Attention is all you need - Sharifce.sharif.edu/courses/97-98/2/ce959-1/resources/root...Attention...

36
Transformer Networks Amir Ali Moinfar - M. Soleymani Deep Learning Sharif University of Technology Spring 2019 1

Transcript of Attention is all you need - Sharifce.sharif.edu/courses/97-98/2/ce959-1/resources/root...Attention...

  • TransformerNetworks

    AmirAliMoinfar- M.SoleymaniDeepLearning

    SharifUniversityofTechnologySpring2019

    1

  • The“simple”translationmodel

    • Embeddingeachword(word2vec,trainable,…)• SomeTricks:

    – Teacherforcing– Reversingtheinput

    2ThisslidehasbeenadaptedfromBhiksha Raj,11-785,CMU2019

  • Problemswiththisframework

    • Alltheinformationabouttheinputisembeddedintoasinglevector– Lasthiddennodeis“overloaded”withinformation

    • Particularlyiftheinputislong

    • Parallelization?• Problemsinbackpropagationthroughsequence

    3ThisslidehasbeenadaptedfromBhiksha Raj,11-785,CMU2019

  • ()Parallelization:ConvolutionalModels

    • Somework:– NeuralGPU– ByteNet– ConvS2S

    • Limitedbysizeofconvolution• Maximumpathlength:

    – 𝐿𝑜𝑔$ 𝑛

    4

    Kalchbrenner etal.“NeuralMachineTranslationinLinearTime”,2017

  • Removingbottleneck:AttentionMechanism

    • Computeaweightedcombinationofallthehiddenoutputsintoasinglevector• Weightsarefunctionsofcurrentoutputstate• Theweightsareadistributionovertheinput(sumto1)

    5ThisslidehasbeenadaptedfromBhiksha Raj,11-785,CMU2019

  • AttentionEffectinmachinetranslation

    6Bahdanau etal."NeuralMachineTranslationbyJointlyLearningtoAlignandTranslate",2014

    • Left:NormalRNNsandlongsentences• Right:Attentionmapinmachinetranslation

  • ()RNNswithAttentionforVQA

    7

    • EachhiddenoutputofLSTMselectsapartofimagetolookat

    Zhuetal.“Visual7W:GroundedQuestionAnsweringinImages”2016

  • AttentionMechanism- AbstractView

    8

    • ALookupMechanism– Query– Key– Value

  • AttentionMechanism- AbstractView(cont.)

    9

    ???

  • AttentionMechanism- AbstractView(cont.)

    • Forlargevaluesof𝑑$,thedotproductsgrowlargeinmagnitude,pushingthe𝑠𝑜𝑓𝑡𝑚𝑎𝑥 functionintoregionswhereithasextremelysmallgradients

    10Vaswanietal."AttentionIsAllYouNeed",2017JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)

  • SelfAttention

    • AKAintra-attention

    • Anattentionmechanismrelatingdifferentpositionsofasinglesequence

    =>Q,K,Varederivedfromasinglesequence

    • Checkthecasewhen– 𝑄. = 𝑊1𝑋.– 𝐾4,… , 𝐾7 = 𝑊8𝑋4,… ,𝑊8𝑋7– V4,… , V: = 𝑊;𝑋4,… ,𝑊;𝑋7

    11JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)

  • Multi-headattention

    • Allowsthemodelto– jointlyattendtoinformation– fromdifferentrepresentationsubspaces

    – atdifferentpositions

    12Vaswanietal."AttentionIsAllYouNeed",2017[modified]JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)

  • Multi-headSelfAttention

    13JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)

  • Bonus:AttentionIsAllSheNeeds

    GregoryJantz“HungryforAttention:IsYourCellPhoneUseatDinnertimeHurtingYourKids?”,https://www.huffpost.com/entry/cell-phone-use-at-dinnertime_n_5207272 2014

  • AttentionIsAllYouNeed

    • ReplaceLSTMswithalotofattention!– State-of-theartresults– Muchlesscomputationfortraining

    15

    Advantages:• Lesscomplex• Canbeparalleled,faster• Easytolearndistantdependency

    Vaswanietal."AttentionIsAllYouNeed",2017

  • Transformer’sBehavior

    16Vaswanietal."AttentionIsAllYouNeed",2017JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)

    • Encoding+Firstdecodingstep

    [Linktogif]

  • Transformer’sBehavior(cont.)

    17Vaswanietal."AttentionIsAllYouNeed",2017JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)

    • Decoding

    [Linktogif]

  • Transformerarchitecture

    • Thecoreofit– Multi-headattention– Positionalencoding

    18Vaswanietal."AttentionIsAllYouNeed",2017JakobUszkoreit "Transformer:ANovelNeuralNetworkArchitectureforLanguageUnderstanding",https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html

    [Linktogif]

  • Transformerarchitecture(cont.)

    • Encoder– Inputembedding(likeword2vec)– Positionalencoding– Multi-headselfattentions– Feed-forwardwithresiduallinks

    • Decoder– Outputembedding(likeword2vec)– Positionalencoding– Multi-headselfattentions– Multi-headencoder-decoderattentions– Feed-forwardwithresiduallinks

    • Output– Linear+Softmax

    19Vaswanietal."AttentionIsAllYouNeed",2017

  • Transformerarchitecture(cont.)

    • Output– Linear+Softmax

    20Vaswanietal."AttentionIsAllYouNeed",2017

  • Transformerarchitecture(cont.)

    • EncoderandDecoder

    21Vaswanietal."AttentionIsAllYouNeed",2017

    JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)

  • Transformerarchitecture(cont.)

    • Feed-forwardLayers• Residuallinks• Batch-norm• Dropout

    22Vaswanietal."AttentionIsAllYouNeed",2017

  • Transformerarchitecture(cont.)

    • Attentionisallitneeds

    23Vaswanietal."AttentionIsAllYouNeed",2017

  • Transformerarchitecture(cont.)

    • [Multi-head] attentionisallitneeds

    24Vaswanietal."AttentionIsAllYouNeed",2017

  • Transformerarchitecture(cont.)

    • Twotypes ofattentionisallitneeds:D

    25Vaswanietal."AttentionIsAllYouNeed",2017

    Remembersignatureofmulti-headattention

  • Transformerarchitecture(cont.)

    • Embeddings– Justalookuptable:

    26Vaswanietal."AttentionIsAllYouNeed",2017

  • Transformerarchitecture(cont.)

    • PositionalEncoding

    • Itwouldallowthemodeltoeasilylearntoattendbyrelativepositions

    •sin(𝑝𝑜𝑠 + 𝑘)cos(𝑝𝑜𝑠 + 𝑘)

    …=sin 𝑝𝑜𝑠 cos 𝑘 + cos 𝑝𝑜𝑠 sin(𝑘)cos 𝑝𝑜𝑠 cos 𝑘 − sin 𝑝𝑜𝑠 sin(𝑘)

    27Vaswanietal."AttentionIsAllYouNeed",2017AlexanderRush,“TheAnnotatedTransformer”http://nlp.seas.harvard.edu/2018/04/03/attention.html (5/20/2019)

  • Transformerarchitecture(cont.)

    28Vaswanietal."AttentionIsAllYouNeed",2017

    • A2tiertransformernetwork

    JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)

  • Transformer’sBehavior

    29Vaswanietal."AttentionIsAllYouNeed",2017JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)

    • Encoding+Firstdecodingstep

    [Linktogif]

  • Transformer’sBehavior(cont.)

    30Vaswanietal."AttentionIsAllYouNeed",2017JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)

    • Decoding

    [Linktogif]

  • Complexity

    • Advantages:– Lesscomplex– Canbeparalleled,faster– Easytolearndistantdependency

    31Vaswanietal."AttentionIsAllYouNeed",2017

  • Interpretability

    • Attentionmechanismintheencoderself-attentioninlayer5of6

    32Vaswanietal."AttentionIsAllYouNeed",2017

  • Interpretability(cont.)

    • Twoheadsintheencoderself-attentioninlayer5of6

    33Vaswanietal."AttentionIsAllYouNeed",2017

  • References

    • Vaswani,Ashish,etal."Attentionisallyouneed." Advancesinneuralinformationprocessingsystems.2017.

    • Alammar,Jay.“TheIllustratedTransformer.”TheIllustratedTransformer– JayAlammar –VisualizingMachineLearningOneConceptataTime,27June2018,jalammar.github.io/illustrated-transformer/.

    • Zhang,Shiyue.“AttentionIsAllYouNeed- PptDownload.”SlidePlayer,20June2017,slideplayer.com/slide/13789541/.

    • Kurbanov,Rauf.“AttentionIsAllYouNeed.”JetBrainsResearch,27Jan.2019,research.jetbrains.org/files/material/5ace635c03259.pdf.

    • Polosukhin,Illia.“AttentionIsAllYouNeed.”LinkedInSlideShare,26Sept.2017,www.slideshare.net/ilblackdragon/attention-is-all-you-need.

    • Rush,Alexander.TheAnnotatedTransformer,3Apr.2018,nlp.seas.harvard.edu/2018/04/03/attention.html.

    • Uszkoreit,Jakob.“Transformer:ANovelNeuralNetworkArchitectureforLanguageUnderstanding.”GoogleAIBlog,31Aug.2017,ai.googleblog.com/2017/08/transformer-novel-neural-network.html.

    34

  • Q&A

    35

  • Thanksforyourattention!

    𝑌𝑜𝑢𝑟𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 =

    𝑆𝑜𝑓𝑡𝑚𝑎𝑥𝑌𝑜𝑢[𝑃𝑟𝑒𝑠𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛|𝐴𝑛𝑦𝑡ℎ𝑖𝑛𝑔𝑒𝑙𝑠𝑒]V

    36 [𝑃𝑟𝑒𝑠𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛𝑛|𝐴𝑛𝑦𝑡ℎ𝑖𝑛𝑔𝑒𝑙𝑠𝑒]

    36