Attention is all you need - Sharifce.sharif.edu/courses/97-98/2/ce959-1/resources/root...Attention...
Transcript of Attention is all you need - Sharifce.sharif.edu/courses/97-98/2/ce959-1/resources/root...Attention...
-
TransformerNetworks
AmirAliMoinfar- M.SoleymaniDeepLearning
SharifUniversityofTechnologySpring2019
1
-
The“simple”translationmodel
• Embeddingeachword(word2vec,trainable,…)• SomeTricks:
– Teacherforcing– Reversingtheinput
2ThisslidehasbeenadaptedfromBhiksha Raj,11-785,CMU2019
-
Problemswiththisframework
• Alltheinformationabouttheinputisembeddedintoasinglevector– Lasthiddennodeis“overloaded”withinformation
• Particularlyiftheinputislong
• Parallelization?• Problemsinbackpropagationthroughsequence
3ThisslidehasbeenadaptedfromBhiksha Raj,11-785,CMU2019
-
()Parallelization:ConvolutionalModels
• Somework:– NeuralGPU– ByteNet– ConvS2S
• Limitedbysizeofconvolution• Maximumpathlength:
– 𝐿𝑜𝑔$ 𝑛
4
Kalchbrenner etal.“NeuralMachineTranslationinLinearTime”,2017
-
Removingbottleneck:AttentionMechanism
• Computeaweightedcombinationofallthehiddenoutputsintoasinglevector• Weightsarefunctionsofcurrentoutputstate• Theweightsareadistributionovertheinput(sumto1)
5ThisslidehasbeenadaptedfromBhiksha Raj,11-785,CMU2019
-
AttentionEffectinmachinetranslation
6Bahdanau etal."NeuralMachineTranslationbyJointlyLearningtoAlignandTranslate",2014
• Left:NormalRNNsandlongsentences• Right:Attentionmapinmachinetranslation
-
()RNNswithAttentionforVQA
7
• EachhiddenoutputofLSTMselectsapartofimagetolookat
Zhuetal.“Visual7W:GroundedQuestionAnsweringinImages”2016
-
AttentionMechanism- AbstractView
8
• ALookupMechanism– Query– Key– Value
-
AttentionMechanism- AbstractView(cont.)
9
???
-
AttentionMechanism- AbstractView(cont.)
• Forlargevaluesof𝑑$,thedotproductsgrowlargeinmagnitude,pushingthe𝑠𝑜𝑓𝑡𝑚𝑎𝑥 functionintoregionswhereithasextremelysmallgradients
10Vaswanietal."AttentionIsAllYouNeed",2017JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)
-
SelfAttention
• AKAintra-attention
• Anattentionmechanismrelatingdifferentpositionsofasinglesequence
=>Q,K,Varederivedfromasinglesequence
• Checkthecasewhen– 𝑄. = 𝑊1𝑋.– 𝐾4,… , 𝐾7 = 𝑊8𝑋4,… ,𝑊8𝑋7– V4,… , V: = 𝑊;𝑋4,… ,𝑊;𝑋7
11JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)
-
Multi-headattention
• Allowsthemodelto– jointlyattendtoinformation– fromdifferentrepresentationsubspaces
– atdifferentpositions
12Vaswanietal."AttentionIsAllYouNeed",2017[modified]JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)
-
Multi-headSelfAttention
13JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)
-
Bonus:AttentionIsAllSheNeeds
GregoryJantz“HungryforAttention:IsYourCellPhoneUseatDinnertimeHurtingYourKids?”,https://www.huffpost.com/entry/cell-phone-use-at-dinnertime_n_5207272 2014
-
AttentionIsAllYouNeed
• ReplaceLSTMswithalotofattention!– State-of-theartresults– Muchlesscomputationfortraining
15
Advantages:• Lesscomplex• Canbeparalleled,faster• Easytolearndistantdependency
Vaswanietal."AttentionIsAllYouNeed",2017
-
Transformer’sBehavior
16Vaswanietal."AttentionIsAllYouNeed",2017JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)
• Encoding+Firstdecodingstep
[Linktogif]
-
Transformer’sBehavior(cont.)
17Vaswanietal."AttentionIsAllYouNeed",2017JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)
• Decoding
[Linktogif]
-
Transformerarchitecture
• Thecoreofit– Multi-headattention– Positionalencoding
18Vaswanietal."AttentionIsAllYouNeed",2017JakobUszkoreit "Transformer:ANovelNeuralNetworkArchitectureforLanguageUnderstanding",https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
[Linktogif]
-
Transformerarchitecture(cont.)
• Encoder– Inputembedding(likeword2vec)– Positionalencoding– Multi-headselfattentions– Feed-forwardwithresiduallinks
• Decoder– Outputembedding(likeword2vec)– Positionalencoding– Multi-headselfattentions– Multi-headencoder-decoderattentions– Feed-forwardwithresiduallinks
• Output– Linear+Softmax
19Vaswanietal."AttentionIsAllYouNeed",2017
-
Transformerarchitecture(cont.)
• Output– Linear+Softmax
20Vaswanietal."AttentionIsAllYouNeed",2017
-
Transformerarchitecture(cont.)
• EncoderandDecoder
21Vaswanietal."AttentionIsAllYouNeed",2017
JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)
-
Transformerarchitecture(cont.)
• Feed-forwardLayers• Residuallinks• Batch-norm• Dropout
22Vaswanietal."AttentionIsAllYouNeed",2017
-
Transformerarchitecture(cont.)
• Attentionisallitneeds
23Vaswanietal."AttentionIsAllYouNeed",2017
-
Transformerarchitecture(cont.)
• [Multi-head] attentionisallitneeds
24Vaswanietal."AttentionIsAllYouNeed",2017
-
Transformerarchitecture(cont.)
• Twotypes ofattentionisallitneeds:D
25Vaswanietal."AttentionIsAllYouNeed",2017
Remembersignatureofmulti-headattention
-
Transformerarchitecture(cont.)
• Embeddings– Justalookuptable:
26Vaswanietal."AttentionIsAllYouNeed",2017
-
Transformerarchitecture(cont.)
• PositionalEncoding
• Itwouldallowthemodeltoeasilylearntoattendbyrelativepositions
•sin(𝑝𝑜𝑠 + 𝑘)cos(𝑝𝑜𝑠 + 𝑘)
…=sin 𝑝𝑜𝑠 cos 𝑘 + cos 𝑝𝑜𝑠 sin(𝑘)cos 𝑝𝑜𝑠 cos 𝑘 − sin 𝑝𝑜𝑠 sin(𝑘)
…
27Vaswanietal."AttentionIsAllYouNeed",2017AlexanderRush,“TheAnnotatedTransformer”http://nlp.seas.harvard.edu/2018/04/03/attention.html (5/20/2019)
-
Transformerarchitecture(cont.)
28Vaswanietal."AttentionIsAllYouNeed",2017
• A2tiertransformernetwork
JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)
-
Transformer’sBehavior
29Vaswanietal."AttentionIsAllYouNeed",2017JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)
• Encoding+Firstdecodingstep
[Linktogif]
-
Transformer’sBehavior(cont.)
30Vaswanietal."AttentionIsAllYouNeed",2017JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)
• Decoding
[Linktogif]
-
Complexity
• Advantages:– Lesscomplex– Canbeparalleled,faster– Easytolearndistantdependency
31Vaswanietal."AttentionIsAllYouNeed",2017
-
Interpretability
• Attentionmechanismintheencoderself-attentioninlayer5of6
32Vaswanietal."AttentionIsAllYouNeed",2017
-
Interpretability(cont.)
• Twoheadsintheencoderself-attentioninlayer5of6
33Vaswanietal."AttentionIsAllYouNeed",2017
-
References
• Vaswani,Ashish,etal."Attentionisallyouneed." Advancesinneuralinformationprocessingsystems.2017.
• Alammar,Jay.“TheIllustratedTransformer.”TheIllustratedTransformer– JayAlammar –VisualizingMachineLearningOneConceptataTime,27June2018,jalammar.github.io/illustrated-transformer/.
• Zhang,Shiyue.“AttentionIsAllYouNeed- PptDownload.”SlidePlayer,20June2017,slideplayer.com/slide/13789541/.
• Kurbanov,Rauf.“AttentionIsAllYouNeed.”JetBrainsResearch,27Jan.2019,research.jetbrains.org/files/material/5ace635c03259.pdf.
• Polosukhin,Illia.“AttentionIsAllYouNeed.”LinkedInSlideShare,26Sept.2017,www.slideshare.net/ilblackdragon/attention-is-all-you-need.
• Rush,Alexander.TheAnnotatedTransformer,3Apr.2018,nlp.seas.harvard.edu/2018/04/03/attention.html.
• Uszkoreit,Jakob.“Transformer:ANovelNeuralNetworkArchitectureforLanguageUnderstanding.”GoogleAIBlog,31Aug.2017,ai.googleblog.com/2017/08/transformer-novel-neural-network.html.
34
-
Q&A
35
-
Thanksforyourattention!
𝑌𝑜𝑢𝑟𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 =
𝑆𝑜𝑓𝑡𝑚𝑎𝑥𝑌𝑜𝑢[𝑃𝑟𝑒𝑠𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛|𝐴𝑛𝑦𝑡ℎ𝑖𝑛𝑔𝑒𝑙𝑠𝑒]V
36 [𝑃𝑟𝑒𝑠𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛𝑛|𝐴𝑛𝑦𝑡ℎ𝑖𝑛𝑔𝑒𝑙𝑠𝑒]
36