Attention is all you need - Sharifce.sharif.edu/courses/97-98/2/ce959-1/resources/root...Attention...

TransformerNetworks

AmirAliMoinfar- M.SoleymaniDeepLearning

SharifUniversityofTechnologySpring2019

1

The“simple”translationmodel

• Embeddingeachword(word2vec,trainable,…)• SomeTricks:

– Teacherforcing– Reversingtheinput

2ThisslidehasbeenadaptedfromBhiksha Raj,11-785,CMU2019

Problemswiththisframework

• Alltheinformationabouttheinputisembeddedintoasinglevector– Lasthiddennodeis“overloaded”withinformation

• Particularlyiftheinputislong

• Parallelization?• Problemsinbackpropagationthroughsequence


()Parallelization:ConvolutionalModels

• Somework:– NeuralGPU– ByteNet– ConvS2S

• Limitedbysizeofconvolution• Maximumpathlength:

– 𝐿𝑜𝑔$ 𝑛

4

Kalchbrenner etal.“NeuralMachineTranslationinLinearTime”,2017

Removingbottleneck:AttentionMechanism

• Computeaweightedcombinationofallthehiddenoutputsintoasinglevector• Weightsarefunctionsofcurrentoutputstate• Theweightsareadistributionovertheinput(sumto1)


AttentionEffectinmachinetranslation

6Bahdanau etal."NeuralMachineTranslationbyJointlyLearningtoAlignandTranslate",2014

• Left:NormalRNNsandlongsentences• Right:Attentionmapinmachinetranslation

()RNNswithAttentionforVQA

7

• EachhiddenoutputofLSTMselectsapartofimagetolookat

Zhuetal.“Visual7W:GroundedQuestionAnsweringinImages”2016

AttentionMechanism- AbstractView

8

• ALookupMechanism– Query– Key– Value

AttentionMechanism- AbstractView(cont.)

9

???

AttentionMechanism- AbstractView(cont.)

• Forlargevaluesof𝑑$,thedotproductsgrowlargeinmagnitude,pushingthe𝑠𝑜𝑓𝑡𝑚𝑎𝑥 functionintoregionswhereithasextremelysmallgradients

10Vaswanietal."AttentionIsAllYouNeed",2017JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)

SelfAttention

• AKAintra-attention

• Anattentionmechanismrelatingdifferentpositionsofasinglesequence

=>Q,K,Varederivedfromasinglesequence

• Checkthecasewhen– 𝑄. = 𝑊1𝑋.– 𝐾4,… , 𝐾7 = 𝑊8𝑋4,… ,𝑊8𝑋7– V4,… , V: = 𝑊;𝑋4,… ,𝑊;𝑋7

11JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)

Multi-headattention

• Allowsthemodelto– jointlyattendtoinformation– fromdifferentrepresentationsubspaces

– atdifferentpositions

12Vaswanietal."AttentionIsAllYouNeed",2017[modified]JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)

Multi-headSelfAttention

13JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)

Bonus:AttentionIsAllSheNeeds

GregoryJantz“HungryforAttention:IsYourCellPhoneUseatDinnertimeHurtingYourKids?”,https://www.huffpost.com/entry/cell-phone-use-at-dinnertime_n_5207272 2014

AttentionIsAllYouNeed

• ReplaceLSTMswithalotofattention!– State-of-theartresults– Muchlesscomputationfortraining

15

Advantages:• Lesscomplex• Canbeparalleled,faster• Easytolearndistantdependency

Vaswanietal."AttentionIsAllYouNeed",2017

Transformer’sBehavior


• Encoding+Firstdecodingstep

[Linktogif]

Transformer’sBehavior(cont.)


• Decoding

[Linktogif]

Transformerarchitecture

• Thecoreofit– Multi-headattention– Positionalencoding

18Vaswanietal."AttentionIsAllYouNeed",2017JakobUszkoreit "Transformer:ANovelNeuralNetworkArchitectureforLanguageUnderstanding",https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html

[Linktogif]

Transformerarchitecture(cont.)

• Encoder– Inputembedding(likeword2vec)– Positionalencoding– Multi-headselfattentions– Feed-forwardwithresiduallinks

• Decoder– Outputembedding(likeword2vec)– Positionalencoding– Multi-headselfattentions– Multi-headencoder-decoderattentions– Feed-forwardwithresiduallinks

• Output– Linear+Softmax

19Vaswanietal."AttentionIsAllYouNeed",2017


• Output– Linear+Softmax



• EncoderandDecoder


JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)


• Feed-forwardLayers• Residuallinks• Batch-norm• Dropout



• Attentionisallitneeds



• [Multi-head] attentionisallitneeds



• Twotypes ofattentionisallitneeds:D


Remembersignatureofmulti-headattention


• Embeddings– Justalookuptable:



• PositionalEncoding

• Itwouldallowthemodeltoeasilylearntoattendbyrelativepositions

•sin(𝑝𝑜𝑠 + 𝑘)cos(𝑝𝑜𝑠 + 𝑘)

…=sin 𝑝𝑜𝑠 cos 𝑘 + cos 𝑝𝑜𝑠 sin(𝑘)cos 𝑝𝑜𝑠 cos 𝑘 − sin 𝑝𝑜𝑠 sin(𝑘)

…

27Vaswanietal."AttentionIsAllYouNeed",2017AlexanderRush,“TheAnnotatedTransformer”http://nlp.seas.harvard.edu/2018/04/03/attention.html (5/20/2019)



• A2tiertransformernetwork

JayAlammar,“TheIllustratedTransformer”http://jalammar.github.io/illustrated-transformer/(5/20/2019)

Transformer’sBehavior


• Encoding+Firstdecodingstep

[Linktogif]

Transformer’sBehavior(cont.)


• Decoding

[Linktogif]

Complexity

• Advantages:– Lesscomplex– Canbeparalleled,faster– Easytolearndistantdependency


Interpretability

• Attentionmechanismintheencoderself-attentioninlayer5of6


Interpretability(cont.)

• Twoheadsintheencoderself-attentioninlayer5of6


References

• Vaswani,Ashish,etal."Attentionisallyouneed." Advancesinneuralinformationprocessingsystems.2017.

• Alammar,Jay.“TheIllustratedTransformer.”TheIllustratedTransformer– JayAlammar –VisualizingMachineLearningOneConceptataTime,27June2018,jalammar.github.io/illustrated-transformer/.

• Zhang,Shiyue.“AttentionIsAllYouNeed- PptDownload.”SlidePlayer,20June2017,slideplayer.com/slide/13789541/.

• Kurbanov,Rauf.“AttentionIsAllYouNeed.”JetBrainsResearch,27Jan.2019,research.jetbrains.org/files/material/5ace635c03259.pdf.

• Polosukhin,Illia.“AttentionIsAllYouNeed.”LinkedInSlideShare,26Sept.2017,www.slideshare.net/ilblackdragon/attention-is-all-you-need.

• Rush,Alexander.TheAnnotatedTransformer,3Apr.2018,nlp.seas.harvard.edu/2018/04/03/attention.html.

• Uszkoreit,Jakob.“Transformer:ANovelNeuralNetworkArchitectureforLanguageUnderstanding.”GoogleAIBlog,31Aug.2017,ai.googleblog.com/2017/08/transformer-novel-neural-network.html.

34

Q&A

35

Thanksforyourattention!

𝑌𝑜𝑢𝑟𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 =

𝑆𝑜𝑓𝑡𝑚𝑎𝑥𝑌𝑜𝑢[𝑃𝑟𝑒𝑠𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛|𝐴𝑛𝑦𝑡ℎ𝑖𝑛𝑔𝑒𝑙𝑠𝑒]V

36 [𝑃𝑟𝑒𝑠𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛𝑛|𝐴𝑛𝑦𝑡ℎ𝑖𝑛𝑔𝑒𝑙𝑠𝑒]

36

Attention is all you need - Sharifce.sharif.edu/courses/97-98/2/ce959-1/resources/root...Attention...

Documents

Transcript of Attention is all you need - Sharifce.sharif.edu/courses/97-98/2/ce959-1/resources/root...Attention...