CS224d Deep NLP Lecture 4: Word Window Classification and...
Transcript of CS224d Deep NLP Lecture 4: Word Window Classification and...
![Page 1: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/1.jpg)
CS224dDeepNLP
Lecture4:WordWindowClassification
andNeuralNetworks
RichardSocher
![Page 2: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/2.jpg)
OverviewToday:
• Generalclassificationbackground
• Updatingwordvectorsforclassification
• Windowclassification&crossentropyerrorderivationtips
• Asinglelayerneuralnetwork!
• (Max-Marginlossandbackprop)
4/7/16RichardSocherLecture1,Slide 2
![Page 3: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/3.jpg)
Classificationsetupandnotation
• Generallywehaveatrainingdatasetconsistingofsamples
{xi,yi}Ni=1
• xi - inputs,e.g.words(indicesorvectors!),contextwindows,sentences,documents,etc.
• yi - labelswetrytopredict,• e.g.otherwords• class:sentiment,namedentities,buy/selldecision,• later:multi-wordsequences
4/7/16RichardSocherLecture1,Slide 3
![Page 4: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/4.jpg)
Classificationintuition
• Trainingdata:{xi,yi}Ni=1
• Simpleillustrationcase:• Fixed2dwordvectorstoclassify• Usinglogisticregression• à lineardecisionboundaryà
• GeneralML:assumexisfixedandonlytrainlogisticregressionweightsWandonlymodifythedecisionboundary
4/7/16RichardSocherLecture1,Slide 4
VisualizationswithConvNetJS byKarpathy!http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
![Page 5: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/5.jpg)
Classificationnotation
• Crossentropylossfunctionoverdataset{xi,yi}Ni=1
• Whereforeachdatapair(xi,yi):
• Wecanwritef inmatrixnotation andindexelementsofitbasedonclass:
4/7/16RichardSocherLecture1,Slide 5
![Page 6: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/6.jpg)
Classification:Regularization!
• Reallyfulllossfunctionoveranydatasetincludesregularizationoverallparametersµ:
• Regularizationwillpreventoverfittingwhenwehavealotoffeatures(orlateraverypowerful/deepmodel)• x-axis:morepowerfulmodelormoretrainingiterations
• Blue:trainingerror,red:testerror
4/7/16RichardSocherLecture1,Slide 6
![Page 7: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/7.jpg)
Details:GeneralMLoptimization
• Forgeneralmachinelearningµ usuallyonlyconsistsofcolumnsofW:
• Soweonlyupdatethedecisionboundary
4/7/16RichardSocherLecture1,Slide 7
VisualizationswithConvNetJS byKarpathy
![Page 8: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/8.jpg)
Classificationdifferencewithwordvectors
• Commonindeeplearning:• LearnbothWandwordvectorsx
4/7/16RichardSocherLecture1,Slide 8
Verylarge!
OverfittingDanger!
![Page 9: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/9.jpg)
Losinggeneralizationbyre-trainingwordvectors
• Setting:Traininglogisticregressionformoviereviewsentimentandinthetrainingdatawehavethewords• “TV”and“telly”
• Inthetestingdatawehave• “television”
• Originallytheywereallsimilar(frompre-trainingwordvectors)
• Whathappenswhenwetrainthewordvectors?
4/7/16RichardSocherLecture1,Slide 9
TVtelly
television
![Page 10: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/10.jpg)
Losinggeneralizationbyre-trainingwordvectors
• Whathappenswhenwetrainthewordvectors?• Thosethatareinthetrainingdatamovearound• Wordsfrompre-trainingthatdoNOTappearintrainingstay
• Example:• Intrainingdata:“TV”and“telly”• Intestingdataonly:“television”
4/7/16RichardSocherLecture1,Slide 10
TVtelly
television:(
![Page 11: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/11.jpg)
Losinggeneralizationbyre-trainingwordvectors
• Takehomemessage:
Ifyouonlyhaveasmalltrainingdataset,don’ttrainthewordvectors.
Ifyouhavehaveaverylargedataset,itmayworkbettertotrainwordvectorstothetask.
4/7/16RichardSocherLecture1,Slide 11
TVtelly
television
![Page 12: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/12.jpg)
Sidenoteonwordvectorsnotation
• ThewordvectormatrixLisalsocalledlookuptable• Wordvectors=wordembeddings =wordrepresentations(mostly)• Mostlyfrommethodslikeword2vecorGlove
|V|
L =d ……
aardvarka…meta…zebra• Thesearethewordfeaturesxword fromnowon
• Conceptuallyyougetaword’svectorbyleftmultiplyingaone-hotvectore byL:x =Le2 d£ V¢ V£ 1
[]
12
![Page 13: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/13.jpg)
Windowclassification
• Classifyingsinglewordsisrarelydone.
• Interestingproblemslikeambiguityariseincontext!
• Example:auto-antonyms:• "Tosanction"canmean"topermit"or"topunish.”• "Toseed"canmean"toplaceseeds"or"toremoveseeds."
• Example:ambiguousnamedentities:• Parisà Paris,Francevs ParisHilton• Hathawayà BerkshireHathawayvs AnneHathaway
4/7/16RichardSocherLecture1,Slide 13
![Page 14: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/14.jpg)
Windowclassification
• Idea:classifyawordinitscontextwindowofneighboringwords.
• Forexamplenamedentityrecognitioninto4classes:• Person,location,organization,none
• Manypossibilitiesexistforclassifyingonewordincontext,e.g.averagingallthewordsinawindowbutthatloosespositioninformation
4/7/16RichardSocherLecture1,Slide 14
![Page 15: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/15.jpg)
Windowclassification
• Trainsoftmax classifierbyassigningalabeltoacenterwordandconcatenatingallwordvectorssurroundingit
• Example:ClassifyParisinthecontextofthissentencewithwindowlength2:
…museumsinParisareamazing….
Xwindow =[xmuseums xin xParis xare xamazing ]T
• Resultingvectorxwindow =x2 R5d,acolumnvector!
4/7/16RichardSocherLecture1,Slide 15
![Page 16: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/16.jpg)
Simplestwindowclassifier:Softmax
• Withx=xwindow wecanusethesamesoftmax classifierasbefore
• Withcrossentropyerrorasbefore:
• Buthowdoyouupdatethewordvectors?
4/7/16RichardSocherLecture1,Slide 16
same
predictedmodeloutputprobability
![Page 17: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/17.jpg)
Updatingconcatenatedwordvectors
• Shortanswer:Justtakederivativesasbefore
• Longanswer:Let’sgooverthestepstogether(you’llhavetofillinthedetailsinPSet 1!)
• Define:• :softmax probabilityoutputvector(seepreviousslide)• :targetprobabilitydistribution(all0’sexceptatgroundtruthindexofclassy,whereit’s1)
• andfc =c’th elementofthefvector
• Hard,thefirsttime,hencesometipsnow:)
4/7/16RichardSocherLecture1,Slide 17
![Page 18: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/18.jpg)
• Tip1:Carefullydefineyourvariablesandkeeptrackoftheirdimensionality!
• Tip2:Knowthychainruleanddon’tforgetwhichvariablesdependonwhat:
• Tip3:Forthesoftmax partofthederivative:Firsttakethederivativewrt fc whenc=y(thecorrectclass),thentakederivativewrt fc whenc≠ y(alltheincorrectclasses)
Updatingconcatenatedwordvectors
4/7/16RichardSocherLecture1,Slide 18
![Page 19: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/19.jpg)
• Tip4:Whenyoutakederivativewrtoneelementoff,trytoseeifyoucancreateagradientintheendthatincludesallpartialderivatives:
• Tip5:Tolaternotgoinsane&implementation!à resultsintermsofvectoroperationsanddefinesingleindex-ablevectors:
Updatingconcatenatedwordvectors
4/7/16RichardSocherLecture1,Slide 19
![Page 20: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/20.jpg)
• Tip6:Whenyoustartwiththechainrule,firstuseexplicitsumsandlookatpartialderivativesofe.g.xi orWij
• Tip7:Tocleanitupforevenmorecomplexfunctionslater:Knowdimensionalityofvariables&simplifyintomatrixnotation
• Tip8:Writethisoutinfullsumsifit’snotclear!
Updatingconcatenatedwordvectors
4/7/16RichardSocherLecture1,Slide 20
![Page 21: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/21.jpg)
• Whatisthedimensionalityofthewindowvectorgradient?
• x istheentirewindow,5d-dimensionalwordvectors,sothederivativewrt toxhastohavethesamedimensionality:
Updatingconcatenatedwordvectors
4/7/16RichardSocherLecture1,Slide 21
![Page 22: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/22.jpg)
• Thegradientthatarrivesatandupdatesthewordvectorscansimplybesplitupforeachwordvector:
• Let• Withxwindow =[xmuseums xin xParis xare xamazing ]
• Wehave
Updatingconcatenatedwordvectors
4/7/16RichardSocherLecture1,Slide 22
![Page 23: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/23.jpg)
• Thiswillpushwordvectorsintoareassuchtheywillbehelpfulindeterminingnamedentities.
• Forexample,themodelcanlearnthatseeingxin asthewordjustbeforethecenterwordisindicativeforthecenterwordtobealocation
Updatingconcatenatedwordvectors
4/7/16RichardSocherLecture1,Slide 23
![Page 24: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/24.jpg)
• ThegradientofJwrt thesoftmax weightsW!
• Similarsteps,writedownpartialwrt Wij first!• Thenwehavefull
What’smissingfortrainingthewindowmodel?
4/7/16RichardSocherLecture1,Slide 24
![Page 25: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/25.jpg)
Anoteonmatriximplementations
4/7/16RichardSocher25
• Therearetwoexpensiveoperationsinthesoftmax:
• Thematrixmultiplication andtheexp
• Aforloopisneverasefficientwhenyouimplementitcomparedvs whenyouusealargermatrixmultiplication!
• Examplecodeà
![Page 26: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/26.jpg)
Anoteonmatriximplementations
4/7/16RichardSocher26
• Loopingoverwordvectorsinsteadofconcatenatingthemallintoonelargematrixandthenmultiplyingthesoftmax weightswiththatmatrix
• 1000loops,bestof3:639µsperloop10000loops,bestof3:53.8µsperloop
![Page 27: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/27.jpg)
Anoteonmatriximplementations
4/7/16RichardSocher27
• ResultoffastermethodisaCxNmatrix:
• Eachcolumnisanf(x)inournotation(unnormalized classscores)
• Matricesareawesome!
• Youshouldspeedtestyourcodealottoo
![Page 28: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/28.jpg)
Softmax (=logisticregression)isnotverypowerful
4/7/16RichardSocher28
• Softmax onlygiveslineardecisionboundariesintheoriginalspace.
• Withlittledatathatcanbeagoodregularizer
• Withmoredataitisverylimiting!
![Page 29: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/29.jpg)
Softmax (=logisticregression)isnotverypowerful
4/7/16RichardSocher29
• Softmax onlylineardecisionboundaries
• à Lamewhenproblemiscomplex
• Wouldn’titbecooltogetthesecorrect?
![Page 30: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/30.jpg)
NeuralNetsfortheWin!
4/7/16RichardSocher30
• Neuralnetworkscanlearnmuchmorecomplexfunctionsandnonlineardecisionboundaries!
![Page 31: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/31.jpg)
Fromlogisticregressiontoneuralnets
31
![Page 32: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/32.jpg)
Demystifyingneuralnetworks
Neuralnetworkscomewiththeirownterminologicalbaggage
…justlikeSVMs
Butifyouunderstandhowsoftmax modelswork
Thenyoualreadyunderstand theoperationofabasicneuralnetworkneuron!
AsingleneuronAcomputationalunitwithn(3) inputs
and1outputandparametersW,b
Activationfunction
Inputs
Biasunitcorresponds tointerceptterm
Output
32
![Page 33: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/33.jpg)
Aneuronisessentiallyabinarylogisticregressionunit
hw,b(x) = f (wTx + b)
f (z) = 11+ e−z
w,b aretheparametersofthisneuroni.e.,thislogisticregressionmodel
33
b:Wecanhavean“alwayson”feature,whichgivesaclassprior,orseparateitout,asabiasterm
![Page 34: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/34.jpg)
Aneuralnetwork=runningseverallogisticregressionsatthesametimeIfwefeedavectorofinputsthroughabunchoflogisticregressionfunctions,thenwegetavectorofoutputs…
Butwedon’thavetodecideaheadoftimewhatvariablestheselogisticregressionsaretryingtopredict!
34
![Page 35: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/35.jpg)
Aneuralnetwork=runningseverallogisticregressionsatthesametime…whichwecanfeedintoanotherlogisticregressionfunction
Itisthelossfunctionthatwilldirectwhattheintermediatehiddenvariablesshouldbe,soastodoagoodjobatpredictingthetargetsforthenextlayer,etc.
35
![Page 36: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/36.jpg)
Aneuralnetwork=runningseverallogisticregressionsatthesametime
Beforeweknowit,wehaveamultilayerneuralnetwork….
36
![Page 37: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/37.jpg)
Matrixnotationforalayer
Wehave
Inmatrixnotation
wheref isappliedelement-wise:
a1
a2
a3
a1 = f (W11x1 +W12x2 +W13x3 + b1)a2 = f (W21x1 +W22x2 +W23x3 + b2 )etc.
z =Wx + ba = f (z)
f ([z1, z2, z3]) = [ f (z1), f (z2 ), f (z3)]37
W12
b3
![Page 38: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/38.jpg)
Non-linearities (f):Whythey’reneeded
• Example:functionapproximation,e.g.,regressionorclassification• Withoutnon-linearities,deepneuralnetworkscan’tdoanythingmorethanalineartransform
• Extralayerscouldjustbecompileddownintoasinglelineartransform:W1W2x =Wx
• Withmorelayers,theycanapproximatemorecomplexfunctions!
38
![Page 39: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/39.jpg)
Amorepowerfulwindowclassifier
• Revisiting
• Xwindow =[xmuseums xin xParis xare xamazing ]
4/7/16RichardSocherLecture1,Slide 39
![Page 40: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/40.jpg)
ASingleLayerNeuralNetwork
• Asinglelayerisacombinationofalinearlayerandanonlinearity:
• Theneuralactivationsacanthenbeusedtocomputesomefunction
• Forinstance,asoftmax probabilityoranunnormalized score:
40
![Page 41: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/41.jpg)
Summary:Feed-forwardComputation
41
Computingawindow’sscorewitha3-layerneuralnet:s=score(museumsinParisareamazing)
Xwindow =[xmuseums xin xParis xare xamazing ]
![Page 42: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/42.jpg)
Nextlecture:
4/7/16RichardSocher42
Trainingawindow-basedneuralnetwork.
Takingmoredeeperderivativesà Backprop
Thenwehaveallthebasictoolsinplacetolearnaboutmorecomplexmodels:)
![Page 43: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/43.jpg)
Probablyfornextlecture…
4/7/16RichardSocher43
![Page 44: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/44.jpg)
Anotheroutputlayerandlossfunctioncombo!
44
• Sofar:softmax andcross-entropyerror(exp slow)
• Wedon’talwaysneedprobabilities,oftenunnormalized scoresareenoughtoclassifycorrectly.
• Also:Max-margin!
• Moreonthatinfuturelectures!
![Page 45: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/45.jpg)
NeuralNetmodeltoclassifygrammaticalphrases
4/7/16RichardSocher45
• Idea:Trainaneuralnetworktoproducehighscoresforgrammatical phrasesofspecificlengthandlowscoresforungrammaticalphrases
• s =score(catchillsonamat)
• sc =score(catchillsMenloamat)
![Page 46: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/46.jpg)
Anotheroutputlayerandlossfunctioncombo!
• Ideafortrainingobjective• Makescoreoftruewindowlargerandcorruptwindow’sscorelower(untilthey’regoodenough):minimize
• Thisiscontinuous,canperformSGD46
![Page 47: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/47.jpg)
TrainingwithBackpropagation
AssumingcostJis>0,itissimpletoseethatwecancomputethederivativesofs andsc wrt alltheinvolvedvariables:U,W,b,x
47
![Page 48: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/48.jpg)
TrainingwithBackpropagation
• Let’sconsiderthederivativeofasingleweightWij
• Thisonlyappearsinsideai
• Forexample:W23 isonlyusedtocomputea2
x1 x2x3 +1
a1 a2
s U2
W23
48
![Page 49: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/49.jpg)
TrainingwithBackpropagation
DerivativeofweightWij:
49
x1 x2x3 +1
a1 a2
s U2
W23
![Page 50: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/50.jpg)
whereforlogisticf
TrainingwithBackpropagation
DerivativeofsingleweightWij :
Localerrorsignal
Localinputsignal
50
x1 x2x3 +1
a1 a2
s U2
W23
![Page 51: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/51.jpg)
• Wewantallcombinationsofi =1,2 and j=1,2,3
• Solution:Outerproduct:whereisthe“responsibility”comingfromeachactivationa
TrainingwithBackpropagation
• FromsingleweightWij tofullW:
51
x1 x2x3 +1
a1 a2
s U2
W23
![Page 52: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/52.jpg)
TrainingwithBackpropagation
• Forbiasesb,weget:
52
x1 x2x3 +1
a1 a2
s U2
W23
![Page 53: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/53.jpg)
TrainingwithBackpropagation
53
That’salmostbackpropagationIt’ssimplytakingderivativesandusingthechainrule!
Remainingtrick:wecanre-usederivativescomputedforhigherlayersincomputingderivativesforlowerlayers
Example:lastderivativesofmodel,thewordvectorsinx
![Page 54: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/54.jpg)
TrainingwithBackpropagation
• Takederivativeofscorewithrespecttosinglewordvector(forsimplicitya1dvector,butsameifitwaslonger)
• Now,wecannotjusttakeintoconsiderationoneaibecauseeachxj isconnectedtoalltheneuronsaboveandhencexj influencestheoverallscorethroughallofthese,hence:
Re-usedpartofpreviousderivative54
![Page 55: CS224d Deep NLP Lecture 4: Word Window Classification and ...web.stanford.edu/class/cs224d/lectures/CS224d-Lecture4.pdf · Updating concatenated word vectors Lecture 1, Slide 18 Richard](https://reader034.fdocuments.in/reader034/viewer/2022051511/6016b88761b1f968fc19cfa6/html5/thumbnails/55.jpg)
Summary
4/7/16RichardSocher55