1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average)...
Transcript of 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average)...
![Page 1: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/1.jpg)
1.Howdowerepresentthemeaningofaword?
Definition:meaning (Websterdictionary)
• theideathatisrepresentedbyaword,phrase,etc.
• theideathatapersonwantstoexpressbyusingwords,signs,etc.
• theideathatisexpressedinaworkofwriting,art,etc.
Commonestlinguisticwayofthinkingofmeaning:
signifier(symbol)⟺ signified(ideaorthing)
=denotation 1/11/184
![Page 2: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/2.jpg)
Howdowehaveusablemeaninginacomputer?Commonsolution:Usee.g.WordNet,aresourcecontaininglistsofsynonymsets andhypernyms (“isa”relationships).
[Synset('procyonid.n.01'), Synset('carnivore.n.01'), Synset('placental.n.01'), Synset('mammal.n.01'), Synset('vertebrate.n.01'), Synset('chordate.n.01'), Synset('animal.n.01'), Synset('organism.n.01'), Synset('living_thing.n.01'), Synset('whole.n.02'), Synset('object.n.01'), Synset('physical_entity.n.01'), Synset('entity.n.01')]
(adj) full, good (adj) estimable, good, honorable, respectable (adj) beneficial, good (adj) good, just, upright (adj) adept, expert, good, practiced, proficient, skillful(adj) dear, good, near (adj) good, right, ripe…(adv) well, good (adv) thoroughly, soundly, good (n) good, goodness (n) commodity, trade good, good
e.g.synonymsetscontaining“good”: e.g.hypernymsof“panda”:
1/11/185
![Page 3: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/3.jpg)
ProblemswithresourceslikeWordNet
• Greatasaresourcebutmissingnuance
• e.g.“proficient”islistedasasynonymfor“good”.Thisisonlycorrectinsomecontexts.
• Missingnewmeaningsofwords
• e.g.wicked,badass,nifty,wizard,genius,ninja,bombest
• Impossibletokeepup-to-date!
• Subjective
• Requireshumanlabortocreateandadapt
• Hardtocomputeaccuratewordsimilarityà1/11/186
![Page 4: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/4.jpg)
Representingwordsasdiscretesymbols
IntraditionalNLP,weregardwordsasdiscretesymbols:hotel, conference, motel
Wordscanberepresentedbyone-hot vectors:
motel=[000000000010000]hotel=[000000010000000]
Vectordimension=numberofwordsinvocabulary(e.g.500,000)
Meansone1,therest0s
1/11/187
![Page 5: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/5.jpg)
Problemwithwordsasdiscretesymbols
Example: inwebsearch,ifusersearchesfor“Seattlemotel”,wewouldliketomatchdocumentscontaining“Seattlehotel”.
But:motel=[000000000010000]hotel=[000000010000000]
Thesetwovectorsareorthogonal.Thereisnonaturalnotionofsimilarityforone-hotvectors!
Solution:• CouldrelyonWordNet’slistofsynonymstogetsimilarity?• Instead:learntoencodesimilarityinthevectorsthemselves
Sec. 9.2.2
1/11/188
![Page 6: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/6.jpg)
Representingwordsbytheircontext
• Coreidea:Aword’smeaningisgivenbythewordsthatfrequentlyappearclose-by
• “Youshallknowawordbythecompanyitkeeps” (J.R.Firth1957:11)
• OneofthemostsuccessfulideasofmodernstatisticalNLP!
• Whenawordw appearsinatext,itscontext isthesetofwordsthatappearnearby(withinafixed-sizewindow).
• Usethemanycontextsofw tobuilduparepresentationofw
…government debt problems turning into banking crises as happened in 2009……saying that Europe needs unified banking regulation to replace the hodgepodge…
…India has just given its banking system a shot in the arm…
Thesecontextwordswillrepresentbanking 1/11/189
![Page 7: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/7.jpg)
Wordvectors
Wewillbuildadensevectorforeachword,chosensothatitissimilartovectorsofwordsthatappearinsimilarcontexts.
Note:wordvectorsaresometimescalledwordembeddings orwordrepresentations.
linguistics=
0.2860.792−0.177−0.1070.109−0.5420.3490.271
1/11/1810
![Page 8: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/8.jpg)
2.Word2vec:Overview
Word2vec(Mikolov etal.2013)isaframeworkforlearningwordvectors.Idea:
• Wehavealargecorpusoftext• Everywordinafixedvocabularyisrepresentedbyavector• Gothrougheachpositiont inthetext,whichhasacenterword
c andcontext(“outside”)wordso• Usethesimilarityofthewordvectorsforcando tocalculate
theprobabilityofo givenc(orviceversa)• Keepadjustingthewordvectorstomaximizethisprobability
1/11/1811
![Page 9: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/9.jpg)
Word2VecOverview
• Examplewindowsandprocessforcomputing𝑃 𝑤789|𝑤7
…crisesbankingintoturningproblems… as
centerwordatpositiont
outsidecontextwordsinwindowofsize2
outsidecontextwordsinwindowofsize2
𝑃 𝑤78;|𝑤7
𝑃 𝑤78<|𝑤7
𝑃 𝑤7=;|𝑤7
𝑃 𝑤7=<|𝑤7
1/11/1812
![Page 10: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/10.jpg)
Word2VecOverview
• Examplewindowsandprocessforcomputing𝑃 𝑤789|𝑤7
…crisesbankingintoturningproblems… as
centerwordatpositiont
outsidecontextwordsinwindowofsize2
outsidecontextwordsinwindowofsize2
𝑃 𝑤78;|𝑤7
𝑃 𝑤78<|𝑤7
𝑃 𝑤7=;|𝑤7
𝑃 𝑤7=<|𝑤7
1/11/1813
![Page 11: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/11.jpg)
Word2vec:objectivefunctionForeachposition𝑡 = 1,… , 𝑇,predictcontextwordswithinawindowoffixedsizem,givencenterword𝑤9.
𝐿 𝜃 =F F 𝑃 𝑤789|𝑤7; 𝜃�
=IJ9JI9KL
M
7N;
Theobjectivefunction 𝐽 𝜃 isthe(average)negativeloglikelihood:
𝐽 𝜃 = −1𝑇log 𝐿(𝜃) = −
1𝑇S S log𝑃 𝑤789|𝑤7; 𝜃
�
=IJ9JI9KL
M
7N;
Minimizingobjectivefunction⟺Maximizingpredictiveaccuracy
Likelihood=
𝜃 isallvariablestobeoptimized
sometimescalledcost orlossfunction
1/11/1814
![Page 12: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/12.jpg)
Word2vec:objectivefunction
• Wewanttominimizetheobjectivefunction:
𝐽 𝜃 = −1𝑇S S log𝑃 𝑤789|𝑤7; 𝜃
�
=IJ9JI9KL
M
7N;
• Question: Howtocalculate𝑃 𝑤789|𝑤7; 𝜃 ?
• Answer: Wewillusetwovectorsperwordw:
• 𝑣U whenw isacenterword
• 𝑢U whenw isacontextword
• Thenforacenterwordc andacontextwordo:
𝑃 𝑜 𝑐 = exp(𝑢YM𝑣Z)
∑ exp(𝑢UM 𝑣Z)�U∈] 1/11/1815
![Page 13: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/13.jpg)
Word2VecOverviewwithVectors
• Examplewindowsandprocessforcomputing𝑃 𝑤789|𝑤7• 𝑃 𝑢^_Y`abIc|𝑣de7Y shortforP 𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠|𝑖𝑛𝑡𝑜; 𝑢^_Y`abIc, 𝑣de7Y, 𝜃
…crisesbankingintoturningproblems… as
centerwordatpositiont
outsidecontextwordsinwindowofsize2
outsidecontextwordsinwindowofsize2
𝑃 𝑢`peqder|𝑣de7Y
𝑃 𝑢Z_dcdc|𝑣de7Y
𝑃 𝑢7seder|𝑣de7Y
𝑃 𝑢^_Y`abIc|𝑣de7Y
1/11/1816
![Page 14: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/14.jpg)
Word2VecOverviewwithVectors
• Examplewindowsandprocessforcomputing𝑃 𝑤789|𝑤7
…crisesbankingintoturningproblems… as
centerwordatpositiont
outsidecontextwordsinwindowofsize2
outsidecontextwordsinwindowofsize2
𝑃 𝑢Z_dcbc|𝑣`peqder
𝑃 𝑢pc|𝑣`peqder
𝑃 𝑢de7Y|𝑣`peqder
𝑃 𝑢7s_eder|𝑣`peqder
1/11/1817
![Page 15: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/15.jpg)
Word2vec:predictionfunction
𝑃 𝑜 𝑐 = exp(𝑢YM𝑣Z)
∑ exp(𝑢UM 𝑣Z)�U∈]
• Thisisanexampleofthesoftmax functionℝe → ℝe
softmax 𝑥d = exp(𝑥d)
∑ exp(𝑥9)e9N;
= 𝑝d
• Thesoftmax functionmapsarbitraryvalues𝑥d toaprobabilitydistribution𝑝d• “max” becauseamplifiesprobabilityoflargest𝑥d• “soft” becausestillassignssomeprobabilitytosmaller𝑥d• FrequentlyusedinDeepLearning
Dotproductcomparessimilarityofo andc.Largerdotproduct=largerprobability
Aftertakingexponent,normalizeoverentirevocabulary
1/11/1818
![Page 16: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/16.jpg)
Totrainthemodel:Computeall vectorgradients!
• Recall:𝜃representsallmodelparameters,inonelongvector• Inourcasewithd-dimensionalvectorsand V-manywords:
• Remember:everywordhastwovectors• Wethenoptimizetheseparameters
1/11/1819
![Page 17: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/17.jpg)
3. Derivationsofgradient
•
• ThebasicLegopiece
• Usefulbasics:
• Ifindoubt:writeoutwithindices
• Chainrule!Ify =f(u)andu =g(x),i.e.y=f(g(x)),then:
1/11/1820
![Page 18: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/18.jpg)
ChainRule
• Chainrule!Ify =f(u)andu =g(x),i.e.y=f(g(x)),then:
• Simpleexample:
1/11/1821
![Page 19: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/19.jpg)
Note
Let’sderivegradientforcenterwordtogetherForoneexamplewindowandoneexampleoutsideword:
Youthenalsoneedthegradientforcontextwords(it’ssimilar;leftforhomework).That’salloftheparametersθ here.
1/11/1822
![Page 20: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/20.jpg)
Calculatingallgradients!
• Wewentthroughgradientforeachcentervectorv inawindow• Wealsoneedgradientsforoutsidevectorsu• Deriveathome!• Generallyineachwindowwewillcomputeupdatesforall
parametersthatarebeingusedinthatwindow.Forexample:
…crisesbankingintoturningproblems… as
centerwordatpositiont
outsidecontextwordsinwindowofsize2
outsidecontextwordsinwindowofsize2
𝑃 𝑢Z_dcbc|𝑣`peqder
𝑃 𝑢pc|𝑣`peqder
𝑃 𝑢de7Y|𝑣`peqder
𝑃 𝑢7s_eder|𝑣`peqder
1/11/1823
![Page 21: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/21.jpg)
Word2vec:Moredetails
Whytwovectors?à Easieroptimization.Averagebothattheend.
Twomodelvariants:1. Skip-grams(SG)
Predictcontext(”outside”)words(positionindependent)givencenterword
2. ContinuousBagofWords(CBOW)Predictcenterwordfrom(bagof)contextwords
Thislecturesofar:Skip-grammodel
Additionalefficiencyintraining:1. Negativesampling
Sofar:Focusonnaïvesoftmax (simplertrainingmethod)
1/11/1824
![Page 22: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/22.jpg)
GradientDescent
• Wehaveacostfunction𝐽 𝜃 wewanttominimize• GradientDescentisanalgorithmtominimize𝐽 𝜃• Idea:forcurrentvalueof𝜃,calculategradientof𝐽 𝜃 ,thentake
smallstepindirectionofnegativegradient.Repeat.
Note:Ourobjectivesarenotconvexlikethis:(
1/11/1825
![Page 23: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/23.jpg)
Intuition
Forasimpleconvexfunctionovertwoparameters.
Contourlinesshowlevelsofobjectivefunction
1/11/1826
![Page 24: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/24.jpg)
• Updateequation(inmatrixnotation):
• Updateequation(forsingleparameter):
• Algorithm:
GradientDescent
𝛼 =stepsizeorlearningrate
1/11/1827
![Page 25: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/25.jpg)
StochasticGradientDescent
• Problem:𝐽 𝜃 isafunctionofall windowsinthecorpus(potentiallybillions!)• Soisveryexpensivetocompute
• Youwouldwaitaverylongtimebeforemakingasingleupdate!
• Very badideaforprettymuchallneuralnets!• Solution: Stochasticgradientdescent(SGD)• Repeatedlysamplewindows,andupdateaftereachone.
• Algorithm:
1/11/1828
![Page 26: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/26.jpg)
1/11/1829
![Page 27: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/27.jpg)
PSet1:Theskip-grammodelandnegativesampling
• Frompaper:“DistributedRepresentationsofWordsandPhrasesandtheirCompositionality”(Mikolov etal.2013)
• Overallobjectivefunction(theymaximize):
• Thesigmoidfunction!(we’llbecomegoodfriendssoon)
• Sowemaximizetheprobabilityoftwowordsco-occurringinfirstlogà
1/11/1830
![Page 28: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/28.jpg)
PSet1:Theskip-grammodelandnegativesampling
• Simplernotation,moresimilartoclassandPSet:
• Wetakeknegativesamples.• Maximizeprobabilitythatrealoutsidewordappears,
minimizeprob.thatrandomwordsappeararoundcenterword
• P(w)=U(w)3/4/Z,theunigramdistributionU(w)raisedtothe3/4power(Weprovidethisfunctioninthestartercode).
• Thepowermakeslessfrequentwordsbesampledmoreoften
1/11/1831
![Page 29: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL](https://reader035.fdocuments.in/reader035/viewer/2022071213/602891005212e7608b5b4bbe/html5/thumbnails/29.jpg)
PSet1:Thecontinuousbagofwordsmodel
• Mainideaforcontinuousbagofwords(CBOW):Predictcenterwordfromsumofsurroundingwordvectorsinsteadofpredictingsurroundingsinglewordsfromcenterwordasinskip-grammodel
• Tomakeassignmentslightlyeasier:
ImplementationoftheCBOWmodelisnotrequired(youcandoitforacoupleofbonuspoints!),butyoudohavetodothetheoryproblemonCBOW.
1/11/1832