Lesson 4 Deep learning for NLP: Word Representa7on Learning · • Machine Learning boils down to...
Transcript of Lesson 4 Deep learning for NLP: Word Representa7on Learning · • Machine Learning boils down to...
Lesson4DeeplearningforNLP:
WordRepresenta7onLearning
October20,2016
EPFLDoctoralCourseEE-724NikolaosPappas
IdiapResearchIns7tute,Mar7gny
HumanLanguageTechnology:Applica7ontoInforma7onAccess
NikolaosPappas /59
Outlineofthetalk
1. Introduc7onandMo7va7on
2. NeuralNetworks-Thebasics
3. WordRepresenta7onLearning
4. SummaryandBeyondWords
2
NikolaosPappas /59
Deeplearning• MachineLearningboilsdowntominimizinganobjec7vefunc7ontoincreasetaskperformance
• mostlyreliesonhuman-craYedfeatures• e.g.topic,syntax,grammar,polarity
➡ Representa)onLearning:a[emptstolearnautoma7callygoodfeaturesorrepresenta7ons
➡ DeepLearning:machinelearningalgorithmsbasedonmul7plelevelsofrepresenta7onorabstrac7on
3
NikolaosPappas /59
Mo7va7onforexploringdeeplearning:Whycare?
• HumancraYedfeaturesare7me-consuming,rigid,andoYenincomplete
• Learnedfeaturesareeasytoadaptandlearn
• DeepLearningprovidesaveryflexible,unified,andlearnableframeworkthatcanhandleavarietyofinput,suchasvision,speech,andlanguage.
• unsupervisedfromrawinput(e.g.text)
• supervisedwithlabelsbyhumans(e.g.sen7ment)
5
NikolaosPappas /59
Mo7va7onforexploringdeeplearning:Whynow?
• WhatenableddeeplearningtechniquestostartoutperformingothermachinelearningtechniquessinceHintonetal.2006?• Largeramountsofdata• Fastercomputersandmul7corecpuandgpu• Newmodels,algorithmsandimprovementsover“older”methods(speech,visionandlanguage)
6
NikolaosPappas /59
Deeplearningforspeech:Phonemedetec7on
7
• Thefirstbreakthroughresultsof“deeplearning”onlargedatasetsbyDahletal.2010
• -30%reduc7onoferror• MostrecentlyonspeechsynthesisOordetal.2016
NikolaosPappas /59
Deeplearningforvision:Objectdetec7on
• PopulartopicforDL• BreakthroughonImageNetbyKrizhevskyetal.2012• -21%and-51%errorreduc7onattop1and5
8
NikolaosPappas /59
Deeplearningforlanguage:Ongoing
• Significantimprovementsinrecentyearsacrossdifferentlevels(phonology,morphology,syntax,seman7cs)andapplica7onsinNLP
• Machinetransla)on(mostnotable)• Ques)onanswering• Sen)mentclassifica)on• Summariza)on
9
S7llalotofworktobedone…e.g.metrics(beyond“basic”recogni7on-a[en7on,reasoning,planning)
NikolaosPappas /59
A[en7onmechanismfordeeplearning
10
• Operatesoninputorintermediatesequence• Chooses“wheretolook”orlearnstoassignarelevancetoeachinputposi7on—essen7allyparametricpooling
NikolaosPappas /59
Deeplearningforlanguage:MachineTransla7on
• Reachedthestate-of-the-artinoneyear:Bahdanauetal.2014,Jeanetal.2014,Gulcehreetal.2015
11
NikolaosPappas /59
Outlineofthetalk
1. NeuralNetworks• Basics:perceptron,logis7cregression• Learningtheparameters• Advancedmodels:spa7alandtemporal/sequen7al
2. WordRepresenta7onLearning• Seman7csimilarity• Tradi7onalandrecentapproaches• Intrinsicandextrinsicevalua7on
3. SummaryandBeyond
12
NikolaosPappas /59
Introduc7ontoneuralnetworks
13
• Biologicallyinspiredfromhowthehumanbrainworks
• Seemstohaveagenericlearningalgorithm
• Neuronsac7vateinresponsetoinputsandproduceexciteotherneurons
NikolaosPappas /59
• Solvelinearlyseparableproblems
• …butnotnon-linearlyseparableones.
Whatcanaperceptrondo?• Processes
15
NikolaosPappas /59
Aneuralnetwork:severallogis7cregressionsatthesame7me
17
• Applyseveralregressionstoobtainavectorofoutputs
• Thevaluesoftheoutputsareini7allyunknown
• Noneedtospecifyaheadof7mewhatvaluesthelogis7cregressionsaretryingtopredict
NikolaosPappas /59
Aneuralnetwork:severallogis7cregressionsatthesame7me
18
• Theintermediatevariablesarelearneddirectlybasedonthetrainingobjec7ve
• Thismakesthemdoagoodjobatpredic7ngthetargetforthenextlayer
• Result:abletomodelnon-lineari7esinthedata!
NikolaosPappas /59
Learningparametersusinggradientdescend
22
• Giventrainingdatafindandthatminimizeslosswithrespecttotheseparameters
• Computegradientwithrespecttoparametersandmakesmallsteptowardsthedirec7onofthenega7vegradient
NikolaosPappas /59
Goinglargescale:Stochas7cgradientdescent(SGD)
23
• Approximatethegradientusingamini-batchofexamplesinsteadofen7retrainingset
• OnlineSGDwhenminibatchsizeisone
• MostcommonlyusedwhencomparedtoGD
NikolaosPappas /59
Learningparametersusinggradientdescend
24
• Severalout-of-the-boxstrategiesfordecayinglearningrateofanobjec7vefunc7on:
• Selectthebestaccordingtovalida7onsetperformance
NikolaosPappas /59
Trainingneuralnetworkswitharbitrarylayers:Backpropaga7on
25
• Wes7llminimizetheobjec7vefunc7onbutthis7mewe“backpropagate”theerrorstoallthehiddenlayers
• Chainrule:Ify=f(u)andu=g(x),i.e.y=f(g(x)),then:
• Usefulbasicderiva7ves:Typically, backprop
computation is implemented in
popular libraries: Theano, Torch, Tensorflow
NikolaosPappas /59
Advancedneuralnetworks
27
• Essen7ally,nowwehaveallthebasic“ingredients”weneedtobuilddeepneuralnetworks
• Morelayersmorenon-linearthefinalprojec7on
• Augmenta7onwithnewproper7es
➡ Advancedneuralnetworksareabletodealwithdifferentarrangementsoftheinput
• Spa)al:convolu7onalnetworks
• Sequen)al:recurrentnetworks
NikolaosPappas /59
Spa7alModeling:Convolu7onalneuralnetworks
• Fullyconnectednetworktoinputpixelsisnotefficient• Inspiredbytheorganiza7onoftheanimalvisualcortex
• assumesthattheinputsareimages• connectseachneurontoalocalregion
28
NikolaosPappas /59
Sequencemodeling:Recurrentneuralnetworks
• Tradi7onalnetworkscan’tmodelsequenceinforma7on• lackofinforma7onpersistence
• Recursion:Mul7plecopiesofthesamenetworkwhereeachonepassesoninforma7ontoitssuccessor
29
*DiagramfromChristopherOlah’sblog.
NikolaosPappas /59
Sequencemodeling:Gatedrecurrentnetworks
30
*DiagramfromChristopherOlah’sblog.
• Long-shorttermmemorynetsareabletolearnlong-termdependencies:HochreiterandSchmidhuber1997
• GatedRNNbyChoetal2014combinestheforgetandinputgatesintoasingle“updategate.”
NikolaosPappas /59
Sequencemodeling:NeuralTuringMachinesorMemoryNetworks
31
*DiagramfromChristopherOlah’sblog.
• Combina7onofrecurrentnetworkwithexternalmemorybank:Gravesetal.2014,Westonet.al2014
NikolaosPappas /59
Sequencemodeling:Recurrentneuralnetworksareflexible
32
• Vanillanns • Imagecap7oning
• Sen7mentclassifica7on
• Topicdetec7on
• Machinetransla7on
• Summariza7on
• Speechrecogni7on• Videoclassifica7on
*DiagramfromKarpathy’sStanfordCS231ncourse.
NikolaosPappas /59
Outlineofthetalk
1. NeuralNetworks• Basics:perceptron,logis7cregression• Learningtheparameters• Advancedmodels:spa7alandtemporal/sequen7al
2. WordRepresenta7onLearning• Seman7csimilarity• Tradi7onalandrecentapproaches• Intrinsicandextrinsicevalua7on
3. SummaryandBeyond
33
*imagefromLebret'sthesis(2016).
NikolaosPappas /59
Seman7csimilarity:Howsimilararetwolinguis7citems?
34
• Wordlevelscrewdriver—?—>wrenchverysimilarscrewdriver—?—>hammerli[lesimilarscrewdriver—?—>technicianrelatedscrewdriver—?—>fruitunrelated
• SentencelevelThebossfiredtheworkerThesupervisorlettheemployeegoverysimilarThebossreprimandedtheworkerli[lesimilarThebosspromotedtheworkerrelatedThebosswentforjoggingtodayunrelated
NikolaosPappas /59
Seman7csimilarity:Howsimilararetwolinguis7citems?
35
• Definedinmanylevels• words,wordsensesorconcepts,phrases,paragraphs,documents
• Similarityisaspecifictypeofrelatedness• related:topicallyorviarela7onheartvssurgeonwheelvsbike
• similar:synonymsandhyponymsdoctorvssurgeonbikevsbicycle
NikolaosPappas /59
Seman7csimilarity:Numerousa[emptstoanswerthat
36
*Image from D. Jurgens’ NAACL 2016 tutorial.
NikolaosPappas /59
Seman7csimilarity:Whydowehavesomanymethods?
38
• Newresourcesormethods• newdatasetsrevealweaknessinpreviousmethods• state-of-the-artismovingtarget
• Task-specificsimilarityfunc7ons• Performanceinnewtasksnotsa7sfactory
➡ Seman7csimilarityisnottheend-task• Picktheonewhichyieldsbestresults• Needformethodstoquicklyadaptsimilarity
NikolaosPappas /59
Twomainsourcesformeasuringsimilarity
Massivetextcorpora
39
Seman)cresourcesandknowledgebases
NikolaosPappas /59
Howtorepresentseman7cs?Vectorspacemodels
• Explicit:eachdimensiondenotesspecificlinguis7citems• interpretabledimensions• highdimensionality
• Con)nuous:dimensionsarenot7edtoexplicitconcepts• enablecomparisonbetweenrepresentedlinguis7citems
• lowdimensionality
40
NikolaosPappas /59
Howtocomparetwolinguis7citemsinthevectorspace
• CosineoftheangleθbetweenAandB:
• Explicitmodelshaveaserioussparsityproblemduetotheirdiscreteor“k-hot”vectorrepresenta7ons
france=[0,0,0,1,0,0]england=[0,1,0,0,0,0]
franceisnearspain=[1,0,0,1,1,1]• cos(france,england)=0.0• cos(france,franceisnearspain)=0.57
41
A B
θ
NikolaosPappas /59
Learningwordvectorrepresenta7onsfromtext
• Limita7onsofknowledge-basedmethods• out-of-contextdespitevalidityofresources• mostlackofevalua7ononprac7caltasks
• Whatifwedonotknowanythingaboutwords?Followthedistribu7onalhypothesis:
“Youshallknowawordbythecompanyitkeeps”,Firth1957
Thevalueofthecentralbankincreasedby10%.SheoYengoestothebanktowithdrawcash.Shewenttotheriverbanktohavepicnicwithherchild.
42
financialins)tu)on
geographicalterm
NikolaosPappas /59
Simpleapproach:Computeaword-in-contextco-occurencematrix
• Matrixofcountsbetweenwordsandcontexts
• Limita)onsofthismethod:• allwordshaveequalimportance(imbalance)• vectorsareveryhighdimensional(storageissue)• infrequentwordshaveoverlysparsevectors(makesubsequentmodelslessrobust)
43
words context document
NikolaosPappas /59
Themoststandardapproach:DimensionalityReduc7on
• Performsingularvaluedecomposi7on(SVD)ofthewordco-occurencematrixthatwesawpreviously
• typically,U*Σisusedasthevectorspace
44
*Image from D. Jurgens’ NAACL 2016 tutorial.
NikolaosPappas /59
• Syntac7callyandseman7callyrelatedwordsclustertogether
Themoststandardapproach:DimensionalityReduc7on
45
*Plots from Rohde et al. 2005
NikolaosPappas /59
Dimensionalityreduc7onwithHellingerPCA
• PerformPCAwithHellingerdistanceonthewordco-occurencematrix:LebretandCollobert2014
• Wellsuitedfordiscreteprobabilitydistribu7ons(P,Q)
• Neuralapproachesare7me-consuming(tuning,data)• insteadcomputewordvectorsefficientlywithPCA• fine-tuningthemonspecifictasks!Be[erthanneural
• Limita)ons:hardtoaddnewwords,notscalableO(mn2)
46
h[ps://github.com/rlebret/hpca
NikolaosPappas /59
Dimensionalityreduc7onwithweightedleastsquares
• GlovevectorsbyPenningtonetal2014.Factorizesthelogoftheco-occurencematrix:
• Fasttraining,scalabletohugecorporabuts7llhardtoincorporatenewwords
• Muchbe[erresultsthanneuralembedding,howeverunderequivalenttuningitisnotthecase:LevyandGoldberg2015
47
h[p://nlp.stanford.edu/projects/glove/
NikolaosPappas /59
Dimensionalityreduc7onwithneuralnetworks
• Themainideaistodirectlylearnlow-dimensionalwordrepresenta7onsfromdata
• Learningrepresenta7ons:Rumelhartetal1986• Neuralprobabilis7clanguagemodel:Bengioetal2003• NLP(almost)fromscratch:CollobertandWeston2008
• Recentmethodsarefasterandmoresimple• Con7nuousBag-Of-Words(CBOW)• Skip-gramwithNega7veSampling(SGNS)• word2vectoolkit:Mikolovetal.2013
48
NikolaosPappas /5949
• Giventhemiddlewordpredictsurroundingonesinafixedwindowofwords(maximizeloglikelihood)
word2vec:Skip-gramwithnega7vesampling(SGNS)
NikolaosPappas /5950
• HowistheP(wt|h)probabilityimplemented?
• Denominatorisveryinefficientforbigvocabulary!• Insteaditusesamorescalableobjec7ve,logQθisabinarylogis7cregressionofwordwandhistoryh:
word2vec:Skip-gramwithnega7vesampling(SGNS)
NikolaosPappas /59
word2vec:Con7nuousBag-Of-Wordswithnega7vesampling(CBOW)
• FactorizesaPMIword-contextmatrix:LevyandGoldberg2014
• buildsuponexis7ngmethods(newdecomp.)
• improvementsonavarietyofintrinsictaskssuchasrelatedness,categoriza7onandanalogy:Baronietal2014,Schnabeletal2015
51
• Moreefficientbuttheorderinginforma7onofthewordsdoesnotinfluencetheprojec7on
NikolaosPappas /59
word2vec:Learnsmeaningfullinearrela7onshipsofwords
52
• Wordvectordimensionscaptureseveralmeaningfulrela7onsbetweenwords:present—pasttense,singular—plural,male—female,capital—country
• Analogybetweenwordscanbeefficientlycomputedusingbasicarithme7copera7onsbetweenvectors(+,-)
king-man+woman≈queen
NikolaosPappas /59
Learningwordrepresenta7onsfromtext:Recap
• Mostmethodsare*similar*toSVDoverPMImatrixhoweverword2vechastheedgeoveralterna7ves
• scaleswellonmassivetextcorporaandnewwords• yieldstopresultsinmosttasks
• Onextrinsictasksitisessen7altofine-tune(forbea7ngBOW)
➡ Severalextensions• dependency-basedembeddings:LevyandGoldberg2014• retrofi[ed-to-lexiconsembeddings:Faruquietal.2014• sense-awareembeddings:LiandJurafsky2015• visually-groundedembeddings:Lazaridouetal.2015• mul7lingualembeddings:Gouwsetal2015
53
NikolaosPappas /59
Openproblemsinseman7csimilarityresearch
• Irregularlanguagecaniwatch4odbbciplayeretcwith10GBuseageallowence?
• Mul7-wordexpressionsWeneedtosortouttheproblemWeneedtosorttheproblemout
• Syntaxandpunctua7ons
Manbitesdog|Dogbitesman
Awoman:withouther,manisnothing.
54
NikolaosPappas /59
Openproblemsinseman7csimilarityresearch
• Variable-sizeinputPriusAfuel-efficienthybridcarAnautomobilepoweredbybothaninternalcombus7on(…)
• AmbiguitywhenlackingcontextThebossfiredhisworker.
• Subjec7vityversusobjec7vityThiswasagoodday.|Thiswasabadday.
• Out-of-vocabularywords:slang,hash-tags,neologisms
55
NikolaosPappas /59
Beyondwords• Wordvectorsarealsousefulforbuildingseman7cvectorsof
phrases,sentencesanddocuments
• inputoroutputspaceforseveralprac7caltasks
• basisformul7lingualormul7modaltransfer(viaalignment)
• interpretability:dowecareaboutwhateachwordvectordimensionmeans?Itdepends.Wemayneedtocompromise.
• Nextcourse:• learningrepresenta7onsofwordsequences• moredetailsonsequencemodels
56
NikolaosPappas /59
References• TomasMikolov,IlyaSutskever,KaiChen,GregSCorrado,andJeffDean.“Distributedrepresenta7onsofwordsandphrasesandtheircomposi7onality.”InNIPS2013.
• JeffreyPennington,RichardSocher,andChristopherD.Manning.“Glove:GlobalVectorsforWordRepresenta7on.”InEMNLP,2014.
• RemiLebret,RonanCollobert.“WordEmbeddingsthroughHellingerPCA.”InEACL,2014• QuocV.Le,andTomasMikolov.“DistributedRepresenta7onsofSentencesandDocuments.”InICML,2014.• ManaalFaruqui,JesseDodge,SujayK.Jauhar,ChrisDyer,EduardHovy,andNoahA.Smith.“Retrofi~ngwordvectorstoseman7clexicons.”,InACL2014.
• OmerLevyandYoavGoldberg.“Dependency-BasedWordEmbeddings.”InACL2014.• TobiasSchnabel,IgorLabutov,DavidMimno,andThorstenJoachims."Evalua7onmethodsforunsupervisedwordembeddings."InEMNLP,2015.
• OmerLevy,YoavGoldberg,andIdoDagan.“Improvingdistribu7onalsimilaritywithlessonslearnedfromwordembeddings.”TACL,2015.
• ManaalFaruqui,YuliaTsvetkov,PushpendreRastogi,andChrisDyer.“ProblemsWithEvalua7onofWordEmbeddingsUsingWordSimilarityTasks.”InRepEval2016.
• ManaalFaruqui,YuliaTsvetkov,DaniYogatama,ChrisDyer,andNoahSmith.“Sparseovercompletewordvectorrepresenta7ons.”ACL2015.
• YoavGoldberg.“Aprimeronneuralnetworkmodelsfornaturallanguageprocessing”arXivpreprint:1510.00726,2015.
• IanGoodfellow,AaronCourville,andJoshuaBengio.“Deeplearning”.Bookinprepara7onforMITPress.,2015.
57
NikolaosPappas /59
Resources(1/2)➡ Onlinecourses• Courseracourseon“Neuralnetworksformachinelearning”byGeoffreyHinton
h[ps://www.coursera.org/learn/neural-networks• Courseracourseon“Machinelearning”byAndrewNg
h[ps://www.coursera.org/learn/machine-learning• StanfordCS224d“DeeplearningforNLP”byRichardSocher
h[p://cs224d.stanford.edu/
➡ Conferencetutorials• RichardSocherandChristopherManning,“DeeplearningforNLP”,EMNLP2013tutorial.h[p://nlp.stanford.edu/courses/NAACL2013/
• DavidJurgensandMohammadTaherPilehvar,“Seman7cSimilarityFron7ers:FromConceptstoDocuments”,EMNLP2015tutorial.h[p://www.emnlp2015.org/tutorials.html#t1
• MiteshMKharpa,SarathChandar,“Mul7lingualandMul7modalLanguageProcessing”,NAACL2016tutorial.h[p://naacl.org/naacl-hlt-2016/t2.html
58
NikolaosPappas /59
Resources(2/2)➡Deeplearningtoolkits• Theanoh[p://deeplearning.net/soYware/theano• Torchh[p://www.torch.ch/• Tensorflowh[p://www.tensorflow.org/• Kerash[p://keras.io/
➡Pre-trainedwordvectorsandcodes•Word2vectoolkitandvectorsh[ps://code.google.com/p/word2vec/
•GloVecodeandvectorsh[p://nlp.stanford.edu/projects/glove/
•HellingerPCAh[ps://github.com/rlebret/hpca
•Onlinewordvectorevalua7onh[p://wordvectors.org/
59