Lesson 4 Deep learning for NLP: Word Representa7on Learning · • Machine Learning boils down to...

59
Lesson 4 Deep learning for NLP: Word Representa7on Learning October 20, 2016 EPFL Doctoral Course EE-724 Nikolaos Pappas Idiap Research Ins7tute, Mar7gny Human Language Technology: Applica7on to Informa7on Access

Transcript of Lesson 4 Deep learning for NLP: Word Representa7on Learning · • Machine Learning boils down to...

Lesson4DeeplearningforNLP:

WordRepresenta7onLearning

October20,2016

EPFLDoctoralCourseEE-724NikolaosPappas

IdiapResearchIns7tute,Mar7gny

HumanLanguageTechnology:Applica7ontoInforma7onAccess

NikolaosPappas /59

Outlineofthetalk

1. Introduc7onandMo7va7on

2. NeuralNetworks-Thebasics

3. WordRepresenta7onLearning

4. SummaryandBeyondWords

2

NikolaosPappas /59

Deeplearning• MachineLearningboilsdowntominimizinganobjec7vefunc7ontoincreasetaskperformance

• mostlyreliesonhuman-craYedfeatures• e.g.topic,syntax,grammar,polarity

➡ Representa)onLearning:a[emptstolearnautoma7callygoodfeaturesorrepresenta7ons

➡ DeepLearning:machinelearningalgorithmsbasedonmul7plelevelsofrepresenta7onorabstrac7on

3

NikolaosPappas /59

Keypoint:Learningmul7plelevelsofrepresenta7on

4

NikolaosPappas /59

Mo7va7onforexploringdeeplearning:Whycare?

• HumancraYedfeaturesare7me-consuming,rigid,andoYenincomplete

• Learnedfeaturesareeasytoadaptandlearn

• DeepLearningprovidesaveryflexible,unified,andlearnableframeworkthatcanhandleavarietyofinput,suchasvision,speech,andlanguage.

• unsupervisedfromrawinput(e.g.text)

• supervisedwithlabelsbyhumans(e.g.sen7ment)

5

NikolaosPappas /59

Mo7va7onforexploringdeeplearning:Whynow?

• WhatenableddeeplearningtechniquestostartoutperformingothermachinelearningtechniquessinceHintonetal.2006?• Largeramountsofdata• Fastercomputersandmul7corecpuandgpu• Newmodels,algorithmsandimprovementsover“older”methods(speech,visionandlanguage)

6

NikolaosPappas /59

Deeplearningforspeech:Phonemedetec7on

7

• Thefirstbreakthroughresultsof“deeplearning”onlargedatasetsbyDahletal.2010

• -30%reduc7onoferror• MostrecentlyonspeechsynthesisOordetal.2016

NikolaosPappas /59

Deeplearningforvision:Objectdetec7on

• PopulartopicforDL• BreakthroughonImageNetbyKrizhevskyetal.2012• -21%and-51%errorreduc7onattop1and5

8

NikolaosPappas /59

Deeplearningforlanguage:Ongoing

• Significantimprovementsinrecentyearsacrossdifferentlevels(phonology,morphology,syntax,seman7cs)andapplica7onsinNLP

• Machinetransla)on(mostnotable)• Ques)onanswering• Sen)mentclassifica)on• Summariza)on

9

S7llalotofworktobedone…e.g.metrics(beyond“basic”recogni7on-a[en7on,reasoning,planning)

NikolaosPappas /59

A[en7onmechanismfordeeplearning

10

• Operatesoninputorintermediatesequence• Chooses“wheretolook”orlearnstoassignarelevancetoeachinputposi7on—essen7allyparametricpooling

NikolaosPappas /59

Deeplearningforlanguage:MachineTransla7on

• Reachedthestate-of-the-artinoneyear:Bahdanauetal.2014,Jeanetal.2014,Gulcehreetal.2015

11

NikolaosPappas /59

Outlineofthetalk

1. NeuralNetworks• Basics:perceptron,logis7cregression• Learningtheparameters• Advancedmodels:spa7alandtemporal/sequen7al

2. WordRepresenta7onLearning• Seman7csimilarity• Tradi7onalandrecentapproaches• Intrinsicandextrinsicevalua7on

3. SummaryandBeyond

12

NikolaosPappas /59

Introduc7ontoneuralnetworks

13

• Biologicallyinspiredfromhowthehumanbrainworks

• Seemstohaveagenericlearningalgorithm

• Neuronsac7vateinresponsetoinputsandproduceexciteotherneurons

NikolaosPappas /59

Ar7ficialneuronorPerceptron• Processes

14

NikolaosPappas /59

• Solvelinearlyseparableproblems

• …butnotnon-linearlyseparableones.

Whatcanaperceptrondo?• Processes

15

NikolaosPappas /59

Fromlogis7cregressiontoneuralnetworks• Processes

16

NikolaosPappas /59

Aneuralnetwork:severallogis7cregressionsatthesame7me

17

• Applyseveralregressionstoobtainavectorofoutputs

• Thevaluesoftheoutputsareini7allyunknown

• Noneedtospecifyaheadof7mewhatvaluesthelogis7cregressionsaretryingtopredict

NikolaosPappas /59

Aneuralnetwork:severallogis7cregressionsatthesame7me

18

• Theintermediatevariablesarelearneddirectlybasedonthetrainingobjec7ve

• Thismakesthemdoagoodjobatpredic7ngthetargetforthenextlayer

• Result:abletomodelnon-lineari7esinthedata!

NikolaosPappas /59

Aneuralnetwork:extensiontomul7plelayers

19

NikolaosPappas /59

Aneuralnetwork:Matrixnota7onforalayer

20

NikolaosPappas /59

Severalac7va7onfunc7onstochoosefrom

21

NikolaosPappas /59

Learningparametersusinggradientdescend

22

• Giventrainingdatafindandthatminimizeslosswithrespecttotheseparameters

• Computegradientwithrespecttoparametersandmakesmallsteptowardsthedirec7onofthenega7vegradient

NikolaosPappas /59

Goinglargescale:Stochas7cgradientdescent(SGD)

23

• Approximatethegradientusingamini-batchofexamplesinsteadofen7retrainingset

• OnlineSGDwhenminibatchsizeisone

• MostcommonlyusedwhencomparedtoGD

NikolaosPappas /59

Learningparametersusinggradientdescend

24

• Severalout-of-the-boxstrategiesfordecayinglearningrateofanobjec7vefunc7on:

• Selectthebestaccordingtovalida7onsetperformance

NikolaosPappas /59

Trainingneuralnetworkswitharbitrarylayers:Backpropaga7on

25

• Wes7llminimizetheobjec7vefunc7onbutthis7mewe“backpropagate”theerrorstoallthehiddenlayers

• Chainrule:Ify=f(u)andu=g(x),i.e.y=f(g(x)),then:

• Usefulbasicderiva7ves:Typically, backprop

computation is implemented in

popular libraries: Theano, Torch, Tensorflow

NikolaosPappas /59

Trainingneuralnetworkswitharbitrarylayers:Backpropaga7on

26

NikolaosPappas /59

Advancedneuralnetworks

27

• Essen7ally,nowwehaveallthebasic“ingredients”weneedtobuilddeepneuralnetworks

• Morelayersmorenon-linearthefinalprojec7on

• Augmenta7onwithnewproper7es

➡ Advancedneuralnetworksareabletodealwithdifferentarrangementsoftheinput

• Spa)al:convolu7onalnetworks

• Sequen)al:recurrentnetworks

NikolaosPappas /59

Spa7alModeling:Convolu7onalneuralnetworks

• Fullyconnectednetworktoinputpixelsisnotefficient• Inspiredbytheorganiza7onoftheanimalvisualcortex

• assumesthattheinputsareimages• connectseachneurontoalocalregion

28

NikolaosPappas /59

Sequencemodeling:Recurrentneuralnetworks

• Tradi7onalnetworkscan’tmodelsequenceinforma7on• lackofinforma7onpersistence

• Recursion:Mul7plecopiesofthesamenetworkwhereeachonepassesoninforma7ontoitssuccessor

29

*DiagramfromChristopherOlah’sblog.

NikolaosPappas /59

Sequencemodeling:Gatedrecurrentnetworks

30

*DiagramfromChristopherOlah’sblog.

• Long-shorttermmemorynetsareabletolearnlong-termdependencies:HochreiterandSchmidhuber1997

• GatedRNNbyChoetal2014combinestheforgetandinputgatesintoasingle“updategate.”

NikolaosPappas /59

Sequencemodeling:NeuralTuringMachinesorMemoryNetworks

31

*DiagramfromChristopherOlah’sblog.

• Combina7onofrecurrentnetworkwithexternalmemorybank:Gravesetal.2014,Westonet.al2014

NikolaosPappas /59

Sequencemodeling:Recurrentneuralnetworksareflexible

32

• Vanillanns • Imagecap7oning

• Sen7mentclassifica7on

• Topicdetec7on

• Machinetransla7on

• Summariza7on

• Speechrecogni7on• Videoclassifica7on

*DiagramfromKarpathy’sStanfordCS231ncourse.

NikolaosPappas /59

Outlineofthetalk

1. NeuralNetworks• Basics:perceptron,logis7cregression• Learningtheparameters• Advancedmodels:spa7alandtemporal/sequen7al

2. WordRepresenta7onLearning• Seman7csimilarity• Tradi7onalandrecentapproaches• Intrinsicandextrinsicevalua7on

3. SummaryandBeyond

33

*imagefromLebret'sthesis(2016).

NikolaosPappas /59

Seman7csimilarity:Howsimilararetwolinguis7citems?

34

• Wordlevelscrewdriver—?—>wrenchverysimilarscrewdriver—?—>hammerli[lesimilarscrewdriver—?—>technicianrelatedscrewdriver—?—>fruitunrelated

• SentencelevelThebossfiredtheworkerThesupervisorlettheemployeegoverysimilarThebossreprimandedtheworkerli[lesimilarThebosspromotedtheworkerrelatedThebosswentforjoggingtodayunrelated

NikolaosPappas /59

Seman7csimilarity:Howsimilararetwolinguis7citems?

35

• Definedinmanylevels• words,wordsensesorconcepts,phrases,paragraphs,documents

• Similarityisaspecifictypeofrelatedness• related:topicallyorviarela7onheartvssurgeonwheelvsbike

• similar:synonymsandhyponymsdoctorvssurgeonbikevsbicycle

NikolaosPappas /59

Seman7csimilarity:Numerousa[emptstoanswerthat

36

*Image from D. Jurgens’ NAACL 2016 tutorial.

NikolaosPappas /59

Seman7csimilarity:Numerousa[emptstoanswerthat

37

NikolaosPappas /59

Seman7csimilarity:Whydowehavesomanymethods?

38

• Newresourcesormethods• newdatasetsrevealweaknessinpreviousmethods• state-of-the-artismovingtarget

• Task-specificsimilarityfunc7ons• Performanceinnewtasksnotsa7sfactory

➡ Seman7csimilarityisnottheend-task• Picktheonewhichyieldsbestresults• Needformethodstoquicklyadaptsimilarity

NikolaosPappas /59

Twomainsourcesformeasuringsimilarity

Massivetextcorpora

39

Seman)cresourcesandknowledgebases

NikolaosPappas /59

Howtorepresentseman7cs?Vectorspacemodels

• Explicit:eachdimensiondenotesspecificlinguis7citems• interpretabledimensions• highdimensionality

• Con)nuous:dimensionsarenot7edtoexplicitconcepts• enablecomparisonbetweenrepresentedlinguis7citems

• lowdimensionality

40

NikolaosPappas /59

Howtocomparetwolinguis7citemsinthevectorspace

• CosineoftheangleθbetweenAandB:

• Explicitmodelshaveaserioussparsityproblemduetotheirdiscreteor“k-hot”vectorrepresenta7ons

france=[0,0,0,1,0,0]england=[0,1,0,0,0,0]

franceisnearspain=[1,0,0,1,1,1]• cos(france,england)=0.0• cos(france,franceisnearspain)=0.57

41

A B

θ

NikolaosPappas /59

Learningwordvectorrepresenta7onsfromtext

• Limita7onsofknowledge-basedmethods• out-of-contextdespitevalidityofresources• mostlackofevalua7ononprac7caltasks

• Whatifwedonotknowanythingaboutwords?Followthedistribu7onalhypothesis:

“Youshallknowawordbythecompanyitkeeps”,Firth1957

Thevalueofthecentralbankincreasedby10%.SheoYengoestothebanktowithdrawcash.Shewenttotheriverbanktohavepicnicwithherchild.

42

financialins)tu)on

geographicalterm

NikolaosPappas /59

Simpleapproach:Computeaword-in-contextco-occurencematrix

• Matrixofcountsbetweenwordsandcontexts

• Limita)onsofthismethod:• allwordshaveequalimportance(imbalance)• vectorsareveryhighdimensional(storageissue)• infrequentwordshaveoverlysparsevectors(makesubsequentmodelslessrobust)

43

words context document

NikolaosPappas /59

Themoststandardapproach:DimensionalityReduc7on

• Performsingularvaluedecomposi7on(SVD)ofthewordco-occurencematrixthatwesawpreviously

• typically,U*Σisusedasthevectorspace

44

*Image from D. Jurgens’ NAACL 2016 tutorial.

NikolaosPappas /59

• Syntac7callyandseman7callyrelatedwordsclustertogether

Themoststandardapproach:DimensionalityReduc7on

45

*Plots from Rohde et al. 2005

NikolaosPappas /59

Dimensionalityreduc7onwithHellingerPCA

• PerformPCAwithHellingerdistanceonthewordco-occurencematrix:LebretandCollobert2014

• Wellsuitedfordiscreteprobabilitydistribu7ons(P,Q)

• Neuralapproachesare7me-consuming(tuning,data)• insteadcomputewordvectorsefficientlywithPCA• fine-tuningthemonspecifictasks!Be[erthanneural

• Limita)ons:hardtoaddnewwords,notscalableO(mn2)

46

h[ps://github.com/rlebret/hpca

NikolaosPappas /59

Dimensionalityreduc7onwithweightedleastsquares

• GlovevectorsbyPenningtonetal2014.Factorizesthelogoftheco-occurencematrix:

• Fasttraining,scalabletohugecorporabuts7llhardtoincorporatenewwords

• Muchbe[erresultsthanneuralembedding,howeverunderequivalenttuningitisnotthecase:LevyandGoldberg2015

47

h[p://nlp.stanford.edu/projects/glove/

NikolaosPappas /59

Dimensionalityreduc7onwithneuralnetworks

• Themainideaistodirectlylearnlow-dimensionalwordrepresenta7onsfromdata

• Learningrepresenta7ons:Rumelhartetal1986• Neuralprobabilis7clanguagemodel:Bengioetal2003• NLP(almost)fromscratch:CollobertandWeston2008

• Recentmethodsarefasterandmoresimple• Con7nuousBag-Of-Words(CBOW)• Skip-gramwithNega7veSampling(SGNS)• word2vectoolkit:Mikolovetal.2013

48

NikolaosPappas /5949

• Giventhemiddlewordpredictsurroundingonesinafixedwindowofwords(maximizeloglikelihood)

word2vec:Skip-gramwithnega7vesampling(SGNS)

NikolaosPappas /5950

• HowistheP(wt|h)probabilityimplemented?

• Denominatorisveryinefficientforbigvocabulary!• Insteaditusesamorescalableobjec7ve,logQθisabinarylogis7cregressionofwordwandhistoryh:

word2vec:Skip-gramwithnega7vesampling(SGNS)

NikolaosPappas /59

word2vec:Con7nuousBag-Of-Wordswithnega7vesampling(CBOW)

• FactorizesaPMIword-contextmatrix:LevyandGoldberg2014

• buildsuponexis7ngmethods(newdecomp.)

• improvementsonavarietyofintrinsictaskssuchasrelatedness,categoriza7onandanalogy:Baronietal2014,Schnabeletal2015

51

• Moreefficientbuttheorderinginforma7onofthewordsdoesnotinfluencetheprojec7on

NikolaosPappas /59

word2vec:Learnsmeaningfullinearrela7onshipsofwords

52

• Wordvectordimensionscaptureseveralmeaningfulrela7onsbetweenwords:present—pasttense,singular—plural,male—female,capital—country

• Analogybetweenwordscanbeefficientlycomputedusingbasicarithme7copera7onsbetweenvectors(+,-)

king-man+woman≈queen

NikolaosPappas /59

Learningwordrepresenta7onsfromtext:Recap

• Mostmethodsare*similar*toSVDoverPMImatrixhoweverword2vechastheedgeoveralterna7ves

• scaleswellonmassivetextcorporaandnewwords• yieldstopresultsinmosttasks

• Onextrinsictasksitisessen7altofine-tune(forbea7ngBOW)

➡ Severalextensions• dependency-basedembeddings:LevyandGoldberg2014• retrofi[ed-to-lexiconsembeddings:Faruquietal.2014• sense-awareembeddings:LiandJurafsky2015• visually-groundedembeddings:Lazaridouetal.2015• mul7lingualembeddings:Gouwsetal2015

53

NikolaosPappas /59

Openproblemsinseman7csimilarityresearch

• Irregularlanguagecaniwatch4odbbciplayeretcwith10GBuseageallowence?

• Mul7-wordexpressionsWeneedtosortouttheproblemWeneedtosorttheproblemout

• Syntaxandpunctua7ons

Manbitesdog|Dogbitesman

Awoman:withouther,manisnothing.

54

NikolaosPappas /59

Openproblemsinseman7csimilarityresearch

• Variable-sizeinputPriusAfuel-efficienthybridcarAnautomobilepoweredbybothaninternalcombus7on(…)

• AmbiguitywhenlackingcontextThebossfiredhisworker.

• Subjec7vityversusobjec7vityThiswasagoodday.|Thiswasabadday.

• Out-of-vocabularywords:slang,hash-tags,neologisms

55

NikolaosPappas /59

Beyondwords• Wordvectorsarealsousefulforbuildingseman7cvectorsof

phrases,sentencesanddocuments

• inputoroutputspaceforseveralprac7caltasks

• basisformul7lingualormul7modaltransfer(viaalignment)

• interpretability:dowecareaboutwhateachwordvectordimensionmeans?Itdepends.Wemayneedtocompromise.

• Nextcourse:• learningrepresenta7onsofwordsequences• moredetailsonsequencemodels

56

NikolaosPappas /59

References• TomasMikolov,IlyaSutskever,KaiChen,GregSCorrado,andJeffDean.“Distributedrepresenta7onsofwordsandphrasesandtheircomposi7onality.”InNIPS2013.

• JeffreyPennington,RichardSocher,andChristopherD.Manning.“Glove:GlobalVectorsforWordRepresenta7on.”InEMNLP,2014.

• RemiLebret,RonanCollobert.“WordEmbeddingsthroughHellingerPCA.”InEACL,2014• QuocV.Le,andTomasMikolov.“DistributedRepresenta7onsofSentencesandDocuments.”InICML,2014.• ManaalFaruqui,JesseDodge,SujayK.Jauhar,ChrisDyer,EduardHovy,andNoahA.Smith.“Retrofi~ngwordvectorstoseman7clexicons.”,InACL2014.

• OmerLevyandYoavGoldberg.“Dependency-BasedWordEmbeddings.”InACL2014.• TobiasSchnabel,IgorLabutov,DavidMimno,andThorstenJoachims."Evalua7onmethodsforunsupervisedwordembeddings."InEMNLP,2015.

• OmerLevy,YoavGoldberg,andIdoDagan.“Improvingdistribu7onalsimilaritywithlessonslearnedfromwordembeddings.”TACL,2015.

• ManaalFaruqui,YuliaTsvetkov,PushpendreRastogi,andChrisDyer.“ProblemsWithEvalua7onofWordEmbeddingsUsingWordSimilarityTasks.”InRepEval2016.

• ManaalFaruqui,YuliaTsvetkov,DaniYogatama,ChrisDyer,andNoahSmith.“Sparseovercompletewordvectorrepresenta7ons.”ACL2015.

• YoavGoldberg.“Aprimeronneuralnetworkmodelsfornaturallanguageprocessing”arXivpreprint:1510.00726,2015.

• IanGoodfellow,AaronCourville,andJoshuaBengio.“Deeplearning”.Bookinprepara7onforMITPress.,2015.

57

NikolaosPappas /59

Resources(1/2)➡ Onlinecourses• Courseracourseon“Neuralnetworksformachinelearning”byGeoffreyHinton

h[ps://www.coursera.org/learn/neural-networks• Courseracourseon“Machinelearning”byAndrewNg

h[ps://www.coursera.org/learn/machine-learning• StanfordCS224d“DeeplearningforNLP”byRichardSocher

h[p://cs224d.stanford.edu/

➡ Conferencetutorials• RichardSocherandChristopherManning,“DeeplearningforNLP”,EMNLP2013tutorial.h[p://nlp.stanford.edu/courses/NAACL2013/

• DavidJurgensandMohammadTaherPilehvar,“Seman7cSimilarityFron7ers:FromConceptstoDocuments”,EMNLP2015tutorial.h[p://www.emnlp2015.org/tutorials.html#t1

• MiteshMKharpa,SarathChandar,“Mul7lingualandMul7modalLanguageProcessing”,NAACL2016tutorial.h[p://naacl.org/naacl-hlt-2016/t2.html

58

NikolaosPappas /59

Resources(2/2)➡Deeplearningtoolkits• Theanoh[p://deeplearning.net/soYware/theano• Torchh[p://www.torch.ch/• Tensorflowh[p://www.tensorflow.org/• Kerash[p://keras.io/

➡Pre-trainedwordvectorsandcodes•Word2vectoolkitandvectorsh[ps://code.google.com/p/word2vec/

•GloVecodeandvectorsh[p://nlp.stanford.edu/projects/glove/

•HellingerPCAh[ps://github.com/rlebret/hpca

•Onlinewordvectorevalua7onh[p://wordvectors.org/

59