CS388:NaturalLanguageProcessing
GregDurre8
Lecture10:Interpre=ngNNs,NeuralCRFs
credit:DanielGengandRishiVeerapaneni,ML@Berkeley
Administrivia‣Mini2dueinoneweek
Recall:RNNLMs
Isawthedog
hiP (w|context) = softmax(Whi)
‣Wisa(vocabsize)x(hiddensize)matrix
wordprobs
=
‣ Backpropagatethroughthenetworktosimultaneouslylearntopredictnextwordgivenpreviouswordsatallposi=ons
‣ Batchbygrabbingmanycon=guoussequencesoftextfromdifferentpartsofalargecorpus
Recall:ELMo‣ CNNovereachword=>RNN
JohnvisitedMadagascaryesterdayCharCNN CharCNN CharCNN CharCNN
4096-dimLSTMsw/512-dimprojec=ons
nextword
2048CNNfiltersprojecteddownto512-dim
Petersetal.(2018)
Representa=onofvisited (plusvectorsfrom backwardsLM)
Recall:ELMo
Peters,Ruder,Smith(2019)
Someneuralnetwork
they dance at balls
Taskpredic=ons(sen=ment,etc.)‣ Takethoseembeddingsandfeedthemintowhateverarchitectureyouwanttouseforyourtask
‣ ForELMo,besttousefrozenembeddings:updatetheweightsofyournetworkbutkeepELMo’sparametersfrozen
ThisLecture
‣ Explainingneuralnetworks’predic=ons
‣ NeuralCRFs
ExplainingNNs
WhatisanExplana=on?
‣ Givenadatainstance,iden=fyproper=esoftheinput/modelthatledtoapar=culardecisionbeingmade
themoviewasgreat
‣ Supposeweight=(+5,+3),what’stheexplana=on?
features=(I[great],I[the])
‣ Supposeweight=(+0.1,+5),what’stheexplana=on?
‣ Explana=on!=“whatahumanwoulddo”.Soanyanalysisofexplana=onshastointrinsicallybeaboutourmodel
‣ Supposeweight=(+5,+0),decision=+.what’stheexplana=on?
Idea1:LookingatWeights
thatmoviewasnotgreat,infactitwasterrible!
‣ Isthemaximumweightalwaysright?
w(notgreat)=-5,w(great)=+5,w(terrible)=-3‣ Feats=unigramsandbigrams
‣ Classifiedasnega=ve;what’stheexplana=on?‣ notgreatandgreatcancel,don’treallycontributetotheclassifica=ondecision.Correlatedfeaturesmakeexplana=onsconfusing
‣ Howcanwedefinethis? Dele=nggreatwouldprobablyhaveli8leeffectontheclassifica=onscore
Idea2:Counterfactuals
thatmoviewasnot____,infactitwasterrible!
thatmoviewasnotgreat,infactitwas_____!
thatmoviewasnotgreat,infactitwasterrible!
Model
—
—
+
‣ LIME:Locally-InterpretableModel-Agnos=cExplana=ons
‣ Perturbinputmany=mesandassesstheimpactonthemodel’spredic=on
Ribeiroetal.(2016)
‣ Localbecausewe’lldoworktolearnhowtointerpretthisoneexample
‣Model-agnos>c:treatmodelasblackbox
LIME
h8ps://www.oreilly.com/learning/introduc=on-to-local-interpretable-model-agnos=c-explana=ons-lime
‣ Breakinputintocomponents(fortextclassifica=on:unigrams)
‣ Checkpredic=onsonsubsetsofthose
‣ Trainamodeltopredictpredic=ons,lookatthatmodel’sweights
LIME‣ Breakdowninputintomanysmallpiecessotheexplana=onisinterpretablex 2 Rd ! x0 2 {0, 1}d
0
<latexit sha1_base64="h/voAVCbiDtPrs8pyF04zNKqqKo=">AAAGDnicjVTLbtNAFHUhgRJeLSzZjChRbdVUdlsJhFRUwQYhIYVHH1LdWuPxxBnVL3nGdSJnvoANv8KGBQixZc2Ov+F67BSapi0jxbk+59zXzPV4aci4sKzfc1euttrXrs/f6Ny8dfvO3YXFezs8yTNCt0kSJtmehzkNWUy3BRMh3UsziiMvpLve0cuK3z2mGWdJ/EGMUnoQ4SBmfUawAMhdbD3qOjhMB9i1dW6gTeTQYao76YC5VOembToRFgOvXw6lYXQmWqFzVyg1zyO35G4pHttSooZWb3qDGqdDusIU00FPeDFxMqv4BgLOo2JmvhWVT7HqRW/A6XQKNOFxYVJhTtxV0pD2hV7nCpDDYuQIOhRZVMZBhiMOmQsXCOInAhEnY8FAVCHDJEA9fbRpjwuTGGgFOf0Mk9KW5ZFsSmebtjyEeBOpNS5cBuJOt6dPCixk1WpPL1zbUH9r4xNzvTJNgKDPENLzyhFCKLhGgC2Z2vg6SMUBEFf7MKVQzuNJc8eMM0F99B7HyrmufsKSJI+F1GeJTVQY8n+EhjydMYb9K3AskEhQkMBzuqRLBa+TQYwmKd5gHweYE5zV9YfwHfgw2Rd3cl6IC7s6x8mA0VhpTkYZdQUbsyuYFb+aqWab/p7B2CnTZ2hKmpqgeG45cgxRquEq0GW6Tpe7dmdYj7SaNq98Jw99VM8wzrKkQMPlmi8t03bkYekvS3dhyVq11EJnDbsxlrRm9dyFX46fkDyisSAh5nzftlJxUOJMMBJS2XFyTlNMjnBA98GMcUT5QamuM4m6gPion2Twg6NX6L8eJXyCfBR5oKx64NNcBc7i9nPRf3pQsjjNBY1Jnaifh9VwVXcj8llGiQhHYGCSMagVkQGGAxBwg3ZgE+zpls8aO2ur9vrq2tuNpa0XzXbMaw+0h5qu2doTbUt7pfW0bY20PrY+t762vrU/tb+0v7d/1NIrc43Pfe3Uav/8A851B/g=</latexit>
Ribeiroetal.(2016)
‣ Nowlearnamodeltopredictf(z)basedonz’.Thismodel’sweightswillserveastheexplana=onforthedecision
‣ Ifz’isverycoarse,caninterpretbutcan’tlearnagoodmodeloftheboundary.Ifz’istoofine-grained,caninterpretbutnotpredict(e.g.,z’=z)
‣ Drawsamplesz’byperturbingx’,thenreconstructzfromz’andcomputef(z)onthat
whyit’s+
LIME
Ribeiroetal.(2016)
‣ Useasparselinearmodeltoachieveasparseexplana=on
LIME
‣ Trainasparsemodel(onlylooksat10featuresofeachexample),thentrytouseLIMEtorecoverthefeatures.Greedy:removefeaturestomakepredictedclassprobdropbyasmuchaspossible
themoviewasgreat
P(+|x)
Cantreatthislayerlikealinearmodel,buthowtoconnectittoinput?Orenhundredsoffeatures
‣ Supposeforgetgateisverylowandthefirstthreewordsareforgo8en
‣ Howcanwegenerallyassessimpactofawordonthepredic=on?
Idea3:WeightsRevisited
‣Wedon’thave“weights”,butwhatcantellusabouttheimpactoftheinputontheoutput?
‣ LIMEisverycomplex,butlookingatweightsistoosimple
Gradient-BasedMethods
Simonyanetal.(2013)
Sc=scoreofclassc I0=currentimage
‣ Approximatescorewithafirst-orderTaylorseriesapproxima=onaroundthecurrentimage
‣ Highergradientmagnitude=smallchangeinpixelsleadstolargechangeinpredic=on
‣ Togetsinglemagnitudeforapixel,maxovercolorchannels.Candothesameforaword(maxovervectorposi=ons)
‣ Sanitycheck:doesthismakesenseforlinearmodels?
Gradient-BasedMethods
Simonyanetal.(2013)
Gradient-basedMethod
good the
‣ Changingthewordlocallyhasli8leeffect:thisworddoesn’tma8ermuch
‣ Changingthewordmakesadifference:seemslikethewordishavingsomeimpact
‣ axes=wordvectorvalues.Lightercolor=higherposi=veclassprobability
Gradientsvs.LIME
‣ Explana=onmethodsshouldpredictfeatureswhich,whendeleted,causethepredic=ontoflip
Nguyen(2018)
‣ 1)Rankallfeatureswiththemethod.2)Deletefeaturesandseehowlongittakestoflipthedecision
‣ Omission:likethegreedyalgorithmfromLIMEcomparison
‣ Saliency(gradientmethod)isbe8eratfindingtheflippointsthanLIME(butonlyslightly)
ExplainingSequenceModels
‣ Thesemodelsmightworkwellforbag-of-wordsmodels,butwhataboutothertasks?
Alvarez-MelisandJaakkola(2019)
Iwenttothestore=>Jesuisalléaumagasin
I____tothestore=>???
‣ Transla=onsystemmighttotallybreakdown,needtostayonthedatamanifold
‣ Samplesimilardatapointsfromavaria=onalautoencoder(VAE),morecomplexapproachthatrequiresanothermodel
Idea3:Probing
‣ TrainamodelfortaskXandlearntopredicttaskY
‣ E.g.:takeELMorepresenta=ons,freezethem,thentrytopredictPOSrepresenta=onswithjustasormaxlayer
‣ Doesn’t“explain”apredic=onbutcanilluminatewhatmodelsareandaren’tabletocapture
Takeaways
‣ Lookingatweightsisgenerallyhardforneuralnetworks
‣ LIMEisagoodmethodforgenera=nginterpretableexplana=ons,butnotalwayseasytogetright
‣ Gradient-basedtechniquescanprovideexplana=ons,butthesearen’tperfect.Very“local”anddon’tconsiderwhathappensifawordchangestoadifferentword
‣ Probingtaskscantellyougenerallywhatyournetworkmightbedoingbutarehardtointerpret
NeuralCRFBasics
NERRevisited
‣ FeaturesinCRFs:I[tag=B-LOC&curr_word=Hangzhou],I[tag=B-LOC&prev_word=to],I[tag=B-LOC&curr_prefix=Han]
BarackObamawilltraveltoHangzhoutodayfortheG20mee>ng.
PERSON LOC ORG
B-PER I-PER O O O B-LOC B-ORGO O O O O
‣ Downsides:‣ Lexicalfeaturesmeanthatwordsneedtobeseeninthetrainingdata
‣ Linearmodelcan’tcapturefeatureconjunc=onsaseffec=vely(doesn’tworkwelltolookatmorethan2wordswithasinglefeature)
‣ Linearmodeloverfeatures
LSTMsforNER
BarackObamawilltraveltoHangzhoutodayfortheG20mee>ng.
PERSON LOC ORG
B-PER I-PER O O O B-LOC B-ORGO O O O O
BarackObamawilltraveltoHangzhou
B-PERI-PEROOOB-LOC
‣ Transducer(LM-likemodel)
‣WhatarethestrengthsandweaknessesofthismodelcomparedtoCRFs?
LSTMsforNER
BarackObamawilltraveltoHangzhoutodayfortheG20mee>ng.
PERSON LOC ORG
B-PER I-PER O O O B-LOC B-ORGO O O O O
BarackObamawilltraveltoHangzhou
B-PERI-PEROOOB-LOC
‣ Bidirec=onaltransducermodel
‣WhatarethestrengthsandweaknessesofthismodelcomparedtoCRFs?
NeuralCRFs
BarackObamawilltraveltoHangzhoutodayfortheG20mee>ng.
PERSON LOC ORG
B-PER I-PER O O O B-LOC B-ORGO O O O O
BarackObamawilltraveltoHangzhou
‣ NeuralCRFs:bidirec=onalLSTMs(orsomeNN)computeemissionpoten=als,capturestructuralconstraintsintransi=onpoten=als
NeuralCRFs
y1 y2 yn…
�e
�t
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
�e(yi, i,x) = w>fe(yi, i,x)
‣ Neuralnetworkcomputesunnormalizedpoten=alsthatareconsumedand“normalized”byastructuredmodel
Wisanum_tagsxlen(f)matrix
‣ Conven=onal:
‣ Neural:
‣ f(i,x)couldbetheoutputofafeedforwardneuralnetworklookingatthewordsaroundposi=oni,ortheithoutputofanLSTM,…
‣ Inference:computef,useViterbi
�e(yi, i,x) = W>yif(i,x)
Compu=ngGradients
y1 y2 yn…
�e
�t
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
‣ Forlinearmodel:
@L@�e,i
= �P (yi = s|x) + I[s is gold]
�e(yi, i,x) = w>fe(yi, i,x)‣ Conven=onal:
@�e,i
wi= fe,i(yi, i,x)
chainrulesaytomul=plytogether,givesourupdate
‣ Forneuralmodel:computegradientofphiw.r.t.parametersofneuralnet
“errorsignal”,computewithF-B
‣ Neural: �e(yi, i,x) = W>yif(i,x)
NeuralCRFs
BarackObamawilltraveltoHangzhoutodayfortheG20mee>ng.
PERSON LOC ORG
B-PER I-PER O O O B-LOC B-ORGO O O O O
BarackObamawilltraveltoHangzhou
1)Computef(x)
2)Runforward-backward
3)Computeerrorsignal
4)Backprop(noknowledgeofCRFstructurerequired)
FFNNNeuralCRFforNER
BarackObamawilltraveltoHangzhoutodayfortheG20mee>ng.
PERSON LOC ORG
B-PER I-PER O O O B-LOC B-ORGO O O O O
toHangzhoutoday
e(Hangzhou)
previousword currword nextword
e(today)e(to)
�e = Wg(V f(x, i))
f(x, i) = [emb(xi�1), emb(xi), emb(xi+1)]FFNN
LSTMNeuralCRFs
BarackObamawilltraveltoHangzhoutodayfortheG20mee>ng.
PERSON LOC ORG
B-PER I-PER O O O B-LOC B-ORGO O O O O
BarackObamawilltraveltoHangzhou
‣ Bidirec=onalLSTMscomputeemission(ortransi=on)poten=als
LSTMsforNER
BarackObamawilltraveltoHangzhoutodayfortheG20mee>ng.
PERSON LOC ORG
B-PER I-PER O O O B-LOC B-ORGO O O O O
BarackObamawilltraveltoHangzhou
B-PERI-PEROOOB-LOC
‣ HowdoesthiscomparetoneuralCRF?
“NLP(Almost)FromScratch”
Collobert,Weston,etal.2008,2011
‣ LM2:wordvectorslearnedfromaprecursortoword2vec/GloVe,trainedfor2weeks(!)onWikipedia
‣WLL:independentclassifica=on;SLL:neuralCRF
NeuralCRFswithLSTMs‣ NeuralCRFusingcharacterLSTMstocomputewordrepresenta=ons
ChiuandNichols(2015),Lampleetal.(2016)
NeuralCRFswithLSTMs
ChiuandNichols(2015),Lampleetal.(2016)
‣ Chiu+Nichols:characterCNNsinsteadofLSTMs
‣ Lin/Passos/Luo:useexternalresourceslikeWikipedia
‣ LSTM-CRFcapturestheimportantaspectsofNER:wordcontext(LSTM),sub-wordfeatures(characterLSTMs),outsideknowledge(wordembeddings)
Takeaways
‣ Explana=onmethods:lookingatweights,LIME,gradient-based
‣ AllkindsofNNscanbeintegratedintoCRFsforstructuredinference.CanbeappliedtoNER,othertagging,parsing,…
‣ ThisconcludestheML/DL-heavypor=onofthecourse.Star=ngTuesday:syntax,thenseman=cs
Top Related