Download - TTIC 31190: Natural Language Processingkgimpel/teaching/31190/lectures/11.pdf · Natural Language Processing Kevin Gimpel Winter 2016 Lecture 11: Recurrent and Convolutional Neural

TTIC31190:NaturalLanguageProcessing

KevinGimpelWinter2016

Lecture11:RecurrentandConvolutionalNeuralNetworksinNLP

1

Announcements• Assignment3assignedyesterday,dueFeb.29

• projectproposaldueTuesday,Feb.16

• midtermonThursday,Feb.18

2

Roadmap• classification• words• lexicalsemantics• languagemodeling• sequencelabeling• neuralnetworkmethodsinNLP• syntaxandsyntacticparsing• semanticcompositionality• semanticparsing• unsupervisedlearning• machinetranslationandotherapplications

3

2-transformation(1-layer)network

• we’llcallthisa“2-transformation”neuralnetwork,ora“1-layer”neuralnetwork

• inputvectoris• scorevectoris• onehiddenvector(“hiddenlayer”)

4

vectoroflabelscores

1-layerneuralnetworkforsentimentclassification

5

Usesoftmax functiontoconvertscoresintoprobabilities

6

ikr smh heaskedfiryo lastnamesohecanadduonfb lololol

7

intj pronoun prepadj prep verbotherverb det noun pronoun

pronoun propernoun

verbprep intj

NeuralNetworksforTwitterPart-of-SpeechTagging

adj =adjectiveprep=prepositionintj =interjection

• inAssignment3,you’llbuildaneuralnetworkclassifiertopredictaword’sPOStagbasedonitscontext

ikr smh heaskedfiryo lastnamesohecan

8

intj pronoun prepadj prep verbotherverbdet noun pronoun


• e.g.,predicttagofyo givencontext• whatshouldtheinputxbe?– ithastobeindependentofthelabel– ithastobeafixed-lengthvector


9



• e.g.,predicttagofyo givencontext• whatshouldtheinputxbe?

wordvectorforyo


10



• e.g.,predicttagofyo givencontext• whatshouldtheinputxbe?

wordvectorforyowordvectorforfir


11



wordvectorforyowordvectorforfir

• whenusingwordvectorsaspartofinput,wecanalsotreatthemasmoreparameterstobelearned!

• thisiscalled“updating”or“fine-tuning”thevectors(sincetheyareinitializedusingsomethinglikeword2vec)


12



vectorforlastvectorforyo

• let’susethecenterword+twowordstotheright:

vectorforname

• ifname istotherightofyo,thenyo isprobablyaformofyour• butourx aboveusesseparatedimensionsforeachposition!

– i.e.,nameistwowordstotheright– whatifnameisonewordtotheright?

FeaturesandFilters• wecoulduseafeaturethatreturns1ifnameistotherightofthecenterword,butthatdoesnotusetheword’sembedding

• howdoweincludeafeaturelike“awordsimilartoname appearssomewheretotherightofthecenterword”?

• ratherthanalwaysspecifyrelativepositionandembedding,wewanttoaddfilters thatlookforwordslikenameanywhereinthewindow(orsentence!)

13

Filters• fornow,thinkofafilterasavectorinthewordvectorspace

• thefiltermatchesaparticularregionofthespace• “match”=“hashighdotproductwith”

14

Convolution• convolutionalneuralnetworksuseabunchofsuchfilters

• eachfilterismatchedagainst(dotproductcomputedwith)eachwordintheentirecontextwindoworsentence

• e.g.,asinglefilterisavectorofsamelengthaswordvectors

15

Convolution

16

vectorforlastvectorforyo vectorforname

Convolution

17


Convolution

18


Convolution

19


=“featuremap”,hasanentryforeachwordposition incontextwindow/sentence

Pooling

20



howdoweconvertthisintoafixed-lengthvector?usepooling:

max-pooling:returnsmaximumvalueinaverage pooling:returnsaverageofvaluesin

Pooling

21



howdoweconvertthisintoafixed-lengthvector?usepooling:

max-pooling:returnsmaximumvalueinaverage pooling:returnsaverageofvaluesin

then,thissinglefilterproducesasinglefeaturevalue(theoutputofsomekindofpooling).inpractice,weusemanyfiltersofmanydifferentlengths(e.g.,n-gramsratherthanwords).

ConvolutionalNeuralNetworks• convolutionalneuralnetworks(convnets orCNNs)usefiltersthatare“convolvedwith”(matchedagainstallpositionsof)theinput

• informally,thinkofconvolutionas“performthesameoperationeverywhereontheinputinsomesystematicorder”

• “convolutionallayer”=setoffiltersthatareconvolvedwiththeinputvector(whetherx orhiddenvector)

• couldbefollowedbymoreconvolutionallayers,orbyatypeofpooling

• oftenusedinNLPtoconvertasentenceintoafeaturevector

22

RecurrentNeuralNetworksInputisasequence:

23

nottoobad

RecurrentNeuralNetworksInputisasequence:

24

“hiddenvector”

RecurrentNeuralNetworks

25

“hiddenvector”

Disclaimer• thesediagramsareoftenusefulforhelpingusunderstandandcommunicateneuralnetworkarchitectures

• buttheyrarelyhaveanysortofformalsemantics(unlikegraphicalmodels)

• theyaremorelikecartoons

26

LongShort-TermMemoryRNNs(gateless)

27

“memorycell”


28


29


30

Experiment:textclassification• StanfordSentimentTreebank• binaryclassification(positive/negative)

• 25-dimwordvectors• 50-dimcell/hiddenvectors• classificationlayeronfinal hiddenvector• AdaGrad,10epochs,mini-batchsize10• earlystoppingondev set

accuracy

80.6

OutputGates

31

OutputGates

32

OutputGates

33

OutputGates

34

thisispointwisemultiplication!isavector

OutputGates

35

thisispointwisemultiplication!isavector

OutputGates

36

diagonalmatrix

logisticsigmoid,sooutputrangesfrom

0to1

OutputGates

37

acc.

gateless 80.6

outputgates 81.9

OutputGates

38

acc.

gateless 80.6

outputgates 81.9

What’sbeinglearned?(demo)

InputGates

39

InputGates

40

again,thisispointwise

multiplication

InputGates

41

InputGates

42

diagonalmatrix

OutputGates

43

InputGates

difference

InputGates

44

acc.

gateless 80.6

outputgates 81.9

inputgates 84.4

InputandOutputGates

45

acc.

gateless 80.6

outputgates 81.9

inputgates 84.4

input&outputgates 84.6

ForgetGates

46

ForgetGates

47

ForgetGates

48

ForgetGates

49

acc.

gateless 80.6

outputgates 81.9

inputgates 84.4

forgetgates 82.1

AllGates

50

AllGates

51

acc.

gateless 80.6

outputgates 81.9

inputgates 84.4

input&outputgates 84.6

forgetgates 82.1

input&forget gates 84.1

forget& outputgates 82.6

input,forget,outputgates 85.3

Backward&BidirectionalLSTMs

52

bidirectional:ifshallow,justuseforwardandbackwardLSTMsinparallel,concatenatefinaltwohiddenvectors,feedtosoftmax


53


forward backward

gateless 80.6 80.3

outputgates 81.9 83.7

inputgates 84.4 82.9

forgetgates 82.1 83.4

input,forget,outputgates 85.3 85.9


54


forward backward bidirectional

gateless 80.6 80.3 81.5

outputgates 81.9 83.7 82.6

inputgates 84.4 82.9 83.9

forgetgates 82.1 83.4 83.1

input,forget,outputgates 85.3 85.9 85.1

LSTM

55

DeepLSTM(2-layer)

56

layer1

layer2

DeepLSTM(2-layer)

57

layer1

layer2

acc.

gatelessshallow(50) 80.6

deep(30,30) 80.8

input,forget,outputshallow(50) 85.3

deep(30,30) ~85

DeepBidirectionalLSTMs

58

concatenatehiddenvectorsofforward&backwardLSTMs,connecteachentrytoforwardandbackwardhiddenvectorsinnextlayer

59

(logistic)sigmoid: