Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar...

38
0 SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019

Transcript of Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar...

Page 1: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

0

SFUNatLangLab

Natural Language Processing

Anoop Sarkaranoopsarkar.github.io/nlp-class

Simon Fraser University

November 6, 2019

Page 2: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

1

Natural Language Processing

Anoop Sarkaranoopsarkar.github.io/nlp-class

Simon Fraser University

Part 1: Long distance dependencies

Page 3: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

2

Long distance dependencies

Example

I He doesn’t have very much confidence in himself

I She doesn’t have very much confidence in herself

n-gram Language Models: P(wi | w i−1i−n+1)

P(himself | confidence, in)

P(herself | confidence, in)

What we want: P(wi | w<i)

P(himself | He, . . . , confidence)

P(herself | She, . . . , confidence)

Page 4: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

3

Long distance dependenciesOther examples

I Selectional preferences: I ate lunch with a fork vs. I atelunch with a backpack

I Topic: Babe Ruth was able to touch the home plate yet againvs. Lucy was able to touch the home audiences with herhumour

I Register: Consistency of register in the entire sentence, e.g.informal (Twitter) vs. formal (scientific articles)

Page 5: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

4

Language Models

Chain Rule and ignore some history: the trigram model

p(w1, . . . ,wn)

≈ p(w1)p(w2 | w1)p(w3 | w1,w2) . . . p(wn | wn−2,wn−1)

≈∏t

p(wt+1 | wt−1,wt)

How can we address the long-distance issues?

I Skip n-gram models. Skip an arbitrary distance for n-gramcontext.

I Variable n in n-gram models that is adaptive

I Problems: Still ”all or nothing”. Categorical rather than soft.

Page 6: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

5

Natural Language Processing

Anoop Sarkaranoopsarkar.github.io/nlp-class

Simon Fraser University

Part 2: Neural Language Models

Page 7: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

6

Neural Language Models

Use Chain rule and approximate using a neural network

p(w1, . . . ,wn) ≈∏t

p(wt+1 | φ(w1, . . . ,wt)︸ ︷︷ ︸capture history with vector s(t)

)

Recurrent Neural NetworkI Let y be the output wt+1 for current word wt and history

w1, . . . ,wt

I s(t) = f (Uxh ·w(t) +Whh · s(t− 1)) where f is sigmoid / tanh

I s(t) encapsulates history using single vector of size h

I Output word at time step wt+1 is provided by y(t)

I y(t) = g(Vhy · s(t)) where g is softmax

Page 8: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

7

Neural Language ModelsRecurrent Neural Network

Single time step in RNN:I Input layer is a one hot vector and

output layer y have the samedimensionality as vocabulary(10K-200K).

I One hot vector is used to look up wordembedding w

I “Hidden” layer s is orders ofmagnitude smaller (50-1K neurons)

I U is the matrix of weights betweeninput and hidden layer

I V is the matrix of weights betweenhidden and output layer

I Without recurrent weights W , this isequivalent to a bigram feedforwardlanguage model

Page 9: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

8

Neural Language ModelsRecurrent Neural Network

y(1) y(2) y(3) y(4) y(5) y(6)

s(1) s(2) s(3) s(4) s(5) s(6)

w(1) w(2) w(3) w(4) w(5) w(6)

Uxh

Vhy

Uxh

Vhy

Uxh

Vhy

Uxh

Vhy

Uxh

Vhy

Uxh

Vhy

Whh Whh Whh Whh Whh

What is stored and what is computed:

I Model parameters: w ∈ Rx (word embeddings);Uxh ∈ Rx×h;Whh ∈ Rh×h;Vhy ∈ Rh×y where y = |V|.

I Vectors computed during forward pass: s(t) ∈ Rh; y(t) ∈ Ry

and each y(t) is a probability over vocabulary V.

Page 10: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

9

Natural Language Processing

Anoop Sarkaranoopsarkar.github.io/nlp-class

Simon Fraser University

Part 3: Training RNN Language Models

Page 11: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

10

Neural Language ModelsRecurrent Neural Network

Computational Graph for an RNN Language Model

Page 12: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

11

Training of RNNLM

I The training is performed using Stochastic Gradient Descent(SGD)

I We go through all the training data iteratively, and update theweight matrices U, W and V (after processing every word)

I Training is performed in several “epochs” (usually 5-10)

I An epoch is one pass through the training data

I As with feedforward networks we have two passes:

Forward pass : collect the values to make a prediction (foreach time step)

Backward pass : back-propagate the error gradients (througheach time step)

Page 13: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

12

Training of RNNLMForward pass

I In the forward pass we compute a hidden state s(t) based onprevious states 1, . . . , t − 1I s(t) = f (Uxh · w(t) + Whh · s(t − 1))I s(t) = f (Uxh · w(t) + Whh · f (Uxh · w(t) + Whh · s(t − 2)))I s(t) = f (Uxh · w(t) + Whh ·

f (Uxh · w(t) + Whh · f (Uxh · w(t) + Whh · s(t − 3))))I etc.

I Let us assume f is linear, e.g. f (x) = x .

I Notice how we have to compute Whh ·Whh · . . . =∏

i Whh

I By examining this repeated matrix multiplication we can showthat the norm of Whh →∞ (explodes)

I This is why f is set to a function that returns a boundedvalue (sigmoid / tanh)

Page 14: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

13

Training of RNNLMBackward pass

I Gradient of the error vector in the output layer eo(t) iscomputed using a cross entropy criterion:

eo(t) = d(t)− y(t)

I d(t) is a target vector that represents the word w(t + 1)represented as a one-hot (1-of-V) vector

Page 15: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

14

Training of RNNLMBackward pass

I Weights V between the hidden layer s(t) and the output layery(t) are updated as

V (t+1) = V (t) + s(t) · eo(t) · α

I where α is the learning rate

Page 16: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

15

Training of RNNLMBackward pass

I Next, gradients of errors are propagated from the output layerto the hidden layer

eh(t) = dh(eo · V , t)

I where the error vector is obtained using function dh() that isapplied element-wise:

dhj(x , t) = x · sj(t)(1− sj(t))

Page 17: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

16

Training of RNNLMBackward pass

I Weights U between the input layer w(t) and the hidden layers(t) are then updated as

U(t+1) = U(t) + w(t) · eh(t) · α

I Similarly the word embeddings w can also be updated usingthe error gradient.

Page 18: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

17

Training of RNNLM: Backpropagation through timeBackward pass

I The recurrent weights W are updated by unfolding them intime and training the network as a deep feedforward neuralnetwork.

I The process of propagating errors back through the recurrentweights is called Backpropagation Through Time (BPTT).

Page 19: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

18

Training of RNNLM: Backpropagation through timeFig. from [1]: RNN unfolded as a deep feedforward network 3 time steps back in time

Page 20: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

19

Training of RNNLM: Backpropagation through timeBackward pass

I Error propagation is done recursively as follows (it requires thestates of the hidden layer from the previous time steps τ to bestored):

e(t − τ − 1) = dh(eh(t − τ) ·W , t − τ − 1)

I The error gradients quickly vanish as they get backpropagatedin time (less likely if we use sigmoid / tanh)

I We use gated RNNs to stop gradients from vanishing orexploding.

I Popular gated RNNs are long short-term memory RNNs akaLSTMs and gated recurrent units aka GRUs.

Page 21: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

20

Training of RNNLM: Backpropagation through timeBackward pass

I The recurrent weights W are updated as:

W (t+1) = W (t) +T∑

z=0

s(t − z − 1) · eh(t − z) · α

I Note that the matrix W is changed in one update at once,not during backpropagation of errors.

Page 22: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

21

Natural Language Processing

Anoop Sarkaranoopsarkar.github.io/nlp-class

Simon Fraser University

Part 4: Gated Recurrent Units

Page 23: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

22

Interpolation for hidden unitsu: use history or forget history

I For RNN state s(t) ∈ Rh create a binary vector u ∈ {0, 1}h

ui =

{1 use the new hidden state (standard RNN update)0 copy previous hidden state and ignore RNN update

I Create an intermediate hidden state s(t) where f is tanh:

s(t) = f (Uxh · w(t) + Whh · s(t − 1))

I Use the binary vector u to interpolate between copying priorstate s(t − 1) and using new state s(t):

s(t) = (1− u) �

� is elementwise multiplication

s(t − 1) + u�s(t)

Page 24: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

23

Interpolation for hidden unitsr : reset or retain each element of hidden state vector

I For RNN state s(t−1) ∈ Rh create a binary vector r ∈ {0, 1}h

ri =

{1 if si (t − 1) should be used0 if si (t − 1) should be ignored

I Modify intermediate hidden state s(t) where f is tanh:

s(t) = f (Uxh · w(t) + Whh · (r � s(t − 1)))

I Use the binary vector u to interpolate between s(t − 1) ands(t):

s(t) = (1− u)� s(t − 1) + u � s(t)

Page 25: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

24

Interpolation for hidden unitsLearning u and r

I Instead of binary vectors u ∈ {0, 1}h and r ∈ {0, 1}h we wantto learn u and r

I Let u ∈ [0, 1]h and r ∈ [0, 1]h

I Learn these two h dimensional vectors using equations similarto the RNN hidden state equation:

u(t) = σ (Uuxh · w(t) + W u

hh · s(t − 1))

r(t) = σ (U rxh · w(t) + W r

hh · s(t − 1))

I The sigmoid function σ ensures that each element of u and ris between [0, 1]

I The use history u and reset element r vectors use differentparameters Uu,W u and U r ,W r

Page 26: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

25

Interpolation for hidden unitsGated Recurrent Unit (GRU)

I Putting it all together:

u(t) = σ (Uuxh · w(t) + W u

hh · s(t − 1))

r(t) = σ (U rxh · w(t) + W r

hh · s(t − 1))

s(t) = tanh(Uxh · w(t) + Whh · (r(t)� s(t − 1)))

s(t) = (1− u(t))� s(t − 1) + u(t)� s(t)

Page 27: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

26

Interpolation for hidden unitsLong Short-term Memory (LSTM)

I Split up u(t) into two different gates i(t) and f (t):

i(t) = σ(U ixh · w(t) + W i

hh · s(t − 1))

f (t) = σ(U fxh · w(t) + W f

hh · s(t − 1))

r(t) = σ (U rxh · w(t) + W r

hh · s(t − 1))

s(t) = tanh(Uxh · w(t) + Whh · s(t − 1)︸ ︷︷ ︸GRU:r(t)�s(t−1)

)

s(t) = f (t)� s(t − 1) + i(t)� s(t)︸ ︷︷ ︸GRU:(1−u(t))�s(t−1)+u(t)�s(t)

s(t) = r(t)� tanh (s(t))

I So LSTM is a GRU plus an extra Uxh, Whh and tanh.

I Q: what happens if f (t) is set to 1− i(t)? A: read [3]

Page 28: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

27

Natural Language Processing

Anoop Sarkaranoopsarkar.github.io/nlp-class

Simon Fraser University

Part 5: Sequence prediction using RNNs

Page 29: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

28

Representation: finding the right parameters

Problem: Predict ?? using context, P(?? | context)

Profits/N soared/V at/P Boeing/?? Co. , easily topping forecastson Wall Street , as their CEO Alan Mulally announced first quarterresults .

Representation: history

I The input is a tuple: (x[1:n], i) [ignoring y−1 for now]

I x[1:n] are the n words in the input

I i is the index of the word being tagged

I For example, for x4 = BoeingI We can use an RNN to summarize the entire context at i = 4

I x[1:i−1] = (Profits, soared, at)I x[i+1:n] = (Co., easily, ..., results, .)

Page 30: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

29

Locally normalized RNN taggers

Log-linear model over history, tag pair (h, t)

log Pr(y | h) = w · f(h, y)− log∑y ′

exp(w · f(h, y ′)

)f(h, y) is a vector of feature functions

RNN for tagging

I Replace f(h, y) with RNN hidden state s(t)

I Define the output logprob: log Pr(y | h) = log y(t)

I y(t) = g(V · s(t)) where g is softmax

I In neural LMs the output y ∈ V (vocabulary)

I In sequence tagging using RNNs the output y ∈ T (tagset)

log Pr(y[1:n] | x[1:n]) =n∑

i=1

log Pr(yi | hi )

Page 31: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

30

Bidirectional RNNsFig. from [2]

sb(1) sb(2) sb(3) sb(4) sb(5) sb(6)

sf (1) sf (2) sf (3) sf (4) sf (5) sf (6)

x(1) x(2) x(3) x(4) x(5) x(6)

Bidirectional RNN

Page 32: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

31

Bidirectional RNNs can be StackedFig. from [2]

sf2(1) sf2(2) sf2(3) sf2(4) sf2(5) sf2(6)

sb2(1) sb2(2) sb2(3) sb2(4) sb2(5) sb2(6)

sf1(1) sf1(2) sf1(3) sf1(4) sf1(5) sf1(6)

sb1(1) sb1(2) sb1(3) sb1(4) sb1(5) sb1(6)

x(1) x(2) x(3) x(4) x(5) x(6)

Two Bidirectional RNNs stacked on top of each other

Page 33: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

32

Natural Language Processing

Anoop Sarkaranoopsarkar.github.io/nlp-class

Simon Fraser University

Part 6: Training RNNs on GPUs

Page 34: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

33

Parallelizing RNN computationsFig. from [2]

Apply RNNs to batches of sequencesPresent the data as a 3D tensor of (T × B × F ). Each dynamicupdate will now be a matrix multiplication.

Page 35: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

34

Binary MasksFig. from [2]

A mask matrix may be used to aid with computations that ignorethe padded zeros.

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0

Page 36: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

35

Binary MasksFig. from [2]

It may be necessary to (partially) sort your data.

Page 37: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

36

[1] Tomas MikolovRecurrent Neural Networks for Language Models. GoogleTalk.2010.

[2] Philemon BrakelMLIA-IQIA Summer School notes on RNNs2015.

[3] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutnık, Bas R.Steunebrink, Jurgen SchmidhuberLSTM: A Search Space Odyssey2017.

Page 38: Natural Language Processinganoopsarkar.github.io/nlp-class/assets/slides/nlm.pdf · Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University November 6, 2019. 1 Natural

37

Acknowledgements

Many slides borrowed or inspired from lecture notes by MichaelCollins, Chris Dyer, Kevin Knight, Chris Manning, Philipp Koehn,Adam Lopez, Graham Neubig, Richard Socher and LukeZettlemoyer from their NLP course materials.

All mistakes are my own.

A big thank you to all the students who read through these notesand helped me improve them.