COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · §...

49
COMP 551 - Applied Machine Learning Lecture 16 --- Recurrent neural networks William L. Hamilton (with slides and content from Joelle Pineau and Ryan Lowe) * Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission. William L. Hamilton, McGill University and Mila 1

Transcript of COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · §...

Page 1: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

COMP 551 - Applied Machine LearningLecture 16 --- Recurrent neural networksWilliam L. Hamilton(with slides and content from Joelle Pineau and Ryan Lowe)* Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission.

William L. Hamilton, McGill University and Mila 1

Page 2: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Midterm§ I released practice questions. (See the announcement on MyCourses).

§ The midterm is 6-8pm on Nov. 18th. § We will send room assignments soon!

§ The midterm itself will be 100 minutes.

William L. Hamilton, McGill University and Mila 2

Page 3: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Last time: Convolutional Neural NetworksFeedforward network Convolutional neural network (CNN)

§ CNN characteristics:

§ Input is usually a 3D tensor: 2D image x 3 colours

§ Each layer transforms an input 3D tensor to an output 3D tensor using a differentiable function.

From: http://cs231n.github.io/convolutional-networks/William L. Hamilton, McGill University and Mila 3

Page 4: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Some practical issues

§ Issue 1: Softmax vs. sigmoid

§ Issue 2: Optimization concerns

William L. Hamilton, McGill University and Mila 4

Page 5: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Practical issue #1: softmax vs. sigmoid§ We have discussed the sigmoid function for binary classification

§ Maps a real-valued scalar to probability.

§ Combine with cross-entropy loss to train binary classification models.

§ It can be extended to the softmax function for multiclass classification

§ Maps a real-valued K-dimensional vector to a K-way categorical distribution.

§ Combine with cross-entropy loss to train K-way classification models

William L. Hamilton, McGill University and Mila 5

�(z) =1

1 + e�z

softmax(z, k) =ezk

PKj=1 e

zj

Page 6: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Practical issue #1: softmax vs. sigmoid

William L. Hamilton, McGill University and Mila 6

P (y = k) = softmax(z, k) =ezk

PKj=1 e

zj

Probability that output belongs to class k.

K-dimensional real-valued vector.

Index of a particular dimension in z (i.e., k ∈ {1, … , K})

k-th entry of z.

Sum over all entries of z

(exponentiated)

Page 7: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Practical issue #1: softmax vs. sigmoid§ Sigmoid is essentially a special case of the softmax with only two classes:

§ Cross entropy can be generalized to the multiclass/softmax setting:

§ Multi-class (K-dimensional softmax) is not multi-output (K-independent sigmoids).

§ Use softmax when every point belongs to one of K-classes.

§ Use K independent sigmoids

William L. Hamilton, McGill University and Mila 7

�(z) =1

1 + e�z=

ez

1 + ez=

ez

ez + e0

softmax-cross-entropy(y, z) = � log(P (y = y))

= � log(softmax(z, y))

Page 8: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Practical issue #1: softmax vs. sigmoid§ Multi-class classification (K-dimensional softmax) is not the same as multi-

output classification (K-independent sigmoid functions).

§ Use softmax/multiclass when every point belongs to one of K-classes.§ E.g., standard MNIST digit classification

§ More common scenario.

§ Use K independent sigmoids when there are multiple labels for each point.§ E.g., suppose we are detecting shapes in an images that contain multiple shapes

William L. Hamilton, McGill University and Mila 8

Page 9: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Practical issue #2: Optimization concerns

William L. Hamilton, McGill University and Mila 9

Page 10: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Practical issue # 2: Choosing the learning rate§ Backprop is very sensitive to the choice of learning rate.

§ Too large ⇒ divergence.

§ Too small ⇒ VERY slow learning.

§ The learning rate also influences the ability to escape local optima.

§ The learning rate is a critical hyperparameter.

William L. Hamilton, McGill University and Mila 10

Page 11: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Practical issue # 2: Adaptive optimization§ It is now standard to use “adaptive” optimization algorithms.

§ These approaches modify the learning rate adaptively depending on how the training is progressing.

§ Adam, RMSProp, and AdaGrad are popular approaches, with Adam being the de facto standard in deep learning.

§ All of these approaches scale the learning rate for each parameter based on statistics of the history of gradient updates for that parameter.

§ Intuition: increase the update strength for parameters that have had smaller updates in the past.

William L. Hamilton, McGill University and Mila 11

Page 12: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Practical issue # 2: Adding momentum§ At each iteration of gradient descent, we are computing an update based

based on the derivative of the current (mini)batch of training examples:

�iw = ↵@J

@wUpdate for

weight vector

Derivative of error w.r.t. weights for current minibatch

w �iw

William L. Hamilton, McGill University and Mila 12

Page 13: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Practical issue # 2: Adding momentum§ On i’th gradient descent update, instead of:

We do:

The second term is called momentum

�iw = ↵@J

@w

�iw = ↵@J

@w+ ��i�1w

William L. Hamilton, McGill University and Mila 13

Page 14: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Practical issue # 2: Adding momentum§ On i’th gradient descent update, instead of:

We do:

The second term is called momentum

Advantages:§ Easy to pass small local minima.§ Keeps the weights moving in areas where the error is flat.

Disadvantages:§ With too much momentum, it can get out of a global maximum!§ One more parameter to tune, and more chances of divergence

�iw = ↵@J

@w

�iw = ↵@J

@w+ ��i�1w

William L. Hamilton, McGill University and Mila 14

Page 15: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Major paradigms for deep learning§ Deep neural networks: The model should be interpreted as a

computation graph.

§ Supervised training: E.g., feedforward neural networks.

§ Unsupervised training (later in the course): E.g., autoencoders.

§ Special architectures for different problem domains.

§ Computer vision => Convolutional neural nets.

§ Text and speech => Recurrent neural nets.

William L. Hamilton, McGill University and Mila 15

Page 16: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Major paradigms for deep learning§ Deep neural networks: The model should be interpreted as a

computation graph.

§ Supervised training: E.g., feedforward neural networks.

§ Unsupervised training (later in the course): E.g., autoencoders.

§ Special architectures for different problem domains.

§ Computer vision => Convolutional neural nets.

§ Text and speech => Recurrent neural nets.

William L. Hamilton, McGill University and Mila 16

Page 17: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

§ Several datasets contain sequences of data (e.g. time-series, text)§ How could we process sequences with a feed-forward neural network?

1. Take vectors representing the last N timesteps and concatenate (join) them2. Take vectors representing the last N timesteps and average them

Neural models for sequences

From Phil Blumson’s slides

Example: machine translation• Input: sequence of words• Output: sequence of words

William L. Hamilton, McGill University and Mila 17

Page 18: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Neural models for sequences§ Problem: these approaches don’t exploit the sequential nature of the data!!

§ Also, they can only consider information from a fixed-size context window

§ Temporal information is very important in sequences!!

§ E.g. machine translation:

“John hit Steve on the head with a bat”

!= “Steve hit John on the bat with a head”

!= “Bat hit a with head on the John Steve”

William L. Hamilton, McGill University and Mila 18

Page 19: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Recurrent Neural Networks (RNNs)

RECURRENT NEURAL NETWORKS• Compare to: Feed Forward Neural Networks:‣ Information is propagated from the inputs to the outputs‣ No notion of “time” necessary

x2 x3 x4 x5x1

1st hidden layer:

2nd hidden layer:

Output layer :

15

RECURRENT NEURAL NETWORKS• RNNs can have arbitrary topology.‣ no fixed direction of information flow

• Delays associated with specific connections‣ Every directed cycle must contain a delay.

• Possesses an internal dynamic state.

16

x2 x3 x4 x5x1

16

RECURRENT NEURAL NETWORKS• RNNs can have arbitrary topology.‣ no fixed direction of information flow

• Delays associated with specific connections‣ Every directed cycle must contain a delay.

• Possesses an internal dynamic state.

16

x2 x3 x4 x5x1

16

Feed-forward neural net Add cycles in network

William L. Hamilton, McGill University and Mila 19

Page 20: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Recurrent Neural Networks (RNNs)

y: target

L: loss

o: output

h: hidden state

x: input

Image from deeplearningbook.org

§ What kind of cycles?

§ Cycles with a time delay

§ Box means that the information is sent at the

next time step (no infinite loops)

William L. Hamilton, McGill University and Mila 20

Page 21: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Recurrent Neural Networks (RNNs)

y: target

L: loss

o: output

x: input

Image from deeplearningbook.org

§ What does this allow us to do?

§ Can view RNN as having a hidden state ht that changes over time.

§ ht represents “useful information” from past inputs.

§ A standard/simple RNN:

ot = �(Vht + c)

ht = �(Wht�1 +Uxt + b)

William L. Hamilton, McGill University and Mila 21

Page 22: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Recurrent Neural Networks (RNNs)

§ Can unroll the RNN over time to form an acyclic graph.

§ RNN = special kind of feed-forward network

weights are shared between time steps

E.g., h3 = �(W(�(W�(Wh0 +Ux1 + b) +Ux2b) +Ux3 + b)

William L. Hamilton, McGill University and Mila 22

Page 23: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Kinds of output

§ How do we specify the target output of an RNN?§ Many ways! Two main ones:

1) Can specify one target at the end of the sequence

§ Ex: sentiment classification

2) Can specify an target at each time step§ Ex: generating language

William L. Hamilton, McGill University and Mila 23

Page 24: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Sentiment classification

I loved the movie

0010

0001

1000

0100

0.31.28.70.1

1.51.70.60.8

2.34.60.80.9

0.10.23.30.7

0.7

Input layer

Hidden layer

Output layer

§ Input: Sequence of words§ E.g., a review, a tweet, a news article

§ Output: A single value§ E.g., indicating the probability that

the text has a positive sentiment

William L. Hamilton, McGill University and Mila 24

Page 25: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Sentiment classification

I loved the movie

0010

0001

1000

0100

0.31.28.70.1

1.51.70.60.8

2.34.60.80.9

0.10.23.30.7

0.7

Input layer

Hidden layer

Output layer

§ Input: Sequence of words§ E.g., review, tweet, or news article

§ Output: A single value§ E.g., indicating the probability that

the text has a positive sentiment

§ Words can be encoded as “one-hot” vectors.

William L. Hamilton, McGill University and Mila 25

Page 26: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Sentiment classification

I loved the movie

0010

0001

1000

0100

0.31.28.70.1

1.51.70.60.8

2.34.60.80.9

0.10.23.30.7

0.7

Input layer

Hidden layer

Output layer

§ Input: Sequence of words§ E.g., review, tweet, or news article

§ Output: A single value§ E.g., indicating the probability that

the text has a positive sentiment

§ There is only output at the end of the sequence

William L. Hamilton, McGill University and Mila 26

Page 27: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Sentiment classification

I loved the movie

0010

0001

1000

0100

0.31.28.70.1

1.51.70.60.8

2.34.60.80.9

0.10.23.30.7

0.7

Input layer

Hidden layer

Output layer

§ Input: Sequence of words§ E.g., review, tweet, or news article

§ Output: A single value§ E.g., indicating the probability that

the text has a positive sentiment

§ Classic example of a “sequence classification” task.

William L. Hamilton, McGill University and Mila 27

Page 28: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Language modeling

0010

0001

1000

0100

0.31.28.70.1

1.51.70.60.8

2.34.60.80.9

0.10.23.30.7

Input layer

Hidden layer

Output layer

loved the movie

0001

1000

0100

§ Input: Sequence of words§ E.g., review, tweet, or news article

§ Output: Sequence of words§ E.g., predicting the next word that

will occur in a sentence.

William L. Hamilton, McGill University and Mila 28

Page 29: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Language modeling

0010

0001

1000

0100

0.31.28.70.1

1.51.70.60.8

2.34.60.80.9

0.10.23.30.7

Input layer

Hidden layer

Output layer

loved the movie

0001

1000

0100

§ Input: Sequence of words§ E.g., review, tweet, or news article

§ Output: Sequence of words§ E.g., predicting the next word that

will occur in a sentence.

§ Same input representation as sentiment classification

William L. Hamilton, McGill University and Mila 29

Page 30: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Language modeling

0010

0001

1000

0100

0.31.28.70.1

1.51.70.60.8

2.34.60.80.9

0.10.23.30.7

Input layer

Hidden layer

Output layer

loved the movie

0001

1000

0100

§ Input: Sequence of words§ E.g., review, tweet, or news article

§ Output: Sequence of words§ E.g., predicting the next word that

will occur in a sentence.

§ But now we have an output at each time-step!

§ Predicting the next word is a multiclass prediction problem (each word is a class), so use a softmax!

William L. Hamilton, McGill University and Mila 32

Page 31: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Language modeling

§ Input: Sequence of words§ E.g., review, tweet, or news article

§ Output: Sequence of words§ E.g., predicting the next word that

will occur in a sentence.

§ Usually add an “end of sentence token”

0010

0001

1000

0100

0.31.28.70.1

1.51.70.60.8

2.34.60.80.9

0.10.23.30.7

Input layer

Hidden layer

Output layer

loved the movie <EOS>

0001

1000

0100

0000

William L. Hamilton, McGill University and Mila 33

Page 32: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Language modeling

§ Input: Sequence of words§ E.g., review, tweet, or news article

§ Output: Sequence of words§ E.g., predicting the next word that will

occur in a sentence.

§ Classic “sequence modeling” task.§ Language modelling is the

“backbone” of NLP.§ Useful for machine translation, dialogue

systems, automated captioning, etc.

0010

0001

1000

0100

0.31.28.70.1

1.51.70.60.8

2.34.60.80.9

0.10.23.30.7

Input layer

Hidden layer

Output layer

loved the movie <EOS>

0001

1000

0100

0000

William L. Hamilton, McGill University and Mila 34

Page 33: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Training RNNs

§ How can we train RNNs?§ Same as feed-forward networks: train with backpropagation

on unrolled computation graph!

William L. Hamilton, McGill University and Mila 35

Page 34: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Training RNNs

§ How can we train RNNs?§ Same as feed-forward networks: train with backpropagation

on unrolled computation graph!

§ This is called backpropagation through time (BPTT) § Same derivation as regular backprop (use chain rule)

William L. Hamilton, McGill University and Mila 36

Page 35: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Training RNNs

§ BPTT is straightforward for sequence classification.

§ Gradient flows from the final prediction back through all the layers.

I loved the movie

0010

0001

1000

0100

0.31.28.70.1

1.51.70.60.8

2.34.60.80.9

0.10.23.30.7

0.7

Input layer

Hidden layer

Output layer

William L. Hamilton, McGill University and Mila 37

Page 36: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Backpropagation through time

§ BPTT is straightforward for sequence classification.

§ Gradient flows from the final prediction back through all the layers.

I loved the movie

0010

0001

1000

0100

0.31.28.70.1

1.51.70.60.8

2.34.60.80.9

0.10.23.30.7

0.7

Input layer

Hidden layer

Output layer

William L. Hamilton, McGill University and Mila 38

Page 37: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Backpropagation through time

§ BPTT is less straightforward for language modeling.

§ Gradient flows from the prediction at each time-step to the preceding time-steps.

0010

0001

1000

0100

0.31.28.70.1

1.51.70.60.8

2.34.60.80.9

0.10.23.30.7

Input layer

Hidden layer

Output layer

loved the movie <EOS>

0001

1000

0100

0000

William L. Hamilton, McGill University and Mila 39

Page 38: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Backpropagation through time

§ BPTT is less straightforward for language modeling.

§ Conceptually, we can think that we are jointly training on three sequence classification tasks.

0010

0001

1000

0100

0.31.28.70.1

1.51.70.60.8

2.34.60.80.9

0.10.23.30.7

Input layer

Hidden layer

Output layer

loved the movie <EOS>

0001

1000

0100

0000

William L. Hamilton, McGill University and Mila 40

Page 39: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Backpropagation through time

§ BPTT is less straightforward for language modeling.

§ Conceptually, we can think that we are jointly training on three sequence classification tasks.

0010

0001

1000

0100

0.31.28.70.1

1.51.70.60.8

2.34.60.80.9

0.10.23.30.7

Input layer

Hidden layer

Output layer

loved the movie <EOS>

0000

William L. Hamilton, McGill University and Mila 41

Page 40: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Backpropagation through time

§ BPTT is less straightforward for language modeling.

§ Conceptually, we can think that we are jointly training on three sequence classification tasks.

0010

0001

1000

1.51.70.60.8

2.34.60.80.9

0.10.23.30.7

Input layer

Hidden layer

Output layer

loved the movie <EOS>

0100

William L. Hamilton, McGill University and Mila 42

Page 41: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Backpropagation through time

§ BPTT is less straightforward for language modeling.

§ Conceptually, we can think that we are jointly training on three sequence classification tasks.

0010

0001

2.34.60.80.9

0.10.23.30.7

Input layer

Hidden layer

Output layer

loved the movie <EOS>

1000

William L. Hamilton, McGill University and Mila 43

Page 42: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Backpropagation through time

§ BPTT is less straightforward for language modeling.

§ For very long sequence/language modeling tasks, sometimes we “truncate” the gradient flow.

0010

0001

1000

0100

0.31.28.70.1

1.51.70.60.8

2.34.60.80.9

0.10.23.30.7

Input layer

Hidden layer

Output layer

loved the movie <EOS>

0001

1000

0100

0000

William L. Hamilton, McGill University and Mila 44

Page 43: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Backpropagation through time

§ BPTT is less straightforward for language modeling.

§ For very long sequence/language modeling tasks, sometimes we “truncate” the gradient flow.§ I.e., only the last K time-steps are

used to train the prediction. 0001

1000

0100

0.31.28.70.1

1.51.70.60.8

2.34.60.80.9

Input layer

Hidden layer

Output layer

loved the movie <EOS>

0000

William L. Hamilton, McGill University and Mila 45

Page 44: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

There are many ways to add recurrence

§ So far, we have been considering a standard/simple RNN.

§ Recurrence is between the hidden states:

§ This is called the Elman RNN.

0010

0001

1000

0100

0.31.28.70.1

1.51.70.60.8

2.34.60.80.9

0.10.23.30.7

Input layer

Hidden layer

Output layer

0001

1000

0100

0000

ht = �(Wht�1 +Uxt + b)

William L. Hamilton, McGill University and Mila 46

Page 45: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

§ But there are other options!

§ E.g., recurrence based on the output:

§ This is called the Jordan RNN.

There are many ways to add recurrence

0010

0001

1000

0100

0.31.28.70.1

1.51.70.60.8

2.34.60.80.9

0.10.23.30.7

Input layer

Hidden layer

Output layer

0001

1000

0100

0000

ht = �(Wot�1 +Uxt + b)

William L. Hamilton, McGill University and Mila 47

Page 46: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

There are many ways to add recurrence

§ Q: Which is better?

§ Elman RNN:

§ Jordan RNN:

vs.

Jordan RNN Elman RNN

ot = �(Vht + c)

ot = �(Vht + c)

ht = �(Wht�1 +Uxt + b)

ht = �(Wot�1 +Uxt + b)

William L. Hamilton, McGill University and Mila 48

Page 47: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

There are many ways to add recurrence

§ Q: Which is better?

§ A: Elman RNN. Usually output o is constrained in some way, and may be missing some important info from the past.

vs.

Jordan RNN Elman RNN

William L. Hamilton, McGill University and Mila 49

Page 48: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

There are many ways to add recurrence

§ Q: Which is better?

§ A: Elman RNN. Usually output o is constrained in some way, and may be missing some important info from the past.

§ We can also add both types of recurrence at once!

vs.

Jordan RNN Elman RNN

William L. Hamilton, McGill University and Mila 50

Page 49: COMP 551 -Applied Machine Learning Lecture 16 ---Recurrent ...wlh/comp551/slides/16-rnns.pdf · § Deep neural networks: The model should be interpreted as a computation graph. §

Beyond Elman and Jordan RNNs

§ Elman and Jordan RNNs are relatively straightforward.

§ But in practice they are very hard to train!

§ Issue: Multiplying by the same W matrix over and over is very unstable…

§ There are recurrent architectures that fix this! (Next lecture).

E.g., h3 = �(W(�(W�(Wh0 +Ux1 + b) +Ux2b) +Ux3 + b)

ht = �(Wht�1 +Uxt + b)

William L. Hamilton, McGill University and Mila 51