Harini Suresh Sequence Modeling with Neural...
Transcript of Harini Suresh Sequence Modeling with Neural...
![Page 1: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/1.jpg)
Sequence Modeling with Neural Networks
Harini Suresh
MIT 6.S191 | Intro to Deep Learning | IAP 2018
x0
s0 s1
x1 x2
s2 . . .
y0 y1 y2
![Page 2: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/2.jpg)
What is a sequence?
● “This morning I took the dog for a walk.”
●
●
MIT 6.S191 | Intro to Deep Learning | IAP 2018
sentence
medical signals
speech waveform
![Page 3: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/3.jpg)
Successes of deep models Machine translation
MIT 6.S191 | Intro to Deep Learning | IAP 2018
Question Answering
Left: https://research.googleblog.com/2016/09/a-neural-network-for-machine.htmlRight: https://rajpurkar.github.io/SQuAD-explorer/
![Page 4: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/4.jpg)
Successes of deep models
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 5: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/5.jpg)
a sequence modeling problem:predict the next word
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 6: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/6.jpg)
“This morning I took the dog for a walk.”
a sequence modeling problem
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 7: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/7.jpg)
“This morning I took the dog for a walk.”
a sequence modeling problem
given these words
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 8: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/8.jpg)
predict what comes next?
“This morning I took the dog for a walk.”
a sequence modeling problem
given these words
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 9: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/9.jpg)
idea: use a fixed window
“This morning I took the dog for a walk.”given these 2 words, predict the next word
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 10: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/10.jpg)
idea: use a fixed window
“This morning I took the dog for a walk.”given these 2 words, predict the next word
[ 1 0 0 0 0 0 1 0 0 0 ]
for a
prediction
One hot feature vector indicates what each word is
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 11: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/11.jpg)
problem: we can’t model long-term dependencies
“In France, I had a great time and I learnt some of the _____ language.”
We need information from the far past and future to accurately guess the correct word.
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 12: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/12.jpg)
idea: use entire sequence, as a set of counts
This morning I took the dog for a
[ 0 1 0 0 1 0 0 … 0 0 1 1 0 0 0 1 ]
prediction
“bag of words”
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 13: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/13.jpg)
problem: counts don’t preserve order
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 14: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/14.jpg)
problem: counts don’t preserve order
“The food was good, not bad at all.” vs
“The food was bad, not good at all.”
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 15: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/15.jpg)
“This morning I took the dog for a walk.”
idea: use a really big fixed windowgiven these 7 words, predict the next word
[ 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 ... ]
morning I took the dog ...
prediction
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 16: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/16.jpg)
problem: no parameter sharing
[ 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 ... ]
each of these inputs has a separate parameter
MIT 6.S191 | Intro to Deep Learning | IAP 2018
this morning
![Page 17: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/17.jpg)
problem: no parameter sharing
[ 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 ... ]
each of these inputs has a separate parameter
MIT 6.S191 | Intro to Deep Learning | IAP 2018
[ 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 ... ]
this morning
this morning
![Page 18: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/18.jpg)
problem: no parameter sharing
[ 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 ... ]
each of these inputs has a separate parameter
MIT 6.S191 | Intro to Deep Learning | IAP 2018
[ 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 ... ]
this morning
this morning
things we learn about the sequence won’t transfer if they appear at different points in the sequence.
![Page 19: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/19.jpg)
to model sequences, we need:
1. to deal with variable-length sequences2. to maintain sequence order3. to keep track of long-term dependencies4. to share parameters across the sequence
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 20: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/20.jpg)
to model sequences, we need:
1. to deal with variable-length sequences2. to maintain sequence order3. to keep track of long-term dependencies4. to share parameters across the sequence
let’s turn to recurrent neural networks.
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 21: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/21.jpg)
example network:
.
.
.
.
.
.
.
.
.
input hidden output
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 22: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/22.jpg)
example network:
.
.
.
.
.
.
.
.
.
input hidden output
let’s take a look at this one hidden unit
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 23: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/23.jpg)
RNNS remember their previous state:
t = 0
x0 : “it” W
U
s0
s1
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 24: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/24.jpg)
RNNS remember their previous state:
t = 1
x1 : “was” W
U
s1
s212
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 25: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/25.jpg)
“unfolding” the RNN across time:
x0
W
s0
U
s1
U
x1
W
x2
W
s2
U. . .
time
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 26: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/26.jpg)
“unfolding” the RNN across time:
x0
W
s0
U
s1
U
x1
W
x2
W
s2
U. . .
time
notice that we use the same parameters, W and U
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 27: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/27.jpg)
“unfolding” the RNN across time:
x0
W
s0
U
s1
U
x1
W
x2
W
s2
U. . .
time
sn can contain information from all past timesteps
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 28: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/28.jpg)
how do we train an RNN?
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 29: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/29.jpg)
how do we train an RNN?
backpropagation!(through time)
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 30: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/30.jpg)
remember: backpropagation
1. take the derivative (gradient) of the loss with respect to each parameter
2. shift parameters in the opposite direction in order to minimize loss
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 31: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/31.jpg)
we have a loss at each timestep:
x0
W
s0
U
s1
U
x1
W
x2
W
s2
U. . .
V V V
y0 y1 y2
(since we’re making a prediction at each timestep)
J0 J1 J2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 32: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/32.jpg)
we have a loss at each timestep:
x0
W
s0
U
s1
U
x1
W
x2
W
s2
U. . .
V V V
y0 y1 y2
(since we’re making a prediction at each timestep)
J0 J1 J2
loss at each timestep
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 33: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/33.jpg)
we sum the losses across time:
loss at time t = Jt( )
total loss = J( ) = Σt Jt( )
= our parameters, like weights
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 34: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/34.jpg)
what are our gradients?
we sum gradients across time for each parameter P:
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 35: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/35.jpg)
let’s try it out for W with the chain rule:
x0
W
s0
U
s1
U
x1
W
x2
W
s2
U. . .
V V V
y0 y1 y2
J0 J1 J2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 36: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/36.jpg)
let’s try it out for W with the chain rule:
x0
W
s0
U
s1
U
x1
W
x2
W
s2
U. . .
V V V
y0 y1 y2
J0 J1 J2
so let’s take a single timestep t:
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 37: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/37.jpg)
let’s try it out for W with the chain rule:
x0
W
s0
U
s1
U
x1
W
x2
W
s2
U. . .
V V V
y0 y1 y2
J0 J1 J2
so let’s take a single timestep t:
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 38: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/38.jpg)
let’s try it out for W with the chain rule:
x0
W
s0
U
s1
U
x1
W
x2
W
s2
U. . .
V V V
y0 y1 y2
J0 J1 J2
so let’s take a single timestep t:
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 39: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/39.jpg)
let’s try it out for W with the chain rule:
x0
W
s0
U
s1
U
x1
W
x2
W
s2
U. . .
V V V
y0 y1 y2
J0 J1 J2
so let’s take a single timestep t:
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 40: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/40.jpg)
let’s try it out for W with the chain rule:
x0
W
s0
U
s1
U
x1
W
x2
W
s2
U. . .
V V V
y0 y1 y2
J0 J1 J2
so let’s take a single timestep t:
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 41: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/41.jpg)
let’s try it out for W with the chain rule:
x0
W
s0
U
s1
U
x1
W
x2
W
s2
U. . .
V V V
y0 y1 y2
J0 J1 J2
so let’s take a single timestep t:
but wait…
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 42: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/42.jpg)
let’s try it out for W with the chain rule:
x0
W
s0
U
s1
U
x1
W
x2
W
s2
U. . .
V V V
y0 y1 y2
J0 J1 J2
so let’s take a single timestep t:
but wait…
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 43: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/43.jpg)
let’s try it out for W with the chain rule:
x0
W
s0
U
s1
U
x1
W
x2
W
s2
U. . .
V V V
y0 y1 y2
J0 J1 J2
so let’s take a single timestep t:
but wait…
s1 also depends on W so we can’t just treat as a constant!
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 44: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/44.jpg)
how does s2 depend on W?
x0
W
s0
U
s1
U
x1
W
x2
W
s2
U. . .
V V V
y0 y1 y2
J0 J1 J2
. . .
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 45: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/45.jpg)
how does s2 depend on W?
x0
W
s0
U
s1
U
x1
W
x2
W
s2
U. . .
V V V
y0 y1 y2
J0 J1 J2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 46: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/46.jpg)
how does s2 depend on W?
x0
W
s0
U
s1
U
x1
W
x2
W
s2
U. . .
V V V
y0 y1 y2
J0 J1 J2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 47: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/47.jpg)
how does s2 depend on W?
x0
W
s0
U
s1
U
x1
W
x2
W
s2
U. . .
V V V
y0 y1 y2
J0 J1 J2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 48: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/48.jpg)
backpropagation through time:
Contributions of W in previous timesteps to the error at timestep t
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 49: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/49.jpg)
backpropagation through time:
Contributions of W in previous timesteps to the error at timestep t
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 50: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/50.jpg)
why are RNNs hard to train?
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 51: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/51.jpg)
problem: vanishing gradient
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 52: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/52.jpg)
problem: vanishing gradient
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 53: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/53.jpg)
problem: vanishing gradient
at k = 0:
x0
s0 s1
x1 x2
s2
y0 y1 y2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 54: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/54.jpg)
problem: vanishing gradient
x0
s0 s1
x1 x2
s2
y0 y1 y2
x3
s3
y3
xn
sn
yn
. . .
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 55: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/55.jpg)
problem: vanishing gradient
x0
s0 s1
x1 x2
s2
y0 y1 y2
x3
s3
y3
xn
sn
yn
. . .
as the gap between timesteps gets bigger, this product gets longer and longer!
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 56: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/56.jpg)
problem: vanishing gradient
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 57: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/57.jpg)
problem: vanishing gradient
what are each of these terms?
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 58: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/58.jpg)
problem: vanishing gradient
what are each of these terms?
W = sampled from standard normal distribution = mostly < 1
f = tanh or sigmoid so f’ < 1
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 59: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/59.jpg)
problem: vanishing gradient
what are each of these terms?
we’re multiplying a lot of small numbers together.
W = sampled from standard normal distribution = mostly < 1
f = tanh or sigmoid so f’ < 1
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 60: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/60.jpg)
so what?
we’re multiplying a lot of small numbers together.
errors due to further back timesteps have increasingly smaller gradients.
so what?
parameters become biased to capture shorter-term dependencies.
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 61: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/61.jpg)
“In France, I had a great time and I learnt some of the _____ language.”
our parameters are not trained to capture long-term dependencies, so the word we predict will mostly depend on the previous few words, not much earlier ones
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 62: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/62.jpg)
solution #1: activation functions
ReLU derivative
sigmoid derivative
tanh derivative
prevents f’ from shrinking the gradients
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 63: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/63.jpg)
solution #2: initialization
weights initialized to identity matrixbiases initialized to zeros
prevents W from shrinking the gradients
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 64: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/64.jpg)
a different type of solution:more complex cells
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 65: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/65.jpg)
solution #3: gated cells
rather each node being just a simple RNN cell, make each node a more complex unit with gates controlling what information is passed through.
RNN LSTM, GRU, etc
vs
MIT 6.S191 | Intro to Deep Learning | IAP 2018
Long short term memory cells are able to keep track of information throughout many timesteps.
![Page 66: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/66.jpg)
solution #3: more on LSTMs
cj cj+1
cell state
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 67: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/67.jpg)
solution #3: more on LSTMs
cj cj+1
forget irrelevant parts
of previous state
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 68: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/68.jpg)
solution #3: more on LSTMs
cj cj+1
selectively update cell
state values
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 69: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/69.jpg)
solution #3: more on LSTMs
cj cj+1
output certain parts of cell
state
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 70: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/70.jpg)
solution #3: more on LSTMs
cj cj+1
output certain parts of cell
state
selectively update cell
state values
forget irrelevant parts
of previous state
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 71: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/71.jpg)
solution #3: more on LSTMs
MIT 6.S191 | Intro to Deep Learning | IAP 2017
cj cj+1
forget gate input gate output gate
X +
xt xt
stst st+1
![Page 72: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/72.jpg)
why do LSTMs help?
1. forget gate allows information to pass through
unchanged
2. cell state is separate from what’s outputted
3. sj depends on sj-1 through addition! → derivatives don’t expand into a long product!
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 73: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/73.jpg)
possible task: classification (i.e. sentiment)
:(
:)
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 74: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/74.jpg)
possible task: classification (i.e. sentiment)
W
s0
U
s1
U
W W
sn. . .
V
ynegative
x0x1 xn
don’t fly luggage
y is a probability distribution over possible classes (like positive, negative, neutral), aka a softmax
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 75: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/75.jpg)
possible task: music generation
RNN
Music by: Francesco Marchesani, Computer Science Engineer, PoliMi
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 76: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/76.jpg)
possible task: music generation
x0
W
s0
U
s1
U
x1
W
x2
W
s2
U. . .
V V V
E D F# yi is actually a probability distribution over possible next notes, aka a softmax
<start> E D
y0y1 y2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 77: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/77.jpg)
possible task: machine translation
the
W
s0
U
s1
U
dog
W
s2 , <start>
J
c0
K
le
eats
W
s2
J
c1
K
chien
s2 , le
L
J
c2
K
mange
s2 , chien
L
J
c3
K
<end>
s2 , mange
L
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 78: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/78.jpg)
possible task: machine translation
the
W
s0
U
s1
U
dog
W
s2 , <go>
J
c0
K
le
eats
W
s2
J
c1
K
chien
s2 , le
L
J
c2
K
mange
s2 , chien
L
J
c3
K
<end>
s2 , mange
L
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 79: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/79.jpg)
problem: a single encoding is limiting
the
W
s0
U
s1
U
dog
W
s2 , <go>
J
c0
K
le
eats
W
s2
J
c1
K
chien
s2 , le
L
J
c2
K
mange
s2 , chien
L
J
c3
K
<end>
s2 , mange
L
all the decoder knows about the input sentence is in one fixed length vector, s2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 80: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/80.jpg)
solution: attend over all encoder states
s0
U
s1
U
s* , <go>
J
c0
K
le
s2
J
c1
K
chien
s* , le
L
J
c2
K
mange
s* , chien
L
J
c3
K
<end>
s* , mange
L
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 81: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/81.jpg)
solution: attend over all encoder states
s0
U
s1
U
c0
K
le
s2
J
c1
K
chien
s* , le
L
J
c2
K
mange
s* , chien
L
J
c3
K
<end>
s* , mange
L
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 82: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/82.jpg)
solution: attend over all encoder states
s0
U
s1
U
c0
K
le
s2 c1
K
chien
L
J
c2
K
mange
s* , chien
L
J
c3
K
<end>
s* , mange
L
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 83: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/83.jpg)
now we can model sequences!
● why recurrent neural networks?● training them with backpropagation through time● solving the vanishing gradient problem with activation functions,
initialization, and gated cells (like LSTMs)● building models for classification, music generation and machine
translation● using attention mechanisms
MIT 6.S191 | Intro to Deep Learning | IAP 2018
![Page 84: Harini Suresh Sequence Modeling with Neural Networksintrotodeeplearning.com/2018/materials/2018_6S191_Lecture2.pdf · rather each node being just a simple RNN cell, make each node](https://reader030.fdocuments.in/reader030/viewer/2022041302/5e11d92f2f35ab03751cdde8/html5/thumbnails/84.jpg)
and there’s lots more to do!
● extending our models to timeseries + waveforms● complex language models to generate long text or books● language models to generate code● controlling cars + robots● predicting stock market trends● summarizing books + articles● handwriting generation● multilingual translation models● … many more!
MIT 6.S191 | Intro to Deep Learning | IAP 2018