Recurrent Neural Networks (RNNs)
Dr. Benjamin Roth, Nina Poerner
CIS LMU Munchen
Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 1 / 21
Heute
9:15 - 10:45: RNN Basics + CNN
11:00 - 11:45: Ubung PyTorch
Statt Ubungsblatt bis nachste Woche durcharbeiten:I http://www.deeplearningbook.org/contents/rnn.html (Abschnitte 10.0
- 10.2.1, 10.7, 10.10)I http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Nachste Woche:I 9:15 - 10:00: “Journal Club” zu LSTMI 10:00 - 10:45: Keras (Teil 2)I 11:00 - 11:45: Ubung Word2Vec
Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 2 / 21
Recurrent Neural Networks (RNNs)
Family of neural networks for processing sequential data x(1) . . . x(T ).
Sequences of words, characters, ...
Simplest case: for each time step t, compute representation h(t) fromcurrent input x(t) and previous representation h(t−1).
h(t) = f (h(t−1), x(t); θ)
x(t) can be embeddings, one-hot, output of some previous layer ...
Question: By recursion, what does h(t) depend on?I all previous inputs x(1) . . . x(t)
I the initial state h(0) (typically all-zero, but not necessarily, c.f.encoder-decoder)
I the parameters of θ
Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 3 / 21
Parameter Sharing
Going from a time step t − 1 to t is parameterized by the sameparameters θ for all t!
h(t) = f (h(t−1), x(t); θ)
Question: Why is parameter sharing a good idea?I Fewer parametersI Can learn to detect features regardless of their positionI Can generalize to longer sequences than were seen in training
Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 4 / 21
RNN: Output
The output at time t is computed from the hidden representation attime t:
o(t) = f (h(t);Vo)
Typically a linear transformation: o(t) = VTo h
(t)
Some RNNs compute o(t) at every time step, others only at the lasttime step o(T )
Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 5 / 21
RNN: Output
n-to-1
h(0)
x(1)
h(1)
x(2)
h(2)
x(3)
h(3)
o(3)
Sentiment classification, topic classification ...
tagging
o(1) o(2)
POS tagging, NER tagging, Language Model ...
tagging (selective)Sequence2Sequence
h(4)
o(4)
h(5)
o(5)
h(6)
o(6)
Decoder
Encoder
Machine Translation, Summarization, Inflection, ...
Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 6 / 21
Any questions so far?
Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 7 / 21
RNN: Loss Function
Loss function:I Several time steps: L(y (1), . . . y (T ); o(1) . . . o(T ))I Last time step: L(y ; o(T ))
Example: POS TaggingI Output o(t) is predicted distribution over POS tags
F o(t) = P(tag =?|h(t))F Typically: o(t) = softmax(VT
o h(t))
I Loss at time t: negative log-likelihood (NLL) of true label y (t)
L(t) = − logP(tag = y (t)|h(t);Vo)
I Overall Loss for all time steps:
L =T∑t=1
L(t)
Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 8 / 21
Graphical Notation
Nodes indicate input data (x) or function outputs (otherwise).
Arrows indicate functions arguments.
Compact notation (left):I All time steps conflated.I � indicates “delay” of 1 time unit.
Source: Goodfellow et al.: Deep Learning.
Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 9 / 21
Graphical Notation: Including Output and Loss Function
Source: Goodfellow et al.: Deep Learning.
Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 10 / 21
Any questions so far?
Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 11 / 21
Backpropagation through time
We have calculated loss L at time step T and want to update θ withgradient descent.
For now, imagine that we have time step-specific“dummy”-parameters θ(t), which are identical copies of θ
→ the unrolled RNN looks like a feed-forward-neural-network!
→ we can calculate ∂L∂θ(t)
using standard backpropagation
Question: How to calculate ∂L∂θ ?
Add up the “dummy” gradients:
∂L∂θ
=T∑t=1
∂L∂θ(t)
Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 12 / 21
Truncated backpropagation through time
Simple idea: Stop backpropagation through time after k time steps
∂L∂θ
=T∑
t=T−k
∂L∂θ(t)
Question: What are advantages and disadvantages?I Advantage: Faster and parallelizableI Disadvantage: If k is too small, long-range dependencies are hard to
learn
Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 13 / 21
Any questions so far?
Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 14 / 21
Vanilla RNN
h(t) = f (h(t−1), x(t); θ) = tanh(Ux(t) + Wh(h−1) + b)
θ = {W,U,b}
W: Hidden-to-hidden
U: Input-to-hidden
b: Bias term
Vanilla RNN in keras:
vanilla = SimpleRNN(units=10, use_bias = True)
vanilla.build(input_shape = (None, None, 30))
print([weight.shape for weight in vanilla.get_weights()])
[(30, 10), (10, 10), (10,)]
Question: Which shape belongs to which weight?
Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 15 / 21
Bidirectional RNNs
Conceptually: Two RNNs that run in opposite directions over thesame input
Typically, each RNN has its own set of parameters
Two sequences of hidden vectors:→h(1) . . .
→h(T ),
←h(1) . . .
←h(T )
Typically,→h and
←h are concatenated along their hidden dimension
Question: Which hidden vectors should we concatenate if our outputlayer needs a single hidden vector h?
I h =→h(T )||
←h(1)
I Because these are the vectors that have “read” the entire sequence
Question: Which hidden vectors should we concatenate if we needone hidden vector per time step t?
I h(t) =→h(t)||
←h(t)
I Full left context, full right context
Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 16 / 21
Multi-Layer RNNs
Conceptually: A stack of L RNNs, such that x(t)l = h
(t)l−1.
h(0)1
h(0)2
x(1)
h(1)1
h(1)2
x(2)
h(2)1
h(2)2
x(3)
h(3)1
h(3)2
x(4)
h(4)1
h(4)2
x(5)
h(5)1
h(5)2
Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 17 / 21
Feeding outputs back
What do we do if the input sequence x(1) . . . x(T ) is only given attraining time, but not at test time?
Examples: Machine Translation decoder, (generative) language model
Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 18 / 21
Example: Machine Translation
h(0) h(1)
o(1)
w (1)
y (1)
L(1)
h(2)
o(2)
w (2)
y (2)
L(2)
h(3)
o(3)
w (3)
y (3)
L(3)
Last encoder state
(has “read” source)
Predicted
True
w (1) w (2)
x(2) x(3) Embeddings
Predictedy (1) y (2) Oracle
Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 19 / 21
Oracle signal
Give Neural Network a signal that it will not have at test time
Can be useful during training (e.g., mix oracle and predicted signal)
Can establish upper bounds of modules
Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 20 / 21
Gated RNNs: Teaser
Vanilla RNNs are not frequently used, as they tend to forget pastinformation quickly
Instead: LSTM, GRU, ... (next week!)
Dr. Benjamin Roth, Nina Poerner (CIS LMU Munchen)Recurrent Neural Networks (RNNs) 21 / 21
Top Related