RNN Slides

Author
trinhthien 
Category
Documents

view
231 
download
4
Embed Size (px)
Transcript of RNN Slides

Deep Learning
Recurrent Neural Network (RNNs)
Ali Ghodsi
University of Waterloo
October 23, 2015
Slides are partially based on Book in preparation, Deep Learningby Bengio, Goodfellow, and Aaron Courville, 2015
Ali Ghodsi Deep Learning

Sequential data
Recurrent neural networks (RNNs) are often used for handlingsequential data.
They introduced first in 1986 (Rumelhart et al 1986).
Sequential data usually involves variable length inputs.
Ali Ghodsi Deep Learning

Parameter sharing
Parameter sharing makes it possible to extend and apply the modelto examples of different lengths and generalize across them.
x1
x2
y1
x1
x2
y1
Ali Ghodsi Deep Learning

Recurrent neural network
Figure: Richard Socher
Ali Ghodsi Deep Learning

Dynamic systems
The classical form of a dynamical system:
st = f(st1)
Ali Ghodsi Deep Learning

Dynamic systems
Now consider a dynamic system with an external signal x
st = f(st1, xt)
The state contains information about the whole past sequence.
st = gt(xt , xt1, xts , . . . , x2, x1)
Ali Ghodsi Deep Learning

Parameter sharing
We can think of st as a summary of the past sequence of inputs up tot.
If we define a different function gt for each possible sequence length,we would not get any generalization.
If the same parameters are used for any sequence length allowingmuch better generalization properties.
Ali Ghodsi Deep Learning

Recurrent Neural Networks
at = b + W st1 + Uxt
st = tanh(at)
ot = c + V st
pt = softmax(ot)
Ali Ghodsi Deep Learning

Computing the Gradient in a Recurrent Neural
Network
Using the generalized backpropagation algorithm one can obtain thesocalled BackPropagation Through Time (BPTT) algorithm.
Ali Ghodsi Deep Learning

Exploding or Vanishing Product of Jacobians
In recurrent nets (also in very deep nets), the final output is thecomposition of a large number of nonlinear transformations.
Even if each of these nonlinear transformations is smooth. Theircomposition might not be.
The derivatives through the whole composition will tend to be eithervery small or very large.
Ali Ghodsi Deep Learning

Exploding or Vanishing Product of Jacobians
The Jacobian (matrix of derivatives) of a composition is the productof the Jacobians of each stage.If
f = fT fT1 . . . , f2 f1where (f g)(x) = f (g(x))
(f g)(x) = (f g)(x) g (x) = f (g(x))g (x)
Ali Ghodsi Deep Learning

Exploding or Vanishing Product of Jacobians
The Jacobian matrix of derivatives of f (x) with respect to its inputvector x is
f = f T fT1 . . . , f
2 f1
where
f =f (x)
x
and
f t =ft(at)
at,
where at = ft1(ft2(. . . , f2(f1(x)))).
Ali Ghodsi Deep Learning

Exploding or Vanishing Product of Jacobians
Simple example
Suppose: all the numbers in the product are scalar and have thesame value .
multiplying many numbers together tends to be either very large orvery small.
If T goes to , then
T goes to if > 1
T goes to 0 if < 1
Ali Ghodsi Deep Learning

Difficulty of Learning LongTerm Dependencies
Consider a general dynamic system
st = f(st1, xt)
ot = gw (st),
A loss Lt is computed at time step t as a function of ot and sometarget yt . At time T :
LT
=tT
Ltst
st
LT
=tT
LtsT
sTst
f(st1, xt)
sTst
=sTsT1
sT1sT2
. . .st+1st
Ali Ghodsi Deep Learning

Facing the challenge
Gradients propagated over many stages tend to either vanish (mostof the time) or explode.
Ali Ghodsi Deep Learning

Echo State Networks
set the recurrent and input weights such that the recurrent hiddenunits do a good job of capturing the history of past inputs, and onlylearn the output weights.
st = (W st1 + Uxt)
Ali Ghodsi Deep Learning

Echo State Networks
If a change s in the state at t is aligned with an eigenvector v ofjacobian J with eigenvalue > 1, then the small change s becomess after one time step, and ts after t time steps.
If the largest eigenvalue < 1, the map from t to t + 1 is contractive.
The network forgetting information about the longterm past.
Set the weights to make the Jacobians slightly contractive.
Ali Ghodsi Deep Learning

Long delays
Use recurrent connections with long delays.
Figure from Benjio et al 2015
Ali Ghodsi Deep Learning

Leaky Units
Recall thatst = (W st1 + Uxt)
Consider
st,i = (11
i)st1 +
1
i(W st1 + Uxt)
1 i
i = 1, Ordinary RNN
i > 1, gradients propagate more easily.
i >> 1 , the state changes very slowly, integrating the past valuesassociated with the input sequence.
Ali Ghodsi Deep Learning

Gated RNNs
It might be useful for the neural network to forget the old state insome cases.
Example: a a b b b a a a a b a b
It might be useful to keep the memory of the past.
Example:
Instead of manually deciding when to clear the state, we want theneural network to learn to decide when to do it.
Ali Ghodsi Deep Learning

Gated RNNs, the LongShortTermMemory
The LongShortTermMemory (LSTM) algorithm was proposed in1997 (Hochreiter and Schmidhuber, 1997).
Several variants of the LSTM are found in the literature:
Hochreiter and Schmidhuber 1997
Graves, 2012
Graves et al., 2013
Sutskever et al., 2014
the principle is always to have a linear selfloop through whichgradients can flow for long duration.
Ali Ghodsi Deep Learning

Gated Recurrent Units (GRU)
Recent work on gated RNNs, Gated Recurrent Units (GRU) wasproposed in 2014
Cho et al., 2014
Chung et al., 2014, 2015
Jozefowicz et al., 2015
Chrupala et al., 2015
Ali Ghodsi Deep Learning

Gated RNNs
Ali Ghodsi Deep Learning

Gated Recurrent Units (GRU)
Standard RNN computes hidden layer at next time step directly:ht = f (Wht1 + Uxt)
GRU first computes an update Gate (another layer) based on currentinput vector and hidden state
zt = (W(z)xt + U
(z)ht1)
compute reset gate similarly but with different weights
rt = (W(r)xt + U
(r)ht1)
Credit: Richard Socher
Ali Ghodsi Deep Learning

Gated Recurrent Units (GRU)
Update gate: zt = (W(z)xt + U
(z)ht1)
Reset gate: rt = (W(r)xt + U
(r)ht1)
New memory content: ht = tanh(wxt + rt Uht1)If reset gate is 0, then this ignores previous memory and only storesthe new information
Final memory at time step combines current and previous time steps:ht = zt hh1 + (1 zt) ht
Ali Ghodsi Deep Learning

Gated Recurrent Units (GRU)
Update gate:zt = (W
(z)xt + U(z)ht1)
Reset gate:rt = (W
(r)xt + U(r)ht1)
New memory content:ht = tanh(wxt + rt Uht1)
ht = zt hh1 + (1 zt) ht
Ali Ghodsi Deep Learning

Gated Recurrent Units (GRU)
zt = (W(z)xt + U
(z)ht1)rt = (W
(r)xt + U(r)ht1)
ht = tanh(wxt + rt Uht1)ht = zt hh1 + (1 zt) ht
If reset is close to 0, ignore previous hidden state Allow model todrop information that is irrelevant
Update gate z controls how much of past state should matter now.If z close to 1, then we can copy information in that unit throughmany time steps.Units with short term dependencies often have reset gates very active.
Ali Ghodsi Deep Learning

The LongShortTermMemory (LSTM)
We can make the units even more complexAllow each time step to modifyInput gate (current cell matters) it = (W
(i)xt + U(i)ht1)
Forget (gate 0, forget past) ft = (W(f )xt + U
(f )ht1)Output (how much cell is exposed) ot = (W
(o)xt + U(o)ht1)
New memory cell ct = tanh(W(c)xt + U
(c)ht1)
Final memory cell: ct = ft ct1 + (it) ct
Final hidden state: ht = ot tanh(ct)
Ali Ghodsi Deep Learning

Clipping Gradients
Figure: Pascanu et al., 2013
Strongly nonlinear functions tend to have derivatives that can beeither very large or very small in magnitude.
Ali Ghodsi Deep Learning

Clipping Gradients
Simple solution for clipping the gradient. (Mikolov, 2012; Pascanu etal., 2013):
Clip the parameter gradient from a mini batch elementwise(Mikolov, 2012) just before the parameter update.
Clip the norm g of the gradient g (Pascanu et al., 2013a) just beforethe parameter update.
Ali Ghodsi Deep Learning