Recurrent Neural Networks (RNNs) - idi.ntnu.no of Recurrent NNs 1 Language Modeling - Given a...
Transcript of Recurrent Neural Networks (RNNs) - idi.ntnu.no of Recurrent NNs 1 Language Modeling - Given a...
Recurrent Neural Networks (RNNs)
Keith L. Downing
The Norwegian University of Science and Technology (NTNU)Trondheim, [email protected]
November 7, 2017
Keith L. Downing Recurrent Neural Networks (RNNs)
Recurrent Networks for Sequence Learning
Strictly Feed-Forward Recurrent
Feedback connections create directed cycles, which produce dynamicmemory: current neural activation states reflect past activation states.
Keith L. Downing Recurrent Neural Networks (RNNs)
History-Dependent Activation Patterns
Grandma made me breakfast
Trump made me nauseous
Same two inputs produce differentinternal patterns due to different
prior contexts: Grandma -vs- Trump
Keith L. Downing Recurrent Neural Networks (RNNs)
Sequence Completion = Prediction
Trumpmademe
Grandmamademe
breakfast
++
nauseous
++ +
By modifying this mapping from internal activation states to outputs (viaweight modification, i.e. learning), the net can predict the next word in asentence.
Keith L. Downing Recurrent Neural Networks (RNNs)
Time-Delay NetsInclude history in the input, so no recurrence needed; similar to NetTalk’s use ofcontext.
GrandmamademeInput(T) Input(T-1) Input(T-2)
breakfast
But this requires n-grams as input, and there are many of them!Over-sensitivity to word position.Trump made me nauseous and Trump made me very nauseous wouldbe handled very differently, because they yield different n-grams.
Keith L. Downing Recurrent Neural Networks (RNNs)
Elman -vs- Jordan Nets
Context Layer
JordanInput Layer
Output Layer
HiddenLayer
Elman
Context Layer
Input Layer
Output Layer
HiddenLayer
The Elman net should be more versatile, since its internal context is free toassume many activation patterns (i.e., representations), whereas the contextof the Jordan net is restricted to that of the target patterns.
Keith L. Downing Recurrent Neural Networks (RNNs)
Backpropagation Through Time (BPTT)
Input
Hidden
Output
U
V
W
In
Hid
Out
U
V
W
In
Hid
Out
U
V
In
Hid
Out
U
V
W W
T - k T - 1 T
Unfoldin
Time
In
Hid
Out
U
V
W
In
Hid
Out
U
V
In
Hid
Out
U
V
W W
T - k T - 1 T
Backpropagating Error
E
Target(T)
EE
Target(T-1)Target(T-k)
For each case, take average of recommended
updates for U, V, W
Keith L. Downing Recurrent Neural Networks (RNNs)
Applications of Recurrent NNs
1 Language Modeling - Given a sequence of words, predict the next one.2 Text Generation - Given a few words, produce a meaningful sequence.3 Machine Translation - Given a sequence of words in one language,
produce a semantically equivalent sequence in another language.4 Speech Recognition - Given a sequence of acoustic signals, produce a
sequence of phonemes.5 Image Captioning - Given an image, produce a short text description.
This typically requires RNNs + Convolution Nets.
Most of the top RNN systems for natural language (NL) applicationsnow use Long Short-Term Memory (LSTM; Hochreiter + Schmidhuber(1997))
RNNs used for any NL task yield useful internal vector representations(a.k.a. embeddings) for NL expressions, wherein semantically similarexpressions have similar vectors.
Keith L. Downing Recurrent Neural Networks (RNNs)
Recurrent Architectures: Elman Net
X
H
O
L
Y
Input
Hidden
Output
Loss
Target
X1
H1
O1
L1
Y1
X2
H2
O2
L2
Y2
X3
H3
O3
L3
Y3
X4
H4
O4
L4
Y4
"Well" "we" "busted" "out"
"we" "busted" "out" "of"
"you" "know" "down" "of"
Keith L. Downing Recurrent Neural Networks (RNNs)
Recurrent Architectures: Jordan Net
X
H
O
L
Y
Input
Hidden
Output
Loss
Target
X1
H1
O1
L1
Y1
X2
H2
O2
L2
Y2
X3
H3
O3
L3
Y3
X4
H4
O4
L4
Y4
"Well" "we" "busted" "out"
"we" "busted" "out" "of"
"you" "know" "down" "of"
Keith L. Downing Recurrent Neural Networks (RNNs)
Recurrent Architectures: Teacher Forcing
X
H
O
L
Y
Input
Hidden
Output
Loss
Target
X1
H1
O1
L1
Y1
X2
H2
O2
L2
Y2
X3
H3
O3
L3
Y3
X4
H4
O4
L4
Y4
"Well" "we" "busted" "out"
"we" "busted" "out" "of"
"you" "know" "down" "of"
During testing, the output (not the target) feeds back to H.
Since Y is independent of H, the gradients do not propagate backwards intime: the full power of BPTT is not needed for learning.
Keith L. Downing Recurrent Neural Networks (RNNs)
Recurrent Architectures: Delayed Supervision
X
H
O
L
Y
Input
Hidden
Output
Loss
Target
X1
H1
X2
H2
X3
H3
Xn
Hn
O
L
Y
"Well" "we" "busted" "surrender"
"Springsteen"
"Lady Gaga"
Keith L. Downing Recurrent Neural Networks (RNNs)
Recurrent Architectures: Captioning
X1
H1
O1
L1
Y1
X
H
O
L
Y
Input
Hidden
Output
Loss
Target
H2
O2
L2
Y2
H3
O3
L3
Y3
H4
O4
L4
Y4
"
"nerd" "on" "Halloween"
"Sketchy" "dude" "hacking" "Mastercard"
"Skinny"
Keith L. Downing Recurrent Neural Networks (RNNs)
Calculating RNN Gradients
Standard calculations once the function graph is unrolled
X
H
O
L
Y
Input
Hidden
Output
Loss
Target
X1
H1
O1
L1
Y1
X2
H2
O2
L2
Y2
Xn
Hn
On
Ln
Yn
"Well" "we"
U
V
WW
V
UW
U
V
W
U
V
"surrender"
U, V, W = Weights
∂L∂H1
=∂O1∂H1
• ∂L1∂O1
+∂H2∂H1
• ∂L∂H2
Keith L. Downing Recurrent Neural Networks (RNNs)
Calculating RNN Gradients
The effects of Hi upon L involve all paths from Hi to Li , Li+1, . . .Ln.
X1
H1
O1
L1
Y1
W
V
U
L2 Ln
H2
∂L∂H1
=∂O1∂H1
• ∂L1∂O1
+∂H2∂H1
• ∂L∂H2
Keith L. Downing Recurrent Neural Networks (RNNs)
Jacobian involving the Hidden Layer
Derivatives of Loss w.r.t. Hidden LayerNote: the hidden layer’s activation levels change with each recurrent loop, sothere is no single Jacobian, ∂L
∂H . We must calculate ∂L∂Hi∀i
∂L∂Hi
=∂Oi∂Hi• ∂Li
∂Oi+
∂Hi+1∂Hi
• ∂L∂Hi+1
This recurrent relationship can create a long product (in blue):
∂L∂Hi
= · · ·+∂Hi+1
∂Hi•
∂Hi+2∂Hi+1
• · · · • ∂Hn
∂Hn−1• ∂L
∂Hn+ · · ·
Each term includes the weights (W):
∂Hk+1∂Hk
=∂sk+1∂Hk
×∂H(sk+1)
∂sk+1= W T ×
∂H(sk+1)
∂sk+1
where sk+1 = sum of weighted inputs to H at time k+1.
Keith L. Downing Recurrent Neural Networks (RNNs)
∂Hk+1/∂Hk
∂Hk+1
∂Hk=
∂h1,k+1∂h1,k
∂h1,k+1∂h2,k
· · · ∂h1,k+1∂hm,k
∂h2,k+1∂h1,k
∂h2,k+1∂h2,k
· · · ∂h2,k+1∂hm,k... ... ... ...
... ... ... ...
∂hm,k+1∂h1,k
∂hm,k+1∂h2,k
· · · ∂hm,k+1∂hm,k
Keith L. Downing Recurrent Neural Networks (RNNs)
Finishing the Calculation of ∂Hk+1/∂Hk
Assume H uses tanh as its activation function∂ tanh(x)
∂x = 1− tanh(x)2.
si ,k+1 = sum of weighted inputs to hidden node hi on the k+1 recursioncycle.
hi ,k+1 = output of hidden node i on cycle k+1.
∂H(sk+1)
∂sk+1=
∣∣∣∣∣∣∣∣∣∣∣∣∣
∂h1,k+1∂s1,k+1
∂h2,k+1∂s2,k+1
...
∣∣∣∣∣∣∣∣∣∣∣∣∣=
∣∣∣∣∣∣∣∣∣∣∣∣
1−h21,k+1
1−h22,k+1
...
∣∣∣∣∣∣∣∣∣∣∣∣For each node in H (hi ), there is a) one row in ∂H(sk+1)
∂sk+1and b) one row
in W T (for all incoming weights to hi ).
So multiply each weight in the ith row of W T by the ith member of thecolumn vector to produce ∂Hk+1
∂Hk.
Keith L. Downing Recurrent Neural Networks (RNNs)
Finishing the Calculation of ∂Hk+1/∂Hk
∂Hk+1
∂Hk=
(1−h2
1,k+1)w11 (1−h21,k+1)w21 · · · (1−h2
1,k+1)wm1
(1−h22,k+1)w12 (1−h2
2,k+1)w22 · · · (1−h22,k+1)wm2
......
......
Since row-by-row multiplication is not a standard array operation, the columnvector can be converted into a diagonal matrix (with the only non-zero valuesalong the main diagonal) denoted diag(1−H2). Then we can use standardmatrix multiplication (•):
∂Hk+1
∂Hk= diag(1−H2)•W T =
(1−h2
1,k+1) 0 · · · 0
0 (1−h22,k+1) · · · 0
......
......
•
w11 w21 w31 · · ·w12 w22 · · · · · ·
......
......
Keith L. Downing Recurrent Neural Networks (RNNs)
Jacobians of Loss w.r.t. Weights
Notice the sums across all recurrent cycles (t).
∂L∂V =
n
∑t=1
∂Ot∂V •
∂Lt∂Ot
where ∂Ot∂V is a 3-d array containing the values of Ht ; the details of ∂Lt
∂Otdepend upon the actual loss function and activation function of O.
∂L∂U =
n
∑t=1
∂Ht∂U •
∂L∂Ht
where ∂Ht∂U is a 3-d array containing the values of Xt .
∂L∂W =
n−1
∑t=1
∂Ht+1∂W • ∂L
∂Ht+1
where ∂Ht+1∂W is a 3-d array containing the values of Ht .
These gradient products include the activations of X and H. When theseapproach 0, the gradients vanish.When the sigmoid or tanh is near saturation (0 or 1 for sigmoid; -1 or 1 for tanh),the diagonal matrix in ∂Hk+1
∂Hkapproaches zero; and the gradients vanish again.
Keith L. Downing Recurrent Neural Networks (RNNs)
Handling Uncooperative Gradients
Recurrence creates long product gradients which often vanish (andoccasionally explode).
How can we combat this?
Innovative Network Architectures
Reservoir Computing: Echo-State Machines and Liquid-State Machines
Gated RNNs: Long Short-Term Memory (LSTM) and Gated RecurrentUnits (GRU)
Explicit Memory - Neural Turing Machine
Supplementary Techniques
Multiple Timescales - using different degrees of leak/forgetting and skipconnections.
Gradient Clipping - setting an upper bound on gradient magnitude.
Regularizing - adding penalty for gradient decay to the loss function.
Keith L. Downing Recurrent Neural Networks (RNNs)
Reservoir Computing
a aa
b
c
e
Training Cases: (a, a)(a, b)(b, a)(a, c)(c, e)
+++
Reservoir
-
(Random, FixedWeights)
ModifiableWeights
c
e
Testing Given: a
Produce: (a, b, a, c, e)
Echo State Machines (ESM) and Liquid State Machines (LSM)
MANY reservoir neurons with random and recurrent connections allowsequences of inputs to create complex internal states (a.k.a. contexts).
Only tuneable synapses: reservoir to outputs→ no long gradients.
Drawback: initializing reservoir weights is tricky.
Keith L. Downing Recurrent Neural Networks (RNNs)
Even RNNs Forget
Repeated multiplication by hidden-to-hidden weights,along with the integration of new inputs, causes memories(i.e. hidden-node states) to fade over time.For some tasks, e.g. reinforcement learning, this isdesirable: older information should have a diminishingeffect upon current decision-making.In other tasks, such as natural-language processing, somekey pieces of old information need to retain influence.I found my bicycle in the same town and on the samestreet, but not on the same front porch from which I hadreported to the police that it had been stolen.What was stolen?A clear memory of bicycle must be maintained throughoutprocessing of the entire sentence (or paragraph, or story).
Keith L. Downing Recurrent Neural Networks (RNNs)
Long Short-Term Memory (LSTM)
Invented by Hochreiter and Schmidhuber in 1997.
Breakthrough in 2009: Won Int’l Conf. on Doc. Analysis andRecognition (ICDAR) competition.
2013: Record for speech recognition on DARPA’s TIMIT dataset.
Currently used in commercial speech-recognition systems by majorplayers: Google, Apple, Microsoft...
Critical basis of RNN success, which is otherwise very limited.
Hundreds, if not thousands, of LSTM variations. One of the mostpopular is the GRU (Gated Recurrent Unit), which has fewer gates butstill good performance. Also, Phased LSTM’s handle asynchronousinput.
Easily combined with other NN components, e.g., convolution.
Increased memory retention→ hidden layers continue to produce(some) high values even when external inputs are low→ gradients donot vanish.
See Understanding LSTM Networks. Colah’s Blog (August, 2015).
Keith L. Downing Recurrent Neural Networks (RNNs)
The LSTM Module
!
! !tanh
x
xx
+
tanh
Xt
ForgetGate
InputGate
OutputGate
Xt+1
HtHt-1
Ct-1 Ct
! sigmoid gate Ct cell state
Ht cell outputXt cell input
layer of nodespointwise op
Ht-1 Ht
Time
Keith L. Downing Recurrent Neural Networks (RNNs)
Long Short-Term Memory (LSTM)
Yellow boxes are full layers of nodes that are preceded by a matrix oflearnable weights and that produce a vector of values.
Orange ovals have unweighted vectors as inputs. They perform theiroperation (+ or ×) pointwise: either individually on each item of a vector,or pairwise with corresponding elements of two vectors (e.g. the statevector and the forget vector). A.k.a. Hadamard Product.
The cell’s internal state, C (a vector), runs along the top (red line) of thediagram.
C is selectively modified, as controlled by the forget and input gates,
The yellow tanh produces suggested contributions to Ct−1.
The modified state (Ct ) is preserved (along the red line) and passed tothe next time step.
Ct also passes through the (orange) tanh, whose result is also gated toproduce the cell’s output value, Ht .
Ht along with the next input Xt+1 comprise the complete set of weightedinputs to the 3 gates and (yellow) tanh at time t+1.
Keith L. Downing Recurrent Neural Networks (RNNs)
Supplementary Techniques for Gradient Management
These are minor adjustments, applicable to many NN architectures.
Gradient Clipping
Oversized gradients→ Large jumps in search space→ risky in ruggederror landscapes.
Reduce problem by restricting gradients to a pre-defined range.
Gradient-Sensitive Regularization
Develop a metric for amount of gradient decay across recurrent steps.
Include this value as a penalty in the loss function.
Facilitate Multiple Time Scales
Skip connections over time - implement delays by linking output of alayer at time t directly to input at time t+d.
Variable leak rates - neurons have diverse leak parameters. More leak→ more forgetting→ shorter timescale of information integration.
Keith L. Downing Recurrent Neural Networks (RNNs)