Computing the Gradient in an RNN - University at Buffalosrihari/CSE676/10.2.2 RNN-Gradients.pdfDeep...

11
Deep Learning Srihari Computing the Gradient in an RNN Sargur Srihari [email protected] 1 This is part of lecture slides on Deep Learning : http://www.cedar.buffalo.edu/~srihari/CSE676

Transcript of Computing the Gradient in an RNN - University at Buffalosrihari/CSE676/10.2.2 RNN-Gradients.pdfDeep...

Page 1: Computing the Gradient in an RNN - University at Buffalosrihari/CSE676/10.2.2 RNN-Gradients.pdfDeep LearningComputing BPTT gradients recursively Srihari • Start recursion with nodes

Deep Learning Srihari

Computing the Gradient in an RNNSargur Srihari

[email protected]

1

This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/CSE676

Page 2: Computing the Gradient in an RNN - University at Buffalosrihari/CSE676/10.2.2 RNN-Gradients.pdfDeep LearningComputing BPTT gradients recursively Srihari • Start recursion with nodes

Deep Learning Srihari Topics in Sequence Modeling

•  Overview1.  Unfolding Computational Graphs2.  Recurrent Neural Networks3.  Bidirectional RNNs4.  Encoder-Decoder Sequence-to-Sequence Architectures5.  Deep Recurrent Networks6.  Recursive Neural Networks7.  The Challenge of Long-Term Dependencies8.  Echo-State Networks9.  Leaky Units and Other Strategies for Multiple Time Scales10. LSTM and Other Gated RNNs11. Optimization for Long-Term Dependencies12. Explicit Memory

2

Page 3: Computing the Gradient in an RNN - University at Buffalosrihari/CSE676/10.2.2 RNN-Gradients.pdfDeep LearningComputing BPTT gradients recursively Srihari • Start recursion with nodes

Deep Learning Srihari

Topics in Recurrent Neural Networks

0. Overview1.  Teacher forcing and Networks with Output Recurrence2.  Computing the gradient in a RNN3.  RNNs as Directed Graphical Models4.  Modeling Sequences Conditioned on Context with RNNs

3

Page 4: Computing the Gradient in an RNN - University at Buffalosrihari/CSE676/10.2.2 RNN-Gradients.pdfDeep LearningComputing BPTT gradients recursively Srihari • Start recursion with nodes

Deep Learning Srihari

Training an RNN is straightforward

1.  Determine gradient using backpropagation•  Apply generalized backpropagation to the unrolled computational graph

(algorithm repeated in next slide)•  No specialized algorithms are necessary

2.  Gradient Descent•  Gradients obtained by backpropagation are used with any general

purpose gradient-based techniques to train an RNN

•  Called BPTT: Back Propagation Through Time

4

Page 5: Computing the Gradient in an RNN - University at Buffalosrihari/CSE676/10.2.2 RNN-Gradients.pdfDeep LearningComputing BPTT gradients recursively Srihari • Start recursion with nodes

Deep Learning Srihari General Backpropagation to compute gradient

5

To compute gradients T of variable z wrt variables in computational graph G

Page 6: Computing the Gradient in an RNN - University at Buffalosrihari/CSE676/10.2.2 RNN-Gradients.pdfDeep LearningComputing BPTT gradients recursively Srihari • Start recursion with nodes

Deep Learning Srihari Build-grad function of Generalized Backprop

6

dzdx

=dzdy

dydx

∂z∂x

i

=∂z∂y

j

∂yj

∂xj∑

If y=g(x) and z=f(y) from chain rule

Equivalently, in vector notation where is the n x n Jacobian matrix of g

∇xz =

∂y∂x

⎝⎜⎜⎜⎜

⎠⎟⎟⎟⎟

T

∇yz

∂y∂x

Ob.bprop(inputs,X,G) returns

Which is just an implementation of chain rule

(∇X

i∑ op.f(inputs)

i)G

i

Page 7: Computing the Gradient in an RNN - University at Buffalosrihari/CSE676/10.2.2 RNN-Gradients.pdfDeep LearningComputing BPTT gradients recursively Srihari • Start recursion with nodes

Deep Learning Srihari

Example for how BPTT algorithm behaves

•  BPTT: Back Propagation Through Time•  We provide an example of how to compute gradients by BPTT

for the RNN equations

7

Page 8: Computing the Gradient in an RNN - University at Buffalosrihari/CSE676/10.2.2 RNN-Gradients.pdfDeep LearningComputing BPTT gradients recursively Srihari • Start recursion with nodes

Deep Learning Srihari Intuition of how BPTT algorithm behaves

•  Ex: Gradients by BPTT for RNN eqns given above

8

Nodes of our computational graph include the parameters U, V, W, b, c as well as the sequence of nodes indexed by t for x(t), h(t), o(t) and L(t)

For each node N we need to compute the gradient recursively

Page 9: Computing the Gradient in an RNN - University at Buffalosrihari/CSE676/10.2.2 RNN-Gradients.pdfDeep LearningComputing BPTT gradients recursively Srihari • Start recursion with nodes

Deep Learning Srihari Computing BPTT gradients recursively

•  Start recursion with nodes immediately preceding final loss

•  We assume that the outputs o(t) are used as arguments to softmax to obtain vector of probabilities over the output. Also that loss is the negative log-likelihood of the true target y(t)

given the input so far.•  The gradient on the outputs at time step t is

9

∂L∂L(t )

= 1

Work backwards starting from the end of the sequence.At the final time step τ, h(τ) has only o(τ) as a descendent. So

Page 10: Computing the Gradient in an RNN - University at Buffalosrihari/CSE676/10.2.2 RNN-Gradients.pdfDeep LearningComputing BPTT gradients recursively Srihari • Start recursion with nodes

Deep Learning Srihari

Gradients over hidden units

•  Iterate backwards in time iterating to backpropagate gradients through time from t =τ-1 down to t =1

•  where indicates a diagonal matrix with elements within parentheses. This is the Jacobian of the hyperbolic tangent associated with the hidden unit i at time t+1

10

Page 11: Computing the Gradient in an RNN - University at Buffalosrihari/CSE676/10.2.2 RNN-Gradients.pdfDeep LearningComputing BPTT gradients recursively Srihari • Start recursion with nodes

Deep Learning Srihari Gradients over parameters

•  Once gradients on internal nodes of the computational graph are obtained, we can obtain gradients on the parameters•  Because parameters are shared across time steps some care is needed

11