Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real...

25
Recurrent Neural Networks A Brief Overview Douglas Eck University of Montreal RNN Overview Oct 1 2007 – p.1/33

Transcript of Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real...

Page 1: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

Recurrent Neural NetworksA Brief Overview

Douglas Eck

University of Montreal

RNN Overview Oct 1 2007 – p.1/33

Page 2: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

RNNs versus FFNs

• Feed-Forward Networks (FFNs, left) can learnfunction mappings (FFN==>FIR filter)

• Recurrent Neural Networks (RNNs, right) usehidden layer as memory store to learn sequences(RNN==>IIR filter)

• RNNs can (in principle at least) exhibit virtuallyunlimited temporal dynamics

RNN Overview Oct 1 2007 – p.2/33

Page 3: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

Several Methods• SRN – Simple Recurrent Network (Elman, 1990)• BPTT – Backpropagation Through Time

(Rumelhart, Hinton & Williams, 1986)• RTRL – Real Time Recurrent Learning (Williams

& Zipser, 1989)• LSTM – Long Short-Term Memory (Hochreiter

& Schmidhuber, 1996)

RNN Overview Oct 1 2007 – p.3/33

Page 4: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

SRN (Elman Net)• Simple Recurrent Networks• Hidden layer activations

copied into a copy layer• Cycles are eliminated, allow-

ing use of standard backprop-agation

RNN Overview Oct 1 2007 – p.4/33

Page 5: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

Some Observations by Elman• Some problems change nature when expressed in

time• Example: Temporal XOR (Elman, 1990)

[101101110110000011000]• RNN learns Frequency Detectors• Error can give information about temporal

structure of input• Increasing sequential dependencies does not

necessarily make task harder• Representation of time is task-dependent

RNN Overview Oct 1 2007 – p.5/33

Page 6: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

SRN Strengths• Easy to train• Potential for complex and useful temporal

dynamics• Can induce hierarchical temporal structure• Can learn, e.g., simple “natural language”

grammar

RNN Overview Oct 1 2007 – p.6/33

Page 7: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

SRN Shortcomings• We need to compute weight changes at every

timestept0 < t ≤ t1:∑t1

t=t0+1 ∆wij(t)

• Thus we need to compute at everyt:−∂E(t)

∂wij= −

∑k∈U

∂(t)∂yk(t)

∂yk(t)∂wij

=∑

k∈U ek(t)∂yk(t)∂wij

• Since we knowek(t) we need only compute∂yk(t)∂wij

• SRN Truncates this derivative• Long-timescale (and short-timescale)

dependencies between error signals are lost

RNN Overview Oct 1 2007 – p.7/33

Page 8: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

SRNs Generalized: BPTT

• Generalize SRN to remember deeper into past• Trick: unfold network to represent time spatially

(one layer per discrete timestep)• Still no cycles, allowing use of standard

backpropagation

RNN Overview Oct 1 2007 – p.8/33

Page 9: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

BPTT(∞)For each timestep t

• Current state of network and input pattern isadded to history buffer (stores since time t=0)

• Errorek(t) is injected;ǫs andδs for times(t0 < τ ≤ t) are computed:ǫk(t) = ek(t)δk(τ) = f ′

k(sk(τ))ǫk(τ)ǫk(τ − 1) =

∑l inU wlkδl(τ)

• Weight changes are computed as in standard BP:∂E(t)∂wij

=∑t

τ=t0+1 δi(τ)xj(τ − 1)

RNN Overview Oct 1 2007 – p.9/33

Page 10: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

Truncated/Epochwise BPTT• When training data is in epochs, can limit size of

history buffer toh, length of epoch• When training data not in epochs, can

nonetheless truncate gradient afterh timesteps

• Forn units andO(n2) weights, epochwise BPTThas space complexityO(nh) and time complexityO(n2h)

• Compares favorably to BPTT∞ (substituteL,length of input sequence, forh)

RNN Overview Oct 1 2007 – p.10/33

Page 11: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

BPTT’s Gradient• Recall that BPTT(∞) computes

∂E(t)∂wij

=∑t

τ=t0+1 δi(τ)xj(τ − 1)

• Thus errors att take into account equallyδ valuesfrom the entire history of computation

• Truncated/Epochwise BPTT cuts off the gradient:∂E(t)∂wij

=∑t

τ=t−h δi(τ)xj(τ − 1)

• If data is naturally organized in epochs, this is nota problem

• If data is not organized in epochs, this is not aproblem either, provided h is “big enough”

RNN Overview Oct 1 2007 – p.11/33

Page 12: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

RTRL• RTRL=Real Time Recurrent Learning• Instead of unfolding network backward in time,

one can propagate error forward in time

• Compute directlypkij(t) = ∂yk(t)

∂wij

• pkij(t + 1) = f ′

k(sk(t))∑

l∈U wklplij(t) + δikxj(t)

• ∆wij = −α∂E(t)∂wij

= α∑

k∈U ek(t)pkij(t)

RNN Overview Oct 1 2007 – p.12/33

Page 13: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

RTRL vs BPTT• RTRL saves from executing backward dynamics• Temporal credit assignment solved during

forward pass

• RTRL is painfully slow: forn units andn2

weights,O(n2) space complexity andO(n4)(!)time complexity

• Because BPTT is faster, in general RTRL is onlyof theoretical interest

RNN Overview Oct 1 2007 – p.13/33

Page 14: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

Training Paradigms• Epochwise training: reset system at fixed

stopping points• Continual training: never reset system• Epochwise training provides barrier for credit

assignment• Unnatural for many tasks• Distinct from whether weights are batchwise or

iteratively updated

RNN Overview Oct 1 2007 – p.14/33

Page 15: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

Teacher Forcing• Continually-trained networks can move far from

desired trajectory and never return• Especially true if network enters region where

activations become saturated (gradients go to 0)• Solution; replace during training actual output

yk(t) with teacher signaldk(t).• Necessary for certain problems in BPTT/RTRL

networks (e.g. generating square waves)

RNN Overview Oct 1 2007 – p.15/33

Page 16: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

Puzzle: How Much is Enough?• Recall that for Truncated BPTT, the true gradient

is not being computed• How many steps do we need to “get close

enough”?

RNN Overview Oct 1 2007 – p.16/33

Page 17: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

Puzzle: How Much is Enough?• Recall that for Truncated BPTT, the true gradient

is not being computed• How many steps do we need to “get close

enough”?• Answer: certainly not more than 50;• Probably not more than 10(!)

RNN Overview Oct 1 2007 – p.17/33

Page 18: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

Credit Assignment is Difficult• In BPTT, error gets “diluted” with every

subsequent layer (credit assignment problem):ǫk(t) = ek(t)δk(τ) = f ′

k(sk(τ))ǫk(τ)ǫk(τ − 1) =

∑l inU wlkδl(τ)

• The logistic sigmoid1.0/(1.0 + (exp(−x))) hasmaximum derivative of 0.25.

• When|w| < 4.0, error is always<1.0

RNN Overview Oct 1 2007 – p.18/33

Page 19: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

Vanishing Gradients• Bengio, Simard & Frasconi (1994)• Remembering a bit requires creation of attractor

basin in state space• Two cases: either system overly sensitive to noise

or error gradient vanishes exponentially• General problem (HMMs suffer something

similar)

RNN Overview Oct 1 2007 – p.19/33

Page 20: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

Solutions and Alternatives• Non-gradient learning algorithms (including

global search)• Expectation Maximization (EM) training (e.g.

Bengio IO/HMM)• Hybrid architectures to aid in preserving error

signals

RNN Overview Oct 1 2007 – p.20/33

Page 21: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

LSTM

u[t]

y[t]

memoryblockwith

singlecell

• Hybrid recurrent neuralnetwork

• Make hidden units linear(derivative is then 1.0)

• Linear units unstable• Place units in a “memory

block” protected by multi-plicative gates

RNN Overview Oct 1 2007 – p.21/33

Page 22: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

Inside an LSTM Memory Blockg(netc)

yin yϕ yout

sc h(sc)

yc

• Gates are standard sigmoidal units• Input gateyin protects linear unitsc from

spurious inputs• Forget gateyφ allowssc to empty own contents• Output gateyout allows block to take itself offline

and ignore error

RNN Overview Oct 1 2007 – p.22/33

Page 23: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

LSTM Learning• For linear unitsc, a truncated RTRL approach is

used (tables of partial derivatives)• Everywhere else, standard backpropagation is

used• Rationale: due to vanishing gradient problem,

errors would decay anyway

RNN Overview Oct 1 2007 – p.23/33

Page 24: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

Properties of LSTM• Very good at finding hierarchical structure• Can induce nonlinear oscillation (for counting

and timing)• But error flow among blocks truncated• Difficult to train: weights into gates are sensitive

RNN Overview Oct 1 2007 – p.24/33

Page 25: Recurrent Neural Networks A Brief Overviepift6266/A07/documents/rnn_talk.pdf · • RTRL – Real Time Recurrent Learning (Williams & Zipser, 1989) • LSTM – Long Short-Term Memory

Formal Grammars• LSTM solves Embedded Reber Grammar faster,

more reliably and with smaller hidden layer thanRTRL/BPTT

• LSTM solves CSLAnBnCn better thanRTRL/BPTT networks

• Trained on examples withn < 10 LSTMgeneralized ton > 1000

• BPTT-trained networks generalize ton ≤ 18 inbest case (Bodén & Wiles, 2001)

s

yc1

c1s

yc

c2

2

Tc

cc

cc

cc

cb

bb

bb

bb

bb

bb

ab

aab

aab

aab

aab

aab

acc

cc

0

10

20

: target: input

aT

S

RNN Overview Oct 1 2007 – p.25/33