Recurrent Neural Networks3dvision.princeton.edu/courses/COS598/2015sp/slides/RNN/... · 2015. 2....

Recurrent Neural Networks

Zhirong Wu Jan 9th 2015

Wikipedia: a recurrent neural network (RNN) is a class of artificial neural network where connections between units form a directed cycle.

Recurrent neural networks constitute a broad class of machines with dynamic state; that is, they have state whose evolution depends both on the input to the system and on the current state

Artificial Neural Networks• FeedForward

• Model static input-output functions • exist a single forward direction • proven useful in many pattern

classification tasks

• Recurrent • model dynamic state transition system • feedbacks connections included • practical difficulties for applications • can be useful for pattern

classification,stochastic sequence modeling, associative memory

layer1 layer2 layer3

Recurrent Neural NetworksComputations (discrete time): given the input sequence , the model updates its internal states and output value as follows:

(n+ 1) = f(weighted input)

(n+ 1) +X

(n) +X

(out)j

(n+ 1) = f(weighted input)

(n+ 1) +X

(out)j

Computations (continuous time): describes in differential equations

internalhi(n)

outputyi(n)

inputxi(n)

A typical widely used formulation:

h(n+ 1) = f(Wxh

x(n+ 1) +W

h(n) + b

y(n+ 1) = f(Wxy

x(n+ 1) +W

h(n+ 1) + b

no feedbacks from output to internal units

no recurrent connections in the output units

recurrent connections

Recurrent Neural NetworksLearning:

||d(i)(n)� y(i)(n)||2

here is the output from the RNN given the input y(i)(n)x

(i)(n)

Backpropagation doesn’t work with cycles.

given m sequences of input and output samples of length T, minimise:

(i)(n) d(i)(n)

Learning Algorithms

• BackPropagation Through Time • gradients update for a whole sequence

• Real Time Recurrent Learning • gradients update for each frame in a sequence

• Extended Kalman Filtering

gradient-based3 popular methods:

Learning Algorithms

• BackPropagation Through Time • gradients update for a whole sequence

• Real Time Recurrent Learning • gradients update for each frame in a sequence

• Extended Kalman Filtering

gradient-based3 popular methods:

Learning AlgorithmsBackpropagation Through Time (BPTT):

Williams, Ronald J., and David Zipser. "Gradient-based learning algorithms for recurrent networks and their computational complexity." Back-propagation: Theory, architectures and applications (1995): 433-486.

Given the whole sequence of inputs and outputs, unfold the RNN through time. The RNN thus becomes a standard feedforward neural network with each layer corresponding to a time step in the RNN. Gradients can be computed using standard backpropagation.

Learning AlgorithmsReal Time Recurrent Learning (RTRL):

For a particular frame in a sequence, get gradients of model parameters W.

vi(n+ 1) = f(KX

wijvj(n))

Now think about a single unit and its K input unitsvi vj , j = 1...K

Then differentiate it

Williams, Ronald J., and David Zipser. "A learning algorithm for continually running fully recurrent neural networks." Neural computation 1.2 (1989): 270-280.

Learning Algorithms

differentiate this unit to a weight parameter wpq

vi(n+ 1) = f(KX

wijvj(n))

@vi(n+ 1)

@wpq= f 0(zi)

wij@vj(n)

@wpq) + �ipvq(n)

�ip = 1, if i = p

�ip = 0, otherwisezi =

wijvj(n)where

we get,

Real Time Recurrent Learning (RTRL):

Learning AlgorithmsReal Time Recurrent Learning

@vi(n+ 1)

@wpq= f 0(zi)

wij@vj(n)

@wpq) + �ipvq(n)

compute gradients of time n+1 using gradients and activations of time n.

initialize time 0, @vj(0)

@wpq= 0, j = 1...K

accumulate gradients from n = 0 ! T0

@yi@wpq

finally, gradient descent.

Learning Algorithms

update rule:

1. batch update. Use m full sequences. 2. pattern update. Use one full sequence. 3. online update. Use each time step in a sequence.

(only applied to RTRL)

BPTT is a lot time efficient than RTRL, and it is most widely used for RNN training.

Learning AlgorithmsVanishing and Exploding Gradients:

for detailed analysis, see

Intuition: the recurrent connection is applied on the error signal recursively backward through time.

• If the eigenvalue is bigger than 1, the gradients tend to explode. Learning will never converge.

• If the eigenvalue is smaller than 1, the gradients tend to vanish. Error signals can only affect small time lags leading to a short-term memory.

Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation

Long Short-Term Memorybreakthrough paper for RNN

Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780.

original recurrent update:

LSTM architecture fixes the vanishing gradient problem, it allows the error signal to affect steps long time before by introducing gates.

LSTM replaces the function by a memory cell:

= f(Wxh

t�1 + b

input gate

output gate

forget gate

memory core

output

input gate

output gate

forget gate

memory core

output

• input gate controls when the memory gets perturbed.

• output gate controls when the memory perturbs others.

• forget gate controls when the memory gets flushed.

• during forward computation, information trapped in the memory could directly interact with inputs long time afterwards.

• likewise, during backpropagation, error signals trapped in the memory could directly propagate to units long time before.

• LSTM retains information over 1000 time steps than 10. • the parameters are learned altogether. Learning is a bit complicated

using a combination of BPTT and RTRL to deal with gates.

Application: image caption generator

• Given an image, output possible sentences to describe the image. The sentence could have varying length.

• Use CNN for image modelling and RNN for language generation.

Vinyals, Oriol, et al. "Show and tell: A neural image caption generator." arXiv preprint arXiv:1411.4555 (2014).

Training:

loss(S, I) = �NX

pt(St)

word embeddingWe

Application: image caption generatorVinyals, Oriol, et al. "Show and tell: A neural image caption generator." arXiv preprint arXiv:1411.4555 (2014).

BPTT log p(S|I) =NX

log p(St|I, S0. . . St�1)

results:

Application: image caption generatorVinyals, Oriol, et al. "Show and tell: A neural image caption generator." arXiv preprint arXiv:1411.4555 (2014).

Application: visual attention

in a reinforcement framework:

The agent only has limited “bandwidth” to observe the environment. Each time, it makes some actions to maximize the total reward based on a history of observations.

Mnih, Volodymyr, Nicolas Heess, and Alex Graves. "Recurrent models of visual attention."

environment: translated and cluttered MNIST bandwidth: location and resolution action: recognition prediction and moving action

unlike CNN, this framework could recognize an image of arbitrary input size without scaling.

Training: • the number of steps is fixed, say 6 steps. • the reward is defined as the classification result of the last step.

Application: visual attentionMnih, Volodymyr, Nicolas Heess, and Alex Graves. "Recurrent models of visual attention."

feature extractor

recurrent memory

action prediction

results:

Application: visual attentionMnih, Volodymyr, Nicolas Heess, and Alex Graves. "Recurrent models of visual attention."

Another memory machine: neural turing machine

Graves, Alex, Greg Wayne, and Ivo Danihelka. "Neural Turing Machines." arXiv preprint arXiv:1410.5401 (2014).

controller: addressing mechanism

Another memory machine: neural turing machine

Graves, Alex, Greg Wayne, and Ivo Danihelka. "Neural Turing Machines." arXiv preprint arXiv:1410.5401 (2014).

copy operation:

by analysing the interaction of the controller and the memory, the machine is able to learn “a copy algorithm”:

Deep RNN

Instead of one single memory cell, we can actually stack several of them like feedforward networks.

Graves, Alex. "Generating sequences with recurrent neural networks." arXiv preprint arXiv:1308.0850 (2013).

Overview

1. What is Recurrent Neural Network 2. Learning algorithms of RNN 3. Vanishing Gradients and LSTM 4. Applications 5. Deep RNN

Code & Demo

http://www.cs.toronto.edu/~graves/

http://www.cs.toronto.edu/~graves/handwriting.html

Recurrent Neural Networks3dvision.princeton.edu/courses/COS598/2015sp/slides/RNN/... · 2015. 2....

Documents

Transcript of Recurrent Neural Networks3dvision.princeton.edu/courses/COS598/2015sp/slides/RNN/... · 2015. 2....

Image gradients and edges

1- Making Density Gradients Pre-formed discontinuous gradients Pre-formed continuous gradients Self-generated gradients 2- Density Gradient Harvesting.

Sub Gradients

9. Fracture Gradients

WebGL - 3dvision.princeton.edu3dvision.princeton.edu/courses/COS598/2014sp/slides/lecture15_webgl/... · Drawing • We are almost able to draw the triangle! • Exciting! • Still

Directional Derivatives and Gradients

Class of 2015SP - Cornell Engineering Bacha Kaitlyn Bacon Misha Baheti Heather Barton ... Class of 2015SP ... Hannah Vining Bryan Warrington Travis Wasson Bill Wheatle Zach Wu Ji Min

Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Fracture Gradients

Gradients on Fractals -

1 - Drilling Technology - Gradients

arXiv:1806.04138v1 [q-bio.TO] 11 Jun 2018morphogen gradients. However, these gradients are not necessarily in steady state. Accordingly, dynamic models of morphogen gradients must

Changes in electrical gradients

Urban-to-rural gradients

olutionary Rates Across Gradients

Agreement Gradients with Cards

Gradients and edges

1.11 Fracture Gradients

Latitudinal gradients

Chapter 4 Colors and color gradients of spiral galaxies123.physics.ucdavis.edu/file_folder_files/Color Gradients Spirals.pdf · Colors and color gradients of spiral galaxies Abstract.