TopicsPlan for Today • Model – Recurrent Neural Networks (RNNs) • Learning – BackPropThrough...

CS 4803 / 7643: Deep Learning

Zsolt KiraGeorgia Tech

Topics: – Recurrent Neural Networks (RNNs)– BackProp Through Time (BPTT)

Plan for Today• Model

– Recurrent Neural Networks (RNNs)

• Learning – BackProp Through Time (BPTT)

(C) Dhruv Batra and Zsolt Kira 2

New Topic: RNNs

(C) Dhruv Batra and Zsolt Kira 3Image Credit: Andrej Karpathy

New Words• Recurrent Neural Networks (RNNs)

• Recursive Neural Networks– General family; think graphs instead of chains

• Types:– “Vanilla” RNNs (Elman Networks)– Long Short Term Memory (LSTMs)– Gated Recurrent Units (GRUs)– …

• Algorithms– BackProp Through Time (BPTT)– BackProp Through Structure (BPTS)


What’s wrong with MLPs?• Problem 1: Can’t model sequences

– Fixed-sized Inputs & Outputs– No temporal structure

(C) Dhruv Batra and Zsolt Kira 5Image Credit: Alex Graves, book

What’s wrong with MLPs?• Problem 1: Can’t model sequences

– Fixed-sized Inputs & Outputs– No temporal structure

• Problem 2: Pure feed-forward processing– No “memory”, no feedback

(C) Dhruv Batra and Zsolt Kira 6Image Credit: Alex Graves, book

Why model sequences?

Figure Credit: Carlos Guestrin

Why model sequences?

(C) Dhruv Batra and Zsolt Kira 8Image Credit: Alex Graves

Sequences are everywhere…

(C) Dhruv Batra and Zsolt Kira 9Image Credit: Alex Graves and Kevin Gimpel


Even where you might not expect a sequence…

Image Credit: Vinyals et al.

Ba, Mnih, and Kavukcuoglu, “Multiple Object Recognition with Visual Attention”, ICLR 2015.Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015Figure copyright Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra, 2015. Reproduced with permission.

Classify images by taking a series of “glimpses”


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


12Image Credit: Ba et al.; Gregor et al

• Output ordering = sequence

(C) Dhruv Batra and Zsolt Kira

Sequences in Input or Output?• It’s a spectrum…


Input: No sequence

Output: No sequence

Example: “standard”

classification / regression problems

Image Credit: Andrej Karpathy



Input: No sequence

Output: No sequence



Input: No sequence

Output: Sequence

Example: Im2Caption




Input: No sequence

Output: No sequence



Input: No sequence

Output: Sequence

Example: Im2Caption

Input: Sequence

Output: No sequence

Example: sentence classification,

multiple-choice question answering




Input: No sequence

Output: No sequence



Input: No sequence

Output: Sequence

Example: Im2Caption

Input: Sequence

Output: No sequence

Example: sentence classification,

multiple-choice question answering

Input: Sequence

Output: Sequence

Example: machine translation, video classification, video captioning, open-ended question answering


2 Key Ideas• Parameter Sharing

– in computation graphs = adding gradients


Slide Credit: Marc'Aurelio Ranzato(C) Dhruv Batra and Zsolt Kira 18

Computational Graph

+

Gradients add at branches




• “Unrolling”– in computation graphs with parameter sharing


How do we model sequences?• No input

(C) Dhruv Batra and Zsolt Kira 23Image Credit: Bengio, Goodfellow, Courville

How do we model sequences?• No input


How do we model sequences?• With inputs




• “Unrolling”– in computation graphs with parameter sharing

• Parameter sharing + Unrolling– Allows modeling arbitrary sequence lengths!– Keeps numbers of parameters in check


Recurrent Neural Network

x

RNN



x

RNN

yusually want to predict a vector at some time steps



x

RNN

yWe can process a sequence of vectors x by applying a recurrence formula at every time step:

new state old state input vector at some time step

some functionwith parameters W



x

RNN

yWe can process a sequence of vectors x by applying a recurrence formula at every time step:

Notice: the same function and the same set of parameters are used at every time step.


(Vanilla) Recurrent Neural Network

x

RNN

y

The state consists of a single “hidden” vector h:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231nSometimes called a “Vanilla RNN” or an “Elman RNN” after Prof. Jeffrey Elman

h0 fW h1

x1

RNN: Computational Graph


h0 fW h1 fW h2

x2x1



h0 fW h1 fW h2 fW h3

x3

…x2x1


hT



x3

…x2x1W


Re-use the same weight matrix at every time-step

hT



x3

yT

…x2x1W

RNN: Computational Graph: Many to Many

hT

y3y2y1



x3

yT

…x2x1W


hT

y3y2y1 L1 L2 L3 LT



x3

yT

…x2x1W


hT

y3y2y1 L1 L2 L3 LT

L



x3

y

…x2x1W

RNN: Computational Graph: Many to One

hT



yT

…x

W

RNN: Computational Graph: One to Many

hT

y3y2y1


Sequence to Sequence: Many-to-one + one-to-many


x3

…

x2x1W1

hT

Many to one: Encode input sequence in a single vector


Sequence to Sequence: Many-to-one + one-to-many

y1 y2

…

Many to one: Encode input sequence in a single vector

One to many: Produce output sequence from single input vector

fW h1 fW h2 fW

W2



x3

…

x2x1W1

hT

Example: Character-levelLanguage Model

Vocabulary:[h,e,l,o]

Example trainingsequence:“hello”


Distributed Representations Toy Example• Local vs Distributed

(C) Dhruv Batra and Zsolt Kira 45Slide Credit: Moontae Lee

Distributed Representations Toy Example• Can we interpret each dimension?

(C) Dhruv Batra and Zsolt Kira 46Slide Credit: Moontae Lee

Power of distributed representations!


Local

Distributed

Slide Credit: Moontae Lee





Training Time: MLE / “Teacher Forcing”

Example: Character-levelLanguage ModelSampling


At test-time sample characters one at a time, feed back to model

.03

.13

.00

.84

.25

.20

.05

.50

.11

.17

.68

.03

.11

.02

.08

.79Softmax

“e” “l” “l” “o”Sample


Test Time: Sample / Argmax

.03

.13

.00

.84

.25

.20

.05

.50

.11

.17

.68

.03

.11

.02

.08

.79Softmax

“e” “l” “l” “o”SampleExample:

Character-levelLanguage ModelSampling


At test-time sample characters one at a time, feed back to model


Test Time: Sample / Argmax

Backpropagation through timeLoss

Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient


Truncated Backpropagation through timeLoss

Run forward and backward through chunks of the sequence instead of whole sequence



Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps


Deep (Stacked) RNNs

x

RNN

y


train more

train more

train more

at first:


The Stacks Project: open source algebraic geometry textbook

Latex source http://stacks.math.columbia.edu/The stacks project is licensed under the GNU Free Documentation License


Generated C code


Searching for interpretable cells

Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016



Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission




quote detection cell




line length tracking cell




if statement cell




quote/comment cell




code depth cell


Neural Image Captioning

(C) Dhruv Batra 77

Convolution Layer+ Non-Linearity

Pooling Layer Convolution Layer+ Non-Linearity

Pooling Layer Fully-Connected MLP

Image Embedding (VGGNet)4096-dim


(C) Dhruv Batra 78

Convolution Layer+ Non-Linearity

Pooling Layer Convolution Layer+ Non-Linearity

Pooling Layer Fully-Connected MLP

4096-dim

Image Embedding (VGGNet)


(C) Dhruv Batra 79

Con

volu

tion

Laye

r +

Non

-Lin

earit

y Po

olin

g La

yer

Con

volu

tion

Laye

r +

Non

-Lin

earit

y Po

olin

g La

yer

Fully

-Con

nect

ed M

LP

4096

-dim

Imag

e Em

bedd

ing

(VG

GN

et)

Line

ar

<start> Two people and two horses.

RN

N

RN

N

RN

N

RN

N

RN

N

RN

N

P(next) P(next) P(next) P(next) P(next)P(next)


(C) Dhruv Batra 80

Con

volu

tion

Laye

r +

Non

-Lin

earit

y Po

olin

g La

yer

Con

volu

tion

Laye

r +

Non

-Lin

earit

y Po

olin

g La

yer

Fully

-Con

nect

ed M

LP

4096

-dim

Imag

e Em

bedd

ing

(VG

GN

et)

Line

ar

<start> Two people and two horses.

RN

N

RN

N

RN

N

RN

N

RN

N

RN

N

P(next) P(next) P(next) P(next) P(next)P(next)

Beam Search• Proceed from left to right• Maintain N partial captions• Expand each caption with possible next words• Discard all but the top N new partial translations

A cat sitting on a suitcase on the floor

A cat is sitting on a tree branch

A dog is running in the grass with a frisbee

A white teddy bear sitting in the grass

Two people walking on the beach with surfboards

Two giraffes standing in a grassy field

A man riding a dirt bike on a dirt track

Image Captioning: Example Results

A tennis player in action on the court

Captions generated using neuraltalk2All images are CC0 Public domain: cat suitcase, cat tree, dog, bear, surfers, tennis, giraffe, motorcycle


Image Captioning: Failure Cases

A woman is holding a cat in her hand

A woman standing on a beach holding a surfboard

A person holding a computer mouse on a desk

A bird is perched on a tree branch

A man in a baseball uniform throwing a ball

Captions generated using neuraltalk2All images are CC0 Public domain: fur coat, handstand, spider web, baseball


Image Captioning with Attention

CNN

Image: H x W x 3

Features: L x D

h0

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015


CNN

Image: H x W x 3

Features: L x D

h0

a1

Distribution over L locations




CNN

Image: H x W x 3

Features: L x D

h0

a1

Weighted combination of features


z1Weighted

features: D




CNN

Image: H x W x 3

Features: L x D

h0

a1

z1


h1


Weighted features: D y1

First wordXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015



CNN

Image: H x W x 3

Features: L x D

h0

a1

z1


y1

h1

First word


a2 d1

Weighted features: D

Distribution over vocab




CNN

Image: H x W x 3

Features: L x D

h0

a1

z1


y1

h1

First word


a2 d1

h2

z2 y2Weighted

features: D





CNN

Image: H x W x 3

Features: L x D

h0

a1

z1


y1

h1

First word


a2 d1

h2

a3 d2

z2 y2Weighted

features: D





Soft attention

Hard attention


Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.



Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.


TopicsPlan for Today • Model – Recurrent Neural Networks (RNNs) • Learning – BackPropThrough...

Documents

Transcript of TopicsPlan for Today • Model – Recurrent Neural Networks (RNNs) • Learning – BackPropThrough...