Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia...

50
Learning in sequential environments Raia Hadsell Staff Research Scientist, DeepMind raiahadsell.com

Transcript of Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia...

Page 1: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Learning in sequential environments

Raia HadsellStaff Research Scientist, DeepMind

raiahadsell.com

Page 2: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Scaling deep reinforcement learning towards the real world:

part 1: learning sequential tasks without forgettingpart 2: learning to navigate in complex worlds

Page 3: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

EnvironmentAgent

Reinforcement Learning

OBSERVATIONS

ACTIONS

REWARD?

Page 4: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

○ Maximizing Qπ(s,a) over possible policies gives the optimal

action-value function and the Bellman equation:

○ Basic idea:

■ Approximate →

■ Apply the Bellman Equation as an iterative update

Value Iteration

Page 5: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

○ Use a neural network for Q(s,a; )

○ Train end-to-end from raw pixels

End-to-End Reinforcement Learning

Page 6: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

but.. a network for every task?

Page 7: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

one network for all?

Page 8: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Catastrophic forgetting

● Well-known phenomenon● Especially severe in Deep RL

Page 9: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Catastrophic forgetting

https://www.youtube.com/watch?v=Fh_zNpdc0Xs

Page 11: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

An illustration

Task B

*

Task A

SGD

EWC

L2

Page 12: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Elastic Weight Consolidation

Task B

*

Task A

Elastic Weight Consolidation (EWC):

Constrain important parameters

to stay close to their old values

Continual learning in the brain:

Synaptic consolidation reduces

the plasticity of synapses that are

vital to previous tasks.

SGD

EWC

L2

Page 13: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Elastic Weight Consolidation

Implement constraint as a quadratic penalty

that is applied while training on B, but not

uniformly - rather, should be greater for

important parameters of Task A.

Posterior distribution

contains exactly this,

but is intractable. Task B

*

Task A

SGD

EWC

L2

Page 14: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Estimate posterior with Gaussian.

Mean: parameter vector *A

Diagonal precision given by approximation

of the Fisher Information F.

Elastic Weight Consolidation

Task B

*

Task A

SGD

EWC

L2

Page 15: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Elastic Weight Consolidation

Task B

*

Task A

SGD

EWC

L2

Page 16: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Experiment: Permuted MNIST

Random, fixed permutations of MNIST dataset.

Train a multilayer, fully-connected network with ReLus until convergence

We compare SGD, L2 regularisation, and EWC.

Perm A Perm B Perm C

Page 17: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Experiment: Permuted MNIST

Page 18: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Experiment: Permuted MNIST

Page 19: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Experiment: Permuted MNIST

Page 20: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Experiment: Permuted MNIST

Page 21: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Let’s try something harder...

Sequential reinforcement learning tasks (10 Atari games)

Random ordering with extended game play on each task, multiple returns

Unknown task boundaries

Regular testing of all 10 games

Single network with fixed capacity

Page 22: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Experiment: Atari 10

Forget-Me-Not1 allows labeling of data

segments, used for

● EWC regularisation

● Task-specific replay buffers used for

DDQN2

● Task-specific bias and gains at each

network layer

Fisher estimated at each task

boundary and EWC penalty is updated

[1] The forget-me-not process, Milan et al., NIPS 2016[2] Deep reinforcement learning with double q-learning, Hasselt et al., AAAI 2016

Page 24: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia

Clopath, Dharshan Kumaran, Raia Hadsell

Overcoming catastrophic forgetting in neural networks

PNAS 2017arxiv.org/abs/1612.00796

Page 25: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Learning to navigate in complex mazes

Page 26: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Navigation mazesGame episode:

1. Random start2. Find the goal (+10)3. Teleport randomly4. Re-find the goal (+10)5. Repeat (limited time)

+10 +1

Page 27: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Navigation mazesGame episode:

1. Random start2. Find the goal (+10)3. Teleport randomly4. Re-find the goal (+10)5. Repeat (limited time)

+10 +1

Variants:● Static maze, static goal● Static maze, random goal● Random maze

Observations: RGB, velocityActions: 8

Page 28: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Navigation mazesGame episode:

1. Random start2. Find the goal (+10)3. Teleport randomly4. Re-find the goal (+10)5. Repeat (limited time)

3600 steps/episode

Variants:● Static maze, static goal● Static maze, random goal● Random maze

Observations: RGB, velocityActions: 8

Page 29: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Navigation mazes Game episode:

1. Random start2. Find the goal (+10)3. Teleport randomly4. Re-find the goal (+10)5. Repeat (limited time)

3600 steps/episode

10800 steps/episode

3600 steps/episode

Variants:● Static maze, static goal● Static maze, random goal● Random maze

Observations: RGB, velocityActions: 8

Page 30: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

The vast and meaningless silence of an agent exploring...

Given: Sparse rewards

Wanted:Spatial knowledge

1e7

I have been here before! I know where to go!

1e7

Why is learning navigation via reinforcement learning hard?

Page 31: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Given:Sparse rewards

Wanted:Spatial knowledge

1. Accelerate reinforcement learning through auxiliary losses➔ Stable gradients help learning, even if unrelated to reward

2. Drive spatial knowledge through choice of auxiliary tasks:● Depth prediction● Loop closure prediction

Why is learning navigation via reinforcement learning hard?

Page 32: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Nav agent ingredients:

1. Convolutional encoder and RGB inputs

enc

xt

Page 33: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Nav agent ingredients:

1. Convolutional encoder and RGB inputs

2. Stacked LSTM

enc

xt

Page 34: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Nav agent ingredients:

1. Convolutional encoder and RGB inputs

2. Stacked LSTM

3. Additional inputs (reward, action, and velocity)

enc

xt rt-1 {vt, at-1}

Page 35: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Nav agent ingredients:

1. Convolutional encoder and RGB inputs

2. Stacked LSTM

3. Additional inputs (reward, action, and velocity)

4. RL: Asynchronous advantage actor critic (A3C)

enc

xt rt-1 {vt, at-1}

Mnih et al. (2016)

Page 36: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Nav agent ingredients:

1. Convolutional encoder and RGB inputs

2. Stacked LSTM

3. Additional inputs (reward, action, and velocity)

4. RL: Asynchronous advantage actor critic (A3C)

5. Aux task 1: Depth predictors

enc

Depth (D1 )

xt rt-1 {vt, at-1}

Depth (D2 )

Page 37: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Nav agent ingredients:

1. Convolutional encoder and RGB inputs

2. Stacked LSTM

3. Additional inputs (reward, action, and velocity)

4. RL: Asynchronous advantage actor critic (A3C)

5. Aux task 1: Depth predictors

Page 38: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Nav agent ingredients:

1. Convolutional encoder and RGB inputs

2. Stacked LSTM

3. Additional inputs (reward, action, and velocity)

4. RL: Asynchronous advantage actor critic (A3C)

5. Aux task 1: Depth predictor

6. Aux task 2: Loop closure predictor enc

Loop (L)

Depth (D1 )

xt rt-1 {vt, at-1}

Depth (D2 )

Page 39: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Nav agent ingredients:

1. Convolutional encoder and RGB inputs

2. Stacked LSTM

3. Additional inputs (reward, action, and velocity)

4. RL: Asynchronous advantage actor critic (A3C)

5. Aux task 1: Depth predictor

6. Aux task 2: Loop closure predictor

7. For analysis: Position decoder

enc

Loop (L)

Depth (D1 )

xt rt-1 {vt, at-1}

Depth (D2 ) Position

Page 40: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

details..

Page 41: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

more details.. policy gradient:

depth prediction from visual features:

depth prediction from LSTM features:

loop prediction from LSTM features:

Page 42: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Experiments

xt rt-1 {vt, at-1}

enc

xt

enc enc

Loop (L)

Depth (D1 )

a. FF A3C c. Nav A3C d. Nav A3C +D1D2L

xt rt-1 {vt, at-1}

enc

xt

b. LSTM A3C

Depth (D2 )

Page 43: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

+10 +1

Results on large maze with static goal

https://www.youtube.com/watch?v=zHhbypmKaj0

Page 44: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Should depth be an input? Or a target?

rgbdt rt-1 {vt, at-1}

enc enc

Depth (D1 )

rgbt rt-1 {vt, at-1}

Depth (D2 )

Answer: the dense, non-noisy gradients from depth as a target are more helpful

Page 45: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Results with Random Goal locations

Is the agent remembering the goal

location?

● Mean time to first goal find of episode:

14.0 sec

● Mean time to subsequent goal finds:

7.2 sec

● Not as impressive for large mazes:

15.4 sec vs 15.0 sec.

small

large

Page 46: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Latency to goal (as the agent returns)

● Trajectories of the Nav A3C+D+L agent in the I-maze and random goal maze over the course of one episode

● Value function and goal finding (red lines) are shown

Page 47: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Position decoding

● Trajectories of the Nav A3C+D+L agent in the random goal maze

● Position likelihoods are overlaid (predicted from LSTM hiddens)

● Initial uncertainty gives way to accurate position estimation.

enc

���� L

D1

xt rt-1 {vt, at-1}

D2Position

Page 48: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Raia Hadsell 2017

Results in random mazessmall

large

https://www.youtube.com/watch?v=EKXQAjoNdGM

Page 49: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

https://www.youtube.com/watch?v=lNoaTyMZsWI

Page 50: Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia Hadsell 2017 Experiment: Atari 10 Forget-Me-Not1 allows labeling of data segments, used

Thank you!raiahadsell.com

Piotr Mirowski Razvan Pascanu

Fabio Ross Andy Hubert Laurent Koray Dharsh Misha Andrea

Learning to navigate in complex environments

ICLR2017arxiv.org/abs/1611.03673