Introduction to Deep Q-network - Home - School of...

26
Introduction to Deep Q-network Presenter: Yunshu Du CptS 580 Deep Learning 10/10/2016

Transcript of Introduction to Deep Q-network - Home - School of...

Page 1: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

Introduction to Deep Q-network

Presenter: Yunshu Du

CptS 580 Deep Learning

10/10/2016

Page 2: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

Deep Q-network (DQN)

Page 3: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

Deep Q-network (DQN)

• An artificial agent for general Atari game playing

– Learn to master 49 different Atari games directly from game

screens

– Beat the best performing learner from the same domain in 43

games

– Excel human expert in 29 games

Page 4: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

Deep Q-network (DQN)

• A demo on DQN playing Atari Breakout

https://www.youtube.com/watch?v=V1eYniJ0Rnk

Page 5: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580
Page 6: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

DQN is reinforcement learning + CNN magic!

• “Q”: Q-learning, a reinforcement learning (RL) method, the

agent interact with the environment to maximize future

rewards

• “Deep”, “network” : deep artificial neural networks to

learn general representation in complex environments

Page 7: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

Q-Learning

• Action-value (Q) function

• Optimal Q function obeys Bellman equation

• The Q-Learning algorithm

http://www.nervanasys.com/demystifying-deep-reinforcement-learning/

approximator

Page 8: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

Q-Learning

• Exploration vs. Exploitation

– Do I want to know as much as possible, or do my best at

things that I already know?

– ε-greedy exploration to select actions

approximator

http://www.nervanasys.com/demystifying-deep-reinforcement-learning/

Page 9: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

Example: Q-Learning for Atari Breakout

Page 10: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

Q-Learning

• But what if there are too many states/actions?

– Solution: deep convolutional network as function

approximator

weights

Page 11: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

Deep Convolutional neural network (CNN)

• Extracts features directly from raw pixel

• Atari game image pre-processing: 84x84x4

http://cs231n.github.io/convolutional-networks/

Page 12: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

DQN Architecture

8x8

Input image: 84x84x4

84x84

32 filters

8x8 stride 4

3x3

#W0 = 8192

(8*8*4)*32

http://www.slideshare.net/onghaoyi/distributed-deep-qlearning

output size

=(84-8)/4+1

= 20*20*32

64 filters

4x4 stride 2

output size

=(20-4)/2+1

= 9*9*64

#W1 = 32768

(4*4*32)*64

Page 13: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

DQN Architecture

8x8

Input image: 84x84x4

84x84

32 filters

8x8 stride 4

3x3

#W0 = 8192

(8*8*4)*32

http://www.slideshare.net/onghaoyi/distributed-deep-qlearning

output size

=(84-8)/4+1

= 20*20*32

64 filters

4x4 stride 2

output size

=(20-4)/2+1

= 9*9*64

#W1 = 32768

(4*4*32)*64

64 filters

3x3 stride 1

7x7

output size

= 7*7*64

#W2 = 36864

(3*3*64)*64

Convolutional

Page 14: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

DQN Architecture

8x8

Input image: 84x84x4

84x84

32 filters

8x8 stride 4

3x3

#W0 = 8192

(8*8*4)*32

http://www.slideshare.net/onghaoyi/distributed-deep-qlearning

output size

=(84-8)/4+1

= 20*20*32

64 filters

4x4 stride 2

output size

=(20-4)/2+1

= 9*9*64

#W1 = 32768

(4*4*32)*64

64 filters

3x3 stride 1

7x7

output size

= 7*7*64

#W2 = 36864

(3*3*64)*64

512 rectifierReshape

3136

Output Q values

for each action

Fully ConnectedConvolutionalAny missing

component?

Page 15: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

Q-Learning

• Problem: Reinforcement learning is known to be unstable

or even to diverge when use a nonlinear function

approximator such as a neural network

– Correlation between samples

– Small updates to Q value may significantly change the policy

Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE

transactions on automatic control, 42(5), 674-690.

Deep

Page 16: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

Q-Learning

• Solutions in DQN

– Experience replay

• Each iterations store experience sequence

et

= (st,a

t,r

t,s

t + 1), D

t= {e

1,…,e

t}

• Randomly drawn samples of experience (s,a,r,s′) ~ U(D) and apply

Q update in minibatch fashion

– Separate target network

• Clone Q(s,a; θ) to a separate target Qˆ(s,a; θ–) every C time step

• Treat y as the target and θ–are held fixed while update

– Reward clipping

• {-1, 1}

Deep

Page 17: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

Deep Q-network (DQN)

• Minimize squared error loss

• Stochastic gradient decent w.r.t. weights

– Minibatch of size 32

• Update weights using RMSprop: divide weights by a

running average

http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

https://en.wikipedia.org/wiki/Stochastic_gradient_descent

target prediction

Page 18: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

DQN: Putting Together

Input CNN

Q value

for actions

Store experience

{st,at,rt,st + 1} then

Sample minibatch

Calculate target for

each sample

Calculate gradient and update weights

Page 19: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

Q DQN

http://www.nervanasys.com/demystifying-deep-reinforcement-learning/

Page 20: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

But … It’s not perfect!

• Reward clipping

– Agent can’t distinguish different scales of rewards

(e.g., Macman)

• Limited experience replay

– Might through away important experiences

• High computational complexity

– Almost 10 days to train one game on a single GPU! Slower on

physical robots

– 10+ GB to store experiences

Andrej Karpathy’s blog

Page 21: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

Beyond DQN

• More stabled learning

– Double DQN (Van, H et al. (2015)): use two Q-networks, one

for select action, the other for evaluate action

• Limited experience replay

– Prioritized Experience Replay (Schaul, T et al. (2016)): weight

experience according to surprise

• High computational time complexity

– Parallel/distributed computing (Nair, A et al. (2015))

– Dueling network (Wang, Z et al. (2015))L split DQN into two

channels

– Asynchronous RL (A3C) (Mnih, V et al. (2016)): can be trained

in CPU

David Silver’s tutorial on Deep Reinforcement Learning

ICML 2016, http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf

Page 22: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

Beyond DQN

David Silver’s tutorial on Deep Reinforcement Learning

ICML 2016, http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf

Page 23: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

Beyond DQN

• Deep Policy Network for continuous control

– Simulated robots

– Physical robots

David Silver’s tutorial on Deep Reinforcement Learning

ICML 2016, http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf

Page 24: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

Beyond DQN

Mastering the game of

Go with deep neural

networks and tree search

Silver, D., Huang, A.,

Maddison, C.J., Guez, A.,

Sifre, L., Van Den Driessche,

G., Schrittwieser, J.,

Antonoglou, I.,

Panneershelvam, V., Lanctot,

M. and Dieleman, S., 2016.

Page 25: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

So … DQN is not magic

• Q learning + CNN as function approximator

• Experience replay + separate target + reward clipping

= stabilize learning

• To be continue …

Page 26: Introduction to Deep Q-network - Home - School of ...eecs.wsu.edu/~taylorm/17_580/Yunshu_DQNpresentation_10102016.pdf · Introduction to Deep Q-network Presenter: Yunshu Du CptS 580

Introduction to Deep Q-network

Presenter: Yunshu Du

CptS 580 Deep Learning

10/10/2016