Introduction to Deep Q-network - Home - School of...
Transcript of Introduction to Deep Q-network - Home - School of...
Introduction to Deep Q-network
Presenter: Yunshu Du
CptS 580 Deep Learning
10/10/2016
Deep Q-network (DQN)
Deep Q-network (DQN)
• An artificial agent for general Atari game playing
– Learn to master 49 different Atari games directly from game
screens
– Beat the best performing learner from the same domain in 43
games
– Excel human expert in 29 games
Deep Q-network (DQN)
• A demo on DQN playing Atari Breakout
https://www.youtube.com/watch?v=V1eYniJ0Rnk
DQN is reinforcement learning + CNN magic!
• “Q”: Q-learning, a reinforcement learning (RL) method, the
agent interact with the environment to maximize future
rewards
• “Deep”, “network” : deep artificial neural networks to
learn general representation in complex environments
Q-Learning
• Action-value (Q) function
• Optimal Q function obeys Bellman equation
• The Q-Learning algorithm
http://www.nervanasys.com/demystifying-deep-reinforcement-learning/
approximator
Q-Learning
• Exploration vs. Exploitation
– Do I want to know as much as possible, or do my best at
things that I already know?
– ε-greedy exploration to select actions
approximator
http://www.nervanasys.com/demystifying-deep-reinforcement-learning/
Example: Q-Learning for Atari Breakout
Q-Learning
• But what if there are too many states/actions?
– Solution: deep convolutional network as function
approximator
weights
Deep Convolutional neural network (CNN)
• Extracts features directly from raw pixel
• Atari game image pre-processing: 84x84x4
http://cs231n.github.io/convolutional-networks/
DQN Architecture
8x8
Input image: 84x84x4
84x84
32 filters
8x8 stride 4
3x3
#W0 = 8192
(8*8*4)*32
http://www.slideshare.net/onghaoyi/distributed-deep-qlearning
output size
=(84-8)/4+1
= 20*20*32
64 filters
4x4 stride 2
output size
=(20-4)/2+1
= 9*9*64
#W1 = 32768
(4*4*32)*64
DQN Architecture
8x8
Input image: 84x84x4
84x84
32 filters
8x8 stride 4
3x3
#W0 = 8192
(8*8*4)*32
http://www.slideshare.net/onghaoyi/distributed-deep-qlearning
output size
=(84-8)/4+1
= 20*20*32
64 filters
4x4 stride 2
output size
=(20-4)/2+1
= 9*9*64
#W1 = 32768
(4*4*32)*64
64 filters
3x3 stride 1
7x7
output size
= 7*7*64
#W2 = 36864
(3*3*64)*64
Convolutional
DQN Architecture
8x8
Input image: 84x84x4
84x84
32 filters
8x8 stride 4
3x3
#W0 = 8192
(8*8*4)*32
http://www.slideshare.net/onghaoyi/distributed-deep-qlearning
output size
=(84-8)/4+1
= 20*20*32
64 filters
4x4 stride 2
output size
=(20-4)/2+1
= 9*9*64
#W1 = 32768
(4*4*32)*64
64 filters
3x3 stride 1
7x7
output size
= 7*7*64
#W2 = 36864
(3*3*64)*64
512 rectifierReshape
3136
Output Q values
for each action
Fully ConnectedConvolutionalAny missing
component?
Q-Learning
• Problem: Reinforcement learning is known to be unstable
or even to diverge when use a nonlinear function
approximator such as a neural network
– Correlation between samples
– Small updates to Q value may significantly change the policy
Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE
transactions on automatic control, 42(5), 674-690.
Deep
Q-Learning
• Solutions in DQN
– Experience replay
• Each iterations store experience sequence
et
= (st,a
t,r
t,s
t + 1), D
t= {e
1,…,e
t}
• Randomly drawn samples of experience (s,a,r,s′) ~ U(D) and apply
Q update in minibatch fashion
– Separate target network
• Clone Q(s,a; θ) to a separate target Qˆ(s,a; θ–) every C time step
• Treat y as the target and θ–are held fixed while update
– Reward clipping
• {-1, 1}
Deep
Deep Q-network (DQN)
• Minimize squared error loss
• Stochastic gradient decent w.r.t. weights
– Minibatch of size 32
• Update weights using RMSprop: divide weights by a
running average
http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
https://en.wikipedia.org/wiki/Stochastic_gradient_descent
target prediction
DQN: Putting Together
Input CNN
Q value
for actions
Store experience
{st,at,rt,st + 1} then
Sample minibatch
Calculate target for
each sample
Calculate gradient and update weights
Q DQN
http://www.nervanasys.com/demystifying-deep-reinforcement-learning/
But … It’s not perfect!
• Reward clipping
– Agent can’t distinguish different scales of rewards
(e.g., Macman)
• Limited experience replay
– Might through away important experiences
• High computational complexity
– Almost 10 days to train one game on a single GPU! Slower on
physical robots
– 10+ GB to store experiences
Andrej Karpathy’s blog
Beyond DQN
• More stabled learning
– Double DQN (Van, H et al. (2015)): use two Q-networks, one
for select action, the other for evaluate action
• Limited experience replay
– Prioritized Experience Replay (Schaul, T et al. (2016)): weight
experience according to surprise
• High computational time complexity
– Parallel/distributed computing (Nair, A et al. (2015))
– Dueling network (Wang, Z et al. (2015))L split DQN into two
channels
– Asynchronous RL (A3C) (Mnih, V et al. (2016)): can be trained
in CPU
David Silver’s tutorial on Deep Reinforcement Learning
ICML 2016, http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf
Beyond DQN
David Silver’s tutorial on Deep Reinforcement Learning
ICML 2016, http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf
Beyond DQN
• Deep Policy Network for continuous control
– Simulated robots
– Physical robots
David Silver’s tutorial on Deep Reinforcement Learning
ICML 2016, http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf
Beyond DQN
Mastering the game of
Go with deep neural
networks and tree search
Silver, D., Huang, A.,
Maddison, C.J., Guez, A.,
Sifre, L., Van Den Driessche,
G., Schrittwieser, J.,
Antonoglou, I.,
Panneershelvam, V., Lanctot,
M. and Dieleman, S., 2016.
So … DQN is not magic
• Q learning + CNN as function approximator
• Experience replay + separate target + reward clipping
= stabilize learning
• To be continue …
Introduction to Deep Q-network
Presenter: Yunshu Du
CptS 580 Deep Learning
10/10/2016