Deep Q-Learning

Deep Q-LearningA Reinforcement Learning approach

What is Reinforcement Learning?

- Much like biological agents behave- No supervisor, only a reward - Data is time dependent (non iid)- Feedback is delayed- Agent actions affect the data it receives

Examples- Play checkers (1959)- Defeat the world champion at Backgammon (1992)- Control a helicopter (2008)- Make a robot to walk- Robocup Soccer- Play ATARI games better than humans (2014)- Defeat the world champion at Go (2016)

Videos

Reward HypothesisAll goals can be described by the maximisation of expected cumulative reward

- Defeat the world champion at Go: +R / -R for winning/losing a game- Make a robot to walk: +R for forward, -R for falling over - Play ATARI games: +R / -R for increasing/decreasing score- Control a helicopter: + R / -R following trajectory / crashing

Agent and Environment

Fully Observable EnvironmentsFully Observable Environments (agent state = environment state):

- Agent directly observes environment- Example: chess board

Partially Observable Environments (agent state not equal environment state):

- Agent indirectly observes environment- Example: A robot with motion sensor or camera - Agent must construct its own state representation

RL components: Policy and Value FunctionPolicy is agent’s behaviour function

- Maps from state to action - Deterministic policy: - Stochastic:

Value function is a is a prediction of future reward

- Used to evaluate state and select between actions-

ModelPredicts what environment will do next:

Maze example: r = -1 per time-step and policy

[David Silver. Advanced Topics: RL]

http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

Maze example: Value function and Model



Exploration - Exploitation dilemma

Math: Markov Decision Process (MDP)Almost all RL problems can be formalised as MDPs

It’s a tuple:

- S is finite set of states- A is finite set of actions- P is state transition probability matrix:- R is a reward function:- Discount factor:

State-Value and Action-Value functions, Bellman eq.Expected return starting from state s, and then following policy :

Expected return starting from state s, taking action a, and then following policy :

Finding an Optimal Policy- There is always optimal policy for any MPD- All optimal policies achieve the optimal value function - All optimal policies achieve the optimal action-value function

All you need is to find

Bellman Opt Equation for state-value function



Bellman Opt Equation for action-value function



Bellman Opt Equation for state-value function



Bellman Opt Equation for action-value function



Policy Iteration Demo

http://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_dp.html



Q-Learning - model-free off-policy control algorithmModel-free (vs Model-based):

- MDP model is unknown, but experience can be sampled MDP - Model is known, but is too big to use, except by samples

Off-policy (vs On-policy):

- Can learn about policy from experience sampled from some other policy

Control (vs Prediction):

- Find best policy

Q-Learning



DQN - Q-Learning with function approximation

[Human-level control through deep reinforcement learning]

http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html

[Human-level control through deep reinforcement learning]


Issues with Q-learning with neural network- Data is sequential (non-iid)- Policy changes rapidly with slight changes to Q-values

- Policy may oscillate- Experience flows from one extreme to another

- Scale of rewards and Q-values is unknown- Unstable backpropagation due to large gradients

DQN solutions- Use experience replay

- Breaks correlations in data- Learn from all past policies- Using off-policy Q-learning

- Freeze target Q-network- Avoid policy oscillations- Break correlations between Q-network and target

- Clip rewards and gradients

Neon Demo

Links

- Human-level control through deep reinforcement learning- Course: David Silver. Advanced Topics: RL- Tutorial: David Silver. Deep Reinforcement Learning - Book: Sutton, Barto. Reinforcement learning- Source Code: simple_dqn- Reinforcejs- The Arcade Learning Environment





http://videolectures.net/rldm2015_silver_reinforcement_learning/



https://webdocs.cs.ualberta.ca/~sutton/book/the-book.html

https://webdocs.cs.ualberta.ca/~sutton/book/the-book.html

https://github.com/tambetm/simple_dqn

https://github.com/tambetm/simple_dqn



http://www.arcadelearningenvironment.org/

http://www.arcadelearningenvironment.org/

Deep Q-Learning

Technology

Transcript of Deep Q-Learning