Deep Q-Learning
-
Upload
nikolay-pavlov -
Category
Technology
-
view
777 -
download
0
Transcript of Deep Q-Learning
Deep Q-LearningA Reinforcement Learning approach
What is Reinforcement Learning?
- Much like biological agents behave- No supervisor, only a reward - Data is time dependent (non iid)- Feedback is delayed- Agent actions affect the data it receives
Examples- Play checkers (1959)- Defeat the world champion at Backgammon (1992)- Control a helicopter (2008)- Make a robot to walk- Robocup Soccer- Play ATARI games better than humans (2014)- Defeat the world champion at Go (2016)
Videos
Reward HypothesisAll goals can be described by the maximisation of expected cumulative reward
- Defeat the world champion at Go: +R / -R for winning/losing a game- Make a robot to walk: +R for forward, -R for falling over - Play ATARI games: +R / -R for increasing/decreasing score- Control a helicopter: + R / -R following trajectory / crashing
Agent and Environment
Fully Observable EnvironmentsFully Observable Environments (agent state = environment state):
- Agent directly observes environment- Example: chess board
Partially Observable Environments (agent state not equal environment state):
- Agent indirectly observes environment- Example: A robot with motion sensor or camera - Agent must construct its own state representation
RL components: Policy and Value FunctionPolicy is agent’s behaviour function
- Maps from state to action - Deterministic policy: - Stochastic:
Value function is a is a prediction of future reward
- Used to evaluate state and select between actions-
ModelPredicts what environment will do next:
Maze example: r = -1 per time-step and policy
[David Silver. Advanced Topics: RL]
Maze example: Value function and Model
[David Silver. Advanced Topics: RL]
Exploration - Exploitation dilemma
Math: Markov Decision Process (MDP)Almost all RL problems can be formalised as MDPs
It’s a tuple:
- S is finite set of states- A is finite set of actions- P is state transition probability matrix:- R is a reward function:- Discount factor:
State-Value and Action-Value functions, Bellman eq.Expected return starting from state s, and then following policy :
Expected return starting from state s, taking action a, and then following policy :
Finding an Optimal Policy- There is always optimal policy for any MPD- All optimal policies achieve the optimal value function - All optimal policies achieve the optimal action-value function
All you need is to find
Bellman Opt Equation for state-value function
[David Silver. Advanced Topics: RL]
Bellman Opt Equation for action-value function
[David Silver. Advanced Topics: RL]
Bellman Opt Equation for state-value function
[David Silver. Advanced Topics: RL]
Bellman Opt Equation for action-value function
[David Silver. Advanced Topics: RL]
Policy Iteration Demo
Q-Learning - model-free off-policy control algorithmModel-free (vs Model-based):
- MDP model is unknown, but experience can be sampled MDP - Model is known, but is too big to use, except by samples
Off-policy (vs On-policy):
- Can learn about policy from experience sampled from some other policy
Control (vs Prediction):
- Find best policy
Q-Learning
[David Silver. Advanced Topics: RL]
DQN - Q-Learning with function approximation
[Human-level control through deep reinforcement learning]
[Human-level control through deep reinforcement learning]
Issues with Q-learning with neural network- Data is sequential (non-iid)- Policy changes rapidly with slight changes to Q-values
- Policy may oscillate- Experience flows from one extreme to another
- Scale of rewards and Q-values is unknown- Unstable backpropagation due to large gradients
DQN solutions- Use experience replay
- Breaks correlations in data- Learn from all past policies- Using off-policy Q-learning
- Freeze target Q-network- Avoid policy oscillations- Break correlations between Q-network and target
- Clip rewards and gradients
Neon Demo
Links
- Human-level control through deep reinforcement learning- Course: David Silver. Advanced Topics: RL- Tutorial: David Silver. Deep Reinforcement Learning - Book: Sutton, Barto. Reinforcement learning- Source Code: simple_dqn- Reinforcejs- The Arcade Learning Environment