Post on 12-Aug-2018
The Reinforcement Learning Problem
Reinforcement Learning (RL)
RL is the study of learning in a scenario where an agent actively interactswith the environment to achieve a certain goal.
Christian Hanauer Reinforcement Learning 2 / 42
The Reinforcement Learning Problem
RL is characterized by:
No supervisor
Delayed rewards
Exploration versus exploitation dilemma
Uncertainty about environment
Reward Hypothesis
All goals can be described by the maximisation of the expected cumulativereward.
→ determine optimal policy π∗ (best course of actions)Can you think of a counterexample?
Christian Hanauer Reinforcement Learning 3 / 42
Outline
1 Introduction
2 Markov Decision ProcessesMarkov Decision ProcessesValue FunctionsBellman Equation
3 Model Free Reinforcement LearningGeneralized Policy IterationMonte Carlo MethodsTemporal-Difference Learning
4 Function ApproximationMotivationPrediction ObjectiveMethods
5 Summary
Christian Hanauer Reinforcement Learning 6 / 42
Markov Property
“The future is independent of the past given the present”
Definition
A state St is Markov if and only if
Pr[St+1|St ] = Pr[St+1|S1, ...,St ]
Definition
A Markov Process (or Markov Chain) is a tuple 〈S,P〉, where
S is a countable non-empty set of states
P is a state transition probability matrix withPss′ = Pr[St+1 = s ′|St = s]
Christian Hanauer Reinforcement Learning 7 / 42
Markov Decision Process
Definition
A Markov Decision Process (MDP) is a tuple 〈S,A,P,R, γ〉, where
S is a countable non-empty set of states
A is a countable non-empty set of actions
P is a state transition probability matrix withPass′ = Pr[St+1 = s ′|St = s,At = a]
R is a reward function Ras = E[Rt+1|St = s,At = a]
γ is a discount factor with γ ∈ [0, 1]
Task of learning agent in MDP:Determine the action to take at each state, the so called policy π : S → A
Christian Hanauer Reinforcement Learning 8 / 42
Goals and Rewards
What quantity do we want to maximize?
Definition
The return Gt is the total discounted sum of rewards from time-step t:
Gt = Rt+1 + γRt+2 + ... =∞∑k=0
γkRt+k+1
The reward Rt is a scalar feedback signal
γ ∈ [0, 1] determines the amount we take into account future rewards
Mathematically convenientRepresent uncertainty of the futureHuman behaviour shows preference for immediate rewards
Christian Hanauer Reinforcement Learning 10 / 42
Value Functions
Definition
The state-value function vπ(s) of an MDP is the expected return startingfrom state s and then following policy π
vπ(s) = Eπ[Gt |St = s]
Definition
The action-value function qπ(s, a) of an MDP is the expected returnstarting from state s, taking action a and then following policy π
qπ(s, a) = Eπ[Gt |St = s,At = a]
Ordering over policies:
π ≥ π′ if vπ(s) ≥ vπ′(s) ∀ s
Christian Hanauer Reinforcement Learning 11 / 42
Evaluating the Value Function
How can we determine the value function vπ(s)?
Idea: Decompose vπ(s) into immediate and future reward
vπ(s) = Eπ[Gt |St = s]
= Eπ[Rt+1 + γRt+2 + γ2Rt+3 + ...|St = s]
= Eπ[Rt+1 + γ(Rt+2 + γRt+3 + ...)|St = s]
= Eπ[Rt+1 + γGt+1|St = s]
= Eπ[Rt+1 + γvπ(St+1)|St = s] (Bellman Expectation Equation for vπ)
Can be solved directly (matrix equation) or iteratively (DynamicProgramming)
Christian Hanauer Reinforcement Learning 12 / 42
Optimizing the Value Function
How can we find an optimal policy π?
(i) The optimal action-value function q∗ is given by
q∗ = maxπ
qπ(s, a)
(ii) Use the Bellman Expectation Equation for qπ
qπ(s, a) = Eπ[Rt+1 + γqπ(St+1,At+1)|St = s,At = a]
(iii) The optimal qπ is described by the Bellman Optimality Equation
q∗(s, a) = Eπ[Rt+1 + γmaxπ
qπ(St+1,At+1)|St = s,At = a]
For any MDP: an optimal policy π∗ always exists
Christian Hanauer Reinforcement Learning 13 / 42
1 Introduction
2 Markov Decision ProcessesMarkov Decision ProcessesValue FunctionsBellman Equation
3 Model Free Reinforcement LearningGeneralized Policy IterationMonte Carlo MethodsTemporal-Difference Learning
4 Function ApproximationMotivationPrediction ObjectiveMethods
5 Summary
Christian Hanauer Reinforcement Learning 15 / 42
Generalized Policy Iteration
How do we learn an optimal policy π∗?
Policy evaluation: Monte Carlo, TD(0)...Policy improvement: Greedy policy improvement...
Christian Hanauer Reinforcement Learning 16 / 42
Policy Improvement Theorem
Theorem
Let π and π′ be any pair of policies such that for all s qπ(s, π′(s)) ≥ vπ(s).Then the policy π′ must be as good as, or better than π:
vπ′(s) ≥ vπ(s)
Proof:
vπ(s) ≤ qπ(s, π′(s)) = Eπ′ [Rt+1 + γvπ(St+1)|St = s]
≤ Eπ′ [Rt+1 + γqπ(St+1, π′(St+1))|St = s]
≤ Eπ′ [Rt+1 + γRt+2 + γ2qπ(St+2, π′(St+2))|St = s]
≤ Eπ′ [Rt+1 + γRt+2 + ...|St = s]
= vπ′(s)
Christian Hanauer Reinforcement Learning 17 / 42
Monte Carlo Methods
Overview:
MC methods learn from episodes of experience
MC methods are model-free
Update of V (St) after end of episode
Basic idea: sample sequences of MDP
How do we evaluate a policy using Monte Carlo Methods?
The value function is given by
vπ(s) = Eπ[Gt |St = s]
MC policy evaluation uses the empirical mean return toestimate vπ(s)
V (St)← V (St) + α(Gt − V (St))
Christian Hanauer Reinforcement Learning 18 / 42
Remark: Incremental Mean
We will compute the mean µk of the sequence x1,x2,... incrementally:
µk =1
k
k∑j=1
xj
=1
k
xk +k−1∑j=1
xj
=
1
k(xk + (k − 1)µk−1)
= µk−1 +1
k(xk − µk−1)
Christian Hanauer Reinforcement Learning 19 / 42
On-policy first-visit MC control
On-policy first-visit MC control
Initialize ∀ s ∈ S, a ∈ AQ(s, a) arbitraryReturns(s, a) empty listπ(a|s) an arbitrary policy
Repeat:
(a) Generate an episode {S1,A1,R2, ...ST} using π
(b) For each pair s, a in the episode:
G ← return following the first occurrence of s, aAppend G to Returns(s, a)Q(s, a)← average(Returns(s, a))
(c) For each s in the episode:
Update policy π epsilon greedily
Christian Hanauer Reinforcement Learning 20 / 42
Remarks
Which door do you want toopen?
You get an reward R = 1
Which door do you want toopen?
Are you sure you have chosenthe right door?
Christian Hanauer Reinforcement Learning 21 / 42
Temporal-Difference Learning
Overview:
TD methods learn from episodes of experience
TD methods are model-free
Update of V (St) after each time step
Basic idea: update a guess towards a guess
How do we evaluate a policy using TD methods?
MC: Update the value V (St) towards actual return Gt
V (St)← V (St) + α(Gt − V (St))
TD: Update the value V (St) towards estimated return
V (St)← V (St) + α(Rt+1 + γV (St+1)− V (St))
Christian Hanauer Reinforcement Learning 22 / 42
Example: Driving Home
State Elapsed Time Predicted Predicted(min) Time to Go Total Time
leaving office 0 30 30reach car, raining 5 35 40exiting highway 20 15 35
behind truck 30 10 40entering home street 40 3 43
arriving home 43 0 43
Christian Hanauer Reinforcement Learning 23 / 42
Q-learning: Off-Policy TD Control
Q-learning
Initialize Q(s, a) ∀s ∈ S, a ∈ A arbitrarilyRepeat (for each episode):
Initialize SRepeat (for each step of episode):
Choose A from S using policy πTake action A, observe R,S ′
Q(S ,A)← Q(S ,A) + α[R + γmaxa
Q(S ′, a)− Q(S ,A)]
S ← S ′
until S is terminal
Q-learning: Update of Q independent of the policy (off-policy)
Sarsa: Update of Q depends on the policy’s action (on-policy)
Christian Hanauer Reinforcement Learning 25 / 42
MC vs. TD
Monte Carlo Methods:
Slow convergence
Markov property not exploited
Update at the end of an episode
TD methods:
Fast convergence
Markov property exploited
Update after every time step
TD(λ) methods:
Use weighted n-step returns as update target
Christian Hanauer Reinforcement Learning 27 / 42
Overview
Full Backup (DP) Sample Backup (TD)
Bellman ExpectationEquation for vπ(s)
Iterative Policy Evalu-ation
TD Learning
Bellman ExpectationEquation for qπ(s, a)
Q-Policy Iteration Sarsa
Bellman OptimalityEquation for q∗(s, a)
Q-Value Iteration Q-Learning
Christian Hanauer Reinforcement Learning 29 / 42
1 Introduction
2 Markov Decision ProcessesMarkov Decision ProcessesValue FunctionsBellman Equation
3 Model Free Reinforcement LearningGeneralized Policy IterationMonte Carlo MethodsTemporal-Difference Learning
4 Function ApproximationMotivationPrediction ObjectiveMethods
5 Summary
Christian Hanauer Reinforcement Learning 31 / 42
Motivation
Why do we need value function approximation?
Situation:
Value function represented by lookup table V (s)Go: 10170 states
Problem with large MDPs:
Large set of states/actions to store in memorySlow exploration of states/actions
Solution for large MDPs:
Estimate value function with function approximators
v(s,w) ≈ vπ(s)
Generalize from known to unknown states
Christian Hanauer Reinforcement Learning 32 / 42
Prediction Objective
Prediction Objective: Minimize the squared error J(w)
J(w) =∑s∈S
[vπ(s)− v(s,w)]2
Adjust w in direction of the negative gradient:
∆w = α [vπ(St)− v(St ,w)] ∇v(St ,w)
Christian Hanauer Reinforcement Learning 33 / 42
Methods
Incremental MethodsSubstitute a target for vπMC: Use return as target
∆w = α [Gt − v(St ,w)] ∇v(St ,w)
TD: Use estimated return as target
∆w = α [Rt+1 + γv(St ,w)− v(St ,w)] ∇v(St ,w)
Batch MethodsSample state and value from experience
〈s, vπ〉 ∼ D
Apply stochastic gradient descent update
∆w = α [vπ − v(St ,w)] ∇v(St ,w)
Experience replay: decorrelates samples
Christian Hanauer Reinforcement Learning 34 / 42
Summary
RL: Study of learning of an agent in interaction with environment
Markov Decision Processes: Divide task into prediction and control
Model-Free RL
Monte Carlo MethodsTemporal-Difference Learning MethodsTD(λ)
Large MDPs: function approximation
Outlook:
Policy Gradient Methods
Model based RL
...
Christian Hanauer Reinforcement Learning 35 / 42
Human-level control through deep reinforcement learning
Deep Q-Networks learn policies over a variety of Atari games:
Only pixels and game score as input
Experience Replay and fixed Q-targets
Exceeds human behaviour in more than half of the 49 games
Christian Hanauer Reinforcement Learning 37 / 42
Questions
The only stupid question is the one you were afraid to askbut never did. (Rich Sutton)
Christian Hanauer Reinforcement Learning 39 / 42
Resources I
Mnih, Volodymyr et. al. (2015)Human-level control through deep reinforcement learning.Nature, 518, 529-533.
Mohri, Mehryar, Rostamizadeh, Afshin and Talwalkar, AmeetFoundations of Machine Learning. Adaptive Computation andMachine LearningMIT Press. Cambridge, Massachusetts, 2012
Sutton, Richard S. and Barto, Andrew G.Reinforcement Learning. An Introduciton (Draft)MIT Press. Cambridge, Massachusetts, 2016
Christian Hanauer Reinforcement Learning 40 / 42
Resources II
Szepesvari, CsabaAlgorithms for Reinforcement Learning. Draft of the lecture pubishedin the Synthesis Lectures on ARtificial Intelligence and machineLearningMorgan & Claypool Publishers. Cambridge, Massachusetts, 2013
A painless Q-learning tutorial (4.7.2017)http://mnemstudio.org/path-finding-q-learning-tutorial.htm
List of resources on Reinforcement Learning (4.7.2017)https://github.com/aikorea/awesome-rl
OpenAI Gym (4.7.2017)https://gym.openai.com/
Christian Hanauer Reinforcement Learning 41 / 42
Resources III
Repository with code, exercises and solutions for popularReinforcement Learning algorithms (4.7.2017)https://github.com/dennybritz/reinforcement-learning
Teaching Your Computer To Play Super Mario Bros (4.7.2017)http://www.ehrenbrav.com/2016/08/teaching-your-computer-to-play-super-mario-bros-a-fork-of-the-google-deepmind-atari-machine-learning-project/
University College London course on Reinforcement Learning(4.7.2017)http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html
Christian Hanauer Reinforcement Learning 42 / 42