Reinforcement Learning - LMU Münchendeckert/light-and-matter/teaching/... · Christian Hanauer...

Reinforcement Learning

Christian Hanauer

5 July 2017Advanced Topics in Machine Learning, LMU

The Reinforcement Learning Problem

Reinforcement Learning (RL)

RL is the study of learning in a scenario where an agent actively interactswith the environment to achieve a certain goal.

Christian Hanauer Reinforcement Learning 2 / 42

The Reinforcement Learning Problem

RL is characterized by:

No supervisor

Delayed rewards

Exploration versus exploitation dilemma

Uncertainty about environment

Reward Hypothesis

All goals can be described by the maximisation of the expected cumulativereward.

→ determine optimal policy π∗ (best course of actions)Can you think of a counterexample?


Branches of Machine Learning


Applications of Reinforcement Learning


Outline

1 Introduction

2 Markov Decision ProcessesMarkov Decision ProcessesValue FunctionsBellman Equation

3 Model Free Reinforcement LearningGeneralized Policy IterationMonte Carlo MethodsTemporal-Difference Learning

4 Function ApproximationMotivationPrediction ObjectiveMethods

5 Summary


Markov Property

“The future is independent of the past given the present”

Definition

A state St is Markov if and only if

Pr[St+1|St ] = Pr[St+1|S1, ...,St ]

Definition

A Markov Process (or Markov Chain) is a tuple 〈S,P〉, where

S is a countable non-empty set of states

P is a state transition probability matrix withPss′ = Pr[St+1 = s ′|St = s]


Markov Decision Process

Definition

A Markov Decision Process (MDP) is a tuple 〈S,A,P,R, γ〉, where

S is a countable non-empty set of states

A is a countable non-empty set of actions

P is a state transition probability matrix withPass′ = Pr[St+1 = s ′|St = s,At = a]

R is a reward function Ras = E[Rt+1|St = s,At = a]

γ is a discount factor with γ ∈ [0, 1]

Task of learning agent in MDP:Determine the action to take at each state, the so called policy π : S → A


Example: Student MDP


Goals and Rewards

What quantity do we want to maximize?

Definition

The return Gt is the total discounted sum of rewards from time-step t:

Gt = Rt+1 + γRt+2 + ... =∞∑k=0

γkRt+k+1

The reward Rt is a scalar feedback signal

γ ∈ [0, 1] determines the amount we take into account future rewards

Mathematically convenientRepresent uncertainty of the futureHuman behaviour shows preference for immediate rewards


Value Functions

Definition

The state-value function vπ(s) of an MDP is the expected return startingfrom state s and then following policy π

vπ(s) = Eπ[Gt |St = s]

Definition

The action-value function qπ(s, a) of an MDP is the expected returnstarting from state s, taking action a and then following policy π

qπ(s, a) = Eπ[Gt |St = s,At = a]

Ordering over policies:

π ≥ π′ if vπ(s) ≥ vπ′(s) ∀ s


Evaluating the Value Function

How can we determine the value function vπ(s)?

Idea: Decompose vπ(s) into immediate and future reward


= Eπ[Rt+1 + γRt+2 + γ2Rt+3 + ...|St = s]

= Eπ[Rt+1 + γ(Rt+2 + γRt+3 + ...)|St = s]

= Eπ[Rt+1 + γGt+1|St = s]

= Eπ[Rt+1 + γvπ(St+1)|St = s] (Bellman Expectation Equation for vπ)

Can be solved directly (matrix equation) or iteratively (DynamicProgramming)


Optimizing the Value Function

How can we find an optimal policy π?

(i) The optimal action-value function q∗ is given by

q∗ = maxπ

qπ(s, a)

(ii) Use the Bellman Expectation Equation for qπ

qπ(s, a) = Eπ[Rt+1 + γqπ(St+1,At+1)|St = s,At = a]

(iii) The optimal qπ is described by the Bellman Optimality Equation

q∗(s, a) = Eπ[Rt+1 + γmaxπ

qπ(St+1,At+1)|St = s,At = a]

For any MDP: an optimal policy π∗ always exists


1 Introduction




5 Summary


Generalized Policy Iteration

How do we learn an optimal policy π∗?

Policy evaluation: Monte Carlo, TD(0)...Policy improvement: Greedy policy improvement...


Policy Improvement Theorem

Theorem

Let π and π′ be any pair of policies such that for all s qπ(s, π′(s)) ≥ vπ(s).Then the policy π′ must be as good as, or better than π:

vπ′(s) ≥ vπ(s)

Proof:

vπ(s) ≤ qπ(s, π′(s)) = Eπ′ [Rt+1 + γvπ(St+1)|St = s]

≤ Eπ′ [Rt+1 + γqπ(St+1, π′(St+1))|St = s]

≤ Eπ′ [Rt+1 + γRt+2 + γ2qπ(St+2, π′(St+2))|St = s]

≤ Eπ′ [Rt+1 + γRt+2 + ...|St = s]

= vπ′(s)


Monte Carlo Methods

Overview:

MC methods learn from episodes of experience

MC methods are model-free

Update of V (St) after end of episode

Basic idea: sample sequences of MDP

How do we evaluate a policy using Monte Carlo Methods?

The value function is given by


MC policy evaluation uses the empirical mean return toestimate vπ(s)

V (St)← V (St) + α(Gt − V (St))


Remark: Incremental Mean

We will compute the mean µk of the sequence x1,x2,... incrementally:

µk =1

k

k∑j=1

xj

=1

k

xk +k−1∑j=1

xj

=

1

k(xk + (k − 1)µk−1)

= µk−1 +1

k(xk − µk−1)


On-policy first-visit MC control

On-policy first-visit MC control

Initialize ∀ s ∈ S, a ∈ AQ(s, a) arbitraryReturns(s, a) empty listπ(a|s) an arbitrary policy

Repeat:

(a) Generate an episode {S1,A1,R2, ...ST} using π

(b) For each pair s, a in the episode:

G ← return following the first occurrence of s, aAppend G to Returns(s, a)Q(s, a)← average(Returns(s, a))

(c) For each s in the episode:

Update policy π epsilon greedily


Remarks

Which door do you want toopen?

You get an reward R = 1

Which door do you want toopen?

Are you sure you have chosenthe right door?


Temporal-Difference Learning

Overview:

TD methods learn from episodes of experience

TD methods are model-free

Update of V (St) after each time step

Basic idea: update a guess towards a guess

How do we evaluate a policy using TD methods?

MC: Update the value V (St) towards actual return Gt

V (St)← V (St) + α(Gt − V (St))

TD: Update the value V (St) towards estimated return

V (St)← V (St) + α(Rt+1 + γV (St+1)− V (St))


Example: Driving Home

State Elapsed Time Predicted Predicted(min) Time to Go Total Time

leaving office 0 30 30reach car, raining 5 35 40exiting highway 20 15 35

behind truck 30 10 40entering home street 40 3 43

arriving home 43 0 43


Example: Driving Home


Q-learning: Off-Policy TD Control

Q-learning

Initialize Q(s, a) ∀s ∈ S, a ∈ A arbitrarilyRepeat (for each episode):

Initialize SRepeat (for each step of episode):

Choose A from S using policy πTake action A, observe R,S ′

Q(S ,A)← Q(S ,A) + α[R + γmaxa

Q(S ′, a)− Q(S ,A)]

S ← S ′

until S is terminal

Q-learning: Update of Q independent of the policy (off-policy)

Sarsa: Update of Q depends on the policy’s action (on-policy)


MC vs. TD

Monte Carlo Methods:

Slow convergence

Markov property not exploited

Update at the end of an episode

TD methods:

Fast convergence

Markov property exploited

Update after every time step

TD(λ) methods:

Use weighted n-step returns as update target


Example: Cliff Walking


Overview

Full Backup (DP) Sample Backup (TD)

Bellman ExpectationEquation for vπ(s)

Iterative Policy Evalu-ation

TD Learning

Bellman ExpectationEquation for qπ(s, a)

Q-Policy Iteration Sarsa

Bellman OptimalityEquation for q∗(s, a)

Q-Value Iteration Q-Learning


Unified View of RL


1 Introduction




5 Summary


Motivation

Why do we need value function approximation?

Situation:

Value function represented by lookup table V (s)Go: 10170 states

Problem with large MDPs:

Large set of states/actions to store in memorySlow exploration of states/actions

Solution for large MDPs:

Estimate value function with function approximators

v(s,w) ≈ vπ(s)

Generalize from known to unknown states


Prediction Objective

Prediction Objective: Minimize the squared error J(w)

J(w) =∑s∈S

[vπ(s)− v(s,w)]2

Adjust w in direction of the negative gradient:

∆w = α [vπ(St)− v(St ,w)] ∇v(St ,w)


Methods

Incremental MethodsSubstitute a target for vπMC: Use return as target

∆w = α [Gt − v(St ,w)] ∇v(St ,w)

TD: Use estimated return as target

∆w = α [Rt+1 + γv(St ,w)− v(St ,w)] ∇v(St ,w)

Batch MethodsSample state and value from experience

〈s, vπ〉 ∼ D

Apply stochastic gradient descent update

∆w = α [vπ − v(St ,w)] ∇v(St ,w)

Experience replay: decorrelates samples


Summary

RL: Study of learning of an agent in interaction with environment

Markov Decision Processes: Divide task into prediction and control

Model-Free RL

Monte Carlo MethodsTemporal-Difference Learning MethodsTD(λ)

Large MDPs: function approximation

Outlook:

Policy Gradient Methods

Model based RL

...


RL Agent Classification


Human-level control through deep reinforcement learning

Deep Q-Networks learn policies over a variety of Atari games:

Only pixels and game score as input

Experience Replay and fixed Q-targets

Exceeds human behaviour in more than half of the 49 games


Thank you for your attention!


Questions

The only stupid question is the one you were afraid to askbut never did. (Rich Sutton)


Resources I

Mnih, Volodymyr et. al. (2015)Human-level control through deep reinforcement learning.Nature, 518, 529-533.

Mohri, Mehryar, Rostamizadeh, Afshin and Talwalkar, AmeetFoundations of Machine Learning. Adaptive Computation andMachine LearningMIT Press. Cambridge, Massachusetts, 2012

Sutton, Richard S. and Barto, Andrew G.Reinforcement Learning. An Introduciton (Draft)MIT Press. Cambridge, Massachusetts, 2016


Resources II

Szepesvari, CsabaAlgorithms for Reinforcement Learning. Draft of the lecture pubishedin the Synthesis Lectures on ARtificial Intelligence and machineLearningMorgan & Claypool Publishers. Cambridge, Massachusetts, 2013

A painless Q-learning tutorial (4.7.2017)http://mnemstudio.org/path-finding-q-learning-tutorial.htm

List of resources on Reinforcement Learning (4.7.2017)https://github.com/aikorea/awesome-rl

OpenAI Gym (4.7.2017)https://gym.openai.com/


http://mnemstudio.org/path-finding-q-learning-tutorial.htm

http://mnemstudio.org/path-finding-q-learning-tutorial.htm

https://github.com/aikorea/awesome-rl

https://gym.openai.com/

Resources III

Repository with code, exercises and solutions for popularReinforcement Learning algorithms (4.7.2017)https://github.com/dennybritz/reinforcement-learning

Teaching Your Computer To Play Super Mario Bros (4.7.2017)http://www.ehrenbrav.com/2016/08/teaching-your-computer-to-play-super-mario-bros-a-fork-of-the-google-deepmind-atari-machine-learning-project/

University College London course on Reinforcement Learning(4.7.2017)http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html


https://github.com/dennybritz/reinforcement-learning

https://github.com/dennybritz/reinforcement-learning

http://www.ehrenbrav.com/2016/08/teaching-your-computer-to-play-super-mario-bros-a-fork-of-the-google-deepmind-atari-machine-learning-project/

http://www.ehrenbrav.com/2016/08/teaching-your-computer-to-play-super-mario-bros-a-fork-of-the-google-deepmind-atari-machine-learning-project/

http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html

http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html

Reinforcement Learning - LMU Münchendeckert/light-and-matter/teaching/... · Christian Hanauer...

Documents

Transcript of Reinforcement Learning - LMU Münchendeckert/light-and-matter/teaching/... · Christian Hanauer...