Lecture 8: Policy Gradient I 2

Emma Brunskill

CS234 Reinforcement Learning.

Winter 2018

Additional reading: Sutton and Barto 2018 Chp. 13

2With many slides from or derived from David Silver and John Schulman and PieterAbbeel

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 3 Winter 2018 1 / 62

Last Time: We want RL Algorithms that Perform

Optimization

Delayed consequences

Exploration

Generalization

And do it statistically and computationally efficiently

Last Time: Generalization and Efficiency

Can use structure and additional knowledge to help constrain andspeed reinforcement learning

Class Structure

Last time: Imitation Learning

This time: Policy Search

Next time: Policy Search Cont.

Table of Contents

1 Introduction

2 Policy Gradient

3 Score Function and Policy Gradient Theorem

4 Policy Gradient Algorithms and Reducing Variance

Policy-Based Reinforcement Learning

In the last lecture we approximated the value or action-value functionusing parameters θ,

Vθ(s) ≈ V π(s)

Qθ(s, a) ≈ Qπ(s, a)

A policy was generated directly from the value function

e.g. using ε-greedy

In this lecture we will directly parametrize the policy

πθ(s, a) = P[a|s, θ]

Goal is to find a policy π with the highest value function V π

We will focus again on model-free reinforcement learning

Value-Based and Policy-Based RL

Value Based

Learnt Value FunctionImplicit policy (e.g.ε-greedy)

Policy Based

No Value FunctionLearnt Policy

Actor-Critic

Learnt Value FunctionLearnt Policy

Advantages of Policy-Based RL

Advantages:

Better convergence properties

Effective in high-dimensional or continuous action spaces

Can learn stochastic policies

Disadvantages:

Typically converge to a local rather than global optimum

Evaluating a policy is typically inefficient and high variance

Example: Rock-Paper-Scissors

Two-player game of rock-paper-scissors

Scissors beats paperRock beats scissorsPaper beats rock

Consider policies for iterated rock-paper-scissors

A deterministic policy is easily exploitedA uniform random policy is optimal (i.e. Nash equilibrium)

Example: Aliased Gridword (1)

The agent cannot differentiate the grey states

Consider features of the following form (for all N, E, S, W)

φ(s, a) = 1(wall to N, a = move E)

Compare value-based RL, using an approximate value function

Qθ(s, a) = f (φ(s, a), θ)

To policy-based RL, using a parametrised policy

πθ(s, a) = g(φ(s, a), θ)

Example: Aliased Gridworld (2)

Under aliasing, an optimal deterministic policy will either

move W in both grey states (shown by red arrows)move E in both grey states

Either way, it can get stuck and never reach the money

Value-based RL learns a near-deterministic policy

e.g. greedy or ε-greedy

So it will traverse the corridor for a long time

Example: Aliased Gridworld (3)

An optimal stochastic policy will randomly move E or W in grey states

πθ(wall to N and W, move E) = 0.5

πθ(wall to N and W, move W) = 0.5

It will reach the goal state in a few steps with high probability

Policy-based RL can learn the optimal stochastic policy

Policy Objective Functions

Goal: given a policy πθ(s, a) with parameters θ, find best θ

But how do we measure the quality for a policy πθ?

In episodic environments we can use the start value of the policy

J1(θ) = V πθ(s1) = Eπθ [v1]

In continuing environments we can use the average value

JavV (θ) =∑s

dπθ(s)V πθ(s)

where dπθ(s) is the stationary distribution of Markov chain for πθ.

Or the average reward per time-step

JavR(θ) =∑s

dπθ(s)∑a

πθ(s, a)Ras

For simplicity, today will mostly discuss the episodic case, but caneasily extend to the continuing / infinite horizon case

Policy optimization

Policy based reinforcement learning is an optimization problem

Find policy parameters θ that maximize V πθ

Policy optimization

Can use gradient free optimization:

Hill climbingSimplex / amoeba / Nelder MeadGenetic algorithmsCross-Entropy method (CEM)Covariance Matrix Adaptation (CMA)

Recall Human-in-the-Loop Exoskeleton Optimization(Zhang et al. Science 2017)

Optimization was done using CMA-ES, variation of covariance matrixevaluation

Gradient Free Policy Optimization

Can often work embarrassingly well: ”discovered that evolutionstrategies (ES), an optimization technique that’s been known fordecades, rivals the performance of standard reinforcement learning(RL) techniques on modern RL benchmarks (e.g. Atari/MuJoCo)”(https://blog.openai.com/evolution-strategies/)

Gradient Free Policy Optimization

Often a great simple baseline to try

Benefits

Can work with any policy parameterizations, includingnon-differentiableFrequently very easy to parallelize

Limitations

Typically not very sample efficient because it ignores temporal structure

Policy optimization

Can use gradient free optimization:

Greater efficiency often possible using gradient

Gradient descentConjugate gradientQuasi-newton

We focus on gradient descent, many extensions possible

And on methods that exploit sequential structure

Table of Contents

1 Introduction

2 Policy Gradient

Policy Gradient

Define V (θ) = V πθ to make explicit the dependence of the value onthe policy parameters

Assume episodic MDPs (easy to extend to related objectives, likeaverage reward)

Policy Gradient

Define V (θ) = V πθ to make explicit the dependence of the value onthe policy parameters

Assume episodic MDPs

Policy gradient algorithms search for a local maximum in V (θ) byascending the gradient of the policy, w.r.t parameters θ

∇θ = α∇θV (θ)

Where ∇θV (θ) is the policy gradient

∇θV (θ) =

δV (θ)δθ1...

δV (θ)δθn

and α is a step-size parameter

Computing Gradients by Finite Differences

To evaluate policy gradient of πθ(s, a)

For each dimension k ∈ [1, n]

Estimate kth partial derivative of objective function w.r.t. θBy perturbing θ by small amount ε in kth dimension

δV (θ)

δθk≈ V (θ + εuk − V (θ)

where uk is a unit vector with 1 in kth component, 0 elsewhere.

Computing Gradients by Finite Differences

To evaluate policy gradient of πθ(s, a)

For each dimension k ∈ [1, n]

Estimate kth partial derivative of objective function w.r.t. θBy perturbing θ by small amount ε in kth dimension

δV (θ)

δθk≈ V (θ + εuk − V (θ)

where uk is a unit vector with 1 in kth component, 0 elsewhere.

Uses n evaluations to compute policy gradient in n dimensions

Simple, noisy, inefficient - but sometimes effective

Works for arbitrary policies, even if policy is not differentiable

Training AIBO to Walk by Finite Difference PolicyGradient27

Goal: learn a fast AIBO walk (useful for Robocup)

Adapt these parameters by finite difference policy gradient

Evaluate performance of policy by field traversal time27Kohl and Stone. Policy gradient reinforcement learning for fast quadrupedal

locomotion. ICRA 2004. http://www.cs.utexas.edu/ ai-lab/pubs/icra04.pdfEmma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 28 Winter 2018 25 / 62

AIBO Policy Parameterization

AIBO walk policy is open-loop policy

No state, choosing set of action parameters that define an ellipse

Specified by 12 continuous parameters (elliptical loci)

The front locus (3 parameters: height, x-pos., y-pos.)The rear locus (3 parameters)Locus lengthLocus skew multiplier in the x-y plane (for turning)The height of the front of the bodyThe height of the rear of the bodyThe time each foot takes to move through its locusThe fraction of time each foot spends on the ground

New policies: or each parameter, randomly add (ε, 0, or −ε)

AIBO Policy Experiments

”All of the policy evaluations took place on actual robots... onlyhuman intervention required during an experiment involved replacingdischarged batteries ... about once an hour.”

Ran on 3 Aibos at once

Evaluated 15 policies per iteration.

Each policy evaluated 3 times (to reduce noise) and averaged

Each iteration took 7.5 minutes

Used η = 2 (learning rate for their finite difference approach)

Training AIBO to Walk by Finite Difference PolicyGradient Results

Authors discuss that performance is likely impacted by: initial starting policyparameters, ε (how much policies are perturbed), η (how much to changepolicy), as well as policy parameterization

AIBO Walk Policies

https://www.cs.utexas.edu/AustinVilla/?p=research/learned walk

Table of Contents

1 Introduction

2 Policy Gradient

Computing the gradient analytically

We now compute the policy gradient analytically

Assume policy πθ is differentiable whenever it is non-zero

and we know the gradient ∇θπθ(s, a)

Likelihood Ratio Policies

Denote a state-action trajectory asτ = (s0, a0, r0, ..., sT−1, aT−1, rT−1, sT )

Use R(τ) =∑T

t=0 R(st , at) to be the sum of rewards for a trajectory τ

Likelihood Ratio Policies

Denote a state-action trajectory asτ = (s0, a0, r0, ..., sT−1, aT−1, rT−1, sT )

Use R(τ) =∑T

t=0 R(st , at) to be the sum of rewards for a trajectory τ

Policy value is

V (θ) = Eπθ

[T∑t=0

R(st , at);πθ

]=∑τ

P(τ ; θ)R(τ), (1)

where P(τ ; θ) is used to denote the probability over trajectories whenexecuting policy π(θ)

In this new notation, our goal is to find the policy parameters θ:

arg maxθ

V (θ) = arg maxθ

P(τ ; θ)R(τ). (2)

Likelihood Ratio Policy Gradient

Goal is to find the policy parameters θ:

arg maxθ

V (θ) = arg maxθ

P(τ ; θ)R(τ). (3)

Take the gradient with respect to θ:

∇θV (θ) = ∇θ∑τ

P(τ ; θ)R(τ)

arg maxθ

V (θ) = arg maxθ

P(τ ; θ)R(τ). (4)

∇θV (θ) = ∇θ∑τ

P(τ ; θ)R(τ)

=∑τ

∇θP(τ ; θ)R(τ)

=∑τ

P(τ ; θ)

P(τ ; θ)∇θP(τ ; θ)R(τ)

=∑τ

P(τ ; θ)R(τ)∇θP(τ ; θ)

P(τ ; θ)︸︷︷︸likelihood ratio

=∑τ

P(τ ; θ)R(τ)∇θ logP(τ ; θ)

arg maxθ

V (θ) = arg maxθ

P(τ ; θ)R(τ). (5)

∇θV (θ) =∑τ

P(τ ; θ)R(τ)∇θ logP(τ ; θ)

Approximate with empirical estimate for m sample paths under policyπθ:

∇θV (θ) ≈ g = (1/m)m∑i=1

R(τ (i))∇θ logP(τ (i); θ)

Score Function Gradient Estimator: Intuition

Consider generic form ofR(τ (i))∇θ logP(τ (i); θ):gi = f (xi )∇θ log p(xi |θ)

f (x) measures how good the sample x is.

Moving in the direction gi pushes up thelogprob of the sample, in proportion to howgood it is

Valid even if f (x) is discontinuous, andunknown, or sample space (containing x) is adiscrete set

gi = f (xi )∇θ log p(xi |θ)

Decomposing the Trajectories Into States and Actions

∇θV (θ) ≈ g = (1/m)m∑i=1

R(τ (i))∇θ logP(τ (i))

∇θ logP(τ (i); θ) =

Decomposing the Trajectories Into States and Actions

∇θV (θ) ≈ g = (1/m)m∑i=1

R(τ (i))∇θ logP(τ (i))

∇θ logP(τ (i);θ) = ∇θ log

µ(s0)︸︷︷︸Initial state distrib.

T−1∏t=0

πθ(at |st)︸︷︷︸policy

P(st+1|st , at)︸︷︷︸dynamics model

= ∇θ

[logµ(s0) +

T−1∑t=0

log πθ(at |st) + logP(st+1|st , at)

=T−1∑t=0

∇θ log πθ(at |st)︸︷︷︸no dynamics model required!

Score Function

Define score function as ∇θ log πθ(s, a)

Likelihood Ratio / Score Function Policy Gradient

Putting this together

arg maxθ

V (θ) = arg maxθ

P(τ ; θ)R(τ). (6)

Approximate with empirical estimate for m sample paths under policyπθ using score function:

∇θV (θ) ≈ g = (1/m)m∑i=1

R(τ (i))∇θ logP(τ (i);θ)

= (1/m)m∑i=1

R(τ (i))T−1∑t=0

∇θ log πθ(a(i)t |s

(i)t )

Do not need to know dynamics model

Policy Gradient Theorem

The policy gradient theorem generalizes the likelihood ratio approach

Theorem

For any differentiable policy πθ(s, a),for any of the policy objective function J = J1, (episodic reward), JavR(average reward per time step), or 1

1−γ JavV (average value),the policy gradient is

∇θJ(θ) = Eπθ [∇θ log πθ(s, a)Qπθ(s, a)]

Table of Contents

1 Introduction

2 Policy Gradient

∇θV (θ) ≈ (1/m)m∑i=1

R(τ (i))T−1∑t=0

(i)t )

Unbiased but very noisy

Fixes that can make it practical

Temporal structureBaseline

Next time will discuss some additional tricks

Policy Gradient: Use Temporal Structure

Previously:

∇θEτ [R] = Eτ

[(T−1∑t=0

)(T−1∑t=0

∇θ log πθ(at |st)

)]We can repeat the same argument to derive the gradient estimator fora single reward term rt′ .

∇θE[rt′ ] = E

[rt′

t′∑t=0

]Summing this formula over t, we obtain

∇θE[R] = E

[T−1∑t=0

rt′t′∑

[T−1∑t=0

∇θ log πθ(at , st)T−1∑t′=t

]Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 50 Winter 2018 47 / 62

Policy Gradient: Use Temporal Structure

Recall for a particular trajectory τ (i),∑T−1

t′=t r(i)t′ is the return G

∇θE[R] ≈ (1/m)m∑i=1

T−1∑t=0

∇θ log πθ(at , st)G(i)t

Monte-Carlo Policy Gradient (REINFORCE)

Leverages likelihood ratio / score function and temporal structure

∆θt = α∇θ log πθ(st , at)Gt (7)

REINFORCE:Initialize policy parameters θ arbitrarilyfor each episode {s1, a1, r2, · · · , sT−1, aT−1, rT} ∼ πθ dofor t = 1 to T − 1 doθ ← θ + α∇θ log πθ(st , at)Gt

endforendforreturn θ

Differentiable Policy Classes

Many choices of differentiable policy classes including:

SoftmaxGaussianNeural networks

Softmax Policy

Weight actions using linear combination of features φ(s, a)T θ

Probability of action is proportional to exponentiated weight

πθ(s, a) = eφ(s,a)T θ/(

eφ(s,a)T θ) (8)

The score function is

∇θ log πθ(s, a) = φ(s, a)− Eπθ[φ(s, ·)]

Gaussian Policy

In continuous action spaces, a Gaussian policy is natural

Mean is a linear combination of state features µ(s) = φ(s)T θ

Variance may be fixed σ2, or can also parametrised

Policy is Gaussian a ∼ N (µ(s), σ2)

The score function is

∇θ log πθ(s, a) =(a− µ(s))φ(s)

∇θV (θ) ≈ (1/m)m∑i=1

R(τ (i))T−1∑t=0

(i)t )

Unbiased but very noisy

Fixes that can make it practical

Temporal structureBaseline

Next time will discuss some additional tricks

Policy Gradient: Introduce Baseline

Reduce variance by introducing a baseline b(s)

∇θEτ [R] = Eτ

[T−1∑t=0

∇θ log π(at |st , θ)

(T−1∑t′=t

rt′ − b(st)

For any choice of b(s), gradient estimator is unbiased.

Near optimal choice is expected return,b(st) ≈ E[rt + rt+1 + · · ·+ rT−1]

Interpretation: increase logprob of action at proportionally to howmuch returns

∑T−1t′=t rt′ are better than expected

Baseline b(s) Does Not Introduce Bias–Derivation

Eτ [∇θ log π(at |st , θ)b(st)]

= Es0:t ,a0:(t−1)

[Es(t+1):T ,at:(T−1)

[∇θ log π(at |st , θ)b(st)]]

Baseline b(s) Does Not Introduce Bias–Derivation

Eτ [∇θ log π(at |st , θ)b(st)]

= Es0:t ,a0:(t−1)

[Es(t+1):T ,at:(T−1)

[∇θ log π(at |st , θ)b(st)]]

(break up expectation)

= Es0:t ,a0:(t−1)

[b(st)Es(t+1):T ,at:(T−1)

[∇θ log π(at |st , θ)]]

(pull baseline term out)

= Es0:t ,a0:(t−1)[b(st)Eat [∇θ log π(at |st , θ)]] (remove irrelevant variables)

= Es0:t ,a0:(t−1)

[b(st)

πθ(at |st)∇θπ(at |st , θ)

πθ(at |st)

](likelihood ratio)

= Es0:t ,a0:(t−1)

[b(st)

∇θπ(at |st , θ)

= Es0:t ,a0:(t−1)

[b(st)∇θ

π(at |st , θ)

]= Es0:t ,a0:(t−1)

[b(st)∇θ1]

= Es0:t,a0:(t−1) [b(st) · 0] = 0

”Vanilla” Policy Gradient Algorithm

Initialize policy parameter θ, baseline bfor iteration=1, 2, · · · do

Collect a set of trajectories by executing the current policyAt each timestep in each trajectory, compute

the return Rt =∑T−1

t′=t rt′ , andthe advantage estimate At = Rt − b(st).

Re-fit the baseline, by minimizing ||b(st)− Rt ||2,summed over all trajectories and timesteps.

Update the policy, using a policy gradient estimate g ,which is a sum of terms ∇θ log π(at |st , θ)At .(Plug g into SGD or ADAM)

endfor

Practical Implementation with Autodiff

Usual formula∑

t ∇θ log π(at |st ; θ)At is inifficient–want to batch data

Define ”surrogate” function using data from current batch

L(θ) =∑t

log π(at |st ; θ)At

Then policy gradient estimator g = ∇θL(θ)

Can also include value function fit error

L(θ) =∑t

(log π(zt |st ; θ)At − ||V (st)− Rt ||2

Value Functions

Recall Q-function / state-action-value function:

Qπ,γ(s, a) = Eπ[r0 + γr1 + γ2r2 · · · |s0 = s, a0 = a

]State-value function can serve as a great baseline

V π,γ(s) = Eπ[r0 + γr1 + γ2r2 · · · |s0 = s

]= Ea∼π[Qπ,γ(s, a)]

Advantage function: Combining Q with baseline V

Aπ,γ(s, a) = Qπ,γ(s, a)− V π,γ(s)

N-step estimators

Can also consider blending between TD and MC estimators for thetarget to substitute for the true state-action value function.

R(1)t = rt + γV (st+1)

R(2)t = rt + γrt+1 + γ2V (st+2) · · ·

R(inf)t = rt + γrt+1 + γ2rt+1 + · · ·

If subtract baselines from the above, get advantage estimators

A(1)t = rt + γV (st+1)−V (st)

A(2)t = rt + γrt+1 + γ2V (st+2)−V (st)

A(inf)t = rt + γrt+1 + γ2rt+1 + · · · −V (st)

A(a)t has low variance & high bias. A

(∞)t high variance but low bias.

(Why? Like which model-free policy estimation techniques?)Using intermediate k (say, 20) can give an intermediate amount ofbias and variance.

Application: Robot Locomotion

Class Structure

Last time: Imitation Learning

This time: Policy Search

Next time: Policy Search Cont.

Lecture 8: Policy Gradient I 2 - Stanford...

Transcript of Lecture 8: Policy Gradient I 2 - Stanford...

Lecture 8: Policy Gradient I 2 - Stanford...

Documents

Transcript of Lecture 8: Policy Gradient I 2 - Stanford...

Lecture 4: Model Free Control - Stanford Universityweb.stanford.edu/.../CS234Win2019/slides/lecture4_wa.pdfLecture 4: Model Free Control Emma Brunskill CS234 Reinforcement Learning.

Lecture 12: Fast Reinforcement Learning [1]With some ... · Lecture 12: Fast Reinforcement Learning 1 Emma Brunskill CS234 Reinforcement Learning Winter 2020 1With some slides derived

gradient 1 gradient 2 gradient 3 gradient 4 ECDIS buyers · PDF file11 Type-specific ECDIS training 11 Technical training 11 Where should training take place? ... gradient 1 gradient

Somani Patnaik 1 , Emma Brunskill 1 , William Thies 2

Lecture 11: Fast Reinforcement Learning [1]With many ...web.stanford.edu/class/cs234/slides/lecture11_postclass.pdfLecture 11: Fast Reinforcement Learning 1 Emma Brunskill CS234 Reinforcement

Steepest Gradient Method Conjugate Gradient Method

Lecture 1: Introduction to RL - Stanford University · Lecture 1: Introduction to RL Emma Brunskill CS234 RL Winter 2020 Today the 3rd part of the lecture includes slides from David

Lecture 4: Model Free Controlweb.stanford.edu/class/cs234/slides/lecture4_postclass.pdf · Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Structure closely follows much

Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

CS234/NetSys210: Advanced Topics in Networking Spring 2012 SIP and VoIP

Lecture 8: Policy Gradient I 1 - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture8_postclass.pdf · Lecture 8: Policy Gradient I 1 Emma Brunskill CS234 Reinforcement

Lecture 15: Batch RL · Lecture15: BatchRL EmmaBrunskill CS234 Reinforcement Learning. Winter2019 SlidesdrawnfromPhilipThomaswithmodiﬁcations EmmaBrunskill (CS234ReinforcementLearning.

CS234 Notes - Lecture 2 Making Good Decisions Given a ...web.stanford.edu/class/cs234/slides/lnotes2.pdf · CS234 Notes - Lecture 2 Making Good Decisions Given a Model of the World

Lecture 7: Deep RL - Stanford University€¦ · Lecture 7: Deep RL CS234: RL Emma Brunskill Spring 2017 Much of the content for this lecture is borrowed from Ruslan Salakhutdinov’s

Lecture 14: MCTS =1With many slides from or derived from ......Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 2With many slides from or derived from David

Winter 2021 CS234 Emma Brunskill - web.stanford.edu

Lecture 12: Fast Reinforcement Learning =1[1]With some ...web.stanford.edu/class/cs234/CS234Win2019/slides/... · 1With some slides derived from David Silver Emma Brunskill (CS234

Lecture 5: Value Function Approximationweb.stanford.edu/class/cs234/slides/lecture5_postclass.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 5: Value Function Approximation

Lecture 3: Monte Carlo and Generalization · Lecture 3: Monte Carlo and Generalization CS234: RL Emma Brunskill Spring 2017 Much of the content for this lecture is borrowed from Ruslan

Lecture 10: Policy Gradient III & Midterm Review =1[1]With ... › class › cs234 › slides › lecture10_1.pdf · Emma Brunskill (CS234 Reinforcement Learning. )Lecture 10: Policy