Lecture 8: Policy Gradient I 2 - Stanford...

Post on 07-Sep-2018

229 views 0 download

Transcript of Lecture 8: Policy Gradient I 2 - Stanford...

Lecture 8: Policy Gradient I 2

Emma Brunskill

CS234 Reinforcement Learning.

Winter 2018

Additional reading: Sutton and Barto 2018 Chp. 13

2With many slides from or derived from David Silver and John Schulman and PieterAbbeel

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 3 Winter 2018 1 / 62

Last Time: We want RL Algorithms that Perform

Optimization

Delayed consequences

Exploration

Generalization

And do it statistically and computationally efficiently

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 4 Winter 2018 2 / 62

Last Time: Generalization and Efficiency

Can use structure and additional knowledge to help constrain andspeed reinforcement learning

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 5 Winter 2018 3 / 62

Class Structure

Last time: Imitation Learning

This time: Policy Search

Next time: Policy Search Cont.

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 6 Winter 2018 4 / 62

Table of Contents

1 Introduction

2 Policy Gradient

3 Score Function and Policy Gradient Theorem

4 Policy Gradient Algorithms and Reducing Variance

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 7 Winter 2018 5 / 62

Policy-Based Reinforcement Learning

In the last lecture we approximated the value or action-value functionusing parameters θ,

Vθ(s) ≈ V π(s)

Qθ(s, a) ≈ Qπ(s, a)

A policy was generated directly from the value function

e.g. using ε-greedy

In this lecture we will directly parametrize the policy

πθ(s, a) = P[a|s, θ]

Goal is to find a policy π with the highest value function V π

We will focus again on model-free reinforcement learning

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 8 Winter 2018 6 / 62

Value-Based and Policy-Based RL

Value Based

Learnt Value FunctionImplicit policy (e.g.ε-greedy)

Policy Based

No Value FunctionLearnt Policy

Actor-Critic

Learnt Value FunctionLearnt Policy

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 9 Winter 2018 7 / 62

Advantages of Policy-Based RL

Advantages:

Better convergence properties

Effective in high-dimensional or continuous action spaces

Can learn stochastic policies

Disadvantages:

Typically converge to a local rather than global optimum

Evaluating a policy is typically inefficient and high variance

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 10 Winter 2018 8 / 62

Example: Rock-Paper-Scissors

Two-player game of rock-paper-scissors

Scissors beats paperRock beats scissorsPaper beats rock

Consider policies for iterated rock-paper-scissors

A deterministic policy is easily exploitedA uniform random policy is optimal (i.e. Nash equilibrium)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 11 Winter 2018 9 / 62

Example: Aliased Gridword (1)

The agent cannot differentiate the grey states

Consider features of the following form (for all N, E, S, W)

φ(s, a) = 1(wall to N, a = move E)

Compare value-based RL, using an approximate value function

Qθ(s, a) = f (φ(s, a), θ)

To policy-based RL, using a parametrised policy

πθ(s, a) = g(φ(s, a), θ)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 12 Winter 2018 10 / 62

Example: Aliased Gridworld (2)

Under aliasing, an optimal deterministic policy will either

move W in both grey states (shown by red arrows)move E in both grey states

Either way, it can get stuck and never reach the money

Value-based RL learns a near-deterministic policy

e.g. greedy or ε-greedy

So it will traverse the corridor for a long time

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 13 Winter 2018 11 / 62

Example: Aliased Gridworld (3)

An optimal stochastic policy will randomly move E or W in grey states

πθ(wall to N and W, move E) = 0.5

πθ(wall to N and W, move W) = 0.5

It will reach the goal state in a few steps with high probability

Policy-based RL can learn the optimal stochastic policy

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 14 Winter 2018 12 / 62

Policy Objective Functions

Goal: given a policy πθ(s, a) with parameters θ, find best θ

But how do we measure the quality for a policy πθ?

In episodic environments we can use the start value of the policy

J1(θ) = V πθ(s1) = Eπθ [v1]

In continuing environments we can use the average value

JavV (θ) =∑s

dπθ(s)V πθ(s)

where dπθ(s) is the stationary distribution of Markov chain for πθ.

Or the average reward per time-step

JavR(θ) =∑s

dπθ(s)∑a

πθ(s, a)Ras

For simplicity, today will mostly discuss the episodic case, but caneasily extend to the continuing / infinite horizon case

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 15 Winter 2018 13 / 62

Policy optimization

Policy based reinforcement learning is an optimization problem

Find policy parameters θ that maximize V πθ

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 16 Winter 2018 14 / 62

Policy optimization

Policy based reinforcement learning is an optimization problem

Find policy parameters θ that maximize V πθ

Can use gradient free optimization:

Hill climbingSimplex / amoeba / Nelder MeadGenetic algorithmsCross-Entropy method (CEM)Covariance Matrix Adaptation (CMA)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 17 Winter 2018 15 / 62

Recall Human-in-the-Loop Exoskeleton Optimization(Zhang et al. Science 2017)

u

Optimization was done using CMA-ES, variation of covariance matrixevaluation

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 18 Winter 2018 16 / 62

Gradient Free Policy Optimization

Can often work embarrassingly well: ”discovered that evolutionstrategies (ES), an optimization technique that’s been known fordecades, rivals the performance of standard reinforcement learning(RL) techniques on modern RL benchmarks (e.g. Atari/MuJoCo)”(https://blog.openai.com/evolution-strategies/)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 19 Winter 2018 17 / 62

Gradient Free Policy Optimization

Often a great simple baseline to try

Benefits

Can work with any policy parameterizations, includingnon-differentiableFrequently very easy to parallelize

Limitations

Typically not very sample efficient because it ignores temporal structure

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 20 Winter 2018 18 / 62

Policy optimization

Policy based reinforcement learning is an optimization problem

Find policy parameters θ that maximize V πθ

Can use gradient free optimization:

Greater efficiency often possible using gradient

Gradient descentConjugate gradientQuasi-newton

We focus on gradient descent, many extensions possible

And on methods that exploit sequential structure

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 21 Winter 2018 19 / 62

Table of Contents

1 Introduction

2 Policy Gradient

3 Score Function and Policy Gradient Theorem

4 Policy Gradient Algorithms and Reducing Variance

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 22 Winter 2018 20 / 62

Policy Gradient

Define V (θ) = V πθ to make explicit the dependence of the value onthe policy parameters

Assume episodic MDPs (easy to extend to related objectives, likeaverage reward)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 23 Winter 2018 21 / 62

Policy Gradient

Define V (θ) = V πθ to make explicit the dependence of the value onthe policy parameters

Assume episodic MDPs

Policy gradient algorithms search for a local maximum in V (θ) byascending the gradient of the policy, w.r.t parameters θ

∇θ = α∇θV (θ)

Where ∇θV (θ) is the policy gradient

∇θV (θ) =

δV (θ)δθ1...

δV (θ)δθn

and α is a step-size parameter

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 24 Winter 2018 22 / 62

Computing Gradients by Finite Differences

To evaluate policy gradient of πθ(s, a)

For each dimension k ∈ [1, n]

Estimate kth partial derivative of objective function w.r.t. θBy perturbing θ by small amount ε in kth dimension

δV (θ)

δθk≈ V (θ + εuk − V (θ)

ε

where uk is a unit vector with 1 in kth component, 0 elsewhere.

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 25 Winter 2018 23 / 62

Computing Gradients by Finite Differences

To evaluate policy gradient of πθ(s, a)

For each dimension k ∈ [1, n]

Estimate kth partial derivative of objective function w.r.t. θBy perturbing θ by small amount ε in kth dimension

δV (θ)

δθk≈ V (θ + εuk − V (θ)

ε

where uk is a unit vector with 1 in kth component, 0 elsewhere.

Uses n evaluations to compute policy gradient in n dimensions

Simple, noisy, inefficient - but sometimes effective

Works for arbitrary policies, even if policy is not differentiable

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 26 Winter 2018 24 / 62

Training AIBO to Walk by Finite Difference PolicyGradient27

Goal: learn a fast AIBO walk (useful for Robocup)

Adapt these parameters by finite difference policy gradient

Evaluate performance of policy by field traversal time27Kohl and Stone. Policy gradient reinforcement learning for fast quadrupedal

locomotion. ICRA 2004. http://www.cs.utexas.edu/ ai-lab/pubs/icra04.pdfEmma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 28 Winter 2018 25 / 62

AIBO Policy Parameterization

AIBO walk policy is open-loop policy

No state, choosing set of action parameters that define an ellipse

Specified by 12 continuous parameters (elliptical loci)

The front locus (3 parameters: height, x-pos., y-pos.)The rear locus (3 parameters)Locus lengthLocus skew multiplier in the x-y plane (for turning)The height of the front of the bodyThe height of the rear of the bodyThe time each foot takes to move through its locusThe fraction of time each foot spends on the ground

New policies: or each parameter, randomly add (ε, 0, or −ε)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 29 Winter 2018 26 / 62

AIBO Policy Experiments

”All of the policy evaluations took place on actual robots... onlyhuman intervention required during an experiment involved replacingdischarged batteries ... about once an hour.”

Ran on 3 Aibos at once

Evaluated 15 policies per iteration.

Each policy evaluated 3 times (to reduce noise) and averaged

Each iteration took 7.5 minutes

Used η = 2 (learning rate for their finite difference approach)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 30 Winter 2018 27 / 62

Training AIBO to Walk by Finite Difference PolicyGradient Results

Authors discuss that performance is likely impacted by: initial starting policyparameters, ε (how much policies are perturbed), η (how much to changepolicy), as well as policy parameterization

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 31 Winter 2018 28 / 62

AIBO Walk Policies

https://www.cs.utexas.edu/AustinVilla/?p=research/learned walk

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 32 Winter 2018 29 / 62

Table of Contents

1 Introduction

2 Policy Gradient

3 Score Function and Policy Gradient Theorem

4 Policy Gradient Algorithms and Reducing Variance

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 33 Winter 2018 30 / 62

Computing the gradient analytically

We now compute the policy gradient analytically

Assume policy πθ is differentiable whenever it is non-zero

and we know the gradient ∇θπθ(s, a)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 34 Winter 2018 31 / 62

Likelihood Ratio Policies

Denote a state-action trajectory asτ = (s0, a0, r0, ..., sT−1, aT−1, rT−1, sT )

Use R(τ) =∑T

t=0 R(st , at) to be the sum of rewards for a trajectory τ

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 35 Winter 2018 32 / 62

Likelihood Ratio Policies

Denote a state-action trajectory asτ = (s0, a0, r0, ..., sT−1, aT−1, rT−1, sT )

Use R(τ) =∑T

t=0 R(st , at) to be the sum of rewards for a trajectory τ

Policy value is

V (θ) = Eπθ

[T∑t=0

R(st , at);πθ

]=∑τ

P(τ ; θ)R(τ), (1)

where P(τ ; θ) is used to denote the probability over trajectories whenexecuting policy π(θ)

In this new notation, our goal is to find the policy parameters θ:

arg maxθ

V (θ) = arg maxθ

∑τ

P(τ ; θ)R(τ). (2)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 36 Winter 2018 33 / 62

Likelihood Ratio Policy Gradient

Goal is to find the policy parameters θ:

arg maxθ

V (θ) = arg maxθ

∑τ

P(τ ; θ)R(τ). (3)

Take the gradient with respect to θ:

∇θV (θ) = ∇θ∑τ

P(τ ; θ)R(τ)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 37 Winter 2018 34 / 62

Likelihood Ratio Policy Gradient

Goal is to find the policy parameters θ:

arg maxθ

V (θ) = arg maxθ

∑τ

P(τ ; θ)R(τ). (4)

Take the gradient with respect to θ:

∇θV (θ) = ∇θ∑τ

P(τ ; θ)R(τ)

=∑τ

∇θP(τ ; θ)R(τ)

=∑τ

P(τ ; θ)

P(τ ; θ)∇θP(τ ; θ)R(τ)

=∑τ

P(τ ; θ)R(τ)∇θP(τ ; θ)

P(τ ; θ)︸ ︷︷ ︸likelihood ratio

=∑τ

P(τ ; θ)R(τ)∇θ logP(τ ; θ)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 38 Winter 2018 35 / 62

Likelihood Ratio Policy Gradient

Goal is to find the policy parameters θ:

arg maxθ

V (θ) = arg maxθ

∑τ

P(τ ; θ)R(τ). (5)

Take the gradient with respect to θ:

∇θV (θ) =∑τ

P(τ ; θ)R(τ)∇θ logP(τ ; θ)

Approximate with empirical estimate for m sample paths under policyπθ:

∇θV (θ) ≈ g = (1/m)m∑i=1

R(τ (i))∇θ logP(τ (i); θ)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 39 Winter 2018 36 / 62

Score Function Gradient Estimator: Intuition

Consider generic form ofR(τ (i))∇θ logP(τ (i); θ):gi = f (xi )∇θ log p(xi |θ)

f (x) measures how good the sample x is.

Moving in the direction gi pushes up thelogprob of the sample, in proportion to howgood it is

Valid even if f (x) is discontinuous, andunknown, or sample space (containing x) is adiscrete set

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 40 Winter 2018 37 / 62

Score Function Gradient Estimator: Intuition

gi = f (xi )∇θ log p(xi |θ)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 41 Winter 2018 38 / 62

Score Function Gradient Estimator: Intuition

gi = f (xi )∇θ log p(xi |θ)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 42 Winter 2018 39 / 62

Decomposing the Trajectories Into States and Actions

Approximate with empirical estimate for m sample paths under policyπθ:

∇θV (θ) ≈ g = (1/m)m∑i=1

R(τ (i))∇θ logP(τ (i))

∇θ logP(τ (i); θ) =

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 43 Winter 2018 40 / 62

Decomposing the Trajectories Into States and Actions

Approximate with empirical estimate for m sample paths under policyπθ:

∇θV (θ) ≈ g = (1/m)m∑i=1

R(τ (i))∇θ logP(τ (i))

∇θ logP(τ (i);θ) = ∇θ log

µ(s0)︸ ︷︷ ︸Initial state distrib.

T−1∏t=0

πθ(at |st)︸ ︷︷ ︸policy

P(st+1|st , at)︸ ︷︷ ︸dynamics model

= ∇θ

[logµ(s0) +

T−1∑t=0

log πθ(at |st) + logP(st+1|st , at)

]

=T−1∑t=0

∇θ log πθ(at |st)︸ ︷︷ ︸no dynamics model required!

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 44 Winter 2018 41 / 62

Score Function

Define score function as ∇θ log πθ(s, a)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 45 Winter 2018 42 / 62

Likelihood Ratio / Score Function Policy Gradient

Putting this together

Goal is to find the policy parameters θ:

arg maxθ

V (θ) = arg maxθ

∑τ

P(τ ; θ)R(τ). (6)

Approximate with empirical estimate for m sample paths under policyπθ using score function:

∇θV (θ) ≈ g = (1/m)m∑i=1

R(τ (i))∇θ logP(τ (i);θ)

= (1/m)m∑i=1

R(τ (i))T−1∑t=0

∇θ log πθ(a(i)t |s

(i)t )

Do not need to know dynamics model

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 46 Winter 2018 43 / 62

Policy Gradient Theorem

The policy gradient theorem generalizes the likelihood ratio approach

Theorem

For any differentiable policy πθ(s, a),for any of the policy objective function J = J1, (episodic reward), JavR(average reward per time step), or 1

1−γ JavV (average value),the policy gradient is

∇θJ(θ) = Eπθ [∇θ log πθ(s, a)Qπθ(s, a)]

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 47 Winter 2018 44 / 62

Table of Contents

1 Introduction

2 Policy Gradient

3 Score Function and Policy Gradient Theorem

4 Policy Gradient Algorithms and Reducing Variance

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 48 Winter 2018 45 / 62

Likelihood Ratio / Score Function Policy Gradient

∇θV (θ) ≈ (1/m)m∑i=1

R(τ (i))T−1∑t=0

∇θ log πθ(a(i)t |s

(i)t )

Unbiased but very noisy

Fixes that can make it practical

Temporal structureBaseline

Next time will discuss some additional tricks

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 49 Winter 2018 46 / 62

Policy Gradient: Use Temporal Structure

Previously:

∇θEτ [R] = Eτ

[(T−1∑t=0

rt

)(T−1∑t=0

∇θ log πθ(at |st)

)]We can repeat the same argument to derive the gradient estimator fora single reward term rt′ .

∇θE[rt′ ] = E

[rt′

t′∑t=0

∇θ log πθ(at |st)

]Summing this formula over t, we obtain

∇θE[R] = E

[T−1∑t=0

rt′t′∑

t=0

∇θ log πθ(at |st)

]

= E

[T−1∑t=0

∇θ log πθ(at , st)T−1∑t′=t

rt′

]Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 50 Winter 2018 47 / 62

Policy Gradient: Use Temporal Structure

Recall for a particular trajectory τ (i),∑T−1

t′=t r(i)t′ is the return G

(i)t

∇θE[R] ≈ (1/m)m∑i=1

T−1∑t=0

∇θ log πθ(at , st)G(i)t

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 51 Winter 2018 48 / 62

Monte-Carlo Policy Gradient (REINFORCE)

Leverages likelihood ratio / score function and temporal structure

∆θt = α∇θ log πθ(st , at)Gt (7)

REINFORCE:Initialize policy parameters θ arbitrarilyfor each episode {s1, a1, r2, · · · , sT−1, aT−1, rT} ∼ πθ dofor t = 1 to T − 1 doθ ← θ + α∇θ log πθ(st , at)Gt

endforendforreturn θ

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 52 Winter 2018 49 / 62

Differentiable Policy Classes

Many choices of differentiable policy classes including:

SoftmaxGaussianNeural networks

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 53 Winter 2018 50 / 62

Softmax Policy

Weight actions using linear combination of features φ(s, a)T θ

Probability of action is proportional to exponentiated weight

πθ(s, a) = eφ(s,a)T θ/(

∑a

eφ(s,a)T θ) (8)

The score function is

∇θ log πθ(s, a) = φ(s, a)− Eπθ[φ(s, ·)]

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 54 Winter 2018 51 / 62

Gaussian Policy

In continuous action spaces, a Gaussian policy is natural

Mean is a linear combination of state features µ(s) = φ(s)T θ

Variance may be fixed σ2, or can also parametrised

Policy is Gaussian a ∼ N (µ(s), σ2)

The score function is

∇θ log πθ(s, a) =(a− µ(s))φ(s)

σ2

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 55 Winter 2018 52 / 62

Likelihood Ratio / Score Function Policy Gradient

∇θV (θ) ≈ (1/m)m∑i=1

R(τ (i))T−1∑t=0

∇θ log πθ(a(i)t |s

(i)t )

Unbiased but very noisy

Fixes that can make it practical

Temporal structureBaseline

Next time will discuss some additional tricks

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 56 Winter 2018 53 / 62

Policy Gradient: Introduce Baseline

Reduce variance by introducing a baseline b(s)

∇θEτ [R] = Eτ

[T−1∑t=0

∇θ log π(at |st , θ)

(T−1∑t′=t

rt′ − b(st)

)]

For any choice of b(s), gradient estimator is unbiased.

Near optimal choice is expected return,b(st) ≈ E[rt + rt+1 + · · ·+ rT−1]

Interpretation: increase logprob of action at proportionally to howmuch returns

∑T−1t′=t rt′ are better than expected

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 57 Winter 2018 54 / 62

Baseline b(s) Does Not Introduce Bias–Derivation

Eτ [∇θ log π(at |st , θ)b(st)]

= Es0:t ,a0:(t−1)

[Es(t+1):T ,at:(T−1)

[∇θ log π(at |st , θ)b(st)]]

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 58 Winter 2018 55 / 62

Baseline b(s) Does Not Introduce Bias–Derivation

Eτ [∇θ log π(at |st , θ)b(st)]

= Es0:t ,a0:(t−1)

[Es(t+1):T ,at:(T−1)

[∇θ log π(at |st , θ)b(st)]]

(break up expectation)

= Es0:t ,a0:(t−1)

[b(st)Es(t+1):T ,at:(T−1)

[∇θ log π(at |st , θ)]]

(pull baseline term out)

= Es0:t ,a0:(t−1)[b(st)Eat [∇θ log π(at |st , θ)]] (remove irrelevant variables)

= Es0:t ,a0:(t−1)

[b(st)

∑a

πθ(at |st)∇θπ(at |st , θ)

πθ(at |st)

](likelihood ratio)

= Es0:t ,a0:(t−1)

[b(st)

∑a

∇θπ(at |st , θ)

]

= Es0:t ,a0:(t−1)

[b(st)∇θ

∑a

π(at |st , θ)

]= Es0:t ,a0:(t−1)

[b(st)∇θ1]

= Es0:t,a0:(t−1) [b(st) · 0] = 0

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 59 Winter 2018 56 / 62

”Vanilla” Policy Gradient Algorithm

Initialize policy parameter θ, baseline bfor iteration=1, 2, · · · do

Collect a set of trajectories by executing the current policyAt each timestep in each trajectory, compute

the return Rt =∑T−1

t′=t rt′ , andthe advantage estimate At = Rt − b(st).

Re-fit the baseline, by minimizing ||b(st)− Rt ||2,summed over all trajectories and timesteps.

Update the policy, using a policy gradient estimate g ,which is a sum of terms ∇θ log π(at |st , θ)At .(Plug g into SGD or ADAM)

endfor

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 60 Winter 2018 57 / 62

Practical Implementation with Autodiff

Usual formula∑

t ∇θ log π(at |st ; θ)At is inifficient–want to batch data

Define ”surrogate” function using data from current batch

L(θ) =∑t

log π(at |st ; θ)At

Then policy gradient estimator g = ∇θL(θ)

Can also include value function fit error

L(θ) =∑t

(log π(zt |st ; θ)At − ||V (st)− Rt ||2

)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 61 Winter 2018 58 / 62

Value Functions

Recall Q-function / state-action-value function:

Qπ,γ(s, a) = Eπ[r0 + γr1 + γ2r2 · · · |s0 = s, a0 = a

]State-value function can serve as a great baseline

V π,γ(s) = Eπ[r0 + γr1 + γ2r2 · · · |s0 = s

]= Ea∼π[Qπ,γ(s, a)]

Advantage function: Combining Q with baseline V

Aπ,γ(s, a) = Qπ,γ(s, a)− V π,γ(s)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 62 Winter 2018 59 / 62

N-step estimators

Can also consider blending between TD and MC estimators for thetarget to substitute for the true state-action value function.

R(1)t = rt + γV (st+1)

R(2)t = rt + γrt+1 + γ2V (st+2) · · ·

R(inf)t = rt + γrt+1 + γ2rt+1 + · · ·

If subtract baselines from the above, get advantage estimators

A(1)t = rt + γV (st+1)−V (st)

A(2)t = rt + γrt+1 + γ2V (st+2)−V (st)

A(inf)t = rt + γrt+1 + γ2rt+1 + · · · −V (st)

A(a)t has low variance & high bias. A

(∞)t high variance but low bias.

(Why? Like which model-free policy estimation techniques?)Using intermediate k (say, 20) can give an intermediate amount ofbias and variance.

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 63 Winter 2018 60 / 62

Application: Robot Locomotion

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 64 Winter 2018 61 / 62

Class Structure

Last time: Imitation Learning

This time: Policy Search

Next time: Policy Search Cont.

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 65 Winter 2018 62 / 62