Lecture 8: Policy Gradient I 2 - Stanford...

62
Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver and John Schulman and Pieter Abbeel Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 3 Winter 2018 1 / 62

Transcript of Lecture 8: Policy Gradient I 2 - Stanford...

Page 1: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Lecture 8: Policy Gradient I 2

Emma Brunskill

CS234 Reinforcement Learning.

Winter 2018

Additional reading: Sutton and Barto 2018 Chp. 13

2With many slides from or derived from David Silver and John Schulman and PieterAbbeel

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 3 Winter 2018 1 / 62

Page 2: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Last Time: We want RL Algorithms that Perform

Optimization

Delayed consequences

Exploration

Generalization

And do it statistically and computationally efficiently

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 4 Winter 2018 2 / 62

Page 3: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Last Time: Generalization and Efficiency

Can use structure and additional knowledge to help constrain andspeed reinforcement learning

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 5 Winter 2018 3 / 62

Page 4: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Class Structure

Last time: Imitation Learning

This time: Policy Search

Next time: Policy Search Cont.

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 6 Winter 2018 4 / 62

Page 5: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Table of Contents

1 Introduction

2 Policy Gradient

3 Score Function and Policy Gradient Theorem

4 Policy Gradient Algorithms and Reducing Variance

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 7 Winter 2018 5 / 62

Page 6: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Policy-Based Reinforcement Learning

In the last lecture we approximated the value or action-value functionusing parameters θ,

Vθ(s) ≈ V π(s)

Qθ(s, a) ≈ Qπ(s, a)

A policy was generated directly from the value function

e.g. using ε-greedy

In this lecture we will directly parametrize the policy

πθ(s, a) = P[a|s, θ]

Goal is to find a policy π with the highest value function V π

We will focus again on model-free reinforcement learning

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 8 Winter 2018 6 / 62

Page 7: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Value-Based and Policy-Based RL

Value Based

Learnt Value FunctionImplicit policy (e.g.ε-greedy)

Policy Based

No Value FunctionLearnt Policy

Actor-Critic

Learnt Value FunctionLearnt Policy

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 9 Winter 2018 7 / 62

Page 8: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Advantages of Policy-Based RL

Advantages:

Better convergence properties

Effective in high-dimensional or continuous action spaces

Can learn stochastic policies

Disadvantages:

Typically converge to a local rather than global optimum

Evaluating a policy is typically inefficient and high variance

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 10 Winter 2018 8 / 62

Page 9: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Example: Rock-Paper-Scissors

Two-player game of rock-paper-scissors

Scissors beats paperRock beats scissorsPaper beats rock

Consider policies for iterated rock-paper-scissors

A deterministic policy is easily exploitedA uniform random policy is optimal (i.e. Nash equilibrium)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 11 Winter 2018 9 / 62

Page 10: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Example: Aliased Gridword (1)

The agent cannot differentiate the grey states

Consider features of the following form (for all N, E, S, W)

φ(s, a) = 1(wall to N, a = move E)

Compare value-based RL, using an approximate value function

Qθ(s, a) = f (φ(s, a), θ)

To policy-based RL, using a parametrised policy

πθ(s, a) = g(φ(s, a), θ)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 12 Winter 2018 10 / 62

Page 11: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Example: Aliased Gridworld (2)

Under aliasing, an optimal deterministic policy will either

move W in both grey states (shown by red arrows)move E in both grey states

Either way, it can get stuck and never reach the money

Value-based RL learns a near-deterministic policy

e.g. greedy or ε-greedy

So it will traverse the corridor for a long time

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 13 Winter 2018 11 / 62

Page 12: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Example: Aliased Gridworld (3)

An optimal stochastic policy will randomly move E or W in grey states

πθ(wall to N and W, move E) = 0.5

πθ(wall to N and W, move W) = 0.5

It will reach the goal state in a few steps with high probability

Policy-based RL can learn the optimal stochastic policy

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 14 Winter 2018 12 / 62

Page 13: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Policy Objective Functions

Goal: given a policy πθ(s, a) with parameters θ, find best θ

But how do we measure the quality for a policy πθ?

In episodic environments we can use the start value of the policy

J1(θ) = V πθ(s1) = Eπθ [v1]

In continuing environments we can use the average value

JavV (θ) =∑s

dπθ(s)V πθ(s)

where dπθ(s) is the stationary distribution of Markov chain for πθ.

Or the average reward per time-step

JavR(θ) =∑s

dπθ(s)∑a

πθ(s, a)Ras

For simplicity, today will mostly discuss the episodic case, but caneasily extend to the continuing / infinite horizon case

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 15 Winter 2018 13 / 62

Page 14: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Policy optimization

Policy based reinforcement learning is an optimization problem

Find policy parameters θ that maximize V πθ

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 16 Winter 2018 14 / 62

Page 15: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Policy optimization

Policy based reinforcement learning is an optimization problem

Find policy parameters θ that maximize V πθ

Can use gradient free optimization:

Hill climbingSimplex / amoeba / Nelder MeadGenetic algorithmsCross-Entropy method (CEM)Covariance Matrix Adaptation (CMA)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 17 Winter 2018 15 / 62

Page 16: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Recall Human-in-the-Loop Exoskeleton Optimization(Zhang et al. Science 2017)

u

Optimization was done using CMA-ES, variation of covariance matrixevaluation

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 18 Winter 2018 16 / 62

Page 17: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Gradient Free Policy Optimization

Can often work embarrassingly well: ”discovered that evolutionstrategies (ES), an optimization technique that’s been known fordecades, rivals the performance of standard reinforcement learning(RL) techniques on modern RL benchmarks (e.g. Atari/MuJoCo)”(https://blog.openai.com/evolution-strategies/)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 19 Winter 2018 17 / 62

Page 18: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Gradient Free Policy Optimization

Often a great simple baseline to try

Benefits

Can work with any policy parameterizations, includingnon-differentiableFrequently very easy to parallelize

Limitations

Typically not very sample efficient because it ignores temporal structure

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 20 Winter 2018 18 / 62

Page 19: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Policy optimization

Policy based reinforcement learning is an optimization problem

Find policy parameters θ that maximize V πθ

Can use gradient free optimization:

Greater efficiency often possible using gradient

Gradient descentConjugate gradientQuasi-newton

We focus on gradient descent, many extensions possible

And on methods that exploit sequential structure

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 21 Winter 2018 19 / 62

Page 20: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Table of Contents

1 Introduction

2 Policy Gradient

3 Score Function and Policy Gradient Theorem

4 Policy Gradient Algorithms and Reducing Variance

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 22 Winter 2018 20 / 62

Page 21: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Policy Gradient

Define V (θ) = V πθ to make explicit the dependence of the value onthe policy parameters

Assume episodic MDPs (easy to extend to related objectives, likeaverage reward)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 23 Winter 2018 21 / 62

Page 22: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Policy Gradient

Define V (θ) = V πθ to make explicit the dependence of the value onthe policy parameters

Assume episodic MDPs

Policy gradient algorithms search for a local maximum in V (θ) byascending the gradient of the policy, w.r.t parameters θ

∇θ = α∇θV (θ)

Where ∇θV (θ) is the policy gradient

∇θV (θ) =

δV (θ)δθ1...

δV (θ)δθn

and α is a step-size parameter

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 24 Winter 2018 22 / 62

Page 23: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Computing Gradients by Finite Differences

To evaluate policy gradient of πθ(s, a)

For each dimension k ∈ [1, n]

Estimate kth partial derivative of objective function w.r.t. θBy perturbing θ by small amount ε in kth dimension

δV (θ)

δθk≈ V (θ + εuk − V (θ)

ε

where uk is a unit vector with 1 in kth component, 0 elsewhere.

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 25 Winter 2018 23 / 62

Page 24: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Computing Gradients by Finite Differences

To evaluate policy gradient of πθ(s, a)

For each dimension k ∈ [1, n]

Estimate kth partial derivative of objective function w.r.t. θBy perturbing θ by small amount ε in kth dimension

δV (θ)

δθk≈ V (θ + εuk − V (θ)

ε

where uk is a unit vector with 1 in kth component, 0 elsewhere.

Uses n evaluations to compute policy gradient in n dimensions

Simple, noisy, inefficient - but sometimes effective

Works for arbitrary policies, even if policy is not differentiable

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 26 Winter 2018 24 / 62

Page 25: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Training AIBO to Walk by Finite Difference PolicyGradient27

Goal: learn a fast AIBO walk (useful for Robocup)

Adapt these parameters by finite difference policy gradient

Evaluate performance of policy by field traversal time27Kohl and Stone. Policy gradient reinforcement learning for fast quadrupedal

locomotion. ICRA 2004. http://www.cs.utexas.edu/ ai-lab/pubs/icra04.pdfEmma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 28 Winter 2018 25 / 62

Page 26: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

AIBO Policy Parameterization

AIBO walk policy is open-loop policy

No state, choosing set of action parameters that define an ellipse

Specified by 12 continuous parameters (elliptical loci)

The front locus (3 parameters: height, x-pos., y-pos.)The rear locus (3 parameters)Locus lengthLocus skew multiplier in the x-y plane (for turning)The height of the front of the bodyThe height of the rear of the bodyThe time each foot takes to move through its locusThe fraction of time each foot spends on the ground

New policies: or each parameter, randomly add (ε, 0, or −ε)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 29 Winter 2018 26 / 62

Page 27: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

AIBO Policy Experiments

”All of the policy evaluations took place on actual robots... onlyhuman intervention required during an experiment involved replacingdischarged batteries ... about once an hour.”

Ran on 3 Aibos at once

Evaluated 15 policies per iteration.

Each policy evaluated 3 times (to reduce noise) and averaged

Each iteration took 7.5 minutes

Used η = 2 (learning rate for their finite difference approach)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 30 Winter 2018 27 / 62

Page 28: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Training AIBO to Walk by Finite Difference PolicyGradient Results

Authors discuss that performance is likely impacted by: initial starting policyparameters, ε (how much policies are perturbed), η (how much to changepolicy), as well as policy parameterization

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 31 Winter 2018 28 / 62

Page 29: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

AIBO Walk Policies

https://www.cs.utexas.edu/AustinVilla/?p=research/learned walk

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 32 Winter 2018 29 / 62

Page 30: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Table of Contents

1 Introduction

2 Policy Gradient

3 Score Function and Policy Gradient Theorem

4 Policy Gradient Algorithms and Reducing Variance

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 33 Winter 2018 30 / 62

Page 31: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Computing the gradient analytically

We now compute the policy gradient analytically

Assume policy πθ is differentiable whenever it is non-zero

and we know the gradient ∇θπθ(s, a)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 34 Winter 2018 31 / 62

Page 32: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Likelihood Ratio Policies

Denote a state-action trajectory asτ = (s0, a0, r0, ..., sT−1, aT−1, rT−1, sT )

Use R(τ) =∑T

t=0 R(st , at) to be the sum of rewards for a trajectory τ

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 35 Winter 2018 32 / 62

Page 33: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Likelihood Ratio Policies

Denote a state-action trajectory asτ = (s0, a0, r0, ..., sT−1, aT−1, rT−1, sT )

Use R(τ) =∑T

t=0 R(st , at) to be the sum of rewards for a trajectory τ

Policy value is

V (θ) = Eπθ

[T∑t=0

R(st , at);πθ

]=∑τ

P(τ ; θ)R(τ), (1)

where P(τ ; θ) is used to denote the probability over trajectories whenexecuting policy π(θ)

In this new notation, our goal is to find the policy parameters θ:

arg maxθ

V (θ) = arg maxθ

∑τ

P(τ ; θ)R(τ). (2)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 36 Winter 2018 33 / 62

Page 34: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Likelihood Ratio Policy Gradient

Goal is to find the policy parameters θ:

arg maxθ

V (θ) = arg maxθ

∑τ

P(τ ; θ)R(τ). (3)

Take the gradient with respect to θ:

∇θV (θ) = ∇θ∑τ

P(τ ; θ)R(τ)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 37 Winter 2018 34 / 62

Page 35: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Likelihood Ratio Policy Gradient

Goal is to find the policy parameters θ:

arg maxθ

V (θ) = arg maxθ

∑τ

P(τ ; θ)R(τ). (4)

Take the gradient with respect to θ:

∇θV (θ) = ∇θ∑τ

P(τ ; θ)R(τ)

=∑τ

∇θP(τ ; θ)R(τ)

=∑τ

P(τ ; θ)

P(τ ; θ)∇θP(τ ; θ)R(τ)

=∑τ

P(τ ; θ)R(τ)∇θP(τ ; θ)

P(τ ; θ)︸ ︷︷ ︸likelihood ratio

=∑τ

P(τ ; θ)R(τ)∇θ logP(τ ; θ)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 38 Winter 2018 35 / 62

Page 36: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Likelihood Ratio Policy Gradient

Goal is to find the policy parameters θ:

arg maxθ

V (θ) = arg maxθ

∑τ

P(τ ; θ)R(τ). (5)

Take the gradient with respect to θ:

∇θV (θ) =∑τ

P(τ ; θ)R(τ)∇θ logP(τ ; θ)

Approximate with empirical estimate for m sample paths under policyπθ:

∇θV (θ) ≈ g = (1/m)m∑i=1

R(τ (i))∇θ logP(τ (i); θ)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 39 Winter 2018 36 / 62

Page 37: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Score Function Gradient Estimator: Intuition

Consider generic form ofR(τ (i))∇θ logP(τ (i); θ):gi = f (xi )∇θ log p(xi |θ)

f (x) measures how good the sample x is.

Moving in the direction gi pushes up thelogprob of the sample, in proportion to howgood it is

Valid even if f (x) is discontinuous, andunknown, or sample space (containing x) is adiscrete set

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 40 Winter 2018 37 / 62

Page 38: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Score Function Gradient Estimator: Intuition

gi = f (xi )∇θ log p(xi |θ)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 41 Winter 2018 38 / 62

Page 39: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Score Function Gradient Estimator: Intuition

gi = f (xi )∇θ log p(xi |θ)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 42 Winter 2018 39 / 62

Page 40: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Decomposing the Trajectories Into States and Actions

Approximate with empirical estimate for m sample paths under policyπθ:

∇θV (θ) ≈ g = (1/m)m∑i=1

R(τ (i))∇θ logP(τ (i))

∇θ logP(τ (i); θ) =

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 43 Winter 2018 40 / 62

Page 41: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Decomposing the Trajectories Into States and Actions

Approximate with empirical estimate for m sample paths under policyπθ:

∇θV (θ) ≈ g = (1/m)m∑i=1

R(τ (i))∇θ logP(τ (i))

∇θ logP(τ (i);θ) = ∇θ log

µ(s0)︸ ︷︷ ︸Initial state distrib.

T−1∏t=0

πθ(at |st)︸ ︷︷ ︸policy

P(st+1|st , at)︸ ︷︷ ︸dynamics model

= ∇θ

[logµ(s0) +

T−1∑t=0

log πθ(at |st) + logP(st+1|st , at)

]

=T−1∑t=0

∇θ log πθ(at |st)︸ ︷︷ ︸no dynamics model required!

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 44 Winter 2018 41 / 62

Page 42: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Score Function

Define score function as ∇θ log πθ(s, a)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 45 Winter 2018 42 / 62

Page 43: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Likelihood Ratio / Score Function Policy Gradient

Putting this together

Goal is to find the policy parameters θ:

arg maxθ

V (θ) = arg maxθ

∑τ

P(τ ; θ)R(τ). (6)

Approximate with empirical estimate for m sample paths under policyπθ using score function:

∇θV (θ) ≈ g = (1/m)m∑i=1

R(τ (i))∇θ logP(τ (i);θ)

= (1/m)m∑i=1

R(τ (i))T−1∑t=0

∇θ log πθ(a(i)t |s

(i)t )

Do not need to know dynamics model

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 46 Winter 2018 43 / 62

Page 44: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Policy Gradient Theorem

The policy gradient theorem generalizes the likelihood ratio approach

Theorem

For any differentiable policy πθ(s, a),for any of the policy objective function J = J1, (episodic reward), JavR(average reward per time step), or 1

1−γ JavV (average value),the policy gradient is

∇θJ(θ) = Eπθ [∇θ log πθ(s, a)Qπθ(s, a)]

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 47 Winter 2018 44 / 62

Page 45: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Table of Contents

1 Introduction

2 Policy Gradient

3 Score Function and Policy Gradient Theorem

4 Policy Gradient Algorithms and Reducing Variance

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 48 Winter 2018 45 / 62

Page 46: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Likelihood Ratio / Score Function Policy Gradient

∇θV (θ) ≈ (1/m)m∑i=1

R(τ (i))T−1∑t=0

∇θ log πθ(a(i)t |s

(i)t )

Unbiased but very noisy

Fixes that can make it practical

Temporal structureBaseline

Next time will discuss some additional tricks

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 49 Winter 2018 46 / 62

Page 47: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Policy Gradient: Use Temporal Structure

Previously:

∇θEτ [R] = Eτ

[(T−1∑t=0

rt

)(T−1∑t=0

∇θ log πθ(at |st)

)]We can repeat the same argument to derive the gradient estimator fora single reward term rt′ .

∇θE[rt′ ] = E

[rt′

t′∑t=0

∇θ log πθ(at |st)

]Summing this formula over t, we obtain

∇θE[R] = E

[T−1∑t=0

rt′t′∑

t=0

∇θ log πθ(at |st)

]

= E

[T−1∑t=0

∇θ log πθ(at , st)T−1∑t′=t

rt′

]Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 50 Winter 2018 47 / 62

Page 48: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Policy Gradient: Use Temporal Structure

Recall for a particular trajectory τ (i),∑T−1

t′=t r(i)t′ is the return G

(i)t

∇θE[R] ≈ (1/m)m∑i=1

T−1∑t=0

∇θ log πθ(at , st)G(i)t

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 51 Winter 2018 48 / 62

Page 49: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Monte-Carlo Policy Gradient (REINFORCE)

Leverages likelihood ratio / score function and temporal structure

∆θt = α∇θ log πθ(st , at)Gt (7)

REINFORCE:Initialize policy parameters θ arbitrarilyfor each episode {s1, a1, r2, · · · , sT−1, aT−1, rT} ∼ πθ dofor t = 1 to T − 1 doθ ← θ + α∇θ log πθ(st , at)Gt

endforendforreturn θ

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 52 Winter 2018 49 / 62

Page 50: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Differentiable Policy Classes

Many choices of differentiable policy classes including:

SoftmaxGaussianNeural networks

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 53 Winter 2018 50 / 62

Page 51: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Softmax Policy

Weight actions using linear combination of features φ(s, a)T θ

Probability of action is proportional to exponentiated weight

πθ(s, a) = eφ(s,a)T θ/(

∑a

eφ(s,a)T θ) (8)

The score function is

∇θ log πθ(s, a) = φ(s, a)− Eπθ[φ(s, ·)]

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 54 Winter 2018 51 / 62

Page 52: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Gaussian Policy

In continuous action spaces, a Gaussian policy is natural

Mean is a linear combination of state features µ(s) = φ(s)T θ

Variance may be fixed σ2, or can also parametrised

Policy is Gaussian a ∼ N (µ(s), σ2)

The score function is

∇θ log πθ(s, a) =(a− µ(s))φ(s)

σ2

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 55 Winter 2018 52 / 62

Page 53: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Likelihood Ratio / Score Function Policy Gradient

∇θV (θ) ≈ (1/m)m∑i=1

R(τ (i))T−1∑t=0

∇θ log πθ(a(i)t |s

(i)t )

Unbiased but very noisy

Fixes that can make it practical

Temporal structureBaseline

Next time will discuss some additional tricks

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 56 Winter 2018 53 / 62

Page 54: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Policy Gradient: Introduce Baseline

Reduce variance by introducing a baseline b(s)

∇θEτ [R] = Eτ

[T−1∑t=0

∇θ log π(at |st , θ)

(T−1∑t′=t

rt′ − b(st)

)]

For any choice of b(s), gradient estimator is unbiased.

Near optimal choice is expected return,b(st) ≈ E[rt + rt+1 + · · ·+ rT−1]

Interpretation: increase logprob of action at proportionally to howmuch returns

∑T−1t′=t rt′ are better than expected

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 57 Winter 2018 54 / 62

Page 55: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Baseline b(s) Does Not Introduce Bias–Derivation

Eτ [∇θ log π(at |st , θ)b(st)]

= Es0:t ,a0:(t−1)

[Es(t+1):T ,at:(T−1)

[∇θ log π(at |st , θ)b(st)]]

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 58 Winter 2018 55 / 62

Page 56: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Baseline b(s) Does Not Introduce Bias–Derivation

Eτ [∇θ log π(at |st , θ)b(st)]

= Es0:t ,a0:(t−1)

[Es(t+1):T ,at:(T−1)

[∇θ log π(at |st , θ)b(st)]]

(break up expectation)

= Es0:t ,a0:(t−1)

[b(st)Es(t+1):T ,at:(T−1)

[∇θ log π(at |st , θ)]]

(pull baseline term out)

= Es0:t ,a0:(t−1)[b(st)Eat [∇θ log π(at |st , θ)]] (remove irrelevant variables)

= Es0:t ,a0:(t−1)

[b(st)

∑a

πθ(at |st)∇θπ(at |st , θ)

πθ(at |st)

](likelihood ratio)

= Es0:t ,a0:(t−1)

[b(st)

∑a

∇θπ(at |st , θ)

]

= Es0:t ,a0:(t−1)

[b(st)∇θ

∑a

π(at |st , θ)

]= Es0:t ,a0:(t−1)

[b(st)∇θ1]

= Es0:t,a0:(t−1) [b(st) · 0] = 0

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 59 Winter 2018 56 / 62

Page 57: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

”Vanilla” Policy Gradient Algorithm

Initialize policy parameter θ, baseline bfor iteration=1, 2, · · · do

Collect a set of trajectories by executing the current policyAt each timestep in each trajectory, compute

the return Rt =∑T−1

t′=t rt′ , andthe advantage estimate At = Rt − b(st).

Re-fit the baseline, by minimizing ||b(st)− Rt ||2,summed over all trajectories and timesteps.

Update the policy, using a policy gradient estimate g ,which is a sum of terms ∇θ log π(at |st , θ)At .(Plug g into SGD or ADAM)

endfor

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 60 Winter 2018 57 / 62

Page 58: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Practical Implementation with Autodiff

Usual formula∑

t ∇θ log π(at |st ; θ)At is inifficient–want to batch data

Define ”surrogate” function using data from current batch

L(θ) =∑t

log π(at |st ; θ)At

Then policy gradient estimator g = ∇θL(θ)

Can also include value function fit error

L(θ) =∑t

(log π(zt |st ; θ)At − ||V (st)− Rt ||2

)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 61 Winter 2018 58 / 62

Page 59: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Value Functions

Recall Q-function / state-action-value function:

Qπ,γ(s, a) = Eπ[r0 + γr1 + γ2r2 · · · |s0 = s, a0 = a

]State-value function can serve as a great baseline

V π,γ(s) = Eπ[r0 + γr1 + γ2r2 · · · |s0 = s

]= Ea∼π[Qπ,γ(s, a)]

Advantage function: Combining Q with baseline V

Aπ,γ(s, a) = Qπ,γ(s, a)− V π,γ(s)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 62 Winter 2018 59 / 62

Page 60: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

N-step estimators

Can also consider blending between TD and MC estimators for thetarget to substitute for the true state-action value function.

R(1)t = rt + γV (st+1)

R(2)t = rt + γrt+1 + γ2V (st+2) · · ·

R(inf)t = rt + γrt+1 + γ2rt+1 + · · ·

If subtract baselines from the above, get advantage estimators

A(1)t = rt + γV (st+1)−V (st)

A(2)t = rt + γrt+1 + γ2V (st+2)−V (st)

A(inf)t = rt + γrt+1 + γ2rt+1 + · · · −V (st)

A(a)t has low variance & high bias. A

(∞)t high variance but low bias.

(Why? Like which model-free policy estimation techniques?)Using intermediate k (say, 20) can give an intermediate amount ofbias and variance.

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 63 Winter 2018 60 / 62

Page 61: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Application: Robot Locomotion

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 64 Winter 2018 61 / 62

Page 62: Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Class Structure

Last time: Imitation Learning

This time: Policy Search

Next time: Policy Search Cont.

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 65 Winter 2018 62 / 62