Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1....

Advanced Policy Gradients

CS 285: Deep Reinforcement Learning, Decision Making, and Control

Sergey Levine

Class Notes

1. Homework 2 due today (11:59 pm)!• Don’t be late!

2. Homework 3 comes out this week• Start early! Q-learning takes a while to run

Today’s Lecture

1. Why does policy gradient work?

2. Policy gradient is a type of policy iteration

3. Policy gradient as a constrained optimization

4. From constrained optimization to natural gradient

5. Natural gradients and trust regions

• Goals:• Understand the policy iteration view of policy gradient

• Understand how to analyze policy gradient improvement

• Understand what natural gradient does and how to use it

Recap: policy gradients

generate samples (i.e.

run the policy)

fit a model to estimate return

improve the policy

“reward to go”

can also use function approximation here

Why does policy gradient work?

run the policy)

improve the policy

look familiar?

Policy gradient as policy iteration

Policy gradient as policy iterationimportance sampling

Ignoring distribution mismatch??

why do we want this to be true?

is it true? and when?

Bounding the distribution change

seem familiar?

not a great bound, but a bound!

Bounding the distribution change

Proof based on: Schulman, Levine, Moritz, Jordan, Abbeel. “Trust Region Policy Optimization.”

Bounding the objective value

Where are we at so far?

A more convenient bound

KL divergence has some very convenient properties that make it much easier to approximate!

How do we optimize the objective?

How do we enforce the constraint?

can do this incompletely (for a few grad steps)

How (else) do we optimize the objective?

Use first order Taylor approximation for objective (a.k.a., linearization)

How do we optimize the objective?

(see policy gradient lecture for derivation)

exactly the normal policy gradient!

Can we just use the gradient then?

not the same!

second order Taylor expansion

Can we just use the gradient then?

natural gradient

Is this even a problem in practice?

(image from Peters & Schaal 2008)

Essentially the same problem as this:

(figure from Peters & Schaal 2008)

Practical methods and notes

• Natural policy gradient• Generally a good choice to stabilize policy gradient training

• See this paper for details:• Peters, Schaal. Reinforcement learning of motor skills with policy gradients.

• Practical implementation: requires efficient Fisher-vector products, a bit non-trivial to do without computing the full matrix• See: Schulman et al. Trust region policy optimization

• Trust region policy optimization

• Just use the IS objective directly• Use regularization to stay close to old policy

• See: Proximal policy optimization

run the policy)

improve the policy

Review

• Policy gradient = policy iteration

• Optimize advantage under new policy state distribution

• Using old policy state distribution optimizes a bound, if the policies are close enough

• Results in constrained optimization problem

• First order approximation to objective = gradient ascent

• Regular gradient ascent has the wrong constraint, use natural gradient

• Practical algorithms

• Natural policy gradient

• Trust region policy optimization

Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1....

Documents

Transcript of Advanced Policy Gradientsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-9.pdfClass Notes 1....

Connections Between Inference and Controlrll.berkeley.edu/deeprlcourse/f17docs/lecture_11_control... · 2017. 10. 4. · Linearly solvable Markov decision problems: one framework

HOMEWORK INTERVENTIONS 1 - University of Utahed-psych.utah.edu/.../homework-interventions.pdf · · 2015-02-09HOMEWORK INTERVENTIONS 1 Running Head: HOMEWORK INTERVENTIONS Homework

Computational Photography - B3 Aggregationweb.media.mit.edu/~raskar/photo/2008/2008_B3TumblinAggregation06.pdfClass: Computational Photography, Advanced Topics Debevec, Raskar and

Supervised Learning of Behaviors - University of …rll.berkeley.edu/deeprlcourse/f17docs/lecture_2_behavior...Supervised Learning of Behaviors CS 294-112: Deep Reinforcement Learning

Advanced Imitation Learning Challenges and Open …rll.berkeley.edu/deeprlcourse/f17docs/lecture_17_challenges.pdf · Advanced Imitation Learning Challenges and Open Problems CS 294-112:

Homework Homework Homework Mandarin - Ying Li PE - ADRIAN ...

Deep Reinforcement Learning CS 294 - 112rll.berkeley.edu/deeprlcourse/docs/week_1_lecture_1_introduction.pdf · 1. Homework 1: Imitation learning (control via supervised learning)

Exploration - Robot Learningrll.berkeley.edu/deeprlcourse/f17docs/lecture_13_exploration.pdfExploration and exploitation examples •Restaurant selection •Exploitation: go to your

Markov Decision Processes and Solving Finite Problemsrll.berkeley.edu/deeprlcourse/docs/lec1.pdfMarkov decision processes: discrete stochastic dynamic programming.John Wiley & Sons,

Variational Inference and Generative Modelsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-13.pdf · 2019-11-18 · The variationalautoencoder. Using the variational autoencoder.

DCP 1172, Homework 2 1 Homework 4 for DCP-1172 (2005.01.11) This time, we have 3 different homework assignments. Homework assignment 4-1 (30%.) Homework.

Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-4.pdfIntroduction to Reinforcement Learning CS 294-112: Deep Reinforcement Learning

Model-Based Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Class Notes 1. Homework 3 is out! Due next week •Start early, this one will take a bit

Distributed RL - BAIRrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-21.pdf · – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner

History of Homework No Homework Movement

HOPS Interventions - scred.k12.mn.us · homework assignment. The . Homework . Completion . ... Homework Management . ... Homework, Organization, and Planning Skills (HOPS) Interventions:

Class - XII Holiday Homework 2017bhatnagarinternationalschoolpv.in/.../2017/05/Class_12_hhw_2017.pdfClass - XII Holiday Homework 2017 ... Get the printout of the Investigatory project

Homework 5 solutions Homework 6 solutions Homework 10 ...

emis.maths.adelaide.edu.auemis.maths.adelaide.edu.au/journals/EJC/Volume_11/PDF/v11i1r24.pdfClass-Uniformly Resolvable Group Divisible Structures II: Frames. Peter Danziger∗ Department

Student Homework Planner | Homework Suite