Policy Gradient in Continuous Time

Presented by Hui Li

Duke University Machine Learning Group

May 30, 2007

by Remi Munos, JMLR 2006

Outline

• Introduction

• Discretized Stochastic Processes Approximation

• Model-free Reinforcement Learning (RL)

algorithm

• Example Results

Introduction of the Problem• Consider an optimal control problem with continuous state

System dynamics: ),( ttt uxf

Control State

• Objective: Find an optimal control (ut) that maximize the functional

)())(;( 0 Ttt xruxJ Objective function:

Deterministic process

Continuous state

• Consider a class of parameterized policies with ),( tt xtu

• Find parameter that maximize the performance measure

)),(;()( 0 ttxtxJV

• Standard approach is to use gradient ascent method

)( V object of the paper

Introduction of the Problem

How to compute )(V

• Finite-difference method

This method requires a large number of trajectories to compute the gradient of performance measure.

• Pathwise estimation of the gradient

Compute the gradient using one trajectory only

Introduction of the Problem

Define tt xz

Dynamics of zt: ttxtt zxfxf

dz)()(

Gradient

TTxTTx zxrxxrV )()()(

• In the reinforcement learning, is unknown. How

to approximate zt?)( txf

Pathwise estimation of the gradient

unknown

Discretized Stochastic Processes Approximation• A General Convergence Result

)( tt xf

• Discretization of the state

Stochastic policy

Stochastic discrete state process NntnX

Initialization: 00 xX

Jump in state

Proof of proposition 5:

From Taylor’s formula

The average jump:

Directly apply the Theorem 3, proposition 5 is proved.

)(),(),( 2 ouxfxxuxf tttt

)(),(),|(

),(),|()],([

ouxfxtu

uxfxtuuxfE

• Discretization of the state gradient

Stochastic discrete state gradient processNntn

Initialization: 00 X

Proof of proposition 6:

Directly apply the Theorem 3, proposition 6 is proved.

Model-free Reinforcement Learning Algorithm

In this stochastic approximation, is observed, and

is given, we only need to approximate

Least-Square Approximation of

Define

}|],[{)( ts uutctstS

The set of past discrete times t-cs t when action ut have been taken.

From Taylor’s formula, for all discrete time s,

We deduce

We may derive an approximation of by solving the least-square problem:

Then we have

denote the average value of

Algorithm

Experimental Results

Six continuous state:

x0, y0: hand position

x, y: mass position

vx, vy: mass velocity

Four control action:U ={(1,0), (0,1), (-1,0),(0,-1)}

Goal: reach a target (xG, yG) with the mass at specific time T

Terminal reward function

The system dynamics:

Consider a Boltzmann-like stochastic policy

Conclusion

• Described a reinforcement learning method for approximating the gradient of a continuous-time deterministic problem with respect to the control parameters

• Used a stochastic policy to approximate the continuous system by a consistent stochastic discrete process

Policy Gradient in Continuous Time

Documents

Transcript of Policy Gradient in Continuous Time

DIFFUSION AND EQUILIBRIUM PASSIVE TRANSPORT. Continuous Gradient HIGH CONCENTRATION LOW CONCENTRATION.

Policy Gradient Methodsbicmr.pku.edu.cn/~wenzw/bigdata/lect-policy.pdf · 2020-06-03 · 4/74 Policy gradient methods For simplicity, assume that ˇ is differentiable with respect

Evolution-Guided Policy Gradient in Reinforcement Learning · 2019-02-19 · Evolution-Guided Policy Gradient in Reinforcement Learning Shauharda Khadka Kagan Tumer Collaborative

An Off-policy Policy Gradient Theorem Using Emphatic ... · ple that previous off-policy policy gradient methods—particularly OffPAC and DPG—converge to the wrong solution whereas

On the Theory of Policy Gradient Methods: Optimality ...

Reinforcement Learning Seminar Yingru Li · Introduction Monte-Carlo Policy Gradient Actor-Critic Policy Gradient Policy Gradient Methods Reinforcement Learning Seminar Yingru Li

From Gradient-Based to Evolutionary Optimization · 2019. 7. 29. · Nikolaus Hansen, Inria, IP Paris From Gradient-Based to Evolutionary Optimization Landscape of Continuous Search

Lecture 9: Policy Gradient II 1 - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture9.pdf · Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning.

Bayesian Policy Gradient Algorithms - University of …ppoupart/ICML-07-tutorial-slides/icml07... · Bayesian Policy Gradient Algorithms. ... GP to define a prior distribution over

Lecture 8: Policy Gradient - · PDF fileLecture 8: Policy Gradient Outline 1 Introduction 2 Finite Di erence Policy Gradient 3 Monte-Carlo Policy Gradient ... for each episode fS 1;A

Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,

The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Stein Variational Policy Gradient - Jian Pengjianpeng.cs.illinois.edu/SVPG.pdf · 2017-04-08 · Stein Variational Policy Gradient Yang Liu UIUC liu301@illinois.edu Prajit Ramachandran

Bayesian Policy Gradient and Actor-Critic Algorithmschercheurs.lille.inria.fr/~ghavamza/my_website/... · 2020-03-07 · Bayesian Policy Gradient and Actor-Critic Algorithms Another

On the Theory of Policy Gradient Methods: Optimality, … · 2020-07-07 · On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift Alekh Agarwal∗

A Natural Policy Gradient - Neural Information … Natural Policy Gradient Sham Kakade Gatsby Computational Neuroscience Unit 17 Queen Square, London, UK WC1N 3AR sham@gatsby.ucl.ac.uk

Infinite-Horizon Policy-Gradient Estimation...Journal of Artificial Intelligence Research 15 (2001) 319-350 Submitted 9/00; published 11/01 Infinite-Horizon Policy-Gradient Estimation

Lecture 8: Policy Gradient I 1 - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture8_postclass.pdf · Lecture 8: Policy Gradient I 1 Emma Brunskill CS234 Reinforcement

Efficient Policy Gradient Optimization/Learning of Feedback Controllers