Policy Gradient in Continuous Time
description
Transcript of Policy Gradient in Continuous Time
![Page 1: Policy Gradient in Continuous Time](https://reader036.fdocuments.in/reader036/viewer/2022062801/568143e2550346895db06a2b/html5/thumbnails/1.jpg)
Policy Gradient in Continuous Time
Presented by Hui Li
Duke University Machine Learning Group
May 30, 2007
by Remi Munos, JMLR 2006
![Page 2: Policy Gradient in Continuous Time](https://reader036.fdocuments.in/reader036/viewer/2022062801/568143e2550346895db06a2b/html5/thumbnails/2.jpg)
Outline
• Introduction
• Discretized Stochastic Processes Approximation
• Model-free Reinforcement Learning (RL)
algorithm
• Example Results
![Page 3: Policy Gradient in Continuous Time](https://reader036.fdocuments.in/reader036/viewer/2022062801/568143e2550346895db06a2b/html5/thumbnails/3.jpg)
Introduction of the Problem• Consider an optimal control problem with continuous state
System dynamics: ),( ttt uxf
dt
dx
Control State
• Objective: Find an optimal control (ut) that maximize the functional
)())(;( 0 Ttt xruxJ Objective function:
Deterministic process
Continuous state
![Page 4: Policy Gradient in Continuous Time](https://reader036.fdocuments.in/reader036/viewer/2022062801/568143e2550346895db06a2b/html5/thumbnails/4.jpg)
• Consider a class of parameterized policies with ),( tt xtu
• Find parameter that maximize the performance measure
)),(;()( 0 ttxtxJV
• Standard approach is to use gradient ascent method
)( V object of the paper
Introduction of the Problem
![Page 5: Policy Gradient in Continuous Time](https://reader036.fdocuments.in/reader036/viewer/2022062801/568143e2550346895db06a2b/html5/thumbnails/5.jpg)
Introduction of the Problem
How to compute )(V
• Finite-difference method
)()(
)(VV
V ii
This method requires a large number of trajectories to compute the gradient of performance measure.
• Pathwise estimation of the gradient
Compute the gradient using one trajectory only
![Page 6: Policy Gradient in Continuous Time](https://reader036.fdocuments.in/reader036/viewer/2022062801/568143e2550346895db06a2b/html5/thumbnails/6.jpg)
Introduction of the Problem
Define tt xz
Dynamics of zt: ttxtt zxfxf
dt
dz)()(
Gradient
TTxTTx zxrxxrV )()()(
• In the reinforcement learning, is unknown. How
to approximate zt?)( txf
Pathwise estimation of the gradient
known
unknown
![Page 7: Policy Gradient in Continuous Time](https://reader036.fdocuments.in/reader036/viewer/2022062801/568143e2550346895db06a2b/html5/thumbnails/7.jpg)
Discretized Stochastic Processes Approximation• A General Convergence Result
)( tt xf
dt
dxIf
![Page 8: Policy Gradient in Continuous Time](https://reader036.fdocuments.in/reader036/viewer/2022062801/568143e2550346895db06a2b/html5/thumbnails/8.jpg)
• Discretization of the state
Stochastic policy
Stochastic discrete state process NntnX
0)(
Initialization: 00 xX
Jump in state
![Page 9: Policy Gradient in Continuous Time](https://reader036.fdocuments.in/reader036/viewer/2022062801/568143e2550346895db06a2b/html5/thumbnails/9.jpg)
Proof of proposition 5:
From Taylor’s formula
The average jump:
Directly apply the Theorem 3, proposition 5 is proved.
)(),(),( 2 ouxfxxuxf tttt
)()(
)(),(),|(
),(),|()],([
2
2
oxf
ouxfxtu
uxfxtuuxfE
Uut
Uutt
![Page 10: Policy Gradient in Continuous Time](https://reader036.fdocuments.in/reader036/viewer/2022062801/568143e2550346895db06a2b/html5/thumbnails/10.jpg)
• Discretization of the state gradient
Stochastic discrete state gradient processNntn
Z
0)(
Initialization: 00 X
With
![Page 11: Policy Gradient in Continuous Time](https://reader036.fdocuments.in/reader036/viewer/2022062801/568143e2550346895db06a2b/html5/thumbnails/11.jpg)
Proof of proposition 6:
Since
then
Directly apply the Theorem 3, proposition 6 is proved.
![Page 12: Policy Gradient in Continuous Time](https://reader036.fdocuments.in/reader036/viewer/2022062801/568143e2550346895db06a2b/html5/thumbnails/12.jpg)
Model-free Reinforcement Learning Algorithm
Let
In this stochastic approximation, is observed, and
is given, we only need to approximate
![Page 13: Policy Gradient in Continuous Time](https://reader036.fdocuments.in/reader036/viewer/2022062801/568143e2550346895db06a2b/html5/thumbnails/13.jpg)
Least-Square Approximation of
Define
}|],[{)( ts uutctstS
The set of past discrete times t-cs t when action ut have been taken.
From Taylor’s formula, for all discrete time s,
We deduce
![Page 14: Policy Gradient in Continuous Time](https://reader036.fdocuments.in/reader036/viewer/2022062801/568143e2550346895db06a2b/html5/thumbnails/14.jpg)
Where
We may derive an approximation of by solving the least-square problem:
Then we have
Here
denote the average value of
![Page 15: Policy Gradient in Continuous Time](https://reader036.fdocuments.in/reader036/viewer/2022062801/568143e2550346895db06a2b/html5/thumbnails/15.jpg)
Algorithm
![Page 16: Policy Gradient in Continuous Time](https://reader036.fdocuments.in/reader036/viewer/2022062801/568143e2550346895db06a2b/html5/thumbnails/16.jpg)
Experimental Results
Six continuous state:
x0, y0: hand position
x, y: mass position
vx, vy: mass velocity
Four control action:U ={(1,0), (0,1), (-1,0),(0,-1)}
Goal: reach a target (xG, yG) with the mass at specific time T
Terminal reward function
![Page 17: Policy Gradient in Continuous Time](https://reader036.fdocuments.in/reader036/viewer/2022062801/568143e2550346895db06a2b/html5/thumbnails/17.jpg)
The system dynamics:
Consider a Boltzmann-like stochastic policy
where
![Page 18: Policy Gradient in Continuous Time](https://reader036.fdocuments.in/reader036/viewer/2022062801/568143e2550346895db06a2b/html5/thumbnails/18.jpg)
![Page 19: Policy Gradient in Continuous Time](https://reader036.fdocuments.in/reader036/viewer/2022062801/568143e2550346895db06a2b/html5/thumbnails/19.jpg)
Conclusion
• Described a reinforcement learning method for approximating the gradient of a continuous-time deterministic problem with respect to the control parameters
• Used a stochastic policy to approximate the continuous system by a consistent stochastic discrete process