Policy Gradient in Continuous Time

Post on 11-Jan-2016

20 views 0 download

description

Policy Gradient in Continuous Time. by Remi Munos, JMLR 2006. Presented by Hui Li Duke University Machine Learning Group May 30, 2007. Outline. Introduction Discretized Stochastic Processes Approximation Model-free Reinforcement Learning (RL) algorithm Example Results. Control. - PowerPoint PPT Presentation

Transcript of Policy Gradient in Continuous Time

Policy Gradient in Continuous Time

Presented by Hui Li

Duke University Machine Learning Group

May 30, 2007

by Remi Munos, JMLR 2006

Outline

• Introduction

• Discretized Stochastic Processes Approximation

• Model-free Reinforcement Learning (RL)

algorithm

• Example Results

Introduction of the Problem• Consider an optimal control problem with continuous state

System dynamics: ),( ttt uxf

dt

dx

Control State

• Objective: Find an optimal control (ut) that maximize the functional

)())(;( 0 Ttt xruxJ Objective function:

Deterministic process

Continuous state

• Consider a class of parameterized policies with ),( tt xtu

• Find parameter that maximize the performance measure

)),(;()( 0 ttxtxJV

• Standard approach is to use gradient ascent method

)( V object of the paper

Introduction of the Problem

Introduction of the Problem

How to compute )(V

• Finite-difference method

)()(

)(VV

V ii

This method requires a large number of trajectories to compute the gradient of performance measure.

• Pathwise estimation of the gradient

Compute the gradient using one trajectory only

Introduction of the Problem

Define tt xz

Dynamics of zt: ttxtt zxfxf

dt

dz)()(

Gradient

TTxTTx zxrxxrV )()()(

• In the reinforcement learning, is unknown. How

to approximate zt?)( txf

Pathwise estimation of the gradient

known

unknown

Discretized Stochastic Processes Approximation• A General Convergence Result

)( tt xf

dt

dxIf

• Discretization of the state

Stochastic policy

Stochastic discrete state process NntnX

0)(

Initialization: 00 xX

Jump in state

Proof of proposition 5:

From Taylor’s formula

The average jump:

Directly apply the Theorem 3, proposition 5 is proved.

)(),(),( 2 ouxfxxuxf tttt

)()(

)(),(),|(

),(),|()],([

2

2

oxf

ouxfxtu

uxfxtuuxfE

Uut

Uutt

• Discretization of the state gradient

Stochastic discrete state gradient processNntn

Z

0)(

Initialization: 00 X

With

Proof of proposition 6:

Since

then

Directly apply the Theorem 3, proposition 6 is proved.

Model-free Reinforcement Learning Algorithm

Let

In this stochastic approximation, is observed, and

is given, we only need to approximate

Least-Square Approximation of

Define

}|],[{)( ts uutctstS

The set of past discrete times t-cs t when action ut have been taken.

From Taylor’s formula, for all discrete time s,

We deduce

Where

We may derive an approximation of by solving the least-square problem:

Then we have

Here

denote the average value of

Algorithm

Experimental Results

Six continuous state:

x0, y0: hand position

x, y: mass position

vx, vy: mass velocity

Four control action:U ={(1,0), (0,1), (-1,0),(0,-1)}

Goal: reach a target (xG, yG) with the mass at specific time T

Terminal reward function

The system dynamics:

Consider a Boltzmann-like stochastic policy

where

Conclusion

• Described a reinforcement learning method for approximating the gradient of a continuous-time deterministic problem with respect to the control parameters

• Used a stochastic policy to approximate the continuous system by a consistent stochastic discrete process