Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case...

Post on 26-Jun-2020

13 views 0 download

Transcript of Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case...

Stochastic optimal control theory

Bert KappenSNN Radboud UniversityNijmegen the Netherlands

July 5, 2008

Bert Kappen

Introduction

Optimal control theory: Optimize sum of a path cost and end cost. Result isoptimal control sequence and optimal trajectory.

Input: Cost function. Output: Optimal trajectory and controls.

Classical control theory: what control signal should I give to move a plant to adesired state?

Input: Desired state trajectory. Output: Optimal control trajectory.

Bert Kappen ICML, July 5 2008 1

Types of optimal control problems

Finite horizon (fixed horizon time)

t tf

controlled trajectory

environmentx

Dynamics and environment may depend explicitly on time.Optimal control depends explicitly on time.

Bert Kappen ICML, July 5 2008 2

Types of optimal control problems

Finite horizon (moving horizon)

t tf

Dynamics and environment are static.Optimal control is time independent.

Bert Kappen ICML, July 5 2008 3

Types of optimal control problems

Finite horizon (moving horizon)

t tft tf

Dynamics and environment are static.Optimal control is time independent.Similar to RL.

Bert Kappen ICML, July 5 2008 4

Types of optimal control problems

Other types of control problems:- minimum time- infinite horizon, average reward- infinite horizon, absorbing states

In addition one should distinguish:- discrete vs. continuous state- discrete vs. continuous time- observable vs. partial observable

Bert Kappen ICML, July 5 2008 5

Overview

Deterministic optimal control (Kappen, 30 min.)- Introduction of delayed reward problem in discrete time;- Dynamic programming solution and deterministic Bellman equations;- Solution in continuous time and states;- Example: Mass on a spring- Pontryagin maximum principle; Notion of an optimal (particle) trajectory- Again Mass on a spring

Bert Kappen ICML, July 5 2008 6

Overview

Stochastic optimal control, discrete case (Toussaint, 40 min.)- Stochastic Bellman equation (discrete state and time) and Dynamic Programming- Reinforcement learning (exact solution, value iteration, policy improvement);Actor critic networks;- Markov decision problems and probabilistic inference;- Example: robotic motion control and planning

Bert Kappen ICML, July 5 2008 7

Overview

Stochastic optimal control, continuous case (Kappen, 40 min.)- Stochastic differential equations- Hamilton-Jacobi-Bellman equation (continuous state and time)- LQ control, Ricatti equation;- Example of LQ control- Learning; Partial observability: Inference and control;- Certainty equivalence- Path integral control; the role of noise and symmetry breaking; efficientapproximate computation (MC, MF, BP, ...)- Examples: Double slit, delayed choice, n joint arm

Bert Kappen ICML, July 5 2008 8

Overview

Research issues (Toussaint, 30 min.)- Learning;- efficient methods to compute value functions/cost-to-go- control under partial observability (POMDPs)

Bert Kappen ICML, July 5 2008 9

Discrete time control

Consider the control of a discrete time deterministic dynamical system:

xt+1 = xt + f(t, xt, ut), t = 0, 1, . . . , T − 1

xt describes the state and ut specifies the control or action at time t.

Given xt=0 = x0 and u0:T−1 = u0, u1, . . . , uT − 1, we can compute x1:T .

Define a cost for each sequence of controls:

C(x0, u0:T−1) = φ(xT ) +T−1∑

t=0

R(t, xt, ut)

The problem of optimal control is to find the sequence u0:T−1 that minimizesC(x0, u0:T−1).

Bert Kappen ICML, July 5 2008 10

Dynamic programming

Find the minimal cost path from A to J.

C(J) = 0, C(H) = 3, C(I) = 4

C(F ) = min(6 + C(H), 3 + C(I))

Bert Kappen ICML, July 5 2008 11

Discrete time control

The optimal control problem can be solved by dynamic programming. Introducethe optimal cost-to-go:

J(t, xt) = minut:T−1

(φ(xT ) +

T−1∑

s=t

R(s, xs, us)

)

which solves the optimal control problem from an intermediate time t until thefixed end time T , for all intermediate states xt.

Then,

J(T, x) = φ(x)

J(0, x) = minu0:T−1

C(x, u0:T−1)

Bert Kappen ICML, July 5 2008 12

Discrete time control

One can recursively compute J(t, x) from J(t+ 1, x) for all x in the following way:

J(t, xt) = minut:T−1

(φ(xT ) +

T−1∑

s=t

R(s, xs, us)

)

= minut

(R(t, xt, ut) + min

ut+1:T−1

[φ(xT ) +

T−1∑

s=t+1

R(s, xs, us)

])

= minut

(R(t, xt, ut) + J(t+ 1, xt+1))

= minut

(R(t, xt, ut) + J(t+ 1, xt + f(t, xt, ut)))

This is called the Bellman Equation.

Computes u as a function of x, t for all intermediate t and all x.

Bert Kappen ICML, July 5 2008 13

Discrete time control

The algorithm to compute the optimal control u∗0:T−1, the optimal trajectory x∗1:T

and the optimal cost is given by

1. Initialization: J(T, x) = φ(x)

2. Backwards: For t = T − 1, . . . , 0 and for all x compute

u∗t (x) = arg minu{R(t, x, u) + J(t+ 1, x+ f(t, x, u))}

J(t, x) = R(t, x, u∗t ) + J(t+ 1, x+ f(t, x, u∗t ))

3. Forwards: For t = 0, . . . , T − 1 compute

x∗t+1 = x∗t + f(t, x∗t , u∗t (x∗t ))

NB: the backward computation requires u∗t (x) for all x.

Bert Kappen ICML, July 5 2008 14

Continuous limit

Replace t+ 1 by t+ dt with dt→ 0.

xt+dt = xt + f(xt, ut, t)dt

C(x0, u0→T ) = φ(xT ) +

∫ T

0

dτR(τ, x(τ), u(τ))

Assume J(x, t) is smooth.

J(t, x) = minu

(R(t, x, u)dt+ J(t+ dt, x+ f(x, u, t)dt))

≈ minu

(R(t, x, u)dt+ J(t, x) + ∂tJ(t, x)dt+ ∂xJ(t, x)f(x, u, t)dt)

−∂tJ(t, x) = minu

(R(t, x, u) + f(x, u, t)∂xJ(x, t))

with boundary condition J(x, T ) = φ(x).

Bert Kappen ICML, July 5 2008 15

Continuous limit

−∂tJ(t, x) = minu

(R(t, x, u) + f(x, u, t)∂xJ(x, t))

with boundary condition J(x, T ) = φ(x).

This is called the Hamilton-Jacobi-Bellman Equation.

Computes the anticipated potential J(t, x) from the future potential φ(x).

Bert Kappen ICML, July 5 2008 16

Example: Mass on a spring

The spring force Fz = −z towards the rest position and control force Fu = u.

Newton’s LawF = −z + u = mz

with m = 1.

Control problem: Given initial position and velocity z(0) = z(0) = 0 at time t = 0,find the control path −1 < u(0→ T ) < 1 such that z(T ) is maximal.

Bert Kappen ICML, July 5 2008 17

Example: Mass on a spring

Introduce x1 = z, x2 = z, then

x1 = x2

x2 = −x1 + u

The end cost is φ(x) = −x1; path cost R(x, u, t) = 0.

The HJB takes the form:

−∂tJ = minu

(−x2

∂J

∂x1+ x1

∂J

∂x2+∂J

∂x2u

)

= −x2∂J

∂x1+ x1

∂J

∂x2−∣∣∣∣∂J

∂x2

∣∣∣∣ , u = −sign

(∂J

∂x2

)

Bert Kappen ICML, July 5 2008 18

Example: Mass on a spring

The solution is

J(t, x1, x2) = − cos(t− T )x1 + sin(t− T )x2 + α(t)

u(t, x1, x2) = −sign(sin(t− T ))

As an example consider T = 2π. Then, the optimal control is

u = −1, 0 < t < π

u = 1, π < t < 2π

0 2 4 6 8−2

−1

0

1

2

3

4

t

x1

x2

Bert Kappen ICML, July 5 2008 19

Pontryagin minimum principle

The HJB equation is a PDE with boundary condition at future time. The PDE issolved using discretization of space and time.

The solution is an optimal cost-to-go for all x and t. From this we compute theoptimal trajectory and optimal control.

An alternative approach is a variational approach that directly finds the optimaltrajectory and optimal control.

Bert Kappen ICML, July 5 2008 20

Pontryagin minimum principle

We can write the optimal control problem as a constrained optimization problemwith independent variables u(0→ T ) and x(0→ T )

minu(0→T ),x(0→T )

φ(x(T )) +

∫ T

0

dtR(x(t), u(t), t)

subject to the constraintx = f(x, u, t)

and boundary condition x(0) = x0.

Introduce the Lagrange multiplier function λ(t):

C = φ(x(T )) +

∫ T

0

dt [R(t, x(t), u(t))− λ(t)(f(t, x(t), u(t))− x(t))]

= φ(x(T )) +

∫ T

0

dt[−H(t, x(t), u(t), λ(t)) + λ(t)x(t))]

−H(t, x, u, λ) = R(t, x, u)− λf(t, x, u)

Bert Kappen ICML, July 5 2008 21

Derivation PMP

The solution is found by extremizing C. This gives a necessary but not sufficientcondition for a solution.

If we vary the action wrt to the trajectory x, the control u and the Lagrangemultiplier λ, we get:

δC = φx(x(T ))δx(T )

+

∫ T

0

dt[−Hxδx(t)−Huδu(t) + (−Hλ + x(t))δλ(t) + λ(t)δx(t)]

= (φx(x(T )) + λ(T )) δx(T )

+

∫ T

0

dt[(−Hx − λ(t))δx(t)−Huδu(t) + (−Hλ + x(t))δλ(t)

]

For instance, Hx = ∂H(t,x(t),u(t),λ(t))∂x(t) .

Bert Kappen ICML, July 5 2008 22

We can solve Hu(t, x, u, λ) = 0 for u and denote the solution as

u∗(t, x, λ)

Assumes H convex in u.

The remaining equations are

x = Hλ(t, x, u∗(t, x, λ), λ)

λ = −Hx(t, x, u∗(t, x, λ), λ)

with boundary conditions

x(0) = x0 λ(T ) = −φx(x(T ))

Mixed boundary value problem.

Bert Kappen ICML, July 5 2008 23

Again mass on a spring

Problem

x1 = x2, x2 = −x1 + u

R(x, u, t) = 0 φ(x) = −x1

Hamiltonian

H(t, x, u, λ) = −R(t, x, u) + λTf(t, x, u) = λ1x2 + λ2(−x1 + u)

H∗(t, x, λ) = λ1x2 − λ2x1 − |λ2| u∗ = −sign(λ2)

The Hamilton equations

x =∂H∗

∂λ⇒ x1 = x2, x2 = −x1 − sign(λ2)

λ = −∂H∗

∂x⇒ λ1 = −λ2, λ2 = λ1

with x(t = 0) = x0 and λ(t = T ) = 1.

Bert Kappen ICML, July 5 2008 24

Comments

The HJB method gives a sufficient (and often necessary) condition for optimality.The solution of the PDE is expensive.

The PMP method provides a necessary condition for optimal control. This meansthat it provides candidate solutions for optimality.

The PMP method is computationally less complicated than the HJB methodbecause it does not require discretization of the state space.

Optimal control in continuous space and time contains many complications relatedto the existence, uniqueness and smoothness of the solution, particular in theabsence of noise. In the presence of noise many of these intricacies disappear.

HJB generalizes to the stochastic case, PMP does not (at least not easy).

Bert Kappen ICML, July 5 2008 25

Stochastic differential equations

Consider the random walk on the line:

xt+1 = xt + ξt ξt = ±1

with x0 = 0. We can compute

xt =t∑

i=1

ξi

Since xt is a sum of random variables, xt becomes Gaussian distributed with

〈xt〉 =t∑

i=1

〈ξi〉 = 0

⟨x2t

⟩=

t∑

i,j=1

〈ξiξj〉 =t∑

i=1

⟨ξ2i

⟩+

t∑

i,j=1,j 6=i〈ξiξj〉 = t

Note, that the fluctuations ∝√t.

Bert Kappen ICML, July 5 2008 26

Stochastic differential equations

In the continuous time limit we define

dxt = xt+dt − xt = dξ

with dξ an infinitesimal mean zero Gaussian variable with⟨dξ2⟩

= νdt.

Then

d

dt〈x〉 = 0, ⇒ 〈x〉 (t) = 0

d

dt

⟨x2⟩

= ν, ⇒⟨x2⟩

(t) = νt

ρ(x, t|x0, 0) =1√

2πνtexp

(−(x− x0)2

2νt

)

Bert Kappen ICML, July 5 2008 27

Stochastic optimal control

Consider a stochastic dynamical system

dx = f(t, x, u)dt+ dξ

dξ Gaussian noise 〈dξidξj〉 = νij(t, x, u)dt.

The cost becomes an expectation:

C(t, x, u(t→ T )) =

⟨φ(x(T )) +

∫ T

t

dτR(t, x(t), u(t))

over all stochastic trajectories starting at x with control path u(t→ T ).

Note, that u(t) as part of u(t → T ) is used at time t. Next move to x + dx andrepeat the optimization.

Bert Kappen ICML, July 5 2008 28

Stochastic optimal control

We obtain the Bellman recursion

J(t, xt) = minut

R(t, xt, ut) + 〈J(t+ dt, xt+dt)〉

〈J(t+ dt, xt+dt)〉 =

∫dxt+dtN (xt+dt|xt, νdt)J(t+ dt, xt+dt)

= J(t, xt) + dt∂tJ(t, xt) + 〈dx〉 ∂xJ(t, xt) +1

2

⟨dx2⟩∂2xJ(t, xt)

〈dx〉 = f(x, u, t)dt⟨dx2⟩

= ν(t, x, u)dt

Thus,

−∂tJ(t, x) = minu

(R(t, x, u) + f(x, u, t)∂xJ(x, t) +

1

2ν(t, x, u)∂2

xJ(x, t)

)

with boundary condition J(x, T ) = φ(x).

Bert Kappen ICML, July 5 2008 29

Linear Quadratic control

The dynamics is linear

dx = [A(t)x+B(t)u+ b(t)]dt+m∑

j=1

(Cj(t)x+Dj(t)u+ σj(t))dξj,⟨dξjdξj′

⟩= δjj′dt

The cost function is quadratic

φ(x) =1

2xTGx

R(x, u, t) =1

2xTQ(t)x+ uTS(t)x+

1

2uTR(t)u

In this case the optimal cost-to-go is quadratic in x:

J(t, x) =1

2xTP (t)x+ αT (t)x+ β(t)

u(t) = −Ψ(t)x(t)− ψ(t)

Bert Kappen ICML, July 5 2008 30

Substitution in the HJB equation yields ODEs for P,α, β:

−P = PA+ATP +mX

j=1

CTj PCj +Q− ST R−1S

−α = [A−BR−1S]

Tα+

mX

j=1

[Cj −DjR−1S]

TPσj + Pb

β =1

2

˛˛pRψ˛˛2

− αTb− 1

2

mX

j=1

σTj Pσj

R = R+mX

j=1

DTj PDj

S = BTP + S +

mX

j=1

DTj PCj

Ψ = R−1S

ψ = R−1

(BTα+

mX

j=1

DTj Pσj)

with P (tf) = G and α(tf) = β(tf) = 0.

Bert Kappen ICML, July 5 2008 31

Example

Find the optimal control for the dynamics

dx = (x+ u)dt+ dξ,⟨dξ2⟩

= νdt

with end cost φ(x) = 0 and path cost R(x, u) = 12(x2 + u2).

The Ricatti equations reduce to

−P = 2P + 1− P 2

−α = 0

β = −1

2νP

with P (T ) = α(T ) = β(T ) = 0 and

u(x, t) = −P (t)x

0 2 4 6 8 100

1

2

3

4

5

6

t

Bert Kappen ICML, July 5 2008 32

Comments

Note, that in the last example the optimal control is independent of ν, i.e. optimalstochastic control equals optimal deterministic control.

This is true in general for ’non-multiplicative’ noise (Cj = Dj = 0).

Bert Kappen ICML, July 5 2008 33

Learning

What happens if (part of) the state is not observed?

For instance,- As a result of measurement error we do not know xt but p(xt|y0:t)- We do not know the parameters of the dynamics- We do not know the cost/rewards (RL case)

Bert Kappen ICML, July 5 2008 34

Learning in RL or receding horizon

Imagine eternity, and you are ordered to cook a very good meal for tomorrow.

You decide to spend all of today to learn the recipes.

Bert Kappen ICML, July 5 2008 35

Learning in RL or receding horizon

Imagine eternity, and you are ordered to cook a very good meal for tomorrow.

You decide to spend all of today to learn the recipes.

When the next day arrives, you are ordered to cook a very good meal for tomorrow.

You decide to spend all of today to learn the recipes

. . .

Bert Kappen ICML, July 5 2008 36

Learning in RL or receding horizon

Imagine eternity, and you are ordered to cook a very good meal for tomorrow.

You decide to spend all of today to learn the recipes.

When the next day arrives, you are ordered to cook a very good meal for tomorrow.

You decide to spend all of today to learn the recipes.

. . .

tt

V

- The learning phase takes forever.- Mix of exploration and optimizing (RL: actor-critic, E3, ...)- Learning is not part of the control problem

Bert Kappen ICML, July 5 2008 37

Learning in RL or receding horizon

Imagine eternity, and you are ordered to cook a very good meal for tomorrow.

You decide to spend all of today to learn the recipes.

When the next day arrives, you are ordered to cook a very good meal for tomorrow.

You decide to spend all of today to learn the recipes.

. . .

tt

V

- The learning phase takes forever.- Mix of exploration and optimizing (RL: actor-critic, E3, ...)- Learning is not part of the control problem

Bert Kappen ICML, July 5 2008 38

Finite horizon learning

Imagine instead life as we know it. It is finite and we have only one life.

Aim is to maximize accumulated reward. This requires to plan your learning!- At t = 0, action is useless.- At t = T , learning is useless.

t

learningaction

Problem of inference and control.

Bert Kappen ICML, July 5 2008 39

Inference and control

As an example, consider the problem

dx = αudt+ dξ

with α unobserved and x observed. with α unobserved and x observed. Path costR(x, u, t), end cost φ(x) and noise variance

⟨dξ2⟩

= νdt are given.

The problem is that the future information that we receive about α depends on u.Each time step we observe dx and u and thus learn about α.

pt+dt(α|dx, u) ∝ p(dx|α, u)pt(α)

The solution is to augment the state space with parameters θt (sufficient statistics)that describe pt(α) = p(α|θt) and θ0 known. Then with α = ±1, pt(α = 1) =σ(θt):

dθ =u

νdx =

u

ν(αudt+ dξ)

NB: In forward pass dθ = F (dx), thus θ also observed.

Bert Kappen ICML, July 5 2008 40

With zt = (xt, θt) we obtain a standard HJB:

−∂tJ(t, z)dt = minu

(R(t, z, u)dt+ 〈dz〉z ∂zJ(z, t) +

1

2

⟨dz2⟩z∂2zJ(z, t)

)

with boundary condition J(z, T ) = φ(x).

The expectation values are conditioned on (xt, θt) and are averages over p(α|θt)and the Gaussian noise, cf.

〈dx〉x,θ = 〈αudt+ dξ〉x,θ = αudt α =

∫dαp(α|θ)α

t tft+dt

J(z,t+dt)<dz>z<dz >z2

Bert Kappen ICML, July 5 2008 41

Certainty equivalence

An important special case of a partial observable control problem is the Kalmanfilter (y observed, x not observed).

dx = (x+ u)dt+ dξ

y = x+ η

The cost is quadratic in x and u, for instance

C(xt, t, ut:T ) =

⟨T∑

τ=t

1

2(x2τ + u2

τ)

The optimal control is u(x, t).

When xt is not observed, we can compute p(xt|y0:t) using Kalman filtering andthe optimal control minimizes

CKF(y0:t, t, ut:T ) =

∫dxtp(xt|y0:t)C(xt, t, ut:T )

Bert Kappen ICML, July 5 2008 42

Since p(xt|y0:t) = N (xt|µt, σ2t ) is Gaussian and

CKF(y0:t, t, ut:T ) =

∫dxtC(xt, t, ut:T )N (xt|µt, σ2

t ) =

T∑

τ=t

1

2u2τ +

T∑

τ=t

⟨x2τ

⟩µt,σt

= · · ·= C(µt, t, ut:T ) +

1

2(T − t)σ2

t

The first term is identical to the observed case with xt → µt. The second termdoes not depend on u and thus does not affect the optimal control.

The optimal control for the Kalman filter is identical to the observed case with xtreplaced by µt:

uKF(y0:t, t) = u(µt, t)

Bert Kappen ICML, July 5 2008 43

Summary

For infinite time problems, learning is a meta problem with a time scale unrelatedto the horizon time.

A finite time partial observable (or adaptive) control problem is in general equivalentto an observable non-adaptive control problem in an extended state space.

The partial observable case is generally more complex than the observable case.For instance a LQ problem with unknown parameters is not LQ.

For a Kalman filter with unobserved states and known parameters, the partialobservability does not affect the optimal control law (Certainty equivalence).

Learning can be done either in the maximum likelihood sense or in a full Bayesianway.

Bert Kappen ICML, July 5 2008 44

Path integral control

Solving the PDE is hard. We consider a special case that can be ’solved’.

dx = (f(x, t) + u)dt+ dξ

R(x, u, t) = V (x, t) +1

2uTRu

and φ(x) arbitrary.

The stochastic HJB equation becomes:

−∂tJ = minu

(1

2uTRu+ V + (f + u)T∂xJ +

1

2Tr(ν∂2

xJ)

)

= −1

2(∂xJ)TR−1(∂xJ) + V + fT∂xJ +

1

2Tr(ν∂2

xJ)

u = −R−1∂xJ(x, t)

must be solved backward in time with boundary condition J(T, x) = φ(x).

Bert Kappen ICML, July 5 2008 45

Closed form solution

If we further assume that R−1 = λν for some scalar λ then we can solve J inclosed form. The solution contains the following steps:

Substitute J = −λ log Ψ in the HJB equations. Because of the relation R−1 = λνthe terms quadratic in Ψ cancel and only linear terms remain.

∂tΨ = −HΨ, H = −Vλ

+ f∂x +1

2ν∂2

x

This equation must be solved backward in time with boundary condition Ψ(x, T ) =exp(−φ(x)/λ).

The linearity allows us to reverse the direction of computation, replacing it by adiffusion process, in the following way.

Bert Kappen ICML, July 5 2008 46

Closed form solution

Let ρ(y, τ |x, t) describe a diffusion process defined by the Fokker-Planck equation

∂τρ = −Vνρ− ∂y(fρ) +

1

2ν∂2

yρ = H†ρ (1)

with ρ(y, t|x, t) = δ(y − x).

Define

A(x, t) =

∫dyρ(y, τ |x, t)Ψ(y, τ).

It is easy to see by using the equations of motions for Ψ and ρ that A(x, t) isindependent of τ . Evaluating A(x, t) for τ = t yields A(x, t) = Ψ(x, t). EvaluatingA(x, t) for τ = T yields A(x, t) =

∫dyρ(y, T |x, t)Ψ(y, T ). Thus,

A(x, t) = A(x, T )

Ψ(x, t) =

∫dyρ(y, T |x, t) exp(−φ(y)/ν)

Bert Kappen ICML, July 5 2008 47

Forward sampling of the diffusion process

The diffusion equation

∂τρ = −Vλρ− ∂y(fρ) +

1

2ν∂2

yρ (2)

can be sampled as

dx = f(x, t)dt+ dξ

x = x+ dx, with probability 1− V (x, t)dt/λ

xi = †, with probability V (x, t)dt/λ

Bert Kappen ICML, July 5 2008 48

Forward sampling of the diffusion process

We can estimate

Ψ(x, t) =

∫dyρ(y, T |x, t) exp(−φ(y)/λ)

≈ 1

N

i∈alive

exp(−φ(xi(T ))/λ)

by computing N trajectories xi(t→ T ), i = 1, . . . , N .

’Alive’ denotes the subset of trajectories that do not get killed along the way bythe † operation.

Bert Kappen ICML, July 5 2008 49

The diffusion process

The diffusion process can be written as a path integral:

ρ(y, T |x, t) =

∫[dx]yx exp

(−1

νSpath(x(t→ T ))

)

Spath(x(t→ T )) =

∫ T

t

dτ1

2(x(τ)− f(x(τ), τ))2 + V (x(τ), τ)

t t f

xy

Bert Kappen ICML, July 5 2008 50

The path integral formulation

Ψ(x, t) =

∫dyρ(y, T |x, t) exp

(−φ(x)

ν

)

=

∫[dx]x exp

(−1

νS(x(t→ T ))

)

S(x(t→ T )) = Spath(x(t→ T ) + φ(x(T ))

Ψ is a partition sum and J = −ν log Ψ therefore can be interpreted as a freeenergy. S is the energy of a path and ν the temperature.

The corresponding probability distribution is

p(x(t→ T )|x, t) =1

Ψ(x, t)exp

(−1

νS(x(t→ T ))

)

Bert Kappen ICML, July 5 2008 51

An example: double slit

dx = udt+ dξ

C =

⟨1

2x(T )2 +

∫ T

0

dτ1

2u(τ)2 + V (x, t)

V (x, t = 1) implements a slit at an intermediatetime t = 1.

Ψ(x, t) =

∫dyρ(y, T |x, t)Ψ(y, T )

can be solved in closed form.

0 0.5 1 1.5 2

−6

−4

−2

0

2

4

6

8

−10 −5 0 5 100

1

2

3

4

5

x

J

t=0t=0.99t=1.01t=2

Bert Kappen ICML, July 5 2008 52

MC sampling on double slit

J(x, t) ≈ −ν log

(1

N

N∑

i=1

exp(−φ(xiT )/ν)

)

0 0.5 1 1.5 2−10

−5

0

5

10

(a) Naive sample paths

−10 −5 0 5 100

2

4

6

8

10

x

J

MCExact

(b) J(x, t = 0) by naivesampling (n = 100000).

−10 −5 0 5 100.5

1

1.5

2

2.5

3

x

J

(c) J(x, t = 0) by Laplaceapproximation and importancesampling (n = 100).

Bert Kappen ICML, July 5 2008 53

The delayed choice

If we go further back in time we encounter a new phenomenon: a delayed choicewhen to decide.

When slit size goes to zero, J is given by

J(x, T ) = −ν log

∫dyρ(y|x)e−φ(y)/ν

=1

T

(1

2x2 − νT log 2 cosh

x

νT

)

where T the time to reach the slits.The expression between brackets is a typical freeenergy with temperature νT .

−2 −1 0 1 20.4

0.6

0.8

1

1.2

1.4

1.6

1.8

x

J(x,

t)

T=2

T=1

T=0.5

Symmetry breaking at νT = 1 separates two qualitatively different behaviors.

Bert Kappen ICML, July 5 2008 54

The delayed choice

0 0.5 1 1.5 2−2

−1

0

1

2stochastic

0 0.5 1 1.5 2−2

−1

0

1

2

0 0.5 1 1.5 2−2

−1

0

1

2deterministic

0 0.5 1 1.5 2−2

−1

0

1

2

The timing of the decision, that is when the automaton decides to go left or right,is the consequence of spontaneous symmetry breaking.

Bert Kappen ICML, July 5 2008 55

N joint arm in a plane

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

dθi = uidt+ dξi, i = 1, . . . , n

C =

⟨φ(xn(~θ)) +

∫dτ

1

2u(τ)2

ui =〈θi〉 − θiT − t , i = 1, . . . , n

〈· · ·〉 ∝ ρ(θ, T |θ0, t) exp(−φ(xn(~θ)))

〈θ〉 from uncontrolled penalized diffusion. Variational approximation.

Bert Kappen ICML, July 5 2008 56

Other path integral control issues

Multiple agents coordination

Relation Reinforcement learning and learning.

Realistic robotics applications; Non-differentiable cost functions (obstacles).

Bert Kappen ICML, July 5 2008 57

Summary

Deterministic control can be solved by- HJB, a PDE- PMP, two coupled ODEs with mixed boundary conditions

Stochastic control can be solved by- HJB in general- Ricatti equation for sufficient statistics in LQ case- Path integral is LQ control with arbitrary cost and dynamics- RL is special case

Learning, (PO states or parameter values)- Decoupled from control in the RL case- Joint inference and control typically harder than control only- For Kalman filters the PO is irrelevant (certainty equivalence)

Bert Kappen ICML, July 5 2008 58

Summary

The PI control problems provides novel link to machine learning:- statistical physics- symmetry breaking, ’phase transitions’- ’efficient’ computation (MCMC, BP, MF, EP)

Bert Kappen ICML, July 5 2008 59

Further reading

Check out the ICML tutorial paper and references there and other general referencesat the tutorial web page

http://ml.cs.tu-berlin.de/~mtoussai/08-optimal-control/

Bert Kappen ICML, July 5 2008 60