Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov...

41
Linearly-Solvable Stochastic Optimal Control Problems Emo Todorov Applied Mathematics and Computer Science & Engineering University of Washington Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26

Transcript of Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov...

Page 1: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Linearly-Solvable Stochastic Optimal ControlProblems

Emo Todorov

Applied Mathematics and Computer Science & Engineering

University of Washington

Winter 2014

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26

Page 2: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Problem formulation

In traditional MDPs the controller chooses actions u which in turn specify thetransition probabilities p (x0jx, u). We can obtain a linearly-solvable MDP(LMDP) by allowing the controller to specify these probabilities directly:

x0 � u (�jx) controlled dynamics

x0 � p (�jx) passive dynamics

p (x0jx) = 0) u (x0jx) = 0 feasible control set U (x)

The immediate cost is in the form

` (x, u (�jx)) = q (x) + KL (u (�jx) jjp (�jx))

KL (u (�jx) jjp (�jx)) = ∑x0 u (x0jx) log u(x0 jx)p(x0 jx) = Ex0�u(�jx)

hlog u(x0 jx)

p(x0 jx)

iThus the controller can impose any dynamics it wishes, however it pays aprice (KL divergence control cost) for pushing the system away from itspassive dynamics.

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 2 / 26

Page 3: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Problem formulation

In traditional MDPs the controller chooses actions u which in turn specify thetransition probabilities p (x0jx, u). We can obtain a linearly-solvable MDP(LMDP) by allowing the controller to specify these probabilities directly:

x0 � u (�jx) controlled dynamics

x0 � p (�jx) passive dynamics

p (x0jx) = 0) u (x0jx) = 0 feasible control set U (x)

The immediate cost is in the form

` (x, u (�jx)) = q (x) + KL (u (�jx) jjp (�jx))

KL (u (�jx) jjp (�jx)) = ∑x0 u (x0jx) log u(x0 jx)p(x0 jx) = Ex0�u(�jx)

hlog u(x0 jx)

p(x0 jx)

iThus the controller can impose any dynamics it wishes, however it pays aprice (KL divergence control cost) for pushing the system away from itspassive dynamics.

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 2 / 26

Page 4: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Understanding the KL divergence cost

u(2|x)

u(3|x)

u(1|x)

“one”“two”

“three”

p

1

3.6

2

2.6

0

1

1

u = probability of Heads0 1

KL(u||p)

E(q)

1

0

u*

p

0.27 0.5

E(q) + KL(u||p)

how to bias a coin benefits oferror tolerance

KL cost over theprobability simplex

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 3 / 26

Page 5: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Simplifying the Bellman equation (first exit)

v (x) = minu

n` (x, u) + Ex0�p(�jx,u)

�v�x0��o

= minu(�jx)

�q (x) + Ex0�u(�jx)

�log

u (x0jx)p (x0jx) + log

1exp (�v (x0))

��= min

u(�jx)

�q (x) + Ex0�u(�jx)

�log

u (x0)p (x0jx) exp (�v (x0))

��The last term is an unnormalized KL divergence...

Definitions

desirability function z (x) , exp (�v (x))

next-state expectation P [z] (x) , ∑x0 p (x0jx) z (x0)

v (x) = minu(�jx)

�q (x)� logP [z] (x) + KL

�u (�jx)

p (�jx) z (�)P [z] (x)

��

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 4 / 26

Page 6: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Simplifying the Bellman equation (first exit)

v (x) = minu

n` (x, u) + Ex0�p(�jx,u)

�v�x0��o

= minu(�jx)

�q (x) + Ex0�u(�jx)

�log

u (x0jx)p (x0jx) + log

1exp (�v (x0))

��= min

u(�jx)

�q (x) + Ex0�u(�jx)

�log

u (x0)p (x0jx) exp (�v (x0))

��The last term is an unnormalized KL divergence...

Definitions

desirability function z (x) , exp (�v (x))

next-state expectation P [z] (x) , ∑x0 p (x0jx) z (x0)

v (x) = minu(�jx)

�q (x)� logP [z] (x) + KL

�u (�jx)

p (�jx) z (�)P [z] (x)

��Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 4 / 26

Page 7: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Linear Bellman equation and optimal control law

KL (p1 (�) jjp2 (�)) achieves its global minimum of 0 iff p1 = p2, thus

Theorem (optimal control law)

u��x0jx

�=

p (x0jx) z (x0)P [z] (x)

The Bellman equation becomes

v (x) = q (x)� logP [z] (x)z (x) = exp (�q (x))P [z] (x)

which can be written more explicitly as

Theorem (linear Bellman equation)

z (x) =

(exp (�q (x))∑x0 p (x0jx) z (x0) : x non-terminal

exp (�qT (x)) : x terminal

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 5 / 26

Page 8: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Illustration

z(x’)

p(x’|x)u*(x’|x) ~ p(x’|x) z(x’)

x’: sampled from u*(x’|x)x

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 6 / 26

Page 9: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Summary of results

Let Q = diag (exp (�q)) and Pxy = p (yjx). Then we have

first exit z = exp (�q)P [z] z = QPz

finite horizon zk = exp (�qk)Pk [zk+1] zk = QkPkzk+1

average cost z = exp (c� q)P [z] λz = QPz

discounted cost z = exp (�q)P [zα] z = QPzα

In the first exit problem we can also write

zN = QNNPNN zN + b = (I�QNNPNN )�1 b

b , QNNPNT exp (�qT )

whereN , T are the sets of non-terminal and terminal states respectively.

In the average cost problem λ = � log (c) is the principal eigenvalue.

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 7 / 26

Page 10: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Summary of results

Let Q = diag (exp (�q)) and Pxy = p (yjx). Then we have

first exit z = exp (�q)P [z] z = QPz

finite horizon zk = exp (�qk)Pk [zk+1] zk = QkPkzk+1

average cost z = exp (c� q)P [z] λz = QPz

discounted cost z = exp (�q)P [zα] z = QPzα

In the first exit problem we can also write

zN = QNNPNN zN + b = (I�QNNPNN )�1 b

b , QNNPNT exp (�qT )

whereN , T are the sets of non-terminal and terminal states respectively.

In the average cost problem λ = � log (c) is the principal eigenvalue.

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 7 / 26

Page 11: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Stationary distribution under the optimal control law

Let µ (x) denote the stationary distribution under the optimal control lawu� (�jx) in the average cost problem. Then

µ�x0�= ∑x u�

�x0jx

�µ (x)

Recall that

u��x0jx

�=

p (x0jx) z (x0)P [z] (x) =

p (x0jx) z (x0)λ exp (q (x)) z (x)

Defining r (x) , µ (x) /z (x), we have

µ�x0�= ∑x

p (x0jx) z (x0)λ exp (q (x)) z (x)

µ (x)

λr�x0�= ∑x exp (�q (x)) p

�x0jx

�r (x)

In vector notation this becomes

λr = (QP)T r

Thus z and r are the right and left principal eigenvectors of QP, and µ = z. � rEmo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 8 / 26

Page 12: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Comparison to policy and value iteration

0 40 80 120

z iter

policy iter

100502510

CPU time (sec)

aver

age

costP

­3 +3­9

+9

horizontal position

tang

entia

lvel

ocity

optimal control (z iter) optimal cost­to­go (z iter) optimal cost­to­go (policy iter)

0 10 102 103 104

number of updates

aver

age

cost

z pol. val.

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 9 / 26

Page 13: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Application to deterministic shortest paths

Given a graph and a set T of goal states, define the first-exit LMDP

p (x0jx) random walk on the graph

q (x) = ρ > 0 constant cost at non-terminal states

qT (x) = 0 zero cost at terminal states

For large ρ the optimal cost-to-go v(ρ) is dominated by the state costs, sincethe KL divergence control costs are bounded. Thus we have

TheoremThe length of the shortest path from state x to a goal state is

limρ!∞

v(ρ) (x)ρ

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 10 / 26

Page 14: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Internet example

Performance on the graph of Internet routers as of 2003 (data from caida.org)There are 190914 nodes and 609066 undirected edges in the graph.

0.2

0.6

1 5 10 151

5

10

15

erro

r

shortest path length

appr

oxim

atio

n

25 40 55 700%

1%

2%

parameter \rho

num

bero

ferr

ors(

%) shortest path lengths

over­estimated by 1

values of z(x) numericallyindistinguishable from 0

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 11 / 26

Page 15: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Embedding of traditional MDPsGiven a traditional MDP with controls eu 2 eU (x), transition probabilitiesep (x0jx, eu) and costs e (x, eu), we can construct and LMDP such that the controlscorresponding to the MDPs transition probabilities have the same costs as inthe MDP. This is done by constructing p and q such that for 8x, eu 2 eU (x)

q (x) + KL (ep (�jx, eu) jjp (�jx)) = e (x, eu)q (x)�∑x0 ep �x0jx, eu� log p

�x0jx

�= e (x, eu) + eh (x, eu)

where eh is the entropy of ep (�jx, eu).

The construction is done separately forevery x. Suppressing x, vectorizing over eu and defining s = � log p,

q1+ ePs = ebexp (�s)T 1 = 1

eP and eb = e+ eh are known, q and s are unknown. Assuming eP is full rank,

y = eP�1eb, s = y� q1, q = � log�

exp (�y)T 1�

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 12 / 26

Page 16: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Embedding of traditional MDPsGiven a traditional MDP with controls eu 2 eU (x), transition probabilitiesep (x0jx, eu) and costs e (x, eu), we can construct and LMDP such that the controlscorresponding to the MDPs transition probabilities have the same costs as inthe MDP. This is done by constructing p and q such that for 8x, eu 2 eU (x)

q (x) + KL (ep (�jx, eu) jjp (�jx)) = e (x, eu)q (x)�∑x0 ep �x0jx, eu� log p

�x0jx

�= e (x, eu) + eh (x, eu)

where eh is the entropy of ep (�jx, eu). The construction is done separately forevery x. Suppressing x, vectorizing over eu and defining s = � log p,

q1+ ePs = ebexp (�s)T 1 = 1

eP and eb = e+ eh are known, q and s are unknown. Assuming eP is full rank,

y = eP�1eb, s = y� q1, q = � log�

exp (�y)T 1�

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 12 / 26

Page 17: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Grid world example

0 20 40 600

10

20

30

40

R2 = 0.986

MDP cost­to­goLM

DP

cost

­to­g

o

MDPcost­to­go

LMDPcost­to­go

X

X

X

X

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 13 / 26

Page 18: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Machine repair example

0 20 40 600

20

40

60

time step1 501

100

stat

eMDPcost­to­go

LMDPcost­to­go

MDP cost­to­go

LMD

Pco

st­to

­go

control policyMDP LMDP random

perf

orm

ance

onM

DP

1.00 1.01

1.64

R2 = 0.993

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 14 / 26

Page 19: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Continuous-time limitConsider a continuous-state discrete-time LMDP where p(h) (x0jx) is theh-step transition probability of some continuous-time stochastic process, andz(h) (x) is the LMDP solution. The linear Bellman equation (first exit) is

z(h) (x) = exp (�hq (x))Ex0�p(h)(�jx)

hz(h)

�x0�i

Let z = limh#0 z(h). The limit yields z (x) = z (x),

but we can rearrange as

limh#0exp (hq (x))� 1

hz(h) (x) = limh#0

Ex0�p(h)(�jx)

hz(h) (x0)

i� z(h) (x)

h

Recalling the definition of the generator L, we now have

q (x) z (x) = L [z] (x)

If the underlying process is an Ito diffusion, the generator is

L [z] (x) = a (x)T zx (x) +12

trace (Σ (x) zxx (x))

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 15 / 26

Page 20: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Continuous-time limitConsider a continuous-state discrete-time LMDP where p(h) (x0jx) is theh-step transition probability of some continuous-time stochastic process, andz(h) (x) is the LMDP solution. The linear Bellman equation (first exit) is

z(h) (x) = exp (�hq (x))Ex0�p(h)(�jx)

hz(h)

�x0�i

Let z = limh#0 z(h). The limit yields z (x) = z (x), but we can rearrange as

limh#0exp (hq (x))� 1

hz(h) (x) = limh#0

Ex0�p(h)(�jx)

hz(h) (x0)

i� z(h) (x)

h

Recalling the definition of the generator L, we now have

q (x) z (x) = L [z] (x)

If the underlying process is an Ito diffusion, the generator is

L [z] (x) = a (x)T zx (x) +12

trace (Σ (x) zxx (x))

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 15 / 26

Page 21: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Linearly-solvable controlled diffusionsAbove z was defined as the continuous-time limit to LMDP solutions z(h).But is z the solution to a continuous-time problem, and if so, what problem?

dx = (a (x) + B (x)u) dt+ C (x) dω

` (x, u) = q (x) +12

uTR (x)u

Recall that for such problems we have u� = �R�1BTvx and

0 = q+ aTvx +12

tr�

CCTvxx

�� 1

2vTx BR�1BTvx

Define z (x) = exp (�v (x)) and write the PDE in terms of z:

vx = �zx

z, vxx = �

zxx

z+

zxzTxz2

0 = q� 1z

�aTzx +

12

tr�

CCTzxx

�+

12z

zTx BR�1BTzx �12z

zTx CCTzx

�Now if CCT = BR�1BT , we obtain the linear HJB equation qz = L [z].

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 16 / 26

Page 22: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Linearly-solvable controlled diffusionsAbove z was defined as the continuous-time limit to LMDP solutions z(h).But is z the solution to a continuous-time problem, and if so, what problem?

dx = (a (x) + B (x)u) dt+ C (x) dω

` (x, u) = q (x) +12

uTR (x)u

Recall that for such problems we have u� = �R�1BTvx and

0 = q+ aTvx +12

tr�

CCTvxx

�� 1

2vTx BR�1BTvx

Define z (x) = exp (�v (x)) and write the PDE in terms of z:

vx = �zx

z, vxx = �

zxx

z+

zxzTxz2

0 = q� 1z

�aTzx +

12

tr�

CCTzxx

�+

12z

zTx BR�1BTzx �12z

zTx CCTzx

�Now if CCT = BR�1BT , we obtain the linear HJB equation qz = L [z].

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 16 / 26

Page 23: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Quadratic control cost and KL divergence

The KL divergence between two Gaussians with means µ1, µ2 and commonfull-rank covariance Σ is 1

2 (µ1 � µ2)T Σ�1 (µ1 � µ2).

Using Euler discretization of the controlled diffusion, the passive andcontrolled dynamics have means x+ ha, x+ ha+ hBu and covariance hCCT .Thus the KL divergence control cost is

12

huTBT�

hCCT��1

hBu =h2

uTBT�

BR�1BT��1

Bu =h2

uTRu

This is the quadratic control cost accumulated over time h.

x

passive

controlled

hBu

Here we used CCT = BR�1BT

and assumed that B is full rank.If B is rank-defficient, the sameresult holds but the Gaussians aredefined over the subspace spannedby the columns of B.

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 17 / 26

Page 24: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Summary of results

discrete time : continuous time :

first exit exp (q) z = P [z] qz = L [z]finite horizon exp (qk) zk = Pk [zk+1] qz� zt = L [z]average cost exp (q� c) z = P [z] (q� c) z = L [z]discounted cost exp (q) z = P [zα] z log (zα) = L [z]

The relation between P [z] and L [z] is

P [z] (x) = Ex0�p(�jx)�z�x0��

L [z] (x) = limh#0Ex0�p(h)(�jx) [z (x

0)]� z (x)

h= limh#0

P (h) [z] (x)� z (x)h

P (h) [z] (x) = z (x) + hL [z] (x) + o�

h2�

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 18 / 26

Page 25: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Summary of results

discrete time : continuous time :

first exit exp (q) z = P [z] qz = L [z]finite horizon exp (qk) zk = Pk [zk+1] qz� zt = L [z]average cost exp (q� c) z = P [z] (q� c) z = L [z]discounted cost exp (q) z = P [zα] z log (zα) = L [z]

The relation between P [z] and L [z] is

P [z] (x) = Ex0�p(�jx)�z�x0��

L [z] (x) = limh#0Ex0�p(h)(�jx) [z (x

0)]� z (x)

h= limh#0

P (h) [z] (x)� z (x)h

P (h) [z] (x) = z (x) + hL [z] (x) + o�

h2�

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 18 / 26

Page 26: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Path-integral representationWe can unfold the linear Bellman equation (first exit) as

z (x) = exp (�q (x))Ex0�p(�jx)�z�x0��

= exp (�q (x))Ex0�p(�jx)hexp

��q�x0��

Ex00�p(�jx0)�z�x00��i

= � � �= Ex0=x

xk+1�p(�jxk)

hexp

��qT

�xtfirst

��∑tfirst�1

k=0 q (xk)�i

This is a path-integral representation of z. Since KL (pjjp) = 0, we have

exp�

Eoptimal [�total cost]�= z (x) = Epassive [exp (�total cost)]

In continuous problems, the Feynman-Kac theorem states that the uniquepositive solution z to the parabolic PDE qz = aTzx +

12 tr

�CCTzxx

�has the

same path-integral representation:

z (x) = Ex(0)=xdx=a(x)dt+C(x)dω

hexp

��qT (x (tfirst))�

R tfirst0 q (x (t)) dt

�i

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 19 / 26

Page 27: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Path-integral representationWe can unfold the linear Bellman equation (first exit) as

z (x) = exp (�q (x))Ex0�p(�jx)�z�x0��

= exp (�q (x))Ex0�p(�jx)hexp

��q�x0��

Ex00�p(�jx0)�z�x00��i

= � � �= Ex0=x

xk+1�p(�jxk)

hexp

��qT

�xtfirst

��∑tfirst�1

k=0 q (xk)�i

This is a path-integral representation of z. Since KL (pjjp) = 0, we have

exp�

Eoptimal [�total cost]�= z (x) = Epassive [exp (�total cost)]

In continuous problems, the Feynman-Kac theorem states that the uniquepositive solution z to the parabolic PDE qz = aTzx +

12 tr

�CCTzxx

�has the

same path-integral representation:

z (x) = Ex(0)=xdx=a(x)dt+C(x)dω

hexp

��qT (x (tfirst))�

R tfirst0 q (x (t)) dt

�iEmo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 19 / 26

Page 28: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Model-free learning

The solution to the linear Bellman equation

z (x) = exp (�q (x))Ex0�p(�jx)�z�x0��

can be approximated in a model-free way given samples (xn, x0n, qn = q (xn))obtained from the passive dynamics x0n � p (�jxn).

One possibility is a Monte Carlo method based on the path integralrepresentation, although covergence can be slow:

bz (x) = 1# trajectoriesstarting at x

∑ exp (� trajectory cost)

Faster convergence is obtained using temporal difference learning:

bz (xn) (1� β)bz (xn) + β exp (�qn)bz �x0n�The learning rate β should decrease over time.

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 20 / 26

Page 29: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Model-free learning

The solution to the linear Bellman equation

z (x) = exp (�q (x))Ex0�p(�jx)�z�x0��

can be approximated in a model-free way given samples (xn, x0n, qn = q (xn))obtained from the passive dynamics x0n � p (�jxn).

One possibility is a Monte Carlo method based on the path integralrepresentation, although covergence can be slow:

bz (x) = 1# trajectoriesstarting at x

∑ exp (� trajectory cost)

Faster convergence is obtained using temporal difference learning:

bz (xn) (1� β)bz (xn) + β exp (�qn)bz �x0n�The learning rate β should decrease over time.

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 20 / 26

Page 30: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Model-free learning

The solution to the linear Bellman equation

z (x) = exp (�q (x))Ex0�p(�jx)�z�x0��

can be approximated in a model-free way given samples (xn, x0n, qn = q (xn))obtained from the passive dynamics x0n � p (�jxn).

One possibility is a Monte Carlo method based on the path integralrepresentation, although covergence can be slow:

bz (x) = 1# trajectoriesstarting at x

∑ exp (� trajectory cost)

Faster convergence is obtained using temporal difference learning:

bz (xn) (1� β)bz (xn) + β exp (�qn)bz �x0n�The learning rate β should decrease over time.

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 20 / 26

Page 31: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Importance sampling

The expectation of a function f (x)under a distribution p (x)can beapproximated as

Ex�p(�) [f (x)] �1N ∑n f (xn)

where fxngn=1���N are i.i.d. samples from p (�).

However, if f (x) has interesting behavior in regions where p (x) is small,convergence can be slow, i.e. we may need a very large N to obtain anaccurate approximation. In the case of Z learning, the passive dynamics mayrarely take the state to regions with low cost.

Importance sampling is a general (unbiased) method for speeding upconvergence. Let q (x) be some other distribution which is better "adapted" tof (x), and let fxng now be samples from q (�). Then

Ex�p(�) [f (x)] �1N ∑n

p (xn)

q (xn)f (xn)

This is essential for particle filters.

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 21 / 26

Page 32: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Importance sampling

The expectation of a function f (x)under a distribution p (x)can beapproximated as

Ex�p(�) [f (x)] �1N ∑n f (xn)

where fxngn=1���N are i.i.d. samples from p (�).

However, if f (x) has interesting behavior in regions where p (x) is small,convergence can be slow, i.e. we may need a very large N to obtain anaccurate approximation. In the case of Z learning, the passive dynamics mayrarely take the state to regions with low cost.

Importance sampling is a general (unbiased) method for speeding upconvergence. Let q (x) be some other distribution which is better "adapted" tof (x), and let fxng now be samples from q (�). Then

Ex�p(�) [f (x)] �1N ∑n

p (xn)

q (xn)f (xn)

This is essential for particle filters.Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 21 / 26

Page 33: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Greedy Z learning

Let bu (x0jx) denote the greedy control law, i.e. the control law which would beoptimal if the current approximation bz (x) were the exact desirability function.Then we can sample from bu rather than p and use importance sampling:

bz (xn) (1� β)bz (xn) + βp (x0njxn)bu (x0njxn)

exp (�qn)bz �x0n�We now need access to the model p (x0jx)of the passive dynamics.

X

Q­learning, passiveQ­learning, e­greedyZ­learning, passiveZ­learning, greedy

state transitions

cost

­to­g

oap

prox

imat

ion

erro

r

0 50,000state transitions

0 50,000

log

tota

lcos

t

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 22 / 26

Page 34: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Greedy Z learning

Let bu (x0jx) denote the greedy control law, i.e. the control law which would beoptimal if the current approximation bz (x) were the exact desirability function.Then we can sample from bu rather than p and use importance sampling:

bz (xn) (1� β)bz (xn) + βp (x0njxn)bu (x0njxn)

exp (�qn)bz �x0n�We now need access to the model p (x0jx)of the passive dynamics.

X

Q­learning, passiveQ­learning, e­greedyZ­learning, passiveZ­learning, greedy

state transitions

cost

­to­g

oap

prox

imat

ion

erro

r

0 50,000state transitions

0 50,000lo

gto

talc

ost

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 22 / 26

Page 35: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Maximum principle for the most likely trajectoryRecall that for finite-horizon LMDPs we have

u�k�x0jx

�= exp (�q (x)) p

�x0jx

� zk+1 (x0)zk (x)

The probability that the optimally-controlled stochastic system initialized atstate x0 generates a given trajectory x1, x2, � � � xT is

p� (x1, x2, � � � xTjx0) = ∏T�1k=0 u�k (xk+1jxk)

= ∏T�1k=0 exp (�q (xk)) p (xk+1jxk)

zk+1 (xk+1)

zk (xk)

=exp (�qT (xT))

z0 (x0)∏T�1

k=0 exp (�q (xk)) p (xk+1jxk)

Theorem (LMDP maximum principle)The most likely trajectory under p� coincides with the optimal trajectory for adeterministic finite-horizon problem with final cost qT (x), dynamics x0 = f (x, u)where f can be arbitrary, and immediate cost ` (x, u) = q (x)� log p (f (x, u) , x).

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 23 / 26

Page 36: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Maximum principle for the most likely trajectoryRecall that for finite-horizon LMDPs we have

u�k�x0jx

�= exp (�q (x)) p

�x0jx

� zk+1 (x0)zk (x)

The probability that the optimally-controlled stochastic system initialized atstate x0 generates a given trajectory x1, x2, � � � xT is

p� (x1, x2, � � � xTjx0) = ∏T�1k=0 u�k (xk+1jxk)

= ∏T�1k=0 exp (�q (xk)) p (xk+1jxk)

zk+1 (xk+1)

zk (xk)

=exp (�qT (xT))

z0 (x0)∏T�1

k=0 exp (�q (xk)) p (xk+1jxk)

Theorem (LMDP maximum principle)The most likely trajectory under p� coincides with the optimal trajectory for adeterministic finite-horizon problem with final cost qT (x), dynamics x0 = f (x, u)where f can be arbitrary, and immediate cost ` (x, u) = q (x)� log p (f (x, u) , x).

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 23 / 26

Page 37: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Trajectory probabilities in continuous timeThere is no formula for the probability of a trajectory under the Ito diffusiondx = a (x) + Cdω. However the relative probabilities of two trajectories ϕ (t)and ψ (t) can be defined using the Onsager-Machlup functional:

OM [ϕ(�) , ψ (�)] , limε!0

p (supt jx (t)�ϕ(t)j < ε)

p (supt jx (t)�ψ (t)j < ε)

It can be shown that

OM [ϕ(�) , ψ (�)] = exp�Z T

0L (ψ (t) , ψ (t))� L (ϕ (t) ,ϕ (t)) dt

�where

L [x, v] , 12(a (x)� v)T

�CCT

��1(a (x)� v) +

12

div (a (x))

We can then fix ψ (t) and define the relative probability of a trajectory as

pOM (ϕ (�)) = exp��Z T

0L (ϕ(t) ,ϕ(t)) dt

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 24 / 26

Page 38: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Trajectory probabilities in continuous timeThere is no formula for the probability of a trajectory under the Ito diffusiondx = a (x) + Cdω. However the relative probabilities of two trajectories ϕ (t)and ψ (t) can be defined using the Onsager-Machlup functional:

OM [ϕ(�) , ψ (�)] , limε!0

p (supt jx (t)�ϕ(t)j < ε)

p (supt jx (t)�ψ (t)j < ε)

It can be shown that

OM [ϕ(�) , ψ (�)] = exp�Z T

0L (ψ (t) , ψ (t))� L (ϕ (t) ,ϕ (t)) dt

�where

L [x, v] , 12(a (x)� v)T

�CCT

��1(a (x)� v) +

12

div (a (x))

We can then fix ψ (t) and define the relative probability of a trajectory as

pOM (ϕ (�)) = exp��Z T

0L (ϕ(t) ,ϕ(t)) dt

�Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 24 / 26

Page 39: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Continuous-time maximum principleIt can be shown that the trajectory maximizing pOM (�) under theoptimally-controlled stochastic dynamics for the problem

dx = a (x) + B (udt+ σdω)

` (x, u) = q (x) +1

2σ2 kuk2

coincides with the optimal trajectory for the deterministic problem

x = a (x) + Bu

` (x, u) = q (x) +1

2σ2 kuk2 +

12

div (a (x))

Example:

dx = (a (x) + u) dt+ σdω

` (x, u) =1

2σ2 u2

­5 +50

0

­2

+2

position (x)

a(x)div(a(x))

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 25 / 26

Page 40: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Continuous-time maximum principleIt can be shown that the trajectory maximizing pOM (�) under theoptimally-controlled stochastic dynamics for the problem

dx = a (x) + B (udt+ σdω)

` (x, u) = q (x) +1

2σ2 kuk2

coincides with the optimal trajectory for the deterministic problem

x = a (x) + Bu

` (x, u) = q (x) +1

2σ2 kuk2 +

12

div (a (x))

Example:

dx = (a (x) + u) dt+ σdω

` (x, u) =1

2σ2 u2

­5 +50

0

­2

+2

position (x)

a(x)div(a(x))

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 25 / 26

Page 41: Linearly-Solvable Stochastic Optimal Control Problemstodorov/courses/amath...Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26 Problem formulation In traditional

Example

0 5

0

+5

­5time (t)

posit

ion

(x)

r(x,t) z(x,t)mu(x,t)

sigma = 0.6

sigma = 1.2

Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 26 / 26