Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov...

45
Linearly-solvable optimal control Emo Todorov Applied Mathematics and Computer Science & Engineering University of Washington based on: E. Todorov, "Efficient computation of optimal actions", PNAS 2009 Emo Todorov (UW) Linearly-solvable optimal control 1 / 19

Transcript of Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov...

Page 1: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Linearly-solvable optimal control

Emo Todorov

Applied Mathematics and Computer Science & Engineering

University of Washington

based on:E. Todorov, "Efficient computation of optimal actions", PNAS 2009

Emo Todorov (UW) Linearly-solvable optimal control 1 / 19

Page 2: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Problem formulation

In traditional MDPs the controller chooses actions u which in turn specify thetransition probabilities p (x0jx, u). We can obtain a linearly-solvable MDP(LMDP) by allowing the controller to specify these probabilities directly:

x0 � u (�jx) controlled dynamics

x0 � p (�jx) passive dynamics

p (x0jx) = 0) u (x0jx) = 0 feasible control set

The running cost is in the form

` (x, u (�jx)) = q (x) +DKL (u (�jx) jjp (�jx))

Thus the controller can impose any dynamics it wishes, however it pays aprice (KL divergence control cost) for pushing the system away from itspassive dynamics.

Emo Todorov (UW) Linearly-solvable optimal control 2 / 19

Page 3: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Problem formulation

In traditional MDPs the controller chooses actions u which in turn specify thetransition probabilities p (x0jx, u). We can obtain a linearly-solvable MDP(LMDP) by allowing the controller to specify these probabilities directly:

x0 � u (�jx) controlled dynamics

x0 � p (�jx) passive dynamics

p (x0jx) = 0) u (x0jx) = 0 feasible control set

The running cost is in the form

` (x, u (�jx)) = q (x) +DKL (u (�jx) jjp (�jx))

Thus the controller can impose any dynamics it wishes, however it pays aprice (KL divergence control cost) for pushing the system away from itspassive dynamics.

Emo Todorov (UW) Linearly-solvable optimal control 2 / 19

Page 4: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Reducing optimal control to a linear problemThe Bellman equation for a first-exit LMPD is

v (x) = minu(�jx)

�q (x) + Ex0�u(�jx)

�log

u (x0jx)p (x0jx)

�+ Ex0�u(�jx)

�v�x0���

= q (x) + minu(�jx)

Ex0�u(�jx)

�log

u (x0)p (x0jx) exp (�v (x0))

�The last term is an unnormalized KL divergence,

thus the optimal control is

u��x0jx

�=

p (x0jx) z (x0)P [z] (x)

z (x) , exp (�v (x)) desirability function

P [z] (x) , ∑x0 p (x0jx) z (x0) next-state expectation

The problem is now linear in the desirability function:

z (x) = exp (�q (x))P [z] (x)

Emo Todorov (UW) Linearly-solvable optimal control 3 / 19

Page 5: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Reducing optimal control to a linear problemThe Bellman equation for a first-exit LMPD is

v (x) = minu(�jx)

�q (x) + Ex0�u(�jx)

�log

u (x0jx)p (x0jx)

�+ Ex0�u(�jx)

�v�x0���

= q (x) + minu(�jx)

Ex0�u(�jx)

�log

u (x0)p (x0jx) exp (�v (x0))

�The last term is an unnormalized KL divergence, thus the optimal control is

u��x0jx

�=

p (x0jx) z (x0)P [z] (x)

z (x) , exp (�v (x)) desirability function

P [z] (x) , ∑x0 p (x0jx) z (x0) next-state expectation

The problem is now linear in the desirability function:

z (x) = exp (�q (x))P [z] (x)

Emo Todorov (UW) Linearly-solvable optimal control 3 / 19

Page 6: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Reducing optimal control to a linear problemThe Bellman equation for a first-exit LMPD is

v (x) = minu(�jx)

�q (x) + Ex0�u(�jx)

�log

u (x0jx)p (x0jx)

�+ Ex0�u(�jx)

�v�x0���

= q (x) + minu(�jx)

Ex0�u(�jx)

�log

u (x0)p (x0jx) exp (�v (x0))

�The last term is an unnormalized KL divergence, thus the optimal control is

u��x0jx

�=

p (x0jx) z (x0)P [z] (x)

z (x) , exp (�v (x)) desirability function

P [z] (x) , ∑x0 p (x0jx) z (x0) next-state expectation

The problem is now linear in the desirability function:

z (x) = exp (�q (x))P [z] (x)

Emo Todorov (UW) Linearly-solvable optimal control 3 / 19

Page 7: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Illustration

z(x’)

p(x’|x)u*(x’|x) ~ p(x’|x) z(x’)

x’: sampled from u*(x’|x)x

Emo Todorov (UW) Linearly-solvable optimal control 4 / 19

Page 8: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Linearly-solvable controlled diffusions

dx = a (x) dt+ B (x) (udt+ σdω)

` (x, u) = q (x) +1

2σ2 kuk2

Defining z = exp (�v) as before, the optimal control law is

u� (x) = σ2B (x)Tzx (x)z (x)

and the (first-exit) HJB equation becomes

q (x) z (x) = L [z] (x)

where L is the generator of the passive dynamics:

L [z] (x) = a (x)T zx (x) +σ2

2trace

�B (x)B (x)T zxx (x)

�Emo Todorov (UW) Linearly-solvable optimal control 5 / 19

Page 9: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Relation between the two problems

The above diffusion is the h! 0 limit of discrete-time continuous-spaceLMDPs with passive dynamics

p(h)�x0jx

�= N

�x+ ha (x) , hσ2B (x)B (x)T

The LMDP state cost is hq (x). From the KL divergence formula for Gaussians,the LMDP control cost reduces to the diffusion control cost h

2σ2 kuk2:

x

passive

controlled

hBu DKL (N (m1, Σ) jjN (m2, Σ))

=12(m1 �m2)

T Σ�1 (m1 �m2)

Thus LMDPs can approximate linearly-solvable diffusions, but they can alsoapproximate other dynamics, e.g. contact dynamics.

Emo Todorov (UW) Linearly-solvable optimal control 6 / 19

Page 10: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Relation between the two problems

The above diffusion is the h! 0 limit of discrete-time continuous-spaceLMDPs with passive dynamics

p(h)�x0jx

�= N

�x+ ha (x) , hσ2B (x)B (x)T

�The LMDP state cost is hq (x). From the KL divergence formula for Gaussians,the LMDP control cost reduces to the diffusion control cost h

2σ2 kuk2:

x

passive

controlled

hBu DKL (N (m1, Σ) jjN (m2, Σ))

=12(m1 �m2)

T Σ�1 (m1 �m2)

Thus LMDPs can approximate linearly-solvable diffusions, but they can alsoapproximate other dynamics, e.g. contact dynamics.

Emo Todorov (UW) Linearly-solvable optimal control 6 / 19

Page 11: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Relation between the two problems

The above diffusion is the h! 0 limit of discrete-time continuous-spaceLMDPs with passive dynamics

p(h)�x0jx

�= N

�x+ ha (x) , hσ2B (x)B (x)T

�The LMDP state cost is hq (x). From the KL divergence formula for Gaussians,the LMDP control cost reduces to the diffusion control cost h

2σ2 kuk2:

x

passive

controlled

hBu DKL (N (m1, Σ) jjN (m2, Σ))

=12(m1 �m2)

T Σ�1 (m1 �m2)

Thus LMDPs can approximate linearly-solvable diffusions, but they can alsoapproximate other dynamics, e.g. contact dynamics.

Emo Todorov (UW) Linearly-solvable optimal control 6 / 19

Page 12: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Summary of (mostly) linear Bellman equations

LMDP : Diffusion :

first exit exp (q) z = P [z] qz = L [z]finite horizon exp (qk) zk = Pk [zk+1] qz� zt = L [z]average cost exp (q� c) z = P [z] (q� c) z = L [z]discounted cost exp (q) z = P [zα] z log (zα) = L [z]

The operator P (h) in the diffusion-approximating LMDP is related to L as

P (h) [z] (x) = z (x) + hL [z] (x) + o�

h2�

Emo Todorov (UW) Linearly-solvable optimal control 7 / 19

Page 13: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Comparison to generic MDP approximation

0 40 80 120

z iter

policy iter

100502510

CPU time (sec)

aver

age

costP

­3 +3­9

+9

horizontal position

tang

entia

lvel

ocity

optimal control (z iter) optimal cost­to­go (z iter) optimal cost­to­go (policy iter)

0 10 102 103 104

number of updates

aver

age

cost

z pol. val.

Emo Todorov (UW) Linearly-solvable optimal control 8 / 19

Page 14: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Z-learning

The solution to the linear Bellman equation

z (x) = exp (�q (x))Ex0�p(�jx)�z�x0��

can be approximated in a model-free way given samples (xn, x0n, qn = q (xn))obtained from the passive dynamics x0n � p (�jxn).

One possibility is a Monte Carlo method based on the path integralrepresentation, although covergence can be slow:

bz (x) = 1# trajectoriesstarting at x

∑ exp (� trajectory cost)

Faster convergence is obtained using temporal difference learning:

bz (xn) (1� β)bz (xn) + β exp (�qn)bz �x0n�The learning rate β should decrease over time.

Emo Todorov (UW) Linearly-solvable optimal control 9 / 19

Page 15: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Z-learning

The solution to the linear Bellman equation

z (x) = exp (�q (x))Ex0�p(�jx)�z�x0��

can be approximated in a model-free way given samples (xn, x0n, qn = q (xn))obtained from the passive dynamics x0n � p (�jxn).

One possibility is a Monte Carlo method based on the path integralrepresentation, although covergence can be slow:

bz (x) = 1# trajectoriesstarting at x

∑ exp (� trajectory cost)

Faster convergence is obtained using temporal difference learning:

bz (xn) (1� β)bz (xn) + β exp (�qn)bz �x0n�The learning rate β should decrease over time.

Emo Todorov (UW) Linearly-solvable optimal control 9 / 19

Page 16: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Z-learning

The solution to the linear Bellman equation

z (x) = exp (�q (x))Ex0�p(�jx)�z�x0��

can be approximated in a model-free way given samples (xn, x0n, qn = q (xn))obtained from the passive dynamics x0n � p (�jxn).

One possibility is a Monte Carlo method based on the path integralrepresentation, although covergence can be slow:

bz (x) = 1# trajectoriesstarting at x

∑ exp (� trajectory cost)

Faster convergence is obtained using temporal difference learning:

bz (xn) (1� β)bz (xn) + β exp (�qn)bz �x0n�The learning rate β should decrease over time.

Emo Todorov (UW) Linearly-solvable optimal control 9 / 19

Page 17: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Importance sampling

Let bu (x0jx) denote the greedy control law, i.e. the control law which would beoptimal if the current approximation bz (x) were the exact desirability function.Then we can sample from bu rather than p and use importance sampling:

bz (xn) (1� β)bz (xn) + βp (x0njxn)bu (x0njxn)

exp (�qn)bz �x0n�We now need access to the model p (x0jx) of the passive dynamics.

X

Q­learning, passiveQ­learning, e­greedyZ­learning, passiveZ­learning, greedy

state transitions

cost

­to­g

oap

prox

imat

ion

erro

r

0 50,000state transitions

0 50,000

log

tota

lcos

t

Emo Todorov (UW) Linearly-solvable optimal control 10 / 19

Page 18: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Importance sampling

Let bu (x0jx) denote the greedy control law, i.e. the control law which would beoptimal if the current approximation bz (x) were the exact desirability function.Then we can sample from bu rather than p and use importance sampling:

bz (xn) (1� β)bz (xn) + βp (x0njxn)bu (x0njxn)

exp (�qn)bz �x0n�We now need access to the model p (x0jx) of the passive dynamics.

X

Q­learning, passiveQ­learning, e­greedyZ­learning, passiveZ­learning, greedy

state transitions

cost

­to­g

oap

prox

imat

ion

erro

r

0 50,000state transitions

0 50,000lo

gto

talc

ost

Emo Todorov (UW) Linearly-solvable optimal control 10 / 19

Page 19: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Estimation-control dualityThe probability of a trajectory under the passive dynamics is

p (x1, x2, � � � xTjx0) = ∏T�1k=0 p (xk+1jxk)

The probability of a trajectory under the optimally-controlled dynamics is

µ (x1, x2, � � � xTjx0) ∝ exp��∑T

k=1 q (xk)�

p (x1, x2, � � � xTjx0)

This is equivalent to Bayesian inference where p is the prior, µ the posterior,and q the negative log-likelihood (of some unspecified measurements).

Let µk (x) be the marginal of the trajectory density at time step k, and definerk (x) = µk (x) /zk (x). Then

zk (x) = exp (�q (x))∑x0 p�x0jx

�zk+1

�x0�

rk+1�x0�= ∑x exp (�q (x)) p

�x0jx

�rk (x)

µk (x) = rk (x) zk (x)

This is equivalent to recursive Bayesian inference where rk is the filteringdensity, zk the backward filtering density, and µk the posterior.

Emo Todorov (UW) Linearly-solvable optimal control 11 / 19

Page 20: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Estimation-control dualityThe probability of a trajectory under the passive dynamics is

p (x1, x2, � � � xTjx0) = ∏T�1k=0 p (xk+1jxk)

The probability of a trajectory under the optimally-controlled dynamics is

µ (x1, x2, � � � xTjx0) ∝ exp��∑T

k=1 q (xk)�

p (x1, x2, � � � xTjx0)

This is equivalent to Bayesian inference where p is the prior, µ the posterior,and q the negative log-likelihood (of some unspecified measurements).

Let µk (x) be the marginal of the trajectory density at time step k, and definerk (x) = µk (x) /zk (x). Then

zk (x) = exp (�q (x))∑x0 p�x0jx

�zk+1

�x0�

rk+1�x0�= ∑x exp (�q (x)) p

�x0jx

�rk (x)

µk (x) = rk (x) zk (x)

This is equivalent to recursive Bayesian inference where rk is the filteringdensity, zk the backward filtering density, and µk the posterior.

Emo Todorov (UW) Linearly-solvable optimal control 11 / 19

Page 21: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

General case

Continuous: Discrete:

dx = a (x) dt+ B (x) (udt+ dω) xt+1 � u (�jxt)

` (x, u) = q (x) + 12 kuk

2 ` (x, u) = q (x) +DKL (u (�jx) jj p (�jx))

is dual to is dual to

dx = a (x) dt+ B (x) dω xt+1 � p (�jxt)

dy = h (x) dt+ dν yt � Bernoulli (r (xt))

when when

q (x) = 12 kh (x)k

2 q (x) = � log (r (x))

Emo Todorov (UW) Linearly-solvable optimal control 12 / 19

Page 22: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Linear case

dynamics: dx = Axdt+ Budt+ Cdω

cost rate: ` (x, u) = 12 xTQx+ 1

2 kuk2

measurement: dy = Hxdt+ dν

cost-to-go Hessian: �V = Q+ATV+VA�VBBTV

covariance: S = CCT +AS+ SAT � SHTHS

inverse covariance: P = HTH�ATP� PA� PCCTP

Duality relations:

linear-quadratic regulator: V A B Q t

Kalman-Bucy filter: S AT HT CCT tf � t

information filter: P �A C HTH tf � t

Emo Todorov (UW) Linearly-solvable optimal control 13 / 19

Page 23: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Linear case

dynamics: dx = Axdt+ Budt+ Cdω

cost rate: ` (x, u) = 12 xTQx+ 1

2 kuk2

measurement: dy = Hxdt+ dν

cost-to-go Hessian: �V = Q+ATV+VA�VBBTV

covariance: S = CCT +AS+ SAT � SHTHS

inverse covariance: P = HTH�ATP� PA� PCCTP

Duality relations:

linear-quadratic regulator: V A B Q t

Kalman-Bucy filter: S AT HT CCT tf � t

information filter: P �A C HTH tf � t

Emo Todorov (UW) Linearly-solvable optimal control 13 / 19

Page 24: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Maximum principle for the most likely trajectory

The maximum of µ is the optimal trajectory for a deterministic problem withany dynamics x0 = f (x, u) and cost ` (x, u) = q (x)� log p (f (x, u) , x).

For diffusions with B = const the corresponding deterministic problem hasdynamics x = a (x) and cost ` (x, u) = q (x) + 1

2σ2 kuk2 + 12 div (a (x)).

dx = a (x) dt+ udt+ σdω

q = 0

qT = x2

­5 +50

0

­2

+2

position (x)

a(x)div(a(x))

0 5

0

+5

­5

time (t)

po

siti

on

(x)

r(x,t) z(x,t)mu(x,t)

sigma = 0.6

sigma = 1.2

Emo Todorov (UW) Linearly-solvable optimal control 14 / 19

Page 25: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Maximum principle for the most likely trajectory

The maximum of µ is the optimal trajectory for a deterministic problem withany dynamics x0 = f (x, u) and cost ` (x, u) = q (x)� log p (f (x, u) , x).

For diffusions with B = const the corresponding deterministic problem hasdynamics x = a (x) and cost ` (x, u) = q (x) + 1

2σ2 kuk2 + 12 div (a (x)).

dx = a (x) dt+ udt+ σdω

q = 0

qT = x2

­5 +50

0

­2

+2

position (x)

a(x)div(a(x))

0 5

0

+5

­5

time (t)

po

siti

on

(x)

r(x,t) z(x,t)mu(x,t)

sigma = 0.6

sigma = 1.2

Emo Todorov (UW) Linearly-solvable optimal control 14 / 19

Page 26: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Compositionality

Consider a finite-horizon problem with final cost

qT (x) = � log�∑k wk exp (�gk (x))

�Let zk (x, t), u�k (x, t) be the desirability functions and optimal control laws forcomponent problems with final costs gk (x).Then the desirability function for the composite problem is

z (x, t) = ∑k wkzk (x, t)

and the optimal control law is

u� (x, t) = ∑kwkzk (x, t)

z (x, t)u�k (x, t)

Emo Todorov (UW) Linearly-solvable optimal control 15 / 19

Page 27: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Application to LQG

dx = u dt+ dω, ` (x, u) =12

u2, gk (x) =12(x� ck)

2

Emo Todorov (UW) Linearly-solvable optimal control 16 / 19

Page 28: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Function approximation

Infinite-horizon average-cost discrete-time continuous-space problem:

λ z (x) = exp (�q (x))Z

p�x0jx

�z�x0�

dx0

Select a collocation set fxng. Define a function approximator

bz (x; w, θ) = ∑i wi fi (x; θ)

Construct the (usually sparse) matrices F, G with elements

Fn,i = fi (xn; θ)

Gn,i = exp (�q (xn))Z

p�x0jxn

�fi�x0; θ

�dx0

The above functional equation now becomes an algebraic equation:

λ F (θ)w = G (θ)w

Solving for λ, w is a linear problem; θ can be adapted via gradient descent.

Emo Todorov (UW) Linearly-solvable optimal control 17 / 19

Page 29: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Function approximation

Infinite-horizon average-cost discrete-time continuous-space problem:

λ z (x) = exp (�q (x))Z

p�x0jx

�z�x0�

dx0

Select a collocation set fxng. Define a function approximator

bz (x; w, θ) = ∑i wi fi (x; θ)

Construct the (usually sparse) matrices F, G with elements

Fn,i = fi (xn; θ)

Gn,i = exp (�q (xn))Z

p�x0jxn

�fi�x0; θ

�dx0

The above functional equation now becomes an algebraic equation:

λ F (θ)w = G (θ)w

Solving for λ, w is a linear problem; θ can be adapted via gradient descent.

Emo Todorov (UW) Linearly-solvable optimal control 17 / 19

Page 30: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Example: 40 adaptive Gaussian bases

car­on­a­hillcycle

7.48

4.63

positionve

loci

ty

state cost7.51

4.66

metronome cycle

0 100CG iteration

log 1

0re

sidu

al

optimal control approximation

Emo Todorov (UW) Linearly-solvable optimal control 18 / 19

Page 31: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Remarks on function approximation

The collocation states and bases need to cover the (unknown) region ofstate space where the optimally-controlled dynamics tend to live. We areexploring various adaptive methods, not yet clear which is best.

Trajectory optimization may be a good way to initialize the basis functionmethod – which can then construct a feedback controller applicable overa much wider region.The computational bottleneck is the evaluation of

Rp (x0jxn) fi (x0; θ) dx0.

If p is a Gaussian mixture, and all fi’s are Gaussians (possibly multipliedby polynomials), this evaluation can be done analytically. Otherwise onecan use cubature formulas.Overall impression: Numerical methods derived from the LMDPframework are much more efficient than methods for generic stochasticcontrol problems, but not enough to beat the curse of dimensionality.

Emo Todorov (UW) Linearly-solvable optimal control 19 / 19

Page 32: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Remarks on function approximation

The collocation states and bases need to cover the (unknown) region ofstate space where the optimally-controlled dynamics tend to live. We areexploring various adaptive methods, not yet clear which is best.Trajectory optimization may be a good way to initialize the basis functionmethod – which can then construct a feedback controller applicable overa much wider region.

The computational bottleneck is the evaluation ofR

p (x0jxn) fi (x0; θ) dx0.If p is a Gaussian mixture, and all fi’s are Gaussians (possibly multipliedby polynomials), this evaluation can be done analytically. Otherwise onecan use cubature formulas.Overall impression: Numerical methods derived from the LMDPframework are much more efficient than methods for generic stochasticcontrol problems, but not enough to beat the curse of dimensionality.

Emo Todorov (UW) Linearly-solvable optimal control 19 / 19

Page 33: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Remarks on function approximation

The collocation states and bases need to cover the (unknown) region ofstate space where the optimally-controlled dynamics tend to live. We areexploring various adaptive methods, not yet clear which is best.Trajectory optimization may be a good way to initialize the basis functionmethod – which can then construct a feedback controller applicable overa much wider region.The computational bottleneck is the evaluation of

Rp (x0jxn) fi (x0; θ) dx0.

If p is a Gaussian mixture, and all fi’s are Gaussians (possibly multipliedby polynomials), this evaluation can be done analytically. Otherwise onecan use cubature formulas.

Overall impression: Numerical methods derived from the LMDPframework are much more efficient than methods for generic stochasticcontrol problems, but not enough to beat the curse of dimensionality.

Emo Todorov (UW) Linearly-solvable optimal control 19 / 19

Page 34: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Remarks on function approximation

The collocation states and bases need to cover the (unknown) region ofstate space where the optimally-controlled dynamics tend to live. We areexploring various adaptive methods, not yet clear which is best.Trajectory optimization may be a good way to initialize the basis functionmethod – which can then construct a feedback controller applicable overa much wider region.The computational bottleneck is the evaluation of

Rp (x0jxn) fi (x0; θ) dx0.

If p is a Gaussian mixture, and all fi’s are Gaussians (possibly multipliedby polynomials), this evaluation can be done analytically. Otherwise onecan use cubature formulas.Overall impression: Numerical methods derived from the LMDPframework are much more efficient than methods for generic stochasticcontrol problems, but not enough to beat the curse of dimensionality.

Emo Todorov (UW) Linearly-solvable optimal control 19 / 19

Page 35: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Trajectory optimization

Emo Todorov

Applied Mathematics Computer Science and Engineering

University of Washington

Page 36: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

MuJoCo: A physics engine for control

recursive algorithms for smooth dynamics new algorithms for contact dynamics modeling of tendons, muscles, pneumatics 100 times faster than real-time (single core, 23 DOF humanoid) parallel evaluation for finite differences

Page 37: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Trajectory optimization methods General procedure: - divide the relevant variables (positions, velocities, torques, contact forces) into independent and dependent sets; compute the latter given the former - if the independent variables are physically coupled, impose constraints - use contact smoothing to ensure that the cost function is differentiable - apply a suitable 2nd-order optimization method, usually with continuation

Independent Dependent (computation) Constraints Smoothing

Torques Positions (integration) Velocities (forward dynamics) Contact forces (contact solver)

n/a Iterative contact approximation

Torques Contact forces

Positions (integration) Velocities (forward dynamics)

Contact & friction Soft constraints

Positions Velocities (differentiation) Torques (inverse dynamics) Contact forces (inverse contact solver)

Under-actuation Finite differences, Convex contact model

Positions Contact forces

Velocities (differentiation) Torques (inverse dynamics)

Under-actuation Contact & friction

Soft constraints

Positions Velocities Torques Contact forces

n/a Differentiation Dynamics Contact & friction

Soft constraints

Page 38: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Model-predictive control (MPC)

At time step t, solve a trajectory optimization problem of the form Larger horizon N results in better performance but requires more computation. The final cost h(x) should approximate all future costs incurred after time t+N.

( ) ( )∑−+

=+ +

1

,minNt

tkkkNt uxxh

There is always a plan, the plan is re-optimized all the time, only the initial portion is ever executed.

The optimizer is based on the iLQG method described in Todorov and Li, ACC 2005, with further improvements described in Tassa, Erez and Todorov, IROS 2012.

Page 39: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

The cost has the extra term where e is the vector distance to the nearest surface. The planned contact impulse f is penalized in full, but scaled by c before being applied.

Contact-aware optimization

We introduce auxiliary decision variables indicating if potential contact i should be active in movement phase . These variables affect the dynamics and cost function, and are optimized along with the movement trajectory.

Mordatch, Todorov and Popovic, SIGGRAPH 2o12 Mordatch, Popovic and Todorov, SCA 2012

Page 40: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Results Kulchenko and Todorov, ICRA 2011 MPC: plan with iLQG until next contact (adaptive horizon). Contact dynamics handled implicitly: heuristic value function used as final cost (encodes what is a good way to hit the ball).

Tassa, Erez and Todorov, IROS 2012 MPC: plan with iLQG 0.5 sec into the future (fixed horizon). Smoothing due to soft approximation to contact.

Mordatch, Todorov and Popovic, SIGRAPH 2012 Space-time optimization: spline representation of trajectory. Contact smoothing due to finite differencing. Auxiliary decision variables for all potential contacts. Continuation: contact forces from a distance.

Mordatch, Popovic and Todorov, SCA 2012 Similar to above, but contact forces and their origins are now independent variables, and contact smoothing is due to soft constraints.

Erez, Tassa and Todorov, IROS 2012 Space-time optimization: Fourier representation of trajectory. Contact smoothing due to inverse contact solver. Continuation: root forces, contact forces from a distance.

Page 41: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Towards real-world applications

Drive utility vehicle to site Walk across rubble Remove debris blocking door Open door, enter building Climb ladder and stairs Use tool to break concrete panel Close leaking valve Replace component

DARPA Robotics Challenge

Page 42: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

The case for MPC in the brain

yet most neurons in your brain are doing something when you move... perhaps online re-optimization of the movement?

complex “hardwired” behaviors can be generated with few neurons

we usually think of learning/optimization as setting up a similar neural machine, which is then responsible for online generation of behavior

small brain: box of chocolates / behaviors

large brain: chocolate factory / behavior optimizer

If you have expensive machinery, you should use it all the time.

Page 43: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Contact smoothing via soft constraints

If the contact forces are treated as independent variables and physical realism is enforced via penalty functions, then the cost function is differentiable

Coulomb friction model

complementarity:

A contact inverse inertia:

b contact velocity before impulse

f contact impulse

v contact velocity after impulse

TJJMA 1−=0,0,0 =≥≥ xyyx

0, ≤=

−⊥

⊥+=

αα

µ

TT

TNT

NN

fvffv

fvbfAv yx ⊥

( )

( )...

,min2

22

2

yxyx

yx

+−+

Penalty functions instead of hard constraints: replace with one of yx ⊥

Page 44: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

The contact impulse f is the solution to a linear equation: subject to complicated but easy-to-enforce constraints: Iterative solvers tend to work well and produce smooth results: - solve the linear equation - enforce the constraints softly - repeat a couple of times

Contact smoothing via iterative approximations

bfAv +=

0, ≤=

−⊥

αα

µ

TT

TNT

NN

fvffv

fv

Page 45: Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov (UW) Linearly-solvable optimal control 9 / 19. Importance sampling Let bu(x0jx) denote

Convex, smooth and invertible contact model

Define the contact impulse by minimizing contact-space kinetic energy subject to for each contact. Replace the constraints with penalty functions

vAvT 1

21 −

TNNN ffvf ≥≥≥ µ,0,0

( ) ( )bAfdfdbfAfff VFTT

f ++++=21minarg*

( ) ( )vdfd VF ,

( ) ( )bAfvfbA +=→ ,,

( ) ( )AfvbfvA −=→ ,,

Forward contact dynamics:

Inverse contact dynamics:

( )( ) ( )fdvdAvff FVT

f +∇+= minarg*

0 200 400 6004

2

0

2

4

time (msec)

verticalvelocity

normalimpulse

ball drop test

Todorov, ICRA 2011