Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov...

Linearly-solvable optimal control

Emo Todorov

Applied Mathematics and Computer Science & Engineering

University of Washington

based on:E. Todorov, "Efficient computation of optimal actions", PNAS 2009

Emo Todorov (UW) Linearly-solvable optimal control 1 / 19

Problem formulation

In traditional MDPs the controller chooses actions u which in turn specify thetransition probabilities p (x0jx, u). We can obtain a linearly-solvable MDP(LMDP) by allowing the controller to specify these probabilities directly:

x0 � u (�jx) controlled dynamics

x0 � p (�jx) passive dynamics

p (x0jx) = 0) u (x0jx) = 0 feasible control set

The running cost is in the form

` (x, u (�jx)) = q (x) +DKL (u (�jx) jjp (�jx))

Thus the controller can impose any dynamics it wishes, however it pays aprice (KL divergence control cost) for pushing the system away from itspassive dynamics.


Reducing optimal control to a linear problemThe Bellman equation for a first-exit LMPD is

v (x) = minu(�jx)

�q (x) + Ex0�u(�jx)

�log

u (x0jx)p (x0jx)

�+ Ex0�u(�jx)

�v�x0��

= q (x) + minu(�jx)

Ex0�u(�jx)

�log

u (x0)p (x0jx) exp (�v (x0))

�The last term is an unnormalized KL divergence,

thus the optimal control is

u��x0jx

�=

p (x0jx) z (x0)P [z] (x)

z (x) , exp (�v (x)) desirability function

P [z] (x) , ∑x0 p (x0jx) z (x0) next-state expectation

The problem is now linear in the desirability function:

z (x) = exp (�q (x))P [z] (x)


Reducing optimal control to a linear problemThe Bellman equation for a first-exit LMPD is

v (x) = minu(�jx)

�q (x) + Ex0�u(�jx)

�log

u (x0jx)p (x0jx)

�+ Ex0�u(�jx)

�v�x0��

= q (x) + minu(�jx)

Ex0�u(�jx)

�log

u (x0)p (x0jx) exp (�v (x0))

�The last term is an unnormalized KL divergence, thus the optimal control is

u��x0jx

�=

p (x0jx) z (x0)P [z] (x)

z (x) , exp (�v (x)) desirability function

P [z] (x) , ∑x0 p (x0jx) z (x0) next-state expectation

The problem is now linear in the desirability function:

z (x) = exp (�q (x))P [z] (x)


Illustration

z(x’)

p(x’|x)u*(x’|x) ~ p(x’|x) z(x’)

x’: sampled from u*(x’|x)x


Linearly-solvable controlled diffusions

dx = a (x) dt+ B (x) (udt+ σdω)

` (x, u) = q (x) +1

2σ2 kuk2

Defining z = exp (�v) as before, the optimal control law is

u� (x) = σ2B (x)Tzx (x)z (x)

and the (first-exit) HJB equation becomes

q (x) z (x) = L [z] (x)

where L is the generator of the passive dynamics:

L [z] (x) = a (x)T zx (x) +σ2

2trace

�B (x)B (x)T zxx (x)

�Emo Todorov (UW) Linearly-solvable optimal control 5 / 19

Relation between the two problems

The above diffusion is the h! 0 limit of discrete-time continuous-spaceLMDPs with passive dynamics

p(h)�x0jx

�= N

�x+ ha (x) , hσ2B (x)B (x)T

�

The LMDP state cost is hq (x). From the KL divergence formula for Gaussians,the LMDP control cost reduces to the diffusion control cost h

2σ2 kuk2:

x

passive

controlled

hBu DKL (N (m1, Σ) jjN (m2, Σ))

=12(m1 �m2)

T Σ�1 (m1 �m2)

Thus LMDPs can approximate linearly-solvable diffusions, but they can alsoapproximate other dynamics, e.g. contact dynamics.


Relation between the two problems

The above diffusion is the h! 0 limit of discrete-time continuous-spaceLMDPs with passive dynamics

p(h)�x0jx

�= N

�x+ ha (x) , hσ2B (x)B (x)T

�The LMDP state cost is hq (x). From the KL divergence formula for Gaussians,the LMDP control cost reduces to the diffusion control cost h

2σ2 kuk2:

x

passive

controlled

hBu DKL (N (m1, Σ) jjN (m2, Σ))

=12(m1 �m2)

T Σ�1 (m1 �m2)

Thus LMDPs can approximate linearly-solvable diffusions, but they can alsoapproximate other dynamics, e.g. contact dynamics.


Summary of (mostly) linear Bellman equations

LMDP : Diffusion :

first exit exp (q) z = P [z] qz = L [z]finite horizon exp (qk) zk = Pk [zk+1] qz� zt = L [z]average cost exp (q� c) z = P [z] (q� c) z = L [z]discounted cost exp (q) z = P [zα] z log (zα) = L [z]

The operator P (h) in the diffusion-approximating LMDP is related to L as

P (h) [z] (x) = z (x) + hL [z] (x) + o�

h2�


Comparison to generic MDP approximation

0 40 80 120

z iter

policy iter

100502510

CPU time (sec)

aver

age

costP

3 +39

+9

horizontal position

tang

entia

lvel

ocity

optimal control (z iter) optimal costtogo (z iter) optimal costtogo (policy iter)

0 10 102 103 104

number of updates

aver

age

cost

z pol. val.


Z-learning

The solution to the linear Bellman equation

z (x) = exp (�q (x))Ex0�p(�jx)�z�x0��

can be approximated in a model-free way given samples (xn, x0n, qn = q (xn))obtained from the passive dynamics x0n � p (�jxn).

One possibility is a Monte Carlo method based on the path integralrepresentation, although covergence can be slow:

bz (x) = 1# trajectoriesstarting at x

∑ exp (� trajectory cost)

Faster convergence is obtained using temporal difference learning:

bz (xn) (1� β)bz (xn) + β exp (�qn)bz �x0n�The learning rate β should decrease over time.


Importance sampling

Let bu (x0jx) denote the greedy control law, i.e. the control law which would beoptimal if the current approximation bz (x) were the exact desirability function.Then we can sample from bu rather than p and use importance sampling:

bz (xn) (1� β)bz (xn) + βp (x0njxn)bu (x0njxn)

exp (�qn)bz �x0n�We now need access to the model p (x0jx) of the passive dynamics.

X

Qlearning, passiveQlearning, egreedyZlearning, passiveZlearning, greedy

state transitions

cost

tog

oap

prox

imat

ion

erro

r

0 50,000state transitions

0 50,000

log

tota

lcos

t


Importance sampling

Let bu (x0jx) denote the greedy control law, i.e. the control law which would beoptimal if the current approximation bz (x) were the exact desirability function.Then we can sample from bu rather than p and use importance sampling:

bz (xn) (1� β)bz (xn) + βp (x0njxn)bu (x0njxn)

exp (�qn)bz �x0n�We now need access to the model p (x0jx) of the passive dynamics.

X

Qlearning, passiveQlearning, egreedyZlearning, passiveZlearning, greedy

state transitions

cost

tog

oap

prox

imat

ion

erro

r

0 50,000state transitions

0 50,000lo

gto

talc

ost


Estimation-control dualityThe probability of a trajectory under the passive dynamics is

p (x1, x2, � � � xTjx0) = ∏T�1k=0 p (xk+1jxk)

The probability of a trajectory under the optimally-controlled dynamics is

µ (x1, x2, � � � xTjx0) ∝ exp��∑T

k=1 q (xk)�

p (x1, x2, � � � xTjx0)

This is equivalent to Bayesian inference where p is the prior, µ the posterior,and q the negative log-likelihood (of some unspecified measurements).

Let µk (x) be the marginal of the trajectory density at time step k, and definerk (x) = µk (x) /zk (x). Then

zk (x) = exp (�q (x))∑x0 p�x0jx

�zk+1

�x0�

rk+1�x0�= ∑x exp (�q (x)) p

�x0jx

�rk (x)

µk (x) = rk (x) zk (x)

This is equivalent to recursive Bayesian inference where rk is the filteringdensity, zk the backward filtering density, and µk the posterior.


General case

Continuous: Discrete:

dx = a (x) dt+ B (x) (udt+ dω) xt+1 � u (�jxt)

` (x, u) = q (x) + 12 kuk

2 ` (x, u) = q (x) +DKL (u (�jx) jj p (�jx))

is dual to is dual to

dx = a (x) dt+ B (x) dω xt+1 � p (�jxt)

dy = h (x) dt+ dν yt � Bernoulli (r (xt))

when when

q (x) = 12 kh (x)k

2 q (x) = � log (r (x))


Linear case

dynamics: dx = Axdt+ Budt+ Cdω

cost rate: ` (x, u) = 12 xTQx+ 1

2 kuk2

measurement: dy = Hxdt+ dν

cost-to-go Hessian: �V = Q+ATV+VA�VBBTV

covariance: S = CCT +AS+ SAT � SHTHS

inverse covariance: P = HTH�ATP� PA� PCCTP

Duality relations:

linear-quadratic regulator: V A B Q t

Kalman-Bucy filter: S AT HT CCT tf � t

information filter: P �A C HTH tf � t


Maximum principle for the most likely trajectory

The maximum of µ is the optimal trajectory for a deterministic problem withany dynamics x0 = f (x, u) and cost ` (x, u) = q (x)� log p (f (x, u) , x).

For diffusions with B = const the corresponding deterministic problem hasdynamics x = a (x) and cost ` (x, u) = q (x) + 1

2σ2 kuk2 + 12 div (a (x)).

dx = a (x) dt+ udt+ σdω

q = 0

qT = x2

5 +50

0

2

+2

position (x)

a(x)div(a(x))

0 5

0

+5

5

time (t)

po

siti

on

(x)

r(x,t) z(x,t)mu(x,t)

sigma = 0.6

sigma = 1.2


Compositionality

Consider a finite-horizon problem with final cost

qT (x) = � log�∑k wk exp (�gk (x))

�Let zk (x, t), u�k (x, t) be the desirability functions and optimal control laws forcomponent problems with final costs gk (x).Then the desirability function for the composite problem is

z (x, t) = ∑k wkzk (x, t)

and the optimal control law is

u� (x, t) = ∑kwkzk (x, t)

z (x, t)u�k (x, t)


Application to LQG

dx = u dt+ dω, ` (x, u) =12

u2, gk (x) =12(x� ck)

2


Function approximation

Infinite-horizon average-cost discrete-time continuous-space problem:

λ z (x) = exp (�q (x))Z

p�x0jx

�z�x0�

dx0

Select a collocation set fxng. Define a function approximator

bz (x; w, θ) = ∑i wi fi (x; θ)

Construct the (usually sparse) matrices F, G with elements

Fn,i = fi (xn; θ)

Gn,i = exp (�q (xn))Z

p�x0jxn

�fi�x0; θ

�dx0

The above functional equation now becomes an algebraic equation:

λ F (θ)w = G (θ)w

Solving for λ, w is a linear problem; θ can be adapted via gradient descent.


Example: 40 adaptive Gaussian bases

caronahillcycle

7.48

4.63

positionve

loci

ty

state cost7.51

4.66

metronome cycle

0 100CG iteration

log 1

0re

sidu

al

optimal control approximation


Remarks on function approximation

The collocation states and bases need to cover the (unknown) region ofstate space where the optimally-controlled dynamics tend to live. We areexploring various adaptive methods, not yet clear which is best.

Trajectory optimization may be a good way to initialize the basis functionmethod – which can then construct a feedback controller applicable overa much wider region.The computational bottleneck is the evaluation of

Rp (x0jxn) fi (x0; θ) dx0.

If p is a Gaussian mixture, and all fi’s are Gaussians (possibly multipliedby polynomials), this evaluation can be done analytically. Otherwise onecan use cubature formulas.Overall impression: Numerical methods derived from the LMDPframework are much more efficient than methods for generic stochasticcontrol problems, but not enough to beat the curse of dimensionality.



The collocation states and bases need to cover the (unknown) region ofstate space where the optimally-controlled dynamics tend to live. We areexploring various adaptive methods, not yet clear which is best.Trajectory optimization may be a good way to initialize the basis functionmethod – which can then construct a feedback controller applicable overa much wider region.

The computational bottleneck is the evaluation ofR

p (x0jxn) fi (x0; θ) dx0.If p is a Gaussian mixture, and all fi’s are Gaussians (possibly multipliedby polynomials), this evaluation can be done analytically. Otherwise onecan use cubature formulas.Overall impression: Numerical methods derived from the LMDPframework are much more efficient than methods for generic stochasticcontrol problems, but not enough to beat the curse of dimensionality.



The collocation states and bases need to cover the (unknown) region ofstate space where the optimally-controlled dynamics tend to live. We areexploring various adaptive methods, not yet clear which is best.Trajectory optimization may be a good way to initialize the basis functionmethod – which can then construct a feedback controller applicable overa much wider region.The computational bottleneck is the evaluation of


If p is a Gaussian mixture, and all fi’s are Gaussians (possibly multipliedby polynomials), this evaluation can be done analytically. Otherwise onecan use cubature formulas.

Overall impression: Numerical methods derived from the LMDPframework are much more efficient than methods for generic stochasticcontrol problems, but not enough to beat the curse of dimensionality.



The collocation states and bases need to cover the (unknown) region ofstate space where the optimally-controlled dynamics tend to live. We areexploring various adaptive methods, not yet clear which is best.Trajectory optimization may be a good way to initialize the basis functionmethod – which can then construct a feedback controller applicable overa much wider region.The computational bottleneck is the evaluation of


If p is a Gaussian mixture, and all fi’s are Gaussians (possibly multipliedby polynomials), this evaluation can be done analytically. Otherwise onecan use cubature formulas.Overall impression: Numerical methods derived from the LMDPframework are much more efficient than methods for generic stochasticcontrol problems, but not enough to beat the curse of dimensionality.


Trajectory optimization

Emo Todorov

Applied Mathematics Computer Science and Engineering

University of Washington

MuJoCo: A physics engine for control

recursive algorithms for smooth dynamics new algorithms for contact dynamics modeling of tendons, muscles, pneumatics 100 times faster than real-time (single core, 23 DOF humanoid) parallel evaluation for finite differences

Trajectory optimization methods General procedure: - divide the relevant variables (positions, velocities, torques, contact forces) into independent and dependent sets; compute the latter given the former - if the independent variables are physically coupled, impose constraints - use contact smoothing to ensure that the cost function is differentiable - apply a suitable 2nd-order optimization method, usually with continuation

Independent Dependent (computation) Constraints Smoothing

Torques Positions (integration) Velocities (forward dynamics) Contact forces (contact solver)

n/a Iterative contact approximation

Torques Contact forces

Positions (integration) Velocities (forward dynamics)

Contact & friction Soft constraints

Positions Velocities (differentiation) Torques (inverse dynamics) Contact forces (inverse contact solver)

Under-actuation Finite differences, Convex contact model

Positions Contact forces

Velocities (differentiation) Torques (inverse dynamics)

Under-actuation Contact & friction

Soft constraints

Positions Velocities Torques Contact forces

n/a Differentiation Dynamics Contact & friction

Soft constraints

Model-predictive control (MPC)

At time step t, solve a trajectory optimization problem of the form Larger horizon N results in better performance but requires more computation. The final cost h(x) should approximate all future costs incurred after time t+N.

( ) ( )∑−+

=+ +

1

,minNt

tkkkNt uxxh

There is always a plan, the plan is re-optimized all the time, only the initial portion is ever executed.

The optimizer is based on the iLQG method described in Todorov and Li, ACC 2005, with further improvements described in Tassa, Erez and Todorov, IROS 2012.

The cost has the extra term where e is the vector distance to the nearest surface. The planned contact impulse f is penalized in full, but scaled by c before being applied.

Contact-aware optimization

We introduce auxiliary decision variables indicating if potential contact i should be active in movement phase . These variables affect the dynamics and cost function, and are optimized along with the movement trajectory.

Mordatch, Todorov and Popovic, SIGGRAPH 2o12 Mordatch, Popovic and Todorov, SCA 2012

Results Kulchenko and Todorov, ICRA 2011 MPC: plan with iLQG until next contact (adaptive horizon). Contact dynamics handled implicitly: heuristic value function used as final cost (encodes what is a good way to hit the ball).

Tassa, Erez and Todorov, IROS 2012 MPC: plan with iLQG 0.5 sec into the future (fixed horizon). Smoothing due to soft approximation to contact.

Mordatch, Todorov and Popovic, SIGRAPH 2012 Space-time optimization: spline representation of trajectory. Contact smoothing due to finite differencing. Auxiliary decision variables for all potential contacts. Continuation: contact forces from a distance.

Mordatch, Popovic and Todorov, SCA 2012 Similar to above, but contact forces and their origins are now independent variables, and contact smoothing is due to soft constraints.

Erez, Tassa and Todorov, IROS 2012 Space-time optimization: Fourier representation of trajectory. Contact smoothing due to inverse contact solver. Continuation: root forces, contact forces from a distance.

Towards real-world applications

Drive utility vehicle to site Walk across rubble Remove debris blocking door Open door, enter building Climb ladder and stairs Use tool to break concrete panel Close leaking valve Replace component

DARPA Robotics Challenge

The case for MPC in the brain

yet most neurons in your brain are doing something when you move... perhaps online re-optimization of the movement?

complex “hardwired” behaviors can be generated with few neurons

we usually think of learning/optimization as setting up a similar neural machine, which is then responsible for online generation of behavior

small brain: box of chocolates / behaviors

large brain: chocolate factory / behavior optimizer

If you have expensive machinery, you should use it all the time.

Contact smoothing via soft constraints

If the contact forces are treated as independent variables and physical realism is enforced via penalty functions, then the cost function is differentiable

Coulomb friction model

complementarity:

A contact inverse inertia:

b contact velocity before impulse

f contact impulse

v contact velocity after impulse

TJJMA 1−=0,0,0 =≥≥ xyyx

0, ≤=

−⊥

⊥+=

αα

µ

TT

TNT

NN

fvffv

fvbfAv yx ⊥

( )

( )...

,min2

22

2

yxyx

yx

+−+

Penalty functions instead of hard constraints: replace with one of yx ⊥

The contact impulse f is the solution to a linear equation: subject to complicated but easy-to-enforce constraints: Iterative solvers tend to work well and produce smooth results: - solve the linear equation - enforce the constraints softly - repeat a couple of times

Contact smoothing via iterative approximations

bfAv +=

0, ≤=

−⊥

⊥

αα

µ

TT

TNT

NN

fvffv

fv

Convex, smooth and invertible contact model

Define the contact impulse by minimizing contact-space kinetic energy subject to for each contact. Replace the constraints with penalty functions

vAvT 1

21 −

TNNN ffvf ≥≥≥ µ,0,0

( ) ( )bAfdfdbfAfff VFTT

f ++++=21minarg*

( ) ( )vdfd VF ,

( ) ( )bAfvfbA +=→ ,,

( ) ( )AfvbfvA −=→ ,,

Forward contact dynamics:

Inverse contact dynamics:

( )( ) ( )fdvdAvff FVT

f +∇+= minarg*

0 200 400 6004

2

0

2

4

time (msec)

verticalvelocity

normalimpulse

ball drop test

Todorov, ICRA 2011

Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov...

Documents

Transcript of Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov...