Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 ·...

28
Introduction Discrete domain Continuous Domain Conclusion Approximate Dynamic Programming Václav Šmídl Seminar CSKI, 18.4.2004 Václav Šmídl Approximate Dynamic Programming

Transcript of Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 ·...

Page 1: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

Introduction Discrete domain Continuous Domain Conclusion

Approximate Dynamic Programming

Václav Šmídl

Seminar CSKI, 18.4.2004

Václav Šmídl Approximate Dynamic Programming

Page 2: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion

Outline

1 IntroductionControl of Dynamic SystemsDynamic Programming

2 Discrete domainMarkov Decision ProcessesCurses of dimensionalityReal-time Dynamic ProgrammingQ-learning

3 Continuous DomainLinear Quadratic ControlAdaptive CriticNeural Networks

Václav Šmídl Approximate Dynamic Programming

Page 3: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion Control of Dynamic Systems Dynamic Programming

Decision Making and Control

Decision

System

at

st

Z (at . . . a∞, st . . . s∞)

z−1

at action, decision, (input),

st observed state, (output),

Z (·) loss function,

Θt internal (unknown)quantities,

Optimal decision making (control)?:

at = arg minat{Z (at . . . a∞, st . . . s∞)}.

Optimal uncertain/robust decision making:

at = arg minat

EΘt Z (at . . . a∞, st , st+1(Θt) . . . s∞(Θt)).

Václav Šmídl Approximate Dynamic Programming

Page 4: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion Control of Dynamic Systems Dynamic Programming

Decision Making and Control

Decision

System

at

st

Z (at . . . a∞, st . . . s∞)

z−1

at action, decision, (input),

st observed state, (output),

Z (·) loss function,

Θt internal (unknown)quantities,

Optimal decision making (control):

at = arg minat{Z (at . . . a∞, st . . . s∞)}.

Optimal uncertain/robust decision making:

at = arg minat

EΘt Z (at . . . a∞, st , st+1(Θt) . . . s∞(Θt)).

Václav Šmídl Approximate Dynamic Programming

Page 5: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion Control of Dynamic Systems Dynamic Programming

Decision Making and Control

Decision

System

at

st

Z (at . . . a∞, st . . . s∞)

z−1

at action, decision, (input),

st observed state, (output),

Z (·) loss function,

Θt internal (unknown)quantities,

Optimal decision making (control):

at = arg minat{Z (at . . . a∞, st . . . s∞)}.

Optimal uncertain/robust decision making:

at = arg minat

EΘt Z (at . . . a∞, st , st+1(Θt) . . . s∞(Θt)).

Václav Šmídl Approximate Dynamic Programming

Page 6: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion Control of Dynamic Systems Dynamic Programming

Dynamic Programing

In case of additive loss function:

Z (at . . . a∞, st . . . s∞) =∞∑

k=0

γk z (at+k , st+k , at+k−1, . . .) .

e.g. Z =∑∞

k=1(st+k − sreft+k )2 + a2

t , the solution can be simplified.

Theorem

A strategy of actions at which minimizes

V (st) = minat

E [z(st , at) + γV (st+1)] (1)

in time t also minimizes the overall loss Z (at . . . a∞, st . . . s∞).

Equation (1) is known as (i) Hamilton-Jacobi from physics, (ii)Bellman, (iii) Hamilton-Jacobi-Bellman.

Optimal robust control (solved by PDE)Operational research: Markov Decision Processes (discrete)Artificial inteligence: Neurocontrol (continuous)

Václav Šmídl Approximate Dynamic Programming

Page 7: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion Control of Dynamic Systems Dynamic Programming

Dynamic Programing

In case of additive loss function:

Z (at . . . a∞, st . . . s∞) =∞∑

k=0

γk z (at+k , st+k , at+k−1, . . .) .

e.g. Z =∑∞

k=1(st+k − sreft+k )2 + a2

t , the solution can be simplified.

Theorem

A strategy of actions at which minimizes

V (st) = minat

E [z(st , at) + γV (st+1)] (1)

in time t also minimizes the overall loss Z (at . . . a∞, st . . . s∞).

Equation (1) is known as (i) Hamilton-Jacobi from physics, (ii)Bellman, (iii) Hamilton-Jacobi-Bellman.

Optimal robust control (solved by PDE)Operational research: Markov Decision Processes (discrete)Artificial inteligence: Neurocontrol (continuous)

Václav Šmídl Approximate Dynamic Programming

Page 8: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion Control of Dynamic Systems Dynamic Programming

Dynamic Programing

In case of additive loss function:

Z (at . . . a∞, st . . . s∞) =∞∑

k=0

γk z (at+k , st+k , at+k−1, . . .) .

e.g. Z =∑∞

k=1(st+k − sreft+k )2 + a2

t , the solution can be simplified.

Theorem

A strategy of actions at which minimizes

V (st) = minat

E [z(st , at) + γV (st+1)] (1)

in time t also minimizes the overall loss Z (at . . . a∞, st . . . s∞).

Equation (1) is known as (i) Hamilton-Jacobi from physics, (ii)Bellman, (iii) Hamilton-Jacobi-Bellman.

Optimal robust control (solved by PDE)Operational research: Markov Decision Processes (discrete)Artificial inteligence: Neurocontrol (continuous)

Václav Šmídl Approximate Dynamic Programming

Page 9: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion Control of Dynamic Systems Dynamic Programming

Aim of the research

Long-term aims:

Apply formal method of decision making in ‘difficult’ areas ofmulti-agent systems, multiple participant decision making.

We seek decision making algorithms that:

1 take uncertainty into account,2 are computationally tractable,3 are as close to the optimality as possible,4 operates reliably in changing enviroment

1 on-line learning (system identification)2 on-line design (improvement) of decision making strategy.

This talk offers summary of state-of-the art.

Václav Šmídl Approximate Dynamic Programming

Page 10: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion MDP Curses RTDP Q-learning

Example: Markov process

RACE TRACK; Stochastic shortest path problem, [Bradtke, 94]

Model:

xt+1 = xt + xt + ax ;t ,

yt+1 = yt + yt + ay ;t ,

xt+1 = xt + ax ;t ,

yt+1 = yt + ay ;t .

If [xt+1, yy+1] outside bounds:

xt+1 = xt , yt+1 = yt ,

xt+1 = yt+1 = 0.

Control actionsax ;t , ay ;t ∈ {−1, 0, 1} are executedwith probability p.

Loss function: z(st , st+1, at) = 1.

Václav Šmídl Approximate Dynamic Programming

Page 11: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion MDP Curses RTDP Q-learning

Exact Solutions of Dynamic Programming

We seek such a function V (st) (lookup table) for which it holds:

V (st) = minat

∑st+1

f (st+1|st , at) (z(st , st+1, at) + γV (st+1)) ,

Solutions:

Value iteration: the solution is iterated for all possible values of at ,

Policy iteration: we start with a guess of optimal policy at = a(0)(st)

V (st) =∑st+1

f (st+1|st , a(st)) (z(st , st+1, a(st)) + γV (st+1)) ,

and recompute the new estimate of the policy.

Linear programming: existence of upper bounds, constraintoptimization.

Václav Šmídl Approximate Dynamic Programming

Page 12: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion MDP Curses RTDP Q-learning

Curses of dimensionality

1 Size of the state-space -> size of the value function1 aggregation of states,2 using continuous functions

1 Taking expectations over the whole space -> time consuming1 sampling,2 tree search,3 roll-out heuristics,4 post-decision formulation of DP.

1 Size of the space of decisions -> expensive minimization1 model-less (Q) learning

Václav Šmídl Approximate Dynamic Programming

Page 13: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion MDP Curses RTDP Q-learning

Curses of dimensionality

1 Size of the state-space -> size of the value function1 aggregation of states,2 using continuous functions

1 Taking expectations over the whole space -> time consuming1 sampling,2 tree search,3 roll-out heuristics,4 post-decision formulation of DP.

1 Size of the space of decisions -> expensive minimization1 model-less (Q) learning

Václav Šmídl Approximate Dynamic Programming

Page 14: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion MDP Curses RTDP Q-learning

Curses of dimensionality

1 Size of the state-space -> size of the value function1 aggregation of states,2 using continuous functions

1 Taking expectations over the whole space -> time consuming1 sampling,2 tree search,3 roll-out heuristics,4 post-decision formulation of DP.

1 Size of the space of decisions -> expensive minimization1 model-less (Q) learning

Václav Šmídl Approximate Dynamic Programming

Page 15: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion MDP Curses RTDP Q-learning

Iterative methods

Estimates of V (st) in the k th iteration is V (k)(st).

Synchronous policy. For all states do:

V (k+1)(st) = minat

∑st+1

f (st+1|st , at)(

z(st , st+1, a1) + γV (k)(st+1))

,

Each state is visited once in each sweep.

Asynchronous policy. For states in a chosen set do:

V (k+1)(st) = minat

∑st+1

f (st+1|st , at)(

z(st , st+1, a1) + γV (k)(st+1))

,

and generate a new set.

Václav Šmídl Approximate Dynamic Programming

Page 16: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion MDP Curses RTDP Q-learning

Real-time Dynamic Programming

Assynchronous plicy: new set is generated by applying the decisionbased on current estimate.(Requires many trial simulation runs to converge.)

Theorem

Assymptotic convergence is guaranteed if:1 there is at least one optimal strategy,2 all step loss functions z(st , at) > 0,3 initial values for all states are non-overestimating, V0(st) ≤ V (st).

Variants:

RTDP: basic variant with known model,

ARTDP: a variant with model with unknown parameters whichare estimated using ML,

RTQ: variant with unknown model.

Václav Šmídl Approximate Dynamic Programming

Page 17: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion MDP Curses RTDP Q-learning

Model-less learning

In case of unknown model, we may extend the Bellman equation asfollows:

V (st) = minat

Q (st , at) ,

Q (st , at) =∑st+1

f (st+1|st , at) (z(st , st+1, a1) + γV (st+1)) ,

yielding recursion:

Q (st , at) =∑st+1

f (st+1|st , at)

(z(st , st+1, a1) + γ min

at+1Q (st+1, at+1)

).

Advantages: (i) no need of explicit model, (ii) easier computation ofoptimal actions.

Disadvantages: (i) larger function to store, (ii) learning is more time-and data-consuming.

Václav Šmídl Approximate Dynamic Programming

Page 18: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion MDP Curses RTDP Q-learning

Experimental comparison: Race track [Bradtke: 94]

States: 22 546 total,9 836 effective.

Number of operations to reachconvergence:

# of trials # of backups(passes)

GS 38 p 835 468RTDP 1000 t 517 356

ARTDP 800 t 653 774RTQ 60 000 t 10 milion

Number of visited states:<100× <10× 0×

RTDP 97% 70% 8%RTQ 50% 8% 2%

Václav Šmídl Approximate Dynamic Programming

Page 19: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion LQG AC NN

Linear Quadratic Control

Linear model with fully observable state, with Gaussian noise:

st+1 = Ast + Bat + et , et ∼ N (0,Σ) ,

and quadratic loss function

z(st , at) = s′tCst + a′tDat .

The Bellman equation is a quadratic in xt :

V (st) =[s′t , 1]Φ

[st

1

].

Obtained as solutions of Riccatti equation.In other cases the problem is intractable. LQG still serves as anetalon for tests.

Václav Šmídl Approximate Dynamic Programming

Page 20: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion LQG AC NN

Adaptive Critic Algorithms

Continuous case of policy iteration approach used in discrete cases.However, it is a forward-dynamic-programming structure.

V (st) = minat

E [z(st , at) + γV (st+1)] (2)

Parameterized approximants:

V (st) ≈ V (k)(st , wk ),

at ≈ a(k)(st , αk ).

Iterations:1 Policy improvement:

a(k+1)(st , αk+1) = arg minat (st ,α)

E[z(st , at(st , α)) + γV (k)(st+1, wk )

],

2 Value determination:

V (k+1)(st , wk+1) = E[z(st , a(k)(st , αk )) + γV (k)(st+1, wk )

].

Václav Šmídl Approximate Dynamic Programming

Page 21: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion LQG AC NN

Properties of Adaptive Critics

In deterministic case:

1 In each step, the new control law a(k+1)(·) achieves better overallperformance than the previous one.

2 In each step, the new value function V (k+1)(·) is closerapproximation of the Bellman function.

3 The algorithm terminates on optimal control is such exist.4 No backward iteration is needed.

In the stochastic case, no proof found, but probably exist (viastochastic approximations).

Václav Šmídl Approximate Dynamic Programming

Page 22: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion LQG AC NN

Design of Adaptive Critics

Choose functional approximations that are:

1 differentiable,2 less complex than lookup table,3 algorithms for computing parameter values exist,4 capable of approximation of the optimal shape withing the

desired accuracy.

Then:

a(k+1)(st , αk+1) = arg minat

E[z(st , at) + γV (k)(st+1, wk )

],

holds if:∂

∂at

[z(st , at) + γV (k)(st+1, wk )

]at=at

= 0.

The closest parametrized approximation:

αk+1 ≡ arg minα

(at − a(st , α)) .

Václav Šmídl Approximate Dynamic Programming

Page 23: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion LQG AC NN

Design of Adaptive Critics

Choose functional approximations that are:

1 differentiable,2 less complex than lookup table,3 algorithms for computing parameter values exist,4 capable of approximation of the optimal shape withing the

desired accuracy.

Then:

a(k+1)(st , αk+1) = arg minat

E[z(st , at) + γV (k)(st+1, wk )

],

holds if:∂

∂at

[z(st , at) + γV (k)(st+1, wk )

]at=at

= 0.

The closest parametrized approximation:

αk+1 ≡ arg minα

(at − a(st , α)) .

Václav Šmídl Approximate Dynamic Programming

Page 24: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion LQG AC NN

Variants of Adaptive Critics

1 Dual Heuristic Programming.All that is required for update of the control law is partialderivatives of V (k)(st+1, wk ). Hence, we approximate directly thederivatives:

λ (st , lk ) ≈ ∂V (st)

∂st.

2 Action Dependent Dynamic Programming:Similar idea to that of Q-learning. We do not approximate V (st)but Q (st , at).

Q(k) (st , at , qk ) ≈ Q (st , at) .

Same requirements as for traditional critic design (differentiability,accuracy, etc.).

Václav Šmídl Approximate Dynamic Programming

Page 25: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion LQG AC NN

Critics using Neural Networks

The functional approximantions are neural networks:

V (st) ≈ NN (wk ) , a (st) ≈ NN (αk ) ,

Learning rules are replaced by gradient descent:

w (k+1) = w (k) − β∂E∂w

, E = [e(t)]2,

e(t) is the error in Bellman equation (or its derivative, or both).Variants in derivations of error, E :

partial total (Galerkin PDE)∂E∂w = 2e(t)

[−∂V (st ,w)

∂w

]∂E∂w = 2e(t)

[−∂V (st ,w)

∂w + γ ∂V (st+1,w)∂w

]

Václav Šmídl Approximate Dynamic Programming

Page 26: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion LQG AC NN

Issues of Adaptive Critics

Initialization: techniques used for improvement (i) Robust control,(ii) Model Predictive Control.

Convergence tested on LQG case in two criteria:1 correct equilibrium, i.e. zero derivation when optimum is found.

none of the total gradient (Galerkin) variants posses this property,all partial gradient variants do,

2 quadratic unconditional stability, (quadratic Lyapunov function,strictly monotonous decrease in stochastic case)

all total gradients posses this property,none of the partial gradient does.

New definitions of error that meet these criteria have appearedrecently.

Václav Šmídl Approximate Dynamic Programming

Page 27: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion LQG AC NN

Issues of Adaptive Critics

Initialization: techniques used for improvement (i) Robust control,(ii) Model Predictive Control.

Convergence tested on LQG case in two criteria:1 correct equilibrium, i.e. zero derivation when optimum is found.

none of the total gradient (Galerkin) variants posses this property,all partial gradient variants do,

2 quadratic unconditional stability, (quadratic Lyapunov function,strictly monotonous decrease in stochastic case)

all total gradients posses this property,none of the partial gradient does.

New definitions of error that meet these criteria have appearedrecently.

Václav Šmídl Approximate Dynamic Programming

Page 28: Approximate Dynamic Programmingstaff.utia.cz/smidl/files/talks/cski06.pdf · 2006-05-29 · institution-logo Introduction Discrete domain Continuous Domain Conclusion Outline 1 Introduction

institution-logo

Introduction Discrete domain Continuous Domain Conclusion

Conclusion

Wide range of approximate methods for approximateprogramming exist. Also wide range of problem specificapproaches (tips & tricks) can be found.

The only rigorous general theory are stochastic approximations.Most of the proofs are based on this theory.

For our purpose, some variants of the Adaptive Critics can beelaborated. Similar tools are being used (estimation ofparameters via distance minimization)

Václav Šmídl Approximate Dynamic Programming