Stochastic Learning and...

42
Stochastic Learning and Optimization - A Sensitivity-Based Approach Plenary Presentation 2008 IFAC World Congress July 8, 2008 Xi-Ren Cao The Hong Kong Uni. of Science & Tech.

Transcript of Stochastic Learning and...

Page 1: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

Stochastic Learning and Optimization- A Sensitivity-Based Approach

Plenary Presentation2008 IFAC World Congress

July 8, 2008

Xi-Ren CaoThe Hong Kong Uni. of Science & Tech.

Page 2: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

A Unified Framework for Stochastic Learning and Optimization

(with a sensitivity-based view)

a. Perturbation analysis (PA): a counterpart of MDPs

b. Markov decision processes (MDPs)a new and simple approach

c. Overview of reinforcement learning (RL)

d. Event-based Optimization and others

1/40

Page 3: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

Sensitivity-BasedView

PA(System theory)

MDPs(Operations Res.)

RL(Compt. Sci.)

Event-BasedOptimization

(vs state-based)

LebesgueSampling

SAC(direct)

FinancialEngineering

2/40

Page 4: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

Optimization Problems

(state x)

Dynamic system

Actions αl , l=0,1,… Observationsyl , l=0,1,…

System Performanceη

Policy: action= d(information) , α =d(y)

Goal – to find a policy that has the best performance

3/40

Page 5: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

4/40

packets packets

Actions:service rate μn (state dependent)

Policy μn = d(n)

Observations: number of packets (state)

n=0,1,...N

Performance: average # served/sec - costs

Page 6: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

5/40

Continuous (with parameters θ) or discrete

Policy Space: Best Policy?

D

Policy space too large(100 states, 2 actions 2100=1030 policies, 10Gh ->1012 yrs to count)

State space too large and structure unknown

Page 7: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

λ μ

Policy μn = d(n)

μ1

μ2

μ3

0

1

1

1

Continuous: D=[0,1]3

Policy space DDiscrete: grid (5^3)

3 states n=1,2,3

μ3

μ1μ2

μ3

6/40

0.5

0.5

0.75

μ3=0.5

μ1=0.5μ2=0.75

Page 8: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

D

How to obtainas much perf. inf. of other policies

as possible?

With structural informationof the system

Analyzing behavior of one policyInterpret performance of others

With no structural informationof the system

Search Methods

Blind random searchOrdinal optimization

Exploring geometric properties ofdistribution of η over D

Evaluate each policy

7/40

Page 9: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

Actions:service rate μn

Observations: state n=0,1,...N

8/40

Black Box

Actions:service rate μn

Observations: state n=0,1,...N

λ μ

Structure known

Page 10: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

Simplicity is Beauty

9/40

How about Stochastic Learning & Optimization?

Qgdd π

θη

= Qg'' πηη =−

maF = 2mcE =2

21

rmmf ∝

constT

PV=

220

/1 cvmm

−=

......fD ρ=•∇ 0=•∇ Bt

BE∂∂

−=×∇

Page 11: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

With some knowledge, studying one policy

find a better policy

Discrete policy spaces

With Structural Information

With some knowledge, studying one policy

neighborhood perf.

θ+Δθθ

Continuous policy spaces

Qgdd π

θη

= Qg'' πηη =−

10/40

Page 12: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

A Sample PathA Sample Path

The dynamic behavior of a system under a policy can berepresented by a sample pathAnalyzing a sample path performance under the policy

? ? Other policies ?Discrete time model (embedded Markov chain):

Interarrial times:

Service times:

2

3

1

321

11/40

Page 13: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

System performance: Reward function f=(f(1),…,f(M))T

Performance measure:

∑∑∈

=∞→ ===

Si

T

ttT ififXf

T)()()(1lim

1

0

ππη

Steady-state probability:Steady-state probability:

π=(π(1), π(2),..,π(M)). π(I-P)=0, πe=1 I:identity matrix, e=(1,…,1)T

System dynamics:- X = {Xn, n=1,2,…}, Xn in S = {1,2,…,M}- Transition Prob. Matrix P=[p(j|i)]i,j=1,..,M

The Markov ModelThe Markov Model

1

32

p(3|1)p(1|3)

p(1|2)

p(2|3)

p(2|1)p(3|2)

12/40

Page 14: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

Perturbation Analysis

13/40

Page 15: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

For two Markov chainsP=[p(j|i)], η, π and P’=[p’(j|i)], η’, π’, (Q=P’-P)

PggPQgd

d πππδδη

−== ')(Performance gradient:

P P’δ

*P(δ)

Poisson equation:

fegPI =+− η)(

Perturbation Analysis (PA)

')1()( PPP δδδ +−= ]1,0[∈δ

14/40

Page 16: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

X: sample path with P and performance η

X

ik

i’X

J’

jX(δ)

X(δ)

X(δ): sample path with P(δ) = P+δQ, Q=P’-P and η(δ) P P’*P(δ)

δ is very small changes in sample path are also very small

Tij

Jump i j Jump i’ j’T

Changes are represented by many jumps

15/40

∑−

=∞→

==1

0)(1lim

N

nnN

XfN

fπηPerformance

Page 17: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

16/40

54321

i

j

0 Tij

X

X’

Define performance potential of state i:

.}|])([{lim)(0

0∑=

∞→ =−=N

nnN iXXfEig η

1

0

1lim { ( )}N

nN n

E f XN

η−

→∞=

= ∑Potential contribution of state i to the performance

Effect of a jump from i to j on performance:

( , ) ( ) ( )i j g j g iγ = −

TMggg ))(),...,1((=Poisson equation: .)( fegPI =+− η

Page 18: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

Adding the effects of all the jumps we obtain η(δ)-η

,)( Qgd

d πδδη

= .' PPQ −=

Performance gradient:

P P’*P(δ) P P(δ)

X X(δ)

ik

i’X

J’

jXδ

Jump i j Jump i’ j’T

17/40

Page 19: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

Markov Decision Processes- Policy Iteration

18/40

Page 20: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

Two Sensitivity Formulas

P P’*P(δ)Two Markov chains P, η, π

P’, η’, π’, with Q=P’-P

Continuous policy space Discrete policy space

Gradient-based optimization

Similarly, we can construct

,)( Qgd

d πδδη

= .' PPQ −=

Performance gradient formula: Performance difference formula:

.'' Qgπηη =− .' PPQ −=

Policy iteration

19/40

Page 21: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

Policy Iteration

gPPQg )'(''' −==− ππηηPerf. diff.

1. π’> 0 η’>η if P’g>Pg

2. Policy iteration: At any state find a policy P’ with P’g>Pg

3. Improve performance iteratively,Stop when no improvement can be made

20/40

Page 22: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

More on Policy Iteration

Performance criteria:

Average performance fπη =

}|)({ 00

iXXfEk

kn

i == ∑∞

=

βη

Discounted performance

∑∞

=−==0

0 }|])([{)(k

k iXXfEig η

Bias g:

0=gπ∑∞

===0

02 }|)([{)(k

k iXXgEig

Bias of bias (2nd order), g2:

01 =−ngπ∑∞

===

−0

01 }|)([{)(k

knn iXXgEig

Bias of (n-1)th bias (nth order), gn:

η

f(i)E{f(X1)|X0=i}

E{f(Xk)|X0=i}

k0 1 2 …

Bias measures transient behavior

21/40

Page 23: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

Perf./Bias Difference FormulasPolicy Iteration

Two policies P’ : π’ , η’, g’, g2’… and P : π, η, g, g2 ,...

Policy iteration for optimal n-biasOptimality equations for n-bias optimization.

ηη =')].()''[(]''[]'['' 1*

2* PgfgPfPPIgPPPgg +−++−+−=− −

If then

.)'(]''[]'['' 11*

2*

11 +−

+++ −+−+−=− nnnn gPPPPIgPPPggIf thennn gg =' ,...2,1=n

,]'[)]()''[('' ** ηηη IPPgfgPfP −++−+=− ∑−

=∞→

=1

0

* .1limN

n

n

NP

NP

22/40

Page 24: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

D

D0 D1

D2

D3 …

DM

Mutli-Chain MDPsPerf./ Bias/ Blackwell Optimization

With perf. difference formulas, we can derive a simple, intuitive approach without discounting

D: Policy space D0: Perf. optimal policies

D1: (1st) Bias optimal policies D2: 2nd Bias optimal policies

…… DM: Blackwell optimal policies23/40

Page 25: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

24/40

With some knowledge, studying one policy

find a better policy

Discrete policy spaces

With some knowledge, studying one policy

neighborhood perf.

θ+Δθθ

Continuous policy spaces

Qgdd π

θη

= Qg'' πηη =−

PA MDP

Observations: Do not need to evaluate every policy

(large policy space)

State space is too large

difficult to evaluate each policyestimate g, Pg, or πQg

Page 26: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

Reinforcement Learning

P too large, or not completely knownLearning: estimate from sample path

25/40

Page 27: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

PggPQgd

d πππδδη

−== ')(PA: gPPQg )'(''' −==− ππηηMDPs:

kα- Stepsize

,0>kα

,0

∞=∑∞

=kkα

.0

2 ∞<∑∞

=kkα

+ f(2) - η

54321

1 2 3 40

g(5) g(5)g(4) g(4)g(3) g(3)g(2) g(2)g(1) g(1)

Xk=2

Xk+1=4

g(2) :=(1-α) g(2)+ α[f(2) – η+g(4)]

Estimating g:∑∞

=−==0

0 }|])([{)(k

k iXXfEig η }.|)(])({[ 01 iXXgifE =+−= η

Monte Carlo: Average of Σ[f(Xk)- η]Stochastic approximation

)},()()({)(:)( 1 kkkkkk XgXgXfXgXg −+−+= +ηα

- Temporal difference (TD))()()( 1 kkkk XgXgXf −+−= +ηδ

26/40

Page 28: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

PggPQgd

d πππδδη

−== ')(PA: gPPQg )'(''' −==− ππηηMDPs:

Estimating Pg, (Q-factors).),()()|(),(

1ηαα α −+= ∑

=

ifigijpiQM

j

Similar Temporal Dfference (TD) algorithms can be developed

Estimating πQg directly PPPQ Δ=−= '

gPd

d )()(Δ= π

δδη

)}.()|()|({ 1

1

1+

+

+Δ= k

kk

kk XgXXpXXpE

,,...)},(ˆ)|()|({1lim)(

021

1

1∑=

+++

+

∞→

Δ=

N

nnn

nn

nn

NXXg

XXpXXp

Ndd

δδη

).(,...)},(ˆ{ 21 nnn XgXXgE =++with

27/40

Page 29: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

Analytical(P,f known)

Learn g(i)(No matrix inversion, etc)

Learn Q(i,α)(P completely unknown)

PolicyIteration

SolvingPoisson Eq.

orby numerical

methodsfor g

Monte Carlo

Temporal Difference

Long runaccurate est.

+ PI

Short runnoised est.+ SA + GPI

Long runaccurate est.

+ PI

Long runaccurate est.

+ PI

Short runnoised est.+ SA + GPI

Monte Carlo

Short runnoised est.+ SA + GPI(to be done)

Temporal Difference

Long runaccurate est.

+ PI

Short runnoised est.+ SA + GPI

(SARSA)

Policy Iteration Based Learning and Optimization

28/40

Page 30: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

29/40

Analytical(P,f known)

Learn g(i)

Perf.DerivativeFormula(PDF)

+GradientMethods

(GM)

Monte Carlo

Temporal Difference

Updates everyregenerative

period:

Learn dθdη

directly

Long runaccurate est.+ PDF+GM

Long runaccurate est.+ PDF+GM

Long runaccurate est.

+ GM

Long runaccurate est

+ GM

Updates everytransition:

Short runnoised est.

+TD

PA-Gradient Based Learning and Optimization

Find a zeroof

dθdη

Find a zeroof dθ

dηLearn dθ

directly

Page 31: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

Online gradient based optimi Online policy iteration

RLTD(λ), Q-learning, Neuro-DP ..

(online estimate)

Qgdd π

δη

=

Two policies: P, P’, Q=P’-PSteady-state prob: π, π’Long-run ave. perf: η, η’Poisson eq: (I-P+e π)g =f

PA

StochasticApproximation

Potentials g

Qg'' πηη =−

SACMDP(Policy iteration)(Policy gradient)

Gradient-based PI

RL: reinforcement learningPA: perturbation analysisMDP: Markov decision proc.SAC: stochastic adaptive cont.

A Map of the L&O World30/40

Page 32: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

Event-Based Optimization- New directions

31/40

Page 33: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

Limitations of State-Based Model

2. State based policies may not be the best

3. Special features not captured

1. Curse of dimensionality

Event-Based Formulation

32/40

Page 34: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

Admission Control in Communication

λ b(n)

1-b(n)

q0i

qij

n: populationNo. of all cust. in net

ni: No. of cust. at svr i n=(n1,…,nM): state N: Capacity

Event: A customer arrives finding a population nHow do we choose the admission probability b(n)?

33/40

Page 35: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

Constructing NewSensitivity Eqs!

RL: reinforcement learningPA: perturbation analysisMDP: Markov decision proc.SAC: stochastic adaptive cont.

Sensitivity-Based Approaches to Event-Based Optimization

Gradient-based PI

MDP(Policy iteration)

Online gradient based optimi Online policy iteration

PA SAC

StochasticApproximation

RLTD(λ), Q-learning, Neuro-DP ..

(online estimate)

Qgdd π

δη

= Qg'' πηη =−

Potentials g

(Policy gradient)

agggeQe )|(*)('' πηη =−agggeQedd )|(*)(πδη

=

34/40

Page 36: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

Advantages of the Event-Based Approach

1. # of aggregated potentials d(n): N may be linear in system

2. Actions at different states are correlatedstandard MDPs do not apply

3. Special features captured by eventsaction depends on future information

5. Opens up a new direction to many engineering problems

POMDPs: observation y as eventhierarchical control: mode change as event

network of networks: transitions among subnets as eventsLebesgue Sampling 35/40

4. May have better performance

Page 37: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

Riemann Sampling vs. Lebesgue Sampling

Sample the system whenever the signal reaches a certain prespecified level, and control is added then.

t1 t2 tk… …

d3

d2

d1

d4d5

t1 t2 tk… ……

RS:

LS:

36/40

Page 38: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

*x

*1τ *

X(t)

t

A Model for Stock Price or Financial Assess

.),()),(,()())(,())(,()( ∫ −++= dzdtNztXttdwtXtdttXtbtdX γσ

w(t): Brownian motion; N(dt,dz): Poisson random measureX(t): Ito-Levy process

37/40

Page 39: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

Semi-Products

S1 S2 S6S5S4S3Assembling stations

B1 B2 B3 B4 B5 B6 B7 B8 B9 B10B11 B12 B13 B14 B15

Linesidebuffers

Central docking area

Dollies

A Material Handling System for an Assembly Line

Event-based approach leads to 6-10% performance improvement

38/40

Page 40: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

2. Extension to event-based optimizationPolicy iteration, perturbation analysis

reinforcement learning, time aggregation…. Lebesgue sampling, sensor networks,

POMDPs, hierarchical control……

Sensitivity-Based View of Optimization

1. A map of the learning and optimization world:Results in Different areas can be obtained /explained from two sensitivity equationsSimple and complete derivation for MDPs

39/40

Page 41: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

THANKS !

Questions?

40/40

Page 42: Stochastic Learning and Optimizationcfins.au.tsinghua.edu.cn/personalhg/caoxiren/IFAC08.pdfStochastic Learning and Optimization-A Sensitivity-Based ApproachPlenary Presentation 2008

Xi-Ren Cao:

Stochastic Learningand Optimization- A Sensitivity BasedApproach

9 Chapters, 566 pages119 Figures, 27 Tables, 212 homework problems

SpringerOctober 2007

40+/40