ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment...

195
. ORF 544 Stochastic Optimization and Learning Spring, 2019 Warren Powell Princeton University http://www.castlelab.princeton.edu © 2018 W.B. Powell

Transcript of ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment...

Page 1: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

.

ORF 544

Stochastic Optimization and LearningSpring, 2019

Warren PowellPrinceton University

http://www.castlelab.princeton.edu

© 2018 W.B. Powell

Page 2: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Week 11

Forward approximate dynamic programming

© 2019 Warren B. Powell

Page 3: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Approximate dynamic programming

Exploiting monotonicity

© 2018 W.B. Powell Slide 3

Page 4: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting monotonicity

Monotonicity in dynamic programming» There are many problems where the value function

increases (or decreases) monotonically with all dimensions of a state variable:

» Operations research• Optimal equipment replacement – Replace when a measure of

deterioration exceeds some point• Dispatching customers in a queue – Costs increase

monotonically with the number of customers in the queue. Dispatch when the queue is over some number.

© 2019 Warren B. PowellQueue length

0

1Repair/dispatch

Do nothing

Page 5: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting monotonicity

Monotonicity in dynamic programming (cont’d)» Energy

• Buy when electricity price is below one number, sell when above another number:

• Value of energy in storage increases with amount being held (and perhaps price, speed of wind, demand, …)

© 2019 Warren B. Powell

Price0

1Sell

Do nothing

Buy -1

Page 6: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting monotonicity

Monotonicity in dynamic programming (cont’d)» Health

• Dosage of a diabetes drug increases with blood sugar.• Dosage of statins (for reducing cholesterol) increase as the

cholesterol level increases (and also increases with age, weight).

» Finance• The value of holding cash in a mutual fund increases with the

amount of redemptions, stock indices, and interest rates.• The value of holding an option increases with price and

volatility.

© 2019 Warren B. Powell

Page 7: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71

Bid is placed at 1pm, consisting of charge and discharge prices between 2pm and 3pm.

Discharge1 ,2pm pmb

Charge1 ,2pm pmb

Exploiting monotonicity

1pm 2pm 3pm

© 2019 Warren B. Powell

Page 8: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting monotonicity

The exact value function© 2019 Warren B. Powell

Page 9: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting monotonicity

Approximate value function without monotonicity

© 2019 Warren B. Powell

Page 10: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting monotonicity

Maintaining monotonicity

© 2019 Warren B. Powell

Page 11: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting monotonicity

Maintaining monotonicity

© 2019 Warren B. Powell

Page 12: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting monotonicity

Maintaining monotonicity

© 2019 Warren B. Powell

Page 13: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting monotonicity

Maintaining monotonicity

© 2019 Warren B. Powell

Page 14: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting monotonicity

© 2019 Warren B. Powell

Page 15: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting monotonicity

Benchmarking» M-ADP and AVI (no. of iterations)

© 2019 Warren B. Powell

Page 16: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting monotonicity

Observations» Initial experiments using various machine learning

approximations produced poor results (60-80 percent of optimal)

» Vanilla lookup table works poorly (within computing budgets that spanned days)

» Lookup table with monotonicity worked quite well» We have run this algorithm on problems with up to 7

dimensions (but that is our limit)

© 2019 Warren B. Powell

Page 17: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Backward ADP

Backward ADP for a clinical trial problem» Problem is to learn the value of a new drug within a

budget of patients to be tested.» Backward MDP required 268-485 hours.» Forward ADP exploiting monotonicity (we will cover

thiss later) required 18-30 hours.» Backward ADP required 20 minutes, with a solution

that was 1.2 percent within optimal.

© 2019 Warren B. Powell

Page 18: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Approximate dynamic programming

Driverless EV problem

© 2018 W.B. Powell Slide 18

Page 19: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Driverless EV optimizationCurrent Uber logic:» Show nearest 8 drivers.» Contact closest driver to

confirm assignment.» If driver does not confirm,

contact second closest driver.

Limitations:» Ignores potential future

opportunities for each driver.

© 2019 Warren B. Powell

Page 20: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Driverless EV optimization

Closest car…. but it moves car away from busy downtown area

and strands other car in low density area.

Assigning car in less dense area allows closer car to handle potential demands in more dense

areas.

© 2019 Warren B. Powell

Page 21: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Driverless EV optimization

© 2019 Warren B. Powell

Page 22: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Driverless EV optimization

The central operator should think about:• Should it accept this trip?• What is the best car to assign to the trip considering

– The type of car– The charge level of the battery

» If it doesn’t assign a car to a trip, should it:• Sit where it is?• Reposition to a better location?• Recharge the battery?• Move to a parking facility?

© 2019 Warren B. Powell

Page 23: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Driverless EV optimization

RidersCars

© 2019 Warren B. Powell

Page 24: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Driverless EV optimization

t t+1 t+2

© 2019 Warren B. Powell

Page 25: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Driverless EV optimization

t t+1 t+2

© 2019 Warren B. Powell

Page 26: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Driverless EV optimization

t t+1 t+2

The assignment of cars to riders evolves over time, with new riders arriving, along with updates of cars available.

© 2019 Warren B. Powell

Page 27: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Driverless EV optimization

Policies based on value function approximations

» Exact value functions are rare:• Discrete states and actions, with a computable one-step transition

matrix.» Approximate value functions are defined by:

• Approximation architecture– Linear, nonlinear separable, nonlinear

• Learning strategy– Pure forward pass, two-pass

1 ,( ) arg max ( , ) ( , )t

VFA n n n M x nt t x t t t t tX S C S x V S S x

Contribution for taking decision 𝑥 from state 𝑆

Expected Value of the Post-decision State

© 2019 Warren B. Powell

Page 28: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Driverless EV optimization

1 1ta rc

'1( )v a

1

TimeLocation

aBatteryFleet

''1( )v a

1 2ta rc

© 2019 Warren B. Powell

Page 29: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

1 1

'1( )ta rc v a

1 2

''1( )ta rc v a

1

TimeLocation

aBatteryFleet

'1( )v a

''1( )v a

Driverless EV optimization

© 2019 Warren B. Powell

Page 30: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

'1( )v a

'2( )v a

''1( )v a

''2( )v a

''3( )v a

'4( )v a

1a

2a

3a

4a

Driverless EV optimization

© 2019 Warren B. Powell

Page 31: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

'1( )v a

'2( )v a

'4( )v a

1ˆ( )v a 1a

2a

3a

4a

''1( )v a

''2( )v a

''3( )v a

Driverless EV optimization

© 2019 Warren B. Powell

Page 32: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

'1( )v a

'2( )v a

'4( )v a

1a

2a

3a

4a

''1( )v a

''2( )v a

''3( )v a

Driverless EV optimization

© 2019 Warren B. Powell

Page 33: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

'1( )v a

'2( )v a

'4( )v a

1a

2a

3a

4a

New car assignments:

New future values

''2( )v a

''3( )v a

''4( )v a

Driverless EV optimization

© 2019 Warren B. Powell

Page 34: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Original solution

'1( )v a

'2( )v a

'4( )v a

1a

2a

3a

4a

''1( )v a

''2( )v a

''3( )v a

Driverless EV optimization

© 2019 Warren B. Powell

Page 35: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

'1( )v a

'2( )v a

'4( )v a

1a

2a

3a

4a

New solution''2( )v a

''3( )v a

''4( )v a

Driverless EV optimization

© 2019 Warren B. Powell

Page 36: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

'1( )v a

'2( )v a

'4( )v a

1ˆ( )v a 1a

2a

3a

4a

Difference = 1ˆ( )v a'' ''1 2( ) ( )v a v a

'' ''2 3( ) ( )v a v a

'' ''3 4( ) ( )v a v a

Driverless EV optimization

© 2019 Warren B. Powell

Page 37: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

'1( )v a

'2( )v a

'4( )v a

1ˆ( )v a 1a

2a

3a

4a

'' ''1 2( ) ( )v a v a

'' ''2 3( ) ( )v a v a

'' ''3 4( ) ( )v a v a

Difference in assignment costs

Changes in future values

Driverless EV optimization

© 2019 Warren B. Powell

Page 38: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Driverless EV optimizationAssignment network

» Capture the value of downstream driver.

»

» Add this value to the assignment arc.

Drivers

3a

3 1( , )Ma a d

3 2( , )Ma a d

3 3( , )Ma a d

3 4( , )Ma a d

3 5( , )Ma a d

Loads

4a

5a

2a

1a

Future attributes

3 1( , ) Attribute vector of driverin the future given a decision .

Ma a d

d

© 2019 Warren B. Powell

Page 39: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Driverless EV optimization

Finding the marginal value of a driver:» Dual variables

• Can provide unreliable estimates.• Need to get the marginal value of drivers who are not actually

there.

» Numerical derivatives:

© 2019 Warren B. Powell

Page 40: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Estimating average values:

1 ˆ( ) (1 ) ( ) ( ) ( )n n nv a v a v a

Attribute of car

Old estimate of the value of a car with attribute a

New estimate of the value of a cars with attribute a

Driverless EV optimization

© 2019 Warren B. Powell

Page 41: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Driverless EV optimization

We had to use two statistical techniques to accelerate learning:» Monotonicity with respect to time and charge level.» Hierarchical aggregation (just as we did with the

trucking example).

© 2019 Warren B. Powell

Page 42: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Mar

gina

l val

ue

Driverless fleets of EVs using ADP

The value of a vehicle in the future» Value function approximation captures charge level, as

well as time and location.» Hierarchical aggregation accelerated the learning process

Page 43: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Driverless EV optimization

Step 1: Start with a pre-decision state Step 2: Solve the deterministic optimization using

an approximate value function:

to obtain . Step 3: Update the value function approximation

Step 4: Obtain Monte Carlo sample of andcompute the next pre-decision state:

Step 5: Return to step 1.

, 1 ,1 1 1 1 1 1 ˆ( ) (1 ) ( )n x n n x n n

t t n t t n tV S V S v

1 ,ˆ min ( , ) ( ( , ) )n n n M x nt x t t t t t tv C S x V S S x

ntS

( )ntW

1 1( , , ( ))n M n n nt t t tS S S x W

Simulation

Deterministicoptimization

Recursivestatistics

“on policy learning”

nx

© 2019 Warren B. Powell

Page 44: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Driverless EV optimization

22000 zones 2000 cars Battery capacity: 50KWh 31560 Trips

© 2019 Warren B. Powell

Page 45: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Driverless EV optimization

No aggregation – (!!) Solution gets worse!

© 2019 Warren B. Powell

Profit versus number of iterations

Page 46: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Driverless EV optimization

Three levels of aggregation - Better

© 2019 Warren B. Powell

Profit versus number of iterations

Page 47: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Driverless EV optimization

Five levels of aggregation – old VFAs

© 2019 Warren B. Powell

Profit versus number of iterations

Page 48: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Driverless EV optimization

Five levels of aggregation – new VFAs

© 2019 Warren B. Powell

Profit versus number of iterations

Page 49: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Driverless EV optimization

Trip requests over time» Challenge is to recharge during off-peak periods

© 2019 Warren B. Powell

Trip Requests vs Time

Page 50: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Driverless fleets of EVs using ADP

Heuristic dispatch vs. ADP-based policies» Effect of value function approximations on recharging

Total battery charge

Trip requests

Actual trips

Heuristic recharging logic

Recharging during peak period No recharging during peak period

Recharging controlled by approximate dynamic programming

Peak Peak

Page 51: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

The economics of driverless fleetsPr

ofit

($)

We can simulate different fleet sizes and battery capacities, properly modeling recharging behaviors given battery capacity.

Page 52: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Approximate dynamic programming

Exploiting convexity in an inventory problem

© 2018 W.B. Powell Slide 52

Page 53: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting convexity

Single inventory problem» With pre-decision state:

» With post-decision state: 1

1 1 1arg max ( , ) ( , , ) |t

n n n nt x t t t t t t t tx C S x V S S x W S

X

, 1arg max ( , ) ( , )t

n n x n x nt x t t t t t tx C S x V S S x

X

Page 54: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting convexity

Updating strategies» Forward pass (approximate value iteration) updating as

we go (also known as TD(0)):

• Compute the value of being in a state by “bootstrapping” the downstream value function approximation:

• Now compute the marginal value using

• Use this to update the piecewise linear value functions.

© 2019 Warren B. Powell

, 1ˆ ( ) max ( , ) ( , )t

n n n x n x nt t x t t t t t tV S C S x V S S x

X

ˆ ˆ( ) ( )ˆn n n n

n t t t tt

V S V Sv dd

+ -=

Page 55: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting convexity

Updating strategies» Double pass:

• Perform forward pass simulating the policy for 𝑡 0, … , 𝑇:

• Now perturb state by 𝛿 and solve again:

Update 𝑆 𝑆 𝑥 (remember 𝑥 may be negative). Store 𝑆 as you proceed. Also store incremental cost

• Perform backward pass, computing marginal values (and performing updates):

1

11 1 1 1 1

ˆ ˆ( , )

ˆ( ) ( ( ), , )

n n n nt t t t

n n V n n n nt t t t t t

v C S x v

V S U V S S v

d +

-- - - - -

= +

¬

, 1arg max ( , ) ( , )t

n n x n x nt x t t t t t tx C S x V S S x

X

, , 1arg max ( , ) ( , )t

n n x n x nt x t t t t t tx C S x V S S x

X

,( , ) ( , )n n n n nt t t t tC C S x C S xdd d += + -

Page 56: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting convexity

We update the piecewise linear value functions by computing estimates of slopes using a backward pass:

» The cost along the marginal path is the derivative of the simulation with respect to the flow perturbation.

S

Page 57: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting convexity

From pre- to post-decision state

» Get marginal value of inventory, , at time for inventory level during iteration .

» Use this to update the value function around the previous post-decision state , to obtain ,

© 2019 Warren B. Powell

Page 58: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting convexity

Updating strategies» The choice between a pure forward pass (TD(0)) versus

a double pass (TD(1)) is highly problem dependent. Every problem has a natural “horizon.”

» We have found that in our fleet management problems, the pure forward pass works fine and is much easier to implement.

» For energy storage problems, it is essential that we use a double pass, since a decision at time can have an impact hundreds of time periods into the future.

© 2019 Warren B. Powell

Page 59: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2004 Warren B. Powell Slide 59

Exploiting convexity It is important to maintain concavity:

,ktv

,ktv

Page 60: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2004 Warren B. Powell Slide 60

0v

1v

ˆ nitV

2vkit

3v

Valuefunction

A concave function…

0v

1v

2v

0u 1u 2ukitR

Slopes … has monotonically decreasing slopes. But updating the function with a stochastic gradient may violate this property.

0u 1u 2unitR 3u

Exploiting convexity

Page 61: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2004 Warren B. Powell Slide 61

0v

1v

0u 1u 2ukitR

2v

Exploiting convexity

Page 62: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2004 Warren B. Powell Slide 62

0v

1v

0u 1u 2ukitR

ˆnitv

2v

Exploiting convexity

Page 63: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2004 Warren B. Powell Slide 63

0v

1v

2v

0u 1u 2ukitR

1 ˆ(1 )n n nit it itv v v

Exploiting convexity

Page 64: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2004 Warren B. Powell Slide 64

0v

1v

2v

0u 1u 2ukitR

1 ˆ(1 )n n nit it itv v v

Exploiting convexity

Page 65: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2004 Warren B. Powell Slide 65

0v

1v

2v

0u 1u 2ukitR

1 ˆ(1 )n n nit it itv v v

Exploiting convexity

Page 66: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting convexity

Ways for maintaining concavity (monotonicity in the slopes):» CAVE algorithm – Use updated slope at one point to

update over a range which shrinks with iterations (shrinking factor is tunable parameter). Expand if needed to ensure concavity is maintained.

• Works well! But we could never prove convergence.

» Leveling algorithm – Update at a point . • Force monotonicity in slopes by increasing/decreasing slopes

farther from 𝑥 as needed. • Works fine, without tunable parameters

» SPAR algorithm – Perform nearest point projection onto space of monotone functions.

• Nice theoretical convergence proof – could never get it to work.

© 2019 Warren B. Powell

Page 67: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting convexity

Derivatives are used to estimate a piecewise linear approximation

( )t tV R

tR

Page 68: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

1200000

1300000

1400000

1500000

1600000

1700000

1800000

1900000

0 100 200 300 400 500 600 700 800 900 1000

Obj

ectiv

e fu

nctio

n

Iterations

Exploiting convexity

With luck, your objective function improves

Obj

ectiv

e fu

nctio

n

Iterations0 10 20 30 40 50 60 70 80 90 100

Page 69: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting convexity

Testing on a time-dependent, deterministic problem» Blue = optimal (found by

solving a single linear program over entire horizon)

» Black = ADP solution

© 2014 W. B. Powell

Optimalsolution

Page 70: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting convexity

Stochastic, time-dependent problems

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Perc

ent o

f opt

imal

Benchmark storage problems

Perc

ent o

f opt

imal

Page 71: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting convexity

Stochastic, time-dependent problems

95

96

97

98

99

100

101

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Perc

ent o

f opt

imal

Perc

ent o

f opt

imal

Storage problem

Page 72: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Approximate dynamic programming

Resource allocation (freight cars) for two-stage problem

© 2018 W.B. Powell Slide 72

Page 73: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2004 Warren B. Powell Slide 73

Page 74: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2004 Warren B. Powell

Forecasts of Car Demands

Forecast Actual

Page 75: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2004 Warren B. Powell

Two-stage problemsTwo-stage resource allocation under uncertainty

Page 76: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2004 Warren B. Powell

Optimization frameworksWe obtain piecewise linear recourse functions for each regions.

Page 77: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2004 Warren B. Powell

Optimization frameworksThe function is piecewise linear on the integers.

We approximate the value of cars in the future using a separable approximation.

0 1 2 3 4 5Number of vehicles at a location

Prof

its

Page 78: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2004 Warren B. Powell

Optimization frameworksUsing standard network transformation:

Each link captures the marginalreward of an additional car.

Page 79: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2004 Warren B. Powell

Two-stage problems

Page 80: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2004 Warren B. Powell

Two-stage problems

Page 81: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2004 Warren B. Powell

Two-stage problems

1nR

2nR

3nR

4nR

5nR

Page 82: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2004 Warren B. Powell

Two-stage problemsWe estimate the functions by sampling from our distributions.

1nR

2nR

3nR

4nR

5nR

1 ( )nD

2 ( )nD w

3 ( )nD w

( )nCD w

1( )nv

2 ( )nv

3 ( )nv

4 ( )nv

5 ( )nv

Marginal value:

Page 83: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2004 Warren B. Powell

Two-stage problems

The time t subproblem:

1tR

2tR

3tR

t1 2 3( , , )n

ta t t tV R R R(i-1,t+3)

(i,t+1)

(i+1,t+5)

1 1

2 2

3 3

Gradients:ˆ ˆ( , )ˆ ˆ( , )ˆ ˆ( , )

n nt tn nt tn nt t

v v

v v

v v

Page 84: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2004 Warren B. Powell

Two-stage problems

Left and right gradients are found by solving flow augmenting path problems.

3tR

t

i

1 2 3( , , )nta t t tV R R R

(i-1,t+3)Gradients:

3ˆ( )ntv

The right derivative (the value of one more unit of that resource) is a flow augmenting path from that node to the supersink.

Page 85: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2004 Warren B. Powell

Two-stage problems

Left and right derivatives are used to build up a nonlinear approximation of the subproblem.

R1t

1( )kit tV R

R1tk

Page 86: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2004 Warren B. Powell

Two-stage problems

Left and right derivatives are used to build up a nonlinear approximation of the subproblem.

R1t

ktv

ktv

Right derivativeLeft derivative

R1tk

1( )kit tV R

Page 87: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2004 Warren B. Powell

Two-stage problems

Each iteration adds new segments, as well as refining old ones.

R1t

( 1)ktv ( 1)k

tv

R1tk1

1( )kit tV R

Page 88: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2004 Warren B. Powell

Two-stage problems

0.0

0.5

1.0

1.5

2.0

2.5

0 1 2 3 4 5 6 7 8 9 10

Variable Value, s

Func

tiona

l Val

ue, f

(s) =

ln(1

+s)

Exact1 Iter2 Iter5 Iter10 Iter15 Iter20 Iter

Number of resources

Page 89: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2004 Warren B. Powell

What we can proveSPAR algorithm» Powell, Ruszczynski and Topaloglu, Mathematics of Operations

Research (2004)

Leveling algorithm» Topaloglu and Powell, Operations Research Letters, 2003.

But so far, our fastest convergence is with the original version, called the CAVE algorithm, for which convergence has not been proven.» Godfrey and Powell, Management Science, 2001.

Page 90: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Approximate dynamic programming

Blood management problemManagement multiple resources

© 2018 W.B. Powell Slide 90

Page 91: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Slide 91

Blood management

Managing blood inventories over time

t=0

0S1 1

ˆ ˆ,R D1S

Week 1

1x

2 2ˆ ˆ,R D

2S

Week 2

2x2xS

3 3ˆ ˆ,R D

3S3x

Week 3

3xS

t=1 t=2 t=3

Week 0

0x

© 2018 Warren B. Powell

Page 92: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

O-,1

O-,2

O-,3

AB+,2

AB+,3

O-,0

t ABD

AB+,0

AB+,1

AB+,2

O-,0

O-,1

O-,2

,( ,0)t ABR

,( ,1)t ABR

,( ,2)t ABR

,( ,0)t OR

,( ,1)t OR

,( ,2)t OR

t ABD

t AD

t ABD

t ABD

t ABD

t ABD

AB+

AB-

A+

A-

B+

B-

O+

O-

xtR

AB+,0

AB+,1

t ABD

Satisfy a demand Hold

tS ˆ , t tR D

© 2018 Warren B. Powell

Page 93: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

AB+,0

AB+,1

AB+,2

tR

O-,0

O-,1

O-,2

xtR

AB+,0

AB+,1

AB+,2

AB+,3

O-,0

O-,1

O-,2

O-,3

AB+,0

AB+,1

AB+,2

AB+,3

O-,0

O-,1

O-,2

O-,3

1,ˆ

t ABR

1tR

1,ˆ

t OR

ˆ

tD

,( ,0)t ABR

,( ,1)t ABR

,( ,2)t ABR

,( ,0)t OR

,( ,1)t OR

,( ,2)t OR

© 2018 Warren B. Powell

Page 94: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

AB+,0

AB+,1

AB+,2

tR xtR

O-,0

O-,1

O-,2

AB+,0

AB+,1

AB+,2

AB+,3

O-,0

O-,1

O-,2

O-,3

ˆ

tD

,( ,0)t ABR

,( ,1)t ABR

,( ,2)t ABR

,( ,0)t OR

,( ,1)t OR

,( ,2)t OR

© 2018 Warren B. Powell

Page 95: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

( )tF R

AB+,0

AB+,1

AB+,2

tR xtR

O-,0

O-,1

O-,2

AB+,0

AB+,1

AB+,2

AB+,3

O-,0

O-,1

O-,2

O-,3

ˆ

tD

,( ,0)t ABR

,( ,1)t ABR

,( ,2)t ABR

,( ,0)t OR

,( ,1)t OR

,( ,2)t OR

Solve this as a linear program.

© 2018 Warren B. Powell

Page 96: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

( )tF R

AB+,0

AB+,1

AB+,2

tR xtR

O-,0

O-,1

O-,2

AB+,0

AB+,1

AB+,2

AB+,3

O-,0

O-,1

O-,2

O-,3

ˆ

tD

Dual variables give value additional unit of blood..

Duals

,( ,0)t̂ AB

,( ,1)t̂ AB

,( ,2)t̂ AB

,( ,0)t̂ O

,( ,1)t̂ O

,( ,2)t̂ O

© 2018 Warren B. Powell

Page 97: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Updating the value function approximation

Estimate the gradient at

,( ,2)nt ABR

,( ,2)ˆnt AB

ntR

( )tF R

© 2018 Warren B. Powell

Page 98: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Updating the value function approximation

Update the value function at

,1

x ntR

11 1( )n x

t tV R

,1

x ntR

,( ,2)ˆnt AB

( )tF R

,( ,2)nt ABR

© 2018 Warren B. Powell

Page 99: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Updating the value function approximation

Update the value function at ,1

x ntR

,( ,2)ˆnt AB

,1

x ntR

11 1( )n x

t tV R

© 2018 Warren B. Powell

Page 100: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Updating the value function approximation

Update the value function at ,1

x ntR

,1

x ntR

11 1( )n x

t tV R

1 1( )n xt tV R

© 2018 Warren B. Powell

Page 101: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Updating the value function approximation

Notes:» We get a marginal value for each supply node.» These are provided automatically by linear

programming solvers.» Be careful when the supply at the node = 0. While we

would like to have the marginal value of one more, if we use the marginal values produced by the linear programming code, it might be the slope to the left (where the “supply” would be negative).

© 2018 Warren B. Powell

Page 102: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting concavity

Derivatives are used to estimate a piecewise linear approximation

( )t tV R

tR

© 2018 Warren B. Powell

Page 103: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Approximate value iteration

Step 1: Start with a pre-decision state Step 2: Solve the deterministic optimization using

an approximate value function:

to obtain and dual variables . Step 3: Update the value function approximation

Step 4: Obtain Monte Carlo sample of andcompute the next pre-decision state:

Step 5: Return to step 1.

, 1 ,1 1 1 1 1 1 ˆ( ) (1 ) ( )n x n n x n n

t t n t t n tV S V S v

ntS

( )ntW

1 1( , , ( ))n M n n nt t t tS S S x W

Simulation

Deterministicoptimization

Recursivestatistics

ˆntiv

1 ,max ( , ) ( ( , )) n n M x nx t t t t t tC S x V S S x

ntx

© 2018 Warren B. Powell

Page 104: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Slide 104

Approximate value iteration

Step 1: Start with a pre-decision state Step 2: Solve the deterministic optimization using

an approximate value function:

to obtain and dual variables . Step 3: Update the value function approximation

Step 4: Obtain Monte Carlo sample of andcompute the next pre-decision state:

Step 5: Return to step 1.

, 1 ,1 1 1 1 1 1 ˆ( ) (1 ) ( )n x n n x n n

t t n t t n tV S V S v

ntS

( )ntW

1 1( , , ( ))n M n n nt t t tS S S x W

Simulation

Deterministicoptimization

Recursivestatistics

1 ,max ( , ) ( ( , )) n n M x nx t t t t t tC S x V S S x

ˆntivn

tx

© 2018 Warren B. Powell

Page 105: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Approximate value iteration

Step 1: Start with a pre-decision state Step 2: Solve the deterministic optimization using

an approximate value function:

to obtain and dual variables . Step 3: Update the value function approximation

Step 4: Obtain Monte Carlo sample of andcompute the next pre-decision state:

Step 5: Return to step 1.

1 ,max ( , ) ( ( , )) n n M x nx t t t t t tC S x V S S x

ntS

( )ntW

1 1( , , ( ))n M n n nt t t tS S S x W

Simulation

Deterministicoptimization

Recursivestatistics

ˆntivn

tx

© 2018 Warren B. Powell

Page 106: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Approximate value iteration

Step 1: Start with a pre-decision state Step 2: Solve the deterministic optimization using

an approximate value function:

to obtain and dual variables . Step 3: Update the value function approximation

Step 4: Obtain Monte Carlo sample of andcompute the next pre-decision state:

Step 5: Return to step 1.

ntx

ntS

( )ntW

1 1( , , ( ))n M n n nt t t tS S S x W

Simulation

Deterministicoptimization

Recursivestatistics

ˆntiv

1 ,max ( , ) ( ( , )) n n M x nx t t t t t tC S x V S S x

© 2018 Warren B. Powell

Page 107: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Iterative learning

t

© 2018 Warren B. Powell

Page 108: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Iterative learning

© 2018 Warren B. Powell

Page 109: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Iterative learning

© 2018 Warren B. Powell

Page 110: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Iterative learning

© 2018 Warren B. Powell

Page 111: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Approximate dynamic programming

Grid level storage

© 2018 W.B. Powell Slide 111

Page 112: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Slide 112

Grid level storage Imagine 25 large storage devices spread around the PJM grid:

© 2019 Warren B. Powell

Page 113: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Slide 113

Grid level storage Optimizing battery storage

© 2019 Warren B. Powell

Page 114: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Slide 114

Value function approximations

© 2019 Warren B. Powell

Page 115: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Slide 115

Value function approximationsMonday

Time :05 :10 :15 :20

© 2019 Warren B. Powell

Page 116: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Slide 116

Value function approximationsMonday

Time :05 :10 :15 :20

© 2019 Warren B. Powell

Page 117: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Slide 117

Value function approximationsMonday

Time :05 :10 :15 :20

© 2019 Warren B. Powell

Page 118: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Exploiting concavity

Derivatives are used to estimate a piecewise linear approximation

( )t tV R

tR

© 2019 Warren B. Powell

Page 119: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

1200000

1300000

1400000

1500000

1600000

1700000

1800000

1900000

0 100 200 300 400 500 600 700 800 900 1000

Obj

ectiv

e fu

nctio

n

Iterations

Approximate dynamic programming

… a typical performance graph.

© 2019 Warren B. Powell

Page 120: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Grid level storage control

Without storage

© 2019 Warren B. Powell

Page 121: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Grid level storage control

With storage

© 2019 Warren B. Powell

Page 122: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Approximate dynamic programming

Locomotive application

© 2018 W.B. Powell Slide 122

Page 123: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2019 Warren B. Powell

Page 124: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2019 Warren B. Powell

Page 125: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2019 Warren B. Powell

Page 126: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2019 Warren B. Powell

Page 127: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

© 2008 Warren B. Powell Slide 127© 2019 Warren B. Powell

Page 128: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

The locomotive assignment problem

Locomotives Trains Yards

Atlanta

Baltimore

Jacksonville

© 2019 Warren B. Powell

Page 129: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

The locomotive assignment problem

LocomotivesHorsepower

4400

4400

6000

4400

5700

4600

6200 Consist-breakup costsShop routing bonuses/ penalties

Leader logic

© 2019 Warren B. Powell

Page 130: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

The locomotive assignment problem

4400

4400

6000

4400

5700

4600

6200

Locomotives coming in on same trainLocomotivesHorsepower

© 2019 Warren B. Powell

Page 131: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

The locomotive assignment problem

4400

4400

6000

4400

5700

4600

6200

Train reward function

LocomotivesHorsepower

© 2019 Warren B. Powell

Page 132: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

The locomotive assignment problem

The train reward function

GoalPower

Minimum

Trai

n re

war

d

Repositioning

© 2019 Warren B. Powell

Page 133: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

The locomotive assignment problem

LocomotivesHorsepower

4400

4400

6000

4400

5700

4600

6200Train may need 12,130 horsepower. Solutions:

4400 + 4400 + 6000 = 14,4004400+4400+4400 = 13,200

6000+6200 = 12,200

© 2019 Warren B. Powell

Page 134: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

The locomotive assignment problem

LocomotivesHorsepower

4400

4400

6000

4400

5700

4600

6200Train may need 12,130 horsepower. Solutions:

4400 + 4400 + 6000 = 14,4004400+4400+4400 = 13,200

6000+6200 = 12,200

© 2019 Warren B. Powell

Page 135: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

The locomotive assignment problem

LocomotivesHorsepower

4400

4400

6000

4400

5700

4600

6200

Locomotivebuckets

The value of locomotives in the future is captured through nonlinear

approximations.

© 2019 Warren B. Powell

Page 136: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

The locomotive assignment problem

4400

4400

6000

4400

5700

4600

6200

Locomotive subproblem can be solved quickly using Cplex© 2019 Warren B. Powell

Page 137: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Iterative learning

tApproximate dynamic programming

© 2019 Warren B. Powell

Page 138: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Iterative learningApproximate dynamic programming

© 2019 Warren B. Powell

Page 139: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Iterative learningApproximate dynamic programming

© 2019 Warren B. Powell

Page 140: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Simulation

Recursivestatistics

Approximate dynamic programming

Step 1: Start with a pre-decision state Step 2: Solve the deterministic optimization using

an approximate value function:

to obtain . Step 3: Update the value function approximation

Step 4: Obtain Monte Carlo sample of andcompute the next pre-decision state:

Step 5: Return to step 1.

, 1 ,1 1 1 1 1 1 ˆ( ) (1 ) ( )n x n n x n n

t t n t t n tV S V S v

1 ,ˆ max ( , ) ( ( , ) )n n n M x nt x t t t t t tv C S x V S S x

ntx

ntS

( )ntW

1 1( , , ( ))n M n n nt t t tS S S x W

Deterministicoptimization

© 2019 Warren B. Powell

Page 141: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Approximate dynamic programming

Step 1: Start with a pre-decision state Step 2: Solve the deterministic optimization using

an approximate value function:

to obtain . Step 3: Update the value function approximation

Step 4: Obtain Monte Carlo sample of andcompute the next pre-decision state:

Step 5: Return to step 1.

, 1 ,1 1 1 1 1 1 ˆ( ) (1 ) ( )n x n n x n n

t t n t t n tV S V S v

1 ,ˆ max ( , ) ( ( , ) )n n n M x nt x t t t t t tv C S x V S S x

ntx

ntS

( )ntW

1 1( , , ( ))n M n n nt t t tS S S x W

Simulation

Deterministicoptimization

Recursivestatistics

© 2019 Warren B. Powell

Page 142: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Laboratory testingADP as a percent of IP optimal » 3 day horizon» Five locomotive types

© 2019 Warren B. Powell

Page 143: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Laboratory testingTrain delay with uncertain transit times and yard delays

Train delay using deterministically trained VFAs

Train delay using stochastically trained VFAs

Trai

n de

lay

© 2019 Warren B. Powell

Page 144: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

How do we do it?

» The stochastic model keeps more power in inventory. The challenge is knowing when and where.

Laboratory testing

Time period within simulation

Stochastic training

Deterministic training

© 2019 Warren B. Powell

Page 145: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Experiments with ADP for energy storage

“Does anything work”

© 2018 W.B. Powell Slide 145

Page 146: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Energy storage model

𝐸Wind Energy

𝐺rid G,with LMP 𝑃

𝐿Load

𝑅Energy in Storage

𝑥

𝑥

𝑥

𝑥

𝑥

𝑥

𝐶 𝑆 , 𝑥 𝑃 𝑥 𝑥 𝜂𝑥

Slide 32© 2019 Warren B. Powell

Page 147: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

A storage problemEnergy storage with stochastic prices, supplies and demands.

windtE

gridtP

tD

batterytR

1 1

1 1

1 1

1

ˆ

ˆ

ˆ

wind wind windt t t

grid grid gridt t t

load load loadt t t

battery batteryt t t

E E E

P P P

D D D

R R Ax

1 Exogenous inputstW State variabletS

Controllable inputstx

© 2019 Warren B. Powell

Page 148: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

A storage problem

Bellman’s optimality equation

1 1( ) min ( , ) ( ( , , ) |tt x t t t t t t tV S C S x V S S x W S X

windtgrid

tloadtbatteryt

E

P

D

R

wind batteryt

wind loadtgrid batterytgrid loadt

battery loadt

x

x

x

x

x

1

1

1

ˆ

ˆ

ˆ

windt

gridt

loadt

E

P

D

These are the “three curses of dimensionality.”

© 2019 Warren B. Powell

Page 149: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Approximate policy iteration

Step 1: Start with a pre-decision state Step 2: Inner loop: Do for m=1,…,M:

Step 2a: Solve the deterministic optimization usingan approximate value function:

to obtain . Step 2b: Update the value function approximation

Step 2c: Obtain Monte Carlo sample of andcompute the next pre-decision state:

Step 3: Update using and return to step 1.

1, , 1, 1 ,1 1 ˆ( ) (1 ) ( )n m x m n m x m m

m mV S V S v

1 , ˆ min ( , ) ( ( , )) m m n M x mxv C S x V S S x

ntS

1 ( , , ( ))m M m m mS S S x W 1, ( )n MV S

mx

( )mW

( )nV S© 2019 Warren B. Powell

Page 150: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Approximate policy iteration

Machine learning methods (coded in R)» SVR - Support vector regression with Gaussian radial

basis kernel» LBF – Weighted linear combination of polynomial basis

functions» GPR – Gaussian process regression with Gaussian RBF» LPR – Kernel smoothing with second-order local

polynomial fit» DC-R – Dirichlet clouds – Local parametric regression.» TRE – Regression trees with constant local fit.

© 2019 Warren B. Powell

Page 151: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Approximate policy iteration

Test problem sets

» Linear Gaussian control• P1 = Linear-quadratic regulation• Remaining problems (P2-P7) are nonquadratic

» Finite horizon energy storage problems (Salas benchmark problems)

• 100 time-period problems• Value functions are fitted for each time period

© 2019 Warren B. Powell

Page 152: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Linear Gaussian control

SVR - Support vector regressionGPR - Gaussian process regression

DC-R – local parametric regression LPR – kernel smoothing

LBF – linear basis functions LBF – regression trees

100

80

60

40

20

0

© 2019 Warren B. Powell

Page 153: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Energy storage applications

SVR - Support vector regressionGPR - Gaussian process regressionLPR – kernel smoothing DC-R – local parametric regression

100

80

60

40

20

0

© 2019 Warren B. Powell

Page 154: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

A tale of two distributions» The sampling distribution, which governs the

likelihood that we sample a state.» The learning distribution, which is the distribution of

states we would visit given the current policy

Approximate policy iteration

State S

( )V s

1( )V s2 ( )V s

© 2019 Warren B. Powell

Page 155: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Approximate policy iteration

Using the optimal value function

Now we are going to use the optimal policy to fit approximate value functions and watch the stability.

Optimal value function …with quadratic fit

State distribution using optimal policy

© 2019 Warren B. Powell

Page 156: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Approximate policy iteration

• Policy evaluation: 500 samples (problem only has 31 states!)

• After 50 policy improvements with optimal distribution: divergence in sequence of VFA’s, 40%-70% optimality.

• After 50 policy improvements with uniform distribution: stable VFAs, 90% optimality.

State distribution using optimal policy

VFA estimated after 50 policy iterations

VFA estimated after 51 policy iterations

© 2019 Warren B. Powell

Page 157: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Policy iteration with parametric VFA

Observations» Experiments using a number of machine learning

approximations produced poor results (60-80 percent of optimal)

» Policy search using a simple error correction term works surprisingly well (for this problem)

» Using low-order polynomial approximations fit using state distributions simulated using the optimal policyworks poorly!

© 2019 Warren B. Powell

Page 158: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

“I think you give a too rosy a picture of ADP….”Andy Barto, in comments on a paper (2009)

“Is the RL glass half full, or half empty?” Rich Sutton, NIPS workshop, (2014)

Approximate value functions can work very well, but you need structure to guide the learning process. ADP needs benchmarks and careful tuning. WBP

© 2019 Warren B. Powell

Page 159: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Least squares approximate policy iteration

…and comparison to policy search

© 2018 W.B. Powell Slide 159

Page 160: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Least squares API vs. policy search

Assume we approximate our value function using a linear model:

Using the post-decision state

In steady state, we might write

© 2019 Warren B. Powell

( ) ( ) ( )Tf f

f F

V S S Sq f f qÎ

= =å

( ) ( ) ( )x x Tf f

f F

V S S Sq f f qÎ

= =å

{ }1( ) max ( , ) ( ) |x x x x x xt x t tV S C S x V S Sg- = +

Page 161: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Least squares API vs. policy searchLeast squares policy iteration (Lagoudakis and Parr)» Bellman’s equation (infinite horizon)

» … is equivalent to (for a fixed policy)

» Rearranging gives:

» … where “X” is our explanatory variable. But we cannot compute this exactly (due to the expectation) so we can sample it.

Need to sample introduces“errors in variable” formulation

© 2019 Warren B. Powell

Page 162: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Least squares API vs. policy search

For benchmark datasets, see: http:www.castlelab.princeton.edu/datasets.htm

Myopic policy

© 2019 Warren B. Powell

Page 163: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Least squares API vs. policy search

For benchmark datasets, see: http:www.castlelab.princeton.edu/datasets.htm

Myopic policyLSAPI

© 2019 Warren B. Powell

Page 164: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Least squares API vs. policy searchVariations of LSAPI:

» Theorem (W. Scott and W.B.P.) Bellman error using instrumental variables and projected Bellman error minimization are the same!

© 2019 Warren B. Powell

Page 165: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Least squares API vs. policy search

For benchmark datasets, see: http:www.castlelab.princeton.edu/datasets.htm

Myopic policyLSAPI

© 2019 Warren B. Powell

Page 166: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Least squares API vs. policy search

For benchmark datasets, see: http:www.castlelab.princeton.edu/datasets.htm

Myopic policyLSAPILSAPI-Proj. Bellman

© 2019 Warren B. Powell

Page 167: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Least squares API vs. policy search

Finding the best policy (“policy search”)» Assume our policy is given by

» We wish to minimize the function

21

0

max ( ) , ( | )T

tt t

tF C S X S

( | ) arg max ( , ) ( ( , )) n x nt x t t f f t t t

fX S C S x S S x

ˆ ( , )F

© 2019 Warren B. Powell

Page 168: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Least squares API vs. policy search

Two solution methods» Forward approximate dynamic programming

• Use least squares approximate policy iteration (LSAPI) to solve

» Policy search• Find to optimize

where

1 1

1 1

( ) min ( , ) ( ( , )) |

( ) min ( , ) ( ( , )) |

x x x x xt x t t t t t t

x x n xf f t x t t f f t t t t

f f

V S C S x V S S x S

S C S x S S x S

( | ) arg min ( , ) ( ( , ))

xt x t t f f t t t

f

X S C S x S S x

0

min ( ) , ( | )T

tt t

tF C S X S

© 2019 Warren B. Powell

Page 169: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Least squares API vs. policy search

For benchmark datasets, see: http:www.castlelab.princeton.edu/datasets.htm

Myopic policyLSAPILSAPI-Inst.Var.

© 2019 Warren B. Powell

Page 170: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Least squares API vs. policy search

For benchmark datasets, see: http:www.castlelab.princeton.edu/datasets.htm

Policy search.

© 2019 Warren B. Powell

Page 171: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Least squares API vs. policy search

Notes:» Using various parametric approximations for the value

function did not work well (but it did work well with the trucking application!)

» Using the same functions, but fitted using policy search, worked quite well.

» We do not know how general this conclusion is. For example, when we can exploit problem structure, value function approximations, fitted using Bellman’s equation, can work quite well.

» It appears that the value function approximations need to be quite accurate, and not just locally.

© 2019 Warren B. Powell

Page 172: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Least squares API vs. policy search

Notes» So why don’t we always do policy search?

• It simply doesn’t work all the time.• Unable to handle time-dependent policies.• The vector 𝜃 needs to be fairly low dimensional if we do not

have access to derivatives (and derivatives where the CFA term is in the objective can be tricky).

» One strategy is to use both:• Start by trying to fit 𝜃 to approximate the value function. Use

this to get an initial value of 𝜃. • Then use policy search to tune it to something even better.

© 2019 Warren B. Powell

Page 173: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Approximate dynamic programming

Stochastic dual dynamic programming

© 2018 W.B. Powell Slide 173

Page 174: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

ADP with Benders cuts (SDDP)

ADP with Benders decomposition» Mario Pereira (1991) was the first to notice that we

could approximate stochastic linear programs (which are convex) using Benders cuts.

» SDDP requires two approximations:• We are limited to solving a sampled model. That is, we are

limited to a small set of sample paths.• We have to assume that new information arriving at time t+1 is

independent of the information at time t.

» SDDP is approximate dynamic programming using a two-pass procedure:

• We simulate forward using the current value function approximation (represented using a set of cuts).

• We then perform a backward pass to create a new cut for each time period, averaging over the set of samples.

© 2019 Warren B. Powell

Page 175: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

ADP with Benders cuts (SDDP)

© 2019 Warren B. Powell

Page 176: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

ADP with Benders cuts (SDDP)

The forward pass

Post-decision state variable

Post-decision state variable

© 2019 Warren B. Powell

Page 177: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

ADP with Benders cuts (SDDP)

The set of sample paths (of wind)

© 2019 Warren B. Powell

Page 178: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

ADP with Benders cuts (SDDP)

© 2019 Warren B. Powell

Page 179: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

ADP with Benders cuts (SDDP)

© 2019 Warren B. Powell

Page 180: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

ADP with Benders cuts (SDDP)

We need to solve a linear program for each sample path (“scenario”) which is why this cannot be very large.

We then need to average over all of these linear programs.

© 2019 Warren B. Powell

Page 181: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

ADP with Benders cuts (SDDP)

Regularization» In the 1990’s, Ruszczynski showed that Benders works

much better if you introduce a regularization term, which penalizes deviations from the current solution.

» This powerful idea (now widely used in statistics) has not been generalized in a practical way to multiperiod(“multistage”) problems.

» In recent work, Tsvetan Asamov introduced a practical way to do regularization for multiperiod problems (with hundreds of time periods). A key feature is that the regularization term does not depend on history.

» The method is very easy to implement.

© 2019 Warren B. Powell

Page 182: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

ADP with Benders cuts (SDDP)

We penalize deviations from the previous resource state. This is not indexed by the sample path.

We simply retain the previous resource state.

© 2019 Warren B. Powell

Page 183: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

ADP with Benders cuts (SDDP)

UB w/o reg. LB w/o reg.UP w reg.LB w reg.

With regularization

Without regularization

SDDP with regularization

» Regularization slightly improves the upper bound, but dramatically improves the lower bound.

» There is minimal additional computational cost to introduce regularization for multiperiod problems.

© 2019 Warren B. Powell

Page 184: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

ADP with Benders cuts (SDDP)

Comparisons on a deterministic grid problem» SPWL – Separable, piecewise linear approximations» SDDP – Benders cuts» 25 batteries

SDDP lower bound

SDDP upper bound

SPWL

Slightly faster convergence using SPWL VFAs.

© 2019 Warren B. Powell

Page 185: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

ADP with Benders cuts (SDDP)

Comparisons on a deterministic grid problem» SPWL – Separable, piecewise linear approximations» SDDP – Benders cuts» 50 batteries

SDDP lower bound

SDDP upper bound

SPWL

Much faster convergence using SPWL VFAs.

© 2019 Warren B. Powell

Page 186: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

ADP with Benders cuts (SDDP)

Comparisons on a stochastic grid problem» SPWL – Separable, piecewise linear approximations» SDDP – Benders cuts» 25 batteries

SDDP Lower bound 100 scenarios

SPWL VFAs still exhibit slightly faster convergence.

SDDP Lower bound 20 scenarios

SDDP upper bounds 20 and 100 scenarios

© 2019 Warren B. Powell

Page 187: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

ADP with Benders cuts (SDDP)

Comparisons on a stochastic grid problem» SPWL – Separable, piecewise linear approximations» SDDP – Benders cuts» 50 batteries

SDDP Lower bound 100 scenarios

SPWL VFAs still exhibit slightly faster convergence.

SDDP Lower bound 20 scenarios

SDDP upper bounds 20 and 100 scenarios

© 2019 Warren B. Powell

Page 188: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

ADP with Benders cuts (SDDP)

Comparisons on a stochastic grid problem» SPWL – Separable, piecewise linear approximations» SDDP – Benders cuts» 100 batteries

Performance is very similar, despite high dimensionality (100 batteries)

SDDP upper bounds 20 and 100 scenarios

SPWL upper bounds 20 and 100 scenarios

© 2019 Warren B. Powell

Page 189: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Problem classes

Learning policies (“algorithms”):Approximate dynamic programmingQ-learningSDDP (see next slide)

© 2019 Warren B. Powell

Page 190: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

ADP with Benders cuts (SDDP)

© 2019 Warren B. Powell

Page 191: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

Notes:» SDDP is a form of learning policy. This is a method

for learning the cuts of the value function.» It is parameterized by the regularization term

» So our learning policy is determined by:• The SDDP structure.• The parameters of the regularization term • The graphs above represent optimization of the learning

policy.

» Then we have to test the cuts as an implementation policy, and compare to other VFA approximations.

ADP with Benders cuts (SDDP)

0( , )r

© 2019 Warren B. Powell

0k kr

Page 192: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

ADP with Benders cuts (SDDP)

The forward pass

The value function approximation defines the implementation policy!

© 2019 Warren B. Powell

Page 193: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

ADP with Benders cuts (SDDP)

More complex information processes» We can easily approximate convex functions (using

Benders or separable, piecewise linear value functions).» But what if we have an external information process

that affects the system?• Weather• Economy

» In the next slide, we show how the algorithm changes when we no longer enjoy “intertemporalindependence.”

» Assume that the information state is “weather” that may be sunny, cloudy or stormy.

© 2019 Warren B. Powell

Page 194: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

ADP with Benders cuts (SDDP)

Weather states

© 2019 Warren B. Powell

Page 195: ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment replacement –Replace when a measure of ... » Backward ADP required 20 minutes, with

ADP with Benders cuts (SDDP)

More complex information processes (cont’d)» In this algorithm, we are using a lookup table function

for weather:

» We generate a new cut for each information state. This can work quite well if the information state is simple, but not if it has multiple dimensions.

© 2019 Warren B. Powell