Reinforcement Learning Overview of Recent Progress and...

Reinforcement Learning – Overview of Recent Progress and Implications for Process Control and Beyond (Integrated Multi-Scale Decision-making)

October 4, 2018 CMU EWO Webinar

Jay H. Lee1

(w/ Thomas Badgwell2, )

1 Korea Advanced Institute of Science and Technology, Daejeon, Korea2ExxonMobil Research & Engineering Company, Clinton, NJ

Introduction to KAIST – 47 Years Old

KAISTBusiness School

71’ KAIS

SEOUL

Graduate school in Seoul under a new law granting special privileges such as exemption from compulsory military service

1971KAIS, Korea Advanced Institute of Science

Undergraduate school in Daejeon for students gifted in math and science

1984KIT, Korea Institute of Technology

Established through the merging of KAIS and KIT

1989KAIST, Korea Advanced Institute of Science and Technology

KAISTMain Campus

84’ KIT

DAEJEON

KAIST Today Brief Statistics

06/43

• Part I: Introduction of Reinforcement Learning and Implications for Process Control

(Acknowledgment: Thomas Badgwell)

• Part II: And Beyond (Integrated Multi-Scale Decision-making)

Overall Structure of This Talk

5

𝜃𝑗

𝒙 𝑦input output

Environment

𝓓 = 𝒙1:𝑛, 𝑦1:𝑛, 𝜃1:𝑛

𝒚 = 𝒇(𝒙; 𝜃)

𝒙∗|𝜃 = argmax𝒙

𝒇(𝒙; 𝜃)

Target system

Data acquisition

Learning

Making decision

𝑓(𝒙; 𝜃𝑗)

Data-driven decision-making & control in engineering domain

Dynamic & Stochastic environment

Data can help model more realistically and derive more accurate solution!

Framework of decision makings

Data & model based decision making

Validation Verification

Is this solution is good for the target

system?

Systems

max𝒙

𝑖=1

𝑁

𝑃𝑖 𝒙;𝜃, 𝑈

Modeling Solving

Real-world task Formal task (model) Algorithm (program)

Are we building the right model? Does the algorithm capture all the

essential aspects of the model?

Data analytics

• Bayesian Statistics

• Machine learning

• Bayesian Network

Modeling

• Optimization

• Markov Decision Process

• Game Theory

Decision Making

• Mathematical Programming

• Dynamic Programming

• Reinforcement Learning

Agenda

8

What is Reinforcement Learning? RL vs. Model Predictive Control

Implications for Process Control Future Research Directions

The agent learns a policy 𝜋 𝐴𝑡 = 𝑎|𝑆𝑡 = 𝑠 that maximizes a long-termvalue function:

𝑣𝜋 𝑠 = 𝐸 𝑅𝑡+1 + 𝛾𝑅𝑡+2 + 𝛾2𝑅𝑡+3⋯|𝑆𝑡 = 𝑠

[2] R. Sutton and A. Barto, Reinforcement Learning, Second Edition draft, (2016)

In Reinforcement Learning, an agent learns a decision policy by taking actions and observing the response of (‘reward’ or ‘penalty’ from) the environment (abstracted from Animal Psychology).

http://ufal.mff.cuni.cz/~straka/courses/npfl114/2016/sutton-bookdraft2016sep.pdf

• Bellman’s optimality equation answers the question: when is the value function 𝑣∗ 𝑠maximized? It enforces consistency of the optimal value function as the state of the environment changes [6].

𝑣∗ 𝑠 =𝑚𝑎𝑥𝑎

𝑠,,𝑟

𝑝 𝑠′, 𝑟|𝑠, 𝑎 𝑟 + 𝛾𝑣∗ 𝑠′

• In practice Bellman’s optimality equation usually cannot be solved because:

➢We often don’t know environment model 𝑝 𝑠′, 𝑟|𝑠, 𝑎

➢Solution complexity explodes with state dimension (which may be infinite)

• All RL algorithms can be regarded as approximate solutions to Bellman’s optimality equation, dealing in various ways with these two limitations [2]. 10

[2] R. Sutton and A. Barto, Reinforcement Learning, Second Edition draft, (2016)

The properties of an optimal policy are described by Bellman’s optimality equation (from Optimal Control theory)


Reinforcement Learning: from Vision [5] to Today’s Reality

11

Powerful new RL algorithms Deep Neural Nets

+Powerful new hardware

+

[5] A. Turing, Computing machinery and intelligence. Mind, 59, 433-460, (1950).

Simulation

Self-play

data

Breadth Reduction:

Policy NetworkLearning a probability distribution

over legal moves over the board:

p(next action | current state)

Depth Reduction:

Value NetworkPredicting the expected outcome in

board position:

p(my win | next state)

Combinatorial explosion

in “Go” play

Reducing the Search Space

13

Approximately 25015010360 branches

(Number of atoms in the universe ~1080)

Example: RL strategy for “AlphaGo”

AlphaGo

Zero

x

Benefit of Unbiased Exploration!

Alpha-Go: Capturing the World’s Attention

Breadth

Depth

Classification Regression

Policy Network Value Network

Human

expert

positions Self play(~30 million

per day)

𝑝𝜎/𝜌 𝑎|𝑠

𝑣𝜃 𝑠′

𝑠′

𝑠

Analogy to chess playing

Chess PlayingProgram

Playing games

with current

policy

Value function

approximation

Assign “value” scores for all the

board positions encountered

during the games. Use some sort

of interpolation for other positions.

Have expert players play a large number of

games. (expert position learning)

A new playing policy

Initial policyIterative Improvement via Self-Play

(“Self-optimizing simulation”)

A Slide from my presentation at CPC-VII, Lake Louise, 2006

Solving Algorithms of RL

value-based or policy-based

1. Value function methods

▪ Learnt value function

(or Q-value)

▪ Implicit policy

2. Policy search methods

▪ No value function

▪ Learnt policy

3. Actor-critic

▪ Learnt value function

▪ Learnt policy

✓ TD learning (off-policy*: Q-learning, on-policy**: SARSA)

✓ Dual heuristic programming

✓ Policy gradient algorithms

( , ) [ | , ]s x P x s =

Value

based

Policy

based

Actor

Critic

( ) ( )t t t tuse V s s instead of V s =

min ( , ( ))tt tt

EJ C s s

= : Stochastic optimization problem

Long-term value

(estimated from samples)

( )1( ) ( ) ( , ( )) ( ) ( )t t t t t tV s V s C s s V s V s + + + −

( ) ( )V s V s

*learning policy ≠ sampling policy,

**learning policy = sampling policy

Monte-Carlo PG (analytical method):

J +

( ) ( )V s V s

( , ) [ | , ]s x P x s =

two approximators

( , ) ( , )Q s x Q s x

TD target

can learn “stochastic” policyˆ[ log ( , ) ]J E s x v

=

Major Issues in RL1. Sampled value computation: TD vs. MC

Low variance, Some bias

(D/T. bootstrapping)

High variance, Zero bias

(D/T. many random samples)

More sensitive to initial valueMore samples requires

(complete sequences)

1ˆ ( )t t tv C V s += + 21 2ˆ ...t t t tv C C C + += + + +

Cumulated cost

through a sample path

Deep backupsShallow backups

Bootstrapping: update

involves an estimate

( ) 11 1ˆ ...

( )

n nt t t t n

nt n

v C C C

V s

−+ + −

+

= + + +

+

1 ( )

1

ˆ ˆ( ) (1 ) n nt t

n

v v

−

=

= −

▪ TD(𝝀) learning

TD

(1-step)

n-step estimate

Weighted average of n-step estimates

2-

step

3-

step

n-

step

MC

(∞-step)

▪ TD learning ▪ MC learning

( ),t t twhere C C s x=

Observable (noisy)

reward & state

1, , ,t t t ts x C s +

Hidden (valuable)

value function

( )tV s

How can we compute the

“estimation targets”?

ˆ ( )t tv s

MC: Monte-CarloTD: Temporal-difference

Major Issues in RL

2. Value function approximation (VFA)

( ) ( ) ( )f f

f F

V s V s s

=

▪ Linear combinations of features

(features are parameterized by state)

Dim(feature) << Dim(state)

▪ artificial neural networks

(esp. deep neural networks)

▪ Nonparametric model

(e.g., Gaussian process, kernel regression)

Complex & nonlinear relation,

Large data for learning

( ),t t twhere C C s x=

Observable (noisy)

reward & state

1, , ,t t t ts x C s +

Which function

approximators?

( )tV s

Computed

estimation target

ˆ ( )t tv s

✓ Gradient-based approaches

✓ Least-squares approaches

✓ Probabilistic models

Major Issues in RL

3. Exploitation vs. Exploration

The best long-term strategy may

involves short-term sacrifices

▪ Naive exploration: add noise to greedy policy (e.g., 𝜖-greedy)

▪ Optimistic initialization: assume the best until proven otherwise

▪ Optimism in the face of uncertainty: prefer actions with uncertain

values (e.g., Upper confidence bound)

▪ Probability matching: select actions w.r.t. probability they are best

(e.g., Thompson sampling)

▪ Knowledge (belief) state search: look ahead search incorporating

value of knowledge (e.g., Gittins indices, knowledge gradient)

Online decision-making involves a fundamental choice:

Principles of exploration

“Exploit” the best one given current knowledge

“Explore” the most uncertain one

Or,

After exploring

Online

learning

knowledge decision

measurement

Exploitation vs. Exploration Dilemma

“optimal learning”

(unknown) MDP

Bayesian inference

Bayes-Adaptive MDP

Multi-armed banditModeling

Solution

Principled way of

handling uncertain

knowledge

(e.g., distribution)

Major Issues in RL

4. Stochastic policy vs. Deterministic policy

Policy search methods

Example: consider policies for iterated rock-paper-scissors▪ A deterministic policy is easily exploited.

▪ A uniform random policy is optimal.

Sometimes you

need stochastic

policy!!

( , )

( , )( , )

T

T

s x

s x

x

es x

e

=

2

2

1 ( ( ))( , )

22

x ss x e

−= −

How can we express stochastic policy?

▪ Softmax policy ▪ Gaussian policy

Deterministic policy

Stochastic policy ( | ) Pr[ | ]x s x s =

( )x s=

Example: Aliased Gridworld

▪ The agent cannot differ-

entiate the grey states

▪ Value-based RL ➔

deterministic policy

▪ It can get stuck, and

never reach the money.

▪ Policy-based RL ➔

stochastic policy

▪ It will reach the goal state in a

few steps with high probability.

Goal state

Advantages▪ Better convergence properties

▪ Effective in high-dimensional/

continuous action spaces

Disadvantages▪ Typically converge to a local optimum

▪ Typically inefficient & high variance

Trained policy: red arrows Trained policy: red arrows

Agenda

19

What is Reinforcement Learning?Bipedal Robot

RL vs. Model Predictive Control

Implications for Process ControlFuture Research Directions

Reinforcement Learning example: Bipedal Robot [6] OpenAI Gym

20

24 ContinuousState

4 Continuous Actions

Penalty+x: Every time it moves forward.- 0.1: Apply leg motor torque.-100: Fall down.

= [ hull angle speed, angular velocity,

horizontal speed, vertical speed,

position of joints, joints angular speed,

legs contact with ground,

10 LIDAR rangefinder measurements ]

= [ leg joint torques]

Snapshot after 3k episodes Snapshot after 40k episodes

A surprising policy! Better algorithms for tougher tasks

Agenda

21

What is Reinforcement Learning?RL vs. Model Predictive Control


Reinforcement Learning has advantages and disadvantages when compared with Model Predictive Control

22

RL (model-free) advantages vs. MPC

• No need to develop process model (develop policy from data directly)

• Able to work with complex nonlinear, stochasticenvironments

• Fast on-line execution• Can adapt to changing environments

RL (model-free) disadvantages vs. MPC

• Extensive trial-and-error learning is required• Must be allowed to fail during training (simulation can

be used)• Training may not be stable or repeatable• May get stuck in local minima during training• Extensive goal engineering may be required• Must re-do training if goal is changed• No closed-loop stability guarantees

Agenda

23

What is Reinforcement Learning? RL vs. Model Predictive Control


MPC is the current state-of-the art for chemical plants. Reinforcement Learning has the potential to complement it to expand its capability.

24

Potential process control applications of RL technology include:

• Directly replace existing process controllers with RL agents

• Use RL agents to help manage process control systems

✓ Switch controllers ON/OFF , adjust limits and tuning parameters as appropriate

✓ Compensate for common disturbances such as weather events or feedrate changes

✓ Supervising / optimizing control, esp. those involving significant uncertainty

• Use RL agents to advise operators during unusual situations (process upsets, startup/shutdown)

• Use a hierarchy of RL agents to simplify operation of a chemical plant/refinery

✓ Process safety agents

✓ Environmental compliance agents

✓ Reliability agents

✓ Economic optimization agents

Agenda

25

What is Reinforcement Learning?Bipedal Robot

RL vs. Model Predictive Control


Reinforcement Learning research opportunities

26

Potential RL research opportunities include:

• RL methods with “Disciplined learning”.

• Integrate aspects of RL technology with MPC

➢ Lee and Wong [8], Morinelly and Ydstie [9], Kamthe and Deisenroth [10]

• Exploration vs. exploitation

• Find the class of function approximations for which a state-of-the-art RL algorithm (e.g. A3C [11]) converges (value and policy function approximations converge to optimal values)

• Prove closed-loop stability for the case of a state-of-the-art RL algorithm (e.g. A3C [11])

• Develop a robust RL algorithm by training in parallel with a number of environments that each represent a realization of the uncertainty set

• Develop RL technology that allows a hierarchy of prioritized RL agents to cooperate to accomplish a complex task.

+

• The state is defined as:

• Stage-wise cost:

27

Exemplary RL with “Disciplined” Learning:Integrated Reaction Separation System with Recycle

(Tosukhowong and Lee, AIChE J. 2009)

Operating Modes 1 2 3 4

Product conc. (x2B)

Production rate (B)

0.886

100

0.85

115

0.91

80

0.82

125

32

21

2

1

⎯→⎯

⎯→⎯

k

k

MR

MD

MB

D

L

F

V

B, x2B

F0

x1

Qu = 10000, Qy = 6000, R = 20I66

0u /100T

sp sp sp

R D BF M M M L B =

28

Stochastic Disturbances and Variations

On-line stochastic disturbance (one of the realizations)

On-line performance comparison with 12 new disturbance

realizations

Qu = 10000, Qy = 6000, R = 20I66

0u /100T

sp sp sp

R D BF M M M L B =

RL

7 NMPC controllers w/ different tunings

Learned starting w/ closed-loop data with 7NMPC controllers

30

RL controller result from one realization

(x2Bsp = 0.85, Bsp = 115)

Product variables Manipulated variables

31

Result of NMPC 1 (the best MPC controller in this case)

Product variables Manipulated variables

Reinforcement Learning with

Mathematical Programmingfor

Multi-Scale Dynamic Decision Makingin an Uncertain Environment

Dynamic decision-making in an uncertain environment

Decision-maker Uncertain

Environment

Execution

Response

State

Information

Decision

Iterative

Improve

ment?

Industrial & Manufacturing system,

Financial engineering,

Robotics,

Power systems,

Medical applications,

Computing & Communications,

Game playing, …

<Sequential decision-making process>

- under uncertainty

Multi-scale decision-making

How do we integrate between layers?

Grossmann (2005). Enterprise‐wide optimization: A new frontier in PSE. AIChE Journal, 51(7), 1846-1857.3

Math programming over time-scale multiplicity

hour

. . . . .day

. . . .

year. . . . .

0

1000

0 30 60 90 120150180210240270300330360

Daily operating cost

Yearly

average

Yearly capacity planning (sizing)

Daily production planning

Hourlydispatch

scheduling

Fixed

design

(capacity)

Operation

constraints

day-ahead

operation/

uncertainty

info.

year-ahead

operation

info.

Day-

ahead

prediction

Wind: summer Wind: winter

7

Renewable (wind) energy example

Temporally-integrated mathematical programming (MP)❖ At fine scale: 1 year = 24 X 365 = 8760 hours ➔ Computationally infeasible

❖ At coarse scale: Coarse-graining and “averaging” of hourly dynamics and uncertainty ➔ Optimistic estimation

➔ “A gap between the layers”

Uncertainty handling: MP vs. MDP

Math programming (MP): solution “over” a time

horizon

Markov Decision Process (MDP): “stage-wise”

solution

❖ Stochastic data: scenario tree

❖ Solution structure: decision tree 𝑥1, 𝑥2𝜔1, … , 𝑥𝑇

𝜔𝑇−1❖ Stochastic data: probability distribution

❖ Solution structure: decision policy 𝝅: 𝑺 → 𝑿

state transition

Birge, & Louveaux (2011). Introduction to stochastic

programming. Springer Science & Business Media.

Puterman (2014). Markov decision processes: discrete

stochastic dynamic programming. John Wiley & Sons.

9

Value function: the expected

sum of all future cost

Value function-based stage decompositionMulti-stage stochastic

programming (MSSP)

with recourse

variables

Proposed “combined” strategy

Low-level operation / scheduling

Finite-horizon operation containing long-term evaluation

Simulated

samples from

state

transition and

reward

Long-term

evaluation

(value

function)

High-level management / planning

Multi-scale uncertainty

model

- Known state and simple dynamics (e.g.,

inventories, price info.)

- High-level future uncertainty (evolving over long-time)

- Long time horizon decision-making

- Optimization with complex dynamics and constraints

- Lower, fast-decaying uncertainty (due to frequent feedback).

- Need for resolving the end-effect

MDP based planning combined with operation model

16

DP or

RI

MP • Nonstationary

• Deterministic + stochastic

• Periodic (diurnal, daily, seasonal, etc.)

“Big data”

• weather factors

• market factors

• Etc.

Applications of the strategy

< Microgrid Operation & Design >

Shin, J., Lee, J. H., & Realff, M. J. (2017). “Operational

Planning and Optimal Sizing of Microgrid Considering

Multi-scale Wind Uncertainty,” Applied energy, 195, 616-

633.

< Refinery Procurement & Production Planning >

Shin, J., & Lee, J. H. (2017). “Crude Selection Integrated

with Optimal Refinery Operation by Combining Optimal

Learning and Mathematical Programming,” IFAC-

PapersOnLine, 50(1), 9032-9037.

22

< Raw Material Procurement Planning &

Scheduling >

Shin, J., & Lee, J. H. (2016). “Multi-time Scale

Procurement Planning Considering Multiple

Suppliers and Uncertainty in Supply and Demand,”

Computers & Chemical Engineering, 91, 114-126.

PART 3-1

Practical Example of the Proposed Model:

Hybrid Renewable Energy Network

< Microgrid Operation & Design >

Shin, J., Lee, J. H., & Realff, M. J. (2017).

“Operational Planning and Optimal Sizing of

Microgrid Considering Multi-scale Wind

Uncertainty,” Applied energy, 195, 616-633.

4

Design & Sizing

•Decisions on capacity acquisitionof each source type

•To minimize capital cost & operating cost

Unit Commitment

(UC)

•Decisions on which source should be on-line at what time

•To minimize operating cost

Dispatch

•Decisions on the output of on-line sources

•To minimize operating cost while meeting the hourly load

[Daily]

[Hourly]

[Yearly]

resource

limitation

operation

constraints

H. Liang and W. Zhuang, "Stochastic modeling and optimization in a microgrid: A survey".

Energies, Vol. 7, No. 4, pp. 2027-2050, 2014.

yearly

operation data

daily

operation

Hourly Uncertainty

Daily & Seasonal Uncertainty

Temporally integrated stochastic model is

required!

Decision hierarchy in HRES

3

Bhandari, Binayak, et al. "Optimization of hybrid

renewable energy power systems: A review."

International Journal of Precision Engineering

and Manufacturing-Green Technology 2.1

(2015): 99-112.

Yang, Hongxing, Lin Lu, and Wei Zhou. "A novel optimization sizing model for hybrid solar-wind

power generation system." Solar energy 81.1 (2007): 76-84.

daily hours

wind speed (m/s)

pow

er

morning

ramp

evening

peak

- self-discharging

- charging to battery

- discharging from

battery

• Slow-start generator (SG): operational limitations (ramp

up/down, minimum up/down

time) ➔ commitment plans

should be decided at least a

day ahead

• Fast-start generator (FG):more flexible source, but

expensive

❖Wind power

conversion model

❖ Battery dynamic model ❖Generator

❖ Daily demand

Energy System Model

Temporal integration model

Yearly: sizing

Daily:

unit commitment

Hourly:

Generation dispatch

Design

Operation

Complexity

Intra-day

variation

Inter-day

variation

Multi-scale WIND model

Deterministic profile Hourly

ramping

% of installed capacity

Seasonality

Winter Spring / fall Summer

Pesch, T., et al. "A

new Markov-chain-

related statistical

approach for

modelling synthetic

wind power time

series." New

journal of

physics 17.5

(2015): 055001.

Wan, Yih-huei.

"Analysis of

wind power

ramping

behavior in

ERCOT."

Contract 303

(2011): 275-

3000.

4

❖ Daily-average time series

exponentia

l decrease

cuts off after

one lag

Daily average data series (Feb.)

ACF PACF

❖ Hourly data regression model

➔ Parameter: least square

daily hour

profileeffect of daily-

average value

effect of

previous hour

Intra-day wind scenarios

Inter-day wind transition

Multi-scale Wind Model

monthmonth

<Generated data><Actual data>

❖ Data: 2002 ~ 2014 (except 2011) from the

National Wind Technology Center

❖ Hourly wind data generation for one year

- Winter: high production & high

variation

- Summer: low production & low

variation

Daily hours profile: Intra-day

variation

Daily-average value: Inter-day

variation

Monthly-average wind value

Yearly Wind Data Generation

Conventional approach

❖ “One-day” planning: 2 stage stochastic programming (2SSP)

Ruiz, Pablo A., et al. "Uncertainty management

in the unit commitment problem." Power

Systems, IEEE Transactions on 24.2 (2009):

642-651.

……

Limitation of the two-stage SP ➔ not

appropriate for the multi-day problem

“End-effects”: An optimal solution within a finite horizon

may be a bad solution beyond the time

horizon

Trade-off: current-day cost vs.

next day cost

operational

level value

function

capacity

limitation

Yearly sizing problem: value function-based optimization

… Month, m=12

…Operational level VFA* for Month m

Hourly operation over one day: LP

Multi-scale wind

uncertainty model

(seasonal, daily, hourly)

daily

operation

Month, m=1

LP LP

• Yearly operation: estimated from the Value function

value

function

Inter-day

wind

transition

Intra-day wind

scenarios

Proposed approach

*VFA: value function approximation

RL

MP

Case study

13

❖Tested methods:Method 1: 2stage SP model without the value function (daily independent)Method 2: MDP + LP with the value function

❖2002 ~ 2014 (except 2011) wind data is used for wind uncertainty modeling

❖2015 wind data is used for comparing the performance of the tested methods

❖System parameters for case study

SG1 SG2 SG3 FGBattery HVDC

System OthersOut In Buy Sell

8 6 4 3 2 2 5 3 20 0

4 3 2 0 0 0 0 0 1 5

4 3 2 3 - - - - 20 5

4 3 2 3 - - - - 30 4

3 2.5 2.5 0 - - - - 0.8 Demand

2.5 2 2 0 - - - - 1 peak 25

1 0.8 0.5 0 - - - - 0.005 Off-

peak13

1.5 2 2.5 3 0 0 4.5 -2

Example caseMethod 1: 2SSP model without the value function

Method 2: MDP+LP model with the value function

: trade-off between current & future cost

1st stage

cost

2nd stage

costOverall cost

method1 41.30 202.30 243.60

method2 41.30 203.69 244.99

1st stage

cost

2nd stage

cost

Overall

cost

method1 41.30 166.76 208.06

method2 34.90 158.39 193.29

At 15th day At 16th day

-0.57% 7.10%

Daily wind production: 4.86 Daily wind production: 6.22

14

Case study: results

15

MonthImprovement of Method 2**

over Method 1* (%)Month

Improvement of Method 2**

over Method 1* (%)

1 6.82 7 2.13

2 6.12 8 3.12

3 3.27 9 6.53

4 2.39 10 12.03

5 1.95 11 16.46

6 2.45 12 10.27

❖Tested methods:

Method 1: 2stage SP model without the value function (daily independent)

Method 2: 2stage SP model with the value function

❖Effected day: the day on which Method 1 and 2 make different decisions

❖ Improvement for the 151 effected days (40%):

PART 3-2


Refinery procurement & production planning

< Refinery Procurement &

Production Planning >

Shin, J., & Lee, J. H. (2017).

“Crude Selection Integrated with

Optimal Refinery Operation by

Combining Optimal Learning and

Mathematical

Programming,” IFAC-

PapersOnLine, 50(1), 9032-9037.

Decision-making structure

, ,, ,

1 1

( , , , )

ˆ

t t WTI t JFt GAS t DSL

t t t

p p p p p

p p w+ +

=

= +

U.S. Energy Information Administration

(EIA)

2014.12 ~ 2017.3 (600 days)

Refinery Optimization: LP** (for one-day)

Finite-horizon operation containing long-term evaluation

Samples for

Next State

& Stage-wise

Reward

Purchased

Crude &

Updated Value

Function

Crude Oil Selection: MDP* (day-to-day)

Daily price uncertainty

model

- Recurring dynamics & decisions, dominated by price

uncertainties

- Infinite-time horizon decision-making

- Optimization with many constraints & realized price info.

- Need for resolving the end-effect

MDP based planning combined with detailed refinery operation model

23

*MDP: Markov decision process

**LP: Linear programming

Model Formulation

1 if 𝑠′ is the optimal solution of

refinery optimization given initial

storage 𝑠𝑡, and 0 otherwise

Crude Oil Selection: MDP Refinery Op. Optimization: LP

❖ Constraints

- crude availability ∙∙∙∙∙∙∙∙∙∙∙∙ 𝑦𝑐𝑓𝑐min ≤ 𝑓𝑐 ≤ 𝑦𝑐𝑓𝑐

max

- initial storage level ∙∙∙∙∙∙∙∙∙∙∙∙ 𝑠0,𝑐 = 𝑠𝑡,𝑐

❖ State, decision variable

𝑆𝑡 =𝑝𝑡,𝑖 = price of product 𝑖

𝑠𝑡,𝑐 = storage of crude 𝑐

𝑎𝑡 = 𝑦𝑡,𝑐 = ቊ1 if crude 𝑐 is selected

0 otherwise

❖ Objective

❖ DecisionsInventory transition:

𝒔𝒕+𝟏,𝒄 = 𝒔𝒄

Constrained by upper-level decision & state

Model defined by the

lower-level optimization

❖ One-day reward function: 𝑅𝑡(𝑆𝑡 , 𝑎𝑡)

❖ State transition probability𝑃(𝑆′|𝑆𝑡) = 𝑃(𝑝′|𝑝𝑡) × 𝑃 𝑠′|𝑠𝑡

MDP ➔ LP

LP➔ MDP

From price model

24

𝑅(𝑆𝑡, 𝑎𝑡) = max𝑥

𝑸𝟏 𝑎𝑡 +

𝑝′

𝑃 𝑝′|𝑝𝑡 𝑸𝟐 𝑝′, 𝑠𝑡

𝒙 = 𝒇𝒄, 𝒔𝒄, 𝒙𝒊,𝒖, 𝒙𝒃,𝒑, 𝒙𝒑, 𝒗𝒑

Crude purchase, 𝑸𝟏

Unit processing &

Product sale 𝑸𝟐

Uncertainty realization(price of import & export products)

WTI

0

where ( ) [1 ],

[ ]t t t t t

t t t t tc ci i i j i j i j

Tp pp sp pp p p p s

S p p p p p p p p s p s

=

=

Value function approximation & learning

❖Value function approximation model:

25

( ) ( )= V S S

Bradtke, S. J., & Barto, A. G. (1996). Machine learning, 22(1-3), 33-57.

Boyan, J. A. (2002). Machine Learning, 49(2-3), 233-246

Marginal value > price

➔ To keep higher

inventory is

preferred.

Marginal value of crude inventory: ෩𝜽𝒔 = 𝜽𝒔 + 𝜽𝒑𝒕𝒔𝒑𝐖𝐓𝐈𝒕

෩𝜽𝒔,𝒄 − 𝒑𝒄 $/𝒕𝒐𝒏

price state

Quantitative evaluation of what kind

of & how much crude oil to retain

❖ Value function learning algorithm

• Initialize 𝜃0

• For each iteration (𝑛 = 0,… , 𝑁 − 1),

1. Simulation data collection:

𝑺𝒏 → 𝒂𝒏 𝜽𝒏 → 𝒓𝒏 → 𝑺𝒏′

2. Define a basis:

𝝓𝒏 = 𝝓 𝑺𝒏 , 𝝓𝒏+𝟏 = 𝝓 𝑺𝒏+𝟏3. Update value function:

𝜽𝒏+𝟏 ← 𝑹𝑳𝑺𝑻𝑫∗ 𝜽𝒏, 𝝓𝒏, 𝝓𝒏+𝟏, 𝒓𝒏

from LP & price model

Linear & bilinear terms for price & inventory(Not always “Deep NN”)

*Recursive least square temporal difference (RLSTD)

Case Study

26

Refinery model parameters refer to Favennec, J. (2001). Petroleum Refining V5. Refinery Operation and Management, Technip

❖Refinery model parameters- Favennec, J. (2001). Petroleum Refining V5. Refinery Operation and Management, Technip

❖Seven crude oil types- Crude assay information: http://corporate.exxonmobil.com/en/company/worldwide-operations/crude-

oils/assays

- Individual crude oil price: 𝒑𝒕,𝒄 = 𝒑𝒕,𝑾𝑻𝑰 + 𝝃𝒄 where 𝜉𝑐 = 𝑓 VR yield, sulphur content

❖Price model: first-order Markov chain

- Data source: U.S. EIA, 2014.12 ~ 2017.3 (600 days)

❖ Size of state space:

Inventory space (37) X price space (90) =

196,830

❖ Size of decision (crude selection) space:

1 crude selection (7) + 2 crude selection (21) =

28

❖ # of parameters for VFA: 51

❖ Quality specification of final product

PG98 ES95 DSL

C4 content ~ 5% ~ 5% Sulphur

content~ 0.05%

RVP 0.5 ~ 0.86 0.45 ~ 0.86

sensitivity ~ 10 ~ 10 HF

RON ~ 98 ~ 95 VBI 30 ~ 33

http://corporate.exxonmobil.com/en/company/worldwide-operations/crude-oils/assays

Tuning Parameters for Learning

27

Improvement (%) of

proposed policy over reference*

Number of iteration (× 102)

Exp

lora

tio

n r

ate

(𝜀)

*ReferenceProposed

one

Refinery opt. model LP LP

Uncertainty

accountedNo Price

VF incorporation No Yes (RL)

❖ Number of samples (N)

• Limited budget (time) for sampling & learning

❖ Exploration rate (𝜺)

• To make a decision 𝑎𝑛 during learning (𝜀-greedy policy)

For each iteration of learning (𝑛 = 1,… ,𝑁),

𝒔𝒏 → 𝒂𝒏 → 𝒓𝒏 → 𝒔𝒏′

Exploratory decision for knowledge acquisition

Exploit

decision 𝑎𝑛 = ቊgreedy − policy with probability 1 − 𝜀randomly choosen with probability 𝜀

Numerical Result

28

Crude purchase (kton) Inventory (kton)

Proposed Reference Proposed Reference

C1 279.67 289.42 99.15 0.50

C2 48.39 49.15 1.23 0.65

C3 8.08 0 40.67 0.57

C4 7.11 0 16.83 0.74

C5 0.33 0.16 2.06 0.46

C6 23.49 25.79 5.43 0.76

C7 0 0 0.49 0.94

367.08 364.53 165.87 4.62

❖ Number of samples: 30000, exploration rate: 0.4

❖ Improvement across 100 runs

- A randomly generated price dataset (200 days)

- A randomly chosen initial inventory state

*ReferenceProposed

one

Refinery opt. model LP LP

Uncertainty

accountedNo Price

VF incorporation No Yes

PART 3-3


Raw Material Procurement Planning and Scheduling

< Raw Material Procurement Planning & Scheduling >

Shin, J., & Lee, J. H. (2016). “Multi-time Scale Procurement Planning

Considering Multiple Suppliers and Uncertainty in Supply and Demand,”

Computers & Chemical Engineering, 91, 114-126.

Supplier 1

Supplier 2Supplier n InventoryDock

Random

demand

Manufacturer

Random

lead time

Multiple independent suppliers

Raw Mat. Procurement Planning & Scheduling

Supplier 1

Supplier 2Supplier n InventoryDock

Physical product flow

Procurement planning: monthly or bi-weekly, t = 1,…

Random

demand

Manufacturer

Random

lead time

scheduling cost,

final inventory

Unloading scheduling: daily, h = 1, …, H

Multiple independent suppliers

❖ Decision: unloading schedule, inventory level, input to

manufacturer

❖ Objective: to minimize (scheduling cost) =

(inventory holding cost) + (unloading cost) + (penalty cost)

❖ Decision: order plan for each supplier

❖ Objective: to minimize (procurement cost)

= (purchasing cost) + (scheduling cost)

order plan,

initial inventory

Time index: (t, h) & (t, H) = (t+1,0)

Constraints on

tank &

demand

Vessel

arrival &

departure

Constraints on

unloading

operation

Daily cost &

state transition

Integrated Formulation

𝒕 𝒕 + 𝟏

Planning: MDP

Initial state &

Constraints

❖ One-period cost function

,

1

( | , ) ( | ) ( | ) ( | , , , )n

t t t t i t t t

i

p s s a p l l p d d p x s a d l=

=

❖ State transition probability

❖ State, Decision & Exogenous variable

Scheduling: MILP

( ) ,min ( )h t h un i i ld h lv ih i it

C x C Tl Tf C Ld C Lv

= + − + +

𝒕, 𝟎 𝒕,𝑯𝒕, 𝒉

….. …..

Time integration: 𝒕,𝑯 = 𝒕 + 𝟏,𝟎

0 otherwise

ˆ( , ) ( )t t order tc s a C a E = + ( )*sch t t t+1C s ,a ,ω

,f i iAv Tf

, ,i h t ihFu a

,0 h t hFd d

,0t tx x=

Availability of vessel

Delivered order

Initial storage level tx

Unit demand

1 if x = *t,Hx

, , 1,ˆf i t i t iAv l l += +

ta

( ), 1ˆt h t td d d H+= +

*1ˆ( , , )+sch t t tC s a

*, , , , ,

Tt i h i h i i i h i h i h h iXf Xl Tf Tl Z Fu Fd Ld Lv = t,hx

,

Inventorylevelat timeDemandat timeLead timeof supplier at time

t

t

t i

x td tl i t

= = =

=

ts

, orderof suplier placedat timet ia i t= =ta

,

ˆ Newinfo.ondemandat timeˆ Newinfo.on lead timeat time

t

t i

d tl t =

= = tω

From planning

From scheduling

Challenge 1:

computing cost & state

transition

For all states

& decisions,

Cost & state

transition

Scheduling model

Value function

computation

Policy

construction

Planning model

Challenge 2:

computing value function

Heuristic estimation

MILP model

Approximate value iteration

with piecewise-linearly

approximated value function

( ) ( ) ( ) = V s V s s

Exact dynamic

programming (value

iteration)

Approximation Strategy

( ) = 1 x d ls

Piecewise linear VFA:

basis function featured by state variable

Policy 1 Policy 2 Policy 3

Planning modelMDP with

safety stockMDP MDP

Uncertainty

accountedDemand

Demand &

Lead time

Demand &

Lead time

Integration with

schedulingNo MILP model

Heuristic

approach

Solution

algorithmExact VI* Exact VI

Approximate

VI

Case 1:

Moderate size

Case 2:

Large size

Decision

horizon

Scheduling 10 30 (one-month)

Planning 10 12 (one-year)

Number of suppliers 3 5

Unit cost

/ penalty

Inventory holding 0.1 0.1

Unloading 3 5

Lost-demand 10 10

Lost-volume 5 5

Safety-stock 10 10

Numerical Case Study – Case 1: moderate size

*VI: Value iteration

Supplier 1 Supplier 2 Supplier 3

Lead time 2,3,4,5 2 5

Order space 0,3,6,9 0,3,6 0,6,9,12

Fixed cost 1.5 2 2

Variable cost 0.5 0.5 0.5

❖ Three Suppliers:

Unreliable, but cheap

Reliable, but

expensive


Average Cost 197.67 171.91 172.46

Improvement over

Policy 1 (%)- 13.03 12.76

CPU time (s) 2.28 1343.73 23.96

❖ Results of Case 1 (100 simulation average)

❖ MILP model ✓ # of var. : 159 (integer: 66)

✓ # of constraints: 317

❖ MDP model ✓ Size of state space: 504

✓ Size of decision space: 48

98.15% improvement

Numerical Case Study – Case 2: large size

*VI: Value iteration

Supplier 1 Supplier 2 Supplier 3 Supplier 4 Supplier 5

Lead time {5~7} {10,11} {13~16} 4 15

Order space {0,5,10} {0,5,10} {0,10,15} {0,5} {0,10,15}

Fixed cost 1.5 1.7 1.4 2 2

Variable cost 0.5 0.5 0.5 0.5 0.5

❖ Five Suppliers:

Unreliable, but cheap Reliable, but expensive


Average Cost 606.75 - 519.46

Improvement over

Policy 1 (%)- - 14.46

CPU time (s) 21.13 - 1247.10

❖ Results of Case 2 (100 simulation average)

❖ MILP model ✓ # of var. : 705 (integer: 310)

✓ # of constraints: 1395

❖ MDP model ✓ Size of state space: 4424

✓ Size of decision space: 136

Computationally

infeasible !


Planning mod

el

MDP with

safety stockMDP MDP

Uncertainty

accountedDemand

Demand &

Lead time

Demand &

Lead time

Integration wit

h

scheduling

No MILP modelHeuristic

approach

Solution

algorithmExact VI* Exact VI

Approximate

VI

Case 1:

Moderate size

Case 2:

Large size

Decision

horizon

Scheduling 10 30 (one-month)

Planning 10 12 (one-year)

Number of suppliers 3 5

Unit cost

/ penalty

Inventory holdin

g0.1 0.1

Unloading 3 5

Lost-demand 10 10

Lost-volume 5 5

Safety-stock 10 10

Summary

64

• Reinforcement Learning has its technical roots in animal psychology and optimal control. Recent advances in RL algorithms, Deep NN, and hardware have enabled superhuman performance for some applications.

• Reinforcement Learning has advantages/disadvantages relative to Model Predictive Control, mostly because it emphasizes development of a control policy rather than a process model.

✓ Extensive off-line learning is possible if a good simulator is available. For on-line learning, “disciplined learning” is needed.

• Potential Process Control applications of Reinforcement Learning include:

✓ Use RL agents to manage control systems or optimize under uncertainty✓ Use RL agents to advise operators during unusual situations✓ Use a hierarchical network of RL agents to simplify operation of a plant

✓ Use RL agents to integrate strategic business decisions with plant operation decisions

Summary

Summary

65

We believe that Reinforcement Learning has the potential to significantly impact both theory and practice of Process Control, and more generally in Integrated Strategic /Operational Decision Making!

Summary

66

Thank you for listening

Matthew Realff (GT) Joohyun Shin (KAIST)Thomas Badgwell (ExxonMobil)

NOT

PICTURED

References

67

[1] D. Silver et al., Mastering the game of Go with deep neural networks and tree search, Nature, 529, 484-489, (2016).

[2] R. Sutton and A. Barto, Reinforcement Learning, Second Edition draft, (2016).

[3] D. Silver, Lecture 1: Introduction to Reinforcement Learning, Google DeepMind, (2015).

[4] S. Levine, Deep Reinforcement Learning, Berkeley CS294-112, (2017).

[5] A. Turing, Computing machinery and intelligence. Mind, 59, 433-460, (1950).

[6] OpenAI Gym, A toolkit for developing and comparing reinforcement learning algorithms, (2018).

[7] T. Badgwell, K. Liu, N. Subrahmanya, W. Liu, and M. Kovalski, Adaptive PID Controller Tuning via Deep Reinforcement Learning,

U.S. provisional patent application filed, (2017).

[8] J. Lee and W. Wong, Approximate dynamic programming approach for process control, Journal of Process Control, 20, 1038-

1048, (2010).

[9] J. Morinelly and E. Ydstie, Dual MPC with Reinforcement Learning, IFAC Papers Online, j.ifacol.2016,07.276, (2016).

[10] S. Kamthe and P. Deisenroth, Data-Efficient Reinforcement Learning with Probabilistic Model Predictive Control,

arXiv:1706.06491v1, (2017).

[11] V. Mnih et al., Asynchronous Methods for Deep Reinforcement Learning, Proceedings of the 33rd International Conference on

Machine Learning, New York, NY, USA, (2016).[12] Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.

[13] Powell, W. B. (2007). Approximate Dynamic Programming: Solving the curses of dimensionality (Vol. 703). John Wiley & Sons.

[14] Lewis, F. L., & Vrabie, D. (2009). Reinforcement learning and adaptive dynamic programming for feedback control. IEEE circuits and

systems magazine, 9(3).

[15] Lee, J. H., Shin, J., & Realff, M. J. (2018). Machine learning: Overview of the recent progresses and implications for the process systems

engineering field. Computers & Chemical Engineering, 114, 111-121.

References


https://www.youtube.com/watch?v=2pWv7GOvuf0

https://www.youtube.com/playlist?list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3

https://github.com/openai/gym

Reinforcement Learning Overview of Recent Progress and...

Documents

Transcript of Reinforcement Learning Overview of Recent Progress and...