Reinforcement Learning Overview of Recent Progress and...
Transcript of Reinforcement Learning Overview of Recent Progress and...
Reinforcement Learning – Overview of Recent Progress and Implications for Process Control and Beyond (Integrated Multi-Scale Decision-making)
October 4, 2018 CMU EWO Webinar
Jay H. Lee1
(w/ Thomas Badgwell2, )
1 Korea Advanced Institute of Science and Technology, Daejeon, Korea2ExxonMobil Research & Engineering Company, Clinton, NJ
Introduction to KAIST – 47 Years Old
KAISTBusiness School
71’ KAIS
SEOUL
Graduate school in Seoul under a new law granting special privileges such as exemption from compulsory military service
1971KAIS, Korea Advanced Institute of Science
Undergraduate school in Daejeon for students gifted in math and science
1984KIT, Korea Institute of Technology
Established through the merging of KAIS and KIT
1989KAIST, Korea Advanced Institute of Science and Technology
KAISTMain Campus
84’ KIT
DAEJEON
3
KAIST Today Brief Statistics
06/43
• Part I: Introduction of Reinforcement Learning and Implications for Process Control
(Acknowledgment: Thomas Badgwell)
• Part II: And Beyond (Integrated Multi-Scale Decision-making)
Overall Structure of This Talk
5
𝜃𝑗
𝒙 𝑦input output
Environment
𝓓 = 𝒙1:𝑛, 𝑦1:𝑛, 𝜃1:𝑛
𝒚 = 𝒇(𝒙; 𝜃)
𝒙∗|𝜃 = argmax𝒙
𝒇(𝒙; 𝜃)
Target system
Data acquisition
Learning
Making decision
𝑓(𝒙; 𝜃𝑗)
Data-driven decision-making & control in engineering domain
Dynamic & Stochastic environment
Data can help model more realistically and derive more accurate solution!
Framework of decision makings
Data & model based decision making
Validation Verification
Is this solution is good for the target
system?
Systems
max𝒙
𝑖=1
𝑁
𝑃𝑖 𝒙;𝜃, 𝑈
Modeling Solving
Real-world task Formal task (model) Algorithm (program)
Are we building the right model? Does the algorithm capture all the
essential aspects of the model?
Data analytics
• Bayesian Statistics
• Machine learning
• Bayesian Network
Modeling
• Optimization
• Markov Decision Process
• Game Theory
Decision Making
• Mathematical Programming
• Dynamic Programming
• Reinforcement Learning
Agenda
8
What is Reinforcement Learning? RL vs. Model Predictive Control
Implications for Process Control Future Research Directions
The agent learns a policy 𝜋 𝐴𝑡 = 𝑎|𝑆𝑡 = 𝑠 that maximizes a long-termvalue function:
𝑣𝜋 𝑠 = 𝐸 𝑅𝑡+1 + 𝛾𝑅𝑡+2 + 𝛾2𝑅𝑡+3⋯|𝑆𝑡 = 𝑠
[2] R. Sutton and A. Barto, Reinforcement Learning, Second Edition draft, (2016)
In Reinforcement Learning, an agent learns a decision policy by taking actions and observing the response of (‘reward’ or ‘penalty’ from) the environment (abstracted from Animal Psychology).
• Bellman’s optimality equation answers the question: when is the value function 𝑣∗ 𝑠maximized? It enforces consistency of the optimal value function as the state of the environment changes [6].
𝑣∗ 𝑠 =𝑚𝑎𝑥𝑎
𝑠,,𝑟
𝑝 𝑠′, 𝑟|𝑠, 𝑎 𝑟 + 𝛾𝑣∗ 𝑠′
• In practice Bellman’s optimality equation usually cannot be solved because:
➢We often don’t know environment model 𝑝 𝑠′, 𝑟|𝑠, 𝑎
➢Solution complexity explodes with state dimension (which may be infinite)
• All RL algorithms can be regarded as approximate solutions to Bellman’s optimality equation, dealing in various ways with these two limitations [2]. 10
[2] R. Sutton and A. Barto, Reinforcement Learning, Second Edition draft, (2016)
The properties of an optimal policy are described by Bellman’s optimality equation (from Optimal Control theory)
Reinforcement Learning: from Vision [5] to Today’s Reality
11
Powerful new RL algorithms Deep Neural Nets
+Powerful new hardware
+
[5] A. Turing, Computing machinery and intelligence. Mind, 59, 433-460, (1950).
Simulation
Self-play
data
Breadth Reduction:
Policy NetworkLearning a probability distribution
over legal moves over the board:
p(next action | current state)
Depth Reduction:
Value NetworkPredicting the expected outcome in
board position:
p(my win | next state)
Combinatorial explosion
in “Go” play
Reducing the Search Space
13
Approximately 25015010360 branches
(Number of atoms in the universe ~1080)
Example: RL strategy for “AlphaGo”
AlphaGo
Zero
x
Benefit of Unbiased Exploration!
Alpha-Go: Capturing the World’s Attention
Breadth
Depth
Classification Regression
Policy Network Value Network
Human
expert
positions Self play(~30 million
per day)
𝑝𝜎/𝜌 𝑎|𝑠
𝑣𝜃 𝑠′
𝑠′
𝑠
Analogy to chess playing
Chess PlayingProgram
Playing games
with current
policy
Value function
approximation
Assign “value” scores for all the
board positions encountered
during the games. Use some sort
of interpolation for other positions.
Have expert players play a large number of
games. (expert position learning)
A new playing policy
Initial policyIterative Improvement via Self-Play
(“Self-optimizing simulation”)
A Slide from my presentation at CPC-VII, Lake Louise, 2006
Solving Algorithms of RL
value-based or policy-based
1. Value function methods
▪ Learnt value function
(or Q-value)
▪ Implicit policy
2. Policy search methods
▪ No value function
▪ Learnt policy
3. Actor-critic
▪ Learnt value function
▪ Learnt policy
✓ TD learning (off-policy*: Q-learning, on-policy**: SARSA)
✓ Dual heuristic programming
✓ Policy gradient algorithms
( , ) [ | , ]s x P x s =
Value
based
Policy
based
Actor
Critic
( ) ( )t t t tuse V s s instead of V s =
min ( , ( ))tt tt
EJ C s s
= : Stochastic optimization problem
Long-term value
(estimated from samples)
( )1( ) ( ) ( , ( )) ( ) ( )t t t t t tV s V s C s s V s V s + + + −
( ) ( )V s V s
*learning policy ≠ sampling policy,
**learning policy = sampling policy
Monte-Carlo PG (analytical method):
J +
( ) ( )V s V s
( , ) [ | , ]s x P x s =
two approximators
( , ) ( , )Q s x Q s x
TD target
can learn “stochastic” policyˆ[ log ( , ) ]J E s x v
=
Major Issues in RL1. Sampled value computation: TD vs. MC
Low variance, Some bias
(D/T. bootstrapping)
High variance, Zero bias
(D/T. many random samples)
More sensitive to initial valueMore samples requires
(complete sequences)
1ˆ ( )t t tv C V s += + 21 2ˆ ...t t t tv C C C + += + + +
Cumulated cost
through a sample path
Deep backupsShallow backups
Bootstrapping: update
involves an estimate
( ) 11 1ˆ ...
( )
n nt t t t n
nt n
v C C C
V s
−+ + −
+
= + + +
+
1 ( )
1
ˆ ˆ( ) (1 ) n nt t
n
v v
−
=
= −
▪ TD(𝝀) learning
TD
(1-step)
n-step estimate
Weighted average of n-step estimates
2-
step
3-
step
n-
step
MC
(∞-step)
▪ TD learning ▪ MC learning
( ),t t twhere C C s x=
Observable (noisy)
reward & state
1, , ,t t t ts x C s +
Hidden (valuable)
value function
( )tV s
How can we compute the
“estimation targets”?
ˆ ( )t tv s
MC: Monte-CarloTD: Temporal-difference
Major Issues in RL
2. Value function approximation (VFA)
( ) ( ) ( )f f
f F
V s V s s
=
▪ Linear combinations of features
(features are parameterized by state)
Dim(feature) << Dim(state)
▪ artificial neural networks
(esp. deep neural networks)
▪ Nonparametric model
(e.g., Gaussian process, kernel regression)
Complex & nonlinear relation,
Large data for learning
( ),t t twhere C C s x=
Observable (noisy)
reward & state
1, , ,t t t ts x C s +
Which function
approximators?
( )tV s
Computed
estimation target
ˆ ( )t tv s
✓ Gradient-based approaches
✓ Least-squares approaches
✓ Probabilistic models
Major Issues in RL
3. Exploitation vs. Exploration
The best long-term strategy may
involves short-term sacrifices
▪ Naive exploration: add noise to greedy policy (e.g., 𝜖-greedy)
▪ Optimistic initialization: assume the best until proven otherwise
▪ Optimism in the face of uncertainty: prefer actions with uncertain
values (e.g., Upper confidence bound)
▪ Probability matching: select actions w.r.t. probability they are best
(e.g., Thompson sampling)
▪ Knowledge (belief) state search: look ahead search incorporating
value of knowledge (e.g., Gittins indices, knowledge gradient)
Online decision-making involves a fundamental choice:
Principles of exploration
“Exploit” the best one given current knowledge
“Explore” the most uncertain one
Or,
After exploring
Online
learning
knowledge decision
measurement
Exploitation vs. Exploration Dilemma
“optimal learning”
(unknown) MDP
Bayesian inference
Bayes-Adaptive MDP
Multi-armed banditModeling
Solution
Principled way of
handling uncertain
knowledge
(e.g., distribution)
Major Issues in RL
4. Stochastic policy vs. Deterministic policy
Policy search methods
Example: consider policies for iterated rock-paper-scissors▪ A deterministic policy is easily exploited.
▪ A uniform random policy is optimal.
Sometimes you
need stochastic
policy!!
( , )
( , )( , )
T
T
s x
s x
x
es x
e
=
2
2
1 ( ( ))( , )
22
x ss x e
−= −
How can we express stochastic policy?
▪ Softmax policy ▪ Gaussian policy
Deterministic policy
Stochastic policy ( | ) Pr[ | ]x s x s =
( )x s=
Example: Aliased Gridworld
▪ The agent cannot differ-
entiate the grey states
▪ Value-based RL ➔
deterministic policy
▪ It can get stuck, and
never reach the money.
▪ Policy-based RL ➔
stochastic policy
▪ It will reach the goal state in a
few steps with high probability.
Goal state
Advantages▪ Better convergence properties
▪ Effective in high-dimensional/
continuous action spaces
Disadvantages▪ Typically converge to a local optimum
▪ Typically inefficient & high variance
Trained policy: red arrows Trained policy: red arrows
Agenda
19
What is Reinforcement Learning?Bipedal Robot
RL vs. Model Predictive Control
Implications for Process ControlFuture Research Directions
Reinforcement Learning example: Bipedal Robot [6] OpenAI Gym
20
24 ContinuousState
4 Continuous Actions
Penalty+x: Every time it moves forward.- 0.1: Apply leg motor torque.-100: Fall down.
= [ hull angle speed, angular velocity,
horizontal speed, vertical speed,
position of joints, joints angular speed,
legs contact with ground,
10 LIDAR rangefinder measurements ]
= [ leg joint torques]
Snapshot after 3k episodes Snapshot after 40k episodes
A surprising policy! Better algorithms for tougher tasks
Agenda
21
What is Reinforcement Learning?RL vs. Model Predictive Control
Implications for Process ControlFuture Research Directions
Reinforcement Learning has advantages and disadvantages when compared with Model Predictive Control
22
RL (model-free) advantages vs. MPC
• No need to develop process model (develop policy from data directly)
• Able to work with complex nonlinear, stochasticenvironments
• Fast on-line execution• Can adapt to changing environments
RL (model-free) disadvantages vs. MPC
• Extensive trial-and-error learning is required• Must be allowed to fail during training (simulation can
be used)• Training may not be stable or repeatable• May get stuck in local minima during training• Extensive goal engineering may be required• Must re-do training if goal is changed• No closed-loop stability guarantees
Agenda
23
What is Reinforcement Learning? RL vs. Model Predictive Control
Implications for Process ControlFuture Research Directions
MPC is the current state-of-the art for chemical plants. Reinforcement Learning has the potential to complement it to expand its capability.
24
Potential process control applications of RL technology include:
• Directly replace existing process controllers with RL agents
• Use RL agents to help manage process control systems
✓ Switch controllers ON/OFF , adjust limits and tuning parameters as appropriate
✓ Compensate for common disturbances such as weather events or feedrate changes
✓ Supervising / optimizing control, esp. those involving significant uncertainty
• Use RL agents to advise operators during unusual situations (process upsets, startup/shutdown)
• Use a hierarchy of RL agents to simplify operation of a chemical plant/refinery
✓ Process safety agents
✓ Environmental compliance agents
✓ Reliability agents
✓ Economic optimization agents
Agenda
25
What is Reinforcement Learning?Bipedal Robot
RL vs. Model Predictive Control
Implications for Process ControlFuture Research Directions
Reinforcement Learning research opportunities
26
Potential RL research opportunities include:
• RL methods with “Disciplined learning”.
• Integrate aspects of RL technology with MPC
➢ Lee and Wong [8], Morinelly and Ydstie [9], Kamthe and Deisenroth [10]
• Exploration vs. exploitation
• Find the class of function approximations for which a state-of-the-art RL algorithm (e.g. A3C [11]) converges (value and policy function approximations converge to optimal values)
• Prove closed-loop stability for the case of a state-of-the-art RL algorithm (e.g. A3C [11])
• Develop a robust RL algorithm by training in parallel with a number of environments that each represent a realization of the uncertainty set
• Develop RL technology that allows a hierarchy of prioritized RL agents to cooperate to accomplish a complex task.
+
• The state is defined as:
• Stage-wise cost:
27
Exemplary RL with “Disciplined” Learning:Integrated Reaction Separation System with Recycle
(Tosukhowong and Lee, AIChE J. 2009)
Operating Modes 1 2 3 4
Product conc. (x2B)
Production rate (B)
0.886
100
0.85
115
0.91
80
0.82
125
32
21
2
1
⎯→⎯
⎯→⎯
k
k
MR
MD
MB
D
L
F
V
B, x2B
F0
x1
Qu = 10000, Qy = 6000, R = 20I66
0u /100T
sp sp sp
R D BF M M M L B =
28
Stochastic Disturbances and Variations
On-line stochastic disturbance (one of the realizations)
On-line performance comparison with 12 new disturbance
realizations
Qu = 10000, Qy = 6000, R = 20I66
0u /100T
sp sp sp
R D BF M M M L B =
RL
7 NMPC controllers w/ different tunings
Learned starting w/ closed-loop data with 7NMPC controllers
30
RL controller result from one realization
(x2Bsp = 0.85, Bsp = 115)
Product variables Manipulated variables
31
Result of NMPC 1 (the best MPC controller in this case)
Product variables Manipulated variables
Reinforcement Learning with
Mathematical Programmingfor
Multi-Scale Dynamic Decision Makingin an Uncertain Environment
Dynamic decision-making in an uncertain environment
Decision-maker Uncertain
Environment
Execution
Response
State
Information
Decision
Iterative
Improve
ment?
Industrial & Manufacturing system,
Financial engineering,
Robotics,
Power systems,
Medical applications,
Computing & Communications,
Game playing, …
<Sequential decision-making process>
- under uncertainty
Multi-scale decision-making
How do we integrate between layers?
Grossmann (2005). Enterprise‐wide optimization: A new frontier in PSE. AIChE Journal, 51(7), 1846-1857.3
Math programming over time-scale multiplicity
hour
. . . . .day
. . . .
year. . . . .
0
1000
0 30 60 90 120150180210240270300330360
Daily operating cost
Yearly
average
Yearly capacity planning (sizing)
Daily production planning
Hourlydispatch
scheduling
Fixed
design
(capacity)
Operation
constraints
day-ahead
operation/
uncertainty
info.
year-ahead
operation
info.
Day-
ahead
prediction
Wind: summer Wind: winter
7
Renewable (wind) energy example
Temporally-integrated mathematical programming (MP)❖ At fine scale: 1 year = 24 X 365 = 8760 hours ➔ Computationally infeasible
❖ At coarse scale: Coarse-graining and “averaging” of hourly dynamics and uncertainty ➔ Optimistic estimation
➔ “A gap between the layers”
Uncertainty handling: MP vs. MDP
Math programming (MP): solution “over” a time
horizon
Markov Decision Process (MDP): “stage-wise”
solution
❖ Stochastic data: scenario tree
❖ Solution structure: decision tree 𝑥1, 𝑥2𝜔1, … , 𝑥𝑇
𝜔𝑇−1❖ Stochastic data: probability distribution
❖ Solution structure: decision policy 𝝅: 𝑺 → 𝑿
state transition
Birge, & Louveaux (2011). Introduction to stochastic
programming. Springer Science & Business Media.
Puterman (2014). Markov decision processes: discrete
stochastic dynamic programming. John Wiley & Sons.
9
Value function: the expected
sum of all future cost
Value function-based stage decompositionMulti-stage stochastic
programming (MSSP)
with recourse
variables
Proposed “combined” strategy
Low-level operation / scheduling
Finite-horizon operation containing long-term evaluation
Simulated
samples from
state
transition and
reward
Long-term
evaluation
(value
function)
High-level management / planning
Multi-scale uncertainty
model
- Known state and simple dynamics (e.g.,
inventories, price info.)
- High-level future uncertainty (evolving over long-time)
- Long time horizon decision-making
- Optimization with complex dynamics and constraints
- Lower, fast-decaying uncertainty (due to frequent feedback).
- Need for resolving the end-effect
MDP based planning combined with operation model
16
DP or
RI
MP • Nonstationary
• Deterministic + stochastic
• Periodic (diurnal, daily, seasonal, etc.)
“Big data”
• weather factors
• market factors
• Etc.
Applications of the strategy
< Microgrid Operation & Design >
Shin, J., Lee, J. H., & Realff, M. J. (2017). “Operational
Planning and Optimal Sizing of Microgrid Considering
Multi-scale Wind Uncertainty,” Applied energy, 195, 616-
633.
< Refinery Procurement & Production Planning >
Shin, J., & Lee, J. H. (2017). “Crude Selection Integrated
with Optimal Refinery Operation by Combining Optimal
Learning and Mathematical Programming,” IFAC-
PapersOnLine, 50(1), 9032-9037.
22
< Raw Material Procurement Planning &
Scheduling >
Shin, J., & Lee, J. H. (2016). “Multi-time Scale
Procurement Planning Considering Multiple
Suppliers and Uncertainty in Supply and Demand,”
Computers & Chemical Engineering, 91, 114-126.
PART 3-1
Practical Example of the Proposed Model:
Hybrid Renewable Energy Network
< Microgrid Operation & Design >
Shin, J., Lee, J. H., & Realff, M. J. (2017).
“Operational Planning and Optimal Sizing of
Microgrid Considering Multi-scale Wind
Uncertainty,” Applied energy, 195, 616-633.
4
Design & Sizing
•Decisions on capacity acquisitionof each source type
•To minimize capital cost & operating cost
Unit Commitment
(UC)
•Decisions on which source should be on-line at what time
•To minimize operating cost
Dispatch
•Decisions on the output of on-line sources
•To minimize operating cost while meeting the hourly load
[Daily]
[Hourly]
[Yearly]
resource
limitation
operation
constraints
H. Liang and W. Zhuang, "Stochastic modeling and optimization in a microgrid: A survey".
Energies, Vol. 7, No. 4, pp. 2027-2050, 2014.
yearly
operation data
daily
operation
Hourly Uncertainty
Daily & Seasonal Uncertainty
Temporally integrated stochastic model is
required!
Decision hierarchy in HRES
3
Bhandari, Binayak, et al. "Optimization of hybrid
renewable energy power systems: A review."
International Journal of Precision Engineering
and Manufacturing-Green Technology 2.1
(2015): 99-112.
Yang, Hongxing, Lin Lu, and Wei Zhou. "A novel optimization sizing model for hybrid solar-wind
power generation system." Solar energy 81.1 (2007): 76-84.
daily hours
wind speed (m/s)
pow
er
morning
ramp
evening
peak
- self-discharging
- charging to battery
- discharging from
battery
• Slow-start generator (SG): operational limitations (ramp
up/down, minimum up/down
time) ➔ commitment plans
should be decided at least a
day ahead
• Fast-start generator (FG):more flexible source, but
expensive
❖Wind power
conversion model
❖ Battery dynamic model ❖Generator
❖ Daily demand
Energy System Model
Temporal integration model
Yearly: sizing
Daily:
unit commitment
Hourly:
Generation dispatch
Design
Operation
Complexity
Intra-day
variation
Inter-day
variation
Multi-scale WIND model
Deterministic profile Hourly
ramping
% of installed capacity
Seasonality
Winter Spring / fall Summer
Pesch, T., et al. "A
new Markov-chain-
related statistical
approach for
modelling synthetic
wind power time
series." New
journal of
physics 17.5
(2015): 055001.
Wan, Yih-huei.
"Analysis of
wind power
ramping
behavior in
ERCOT."
Contract 303
(2011): 275-
3000.
4
❖ Daily-average time series
exponentia
l decrease
cuts off after
one lag
Daily average data series (Feb.)
ACF PACF
❖ Hourly data regression model
➔ Parameter: least square
daily hour
profileeffect of daily-
average value
effect of
previous hour
Intra-day wind scenarios
Inter-day wind transition
Multi-scale Wind Model
monthmonth
<Generated data><Actual data>
❖ Data: 2002 ~ 2014 (except 2011) from the
National Wind Technology Center
❖ Hourly wind data generation for one year
- Winter: high production & high
variation
- Summer: low production & low
variation
Daily hours profile: Intra-day
variation
Daily-average value: Inter-day
variation
Monthly-average wind value
Yearly Wind Data Generation
Conventional approach
❖ “One-day” planning: 2 stage stochastic programming (2SSP)
Ruiz, Pablo A., et al. "Uncertainty management
in the unit commitment problem." Power
Systems, IEEE Transactions on 24.2 (2009):
642-651.
……
Limitation of the two-stage SP ➔ not
appropriate for the multi-day problem
“End-effects”: An optimal solution within a finite horizon
may be a bad solution beyond the time
horizon
Trade-off: current-day cost vs.
next day cost
operational
level value
function
capacity
limitation
Yearly sizing problem: value function-based optimization
… Month, m=12
…Operational level VFA* for Month m
Hourly operation over one day: LP
Multi-scale wind
uncertainty model
(seasonal, daily, hourly)
daily
operation
Month, m=1
LP LP
• Yearly operation: estimated from the Value function
value
function
Inter-day
wind
transition
Intra-day wind
scenarios
Proposed approach
*VFA: value function approximation
RL
MP
Case study
13
❖Tested methods:Method 1: 2stage SP model without the value function (daily independent)Method 2: MDP + LP with the value function
❖2002 ~ 2014 (except 2011) wind data is used for wind uncertainty modeling
❖2015 wind data is used for comparing the performance of the tested methods
❖System parameters for case study
SG1 SG2 SG3 FGBattery HVDC
System OthersOut In Buy Sell
8 6 4 3 2 2 5 3 20 0
4 3 2 0 0 0 0 0 1 5
4 3 2 3 - - - - 20 5
4 3 2 3 - - - - 30 4
3 2.5 2.5 0 - - - - 0.8 Demand
2.5 2 2 0 - - - - 1 peak 25
1 0.8 0.5 0 - - - - 0.005 Off-
peak13
1.5 2 2.5 3 0 0 4.5 -2
Example caseMethod 1: 2SSP model without the value function
Method 2: MDP+LP model with the value function
: trade-off between current & future cost
1st stage
cost
2nd stage
costOverall cost
method1 41.30 202.30 243.60
method2 41.30 203.69 244.99
1st stage
cost
2nd stage
cost
Overall
cost
method1 41.30 166.76 208.06
method2 34.90 158.39 193.29
At 15th day At 16th day
-0.57% 7.10%
Daily wind production: 4.86 Daily wind production: 6.22
14
Case study: results
15
MonthImprovement of Method 2**
over Method 1* (%)Month
Improvement of Method 2**
over Method 1* (%)
1 6.82 7 2.13
2 6.12 8 3.12
3 3.27 9 6.53
4 2.39 10 12.03
5 1.95 11 16.46
6 2.45 12 10.27
❖Tested methods:
Method 1: 2stage SP model without the value function (daily independent)
Method 2: 2stage SP model with the value function
❖Effected day: the day on which Method 1 and 2 make different decisions
❖ Improvement for the 151 effected days (40%):
PART 3-2
Practical Example of the Proposed Model:
Refinery procurement & production planning
< Refinery Procurement &
Production Planning >
Shin, J., & Lee, J. H. (2017).
“Crude Selection Integrated with
Optimal Refinery Operation by
Combining Optimal Learning and
Mathematical
Programming,” IFAC-
PapersOnLine, 50(1), 9032-9037.
Decision-making structure
, ,, ,
1 1
( , , , )
ˆ
t t WTI t JFt GAS t DSL
t t t
p p p p p
p p w+ +
=
= +
U.S. Energy Information Administration
(EIA)
2014.12 ~ 2017.3 (600 days)
Refinery Optimization: LP** (for one-day)
Finite-horizon operation containing long-term evaluation
Samples for
Next State
& Stage-wise
Reward
Purchased
Crude &
Updated Value
Function
Crude Oil Selection: MDP* (day-to-day)
Daily price uncertainty
model
- Recurring dynamics & decisions, dominated by price
uncertainties
- Infinite-time horizon decision-making
- Optimization with many constraints & realized price info.
- Need for resolving the end-effect
MDP based planning combined with detailed refinery operation model
23
*MDP: Markov decision process
**LP: Linear programming
Model Formulation
1 if 𝑠′ is the optimal solution of
refinery optimization given initial
storage 𝑠𝑡, and 0 otherwise
Crude Oil Selection: MDP Refinery Op. Optimization: LP
❖ Constraints
- crude availability ∙∙∙∙∙∙∙∙∙∙∙∙ 𝑦𝑐𝑓𝑐min ≤ 𝑓𝑐 ≤ 𝑦𝑐𝑓𝑐
max
- initial storage level ∙∙∙∙∙∙∙∙∙∙∙∙ 𝑠0,𝑐 = 𝑠𝑡,𝑐
❖ State, decision variable
𝑆𝑡 =𝑝𝑡,𝑖 = price of product 𝑖
𝑠𝑡,𝑐 = storage of crude 𝑐
𝑎𝑡 = 𝑦𝑡,𝑐 = ቊ1 if crude 𝑐 is selected
0 otherwise
❖ Objective
❖ DecisionsInventory transition:
𝒔𝒕+𝟏,𝒄 = 𝒔𝒄
Constrained by upper-level decision & state
Model defined by the
lower-level optimization
❖ One-day reward function: 𝑅𝑡(𝑆𝑡 , 𝑎𝑡)
❖ State transition probability𝑃(𝑆′|𝑆𝑡) = 𝑃(𝑝′|𝑝𝑡) × 𝑃 𝑠′|𝑠𝑡
MDP ➔ LP
LP➔ MDP
From price model
24
𝑅(𝑆𝑡, 𝑎𝑡) = max𝑥
𝑸𝟏 𝑎𝑡 +
𝑝′
𝑃 𝑝′|𝑝𝑡 𝑸𝟐 𝑝′, 𝑠𝑡
𝒙 = 𝒇𝒄, 𝒔𝒄, 𝒙𝒊,𝒖, 𝒙𝒃,𝒑, 𝒙𝒑, 𝒗𝒑
Crude purchase, 𝑸𝟏
Unit processing &
Product sale 𝑸𝟐
Uncertainty realization(price of import & export products)
WTI
0
where ( ) [1 ],
[ ]t t t t t
t t t t tc ci i i j i j i j
Tp pp sp pp p p p s
S p p p p p p p p s p s
=
=
Value function approximation & learning
❖Value function approximation model:
25
( ) ( )= V S S
Bradtke, S. J., & Barto, A. G. (1996). Machine learning, 22(1-3), 33-57.
Boyan, J. A. (2002). Machine Learning, 49(2-3), 233-246
Marginal value > price
➔ To keep higher
inventory is
preferred.
Marginal value of crude inventory: ෩𝜽𝒔 = 𝜽𝒔 + 𝜽𝒑𝒕𝒔𝒑𝐖𝐓𝐈𝒕
෩𝜽𝒔,𝒄 − 𝒑𝒄 $/𝒕𝒐𝒏
price state
Quantitative evaluation of what kind
of & how much crude oil to retain
❖ Value function learning algorithm
• Initialize 𝜃0
• For each iteration (𝑛 = 0,… , 𝑁 − 1),
1. Simulation data collection:
𝑺𝒏 → 𝒂𝒏 𝜽𝒏 → 𝒓𝒏 → 𝑺𝒏′
2. Define a basis:
𝝓𝒏 = 𝝓 𝑺𝒏 , 𝝓𝒏+𝟏 = 𝝓 𝑺𝒏+𝟏3. Update value function:
𝜽𝒏+𝟏 ← 𝑹𝑳𝑺𝑻𝑫∗ 𝜽𝒏, 𝝓𝒏, 𝝓𝒏+𝟏, 𝒓𝒏
from LP & price model
Linear & bilinear terms for price & inventory(Not always “Deep NN”)
*Recursive least square temporal difference (RLSTD)
Case Study
26
Refinery model parameters refer to Favennec, J. (2001). Petroleum Refining V5. Refinery Operation and Management, Technip
❖Refinery model parameters- Favennec, J. (2001). Petroleum Refining V5. Refinery Operation and Management, Technip
❖Seven crude oil types- Crude assay information: http://corporate.exxonmobil.com/en/company/worldwide-operations/crude-
oils/assays
- Individual crude oil price: 𝒑𝒕,𝒄 = 𝒑𝒕,𝑾𝑻𝑰 + 𝝃𝒄 where 𝜉𝑐 = 𝑓 VR yield, sulphur content
❖Price model: first-order Markov chain
- Data source: U.S. EIA, 2014.12 ~ 2017.3 (600 days)
❖ Size of state space:
Inventory space (37) X price space (90) =
196,830
❖ Size of decision (crude selection) space:
1 crude selection (7) + 2 crude selection (21) =
28
❖ # of parameters for VFA: 51
❖ Quality specification of final product
PG98 ES95 DSL
C4 content ~ 5% ~ 5% Sulphur
content~ 0.05%
RVP 0.5 ~ 0.86 0.45 ~ 0.86
sensitivity ~ 10 ~ 10 HF
RON ~ 98 ~ 95 VBI 30 ~ 33
Tuning Parameters for Learning
27
Improvement (%) of
proposed policy over reference*
Number of iteration (× 102)
Exp
lora
tio
n r
ate
(𝜀)
*ReferenceProposed
one
Refinery opt. model LP LP
Uncertainty
accountedNo Price
VF incorporation No Yes (RL)
❖ Number of samples (N)
• Limited budget (time) for sampling & learning
❖ Exploration rate (𝜺)
• To make a decision 𝑎𝑛 during learning (𝜀-greedy policy)
For each iteration of learning (𝑛 = 1,… ,𝑁),
𝒔𝒏 → 𝒂𝒏 → 𝒓𝒏 → 𝒔𝒏′
Exploratory decision for knowledge acquisition
Exploit
decision 𝑎𝑛 = ቊgreedy − policy with probability 1 − 𝜀randomly choosen with probability 𝜀
Numerical Result
28
Crude purchase (kton) Inventory (kton)
Proposed Reference Proposed Reference
C1 279.67 289.42 99.15 0.50
C2 48.39 49.15 1.23 0.65
C3 8.08 0 40.67 0.57
C4 7.11 0 16.83 0.74
C5 0.33 0.16 2.06 0.46
C6 23.49 25.79 5.43 0.76
C7 0 0 0.49 0.94
367.08 364.53 165.87 4.62
❖ Number of samples: 30000, exploration rate: 0.4
❖ Improvement across 100 runs
- A randomly generated price dataset (200 days)
- A randomly chosen initial inventory state
*ReferenceProposed
one
Refinery opt. model LP LP
Uncertainty
accountedNo Price
VF incorporation No Yes
PART 3-3
Practical Example of the Proposed Model:
Raw Material Procurement Planning and Scheduling
< Raw Material Procurement Planning & Scheduling >
Shin, J., & Lee, J. H. (2016). “Multi-time Scale Procurement Planning
Considering Multiple Suppliers and Uncertainty in Supply and Demand,”
Computers & Chemical Engineering, 91, 114-126.
Supplier 1
Supplier 2Supplier n InventoryDock
Random
demand
Manufacturer
Random
lead time
Multiple independent suppliers
Raw Mat. Procurement Planning & Scheduling
Supplier 1
Supplier 2Supplier n InventoryDock
Physical product flow
Procurement planning: monthly or bi-weekly, t = 1,…
Random
demand
Manufacturer
Random
lead time
scheduling cost,
final inventory
Unloading scheduling: daily, h = 1, …, H
Multiple independent suppliers
❖ Decision: unloading schedule, inventory level, input to
manufacturer
❖ Objective: to minimize (scheduling cost) =
(inventory holding cost) + (unloading cost) + (penalty cost)
❖ Decision: order plan for each supplier
❖ Objective: to minimize (procurement cost)
= (purchasing cost) + (scheduling cost)
order plan,
initial inventory
Time index: (t, h) & (t, H) = (t+1,0)
Constraints on
tank &
demand
Vessel
arrival &
departure
Constraints on
unloading
operation
Daily cost &
state transition
Integrated Formulation
𝒕 𝒕 + 𝟏
Planning: MDP
Initial state &
Constraints
❖ One-period cost function
,
1
( | , ) ( | ) ( | ) ( | , , , )n
t t t t i t t t
i
p s s a p l l p d d p x s a d l=
=
❖ State transition probability
❖ State, Decision & Exogenous variable
Scheduling: MILP
( ) ,min ( )h t h un i i ld h lv ih i it
C x C Tl Tf C Ld C Lv
= + − + +
𝒕, 𝟎 𝒕,𝑯𝒕, 𝒉
….. …..
Time integration: 𝒕,𝑯 = 𝒕 + 𝟏,𝟎
0 otherwise
ˆ( , ) ( )t t order tc s a C a E = + ( )*sch t t t+1C s ,a ,ω
,f i iAv Tf
, ,i h t ihFu a
,0 h t hFd d
,0t tx x=
Availability of vessel
Delivered order
Initial storage level tx
Unit demand
1 if x = *t,Hx
, , 1,ˆf i t i t iAv l l += +
ta
( ), 1ˆt h t td d d H+= +
*1ˆ( , , )+sch t t tC s a
*, , , , ,
Tt i h i h i i i h i h i h h iXf Xl Tf Tl Z Fu Fd Ld Lv = t,hx
,
Inventorylevelat timeDemandat timeLead timeof supplier at time
t
t
t i
x td tl i t
= = =
=
ts
, orderof suplier placedat timet ia i t= =ta
,
ˆ Newinfo.ondemandat timeˆ Newinfo.on lead timeat time
t
t i
d tl t =
= = tω
From planning
From scheduling
Challenge 1:
computing cost & state
transition
For all states
& decisions,
Cost & state
transition
Scheduling model
Value function
computation
Policy
construction
Planning model
Challenge 2:
computing value function
Heuristic estimation
MILP model
Approximate value iteration
with piecewise-linearly
approximated value function
( ) ( ) ( ) = V s V s s
Exact dynamic
programming (value
iteration)
Approximation Strategy
( ) = 1 x d ls
Piecewise linear VFA:
basis function featured by state variable
Policy 1 Policy 2 Policy 3
Planning modelMDP with
safety stockMDP MDP
Uncertainty
accountedDemand
Demand &
Lead time
Demand &
Lead time
Integration with
schedulingNo MILP model
Heuristic
approach
Solution
algorithmExact VI* Exact VI
Approximate
VI
Case 1:
Moderate size
Case 2:
Large size
Decision
horizon
Scheduling 10 30 (one-month)
Planning 10 12 (one-year)
Number of suppliers 3 5
Unit cost
/ penalty
Inventory holding 0.1 0.1
Unloading 3 5
Lost-demand 10 10
Lost-volume 5 5
Safety-stock 10 10
Numerical Case Study – Case 1: moderate size
*VI: Value iteration
Supplier 1 Supplier 2 Supplier 3
Lead time 2,3,4,5 2 5
Order space 0,3,6,9 0,3,6 0,6,9,12
Fixed cost 1.5 2 2
Variable cost 0.5 0.5 0.5
❖ Three Suppliers:
Unreliable, but cheap
Reliable, but
expensive
Policy 1 Policy 2 Policy 3
Average Cost 197.67 171.91 172.46
Improvement over
Policy 1 (%)- 13.03 12.76
CPU time (s) 2.28 1343.73 23.96
❖ Results of Case 1 (100 simulation average)
❖ MILP model ✓ # of var. : 159 (integer: 66)
✓ # of constraints: 317
❖ MDP model ✓ Size of state space: 504
✓ Size of decision space: 48
98.15% improvement
Numerical Case Study – Case 2: large size
*VI: Value iteration
Supplier 1 Supplier 2 Supplier 3 Supplier 4 Supplier 5
Lead time {5~7} {10,11} {13~16} 4 15
Order space {0,5,10} {0,5,10} {0,10,15} {0,5} {0,10,15}
Fixed cost 1.5 1.7 1.4 2 2
Variable cost 0.5 0.5 0.5 0.5 0.5
❖ Five Suppliers:
Unreliable, but cheap Reliable, but expensive
Policy 1 Policy 2 Policy 3
Average Cost 606.75 - 519.46
Improvement over
Policy 1 (%)- - 14.46
CPU time (s) 21.13 - 1247.10
❖ Results of Case 2 (100 simulation average)
❖ MILP model ✓ # of var. : 705 (integer: 310)
✓ # of constraints: 1395
❖ MDP model ✓ Size of state space: 4424
✓ Size of decision space: 136
Computationally
infeasible !
Policy 1 Policy 2 Policy 3
Planning mod
el
MDP with
safety stockMDP MDP
Uncertainty
accountedDemand
Demand &
Lead time
Demand &
Lead time
Integration wit
h
scheduling
No MILP modelHeuristic
approach
Solution
algorithmExact VI* Exact VI
Approximate
VI
Case 1:
Moderate size
Case 2:
Large size
Decision
horizon
Scheduling 10 30 (one-month)
Planning 10 12 (one-year)
Number of suppliers 3 5
Unit cost
/ penalty
Inventory holdin
g0.1 0.1
Unloading 3 5
Lost-demand 10 10
Lost-volume 5 5
Safety-stock 10 10
Summary
64
• Reinforcement Learning has its technical roots in animal psychology and optimal control. Recent advances in RL algorithms, Deep NN, and hardware have enabled superhuman performance for some applications.
• Reinforcement Learning has advantages/disadvantages relative to Model Predictive Control, mostly because it emphasizes development of a control policy rather than a process model.
✓ Extensive off-line learning is possible if a good simulator is available. For on-line learning, “disciplined learning” is needed.
• Potential Process Control applications of Reinforcement Learning include:
✓ Use RL agents to manage control systems or optimize under uncertainty✓ Use RL agents to advise operators during unusual situations✓ Use a hierarchical network of RL agents to simplify operation of a plant
✓ Use RL agents to integrate strategic business decisions with plant operation decisions
Summary
Summary
65
We believe that Reinforcement Learning has the potential to significantly impact both theory and practice of Process Control, and more generally in Integrated Strategic /Operational Decision Making!
Summary
66
Thank you for listening
Matthew Realff (GT) Joohyun Shin (KAIST)Thomas Badgwell (ExxonMobil)
NOT
PICTURED
References
67
[1] D. Silver et al., Mastering the game of Go with deep neural networks and tree search, Nature, 529, 484-489, (2016).
[2] R. Sutton and A. Barto, Reinforcement Learning, Second Edition draft, (2016).
[3] D. Silver, Lecture 1: Introduction to Reinforcement Learning, Google DeepMind, (2015).
[4] S. Levine, Deep Reinforcement Learning, Berkeley CS294-112, (2017).
[5] A. Turing, Computing machinery and intelligence. Mind, 59, 433-460, (1950).
[6] OpenAI Gym, A toolkit for developing and comparing reinforcement learning algorithms, (2018).
[7] T. Badgwell, K. Liu, N. Subrahmanya, W. Liu, and M. Kovalski, Adaptive PID Controller Tuning via Deep Reinforcement Learning,
U.S. provisional patent application filed, (2017).
[8] J. Lee and W. Wong, Approximate dynamic programming approach for process control, Journal of Process Control, 20, 1038-
1048, (2010).
[9] J. Morinelly and E. Ydstie, Dual MPC with Reinforcement Learning, IFAC Papers Online, j.ifacol.2016,07.276, (2016).
[10] S. Kamthe and P. Deisenroth, Data-Efficient Reinforcement Learning with Probabilistic Model Predictive Control,
arXiv:1706.06491v1, (2017).
[11] V. Mnih et al., Asynchronous Methods for Deep Reinforcement Learning, Proceedings of the 33rd International Conference on
Machine Learning, New York, NY, USA, (2016).[12] Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
[13] Powell, W. B. (2007). Approximate Dynamic Programming: Solving the curses of dimensionality (Vol. 703). John Wiley & Sons.
[14] Lewis, F. L., & Vrabie, D. (2009). Reinforcement learning and adaptive dynamic programming for feedback control. IEEE circuits and
systems magazine, 9(3).
[15] Lee, J. H., Shin, J., & Realff, M. J. (2018). Machine learning: Overview of the recent progresses and implications for the process systems
engineering field. Computers & Chemical Engineering, 114, 111-121.
References