2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu...

93

Transcript of 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu...

Page 1: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H
Page 2: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Moncrief-O’Donnell Chair, UTA Research Institute (UTARI)The University of Texas at Arlington, USA

and

F.L. Lewis National Academy of Inventors

Talk available online at http://www.UTA.edu/UTARI/acs

New Developments in Integral Reinforcement Learning:Continuous-time Optimal Control and Games

Qian Ren Consulting Professor, State Key Laboratory of Synthetical Automation for Process Industries

Northeastern University, Shenyang, China

Supported by :NSF AFOSR EuropeONR US TARDEC

Supported by :China NNSFChina Project 111

Page 3: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

3

Thanks to Han-Xiong Li

Page 4: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Optimal Control is Effective for:Aircraft AutopilotsVehicle engine controlAerospace VehiclesShip ControlIndustrial Process Control

Multi-player Games Occur in:Networked Systems Bandwidth AssignmentEconomicsControl Theory disturbance rejectionTeam gamesInternational politicsSports strategy

But, optimal control and game solutions are found byOffline solution of Matrix Design equationsA full dynamical model of the system is needed

Optimality and Games

Page 5: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

1 Tu R B Px Kx

BuAxx

( ( )) ( ) ( ) ( )T T T

t

V x t x Qx u Ru d x t Px t

0 ( , , ) 2 2 ( )T

T T T T T T TV VH x u V x Qx u Ru x x Qx u Ru x P Ax Bu x Qx u Rux x

Derivation of Linear Quadratic Regulator

System 

Cost

Differentiate using Leibniz’ formula

Optimal Control is

Scalar equation

( )T TV x Qx u Ru

Differential equivalent is the Bellman equation

Stationarity condition  0 2 2 TH Ru B Pxu

1 10 2 ( )T T T T Tx P Ax BR B Px x Qx x PBR B Px HJB equation

10 T T T T T Tx PAx x A Px x Qx x PBR B Px

u Kx Given any stabilizing FB policy

0 ( , , ) ( ) ( )T T TVH x u x A BK P P A BK Q K RK xx

Lyapunov equation

Full system dynamics must be knownOff‐line solution

Rudolph Kalman 1960

Riccati equation

Page 6: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

xux Ax Bu

SystemControlK

PBPBRQPAPA TT 10

1 TK R B P

On-line real-timeControl Loop

Off-line Design LoopUsing ARE

Optimal Control- The Linear Quadratic Regulator (LQR)

An Offline Design Procedurethat requires Knowledge of system dynamics model (A,B)

System modeling is expensive, time consuming, and inaccurate

( , )Q R

User prescribed optimization criterion ( ( )) ( )T T

t

V x t x Qx u Ru d

Page 7: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Adaptive Control is online and works for unknown systems.Generally not Optimal

Optimal Control is off-line, and needs to know the system dynamics to solve design eqs.

Reinforcement Learning turns out to be the key to this!

We want to find optimal control solutions Online in real-time Using adaptive control techniquesWithout knowing the full dynamics

For nonlinear systems and general performance indices

Bring together Optimal Control and Adaptive Control

Page 8: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

SystemAdaptiveLearning system

ControlInputs

outputs

environmentTuneactor

Reinforcementsignal

Actor

Critic

Desiredperformance

Reinforcement learningIvan Pavlov 1890s

Actor-Critic Learning

We want OPTIMAL performance- ADP- Approximate Dynamic Programming

Every living organism improves its control actions based on rewards received from the environment

The resources available to living organisms are usually meager.Nature uses optimal control.

Optimality in Biological Systems

Page 9: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

D. Vrabie, K. Vamvoudakis, and F.L. Lewis,Optimal Adaptive Control and DifferentialGames by Reinforcement LearningPrinciples, IET Press,2012.

BooksF.L. Lewis, D. Vrabie, and V. Syrmos,Optimal Control, third edition, John Wiley andSons, New York, 2012.

New Chapters on:Reinforcement LearningDifferential Games

Page 10: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

F.L. Lewis and D. Vrabie,“Reinforcement learning and adaptive dynamic programming for feedback control,”IEEE Circuits & Systems Magazine, Invited Feature Article, pp. 32-50, Third Quarter 2009.

IEEE Control Systems Magazine,F. Lewis, D. Vrabie, and K. Vamvoudakis,“Reinforcement learning and feedback Control,” Dec. 2012

Page 11: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Multi‐player Game SolutionsIEEE Control Systems Magazine,Dec 2017

Page 12: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Integral Reinforcement Learning for CT systemsSynchronous IRLMulti-player Non-zero Sum GamesGraphical Games on NetworksZero-Sum GamesQ Learning for CT SystemsExperience ReplayOff-Policy IRL

Page 13: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

t

T

t

dtRuuxQdtuxrtxV ))((),())((

( , , ) ( , ) ( , ) ( ) ( ) ( , ) 0T TV V VH x u V r x u x r x u f x g x u r x u

x x x

112( ) ( )T Vu h x R g x

x

dxdVggR

dxdVxQf

dxdV T

TT *1

*

41

*

)(0

, (0) 0V

Nonlinear System dynamics 

Cost/value 

Bellman Equation, in terms of the Hamiltonian function

Stationary Control Policy

HJB equation

CT Systems‐ Derivation of Nonlinear Optimal Regulator

Off‐line solutionHJB hard to solve.   May not have smooth solution.Dynamics must be known

Stationarity condition 0Hu

( , ) ( ) ( )x f x u f x g x u

Leibniz gives Differential equivalent

To find online methods for optimal control Focus on these two equations

Problem‐ System dynamics shows up in Hamiltonian

Page 14: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

),,(),(),(0 uxVxHuxruxf

xV T

0 ( , ( )) ( , ( ))T

jj j

Vf x h x r x h x

x

(0) 0jV

1121( ) ( ) jT

j

Vh x R g x

x

CT Policy Iteration – a Reinforcement Learning Technique

• Convergence proved by Leake and Liu 1967, Saridis 1979 if Lyapunov eq. solved exactly

• Beard & Saridis used Galerkin Integrals to solve Lyapunov eq.• Abu Khalaf & Lewis used NN to approx. V for nonlinear systems and proved convergence

RuuxQuxr T )(),(Utility 

The cost is given by solving the CT Bellman equation

Policy Iteration Solution

Pick stabilizing initial control policy

Policy Evaluation ‐ Find cost, Bellman eq.

Policy improvement ‐ Update control Full system dynamics must be knownOff‐line solution

dxdVggR

dxdVxQf

dxdV T

TT *1

*

41

*

)(0

Scalar equation

M. Abu-Khalaf, F.L. Lewis, and J. Huang, “Policyiterations on the Hamilton-Jacobi-Isaacs equation for H-infinity state feedback control with input saturation,”IEEE Trans. Automatic Control, vol. 51, no. 12, pp.1989-1995, Dec. 2006.

( ) ( )u x h xGiven any admissible policy

0 ( )h x

Converges to solution of HJB

Page 15: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

PBPBRQPAPA TT 10

1 Tu R B Px Kx

BuAxx

( ( )) ( ) ( ) ( )T T T

t

V x t x Qx u Ru d x t Px t

Full system dynamics must be knownOff‐line solution

0 ( , , ) 2 2 ( )T

T T T T T T TV VH x u V x Qx u Ru x x Qx u Ru x P Ax Bu x Qx u Rux x

Policy Iterations for the Linear Quadratic Regulator

0 ( ) ( )T TA BK P P A BK Q K RK

u Kx

System 

Cost

Differential equivalent is the Bellman equation

Given any stabilizing FB policy

The cost value is found by solving Lyapunov equation = Bellman equation

Optimal Control is

Algebraic Riccati equation

Page 16: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

LQR Policy iteration = Kleinman algorithm

1. For a given control policy                 solve for the cost: 

2. Improve policy:

If started with a stabilizing control policy the matrix monotonically converges to the unique positive definite solution of the Riccati equation.

Every iteration step will return a stabilizing controller. The system has to be known.

ju K x

0 T Tj j j j j jA P P A Q K RK

11

Tj jK R B P

j jA A BK

0K jP

Kleinman 1968

Bellman eq. = Lyapunov eq.

OFF‐LINE DESIGNMUST SOLVE LYAPUNOV EQUATION AT EACH STEP.

Matrix equation

Page 17: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Can Avoid knowledge of drift term f(x)

Work of Draguna Vrabie

Policy iteration requires repeated solution of the CT Bellman equation

0 ( , ( )) ( , ( )) ( , ( )) ( ) ( , , ( ))T T

TV V VV r x u x x r x u x f x u x Q x u Ru H x u xx x x

This can be done online without knowing f(x)using measurements of x(t), u(t) along the system trajectories

Integral Reinforcement Learning

( ) ( )x f x g x u

D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F. L. Lewis, “Adaptive optimal control for continuous-time linear systems based on policy iteration,” Automatica, vol. 45, pp. 477-484, 2009.

Page 18: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Lemma 1 – Draguna Vrabie

Solves Bellman equation without knowing f(x,u)

( ( )) ( , ) ( ( )), (0) 0t T

t

V x t r x u d V x t T V

0 ( , ) ( , ) ( , , ), (0) 0TV Vf x u r x u H x u V

x x

Allows definition of temporal difference error for CT systems

( ) ( ( )) ( , ) ( ( ))t T

t

e t V x t r x u d V x t T

Integral reinf. form  (IRL) for the CT Bellman eq.Is equivalent to

( ( )) ( , ) ( , ) ( , )t T

t t t T

V x t r x u d r x u d r x u d

value

Key Idea

Work of Draguna Vrabie 2009Integral Reinforcement Learning

Bad Bellman Equation

Good Bellman Equation

Page 19: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T T

tx t Px t x Q L RL x d x t T Px t T

0T Tc cA P P A L RL Q

Lemma 1 ‐ D. Vrabie‐ LQR case

is equivalent to

( ) ( ) ( )T

T T T Tc c

d x P x x A P PA x x L RL Q xdt

Proof:

( ) ( ) ( ) ( ) ( ) ( )t T t T

T T T T T

t t

x Q L RL xd d x Px x t Px t x t T Px t T

Solves Lyapunov equation without knowing A or B

, cA A BL Integral Reinforcement Learning form

Lyapunov equation

( ( )) ( , ) ( ( )), (0) 0t T

t

V x t r x u d V x t T V

LQR Case

IRL Bellman equation

Value function  ( ( )) ( ) ( )TV x t x t Px t

Page 20: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

( ( )) ( , ) ( ( ))t T

k k kt

V x t r x u dt V x t T

IRL Policy iteration

Initial stabilizing control is needed

Cost update

Control gain update

f(x) and g(x) do not appear

g(x) needed for control update

Policy evaluation‐ IRL Bellman Equation

Policy improvement11

21 1( ) ( )T kk k

Vu h x R g xx

),,(),(),(0 uxVxHuxruxf

xV T

Equivalent to

Solves Bellman eq. (nonlinear Lyapunov eq.) without knowing system dynamics

CT Bellman eq.

Integral Reinforcement Learning (IRL)- Draguna Vrabie

D. Vrabie proved convergence to the optimal value and controlAutomatica 2009, Neural Networks 2009

Converges to solution to HJB eq.dx

dVggRdx

dVxQfdx

dV TTT *

1*

41

*

)(0

Page 21: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

CT Policy Iteration – How to implement online?Linear Systems Quadratic Cost‐ LQR

1 111 12 11 121 2 1 2

2 212 22 12 22

1 2 1 2

1 2 1 211 12 22 11 12 22

2 2 2 2

( ) ( )

( ) ( )( ) ( ) ( ) ( )

( ) ( )

( ) ( )2 2( ) ( )

( ) ( )t t T

Tk

p p p px t x t Tx t x t x t T x t T

p p p px t x t T

x xp p p x x p p p x x

x x

p x t x t T

Quadratic basis set

( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T Tk k k k

tx t P x t x Q K RK x d x t T P x t T

Policy evaluation‐ solve IRL Bellman Equation

( ) ( ) ( ) ( ) ( )( ) ( )t T

T T T Tk k k k

tx t P x t x t T P x t T x Q K RK x d

( ( )) ( ) ( )TV x t x t Px tValue function is quadratic

( ) ( ) ( ) ( ) ( ) ( )t T

T T T Tk k k k

tp t p x t x t T x Q L RL x d

( , )t t T

Same form as standard System ID problems

Page 22: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

( ( )) ( ) ( ( ))t T

Tk k k k

t

V x t Q x u Ru dt V x t T

Approximate value by Weierstrass Approximator Network ( )TV W x

( ( )) ( ) ( ( ))t T

T T Tk k k k

t

W x t Q x u Ru dt W x t T

( ( )) ( ( )) ( )t T

T Tk k k

t

W x t x t T Q x u Ru dt

regression vector Reinforcement on time interval [t, t+T]

kWNow use RLS along the trajectory to get new weights

Then find updated FB1 11 1

2 21 1( ( ))( ) ( ) ( )

( )

TT Tk

k k kV x tu h x R g x R g x Wx x t

Nonlinear Case- Approximate Dynamic Programming

Direct Optimal Adaptive Control for Partially Unknown CT Systems

Value Function Approximation (VFA) to Solve Bellman Equation– Paul Werbos (ADP), Dimitri Bertsekas (NDP)

Scalar equation with vector unknowns

Same form as standard System ID problems in Adaptive Control

Optimal ControlandAdaptive Controlcome togetherOn this slide.Because of RL

Page 23: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

( ( )) ( ( )) ( )t T

T Tk k k

t

W x t x t T Q x u Ru dt

2

( ( )) ( ( 2 )) ( )t T

T Tk k k

t T

W x t T x t T Q x u Ru dt

3

2

( ( 2 )) ( ( 3 )) ( )t T

T Tk k k

t T

W x t T x t T Q x u Ru dt

11 12

12 22

p pp p

11 12 22TW p p p

Solving the IRL Bellman Equation

Need data from 3 time intervals to get 3 equations to solve for 3 unknowns

Now solve by Batch least-squares

Solve for value function parameters

Page 24: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

( ( )) ( ( )) ( ( )) ( ) ( )t T

T T Tk k k k

t

W x t W x t x t T Q x u Ru dt t

2

( ( )) ( ( )) ( ( 2 )) ( ) ( )t T

T T Tk k k k

t T

W x t T W x t T x t T Q x u Ru dt t T

3

2

( ( 2 )) ( ( 2 )) ( ( 3 )) ( ) ( 2 )t T

T T Tk k k k

t T

W x t T W x t T x t T Q x u Ru dt t T

11 12

12 22

p pp p

11 12 22TW p p p

Solving the IRL Bellman Equation

Need data from 3 time intervals to get 3 equations to solve for 3 unknowns

Now solve by Batch least-squares

Or can use Recursive Least-Squares (RLS)

Solve for value function parameters

Put together

( ( )) ( ( )) ( ( 2 )) ( ) ( ) ( 2 )TkW x t x t T x t T t t T t T

Page 25: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

t t+T

observe x(t)

apply uk=Lkx

observe cost integral

update P

Do RLS until convergence to Pk

update control gain

Integral Reinforcement Learning (IRL)

Data set at time [t,t+T)

( ), ( , ), ( )x t t t T x t T

t+2T

observe x(t+T)

apply uk=Lkx

observe cost integral

update P

t+3T

observe x(t+2T)

apply uk=Lkx

observe cost integral

update P

11

Tk kK R B P

( , )t t T ( , 2 )t T t T ( 2 , 3 )t T t T

Or use batch least-squares

(x( )) ( ( )) ( ) ( ) ( ) ( , )t T

T T Tk k k

tW t x t T x Q K RK x d t t T

Solve Bellman Equation - Solves Lyapunov eq. without knowing dynamics

This is a data-based approach that uses measurements of x(t), u(t)Instead of the plant dynamical model.

A is not needed anywhere

Page 26: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Persistence of Excitation

Regression vector must be PE

Relates to choice of reinforcement interval T

( ( )) ( ( )) ( )t T

T Tk k k

t

W x t x t T Q x u Ru dt

Page 27: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Implementation Policy evaluationNeed to solve online

Add a new state= Integral Reinforcement

T Tx Q x u R u

This is the controller dynamics or memory

(x( )) ( ( )) ( ) ( ) ( ) ( , )t T

T T Tk k k

tW t x t T x Q K RK x d t t T

Page 28: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Direct Optimal Adaptive Controller

A hybrid continuous/discrete dynamic controller whose internal state is the observed cost over the interval

Draguna Vrabie

DynamicControlSystemw/ MEMORY

xu

V

ZOH T

x Ax Bu System

T Tx Q x u Ru

Critic

ActorK

T T

Run RLS or use batch L.S.To identify value of current control

Update FB gain afterCritic has converged

Reinforcement interval T can be selected on line on the fly – can change

Solves Riccati Equation Online without knowing A matrix

CT time Actor‐Critic Structure

Page 29: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Actor / Critic structure for CT Systems

Theta waves 4-8 Hz

Reinforcement learning

Motor control 200 Hz

Optimal Adaptive IRL for CT systems

A new structure of adaptive controllers

( ( )) ( , ) ( ( ))t T

k k kt

V x t r x u dt V x t T

1121 1( ) ( )T k

k kVu h x R g xx

D. Vrabie, 2009

Page 30: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

PBPBRQPAPA TT 10

1 Tu R B Px Kx

Full system dynamics must be knownOff‐line solution

Algebraic Riccati equation

( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T Tk k k k

tx t P x t x Q K RK x d x t T P x t T

11

Tk kK R B P Only B is needed

On line solution in real timeUses data measurements along system trajectory

CT Bellman eq.

1 1( ) ( )k ku t K x t

The Optimal Control Solution

ARE solution can be found online in real-time by using the IRL Algorithmwithout knowing A matrix

Iterate for k=0,1,2,….

The Bottom Line about Integral Reinforcement Learning Off-line ARE Solution

Data-driven On-line ARE Solution

Data set at time [t,t+T)

( ), ( , ), ( )x t t t T x t T

Page 31: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

xux Ax Bu

SystemControlK

PBPBRQPAPA TT 10

1 TK R B P

On-line Control Loop

Off-line Design Loop

Optimal Control- The Linear Quadratic Regulator (LQR)

An Offline design Procedurethat requires Knowledge of system dynamics model(A,B)

System modeling is expensive, time consuming, and inaccurate

( , )J Q RUser prescribed optimization criterion

Page 32: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

1 TK R B P

PBPBRQPAPA TT 10

xux Ax Bu

SystemControlK On-line Control Loop

On-line Performance Loop

Data-driven Online Adaptive Optimal ControlDDO

An Online Supervisory Control Procedurethat requires no Knowledge of system dynamics model A

Automatically tunes the control gains in real time to optimize a user given cost functionUses measured data (u(t),x(t)) along system trajectories

( , )J Q RUser prescribed optimization criterion

( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T Tk k k k

tx t P x t x Q K RK x d x t T P x t T

11

Tk kK R B P

Data set at time [t,t+T)

( ), ( , ), ( )x t t t T x t T

Page 33: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Simulation 1‐ F‐16 aircraft pitch rate controller

1.01887 0.90506 0.00215 00.82225 1.07741 0.17555 0

0 0 1 1x x u

Stevens and Lewis 2003

*1 11 12 13 22 23 33[p 2p 2p p 2p p ]

[1.4245 1.1682 -0.1352 1.4349 -0.1501 0.4329]

T

T

W

PBPBRQPAPA TT 10 ARE

Exact solution

,Q I R I

][ eqx

Select quadratic NN basis set for VFA

Page 34: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

0 0.5 1 1.5 2-0.3

-0.2

-0.1

0Control signal

Time (s)

0 0.5 1 1.5 2-0.4

-0.2

0Controller parameters

Time (s)

0 0.5 1 1.5 20

0.5

1

1.5

2

2.5

3

3.5

4System states

Time (s)

0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

Critic parameters

Time (s)

P(1,1)P(1,2)P(2,2)P(1,1) - optimalP(1,2) - optimalP(2,2) - optimal

Simulations on:  F‐16 autopilot

A matrix not needed 

Converge to SS Riccati equation soln

Solves ARE online without knowing A

PBPBRQPAPA TT 10

Page 35: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

1/ / 0 0 00 1/ 1/ 0 0

,1/ 0 1/ 1/ 1/

0 0 0 0

p p p

T T

G G G G

E

T K TT T

A BRT T T T

K

x Ax Bu

( ) [ ( ) ( ) ( ) ( )]Tg gx t f t P t X t E t

FrequencyGenerator outputGovernor positionIntegral control

Simulation 2: Load Frequency Control of Electric Power system

PBPBRQPAPA TT 10

0.4750 0.4766 0.0601 0.4751

0.4766 0.7831 0.1237 0.3829

0.0601 0.1237 0.0513 0.0298

0.4751 0.3829 0.0298 2.3370

.AREP

ARE

ARE solution using full dynamics model (A,B)

A. Al-Tamimi, D. Vrabie, Youyi Wang

Page 36: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

0.4750 0.4766 0.0601 0.4751

0.4766 0.7831 0.1237 0.3829

0.0601 0.1237 0.0513 0.0298

0.4751 0.3829 0.0298 2.3370

.AREP

PBPBRQPAPA TT 10

0 10 20 30 40 50 60-0.5

0

0.5

1

1.5

2

2.5P matrix parameters P(1,1),P(1,3),P(2,4),P(4,4)

Time (s)

P(1,1)P(1,3)P(2,4)P(4,4)

( ), ( ), ( : )x t x t T t t T Fifteen data points

Hence, the value estimate was updated every 1.5s.

IRL period of T= 0.1s.

Solves ARE online without knowing A

0.4802 0.4768 0.0603 0.4754

0.4768 0.7887 0.1239 0.3834

0.0603 0.1239 0.0567 0.0300

0.4754 0.3843 0.0300 2.3433

.critic NNP

0 1 2 3 4 5 6-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1System states

Time (s)

Page 37: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Optimal Control Design Allows a Lot of Design Freedom

Page 38: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

( ( )) ( , ) ( ( ))t T

k k kt

V x t r x u dt V x t T

IRL Policy iteration Initial stabilizing control is needed

Cost update

Control gain update

Policy evaluation‐ IRL Bellman Equation

Policy improvement11

21 1( ) ( )T kk k

Vu h x R g xx

CT PI Bellman eq.= Lyapunov eq.

IRL Value Iteration - Draguna Vrabie

Converges to solution to HJB eq.dx

dVggRdx

dVxQfdx

dV TTT *

1*

41

*

)(0

1( ( )) ( , ) ( ( ))t T

k k kt

V x t r x u dt V x t T

IRL Value iteration Initial stabilizing control is NOT needed

Cost update

Control gain update

Value evaluation‐ IRL Bellman Equation

Policy improvement1 11

21 1( ) ( )T kk k

Vu h x R g xx

CT VI Bellman eq.

Converges if T is small enough

Page 39: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Kung Tz 500 BCConfucius

ArcheryChariot driving

MusicRites and Rituals

PoetryMathematics

孔子Man’s relations to

FamilyFriendsSocietyNationEmperorAncestors

Tian xia da tongHarmony under heaven

Page 40: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Pythagoras 500 BC Natural Philosophy, Ethics, and Mathematics

Numbers and the harmony of the spheres

Music, Mathematics, Gymnastics, Astronomy, Medicine

Translate music to mathematical equationsRatios in music

Fire, air, water, earth

The School of Pythagoras esoterikoi and exoterikoi

Mathematikoi - learnersAkoustiki- listeners

Patterns in NatureSerenity and Self-Possession

Page 41: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H
Page 42: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Actor / Critic structure for CT Systems

Theta waves 4-8 Hz

Reinforcement learning

Motor control 200 Hz

Optimal Adaptive IRL for CT systems

A new structure of adaptive controllers

( ( )) ( , ) ( ( ))t T

k k kt

V x t r x u dt V x t T

1121 1( ) ( )T k

k kVu h x R g xx

D. Vrabie, 2009

Page 43: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Motor control 200 Hz

Oscillation is a fundamental property of neural tissue

Brain has multiple adaptive clocks with different timescales

theta rhythm, Hippocampus, Thalamus, 4-10 Hzsensory processing, memory and voluntary control of movement.

gamma rhythms 30-100 Hz, hippocampus and neocortex high cognitive activity.

• consolidation of memory• spatial mapping of the environment – place cells

The high frequency processing is due to the large amounts of sensorial data to be processed

Spinal cord

D. Vrabie and F.L. Lewis, “Neural network approach to continuous-time direct adaptive optimal control forpartially-unknown nonlinear systems,” Neural Networks, vol. 22, no. 3, pp. 237-246, Apr. 2009.

D. Vrabie, F.L. Lewis, D. Levine, “Neural Network-Based Adaptive Optimal Controller- A Continuous-Time Formulation -,” Proc. Int. Conf. Intelligent Control, Shanghai, Sept. 2008.

Page 44: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Doya, Kimura, Kawato 2001

Limbic system

Motor control 200 Hz

theta rhythms 4-10 HzDeliberativeevaluation

control

Page 45: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Cerebral cortexMotor areas

ThalamusBasal ganglia

Cerebellum

Brainstem

Spinal cord

Interoceptivereceptors

Exteroceptivereceptors

Muscle contraction and movement

Summary of Motor Control in the Human Nervous System

reflex

Supervisedlearning

ReinforcementLearning- dopamine

(eye movement)inf. olive

Hippocampus

Unsupervisedlearning

Limbic System

Motor control 200 Hz

theta rhythms 4-10 Hz

picture by E. StinguD. Vrabie

Memoryfunctions

Long term

Short term

Hierarchy of multiple parallel loops

gamma rhythms 30-100 Hz

Page 46: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

69

Synchronous Real‐time Data‐driven Optimal Control 

Page 47: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Actor / Critic structure for CT Systems

Theta waves 4-8 Hz

Motor control 200 Hz

Optimal Adaptive Integral Reinforcement Learning for CT systems

A new structure of adaptive controllers

( ( )) ( , ) ( ( ))t T

k k kt

V x t r x u dt V x t T

1121 1( ) ( )T k

k kVu h x R g xx

D. Vrabie, 2009

Policy Iteration gives the structure needed for online optimal solution

Page 48: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Theorem (Kyriakos Vamvoudakis)‐ Online Learning of Nonlinear Optimal Control

Tune actor NN weights as

Learning the Value

Learning the control policy

Online Synchronous Policy IterationKyriakos Vamvoudakis

1ˆ ˆ( ) ( )T

cV X W X

( )( ( )) ( ) ( ) ( ) ( )] ( ( ))t T

t T T TT

t

V X t e X Q X u Ru d e V X t Th t ht t t t t+

- - -é= + + +êëò

( )1

ˆ ˆ ˆ( ) ( ( ) ( ) ( ) ( ))t

T t T T TB c T

t T

e W t e X Q X u Ru d

1 1 1( ) ( ) ( )Tt e t t T

1

21 1

( )ˆ(1 ( ) ( ))c c BT

tW e

t t

fa

f f

D= -

+D D

1 11 ˆ(̂ ) ( )( )2

T Tu

u t R G X WX

f- ¶= -

2 1 1

11 1 12

1 1

ˆ ˆ ˆ(( ( ) )

( )1 ˆ ˆ( ( ) ( )( ) ) ( ) )4 (1 ( ) ( ))

Tu u u c

TT T

u cT

W FW F t W

tG X R G X W WX X t t

a f

f f f

f f-

= - -

¶ ¶ D-

¶ ¶ +D D

Actor NN

Critic NN

IRL Bellman Equation

Page 49: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

1 11 1 1 2 2 2

1 1( ) ( ) ( ) ( ).2 2

T TL t V x tr W a W tr W a W

Lyapunov  energy‐based Proof:

1140 ( )

T TTdV dV dVf Q x gR g

dx dx dx

V(x)= Unknown solution to HJB eq.

2 1 2ˆW W W

1 1 1̂W W W

1 1 1( , , ) ( ) ( )T THH x W u W f gu Q x u Ru

W1= Unknown LS solution to Bellman equation for given N

Guarantees stability

Page 50: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Adaptive Critic structure Reinforcement learning

Two Learning NetworksTune them Simultaneously

Synchronous Online Solution of Optimal Control for Nonlinear Systems

A new form of Adaptive Control with TWO tunable networks

A new structure of adaptive controllers

12 2 2 2 1 1 1 2 11ˆ ˆ ˆ ˆ ˆ{( ) ( ) ( ) }4

T TW F W F W D x W m x W

K.G. Vamvoudakis and F.L. Lewis, “Online actor-critic algorithm to solve the continuous-time infinitehorizon optimal control problem,” Automatica, vol. 46, no. 5, pp. 878-888, May 2010.

Page 51: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

A New Class of Adaptive Control

Plantcontrol output

Identify the Controller-Direct Adaptive

Identify the system model-Indirect Adaptive

Identify the performance value-Optimal Adaptive

)()( xWxV T

Page 52: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Simulation 1‐ F‐16 aircraft pitch rate controller

1.01887 0.90506 0.00215 00.82225 1.07741 0.17555 0

0 0 1 1x x u

Stevens and Lewis 2003

*1 11 12 13 22 23 33[p 2p 2p p 2p p ]

[1.4245 1.1682 -0.1352 1.4349 -0.1501 0.4329]

T

T

W

PBPBRQPAPA TT 10

Solves ARE online

Exact solution

1̂( ) [1.4279 1.1612 -0.1366 1.4462 -0.1480 0.4317] .TfW t

Algorithm converges to

,Q I R I

Must add probing noise to get PE

111 22

ˆ( ) ( ) ( )T Tu x R g x W n t

2ˆ ( ) [1.4279 1.1612 -0.1366 1.4462 -0.1480 0.4317]TfW t

(exponentially decay n(t))

1

2 1

3 11 11 12 2 2

2

3 2

3

2 0 0 1.42790 1.1612

0 0 -0.1366ˆ ( ) 01.44620 2 01-0.148000.43170 0 2

T

T

T

xx xx x

u x R B Px Rxx x

x

][ eqx

Select quadratic NN basis set for VFA

Page 53: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Critic NN parameters‐Converge to ARE solution

System states

Page 54: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Simulation 2. – Nonlinear System  

2( ) ( ) ,x f x g x u x R

21 2

1 2

1

1

0.5 0.5 (1 ( )( )

cos(2 ) 2

0 ( )

)

.cos(2 ) 2

x xf x

x

g x

x x

x

,Q I R I

* 2 21 2

1( )2

V x x x Optimal Value 

*1 2( ) (cos(2 ) 2) .u x x x Optimal control

2 21 1 1 2 2( ) [ ] ,Tx x x x x Select VFA basis set 

1̂( ) [0.5017 -0.0020 1.0008] .TfW t

Algorithm converges to

2ˆ ( ) [0.5017 -0.0020 1.0008] .T

fW t 111

2 2 121

2

2 0 0.50170ˆ ( ) -0.0020

cos(2 ) 20 2 1.0008

TT x

u x R x xx

x

Nevistic V. and Primbs J. A. (1996)Converse optimal

dxdVggR

dxdVxQf

dxdV T

TT *1

*

41

*

)(0

Solves HJB equation online

Page 55: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Critic NN parameters states

Optimal value fn. Value fn. approx. error Control approx error

Page 56: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Data‐driven Online Solution of Differential GamesSynchronous Solution of Multi‐player Non Zero‐sum Games

Page 57: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Sun Tz bin fa孙子兵法

500 BC

Multi‐player Differential Games

Page 58: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Real‐Time Solution of Multi‐Player NZS Games

1

( ) ( )N

j jj

x f x g x u

Multi‐Player Nonlinear Systems

Optimal control *1 2

10

( (0), , , ) min ( ( ) ) ;i

NT

i N i i ij ij

V x Q x R dt i N

* * * * * * * *1 2 1 1 2( , , ,..., ) ( , , ,..., ),i i i iV V V i N Nash equilibrium

1

1

0 ,N

T T Ti c c i i j j jj ij jj j j

j

P A A P Q P B R R R B P i N

Requires Offline solution of coupled Hamilton‐Jacobi –Bellman  eqs.

Kyriakos Vamvoudakis, Automatica 2011

Continuous‐time,  N players

112

1

0 ( ) ( ) ( ) ( )N

T Ti j jj j j

j

V f x g x R g x V

114

1

( ) ( ) ( ) , (0) 0N

T T Ti j j jj ij jj j j i

j

Q x V g x R R R g x V V

Linear Quadratic Regulator Case‐ coupled AREs

These are hard to solveIn the nonlinear case, HJB generally cannot be solved

100

112( ) ( ) ,T

i ii i ix R g x V i N

Control policies

Page 59: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Real‐Time Solution of Multi‐Player Games

1 210

( (0), , , ) ( ( ) ) ;N

Ti N i i ij i

j

V x Q x R dt i N

Value functions

Solve Bellman eq.

Policy Iteration Solution:

11

0 ( , , , ) ( ) ( ) ( ) , (0) 0N

k k k T i kN i j j i

j

r x V f x g x V i N

1 112( ) ( ) ,k T k

i ii i ix R g x V i N Policy Update

Non‐Zero Sum Games – Synchronous Policy Iteration Kyriakos Vamvoudakis

Convergence has not been provenHard to solve Hamiltonian equationBut this gives the structure we need for online Synchronous PI Solution

Differential equivalent gives coupled Bellman eqs.

11 1

0 ( ) ( ) ( ( ) ( ) ) ( , , , , ),N N

T Ti j ij j i j j i i N

j jQ x u R u V f x g x u H x V u u i N

Leibniz gives Differential equivalent

Page 60: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Real‐Time Solution of Multi‐Player Games

N Critic Neural Networks for VFA 1 1 1ˆ ˆ( ) ( ) ,TV x W x 2 2 2

ˆ ˆ( ) ( )TV x W x

11 1 1 1 1 1 11 1 2 12 22

1 1

ˆ ˆ[ ( ) ]( 1)

T T TTW a W Q x u R u u R u

13 3 2 3 1 3 1 1 11 21 11 1 3 2 2 1 3 1 1

1 1ˆ ˆ ˆ ˆ ˆ ˆ ˆ{( ) ( ) ( ) ( ) }4 4

T T T T T TW F W F W g x R R R g x W m W D x W m W

N Actor Neural Networks

On‐Line Learning – for Player 1:

111 11 1 1 32

ˆ( ) ( ) ,T Tu x R g x W

Online Synchronous PI Solution for Multi‐Player Games

Learns Bellman eq. solution

Kyriakos Vamvoudakis

112 22 2 2 42

ˆ( ) ( )T Tu x R g x W

2-player case

Each player needs 2 NN – a Critic and an Actor

Learns control policy

Player 1 Player 2

Convergence is proven using Lyapunov functions

K.G. Vamvoudakis and F.L. Lewis, “Multi-Player Non-Zero Sum Games: Online Adaptive LearningSolution of Coupled Hamilton-Jacobi Equations,” Automatica, vol. 47, pp. 1556-1569, 2011.

Page 61: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

1 11 1 1 2 2 2

1 1( ) ( ) ( ) ( ).2 2

T TL t V x tr W a W tr W a W

Lyapunov  energy‐based Proof:

1140 ( )

T TTdV dV dVf Q x gR g

dx dx dx

V(x)= Unknown solution to HJB eq.

2 1 2ˆW W W

1 1 1̂W W W

1 1 1( , , ) ( ) ( )T THH x W u W f gu Q x u Ru

W1= Unknown LS solution to Bellman equation for given N

Guarantees stability

Page 62: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Manufacturing as the Interactions of Multiple AgentsEach machine has it own dynamics and cost functionNeighboring machines influence each other most stronglyThere are local optimization requirements as well as global necessities

( )i

i i i i i i ij j jj N

A d g B u e B u

12

0

( (0), , ) ( )i

T T Ti i i i i ii i i ii i j ij j

j N

J u u Q u R u u R u dt

Each process has its own dynamics

And cost function

Each process helps other processes achieve optimality and efficiency

Page 63: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Process i Cost Function Optimization

Process j1 controlupdate

Process i controlupdate

Process j2 controlupdate

Optimal Performance of Each Process Depends on the Control of its Neighbor Processes

Control Policy of Each Process Depends on the Performance of its Neighbor Processes

Process i Cost Function Optimization

Process i controlupdate

Process j2 Cost Function Optimization

Process j1 Cost Function Optimization

Multi-player Games for Multi-Process Optimal ControlData-Driven Optimization (DDO)

Page 64: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Sun Tz bin fa孙子兵法

Games on Communication Graphs

500 BC

Page 65: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

F.L. Lewis, H. Zhang, A. Das, K. Hengster-Movric, Cooperative Control of Multi-Agent Systems: Optimal Design and Adaptive Control, Springer-Verlag, 2013

Key Point

Lyapunov Functions and Performance IndicesMust depend on graph topology

Hongwei Zhang, F.L. Lewis, and Abhijit Das“Optimal design for synchronization of cooperative systems: state feedback, observer and output feedback,”IEEE Trans. Automatic Control, vol. 56, no. 8, pp. 1948-1952, August 2011.

H. Zhang, F.L. Lewis, and Z. Qu, "Lyapunov, Adaptive, and Optimal Design Techniques for Cooperative Systems on Directed Communication Graphs," IEEE Trans. Industrial Electronics, vol. 59, no. 7, pp. 3026‐3041, July 2012.

Page 66: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

,i i i ix Ax B u

0 0x Ax

0( ) ( ),ix t x t i

0( ) ( ),i

i ij i j i ij N

e x x g x x

( ) ,nix t ( ) im

iu t

Graphical GamesSynchronization‐ Cooperative Tracker Problem

Node dynamics

Target generator dynamics

Synchronization problem

Local neighborhood tracking error (Lihua Xie)

x0(t)

( )i

i i i i i i ij j jj N

A d g B u e B u

12

0

( (0), , ) ( )i

T T Ti i i i i ii i i ii i j ij j

j N

J u u Q u R u u R u dt

12

0

( ( ), ( ), ( ))i i i iL t u t u t dt

Local nbhd. tracking error dynamics

Define Local nbhd. performance index

Local agent dynamics driven by neighbors’ controls

Values driven by neighbors’ controls

K.G. Vamvoudakis, F.L. Lewis, and G.R. Hudas, “Multi-Agent Differential Graphical Games: online adaptive learning solution for synchronization with optimality,” Automatica, vol. 48, no. 8, pp. 1598-1611, Aug. 2012.

M. Abouheaf, K. Vamvoudakis, F.L. Lewis, S. Haesaert, and R. Babuska, “Multi-Agent Discrete-Time GraphicalGames and Reinforcement Learning Solutions,” Automatica, Vol. 50, no. 12, pp. 3038-3053, 2014.

Page 67: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

1u

2u

iu Control action of player i

Value function of player i

New Differential Graphical Game

( )i

i i i i i i ij j jj N

A d g B u e B u

State dynamics of agent i

Local DynamicsLocal Value Function

Only depends on graph neighbors

12

0

( (0), , ) ( )i

T T Ti i i i i ii i i ii i j ij j

j N

J u u Q u R u u R u dt

Page 68: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

1

N

i ii

z Az B u

12

10

( (0), , ) ( )N

T Ti i i j ij j

j

J z u u z Qz u R u dt

1u

2u

iu Control action of player i

Central Dynamics

Value function of player i

Standard Multi-Agent Differential Game

Central DynamicsLocal Value Functiondepends on ALL

other control actions

Page 69: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Def. Local Best response.is said to be agent i’s local best response to fixed policies        of its neighbors if

* * *( , ) ( , ), ,i j G j i j G jJ u u J u u i j N

A restriction on what sorts of performance indices can be selected in multi‐player graph games. 

A condition on the reaction curves (Basar and Olsder) of the agents

This rules out the disconnected counterexample.

*( , ) ( , ),i i i i i i iJ u u J u u u

*iu iu

New Definition of Nash Equilibrium for Graphical Games

* * *1 2, ,...,u u u

* * * *( , ) ( , ),i i i G i i i G iJ J u u J u u i N

Def:  Interactive Nash equilibrium

are in Interactive Nash equilibrium if

2.  There exists a policy         such that ju

1.

That is, every player can find a policy that changes the value of every other player. 

i.e. they are in Nash equilibrium

Page 70: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

1 1 12 2 2( , , , ) ( ) 0

i i

TT T Ti i

i i i i i i i i i ij j j i ii i i ii i j ij ji i j N j N

V VH u u A d g B u e B u Q u R u u R u

10 ( ) Ti ii i i ii i

i i

H Vu d g R B

u

2 1 2 1 11 1 12 2 2( ) ( ) 0,

i

TT Tj jc T T Ti i i

i i ii i i i i ii i j j j jj ij jj ji i i j jj N

V VV V VA Q d g B R B d g B R R R B i N

2 1 1( ) ( ) ,i

jc T Tii i i i i ii i ij j j j jj j

i jj N

VVA A d g B R B e d g B R B i N

* *( , , , ) 0ii i i i

i

VH u u

12( ( )) ( )

i

T T Ti i i ii i i ii i j ij j

j Nt

V t Q u R u u R u dt

Value function

Differential equivalent (Leibniz formula) is Bellman’s Equation

Stationarity Condition

1.  Coupled HJ equations

where

Graphical Game Solution Equations

Now use Synchronous PI to learn optimal Nash policies online in real‐time as players interact 

Page 71: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Online Solution of Graphical Games

Use Reinforcement Learning Convergence Results

POLICY ITERATION

Kyriakos Vamvoudakis

Page 72: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Data‐driven Online Solution of Differential GamesZero‐sum 2‐Player Games and H‐infinity Control

Page 73: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

2

0

2

0

2

0

2

0

2

)(

)(

)(

)(

dttd

dtuhh

dttd

dttz T

System( ) ( ) ( )( )

TT T

x f x g x u k x dy h x

z y u

( )u l x

d

u

z

x control

Performance output disturbance

Find control u(t) so that 

For all L2 disturbancesAnd a prescribed gain 

L2 Gain Problem

H‐Infinity Control Using Neural Networks

Zero‐Sum differential gameNature as the opposing player

Disturbance Rejection

Page 74: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

2* 2

0

( (0)) min max ( (0), , ) min max ( ) ( )T T

u ud dV x V x u d h x h x u Ru d dt

Define 2‐player zero‐sum game as

min max ( (0), , ) max min ( (0), , )u ud d

V x u d V x u d

The game has a unique value (saddle‐point solution) iff the Nash condition holds

min max ( , , , ) max min ( , , , )u ud d

H x V u d H x V u d

A necessary condition for this is the Isaacs Condition

0 , 0H Hu d

Stationarity Conditions

Page 75: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

( , ) ( ) ( ) ( )( )

x f x u f x g x u k x dy h x

System 

Cost 

Online Zero‐Sum Differential Games

22( ( ), , ) ( , , )T T

t t

V x t u d h h u Ru d dt r x u d dt

H-infinity Control

2 players

Leibniz gives Differential equivalent

K.G. Vamvoudakis and F.L. Lewis, “Online solution of nonlinear two-player zero-sum games using synchronous policy iteration,” Int. J. Robust and Nonlinear Control, vol. 22, pp. 1460-1483, 2012.

D. Vrabie and F.L. Lewis, “Adaptive dynamic programming for online solution of a zero-sum differential game,” J Control Theory App., vol. 9, no. 3, pp. 353–360, 2011

Optimal control/dist. policies found by stationarity conditions

Game saddle point solution found from Hamiltonian   ‐ ZS Game BELLMAN EQUATION

0 , 0H Hu d

112 ( )Tu R g x V 2

1 ( )2

Td k x V

22( , , , ) ( ( ) ( ) ( ) ) 0TT TVH x u d h h u Ru d V f x g x u k x dx

HJI equation * *

12

0 ( , , , )1 1( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )4 4

T T T T T T

H x V u d

h h V x f x V x g x R g x V x V x kk V x

Page 76: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

1. For a given control policy                solve for the value               ( )ju x 1( ( ))jV x t

111 12( ) ( )T

j ju x R g x V

Double Policy Iteration Algorithm to Solve HJI

Start with stabilizing initial policy  0 ( )u x

ZS game Bellman eq.

3. Improve policy:

• Convergence proved by Van der Schaft if can solve nonlinear Lyapunov equation exactly

• Abu Khalaf & Lewis used NN to approximate V for nonlinear systems and proved convergence

ONLINE SOLUTION‐ Use IRL to solve the ZS Bellman eq. and update the actors at each step

220 ( )( )T iT i T ij j j jh h V x f gu kd u Ru d

Add inner loop to solve for available storage

12

1 ( )2

i T ijd k x V

2.  Set              .   For i=0,1,… solve for  1( ( )),i ijV x t d 0 0d

On convergence set 1( ) ( )ij jV x V x

Page 77: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Actor‐Critic structure ‐ three time scales

x

u2 1 0;x Ax B u B w x

System

Controller/Player 1

w( 1)T i kuu B P x 1

1T i

ww B P x

Disturbance/Player 2

CriticLearning procedure

ˆ ˆ, if =1

ˆ ˆ ˆ ˆ, if >1

T T T

T T

V x C Cx u u i

V w w u u i

T T

1 2

1iuP ( 1) 1 ( 1)i k i i k

u u uP P Z

V

iuP

x

( ) 1 ( )i k i i ku u uP P Z

Page 78: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

2

1/ / 0 0 00 1/ 1/ 0 0

,1/ 0 1/ 1/ 1/

0 0 0 0

p p p

T T

G G G G

E

T K TT T

A BRT T T T

K

2 1x Ax B u B d

( ) [ ( ) ( ) ( ) ( )]Tg gx t f t P t X t E t

FrequencyGenerator outputGovernor positionIntegral control

-0.0665 8 0 0 0 -3.663 3.663 0

, [0 0 13.7355 0] . -6.86 0 -13.736 -13.736 0.6 0 0 0

TA B

Simulation- H-inf control for Electric Power Plant- LFC

A is unknownB1, B2 are known

1 8 0 0 0 TB

2 2 1 10 ( )T T T TA P PA C C P B B B B P

Load disturbance

Page 79: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Simulation result – Electric Power Plant LFC

• System – Power plant ‐ internally stable system; – system state                                                        

(incremental changes of: frequency deviation, generator output, governor position and integral control)

– Player 1 ‐ controller ; Player 2 – load disturbance

• Nash equilibrium solution 

• Online learned solution using ADP – after 5 updates of the parameters of Player 2

[ ( ) ( ) ( ) ( )]g gx f t P t X t E t

0.6036 0.7398 0.0609 0.58770.7398 1.5438 0.1702 0.59780.0609 0.1702 0.0502 0.03570.5877 0.5978 0.0357 2.3307

uP

5

0.6036 0.7399 0.0609 0.58770.7399 1.5440 0.1702 0.59790.0609 0.1702 0.0502 0.03570.5877 0.5979 0.0357 2.3307

uP

2 2 1 10 ( )T T T TA P PA C C P B B B B P Solves GARE online without knowing A

Page 80: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Parameters of the critic – ARE Solution elements

– Cost function learning using least squares

– Sampling integration time T=0.1 s

– The policy of Player 1 is updated every 2.5 s

– The policy of Player 2 is updated only when the policy of Player 1 has converged

– Number of updates of Player 1 before an update of Player 2

moments when Player 2 is updated

0 50 100 150 200 250 300

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5Parameters of the cost function of the game

Time (s)

P(1,1)P(1,2)P(2,2)P(3,4)P(4,4)

Page 81: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

144

New Developments in IRL for CT SystemsQ Learning for CT SystemsExperience ReplayOff-Policy IRL

Page 82: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Applications of Reinforcement Learning

Human‐Robot Interactive LearningIndustrial process control‐Mineral grinding in Gansu, ChinaResilient Control to Cyber‐Attacks in Networked Multi‐agent SystemsDecision & Control for Heterogeneous MAS (different dynamics)

Page 83: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Intelligent Operational Control for Complex Industrial Processes

Professor Chai Tianyou

State Key Laboratory of Synthetical Automation for Process Industries

Northeastern UniversityMay 20, 2013

Jinliang Ding

1. Jinliang Ding, H. Modares, Tianyou Chai, and F.L. Lewis, "Data-based Multi-objective Plant-widePerformance Optimization of Industrial Processes under Dynamic Environments,” IEEE Trans. IndustrialInformatics, vol.12, no. 2, pp. 454-465, April 2016.

2. Xinglong Lu, B. Kiumarsi, Tianyou Chai, and F.L. Lewis, “Data-driven Optimal Control of OperationalIndices for a Class of Industrial Processes,” IET Control Theory & Applications, vol. 10, no. 12, pp. 1348-1356, 2016.

Page 84: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Manufacturing as the Interactions of Multiple AgentsEach machine has it own dynamics and cost functionNeighboring machines influence each other most stronglyThere are local optimization requirements as well as global necessities

Page 85: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Production line for mineral processing plant

Mineral Processing Plant in Gansu China

Page 86: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Existing Manual Control for Plant production indices, unit operational indices, and unit process control for a production line

Overall

Page 87: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

ˆ ( )kQ t

( )kQ mT

,~ { } 1,

1, 2,3i jr i n

j

r r

ˆ( )tr

( )mTr

*min max, ,k k kQ Q Q

*,i jr

*min max, ,k k kQ Q Q

( )kQ mT

*min max, ,k k kQ Q Q

, ( )i jr mT

Automated online reinforcement learning for determining operational indices

Implemented by Jingliang Ding and Chai Tianyou’s group in biggest mineral processingfactory of hematite iron ore in China, Gansu Province.

Savings of 30.75 million RMB per year were realized by implementing this automatedoptimization procedure instead of the standard industry practice of human operatorselection of process operational indices.

2 RL loopsAnd Value Function Approximation

Page 88: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

RL for Human-Robot Interaction (HRI)1. H. Modares, I. Ranatunga, F.L. Lewis, and D.O. Popa, “Optimized Assistive Human-robot

Interaction using Reinforcement Learning,” IEEE Transactions on Cybernetics, vol. 46, no. 3,pp. 655-667, 2016.

2. I. Ranatunga, F.L. Lewis, D.O. Popa, and S.M. Tousif, "Adaptive Admittance Control forHuman-Robot Interaction Using Model Reference Design and Adaptive Inverse Filtering" IEEETransactions on Control Systems Technology, vol. 25, no. 1, pp. 278-285, Jan. 2017.

3. B. AlQaudi, H. Modares, I. Ranatunga, S.M. Tousif, F.L. Lewis, and D.O. Popa, “Modelreference adaptive impedance control for physical human robot interaction,” Control Theory andTechnology, vol. 14, no. 1, pp. 1-15, Feb. 2016.

Page 89: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

PR2 meets Isura

Page 90: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Robot dynamics

Prescribed Error system

Control torque depends onImpedance model parameters

Impedance Control

Page 91: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

Human task learning has 2 components:1. Human learns a robot dynamics model to compensate for robot nonlinearities2. Human learns a task model to properly perform a task 

Inner Robot Specific Control LoopINDEPENDENT OF TASK

Outer Task Specific Control LoopINDEPENDENT OF ROBOT DETAILS

Human Performance Factors Studies

Page 92: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H

H. Modares, I. Ranatunga, F.L. Lewis, and D.O. Popa, “Optimized Assistive Human‐robot Interaction using Reinforcement Learning,” IEEE Transactions on Cybernetics, to appear 2015.

RL for Human‐Robot Interactions

RobotManipulator

TorqueDesign

PrescribedImpedance Model

xhf

mx

e

thKHuman

FeedforwardControl

‐dx

+

-

Performance

Robot‐specific inner control loop 

Task‐specific outer‐loop control designUsing IRL Optimal Tracking

Inspired by work of Sam S. Ge Implemented on PR2 robot

Page 93: 2018 new RL for opt ctrl and graph games - lewisgroup.uta.edu new... · Scalar equation VxQxuRu ()TT Differential equivalent is the Bellman equation Stationarity condition 022T H