Download - optimal control intro for course 2012.ppt control intro... · 2017. 12. 14. · Optimal Control Reinforcement learning Control of Robotic Devices Adaptive Learning of Human-Robot

UTA Research Institute (UTARI)The University of Texas at Arlington, USA

F.L. Lewis

Optimal Control Introduction

Supported byNSF, ARO, AFOSR

He who exerts his mind to the utmost knows nature’s pattern.

The way of learning is none other than finding the lost mind.

Meng Tz500 BC

Man’s task is to understand patterns innature and society.

Mencius

Optimal Control

Reinforcement learning

Control of Robotic Devices

Adaptive Learning of Human-Robot Interfaces

Dynamic Games for HRI

Importance of Feedback Control

Darwin- FB and natural selectionVolterra- FB and fish population balanceAdam Smith- FB and international economyJames Watt- FB and the steam engineFB and cell homeostasis

The resources available to most species for their survival are meager and limited

Nature uses Optimal control

SYSTEMoutputsAlfred North Whitehead

Von BertalanffySystems Theory 1920s

inputs

processing

Relevance- Machine Feedback Control

qd

qr1 qr2

AzEl

barrel flexiblemodes qf

compliant coupling

moving tank platform

turret with backlashand compliant drive train

terrain andvehicle vibrationdisturbances d(t)

Barrel tipposition

qd

qr1 qr2

AzEl

barrel flexiblemodes qf

compliant coupling

moving tank platform

turret with backlashand compliant drive train

terrain andvehicle vibrationdisturbances d(t)

Barrel tipposition

Vehicle mass m

ParallelDamper

mc

activedamping

uc(if used)

kc cc

vibratory modesqf(t)

forward speedy(t)

vertical motionz(t)

surface roughness(t)

k c

w(t)

Series Damper+

suspension+

wheel

Single-Wheel/Terrain System with Nonlinearities

High-Speed Precision Motion Control with unmodeled dynamics, vibration suppression, disturbance rejection, friction compensation, deadzone/backlash control

VehicleSuspension

IndustrialMachines

Military LandSystems

Aerospace

plant

controlu(t)

outputy(t)

controller

systemidentifier estimated

output

)(ˆ ty

identificationerror

desiredoutput

)(tyd

plant

controlu(t)

outputy(t)

controllerdesiredoutput

)(tyd

trackingerror

plant

controlu(t)

outputy(t)

controller #1

controller #2

desiredoutput

)(tyd

trackingerror

Indirect Scheme

Fixed Controller Topologies

Direct Scheme

Feedback/FeedforwardScheme

Cell Homeostasis The individual cell is a complex feedback control system. It pumps ions across the cell membrane to maintain homeostatis, and has only limited energy to do so.

Cellular Metabolism

Permeability control of the cell membrane

http://www.accessexcellence.org/RC/VL/GG/index.html

Optimality in Biological Systems

Optimality in Control Systems DesignR. Kalman 1960

Rocket Orbit Injection

http://microsat.sm.bmstu.ru/e-library/Launch/Dnepr_GEO.pdf

FmmmF

rwvv

mF

rrvw

wr

cos

sin22

ObjectivesGet to orbit in minimum timeUse minimum fuel

Dynamics

Optimal Control is Effective for:Aircraft AutopilotsVehicle engine controlAerospace VehiclesShip ControlIndustrial Process ControlRobot Control

Optimal Control:Minimum timeMinimum fuelMinimum energyConstrained control

Optimal control solutions are found byOffline solution of Matrix Design equationsA full dynamical model of the system is needed

Optimality and Games

PBPBRQPAPA TT 10

1 Tu R B Px Kx

BuAxx ( ( )) ( ) ( ) ( )T T T

t

V x t x Qx u Ru d x t Px t

Full system dynamics must be knownOff‐line solution

0 ( , , ) 2 2 ( )T

T T T T T T TV VH x u V x Qx u Ru x x Qx u Ru x P Ax Bu x Qx u Rux x

Optimal Control: Linear Quadratic Regulator (LQR)

System

Performance Index

Leibniz’s formula‐

Optimal Control is SVFB

Algebraic Riccati equation

( , , ) 2( ) 0T T Td V dH x u Ax Bu Px x Qx u Rudu x du

Stationarity Condition

2 0TRu B Px

( ) ( ) ( ) ( )T T T T T T T TdV x Qx u Ru x Px x Px x Px x Px Ax Bu Px x P Ax Budt

( , , )VH x ux

Hamiltonian function

Differential equivalent to PI is the Bellman equation

PBPBRQPAPA TT 10

1 Tu R B Px Kx

BuAxx

( ( )) ( ) ( ) ( )T T Tt


Full system dynamics must be knownOff‐line solutionCannot change performance objectives

Optimal Control: Linear Quadratic Regulator

System model

Performance Function

Optimal Control is


MATLAB Control Systems Toolbox

[K,P]=lqr(A,B,Q,R)

PBPBRQPAPA TT 10

1 Tu R B Px Kx

BuAxx

( ( )) ( ) ( ) ( )T T Tt



0 ( , , ) 2 2 ( )T

T T T T T T TV VH x u V x Qx u Ru x x Qx u Ru x P Ax Bu x Qx u Rux x


0 ( ) ( )T TA BK P P A BK Q K RK

u Kx

System

Cost

Leibniz’s formula‐ Differential equivalent is the Bellman equation

Given any stabilizing FB policy

The cost value is found by solving Lyapunov equation

Optimal Control is


( , , ) 2( ) 0T T Td V dH x u Ax Bu Px x Qx u Rudu x du

Many Successful Design Applications of Optimal Control

PBPBRQPAPA TT 10

1 Tu R B Px Kx

BuAxx

( ( )) ( ) ( , ) ( ) ( )T T Tt t

V x t x Qx u Ru d r x u d x t Px t



System

Cost function

Optimal Control is



Chop off Tail of cost function

( ( )) ( , ) ( ( ))t T

t

V x t r x u d V x t T

( ( )) ( , ) ( , )t T

t t T

V x t r x u d r x u d

Bellman Equation

1 Tu R B Px Kx Update Control using Hamiltonian Physics

Reinforcement Learning‐ Policy Iteration

Online Learning Feedback ControlSynthesis of Robot Control and Biologically Inspired Learning

Neural Network Properties

Learning

Recall

Function approximation

Generalization

Classification

Association

Pattern recognition

Clustering

Robustness to single node failure

Repair and reconfiguration

Nervous system cell. http://www.sirinet.net/~jgjohnso/index.html

Dynamic Neural Networks and Control Learning

Robot System[I]

Robust ControlTerm

qqd

v(t)

e

Outer Tracking Loop

f(x)

rKv

Nonlinear Inner Learning Loop

^

Feedforward Loop

Tunable weights

dq

Theorem 1 (NN Weight Tuning for Stability)

Let the desired trajectory )(tqd and its derivatives be bounded. Let the initial tracking error be within a certain allowable set U . Let MZ be a known upper bound on the Frobenius norm of the unknown ideal weights Z . Take the control input as

vrKxVW vTT )ˆ(ˆ with rZZKtv MFZ )()( .

Let weight tuning be provided by

WrFxrVFrFW TTT ˆˆ'ˆˆˆ

, VrGrWGxVTT ˆ)ˆ'ˆ(ˆ

with any constant matrices 0,0 TT GGFF , and scalar tuning parameter 0 . Initialize the weight estimates as randomVW ˆ,0ˆ .

Then the filtered tracking error )(tr and NN weight estimates VW ˆ,ˆ are uniformly ultimately bounded. Moreover, arbitrarily small tracking error may be achieved by selecting large control gains vK .

NEW NN Tuning Laws for Control Learning

Force Control

Flexible pointing systemsSimis labs, Inc. and US Army

Vehicle active suspensionLeo Davis Technol.

SBIR Contracts

Learning NN Controller

Applications at Boeing Defense Space & Security

Kevin Wise and Eugene Lavretsky

Highly reliable adaptive uncertainty approximation compensators for flight control applications:

unmanned aircraft – Phantom Ray

Tailkit adaptive control systems for Joint Direct Attack Munition (JDAM) munitions:

Mk-82, Mk-84, and Laser-guided variantsFielded for defense in Iraq and Afghanistan.

BUT-Nature is More than StableNature is OPTIMAL and conserves energy

We want to Learn optimal control solutions Online in real-time Using adaptive control techniquesWithout knowing the full dynamics

For Adaptive Man/Machine systems-

Reinforcement Learning

Every living organism improves its control actions based on rewards received from the environment

The resources available to living organisms are usually meager.Nature uses optimal control.

1. Apply a control. Evaluate the benefit of that control.2. Improve the control policy.

RL finds optimal policies by evaluating the effects of suboptimal policies

Optimality in Biological Systems

Different methods of learning

SystemAdaptiveLearning system

ControlInputs

outputs

environmentTuneactor

Reinforcementsignal

Actor

Critic

Desiredperformance

Reinforcement learningIvan Pavlov 1890s

Actor-Critic Learning

We want OPTIMAL performance- ADP- Approximate Dynamic Programming

Sutton & Barto book

Adaptive Critic structure

Slow learning Improvement loop


Fast inner control loops

PBPBRQPAPA TT 10

1 Tu R B Px Kx

BuAxx

( ( )) ( ) ( , ) ( ) ( )T T Tt t

V x t x Qx u Ru d r x u d x t Px t



System

Cost function

Optimal Control is



Chop off Tail of cost function

( ( )) ( , ) ( ( ))t T

t

V x t r x u d V x t T

( ( )) ( , ) ( , )t T

t t T

V x t r x u d r x u d

Bellman Equation

1 Tu R B Px Kx Update Control using Hamiltonian Physics

Reinforcement Learning‐ Policy Iteration

CT Policy Iteration – How to implement online?Linear Systems Quadratic Cost‐ LQR

( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T Tk k k k

tx t P x t x Q K RK x d x t T P x t T

Policy evaluation‐ solve IRL Bellman Equation

( ) ( ) ( ) ( ) ( )( ) ( )t T

T T T Tk k k k

tx t P x t x t T P x t T x Q K RK x d

( ( )) ( ) ( )TV x t x t Px tValue function is quadratic

( ) ( ) ( ) ( ) ( ) ( )t T

T T T Tk k k k

tp t p x t x t T x Q K RK x d

Reinforcement on time interval [t, t+T]regression vector

Same form as standard System ID problems

Solve using RLS or batch LS

Union of Reinforcement Learning and Adaptive Control

Value Function Approximation

1. Select initial control policy

2. Find associated cost

3. Improve control 11T

k kK R B P

( ) ( ) ( ) ( ) ( ) ( , )t T

T T Tk k k

tp x t x t T x Q K RK x d t t T

Bellman EquationSolves Lyapunov eq. without knowing dynamics

t t+T

observe x(t)

observe x(t+T)

apply uk(t)=Kkx(t)

observe cost integral

update P

do RLS until convergence to Pk

update control gain to Kk+1

A is not needed anywhere

This is a data-based approach that uses measurements of x(t), u(t)Instead of the plant dynamical model.

Integral Reinforcement Learning (IRL)

Data set at time [t,t+T)

( ), ( , ), ( )x t t t T x t T

Motor control 200 Hz

Oscillation is a fundamental property of neural tissue

Brain has multiple adaptive clocks with different timescales

theta rhythm, Hippocampus, Thalamus, 4-10 Hzsensory processing, memory and voluntary control of movement.

gamma rhythms 30-100 Hz, hippocampus and neocortex high cognitive activity.

• consolidation of memory• spatial mapping of the environment – place cells

The high frequency processing is due to the large amounts of sensorial data to be processed

Spinal cord

D. Vrabie, F. Lewis, and Dan Levine- RL for Continuous-Time Systems

Doya, Kimura, Kawato 2001

Limbic system

Cerebral cortexMotor areas

ThalamusBasal ganglia

Cerebellum

Brainstem

Spinal cord

Interoceptivereceptors

Exteroceptivereceptors

Muscle contraction and movement

Summary of Motor Control in the Human Nervous System

reflex

Supervisedlearning

ReinforcementLearning- dopamine

(eye movement)inf. olive

Hippocampus

Unsupervisedlearning

Limbic System


theta rhythms 4-10 Hz

picture by E. StinguD. Vrabie

Memoryfunctions

Long term

Short term

Hierarchy of multiple parallel loops

gamma rhythms 30-100 Hz,

Adaptive Critic structure

Theta waves 4-8 Hz



F.L. Lewis and D. Vrabie,“Reinforcement learning andadaptive dynamic programmingfor feedback control,” IEEECircuits & Systems Magazine,Invited Feature Article, pp. 32-50, Third Quarter 2009.

IEEE Control Systems Magazine“Reinforcement learning andfeedback Control,” Dec. 2012

D. Vrabie, K. Vamvoudakis, and F.L. Lewis, Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles, IET Press, 2012, to appear.

Books

F.L. Lewis, D. Vrabie, and V. Syrmos, Optimal Control, third edition, John Wiley and Sons, New York, 2012.

New Chapters on:Reinforcement LearningDifferential Games

Direct Optimal Adaptive Controller

A hybrid continuous/discrete dynamic controller whose internal state is the observed cost over the interval

Draguna Vrabie

DynamicControlSystemw/ MEMORY

xu

V

ZOH T

x A x B u System

T Tx Q x u R u

Critic

ActorL

T T

Run RLS or use batch L.S.To identify value of current control

Update FB gain afterCritic has converged

Reinforcement interval T can be selected on line on the fly – can change

Solves Riccati Equation Online without knowing A matrix

Optimal Control Design Allows a Lot of Design Freedom

( ( )) ( , )t

V x t r x u d

Tailor controls design by choosing utility

Minimum EnergyMinimum FuelMinimum TimeControl Actuator Constraints

Optimal Control for Constrained Input Systems

This is a quasi-norm

2

0

2 ( )u

Tq

u d

Weaker than a norm –homogeneity property is replaced by the weaker symmetry property

qqxx

(Used by Lyshevsky for H2 control)

Control constrained by saturation function tanh(p)

p

1

-1

0 0

( , ) ( ) 2 ( )u

TJ u d Q x d dt

Encode constraint into Value function

Then Is BOUNDED1 ( )T Vu R g xx

Murad Abu-Khalaf

-0.1 0 0.1 0.2 0.3 0.4 0.5 0.6-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6State Evolution for both controlllers

x1

x2

Switching Surface methodNonquadratic functionals method

10 0

tanh( ) 2 ( )u

TTV x Q x Rd dt

Near Minimum-Time ControlEncode into Value Function

Murad Abu-Khalaf

Force ControlIntelligent Automation, Inc.

Flexible pointing systemsSimis labs, Inc. and US Army

Vehicle active suspensionLeo Davis Technol.

SBIR ContractsSBA Tibbets Award 1996 (led by TMAC)

Learning NN Controller

A. Yesildirek and F.L. Lewis, "Method for feedback linearization of neural networks and

neural network incorporating same," U.S. Patent 5,943,660, awarded 24 August 1999.

S. Jagannathan and F.L. Lewis, "Discrete-time tuning of neural network controllers for

nonlinear dynamical systems," U.S. Patent 6,064,997, awarded 16 May 2000.

R. Selmic, F.L. Lewis, A.J. Calise, and M.B. McFarland, "Backlash Compensation Using

Neural Network," U.S. Patent 6,611,823, awarded 26 Aug. 2003.

J. Campos and F.L. Lewis, "Method for Backlash Compensation Using Discrete-Time

Neural Networks," U.S. Patent 7,080,055, awarded July 2006.

Patents

K. Vamvoudakis, D. Vrabie, and F.L. Lewis, “Control methodology for online adaptation to optimal feedback controller using integral reinforcement learning,” provisional patent, filed March 2012.

Applications of our Algorithms at Boeing Defense Space & Security

Kevin Wise and Eugene Lavretsky

Highly reliable adaptive uncertainty approximation compensators for flight control applications:

Unmanned aircraft – Phantom Ray

Tailkit adaptive control systems for Joint Direct Attack Munition (JDAM) munitions:

Mk-82, Mk-84, and Laser-guided variants

Fielded for defense in Iraq and Afghanistan.Saved Lives.

optimal engine controllers based on RL for Caterpillar, Ford (Zetec engine), and GM

Applications of Our Algorithms to Auto Engine Control

Student S. Jagannathan11 US patents

8-10% improvement in fuel efficiency and a drastic reduction in NOx (90%), HC (30%) and CO (15%) by operating with adaptive exhaust gas recirculation.

Savings to Caterpillar were over $65,000 per component.

A Q-Learning Based Adaptive Optimal Controller Implementation for a Humanoid

Robot Arm

Said Ghani Khan1, Guido Herrmann1, Tony Pipe1, Chris Melhuish1

Bristol Robotics LaboratoryUniversity of Bristol and University of the West of England, Bristol, UK

Conference on Decision and Control (CDC) 2011, Orlando, 11 December 2011

The cost function is modified to include constraints

Introducing Constraints

The new cost function....

qL is the joint limit.

BRL BERT II ARM Bristol Robotics Lab

Natural Human-Like Motion

An Approximate Dynamic Programming Based Controllerfor an Underactuated 6DoF Quadrotor

Emanuel Stingu Frank Lewis

™

Automation & Robotics Research InstituteUniversity of Texas at ArlingtonARRI

Supported byARO grant W91NF-05-1-0314NSF grant ECCS-0801330

3 control loops

The quadrotor has 17 states and only 4 control inputs, thus it is very under-actuated.Three control loops with dynamic inversion are used to generate the 4 control signals.

x

y

zamountof tilt

yaw

directionof tilt

Momentum theory applied to the propeller

Translationand yaw

Controller

AttitudeController

Motor andPropellerController

Referencetrajectory

Position and velocity errors

Desired attitude

Desired forces

Motor commands

Dynamic inversionaccelerations to attitude

Dynamic inversionrotation accelerations to forces

Altitude control: thrust force

Dynamic inversionthrust forces to motor voltages

Using standard controller

UsingReinforcement learningcontroller

Potential New Applications in Prosthetics

With Sesh Commuri, Oklahoma UniversityAnd Muthu Wijesundara

Adaptive Reinforcement Learning of Human/Prosthetic Ankle Interactions

Current methods use stored gait information

What about Adaptive Human-Robot Interfaces (HRI)?

Adaptive Learning

HRI

Different usersDifferent robotic devices

Interface learns so as to improve Human/Robot team performance

Tune parameters such as stiffness, speed of response, dynamic motion

There are 3 componentsHumanRobotThe Interface

They should all be intelligent

We want verifiable performance and straightforward tuning algorithms

Several Approaches to Adaptive HRI1. Learning of Coordinate Transforms

2. Dynamic Motion Primitives to Modify Task Parameters

3. Adaptive Impedance Control

Current and Expected Funding

F.L. Lewis, “Adaptive Dynamic Programming for Real-Time Cooperative Multi-Player Games and Graphical Games,” NSF Grant, $272,000 for 3 years, July 2011.

D. Popa, Z. Celik-Butler, D. Butler, and F.L. Lewis, “NRI: Multi-Modal Skin and Garments for Healthcare and Home Robots,” NSF Grant, $1.3M for 4 years, Sept. 2012.

F.L. Lewis, “Games and Learning for Cooperative Nonlinear Systems and Internal Structure of Coalitions on Graphs, TARDEC/NAC grant, $81,000 for 1 year, Jan. 2013 expected.IF NO FISCAL CLIFF!

1d dq J y

Robot Inverse kinematics

( )dy t ( )dq t

( )dy t Robot Inverse Jacobian

Hard to find AccuratelyNeed dynamics model of robotDoes not generalize to different robots

Learning of Coordinate TransformsUser coordinates Robot coordinates

User coordinates Robot coordinates

Standard robot controller

( )dy t Robot Inverse Jacobian

Hard to find accuratelyNeed dynamics model of robotDoes not generalize to different robots

1. Learning of Coordinate Transforms

Usercoordinates Robot

coordinates

Standard robot controller

Learning the Interface MappingWhat is it exactly?

u y

f uWhat can we do to get f(u)?

The simplest way is to obtain a set of inputs and a set of outputs and calculate the relationship (Curve Fitting)

uy

Curve Fitting Algorithm f

x

0 0( )

M P

i i ij ii j

y f u w w u

Static nonlinear map

Dan PopaReinforcement Learning

NN Training

Log sigmoid function is used as a neuron activation function . 1

1 ex

Train using MATLAB NN Toolbox very easily if input/output pairs are known.

Dan Popa

Static Mapping Approach

f u

f(u) is a coordinate transformation

Found by training neural network with known input/output pairs Dan Popa

Reinforcement Learning

uy

Curve Fitting Algorithm f u

What if we cannot specify the desired output trajectory directly ?

“Learning by interacting with an environment”Reinforcement LearningReinforcement Learning

With RL we do not have to specify a desired trajectory.

Instead, a Reward Function is used

Dan Popa

How to find the Reward Function?

Model of Task

A task is a profile in velocity/force spaceDifferent Curve for Different Tasks

2. Learn and Modify Task Parameters

20( ) ( ) ( )

T Tx Dx Ks g x K g x KW V ss s

Dynamic Motion Primitives for Characterizing the Task

Learning the task:Kinesthetic demonstration to learn D, K, initial weightsReward-based learning to tune the weights

Phase variable s makes DMP independent of time

Tunable NN weights

Different NN weights give different task trajectories

• Build an adaptive interface system that allows a single operator to manipulate a robots with multiple degrees of freedom and/or multiple robots• The interface system should be intuitive, easy to use and can be learned quickly , and should be able to apply with different robot configurations

Robot System

Operator

Performance Evaluation

Output

Interface Systeminput

Reinforcement Learning for Human-Robot InterfacesDan Popa

What to learn?How to evaluate Performance?

Adaptive Impedance Control

3. Adaptive Impedance Control

ImpedanceModel

Inner loop FB controllers

Human intentRobot performance

Performanceevaluation

Tuning

Tune interface so robot learns to control the human model to improve task performance

Model of Human Behavior

For a given task, a skilled human operator adapts to learn-

1. An inverse model of the robot system to cancel nonlinearities

2. A feedforward predictive control based on the task

Frequency Crossover Theory -

Human operator adapts to make the overall closed-loop man-machine system

look like a high bandwidth frequency response

And remain invariant to a wide range of task variations and changes in operating condition

Suzuki & Furuta 2012

For a given task, after learning, the human model has three parts1. Time delay due to vision processing and reaction time2. A first-order filter due to neuromuscular dynamics3. A PD controller in the cerebellum

Adaptive Impedance Control as a 2-Player Game

InterfaceParameters

Human intentRobot performance

Critic Agent forPerformance Evaluation

Tuning

A 2-Player Dynamic Game

Player 1- the human

Player 2- the Interface!

Sun Tz bin fa孙子兵法

Graphical Coalitional Games

500 BC

Optimal Control is Effective for:Aircraft AutopilotsVehicle engine controlAerospace VehiclesShip ControlIndustrial Process ControlRobot Control

Multi-player Games Occur in:EconomicsControl Theory disturbance rejectionTeam gamesInternational politicsSports strategy

Optimality and Games

Actor‐Critic Game structure ‐ three time scales

x

u2 1 0;x Ax B u B w x

System

Controller/Player 1

w( 1)T i kuu B P x

11T i

ww B P x

Interface/Player 2

CriticLearning procedure

ˆ ˆ, if =1

ˆ ˆ ˆ ˆ, if >1

T T T

T T

V x C Cx u u i

V w w u u i

T T

1 2

1iuP( 1) 1 ( 1)i k i i k

u u uP P Z

V

iuP

x

( ) 1 ( )i k i i ku u uP P Z

The cloud-capped towers, the gorgeous palaces,The solemn temples, the great globe itself,Yea, all which it inherit, shall dissolve,And, like this insubstantial pageant faded,Leave not a rack behind.

We are such stuff as dreams are made on, and our little life is rounded with a sleep.

Our revels now are ended. These our actors, As I foretold you, were all spirits, and Are melted into air, into thin air.

Prospero, in The Tempest, act 4, sc. 1, l. 152-6, Shakespeare

The way that can be told is not the Constant WayThe name that can be named is not the Constant Name

For nameless is the true wayBeyond the myriad experiences of the world

To experience without intention is to sense the world

All experience is an archwherethrough gleams that untravelled landwhose margins fade forever as we move

Dao ke dao feichang daoMing ke ming feichang ming