ARTICLE IN PRESS notes/CT RL pub- Vrabie_nonlin NN.pdfARTICLE IN PRESS...

ARTICLE IN PRESSNeural Networks ( ) –

Contents lists available at ScienceDirect

Neural Networks

journal homepage: www.elsevier.com/locate/neunet

2009 Special Issue

Neural network approach to continuous-time direct adaptive optimal control forpartially unknown nonlinear systemsDraguna Vrabie ∗, Frank LewisAutomation and Robotics Research Institute, University of Texas at Arlington, 7300 Jack Newell Blvd. S., Fort Worth, TX 76118, USA

a r t i c l e i n f o

Article history:Received 11 January 2009Received in revised form16 March 2009Accepted 19 March 2009

Keywords:Direct adaptive optimal controlPolicy iterationNeural networksOnline control

a b s t r a c t

In this paper we present in a continuous-time framework an online approach to direct adaptive optimalcontrol with infinite horizon cost for nonlinear systems. The algorithm converges online to the optimalcontrol solution without knowledge of the internal system dynamics. Closed-loop dynamic stabilityis guaranteed throughout. The algorithm is based on a reinforcement learning scheme, namely PolicyIterations, and makes use of neural networks, in an Actor/Critic structure, to parametrically representthe control policy and the performance of the control system. The two neural networks are trained toexpress the optimal controller and optimal cost function which describes the infinite horizon controlperformance. Convergence of the algorithm is proven under the realistic assumption that the twoneural networks do not provide perfect representations for the nonlinear control and cost functions.The result is a hybrid control structure which involves a continuous-time controller and a supervisoryadaptation structure which operates based on data sampled from the plant and from the continuous-time performance dynamics. Such control structure is unlike any standard form of controllers previouslyseen in the literature. Simulation results, obtained considering two second-order nonlinear systems, areprovided.

© 2009 Elsevier Ltd. All rights reserved.

1. Introduction

In an environment in which a number of players compete fora limited resource, optimal behavior with respect to desired longterm goals leads to long term advantages. In a control engineeringframework the role of the environment is played by a systemto be controlled (this ranges from industrial processes such asdistillation columns and power systems, to airplanes, medicalequipment and mobile robots), while the controller, equippedwith sensors and actuators, plays the role of the agent which isable to regulate the state of the environment such that desiredperformances are obtained. An intelligent controller is able toadapt its actions in front of unforeseen changes in the systemdynamics. In the case inwhich the controller has a fixed parametricstructure, the adaptation of controller behavior is equivalent tochanging the values of the controller parameters. From a controlengineering perspective, not every automatic control loop needsto be designed to exhibit intelligent behavior. In fact in industrialprocess control there exists a hierarchy of control loops whichhas at the lowest level the simplest and most robust regulation,which provides fast reaction in front of parametric and non-parametric disturbances without controller adaptation, while at

∗ Corresponding author. Tel.: +1 817 272 5938; fax: +1 817 272 5938.E-mail addresses: [email protected] (D. Vrabie), [email protected] (F. Lewis).

the topmost end are placed the so-called money-making loops,whose operation close to optimality has the greatest impact onmaximization of income. In the latter case the control performanceis not explicitly defined in terms of desired trajectories for thestates and/or outputs of the system; instead it is implicitlyexpressed through a functional that captures the nature of thedesired performance in a more general sense. Such an optimalitycriterion characterizes the system’s performance in terms ofthe control inputs and system states; it is in fact an implicitrepresentation of a desired balance between the amount of effortinvested in the control process and the resulting outputs.Optimal control refers to a class of methods that can be used to

synthesize a control policy which results in best possible behaviorwith respect to the prescribed criterion (i.e. control policy whichleads to maximization of performance). The solutions of optimalcontrol problems can be obtained either by using Pontryagin’sminimum principle, which provides a necessary condition foroptimality, or by solving the Hamilton–Jacobi–Bellman (HJB)equation, which is a sufficient condition (see e.g. Kirk (2004)and Lewis and Syrmos (1995)). Although mathematically elegant,both approaches present a major disadvantage posed by therequirement of complete knowledge of the system dynamics.In the case when only an approximate model of the systemis available, the optimal controller derived with respect to thesystem’s model will not perform optimally when applied for thecontrol of the real process. Thus, adaptation of the controller

0893-6080/$ – see front matter© 2009 Elsevier Ltd. All rights reserved.doi:10.1016/j.neunet.2009.03.008

Please cite this article in press as: Vrabie, D., & Lewis, F. Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinearsystems. Neural Networks (2009), doi:10.1016/j.neunet.2009.03.008

http://www.elsevier.com/locate/neunethttp://www.elsevier.com/locate/neunetmailto:[email protected]:[email protected]://dx.doi.org/10.1016/j.neunet.2009.03.008

ARTICLE IN PRESS2 D. Vrabie, F. Lewis / Neural Networks ( ) –

parameters such that operation becomes optimal with respect tothe behavior of the real plant is highly desired.Adaptive optimal controllers have been developed either by

adding optimality features to an adaptive controller (e.g. theadaptation of the controller parameters is driven by desiredperformance improvement reflected by an optimality criterionfunctional) or by adding adaptive features to an optimal controller(e.g. the optimal control policy is improved relative to theadaptation of the parameters of the model of the system). A thirdapproach to adaptive optimal control (Sutton, Barto, & Williams,1992), namely reinforcement learning (RL) (Sutton & Barto, 1998),was introduced and extensively developed in the computationalintelligence and machine learning societies, generally to findoptimal control policies for Markovian systems with discrete stateand action spaces (Howard, 1960). RL-based solutions to thecontinuous-time optimal control problemhave been given in Baird(1994) and Doya (2000). The RL algorithms are constructed onthe idea that successful control decisions should be remembered,by means of a reinforcement signal, such that they become morelikely to be used a second time. Although the idea originates fromexperimental animal learning, where it has been observed that thedopamine neurotransmitter acts as a reinforcement informationalsignal which favors learning at the level of the neuron (seee.g. Doya, Kimura, and Kawato (2001) and Schultz, Tremblay, andHollerman (2000)), RL is strongly connected from a theoreticalpoint of view with direct and indirect adaptive optimal controlmethods. In the present issue, Werbos (in this issue) reviewsfour generations of general-purpose learning designs in Adaptive,Approximate Dynamic Programming, which provide approximatesolutions to optimal control problems and include reinforcementlearning as a special case. He argues the relevance of suchmethodsnot only for the general goal of replicating human intelligencebut also for bringing a solution of efficient regulation in electricalpower systems.The main advantage of using RL for solving optimal control

problems comes from the fact that a number of RL algorithms,e.g. Q-learning (Watkins, 1989) (also known as Action DependentHeuristic Dynamic Programming (Werbos, 1989, 1992)), donot require knowledge or identification/learning of the systemdynamics. This is important since it is well known that modelingand identification procedures for the dynamics of a given nonlinearsystem are most often time consuming iterative approacheswhich require model design, parameter identification and modelvalidation at each step of the iteration. The identificationprocedureis even more difficult when the system has hidden nonlineardynamics which manifest only in certain operating regions. In theRL algorithms’ case the learning process is moved at a higher levelhaving no longer as an object of interest the system’s dynamicsbut a performance index which quantifies how close to optimalitythe closed-loop control system operates. In other words, instead ofidentifying a model of the plant dynamics, that will later be usedfor the controller design, the RL algorithms require identification ofthe static mapwhich describes the system performance associatedwith a given control policy. One sees now that, as long as enoughinformation is available to describe the performance associatedwith a given control policy at all significant operating points of thecontrol system, the systemperformancemap can be easily learned,conditioned by the fact that the control systemmaintains stabilityproperties. This is again advantageous compared with an open-loop identification procedure which, due to the excitatory inputsrequired for making the natural modes of the system visible in themeasured system states, could have as a result the instability of thesystem.Even in the case when complete knowledge of the system

dynamics is available, a second difficulty appears from the factthat the HJB equation, underlying the optimal control problem, is

generally nonlinear and most often does not possess an analyticalsolution; thus the optimal control solution is regularly addressedby numerical methods (Huang & Lin, 1995). Also from this pointof view, RL algorithms provide a natural approach to solve theoptimal control problem, as they can be implemented my meansof function approximation structures, such as neural networks,which can be trained to learn the solution of the HJB equation.RL algorithms, such as the one that we will present in Section 3,and develop for online implementation in Section 4, can easily beincorporated in higher-level decisionmaking structures of the sortpresented in (Brannon, Seiffertt, Draelos, & Wunch, in this issue).RL algorithms can be implemented on Actor/Critic structures

which involve two function approximators, namely the Actor,which parameterizes the control policy, and the Critic, a paramet-ric representation for the cost functionwhich describes the perfor-mance of the control system. In this case the solution of the optimalcontrol problem will be provided in the form of the Actor neuralnetwork for which the associated cost, i.e. the output of the Criticneural network, has an extremal value.In this paper we present an adaptive method, which uses

neural-network-type structures in an Actor/Critic configuration,for solving online the optimal control problem for the case ofnonlinear systems, in a continuous-time framework, withoutmaking use of explicit knowledge on the internal dynamics of thenonlinear system. The method is based on Policy Iteration (PI), anRL algorithmwhich iterates between the steps of policy evaluationand policy improvement. The PI method starts by evaluatingthe cost of a given admissible initial policy and then uses thisinformation to obtain a new control policy, which is improved inthe sense of having a smaller associated cost compared with theprevious policy, over the domain of interest in the state space. Thetwo steps are repeated until the policy improvement step no longerchanges the present policy, this indicating that the optimal controlbehavior is obtained.In the case of continuous-time systemswith linear dynamics, PI

was employed for finding the solution of the state feedback optimalcontrol problem (i.e. LQR) in Murray, Cox, Lendaris, and Saeks(2002), while the convergence guarantee to the LQR solution wasgiven in Kleinman (1968). The PI algorithm, as used by Kleinman(1968), requires repetitive solution of Lyapunov equations, whichinvolve complete knowledge of the system dynamics (i.e. boththe input-to-state and internal system dynamics specified by theplant input and system matrices). For nonlinear systems, the PIalgorithm was first developed by Leake and Liu (1967). Threedecades later it was introduced in Beard, Saridis, and Wen (1997)as a feasible adaptive solution to the CT optimal control problem.In Beard et al. (1997) the Generalized HJB equations (a sort ofnonlinear Lyapunov equations), which appear in the PI algorithm,were solved using successive Galerkin approximation algorithms.A neural-network-based approachwas developed and extended tothe cases of H2 and H-infinity with constrained control in Abu-Khalaf and Lewis (2005) and Abu-Khalaf, Lewis, and Huang (2006).Neural-network-based Actor/Critic structures, in a continuous-time framework, with neural network tuning laws have been givenin Hanselmann, Noakes, and Zaknich (2007). All of the above-mentioned methods require complete knowledge of the systemdynamics.In Vrabie, Pastravanu, and Lewis (2007) and Vrabie and

Lewis (2008) the authors gave a new formulation of the PIalgorithm for linear and nonlinear continuous-time systems. Thisnew formulation allows online adaptation (i.e. learning) of thecontinuous-time operating controller to the optimal state feedbackcontrol policy, without requiring knowledge of the system internaldynamics (knowledge regarding the input-to-state dynamics isstill required, but from a system identification point of view thisknowledge is relatively easier to obtain).


ARTICLE IN PRESSD. Vrabie, F. Lewis / Neural Networks ( ) – 3

The continuous-time online approach to RL, which we describein Section 3, is the one introduced in Vrabie and Lewis (2008). Inthat paper the convergence of the algorithmwas proved under theassumption that the two function approximators in theActor/Criticstructure can provide exact representations of the control andcost functions. While the development in Vrabie and Lewis(2008) showed the validity of the approach to online learning,the assumption of exact representation of the cost functionswhich have to be learned was not realistic. In this paper werevisit the online algorithm presented in Vrabie and Lewis (2008)and we extend those results providing convergence proofs forthe function-approximator-based algorithm, taking into accountthe existing approximation errors between the Actor/Criticstructures and the control and cost functions, respectively. Thealgorithm converges online to the optimal control solutionwithoutknowledge of the internal system dynamics. Closed-loop dynamicstability is guaranteed throughout.The end result is a control and adaptation structure which is

a hybrid combination between a continuous-time controller anda learning structure which operates based on discrete sampleddata from the system and from the continuous-time dynamicsreflecting the performance of the system. Such structure is unlikeany of the standard forms of controllers appearing in the literature.In the next section we give an overview of the optimal

control problem for nonlinear systems. The proposed PolicyIteration algorithm which solves the HJB equation withoutrequiring knowledge of the internal dynamics of the systemis presented in Section 3. Convergence of the algorithm isproven by showing equivalence with the general PI algorithm fornonlinear systems. The formulation of the introduced algorithmusing neural networks approximation structures is discussed inSection 4. Convergence of the neural-network-based algorithmwhile considering the error between the cost function and itsneural network approximation is then provided. Section 5 givesa flowchart of the online algorithm and discusses the onlineimplementation on an Actor/Critic structure, while commentingalso on the relations between the proposed online algorithm andcertain learning mechanisms in the mammal brain. Section 6presents simulation results considering two nonlinear systemswith quadratic and quartic cost functions.

2. Background in nonlinear optimal control

In this section we give the formulation of the nonlinear optimalcontrol problem.Consider the time-invariant affine in the input dynamical

system given by

ẋ(t) = f (x(t))+ g(x(t))u(x(t)); x(0) = x0 (1)

with x(t) ∈ Rn, f (x(t)) ∈ Rn, g(x(t)) ∈ Rn×m and the inputu(t) ∈ U ⊂ Rm. We assume that f (0) = 0, that f (x) + g(x)u isLipschitz continuous on a set Ω ⊆ Rn which contains the origin,and that the dynamical system is stabilizable onΩ , i.e. there existsa continuous control function u(t) ∈ U such that the system isasymptotically stable onΩ .We note here that although global asymptotic stability is

guaranteed in a linear system case, it is generally difficult toguarantee in a general continuous-time nonlinear system problemsetting. This is due to the non-smooth nature of a nonlinear systemdynamics; at the points in which there exist discontinuities ofẋ, there will also exist discontinuities of the gradient of the costfunction. For this reason we have to restrict our discussion to thecase in which asymptotic stability is desired and sought for onlyin a region Ω ⊆ Rn in which the cost function is continuouslydifferentiable.

The infinite horizon integral cost associated with the controlinput {u(τ ); τ ≥ t} is defined as

V u(x(t)) =∫∞

tr(x(τ ), u(τ ))dτ (2)

where x(τ )denotes the solution of (1) for initial condition x(t) ∈ Ωand input {u(τ ); τ ≥ t}, r(x, u) = Q (x)+ uTRuwith Q (x) positivedefinite, i.e. ∀x 6= 0,Q (x) > 0 and x = 0 ⇒ Q (x) = 0, andR ∈ Rm×m a positive definite matrix.

Definition 1 (Beard et al., 1997 (Admissible (Stabilizing) Policy)). Acontrol policy µ(x) is defined as admissible with respect to (2) onΩ , denoted by µ ∈ Ψ (Ω), if µ(x) is continuous on Ω , µ(0) = 0,µ(x) stabilizes (1) onΩ and V (x0) is finite ∀x0 ∈ Ω .The cost function associated with any admissible control policy

µ ∈ Ψ (Ω) is

Vµ(x(t)) =∫∞

tr(x(τ ), µ(x(τ )))dτ . (3)

Vµ(x) is C1. The infinitesimal version of (3) is

0 = r(x, µ(x))+ (∇Vµx )T(f (x)+ g(x)µ(x)), Vµ(0) = 0 (4)

where ∇Vµx (a column vector) denotes the gradient of the costfunction Vµ with respect to x, as the cost function does not dependexplicitly on time. Eq. (4) is a Lyapunov equation for nonlinearsystems which, given the controller µ(x) ∈ Ψ (Ω), can be solvedfor the cost function Vµ(x) associated with it. Given thatµ(x) is anadmissible control policy, ifVµ(x) satisfies (4),with r(x, µ(x)) ≥ 0,then Vµ(x) is a Lyapunov function for the system (1) with controlpolicy µ(x).The optimal control problem can now be formulated: Given the

continuous-time system (1), the set u ∈ Ψ (Ω) of admissible controlpolicies, and the infinite horizon cost functional (2), find an admissiblecontrol policy such that the cost index (2) associated with the system(1) is minimized.Defining the Hamiltonian of the problem

H(x, u, Vx) = r(x(t), u(t))+ (∇Vx)T(f (x(t))+ g(x(t))u(t)), (5)the optimal cost function V ∗(x) satisfies the HJB equation

0 = minu∈Ψ (Ω)

[H(x, u,∇V ∗x )]. (6)

This Hamiltonian, which does not depend explicitly on time andcorresponds to the infinite horizon optimal control problem, mustbe identically zero when evaluated on an extremal trajectory. Alsoevery admissible policy, µ(x), and associated cost, Vµ(x), withVµ(0) = 0, satisfyH(x, µ, Vµx ) = 0. Thus this equation can be usedto solve for the cost associated with any given admissible policy.Assuming that the minimum on the right-hand side of the

Eq. (6) exists and is unique then the optimal control function forthe given problem is

u∗(x) = −12R−1gT(x)∇V ∗x . (7)

Inserting this optimal control policy in the Hamiltonian weobtain the formulation of the HJB equation in terms of ∇V ∗x :

0 = Q (x)+ (∇V ∗x )Tf (x)−

14(∇V ∗x )

Tg(x)R−1gT(x)∇V ∗x ,

V ∗(0) = 0.(8)

For the linear system case, considering a quadratic costfunctional, the equivalent of this HJB equation is the well knownRiccati equation.In order to find the optimal control solution for the problemone

only needs to solve the HJB equation (8) for the cost function andthen substitute the solution in (7) to obtain the optimal control.However, solving the HJB equation is generally difficult. It alsorequires complete knowledge of the system dynamics (i.e. thefunctions f (x), g(x) need to be known).



3. Policy iteration algorithm for solving the HJB equation

In the followingwe give a new online iterative algorithmwhichwill adapt to solve the infinite horizon optimal control problemwithout using knowledge regarding the system internal dynamics(i.e. the system function f (x)). Convergence of the algorithm to theoptimal control function is then provided.

3.1. Policy iteration algorithm

Let µ(0)(x(t)) ∈ Ψ (Ω) be an admissible policy, and T > 0 suchthat as x(t) ∈ Ω also x(t + T ) ∈ Ω (the existence of such T > 0 isguaranteed by the admissibility ofµ(0)(.) onΩ), then the iterationbetween1. (policy evaluation) solve for Vµ

(i)(x(t)) using

Vµ(i)(x(t)) =

∫ t+Tt

r(x(s), µ(i)(x(s)))ds+ Vµ(i)(x(t + T ))

with Vµ(i)(0) = 0 (9)

and2. (policy improvement) update the control policy using

µ(i+1) = argminu∈Ψ (Ω)

[H(x, u,∇Vµ(i)

x )], (10)

which explicitly is

µ(i+1)(x) = −12R−1gT(x)∇Vµ

(i)

x , (11)

converges to the optimal control policy µ∗ ∈ Ψ (Ω) withcorresponding cost V ∗(x0) = minµ(

∫∞

0 r(x(τ ), µ(x(τ )))dτ).Eqs. (9) and (11) give a new formulation for the Policy Iteration

algorithm which allows solving the optimal control problemwithout making use of any knowledge of the internal dynamics ofthe system, f (x). This algorithm is an online version of the offlinealgorithms proposed in Abu-Khalaf and Lewis (2005) and Beardet al. (1997) motivated by the success of the online adaptive critictechniques proposed by computational intelligence researchers(Bertsekas and Tsitsiklis (1996); Murray et al. (2002); Prokhorovand Wunsch (1997)). In the spirit of reinforcement learningalgorithms, the integral term in (9) can be addressed as thereinforcement over the time interval [t, t + T ).Eq. (9) is a discretized version of Vµ

(i)(x(t)) =

∫∞

t r(x(τ ), µ(i)

(x(τ )))dτ and it can be viewed as a Lyapunov equation fornonlinear systems. In this paper we shall refer to it also as

LE(Vµ(i)(x(t))) ,

∫ t+Tt

r(x(s), µ(i)(x(s)))ds+ Vµ(i)(x(t + T ))

− Vµ(i)(x(t)) with Vµ

(i)(0) = 0.

The convergence of the new PI algorithm is given in the nextsubsection. The implementation of the algorithm using neuralnetwork structures will be discussed in Section 4.

3.2. Convergence of the policy iteration algorithm

If µ(i) ∈ Ψ (Ω) and Vµ(i)(x(t)) ∈ C1(Ω) satisfy Eq. (9), then

one can show the new control policy µ(i+1), determined based onEq. (10), is admissible for the system (1) (for proof see Abu-Khalafand Lewis (2005) and Beard et al. (1997)).The following result is required in order to prove the

convergence of the proposed policy iteration algorithm.

Lemma 1. Solving for Vµ(i) in Eq. (9) is equivalent to finding thesolution of

0 = r(x, µ(i)(x))+ (∇Vµ(i)

x )T(f (x)+ g(x)µ(i)(x)),

Vµ(i)(0) = 0.

(12)

The proof is presented in the Appendix.

Remark 1. Although the same solution is obtained solving eitherEq. (9) or (12), solving Eq. (9) does not require any knowledge ofthe system dynamics f (x), which in turn appears explicitly in (12).

From Lemma 1 it follows that the algorithm (9) and (11)is equivalent to iterating between (12) and (11), without usingknowledge of the system internal dynamics f (x).

Theorem 1 (Convergence). The policy iteration (9) and (11) con-verges uniformly to the optimal control solution on the trajectoriesoriginating inΩ , i.e.

∀ ε > 0 ∃i0 : ∀i ≥ i0

supx∈Ω

∣∣∣Vµ(i)(x)− V ∗(x)∣∣∣ < ε, supx∈Ω

∣∣µ(i)(x)− u∗(x)∣∣ < ε. (13)Proof. In Abu-Khalaf and Lewis (2005) and Beard et al. (1997), itwas shown that iterating on Eq. (12) and (11), conditioned by aninitial admissible policyµ(0)(x), all the subsequent control policieswill be admissible and the iteration (12) and (11) will converge tothe solution of the HJB equation, i.e. Eq. (13) is satisfied.Based on the proven equivalence between the Eqs. (9) and

(12) we can conclude that the proposed online adaptive optimalcontrol algorithm will converge to the solution of the optimalcontrol problem (2), onΩ , without using knowledge of the internaldynamics of the controlled system (1). �

4. Neural-network-based approximation of the HJB solution

For the implementation of the iteration scheme given by (9)and (11) one only needs to have knowledge of the input-to-statedynamics, i.e. the function g(x), which is required for the policyupdate in Eq. (11). One can see that knowledge of the internalstate dynamics, described by f (x), is not required. The informationregarding the system f (x) matrix is embedded in the states x(t)and x(t + T )which are sampled online.

4.1. Neural network approximation of the cost function

In order to solve Eq. (9) we will use a neural-network-typestructure to obtain an approximation of the cost function forany x ∈ Ω . Due to the universal approximation property(Hornik, Stinchcombe, &White, 1990), neural networks are naturalcandidates to approximate smooth functions on compact sets. Thuswe consider that the cost function Vµ

(i)(x) can be approximated,

for x ∈ Ω , by

Vµ(i)

L (x) =L∑j=1

wµ(i)

j φj(x) = (wµ(i)

L )TϕL(x). (14)

This can be seen as a neural network with L neurons on the hiddenlayer and activation functions φj(x) ∈ C1(Ω), φj(0) = 0. ϕL(x) is

the vector of activation functions, wµ(i)

j denote the weights of the

output layer andwµ(i)

L is theweight vector. The output layer neuronhas a linear activation function. The weights of the hidden layerare all equal to one and will not be changed during the trainingprocedure.Note that, given an infinite set of linearly independent

activation functions {φj(x)}∞1 , such that φj(x) ∈ C1(Ω), φj(0) =

0, j = 1,∞, which satisfy the completeness property (i.e. anyfunction f (x) ∈ C1(Ω), f (0) = 0 can be represented as a linearcombination of a subset of {φj(x)}∞1 ), then the exact solution ofEq. (9) can be expressed as

Vµ(i)(x) =

∞∑j=1

cµ(i)

j φj(x) = (cµ(i)

∞)Tϕ∞(x), (15)



whereϕ∞(x) is the vector of activation functions and cµ(i)

∞ denotesthe weight vector.Using the neural network approximate description for the cost

function, Eq. (14) and (9) can be written as

wµ(i)TL ϕL(x(t)) =

∫ t+Tt

r(x, µ(i)(x))dτ +wµ(i)TL ϕL(x(t + T )). (16)

As the cost function was replaced with the neural networkapproximation, (16) will have the residual error

δµ(i)

L (x(t), T ) =∫ t+Tt

r(x, µ(i)(x))dτ

+wµ(i)TL [ϕL(x(t + T ))− ϕL(x(t))]. (17)

From the perspective of temporal difference learning methods(e.g., Baird (1994) and Doya (2000)) this error can be viewed as atemporal difference residual error for continuous-time systems.To determine the parameters of the neural network approxi-

mating the cost function Vµ(i)

L , in the least-squares sense, we usethe method of weighted residuals. Thus we tune the parameterswµ

(i)

L of the cost function approximation Vµ(i)

L in order to minimizethe objective

S =∫Ω

δµ(i)

L (x, T )δµ(i)

L (x, T )dx. (18)

This amounts to∫Ω

dδµ(i)L (x,T )

dwµ(i)L

δµ(i)

L (x, T )dx = 0. Using the inner

product notation for the Lebesgue integral, one can write〈dδµ

(i)

L (x, T )

dwµ(i)

L

, δiL(x, T )

〉Ω

= 0, (19)

which is〈[ϕL(x(t + T ))− ϕL(x(t))], [ϕL(x(t + T ))− ϕL(x(t))]

T〉Ωwµ

(i)

L

+

〈[ϕL(x(t + T ))− ϕL(x(t))],

∫ t+Tt

r(x(s), µ(i)(x(s)))ds〉Ω

= 0. (20)

Conditioned by Φ = 〈[ϕL(x(t + T )) − ϕL(x(t))], [ϕL(x(t + T )) −ϕL(x(t))]

T〉Ω being invertible, then we obtain

wµ(i)

L = −Φ−1〈[ϕL(x(t + T ))− ϕL(x(t))],∫ t+T

tr(x(s), µ(i)(x(s)))ds

〉Ω

. (21)

To show thatΦ can be inverted the following technical results areneeded.

Definition 2 ((Kolmogorov & Fomin, 1999) (Linearly independentset of functions)). A set of functions

{φj}N1 is said to be linearly

independent on a setΩ if∑Nj=1 cjφj(x) = 0 a.e. onΩ implies that

c1 = · · · = cN = 0.

Lemma 2. If the set{φj}N1 is linearly independent and u ∈ Ψ (Ω)

then the set{∇φTj (f + gu)

}N1is also linearly independent.

For proof see Beard et al. (1997).We now give a technical lemma proving thatΦ can be inverted.

Lemma 3. Let µ(x) ∈ Ψ (Ω) such that f (x) + g(x)µ(x)is asymptotically stable. Given that the set

{φj}N1 is linearly

independent then ∃T > 0 such that ∀x(t) ∈ Ω − {0},the set

{φ̄j(x(t), T ) = φj(x(t + T ))− φj(x(t))

}N1 is also linearly

independent.

The proof is presented in the Appendix section.Based on the result of Lemma 3, there exist values of T such

that Φ is invertible and the parameters wµ(i)

L of the cost function

Vµ(i)

L can be calculated. Having solved for the cost function Vµ(i)

Lassociated with the control policy µ(i), the policy update step canbe executed. The new control policy will thus be

µ(i+1)(x) = −12R−1gT(x)∇ϕTL(x)w

µ(i)

L . (22)

Eq. (22) gives the output of the Actor neural network. Note that inthis implementation the controller (Actor) can be seen as a neuralnetwork, similar to the one approximating the Critic, which has thesame weight parameters as the Critic neural network, but whoseactivation functions are the gradients of those in the Critic.

4.2. Convergence of Vµ(i)

L (x) to the exact solution of the Lyapunovequation Vµ

(i)(x)

Wenow discuss the convergence of themethod of least squareswhen a neural network is used to approximate the cost function inEq. (9).

Definition 3 (Convergence in the Mean). A sequence of Lebesgueintegrable functions on a setΩ , {fn(x)} ∈ L2(Ω), is said to convergein the mean to f (x) ∈ L2(Ω) if ∀ε > 0, ∃N(ε) such that ∀n >N(ε), ‖fn(x)− f (x)‖L2(Ω) < ε, where ‖f (x)‖

2L2(Ω) =

〈f , f 〉.

Eq. (9) can be written using a linear operator A defined on theHilbert space of continuous and differentiable functionals onΩ:

AVµ︷︸︸︷Vµ(x(t))− Vµ(x(t + T )) =

d(x,µ(x),T )︷︸︸︷∫ t+Tt

r(x(s), µ(x(s)))ds . (23)

Neural networks which are defined such that the activationfunctions are power series of order m are differentiable and canuniformly approximate a continuous function with all its partialderivatives, up to order m, by differentiating the series termwise.This type of series ism-uniformly dense, as shown in Lemma 4.

Lemma 4 (Higher Order Weierstrass Approximation Theorem). Letf (x) ∈ Cm(Ω); then there exists a polynomial P(x) such that itconverges uniformly to f (x), and all its partial derivatives up to orderm converge uniformly (Hornik et al., 1990).

The following facts hold under the stated standard conditionsin optimal control.

Fact 1. The solution of (9) is positive definite. This is guaranteedwhen the system has stabilizable dynamics and when the perfor-mance functional satisfies zero state observability (i.e. observabil-ity of the system state through the cost function) (Van Der Schaft,1992).

Fact 2. The system dynamics and the performance integrandr(x(s), µ(x(s))) are such that the solution of (9) is continuous anddifferentiable onΩ .

Fact 3. We can choose a complete set {φj}∞1 ∈ C1(Ω) such that

the solution V ∈ C1(Ω) and ∇V can be uniformly approximatedby the infinite series built based on {φj}∞1 .



Fact 4. The sequence {φ̄j(x(t), T ) = φj(x(t + T )) − φj(x(t))}∞1is linearly independent and complete. The linear independenceresults from Lemma 3, being conditioned by certain values ofthe sample time T . The completeness relies on the high-orderWeierstrass approximation theorem.∀V , ε∃L,wL such that |VL − V | < ε. This implies that, as L →

∞, supx∈Ω |AVL − AV | → 0 ⇒ ‖AVL − AV‖L2(Ω) → 0, whichproves the completeness of {φ̄j(x(t), T ) = Aφj}∞1 .

The first three assumptions are standard in optimal control, andwe have proven the fourth herein. The next result is required.

Lemma 5. Given a set of N linearly independent functions {fj(x)}N1defined onΩ then∥∥αTN fN∥∥2L2(Ω) → 0⇔ ‖αN‖2l2 → 0. (24)For proof see Abu-Khalaf and Lewis (2005).The next main result shows convergence in the mean.

Theorem 2. Given that the Facts 1–4 hold, then approximatesolutions exist for (9) using the method of least squares and they areunique for each L. In addition, as L→∞,

R1.∥∥∥LE(Vµ(i)L (x))− LE(Vµ(i)(x))∥∥∥L2(Ω) → 0

R2.∥∥∥Vµ(i)L (x)− Vµ(i)(x)∥∥∥L2(Ω) → 0

R3.∥∥∥∇Vµ(i)L (x)−∇Vµ(i)(x)∥∥∥L2(Ω) → 0

R4.∥∥∥µ(i)L (x)− µ(i)(x)∥∥∥L2(Ω) → 0.

Proof. The solution Vµ(i)

L of (9) in the least-squares sense is thesolution of the minimization problem∥∥∥AVµ(i)L − d(x, µ(i), T )∥∥∥2 = minwL ∥∥wTL ϕ̄(x)− d(x, µ(i), T )∥∥2 . (25)The uniqueness of the solution follows directly from the

linear independence of {φ̄j(x(t), T )}L1 and R1 follows from thecompleteness of {φ̄j(x(t), T ) = Aφj}∞1 .To show R2 we write

LE(Vµ(i)

L (x))− LE(Vµ(i)(x)) = wTL ϕ̄L(x, T )− c

T∞ϕ̄∞(x, T )

= εL(x, T ), (26)

(wL − cL)Tϕ̄L(x, T ) = εL(x, T )+∞∑j=L+1

cjφ̄j(x, T )

= εL(x, T )+ eL(x, T ). (27)

eL(x, T ) converges uniformly to zero due to the high-orderWeierstrass approximation theorem (this implies convergence inthe mean) and εL(x, T ) converges in the mean to zero due to R1.Then∥∥(wL − cL)Tϕ̄L(x, T )∥∥2L2(Ω) = ‖εL(x, T )+ eL(x, T )‖2L2(Ω)≤ 2 ‖εL(x, T )‖2L2(Ω) + 2 ‖eL(x, T )‖

2L2(Ω) → 0. (28)

Since ϕ̄L(x, T ) is linearly independent, then, based on Lemma5, onesees that ‖(wL − cL)‖2l2 → 0. As the set {φj}

L1 is linearly indepen-

dent, from Lemma 5 it results that∥∥(wL − cL)TϕL(x)∥∥2L2(Ω) → 0. It

thus follows that, as L→∞,∥∥∥Vµ(i)L − Vµ(i)∥∥∥2L2(Ω) → 0.

Similarly, since{dφjdx

}L1is linearly independent, from Lemma 5 it

results that∥∥(wL − cL)T∇ϕL(x)∥∥2L2(Ω) → 0, from which it follows

that∥∥∥∇Vµ(i)L −∇Vµ(i)∥∥∥2L2(Ω) → 0.R4 follows immediately from R3 given that∥∥∥µ(i)L (x)− µ(i)(x)∥∥∥2L2(Ω)=

∥∥∥−R−1gT(x)(∇Vµ(i−1)L (x)−∇Vµ(i−1)(x))∥∥∥2L2(Ω)≤∥∥−R−1gT(x)∥∥2L2(Ω) ∥∥∥∇Vµ(i−1)L (x)−∇Vµ(i−1)(x)∥∥∥2L2(Ω)→ 0. � (29)

Based on the result in Theorem 2 the following stronger result ofuniform convergence can be shown.

Corollary 1. If the results from Theorem 2 hold then

supx∈Ω

∣∣∣Vµ(i)L (x)− Vµ(i)(x)∣∣∣→ 0,supx∈Ω

∣∣∣∇Vµ(i)L (x)−∇Vµ(i)(x)∣∣∣→ 0 andsupx∈Ω

∣∣∣µ(i)L (x)− µ(i)(x)∣∣∣→ 0.For proof see Abu-Khalaf and Lewis (2005).The next result shows that, given an initial admissible control

policy, µ(0)(x), the control policy at each step i of the PolicyIteration algorithm, with value function approximation, µ(i)L (x) isadmissible provided that the number of neurons in the hiddenlayer is sufficiently large.

Corollary 2. (Admissibility of µ(i)L (x))∃L0 such that ∀L > L0, µ(i)L ∈

Ψ (Ω).

The proof is given in the Appendix.

Corollary 3. supx∈Ω |µ(i)L (x) − µ

(i)(x)| → 0 implies that

supx∈Ω∣∣∣V (i)L (x)− V (i)(x)∣∣∣→ 0.

4.3. Convergence of the method of least squares to the solution of theHJB equation

In this section we show that the successive least-squaressolution using neural networks converges to the solution of theHJBequation (8).

Theorem 3. Under the assumptions of Theorem 2 the following issatisfied ∀i ≥ 0:

i. supx∈Ω

∣∣∣V (i)L (x)− V (i)(x)∣∣∣→ 0ii. sup

x∈Ω

∣∣∣µ(i+1)L (x)− µ(i+1)(x)∣∣∣→ 0iii. ∃L0 : ∀L ≥ L0 µ

(i)L (x) ∈ Ψ (Ω).

The proof is by induction and is presented in Abu-Khalaf andLewis (2005).



Fig. 1. Structure of the system with adaptive controller.

Theorem 4. ∀ε ≥ 0, ∃i0, L0 : ∀i ≥ i0, L ≥ L0

i. supx∈Ω

∣∣∣V (i)L (x)− V ∗(x)∣∣∣ < ε,ii. sup

x∈Ω

∣∣∣µ(i+1)L (x)− u∗(x)∣∣∣ < ε,iii. µ(i)L (x) ∈ Ψ (Ω).

The proof follows directly from Theorems 1 and 3.

5. Online algorithm on an Actor/Critic structure

In this section we discuss the implementation of the adaptivealgorithm on the Actor/Critic structure. We present the mainfeatures of the online adaptive critic while noting similarities withlearning mechanisms in the mammal brain.The structure of the system with the adaptive controller is

presented in Fig. 1.It is important to note that the adaptation structure has

dynamics consisting of the state V (t), i.e. the value, which evolvesbased on V̇ = Q (x) + uTRu. This provides a dynamic memorythat enables one to extract the information regarding the costassociated with the given policy. If one resets V (t) to zero at thebeginning of each sample interval [t, t+T ), then themeasurementV (t + T ) gives the reinforcement over time interval [t, t + T )required to implement the policy evaluation step in (9), i.e.V (t+T )gives the integral reinforcement term in (9). Thus, this a dynamicalcontroller whose memory is exactly the value V (t) of using thecurrent policy.The policy iteration technique in this paper has led us to

a control system structure that allows one to perform optimalcontrol in an adaptive fashion online without knowing the internaldynamics of the system. We term this optimal adaptive control.This structure is not a standard one in the control systemsliterature. It is a hybrid continuous-time/discrete-time adaptivecontrol structure which has continuous-time dynamics, and adiscrete-time sampled data portion for policy evaluation.The algorithm is suitable for online implementation from the

control theory point of view since the control policy µ(i+1)L (x),updated at time ti+1 after observing the state x(ti+1), will be usedfor controlling the system during the time interval (ti+1, ti+1 + T ].The flowchart of the online algorithm is presented in Fig. 2.All the calculations involved are performed at a supervisory

level which operates based on discrete-time data measured fromthe system. This high-level intelligent control structure imple-ments the Policy Iteration algorithm and uses the Critic neuralnetwork to parameterize the performance of the continuous-timecontrol system associated with a certain control policy. The high-level supervisory structure makes the decisions relative to thediscrete-time moments at which both the Actor and the Critic pa-rameters will be updated. The Actor neural network is part of the

Fig. 2. Flowchart of the online algorithm.

control system structure and performs continuous-time control,while its constant gain is updated at discrete moments in time.The algorithm converges to the solution of the continuous-timeoptimal control problem, as proved in Section 4, since the Criticupdate is based on the observations of the continuous-time costover a finite sample interval. The net result is a continuous-timecontroller incorporated in a continuous-time/discrete-time adap-tive structure,which includes the continuous-timedynamics of thecost function and operates based on sampled data, to perform thepolicy evaluation and policy update steps at discrete moments intime.It is interesting to note the rough similarity between the above-

mentioned adaptive controller structure and learningmechanismsin the mammal brain. The Critic structure learns, in an episodicmanner and based on samples of the reward signal from theenvironment, the parameters of a function which describes theActor performance. Once a performance evaluation episode iscompleted, the Critic passes this information to the Actor structurewhich will use it to adapt for improved performance. At all timesthe Actor must perform continuous-time control for the system(the environment in which optimal behavior is sought). Thisdescription of the way in which the Actor/Critic structure workswhile searching for continuous-time optimal control policiespoints out the existence of two time scales for the mechanismsinvolved:

– a fast time scale which characterizes the continuous-timecontrol process, and

– a slower time scale which characterizes the learning processesat the levels of the Critic and the Actor.

Thus the Actor and Critic structures perform tasks at differentoperation frequencies in relation with the nature of the taskto be performed (i.e. learning or control). Evidence regardingthe oscillatory behavior naturally characterizing biological neuralsystems is presented in a comprehensivemanner in Levine, Brown,and Shirey (2000). Different oscillation frequencies are connectedwith the way in which different areas of the brain performtheir functions of processing the information received from thesensors. Low-level control structures must quickly react to newinformation received from the environment while higher-levelstructures slowly evaluate the results associated with the presentbehavior policy.In Section 4 it was shown that having little information about

the systemstatesmeasured from the sensors, x, and the augmentedsystem state, i.e. V , extracted from the system only at specific timevalues (i.e. x(t), x(t + T ) and V (t + T )− V (t)), the Critic is able to



evaluate the infinite horizon continuous-time performance of thesystem associatedwith a given control policy described in terms ofthe Actor parameters. The Critic learns the cost function associatedwith a certain control behavior based on a computed temporaldifference (TD) error signal, given by V (t + T )− V (t).It is interesting to mention here that in a number of reports,

e.g. Schultz (2004) and Schultz, Dayan and Montague (1997),it is argued that the temporal difference error between thereceived and the expected rewards is in physically encoded inthe dopamine signal produced by basal ganglia structures inthe mammal brain. At the same time, it is known that thedopamine signal encoding the temporal error difference favorsthe learning process by increasing the synaptic plasticity ofcertain groups of neurons. We also note an interesting pointpresented in this issue (Perlovsky, in this issue) associating certainemotions with the need for cognition, i.e. emotions play the roleof reinforcement signals which drive the need for cognition. Thiskind of reinforcement learning is located at a higher level thanthe dopamine driven learning, this suggesting that there existsa hierarchy of reinforcement-based learning mechanisms in themammal brain.The cost function solution, given by (21), can be obtained in

real time, after a sufficient number of data points are collectedalong state trajectories in the region of interest Ω . In practice,the matrix inversion in (21) is not performed, the solution of theequation being obtained using algorithms that involve techniquessuch as Gaussian elimination, backsubstitution, and Householderreflections. One should note that the least-squares method forfinding the parameters of the cost function can be replaced withany other suitable, recursive or not recursive,method of parameteridentification.The iterations will be stopped (i.e. the Critic will stop

updating the control policy) when the error between the systemperformance evaluated at two consecutive steps will cross below adesigner specified threshold. Also, when this error becomes biggerthan the above-mentioned threshold, indicating a change in thesystem dynamics, the Critic will take again the decision to starttuning the Actor parameters.We note again that there is no required knowledge about the

system dynamics f (x) for the evaluation of the cost or the updateof the control policy. However, knowledge on the g(x) function isrequired for the update of the control policy, using (22), and thismakes the online tuning algorithm only partially model free.The next section presents simulation results which were

obtained considering two second-order nonlinear systems.

6. Simulation examples

In this section we illustrate the results of the adaptiveoptimal control algorithm considering two nonlinear systemsfor which the optimal cost function and optimal controller areknown. The nonlinear system examples were developed using theconverse HJB approach, (Nevistic & Primbs, 1996), which allowsconstruction of nonlinear systems, specified initially in a generalform, starting from the known optimal cost function. In effectit solves conversely the HJB equation, given the optimal costfunction, for the dynamics of the nonlinear system.

6.1. Example 1

We consider the nonlinear system given by the equations{ẋ1 = −x1 + x2ẋ2 = f (x)+ g(x)u

(30)

with f (x) = − 12 (x1 + x2)+12x2 sin

2(x1), g(x) = sin(x1).

Fig. 3. Convergence of the Critic parameters.

If we define the infinite horizon cost function to be minimizedas V u(x(t)) =

∫∞

t (Q (x) + u2)dτ with Q (x) = x21 + x

22, then the

optimal cost function for this system is V ∗(x) = 12x21 + x

22 and the

optimal controller is u∗(x) = − sin(x1)x2.The simulation was conducted using data obtained from the

system at every 0.1 s. We note here that the value of this sampletime is not relevant for the cost function identification procedure.In fact data does not have to bemeasuredwith a fixed sample time,as long as it is suitable for learning (i.e. carries new informationon the cost function to be identified). In this sense, as long asthe measured signals did not reach steady state values, meaningthat the measured data is not redundant, the sampling couldbe executed as quickly as the hardware permits it. At the sametime, as we used a batch method for identifying the cost functionparameters in the least-squares sense a larger number of sampleswill lead to better approximation of the cost function. However,this batch procedure can be replaced with a recursive one, or arecursive procedure on time windows, such that the parametersof the cost function will be adapted over time as more data isacquired.For the purpose of demonstrating the algorithm the initial

state of the system is taken to be different than zero. For eachiteration we considered data measured along five trajectoriesdefined by five different initial conditions chosen randomly inΩ = {−1 ≤ xi ≤ 1; i = 1, 2}. The initial stabilizing controllerwas taken as µ(0)(x) = − 32 sin(x1)(x1 + x2). The cost functionVµ

(i)(x) was approximated by the following smooth function, for

x ∈ Ω , Vµ(i)

L (x) = (wµ(i)

L )TϕL(x) with L = 3, w

µ(i)

3 =[wµ(i)

1 wµ(i)

2 wµ(i)

3

]Tand ϕ3(x) =

[x21 x1x2 x

22

]T.In order to solve online for the neural network weights wµ

(i)

3which parameterize the cost function, at each iteration step we setup a least-squares problemwith the solution given by (21). At eachiteration step we solved for wµ

(i)

3 using 30 data points consistingof the measured the cost function associated with a given controlpolicy over 30 time intervals T = 0.1 s, the initial state and thesystem state at the end of each time interval, six points measuredover each of the five trajectories in the state space. In this way,every 3 s, the cost function was solved for and a policy updatewas performed. The result of applying the algorithm is presentedin Fig. 3.One can see from the figure that the parameters of the Critic

converged to the coefficients of the optimal cost function V ∗(x) =12x21 + x

22, i.e.w

u∗3 =

[0.5 0 1

]T.Please cite this article in press as: Vrabie, D., & Lewis, F. Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinearsystems. Neural Networks (2009), doi:10.1016/j.neunet.2009.03.008


6.2. Example 2

In this example we present the results obtained for a systemwhichhas stronger nonlinearities andquartic cost.We consider thenonlinear system given by the equations{ẋ1 = −x1 + x2 + 2x32ẋ2 = f (x)+ g(x)u

(31)

with f (x) = − 12 (x1 + x2)+12x2(1+ 2x

22) sin

2(x1), g(x) = sin(x1).If we define Q (x) = x21 + x

22 + 2x

42 the infinite horizon cost

function to be minimized then the optimal cost function for thissystem is V ∗(x) = 12x

21 + x

22 + x

42 and the optimal controller is

u∗(x) = − sin(x1)(x2 + 2x32).The simulation was conducted using data obtained from the

system at every 0.1 s. For each iteration we considered datameasured along five trajectories defined by five different initialconditions chosen randomly in Ω = {−1 ≤ xi ≤ 1; i = 1, 2}.The initial stabilizing controller was taken as µ(0)(x) =−12 sin(x1)(3x2 − 0.2x

21x2 + 12x

32). The cost function V

µ(i)(x)

was approximated on Ω as Vµ(i)

L (x) = (wµ(i)

L )TϕL(x) with

L = 8, wµ(i)

8 =

[wµ(i)

1 · · · wµ(i)

8

]Tand ϕ8(x) =[

x21 x1x2 x22 x

41 x

31x2 x

21x22 x1x

32 x

42

]T.At each iteration step we solved for wµ

(i)

8 using 40 data points,i.e. eight pointsmeasured on each of the five trajectories inΩ . Eachdata point consists of the measured the cost function associatedwith the present control policy, over a time interval T = 0.1 s,the system state at the ends of this interval. In this way, atevery 4 s, the cost function was solved for and a policy updatewas performed. One notes that each data point set measured oneach trajectory is sufficient to identify the parameters of the costfunction corresponding to that given trajectory. However, it isoften not the case that cost function parameters associated withone trajectory are equal to the cost function parameters associatedwith another trajectory. The result of applying the algorithm ispresented in Fig. 4.The figure clearly shows that the parameters of the Critic neural

network converged to the coefficients of the optimal cost functionV ∗(x) = 12x

21+x

22+x

42, i.e.w

u∗8 =

[0.5 0 1 0 0 0 0 1

]T.One observes that after three iteration steps the parameters of thecontroller, obtained based on the update Eq. (22), are very close tothe parameters of the optimal controller u∗(x) = − sin(x1)(x2 +2x32).

7. Conclusions

In this paper we have presented a continuous-time adaptivecontroller, based on Policy Iteration, which adapts online tolearn the continuous-time optimal control policy without usingknowledge of the internal dynamics of the nonlinear system.Convergence of the proposed algorithm, under the condition ofinitial stabilizing controller, to the solution of the optimal controlproblem has been established. This paper also presents proofsof convergence for the online neural-network-based version ofthe algorithm while taking into account the neural networkapproximation error. Simulation results support the effectivenessof the online adaptive optimal controller.

Acknowledgements

This work was supported by the National Science FoundationECCS-0801330 and the Army Research Office W91NF-05-1-0314.

Fig. 4. Convergence of the Critic parameters.

Appendix

Proof of Lemma 1. Since µ(i) ∈ Ψ (Ω) then Vµ(i) ∈ C1(Ω),defined as Vµ

(i)(x(t)) =

∫∞

t r(x(s), µ(i)(x(s)))ds, is a Lyapunov

function for the system ẋ(t) = f (x(t)) + g(x(t))µ(i)(x(t)). Vµ(i)∈

C1(Ω) satisfies

(∇Vµ(i)

x )T(f (x)+ g(x)µ(i)(x)) = −r(x(t), µ(i)(x(t))) (32)

with r(x(t), µ(i)(x(t))) > 0; x(t) 6= 0. Integrating (32) over thetime interval [t, t + T ], one obtains

Vµ(i)(x(t)) =

∫ t+Tt

r(x(s), µ(i)(x(s)))ds+ Vµ(i)(x(t + T )). (33)

Thismeans that the unique solution of (12),Vµ(i), also satisfies (33).

To complete the proof we have to show that Eq. (33) has a uniquesolution. The proof is by contradiction.Thus, assume that there exists another cost function V ∈ C1(Ω)

which satisfies (14) with the end condition V (0) = 0. This costfunction also satisfies V̇ (x(t)) = −r(x(t), µ(i)(x(t))). Subtractingthis from (33) we obtaind[V (x(t))− Vµ(i)(x(t))]

dx

T ẋ

=

d[V (x(t))− Vµ(i)(x(t))]dx

T

× (f (x(t))+ g(x(t))µ(i)(x(t))) = 0, (34)

which must hold for any x on the system trajectories generated bythe stabilizing policy µ(i). Thus V (x(t)) = Vµ

(i)(x(t)) + c. As this

relation must hold also for x(t) = 0 then V (0) = Vµ(i)(0) + c ⇒

0 = c and thus V (x(t)) = Vµ(i)(x(t)), i.e. Eq. (9) has a unique

solution which is equal to the unique solution of (12). �

Proof of Lemma 3. The proof is by contradiction.The vector field ẋ = f (x) + g(x)µ(x) is asymptotically stable.

Denote with η(τ ; x(t), µ), x(t) ∈ Ω the system trajectoriesobtained using the policy µ(x) for any x(t) ∈ Ω . Then along thesystem trajectories we have that

φ(x(t + T ))− φ(x(t)) =∫ t+Tt∇φTx (f + gµ)(η(τ ; x(t), µ))dτ .

(35)



Suppose that the lemma is not true; then ∀ T > 0 there exists anonzero constant vector c ∈ RN such that ∀x(t) ∈ Ω cT[φ(x(t +T )) − φ(x(t))] ≡ 0. This implies that ∀T > 0, cT

∫ t+Tt ∇φ

Tx (f +

gµ)(ϕ(τ ; x(t), µ))dτ ≡ 0 and thus, ∀x(t) ∈ Ω , cT∇φTx (f +gµ)(ϕ(τ ; x(t), µ)) ≡ 0. This means that

{∇φTj (f + gu)

}N1is

not linearly independent, contradicting Lemma 2. Thus ∃T > 0such that ∀x(t0) ∈ Ω the set

{φ̄j(x(t0), T )

}N1 is also linearly

independent. �

Proof of Corollary 2. Consider the function Vµ(i) , which is aLyapunov function for the system (1) with control policy µ(i).Taking the derivative of Vµ

(i)along the trajectories generated by

the controller µ(i+1)L (x), we obtain

V̇µ(i)= (∇Vµ

(i)

x )T(f (x)+ g(x)µ(i+1)L (x)). (36)

Wealso have that V̇µ(i)= (∇Vµ

(i)

x )T(f (x)+g(x)µ(i)(x)) = −Q (x)−

(µ(i)(x))TRµ(i)(x) and thus (∇Vµ(i)

x )Tf (x) = −(∇Vµ

(i)

x )Tg(x)µ(i)(x)

− Q (x)− (µ(i)(x))TRµ(i)(x). With this, (36) becomes

V̇µ(i)= −Q (x)− (µ(i)(x))TRµ(i)(x)− (∇Vµ

(i)

x )Tg(x)(µ(i)(x)

−µ(i+1)L (x)). (37)

Using the controller update µ(i+1)(x) = − 12R−1g(x)T∇Vµ

(i)

x , we

can write (∇Vµ(i)

x )Tg(x) = 2Rµ(i+1)(x). Thus (37) becomes

V̇µ(i)= −Q − (µ(i))TRµ(i) − 2µ(i+1)R(µ(i) − µ(i+1)L )

= −Q − (µ(i) − µ(i+1))TR(µ(i) − µ(i+1))− (µ(i+1)L )TRµ(i+1)L

+ (µ(i+1) − µ(i+1)L )

TR(µ(i+1) − µ(i+1)L ). (38)

Since supx∈Ω∣∣∣µ(i)L (x)− µ(i)(x)∣∣∣ → 0 as L → ∞ then there exists

a L0 such that ∀L > L0, V̇µ(i)< 0, which means that Vµ

(i)is

a Lyapunov function for the system with control policy µ(i+1)L (x),which proves the corollary. �

References

Abu-Khalaf, M., & Lewis, F. L. (2005). Nearly optimal control laws for nonlinearsystems with saturating actuators using a neural network HJB approach.Automatica, 41(5), 779–791.

Abu-Khalaf, M., Lewis, F. L., & Huang, J. (2006). Policy iterations and theHamilton–Jacobi–Isaacs equation for H-infinity state-feedback control withinput saturation. IEEE Transactions on Automatic Control, 51, 1989–1995.

Beard, R., Saridis, G., & Wen, J. (1997). Galerkin approximations of the generalizedHamilton–Jacobi–Bellman equation. Automatica, 33(11), 2159–2177.

Baird, L. C., III (1994). Reinforcement learning in continuous time: Advantageupdating. Proceedings of ICNN , 4, 2448–2453.

Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. MA: AthenaScientific.

Brannon, N., Seiffertt, J., Draelos, T., & Wunch, D. (2009). Coordinated machinelearning and decision support for situation awareness. Neural Networks, in thisissue (doi:10.1016/j.neunet.2009.03.013).

Doya, K. (2000). Reinforcement learning in continuous time and space. NeuralComputation, 12(1), 219–245.

Doya, K., Kimura, H., & Kawato, M. (2001). Neural mechanisms of learning andcontrol. IEEE Control Systems Magazine, 21(4), 42–54.

Hanselmann, T., Noakes, L., & Zaknich, A. (2007). Continuous-time adaptive critics.IEEE Transactions on Neural Networks, 18(3), 631–647.

Hornik, K., Stinchcombe, M., & White, M. (1990). Universal approximation of anunknown mapping and its derivatives using multilayer feedforward networks.Neural Networks, 3, 551–560.

Howard, R. A. (1960). Dynamic programming and Markov processes. Cambridge, MT:MIT Press.

Huang, J., & Lin, C. F. (1995). Numerical approach to computing nonlinearH∞ controllaws. Journal of Guidance, Control and Dynamics, 18(5), 989–994.

Kirk, D. E. (2004). Optimal control theory — An introduction. New York: Dover Pub.Inc., Mineola.

Kleinman, D. (1968). On an iterative technique for Riccati equation computations.IEEE Transactions on Automatic Control, 13, 114–115.

Kolmogorov, A. N., & Fomin, S. V. (1999). Elements of the theory of functions andfunctional analysis. New York: Dover Pub Inc., Mineola.

Leake, R. J., & Liu, R.-W. (1967). Construction of Suboptimal Control Sequences.Journal SIAM Control, 5(1), 54–63.

Levine, D. S., Brown, V. R., & Shirey, V. T. (Eds.) (2000). Oscillations in neural systems.Mahwah, NJ: Lawrence Erlbaum Associates Publ.

Lewis, F. L., & Syrmos, V. L. (1995). Optimal control. John Wiley.Murray, J. J., Cox, C. J., Lendaris, G. G., & Saeks, R. (2002). Adaptive dynamicprogramming. IEEE Transactions on Systems, Man and Cybernetics, 32(2),140–153.

Nevistic, V., & Primbs, J. (1996). Constrained nonlinear optimal control: A converseHJB approach. Technical report 96-021. California Institute of Technology.

Perlovsky, L. (2009). Language and cognition. Neural Networks, in this issue(doi:10.1016/j.neunet.2009.03.007).

Prokhorov, D., & Wunsch, D. (1997). Adaptive critic designs. IEEE Transactions onNeural Networks, 8(5), 997–1007.

Schultz, W. (2004). Neural coding of basic reward terms of animal learning theory,game theory, microeconomics and behavioral ecology. Current Opinion inNeurobiology, 14, 139–147.

Schultz,W., Dayan, P., &Montague, P. R. (1997). A neural substrate of prediction andreward. Science, 275, 1593–1599.

Schultz, W., Tremblay, L., & Hollerman, J. R. (2000). Reward processing in primateorbitofrontal cortex and basal ganglia. Cerebral Cortex, 10, 272–283.

Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning — An introduction.Cambridge, MT: MIT Press.

Sutton, R. S., Barto, A. G., & Williams, R. J. (1992). Reinforcement learning is directadaptive optimal control. IEEE Control Systems Magazine, (April), 19–22.

Van Der Schaft, A. J. (1992). L2-gain analysis of nonlinear systems and nonlinearstate feedback H∞ control. IEEE Transactions on Automatic Control, 37(6),770–784.

Vrabie, D., Pastravanu, O., & Lewis, F. L. (2007). Policy iteration for continuous-time systems with unknown internal dynamics. In IEEE Proceedings of MED’07(pp. 1–6).

Vrabie, D., & Lewis, F. (2008). Adaptive optimal control algorithm for continuous-time nonlinear systems based on policy iteration. In IEEE Proc. CDC’08.(pp. 73–79).

Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. thesis. University ofCambridge, England.

Werbos, P. (1989). Neural networks for control and system identification. In IEEEProc. CDC’89. 1 (pp. 260–265).

Werbos, P. J. (1992). Approximate dynamic programming for real-time control andneural modeling. In D. A. White, & D. A. Sofge (Eds.), Handbook of intelligentcontrol (pp. 493–525). New York: Van Nostrand Reinhold.

Werbos, P. (2009). Intelligence in the brain: A theory of how it works and how tobuild it. Neural Networks, in this issue (doi:10.1016/j.neunet.2009.03.012).


http://dx.doi.org/doi:10.1016/j.neunet.2009.03.013http://dx.doi.org/doi:10.1016/j.neunet.2009.03.007http://dx.doi.org/doi:10.1016/j.neunet.2009.03.012

Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systemsIntroductionBackground in nonlinear optimal controlPolicy iteration algorithm for solving the HJB equationPolicy iteration algorithmConvergence of the policy iteration algorithm

Neural-network-based approximation of the HJB solutionNeural network approximation of the cost functionConvergence of VLμ(i) (x) to the exact solution of the Lyapunov equation Vμ(i) (x) Convergence of the method of least squares to the solution of the HJB equation

Online algorithm on an Actor/Critic structureSimulation examplesExample 1Example 2

ConclusionsAcknowledgementsReferences

ARTICLE IN PRESS notes/CT RL pub- Vrabie_nonlin NN.pdfARTICLE IN PRESS...

Documents

Transcript of ARTICLE IN PRESS notes/CT RL pub- Vrabie_nonlin NN.pdfARTICLE IN PRESS...