IEEE TRANSACTIONS ON NEURAL NETWORKS … convergence analysis of the performance index function and...

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Optimal Control for Unknown Discrete-TimeNonlinear Markov Jump Systems Using

Adaptive Dynamic ProgrammingXiangnan Zhong, Haibo He, Senior Member, IEEE, Huaguang Zhang, Senior Member, IEEE,

and Zhanshan Wang, Member, IEEE

Abstract— In this paper, we develop and analyze an opti-mal control method for a class of discrete-time nonlinearMarkov jump systems (MJSs) with unknown system dynam-ics. Specifically, an identifier is established for the unknownsystems to approximate system states, and an optimal con-trol approach for nonlinear MJSs is developed to solve theHamilton–Jacobi–Bellman equation based on the adaptivedynamic programming technique. We also develop detailed sta-bility analysis of the control approach, including the convergenceof the performance index function for nonlinear MJSs andthe existence of the corresponding admissible control. Neuralnetwork techniques are used to approximate the proposed per-formance index function and the control law. To demonstrate theeffectiveness of our approach, three simulation studies, one linearcase, one nonlinear case, and one single link robot arm case, areused to validate the performance of the proposed optimal controlmethod.

Index Terms— Adaptive dynamic programming (ADP),Markov jump systems (MJSs), neural network, optimal control,state identifier.

I. INTRODUCTION

MARKOV jump systems (MJSs) have witnessed exten-sive studies in recent years because of their power-

ful modeling capability for power systems, network controlsystems, and manufacturing systems [1], [2]. These systemsinclude abrupt variations in their structures due to suddenenvironmental disturbances and subsystems interconnectionvariations. Therefore, these systems are inherently vulnerableto component failure or repairs and hard to be modeled.Due to the wide spectrum of applications of MJSs, there hasbeen extensive research in the stability analysis [3], controller

Manuscript received May 10, 2013; revised January 29, 2014; acceptedFebruary 4, 2014. This work was supported in part by the National ScienceFoundation under Grant ECCS 1053717, in part by the Army Research Officeunder Grant W911NF-12-1-0378, in part by the NSF-DFG CollaborativeResearch on Autonomous Learning under Grant CNS 1117314, in part bythe National Natural Science Foundation of China under Grant 51228701 andGrant 61034005, and in part by the IAPI Fundamental Research Funds underGrant 2013ZCX01-07.

X. Zhong and H. He are with the Department of Electrical, Computer andBiomedical Engineering, University of Rhode Island, Kingston, RI 02881USA (e-mail: [email protected]; [email protected]).

H. Zhang and Z. Wang are with the School of Information Science andEngineering, Northeastern University, Shenyang 110006, China, and also withthe State Key Laboratory of Integrated Automation of Process Industry Tech-nology and Research Center of National Metallurgial Automation, Shenyang110004, China (e-mail: [email protected]; [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TNNLS.2014.2305841

design [4], [5], and filtering [6], [7]. So far, many of thestudies focus on solving the optimal control problem basedon accurate MJSs functions and/or models [8], [9]. However,in many real-world applications, the likelihood to access thecomplete knowledge of system functions is either infeasibleor very difficult to obtain.

This paper develops a method for a class of unknowndiscrete-time nonlinear MJSs with adaptive dynamic program-ming (ADP) technique. Specifically, we propose an optimalcontrol scheme to convert the MJSs control problem withmultiple subsystems into a single-objective optimal controlproblem. That is, the performance index functions of all thesubsystems in MJSs are combined into one performance indexfunction depending on the Markov chain and the weighted sumtechnique. The major contributions of this paper includes thefollowing.

1) The ADP technique is introduced into the field of MJSsto solve this kind of problem without the knowledge ofsystem functions. Unlike the traditional method, suchas the linear matrix inequality technique, our approachbased on ADP technique includes the adaptive and learn-ing capability of the system dynamics, indicating thatour approach can still find the near optimal controllereven if the system parameters change.

2) The theoretical analysis is developed in this paper, whichis focused on the stability of the proposed ADP approachfor MJSs. The convergence of the proposed performanceindex function and the existence of the correspondingcontrol law are provided. These are also verified by thenumerical examples.

3) The state identifier is established to enable the controlprocess without the requirement of system dynamics.This is important, as the system functions and dynamicsare usually difficult to obtain for some nonlinearsystems.

The rest of this paper is organized as follows. In Section II,we provide the related work with the highlight of the state-of-art research on this topic. In Section III, we formulate theMJSs problem analyzed in this paper. Section IV establishesthe state identifier by a three-layer neural network and providesthe corresponding stability proof. Section V explicitly presentsthe convergence analysis of the performance index functionand the existence of the corresponding admissible control lawfor the proposed optimal control scheme. Actor-critic networks

2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.


2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

are used to implement this ADP scheme in Section VI. Threenumerical examples, including a linear one, a nonlinear oneand a single link robot arm case, are presented in Section VIIto demonstrate the validity of the proposed approach. Finally,Section VIII concludes this paper.

II. RELATED WORK

The study of MJSs has attracted considerable attentionin recent years. Most of the results of MJSs are obtainedunder the full information of system dynamics, but in manypractical situations, the system dynamics cannot easily beobtained exactly. To solve this problem, in the literature,Chen et al. [5] designed a memoryless state feedback con-troller for uncertain MJSs to guarantee the closed-loop costfunction value was not more than a specific level of perfor-mance for any admissible uncertainties. In [10], an optimalestimator for the current state was designed according to cur-rent and past observations to overcome the system parametersvarying. Farias et al. [11] introduced the ADP method intothe stochastic control problem and approximated the optimalcontrol law via linear programming. In [12], they defined thisalgorithm as approximate linear programming and providedthe detailed theoretical and simulation results.

Taking the advantage of solving the problem without theknowledge of system function, ADP has attracted signif-icantly increasing attention from both theoretical researchand real-world applications [13]–[18] over the past decadesby attempting to obtain the approximate solutions of theHamilton–Jacobi–Bellman (HJB) equations. It has been widelyrecognized that ADP could be one of the core methodologiesto achieve optimal control in stochastic process in a generalcase to achieve brain-like intelligent control [19], [20]. Exten-sive efforts and promising results have been achieved overthe past decades. Here, we highlight a few important ADPresearches from the theoretical perspective that are closelyrelated to the research presented in this paper, and interestedreaders can refer to the two important handbooks on ADP formany other successful architectures, algorithms, models, andchallenging engineering applications [21], [22]. For instance,Al-Tamimi et al. [23] provided the convergence of the value-iteration-based ADP algorithm for general discrete-time non-linear systems. In [24], Abu-Khalaf et al. introduced a newgeneralized nonquadratic function into the performance indexto evaluate the performance of systems with constrainedcontrol inputs. This idea of bounded control was also relatedto the work presented in [25] and [26], in which the authorsfocused on the optimal control problem for nonlinear systemswith unknown perturbation. The optimal control problem withconstrained input was also solved in [27] and [28] based on theADP algorithm. In [29], the feasibility of using the solutionof the optimal control problem to solve the robust controlproblem was provided for nonlinear system with matcheduncertainties. The author further developed the results intothe nonlinear system with unmatched uncertainties in [30].Zhong et al. [31] used online neural network learning methodto train the control law for robust control problem. In [32], Weiand Liu proposed a new θ -ADP iterative algorithm to solvethe optimal control problem of infinite horizon discrete-time

nonlinear systems by finding a lower bound for parameter θ toassure the convergence of this algorithm. Motivated by theseresults, Liu and Wei [33] further developed the convergenceconditions for the situation that the iterative control policyand iterative performance index cannot be accurately obtained.In [34], an optimal scheme for unknown nonaffine nonlineardiscrete-time systems using cost function with discount factorwas developed and analyzed. For the affine nonlinear system,the optimal control using general value iteration was providedin [35]. A new iterative ADP method was proposed to solvea class of nonlinear zero-sum differential games in [36]and [37] for continuous- and discrete-time situation, respec-tively. Wei et al. [38] developed a numerical iterative ADPalgorithm with convergence analysis. Moreover, the adaptivecritic techniques were also applied for engine torque and air-fuel ratio control [39] and tracking control [40]. From thearchitecture point of view, He et al. [41] and [42] integrateda reference network into the classic ADP structure to adap-tively establish an internal goal representation to facilitatethe optimal learning and control. Then, they used this newstructure to solve tracking control problem and obtain theeffective performance [43], [44]. This GrHDP approach wasalso applied on the maze navigation example and comparedwith many other reinforcement learning approaches in [45]and [46]. The hierarchical GrHDP architecture was furtherstudied in [47] and [48]. Furthermore, due to the problemof (partially) unknown system dynamics, many researchershave developed different approaches to handle such par-tially observable situations [49], [50]. In a similar situation,Zhang et al. [51] employed a model network based onrecurrent neural network structure to reconstruct the unknownsystem dynamics for nonlinear systems.

Motivated by the research presented in [23], [32]–[34], and[51], we are interested in the problem of stability analysisfor a class of unknown discrete-time nonlinear systems withMarkovian jumping parameters using ADP technique in thispaper. Note that, in [11] and [12], Farias et al. also solved thestochastic control problem based on ADP. Particularly, theycombined the linear programming approach with the ADPtechnique as the approximate linear programming algorithm,which is used to approximate the optimal control law forthe stochastic system. In our current design, the actor-criticnetworks are used to implement the proposed ADP approach.In other words, we use neural network technique to estimatethe control law and the performance index function iteratively.

III. PROBLEM STATEMENT

Consider the unknown discrete-time nonlinear Markov jumpsystems (MJSs) of the following form:

xk+1 = fi (xk) + gi(xk)uk (1)

where xk ∈ Rn denotes the system state with the initial valuex0, uk ∈ Rl is the system input, and i is the simplified notationof a discrete-time Markov chain {rk}, of which taking valuesin a finite state space S = {1, 2, . . . , m}, where m is thenumber of the subsystems. Assume that f + gu is Lipschitzcontinuous on a set � ⊆ Rn containing the origin. fi (xk)and gi (xk) are the unknown discrete-time state functions and


ZHONG et al.: OPTIMAL CONTROL FOR UNKNOWN DISCRETE-TIME NONLINEAR MJSs 3

fi (0) = 0, gi(0) = 0, which means the system state xk = 0 isan equilibrium point of system (1) under the control uk = 0.

Define the transition probability matrix for discrete-timeMJSs as

H =

⎛⎜⎜⎜⎝

π11 π12 · · · π1m

π21 π22 · · · π2m...

.... . .

...πm1 πm2 · · · πmm

⎞⎟⎟⎟⎠ . (2)

The elements in (2) can be expressed by

πab = Pr{rk+1 = b|rk = a} (3)

which denotes the probability of the next system mode b,given the current mode a. Therefore, we can easily obtain thatπab ≥ 0, ∀a, b ∈ S and for each subsystem a,

∑mb=1 πab = 1.

Assume that the MJSs (1) are completely controllable andbounded on � ∈ Rn . The performance index function for eachsubsystem can be described by

Ji (xk) =∞∑

z=k

Ui (xz, uz) (4)

where the utility function can be chosen as

Ui (xk, uk) = Qi (xk) + uTk Ri uk (5)

in which Qi (xk) and Ri are positive definite. This meansUi (xk, uk) is positive definite, i.e., if and only if xk = 0 anduk = 0, Ui (0, 0) = 0, otherwise, Ui (xk, uk) > 0.

An equivalent equation to (4) is given by the Bellmanequation

Ji (xk) = Ui (xk, uk) +∞∑

z=k+1

Ui (xz, uz)

= Ui (xk, uk) + Ji (xk+1) (6)

where i ∈ S.The purpose of this paper is to find the optimal control

law u∗k , so as to minimize the performance index function

of the whole MJSs and stabilize the MJSs. However, due tothe existence of the transition probabilities, we cannot justadd all the performance index functions of the subsystemstogether as the final one for the MJSs. Here, we reconstructthe performance index function (4) of the subsystems usingthe transition probability matrix (2) as follows:⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

JI (xk) = π11 J1(xk) + π12 J2(xk) + · · · + π1m Jm(xk)

JI I (xk) = π21 J1(xk) + π22 J2(xk) + · · · + π2m Jm(xk)...

JM (xk) = πm1 J1(xk) + πm2 J2(xk) + · · · + πmm Jm(xk).

(7)

Hence, we transform the MJSs control problem into a mul-tiple objectives optimal control problem. Using the weightedsum technique, we convert the above multiobjective optimalcontrol problem into a single-objective optimization problem.The performance index function can be rewritten as

J (xk) = ω1 JI (xk) + ω2 JI I (xk) + · · · + ωm JM (xk) (8)

where ωi > 0 is the weight vector and∑m

i=1 ωi = 1.

Therefore, the control vector uk needs to be found tominimize the performance index function (8). Note that, foroptimal control problems, the designed control law must notonly stabilize the systems on the compact set �, but alsoguarantee that (8) is finite, which means the control must beadmissible.

Definition 1 ([23], [24]) (Admissible Controls): A law uk

is said to be an admissible control with respect to (8) on �,if uk is continuous on � and can stabilize system (1) for allx0 ∈ �, uk = 0 if xk = 0, and ∀xk ∈ �, J (xk) is finite.

Equation (8) can be extended as

J (xk) = ω1 JI (xk) + ω2 JI I (xk) + · · · + ωm JM (xk)

= ω1 (π11 J1(xk) + π12 J2(xk) + · · · + π1m Jm(xk))

+ ω2 (π21 J1(xk) + π22 J2(xk) + · · · + π2m Jm(xk))

+ · · · + ωm (πm1 J1(xk) + · · · + πmm Jm(xk))

= (ω1π11 + ω2π21 + · · · + ωmπm1)J1(xk)

+ (ω1π12 + ω2π22 + · · · + ωmπm2)J2(xk)

+ · · · + (ω1π1m + · · · + ωmπmm)Jm(xk)

= D1 J1(xk) + D2 J2(xk) + · · · + Dm Jm(xk)

=m∑

i=1

Di Ji (xk)

=m∑

i=1

Di

( ∞∑z=k

(Qi (xz) + uT

z Ri uz

))

=m∑

i=1

∞∑z=k

(Di

(Qi (xz) + uT

z Ri uz

))(9)

where Di = ∑mj=1 ω jπ j i > 0. Hence, (9) is positive definite,

i.e., the obtained performance index function J (xk) is positivedefinite. Hence, this performance index function serves as aLyapunov function. Equation (9) can be rewritten as

J (xk) =m∑

i=1

(Di (Qi (xk) + uTk Ri uk))

+m∑

i=1

∞∑z=k+1

Di (Qi (xz) + uTz Ri uz)

=m∑

i=1

Di Ui (xk, uk)+m∑

i=1

Di Ji (xk+1)

= DT T (xk, uk) + J (xk+1) (10)

where

D = (D1, D2, . . . , Dm)T

T (xk, uk) = (U1(xk, uk), U2(xk, uk), . . . , Um(xk, uk))T .

Let us define a stochastic operator P by

P J (xk) = minuk

{DT T (xk, uk) + J (xk+1)} (11)

where the minimization is carried out component-wise. ADPinvolves solution of Bellman’s equation

J (xk) = P J (xk). (12)



According to Bellman’s optimality principle, the uniquesolution J ∗(xk) of (12) is the optimal performance indexfunction and satisfies the discrete-time HJB equation

J ∗(xk) = minuk

{DT T (xk, uk) + J ∗(xk+1)}. (13)

Here, we assume that the minimum on the right-hand side of(13) exists and is unique [18]. Therefore, the optimal controlu∗

k satisfies the first order necessary condition, which is givenby the gradient of the right-hand side of (13) with respect touk as

∂(DT T (xk, uk))

∂uk+

(∂xk+1

∂uk

)T ∂ J ∗(xk+1)

∂xk+1= 0 (14)

and therefore the optimal control law is obtained by

u∗k = arg min{DT T (xk, uk) + J ∗(xk+1)}

= −1

2

(m∑

i=1

Di Ri

)−1

gTi (xk)

∂ J ∗(xk+1)

∂xk+1(15)

where J ∗(xk) is solved in the following HJB equation:

J ∗(xk) =m∑

i=1

Di Qi (xk) + 1

4

(g(xk)

T ∂ J ∗(xk+1)

∂xk+1

)T

·(

m∑i=1

Di Ri

)−1

gT (xk)∂ J ∗(xk+1)

∂xk+1+ J ∗(xk+1). (16)

IV. IDENTIFIER DESIGN BY NEURAL NETWORK

In this paper, we solve the optimal control problem of theMJSs without the knowledge of system functions, thereforewe need to design an identifier to approximate the systemstates for each subsystem. Motivated by the identification ideain [51], we extend the approach using xk and xk−1 to estimatethe next state xk+1. A three-layer neural network is consideredas the function approximation structure. The weight matrixbetween the input and the hidden layers is denoted by aconstant matrix W1 and the weight matrix between the hiddenand the output layers is denoted by W2. In this paper, theinput-to-hidden layer weight matrix W1 is chosen randomly atinitial, and the output layer weight matrix W2 is proposed tobe updated during the training process.

We define the identification scheme as follows:xk+1 = W T

2 (k)�(hk) (17)

where xk+1 is the estimated value of xk+1 and hk =W T

1 [x Tk x T

k−1 uTk ]T . �(hk) is the bounded activation function,

i.e., ‖�(hk)‖ ≤ �m . Here, we let

�(·) = 1 − e−(·)

1 + e−(·) . (18)

According to the universal approximation property of neuralnetworks, the identified subsystem of MJSs (1) has a neuralnetwork representation, which can be described by

xk+1 = W∗T2 �(hk) + δk (19)

where W∗2 is the ideal weight matrix between the hidden

and the output layers, and δk is the bounded neural networkapproximation error.

Furthermore, the weight matrix error is denoted by

W2(k) = W2(k) − W∗2 (20)

and the system identification error is defined by

ek+1 = xk+1 − xk+1. (21)

Consider (17), (19), and (21), we obtain

ek+1 = W T2 (k)�(hk) − W∗T

2 �(hk) − δk

= W T2 (k)�(hk) − δk . (22)

The following lemma is used to prove the convergence ofthe weight matrix error (20) and identification error (21).

Lemma 1 [52] (Cauchy–Schwarz Inequality): Set A and Bbe vectors of the same dimension. Then, A, B satisfy

| < A, B > |2 ≤ ‖A‖2 · ‖B‖2 (23)

where |〈·, ·〉| denotes the inner product.Now, we show the estimation errors of the identifier design

are convergent using Lyapunov stability construct.Theorem 1: Define the identification scheme as (17). The

weights update law is designed as

W2(k + 1) = W2(k) − β�(hk)eTk+1 (24)

where β > 0 is the learning rate of this neural network, whichsatisfies the following condition:

0 < β�2m < min

{1

2,

(1

2ξ− 1

2

)}(25)

in which 0 < ξ < 1 is a constant value. Moreover, theneural network approximation error δk is assumed to be upperbounded by a function of identification error such that

δTk δk ≤ ξeT

k ek . (26)

Then, the identification error ek+1 in (22) is asymptoticallystable, and the weights estimation error W2(k) in (20) isbounded, which means the proposed state identifier is feasible.

Proof: Consider the following Lyapunov function:V (k) = V1(k) + V2(k) (27)

where

V1(k) = eTk ek (28)

V2(k) = σ · tr{

W T2 (k)W2(k)

}(29)

in which σ > 0 is a constant parameter.Hence, (27) is a positive definite function, and its first

difference is calculated as

�V (k) = �V1(k) + �V2(k). (30)

For convenience, we denote

�V1(k)

= eTk+1ek+1 − eT

k ek

= (W T2 (k)�(hk) − δk)

T (W T2 (k)�(hk) − δk) − eT

k ek

= (W T2 (k)�(hk))

T (W T2 (k)�(hk)) − 2(W T

2 (k)�(hk))T δk

+ δTk δk − eT

k ek (31)



and

�V2(k)

= σ · tr{W T2 (k + 1)W2(k + 1) − W T

2 (k)W2(k)}= σ · tr{(W2(k + 1) − W∗

2 )T (W2(k + 1) − W∗2 )

− W T2 (k)W2(k)}

= σ · tr{(W2(k) − β�(hk)eTk+1 − W∗

2 )T (W2(k)

− β�(hk)eTk+1 − W∗

2 ) − W T2 (k)W2(k)}

= σ · tr{(W2(k) − β�(hk)eTk+1)

T (W2(k)

− β�(hk)eTk+1) − W T

2 (k)W2(k)}= −2σβ�T (hk)W2(k)ek+1+σβ2eT

k+1

(�T (hk)�(hk)

)ek+1.

(32)

Substituting (31) and (32) into (30), it yields

�V (k)

= �V1(k) + �V2(k)

= (W T2 (k)�(hk))

T (W T2 (k)�(hk)) − 2(W T

2 (k)�(hk))T δk

+ δTk δk − eT

k ek − 2σβ�T (hk)W2(k)ek+1

+ σβ2eTk+1

(�T (hk)�(hk)

)ek+1

= (W T2 (k)�(hk))

T (W T2 (k)�(hk)) − 2(W T

2 (k)�(hk))T δk

+ δTk δk − eT

k ek − 2σβ�T (hk)W2(k)(W T2 (k)�(hk) − δk)

+ σβ2�T (hk)�(hk)(W T2 (k)�(hk) − δk)

T

·(W T2 (k)�(hk) − δk). (33)

According to (23), one has

�V (k)

≤ (W T2 (k)�(hk))

T (W T2 (k)�(hk)) − 2(W T

2 (k)�(hk))T δk

+ δTk δk − eT

k ek − 2σβ�T (hk)W2(k)(W T2 (k)�(hk) − δk)

+ 2σβ2�T (hk)�(hk)((W T2 (k)�T (hk))

T W T2 (k)�T (hk)

+ δTk δk). (34)

Based on (26), we know

�V (k)

≤ (1 − 2σβ + 2σβ2�2m)(W T

2 (k)�(hk))T W T

2 (k)�(hk)

+ (−1 + ξ+2ξσβ2�2m)eT

k ek +(−2+2σβ)(W T2 (k)�(hk))δk .

(35)

If the design parameter σ is selected as

σ = 1

β(36)

(35) can be rewritten as

�V (k) ≤ (−1 + 2β�2m)(W T

2 (k)�(hk))T W T

2 (k)�(hk)

+ (−1 + ξ + 2ξβ�2m)eT

k ek . (37)

According to condition (25), we obtain �V (k) ≤ 0. Hence,the identification error ek satisfies the stability conditionand the weights estimation error W2(k) is bounded. Thiscompletes the proof. �

With the estimated system states, ADP algorithm for MJSsis provided in the following section to approximate the solutionof HJB equation and seek the optimal control at the same time.

V. OPTIMAL CONTROL FOR UNKNOWN MJSS

In this section, ADP approach is proposed to approximatethe optimal performance index function and control law forMJSs. Two sections are included. The first one proposes anADP algorithm for discrete-time nonlinear MJSs to estimatethe HJB equation and solve the optimal control law accordingto the obtained performance index function (10). The cor-responding stability analysis is given in the second section,including the convergence of the obtained performance indexfunction for MJSs and the existence of the optimal controlinput.

A. ADP Algorithm to Approximate the Optimal Controlfor MJSs

In this ADP algorithm, we start with an initial performanceindex function J (0)(x) = 0. Then, we solve for the controllaw u(0)

k as

u(0)k = arg min

uk

{DT T (xk, uk) + J (0)(xk+1)

}. (38)

According to u(0)k , iteration on the performance index func-

tion is performed by computing

J (1)(xk) = minuk

{DT T (xk, uk) + J (0)(xk+1)

}

= DT T (xk, u(0)k ) + J (0)(xk+1). (39)

Because J (0)(x) = 0, then it follows:J (1)(xk) = DT T (xk, u(0)

k ). (40)

Based on (40), we can obtain the following iterationequations:

u(1)k = arg min

uk

{DT T (xk, uk) + J (1)(xk+1)

}(41)

J (2)(xk) = minuk

{DT T (xk, uk) + J (1)(xk+1)

}

= DT T (xk, u(1)k ) + J (1)(xk+1). (42)

The ADP algorithm, therefore, is obtained by iteratingbetween a sequence of action laws u(n)

k

u(n)k = arg min

uk

{DT T (xk, uk) + J (n)(xk+1)

}

= arg minuk

{DT T (xk, uk) + J (n)( fi (xk) + gi(xk)uk)

}

(43)

and a sequence of performance index functions J (n)(xk)

J (n+1)(xk) = minuk

{DT T (xk, uk) + J (n)(xk+1)

}

= DT T (xk, u(n)k )+ J (n)( fi (xk)+gi(xk)uk) (44)

where k is the time index, i is the index of the active subsystemat time step k, and n is the iteration index.

Note that, in the ADP algorithm, we do not need to startfrom an optimal performance index function, which is difficultto find for general nonlinear jump systems. It is an incrementaloptimization process which is implemented forward in timeand online. Moreover, this process is adaptive since it does notrequire the knowledge of system functions. In the following



section, it is shown that J (n)(xk) and u(n)k converge to the

optimal performance index function and to the correspondingoptimal control law, respectively.

B. Convergence Analysis of the Proposed ADP Approach

To prove the convergence of the proposed ADP approach fordiscrete-time nonlinear MJSs, let us start with the followinglemmas, which are important in the convergence analysis.

Lemma 2: Let η(n)k be any stabilizing and admissible control

law and �(0)(x) = J (0)(x) = 0, where �(n)(xk) is updated as

�(n+1)(xk) = DT T (xk, η(n)k ) + �(n)(xk+1) (45)

where

T(xk, η

(n)k

) = (U1

(xk, η

(n)k

), U2

(xk, η

(n)k

). . . , Um(xk, η

(n)k )

)T

Ui(xk, η

(n)k

) = Qi (xk) + η(n)Tk Riη

(n)k , i ∈ S.

Then

J (n)(xk) ≤ �(n)(xk).

Lemma 2 can easily be proved because J (n)(xk) is the resultwhen control uk minimizes the right-hand side of (45).

Lemma 3: Define the performance index function sequencefor discrete-time MJSs as in (44). If the MJSs (1) are con-trollable and J (0)(x) = 0. Then, it follows that J (n)(xk) isa monotonically nondecreasing sequence, i.e., ∀n, J (n)(xk) ≤J (n+1)(xk).

Proof: From Lemma 2, we know if �(0)(x) = J (0)

(x) = 0, then the new sequence �(n)(xk) defined in (45) hasthe following property:

J (n)(xk) ≤ �(n)(xk). (46)

Because η(n)k is an arbitrary and stabilizing sequence,

assume η(n−1)k = u(n)

k , such that

�(n)(xk) = DT T(xk, η

(n−1)k

) + �(n−1)(xk+1)

= DT T (xk, u(n)k ) + �(n−1)(xk+1). (47)

In the following part, we prove J (n+1)(xk) ≥ �(n)(xk) bymathematical induction. Let us start with n = 0. We knowthat J (0)(xk) = �(0)(xk) = 0, then

J (1)(xk) − �(0)(xk) = DT T (xk, u(0)k ) ≥ 0. (48)

Thus, for n = 0, we obtain J (1)(xk) ≥ �(0)(xk).Now, we assume it holds for the (n − 1)th iteration step

J (n)(xk) − �(n−1)(xk) ≥ 0. (49)

By subtracting (47) from (44), it follows:

J (n+1)(xk) − �(n)(xk)

= DT T (xk, u(n)k ) + J (n)(xk+1)

−(DT T (xk, u(n)k ) + �(n−1)(xk+1))

= J (n)(xk+1) − �(n−1)(xk+1) ≥ 0 (50)

which completes the proof of J (n+1)(xk) ≥ �(n)(xk).On the other side, we obtain J (n)(xk) ≤ �(n)(xk) from

(46), hence J (n)(xk) ≤ �(n)(xk) ≤ J (n+1)(xk) for any

n = 0, 1, 2, . . ., which is J (n)(xk) ≤ J (n+1)(xk) for anyiteration step, i.e., J (n)(xk) is a monotonically nondecreasingsequence. The conclusion holds. �

From Lemma 3, we know the performance index functionsequence (44) for MJSs is monotonically nondecreasing. Now,we present our main theorem.

Theorem 2: Let the sequences J (n)(xk) and u(n)k be defined

as in (44) and (43), respectively. If the MJSs (1) are control-lable, then the following conditions hold.

1) The admissible control law exists for MJSs (1).2) There exists an upper bound C(xk) such that 0 ≤

J (n)(xk) ≤ J∞(xk) ≤ C(xk).3) The performance index function sequence can converge

to the optimal value J ∗(xk), and ∀k, u∞k is an asymp-

totically stable control law for MJSs (1), i.e., u∞k = u∗

k .

Proof: Let us start with the admissibility part. SinceJ (n)(xk) is positive definite, it attains a minimum at xk = 0,and thus d J (n)(xk)/dxk should vanish there. This impliesthat uk = 0 if xk = 0. The continuity assumption onf + gu implies that there exists continuous control law andthe system (1) cannot jump to infinity by any one step of finitecontrol. Furthermore, because fi (0) = gi (0) = 0, when thesystem state xk reaches the equilibrium state, the control inputbecomes zero and the state of MJSs is kept at zero. Accordingto Definition 1, the admissible control law exists for MJSs (1),which proves part (1).

The second part of the theorem follows by realizing that theelements in the obtained performance index function sequenceJ (n)(xk) for MJSs are all positive values from (9). Therefore,using Lemma 3 and (9), the left-hand side of the conclusion0 ≤ J (n)(xk) ≤ J∞(xk) holds. Now, we prove this positivesequence has an upper bound.

Define μk as any stabilizing and admissible control law andlet μk = η

(n)k . Therefore, the new sequence based on μk is

updated as

�(n+1)(xk) = DT T (xk, μk) + �(n)(xk+1) (51)

where �(0)(x) = J (0)(x) = 0.Motivated by the research in [23] and [34], we obtain the

following:�(n+1)(xk)

= DT T (xk, μk) + �(n)(xk+1)

= DT T (xk, μk) + DT T (xk+1, μk+1) + �(n−1)(xk+2)...

= DT T (xk, μk) + DT T (xk+1, μk+1)

+ · · · + DT T (xk+n, μk+n) + �(0)(xk+n+1). (52)

Because �(0)(x) = 0, it follows that:

�(n+1)(xk) =n∑

t=0

DT T (xk+t , μk+t )

=n+k∑t=k

DT T (xt , μt ). (53)



Fig. 1. Neural network structure of the proposed ADP approach.

Letting n → ∞, limn→∞ �(n+1)(xk) = �∞(xk), (53) becomes

�∞(xk) =∞∑

t=k

DT T (xt , μt ). (54)

Assume η(n)k = μk and �(n)(xk) = �(n)(xk), such that

J (n)(xk) ≤ �(n)(xk) obtained from Lemma 2. It can berewritten as J∞(xk) ≤ �∞(xk) when n → ∞. Combiningthis with (54), it follows:

J∞(xk) ≤ �∞(xk) =∞∑

t=k

DT T (xt , μt ). (55)

Define C(xk) = ∑∞t=k DT T (xt , μt ), such that (55) can be

rewritten as J∞(xk) ≤ C(xk). Hence, the proof of part (2)is completed. Note that C(xk) is a function and determinedby an admissible stabilizing law μk , which means C(xk) is afinite value.

For part (3), consider the definition of the upper boundC(xk). Because μk is defined as an admissible control, if μk

is the control input of the infinite step, it follows that:J∞(xk) = C(xk) ≥ J ∗(xk). (56)

On the other hand, since J (n)(xk) ≤ �(n)(xk), which can be

rewritten as J∞(xk) ≤ �∞(xk) =∞∑

t=kDT T (xt , μt ), we obtain

J∞(xk) ≤∞∑

t=k

DT T (xt , u∗t ) (57)

by setting μk = u∗k , which means J∞(xk) ≤ J ∗(xk).

From (56), we know J ∗(xk) ≤ J∞(xk). Hence, J∞(xk) =J ∗(xk), i.e., J (n)(xk) converge to the optimal value J ∗(xk).

Then, the convergence of the corresponding control lawsequence u(n)

k is provided as follows.From (9), we know the performance index function (10) for

MJSs is positive definite. We can further write that

J∞(xk+1) − J∞(xk) = −DT T (xk, u∞k ). (58)

Since DT T (xk, u∞k ) is positive definite, equation (58) is

negative definite. Therefore, J∞(xk) can be seen as a kind

of Lyapunov function for an admissible control u∞k . Besides,

because (58) is negative definite, u∞k can make the MJSs (1)

asymptotically stable. As J∞(xk) = J ∗(xk), it follows:J∞(xk+1) − J∞(xk) = J ∗(xk+1) − J ∗(xk). (59)

Consider (13) and (58), (59) becomes

−DT T (xk, u∞k ) = −DT T (xk, u∗

k). (60)

Hence, the conclusion u∞k = u∗

k is proved, which completesthe proof. �

From Theorem 2, we know the obtained performance indexfunction sequence J (n)(xk) for discrete-time nonlinear MJSsmonotonically nondecreases to the optimal value J ∗(xk) foreach xk and the corresponding admissible control input existsto asymptotically stabilize the MJSs, i.e., when n → ∞,J (n)(xk) → J ∗(xk) and u(n)

k → u∗k .

VI. DESIGN OF THE PROPOSED ADP APPROACH

In this section, we use the technique of neural networksto approximate the obtained performance index functionsequence (44) and the control law sequence (43). The imple-mentation process is shown in Fig. 1. We can see the unknownMJS is replaced by the state identifier, which is introduced inSection IV. Two neural networks, the critic and the action net-works, are used iteratively to estimate the optimal values of theperformance index function and the control law. The detailedimplementation process based on the actor-critic networks ispresented as follows.

A. Critic Network

The purpose of the critic network is to approximate theperformance index function sequence J (n)(xk) of the proposedMJSs. A three-layer neural network is built as this functionapproximation structure. Set the weight matrix between theinput and the hidden layers as Wc1, and the weight matrixbetween the hidden and the output layers as Wc2. Therefore,the output of the critic network can be defined as

J (n)(xk) = W (n)Tc2 �(yk) (61)



where �(·) is the activation function defined in (18) andyk = W T

c1[x Tk , uT

k ]T .Based on (44), the target performance index function is

J (n)(xk) = DT T(xk, u(n−1)

k

) + J (n−1)(xk+1). (62)

Therefore, the output error of the critic network is

e(n)c (k) = J (n)(xk)− J (n)(xk)

= J (n)(xk)− J (n−1)(xk+1)−DT T(xk, u(n−1)

k

). (63)

To update the weight matrix is to minimize the followingperformance measure:

E (n)c (k) = 1

2e(n)2

c . (64)

According to the gradient decent rules, the update schemeof the critic network is as follows:

W (n+1)c2 = W (n)

c2 − βc

(∂ E (n)

c (k)

∂W (n)c2

)

= W (n)c2 − βc

(∂ E (n)

c (k)

∂e(n)c (k)

· ∂e(n)c (k)

∂W (n)c2

)

= W (n)c2 − βc�(yk)e

(n)Tc (k) (65)

where βc > 0 is the learning rate of the critic network.

B. Action Network

The control law sequence u(n)k is estimated by the action

network. Consider a three-layer neural network architectureas this function approximation structure. Denote the weightmatrix between the input and the hidden layers as a constantmatrix Wa1, and the weight matrix between the hidden andthe output layers as Wa2. Then, the estimated control law canbe formulated as

u(n)k = �

(W (n)T

a2 �(n)(tk))

(66)

where tk = W Ta1xk , and the definition of �(tk) is the same as

�(yk) in the critic network part.Since u(n)

k , given by (43), is the target of the output of theaction network, define the output error as

e(n)a (k) = u(n)

k − u(n)k . (67)

The weight matrix in this process is updated to minimizethe following performance measure:

E (n)a (k) = 1

2e(n)T

a (k)e(n)a (k). (68)

Then, we can derive gradient decent rules to train

W (n+1)a2 = W (n)

a2 − βa

(∂ E (n)

a (k)

∂W (n)a2

)(69)

where βa > 0 is the learning rate of the action network, and

∂ E (n)a (k)

∂W (n)a2

= ∂ E (n)a (k)

∂e(n)a (k)

· ∂e(n)a (k)

∂ u(n)k

· ∂ u(n)k

∂W (n)a2

= �(n)(tk) · 1

2

(1 − �(u(n)

a (k)))

· e(n)Ta (k). (70)

Note that during this training procedure, the input-to-hiddenlayer weight matrices Wc1 and Wa1 are chosen randomly atinitial and only the hidden-to-output layer weight matrices Wc2and Wa2 are proposed to be updated.

Remark: The training procedure above is to obtain theperformance index function and the control law sequence. Itis very important that the whole system would remain stablewhile both the action and the critic networks undergo adaption,which means one should make sure the convergence of the net-works’ weights. So far, many papers study the neural-network-based ADP technique. Some of them prove the convergence ofiterative performance index function and control law and thenneural networks are just used to implement this process [14],[23], [34]. While the others prove the convergence in anotherway, which is the convergence analysis of the neural networkweights and the state [43], [51], [53]. In this paper, we use thefirst method to prove that our proposed method is convergentincluding the performance index function and the control law.Then, the actor-critic networks are used to implement thismethod. Detailed analysis on neural network training algo-rithm can be found in [53] where Liu et al. provides a theoremto show that the training errors of the neural network weightsare uniformly ultimately bounded using the Lyapunov stabilityconstruct.

VII. SIMULATION ANALYSIS

In this section, we provide three numerical examples todemonstrate the effectiveness of the neural-network-basedADP approach proposed in this paper. Specifically, the firstexample solves a two-mode linear MJS and we compare theresults with the theoretical solution of the HJB equation.A two-mode nonlinear MJS is considered in the secondexample, and without loss of generality, we also consider twokinds of arbitrary selections of the system functions. In thethird example, we consider a single link robot arm, which isvery popular in Markov jump problems. Four jumping modesare considered in this case.

A. Linear System

We start with the following discrete-time linear MJS withtwo jumping modes:

xk+1 = Ai xk + Bi uk (71)

where xk ∈ Rn and uk ∈ Rm .The dynamics in each mode can be described as

mode 1 A1 =( −0.5 0.1

0.1 0.6

), B1 =

(0.10

)

mode 2 A2 =(

0.6 −0.20.1 0.1

), B2 =

(0

0.6

). (72)

The transition probability matrix is

H =(

0.2 0.80.4 0.6

). (73)

Assume that the system functions and dynamics areunknown. Based on the identifier design approach proposed inSection IV, the state identifiers for two subsystems are trained



Fig. 2. Identification errors for modes 1 and 2 of the linear MJS.

Fig. 3. Active jumping mode and system responses with the ADP controller.

with the maximal time step 250. The initial weights of theidentifiers are randomly chosen in [−1,1], and the learningrate is set to β = 0.01. Then, the identification errors for bothsubsystems are shown in Fig. 2. We can observe that botherrors converge to zero asymptotically, which means these twoidentifiers can approximate the states effectively.

With the above identifiers, we define the performance indexfunction for each mode of this linear MJS as linear quadraticform Ji (xk) = x T

k Qi xk +uTk Ri uk , i ∈ {1, 2}, where Qi and Ri

are the identity matrices with appropriate dimensions. In thissituation, set the weight vector as ω = [0.3, 0.7]T . Combiningthis with the knowledge of (73), we convert this two-modeMJS control problem into a single-objective optimal controlproblem according to (10). Therefore, the performance indexfunction for the whole MJS can be described as

J (xk) = (0.3 ∗ 0.2 + 0.7 ∗ 0.8)J1(xk)

+ (0.3 ∗ 0.4 + 0.7 ∗ 0.6)J2(xk). (74)

Fig. 4. Performance index function trajectory of the linear MJS.

Fig. 5. Weights trajectories of the action and the critic networks from thehidden to the output layer.

For the design of the controller, we choose the initial stateas x0 = [1,−0.5]T . Two three-layer neural networks are builtas the critic and the action networks and the numbers of thehidden layer nodes are set to Nhc = 8, Nha = 6, respectively.The learning rates of both the action and the critic networks areset as βc = βa = 0.01. The initial weights of both networksare set randomly within [−1, 1].

The active jumping mode and the system responses oftraining are shown in Fig. 3. We can clearly observe that thesystem randomly jumps between the two modes and the statevariables converge to zero even though the mode randomlyjumps between the modes 1 and 2. Moreover, when thesystem reaches the stability (after 10 time steps), the statevariables do not change even though the modes still jumprandomly. The trajectory of the performance index functionsequence at time step k = 0 for MJS (72) is shown in Fig. 4,representing that the obtained performance index functionsequence is monotonically nondecreasing and can stay at itsoptimal value during this process, just like the theoreticalanalysis in Section V-B. Weights of both the action and thecritic networks from hidden to output layer are shown in Fig. 5.

Furthermore, to demonstrate the effectiveness of thismethod, we compare this ADP controller with the standardlinear quadratic regulator (LQR) controller, which is the exact



Fig. 6. Comparisons of system responses of the ADP and the LQRcontrollers.

solution of the HJB. We fix the optimal weights of thecritic and the action networks obtained above and test theperformance of the controller. The system responses of bothcontrollers are shown in Fig. 6 including both the statevariables and the control law trajectories of the two con-trollers. It can be observed that the system responses of thedesigned ADP controller can converge to those of the LQRcontroller, which means the training process of the proposedADP method can obtain the performance of the optimal con-trol solution. The simulation results reveal that the proposedneural-network-based ADP approach is effective for the linearMJS with unknown discrete-time dynamics and can obtainsatisfactory.

B. Nonlinear System

Now, we turn to the nonlinear discrete-time MJS with twojumping modes. The system function can be described asfollows:

mode 1

{x1(k+1) = −x1(k) + x1(k) cos(x1(k)x2(k))

x2(k+1) = − sin(x1(k) + uk)

mode 2

{x1(k+1) = − sin(0.5x2(k))

x2(k+1) = − sin(0.9x1(k)) cos(x2(k) + uk).(75)

The transition probability matrix is

H =(

0.7 0.30.2 0.8

). (76)

Assume the system has unknown state dynamics. Accordingto the identification approach presented in Section IV, a three-layer neural network is trained to approximate the state of

Fig. 7. Identification errors for modes 1 and 2 of the nonlinear MJS.

Fig. 8. Active jumping mode and system responses with the ADP controller.

next step. The learning rate is set to β = 0.01, and the initialweights of the identifiers are chosen randomly in [−1, 1]. Theidentification errors of both subsystems are shown in Fig. 7.It can be observed that the identification errors can convergeto zero, indicating that the designed identifiers can accuratelyestimate the system state.

With these two identifiers, an ADP controller is designed tostabilize this MJS. Two three-layer neural networks are built asthe critic and the action networks with Nhc = 8 and Nha = 6.The learning rates of both networks are set to βc = βa = 0.01and the initial weights of both networks are chosen randomlywithin [−1, 1]. The initial state is set to x0 = [1,−0.5]T .The weight vector is set to ω = [0.6, 0.4]T . Therefore, wecan obtain the performance index function for MJS (75) asfollows:

J (xk) = (0.6 ∗ 0.7 + 0.4 ∗ 0.3)J1(xk)

+ (0.6 ∗ 0.2 + 0.4 ∗ 0.8)J2(xk) (77)

where Ji (xk) = x Tk Qi xk + uT

k Ri uk , i ∈ {1, 2}, and Qi and Ri

are the identity matrices with appropriate dimensions.System performances of the designed ADP controller are

shown in Fig. 8, which illustrates the active mode of each



Fig. 9. Performance index function trajectory of the nonlinear MJS.


time step, the state responses, and the control input trajectoryduring the training process. We can observe that the systemstate variables reach the equilibrium point at the 10th timestep and then stay at the equilibrium values even though theactive mode keep changing. The performance index functionsequence at time step k = 0 is shown in Fig. 9, which isconsistent with the analysis in Section V-B by noticing that itis monotonically nondecreasing. The learning weights of boththe critic and the action networks from the hidden to the outputlayer are shown in Fig. 10.

Additionally, without loss of generality, we consider anarbitrary selection of the system functions in the followingtwo cases.

1) Consider a set of nonlinear systems{x1(k+1) = λ1 sin(λ2x2(k))

x2(k+1) = λ3 sin(λ4x1(k)) cos(λ5x2(k) + λ6uk)(78)

where λ1 ∼ λ6 are the designed parameters, whichare adjustable. For convenience, we assume λ1 ∈[−1, 1], λ2 ∈ [−1, 1], λ3 ∈ [−1, 1], λ4 ∈ [−1, 1],λ5 ∈ [−100, 100], and λ6 ∈ [−100, 100]. As weknow, different sets of designed parameters (λ1 ∼ λ6)come with different system functions. Therefore, we

Fig. 11. Average state trajectories of 10 000 round.

Fig. 12. Histogram of the state values of 50th time step of 10 000 round.

randomly choose the parameters within their boundaries,respectively, for each time step and let the systemjump among these different functions for 50 time steps.Set the initial state as x0 = [−0.5, 1]T and chooserandomly the initial weights of both the critic and theaction networks within [−1, 1]. The performance indexfunction is defined as J (xk) = x T

k xk +uTk uk in this case.

Then, the values of state variables are collected and theroot mean square error (RMSE) is measured based oneach round. We repeat this process for 10 000 timesand plot the average state trajectories of these 10 000rounds of x1 and x2, which are shown in Fig. 11. Thehistogram of the 50th state values and the state RMSEfor these 10 000 rounds of x1 and x2 are shown inFigs. 12 and 13. The results show that the RMSE for x1and x2 can focus on a small range of errors, which meansthe states can converge regardless of the parameterschanges within their boundaries. Moreover, from Fig. 12,we know almost all the states of the 50th time step arelocated at a small neighbor of zero (equivalent point).Therefore, all the rounds in this situation achieve thedesired performance, which means the rate of successis 100%.



Fig. 13. Histogram of RMSE for x1 and x2 of 10 000 round.

Fig. 14. Average state trajectories of 10 000 round.

2) Consider the following nonlinear systems

mode 1

{x1(k+1) = p1x1(k)+x1(k) cos(p2x1(k)x2(k))

x2(k+1) = p3 sin(p4x1(k)+uk)

mode 2

{x1(k+1) = p5 sin(p6x2(k))

x2(k+1) = p7 sin(p8x1(k)) cos(p9x2(k)+ p10uk).

(79)

where p1 ∼ p10 are the designed parameters, which arechosen within their boundaries. Moreover, we assumep1 ∈ [−1, 0], p2 ∈ [−100, 100], p3 ∈ [−1, 1],p4 ∈ [−100, 100], p5 ∈ [−1, 1], p6 ∈ [−1, 1],p7 ∈ [−1, 1], p8 ∈ [−1, 1], p9 ∈ [−100, 100], andp10 ∈ [−100, 100]. We can clearly observe that system(75) is the above MJS with specific set of the designedparameters. Without loss of generality, we randomlychoose a set of these parameters (p1 ∼ p10) withintheir boundaries at the beginning of each round. In otherwords, the system jumps between the two arbitrary func-tions of selection in one run. The transition probabilitymatrix is defined as

H =(

a 1 − a1 − b b

)(80)

where 0 < a < 1 and 0 < b < 1 are the transition prob-abilities, which are randomly chosen at initial. We repeatthis process for 10 000 times. For each round, the initialstate is set to x0 = [−0.5, 1]T . System jumps everytwo time steps and continues for 50 steps. The statevalue of each time step is collected and the RMSE for

Fig. 15. Histogram of the state values of 50th time step of 10 000 round.

Fig. 16. Histogram of RMSE for x1 and x2 of 10 000 round.

system state is measured according to each set of theparameters. Fig. 14 shows the average trajectories of x1and x2 of 10 000 rounds. The histograms of the statevalue of the 50th time step and of RMSE for x1 andx2 are shown in Figs. 15 and 16, respectively. Fromthe results, we know most of the jumping process canconverge to zero (the equivalent point). In this paper, wedefine the desired performance if the state trajectoriesconverge to the range [−0.01, 0.01]. From Fig. 15, weobtain the number of the unsuccessful round is 287,indicating that successful rate in this case is 97.13%.

C. Single Link Robot Arm

In this section, we consider a single link robot arm toillustrate the effectiveness of the proposed design approach.This model is very popular in Markov jump problems (see[54]–[56]). Comparing with these papers, our approach doesnot require the knowledge of system functions. This is veryimportant, because if the parameters of the system modes arechanged, we do not need to recalculate the controller.

The dynamic function of the single link robot arm isgiven by

θ (t) = − MgL

Gsin(θ(t)) − D

Gθ (t) + 1

Gu(t) (81)



Fig. 17. Jumping mode evolution r of the robot arm system.

where θ(t) is the angle position of the robot arm, and u(t)is the control input. Moreover, M is the mass of the payload,G is the moment of the inertia, g is the acceleration of gravity,L is the length of the arm, and D is the viscous friction.According to [54], the values of the system parameters aregiven by g = 9.81, D = 2, and L = 0.5, respectively. Thisprocess is a Markov jump process because the parameters ofM and G have four different modes. Assuming x1(t) = θ(t)and x2(t) = θ (t), the dynamic function (81) can be representedby⎧⎨⎩

x1(t) = x2(t)

x2(t) = − 2

G(r)x2(t)+ 1

G(r)u(t)− 4.905M(r) sin(x1(t))

G(r)(82)

where r = {1, 2, 3, 4}, G(r) and M(r) are dependent onjumping mode r . In this paper, we set G(1) = 1, G(2) = 5,G(3) = 10, G(4) = 15, and M(1) = 1, M(2) = 5,M(3) = 10, M(4) = 15. The transition probability matrixis described as

H =

⎛⎜⎜⎝

0.2 0.1 0.4 0.30.3 0.2 0.2 0.30.1 0.3 0.3 0.30.4 0.4 0.1 0.1

⎞⎟⎟⎠ . (83)

In our current simulation, the sampling period is chosenas T = 0.05 s. Two three-layer neural networks are builtas the critic and the action networks. The hidden neurons ofthese two networks are chosen as Nhc = 8 and Nha = 6.The learning rates of both networks are set to βc = βa =0.01 and the initial weights of both networks are chosenrandomly within [−1, 1]. We set the initial state of the systemto x0 = [0.5, 0.5]T . The system parameters are jumpingrandomly among four modes, which can be clearly observedin Fig. 17. The state trajectories and the control law of therobot arm system under jumping mode r are shown in Figs. 18and 19, respectively. The weights learning process of the criticand the action networks are shown in Fig. 20. We knowfrom the results that this MJSs can converge to its stablestate under the designed control law. The simulation resultsreveal that the proposed control method can be applied tononlinear MJSs with high jumping modes and obtain satisfyingperformance.

Fig. 18. State trajectories of the robot arm system under jumping mode r .

Fig. 19. Control law of the robot arm system under jumping mode r .


VIII. CONCLUSION

In this paper, we proposed an optimal control methodfor a class of discrete-time nonlinear MJSs with unknowndynamics. An identifier was designed to approximate thestate variables for unknown systems, and an ADP-basedapproach was proposed to control this kind of jump sys-tems by transforming MJSs control problem into a single-objective optimal control problem. The convergence of theperformance index function and the existence of the admis-sible control in this situation were proved in detail. Neuralnetwork techniques were applied to implement the proposedADP method. Three simulation studies, one linear case, one



nonlinear case, and one single link robot arm case, wereused to demonstrate the performance of the proposed optimalcontrol method.

REFERENCES

[1] V. Ugrinovskii and H. R. Pota, “Decentralized control of power systemsvia robust control of uncertain Markov jump parameter systems,” Int.J. Control, vol. 78, no. 9, pp. 662–677, 2005.

[2] S. C. Lee, “Maintenance strategies for manufacturing systems usingMarkov models,” Ph.D. dissertation, Dept. Mech. Eng., Univ. Michigan,Ann Arbor, MI, USA, 2010.

[3] L. Zhang and E. K. Boukas, “Stability and stabilization of Markov-ian jump linear systems with partly unknown transition probabilities,”Automatica, vol. 45, no. 2, pp. 463–468, 2009.

[4] J. Daafouz, P. Riedinger, and C. Iung, “Stability analysis and control syn-thesis for switched systems: A switched Lyapunov function approach,”IEEE Trans. Autom. Control, vol. 47, no. 11, pp. 1883–1887, Nov. 2002.

[5] W. Chen, J. Xu, and Z. Guan, “Guaranteed cost control for uncertainMarkovian jump systems with mode-dependent time-delays,” IEEETrans. Autom. Control, vol. 48, no. 12, pp. 2270–2277, Dec. 2003.

[6] A. P. C. Goncalves, A. R. Fioravanti, and J. C. Geromel,“H∞ filtering of discrete-time Markov jump linear systems throughlinear matrix inequalities,” IEEE Trans. Autom. Control, vol. 54, no. 6,pp. 1347–1351, Jun. 2009.

[7] M. S. Mahmoud, P. Shi, and A. Ismail, “Robust kalman filtering fordiscrete-time Markovian jump systems with parameter uncertainty,”J. Comput. Appl. Math., vol. 169, pp. 53–69, Aug. 2004.

[8] E. K. Boukas and A. Benzaouia, “Stability of discrete-time linearsystems with Markovian jumping parameters and constrained control,”IEEE Trans. Autom. Control, vol. 47, no. 3, pp. 516–521, Mar. 2002.

[9] Y. Fang and K. A. Loparo, “Stochastic stability of jump linear systems,”IEEE Trans. Autom. Control, vol. 47, no. 7, pp. 1204–1208, Jul. 2002.

[10] I. Matei and J. Baras, “Optimal state estimation for discrete-timeMarkovian jump linear systems, in the presence of delayed out-put observations,” IEEE Trans. Autom. Control, vol. 56, no. 9,pp. 2235–2240, Sep. 2011.

[11] D. Farias and B. V. Roy, “Approximate dynamic programming via linearprogramming,” in Advances Neural Information Processing Systems.Cambridge, MA, USA: MIT Press, 2001, pp. 689–695.

[12] D. P. de Farias and B. Van Roy, “The linear programming approachto approximate dynamic programming,” Oper. Res., vol. 51, no. 6,pp. 850–865, 2003.

[13] D. V. Prokhorov and D. C. Wunsch, “Adaptive critic designs,” IEEETrans. Neural Netw. Learn. Syst., vol. 8, no. 5, pp. 997–1007, Sep. 1997.

[14] H. G. Zhang, Y. H. Luo, and D. Liu, “Neural-network-based near-optimal control for a class of discrete-time affine nonlinear systems withcontrol constraints,” IEEE Trans. Neural Netw. Learn. Syst., vol. 20,no. 9, pp. 1490–1503, Sep. 2009.

[15] D. Liu, Y. Zhang, and H. Zhang, “A self-learning call admission controlscheme for CDMA cellular networks,” IEEE Trans. Neural Netw. Learn.Syst., vol. 16, no. 5, pp. 1219–1228, Sep. 2005.

[16] J. Fu, H. He, and X. Zhou, “Adaptive learning and control for MIMOsystem based on adaptive dynamic programming,” IEEE Trans. NeuralNetw. Learn. Syst., vol. 22, no. 7, pp. 1133–1148, Jul. 2011.

[17] H. G. Zhang, Q. L. Wei, and Y. H. Luo, “A novel infinite-time optimaltracking control scheme for a class of discrete-time nonlinear systems viathe greedy HDP iteration algorithm,” IEEE Trans. Syst., Man, Cybern.,B, Cybern., vol. 38, no. 4, pp. 937–942, Aug. 2008.

[18] K. G. Vamvoudakis and F. L. Lewis, “Online actor-critic algorithm tosolve the continuous-time infinite horizon optimal control problem,”Automatica, vol. 46, pp. 878–888, May 2010.

[19] P. J. Werbos, “Intelligence in the brain: A theory of how it works andhow to build it,” Neural Netw., vol. 22, no. 3, pp. 200–212, 2009.

[20] P. J. Werbos, “Using ADP to understand and replicate brain intelligence:The next level design,” in Proc. IEEE Int. Symp. Approx. Dyn. Program.Reinforcement Learn., Apr. 2007, pp. 209–216.

[21] J. Si, A. G. Barto, W. B. Powell, and D. Wunsch, Handbook of Learningand Approximate Dynamic Programming. New York, NY, USA: Wiley,2004.

[22] F. L. Lewis and D. Liu, Reinforcement Learning and ApproximateDynamic Programming for Feedback Control. New York, NY, USA:Wiley, 2012.

[23] A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time non-linear HJB solution using approximate dynamic programming: Conver-gence proof,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 4,pp. 942–949, Aug. 2008.

[24] M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws fornonlinear systems with saturating actuators using a neural network HJBapproach,” Automatica, vol. 41, no. 5, pp. 779–791, 2005.

[25] D. M. Adhyaru, I. N. Kar, and M. Gopal, “Bounded robust control ofnonlinear systems using neural network-based HJB solution,” NeuralComput. Appl., vol. 20, no. 1, pp. 91–103, 2011.

[26] D. M. Adhyaru, I. N. Kar, and M. Gopal, “Fixed final time opti-mal control approach for bounded robust controller design usingHamilton–Jacobi–Bellman solution,” IET Control Theory Appl., vol. 3,no. 9, pp. 1183–1195, Sep. 2009.

[27] D. Liu, D. Wang, and X. Yang, “An iterative adaptive dynamic program-ming algorithm for optimal control of unknown discrete-time nonlinearsystems with constrained inputs,” Inf. Sci., vol. 220, pp. 331–342,Jan. 2013.

[28] D. Wang, D. Liu, D. Zhao, Y. Huang, and D. Zhang, “A neural-network-based iterative GDHP approach for solving a class of nonlinearoptimal control problems with control constraints,” Neural Comput.Appl., vol. 22, no. 2, pp. 219–227, 2013.

[29] F. Lin, R. D. Brandt, and J. Sun, “Robust control of nonlinear sys-tems: Compensating for uncertainty,” Int. J. Control, vol. 56, no. 6,pp. 1453–1459, 1992.

[30] F. Lin, “An optimal control approach to robust control design,” Int.J. Control, vol. 73, no. 3, pp. 177–186, 2000.

[31] X. Zhong, H. He, and D. V. Prokhorov, “Robust controller design ofcontinuous-time nonlinear system using neural network,” in Proc. Int.Joint Conf. Neural Netw., 2013, pp. 1–8.

[32] Q. Wei and D. Liu, “Adaptive dynamic programming with stable valueiteration algorithm for discrete-time nonlinear systems,” in Proc. IEEEInt. Joint Conf. Neural Netw., Jun. 2012, pp. 1–6.

[33] D. Liu and Q. Wei, “Finite-approximation-error-based optimal controlapproach for discrete-time nonlinear systems,” IEEE Trans. Syst., Man,Cybern. B, Cybern., vol. 43, no. 2, pp. 779–789, Apr. 2013.

[34] D. Wang, D. Liu, Q. Wei, D. Zhao, and N. Jin, “Optimal control ofunknown nonaffine nonlinear discrete-time systems based on adaptivedynamic programming,” Automatica, vol. 48, no. 8, pp. 1825–1832,2012.

[35] H. Li and D. Liu, “Optimal control for discrete-time affine non-linearsystems using general value iteration,” IET Control Theory Appl., vol. 6,no. 18, pp. 2725–2736, Dec. 2012.

[36] H. Zhang, Q. Wei, and D. Liu, “An iterative adaptive dynamic pro-gramming method for solving a class of nonlinear zero-sum differentialgames,” Automatica, vol. 47, no. 1, pp. 207–214, 2011.

[37] D. Liu, H. Li, and D. Wang, “Neural-network-based zero-sum gamefor discrete-time nonlinear systems via iterative adaptive dynamicprogramming algorithm,” Neurocomputing, vol. 110, pp. 92–100,Jun. 2013.

[38] Q. Wei and D. Liu, “Numerical adaptive learning control scheme fordiscrete-time non-linear systems,” IET Control Theory Appl., vol. 7,no. 11, pp. 1472–1486, Jul. 2013.

[39] D. Liu, H. Javaherian, O. Kovalenko, and T. Huang, “Adaptive criticlearning techniques for engine torque and air–fuel ratio control,” IEEETrans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 4, pp. 988–993,Aug. 2008.

[40] D. Wang, D. Liu, and Q. Wei, “Finite-horizon neuro-optimal trackingcontrol for a class of discrete-time nonlinear systems using adaptivedynamic programming approach,” Neurocomputing, vol. 78, no. 1,pp. 14–22, 2012.

[41] H. He, Z. Ni, and J. Fu, “A three-network architecture for on-linelearning and optimization based on adaptive dynamic programming,”Neurocomputing, vol. 78, no. 1, pp. 3–13, 2012.

[42] H. He, Self-Adaptive Systems for Machine Intelligence. New York, NY,USA: Wiley, 2011.

[43] Z. Ni, H. He, and J. Wen, “Adaptive learning in tracking control basedon the dual critic network design,” IEEE Trans. Neural Netw. Learn.Syst., vol. 24, no. 6, pp. 913–928, Jun. 2013.

[44] Z. Ni, X. Fang, H. He, D. Zhao, and X. Xu, “Real-time trackingcontrol on adaptive critic design with uniformly ultimately boundedcondition,” in Proc. IEEE Symp. Adapt. Dyn. Program. ReinforcementLearn., Apr. 2013, pp. 20–25.

[45] Z. Ni, H. He, J. Wen, and X. Xu, “Goal representation heuristic dynamicprogramming on maze navigation,” IEEE Trans. Neural Netw. Learn.Syst., vol. 24, no. 12, pp. 2038–2050, Dec. 2013.



[46] Z. Ni and H. He, “Heuristic dynamic programming with internal goalrepresentation,” Soft Comput., vol. 17, no. 11, pp. 2101–2108, 2013.

[47] H. He, Z. Ni, and D. Zhao, Reinforcement Learning and ApproximateDynamic Programming for Feedback Control (Learning and Optimiza-tion in Hierarchical Adaptive Critic Design). New York, NY, USA:Wiley, 2013, pp. 78–95.

[48] Z. Ni, H. He, D. Zhao, and D. V. Prokhorov, “Reinforcement learningcontrol based on multi-goal representation using hierarchical heuristicdynamic programming,” in Proc. IJCNN, 2012, pp. 1–8.

[49] F. L. Lewis and K. G. Vamvoudakis, “Reinforcement learning forpartially observable dynamic processes: Adaptive dynamic programmingusing measured output data,” IEEE Trans. Syst., Man, Cybern. B,Cybern., vol. 41, no. 1, pp. 14–25, Feb. 2011.

[50] Z. Ni, H. He, and X. Zhong, Frontiers of Intelligent Control andInformation Processing (Experimental Studies on Data-Driven HeuristicDynamic Programming for POMDP). Singapore: World Sci. Publishing,2014.

[51] X. Zhang, H. Zhang, Q. Sun, and Y. Luo, “Adaptive dynamicprogramming-based optimal control of unknown nonaffine nonlineardiscrete-time systems with proof of convergence,” Neurocomputing,vol. 91, pp. 48–55, Aug. 2012.

[52] H. Alzer, “On the Cauchy–Schwarz inequality,” J. Math. Anal. Appl.,vol. 234, no. 1, pp. 6–14, 1999.

[53] F. Liu, J. Sun, J. Si, W. Guo, and S. Mei, “A boundedness resultfor the direct heuristic dynamic programming,” Neural Netw., vol. 32,pp. 229–235, Aug. 2012.

[54] X. Luan, F. Liu, and P. Shi, “Neural-network-based finite-time H∞control for extended Markov jump nonlinear systems,” Int. J. Adapt.Control Signal Process., vol. 24, no. 7, pp. 554–567, 2010.

[55] H. Wu and K. Cai, “Mode-independent robust stabilization for uncertainMarkovian jump nonlinear systems via fuzzy control,” IEEE Trans. Syst.,Man, Cybern. B, Cybern., vol. 36, no. 3, pp. 509–519, Jun. 2005.

[56] R. Palm and D. Driankov, “Fuzzy switched hybrid systems-modelingand identification,” in Proc. IEEE Int. Symp. Intell. Control, CIRA, ISAS,Sep. 1998, pp. 130–135.

Xiangnan Zhong received the B.S. and M.S.degrees in automation and control from NortheasternUniversity, Shenyang, China, in 2010 and 2012,respectively. She is currently pursuing the Ph.D.degree with the Department of Electrical, Computer,and Biomedical Engineering, University of RhodeIsland, Kingston, RI, USA.

Her current research interests include adaptivedynamic programming, reinforcement learning, opti-mal control, and machine learning.

Haibo He (SM’11) received the B.S. and M.S.degrees in electrical engineering from the HuazhongUniversity of Science and Technology, Wuhan,China, and the Ph.D. degree in electrical engineeringfrom Ohio University, Athens, OH, USA, in 1999,2002, and 2006, respectively.

He is currently the Robert Haas Endowed Pro-fessor in electrical engineering with the Universityof Rhode Island, Kingston, RI, USA. From 2006to 2009, he was an Assistant Professor with theDepartment of Electrical and Computer Engineering,

Stevens Institute of Technology, Hoboken, NJ, USA. He has publishedone research book (Wiley), edited one research book (Wiley-IEEE), andsix conference proceedings (Springer), and authored and co-authored morethan 130 peer-reviewed journal and conference papers. His researches havebeen covered by national and international media such as IEEE Smart GridNewsletter, The Wall Street Journal, and Providence Business News. Hiscurrent research interests include machine learning, cyber-physical systems,computational intelligence, hardware design for machine intelligence, andvarious applications such as smart grid and renewable energy systems.

Dr. He is currently an Associate Editor of the IEEE TRANSACTIONS ON

NEURAL NETWORKS AND LEARNING SYSTEMS, and the IEEE TRANSAC-TIONS ON SMART GRID. He received the IEEE Computational IntelligenceSociety Outstanding Early Career Award in 2014, the National ScienceFoundation CAREER Award in 2011, and the Providence Business NewsRising Star Innovator Award in 2011.

Huaguang Zhang (SM’04) received the B.S. andM.S. degrees in control engineering from the North-east Dianli University of China, Jilin, China, in1982 and 1985, respectively, and the Ph.D. degreein thermal power engineering and automation fromSoutheast University, Nanjing, China, in 1982, 1985,and 1991, respectively.

He joined the Department of Automatic Control,Northeastern University, Shenyang, China, in 1992,as a Post-Doctoral Fellow. Since 1994, he has beena Professor and the Head of the Institute of Electric

Automation, School of Information Science and Engineering, NortheasternUniversity. He has authored and co-authored more than 200 journal andconference papers, four monographs, and co-invented 20 patents. His currentresearch interests include fuzzy control, stochastic system control, nonlinearcontrol, and their applications.

Dr. Zhang is the Chair of the Adaptive Dynamic Programming Reinforce-ment Learning Technical Committee on the IEEE Computational IntelligenceSociety. He is an Associate Editor of Automatica, the IEEE TRANSACTIONS

ON FUZZY SYSTEMS, the IEEE TRANSACTIONS ON NEURAL NETWORKS,the IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART B,and Neurocomputing. He was awarded the Outstanding Youth Science Foun-dation Award from the National Natural Science Foundation Committee ofChina in 2003. He was named the Cheung Kong Scholar by the EducationMinistry of China in 2005.

Zhanshan Wang (M’09) received the B.S. degreein industry electric automation from Baotou Ironand Steel Institute (now Inner Mongolia Universityof Science and Technology), Baotou, China, theM.S. degree in control theory and control engineer-ing from Fushun Petroleum Institute (now LiaoningShihua University), Fushun, China, and the Ph.D.degree in control theory and control engineeringfrom Northeastern University, Shenyang, China, in1994, 2001, and 2006, respectively.

He is currently a Professor with NortheasternUniversity. His current research interests include stability analysis of recurrentneural networks, synchronization of complex networks, fault diagnosis, faulttolerant control, and nonlinear control.

IEEE TRANSACTIONS ON NEURAL NETWORKS … convergence analysis of the performance index function and...

Documents

Transcript of IEEE TRANSACTIONS ON NEURAL NETWORKS … convergence analysis of the performance index function and...