Research Article Decentralized Reinforcement Learning...

17
Hindawi Publishing Corporation Mathematical Problems in Engineering Volume 2013, Article ID 387817, 16 pages http://dx.doi.org/10.1155/2013/387817 Research Article Decentralized Reinforcement Learning Robust Optimal Tracking Control for Time Varying Constrained Reconfigurable Modular Robot Based on ACI and -Function Bo Dong 1 and Yuanchun Li 2 1 Department of Communication Engineering, Jilin University, Changchun 130022, China 2 Department of Control Engineering, Changchun University of Technology, Changchun 130012, China Correspondence should be addressed to Yuanchun Li; [email protected] Received 20 August 2013; Revised 13 November 2013; Accepted 13 November 2013 Academic Editor: M. Onder Efe Copyright © 2013 B. Dong and Y. Li. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. A novel decentralized reinforcement learning robust optimal tracking control theory for time varying constrained reconfigurable modular robots based on action-critic-identifier (ACI) and state-action value function (-function) has been presented to solve the problem of the continuous time nonlinear optimal control policy for strongly coupled uncertainty robotic system. e dynamics of time varying constrained reconfigurable modular robot is described as a synthesis of interconnected subsystem, and continuous time state equation and -function have been designed in this paper. Combining with ACI and RBF network, the global uncertainty of the subsystem and the HJB (Hamilton-Jacobi-Bellman) equation have been estimated, where critic-NN and action-NN are used to approximate the optimal -function and the optimal control policy, and the identifier is adopted to identify the global uncertainty as well as RBF-NN which is used to update the weights of ACI-NN. On this basis, a novel decentralized robust optimal tracking controller of the subsystem is proposed, so that the subsystem can track the desired trajectory and the tracking error can converge to zero in a finite time. e stability of ACI and the robust optimal tracking controller are confirmed by Lyapunov theory. Finally, comparative simulation examples are presented to illustrate the effectiveness of the proposed ACI and decentralized control theory. 1. Introduction Reconfigurable modular robot could transform its config- uration depending on the different external situations and the requirements of the tasks. According to the concept of modular design and the decentralized control theory of the subsystem, reconfigurable modular robot can complete the task by changing its structure efficiently in different situa- tions, without redesigning the control law. At the same time, reconfigurable modular robot possessed a good accuracy and flexibility. Many scholars have studied the dynamics and the control method of reconfigurable modular robot. A novel VGSTA- ESO based decentralized ADRC control method for recon- figurable modular robot has been proposed in [1]. rough designing a high-precision VGSTA-ESO to estimate the dynamic model nonlinear terms and the interconnection terms of the subsystem, the joint trajectory tracking control is implemented. Based on calculating torque, a robust fuzzy neural network controller is proposed in [2], which is used to solve the problem of model uncertainty in the process of model generating. In [3], it shows a decentralized adaptive fuzzy sliding mode control method of reconfigurable modu- lar robot. e fuzzy logic system is used to approximate the unknown dynamics of the subsystem, and a sliding mode controller with an adaptive scheme is designed to avoid both the interconnection term and the fuzzy approximation error. A decentralized adaptive neural network control algorithm for reconfigurable manipulators is proposed in [4], where the neural networks are used to approximate the unknown dynamic functions and interconnections in the subsystem by using the adaptive algorithm. A new distributed control method is proposed in [5], which uses a decomposition algorithm to decompose the robot dynamic system into a number of dynamical systems, and the adaptive sliding mode controller is designed to offset the impact of model

Transcript of Research Article Decentralized Reinforcement Learning...

Page 1: Research Article Decentralized Reinforcement Learning ...downloads.hindawi.com/journals/mpe/2013/387817.pdf · Research Article Decentralized Reinforcement Learning Robust Optimal

Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2013 Article ID 387817 16 pageshttpdxdoiorg1011552013387817

Research ArticleDecentralized Reinforcement Learning Robust OptimalTracking Control for Time Varying Constrained ReconfigurableModular Robot Based on ACI and 119876-Function

Bo Dong1 and Yuanchun Li2

1 Department of Communication Engineering Jilin University Changchun 130022 China2Department of Control Engineering Changchun University of Technology Changchun 130012 China

Correspondence should be addressed to Yuanchun Li yuanchunjlueducn

Received 20 August 2013 Revised 13 November 2013 Accepted 13 November 2013

Academic Editor M Onder Efe

Copyright copy 2013 B Dong and Y Li This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

A novel decentralized reinforcement learning robust optimal tracking control theory for time varying constrained reconfigurablemodular robots based on action-critic-identifier (ACI) and state-action value function (119876-function) has been presented to solve theproblem of the continuous time nonlinear optimal control policy for strongly coupled uncertainty robotic system The dynamicsof time varying constrained reconfigurable modular robot is described as a synthesis of interconnected subsystem and continuoustime state equation and119876-function have been designed in this paper CombiningwithACI and RBF network the global uncertaintyof the subsystem and the HJB (Hamilton-Jacobi-Bellman) equation have been estimated where critic-NN and action-NN are usedto approximate the optimal119876-function and the optimal control policy and the identifier is adopted to identify the global uncertaintyas well as RBF-NN which is used to update the weights of ACI-NN On this basis a novel decentralized robust optimal trackingcontroller of the subsystem is proposed so that the subsystem can track the desired trajectory and the tracking error can convergeto zero in a finite time The stability of ACI and the robust optimal tracking controller are confirmed by Lyapunov theory Finallycomparative simulation examples are presented to illustrate the effectiveness of the proposed ACI and decentralized control theory

1 Introduction

Reconfigurable modular robot could transform its config-uration depending on the different external situations andthe requirements of the tasks According to the concept ofmodular design and the decentralized control theory of thesubsystem reconfigurable modular robot can complete thetask by changing its structure efficiently in different situa-tions without redesigning the control law At the same timereconfigurable modular robot possessed a good accuracy andflexibility

Many scholars have studied the dynamics and the controlmethod of reconfigurable modular robot A novel VGSTA-ESO based decentralized ADRC control method for recon-figurable modular robot has been proposed in [1] Throughdesigning a high-precision VGSTA-ESO to estimate thedynamic model nonlinear terms and the interconnectionterms of the subsystem the joint trajectory tracking control

is implemented Based on calculating torque a robust fuzzyneural network controller is proposed in [2] which is usedto solve the problem of model uncertainty in the process ofmodel generating In [3] it shows a decentralized adaptivefuzzy sliding mode control method of reconfigurable modu-lar robot The fuzzy logic system is used to approximate theunknown dynamics of the subsystem and a sliding modecontroller with an adaptive scheme is designed to avoid boththe interconnection term and the fuzzy approximation errorA decentralized adaptive neural network control algorithmfor reconfigurable manipulators is proposed in [4] wherethe neural networks are used to approximate the unknowndynamic functions and interconnections in the subsystemby using the adaptive algorithm A new distributed controlmethod is proposed in [5] which uses a decompositionalgorithm to decompose the robot dynamic system intoa number of dynamical systems and the adaptive slidingmode controller is designed to offset the impact of model

2 Mathematical Problems in Engineering

uncertainty An observer based decentralized adaptive fuzzycontroller for reconfigurable manipulator is proposed in [6]by designing the state observer the adaptive fuzzy systemswhich are used to model the unknown dynamics of thesubsystem and the interconnection term can be constructedby using the state estimations Nevertheless the require-ment of the dynamics of the reconfigurable modular robotsystem is hard to be satisfied either fully or even partiallyknow Moreover because of there are strong coupling modeluncertainties and interconnection terms of subsystems in thereconfigurable modular robot system besides the processingload on the controller would be increased the fact that thegreater time delay and calculation error are easy to produceso that it is too complicated to design the controllers by usingthe methods and algorithms above

In recent years as one of the most effective methodsto solve the control problems for continuous time whichstrongly coupled with nonlinear system the reinforcementlearning algorithm has received extensive attention fromscholars Reinforcement learning [7 8] is a kind of learningmethod mapping situations to actions so as to maximize anumerical reward signal Compared with supervised learn-ing reinforcement learning does not need to predict thementor signal in various states but learns in the processof interaction with the situation Because of its adaptiveoptimization capability in the nonlinear model under thecondition of uncertainty reinforcement learning has a uniqueadvantage to solve the problems of optimization strategiesand the control method in terms of the complex models [9ndash11] Zhang and his team presented an infinite time optimaltracking control scheme for discrete-time nonlinear systemvia the greedy HDP iteration algorithm [12ndash14] Accordingto the system transformation the optimal tracking problemis transformed into an optimal regulation problem and thegreedyHDP iteration algorithm is introduced to deal with theregulation problem with the rigorous convergence analysisThen a data-driven robust approximate optimal trackingcontrol is proposed by using the adaptive dynamic program-ming as well as a data-drivenmodel which is established by arecurrent neural network to reconstruct the unknown systemdynamics by using available input-output date [15] After thisthey design a fuzzy critic estimator which is used to estimatethe value function for nonlinear continuous-time system[16] On this basis a synchronization problem for an arrayof neural networks with hybrid coupling and interval timevarying delay is concerned with an augmented Lyapunov-Krasovskii functional method [17] The FRL scheme onlyusing immediate reward and sufficient conditions is adoptedto analyze the convergence of the optimal task performanceBhasin presents a neural network control of a robot inter-acting with an uncertain viscoelastic environment [18] In[18] a continuous controller is developed for a robot thatmoved in free space and then regulated the new coupleddynamic system to a desired setpoint Khan et al presentan implementation of a model-free Q-learning based onthe discrete model reference compliance controller for ahumanoid robot arm [19] where reinforcement learningscheme uses a recently developed Q-learning scheme todevelop an optimal policy online Patchaikani et al propose

an adaptive critic-based real-time redundancy resolutionscheme for kinematic control of the redundant manipulator[20] The kinematic control of the redundant manipulatoris formulated as a discrete-time input affine system andthen an optimal real-time redundancy resolution scheme isproposed Although the research of reinforcement learningalgorithm has been rapidly developed in recent years thereare still some deficiencies For example when there aremultiple subsystems in the global system the methods abovecould not handle the impacts of the interconnection termsbetween the subsystems Meanwhile the methods above aremostly used to solve the learning and optimization problemsof the system itself but when the external constraints exist inthe system thesemethods are no longer applicableThereforeit is an urgent issue to solve the problem of how to designa kind of robust reinforcement learning optimal controlmethod in the case of the external constraints existing andmultiple subsystems coupled in the system are the urgentproblems to be solved

In this paper we presented a novel continuous timedecentralized reinforcement learning robust optimal trackingcontrol theory for the time varying constrained reconfig-urable modular robot Combining with ACI and RBF-NNthe critic-NN is used to estimate the optimal119876-function theaction-NN is proposed to approximate the optimal controlpolicy and then the identifier is adopted to identify theglobal uncertainty so that the HJB equation can be estimatedand the estimation error is bounded and converged Firstlysince the decentralized control method is adopted in thispaper whichmeans that each joint subsystemowns a separatecontroller thus the processing loads of the controllers arereduced greatly Secondly due to the fact that the timevarying constraints can be compensated in the subsystemstherefore the proposedmethod in this paper is suitable for thereconfigurablemodular robot in the time varying constrainedoutside environment Thirdly the proposed control methodcould compensate for the impacts of the model uncertaintiesand the interconnection terms on the system so that it canmake the subsystems track the desired trajectories and thetracking error can converge to zero in finite time

2 Problem Formulation

Assume that the time varying external constraints for the endof reconfigurable modular robot is shown as

Ψ (119902 119905) = 0 (1)

Here 119902 isin 119877119899 is the vector of joint displacements Function

Ψ 119877119899

rarr 119877119898 119898 is the dimension of the external limiting

conditions With the time varying constraints the dynamicsof a reconfigurablemodular robot can be presented as follows

119872(119902) 119902 + 119862 (119902 119902) 119902 + 119866 (119902) + 119865 (119902 119902) = 119906 + 119869119879

Ψ(119902 119905) 119891 (2)

119872(119902) isin 119877119899times119899 is the inertia matrix 119862(119902 119902) isin 119877

119899 is theCoriolis and centripetal force 119866(119902) isin 119877

119899 is the gravity term119865(119902 119902) is the unmodeled dynamics including friction termsand external disturbances 119906 isin 119877

119899 is the applied joint torque

Mathematical Problems in Engineering 3

and 119869119879

Ψ(119902 119905)119891 is the contact force generated by the contact

of the end of the reconfigurable modular robot and externalconstraints

After introducing 119898th constraints for the robot whichworks in the free space because of the limitation of (1) thesystem lost 119898th degrees of freedom Therefore the degreesof freedom of the robot change from 119899 to (119899 minus 119898) so thatonly (119899 minus 119898) independent joint displacements are needed todescribe the system of restricted movement fully

Define

119902 = [1199021

1199022

] 1199021isin 119877

119899minus119898

1199022isin 119877

119898

(3)

Putting the equation above into (1) then we can get that

Ψ (1199021 Θ (119902

1 119905) 119905) = 0 (4)

where

1199022= Θ (119902

1 119905) (5)

Therefore (3) can be described by joint displacement 1199021

fully shown as follows

119902 = [1199021

Θ(1199021 119905)

] (6)

The derivation of (6) is

119902 = [

[

1199021

120597Θ (1199021 119905)

1205971199021

1199021+120597Θ (119902

1 119905)

120597119905

]

]

= [

[

119868119899minus119898

0

120597Θ (1199021 119905)

1205971199021

119868119898

]

]

[1199021

0] + [

[

0

120597Θ (1199021 119905)

120597119905

]

]

= 119879 120579 + 119867

(7)

In (7)

119879 = [

[

119868119899minus119898

0

120597Θ (1199021 119905)

1205971199021

119868119898

]

]

isin 119877119899times119899

120579 = [1199021

0] isin 119877

119899

119867 = [

[

0

120597Θ (1199021 119905)

120597119905

]

]

isin 119877119899

(8)

Therefore the second derivation of 119902 can be achieved easilyas

119902 = 119879 120579 + 120579 + (9)

Putting (7) and (9) into (2) we can get

119906 + 119869119879

Ψ(119902 119905) 119891 = 119872(119902) (119879 120579 + 120579 + )

+ 119862 (119902 119902) (119879 120579 + ) + 119866 (119902) + 119865 (119902 119902)

(10)

Define

119864 = [119868(119899minus119898)times(119899minus119898)

0119898times(119899minus119898)

] isin 119877119899times(119899minus119898)

(11)

Therefore

120579 = [1199021

0] = 119864119902

1(12)

So (2) can be decomposed into the following form

119899

sum

119895=1

119872119894119895(119902) [(119879119864 119902

1)119895+ (119864 119902

1)119895

+ 119895]

+

119899

sum

119895=1

119862119894119895(119902 119902) [(119879119864 119902

1)119895+ 119867

119895] + 119866

119894(119902)

+ 119865119894(119902

119894 119902

119894) minus 119891

119894= 119906

119894

(13)

In the equation above (119879119864 1199021)119895 (119864 119902

1)119895 (119879119864 119902

1)119895 and 119867

119895

are the 119895th element of (119879119864 1199021) (119864 119902

1) (119879119864 119902

1) and 119867

respectively 119866119894(119902) 119865

119894(119902

119894 119902

119894) and 119906

119894are the 119894th element of

119866(119902) 119865(119902 119902) and 119906 119891119894is the constraint force which suffered

by the 119894th joint 119872119894119895(119902) and 119862

119894119895(119902 119902) are the 119894119895th element of

119872(119902) and 119862(119902 119902) respectively So as shown in Figure 1 eachsubsystem dynamical model can be formulated in joint spaceas follows

119872119894(119902

119894) 119902

119894+ 119862

119894(119902

119894 119902

119894) 119902

119894+ 119866

119894(119902

119894) + 119865

119894(119902

119894 119902

119894) + 119885

119894(119902 119902 119902)

= 119906119894

(14)

119885119894(119902 119902 119902) =

119899

sum

119895=1

119895 = 119894

119872119894119895(119902) [(119879119864 119902

1)119895+ (119864 119902

1)119895

+ 119895]

+119872119894119894(119902) [(119879119864 119902

1)119894+ (119864 119902

1)119894

+ 119894]

minus119872119894(119902

119894) 119902

119894+

119899

sum

119895=1

119895 = 119894

119862119894119895(119902 119902) [(119879119864 119902

1)119895+ 119867

119895]

+ 119862119894119894(119902 119902) [(119879119864 119902

1)119895+ 119867

119895]

minus 119862119894(119902

119894 119902

119894) 119902

119894+ [119866

119894(119902) minus 119866

119894(119902

119894)]

(15)

Let 119909119894= [119909

1198941 119909

1198942]119879

= [119902119894 119902

119894]119879 for 119894 = 1 119899 then (10) can be

presented by the following state equation

119878119894

1198941= 119909

1198942

1198942= minus119891 (119909

119894 119906

119894) minus ℎ

119894(119902 119902 119902) minus 119891

119894

119910119894= 119909

1198941

(16)

4 Mathematical Problems in Engineering

Zi(q q q)

Zn(q q q)

Z1(q q q)

M1q1 + C1q1 + G1 + F1 + Z1 minus f1 = u1

Miqi + Ciqi + Gi + Fi + Zi minus fi = ui

Mnqn+ Cnqn+ Gn+ Fn+ Znminus fn= un

q1

qq1

q

qn

qi

qn

qi

q

u1

un

uiu

sum

sum

sum

minus

minus

minus

Subsystem n

Subsystem i

Subsystem 1

q=

(uminusCqminusGminusF+

(qt)f

)M

minus1

JT Ψ

Figure 1 The architecture of the time varying constrained reconfigurable modular robot system

where 119909119894is the state vector of subsystem 119878

119894 119910

119894is the output

of subsystem 119878119894 and ℎ

119894(119902 119902 119902) is the interconnection term of

the subsystem 119891(119909119894 119906

119894) and ℎ

119894(119902 119902 119902) can be defined as

119891 (119909119894 119906

119894) = 119872

minus1

119894(119902

119894) [

119862119894(119902

119894 119902

119894) 119902

119894+ 119866

119894(119902

119894)

+119865119894(119902

119894 119902

119894) minus 119906

119894

]

ℎ119894(119902 119902 119902) = minus119872

minus1

119894(119902

119894) 119885

119894(119902 119902 119902)

(17)

In response to the time varying constrained reconfig-urable modular robot system we need to design a decen-tralized robust optimal tracking control policy to make thesubsystem track the desired trajectory as well as the trackingerror is converged and bounded

3 Decentralized Reinforcement LearningRobust Optimal Tracking Control Basedon ACI and 119876-Function

Assumption 1 Desired trajectory 119910119894119889 119910

119894119889 119910

119894119889and input gain

matrix 119887119894(119909

119894) are bounded

Then (16) can be transformed to the below Consider

119878119894

1198941= 119909

1198942

1198942= minus [119865 (119909

119894 119906

119894) + ℎ

119894(119902 119902 119902) + 119891

119894] + 119887

119894(119909

119894) 119906

119894

119910119894= 119909

1198941

(18)

where 119865(119909119894 119906

119894) = 119891(119909

119894 119906

119894) + 119887

119894(119909

119894)119906

119894

Assumption 2 The interconnection terms are bounded sat-isfying the following equation

1003816100381610038161003816ℎ119894 (119902 119902 119902)1003816100381610038161003816 le 120575

1198940+

119899

sum

119895=1

120575119894119895(10038161003816100381610038161003816119904119894119895

10038161003816100381610038161003816) (19)

where 1205751198940gt 0 is an unknown constant and 120575

119894119895(|119904

119894119895|) ge 0 is an

unknown smooth Lipschitz functionThe trajectory tracking error of the joint subsystem 119888 can

be defined as

119890119894(119905) = 119909

119894minus 119910

119894119889 (20)

With regard to the continuous time state equation ofthe subsystem in (18) with the nonlinear function andinterconnection terms generally the value function can bedefined as

119881119906119894(119890119894(119905))

119894(119890

119894(119905)) = int

infin

0

119903119894(119890

119894(119905) 119906

119894(119890

119894(119905))) 119889119905 (21)

In order to facilitate the equation we use 119890119894 119906

119894instead of

119890119894(119905) 119906

119894(119890

119894(119905)) Since the trajectory 119910

119894119889relies upon the control

of the subsystem 119906119894for updating in order to avoid the infinity

results by using (21) we need to transform the value functioninto the following form

119881119906119894

119894(119890

119894) = int

infin

0

119903119894(119890

119894(120591) 119906

119894(119890

119894(120591))) 119889120591 119905 le 120591 lt infin (22)

Thus the optimal value function of the subsystem can bedefined as follows

119881lowast

119894(119890

119894) = min

119906119894

119905le120591ltinfin

int

infin

0

119903119894(119890

119894(120591) 119906

119894(119890

119894(120591))) 119889120591 (23)

Mathematical Problems in Engineering 5

Here 119903119894(119890

119894 119906

119894) represents the reward function for the current

state shown as

119903119894(119890

119894 119906

119894) = 119890

119879

119894119876119890119890119894+ 119906

119879

119894119877119906

119894 (24)

where 119876119890and 119877 are the positive definite matrixes

Typically recording the value of state-action pairs is moreuseful than recording the value of state only since the state-action pairs are the predictions of the reward Even if thereward value of a state is low it does not mean that the valueof state-action pairs is low too If the state of the subsystemin a period time produces a higher reward then it can stillget a higher state-action value Therefore from a long termperspective defining a suitable state-action value function(119876-function) can make actions produce more rewards [2122]

According to (23) and (24) the continuous-time optimal119876-function can be defined as

119876lowast

119894(119890

119894 119886

119894 119906

119894) = 119903

119894(119890

119894 119886

119894 119906

119894) + 119881

lowast

119894(119890

119894 119906

119894)

= 119903119894(119890

119894 119886

119894 119906

119894)

+ min119906119894

119905le120591ltinfin

int

infin

0

119903119894(119890

119894(120591) 119906

119894(119890

119894(120591))) 119889120591

(25)

Assumption 3 The partial derivation of 119876lowast

119894and 119903

119894(119890

119894 119886

119894 119906

119894)

exist and they are continuous in the domain According to(18) and (24) by using the control policy 119906

119894 the optimal

119876-function can satisfy the following Hamiltonian-Jacobi-Bellman equation [23]

HJB119894(119890

119894 119906

119894 nabla119876

lowast

119894)

= min119906119894(119890119894)

[119903119894(119890

119894 119886

119894 119906

119894)

+nabla119876lowast

119894(minus119865 (119890

119894 119906

119894) minus ℎ

119894(119890 119890 119890) minus 119891

119894+ 119887

119894(119890

119894) 119906

119894)]

= min119906119894(119890119894)

[119903119894(119890

119894 119886

119894 119906

119894) + nabla119876

lowast

119894Φ

119894(119890

119894 119906

119894)]

(26)

whereΦ119894(119890

119894 119906

119894) = minus119865(119890

119894 119906

119894)minusℎ

119894(119890 119890 119890)minus119891

119894+119887

119894(119890

119894)119906

119894means the

global uncertainty including the unknown dynamics of thesubsystem and the interconnection term andnabla119876lowast

119894= 120597119876

lowast

119894120597119890

119894

means the gradient of the optimal 119876-function

Lemma 4 (see [24]) Considering dynamics of the subsystemof time varying constrained reconfigurable modular robot in(14) in order to ensure the minimum of the HJB equation (26)possessing the stationary point with respect to 119906

119894 the optimal

119876-function and the optimal control policy must satisfy thefollowing conditions

(1) 120597119867119869119861(119890119894 119906

119894 nabla119876

119894)120597119906

119894= 0

(2) 1205972119867119869119861119894(119890

119894 119906

119894 nabla119876

119894)(120597119906

119894times 120597119906

119879

119894) ge 0

The necessary conditions above lead us to the followingresults

(a) The bounded control policy can guarantee a localminimum of the HJB equation (26) and satisfy theconstraints imposed on the control inputs

(b) The Hessian matrix is positive-definite and the controlpolice 119906

119894can render the global minimum of the HJB

equation(c) If an optimal algorithm exists it is unique

According to Lemma 4 if the reward function is smoothand the optimal control 119906lowast

119894is adopted then the HJB equation

satisfies the following equation

HJBlowast

119894(119890

119894 119906

lowast

119894 nabla119876

lowast

119894) = min

119906lowast

119894(119890119894)

[119903119894(119890

119894 119886

119894 119906

lowast

119894) + nabla119876

lowast

119894Φ

119894(119890

119894 119906

lowast

119894)]

= 0

(27)

And the optimal control can be expressed as follows

119906lowast

119894(119890

119894) = arg

119906lowast

119894

min [HJBlowast

119894(119890

119894 119906

lowast

119894 nabla119876

lowast

119894)]

=1

2119877minus1

119887119879

119894(119890

119894)120597119876

lowast

119894(119890

119894 119886

119894 119906

119894)119879

120597119890119894

(28)

If the optimal 119876-function 119876lowast

119894is continuous derivable

and known and the initial value119876lowast

119894(0) = 0 as well as the opti-

mal control policy 119906lowast

119894(119890

119894) and the global uncertainty of the

subsystemΦ119894(119890

119894 119906

lowast

119894) is known then the HJB equation in (27)

is held and solvableHowever in the actual situation119876lowast

119894is not

derivable everywhere and 119906lowast119894(119890

119894) andΦ

119894(119890

119894 119906

lowast

119894) are unknown

Therefore it is not feasible to solve the HJB equation byusing average method In this paper we combine the action-critic identifier (ACI) with RBF neural network to estimatethe optimal control policy the optimal 119876-function and theglobal uncertainty of the subsystem Action-NN is used toestimate 119906

lowast

119894(119890

119894) and is denoted as

119894(119890

119894) 119876lowast

119894is estimated

by critic-NN and expressed as 119876119894 then we use the robust

neural network identifier to identify Φ119894(119890

119894 119906

lowast

119894) denoted as

Φ119894(119890

119894 119906

lowast

119894)Theblock diagramof theACI architecture is shown

in Figure 2The estimated HJB equation can be expressed as follows

HJBlowast

119894(119890

119894

119894 nabla119876

119894) = min

119906119894(119890119894)[119903

119894(119890

119894 119886

119894

119894) + nabla119876

119894Φ

119894(119890

119894

119894)]

(29)

The identification error of the HJB equation above can beexpressed as

120575ℎ119894= HJBlowast

119894(119890

119894

119894 nabla119876

119894) minusHJBlowast

119894(119890

119894 119906

lowast

119894 nabla119876

lowast

119894) (30)

A classic radial basis function of the neural network isproposed in [25] shown as (31)

119873(119909) = 119882lowast119879

119878 (119909) + 120576 (119909) (31)

6 Mathematical Problems in Engineering

Action

Rewardfunction

HJB

error

Identifier

Subsystem

Critic

minus+

Qi(ei ai ui)Qi(ei ai ui)

ri(ei ai ui)

ri(ei ai ui)

Φi(ei ui)

Φi(ei ui)

Φi(ei )

eiF(t)

ui

ui

(t)

120575hi

1s

Figure 2 The architecture of action-critic-identifier

where 119882lowast means the ideal neural network weights and 120576(119909)

represents the estimation error In the case of using sufficientnumber of nodes if the center and width of the nodes arebuilt appropriately then any kind of continuous functioncould be approximated by RBF-NN Therefore the optimal119876-function and the optimal control policy can be expressedas follows

119876lowast

119894= 119882

119879

119894119878119894(119890

119894) + 120576

119894119888(119890

119894)

119906lowast

119894(119890

119894) = minus

1

2119877minus1

119887119879

119894(119890

119894) [ 119878

119894(119890

119894)119879

119882119894+ 120576

119894119886(119890

119894)]

(32)

where 119878119894(119890

119894) = [119904

1198941(119890

119894) sdot sdot sdot 119904

119894119899(119890

119894)]119879 indicates the smooth

basis function of the neural network 119882119894means the ideal

unknown neural network weight and 120576119894119888(119890

119894) and 120576

119894119886(119890

119894) are

the estimation error By using 119876119894and

119894(119890

119894) to estimate 119876

lowast

119894

and 119906lowast

119894(119890

119894) we can get the following equations

119876119894=

119879

119894119888119878119894119888(119890

119894) (33)

119894(119890

119894) = minus

1

2119877minus1

119887119879

119894(119890

119894) 119878

119894119886(119890

119894)119879

119894119886 (34)

According to the equations above 119894119888(119905) and

119894119886(119905) indicated

the weights of critic-NN and action-NN And the estimationerrors of weights are shown as follows

119894119888(119905) = 119882

119894minus

119894119888(119905) (35)

119894119886(119905) = 119882

119894minus

119894119886(119905) (36)

The update law of the weight for the critic-NN is a gradientdescent algorithm which is shown as follows

119882

119894119888(119905) = minus119899

1119871119894(119871

119879

119894

119894119888+ 119890

119879

119894119876119890119890119894+ 119906

119879

119894119877119906

119894) (37)

In the equation above 119899119894gt 0 is the adaptive gain of the neural

network 119871119894and 119897

119894are defined as

119871119894=

119897119894

119897119879119894119897119894+ 1

119897119894= nabla119878

119894119888(119890

119894) 119890

119894

(38)

Therefore according to the definition above the followinginequalities can be obtained

119871119894119898

le 119871119894le 119871

119894119872

119878119894119888119898

le 119878119894119888(119890

119894) le 119878

119894119888119872

119878119894119886119898

le 119878119894119886(119890

119894) le 119878

119894119886119872

(39)

Mathematical Problems in Engineering 7

Combining (35) with (38) we can get that

119882119894119888(119905) = minus119899

1119871119894(119871

119879

119894

119894119888+ 120575

ℎ119894) (40)

The update law of the weight for the action-NN is developedby a gradient descent algorithm expressed as follows

119882

119894119886(119905) = minus119899

2119878119894119886(119890

119894)

times ((119879

119894119886119878119894119886(119890

119894) +

1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119894119888))

119879

(41)

According to the estimation error of action-NN in (36) theoptimal control 119906lowast

119894(119890

119894) can minimize the optimal119876-function

and we can get the following equation

119882119879

119894119886119878119894119886(119890

119894) +

1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119882119894119888

+1

2119877minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894) + 120576

119894119886(119890

119894) = 0

(42)

Putting (41) into (42) we can get that

119882119894119886

= minus1198992119878119894119886(119890

119894)(

119879

119894119886119878119894119886(119890

119894) +

1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119894119888

minus1

2119877minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894) minus 120576

119894119886(119890

119894)

)

(43)

After using critic-NN and action-NN to estimate 119876119894and

119894(119890

119894) we need to design a kind of robust RBF-NN identifier

to identify the nonlinear uncertainties of the subsystem HereΦ

119894(119890

119894

119894) can be expressed as follows

Φ119894(119890

119894

119894) = 119890

119894119865= 119882

119879

119894119865120581 (Λ

119879

119894119865119890119894119865) + 120576

119894119865(119890

119894119865) + 119887

119894(119890

119894)

119894

(44)

where 120581(sdot) means the basic function of neural network and119882

119894119865Λ

119894119865indicate the unknown ideal neural network weights

Equation (44) can be identified by using robust RBF-NNidentifier so we can get

Φ119894(119890

119894

119894) = 119890

119894119865= 119882

119879

119894119865120581119894119865+ 119887

119894(119890

119894)

119894+ 120583

119894 (45)

Here 120581119894119865indicates the estimated value of the basic function of

the neural network 119882119894119865 Λ

119894119865are expressed as the estimated

value of neural network 120583119894isin R means the feedback error

term shown as follows [26]120583119894= 119896 (119890

119894119865(119905) minus 119890

119894119865(119905)) minus 119896 (119890

119894119865(0) minus 119890

119894119865(0)) + 120599

= 119896 (119890119894119865(119905) minus 119890

119894119865(0)) + 120599

120599 = (119896120572 + 120574) 119890119894119865+ 120573

1sat (119890

119894119865)

(46)

where 119896 120572 1205731 and 120574 are the positive control gain constants

and sat(sdot) is a saturation functionTherefore the state estima-tion error of the identifier-NN can be expressed as follows

119890119894119865= 119890

119894119865minus 119890

119894119865

= 119882119879

119894119865120581119894119865minus

119879

119894119865120581119894119865+ 120576

119894119865(119890

119894119865) minus 120583

119894

(47)

A filtered identification error is defined as follows

119864119894= 119890

119894119865+ 120572119890

119894119865 (48)

The derivation of the equation above is shown as

119894= 119882

119879

119894119865119894119865Λ119879

119894119865119890119894119865minus

119879

119894119865

120581119894119865

Λ119879

119894119865119890119894119865+ 120572 119890

119894119865minus 119896119864

119894minus 120574119890

119894119865

minus

119882119879

119894119865120581119894119865+ 120576

119894119865(119890

119894119865) minus 120573

1sat (119890

119894119865) minus

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

(49)

Here the weight 119882119894119865 Λ

119894119865of the identification-NN can be

updated by

119882

119894119865= proj (Γ

119894119882119865

120581119894119865Λ119879

119894119865

119890119894119865119890119879

119894119865)

Λ119894119865= proj (Γ

119894Λ119865

119890119894119865119890119879

119894119865

119879

119894119865

120581119894119865)

(50)

where Γ119894119882119865

Γ119894Λ119865

are positive constant adaptation gain matri-ces In order to analyze the convergence of the filteredidentification error 119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

can be divided into thefollowing form

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

=1

2120581119894119865

119890119894119865[(Λ

119879

119894119865minus Λ

119879

119894119865) (119882

119879

119894119865minus

119879

119894119865)

+ (119882119879

119894119865minus

119879

119894119865) (Λ

119879

119894119865minus Λ

119879

119894119865)]

=1

2120581119894119865

119890119894119865[

119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) + (119882

119879

119894119865minus

119879

119894119865) Λ

119879

119894119865

minus119882119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) minus (119882

119879

119894119865minus

119879

119894119865)Λ

119879

119894119865

]

=1

2120581119894119865

119890119894119865[

119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) + (119882

119879

119894119865minus

119879

119894119865) Λ

119879

119894119865]

minus1

2120581119894119865

119890119894119865[119882

119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) + (119882

119879

119894119865minus

119879

119894119865)Λ

119879

119894119865]

=1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

minus1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865minus1

2

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865

+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

(51)

where 119879

119894119865= 119882

119879

119894119865minus

119879

119894119865 Λ119879

119894119865= Λ

119879

119894119865minus Λ

119879

119894119865 Putting (51) into

(49) then (49) can be reduced to the following form

119894= 119875

1198651+ 119875

1198652+ 119875

1198653minus 119896119864

119894minus 120574119890

119894119865minus 120573

1sat (119890

119894119865) (52)

8 Mathematical Problems in Engineering

Among the equations above 1198751198651+119875

1198652+119875

1198653can be expressed

respectively as follows

1198751198651

=1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865minus

119879

119894119865

120581119894119865

Λ119879

119894119865119890119894119865

+ 120572 119890119894119865minus

119879

119894119865120581119894119865

(53)

1198751198652

= minus1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865minus1

2

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865

+119882119879

119894119865119894119865Λ119879

119894119865119890119894119865+ 120576

119894119865(119890

119894119865)

(54)

1198751198653

=1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865 (55)

According to Assumption 1 (48) and (50) the upper boundsof 119875

1198651 119875

1198652 119875

1198653are shown as

100381710038171003817100381711987511986511003817100381710038171003817 le 119869

1(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

100381710038171003817100381711987511986521003817100381710038171003817 le 120589

1

100381710038171003817100381711987511986531003817100381710038171003817 le 120589

2

(56)

Combining (53) and (54) with (55) then we can get that100381710038171003817100381710038171198652

+ 1198653

10038171003817100381710038171003817le 120589

3+ 120589

41198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817 (57)

where 120593119894(119890

119879

119894119865 119864

119879

119894) = [119890

119879

119894119865119864119879

119894]119879 and 119869

119894(sdot) is a global invertible

nondecreasing function 120589119894 (119894 = 1 2 3 4) are computable

positive constants

Theorem 5 Considering dynamics of the subsystem of timevarying constrained reconfigurable modular robot in (14) andthe state equation (18) if the designed identifier and thecorresponding weight update laws are adopted then the globaluncertainty of the subsystem which depends explicitly on theerror term can be identified and the identification error isconverged and bounded

Proof Define the Lyapunov function as the follows

119881119894119871(119890

119894119865 119864

119894) =

1

2119864119879

119894119864119894+1

2120574119890

119879

119894119865119890119894119865+ 120603

119894(119905) + 120601

119894(119905) (58)

In the equation above 120603119894(119905) and 120601

119894(119905) can be expressed as

follows

119894(119905) = minus[

119864119879

119894(119875

1198652minus 120573

1sat (119890

119894119865)) + 119890

119879

1198941198651198751198653

minus12057321198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)100381710038171003817100381710038171003817100381710038171003817119890119894119865

1003817100381710038171003817

]

120603119894(0) = 120573

1

1003816100381610038161003816119890119894119865 (0)1003816100381610038161003816 minus 119890

119879

119894119865(0) (119875

1198652(0) + 119875

1198653(0))

(59)

120601119894(119905) =

1

4120572 [ tr (

119879

119894119865Γminus1

119894119882119865

119894119865) + tr (Λ

119879

119894119865Γminus1

119894Λ119865Λ

119894119865)] (60)

where tr(sdot) represents the trace of matrix Defining 119889 =

[119864119879

119894119890119879

11989411986512060312

11989412060112

119894] 120573

1 120573

2isin R are positive adaptation gains

which are chosen to ensure 120603119894(119905) ge 0 so we can get

1198801(119889) le 119881

119894119871(119890

119894119865 119864

119894) le 119880

2(119889) (61)

where

1198801(119889) =

1

2min (1 120574) 1198892

1198802(119889) = max (1 120574) 1198892

(62)

The derivation of (58) is shown as follows

119894119871(119890

119894119865 119864

119894) = nabla119881

119879

119894119871119870[

119894

119890119879

119894119865

1

212060312

119894119894

1

212060112

119894

120601119894]119879

(63)

where119870[sdot] is expressed as a Filipov set [27]So

119894119871(119890

119894119865 119864

119894) can be deformed as the following form

119894119871(119890

119894119865 119864

119894)

= [119864119879

119894120574119890

119879

1198941198652120603

12

1198942120601

12

119894]119870[

119894

119890119879

119894119865

1

212060312

119894119894

1

212060112

119894

120601119894]119879

le 120574119879

(

1

2

119882119879

119894119865

119894119865

Λ119879

119894119865

119894119865+

1

2

119879

119894119865

119894119865Λ119879

119894119865

119894119865minus

119879

119894119865

119894119865

Λ

119879

119894119865119890119894119865

+120572119894119865minus

1

2

119882119879

119894119865

119894119865

Λ119879

119894119865119890119894119865minus

1

2

119879

119894119865

119894119865Λ119879

119894119865119890119894119865minus 120574119890

119894119865

minus119879

119894119865120581119894119865+119882

119879

119894119865119894119865Λ119879

119894119865119890119894119865+ 120576

119894119865(119890

119894119865) +

1

2

119879

119894119865

119894119865

Λ119879

119894119865

119894119865

+

1

2

119879

119894119865

119894119865

Λ119879

119894119865

119894119865minus 119896119864

119894minus 120573

1119870[sat (119890

119894119865)]

)

minus119864119879

119894(

minus1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865minus1

2

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865

+119882119879

119894119865119894119865Λ119879

119894119865119890119894119865+ 120576

119894119865(119890

119894) minus 120573

1119870[sat (119890

119894119865)])

minus 119890119879

119894119865

1

2(

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865)

+ 120574119890119879

119894119865(119864

119894minus 120572119890

119894119865)

+ 12057321198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)100381710038171003817100381710038171003817100381710038171003817119890119894119865

1003817100381710038171003817

minus1

2120572 tr (119879

119894119865Γminus1

119894119882119865

119882

119894119865) minus

1

2120572 tr (Λ119879

119894119865Γminus1

119894Λ119865

Λ119894119865)

(64)

Put (53) (54) and (55) into (64) then we can get

119894119871(119890

119894119865 119864

119894)

= 119864119879

119894(119875

1198651+ 119875

1198652+ 119875

1198653minus 120573

1119870[sat (119890

119894119865)] minus 119896119864

119894minus 120574119890

119894119865)

+ 120574119890119879

119894119865(119864

119894minus 120572119890

119894119865)

minus 119864119879

119894(119875

1198652minus 120573

1119870[sat (119890

119894119865)])

minus 119890119879

1198941198651198751198653

+ 12057321198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)100381710038171003817100381710038171003817100381710038171003817119890119894119865

1003817100381710038171003817

minus1

2120572 tr (119879

119894119865Γminus1

119894119882119865

119882

119894119865)

minus1

2120572 tr (Λ119879

119894119865Γminus1

119894Λ119865

Λ119894119865)

Mathematical Problems in Engineering 9

= minus120572120574119890119879

119894119865119890119894119865+ (119864

119879

119894minus 119890

119879

119894119865)119875

1198653

1198691(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41198963

minus1

2120572 tr (119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865119890119879

119894119865) minus

1

2120572 tr (Λ119879

119894119865

119890119894119865119890119879

119894119865

119879

119894119865

120581119894119865)

le minus1198961

100381710038171003817100381711989011989411986510038171003817100381710038172

minus 1198962

1003817100381710038171003817119864119894

10038171003817100381710038172

+1198691(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41198963

10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

+1205732

21198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41205721198964

10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

(65)

where 119896min = min1198961 119896

2 120585 = min119896

3 120572119896

4120573

2

2 and

119869(120593119894(119890

119879

119894119865 119864

119879

119894))

2

= 1198691(120593

119894(119890

119879

119894119865 119864

119879

119894))

2 + 1198692(120593

119894(119890

119879

119894119865 119864

119879

119894))

2 sothe following conclusion can be obtained

119894119871(119890

119894119865 119864

119894)

le minus119896min10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

+119869(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)210038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

4120585

le minus11988810038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

(66)

Therefore for an arbitrary constant 119888 minus119888120593119894(119890

119879

119894119865 119864

119879

119894)

2

is a negative semidefinite function which is defined in theadjustable interval119863 expressed as follows

119863 = 119889 (119905) | 119889 le 119869minus1

(2radic119896min120585) (67)

so that Lyapunov stability theory shows that the system isstable In order to make the subsystem of time varying con-strained reconfigurable modular robot tracking the desiredtrajectory progressively in this paper a novel decentralizedreinforcement learning robust optimal tracking controllerhas been designed by using the robust term to compensatethe neural network approximation errors Design the robustcontrol term as

119906119894119903119887

=119873

119903119887119890119894

119890119879119894119890119894+ 120577

(68)

In the equation above 120577 gt 0 is a constant And 119873119903119887can

be expressed as

119873119903119887ge [

[

1205752

ℎ119894

21198991

+1198991(minus120576

119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894) (nabla120576

119894119888(119890

119894)2))

2

21198992

+11989911198992

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

(minus120576119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

]

]

sdot(119890

119879

119894119890119894+ 120577)

211989911198992

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817 119890

119879

119894119890119894

ge [1198992

1(minus120576

119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

+ 11989921205752

ℎ119894+ 2119899

2

11198992

2

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

sdot (minus120576119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

]

sdot(119890

119879

119894119890119894+ 120577)

41198992111989922

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817 119890

119879

119894119890119894

(69)

Therefore the global control law can be designed asfollows

119906mix = 119906119894+ 119906

119894119903119887

= minus1

2119877minus1

119887119879

119894(119890

119894) 119878

119894119886(119890

119894)119879

119894119886+

119873119903119887119890119894

119890119879119894119890119894+ 120577

(70)

Theorem 6 Considering dynamics of the subsystem of timevarying constrained reconfigurable modular robot in (14) if thesystem parameters conditions and the assumptions are held thecritic-NN action-NN and identifier are given by (33) (34)and (45) respectively and the decentralized robust optimaltracking controller of the subsystem in (70) is adopted thenthe system is closed-loop stability and the desired trajectory canbe tracked asymptotically by the actual output

Proof Design the Lyapunov function as follows

119881119894119906(119890

119894 119906mix) =

1

21198991

tr 119879

119894119888

119894119888 +

1198991

21198992

tr 119879

119894119886

119894119886

+ 11989911198992[119890

119879

119894119865119890119894119865+ Ξint

infin

0

119903119894(119890

119894 119906mix) 119889120591]

(71)

where Ξ gt 0 is the undetermined parameter 119905 le 120591 lt infin Thederivation of (71) is shown as follows

119894119906(119890

119894 119906mix)

=1

21198991

tr 119879

119894119888

119882

119894119888 +

1198991

21198992

tr 119879

119894119886

119882

119894119886

+ 11989911198992[119890

119879

119894119865119890119894119865+ Ξ119903

119894(119890

119894 119906mix)]

=1

21198991

tr 119879

119894119888(minus119899

1119871119894(119871

119879

119894

119894119888+ 120575

ℎ119894))

+1198991

21198992

tr

times

119879

119894119886

[[[[

[

minus1198992119878119894119886(119890

119894)(

119879

119894119886119878119894119886(119890

119894) minus 120576

119894119886(119890

119894)

+1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119894119888

minus1

2119877minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894)

)

]]]]

]

+ 11989911198992119890119879

119894119865(119882

119879

119894119865120581 (Λ

119879

119894119865119890119894) + 120576

119894119865(119890

119894) + 119887

119894(119890

119894) mix)

+Ξ (119890119879

119894119876119890119890119894+ 119906

119879

mix119877119906mix)

10 Mathematical Problems in Engineering

le minus(1198712

119894119898minus1198991

21198712

119894119872)10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

+1

21198991

1205752

ℎ119894

minus (11989911198782

119894119886119898minus3

4119899111989921198782

119894119886119872)10038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

+1198991

41198992

10038171003817100381710038171003817119877minus110038171003817100381710038171003817

21003817100381710038171003817119887119894(119890119894)10038171003817100381710038172 10038171003817100381710038171003817nabla119878

2

119894119888119872

10038171003817100381710038171003817

10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

+1198991

21198992

(120576119894119886(119890

119894) + 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

119879

sdot (120576119894119886(119890

119894) + 119877

minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894)

2)

+ 11989911198992

1003817100381710038171003817119887119894 (119890119894)10038171003817100381710038172

120576119894119886(119890

119894)119879

120576119894119886(119890

119894)

+ 119899111989921198782

119894119886119872

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817210038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

+ 11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

+ 11989911198992(1003817100381710038171003817119887119894 (119890119894)

10038171003817100381710038172

minus Ξ120582min (119877))1003817100381710038171003817119906mix

10038171003817100381710038172

(72)

If the following inequalities can satisfy

120582min 1198761198901003817100381710038171003817119890119894119865

10038171003817100381710038172

2le 119890

119879

119894119865119876119890119890119894119865le 120582max 119876119890

1003817100381710038171003817119890119894119865

10038171003817100381710038172

2

120582min 1198771003817100381710038171003817119906mix

10038171003817100381710038172

2le 119906

119879

mix119877119906mix le 120582max 1198771003817100381710038171003817119906mix

10038171003817100381710038172

2

Ξ gt

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

120582min 119877

(73)

then 119894119906(119890

119894 119906mix) can be further transformed as

119894119906(119890

119894 119906mix)

le minus(1198712

119894119898minus1198991

21198712

119894119872minus

1198991

41198992

10038171003817100381710038171003817119877minus110038171003817100381710038171003817

21003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

nabla1198782

119894119888119872)10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

minus (11989911198782

119894119886119898minus3

4119899111989921198782

119894119886119872minus 119899

111989921198782

119894119886119872

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

)10038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

minus 11989911198992(1003817100381710038171003817119887119894(119890119894)

10038171003817100381710038172

+ Ξ120582min (119877))1003817100381710038171003817119906mix

10038171003817100381710038172

minus 11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

le minus11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

(74)

Therefore we can get the conclusion that 119894119906(119890

119894 119906mix) lt 0

4 Simulations

In order to verify the validity and convergence of the pro-posed decentralized reinforcement learning robust optimaltracking control method based on ACI and to study theconvergence of the error by comparing the simulation resultin this paper two different configurations of the time varying

external constrained reconfigurablemodular robot have beenapplied shown in Figures 3 and 4

For the sake of the facilitation of the analysis of theconfigurations above we can transform them into a formof analytic charts which are shown in Figures 5 and 6where 119871

1 119871

2 and 119871

4are the length of the links 119871

3is the

distance between the time varying constraint joint and thebase modular

The time varying constraint can be defined as a kindof column which rotated about with a certain degree offreedom The constraint equations of configuration A andconfiguration B are shown as follows

Ψ119860(119902 119905) = 119871

1cos 119902

1+ 119871

2cos 119902

2minus [119871

3+ 119871

4cot120572 (119905)]

Ψ119861(119902 119905) = 119871

1+ 119871

2cos 119902

2minus [119871

3+ 119871

4cot120572 (119905)]

(75)

In the equation above the angle 120572(119905) between the timevarying constraint and the119883-axis can be defined as follows

120572 (119905) = 075120587 + 02 sin 119905

2 (76)

The initial positions of joint models are 1199021(0) = 2 119902

2(0) =

2 in configurationA and 1199021(0) = 2 119902

2(0) = 2 in configuration

BThe initial velocities of joints are zerosThe dynamicmodelof configurations A and B is designed as follows

119872119860(119902) = [

036 cos (1199022) + 06066 018 cos (119902

2) + 01233

018 cos (1199022) + 01233 01233

]

119872119861(119902) = [

017 minus 01166cos2 (1199022) minus006 cos (119902

2)

minus006 cos (1199022) 01233

]

119862119860(119902 119902) = [

minus036 sin (1199022) 119902

2minus018 sin (119902

2) 119902

2

018 sin (1199022) ( 119902

1minus 119902

2) 018 sin (119902

2) 119902

1

]

119862119861(119902 119902) = [

01166 sin (21199022) 119902

2006 sin (119902

2) 119902

2

006 sin (1199022) 119902

20

]

119866119860(119902) = [

minus588 sin (1199021+ 119902

2) minus 1764 sin (119902

1)

minus588 sin (1199021+ 119902

2)

]

119866119861(119902) = [

0

minus588 cos (1199022)]

119865119860(119902 119902) = [

1199021+ 10 sin (3119902

1) + 2 sgn ( 119902

1)

12 1199022+ 5 sin (2119902

2) + sgn ( 119902

2)]

119865119861(119902 119902) = [

0

15 1199022+ sin (119902

2) + 12 sgn ( 119902

2)]

(77)

The desired trajectory of configurations A and B is shown asConfiguration A

1199101119889

= 05 cos (119905) + 02 sin (3119905)

1199102119889

= Θ (1199101119889 119905)

= arcsin[1198711sin (120572 (119905) minus 119910

1119889) minus 119871

3sin (120572 (119905))

1198712

] + 120572 (119905)

(78)

Mathematical Problems in Engineering 11

Figure 3 Configuration A for simulation

Figure 4 Configuration B for simulation

Configuration B

1199101119889

= 0

1199102119889

= Θ (1199101119889 119905)

= arcsin [1198711sin (120572 (119905)) minus 119871

3sin (120572 (119905))

1198712

] + 120572 (119905)

(79)

Due to the limit of dynamic external constraints thevariety of joint 1 in configuration B is zero

In order to confirm that the adopted method can beapplied in different configurations and to verify the trackingperformance for the subsystem desired trajectory by usingthe ACI and RBF-NN based decentralized reinforcementlearning robust optimal tracking control method in this partthe comparative simulations which include the classic RBFneural network control method and ACI based decentral-ized reinforcement learning robust optimal tracking controlmethod have been adopted respectively

From Figures 7 8 9 10 11 12 13 and 14 the joint trackingand error curves are shown by using classic RBF neural net-work [4] to compensate the effect of the dynamic nonlinearterm and the interconnection term in the subsystem Figures7 to 10 show that the actual output trajectories take about2 seconds to track the desired trajectories This is due tothe fact that the classical neural network method requires alonger training process and parameter adaptive process Theerror curves in Figures 11ndash14 show that the joint subsystem

q1L2

L3

L4

L1

Y

X

120572

q2

Figure 5 The analytic chart of configuration A

q2

L4

L2

L1

L3

Y

120572

X

q1

Figure 6 The analytic chart of configuration B

constraints cannot be well compensated by adopting the clas-sical neural control method for the reconfigurable modularrobot when the time varying constrains exist When thejoint output variables turn larger the reconfigurable modularsystem cannot exhibit a good robustness and the trackingerrors are larger than before

Figures 15 16 17 18 19 20 21 and 22 showed thejoint tracking and error curves by adopting ACI to identifythe optimal 119876-function optimal control policy and globaldynamic nonlinear terms of the HJB equation in the sub-system Figures 15 to 18 show that the actual output trajec-tories of the subsystems can track the desired trajectoriesin 05 seconds by using the proposed robust reinforcementlearning optimal control method This is due to the excellentidentifying ability of ACIwhich can identify the uncertaintiescontained in the subsystems in a short timeThe error curvesin Figures 19ndash22 show that the tracking error is very smalland it can converge to zero in a short time Besides whenthe proposed decentralized control method is adopted for thejoint subsystems the robustness is manifested

Figures 23 and 24 showed the 3D-tip trajectory curvesby using ACI algorithm These two figures show that theproposed decentralized control method can fully satisfythe accessibility requirement of the reconfigurable modularrobot Besides the singular displacements of the joint subsys-tems and the end-effector would not appear The parametersdefined in ACI are shown in Table 1

12 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

25

Time (s)

Join

t 1 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

minus1

minus05

Figure 7 Trajectory tracking curve of configuration A joint 1 withRBF neural network

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 8 Trajectory tracking curve of configuration A joint 2 withRBF neural network

Table 1 Parameter list of action-critic-identifier

119896 120572 120592 1205781198861

1205781198862

120578119888

1205731

1205732

120574

800 300 0005 10 50 20 02 2 05

The simulation results show that compared with thesituation of using classic RBF neural network controllerthe decentralized robust optimal tracking control methodbased on ACI and reinforcement learning can be appliedinto different configurations of the time varying constrainedreconfigurable modular robot And the joint variables cantrack the desired trajectory within a very short time indifferent configurations and the fluctuation of the errorconvergence range is minimal

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus2

minus15

minus05

minus1

Desired trajectoryActual trajectory

Figure 9 Trajectory tracking curve of configuration B joint 1 withRBF neural network

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 10 Trajectory tracking curve of configuration B joint 2 withRBF neural network

0 1 2 3 4 5 6 7 8 9 10

0

002

004

006

008

01

Time (s)

Join

t 1 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 11 Tracking error curve of configuration A joint 1 with RBFneural network

Mathematical Problems in Engineering 13

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

002

004

006

008

01

Join

t 2 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 12 Tracking error curve of configuration A joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

001

002

003

004

005

Time (s)

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 13 Tracking error curve of configuration B joint 1 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 14 Tracking error curve of configuration B joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus1

minus05

Desired trajectoryActual trajectory

Figure 15 Trajectory tracking curve of configuration A joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 16 Trajectory tracking curve of configuration A joint 2 withACI based reinforcement learning

5 Conclusions and Future Work

In this paper combining ACI with RBF neural network anovel decentralized reinforcement learning robust optimaltracking control theory has been proposed for time varyingconstrained reconfigurable modular robots Moreover thistheory is used to solve the problem of the continuoustime nonlinear optimal control policy for strongly coupleduncertainty robotic system Firstly we build the model ofsubsystem with the time varying external constraints anddescribed the global robot system as a synthesis of inter-connected subsystem Secondly ACI is used to estimate theHJB equation and the global uncertainty where a continuous-time optimal 119876-function is adopted to take the place oftraditional optimal value function the optimal 119876-function

14 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus15

minus05

minus2

minus1

Desired trajectoryActual trajectory

Figure 17 Trajectory tracking curve of configuration B joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 18 Trajectory tracking curve of configuration B joint 2 withACI based reinforcement learning

and the optimal control police are approximated by critic-NNand action-NN and the global uncertainty is identified bythe identifier Thirdly we design a novel decentralized robustoptimal tracking controller so the desired trajectory can betracked and the tracking error could converge to zero infinite timeOn this basis two kinds of Lyapunov functions aredesigned to confirm the stability of ACI and the subsystemFinally in order to confirm the superiority of the proposedcontrol theory the comparative simulation examples havebeen presented combining two different configurations of therobot

In the future more complex configuration for timevarying external constrained reconfigurable modular robotcan be included Therefore the decentralized controller withhigher control and error precision will be considered

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 19 Tracking error curve of configuration A joint 1 with ACIbased reinforcement learning

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005Jo

int 2

erro

r (ra

d)

minus005

minus004

minus003

minus002

minus001

Figure 20 Tracking error curve of configuration A joint 2 with ACIbased reinforcement learning

0

0

1 2 3 4 5 6 7 8 9 10Time (s)

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 21 Tracking error curve of configuration B joint 1 with ACIbased reinforcement learning

Mathematical Problems in Engineering 15

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 22 Tracking error curve of configuration B joint 2 with ACIbased reinforcement learning

005

1

02 03 04 05 06 07

0

01

02

03

minus1

minus05minus02

minus01

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 23 3D-tip trajectory curve of configuration A with ACI

005

1

035 036 037 038 039 04

006008

01012014016018

minus1

minus05

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 24 3D-tip trajectory curve of configuration B with ACI

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is financially supported by the National NaturalScience Foundation of China (Grant nos 61374051 and60974010) the Scientific Technological Development PlanProject in Jilin Province of China (Grant no 20110705) Thefirst author is funded by China Scholarship Council

References

[1] Y Li and B Dong ldquoDecentralized ADRC control for reconfig-urable manipulators based on VGSTA-ESO of sliding moderdquoINFORMATION vol 15 no 6 pp 2453ndash2466 2012

[2] Y LiM-C Zhu andY-C Li ldquoSimulation on robust neurofuzzycompensator for reconfigurable manipulator motion controlrdquoJournal of System Simulation vol 19 no 22 pp 5169ndash5174 2007

[3] M-C Zhu and Y-C Li ldquoDecentralized adaptive sliding modecontrol for reconfigurable manipulators using fuzzy logicrdquoJournal of Jilin University (Engineering and Technology Edition)vol 39 no 1 pp 170ndash176 2009

[4] L Zhu and Y Li ldquoDecentralized adaptive neural networkcontrol for reconfigurable manipulatorsrdquo in Proceedings of theChinese Control and Decision Conference (CCDC rsquo10) pp 1760ndash1765 Xuzhou China May 2010

[5] M Zhu Y Li and Y Li ldquoA new distributed control schemeof modular and reconfigurable robotsrdquo in Proceedings of theIEEE International Conference onMechatronics and Automation(ICMA rsquo07) pp 2622ndash2627 Harbin China August 2007

[6] M C Zhu Y Li and Y C Li ldquoObserver-based decentralizedadaptive fuzzy control for reconfigurable manipulatorrdquo Controland Decision vol 24 no 3 pp 429ndash434 2009

[7] S Richard and A G Barto Reinforcement Learning An Intro-duction The MIT Press London UK 1998

[8] F L Lewis and D Liu ReinForcement Learning and Approxi-mate Dynamic Programming For Feedback Control Wiley-IEEEPress New York NY USA 2012

[9] Y-K Xu and X-R Cao ldquoLebesgue-sampling-based optimalcontrol problems with time aggregationrdquo IEEE Transactions onAutomatic Control vol 56 no 5 pp 1097ndash1109 2011

[10] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[11] X Xu H-G He and D Hu ldquoEfficient reinforcement learningusing recursive least-squares methodsrdquo Journal of ArtificialIntelligence Research vol 16 pp 259ndash292 2002

[12] H Zhang Q Wei and Y Luo ldquoA novel infinite-time optimaltracking control scheme for a class of discrete-time nonlinearsystems via the greedy HDP iteration algorithmrdquo IEEE Trans-actions on Systems Man and Cybernetics B vol 38 no 4 pp937ndash942 2008

[13] H Zhang R Song Q Wei and T Zhang ldquoOptimal trackingcontrol for a class of nonlinear discrete-time systems withtime delays based on heuristic dynamic programmingrdquo IEEETransactions on Neural Networks vol 22 no 12 pp 1851ndash18622011

16 Mathematical Problems in Engineering

[14] H Zhang X Xie and S Tong ldquoHomogenous polynomiallyparameter-dependent 119867

infinfilter designs of discrete-time fuzzy

systemsrdquo IEEE Transactions on Systems Man and CyberneticsB vol 41 no 5 pp 1313ndash1322 2011

[15] H Zhang L Cui X Zhang and Y Luo ldquoData-drivenrobust approximate optimal tracking control for unknown gen-eral nonlinear systems using adaptive dynamic programmingmethodrdquo IEEE Transactions on Neural Networks vol 22 no 12pp 2226ndash2236 2011

[16] J Zhang H Zhang Y Luo and H Liang ldquoOptimal controldesign for nonlinear systems adaptive dynamic programmingbased on fuzzy critic estimatorrdquo in The International JointConference on Neural Network pp 1ndash6 Brisbane Australia2012

[17] H Zhang D Gong B Chen and Z Liu ldquoSynchronization forcoupled neural networkswith interval delay a novel augmentedlyapunov-krasovskii functional methodrdquo IEEE Transactions onNeural Networks and Learning Systems vol 24 no 1 pp 58ndash702013

[18] S Bhasin K Dupree P M Patre and W E Dixon ldquoNeuralnetwork control of a robot interacting with an uncertain vis-coelastic environmentrdquo IEEE Transactions on Control SystemsTechnology vol 19 no 4 pp 947ndash955 2011

[19] S G Khan G Herrmann F Lewis T Pipe and C MelhuishldquoA Q-learning based Cartesian model reference compliancecontroller implementation for a humanoid robot armrdquo inProceedings of the IEEE 5th International Conference onRoboticsAutomation andMechatronics (RAM rsquo11) pp 214ndash219 QingdaoChina September 2011

[20] P K Patchaikani L Behera and G Prasad ldquoA single networkadaptive critic-based redundancy resolution scheme for robotmanipulatorsrdquo IEEE Transactions on Industrial Electronics vol59 no 8 pp 3241ndash3253 2012

[21] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[22] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]University of Cambridge Cambridge UK 1989

[23] F L Lewis and V L Syrmos Optimal Control John Wiley ampSons New York NY USA 1995

[24] M Sassano and A Astolfi ldquoDynamic approximate solutionsof the HJ inequality and of the HJB equation for input-affinenonlinear systemsrdquo IEEE Transactions on Automatic Controlvol 57 no 10 pp 2490ndash2503 2012

[25] Y X Wu and C Wang ldquoDeterministic learning based adaptivenetwork control of robot in task spacerdquoActa Automatica Sinicavol 39 no 1 pp 1ndash10 2013

[26] P M Patre W MacKunis K Kaiser and W E DixonldquoAsymptotic tracking for uncertain dynamic systems via amultilayer neural network feedforward and RISE feedbackcontrol structurerdquo IEEE Transactions on Automatic Control vol53 no 9 pp 2180ndash2185 2008

[27] B E Paden and S S Sastry ldquoA calculus for computing Filippovrsquosdifferential inclusion with application to the variable structurecontrol of robot manipulatorsrdquo IEEE Transactions on Circuitsand Systems vol 34 no 1 pp 73ndash82 1987

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 2: Research Article Decentralized Reinforcement Learning ...downloads.hindawi.com/journals/mpe/2013/387817.pdf · Research Article Decentralized Reinforcement Learning Robust Optimal

2 Mathematical Problems in Engineering

uncertainty An observer based decentralized adaptive fuzzycontroller for reconfigurable manipulator is proposed in [6]by designing the state observer the adaptive fuzzy systemswhich are used to model the unknown dynamics of thesubsystem and the interconnection term can be constructedby using the state estimations Nevertheless the require-ment of the dynamics of the reconfigurable modular robotsystem is hard to be satisfied either fully or even partiallyknow Moreover because of there are strong coupling modeluncertainties and interconnection terms of subsystems in thereconfigurable modular robot system besides the processingload on the controller would be increased the fact that thegreater time delay and calculation error are easy to produceso that it is too complicated to design the controllers by usingthe methods and algorithms above

In recent years as one of the most effective methodsto solve the control problems for continuous time whichstrongly coupled with nonlinear system the reinforcementlearning algorithm has received extensive attention fromscholars Reinforcement learning [7 8] is a kind of learningmethod mapping situations to actions so as to maximize anumerical reward signal Compared with supervised learn-ing reinforcement learning does not need to predict thementor signal in various states but learns in the processof interaction with the situation Because of its adaptiveoptimization capability in the nonlinear model under thecondition of uncertainty reinforcement learning has a uniqueadvantage to solve the problems of optimization strategiesand the control method in terms of the complex models [9ndash11] Zhang and his team presented an infinite time optimaltracking control scheme for discrete-time nonlinear systemvia the greedy HDP iteration algorithm [12ndash14] Accordingto the system transformation the optimal tracking problemis transformed into an optimal regulation problem and thegreedyHDP iteration algorithm is introduced to deal with theregulation problem with the rigorous convergence analysisThen a data-driven robust approximate optimal trackingcontrol is proposed by using the adaptive dynamic program-ming as well as a data-drivenmodel which is established by arecurrent neural network to reconstruct the unknown systemdynamics by using available input-output date [15] After thisthey design a fuzzy critic estimator which is used to estimatethe value function for nonlinear continuous-time system[16] On this basis a synchronization problem for an arrayof neural networks with hybrid coupling and interval timevarying delay is concerned with an augmented Lyapunov-Krasovskii functional method [17] The FRL scheme onlyusing immediate reward and sufficient conditions is adoptedto analyze the convergence of the optimal task performanceBhasin presents a neural network control of a robot inter-acting with an uncertain viscoelastic environment [18] In[18] a continuous controller is developed for a robot thatmoved in free space and then regulated the new coupleddynamic system to a desired setpoint Khan et al presentan implementation of a model-free Q-learning based onthe discrete model reference compliance controller for ahumanoid robot arm [19] where reinforcement learningscheme uses a recently developed Q-learning scheme todevelop an optimal policy online Patchaikani et al propose

an adaptive critic-based real-time redundancy resolutionscheme for kinematic control of the redundant manipulator[20] The kinematic control of the redundant manipulatoris formulated as a discrete-time input affine system andthen an optimal real-time redundancy resolution scheme isproposed Although the research of reinforcement learningalgorithm has been rapidly developed in recent years thereare still some deficiencies For example when there aremultiple subsystems in the global system the methods abovecould not handle the impacts of the interconnection termsbetween the subsystems Meanwhile the methods above aremostly used to solve the learning and optimization problemsof the system itself but when the external constraints exist inthe system thesemethods are no longer applicableThereforeit is an urgent issue to solve the problem of how to designa kind of robust reinforcement learning optimal controlmethod in the case of the external constraints existing andmultiple subsystems coupled in the system are the urgentproblems to be solved

In this paper we presented a novel continuous timedecentralized reinforcement learning robust optimal trackingcontrol theory for the time varying constrained reconfig-urable modular robot Combining with ACI and RBF-NNthe critic-NN is used to estimate the optimal119876-function theaction-NN is proposed to approximate the optimal controlpolicy and then the identifier is adopted to identify theglobal uncertainty so that the HJB equation can be estimatedand the estimation error is bounded and converged Firstlysince the decentralized control method is adopted in thispaper whichmeans that each joint subsystemowns a separatecontroller thus the processing loads of the controllers arereduced greatly Secondly due to the fact that the timevarying constraints can be compensated in the subsystemstherefore the proposedmethod in this paper is suitable for thereconfigurablemodular robot in the time varying constrainedoutside environment Thirdly the proposed control methodcould compensate for the impacts of the model uncertaintiesand the interconnection terms on the system so that it canmake the subsystems track the desired trajectories and thetracking error can converge to zero in finite time

2 Problem Formulation

Assume that the time varying external constraints for the endof reconfigurable modular robot is shown as

Ψ (119902 119905) = 0 (1)

Here 119902 isin 119877119899 is the vector of joint displacements Function

Ψ 119877119899

rarr 119877119898 119898 is the dimension of the external limiting

conditions With the time varying constraints the dynamicsof a reconfigurablemodular robot can be presented as follows

119872(119902) 119902 + 119862 (119902 119902) 119902 + 119866 (119902) + 119865 (119902 119902) = 119906 + 119869119879

Ψ(119902 119905) 119891 (2)

119872(119902) isin 119877119899times119899 is the inertia matrix 119862(119902 119902) isin 119877

119899 is theCoriolis and centripetal force 119866(119902) isin 119877

119899 is the gravity term119865(119902 119902) is the unmodeled dynamics including friction termsand external disturbances 119906 isin 119877

119899 is the applied joint torque

Mathematical Problems in Engineering 3

and 119869119879

Ψ(119902 119905)119891 is the contact force generated by the contact

of the end of the reconfigurable modular robot and externalconstraints

After introducing 119898th constraints for the robot whichworks in the free space because of the limitation of (1) thesystem lost 119898th degrees of freedom Therefore the degreesof freedom of the robot change from 119899 to (119899 minus 119898) so thatonly (119899 minus 119898) independent joint displacements are needed todescribe the system of restricted movement fully

Define

119902 = [1199021

1199022

] 1199021isin 119877

119899minus119898

1199022isin 119877

119898

(3)

Putting the equation above into (1) then we can get that

Ψ (1199021 Θ (119902

1 119905) 119905) = 0 (4)

where

1199022= Θ (119902

1 119905) (5)

Therefore (3) can be described by joint displacement 1199021

fully shown as follows

119902 = [1199021

Θ(1199021 119905)

] (6)

The derivation of (6) is

119902 = [

[

1199021

120597Θ (1199021 119905)

1205971199021

1199021+120597Θ (119902

1 119905)

120597119905

]

]

= [

[

119868119899minus119898

0

120597Θ (1199021 119905)

1205971199021

119868119898

]

]

[1199021

0] + [

[

0

120597Θ (1199021 119905)

120597119905

]

]

= 119879 120579 + 119867

(7)

In (7)

119879 = [

[

119868119899minus119898

0

120597Θ (1199021 119905)

1205971199021

119868119898

]

]

isin 119877119899times119899

120579 = [1199021

0] isin 119877

119899

119867 = [

[

0

120597Θ (1199021 119905)

120597119905

]

]

isin 119877119899

(8)

Therefore the second derivation of 119902 can be achieved easilyas

119902 = 119879 120579 + 120579 + (9)

Putting (7) and (9) into (2) we can get

119906 + 119869119879

Ψ(119902 119905) 119891 = 119872(119902) (119879 120579 + 120579 + )

+ 119862 (119902 119902) (119879 120579 + ) + 119866 (119902) + 119865 (119902 119902)

(10)

Define

119864 = [119868(119899minus119898)times(119899minus119898)

0119898times(119899minus119898)

] isin 119877119899times(119899minus119898)

(11)

Therefore

120579 = [1199021

0] = 119864119902

1(12)

So (2) can be decomposed into the following form

119899

sum

119895=1

119872119894119895(119902) [(119879119864 119902

1)119895+ (119864 119902

1)119895

+ 119895]

+

119899

sum

119895=1

119862119894119895(119902 119902) [(119879119864 119902

1)119895+ 119867

119895] + 119866

119894(119902)

+ 119865119894(119902

119894 119902

119894) minus 119891

119894= 119906

119894

(13)

In the equation above (119879119864 1199021)119895 (119864 119902

1)119895 (119879119864 119902

1)119895 and 119867

119895

are the 119895th element of (119879119864 1199021) (119864 119902

1) (119879119864 119902

1) and 119867

respectively 119866119894(119902) 119865

119894(119902

119894 119902

119894) and 119906

119894are the 119894th element of

119866(119902) 119865(119902 119902) and 119906 119891119894is the constraint force which suffered

by the 119894th joint 119872119894119895(119902) and 119862

119894119895(119902 119902) are the 119894119895th element of

119872(119902) and 119862(119902 119902) respectively So as shown in Figure 1 eachsubsystem dynamical model can be formulated in joint spaceas follows

119872119894(119902

119894) 119902

119894+ 119862

119894(119902

119894 119902

119894) 119902

119894+ 119866

119894(119902

119894) + 119865

119894(119902

119894 119902

119894) + 119885

119894(119902 119902 119902)

= 119906119894

(14)

119885119894(119902 119902 119902) =

119899

sum

119895=1

119895 = 119894

119872119894119895(119902) [(119879119864 119902

1)119895+ (119864 119902

1)119895

+ 119895]

+119872119894119894(119902) [(119879119864 119902

1)119894+ (119864 119902

1)119894

+ 119894]

minus119872119894(119902

119894) 119902

119894+

119899

sum

119895=1

119895 = 119894

119862119894119895(119902 119902) [(119879119864 119902

1)119895+ 119867

119895]

+ 119862119894119894(119902 119902) [(119879119864 119902

1)119895+ 119867

119895]

minus 119862119894(119902

119894 119902

119894) 119902

119894+ [119866

119894(119902) minus 119866

119894(119902

119894)]

(15)

Let 119909119894= [119909

1198941 119909

1198942]119879

= [119902119894 119902

119894]119879 for 119894 = 1 119899 then (10) can be

presented by the following state equation

119878119894

1198941= 119909

1198942

1198942= minus119891 (119909

119894 119906

119894) minus ℎ

119894(119902 119902 119902) minus 119891

119894

119910119894= 119909

1198941

(16)

4 Mathematical Problems in Engineering

Zi(q q q)

Zn(q q q)

Z1(q q q)

M1q1 + C1q1 + G1 + F1 + Z1 minus f1 = u1

Miqi + Ciqi + Gi + Fi + Zi minus fi = ui

Mnqn+ Cnqn+ Gn+ Fn+ Znminus fn= un

q1

qq1

q

qn

qi

qn

qi

q

u1

un

uiu

sum

sum

sum

minus

minus

minus

Subsystem n

Subsystem i

Subsystem 1

q=

(uminusCqminusGminusF+

(qt)f

)M

minus1

JT Ψ

Figure 1 The architecture of the time varying constrained reconfigurable modular robot system

where 119909119894is the state vector of subsystem 119878

119894 119910

119894is the output

of subsystem 119878119894 and ℎ

119894(119902 119902 119902) is the interconnection term of

the subsystem 119891(119909119894 119906

119894) and ℎ

119894(119902 119902 119902) can be defined as

119891 (119909119894 119906

119894) = 119872

minus1

119894(119902

119894) [

119862119894(119902

119894 119902

119894) 119902

119894+ 119866

119894(119902

119894)

+119865119894(119902

119894 119902

119894) minus 119906

119894

]

ℎ119894(119902 119902 119902) = minus119872

minus1

119894(119902

119894) 119885

119894(119902 119902 119902)

(17)

In response to the time varying constrained reconfig-urable modular robot system we need to design a decen-tralized robust optimal tracking control policy to make thesubsystem track the desired trajectory as well as the trackingerror is converged and bounded

3 Decentralized Reinforcement LearningRobust Optimal Tracking Control Basedon ACI and 119876-Function

Assumption 1 Desired trajectory 119910119894119889 119910

119894119889 119910

119894119889and input gain

matrix 119887119894(119909

119894) are bounded

Then (16) can be transformed to the below Consider

119878119894

1198941= 119909

1198942

1198942= minus [119865 (119909

119894 119906

119894) + ℎ

119894(119902 119902 119902) + 119891

119894] + 119887

119894(119909

119894) 119906

119894

119910119894= 119909

1198941

(18)

where 119865(119909119894 119906

119894) = 119891(119909

119894 119906

119894) + 119887

119894(119909

119894)119906

119894

Assumption 2 The interconnection terms are bounded sat-isfying the following equation

1003816100381610038161003816ℎ119894 (119902 119902 119902)1003816100381610038161003816 le 120575

1198940+

119899

sum

119895=1

120575119894119895(10038161003816100381610038161003816119904119894119895

10038161003816100381610038161003816) (19)

where 1205751198940gt 0 is an unknown constant and 120575

119894119895(|119904

119894119895|) ge 0 is an

unknown smooth Lipschitz functionThe trajectory tracking error of the joint subsystem 119888 can

be defined as

119890119894(119905) = 119909

119894minus 119910

119894119889 (20)

With regard to the continuous time state equation ofthe subsystem in (18) with the nonlinear function andinterconnection terms generally the value function can bedefined as

119881119906119894(119890119894(119905))

119894(119890

119894(119905)) = int

infin

0

119903119894(119890

119894(119905) 119906

119894(119890

119894(119905))) 119889119905 (21)

In order to facilitate the equation we use 119890119894 119906

119894instead of

119890119894(119905) 119906

119894(119890

119894(119905)) Since the trajectory 119910

119894119889relies upon the control

of the subsystem 119906119894for updating in order to avoid the infinity

results by using (21) we need to transform the value functioninto the following form

119881119906119894

119894(119890

119894) = int

infin

0

119903119894(119890

119894(120591) 119906

119894(119890

119894(120591))) 119889120591 119905 le 120591 lt infin (22)

Thus the optimal value function of the subsystem can bedefined as follows

119881lowast

119894(119890

119894) = min

119906119894

119905le120591ltinfin

int

infin

0

119903119894(119890

119894(120591) 119906

119894(119890

119894(120591))) 119889120591 (23)

Mathematical Problems in Engineering 5

Here 119903119894(119890

119894 119906

119894) represents the reward function for the current

state shown as

119903119894(119890

119894 119906

119894) = 119890

119879

119894119876119890119890119894+ 119906

119879

119894119877119906

119894 (24)

where 119876119890and 119877 are the positive definite matrixes

Typically recording the value of state-action pairs is moreuseful than recording the value of state only since the state-action pairs are the predictions of the reward Even if thereward value of a state is low it does not mean that the valueof state-action pairs is low too If the state of the subsystemin a period time produces a higher reward then it can stillget a higher state-action value Therefore from a long termperspective defining a suitable state-action value function(119876-function) can make actions produce more rewards [2122]

According to (23) and (24) the continuous-time optimal119876-function can be defined as

119876lowast

119894(119890

119894 119886

119894 119906

119894) = 119903

119894(119890

119894 119886

119894 119906

119894) + 119881

lowast

119894(119890

119894 119906

119894)

= 119903119894(119890

119894 119886

119894 119906

119894)

+ min119906119894

119905le120591ltinfin

int

infin

0

119903119894(119890

119894(120591) 119906

119894(119890

119894(120591))) 119889120591

(25)

Assumption 3 The partial derivation of 119876lowast

119894and 119903

119894(119890

119894 119886

119894 119906

119894)

exist and they are continuous in the domain According to(18) and (24) by using the control policy 119906

119894 the optimal

119876-function can satisfy the following Hamiltonian-Jacobi-Bellman equation [23]

HJB119894(119890

119894 119906

119894 nabla119876

lowast

119894)

= min119906119894(119890119894)

[119903119894(119890

119894 119886

119894 119906

119894)

+nabla119876lowast

119894(minus119865 (119890

119894 119906

119894) minus ℎ

119894(119890 119890 119890) minus 119891

119894+ 119887

119894(119890

119894) 119906

119894)]

= min119906119894(119890119894)

[119903119894(119890

119894 119886

119894 119906

119894) + nabla119876

lowast

119894Φ

119894(119890

119894 119906

119894)]

(26)

whereΦ119894(119890

119894 119906

119894) = minus119865(119890

119894 119906

119894)minusℎ

119894(119890 119890 119890)minus119891

119894+119887

119894(119890

119894)119906

119894means the

global uncertainty including the unknown dynamics of thesubsystem and the interconnection term andnabla119876lowast

119894= 120597119876

lowast

119894120597119890

119894

means the gradient of the optimal 119876-function

Lemma 4 (see [24]) Considering dynamics of the subsystemof time varying constrained reconfigurable modular robot in(14) in order to ensure the minimum of the HJB equation (26)possessing the stationary point with respect to 119906

119894 the optimal

119876-function and the optimal control policy must satisfy thefollowing conditions

(1) 120597119867119869119861(119890119894 119906

119894 nabla119876

119894)120597119906

119894= 0

(2) 1205972119867119869119861119894(119890

119894 119906

119894 nabla119876

119894)(120597119906

119894times 120597119906

119879

119894) ge 0

The necessary conditions above lead us to the followingresults

(a) The bounded control policy can guarantee a localminimum of the HJB equation (26) and satisfy theconstraints imposed on the control inputs

(b) The Hessian matrix is positive-definite and the controlpolice 119906

119894can render the global minimum of the HJB

equation(c) If an optimal algorithm exists it is unique

According to Lemma 4 if the reward function is smoothand the optimal control 119906lowast

119894is adopted then the HJB equation

satisfies the following equation

HJBlowast

119894(119890

119894 119906

lowast

119894 nabla119876

lowast

119894) = min

119906lowast

119894(119890119894)

[119903119894(119890

119894 119886

119894 119906

lowast

119894) + nabla119876

lowast

119894Φ

119894(119890

119894 119906

lowast

119894)]

= 0

(27)

And the optimal control can be expressed as follows

119906lowast

119894(119890

119894) = arg

119906lowast

119894

min [HJBlowast

119894(119890

119894 119906

lowast

119894 nabla119876

lowast

119894)]

=1

2119877minus1

119887119879

119894(119890

119894)120597119876

lowast

119894(119890

119894 119886

119894 119906

119894)119879

120597119890119894

(28)

If the optimal 119876-function 119876lowast

119894is continuous derivable

and known and the initial value119876lowast

119894(0) = 0 as well as the opti-

mal control policy 119906lowast

119894(119890

119894) and the global uncertainty of the

subsystemΦ119894(119890

119894 119906

lowast

119894) is known then the HJB equation in (27)

is held and solvableHowever in the actual situation119876lowast

119894is not

derivable everywhere and 119906lowast119894(119890

119894) andΦ

119894(119890

119894 119906

lowast

119894) are unknown

Therefore it is not feasible to solve the HJB equation byusing average method In this paper we combine the action-critic identifier (ACI) with RBF neural network to estimatethe optimal control policy the optimal 119876-function and theglobal uncertainty of the subsystem Action-NN is used toestimate 119906

lowast

119894(119890

119894) and is denoted as

119894(119890

119894) 119876lowast

119894is estimated

by critic-NN and expressed as 119876119894 then we use the robust

neural network identifier to identify Φ119894(119890

119894 119906

lowast

119894) denoted as

Φ119894(119890

119894 119906

lowast

119894)Theblock diagramof theACI architecture is shown

in Figure 2The estimated HJB equation can be expressed as follows

HJBlowast

119894(119890

119894

119894 nabla119876

119894) = min

119906119894(119890119894)[119903

119894(119890

119894 119886

119894

119894) + nabla119876

119894Φ

119894(119890

119894

119894)]

(29)

The identification error of the HJB equation above can beexpressed as

120575ℎ119894= HJBlowast

119894(119890

119894

119894 nabla119876

119894) minusHJBlowast

119894(119890

119894 119906

lowast

119894 nabla119876

lowast

119894) (30)

A classic radial basis function of the neural network isproposed in [25] shown as (31)

119873(119909) = 119882lowast119879

119878 (119909) + 120576 (119909) (31)

6 Mathematical Problems in Engineering

Action

Rewardfunction

HJB

error

Identifier

Subsystem

Critic

minus+

Qi(ei ai ui)Qi(ei ai ui)

ri(ei ai ui)

ri(ei ai ui)

Φi(ei ui)

Φi(ei ui)

Φi(ei )

eiF(t)

ui

ui

(t)

120575hi

1s

Figure 2 The architecture of action-critic-identifier

where 119882lowast means the ideal neural network weights and 120576(119909)

represents the estimation error In the case of using sufficientnumber of nodes if the center and width of the nodes arebuilt appropriately then any kind of continuous functioncould be approximated by RBF-NN Therefore the optimal119876-function and the optimal control policy can be expressedas follows

119876lowast

119894= 119882

119879

119894119878119894(119890

119894) + 120576

119894119888(119890

119894)

119906lowast

119894(119890

119894) = minus

1

2119877minus1

119887119879

119894(119890

119894) [ 119878

119894(119890

119894)119879

119882119894+ 120576

119894119886(119890

119894)]

(32)

where 119878119894(119890

119894) = [119904

1198941(119890

119894) sdot sdot sdot 119904

119894119899(119890

119894)]119879 indicates the smooth

basis function of the neural network 119882119894means the ideal

unknown neural network weight and 120576119894119888(119890

119894) and 120576

119894119886(119890

119894) are

the estimation error By using 119876119894and

119894(119890

119894) to estimate 119876

lowast

119894

and 119906lowast

119894(119890

119894) we can get the following equations

119876119894=

119879

119894119888119878119894119888(119890

119894) (33)

119894(119890

119894) = minus

1

2119877minus1

119887119879

119894(119890

119894) 119878

119894119886(119890

119894)119879

119894119886 (34)

According to the equations above 119894119888(119905) and

119894119886(119905) indicated

the weights of critic-NN and action-NN And the estimationerrors of weights are shown as follows

119894119888(119905) = 119882

119894minus

119894119888(119905) (35)

119894119886(119905) = 119882

119894minus

119894119886(119905) (36)

The update law of the weight for the critic-NN is a gradientdescent algorithm which is shown as follows

119882

119894119888(119905) = minus119899

1119871119894(119871

119879

119894

119894119888+ 119890

119879

119894119876119890119890119894+ 119906

119879

119894119877119906

119894) (37)

In the equation above 119899119894gt 0 is the adaptive gain of the neural

network 119871119894and 119897

119894are defined as

119871119894=

119897119894

119897119879119894119897119894+ 1

119897119894= nabla119878

119894119888(119890

119894) 119890

119894

(38)

Therefore according to the definition above the followinginequalities can be obtained

119871119894119898

le 119871119894le 119871

119894119872

119878119894119888119898

le 119878119894119888(119890

119894) le 119878

119894119888119872

119878119894119886119898

le 119878119894119886(119890

119894) le 119878

119894119886119872

(39)

Mathematical Problems in Engineering 7

Combining (35) with (38) we can get that

119882119894119888(119905) = minus119899

1119871119894(119871

119879

119894

119894119888+ 120575

ℎ119894) (40)

The update law of the weight for the action-NN is developedby a gradient descent algorithm expressed as follows

119882

119894119886(119905) = minus119899

2119878119894119886(119890

119894)

times ((119879

119894119886119878119894119886(119890

119894) +

1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119894119888))

119879

(41)

According to the estimation error of action-NN in (36) theoptimal control 119906lowast

119894(119890

119894) can minimize the optimal119876-function

and we can get the following equation

119882119879

119894119886119878119894119886(119890

119894) +

1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119882119894119888

+1

2119877minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894) + 120576

119894119886(119890

119894) = 0

(42)

Putting (41) into (42) we can get that

119882119894119886

= minus1198992119878119894119886(119890

119894)(

119879

119894119886119878119894119886(119890

119894) +

1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119894119888

minus1

2119877minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894) minus 120576

119894119886(119890

119894)

)

(43)

After using critic-NN and action-NN to estimate 119876119894and

119894(119890

119894) we need to design a kind of robust RBF-NN identifier

to identify the nonlinear uncertainties of the subsystem HereΦ

119894(119890

119894

119894) can be expressed as follows

Φ119894(119890

119894

119894) = 119890

119894119865= 119882

119879

119894119865120581 (Λ

119879

119894119865119890119894119865) + 120576

119894119865(119890

119894119865) + 119887

119894(119890

119894)

119894

(44)

where 120581(sdot) means the basic function of neural network and119882

119894119865Λ

119894119865indicate the unknown ideal neural network weights

Equation (44) can be identified by using robust RBF-NNidentifier so we can get

Φ119894(119890

119894

119894) = 119890

119894119865= 119882

119879

119894119865120581119894119865+ 119887

119894(119890

119894)

119894+ 120583

119894 (45)

Here 120581119894119865indicates the estimated value of the basic function of

the neural network 119882119894119865 Λ

119894119865are expressed as the estimated

value of neural network 120583119894isin R means the feedback error

term shown as follows [26]120583119894= 119896 (119890

119894119865(119905) minus 119890

119894119865(119905)) minus 119896 (119890

119894119865(0) minus 119890

119894119865(0)) + 120599

= 119896 (119890119894119865(119905) minus 119890

119894119865(0)) + 120599

120599 = (119896120572 + 120574) 119890119894119865+ 120573

1sat (119890

119894119865)

(46)

where 119896 120572 1205731 and 120574 are the positive control gain constants

and sat(sdot) is a saturation functionTherefore the state estima-tion error of the identifier-NN can be expressed as follows

119890119894119865= 119890

119894119865minus 119890

119894119865

= 119882119879

119894119865120581119894119865minus

119879

119894119865120581119894119865+ 120576

119894119865(119890

119894119865) minus 120583

119894

(47)

A filtered identification error is defined as follows

119864119894= 119890

119894119865+ 120572119890

119894119865 (48)

The derivation of the equation above is shown as

119894= 119882

119879

119894119865119894119865Λ119879

119894119865119890119894119865minus

119879

119894119865

120581119894119865

Λ119879

119894119865119890119894119865+ 120572 119890

119894119865minus 119896119864

119894minus 120574119890

119894119865

minus

119882119879

119894119865120581119894119865+ 120576

119894119865(119890

119894119865) minus 120573

1sat (119890

119894119865) minus

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

(49)

Here the weight 119882119894119865 Λ

119894119865of the identification-NN can be

updated by

119882

119894119865= proj (Γ

119894119882119865

120581119894119865Λ119879

119894119865

119890119894119865119890119879

119894119865)

Λ119894119865= proj (Γ

119894Λ119865

119890119894119865119890119879

119894119865

119879

119894119865

120581119894119865)

(50)

where Γ119894119882119865

Γ119894Λ119865

are positive constant adaptation gain matri-ces In order to analyze the convergence of the filteredidentification error 119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

can be divided into thefollowing form

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

=1

2120581119894119865

119890119894119865[(Λ

119879

119894119865minus Λ

119879

119894119865) (119882

119879

119894119865minus

119879

119894119865)

+ (119882119879

119894119865minus

119879

119894119865) (Λ

119879

119894119865minus Λ

119879

119894119865)]

=1

2120581119894119865

119890119894119865[

119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) + (119882

119879

119894119865minus

119879

119894119865) Λ

119879

119894119865

minus119882119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) minus (119882

119879

119894119865minus

119879

119894119865)Λ

119879

119894119865

]

=1

2120581119894119865

119890119894119865[

119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) + (119882

119879

119894119865minus

119879

119894119865) Λ

119879

119894119865]

minus1

2120581119894119865

119890119894119865[119882

119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) + (119882

119879

119894119865minus

119879

119894119865)Λ

119879

119894119865]

=1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

minus1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865minus1

2

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865

+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

(51)

where 119879

119894119865= 119882

119879

119894119865minus

119879

119894119865 Λ119879

119894119865= Λ

119879

119894119865minus Λ

119879

119894119865 Putting (51) into

(49) then (49) can be reduced to the following form

119894= 119875

1198651+ 119875

1198652+ 119875

1198653minus 119896119864

119894minus 120574119890

119894119865minus 120573

1sat (119890

119894119865) (52)

8 Mathematical Problems in Engineering

Among the equations above 1198751198651+119875

1198652+119875

1198653can be expressed

respectively as follows

1198751198651

=1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865minus

119879

119894119865

120581119894119865

Λ119879

119894119865119890119894119865

+ 120572 119890119894119865minus

119879

119894119865120581119894119865

(53)

1198751198652

= minus1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865minus1

2

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865

+119882119879

119894119865119894119865Λ119879

119894119865119890119894119865+ 120576

119894119865(119890

119894119865)

(54)

1198751198653

=1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865 (55)

According to Assumption 1 (48) and (50) the upper boundsof 119875

1198651 119875

1198652 119875

1198653are shown as

100381710038171003817100381711987511986511003817100381710038171003817 le 119869

1(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

100381710038171003817100381711987511986521003817100381710038171003817 le 120589

1

100381710038171003817100381711987511986531003817100381710038171003817 le 120589

2

(56)

Combining (53) and (54) with (55) then we can get that100381710038171003817100381710038171198652

+ 1198653

10038171003817100381710038171003817le 120589

3+ 120589

41198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817 (57)

where 120593119894(119890

119879

119894119865 119864

119879

119894) = [119890

119879

119894119865119864119879

119894]119879 and 119869

119894(sdot) is a global invertible

nondecreasing function 120589119894 (119894 = 1 2 3 4) are computable

positive constants

Theorem 5 Considering dynamics of the subsystem of timevarying constrained reconfigurable modular robot in (14) andthe state equation (18) if the designed identifier and thecorresponding weight update laws are adopted then the globaluncertainty of the subsystem which depends explicitly on theerror term can be identified and the identification error isconverged and bounded

Proof Define the Lyapunov function as the follows

119881119894119871(119890

119894119865 119864

119894) =

1

2119864119879

119894119864119894+1

2120574119890

119879

119894119865119890119894119865+ 120603

119894(119905) + 120601

119894(119905) (58)

In the equation above 120603119894(119905) and 120601

119894(119905) can be expressed as

follows

119894(119905) = minus[

119864119879

119894(119875

1198652minus 120573

1sat (119890

119894119865)) + 119890

119879

1198941198651198751198653

minus12057321198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)100381710038171003817100381710038171003817100381710038171003817119890119894119865

1003817100381710038171003817

]

120603119894(0) = 120573

1

1003816100381610038161003816119890119894119865 (0)1003816100381610038161003816 minus 119890

119879

119894119865(0) (119875

1198652(0) + 119875

1198653(0))

(59)

120601119894(119905) =

1

4120572 [ tr (

119879

119894119865Γminus1

119894119882119865

119894119865) + tr (Λ

119879

119894119865Γminus1

119894Λ119865Λ

119894119865)] (60)

where tr(sdot) represents the trace of matrix Defining 119889 =

[119864119879

119894119890119879

11989411986512060312

11989412060112

119894] 120573

1 120573

2isin R are positive adaptation gains

which are chosen to ensure 120603119894(119905) ge 0 so we can get

1198801(119889) le 119881

119894119871(119890

119894119865 119864

119894) le 119880

2(119889) (61)

where

1198801(119889) =

1

2min (1 120574) 1198892

1198802(119889) = max (1 120574) 1198892

(62)

The derivation of (58) is shown as follows

119894119871(119890

119894119865 119864

119894) = nabla119881

119879

119894119871119870[

119894

119890119879

119894119865

1

212060312

119894119894

1

212060112

119894

120601119894]119879

(63)

where119870[sdot] is expressed as a Filipov set [27]So

119894119871(119890

119894119865 119864

119894) can be deformed as the following form

119894119871(119890

119894119865 119864

119894)

= [119864119879

119894120574119890

119879

1198941198652120603

12

1198942120601

12

119894]119870[

119894

119890119879

119894119865

1

212060312

119894119894

1

212060112

119894

120601119894]119879

le 120574119879

(

1

2

119882119879

119894119865

119894119865

Λ119879

119894119865

119894119865+

1

2

119879

119894119865

119894119865Λ119879

119894119865

119894119865minus

119879

119894119865

119894119865

Λ

119879

119894119865119890119894119865

+120572119894119865minus

1

2

119882119879

119894119865

119894119865

Λ119879

119894119865119890119894119865minus

1

2

119879

119894119865

119894119865Λ119879

119894119865119890119894119865minus 120574119890

119894119865

minus119879

119894119865120581119894119865+119882

119879

119894119865119894119865Λ119879

119894119865119890119894119865+ 120576

119894119865(119890

119894119865) +

1

2

119879

119894119865

119894119865

Λ119879

119894119865

119894119865

+

1

2

119879

119894119865

119894119865

Λ119879

119894119865

119894119865minus 119896119864

119894minus 120573

1119870[sat (119890

119894119865)]

)

minus119864119879

119894(

minus1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865minus1

2

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865

+119882119879

119894119865119894119865Λ119879

119894119865119890119894119865+ 120576

119894119865(119890

119894) minus 120573

1119870[sat (119890

119894119865)])

minus 119890119879

119894119865

1

2(

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865)

+ 120574119890119879

119894119865(119864

119894minus 120572119890

119894119865)

+ 12057321198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)100381710038171003817100381710038171003817100381710038171003817119890119894119865

1003817100381710038171003817

minus1

2120572 tr (119879

119894119865Γminus1

119894119882119865

119882

119894119865) minus

1

2120572 tr (Λ119879

119894119865Γminus1

119894Λ119865

Λ119894119865)

(64)

Put (53) (54) and (55) into (64) then we can get

119894119871(119890

119894119865 119864

119894)

= 119864119879

119894(119875

1198651+ 119875

1198652+ 119875

1198653minus 120573

1119870[sat (119890

119894119865)] minus 119896119864

119894minus 120574119890

119894119865)

+ 120574119890119879

119894119865(119864

119894minus 120572119890

119894119865)

minus 119864119879

119894(119875

1198652minus 120573

1119870[sat (119890

119894119865)])

minus 119890119879

1198941198651198751198653

+ 12057321198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)100381710038171003817100381710038171003817100381710038171003817119890119894119865

1003817100381710038171003817

minus1

2120572 tr (119879

119894119865Γminus1

119894119882119865

119882

119894119865)

minus1

2120572 tr (Λ119879

119894119865Γminus1

119894Λ119865

Λ119894119865)

Mathematical Problems in Engineering 9

= minus120572120574119890119879

119894119865119890119894119865+ (119864

119879

119894minus 119890

119879

119894119865)119875

1198653

1198691(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41198963

minus1

2120572 tr (119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865119890119879

119894119865) minus

1

2120572 tr (Λ119879

119894119865

119890119894119865119890119879

119894119865

119879

119894119865

120581119894119865)

le minus1198961

100381710038171003817100381711989011989411986510038171003817100381710038172

minus 1198962

1003817100381710038171003817119864119894

10038171003817100381710038172

+1198691(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41198963

10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

+1205732

21198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41205721198964

10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

(65)

where 119896min = min1198961 119896

2 120585 = min119896

3 120572119896

4120573

2

2 and

119869(120593119894(119890

119879

119894119865 119864

119879

119894))

2

= 1198691(120593

119894(119890

119879

119894119865 119864

119879

119894))

2 + 1198692(120593

119894(119890

119879

119894119865 119864

119879

119894))

2 sothe following conclusion can be obtained

119894119871(119890

119894119865 119864

119894)

le minus119896min10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

+119869(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)210038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

4120585

le minus11988810038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

(66)

Therefore for an arbitrary constant 119888 minus119888120593119894(119890

119879

119894119865 119864

119879

119894)

2

is a negative semidefinite function which is defined in theadjustable interval119863 expressed as follows

119863 = 119889 (119905) | 119889 le 119869minus1

(2radic119896min120585) (67)

so that Lyapunov stability theory shows that the system isstable In order to make the subsystem of time varying con-strained reconfigurable modular robot tracking the desiredtrajectory progressively in this paper a novel decentralizedreinforcement learning robust optimal tracking controllerhas been designed by using the robust term to compensatethe neural network approximation errors Design the robustcontrol term as

119906119894119903119887

=119873

119903119887119890119894

119890119879119894119890119894+ 120577

(68)

In the equation above 120577 gt 0 is a constant And 119873119903119887can

be expressed as

119873119903119887ge [

[

1205752

ℎ119894

21198991

+1198991(minus120576

119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894) (nabla120576

119894119888(119890

119894)2))

2

21198992

+11989911198992

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

(minus120576119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

]

]

sdot(119890

119879

119894119890119894+ 120577)

211989911198992

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817 119890

119879

119894119890119894

ge [1198992

1(minus120576

119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

+ 11989921205752

ℎ119894+ 2119899

2

11198992

2

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

sdot (minus120576119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

]

sdot(119890

119879

119894119890119894+ 120577)

41198992111989922

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817 119890

119879

119894119890119894

(69)

Therefore the global control law can be designed asfollows

119906mix = 119906119894+ 119906

119894119903119887

= minus1

2119877minus1

119887119879

119894(119890

119894) 119878

119894119886(119890

119894)119879

119894119886+

119873119903119887119890119894

119890119879119894119890119894+ 120577

(70)

Theorem 6 Considering dynamics of the subsystem of timevarying constrained reconfigurable modular robot in (14) if thesystem parameters conditions and the assumptions are held thecritic-NN action-NN and identifier are given by (33) (34)and (45) respectively and the decentralized robust optimaltracking controller of the subsystem in (70) is adopted thenthe system is closed-loop stability and the desired trajectory canbe tracked asymptotically by the actual output

Proof Design the Lyapunov function as follows

119881119894119906(119890

119894 119906mix) =

1

21198991

tr 119879

119894119888

119894119888 +

1198991

21198992

tr 119879

119894119886

119894119886

+ 11989911198992[119890

119879

119894119865119890119894119865+ Ξint

infin

0

119903119894(119890

119894 119906mix) 119889120591]

(71)

where Ξ gt 0 is the undetermined parameter 119905 le 120591 lt infin Thederivation of (71) is shown as follows

119894119906(119890

119894 119906mix)

=1

21198991

tr 119879

119894119888

119882

119894119888 +

1198991

21198992

tr 119879

119894119886

119882

119894119886

+ 11989911198992[119890

119879

119894119865119890119894119865+ Ξ119903

119894(119890

119894 119906mix)]

=1

21198991

tr 119879

119894119888(minus119899

1119871119894(119871

119879

119894

119894119888+ 120575

ℎ119894))

+1198991

21198992

tr

times

119879

119894119886

[[[[

[

minus1198992119878119894119886(119890

119894)(

119879

119894119886119878119894119886(119890

119894) minus 120576

119894119886(119890

119894)

+1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119894119888

minus1

2119877minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894)

)

]]]]

]

+ 11989911198992119890119879

119894119865(119882

119879

119894119865120581 (Λ

119879

119894119865119890119894) + 120576

119894119865(119890

119894) + 119887

119894(119890

119894) mix)

+Ξ (119890119879

119894119876119890119890119894+ 119906

119879

mix119877119906mix)

10 Mathematical Problems in Engineering

le minus(1198712

119894119898minus1198991

21198712

119894119872)10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

+1

21198991

1205752

ℎ119894

minus (11989911198782

119894119886119898minus3

4119899111989921198782

119894119886119872)10038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

+1198991

41198992

10038171003817100381710038171003817119877minus110038171003817100381710038171003817

21003817100381710038171003817119887119894(119890119894)10038171003817100381710038172 10038171003817100381710038171003817nabla119878

2

119894119888119872

10038171003817100381710038171003817

10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

+1198991

21198992

(120576119894119886(119890

119894) + 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

119879

sdot (120576119894119886(119890

119894) + 119877

minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894)

2)

+ 11989911198992

1003817100381710038171003817119887119894 (119890119894)10038171003817100381710038172

120576119894119886(119890

119894)119879

120576119894119886(119890

119894)

+ 119899111989921198782

119894119886119872

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817210038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

+ 11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

+ 11989911198992(1003817100381710038171003817119887119894 (119890119894)

10038171003817100381710038172

minus Ξ120582min (119877))1003817100381710038171003817119906mix

10038171003817100381710038172

(72)

If the following inequalities can satisfy

120582min 1198761198901003817100381710038171003817119890119894119865

10038171003817100381710038172

2le 119890

119879

119894119865119876119890119890119894119865le 120582max 119876119890

1003817100381710038171003817119890119894119865

10038171003817100381710038172

2

120582min 1198771003817100381710038171003817119906mix

10038171003817100381710038172

2le 119906

119879

mix119877119906mix le 120582max 1198771003817100381710038171003817119906mix

10038171003817100381710038172

2

Ξ gt

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

120582min 119877

(73)

then 119894119906(119890

119894 119906mix) can be further transformed as

119894119906(119890

119894 119906mix)

le minus(1198712

119894119898minus1198991

21198712

119894119872minus

1198991

41198992

10038171003817100381710038171003817119877minus110038171003817100381710038171003817

21003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

nabla1198782

119894119888119872)10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

minus (11989911198782

119894119886119898minus3

4119899111989921198782

119894119886119872minus 119899

111989921198782

119894119886119872

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

)10038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

minus 11989911198992(1003817100381710038171003817119887119894(119890119894)

10038171003817100381710038172

+ Ξ120582min (119877))1003817100381710038171003817119906mix

10038171003817100381710038172

minus 11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

le minus11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

(74)

Therefore we can get the conclusion that 119894119906(119890

119894 119906mix) lt 0

4 Simulations

In order to verify the validity and convergence of the pro-posed decentralized reinforcement learning robust optimaltracking control method based on ACI and to study theconvergence of the error by comparing the simulation resultin this paper two different configurations of the time varying

external constrained reconfigurablemodular robot have beenapplied shown in Figures 3 and 4

For the sake of the facilitation of the analysis of theconfigurations above we can transform them into a formof analytic charts which are shown in Figures 5 and 6where 119871

1 119871

2 and 119871

4are the length of the links 119871

3is the

distance between the time varying constraint joint and thebase modular

The time varying constraint can be defined as a kindof column which rotated about with a certain degree offreedom The constraint equations of configuration A andconfiguration B are shown as follows

Ψ119860(119902 119905) = 119871

1cos 119902

1+ 119871

2cos 119902

2minus [119871

3+ 119871

4cot120572 (119905)]

Ψ119861(119902 119905) = 119871

1+ 119871

2cos 119902

2minus [119871

3+ 119871

4cot120572 (119905)]

(75)

In the equation above the angle 120572(119905) between the timevarying constraint and the119883-axis can be defined as follows

120572 (119905) = 075120587 + 02 sin 119905

2 (76)

The initial positions of joint models are 1199021(0) = 2 119902

2(0) =

2 in configurationA and 1199021(0) = 2 119902

2(0) = 2 in configuration

BThe initial velocities of joints are zerosThe dynamicmodelof configurations A and B is designed as follows

119872119860(119902) = [

036 cos (1199022) + 06066 018 cos (119902

2) + 01233

018 cos (1199022) + 01233 01233

]

119872119861(119902) = [

017 minus 01166cos2 (1199022) minus006 cos (119902

2)

minus006 cos (1199022) 01233

]

119862119860(119902 119902) = [

minus036 sin (1199022) 119902

2minus018 sin (119902

2) 119902

2

018 sin (1199022) ( 119902

1minus 119902

2) 018 sin (119902

2) 119902

1

]

119862119861(119902 119902) = [

01166 sin (21199022) 119902

2006 sin (119902

2) 119902

2

006 sin (1199022) 119902

20

]

119866119860(119902) = [

minus588 sin (1199021+ 119902

2) minus 1764 sin (119902

1)

minus588 sin (1199021+ 119902

2)

]

119866119861(119902) = [

0

minus588 cos (1199022)]

119865119860(119902 119902) = [

1199021+ 10 sin (3119902

1) + 2 sgn ( 119902

1)

12 1199022+ 5 sin (2119902

2) + sgn ( 119902

2)]

119865119861(119902 119902) = [

0

15 1199022+ sin (119902

2) + 12 sgn ( 119902

2)]

(77)

The desired trajectory of configurations A and B is shown asConfiguration A

1199101119889

= 05 cos (119905) + 02 sin (3119905)

1199102119889

= Θ (1199101119889 119905)

= arcsin[1198711sin (120572 (119905) minus 119910

1119889) minus 119871

3sin (120572 (119905))

1198712

] + 120572 (119905)

(78)

Mathematical Problems in Engineering 11

Figure 3 Configuration A for simulation

Figure 4 Configuration B for simulation

Configuration B

1199101119889

= 0

1199102119889

= Θ (1199101119889 119905)

= arcsin [1198711sin (120572 (119905)) minus 119871

3sin (120572 (119905))

1198712

] + 120572 (119905)

(79)

Due to the limit of dynamic external constraints thevariety of joint 1 in configuration B is zero

In order to confirm that the adopted method can beapplied in different configurations and to verify the trackingperformance for the subsystem desired trajectory by usingthe ACI and RBF-NN based decentralized reinforcementlearning robust optimal tracking control method in this partthe comparative simulations which include the classic RBFneural network control method and ACI based decentral-ized reinforcement learning robust optimal tracking controlmethod have been adopted respectively

From Figures 7 8 9 10 11 12 13 and 14 the joint trackingand error curves are shown by using classic RBF neural net-work [4] to compensate the effect of the dynamic nonlinearterm and the interconnection term in the subsystem Figures7 to 10 show that the actual output trajectories take about2 seconds to track the desired trajectories This is due tothe fact that the classical neural network method requires alonger training process and parameter adaptive process Theerror curves in Figures 11ndash14 show that the joint subsystem

q1L2

L3

L4

L1

Y

X

120572

q2

Figure 5 The analytic chart of configuration A

q2

L4

L2

L1

L3

Y

120572

X

q1

Figure 6 The analytic chart of configuration B

constraints cannot be well compensated by adopting the clas-sical neural control method for the reconfigurable modularrobot when the time varying constrains exist When thejoint output variables turn larger the reconfigurable modularsystem cannot exhibit a good robustness and the trackingerrors are larger than before

Figures 15 16 17 18 19 20 21 and 22 showed thejoint tracking and error curves by adopting ACI to identifythe optimal 119876-function optimal control policy and globaldynamic nonlinear terms of the HJB equation in the sub-system Figures 15 to 18 show that the actual output trajec-tories of the subsystems can track the desired trajectoriesin 05 seconds by using the proposed robust reinforcementlearning optimal control method This is due to the excellentidentifying ability of ACIwhich can identify the uncertaintiescontained in the subsystems in a short timeThe error curvesin Figures 19ndash22 show that the tracking error is very smalland it can converge to zero in a short time Besides whenthe proposed decentralized control method is adopted for thejoint subsystems the robustness is manifested

Figures 23 and 24 showed the 3D-tip trajectory curvesby using ACI algorithm These two figures show that theproposed decentralized control method can fully satisfythe accessibility requirement of the reconfigurable modularrobot Besides the singular displacements of the joint subsys-tems and the end-effector would not appear The parametersdefined in ACI are shown in Table 1

12 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

25

Time (s)

Join

t 1 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

minus1

minus05

Figure 7 Trajectory tracking curve of configuration A joint 1 withRBF neural network

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 8 Trajectory tracking curve of configuration A joint 2 withRBF neural network

Table 1 Parameter list of action-critic-identifier

119896 120572 120592 1205781198861

1205781198862

120578119888

1205731

1205732

120574

800 300 0005 10 50 20 02 2 05

The simulation results show that compared with thesituation of using classic RBF neural network controllerthe decentralized robust optimal tracking control methodbased on ACI and reinforcement learning can be appliedinto different configurations of the time varying constrainedreconfigurable modular robot And the joint variables cantrack the desired trajectory within a very short time indifferent configurations and the fluctuation of the errorconvergence range is minimal

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus2

minus15

minus05

minus1

Desired trajectoryActual trajectory

Figure 9 Trajectory tracking curve of configuration B joint 1 withRBF neural network

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 10 Trajectory tracking curve of configuration B joint 2 withRBF neural network

0 1 2 3 4 5 6 7 8 9 10

0

002

004

006

008

01

Time (s)

Join

t 1 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 11 Tracking error curve of configuration A joint 1 with RBFneural network

Mathematical Problems in Engineering 13

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

002

004

006

008

01

Join

t 2 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 12 Tracking error curve of configuration A joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

001

002

003

004

005

Time (s)

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 13 Tracking error curve of configuration B joint 1 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 14 Tracking error curve of configuration B joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus1

minus05

Desired trajectoryActual trajectory

Figure 15 Trajectory tracking curve of configuration A joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 16 Trajectory tracking curve of configuration A joint 2 withACI based reinforcement learning

5 Conclusions and Future Work

In this paper combining ACI with RBF neural network anovel decentralized reinforcement learning robust optimaltracking control theory has been proposed for time varyingconstrained reconfigurable modular robots Moreover thistheory is used to solve the problem of the continuoustime nonlinear optimal control policy for strongly coupleduncertainty robotic system Firstly we build the model ofsubsystem with the time varying external constraints anddescribed the global robot system as a synthesis of inter-connected subsystem Secondly ACI is used to estimate theHJB equation and the global uncertainty where a continuous-time optimal 119876-function is adopted to take the place oftraditional optimal value function the optimal 119876-function

14 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus15

minus05

minus2

minus1

Desired trajectoryActual trajectory

Figure 17 Trajectory tracking curve of configuration B joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 18 Trajectory tracking curve of configuration B joint 2 withACI based reinforcement learning

and the optimal control police are approximated by critic-NNand action-NN and the global uncertainty is identified bythe identifier Thirdly we design a novel decentralized robustoptimal tracking controller so the desired trajectory can betracked and the tracking error could converge to zero infinite timeOn this basis two kinds of Lyapunov functions aredesigned to confirm the stability of ACI and the subsystemFinally in order to confirm the superiority of the proposedcontrol theory the comparative simulation examples havebeen presented combining two different configurations of therobot

In the future more complex configuration for timevarying external constrained reconfigurable modular robotcan be included Therefore the decentralized controller withhigher control and error precision will be considered

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 19 Tracking error curve of configuration A joint 1 with ACIbased reinforcement learning

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005Jo

int 2

erro

r (ra

d)

minus005

minus004

minus003

minus002

minus001

Figure 20 Tracking error curve of configuration A joint 2 with ACIbased reinforcement learning

0

0

1 2 3 4 5 6 7 8 9 10Time (s)

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 21 Tracking error curve of configuration B joint 1 with ACIbased reinforcement learning

Mathematical Problems in Engineering 15

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 22 Tracking error curve of configuration B joint 2 with ACIbased reinforcement learning

005

1

02 03 04 05 06 07

0

01

02

03

minus1

minus05minus02

minus01

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 23 3D-tip trajectory curve of configuration A with ACI

005

1

035 036 037 038 039 04

006008

01012014016018

minus1

minus05

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 24 3D-tip trajectory curve of configuration B with ACI

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is financially supported by the National NaturalScience Foundation of China (Grant nos 61374051 and60974010) the Scientific Technological Development PlanProject in Jilin Province of China (Grant no 20110705) Thefirst author is funded by China Scholarship Council

References

[1] Y Li and B Dong ldquoDecentralized ADRC control for reconfig-urable manipulators based on VGSTA-ESO of sliding moderdquoINFORMATION vol 15 no 6 pp 2453ndash2466 2012

[2] Y LiM-C Zhu andY-C Li ldquoSimulation on robust neurofuzzycompensator for reconfigurable manipulator motion controlrdquoJournal of System Simulation vol 19 no 22 pp 5169ndash5174 2007

[3] M-C Zhu and Y-C Li ldquoDecentralized adaptive sliding modecontrol for reconfigurable manipulators using fuzzy logicrdquoJournal of Jilin University (Engineering and Technology Edition)vol 39 no 1 pp 170ndash176 2009

[4] L Zhu and Y Li ldquoDecentralized adaptive neural networkcontrol for reconfigurable manipulatorsrdquo in Proceedings of theChinese Control and Decision Conference (CCDC rsquo10) pp 1760ndash1765 Xuzhou China May 2010

[5] M Zhu Y Li and Y Li ldquoA new distributed control schemeof modular and reconfigurable robotsrdquo in Proceedings of theIEEE International Conference onMechatronics and Automation(ICMA rsquo07) pp 2622ndash2627 Harbin China August 2007

[6] M C Zhu Y Li and Y C Li ldquoObserver-based decentralizedadaptive fuzzy control for reconfigurable manipulatorrdquo Controland Decision vol 24 no 3 pp 429ndash434 2009

[7] S Richard and A G Barto Reinforcement Learning An Intro-duction The MIT Press London UK 1998

[8] F L Lewis and D Liu ReinForcement Learning and Approxi-mate Dynamic Programming For Feedback Control Wiley-IEEEPress New York NY USA 2012

[9] Y-K Xu and X-R Cao ldquoLebesgue-sampling-based optimalcontrol problems with time aggregationrdquo IEEE Transactions onAutomatic Control vol 56 no 5 pp 1097ndash1109 2011

[10] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[11] X Xu H-G He and D Hu ldquoEfficient reinforcement learningusing recursive least-squares methodsrdquo Journal of ArtificialIntelligence Research vol 16 pp 259ndash292 2002

[12] H Zhang Q Wei and Y Luo ldquoA novel infinite-time optimaltracking control scheme for a class of discrete-time nonlinearsystems via the greedy HDP iteration algorithmrdquo IEEE Trans-actions on Systems Man and Cybernetics B vol 38 no 4 pp937ndash942 2008

[13] H Zhang R Song Q Wei and T Zhang ldquoOptimal trackingcontrol for a class of nonlinear discrete-time systems withtime delays based on heuristic dynamic programmingrdquo IEEETransactions on Neural Networks vol 22 no 12 pp 1851ndash18622011

16 Mathematical Problems in Engineering

[14] H Zhang X Xie and S Tong ldquoHomogenous polynomiallyparameter-dependent 119867

infinfilter designs of discrete-time fuzzy

systemsrdquo IEEE Transactions on Systems Man and CyberneticsB vol 41 no 5 pp 1313ndash1322 2011

[15] H Zhang L Cui X Zhang and Y Luo ldquoData-drivenrobust approximate optimal tracking control for unknown gen-eral nonlinear systems using adaptive dynamic programmingmethodrdquo IEEE Transactions on Neural Networks vol 22 no 12pp 2226ndash2236 2011

[16] J Zhang H Zhang Y Luo and H Liang ldquoOptimal controldesign for nonlinear systems adaptive dynamic programmingbased on fuzzy critic estimatorrdquo in The International JointConference on Neural Network pp 1ndash6 Brisbane Australia2012

[17] H Zhang D Gong B Chen and Z Liu ldquoSynchronization forcoupled neural networkswith interval delay a novel augmentedlyapunov-krasovskii functional methodrdquo IEEE Transactions onNeural Networks and Learning Systems vol 24 no 1 pp 58ndash702013

[18] S Bhasin K Dupree P M Patre and W E Dixon ldquoNeuralnetwork control of a robot interacting with an uncertain vis-coelastic environmentrdquo IEEE Transactions on Control SystemsTechnology vol 19 no 4 pp 947ndash955 2011

[19] S G Khan G Herrmann F Lewis T Pipe and C MelhuishldquoA Q-learning based Cartesian model reference compliancecontroller implementation for a humanoid robot armrdquo inProceedings of the IEEE 5th International Conference onRoboticsAutomation andMechatronics (RAM rsquo11) pp 214ndash219 QingdaoChina September 2011

[20] P K Patchaikani L Behera and G Prasad ldquoA single networkadaptive critic-based redundancy resolution scheme for robotmanipulatorsrdquo IEEE Transactions on Industrial Electronics vol59 no 8 pp 3241ndash3253 2012

[21] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[22] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]University of Cambridge Cambridge UK 1989

[23] F L Lewis and V L Syrmos Optimal Control John Wiley ampSons New York NY USA 1995

[24] M Sassano and A Astolfi ldquoDynamic approximate solutionsof the HJ inequality and of the HJB equation for input-affinenonlinear systemsrdquo IEEE Transactions on Automatic Controlvol 57 no 10 pp 2490ndash2503 2012

[25] Y X Wu and C Wang ldquoDeterministic learning based adaptivenetwork control of robot in task spacerdquoActa Automatica Sinicavol 39 no 1 pp 1ndash10 2013

[26] P M Patre W MacKunis K Kaiser and W E DixonldquoAsymptotic tracking for uncertain dynamic systems via amultilayer neural network feedforward and RISE feedbackcontrol structurerdquo IEEE Transactions on Automatic Control vol53 no 9 pp 2180ndash2185 2008

[27] B E Paden and S S Sastry ldquoA calculus for computing Filippovrsquosdifferential inclusion with application to the variable structurecontrol of robot manipulatorsrdquo IEEE Transactions on Circuitsand Systems vol 34 no 1 pp 73ndash82 1987

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 3: Research Article Decentralized Reinforcement Learning ...downloads.hindawi.com/journals/mpe/2013/387817.pdf · Research Article Decentralized Reinforcement Learning Robust Optimal

Mathematical Problems in Engineering 3

and 119869119879

Ψ(119902 119905)119891 is the contact force generated by the contact

of the end of the reconfigurable modular robot and externalconstraints

After introducing 119898th constraints for the robot whichworks in the free space because of the limitation of (1) thesystem lost 119898th degrees of freedom Therefore the degreesof freedom of the robot change from 119899 to (119899 minus 119898) so thatonly (119899 minus 119898) independent joint displacements are needed todescribe the system of restricted movement fully

Define

119902 = [1199021

1199022

] 1199021isin 119877

119899minus119898

1199022isin 119877

119898

(3)

Putting the equation above into (1) then we can get that

Ψ (1199021 Θ (119902

1 119905) 119905) = 0 (4)

where

1199022= Θ (119902

1 119905) (5)

Therefore (3) can be described by joint displacement 1199021

fully shown as follows

119902 = [1199021

Θ(1199021 119905)

] (6)

The derivation of (6) is

119902 = [

[

1199021

120597Θ (1199021 119905)

1205971199021

1199021+120597Θ (119902

1 119905)

120597119905

]

]

= [

[

119868119899minus119898

0

120597Θ (1199021 119905)

1205971199021

119868119898

]

]

[1199021

0] + [

[

0

120597Θ (1199021 119905)

120597119905

]

]

= 119879 120579 + 119867

(7)

In (7)

119879 = [

[

119868119899minus119898

0

120597Θ (1199021 119905)

1205971199021

119868119898

]

]

isin 119877119899times119899

120579 = [1199021

0] isin 119877

119899

119867 = [

[

0

120597Θ (1199021 119905)

120597119905

]

]

isin 119877119899

(8)

Therefore the second derivation of 119902 can be achieved easilyas

119902 = 119879 120579 + 120579 + (9)

Putting (7) and (9) into (2) we can get

119906 + 119869119879

Ψ(119902 119905) 119891 = 119872(119902) (119879 120579 + 120579 + )

+ 119862 (119902 119902) (119879 120579 + ) + 119866 (119902) + 119865 (119902 119902)

(10)

Define

119864 = [119868(119899minus119898)times(119899minus119898)

0119898times(119899minus119898)

] isin 119877119899times(119899minus119898)

(11)

Therefore

120579 = [1199021

0] = 119864119902

1(12)

So (2) can be decomposed into the following form

119899

sum

119895=1

119872119894119895(119902) [(119879119864 119902

1)119895+ (119864 119902

1)119895

+ 119895]

+

119899

sum

119895=1

119862119894119895(119902 119902) [(119879119864 119902

1)119895+ 119867

119895] + 119866

119894(119902)

+ 119865119894(119902

119894 119902

119894) minus 119891

119894= 119906

119894

(13)

In the equation above (119879119864 1199021)119895 (119864 119902

1)119895 (119879119864 119902

1)119895 and 119867

119895

are the 119895th element of (119879119864 1199021) (119864 119902

1) (119879119864 119902

1) and 119867

respectively 119866119894(119902) 119865

119894(119902

119894 119902

119894) and 119906

119894are the 119894th element of

119866(119902) 119865(119902 119902) and 119906 119891119894is the constraint force which suffered

by the 119894th joint 119872119894119895(119902) and 119862

119894119895(119902 119902) are the 119894119895th element of

119872(119902) and 119862(119902 119902) respectively So as shown in Figure 1 eachsubsystem dynamical model can be formulated in joint spaceas follows

119872119894(119902

119894) 119902

119894+ 119862

119894(119902

119894 119902

119894) 119902

119894+ 119866

119894(119902

119894) + 119865

119894(119902

119894 119902

119894) + 119885

119894(119902 119902 119902)

= 119906119894

(14)

119885119894(119902 119902 119902) =

119899

sum

119895=1

119895 = 119894

119872119894119895(119902) [(119879119864 119902

1)119895+ (119864 119902

1)119895

+ 119895]

+119872119894119894(119902) [(119879119864 119902

1)119894+ (119864 119902

1)119894

+ 119894]

minus119872119894(119902

119894) 119902

119894+

119899

sum

119895=1

119895 = 119894

119862119894119895(119902 119902) [(119879119864 119902

1)119895+ 119867

119895]

+ 119862119894119894(119902 119902) [(119879119864 119902

1)119895+ 119867

119895]

minus 119862119894(119902

119894 119902

119894) 119902

119894+ [119866

119894(119902) minus 119866

119894(119902

119894)]

(15)

Let 119909119894= [119909

1198941 119909

1198942]119879

= [119902119894 119902

119894]119879 for 119894 = 1 119899 then (10) can be

presented by the following state equation

119878119894

1198941= 119909

1198942

1198942= minus119891 (119909

119894 119906

119894) minus ℎ

119894(119902 119902 119902) minus 119891

119894

119910119894= 119909

1198941

(16)

4 Mathematical Problems in Engineering

Zi(q q q)

Zn(q q q)

Z1(q q q)

M1q1 + C1q1 + G1 + F1 + Z1 minus f1 = u1

Miqi + Ciqi + Gi + Fi + Zi minus fi = ui

Mnqn+ Cnqn+ Gn+ Fn+ Znminus fn= un

q1

qq1

q

qn

qi

qn

qi

q

u1

un

uiu

sum

sum

sum

minus

minus

minus

Subsystem n

Subsystem i

Subsystem 1

q=

(uminusCqminusGminusF+

(qt)f

)M

minus1

JT Ψ

Figure 1 The architecture of the time varying constrained reconfigurable modular robot system

where 119909119894is the state vector of subsystem 119878

119894 119910

119894is the output

of subsystem 119878119894 and ℎ

119894(119902 119902 119902) is the interconnection term of

the subsystem 119891(119909119894 119906

119894) and ℎ

119894(119902 119902 119902) can be defined as

119891 (119909119894 119906

119894) = 119872

minus1

119894(119902

119894) [

119862119894(119902

119894 119902

119894) 119902

119894+ 119866

119894(119902

119894)

+119865119894(119902

119894 119902

119894) minus 119906

119894

]

ℎ119894(119902 119902 119902) = minus119872

minus1

119894(119902

119894) 119885

119894(119902 119902 119902)

(17)

In response to the time varying constrained reconfig-urable modular robot system we need to design a decen-tralized robust optimal tracking control policy to make thesubsystem track the desired trajectory as well as the trackingerror is converged and bounded

3 Decentralized Reinforcement LearningRobust Optimal Tracking Control Basedon ACI and 119876-Function

Assumption 1 Desired trajectory 119910119894119889 119910

119894119889 119910

119894119889and input gain

matrix 119887119894(119909

119894) are bounded

Then (16) can be transformed to the below Consider

119878119894

1198941= 119909

1198942

1198942= minus [119865 (119909

119894 119906

119894) + ℎ

119894(119902 119902 119902) + 119891

119894] + 119887

119894(119909

119894) 119906

119894

119910119894= 119909

1198941

(18)

where 119865(119909119894 119906

119894) = 119891(119909

119894 119906

119894) + 119887

119894(119909

119894)119906

119894

Assumption 2 The interconnection terms are bounded sat-isfying the following equation

1003816100381610038161003816ℎ119894 (119902 119902 119902)1003816100381610038161003816 le 120575

1198940+

119899

sum

119895=1

120575119894119895(10038161003816100381610038161003816119904119894119895

10038161003816100381610038161003816) (19)

where 1205751198940gt 0 is an unknown constant and 120575

119894119895(|119904

119894119895|) ge 0 is an

unknown smooth Lipschitz functionThe trajectory tracking error of the joint subsystem 119888 can

be defined as

119890119894(119905) = 119909

119894minus 119910

119894119889 (20)

With regard to the continuous time state equation ofthe subsystem in (18) with the nonlinear function andinterconnection terms generally the value function can bedefined as

119881119906119894(119890119894(119905))

119894(119890

119894(119905)) = int

infin

0

119903119894(119890

119894(119905) 119906

119894(119890

119894(119905))) 119889119905 (21)

In order to facilitate the equation we use 119890119894 119906

119894instead of

119890119894(119905) 119906

119894(119890

119894(119905)) Since the trajectory 119910

119894119889relies upon the control

of the subsystem 119906119894for updating in order to avoid the infinity

results by using (21) we need to transform the value functioninto the following form

119881119906119894

119894(119890

119894) = int

infin

0

119903119894(119890

119894(120591) 119906

119894(119890

119894(120591))) 119889120591 119905 le 120591 lt infin (22)

Thus the optimal value function of the subsystem can bedefined as follows

119881lowast

119894(119890

119894) = min

119906119894

119905le120591ltinfin

int

infin

0

119903119894(119890

119894(120591) 119906

119894(119890

119894(120591))) 119889120591 (23)

Mathematical Problems in Engineering 5

Here 119903119894(119890

119894 119906

119894) represents the reward function for the current

state shown as

119903119894(119890

119894 119906

119894) = 119890

119879

119894119876119890119890119894+ 119906

119879

119894119877119906

119894 (24)

where 119876119890and 119877 are the positive definite matrixes

Typically recording the value of state-action pairs is moreuseful than recording the value of state only since the state-action pairs are the predictions of the reward Even if thereward value of a state is low it does not mean that the valueof state-action pairs is low too If the state of the subsystemin a period time produces a higher reward then it can stillget a higher state-action value Therefore from a long termperspective defining a suitable state-action value function(119876-function) can make actions produce more rewards [2122]

According to (23) and (24) the continuous-time optimal119876-function can be defined as

119876lowast

119894(119890

119894 119886

119894 119906

119894) = 119903

119894(119890

119894 119886

119894 119906

119894) + 119881

lowast

119894(119890

119894 119906

119894)

= 119903119894(119890

119894 119886

119894 119906

119894)

+ min119906119894

119905le120591ltinfin

int

infin

0

119903119894(119890

119894(120591) 119906

119894(119890

119894(120591))) 119889120591

(25)

Assumption 3 The partial derivation of 119876lowast

119894and 119903

119894(119890

119894 119886

119894 119906

119894)

exist and they are continuous in the domain According to(18) and (24) by using the control policy 119906

119894 the optimal

119876-function can satisfy the following Hamiltonian-Jacobi-Bellman equation [23]

HJB119894(119890

119894 119906

119894 nabla119876

lowast

119894)

= min119906119894(119890119894)

[119903119894(119890

119894 119886

119894 119906

119894)

+nabla119876lowast

119894(minus119865 (119890

119894 119906

119894) minus ℎ

119894(119890 119890 119890) minus 119891

119894+ 119887

119894(119890

119894) 119906

119894)]

= min119906119894(119890119894)

[119903119894(119890

119894 119886

119894 119906

119894) + nabla119876

lowast

119894Φ

119894(119890

119894 119906

119894)]

(26)

whereΦ119894(119890

119894 119906

119894) = minus119865(119890

119894 119906

119894)minusℎ

119894(119890 119890 119890)minus119891

119894+119887

119894(119890

119894)119906

119894means the

global uncertainty including the unknown dynamics of thesubsystem and the interconnection term andnabla119876lowast

119894= 120597119876

lowast

119894120597119890

119894

means the gradient of the optimal 119876-function

Lemma 4 (see [24]) Considering dynamics of the subsystemof time varying constrained reconfigurable modular robot in(14) in order to ensure the minimum of the HJB equation (26)possessing the stationary point with respect to 119906

119894 the optimal

119876-function and the optimal control policy must satisfy thefollowing conditions

(1) 120597119867119869119861(119890119894 119906

119894 nabla119876

119894)120597119906

119894= 0

(2) 1205972119867119869119861119894(119890

119894 119906

119894 nabla119876

119894)(120597119906

119894times 120597119906

119879

119894) ge 0

The necessary conditions above lead us to the followingresults

(a) The bounded control policy can guarantee a localminimum of the HJB equation (26) and satisfy theconstraints imposed on the control inputs

(b) The Hessian matrix is positive-definite and the controlpolice 119906

119894can render the global minimum of the HJB

equation(c) If an optimal algorithm exists it is unique

According to Lemma 4 if the reward function is smoothand the optimal control 119906lowast

119894is adopted then the HJB equation

satisfies the following equation

HJBlowast

119894(119890

119894 119906

lowast

119894 nabla119876

lowast

119894) = min

119906lowast

119894(119890119894)

[119903119894(119890

119894 119886

119894 119906

lowast

119894) + nabla119876

lowast

119894Φ

119894(119890

119894 119906

lowast

119894)]

= 0

(27)

And the optimal control can be expressed as follows

119906lowast

119894(119890

119894) = arg

119906lowast

119894

min [HJBlowast

119894(119890

119894 119906

lowast

119894 nabla119876

lowast

119894)]

=1

2119877minus1

119887119879

119894(119890

119894)120597119876

lowast

119894(119890

119894 119886

119894 119906

119894)119879

120597119890119894

(28)

If the optimal 119876-function 119876lowast

119894is continuous derivable

and known and the initial value119876lowast

119894(0) = 0 as well as the opti-

mal control policy 119906lowast

119894(119890

119894) and the global uncertainty of the

subsystemΦ119894(119890

119894 119906

lowast

119894) is known then the HJB equation in (27)

is held and solvableHowever in the actual situation119876lowast

119894is not

derivable everywhere and 119906lowast119894(119890

119894) andΦ

119894(119890

119894 119906

lowast

119894) are unknown

Therefore it is not feasible to solve the HJB equation byusing average method In this paper we combine the action-critic identifier (ACI) with RBF neural network to estimatethe optimal control policy the optimal 119876-function and theglobal uncertainty of the subsystem Action-NN is used toestimate 119906

lowast

119894(119890

119894) and is denoted as

119894(119890

119894) 119876lowast

119894is estimated

by critic-NN and expressed as 119876119894 then we use the robust

neural network identifier to identify Φ119894(119890

119894 119906

lowast

119894) denoted as

Φ119894(119890

119894 119906

lowast

119894)Theblock diagramof theACI architecture is shown

in Figure 2The estimated HJB equation can be expressed as follows

HJBlowast

119894(119890

119894

119894 nabla119876

119894) = min

119906119894(119890119894)[119903

119894(119890

119894 119886

119894

119894) + nabla119876

119894Φ

119894(119890

119894

119894)]

(29)

The identification error of the HJB equation above can beexpressed as

120575ℎ119894= HJBlowast

119894(119890

119894

119894 nabla119876

119894) minusHJBlowast

119894(119890

119894 119906

lowast

119894 nabla119876

lowast

119894) (30)

A classic radial basis function of the neural network isproposed in [25] shown as (31)

119873(119909) = 119882lowast119879

119878 (119909) + 120576 (119909) (31)

6 Mathematical Problems in Engineering

Action

Rewardfunction

HJB

error

Identifier

Subsystem

Critic

minus+

Qi(ei ai ui)Qi(ei ai ui)

ri(ei ai ui)

ri(ei ai ui)

Φi(ei ui)

Φi(ei ui)

Φi(ei )

eiF(t)

ui

ui

(t)

120575hi

1s

Figure 2 The architecture of action-critic-identifier

where 119882lowast means the ideal neural network weights and 120576(119909)

represents the estimation error In the case of using sufficientnumber of nodes if the center and width of the nodes arebuilt appropriately then any kind of continuous functioncould be approximated by RBF-NN Therefore the optimal119876-function and the optimal control policy can be expressedas follows

119876lowast

119894= 119882

119879

119894119878119894(119890

119894) + 120576

119894119888(119890

119894)

119906lowast

119894(119890

119894) = minus

1

2119877minus1

119887119879

119894(119890

119894) [ 119878

119894(119890

119894)119879

119882119894+ 120576

119894119886(119890

119894)]

(32)

where 119878119894(119890

119894) = [119904

1198941(119890

119894) sdot sdot sdot 119904

119894119899(119890

119894)]119879 indicates the smooth

basis function of the neural network 119882119894means the ideal

unknown neural network weight and 120576119894119888(119890

119894) and 120576

119894119886(119890

119894) are

the estimation error By using 119876119894and

119894(119890

119894) to estimate 119876

lowast

119894

and 119906lowast

119894(119890

119894) we can get the following equations

119876119894=

119879

119894119888119878119894119888(119890

119894) (33)

119894(119890

119894) = minus

1

2119877minus1

119887119879

119894(119890

119894) 119878

119894119886(119890

119894)119879

119894119886 (34)

According to the equations above 119894119888(119905) and

119894119886(119905) indicated

the weights of critic-NN and action-NN And the estimationerrors of weights are shown as follows

119894119888(119905) = 119882

119894minus

119894119888(119905) (35)

119894119886(119905) = 119882

119894minus

119894119886(119905) (36)

The update law of the weight for the critic-NN is a gradientdescent algorithm which is shown as follows

119882

119894119888(119905) = minus119899

1119871119894(119871

119879

119894

119894119888+ 119890

119879

119894119876119890119890119894+ 119906

119879

119894119877119906

119894) (37)

In the equation above 119899119894gt 0 is the adaptive gain of the neural

network 119871119894and 119897

119894are defined as

119871119894=

119897119894

119897119879119894119897119894+ 1

119897119894= nabla119878

119894119888(119890

119894) 119890

119894

(38)

Therefore according to the definition above the followinginequalities can be obtained

119871119894119898

le 119871119894le 119871

119894119872

119878119894119888119898

le 119878119894119888(119890

119894) le 119878

119894119888119872

119878119894119886119898

le 119878119894119886(119890

119894) le 119878

119894119886119872

(39)

Mathematical Problems in Engineering 7

Combining (35) with (38) we can get that

119882119894119888(119905) = minus119899

1119871119894(119871

119879

119894

119894119888+ 120575

ℎ119894) (40)

The update law of the weight for the action-NN is developedby a gradient descent algorithm expressed as follows

119882

119894119886(119905) = minus119899

2119878119894119886(119890

119894)

times ((119879

119894119886119878119894119886(119890

119894) +

1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119894119888))

119879

(41)

According to the estimation error of action-NN in (36) theoptimal control 119906lowast

119894(119890

119894) can minimize the optimal119876-function

and we can get the following equation

119882119879

119894119886119878119894119886(119890

119894) +

1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119882119894119888

+1

2119877minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894) + 120576

119894119886(119890

119894) = 0

(42)

Putting (41) into (42) we can get that

119882119894119886

= minus1198992119878119894119886(119890

119894)(

119879

119894119886119878119894119886(119890

119894) +

1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119894119888

minus1

2119877minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894) minus 120576

119894119886(119890

119894)

)

(43)

After using critic-NN and action-NN to estimate 119876119894and

119894(119890

119894) we need to design a kind of robust RBF-NN identifier

to identify the nonlinear uncertainties of the subsystem HereΦ

119894(119890

119894

119894) can be expressed as follows

Φ119894(119890

119894

119894) = 119890

119894119865= 119882

119879

119894119865120581 (Λ

119879

119894119865119890119894119865) + 120576

119894119865(119890

119894119865) + 119887

119894(119890

119894)

119894

(44)

where 120581(sdot) means the basic function of neural network and119882

119894119865Λ

119894119865indicate the unknown ideal neural network weights

Equation (44) can be identified by using robust RBF-NNidentifier so we can get

Φ119894(119890

119894

119894) = 119890

119894119865= 119882

119879

119894119865120581119894119865+ 119887

119894(119890

119894)

119894+ 120583

119894 (45)

Here 120581119894119865indicates the estimated value of the basic function of

the neural network 119882119894119865 Λ

119894119865are expressed as the estimated

value of neural network 120583119894isin R means the feedback error

term shown as follows [26]120583119894= 119896 (119890

119894119865(119905) minus 119890

119894119865(119905)) minus 119896 (119890

119894119865(0) minus 119890

119894119865(0)) + 120599

= 119896 (119890119894119865(119905) minus 119890

119894119865(0)) + 120599

120599 = (119896120572 + 120574) 119890119894119865+ 120573

1sat (119890

119894119865)

(46)

where 119896 120572 1205731 and 120574 are the positive control gain constants

and sat(sdot) is a saturation functionTherefore the state estima-tion error of the identifier-NN can be expressed as follows

119890119894119865= 119890

119894119865minus 119890

119894119865

= 119882119879

119894119865120581119894119865minus

119879

119894119865120581119894119865+ 120576

119894119865(119890

119894119865) minus 120583

119894

(47)

A filtered identification error is defined as follows

119864119894= 119890

119894119865+ 120572119890

119894119865 (48)

The derivation of the equation above is shown as

119894= 119882

119879

119894119865119894119865Λ119879

119894119865119890119894119865minus

119879

119894119865

120581119894119865

Λ119879

119894119865119890119894119865+ 120572 119890

119894119865minus 119896119864

119894minus 120574119890

119894119865

minus

119882119879

119894119865120581119894119865+ 120576

119894119865(119890

119894119865) minus 120573

1sat (119890

119894119865) minus

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

(49)

Here the weight 119882119894119865 Λ

119894119865of the identification-NN can be

updated by

119882

119894119865= proj (Γ

119894119882119865

120581119894119865Λ119879

119894119865

119890119894119865119890119879

119894119865)

Λ119894119865= proj (Γ

119894Λ119865

119890119894119865119890119879

119894119865

119879

119894119865

120581119894119865)

(50)

where Γ119894119882119865

Γ119894Λ119865

are positive constant adaptation gain matri-ces In order to analyze the convergence of the filteredidentification error 119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

can be divided into thefollowing form

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

=1

2120581119894119865

119890119894119865[(Λ

119879

119894119865minus Λ

119879

119894119865) (119882

119879

119894119865minus

119879

119894119865)

+ (119882119879

119894119865minus

119879

119894119865) (Λ

119879

119894119865minus Λ

119879

119894119865)]

=1

2120581119894119865

119890119894119865[

119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) + (119882

119879

119894119865minus

119879

119894119865) Λ

119879

119894119865

minus119882119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) minus (119882

119879

119894119865minus

119879

119894119865)Λ

119879

119894119865

]

=1

2120581119894119865

119890119894119865[

119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) + (119882

119879

119894119865minus

119879

119894119865) Λ

119879

119894119865]

minus1

2120581119894119865

119890119894119865[119882

119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) + (119882

119879

119894119865minus

119879

119894119865)Λ

119879

119894119865]

=1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

minus1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865minus1

2

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865

+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

(51)

where 119879

119894119865= 119882

119879

119894119865minus

119879

119894119865 Λ119879

119894119865= Λ

119879

119894119865minus Λ

119879

119894119865 Putting (51) into

(49) then (49) can be reduced to the following form

119894= 119875

1198651+ 119875

1198652+ 119875

1198653minus 119896119864

119894minus 120574119890

119894119865minus 120573

1sat (119890

119894119865) (52)

8 Mathematical Problems in Engineering

Among the equations above 1198751198651+119875

1198652+119875

1198653can be expressed

respectively as follows

1198751198651

=1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865minus

119879

119894119865

120581119894119865

Λ119879

119894119865119890119894119865

+ 120572 119890119894119865minus

119879

119894119865120581119894119865

(53)

1198751198652

= minus1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865minus1

2

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865

+119882119879

119894119865119894119865Λ119879

119894119865119890119894119865+ 120576

119894119865(119890

119894119865)

(54)

1198751198653

=1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865 (55)

According to Assumption 1 (48) and (50) the upper boundsof 119875

1198651 119875

1198652 119875

1198653are shown as

100381710038171003817100381711987511986511003817100381710038171003817 le 119869

1(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

100381710038171003817100381711987511986521003817100381710038171003817 le 120589

1

100381710038171003817100381711987511986531003817100381710038171003817 le 120589

2

(56)

Combining (53) and (54) with (55) then we can get that100381710038171003817100381710038171198652

+ 1198653

10038171003817100381710038171003817le 120589

3+ 120589

41198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817 (57)

where 120593119894(119890

119879

119894119865 119864

119879

119894) = [119890

119879

119894119865119864119879

119894]119879 and 119869

119894(sdot) is a global invertible

nondecreasing function 120589119894 (119894 = 1 2 3 4) are computable

positive constants

Theorem 5 Considering dynamics of the subsystem of timevarying constrained reconfigurable modular robot in (14) andthe state equation (18) if the designed identifier and thecorresponding weight update laws are adopted then the globaluncertainty of the subsystem which depends explicitly on theerror term can be identified and the identification error isconverged and bounded

Proof Define the Lyapunov function as the follows

119881119894119871(119890

119894119865 119864

119894) =

1

2119864119879

119894119864119894+1

2120574119890

119879

119894119865119890119894119865+ 120603

119894(119905) + 120601

119894(119905) (58)

In the equation above 120603119894(119905) and 120601

119894(119905) can be expressed as

follows

119894(119905) = minus[

119864119879

119894(119875

1198652minus 120573

1sat (119890

119894119865)) + 119890

119879

1198941198651198751198653

minus12057321198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)100381710038171003817100381710038171003817100381710038171003817119890119894119865

1003817100381710038171003817

]

120603119894(0) = 120573

1

1003816100381610038161003816119890119894119865 (0)1003816100381610038161003816 minus 119890

119879

119894119865(0) (119875

1198652(0) + 119875

1198653(0))

(59)

120601119894(119905) =

1

4120572 [ tr (

119879

119894119865Γminus1

119894119882119865

119894119865) + tr (Λ

119879

119894119865Γminus1

119894Λ119865Λ

119894119865)] (60)

where tr(sdot) represents the trace of matrix Defining 119889 =

[119864119879

119894119890119879

11989411986512060312

11989412060112

119894] 120573

1 120573

2isin R are positive adaptation gains

which are chosen to ensure 120603119894(119905) ge 0 so we can get

1198801(119889) le 119881

119894119871(119890

119894119865 119864

119894) le 119880

2(119889) (61)

where

1198801(119889) =

1

2min (1 120574) 1198892

1198802(119889) = max (1 120574) 1198892

(62)

The derivation of (58) is shown as follows

119894119871(119890

119894119865 119864

119894) = nabla119881

119879

119894119871119870[

119894

119890119879

119894119865

1

212060312

119894119894

1

212060112

119894

120601119894]119879

(63)

where119870[sdot] is expressed as a Filipov set [27]So

119894119871(119890

119894119865 119864

119894) can be deformed as the following form

119894119871(119890

119894119865 119864

119894)

= [119864119879

119894120574119890

119879

1198941198652120603

12

1198942120601

12

119894]119870[

119894

119890119879

119894119865

1

212060312

119894119894

1

212060112

119894

120601119894]119879

le 120574119879

(

1

2

119882119879

119894119865

119894119865

Λ119879

119894119865

119894119865+

1

2

119879

119894119865

119894119865Λ119879

119894119865

119894119865minus

119879

119894119865

119894119865

Λ

119879

119894119865119890119894119865

+120572119894119865minus

1

2

119882119879

119894119865

119894119865

Λ119879

119894119865119890119894119865minus

1

2

119879

119894119865

119894119865Λ119879

119894119865119890119894119865minus 120574119890

119894119865

minus119879

119894119865120581119894119865+119882

119879

119894119865119894119865Λ119879

119894119865119890119894119865+ 120576

119894119865(119890

119894119865) +

1

2

119879

119894119865

119894119865

Λ119879

119894119865

119894119865

+

1

2

119879

119894119865

119894119865

Λ119879

119894119865

119894119865minus 119896119864

119894minus 120573

1119870[sat (119890

119894119865)]

)

minus119864119879

119894(

minus1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865minus1

2

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865

+119882119879

119894119865119894119865Λ119879

119894119865119890119894119865+ 120576

119894119865(119890

119894) minus 120573

1119870[sat (119890

119894119865)])

minus 119890119879

119894119865

1

2(

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865)

+ 120574119890119879

119894119865(119864

119894minus 120572119890

119894119865)

+ 12057321198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)100381710038171003817100381710038171003817100381710038171003817119890119894119865

1003817100381710038171003817

minus1

2120572 tr (119879

119894119865Γminus1

119894119882119865

119882

119894119865) minus

1

2120572 tr (Λ119879

119894119865Γminus1

119894Λ119865

Λ119894119865)

(64)

Put (53) (54) and (55) into (64) then we can get

119894119871(119890

119894119865 119864

119894)

= 119864119879

119894(119875

1198651+ 119875

1198652+ 119875

1198653minus 120573

1119870[sat (119890

119894119865)] minus 119896119864

119894minus 120574119890

119894119865)

+ 120574119890119879

119894119865(119864

119894minus 120572119890

119894119865)

minus 119864119879

119894(119875

1198652minus 120573

1119870[sat (119890

119894119865)])

minus 119890119879

1198941198651198751198653

+ 12057321198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)100381710038171003817100381710038171003817100381710038171003817119890119894119865

1003817100381710038171003817

minus1

2120572 tr (119879

119894119865Γminus1

119894119882119865

119882

119894119865)

minus1

2120572 tr (Λ119879

119894119865Γminus1

119894Λ119865

Λ119894119865)

Mathematical Problems in Engineering 9

= minus120572120574119890119879

119894119865119890119894119865+ (119864

119879

119894minus 119890

119879

119894119865)119875

1198653

1198691(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41198963

minus1

2120572 tr (119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865119890119879

119894119865) minus

1

2120572 tr (Λ119879

119894119865

119890119894119865119890119879

119894119865

119879

119894119865

120581119894119865)

le minus1198961

100381710038171003817100381711989011989411986510038171003817100381710038172

minus 1198962

1003817100381710038171003817119864119894

10038171003817100381710038172

+1198691(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41198963

10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

+1205732

21198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41205721198964

10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

(65)

where 119896min = min1198961 119896

2 120585 = min119896

3 120572119896

4120573

2

2 and

119869(120593119894(119890

119879

119894119865 119864

119879

119894))

2

= 1198691(120593

119894(119890

119879

119894119865 119864

119879

119894))

2 + 1198692(120593

119894(119890

119879

119894119865 119864

119879

119894))

2 sothe following conclusion can be obtained

119894119871(119890

119894119865 119864

119894)

le minus119896min10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

+119869(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)210038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

4120585

le minus11988810038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

(66)

Therefore for an arbitrary constant 119888 minus119888120593119894(119890

119879

119894119865 119864

119879

119894)

2

is a negative semidefinite function which is defined in theadjustable interval119863 expressed as follows

119863 = 119889 (119905) | 119889 le 119869minus1

(2radic119896min120585) (67)

so that Lyapunov stability theory shows that the system isstable In order to make the subsystem of time varying con-strained reconfigurable modular robot tracking the desiredtrajectory progressively in this paper a novel decentralizedreinforcement learning robust optimal tracking controllerhas been designed by using the robust term to compensatethe neural network approximation errors Design the robustcontrol term as

119906119894119903119887

=119873

119903119887119890119894

119890119879119894119890119894+ 120577

(68)

In the equation above 120577 gt 0 is a constant And 119873119903119887can

be expressed as

119873119903119887ge [

[

1205752

ℎ119894

21198991

+1198991(minus120576

119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894) (nabla120576

119894119888(119890

119894)2))

2

21198992

+11989911198992

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

(minus120576119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

]

]

sdot(119890

119879

119894119890119894+ 120577)

211989911198992

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817 119890

119879

119894119890119894

ge [1198992

1(minus120576

119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

+ 11989921205752

ℎ119894+ 2119899

2

11198992

2

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

sdot (minus120576119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

]

sdot(119890

119879

119894119890119894+ 120577)

41198992111989922

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817 119890

119879

119894119890119894

(69)

Therefore the global control law can be designed asfollows

119906mix = 119906119894+ 119906

119894119903119887

= minus1

2119877minus1

119887119879

119894(119890

119894) 119878

119894119886(119890

119894)119879

119894119886+

119873119903119887119890119894

119890119879119894119890119894+ 120577

(70)

Theorem 6 Considering dynamics of the subsystem of timevarying constrained reconfigurable modular robot in (14) if thesystem parameters conditions and the assumptions are held thecritic-NN action-NN and identifier are given by (33) (34)and (45) respectively and the decentralized robust optimaltracking controller of the subsystem in (70) is adopted thenthe system is closed-loop stability and the desired trajectory canbe tracked asymptotically by the actual output

Proof Design the Lyapunov function as follows

119881119894119906(119890

119894 119906mix) =

1

21198991

tr 119879

119894119888

119894119888 +

1198991

21198992

tr 119879

119894119886

119894119886

+ 11989911198992[119890

119879

119894119865119890119894119865+ Ξint

infin

0

119903119894(119890

119894 119906mix) 119889120591]

(71)

where Ξ gt 0 is the undetermined parameter 119905 le 120591 lt infin Thederivation of (71) is shown as follows

119894119906(119890

119894 119906mix)

=1

21198991

tr 119879

119894119888

119882

119894119888 +

1198991

21198992

tr 119879

119894119886

119882

119894119886

+ 11989911198992[119890

119879

119894119865119890119894119865+ Ξ119903

119894(119890

119894 119906mix)]

=1

21198991

tr 119879

119894119888(minus119899

1119871119894(119871

119879

119894

119894119888+ 120575

ℎ119894))

+1198991

21198992

tr

times

119879

119894119886

[[[[

[

minus1198992119878119894119886(119890

119894)(

119879

119894119886119878119894119886(119890

119894) minus 120576

119894119886(119890

119894)

+1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119894119888

minus1

2119877minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894)

)

]]]]

]

+ 11989911198992119890119879

119894119865(119882

119879

119894119865120581 (Λ

119879

119894119865119890119894) + 120576

119894119865(119890

119894) + 119887

119894(119890

119894) mix)

+Ξ (119890119879

119894119876119890119890119894+ 119906

119879

mix119877119906mix)

10 Mathematical Problems in Engineering

le minus(1198712

119894119898minus1198991

21198712

119894119872)10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

+1

21198991

1205752

ℎ119894

minus (11989911198782

119894119886119898minus3

4119899111989921198782

119894119886119872)10038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

+1198991

41198992

10038171003817100381710038171003817119877minus110038171003817100381710038171003817

21003817100381710038171003817119887119894(119890119894)10038171003817100381710038172 10038171003817100381710038171003817nabla119878

2

119894119888119872

10038171003817100381710038171003817

10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

+1198991

21198992

(120576119894119886(119890

119894) + 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

119879

sdot (120576119894119886(119890

119894) + 119877

minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894)

2)

+ 11989911198992

1003817100381710038171003817119887119894 (119890119894)10038171003817100381710038172

120576119894119886(119890

119894)119879

120576119894119886(119890

119894)

+ 119899111989921198782

119894119886119872

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817210038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

+ 11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

+ 11989911198992(1003817100381710038171003817119887119894 (119890119894)

10038171003817100381710038172

minus Ξ120582min (119877))1003817100381710038171003817119906mix

10038171003817100381710038172

(72)

If the following inequalities can satisfy

120582min 1198761198901003817100381710038171003817119890119894119865

10038171003817100381710038172

2le 119890

119879

119894119865119876119890119890119894119865le 120582max 119876119890

1003817100381710038171003817119890119894119865

10038171003817100381710038172

2

120582min 1198771003817100381710038171003817119906mix

10038171003817100381710038172

2le 119906

119879

mix119877119906mix le 120582max 1198771003817100381710038171003817119906mix

10038171003817100381710038172

2

Ξ gt

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

120582min 119877

(73)

then 119894119906(119890

119894 119906mix) can be further transformed as

119894119906(119890

119894 119906mix)

le minus(1198712

119894119898minus1198991

21198712

119894119872minus

1198991

41198992

10038171003817100381710038171003817119877minus110038171003817100381710038171003817

21003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

nabla1198782

119894119888119872)10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

minus (11989911198782

119894119886119898minus3

4119899111989921198782

119894119886119872minus 119899

111989921198782

119894119886119872

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

)10038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

minus 11989911198992(1003817100381710038171003817119887119894(119890119894)

10038171003817100381710038172

+ Ξ120582min (119877))1003817100381710038171003817119906mix

10038171003817100381710038172

minus 11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

le minus11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

(74)

Therefore we can get the conclusion that 119894119906(119890

119894 119906mix) lt 0

4 Simulations

In order to verify the validity and convergence of the pro-posed decentralized reinforcement learning robust optimaltracking control method based on ACI and to study theconvergence of the error by comparing the simulation resultin this paper two different configurations of the time varying

external constrained reconfigurablemodular robot have beenapplied shown in Figures 3 and 4

For the sake of the facilitation of the analysis of theconfigurations above we can transform them into a formof analytic charts which are shown in Figures 5 and 6where 119871

1 119871

2 and 119871

4are the length of the links 119871

3is the

distance between the time varying constraint joint and thebase modular

The time varying constraint can be defined as a kindof column which rotated about with a certain degree offreedom The constraint equations of configuration A andconfiguration B are shown as follows

Ψ119860(119902 119905) = 119871

1cos 119902

1+ 119871

2cos 119902

2minus [119871

3+ 119871

4cot120572 (119905)]

Ψ119861(119902 119905) = 119871

1+ 119871

2cos 119902

2minus [119871

3+ 119871

4cot120572 (119905)]

(75)

In the equation above the angle 120572(119905) between the timevarying constraint and the119883-axis can be defined as follows

120572 (119905) = 075120587 + 02 sin 119905

2 (76)

The initial positions of joint models are 1199021(0) = 2 119902

2(0) =

2 in configurationA and 1199021(0) = 2 119902

2(0) = 2 in configuration

BThe initial velocities of joints are zerosThe dynamicmodelof configurations A and B is designed as follows

119872119860(119902) = [

036 cos (1199022) + 06066 018 cos (119902

2) + 01233

018 cos (1199022) + 01233 01233

]

119872119861(119902) = [

017 minus 01166cos2 (1199022) minus006 cos (119902

2)

minus006 cos (1199022) 01233

]

119862119860(119902 119902) = [

minus036 sin (1199022) 119902

2minus018 sin (119902

2) 119902

2

018 sin (1199022) ( 119902

1minus 119902

2) 018 sin (119902

2) 119902

1

]

119862119861(119902 119902) = [

01166 sin (21199022) 119902

2006 sin (119902

2) 119902

2

006 sin (1199022) 119902

20

]

119866119860(119902) = [

minus588 sin (1199021+ 119902

2) minus 1764 sin (119902

1)

minus588 sin (1199021+ 119902

2)

]

119866119861(119902) = [

0

minus588 cos (1199022)]

119865119860(119902 119902) = [

1199021+ 10 sin (3119902

1) + 2 sgn ( 119902

1)

12 1199022+ 5 sin (2119902

2) + sgn ( 119902

2)]

119865119861(119902 119902) = [

0

15 1199022+ sin (119902

2) + 12 sgn ( 119902

2)]

(77)

The desired trajectory of configurations A and B is shown asConfiguration A

1199101119889

= 05 cos (119905) + 02 sin (3119905)

1199102119889

= Θ (1199101119889 119905)

= arcsin[1198711sin (120572 (119905) minus 119910

1119889) minus 119871

3sin (120572 (119905))

1198712

] + 120572 (119905)

(78)

Mathematical Problems in Engineering 11

Figure 3 Configuration A for simulation

Figure 4 Configuration B for simulation

Configuration B

1199101119889

= 0

1199102119889

= Θ (1199101119889 119905)

= arcsin [1198711sin (120572 (119905)) minus 119871

3sin (120572 (119905))

1198712

] + 120572 (119905)

(79)

Due to the limit of dynamic external constraints thevariety of joint 1 in configuration B is zero

In order to confirm that the adopted method can beapplied in different configurations and to verify the trackingperformance for the subsystem desired trajectory by usingthe ACI and RBF-NN based decentralized reinforcementlearning robust optimal tracking control method in this partthe comparative simulations which include the classic RBFneural network control method and ACI based decentral-ized reinforcement learning robust optimal tracking controlmethod have been adopted respectively

From Figures 7 8 9 10 11 12 13 and 14 the joint trackingand error curves are shown by using classic RBF neural net-work [4] to compensate the effect of the dynamic nonlinearterm and the interconnection term in the subsystem Figures7 to 10 show that the actual output trajectories take about2 seconds to track the desired trajectories This is due tothe fact that the classical neural network method requires alonger training process and parameter adaptive process Theerror curves in Figures 11ndash14 show that the joint subsystem

q1L2

L3

L4

L1

Y

X

120572

q2

Figure 5 The analytic chart of configuration A

q2

L4

L2

L1

L3

Y

120572

X

q1

Figure 6 The analytic chart of configuration B

constraints cannot be well compensated by adopting the clas-sical neural control method for the reconfigurable modularrobot when the time varying constrains exist When thejoint output variables turn larger the reconfigurable modularsystem cannot exhibit a good robustness and the trackingerrors are larger than before

Figures 15 16 17 18 19 20 21 and 22 showed thejoint tracking and error curves by adopting ACI to identifythe optimal 119876-function optimal control policy and globaldynamic nonlinear terms of the HJB equation in the sub-system Figures 15 to 18 show that the actual output trajec-tories of the subsystems can track the desired trajectoriesin 05 seconds by using the proposed robust reinforcementlearning optimal control method This is due to the excellentidentifying ability of ACIwhich can identify the uncertaintiescontained in the subsystems in a short timeThe error curvesin Figures 19ndash22 show that the tracking error is very smalland it can converge to zero in a short time Besides whenthe proposed decentralized control method is adopted for thejoint subsystems the robustness is manifested

Figures 23 and 24 showed the 3D-tip trajectory curvesby using ACI algorithm These two figures show that theproposed decentralized control method can fully satisfythe accessibility requirement of the reconfigurable modularrobot Besides the singular displacements of the joint subsys-tems and the end-effector would not appear The parametersdefined in ACI are shown in Table 1

12 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

25

Time (s)

Join

t 1 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

minus1

minus05

Figure 7 Trajectory tracking curve of configuration A joint 1 withRBF neural network

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 8 Trajectory tracking curve of configuration A joint 2 withRBF neural network

Table 1 Parameter list of action-critic-identifier

119896 120572 120592 1205781198861

1205781198862

120578119888

1205731

1205732

120574

800 300 0005 10 50 20 02 2 05

The simulation results show that compared with thesituation of using classic RBF neural network controllerthe decentralized robust optimal tracking control methodbased on ACI and reinforcement learning can be appliedinto different configurations of the time varying constrainedreconfigurable modular robot And the joint variables cantrack the desired trajectory within a very short time indifferent configurations and the fluctuation of the errorconvergence range is minimal

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus2

minus15

minus05

minus1

Desired trajectoryActual trajectory

Figure 9 Trajectory tracking curve of configuration B joint 1 withRBF neural network

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 10 Trajectory tracking curve of configuration B joint 2 withRBF neural network

0 1 2 3 4 5 6 7 8 9 10

0

002

004

006

008

01

Time (s)

Join

t 1 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 11 Tracking error curve of configuration A joint 1 with RBFneural network

Mathematical Problems in Engineering 13

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

002

004

006

008

01

Join

t 2 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 12 Tracking error curve of configuration A joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

001

002

003

004

005

Time (s)

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 13 Tracking error curve of configuration B joint 1 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 14 Tracking error curve of configuration B joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus1

minus05

Desired trajectoryActual trajectory

Figure 15 Trajectory tracking curve of configuration A joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 16 Trajectory tracking curve of configuration A joint 2 withACI based reinforcement learning

5 Conclusions and Future Work

In this paper combining ACI with RBF neural network anovel decentralized reinforcement learning robust optimaltracking control theory has been proposed for time varyingconstrained reconfigurable modular robots Moreover thistheory is used to solve the problem of the continuoustime nonlinear optimal control policy for strongly coupleduncertainty robotic system Firstly we build the model ofsubsystem with the time varying external constraints anddescribed the global robot system as a synthesis of inter-connected subsystem Secondly ACI is used to estimate theHJB equation and the global uncertainty where a continuous-time optimal 119876-function is adopted to take the place oftraditional optimal value function the optimal 119876-function

14 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus15

minus05

minus2

minus1

Desired trajectoryActual trajectory

Figure 17 Trajectory tracking curve of configuration B joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 18 Trajectory tracking curve of configuration B joint 2 withACI based reinforcement learning

and the optimal control police are approximated by critic-NNand action-NN and the global uncertainty is identified bythe identifier Thirdly we design a novel decentralized robustoptimal tracking controller so the desired trajectory can betracked and the tracking error could converge to zero infinite timeOn this basis two kinds of Lyapunov functions aredesigned to confirm the stability of ACI and the subsystemFinally in order to confirm the superiority of the proposedcontrol theory the comparative simulation examples havebeen presented combining two different configurations of therobot

In the future more complex configuration for timevarying external constrained reconfigurable modular robotcan be included Therefore the decentralized controller withhigher control and error precision will be considered

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 19 Tracking error curve of configuration A joint 1 with ACIbased reinforcement learning

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005Jo

int 2

erro

r (ra

d)

minus005

minus004

minus003

minus002

minus001

Figure 20 Tracking error curve of configuration A joint 2 with ACIbased reinforcement learning

0

0

1 2 3 4 5 6 7 8 9 10Time (s)

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 21 Tracking error curve of configuration B joint 1 with ACIbased reinforcement learning

Mathematical Problems in Engineering 15

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 22 Tracking error curve of configuration B joint 2 with ACIbased reinforcement learning

005

1

02 03 04 05 06 07

0

01

02

03

minus1

minus05minus02

minus01

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 23 3D-tip trajectory curve of configuration A with ACI

005

1

035 036 037 038 039 04

006008

01012014016018

minus1

minus05

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 24 3D-tip trajectory curve of configuration B with ACI

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is financially supported by the National NaturalScience Foundation of China (Grant nos 61374051 and60974010) the Scientific Technological Development PlanProject in Jilin Province of China (Grant no 20110705) Thefirst author is funded by China Scholarship Council

References

[1] Y Li and B Dong ldquoDecentralized ADRC control for reconfig-urable manipulators based on VGSTA-ESO of sliding moderdquoINFORMATION vol 15 no 6 pp 2453ndash2466 2012

[2] Y LiM-C Zhu andY-C Li ldquoSimulation on robust neurofuzzycompensator for reconfigurable manipulator motion controlrdquoJournal of System Simulation vol 19 no 22 pp 5169ndash5174 2007

[3] M-C Zhu and Y-C Li ldquoDecentralized adaptive sliding modecontrol for reconfigurable manipulators using fuzzy logicrdquoJournal of Jilin University (Engineering and Technology Edition)vol 39 no 1 pp 170ndash176 2009

[4] L Zhu and Y Li ldquoDecentralized adaptive neural networkcontrol for reconfigurable manipulatorsrdquo in Proceedings of theChinese Control and Decision Conference (CCDC rsquo10) pp 1760ndash1765 Xuzhou China May 2010

[5] M Zhu Y Li and Y Li ldquoA new distributed control schemeof modular and reconfigurable robotsrdquo in Proceedings of theIEEE International Conference onMechatronics and Automation(ICMA rsquo07) pp 2622ndash2627 Harbin China August 2007

[6] M C Zhu Y Li and Y C Li ldquoObserver-based decentralizedadaptive fuzzy control for reconfigurable manipulatorrdquo Controland Decision vol 24 no 3 pp 429ndash434 2009

[7] S Richard and A G Barto Reinforcement Learning An Intro-duction The MIT Press London UK 1998

[8] F L Lewis and D Liu ReinForcement Learning and Approxi-mate Dynamic Programming For Feedback Control Wiley-IEEEPress New York NY USA 2012

[9] Y-K Xu and X-R Cao ldquoLebesgue-sampling-based optimalcontrol problems with time aggregationrdquo IEEE Transactions onAutomatic Control vol 56 no 5 pp 1097ndash1109 2011

[10] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[11] X Xu H-G He and D Hu ldquoEfficient reinforcement learningusing recursive least-squares methodsrdquo Journal of ArtificialIntelligence Research vol 16 pp 259ndash292 2002

[12] H Zhang Q Wei and Y Luo ldquoA novel infinite-time optimaltracking control scheme for a class of discrete-time nonlinearsystems via the greedy HDP iteration algorithmrdquo IEEE Trans-actions on Systems Man and Cybernetics B vol 38 no 4 pp937ndash942 2008

[13] H Zhang R Song Q Wei and T Zhang ldquoOptimal trackingcontrol for a class of nonlinear discrete-time systems withtime delays based on heuristic dynamic programmingrdquo IEEETransactions on Neural Networks vol 22 no 12 pp 1851ndash18622011

16 Mathematical Problems in Engineering

[14] H Zhang X Xie and S Tong ldquoHomogenous polynomiallyparameter-dependent 119867

infinfilter designs of discrete-time fuzzy

systemsrdquo IEEE Transactions on Systems Man and CyberneticsB vol 41 no 5 pp 1313ndash1322 2011

[15] H Zhang L Cui X Zhang and Y Luo ldquoData-drivenrobust approximate optimal tracking control for unknown gen-eral nonlinear systems using adaptive dynamic programmingmethodrdquo IEEE Transactions on Neural Networks vol 22 no 12pp 2226ndash2236 2011

[16] J Zhang H Zhang Y Luo and H Liang ldquoOptimal controldesign for nonlinear systems adaptive dynamic programmingbased on fuzzy critic estimatorrdquo in The International JointConference on Neural Network pp 1ndash6 Brisbane Australia2012

[17] H Zhang D Gong B Chen and Z Liu ldquoSynchronization forcoupled neural networkswith interval delay a novel augmentedlyapunov-krasovskii functional methodrdquo IEEE Transactions onNeural Networks and Learning Systems vol 24 no 1 pp 58ndash702013

[18] S Bhasin K Dupree P M Patre and W E Dixon ldquoNeuralnetwork control of a robot interacting with an uncertain vis-coelastic environmentrdquo IEEE Transactions on Control SystemsTechnology vol 19 no 4 pp 947ndash955 2011

[19] S G Khan G Herrmann F Lewis T Pipe and C MelhuishldquoA Q-learning based Cartesian model reference compliancecontroller implementation for a humanoid robot armrdquo inProceedings of the IEEE 5th International Conference onRoboticsAutomation andMechatronics (RAM rsquo11) pp 214ndash219 QingdaoChina September 2011

[20] P K Patchaikani L Behera and G Prasad ldquoA single networkadaptive critic-based redundancy resolution scheme for robotmanipulatorsrdquo IEEE Transactions on Industrial Electronics vol59 no 8 pp 3241ndash3253 2012

[21] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[22] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]University of Cambridge Cambridge UK 1989

[23] F L Lewis and V L Syrmos Optimal Control John Wiley ampSons New York NY USA 1995

[24] M Sassano and A Astolfi ldquoDynamic approximate solutionsof the HJ inequality and of the HJB equation for input-affinenonlinear systemsrdquo IEEE Transactions on Automatic Controlvol 57 no 10 pp 2490ndash2503 2012

[25] Y X Wu and C Wang ldquoDeterministic learning based adaptivenetwork control of robot in task spacerdquoActa Automatica Sinicavol 39 no 1 pp 1ndash10 2013

[26] P M Patre W MacKunis K Kaiser and W E DixonldquoAsymptotic tracking for uncertain dynamic systems via amultilayer neural network feedforward and RISE feedbackcontrol structurerdquo IEEE Transactions on Automatic Control vol53 no 9 pp 2180ndash2185 2008

[27] B E Paden and S S Sastry ldquoA calculus for computing Filippovrsquosdifferential inclusion with application to the variable structurecontrol of robot manipulatorsrdquo IEEE Transactions on Circuitsand Systems vol 34 no 1 pp 73ndash82 1987

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 4: Research Article Decentralized Reinforcement Learning ...downloads.hindawi.com/journals/mpe/2013/387817.pdf · Research Article Decentralized Reinforcement Learning Robust Optimal

4 Mathematical Problems in Engineering

Zi(q q q)

Zn(q q q)

Z1(q q q)

M1q1 + C1q1 + G1 + F1 + Z1 minus f1 = u1

Miqi + Ciqi + Gi + Fi + Zi minus fi = ui

Mnqn+ Cnqn+ Gn+ Fn+ Znminus fn= un

q1

qq1

q

qn

qi

qn

qi

q

u1

un

uiu

sum

sum

sum

minus

minus

minus

Subsystem n

Subsystem i

Subsystem 1

q=

(uminusCqminusGminusF+

(qt)f

)M

minus1

JT Ψ

Figure 1 The architecture of the time varying constrained reconfigurable modular robot system

where 119909119894is the state vector of subsystem 119878

119894 119910

119894is the output

of subsystem 119878119894 and ℎ

119894(119902 119902 119902) is the interconnection term of

the subsystem 119891(119909119894 119906

119894) and ℎ

119894(119902 119902 119902) can be defined as

119891 (119909119894 119906

119894) = 119872

minus1

119894(119902

119894) [

119862119894(119902

119894 119902

119894) 119902

119894+ 119866

119894(119902

119894)

+119865119894(119902

119894 119902

119894) minus 119906

119894

]

ℎ119894(119902 119902 119902) = minus119872

minus1

119894(119902

119894) 119885

119894(119902 119902 119902)

(17)

In response to the time varying constrained reconfig-urable modular robot system we need to design a decen-tralized robust optimal tracking control policy to make thesubsystem track the desired trajectory as well as the trackingerror is converged and bounded

3 Decentralized Reinforcement LearningRobust Optimal Tracking Control Basedon ACI and 119876-Function

Assumption 1 Desired trajectory 119910119894119889 119910

119894119889 119910

119894119889and input gain

matrix 119887119894(119909

119894) are bounded

Then (16) can be transformed to the below Consider

119878119894

1198941= 119909

1198942

1198942= minus [119865 (119909

119894 119906

119894) + ℎ

119894(119902 119902 119902) + 119891

119894] + 119887

119894(119909

119894) 119906

119894

119910119894= 119909

1198941

(18)

where 119865(119909119894 119906

119894) = 119891(119909

119894 119906

119894) + 119887

119894(119909

119894)119906

119894

Assumption 2 The interconnection terms are bounded sat-isfying the following equation

1003816100381610038161003816ℎ119894 (119902 119902 119902)1003816100381610038161003816 le 120575

1198940+

119899

sum

119895=1

120575119894119895(10038161003816100381610038161003816119904119894119895

10038161003816100381610038161003816) (19)

where 1205751198940gt 0 is an unknown constant and 120575

119894119895(|119904

119894119895|) ge 0 is an

unknown smooth Lipschitz functionThe trajectory tracking error of the joint subsystem 119888 can

be defined as

119890119894(119905) = 119909

119894minus 119910

119894119889 (20)

With regard to the continuous time state equation ofthe subsystem in (18) with the nonlinear function andinterconnection terms generally the value function can bedefined as

119881119906119894(119890119894(119905))

119894(119890

119894(119905)) = int

infin

0

119903119894(119890

119894(119905) 119906

119894(119890

119894(119905))) 119889119905 (21)

In order to facilitate the equation we use 119890119894 119906

119894instead of

119890119894(119905) 119906

119894(119890

119894(119905)) Since the trajectory 119910

119894119889relies upon the control

of the subsystem 119906119894for updating in order to avoid the infinity

results by using (21) we need to transform the value functioninto the following form

119881119906119894

119894(119890

119894) = int

infin

0

119903119894(119890

119894(120591) 119906

119894(119890

119894(120591))) 119889120591 119905 le 120591 lt infin (22)

Thus the optimal value function of the subsystem can bedefined as follows

119881lowast

119894(119890

119894) = min

119906119894

119905le120591ltinfin

int

infin

0

119903119894(119890

119894(120591) 119906

119894(119890

119894(120591))) 119889120591 (23)

Mathematical Problems in Engineering 5

Here 119903119894(119890

119894 119906

119894) represents the reward function for the current

state shown as

119903119894(119890

119894 119906

119894) = 119890

119879

119894119876119890119890119894+ 119906

119879

119894119877119906

119894 (24)

where 119876119890and 119877 are the positive definite matrixes

Typically recording the value of state-action pairs is moreuseful than recording the value of state only since the state-action pairs are the predictions of the reward Even if thereward value of a state is low it does not mean that the valueof state-action pairs is low too If the state of the subsystemin a period time produces a higher reward then it can stillget a higher state-action value Therefore from a long termperspective defining a suitable state-action value function(119876-function) can make actions produce more rewards [2122]

According to (23) and (24) the continuous-time optimal119876-function can be defined as

119876lowast

119894(119890

119894 119886

119894 119906

119894) = 119903

119894(119890

119894 119886

119894 119906

119894) + 119881

lowast

119894(119890

119894 119906

119894)

= 119903119894(119890

119894 119886

119894 119906

119894)

+ min119906119894

119905le120591ltinfin

int

infin

0

119903119894(119890

119894(120591) 119906

119894(119890

119894(120591))) 119889120591

(25)

Assumption 3 The partial derivation of 119876lowast

119894and 119903

119894(119890

119894 119886

119894 119906

119894)

exist and they are continuous in the domain According to(18) and (24) by using the control policy 119906

119894 the optimal

119876-function can satisfy the following Hamiltonian-Jacobi-Bellman equation [23]

HJB119894(119890

119894 119906

119894 nabla119876

lowast

119894)

= min119906119894(119890119894)

[119903119894(119890

119894 119886

119894 119906

119894)

+nabla119876lowast

119894(minus119865 (119890

119894 119906

119894) minus ℎ

119894(119890 119890 119890) minus 119891

119894+ 119887

119894(119890

119894) 119906

119894)]

= min119906119894(119890119894)

[119903119894(119890

119894 119886

119894 119906

119894) + nabla119876

lowast

119894Φ

119894(119890

119894 119906

119894)]

(26)

whereΦ119894(119890

119894 119906

119894) = minus119865(119890

119894 119906

119894)minusℎ

119894(119890 119890 119890)minus119891

119894+119887

119894(119890

119894)119906

119894means the

global uncertainty including the unknown dynamics of thesubsystem and the interconnection term andnabla119876lowast

119894= 120597119876

lowast

119894120597119890

119894

means the gradient of the optimal 119876-function

Lemma 4 (see [24]) Considering dynamics of the subsystemof time varying constrained reconfigurable modular robot in(14) in order to ensure the minimum of the HJB equation (26)possessing the stationary point with respect to 119906

119894 the optimal

119876-function and the optimal control policy must satisfy thefollowing conditions

(1) 120597119867119869119861(119890119894 119906

119894 nabla119876

119894)120597119906

119894= 0

(2) 1205972119867119869119861119894(119890

119894 119906

119894 nabla119876

119894)(120597119906

119894times 120597119906

119879

119894) ge 0

The necessary conditions above lead us to the followingresults

(a) The bounded control policy can guarantee a localminimum of the HJB equation (26) and satisfy theconstraints imposed on the control inputs

(b) The Hessian matrix is positive-definite and the controlpolice 119906

119894can render the global minimum of the HJB

equation(c) If an optimal algorithm exists it is unique

According to Lemma 4 if the reward function is smoothand the optimal control 119906lowast

119894is adopted then the HJB equation

satisfies the following equation

HJBlowast

119894(119890

119894 119906

lowast

119894 nabla119876

lowast

119894) = min

119906lowast

119894(119890119894)

[119903119894(119890

119894 119886

119894 119906

lowast

119894) + nabla119876

lowast

119894Φ

119894(119890

119894 119906

lowast

119894)]

= 0

(27)

And the optimal control can be expressed as follows

119906lowast

119894(119890

119894) = arg

119906lowast

119894

min [HJBlowast

119894(119890

119894 119906

lowast

119894 nabla119876

lowast

119894)]

=1

2119877minus1

119887119879

119894(119890

119894)120597119876

lowast

119894(119890

119894 119886

119894 119906

119894)119879

120597119890119894

(28)

If the optimal 119876-function 119876lowast

119894is continuous derivable

and known and the initial value119876lowast

119894(0) = 0 as well as the opti-

mal control policy 119906lowast

119894(119890

119894) and the global uncertainty of the

subsystemΦ119894(119890

119894 119906

lowast

119894) is known then the HJB equation in (27)

is held and solvableHowever in the actual situation119876lowast

119894is not

derivable everywhere and 119906lowast119894(119890

119894) andΦ

119894(119890

119894 119906

lowast

119894) are unknown

Therefore it is not feasible to solve the HJB equation byusing average method In this paper we combine the action-critic identifier (ACI) with RBF neural network to estimatethe optimal control policy the optimal 119876-function and theglobal uncertainty of the subsystem Action-NN is used toestimate 119906

lowast

119894(119890

119894) and is denoted as

119894(119890

119894) 119876lowast

119894is estimated

by critic-NN and expressed as 119876119894 then we use the robust

neural network identifier to identify Φ119894(119890

119894 119906

lowast

119894) denoted as

Φ119894(119890

119894 119906

lowast

119894)Theblock diagramof theACI architecture is shown

in Figure 2The estimated HJB equation can be expressed as follows

HJBlowast

119894(119890

119894

119894 nabla119876

119894) = min

119906119894(119890119894)[119903

119894(119890

119894 119886

119894

119894) + nabla119876

119894Φ

119894(119890

119894

119894)]

(29)

The identification error of the HJB equation above can beexpressed as

120575ℎ119894= HJBlowast

119894(119890

119894

119894 nabla119876

119894) minusHJBlowast

119894(119890

119894 119906

lowast

119894 nabla119876

lowast

119894) (30)

A classic radial basis function of the neural network isproposed in [25] shown as (31)

119873(119909) = 119882lowast119879

119878 (119909) + 120576 (119909) (31)

6 Mathematical Problems in Engineering

Action

Rewardfunction

HJB

error

Identifier

Subsystem

Critic

minus+

Qi(ei ai ui)Qi(ei ai ui)

ri(ei ai ui)

ri(ei ai ui)

Φi(ei ui)

Φi(ei ui)

Φi(ei )

eiF(t)

ui

ui

(t)

120575hi

1s

Figure 2 The architecture of action-critic-identifier

where 119882lowast means the ideal neural network weights and 120576(119909)

represents the estimation error In the case of using sufficientnumber of nodes if the center and width of the nodes arebuilt appropriately then any kind of continuous functioncould be approximated by RBF-NN Therefore the optimal119876-function and the optimal control policy can be expressedas follows

119876lowast

119894= 119882

119879

119894119878119894(119890

119894) + 120576

119894119888(119890

119894)

119906lowast

119894(119890

119894) = minus

1

2119877minus1

119887119879

119894(119890

119894) [ 119878

119894(119890

119894)119879

119882119894+ 120576

119894119886(119890

119894)]

(32)

where 119878119894(119890

119894) = [119904

1198941(119890

119894) sdot sdot sdot 119904

119894119899(119890

119894)]119879 indicates the smooth

basis function of the neural network 119882119894means the ideal

unknown neural network weight and 120576119894119888(119890

119894) and 120576

119894119886(119890

119894) are

the estimation error By using 119876119894and

119894(119890

119894) to estimate 119876

lowast

119894

and 119906lowast

119894(119890

119894) we can get the following equations

119876119894=

119879

119894119888119878119894119888(119890

119894) (33)

119894(119890

119894) = minus

1

2119877minus1

119887119879

119894(119890

119894) 119878

119894119886(119890

119894)119879

119894119886 (34)

According to the equations above 119894119888(119905) and

119894119886(119905) indicated

the weights of critic-NN and action-NN And the estimationerrors of weights are shown as follows

119894119888(119905) = 119882

119894minus

119894119888(119905) (35)

119894119886(119905) = 119882

119894minus

119894119886(119905) (36)

The update law of the weight for the critic-NN is a gradientdescent algorithm which is shown as follows

119882

119894119888(119905) = minus119899

1119871119894(119871

119879

119894

119894119888+ 119890

119879

119894119876119890119890119894+ 119906

119879

119894119877119906

119894) (37)

In the equation above 119899119894gt 0 is the adaptive gain of the neural

network 119871119894and 119897

119894are defined as

119871119894=

119897119894

119897119879119894119897119894+ 1

119897119894= nabla119878

119894119888(119890

119894) 119890

119894

(38)

Therefore according to the definition above the followinginequalities can be obtained

119871119894119898

le 119871119894le 119871

119894119872

119878119894119888119898

le 119878119894119888(119890

119894) le 119878

119894119888119872

119878119894119886119898

le 119878119894119886(119890

119894) le 119878

119894119886119872

(39)

Mathematical Problems in Engineering 7

Combining (35) with (38) we can get that

119882119894119888(119905) = minus119899

1119871119894(119871

119879

119894

119894119888+ 120575

ℎ119894) (40)

The update law of the weight for the action-NN is developedby a gradient descent algorithm expressed as follows

119882

119894119886(119905) = minus119899

2119878119894119886(119890

119894)

times ((119879

119894119886119878119894119886(119890

119894) +

1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119894119888))

119879

(41)

According to the estimation error of action-NN in (36) theoptimal control 119906lowast

119894(119890

119894) can minimize the optimal119876-function

and we can get the following equation

119882119879

119894119886119878119894119886(119890

119894) +

1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119882119894119888

+1

2119877minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894) + 120576

119894119886(119890

119894) = 0

(42)

Putting (41) into (42) we can get that

119882119894119886

= minus1198992119878119894119886(119890

119894)(

119879

119894119886119878119894119886(119890

119894) +

1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119894119888

minus1

2119877minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894) minus 120576

119894119886(119890

119894)

)

(43)

After using critic-NN and action-NN to estimate 119876119894and

119894(119890

119894) we need to design a kind of robust RBF-NN identifier

to identify the nonlinear uncertainties of the subsystem HereΦ

119894(119890

119894

119894) can be expressed as follows

Φ119894(119890

119894

119894) = 119890

119894119865= 119882

119879

119894119865120581 (Λ

119879

119894119865119890119894119865) + 120576

119894119865(119890

119894119865) + 119887

119894(119890

119894)

119894

(44)

where 120581(sdot) means the basic function of neural network and119882

119894119865Λ

119894119865indicate the unknown ideal neural network weights

Equation (44) can be identified by using robust RBF-NNidentifier so we can get

Φ119894(119890

119894

119894) = 119890

119894119865= 119882

119879

119894119865120581119894119865+ 119887

119894(119890

119894)

119894+ 120583

119894 (45)

Here 120581119894119865indicates the estimated value of the basic function of

the neural network 119882119894119865 Λ

119894119865are expressed as the estimated

value of neural network 120583119894isin R means the feedback error

term shown as follows [26]120583119894= 119896 (119890

119894119865(119905) minus 119890

119894119865(119905)) minus 119896 (119890

119894119865(0) minus 119890

119894119865(0)) + 120599

= 119896 (119890119894119865(119905) minus 119890

119894119865(0)) + 120599

120599 = (119896120572 + 120574) 119890119894119865+ 120573

1sat (119890

119894119865)

(46)

where 119896 120572 1205731 and 120574 are the positive control gain constants

and sat(sdot) is a saturation functionTherefore the state estima-tion error of the identifier-NN can be expressed as follows

119890119894119865= 119890

119894119865minus 119890

119894119865

= 119882119879

119894119865120581119894119865minus

119879

119894119865120581119894119865+ 120576

119894119865(119890

119894119865) minus 120583

119894

(47)

A filtered identification error is defined as follows

119864119894= 119890

119894119865+ 120572119890

119894119865 (48)

The derivation of the equation above is shown as

119894= 119882

119879

119894119865119894119865Λ119879

119894119865119890119894119865minus

119879

119894119865

120581119894119865

Λ119879

119894119865119890119894119865+ 120572 119890

119894119865minus 119896119864

119894minus 120574119890

119894119865

minus

119882119879

119894119865120581119894119865+ 120576

119894119865(119890

119894119865) minus 120573

1sat (119890

119894119865) minus

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

(49)

Here the weight 119882119894119865 Λ

119894119865of the identification-NN can be

updated by

119882

119894119865= proj (Γ

119894119882119865

120581119894119865Λ119879

119894119865

119890119894119865119890119879

119894119865)

Λ119894119865= proj (Γ

119894Λ119865

119890119894119865119890119879

119894119865

119879

119894119865

120581119894119865)

(50)

where Γ119894119882119865

Γ119894Λ119865

are positive constant adaptation gain matri-ces In order to analyze the convergence of the filteredidentification error 119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

can be divided into thefollowing form

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

=1

2120581119894119865

119890119894119865[(Λ

119879

119894119865minus Λ

119879

119894119865) (119882

119879

119894119865minus

119879

119894119865)

+ (119882119879

119894119865minus

119879

119894119865) (Λ

119879

119894119865minus Λ

119879

119894119865)]

=1

2120581119894119865

119890119894119865[

119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) + (119882

119879

119894119865minus

119879

119894119865) Λ

119879

119894119865

minus119882119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) minus (119882

119879

119894119865minus

119879

119894119865)Λ

119879

119894119865

]

=1

2120581119894119865

119890119894119865[

119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) + (119882

119879

119894119865minus

119879

119894119865) Λ

119879

119894119865]

minus1

2120581119894119865

119890119894119865[119882

119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) + (119882

119879

119894119865minus

119879

119894119865)Λ

119879

119894119865]

=1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

minus1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865minus1

2

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865

+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

(51)

where 119879

119894119865= 119882

119879

119894119865minus

119879

119894119865 Λ119879

119894119865= Λ

119879

119894119865minus Λ

119879

119894119865 Putting (51) into

(49) then (49) can be reduced to the following form

119894= 119875

1198651+ 119875

1198652+ 119875

1198653minus 119896119864

119894minus 120574119890

119894119865minus 120573

1sat (119890

119894119865) (52)

8 Mathematical Problems in Engineering

Among the equations above 1198751198651+119875

1198652+119875

1198653can be expressed

respectively as follows

1198751198651

=1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865minus

119879

119894119865

120581119894119865

Λ119879

119894119865119890119894119865

+ 120572 119890119894119865minus

119879

119894119865120581119894119865

(53)

1198751198652

= minus1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865minus1

2

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865

+119882119879

119894119865119894119865Λ119879

119894119865119890119894119865+ 120576

119894119865(119890

119894119865)

(54)

1198751198653

=1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865 (55)

According to Assumption 1 (48) and (50) the upper boundsof 119875

1198651 119875

1198652 119875

1198653are shown as

100381710038171003817100381711987511986511003817100381710038171003817 le 119869

1(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

100381710038171003817100381711987511986521003817100381710038171003817 le 120589

1

100381710038171003817100381711987511986531003817100381710038171003817 le 120589

2

(56)

Combining (53) and (54) with (55) then we can get that100381710038171003817100381710038171198652

+ 1198653

10038171003817100381710038171003817le 120589

3+ 120589

41198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817 (57)

where 120593119894(119890

119879

119894119865 119864

119879

119894) = [119890

119879

119894119865119864119879

119894]119879 and 119869

119894(sdot) is a global invertible

nondecreasing function 120589119894 (119894 = 1 2 3 4) are computable

positive constants

Theorem 5 Considering dynamics of the subsystem of timevarying constrained reconfigurable modular robot in (14) andthe state equation (18) if the designed identifier and thecorresponding weight update laws are adopted then the globaluncertainty of the subsystem which depends explicitly on theerror term can be identified and the identification error isconverged and bounded

Proof Define the Lyapunov function as the follows

119881119894119871(119890

119894119865 119864

119894) =

1

2119864119879

119894119864119894+1

2120574119890

119879

119894119865119890119894119865+ 120603

119894(119905) + 120601

119894(119905) (58)

In the equation above 120603119894(119905) and 120601

119894(119905) can be expressed as

follows

119894(119905) = minus[

119864119879

119894(119875

1198652minus 120573

1sat (119890

119894119865)) + 119890

119879

1198941198651198751198653

minus12057321198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)100381710038171003817100381710038171003817100381710038171003817119890119894119865

1003817100381710038171003817

]

120603119894(0) = 120573

1

1003816100381610038161003816119890119894119865 (0)1003816100381610038161003816 minus 119890

119879

119894119865(0) (119875

1198652(0) + 119875

1198653(0))

(59)

120601119894(119905) =

1

4120572 [ tr (

119879

119894119865Γminus1

119894119882119865

119894119865) + tr (Λ

119879

119894119865Γminus1

119894Λ119865Λ

119894119865)] (60)

where tr(sdot) represents the trace of matrix Defining 119889 =

[119864119879

119894119890119879

11989411986512060312

11989412060112

119894] 120573

1 120573

2isin R are positive adaptation gains

which are chosen to ensure 120603119894(119905) ge 0 so we can get

1198801(119889) le 119881

119894119871(119890

119894119865 119864

119894) le 119880

2(119889) (61)

where

1198801(119889) =

1

2min (1 120574) 1198892

1198802(119889) = max (1 120574) 1198892

(62)

The derivation of (58) is shown as follows

119894119871(119890

119894119865 119864

119894) = nabla119881

119879

119894119871119870[

119894

119890119879

119894119865

1

212060312

119894119894

1

212060112

119894

120601119894]119879

(63)

where119870[sdot] is expressed as a Filipov set [27]So

119894119871(119890

119894119865 119864

119894) can be deformed as the following form

119894119871(119890

119894119865 119864

119894)

= [119864119879

119894120574119890

119879

1198941198652120603

12

1198942120601

12

119894]119870[

119894

119890119879

119894119865

1

212060312

119894119894

1

212060112

119894

120601119894]119879

le 120574119879

(

1

2

119882119879

119894119865

119894119865

Λ119879

119894119865

119894119865+

1

2

119879

119894119865

119894119865Λ119879

119894119865

119894119865minus

119879

119894119865

119894119865

Λ

119879

119894119865119890119894119865

+120572119894119865minus

1

2

119882119879

119894119865

119894119865

Λ119879

119894119865119890119894119865minus

1

2

119879

119894119865

119894119865Λ119879

119894119865119890119894119865minus 120574119890

119894119865

minus119879

119894119865120581119894119865+119882

119879

119894119865119894119865Λ119879

119894119865119890119894119865+ 120576

119894119865(119890

119894119865) +

1

2

119879

119894119865

119894119865

Λ119879

119894119865

119894119865

+

1

2

119879

119894119865

119894119865

Λ119879

119894119865

119894119865minus 119896119864

119894minus 120573

1119870[sat (119890

119894119865)]

)

minus119864119879

119894(

minus1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865minus1

2

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865

+119882119879

119894119865119894119865Λ119879

119894119865119890119894119865+ 120576

119894119865(119890

119894) minus 120573

1119870[sat (119890

119894119865)])

minus 119890119879

119894119865

1

2(

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865)

+ 120574119890119879

119894119865(119864

119894minus 120572119890

119894119865)

+ 12057321198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)100381710038171003817100381710038171003817100381710038171003817119890119894119865

1003817100381710038171003817

minus1

2120572 tr (119879

119894119865Γminus1

119894119882119865

119882

119894119865) minus

1

2120572 tr (Λ119879

119894119865Γminus1

119894Λ119865

Λ119894119865)

(64)

Put (53) (54) and (55) into (64) then we can get

119894119871(119890

119894119865 119864

119894)

= 119864119879

119894(119875

1198651+ 119875

1198652+ 119875

1198653minus 120573

1119870[sat (119890

119894119865)] minus 119896119864

119894minus 120574119890

119894119865)

+ 120574119890119879

119894119865(119864

119894minus 120572119890

119894119865)

minus 119864119879

119894(119875

1198652minus 120573

1119870[sat (119890

119894119865)])

minus 119890119879

1198941198651198751198653

+ 12057321198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)100381710038171003817100381710038171003817100381710038171003817119890119894119865

1003817100381710038171003817

minus1

2120572 tr (119879

119894119865Γminus1

119894119882119865

119882

119894119865)

minus1

2120572 tr (Λ119879

119894119865Γminus1

119894Λ119865

Λ119894119865)

Mathematical Problems in Engineering 9

= minus120572120574119890119879

119894119865119890119894119865+ (119864

119879

119894minus 119890

119879

119894119865)119875

1198653

1198691(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41198963

minus1

2120572 tr (119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865119890119879

119894119865) minus

1

2120572 tr (Λ119879

119894119865

119890119894119865119890119879

119894119865

119879

119894119865

120581119894119865)

le minus1198961

100381710038171003817100381711989011989411986510038171003817100381710038172

minus 1198962

1003817100381710038171003817119864119894

10038171003817100381710038172

+1198691(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41198963

10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

+1205732

21198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41205721198964

10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

(65)

where 119896min = min1198961 119896

2 120585 = min119896

3 120572119896

4120573

2

2 and

119869(120593119894(119890

119879

119894119865 119864

119879

119894))

2

= 1198691(120593

119894(119890

119879

119894119865 119864

119879

119894))

2 + 1198692(120593

119894(119890

119879

119894119865 119864

119879

119894))

2 sothe following conclusion can be obtained

119894119871(119890

119894119865 119864

119894)

le minus119896min10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

+119869(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)210038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

4120585

le minus11988810038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

(66)

Therefore for an arbitrary constant 119888 minus119888120593119894(119890

119879

119894119865 119864

119879

119894)

2

is a negative semidefinite function which is defined in theadjustable interval119863 expressed as follows

119863 = 119889 (119905) | 119889 le 119869minus1

(2radic119896min120585) (67)

so that Lyapunov stability theory shows that the system isstable In order to make the subsystem of time varying con-strained reconfigurable modular robot tracking the desiredtrajectory progressively in this paper a novel decentralizedreinforcement learning robust optimal tracking controllerhas been designed by using the robust term to compensatethe neural network approximation errors Design the robustcontrol term as

119906119894119903119887

=119873

119903119887119890119894

119890119879119894119890119894+ 120577

(68)

In the equation above 120577 gt 0 is a constant And 119873119903119887can

be expressed as

119873119903119887ge [

[

1205752

ℎ119894

21198991

+1198991(minus120576

119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894) (nabla120576

119894119888(119890

119894)2))

2

21198992

+11989911198992

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

(minus120576119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

]

]

sdot(119890

119879

119894119890119894+ 120577)

211989911198992

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817 119890

119879

119894119890119894

ge [1198992

1(minus120576

119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

+ 11989921205752

ℎ119894+ 2119899

2

11198992

2

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

sdot (minus120576119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

]

sdot(119890

119879

119894119890119894+ 120577)

41198992111989922

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817 119890

119879

119894119890119894

(69)

Therefore the global control law can be designed asfollows

119906mix = 119906119894+ 119906

119894119903119887

= minus1

2119877minus1

119887119879

119894(119890

119894) 119878

119894119886(119890

119894)119879

119894119886+

119873119903119887119890119894

119890119879119894119890119894+ 120577

(70)

Theorem 6 Considering dynamics of the subsystem of timevarying constrained reconfigurable modular robot in (14) if thesystem parameters conditions and the assumptions are held thecritic-NN action-NN and identifier are given by (33) (34)and (45) respectively and the decentralized robust optimaltracking controller of the subsystem in (70) is adopted thenthe system is closed-loop stability and the desired trajectory canbe tracked asymptotically by the actual output

Proof Design the Lyapunov function as follows

119881119894119906(119890

119894 119906mix) =

1

21198991

tr 119879

119894119888

119894119888 +

1198991

21198992

tr 119879

119894119886

119894119886

+ 11989911198992[119890

119879

119894119865119890119894119865+ Ξint

infin

0

119903119894(119890

119894 119906mix) 119889120591]

(71)

where Ξ gt 0 is the undetermined parameter 119905 le 120591 lt infin Thederivation of (71) is shown as follows

119894119906(119890

119894 119906mix)

=1

21198991

tr 119879

119894119888

119882

119894119888 +

1198991

21198992

tr 119879

119894119886

119882

119894119886

+ 11989911198992[119890

119879

119894119865119890119894119865+ Ξ119903

119894(119890

119894 119906mix)]

=1

21198991

tr 119879

119894119888(minus119899

1119871119894(119871

119879

119894

119894119888+ 120575

ℎ119894))

+1198991

21198992

tr

times

119879

119894119886

[[[[

[

minus1198992119878119894119886(119890

119894)(

119879

119894119886119878119894119886(119890

119894) minus 120576

119894119886(119890

119894)

+1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119894119888

minus1

2119877minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894)

)

]]]]

]

+ 11989911198992119890119879

119894119865(119882

119879

119894119865120581 (Λ

119879

119894119865119890119894) + 120576

119894119865(119890

119894) + 119887

119894(119890

119894) mix)

+Ξ (119890119879

119894119876119890119890119894+ 119906

119879

mix119877119906mix)

10 Mathematical Problems in Engineering

le minus(1198712

119894119898minus1198991

21198712

119894119872)10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

+1

21198991

1205752

ℎ119894

minus (11989911198782

119894119886119898minus3

4119899111989921198782

119894119886119872)10038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

+1198991

41198992

10038171003817100381710038171003817119877minus110038171003817100381710038171003817

21003817100381710038171003817119887119894(119890119894)10038171003817100381710038172 10038171003817100381710038171003817nabla119878

2

119894119888119872

10038171003817100381710038171003817

10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

+1198991

21198992

(120576119894119886(119890

119894) + 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

119879

sdot (120576119894119886(119890

119894) + 119877

minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894)

2)

+ 11989911198992

1003817100381710038171003817119887119894 (119890119894)10038171003817100381710038172

120576119894119886(119890

119894)119879

120576119894119886(119890

119894)

+ 119899111989921198782

119894119886119872

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817210038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

+ 11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

+ 11989911198992(1003817100381710038171003817119887119894 (119890119894)

10038171003817100381710038172

minus Ξ120582min (119877))1003817100381710038171003817119906mix

10038171003817100381710038172

(72)

If the following inequalities can satisfy

120582min 1198761198901003817100381710038171003817119890119894119865

10038171003817100381710038172

2le 119890

119879

119894119865119876119890119890119894119865le 120582max 119876119890

1003817100381710038171003817119890119894119865

10038171003817100381710038172

2

120582min 1198771003817100381710038171003817119906mix

10038171003817100381710038172

2le 119906

119879

mix119877119906mix le 120582max 1198771003817100381710038171003817119906mix

10038171003817100381710038172

2

Ξ gt

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

120582min 119877

(73)

then 119894119906(119890

119894 119906mix) can be further transformed as

119894119906(119890

119894 119906mix)

le minus(1198712

119894119898minus1198991

21198712

119894119872minus

1198991

41198992

10038171003817100381710038171003817119877minus110038171003817100381710038171003817

21003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

nabla1198782

119894119888119872)10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

minus (11989911198782

119894119886119898minus3

4119899111989921198782

119894119886119872minus 119899

111989921198782

119894119886119872

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

)10038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

minus 11989911198992(1003817100381710038171003817119887119894(119890119894)

10038171003817100381710038172

+ Ξ120582min (119877))1003817100381710038171003817119906mix

10038171003817100381710038172

minus 11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

le minus11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

(74)

Therefore we can get the conclusion that 119894119906(119890

119894 119906mix) lt 0

4 Simulations

In order to verify the validity and convergence of the pro-posed decentralized reinforcement learning robust optimaltracking control method based on ACI and to study theconvergence of the error by comparing the simulation resultin this paper two different configurations of the time varying

external constrained reconfigurablemodular robot have beenapplied shown in Figures 3 and 4

For the sake of the facilitation of the analysis of theconfigurations above we can transform them into a formof analytic charts which are shown in Figures 5 and 6where 119871

1 119871

2 and 119871

4are the length of the links 119871

3is the

distance between the time varying constraint joint and thebase modular

The time varying constraint can be defined as a kindof column which rotated about with a certain degree offreedom The constraint equations of configuration A andconfiguration B are shown as follows

Ψ119860(119902 119905) = 119871

1cos 119902

1+ 119871

2cos 119902

2minus [119871

3+ 119871

4cot120572 (119905)]

Ψ119861(119902 119905) = 119871

1+ 119871

2cos 119902

2minus [119871

3+ 119871

4cot120572 (119905)]

(75)

In the equation above the angle 120572(119905) between the timevarying constraint and the119883-axis can be defined as follows

120572 (119905) = 075120587 + 02 sin 119905

2 (76)

The initial positions of joint models are 1199021(0) = 2 119902

2(0) =

2 in configurationA and 1199021(0) = 2 119902

2(0) = 2 in configuration

BThe initial velocities of joints are zerosThe dynamicmodelof configurations A and B is designed as follows

119872119860(119902) = [

036 cos (1199022) + 06066 018 cos (119902

2) + 01233

018 cos (1199022) + 01233 01233

]

119872119861(119902) = [

017 minus 01166cos2 (1199022) minus006 cos (119902

2)

minus006 cos (1199022) 01233

]

119862119860(119902 119902) = [

minus036 sin (1199022) 119902

2minus018 sin (119902

2) 119902

2

018 sin (1199022) ( 119902

1minus 119902

2) 018 sin (119902

2) 119902

1

]

119862119861(119902 119902) = [

01166 sin (21199022) 119902

2006 sin (119902

2) 119902

2

006 sin (1199022) 119902

20

]

119866119860(119902) = [

minus588 sin (1199021+ 119902

2) minus 1764 sin (119902

1)

minus588 sin (1199021+ 119902

2)

]

119866119861(119902) = [

0

minus588 cos (1199022)]

119865119860(119902 119902) = [

1199021+ 10 sin (3119902

1) + 2 sgn ( 119902

1)

12 1199022+ 5 sin (2119902

2) + sgn ( 119902

2)]

119865119861(119902 119902) = [

0

15 1199022+ sin (119902

2) + 12 sgn ( 119902

2)]

(77)

The desired trajectory of configurations A and B is shown asConfiguration A

1199101119889

= 05 cos (119905) + 02 sin (3119905)

1199102119889

= Θ (1199101119889 119905)

= arcsin[1198711sin (120572 (119905) minus 119910

1119889) minus 119871

3sin (120572 (119905))

1198712

] + 120572 (119905)

(78)

Mathematical Problems in Engineering 11

Figure 3 Configuration A for simulation

Figure 4 Configuration B for simulation

Configuration B

1199101119889

= 0

1199102119889

= Θ (1199101119889 119905)

= arcsin [1198711sin (120572 (119905)) minus 119871

3sin (120572 (119905))

1198712

] + 120572 (119905)

(79)

Due to the limit of dynamic external constraints thevariety of joint 1 in configuration B is zero

In order to confirm that the adopted method can beapplied in different configurations and to verify the trackingperformance for the subsystem desired trajectory by usingthe ACI and RBF-NN based decentralized reinforcementlearning robust optimal tracking control method in this partthe comparative simulations which include the classic RBFneural network control method and ACI based decentral-ized reinforcement learning robust optimal tracking controlmethod have been adopted respectively

From Figures 7 8 9 10 11 12 13 and 14 the joint trackingand error curves are shown by using classic RBF neural net-work [4] to compensate the effect of the dynamic nonlinearterm and the interconnection term in the subsystem Figures7 to 10 show that the actual output trajectories take about2 seconds to track the desired trajectories This is due tothe fact that the classical neural network method requires alonger training process and parameter adaptive process Theerror curves in Figures 11ndash14 show that the joint subsystem

q1L2

L3

L4

L1

Y

X

120572

q2

Figure 5 The analytic chart of configuration A

q2

L4

L2

L1

L3

Y

120572

X

q1

Figure 6 The analytic chart of configuration B

constraints cannot be well compensated by adopting the clas-sical neural control method for the reconfigurable modularrobot when the time varying constrains exist When thejoint output variables turn larger the reconfigurable modularsystem cannot exhibit a good robustness and the trackingerrors are larger than before

Figures 15 16 17 18 19 20 21 and 22 showed thejoint tracking and error curves by adopting ACI to identifythe optimal 119876-function optimal control policy and globaldynamic nonlinear terms of the HJB equation in the sub-system Figures 15 to 18 show that the actual output trajec-tories of the subsystems can track the desired trajectoriesin 05 seconds by using the proposed robust reinforcementlearning optimal control method This is due to the excellentidentifying ability of ACIwhich can identify the uncertaintiescontained in the subsystems in a short timeThe error curvesin Figures 19ndash22 show that the tracking error is very smalland it can converge to zero in a short time Besides whenthe proposed decentralized control method is adopted for thejoint subsystems the robustness is manifested

Figures 23 and 24 showed the 3D-tip trajectory curvesby using ACI algorithm These two figures show that theproposed decentralized control method can fully satisfythe accessibility requirement of the reconfigurable modularrobot Besides the singular displacements of the joint subsys-tems and the end-effector would not appear The parametersdefined in ACI are shown in Table 1

12 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

25

Time (s)

Join

t 1 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

minus1

minus05

Figure 7 Trajectory tracking curve of configuration A joint 1 withRBF neural network

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 8 Trajectory tracking curve of configuration A joint 2 withRBF neural network

Table 1 Parameter list of action-critic-identifier

119896 120572 120592 1205781198861

1205781198862

120578119888

1205731

1205732

120574

800 300 0005 10 50 20 02 2 05

The simulation results show that compared with thesituation of using classic RBF neural network controllerthe decentralized robust optimal tracking control methodbased on ACI and reinforcement learning can be appliedinto different configurations of the time varying constrainedreconfigurable modular robot And the joint variables cantrack the desired trajectory within a very short time indifferent configurations and the fluctuation of the errorconvergence range is minimal

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus2

minus15

minus05

minus1

Desired trajectoryActual trajectory

Figure 9 Trajectory tracking curve of configuration B joint 1 withRBF neural network

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 10 Trajectory tracking curve of configuration B joint 2 withRBF neural network

0 1 2 3 4 5 6 7 8 9 10

0

002

004

006

008

01

Time (s)

Join

t 1 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 11 Tracking error curve of configuration A joint 1 with RBFneural network

Mathematical Problems in Engineering 13

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

002

004

006

008

01

Join

t 2 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 12 Tracking error curve of configuration A joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

001

002

003

004

005

Time (s)

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 13 Tracking error curve of configuration B joint 1 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 14 Tracking error curve of configuration B joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus1

minus05

Desired trajectoryActual trajectory

Figure 15 Trajectory tracking curve of configuration A joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 16 Trajectory tracking curve of configuration A joint 2 withACI based reinforcement learning

5 Conclusions and Future Work

In this paper combining ACI with RBF neural network anovel decentralized reinforcement learning robust optimaltracking control theory has been proposed for time varyingconstrained reconfigurable modular robots Moreover thistheory is used to solve the problem of the continuoustime nonlinear optimal control policy for strongly coupleduncertainty robotic system Firstly we build the model ofsubsystem with the time varying external constraints anddescribed the global robot system as a synthesis of inter-connected subsystem Secondly ACI is used to estimate theHJB equation and the global uncertainty where a continuous-time optimal 119876-function is adopted to take the place oftraditional optimal value function the optimal 119876-function

14 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus15

minus05

minus2

minus1

Desired trajectoryActual trajectory

Figure 17 Trajectory tracking curve of configuration B joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 18 Trajectory tracking curve of configuration B joint 2 withACI based reinforcement learning

and the optimal control police are approximated by critic-NNand action-NN and the global uncertainty is identified bythe identifier Thirdly we design a novel decentralized robustoptimal tracking controller so the desired trajectory can betracked and the tracking error could converge to zero infinite timeOn this basis two kinds of Lyapunov functions aredesigned to confirm the stability of ACI and the subsystemFinally in order to confirm the superiority of the proposedcontrol theory the comparative simulation examples havebeen presented combining two different configurations of therobot

In the future more complex configuration for timevarying external constrained reconfigurable modular robotcan be included Therefore the decentralized controller withhigher control and error precision will be considered

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 19 Tracking error curve of configuration A joint 1 with ACIbased reinforcement learning

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005Jo

int 2

erro

r (ra

d)

minus005

minus004

minus003

minus002

minus001

Figure 20 Tracking error curve of configuration A joint 2 with ACIbased reinforcement learning

0

0

1 2 3 4 5 6 7 8 9 10Time (s)

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 21 Tracking error curve of configuration B joint 1 with ACIbased reinforcement learning

Mathematical Problems in Engineering 15

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 22 Tracking error curve of configuration B joint 2 with ACIbased reinforcement learning

005

1

02 03 04 05 06 07

0

01

02

03

minus1

minus05minus02

minus01

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 23 3D-tip trajectory curve of configuration A with ACI

005

1

035 036 037 038 039 04

006008

01012014016018

minus1

minus05

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 24 3D-tip trajectory curve of configuration B with ACI

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is financially supported by the National NaturalScience Foundation of China (Grant nos 61374051 and60974010) the Scientific Technological Development PlanProject in Jilin Province of China (Grant no 20110705) Thefirst author is funded by China Scholarship Council

References

[1] Y Li and B Dong ldquoDecentralized ADRC control for reconfig-urable manipulators based on VGSTA-ESO of sliding moderdquoINFORMATION vol 15 no 6 pp 2453ndash2466 2012

[2] Y LiM-C Zhu andY-C Li ldquoSimulation on robust neurofuzzycompensator for reconfigurable manipulator motion controlrdquoJournal of System Simulation vol 19 no 22 pp 5169ndash5174 2007

[3] M-C Zhu and Y-C Li ldquoDecentralized adaptive sliding modecontrol for reconfigurable manipulators using fuzzy logicrdquoJournal of Jilin University (Engineering and Technology Edition)vol 39 no 1 pp 170ndash176 2009

[4] L Zhu and Y Li ldquoDecentralized adaptive neural networkcontrol for reconfigurable manipulatorsrdquo in Proceedings of theChinese Control and Decision Conference (CCDC rsquo10) pp 1760ndash1765 Xuzhou China May 2010

[5] M Zhu Y Li and Y Li ldquoA new distributed control schemeof modular and reconfigurable robotsrdquo in Proceedings of theIEEE International Conference onMechatronics and Automation(ICMA rsquo07) pp 2622ndash2627 Harbin China August 2007

[6] M C Zhu Y Li and Y C Li ldquoObserver-based decentralizedadaptive fuzzy control for reconfigurable manipulatorrdquo Controland Decision vol 24 no 3 pp 429ndash434 2009

[7] S Richard and A G Barto Reinforcement Learning An Intro-duction The MIT Press London UK 1998

[8] F L Lewis and D Liu ReinForcement Learning and Approxi-mate Dynamic Programming For Feedback Control Wiley-IEEEPress New York NY USA 2012

[9] Y-K Xu and X-R Cao ldquoLebesgue-sampling-based optimalcontrol problems with time aggregationrdquo IEEE Transactions onAutomatic Control vol 56 no 5 pp 1097ndash1109 2011

[10] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[11] X Xu H-G He and D Hu ldquoEfficient reinforcement learningusing recursive least-squares methodsrdquo Journal of ArtificialIntelligence Research vol 16 pp 259ndash292 2002

[12] H Zhang Q Wei and Y Luo ldquoA novel infinite-time optimaltracking control scheme for a class of discrete-time nonlinearsystems via the greedy HDP iteration algorithmrdquo IEEE Trans-actions on Systems Man and Cybernetics B vol 38 no 4 pp937ndash942 2008

[13] H Zhang R Song Q Wei and T Zhang ldquoOptimal trackingcontrol for a class of nonlinear discrete-time systems withtime delays based on heuristic dynamic programmingrdquo IEEETransactions on Neural Networks vol 22 no 12 pp 1851ndash18622011

16 Mathematical Problems in Engineering

[14] H Zhang X Xie and S Tong ldquoHomogenous polynomiallyparameter-dependent 119867

infinfilter designs of discrete-time fuzzy

systemsrdquo IEEE Transactions on Systems Man and CyberneticsB vol 41 no 5 pp 1313ndash1322 2011

[15] H Zhang L Cui X Zhang and Y Luo ldquoData-drivenrobust approximate optimal tracking control for unknown gen-eral nonlinear systems using adaptive dynamic programmingmethodrdquo IEEE Transactions on Neural Networks vol 22 no 12pp 2226ndash2236 2011

[16] J Zhang H Zhang Y Luo and H Liang ldquoOptimal controldesign for nonlinear systems adaptive dynamic programmingbased on fuzzy critic estimatorrdquo in The International JointConference on Neural Network pp 1ndash6 Brisbane Australia2012

[17] H Zhang D Gong B Chen and Z Liu ldquoSynchronization forcoupled neural networkswith interval delay a novel augmentedlyapunov-krasovskii functional methodrdquo IEEE Transactions onNeural Networks and Learning Systems vol 24 no 1 pp 58ndash702013

[18] S Bhasin K Dupree P M Patre and W E Dixon ldquoNeuralnetwork control of a robot interacting with an uncertain vis-coelastic environmentrdquo IEEE Transactions on Control SystemsTechnology vol 19 no 4 pp 947ndash955 2011

[19] S G Khan G Herrmann F Lewis T Pipe and C MelhuishldquoA Q-learning based Cartesian model reference compliancecontroller implementation for a humanoid robot armrdquo inProceedings of the IEEE 5th International Conference onRoboticsAutomation andMechatronics (RAM rsquo11) pp 214ndash219 QingdaoChina September 2011

[20] P K Patchaikani L Behera and G Prasad ldquoA single networkadaptive critic-based redundancy resolution scheme for robotmanipulatorsrdquo IEEE Transactions on Industrial Electronics vol59 no 8 pp 3241ndash3253 2012

[21] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[22] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]University of Cambridge Cambridge UK 1989

[23] F L Lewis and V L Syrmos Optimal Control John Wiley ampSons New York NY USA 1995

[24] M Sassano and A Astolfi ldquoDynamic approximate solutionsof the HJ inequality and of the HJB equation for input-affinenonlinear systemsrdquo IEEE Transactions on Automatic Controlvol 57 no 10 pp 2490ndash2503 2012

[25] Y X Wu and C Wang ldquoDeterministic learning based adaptivenetwork control of robot in task spacerdquoActa Automatica Sinicavol 39 no 1 pp 1ndash10 2013

[26] P M Patre W MacKunis K Kaiser and W E DixonldquoAsymptotic tracking for uncertain dynamic systems via amultilayer neural network feedforward and RISE feedbackcontrol structurerdquo IEEE Transactions on Automatic Control vol53 no 9 pp 2180ndash2185 2008

[27] B E Paden and S S Sastry ldquoA calculus for computing Filippovrsquosdifferential inclusion with application to the variable structurecontrol of robot manipulatorsrdquo IEEE Transactions on Circuitsand Systems vol 34 no 1 pp 73ndash82 1987

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 5: Research Article Decentralized Reinforcement Learning ...downloads.hindawi.com/journals/mpe/2013/387817.pdf · Research Article Decentralized Reinforcement Learning Robust Optimal

Mathematical Problems in Engineering 5

Here 119903119894(119890

119894 119906

119894) represents the reward function for the current

state shown as

119903119894(119890

119894 119906

119894) = 119890

119879

119894119876119890119890119894+ 119906

119879

119894119877119906

119894 (24)

where 119876119890and 119877 are the positive definite matrixes

Typically recording the value of state-action pairs is moreuseful than recording the value of state only since the state-action pairs are the predictions of the reward Even if thereward value of a state is low it does not mean that the valueof state-action pairs is low too If the state of the subsystemin a period time produces a higher reward then it can stillget a higher state-action value Therefore from a long termperspective defining a suitable state-action value function(119876-function) can make actions produce more rewards [2122]

According to (23) and (24) the continuous-time optimal119876-function can be defined as

119876lowast

119894(119890

119894 119886

119894 119906

119894) = 119903

119894(119890

119894 119886

119894 119906

119894) + 119881

lowast

119894(119890

119894 119906

119894)

= 119903119894(119890

119894 119886

119894 119906

119894)

+ min119906119894

119905le120591ltinfin

int

infin

0

119903119894(119890

119894(120591) 119906

119894(119890

119894(120591))) 119889120591

(25)

Assumption 3 The partial derivation of 119876lowast

119894and 119903

119894(119890

119894 119886

119894 119906

119894)

exist and they are continuous in the domain According to(18) and (24) by using the control policy 119906

119894 the optimal

119876-function can satisfy the following Hamiltonian-Jacobi-Bellman equation [23]

HJB119894(119890

119894 119906

119894 nabla119876

lowast

119894)

= min119906119894(119890119894)

[119903119894(119890

119894 119886

119894 119906

119894)

+nabla119876lowast

119894(minus119865 (119890

119894 119906

119894) minus ℎ

119894(119890 119890 119890) minus 119891

119894+ 119887

119894(119890

119894) 119906

119894)]

= min119906119894(119890119894)

[119903119894(119890

119894 119886

119894 119906

119894) + nabla119876

lowast

119894Φ

119894(119890

119894 119906

119894)]

(26)

whereΦ119894(119890

119894 119906

119894) = minus119865(119890

119894 119906

119894)minusℎ

119894(119890 119890 119890)minus119891

119894+119887

119894(119890

119894)119906

119894means the

global uncertainty including the unknown dynamics of thesubsystem and the interconnection term andnabla119876lowast

119894= 120597119876

lowast

119894120597119890

119894

means the gradient of the optimal 119876-function

Lemma 4 (see [24]) Considering dynamics of the subsystemof time varying constrained reconfigurable modular robot in(14) in order to ensure the minimum of the HJB equation (26)possessing the stationary point with respect to 119906

119894 the optimal

119876-function and the optimal control policy must satisfy thefollowing conditions

(1) 120597119867119869119861(119890119894 119906

119894 nabla119876

119894)120597119906

119894= 0

(2) 1205972119867119869119861119894(119890

119894 119906

119894 nabla119876

119894)(120597119906

119894times 120597119906

119879

119894) ge 0

The necessary conditions above lead us to the followingresults

(a) The bounded control policy can guarantee a localminimum of the HJB equation (26) and satisfy theconstraints imposed on the control inputs

(b) The Hessian matrix is positive-definite and the controlpolice 119906

119894can render the global minimum of the HJB

equation(c) If an optimal algorithm exists it is unique

According to Lemma 4 if the reward function is smoothand the optimal control 119906lowast

119894is adopted then the HJB equation

satisfies the following equation

HJBlowast

119894(119890

119894 119906

lowast

119894 nabla119876

lowast

119894) = min

119906lowast

119894(119890119894)

[119903119894(119890

119894 119886

119894 119906

lowast

119894) + nabla119876

lowast

119894Φ

119894(119890

119894 119906

lowast

119894)]

= 0

(27)

And the optimal control can be expressed as follows

119906lowast

119894(119890

119894) = arg

119906lowast

119894

min [HJBlowast

119894(119890

119894 119906

lowast

119894 nabla119876

lowast

119894)]

=1

2119877minus1

119887119879

119894(119890

119894)120597119876

lowast

119894(119890

119894 119886

119894 119906

119894)119879

120597119890119894

(28)

If the optimal 119876-function 119876lowast

119894is continuous derivable

and known and the initial value119876lowast

119894(0) = 0 as well as the opti-

mal control policy 119906lowast

119894(119890

119894) and the global uncertainty of the

subsystemΦ119894(119890

119894 119906

lowast

119894) is known then the HJB equation in (27)

is held and solvableHowever in the actual situation119876lowast

119894is not

derivable everywhere and 119906lowast119894(119890

119894) andΦ

119894(119890

119894 119906

lowast

119894) are unknown

Therefore it is not feasible to solve the HJB equation byusing average method In this paper we combine the action-critic identifier (ACI) with RBF neural network to estimatethe optimal control policy the optimal 119876-function and theglobal uncertainty of the subsystem Action-NN is used toestimate 119906

lowast

119894(119890

119894) and is denoted as

119894(119890

119894) 119876lowast

119894is estimated

by critic-NN and expressed as 119876119894 then we use the robust

neural network identifier to identify Φ119894(119890

119894 119906

lowast

119894) denoted as

Φ119894(119890

119894 119906

lowast

119894)Theblock diagramof theACI architecture is shown

in Figure 2The estimated HJB equation can be expressed as follows

HJBlowast

119894(119890

119894

119894 nabla119876

119894) = min

119906119894(119890119894)[119903

119894(119890

119894 119886

119894

119894) + nabla119876

119894Φ

119894(119890

119894

119894)]

(29)

The identification error of the HJB equation above can beexpressed as

120575ℎ119894= HJBlowast

119894(119890

119894

119894 nabla119876

119894) minusHJBlowast

119894(119890

119894 119906

lowast

119894 nabla119876

lowast

119894) (30)

A classic radial basis function of the neural network isproposed in [25] shown as (31)

119873(119909) = 119882lowast119879

119878 (119909) + 120576 (119909) (31)

6 Mathematical Problems in Engineering

Action

Rewardfunction

HJB

error

Identifier

Subsystem

Critic

minus+

Qi(ei ai ui)Qi(ei ai ui)

ri(ei ai ui)

ri(ei ai ui)

Φi(ei ui)

Φi(ei ui)

Φi(ei )

eiF(t)

ui

ui

(t)

120575hi

1s

Figure 2 The architecture of action-critic-identifier

where 119882lowast means the ideal neural network weights and 120576(119909)

represents the estimation error In the case of using sufficientnumber of nodes if the center and width of the nodes arebuilt appropriately then any kind of continuous functioncould be approximated by RBF-NN Therefore the optimal119876-function and the optimal control policy can be expressedas follows

119876lowast

119894= 119882

119879

119894119878119894(119890

119894) + 120576

119894119888(119890

119894)

119906lowast

119894(119890

119894) = minus

1

2119877minus1

119887119879

119894(119890

119894) [ 119878

119894(119890

119894)119879

119882119894+ 120576

119894119886(119890

119894)]

(32)

where 119878119894(119890

119894) = [119904

1198941(119890

119894) sdot sdot sdot 119904

119894119899(119890

119894)]119879 indicates the smooth

basis function of the neural network 119882119894means the ideal

unknown neural network weight and 120576119894119888(119890

119894) and 120576

119894119886(119890

119894) are

the estimation error By using 119876119894and

119894(119890

119894) to estimate 119876

lowast

119894

and 119906lowast

119894(119890

119894) we can get the following equations

119876119894=

119879

119894119888119878119894119888(119890

119894) (33)

119894(119890

119894) = minus

1

2119877minus1

119887119879

119894(119890

119894) 119878

119894119886(119890

119894)119879

119894119886 (34)

According to the equations above 119894119888(119905) and

119894119886(119905) indicated

the weights of critic-NN and action-NN And the estimationerrors of weights are shown as follows

119894119888(119905) = 119882

119894minus

119894119888(119905) (35)

119894119886(119905) = 119882

119894minus

119894119886(119905) (36)

The update law of the weight for the critic-NN is a gradientdescent algorithm which is shown as follows

119882

119894119888(119905) = minus119899

1119871119894(119871

119879

119894

119894119888+ 119890

119879

119894119876119890119890119894+ 119906

119879

119894119877119906

119894) (37)

In the equation above 119899119894gt 0 is the adaptive gain of the neural

network 119871119894and 119897

119894are defined as

119871119894=

119897119894

119897119879119894119897119894+ 1

119897119894= nabla119878

119894119888(119890

119894) 119890

119894

(38)

Therefore according to the definition above the followinginequalities can be obtained

119871119894119898

le 119871119894le 119871

119894119872

119878119894119888119898

le 119878119894119888(119890

119894) le 119878

119894119888119872

119878119894119886119898

le 119878119894119886(119890

119894) le 119878

119894119886119872

(39)

Mathematical Problems in Engineering 7

Combining (35) with (38) we can get that

119882119894119888(119905) = minus119899

1119871119894(119871

119879

119894

119894119888+ 120575

ℎ119894) (40)

The update law of the weight for the action-NN is developedby a gradient descent algorithm expressed as follows

119882

119894119886(119905) = minus119899

2119878119894119886(119890

119894)

times ((119879

119894119886119878119894119886(119890

119894) +

1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119894119888))

119879

(41)

According to the estimation error of action-NN in (36) theoptimal control 119906lowast

119894(119890

119894) can minimize the optimal119876-function

and we can get the following equation

119882119879

119894119886119878119894119886(119890

119894) +

1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119882119894119888

+1

2119877minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894) + 120576

119894119886(119890

119894) = 0

(42)

Putting (41) into (42) we can get that

119882119894119886

= minus1198992119878119894119886(119890

119894)(

119879

119894119886119878119894119886(119890

119894) +

1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119894119888

minus1

2119877minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894) minus 120576

119894119886(119890

119894)

)

(43)

After using critic-NN and action-NN to estimate 119876119894and

119894(119890

119894) we need to design a kind of robust RBF-NN identifier

to identify the nonlinear uncertainties of the subsystem HereΦ

119894(119890

119894

119894) can be expressed as follows

Φ119894(119890

119894

119894) = 119890

119894119865= 119882

119879

119894119865120581 (Λ

119879

119894119865119890119894119865) + 120576

119894119865(119890

119894119865) + 119887

119894(119890

119894)

119894

(44)

where 120581(sdot) means the basic function of neural network and119882

119894119865Λ

119894119865indicate the unknown ideal neural network weights

Equation (44) can be identified by using robust RBF-NNidentifier so we can get

Φ119894(119890

119894

119894) = 119890

119894119865= 119882

119879

119894119865120581119894119865+ 119887

119894(119890

119894)

119894+ 120583

119894 (45)

Here 120581119894119865indicates the estimated value of the basic function of

the neural network 119882119894119865 Λ

119894119865are expressed as the estimated

value of neural network 120583119894isin R means the feedback error

term shown as follows [26]120583119894= 119896 (119890

119894119865(119905) minus 119890

119894119865(119905)) minus 119896 (119890

119894119865(0) minus 119890

119894119865(0)) + 120599

= 119896 (119890119894119865(119905) minus 119890

119894119865(0)) + 120599

120599 = (119896120572 + 120574) 119890119894119865+ 120573

1sat (119890

119894119865)

(46)

where 119896 120572 1205731 and 120574 are the positive control gain constants

and sat(sdot) is a saturation functionTherefore the state estima-tion error of the identifier-NN can be expressed as follows

119890119894119865= 119890

119894119865minus 119890

119894119865

= 119882119879

119894119865120581119894119865minus

119879

119894119865120581119894119865+ 120576

119894119865(119890

119894119865) minus 120583

119894

(47)

A filtered identification error is defined as follows

119864119894= 119890

119894119865+ 120572119890

119894119865 (48)

The derivation of the equation above is shown as

119894= 119882

119879

119894119865119894119865Λ119879

119894119865119890119894119865minus

119879

119894119865

120581119894119865

Λ119879

119894119865119890119894119865+ 120572 119890

119894119865minus 119896119864

119894minus 120574119890

119894119865

minus

119882119879

119894119865120581119894119865+ 120576

119894119865(119890

119894119865) minus 120573

1sat (119890

119894119865) minus

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

(49)

Here the weight 119882119894119865 Λ

119894119865of the identification-NN can be

updated by

119882

119894119865= proj (Γ

119894119882119865

120581119894119865Λ119879

119894119865

119890119894119865119890119879

119894119865)

Λ119894119865= proj (Γ

119894Λ119865

119890119894119865119890119879

119894119865

119879

119894119865

120581119894119865)

(50)

where Γ119894119882119865

Γ119894Λ119865

are positive constant adaptation gain matri-ces In order to analyze the convergence of the filteredidentification error 119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

can be divided into thefollowing form

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

=1

2120581119894119865

119890119894119865[(Λ

119879

119894119865minus Λ

119879

119894119865) (119882

119879

119894119865minus

119879

119894119865)

+ (119882119879

119894119865minus

119879

119894119865) (Λ

119879

119894119865minus Λ

119879

119894119865)]

=1

2120581119894119865

119890119894119865[

119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) + (119882

119879

119894119865minus

119879

119894119865) Λ

119879

119894119865

minus119882119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) minus (119882

119879

119894119865minus

119879

119894119865)Λ

119879

119894119865

]

=1

2120581119894119865

119890119894119865[

119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) + (119882

119879

119894119865minus

119879

119894119865) Λ

119879

119894119865]

minus1

2120581119894119865

119890119894119865[119882

119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) + (119882

119879

119894119865minus

119879

119894119865)Λ

119879

119894119865]

=1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

minus1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865minus1

2

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865

+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

(51)

where 119879

119894119865= 119882

119879

119894119865minus

119879

119894119865 Λ119879

119894119865= Λ

119879

119894119865minus Λ

119879

119894119865 Putting (51) into

(49) then (49) can be reduced to the following form

119894= 119875

1198651+ 119875

1198652+ 119875

1198653minus 119896119864

119894minus 120574119890

119894119865minus 120573

1sat (119890

119894119865) (52)

8 Mathematical Problems in Engineering

Among the equations above 1198751198651+119875

1198652+119875

1198653can be expressed

respectively as follows

1198751198651

=1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865minus

119879

119894119865

120581119894119865

Λ119879

119894119865119890119894119865

+ 120572 119890119894119865minus

119879

119894119865120581119894119865

(53)

1198751198652

= minus1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865minus1

2

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865

+119882119879

119894119865119894119865Λ119879

119894119865119890119894119865+ 120576

119894119865(119890

119894119865)

(54)

1198751198653

=1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865 (55)

According to Assumption 1 (48) and (50) the upper boundsof 119875

1198651 119875

1198652 119875

1198653are shown as

100381710038171003817100381711987511986511003817100381710038171003817 le 119869

1(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

100381710038171003817100381711987511986521003817100381710038171003817 le 120589

1

100381710038171003817100381711987511986531003817100381710038171003817 le 120589

2

(56)

Combining (53) and (54) with (55) then we can get that100381710038171003817100381710038171198652

+ 1198653

10038171003817100381710038171003817le 120589

3+ 120589

41198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817 (57)

where 120593119894(119890

119879

119894119865 119864

119879

119894) = [119890

119879

119894119865119864119879

119894]119879 and 119869

119894(sdot) is a global invertible

nondecreasing function 120589119894 (119894 = 1 2 3 4) are computable

positive constants

Theorem 5 Considering dynamics of the subsystem of timevarying constrained reconfigurable modular robot in (14) andthe state equation (18) if the designed identifier and thecorresponding weight update laws are adopted then the globaluncertainty of the subsystem which depends explicitly on theerror term can be identified and the identification error isconverged and bounded

Proof Define the Lyapunov function as the follows

119881119894119871(119890

119894119865 119864

119894) =

1

2119864119879

119894119864119894+1

2120574119890

119879

119894119865119890119894119865+ 120603

119894(119905) + 120601

119894(119905) (58)

In the equation above 120603119894(119905) and 120601

119894(119905) can be expressed as

follows

119894(119905) = minus[

119864119879

119894(119875

1198652minus 120573

1sat (119890

119894119865)) + 119890

119879

1198941198651198751198653

minus12057321198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)100381710038171003817100381710038171003817100381710038171003817119890119894119865

1003817100381710038171003817

]

120603119894(0) = 120573

1

1003816100381610038161003816119890119894119865 (0)1003816100381610038161003816 minus 119890

119879

119894119865(0) (119875

1198652(0) + 119875

1198653(0))

(59)

120601119894(119905) =

1

4120572 [ tr (

119879

119894119865Γminus1

119894119882119865

119894119865) + tr (Λ

119879

119894119865Γminus1

119894Λ119865Λ

119894119865)] (60)

where tr(sdot) represents the trace of matrix Defining 119889 =

[119864119879

119894119890119879

11989411986512060312

11989412060112

119894] 120573

1 120573

2isin R are positive adaptation gains

which are chosen to ensure 120603119894(119905) ge 0 so we can get

1198801(119889) le 119881

119894119871(119890

119894119865 119864

119894) le 119880

2(119889) (61)

where

1198801(119889) =

1

2min (1 120574) 1198892

1198802(119889) = max (1 120574) 1198892

(62)

The derivation of (58) is shown as follows

119894119871(119890

119894119865 119864

119894) = nabla119881

119879

119894119871119870[

119894

119890119879

119894119865

1

212060312

119894119894

1

212060112

119894

120601119894]119879

(63)

where119870[sdot] is expressed as a Filipov set [27]So

119894119871(119890

119894119865 119864

119894) can be deformed as the following form

119894119871(119890

119894119865 119864

119894)

= [119864119879

119894120574119890

119879

1198941198652120603

12

1198942120601

12

119894]119870[

119894

119890119879

119894119865

1

212060312

119894119894

1

212060112

119894

120601119894]119879

le 120574119879

(

1

2

119882119879

119894119865

119894119865

Λ119879

119894119865

119894119865+

1

2

119879

119894119865

119894119865Λ119879

119894119865

119894119865minus

119879

119894119865

119894119865

Λ

119879

119894119865119890119894119865

+120572119894119865minus

1

2

119882119879

119894119865

119894119865

Λ119879

119894119865119890119894119865minus

1

2

119879

119894119865

119894119865Λ119879

119894119865119890119894119865minus 120574119890

119894119865

minus119879

119894119865120581119894119865+119882

119879

119894119865119894119865Λ119879

119894119865119890119894119865+ 120576

119894119865(119890

119894119865) +

1

2

119879

119894119865

119894119865

Λ119879

119894119865

119894119865

+

1

2

119879

119894119865

119894119865

Λ119879

119894119865

119894119865minus 119896119864

119894minus 120573

1119870[sat (119890

119894119865)]

)

minus119864119879

119894(

minus1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865minus1

2

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865

+119882119879

119894119865119894119865Λ119879

119894119865119890119894119865+ 120576

119894119865(119890

119894) minus 120573

1119870[sat (119890

119894119865)])

minus 119890119879

119894119865

1

2(

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865)

+ 120574119890119879

119894119865(119864

119894minus 120572119890

119894119865)

+ 12057321198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)100381710038171003817100381710038171003817100381710038171003817119890119894119865

1003817100381710038171003817

minus1

2120572 tr (119879

119894119865Γminus1

119894119882119865

119882

119894119865) minus

1

2120572 tr (Λ119879

119894119865Γminus1

119894Λ119865

Λ119894119865)

(64)

Put (53) (54) and (55) into (64) then we can get

119894119871(119890

119894119865 119864

119894)

= 119864119879

119894(119875

1198651+ 119875

1198652+ 119875

1198653minus 120573

1119870[sat (119890

119894119865)] minus 119896119864

119894minus 120574119890

119894119865)

+ 120574119890119879

119894119865(119864

119894minus 120572119890

119894119865)

minus 119864119879

119894(119875

1198652minus 120573

1119870[sat (119890

119894119865)])

minus 119890119879

1198941198651198751198653

+ 12057321198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)100381710038171003817100381710038171003817100381710038171003817119890119894119865

1003817100381710038171003817

minus1

2120572 tr (119879

119894119865Γminus1

119894119882119865

119882

119894119865)

minus1

2120572 tr (Λ119879

119894119865Γminus1

119894Λ119865

Λ119894119865)

Mathematical Problems in Engineering 9

= minus120572120574119890119879

119894119865119890119894119865+ (119864

119879

119894minus 119890

119879

119894119865)119875

1198653

1198691(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41198963

minus1

2120572 tr (119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865119890119879

119894119865) minus

1

2120572 tr (Λ119879

119894119865

119890119894119865119890119879

119894119865

119879

119894119865

120581119894119865)

le minus1198961

100381710038171003817100381711989011989411986510038171003817100381710038172

minus 1198962

1003817100381710038171003817119864119894

10038171003817100381710038172

+1198691(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41198963

10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

+1205732

21198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41205721198964

10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

(65)

where 119896min = min1198961 119896

2 120585 = min119896

3 120572119896

4120573

2

2 and

119869(120593119894(119890

119879

119894119865 119864

119879

119894))

2

= 1198691(120593

119894(119890

119879

119894119865 119864

119879

119894))

2 + 1198692(120593

119894(119890

119879

119894119865 119864

119879

119894))

2 sothe following conclusion can be obtained

119894119871(119890

119894119865 119864

119894)

le minus119896min10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

+119869(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)210038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

4120585

le minus11988810038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

(66)

Therefore for an arbitrary constant 119888 minus119888120593119894(119890

119879

119894119865 119864

119879

119894)

2

is a negative semidefinite function which is defined in theadjustable interval119863 expressed as follows

119863 = 119889 (119905) | 119889 le 119869minus1

(2radic119896min120585) (67)

so that Lyapunov stability theory shows that the system isstable In order to make the subsystem of time varying con-strained reconfigurable modular robot tracking the desiredtrajectory progressively in this paper a novel decentralizedreinforcement learning robust optimal tracking controllerhas been designed by using the robust term to compensatethe neural network approximation errors Design the robustcontrol term as

119906119894119903119887

=119873

119903119887119890119894

119890119879119894119890119894+ 120577

(68)

In the equation above 120577 gt 0 is a constant And 119873119903119887can

be expressed as

119873119903119887ge [

[

1205752

ℎ119894

21198991

+1198991(minus120576

119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894) (nabla120576

119894119888(119890

119894)2))

2

21198992

+11989911198992

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

(minus120576119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

]

]

sdot(119890

119879

119894119890119894+ 120577)

211989911198992

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817 119890

119879

119894119890119894

ge [1198992

1(minus120576

119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

+ 11989921205752

ℎ119894+ 2119899

2

11198992

2

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

sdot (minus120576119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

]

sdot(119890

119879

119894119890119894+ 120577)

41198992111989922

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817 119890

119879

119894119890119894

(69)

Therefore the global control law can be designed asfollows

119906mix = 119906119894+ 119906

119894119903119887

= minus1

2119877minus1

119887119879

119894(119890

119894) 119878

119894119886(119890

119894)119879

119894119886+

119873119903119887119890119894

119890119879119894119890119894+ 120577

(70)

Theorem 6 Considering dynamics of the subsystem of timevarying constrained reconfigurable modular robot in (14) if thesystem parameters conditions and the assumptions are held thecritic-NN action-NN and identifier are given by (33) (34)and (45) respectively and the decentralized robust optimaltracking controller of the subsystem in (70) is adopted thenthe system is closed-loop stability and the desired trajectory canbe tracked asymptotically by the actual output

Proof Design the Lyapunov function as follows

119881119894119906(119890

119894 119906mix) =

1

21198991

tr 119879

119894119888

119894119888 +

1198991

21198992

tr 119879

119894119886

119894119886

+ 11989911198992[119890

119879

119894119865119890119894119865+ Ξint

infin

0

119903119894(119890

119894 119906mix) 119889120591]

(71)

where Ξ gt 0 is the undetermined parameter 119905 le 120591 lt infin Thederivation of (71) is shown as follows

119894119906(119890

119894 119906mix)

=1

21198991

tr 119879

119894119888

119882

119894119888 +

1198991

21198992

tr 119879

119894119886

119882

119894119886

+ 11989911198992[119890

119879

119894119865119890119894119865+ Ξ119903

119894(119890

119894 119906mix)]

=1

21198991

tr 119879

119894119888(minus119899

1119871119894(119871

119879

119894

119894119888+ 120575

ℎ119894))

+1198991

21198992

tr

times

119879

119894119886

[[[[

[

minus1198992119878119894119886(119890

119894)(

119879

119894119886119878119894119886(119890

119894) minus 120576

119894119886(119890

119894)

+1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119894119888

minus1

2119877minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894)

)

]]]]

]

+ 11989911198992119890119879

119894119865(119882

119879

119894119865120581 (Λ

119879

119894119865119890119894) + 120576

119894119865(119890

119894) + 119887

119894(119890

119894) mix)

+Ξ (119890119879

119894119876119890119890119894+ 119906

119879

mix119877119906mix)

10 Mathematical Problems in Engineering

le minus(1198712

119894119898minus1198991

21198712

119894119872)10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

+1

21198991

1205752

ℎ119894

minus (11989911198782

119894119886119898minus3

4119899111989921198782

119894119886119872)10038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

+1198991

41198992

10038171003817100381710038171003817119877minus110038171003817100381710038171003817

21003817100381710038171003817119887119894(119890119894)10038171003817100381710038172 10038171003817100381710038171003817nabla119878

2

119894119888119872

10038171003817100381710038171003817

10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

+1198991

21198992

(120576119894119886(119890

119894) + 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

119879

sdot (120576119894119886(119890

119894) + 119877

minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894)

2)

+ 11989911198992

1003817100381710038171003817119887119894 (119890119894)10038171003817100381710038172

120576119894119886(119890

119894)119879

120576119894119886(119890

119894)

+ 119899111989921198782

119894119886119872

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817210038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

+ 11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

+ 11989911198992(1003817100381710038171003817119887119894 (119890119894)

10038171003817100381710038172

minus Ξ120582min (119877))1003817100381710038171003817119906mix

10038171003817100381710038172

(72)

If the following inequalities can satisfy

120582min 1198761198901003817100381710038171003817119890119894119865

10038171003817100381710038172

2le 119890

119879

119894119865119876119890119890119894119865le 120582max 119876119890

1003817100381710038171003817119890119894119865

10038171003817100381710038172

2

120582min 1198771003817100381710038171003817119906mix

10038171003817100381710038172

2le 119906

119879

mix119877119906mix le 120582max 1198771003817100381710038171003817119906mix

10038171003817100381710038172

2

Ξ gt

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

120582min 119877

(73)

then 119894119906(119890

119894 119906mix) can be further transformed as

119894119906(119890

119894 119906mix)

le minus(1198712

119894119898minus1198991

21198712

119894119872minus

1198991

41198992

10038171003817100381710038171003817119877minus110038171003817100381710038171003817

21003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

nabla1198782

119894119888119872)10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

minus (11989911198782

119894119886119898minus3

4119899111989921198782

119894119886119872minus 119899

111989921198782

119894119886119872

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

)10038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

minus 11989911198992(1003817100381710038171003817119887119894(119890119894)

10038171003817100381710038172

+ Ξ120582min (119877))1003817100381710038171003817119906mix

10038171003817100381710038172

minus 11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

le minus11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

(74)

Therefore we can get the conclusion that 119894119906(119890

119894 119906mix) lt 0

4 Simulations

In order to verify the validity and convergence of the pro-posed decentralized reinforcement learning robust optimaltracking control method based on ACI and to study theconvergence of the error by comparing the simulation resultin this paper two different configurations of the time varying

external constrained reconfigurablemodular robot have beenapplied shown in Figures 3 and 4

For the sake of the facilitation of the analysis of theconfigurations above we can transform them into a formof analytic charts which are shown in Figures 5 and 6where 119871

1 119871

2 and 119871

4are the length of the links 119871

3is the

distance between the time varying constraint joint and thebase modular

The time varying constraint can be defined as a kindof column which rotated about with a certain degree offreedom The constraint equations of configuration A andconfiguration B are shown as follows

Ψ119860(119902 119905) = 119871

1cos 119902

1+ 119871

2cos 119902

2minus [119871

3+ 119871

4cot120572 (119905)]

Ψ119861(119902 119905) = 119871

1+ 119871

2cos 119902

2minus [119871

3+ 119871

4cot120572 (119905)]

(75)

In the equation above the angle 120572(119905) between the timevarying constraint and the119883-axis can be defined as follows

120572 (119905) = 075120587 + 02 sin 119905

2 (76)

The initial positions of joint models are 1199021(0) = 2 119902

2(0) =

2 in configurationA and 1199021(0) = 2 119902

2(0) = 2 in configuration

BThe initial velocities of joints are zerosThe dynamicmodelof configurations A and B is designed as follows

119872119860(119902) = [

036 cos (1199022) + 06066 018 cos (119902

2) + 01233

018 cos (1199022) + 01233 01233

]

119872119861(119902) = [

017 minus 01166cos2 (1199022) minus006 cos (119902

2)

minus006 cos (1199022) 01233

]

119862119860(119902 119902) = [

minus036 sin (1199022) 119902

2minus018 sin (119902

2) 119902

2

018 sin (1199022) ( 119902

1minus 119902

2) 018 sin (119902

2) 119902

1

]

119862119861(119902 119902) = [

01166 sin (21199022) 119902

2006 sin (119902

2) 119902

2

006 sin (1199022) 119902

20

]

119866119860(119902) = [

minus588 sin (1199021+ 119902

2) minus 1764 sin (119902

1)

minus588 sin (1199021+ 119902

2)

]

119866119861(119902) = [

0

minus588 cos (1199022)]

119865119860(119902 119902) = [

1199021+ 10 sin (3119902

1) + 2 sgn ( 119902

1)

12 1199022+ 5 sin (2119902

2) + sgn ( 119902

2)]

119865119861(119902 119902) = [

0

15 1199022+ sin (119902

2) + 12 sgn ( 119902

2)]

(77)

The desired trajectory of configurations A and B is shown asConfiguration A

1199101119889

= 05 cos (119905) + 02 sin (3119905)

1199102119889

= Θ (1199101119889 119905)

= arcsin[1198711sin (120572 (119905) minus 119910

1119889) minus 119871

3sin (120572 (119905))

1198712

] + 120572 (119905)

(78)

Mathematical Problems in Engineering 11

Figure 3 Configuration A for simulation

Figure 4 Configuration B for simulation

Configuration B

1199101119889

= 0

1199102119889

= Θ (1199101119889 119905)

= arcsin [1198711sin (120572 (119905)) minus 119871

3sin (120572 (119905))

1198712

] + 120572 (119905)

(79)

Due to the limit of dynamic external constraints thevariety of joint 1 in configuration B is zero

In order to confirm that the adopted method can beapplied in different configurations and to verify the trackingperformance for the subsystem desired trajectory by usingthe ACI and RBF-NN based decentralized reinforcementlearning robust optimal tracking control method in this partthe comparative simulations which include the classic RBFneural network control method and ACI based decentral-ized reinforcement learning robust optimal tracking controlmethod have been adopted respectively

From Figures 7 8 9 10 11 12 13 and 14 the joint trackingand error curves are shown by using classic RBF neural net-work [4] to compensate the effect of the dynamic nonlinearterm and the interconnection term in the subsystem Figures7 to 10 show that the actual output trajectories take about2 seconds to track the desired trajectories This is due tothe fact that the classical neural network method requires alonger training process and parameter adaptive process Theerror curves in Figures 11ndash14 show that the joint subsystem

q1L2

L3

L4

L1

Y

X

120572

q2

Figure 5 The analytic chart of configuration A

q2

L4

L2

L1

L3

Y

120572

X

q1

Figure 6 The analytic chart of configuration B

constraints cannot be well compensated by adopting the clas-sical neural control method for the reconfigurable modularrobot when the time varying constrains exist When thejoint output variables turn larger the reconfigurable modularsystem cannot exhibit a good robustness and the trackingerrors are larger than before

Figures 15 16 17 18 19 20 21 and 22 showed thejoint tracking and error curves by adopting ACI to identifythe optimal 119876-function optimal control policy and globaldynamic nonlinear terms of the HJB equation in the sub-system Figures 15 to 18 show that the actual output trajec-tories of the subsystems can track the desired trajectoriesin 05 seconds by using the proposed robust reinforcementlearning optimal control method This is due to the excellentidentifying ability of ACIwhich can identify the uncertaintiescontained in the subsystems in a short timeThe error curvesin Figures 19ndash22 show that the tracking error is very smalland it can converge to zero in a short time Besides whenthe proposed decentralized control method is adopted for thejoint subsystems the robustness is manifested

Figures 23 and 24 showed the 3D-tip trajectory curvesby using ACI algorithm These two figures show that theproposed decentralized control method can fully satisfythe accessibility requirement of the reconfigurable modularrobot Besides the singular displacements of the joint subsys-tems and the end-effector would not appear The parametersdefined in ACI are shown in Table 1

12 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

25

Time (s)

Join

t 1 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

minus1

minus05

Figure 7 Trajectory tracking curve of configuration A joint 1 withRBF neural network

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 8 Trajectory tracking curve of configuration A joint 2 withRBF neural network

Table 1 Parameter list of action-critic-identifier

119896 120572 120592 1205781198861

1205781198862

120578119888

1205731

1205732

120574

800 300 0005 10 50 20 02 2 05

The simulation results show that compared with thesituation of using classic RBF neural network controllerthe decentralized robust optimal tracking control methodbased on ACI and reinforcement learning can be appliedinto different configurations of the time varying constrainedreconfigurable modular robot And the joint variables cantrack the desired trajectory within a very short time indifferent configurations and the fluctuation of the errorconvergence range is minimal

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus2

minus15

minus05

minus1

Desired trajectoryActual trajectory

Figure 9 Trajectory tracking curve of configuration B joint 1 withRBF neural network

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 10 Trajectory tracking curve of configuration B joint 2 withRBF neural network

0 1 2 3 4 5 6 7 8 9 10

0

002

004

006

008

01

Time (s)

Join

t 1 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 11 Tracking error curve of configuration A joint 1 with RBFneural network

Mathematical Problems in Engineering 13

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

002

004

006

008

01

Join

t 2 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 12 Tracking error curve of configuration A joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

001

002

003

004

005

Time (s)

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 13 Tracking error curve of configuration B joint 1 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 14 Tracking error curve of configuration B joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus1

minus05

Desired trajectoryActual trajectory

Figure 15 Trajectory tracking curve of configuration A joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 16 Trajectory tracking curve of configuration A joint 2 withACI based reinforcement learning

5 Conclusions and Future Work

In this paper combining ACI with RBF neural network anovel decentralized reinforcement learning robust optimaltracking control theory has been proposed for time varyingconstrained reconfigurable modular robots Moreover thistheory is used to solve the problem of the continuoustime nonlinear optimal control policy for strongly coupleduncertainty robotic system Firstly we build the model ofsubsystem with the time varying external constraints anddescribed the global robot system as a synthesis of inter-connected subsystem Secondly ACI is used to estimate theHJB equation and the global uncertainty where a continuous-time optimal 119876-function is adopted to take the place oftraditional optimal value function the optimal 119876-function

14 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus15

minus05

minus2

minus1

Desired trajectoryActual trajectory

Figure 17 Trajectory tracking curve of configuration B joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 18 Trajectory tracking curve of configuration B joint 2 withACI based reinforcement learning

and the optimal control police are approximated by critic-NNand action-NN and the global uncertainty is identified bythe identifier Thirdly we design a novel decentralized robustoptimal tracking controller so the desired trajectory can betracked and the tracking error could converge to zero infinite timeOn this basis two kinds of Lyapunov functions aredesigned to confirm the stability of ACI and the subsystemFinally in order to confirm the superiority of the proposedcontrol theory the comparative simulation examples havebeen presented combining two different configurations of therobot

In the future more complex configuration for timevarying external constrained reconfigurable modular robotcan be included Therefore the decentralized controller withhigher control and error precision will be considered

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 19 Tracking error curve of configuration A joint 1 with ACIbased reinforcement learning

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005Jo

int 2

erro

r (ra

d)

minus005

minus004

minus003

minus002

minus001

Figure 20 Tracking error curve of configuration A joint 2 with ACIbased reinforcement learning

0

0

1 2 3 4 5 6 7 8 9 10Time (s)

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 21 Tracking error curve of configuration B joint 1 with ACIbased reinforcement learning

Mathematical Problems in Engineering 15

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 22 Tracking error curve of configuration B joint 2 with ACIbased reinforcement learning

005

1

02 03 04 05 06 07

0

01

02

03

minus1

minus05minus02

minus01

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 23 3D-tip trajectory curve of configuration A with ACI

005

1

035 036 037 038 039 04

006008

01012014016018

minus1

minus05

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 24 3D-tip trajectory curve of configuration B with ACI

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is financially supported by the National NaturalScience Foundation of China (Grant nos 61374051 and60974010) the Scientific Technological Development PlanProject in Jilin Province of China (Grant no 20110705) Thefirst author is funded by China Scholarship Council

References

[1] Y Li and B Dong ldquoDecentralized ADRC control for reconfig-urable manipulators based on VGSTA-ESO of sliding moderdquoINFORMATION vol 15 no 6 pp 2453ndash2466 2012

[2] Y LiM-C Zhu andY-C Li ldquoSimulation on robust neurofuzzycompensator for reconfigurable manipulator motion controlrdquoJournal of System Simulation vol 19 no 22 pp 5169ndash5174 2007

[3] M-C Zhu and Y-C Li ldquoDecentralized adaptive sliding modecontrol for reconfigurable manipulators using fuzzy logicrdquoJournal of Jilin University (Engineering and Technology Edition)vol 39 no 1 pp 170ndash176 2009

[4] L Zhu and Y Li ldquoDecentralized adaptive neural networkcontrol for reconfigurable manipulatorsrdquo in Proceedings of theChinese Control and Decision Conference (CCDC rsquo10) pp 1760ndash1765 Xuzhou China May 2010

[5] M Zhu Y Li and Y Li ldquoA new distributed control schemeof modular and reconfigurable robotsrdquo in Proceedings of theIEEE International Conference onMechatronics and Automation(ICMA rsquo07) pp 2622ndash2627 Harbin China August 2007

[6] M C Zhu Y Li and Y C Li ldquoObserver-based decentralizedadaptive fuzzy control for reconfigurable manipulatorrdquo Controland Decision vol 24 no 3 pp 429ndash434 2009

[7] S Richard and A G Barto Reinforcement Learning An Intro-duction The MIT Press London UK 1998

[8] F L Lewis and D Liu ReinForcement Learning and Approxi-mate Dynamic Programming For Feedback Control Wiley-IEEEPress New York NY USA 2012

[9] Y-K Xu and X-R Cao ldquoLebesgue-sampling-based optimalcontrol problems with time aggregationrdquo IEEE Transactions onAutomatic Control vol 56 no 5 pp 1097ndash1109 2011

[10] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[11] X Xu H-G He and D Hu ldquoEfficient reinforcement learningusing recursive least-squares methodsrdquo Journal of ArtificialIntelligence Research vol 16 pp 259ndash292 2002

[12] H Zhang Q Wei and Y Luo ldquoA novel infinite-time optimaltracking control scheme for a class of discrete-time nonlinearsystems via the greedy HDP iteration algorithmrdquo IEEE Trans-actions on Systems Man and Cybernetics B vol 38 no 4 pp937ndash942 2008

[13] H Zhang R Song Q Wei and T Zhang ldquoOptimal trackingcontrol for a class of nonlinear discrete-time systems withtime delays based on heuristic dynamic programmingrdquo IEEETransactions on Neural Networks vol 22 no 12 pp 1851ndash18622011

16 Mathematical Problems in Engineering

[14] H Zhang X Xie and S Tong ldquoHomogenous polynomiallyparameter-dependent 119867

infinfilter designs of discrete-time fuzzy

systemsrdquo IEEE Transactions on Systems Man and CyberneticsB vol 41 no 5 pp 1313ndash1322 2011

[15] H Zhang L Cui X Zhang and Y Luo ldquoData-drivenrobust approximate optimal tracking control for unknown gen-eral nonlinear systems using adaptive dynamic programmingmethodrdquo IEEE Transactions on Neural Networks vol 22 no 12pp 2226ndash2236 2011

[16] J Zhang H Zhang Y Luo and H Liang ldquoOptimal controldesign for nonlinear systems adaptive dynamic programmingbased on fuzzy critic estimatorrdquo in The International JointConference on Neural Network pp 1ndash6 Brisbane Australia2012

[17] H Zhang D Gong B Chen and Z Liu ldquoSynchronization forcoupled neural networkswith interval delay a novel augmentedlyapunov-krasovskii functional methodrdquo IEEE Transactions onNeural Networks and Learning Systems vol 24 no 1 pp 58ndash702013

[18] S Bhasin K Dupree P M Patre and W E Dixon ldquoNeuralnetwork control of a robot interacting with an uncertain vis-coelastic environmentrdquo IEEE Transactions on Control SystemsTechnology vol 19 no 4 pp 947ndash955 2011

[19] S G Khan G Herrmann F Lewis T Pipe and C MelhuishldquoA Q-learning based Cartesian model reference compliancecontroller implementation for a humanoid robot armrdquo inProceedings of the IEEE 5th International Conference onRoboticsAutomation andMechatronics (RAM rsquo11) pp 214ndash219 QingdaoChina September 2011

[20] P K Patchaikani L Behera and G Prasad ldquoA single networkadaptive critic-based redundancy resolution scheme for robotmanipulatorsrdquo IEEE Transactions on Industrial Electronics vol59 no 8 pp 3241ndash3253 2012

[21] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[22] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]University of Cambridge Cambridge UK 1989

[23] F L Lewis and V L Syrmos Optimal Control John Wiley ampSons New York NY USA 1995

[24] M Sassano and A Astolfi ldquoDynamic approximate solutionsof the HJ inequality and of the HJB equation for input-affinenonlinear systemsrdquo IEEE Transactions on Automatic Controlvol 57 no 10 pp 2490ndash2503 2012

[25] Y X Wu and C Wang ldquoDeterministic learning based adaptivenetwork control of robot in task spacerdquoActa Automatica Sinicavol 39 no 1 pp 1ndash10 2013

[26] P M Patre W MacKunis K Kaiser and W E DixonldquoAsymptotic tracking for uncertain dynamic systems via amultilayer neural network feedforward and RISE feedbackcontrol structurerdquo IEEE Transactions on Automatic Control vol53 no 9 pp 2180ndash2185 2008

[27] B E Paden and S S Sastry ldquoA calculus for computing Filippovrsquosdifferential inclusion with application to the variable structurecontrol of robot manipulatorsrdquo IEEE Transactions on Circuitsand Systems vol 34 no 1 pp 73ndash82 1987

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 6: Research Article Decentralized Reinforcement Learning ...downloads.hindawi.com/journals/mpe/2013/387817.pdf · Research Article Decentralized Reinforcement Learning Robust Optimal

6 Mathematical Problems in Engineering

Action

Rewardfunction

HJB

error

Identifier

Subsystem

Critic

minus+

Qi(ei ai ui)Qi(ei ai ui)

ri(ei ai ui)

ri(ei ai ui)

Φi(ei ui)

Φi(ei ui)

Φi(ei )

eiF(t)

ui

ui

(t)

120575hi

1s

Figure 2 The architecture of action-critic-identifier

where 119882lowast means the ideal neural network weights and 120576(119909)

represents the estimation error In the case of using sufficientnumber of nodes if the center and width of the nodes arebuilt appropriately then any kind of continuous functioncould be approximated by RBF-NN Therefore the optimal119876-function and the optimal control policy can be expressedas follows

119876lowast

119894= 119882

119879

119894119878119894(119890

119894) + 120576

119894119888(119890

119894)

119906lowast

119894(119890

119894) = minus

1

2119877minus1

119887119879

119894(119890

119894) [ 119878

119894(119890

119894)119879

119882119894+ 120576

119894119886(119890

119894)]

(32)

where 119878119894(119890

119894) = [119904

1198941(119890

119894) sdot sdot sdot 119904

119894119899(119890

119894)]119879 indicates the smooth

basis function of the neural network 119882119894means the ideal

unknown neural network weight and 120576119894119888(119890

119894) and 120576

119894119886(119890

119894) are

the estimation error By using 119876119894and

119894(119890

119894) to estimate 119876

lowast

119894

and 119906lowast

119894(119890

119894) we can get the following equations

119876119894=

119879

119894119888119878119894119888(119890

119894) (33)

119894(119890

119894) = minus

1

2119877minus1

119887119879

119894(119890

119894) 119878

119894119886(119890

119894)119879

119894119886 (34)

According to the equations above 119894119888(119905) and

119894119886(119905) indicated

the weights of critic-NN and action-NN And the estimationerrors of weights are shown as follows

119894119888(119905) = 119882

119894minus

119894119888(119905) (35)

119894119886(119905) = 119882

119894minus

119894119886(119905) (36)

The update law of the weight for the critic-NN is a gradientdescent algorithm which is shown as follows

119882

119894119888(119905) = minus119899

1119871119894(119871

119879

119894

119894119888+ 119890

119879

119894119876119890119890119894+ 119906

119879

119894119877119906

119894) (37)

In the equation above 119899119894gt 0 is the adaptive gain of the neural

network 119871119894and 119897

119894are defined as

119871119894=

119897119894

119897119879119894119897119894+ 1

119897119894= nabla119878

119894119888(119890

119894) 119890

119894

(38)

Therefore according to the definition above the followinginequalities can be obtained

119871119894119898

le 119871119894le 119871

119894119872

119878119894119888119898

le 119878119894119888(119890

119894) le 119878

119894119888119872

119878119894119886119898

le 119878119894119886(119890

119894) le 119878

119894119886119872

(39)

Mathematical Problems in Engineering 7

Combining (35) with (38) we can get that

119882119894119888(119905) = minus119899

1119871119894(119871

119879

119894

119894119888+ 120575

ℎ119894) (40)

The update law of the weight for the action-NN is developedby a gradient descent algorithm expressed as follows

119882

119894119886(119905) = minus119899

2119878119894119886(119890

119894)

times ((119879

119894119886119878119894119886(119890

119894) +

1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119894119888))

119879

(41)

According to the estimation error of action-NN in (36) theoptimal control 119906lowast

119894(119890

119894) can minimize the optimal119876-function

and we can get the following equation

119882119879

119894119886119878119894119886(119890

119894) +

1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119882119894119888

+1

2119877minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894) + 120576

119894119886(119890

119894) = 0

(42)

Putting (41) into (42) we can get that

119882119894119886

= minus1198992119878119894119886(119890

119894)(

119879

119894119886119878119894119886(119890

119894) +

1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119894119888

minus1

2119877minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894) minus 120576

119894119886(119890

119894)

)

(43)

After using critic-NN and action-NN to estimate 119876119894and

119894(119890

119894) we need to design a kind of robust RBF-NN identifier

to identify the nonlinear uncertainties of the subsystem HereΦ

119894(119890

119894

119894) can be expressed as follows

Φ119894(119890

119894

119894) = 119890

119894119865= 119882

119879

119894119865120581 (Λ

119879

119894119865119890119894119865) + 120576

119894119865(119890

119894119865) + 119887

119894(119890

119894)

119894

(44)

where 120581(sdot) means the basic function of neural network and119882

119894119865Λ

119894119865indicate the unknown ideal neural network weights

Equation (44) can be identified by using robust RBF-NNidentifier so we can get

Φ119894(119890

119894

119894) = 119890

119894119865= 119882

119879

119894119865120581119894119865+ 119887

119894(119890

119894)

119894+ 120583

119894 (45)

Here 120581119894119865indicates the estimated value of the basic function of

the neural network 119882119894119865 Λ

119894119865are expressed as the estimated

value of neural network 120583119894isin R means the feedback error

term shown as follows [26]120583119894= 119896 (119890

119894119865(119905) minus 119890

119894119865(119905)) minus 119896 (119890

119894119865(0) minus 119890

119894119865(0)) + 120599

= 119896 (119890119894119865(119905) minus 119890

119894119865(0)) + 120599

120599 = (119896120572 + 120574) 119890119894119865+ 120573

1sat (119890

119894119865)

(46)

where 119896 120572 1205731 and 120574 are the positive control gain constants

and sat(sdot) is a saturation functionTherefore the state estima-tion error of the identifier-NN can be expressed as follows

119890119894119865= 119890

119894119865minus 119890

119894119865

= 119882119879

119894119865120581119894119865minus

119879

119894119865120581119894119865+ 120576

119894119865(119890

119894119865) minus 120583

119894

(47)

A filtered identification error is defined as follows

119864119894= 119890

119894119865+ 120572119890

119894119865 (48)

The derivation of the equation above is shown as

119894= 119882

119879

119894119865119894119865Λ119879

119894119865119890119894119865minus

119879

119894119865

120581119894119865

Λ119879

119894119865119890119894119865+ 120572 119890

119894119865minus 119896119864

119894minus 120574119890

119894119865

minus

119882119879

119894119865120581119894119865+ 120576

119894119865(119890

119894119865) minus 120573

1sat (119890

119894119865) minus

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

(49)

Here the weight 119882119894119865 Λ

119894119865of the identification-NN can be

updated by

119882

119894119865= proj (Γ

119894119882119865

120581119894119865Λ119879

119894119865

119890119894119865119890119879

119894119865)

Λ119894119865= proj (Γ

119894Λ119865

119890119894119865119890119879

119894119865

119879

119894119865

120581119894119865)

(50)

where Γ119894119882119865

Γ119894Λ119865

are positive constant adaptation gain matri-ces In order to analyze the convergence of the filteredidentification error 119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

can be divided into thefollowing form

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

=1

2120581119894119865

119890119894119865[(Λ

119879

119894119865minus Λ

119879

119894119865) (119882

119879

119894119865minus

119879

119894119865)

+ (119882119879

119894119865minus

119879

119894119865) (Λ

119879

119894119865minus Λ

119879

119894119865)]

=1

2120581119894119865

119890119894119865[

119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) + (119882

119879

119894119865minus

119879

119894119865) Λ

119879

119894119865

minus119882119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) minus (119882

119879

119894119865minus

119879

119894119865)Λ

119879

119894119865

]

=1

2120581119894119865

119890119894119865[

119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) + (119882

119879

119894119865minus

119879

119894119865) Λ

119879

119894119865]

minus1

2120581119894119865

119890119894119865[119882

119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) + (119882

119879

119894119865minus

119879

119894119865)Λ

119879

119894119865]

=1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

minus1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865minus1

2

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865

+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

(51)

where 119879

119894119865= 119882

119879

119894119865minus

119879

119894119865 Λ119879

119894119865= Λ

119879

119894119865minus Λ

119879

119894119865 Putting (51) into

(49) then (49) can be reduced to the following form

119894= 119875

1198651+ 119875

1198652+ 119875

1198653minus 119896119864

119894minus 120574119890

119894119865minus 120573

1sat (119890

119894119865) (52)

8 Mathematical Problems in Engineering

Among the equations above 1198751198651+119875

1198652+119875

1198653can be expressed

respectively as follows

1198751198651

=1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865minus

119879

119894119865

120581119894119865

Λ119879

119894119865119890119894119865

+ 120572 119890119894119865minus

119879

119894119865120581119894119865

(53)

1198751198652

= minus1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865minus1

2

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865

+119882119879

119894119865119894119865Λ119879

119894119865119890119894119865+ 120576

119894119865(119890

119894119865)

(54)

1198751198653

=1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865 (55)

According to Assumption 1 (48) and (50) the upper boundsof 119875

1198651 119875

1198652 119875

1198653are shown as

100381710038171003817100381711987511986511003817100381710038171003817 le 119869

1(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

100381710038171003817100381711987511986521003817100381710038171003817 le 120589

1

100381710038171003817100381711987511986531003817100381710038171003817 le 120589

2

(56)

Combining (53) and (54) with (55) then we can get that100381710038171003817100381710038171198652

+ 1198653

10038171003817100381710038171003817le 120589

3+ 120589

41198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817 (57)

where 120593119894(119890

119879

119894119865 119864

119879

119894) = [119890

119879

119894119865119864119879

119894]119879 and 119869

119894(sdot) is a global invertible

nondecreasing function 120589119894 (119894 = 1 2 3 4) are computable

positive constants

Theorem 5 Considering dynamics of the subsystem of timevarying constrained reconfigurable modular robot in (14) andthe state equation (18) if the designed identifier and thecorresponding weight update laws are adopted then the globaluncertainty of the subsystem which depends explicitly on theerror term can be identified and the identification error isconverged and bounded

Proof Define the Lyapunov function as the follows

119881119894119871(119890

119894119865 119864

119894) =

1

2119864119879

119894119864119894+1

2120574119890

119879

119894119865119890119894119865+ 120603

119894(119905) + 120601

119894(119905) (58)

In the equation above 120603119894(119905) and 120601

119894(119905) can be expressed as

follows

119894(119905) = minus[

119864119879

119894(119875

1198652minus 120573

1sat (119890

119894119865)) + 119890

119879

1198941198651198751198653

minus12057321198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)100381710038171003817100381710038171003817100381710038171003817119890119894119865

1003817100381710038171003817

]

120603119894(0) = 120573

1

1003816100381610038161003816119890119894119865 (0)1003816100381610038161003816 minus 119890

119879

119894119865(0) (119875

1198652(0) + 119875

1198653(0))

(59)

120601119894(119905) =

1

4120572 [ tr (

119879

119894119865Γminus1

119894119882119865

119894119865) + tr (Λ

119879

119894119865Γminus1

119894Λ119865Λ

119894119865)] (60)

where tr(sdot) represents the trace of matrix Defining 119889 =

[119864119879

119894119890119879

11989411986512060312

11989412060112

119894] 120573

1 120573

2isin R are positive adaptation gains

which are chosen to ensure 120603119894(119905) ge 0 so we can get

1198801(119889) le 119881

119894119871(119890

119894119865 119864

119894) le 119880

2(119889) (61)

where

1198801(119889) =

1

2min (1 120574) 1198892

1198802(119889) = max (1 120574) 1198892

(62)

The derivation of (58) is shown as follows

119894119871(119890

119894119865 119864

119894) = nabla119881

119879

119894119871119870[

119894

119890119879

119894119865

1

212060312

119894119894

1

212060112

119894

120601119894]119879

(63)

where119870[sdot] is expressed as a Filipov set [27]So

119894119871(119890

119894119865 119864

119894) can be deformed as the following form

119894119871(119890

119894119865 119864

119894)

= [119864119879

119894120574119890

119879

1198941198652120603

12

1198942120601

12

119894]119870[

119894

119890119879

119894119865

1

212060312

119894119894

1

212060112

119894

120601119894]119879

le 120574119879

(

1

2

119882119879

119894119865

119894119865

Λ119879

119894119865

119894119865+

1

2

119879

119894119865

119894119865Λ119879

119894119865

119894119865minus

119879

119894119865

119894119865

Λ

119879

119894119865119890119894119865

+120572119894119865minus

1

2

119882119879

119894119865

119894119865

Λ119879

119894119865119890119894119865minus

1

2

119879

119894119865

119894119865Λ119879

119894119865119890119894119865minus 120574119890

119894119865

minus119879

119894119865120581119894119865+119882

119879

119894119865119894119865Λ119879

119894119865119890119894119865+ 120576

119894119865(119890

119894119865) +

1

2

119879

119894119865

119894119865

Λ119879

119894119865

119894119865

+

1

2

119879

119894119865

119894119865

Λ119879

119894119865

119894119865minus 119896119864

119894minus 120573

1119870[sat (119890

119894119865)]

)

minus119864119879

119894(

minus1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865minus1

2

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865

+119882119879

119894119865119894119865Λ119879

119894119865119890119894119865+ 120576

119894119865(119890

119894) minus 120573

1119870[sat (119890

119894119865)])

minus 119890119879

119894119865

1

2(

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865)

+ 120574119890119879

119894119865(119864

119894minus 120572119890

119894119865)

+ 12057321198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)100381710038171003817100381710038171003817100381710038171003817119890119894119865

1003817100381710038171003817

minus1

2120572 tr (119879

119894119865Γminus1

119894119882119865

119882

119894119865) minus

1

2120572 tr (Λ119879

119894119865Γminus1

119894Λ119865

Λ119894119865)

(64)

Put (53) (54) and (55) into (64) then we can get

119894119871(119890

119894119865 119864

119894)

= 119864119879

119894(119875

1198651+ 119875

1198652+ 119875

1198653minus 120573

1119870[sat (119890

119894119865)] minus 119896119864

119894minus 120574119890

119894119865)

+ 120574119890119879

119894119865(119864

119894minus 120572119890

119894119865)

minus 119864119879

119894(119875

1198652minus 120573

1119870[sat (119890

119894119865)])

minus 119890119879

1198941198651198751198653

+ 12057321198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)100381710038171003817100381710038171003817100381710038171003817119890119894119865

1003817100381710038171003817

minus1

2120572 tr (119879

119894119865Γminus1

119894119882119865

119882

119894119865)

minus1

2120572 tr (Λ119879

119894119865Γminus1

119894Λ119865

Λ119894119865)

Mathematical Problems in Engineering 9

= minus120572120574119890119879

119894119865119890119894119865+ (119864

119879

119894minus 119890

119879

119894119865)119875

1198653

1198691(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41198963

minus1

2120572 tr (119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865119890119879

119894119865) minus

1

2120572 tr (Λ119879

119894119865

119890119894119865119890119879

119894119865

119879

119894119865

120581119894119865)

le minus1198961

100381710038171003817100381711989011989411986510038171003817100381710038172

minus 1198962

1003817100381710038171003817119864119894

10038171003817100381710038172

+1198691(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41198963

10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

+1205732

21198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41205721198964

10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

(65)

where 119896min = min1198961 119896

2 120585 = min119896

3 120572119896

4120573

2

2 and

119869(120593119894(119890

119879

119894119865 119864

119879

119894))

2

= 1198691(120593

119894(119890

119879

119894119865 119864

119879

119894))

2 + 1198692(120593

119894(119890

119879

119894119865 119864

119879

119894))

2 sothe following conclusion can be obtained

119894119871(119890

119894119865 119864

119894)

le minus119896min10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

+119869(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)210038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

4120585

le minus11988810038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

(66)

Therefore for an arbitrary constant 119888 minus119888120593119894(119890

119879

119894119865 119864

119879

119894)

2

is a negative semidefinite function which is defined in theadjustable interval119863 expressed as follows

119863 = 119889 (119905) | 119889 le 119869minus1

(2radic119896min120585) (67)

so that Lyapunov stability theory shows that the system isstable In order to make the subsystem of time varying con-strained reconfigurable modular robot tracking the desiredtrajectory progressively in this paper a novel decentralizedreinforcement learning robust optimal tracking controllerhas been designed by using the robust term to compensatethe neural network approximation errors Design the robustcontrol term as

119906119894119903119887

=119873

119903119887119890119894

119890119879119894119890119894+ 120577

(68)

In the equation above 120577 gt 0 is a constant And 119873119903119887can

be expressed as

119873119903119887ge [

[

1205752

ℎ119894

21198991

+1198991(minus120576

119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894) (nabla120576

119894119888(119890

119894)2))

2

21198992

+11989911198992

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

(minus120576119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

]

]

sdot(119890

119879

119894119890119894+ 120577)

211989911198992

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817 119890

119879

119894119890119894

ge [1198992

1(minus120576

119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

+ 11989921205752

ℎ119894+ 2119899

2

11198992

2

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

sdot (minus120576119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

]

sdot(119890

119879

119894119890119894+ 120577)

41198992111989922

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817 119890

119879

119894119890119894

(69)

Therefore the global control law can be designed asfollows

119906mix = 119906119894+ 119906

119894119903119887

= minus1

2119877minus1

119887119879

119894(119890

119894) 119878

119894119886(119890

119894)119879

119894119886+

119873119903119887119890119894

119890119879119894119890119894+ 120577

(70)

Theorem 6 Considering dynamics of the subsystem of timevarying constrained reconfigurable modular robot in (14) if thesystem parameters conditions and the assumptions are held thecritic-NN action-NN and identifier are given by (33) (34)and (45) respectively and the decentralized robust optimaltracking controller of the subsystem in (70) is adopted thenthe system is closed-loop stability and the desired trajectory canbe tracked asymptotically by the actual output

Proof Design the Lyapunov function as follows

119881119894119906(119890

119894 119906mix) =

1

21198991

tr 119879

119894119888

119894119888 +

1198991

21198992

tr 119879

119894119886

119894119886

+ 11989911198992[119890

119879

119894119865119890119894119865+ Ξint

infin

0

119903119894(119890

119894 119906mix) 119889120591]

(71)

where Ξ gt 0 is the undetermined parameter 119905 le 120591 lt infin Thederivation of (71) is shown as follows

119894119906(119890

119894 119906mix)

=1

21198991

tr 119879

119894119888

119882

119894119888 +

1198991

21198992

tr 119879

119894119886

119882

119894119886

+ 11989911198992[119890

119879

119894119865119890119894119865+ Ξ119903

119894(119890

119894 119906mix)]

=1

21198991

tr 119879

119894119888(minus119899

1119871119894(119871

119879

119894

119894119888+ 120575

ℎ119894))

+1198991

21198992

tr

times

119879

119894119886

[[[[

[

minus1198992119878119894119886(119890

119894)(

119879

119894119886119878119894119886(119890

119894) minus 120576

119894119886(119890

119894)

+1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119894119888

minus1

2119877minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894)

)

]]]]

]

+ 11989911198992119890119879

119894119865(119882

119879

119894119865120581 (Λ

119879

119894119865119890119894) + 120576

119894119865(119890

119894) + 119887

119894(119890

119894) mix)

+Ξ (119890119879

119894119876119890119890119894+ 119906

119879

mix119877119906mix)

10 Mathematical Problems in Engineering

le minus(1198712

119894119898minus1198991

21198712

119894119872)10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

+1

21198991

1205752

ℎ119894

minus (11989911198782

119894119886119898minus3

4119899111989921198782

119894119886119872)10038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

+1198991

41198992

10038171003817100381710038171003817119877minus110038171003817100381710038171003817

21003817100381710038171003817119887119894(119890119894)10038171003817100381710038172 10038171003817100381710038171003817nabla119878

2

119894119888119872

10038171003817100381710038171003817

10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

+1198991

21198992

(120576119894119886(119890

119894) + 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

119879

sdot (120576119894119886(119890

119894) + 119877

minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894)

2)

+ 11989911198992

1003817100381710038171003817119887119894 (119890119894)10038171003817100381710038172

120576119894119886(119890

119894)119879

120576119894119886(119890

119894)

+ 119899111989921198782

119894119886119872

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817210038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

+ 11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

+ 11989911198992(1003817100381710038171003817119887119894 (119890119894)

10038171003817100381710038172

minus Ξ120582min (119877))1003817100381710038171003817119906mix

10038171003817100381710038172

(72)

If the following inequalities can satisfy

120582min 1198761198901003817100381710038171003817119890119894119865

10038171003817100381710038172

2le 119890

119879

119894119865119876119890119890119894119865le 120582max 119876119890

1003817100381710038171003817119890119894119865

10038171003817100381710038172

2

120582min 1198771003817100381710038171003817119906mix

10038171003817100381710038172

2le 119906

119879

mix119877119906mix le 120582max 1198771003817100381710038171003817119906mix

10038171003817100381710038172

2

Ξ gt

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

120582min 119877

(73)

then 119894119906(119890

119894 119906mix) can be further transformed as

119894119906(119890

119894 119906mix)

le minus(1198712

119894119898minus1198991

21198712

119894119872minus

1198991

41198992

10038171003817100381710038171003817119877minus110038171003817100381710038171003817

21003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

nabla1198782

119894119888119872)10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

minus (11989911198782

119894119886119898minus3

4119899111989921198782

119894119886119872minus 119899

111989921198782

119894119886119872

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

)10038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

minus 11989911198992(1003817100381710038171003817119887119894(119890119894)

10038171003817100381710038172

+ Ξ120582min (119877))1003817100381710038171003817119906mix

10038171003817100381710038172

minus 11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

le minus11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

(74)

Therefore we can get the conclusion that 119894119906(119890

119894 119906mix) lt 0

4 Simulations

In order to verify the validity and convergence of the pro-posed decentralized reinforcement learning robust optimaltracking control method based on ACI and to study theconvergence of the error by comparing the simulation resultin this paper two different configurations of the time varying

external constrained reconfigurablemodular robot have beenapplied shown in Figures 3 and 4

For the sake of the facilitation of the analysis of theconfigurations above we can transform them into a formof analytic charts which are shown in Figures 5 and 6where 119871

1 119871

2 and 119871

4are the length of the links 119871

3is the

distance between the time varying constraint joint and thebase modular

The time varying constraint can be defined as a kindof column which rotated about with a certain degree offreedom The constraint equations of configuration A andconfiguration B are shown as follows

Ψ119860(119902 119905) = 119871

1cos 119902

1+ 119871

2cos 119902

2minus [119871

3+ 119871

4cot120572 (119905)]

Ψ119861(119902 119905) = 119871

1+ 119871

2cos 119902

2minus [119871

3+ 119871

4cot120572 (119905)]

(75)

In the equation above the angle 120572(119905) between the timevarying constraint and the119883-axis can be defined as follows

120572 (119905) = 075120587 + 02 sin 119905

2 (76)

The initial positions of joint models are 1199021(0) = 2 119902

2(0) =

2 in configurationA and 1199021(0) = 2 119902

2(0) = 2 in configuration

BThe initial velocities of joints are zerosThe dynamicmodelof configurations A and B is designed as follows

119872119860(119902) = [

036 cos (1199022) + 06066 018 cos (119902

2) + 01233

018 cos (1199022) + 01233 01233

]

119872119861(119902) = [

017 minus 01166cos2 (1199022) minus006 cos (119902

2)

minus006 cos (1199022) 01233

]

119862119860(119902 119902) = [

minus036 sin (1199022) 119902

2minus018 sin (119902

2) 119902

2

018 sin (1199022) ( 119902

1minus 119902

2) 018 sin (119902

2) 119902

1

]

119862119861(119902 119902) = [

01166 sin (21199022) 119902

2006 sin (119902

2) 119902

2

006 sin (1199022) 119902

20

]

119866119860(119902) = [

minus588 sin (1199021+ 119902

2) minus 1764 sin (119902

1)

minus588 sin (1199021+ 119902

2)

]

119866119861(119902) = [

0

minus588 cos (1199022)]

119865119860(119902 119902) = [

1199021+ 10 sin (3119902

1) + 2 sgn ( 119902

1)

12 1199022+ 5 sin (2119902

2) + sgn ( 119902

2)]

119865119861(119902 119902) = [

0

15 1199022+ sin (119902

2) + 12 sgn ( 119902

2)]

(77)

The desired trajectory of configurations A and B is shown asConfiguration A

1199101119889

= 05 cos (119905) + 02 sin (3119905)

1199102119889

= Θ (1199101119889 119905)

= arcsin[1198711sin (120572 (119905) minus 119910

1119889) minus 119871

3sin (120572 (119905))

1198712

] + 120572 (119905)

(78)

Mathematical Problems in Engineering 11

Figure 3 Configuration A for simulation

Figure 4 Configuration B for simulation

Configuration B

1199101119889

= 0

1199102119889

= Θ (1199101119889 119905)

= arcsin [1198711sin (120572 (119905)) minus 119871

3sin (120572 (119905))

1198712

] + 120572 (119905)

(79)

Due to the limit of dynamic external constraints thevariety of joint 1 in configuration B is zero

In order to confirm that the adopted method can beapplied in different configurations and to verify the trackingperformance for the subsystem desired trajectory by usingthe ACI and RBF-NN based decentralized reinforcementlearning robust optimal tracking control method in this partthe comparative simulations which include the classic RBFneural network control method and ACI based decentral-ized reinforcement learning robust optimal tracking controlmethod have been adopted respectively

From Figures 7 8 9 10 11 12 13 and 14 the joint trackingand error curves are shown by using classic RBF neural net-work [4] to compensate the effect of the dynamic nonlinearterm and the interconnection term in the subsystem Figures7 to 10 show that the actual output trajectories take about2 seconds to track the desired trajectories This is due tothe fact that the classical neural network method requires alonger training process and parameter adaptive process Theerror curves in Figures 11ndash14 show that the joint subsystem

q1L2

L3

L4

L1

Y

X

120572

q2

Figure 5 The analytic chart of configuration A

q2

L4

L2

L1

L3

Y

120572

X

q1

Figure 6 The analytic chart of configuration B

constraints cannot be well compensated by adopting the clas-sical neural control method for the reconfigurable modularrobot when the time varying constrains exist When thejoint output variables turn larger the reconfigurable modularsystem cannot exhibit a good robustness and the trackingerrors are larger than before

Figures 15 16 17 18 19 20 21 and 22 showed thejoint tracking and error curves by adopting ACI to identifythe optimal 119876-function optimal control policy and globaldynamic nonlinear terms of the HJB equation in the sub-system Figures 15 to 18 show that the actual output trajec-tories of the subsystems can track the desired trajectoriesin 05 seconds by using the proposed robust reinforcementlearning optimal control method This is due to the excellentidentifying ability of ACIwhich can identify the uncertaintiescontained in the subsystems in a short timeThe error curvesin Figures 19ndash22 show that the tracking error is very smalland it can converge to zero in a short time Besides whenthe proposed decentralized control method is adopted for thejoint subsystems the robustness is manifested

Figures 23 and 24 showed the 3D-tip trajectory curvesby using ACI algorithm These two figures show that theproposed decentralized control method can fully satisfythe accessibility requirement of the reconfigurable modularrobot Besides the singular displacements of the joint subsys-tems and the end-effector would not appear The parametersdefined in ACI are shown in Table 1

12 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

25

Time (s)

Join

t 1 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

minus1

minus05

Figure 7 Trajectory tracking curve of configuration A joint 1 withRBF neural network

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 8 Trajectory tracking curve of configuration A joint 2 withRBF neural network

Table 1 Parameter list of action-critic-identifier

119896 120572 120592 1205781198861

1205781198862

120578119888

1205731

1205732

120574

800 300 0005 10 50 20 02 2 05

The simulation results show that compared with thesituation of using classic RBF neural network controllerthe decentralized robust optimal tracking control methodbased on ACI and reinforcement learning can be appliedinto different configurations of the time varying constrainedreconfigurable modular robot And the joint variables cantrack the desired trajectory within a very short time indifferent configurations and the fluctuation of the errorconvergence range is minimal

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus2

minus15

minus05

minus1

Desired trajectoryActual trajectory

Figure 9 Trajectory tracking curve of configuration B joint 1 withRBF neural network

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 10 Trajectory tracking curve of configuration B joint 2 withRBF neural network

0 1 2 3 4 5 6 7 8 9 10

0

002

004

006

008

01

Time (s)

Join

t 1 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 11 Tracking error curve of configuration A joint 1 with RBFneural network

Mathematical Problems in Engineering 13

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

002

004

006

008

01

Join

t 2 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 12 Tracking error curve of configuration A joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

001

002

003

004

005

Time (s)

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 13 Tracking error curve of configuration B joint 1 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 14 Tracking error curve of configuration B joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus1

minus05

Desired trajectoryActual trajectory

Figure 15 Trajectory tracking curve of configuration A joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 16 Trajectory tracking curve of configuration A joint 2 withACI based reinforcement learning

5 Conclusions and Future Work

In this paper combining ACI with RBF neural network anovel decentralized reinforcement learning robust optimaltracking control theory has been proposed for time varyingconstrained reconfigurable modular robots Moreover thistheory is used to solve the problem of the continuoustime nonlinear optimal control policy for strongly coupleduncertainty robotic system Firstly we build the model ofsubsystem with the time varying external constraints anddescribed the global robot system as a synthesis of inter-connected subsystem Secondly ACI is used to estimate theHJB equation and the global uncertainty where a continuous-time optimal 119876-function is adopted to take the place oftraditional optimal value function the optimal 119876-function

14 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus15

minus05

minus2

minus1

Desired trajectoryActual trajectory

Figure 17 Trajectory tracking curve of configuration B joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 18 Trajectory tracking curve of configuration B joint 2 withACI based reinforcement learning

and the optimal control police are approximated by critic-NNand action-NN and the global uncertainty is identified bythe identifier Thirdly we design a novel decentralized robustoptimal tracking controller so the desired trajectory can betracked and the tracking error could converge to zero infinite timeOn this basis two kinds of Lyapunov functions aredesigned to confirm the stability of ACI and the subsystemFinally in order to confirm the superiority of the proposedcontrol theory the comparative simulation examples havebeen presented combining two different configurations of therobot

In the future more complex configuration for timevarying external constrained reconfigurable modular robotcan be included Therefore the decentralized controller withhigher control and error precision will be considered

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 19 Tracking error curve of configuration A joint 1 with ACIbased reinforcement learning

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005Jo

int 2

erro

r (ra

d)

minus005

minus004

minus003

minus002

minus001

Figure 20 Tracking error curve of configuration A joint 2 with ACIbased reinforcement learning

0

0

1 2 3 4 5 6 7 8 9 10Time (s)

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 21 Tracking error curve of configuration B joint 1 with ACIbased reinforcement learning

Mathematical Problems in Engineering 15

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 22 Tracking error curve of configuration B joint 2 with ACIbased reinforcement learning

005

1

02 03 04 05 06 07

0

01

02

03

minus1

minus05minus02

minus01

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 23 3D-tip trajectory curve of configuration A with ACI

005

1

035 036 037 038 039 04

006008

01012014016018

minus1

minus05

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 24 3D-tip trajectory curve of configuration B with ACI

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is financially supported by the National NaturalScience Foundation of China (Grant nos 61374051 and60974010) the Scientific Technological Development PlanProject in Jilin Province of China (Grant no 20110705) Thefirst author is funded by China Scholarship Council

References

[1] Y Li and B Dong ldquoDecentralized ADRC control for reconfig-urable manipulators based on VGSTA-ESO of sliding moderdquoINFORMATION vol 15 no 6 pp 2453ndash2466 2012

[2] Y LiM-C Zhu andY-C Li ldquoSimulation on robust neurofuzzycompensator for reconfigurable manipulator motion controlrdquoJournal of System Simulation vol 19 no 22 pp 5169ndash5174 2007

[3] M-C Zhu and Y-C Li ldquoDecentralized adaptive sliding modecontrol for reconfigurable manipulators using fuzzy logicrdquoJournal of Jilin University (Engineering and Technology Edition)vol 39 no 1 pp 170ndash176 2009

[4] L Zhu and Y Li ldquoDecentralized adaptive neural networkcontrol for reconfigurable manipulatorsrdquo in Proceedings of theChinese Control and Decision Conference (CCDC rsquo10) pp 1760ndash1765 Xuzhou China May 2010

[5] M Zhu Y Li and Y Li ldquoA new distributed control schemeof modular and reconfigurable robotsrdquo in Proceedings of theIEEE International Conference onMechatronics and Automation(ICMA rsquo07) pp 2622ndash2627 Harbin China August 2007

[6] M C Zhu Y Li and Y C Li ldquoObserver-based decentralizedadaptive fuzzy control for reconfigurable manipulatorrdquo Controland Decision vol 24 no 3 pp 429ndash434 2009

[7] S Richard and A G Barto Reinforcement Learning An Intro-duction The MIT Press London UK 1998

[8] F L Lewis and D Liu ReinForcement Learning and Approxi-mate Dynamic Programming For Feedback Control Wiley-IEEEPress New York NY USA 2012

[9] Y-K Xu and X-R Cao ldquoLebesgue-sampling-based optimalcontrol problems with time aggregationrdquo IEEE Transactions onAutomatic Control vol 56 no 5 pp 1097ndash1109 2011

[10] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[11] X Xu H-G He and D Hu ldquoEfficient reinforcement learningusing recursive least-squares methodsrdquo Journal of ArtificialIntelligence Research vol 16 pp 259ndash292 2002

[12] H Zhang Q Wei and Y Luo ldquoA novel infinite-time optimaltracking control scheme for a class of discrete-time nonlinearsystems via the greedy HDP iteration algorithmrdquo IEEE Trans-actions on Systems Man and Cybernetics B vol 38 no 4 pp937ndash942 2008

[13] H Zhang R Song Q Wei and T Zhang ldquoOptimal trackingcontrol for a class of nonlinear discrete-time systems withtime delays based on heuristic dynamic programmingrdquo IEEETransactions on Neural Networks vol 22 no 12 pp 1851ndash18622011

16 Mathematical Problems in Engineering

[14] H Zhang X Xie and S Tong ldquoHomogenous polynomiallyparameter-dependent 119867

infinfilter designs of discrete-time fuzzy

systemsrdquo IEEE Transactions on Systems Man and CyberneticsB vol 41 no 5 pp 1313ndash1322 2011

[15] H Zhang L Cui X Zhang and Y Luo ldquoData-drivenrobust approximate optimal tracking control for unknown gen-eral nonlinear systems using adaptive dynamic programmingmethodrdquo IEEE Transactions on Neural Networks vol 22 no 12pp 2226ndash2236 2011

[16] J Zhang H Zhang Y Luo and H Liang ldquoOptimal controldesign for nonlinear systems adaptive dynamic programmingbased on fuzzy critic estimatorrdquo in The International JointConference on Neural Network pp 1ndash6 Brisbane Australia2012

[17] H Zhang D Gong B Chen and Z Liu ldquoSynchronization forcoupled neural networkswith interval delay a novel augmentedlyapunov-krasovskii functional methodrdquo IEEE Transactions onNeural Networks and Learning Systems vol 24 no 1 pp 58ndash702013

[18] S Bhasin K Dupree P M Patre and W E Dixon ldquoNeuralnetwork control of a robot interacting with an uncertain vis-coelastic environmentrdquo IEEE Transactions on Control SystemsTechnology vol 19 no 4 pp 947ndash955 2011

[19] S G Khan G Herrmann F Lewis T Pipe and C MelhuishldquoA Q-learning based Cartesian model reference compliancecontroller implementation for a humanoid robot armrdquo inProceedings of the IEEE 5th International Conference onRoboticsAutomation andMechatronics (RAM rsquo11) pp 214ndash219 QingdaoChina September 2011

[20] P K Patchaikani L Behera and G Prasad ldquoA single networkadaptive critic-based redundancy resolution scheme for robotmanipulatorsrdquo IEEE Transactions on Industrial Electronics vol59 no 8 pp 3241ndash3253 2012

[21] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[22] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]University of Cambridge Cambridge UK 1989

[23] F L Lewis and V L Syrmos Optimal Control John Wiley ampSons New York NY USA 1995

[24] M Sassano and A Astolfi ldquoDynamic approximate solutionsof the HJ inequality and of the HJB equation for input-affinenonlinear systemsrdquo IEEE Transactions on Automatic Controlvol 57 no 10 pp 2490ndash2503 2012

[25] Y X Wu and C Wang ldquoDeterministic learning based adaptivenetwork control of robot in task spacerdquoActa Automatica Sinicavol 39 no 1 pp 1ndash10 2013

[26] P M Patre W MacKunis K Kaiser and W E DixonldquoAsymptotic tracking for uncertain dynamic systems via amultilayer neural network feedforward and RISE feedbackcontrol structurerdquo IEEE Transactions on Automatic Control vol53 no 9 pp 2180ndash2185 2008

[27] B E Paden and S S Sastry ldquoA calculus for computing Filippovrsquosdifferential inclusion with application to the variable structurecontrol of robot manipulatorsrdquo IEEE Transactions on Circuitsand Systems vol 34 no 1 pp 73ndash82 1987

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 7: Research Article Decentralized Reinforcement Learning ...downloads.hindawi.com/journals/mpe/2013/387817.pdf · Research Article Decentralized Reinforcement Learning Robust Optimal

Mathematical Problems in Engineering 7

Combining (35) with (38) we can get that

119882119894119888(119905) = minus119899

1119871119894(119871

119879

119894

119894119888+ 120575

ℎ119894) (40)

The update law of the weight for the action-NN is developedby a gradient descent algorithm expressed as follows

119882

119894119886(119905) = minus119899

2119878119894119886(119890

119894)

times ((119879

119894119886119878119894119886(119890

119894) +

1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119894119888))

119879

(41)

According to the estimation error of action-NN in (36) theoptimal control 119906lowast

119894(119890

119894) can minimize the optimal119876-function

and we can get the following equation

119882119879

119894119886119878119894119886(119890

119894) +

1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119882119894119888

+1

2119877minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894) + 120576

119894119886(119890

119894) = 0

(42)

Putting (41) into (42) we can get that

119882119894119886

= minus1198992119878119894119886(119890

119894)(

119879

119894119886119878119894119886(119890

119894) +

1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119894119888

minus1

2119877minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894) minus 120576

119894119886(119890

119894)

)

(43)

After using critic-NN and action-NN to estimate 119876119894and

119894(119890

119894) we need to design a kind of robust RBF-NN identifier

to identify the nonlinear uncertainties of the subsystem HereΦ

119894(119890

119894

119894) can be expressed as follows

Φ119894(119890

119894

119894) = 119890

119894119865= 119882

119879

119894119865120581 (Λ

119879

119894119865119890119894119865) + 120576

119894119865(119890

119894119865) + 119887

119894(119890

119894)

119894

(44)

where 120581(sdot) means the basic function of neural network and119882

119894119865Λ

119894119865indicate the unknown ideal neural network weights

Equation (44) can be identified by using robust RBF-NNidentifier so we can get

Φ119894(119890

119894

119894) = 119890

119894119865= 119882

119879

119894119865120581119894119865+ 119887

119894(119890

119894)

119894+ 120583

119894 (45)

Here 120581119894119865indicates the estimated value of the basic function of

the neural network 119882119894119865 Λ

119894119865are expressed as the estimated

value of neural network 120583119894isin R means the feedback error

term shown as follows [26]120583119894= 119896 (119890

119894119865(119905) minus 119890

119894119865(119905)) minus 119896 (119890

119894119865(0) minus 119890

119894119865(0)) + 120599

= 119896 (119890119894119865(119905) minus 119890

119894119865(0)) + 120599

120599 = (119896120572 + 120574) 119890119894119865+ 120573

1sat (119890

119894119865)

(46)

where 119896 120572 1205731 and 120574 are the positive control gain constants

and sat(sdot) is a saturation functionTherefore the state estima-tion error of the identifier-NN can be expressed as follows

119890119894119865= 119890

119894119865minus 119890

119894119865

= 119882119879

119894119865120581119894119865minus

119879

119894119865120581119894119865+ 120576

119894119865(119890

119894119865) minus 120583

119894

(47)

A filtered identification error is defined as follows

119864119894= 119890

119894119865+ 120572119890

119894119865 (48)

The derivation of the equation above is shown as

119894= 119882

119879

119894119865119894119865Λ119879

119894119865119890119894119865minus

119879

119894119865

120581119894119865

Λ119879

119894119865119890119894119865+ 120572 119890

119894119865minus 119896119864

119894minus 120574119890

119894119865

minus

119882119879

119894119865120581119894119865+ 120576

119894119865(119890

119894119865) minus 120573

1sat (119890

119894119865) minus

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

(49)

Here the weight 119882119894119865 Λ

119894119865of the identification-NN can be

updated by

119882

119894119865= proj (Γ

119894119882119865

120581119894119865Λ119879

119894119865

119890119894119865119890119879

119894119865)

Λ119894119865= proj (Γ

119894Λ119865

119890119894119865119890119879

119894119865

119879

119894119865

120581119894119865)

(50)

where Γ119894119882119865

Γ119894Λ119865

are positive constant adaptation gain matri-ces In order to analyze the convergence of the filteredidentification error 119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

can be divided into thefollowing form

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

=1

2120581119894119865

119890119894119865[(Λ

119879

119894119865minus Λ

119879

119894119865) (119882

119879

119894119865minus

119879

119894119865)

+ (119882119879

119894119865minus

119879

119894119865) (Λ

119879

119894119865minus Λ

119879

119894119865)]

=1

2120581119894119865

119890119894119865[

119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) + (119882

119879

119894119865minus

119879

119894119865) Λ

119879

119894119865

minus119882119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) minus (119882

119879

119894119865minus

119879

119894119865)Λ

119879

119894119865

]

=1

2120581119894119865

119890119894119865[

119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) + (119882

119879

119894119865minus

119879

119894119865) Λ

119879

119894119865]

minus1

2120581119894119865

119890119894119865[119882

119879

119894119865(Λ

119879

119894119865minus Λ

119879

119894119865) + (119882

119879

119894119865minus

119879

119894119865)Λ

119879

119894119865]

=1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

minus1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865minus1

2

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865

+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865

(51)

where 119879

119894119865= 119882

119879

119894119865minus

119879

119894119865 Λ119879

119894119865= Λ

119879

119894119865minus Λ

119879

119894119865 Putting (51) into

(49) then (49) can be reduced to the following form

119894= 119875

1198651+ 119875

1198652+ 119875

1198653minus 119896119864

119894minus 120574119890

119894119865minus 120573

1sat (119890

119894119865) (52)

8 Mathematical Problems in Engineering

Among the equations above 1198751198651+119875

1198652+119875

1198653can be expressed

respectively as follows

1198751198651

=1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865minus

119879

119894119865

120581119894119865

Λ119879

119894119865119890119894119865

+ 120572 119890119894119865minus

119879

119894119865120581119894119865

(53)

1198751198652

= minus1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865minus1

2

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865

+119882119879

119894119865119894119865Λ119879

119894119865119890119894119865+ 120576

119894119865(119890

119894119865)

(54)

1198751198653

=1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865 (55)

According to Assumption 1 (48) and (50) the upper boundsof 119875

1198651 119875

1198652 119875

1198653are shown as

100381710038171003817100381711987511986511003817100381710038171003817 le 119869

1(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

100381710038171003817100381711987511986521003817100381710038171003817 le 120589

1

100381710038171003817100381711987511986531003817100381710038171003817 le 120589

2

(56)

Combining (53) and (54) with (55) then we can get that100381710038171003817100381710038171198652

+ 1198653

10038171003817100381710038171003817le 120589

3+ 120589

41198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817 (57)

where 120593119894(119890

119879

119894119865 119864

119879

119894) = [119890

119879

119894119865119864119879

119894]119879 and 119869

119894(sdot) is a global invertible

nondecreasing function 120589119894 (119894 = 1 2 3 4) are computable

positive constants

Theorem 5 Considering dynamics of the subsystem of timevarying constrained reconfigurable modular robot in (14) andthe state equation (18) if the designed identifier and thecorresponding weight update laws are adopted then the globaluncertainty of the subsystem which depends explicitly on theerror term can be identified and the identification error isconverged and bounded

Proof Define the Lyapunov function as the follows

119881119894119871(119890

119894119865 119864

119894) =

1

2119864119879

119894119864119894+1

2120574119890

119879

119894119865119890119894119865+ 120603

119894(119905) + 120601

119894(119905) (58)

In the equation above 120603119894(119905) and 120601

119894(119905) can be expressed as

follows

119894(119905) = minus[

119864119879

119894(119875

1198652minus 120573

1sat (119890

119894119865)) + 119890

119879

1198941198651198751198653

minus12057321198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)100381710038171003817100381710038171003817100381710038171003817119890119894119865

1003817100381710038171003817

]

120603119894(0) = 120573

1

1003816100381610038161003816119890119894119865 (0)1003816100381610038161003816 minus 119890

119879

119894119865(0) (119875

1198652(0) + 119875

1198653(0))

(59)

120601119894(119905) =

1

4120572 [ tr (

119879

119894119865Γminus1

119894119882119865

119894119865) + tr (Λ

119879

119894119865Γminus1

119894Λ119865Λ

119894119865)] (60)

where tr(sdot) represents the trace of matrix Defining 119889 =

[119864119879

119894119890119879

11989411986512060312

11989412060112

119894] 120573

1 120573

2isin R are positive adaptation gains

which are chosen to ensure 120603119894(119905) ge 0 so we can get

1198801(119889) le 119881

119894119871(119890

119894119865 119864

119894) le 119880

2(119889) (61)

where

1198801(119889) =

1

2min (1 120574) 1198892

1198802(119889) = max (1 120574) 1198892

(62)

The derivation of (58) is shown as follows

119894119871(119890

119894119865 119864

119894) = nabla119881

119879

119894119871119870[

119894

119890119879

119894119865

1

212060312

119894119894

1

212060112

119894

120601119894]119879

(63)

where119870[sdot] is expressed as a Filipov set [27]So

119894119871(119890

119894119865 119864

119894) can be deformed as the following form

119894119871(119890

119894119865 119864

119894)

= [119864119879

119894120574119890

119879

1198941198652120603

12

1198942120601

12

119894]119870[

119894

119890119879

119894119865

1

212060312

119894119894

1

212060112

119894

120601119894]119879

le 120574119879

(

1

2

119882119879

119894119865

119894119865

Λ119879

119894119865

119894119865+

1

2

119879

119894119865

119894119865Λ119879

119894119865

119894119865minus

119879

119894119865

119894119865

Λ

119879

119894119865119890119894119865

+120572119894119865minus

1

2

119882119879

119894119865

119894119865

Λ119879

119894119865119890119894119865minus

1

2

119879

119894119865

119894119865Λ119879

119894119865119890119894119865minus 120574119890

119894119865

minus119879

119894119865120581119894119865+119882

119879

119894119865119894119865Λ119879

119894119865119890119894119865+ 120576

119894119865(119890

119894119865) +

1

2

119879

119894119865

119894119865

Λ119879

119894119865

119894119865

+

1

2

119879

119894119865

119894119865

Λ119879

119894119865

119894119865minus 119896119864

119894minus 120573

1119870[sat (119890

119894119865)]

)

minus119864119879

119894(

minus1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865minus1

2

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865

+119882119879

119894119865119894119865Λ119879

119894119865119890119894119865+ 120576

119894119865(119890

119894) minus 120573

1119870[sat (119890

119894119865)])

minus 119890119879

119894119865

1

2(

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865)

+ 120574119890119879

119894119865(119864

119894minus 120572119890

119894119865)

+ 12057321198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)100381710038171003817100381710038171003817100381710038171003817119890119894119865

1003817100381710038171003817

minus1

2120572 tr (119879

119894119865Γminus1

119894119882119865

119882

119894119865) minus

1

2120572 tr (Λ119879

119894119865Γminus1

119894Λ119865

Λ119894119865)

(64)

Put (53) (54) and (55) into (64) then we can get

119894119871(119890

119894119865 119864

119894)

= 119864119879

119894(119875

1198651+ 119875

1198652+ 119875

1198653minus 120573

1119870[sat (119890

119894119865)] minus 119896119864

119894minus 120574119890

119894119865)

+ 120574119890119879

119894119865(119864

119894minus 120572119890

119894119865)

minus 119864119879

119894(119875

1198652minus 120573

1119870[sat (119890

119894119865)])

minus 119890119879

1198941198651198751198653

+ 12057321198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)100381710038171003817100381710038171003817100381710038171003817119890119894119865

1003817100381710038171003817

minus1

2120572 tr (119879

119894119865Γminus1

119894119882119865

119882

119894119865)

minus1

2120572 tr (Λ119879

119894119865Γminus1

119894Λ119865

Λ119894119865)

Mathematical Problems in Engineering 9

= minus120572120574119890119879

119894119865119890119894119865+ (119864

119879

119894minus 119890

119879

119894119865)119875

1198653

1198691(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41198963

minus1

2120572 tr (119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865119890119879

119894119865) minus

1

2120572 tr (Λ119879

119894119865

119890119894119865119890119879

119894119865

119879

119894119865

120581119894119865)

le minus1198961

100381710038171003817100381711989011989411986510038171003817100381710038172

minus 1198962

1003817100381710038171003817119864119894

10038171003817100381710038172

+1198691(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41198963

10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

+1205732

21198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41205721198964

10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

(65)

where 119896min = min1198961 119896

2 120585 = min119896

3 120572119896

4120573

2

2 and

119869(120593119894(119890

119879

119894119865 119864

119879

119894))

2

= 1198691(120593

119894(119890

119879

119894119865 119864

119879

119894))

2 + 1198692(120593

119894(119890

119879

119894119865 119864

119879

119894))

2 sothe following conclusion can be obtained

119894119871(119890

119894119865 119864

119894)

le minus119896min10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

+119869(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)210038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

4120585

le minus11988810038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

(66)

Therefore for an arbitrary constant 119888 minus119888120593119894(119890

119879

119894119865 119864

119879

119894)

2

is a negative semidefinite function which is defined in theadjustable interval119863 expressed as follows

119863 = 119889 (119905) | 119889 le 119869minus1

(2radic119896min120585) (67)

so that Lyapunov stability theory shows that the system isstable In order to make the subsystem of time varying con-strained reconfigurable modular robot tracking the desiredtrajectory progressively in this paper a novel decentralizedreinforcement learning robust optimal tracking controllerhas been designed by using the robust term to compensatethe neural network approximation errors Design the robustcontrol term as

119906119894119903119887

=119873

119903119887119890119894

119890119879119894119890119894+ 120577

(68)

In the equation above 120577 gt 0 is a constant And 119873119903119887can

be expressed as

119873119903119887ge [

[

1205752

ℎ119894

21198991

+1198991(minus120576

119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894) (nabla120576

119894119888(119890

119894)2))

2

21198992

+11989911198992

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

(minus120576119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

]

]

sdot(119890

119879

119894119890119894+ 120577)

211989911198992

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817 119890

119879

119894119890119894

ge [1198992

1(minus120576

119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

+ 11989921205752

ℎ119894+ 2119899

2

11198992

2

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

sdot (minus120576119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

]

sdot(119890

119879

119894119890119894+ 120577)

41198992111989922

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817 119890

119879

119894119890119894

(69)

Therefore the global control law can be designed asfollows

119906mix = 119906119894+ 119906

119894119903119887

= minus1

2119877minus1

119887119879

119894(119890

119894) 119878

119894119886(119890

119894)119879

119894119886+

119873119903119887119890119894

119890119879119894119890119894+ 120577

(70)

Theorem 6 Considering dynamics of the subsystem of timevarying constrained reconfigurable modular robot in (14) if thesystem parameters conditions and the assumptions are held thecritic-NN action-NN and identifier are given by (33) (34)and (45) respectively and the decentralized robust optimaltracking controller of the subsystem in (70) is adopted thenthe system is closed-loop stability and the desired trajectory canbe tracked asymptotically by the actual output

Proof Design the Lyapunov function as follows

119881119894119906(119890

119894 119906mix) =

1

21198991

tr 119879

119894119888

119894119888 +

1198991

21198992

tr 119879

119894119886

119894119886

+ 11989911198992[119890

119879

119894119865119890119894119865+ Ξint

infin

0

119903119894(119890

119894 119906mix) 119889120591]

(71)

where Ξ gt 0 is the undetermined parameter 119905 le 120591 lt infin Thederivation of (71) is shown as follows

119894119906(119890

119894 119906mix)

=1

21198991

tr 119879

119894119888

119882

119894119888 +

1198991

21198992

tr 119879

119894119886

119882

119894119886

+ 11989911198992[119890

119879

119894119865119890119894119865+ Ξ119903

119894(119890

119894 119906mix)]

=1

21198991

tr 119879

119894119888(minus119899

1119871119894(119871

119879

119894

119894119888+ 120575

ℎ119894))

+1198991

21198992

tr

times

119879

119894119886

[[[[

[

minus1198992119878119894119886(119890

119894)(

119879

119894119886119878119894119886(119890

119894) minus 120576

119894119886(119890

119894)

+1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119894119888

minus1

2119877minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894)

)

]]]]

]

+ 11989911198992119890119879

119894119865(119882

119879

119894119865120581 (Λ

119879

119894119865119890119894) + 120576

119894119865(119890

119894) + 119887

119894(119890

119894) mix)

+Ξ (119890119879

119894119876119890119890119894+ 119906

119879

mix119877119906mix)

10 Mathematical Problems in Engineering

le minus(1198712

119894119898minus1198991

21198712

119894119872)10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

+1

21198991

1205752

ℎ119894

minus (11989911198782

119894119886119898minus3

4119899111989921198782

119894119886119872)10038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

+1198991

41198992

10038171003817100381710038171003817119877minus110038171003817100381710038171003817

21003817100381710038171003817119887119894(119890119894)10038171003817100381710038172 10038171003817100381710038171003817nabla119878

2

119894119888119872

10038171003817100381710038171003817

10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

+1198991

21198992

(120576119894119886(119890

119894) + 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

119879

sdot (120576119894119886(119890

119894) + 119877

minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894)

2)

+ 11989911198992

1003817100381710038171003817119887119894 (119890119894)10038171003817100381710038172

120576119894119886(119890

119894)119879

120576119894119886(119890

119894)

+ 119899111989921198782

119894119886119872

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817210038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

+ 11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

+ 11989911198992(1003817100381710038171003817119887119894 (119890119894)

10038171003817100381710038172

minus Ξ120582min (119877))1003817100381710038171003817119906mix

10038171003817100381710038172

(72)

If the following inequalities can satisfy

120582min 1198761198901003817100381710038171003817119890119894119865

10038171003817100381710038172

2le 119890

119879

119894119865119876119890119890119894119865le 120582max 119876119890

1003817100381710038171003817119890119894119865

10038171003817100381710038172

2

120582min 1198771003817100381710038171003817119906mix

10038171003817100381710038172

2le 119906

119879

mix119877119906mix le 120582max 1198771003817100381710038171003817119906mix

10038171003817100381710038172

2

Ξ gt

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

120582min 119877

(73)

then 119894119906(119890

119894 119906mix) can be further transformed as

119894119906(119890

119894 119906mix)

le minus(1198712

119894119898minus1198991

21198712

119894119872minus

1198991

41198992

10038171003817100381710038171003817119877minus110038171003817100381710038171003817

21003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

nabla1198782

119894119888119872)10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

minus (11989911198782

119894119886119898minus3

4119899111989921198782

119894119886119872minus 119899

111989921198782

119894119886119872

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

)10038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

minus 11989911198992(1003817100381710038171003817119887119894(119890119894)

10038171003817100381710038172

+ Ξ120582min (119877))1003817100381710038171003817119906mix

10038171003817100381710038172

minus 11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

le minus11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

(74)

Therefore we can get the conclusion that 119894119906(119890

119894 119906mix) lt 0

4 Simulations

In order to verify the validity and convergence of the pro-posed decentralized reinforcement learning robust optimaltracking control method based on ACI and to study theconvergence of the error by comparing the simulation resultin this paper two different configurations of the time varying

external constrained reconfigurablemodular robot have beenapplied shown in Figures 3 and 4

For the sake of the facilitation of the analysis of theconfigurations above we can transform them into a formof analytic charts which are shown in Figures 5 and 6where 119871

1 119871

2 and 119871

4are the length of the links 119871

3is the

distance between the time varying constraint joint and thebase modular

The time varying constraint can be defined as a kindof column which rotated about with a certain degree offreedom The constraint equations of configuration A andconfiguration B are shown as follows

Ψ119860(119902 119905) = 119871

1cos 119902

1+ 119871

2cos 119902

2minus [119871

3+ 119871

4cot120572 (119905)]

Ψ119861(119902 119905) = 119871

1+ 119871

2cos 119902

2minus [119871

3+ 119871

4cot120572 (119905)]

(75)

In the equation above the angle 120572(119905) between the timevarying constraint and the119883-axis can be defined as follows

120572 (119905) = 075120587 + 02 sin 119905

2 (76)

The initial positions of joint models are 1199021(0) = 2 119902

2(0) =

2 in configurationA and 1199021(0) = 2 119902

2(0) = 2 in configuration

BThe initial velocities of joints are zerosThe dynamicmodelof configurations A and B is designed as follows

119872119860(119902) = [

036 cos (1199022) + 06066 018 cos (119902

2) + 01233

018 cos (1199022) + 01233 01233

]

119872119861(119902) = [

017 minus 01166cos2 (1199022) minus006 cos (119902

2)

minus006 cos (1199022) 01233

]

119862119860(119902 119902) = [

minus036 sin (1199022) 119902

2minus018 sin (119902

2) 119902

2

018 sin (1199022) ( 119902

1minus 119902

2) 018 sin (119902

2) 119902

1

]

119862119861(119902 119902) = [

01166 sin (21199022) 119902

2006 sin (119902

2) 119902

2

006 sin (1199022) 119902

20

]

119866119860(119902) = [

minus588 sin (1199021+ 119902

2) minus 1764 sin (119902

1)

minus588 sin (1199021+ 119902

2)

]

119866119861(119902) = [

0

minus588 cos (1199022)]

119865119860(119902 119902) = [

1199021+ 10 sin (3119902

1) + 2 sgn ( 119902

1)

12 1199022+ 5 sin (2119902

2) + sgn ( 119902

2)]

119865119861(119902 119902) = [

0

15 1199022+ sin (119902

2) + 12 sgn ( 119902

2)]

(77)

The desired trajectory of configurations A and B is shown asConfiguration A

1199101119889

= 05 cos (119905) + 02 sin (3119905)

1199102119889

= Θ (1199101119889 119905)

= arcsin[1198711sin (120572 (119905) minus 119910

1119889) minus 119871

3sin (120572 (119905))

1198712

] + 120572 (119905)

(78)

Mathematical Problems in Engineering 11

Figure 3 Configuration A for simulation

Figure 4 Configuration B for simulation

Configuration B

1199101119889

= 0

1199102119889

= Θ (1199101119889 119905)

= arcsin [1198711sin (120572 (119905)) minus 119871

3sin (120572 (119905))

1198712

] + 120572 (119905)

(79)

Due to the limit of dynamic external constraints thevariety of joint 1 in configuration B is zero

In order to confirm that the adopted method can beapplied in different configurations and to verify the trackingperformance for the subsystem desired trajectory by usingthe ACI and RBF-NN based decentralized reinforcementlearning robust optimal tracking control method in this partthe comparative simulations which include the classic RBFneural network control method and ACI based decentral-ized reinforcement learning robust optimal tracking controlmethod have been adopted respectively

From Figures 7 8 9 10 11 12 13 and 14 the joint trackingand error curves are shown by using classic RBF neural net-work [4] to compensate the effect of the dynamic nonlinearterm and the interconnection term in the subsystem Figures7 to 10 show that the actual output trajectories take about2 seconds to track the desired trajectories This is due tothe fact that the classical neural network method requires alonger training process and parameter adaptive process Theerror curves in Figures 11ndash14 show that the joint subsystem

q1L2

L3

L4

L1

Y

X

120572

q2

Figure 5 The analytic chart of configuration A

q2

L4

L2

L1

L3

Y

120572

X

q1

Figure 6 The analytic chart of configuration B

constraints cannot be well compensated by adopting the clas-sical neural control method for the reconfigurable modularrobot when the time varying constrains exist When thejoint output variables turn larger the reconfigurable modularsystem cannot exhibit a good robustness and the trackingerrors are larger than before

Figures 15 16 17 18 19 20 21 and 22 showed thejoint tracking and error curves by adopting ACI to identifythe optimal 119876-function optimal control policy and globaldynamic nonlinear terms of the HJB equation in the sub-system Figures 15 to 18 show that the actual output trajec-tories of the subsystems can track the desired trajectoriesin 05 seconds by using the proposed robust reinforcementlearning optimal control method This is due to the excellentidentifying ability of ACIwhich can identify the uncertaintiescontained in the subsystems in a short timeThe error curvesin Figures 19ndash22 show that the tracking error is very smalland it can converge to zero in a short time Besides whenthe proposed decentralized control method is adopted for thejoint subsystems the robustness is manifested

Figures 23 and 24 showed the 3D-tip trajectory curvesby using ACI algorithm These two figures show that theproposed decentralized control method can fully satisfythe accessibility requirement of the reconfigurable modularrobot Besides the singular displacements of the joint subsys-tems and the end-effector would not appear The parametersdefined in ACI are shown in Table 1

12 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

25

Time (s)

Join

t 1 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

minus1

minus05

Figure 7 Trajectory tracking curve of configuration A joint 1 withRBF neural network

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 8 Trajectory tracking curve of configuration A joint 2 withRBF neural network

Table 1 Parameter list of action-critic-identifier

119896 120572 120592 1205781198861

1205781198862

120578119888

1205731

1205732

120574

800 300 0005 10 50 20 02 2 05

The simulation results show that compared with thesituation of using classic RBF neural network controllerthe decentralized robust optimal tracking control methodbased on ACI and reinforcement learning can be appliedinto different configurations of the time varying constrainedreconfigurable modular robot And the joint variables cantrack the desired trajectory within a very short time indifferent configurations and the fluctuation of the errorconvergence range is minimal

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus2

minus15

minus05

minus1

Desired trajectoryActual trajectory

Figure 9 Trajectory tracking curve of configuration B joint 1 withRBF neural network

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 10 Trajectory tracking curve of configuration B joint 2 withRBF neural network

0 1 2 3 4 5 6 7 8 9 10

0

002

004

006

008

01

Time (s)

Join

t 1 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 11 Tracking error curve of configuration A joint 1 with RBFneural network

Mathematical Problems in Engineering 13

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

002

004

006

008

01

Join

t 2 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 12 Tracking error curve of configuration A joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

001

002

003

004

005

Time (s)

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 13 Tracking error curve of configuration B joint 1 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 14 Tracking error curve of configuration B joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus1

minus05

Desired trajectoryActual trajectory

Figure 15 Trajectory tracking curve of configuration A joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 16 Trajectory tracking curve of configuration A joint 2 withACI based reinforcement learning

5 Conclusions and Future Work

In this paper combining ACI with RBF neural network anovel decentralized reinforcement learning robust optimaltracking control theory has been proposed for time varyingconstrained reconfigurable modular robots Moreover thistheory is used to solve the problem of the continuoustime nonlinear optimal control policy for strongly coupleduncertainty robotic system Firstly we build the model ofsubsystem with the time varying external constraints anddescribed the global robot system as a synthesis of inter-connected subsystem Secondly ACI is used to estimate theHJB equation and the global uncertainty where a continuous-time optimal 119876-function is adopted to take the place oftraditional optimal value function the optimal 119876-function

14 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus15

minus05

minus2

minus1

Desired trajectoryActual trajectory

Figure 17 Trajectory tracking curve of configuration B joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 18 Trajectory tracking curve of configuration B joint 2 withACI based reinforcement learning

and the optimal control police are approximated by critic-NNand action-NN and the global uncertainty is identified bythe identifier Thirdly we design a novel decentralized robustoptimal tracking controller so the desired trajectory can betracked and the tracking error could converge to zero infinite timeOn this basis two kinds of Lyapunov functions aredesigned to confirm the stability of ACI and the subsystemFinally in order to confirm the superiority of the proposedcontrol theory the comparative simulation examples havebeen presented combining two different configurations of therobot

In the future more complex configuration for timevarying external constrained reconfigurable modular robotcan be included Therefore the decentralized controller withhigher control and error precision will be considered

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 19 Tracking error curve of configuration A joint 1 with ACIbased reinforcement learning

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005Jo

int 2

erro

r (ra

d)

minus005

minus004

minus003

minus002

minus001

Figure 20 Tracking error curve of configuration A joint 2 with ACIbased reinforcement learning

0

0

1 2 3 4 5 6 7 8 9 10Time (s)

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 21 Tracking error curve of configuration B joint 1 with ACIbased reinforcement learning

Mathematical Problems in Engineering 15

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 22 Tracking error curve of configuration B joint 2 with ACIbased reinforcement learning

005

1

02 03 04 05 06 07

0

01

02

03

minus1

minus05minus02

minus01

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 23 3D-tip trajectory curve of configuration A with ACI

005

1

035 036 037 038 039 04

006008

01012014016018

minus1

minus05

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 24 3D-tip trajectory curve of configuration B with ACI

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is financially supported by the National NaturalScience Foundation of China (Grant nos 61374051 and60974010) the Scientific Technological Development PlanProject in Jilin Province of China (Grant no 20110705) Thefirst author is funded by China Scholarship Council

References

[1] Y Li and B Dong ldquoDecentralized ADRC control for reconfig-urable manipulators based on VGSTA-ESO of sliding moderdquoINFORMATION vol 15 no 6 pp 2453ndash2466 2012

[2] Y LiM-C Zhu andY-C Li ldquoSimulation on robust neurofuzzycompensator for reconfigurable manipulator motion controlrdquoJournal of System Simulation vol 19 no 22 pp 5169ndash5174 2007

[3] M-C Zhu and Y-C Li ldquoDecentralized adaptive sliding modecontrol for reconfigurable manipulators using fuzzy logicrdquoJournal of Jilin University (Engineering and Technology Edition)vol 39 no 1 pp 170ndash176 2009

[4] L Zhu and Y Li ldquoDecentralized adaptive neural networkcontrol for reconfigurable manipulatorsrdquo in Proceedings of theChinese Control and Decision Conference (CCDC rsquo10) pp 1760ndash1765 Xuzhou China May 2010

[5] M Zhu Y Li and Y Li ldquoA new distributed control schemeof modular and reconfigurable robotsrdquo in Proceedings of theIEEE International Conference onMechatronics and Automation(ICMA rsquo07) pp 2622ndash2627 Harbin China August 2007

[6] M C Zhu Y Li and Y C Li ldquoObserver-based decentralizedadaptive fuzzy control for reconfigurable manipulatorrdquo Controland Decision vol 24 no 3 pp 429ndash434 2009

[7] S Richard and A G Barto Reinforcement Learning An Intro-duction The MIT Press London UK 1998

[8] F L Lewis and D Liu ReinForcement Learning and Approxi-mate Dynamic Programming For Feedback Control Wiley-IEEEPress New York NY USA 2012

[9] Y-K Xu and X-R Cao ldquoLebesgue-sampling-based optimalcontrol problems with time aggregationrdquo IEEE Transactions onAutomatic Control vol 56 no 5 pp 1097ndash1109 2011

[10] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[11] X Xu H-G He and D Hu ldquoEfficient reinforcement learningusing recursive least-squares methodsrdquo Journal of ArtificialIntelligence Research vol 16 pp 259ndash292 2002

[12] H Zhang Q Wei and Y Luo ldquoA novel infinite-time optimaltracking control scheme for a class of discrete-time nonlinearsystems via the greedy HDP iteration algorithmrdquo IEEE Trans-actions on Systems Man and Cybernetics B vol 38 no 4 pp937ndash942 2008

[13] H Zhang R Song Q Wei and T Zhang ldquoOptimal trackingcontrol for a class of nonlinear discrete-time systems withtime delays based on heuristic dynamic programmingrdquo IEEETransactions on Neural Networks vol 22 no 12 pp 1851ndash18622011

16 Mathematical Problems in Engineering

[14] H Zhang X Xie and S Tong ldquoHomogenous polynomiallyparameter-dependent 119867

infinfilter designs of discrete-time fuzzy

systemsrdquo IEEE Transactions on Systems Man and CyberneticsB vol 41 no 5 pp 1313ndash1322 2011

[15] H Zhang L Cui X Zhang and Y Luo ldquoData-drivenrobust approximate optimal tracking control for unknown gen-eral nonlinear systems using adaptive dynamic programmingmethodrdquo IEEE Transactions on Neural Networks vol 22 no 12pp 2226ndash2236 2011

[16] J Zhang H Zhang Y Luo and H Liang ldquoOptimal controldesign for nonlinear systems adaptive dynamic programmingbased on fuzzy critic estimatorrdquo in The International JointConference on Neural Network pp 1ndash6 Brisbane Australia2012

[17] H Zhang D Gong B Chen and Z Liu ldquoSynchronization forcoupled neural networkswith interval delay a novel augmentedlyapunov-krasovskii functional methodrdquo IEEE Transactions onNeural Networks and Learning Systems vol 24 no 1 pp 58ndash702013

[18] S Bhasin K Dupree P M Patre and W E Dixon ldquoNeuralnetwork control of a robot interacting with an uncertain vis-coelastic environmentrdquo IEEE Transactions on Control SystemsTechnology vol 19 no 4 pp 947ndash955 2011

[19] S G Khan G Herrmann F Lewis T Pipe and C MelhuishldquoA Q-learning based Cartesian model reference compliancecontroller implementation for a humanoid robot armrdquo inProceedings of the IEEE 5th International Conference onRoboticsAutomation andMechatronics (RAM rsquo11) pp 214ndash219 QingdaoChina September 2011

[20] P K Patchaikani L Behera and G Prasad ldquoA single networkadaptive critic-based redundancy resolution scheme for robotmanipulatorsrdquo IEEE Transactions on Industrial Electronics vol59 no 8 pp 3241ndash3253 2012

[21] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[22] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]University of Cambridge Cambridge UK 1989

[23] F L Lewis and V L Syrmos Optimal Control John Wiley ampSons New York NY USA 1995

[24] M Sassano and A Astolfi ldquoDynamic approximate solutionsof the HJ inequality and of the HJB equation for input-affinenonlinear systemsrdquo IEEE Transactions on Automatic Controlvol 57 no 10 pp 2490ndash2503 2012

[25] Y X Wu and C Wang ldquoDeterministic learning based adaptivenetwork control of robot in task spacerdquoActa Automatica Sinicavol 39 no 1 pp 1ndash10 2013

[26] P M Patre W MacKunis K Kaiser and W E DixonldquoAsymptotic tracking for uncertain dynamic systems via amultilayer neural network feedforward and RISE feedbackcontrol structurerdquo IEEE Transactions on Automatic Control vol53 no 9 pp 2180ndash2185 2008

[27] B E Paden and S S Sastry ldquoA calculus for computing Filippovrsquosdifferential inclusion with application to the variable structurecontrol of robot manipulatorsrdquo IEEE Transactions on Circuitsand Systems vol 34 no 1 pp 73ndash82 1987

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 8: Research Article Decentralized Reinforcement Learning ...downloads.hindawi.com/journals/mpe/2013/387817.pdf · Research Article Decentralized Reinforcement Learning Robust Optimal

8 Mathematical Problems in Engineering

Among the equations above 1198751198651+119875

1198652+119875

1198653can be expressed

respectively as follows

1198751198651

=1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865minus

119879

119894119865

120581119894119865

Λ119879

119894119865119890119894119865

+ 120572 119890119894119865minus

119879

119894119865120581119894119865

(53)

1198751198652

= minus1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865minus1

2

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865

+119882119879

119894119865119894119865Λ119879

119894119865119890119894119865+ 120576

119894119865(119890

119894119865)

(54)

1198751198653

=1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865 (55)

According to Assumption 1 (48) and (50) the upper boundsof 119875

1198651 119875

1198652 119875

1198653are shown as

100381710038171003817100381711987511986511003817100381710038171003817 le 119869

1(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

100381710038171003817100381711987511986521003817100381710038171003817 le 120589

1

100381710038171003817100381711987511986531003817100381710038171003817 le 120589

2

(56)

Combining (53) and (54) with (55) then we can get that100381710038171003817100381710038171198652

+ 1198653

10038171003817100381710038171003817le 120589

3+ 120589

41198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817 (57)

where 120593119894(119890

119879

119894119865 119864

119879

119894) = [119890

119879

119894119865119864119879

119894]119879 and 119869

119894(sdot) is a global invertible

nondecreasing function 120589119894 (119894 = 1 2 3 4) are computable

positive constants

Theorem 5 Considering dynamics of the subsystem of timevarying constrained reconfigurable modular robot in (14) andthe state equation (18) if the designed identifier and thecorresponding weight update laws are adopted then the globaluncertainty of the subsystem which depends explicitly on theerror term can be identified and the identification error isconverged and bounded

Proof Define the Lyapunov function as the follows

119881119894119871(119890

119894119865 119864

119894) =

1

2119864119879

119894119864119894+1

2120574119890

119879

119894119865119890119894119865+ 120603

119894(119905) + 120601

119894(119905) (58)

In the equation above 120603119894(119905) and 120601

119894(119905) can be expressed as

follows

119894(119905) = minus[

119864119879

119894(119875

1198652minus 120573

1sat (119890

119894119865)) + 119890

119879

1198941198651198751198653

minus12057321198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)100381710038171003817100381710038171003817100381710038171003817119890119894119865

1003817100381710038171003817

]

120603119894(0) = 120573

1

1003816100381610038161003816119890119894119865 (0)1003816100381610038161003816 minus 119890

119879

119894119865(0) (119875

1198652(0) + 119875

1198653(0))

(59)

120601119894(119905) =

1

4120572 [ tr (

119879

119894119865Γminus1

119894119882119865

119894119865) + tr (Λ

119879

119894119865Γminus1

119894Λ119865Λ

119894119865)] (60)

where tr(sdot) represents the trace of matrix Defining 119889 =

[119864119879

119894119890119879

11989411986512060312

11989412060112

119894] 120573

1 120573

2isin R are positive adaptation gains

which are chosen to ensure 120603119894(119905) ge 0 so we can get

1198801(119889) le 119881

119894119871(119890

119894119865 119864

119894) le 119880

2(119889) (61)

where

1198801(119889) =

1

2min (1 120574) 1198892

1198802(119889) = max (1 120574) 1198892

(62)

The derivation of (58) is shown as follows

119894119871(119890

119894119865 119864

119894) = nabla119881

119879

119894119871119870[

119894

119890119879

119894119865

1

212060312

119894119894

1

212060112

119894

120601119894]119879

(63)

where119870[sdot] is expressed as a Filipov set [27]So

119894119871(119890

119894119865 119864

119894) can be deformed as the following form

119894119871(119890

119894119865 119864

119894)

= [119864119879

119894120574119890

119879

1198941198652120603

12

1198942120601

12

119894]119870[

119894

119890119879

119894119865

1

212060312

119894119894

1

212060112

119894

120601119894]119879

le 120574119879

(

1

2

119882119879

119894119865

119894119865

Λ119879

119894119865

119894119865+

1

2

119879

119894119865

119894119865Λ119879

119894119865

119894119865minus

119879

119894119865

119894119865

Λ

119879

119894119865119890119894119865

+120572119894119865minus

1

2

119882119879

119894119865

119894119865

Λ119879

119894119865119890119894119865minus

1

2

119879

119894119865

119894119865Λ119879

119894119865119890119894119865minus 120574119890

119894119865

minus119879

119894119865120581119894119865+119882

119879

119894119865119894119865Λ119879

119894119865119890119894119865+ 120576

119894119865(119890

119894119865) +

1

2

119879

119894119865

119894119865

Λ119879

119894119865

119894119865

+

1

2

119879

119894119865

119894119865

Λ119879

119894119865

119894119865minus 119896119864

119894minus 120573

1119870[sat (119890

119894119865)]

)

minus119864119879

119894(

minus1

2119882

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865minus1

2

119879

119894119865

120581119894119865Λ119879

119894119865119890119894119865

+119882119879

119894119865119894119865Λ119879

119894119865119890119894119865+ 120576

119894119865(119890

119894) minus 120573

1119870[sat (119890

119894119865)])

minus 119890119879

119894119865

1

2(

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865+1

2

119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865)

+ 120574119890119879

119894119865(119864

119894minus 120572119890

119894119865)

+ 12057321198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)100381710038171003817100381710038171003817100381710038171003817119890119894119865

1003817100381710038171003817

minus1

2120572 tr (119879

119894119865Γminus1

119894119882119865

119882

119894119865) minus

1

2120572 tr (Λ119879

119894119865Γminus1

119894Λ119865

Λ119894119865)

(64)

Put (53) (54) and (55) into (64) then we can get

119894119871(119890

119894119865 119864

119894)

= 119864119879

119894(119875

1198651+ 119875

1198652+ 119875

1198653minus 120573

1119870[sat (119890

119894119865)] minus 119896119864

119894minus 120574119890

119894119865)

+ 120574119890119879

119894119865(119864

119894minus 120572119890

119894119865)

minus 119864119879

119894(119875

1198652minus 120573

1119870[sat (119890

119894119865)])

minus 119890119879

1198941198651198751198653

+ 12057321198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)100381710038171003817100381710038171003817100381710038171003817119890119894119865

1003817100381710038171003817

minus1

2120572 tr (119879

119894119865Γminus1

119894119882119865

119882

119894119865)

minus1

2120572 tr (Λ119879

119894119865Γminus1

119894Λ119865

Λ119894119865)

Mathematical Problems in Engineering 9

= minus120572120574119890119879

119894119865119890119894119865+ (119864

119879

119894minus 119890

119879

119894119865)119875

1198653

1198691(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41198963

minus1

2120572 tr (119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865119890119879

119894119865) minus

1

2120572 tr (Λ119879

119894119865

119890119894119865119890119879

119894119865

119879

119894119865

120581119894119865)

le minus1198961

100381710038171003817100381711989011989411986510038171003817100381710038172

minus 1198962

1003817100381710038171003817119864119894

10038171003817100381710038172

+1198691(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41198963

10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

+1205732

21198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41205721198964

10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

(65)

where 119896min = min1198961 119896

2 120585 = min119896

3 120572119896

4120573

2

2 and

119869(120593119894(119890

119879

119894119865 119864

119879

119894))

2

= 1198691(120593

119894(119890

119879

119894119865 119864

119879

119894))

2 + 1198692(120593

119894(119890

119879

119894119865 119864

119879

119894))

2 sothe following conclusion can be obtained

119894119871(119890

119894119865 119864

119894)

le minus119896min10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

+119869(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)210038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

4120585

le minus11988810038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

(66)

Therefore for an arbitrary constant 119888 minus119888120593119894(119890

119879

119894119865 119864

119879

119894)

2

is a negative semidefinite function which is defined in theadjustable interval119863 expressed as follows

119863 = 119889 (119905) | 119889 le 119869minus1

(2radic119896min120585) (67)

so that Lyapunov stability theory shows that the system isstable In order to make the subsystem of time varying con-strained reconfigurable modular robot tracking the desiredtrajectory progressively in this paper a novel decentralizedreinforcement learning robust optimal tracking controllerhas been designed by using the robust term to compensatethe neural network approximation errors Design the robustcontrol term as

119906119894119903119887

=119873

119903119887119890119894

119890119879119894119890119894+ 120577

(68)

In the equation above 120577 gt 0 is a constant And 119873119903119887can

be expressed as

119873119903119887ge [

[

1205752

ℎ119894

21198991

+1198991(minus120576

119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894) (nabla120576

119894119888(119890

119894)2))

2

21198992

+11989911198992

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

(minus120576119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

]

]

sdot(119890

119879

119894119890119894+ 120577)

211989911198992

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817 119890

119879

119894119890119894

ge [1198992

1(minus120576

119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

+ 11989921205752

ℎ119894+ 2119899

2

11198992

2

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

sdot (minus120576119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

]

sdot(119890

119879

119894119890119894+ 120577)

41198992111989922

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817 119890

119879

119894119890119894

(69)

Therefore the global control law can be designed asfollows

119906mix = 119906119894+ 119906

119894119903119887

= minus1

2119877minus1

119887119879

119894(119890

119894) 119878

119894119886(119890

119894)119879

119894119886+

119873119903119887119890119894

119890119879119894119890119894+ 120577

(70)

Theorem 6 Considering dynamics of the subsystem of timevarying constrained reconfigurable modular robot in (14) if thesystem parameters conditions and the assumptions are held thecritic-NN action-NN and identifier are given by (33) (34)and (45) respectively and the decentralized robust optimaltracking controller of the subsystem in (70) is adopted thenthe system is closed-loop stability and the desired trajectory canbe tracked asymptotically by the actual output

Proof Design the Lyapunov function as follows

119881119894119906(119890

119894 119906mix) =

1

21198991

tr 119879

119894119888

119894119888 +

1198991

21198992

tr 119879

119894119886

119894119886

+ 11989911198992[119890

119879

119894119865119890119894119865+ Ξint

infin

0

119903119894(119890

119894 119906mix) 119889120591]

(71)

where Ξ gt 0 is the undetermined parameter 119905 le 120591 lt infin Thederivation of (71) is shown as follows

119894119906(119890

119894 119906mix)

=1

21198991

tr 119879

119894119888

119882

119894119888 +

1198991

21198992

tr 119879

119894119886

119882

119894119886

+ 11989911198992[119890

119879

119894119865119890119894119865+ Ξ119903

119894(119890

119894 119906mix)]

=1

21198991

tr 119879

119894119888(minus119899

1119871119894(119871

119879

119894

119894119888+ 120575

ℎ119894))

+1198991

21198992

tr

times

119879

119894119886

[[[[

[

minus1198992119878119894119886(119890

119894)(

119879

119894119886119878119894119886(119890

119894) minus 120576

119894119886(119890

119894)

+1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119894119888

minus1

2119877minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894)

)

]]]]

]

+ 11989911198992119890119879

119894119865(119882

119879

119894119865120581 (Λ

119879

119894119865119890119894) + 120576

119894119865(119890

119894) + 119887

119894(119890

119894) mix)

+Ξ (119890119879

119894119876119890119890119894+ 119906

119879

mix119877119906mix)

10 Mathematical Problems in Engineering

le minus(1198712

119894119898minus1198991

21198712

119894119872)10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

+1

21198991

1205752

ℎ119894

minus (11989911198782

119894119886119898minus3

4119899111989921198782

119894119886119872)10038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

+1198991

41198992

10038171003817100381710038171003817119877minus110038171003817100381710038171003817

21003817100381710038171003817119887119894(119890119894)10038171003817100381710038172 10038171003817100381710038171003817nabla119878

2

119894119888119872

10038171003817100381710038171003817

10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

+1198991

21198992

(120576119894119886(119890

119894) + 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

119879

sdot (120576119894119886(119890

119894) + 119877

minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894)

2)

+ 11989911198992

1003817100381710038171003817119887119894 (119890119894)10038171003817100381710038172

120576119894119886(119890

119894)119879

120576119894119886(119890

119894)

+ 119899111989921198782

119894119886119872

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817210038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

+ 11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

+ 11989911198992(1003817100381710038171003817119887119894 (119890119894)

10038171003817100381710038172

minus Ξ120582min (119877))1003817100381710038171003817119906mix

10038171003817100381710038172

(72)

If the following inequalities can satisfy

120582min 1198761198901003817100381710038171003817119890119894119865

10038171003817100381710038172

2le 119890

119879

119894119865119876119890119890119894119865le 120582max 119876119890

1003817100381710038171003817119890119894119865

10038171003817100381710038172

2

120582min 1198771003817100381710038171003817119906mix

10038171003817100381710038172

2le 119906

119879

mix119877119906mix le 120582max 1198771003817100381710038171003817119906mix

10038171003817100381710038172

2

Ξ gt

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

120582min 119877

(73)

then 119894119906(119890

119894 119906mix) can be further transformed as

119894119906(119890

119894 119906mix)

le minus(1198712

119894119898minus1198991

21198712

119894119872minus

1198991

41198992

10038171003817100381710038171003817119877minus110038171003817100381710038171003817

21003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

nabla1198782

119894119888119872)10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

minus (11989911198782

119894119886119898minus3

4119899111989921198782

119894119886119872minus 119899

111989921198782

119894119886119872

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

)10038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

minus 11989911198992(1003817100381710038171003817119887119894(119890119894)

10038171003817100381710038172

+ Ξ120582min (119877))1003817100381710038171003817119906mix

10038171003817100381710038172

minus 11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

le minus11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

(74)

Therefore we can get the conclusion that 119894119906(119890

119894 119906mix) lt 0

4 Simulations

In order to verify the validity and convergence of the pro-posed decentralized reinforcement learning robust optimaltracking control method based on ACI and to study theconvergence of the error by comparing the simulation resultin this paper two different configurations of the time varying

external constrained reconfigurablemodular robot have beenapplied shown in Figures 3 and 4

For the sake of the facilitation of the analysis of theconfigurations above we can transform them into a formof analytic charts which are shown in Figures 5 and 6where 119871

1 119871

2 and 119871

4are the length of the links 119871

3is the

distance between the time varying constraint joint and thebase modular

The time varying constraint can be defined as a kindof column which rotated about with a certain degree offreedom The constraint equations of configuration A andconfiguration B are shown as follows

Ψ119860(119902 119905) = 119871

1cos 119902

1+ 119871

2cos 119902

2minus [119871

3+ 119871

4cot120572 (119905)]

Ψ119861(119902 119905) = 119871

1+ 119871

2cos 119902

2minus [119871

3+ 119871

4cot120572 (119905)]

(75)

In the equation above the angle 120572(119905) between the timevarying constraint and the119883-axis can be defined as follows

120572 (119905) = 075120587 + 02 sin 119905

2 (76)

The initial positions of joint models are 1199021(0) = 2 119902

2(0) =

2 in configurationA and 1199021(0) = 2 119902

2(0) = 2 in configuration

BThe initial velocities of joints are zerosThe dynamicmodelof configurations A and B is designed as follows

119872119860(119902) = [

036 cos (1199022) + 06066 018 cos (119902

2) + 01233

018 cos (1199022) + 01233 01233

]

119872119861(119902) = [

017 minus 01166cos2 (1199022) minus006 cos (119902

2)

minus006 cos (1199022) 01233

]

119862119860(119902 119902) = [

minus036 sin (1199022) 119902

2minus018 sin (119902

2) 119902

2

018 sin (1199022) ( 119902

1minus 119902

2) 018 sin (119902

2) 119902

1

]

119862119861(119902 119902) = [

01166 sin (21199022) 119902

2006 sin (119902

2) 119902

2

006 sin (1199022) 119902

20

]

119866119860(119902) = [

minus588 sin (1199021+ 119902

2) minus 1764 sin (119902

1)

minus588 sin (1199021+ 119902

2)

]

119866119861(119902) = [

0

minus588 cos (1199022)]

119865119860(119902 119902) = [

1199021+ 10 sin (3119902

1) + 2 sgn ( 119902

1)

12 1199022+ 5 sin (2119902

2) + sgn ( 119902

2)]

119865119861(119902 119902) = [

0

15 1199022+ sin (119902

2) + 12 sgn ( 119902

2)]

(77)

The desired trajectory of configurations A and B is shown asConfiguration A

1199101119889

= 05 cos (119905) + 02 sin (3119905)

1199102119889

= Θ (1199101119889 119905)

= arcsin[1198711sin (120572 (119905) minus 119910

1119889) minus 119871

3sin (120572 (119905))

1198712

] + 120572 (119905)

(78)

Mathematical Problems in Engineering 11

Figure 3 Configuration A for simulation

Figure 4 Configuration B for simulation

Configuration B

1199101119889

= 0

1199102119889

= Θ (1199101119889 119905)

= arcsin [1198711sin (120572 (119905)) minus 119871

3sin (120572 (119905))

1198712

] + 120572 (119905)

(79)

Due to the limit of dynamic external constraints thevariety of joint 1 in configuration B is zero

In order to confirm that the adopted method can beapplied in different configurations and to verify the trackingperformance for the subsystem desired trajectory by usingthe ACI and RBF-NN based decentralized reinforcementlearning robust optimal tracking control method in this partthe comparative simulations which include the classic RBFneural network control method and ACI based decentral-ized reinforcement learning robust optimal tracking controlmethod have been adopted respectively

From Figures 7 8 9 10 11 12 13 and 14 the joint trackingand error curves are shown by using classic RBF neural net-work [4] to compensate the effect of the dynamic nonlinearterm and the interconnection term in the subsystem Figures7 to 10 show that the actual output trajectories take about2 seconds to track the desired trajectories This is due tothe fact that the classical neural network method requires alonger training process and parameter adaptive process Theerror curves in Figures 11ndash14 show that the joint subsystem

q1L2

L3

L4

L1

Y

X

120572

q2

Figure 5 The analytic chart of configuration A

q2

L4

L2

L1

L3

Y

120572

X

q1

Figure 6 The analytic chart of configuration B

constraints cannot be well compensated by adopting the clas-sical neural control method for the reconfigurable modularrobot when the time varying constrains exist When thejoint output variables turn larger the reconfigurable modularsystem cannot exhibit a good robustness and the trackingerrors are larger than before

Figures 15 16 17 18 19 20 21 and 22 showed thejoint tracking and error curves by adopting ACI to identifythe optimal 119876-function optimal control policy and globaldynamic nonlinear terms of the HJB equation in the sub-system Figures 15 to 18 show that the actual output trajec-tories of the subsystems can track the desired trajectoriesin 05 seconds by using the proposed robust reinforcementlearning optimal control method This is due to the excellentidentifying ability of ACIwhich can identify the uncertaintiescontained in the subsystems in a short timeThe error curvesin Figures 19ndash22 show that the tracking error is very smalland it can converge to zero in a short time Besides whenthe proposed decentralized control method is adopted for thejoint subsystems the robustness is manifested

Figures 23 and 24 showed the 3D-tip trajectory curvesby using ACI algorithm These two figures show that theproposed decentralized control method can fully satisfythe accessibility requirement of the reconfigurable modularrobot Besides the singular displacements of the joint subsys-tems and the end-effector would not appear The parametersdefined in ACI are shown in Table 1

12 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

25

Time (s)

Join

t 1 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

minus1

minus05

Figure 7 Trajectory tracking curve of configuration A joint 1 withRBF neural network

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 8 Trajectory tracking curve of configuration A joint 2 withRBF neural network

Table 1 Parameter list of action-critic-identifier

119896 120572 120592 1205781198861

1205781198862

120578119888

1205731

1205732

120574

800 300 0005 10 50 20 02 2 05

The simulation results show that compared with thesituation of using classic RBF neural network controllerthe decentralized robust optimal tracking control methodbased on ACI and reinforcement learning can be appliedinto different configurations of the time varying constrainedreconfigurable modular robot And the joint variables cantrack the desired trajectory within a very short time indifferent configurations and the fluctuation of the errorconvergence range is minimal

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus2

minus15

minus05

minus1

Desired trajectoryActual trajectory

Figure 9 Trajectory tracking curve of configuration B joint 1 withRBF neural network

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 10 Trajectory tracking curve of configuration B joint 2 withRBF neural network

0 1 2 3 4 5 6 7 8 9 10

0

002

004

006

008

01

Time (s)

Join

t 1 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 11 Tracking error curve of configuration A joint 1 with RBFneural network

Mathematical Problems in Engineering 13

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

002

004

006

008

01

Join

t 2 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 12 Tracking error curve of configuration A joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

001

002

003

004

005

Time (s)

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 13 Tracking error curve of configuration B joint 1 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 14 Tracking error curve of configuration B joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus1

minus05

Desired trajectoryActual trajectory

Figure 15 Trajectory tracking curve of configuration A joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 16 Trajectory tracking curve of configuration A joint 2 withACI based reinforcement learning

5 Conclusions and Future Work

In this paper combining ACI with RBF neural network anovel decentralized reinforcement learning robust optimaltracking control theory has been proposed for time varyingconstrained reconfigurable modular robots Moreover thistheory is used to solve the problem of the continuoustime nonlinear optimal control policy for strongly coupleduncertainty robotic system Firstly we build the model ofsubsystem with the time varying external constraints anddescribed the global robot system as a synthesis of inter-connected subsystem Secondly ACI is used to estimate theHJB equation and the global uncertainty where a continuous-time optimal 119876-function is adopted to take the place oftraditional optimal value function the optimal 119876-function

14 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus15

minus05

minus2

minus1

Desired trajectoryActual trajectory

Figure 17 Trajectory tracking curve of configuration B joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 18 Trajectory tracking curve of configuration B joint 2 withACI based reinforcement learning

and the optimal control police are approximated by critic-NNand action-NN and the global uncertainty is identified bythe identifier Thirdly we design a novel decentralized robustoptimal tracking controller so the desired trajectory can betracked and the tracking error could converge to zero infinite timeOn this basis two kinds of Lyapunov functions aredesigned to confirm the stability of ACI and the subsystemFinally in order to confirm the superiority of the proposedcontrol theory the comparative simulation examples havebeen presented combining two different configurations of therobot

In the future more complex configuration for timevarying external constrained reconfigurable modular robotcan be included Therefore the decentralized controller withhigher control and error precision will be considered

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 19 Tracking error curve of configuration A joint 1 with ACIbased reinforcement learning

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005Jo

int 2

erro

r (ra

d)

minus005

minus004

minus003

minus002

minus001

Figure 20 Tracking error curve of configuration A joint 2 with ACIbased reinforcement learning

0

0

1 2 3 4 5 6 7 8 9 10Time (s)

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 21 Tracking error curve of configuration B joint 1 with ACIbased reinforcement learning

Mathematical Problems in Engineering 15

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 22 Tracking error curve of configuration B joint 2 with ACIbased reinforcement learning

005

1

02 03 04 05 06 07

0

01

02

03

minus1

minus05minus02

minus01

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 23 3D-tip trajectory curve of configuration A with ACI

005

1

035 036 037 038 039 04

006008

01012014016018

minus1

minus05

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 24 3D-tip trajectory curve of configuration B with ACI

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is financially supported by the National NaturalScience Foundation of China (Grant nos 61374051 and60974010) the Scientific Technological Development PlanProject in Jilin Province of China (Grant no 20110705) Thefirst author is funded by China Scholarship Council

References

[1] Y Li and B Dong ldquoDecentralized ADRC control for reconfig-urable manipulators based on VGSTA-ESO of sliding moderdquoINFORMATION vol 15 no 6 pp 2453ndash2466 2012

[2] Y LiM-C Zhu andY-C Li ldquoSimulation on robust neurofuzzycompensator for reconfigurable manipulator motion controlrdquoJournal of System Simulation vol 19 no 22 pp 5169ndash5174 2007

[3] M-C Zhu and Y-C Li ldquoDecentralized adaptive sliding modecontrol for reconfigurable manipulators using fuzzy logicrdquoJournal of Jilin University (Engineering and Technology Edition)vol 39 no 1 pp 170ndash176 2009

[4] L Zhu and Y Li ldquoDecentralized adaptive neural networkcontrol for reconfigurable manipulatorsrdquo in Proceedings of theChinese Control and Decision Conference (CCDC rsquo10) pp 1760ndash1765 Xuzhou China May 2010

[5] M Zhu Y Li and Y Li ldquoA new distributed control schemeof modular and reconfigurable robotsrdquo in Proceedings of theIEEE International Conference onMechatronics and Automation(ICMA rsquo07) pp 2622ndash2627 Harbin China August 2007

[6] M C Zhu Y Li and Y C Li ldquoObserver-based decentralizedadaptive fuzzy control for reconfigurable manipulatorrdquo Controland Decision vol 24 no 3 pp 429ndash434 2009

[7] S Richard and A G Barto Reinforcement Learning An Intro-duction The MIT Press London UK 1998

[8] F L Lewis and D Liu ReinForcement Learning and Approxi-mate Dynamic Programming For Feedback Control Wiley-IEEEPress New York NY USA 2012

[9] Y-K Xu and X-R Cao ldquoLebesgue-sampling-based optimalcontrol problems with time aggregationrdquo IEEE Transactions onAutomatic Control vol 56 no 5 pp 1097ndash1109 2011

[10] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[11] X Xu H-G He and D Hu ldquoEfficient reinforcement learningusing recursive least-squares methodsrdquo Journal of ArtificialIntelligence Research vol 16 pp 259ndash292 2002

[12] H Zhang Q Wei and Y Luo ldquoA novel infinite-time optimaltracking control scheme for a class of discrete-time nonlinearsystems via the greedy HDP iteration algorithmrdquo IEEE Trans-actions on Systems Man and Cybernetics B vol 38 no 4 pp937ndash942 2008

[13] H Zhang R Song Q Wei and T Zhang ldquoOptimal trackingcontrol for a class of nonlinear discrete-time systems withtime delays based on heuristic dynamic programmingrdquo IEEETransactions on Neural Networks vol 22 no 12 pp 1851ndash18622011

16 Mathematical Problems in Engineering

[14] H Zhang X Xie and S Tong ldquoHomogenous polynomiallyparameter-dependent 119867

infinfilter designs of discrete-time fuzzy

systemsrdquo IEEE Transactions on Systems Man and CyberneticsB vol 41 no 5 pp 1313ndash1322 2011

[15] H Zhang L Cui X Zhang and Y Luo ldquoData-drivenrobust approximate optimal tracking control for unknown gen-eral nonlinear systems using adaptive dynamic programmingmethodrdquo IEEE Transactions on Neural Networks vol 22 no 12pp 2226ndash2236 2011

[16] J Zhang H Zhang Y Luo and H Liang ldquoOptimal controldesign for nonlinear systems adaptive dynamic programmingbased on fuzzy critic estimatorrdquo in The International JointConference on Neural Network pp 1ndash6 Brisbane Australia2012

[17] H Zhang D Gong B Chen and Z Liu ldquoSynchronization forcoupled neural networkswith interval delay a novel augmentedlyapunov-krasovskii functional methodrdquo IEEE Transactions onNeural Networks and Learning Systems vol 24 no 1 pp 58ndash702013

[18] S Bhasin K Dupree P M Patre and W E Dixon ldquoNeuralnetwork control of a robot interacting with an uncertain vis-coelastic environmentrdquo IEEE Transactions on Control SystemsTechnology vol 19 no 4 pp 947ndash955 2011

[19] S G Khan G Herrmann F Lewis T Pipe and C MelhuishldquoA Q-learning based Cartesian model reference compliancecontroller implementation for a humanoid robot armrdquo inProceedings of the IEEE 5th International Conference onRoboticsAutomation andMechatronics (RAM rsquo11) pp 214ndash219 QingdaoChina September 2011

[20] P K Patchaikani L Behera and G Prasad ldquoA single networkadaptive critic-based redundancy resolution scheme for robotmanipulatorsrdquo IEEE Transactions on Industrial Electronics vol59 no 8 pp 3241ndash3253 2012

[21] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[22] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]University of Cambridge Cambridge UK 1989

[23] F L Lewis and V L Syrmos Optimal Control John Wiley ampSons New York NY USA 1995

[24] M Sassano and A Astolfi ldquoDynamic approximate solutionsof the HJ inequality and of the HJB equation for input-affinenonlinear systemsrdquo IEEE Transactions on Automatic Controlvol 57 no 10 pp 2490ndash2503 2012

[25] Y X Wu and C Wang ldquoDeterministic learning based adaptivenetwork control of robot in task spacerdquoActa Automatica Sinicavol 39 no 1 pp 1ndash10 2013

[26] P M Patre W MacKunis K Kaiser and W E DixonldquoAsymptotic tracking for uncertain dynamic systems via amultilayer neural network feedforward and RISE feedbackcontrol structurerdquo IEEE Transactions on Automatic Control vol53 no 9 pp 2180ndash2185 2008

[27] B E Paden and S S Sastry ldquoA calculus for computing Filippovrsquosdifferential inclusion with application to the variable structurecontrol of robot manipulatorsrdquo IEEE Transactions on Circuitsand Systems vol 34 no 1 pp 73ndash82 1987

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 9: Research Article Decentralized Reinforcement Learning ...downloads.hindawi.com/journals/mpe/2013/387817.pdf · Research Article Decentralized Reinforcement Learning Robust Optimal

Mathematical Problems in Engineering 9

= minus120572120574119890119879

119894119865119890119894119865+ (119864

119879

119894minus 119890

119879

119894119865)119875

1198653

1198691(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41198963

minus1

2120572 tr (119879

119894119865

120581119894119865Λ119879

119894119865

119890119894119865119890119879

119894119865) minus

1

2120572 tr (Λ119879

119894119865

119890119894119865119890119879

119894119865

119879

119894119865

120581119894119865)

le minus1198961

100381710038171003817100381711989011989411986510038171003817100381710038172

minus 1198962

1003817100381710038171003817119864119894

10038171003817100381710038172

+1198691(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41198963

10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

+1205732

21198692(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)2

41205721198964

10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

(65)

where 119896min = min1198961 119896

2 120585 = min119896

3 120572119896

4120573

2

2 and

119869(120593119894(119890

119879

119894119865 119864

119879

119894))

2

= 1198691(120593

119894(119890

119879

119894119865 119864

119879

119894))

2 + 1198692(120593

119894(119890

119879

119894119865 119864

119879

119894))

2 sothe following conclusion can be obtained

119894119871(119890

119894119865 119864

119894)

le minus119896min10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

+119869(10038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817)210038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

4120585

le minus11988810038171003817100381710038171003817120593119894(119890

119879

119894119865 119864

119879

119894)10038171003817100381710038171003817

2

(66)

Therefore for an arbitrary constant 119888 minus119888120593119894(119890

119879

119894119865 119864

119879

119894)

2

is a negative semidefinite function which is defined in theadjustable interval119863 expressed as follows

119863 = 119889 (119905) | 119889 le 119869minus1

(2radic119896min120585) (67)

so that Lyapunov stability theory shows that the system isstable In order to make the subsystem of time varying con-strained reconfigurable modular robot tracking the desiredtrajectory progressively in this paper a novel decentralizedreinforcement learning robust optimal tracking controllerhas been designed by using the robust term to compensatethe neural network approximation errors Design the robustcontrol term as

119906119894119903119887

=119873

119903119887119890119894

119890119879119894119890119894+ 120577

(68)

In the equation above 120577 gt 0 is a constant And 119873119903119887can

be expressed as

119873119903119887ge [

[

1205752

ℎ119894

21198991

+1198991(minus120576

119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894) (nabla120576

119894119888(119890

119894)2))

2

21198992

+11989911198992

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

(minus120576119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

]

]

sdot(119890

119879

119894119890119894+ 120577)

211989911198992

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817 119890

119879

119894119890119894

ge [1198992

1(minus120576

119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

+ 11989921205752

ℎ119894+ 2119899

2

11198992

2

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

sdot (minus120576119894119886(119890

119894) minus 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

2

]

sdot(119890

119879

119894119890119894+ 120577)

41198992111989922

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817 119890

119879

119894119890119894

(69)

Therefore the global control law can be designed asfollows

119906mix = 119906119894+ 119906

119894119903119887

= minus1

2119877minus1

119887119879

119894(119890

119894) 119878

119894119886(119890

119894)119879

119894119886+

119873119903119887119890119894

119890119879119894119890119894+ 120577

(70)

Theorem 6 Considering dynamics of the subsystem of timevarying constrained reconfigurable modular robot in (14) if thesystem parameters conditions and the assumptions are held thecritic-NN action-NN and identifier are given by (33) (34)and (45) respectively and the decentralized robust optimaltracking controller of the subsystem in (70) is adopted thenthe system is closed-loop stability and the desired trajectory canbe tracked asymptotically by the actual output

Proof Design the Lyapunov function as follows

119881119894119906(119890

119894 119906mix) =

1

21198991

tr 119879

119894119888

119894119888 +

1198991

21198992

tr 119879

119894119886

119894119886

+ 11989911198992[119890

119879

119894119865119890119894119865+ Ξint

infin

0

119903119894(119890

119894 119906mix) 119889120591]

(71)

where Ξ gt 0 is the undetermined parameter 119905 le 120591 lt infin Thederivation of (71) is shown as follows

119894119906(119890

119894 119906mix)

=1

21198991

tr 119879

119894119888

119882

119894119888 +

1198991

21198992

tr 119879

119894119886

119882

119894119886

+ 11989911198992[119890

119879

119894119865119890119894119865+ Ξ119903

119894(119890

119894 119906mix)]

=1

21198991

tr 119879

119894119888(minus119899

1119871119894(119871

119879

119894

119894119888+ 120575

ℎ119894))

+1198991

21198992

tr

times

119879

119894119886

[[[[

[

minus1198992119878119894119886(119890

119894)(

119879

119894119886119878119894119886(119890

119894) minus 120576

119894119886(119890

119894)

+1

2119877minus1

119887119894(119890

119894) nabla119878

119894119888(119890

119894)119879

119894119888

minus1

2119877minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894)

)

]]]]

]

+ 11989911198992119890119879

119894119865(119882

119879

119894119865120581 (Λ

119879

119894119865119890119894) + 120576

119894119865(119890

119894) + 119887

119894(119890

119894) mix)

+Ξ (119890119879

119894119876119890119890119894+ 119906

119879

mix119877119906mix)

10 Mathematical Problems in Engineering

le minus(1198712

119894119898minus1198991

21198712

119894119872)10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

+1

21198991

1205752

ℎ119894

minus (11989911198782

119894119886119898minus3

4119899111989921198782

119894119886119872)10038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

+1198991

41198992

10038171003817100381710038171003817119877minus110038171003817100381710038171003817

21003817100381710038171003817119887119894(119890119894)10038171003817100381710038172 10038171003817100381710038171003817nabla119878

2

119894119888119872

10038171003817100381710038171003817

10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

+1198991

21198992

(120576119894119886(119890

119894) + 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

119879

sdot (120576119894119886(119890

119894) + 119877

minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894)

2)

+ 11989911198992

1003817100381710038171003817119887119894 (119890119894)10038171003817100381710038172

120576119894119886(119890

119894)119879

120576119894119886(119890

119894)

+ 119899111989921198782

119894119886119872

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817210038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

+ 11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

+ 11989911198992(1003817100381710038171003817119887119894 (119890119894)

10038171003817100381710038172

minus Ξ120582min (119877))1003817100381710038171003817119906mix

10038171003817100381710038172

(72)

If the following inequalities can satisfy

120582min 1198761198901003817100381710038171003817119890119894119865

10038171003817100381710038172

2le 119890

119879

119894119865119876119890119890119894119865le 120582max 119876119890

1003817100381710038171003817119890119894119865

10038171003817100381710038172

2

120582min 1198771003817100381710038171003817119906mix

10038171003817100381710038172

2le 119906

119879

mix119877119906mix le 120582max 1198771003817100381710038171003817119906mix

10038171003817100381710038172

2

Ξ gt

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

120582min 119877

(73)

then 119894119906(119890

119894 119906mix) can be further transformed as

119894119906(119890

119894 119906mix)

le minus(1198712

119894119898minus1198991

21198712

119894119872minus

1198991

41198992

10038171003817100381710038171003817119877minus110038171003817100381710038171003817

21003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

nabla1198782

119894119888119872)10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

minus (11989911198782

119894119886119898minus3

4119899111989921198782

119894119886119872minus 119899

111989921198782

119894119886119872

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

)10038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

minus 11989911198992(1003817100381710038171003817119887119894(119890119894)

10038171003817100381710038172

+ Ξ120582min (119877))1003817100381710038171003817119906mix

10038171003817100381710038172

minus 11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

le minus11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

(74)

Therefore we can get the conclusion that 119894119906(119890

119894 119906mix) lt 0

4 Simulations

In order to verify the validity and convergence of the pro-posed decentralized reinforcement learning robust optimaltracking control method based on ACI and to study theconvergence of the error by comparing the simulation resultin this paper two different configurations of the time varying

external constrained reconfigurablemodular robot have beenapplied shown in Figures 3 and 4

For the sake of the facilitation of the analysis of theconfigurations above we can transform them into a formof analytic charts which are shown in Figures 5 and 6where 119871

1 119871

2 and 119871

4are the length of the links 119871

3is the

distance between the time varying constraint joint and thebase modular

The time varying constraint can be defined as a kindof column which rotated about with a certain degree offreedom The constraint equations of configuration A andconfiguration B are shown as follows

Ψ119860(119902 119905) = 119871

1cos 119902

1+ 119871

2cos 119902

2minus [119871

3+ 119871

4cot120572 (119905)]

Ψ119861(119902 119905) = 119871

1+ 119871

2cos 119902

2minus [119871

3+ 119871

4cot120572 (119905)]

(75)

In the equation above the angle 120572(119905) between the timevarying constraint and the119883-axis can be defined as follows

120572 (119905) = 075120587 + 02 sin 119905

2 (76)

The initial positions of joint models are 1199021(0) = 2 119902

2(0) =

2 in configurationA and 1199021(0) = 2 119902

2(0) = 2 in configuration

BThe initial velocities of joints are zerosThe dynamicmodelof configurations A and B is designed as follows

119872119860(119902) = [

036 cos (1199022) + 06066 018 cos (119902

2) + 01233

018 cos (1199022) + 01233 01233

]

119872119861(119902) = [

017 minus 01166cos2 (1199022) minus006 cos (119902

2)

minus006 cos (1199022) 01233

]

119862119860(119902 119902) = [

minus036 sin (1199022) 119902

2minus018 sin (119902

2) 119902

2

018 sin (1199022) ( 119902

1minus 119902

2) 018 sin (119902

2) 119902

1

]

119862119861(119902 119902) = [

01166 sin (21199022) 119902

2006 sin (119902

2) 119902

2

006 sin (1199022) 119902

20

]

119866119860(119902) = [

minus588 sin (1199021+ 119902

2) minus 1764 sin (119902

1)

minus588 sin (1199021+ 119902

2)

]

119866119861(119902) = [

0

minus588 cos (1199022)]

119865119860(119902 119902) = [

1199021+ 10 sin (3119902

1) + 2 sgn ( 119902

1)

12 1199022+ 5 sin (2119902

2) + sgn ( 119902

2)]

119865119861(119902 119902) = [

0

15 1199022+ sin (119902

2) + 12 sgn ( 119902

2)]

(77)

The desired trajectory of configurations A and B is shown asConfiguration A

1199101119889

= 05 cos (119905) + 02 sin (3119905)

1199102119889

= Θ (1199101119889 119905)

= arcsin[1198711sin (120572 (119905) minus 119910

1119889) minus 119871

3sin (120572 (119905))

1198712

] + 120572 (119905)

(78)

Mathematical Problems in Engineering 11

Figure 3 Configuration A for simulation

Figure 4 Configuration B for simulation

Configuration B

1199101119889

= 0

1199102119889

= Θ (1199101119889 119905)

= arcsin [1198711sin (120572 (119905)) minus 119871

3sin (120572 (119905))

1198712

] + 120572 (119905)

(79)

Due to the limit of dynamic external constraints thevariety of joint 1 in configuration B is zero

In order to confirm that the adopted method can beapplied in different configurations and to verify the trackingperformance for the subsystem desired trajectory by usingthe ACI and RBF-NN based decentralized reinforcementlearning robust optimal tracking control method in this partthe comparative simulations which include the classic RBFneural network control method and ACI based decentral-ized reinforcement learning robust optimal tracking controlmethod have been adopted respectively

From Figures 7 8 9 10 11 12 13 and 14 the joint trackingand error curves are shown by using classic RBF neural net-work [4] to compensate the effect of the dynamic nonlinearterm and the interconnection term in the subsystem Figures7 to 10 show that the actual output trajectories take about2 seconds to track the desired trajectories This is due tothe fact that the classical neural network method requires alonger training process and parameter adaptive process Theerror curves in Figures 11ndash14 show that the joint subsystem

q1L2

L3

L4

L1

Y

X

120572

q2

Figure 5 The analytic chart of configuration A

q2

L4

L2

L1

L3

Y

120572

X

q1

Figure 6 The analytic chart of configuration B

constraints cannot be well compensated by adopting the clas-sical neural control method for the reconfigurable modularrobot when the time varying constrains exist When thejoint output variables turn larger the reconfigurable modularsystem cannot exhibit a good robustness and the trackingerrors are larger than before

Figures 15 16 17 18 19 20 21 and 22 showed thejoint tracking and error curves by adopting ACI to identifythe optimal 119876-function optimal control policy and globaldynamic nonlinear terms of the HJB equation in the sub-system Figures 15 to 18 show that the actual output trajec-tories of the subsystems can track the desired trajectoriesin 05 seconds by using the proposed robust reinforcementlearning optimal control method This is due to the excellentidentifying ability of ACIwhich can identify the uncertaintiescontained in the subsystems in a short timeThe error curvesin Figures 19ndash22 show that the tracking error is very smalland it can converge to zero in a short time Besides whenthe proposed decentralized control method is adopted for thejoint subsystems the robustness is manifested

Figures 23 and 24 showed the 3D-tip trajectory curvesby using ACI algorithm These two figures show that theproposed decentralized control method can fully satisfythe accessibility requirement of the reconfigurable modularrobot Besides the singular displacements of the joint subsys-tems and the end-effector would not appear The parametersdefined in ACI are shown in Table 1

12 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

25

Time (s)

Join

t 1 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

minus1

minus05

Figure 7 Trajectory tracking curve of configuration A joint 1 withRBF neural network

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 8 Trajectory tracking curve of configuration A joint 2 withRBF neural network

Table 1 Parameter list of action-critic-identifier

119896 120572 120592 1205781198861

1205781198862

120578119888

1205731

1205732

120574

800 300 0005 10 50 20 02 2 05

The simulation results show that compared with thesituation of using classic RBF neural network controllerthe decentralized robust optimal tracking control methodbased on ACI and reinforcement learning can be appliedinto different configurations of the time varying constrainedreconfigurable modular robot And the joint variables cantrack the desired trajectory within a very short time indifferent configurations and the fluctuation of the errorconvergence range is minimal

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus2

minus15

minus05

minus1

Desired trajectoryActual trajectory

Figure 9 Trajectory tracking curve of configuration B joint 1 withRBF neural network

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 10 Trajectory tracking curve of configuration B joint 2 withRBF neural network

0 1 2 3 4 5 6 7 8 9 10

0

002

004

006

008

01

Time (s)

Join

t 1 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 11 Tracking error curve of configuration A joint 1 with RBFneural network

Mathematical Problems in Engineering 13

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

002

004

006

008

01

Join

t 2 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 12 Tracking error curve of configuration A joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

001

002

003

004

005

Time (s)

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 13 Tracking error curve of configuration B joint 1 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 14 Tracking error curve of configuration B joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus1

minus05

Desired trajectoryActual trajectory

Figure 15 Trajectory tracking curve of configuration A joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 16 Trajectory tracking curve of configuration A joint 2 withACI based reinforcement learning

5 Conclusions and Future Work

In this paper combining ACI with RBF neural network anovel decentralized reinforcement learning robust optimaltracking control theory has been proposed for time varyingconstrained reconfigurable modular robots Moreover thistheory is used to solve the problem of the continuoustime nonlinear optimal control policy for strongly coupleduncertainty robotic system Firstly we build the model ofsubsystem with the time varying external constraints anddescribed the global robot system as a synthesis of inter-connected subsystem Secondly ACI is used to estimate theHJB equation and the global uncertainty where a continuous-time optimal 119876-function is adopted to take the place oftraditional optimal value function the optimal 119876-function

14 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus15

minus05

minus2

minus1

Desired trajectoryActual trajectory

Figure 17 Trajectory tracking curve of configuration B joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 18 Trajectory tracking curve of configuration B joint 2 withACI based reinforcement learning

and the optimal control police are approximated by critic-NNand action-NN and the global uncertainty is identified bythe identifier Thirdly we design a novel decentralized robustoptimal tracking controller so the desired trajectory can betracked and the tracking error could converge to zero infinite timeOn this basis two kinds of Lyapunov functions aredesigned to confirm the stability of ACI and the subsystemFinally in order to confirm the superiority of the proposedcontrol theory the comparative simulation examples havebeen presented combining two different configurations of therobot

In the future more complex configuration for timevarying external constrained reconfigurable modular robotcan be included Therefore the decentralized controller withhigher control and error precision will be considered

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 19 Tracking error curve of configuration A joint 1 with ACIbased reinforcement learning

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005Jo

int 2

erro

r (ra

d)

minus005

minus004

minus003

minus002

minus001

Figure 20 Tracking error curve of configuration A joint 2 with ACIbased reinforcement learning

0

0

1 2 3 4 5 6 7 8 9 10Time (s)

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 21 Tracking error curve of configuration B joint 1 with ACIbased reinforcement learning

Mathematical Problems in Engineering 15

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 22 Tracking error curve of configuration B joint 2 with ACIbased reinforcement learning

005

1

02 03 04 05 06 07

0

01

02

03

minus1

minus05minus02

minus01

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 23 3D-tip trajectory curve of configuration A with ACI

005

1

035 036 037 038 039 04

006008

01012014016018

minus1

minus05

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 24 3D-tip trajectory curve of configuration B with ACI

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is financially supported by the National NaturalScience Foundation of China (Grant nos 61374051 and60974010) the Scientific Technological Development PlanProject in Jilin Province of China (Grant no 20110705) Thefirst author is funded by China Scholarship Council

References

[1] Y Li and B Dong ldquoDecentralized ADRC control for reconfig-urable manipulators based on VGSTA-ESO of sliding moderdquoINFORMATION vol 15 no 6 pp 2453ndash2466 2012

[2] Y LiM-C Zhu andY-C Li ldquoSimulation on robust neurofuzzycompensator for reconfigurable manipulator motion controlrdquoJournal of System Simulation vol 19 no 22 pp 5169ndash5174 2007

[3] M-C Zhu and Y-C Li ldquoDecentralized adaptive sliding modecontrol for reconfigurable manipulators using fuzzy logicrdquoJournal of Jilin University (Engineering and Technology Edition)vol 39 no 1 pp 170ndash176 2009

[4] L Zhu and Y Li ldquoDecentralized adaptive neural networkcontrol for reconfigurable manipulatorsrdquo in Proceedings of theChinese Control and Decision Conference (CCDC rsquo10) pp 1760ndash1765 Xuzhou China May 2010

[5] M Zhu Y Li and Y Li ldquoA new distributed control schemeof modular and reconfigurable robotsrdquo in Proceedings of theIEEE International Conference onMechatronics and Automation(ICMA rsquo07) pp 2622ndash2627 Harbin China August 2007

[6] M C Zhu Y Li and Y C Li ldquoObserver-based decentralizedadaptive fuzzy control for reconfigurable manipulatorrdquo Controland Decision vol 24 no 3 pp 429ndash434 2009

[7] S Richard and A G Barto Reinforcement Learning An Intro-duction The MIT Press London UK 1998

[8] F L Lewis and D Liu ReinForcement Learning and Approxi-mate Dynamic Programming For Feedback Control Wiley-IEEEPress New York NY USA 2012

[9] Y-K Xu and X-R Cao ldquoLebesgue-sampling-based optimalcontrol problems with time aggregationrdquo IEEE Transactions onAutomatic Control vol 56 no 5 pp 1097ndash1109 2011

[10] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[11] X Xu H-G He and D Hu ldquoEfficient reinforcement learningusing recursive least-squares methodsrdquo Journal of ArtificialIntelligence Research vol 16 pp 259ndash292 2002

[12] H Zhang Q Wei and Y Luo ldquoA novel infinite-time optimaltracking control scheme for a class of discrete-time nonlinearsystems via the greedy HDP iteration algorithmrdquo IEEE Trans-actions on Systems Man and Cybernetics B vol 38 no 4 pp937ndash942 2008

[13] H Zhang R Song Q Wei and T Zhang ldquoOptimal trackingcontrol for a class of nonlinear discrete-time systems withtime delays based on heuristic dynamic programmingrdquo IEEETransactions on Neural Networks vol 22 no 12 pp 1851ndash18622011

16 Mathematical Problems in Engineering

[14] H Zhang X Xie and S Tong ldquoHomogenous polynomiallyparameter-dependent 119867

infinfilter designs of discrete-time fuzzy

systemsrdquo IEEE Transactions on Systems Man and CyberneticsB vol 41 no 5 pp 1313ndash1322 2011

[15] H Zhang L Cui X Zhang and Y Luo ldquoData-drivenrobust approximate optimal tracking control for unknown gen-eral nonlinear systems using adaptive dynamic programmingmethodrdquo IEEE Transactions on Neural Networks vol 22 no 12pp 2226ndash2236 2011

[16] J Zhang H Zhang Y Luo and H Liang ldquoOptimal controldesign for nonlinear systems adaptive dynamic programmingbased on fuzzy critic estimatorrdquo in The International JointConference on Neural Network pp 1ndash6 Brisbane Australia2012

[17] H Zhang D Gong B Chen and Z Liu ldquoSynchronization forcoupled neural networkswith interval delay a novel augmentedlyapunov-krasovskii functional methodrdquo IEEE Transactions onNeural Networks and Learning Systems vol 24 no 1 pp 58ndash702013

[18] S Bhasin K Dupree P M Patre and W E Dixon ldquoNeuralnetwork control of a robot interacting with an uncertain vis-coelastic environmentrdquo IEEE Transactions on Control SystemsTechnology vol 19 no 4 pp 947ndash955 2011

[19] S G Khan G Herrmann F Lewis T Pipe and C MelhuishldquoA Q-learning based Cartesian model reference compliancecontroller implementation for a humanoid robot armrdquo inProceedings of the IEEE 5th International Conference onRoboticsAutomation andMechatronics (RAM rsquo11) pp 214ndash219 QingdaoChina September 2011

[20] P K Patchaikani L Behera and G Prasad ldquoA single networkadaptive critic-based redundancy resolution scheme for robotmanipulatorsrdquo IEEE Transactions on Industrial Electronics vol59 no 8 pp 3241ndash3253 2012

[21] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[22] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]University of Cambridge Cambridge UK 1989

[23] F L Lewis and V L Syrmos Optimal Control John Wiley ampSons New York NY USA 1995

[24] M Sassano and A Astolfi ldquoDynamic approximate solutionsof the HJ inequality and of the HJB equation for input-affinenonlinear systemsrdquo IEEE Transactions on Automatic Controlvol 57 no 10 pp 2490ndash2503 2012

[25] Y X Wu and C Wang ldquoDeterministic learning based adaptivenetwork control of robot in task spacerdquoActa Automatica Sinicavol 39 no 1 pp 1ndash10 2013

[26] P M Patre W MacKunis K Kaiser and W E DixonldquoAsymptotic tracking for uncertain dynamic systems via amultilayer neural network feedforward and RISE feedbackcontrol structurerdquo IEEE Transactions on Automatic Control vol53 no 9 pp 2180ndash2185 2008

[27] B E Paden and S S Sastry ldquoA calculus for computing Filippovrsquosdifferential inclusion with application to the variable structurecontrol of robot manipulatorsrdquo IEEE Transactions on Circuitsand Systems vol 34 no 1 pp 73ndash82 1987

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 10: Research Article Decentralized Reinforcement Learning ...downloads.hindawi.com/journals/mpe/2013/387817.pdf · Research Article Decentralized Reinforcement Learning Robust Optimal

10 Mathematical Problems in Engineering

le minus(1198712

119894119898minus1198991

21198712

119894119872)10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

+1

21198991

1205752

ℎ119894

minus (11989911198782

119894119886119898minus3

4119899111989921198782

119894119886119872)10038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

+1198991

41198992

10038171003817100381710038171003817119877minus110038171003817100381710038171003817

21003817100381710038171003817119887119894(119890119894)10038171003817100381710038172 10038171003817100381710038171003817nabla119878

2

119894119888119872

10038171003817100381710038171003817

10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

+1198991

21198992

(120576119894119886(119890

119894) + 119877

minus1

119887119894(119890

119894)nabla120576

119894119888(119890

119894)

2)

119879

sdot (120576119894119886(119890

119894) + 119877

minus1

119887119894(119890

119894) nabla120576

119894119888(119890

119894)

2)

+ 11989911198992

1003817100381710038171003817119887119894 (119890119894)10038171003817100381710038172

120576119894119886(119890

119894)119879

120576119894119886(119890

119894)

+ 119899111989921198782

119894119886119872

1003817100381710038171003817119887119894 (119890119894)1003817100381710038171003817210038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

+ 11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

+ 11989911198992(1003817100381710038171003817119887119894 (119890119894)

10038171003817100381710038172

minus Ξ120582min (119877))1003817100381710038171003817119906mix

10038171003817100381710038172

(72)

If the following inequalities can satisfy

120582min 1198761198901003817100381710038171003817119890119894119865

10038171003817100381710038172

2le 119890

119879

119894119865119876119890119890119894119865le 120582max 119876119890

1003817100381710038171003817119890119894119865

10038171003817100381710038172

2

120582min 1198771003817100381710038171003817119906mix

10038171003817100381710038172

2le 119906

119879

mix119877119906mix le 120582max 1198771003817100381710038171003817119906mix

10038171003817100381710038172

2

Ξ gt

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

120582min 119877

(73)

then 119894119906(119890

119894 119906mix) can be further transformed as

119894119906(119890

119894 119906mix)

le minus(1198712

119894119898minus1198991

21198712

119894119872minus

1198991

41198992

10038171003817100381710038171003817119877minus110038171003817100381710038171003817

21003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

nabla1198782

119894119888119872)10038171003817100381710038171003817

119894119888

10038171003817100381710038171003817

2

minus (11989911198782

119894119886119898minus3

4119899111989921198782

119894119886119872minus 119899

111989921198782

119894119886119872

1003817100381710038171003817119887119894(119890119894)10038171003817100381710038172

)10038171003817100381710038171003817

119894119886

10038171003817100381710038171003817

2

minus 11989911198992(1003817100381710038171003817119887119894(119890119894)

10038171003817100381710038172

+ Ξ120582min (119877))1003817100381710038171003817119906mix

10038171003817100381710038172

minus 11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

le minus11989911198992Ξ120582min (119876119890

)1003817100381710038171003817119890119894119865

10038171003817100381710038172

(74)

Therefore we can get the conclusion that 119894119906(119890

119894 119906mix) lt 0

4 Simulations

In order to verify the validity and convergence of the pro-posed decentralized reinforcement learning robust optimaltracking control method based on ACI and to study theconvergence of the error by comparing the simulation resultin this paper two different configurations of the time varying

external constrained reconfigurablemodular robot have beenapplied shown in Figures 3 and 4

For the sake of the facilitation of the analysis of theconfigurations above we can transform them into a formof analytic charts which are shown in Figures 5 and 6where 119871

1 119871

2 and 119871

4are the length of the links 119871

3is the

distance between the time varying constraint joint and thebase modular

The time varying constraint can be defined as a kindof column which rotated about with a certain degree offreedom The constraint equations of configuration A andconfiguration B are shown as follows

Ψ119860(119902 119905) = 119871

1cos 119902

1+ 119871

2cos 119902

2minus [119871

3+ 119871

4cot120572 (119905)]

Ψ119861(119902 119905) = 119871

1+ 119871

2cos 119902

2minus [119871

3+ 119871

4cot120572 (119905)]

(75)

In the equation above the angle 120572(119905) between the timevarying constraint and the119883-axis can be defined as follows

120572 (119905) = 075120587 + 02 sin 119905

2 (76)

The initial positions of joint models are 1199021(0) = 2 119902

2(0) =

2 in configurationA and 1199021(0) = 2 119902

2(0) = 2 in configuration

BThe initial velocities of joints are zerosThe dynamicmodelof configurations A and B is designed as follows

119872119860(119902) = [

036 cos (1199022) + 06066 018 cos (119902

2) + 01233

018 cos (1199022) + 01233 01233

]

119872119861(119902) = [

017 minus 01166cos2 (1199022) minus006 cos (119902

2)

minus006 cos (1199022) 01233

]

119862119860(119902 119902) = [

minus036 sin (1199022) 119902

2minus018 sin (119902

2) 119902

2

018 sin (1199022) ( 119902

1minus 119902

2) 018 sin (119902

2) 119902

1

]

119862119861(119902 119902) = [

01166 sin (21199022) 119902

2006 sin (119902

2) 119902

2

006 sin (1199022) 119902

20

]

119866119860(119902) = [

minus588 sin (1199021+ 119902

2) minus 1764 sin (119902

1)

minus588 sin (1199021+ 119902

2)

]

119866119861(119902) = [

0

minus588 cos (1199022)]

119865119860(119902 119902) = [

1199021+ 10 sin (3119902

1) + 2 sgn ( 119902

1)

12 1199022+ 5 sin (2119902

2) + sgn ( 119902

2)]

119865119861(119902 119902) = [

0

15 1199022+ sin (119902

2) + 12 sgn ( 119902

2)]

(77)

The desired trajectory of configurations A and B is shown asConfiguration A

1199101119889

= 05 cos (119905) + 02 sin (3119905)

1199102119889

= Θ (1199101119889 119905)

= arcsin[1198711sin (120572 (119905) minus 119910

1119889) minus 119871

3sin (120572 (119905))

1198712

] + 120572 (119905)

(78)

Mathematical Problems in Engineering 11

Figure 3 Configuration A for simulation

Figure 4 Configuration B for simulation

Configuration B

1199101119889

= 0

1199102119889

= Θ (1199101119889 119905)

= arcsin [1198711sin (120572 (119905)) minus 119871

3sin (120572 (119905))

1198712

] + 120572 (119905)

(79)

Due to the limit of dynamic external constraints thevariety of joint 1 in configuration B is zero

In order to confirm that the adopted method can beapplied in different configurations and to verify the trackingperformance for the subsystem desired trajectory by usingthe ACI and RBF-NN based decentralized reinforcementlearning robust optimal tracking control method in this partthe comparative simulations which include the classic RBFneural network control method and ACI based decentral-ized reinforcement learning robust optimal tracking controlmethod have been adopted respectively

From Figures 7 8 9 10 11 12 13 and 14 the joint trackingand error curves are shown by using classic RBF neural net-work [4] to compensate the effect of the dynamic nonlinearterm and the interconnection term in the subsystem Figures7 to 10 show that the actual output trajectories take about2 seconds to track the desired trajectories This is due tothe fact that the classical neural network method requires alonger training process and parameter adaptive process Theerror curves in Figures 11ndash14 show that the joint subsystem

q1L2

L3

L4

L1

Y

X

120572

q2

Figure 5 The analytic chart of configuration A

q2

L4

L2

L1

L3

Y

120572

X

q1

Figure 6 The analytic chart of configuration B

constraints cannot be well compensated by adopting the clas-sical neural control method for the reconfigurable modularrobot when the time varying constrains exist When thejoint output variables turn larger the reconfigurable modularsystem cannot exhibit a good robustness and the trackingerrors are larger than before

Figures 15 16 17 18 19 20 21 and 22 showed thejoint tracking and error curves by adopting ACI to identifythe optimal 119876-function optimal control policy and globaldynamic nonlinear terms of the HJB equation in the sub-system Figures 15 to 18 show that the actual output trajec-tories of the subsystems can track the desired trajectoriesin 05 seconds by using the proposed robust reinforcementlearning optimal control method This is due to the excellentidentifying ability of ACIwhich can identify the uncertaintiescontained in the subsystems in a short timeThe error curvesin Figures 19ndash22 show that the tracking error is very smalland it can converge to zero in a short time Besides whenthe proposed decentralized control method is adopted for thejoint subsystems the robustness is manifested

Figures 23 and 24 showed the 3D-tip trajectory curvesby using ACI algorithm These two figures show that theproposed decentralized control method can fully satisfythe accessibility requirement of the reconfigurable modularrobot Besides the singular displacements of the joint subsys-tems and the end-effector would not appear The parametersdefined in ACI are shown in Table 1

12 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

25

Time (s)

Join

t 1 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

minus1

minus05

Figure 7 Trajectory tracking curve of configuration A joint 1 withRBF neural network

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 8 Trajectory tracking curve of configuration A joint 2 withRBF neural network

Table 1 Parameter list of action-critic-identifier

119896 120572 120592 1205781198861

1205781198862

120578119888

1205731

1205732

120574

800 300 0005 10 50 20 02 2 05

The simulation results show that compared with thesituation of using classic RBF neural network controllerthe decentralized robust optimal tracking control methodbased on ACI and reinforcement learning can be appliedinto different configurations of the time varying constrainedreconfigurable modular robot And the joint variables cantrack the desired trajectory within a very short time indifferent configurations and the fluctuation of the errorconvergence range is minimal

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus2

minus15

minus05

minus1

Desired trajectoryActual trajectory

Figure 9 Trajectory tracking curve of configuration B joint 1 withRBF neural network

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 10 Trajectory tracking curve of configuration B joint 2 withRBF neural network

0 1 2 3 4 5 6 7 8 9 10

0

002

004

006

008

01

Time (s)

Join

t 1 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 11 Tracking error curve of configuration A joint 1 with RBFneural network

Mathematical Problems in Engineering 13

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

002

004

006

008

01

Join

t 2 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 12 Tracking error curve of configuration A joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

001

002

003

004

005

Time (s)

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 13 Tracking error curve of configuration B joint 1 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 14 Tracking error curve of configuration B joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus1

minus05

Desired trajectoryActual trajectory

Figure 15 Trajectory tracking curve of configuration A joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 16 Trajectory tracking curve of configuration A joint 2 withACI based reinforcement learning

5 Conclusions and Future Work

In this paper combining ACI with RBF neural network anovel decentralized reinforcement learning robust optimaltracking control theory has been proposed for time varyingconstrained reconfigurable modular robots Moreover thistheory is used to solve the problem of the continuoustime nonlinear optimal control policy for strongly coupleduncertainty robotic system Firstly we build the model ofsubsystem with the time varying external constraints anddescribed the global robot system as a synthesis of inter-connected subsystem Secondly ACI is used to estimate theHJB equation and the global uncertainty where a continuous-time optimal 119876-function is adopted to take the place oftraditional optimal value function the optimal 119876-function

14 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus15

minus05

minus2

minus1

Desired trajectoryActual trajectory

Figure 17 Trajectory tracking curve of configuration B joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 18 Trajectory tracking curve of configuration B joint 2 withACI based reinforcement learning

and the optimal control police are approximated by critic-NNand action-NN and the global uncertainty is identified bythe identifier Thirdly we design a novel decentralized robustoptimal tracking controller so the desired trajectory can betracked and the tracking error could converge to zero infinite timeOn this basis two kinds of Lyapunov functions aredesigned to confirm the stability of ACI and the subsystemFinally in order to confirm the superiority of the proposedcontrol theory the comparative simulation examples havebeen presented combining two different configurations of therobot

In the future more complex configuration for timevarying external constrained reconfigurable modular robotcan be included Therefore the decentralized controller withhigher control and error precision will be considered

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 19 Tracking error curve of configuration A joint 1 with ACIbased reinforcement learning

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005Jo

int 2

erro

r (ra

d)

minus005

minus004

minus003

minus002

minus001

Figure 20 Tracking error curve of configuration A joint 2 with ACIbased reinforcement learning

0

0

1 2 3 4 5 6 7 8 9 10Time (s)

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 21 Tracking error curve of configuration B joint 1 with ACIbased reinforcement learning

Mathematical Problems in Engineering 15

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 22 Tracking error curve of configuration B joint 2 with ACIbased reinforcement learning

005

1

02 03 04 05 06 07

0

01

02

03

minus1

minus05minus02

minus01

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 23 3D-tip trajectory curve of configuration A with ACI

005

1

035 036 037 038 039 04

006008

01012014016018

minus1

minus05

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 24 3D-tip trajectory curve of configuration B with ACI

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is financially supported by the National NaturalScience Foundation of China (Grant nos 61374051 and60974010) the Scientific Technological Development PlanProject in Jilin Province of China (Grant no 20110705) Thefirst author is funded by China Scholarship Council

References

[1] Y Li and B Dong ldquoDecentralized ADRC control for reconfig-urable manipulators based on VGSTA-ESO of sliding moderdquoINFORMATION vol 15 no 6 pp 2453ndash2466 2012

[2] Y LiM-C Zhu andY-C Li ldquoSimulation on robust neurofuzzycompensator for reconfigurable manipulator motion controlrdquoJournal of System Simulation vol 19 no 22 pp 5169ndash5174 2007

[3] M-C Zhu and Y-C Li ldquoDecentralized adaptive sliding modecontrol for reconfigurable manipulators using fuzzy logicrdquoJournal of Jilin University (Engineering and Technology Edition)vol 39 no 1 pp 170ndash176 2009

[4] L Zhu and Y Li ldquoDecentralized adaptive neural networkcontrol for reconfigurable manipulatorsrdquo in Proceedings of theChinese Control and Decision Conference (CCDC rsquo10) pp 1760ndash1765 Xuzhou China May 2010

[5] M Zhu Y Li and Y Li ldquoA new distributed control schemeof modular and reconfigurable robotsrdquo in Proceedings of theIEEE International Conference onMechatronics and Automation(ICMA rsquo07) pp 2622ndash2627 Harbin China August 2007

[6] M C Zhu Y Li and Y C Li ldquoObserver-based decentralizedadaptive fuzzy control for reconfigurable manipulatorrdquo Controland Decision vol 24 no 3 pp 429ndash434 2009

[7] S Richard and A G Barto Reinforcement Learning An Intro-duction The MIT Press London UK 1998

[8] F L Lewis and D Liu ReinForcement Learning and Approxi-mate Dynamic Programming For Feedback Control Wiley-IEEEPress New York NY USA 2012

[9] Y-K Xu and X-R Cao ldquoLebesgue-sampling-based optimalcontrol problems with time aggregationrdquo IEEE Transactions onAutomatic Control vol 56 no 5 pp 1097ndash1109 2011

[10] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[11] X Xu H-G He and D Hu ldquoEfficient reinforcement learningusing recursive least-squares methodsrdquo Journal of ArtificialIntelligence Research vol 16 pp 259ndash292 2002

[12] H Zhang Q Wei and Y Luo ldquoA novel infinite-time optimaltracking control scheme for a class of discrete-time nonlinearsystems via the greedy HDP iteration algorithmrdquo IEEE Trans-actions on Systems Man and Cybernetics B vol 38 no 4 pp937ndash942 2008

[13] H Zhang R Song Q Wei and T Zhang ldquoOptimal trackingcontrol for a class of nonlinear discrete-time systems withtime delays based on heuristic dynamic programmingrdquo IEEETransactions on Neural Networks vol 22 no 12 pp 1851ndash18622011

16 Mathematical Problems in Engineering

[14] H Zhang X Xie and S Tong ldquoHomogenous polynomiallyparameter-dependent 119867

infinfilter designs of discrete-time fuzzy

systemsrdquo IEEE Transactions on Systems Man and CyberneticsB vol 41 no 5 pp 1313ndash1322 2011

[15] H Zhang L Cui X Zhang and Y Luo ldquoData-drivenrobust approximate optimal tracking control for unknown gen-eral nonlinear systems using adaptive dynamic programmingmethodrdquo IEEE Transactions on Neural Networks vol 22 no 12pp 2226ndash2236 2011

[16] J Zhang H Zhang Y Luo and H Liang ldquoOptimal controldesign for nonlinear systems adaptive dynamic programmingbased on fuzzy critic estimatorrdquo in The International JointConference on Neural Network pp 1ndash6 Brisbane Australia2012

[17] H Zhang D Gong B Chen and Z Liu ldquoSynchronization forcoupled neural networkswith interval delay a novel augmentedlyapunov-krasovskii functional methodrdquo IEEE Transactions onNeural Networks and Learning Systems vol 24 no 1 pp 58ndash702013

[18] S Bhasin K Dupree P M Patre and W E Dixon ldquoNeuralnetwork control of a robot interacting with an uncertain vis-coelastic environmentrdquo IEEE Transactions on Control SystemsTechnology vol 19 no 4 pp 947ndash955 2011

[19] S G Khan G Herrmann F Lewis T Pipe and C MelhuishldquoA Q-learning based Cartesian model reference compliancecontroller implementation for a humanoid robot armrdquo inProceedings of the IEEE 5th International Conference onRoboticsAutomation andMechatronics (RAM rsquo11) pp 214ndash219 QingdaoChina September 2011

[20] P K Patchaikani L Behera and G Prasad ldquoA single networkadaptive critic-based redundancy resolution scheme for robotmanipulatorsrdquo IEEE Transactions on Industrial Electronics vol59 no 8 pp 3241ndash3253 2012

[21] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[22] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]University of Cambridge Cambridge UK 1989

[23] F L Lewis and V L Syrmos Optimal Control John Wiley ampSons New York NY USA 1995

[24] M Sassano and A Astolfi ldquoDynamic approximate solutionsof the HJ inequality and of the HJB equation for input-affinenonlinear systemsrdquo IEEE Transactions on Automatic Controlvol 57 no 10 pp 2490ndash2503 2012

[25] Y X Wu and C Wang ldquoDeterministic learning based adaptivenetwork control of robot in task spacerdquoActa Automatica Sinicavol 39 no 1 pp 1ndash10 2013

[26] P M Patre W MacKunis K Kaiser and W E DixonldquoAsymptotic tracking for uncertain dynamic systems via amultilayer neural network feedforward and RISE feedbackcontrol structurerdquo IEEE Transactions on Automatic Control vol53 no 9 pp 2180ndash2185 2008

[27] B E Paden and S S Sastry ldquoA calculus for computing Filippovrsquosdifferential inclusion with application to the variable structurecontrol of robot manipulatorsrdquo IEEE Transactions on Circuitsand Systems vol 34 no 1 pp 73ndash82 1987

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 11: Research Article Decentralized Reinforcement Learning ...downloads.hindawi.com/journals/mpe/2013/387817.pdf · Research Article Decentralized Reinforcement Learning Robust Optimal

Mathematical Problems in Engineering 11

Figure 3 Configuration A for simulation

Figure 4 Configuration B for simulation

Configuration B

1199101119889

= 0

1199102119889

= Θ (1199101119889 119905)

= arcsin [1198711sin (120572 (119905)) minus 119871

3sin (120572 (119905))

1198712

] + 120572 (119905)

(79)

Due to the limit of dynamic external constraints thevariety of joint 1 in configuration B is zero

In order to confirm that the adopted method can beapplied in different configurations and to verify the trackingperformance for the subsystem desired trajectory by usingthe ACI and RBF-NN based decentralized reinforcementlearning robust optimal tracking control method in this partthe comparative simulations which include the classic RBFneural network control method and ACI based decentral-ized reinforcement learning robust optimal tracking controlmethod have been adopted respectively

From Figures 7 8 9 10 11 12 13 and 14 the joint trackingand error curves are shown by using classic RBF neural net-work [4] to compensate the effect of the dynamic nonlinearterm and the interconnection term in the subsystem Figures7 to 10 show that the actual output trajectories take about2 seconds to track the desired trajectories This is due tothe fact that the classical neural network method requires alonger training process and parameter adaptive process Theerror curves in Figures 11ndash14 show that the joint subsystem

q1L2

L3

L4

L1

Y

X

120572

q2

Figure 5 The analytic chart of configuration A

q2

L4

L2

L1

L3

Y

120572

X

q1

Figure 6 The analytic chart of configuration B

constraints cannot be well compensated by adopting the clas-sical neural control method for the reconfigurable modularrobot when the time varying constrains exist When thejoint output variables turn larger the reconfigurable modularsystem cannot exhibit a good robustness and the trackingerrors are larger than before

Figures 15 16 17 18 19 20 21 and 22 showed thejoint tracking and error curves by adopting ACI to identifythe optimal 119876-function optimal control policy and globaldynamic nonlinear terms of the HJB equation in the sub-system Figures 15 to 18 show that the actual output trajec-tories of the subsystems can track the desired trajectoriesin 05 seconds by using the proposed robust reinforcementlearning optimal control method This is due to the excellentidentifying ability of ACIwhich can identify the uncertaintiescontained in the subsystems in a short timeThe error curvesin Figures 19ndash22 show that the tracking error is very smalland it can converge to zero in a short time Besides whenthe proposed decentralized control method is adopted for thejoint subsystems the robustness is manifested

Figures 23 and 24 showed the 3D-tip trajectory curvesby using ACI algorithm These two figures show that theproposed decentralized control method can fully satisfythe accessibility requirement of the reconfigurable modularrobot Besides the singular displacements of the joint subsys-tems and the end-effector would not appear The parametersdefined in ACI are shown in Table 1

12 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

25

Time (s)

Join

t 1 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

minus1

minus05

Figure 7 Trajectory tracking curve of configuration A joint 1 withRBF neural network

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 8 Trajectory tracking curve of configuration A joint 2 withRBF neural network

Table 1 Parameter list of action-critic-identifier

119896 120572 120592 1205781198861

1205781198862

120578119888

1205731

1205732

120574

800 300 0005 10 50 20 02 2 05

The simulation results show that compared with thesituation of using classic RBF neural network controllerthe decentralized robust optimal tracking control methodbased on ACI and reinforcement learning can be appliedinto different configurations of the time varying constrainedreconfigurable modular robot And the joint variables cantrack the desired trajectory within a very short time indifferent configurations and the fluctuation of the errorconvergence range is minimal

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus2

minus15

minus05

minus1

Desired trajectoryActual trajectory

Figure 9 Trajectory tracking curve of configuration B joint 1 withRBF neural network

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 10 Trajectory tracking curve of configuration B joint 2 withRBF neural network

0 1 2 3 4 5 6 7 8 9 10

0

002

004

006

008

01

Time (s)

Join

t 1 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 11 Tracking error curve of configuration A joint 1 with RBFneural network

Mathematical Problems in Engineering 13

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

002

004

006

008

01

Join

t 2 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 12 Tracking error curve of configuration A joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

001

002

003

004

005

Time (s)

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 13 Tracking error curve of configuration B joint 1 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 14 Tracking error curve of configuration B joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus1

minus05

Desired trajectoryActual trajectory

Figure 15 Trajectory tracking curve of configuration A joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 16 Trajectory tracking curve of configuration A joint 2 withACI based reinforcement learning

5 Conclusions and Future Work

In this paper combining ACI with RBF neural network anovel decentralized reinforcement learning robust optimaltracking control theory has been proposed for time varyingconstrained reconfigurable modular robots Moreover thistheory is used to solve the problem of the continuoustime nonlinear optimal control policy for strongly coupleduncertainty robotic system Firstly we build the model ofsubsystem with the time varying external constraints anddescribed the global robot system as a synthesis of inter-connected subsystem Secondly ACI is used to estimate theHJB equation and the global uncertainty where a continuous-time optimal 119876-function is adopted to take the place oftraditional optimal value function the optimal 119876-function

14 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus15

minus05

minus2

minus1

Desired trajectoryActual trajectory

Figure 17 Trajectory tracking curve of configuration B joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 18 Trajectory tracking curve of configuration B joint 2 withACI based reinforcement learning

and the optimal control police are approximated by critic-NNand action-NN and the global uncertainty is identified bythe identifier Thirdly we design a novel decentralized robustoptimal tracking controller so the desired trajectory can betracked and the tracking error could converge to zero infinite timeOn this basis two kinds of Lyapunov functions aredesigned to confirm the stability of ACI and the subsystemFinally in order to confirm the superiority of the proposedcontrol theory the comparative simulation examples havebeen presented combining two different configurations of therobot

In the future more complex configuration for timevarying external constrained reconfigurable modular robotcan be included Therefore the decentralized controller withhigher control and error precision will be considered

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 19 Tracking error curve of configuration A joint 1 with ACIbased reinforcement learning

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005Jo

int 2

erro

r (ra

d)

minus005

minus004

minus003

minus002

minus001

Figure 20 Tracking error curve of configuration A joint 2 with ACIbased reinforcement learning

0

0

1 2 3 4 5 6 7 8 9 10Time (s)

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 21 Tracking error curve of configuration B joint 1 with ACIbased reinforcement learning

Mathematical Problems in Engineering 15

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 22 Tracking error curve of configuration B joint 2 with ACIbased reinforcement learning

005

1

02 03 04 05 06 07

0

01

02

03

minus1

minus05minus02

minus01

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 23 3D-tip trajectory curve of configuration A with ACI

005

1

035 036 037 038 039 04

006008

01012014016018

minus1

minus05

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 24 3D-tip trajectory curve of configuration B with ACI

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is financially supported by the National NaturalScience Foundation of China (Grant nos 61374051 and60974010) the Scientific Technological Development PlanProject in Jilin Province of China (Grant no 20110705) Thefirst author is funded by China Scholarship Council

References

[1] Y Li and B Dong ldquoDecentralized ADRC control for reconfig-urable manipulators based on VGSTA-ESO of sliding moderdquoINFORMATION vol 15 no 6 pp 2453ndash2466 2012

[2] Y LiM-C Zhu andY-C Li ldquoSimulation on robust neurofuzzycompensator for reconfigurable manipulator motion controlrdquoJournal of System Simulation vol 19 no 22 pp 5169ndash5174 2007

[3] M-C Zhu and Y-C Li ldquoDecentralized adaptive sliding modecontrol for reconfigurable manipulators using fuzzy logicrdquoJournal of Jilin University (Engineering and Technology Edition)vol 39 no 1 pp 170ndash176 2009

[4] L Zhu and Y Li ldquoDecentralized adaptive neural networkcontrol for reconfigurable manipulatorsrdquo in Proceedings of theChinese Control and Decision Conference (CCDC rsquo10) pp 1760ndash1765 Xuzhou China May 2010

[5] M Zhu Y Li and Y Li ldquoA new distributed control schemeof modular and reconfigurable robotsrdquo in Proceedings of theIEEE International Conference onMechatronics and Automation(ICMA rsquo07) pp 2622ndash2627 Harbin China August 2007

[6] M C Zhu Y Li and Y C Li ldquoObserver-based decentralizedadaptive fuzzy control for reconfigurable manipulatorrdquo Controland Decision vol 24 no 3 pp 429ndash434 2009

[7] S Richard and A G Barto Reinforcement Learning An Intro-duction The MIT Press London UK 1998

[8] F L Lewis and D Liu ReinForcement Learning and Approxi-mate Dynamic Programming For Feedback Control Wiley-IEEEPress New York NY USA 2012

[9] Y-K Xu and X-R Cao ldquoLebesgue-sampling-based optimalcontrol problems with time aggregationrdquo IEEE Transactions onAutomatic Control vol 56 no 5 pp 1097ndash1109 2011

[10] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[11] X Xu H-G He and D Hu ldquoEfficient reinforcement learningusing recursive least-squares methodsrdquo Journal of ArtificialIntelligence Research vol 16 pp 259ndash292 2002

[12] H Zhang Q Wei and Y Luo ldquoA novel infinite-time optimaltracking control scheme for a class of discrete-time nonlinearsystems via the greedy HDP iteration algorithmrdquo IEEE Trans-actions on Systems Man and Cybernetics B vol 38 no 4 pp937ndash942 2008

[13] H Zhang R Song Q Wei and T Zhang ldquoOptimal trackingcontrol for a class of nonlinear discrete-time systems withtime delays based on heuristic dynamic programmingrdquo IEEETransactions on Neural Networks vol 22 no 12 pp 1851ndash18622011

16 Mathematical Problems in Engineering

[14] H Zhang X Xie and S Tong ldquoHomogenous polynomiallyparameter-dependent 119867

infinfilter designs of discrete-time fuzzy

systemsrdquo IEEE Transactions on Systems Man and CyberneticsB vol 41 no 5 pp 1313ndash1322 2011

[15] H Zhang L Cui X Zhang and Y Luo ldquoData-drivenrobust approximate optimal tracking control for unknown gen-eral nonlinear systems using adaptive dynamic programmingmethodrdquo IEEE Transactions on Neural Networks vol 22 no 12pp 2226ndash2236 2011

[16] J Zhang H Zhang Y Luo and H Liang ldquoOptimal controldesign for nonlinear systems adaptive dynamic programmingbased on fuzzy critic estimatorrdquo in The International JointConference on Neural Network pp 1ndash6 Brisbane Australia2012

[17] H Zhang D Gong B Chen and Z Liu ldquoSynchronization forcoupled neural networkswith interval delay a novel augmentedlyapunov-krasovskii functional methodrdquo IEEE Transactions onNeural Networks and Learning Systems vol 24 no 1 pp 58ndash702013

[18] S Bhasin K Dupree P M Patre and W E Dixon ldquoNeuralnetwork control of a robot interacting with an uncertain vis-coelastic environmentrdquo IEEE Transactions on Control SystemsTechnology vol 19 no 4 pp 947ndash955 2011

[19] S G Khan G Herrmann F Lewis T Pipe and C MelhuishldquoA Q-learning based Cartesian model reference compliancecontroller implementation for a humanoid robot armrdquo inProceedings of the IEEE 5th International Conference onRoboticsAutomation andMechatronics (RAM rsquo11) pp 214ndash219 QingdaoChina September 2011

[20] P K Patchaikani L Behera and G Prasad ldquoA single networkadaptive critic-based redundancy resolution scheme for robotmanipulatorsrdquo IEEE Transactions on Industrial Electronics vol59 no 8 pp 3241ndash3253 2012

[21] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[22] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]University of Cambridge Cambridge UK 1989

[23] F L Lewis and V L Syrmos Optimal Control John Wiley ampSons New York NY USA 1995

[24] M Sassano and A Astolfi ldquoDynamic approximate solutionsof the HJ inequality and of the HJB equation for input-affinenonlinear systemsrdquo IEEE Transactions on Automatic Controlvol 57 no 10 pp 2490ndash2503 2012

[25] Y X Wu and C Wang ldquoDeterministic learning based adaptivenetwork control of robot in task spacerdquoActa Automatica Sinicavol 39 no 1 pp 1ndash10 2013

[26] P M Patre W MacKunis K Kaiser and W E DixonldquoAsymptotic tracking for uncertain dynamic systems via amultilayer neural network feedforward and RISE feedbackcontrol structurerdquo IEEE Transactions on Automatic Control vol53 no 9 pp 2180ndash2185 2008

[27] B E Paden and S S Sastry ldquoA calculus for computing Filippovrsquosdifferential inclusion with application to the variable structurecontrol of robot manipulatorsrdquo IEEE Transactions on Circuitsand Systems vol 34 no 1 pp 73ndash82 1987

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 12: Research Article Decentralized Reinforcement Learning ...downloads.hindawi.com/journals/mpe/2013/387817.pdf · Research Article Decentralized Reinforcement Learning Robust Optimal

12 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

25

Time (s)

Join

t 1 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

minus1

minus05

Figure 7 Trajectory tracking curve of configuration A joint 1 withRBF neural network

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 8 Trajectory tracking curve of configuration A joint 2 withRBF neural network

Table 1 Parameter list of action-critic-identifier

119896 120572 120592 1205781198861

1205781198862

120578119888

1205731

1205732

120574

800 300 0005 10 50 20 02 2 05

The simulation results show that compared with thesituation of using classic RBF neural network controllerthe decentralized robust optimal tracking control methodbased on ACI and reinforcement learning can be appliedinto different configurations of the time varying constrainedreconfigurable modular robot And the joint variables cantrack the desired trajectory within a very short time indifferent configurations and the fluctuation of the errorconvergence range is minimal

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus2

minus15

minus05

minus1

Desired trajectoryActual trajectory

Figure 9 Trajectory tracking curve of configuration B joint 1 withRBF neural network

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 10 Trajectory tracking curve of configuration B joint 2 withRBF neural network

0 1 2 3 4 5 6 7 8 9 10

0

002

004

006

008

01

Time (s)

Join

t 1 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 11 Tracking error curve of configuration A joint 1 with RBFneural network

Mathematical Problems in Engineering 13

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

002

004

006

008

01

Join

t 2 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 12 Tracking error curve of configuration A joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

001

002

003

004

005

Time (s)

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 13 Tracking error curve of configuration B joint 1 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 14 Tracking error curve of configuration B joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus1

minus05

Desired trajectoryActual trajectory

Figure 15 Trajectory tracking curve of configuration A joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 16 Trajectory tracking curve of configuration A joint 2 withACI based reinforcement learning

5 Conclusions and Future Work

In this paper combining ACI with RBF neural network anovel decentralized reinforcement learning robust optimaltracking control theory has been proposed for time varyingconstrained reconfigurable modular robots Moreover thistheory is used to solve the problem of the continuoustime nonlinear optimal control policy for strongly coupleduncertainty robotic system Firstly we build the model ofsubsystem with the time varying external constraints anddescribed the global robot system as a synthesis of inter-connected subsystem Secondly ACI is used to estimate theHJB equation and the global uncertainty where a continuous-time optimal 119876-function is adopted to take the place oftraditional optimal value function the optimal 119876-function

14 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus15

minus05

minus2

minus1

Desired trajectoryActual trajectory

Figure 17 Trajectory tracking curve of configuration B joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 18 Trajectory tracking curve of configuration B joint 2 withACI based reinforcement learning

and the optimal control police are approximated by critic-NNand action-NN and the global uncertainty is identified bythe identifier Thirdly we design a novel decentralized robustoptimal tracking controller so the desired trajectory can betracked and the tracking error could converge to zero infinite timeOn this basis two kinds of Lyapunov functions aredesigned to confirm the stability of ACI and the subsystemFinally in order to confirm the superiority of the proposedcontrol theory the comparative simulation examples havebeen presented combining two different configurations of therobot

In the future more complex configuration for timevarying external constrained reconfigurable modular robotcan be included Therefore the decentralized controller withhigher control and error precision will be considered

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 19 Tracking error curve of configuration A joint 1 with ACIbased reinforcement learning

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005Jo

int 2

erro

r (ra

d)

minus005

minus004

minus003

minus002

minus001

Figure 20 Tracking error curve of configuration A joint 2 with ACIbased reinforcement learning

0

0

1 2 3 4 5 6 7 8 9 10Time (s)

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 21 Tracking error curve of configuration B joint 1 with ACIbased reinforcement learning

Mathematical Problems in Engineering 15

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 22 Tracking error curve of configuration B joint 2 with ACIbased reinforcement learning

005

1

02 03 04 05 06 07

0

01

02

03

minus1

minus05minus02

minus01

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 23 3D-tip trajectory curve of configuration A with ACI

005

1

035 036 037 038 039 04

006008

01012014016018

minus1

minus05

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 24 3D-tip trajectory curve of configuration B with ACI

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is financially supported by the National NaturalScience Foundation of China (Grant nos 61374051 and60974010) the Scientific Technological Development PlanProject in Jilin Province of China (Grant no 20110705) Thefirst author is funded by China Scholarship Council

References

[1] Y Li and B Dong ldquoDecentralized ADRC control for reconfig-urable manipulators based on VGSTA-ESO of sliding moderdquoINFORMATION vol 15 no 6 pp 2453ndash2466 2012

[2] Y LiM-C Zhu andY-C Li ldquoSimulation on robust neurofuzzycompensator for reconfigurable manipulator motion controlrdquoJournal of System Simulation vol 19 no 22 pp 5169ndash5174 2007

[3] M-C Zhu and Y-C Li ldquoDecentralized adaptive sliding modecontrol for reconfigurable manipulators using fuzzy logicrdquoJournal of Jilin University (Engineering and Technology Edition)vol 39 no 1 pp 170ndash176 2009

[4] L Zhu and Y Li ldquoDecentralized adaptive neural networkcontrol for reconfigurable manipulatorsrdquo in Proceedings of theChinese Control and Decision Conference (CCDC rsquo10) pp 1760ndash1765 Xuzhou China May 2010

[5] M Zhu Y Li and Y Li ldquoA new distributed control schemeof modular and reconfigurable robotsrdquo in Proceedings of theIEEE International Conference onMechatronics and Automation(ICMA rsquo07) pp 2622ndash2627 Harbin China August 2007

[6] M C Zhu Y Li and Y C Li ldquoObserver-based decentralizedadaptive fuzzy control for reconfigurable manipulatorrdquo Controland Decision vol 24 no 3 pp 429ndash434 2009

[7] S Richard and A G Barto Reinforcement Learning An Intro-duction The MIT Press London UK 1998

[8] F L Lewis and D Liu ReinForcement Learning and Approxi-mate Dynamic Programming For Feedback Control Wiley-IEEEPress New York NY USA 2012

[9] Y-K Xu and X-R Cao ldquoLebesgue-sampling-based optimalcontrol problems with time aggregationrdquo IEEE Transactions onAutomatic Control vol 56 no 5 pp 1097ndash1109 2011

[10] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[11] X Xu H-G He and D Hu ldquoEfficient reinforcement learningusing recursive least-squares methodsrdquo Journal of ArtificialIntelligence Research vol 16 pp 259ndash292 2002

[12] H Zhang Q Wei and Y Luo ldquoA novel infinite-time optimaltracking control scheme for a class of discrete-time nonlinearsystems via the greedy HDP iteration algorithmrdquo IEEE Trans-actions on Systems Man and Cybernetics B vol 38 no 4 pp937ndash942 2008

[13] H Zhang R Song Q Wei and T Zhang ldquoOptimal trackingcontrol for a class of nonlinear discrete-time systems withtime delays based on heuristic dynamic programmingrdquo IEEETransactions on Neural Networks vol 22 no 12 pp 1851ndash18622011

16 Mathematical Problems in Engineering

[14] H Zhang X Xie and S Tong ldquoHomogenous polynomiallyparameter-dependent 119867

infinfilter designs of discrete-time fuzzy

systemsrdquo IEEE Transactions on Systems Man and CyberneticsB vol 41 no 5 pp 1313ndash1322 2011

[15] H Zhang L Cui X Zhang and Y Luo ldquoData-drivenrobust approximate optimal tracking control for unknown gen-eral nonlinear systems using adaptive dynamic programmingmethodrdquo IEEE Transactions on Neural Networks vol 22 no 12pp 2226ndash2236 2011

[16] J Zhang H Zhang Y Luo and H Liang ldquoOptimal controldesign for nonlinear systems adaptive dynamic programmingbased on fuzzy critic estimatorrdquo in The International JointConference on Neural Network pp 1ndash6 Brisbane Australia2012

[17] H Zhang D Gong B Chen and Z Liu ldquoSynchronization forcoupled neural networkswith interval delay a novel augmentedlyapunov-krasovskii functional methodrdquo IEEE Transactions onNeural Networks and Learning Systems vol 24 no 1 pp 58ndash702013

[18] S Bhasin K Dupree P M Patre and W E Dixon ldquoNeuralnetwork control of a robot interacting with an uncertain vis-coelastic environmentrdquo IEEE Transactions on Control SystemsTechnology vol 19 no 4 pp 947ndash955 2011

[19] S G Khan G Herrmann F Lewis T Pipe and C MelhuishldquoA Q-learning based Cartesian model reference compliancecontroller implementation for a humanoid robot armrdquo inProceedings of the IEEE 5th International Conference onRoboticsAutomation andMechatronics (RAM rsquo11) pp 214ndash219 QingdaoChina September 2011

[20] P K Patchaikani L Behera and G Prasad ldquoA single networkadaptive critic-based redundancy resolution scheme for robotmanipulatorsrdquo IEEE Transactions on Industrial Electronics vol59 no 8 pp 3241ndash3253 2012

[21] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[22] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]University of Cambridge Cambridge UK 1989

[23] F L Lewis and V L Syrmos Optimal Control John Wiley ampSons New York NY USA 1995

[24] M Sassano and A Astolfi ldquoDynamic approximate solutionsof the HJ inequality and of the HJB equation for input-affinenonlinear systemsrdquo IEEE Transactions on Automatic Controlvol 57 no 10 pp 2490ndash2503 2012

[25] Y X Wu and C Wang ldquoDeterministic learning based adaptivenetwork control of robot in task spacerdquoActa Automatica Sinicavol 39 no 1 pp 1ndash10 2013

[26] P M Patre W MacKunis K Kaiser and W E DixonldquoAsymptotic tracking for uncertain dynamic systems via amultilayer neural network feedforward and RISE feedbackcontrol structurerdquo IEEE Transactions on Automatic Control vol53 no 9 pp 2180ndash2185 2008

[27] B E Paden and S S Sastry ldquoA calculus for computing Filippovrsquosdifferential inclusion with application to the variable structurecontrol of robot manipulatorsrdquo IEEE Transactions on Circuitsand Systems vol 34 no 1 pp 73ndash82 1987

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 13: Research Article Decentralized Reinforcement Learning ...downloads.hindawi.com/journals/mpe/2013/387817.pdf · Research Article Decentralized Reinforcement Learning Robust Optimal

Mathematical Problems in Engineering 13

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

002

004

006

008

01

Join

t 2 er

ror (

rad)

minus01

minus008

minus006

minus004

minus002

Figure 12 Tracking error curve of configuration A joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

001

002

003

004

005

Time (s)

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 13 Tracking error curve of configuration B joint 1 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 14 Tracking error curve of configuration B joint 2 with RBFneural network

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus1

minus05

Desired trajectoryActual trajectory

Figure 15 Trajectory tracking curve of configuration A joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1008

1

12

14

16

18

2

22

24

26

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 16 Trajectory tracking curve of configuration A joint 2 withACI based reinforcement learning

5 Conclusions and Future Work

In this paper combining ACI with RBF neural network anovel decentralized reinforcement learning robust optimaltracking control theory has been proposed for time varyingconstrained reconfigurable modular robots Moreover thistheory is used to solve the problem of the continuoustime nonlinear optimal control policy for strongly coupleduncertainty robotic system Firstly we build the model ofsubsystem with the time varying external constraints anddescribed the global robot system as a synthesis of inter-connected subsystem Secondly ACI is used to estimate theHJB equation and the global uncertainty where a continuous-time optimal 119876-function is adopted to take the place oftraditional optimal value function the optimal 119876-function

14 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus15

minus05

minus2

minus1

Desired trajectoryActual trajectory

Figure 17 Trajectory tracking curve of configuration B joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 18 Trajectory tracking curve of configuration B joint 2 withACI based reinforcement learning

and the optimal control police are approximated by critic-NNand action-NN and the global uncertainty is identified bythe identifier Thirdly we design a novel decentralized robustoptimal tracking controller so the desired trajectory can betracked and the tracking error could converge to zero infinite timeOn this basis two kinds of Lyapunov functions aredesigned to confirm the stability of ACI and the subsystemFinally in order to confirm the superiority of the proposedcontrol theory the comparative simulation examples havebeen presented combining two different configurations of therobot

In the future more complex configuration for timevarying external constrained reconfigurable modular robotcan be included Therefore the decentralized controller withhigher control and error precision will be considered

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 19 Tracking error curve of configuration A joint 1 with ACIbased reinforcement learning

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005Jo

int 2

erro

r (ra

d)

minus005

minus004

minus003

minus002

minus001

Figure 20 Tracking error curve of configuration A joint 2 with ACIbased reinforcement learning

0

0

1 2 3 4 5 6 7 8 9 10Time (s)

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 21 Tracking error curve of configuration B joint 1 with ACIbased reinforcement learning

Mathematical Problems in Engineering 15

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 22 Tracking error curve of configuration B joint 2 with ACIbased reinforcement learning

005

1

02 03 04 05 06 07

0

01

02

03

minus1

minus05minus02

minus01

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 23 3D-tip trajectory curve of configuration A with ACI

005

1

035 036 037 038 039 04

006008

01012014016018

minus1

minus05

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 24 3D-tip trajectory curve of configuration B with ACI

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is financially supported by the National NaturalScience Foundation of China (Grant nos 61374051 and60974010) the Scientific Technological Development PlanProject in Jilin Province of China (Grant no 20110705) Thefirst author is funded by China Scholarship Council

References

[1] Y Li and B Dong ldquoDecentralized ADRC control for reconfig-urable manipulators based on VGSTA-ESO of sliding moderdquoINFORMATION vol 15 no 6 pp 2453ndash2466 2012

[2] Y LiM-C Zhu andY-C Li ldquoSimulation on robust neurofuzzycompensator for reconfigurable manipulator motion controlrdquoJournal of System Simulation vol 19 no 22 pp 5169ndash5174 2007

[3] M-C Zhu and Y-C Li ldquoDecentralized adaptive sliding modecontrol for reconfigurable manipulators using fuzzy logicrdquoJournal of Jilin University (Engineering and Technology Edition)vol 39 no 1 pp 170ndash176 2009

[4] L Zhu and Y Li ldquoDecentralized adaptive neural networkcontrol for reconfigurable manipulatorsrdquo in Proceedings of theChinese Control and Decision Conference (CCDC rsquo10) pp 1760ndash1765 Xuzhou China May 2010

[5] M Zhu Y Li and Y Li ldquoA new distributed control schemeof modular and reconfigurable robotsrdquo in Proceedings of theIEEE International Conference onMechatronics and Automation(ICMA rsquo07) pp 2622ndash2627 Harbin China August 2007

[6] M C Zhu Y Li and Y C Li ldquoObserver-based decentralizedadaptive fuzzy control for reconfigurable manipulatorrdquo Controland Decision vol 24 no 3 pp 429ndash434 2009

[7] S Richard and A G Barto Reinforcement Learning An Intro-duction The MIT Press London UK 1998

[8] F L Lewis and D Liu ReinForcement Learning and Approxi-mate Dynamic Programming For Feedback Control Wiley-IEEEPress New York NY USA 2012

[9] Y-K Xu and X-R Cao ldquoLebesgue-sampling-based optimalcontrol problems with time aggregationrdquo IEEE Transactions onAutomatic Control vol 56 no 5 pp 1097ndash1109 2011

[10] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[11] X Xu H-G He and D Hu ldquoEfficient reinforcement learningusing recursive least-squares methodsrdquo Journal of ArtificialIntelligence Research vol 16 pp 259ndash292 2002

[12] H Zhang Q Wei and Y Luo ldquoA novel infinite-time optimaltracking control scheme for a class of discrete-time nonlinearsystems via the greedy HDP iteration algorithmrdquo IEEE Trans-actions on Systems Man and Cybernetics B vol 38 no 4 pp937ndash942 2008

[13] H Zhang R Song Q Wei and T Zhang ldquoOptimal trackingcontrol for a class of nonlinear discrete-time systems withtime delays based on heuristic dynamic programmingrdquo IEEETransactions on Neural Networks vol 22 no 12 pp 1851ndash18622011

16 Mathematical Problems in Engineering

[14] H Zhang X Xie and S Tong ldquoHomogenous polynomiallyparameter-dependent 119867

infinfilter designs of discrete-time fuzzy

systemsrdquo IEEE Transactions on Systems Man and CyberneticsB vol 41 no 5 pp 1313ndash1322 2011

[15] H Zhang L Cui X Zhang and Y Luo ldquoData-drivenrobust approximate optimal tracking control for unknown gen-eral nonlinear systems using adaptive dynamic programmingmethodrdquo IEEE Transactions on Neural Networks vol 22 no 12pp 2226ndash2236 2011

[16] J Zhang H Zhang Y Luo and H Liang ldquoOptimal controldesign for nonlinear systems adaptive dynamic programmingbased on fuzzy critic estimatorrdquo in The International JointConference on Neural Network pp 1ndash6 Brisbane Australia2012

[17] H Zhang D Gong B Chen and Z Liu ldquoSynchronization forcoupled neural networkswith interval delay a novel augmentedlyapunov-krasovskii functional methodrdquo IEEE Transactions onNeural Networks and Learning Systems vol 24 no 1 pp 58ndash702013

[18] S Bhasin K Dupree P M Patre and W E Dixon ldquoNeuralnetwork control of a robot interacting with an uncertain vis-coelastic environmentrdquo IEEE Transactions on Control SystemsTechnology vol 19 no 4 pp 947ndash955 2011

[19] S G Khan G Herrmann F Lewis T Pipe and C MelhuishldquoA Q-learning based Cartesian model reference compliancecontroller implementation for a humanoid robot armrdquo inProceedings of the IEEE 5th International Conference onRoboticsAutomation andMechatronics (RAM rsquo11) pp 214ndash219 QingdaoChina September 2011

[20] P K Patchaikani L Behera and G Prasad ldquoA single networkadaptive critic-based redundancy resolution scheme for robotmanipulatorsrdquo IEEE Transactions on Industrial Electronics vol59 no 8 pp 3241ndash3253 2012

[21] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[22] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]University of Cambridge Cambridge UK 1989

[23] F L Lewis and V L Syrmos Optimal Control John Wiley ampSons New York NY USA 1995

[24] M Sassano and A Astolfi ldquoDynamic approximate solutionsof the HJ inequality and of the HJB equation for input-affinenonlinear systemsrdquo IEEE Transactions on Automatic Controlvol 57 no 10 pp 2490ndash2503 2012

[25] Y X Wu and C Wang ldquoDeterministic learning based adaptivenetwork control of robot in task spacerdquoActa Automatica Sinicavol 39 no 1 pp 1ndash10 2013

[26] P M Patre W MacKunis K Kaiser and W E DixonldquoAsymptotic tracking for uncertain dynamic systems via amultilayer neural network feedforward and RISE feedbackcontrol structurerdquo IEEE Transactions on Automatic Control vol53 no 9 pp 2180ndash2185 2008

[27] B E Paden and S S Sastry ldquoA calculus for computing Filippovrsquosdifferential inclusion with application to the variable structurecontrol of robot manipulatorsrdquo IEEE Transactions on Circuitsand Systems vol 34 no 1 pp 73ndash82 1987

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 14: Research Article Decentralized Reinforcement Learning ...downloads.hindawi.com/journals/mpe/2013/387817.pdf · Research Article Decentralized Reinforcement Learning Robust Optimal

14 Mathematical Problems in Engineering

0 1 2 3 4 5 6 7 8 9 10

0

05

1

15

2

Time (s)

Join

t 1 p

ositi

on (r

ad)

minus15

minus05

minus2

minus1

Desired trajectoryActual trajectory

Figure 17 Trajectory tracking curve of configuration B joint 1 withACI based reinforcement learning

0 1 2 3 4 5 6 7 8 9 1017

18

19

2

21

22

23

24

Time (s)

Join

t 2 p

ositi

on (r

ad)

Desired trajectoryActual trajectory

Figure 18 Trajectory tracking curve of configuration B joint 2 withACI based reinforcement learning

and the optimal control police are approximated by critic-NNand action-NN and the global uncertainty is identified bythe identifier Thirdly we design a novel decentralized robustoptimal tracking controller so the desired trajectory can betracked and the tracking error could converge to zero infinite timeOn this basis two kinds of Lyapunov functions aredesigned to confirm the stability of ACI and the subsystemFinally in order to confirm the superiority of the proposedcontrol theory the comparative simulation examples havebeen presented combining two different configurations of therobot

In the future more complex configuration for timevarying external constrained reconfigurable modular robotcan be included Therefore the decentralized controller withhigher control and error precision will be considered

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 19 Tracking error curve of configuration A joint 1 with ACIbased reinforcement learning

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005Jo

int 2

erro

r (ra

d)

minus005

minus004

minus003

minus002

minus001

Figure 20 Tracking error curve of configuration A joint 2 with ACIbased reinforcement learning

0

0

1 2 3 4 5 6 7 8 9 10Time (s)

001

002

003

004

005

Join

t 1 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 21 Tracking error curve of configuration B joint 1 with ACIbased reinforcement learning

Mathematical Problems in Engineering 15

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 22 Tracking error curve of configuration B joint 2 with ACIbased reinforcement learning

005

1

02 03 04 05 06 07

0

01

02

03

minus1

minus05minus02

minus01

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 23 3D-tip trajectory curve of configuration A with ACI

005

1

035 036 037 038 039 04

006008

01012014016018

minus1

minus05

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 24 3D-tip trajectory curve of configuration B with ACI

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is financially supported by the National NaturalScience Foundation of China (Grant nos 61374051 and60974010) the Scientific Technological Development PlanProject in Jilin Province of China (Grant no 20110705) Thefirst author is funded by China Scholarship Council

References

[1] Y Li and B Dong ldquoDecentralized ADRC control for reconfig-urable manipulators based on VGSTA-ESO of sliding moderdquoINFORMATION vol 15 no 6 pp 2453ndash2466 2012

[2] Y LiM-C Zhu andY-C Li ldquoSimulation on robust neurofuzzycompensator for reconfigurable manipulator motion controlrdquoJournal of System Simulation vol 19 no 22 pp 5169ndash5174 2007

[3] M-C Zhu and Y-C Li ldquoDecentralized adaptive sliding modecontrol for reconfigurable manipulators using fuzzy logicrdquoJournal of Jilin University (Engineering and Technology Edition)vol 39 no 1 pp 170ndash176 2009

[4] L Zhu and Y Li ldquoDecentralized adaptive neural networkcontrol for reconfigurable manipulatorsrdquo in Proceedings of theChinese Control and Decision Conference (CCDC rsquo10) pp 1760ndash1765 Xuzhou China May 2010

[5] M Zhu Y Li and Y Li ldquoA new distributed control schemeof modular and reconfigurable robotsrdquo in Proceedings of theIEEE International Conference onMechatronics and Automation(ICMA rsquo07) pp 2622ndash2627 Harbin China August 2007

[6] M C Zhu Y Li and Y C Li ldquoObserver-based decentralizedadaptive fuzzy control for reconfigurable manipulatorrdquo Controland Decision vol 24 no 3 pp 429ndash434 2009

[7] S Richard and A G Barto Reinforcement Learning An Intro-duction The MIT Press London UK 1998

[8] F L Lewis and D Liu ReinForcement Learning and Approxi-mate Dynamic Programming For Feedback Control Wiley-IEEEPress New York NY USA 2012

[9] Y-K Xu and X-R Cao ldquoLebesgue-sampling-based optimalcontrol problems with time aggregationrdquo IEEE Transactions onAutomatic Control vol 56 no 5 pp 1097ndash1109 2011

[10] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[11] X Xu H-G He and D Hu ldquoEfficient reinforcement learningusing recursive least-squares methodsrdquo Journal of ArtificialIntelligence Research vol 16 pp 259ndash292 2002

[12] H Zhang Q Wei and Y Luo ldquoA novel infinite-time optimaltracking control scheme for a class of discrete-time nonlinearsystems via the greedy HDP iteration algorithmrdquo IEEE Trans-actions on Systems Man and Cybernetics B vol 38 no 4 pp937ndash942 2008

[13] H Zhang R Song Q Wei and T Zhang ldquoOptimal trackingcontrol for a class of nonlinear discrete-time systems withtime delays based on heuristic dynamic programmingrdquo IEEETransactions on Neural Networks vol 22 no 12 pp 1851ndash18622011

16 Mathematical Problems in Engineering

[14] H Zhang X Xie and S Tong ldquoHomogenous polynomiallyparameter-dependent 119867

infinfilter designs of discrete-time fuzzy

systemsrdquo IEEE Transactions on Systems Man and CyberneticsB vol 41 no 5 pp 1313ndash1322 2011

[15] H Zhang L Cui X Zhang and Y Luo ldquoData-drivenrobust approximate optimal tracking control for unknown gen-eral nonlinear systems using adaptive dynamic programmingmethodrdquo IEEE Transactions on Neural Networks vol 22 no 12pp 2226ndash2236 2011

[16] J Zhang H Zhang Y Luo and H Liang ldquoOptimal controldesign for nonlinear systems adaptive dynamic programmingbased on fuzzy critic estimatorrdquo in The International JointConference on Neural Network pp 1ndash6 Brisbane Australia2012

[17] H Zhang D Gong B Chen and Z Liu ldquoSynchronization forcoupled neural networkswith interval delay a novel augmentedlyapunov-krasovskii functional methodrdquo IEEE Transactions onNeural Networks and Learning Systems vol 24 no 1 pp 58ndash702013

[18] S Bhasin K Dupree P M Patre and W E Dixon ldquoNeuralnetwork control of a robot interacting with an uncertain vis-coelastic environmentrdquo IEEE Transactions on Control SystemsTechnology vol 19 no 4 pp 947ndash955 2011

[19] S G Khan G Herrmann F Lewis T Pipe and C MelhuishldquoA Q-learning based Cartesian model reference compliancecontroller implementation for a humanoid robot armrdquo inProceedings of the IEEE 5th International Conference onRoboticsAutomation andMechatronics (RAM rsquo11) pp 214ndash219 QingdaoChina September 2011

[20] P K Patchaikani L Behera and G Prasad ldquoA single networkadaptive critic-based redundancy resolution scheme for robotmanipulatorsrdquo IEEE Transactions on Industrial Electronics vol59 no 8 pp 3241ndash3253 2012

[21] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[22] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]University of Cambridge Cambridge UK 1989

[23] F L Lewis and V L Syrmos Optimal Control John Wiley ampSons New York NY USA 1995

[24] M Sassano and A Astolfi ldquoDynamic approximate solutionsof the HJ inequality and of the HJB equation for input-affinenonlinear systemsrdquo IEEE Transactions on Automatic Controlvol 57 no 10 pp 2490ndash2503 2012

[25] Y X Wu and C Wang ldquoDeterministic learning based adaptivenetwork control of robot in task spacerdquoActa Automatica Sinicavol 39 no 1 pp 1ndash10 2013

[26] P M Patre W MacKunis K Kaiser and W E DixonldquoAsymptotic tracking for uncertain dynamic systems via amultilayer neural network feedforward and RISE feedbackcontrol structurerdquo IEEE Transactions on Automatic Control vol53 no 9 pp 2180ndash2185 2008

[27] B E Paden and S S Sastry ldquoA calculus for computing Filippovrsquosdifferential inclusion with application to the variable structurecontrol of robot manipulatorsrdquo IEEE Transactions on Circuitsand Systems vol 34 no 1 pp 73ndash82 1987

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 15: Research Article Decentralized Reinforcement Learning ...downloads.hindawi.com/journals/mpe/2013/387817.pdf · Research Article Decentralized Reinforcement Learning Robust Optimal

Mathematical Problems in Engineering 15

0 1 2 3 4 5 6 7 8 9 10Time (s)

0

001

002

003

004

005

Join

t 2 er

ror (

rad)

minus005

minus004

minus003

minus002

minus001

Figure 22 Tracking error curve of configuration B joint 2 with ACIbased reinforcement learning

005

1

02 03 04 05 06 07

0

01

02

03

minus1

minus05minus02

minus01

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 23 3D-tip trajectory curve of configuration A with ACI

005

1

035 036 037 038 039 04

006008

01012014016018

minus1

minus05

xla

bel

y labelz label

Desired trajectoryActual trajectory

Figure 24 3D-tip trajectory curve of configuration B with ACI

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is financially supported by the National NaturalScience Foundation of China (Grant nos 61374051 and60974010) the Scientific Technological Development PlanProject in Jilin Province of China (Grant no 20110705) Thefirst author is funded by China Scholarship Council

References

[1] Y Li and B Dong ldquoDecentralized ADRC control for reconfig-urable manipulators based on VGSTA-ESO of sliding moderdquoINFORMATION vol 15 no 6 pp 2453ndash2466 2012

[2] Y LiM-C Zhu andY-C Li ldquoSimulation on robust neurofuzzycompensator for reconfigurable manipulator motion controlrdquoJournal of System Simulation vol 19 no 22 pp 5169ndash5174 2007

[3] M-C Zhu and Y-C Li ldquoDecentralized adaptive sliding modecontrol for reconfigurable manipulators using fuzzy logicrdquoJournal of Jilin University (Engineering and Technology Edition)vol 39 no 1 pp 170ndash176 2009

[4] L Zhu and Y Li ldquoDecentralized adaptive neural networkcontrol for reconfigurable manipulatorsrdquo in Proceedings of theChinese Control and Decision Conference (CCDC rsquo10) pp 1760ndash1765 Xuzhou China May 2010

[5] M Zhu Y Li and Y Li ldquoA new distributed control schemeof modular and reconfigurable robotsrdquo in Proceedings of theIEEE International Conference onMechatronics and Automation(ICMA rsquo07) pp 2622ndash2627 Harbin China August 2007

[6] M C Zhu Y Li and Y C Li ldquoObserver-based decentralizedadaptive fuzzy control for reconfigurable manipulatorrdquo Controland Decision vol 24 no 3 pp 429ndash434 2009

[7] S Richard and A G Barto Reinforcement Learning An Intro-duction The MIT Press London UK 1998

[8] F L Lewis and D Liu ReinForcement Learning and Approxi-mate Dynamic Programming For Feedback Control Wiley-IEEEPress New York NY USA 2012

[9] Y-K Xu and X-R Cao ldquoLebesgue-sampling-based optimalcontrol problems with time aggregationrdquo IEEE Transactions onAutomatic Control vol 56 no 5 pp 1097ndash1109 2011

[10] F L Lewis andD Vrabie ldquoReinforcement learning and adaptivedynamic programming for feedback controlrdquo IEEE Circuits andSystems Magazine vol 9 no 3 pp 32ndash50 2009

[11] X Xu H-G He and D Hu ldquoEfficient reinforcement learningusing recursive least-squares methodsrdquo Journal of ArtificialIntelligence Research vol 16 pp 259ndash292 2002

[12] H Zhang Q Wei and Y Luo ldquoA novel infinite-time optimaltracking control scheme for a class of discrete-time nonlinearsystems via the greedy HDP iteration algorithmrdquo IEEE Trans-actions on Systems Man and Cybernetics B vol 38 no 4 pp937ndash942 2008

[13] H Zhang R Song Q Wei and T Zhang ldquoOptimal trackingcontrol for a class of nonlinear discrete-time systems withtime delays based on heuristic dynamic programmingrdquo IEEETransactions on Neural Networks vol 22 no 12 pp 1851ndash18622011

16 Mathematical Problems in Engineering

[14] H Zhang X Xie and S Tong ldquoHomogenous polynomiallyparameter-dependent 119867

infinfilter designs of discrete-time fuzzy

systemsrdquo IEEE Transactions on Systems Man and CyberneticsB vol 41 no 5 pp 1313ndash1322 2011

[15] H Zhang L Cui X Zhang and Y Luo ldquoData-drivenrobust approximate optimal tracking control for unknown gen-eral nonlinear systems using adaptive dynamic programmingmethodrdquo IEEE Transactions on Neural Networks vol 22 no 12pp 2226ndash2236 2011

[16] J Zhang H Zhang Y Luo and H Liang ldquoOptimal controldesign for nonlinear systems adaptive dynamic programmingbased on fuzzy critic estimatorrdquo in The International JointConference on Neural Network pp 1ndash6 Brisbane Australia2012

[17] H Zhang D Gong B Chen and Z Liu ldquoSynchronization forcoupled neural networkswith interval delay a novel augmentedlyapunov-krasovskii functional methodrdquo IEEE Transactions onNeural Networks and Learning Systems vol 24 no 1 pp 58ndash702013

[18] S Bhasin K Dupree P M Patre and W E Dixon ldquoNeuralnetwork control of a robot interacting with an uncertain vis-coelastic environmentrdquo IEEE Transactions on Control SystemsTechnology vol 19 no 4 pp 947ndash955 2011

[19] S G Khan G Herrmann F Lewis T Pipe and C MelhuishldquoA Q-learning based Cartesian model reference compliancecontroller implementation for a humanoid robot armrdquo inProceedings of the IEEE 5th International Conference onRoboticsAutomation andMechatronics (RAM rsquo11) pp 214ndash219 QingdaoChina September 2011

[20] P K Patchaikani L Behera and G Prasad ldquoA single networkadaptive critic-based redundancy resolution scheme for robotmanipulatorsrdquo IEEE Transactions on Industrial Electronics vol59 no 8 pp 3241ndash3253 2012

[21] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[22] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]University of Cambridge Cambridge UK 1989

[23] F L Lewis and V L Syrmos Optimal Control John Wiley ampSons New York NY USA 1995

[24] M Sassano and A Astolfi ldquoDynamic approximate solutionsof the HJ inequality and of the HJB equation for input-affinenonlinear systemsrdquo IEEE Transactions on Automatic Controlvol 57 no 10 pp 2490ndash2503 2012

[25] Y X Wu and C Wang ldquoDeterministic learning based adaptivenetwork control of robot in task spacerdquoActa Automatica Sinicavol 39 no 1 pp 1ndash10 2013

[26] P M Patre W MacKunis K Kaiser and W E DixonldquoAsymptotic tracking for uncertain dynamic systems via amultilayer neural network feedforward and RISE feedbackcontrol structurerdquo IEEE Transactions on Automatic Control vol53 no 9 pp 2180ndash2185 2008

[27] B E Paden and S S Sastry ldquoA calculus for computing Filippovrsquosdifferential inclusion with application to the variable structurecontrol of robot manipulatorsrdquo IEEE Transactions on Circuitsand Systems vol 34 no 1 pp 73ndash82 1987

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 16: Research Article Decentralized Reinforcement Learning ...downloads.hindawi.com/journals/mpe/2013/387817.pdf · Research Article Decentralized Reinforcement Learning Robust Optimal

16 Mathematical Problems in Engineering

[14] H Zhang X Xie and S Tong ldquoHomogenous polynomiallyparameter-dependent 119867

infinfilter designs of discrete-time fuzzy

systemsrdquo IEEE Transactions on Systems Man and CyberneticsB vol 41 no 5 pp 1313ndash1322 2011

[15] H Zhang L Cui X Zhang and Y Luo ldquoData-drivenrobust approximate optimal tracking control for unknown gen-eral nonlinear systems using adaptive dynamic programmingmethodrdquo IEEE Transactions on Neural Networks vol 22 no 12pp 2226ndash2236 2011

[16] J Zhang H Zhang Y Luo and H Liang ldquoOptimal controldesign for nonlinear systems adaptive dynamic programmingbased on fuzzy critic estimatorrdquo in The International JointConference on Neural Network pp 1ndash6 Brisbane Australia2012

[17] H Zhang D Gong B Chen and Z Liu ldquoSynchronization forcoupled neural networkswith interval delay a novel augmentedlyapunov-krasovskii functional methodrdquo IEEE Transactions onNeural Networks and Learning Systems vol 24 no 1 pp 58ndash702013

[18] S Bhasin K Dupree P M Patre and W E Dixon ldquoNeuralnetwork control of a robot interacting with an uncertain vis-coelastic environmentrdquo IEEE Transactions on Control SystemsTechnology vol 19 no 4 pp 947ndash955 2011

[19] S G Khan G Herrmann F Lewis T Pipe and C MelhuishldquoA Q-learning based Cartesian model reference compliancecontroller implementation for a humanoid robot armrdquo inProceedings of the IEEE 5th International Conference onRoboticsAutomation andMechatronics (RAM rsquo11) pp 214ndash219 QingdaoChina September 2011

[20] P K Patchaikani L Behera and G Prasad ldquoA single networkadaptive critic-based redundancy resolution scheme for robotmanipulatorsrdquo IEEE Transactions on Industrial Electronics vol59 no 8 pp 3241ndash3253 2012

[21] C J C H Watkins and P Dayan ldquoQ-learningrdquo MachineLearning vol 8 no 3-4 pp 279ndash292 1992

[22] C J CHWatkinsLearning fromdelayed rewards [PhD thesis]University of Cambridge Cambridge UK 1989

[23] F L Lewis and V L Syrmos Optimal Control John Wiley ampSons New York NY USA 1995

[24] M Sassano and A Astolfi ldquoDynamic approximate solutionsof the HJ inequality and of the HJB equation for input-affinenonlinear systemsrdquo IEEE Transactions on Automatic Controlvol 57 no 10 pp 2490ndash2503 2012

[25] Y X Wu and C Wang ldquoDeterministic learning based adaptivenetwork control of robot in task spacerdquoActa Automatica Sinicavol 39 no 1 pp 1ndash10 2013

[26] P M Patre W MacKunis K Kaiser and W E DixonldquoAsymptotic tracking for uncertain dynamic systems via amultilayer neural network feedforward and RISE feedbackcontrol structurerdquo IEEE Transactions on Automatic Control vol53 no 9 pp 2180ndash2185 2008

[27] B E Paden and S S Sastry ldquoA calculus for computing Filippovrsquosdifferential inclusion with application to the variable structurecontrol of robot manipulatorsrdquo IEEE Transactions on Circuitsand Systems vol 34 no 1 pp 73ndash82 1987

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 17: Research Article Decentralized Reinforcement Learning ...downloads.hindawi.com/journals/mpe/2013/387817.pdf · Research Article Decentralized Reinforcement Learning Robust Optimal

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of