762 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL

762 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 2013

Online Learning Control Using Adaptive CriticDesigns With Sparse Kernel Machines

Xin Xu, Senior Member, IEEE, Zhongsheng Hou, Chuanqiang Lian, and Haibo He, Senior Member, IEEE

Abstract— In the past decade, adaptive critic designs (ACDs),including heuristic dynamic programming (HDP), dual heuristicprogramming (DHP), and their action-dependent ones, have beenwidely studied to realize online learning control of dynamical sys-tems. However, because neural networks with manually designedfeatures are commonly used to deal with continuous state andaction spaces, the generalization capability and learning efficiencyof previous ACDs still need to be improved. In this paper, a novelframework of ACDs with sparse kernel machines is presentedby integrating kernel methods into the critic of ACDs. Toimprove the generalization capability as well as the computationalefficiency of kernel machines, a sparsification method based onthe approximately linear dependence analysis is used. Usingthe sparse kernel machines, two kernel-based ACD algorithms,that is, kernel HDP (KHDP) and kernel DHP (KDHP), areproposed and their performance is analyzed both theoreticallyand empirically. Because of the representation learning andgeneralization capability of sparse kernel machines, KHDP andKDHP can obtain much better performance than previous HDPand DHP with manually designed neural networks. Simulationand experimental results of two nonlinear control problems, thatis, a continuous-action inverted pendulum problem and a balland plate control problem, demonstrate the effectiveness of theproposed kernel ACD methods.

Index Terms— Adaptive critic designs, approximate dynamicprogramming, kernel machines, learning control, Markovdecision processes, reinforcement learning.

I. INTRODUCTION

REINFORCEMENT learning (RL) is a machine learn-ing framework for solving sequential decision making

problems that can be modeled using the Markov decisionprocess (MDP) formalism. In RL, the learning agent interactswith an initially unknown environment and modifies its actionpolicies to maximize its cumulative payoffs [1], [2]. Althoughearlier RL research focused on tabular algorithms in discrete

Manuscript received September 2, 2011; revised October 12, 2012; acceptedDecember 16, 2012. Date of publication February 13, 2013; date of currentversion March 8, 2013. This work was supported in part by the NationalNatural Science Foundation of China under Grant 61075072, Grant 90820302,Grant 51228701, and Grant 61120106009, the New Century Excellent TalentProgram under Grant NCET-10-0901, and the U.S. National Science Founda-tion under Grant CAREER ECCS 1053717.

X. Xu and C. Lian are with the College of Mechatronics and Automation,National University of Defense Technology, Changsha 410073, China (e-mail:[email protected]; [email protected]).

Z. Hou is with the Advanced Control Systems Laboratory of School ofElectronic and Information Engineering, Beijing Jiaotong University, Beijing100044, China (e-mail: [email protected]).

H. He is with the Department of Electrical, Computer, and BiomedicalEngineering, University of Rhode Island, Kingston, RI 02881 USA (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TNNLS.2012.2236354

state/action spaces, approximation and generalization methodsfor RL have received more and more research interest inrecent years. In the literature, there are several synonymsused for RL, including approximate/adaptive dynamic pro-gramming (ADP) and neuro-dynamic programming [3]–[7].One common goal of ADP and RL is to solve the optimalcontrol problem of MDP with large or continuous state andaction spaces. Until now, RL has been shown to be a verypromising framework to solve learning control problems whichare difficult or even impossible for mathematical program-ming and supervised learning methods. However, despitesome successful empirical results in real-world applications[8]–[11], realizing efficient online learning control for MDPswith large or continuous space is still a difficult problem.In such cases, many RL or ADP algorithms are slow toconverge and require a large amount of training samples[12]. As indicated in [1], this problem is closely relatedto the generalization capability of learning machines, whichis the ability of a learning algorithm to perform accuratelyon new, unseen examples after having trained on a finitedata set.

In order to improve the generalization capability and learn-ing efficiency of RL, function approximation has been acentral topic in RL. Currently, there are three main categoriesof research work on function approximation for RL, thatis, value function approximation (VFA) [13], [14], policysearch [15], and actor-critic methods [16]. The actor-criticalgorithms, viewed as a hybrid of VFA and policy search,have been shown to be more effective than standard VFAor policy search in online learning tasks with continuousstate/action spaces [17]. In an actor-critic learning controller,there is an actor for policy learning and a critic for VFAor policy evaluation. One pioneering work on RL algorithmsusing the actor-critic architecture can be found in [18].In recent years, adaptive critic designs (ACDs) [19]–[23],[28]–[30] were widely studied as an important class ofactor-critic learning control methods for dynamical systems.Generally, ACDs can be categorized as the following majorgroups: heuristic dynamic programming (HDP), dual heuris-tic programming (DHP), globalized dual heuristic program-ming (GDHP), and their action-dependent versions [17].Among ACD architectures, DHP is the most popular one,which has been proven to be more efficient than HDP [19].

Although ACDs have been applied in various learning con-trol problems [24]–[26], such as aircraft control, automotiveengine control, and power system control, there are still somedifficult issues in the design and implementation of ACDs.The first issue is that the learning efficiency and convergence

2162–237X/$31.00 © 2013 IEEE

XU et al.: ONLINE LEARNING CONTROL USING ADAPTIVE CRITIC DESIGNS 763

of ACDs greatly rely on the empirical design of the critic,including the approximation structure and the learning rate.In ACDs, multilayer perceptron neural networks (MLPNNs)[48] were commonly used for VFA, but the structure andlearning rates (step sizes) of MLPNNs have to be manuallyselected for good performance [27]. The second difficulty isthat the robustness to uncertainties in learning control systemsbased on DHP or HDP still needs to be improved. In DHPand HDP, due to the local minima in neural network training,how to improve the quality of the final policies is still anopen problem [22]. As suggested in [22], the most importantpotential extension of their results would be to characterizethe quality of the converged solution in ACDs. Recent studieshave attempted to approximate the optimal control solutionusing various ADP techniques with or without a priori systemmodel [28]–[30], [47]. Vamvoudakis and Lewis [29] proposedan online actor critic algorithm to solve the continuous-timeinfinite-horizon optimal control problem, with the assumptionof known dynamics. Zhang et al. proposed a data-driven robustapproximate optimal tracking control scheme for unknowngeneral nonlinear systems [30]. Nevertheless, the above worksstill relied on manual settings of critic networks, and the learn-ing control performance depended on the empirical design ofbasis functions. Therefore, it is desirable to develop automaticfeature representation and selection methods for the criticlearning of ADP approaches.

As is well known, feature representation and selection is acritical factor for improving the generalization performanceof machine learning algorithms. However, compared withsupervised learning, there are relatively fewer works on featurerepresentation and selection in reinforcement learning, espe-cially in online learning control methods. For ACDs, it waspointed out recently [22] that a study on the choice of the basisfunctions for the critic to obtain a good estimate of the policygradient should be done to improve the performance of ACDs.The motivation of this paper is to present a novel kernel-based feature representation method for ACDs and developnew online learning control algorithms with sparse kernelmachines. Based on the theoretical and empirical results fromstatistical learning [31], [32], sparse kernel machines will havebetter generalization capability than conventional MLPNNswith manually designed structures. Therefore, the goal of thispaper is to provide a new kernel-based feature representationmethod for ACDs, which is important to realize efficient onlinelearning control methods for uncertain dynamical systems.

Recently, kernel machines have been popularly studied torealize nonlinear and nonparametric versions of supervised orunsupervised learning algorithms [31]–[32]. The main idea ofkernel machines is as follows: an inner product in a high-dimensional feature space can be represented as a Mercerkernel function, thus, existing learning algorithms in linearspaces can be transformed to kernel-based algorithms withoutexplicitly computing the inner products in high-dimensionalfeature spaces. This idea, which is usually called the kerneltrick, has been widely used in supervised and unsupervisedlearning problems [32]. In supervised learning, the most pop-ular kernel machines include support vector machines (SVMs)and Gaussian processes (GPs), which have been applied in

many classification and regression problems. In most cases,kernel machines obtained very good results or even the state-of-the-art performance [32]–[34]. In unsupervised learning,kernel principal component analysis and kernel independentcomponent analysis were also studied by many researchers[34]. Comprehensive reviews on kernel machines can be foundin [35].

The combination of kernel methods with RL and ADPhas also received increased research interest in recent years.However, the function approximation problem is more difficultin RL than in supervised learning. One of the earlier worksin this direction was published in [36], where kernel-basedlocally weighted averaging was used to approximate the statevalue functions of MDPs. The applications of GPs or SVMsin reinforcement learning problems were also studied in theliterature, such as GPs in temporal difference [TD(0)] learning[37], SVMs for RL [38], and Gaussian processes in model-based approximate policy iteration [39]. In [38], supportvector regression was applied to batch learning of state valuefunctions of MDPs with discrete state spaces, and there wereno theoretical results on the policies obtained. The GP-basedpolicy iteration method in [39] uses support points, which areusually selected by manual discretization of the state spaces,and policy evaluation is performed using the state transitionmodel approximated by a GP model. In [40], a model-freeapproximate policy iteration algorithm, called least-squarespolicy iteration (LSPI), was presented, which offers an RLmethod with good properties in convergence, stability, andsample complexity. Nevertheless, the approximation structuresin LSPI may lead to degraded performance when the featuresare improperly selected. In [41], a kernel-based least-squarespolicy iteration (KLSPI) algorithm was presented for MDPswith large or continuous state spaces. However, both LSPI andKLSPI are mainly restricted to solving MDPs with discreteactions.

In this paper, a novel framework of ACDs with sparsekernel machines is presented by integrating kernel methodsinto the critic learning of ACD algorithms. A sparsificationmethod based on the approximately linear dependence (ALD)analysis [42] is used to sparsify the kernel machines whenapproximating the action value functions or their derivatives.Using the sparsified kernel machines, two Kernel ACD algo-rithms, that is, kernel HDP (KHDP) and kernel DHP (KDHP),are proposed to realize efficient online learning control fordynamical systems. To the best of our best knowledge, thereare very few works on integrating kernel methods into onlinelearning control based on ACDs in the community. Simulationand experimental results on two nonlinear control problems,a continuous-action inverted pendulum problem and a balland plate control problem, demonstrate that kernel ACDs canobtain much better performance than that of previous ACDs.

The main contributions of this paper include the followingtwo aspects. One is automatic feature representation usingkernels for VFA in ACDs. Because of the structure learningand nonlinear approximation ability of sparse kernel machines,KHDP and KDHP can obtain much better performance thanprevious HDP and DHP methods with manually designedneural networks. The second is to combine sparsified kernel


features with the recursive least-squares TD (RLS-TD) algo-rithm [42] so that faster learning speed can be realized inthe critic of kernel ACDs. As studied in [16] and [22], theconvergence of actor-critic algorithms can be ensured basedon the principle of two-timescale stochastic approximations,which are characterized by coupled stochastic recursions thatare driven by two different step size schedules. According tothe results in [22], when linear function approximators areused, actor-critic algorithms can be proved to converge if thelearning process in the critic is a faster recursion than theactor. Thus, when faster learning speed is realized by usingRLS-TD in the critic, kernel ACDs can be expected to haveimproved performance in convergence. The idea of kernel-based VFA can also be applied to other ADP methods forlearning control of dynamical systems [28]–[30]. In recentstudies on ADP methods, VFA is still a central problem and itcan be expected that new kernel-based ADP algorithms can bedeveloped. In the following, we will focus on kernel methodsin popularly used ACDs including HDP and DHP, and theextension of kernel methods in other ADP algorithms is apromising direction for future work.

The rest of this paper is organized as follows. In Section II,some research backgrounds on MDPs and the ALD-basedkernel sparsification process are introduced. In Section III, theframework of ACDs with sparse kernel machines is presentedand the KHDP and KDHP algorithms are proposed. The per-formance of kernel ACDs is analyzed from two perspectives.One is the performance error of critic learning and the otheris the convergence of the actor-critic learning control process.In Section IV, simulation and experimental results on twononlinear learning control systems are provided to illustratethe effectiveness of the proposed method. Finally, conclusionsand future work are summarized in Section V.

II. BACKGROUND

A. Markov Decision Processes

An MDP M is denoted as a quadruple {X , A, R, P},where X is the state space, A is the action space,P is thestate transition probability, and R is the reward function.A stochastic stationary policy π(or just stationary policy) mapsstates to distributions over the action space. When referring tosuch a policy π , we use π(a|x) to denote the probability ofselecting action a in state x by π . A deterministic stationarypolicy directly maps states to actions, denoted as

at = π(xt ), t ≥ 0. (1)

When the actions at (t ≥ 0) satisfy (1), policy π is followedin the MDP M . A stochastic stationary policy π is said to befollowed in the MDP M if at ∼ π(a |xt ), t ≥ 0.

The objective of a learning controller is to estimate theoptimal policy π∗ satisfying

Jπ∗ = maxπ

Jπ = maxπ

Eπ[ ∞∑

t=0

γ t rt

](2)

where 0 < γ < 1 is the discount factor and rt is the rewardat time step t , Eπ [•] stands for the expectation with respect tothe policy π and the state transition probabilities, and Jπ is the

expected total reward along the state trajectories by followingpolicy π . In this paper, Jπ is also called the performance valueof policy π .

The state value function V π(x) of a policy π is theexpected, discounted total rewards when starting from x andfollowing policy π thereafter

V π(x) = Eπ

[ ∞∑t=0

γ t rt

∣∣∣∣x0 = x

]. (3)

Similarly, the state–action value function Qπ (x ,a) is definedas the expected, discounted total rewards when taking actiona in state x and following policy π thereafter

Qπ (x, a) = Eπ[ ∞∑

t=0

γ t rt |x0 = x, a0 = a]. (4)

For an MDP, a deterministic optimal policy π∗(x) maxi-mizes the expected, discounted total reward of state x

π∗ (x) = arg maxa

Qπ∗(x, a) . (5)

B. ALD-Based Kernel Sparsification

Let X denote the original state space. A kernel functionis a mapping from X × X to R, which is usually assumedto be continuous. A Mercer kernel is a kernel functionthat is positive definite, that is, for any finite set of points{x1, x2, . . ., xn}, the kernel matrix K = [k(xi , x j )] (1≤ i ,j ≤ n) is positive definite. According to the Mercer theorem[32], there exists a Hilbert space H and a mapping φ from Xto H such that

k(xi , x j ) =< φ(xi ), φ(x j ) > (6)

where <•, •> is the inner product in H . Although the dimen-sion of H may be infinite and the nonlinear mapping φ isusually unknown, all the computation in the feature space canstill be performed if it is in the form of inner products.

As introduced in [42], in the ALD analysis, after the samplecollection process, the kernel-based features are constructedin a data-driven way. Let Sn = {s1, s2, . . . , sn}denote aset of data samples and φ be a feature mapping on thedata, which can be determined by the Mercer kernel functiondefined in (6). A feature vector set can be obtained as �n ={φ(s1), φ(s2), . . . , φ(sn)}, φ(si ) ∈ Rm×1, i = 1, 2, . . . , n.To perform ALD analysis on the feature vector set, a datadictionary is defined as a subset of the feature vector set. Thedata dictionary D is initially empty and the ALD analysisis implemented by testing every feature vector in �n , oneat a time. If a feature vector φ(s) cannot be approximatedwithin a predefined precision by the linear combination ofthe feature vectors in the dictionary, it will be added to thedictionary. Otherwise, it will not be added to the dictionary.Thus, after the ALD analysis process, all the feature vectorsof the data samples in Sn can be approximately representedby linear combinations of the feature vectors in the dictionarywithin a given precision.


The ALD-based sparsification procedure mainly includestwo steps. The first step is to compute the following opti-mization solution:

δt = minc

∥∥∥∥∥∥∑

s j∈Dt

c jφ(s j ) − φ(st )

∥∥∥∥∥∥

2

. (7)

Due to the kernel trick, after substituting (6) into (7), weobtain

δt = minc

{cT Kt−1c − 2cT kt−1(st ) + ktt} (8)

where [Kt−1]i, j = k(si , s j ), si (i = 1, 2, . . . , d(t − 1)) are theelements in the dictionary, d(t−1) is the length of the datadictionary, kt−1(st ) = [k(s1,st ), k(s2, st ), . . ., k(sd(t−1),st )]

T ,c = [c1, c2, . . ., cd ]T , and ktt = k(st , st ).

The optimal solution for (8) is

ct = K −1t−1kt−1(st ) (9)

δt = ktt − kTt−1(st )ct . (10)

The second step of the ALD-based sparsification is toupdate the data dictionary by comparing δt with a predefinedthreshold μ. If δt < μ, the dictionary is unchanged, otherwise,st is added to the dictionary, that is, Dt = Dt−1 ∪ st .

After the sparsification procedure, a data dictionary Dn

with reduced number of data vectors is obtained and theapproximated state–action value function or its derivative isrepresented as follows:

Q(x, a) =d(n)∑j=1

α j k(s, s j ) (11)

λ(x) =d(n)∑j=1

α j k(x, x j ) (12)

where d(n), usually much smaller than the original samplesize n, is the length of the dictionary Dn , s j = s(x j , a j ), andx j ( j =1,2,. . ., d(n)) are the elements of the data dictionary.

III. ACDS WITH SPARSE KERNEL MACHINES

A. Framework of Kernel ACDs

A general framework of ACDs with sparse kernel machinesis shown in Fig. 1. The main components of kernel ACDsinclude a critic, a kernel-based feature learning module, areward function, an actor/controller, and a model of the plant.The kernel-based feature learning module is to implementdata-driven feature representation and learning so that betterlearning efficiency and generalization performance can beobtained for ACDs. The critic is used to approximate the valuefunctions or their derivatives. In the proposed framework, thekernel function and its induced feature space play importantroles in the critic learning process. Since kernel-based featuresare in linear forms, the RLS-TD learning algorithms canbe employed in the critic. The actor or controller receivesmeasurement data about the plant’s current state xt and outputsthe control ut . The output of the critic is used in the trainingprocess of the actor so that policy gradients can be computed.The plant model receives the control ut , and estimates the next

Algorithm 1 Kernel ACDsInput:

k(.,.): a Mercer kernel functiong(x, θ): the approximation structure in the actorS = {si |si = (xi , ai )}N

i=1: a sample set1) Initialize: A kernel dictionaryD = NULL, actor weights

θ = θ0, critic weights α = α0, step size in the actor β = β0.2) For i = 1, 2,. . ., Size(D)

Compute δt using (8);If δt ≥ μ

Add si to D;End if

End for3) Let t = 0;4) Loop:

t = t + 1;Draw action at = g(xt , θt );Get reward rt ;Observe next state xt+1;Compute feature vector k(st ) and k(st+1);Update θ and α according to (35) and (27) or (56)

and (53);Until the termination criterion is satisfied

5) Return the final policy in the actor.

Critic

Actor

Model

Plant Reward function

Kernel-based feature learning

( )txV ( )txλ

ta tx

tr

1ˆ +tx

Fig. 1. Learning control structure of kernel ACDs.

state xt+1. The state data are provided to the critic and to thereward function. In some ACDs, such as DHP, by making useof the plant model, xt+1 is provided for a second pass throughthe critic so that V (xt+1) can be obtained for critic training.

Algorithm 1 shows the proposed kernel ACDs, whichinclude two main procedures, that is, a kernel-based featureconstruction process and an online learning control process.The sample collection process for kernel feature constructioncan be realized either by collecting data when a conventionalcontroller is used or by observing the MDP running with aninitially randomized policy in the actor. The data samples arein the form of state transitions {(x1, a1), (x2, a2),. . ., (xn , an)}.Based on the data samples, the ALD-based kernel sparsifica-tion procedure, which was introduced in Section II-B, can beperformed offline before the online learning process of ACDs.

Since HDP and DHP are the most widely studied ACDs,we will focus on integrating sparse kernel machines into thesetwo online learning control methods. In HDP, the aim of the


critic is to approximate the value functions or action valuefunctions, whereas in DHP, the derivatives of value functionsare approximated in the critic. So, in the proposed kernelACDs, a recursive algorithm of KLSTD [44] will be usedand the action value function or the value function derivativeis approximated as

Q(s) =t∑

i=1

αi k(s, si ) (13)

λ(x) =t∑

i=1

αi k(x, xi ) (14)

where s and si are the combined features of the state–actionpairs (x , a) and (xi , ai ), respectively, αi (i = 1, 2, . . ., t) arethe weights, and (xi , ai ) (i = 1, 2, . . ., t) are selected state–action pairs in the sample data, that is, trajectories generatedfrom a Markov decision process.

B. KHDP Algorithm

In the critic of KHDP, the action value function Q(x , a)is approximated in a linear weighted form, where a Mercerkernel function k(x , y) = <φ(x), φ(y)> is employed to realizethe feature mapping in a reproducing kernel Hilbert space(RKHS). Let st = (xt , at ) denote the state–action pair at timestep t . Then, the action value function Q(xt , at) can also beexpressed as Q(st ). As studied in [40], the regression equationfor the linear LS-TD(0) (λ = 0) algorithm is

E0

[φ(st )(Q(st ) − γ Q(st+1))

]= E0

[φ(st )rt

](15)

where E0 is the expectation with respect to the state transitionprobability when following a stationary policy and

Q(s) = φT (s)W, φ, W ∈ Rq×1. (16)

Equation (15) can be rewritten as

E0

[φ(st )(φ

T (st ) − γφT (st+1))]W = E0

[φ(st )r(st )

]. (17)

The observation equation of (17) is as follows:

φ(st )(φT (st ) − γφT (st+1))W = φ(st )rt + εt (18)

where εt is the one-step observation noise.Due to the property of RKHS, the weight vector W in (18)

can be represented by the weighted sum of the state featurevectors

W =T∑

i=1

φ(si )αi (19)

where si (i = 1, 2, . . . , T )are the selected state–action pairsafter the ALD analysis, T is the number of selected samples,and αi are the coefficients.

Let

�T = (φT (s1), φT (s2), . . . , φ

T (sT ))T (20)k(st ) = (k(s1, st ), k(s2, st ), . . . , k(sT , st ))

T. (21)

By multiplying �T to both sides of the observation equation(18), due to the kernel trick, we get

k(st )[kT (st )α − γ kT (st+1)α] = k(st )rt + νt (22)

where vt ∈ RT ×1is a transformed noise vector and

α = [α1, α2, . . . , αT ]T. (23)

Let

AT =N∑

t=1

k(st )[kT (st ) − γ kT (st+1)] (24)

bT =N∑

t=1

k(st )rt (25)

where N is the total number of samples.Then, the kernel-based least-squares fixed-point solution to

the TD learning problem is as follows:

α = A−1T bT . (26)

To realize online learning in the critic, the following updaterules based on the kernel RLS-TD(0) algorithm are used inthe critic of KHDP.

Critic Update in KHDP:

βt+1 = Pt k(st )/(μ + (kT (st ) − γ kT (st+1))Pt k(st ))

αt+1 = αt + βt+1(rt − (kT (st ) − γ kT (st+1))αt )

Pt+1 = 1

μ

[Pt − Pt k(st )(kT (st ) − γ kT (st+1))Pt[

μ + (kT (st ) − γ kT (st+1))Pt k(st )]]

(27)

where βt is the step size in the critic, μ(0 <μ ≤ 1) is theforgetting factor, P0 = δ I , δ is a positive number, and I is theidentity matrix.

The actor network in KHDP uses MLPNNs to approximatethe policy function

at = g(xt , θt ). (28)

In this paper, the learning control objective is to minimizeor maximize the following total discounted reward:

J (x) = V (x) = E

[ ∞∑t=0

γ t rt

∣∣∣∣x0 = x

]. (29)

where 0 < γ < 1 is the discount factor.In this paper, we will mainly focus on deterministic MDPs,

and the reward function is defined as nonpositive or nonneg-ative values. For nonpositive reward functions, the learningobjective is to maximize the expected total discounted reward.For nonnegative reward functions, the learning objective isto minimize the expected total discounted reward. Therefore,the following cost function is used in the actor to realize thelearning control objective:

Ea = 1

2Q2(x, a) (30)

Since the minimization of cost function (30) is equivalentto minimize J (x) when Q(x , a) is nonnegative or maximizeJ (x) when Q(x , a) is nonpositive, the policy gradient learningrule in the actor can be designed as

θt = ∂ Ea

∂ θt= Q(xt , at )

∂ Q(xt , at )

∂at

∂at

∂ θt.

(31)


When Gaussion kernels are used, the approximated actionvalue function is

Q(x, a) =T∑

i=1

αi k(s, si ) =T∑

i=1

αi e−‖s−si ‖2/σ 2

(32)

where x = (x(1), x(2), . . . , x(m))T , s = (x(1), x(2), . . . , x(m), a)

is the combined vector of the state–action pair (x , a), and‖·‖ is defined as

‖s − si‖ =√√√√

n∑j=1

(x( j ) − xi( j ))2 + (a − ai )2. (33)

On the basis of the definition in (32), we have

∂ Q(xt , at )

∂at=

T∑i=1

2αi(at − ai)

σ 2 e−‖st −si‖2/σ 2. (34)

Then, the actor learning rule in KHDP is as follows.Actor Update in KHDP:

θt = θt − ηt θt

= θt − ηt Q(xt , at )∂at

∂ θt

T∑i=1

αi(at − ai )

σ 2 e−(st−si )2/σ 2

(35)

where ηt is the step size in the actor.

C. KDHP Algorithm

The critic learning in KDHP is to approximate the deriv-atives of state value functions, which satisfy the followingBellman equation:

∂V (xt )

∂xt= ∂ Rt

∂xt+ γ

∂ E[V (xt+1)]∂xt

(36)

where Rt is the expected reward and E[.] is with respect to thestate transition probability when following a stationary policy.

Let

λ(xt ) = ∂V (xt )

∂xt(37)

λ(xt+1) = ∂V (xt+1)

∂xt+1. (38)

If xt and at are 1-D values, the following relation holds:∂V (xt+1)

∂xt= ∂V (xt+1)

∂xt+1

∂xt+1

∂xt+ ∂V (xt+1)

∂xt+1

∂xt+1

∂at

∂at

∂xt

= λ(xt+1)∂xt+1

∂xt+ λ(xt+1)

∂xt+1

∂at

∂at

∂xt(39)

If xt = [xi(t)]n×1 and at = [ui (t)]m×1 are multidimen-sional vectors, equation (39) becomes

∂V (xt+1)

∂x j (t)=

n∑i=1

∂V (xt+1)

∂xi(t + 1)

∂xi (t + 1)

∂x j (t)

+n∑

i=1

m∑k=1

∂V (xt+1)

∂xi(t + 1)

∂xi(t + 1)

∂ak(t)

∂ak(t)

∂x j (t)

=n∑

i=1

λ(xi (t + 1))∂xi(t + 1)

∂x j (t)

+n∑

i=1

m∑k=1

λ(xi (t + 1))∂xi(t + 1)

∂ak(t)

∂ak(t)

∂x j (t)(40)

where m and n are the dimensions of at and xt , respectively.To simplify notations, we only show the results when xt

and at are 1-D variables, therefore (39) is employed. Theextensions to multidimensional state and control vectors canbe done by considering (40) instead of (39).

Then, (36) can be rewritten as

λ(xt ) = ∂ Rt

∂xt+ γ E

[λ(xt+1)

(∂xt+1

∂xt+ ∂xt+1

∂at

∂at

∂xt

)](41)

where ∂xt+1/∂xtand ∂xt+1/∂at can be computed based on themodel network in Fig. 1, and ∂at/∂xt can be computed on thebasis of the actor network.

Suppose the following nonlinear mappings are implementedby the model network and the actor network, respectively:

xt+1 = f (xt , at ) (42)

at = g(xt , θt ) (43)

where θt is the weight vector of the actor network.Then, the derivatives in the right-hand side of (41) can be

obtained as∂xt+1

∂xt= ∂ f (xt , at )

∂xt(44)

∂at

∂xt= ∂g(xt , θt )

∂xt. (45)

The temporal differences can be defined as

δ(t) = ∂rt

∂xt+ γ

(∂xt+1

∂xt+ ∂xt+1

∂at

∂at

∂xt

)λ(xt+1) − λ(xt ). (46)

In the critic learning of KDHP, a kernel-based approxi-mation structure is considered to approximate λ(xt ). At first,consider the following approximation structure in linear forms:

λ(xt ) = ∂V (xt )

∂xt= φT (xt)W =

l∑j=1

φ j (xt)w j (47)

where φ(xt ) = [φ1(xt ), φ2(xt ), . . . , φl(xt )]T is a vectorof basis functions, xt is the input state of the critic, andW = [w1, w2, . . ., wl]T is the weight vector.

By multiplying φ(xt ) to both sides of (41), the fixed-pointequation for linear LS-TD(0) algorithms is derived

E[φ(xt )

[λ(xt ) − γ

(∂xt+1

∂xt+ ∂xt+1

∂at

∂at

∂xt

)λ(xt+1)

]]

= φ(xt )∂ Rt

∂xt(48)

Let

D(xt ) = ∂xt+1

∂xt+ ∂xt+1

∂at

∂at

∂xt. (49)

Equation (48) can be rewritten as

E[φ(xt )(φ

T (xt )−γ D(xt)φT (xt+1))

]W = E

[φ(xt )

∂rt

∂xt

]. (50)

Assume that xi (i = 1, 2, . . . , T )are the selected states afterthe ALD analysis, k(x , y) =<φ(x), φ(y)> is a Mercer kernel,and T is the number of selected samples. Similar to the


derivation of (22), by using the kernel trick, we can also obtainthe least-squares fixed point equation for approximating λ(x)in the form of kernel-based features:

k(xt )[kT (xt )α − γ D(xt )kT (xt+1)α] = k(xt )rt + νt (51)

where vt ∈ RT ×1 is a noise vector, α = [α1, α2, . . . , αT ]T isthe coefficient vector for approximating λ(x) and

k(xt ) =(

k(x1, xt ), k(x2, xt ), . . . , k(xT , xt ))T

.

The kernel-based RLS-TD update rules for critic learningin KDHP are as follows.

Critic Update in KDHP:

βt+1=Pt k(xt )/(μ+ (kT (xt)−γ D(xt )kT (xt+1))Pt k(xt )) (52)

αt+1=αt + βt+1

( ∂rt

∂xt−(kT (xt )−γ D(xt )kT (xt+1))αt

)(53)

Pt+1= 1

μ

[Pt−Pt k(xt )

(kT(xt )−γ D(xt )kT(xt+1))

Pt

μ+(kT (xt)−γ D(xt )kT (xt+1)

)Pt k(xt )

]

(54)

where βt is the step size in the critic, μ(0 < μ ≤ 1) is theforgetting factor, P0 = δ I , δ is a positive number, and I is theidentity matrix.

The actor network is used to generate the control actionsbased on the observed states of the plant. The output of theactor is given by (43). The learning objective of the actor isto minimize the performance value of the closed-loop system,which can be computed by the value functions of the MDP

J (x) = V (x) = Eπ

[ ∞∑t=0

γ trt |x0 = x

]. (55)

In KDHP, based on the outputs of the critic, the followingpolicy gradient methods can be used to train the actor:

Actor Update in KDHP:

θt+1 = θt − ηt θt = θt − ηt∂V (xt+1)

∂at

∂at

∂ θt

= θt − ηtλ(xt+1)∂xt+1

∂at

∂at

∂ θt. (56)

Since λ(xt+1) and ∂xt+1/∂at can be computed by the criticand the model network, respectively, and ∂at/∂ θt is given by(45), the above policy gradient learning can be implementedalong with the critic learning.

D. Performance Analysis and Discussions

Compared with recent attempts in ADP methods for model-free learning control, one advantage of kernel ACDs is that themanual selection of approximation structures in the critic isavoided and automatic feature construction and selection canbe realized to improve the approximation and generalizationcapability of ACDs. Furthermore, by making use of the gener-alization capability of sparse kernel machines, which has beenverified in the literature [35], [41], better learning control per-formance can be obtained. In the critic training of ACDs, theTD(λ) algorithm was popularly used to approximate the value

functions or their derivatives, where function approximatorswere employed to realize generalization in large or continuousspaces. However, for TD(λ) with nonlinear approximators,for example, MLPNNs, there are no convergence proofs, andsome divergence counterexamples were found in previousstudies [45].

According to the recent theoretical results in [16] and [22],the convergence of ACDs can be ensured based on two-timescale stochastic approximations, where the critic needs toimplement a faster recursion than the actor. In kernel ACDs,by making use of the kernel-based features, which are in aform of linear basis functions, the RLS-TD algorithm [43] isused to approximate the value functions or their derivativeswith improved data efficiency and stability. As shown in [44],the kernel-based LS-TD algorithm is superior to conventionallinear or nonlinear TD algorithms in terms of fast convergencerates. Therefore, with faster learning in the critic, kernel ACDscan have better performance than previous ACDs in terms ofconvergence rates.

1) Performance Error of Critic Learning in KernelACDs: In KHDP and KDHP, sparsification of kernel machinesis implemented based on the ALD analysis so that the kernel-based features have approximately linear independence. Thefollowing Lemma 1 shows that the kernel dictionary obtainedby the ALD-based sparsification procedure is finite even ifinfinite samples are used.

Lemma 1 [42]: For the ALD-based kernel sparsificationprocedure, assume that 1) k(.,.) is a continuous Mercer kerneland 2) S is a compact subset of a Banach space. Then, forany training sequence {xi } ∈ X (i = 1, 2, . . ., ∞) and forany μ > 0, the number of dictionary vectors is finite.

In Lemma 1, it is shown that if the original state space Xis compact, the ultimate dictionary set will be finite regardlessof the dimension of the Hilbert space H . In the following,to simplify the notation, a countable state–action space isconsidered, but the results on TD learning can also be extendedto general spaces [45]. Let the cardinality of the states be N .The kernel matrix can be denoted as

K =[k(x1), k(x2), . . . , k(xN )

]T ∈ RN×d (57)

where d is the number of dictionary vectors.Let α be the critic’s weight vector, V (α) be the approx-

imated value function using kernel machines, and θt be theactor’s weight vector. Since θt is updated in a slower timescalethan the critic, the policy π(θt) determined by θt is also slowlyvarying. In the following, we will analyze the approximationerror of kernel-based RLS-TD learning when the actor’s policyis stationary or changes very slowly.

An MDP with a stationary action policy π can be viewedas an equivalent Markov reward process with state transitionprobability P . Suppose μ is the unique distribution thatsatisfies μT P = μT with μ(i) > 0 for all i ∈ X and μis a finite or infinite vector, depending on the cardinality ofX . The theoretical results in [40] and [46] show that whenLS-TD or RLS-TD converges, a fixed-point solution can beobtained to minimize the projected Bellman residual errors

minα

Jα = minα

∥∥∥V (α) − �T V (α)∥∥∥ (58)


where � = �(�T D�)−1�T D, � = [φ1,φ2, . . . ,φn] ∈RN×n , T is the Bellman operator, and D = diag{μ(i)}.

In kernel ACDs, the projection operator � is determined bythe sparsified kernel features and the ALD-based sparsificationcan be viewed as a regularization procedure for the optimiza-tion problem in (58), where the objective function becomes

minα

Jα = minα

∥∥∥V (α) − �T V (α)∥∥∥ + h(α) (59)

where h(α) is the structural risk of the kernel machines.Although the combined objective function in (59) may be

minimized in a synchronized way, it is sequentially optimizedin kernel ACDs. At first, by using the ALD-based sparsifica-tion criterion, the structural risk h(α) is reduced. Then, thecombined objective function in (59) is optimized by kernelLS-TD using the kernel dictionary obtained by the ALDanalysis.

In [22], it is proved that if the policy parameters changeslowly, the critic weight vector can converge to a solutiondetermined by the actor’s policy π(θ). The next problemis the approximation error between the true value functionV π(θ)(x) and the solution based on the least-squares fixed-point equation (58). Since the kernel-based RLS-TD learningessentially implements linear TD learning using kernel-basedfeature vectors, (58) is equivalent to the fixed-point equationof linear LS-TD learning algorithms. Therefore, based on theanalysis of TD learning using linear basis functions in [45],the following relation holds:

∥∥∥V (α) − V π(θ)∥∥∥

D≤ 1 − λγ

1 − γ

∥∥∥�V π(θ) − V π(θ)∥∥∥

D(60)

where D = diag{μ(i)}, � = K (K T DK )−1 K T D,λ (0≤ λ ≤1) is the parameter for eligibility traces, and‖X‖D = √

X T DX .Since � is determined by the sparsified kernel features,

inequality (60) also shows that by appropriately selecting andsparsifying the kernel-based features, the approximation errorbounds of value functions can also be reduced.

E. Convergence Analysis of Kernel ACDs

Similar to the analysis in [22], the update rules (53) and (56)in ACDs can be modeled as a general setting of two-timescalestochastic approximations

Xt+1 = Xt + βt ( f (Xt , Yt ) + N1t+1) (61)

Yt+1 = Yt + γt (g(Xt , Yt ) + N2t+1) (62)

where f and g are Lipschitz continuous functions and{N1

t+1}and {N2t+1} are martingale difference sequences with

respect to the field

E[ ∥∥∥Ni

t+1

∥∥∥2 |F t

]≤ D1(1+‖Xt‖2+‖Yt‖2), i = 1, 2, t ≥ 0]

(63)

for some constant D1 < ∞.In KHDP and KDHP, the learning rules in the critic use

recursive least-squares methods and the step sizes are adap-tively determined by online computation rules (27) and (52),respectively. When the update in the critic is a faster recursion

than the update in the actor, the weights in the critic haveuniformly higher increments compared to the weights in theactor.

To analyze the convergence of kernel ACDs based on two-timescale stochastic approximations, the following ordinarydifferential equations can be considered:

X = f (X (t), Y ) (64)

where Assumptions (A1)–(A3) hold.(A1) sup

t‖Xt‖ , sup

t‖Yt‖ < ∞;

(A2) X = f (X (t), Y ) has a globally asymptotically sta-ble equilibrium μ(Y ), where μ(.) is a Lipschitz continuousfunction;

(A3) Y = g(μ(Y (t)), Y (t)) has a globally asymptoticallystable equilibrium Y ∗;

In [22], the main convergence result was obtained for two-timescale stochastic approximations:

Theorem 1: Under Assumptions (A1)–(A3), the updates in(61) and (62) converge asymptotically to the equilibrium, thatis, (Xt , Yt ) → (μ(Y ∗), Y ∗) as t → ∞, with probability one.

In KHDP and KDHP, by appropriately selecting the actor’sstep sizes, it can be expected that the update in the critic is afaster recursion than the update in the actor, and the weights inthe critic have uniformly higher increments as compared withthe weights in the actor. In [22], when the update in the criticis a faster recursion than the actor, it was proved that a classof actor-critic algorithms with linear function approximatorswill converge almost surely to a small neighborhood of alocal minimum of the averaged reward J . In kernel ACDs,by making use of kernel-based features and the RLS-TDalgorithm in the critic, the updates in the critic can be afaster recursion than the actor. Thus, it will be more beneficialto ensure the convergence of the online learning process. InSection IV, extensive performance tests and comparisons wereconducted and it was shown that kernel ACDs have muchbetter performance than conventional ACDs both in terms ofconvergence speed and in terms of the quality of the finalpolicies.

IV. SIMULATION AND EXPERIMENTAL RESULTS

A. Inverted Pendulum Problem

The inverted pendulum problem has been widely studied asa benchmark control problem with nonlinearity and instability.In the following, simulation and experimental studies will beconducted on the inverted pendulum problem to compare theperformance of different RL algorithms. In simulation, theperformance of kernel ACDs is compared with that of ACDsunder different conditions and parameter settings. The near-optimal policies of different algorithms are also implementedin a real inverted pendulum system to test the performance ofdifferent controllers.

The aim of the learning controller is to balance the poleas long as possible and make the angle variations of thependulum be as small as possible. The dynamics equationsare assumed to be unknown or only partially known for thelearning controller. For HDP and KHDP, the reward r is always0 before the pole angle or the position of the cart exceeds


0.05 0.1 0.2 0.3 0.50

0.5

1

actor module learning rate

succ

essf

ul ra

te

1.0 1.1 1.21 1.331 1.4640

0.5

1

cart mass/kg

succ

essf

ul ra

te

0.5 0.55 0.605 0.666 0.7320

0.5

1

pole length/m

succ

essf

ul ra

te

KDHP KHDP DHP HDP

(a)

0.05 0.1 0.2 0.3 0.50

50

100

actor module learning rate

aver

age

trial

s

1.0 1.1 1.21 1.331 1.4640

50

100

cart mass/kg

aver

age

trial

s

0.5 0.55 0.605 0.666 0.7320

50

100

pole length/m

aver

age

trial

s

KDHP KHDP DHP HDP

(b)

Fig. 2. Performance comparisons between K-ACDs and ACDs under different parameter settings such as (a) success rates and (b) average trials.

the boundary conditions, that is, if |θ | ≤ 12°, |x | ≤ 1.2m,r(t) = 0; else r(t) = −1. For DHP and KDHP, a differentiablereward function is defined as r(t) = 0.5(x2 + θ2). Thesimulation time step is 0.02 s. A learning controller is regardedto be successful when its final policy can balance the pole forat least 10 000 time steps. A trail starts from an initial statenear the equilibrium and ends when the controller balances thepole for 10 000 time steps or the pole angle or the position ofthe cart exceeds the boundary conditions.

In Fig. 2, the performance of kernel ACDs and conventionalACDs is compared under different parameter settings includ-ing the variations of actor learning rates, the cart mass, and thepole length. We use two performance measures to evaluate thelearning efficiency of different learning control methods. Oneis the success rate of a learning controller, which is definedas the percentage of successful learning trials that can learna policy to balance the pole for at least 10 000 time steps.The other is the averaged number of trials which is needed tolearn a successful policy. The averaged number of trials was

computed by running the learning control process for 10 inde-pendent runs. For each independent run, the maximum numberof learning trials is 100. For KHDP and KDHP, 40 trials ofsamples were collected by a random policy to construct thedictionary of kernel features. The threshold parameter for theALD analysis is set as μ = 0.001. It is shown in Fig. 2that the performance of KDHP and KHDP is much betterthan that of DHP and HDP, respectively. In Fig. 2(a), we seethat the success rates of KDHP are all 100% under differentsettings of actor learning rates, whereas the performance ofDHP and HDP is greatly influenced by the actor learning rates.It is observed that KHDP has higher success rates than HDPand it is also less sensitive to the variations of actor learningrates. In Fig. 2(a), it is illustrated that KDHP has the bestperformance (100% success rate) under different dynamicschanges of the plant including the variations of the cart massand the pole length. The performance of KHDP is also muchmore robust than that of HDP and DHP. In Fig. 2(b), it isshown that KDHP needs the minimum averaged number of


[-0.01,0.01] [-0.05,0.05] [-0.1,0.1] (0,0.01) (0,0.02)0

0.5

1

noise level

succ

essf

ul ra

te

[-0.01,0.01] [-0.05,0.05] [-0.1,0.1] (0,0.01) (0,0.02)0

50

100

noise level

aver

age

trial

s

KDHP KHDP DHP HDP

(a)

3 5 7 9 110

0.5

1

number of hidden layer nodes in the actor

succ

essf

ul ra

te

3 5 7 9 110

50

100

number of hidden layer nodes in the actor

aver

age

trial

s

KDHP KHDP DHP HDP

(b)

Fig. 3. Performance comparisons between K-ACDs and ACDs under (a) different conditions of noise levels and (b) different number of hidden layer nodesin actor networks.

0 0.5 1 1.5 2 2.5

-0.2-0.1

00.10.2

KDHP

t(s)

thet

a(ra

d)

0 0.5 1 1.5 2 2.5

-0.2-0.1

00.10.2

KHDP

t(s)

thet

a(ra

d)

0 0.5 1 1.5 2 2.5

-0.2-0.1

00.10.2

DHP

t(s)

thet

a(ra

d)

0 0.5 1 1.5 2 2.5

-0.2-0.1

00.10.2

HDP

t(s)

thet

a(ra

d)

Fig. 4. Angle variations of the real cart-pole system controlled by different learning controllers after convergence.

trials for learning a successful control policy, which means thatKDHP converges faster than other learning control algorithms.Compared with HDP, KHDP converges to a good control

policy much faster. However, compared with KHDP and HDP,DHP needs smaller number of trials to balance the polesuccessfully. This is mainly due to the fact that DHP makes


Fig. 5. Ball and plate system.

2 4 6 8 10 12 14 16 180

10

20

30

40

50

60

trials

the

cum

ulat

ive

squa

red

erro

rs/m

*m

DHPKDHPHDPKHDP

DHP

KDHP

HDP

KHDP

Fig. 6. Total squared errors of four algorithms in 18 trials.

use of some model information to estimate the policy gradient,which will greatly reduce the variance of policy gradients andincrease the convergence speed of ACDs.

The performance comparisons between HDP and DHP werealso studied in the simulation, where the performance of HDPand DHP was evaluated under different learning rates andhidden node numbers of the critic and the actor. It is observedthat DHP can consistently obtain better performance than HDP.

In Fig. 3, the performance of different learning control algo-rithms is compared under different noise levels and differentnumber of hidden nodes in the actor network. It is illustratedthat KDHP has the best performance among all the algorithmsand it is very robust to sensor noises and structure variations inthe actor network. It can be seen that KHDP has much betterperformance than HDP and its performance is more robustthan DHP.

Fig. 4 shows the angle variations of a real cart-pole systemcontrolled by different learning controllers after convergence.From Fig. 4, it is observed that the final policy obtained byKDHP can stabilize the system in a shorter time than otherlearning controllers. This means that the quality of the finalnear-optimal policy of KDHP is better than other algorithms.Moreover, the performance of KHDP is also better than HDP.

From the above simulation and experimental results, it isillustrated that by making use of the sparse kernel machinesin the critic of ACDs, the robustness and the efficiency oflearning controllers can be greatly improved.

0 100 200 300 400 500 600 700-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

time-steps

the

ball

posi

tion/

m

DHPKDHPHDPKHDPHDP

DHP

KDHP

KHDP

Fig. 7. Performance comparisons of the final policies obtained by the fouralgorithms.

B. Learning Control of the Ball and Plate System

The ball and plate system is a typical multivariable nonlinearplant, which has been used to test various learning controlmethods as an experimental device. The controller design forthe ball and plate system becomes very difficult when there aremodel uncertainties and unknown disturbances in the plant. Inthe following, both simulation and experimental studies willbe conducted on the ball and plate problem to compare theperformance of different ACDs.

As shown in Fig. 5, a typical ball and plate system com-prises a ball, a round plate, a charge-coupled device (CCD)vidicon, two electromotors, and some other control devices.The CCD vidicon is used to detect the position of the ball,and the two electromotors can drive the round plate incliningso that the ball can roll arbitrarily on the plate. The controlproblems of the ball and plate system comprise the rollingfrom point to point, route tracking and obstacle avoiding, andso on. In this paper, the learning control problem of rollingfrom point to point was studied to compare the performanceof different ACDs.

The movement of the ball on the plate can be decomposedinto two parts: the move along the x−axes and the movealong the y−axes. Because of the independence of the controlactions and the coherence of the dynamics models on thex−axes and y−axes, only the learning control problem onthe x−axes is considered.

Let x stand for the position of the ball on the plate, and θdenote the inclining angle of the plate. R is the diameter of theball and m is the mass of the ball, τ denotes the moment bywhich the inclining angle of the plate can be changed. In thelearning control process, the moment τ is defined as the action.If the state exceeds the boundary conditions, the current trialends and the controller is regarded as unsuccessful.

For the state defined as (x1, x2, x3, x4) = (θ, θ , x, x), thedynamics equations of the ball and plate system on the x axescan be described as follows:

X =

⎡⎢⎢⎣

x1x2x3x4

⎤⎥⎥⎦ =

⎡⎢⎢⎣

x2−g4R2+x2

3x3 cos x1 + 1

4m R2+mx23τ

x4−7 sin(x1)

⎤⎥⎥⎦. (65)


0 2 4 6 8 10 12 14 16 18 20-0.06

-0.04

-0.02

0

0.02

0.04

0.06

time/s

the

ball

posi

tion/

m

DHPKDHP

Fig. 8. Performance comparisons of KDHP and DHP algorithms for real-time control of the ball and plate system.

The reward function is defined as

r(t) = 0.25(θ(t) − θd(t))2 + 0.25(x(t) − xd(t))2

where θd(t) is the expected inclining angle of the plate andxd(t) is the expected position of the ball, they are both zerosin the simulation. The discount factor γ is set to be 0.95.

In the simulation, the time step is set to be 0.02 s, and a trailstarts from an initial position and ends after 10 000 time stepsor the controller is unsuccessful. The initial state is randomlyset around zero vectors and within 10% of the state boundary.If the ball can be stabilized at the expected position in 10 000time steps, the controller is regarded as successful. One run ofthe learning control process consists of at most 200 trials. Theinitial conditions are independently set among different runs.If a successful controller is obtained in one run, this run endsand a new run starts. For the learning control algorithms, theperformance is evaluated based on 100 independent runs.

In the four algorithms, action modules are all constructedwith neural networks whose structures and parameter set-tings are the same. The network structure (number ofnodes in each layer) of action modules is set as 4-5-1.The transfer functions from the input layer to the hiddenlayer is f (x) = (1 + e−x)−1 and from the hidden layerto the output layer, the transfer function is L(x) = kx.The learning rate in the actor is 0.3 and the actor weightsare randomly initialized from −0.5 to 0.5.

In the HDP and DHP algorithms, the critic modules areconstructed with neural networks, whose parameter settingsare the same as that in the action module except the structure,which is 4-5-4 here. In the KHDP and KDHP algorithms,kernel-based methods are employed to approximate the valuefunctions and the derivatives of value functions, respectively.

The performance of the four algorithms is compared by thetracking errors. In Fig. 6, the following performance index ineach trial (T = 10 000) is used to compare the four algorithms:

J =T∑

t=0

(x (t) − xd (t))2. (66)

Fig. 6 shows that KDHP has the fastest convergence rate andsmallest tracking errors. KHDP also has faster convergencerates and smaller tracking errors than HDP.

TABLE I

AVERAGED LEARNING CONTROL PERFORMANCE IN 100 RUNS

MinimumNo. ofTrials

MaximumNo. ofTrials

AveragedTrials

SuccessRates

HDP 12 36 21.1 57%

KHDP 9 41 15.9 92%

DHP 4 19 8.4 84%

KDHP 3 10 5.8 100%

The final policies obtained by the four algorithms canstabilize the ball within a very small region around the platecenter, as demonstrated in Fig. 7. In Fig. 7, the positionvariations of the ball controlled by the final policies obtainedby the four algorithms are depicted. It is shown that using thefinal policies obtained by KDHP, it takes the shortest time tocontrol the ball to reach the plate center and be stabilized there.Compared with HDP, KHDP also costs less time to stabilizethe ball.

Because of the characteristics of online learning control,the convergence rates and success rates of the four algorithmswere evaluated for performance comparisons. In Table I, it isshown that in 100 independent runs, KDHP needs the smallestnumber of trials to converge and KHDP needs smaller numberof trials than HDP. In KDHP, the success rate of learningcontrol is 100%, whereas in DHP, it is 84%. In KHDP, thesuccess rate of learning control is 92%, whereas in HDP, it isonly 57%.

As shown in Fig. 5, the ball and plate control systemdeveloped by Googol Technology is used for experimentalstudies. In the experiments, the control policies obtained fromsimulation data are used for performance tests, and the timestep is still 0.02 s.

Fig. 8 shows that in the real-time control experiments,by using the final policies obtained by KDHP, it takessmaller number of time steps to control the ball to reachthe plate center and the tracking error of KDHP is smallerthan DHP.

Therefore, on the basis of the simulation and experimentalresults, it is clearly shown that the proposed kernel ACDs canobtain better performance than standard ACDs.


V. CONCLUSION

ACDs were among the first to address reinforcement learn-ing problems in a general setting. Recently, ACDs have gainedrenewed interests due to their abilities in online learning con-trol of dynamical systems, In ACDs, MLPNNs with manuallydesigned structures are commonly used to realize functionapproximation in continuous state and action spaces. How-ever, when the structures of neural networks are improperlydesigned, previous ACDs will have difficulties in improvingthe generalization capability and learning efficiency. In thispaper, a novel class of ACDs with sparse kernel machines,called Kernel ACDs, was presented for online learning controlproblems. Based on the framework of Kernel ACDs, twoKernel ACD algorithms, that is, KHDP and KDHP, wereproposed and their performance was analyzed. Due to therepresentation learning and generalization capability of sparsekernel machines, as well as the fast recursion in the critic usingRLS-TD, kernel ACDs can obtain much better performancethan previous ACDs with MLPNNs.

The research in this paper showed that it is very promisingto integrate sparse kernel machines into online learning controlproblems. There are also some interesting topics to be studiedin future work. One of these topics is the application ofkernel ACDs in real-world online learning control problemsso that better performance can be obtained for real-timelearning control systems. Another topic is to develop morerigorous theoretical results for the convergence of kernelACDs. Existing results on convergence analysis of ACDs stillrequire various assumptions, and further work needs to bedone to eliminate the gap between theoretical assumptions andpractical implementations.

REFERENCES

[1] R. Sutton and A. Barto, Reinforcement Learning. Introduction. Cam-bridge, MA: MIT Press, 1998.

[2] F. Y. Wang, H. Zhang, and D. Liu, “Adaptive dynamic programming:An introduction,” IEEE Comput. Intell. Mag., vol. 4, no. 2, pp. 39–47,May 2009.

[3] C. Szepesvári, Algorithms for Reinforcement Learning. San Mateo, CA:Morgan, 2010.

[4] D. A. White and D. A. Sofge, Handbook of Intelligent Control. NewYork: Van Nostrand, 1992.

[5] P. J. Werbos, “Intelligence in the brain: A theory of how it works andhow to build it,” Neural Netw., vol. 22, no. 3, pp. 200–212, Apr. 2009.

[6] P. J. Werbos, “Using ADP to understand and replicate brain intelligence:The next level design,” in Proc. IEEE Int. Symp. Approx. DynamicProgram. Reinforcement Learn., Apr. 2007, pp. 209–216.

[7] D. P. Bertsekas and J. Tsitsiklis, Neuro-Dynamic Programming. Bel-mont, MA: Athena Scientific, 1996.

[8] D. Liu, Y. Zhang, and H. Zhang, “A self-learning call admission controlscheme for CDMA cellular networks,” IEEE Trans. Neural Netw.,vol. 16, no. 5, pp. 1219–1228, Sep. 2005.

[9] R. H. Crites and A. G. Barto, “Elevator group control using multi-ple reinforcement learning agents,” Mach. Learn., vol. 33, nos. 2–3,pp. 235–262, Nov. 1998.

[10] G. Tesauro, “TD-Gammon, a self-teaching backgammon program,achieves master-level play,” Neural Comput., vol. 6, no. 2, pp. 215–219,Mar. 1994.

[11] P. Shih, B. C. Kaul, S. Jagannathan, and J. A. Drallmeier,“Reinforcement-learning-based dual-control methodology for complexnonlinear discrete-time systems with application to spark engine EGRoperation,” IEEE Trans. Neural Netw., vol. 19, no. 8, pp. 1369–1388,Aug. 2008.

[12] W. B. Powell, Approximate Dynamic Programming: Solving the Cursesof Dimensionality. New York: Wiley, 2007.

[13] L. Busoniu, R. Babuska, B. De Schutter, and D. Ernst, ReinforcementLearning and Dynamic Programming Using Function Approximators.Boca Raton, FL: CRC Press, 2010

[14] R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepes-vari, and E. Wiewiora, “Fast gradient-descent methods for temporal-difference learning with linear function approximation,” in Proc. Int.Conf. Mach. Learn., 2009, pp. 993–1000.

[15] J. Baxter and P. L. Bartlett, “Infinite-horizon policy-gradient estimation,”J. Artif. Intell. Res., vol. 15, no. 1, pp. 319–350, Jul. 2001.

[16] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advancesin Neural Information Processing Systems. Cambridge, MA: MIT Press,2000.

[17] D. V. Prokhorov and D. C. Wunsch, “Adaptive critic designs,” IEEETrans. Neural Netw., vol. 8, no. 5, pp. 997–1007, Jul. 1997.

[18] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuron-like adaptiveelements that can solve difficult learning control problems,” IEEE Trans.Syst., Man, Cybern., vol. 13, no. 5, pp. 834–846, Sep.–Oct. 1983.

[19] G. K. Venayagamoorthy, R. G. Harley, and D. C. Wunsch, “Comparisonof heuristic dynamic programming and dual heuristic programmingadaptive critics for neurocontrol of a turbogenerator,” IEEE Trans.Neural Netw., vol. 13, no. 3, pp. 764–773, May 2002.

[20] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptivedynamic programming for feedback control,” IEEE Circuits Syst. Mag.,vol. 9, no. 3, pp. 32–50, Aug. 2009.

[21] J. Peters and S. Schaal, “Natural actor-critic,” Neurocomputing, vol. 71,nos. 7–9, pp. 1180–1190, Mar. 2008.

[22] S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee, Nat-ural Actor-Critic Algorithms. Alberta, Canada: Dept. Comput. Sci.,2009.

[23] S. N. Balakrishnan and V. Biega, “Adaptive-critic-based neural networksfor aircraft optimal control,” J. Guid., Control, Dynamics, vol. 19, no. 4,pp. 893–898, 1996.

[24] R. Enns and J. Si, “Helicopter trimming and tracking control using directneural dynamic programming,” IEEE Trans. Neural Netw., vol. 14, no. 4,pp. 929–939, Jul. 2003.

[25] C. Lu, J. Si, and X. Xie, “Direct heuristic dynamic programming fordamping oscillations in a large power system,” IEEE Trans. Syst., Man,Cybern., Part B, Cybern., vol. 38, no. 4, pp. 1008–1013, Aug. 2008.

[26] P. Shih, B. C. Kaul, S. Jagannathan, and J. A. Drallmeier,“Reinforcement-learning-based dual-control methodology for complexnonlinear discrete-time systems with application to spark engine EGRoperation,” IEEE Trans. Neural Netw., vol. 19, no. 8, pp. 1369–1388,Aug. 2008.

[27] G. D. Magoulasa, M. N. Vrahatisb, and G. S. Androulakisb, “EffectiveBackpropagation training with variable stepsize,” Neural Netw., vol. 10,no. 1, pp. 69–82, Jan. 1997.

[28] S. Bhasin, N. Sharma, P. Patre, and W. E. Dixon, “Asymptotic trackingby a reinforcement learning-based adaptive critic controller,” J. ControlTheory Appl., vol. 9, No. 3, pp. 400–409, 2011.

[29] K. G. Vamvoudakis and F. L. Lewis, “Online actor-critic algorithm tosolve the continuous-time infinite horizon optimal control problem,”Automatica, vol. 46, no. 5, pp. 878–888, May 2010.

[30] H. Zhang, L. Cui, X. Zhang, and Y. Luo, “Data-driven robust approx-imate optimal tracking control for unknown general nonlinear systemsusing adaptive dynamic programming method,” IEEE Trans. NeuralNetw., vol. 22, no. 12, pp. 2226–2236, Dec. 2011.

[31] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.[32] B. Schölkopf and A. Smola, Learning With Kernels. Cambridge, MA:

MIT Press, 2002.[33] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector

Machines. Cambridge, U.K.: Cambridge Univ. Press, 2000.[34] F. R. Bach and M. I. Jordan, “Kernel independent component analysis,”

J. Mach. Learn. Res., vol. 3, pp. 1–48, Jul. 2002.[35] T. Hofmann, B. Schölkopf, and A. J. Smola, “Kernel methods in machine

learning,” Ann. Statist., vol. 36, no. 3 pp. 1171–1220, 2008.[36] D. Ormoneit and S. Sen, “Kernel-based reinforcement learning,” Mach.

Learn., vol. 49, nos. 2–3, pp. 161–178, 2002.[37] Y. Engel, S. Mannor, and R. Meir, “Bayes meets bellman: The Gaussian

Process approach to temporal difference learning,” in Proc. Int. Conf.Mach. Learn., 2003, pp. 154–161.


[38] T. G. Dietterich and X. Wang, “Batch value function approximation viasupport vectors,” in Advances in Neural Information Processing Systems14, Cambridge, MA: MIT Press, 2002, pp. 1491–1498.

[39] C. E. Rasmussen and M. Kuss, “Gaussian processes in reinforcementlearning,” in Advances in Neural Information Processing Systems 16, S.Thrun, L. K. Saul, and B. Schölkopf, Eds., Cambridge, MA: MIT Press,2004, pp. 751–759.

[40] M. G. Lagoudakis and R. Parr, “Least-squares policy iteration,” J. Mach.Learn. Res., vol. 4, pp. 1107–1149, Dec. 2003.

[41] X. Xu, D. Hu, and X. Lu, “Kernel-based least-squares policy iterationfor reinforcement learning,” IEEE Trans. Neural Netw., vol. 18, no. 4,pp. 973–992, Jul. 2007.

[42] Y. Engel, S. Mannor, and R. Meir, “The kernel recursive least-squaresalgorithm,” IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2275–2285,Aug. 2004.

[43] X. Xu, H. G. He, and D. W. Hu, “Efficient reinforcement learningusing recursive least-squares methods,” J. Artif. Intell. Res., vol. 16,pp. 259–292, Jun. 2002.

[44] X. Xu, T. Xie, D. Hu, and X. Lu, “Kernel least-squares temporaldifference learning,” Int. J. Inf. Technol., vol. 11, no. 9, pp. 54–63, 2005.

[45] J. N. Tsitsiklis and B. V. Roy, “An analysis of temporal differencelearning with function approximation,” IEEE Trans. Autom. Control,vol. 42, no. 5, pp. 674–690, May 1997.

[46] A. Nedic and D. P. Bertsekas, “Least squares policy evaluation algo-rithms with linear function approximation,” Discrete Event Dyn. Syst.,vol. 13, nos. 1–2, pp. 79–110, Jan.–Apr. 2003.

[47] T. Dierks and S. Jagannathan, “Online optimal control of affine nonlineardiscrete-time systems with unknown internal dynamics by using time-based policy update,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23,no. 7, pp. 1118–1129, Jul. 2012.

[48] S. Zhong, X. Zeng, S. Wu, and L. Han, “Sensitivity-based adaptivelearning rules for binary feedforward neural networks,” IEEE Trans.Neural Netw. Learn. Syst., vol. 23, no. 3, pp. 480–491, Mar. 2012.

Xin Xu (M’07–SM’12) received the B.S. degree inelectrical engineering from the Department of Auto-matic Control, National University of Defense Tech-nology (NUDT), Changsha, China, in 1996, wherehe received the Ph.D. degree in control science andengineering from the College of Mechatronics andAutomation, in 2002.

He is currently a Full Professor with the Instituteof Unmanned Systems, College of Mechatronicsand Automation, National University of DefenseTechnology, China. He has been a Visiting Scientist

for cooperation research with Hong Kong Polytechnic University, Hong Kong,University of Alberta, Edmonton, AB, Canada, the University of Guelph,Guelph, ON, Canada, and the University of Strathclyde, Glasgow, U.K. Hehas authored or co-authored more than 90 papers in international journalsand conferences, and co-authored four books. His current research interestsinclude reinforcement learning, approximate dynamic programming, machinelearning, robotics, and autonomous vehicles.

Dr. Xu was a recipient of the 1st class Natural Science Award fromHunan Province, China, in 2009 and the Fork Ying Tong Youth TeacherFund of China in 2008. He serves as currently an associate editor of theInformation Sciences Journal, a guest editor of the International Journalof Adaptive Control and Signal Processing. He is a Committee Memberof the IEEE Technical Committee on Approximate Dynamic Programmingand Reinforcement Learning (ADPRL) and the IEEE Technical Committeeon Robot Learning. He was a PC member or the Session Chair of manyinternational conferences.

Zhongsheng Hou received the Bachelors and Mas-ters degrees from the Jilin University of Technology,Changchun, China, in 1983 and 1988, respectively,and the Ph.D. degree from Northeastern University,Shenyang, China, in 1994.

He was a Post-Doctoral Fellow with the HarbinInstitute of Technology, Harbin, China, from 1995to 1997 and a Visiting Scholar with Yale University,New Haven, CT, from 2002 to 2003. In 1997,he joined the Beijing Jiaotong University, Beijing,China, where he is currently a Full Professor and the

Founding Director of the Advanced Control Systems Laboratory, and the Headof the Department of Automatic Control, School of Electronic and InformationEngineering. He has authored or co-authored over 100 papers in peer-reviewedjournals and over 100 papers in prestigious conference proceedings andauthored two monographs, Nonparametric Model and its Adaptive ControlTheory and the coming Model Free Adaptive Control: Theory and Applications(CRC Press, 2013). His current research interests include data-driven control,model-free adaptive control, learning control, and intelligent transportationsystems.

Dr. Hou was a committee member of over 40 international conferences andChinese conferences, and was an Associate Editor and Guest Editor for a fewinternational journals and Chinese journals.

Chuanqiang Lian received the Bachelors degreefrom the Department of Automation, Qinghua Uni-versity, Beijing, China, in 2008, and the Mas-ters degree from the College of Mechatronics andAutomation, National University of Defense Tech-nology, Changsha, China, in 2010, where he iscurrently pursuing the Ph.D. degree with the Instituteof Unmanned Systems.

He has co-authored more than 10 papers in interna-tional journals and conferences. His current researchinterests include reinforcement learning, approxi-

mate dynamic programming, and autonomous vehicles.

Haibo He (SM’11) received the B.S. and M.S.degrees in electrical engineering from the HuazhongUniversity of Science and Technology, Wuhan,China, in 1999 and 2002, respectively, and thePh.D. degree in electrical engineering from OhioUniversity, Athens, in 2006.

He was an Assistant Professor with the Depart-ment of Electrical and Computer Engineering,Stevens Institute of Technology, Hoboken, NJ, from2006 to 2009. He is currently an Associate Professorwith the Department of Electrical, Computer, and

Biomedical Engineering, University of Rhode Island, Kingston, RI. He hasauthored and co-authored over 100 peer-reviewed journal and conferencepapers, authored one research book (Wiley), and edited six conferenceproceedings (Springer). His research has been highlighted in numerous media,such as IEEE Smart Grid Newsletter, The Wall Street Journal, and ProvidenceBusiness News. His current research interests include adaptive dynamicprogramming, machine learning, computational intelligence, hardware designfor machine intelligence, and various applications such as smart grid.

Dr. He was a recipient of the National Science Foundation CAREER Awardin 2011 and the Providence Business News Rising Star Innovator Award in2011. He is currently an Associate Editor of the IEEE TRANSACTIONS ON

NEURAL NETWORKS AND LEARNING SYSTEMS and the IEEE TRANSAC-TIONS ON SMART GRID.

762 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL

Documents

Transcript of 762 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL