Leader-to-formation stability of multi-agent …1 Leader-to-formation stability of multi-agent...

10
MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Leader-to-formation stability of multi-agent systems: An adaptive optimal control approach Gao, W.; Jiang, Z.-P.; Lewis, F.; Wang, Y. TR2018-013 January 2018 Abstract This note proposes a novel data-driven solution to the cooperative adaptive optimal control problem of leaderfollower multi-agent systems under switching network topology. The dy- namics of all the followers are unknown, and the leader is modeled by a perturbed exosystem. Through the combination of adaptive dynamic programming and internal model principle, an approximate optimal controller is iteratively learned online using real-time input-state data. Rigorous stability analysis shows that the system in closed-loop with the developed control policy is leader-to-formation stable, with guaranteed robustness to unmeasurable leader dis- turbance. Numerical results illustrate the effectiveness of the proposed data-driven algorithm. IEEE Transactions on Automatic Control This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved. Copyright c Mitsubishi Electric Research Laboratories, Inc., 2018 201 Broadway, Cambridge, Massachusetts 02139

Transcript of Leader-to-formation stability of multi-agent …1 Leader-to-formation stability of multi-agent...

Page 1: Leader-to-formation stability of multi-agent …1 Leader-to-formation stability of multi-agent systems: An adaptive optimal control approach Weinan Gao, Zhong-Ping Jiang, Fellow, IEEE,

MITSUBISHI ELECTRIC RESEARCH LABORATORIEShttp://www.merl.com

Leader-to-formation stability of multi-agent systems: Anadaptive optimal control approach

Gao, W.; Jiang, Z.-P.; Lewis, F.; Wang, Y.

TR2018-013 January 2018

AbstractThis note proposes a novel data-driven solution to the cooperative adaptive optimal controlproblem of leaderfollower multi-agent systems under switching network topology. The dy-namics of all the followers are unknown, and the leader is modeled by a perturbed exosystem.Through the combination of adaptive dynamic programming and internal model principle, anapproximate optimal controller is iteratively learned online using real-time input-state data.Rigorous stability analysis shows that the system in closed-loop with the developed controlpolicy is leader-to-formation stable, with guaranteed robustness to unmeasurable leader dis-turbance. Numerical results illustrate the effectiveness of the proposed data-driven algorithm.

IEEE Transactions on Automatic Control

This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy inwhole or in part without payment of fee is granted for nonprofit educational and research purposes provided that allsuch whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi ElectricResearch Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and allapplicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall requirea license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved.

Copyright c© Mitsubishi Electric Research Laboratories, Inc., 2018201 Broadway, Cambridge, Massachusetts 02139

Page 2: Leader-to-formation stability of multi-agent …1 Leader-to-formation stability of multi-agent systems: An adaptive optimal control approach Weinan Gao, Zhong-Ping Jiang, Fellow, IEEE,
Page 3: Leader-to-formation stability of multi-agent …1 Leader-to-formation stability of multi-agent systems: An adaptive optimal control approach Weinan Gao, Zhong-Ping Jiang, Fellow, IEEE,

1

Leader-to-formation stability of multi-agentsystems: An adaptive optimal control approach

Weinan Gao, Zhong-Ping Jiang, Fellow, IEEE, Frank L. Lewis, Fellow, IEEE, and Yebin Wang, Member, IEEE

Abstract—This note proposes a novel data-driven solution tothe cooperative adaptive optimal control problem of leader-follower multi-agent systems under switching network topology.The dynamics of all the followers are unknown, and the leaderis modeled by a perturbed exosystem. Through the combinationof adaptive dynamic programming and internal model principle,an approximate optimal controller is iteratively learned onlineusing real-time input-state data. Rigorous stability analysis showsthat the system in closed-loop with the developed control policyis leader-to-formation stable, with guaranteed robustness tounmeasurable leader disturbance. Numerical results illustrate theeffectiveness of the proposed data-driven algorithm.

Index Terms—Adaptive dynamic programming (ADP), Opti-mal tracking control, Leader-to-formation stability, Switchingnetwork topology.

I. INTRODUCTION

The cooperative control problem of leader-follower multi-agent systems has been under extensive investigation in thelast decade due to its wide application in electrical, mechanicaland biological systems; see [3], [4], [35] and many referencestherein. A general assumption in the present literature on thetopic is that the leader is modeled by an autonomous systemwithout considering the influence of external signals. Refer-ence [29] relaxes this assumption by developing distributedtrackers for multi-agent systems with bounded unknown leaderinput. The notion of leader-to-formation stability (LFS) [41]is introduced to investigate how the leader inputs and distur-bances affect the stability of the group. By taking unknowndynamics and partial measurements into account, reference[13] proposes an adaptive control design approach for a classof second-order leader-follower systems. However, the issue ofadaptive optimal controller design of the multi-agent systemswith assured LFS remains open.

Over the last decade, a trend in adaptive optimal control is toinvoke reinforcement learning [40] and approximate/adaptivedynamic programming (ADP) [2], [21], [27], [45] for feedbackcontrol of dynamical systems. Among all the different ADP

This work has been partly supported by the U.S. National Science Foun-dation grant ECCS-1501044 and Mitsubishi Electric Research Laboratories.

W. Gao is with the Department of Electrical and Computer Engineering,Allen E. Paulson College of Engineering and Computing, Georgia SouthernUniversity, Statesboro, GA, 30460 USA (e-mail:[email protected])

Z.P. Jiang is with the Department of Electrical and Computer Engineering,Tandon School of Engineering, New York University, Brooklyn, NY, 11201USA (e-mail:[email protected]).

F. Lewis is with UTA Research Institute, The University of Texas at Ar-lington, Texas 76118 USA and Qian Ren Consulting Professor, NortheasternUniversity, Shenyang 110036, China. (e-mail:[email protected])

Y. Wang is with Mitsubishi Electric Research Laboratories, Cambridge, MA02139 USA (e-mail: [email protected]).

approaches, much attention has been paid to achieving theadaptive optimal stabilization of linear or nonlinear plantsvia state-feedback [6], [11], [12], [18], [19], [22], [27], [42],[43] and output-feedback [8], [9], [26]. The generalization toadaptive optimal tracking control is studied by [10], [33], [34].For non-model-based optimal stabilization of large-scale andmulti-agent systems, some interesting results appear in [1],[20] using (robust) ADP.

The main purpose of this note is to address the cooperativeadaptive optimal control problem of leader-follower multi-agent systems via ADP. The contributions of this note arethree-fold. First, considering the more general and realisticcase when the leader model (or, the exosystem here) is subjectto external disturbance, we develop a data-driven distributedcontrol policy to guarantee that the closed-loop system isleader-to-formation stable. Moreover, given a vanishing leaderdisturbance, the multi-agent system is able to achieve cooper-ative output regulation [5], [14], [30], [38], [44] which meanseach follower asymptotically tracks a desired trajectory, whilerejecting the disturbance (Div in eq. (2) below) generatedby the exosystem. Second, this note, for the first time, com-bines the idea of ADP and internal model principle to studycooperative adaptive optimal tracking control problems. Bymeans of internal model principle, we convert the trackingproblem to a stabilization problem of an augmented systemcomposed of the plant and a dynamic compensator namedas internal model. Comparing with our previous work [10]which need solve regulator equations by online data first andthen design a feedback-feedforward controller, the proposedalgorithm in this note has a reduced computational cost since itneed not solve regulator equations. Third, instead of assumingthat the communication network remains static and connected,we study a more practical situation where the network isjointly connected [17]. In other words, the network is alloweddisconnected at any time instant. To overcome this issue, theestimation of the exostate obtained from a distributed observeris used for feedback design.

The remainder of this note is organized as follows. SectionII formulates the problem and introduces some basic results re-garding LFS, internal model principle, and optimal control. InSection III, a novel data-driven control approach is presentedbased on ADP to solve cooperative adaptive optimal con-trol problems for leader-follower multi-agent systems underswitching network. The convergence of the proposed algorithmand the LFS of the closed-loop system are rigorously analyzedas well. An example to validate our design is shown in SectionIV. Section V provides concluding remarks. For the sake ofclarity and readability, the relationship among our main results

Page 4: Leader-to-formation stability of multi-agent …1 Leader-to-formation stability of multi-agent systems: An adaptive optimal control approach Weinan Gao, Zhong-Ping Jiang, Fellow, IEEE,

2

is illustrated in Table I.

TABLE IRELATIONSHIP BETWEEN THE PROPOSITION, LEMMA AND THEOREMS

Stability Analysis: Proposition 1 −→ Theorem 2Convergence Analysis: Lemma 1 −→ Theorem 1

Optimality Analysis: Theorem 3

Notations. Throughout this note, | · | represents theEuclidean norm for vectors and the induced norm formatrices. C− stands for the open left-half complex plane.For any piecewise continuous function u : R+ → Rm, ‖u‖stands for supt≥0 |u(t)|. ⊗ indicates the Kronecker product.A continuous function α : R+ → R+ belongs to class Kif it is increasing and α(0) = 0. A continuous functionβ : R+ × R+ → R+ belongs to class KL if for each fixed t,the function β(·, t) is of class K and, for each fixed s, thefunction β(s, ·) is non-increasing and tends to 0 at infinity.vec(A) = [aT1 , a

T2 , · · · , aTm]T , where ai ∈ Rn is the ith

column of A ∈ Rn×m. When n = m, σ(A) is its complexspectrum. For a symmetric matrix P ∈ Rm×m, vecs(P ) =[p11, 2p12, · · · , 2p1m, p22, 2p23, · · · , 2pm−1,m, pmm]T ∈R

12m(m+1). For an arbitrary column vector

v ∈ Rm, |v|P stands for vTPv, and vecv(v) =[v2

1 , v1v2, · · · , v1vm, v22 , v2v3, · · · , vm−1vm, v

2m]T ∈

R12m(m+1). λM (P ) and λm(P ) denote the maximum

and the minimum eigenvalue of a real symmetric matrixP . ρ(t) represents a piecewise constant switching signalρ : [0,+∞) → {1, 2, · · · , nρ} for some integer nρ > 0. Weassume switching constants t0 = 0, t1, t2, · · · of ρ satisfyinfi(ti+1 − ti) ≥ τd > 0, i = 0, 1, · · · , with lim

i→∞ti = ∞,

where τd is called the dwell time.

II. PRELIMINARIES

A. Problem Formulation

Consider the following class of linear multi-agent systems

v =Ev +Hw, (1)xi =Aixi +Biui +Div,

ei =Cixi + Fiv, i = 1, 2 · · · , N (2)

where xi ∈ Rni , ui ∈ R and ei ∈ R are the state, control inputand tracking error of the ith subsystem (follower), respectively.v ∈ Rq is the state of the leader, modeled by the exosystem (1)which generates both the disturbance Div and the referencesignal −Fiv (to be tracked by the output yi = Cixi) of eachfollower. The leader is assumed to track a desired trajectoryy∗0 = −F0v

∗(t) with the signal v∗ satisfying v∗ = Ev∗. Inthis setting, we let the leader input contain two parts: w =w+w, where w is a feedback control input w = −K0(v−v∗)with σ(E − HK0) ⊂ C− and w is an external disturbanceinput. Given the exosystem (1) and the plant (2), define a time-varying digraph Gρ(t) = {V, Eρ(t)}. V = {0, 1, · · · , N} is thenode set with node 0 denoting the leader and the remaining Nnodes being identified as followers described by (2). Eρ(t) ⊂V × V refers to the edge set. Denote Ni(t) the set of allthe nodes j such that (j, i) ∈ Eρ(t). The adjacency matrix

Aρ(t) = [aij(t)] ∈ R(N+1)×(N+1) is defined by aij(t) > 0 if(j, i) ∈ Eρ(t) and otherwise aij(t) = 0.

Some standard assumptions are made on the system (1)-(2).Similar assumptions can be found in [7], [31], [32], [38], [44]for solving (cooperative) output regulation problems.

Assumption 1. All the eigenvalues of E are simple with zeroreal part.

Assumption 2. (Ai, Bi) is stabilizable, ∀1 ≤ i ≤ N .

Assumption 3. rank[Ai − λI BiCi 0

]= ni + 1, ∀λ ∈ σ(E),

∀1 ≤ i ≤ N .

Assumption 4. There exists a subsequence {ik} of {i : i =0, 1, · · · } with tik+1

− tik < T for some positive T such thateach node j = 1, 2, · · · , N is reachable from node 0 in theunion graph ∪ik+1−1

l=ikGρ(tj).

Remark 1. Assumption 2 is made such that the exponentialstability can be achieved for each follower. Assumption 3 is asufficient condition for the solvability of regulator equations(6)-(7). Assumption 4 is a joint connectivity condition [17],[31], [39] that allows the network disconnected at any timeinstant.

B. Basic Results

Under Assumptions 1-3, LFS can be achieved by system(1)-(2) in closed-loop with a decentralized controller

ui =−Kxixi −Kzizi, (3)zi =G1zi +G2ei, i = 1, 2, · · · , N (4)

where the characteristic polynomial of G1 is the same asthe minimal polynomial of E, and the pair (G1, G2) iscontrollable. In this setting, the pair (G1, G2) incorporates aninternal model of the matrix E, and (4) is an internal modelof the ith follower. For i = 1, 2, · · · , N , matrices Kxi,Kzi

are chosen such that

Aci =

[Ai −BiKxi −BiKzi

G2Ci G1

]is a Hurwitz matrix.

Remark 2. As shown in [15, Lemma 1.26], for i =1, 2, · · · , N , the pair (Ai, Bi) is stabilizable under Assump-tions 1-2, where

Ai =

[Ai 0G2Ci G1

], Bi =

[Bi0

]which implies one can always find a Ki :=

[Kxi Kzi

]such

that σ(Aci) ⊂ C−.

Let η be the error between the lumped state of the multi-agent system (1)-(2) with (4) and its desired value. The LFSis then defined as follows. The definition in this note is inlight of input-to-output stability [23], [37], which is slightlydifferent from [41].

Definition 1. System (1)-(2) achieves LFS if there exist afunction β of class KL and a function γ of class K such that,

Page 5: Leader-to-formation stability of multi-agent …1 Leader-to-formation stability of multi-agent systems: An adaptive optimal control approach Weinan Gao, Zhong-Ping Jiang, Fellow, IEEE,

3

for any initial state error η(0) and any measurable essentiallybounded input w and t ≥ 0:

|e(t)| ≤ β(|η(0)|, t) + γ(‖w‖) (5)

where e(t) =[e1(t) e2(t) · · · eN (t)

]T.

Remark 3. Note that the LFS ensures that, given any boundeddisturbance w, the tracking error e will be bounded. It alsoimplies that lim

t→∞e(t) = 0 if lim

t→∞w(t) = 0, which corresponds

to the asymptotic tracking with disturbance rejection arisen inthe cooperative output regulation problems [38].

The following proposition analyzes the LFS of the closed-loop system with respect to w.

Proposition 1. Under Assumptions 1-3, for i = 1, 2, · · · , N ,the multi-agent system (1)-(2) in closed-loop with (3)-(4) isleader-to-formation stable.

Proof. Under Assumptions 1-3, for i = 1, 2, · · · , N , by [32],there exist uniquely matrices Xi and Ui solving the followingregulator equations

XiE =AiXi +BiUi +Di, (6)0 =CiXi + Fi. (7)

By [15, Lemma 1.27], the matrix equations (7) combinedwith

XiE =(Ai −BiKxi)Xi −BiKziZi +Di, (8)ZiE =G1Zi +G2(CiXi + Fi) (9)

have a unique solution Xi and Zi. Then, one can easilycheck that Xi = Xi, and Ui = −KxiXi − KziZi. Letv = 1N ⊗ (v − v∗), xi = xi − Xiv

∗, zi = zi − Ziv∗,

ξi = [xTi , zTi ]T ∈ Rmi , Bci =

[DTi FTi G

T2

]Tand Ci =

[Ci 0

]∈ R1×mi . Also, let ξ =

[ξT1 , ξT2 , · · · , ξTN ]T , Ac = blockdiag(Ac1, Ac2, · · · , AcN ),

Bc = blockdiag(Bc1, Bc2, · · · , BcN ),C = blockdiag(C1, C2, · · · , CN ), and F =blockdiag(F1, F2, · · · , FN ). Then, we have[

˙v˙ξ

]=

[IN ⊗ (E −HK0) 0

Bc Ac

] [v

ξ

]+

[IN ⊗H

0

](1N ⊗ w)

e =F v + Cξ. (10)

By η = [vT , ξT ]T , it is easily checkable that the LFS of theoriginal multi-agent systems is achieved since (E−HK0) andAc are Hurwitz matrices. The proof is thus completed.

In order to ameliorate the transient performance of eachsubsystem, we develop a robust optimal controller such thatthe closed-loop system is leader-to-formation stable with re-spect to the leader disturbance w. Moreover, as v ≡ v∗, thedeveloped controller is optimal in the sense that it minimizesthe following cost

J =

∫ ∞0

(|ξ|Q + |u|R

)dt (11)

for the open-loop system˙ξ =Aξ + Bu (12)

where, for i = 1, 2, · · · , N , ui = ui − Uiv∗,

Qi = QTi > 0, Ri = RTi > 0. u =[u1, u2, · · · , uN ]T , A = blockdiag(A1, A2, · · · , AN ),B = blockdiag(B1, B2, · · · , BN ), Q =blockdiag(Q1, Q2, · · · , QN ) and R =blockdiag(R1, R2, · · · , RN ). Based upon optimal controltheory, the locally optimal control policy is (4) with

u∗i =u∗i + Uiv∗

=−K∗xixi −K∗zizi + Uiv∗

=−K∗xixi −K∗zizi, i = 1, 2, · · · , N. (13)

The optimal control gains are[K∗xi K∗zi

]= R−1

i BTi P∗i := K∗i (14)

where P ∗i is the unique solution to the following Riccatiequation

ATi P∗i + P ∗i Ai +Qi − P ∗i BiR−1

i BTi P∗i = 0. (15)

A model-based algorithm, Algorithm 1, is given to seek thedecentralized optimal controller. Note that, instead of solving(15) which is nonlinear in P ∗i , we employ the policy iterationtechnique [24] to approximate P ∗i by solving linear Lyapunovequations iteratively .

Algorithm 1 Model-based Decentralized Optimal ControllerDesign

1: Find a pair (G1, G2) such that it incorporates an internalmodel of E. i← 1

2: repeat3: Find K

(0)i such that

(Ai − BiK(0)

i

)is a Hurwitz

matrix. k ← 0. Select a sufficiently small constant ε > 0.4: repeat5: Solve P (k)

i and K(k+1)i from

0 =(Ai − BiK(k)

i

)TP

(k)i + P

(k)i

(Ai − BiK(k)

i

)+Qi +

(K

(k)i

)TRiK

(k)i (16)

K(k+1)i =R−1

i BTi P(k)i (17)

6: k ← k + 17: until

∣∣∣P (k)i − P (k−1)

i

∣∣∣ < ε

8: i← i+ 19: until i = N + 1

III. MAIN RESULTS

In this section, we will design a data-driven distributedcontroller via ADP to achieve LFS under switching networktopology. The developed approach is able to approximate thecontrol gains K∗i for each follower without relying on theknowledge of system matrices Ai, Bi and Di. To begin with,the internal model (4) is modified by

zi =G1zi +G2ei, i = 1, 2, · · ·N (18)

Page 6: Leader-to-formation stability of multi-agent …1 Leader-to-formation stability of multi-agent systems: An adaptive optimal control approach Weinan Gao, Zhong-Ping Jiang, Fellow, IEEE,

4

where ei = yi + Fζi. The dynamics of ζi ∈ Rq depends onthe following equation

ζi =Eζi +∑

j∈Ni(t)

aij(t)(ζj − ζi) i = 1, 2, · · · , N (19)

with ζ0 = v.Then, we rewrite the ith subsystem augmented with the

internal model (18):

ξi =Aiξi + Biui + Diψi

=A(k)i ξi + Bi

(K

(k)i ξi + ui

)+ Diψi

where, for i = 1, 2, · · · , N , A(k)i = Ai − BiK

(k)i , Di =

blockdiag(Di, G2Fi), ξi =[xTi zTi

]T ∈ Rmi , ψi =[vT ζTi

]T ∈ R2q .By equation (16), we have

|ξi(t+ δt)|P

(k)i− |ξi(t)|P (k)

i

=

∫ t+δt

t

[|ξi|(A(k)i )TP

(k)i +P

(k)i A

(k)i

+ 2ψTi DTi P

(k)i ξi

+ 2(ui +K(k)i ξi)

T BTi P(k)i ξi]dτ

=

∫ t+δt

t

[−|ξi|Qi+(K(k)i )TRiK

(k)i

+ 2ψTi DTi P

(k)i ξi

+ 2(ui +K(k)i ξi)

TRiK(k+1)i ξi]dτ. (20)

By Kronecker product representation, we obtain

|ξi|Qi+(K(k)i )TRiK

(k)i

= (ξTi ⊗ ξTi )vec(Qi + (K

(k)i )TRiK

(k)i

),

ψTi DTi P

(k)i ξi = (ξTi ⊗ ψTi )vec

(DTi P

(k)i

),

(ui +K(k)i ξi)

TRiK(k+1)i ξi = [(ξTi ⊗ ξTi )(I ⊗ (K

(k)i )TRi)

+ (ξTi ⊗ uTi )(I ⊗Ri)]vec(K

(k+1)i

).

Moreover, for any two vectors a, b and a sufficiently largenumber s > 0, define

δa =[vecv(a(t1))− vecv(a(t0)), · · · ,vecv(a(ts))− vecv(a(ts−1))]T ,

Γa,b =

[∫ t1

t0

a⊗ bdτ,∫ t2

t1

a⊗ bdτ, · · · ,∫ ts

ts−1

a⊗ bdτ]T

.

(20) implies the following linear equation

Ψ(k)i

vecs(P(k)i )

vec(K(k+1)i )

vec(DTi P

(k)i

) = Φ

(k)i (21)

where

Ψ(k)i =[δξi ,−2Γξiξi(I ⊗ (K

(k)i )TRi)− 2Γξiui

(I ⊗Ri),− 2Γξiψi

],

Φ(k)i =− Γξiξivec

(Qi + (K

(k)i )TRiK

(k)i

).

The uniqueness of solution to (21) is guaranteed under somerank condition as shown below. For want of space, we omit

the proof of Lemma 1 which follows the same line of proofsas in [10], [19].

Lemma 1. For all k ∈ Z+, if there exists a s∗ ∈ Z+ suchthat for all s > s∗,

rank([Γξiξi ,Γξiui,Γξiψi

]) =(mi + 4q + 3)mi

2, (22)

then the matrix Ψ(k)i has full column rank for all k ∈ Z+.

Now, we are ready to present a data-driven ADP algorithm2 which yields approximate solutions to the unknown optimalvalues K∗i and P ∗i .

Algorithm 2 Data-driven ADP Algorithm for DistributedOptimal Controller Design

1: Find a pair (G1, G2) such that it incorporates an internalmodel of E.

2: Select a small ε > 0. Apply ui = −K(0)i ξi + νi on

[t0, ts] with νi an exploration noise, s.t. (22) holds fori = 1, 2, · · · , N.

3: i← 14: repeat5: k ← −16: repeat7: k ← k + 18: Solve P (k)

i and K(k+1)i from (21)

9: until |P (k)i − P (k−1)

i | < ε for k ≥ 1

10: P †i ← P(k)i

11: The learned controller is (18), (19), and

ui = −K(k+1)i ξi := −K†i ξi (23)

12: i← i+ 113: until i = N + 1

The convergence of Algorithm 2 is shown in Theorem1, while the LFS of the closed-loop system is analyzed inTheorem 2.

Theorem 1. If (22) is satisfied, then, for i = 1, 2, · · · , N ,sequences {P (k)

i }∞k=0 and {K(k)i }∞k=1 computed by Algorithm

2 converge to P ∗i and K∗i , respectively.

Proof. For all 1 ≤ i ≤ N, letting P(k)i =

(P

(k)i

)T> 0

be the solution to (16). K(k+1)i is uniquely determined by

(17) with T(k)i = DT

i P(k)i . On the other hand, letting P ,

K, and T solve (21), condition (22) ensures that P (k)i = P ,

K(k+1)i = K, and T (k)

i = T are uniquely determined. By [24],we have lim

k→∞K

(k)i = K∗i , lim

k→∞P

(k)i = P ∗i . The convergence

of sequences {P (k)i }∞k=0 and {K(k)

i }∞k=1 obtained by non-model-based Algorithm 2 is thus ensured.

Theorem 2. Under Assumptions 1-4, the multi-agent system(1)-(2) in closed-loop with the learned controller (18), (19)and (23) is leader-to-formation stable.

Page 7: Leader-to-formation stability of multi-agent …1 Leader-to-formation stability of multi-agent systems: An adaptive optimal control approach Weinan Gao, Zhong-Ping Jiang, Fellow, IEEE,

5

Proof. Write the closed-loop system in a compact form[ξ

ζ

]=

[A†c B†c0

[(IN ⊗ E)− (Hρ(t) ⊗ Iq)

]] [ξζ

]+

[D

Hρ(t) ⊗ Iq

](1N ⊗ v)

:=Ac,ρ(t)

[ξζ

]+ Bc,ρ(t)(1N ⊗ v),

e =Cξ + F (1N ⊗ v) (24)

where Ac† = blockdiag(A1 − B1K†1 , · · · , AN −

BNK†N ), ζ =

[ζT1 , ζ

T2 , · · · , ζTN

]T, D =

blockdiag([DT

1 0q×(mi−ni)

]T, · · · ,

[DTN 0q×(mi−ni)

]T),

and B†c = blockdiag([0q×ni

FT1 GT2

]T, · · · ,

[0q×ni

FTNGT2

]T).

Hρ(t) = [hij ] ∈ RN×N is a submatrix of Laplacian of thedigraph Gρ(t) with hii(t) =

∑Nj=0 aij(t) and hij(t) = −aij(t)

if j 6= i.Let X = blockdiag(X1, Z1, X2, Z2, · · · , XN , ZN ), where,

for i = 1, 2, · · · , N , matrices Xi and Zi solve regula-tor equations (6)-(7) and equations (8)-(9). Defining Xc =[XT IqN

]T, then we have

Xc(IN ⊗ E) =Ac,ρ(t)Xc + Bc,ρ(t),

0 =CX + F. (25)

Let ζ = ζ − (1N ⊗ v∗). By equations (24) and (25), weobtain

˙v =[IN ⊗ (E −HK0)]v + (IN ⊗H)(1N ⊗ w), (26)[˙ξ˙ζ

]=Ac,ρ(t)

ζ

]+ Bc,ρ(t)v (27)

e =F v + Cξ. (28)

Define φ = [ξT , ζT ]T . By [39, Lemma 2], we see that theorigin of following linear switched system

φ = Ac,ρ(t)φ (29)

is exponentially stable. Hence, the state transition matrixΦ(τ, t) of (29) satisfies |Φ(τ, t)| ≤ ke−λ(τ−t),∀τ ≥ t ≥ 0.Let P †(t) =

∫∞t

ΦT (τ, t)Φ(τ, t)dτ . It is checkable that thereexists some c1, c2 > 0 such that c1|φ|2 ≤ φTP †(t)φ ≤c2|φ|2, which implies that P †(t) is positive definite and upperbounded by some supt≥0 |P †(t)| < c3. The definition of P †(t)shows that it is symmetric and continuously differentiable. Bythe fact that ∂Φ(τ, t)/∂t = −Φ(τ, t)Ac,ρ(t), we have

P †(t) = −P †(t)Ac,ρ(t) − ATc,ρ(t)P †(t)− I. (30)

On the other hand, since σ(E −HK0) ⊂ C−, there existsa positive definite matrix Pv = PTv such that

[IN ⊗ (E −HK0)]TPv + Pv[IN ⊗ (E −HK0)] + IqN = 0(31)

Choose the Lyapunov function V (t) = φTP †(t)φ+(2c23c24+

1)vTPv v, where c4 = supt≥0 |Bc,ρ(t)| and c5 = |Pv(IN⊗H)|.

The derivative of V along the solutions of system (26)-(27) is

V =− |φ|2 + 2φTP †Bc,ρ(t)v − (2c23c24 + 1)|v|2

+ 2(2c23c24 + 1)vTPv(IN ⊗H)(1N ⊗ w)

≤− |φ|2 + 2c3c4|φ||v| − (2c23c24 + 1)|v|2

+ 2N(2c23c24 + 1)c5|v||w|

≤ − |φ|2

2− |v|

2

2+ c6|w|2

with c6 = 2N2(2c23c24 + 1)2c25.

The previous inequality implies that the system (26)-(27)with w as the input is input-to-state stable (ISS) [36]. Givenη = [vT , φT ]T , there exist a function β1 of class KL and afunction γ1 of class K such that

|η(t)| ≤β1(|η(0)|, t) + γ1(‖w‖). (32)

From (28), one can immediately check that the LFS condi-tion (5) is satisfied. The proof is thus completed.

The following result compares the cost J� for the decen-tralized controller (4), (13) with the cost J† associated withthe distributed controller (18), (19) and (23).

Theorem 3. There always exist constants d1, d2 > 0 such thatJ† ≤ d1J

� + d2|ζ(0)|2 if v ≡ v∗.Proof. Denoting K∗ = blockdiag{K∗1 ,K∗2 , · · · ,K∗N}, P ∗ =blockdiag{P ∗1 , P ∗2 , · · · , P ∗N}, one can rewrite the system (2)with (4) and (13) by

˙ξ =(A− BK∗)ξ. (33)

The corresponding cost is

J� =ξ(0)TP ∗ξ(0). (34)

When v ≡ v∗, the system (2) with (13), (18) and (19) canbe written by (29). Along the trajectory of (29), from (30), wehave ∫ ∞

0

|φ2|dτ ≤ φT (0)P †(0)φ(0) ≤ c2|φ(0)|2. (35)

By the previous inequality, the cost J† is upper bounded by

J† ≤λM (Q+ (K†)TRK†)∫ ∞

0

|ξ|2dτ

≤λM (Q+ (K†)TRK†)∫ ∞

0

|φ|2dτ

≤c2λM (Q+ (K†)TRK†)|φ(0)|2

≤c2λM (Q+ (K†)TRK†)(|ξ(0)|2 + |ζ(0)|2)

≤c2λM (Q+ (K†)TRK†)λm(P ∗)

J�

+ c2λM (Q+ (K†)TRK†)|ζ(0)|2

:=d1J� + d2|ζ(0)|2 (36)

where K† = blockdiag{K†1 ,K†2 , · · · ,K†N}.The proof is thus completed.

Remark 4. Note that the proposed Algorithm 2 is a di-rect adaptive control approach without identifying the systemmatrices. In each iteration, one can estimate the controller

Page 8: Leader-to-formation stability of multi-agent …1 Leader-to-formation stability of multi-agent systems: An adaptive optimal control approach Weinan Gao, Zhong-Ping Jiang, Fellow, IEEE,

6

parameters by solving a linear matrix equation (21). Thisis different from the indirect adaptive optimal control [25],[16, Chap. 7.4.4] that the plant parameters are estimatedonline and used to find controller parameters by solving thecorresponding Riccati equations that is nonlinear in Pi.

Remark 5. (22) is introduced to ensure the convergence ofcontroller parameters. Like traditional adaptive control [46]and existing work on ADP [19], [28], we add an explorationnoise to the input during the learning phase in order to satisfy(22).

Remark 6. Albeit some nodes cannot get instant informationfrom the leader, Algorithm 2 is implementable since all thefollowers are reachable from node 0, i.e., there always existsa T > 0 such that v(t) in the period [t0, ts] is receivable byall the other subsystems at t = ts + T .

IV. EXAMPLE

In order to validate the effectiveness of the proposed data-driven Algorithm 2, we consider a system in the form of (1)-(2)with N = 4 and for i = 1, 2, 3, 4,

Ai =

[0 10 0

], Bi =

[01

], Di =

[0 00 0.5 ∗ i

],

Ci =

[10

]T, Fi =

[−i− 1

0

]T, E =

[0 1−1 0

], H =

[11

].

Suppose that the switching network topology Gρ(t) is dom-inated by the following switching signal

ρ(t) =

1, if 3sTs ≤ t < (3s+ 1)Ts2, if (3s+ 1)Ts ≤ t < (3s+ 2)Ts3, if (3s+ 2)Ts ≤ t < (3s+ 3)Ts

where s = 0, 1, 2, · · · , Ts = 0.2s and the correspondingcommunication graph is depicted in Fig. 1. It is easily check-able that Assumptions 1-4 are satisfied. Suppose the systemmatrices Ai, Bi, and Di are unknown. Let the internal modelfor the exosystem dynamics v be

G1 =

[0 1−1 0

], G2 =

[01

].

For the purpose of simulation, we choose all the weightmatrices Qi and Ri in the cost (11) to be identity matrices.The external disturbance of the leader is set by w = sin(3t).The exploration noise is taken as a sum of sinusoidal signalswith disparate frequencies. We collect data from t = 0s tot = 10s, then (21) is solved repeatedly until the convergencecriterion is satisfied. The comparison of the P (k)

i of the ithfollower at kth iteration with its optimal value is shown inFig. 2. We employ our updated control policy after t = 10s.The outputs of all the leader and followers are depicted inFig. 3 with their reference signal y∗i = −Fiv∗. The plots ofthe distributed control inputs are shown in Fig. 4.

V. SUMMARY AND FUTURE WORK

This note has studied the cooperative adaptive optimalcontrol problem by means of a combined use of internalmodel principle and adaptive dynamic programming theory.

Fig. 1. Network topology

0 2 4 6 80

10

20

30

40

Number of Iteration

‖P (k)1 −P∗

1 ‖

0 2 4 6 80

10

20

30

40

Number of Iteration

‖P (k)2 −P∗

2 ‖

0 2 4 6 80

10

20

30

40

Number of Iteration

‖P (k)3 −P∗

3 ‖

0 2 4 6 80

10

20

30

40

Number of Iteration

‖P (k)4 −P∗

4 ‖

Fig. 2. The comparison of P (k)i and their optimal values

0 5 10 15 20 25 30−25

−20

−15

−10

−5

0

5

10

15

20

25

Time(s)

y1

y2

y3

y4

y0

y1

*

y2

*

y3

*

y4

*

y0

*

Control Updated

Fig. 3. Plots of the system outputs yi and the reference signal y∗i

The communication network is jointly connected, which isallowed disconnected at any time instant. Instead of relying onthe accurate knowledge of the system dynamics, an internal-model-based control policy is learned by means of input-statedata. The learned control policy achieves leader-to-formationstability, which is robust to unmeasurable leader disturbance.Future work includes the generalization to nonlinear multi-

Page 9: Leader-to-formation stability of multi-agent …1 Leader-to-formation stability of multi-agent systems: An adaptive optimal control approach Weinan Gao, Zhong-Ping Jiang, Fellow, IEEE,

7

0 5 10 15 20 25 30−30

−20

−10

0

10

20

30

Time(s)

u1

u2

u3

u4

Fig. 4. Plots of distributed controllers

agent systems and the combination of ADP and adaptiveinternal model.

REFERENCES

[1] M. I. Abouheaf, F. L. Lewis, K. G. Vamvoudakis, S. Haesaert, andR. Babuska, “Multi-agent discrete-time graphical games and reinforce-ment learning solutions,” Automatica, vol. 50, no. 12, pp. 3038 – 3053,2014.

[2] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming.Belmont, MA: Athena Scientific, 1996.

[3] L. Consolini, F. Morbidi, D. Prattichizzo, and M. Tosques, “Leader-follower formation control of nonholonomic mobile robots with inputconstraints,” Automatica, vol. 44, no. 5, pp. 1343 – 1349, 2008.

[4] D. V. Dimarogonas, P. Tsiotras, and K. J. Kyriakopoulos, “Leader-follower cooperative attitude control of multiple rigid bodies,” Systems& Control Letters, vol. 58, no. 6, pp. 429 – 435, 2009.

[5] Z. Ding, “Consensus output regulation of a class of heterogeneousnonlinear systems,” IEEE Transactions on Automatic Control, vol. 58,no. 10, pp. 2648–2653, 2013.

[6] Q. Y. Fan and G. H. Yang, “Adaptive actor-critic design-based integralsliding-mode control for partially unknown nonlinear systems with inputdisturbances,” IEEE Transactions on Neural Networks and LearningSystems, vol. 27, no. 1, pp. 165–177, 2016.

[7] B. Francis, “The linear multivariable regulator problem,” SIAM Journalon Control and Optimization, vol. 15, no. 3, pp. 486–505, 1977.

[8] W. Gao, M. Huang, Z. P. Jiang, and T. Chai, “Sampled-data-basedadaptive optimal output-feedback control of a 2-degree-of-freedom he-licopter,” IET Control Theory and Applications, vol. 10, no. 12, pp.1440–1447, 2016.

[9] W. Gao, Y. Jiang, Z. P. Jiang, and T. Chai, “Output-feedback adaptiveoptimal control of interconnected systems based on robust adaptivedynamic programming,” Automatica, vol. 72, pp. 37–45, 2016.

[10] W. Gao and Z. P. Jiang, “Adaptive dynamic programming and adaptiveoptimal output regulation of linear systems,” IEEE Transactions onAutomatic Control, vol. 61, no. 12, pp. 4164–4169, 2016.

[11] ——, “Learning-based adaptive optimal tracking control of strict-feedback nonlinear systems,” IEEE Transactions on Neural Networksand Learning Systems, in press, 2017.

[12] ——, “Nonlinear and adaptive suboptimal control of connected vehi-cles: A global adaptive dynamic programming approach,” Journal ofIntelligent & Robotic Systems, vol. 85, no. 3, pp. 597–611, 2017.

[13] J. Hu and W. X. Zheng, “Adaptive tracking control of leaderfollowersystems with unknown dynamics and partial measurements,” Automat-ica, vol. 50, no. 5, pp. 1416 – 1423, 2014.

[14] C. Huang and X. Ye, “Cooperative output regulation of heterogeneousmulti-agent systems: An H∞ criterion,” IEEE Transactions on Auto-matic Control, vol. 59, no. 1, pp. 267–273, 2014.

[15] J. Huang, Nonlinear Output Regulation: Theory and Applications.Philadelphia, PA: SIAM, 2004.

[16] P. A. Ioannou and J. Sun, Robust Adaptive Control. Mineola, NY:Dover Publications, Inc., 2012.

[17] A. Jadbabaie, J. Lin, and A. S. Morse, “Coordination of groups of mobileautonomous agents using nearest neighbor rules,” IEEE Transactions onAutomatic Control, vol. 48, no. 6, pp. 988–1001, 2003.

[18] Y. Jiang, J. Fan, T. Chai, J. Li, and F. L. Lewis, “Data-driven flotationindustrial process operational optimal control based on reinforcementlearning,” IEEE Transactions on Industrial Informatics, in press, 2017.

[19] Y. Jiang and Z. P. Jiang, “Computational adaptive optimal control forcontinuous-time linear systems with completely unknown dynamics,”Automatica, vol. 48, no. 10, pp. 2699–2704, 2012.

[20] ——, “Robust adaptive dynamic programming for large-scale systemswith an application to multimachine power systems,” IEEE Transactionson Circuits and Systems II: Express Briefs, vol. 59, no. 10, pp. 693–697,2012.

[21] ——, Robust Adaptive Dynamic Programming. Hoboken. NJ: Wiley-IEEE Press, 2017.

[22] Z. P. Jiang and Y. Jiang, “Robust adaptive dynamic programmingfor linear and nonlinear systems: An overview,” European Journal ofControl, vol. 19, no. 5, pp. 417–425, 2013.

[23] Z. P. Jiang, A. R. Teel, and L. Praly, “Small-gain theorem for ISSsystems and applications,” Mathematics of Control, Signals and Systems,vol. 7, no. 2, pp. 95–120, 1994.

[24] D. Kleinman, “On an iterative technique for Riccati equation compu-tations,” IEEE Transactions on Automatic Control, vol. 13, no. 1, pp.114–115, 1968.

[25] P. R. Kumar, “Optimal adaptive contorl of linear-quadratic-guassiansystems,” SIAM Journal on Control and Optimization, vol. 21, no. 2,pp. 163–178, 1983.

[26] F. L. Lewis and K. G. Vamvoudakis, “Reinforcement learning forpartially observable dynamic processes: Adaptive dynamic programmingusing measured output data,” IEEE Transactions on Systems, Man, andCybernetics, Part B: Cybernetics, vol. 41, no. 1, pp. 14–25, 2011.

[27] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptivedynamic programming for feedback control,” IEEE Circuits and SystemsMagazine, vol. 9, no. 3, pp. 32–50, 2009.

[28] F. L. Lewis, D. Vrabie, and K. G. Vamvoudakis, “Reinforcementlearning and feedback control: Using natural decision methods to designoptimal adaptive controllers,” IEEE Control Systems Magazine, vol. 32,no. 6, pp. 76–105, 2012.

[29] Z. Li, X. Liu, W. Ren, and L. Xie, “Distributed tracking control forlinear multiagent systems with a leader of bounded unknown input,”IEEE Transactions on Automatic Control, vol. 58, no. 2, pp. 518–523,2013.

[30] L. Liu, “Adaptive cooperative output regulation for a class of nonlinearmulti-agent systems,” IEEE Transactions on Automatic Control, vol. 60,no. 6, pp. 1677–1682, 2015.

[31] W. Liu and J. Huang, “Cooperative adaptive output regulation forsecond-order nonlinear multiagent systems with jointly connectedswitching networks,” IEEE Transactions on Neural Networks and Learn-ing Systems, in press, 2017.

[32] R. Marino and P. Tomei, “Output regulation for linear systems viaadaptive internal model,” IEEE Transactions on Automatic Control,vol. 48, no. 12, pp. 2199–2202, 2003.

[33] H. Modares and F. L. Lewis, “Linear quadratic tracking control ofpartially-unknown continuous-time systems using reinforcement learn-ing,” IEEE Transactions on Automatic Control, vol. 59, no. 11, pp. 3051–3056, 2014.

[34] C. Mu, Z. Ni, C. Sun, and H. He, “Data-driven tracking control withadaptive dynamic programming for a class of continuous-time nonlinearsystems,” IEEE Transactions on Cybernetics, vol. 47, no. 6, pp. 1460–1470, 2017.

[35] W. Ren and R. Beard, Distributed Consensus in Multi-vehicle Cooper-ative Control, ser. Communications and Control Engineering. London,U.K.: Springer-Verlag London, 2008.

[36] E. D. Sontag, “Smooth stabilization implies coprime factorization,” IEEETransactions on Automatic Control, vol. 34, no. 4, pp. 435–443, 1989.

[37] ——, “Input to state stability: Basic concepts and results,” in Nonlinearand Optimal Control Theory, P. Nistri and G. Stefani, Eds. Berlin:Springer-Verlag, 2007, pp. 163–220.

[38] Y. Su and J. Huang, “Cooperative output regulation of linear multi-agentsystems,” IEEE Transactions on Automatic Control, vol. 57, no. 4, pp.1062–1066, 2012.

[39] ——, “Cooperative output regulation with application to multi-agentconsensus under switching network,” IEEE Transactions on Systems,Man, and Cybernetics, Part B (Cybernetics), vol. 42, no. 3, pp. 864–875, 2012.

Page 10: Leader-to-formation stability of multi-agent …1 Leader-to-formation stability of multi-agent systems: An adaptive optimal control approach Weinan Gao, Zhong-Ping Jiang, Fellow, IEEE,

8

[40] R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning.Cambridge, MA: MIT Press, 1998.

[41] H. G. Tanner, G. J. Pappas, and V. Kumar, “Leader-to-formation stabil-ity,” IEEE Transactions on Robotics and Automation, vol. 20, no. 3, pp.443–455, 2004.

[42] K. G. Vamvoudakis, D. Vrabie, and F. L. Lewis, “Online adaptivealgorithm for optimal control with integral reinforcement learning,”International Journal of Robust and Nonlinear Control, vol. 24, no. 17,pp. 2686–2710, 2014.

[43] D. Wang, D. Liu, H. Li, B. Luo, and H. Ma, “An approximate optimalcontrol approach for robust stabilization of a class of discrete-timenonlinear systems with uncertainties,” IEEE Transactions on Systems,Man, and Cybernetics: Systems, vol. 46, no. 5, pp. 713–717, 2016.

[44] X. Wang, Y. Hong, J. Huang, and Z.-P. Jiang, “A distributed controlapproach to a robust output regulation problem for multi-agent linearsystems,” IEEE Transactions on Automatic Control, vol. 55, no. 12, pp.2891–2895, 2010.

[45] P. J. Werbos, “Beyond regression: New tools for prediction and analysisin the behavioral sciences,” Ph.D. dissertation, Harvard University, 1974.

[46] J. Yuan and W. Wonham, “Probing signals for model reference identi-fication,” IEEE Transactions on Automatic Control, vol. 22, no. 4, pp.530–538, 1977.