Optimization and control of large-scale networked...

Optimization and controlof large-scale networked systems

A DISSERTATION

SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

OF THE UNIVERSITY OF MINNESOTA

BY

Neil K. Dhingra

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

Doctor of Philosophy

December 2017

Copyright © 2017

Neil K. Dhingra

To Pallavi Kathuria, in loving memory.

i

Acknowledgements

No man is an island and in completing my graduate studies I have relied on the help and guidance

of innumerable friends, relations, and mentors. I will try to give a partial accounting of their

aid and support here.

I would first like to thank my advisor, Professor Mihailo R. Jovanovic for being an amazing

teacher, mentor, and friend. His relentless perfectionism and his insistence on detail has en-

hanced the quality and sophistication of my writing, research, and presentation. I am forever

indebted to him for his guidance and mentorship.

Many other faculty members have contributed greatly to my time here. I learned much

about convex optimization from Professor Tom Luo, both in his excellent classes and one-on-

one meetings. I also benefitted tremendously from interactions with Tryphon Georgiou, Murti

Salapaka, Peter J. Seiler, Andy Lamperski, Makan Fardad, and Anders Rantzer.

I am thankful for NASA and their amazing research scientists. Interacting with Marty

Brenner was always stimulating; working with him broadened my perspectives and gave me

insight into the practical applications of control theory.

I would also like to thank the professors who taught and mentored me at the University of

Michigan and all my classmates and colleagues who inspired me to pursue graduate school. In

particular, I would like to thank Sister Mary Elizabeth Merriam, under whose guidance I began

my journey into academia.

The countless hours spent in our stuffy Keller Hall office were made bearable only by the

incredible company I kept. I would like to thank Xiaofan Wu, Armin Zare, and Yongxin Chen,

my compatriots in this endeavor. Whether watching the Timberwolves or running races, it has

always been a pleasure. The senior statesmen of the lab – Rashad Moarref, Binh K. Lieu and Fu

Lin – all offered invaluable guidance and counsel. I would like to thank “Mr. Fu” in particular

for his help with optimization and sparsity-promoting optimal control. Much of the work in this

thesis builds upon the foundation that he laid. Finally I would like to thank the junior graduate

ii

students, Sepideh Hassan-Moghaddam, Wei Ran, and Dongsheng Ding – I am happy that I can

leave knowing that the future of the lab is in good hands.

Beyond our lab group, interactions with Sei Zhen Khong at the IMA and Marcello Colombino

at ETH were enjoyable socially and brought my research to a higher level. I also enjoyed my

time with the other IMA postdocs, particularly Kaoru Yamamoto and Rohit Gupta.

Of course I must thank my incredible network of friends in Minneapolis and across the coun-

try. Katie Hartl, in particular, has always been a great friend with whom I could commiserate.

Thank you to Kevin Nowland, for his friendship and mathematical advice and despite his dubi-

ous collegiate affiliation, and to Rohan Agarwal, for giving me a second home in Chicago. I am

grateful for all my friends in Minneapolis, including my roommate Aaron as well as everyone

I met through him, Brian and Julia and their wonderful children, Julius and Maria, Luke and

Lindsey, Caitlin and Eric, Jane, Liz, Niko, Alex, Sarah, Stephen, Ben, Emily, Angela, Kyle,

and the rest of the Scudder house crew, Taylor, Jordan, and everyone else at the Michigan

Alumni Association, Megan, Sandy, Elizabeth, Azra, and many, many more. I would also like to

thank Sisyphus, in whom I found a kindred spirit, and Albert Camus for providing an optimistic

interpretation of this somewhat concerning correlation.

Thank you to my family for constantly providing love and support. Thanks especially to my

parents, who have always been there for me, even when my car was stolen or broke down. Thank

you to my sister Nina, for her support as well as for helping me start and accompanying me

on different roadtrips from California to Minnesota. I am also indebted to my extended family,

grandparents, cousins, aunts, and uncles for their support and love over the years. In particular,

I would like to thank my little cousin – Pallavi, your wonderful smile and joyous laugh will stay

with me forever.

Finally and most importantly I would like to thank Amy. Her constant support and love has

given me strength and fortitude throughout my graduate studies. I loved our time in Minneapolis

and our biking, running, and roadtrip adventures to all the breweries, lakes, waterfalls, ice caves,

and stadia in the area. Thank you, Amy, I couldn’t have done it without you.

iii

Abstract

In this thesis, we design structured controllers for linear systems by solving regularized

optimal control problems. We develop tractable methods for solving nonconvex regularized

problems and then identify classes of problems for which regularized optimal control problems

can be placed into a convex form.

We first develop novel methods based on reformulating the regularized optimization prob-

lem with an auxiliary variable. By exploiting the properties of proximal operators, we bring

the associated augmented Lagrangian into a continuously differentiable form by constraining

it to the manifold that corresponds to explicit minimization over the auxiliary variable. The

new expression facilitates a method of multipliers algorithm that offers many advantages rel-

ative to existing methods, including guaranteed convergence for nonconvex problems and the

ability to impose regularization in alternate coordinates. We then apply primal-descent dual-

ascent Arrow-Hurwicz-Uzawa type gradient flow dynamics to solve regularized problems in a

distributed manner. We prove global convergence for convex problems and use the theory of In-

tegral Quadratic Constraints to establish conditions for exponential convergence for continuous-

and discrete-time updates applied to strongly convex probems. Finally, we take advantage of

generalizations of the Jacobian to develop a second-order algorithm which converges globally to

the optimal solution for convex problems. Moreover, we prove local quadratic convergence for

strongly convex problems.

We next study several classes of convex regularized optimal control problems. The problem

of designing symmetric modifications to symmetric linear systems is convex in the underlying

design variable and is thus appealing for the purpose of structured control. We show that even

when the system and controller are not symmetric, their symmetric components can be used

to perform structured design with stability and performance guarantees. We then examine the

problem of designing structured diagonal modifications to positive systems. We prove convexity

of the H2 and H∞ optimal control problems for this class of system and apply our results

to leader selection in directed consensus networks and combination drug therapy design for

HIV. We consider time-varying controllers and show that a constant controller is optimal for

an induced-power performance index. Finally, we develop customized algorithms for large-scale

actuator and sensor selection.

iv

Contents

Acknowledgements ii

Abstract iv

List of Tables x

List of Figures xi

1 Introduction 1

1.1 A historical perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Optimal control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.2 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Distributed Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3.1 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3.2 Lack of convexity for structured control . . . . . . . . . . . . . . . . . . . 12

1.4 Regularized problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.5 Design of structured controllers via regularization . . . . . . . . . . . . . . . . . . 17

1.5.1 Structure identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.5.2 Polishing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.6 Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

I Proximal augmented Lagrangian algorithms 24

2 Method of multipliers 25

2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

v

2.2 Problem formulation and background . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.2.1 Composite optimization problem . . . . . . . . . . . . . . . . . . . . . . . 26

2.2.2 Background on proximal operators . . . . . . . . . . . . . . . . . . . . . . 27

2.2.3 Existing algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3 MM with the proximal augmented Lagrangian . . . . . . . . . . . . . . . . . . . 30

2.3.1 Derivation of the proximal augmented Lagrangian . . . . . . . . . . . . . 30

2.3.2 MM based on the proximal augmented Lagrangian . . . . . . . . . . . . . 31

2.3.3 Minimization of the proximal augmented Lagrangian over x . . . . . . . . 33

2.4 Example: Edge addition in directed consensus networks . . . . . . . . . . . . . . 36

2.4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.4.2 Computational experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.5 Example: Sparsity-promoting optimal control . . . . . . . . . . . . . . . . . . . . 41

2.5.1 Minimization of the proximal augmented Lagrangian . . . . . . . . . . . . 44

2.5.2 Proximal gradient applied directly to (2.1) . . . . . . . . . . . . . . . . . . 45

2.5.3 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3 First-order primal-dual algorithm 50

3.1 Arrow-Hurwicz-Uzawa gradient flow . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2 Exponential convergence for strongly convex f . . . . . . . . . . . . . . . . . . . 54

3.2.1 Continuous-time dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.2.2 Discrete time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.3 Distributed implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.3.1 Example: Optimal placement . . . . . . . . . . . . . . . . . . . . . . . . . 62

4 Second-order primal-dual method 66


4.1.1 Generalization of the gradient and Jacobian . . . . . . . . . . . . . . . . . 67

4.1.2 Semismoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.1.3 Existing second order methods . . . . . . . . . . . . . . . . . . . . . . . . 69

4.1.4 Second order updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2 A globally convergent differential inclusion . . . . . . . . . . . . . . . . . . . . . . 71

4.2.1 Asymptotic stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.2.2 Global exponential stability . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.3 A second order primal-dual algorithm . . . . . . . . . . . . . . . . . . . . . . . . 77

vi

4.3.1 Merit function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.3.2 Second order primal-dual algorithm . . . . . . . . . . . . . . . . . . . . . 80

4.4 Computational experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.4.1 Example: `1-regularized least squares . . . . . . . . . . . . . . . . . . . . 84

4.4.2 Example: Distributed control of a spatially-invariant system . . . . . . . . 87

4.5 Additional considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.5.1 Algorithm based on V (w) as a merit function . . . . . . . . . . . . . . . . 91

4.5.2 Bounding ∇f(x) for Algorithm 3 . . . . . . . . . . . . . . . . . . . . . . . 95

5 Connections with other methods and discussion 98

5.1 Second order updates as linearized KKT corrections . . . . . . . . . . . . . . . . 98

5.2 Connections with other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.2.1 MM and ADMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.2.2 First order methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.2.3 Second order methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

II Structured optimal control 103

6 Symmetry and spatial invariance 104

6.1 Symmetric systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.1.1 Convex formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.1.2 Stability and performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.1.3 Performance bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.1.4 Small asymmetric perturbations . . . . . . . . . . . . . . . . . . . . . . . 110

6.2 Computational experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.2.1 Example: Directed Consensus Network . . . . . . . . . . . . . . . . . . . . 112

6.2.2 Example: Combination drug therapy design via symmetric systems . . . . 113

6.3 Computational advantages for structured problems . . . . . . . . . . . . . . . . . 118

6.3.1 Spatially-invariant systems . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.3.2 Swift-Hohenberg Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

7 Structured decentralized control of positive systems 125


7.1.1 Background on positive systems . . . . . . . . . . . . . . . . . . . . . . . 126

7.1.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

vii

7.2 Convexity of optimal control problems . . . . . . . . . . . . . . . . . . . . . . . . 128

7.2.1 Convexity of f2 and f∞ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

7.2.2 Differentiability of the H∞ norm . . . . . . . . . . . . . . . . . . . . . . . 131

7.3 Leader selection in directed networks . . . . . . . . . . . . . . . . . . . . . . . . . 135


7.3.2 Stability for directed networks . . . . . . . . . . . . . . . . . . . . . . . . 137

7.3.3 Bounds for Problem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7.3.4 Additional comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7.4 Computational experiments for leader selection . . . . . . . . . . . . . . . . . . . 141

7.4.1 Example: Bounds on leader selection for a small network . . . . . . . . . 141

7.4.2 Example: Leaders in the neural network of the worm C. Elegans . . . . . 141

7.4.3 Example: Ranking college football teams . . . . . . . . . . . . . . . . . . 143

7.5 Computational experiments for combination drug therapy . . . . . . . . . . . . . 144

7.5.1 Example: Simple problem with nondifferentiable f∞ . . . . . . . . . . . . 144

7.5.2 Example: Real world drug therapy problem . . . . . . . . . . . . . . . . . 145

7.6 Time-varying controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

7.6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148


7.6.3 Solution to the optimal control problem . . . . . . . . . . . . . . . . . . . 151

8 Actuator or sensor selection 155

8.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

8.1.1 Actuator selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

8.1.2 Sensor selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

8.2 Customized algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

8.2.1 Alternating direction method of multipliers . . . . . . . . . . . . . . . . . 161

8.2.2 P -minimization step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

8.2.3 Z-minimization step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

8.2.4 Iterative reweighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

8.3 Example: Mass-spring system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

8.3.1 Algorithm speed and computational complexity . . . . . . . . . . . . . . . 164

8.3.2 Sensor selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

8.3.3 Iterative reweighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

8.4 Proximal gradient method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

viii

8.4.1 Elimination of P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

8.4.2 Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

8.4.3 Example: Damped mass-spring systems . . . . . . . . . . . . . . . . . . . 172

8.5 Example: Flexible wing aircraft . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

9 Conclusions and future directions 176

References 178

Appendix A. Crossword 199

ix

List of Tables

5.1 Summary of different functions embedded in the augmented Lagrangian of (2.5)

and methods for solving (2.1) based on these functions. . . . . . . . . . . . . . . 101

7.1 Leaders selected for different values of γ. . . . . . . . . . . . . . . . . . . . . . . . 145

7.2 Optimal budgeted doses and H2/H∞ performance. . . . . . . . . . . . . . . . . . 147

x

List of Figures

1.1 HIV replication-mutation pattern for a set of 4 mutants with a single drug that

affects 2 mutants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2 A network of 5 dynamical systems with associated local controllers. . . . . . . . . 11

1.3 Mass-spring system on a line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.4 (a) The optimal centralized position feedback gain matrix Xp in the system with

50 masses. Both Xp and Xv (not shown) have almost constant diagonals (modulo

edges) and exponential off-diagonal decay. (b) Optimal centralized position gains

for the middle mass n = 25. (c) Truncation of the optimal centralized position

gains for the middle mass n = 25. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.5 The change of variables that casts the unstructured state-feedback problem as an

SDP, in general, does not preserve the structural properties of X. . . . . . . . . . 14

1.6 Increased emphasis on sparsity encourages sparser control architectures at the

expense of deteriorating the closed-loop performance. For γ = 0 the optimal

centralized controller Xc is obtained from the positive definite solution of the

algebraic Riccati equation. Control architectures for γ > 0 are determined by

X(γ) := argminX (f(X) + γ g(X)) and they depend on interconnections in the

distributed plant and the state and control performance weights Q and R. . . . . 18

1.7 Cardinality function of a scalar variable x and the corresponding absolute value

and logarithmic approximations on x ∈ [−1, 1]. . . . . . . . . . . . . . . . . . . . 19

1.8 The solution x? of the constrained problem (1.10) is the intersection of the con-

straint set C := x | f(x) ≤ σ and the smallest sub-level set of g that touches

C. The penalty function g is the `1 norm (left); the weighted `1 norm with

appropriate weights (middle); and the nonconvex sum-of-logs function (right). . . 20

2.1 The regularization function g1(v) = |v| for a scalar argument v with associated

proximal operator, Moreau envelope, and Moreau envelope gradient when µ = 1. 28

xi

2.2 A balanced plant graph with 7 nodes and 10 directed edges (solid black lines). A

sparse set of 2 added edges (dashed red lines) is identified by solving (2.13) with

γ = 3.5 and R = I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.3 Tradeoff between performance and sparsity resulting from the solution to (2.13)-

(2.14) for the network shown in Fig. 2.2. Performance loss is measured relative

to the optimal centralized controller (i.e., the setup in which all edges in the

controller network are used). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.4 (a) Total time; (b) number of outer iterations; and (c) average time per outer

iteration required to solve (2.13) with γ = 0.01, 0.1, 0.2 for a cycle graph with

N = 5 to 50 nodes and m = 20 to 2450 potential added edges using PAL (solid

blue –×–), ADMM (dashed red - -- -), and ADMM with the adaptive µ-

update heuristic (dotted yellow · · · · · · ). PAL requires fewer outer iterations

and thus a smaller total solve time. . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.5 Comparison of proximal gradient with BB step-size selection (solid blue —),

proximal gradient without BB step-size selection (dotted black · · · ) and gradient

descent with BB step-size selection (dashed red - - -) for the x-minimization

step (2.10a) for an unstable network with 20 subsystems, γ = 0.0844, and µ =

0.10. The y-axis shows the distance from the optimal objective value relative to

initial distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.6 Computation time required to solve (2.1) for 10 evenly spaced values of γ from

0.001 to 1.0 for a mass-spring example with N = 5, 10, 20, 30, 40, 50, 100 masses.

Performance of direct proximal gradient (dashed green −−−−), the method

of multipliers (solid blue —×—) and ADMM (dash-dot red − · − ·) is

displayed. All algorithms use BB step-size initialization. . . . . . . . . . . . . . . 48

2.7 Computation time required to solve (2.1) for 20 evenly spaced values of γ from

0.001 to 0.05 for unstable network examples with N = 5, 10, 20, 30, 40, 50 nodes.

Performance of direct proximal gradient (dashed green −−−−), the method

of multipliers (solid blue —×—) and ADMM (dash-dot red − · − ·) is

displayed. All algorithms use BB step-size initialization. . . . . . . . . . . . . . . 49

3.1 Block diagram of primal-descent dual-ascent dynamics where G is a linear system

connected via feedback with nonlinearities. . . . . . . . . . . . . . . . . . . . . . 55

3.2 Block diagram of primal-descent dual-ascent dynamics where the nonlinearities

are replaced by an IQC imposed on their inputs and outputs. . . . . . . . . . . . 55

xii

3.3 Subfigures (a) and (b) show two optimal configurations of mobile agents () with

corresponding fixed agents (black ), Exx is (solid red — and solid blue —

edges) and Exb (dotted black · · · edges). Subfigures (c) and (d) show the agent

trajectories and distance from optimal configuration (a) until t = 2.5 and from

optimal configuration (b) until the final t = 10. . . . . . . . . . . . . . . . . . . 65

4.1 Distance from optimal solution as a function of the iteration number and solve

time when solving LASSO for two values of γ using ISTA, FISTA, and our algo-

rithm (2ndMM). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.2 Solve times for LASSO with n = 1000 obtained using ISTA, FISTA, and our

algorithm (2ndMM) as a function of the sparsity-promoting parameter γ. . . . . 86

4.3 Comparison of our algorithm (2ndMM) with state-of-the-art methods for LASSO

with problem dimension varying from n = 100 to 2000. . . . . . . . . . . . . . . . 86

4.4 (a) The middle row of the circulant feedback gain matrix Z; and (b) the sparsity

level of z?γ (relative to the sparsity level of the optimal centralized controller z?0)

resulting from the solutions to (4.31a) for the linearized Swift-Hohenberg equation

with n = 64 Fourier modes and c = −0.01. . . . . . . . . . . . . . . . . . . . . . . 92

4.5 Performance degradation (in percents) of structured controllers relative to the

optimal centralized controller: polished optimal structured controller obtained by

solving (4.31a) and (4.31b) (solid blue ——); unpolished optimal structured

controller obtained by solving only (4.31a) (dashed red - -×- -); and optimal

structured controller obtained by solving (4.31b) for an a priori specified nearest

neighbor reference structure (dotted yellow · · ·+ · · · ). . . . . . . . . . . . . . 93

4.6 Total time to compute the solution to (4.31a) with γ = 0.004 using our algorithm

(2ndMM), proximal Newton, and ADMM. . . . . . . . . . . . . . . . . . . . . . . 94

4.7 Comparison of (a) times to compute an iteration (averaged over all iterations);

and (b) numbers of iterations required to solve (4.31a) with γ = 0.004. . . . . . 95

4.8 Distance from the optimal solution as a function of iteration number when solving

LASSO using Algorithm 4 for different values of µ. . . . . . . . . . . . . . . . . . 96

6.1 Directed network (solid black — arrows) with added undirected edges (dashed

red - - - arrows). Both the H2 and H∞ optimal structured control problems

yielded the same set of added edges. In addition to these edges, the controllers

tuned the weights of the edges (1)− (3) and (1)− (5). . . . . . . . . . . . . . . . 113

6.2 H2 and H∞ performance of the closed-loop symmetric system and the original

system subject to a controller designed at various values of γ. . . . . . . . . . . . 113

xiii

6.3 Sparsity structure of the matrix A and its symmetric counterpart. The elements

of A are shown with blue dots, and the elements in its symmetric component As

are overlaid in green circles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.4 Difference in H2 norm between the symmetric and original systems with different

controllers designed as a function of γ, normalized by the H2 norm of original

system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.5 The solid — lines are the H2 norms of the symmetric systems and the dotted

- - - lines are the H2 norms of the original systems. The blue ×, red , and

magenta designate c = 105, 1.4× 106, and 1.9× 107 respectively. . . . . . . . 120

6.6 Computation time for the general formulation (6.3) (solid blue ——) and that

which takes advantage of spatial invariance (6.9) (dotted red · · · ∗ · · · ). . . . . . 124

6.7 Feedback gain −x(φ) for the node at position φ = 0, computed with N = 51 and

γ = 0 (solid black —), γ = 0.1 (dashed blue - - -), and γ = 10 (dotted red

· · · ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

7.1 A directed network and the sparsity pattern of the corresponding graph Laplacian.

This network is stabilized if and only if either node 1 or node 2 are made leaders. 137

7.2 H2 performance of optimal leader set (blue ×) and upper bounds resulting from

“rounding” (yellow ) and the optimal leaders for the undirected network (red

+). Performance is shown as a percent increase in f2 relative to f lb2 (N). . . . . 142

7.3 Network of games in the 2015 − 2016 College Football season. The central con-

nected component corresponds to the top division, and the distal nodes are lower

division teams for whom there are less data. . . . . . . . . . . . . . . . . . . . . . 144

7.4 C. Elegans neural network with N = 10 (a) f2 and (b) f∞ leaders along with

the (c) f2 and (d) f∞ performance of varying numbers of leaders N relative to

f lb(N). In all cases, leaders are selected via “rounding”. . . . . . . . . . . . . . . 146

7.5 A directed network and corresponding A matrix for a virus with 4 mutants and

2 drugs. For this system, f∞ is nondifferentiable. . . . . . . . . . . . . . . . . . . 147

7.6 Comparison of different algorithms starting from initial condition [2.5 2.8]T . The

algorithms are the subgradient method with a constant step-size (dotted blue

· · · ), the subgradient method with a diminishing step-size (solid red —) and

our optimal subgradient method where the step-size is chosen via backtracking

to ensure descent of the objective function (dashed yellow - - -) . . . . . . . . 148

7.7 Mutation pattern of HIV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

xiv

7.8 Percent performance degradation for H2 (solid red —×—) and H∞ (dashed

blue −− −−) performance relative to using all 5 drugs. . . . . . . . . . . . . . 149

8.1 Scaling of computation time with the number of states for CVX and for ADMM for

mass-spring system with γ = 100 and position and velocity outputs. Empirically,

we observe that CVX scales roughly with n6 while ADMM scales roughly with n3.165

8.2 Scaling of computation time with the number of sensors for CVX and for ADMM

for mass-spring system with γ = 100 and n = 50. Outputs are random linear

combinations of the states. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

8.3 Percent increase in fo(L) in terms of the number of sensors. . . . . . . . . . . . . 167

8.4 Retained position sensors as γ increases. A blue dot indicates that the position of

the corresponding mass is being measured. The top row shows the densest sensor

topology, and the bottom row shows the sparsest. . . . . . . . . . . . . . . . . . . 167

8.5 Retained velocity sensors as γ increases. A blue dot indicates that the velocity of

the corresponding mass is being measured. The top row shows the densest sensor

topology, and the bottom row shows the sparsest. . . . . . . . . . . . . . . . . . . 168

8.6 Number of sensors versus γ for a scheme which uses iterative reweighting and for

the scheme which uses constant weights. Iterative reweighting promotes sparser

structures earlier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

8.7 Computation time for mass-spring system damped with b = 0.1 and γ = 30. . . . 173

8.8 Body Freedom Flutter flexible wing testbed aircraft. . . . . . . . . . . . . . . . . 173

8.9 (a) Number of sensors as a function of the sparsity-promoting parameter γ; and

(b) Performance comparison of the Kalman filter associated with the sets of sen-

sors resulting from the regularized sensor selection problem and from truncation. 174

8.10 The `2 norm of each column of the Kalman gain for Kalman filters designed for

different sets of sensors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

xv

Chapter 1

Introduction

1.1 A historical perspective

Feedback control theory is the science of designing a controller that connects a system’s output

to its input in order to achieve some desired behavior. A simple example is a thermostat: this

controller achieves a target temperature by adjusting current through heating coils or strength

of air conditioning to counteract unwanted deviations from a desired temperature. The concept

of feedback control arguably traces back to irrigation canals in ancient Mesopotamia built to

facilitate farming [1]. Modern mathematical study of control systems began in the 19th cen-

tury with the study of centrifugal governors used to keep steam engines running at a constant

speed [1]. During the Second World War, applications of control theory for weapons guidance

systems drove the development of graphical single-input single-output control design techniques

based on loop-shaping, Bode plots, and Nyquist diagrams [2]. After the war, the challenge

of controlling rockets during the space race prompted the study of multi-input multi-output

systems and eventually led to the formalism of optimal control [3].

The optimal control of linear systems with quadratic performance measures, such as the

H2 and H∞ norms, is a cornerstone of systems theory. This framework provides a systematic

way to balance closed-loop performance, robustness, and control effort. In the conventional

formulation, an optimal controller is designed to minimize some measure of the amplification

from exogenous sources of excitation to a regulated output which penalizes both the system state

and the control effort. Typically, these optimal controllers do not have a particular structure

and require a centralized implementation in which measurement and control must involve all

1

2

outputs and inputs and must be performed at a single, central location.

While they provide a powerful framework, most traditional optimal control tools are ill-suited

to problems where controller structure is of tantamount importance. In many modern appli-

cations, constraints on communication, computation, and physics impose significant structural

restrictions on the controller. Designing controllers subject to such restrictions is a challenging

problem. In fact, even determining stabilizability was shown to be NP hard in general [4, 5].

Nevertheless, designing structured controllers is of growing importance for many modern

problems, such as the distributed control of large-scale networks of dynamical systems. Systems

of this type are found in applications ranging from distributed power generation, to deployment

of teams of robotic agents, to control of segmented mirrors in extremely large telescopes, to

control of fluid flows around wind turbines and vehicles. The development of analytical and

computational methods for tractable analysis and design of such networks is a major challenge.

Recent technological advances have allowed the individual components of large-scale systems

to be equipped with their own sensing, actuation, communication, computation, and decision

making capabilities. Advances in Micro-Electro-Mechanical-Systems (MEMS) have enabled the

development of arrays of sensors and actuators that can interact with one another. Strings of

vehicular platoons, unmanned aerial vehicles (UAVs), and robotic agents constitute another set

of examples of large-scale autonomous systems [6, 7].

In many of these applications, the scale of the problem, constraints on computing and com-

munication resources, and wide-spread sensing and actuating capabilities pose additional re-

quirements on controller complexity. Typically, these cannot be addressed using tools from

standard optimal control theory. For example, a dense state-feedback controller resulting from

the LQR framework would impose a prohibitive communication burden in large-scale networks.

This is because forming every control input requires information from every subsystem in the

distributed plant. The cost of creating and maintaining communication links makes such an

all-to-all topology infeasible in most large-scale and distributed systems.

This has motivated the design of structured (both decentralized and distributed) controllers.

Early efforts have centered on the design of decentralized strategies [8] and, during the last fifteen

years, the emphasis has shifted to the design of distributed controllers [9–37]. Incorporating

structure into systems theory has also enabled the analysis [38–41], control [42–45], and low

complexity modeling [46, 47] of large-scale fluid flows. Two major issues have emerged: the

identification of controller architectures of convex classes of structured control problems and

optimal control design under a priori specified structural constraints.

3

In some cases, these questions can be resolved using standard tools under additional assump-

tions. Optimal control problems are often reformulated using the Youla parameterization [48,49].

The mapping from the controller to the Youla parameter is nonlinear which typically compro-

mises convexity of the structural constraints in the distributed setup. It is thus important to

identify subspaces which remain invariant under this nonlinear mapping for distributed sys-

tems. In [11, 15], the subspaces of cone and funnel causal systems have been introduced; these

describe how information from every controller propagates through the distributed system. For

spatially-invariant systems, the design of quadratically optimal controllers can be cast into a con-

vex problem if the information in the controller propagates at least as fast as in the plant [11,15].

A similar but more general algebraic characterization of the constraint set was introduced and

convexity was established under the condition of quadratic invariance in [16]. Other classes

of convex distributed control problems include partially-nested systems [50–52], poset-causal

systems [19,25], and positive systems [53–59].

Since most of these convex formulations are expressed in terms of the impulse response pa-

rameters, they do not lend themselves easily to state-space characterization. Apart from very

special instances, the optimal distributed design problem remains challenging. For poset-causal

systems, explicit Riccati-based solutions for the optimal decentralized state-feedback problem

were obtained in [19, 25]. For a two-player problem with block-triangular state-space matrices,

the optimal decentralized output-feedback solution was recently provided in [23,29]. Character-

izing the structural properties of optimal distributed controllers is another important challenge.

For spatially invariant systems, the quadratic optimal controllers are also spatially invariant and

the information from other subsystems is exponentially discounted with the distance between

the controller and the subsystems [9]. For systems on graphs, this spatially decaying property

was studied in [60,61] and it motivates the search for inherently localized controllers.

Optimal control under structural constraints remains a challenging problem [32]. Since these

constraints are often combinatorial in nature (e.g., selecting k communication links between N

nodes), structured optimal control has two main aspects: the identification of effective structures

for the purpose of control (e.g., the communication topology between distributed controllers)

and the design of controllers given a specific structure (e.g., the control law implemented over a

given communication network).

4

1.2 Optimal control

The notion of controller structure can have different connotations and the design variable can

impact the dynamics in different ways. We first define the dynamics we study and the perfor-

mance metrics we use to quantify closed-loop performance. We consider the general dynamics,

ψ = (A + F (x))ψ + Bd

ζ =

C

R(x)

ψ (1.1)

where x ∈ Rm is a design parameter, F : Rm → Rn×n is a linear operator, ψ(t) ∈ Rn is the

state vector, d(t) represents an exogeneous input, and ζ is a regulated output. The matrix

C represents a mapping from ψ to a state penalty and R(x) is a mapping from the state to a

measure of control effort. In Section 1.2.1, we illustrate how this formulation generalizes standard

state feedback and encapsulates state feedback, output feedback, edge addition in directed and

undirected consensus networks, leader selection, and other problems.

In the optimal control problems we consider, the objective is to improve the open-loop (i.e.,

no controller, or x = 0) performance of a system by implementing a static controller x. Solving

optimal control problems amounts to designing controllers x that minimize a performance metric

that quantifies the performance of the closed-loop transfer function from the exogenous input

d to the regulated output ζ. This thesis mainly considers the H2- and H∞-optimal control

frameworks, where closed-loop performance is measured by variance amplification and the peak

of the frequency response, respectively.

We first provide examples of problems that are included in formulation (1.1), provide the

expressions for theH2 andH∞ norms, and then motivate the study of structured optimal control

with a distributed systems example.

5

1.2.1 Applications

Static feedback

The standard static state feedback formulation,

ψ = Aψ + B1d + B2u

u = −Xψ

ζ =

Cψ

R1/2u

(1.2)

with state-feedback matrix X is recovered from (1.1) by taking x = vec(X), F (x) = −B2X

and R(x) = −R1/2X for a positive semidefinite R. When the controller can only measure some

output ζ2 = C2ψ, taking F (x) = −B2XC2 and R(x) = −R1/2XC2 yields the static output

feedback formulation.

We describe and motivate this problem in the context of distributed systems in Section 1.3.

In Section 2.5, we develop an algorithm for designing structured controllers for (1.2) following

the formulation introduced by [62–64].

Edge addition in stochastically forced consensus networks

The problem of adding undirected ‘controller’ edges to an existing ‘plant’ consensus network

also involves dynamics of the form (1.1),

ψ = −(L + E diag(x)ET

)ψ + d

where each element of the state vector, ψi, represents a node in the network, L is a directed graph

Laplacian which contains information about how the nodes are connected in the ‘plant’ network,

E contains information about the locations of potential added edges, x is a vector of added edges

and F (x) = −E diag(x)ET is the corresponding weighted graph Laplacian associated with the

‘controller’ network [65].

Consensus networks have garnered much interest for problems dealing with collective decision-

making and collective sensing [66–69]. These networks have proven useful in applications as

varied as modeling animal group dynamics, control for spacecraft flying in formation, and data

fusion in sensor networks [70–72].

Adding edges to a consensus network improves its performance, but selecting a limited

6

number of edges to add is a combinatorial problem. Much recent work has focused on addressing

the optimal edge addition problem using convex optimization techniques [73–75] and developing

efficient algorithms that scale to very large networks [76]. However, most of these approaches

consider undirected networks. We extend this problem and study edge addition in directed

consensus networks in Sections 2.4 and 6.2.1.

Leader selection in consensus networks

This problem concerns the identification of influential nodes in consensus networks. These special

nodes, so-called leaders, can be equipped with additional information in order to influence the

network behavior in a beneficial way. One application is in vehicular formations, where the

objective is for the vehicles to gather at a certain point. Here, the ‘leaders’ are equipped absolute

measurements (e.g., from GPS units) and the other nodes must rely on relative measurements

(e.g., their distance from certain neighbors). The dynamics of the problem are,

ψ = − (L + diag(x)) ψ + d

where L is a graph Laplacian and x is a nonnegative vector whose nonzero entries identify the

leaders.

The question of how to optimally assign a predetermined number of nodes to act as leaders

in a network of dynamical systems with a given topology has also recently emerged as a useful

proxy for identifying important nodes in a network [77–84]. In [80], the authors develop a greedy

algorithm for leader selection in undirected networks and use convex relaxations to quantify

performance bounds. In [78,79], the authors derive an explicit expression for the set of optimal

leaders in terms of graph theoretic properties which is computationally tractable for very few for

very many leaders. In [85], the authors characterize bounds on the convergence rate based on

the distance between leaders and followers. Most of these results focus on undirected networks

with the exception of [86] in which the authors derive the optimal leaders for one-dimensional

directed path networks.

In contrast to earlier work, our framework can handle leader selection in directed consensus

networks [87,88]. We study this problem in detail in Section 7.3.

7

ψ1

ψ2 ψ4

ψ3

u1

A =

? ? ?? ?? ?

? ?

Fx =

????

Figure 1.1: HIV replication-mutation pattern for a set of 4 mutants with a single drug thataffects 2 mutants.

Combination drug therapy design for HIV treatment

The evolution-replication dynamics of HIV subject to treatment [89,90] can be cast as,

ψ = (A + diag (Fxx)) ψ + d.

Here, the elements of ψ represent populations of individual HIV mutants. The diagonal elements

of A represent each mutant’s replication rate and the off diagonal elements of A represent the

probability of mutation from one mutant to another. The linear function F (x) := diag (Fxx)

captures the effect of drug therapy. The components of the vector x are doses of different drugs

and the kth column of matrix Fx is describes how efficiently drug k kills each HIV mutant. See

Fig. 1.1 for an illustration.

Structure is particularly important for this problem since drug-drug interactions cannot

always be captured by the linear model. These interactions can be vital for efficacy of the

drug or may have fatal consequences, so it is important to apply constraints on dosages and to

impose penalties that promote the satisfaction of combinatorial conditions. However, imposing

constraints on drug dosages is challenging using existing positive systems tools for H2 and H∞design because the drug doses x do not appear explicitly as optimization variables [59, 89, 90],

making regularization difficult.

We first study design of structured controllers by designing a conservative structured con-

troller in Section 6.2.2. We then show convexity of the original problem in Section 7. Since

our formulations use the drug doses x as an optimization variable, our results apply naturally

8

structured combination drug therapy design; we explore this application in Section 7.5.

1.2.2 Performance metrics

In what follows, we use the H2 and H∞ norms to quantify the closed-loop performance. In

the centralized state-feedback problem, the optimal control law for LTI systems is given by

static state-feedback. Even though it is not clear if the optimal structured controller is also

memoryless, we design structured static controllers as a first iteration in design. In Chapter 7,

we explore the use of a time-varying controller for a certain class of structured control problem.

The H2 performance metric is the steady-state amplification from white-in-time stochastic

forcing at the disturbance input d to the performance output ζ of system (1.1),

f2(x) := limt→∞

E(ζT (t) ζ(t)

)= lim

t→∞E(ψT (t)CTC ψ(t) + uT (t)RT (x)R(x)u(t)

).

This quantity is determined by the square of theH2 norm of system (1.1) and it can be expressed

as a function of the controller x as

f2(x) =

trace((CTC + RT (x)R(x))P

), x stabilizing

+∞, otherwise(1.3a)

where x must be such that the closed-loop dynamical generator, A+F (x), corresponds to a stable

system (the spectrum of A+ F (x) must lie completely in the open left-half plane). The matrix

P is the closed-loop controllability gramian given by the solution to the Lyapunov equation,

(A + F (x))P + P (A + F (x))T + B1BT1 = 0.

When F (x) = −B2X, R(x) = −R1/2X, and there are no structural constraints on the matrix

X, the optimal H2 feedback gain is determined by the linear quadratic regulator (LQR) and it

can be computed via the solution of an algebraic Riccati equation.

The H∞ performance metric, which we denote by f∞(·), is the maximum induced L2 gain

from d to ζ in system (1.1),

f∞(x) := sup‖d‖L2≤1

‖ζ‖L2

‖d‖L2

,

where the L2 norm of a signal v is defined as,

‖v‖2L2:=

∫ ∞0

vT (t)v(t) dt.

9

This performance metric corresponds to the peak of the frequency response,

f∞(x) = supω∈R

σ(C (jωI − (A + F (x)))−1B

). (1.3b)

As with f2, the optimalH∞ control problem for unstructured state feedback can be cast in a con-

vex form and readily solved. However, as we describe next, incorporating structural restrictions

on x significantly complicates the design problem.

1.3 Distributed Control

To motivate the study of and explain the difficulty inherent to structured optimal control,

we begin with a discussion of networks of dynamical systems. In this problem, the controller

structure of interest is the communication topology induced by the control law. Traditional

design techniques, such as LQR, yield controllers which require a centralized implementation

where every subsystem communicates with a central node.

For large-scale systems, the computational and communication costs associated with such

an all-to-all communication topology may be prohibitively high. It is thus of interest to design

controllers with distributed structure and sparse communication topologies. We first draw a con-

nection between sparsity of the feedback gain matrix and the induced communication topology,

and then highlight challenges that arise in the design of structured state-feedback controllers.

Let us assume that (1.2) contains N individual subsystems, each with a local state and

control inputs. We partition the state and control input vectors into subvectors corresponding

to each subsystem, ψ := [ψT1 · · · ψTN ]T and u = [uT1 · · · uTN ]T , and write the subsystem

dynamics as,

ψi = Aii ψi +∑j 6= i

Aij ψj + B1i d + B2,ii ui. (1.4a)

The block-sparsity pattern of A determines the interaction topology between subsystems; when

Aij is zero, subsystem j has no direct effect on the evolution of the state of subsystem i.

With each subsystem we associate a controller that specifies the control input ui. Standard

optimal control techniques typically induce a communication topology which requires every local

controller to have access to the state of every subsystem. In large-scale networks of dynamical

systems, this may impose significant communication burden and implementation may be pro-

hibitively expensive. It is thus of interest to explore the design of feedback laws that utilize

limited information exchange within a large-scale network.

10

Under linear state-feedback u = −Xψ, the dynamics (1.4a) become,

ψi = Aii ψi +∑j 6= i

Aij ψj + B1i d − B2,ii

∑j

Xij ψj . (1.4b)

Thus, the block-sparsity pattern of the feedback gain matrix X determines the communication

topology of the static controller: forming the control input ui requires access to the states of

each subsystem j for which Xij is nonzero.

Figure 1.2a illustrates a network of coupled subsystems, associated controller topology, and

the sparsity patterns of the corresponding matrices A and X. The subsystems in the physical

layer are represented by blue octagons; their interaction topology is marked by the blue arrows

which correspond to the sparsity pattern of the matrix A. Each local controller is represented

by a yellow circle; the structure of the information exchange network between the two layers is

marked by the red arrows which correspond to the sparsity pattern of the feedback gain matrix

X.

In the more general setup where the local controllers are dynamic (perhaps because they

estimate the subsystem’s state rather than directly measure it), it is important to determine

the order of local controllers as well as the structure of the information exchange network in the

controller layer; see Fig. 1.2b for an illustration. Recent advances have been made for particular

classes of systems [23,29], but addressing these questions in general remains an open challenge.

1.3.1 An example

After permuting ψ relative to the definition in (1.4) for clarity of exposition, the state vector for

the mass-spring system shown in Fig. 1.3 is determined by ψ =[pT vT

]T, where p and v are

the vectors of positions and velocities of the N masses. We set all masses and spring constants

to unity. Assuming that the control and disturbance inputs enter as forces and partitioning

matrices in the state-space model (1.2) conformably with ψ yields

A =

0 I

T 0

, B1 = B2 =

0

I

,where T is an N ×N tridiagonal Toeplitz matrix with −2 on its main diagonal and 1 on its first

sub- and super-diagonal. In the absence of the structural constraints, the solution to the Riccati

equation yields the centralized H2-optimal controller, i.e., the linear quadratic regulator. In this

case, the LQR, X := [Xp Xv ], has dense position and velocity feedback gain matrices Xp and

11

X =

∗ ∗ ∗∗∗∗ ∗

∗

A =

∗ ∗∗ ∗ ∗∗ ∗ ∗∗ ∗

∗ ∗ ∗

(a) The local controllers are memoryless.

(b) The local controllers are dynamic.

Figure 1.2: A network of 5 dynamical systems with associated local controllers.

12

Figure 1.3: Mass-spring system on a line.

Xv, u1(t)

u2(t)

u3(t)

u4(t)

= −

∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗

︸︷︷︸

Xp

p1(t)

p2(t)

p3(t)

p4(t)

−∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗

︸︷︷︸

Xv

v1(t)

v2(t)

v3(t)

v4(t)

.

Even though these matrices are populated with non-zero elements, the gains that are used to

form control actions for individual masses display interesting patterns. Figure 1.4 illustrates the

optimal centralized position feedback gain matrix Xp in the system with 50 masses. Apart from

the edges, both Xp and Xv (not shown) have almost constant diagonals and exponential off-

diagonal decay. This suggests that good performance may be achievable if the smaller elements

of the feedback gain matrix are set to 0. Since the small feedback gains represent interactions

between masses that are spatially distant, such a strategy would allow the masses’ controllers

to interact in a distributed fashion.

More generally, for spatially invariant systems such as the mass-spring system on a circle,

the optimal controllers with respect to quadratic performance indices (e.g., H2, H∞) are also

spatially invariant and they exponentially discount information with spatial distance [9]. More-

over, it has been suggested that optimal controllers for spatially-decaying systems over general

graphs also possess spatially-decaying property [60, 61]. This motivates the search for inher-

ently localized controllers and suggests that localized information exchange in the distributed

controller may provide a viable strategy for controlling large-scale systems.

However, in general, low-magnitude elements in the feedback gain cannot reliably be assumed

unimportant. It is difficult to provide error bounds on the deviation from optimality as a result

of truncation (e.g., as in Fig. 1.4c) and it has been recognized that truncation of the centralized

controller could significantly compromise the closed-loop performance and even yield a controller

that does not guarantee closed-loop stability [60,64].

1.3.2 Lack of convexity for structured control

We consider the state feedback case, i.e., F (x) = B2X, for H2 optimal control. The design of the

optimal state-feedback gain X, subject to constraints on its sparsity pattern (equivalently, on

13

(a) (b) (c)

Figure 1.4: (a) The optimal centralized position feedback gain matrix Xp in the system with50 masses. Both Xp and Xv (not shown) have almost constant diagonals (modulo edges) andexponential off-diagonal decay. (b) Optimal centralized position gains for the middle massn = 25. (c) Truncation of the optimal centralized position gains for the middle mass n = 25.

the communication topology in Fig. 1.2a) has a rich history and was recently revisited in [91,92].

Let the subspace S encapsulate these structural constraints and let us assume that there is a

stabilizing X ∈ S. The optimal control problem of determining stabilizing X ∈ S that minimizes

the H2 norm of system (1.2) can be formulated as

minimize f(X)

subject to X ∈ S(1.5)

and brought into the following form

minimizeP,X

trace((Q + XTRX)P

)subject to (A − B2X)P + P (A − B2X)T + B1B

T1 = 0

P 0, X ∈ S

(1.6)

where Q := CTC and R 0. In the absence of the structural constraint X ∈ S, a standard

change of variables [49]

Z := X P (1.7)

can be used to express the square of the H2 norm as,

f(P, Y ) = trace (QP ) + trace (RZP−1 ZT )

14

Figure 1.5: The change of variables that casts the unstructured state-feedback problem as anSDP, in general, does not preserve the structural properties of X.

and the Schur complement can be employed to cast the optimal state-feedback H2 control

problem as an SDP,

minimizeP, Y, Z

trace (QP ) + trace (RΘ)

subject to (AP − B2 Z) + (AP − B2 Z)T

+ B1BT1 = 0 Θ Z

ZT P

0.

Since P is positive definite, it is invertible, and the optimal centralized (i.e., unstructured) X

is determined by Xc = ZP−1. This centralized solution coincides with the linear quadratic

regulator, which can be explicitly determined by Xc = R−1BT2 P where P is the unique positive

definite solution of the algebraic Riccati equation, ATP + PA+Q− PB2R−1BT2 P = 0.

The above change of variables is, in general, not suitable for imposing structure on X.

Although the constraint on the feedback gain matrix X ∈ S is linear and thus convex, the

corresponding constraint on P and Z is bilinear, ZP−1 ∈ S. This makes it difficult to translate

the sparsity patterns of X to the sparsity patterns of P and Z (see Fig. 1.5 for an illustration),

thereby limiting the use of these coordinates for structured design problems. In fact determin-

ing stabilizability, let alone achieving optimal closed-loop performance, is NP hard for general

structured problems [4, 5].

If P is restricted to be diagonal, the sparsity structure of X coincides with the sparsity

structure of Z. However, this may introduce considerable conservatism in the design and may

not even lead to a feasible SDP characterization (even when the original nonconvex problem is

feasible). One notable special case where this relaxation is tight appears in the H∞ optimal

15

control of positive systems [54].

1.4 Regularized problems

In order to design structured controllers, we draw inspiration from recent advances in compres-

sive sensing. Given some additional assumptions on data structure, it is possible to completely

recover signals even when they are sampled at frequencies below Nyquist rate. While it may

seem magical to a student fresh from an undergraduate signals and systems course, this result

makes intuitive sense; using additional information about the signal’s structure allows one to

recover it with less data from elsewhere. The magic sauce is the concept of regularization, which

allows one to incorporate additional knowledge about the signal structure into the recovery prob-

lem by augmenting the standard loss functions associated with signal sampling with nonsmooth

structure-promoting penalty terms.

Regularized problems have found applications in diverse fields including compressive sens-

ing [93], machine learning [94], statistics [95], image processing [96], and, as we develop in this

thesis, control theory [32]. This is a class of composite optimization problems in which the

objective function is a sum of a differentiable but possibly nonconvex component and a convex

nondifferentiable component,

minimize f(x) + γ g(x).

The differentiable component of the objective function, f(x), typically encodes some measure of

performance. In a least-squares setting, f(x) = ‖Ax− b‖2 measures how well a candidate set of

parameters x satisfies the linear relationship Ax = b. The nondifferentiable component of the

objective function, g(x), is a structure-promoting penalty function and γ specifies the emphasis

on this structure relative to performance. When γ = 0, an unstructured x will be recovered

and larger values of γ yield more structured optimal solutions. As an example, the `1 norm

g(x) =∑|xi| is a commonly used proxy for promoting sparsity of x.

In much of the work that utilizes regularization, γ can be chosen in a systematic way that

arises from assumptions on the problem structure and problem data, e.g. number of nonzero

entries and the statistics of noise affecting the measurements. In contrast, we use this framework

somewhat artificially. Instead of using regularization to harness a priori knowledge of the

structure of x in order to recover it, we use regularization to artificially impose structure on a

design variable x to satisfy engineering requirements.

In feedback synthesis, we augment a traditional performance metric (such as the H2 or H∞

16

norm) with a regularization function to promote certain structural properties in the optimal

controller. For example, the `1 norm and the nuclear norm are commonly used nonsmooth

convex regularizers that encourage sparse and low-rank optimal solutions, respectively.

Such regularized problems can be used to identify controller structure. This is particularly

important because recently, it has been demonstrated that the design of controller architectures

can have a more profound impact on the closed-loop performance than the optimal design under

a given pre-specified architecture [97]. In [62, 64], tools and ideas from control theory, opti-

mization, and compressive sensing have been combined to systematically address the challenge

of designing controller architectures. The proposed approach introduces regularized versions

of standard optimal control problems and aims to strike a balance between closed-loop perfor-

mance and controller complexity. As discussed earlier, when the state vector and control inputs

can be partitioned into subvectors that correspond to separate subsystems, promoting sparsity

of the feedback gain matrix limits information exchange between the physical system and the

controller. Sparse controller architectures can be designed by augmenting standard quadratic

performance measures with sparsity-promoting penalty functions which serve as measures of

controller complexity. Such an approach has received much recent attention [62,64,98–103].

Alongside sparse feedback synthesis, the critical question of sensor and actuator selection has

been recently considered in [104,105]. Although, in general, finding the solution to this problem

requires an intractable combinatorial search, by drawing upon recent developments in sparse

representations, this problem can be cast as a semidefinite program (SDP). Moreover, it is also

of interest to study problems where it is desired to impose structure on a linear function of the

design variable [106,107]. This broader framework covers a wide variety of problems ranging from

wide-area and distributed PI control of power networks [108–111], to combination drug therapy

for HIV treatment [112], to edge addition [76,113,114] and leader selection [77,78,80,81,83,84,87]

in consensus networks.

Several recent efforts have focused on establishing convexity for classes of these problems and

on developing efficient algorithms for optimal controller design for both convex and nonconvex

problems. Convex structured optimal control problems include symmetric modifications to sym-

metric linear systems [62,115,116], diagonal modifications to positive systems [87,112], optimal

sensor and actuator selection [104,105], and edge addition to undirected consensus [76,113,114]

and synchronization [117] networks. Algorithmic developments have employed alternating di-

rection method of multipliers [64,105], proximal gradient and Newton methods [114], as well as

first- and second-order method of multipliers [107, 118–120] to efficiently perform identification

of controller structure and structured feedback synthesis.

17

1.5 Design of structured controllers via regularization

The communication architecture S of the state-feedback controller in (1.6) is fixed and a priori

specified which may impose limits on the achievable performance. For problems where the com-

munication topology is not fixed, it is desirable to design a favorable communication topology

while promoting sparsity of the communication links. To achieve this, an optimization frame-

work which augments the H2 objective function with a penalty on the sparsity of the feedback

gain matrix (i.e., the number of communication links) was introduced in [62–64].

1.5.1 Structure identification

Our objective is to design controller architecture that achieves a desired tradeoff between the

quadratic performance of system (1.1) and the structure of the controller x. To address this

challenge we consider a regularized optimal control problem

minimizex

f(x) + γ g(Tx)

←−

←−

closed-loop

performance

controller

structure

(1.8)

where f(x) is the H2 or H∞ norm of system (1.1). In contrast to (1.6), no structural constraints

are imposed on x in (1.8); instead, the objective is to manage structure of the controller by

introducing a regularization term g(Tx) into the optimal control problem. The matrix T allows

regularization of structure in a different set of coordinates. The non-negative regularization

parameter γ encodes the emphasis on controller structure relative to the closed-loop performance.

For the state feedback problem F (x) = B2X regularized with a sparsity-promoting penalty

function, γ = 0 yields the centralized LQR solution. As γ increases, larger emphasis is placed

on obtaining a more structured feedback gain matrix X; see Fig. 1.6 for an illustration.

Problem (1.8) is difficult to solve directly because f is typically a nonconvex function of x

and g is convex but not differentiable. While the nonlinear change of coordinates (1.7) yields a

convex dependence of f on P and Y , in general, it introduces a nonconvex dependence of the

regularization term g on these optimization variables.

18

Figure 1.6: Increased emphasis on sparsity encourages sparser control architectures at the ex-pense of deteriorating the closed-loop performance. For γ = 0 the optimal centralized controllerXc is obtained from the positive definite solution of the algebraic Riccati equation. Control archi-tectures for γ > 0 are determined by X(γ) := argminX (f(X) + γ g(X)) and they depend on in-terconnections in the distributed plant and the state and control performance weights Q and R.

Sparsity-promoting regularizers

Elementwise sparsity of x can be promoted by incorporating the cardinality function into the

optimal control problem (1.8),

g0(x) = card (x) . (1.9a)

This regularizer counts the number of nonzero elements in x and it yields a combinatorial

optimization problem (1.8) whose solution typically requires an intractable combinatorial search.

A weighted `1 penalty,

g1(x) := ‖w x‖1 =∑i

wi |xi| (1.9b)

provides a convex proxy for promoting elementwise sparsity of x [121]. Here, w is the vector

whose elements are non-negative weights wi and is elementwise matrix multiplication. The

weights wi can be selected to place larger relative penalties on certain elements of x. Similarly,

the sum of the `2 norms (not the `2 norm squared) of the subvectors xi ∈ Rni ,

g2(x) =∑i

wi ‖xi‖2 (1.9c)

enhances group sparsity; i.e., sparsity at the level of subvectors [122].

The `1 norm is the largest convex function that underestimates the cardinality function

on the domain [−1, 1] [123]; see Fig. 1.7 for an illustration in the scalar case. Both the `1

19

Figure 1.7: Cardinality function of a scalar variable x and the corresponding absolute value andlogarithmic approximations on x ∈ [−1, 1].

norm and its weighted version are convex relaxations of card(X). On the other hand, better

approximation can be obtained with nonconvex functions, e.g., the sum-of-logs,

g4(x) =∑i

log

(1 +

|xi|ε

), 0 < ε 1. (1.9d)

The weighted `1 norm attempts to bridge the difference between the `1 norm and the cardinality

function. In contrast to the cardinality function that assigns the same cost to any nonzero

element, the `1 norm penalizes more heavily the elements of larger magnitudes. The positive

weights can be chosen to counteract this magnitude dependence of the `1 norm. For example,

if the weights wi are inversely proportional to the magnitude of xi, wi = 1/|xij |, xi 6= 0

wi = ∞, xi = 0

then there is no difference between the weighted `1 norm of x and the cardinality function of

x. This scheme, however, cannot be implemented, because the weights depend on the unknown

feedback gain. A re-weighted algorithm that solves a sequence of weighted `1 optimization

problems was proposed in [121]. In this, sequential linearization of the sum-of-logs function is

used and the weights are determined by the solution of the optimization problem in the previous

iteration. This algorithm has provided an effective heuristics for promoting sparsity in many

emerging applications.

Additional intuition about the role of sparsity-promoting regularizers can be gained by con-

sidering a problem in which it is desired to find the sparsest feedback gain that provides a given

20

`1 weighted `1 sum-of-logs

Figure 1.8: The solution x? of the constrained problem (1.10) is the intersection of the constraintset C := x | f(x) ≤ σ and the smallest sub-level set of g that touches C. The penalty function gis the `1 norm (left); the weighted `1 norm with appropriate weights (middle); and the nonconvexsum-of-logs function (right).

level of H2 performance σ > 0,

minimize card (x)

subject to f(x) ≤ σ.

Approximating card (x) with a penalty function g(x) yields

minimize g(x)

subject to f(x) ≤ σ.(1.10)

The solution to (1.10) is the intersection of the constraint set C := x | f(x) ≤ σ and the

smallest sub-level set of g that touches C; see Fig. 1.8 for an illustration. In contrast to the

`1 norm whose sub-level sets are determined by the convex `1 ball, the sub-level sets of the

nonconvex sum-of-logs function have a star-like shape.

Other regularization functions

The nuclear norm of a matricial variable X,

g∗(X) :=∑i

σi(X)

21

is the sum of the singular values of the matricial variable X and promotes a low-rank optimal

solution. Enforcing low-rank solutions is often important for low-complexity modelling [46,124–

127]

The indicator function associated with a convex set C,

gC(x) :=

0 x ∈ C∞ x 6∈ C

can be used to enforce that x lie in the convex set C. Such sets can encode structural properties

as simple as box constraints and can even be used to promote more complicated combinatorial

conditions such as mutual exclusivity or necessity between elements of x (i.e., either xi or xj

may be nonzero or xj may only be nonzero if xi is nonzero) [128]. These sorts of constraints

are very important in the context of the drug therapy example described in Section 1.2.1 since

linear models often cannot capture the full complexity of drug-drug interactions.

Finally, we note that recent work has used the framework of atomic norms [129] to penal-

ize communication between subsystems in large-scale distributed systems using more nuanced

measures of controller complexity [100–103].

1.5.2 Polishing

After having identified the controller architecture, we optimize the closed-loop H2 performance

over the identified structure S. This is necessary because the presence of the regularizer in (1.8)

often distorts the optimal solution. For example, in the `1 regularized case, the `1 norm imposes

an additional penalty on the magnitude of the feedback gains, resulting in worse closed-loop

performance. As a result, we obtain the final structured controller by solving a structured

problem,

minimize f(x)

subject to x ∈ S

which is equivalent to (1.8) where the regularizer is an indicator function corresponding to the

identified structure S.

For nuclear norm regularization, polishing amounts to optimizing over the singular values of

the matrix obtained by solving (1.8). Regularization with an indicator function does not require

a polishing step because it does not distort the value of the performance metric f in (1.8) as it

enforces an exact condition on x and is not a proxy for a nonconvex or combinatorial constraint.

22

1.6 Outline of thesis

This thesis approaches the challenge of designing structured controllers for linear systems by

solving regularized optimal control problems. In order to deal with the inherent difficulties of

solving these optimization problems, we tackle two overarching problems. The first is that of

developing tractable methods for solving regularized optimal control problems, which are in

general nonconvex. The second is that of identifying classes of problems for which regularized

optimal control can be placed into a convex form as well as developing methods to use convex

design problems to provide suboptimal solutions for the harder, nonconvex problems.

In Part I, we study algorithms for solving regularized problems. In Chapter 2, we develop a

novel splitting method based on a reformulation of the generalized regularized problem (1.8) with

an auxiliary variable. By exploiting the properties of proximal operators, we bring the associated

augmented Lagrangian into a continuously differentiable form by constraining it to a manifold

that corresponds to explicit minimization over the auxiliary variable. The new expression,

which we call the ‘proximal augmented Lagrangian’, facilitates a method of multipliers algorithm

that offers many advantages relative to existing methods, including guaranteed convergence for

nonconvex problems and the ability to regularize a linear function of the optimization variable.

We then study primal-descent dual-ascent Arrow-Hurwicz-Uzawa type gradient flow dynamics to

solve regularized problems in Chapter 3. We prove global convergence and establish conditions

for exponential convergence for continuous- and discrete-time updates. In Chapter 4, we take

advantage of generalizations of the gradient to apply second-order primal-dual updates to the

proximal augmented Lagrangian and prove global and local quadratic convergence. Finally, in

Chapter 5, we draw connections with other methods and provide additional insight.

In Part II, we identify classes of convex regularized optimal control problems. In Chapter 6,

we study the problem of designing symmetric modifications to symmetric linear systems. This

problem is convex in the underlying design variable and is thus appealing for the purpose of

structured control. We show that even when the system and controller are not symmetric, their

symmetric components can be used to perform structured design with stability and performance

guarantees. In Chapter 7, we examine the problem of designing structured diagonal modifica-

tions to positive systems. We prove that the H2 and H∞ optimal control problems are convex,

apply our results combination drug therapy design for HIV and leader selection in consensus

networks, and show that a constant controller is optimal for an induced-power performance

index. In Chapter 8, we examine the optimal actuator and sensor selection problems. Using

a change of variables, this problem can be cast as a semidefinite program. Although convex,

23

the computational complexity required to solve problems of this class scales poorly with the

problem dimension. We develop customized algorithms to solve this problem efficiently.

Finally, we provide concluding remarks and comment on future research in Chapter 9.

Notation

The set of real numbers is denoted by R and the set of nonnegative (positive) reals is denoted

by R+ (R++). The integers are represented by Z. The set of complex numbers are denoted by

C and j =√−1 is the imaginary unit. The operator <(·) (=(·)) extracts the real (imaginary)

component of a complex argument.

Given a matrix A, AT denotes its transpose. The vector λ(A) (σ(A)) indicates the eigenvalues

(singular values) of A, λi(A) denotes the ith largest eigenvalue of A and σ(A) denotes the princi-

pal (largest) singular value of A. We use trace(A) to denote its trace, and ‖A‖2F := trace(ATA

)to denote its Frobenius norm squared. The vector inner product is given by 〈x, y〉 := xT y

and the matricial inner product is given by 〈X, Y 〉 := trace(XTY ). The Kronecker product is

denoted by ⊗ and the Hadamard (entrywise) product is denoted by .We write A ≥ 0 (A > 0) if A is entrywise nonnegative (positive) and A 0 (A 0) to

denote that A is symmetric and positive semidefinite (definite).

We use the diag(·) operator to denote either the diagonal entries of a matrix or a diagonal

matrix with elements of the vector · on its diagonal, depending on its argument. The symbol E

denotes the expectation operator.

Given a set C we define the indicator function

IC(x) :=

0 if x ∈ C+∞ otherwise.

We define the sparsity pattern sp (u) of a vector u to be the set of indices for which ui is nonzero.

Definition 1. The adjoint of a linear operator F : Rm → Rn×n is the linear operator F †:

Rn×n → Rm which satisfies

〈X, F (x)〉 =⟨F †(X), x

⟩.

Any additional notation specific to particular chapters will be introduced as required.

Part I

Proximal augmented Lagrangian

algorithms

24

Chapter 2

Method of multipliers

We study a class of composite optimization problems in which the objective function is the sum

of a differentiable but possibly nonconvex component and a convex nondifferentiable component.

Problems of this form are encountered in diverse fields including compressive sensing [93], ma-

chine learning [94], statistics [95], image processing [96], and control [64]. In feedback synthesis,

they typically arise when a traditional performance metric (such as the H2 or H∞ norm) is

augmented with a regularization function to promote certain structural properties in the opti-

mal controller. For example, the `1 norm and the nuclear norm are commonly used nonsmooth

convex regularizers that encourage sparse and low-rank optimal solutions, respectively.

In this chapter, we derive the proximal augmented Lagrangian, which allows us to apply

the widely used method of multipliers (MM) to nondifferentiable composite optimization prob-

lems. We then illustrate the utility of this approach by applying it to edge addition in directed

consensus networks and sparse feedback synthesis.

2.1 Background

The lack of a differentiability in objective function (1.8) due to the regularization function

precludes the use of standard descent methods for smooth optimization. Proximal gradient

methods [130] and their accelerated variants [131] generalize gradient descent, but typically re-

quire the nonsmooth term to be separable over the optimization variable. Furthermore, standard

acceleration techniques are not well-suited for problems with constraint sets that do not admit

an easy projection (e.g., closed-loop stability).

An alternative approach is to split the smooth and nonsmooth components in the objective

25

26

function over separate variables which are coupled via an equality constraint. Such a reformu-

lation facilitates the use of the alternating direction method of multipliers (ADMM) [132]. This

augmented-Lagrangian-based method splits the optimization problem into subproblems which

are either smooth or easy to solve. It also allows for a broader class of regularizers than proximal

gradient and it is convenient for distributed implementation. However, there are limited conver-

gence guarantees for nonconvex problems and parameter tuning greatly affects its convergence

rate.

The method of multipliers (MM) is the most widely used algorithm for solving constrained

nonlinear programing problems [133–135]. In contrast to ADMM, it is guaranteed to converge for

noncconvex problems and there are systematic ways to adjust algorithmic parameters. However,

MM is not a splitting method and it requires joint minimization of the augmented Lagrangian

with respect to all primal optimization variables. This subproblem is typically nonsmooth and

as difficult to solve as the original optimization problem.

To treat this problem, we transform the augmented Lagrangian into a continuously differen-

tiable form by exploiting the structure of proximal operators associated with nonsmooth regular-

izers. This new form is obtained by constraining the augmented Lagrangian to a manifold that

corresponds to the explicit minimization over the variable in the nonsmooth term. The resulting

expression, that we call the proximal augmented Lagrangian, is given in terms of the Moreau

envelope of the nonsmooth regularizer and it is continuously differentiable. This allows us to

take advantage of standard optimization tools, including gradient descent and quasi-Newton

methods, and enjoy the convergence guarantees of the standard method of multipliers.

2.2 Problem formulation and background

2.2.1 Composite optimization problem

We consider a composite optimization problem,

minimizex

f(x) + g (T (x)) (2.1)

where the optimization variable x belongs to a finite-dimensional Hilbert space (e.g., Rm or

Rm×n) equipped with an inner product 〈·, ·〉 and associated norm ‖·‖. The function f is con-

tinuously differentiable but possibly nonconvex, the function g is convex but potentially nondif-

ferentiable, and T is a bounded linear operator. We further assume that g is proper and lower

semicontinuous, that (2.1) is feasible, and that its minimum is finite.

27

Regularization of T (x) instead of x is important in the situations where the desired struc-

ture has a simple characterization in the co-domain of T . Such problems arise in spatially-

invariant systems, where it is convenient to perform standard control design in the spatial

frequency domain [9] but necessary to promote structure in the physical space, and in consen-

sus/synchronization networks, where the objective function is expressed in terms of the deviation

of node values from the network average but it is desired to impose structure on the network

edge weights [106].

2.2.2 Background on proximal operators

Problem (2.1) is difficult to solve directly because f is, in general, a nonconvex function of x and

g is typically not differentiable. Since the existing approaches and our method utilize proximal

operators, we first provide a brief overview; for additional information, see [130].

The proximal operator of the function g is given by,

proxµg(v) := argminx

g(x) + 12µ ‖x − v‖2 (2.2a)

and the associated optimal value specifies its Moreau envelope,

Mµg(v) := infxg(x) + 1

2µ ‖x − v‖2 (2.2b)

where µ > 0. The Moreau envelope is a continuously differentiable function, even when g is not,

and its gradient is given by,

∇Mµg(v) = 1µ

(v − proxµg(v)

). (2.2c)

As a concrete example, when g is the `1 norm with unit weight w = 1, the proximal operator is

determined by soft-thresholding,

proxµg1(vi) = Sµ(vi) := sign(vi) max(|vi| − µ, 0) (2.3a)

the associated Moreau envelope is given by the Huber function,

Mµg1(vi) =

v2i /(2µ) |vi| ≤ µ

|vi| − µ/2 |vi| ≥ µ(2.3b)

28

(a) g1(v) (b) proxµg1 (v) (c) Mµg1 (v) (d) ∇Mµg1 (v)

Figure 2.1: The regularization function g1(v) = |v| for a scalar argument v with associatedproximal operator, Moreau envelope, and Moreau envelope gradient when µ = 1.

and the gradient of this Moreau envelope is the saturation function,

∇Mµg1(vi) = sign(vi) min(|vi|/µ, 1). (2.3c)

Figure 2.1 plots these functions for a scalar argument with µ = 1.

2.2.3 Existing algorithms

Proximal gradient

The proximal gradient method generalizes standard gradient descent to certain classes of nons-

mooth optimization problems. This method can be used to solve (2.1) when g(T ) has an easily

computable proximal operator. The standard gradient descent update to find a minimizer x? of

a differentiable function h(x) is given by

xl+1 = xl − αl∇h(xl)

where αl is a step size. When T = I, the proximal gradient method for problem (2.1) with

step-size αl is given by,

xl+1 = proxαlg(xl − αl∇f(xl)).

Clearly, this method is most effective when the proximal operator of g is easy to evaluate.

When g = 0, the proximal gradient method simplifies to standard gradient descent, and when g

is indicator function of a convex set, it simplifies to projected gradient descent.

In particular, the proximal gradient update for the `1-regularized least-squares problem

(LASSO),

minimizex

12 ‖Ax − b‖2 + γ ‖x‖1 (2.4)

29

where γ is a positive regularization parameter, is given by the Iterative Soft-Thresholding Al-

gorithm (ISTA),

xl+1 = Sγαl(xl − αlAT (Axl − b)).

When g(x) = 0, the proximal gradient method simplifies to standard gradient descent, and

when g(x) is the indicator function IC(x) of the convex set C, it simplifies to projected gradient

descent.

Except in special cases, e.g, when T = I or is a diagonal operator, efficient computation of

proxg(T ) does not necessarily follow from an efficiently computable proxg. This makes the use

of proximal gradient method challenging for many applications and its convergence can be slow.

Acceleration techniques [131, 136] improve the convergence rate, but they do not perform well

in the face of constraints such as closed-loop stability.

Augmented Lagrangian methods

A common approach for dealing with a nondiagonal linear operator T in (2.1) is to introduce

an additional optimization variable z

minimize f(x) + g(z)

subject to T (x) − z = 0.(2.5)

The augmented Lagrangian is obtained by adding a quadratic penalty on the violation of the

linear constraint to the regular Lagrangian associated with (2.5),

Lµ(x, z; y) = f(x) + g(z) + 〈y, T (x) − z〉 + 12µ ‖T (x) − z‖2

where y is the Lagrange multiplier, µ is a positive parameter, and Lµ is obtained by augmenting

the regular Lagrangian with a quadratic penalty on the violation of the linear constraint in (2.5).

ADMM solves (2.5) via an iteration which involves minimization of Lµ separately over x and

z and a gradient ascent update (with step size 1/µ) of y [132],

xk+1 = argminx

Lµ(x, zk; yk) (2.6a)

zk+1 = argminz

Lµ(xk+1, z; yk) (2.6b)

yk+1 = yk + 1µ (T (xk+1) − zk+1). (2.6c)

ADMM is appealing because, even when T is nondiagonal, the z-minimization step amounts

30

to evaluating proxµg, and the x-minimization step amounts to solving a smooth (but possibly

nonconvex) optimization problem. Although it was recently shown that ADMM is guaranteed

to converge to a stationary point of (2.5) for some classes of nonconvex problems [137], its rate

of convergence is strongly influenced by the choice of µ.

The method of multipliers (MM) is the most widely used algorithm for solving constrained

nonlinear programing problems and it guarantees convergence to a local minimum. In contrast

to ADMM, each MM iteration requires joint minimization of the augmented Lagrangian with

respect to the primal variables x and z,

(xk+1, zk+1) = argminx, z

Lµ(x, z; yk) (2.7a)

yk+1 = yk + 1µ (T (xk+1) − zk+1). (2.7b)

It is possible to refine MM to allow for inexact solutions to the (x, z)-minimization subproblem

and adaptive updates of the penalty parameter µ. However, until now, MM has not been a

feasible choice for solving (2.5) because the nonconvex and nondifferentiable (x, z)-minimization

subproblem is as difficult as the original problem (2.1).

In what follows, we exploit the structure of the linear constraint in (2.5) and utilize the

optimality conditions with respect to z to eliminate it from the augmented Lagrangian. This

brings the (x, z)-minimization problem in (2.7) into a form that is continuously differentiable

with respect to x.

2.3 MM with the proximal augmented Lagrangian

In this section, we derive the proximal augmented Lagrangian, a continuously differentiable

function resulting from explicit minimization of the augmented Lagrangian over the auxiliary

variable z. This brings the (x, z)-minimization problem in (2.7) into a form that is continuously

differentiable with respect to x and facilitates the use of a wide suite of standard optimization

tools, including the method of multipliers and the Arrow-Hurwicz-Uzawa method.

2.3.1 Derivation of the proximal augmented Lagrangian

The first main result of the paper is provided in Theorem 2.3.1. We use the proximal operator of

the function g to eliminate the auxiliary optimization variable z from the augmented Lagrangian

and transform (2.7a) into a tractable continuously differentiable problem.

31

Theorem 2.3.1. For a proper, lower semicontinuous, and convex function function g, mini-

mization of the augmented Lagrangian Lµ(x, z; y) associated with problem (2.5) over (x, z) is

equivalent to minimization of the proximal augmented Lagrangian

Lµ(x; y) := f(x) + Mµg(T (x) + µy) − µ2 ‖y‖

2 (2.8)

over x. Moreover, if f is continuously differentiable Lµ(x; y) is continuously differentiable over

x and y, and if f has a Lipschitz continuous gradient ∇Lµ(x; y) is Lipschitz continuous.

Proof. Through the completion of squares, the augmented Lagrangian Lµ associated with (2.5)

can be equivalently written as

Lµ(x, z; y) = f(x) + g(z) + 12µ ‖z − (T (x) + µy)‖2 − µ

2 ‖y‖2.

Minimization with respect to z yields an explicit expression,

z?µ(x, y) = proxµg(T (x) + µy) (2.9)

and substitution of z?µ into the augmented Lagrangian provides (2.8), i.e., Lµ(x; y) =

Lµ(x, z?µ(x, y); y). Continuous differentiability of Lµ(x; y) follows from continuous differentiabil-

ity of Mµg and Lipschitz continuity of ∇Lµ(x; y) follows from Lipschitz continuity of proxµg

and boundedness of the linear operator T ; see (2.2c).

Expression (2.8), that we refer to as the proximal augmented Lagrangian, characterizes

Lµ(x, z; y) on the manifold corresponding to explicit minimization over the auxilary variable

z. Theorem 2.3.1 allows joint minimization of the augmented Lagrangian with respect to x and

z, which is in general a nondifferentiable problem, to be achieved by minimizing differentiable

function (2.8) over x. It thus facilitates the use of the method of multipliers as follows and the

Arrow-Hurwicz-Uzawa gradient flow dynamics in Chapter 3.

2.3.2 MM based on the proximal augmented Lagrangian

Theorem 2.3.1 allows us to solve nondifferentiable subproblem (2.7a) by minimizing the con-

tinuously differentiable proximal augmented Lagrangian Lµ(x; yk) over x. Relative to ADMM,

our customized MM algorithm guarantees convergence to a local minimum and offers system-

atic update rules for the parameter µ. Relative to proximal gradient, we can solve (2.1) with a

general bounded linear operator T and can incorporate second order information about f .

32

Using reformulated expression (2.8) for the augmented Lagrangian, MM minimizes Lµ(x; yk)

over the primal variable x and updates the dual variable y using gradient ascent with step-size

1/µ,

xk+1 = argminx

Lµ(x; yk) (2.10a)

yk+1 = yk + 1µ ∇yLµ(xk+1; yk) (2.10b)

where

∇yLµ(xk+1; yk) = T (xk+1) − proxµg(T (xk+1) + µyk)

denotes the primal residual, i.e., the difference between T (xk+1) and z?µ(xk+1, yk).

In contrast to ADMM, our approach does not attempt to avoid the lack of differentiability of

g by fixing z to minimize over x. By constraining Lµ(x, z; y) to the manifold resulting from ex-

plicit minimization over z, we guarantee continuous differentiability of the proximal augmented

Lagrangian Lµ(x; y). MM is a gradient ascent algorithm on the Lagrangian dual of problem (2.5)

strengthened by a quadratic penalty on primal infeasibility. Since a closed-form expression of

the dual is typically unavailable, MM uses the (x, z)-minimization subproblem (2.7a) to evaluate

it computationally and then take a gradient ascent step (2.7b) in y. ADMM avoids this issue

by solving simpler, separate subproblems over x and z. However, the x and z minimization

steps (2.6a) and (2.6b) do not solve (2.7a) and thus unlike the y-update (2.7b) in MM, the

y-update (2.6c) in ADMM is not a gradient ascent step on the strengthened dual. MM thus has

stronger convergence results [132,133] and may lead to fewer y-update steps.

Remark 2.3.2. The proximal augmented Lagrangian enables MM because the x-minimization

subproblem in MM 2.10a is not more difficult than in ADMM (2.6a). For LASSO problem (2.4),

the z-update in ADMM (2.6b) is given by soft-thresholding, zk+1 = Sγµ(xk+1 + µyk), and

the x-update (2.6a) requires minimization of the quadratic function [132]. In contrast, the x-

update (2.10a) in MM requires minimization of (1/2) ‖Ax − b‖2 + Mµkg(x + µkyk), where

Mµkg(v) is the Moreau envelope associated with the `1 norm; i.e., the Huber function (2.3b).

Although in this case the solution to (2.6a) can be characterized explicitly by a matrix inversion,

this is not true in general. The computational cost associated with solving either (2.6a) or (2.10a)

using first-order methods scales at the same rate.

33

Algorithm

The procedure outlined in [135, Algorithm 17.4] allows minimization subproblem (2.10a) to be

inexact, provides a method for adaptively adjusting µk, and describes a more refined update

of the Lagrange multiplier y. We incorporate these refinements into our proximal augmented

Lagrangian algorithm for solving (2.5). In Algorithm 1, η and ω are convergence tolerances,

and µmin is a minimum value of the parameter µ. Because of the equivalence established in

Theorem 2.3.1, convergence to a local minimum follows from the convergence results for the

standard method of multipliers [135].

Algorithm 1 Proximal augmented Lagrangian algorithm for (2.5).

input: Initial point x0 and Lagrange multiplier y0

initialize: µ0 = 10−1, µmin = 10−5, ω0 = µ0, and η0 = µ0.10

for k = 0, 1, 2, . . .

Solve 2.10a such that‖∇xLµ(xk+1, yk)‖ ≤ ωk

if ‖∇yLµk(xk+1; yk)‖ ≤ ηk

if ‖∇yLµk(xk+1; yk)‖ ≤ η and ‖∇xLµ(xk+1, yk)‖ ≤ ωstop with approximate solution xk+1

else

yk+1 = yk + 1µk∇yLµk(xk+1; yk), µk+1 = µk

ηk+1 = ηk µ0.9k+1, ωk+1 = ωk µk+1

endif

else

yk+1 = yk, µk+1 = max(µk/5, µmin)ηk+1 = µ0.1

k+1, ωk+1 = µk+1

endifendfor

2.3.3 Minimization of the proximal augmented Lagrangian over x

MM based on the proximal augmented Lagrangian alternates between minimization of Lµ(x; y)

with respect to x (for fixed values of µ and y) and the update of µ and y. Since Lµ(x; y) is

once continuously differentiable, many techniques can be used to find a solution to subprob-

lem (2.10a). We next summarize three of them.

34

Gradient descent

The gradient of the proximal augmented Lagrangian with respect to x is given by,

∇xLµ(x; y) = ∇f(x) + 1µT

†(T (x) + µy − proxµg(T (x) + µy)).

Gradient descent with backtracking rules, such as the Armijo rule, can be used to find a solution

to (2.10a).

Proximal gradient

Gradient descent does not exploit the structure of the Moreau envelope of the function g; in some

cases, it may be advantageous to use proximal operator associated with the Moreau envelope to

solve 2.10a. In particular, when T = I, this requires evaluation of the proximal operator of the

Moreau envelope,

proxαMµg(v) := argmin

xMµg(x) + 1

2α ‖x − v‖2.

Since Mµg is continuously differentiable, the optimality conditions are given by

0 = ∇Mµg(x) + 1α (x − v)

= 1µ (x − proxµg(x)) + 1

α (x − v).

Thus, proxαMµg(v) = x∗ where x∗ satisfies,

x∗ = 1µ+α

(αproxµg(x

∗) + µ v).

For separable g, the proximal operators associated with the function g and its Moreau envelope

are easily computable. For example, the proximal operator of the `1 norm is determined by

soft-thresholding (2.3a) and the ith element of proxαMµg(v) is given by

proxαMµg(vi) =

µ

µ+α vi, |vi| ≤ µ + α

vi − α sign(vi), |vi| ≥ µ + α.

In [118], proximal gradient methods were used for subproblem (2.10a) to solve the sparse

feedback synthesis problem described in Section 2.5. For associated problem (2.1), computa-

tional savings were shown relative to standard proximal gradient and ADMM.

35

Quasi-Newton method

When the proximal operator associated with g is continuously differentiable, x ∈ Rm, and

T (x) = Tx, the Hessian of the proximal augmented Lagrangian is given by,

∇xxLµ(x; y) = ∇2f(x) + 1µ T

T (I − Bp)T

where (1/µ)(I−Bp) is the Hessian of the Moreau envelope and Bp is the Jacobian of proxµg(x+

µy). Although proxµg is not differentiable in general, it is Lipschitz continuous and therefore

differentiable almost everywhere [138]. When proxµg is not differentiable, the Dini derivative

or the Clarke subgradient [138, 139] can be used to obtain a generalized Jacobian Bp. For the

soft-thresholding operator (2.3a), Bp = diagβ, where the ith component of the vector β is

βi ∈ 0, |vi| < µ; 1, |vi| > µ; 0, 1, |vi| = µ . To improve computational efficiency, we employ

the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method [135, Algorithm 7.4]

to estimate the Hessian ∇xxLµ(x; yk).

L-BFGS estimates the Hessian inverse, Hl, to compute the search direction via r = −Hl(∇Lµ).

Instead of explictly forming Hl, it computes r via a low-rank operation q = −Sl(∇Lµ) followed

by an easily-computable full-rank operation and another low-rank operation, r = S†l (H

0l (q)).

The operations Sl and S†l encode updates to an initial Hessian-inverse approximation H0

l based

on first-order information at previous iterates. The initial estimate of the Hessian inverse can be

adaptively updated and is commonly taken to be the scaling operation, H0l (v) = κv where [135]

κ =

⟨xl − xl−1, ∇Lµ(xl, y) − ∇Lµ(xl−1, y)

⟩‖∇Lµ(xl, y) − ∇Lµ(xl−1, y)‖2

. (2.11)

Remark 2.3.3. For regularization functions that do not admit simply computable proximal

operators, proxµg has to be evaluated numerically by solving (2.2a). If this is expensive, the

primal-descent dual-ascent algorithm of Chapter 3 offers an appealing alternative because it re-

quires one evaluation of proxµg per iteration. When the regularization function g is nonconvex,

the proximal operator may not be single-valued and the Moreau envelope may not be continu-

ously differentiable. In spite of this, the convergence of proximal algorithms has been established

for nonconvex, proper, lower semicontinuous regularizers that obey the Kurdyka- Lojasiewicz in-

equality [140].

36

Algorithm 2 Limited memory BFGS (L-BFGS)

input: Gradient ∇Lµ(xl; y), estimate of Hessian inverse H0l , and history sl = xl+1 − xl,

gradients tl = ∇Lµ(xl+1; y)−∇Lµ(xl; y) and ρl = 1/⟨tl, sl

⟩for last nl iterations

output: Search direction r = Hl(∇Lµ(xl; y))

q = ∇Lµ(xl; y)

for i = l − 1, . . . , l − nlαi = ρi

⟨si, q

⟩q = q − αiti

endfor

r = H0l (q)

for i = l − nl, . . . , l − 1τ = ρi

⟨ti, r

⟩r = r + si(αi − τ)

endfor

2.4 Example: Edge addition in directed consensus net-

works

To illustrate the utility of MM with the proximal augmented Lagrangian, we consider the prob-

lem of edge addition described in Section 1.2.1. The x minimization subproblem is solved using

L-BFGS.

It is desired to minimize the H2 norm of the closed-loop system,

ψ = −(Lp + Lx)ψ + d, ζ =

Q1/2

−R1/2Lx

ψwhere Lp and Lx are the Laplacians of the plant and controller networks, respectively, R 0,

and Q := I − (1/N)11T .

To ensure convergence of ψ to the average of the initial node values, we require that the

closed-loop graph Laplacian, L = Lp + Lx, is balanced. This condition amounts to the linear

constraint, 1TL = 0. We express the directed graph Laplacian of the controller network as,

Lx =∑i 6= j

Lij wij =:M∑l= 1

Ll wl

where wij ≥ 0 is the added edge weight that connects node j to node i, Lij := eieTi − eie

Tj , ei is

the ith basis vector in Rn, and the integer l indexes the edges such that wl = wij and Ll = Lij .

37

For simplicity, we assume that the plant network Lp is balanced and connected. Thus, enforcing

that L is balanced amounts to enforcing

1TLx = 1T

(∑l

Ll wl

)= 0.

This imposes a linear constraint on the vector of edge weights w ∈ RM , Ew = 0, where E is

the incidence matrix of the directed controller network [7]. When any edge may be added to a

network with N nodes, the number of potential added edges is M = N2 −N .

Any vector of edge weights w that corresponds to a balanced graph can be written as w = Tx

where the columns of T span the nullspace of the matrix E. The matrix T can be obtained

via the singular value decomposition of E and it provides a basis for the space of balanced

graphs, i.e., the cycle space [7] of the controller network. Each feasible controller network can

be expressed in terms of this basis,

Lx =∑l

Ll [Tx]l =∑l

Ll

[∑k

(T ek)xk

]l

=:∑k

Lk xk (2.12a)

where the matrices Lk are given by,

Lk :=∑l

Ll [T ek]l. (2.12b)

Since the controller and plant networks each have an eigenvalue at zero corresponding to 1,

we introduce the change of variables, φ

ψ

=

V T

1T

ψ ⇔ ψ = V φ + ψ 1 (2.12c)

where the columns of V form an orthonormal basis of the orthogonal complement to the subspace

spanned by the vector of all ones. The average mode ψ is marginally stable and decoupled from

the dynamics that govern the evolution of φ. Since Q = I− (1/N)11T , ψ is not detectable from

ζ and the dynamics that govern the evolution of deviations from average are,

φ = A φ + B d, ζ =

Q1/2 V

−R1/2 Lx V

φ (2.12d)

38

where A := −V T (Lp + Lx)V and B := V T . The square of the H2 norm of this system is

determined by,

f2(x) =⟨V T (Q + LTxRLx)V, Pc

⟩(2.12e)

where Pc is the controllability gramian of system (2.12d),

0 = A Pc + PcAT + BBT . (2.12f)

To balance the closed-loop H2 performance with the number of added edges, we introduce a

regularized optimization problem

xγ = argmin f2(x) + γ 1TTx + I+(Tx). (2.13)

Here, the regularization parameter γ > 0 specifies the emphasis on sparsity relative to the closed-

loop performance f , and I+ is the indicator function associated with the nonnegative orthant

RM+ . When the desired level of sparsity for the vector of the added edge weights wγ = Txγ has

been attained, optimal weights for the identified set of edges are obtained by solving,

minimizex

f2(x) + IWγ (Tx) + I+(Tx) (2.14)

where Wγ is the set of vectors with the same sparsity pattern as wγ and IWγis the indicator

function associated with this set.

2.4.1 Implementation

We next derive the gradient of the square of the H2 norm of system (2.12d) and provide im-

plementation details for MM with the proximal augmented Lagrangian with the regularization

functions in (2.13) and (2.14).

Lemma 2.4.1. Let a graph Laplacian of a directed plant network Lp be balanced and connected

and let A, B, Lx, and V be as defined in (2.12a)–(2.12d). The gradient of f2(x) defined in (2.12e)

is given by,

∇f2(x) = 2 vec(⟨

(RLxV − V Po)PcVT , Lk

⟩)

39

where Pc and Po are the controllability and observability gramians of system (2.12d),

A Pc + PcAT + BBT = 0

ATPo + PoA + V T (Q + LTxRLx)V = 0.

Proof. The Lagrangian associated with the minimization of the function f2 given by (2.12e)

subject to constraint (2.12f) is

⟨V T (Q + LTxRLx)V, Pc

⟩+⟨A Pc + PcA

T + BBT , Po

⟩where Po is the Lagrange multiplier. Variations with respect to Po and Pc yield the Lyapunov

equations for the controllability and observability gramians Pc and Po, respectively. Using the

definition of A, the Lagrangian can be rewritten as,

⟨(RLxV − 2V Po)PcV

T , Lx⟩

+⟨BBT − V TLpV Pc − PcV

TLTp V, Po

⟩+⟨V TQV, Pc

⟩.

Using definition (2.12a) of Lx and taking variation with respect to each xk yields the expression

for ∇f2(x).

The proximal operator associated with the regularization function gs(z) := γ1T z + I+(z)

in (2.13) is given by

proxµgs(vi) = max (0, vi − γµ).

The associated Moreau envelope for each element of v is

Mµgs(vi) =

1

2µv2i vi ≤ γµ

γ (vi − γ µ2 ) vi > γµ

and its gradient is

∇Mµgs(vi) = max ( 1µvi, γ).

The proximal operator associated with the regularization function in (2.14), gp(z) := IWγ (z) +

I+(z), is a projection onto the intersection of the set Wγ and the nonnegative orthant,

proxµgp(v) = PE(v)

40

where E :=Wγ ∩ RM+ . The associated Moreau envelope is

Mµgp(v) = 12µ ‖v − PE(v)‖2

and its gradient is determined by

∇Mµgp(v) = 1µ (v − PE(v)).

We solve (2.13) and (2.14) using Algorithm 1, where L-BFGS (i.e., Algorithm 2) is employed in

the x-minimization subproblem (2.10a).

2.4.2 Computational experiments

For the plant network shown in Fig. 2.2, Fig. 2.3 illustrates the tradeoff between the number

of added edges and the closed-loop H2 norm. The added edges are identified by computing the

γ-parameterized homotopy path for problem (2.13), and the optimal edge weights are obtained

by solving (2.14). The red dashed lines in Fig. 2.2 show the optimal set of 2 added edges. These

are obtained for γ = 3.5 and they yield 23.91% performance loss relative to the setup in which

all edges in the controller graph are used. We note that the same set of edges is obtained by

conducting an exhaustive search. This suggests that the proposed convex regularizers may offer

a good proxy for solving difficult combinatorial optimization problems.

We also consider simple directed cycle graphs with N = 5 to 50 nodes and M = N2 − Npotential added edges. We solve (2.13) for γ = 0.01, 0.1, 0.2, and R = I using MM with the

proximal augmented Lagrangian (PAL), ADMM, and ADMM with an adaptive heuristic for

updating µ [132] (ADMM µ). The x-minimization subproblems in each algorithm are solved

using L-BFGS. Since gs(Tx) and gp(Tx) are not separable in x, proximal gradient cannot be

used here.

Figure 2.4a shows the time required to solve problem (2.13) in terms of the total number

of potential added edges; Fig. 2.4b demonstrates that PAL requires fewer outer iterations; and

Fig. 2.4c illustrates that the average computation time per outer iteration is roughly equivalent

for all three methods. Even with an adaptive heuristic update of µ [132], ADMM requires more

outer iterations which increases overall solve time relative to the proximal augmented Lagrangian

method. Thus, compared to ADMM, PAL provides computational advantage by reducing the

number of outer iterations (indexed by k in Algorithm 1 and in (2.6)).

41

ψ1

ψ2ψ4

ψ3

ψ5 ψ6 ψ7

Figure 2.2: A balanced plant graph with 7 nodes and 10 directed edges (solid black lines). Asparse set of 2 added edges (dashed red lines) is identified by solving (2.13) with γ = 3.5 andR = I.

2.5 Example: Sparsity-promoting optimal control

Here, we apply the method of multipliers algorithm to apply a sparsity-promoting penalty on the

optimal state-feedback design problem described in Section 1.2.1. As described in Section 1.3,

sparsity of the feedback gain associated with a distributed system corresponds to a controller

with a distributed architecture. We use the proximal gradient method with BB step-size selection

to solve the Lµ minimization subproblem.

The design of state-feedback controllers which balance performance with sparsity has been

the subject of considerable attention in recent years [62, 64, 98–102, 104, 105, 141]. We consider

the sparsity-promoting optimal control problem applied to the linear system,

ψ = (A − B2X)ψ + B1 d, ζ =

Q1/2

−R1/2X

ψ (2.15)

where d is an exogenous disturbance, ζ is the performance output, and Q = QT 0 and

R = RT 0 are the state and control performance weights. System (2.15) describes closed-loop

dynamics under the state-feedback control law,

u = −Xψ,

We make the standard assumptions that (A,B2) is stabilizable and (A,Q1/2) is detectable.

The state-feedback gain X which minimizes the closed-loop H2 norm is, in general, a dense

42

per

form

ance

loss

(per

cent)

number of added edges

Figure 2.3: Tradeoff between performance and sparsity resulting from the solution to (2.13)-(2.14) for the network shown in Fig. 2.2. Performance loss is measured relative to the optimalcentralized controller (i.e., the setup in which all edges in the controller network are used).

M

(a) Total solve time (s)

M

(b) Number of outer iterations

M

(c) Solve time (s) per outer iteration

Figure 2.4: (a) Total time; (b) number of outer iterations; and (c) average time per outeriteration required to solve (2.13) with γ = 0.01, 0.1, 0.2 for a cycle graph with N = 5 to 50 nodesand m = 20 to 2450 potential added edges using PAL (solid blue –×–), ADMM (dashed red- -- -), and ADMM with the adaptive µ-update heuristic (dotted yellow · · · · · · ). PALrequires fewer outer iterations and thus a smaller total solve time.

43

matrix. In [62, 64], the authors studied the problem of designing feedback gain matrices which

balance H2 performance with the sparsity of X. This was achieved by considering a regularized

optimal control problem (2.1) where f = f2 is the closed-loop H2 norm, g(X) encodes some

structural constraint or penalty on X, and γ > 0 encodes the emphasis on this penalty relative

to the H2 performance. In particular we consider regularization by the `1 norm,

minimize f2(X) + γ‖X‖1

where f2 is the H2 norm (1.3a) of system (2.15),

f2(X) = trace(Pc(Q+XTRX))

and its gradient with respect to X is given by

∇f2(X) = 2(RX − BT2 Po)Pc

where Po and Pc are observability and controllability gramians of the closed-loop system,

ATclPo + PoAcl = − (Q + XTRX)

AclPc + PcATcl = −B1B

T1

and Acl := A − B2X. Furthermore, ∇f2(X) is a Lipschitz continuous function on the set of

stabilizing feedback gains [142].

For γ = 0, the problem simplifies to the H2 state-feedback problem whose solution is given

by the standard linear quadratic regulator. A typical approach is to solve (2.1) for a series of

different γ and to generate a set of feedback gains with different levels of sparsity. From this set,

a sparse feedback gain can be selected or γ can be refined to yield sparser or denser controllers.

The proximal augmented Lagrangian associated with this regularized problem is

f2(X) + γhκ(X + µY ) − µ2 ‖Y ‖

2F

where κ := µγ and the Huber function

hκ(V ) =∑i,j

12 V

2ij , |Vij | ≤ κ

κ (|Vij | − 12 κ), |Vij | ≥ κ

44

is the Moreau envelope of the `1 norm.

2.5.1 Minimization of the proximal augmented Lagrangian

The main computational burden in the method of multipliers lies in finding a solution to the

optimization problem (2.10a),

minimizeX

f2(X) + γhκ(X + µY k).

Although the differentiability of Lµ implies that gradient descent may be employed to update

X, we utilize the proximal gradient method to exploit the structure of the Moreau envelope hκ.

We use the notation X l to denote the sequence of inner iterates that converge to a solution

of (2.10a).

Proximal gradient step for minimizing Lµ

Proximal gradient descent, described in Section 2.2.2, provides a generalization of standard

gradient descent which can be applied to nonsmooth optimization problems. Here, we apply

it to solve subproblm (2.10a) in MM with the proximal augmented Lagrangian. The proximal

gradient update is given by X l+1 = X l + X where X minimizes

12α‖X‖

2F +

⟨∇f2(X l), X

⟩+ γ hκ(X l + X + µY )

over X and α is a step size [130]. Note that this problem is separable over the elements of X.

By defining a := Xij , b := (∇f(X l))ij , and c := (X l + µY k)ij , optimization over each element

of X can be expressed as,

minimizea

12α a

2 + b a + γ hκ(a+ c).

Setting the gradient to zero yields, a + α(b + γ satκ(a + c)) = 0 and considering the separate

cases when satκ(a+ c) = κ, a+ c, and −κ yields the optimal a,

a? =

−α(b + γκ), αb− c ≥ κ(αγ + 1)

− α

1 + αγ(b + γc), |αb− c| ≤ κ(αγ + 1)

−α(b − γκ), αb− c ≤ −κ(αγ + 1).

45

Substituting values for a, b, and c yields the proximal gradient update.

Step-size selection

Since the objective function is not smooth, an Armijo backtracking rule cannot be used. Instead,

we backtrack from αl,0 by selecting the smallest nonnegative integer r such that αl = crαl,0 with

c ∈ (0, 1) such that X l+1 is stabilizing and satisfies a descent condition based on the quadratic

approximation

f2(X l+1) ≤ f2(X l) +⟨∇f2(X l), X l+1 − X l

⟩+ 1

2αl‖X l+1 − X l‖2F .

When ∇f2 is Lipschitz continuous and α < 1Lf

where Lf is a Lipschitz constant of ∇f2, this

condition is always satisfied. This backtracking rule adaptively estimates the Lipschitz constant

of ∇f(X) to ensure convergence [131].

To improve the speed of the proximal gradient algorithm, we initialize the step-size using

the Barzilai-Borwein (BB) method [143],

αl,0 =‖X l − X l−1‖2F

〈X l−1 − X l, ∇f2(X l−1) − ∇f(X l)〉.

Figure 2.5 illustrates the utility of the proximal gradient method over standard gradient descent

and the advantage of BB step-size initialization.

2.5.2 Proximal gradient applied directly to (2.1)

For this problem, it is also possible to solve (2.1) directly using proximal gradient descent. This

algorithm is guaranteed to converge to a local optimal point [144], but we find that in practice

it takes longer to find a solution than the method of multipliers. The proximal operator for the

weighted `1-norm is the elementwise softhresholding operator defined in (2.3a) and the update

for solving (2.1) directly is given by

Xk+1 = Sβ(Xk − αk∇f2(Xk)

)where β := γαk and αk is the step-size. The backtracking and BB step-size initialization rules

described in Section 2.5.1 are also used here.

46

Lµ(xl ;yk)−Lµ(x?;yk)

(Lµ(x

0;yk)−Lµ(x?;yk))

Iteration

Figure 2.5: Comparison of proximal gradient with BB step-size selection (solid blue —),proximal gradient without BB step-size selection (dotted black · · · ) and gradient descent withBB step-size selection (dashed red - - -) for the x-minimization step (2.10a) for an unstablenetwork with 20 subsystems, γ = 0.0844, and µ = 0.10. The y-axis shows the distance from theoptimal objective value relative to initial distance.

47

2.5.3 Numerical experiments

We next illustrate the utility of our approach using two examples. We compare our method

of multipliers algorithm with the ADMM algorithm from [64] and a direct application of the

proximal gradient method.

Mass-spring system

Consider a series of N masses connected by linear springs. The dynamics of each mass are

described by

pi = − (pi − pi+1) − (pi − pi−1) + di + ui

where pi is the position of the ith mass. When the first and last masses are affixed to rigid

bodies, the aggregate dynamics are given by p

v

=

0 I

−T 0

p

v

+

0

I

d +

0

I

uwhere p, v, and d are the position, velocity and disturbance vectors, and T is a Toeplitz matrix

with 2 on the main diagonal and −1 on the first super- and sub-diagonals.

In Figure 2.6, we compare the time required to compute a series of sparse feedback gains for

10 values of γ, linearly spaced between 0.001 and 1.0. Taking γ = 1.0 corresponds to roughly

6% nonzero elements in the feedback gain matrix.

Among the three algorithms, ADMM is the fastest; however, the method of multipliers is

comparable and scales at the same rate. Direct proximal gradient was the slowest and exhibited

the worst scaling. Since the mass-spring system has benign dynamics, we next consider an

unstable network.

Unstable network

Let N nodes be uniformly randomly distributed in a box. Each node is an unstable second

order system coupled with nearby nodes via an exponentially decaying function of the Euclidean

distance δ(i, j) between them [60], ψ1i

ψ2i

=

1 1

1 2

ψ1i

ψ2i

+∑j 6= i

e−δ(i,j)

ψ1j

ψ2j

+

0

1

(di + ui)

48

Com

pu

tati

on

tim

e(s

)

Number of subsystems (N)

Figure 2.6: Computation time required to solve (2.1) for 10 evenly spaced values of γ from0.001 to 1.0 for a mass-spring example with N = 5, 10, 20, 30, 40, 50, 100 masses. Performanceof direct proximal gradient (dashed green −− −−), the method of multipliers (solid blue—×—) and ADMM (dash-dot red − · − ·) is displayed. All algorithms use BB step-sizeinitialization.

49

Com

pu

tati

on

tim

e(s

)

Number of subsystems (N)

Figure 2.7: Computation time required to solve (2.1) for 20 evenly spaced values of γ from0.001 to 0.05 for unstable network examples with N = 5, 10, 20, 30, 40, 50 nodes. Performanceof direct proximal gradient (dashed green −− −−), the method of multipliers (solid blue—×—) and ADMM (dash-dot red − · − ·) is displayed. All algorithms use BB step-sizeinitialization.

where Q and R are taken to be the identity. Note that simple truncation of the centralized

controller could result in a non-stabilizing feedback matrix [60]. We solve (2.5) for γ varying

from 0.001 to 0.05 in 20 linearly spaced increments. On average, γ = 0.05 corresponds to

approximately 25% nonzero entries in the feedback gain matrix.

Computation times for N varying from 5 to 50, are shown in Figure 2.7. Since networks

are randomly generated, we average the computation time for 5 networks of each size. For this

more complicated example, the method of multipliers algorithm is the fastest and ADMM is the

slowest.

Chapter 3

First-order primal-dual algorithm

We examine Arrow-Hurwicz-Uzawa gradient flow dynamics for the proximal augmented La-

grangian. Such dynamics can be used to identify saddle points of the Lagrangian [145] and

have enjoyed recent renewed interest in the context of networked optimization because, in many

cases, the gradient can be computed in a distributed manner [146]. Our approach yields a dy-

namical system with a continuous right-hand side for a broad class of nonsmooth optimization

problems. This is in contrast to existing techniques which employ subgradient methods [147] or

use discontinuous projected dynamics [148–150] for problems with inequality constraints. Fur-

thermore, since the proximal augmented Lagrangian is neither strictly concave nor linear in the

dual variable, we require additional developments relative to recent advances [151] to establish

convergence.

3.1 Arrow-Hurwicz-Uzawa gradient flow

In this section, we apply the continuous-time Arrow-Hurwicz-Uzawa gradient flow [145] to the

proximal augmented Lagrangian (2.8). Simultaneous update of the primal and dual variables in x

y

=

−∇xLµ(x; y)

∇yLµ(x; y)

(3.1)

should be compared and contrasted with the approach presented in Section 2.3 where we alter-

nate between minimization of Lµ(x; y) over x and gradient ascent over y. For convex, differen-

tiable f and convex g, we show that the gradient flow dynamics (3.1) globally converge to the

50

51

set of saddle points of the proximal augmented Lagrangian.

We first characterize the optimal primal-dual pairs of optimization problem (2.5) with the

Lagrangian,

f(x) + g(z) + 〈y, T (x) − z〉 .

The associated first-order optimality conditions are given by,

0 = ∇f(x?) + T †(y?) (3.2a)

0 ∈ ∂g(z?) − y? (3.2b)

0 = T (x?) − z? (3.2c)

where ∂g is the subgradient of g. Clearly, these are equivalent to the optimality condition

for (2.1), 0 ∈ ∇f(x?) + T †(∂g(T (x?))).

Remark 3.1.1. Since the proximal augmented Lagrangian (2.8) is neither linear nor strictly

concave in y, we cannot use [151] to show global convergence. For LASSO problem (2.4),

proxµg(x+µy) is differentiable at points where |xi+µyi| 6= γµ for all i. When it is differentiable,

Lµ(x; y) is locally quadratic in y and ∇yyLµ(x; y) = −µ diag(β) where the ith element βi = 1

if |xi + µyi| > γµ and 0 if |xi + µyi| < γµ. Since ∇yyLµ(x; y) 6= 0, the proximal augmented

Lagrangian (2.8) is not linear in y. Furthermore, since ∇yyLµ(x; y) is not negative definite,

Lµ(x; y) is not strictly concave in y.

To show convergence, we introduce a simple lemma about proximal operators which follows

almost directly from their definition. Even though we state the result for x ∈ Rm, g: RM → R

and T (x) = Tx where T ∈ RM×m is a given matrix, the proof for x in a Hilbert space and a

bounded linear operator T follows from similar arguments.

Lemma 3.1.2. Let g: RM → R be a proper, lower semicontinuous, convex function and let

proxµg: RM → RM be the corresponding proximal operator. Then, for any a, b ∈ RM , we can

write

proxµg(a) − proxµg(b) = D (a − b) (3.3)

where D is a symmetric matrix satisfying I D 0.

Proof. Let c := a− b, p := proxµg(a)−proxµg(b), and D := p pT /(pT c), p 6= 0; 0, otherwise.Then, by construction, D = DT 0 and (3.3) can be written as p = Dc. Since proxµg is firmly

nonexpansive [130], pT c ≥ ‖p‖2 or, equivalently, cTD(I − D)c ≥ 0 for every c ∈ RM . Positive

semi-definiteness of I −D thus follows from D 0 and commutativity of D and I −D.

52

Theorem 3.1.3. Let f be a continuously differentiable convex function, and let g be a proper,

lower semicontinuous, convex function. Then, for the primal-descent dual-ascent gradient flow

dynamics (3.1), x

y

=

− (∇f(x) + TT∇Mµg(Tx + µy))

µ (∇Mµg(Tx + µy) − y)

(3.4)

the set of optimal primal-dual pairs (x?, y?) of (2.5) is globally asymptotically stable, and x? is

an optimal solution of (2.1).

Proof. We introduce a change of coordinates x := x− x?, y := y − y? and a Lyapunov function

candidate,

V (x, y) = 12 〈x, x〉 + 1

2 〈y, y〉

where (x?, z?; y?) is an optimal solution to (2.5) that satisfies (3.2). The dynamics in the (x, y)-

coordinates are given by, ˙x

˙y

=

−(∇f(x) − ∇f(x?) + 1µ T

T m)

m − µ y

(3.5)

where

m := µ (∇Mµg(Tx + µy) − ∇Mµg(Tx? + µy?)). (3.6a)

Using expression (3.3) to construct D such that,

D(T x+ µy) = proxµg(Tx+ µy) − proxµg(Tx? + µy?) (3.6b)

and definition (2.2c) of ∇Mµg, we can write

m = (I − D)(T x + µy). (3.6c)

Thus, the derivative of V along the solutions of (3.5) is given by

V (x, y) = − 〈x, ∇f(x) − ∇f(x?)〉 − 1µ 〈T x, (I − D)T x〉 − µ 〈y, Dy〉 .

Since f is convex, 〈x, ∇f(x)−∇f(x?)〉 ≥ 0 and Lemma 3.1.2 imply V ≤ 0.

When ∇f(x) = ∇f(x?) and when matrices D and TT (I −D)T have nontrivial kernels, it is

possible that V = 0 for a nonzero y ∈ kerD and x such that T x ∈ ker(I −D). From (3.6c),

53

these conditions imply m = µy and (3.5) simplifies to,

˙x = −TT y, ˙y = 0.

Thus, the only invariant set for dynamics (3.5) is ∇f(x?) = ∇f(x), y ∈ kerTT ∩ kerD, and

T x ∈ kerI −D. Global asymptotic stability of these points follows from LaSalle’s invariance

principle. To complete the proof, we show that any x and y that lie in this invariant set also

satisfy the optimality conditions (3.2) of problem (2.5) with z = z?µ(x, y) and thus that x solves

problem (2.1).

Since x? and y? are optimal, (3.2a) implies ∇f(x?) + TT y? = 0. For any x and y that lie in

the invariant set of (3.5), we can replace ∇f(x?) with ∇f(x) and add TT y = 0 to obtain

∇f(x) + TT (y? + y) = ∇f(x) + TT y = 0

which implies that x and y also satisfy (3.2a). Furthermore, (I −D)T x = 0 and Dy = 0 can be

combined to yield T x−D(T x+ µy) = 0, which, by (3.6b), leads to

T x − (proxµg(Tx + µy) − proxµg(Tx? + µy?)) = 0.

By optimality condition (3.2b), Tx? = z?, and by definition (2.9), z? = z?µ(x?, y?) = proxµg(Tx?+

µy?). It thereby follows that Tx− proxµg(Tx+ µy) = 0 and thus

Tx = proxµg(Tx + µy) = z?µ(x, y) = z

which implies that x and z satisfy (3.2c).

Finally, the optimality condition of the minimization problem (2.2a) that defines proxµg(v)

is ∂g(z) + 1µ (z − v) 3 0. Setting v = Tx + µy from the expression (2.9) that characterizes the

z?µ-manifold and Tx = z by (3.2c) yields the final optimality condition (3.2b).

We note that gradient flow dynamics (3.1) are convenient for distributed implementation. If

the state vector x corresponds to the concatenated states of individual agents, xi, the sparsity

pattern of T and the structure of the gradient map ∇f : Rm → Rm dictate the communication

topology required to form ∇Lµ in (3.1). For example, if f(x) =∑fi(xi) is separable over

the agents, then ∇fi(xi) can be formed locally. If in addition TT is an incidence matrix of an

undirected network with the graph Laplacian TTT , each agent must share its state xi with its

neighbors and maintain dual variables yi that correspond to its edges.

54

Our approach provides several advantages over existing distributed optimization algorithms.

Even for problems (2.1) with non-differentiable regularizers g, a formulation based on the prox-

imal augmented Lagrangian yields gradient flow dynamics (3.1) with a continuous right-hand

side. This is in contrast to existing approaches which employ subgradient methods [147] or use

discontinuous projected dynamics [148–151]. Furthemore, for problems where T is not diagonal,

a distributed proximal gradient scheme cannot be implemented because the proximal operator

of g(Tx) may not be separable. Finally, relative to a distributed ADMM scheme, our method

does not require solving an x-minimization subproblem in each iteration.

3.2 Exponential convergence for strongly convex f

In this section, we will show that for a strongly convex f with a Lipschitz continuous gradient,

the dynamics described by (3.1) converge exponentially for a sufficiently large µ. Establishing

exponential convergence is particularly interesting since Lµ is neither strictly concave nor linear

in the dual variable. We then turn our attention to discrete-time dynamics, which are directly

related to an algorithmic implementation of the continuous dynamics (3.1) with a fixed step-

size. For this setup, we also show exponential convergence (i.e., linear convergence in standard

optimization terminology) for a suitably selected penalty-parameter µ and step-size.

To show these results, we employ the integral quadratic constraint (IQC) framework recently

formulated by [152] for studying optimization algorithms as linear systems G connected via

feedback with a nonlinear block corresponding to the gradient of f . To study (3.1), we introduce

an additional nonlinear block corresponding to the Moreau envelope; see Fig. 3.1.

We express 3.1, or equivalently 3.4, as a linear system G connected in feedback with non-

linearities that correspond to the gradients of f and of the Moreau envelope of g; see Fig. 3.1.

These nonlinearities can be conservatively characterized by IQCs. Exponential stability of G

connected in feedback with any nonlinearity that satisfies these IQCs implies exponential con-

vergence of 3.1 to the primal-dual optimal solution of (2.5). In what follows, we adjust the tools

of [152–155] to our setup and establish exponential convergence by evaluating the feasibility of

an LMI. We assume that the function f is mf -strongly convex with an Lf -Lipschitz continuous

gradient. Characterizing additional structural restrictions on f and g with IQCs may lead to

tighter bounds on the rate of convergence.

In this work, we use a static filter η = Ψ(u, ζ) = [uT ζT ]T and IQCs which correspond to

Lipschitz continuity. In general, filters with nontrivial dynamics and stricter IQCs may yield

tighter bounds on the exponential rate of convergence.

55

G

∇f −mI

µ∇Mµg

ζ1 = x

ζ2 = Tx+ µy

u1

u2

∆

Figure 3.1: Block diagram of primal-descent dual-ascent dynamics where G is a linear systemconnected via feedback with nonlinearities.

G

∆

Ψ η

ζ u

Figure 3.2: Block diagram of primal-descent dual-ascent dynamics where the nonlinearities arereplaced by an IQC imposed on their inputs and outputs.

3.2.1 Continuous-time dynamics

As illustrated in Fig. 3.1, 3.4 can be expressed as a linear system G connected via feedback to

a nonlinear block ∆,

w = Aw + B u, ξ = C w, u = ∆(ξ)

A =

−mfI

−µI

, B =

−I − 1µT

T

0 I

, C =

I 0

T µI

where w := [xT yT ]T , ξ := [ ξT1 ξT2 ]T , and u := [uT1 uT2 ]T . Nonlinearity ∆ maps the system

outputs ξ1 = x and ξ2 = Tx + µy to the inputs u1 and u2 via u1 = ∆1(ξ1) := ∇f(ξ1) −mfξ1

and u2 = ∆2(ξ2) := µ∇Mµg(ξ2) = ξ2 − proxµg(ξ2).

When the mapping ui = ∆i(ξi) is the L-Lipschitz continuous gradient of a convex function,

56

it satisfies the IQC [152, Lemma 6]

ξi − ξ0

ui − u0

T 0 LI

LI −2I

ξi − ξ0

ui − u0

≥ 0 (3.7)

where ξ0 is some reference point and u0 = ∆i(ξ0). Since f is strongly convex with parameter

mf , the mapping ∆1(ξ1) is the gradient of the convex function f(ξ1)− (mf/2)‖ξ1‖2. Lipschitz

continuity of ∇f with parameter Lf implies Lipschitz continuity of ∆1(ξ1) with parameter

L := Lf − mf ; thus, ∆1 satisfies (3.7) with L = L. Similarly, ∆2(ξ2) is the scaled gradient

of the convex Moreau envelope and is Lipschitz continuous with parameter 1; thus, ∆2 also

satisfies (3.7) with L = 1. These two IQCs can be combined into

(η − η0)TΠ (η − η0) ≥ 0, η := [ ξT uT ]T . (3.8)

For a linear system G connected in feedback with nonlinearities that satisfy IQC (3.8), [153,

Theorem 3] establishes ρ-exponential convergence, i.e., ‖w(t) − w?‖ ≤ τe−ρt‖w(0) − w?‖ for

some τ, ρ > 0, by verifying the existence of a matrix P 0 such that, ATρ P + PAρ PB

BTP 0

+

CT 0

0 I

Π

C 0

0 I

0, (3.9)

where Aρ := A+ρI. In Theorem 3.2.1, we determine a scalar condition that ensures exponential

stability when TTT is a full rank matrix.

Theorem 3.2.1. Let f be strongly convex with parameter mf , let its gradient be Lipschitz

continuous with parameter Lf , let g be proper, lower semicontinuous, and convex, and let TTT

be full rank. Then, if µ ≥ Lf − mf , there is a ρ > 0 such that the dynamics 3.1 converge

ρ-exponentially to the optimal point of (2.5).

Proof. Since any function that is Lipschitz continuous with parameter L is also Lipschitz contin-

uous with parameter L > L, we establish the result for µ = L := Lf −mf . We utilize [153, The-

orem 3] to show ρ-exponential convergence by verifying matrix inequality (3.9) through a series

of equivalent expressions (3.10). We first apply the KYP Lemma [156, Theorem 1] to (3.9) to

obtain an equivalent frequency domain characterization Gρ(jω)

I

∗Π

Gρ(jω)

I

0, ∀ ω ∈ R (3.10a)

57

where Gρ(jω) = C(jωI − Aρ)−1B. Evaluating the left-hand side of (3.10a) for L = µ and

dividing by −2 yields the matrix inequality

µm+ m2 + ω2

m2 + ω2I

m

m2 + ω2TT

∗ m/µ

m2 + ω2TTT +

ω2 − ρµµ2 + ω2

I

0 (3.10b)

where m := mf − ρ > 0 and µ := µ− ρ > 0 so that Aρ is Hurwitz, i.e., the system Gρ is stable.

Since the (1, 1) block in (3.10b) is positive definite for all ω, the matrix in (3.10b) is positive

definite if and only if the corresponding Schur complement is positive definite,

m/µ

µm+ m2 + ω2TTT +

ω2 − ρµµ2 + ω2

I 0. (3.10c)

We exploit the symmetry of TTT to diagonalize (3.10c) via a unitary coordinate transformation.

This yields m scalar inequalities parametrized by the eigenvalues λi of TTT . Multiplying the

left-hand side of these inequalities by the positive quantity (µ2 + ω2)(µm + m2 + ω2) yields a

set of equivalent, quadratic in ω2, conditions,

ω4 + ( mλiµ + m2 + µm− ρµ)ω2 + mµ2λiµ − ρµ(µm+ m2) > 0. (3.10d)

Condition (3.10d) is satisfied for all ω ∈ R if there are no ω2 ≥ 0 for which the left-hand side is

nonpositive. When ρ = 0, both the constant term and the coefficient of ω2 are strictly positive,

which implies that the roots of (3.10d) as a function of ω2 are either not real or lie in the domain

ω2 < 0, which cannot occur for ω ∈ R. Finally, continuity of (3.10d) with respect to ρ implies

the existence a positive ρ that satisfies (3.10d) for all ω ∈ R.

Remark 3.2.2. Each eigenvalue λi of a full rank matrix TTT is positive and hence to estimate

the exponential convergence rate it suffices to check (3.10d) only for the smallest λi. A sufficient

condition for (3.10d) to hold for each ω ∈ R is positivity of the constant term and the coefficient

multiplying ω2. These can be expressed as quadratic inequalities in ρ that admit explicit solutions,

thereby providing an estimate of the rate of exponential convergence.

58

3.2.2 Discrete time

The discrete time dynamics can be expressed as a discrete time linear system G connected via

feedback to a nonlinear block ∆,

wk+1 = Awk + B uk, ξ = C wk, u = ∆(ξ)

A =

αmIαµI

, B = α

−I − 1µT

T

0 I

, C =

I 0

T µI

(3.11)

where αm := 1 − αmf , αµ := 1 − αµ, w := [xT yT ]T , ξ := [ ξT1 ξT2 ]T , and u := [uT1 uT2 ]T . The

nonlinearity ∆ is the same as in the continuous-time case and it still satisfies IQC (3.8).

For a discrete time linear system G connected in feedback with nonlinearities that satisfy

IQC (3.8), [152, Theorem 4] establishes ρ-exponential convergence, i.e., ‖wk−w?‖ ≤ τρk‖w(0)−w?‖ for some τ ≥ 0, ρ ∈ [0, 1) is guaranteed by verifying the existence of a matrix P 0, λ ≥ 0

such that, ATρ PAρ − I ATρ PBρ

BTρ PAρ BTρ PBρ

+ λ

CT 0

0 I

Π

C 0

0 I

0, (3.12)

where Aρ := A/ρ and Bρ := B/ρ. In Theorem 3.2.3, we determine a scalar condition that

ensures exponential stability when TTT = I. The extension to the general case where TTT is

full rank is straightforward and the layout of the proof is meant to facilitate such development;

see Remark 3.2.4.

Theorem 3.2.3. Let f be strongly convex with parameter mf , let its gradient be Lipschitz

continuous with parameter Lf , let g be proper, lower semicontinuous, and convex, and let TTT =

I. Then, if µ ≥ Lf − mf and α ∈ (0,min( 12m ,

12µ , µ)), there is a ρ ∈ (0, 1) such that the

dynamics 3.11 converge ρ-exponentially to the optimal solution.

Proof. Since any function that is Lipschitz continuous with parameter L is also Lipschitz contin-

uous with parameter L > L, we establish the result for µ = L := Lf −mf . We utilize [152, The-

orem 4] to show ρ-exponential convergence by verifying matrix inequality (3.12) through a series

of equivalent expressions (3.13). We first apply the KYP Lemma [156, Theorem 2] to (3.12) to

obtain an equivalent frequency domain characterization Gρ(ejω)

I

∗Π

Gρ(ejω)

I

0, ∀ ω ∈ R (3.13a)

59

where Gρ(ejω) = C(ejωI−Aρ)−1Bρ = C(ρ ejωI−A)−1B. Evaluating the left-hand side of (3.13a)

for L = µ and dividing by −2 yields the matrix inequality a(ω)I b(ω)T

∗ c(ω)I + d(ω)TTT

0, ∀ω ∈ R (3.13b)

where

a(ω) := 1 + αµαm− 1 + ρ cosω

ρ2 + (αm− 1)2 + 2ρ(αm− 1) cosω

b(ω) := ααm− 1 + ρ cosω

ρ2 + (αm− 1)2 + 2ρ(αm− 1) cosω

c(ω) := 1− αµ αµ− 1 + ρ cosω

ρ2 + (αµ− 1)2 + 2ρ(αµ− 1) cosω

d(ω) := αµ

αm− 1ρ cosω

ρ2 + (αm− 1)22ρ(αm− 1) cosω.

Since the range of cosω over ω ∈ R is [−1, 1], we can replace cosω by ζ and check the matrix

inequality a(ζ)I b(ζ)T

∗ c(ζ)I + d(ζ)TTT

0, ∀ζ ∈ [−1, 1]

where

a(ζ) := 1 + αµαm− 1 + ρζ

ρ2 + (αm− 1)2 + 2ρ(αm− 1)ζ=: 1 + αµ

n

d

b(ζ) := ααm− 1 + ρζ

ρ2 + (αm− 1)2 + 2ρ(αm− 1)ζ=: α

n

d

c(ζ) := 1− αµ αµ− 1 + ρζ

ρ2 + (αµ− 1)2 + 2ρ(αµ− 1)ζ

d(ζ) := αµ

αm− 1 + ρζ

ρ2 + (αm− 1)2 + 2ρ(αm− 1)ζ=:

α

µ

n

d.

. (3.13c)

This is equivalent to checking positive definiteness of the (1, 1) block via,

a(ζ) > 0 (3.14a)

and the Schur complement. After diagonalizing TTT with a unitary transformation, this

amounts to checking

c(ζ) + d(ζ)λi −(b(ζ))∗b(ζ)

a(ζ)λi > 0 (3.14b)

for each λi where λi are the eigenvalues of TTT . To show exponential convergence at some rate

ρ, we show (3.12) via (3.14) for ρ = 1. By continuity of (3.14) in ρ, this establishes (3.12) for

some ρ < 1. For the remainder of the proof, we take ρ = 1.

60

Since a(ζ) is a linear fractional transformation of ζ, it is quasiconvex and thus condi-

tion (3.14a) may be established by checking a(1) > 0 and a(−1) > 0 [123],

1 + αµαm

1 + (αm− 1)2 + 2(αm− 1)= 1 +

µ

m> 0 (3.15a)

1 + αµαm− 2

1 + (αm− 1)2 − 2(αm− 1)= 1 +

αµ

αm− 2> 0 (3.15b)

Condition (3.15a) is clearly satisfied. By assumption, αµ ∈ (0, 12 ) and αm ∈ (0, 1

2 ) which implies

that αm− 2 ∈ (−2,− 32 ) and αµ

αm−2 ∈ (0,− 13 ), thereby establishing condition (3.15b).

Expressing the Schur complement (3.14b) in terms of the n and d from (3.13c) yields,

c(ζ) +α

µ

n

dλi −

(αn

d

)(1 + αµ

n

d

)−1 (αn

d

)λi = c(ζ) +

α

µ

n

dλi − α2n

2

d2

d

αµn+ dλi

= c(ζ) +α

µ

n

dλi −

α2n2

d(αµn+ d)λi

= c(ζ) +(α/µ)n(αµn+ d) − α2n2

d(αµn+ d)λi

= c(ζ) +(α/µ)n

αµn+ dλi

Substituting to remove n and d yields

s(ζ) := c(ζ) +(α/µ)n

αµn+ dλi =: 1 + s1(ζ) + s2(ζ)λi > 0. (3.16)

where

s1(ζ) := − αµ(αµ− 1 + ζ)

1 + (αµ− 1)2 + 2(αµ− 1)ζ

s2(ζ) :=(α/µ)(αm− 1 + ζ)

αµ(αm− 1 + ζ) + 1 + (αm− 1)2 + 2(αm− 1)ζ

Since both s1 and s2 are linear fractional transformations of ζ, they are quasiconvex which

implies s1(ζ) ∈ [s1(−1), s1(1)] and s2(ζ) ∈ [s2(−1), s2(1)] for all ζ ∈ [−1, 1]. Evaluating these

functions at ζ = ±1,

s1(−1) ∈ (0, 13αµ), s1(1) = −1, s2(−1) = (α/µ)

αm+αµ−2 > −1, s2(1) = (α/µ)αµ+αm > 1

(3.17)

yields

s1(ζ) ∈ (−1, 13αµ], s2(ζ) > −1, ∀ ζ ∈ [−1, 1] (3.18)

61

Since TTT = I, λi = 1 for all i. However, although s(1) > 0 and s(−1) > 0, the set of

quasiconvex functions is not closed under addition so inequality (3.16) does not follow for all

ζ ∈ [−1, 1]. From (3.18), we can only claim that s(ζ) > −1 for all ζ ∈ [−1, 1], which is clearly

insufficient. To obtain the result, we consider segments of [−1, 1]. These segments vary based

on two cases: µ ≥ m and µ ≤ m.

First, let us consider the case where µ ≤ m. We consider two segments [−1, η] and [η, 1]

where η = 1− αm. Note that the denominator of s1(ζ) is always positive so

s1(η) =−α2µ(µ−m)

1 + (αµ− 1)2 + 2(αµ− 1)(1− αm).

is positive because µ−m ≤ 0. Together with (3.17), this implies

s1(ζ) ≥ 0 ∀ζ ∈ [−1, η], s1(ζ) ≥ −1 ∀ζ ∈ [η, 1] (3.19a)

Combining (3.17) with s2(η) = 0 yields,

s2(ζ) ≥ −1 ∀ζ ∈ [−1, η], s2(ζ) ≥ 0 ∀ζ ∈ [η, 1]. (3.19b)

Together, equations (3.19) imply that (3.14b) holds when µ ≤ m.

Now consider the case µ ≥ m. We consider three segments [−1, δ], [δ, η], and [η, 1] where

δ = 1− αµ. Combining s1(δ) = 0 and s2(η) = 0 with (3.17) yields

s1(ζ) ≥ 0 ∀ζ ∈ [−1, δ], s2(ζ) ≥ 0 ∀ζ ∈ [η, 1]. (3.20a)

Together with (3.18), equation (3.20a) implies that (3.14b) holds for ζ ∈ [−1, δ] and ζ ∈ [η, 1].

To consider ζ ∈ [δ, η], we must consider s1(η) and s2(δ). For this case, where µ ≥ m,

s1(η) =(−αµ)α(µ−m)

1 + (αµ− 1)((αµ− 1) + 2(1− αm))

is negative. However, since |αµ| ≤ 12 , |αµ − 1| ≤ 1

2 , |α(m − µ)| ≤ 12 and |1 − αm| ≤ 1

2 , it is

readily shown that s1(η) > − 13 . Now consider,

s2(δ) =(α/µ)(α(m− µ))

αµ(α(m− µ)) + 1 + (αm− 1)2 + 2(αm− 1)(1− αµ).

which is also necessarily negative. Using the same bounds, |αµ| ≤ 1, |αµ−1| ≤ 1, |α(m−µ)| ≤ 1

62

and |1− αm| ≤ 1, it is readily shown that s2(δ) ≥ − 12 . This implies,

s1(ζ) ∈ [− 13 , 0] ∀ζ ∈ [δ, η], s2(ζ) ∈ [− 1

2 , 0] ∀ζ ∈ [δ, η]. (3.20b)

from which it follows that s(ζ) > 0 for all ζ ∈ [δ, η], establishing condition (3.14b) and completing

the proof.

Remark 3.2.4. The conditions on the step size α and the spectrum of TTT in Theorem 3.2.3

are conservative and lead to the simple verification of inequalities (3.15b) and (3.16). These

inequalities can be posed as quadratic and quartic equations in α, respectively. Since these

orders of polynomial have explicit solutions, more rigorous analysis of these conditions would

lead to explicit, tighter bounds on α and an overt dependence on λi.

3.3 Distributed implementation

Dynamics (3.1) and (3.11) are convenient for distributed implementation. If the state vector x

corresponds to the concatenated states of individual agents, xi, the sparsity pattern of T and

the structure of the gradient map ∇f : Rn → Rn dictate the communication topology required

to form ∇Lµ in (3.1). For example, if f(x) =∑fi(xi) is separable over the agents, then ∇fi(xi)

can be formed locally. If in addition TT is an incidence matrix of an undirected network with

M edges and the graph Laplacian TTT , each agent must share its state xi with its neighbors

and maintain dual variables yi that correspond to the edges to which it is connected.

Distributed implementation of a primal-descent dual-ascent dynamics on the proximal aug-

mented Lagrangian provides several advantages over existing distributed optimization algo-

rithms. For problems where T is not diagonal, a distributed proximal gradient scheme can-

not be implemented because the proximal operator of g(Tx) may not be separable. It has a

continuous right-hand side even for problems (2.1) with non-differentiable regularizers g, unlike

existing approaches which employ subgradient methods [147] or use discontinuous projected dy-

namics [148–151]. Finally, relative to a distributed ADMM scheme, our method does not require

solving an x-minimization subproblem in each iteration.

3.3.1 Example: Optimal placement

In this example [123], a number of agents try to minimize some weighted measure of distances

from their neighbors. The set of neighbors for each node is determined by an a priori specified

63

network. The optimization variable x contains the locations of the ‘mobile’ agents The locations

of fixed agents b enter as a problem parameter. The optimization problem can be written as,

minimize∑

(i,j)∈Exx

fij(xi − xj) +∑

(i,j)∈Exb

fij(xi − bj)

where xi, bi ∈ Rp are position vectors of mobile and fixed agents in a p-dimensional space,

Exx and Exb denote the sets of edges from mobile agents to each other and to the fixed agents

respectively, and fij is some measure of distance. In this example, we take fij = (1/2)‖·‖2 and

impose an additional constraint on the distances between neighboring mobile agents,

‖xi − xj‖ ≤ c, ∀(i, j) ∈ Exx.

We can reexpress the problem as,

minimize1

2

∥∥∥∥∥∥ A

T

x − b

0

∥∥∥∥∥∥2

︸︷︷︸f(x)

+ I[−c,c](Tx)︸︷︷︸g(T (x))

(3.21)

where x contains the positions of the mobile agents, Ax−b forms vectors between mobile agents

and neighboring fixed agents, Tx forms vectors connecting neighboring mobile agents, and I[−c,c]

is an indicator function associated with the set of vectors v whose subvectors vi ∈ Rp have norm

less than c.

Distributed implementation

We form the proximal augmented Lagrangian and implement the gradient dynamics described

in (3.1). The proximal operator assoicated with I[−c,c] is a projection which scales each subvector

with norm larger than c such that it has norm c. Each subvector xi of the state corresponds to

the position of the ith agent and each subvector yi of the Lagrange multiplier is associated with

the difference in position between two agents connected by an edge in Exx.

The continuous time dynamics described in (3.1) and the expression for the gradient of f ,

∇f(x) = (ATA + TTT )x + AT b

reveal the distributed nature of the dynamics. The matrix ATA = blkdiag(. . . , diIp, . . . ) where

64

di is the number of fixed agents to which the ith mobile agent is connected and Ip is a p-

dimensional identity matrix. The matrix T is of the form E ⊗ Ip where E is the incidence

matrix corresponding to Exx so TTT is of the form L ⊗ Ip where L is the graph laplacian

associated with Exx.

Each agent can form its dynamics only the positions of itself and its neighbors and the

Lagrange multipliers associated with the edges to which it is connected. The dynamics of yi

may be independently determined by agents.

Simulation results

We simulated the gradient dynamics for the constrained optimal placement problem for a small

system with 5 mobile agents and 6 fixed agents. Figs. 3.3a and 3.3b show the optimal mobile

agent positions (solution to (3.21)) with red or blue circles given two configurations of fixed

agents (black ) with Exx shown by solid red — or solid blue — edges, Exb is shown by

dotted black · · · edges, and c = 0.5. The second configuration corresponds to rotating the

fixed agents by π/8 and scaling their positions by 1.6. In Fig. 3.3a, 3 out of the 5 inequality

constraints are active and in Fig. 3.3b, all 5 inequality constraints are active.

To illustrate the dynamics, we simulated (3.1) from t = 0 to t = 10 with a discrete jump in

fixed agent positions. The mobile agents were initially placed at the origin. The fixed agents

were placed as in Fig. 3.3a until time t = 2.5, at which they were placed as in Fig. 3.3b. The

trajectories of each mobile agent are shown in Fig. 3.3c and the Euclidean distance from their

optimal position (i.e., as in Figs. 3.3a and 3.3b) with respect to time is plotted in Fig. 3.3d.

65

(a) First configuration. 3 out of 5 inequality con-straints active.

(b) Second configuration. 5 out of 5 inequality con-straints active.

(c) Trajectories with black × at position every 2.5time steps.

(d) Euclidean distance from optimal with respectto time.

Figure 3.3: Subfigures (a) and (b) show two optimal configurations of mobile agents () withcorresponding fixed agents (black ), Exx is (solid red — and solid blue — edges) and Exb(dotted black · · · edges). Subfigures (c) and (d) show the agent trajectories and distance fromoptimal configuration (a) until t = 2.5 and from optimal configuration (b) until the final t = 10.

Chapter 4

Second-order primal-dual method

Although the MM and Arrow-Hurwicz-Uzawa primal-descent dual-ascent gradient flow ap-

proaches are typically simple and computationally efficient, they tend to converge slowly to

high-accuracy solutions. As with smooth problems, second order information can enhance the

speed of convergence. A generalization of Newton’s method to non-smooth problems has been

developed in [157,158], but it requires solving a regularized quadratic subproblem to determine

a search direction. Except in cases where the Hessian of the smooth component has a special

structure, this problem may be difficult to solve. More recently, proximal Newton method for

non-smooth composite minimization was developed in [159]. It was shown that, even in the

situations when the search direction is computed approximately, the developed algorithm shares

convergence properties with traditional Newton’s method. Related ideas have been successfully

utilized in a number of applications, including sparse inverse covariance estimation in graphical

models [160,161].

Recent work has extended the method of multipliers to incorporate second-order updates

of the primal and dual variables [162–164]. Since the standard method of multipliers seeks the

saddle point of the augmented Lagrangian, it is challenging to assess joint progress of primal and

dual iterates. In [162], Gill and Robinson have introduced the notion of primal-dual augmented

Lagrangian that serves as the merit function for measuring progress of second-order updates.

The proposed approach is applicable to non-convex problems but it requires that the objective

function is twice continuously differentiable. In [165], a forward-backward envelope was utilized

to cast the original non-smooth optimization problem as the unconstrained minimization of

a continuously differentiable function and generalized sub-differential calculus was employed to

derive second-order updates. Generalized second order methods have also been applied to saddle

66

67

point problems directly [166].

We draw on these recent advances and develop a new algorithm for non-smooth composite

optimization which efficiently forms second-order updates of both primal and dual variables. To

motivate our approach, we show global exponential stability of the corresponding continuous-

time differential inclusion when f is strongly convex. Using the merit function employed in [163,

164], we refine the algorithm by adding adaptive updates for the penalty parameter. We show

global convergence and local quadratic convergence of this algorithm for strongly convex f .


We consider problem (2.1) with additional assumptions on the function f and linear operator

T . Although our strongest results require strong convexity of f , our theory and techniques are

applicable as long as the Hessian of f is positive definite.

Assumption 1. The function f is twice continuously differentiable, has an Lf Lipschitz con-

tinuous gradient ∇f , and is strictly convex with ∇2f 0; the function g is proper, lower

semicontinuous, and convex; and the matrix T has full row rank.

We now provide background on generalizations of the gradient for nondifferentiable functions

and briefly overview existing second order methods for solving (2.1).

4.1.1 Generalization of the gradient and Jacobian

Although proxµg is typically not differentiable, it is Lipschitz continuous and therefore differ-

entiable almost everywhere [138]. One generalization of the gradient for such functions is given

by the B-subdifferential set [167], which applies to locally Lipschitz continuous functions h:

Rm → R. Let Ch be a set at which h is differentiable. Each element in the set ∂Bh(z) is the

limit point of a sequence of gradients ∇h(zk) evaluated at a sequence of points zk ⊂ Ch

whose limit is z,

∂Bh(z) := J | ∃zk ⊂ Ch, zk → z, ∇h(zk)→ J . (4.1)

If h is continuously differentiable in the neighborhood of a point z, the B-subdifferential set

becomes single valued and it is given by the gradient, ∂Bh(z) = ∇h(z). In general, ∂Bh(z) is

not a convex set; e.g., if h(z) = |z|, ∂Bh(0) = −1, 1.

68

The Clarke subdifferential set of h: Rm → R at z is the convex hull of the B-subdifferential

set [139],

∂Ch(z) := conv (∂Bh(z)).

When h is a convex function, the Clarke subdifferential set is equal to the subdifferential set

∂h(z) which defines the supporting hyperplanes of h at z [168, Chapter VI]. For a function G:

Rm → Rn, the B-generalization of the Jacobian at a point z is given by

∂BG(z) :=[JT1 . . . JTn

]Twhere each Ji ∈ ∂BGi(z) is a member of the B-subdifferential set of the ith component of G

evaluated at z. The Clarke generalization of the Jacobian at a point z, ∂CG(z), has the same

structure where each Ji ∈ ∂CGi(z) is a member of the Clarke subdifferential of Gi(z).

4.1.2 Semismoothness

The mapping G: Rm → Rn is semismooth at z if for any sequence zk → z, the sequence of

Clarke generalized Jacobians JGk ∈ ∂CG(zk) provides a first order approximation of G,

‖G(zk) − G(z) + JGk(z − zk)‖ = o(‖zk − z‖). (4.2)

where φ(k) = o(ψ(k)) denotes that φ(k)/ψ(k)→ 0 as k tends to infinity [169]. The function G

is strongly semismooth if this approximation satisfies the stronger condition,

‖G(zk) − G(z) + JGk(z − zk)‖ = O(‖zk − z‖2),

where φ(k) = O(ψ(k)) signifies that |φ(k)| ≤ Lψ(k) for some positive constant L and positive

ψ(k) [169].

Remark 4.1.1. (Strong) semismoothness of the proximal operator leads to fast asymptotic

convergence of the differential inclusion (see Section 4.2) and the efficient algorithm (see Sec-

tion 4.3). Proximal operators associated with many typical regularization functions (e.g., the `1

and nuclear norms [170], piecewise quadratic functions [171], and indicator functions of affine

convex sets [171]) are strongly semismooth. In general, semismoothness of proxµg follows from

semismoothness of the projection onto the epigraph of g [171]. However, there are convex sets

onto which projection is not directionally differentiable [172]. The indicator functions associ-

ated with such sets or functions whose epigraphs are described by such sets may induce proximal

69

operators which are not semismooth.

4.1.3 Existing second order methods

We discussed existing first order methods in Section 2.2.3. While simple to implement, their slow

convergence to high-accuracy solutions motivates the development of second order methods for

solving (2.1). A generalization of Newton’s method for nonsmooth problems (2.1) with T = I

was developed in [157, 158, 165, 173]. A sequential quadratic approximation of the smooth part

of the objective function is utilized and a search direction x is obtained as the solution of a

regularized quadratic subproblem,

minimizex

12 x

THx + ∇f(xk)T x + g(xk + x) (4.3)

where xk is the current iterate and H is the Hessian of f . This method generalizes the projected

Newton method [174] to a broader class of regularizers. For example, when g is the `1 norm,

this amounts to solving a LASSO problem [175], which can be a challenging task. Coordinate

descent is often used to solve this subproblem [173] and it has been observed to perform well in

practice [76,160,161].

The Forward-Backward Envelope (FBE) was introduced in [176–178]. The FBE is a once-

continuously differentiable nonconvex function of x and its minimum corresponds to the solution

of (2.1) with T = I. As demonstrated in Chapter 5, the FBE can be obtained from the proximal

augmented Lagrangian that we introduce in Section 2.3. Since the generalized Hessian of FBE

involves third-order derivatives of f (which may be expensive to compute), references [176–178]

employ either truncated or quasi-Newton methods to obtain a second order update to x.

4.1.4 Second order updates

Even though Newton’s method is primarily used for solving minimization problems in modern

optimization, it was originally formulated as a root-finding technique and it has long been

employed for finding stationary points [179]. In [166], a generalized Jacobian was used to extend

Newton’s method to semismooth problems. We employ this generalization of Newton’s method

to ∇Lµ(x; y) in order to compute the saddle point of the proximal augmented Lagrangian. The

unique saddle point of Lµ(x; y) is given by the optimal primal-dual pair (x?, y?) and it thus

provides the solution to (2.1).

70

Generalized Newton updates

Let H := ∇2f(x). We use the B-generalized Jacobian of the proximal operator proxµg, PB :=

∂B proxµg(Tx + µy), to define the set of B-generalized Hessians of the proximal augmented

Lagrangian,

∂2BLµ :=

H + 1

µ TT (I − P )T TT (I − P )

(I − P )T −µP

, P ∈ PB

(4.4a)

and the Clarke generalized Jacobian PC := ∂C proxµg(Tx + µy) to define the set of Clarke

generalized Hessians of the proximal augmented Lagrangian,

∂2CLµ :=

H + 1

µ TT (I − P )T TT (I − P )

(I − P )T −µP

, P ∈ PC

. (4.4b)

Note that ∂2BLµ(x; y) ⊂ ∂2

CLµ(x; y) because PB ⊂ PC .

We introduce the composite variable, w := [xT yT ]T , use Lµ(w) interchangeably with

Lµ(x; y), and suppress the dependance of H and P on w to reduce notational clutter. For

simplicity of exposition, we assume that proxµg is semismooth and state the results for the

Clarke generalized Hessian (4.4b), i.e., ∂2Lµ(w) = ∂2CLµ(w). As described in Remark 4.2.6,

analogous convergence results for non-semismooth proxµg can be obtained for the B-generalized

Hessian (4.4a), i.e., ∂2Lµ(w) = ∂2BLµ(w).

We use the Clarke generalized Hessian (4.4b) to obtain a second order update w by linearizing

the stationarity condition ∇Lµ(w) = 0 around the current iterate wk,

∂2CLµ(wk) wk = −∇Lµ(wk). (4.5)

Since proxµg is firmly nonexpansive, 0 P I. In Lemma 4.1.2 we use this fact to prove

that the second order update w is well-defined for any generalized Hessian (4.4) of the proximal

augmented Lagrangian Lµ(x; y) as long as f is strictly convex with ∇2f(x) 0 for all x ∈ Rm.

Lemma 4.1.2. Let H ∈ Rn×n be symmetric positive definite, H 0, let P ∈ RM×M be

symmetric positive semidefinite with eigenvalues less than one, 0 P I, let T ∈ RM×m be

full row rank, and let µ > 0. Then, the matrix H + 1µ T

T (I − P )T TT (I − P )

(I − P )T −µP

71

is invertible and it has n positive and m negative eigenvalues.

Proof. The Haynsworth inertia additivity formula [180] implies that the inertia of matrix (4.4)

is determined by the sum of the inertias of matrices,

H + 1µ T

T (I − P )T (4.6a)

and

−µP − (I − P )T(H + 1

µ TT (I − P )T

)−1

TT (I − P ). (4.6b)

Matrix (4.6a) is positive definite because H 0 and both P and I − P are positive semidefi-

nite. Matrix (4.6b) is negative definite because the kernels of P and I − P have no nontrivial

intersection and T has full row rank.

Fast local convergence

The use of generalized Newton updates for solving the nonlinear equation G(x) = 0 for non-

differentiable G was studied in [166]. We apply this framework to the stationarity condition

∇Lµ(w) = 0 when proxµg is (strongly) semismooth and show that second order updates (4.5)

converge (quadratically) superlinearly within a neighborhood of the optimal primal-dual pair.

Proposition 4.1.1. Let proxµg be (strongly) semismooth, and let wk be defined by (4.5). Then,

there is a neighborhood of the optimal solution w? in which the second order iterates wk+1 =

wk + wk converge (quadratically) superlinearly to w?.

Proof. Lemma 4.1.2 establishes that ∂2C Lµ(w) is nonsingular for any P ∈ PC . Since the gradi-

ent ∇Lµ(w) of the proximal augmented Lagrangian is Lipschitz continuous by Theorem 2.3.1,

nonsingularity of ∂2CLµ(w) and (strong) semismoothness of the proximal operator guarantee

(quadratic) superlinear convergence of the iterates by [166, Theorem 3.2].

4.2 A globally convergent differential inclusion

Since we apply a generalization of Newton’s method to a saddle point problem and the second

order updates are set valued, convergence to the optimal point is not immediate. Although we

showed local convergence rates in Proposition 4.1.1 by leveraging the results of [166], proof of

the global convergence is more subtle and it is established next.

72

To justify the development of a discrete-time algorithm based on the search direction resulting

from (4.5), we first examine the corresponding differential inclusion,

w ∈ −(∂2CLµ(w))−1∇Lµ(w) (4.7)

where ∂2CLµ is the Clarke generalized Hessian (4.4b) of Lµ. We assume existence of a solution

and prove asymptotic stability of (4.7) under Assumption 1 and global exponential stability

under an additional assumption that f is strongly convex.

Assumption 2. Differential inclusion (4.7) has a solution.

4.2.1 Asymptotic stability

We first establish asymptotic stability of differential inclusion (4.7).

Theorem 4.2.1. Let Assumptions 1 and 2 hold and let proxµg be semismooth. Then, differ-

ential inclusion (4.7) is asymptotically stable. Moreover,

V (w) := 12 ‖∇Lµ(w)‖2 (4.8)

provides a Lyapunov function and

V (t) = − 2V (t). (4.9)

Proof. Lyapunov function candidate (4.8) is a positive function of w everywhere apart from the

optimal primal-dual pair w? at which it is zero. It remains to show that V is decreasing along

the solutions w(t) of (4.7), i.e., that V is strictly negative for all w(t) 6= w?,

V (t) := ddtV (w(t)) = − 2V (w(t))

For Lyapunov function candidates V (w) which are differentiable with respect to w,˙V =

wT∇V by the chain rule. Although (4.8) is not differentiable with respect to w, we show that

V (w(t)) is differentiable along the solutions of (4.7). Instead of employing the chain rule, we

use the limit that defines the derivative,

V (t) := ddtV (w(t)) = lim

s→ 0

V (w(t) + sw(t)) − V (w(t))

s(4.10)

73

to show that V exists and is negative along the solutions of (4.7). Here, w is determined by the

dynamics (4.7),

w(t) = −H−1C ∇Lµ(w(t)), (4.11)

for some HC ∈ ∂2CLµ(w(t)) and (4.10) is equivalent to the directional derivative of V (w) in the

direction w. We first introduce

hs(t) :=V (w(t) + sw(t)) − V (w(t))

s

which yields V in the limit s → 0. We then rewrite hs(t) as the limit point of a sequence of

functions hs,k(t) so that

V (t) = lims→ 0

hs(t) = lims→ 0

limk→∞

hs,k(t) (4.12)

and use the Moore-Osgood theorem [181, Theorem 7.11] to exchange the order of the limits and

establish that V (t) = −2V (t).

Let Cg denote a subset of Rm+M over which proxµg(Tx+µy) is differentiable (and therefore

V is differentiable with respect to w) and let wk be a sequence of points in Cg that converges

to w. We define the sequence of functions hs,k(t),

hs,k(t) :=V (wk(t) + sw(t)) − V (wk(t))

s.

To employ the Moore-Osgood theorem, it remains to show that hs,k(t) converges pointwise (for

any k) as s→ 0 and that hs,k(t) converges uniformly on some interval s ∈ [0, s] as k →∞.

Since wk ⊂ Cg, V (wk) is differentiable for every k ∈ Z+ and ∇V (wk) = ∇2Lµ(wk) =

∂2CLµ(wk). It thus follows that

lims→ 0

hs,k(t) = wTk (t)∇2Lµ(wk)∇Lµ(wk(t)) (4.13)

pointwise (for any k). Local Lipschitz continuity of V with respect to wk implies uniform

convergence of hs,k(t) to hs(t) on s ∈ (0, s ] where s > 0. Therefore, the Moore-Osgood theorem

74

on exchanging limits [181, Theorem 7.11] in conjunction with (4.13) implies

V (t) = lims→ 0

hs(t)

= lims→ 0

limk→∞

hs,k(t) = limk→∞

lims→ 0

hs,k(t)

= limk→∞

wT (t)∇2Lµ(wk)∇Lµ(wk(t))

= wT (t)HB∇Lµ(w(t))

= − (∇Lµ(w(t)))TH−1C HB∇Lµ(w(t))

(4.14a)

for some HB ∈ ∂2BLµ(w(t)) by the definition of the B-subdifferential (4.1). This immediately

establishes (4.9) for any w (4.11) such that C ∈ ∂2BLµ(w(t)) by choosing Cg such that HB = HC .

Semismoothness implies directional differentiability of V and thereby equivalence of (4.10)

and (4.14a) by [182, Proposition 2.3]. It follows that

V (t) = (∇Lµ(w(t)))TH−1C HBi∇Lµ(w(t)) = (∇Lµ(w(t)))TH−1

C HBj∇Lµ(w(t)) (4.14b)

for any HBi , HBj ∈ ∂2BLµ(w(t)). By definition of the Clarke subgradient, any HC 6∈ ∂2

BLµ(w(t))

can be expressed as

HC =∑i

αiHBi (4.14c)

where HBi ∈ ∂2BLµ(w(t)) and α ∈ P where P is the simplex P := α ∈ Rr+ |

∑i αi = 1 and r

is the cardinality of the set ∂2BLµ(w). By substituting (4.14c) into the final line of (4.14a) and

noting the equivalence (4.14b), the directional derivative (4.10) can be expressed as

V (t) = − (∇Lµ(w(t)))T

(∑i

αiHBi

)−1 (∑i

βiHBi

)∇Lµ(w(t))

for any β ∈ P. Taking β = α yields V (t) = −‖Lµ(w(t))‖2 = −2V (t), completing the proof.

4.2.2 Global exponential stability

To establish global asymptotic stability, we show that the Lyapunov function (4.8) is radially

unbounded, and to prove exponential stability we bound it with quadratic functions. We first

provide two lemmas that characterize the mappings proxµg and ∇f in terms of the spectral

properties of matrices that describe the corresponding input-output relations at given points.

75

Lemma 4.2.2. Let f be strongly convex with parameter mf and let its gradient ∇f be Lipschitz

continuous with parameter Lf . Then, for any a, b ∈ Rn there exists a symmetric matrix Ga,b

satisfying mfI Ga,b LfI such that

∇f(a) − ∇f(b) = Ga,b (a − b).

Proof. Let c := a− b, d := ∇f(a)−∇f(b), e := d−mfc, and

Ga,b := eeT /(eT c), e 6= 0; 0, otherwise (4.15a)

Ga,b := Ga,b + mfI. (4.15b)

Clearly, by construction, Ga,b = GTa,b 0. It is also readily verified that Ga,b c = d when

eT c 6= 0. It thus remains to show that (i) Ga,b c = d when eT c = 0; and (ii) Ga,b (Lf −mf )I.

(i) Since f is mf strongly convex and ∇f is Lf Lipschitz continuous, h(x) := f(x)− mf2 ‖x‖

2

is convex and ∇h(x) = ∇f(x) −mfx is Lf −mf Lipschitz continuous. Furthermore, we have

e = ∇h(a)−∇h(b), and [152, Proposition 5] implies

eT c ≥ 1Lf −mf ‖e‖

2, for all c ∈ Rn. (4.16)

This shows that eT c = 0 only if e := d −mfc = 0 and, thus, d = mfc = Ga,b c when eT c = 0.

Therefore, there always exist a symmetric matrix Ga,b such that Ga,b c = d.

(ii) When eT c 6= 0, Ga,b is a rank one matrix and its only nonzero eigenvalue is ‖e‖2/(eT c);this follows from Ga,b e = (‖e‖2/(eT c)) e. In this case, inequality (4.16) implies eT c > 0

and (4.16) is equivalent to 1/(eT c) ≤ (Lf − mf )/‖e‖2. Thus, ‖e‖2/(eT c) ≤ Lf − mf and

Ga,b (Lf −mf )I when eT c 6= 0. Since Ga,b = 0 when eT c = 0, Ga,b (Lf −mf )I for all a

and b. Finally, Ga,b 0 and (4.15b) imply mfI Ga,b LfI.

Remark 4.2.3. Although matrices Da,b and Ga,b in Lemmas 3.1.2 and 4.2.2 depend on the

operating point, their spectral properties, 0 Da,b I and mfI Ga,b LfI, hold for all a

and b. These lemmas can be interpreted as a combination between a generalization of the mean

value theorem [181, Theorem 5.9] to vector-valued functions and spectral bounds on the operators

proxµg: Rm → Rm and ∇f : Rn → Rn arising from firm nonexpansiveness of proxµg, strong

convexity of f , and Lipschitz continuity of ∇f .

We now combine Lemmas 3.1.2 and 4.2.2 to establish quadratic upper and lower bounds

76

for Lyapunov function (4.8) and thereby prove global exponential stability of differential inclu-

sion (4.7) for strongly convex f .

Theorem 4.2.4. Let Assumptions 1 and 2 hold, let proxµg be semismooth, and let f be mf

strongly convex. Then, differential inclusion (4.7) is globally exponentially stable, i.e., there

exists κ > 0 such that ‖w(t)− w?‖ ≤ κ e−t ‖w(0)− w?‖.

Proof. Given the assumptions, Theorem 4.2.1 establishes asymptotic stability of (4.7) with the

dissipation rate V (w) = − 2V (w). It remains to show the existence of positive constants κ1

and κ2 such that Lyapunov function (4.8) satisfies

κ1

2 ‖w‖2 ≤ V (w) ≤ κ2

2 ‖w‖2 (4.17)

where w := w − w? and w? := (x?, y?) is the optimal primal-dual pair. The upper bound

in (4.17) follows from Lipschitz continuity of ∇Lµ(w) (see Theorem 2.3.1), with κ2 determined

by the Lipschitz constant of ∇Lµ(w).

To show the lower bound in (4.17), and thus establish radial unboundedness of V (w), we

construct matrices that relate V (w) to w. Lemmas 3.1.2 and 4.2.2 imply the existence of

symmetric matrices Dw and Gw such that 0 Dw I, mfI Gw LfI, and

proxµg(Tx+ µy)− proxµg(Tx? + µy?) = Dw (T x+ µy)

f(x)− f(x?) = Gw x.

As noted in Remark 4.2.3, although Dw and Gw depend on the operating point, their spectral

properties hold for all w.

Since ∇Lµ(w?) = 0, we can write

∇Lµ(w) = ∇Lµ(w) − ∇Lµ(w?) = Qw w

and express Lyapunov function (4.8) as

V (w) = 12 w

TQTwQw w

where

Qw :=

Gw + 1µ T

T (I −Dw)T TT (I −Dw)

(I −Dw)T −µDw

77

for some (Dw, Gw) ∈ Ωw,

Ωw := (Dw, Gw) | 0 Dw I, mfI Gw LfI .

The set Ωw is closed and bounded and the minimum eigenvalue of QTwQw is a continuous function

of Gw and Dw. Thus, the extreme value theorem [181, Theorem 4.14] implies that its infimum

over Ωw,

κ1 = inf(Dw,Gw)∈Ωw

λmin

(QTwQw

)is achieved. By Lemma 4.1.2, Qw is a full rank matrix, which implies that QTwQw 0 for all w

and therefore that κ1 is positive. Thus, V (w) ≥ κ1

2 ‖w‖2, establishing condition (4.17).

Condition (4.9) and [183, Lemma 2.5] imply V (w(t)) = e−2tV (w(0)). It then follows

from (4.17) that

‖w(t)− w?‖2 ≤ (κ2/κ1) e−2t ‖w(0)− w?‖2.

Taking the square root completes the proof and provides an upper bound for the constant κ,

κ ≤√κ2/κ1.

Remark 4.2.5. The rate of exponential convergence established by Theorem 4.2.4 is indepen-

dent of mf , Lf , and µ. This is a consequence of insensitivity of Newton-like methods to poor

conditioning. In contrast, the first order primal-dual method considered in [107] requires a suf-

ficiently large µ for exponential convergence. In our second order primal-dual method, problem

conditioning and parameter selection affect the multiplicative constant κ but not the rate of

convergence.

Remark 4.2.6. When differential inclusion (4.7) is defined with the B-generalized Hessian (4.4a),

Theorems 4.2.1 and 4.2.4 hold even for proximal operators which are not semismooth. In this

case, Theorem 4.2.1 is complete at equation (4.14a) and Theorem 4.2.4 applies without modifi-

cation.

4.3 A second order primal-dual algorithm

An algorithm based on the second order updates (4.5) requires step size selection to ensure

global convergence. This is challenging for saddle point problems because standard notions,

such as sufficient descent, cannot be applied to assess the progress of the iterates. Instead, it

is necessary to identify a merit function whose minimum lies at the stationary point and whose

78

sufficient descent can be used to evaluate progress towards the saddle point.

An approach based on discretization of differential inclusion (4.7) and Lyapunov func-

tion (4.8) as a merit function leads to Algorithm 4 in Section 4.5.1. However, such a merit

function is nonconvex and nondifferentiable in general which makes the utility of backtracking

(e.g., the Armijo rule) unclear. Moreover, Algorithm 4 employs a fixed penalty parameter µ. A

priori selection of this parameter is difficult and it has a large effect on the convergence speed.

Instead, we employ the primal-dual augmented Lagrangian introduced in [162] as a merit

function and incorporate an adaptive µ update. This merit function is convex in both x and

y and it facilitates an implementation with outstanding practical performance. Drawing upon

recent advancements for constrained optimization of twice differentiable functions [163,164], we

show that our algorithm converges to the solution of (2.5). Finally, our algorithm exhibits local

(quadratic) superlinear convergence for (strongly) semismooth proxµg.

4.3.1 Merit function

The primal-dual augmented Lagrangian,

Mµ(x, z; y, λ) := Lµ(x, z;λ) + 12µ ‖Tx− z + µ (λ− y)‖2

was introduced in [162], where λ is an estimate of the optimal Lagrange multiplier y?. Fol-

lowing [162, Theorem 3.1], it can be shown that the optimal primal-dual pair (x?, z?; y?) of

optimization problem (2.5) is a stationary point of Mµ(x, z; y, y?). Furthermore, for any fixed

λ, Mµ is a convex function of x, z, and y and it has a unique global minimizer.

In contrast to [162], we study problems in which a component of the objective function is not

differentiable. As in Theorem 2.3.1, the Moreau envelope associated with the nondifferentiable

component g allows us to eliminate the dependence of the primal-dual augmented Lagrangian

Mµ on z,

z?µ(x; y, λ) = argminz

Mµ(x, z; y, λ)

= proxµ2 g

(Tx + µ2 (2λ − y))

and to express Mµ as a continuously differentiable function,

Mµ(x; y, λ) := Mµ(x, z?µ(x; y, λ); y, λ) = f(x) +Mµ2 g

(Tx + µ

2 (2λ − y))

+ µ4 ‖y‖

2 − µ2 ‖λ‖

2.

For notational compactness, we suppress the dependence on λ and writeMµ(w) when λ is fixed.

79

Remark 4.3.1. The primal-dual augmented Lagrangian is not a Lyapunov function unless

λ = y?. We establish convergence by minimizing Mµ(x; y, λ) over (x; y) – a convex problem –

while adaptively updating the Lagrange multiplier estimate λ.

In [162, 164], the authors obtain a search direction using the Hessian of the merit function,

∇2Mµ. Instead of implementing an analogous update using generalized Hessian ∂2Mµ, we take

advantage of the efficient inversion of ∂2Lµ (see Section 4.3.2) to define the update

∂2Lµ(wk) w = − blkdiag(I,−I)∇M2µ(wk) (4.18)

where the identity matrices are sized conformably with the dimensions of x and y, and

∇M2µ(w) =

∇f(x) + TT∇Mµg(Tx+ µ(2λ− y))

µ(y −∇Mµg(Tx+ µ(2λ− y)))

. (4.19)

Multiplication by blkdiag(I,−I) is used to ensure descent in the dual direction and M2µ is

employed because z?µ(x; y, λ) is determined by the proximal operator associated with (µ/2)g.

When λ = y, ∇xM2µ = ∇xLµ, ∇yM2µ = −∇yLµ, and (4.18) becomes equivalent to the second

order update (4.5).

Lemma 4.3.2. Let w solve (4.18). Then, for the fixed value of the Lagrange multiplier estimate

λ and any σ ∈ (0, 1],

d := (1 − σ) w − σ∇M2µ(w, λ) (4.20)

is a descent direction of the merit function M2µ(w, λ).

Proof. By multiplying (4.18) with the nonsingular matrix

Π :=

I − 1µ T

T

0 I

(4.21)

we can express it as H TT

(I − P )T −µP

x

y

=

−(∇f(x) + TT y)

∇yM2µ(x; y, λ)

(4.22)

80

where H := ∇2f(x) 0. Using (4.19) and (4.22), ∇M2µ(w) can be expressed as,

∇M2µ(w) =

−(H + 1µT

T (I − P )T )x − TT (I − P )y

(I − P )T x − µP y

.Thus,

wT∇M2µ(w) = −xT (H + 1µT

T (I − P )T )x − µyTP y

is negative semidefinite, and the inner product

dT∇M2µ(w) = (1− σ) wT∇M2µ(w) − σ‖∇M2µ(w)‖2

is negative definite when ∇M2µ is nonzero.

4.3.2 Second order primal-dual algorithm

We now develop a customized algorithm that alternates between minimizing the merit function

Mµ(x; y, λ) over (x; y) and updating λ. Near the optimal solution, the algorithm approaches sec-

ond order updates (4.5) with unit step size, leading to local (quadratic) superlinear convergence

for (strongly) semismooth proxµg.

Our approach builds on the sequential quadratic programming method described in [162–164]

and it uses the primal-dual augmented Lagrangian as a merit function to assess progress of

iterates to the optimal solution. Inspired by [184], we ensure sufficient progress with damped

second order updates.

The following two quantities

r := Tx − proxµg(Tx + µy)

s := Tx − proxµg(Tx + µ (2λ − y))

appear in the proof of global convergence. Note that r is the primal residual of optimization

problem (2.5) and that ∇M2µ can be equivalently expressed as

∇M2µ(w) =

∇f(x) + 1µ T

T (s+ µ(2λ− y))

−(s + 2µ(λ− y))

. (4.23)

81

Global convergence

We now establish global convergence of Algorithm 3 under an assumption that the sequence

of gradients generated by the algorithm, ∇f(xk), is bounded. This assumption is standard for

augmented Lagrangian based methods [164,185] and it does not lead to a loss of generality when

f is strongly convex.

Theorem 4.3.3. Let Assumption 1 hold and let the sequence ∇f(xk) resulting from Algo-

rithm 3 be bounded. Then, the sequence of iterateswk

converges to the optimal primal-dual

point of problem (2.5) and the Lagrange multiplier estimates λk converge to the optimal La-

grange multiplier.

Proof. Since V2µ(w, λ) is convex in w for any fixed λ, condition (4.26) in Algorithm 3 will be

satisfied after finite number of iterations. Combining (4.26) and (4.23) shows that sk+2µk(λk−yk) → 0 and ∇f(xk) + 1

µkTT (sk + µ(2λk − yk)) → 0. Together, these statements imply that

the dual residual ∇f(xk) + TT yk of (2.5) converges to zero.

To show that the primal residual rk converges to zero, we first show that sk → 0. If Step 2a

in Algorithm 3 is executed infinitely often, sk → 0 since it satisfies (4.25) at every iteration and

η ∈ (0, 1). If Step 2a is executed finitely often, there is k0 after which λk = λk0 . By adding and

subtracting 2µk∇f(xk) + TT sk + 4µkTT (λk0 − yk) and rearranging terms, we can write

TT sk = 2µk(∇f(xk) + 1µkTT (sk + µk(2λk0 − yk)))

− 2µk∇f(xk) − TT (sk + 2µk(λk0 − yk)) − 2µkTTλk0 .

Taking the norm of each side and applying the triangle inequality, (4.26) and (4.23) yields

‖TT sk‖ ≤ 2µkεk + 2µk‖∇f(xk)‖+ ‖TT ‖εk + 2µk‖TTλk0‖. (4.24)

This inequality implies that TT sk → 0 because ∇f(xk) is bounded, εk → 0, and µk → 0. Since

T has full row rank, TT has full column rank and it follows that sk → 0.

Substituting sk → 0 and ∇f(xk) +TT yk → 0 into the first row of (4.23) and applying (4.26)

implies λk → yk. Thus, sk → rk, implying that the iterates asymptotically drive the primal

residual rk to zero, thereby completing the proof.

Remark 4.3.4. Despite the assumption that ∇f(xk) is bounded, Theorem 4.3.3 can be used

to ensure global convergence whenever f is strongly convex. We show in Lemma 4.5.1 in Sec-

tion 4.5.2 that a bounded set Cf containing the optimal point can be identified a priori. One

82

can thus artificially bound ∇f(x) for all x 6∈ Cf to satisfy the conditions of Theorem 4.3.3 and

guarantee global convergence to the solution of (2.1).

Algorithm 3 Second order primal-dual algorithm for nonsmooth composite optimization.

input: Initial point w0 = (x0, y0), and parameters η ∈ (0, 1), β ∈ (0, 1), τa, τb ∈ (0, 1), εk ≥ 0such that εk → 0.initialize: Set λ0 = y0.

Step 1: If‖sk‖ ≤ η‖sk−1‖ (4.25)

go to Step 2a. If not, go to Step 2b.Step 2a: Set

µk+1 = τaµk, λk+1 = yk

Step 2b: Setµk+1 = τbµ

k, λk+1 = λk

Step 3: Using a backtracking line search, perform a sequence of inner iterations to choosewk+1 until

‖∇M2µk+1(wk+1, λk+1)‖ ≤ εk (4.26)

where the search direction d is obtained using (4.20)–(4.22) with

σ = 0,(wk)T∇M2µk+1(wk)

‖∇M2µk+1(wk)‖2≤ −β, (4.27a)

σ ∈ (0, 1], otherwise. (4.27b)

Asymptotic convergence rate

The invertibility of the generalized Hessian ∂2Lµ(w) allows us to establish local convergence

rates for the second order updates (4.5) when proxµg is (strongly) semismooth.

We now show that the updates in Algorithm 3 are equivalent to the second order updates (4.5)

as k → ∞. Thus, if proxµg is (strongly) semismooth, the sequence of iterates generated by

Algorithm 3 converges (quadratically) superlinearly to the optimal point in some neighborhood

of it.

Theorem 4.3.5. Let the conditions of Theorem 4.3.3 hold, let proxµg be (strongly) semismooth,

and let εk be such that ‖wk − w?‖ = O(εk). Then, in a neighborhood of the optimal point w?,

the iterates wk converge (quadratically) superlinearly to w?.

Proof. From Theorem 4.3.3, λk → yk and thus ∇M2µ(wk) → ∇Lµ(wk). Descent of the Lya-

punov function in Theorem 4.2.1 therefore implies that the update in Step 3 of Algorithm 3 is

83

given by (4.27a), which is equivalent to (4.5) because λ = y. The assumption on εk in con-

junction with Proposition 4.1.1, and [182, Theorem 3.2] imply that this update asymptotically

satisfies (4.26) in one iteration with a unit step size. Therefore, Step 3 reduces to (4.5) in some

neighborhood of the optimal solution and Proposition 4.1.1 implies that wk converges to w?

(quadratically) superlinearly.

Efficient computation of the Newton direction

When g is (block) separable, the matrix P in (4.4) is (block) diagonal. We next demonstrate

that the solution to (4.22) can be efficiently computed when T = I and P is a sparse diagonal

matrix whose entries are either 0 or 1. The extensions to a low rank P , to a P with entries

between 0 and 1, or to a general diagonal T follow from similar arguments.

These conditions occur, for example, when g(z) = γ‖z‖1. The matrix P is sparse when

proxµg(x + µy) = Sγµ(x + µy) is sparse. Larger values of γ are more likely to produce a

sequence of iterates wk for which P is sparse and thus the second order search directions (4.22)

are cheaper to compute.

We can write (4.22) as H I

I − P −µP

x

y

=

ϑ

θ

, (4.28a)

permute it according to the entries of P which are 1 and 0, respectively, and partition the

matrices H, P , and I − P conformably such that

H =

H11 H12

HT12 H22

, P =

I

0

.Let v denote either the primal variable x or the dual variable y. We use v1 to denote the

subvector of v corresponding to the entries of P which are equal to 1 and v2 to denote the

subvector corresponding to the zero diagonal entries of P .

Note that (I − P )v = 0 when v2 = 0 and Pv = 0 when v1 = 0. As a result, x2 and y1 are

explicitly determined by the bottom row of the system of equations (4.28a), 0 −µII 0

x2

y1

=

θ1

θ2

(4.28b)

84

Substitution of the subvectors x2 and y1 into (4.28a) yields,

H11x1 = ϑ1 + H12x2 + y1 (4.28c)

which must be solved for x1. Finally, the computation of y2 requires only matrix-vector products,

y2 = − (ϑ2 + H21x1 + H22x2) . (4.28d)

Thus, the major computational burden in solving (4.22) lies in performing a Cholesky factor-

ization to solve (4.28c), where H11 is a matrix of a much smaller size than H.

4.4 Computational experiments

In this section, we illustrate the merits and the effectiveness of our approach. We first apply our

algorithm to the `1-regularized least squares problem and then study a system theoretic problem

of controlling a spatially-invariant system. All computations were performed in Matlab R2014b

on a 2012 Macbook Air with a 2GHz Intel Core i7 processor and 8GB of RAM. All random

computational experiments are averaged over 20 trials.

4.4.1 Example: `1-regularized least squares

The LASSO problem (2.4) regularizes a least squares objective with a γ-weighted `1 penalty, As

described in Section 2.2.2, the associated proximal operator is given by soft-thresholding Sγµ,

the Moreau envelope is the Huber function, and its gradient is the saturation function. Thus,

P ∈ P is diagonal and Pii is 0 when |xi + µyi| < γµ, 1 outside this interval, and between 0 and

1 on the boundary. Larger values of the regularization parameter γ induce sparser solutions for

which one can expect a sparser sequence of iterates. Note that we require strong convexity of

the least squares penalty; i.e., that ATA is positive definite.

In Fig. 4.1, we show the distance of the iterates from the optimal for the standard proximal

gradient algorithm ISTA, its accelerated version FISTA, and our customized second order primal-

dual algorithm for a problem where ATA has condition number 3.26 × 104. We plot distance

from the optimal point as a function of both iteration number and solve time. Although our

method always requires much fewer iterations, it is most effective when γ is large. In this case

the most computationally demanding step (4.28c) required to determine the second order search

direction (4.22) involves a smaller matrix inversion; see Section 4.3.2 for details. In Fig. 4.2,

85

γ = 0.15γmax γ = 0.85γmax

‖x−x?‖

iteration iteration

‖x−x?‖

time (s) time (s)

Figure 4.1: Distance from optimal solution as a function of the iteration number and solve timewhen solving LASSO for two values of γ using ISTA, FISTA, and our algorithm (2ndMM).

86

com

pu

tati

on

tim

e(s

)

Percent of γmax

Figure 4.2: Solve times for LASSO with n = 1000 obtained using ISTA, FISTA, and ouralgorithm (2ndMM) as a function of the sparsity-promoting parameter γ.

γ = 0.15γmax γ = 0.85γmax

com

pu

tati

on

tim

e(s

)

n n

Figure 4.3: Comparison of our algorithm (2ndMM) with state-of-the-art methods for LASSOwith problem dimension varying from n = 100 to 2000.

87

we show the solve times for n = 1000 as the sparsity-promoting parameter γ ranges from 0

to γmax = ‖AT b‖, where γmax yields a zero solution. All numerical experiments consist of

20 averaged trials.

In Fig. 4.3, we compare the performance of our algorithm with the LASSO function in

Matlab (a coordinate descent method [186]), SpaRSA [187], an interior point method [188], and

YALL1 [189]. Problem instances were randomly generated with A ∈ Rm×n, n ranging from 100

to 2000, m = 3n, and γ = 0.15γmax or 0.85γmax. The solve times and scaling of our algorithm is

competitive with these state-of-the-art methods. For larger values of γ, the second order search

direction (4.22) is cheaper to compute and our algorithm is the fastest.

4.4.2 Example: Distributed control of a spatially-invariant system

We now apply our algorithm to a structured control design problem aimed at balancing closed-

loop H2 performance with spatial support of a state-feedback controller. Following the problem

formulation of [64], ADMM was used in [106,190] to design sparse feedback gains for spatially-

invariant systems. Herein, we demonstrate that our algorithm provides significant computational

advantage over both the ADMM algorithm and a proximal Newton scheme.

Spatially-invariant systems

Let us consider

ψ = Aψ + u + d

ζ =

Q1/2 ψ

R1/2 u

(4.29)

where ψ, u, d, and ζ are the system state, control input, white stochastic disturbance, and

performance output and A, Q 0, and R 0 are n × n circulant matrices. Such systems

evolve over a discrete spatially-periodic domain; they can be used to model spatially-invariant

vehicular platoons [97] and can result from a spatial discretization of fluid flows [38].

Any circulant matrix can be diagonalized via the discrete Fourier transform (DFT). Thus,

the coordinate transformation ψ := T ψ, u := T u, d := T d, where T−1 is the DFT matrix, brings

the state equation in (4.29) into,

˙ψ = A ψ + u + d. (4.30)

Here, A := T−1AT is a diagonal matrix whose main diagonal a is determined by the DFT of

88

the first row a of the matrix A,

ak :=

n−1∑i= 0

ai e−j2πikn , k ∈ 0, . . . , n− 1.

We are interested in designing a structured state-feedback controller, u = −Zψ, that min-

imizes the closed-loop H2 norm, i.e., the variance amplification from the disturbance d to the

regulated output ζ. Since the optimal unstructured Z for spatially-invariant system (4.29) is a

circulant matrix [9], we restrict our attention to circulant feedback gains Z. Thus, Z can also

be diagonalized via a DFT and we equivalently take x := z as our optimization variable where

T−1ZT = diag (z).

For simplicity, we assume that A and Z are symmetric. In this case, a and z are real vectors

and the closed-loop H2 norm of system (4.29) takes separable form f2(x) =∑n−1k= 0 f

k(xk),

fk(xk) =

qk + rkx

2k

2 (xk − ak), xk > ak

∞, otherwise

where xk > ak guarantees closed-loop stability. To promote sparsity of Z, we consider a regu-

larized optimization problem (2.5),

minimizex, z

f2(x) + γ ‖z‖1

subject to Tx − z = 0(4.31a)

where γ is a positive regularization parameter, T is the inverse DFT matrix, and z ∈ Rn denotes

the first row of the symmetric circulant matrix Z. Formulation (4.31a) signifies that while it

is convenient to quantify the H2 norm in the spatial frequency domain, sparsity has to be

promoted in the physical space.

By solving (4.31a) over a range of γ, we identify distributed controller structures which are

specified by the sparsity pattern of the solutions z?γ to (4.31a) at different values of γ. After

selecting a controller structure associated with a particular value of γ, we solve a ‘polishing’ or

‘debiasing’ problem,

minimizex,z

f2(x) + Isp(z?γ)(z)

subject to Tx − z = 0(4.31b)

where Isp(z?γ)(z) is the indicator function associated with the sparsity pattern of z?γ . The solution

to this problem is the optimal controller for system (4.29) with the desired structure, i.e., the

89

same sparsity pattern as z?γ . This step is necessary because the `1 norm in (4.31a) imposes an

additional penalty on z that compromises closed-loop performance.

Implementation

The elements of the gradient of f are

dfk(xk)

dxk=

rkx2k − 2akrkxk − qk2 (xk − ak)2

the Hessian ∇2f2 is a diagonal matrix with non-zero entries,

d2fk(xk)

dx2k

=qk + rka

2k

(xk − ak)3

and the proximal operator associated with the nonsmooth regularizer in (4.31a) is given by

soft-thresholding Sγµ.

While the optimal unstructured controller can be obtained by solving n uncoupled scalar

quadratic equations for xk, sparsity-promoting problem (4.31a) is not in a separable form (be-

cause of the linear constraint) and computing the second order update (4.5) requires solving a

system of equations H T ∗

(I − P )T −µP

x

y

=

ϑ

θ

(4.32)

Pre-multiplying by the matrix blkdiag(T, I) and changing variables to solve for z := T x brings (4.5)

into T H T−1 1n I

I − P −µP

z

y

=

Tϑ

θ

which is of the same form as (4.28a). This equation can be solved efficiently when P is sparse

(i.e., γµ is large; cf. Section 4.3.2) and x can be recovered from z via FFT.

Since the Hessian H of a separable function f is a diagonal matrix, the search direction can

also be efficiently computed when I − P is sparse (i.e., γµ is small). As in Section 4.3.2, the

component of y in the support of P is determined from the bottom row of (4.32). The top row

of (4.32) implies x = H−1(ϑ− T ∗y) and substitution into the bottom row yields

(I − P )T H−1 T ∗ (I − P ) y = θ

where θ := (I − P )(TH−1(ϑ − T ∗P y) − θ) is a known vector. Thus, the component of y in

90

the support of I − P can be determined by inverting a matrix whose size is determined by the

support of I−P and x is readily obtained from y and ϑ. The operations involving T and T ∗ can

be performed via FFT; since H is diagonal, multiplication by these matrices is cheap and the

computational burden in solving (4.5) again arises from a limited matrix inversion. In contrast

to Section 4.3.2, the computation of the search direction using this approach is efficient when

I − P is sparse, i.e., γµ is small.

Swift-Hohenberg equation

We consider the linearized Swift-Hohenberg equation [191],

∂tψ(t, ξ) = (cI − (I + ∂ξξ)2)ψ(t, ξ) + u(t, ξ) + d(t, ξ)

with periodic boundary conditions on a spatial domain ξ ∈ [−π, π]. Finite-dimensional approxi-

mation and diagonalization via the DFT (with an even number of Fourier modes n) yields (4.30)

with ak = c− (1− k2)2 where k = −n/2 + 1, . . . , n/2 is a spatial wavenumber.

Figure 4.4a shows the optimal centralized controller and solutions to (4.31a) for c = −0.01,

n = 64, Q = R = I, and γ = 4× 10−4, 4× 10−3, and 4. As further illustrated in Fig. 4.4b, the

optimal solutions to (4.31a) become sparser as γ is increased.

In Fig. 4.5, we demonstrate the utility of using regularized problems to navigate the tradeoff

between controller performance and structure, an approach pioneered by [64]. The polished op-

timal structured controllers (solid blue ——) were designed by first solving (4.31a) to identify

an optimal structure and then solving (4.31b) to further improve the closed-loop performance. To

illustrate the importance of polishing step (4.31b), we also show the closed-loop performance of

unpolished optimal structured controllers (dashed red - -×- -) resulting from (4.31a). Finally,

to evaluate the controller structures identified by (4.31a), we show the closed-loop performance of

polished ‘reference’ structured controllers (dotted yellow · · ·+ · · · ). Instead of solving (4.31a),

the reference structures are a priori specified as nearest neighbor symmetric controllers of the

same cardinality as controllers resulting from (4.31a). Among controllers with the same num-

ber of nonzero entries, the polished optimal structured controller consistently achieves the best

closed-loop performance.

We compare the computational efficiency of our approach with the proximal Newton

method [173] and ADMM [106, 190]. The proximal Newton method requires solving a LASSO

subproblem (4.3), for which we employ SpaRSA [187]. Since A is circulant, the x-minimization

step in ADMM (2.6) requires solving n uncoupled cubic scalar equations. In general, when A

91

is block circulant, the DFT only block diagonalizes the dynamics and thus the x-minimization

step has to be solved via an iterative procedure [106,190].

Figure 4.6 shows the time to solve (4.31a) with γ = 0.004 using our method, proximal

Newton, and ADMM. Our algorithm and ADMM were stopped when the primal and dual

residuals were below 1 × 10−8. The proximal Newton method was stopped when the norm of

the difference between two consecutive iterates was smaller than 1× 10−8. In Fig. 4.7, we show

the per iteration cost and the total number of iterations required to find the optimal solution

using each method. Our algorithm clearly outperforms proximal Newton and ADMM.

Although proximal Newton requires a similar number of iterations, the LASSO subprob-

lem (4.3) that determines its search direction is much more expensive; this increases the compu-

tation cost of each iteration and slows the overall algorithm. Moreover, for larger problem sizes,

the proximal Newton method struggles with finding a stabilizing search direction because ∇2f2

seems to bring it away from the set of stabilizing feedback gains. It appears that our method

circumvents this issue because its iterates lie in a larger lifted space in which stability is easier

to enforce via backtracking.

On the other hand, while the x- and z- minimization steps in ADMM are quite efficient, as a

first order method ADMM requires a large number of iterations to reach high-accuracy solutions.

Our algorithm achieves better performance because its use of second order information leads to

relatively few iterations and the structured matrix inversion leads to efficient computation of

the search direction.

4.5 Additional considerations

4.5.1 Algorithm based on V (w) as a merit function

Since Theorem 4.2.4 establishes global convergence of the differential inclusion, one algorithmic

approach is to implement a Forward Euler discretization of differential inclusion (4.7). A natural

choice of merit function is the Lyapunov function V defined in (4.8). A simple corollary of

Theorem 4.2.1 shows that the second order update (4.5) is a descent direction for V .

Corollary 4.5.0.1. The second order update (4.5) is a descent direction for the merit function

V defined in (4.8).

Proof. Follows from (4.9) in Theorem 4.2.1.

Corollary 4.5.0.1 enables the use of a backtracking Armijo rule for step size selection. A

92m

idd

lero

wofZ

(a)

spars

ity

level

ofz? γ

γ

(b)

Figure 4.4: (a) The middle row of the circulant feedback gain matrix Z; and (b) the sparsitylevel of z?γ (relative to the sparsity level of the optimal centralized controller z?0) resulting fromthe solutions to (4.31a) for the linearized Swift-Hohenberg equation with n = 64 Fourier modesand c = −0.01.

natural choice of stopping criterion for such an algorithm is a condition on the size of the primal

and dual residuals. Moreover, proposition 4.1.1 suggests fast asymptotic convergence when

proxµg is semismooth. The LASSO example in Fig. 4.8 verifies this intuition when solving

LASSO to a threshold of 1× 10−8 for the primal and dual residuals.

Algorithm 4 Second order primal-dual algorithm for nonsmooth composite optimization basedon discretizing (4.7).

input: Initial point x0, y0, backtracking constant α ∈ (0, 1), Armijo parameter σ ∈ (0, 1),and stopping tolerances ε1, ε2.While: ‖Txk − proxµg(Tx

k + µyk)‖ > ε1 or

‖∇f(xk) − TT yk‖ > ε2

Step 1: Compute wk as defined in (4.5)Step 2: Choose the smallest j ∈ Z+ such that

V (wk + αjwk) ≤ V (wk) − σ αj ‖∇Lµ(wk)‖2

Step 3: Update the primal and dual variables

wk+1 = wk + αjwk

However, such an implementation would require a fixed penalty parameter µ, which typically

has a large effect on the convergence speed of augmented Lagrangian algorithms and is difficult

to select a priori . Moreover, stability of the solution to a differential equation does not always

93

per

form

an

ced

egra

dati

on

(in

per

cents

)

sparsity level of z (in percents)

Figure 4.5: Performance degradation (in percents) of structured controllers relative to the op-timal centralized controller: polished optimal structured controller obtained by solving (4.31a)and (4.31b) (solid blue ——); unpolished optimal structured controller obtained by solvingonly (4.31a) (dashed red - -×- -); and optimal structured controller obtained by solving (4.31b)for an a priori specified nearest neighbor reference structure (dotted yellow · · ·+ · · · ).

94

tim

e

n

Figure 4.6: Total time to compute the solution to (4.31a) with γ = 0.004 using our algorithm(2ndMM), proximal Newton, and ADMM.

95avg

iter

ati

on

tim

e

n

(a)

nu

mb

erof

iter

ati

on

s

n

(b)

Figure 4.7: Comparison of (a) times to compute an iteration (averaged over all iterations); and(b) numbers of iterations required to solve (4.31a) with γ = 0.004.

imply stability of its discretization.

An example

We implement Algorithm 4 to solve the LASSO problem (2.4) studied in Section 4.4. The

LASSO problem was randomly generated with n = 500, m = 1000, and γ = 0.85γmax. Figure 4.8

illustrates the quadratic asymptotic convergence of Algorithm 4 and a strong influence of µ.

4.5.2 Bounding ∇f(x) for Algorithm 3

A bounded set Cf containing the solution to (2.1) can always be identified from any point x,

∇f(x), any element of the subgradient ∂g(x), and a lower bound on the parameter of strong

convexity.

Lemma 4.5.1. Let Assumption 1 hold and let f be mf -strongly convex. Then, for any x, the

optimal solution to (2.1) lies within the ball of radius 2mf‖∇f(x) + TT∂g(T x)‖ centered at x.

Proof. Given any point x, strong convexity of f and convexity of g imply that,

f(x) + g(Tx) − f(x) − g(T x) ≥⟨∇f(x) + TT∂g(T x), x − x

⟩+

mf2 ‖x − x‖2 (4.33)

for all x where ∂g(T x) is any member of the subgradient of g(T x). For any x with ‖x − x‖ ≥2mf‖∇f(x) + TT∂g(T x)‖, the right-hand side of (4.33) must be nonnegative which implies that

f(x) + g(Tx) ≥ f(x) + g(T x) and thus that x cannot solve (2.1).

96

‖x−x?‖

Iteration

Figure 4.8: Distance from the optimal solution as a function of iteration number when solvingLASSO using Algorithm 4 for different values of µ.

For any strongly convex function f and convex function g, a function f can be selected such

that

argmin f(x) + g(Tx) = argmin f(x) + g(Tx).

Here, f is identical to f in some closed set Cf containing x?,

f(x) :=

f(x), x ∈ Cf

f(x), x 6∈ Cf

and f(x) is chosen such that f(x) is convex and twice differentiable, ∇f(x) is uniformly bounded,

and ∇2f(x) 0. Lemma 4.5.1 implies that a set Cf that contains the optimal solution x?

of (2.1) can be identified a priori from any given point x. Since f(x) satisfies all the conditions

of Theorem 4.3.3, Algorithm 3 can be used to solve (2.1) and, since Cf contains x?, its optimal

solution will also solve (2.1).

97

An example

When x ∈ R, f(x) = 12x

2, and Cf = [−1, 1], a potential choice of f(x) is given by,

f(x) =

−2x + ex+1 − 2.5, x ≤ −1

2x + e−x+1 − 2.5, x ≥ 1.

For this choice of f , the gradient of f is continuous and bounded,

∇f(x) =

−2 + ex+1, x ≤ −1

x, x ∈ [−1, 1]

2− e−x+1, x ≥ 1

and the Hessian of f is determined by

∇2f(x) =

ex+1, x ≤ −1

1, x ∈ [−1, 1]

e−x+1, x ≥ 1.

Chapter 5

Connections with other methods

and discussion

The proximal augmented Lagrangian Lµ(x; y) is obtained by constraining Lµ(x, z; y) to the

manifold

Z := (x, z?µ; y) | z?µ = argminz

Lµ(x, z; y)

= (x, z?µ; y) | Tx + µy ∈ z?µ + µ∂Cg(z?µ)

which results from the explicit minimization over the auxiliary variable z. Herein, we interpret

the second order search direction as a linearized update to the KKT conditions for problem (2.5)

and discuss connections to the alternative algorithms.

5.1 Second order updates as linearized KKT corrections

The second order update (4.5) can be viewed as a first order correction to the KKT conditions

for optimization problem (2.5),

0 = ∇f(x) + TT y

0 = Tx − z

0 ∈ ∂Cg(z) − y.

(5.1)

Substitution of z?µ into (5.1) makes the last two conditions redundant; when combined with the

definition of the manifold Z, Tx = z implies y ∈ ∂Cg(z) and y ∈ ∂Cg(z) implies Tx = z.

98

99

Multiplying equation (4.5) for the second order update with the nonsingular matrix (4.21)

recovers the equivalent expression H TT

(I − P )T −µP

xk

yk

= −

∇f(xk) + TT yk

rk

(5.2)

where rk := Txk − z?µ(xk; yk) = Txk − proxµg(Txk + µyk) is the primal residual of (2.5).

Thus, (5.2) describes a first order correction to the first and second KKT conditions in (5.1).

5.2 Connections with other methods

We now discuss broader implications of our framework and draw connections to the existing

methods for solving versions of (2.1). Many techniques for solving composite minimization

problems of the form (2.1) can be expressed in terms of functions embedded in the augmented

Lagrangian; see Table 5.1. Trivially, the objective function in (2.1) corresponds to Lµ(x, z; y)

over the manifold z = Tx. The Lagrange dual of a problem equivalent to (2.5),

minimizex, z

f(x) + g(z) + 12µ ‖Tx − z‖2

subject to Tx − z = 0(5.3)

is recovered by collapsing Lµ(x, z; y) onto the intersection of the z-minimization manifold Z and

the x-minimization manifold,

X := (x?, z; y) | x? = argminx

Lµ(x, z; y)

= (x?, z; y) | µ (∇f(x?) + TT y) = TT (z − Tx?).

The Lagrange dual of (2.5) is recovered from the Lagrange dual of (5.3) in the limit µ→∞.

5.2.1 MM and ADMM

MM implements gradient ascent on the dual of (5.3) by collapsing Lµ(x, z; y) computationally

onto X ∩Z. The joint (x, z)-minimization step in (2.7) evaluates the Lagrange dual at discrete

iterates yk by finding the corresponding (x, z)-pair on X ∩ Z; i.e., the iterate (xk+1, zk+1; yk)

generated by (2.7) lies on the manifold X ∩Z. Note that, in this form, joint (x, z)-minimization

is a challenging nondifferentiable optimization problem.

ADMM avoids this challenge by collapsing Lµ(x, z; y) onto X and Z separately. While

100

the underlying x- and z-minimization subproblems in ADMM are relatively simple, the iterate

(xk+1, zk+1; yk) generated by (2.6) does not typically lie on the X ∩ Z manifold. Thus ADMM

does not represent gradient ascent on the dual of (5.3), causing looser theoretical guarantees

and often worse practical performance.

By collapsing the augmented Lagrangian onto Z, the proximal augmented Lagrangian (2.8)

allows us to express the (x, z)-minimization step in MM as a tractable continuously differen-

tiable problem (cf. Theorem 2.3.1). This avoids challenges associated with ADMM and it does

not increase computational complexity in the x-minimization step in (2.7) relative to ADMM

when using first order methods. We finally note that unlike the Rockafellar’s proximal method

of multipliers [192] which applies the proximal point algorithm to the primal-dual optimality

conditions, our framework reformulates the standard method of multipliers and develops second

order algorithm to solve nonsmooth composite optimization problems.

5.2.2 First order methods

A special instance of our framework has strong connections with the existing methods for dis-

tributed optimization on graphs; e.g., [146, 147, 193]. The networked optimization problem of

minimizing f(x) =∑f i(x) over a single variable x can be reformulated as

∑f i(xi) + g(Tx)

where the components f i of the objective function are distributed over independent agents xi,

x is the aggregate state, TT is the incidence matrix of a strongly connected and balanced graph,

and g is the indicator function associated with the set Tx = 0. The g(Tx) term ensures that at

feasible points, xi = xj = x for all i and j. It is easy to show that ∇Mµg(v) = 1µ v and that the

primal-descent dual-ascent dynamics 3.1 are given by,

x = −∇f(x) − 1µ Lx − y

˙y = Lx(5.4)

where y := TT y and L := TTT is the graph Laplacian. The only difference relative to [193,

Equation 11] is that −y appears instead of −Ly in the equation for the dynamics of the primal

variable x.

We also include a comparison with the EXTRA algorithm [194]. The discretized version of

the first order primal-dual dynamics (5.4) is equivalent to

xk+1 = xk − αx(∇f(xk) + 1µLx

k + yk)

yk+1 = yk + αyLxk

(5.5)

101

x z y function methods

z = Tx - objective function of (2.1) subgradient iteration [134]- - - augmented Lagrangian ADMM [64,132]X Z - Lagrange dual of (5.3) Dual ascent (if explicit

expression available)- Z - proximal augmented Lagrangian MM [107], Arrow-Hurwicz-Uzawa [107],

Second order primal-dual- Z Y Forward-Backward Envelope Forward-Backward Truncated

Newton [176], related toproximal gradient [176, Section 2.1]

Table 5.1: Summary of different functions embedded in the augmented Lagrangian of (2.5) andmethods for solving (2.1) based on these functions.

where y := TT y, L := TTT is the graph Laplacian, and αx and αy are the primal and dual step

sizes, respectively. By noting that that yk =∑k−1t=0 (W − W )xt, the EXTRA algorithm [194,

Equation (2.13)],

xk+1 = Wxk − α∇f(xk) + 1µLx

k +

k−1∑t=0

(W − W )xt

is recovered exactly from (5.5) by setting αx = α and αy = α2µ and taking W = I − α

µL,

W = 12 (I + W ). Of course, EXTRA does not require that W be stated in the form of a

Laplacian, but we can take T = (I −W )1/2 and still recover EXTRA.

5.2.3 Second order methods

We identify the saddle point of Lµ(x, z; y) by forming second order updates to x and y along

the Z manifold. Just as Newton’s method approximates an objective function with a convex

quadratic function, we approximate Lµ(x; y) with a quadratic saddle function.

Constraining the dual variable itself yields connections with other methods. When T =

I, (5.1) implies that the optimal dual variable is given by y? = −∇f(x?), so it is natural to

collapse the augmented Lagrangian onto the manifold

Y := (x, z; y?) | y? = −∇f(x).

The augmented Lagrangian over the manifold Z ∩Y corresponds to the Forward-Backward En-

velope (FBE) introduced in [176]. The proximal gradient algorithm with step size µ can be

102

recovered from a variable metric gradient iteration on the FBE [176]. In [176–178], the approx-

imate line search and quasi-Newton methods based on the FBE were developed to solve (2.1)

with T = I. Since the Hessian of the FBE involves third order derivatives of f , these techniques

employ either truncated- or quasi-Newton methods.

The approach advanced in the current paper applies a second order method to the aug-

mented Lagrangian that is constrained over the larger manifold Z. Relative to alternatives, our

framework offers several advantages. First, while the FBE is in general a nonconvex function of

x, Lµ(x; y) is always convex in x and concave in y. Furthermore, we can compute the Hessian

exactly using only second order derivatives of f and its structure allows for efficient computation

of the search direction. Finally, our formulation allows us to leverage recent advances in second

order methods for augmented Lagrangian methods, e.g., [162–164].

Part II

Structured optimal control

103

Chapter 6

Symmetry and spatial invariance

In this chapter, we propose a principled convex approach for structured H2 and H∞ optimal

controller design. We consider problems where the dynamical generator is an affine function of

the design variable. When the dynamical generator is symmetric, we show that the H2 and H∞norms of the closed-loop system are convex functions of the design variable. This allows us to

impose regularization penalties directly on the design variable in order to promote structure.

We then show that even when the dynamical generator is not symmetric, its symmetric

component can be used to provide an upper bound on its H2 and H∞ norm. Thus, we can

design structured controllers by solving the convex design problem for the symmetric component

of the system and implement the resulting controllers on the original system. We show that this

procedure guarantees stability and a level of H2/H∞ performance.

Although convex, the H2 and H∞ norms for symmetric systems are expressed using linear

matrix inequalities (LMIs) and the semidefinite programs (SDPs) that result from the H2 and

H∞ optimal control problems do not scale favorably with the problem dimension. To address

this challenge, we develop a customized ADMM algorithm and provide a procedure to gain com-

putational advantage when the system and its controller are jointly block-diagonalizable. Such

a structure arises, for example, in spatially-invariant systems [9]. In [190], we took advantage of

this property to develop an efficient and scalable algorithm for sparsity-promoting H2 optimal

feedback design.

104

105

6.1 Symmetric systems

We consider the class of systems (1.1) with R(x) = 0 and B1 = I,

ψ = (A + F (x))ψ + d

ζ = ψ.(6.1)

Instead of limiting the standard measure of control effort, we regularize x directly by imposing

a quadratic penalty, xTRx with R 0, to limit the magnitude of x, and an `1 penalty, ‖x‖1 :=∑i |xi|, to promote sparsity in x. This is because in many motivating examples for this section,

it is of interest to regulate the size of the controller x itself and not its effect on the system state,

R(x)ψ. For example, in the combination drug therapy example first described in Section 1.2.1,

it makes sense to penalize the size of the drug dose but not the quantity of virus each drug is

killing.

Under the above assumptions, we can design controllers for (6.1) by designing controllers for

ψ = (As + Fs(x))ψ + d

ζ = ψ(6.2)

where As := (A + AT )/2 is the symmetric part of A and Fs(x) := (F (x) + FT (x))/2. Since

any matrix A can be decomposed into its symmetric As and antisymmetric components, any

system (6.1) can be placed into the form of (6.2)

We first show that the H2 and H∞ optimal control problems for (6.2) are convex in x and

thus that we can impose regularization direclty on the controller. We then show that stability

of the symmetric system (6.2) implies stability of the corresponding original system (6.1) and

that the H2 and H∞ norms of the symmetric system (6.2) are an upper bound on the H2 and

H∞ norms of the original system (6.1). We then show that when the difference between A and

As of the order ε, the H2 and H∞ norms of systems (6.1) and (6.2) differ only by O(ε2).

We note that this method is more appropriate when F (x) is symmetric. If Fs(x) 6= F (x),

the neglected effect of the asymmetric component of F (x) makes the degree of conservatism

unpredictable.

6.1.1 Convex formulation

The H2 and H∞ norms of (6.2) can be expressed in a convex form. Although more general

symmetric systems can be cast as convex problems, here we assume B = C = I and R(x) = 0

106

to facilitate the transition to the discussion of spectral properties and performance bounds.

H2-optimal control

The H2 norm of system (6.2) is given by trace (Ps) where,

(As + Fs(x)

)Ps + Ps

(As + Fs(x)

)+ I = 0.

Since As = ATs is symmetric, the controllability gramian of system (6.2) can be explicitly

expressed as,

Ps = − 12 (As + Fs(x))

−1

and, by taking a Schur complement, the regularized optimal H2 control problem can be cast in

a convex function of x and an auxiliary variable Θ,

minimizex,Θ

12 trace(Θ) + g(x)

subject to

Θ I

I −(As + Fs(x))

0.(6.3)

The matrix As + Fs(x) is always invertible when it is Hurwitz. We note the structured LQR

problem (i.e., R(x) = R1/2Fs(x)) for symmetric systems can also be expressed as an SDP by

taking the Schur complement of Fs(x)RFs(x). Finally, we drop the constant factor 1/2 for

compactness.

H∞-optimal control

To formulate the H∞ optimal control problem as an SDP, we first state a simple lemma about

system (6.2).

Proposition 6.1.1. The disturbance that achieves the maximum induced L2 amplification for

system (6.2) corresponds to the constant signal d(t) = v where v is the right principal singular

vector of A−1s . I.e., the peak of the frequency response of system (6.2) occurs at the temporal

frequency ω = 0.

Proof. A symmetric matrix can be diagonalized as As = UΛUT where Λ is a diagonal matrix

with the eigenvalues of As on the main diagonal and the columns of U contain the corresponding

107

eigenvectors. For such a matrix,

(jωI − As)−1 = U diag

1

jω−λi

UT .

It is clear that ω = 0 maximizes the singular values of the above matrix. Thus, the H∞ norm

of (6.2) can be characterized by σmax

(−(As + Fs(x))−1

).

The regularized H∞-optimal control problem for symmetric systems can therefore be ex-

pressed as,

minimizex,Θ

σmax(Θ) + g(x)

subject to

Θ I

I −(As + Fs(x))

0.

(6.4)

As we show in the next section, this convex problem can be used for structured H∞ control

design. This is particularly advantageous because many of the existing algorithms for general

structured H2 control cannot be extended to the structured H∞ problem because the H∞ norm

is not differentiable in general.

In Section 6.1.2, we show that stability of the symmetric system (6.2) implies stability of

the corresponding original system (6.1). Furthermore, the H2 and H∞ norms of the symmetric

system are upper bounds on the H2 and H∞ norms of the original system. Finally, when the

difference between A and As is small (of the order ε, O(ε)), the H2/H∞ norms of systems (6.1)

and (6.2) differ only by O(ε2).

This formulation can be applied to the design of edges in consensus networks, leader selection,

and combination drug therapy design for HIV.

6.1.2 Stability and performance

We now justify the use of the symmetric component (6.2) to design controllers for the original

system (6.1). First, we show that the stability of the symmetric system implies stability of the

full system.

Lemma 6.1.1. Let the symmetric part of A be Hurwitz. Then, A is Hurwitz.

Proof. We show this by contradiction. Since the symmetric part of A, As := (A + AT )/2 is

symmetric and Hurwitz, it is negative definite,

v∗Asv < 0 for all v 6= 0.

108

Assume that A is not Hurwitz. Then there is a v such that Av = λv with <(λ) ≥ 0. Furthermore,

v∗Av = λv∗v. However,

v∗Av = v∗Asv + 12 v∗ (A − AT ) v

= v∗Asv + j Im(λ) ‖v‖22.

Since As ≺ 0, v∗Asv cannot have a nonnegative real part.

Remark 6.1.2. Lemma 6.1.1 is not a necessary condition; A may be Hurwitz even if As is not.

6.1.3 Performance bounds

We next show that the H2 norm of the symmetric system (6.2) is an upper bound on the H2

norm of the system (6.1). First, we present a useful theorem from linear algebra [195].

Theorem 6.1.3 (Theorem IX.3.1 in [195]). Let A be any matrix and let As = 12 (A + AT ) be

the symmetric component of A. Then

‖eA‖ ≤ ‖eAs‖

for every unitarily invariant norm.

The statement about the H2 norms of systems (6.1) and (6.2) is a simple corollary of Theo-

rem 6.1.3.

Corollary 6.1.3.1. When the systems (6.1) and (6.2) are stable, the H2 norm of (6.1) is

bounded from above by the H2 norm of system (6.2).

Proof. Recall that the H2 norm of a system,

ψ = Aψ + d

with A Hurwitz is given by trace(Pc) where

APc + PcAT + I = 0

and that X can be expressed as

Pc =

∫ ∞0

eAteAT tdt.

109

Using the linearity of the trace and of integration, we can rewrite the expression for the H2

norm as,

trace

(∫ ∞0

eAteAT tdt

)=

∫ ∞0

‖eAt‖2Fdt.

Since the Frobenius norm is unitarily invariant, by Theorem 6.1.3 ‖eAt‖2F ≤ ‖eAst‖2F for any t

and therefore, ∫ ∞0

‖eAt‖2F dt ≤∫ ∞

0

‖eAst‖2F dt.

Since the right hand side is the H2 norm of system (6.2), this completes the proof.

Note: Theorem 6.1.3 relies on the fact that the sum of the k largest eigenvalues of Ps are larger

than the sum of the k largest eigenvalues of P for any integer k. As a result, Corollary 6.1.3.1

does not extend to a general state penalty matrix Q where the H2 norm would be trace(QX).

For the same reason, the result requires E(ddT ) = I.

We now show that an analogous bound holds for the H∞ norm.

Proposition 6.1.2. When the systems (6.1) and (6.2) are stable, the H∞ norm of the general

system (6.1) is bounded from above by the H∞ norm of the symmetric system (6.2).

Proof. From the bounded real lemma [49], the H∞ norm of the general system (6.1) is less than

ζ if there exists a P 0 such that,

ATP + PA + I + ζ−2P 2 ≺ 0. (6.5)

From Proposition 6.1.1, the H∞ norm of symmetric system (6.2) is ζ = σmax(A−1s ). Taking

P = ζI for any ζ > σmax(A−1s ) satisfies the above LMI; substituting P and As into (6.5) yields,

2ζAs + 2I ≺ 0.

Since As is Hurwitz, As ≺ 0. Since ζ > −λmax(A−1s ), ζ−1 < −λmin(As), so As ≺ −ζ−1I.

Therefore the LMI is satisfied. Since Aa = −ATa , setting P = ζI implies,

ATP + PA = 2ζAs.

Substituting A and P = ζI into (6.5) yields,

ATP + PA + I + ζ−2P 2 = 2ζAs + 2I ≺ 0.

110

6.1.4 Small asymmetric perturbations

We next show that in addition to being an upper bound, the H2 andH∞ norms of the symmetric

and full systems are close when A is nearly normal. In what follows, we show that when a normal

system is subject to an antinormal perturbation of O(ε), the first order correction to the H2

and H∞ norms is zero. A similar but more restrictive result appeared in [117] for the design of

an interconnection graph for synchronizing oscillator networks. We present a result for systems

with normal dynamical generators An. Since a normal matrix An is a matrix that commutes

with ATn , this set includes symmetric matrices As.

Proposition 6.1.3. Let An be a normal matrix. The O(ε) correction to the H2 norm of the

system

x = Anx + d

from an O(ε) antisymmetric perturbation Aa is zero.

Proof. The H2 norm of the above system is given by trace(Pn) where,

AnPn + PnATn + I = 0. (6.6)

From Lemma 1 in [196], Pn = −(An +An)−1. Perturbing An by a small antisymmetric matrix

εAa yields a small correction term εP in the controllability gramian. Collecting the O(ε) terms

from the Lyapunov equation,

(An + εAa)(Pn + εP ) + (Pn + εP )(An + εAa)T + I = 0.

recovers the linear equation,

AnP + PAn + AaPn + PnATa = 0.

The O(ε) correction to the H2 norm is given by,

trace(P ) = trace(Pn (AaPn + PnA

Ta ))

= trace((Aa − Aa)P 2

n

)= 0

111

since Aa = −ATa .

To illustrate this result for a small system, we provide an explicit expression for the H2 norm

of a 2-dimensional system in terms of the corresponding symmetric system.

2-dimensional system

To build intuition, we give the analytical expression for the H2 norm of a 2-dimensional system

in terms of the norm of its symmetric component. The A and As matrices are,

A =

a e+ h

e− h d

, As =

a e

e d

.Solving for Ps = −0.5A−1

s and taking the trace gives the H2 norm of the symmetric system,

trace(Ps) = − a+d2(ad−e2)

Explicitly solving the Lyapunov equation APc + PcAT + I = 0 for P and taking its trace yields

the H2 norm of the original system,

trace(Pc) = − (b−c)2+(a+d)2

2(a+d)(ad−bc) .

With some algebra, the above simplifies to, trace(Ps) = −(a+ d)(2(ad− e2))−1 and,

trace(Pc) = trace(Ps)

((a+ d)2(ad− e2) + 4(ad− e2)h2

(a+ d)2(ad− e) + (a+ d)2h2

).

We show that a similar property holds for the H∞ norm.

Proposition 6.1.4. Let As be a symmetric matrix. The O(ε) correction to the H∞ norm of

the system

x = Asx + d

from an O(ε) antisymmetric perturbation Aa is zero.

Proof. From Proposition 6.1.1, the H∞ norm of the symmetric system is given by σmax(−A−1s ).

The maximum singular value of a matrix is equivalent to,

σmax(X) = sup‖v‖2≤1,‖w‖2≤1

vTXw.

112

Since As is symmetric, w = v. Taking an O(ε) antisymmetric perturbation Aa to the above

expression,

σmax(−(As + εAa)−1) = −vTA−1s v + εvTA−1

s AaA−1s v. + O(ε2)

Since Aa is antisymmetric,⟨A−1s vvTA−1

s , Aa⟩

= 0 and thus the first-order correction is 0.

6.2 Computational experiments

6.2.1 Example: Directed Consensus Network

In this example, we illustrate the utility of the approach described in Section 6.1.2. Consider

the network dynamics given by a directed network as described in Section 1.2.1,

ψ = −(L + E diag(x)ET )ψ

where L is a directed graph Laplacian, F (x) = E diag(x)ET represents the addition of undirected

links, v is a vector that contains weights of these added links, and the incidence matrix E

describes which edges may be added or altered. The regularization on x is given by,

g(x) = ‖x‖22 + γ∑i

|xi|

where the quadratic term limits the size of the edge weights, the `1 norm promotes sparsity of

added links, and γ > 0 parametrizes the importance of sparsity.

For this concrete example, the network topology is given by Figure 6.1. The potential added

edges can connect the following pairs of nodes: (1)− (2), (1)− (3), (1)− (5), (1)− (6), (2)− (5),

(2)− (6), (3)− (6), and (4)− (5).

Controllers were designed by solving problems (6.3) and (6.4) for the symmetric version

of the network over 50 log-distributed values of γ ∈ [10−4, 1]. The closed-loop H2 and H∞norms obtained by applying these controllers to the symmetric and original systems are shown

in Fig. 6.2. Fig. 6.1 also shows which edges were added for γ = 1.

113

ψ1

ψ2ψ3

ψ5

ψ4

ψ6

Figure 6.1: Directed network (solid black — arrows) with added undirected edges (dashedred - - - arrows). Both the H2 and H∞ optimal structured control problems yielded the sameset of added edges. In addition to these edges, the controllers tuned the weights of the edges(1)− (3) and (1)− (5).

f2(x

)

f∞

(x)

Figure 6.2: H2 and H∞ performance of the closed-loop symmetric system and the originalsystem subject to a controller designed at various values of γ.

6.2.2 Example: Combination drug therapy design via symmetric sys-

tems

In this section, we consider the combination drug therapy problem introduced in Section 1.2.1,

ψ = (A + diag (Fxx))ψ + d.

where ψ represents a vector of HIV mutant populations, A contains their evolutionary dynamics,

and Fx contains information about drug treatment. The entry Aii represents how fast mutant i

replicates and the entry Aij represents the probability of mutation from mutant j into mutant i.

Each element xk represents the amount of drug k administered to the patient. The ith element

of the kth column of of Fx specifies how quickly drug k destroys mutant i.

114

Combination drug therapy is desirable because using only one drug often leads to the mu-

tant population adapting to the weaknesses of that particular drug [197]. However, because of

potential side effects and drug-drug interactions, it is not desirable to use a large number of

different drugs. Furthermore, large doses can have additional side effects [198].

Since the probability of mutation is often orders of magnitude less than the rate of replica-

tion [199], Proposition 6.1.3 justifies design using the symmetric component of the system. We

therefore minimize the H2 norm of the system,

ψ = (As + diag(Fxx))ψ + d.

where As = 12 (A + AT ). Since this system is symmetric, it can be posed as problem (6.3).

Primal and dual optimization problems

We first state the problem in a form which is convenient for implementation of the alternating

direction method of multipliers (ADMM), a technique well suited to large scale problems that

has been recently successfully applied to sparse control synthesis problems [64, 105]. Using the

auxiliary variable G := −(As + diag(Fxx)), problem (6.3) becomes,

minimize trace(G−1) + 12x

TRx + γ1Tu

subject to G + As + diag(Fxx) = 0

G 0, x ≥ 0.

(6.7)

Since a negative drug dosage is not possible, the `1 norm of x can be written as 1Tx. The

Lagrangian is,

L(G, x, Y, λ) = trace(G−1) + 12x

TRx + γ1Tx − λTx + 〈Y, G + A + diag(Fxx)〉

where Y = Y T and λ ≥ 0. We omit the Lagrange multiplier associated with the positive

definiteness of G for compactness; it will become a slack variable and will result in a requirement

on the positive definiteness of the dual variable Y . We substitute the equivalent expression,

trace (Y diag(Fxx)) = − yTFxx

where y = diag(Y ) into the Lagrangian, differentiate it with respect to G and x and set the

115

gradients equal to zero,

0 = − G−2 + Y

0 = Rx + γ1 − λ + FTx y.

The optimal G and x are therefore,

G = Y −1/2

x = −R−1(FTx y + γ1− λ).

Substituting these expressions into the Lagrangian yields

2 trace(Y12 ) + trace(AY ) − 1

2 (FTx y + γ1− λ)TR−1(FTx y + γ1− λ).

The dual problem is to maximize the above expression over λ ≥ 0 and Y 0. Since the above

is concave and quadratic in λ, its maximum value is achieved at λ = FTx y + γ1. However, λ

must be nonnegative and thus

λ? =

FTx y + γ1, FTx y + γ1 > 0

0, otherwise.

Therefore, λ can be eliminated by substituting λ? and the dual problem can be written as,

maximize 2 trace(Y 1/2) + trace(AY ) − 12 max(−FTx y−γ1, 0)TR−1(−FTx y − γ1)

subject to Y 0.

(6.8)

Alternating direction method of multipliers

To apply ADMM to (6.7), we first form the corresponding augmented Lagrangian,

Lµ(G, x, Y ) := trace(G−1)+xTRx+γ 1Tx+〈Y, G+A+ diag(Fxx)〉+ 12µ‖G+A+diag(Fxx)‖2F .

Relative to the standard Lagrangian, Lµ contains an additional quadratic penalty on the viola-

tion of the linear constraint. The positive parameter µ specifies the magnitude on the constraint

violation penalty at each iteration.

116

The ADMM iteration uses the update sequence [132]

Gk+1 = argminG

Lρ(G, xk, Y k)

xk+1 = argminx

Lρ(Gk+1, x, Y k)

Y k+1 = Y k + 1µ

(Gk+1 +A+ diag(Fxx

k+1))

to find the optimal solution to the original problem. The stopping criteria depend on the primal

residual, which quantifies how well Gk and xk satisfy the linear constraint, and the dual residual,

which quantifies the difference between xk and xk−1. We refer the reader to [132] for details.

This algorithm is advantageous because the subproblems are much simpler than the original

problem. The G-minimization step has an explicit solution, the x-minimization step takes

a standard problem form for which there are efficient algorithms, and the Y -update step is

algebraic.

G-minimization

The G-minimization step amounts to solving,

minimizeG

trace(G−1) + 12µ‖G− V

k‖2F

subject to G 0

where Vk := −(A+ diag(Fxxk))− 1

2µYk. Seting the gradient,

−G−2 + 1µG − V k

to zero is an explicit exercise with the positive definiteness constraint. Since the powers of G

appear with no coefficients, the optimal G has the same eigenstructure as V k. Its eigenvalues

are determined by the positive real solution to the cubic equation

1µ λ

3i + σi λ

2i − 1

where σi is the corresponding eigenvalue of Vk. By the convexity of the G-minimization problem,

there can be only one positive real solution to the above equation.

117

x-minimization

The x-minimization step is equivalent to solving,

minimizex

γ1Tx + xTRx + 12µ‖Fxx − wk‖22.

where wk := gk+1 + a + µyk, gk = diag(Gk), a = diag(A), and yk = diag(Y k). The objective

function is the sum of a quadratic term and an `1 norm: a problem form is commonly referred

to as LASSO. This problem has attracted lots of attention in recent years and there are many

efficient methods for computing its solution; see Section 4.4.1 for a comparison of many standard

techniques with the second order method we develop in Chapter 4. We solve it using Iterative

Soft-Thresholding (ISTA) [131].

Computational complexity

Generic SDP solvers scale with n6, where n is the dimension of the positive definiteness con-

straint. The G-minimization requires O(n3) operations because it requires an eigenvalue decom-

position, the x-minimization step requires O(nr), operations and the Y -update step requires

O(n2) operations.

Numerical example

Following the example given in [89, 90] based on [200], we examine a system with 35 mutants

and 5 potential types of drugs. The diagonal entries are 0.5 and the off diagonal elements range

from the order of 10−8 to 10−6. The structure of A is shown in Fig. 6.3. It is evident that the

matrix is not symmetric.

We next use our algorithm to design control inputs u for the symmetric system with varying

levels of the sparsity promoting parameter γ. As γ is increased, sparsity is prioritized over H2

performance of the system and therefore the performance degrades. In Fig. 6.4, we show the

difference in H2 performance between the symmetric and original systems as a function of γ. In

this problem, the symmetric model is a very good approximation of the original system, even

up to extremely large levels of γ.

Since the off-diagonal entries of the matrix A are small, we artificially increase them by a

constant factor to study our approach for problems with larger degrees of asymmetry. We take,

Ac = c(A− I A) + 0.5(I A)

118

where c is a constant factor and is the Hadamard (element wise) product. This modification

means that Ac does not have physically relevant implications for the drug synthesis problem, but

it illustrates the utility of our approach with a realistic problem structure. Figure 6.5 compares

the H2 performance for c = 105, 1.4× 106, and 1.9× 107. Compared to the diagonal entries of

Ac, the maximum off-diagonal element is of the same order, one order of magnitude higher, and

two orders of magnitude higher respectively.

When the off-diagonal elements are of the same order of magnitude or smaller than the di-

agonal elements, there is almost no difference between the symmetric and full models. As the

off-diagonal elements get larger, the fidelity of the approximation suffers. Unsurprisingly, as

the system becomes more asymmetric, the symmetric approximation becomes more conserva-

tive [201].

We note that for a realistic synthesis problem, γ would be varied to find sparsity structures

for x. Once a desired sparsity structure is identified, (6.3) would be solved with γ = 0 but

x constrained to have that particular sparsity structure. This process, known as polishing in

compressive sensing literature, would yield a x which is sparse but has better performance than

simply obtaining x from (6.3) where g represents `1 regularization.

6.3 Computational advantages for structured problems

Structured control is often of interest for large-scale systems. As such, the computational scaling

of algorithms used to compute optimal controllers is very important. In this section, we identify

a class of systems which are amenable to scalable distributed algorithms.

When A and F (x) are always jointly block-diagonalizable, the dynamics of the system can

be expressed as the sum of independent subsystems. Define ψ := Πψ and let Π be a unitary

matrix such that,˙ψ = (A + F (x))ψ

where

A := ΠAΠT , F (x) := ΠF (x)ΠT ,

and, for any choice of x, A+ F (x) = blkdiag(A11 + F11, · · · , ANN + FNN ) is block-diagonal with

N blocks of size n× n each.

For problems of this form, computing optimal control strategies is much more efficient in

the ψ coordinates because the majority of the computational burden in solving problems (6.3)

and (6.4) comes from the nN × nN LMI constraint involved in minimizing the performance

119

Figure 6.3: Sparsity structure of the matrix A and its symmetric counterpart. The elements ofA are shown with blue dots, and the elements in its symmetric component As are overlaid ingreen circles.

120

γ

Figure 6.4: Difference in H2 norm between the symmetric and original systems with differentcontrollers designed as a function of γ, normalized by the H2 norm of original system.

f 2(x

)

γ

Figure 6.5: The solid — lines are the H2 norms of the symmetric systems and the dotted - - -lines are the H2 norms of the original systems. The blue ×, red , and magenta designatec = 105, 1.4× 106, and 1.9× 107 respectively.

121

metrics f2(x) or f∞(x).

For this class of system, the H2-optimal control problem (6.3) can be expressed as,

minimizev,Θi

1

2

∑i

trace(Θi) + g(x)

subject to

Θi I

I −((As)ii + (Fs(x))ii)

0.

(6.9)

which is an SDP with N separate n× n LMI blocks. Since SDPs scale with the sixth power of

the LMI blocks, solving this reformulation scales with n6 as opposed to n6N6.

Analogously, the structured H∞-optimal control problem (6.4) can be cast as,

minimizev,Θi

maxi

(σmax(Θi)) + g(x)

subject to

Θi I

I (As)ii + (Fs(x))ii

0.

(6.10)

One important class of system which satisfies these assumptions is spatially-invariant sys-

tems. This structure was used in [190] to develop efficient techniques for sparse feedback syn-

thesis.

6.3.1 Spatially-invariant systems

Spatially-invariant systems have a block-circulant structure which is block-diagonalizable by

a Discrete Fourier Transform (DFT). A spatially-invariant system can be represented by N

subsystems with n states each. The state vector ψ ∈ RnN is composed of N subvectors ψi ∈ Rn

which denotes the state of the subsystem. The matrix A ∈ RnN×nN is block-circulant with

blocks of the size n× n. For example, when N = 3,

A =

A0 A1 A−1

A−1 A0 A1

A1 A−1 A0

where the blocks A0, A−1, A1 ∈ Rn×n.

It was shown in [9] that the optimal feedback controller for a spatially-invariant system is

122

itself spatially-invariant. Assuming that the optimal sparse feedback controller is also spatially-

invariant is equivalent to assuming that F (x) is block-circulant. Block circulant matrices are

block-diagonalizable by the appropriate DFT. Let the block Fourier matrix be

Φ := ΦN ⊗ In,

where In is the n× n identity matrix, ΦN is the N ×N discrete Fourier transform matrix, and

⊗ represents the Kronecker product. By introducing the change of variables ψ := Φψ, where

ψ =[ψT1 · · · ψTN

]T,

and ψi ∈ Rn, the original system’s dynamics can be expressed as N independent n× n subsys-

tems,

A = blkdiag(A11, A22, A33)

Consequently, the optimal structured control problems (6.3) and (6.4) can be cast as (6.9)

and (6.10), which are more amenable to efficient computation.

6.3.2 Swift-Hohenberg Equation

Here we illustrate the utility of the block-diagonalization. Consider a particular realization of

the Swift-Hohenberg equation [191],

∂tψ(t, φ) = β ψ(t, φ) − (1 + ∂φφ)2 ψ(t, φ) + x(φ)ψ(t, φ) + d(t, φ)

where β ∈ R, and ψ(t, ·), v(·) ∈ L2(−∞,∞), d(t, φ) is a white-in-time-and-space stochastic

disturbance and x(φ) is a spatially-invariant feedback controller which is to be designed. A

finite dimensional approximation of this system can be obtained by using the differentiation

suite from [202] to discretize the problem into N points and approximating the infinite domain

with periodic boundary conditions over the domain L2[0, 2π]. A sparse H2 feedback controller

x(φ) can then be identified by solving problem (6.3).

We contrast this method with the approach we advocate in Section 6.3, where we use the

DFT to decompose the system into N first-order systems corresponding to eigenfunctions of the

Swift-Hohenberg equation and solve problem (6.9).

The state vector takes the form of ψ(φ) evaluated at grid points in φ where the dynamics

123

are given by

ψ = (A + X)ψ + d.

and A = βI − (I + D2)2. Here D is a discrete differentiation matrix from [202], and X is the

circulant state feedback matrix.

Using the DFT over φ, the Swift-Hohenberg equation can be expressed as a set of independent

first-order systems,˙ψφ = (aφ + xφ)ψφ + d

where aφ := β−(1−κ2φ)2, and the new coordinates are ψ := Πψ, Π is the DFT matrix, κφ is the

wavenumber (spatial frequency), and x represents X in the Fourier space; i.e., X = ΠT diag(x)Π.

We take the regularization term to be

g(x) = ‖X‖2F + γ‖X‖1

where ‖X‖1 :=∑ij |Xij | is the elementwise `1 norm and γ is a parameter which specifies the

emphasis on sparsity relative to performance.

For the H2 problem, the regularized optimal control problem is of the form of (6.3) with

Fs(x) = X and X is circulant. In that formulation, the problem is an SDP with one N×N LMI

block. In the Fourier space, the problem can be expressed as (6.9), which takes the particular

form,

minimizex

1

2

∑ 1

−(aφ + xφ)+ g

(ΠT diag(x)Π

)subject to − (aφ + xφ) ≥ 0

which does not require the large SDP constraints in (6.3).

We solved the regularizedH2 optimal control problem by solving the general formulation (6.3)

and the more efficient formulation (6.9) for β = 0.1, γ = 1 and N varying from 5 to 51 using

CVX, a general purpose convex optimization solver [203].

Taking advantage of spatial invariance yields a significant computational advantage, as can

be seen in Figure 6.6. Although both expressions of the problem yield the same solution, solving

the realization in (6.9) is much faster and allows us to examine much larger problem dimensions.

In Figure 6.7, we show the spatially-invariant feedback controller for one point in the domain,

i.e., one row of X, computed for N = 101 at γ = 0, γ = 0.1, and γ = 10..

124

Com

pu

tati

onti

me

(s)

Number of discretization points

Figure 6.6: Computation time for the general formulation (6.3) (solid blue ——) and thatwhich takes advantage of spatial invariance (6.9) (dotted red · · · ∗ · · · ).

feed

bac

kga

in−x

(φ)

φ

Figure 6.7: Feedback gain −x(φ) for the node at position φ = 0, computed with N = 51 andγ = 0 (solid black —), γ = 0.1 (dashed blue - - -), and γ = 10 (dotted red · · · ).

Chapter 7

Structured decentralized control

of positive systems

Positive systems have received much attention in recent years because of convenient properties

that arise from their structure. A system is called positive if, for every nonnegative initial

condition and input signal, its state and output remain nonnegative [53]. Such systems appear

in the models of heat transfer, chemical networks, and probabilistic networks. In [54], the

authors show that the KYP lemma greatly simplifies for positive systems, thereby allowing

for decentralized H∞ synthesis via Semidefinite Programming (SDP). In [57], it is shown that

static output-feedback can be solved via Linear Programming (LP) for a class of positive systems.

In [55,56], the authors develop necessary and sufficient conditions for robust stability of positive

systems with respect to induced L1–L∞ norm-bounded perturbations. In [204–206], it is shown

that the structured singular value is equal to its convex upper bound for positive systems.

Thus, assessing robust stability with respect to induced L2 norm-bounded perturbation is also

tractable.

Most of the recent literature focuses on control design for positive systems with respect to

induced L1–L∞ norms or induced L2 norms [54, 56, 58]. In this chapter, we show that a class

of H2 and H∞ optimal control problems which are not tractable for general systems are convex

for positive systems. Moreover, they are convex in the original controller variables which (i)

allows us to formulate convex optimization problems where the control parameter is kept as

an optimization variable; and (ii) facilitates a straightforward implementation of constraints or

regularizers on the control parameter in the optimal control formulation.

125

126


We begin with a review of positive systems and then pose the problem.

7.1.1 Background on positive systems

We first provide relevant notation and definitions to facilitate discussion.

Notation

The set of n× n Metzler matrices (matrices with nonnegative off diagonal elements) is denoted

by Mn×n. The set of n× n nonnegative (positive) diagonal matrices is denoted by Dn+ (Dn++).

Definition 2 (Graph associated to a matrix). Given A ∈ Rn×n we denote the graph associated

to A as G(A) = (V, E), with the set of vertices V = 1, ..., n and the set of edges E :=

(i, j) |Aji 6= 0 .

Definition 3 (Strongly connected graph). A graph (V, E) is strongly connected if there is a

directed path linking every two distinct nodes in V.

Definition 4 (Weakly connected graph). A graph (V, E) is weakly connected if replacing its

edges with undirected edges results in a strongly connected graph.

Positive systems

We begin with a definition of a linear positive system.

Definition 5. A linear system described by the state-space representation,

ψ = Aψ + Bd

ζ = Cψ + Dd,

is positive if and only if A is Metzler and B, C, and D are nonnegative matrices.

We now state two lemmas that are useful for the analysis of positive LTI systems.

Lemma 7.1.1. Let A ∈Mn×n and let Q ∈ Rn×n be a positive definite matrix with nonnegative

entries. Then

(a) eA ≥ 0;

127

(b) For Hurwitz A, the solution P to the algebraic Lyapunov equation,

AP + PAT + Q = 0

is elementwise nonnegative.

Proof. (a) The Metzler matrix A can be written as A − αI with A ≥ 0 and α > 0. Since

eA =∑∞k=0 A

k/k! and Ak ≥ 0 for all k, eA ≥ 0. Thus, e−α > 0 implies eA = e−αeA ≥ 0.

(b) Apply (a) to P =∫∞

0eAtQ eA

T t dt.

Lemma 7.1.2. The left and right principal singular vectors, w and v, of A ∈ Rn×n+ are non-

negative. If A ∈ Rn×n++ , w and v are positive and unique.

Proof. Follows from the application of the Perron theorem [180, Theorem 8.2.11] to AAT and

ATA.

7.1.2 Problem formulation

We consider a class of positive systems (1.1) subject to structured decentralized control,

ψ = (A + F (x))ψ + Bd

ζ = Cψ(7.1)

where A is a Metzler matrix, F (x): Rm → Dn, and B,C ≥ 0. Many applications fit into this

form, including leader selection and combination drug therapy design for HIV. Our objective is

to design a stabilizing diagonal matrix F (x) that minimizes amplification from d to z. To study

this class of problem, we introduce the following assumption.

Assumption 3. The matrix A in (7.1) is Metzler, the matrices B and C are nonnegative, and

the diagonal matrix F (x) := diag (Fxx) is a linear function of x with Fx ∈ Rn×m.

We now review some recent results. Under Assumption 3, the matrix A + F (x) is Metzler

and its largest eigenvalue is real and a convex function of x [207]. Recently, it has been shown

that the weighted L1 norm of the response of system (7.1) from a nonnegative initial condition

x0 ≥ 0, ∫ T

0

cTψ(t) dt

is a convex function of u for every c ∈ Rn+ [58, 59]. Furthermore, the approach in [54] can

be used to cast the problem of unstructured decentralized H∞ control of positive systems as a

128

semidefinite program (SDP) and [208] can be used to cast it as a Linear Program (LP). However,

both the SDP and the LP formulations require a change of variables that does not preserve the

structure of F (x). Consequently, it is often difficult to design controllers that are feasible for a

given noninvertible operator F or to impose structural constraints or penalties on x.

We show that both the H2 and the H∞ norms are convex functions of the original optimiza-

tion variable x and provide explicit expressions for the (sub)gradients. This allows us to develop

efficient descent algorithms that solve regularized optimal control problems of the form (1.8).

7.2 Convexity of optimal control problems

We next establish convexity of the H2 and H∞ norms for systems that satisfy Assumption 3,

derive a graph theoretic condition that guarantees continuous differentiability of f∞, and de-

velop a customized algorithm for solving optimization problem (1.8) even in the absence of

differentiability.

7.2.1 Convexity of f2 and f∞

We first establish convexity of the H2 optimal control problem and provide the expression for

the gradient of f2.

Proposition 7.2.1. Let Assumption 3 hold and let Acl(x) := A + F (x) be a Hurwitz matrix.

Then, f2 is a convex function of x and its gradient is given by

∇f2(x) = 2F †(PcPo) (7.2)

where Pc and Po are the controllability and observability gramians of the closed-loop system (7.1),

Acl(x)Pc + PcATcl(x) + BBT = 0 (7.3a)

ATcl(x)Po + PoAcl(x) + CTC = 0. (7.3b)

Proof. We first establish convexity of f2(x) and then derive its gradient. The square of the H2

norm is given by,

f2(x) =

⟨CTC, Pc

⟩, Acl(x) Hurwitz

∞, otherwise

129

where the controllability gramian Pc of the closed-loop system is determined by the solution to

Lyapunov equation (7.3a). For Hurwitz Acl(x), Pc can be expressed as,

Pc =

∫ ∞0

eAcl(x)tBBT eATcl(x)t dt.

Substituting into⟨CTC, Pc

⟩and rearranging terms yields,

f2(x) =

∫ ∞0

‖C eAcl(x)tB‖2F dt =

∫ ∞0

∑i, j

(cTi eAcl(x)t bj

)2

dt

where cTi is the ith row of C and bj is the jth column of B. From [58, Lemma 3] it follows that

cT eAcl(x)t b is a convex function of u for nonnegative vectors c and b. Since the range of this

function is R+ and (·)2 is nondecreasing over R+, the composition rules for convex functions [123]

imply that (cTi eAcl(x)t bj)2 is convex in x. Convexity of f2(x) follows from the linearity of the

sum and integral operators.

To derive ∇f2, we form the associated Lagrangian,

L(x, Pc, Po) =⟨CTC, Pc

⟩+⟨Po, Acl(x)Pc + PcA

Tcl(x) + BBT

⟩where Po is the Lagrange multiplier associated with equality constraint (7.3a). Taking variations

of L with respect to Po and Pc yields Lyapunov equations (7.3a) and (7.3b) for controllability

and observability gramians, respectively. Using Acl(x) = A + F (x) and the adjoint of F , we

rewrite the Lagrangian as

L(x, Pc, Po) = 2⟨F †(PcPo), x

⟩+⟨CTC, Pc

⟩+⟨Po, APc + PcA

T + BBT⟩.

Taking the variation of L with respect to x yields (7.2).

Remark 7.2.1. Convexity of the quadratic cost

∫ T

0

ψT (t)CTC ψ(t) dt

also holds over a finite or infinite time horizon for a piecewise constant x. This follows from [58,

Lemma 4] and suggests that an approach inspired by the Model Predictive Control (MPC) frame-

work can be used to compute a time-varying optimal control input for a finite horizon problem.

Remark 7.2.2. The expression for ∇f2 in Proposition 7.2.1 remains valid for any linear system

130

and any linear operator F : Rm → Rn×n. However, convexity of f2 holds under Assumption 3

and is not guaranteed in general.

We now establish the convexity of the H∞ control problem and provide expression for the

subdifferential set of f∞.

Proposition 7.2.2. Let Assumption 3 hold and let Acl(x) := A + F (x) be a Hurwitz matrix.

Then, f∞ is a convex function of x and its subdifferential set is given by

∂f∞(x) =∑

i

αi F†(A−1

cl (x)B vi wTi CA

−1cl (x)

)| wTi (CA−1

cl (x)B) vi = f∞(x), α ∈ P

(7.4)

where F † is the adjoint of the operator F and P is the simplex, P := αi | αi ≥ 0,∑i αi = 1.

Proof. We first establish convexity of f∞(x) and then derive the expression for its subdifferential

set. For positive systems, the H∞ norm achieves its largest value at ω = 0 [54] and from (1.3b)

we thus have f∞(x) = σ(−CA−1cl (x)B). To show convexity of f∞(x), we show that −CA−1

cl (x)B

is a convex and nonnegative function of x, that σ(M) is a convex and nondecreasing function of

a nonnegative argument M , and leverage the composition rules for convex functions [123].

Since Acl(x) is Hurwitz, its inverse can be expressed as

−A−1cl (x) =

∫ ∞0

eAcl(x)t dt. (7.5)

Convexity of cieAcl(x)t bj [58, Lemma 3] and linearity of integration implies that each element of

the matrix

−C A−1cl (x)B = C

∫ ∞0

eAcl(x)t dtB

is convex in x and, by part (a) of Lemma 7.1.1, nonnegative.

The largest singular value σ(M) is a convex function of the entries of M [123],

σ(M) = sup‖w‖=1, ‖v‖=1

wTMv (7.6)

and Lemma 7.1.2 implies that the principal singular vectors vi and wi that achieve the supremum

in (7.6) are nonnegative for M ≥ 0. Thus,

wTi (M + β ejeTk ) vi ≥ wTi Mvi

for any β ≥ 0, thereby implying that σ(M) is nondecreasing over M ≥ 0. Since each element of

131

−CA−1cl (x)B ≥ 0 is convex in x, these properties of σ(·) and the composition rules for convex

functions [123] imply convexity of f∞(x) = σ(−CA−1cl (x)B).

To derive ∂f∞(u), we note that the subdifferential set of the supremum over a set of differ-

entiable functions,

f(x) = supi∈I

fi(x)

is the convex hull of the gradients of the functions that achieve the supremum [209, Theorem

1.13],

∂f(x) = ∑j | fj(x) = f(x)

αj∇fj(x) | α ∈ P

Thus, the subgradient of σ(M) with respect to M is given by

∂ σ(M) =∑

j αjwjvTj | wTj Mvj = σ(M), α ∈ P

.

Finally, the matrix derivative of M−1 with respect to M in conjunction with the chain rule

yields (7.4).

Remark 7.2.3. The adjoint of the linear operator F , introduced in Assumption 3, with respect

to the standard inner product is F †(x) = FTx diag(x). For positive systems, Lemma 7.1.1 implies

that the gramians Pc and Po are nonnegative matrices. Thus, the diagonal of the matrix PcPo

is positive and it follows that f2 is a monotone function of the diagonal matrix F (x). Similarly,

−A−1cl (x)B vi and −wTi CA

−1cl (x) are nonnegative and thus f∞ is also a monotone function of

F (x).

7.2.2 Differentiability of the H∞ norm

In general, theH∞ norm is a nondifferentiable function of the control parameter x. Even though,

under Assumption 3, the decentralized H∞ optimal control problem (1.8) for positive systems is

convex, it is still difficult to solve because of the lack of differentiability of f∞. Nondifferentiable

objective functions often necessitate the use of subgradient methods, which can converge slowly

to the optimal solution.

In what follows, we prove that f∞ is a continuously differentiable function of x for weakly

connected G(A). Then, by noting that f∞ is nondifferentiable only when G(A) contains dis-

connected components, we develop a method for solving (1.8) that outperforms the standard

subgradient algorithm.

132

Differentiability under weak connectivity

To show the result, we first require two technical lemmas.

Lemma 7.2.4. Let M ≥ 0 be a matrix whose main diagonal is strictly positive and whose

associated graph G(M) is weakly connected. Then, the graphs associated with G(MMT ) and

G(MTM) have self loops and are strongly connected.

Proof. Positivity of the main diagonal of M implies that if Mij is nonzero, then (MTM)ij and

(MMT )ij are nonzero; by symmetry, (MTM)ji and (MMT )ji are also nonzero. Thus, G(MTM)

and G(MMT ) contain all the edges (i, j) in G(M) as well as their reversed counterparts (j, i).

Since G(M) is weakly connected, G(MTM) and G(MMT ) are strongly connected. The presence

of self loops follows directly from the positivity of the main diagonal of M .

Lemma 7.2.5. Let M ≥ 0 be a matrix whose main diagonal is strictly positive and whose

associated graph G(M) is weakly connected. Then, the principal singular value and the principal

singular vectors of M are unique.

Proof. Note that G(Mk) has an edge from i to j if M contains a directed path of length k from

i to j [6, Lemma 1.32]. Since G(MMT ) and G(MTM) are strongly connected with self loops,

Lemma 7.2.4 implies the existence of k such that (MTM)k > 0 and (MMT )k > 0 for all k ≥ k,

and Perron Theorem [180, Theorem 8.2.11] implies that (MTM)k and (MMT )k have unique

maximum eigenvalues for all k ≥ k.

The eigenvalues of (MTM)k and (MMT )k are related to the singular values of M by,

λi((MTM)k) = λi((MMT )k) = (σi(M))2k

and the eigenvectors of (MTM)k and (MMT )k are determined by the right and the left singular

vectors of M , respectively. Since the principal eigenvalue and eigenvectors of these matrices are

unique, the principal singular value and the associated singular vectors of M are also unique.

Theorem 7.2.6. Let Assumption 3 hold, let Acl(x) := A + F (x) be a Hurwitz matrix, and let

matrices B and C have strictly positive main diagonals. If the graph G(A) associated with A is

weakly connected, f∞(x) is continuously differentiable.

Proof. Lemma 7.1.1 implies that eAcl(x) ≥ 0. From [6, Lemma 1.32], G(Mk) has an edge from

i to j if there is a directed path of length k from i to j in G(M). Weak connectivity of G(A)

implies weak connectivity of G(A), G(Ak), eAcl(x)t and, by (7.5), of G(−A−1cl (x)).

133

Since Acl(x) is Hurwitz and Metzler, its main diagonal must be strictly negative; otherwise,

ddtψi ≥ 0 for some ψi, contradicting stability and thus the Hurwitz assumption. Equation (7.5)

and Lemma 7.1.1 imply A−1cl (x) ≤ 0 and, since Acl(x) is Metzler, A−1

cl (x)Acl(x) = I can only

hold if the main diagonal of −A−1cl (x) is strictly positive.

Moreover, since the diagonals of B and C are strictly positive, G(−CA−1cl (x)B) is weakly

connected and the diagonal of −CA−1cl (x)B is also strictly positive. Thus, Lemma 7.2.5 implies

that the principal singular value and singular vectors of −CA−1cl (x)B are unique, that (7.4) is

unique for each stabilizing x, and that f∞(x) is continuously differentiable.

Nondifferentiability for disconnected G(A)

Theorem 7.2.6 implies that under a mild assumption on B and C, f∞ is only nondifferentiable

when the graph associated with A has disjoint components. Since the proximal operator of

f∞ does not readily admit an explicit expression, neither the existing proximal methods dis-

cussed in Section 2.2.3 nor the novel algorithms we develop in Part I may be directly applied

to solve (1.8). Instead, subgradient methods [209] must be employed. Although proximal sub-

gradient algorithms [210] can be applied to problems of the form (1.8) where both f and g

are nondifferentiable and g has a cheaply computable proximal operator, subgradient methods

require a large number of iterations to converge.

To a large extent, subgradient methods are inefficient because they do not guarantee descent

of the objective function. However, under the following mild assumption, the subgradient of f∞,

∂f∞, can be conveniently expressed and a descent direction can be obtained by solving a linear

program.

Assumption 4. Without loss of generality, let Acl(x) be permuted such that

Acl(x) = blkdiag (A1cl(x), . . . , Amcl (x))

is block diagonal and let G(Aicl(x)) be weakly connected for every i. Moreover, the matrices

B = blkdiag (B1, . . . , Bm) and C = blkdiag (C1, . . . , Cm) are block diagonal and partitioned

conformably with the matrix Acl(x).

Theorem 7.2.7. Let Assumptions 3 and 4 hold and let Acl(x) := A+F (x) be a Hurwitz matrix.

Then,

f∞(x) = maxi

f i∞(x) (7.7a)

where f i∞(x) := σ(Ci(Aicl(x))−1Bi). Moreover, every element of the subgradient of f∞(x) can be

134

expressed as the convex combination of a finite number of vectors Gj := ∇f j∞(x) corresponding to

the gradients of the functions f j∞(x) that achieve the maximum in (7.7a), i.e., f∞(x) = f j∞(x),

∂f∞(x) = Gα | α ∈ P (7.7b)

where the columns of G are given by Gj and P is the simplex.

Proof. Since Acl(x) is a block diagonal matrix, so is A−1cl (x) and Assumption 4 implies that

−CA−1cl (x)B = blkdiag (−Ci(Aicl(x))−1Bi) is also block diagonal. Thus,

f∞(x) = σ(−CA−1cl (x)B) = max

iσ(−Ci(Aicl(x))−1Bi)

which proves (7.7a). Theorem 7.2.6 implies that each f i∞(x) is continuously differentiable which

establishes (7.7b) by [209, Theorem 1.13].

When g is differentiable, we leverage the above convenient expression for ∂f∞ to select

an element of the subdifferential set which, with an abuse of terminology, we call the optimal

subgradient. The optimal subgradient is guaranteed to be a descent direction for (1.8) and it is

defined as the member of ∂(f∞(x) + γg(x)) that solves

minimizev, α

maxj

(vT (Gj + γ∇g(x))) (7.8a)

subject to v = −(Gα + γ∇g(x)), α ∈ P (7.8b)

vT (Gj + γ∇g(x)) < 0, for all j (7.8c)

where F and f j are defined as in Theorem 7.2.7. While (7.8a) minimizes the directional

derivative of (1.8) in the search direction v, constraints (7.8b) and (7.8c) ensure that v ∈∂(f∞(x) + γ∇g(x)) and that it is a descent direction, respectively.

Problem (7.8) can be solved efficiently because it is a linear program. Moreover, the opti-

mality condition for (1.8), ∂f∞(x) + γ∇g(x) 3 0, can be checked by solving a linear program to

verify the existence of an α ∈ P such that Gα+ γ∇g(x) = 0.

Customized algorithm

Ensuring a descent direction enables principled rules for step-size selection and makes prob-

lem (1.8) with nondifferentiable g tractable via the augmented-Lagrangian-based approaches.

135

Introducing an auxiliary variable z and deriving the proximal augmented Lagrangian as de-

scribed in Section 2.3.1 yields

Lµ(x, z?µ(x, y); y) = f∞(x) + Mµg(x + µy) − µ2 ‖y‖

2.

Since f∞ is the only nondifferentiable component of the proximal augmented Lagrangian, the op-

timal subgradient (7.8) can be used to minimize Lµ(x; yk) over x to facilitate the MM algorithm

in Section (2.3) [107].

7.3 Leader selection in directed networks

We now consider the special case of system (7.1), in which the matrix A is given by a graph

Laplacian, and study the leader selection problem for directed consensus networks. The question

of how to optimally assign a predetermined number of nodes to act as leaders in a network of

dynamical systems with a given topology has recently emerged as a useful proxy for identifying

important nodes in a network [77–84]. Even though significant theoretical and algorithmic

advances for undirected networks have been made, the leader selection problem in directed

networks remains open. We first discussed this problem in Section 1.2.1; we describe it in more

detail here.


We describe consensus dynamics and state the problem.

Consensus dynamics

The weighted directed network G(L) with n nodes and the graph Laplacian L obeys consensus

dynamics in which each node i updates its state ψi using relative information exchange with its

neighbors,

ψi = −∑j ∈Ni

Lij(ψi − ψj) + di.

Here, Ni := j | (i, j) ∈ E, Lij ≥ 0 is a weight that quantifies the importance of the edge from

node j to node i, di is a disturbance, and the aggregate dynamics are [67],

ψ = −Lψ + d

136

where L is the graph Laplacian of the directed network [211].

The graph Laplacian always has an eigenvalue at zero that corresponds to a right eigenvector

of all ones, L1 = 0. If this eigenvalue is simple, all node values ψi converge to a constant ψ in

the absence of an external input d. When G(L) is balanced and conncted, ψ = 1n 1

Tψ(0) is the

average of the initial node values. In general, ψ = wTψ(0), where w is the left eigenvector of

L corresponding to zero eigenvalue, wTL = 0. If G(L) is not strongly connected, L may have

additional eigenvalues at zero and the node values converge to distinct groups whose number is

equal to or smaller than the multiplicity of the zero eigenvalue.

Leader selection

In consensus networks, the dynamics are governed by relative information exchange and the node

values converge to the network average. In the leader selection paradigm [80], certain “leader”

nodes are additionally equipped with absolute information which introduces negative feedback

on the states of these nodes. If suitable leader nodes are present, the dynamical generator

becomes a Hurwitz matrix and the states of all nodes asymptotically converge to zero.

An example is given by a kinematic model of vehicles where ψ represents the position vector

of a formation. Relative information exchange corresponds to maintaining constant distances

between neighboring vehicles and the leader nodes may have access to absolute information from

GPS units.

The node dynamics in a network with leaders is

ψi = −∑j ∈Ni

Lij(ψi − ψj) − xi ψi + di

where xi ≥ 0 is the weight that node i places on its absolute information. The node i is a leader

if xi > 0, otherwise it is a follower. The aggregate dynamics can be written as,

ψ = −(L + diag (x))ψ + d

and placed in the form (7.1) by taking A = −L, B = C = I, and F (x) = −diag (x). We

evaluate the performance of a leader vector x ∈ Rn using the H2 or H∞ performance metrics

f2 or f∞, respectively. We note that this system is marginally stable in the absence of leaders

and much work on consensus networks focuses on driving the deviations from the average node

values to zero [97] as we discussed in the edge addition example for directed consensus networks

in Section 2.4. Instead, we here focus on driving the node values themselves to zero.

137

1

2

3

4

? ?? ?

? ?? ?

︸︷︷︸

L

Figure 7.1: A directed network and the sparsity pattern of the corresponding graph Laplacian.This network is stabilized if and only if either node 1 or node 2 are made leaders.

We formulate the combinatorial problem of selecting N leaders to optimize either H2 or H∞norm as follows.

Problem 1. Given a network with a graph Laplacian L and a fixed leader weight κ, find the

optimal set of N leaders that solves

minimizex

f(x)

subject to 1Tx = N κ, xi ∈ 0, κ

where f is one of the performance metrics described in Section 1.2.2, with A = −L, B = C = I,

and F (x) = diag (x).

In [78, 79], the authors derive explicit expressions for leaders in undirected networks. How-

ever, these expressions are efficient only for very few or very many leaders. Instead, we follow [80]

and develop an algorithm which relaxes the integer constraint to obtain a lower bound on Prob-

lem 1 and use greedy heuristics to obtain an upper bound.

Considering leader selection in directed networks adds the challenge of ensuring stability. At

the same time, we can leverage existing results on leader selection in undirected networks to

derive efficient upper bounds on Problem 1.

7.3.2 Stability for directed networks

For a vector of leader weights x to be feasible for Problems 1, it must stabilize system (7.1); i.e.,

−(L+ diag (x)) must be a Hurwitz matrix. When G(L) is undirected and connected, any leader

will stabilize (7.1). However, this is not the case for directed networks. For example, making

node 1 or 2 a leader stabilizes the network in Fig. 7.1, but making nodes 3 or 4 a leader does

not. Theorem 7.3.1 provides a necessary and sufficient condition for stability.

138

Theorem 7.3.1. Let L be a weighted directed graph Laplacian and let x ≥ 0. The matrix

−(L+ diag (x)) is Hurwitz if and only if w x 6= 0 for all nonzero w with wTL = 0, where is

the elementwise product.

Proof. (⇐) If w x = 0, wT diag (x) = 0. If, in addition, wTL = 0, we have

−wT (L + diag (x)) = 0 (7.9)

and therefore zero is an eigenvalue of −(L+ diag (x)).

(⇒) Since the graph Laplacian L is row stochastic and diag (x) is diagonal and nonnegative,

the Gershgorin circle theorem [180] implies that the eigenvalues of −(L+diag (x)) are at most 0.

To show that −(L+diag (x)) is Hurwitz, we show that it has no eigenvalue at zero. Assume there

exists a nonzero w such that (7.9) holds. This implies that either wTL = wT diag (x) = 0 or that

wTL = −wT diag (x). The first case is not possible because, by assumption, wT diag (x) = (w x)T 6= 0 for any w such that wTL = 0. If the second case is true, then wTLv = −wT diag (x)v

must also hold for all v. However, if we take v = 1, then wTL1 = 0 but −wT diag (x)1 is

nonzero. This completes the proof.

Remark 7.3.2. Only the set of leader nodes is relevant to the question of stability. If x does

not stabilize (7.1), no positive weighting of the vector of leader nodes, α x with α ∈ Rn++, will

stabilize (7.1). Similarly if x stabilizes (7.1), every α x will.

Corollary 7.3.2.1. If G(L) is strongly connected, any choice of leader node will stabilize (7.1).

Proof. Since the graph Laplacian associated with a strongly connected graph is irreducible, the

Perron-Frobenius theorem [180] implies that the left eigenvector associated with −L is positive.

Thus, w x 6= 0 for any nonzero x and system (7.1) is stable by Theorem 7.3.1.

Remark 7.3.3. The condition in Theorem 7.3.1 requires that there is a path from the set of

leader nodes to every node in the network. This can be enforced by extracting disjoint “leader

subsets” Sj which are not influenced by the rest of the network, i.e., (vj)TL = 0 where vji = 1 if

i ∈ Sj and vji = 0 otherwise, and which are each strongly connected components of the original

network. Stability is guaranteed if there is at least one leader node in each such subset Sj; e.g.,

for the network in Fig. 7.1, there is one leader subset S1 = 1, 2. By Corollary 7.3.2.1, S1

contains all nodes when the network is strongly connected.

139

7.3.3 Bounds for Problem 1

To approach combinatorial Problem 1, we derive bounds on its optimal objective value. These

bounds can also be used to implement a branch-and-bound approach [212].

Lower bound

By relaxing the combinatorial constraint in Problem 1, we obtain a convex program,

minimizex

f(x)

subject to x ∈ κPN

where κPN := x |∑i xi = Nκ, xi ≤ κ is the “capped” simplex. Using a recent result on

efficient projection onto PN [213], this problem can be solved efficiently via proximal gradient

methods [131] to provide a lower bound on Problem 1.

When G(L) is not strongly connected, additional constraints can be added to enforce the

condition in Theorem 7.3.1 and thus guarantee stability. Let the sets Sj denote “leader subsets”

from which a leader must be chosen, as discussed in Remark 7.3.3. Then, the convex problem

minimize f(x)

subject to x ∈ κPN ,∑i∈Sj

xi ≥ κ, for all j

(7.10)

relaxes the combinatorial constraint and guarantees stability. We denote the resulting lower

bounds on the optimal values of the H2 and H∞ versions of Problem 1 with N leaders by

f lb2 (N) and f lb

∞(N), respectively.

Upper bounds for Problem 1

If k denotes the number of subsets Sj , a stabilizing candidate solution to Problem 1 can be

obtained by “rounding” the solution to (7.10) by taking the N leaders to contain the largest

element from each subset Sj and N − k largest remaining elements. The greedy swapping

algorithm proposed in [80] can further tighten this upper bound.

Recent work on leader selection in undirected networks can also provide upper bounds for

Problem 1 when G(L) is balanced. The symmetric component of the Laplacian of a balanced

graph, Ls := 12 (L+LT ), is the Laplacian of an undirected network. The exact optimal leader set

140

for an undirected network can be efficiently computed when N is either small or large [78, 79].

Since the performance of the symmetric component of a system provides an upper bound on

the performance of the original system, these sets of leaders will have better performance with

L than with Ls for both the H2 (Corollary 6.1.3.1) and H∞ norms (Proposition 6.1.2).

Even when L does not represent a balanced network, f2 and f∞ are respectively upper

bounded by the trace and the maximum eigenvalue of 12 (Ls + F (x))−1. For small numbers of

leaders, they can be efficiently computed using rank-one inversion updates. A similar approach

was used in [78, 79] to derive optimal leaders for undirected networks. Moreover, this approach

always yields a stabilizing set of leaders (Lemma 6.1.1).

7.3.4 Additional comments

We now provide additional discussion on interesting aspects of Problem 1. We first consider the

gradients of f2 and f∞.

Remark 7.3.4. When F (x) = −diag (x), we have ∇f2 = −2 diag (PcPo). The matrix PcPo

often appears in model reduction and (∇f2(x))i corresponds to the inner product between the ith

columns of Pc and Po.

Remark 7.3.5. When F (x) = −diag(x), (∂f∞(x))i is given by the product of −eTi A−1cl (x)v

and wTA−1cl (x)ei. The former quantifies how much the forcing which causes the largest overall

response of system (7.1) affects node i, and the latter captures how much the forcing at node i

affects the direction of the largest output response.

The optimal leader sets for balanced graphs are interesting because they are invariant under

reversal of all edge directions.

Proposition 7.3.1. Let G(L) be balanced, let L := LT so that G(L) contains the reversed edges

of the graph G(L), and let f2 and f∞ denote the performance metrics (7.13) with A = −L,

F (x) = − diag (x), and B = C = I as in Problem 1. Then, f2(x) = f2(x) and f∞(x) = f∞(x).

Proof. The controllability gramian of (7.1) defined with Acl = −(L+ diag(x)) solves Lyapunov

equation (7.3a), −(L + diag(x))Pc − Pc(L + diag(x))T + I = 0, and is also the observability

gramian Po of (7.1) defined with Acl = −(L + diag(x)) = −(LT + diag(x)) that solves (7.3b).

By definition of the H2 norm, f2(x) = trace(Po) = trace(Pc) = f2(x). Since σ(M) = σ(MT ),

f∞(x) = σ(−(L+ diag(x))−1) = σ(−(L+ diag(x))−1) = f∞(x).

This invariance is intriguing because the space of balanced graphs is spanned by cycles.

In [214], the authors explored how undirected cycles affect undirected consensus networks.

141

Proposition 7.3.1 suggests that directed cycles also play a fundamental role in directed con-

sensus networks.

7.4 Computational experiments for leader selection

Here, we illustrate our approach to Problem 1 with N leaders and the weight κ = 1. The

“rounding” approach that we employ is described in Section 7.3.3.

7.4.1 Example: Bounds on leader selection for a small network

For the directed network in Fig. 7.2a, let the edges from node 2 to node 7 and from node 7 to

node 8 have an edge weight of 2 and let all other edges have unit edge weights. We compare

the optimal set of leaders, determined by exhaustive search, to the set of leaders obtained by

(i) “rounding” the solution to relaxed problem (7.10); and by (ii) the optimal selection for

the undirected version of the graph via [78, 79], as discussed in Section 7.3.3. In Fig. 7.2a,

blue node 7 represents the optimal single leader, yellow node 4 represents the single leader

selected by “rounding”, and red node 8 represents the optimal single leader for the undirected

network. In Fig. 7.2b we show the H2 performance for 1 to 8 leader nodes resulting from

different methods. Since in general, we do not know the optimal performance a priori , we plot

performance degradation (in percents) relative to the lower bound on Problem 1 obtained by

solving problem (7.10).

Figure 7.2b shows that neither “rounding” (yellow ) nor the optimal selection for undirected

networks (red +) achieve unilaterally better H2 performance (performance of the optimal leader

sets are shown in blue ×). While the procedure for the undirected networks selects better sets

of 1, 2, and 5 leaders relative to “rounding”, identifying them is expensive except for large or

small number of leaders [78,79] and “rounding” identifies a better set of 4 leaders. This suggests

that, when possible, both sets of leaders should be computed and the one that achieves better

performance should be selected.

7.4.2 Example: Leaders in the neural network of the worm C. Elegans

We now consider the network of neurons in the brain of the worm C. Elegans with 297 nodes

and 2359 weighted directed edges. The data was compiled by [215] from [216]. Inspired by the

use of leader selection as a proxy for identifying important nodes in a network [77–80, 82], we

employ this framework to identify important neurons in the brain of C. Elegans.

142

8

7 2

1

3

6

4

5

2

2

(a) Balanced Network with 8 nodes,12 edges.

100∗

(f2(x

)/flb 2

(N)−

1)

Number of leaders N

(b) Performance of optimal leaders and two leader selectiontechniques.

Figure 7.2: H2 performance of optimal leader set (blue ×) and upper bounds resulting from“rounding” (yellow ) and the optimal leaders for the undirected network (red +). Performanceis shown as a percent increase in f2 relative to f lb

2 (N).

Three nodes in the network have zero in-degree, i.e., they are not influenced by the rest of the

network. Thus, as discussed in Remark 7.3.3, there are three “leader subsets”, each comprised

of one of these nodes. Theorem 7.3.1 implies that system (7.1) can only be stable if each of

these nodes are leaders.

In Figs. 7.4c and 7.4d, we show f2 and f∞ resulting from “rounding” the solution to prob-

lem (7.10) to select the additional 1 to 294 leaders. Performance is plotted as an increase (in

percents) relative to the lower bound f lb(N) obtained from (7.10). This provides an upper

bound on suboptimality of the identified set of leaders. While this value does not provide in-

formation about how f(x) changes with the number of leaders, Remark 7.2.3 implies that it

monotonically decreases with N .

For both f2 and f∞ performance metrics, Figs. 7.4c and 7.4d illustrates that the upper

bound is loosest for 25 leaders (1.56% and 0.48%, respectively). As seen in Fig. 7.2b from the

previous example, whose small size enabled exhaustive search to solve Problem 1 exactly, the

upper bound on suboptimality is not tight and the exact optimal solution to Problem 1 can

differ by as much 21.75% from the lower bound. This suggests that “rounding” selects very

good sets of leaders for the C. Elegans example.

In Figs. 7.4a and 7.4b, we show the network with ten identified f2 and f∞ optimal leaders.

The size of the nodes is related to their out-degree and the thickness of the edges is related to

143

the weight. The red marks nodes that must be leaders and the blue marks the 7 additional

leaders selected by “rounding”.

7.4.3 Example: Ranking college football teams

Inspired by the recent use of graph theoretic tools for ranking athletic teams, we consider the

problem of ranking college football teams. Here, we do not consider that each leader has the

same weight and instead use a sparsity penalty in (1.8) to select sparse subsets of nodes (teams).

Due to the number of teams in the top division of college football (128) and the relative

scarcity of games between them (around 13 per team), ranking these teams is an underdeter-

mined problem. The current practice of ranking by a committee is clearly subject to bias.

Recently, graph theoretic measures, such as average path length from each node, have been

explored for the purpose of objectively ranking teams or athletes [217,218].

We used the scores of college football games from the 2015−2016 season collected from [219]

to generate a network. If team A beat team B, a edge was placed from A to B with a weight

equal to the score difference in the game. There were 203 teams (nodes) and 863 games (edges)

in our data set. We select the top N teams by identifying N H2 optimal leaders. We selected

sparse sets of leaders by solving (1.8) with g as the `1 norm and increasing γ until its solution

had the desired number of nonzero elements. Interestingly, the metrics we use are biased against

selecting leaders which are close in the network. In this context, such proximity would correspond

to teams who have many common opponents.

Fig. 7.3 shows the network generated. The large connected component in the center repre-

sents the teams in the top division. Since our dataset included games played between teams

from the top division and lower divisions but not from games played between teams of the lower

divisions, there are a number of topgoraphically isolated nodes.

Table 7.1 shows sets of 2, 4, 6, and 8 leaders with the corresponding end-of-season rankings

from the Associated Press (AP) [220]. This approach selected teams which agree well with the

AP rankings with the notable exception of Southern Illinois. This team played only one game in

the our set – a close loss to a poorly ranked team. We ascribe this anomaly to the the topological

isolation of this team.

144

Figure 7.3: Network of games in the 2015−2016 College Football season. The central connectedcomponent corresponds to the top division, and the distal nodes are lower division teams forwhom there are less data.

7.5 Computational experiments for combination drug ther-

apy

System (7.1) also arises in the modeling of combination drug therapy [89,90,221–223], which we

first discuss in Section 1.2.1. It provides a model for the evolution of populations of mutants

of the HIV virus ψ in the presence of a combination of drugs x. The HIV virus is known to

be present in the body in the form of different mutant strands; in (7.1), the ith component of

the state vector ψ represents the population of the ith HIV mutant. The diagonal entries of

the matrix A represent the net replication rate of each mutant, and the off diagonal entries of

A, which are all nonnegative, represent the rate of mutation from one mutant to another. The

control input xk is the dose of drug k and the kth column of the matrix Fx in F (x) = diag (Fxx)

specifies at what rate drug k kills each HIV mutant.

7.5.1 Example: Simple problem with nondifferentiable f∞

The mutation patterns of viruses need not be connected. In Fig. 7.5, we show a sample mutation

network with 2 disconnected components. For this network, the H∞ norm is nondifferentiable

when x1 = x2. Nondifferentiability and the lack of an efficiently computable proximal operator

145

γ 0.225 0.6 1 10Teams 8 6 4 2

AP #1 Alabama Alabama Alabama AlabamaAP #4 Ohio State Ohio State Ohio State Ohio StateAP #2 Clemson Clemson ClemsonAP #8 Houston Houston HoustonAP #3 Stanford StanfordAP NR S. Illinois S. Illinois

AP #12 MichiganAP #10 Mississippi

Table 7.1: Leaders selected for different values of γ.

necessitates the use of subgradient methods for solving

minimizex

f∞(x) + xTx.

As shown in Fig. 7.6 with h(x) := f∞(x) + xTx, subgradient methods are not descent methods

so small constant or a divergent series of diminishing step-sizes must be employed.

We compare the performance of the subgradient method with a constant step-size of 10−2

(blue) and a diminishing step-size 7×10−2

k (red) with our optimal subgradient method in which

the step-size is chosen via backtracking to ensure descent of the objective function (yellow). We

show the objective function value with respect to iteration number in Fig. 7.6a and the iterates

xk in the (x1, x2)-plane in Fig. 7.6b.

The subgradient methods were run for 1000 iterations as there is no principled stopping

criterion. Our optimal subgradient method identified the optimal point, to an accuracy of 10−4

(i.e., there was a v ∈ ∂(f∞(x) + xTx) such that ‖v‖ ≤ 10−4) in 23 iterations.

7.5.2 Example: Real world drug therapy problem

Following [89, 90] and using data from [200], we study a system with 35 mutants and 5 drugs.

The sparsity pattern of the matrix A, shown in Fig. 7.7, corresponds to the mutation pattern

and replication rates of 33 mutants and F (x) specifies the effect of drug therapy. Two mutants

are not shown in Fig. 7.7a as they have no mutation pathways to or from other mutants.

Several clinically relevant properties, such as maximum dose or budget constraints, may

be directly enforced in our formulation. Other combinatorial conditions can be promoted via

convex penalties, such as drug j requiring drug i via xi ≥ xj or mutual exclusivity of drugs i

and j via xi + xj ≤ 1. We design optimal drug doses using two convex regularizers g.

146

(a) f2 optimal leaders (b) f∞ optimal leaders

100∗

(f(x

)/flb

(N)−

1)

Number of leaders N

(c) f2 optimal leaders

Number of leaders N

(d) f∞ optimal leaders

Figure 7.4: C. Elegans neural network with N = 10 (a) f2 and (b) f∞ leaders along with the(c) f2 and (d) f∞ performance of varying numbers of leaders N relative to f lb(N). In all cases,leaders are selected via “rounding”.

Budget constraint

We impose a unit budget constraint on the drug doses and solve the f2 and f∞ problems using

proximal gradient methods [131, 134]. These can be cast in the form (1.8), where g is the

indicator function associated with the probability simplex P. Table 7.2 contains the optimal

doses and illustrates the tradeoff between H2 and H∞ performance.

147

x1

x2 x4

x3

u2u1

1 11 1

1 11 1

︸︷︷︸

A

−1−1 .1.1 −1−1

︸︷︷︸

Fx

Figure 7.5: A directed network and corresponding A matrix for a virus with 4 mutants and 2drugs. For this system, f∞ is nondifferentiable.

Antibody x?2 x?∞3BC176 0.5952 0.9875

PG16 0 045-46G54W 0.2484 0.0125

PGT128 0.1564 010-1074 0 0

f2(x?2) 0.6017f2(x?∞) 1.1947

f∞(x?2) 0.1857f∞(x?∞) 0.1084

Table 7.2: Optimal budgeted doses and H2/H∞ performance.

Sparsity-promoting framework

Although the above budget constraint is naturally sparsity-promoting, in Algorithm 5 we aug-

ment a quadratically regularized optimal control problem with a reweighted `1 norm [121] to

select a homotopy path of successively sparser sets of drugs. We then perform a ‘polishing’

step to design the optimal doses of the selected set of drugs. We use 50 logarithmically spaced

increments of the regularization parameter γ between 0.01 and 10 to identify the drugs and then

replace the weighted `1 penalty with a constraint to prescribe the selected drugs. In Fig. 7.8, we

show performance degradation (in percents) relative to the optimal dose that uses all 5 drugs

with B = C = I and R = I.

7.6 Time-varying controllers

While references [89,90,112,115,222] and the previous section either assume a constant control

signal or use heuristics to introduce time dependence, we show that such a constant input is in

fact optimal for the induced power norm. We cast the optimal synthesis problem of constant

148h

(xk)−h

(x?)

Iteration k

(a) Descent of objective function.

x2

x1

(b) Iterates in the (x1, x2)-plane.

Figure 7.6: Comparison of different algorithms starting from initial condition [2.5 2.8]T . Thealgorithms are the subgradient method with a constant step-size (dotted blue · · · ), the subgra-dient method with a diminishing step-size (solid red —) and our optimal subgradient methodwhere the step-size is chosen via backtracking to ensure descent of the objective function (dashedyellow - - -)

Algorithm 5 Sparsity-promoting algorithm for N drugs

Set γ > 0, R 0, w = 1, ε > 0while card(xγ) > N do

xγ = argminx

f(x) + xTRx + γ∑i

wi |xi|

increase γ, set wi = 1/(xi + ε)endx?N = argmin

xf(x) + xTRx

subject to sp (x) ⊆ sp (xγ)

control inputs as a finite-dimensional non-smooth convex optimization problem and develop an

algorithm for designing the optimal controller.

7.6.1 Preliminaries

The space of square integrable signals is denoted by L2. The inner product in this space is given

by

〈u, v〉2 :=

∫ ∞0

uT (t) v(t) dt

149

(a) Network of HIV mutation pathways (b) Sparsity pattern of A

Figure 7.7: Mutation pattern of HIV.

Per

cent

deg

rad

ati

on

off

Number of antibodies

Figure 7.8: Percent performance degradation for H2 (solid red —×—) and H∞ (dashed blue−− −−) performance relative to using all 5 drugs.

with the associated norm ‖v‖22 = 〈v, v〉2. The L2 signals are those for which ‖v‖22 is finite and

the locally L2 signals are those for which∫

ΩvT (t) v(t) dt is finite over any compact set Ω, e.g.,

t ∈ [0, T ] for a finite T . The power semi-norm of a signal v is

‖v‖2pow := lim supT→∞

1

T

∫ T

0

vT (t) v(t) dt. (7.11)

The space of trigonometric polynomials is defined as

T :=

g : R→ Rn

∣∣∣∣∣ g(t) =

N∑k= 1

αkejλkt, λk ∈ R, αk ∈ C

,

where j is the imaginary unit.

The closure of T in the space of locally L2 integrable functions with bounded power norm

150

with respect to the metric ‖f − g‖pow is given by the space of Besicovitch almost periodic

functions. Since ‖·‖pow is a seminorm, we consider B2, the space of Besicovitch almost periodic

functions modulo functions with zero power norm [224]. Factoring out signals with zero power

norm is natural in this setting because our performance metric is the power norm of a regulated

output. The inner product associated with this Hilbert space is [225,226]

〈u, v〉 := lim supT→∞

1

T

∫ T

0

uT (t) v(t) dt

and the norm ‖ · ‖pow. The mean of a signal v ∈ B2

M(v) := limT →∞

1

T

∫ T

0

v(t) dt

is well defined for every v ∈ B2. Furthermore, each v ∈ B2 can be decomposed uniquely as

v = v + v, where v is a constant signal given by v = M(v) and M(v) = 0. Note that the inner

product between a constant signal v and a zero-mean signal v is zero, i.e., 〈v, v〉 = 0.

The space B2 contains all bounded L2 signals, periodic signals, and almost periodic signals.

At the same time, it alleviates challenges arising from the fact that the space of signals with

bounded power norm is not a Hilbert space; for additional discussion see [227].


Consider bilinear positive system (7.1) where x is now allowed to be time-varying and x(t) ∈ Rm,

ψ = (A + F (x(t)))ψ + B d. (7.12a)

For given control and disturbance signals x ∈ B2 and d ∈ B2, we associate the performance

output,

ζx,d =

Q1/2

0

ζ +

0

R1/2

x (7.12b)

with (7.12a), where Q 0 and R 0 are the state and control weights. Under Assump-

tion 3, (7.12) is a positive system. This implies that for every control input x, every nonnegative

disturbance d, and every nonnegative initial condition ψ(0), the state ψ and the output ζx,d of

system (7.12) remain nonnegative at all times.

151

The induced power norm of a stable system (7.12) is,

f(x) :=

sup

‖d‖2pow≤ 1

‖ζx,d‖2pow, system (7.12) stable

∞, otherwise

(7.13)

and it quantifies the response to the worst case persistent disturbance d. Since system (7.12) is

nonlinear, we cannot separate the initial condition and input response and we must define our

notion of stability. We assume that ψ(0) = 0 and consider stability in a bounded input-bounded

output sense.

Definition 6. A control signal x is stabilizing for nonlinear system (7.12) if for every bounded

d ∈ B2, ζx,d is bounded and has finite power, ‖ζx,d‖pow <∞.

For unstable open-loop systems (7.12), there may be no stabilizing control input x in L2.

Thus, the L2-induced gain does not provide a suitable measure of input-output amplification

for (7.12) and f(x) represents an appropriate generalization of the H∞ norm for this class of

bilinear positive systems.

We now formulate the optimal control problem.

Problem 2. Design a stabilizing bounded control signal x ∈ B2 to minimize f(x) for bilinear

positive system (7.12).

7.6.3 Solution to the optimal control problem

In this section, we prove that a constant control input solves Problem 2.

Since f(x) is given by (7.13), any x? which solves Problem 2 satisfies ‖ζx?,d‖2pow ≤ f(x?) ≤f(x) for all x ∈ B2 and d ∈ B2. In particular, ‖ζx?,d‖2pow ≤ f(x) where x is a constant control

input. As shown in [54], for constant control inputs the worst-case disturbance d is also constant,

i.e., f(x) = ‖ζx,d‖2pow. In what follows, we show that ‖ζx,d‖2pow is a convex function of x, that a

constant x minimizes it, and, thus, that a constant control input solves Problem 2.

We first establish convexity of ‖ζx,d‖2pow.

Lemma 7.6.1. Let d(t) = d be a constant non-negative disturbance. Then, under Assumption 3,

the power norm of the output ‖ζx,d‖2pow is a convex function of x ∈ B2.

Proof. We first show that

ψT (t)Qψ(t) + xT (t)Rx(t) (7.14)

152

is a convex function of x ∈ L2[0, t] for any t and then extend this statement to complete the

proof. The matrix R is positive definite so the second term on the right-hand side of (7.14) is a

convex function of x(t). By [58, Theorem 2] every component of ψ(t) is a convex function of the

control input x ∈ L2[0, t]. Since the matrix Q 0 is positive semidefinite and has nonnegative

entries, ψ(t)TQψ(t) is convex and nondecreasing in the elements of ψ(t). Thus, the composition

rules for convex functions [123, Section 3.2.4] imply that ψT (t)Qψ(t) is a convex function of

u ∈ L2[0, t].

Since the integral preserves convexity,

1

T

∫ T

0

(ψT (t)Qψ(t) + xT (t)Rx(t)

)dt

is a convex function of x ∈ L2[0, T ] Let PT denote the mapping that truncates the support of

a signal v to [0, T ]. If there is T such that PT v is not L2 integrable, the power seminorm (7.11)

of v is infinite. This implies that, for any v ∈ B2, PT v ∈ L2[0, T ]. Thus, any v ∈ B2 can be

written as limT→∞ PT v of PT v ∈ L2[0, T ]. Since the limit and supremum preserve convexity,

‖ζx,d‖2pow = lim supT→∞

1

T

∫ T

0

(ψT (t)Qψ(t) + xT (t)Rx(t)

)dt

is a convex function of x ∈ B2.

Lemma 7.6.2. Let x be a stabilizing constant control input for (7.12a) and d be a constant

non-negative disturbance. Then, the directional derivative of ‖ζx,d‖2pow evaluated at x is zero for

any bounded zero-mean variation x ∈ B2.

Proof. The dynamics (7.12a) with control input x+ εx and constant disturbance d are

ψ = (A + F (x + εx))ψ + B d. (7.15)

Since the unperturbed system (with ε = 0) is exponentially stable and boundedness of x implies

that the solution x(t) is continuous in ε [183, Theorem 3.5]. Therefore, (7.15) represents a system

in a regularly perturbed form [183, 228]. The Taylor series expansion can be used to write the

solution to the perturbed dynamics as,

ψ(t) = ψ(t) + ε ψ(t) + O(ε2) (7.16)

153

where ψ is the nominal solution that solves (7.15) for ε = 0

˙ψ = (A + F (x)) ψ + B d. (7.17a)

and ψ represents the first-order correction. By [183, Theorem 10.2], ψ is given by the solu-

tion to a differential equation corresponding to the O(ε) terms in the expression obtained by

substituting (7.16) into equation (7.15),

( ˙ψ + ε˙ψ + O(ε2)) =

[(A + F (x)) ψ + B d

]+ ε

[(A + F (x)) ψ + F (u) ψ

]+ O(ε2).

Collecting the O(ε) terms yields the dynamics that governs the evolution of ψ,

˙ψ = (A + F (x)) ψ + F (x) ψ. (7.17b)

Both (7.17a) and (7.17b) are stable LTI systems. Thus, for the constant disturbance d(t) = d, the

solution ψ to (7.17a) is asymptotically constant. Since x is zero-mean, the signal F (x) ψ is also

zero-mean, i.e., it has no component that contains the zero temporal frequency. Furthermore,

since (7.17b) is a stable LTI system, the frequency components in ψ must correspond to frequency

components in the input, F (x) ψ. Thus, ψ has no frequency component corresponding to the

zero temporal frequency and therefore ψ is also zero-mean.

Because ψ is asymptotically constant and ψ is zero-mean,⟨Q1/2ψ, Q1/2ψ

⟩= 0. Similarly,⟨

R1/2x, R1/2x⟩

= 0. Furthermore, ‖ζx,d‖2pow =⟨Q1/2ψ, Q1/2ψ

⟩+⟨R1/2x, R1/2x

⟩, and we have,

‖z(x+εx),d‖2pow − ‖zx,d‖2pow = 2 ε(⟨Q1/2ψ, Q1/2ψ

⟩+⟨R1/2x, R1/2x

⟩)+ O(ε2)

= O(ε2).(7.18)

Thus, the first order correction to ‖ζx,d‖2pow evaluated at a constant control input x is zero,

which completes the proof.

Remark 7.6.3. Lemma 7.6.2 does not need Assumption 3 and it holds for all bilinear systems

of the form (7.12a) for which there is a stabilizing constant control input.

Lemma 7.6.2 implies that a constant control signal x?

x? ∈ argminx constant

‖ζx,d‖2pow,

154

is a stationary point of ‖ζx,d‖2pow under Assumption 3. Since ‖ζx,d‖2pow is a convex function of

x by Lemma 7.6.1, x? is a local and therefore global minimizer. The following theorem relates

the power norm of the output of system (7.12) subject to a constant disturbance, ‖ζx,d‖2pow,

with the worst case power norm amplification f(x) and shows that the constant control signal

x? solves Problem 2.

Theorem 7.6.4. Let Assumption 3 hold. Then, a constant control input x(t) = x? solves

Problem 2.

Proof. Let x? minimize f(x) over the space of constant functions. Since (7.12a) with a constant

control input is an LTI system, the maximum power-amplification coincides with the H∞ norm.

Moreover, because (7.12a) is a positive system, the maximal singular value of the frequency

response matrix peaks at zero temporal frequency and the worst-case disturbance is a constant

signal [54]. This implies that f(x?) = ‖ζx?,d‖2pow where d = v ≥ 0 is a constant nonnegative

disturbance and v is the right principal singular vector of the matrix −Q 12 (A+ F (x?))−1B.

Suppose there exists a time varying signal x ∈ B2 such that f(x) < f(x?). Then, since f

measures the worst-case disturbance amplification, x must also decrease the power norm of the

output of system (7.12) for a constant disturbance d = v, i.e.,

‖ζx,d‖2pow ≤ f(x) < ‖ζx?,d‖2pow = f(x?). (7.19)

We next show that this is not possible.

Every bounded x ∈ B2 can be written as x = x+ x where x is constant and x is bounded and

zero-mean. Since x? minimizes ‖ζx,d‖2pow over constant control inputs by definition, Lemma 7.6.2

implies that x? is a stationary point of ‖ζx,d‖2pow over all bounded x ∈ B2. Convexity of ‖ζx,d‖2pow

in x by Lemma 7.6.1 thus implies that x? is a local and therefore global minimizer of ‖ζx,d‖2pow,

contradicting (7.19) and completing the proof.

Chapter 8

Actuator or sensor selection

In traditional applications, controller or observer design deals with the problem of how to use

a pre-specified configuration of sensors and actuators in order to attain the desired objective.

In general, the best performance is achieved by using all of the available sensors or actuators.

However, this option may be computationally or economically infeasible. We thus consider the

problem of selecting a subset of available sensors or actuators in order to gracefully degrade

performance relative to the setup where all of them are used.

Typically, sensor/actuator selection and placement is performed by a designer with expert

knowledge of the system. However, in large-scale applications and systems with complex inter-

actions, it can be difficult to do this effectively. For linear time-invariant dynamical systems, we

develop a framework and an efficient algorithm to systematically choose sensors and actuators

via convex optimization.

Our starting point is a formulation with an abundance of potential sensors or actuators.

This setup can encode information about different types or different placements of sensing and

actuating capabilities. We consider the problem of selecting subsets of available options from

this full model. Applications of this formulation range from placement of Phasor Measurement

Units (PMUs) in power systems, to placement of sensors and actuators along an aircraft wing,

to the distribution of GPS units in a formation of multi-vehicle systems.

The problem of interest is a difficult combinatorial optimization problem. Although there is

a wide body of previous work in this area, most of the available literature either uses heuristic

methods or does not consider dynamical models. In [128], the authors provide a convex sensor

selection formulation for a problem with linear measurements. The authors of [229] select sparse

subsets of sensors to minimize the Cramer-Rao bound of a class of nonlinear measurement

155

156

models. The placement of PMUs in power systems was formulated as a variation of the optimal

experiment design in [230]. Actuator selection via genetic algorithms was explored in [231]. A

non-convex formulation of the joint sensor and actuator placement was provided in [232,233] and

it was recently applied to the linearized Ginzburg-Landau equation [234]. The leader selection

problem in consensus networks can be seen as a type of a structured joint actuator and sensor

selection problem which admits a convex relaxation [80] and even an analytical solution for one

or two leaders [78]. However, this formulation does not extend naturally to broader classes of

problems.

The sparsity-promoting framework introduced in [62–64] can be used to obtain block-sparse

structured feedback and observer gains and select actuators or sensors. Indeed, algorithms de-

veloped in [64] have been used in [235] for the sensor selection in a target tracking problem.

However, these algorithms have been developed for general structured control/estimation prob-

lems and they do not exploit the hidden convexity of the actuator/sensor selection problem.

In [104], the authors introduced a convex semidefinite programming (SDP) characterization

of the problem formulation considered in [64] for enhancing certain forms of sparsity in the

feedback gain. Although sensor and actuator selection falls into the class of problems considered

by [104], generic SDP solvers cannot handle large-scale applications. Since we are interested in

high-dimensional systems with many sensors/actuators, we use the alternating direction method

of multipliers (ADMM) [132] and proximal gradient [131] methods to develop a customized solver

that is well-suited for large problems. In contrast to standard SDP solvers, whose computational

complexity scales unfavorably with the number of states/sensors/actuators, the worst case per-

iteration complexity of our method scales only with the number of states. Furthermore, our

algorithm performs much better than standard SDP solvers in numerical experiments.

8.1 Problem formulation

8.1.1 Actuator selection

Consider the standard state-space system first presented in (1.2),

ψ = Aψ + B1 d + B2 u

ζ =

Q1/2

0

ψ +

0

R1/2

u

157

where d is a zero-mean white stochastic process with covariance Vd and the pair (A,B2) is

controllable. The optimal H2 controller minimizes the steady-state variance,

limt→∞

E(ψT (t)Qψ(t) + uT (t)Ru(t)

)where Q = QT 0 specifies a weight on the system states, and R = RT 0 specifies the penalty

on the control input. The global optimal controller for this problem is a state feedback law of

the form u = −Xψ. Although this controller is readily computed by solving the corresponding

algebraic Riccati equation, it typically uses all input channels and thus all available actuators.

We are interested in designing an optimal controller which uses a subset of the available

actuators. This will be accomplished by augmenting the H2 performance index with a term

that promotes row-sparsity of the feedback gain matrix X. The resulting problem can be cast

as a semidefinite program and thus solved efficiently for small problems.

SDP Formulation

Under the state feedback control law, u = −Xψ, the closed-loop system is given by

ψ = (A − B2X)ψ + B1 d

z =

Q1/2

−R1/2X

ψ. (8.1)

The H2 norm of system (8.1) is determined by

f2(X) = trace(QP + XTRX P

)where P = PT 0 is the controllability gramian of the closed loop system,

(A − B2X)P + P (A − B2X)T + B1 VdBT1 = 0.

Since P is positive definite and therefore invertible, the standard change of coordinates Z := XP

can be used to express f(X) in terms of P and Z [49],

f2(P,Z) = trace(QP + P−1 ZTRZ

)

158

and to bring the optimal H2 problem into the following form

minimizeP,Z

f2(P,Z)

subject to AP + PAT −B2Z − ZTBT2 + V = 0

P 0

(8.2)

with V := B1VdBT1 . By taking the Schur complement of P−1ZTRZ this problem can be

expressed as an SDP [49]. Finally, the optimal feedback gain can be recovered by X = ZP−1.

In what follows, we use this SDP characterization to introduce the actuator selection problem.

Sparsity Structure

When the ith row of X is identically equal to zero, the ith control input is not used. Therefore,

obtaining a control law which uses only a subset of available actuators can be achieved by

promoting row-sparsity of X. Our developments are facilitated by the equivalence between the

row-sparsity of X and Z; the ith row of Z is equal to zero if and only if the ith row of Z = XP

is equal to zero [104].

Drawing on the group-sparsity paradigm [122], we augment (8.2) with a sparsity-promoting

penalty on the rows of Z,

minimize f2(P,Z) + γ g2(Z)

subject to AP + PAT −B2Z − ZTBT2 + V = 0

P 0.

(8.3)

where γ > 0 specifies the importance of sparsity relative to H2 performance and

g2(Z) :=

m∑i= 1

wi ‖eTi Z‖2

is a row-sparsity-promoting penalty function in which wi are nonzero weights and ei is the ith

unit vector.

This problem can still be cast as an SDP and standard solvers can be used to find its

solution. However, since generic SDP solvers do not exploit the structure of (8.3), they do not

scale gracefully with problem dimension.

159

8.1.2 Sensor selection

The sensor selection problem can be approached in a similar manner. Consider a linear time-

invariant system,

ψ = As ψ + B1 d

y = C ψ + η

where d and η are zero-mean white stochastic processes with covariances Vd and Vη, respectively,

and (As, C) is an observable pair. The observer,

˙ψ = As ψ + L (y − y)

= As ψ + LC(ψ − ψ

)+ Lη

estimates the state x from the noisy measurements y using a linear injection term with an

observer gain L. For the Hurwitz matrix As − LC, the zero-mean estimate of ψ is given by ψ

and the dynamics of the estimation error ψ := ψ − ψ are governed by

˙ψ = (As − LC) ψ + B1 d − Lη. (8.4)

The variance amplification from the noise sources d and η to the estimation error ψ is determined

by

fo(L) = trace (PoB1 VdBT1 + Po LVη L

T ) (8.5)

where Po is the observability gramian of the error system (8.4),

(As − LC)T Po + Po (As − LC) + I = 0.

The Kalman filter gain L, resulting from the observer Riccati equation, provides an optimal

observer with the smallest variance amplification.

Our objective is to design a Kalman filter which uses a subset of the available sensors.

This can be achieved by enhancing column-sparsity of the observer gain L. Since the change of

coordinates Zs := PoL preserves column sparsity of L, we formulate the sensor selection problem

160

as

minimizePo, Zs

fo(Po, Zs) + γ g2(ZTs )

subject to ATs Po + PoAs − CTZTs − ZsC + I = 0

Po 0

(8.6)

where

fo(Po, Zs) = trace(PoB1 VdB

T1 + P−1

o Zs Vη ZTs

).

We note that the sensor selection problem (8.6) can be obtained from the actuator selection

problem (8.3) by setting the problem data in (8.3) to

A = ATs , B2 = CT , Q = B1 VdBT1

V = I, R = Vη

(8.7)

and recovering the variables Po = P and Zs = ZT .

8.2 Customized algorithm

We next develop an efficient algorithm for solving the actuator selection problem (8.3); the

solution to the sensor selection problem (8.6) can be obtained by mapping it to (8.3) via (8.7).

The challenges in solving the optimization problem (8.3) arise from

• the positive definite constraint;

• the linear Lyapunov-like constraint;

• the non-smoothness of the sparsity-promoting term.

To ensure positive definiteness of P , we use projected descent techniques to optimize over the

positive definite cone. We dualize the linear constraint and split (8.3) into two simpler subprob-

lems over P and Z via the alternating direction method of multipliers (ADMM). This splitting

separates the objective function component associated with the positive definite constraint (the

subproblem over P ) from the non-differentiable component (the subproblem over Z). Since the

subproblem over Z is efficiently solvable, our method scales well with the number of sensors or

actuators.

We employ a projected version of Newton’s method to solve the P -subproblem. The Newton

search direction is obtained via a conjugate gradient algorithm. The Z-subproblem is solved

161

with a proximal method.

8.2.1 Alternating direction method of multipliers

To employ ADMM, we first form the augmented Lagrangian corresponding to the optimization

problem (8.3),

Lµ(P,Z;Y ) := f2(P,Z) + γ g2(Z) + 〈Y, h(P,Z)〉 + 12µ‖h(P,Z)‖2F

where h(X,Y ) = 0 is the linear Lyapunov-like constraint and

h(P,Z) := AP + PAT − B2Z − ZTBT2 + V.

For this problem, the ADMM iteration discussed in Section 2.2.3 [132] is,

P k+1 = argminP

Lµ(P, Zk, Y k)

Zk+1 = argminZ

Lµ(P k+1, Z, Y k)

Y k+1 = Y k + 1µ h(P k+1, Zk+1).

The stopping criteria depend on the primal residual, which quantifies how well P k and Zk

satisfy the linear constraint, and the dual residual, which quantifies the difference between Zk

and Zk−1. We refer the reader to [132] for details.

We note that the proximal augmented Lagrangian reformulation of Lµ described in Sec-

tion 2.3 is not possible here because the linear constraint involves nondiagonal operators which

act on both P and Z.

8.2.2 P -minimization step

Minimization of Lµ over P is step (2.6a) in ADMM and can be equivalently expressed as

minimizeP

f(P,Zk) + 12µ ‖h(P,Zk) + µY k‖2F

subject to P 0.(8.8)

We use a projected version of Newton’s method to solve a sequence of quadratic approximations

to (8.8). In what follows, P l denotes the sequence of inner iterations that converges to the

solution of (8.8).

162

Newton’s method

At each inner iteration P l, Newton’s method does a line search from P l in the direction P l

which minimizes the quadratic approximation of (8.8),

P l := argminP

12

⟨HP l(P ), P

⟩+⟨∇PLµ(P l, Zk, Y k), P

⟩(8.9)

where HP l(P ) is a linear function of P that contains information about the Hessian of Lµ. The

gradient of Lµ with respect to P is given by

∇PLµ(P l, Zk, Y k) := Q − P−1ZTRZP−1 + AT(Y + 1

µ h(P,Z))

+(Y + 1

µ h(P,Z))A.

The quadratic term used in (8.9) is,

12 vec(P )T (∇2

PLµ) vec(P )

where ∇2PLµ is the Hessian of the objective function in (8.8). It can be more conveniently

expressed as 12

⟨H(P ), P

⟩where HP is the linear operator

HP (P ) := H1,P (P ) + 1µ H2(P ). (8.10)

The first term in (8.10) comes from the performance index,

H1,P (P ) := P−1ZTRZP−1 P P−1 + P−1 P P−1ZTRZP−1

and the second term comes from the constraint penalties,

H2(P ) := ATA P + P ATA + A P A + AT P AT .

Solving the linear equation

HP l(P ) + ∇PLµ(P l, Zk, Y k) = 0

yields the Newton direction P l which is computed using the conjugate gradient method [135].

163

Projection

The set C := P |P εI approximates the positive definite cone P 0 for small ε > 0. Once

the Newton direction is determined, the step

P = PC(P l + α P l

)is taken, where PC is the projection on the set C and the step size α is chosen using an Armijo

backtracking search.

To project the symmetric matrix M onto C, its eigenvalues are projected onto the set

λi ≥ ε. From the eigenvalue decomposition M = U diag(λ)UT , where λ is a vector of the

eigenvalues and U is a matrix of the corresponding eigenvectors, the projection is PC(M) =

U diag (max (λ, ε))UT .

While Netwon’s method reduces the number of required steps, computing the search direc-

tions can be prohibitively expensive for large-scale systems. In Section 8.4, we will explore a

proximal gradient algorithm for solving problem (8.3).

8.2.3 Z-minimization step

The Z-minimization step is equivalent to solving,

minimizeZ

γ g2(Z) + trace ((P k+1)−1ZTRZ) + 12µ‖h(P k+1, Z) + µY k‖2F . (8.11)

The objective function is the sum of a quadratic term and a separable sum of `2 norms: a problem

form commonly referred to as group LASSO. We employ the proximal gradient method [131] to

solve this problem. The gradient of the smooth part of (8.11) evaluated at Z is

2RZP−1 + 2µ

(BT2 Z

TBT2 +BT2 B2Z)− 2BT2

(Y + 1

µ (AP + PAT + V )).

The proximal operator for g2 is the block shrinkage operator,

Sα(eTi Z) =

(1 − α/‖eTi Z‖2

)eTi Z, ‖eTi Z‖2 > α

0, ‖eTi Z‖2 ≤ α.

164

8.2.4 Iterative reweighting

Inspired by [121], we employ an iterative reweighing scheme to select the weights wi in the

sparsity-promoting term∑i wi‖ai‖2 to obtain sparser structures at lower values of γ. The

authors in [121] noted that if wi = 1/‖ai‖2, then there is an exact correspondence between the

weighted `1 norm and the cardinality function. However, implementing such weights requires a

priori knowledge of the values ‖ai‖2 at the optimal a. Consequently, we implement a reweighting

scheme in which we run the algorithm multiple times for each value of γ and update the weights

as,

wj+1i =

1

‖ eTi Z ‖2 + ε(8.12)

where ε > 0 ensures that the update is always well-defined.

8.3 Example: Mass-spring system

We use a simple mass-spring system to illustrate the utility of our algorithm in the sensor

selection problem. This system has a clear intuitive interpretation and its dimension can be

easily scaled while retaining the problem structure. Consider a series of masses connected by

linear springs. With unit masses, unit spring constants and no friction, the dynamics of each

mass are described by

pi = −(pi − pi+1) − (pi − pi−1) + di

where pi is the position of the ith mass. When the first and last masses are affixed to rigid

bodies, the aggregate dynamics are given by p

v

=

0 I

−T 0

p

v

+

0

I

dwhere p, v, and d are the position, velocity and disturbance vectors, and T is a Toeplitz matrix

with 2 on the main diagonal and −1 on the first super- and sub-diagonals. The possible sensor

outputs are the position and velocity vectors.

8.3.1 Algorithm speed and computational complexity

The complexity of solving the sensor selection SDP with interior point methods is O((n+ r)6),

where n is the dimension of the state-space and r is the number of sensors. In our algorithm,

165

Com

pu

tati

on

tim

e(s

)

Number of states

Figure 8.1: Scaling of computation time with the number of states for CVX and for ADMM formass-spring system with γ = 100 and position and velocity outputs. Empirically, we observethat CVX scales roughly with n6 while ADMM scales roughly with n3.

the greatest cost is incurred by computing the Newton direction in the P -minimization step.

Since P ∈ Rn×n, the worst case complexity of computing Newton direction is O(n6). This

is because each conjugate gradient step takes O(n3) operations and, in general, n2 conjugate

gradient steps are required to obtain convergence. In well-conditioned problems, the conjugate

gradient method achieves high accuracy much faster which significantly reduces computational

complexity. The Z-minimization step has a computational cost of only O(nr), so the overall

cost per ADMM iteration is O(n6) unless r ≥ n5.

Figure 8.1 shows the time required by ADMM and by CVX [203, 236] to solve (8.6) with

γ = 100 for mass-spring systems of increasing sizes. Our algorithm scales favorably here and it

scales much better when only the number of sensors is varied. Figure 8.2 shows scaling with just

the number of outputs. As the number of sensors increases, CVX’s computation time increases

while the computation time of ADMM barely changes.

166

Com

pu

tati

on

tim

e(s

)

Number of sensors

Figure 8.2: Scaling of computation time with the number of sensors for CVX and for ADMMfor mass-spring system with γ = 100 and n = 50. Outputs are random linear combinations ofthe states.

8.3.2 Sensor selection

We consider a system with 20 masses (40 states) and potential position and velocity measure-

ments for each mass. As γ increases, sparser observer structures are uncovered at the cost

of compromising quality of estimation. The tradeoff between the number of sensors and the

variance of the estimation error is shown in Figure 8.3.

Figures 8.4 and 8.5 show the position and velocity sensor topologies identified by the ADMM

algorithm as γ is increased. To save space, only novel topologies are shown. The selected sensor

configurations have symmetric topology, which is expected for this example. Notably, velocity

measurements are, in general, more important but the locations of the most important position

and velocity sensors differ.

167

Per

cent

incr

ease

inf o

(L)

Number of sensors

Figure 8.3: Percent increase in fo(L) in terms of the number of sensors.

Mass index

Figure 8.4: Retained position sensors as γ increases. A blue dot indicates that the position ofthe corresponding mass is being measured. The top row shows the densest sensor topology, andthe bottom row shows the sparsest.

168

Mass index

Figure 8.5: Retained velocity sensors as γ increases. A blue dot indicates that the velocity ofthe corresponding mass is being measured. The top row shows the densest sensor topology, andthe bottom row shows the sparsest.

8.3.3 Iterative reweighting

Figure 8.6 illustrates the utility of iterative reweighting. When constant sparsity-promoting

weights are used, large values of γ are required to identify sparse structures. Here, we run our

ADMM algorithm 3 times for each value of γ, updating the weights using (8.12) and retaining

them as we increase γ.

8.4 Proximal gradient method

When the system is open-loop stable, it is possible to express the objective function solely in

terms of Z and to solve the regularized problem using proximal gradient descent. For simple

problems, such as the mass-spring system, this shows good performance relative to ADMM.

8.4.1 Elimination of P

The linear constraint can be abstractly written as,

A(P ) − B(Z) + V = 0.

169

Nu

mb

erof

sen

sors

use

d

γ

Figure 8.6: Number of sensors versus γ for a scheme which uses iterative reweighting and for thescheme which uses constant weights. Iterative reweighting promotes sparser structures earlier.

where

A(P ) := AP + PAT

B(Z) := B2Z + ZTBT2 .

We assume that the linear map A is invertible, which occurs if and only if the matrix A has no

eigenvalues with real part 0 and no eigenvalues of A have opposite real part; i.e, no eigenvalue

of A has a real part which is the negative of the real part of any other eigenvalue of A. We can

then write,

P = A−1 (B(Z) − V ) .

8.4.2 Gradient

For notational convenience, we partition the smooth part of the objective function into two

components,

fa(P ) := 〈Q, P 〉

fb(P,Z) :=⟨P−1, ZTRZ

⟩

170

Gradient of fa

Using the explicit expression for P and the properties of linear operators, the first component

of the objective function can be rewritten in terms of Z,

〈Q, P 〉 =⟨Q, A−1 (B(Z) − V )

⟩= −

⟨V, A−†(Q)

⟩+⟨Z, B†(A−†(Q))

⟩where † indicates the adjoint of a linear operator, and

A†(P ) = ATP + PA

B†(P ) = 2BT2 P.

Therefore,

∇Zfa = − 2B2TQad

where

AT Qad + QadA + Q = 0.

Gradient of fb

The gradient of fb can be computed via perturbation analysis. Note that,

(P + ε P )−1 = P−1 − εP−1 P P−1 + O(ε2).

and that since P is specified by a linear constraint,

A(P + ε P ) − B(Z + ε Z) + V = 0,

the variation of P as a result of a variation in Z is given by

P = A−1(B(Z)).

Perturbing fb, ⟨(P + ε P )−1, (Z + ε Z)T R (Z + ε Z)

⟩and collecting the order ε terms, 2

⟨Z, RZP−1

⟩−⟨Z, B†(A−†(P−1ZTRZP−1))

⟩. From this,

it is clear that

∇Zfb = 2RZP−1 − B†(A−†(P−1ZTRZP−1))

171

and so ∇f(Z) = 2BT2 (Rad − Qad) + 2RZP−1 where

AT Qad + QadA + Q = 0

AT Rad + RadA + P−1 ZT RZ P−1 = 0

AP + P AT − B2 Z − ZT BT2 + V = 0.

BB stepsize initialization

We implement proximal gradient descent and initialize the step-size using the Barzilai-Borwein

(BB) method, which attempts to approximate the Hessian with a scaled version of the identity

matrix. The secant equation,

∇f(Y k) ≈ ∇f(Y k−1) + H(Y k, Y k − Y k−1)

predicts the change in the gradient ∇f(Y ) as a result of a change in Y . Quasi-Newton methods

use the above in conjunction with observed changes in the gradient to build an estimate H of

the Hessian and determine a search direction by solving the linear equation,

H(Y k, Y ) = − ∇f(Y k). (8.13)

BB step size selection restricts the Hessian approximation to be a scaled version of the identity,

H(Z, Z) := 1α Z

with α > 0 and minimizes the residual of the secant equation,

‖∇f(Zk) − ∇f(Zk−1) − 1α (Zk − Zk−1)‖2F

over α to obtain,

αm,0 =‖Zk − Zk−1‖2F

〈Zk−1 − Zk, ∇f(Zk−1) − ∇f(Zk)〉(8.14)

We backtrack from αm,0 by selecting the smallest nonnegative integer r such that αm = crαm,0

with c ∈ (0, 1) such that Zk+1 is stabilizing (i.e., yields a positive definite P k+1 = A−1(B(Zk+1−V ))) and so that Y Z+1 satisfies either an ISTA-like or a SpaRSA-like step size selection rule.

172

ISTA-like step size selection rule

The step size α is chosen so that the approximation of the objective function is an overestimate

of the actual objective function

f(Zk+1) + g(Zk+1) ≤ f(Zk) +⟨Y , ∇f(Zk)

⟩+ 1

2α ‖Z‖2F + g(Zk+1)

where Z := Zk+1 − Zk [131].

SpaRSA-like step size selection rule

In this approach [187], the objective function is only checked every p iterations. At that iteration,

backtracking is used to ensure that,

f(Zk+1) < maxm= k− p,...,k

(f(Zm)) .

Otherwise, backtracking is just used to ensure stability.

8.4.3 Example: Damped mass-spring systems

Since this method requires open-loop stability, we must introduce some damping into the mass-

spring example explored earlier. For the damped mass-spring model with N masses, p

v

=

0 I

T − bI −bI

p

v

+ u + d

where T is a Toeplitz matrix with −2 on the main diagonal and 1 on the first super- and sub-

diagonals. The parameter b represents the strength of damping. Experiments were performed

with γ = 30, Q = I and R = 10I. Figure 8.7 shows the computation time for proximal

gradient with both ISTA (PG) and SpaRSA (PGs) step-size selection, CVX, ADMM, and an

MM algorithm with the proximal augmented Lagrangian from Chapter 2.

8.5 Example: Flexible wing aircraft

Finally, we present an example of the sensor selection algorithm applied to a practical example

from aerospace engineering. One barrier in reducing aircraft weight in order to improve fuel

173

Com

pu

tati

on

tim

e(s

)

Number of states

Figure 8.7: Computation time for mass-spring system damped with b = 0.1 and γ = 30.

Figure 8.8: Body Freedom Flutter flexible wing testbed aircraft.

174N

um

ber

of

sen

sors

use

d

γ

Per

cent

per

form

an

ced

egra

dati

on

Number of sensors

Figure 8.9: (a) Number of sensors as a function of the sparsity-promoting parameter γ; and (b)Performance comparison of the Kalman filter associated with the sets of sensors resulting fromthe regularized sensor selection problem and from truncation.

` 2n

orm

ofro

w

Sensor

Figure 8.10: The `2 norm of each column of the Kalman gain for Kalman filters designed fordifferent sets of sensors.

175

efficiency is that lighter airframes are more flexible and thus susceptible to vibrational instabili-

ties [237]. These instabilities, known as flutter, were behind the famous Tacoma Narrows Bridge

collapse and have been identified as the likely cause of the loss of NASA’s Helios Prototype

aircraft [238].

Recent work has sought to approach this problem by actively damping flutter instabili-

ties [239]. Since active control requires reliable detection of instabilities, selection of sensors is

an important challenge. For the Body Freedom Flutter test aircraft shown in Fig. 8.8 [240], we

use the dual formulation to the actuator selection approach described in Section 8.1.2 in con-

junction with iterative re-weighting algorithm to select sparse sets of sensors. Figure 8.9 shows

the number of sensors as a function of the sparsity-promoting parameter γ and the performance

of Kalman filter with the limited sets of sensors [105].

We compare the performance of the Kalman filter corresponding to the sensors selected

by our approach to the Kalman filter associated with sensors selected by truncation. For the

truncation approach, the Kalman gain matrix corresponding to a set of sensors was computed.

The sensor corresponding to the row with the lowest `2 norm was discarded and the Kalman

gain was recomputed for the new set of sensors. This process was repeated iteratively from the

full set of sensors to a set of two sensors. Clearly, the regularized sensor selection algorithm

selects better subsets of sensors than the truncation approach.

To further justify the use of this algorithm over truncation, we show the `2 norm of different

columns of the Kalman gain in Fig. 8.10. The sensor which has the most effect on the state

estimate changes as the number of sensors is decreased, even when each smaller set of sensors

is a subset of a previous set.

Chapter 9

Conclusions and future directions

Proximal augmented Lagrangian methods

In this thesis, we have developed custom optimization algorithms for composite problems by

introducing an auxiliary variable and reformulating the associated augmented Lagrangian. The

resulting function, which we call the proximal augmented Lagrangian, is continuously differen-

tiable and thus opens the door to many methods for solving the original problem.

The proximal augmented Lagrangian facilitates the method of multipliers by transforming

the primal-minimization step into a tractable, differentiable problem. Differentiability also en-

ables primal-dual gradient updates which, for certain problems, are convenient for a distributed

implementation. Generalizations of the Jacobian for once-differentiable functions allow us to de-

rive second order updates which lead to a globally exponentially convergent differential inclusion

as well as a globally convergent algorithm with asympotitic quadratic convergence.

Structured optimal control via regularized optimization

The motivation behind developing these algorithms was to use regularization to design structured

controllers. We successfully applied those algorithms to the nonconvex problems of edge addition

in directed consensus networks and sparse optimal control. We also identified classes of systems

for which structure-promoting optimal control problems can be cast as convex problems.

We first showed how the symmetric component of a system can be used to pose a regularized

convex controller design problem. The resulting controller is structured and has stability and

performance guarantees for the original system. We used this theory to inform the problems

of edge addition and leader selection in directed consensus networks as well as for combination

176

177

drug therapy design for HIV treatment.

We also considered the decentralized control of positive systems. We showed that the H2

and H∞ optimal problems are convex in the original coordinates, allowing us to pose convex

regularized problems to design structured controllers. We applied this theory to leader selection

in consensus networks and combination drug therapy design for HIV. We then established that

a constant controller is optimal for an induced-power performance index.

Finally, we considered the problem of choosing a limited number of actuators or sensors to

effectively control or observe a system. We posed this problem as a semidefinite program and

developed customized methods to solve it efficiently. We applied this approach to select sensors

for a flexible wing aircraft.

Future directions

Future algorithmic research will investigate how to deal with nonconvexity. Although we can

find local minima of nonconvex problems using the method of multipliers with the proximal

augmented Lagrangian, we have no guarantees on the quality of these local minima. It would

be interesting to explore techniques to discover a variety of local minima or to characterize the

quality of any particular local minimum. Moreover, it would be interesting to explore the use

of nonconvex regularizers, especially in distributed methods.

We will also work to develop techniques for structured time-varying or dynamic controllers.

So far, we have considered exclusively static structured controllers. Although we show that

a static controller achieves the optimal for decentralized control of positive systems with the

induced power norm, we cannot expect this to hold for more general systems or performance

indices. Investigating model predictive control, dynamic programming, or periodic structured

control approaches would be an interesting avenue of further research.

References

[1] Enrique Fernandez Cara and Enrique Zuazua Iriondo. Control Theory: History, Mathe-

matical Achievements and Perspectives. Boletın de la Sociedad Espanola de Matematica

Aplicada, 26, 79-140., 2003.

[2] Christopher Bissell. A History of Automatic Control. In Springer Handbook of Automation,

pages 53–69. Springer, 2009.

[3] Neculai Andrei. Modern Control Theory. Studies in Informatics and Control, 15(1):51,

2006.

[4] Vincent Blondel and John N Tsitsiklis. NP-hardness of some linear control design prob-

lems. SIAM J. Control Optim., 35(6):2118–2127, 1997.

[5] Minyue Fu and Zhi-Quan Luo. Computational complexity of a problem arising in fixed

order output feedback design. Syst. Control Lett., 30(5):209–215, 1997.

[6] Francesco Bullo, Jorge Cortes, and Sonia Martınez. Distributed Control of Robotic Net-

works. Princeton University Press, 2009.

[7] Mehran Mesbahi and Magnus Egerstedt. Graph Theoretic Methods in Multiagent Net-

works. Princeton University Press, 2010.

[8] Dragoslav D Siljak. Decentralized control of complex systems. Academic Press, New York,

1991.

[9] Bassam Bamieh, Fernando Paganini, and Munther A Dahleh. Distributed control of spa-

tially invariant systems. IEEE Trans. Automat. Control, 47(7):1091–1107, 2002.

[10] Gustavo A de Castro and Fernando Paganini. Convex synthesis of localized controllers for

spatially invariant system. Automat., 38:445–456, 2002.

178

179

[11] Petros G Voulgaris, Gianni Bianchini, and Bassam Bamieh. Optimal H2 controllers for

spatially invariant systems with delayed communication requirements. Syst. Control Lett.,

50:347–361, 2003.

[12] Raffaello D’Andrea and Geir E Dullerud. Distributed control design for spatially inter-

connected systems. IEEE Trans. Automat. Control, 48(9):1478–1495, 2003.

[13] Cedric Langbort, Ramu S Chandra, and Raffaello D’Andrea. Distributed control de-

sign for systems interconnected over an arbitrary graph. IEEE Trans. Automat. Control,

49(9):1502–1519, 2004.

[14] Dragoslav D Siljak and Aleksandar I. Zecevic. Control of large-scale systems: Beyond

decentralized feedback. Annual Reviews in Control, 29(2):169–179, 2005.

[15] Bassam Bamieh and Petros G Voulgaris. A convex characterization of distributed control

problems in spatially invariant systems with communication constraints. Syst. Control

Lett., 54(6):575–583, 2005.

[16] Michael Rotkowitz and Sanjay Lall. A characterization of convex problems in decentralized

control. IEEE Trans. Automat. Control, 51(2):274–286, Feb 2006.

[17] Francesco Borrelli and Tamas Keviczky. Distributed LQR design for identical dynamically

decoupled systems. IEEE Trans. on Automat. Control, 53(8):1901–1912, 2008.

[18] Javad Lavaei and Amir G Aghdam. Control of continuous-time LTI systems by means of

structurally constrained controllers. Automat., 44(1):141–148, 2008.

[19] Parikshit Shah and Pablo A Parrilo. H2 optimal decentralized control over posets: A state

space solution for state-feedback. In Proceedings of the 49th IEEE Conference on Decision

and Control, pages 6722–6727, 2010.

[20] John Swigart and Sanjay Lall. An explicit state-space solution for a decentralized two-

player optimal linear-quadratic regulator. In Proceedings of the 2010 American Control

Conference, pages 6385–6390, 2010.

[21] Makan Fardad and Mihailo R Jovanovic. Design of optimal controllers for spatially invari-

ant systems with finite communication speed. Automat., 47(5):880–889, May 2011.

[22] Karl Martensson and Anders Rantzer. A scalable method for continuous-time distributed

control synthesis. In Proceedings of the 2012 American Control Conference, pages 6308–

6313, 2012.

180

[23] Laurent Lessard and Sanjay Lall. Optimal controller synthesis for the decentralized two-

player problem with output feedback. In Proceedings of the 2012 American Control Con-

ference, pages 6314–6321, 2012.

[24] Aditya Mahajan, Nuno C Martins, Michael C Rotkowitz, and Serdar Yuksel. Information

structures in optimal decentralized control. In Proceedings of the 51st IEEE Conference

on Decision and Control, pages 1291–1306, 2012.

[25] Parikshit Shah and Pablo A Parrilo. H2 optimal decentralized control over posets: A

state-space solution for state-feedback. IEEE Trans. Automat. Control, 58(12):3084–3096,

2013.

[26] Javad Lavaei. Optimal decentralized control problem as a rank-constrained optimization.

In Proceedings of the 51st Annual Allerton Conference on Communication, Control, and

Computing, pages 39–45, 2013.

[27] Ghazal Fazelnia, Ramtin Madani, and Javad Lavaei. Convex relaxation for optimal dis-

tributed control problem. In Proceedings of the 53rd IEEE Conference on Decision and

Control, pages 896–903, 2014.

[28] Yuh-Shyang Wang, Nikolai Matni, and John C Doyle. Localized LQR optimal control.

In Proceedings of the 53rd IEEE Conference on Decision and Control, pages 1661–1668.

IEEE, 2014.

[29] Laurent Lessard and Sanjay Lall. Optimal control of two-player systems with output

feedback. IEEE Trans. Automat. Control, 60(8):2129–2144, 2015.

[30] Andrew Lamperski and John C Doyle. The H2 control problem for quadratically invariant

systems with delays. IEEE Trans. Automat. Control, 60(7):1945–1950, 2015.

[31] Takuto Mori, Takayuki Wada, and Yasumasa Fujisaki. Structured feedback gain design via

stabilizable dilation and LMIs. In Proceedings of the 54th IEEE Conference on Decision


[32] Mihailo R Jovanovic and Neil K Dhingra. Controller architectures: Tradeoffs between

performance and structure. Eur. J. Control, 30:76–91, July 2016.

[33] Yuh-Shyang Wang. Localized LQR with adaptive constraint and performance guarantee.

In Proceedings of the 55th IEEE Conference on Decision and Control, pages 2769–2776.

IEEE, 2016.

181

[34] Yuh-Shyang Wang and Nikolai Matni. Localized LQG optimal control for large-scale

systems. In Proceedings of the 2016 American Control Conference, pages 1954–1961.

IEEE, 2016.

[35] Yuh-Shyang Wang. A system level approach to optimal controller design for large-scale

distributed systems. PhD thesis, California Institute of Technology, 2017.

[36] Yuh-Shyang Wang, Nikolai Matni, and John C Doyle. System level parameterizations,

constraints and synthesis. In Proceedings of the 2017 American Control Conference, pages

1308–1315, May 2017.

[37] Yuh-Shyang Wang, Nikolai Matni, and John C Doyle. Separable and localized system

level synthesis for large-scale systems. arXiv preprint arXiv:1701.05880, 2017.

[38] Mihailo R Jovanovic and Bassam Bamieh. Componentwise energy amplification in channel

flows. J. Fluid Mech., 534:145–183, July 2005.

[39] Binh K Lieu and Mihailo R Jovanovic. Computation of frequency responses for linear

time-invariant PDEs on a compact interval. J. Comput. Phys., 250:246–269, October

2013.

[40] Binh K Lieu, Mihailo R. Jovanovic, and Satish Kumar. Worst-case amplification of dis-

turbances in inertialess Couette flow of viscoelastic fluids. J. Fluid Mech., 723:232–263,

May 2013.

[41] Binh K Lieu. Dynamics and control of Newtonian and viscoelastic fluids. PhD thesis,

University of Minnesota, 2014.

[42] Rashad Moarref. Model-based control of transitional and turbulent wall-bounded shear

flows. PhD thesis, University of Minnesota, 2012.

[43] Rashad Moarref and Mihailo R Jovanovic. Model-based design of transverse wall oscilla-

tions for turbulent drag reduction. J. Fluid Mech., 707:205–240, September 2012.

[44] Rashad Moarref and Mihailo R Jovanovic. Controlling the onset of turbulence by stream-

wise traveling waves. Part 1: Receptivity analysis. J. Fluid Mech., 663:70–99, November

2010.

[45] Binh K Lieu, Rashad Moarref, and Mihailo R Jovanovic. Controlling the onset of turbu-

lence by streamwise traveling waves. Part 2: Direct numerical simulations. J. Fluid Mech.,

663:100–119, November 2010.

182

[46] Armin Zare, Mihailo R Jovanovic, and Tryphon T Georgiou. Colour of turbulence. J.

Fluid Mech., 812:636–680, February 2017.

[47] Armin Zare. Low-complexity stochastic modeling of wall-bounded shear flows. PhD thesis,

University of Minnesota, 2016.

[48] Bruce A Francis. A Course in H∞ Control Theory. Springer-Verlag, New York, 1987.

[49] Geir E Dullerud and Fernando Paganini. A Course in Robust Control Theory. Springer,

2000.

[50] Yu Chi Ho and Kai-Ching Chu. Team decision theory and information structures in

optimal control problems–part I. IEEE Trans. Automat. Control, 17(1):15–22, 1972.

[51] Petros G Voulgaris. Control of nested systems. In Proceedings of the 2001 American

Control Conference, pages 4442–4445, 2000.

[52] Petros G Voulgaris. A convex characterization of classes of problems in control with

specific interaction and communication structures. In Proceedings of the 2001 American


[53] Lorenzo Farina and Sergio Rinaldi. Positive Linear Systems: Theory and Applications.

John Wiley & Sons, 2011.

[54] Takashi Tanaka and Cedric Langbort. The bounded real lemma for internally positive sys-

tems andH∞ structured static state feedback. IEEE Trans. Automat. Control, 56(9):2218–

2223, 2011.

[55] Corentin Briat. Robust stability and stabilization of uncertain linear positive systems via

integral linear constraints: L1 gain and L∞ gain characterization. Int. J. Robust Nonlin.,

23(17):1932–1954, 2013.

[56] Yoshio Ebihara, Dimitri Peaucelle, and Denis Arzelier. L1 gain analysis of linear positive

systems and its application. In Proceedings of 50th IEEE Conference on Decision and

Control, pages 4029–4034. IEEE, 2011.

[57] Anders Rantzer. Distributed control of positive systems. In Proceedings of 50th IEEE

Conference on Decision and Control, pages 6608–6611, Dec 2011.

183

[58] Patrizio Colaneri, Richard H Middleton, Zhiyong Chen, Danilo Caporale, and Franco

Blanchini. Convexity of the cost functional in an optimal control problem for a class of

positive switched systems. Automat., 4(50):1227–1234, 2014.

[59] Anders Rantzer and Bo Bernhardsson. Control of convex monotone systems. In Proceedings

of the 53rd IEEE Conference on Decision and Control, pages 2378–2383, 2014.

[60] Nader Motee and Ali Jadbabaie. Optimal control of spatially distributed systems. IEEE

Trans. Automat. Control, 53(7):1616–1629, 2008.

[61] Nader Motee and Qiyu Sun. Sparsity measures for spatially decaying systems. In Pro-

ceedings of the 2014 American Control Conference, pages 5459–5464, 2014.

[62] Makan Fardad, Fu Lin, and Mihailo R Jovanovic. Sparsity-promoting optimal control for

a class of distributed systems. In Proceedings of the 2011 American Control Conference,

pages 2050–2055, 2011.

[63] Fu Lin, Makan Fardad, and Mihailo R Jovanovic. Sparse feedback synthesis via the

alternating direction method of multipliers. In Proceedings of the 2012 American Control

Conference, pages 4765–4770, Montreal, Canada, 2012.

[64] Fu Lin, Makan Fardad, and Mihailo R Jovanovic. Design of optimal sparse feedback

gains via the alternating direction method of multipliers. IEEE Trans. Automat. Control,

58(9):2426–2431, September 2013.

[65] Makan Fardad, Fu Lin, and Mihailo R Jovanovic. On optimal link creation for facilitation

of consensus in social networks. In Proceedings of the 2014 American Control Conference,

pages 3802–3807, Portland, OR, 2014.

[66] George F Young, L. Scardovi, and Naomi E Leonard. Robustness of noisy consensus

dynamics with directed communication. In Proceedings of the 2010 American Control

Conference, pages 6312–6317, 2010.

[67] Lin Xiao, Stephen P Boyd, and Seung-Jean Kim. Distributed average consensus with

least-mean-square deviation. J. Parallel Distrib. Comput., 67(1):33–46, 2007.

[68] Daniel Zelazo and Mehran Mesbahi. Edge agreement: Graph-theoretic performance

bounds and passivity analysis. IEEE Trans. Automat. Control, 56(3):544–555, 2011.

184

[69] Makan Fardad, Fu Lin, and Mihailo R Jovanovic. On the optimal synchronization of

oscillator networks via sparse interconnection graphs. In Proceedings of the 2012 American

Control Conference, pages 4777–4782, Montreal, Canada, 2012.

[70] David JT Sumpter, Jens Krause, Richard James, Ian D Couzin, and Ashley JW Ward.

Consensus decision making by fish. Current Biology, 18(22):1773–1777, 2008.

[71] Mehran Mesbahi and Fred Y Hadaegh. Formation flying control of multiple spacecraft via

graphs, matrix inequalities, and switching. J. Guid. Control Dyn., 24(2):369–377, 2001.

[72] Lin Xiao, Stephen P Boyd, and Sanjay Lall. A scheme for robust distributed sensor

fusion based on average consensus. In Proceedings of the 4th International Symposium on

Information Processing in Sensor Networks, pages 63–70, 2005.

[73] Arpita Ghosh and Stephen P Boyd. Growing well-connected graphs. In Proceedings of the

45th IEEE Conference on Decision and Control, pages 6605–6611, 2006.

[74] Fu Lin, Makan Fardad, and Mihailo R Jovanovic. Identification of sparse communication

graphs in consensus networks. In Proceedings of the 50th Annual Allerton Conference on

Communication, Control, and Computing, pages 85–89, Monticello, IL, 2012.

[75] Tyler Summers, Iman Shames, John Lygeros, and Florian Dorfler. Topology design for

optimal network coherence. In Proceedings of the 2015 European Control Conference,

pages 575–580. IEEE, 2015.

[76] Sepideh Hassan-Moghaddam and Mihailo R Jovanovic. Topology design for

stochastically-forced consensus networks. IEEE Trans. Control Netw. Syst., 2017.

doi:10.1109/TCNS.2017.2674962.

[77] Stacy Patterson and Bassam Bamieh. Leader selection for optimal network coherence.

In Proceedings of the 49th IEEE Conference on Decision and Control, pages 2692–2697,

2010.

[78] Katherine Fitch and Naomi E Leonard. Information centrality and optimal leader selection

in noisy networks. In Proceedings of the 52nd IEEE Conference on Decision and Control,

pages 7510–7515, 2013.

[79] Katherine Fitch and Naomi E Leonard. Joint centrality distinguishes optimal leaders in

noisy networks. IEEE Trans. Control Netw. Syst., 2014.

185

[80] Fu Lin, Makan Fardad, and Mihailo R Jovanovic. Algorithms for leader selection in

stochastically forced consensus networks. IEEE Trans. Automat. Control, 59(7):1789–

1802, July 2014.

[81] Makan Fardad, Fu Lin, and Mihailo R Jovanovic. Algorithms for leader selection in large

dynamical networks: noise-free leaders. In Proceedings of the 50th IEEE Conference on

Decision and Control and European Control Conference, pages 7188–7193, Orlando, FL,

2011.

[82] Andrew Clark, Linda Bushnell, and Radha Poovendran. A supermodular optimization

framework for leader selection under link noise in linear multi-agent systems. IEEE Trans.

Automat. Control, 59(2):283–296, 2014.

[83] Andrew Clark, Basel Alomair, Linda Bushnell, and Radha Poovendran. Submodularity in

Dynamics and Control of Networked Systems. Springer, 2016.

[84] Andrew Clark, Basel Alomair, Linda Bushnell, and Radha Poovendran. Submodularity

in input node selection for networked linear systems: Efficient algorithms for performance

and controllability. IEEE Control Syst. Mag., 37(6):52–74, Dec 2017.

[85] Guodong Shi, Kin Cheong Sou, Henrik Sandberg, and Karl Henrik Johansson. A graph-

theoretic approach on optimizing informed-node selection in multi-agent tracking control.

Physica D: Nonlinear Phenomena, 267:104–111, 2014.

[86] Stacy Patterson, Neil McGlohon, and Kirill Dyagilev. Optimal k-leader selection for

coherence and convergence rate in one-dimensional networks. IEEE Trans. Control Netw.

Syst., 2016.

[87] Neil K Dhingra, Marcello Colombino, and Mihailo R Jovanovic. Leader selection in di-

rected networks. In Proceedings of the 55th IEEE Conference on Decision and Control,

pages 2715–2720, Las Vegas, NV, 2016.

[88] Neil K Dhingra, Marcello Colombino, and Mihailo R Jovanovic. Structured decentralized

control of positive systems with applications to combination drug therapy and leader

selection in directed networks. IEEE Trans. Control Netw. Syst., 2017. submitted.

[89] Vanessa D Jonsson, Anders Rantzer, and Richard M Murray. A scalable formulation for

engineering combination therapies for evolutionary dynamics of disease. Proceedings of

the 2014 American Control Conference, pages 2771–2778, 2014.

186

[90] Vanessa D Jonsson, Nikolai Matni, and Richard M Murray. Reverse engineering combina-

tion therapies for evolutionary dynamics of disease: An H∞ approach. In Proceedings of

the 52nd IEEE Conference on Decision and Control, pages 2060–2065, 2013.

[91] Makan Fardad, Fu Lin, and Mihailo R Jovanovic. On the optimal design of structured

feedback gains for interconnected systems. In Proceedings of the 48th IEEE Conference

on Decision and Control, pages 978–983, Shanghai, China, 2009.

[92] Fu Lin, Makan Fardad, and Mihailo R Jovanovic. Augmented Lagrangian approach

to design of structured optimal state feedback gains. IEEE Trans. Automat. Control,

56(12):2923–2929, 2011.

[93] David L Donoho. Compressed sensing. IEEE Trans. Inf. Theory, 52(4):1289–1306, 2006.

[94] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical

Learning. Springer, 2009.

[95] Peter J Bickel and Elizaveta Levina. Regularized estimation of large covariance matrices.

Ann. Stat., pages 199–227, 2008.

[96] Tom Goldstein and Stanley Osher. The split Bregman method for `1-regularized problems.

SIAM J. Imaging Sci., 2(2):323–343, 2009.

[97] Bassam Bamieh, Mihailo R Jovanovic, Partha Mitra, and Stacy Patterson. Coherence

in large-scale networks: dimension dependent limitations of local feedback. IEEE Trans.

Automat. Control, 57(9):2235–2249, September 2012.

[98] Simone Schuler, Ping Li, James Lam, and Frank Allgower. Design of structured dynamic

output-feedback controllers for interconnected systems. International Journal of Control,

84(12):2081–2091, 2011.

[99] Makan Fardad and Mihailo R Jovanovic. On the design of optimal structured and sparse

feedback gains using semidefinite programming. In Proceedings of the 2014 American

Control Conference, pages 2438–2443, Portland, OR, 2014.

[100] Nikolai Matni and Venkat Chandrasekaran. Regularization for design. In Proceedings of

the 53rd IEEE Conference on Decision and Control, pages 1111–1118, Los Angeles, CA,

2014.

187

[101] Nikolai Matni. Communication delay co-design in H2-distributed control using atomic

norm minimization. IEEE Transactions on Control of Network Systems, 4(2):267–278,

June 2017.

[102] Nikolai Matni and Venkat Chandrasekaran. Regularization for design. IEEE Trans. Au-

tomat. Control, 61(12):3991–4006, 2016.

[103] Nikolai Matni. Distributed optimal control of cyber-physical systems: controller synthesis,

architecture design and system identification. PhD thesis, California Institute of Technol-

ogy, 2016.

[104] Boris Polyak, Mikhail Khlebnikov, and Pavel Shcherbakov. An LMI approach to structured

sparse feedback design in linear control systems. In Proceedings of the 2013 European


[105] Neil K Dhingra, Mihailo R Jovanovic, and Z. Q. Luo. An ADMM algorithm for optimal

sensor and actuator selection. In Proceedings of the 53rd IEEE Conference on Decision

and Control, pages 4039–4044, Los Angeles, CA, 2014.

[106] Xiaofan Wu and Mihailo R Jovanovic. Sparsity-promoting optimal control of systems with

symmetries, consensus and synchronization networks. Syst. Control Lett., 103:1–8, May

2017.

[107] Neil K Dhingra, Sei Zhen Khong, and Mihailo R Jovanovic. The proximal augmented La-

grangian method for nonsmooth composite optimization. IEEE Trans. Automat. Control,

2016. submitted; also arXiv:1610.04514.

[108] Florian Dorfler, Mihailo R Jovanovic, Michael Chertkov, and Francesco Bullo. Sparse

and optimal wide-area damping control in power networks. In Proceedings of the 2013

American Control Conference, pages 4295–4300, Washington, DC, 2013.

[109] Florian Dorfler, Mihailo R Jovanovic, M. Chertkov, and F. Bullo. Sparsity-promoting

optimal wide-area control of power networks. IEEE Trans. Power Syst., 29(5):2281–2291,

September 2014.

[110] Xiaofan Wu, Florian Dorfler, and Mihailo R Jovanovic. Input-output analysis and decen-

tralized optimal control of inter-area oscillations in power systems. IEEE Trans. Power

Syst., 31(3):2434–2444, May 2016.

188

[111] Xiaofan Wu, Florian Dorfler, and Mihailo R Jovanovic. Topology identification and design

of distributed integral action in power networks. In Proceedings of the 2016 American

Control Conference, pages 5921–5926, Boston, MA, 2016.

[112] Neil K Dhingra, Marcello Colombino, and Mihailo R Jovanovic. On the convexity of a

class of structured optimal control problems for positive systems. In Proceedings of the

2016 European Control Conference, pages 825–830, Aalborg, Denmark, 2016.

[113] Sepideh Hassan-Moghaddam and Mihailo R Jovanovic. An interior point method for grow-

ing connected resistive networks. In Proceedings of the 2015 American Control Conference,

pages 1223–1228, Chicago, IL, 2015.

[114] Sepideh Hassan-Moghaddam and Mihailo R Jovanovic. Customized algorithms for growing

connected resistive networks. In Proceedings of the 10th IFAC Symposium on Nonlinear

Control Systems, pages 986–991, Monterey, CA, 2016.

[115] Neil K Dhingra and Mihailo R Jovanovic. Convex synthesis of symmetric modifications to

linear systems. In Proceedings of the 2015 American Control Conference, pages 3583–3588,

Chicago, IL, 2015.

[116] Neil K Dhingra, Xiaofan Wu, and Mihailo R Jovanovic. Sparsity-promoting optimal control

of systems with invariances and symmetries. In Proceedings of the 10th IFAC Symposium

on Nonlinear Control Systems, pages 648–653, Monterey, CA, 2016.

[117] Makan Fardad, Fu Lin, and Mihailo R Jovanovic. Design of optimal sparse interconnec-

tion graphs for synchronization of oscillator networks. IEEE Trans. Automat. Control,

59(9):2457–2462, September 2014.

[118] Neil K Dhingra and Mihailo R Jovanovic. A method of multipliers algorithm for sparsity-

promoting optimal control. In Proceedings of the 2016 American Control Conference,

pages 1942–1947, Boston, MA, 2016.

[119] Neil K Dhingra, Sei Zhen Khong, and Mihailo R Jovanovic. A second order primal-dual

algorithm for non-smooth convex composite optimization. In Proceedings of the 56th IEEE

Conference on Decision and Control, Melbourne, Australia, 2017.

[120] Neil K Dhingra, Sei Zhen Khong, and Mihailo R Jovanovic. A second order primal-dual

method for nonsmooth convex composite optimization. IEEE Trans. Automat. Control,

2017. submitted; also arXiv:1709.01610.

189

[121] Emmanuel J Candes, Mike B Wakin, and Stephen P Boyd. Enhancing sparsity by

reweighted `1 minimization. J. Fourier Anal. Appl, 14:877–905, 2008.

[122] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped vari-

ables. J. R. Stat. Soc. Series B Stat. Methodol., 68(1):49–67, 2006.

[123] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University

Press, 2004.

[124] Athanasios C Antoulas. Approximation of Large-Scale Dynamical Systems. SIAM, 2005.

[125] Armin Zare, Yongxin Chen, Mihailo R Jovanovic, and Tryphon T Georgiou. Low-

complexity modeling of partially available second-order statistics: theory and an efficient

matrix completion algorithm. IEEE Trans. Automat. Control, 62(3):1368–1383, March

2017.

[126] Armin Zare, Mihailo R Jovanovic, and Tryphon T Georgiou. Alternating direction op-

timization algorithms for covariance completion problems. In Proceedings of the 2015

American Control Conference, pages 515–520, Chicago, IL, 2015.

[127] Armin Zare, Mihailo R Jovanovic, and Tryphon T Georgiou. Completion of partially

known turbulent flow statistics. In Proceedings of the 2014 American Control Conference,

pages 1680–1685, Portland, OR, 2014.

[128] Siddharth Joshi and Stephen P Boyd. Sensor selection via convex optimization. IEEE

Trans. Signal Process., 57(2):451–462, 2009.

[129] Venkat Chandrasekaran, Benjamin Recht, Pablo A Parrilo, and Alan S Willsky. The

convex geometry of linear inverse problems. Foundations of Computational Mathematics,

12(6):805–849, 2012.

[130] Neal Parikh and Stephen P Boyd. Proximal algorithms. Foundations and Trends in

Optimization, 1(3):123–231, 2013.

[131] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear

inverse problems. SIAM J. Imaging Sci., 2(1):183–202, 2009.

[132] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed

optimization and statistical learning via the alternating direction method of multipliers.

Found. Trends Mach. Learning, 3(1):1–124, 2011.

190

[133] Dimitri P Bertsekas. Constrained Optimization and Lagrange Multiplier Methods. Aca-

demic Press, New York, 1982.

[134] Dimitri P Bertsekas. Nonlinear Programming. Athena Scientific, 1999.

[135] Jorge Nocedal and Stephen J Wright. Numerical Optimization. Springer, 2006.

[136] Huan Li and Zhouchen Lin. Accelerated proximal gradient methods for nonconvex pro-

gramming. In Adv. Neural Inf. Process Syst., pages 379–387, 2015.

[137] Mingyi Hong, Zhi-Quan Luo, and Meisam Razaviyayn. Convergence analysis of alternating

direction method of multipliers for a family of nonconvex problems. SIAM J. Optimiz.,

26(1):337–364, 2016.

[138] Ralph T Rockafellar and Roger J-B Wets. Variational Analysis, volume 317. Springer

Science & Business Media, 2009.

[139] Frank H Clarke. Optimization and Nonsmooth Analysis, volume 5. SIAM, 1990.

[140] Jerome Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized min-

imization for nonconvex and nonsmooth problems. Math. Program., 146(1-2):459–494,

2014.

[141] Yin Wang, Jose A Lopez, and Mario Sznaier. Sparse static output feedback controller

design via convex optimization. In Proceedings of the 53rd IEEE Conference on Decision


[142] Tankred Rautert and Ekkehard W Sachs. Computational design of optimal output feed-

back controllers. SIAM J. Optimiz., 7(3):837–852, 1997.

[143] Jonathan Barzilai and Jonathan M Borwein. Two-point step size gradient methods. IMA

J. Numer. Anal., 8(1):141–148, 1988.

[144] Andrei Patrascu and Ion Necoara. Efficient random coordinate descent algorithms for

large-scale structured nonconvex optimization. J. Global Optim., 61(1):19–46, 2015.

[145] Kenneth J Arrow, Leonid Hurwicz, and Hirofumi Uzawa. Studies in Linear and Non-

Linear Programming. Stanford University Press, 1958.

[146] Jing Wang and Nicola Elia. A control perspective for centralized and distributed convex

optimization. In Proceedings of the 50th IEEE Conference on Decision and Control and

the 10th European Control Conference, pages 3800–3805, 2011.

191

[147] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent

optimization. IEEE Trans. Automat. Control, 54(1):48–61, 2009.

[148] Diego Feijer and Fernando Paganini. Stability of primal–dual gradient dynamics and

applications to network optimization. Automat., 46(12):1974–1981, 2010.

[149] Ashish Cherukuri, Enrique Mallada, and Jorge Cortes. Asymptotic convergence of con-

strained primal–dual dynamics. Syst. Control Lett., 87:10–15, 2016.

[150] Ashish Cherukuri, Enrique Mallada, Steven Low, and Jorge Cortes. The role of con-

vexity on saddle-point dynamics: Lyapunov function and robustness. arXiv preprint

arXiv:1608.08586, 2016.

[151] Ashish Cherukuri, Bahman Gharesifard, and Jorge Cortes. Saddle-point dynamics: condi-

tions for asymptotic stability of saddle points. SIAM Journal on Control and Optimization,

55(1):486–511, 2017.

[152] Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of opti-

mization algorithms via integral quadratic constraints. SIAM J. Optimiz., 26(1):57–95,

2016.

[153] Bin Hu and Peter Seiler. Exponential decay rate conditions for uncertain linear systems

using integral quadratic constraints. IEEE Trans. Automat. Control, 61(11):3631–3637,

2016.

[154] Ross Boczar, Laurent Lessard, and Benjamin Recht. Exponential convergence bounds us-

ing integral quadratic constraints. In Proceedings of the 54th IEEE Conference on Decision

and Control, pages 7516–7521. IEEE, 2015.

[155] Ross Boczar, Laurent Lessard, Andrew Packard, and Benjamin Recht. Exponential sta-

bility analysis via integral quadratic constraints. arXiv preprint arXiv:1706.01337, 2017.

[156] Anders Rantzer. On the Kalman-Yakubovich-Popov lemma. Systems & Control Letters,

28(1):7–10, 1996.

[157] Paul Tseng and Sangwoon Yun. A coordinate gradient descent method for nonsmooth

separable minimization. Math. Program., 117(1-2):387–423, 2009.

[158] Richard H Byrd, Jorge Nocedal, and Figen Oztoprak. An inexact successive quadratic

approximation method for `1 regularized optimization. Math. Program., pages 1–22, 2015.

192

[159] Jason D Lee, Yuekai Sun, and Michael Saunders. Proximal Newton-type methods for

convex optimization. In Adv. Neural Inf. Process. Syst., pages 836–844, 2012.

[160] Cho-Jui Hsieh, Inderjit S Dhillon, Pradeep K Ravikumar, and Matyas A Sustik. Sparse

inverse covariance matrix estimation using quadratic approximation. In Adv. Neural Inf.

Process. Syst., pages 2330–2338, 2011.

[161] Cho-Jui Hsieh, Matyas A Sustik, Inderjit S Dhillon, and Pradeep Ravikumar. QUIC:

Quadratic approximation for sparse inverse covariance estimation. J. Mach. Learn. Res.,

15:2911–2947, 2014.

[162] Philip E Gill and Daniel P Robinson. A primal-dual augmented Lagrangian. Comput.

Optim. Appl., 51(1):1–25, 2012.

[163] Paul Armand, Joel Benoist, Riadh Omheni, and Vincent Pateloup. Study of a primal-dual

algorithm for equality constrained minimization. Comput. Optim. Appl., 59(3):405–433,

2014.

[164] Paul Armand and Riadh Omheni. A globally and quadratically convergent primal-dual

augmented Lagrangian algorithm for equality constrained optimization. Optim. Method.

Softw., pages 1–21, 2015.

[165] Stephen Becker and Jalal Fadili. A quasi-Newton proximal splitting method. In Adv.

Neural Inf. Process. Syst., pages 2618–2626, 2012.

[166] Liqun Qi and Jie Sun. A nonsmooth version of Newton’s method. Math. Program.,

58(1):353–367, 1993.

[167] Stephen M Robinson. Local structure of feasible sets in nonlinear programming, Part III:

Stability and sensitivity. In Nonlinear Analysis and Optimization, pages 45–66. Springer,

1987.

[168] Jean-Baptiste Hiriart-Urruty and Claude Lemarechal. Convex analysis and minimization

algorithms I: Fundamentals, volume 305. Springer Science & Business Media, 2013.

[169] Godfrey Harold Hardy and Edward Maitland Wright. An Introduction to the Theory of

Numbers, volume 5. Oxford University Press, 1979.

[170] Kaifeng Jiang, Defeng Sun, and Kim-Chuan Toh. A partial proximal point algorithm for

nuclear norm regularized matrix least squares problems. Math. Program. Comp., 6(3):281–

325, 2014.

193

[171] Fanwen Meng, Defeng Sun, and Gongyun Zhao. Semismoothness of solutions to generalized

equations and the Moreau-Yosida regularization. Math. Program., 104(2):561–581, 2005.

[172] Joseph B Kruskal. Two convex counterexamples: A discontinuous envelope function and

a nondifferentiable nearest-point mapping. Proceedings of the American Mathematical

Society, 23(3):697–703, 1969.

[173] Jason D Lee, Yuekai Sun, and Michael A Saunders. Proximal Newton-type methods for

minimizing composite functions. SIAM J. Optimiz., 24(3):1420–1443, 2014.

[174] Dimitri P Bertsekas. Projected Newton methods for optimization problems with simple

constraints. SIAM J. Control Opt., 20(2):221–246, 1982.

[175] Robert Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc

B., 58(1):267–288, 1996.

[176] Panagiotis Patrinos, Lorenzo Stella, and Alberto Bemporad. Forward-backward truncated

Newton methods for large-scale convex composite optimization. 2014. arXiv:1402.6655.

[177] Lorenzo Stella, Andreas Themelis, and Panagiotis Patrinos. Forward–backward quasi-

Newton methods for nonsmooth optimization problems. Computational Optimization and

Applications, 67(3):443–487, 2017.

[178] Andreas Themelis, Lorenzo Stella, and Panagiotis Patrinos. Forward-backward envelope

for the sum of two nonconvex functions: Further properties and nonmonotone line-search

algorithms. 2016. arXiv:1606.06256.

[179] James M Ortega and Werner C Rheinboldt. Iterative Solution of Nonlinear Equations in

Several Variables. SIAM, 2000.

[180] Roger A Horn and Charles R Johnson. Matrix Analysis. Cambridge University Press,

1985.

[181] Walter Rudin. Principles of Mathematical Analysis, volume 3. McGraw-Hill, Inc., 1964.

[182] Francisco Facchinei. Minimization of SC1 functions and the Maratos effect. Oper. Res.

Lett., 17(3):131–137, 1995.

[183] Hassan K Khalil. Nonlinear Systems. Prentice Hall, Third edition, 1996.

194

[184] Tecla De Luca, Francisco Facchinei, and Christian Kanzow. A semismooth equation ap-

proach to the solution of nonlinear complementarity problems. Math. Program., 75(3):407–

439, 1996.

[185] Andrew R Conn, Nicholas IM Gould, and Philippe Toint. A globally convergent augmented

Lagrangian algorithm for optimization with general constraints and simple bounds. SIAM

J. Numer. Anal., 28(2):545–572, 1991.

[186] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized

linear models via coordinate descent. J. Stat. Softw., 33(1):1, 2010.

[187] Stephen J Wright, Robert D Nowak, and Mario AT Figueiredo. Sparse reconstruction by

separable approximation. IEEE Trans. Signal Processing, 57(7):2479–2493, 2009.

[188] Seung-Jean Kim, Kwangmoo Koh, Michael Lustig, Stephen Boyd, and Dimitry

Gorinevsky. An interior-point method for large-scale `1-regularized least squares. IEEE

J. Sel. Topics Signal Process, 1(4):606–617, 2007.

[189] Junfeng Yang and Yin Zhang. Alternating direction algorithms for `1-problems in com-

pressive sensing. SIAM J. Sci. Comput., 33(1):250–278, 2011.

[190] David M Zoltowski, Neil K Dhingra, Fu Lin, and Mihailo R Jovanovic. Sparsity-promoting

optimal control of spatially-invariant systems. In Proceedings of the 2014 American Control

Conference, pages 1261–1266, Portland, OR, 2014.

[191] Ju Swift and Pierre C Hohenberg. Hydrodynamic fluctuations at the convective instability.

Physical Review A, 15(1):319, 1977.

[192] Ralph T Rockafellar. Augmented Lagrangians and applications of the proximal point

algorithm in convex programming. Mathematics of Operations Research, 1:97–116, 1976.

[193] Bahman Gharesifard and Jorge Cortes. Distributed continuous-time convex optimization

on weight-balanced digraphs. IEEE Trans. Automat. Control, 59(3):781–786, 2014.

[194] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. EXTRA: An exact first-order algorithm

for decentralized consensus optimization. SIAM J. Optimiz., 25(2):944–966, 2015.

[195] Rajendra Bhatia. Matrix Analysis, volume 169. Springer, 1997.

195

[196] Bassam Bamieh and Mohammed Dahleh. Exact computation of traces and H2 norms for

a class of infinite dimensional problems. IEEE Trans. Automat. Control, 48(4):646–649,

2003.

[197] Franziska Michor, Yoh Iwasa, and Martin A Nowak. Dynamics of cancer progression.

Nature Reviews Cancer, 4(3):197–205, 2004.

[198] Bissan Al-Lazikani, Udai Banerji, and Paul Workman. Combinatorial drug therapy for

cancer in the post-genomic era. Nature Biotechnology, 30(7):679–692, 2012.

[199] Daniel S Rosenbloom, Alison L Hill, S Alireza Rabi, Robert F Siliciano, and Martin A

Nowak. Antiretroviral dynamics determines hiv evolution and predicts therapy outcome.

Nature Medicine, 18(9):1378–1385, 2012.

[200] Florian Klein, Ariel Halper-Stromberg, Joshua A Horwitz, Henning Gruell, Johannes F

Scheid, Stylianos Bournazos, Hugo Mouquet, Linda A Spatz, Ron Diskin, Alexander

Abadir, et al. HIV therapy by a combination of broadly neutralizing antibodies in human-

ized mice. Nature, 492(7427):118–122, 2012.

[201] Lloyd N Trefethen and Mark Embree. Spectra and pseudospectra: the behavior of nonnor-

mal matrices and operators. Princeton University Press, 2005.

[202] JAC Weideman and Satish C Reddy. A MATLAB differentiation matrix suite. ACM

Transactions on Mathematical Software, 26(4):465–519, December 2000.

[203] Michael Grant and Stephen Boyd. CVX: Matlab software for disciplined convex program-

ming, version 2.0 beta. http://cvxr.com/cvx, September 2013.

[204] Marcello Colombino and Roy S Smith. Convex characterization of robust stability anal-

ysis and control synthesis for positive linear systems. In Proceedings of the 53rd IEEE

Conference on Decision and Control, Los Angeles, California, 2014.

[205] Marcello Colombino and Roy S Smith. A convex characterization of robust stability

for positive and positively dominated linear systems. IEEE Trans. Automat. Control,

61(7):1965–1971, July 2016.

[206] Marcello Colombino. Robust and Decentralized Control of Positive Systems: a Convex

Approach. PhD thesis, ETH Zurich, 2016.

http://cvxr.com/cvx

196

[207] Joel E Cohen. Convexity of the dominant eigenvalue of an essentially nonnegative matrix.

Proc. Am. Math. Soc., 81(4):657–658, 1981.

[208] Anders Rantzer. On the Kalman-Yakubovich-Popov lemma for positive systems. IEEE

Trans. Automat. Control, 61(5):1346–1349, 2016.

[209] Naum Z Shor. Minimization Methods for Non-differentiable Functions, volume 3. Springer

Science & Business Media, 2012.

[210] Jose YB Cruz. On proximal subgradient splitting method for minimizing the sum of two

nonsmooth convex functions. Set-Valued and Variational Analysis, 25(2):245–263, 2017.

[211] Dragos M Cvetkovic, Michael Doob, and Horst Sachs. Spectra of Graphs: Theory and

Application, volume 87. Academic Pr, 1980.

[212] Eugene L Lawler and David E Wood. Branch-and-bound methods: a survey. Operations

Research, 14(4):699–719, 1966.

[213] Weiran Wang and Canyi Lu. Projection onto the capped simplex. arXiv preprint

arXiv:1503.01002, 2015.

[214] Daniel Zelazo, Simone Schuler, and Frank Allgower. Performance and design of cycles in

consensus networks. Syst. Control Lett., 62(1):85–96, 2013.

[215] Duncan J Watts and Stephen H Strogatz. Collective dynamics of ‘small-world’ networks.

Nature, 393:440–442, 1998.

[216] John G White, Eileen Southgate, J Nichol Thomson, and Sydney Brenner. The structure

of the nervous system of the nematode Caenorhabditis Elegans: the mind of a worm. Phil.

Trans. R. Soc. Lond, 314:1–340, 1986.

[217] Shun Motegi and Naoki Masuda. A network-based dynamical system for competitive

sports. Scientific Reports, 2, 2012.

[218] Juyong Park and Mark EJ Newman. A network-based ranking system for US college

football. Journal of Statistical Mechanics: Theory and Experiment, 2005(10):P10014,

2005.

[219] Sports-Reference. Sports Reference – College Football 2015 schedule and results. http:

//www.sports-reference.com/cfb/years/2015-schedule.html, 2016. Accessed: 2016-

03-15.

http://www.sports-reference.com/cfb/years/2015-schedule.html

http://www.sports-reference.com/cfb/years/2015-schedule.html

197

[220] Associated-Press. The AP Top 25 Poll. http://collegefootball.ap.org/poll, 2016.

Accessed: 2016-03-15.

[221] Esteban Hernandez-Vargas, Patrizio Colaneri, Richard Middleton, and Franco Blanchini.

Discrete-time control for switched positive systems with application to mitigating viral

escape. Int. J. Robust Nonlin., 21(10):1093–1111, 2011.

[222] Vanessa D Jonsson, Nikolai Matni, and Richard M Murray. Synthesizing combination

therapies for evolutionary dynamics of disease for nonlinear pharmacodynamics. In Pro-

ceedings of the 53rd Conference on Decision and Control, pages 2352–2358. IEEE, 2014.

[223] Vanessa D Jonsson. Robust Control of Evolutionary Dynamics. PhD thesis, California

Institute of Technology, 2016.

[224] Abram S Besicovitch. Almost Periodic Functions. Dover Publications, 1954.

[225] Constantin Corduneanu. Almost Periodic Oscillations and Waves. Springer Science &

Business Media, 2009.

[226] Christian Constanda and Maria E Perez. Integral Methods in Science and Engineering.

Springer, 2010.

[227] Jorge Mari. A counterexample in power signals space. IEEE Trans. Automat. Control,

41(1):115–116, 1996.

[228] Petar Kokotovic, Hassan K Khalil, and John O’Reilly. Singular perturbation methods in

control: analysis and design. SIAM, 1999.

[229] Venkat Roy, Sundeep P Chepuri, and Geert Leus. Sparsity-enforcing sensor selection for

DOA estimation. In Proceedings of the 2013 IEEE International Workshop on Computa-

tional Advances in Multi-Sensor Adaptive Processing, pages 340–343, 2013.

[230] Vassilis Kekatos, Georgios B Giannakis, and Bruce Wollenberg. Optimal placement of

phasor measurement units via convex relaxation. IEEE Trans. Power Syst., 27(3):1521–

1530, 2012.

[231] James L Rogers. A parallel approach to optimum actuator selection with a genetic algo-

rithm. In Proceedings of the AIAA Guidance, Navigation, and Control Conference, pages

14–17, 2000.

http://collegefootball.ap.org/poll

198

[232] Shinji Kondoh, Chikayoshi Yatomi, and Koichi Inoue. The positioning of sensors and

actuators in the vibration control of flexible systems. JSME Int. J., Series III, 33(2):145–

152, 1990.

[233] Kazuhiko Hiramoto, Hitoshi Doki, and Goro Obinata. Optimal sensor/actuator placement

for active vibration control using explicit solution of algebraic Riccati equation. J. Sound

Vib., 229(5):1057–1075, 2000.

[234] Kevin K Chen and Clarence W Rowley. H2 optimal actuator and sensor placement in the

linearised complex Ginzburg-Landau system. J. Fluid Mech., 681:241–260, 2011.

[235] Engin Masazade, Makan Fardad, and Pramod K Varshney. Sparsity-promoting extended

Kalman filtering for target tracking in wireless sensor networks. IEEE Signal Processing

Letters, 19:845–848, 2012.

[236] Michael Grant and Stephen P Boyd. Graph implementations for nonsmooth convex pro-

grams. In V. Blondel, S. Boyd, and H. Kimura, editors, Recent Advances in Learning and

Control, Lecture Notes in Control and Information Sciences, pages 95–110. Springer-Verlag

Limited, 2008.

[237] Martin J Brenner, Richard C Lind, and David F Voracek. Overview of Recent Flight

Flutter Testing Research at NASA Dryden, volume 4792. National Aeronautics and Space

Administration, Office of Management, Scientific and Technical Information Program,

1997.

[238] Thomas E Noll, John M Brown, Marla E Perez-Davis, Stephen D Ishmael, Geary C

Tiffany, and Matthew Gaier. Investigation of the Helios prototype aircraft mishap volume

I mishap report. NASA, 1, 2004.

[239] Jeffrey M Barker and Gary J Balas. Comparing linear parameter-varying gain-scheduled

control techniques for active flutter suppression. J. Guid. Control Dyn., 23(5):948–955,

2000.

[240] Claudia Moreno, Peter Seiler, and Gary J Balas. Linear parameter varying model reduction

for aeroservoelastic systems. In Proceedings of the AIAA Atmospheric Flight Mechanics

Conference, page 4859, 2012.

Appendix A

CrosswordSince the subject of my thesis is mathematical in nature, I thought it appropriate that I leave

something as an exercise to the reader.

199

200

Across1. Dirac , i.e. a 55-across

train5. In the style of

8. Preserves additivity andhomogeneity

14. Basilica part

15. Gaping hole

16. Non-rhyming fruit

17. It closes the loop

19. As a set which admitsa supporting hyperplaneat each point on itsboundary

20. Systems theory-focusedElsevier gathering acrossthe pond, abbr.

21. Sailor23. We Can!, 2008

Obama-Biden slogan

24. Stand for a golfer

25. McBeal, et. al

27. Ali who stole from fortythieves

30. Gruffly succinct

32. Road usage fee

36. All hot and bothered, ina bad way

38. Baboon genus

39. With 40-across, the sub-ject area of this thesis

40. See 39-across42. Sandwich cookies43. Generalizations of vol-

ume, such as those ofLebesgue or Borel

44. Western Cold War al-liance

45. Town official in charge ofmaking announcements

47. Trivial48. 2016 Green Party candi-

date Jill50. Private Twitter commu-

niques

52. Some Audi roadsters55. Dirac delta function57. Act whose wages is

death60. A Modest Proposal, e.g.

62. A property promoted by`1 regularization (e.g., inthe LASSO problem)

64. Ability

65. For profit diamond grad-ing agcy

66. Salt Lake City collegiateathletes

67. Functions as desired de-spite uncertainty

68. Professional organiza-tion whose members de-sign cars

69. Viral phenomenon

Down1. Small eatery with caf-

feinated beverages

2. Oil cartel

3. 1/60,000 min.

4. Crash site?5. Almond-flavored Italian

liqueur

6. Delicate fabric7. Maladroit8. Place9. Concept famously mis-

understood by AlanisMorissette in a 1996 hitsingle

10. What you get when youmultiply 0 by Inf in Mat-lab

11. Green emotion12. Author James13. Property of the green di-

nosaur in Toy Story

18. Abbr. for ancient dates22. American pro. football

assn.24. Snapbacks and

26. Some sculptures

27. Philosopher Francis

28. Ancient Greek publicsquare

29. French psychologist Al-fred who co-created IQtesting

31. Unagi

33. Carmen or Don Gio-vanni

34. Fruit famously used bythe British Navy to pre-vent scurvy

35. As a transmission linewith non-negligible re-sistnace

37. longa, vita brevis

38. Local frequency con-troller for a generator(abbr.)

40. Call, as a poker bet

41. Springtime householdliquidation

43. Opposite of plusses

45. Type of shoes favoredfor maritime Mafia exe-cutions

46. Hamilton formerly of theDetroit Pistons

49. Components of theMichelin Man’s body

51. French sea52. Former Winter Palace

resident53. With ”cat”, it’s a palin-

drome54. Retained ticket portion

56. Org. of female drivers?

57. Location58. Couple

59. Wall St. trading venue

61. Electronic sensing unitcomposed of accelerome-ters and gyroscopes

63. Result of addition

201

The solution to the crossword puzzle is on next page.

202

Solution

Optimization and control of large-scale networked...

Documents

Transcript of Optimization and control of large-scale networked...