Optimization and control of large-scale networked...
Transcript of Optimization and control of large-scale networked...
Optimization and controlof large-scale networked systems
A DISSERTATION
SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL
OF THE UNIVERSITY OF MINNESOTA
BY
Neil K. Dhingra
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
Doctor of Philosophy
December 2017
Copyright © 2017
Neil K. Dhingra
To Pallavi Kathuria, in loving memory.
i
Acknowledgements
No man is an island and in completing my graduate studies I have relied on the help and guidance
of innumerable friends, relations, and mentors. I will try to give a partial accounting of their
aid and support here.
I would first like to thank my advisor, Professor Mihailo R. Jovanovic for being an amazing
teacher, mentor, and friend. His relentless perfectionism and his insistence on detail has en-
hanced the quality and sophistication of my writing, research, and presentation. I am forever
indebted to him for his guidance and mentorship.
Many other faculty members have contributed greatly to my time here. I learned much
about convex optimization from Professor Tom Luo, both in his excellent classes and one-on-
one meetings. I also benefitted tremendously from interactions with Tryphon Georgiou, Murti
Salapaka, Peter J. Seiler, Andy Lamperski, Makan Fardad, and Anders Rantzer.
I am thankful for NASA and their amazing research scientists. Interacting with Marty
Brenner was always stimulating; working with him broadened my perspectives and gave me
insight into the practical applications of control theory.
I would also like to thank the professors who taught and mentored me at the University of
Michigan and all my classmates and colleagues who inspired me to pursue graduate school. In
particular, I would like to thank Sister Mary Elizabeth Merriam, under whose guidance I began
my journey into academia.
The countless hours spent in our stuffy Keller Hall office were made bearable only by the
incredible company I kept. I would like to thank Xiaofan Wu, Armin Zare, and Yongxin Chen,
my compatriots in this endeavor. Whether watching the Timberwolves or running races, it has
always been a pleasure. The senior statesmen of the lab – Rashad Moarref, Binh K. Lieu and Fu
Lin – all offered invaluable guidance and counsel. I would like to thank “Mr. Fu” in particular
for his help with optimization and sparsity-promoting optimal control. Much of the work in this
thesis builds upon the foundation that he laid. Finally I would like to thank the junior graduate
ii
students, Sepideh Hassan-Moghaddam, Wei Ran, and Dongsheng Ding – I am happy that I can
leave knowing that the future of the lab is in good hands.
Beyond our lab group, interactions with Sei Zhen Khong at the IMA and Marcello Colombino
at ETH were enjoyable socially and brought my research to a higher level. I also enjoyed my
time with the other IMA postdocs, particularly Kaoru Yamamoto and Rohit Gupta.
Of course I must thank my incredible network of friends in Minneapolis and across the coun-
try. Katie Hartl, in particular, has always been a great friend with whom I could commiserate.
Thank you to Kevin Nowland, for his friendship and mathematical advice and despite his dubi-
ous collegiate affiliation, and to Rohan Agarwal, for giving me a second home in Chicago. I am
grateful for all my friends in Minneapolis, including my roommate Aaron as well as everyone
I met through him, Brian and Julia and their wonderful children, Julius and Maria, Luke and
Lindsey, Caitlin and Eric, Jane, Liz, Niko, Alex, Sarah, Stephen, Ben, Emily, Angela, Kyle,
and the rest of the Scudder house crew, Taylor, Jordan, and everyone else at the Michigan
Alumni Association, Megan, Sandy, Elizabeth, Azra, and many, many more. I would also like to
thank Sisyphus, in whom I found a kindred spirit, and Albert Camus for providing an optimistic
interpretation of this somewhat concerning correlation.
Thank you to my family for constantly providing love and support. Thanks especially to my
parents, who have always been there for me, even when my car was stolen or broke down. Thank
you to my sister Nina, for her support as well as for helping me start and accompanying me
on different roadtrips from California to Minnesota. I am also indebted to my extended family,
grandparents, cousins, aunts, and uncles for their support and love over the years. In particular,
I would like to thank my little cousin – Pallavi, your wonderful smile and joyous laugh will stay
with me forever.
Finally and most importantly I would like to thank Amy. Her constant support and love has
given me strength and fortitude throughout my graduate studies. I loved our time in Minneapolis
and our biking, running, and roadtrip adventures to all the breweries, lakes, waterfalls, ice caves,
and stadia in the area. Thank you, Amy, I couldn’t have done it without you.
iii
Abstract
In this thesis, we design structured controllers for linear systems by solving regularized
optimal control problems. We develop tractable methods for solving nonconvex regularized
problems and then identify classes of problems for which regularized optimal control problems
can be placed into a convex form.
We first develop novel methods based on reformulating the regularized optimization prob-
lem with an auxiliary variable. By exploiting the properties of proximal operators, we bring
the associated augmented Lagrangian into a continuously differentiable form by constraining
it to the manifold that corresponds to explicit minimization over the auxiliary variable. The
new expression facilitates a method of multipliers algorithm that offers many advantages rel-
ative to existing methods, including guaranteed convergence for nonconvex problems and the
ability to impose regularization in alternate coordinates. We then apply primal-descent dual-
ascent Arrow-Hurwicz-Uzawa type gradient flow dynamics to solve regularized problems in a
distributed manner. We prove global convergence for convex problems and use the theory of In-
tegral Quadratic Constraints to establish conditions for exponential convergence for continuous-
and discrete-time updates applied to strongly convex probems. Finally, we take advantage of
generalizations of the Jacobian to develop a second-order algorithm which converges globally to
the optimal solution for convex problems. Moreover, we prove local quadratic convergence for
strongly convex problems.
We next study several classes of convex regularized optimal control problems. The problem
of designing symmetric modifications to symmetric linear systems is convex in the underlying
design variable and is thus appealing for the purpose of structured control. We show that even
when the system and controller are not symmetric, their symmetric components can be used
to perform structured design with stability and performance guarantees. We then examine the
problem of designing structured diagonal modifications to positive systems. We prove convexity
of the H2 and H∞ optimal control problems for this class of system and apply our results
to leader selection in directed consensus networks and combination drug therapy design for
HIV. We consider time-varying controllers and show that a constant controller is optimal for
an induced-power performance index. Finally, we develop customized algorithms for large-scale
actuator and sensor selection.
iv
Contents
Acknowledgements ii
Abstract iv
List of Tables x
List of Figures xi
1 Introduction 1
1.1 A historical perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Optimal control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Distributed Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.2 Lack of convexity for structured control . . . . . . . . . . . . . . . . . . . 12
1.4 Regularized problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Design of structured controllers via regularization . . . . . . . . . . . . . . . . . . 17
1.5.1 Structure identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.5.2 Polishing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.6 Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
I Proximal augmented Lagrangian algorithms 24
2 Method of multipliers 25
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
v
2.2 Problem formulation and background . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.1 Composite optimization problem . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.2 Background on proximal operators . . . . . . . . . . . . . . . . . . . . . . 27
2.2.3 Existing algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 MM with the proximal augmented Lagrangian . . . . . . . . . . . . . . . . . . . 30
2.3.1 Derivation of the proximal augmented Lagrangian . . . . . . . . . . . . . 30
2.3.2 MM based on the proximal augmented Lagrangian . . . . . . . . . . . . . 31
2.3.3 Minimization of the proximal augmented Lagrangian over x . . . . . . . . 33
2.4 Example: Edge addition in directed consensus networks . . . . . . . . . . . . . . 36
2.4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.2 Computational experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 Example: Sparsity-promoting optimal control . . . . . . . . . . . . . . . . . . . . 41
2.5.1 Minimization of the proximal augmented Lagrangian . . . . . . . . . . . . 44
2.5.2 Proximal gradient applied directly to (2.1) . . . . . . . . . . . . . . . . . . 45
2.5.3 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3 First-order primal-dual algorithm 50
3.1 Arrow-Hurwicz-Uzawa gradient flow . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2 Exponential convergence for strongly convex f . . . . . . . . . . . . . . . . . . . 54
3.2.1 Continuous-time dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.2 Discrete time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3 Distributed implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3.1 Example: Optimal placement . . . . . . . . . . . . . . . . . . . . . . . . . 62
4 Second-order primal-dual method 66
4.1 Problem formulation and background . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1.1 Generalization of the gradient and Jacobian . . . . . . . . . . . . . . . . . 67
4.1.2 Semismoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.1.3 Existing second order methods . . . . . . . . . . . . . . . . . . . . . . . . 69
4.1.4 Second order updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 A globally convergent differential inclusion . . . . . . . . . . . . . . . . . . . . . . 71
4.2.1 Asymptotic stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.2 Global exponential stability . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 A second order primal-dual algorithm . . . . . . . . . . . . . . . . . . . . . . . . 77
vi
4.3.1 Merit function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.2 Second order primal-dual algorithm . . . . . . . . . . . . . . . . . . . . . 80
4.4 Computational experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.4.1 Example: `1-regularized least squares . . . . . . . . . . . . . . . . . . . . 84
4.4.2 Example: Distributed control of a spatially-invariant system . . . . . . . . 87
4.5 Additional considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.5.1 Algorithm based on V (w) as a merit function . . . . . . . . . . . . . . . . 91
4.5.2 Bounding ∇f(x) for Algorithm 3 . . . . . . . . . . . . . . . . . . . . . . . 95
5 Connections with other methods and discussion 98
5.1 Second order updates as linearized KKT corrections . . . . . . . . . . . . . . . . 98
5.2 Connections with other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2.1 MM and ADMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2.2 First order methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2.3 Second order methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
II Structured optimal control 103
6 Symmetry and spatial invariance 104
6.1 Symmetric systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.1.1 Convex formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.1.2 Stability and performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.1.3 Performance bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.1.4 Small asymmetric perturbations . . . . . . . . . . . . . . . . . . . . . . . 110
6.2 Computational experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.2.1 Example: Directed Consensus Network . . . . . . . . . . . . . . . . . . . . 112
6.2.2 Example: Combination drug therapy design via symmetric systems . . . . 113
6.3 Computational advantages for structured problems . . . . . . . . . . . . . . . . . 118
6.3.1 Spatially-invariant systems . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.3.2 Swift-Hohenberg Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7 Structured decentralized control of positive systems 125
7.1 Problem formulation and background . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.1.1 Background on positive systems . . . . . . . . . . . . . . . . . . . . . . . 126
7.1.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
vii
7.2 Convexity of optimal control problems . . . . . . . . . . . . . . . . . . . . . . . . 128
7.2.1 Convexity of f2 and f∞ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.2.2 Differentiability of the H∞ norm . . . . . . . . . . . . . . . . . . . . . . . 131
7.3 Leader selection in directed networks . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.3.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.3.2 Stability for directed networks . . . . . . . . . . . . . . . . . . . . . . . . 137
7.3.3 Bounds for Problem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.3.4 Additional comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.4 Computational experiments for leader selection . . . . . . . . . . . . . . . . . . . 141
7.4.1 Example: Bounds on leader selection for a small network . . . . . . . . . 141
7.4.2 Example: Leaders in the neural network of the worm C. Elegans . . . . . 141
7.4.3 Example: Ranking college football teams . . . . . . . . . . . . . . . . . . 143
7.5 Computational experiments for combination drug therapy . . . . . . . . . . . . . 144
7.5.1 Example: Simple problem with nondifferentiable f∞ . . . . . . . . . . . . 144
7.5.2 Example: Real world drug therapy problem . . . . . . . . . . . . . . . . . 145
7.6 Time-varying controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.6.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.6.3 Solution to the optimal control problem . . . . . . . . . . . . . . . . . . . 151
8 Actuator or sensor selection 155
8.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.1.1 Actuator selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.1.2 Sensor selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.2 Customized algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
8.2.1 Alternating direction method of multipliers . . . . . . . . . . . . . . . . . 161
8.2.2 P -minimization step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.2.3 Z-minimization step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.2.4 Iterative reweighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.3 Example: Mass-spring system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.3.1 Algorithm speed and computational complexity . . . . . . . . . . . . . . . 164
8.3.2 Sensor selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.3.3 Iterative reweighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.4 Proximal gradient method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
viii
8.4.1 Elimination of P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.4.2 Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.4.3 Example: Damped mass-spring systems . . . . . . . . . . . . . . . . . . . 172
8.5 Example: Flexible wing aircraft . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9 Conclusions and future directions 176
References 178
Appendix A. Crossword 199
ix
List of Tables
5.1 Summary of different functions embedded in the augmented Lagrangian of (2.5)
and methods for solving (2.1) based on these functions. . . . . . . . . . . . . . . 101
7.1 Leaders selected for different values of γ. . . . . . . . . . . . . . . . . . . . . . . . 145
7.2 Optimal budgeted doses and H2/H∞ performance. . . . . . . . . . . . . . . . . . 147
x
List of Figures
1.1 HIV replication-mutation pattern for a set of 4 mutants with a single drug that
affects 2 mutants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 A network of 5 dynamical systems with associated local controllers. . . . . . . . . 11
1.3 Mass-spring system on a line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 (a) The optimal centralized position feedback gain matrix Xp in the system with
50 masses. Both Xp and Xv (not shown) have almost constant diagonals (modulo
edges) and exponential off-diagonal decay. (b) Optimal centralized position gains
for the middle mass n = 25. (c) Truncation of the optimal centralized position
gains for the middle mass n = 25. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 The change of variables that casts the unstructured state-feedback problem as an
SDP, in general, does not preserve the structural properties of X. . . . . . . . . . 14
1.6 Increased emphasis on sparsity encourages sparser control architectures at the
expense of deteriorating the closed-loop performance. For γ = 0 the optimal
centralized controller Xc is obtained from the positive definite solution of the
algebraic Riccati equation. Control architectures for γ > 0 are determined by
X(γ) := argminX (f(X) + γ g(X)) and they depend on interconnections in the
distributed plant and the state and control performance weights Q and R. . . . . 18
1.7 Cardinality function of a scalar variable x and the corresponding absolute value
and logarithmic approximations on x ∈ [−1, 1]. . . . . . . . . . . . . . . . . . . . 19
1.8 The solution x? of the constrained problem (1.10) is the intersection of the con-
straint set C := x | f(x) ≤ σ and the smallest sub-level set of g that touches
C. The penalty function g is the `1 norm (left); the weighted `1 norm with
appropriate weights (middle); and the nonconvex sum-of-logs function (right). . . 20
2.1 The regularization function g1(v) = |v| for a scalar argument v with associated
proximal operator, Moreau envelope, and Moreau envelope gradient when µ = 1. 28
xi
2.2 A balanced plant graph with 7 nodes and 10 directed edges (solid black lines). A
sparse set of 2 added edges (dashed red lines) is identified by solving (2.13) with
γ = 3.5 and R = I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3 Tradeoff between performance and sparsity resulting from the solution to (2.13)-
(2.14) for the network shown in Fig. 2.2. Performance loss is measured relative
to the optimal centralized controller (i.e., the setup in which all edges in the
controller network are used). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4 (a) Total time; (b) number of outer iterations; and (c) average time per outer
iteration required to solve (2.13) with γ = 0.01, 0.1, 0.2 for a cycle graph with
N = 5 to 50 nodes and m = 20 to 2450 potential added edges using PAL (solid
blue –×–), ADMM (dashed red - -- -), and ADMM with the adaptive µ-
update heuristic (dotted yellow · · · · · · ). PAL requires fewer outer iterations
and thus a smaller total solve time. . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5 Comparison of proximal gradient with BB step-size selection (solid blue —),
proximal gradient without BB step-size selection (dotted black · · · ) and gradient
descent with BB step-size selection (dashed red - - -) for the x-minimization
step (2.10a) for an unstable network with 20 subsystems, γ = 0.0844, and µ =
0.10. The y-axis shows the distance from the optimal objective value relative to
initial distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.6 Computation time required to solve (2.1) for 10 evenly spaced values of γ from
0.001 to 1.0 for a mass-spring example with N = 5, 10, 20, 30, 40, 50, 100 masses.
Performance of direct proximal gradient (dashed green −−−−), the method
of multipliers (solid blue —×—) and ADMM (dash-dot red − · − ·) is
displayed. All algorithms use BB step-size initialization. . . . . . . . . . . . . . . 48
2.7 Computation time required to solve (2.1) for 20 evenly spaced values of γ from
0.001 to 0.05 for unstable network examples with N = 5, 10, 20, 30, 40, 50 nodes.
Performance of direct proximal gradient (dashed green −−−−), the method
of multipliers (solid blue —×—) and ADMM (dash-dot red − · − ·) is
displayed. All algorithms use BB step-size initialization. . . . . . . . . . . . . . . 49
3.1 Block diagram of primal-descent dual-ascent dynamics where G is a linear system
connected via feedback with nonlinearities. . . . . . . . . . . . . . . . . . . . . . 55
3.2 Block diagram of primal-descent dual-ascent dynamics where the nonlinearities
are replaced by an IQC imposed on their inputs and outputs. . . . . . . . . . . . 55
xii
3.3 Subfigures (a) and (b) show two optimal configurations of mobile agents () with
corresponding fixed agents (black ), Exx is (solid red — and solid blue —
edges) and Exb (dotted black · · · edges). Subfigures (c) and (d) show the agent
trajectories and distance from optimal configuration (a) until t = 2.5 and from
optimal configuration (b) until the final t = 10. . . . . . . . . . . . . . . . . . . 65
4.1 Distance from optimal solution as a function of the iteration number and solve
time when solving LASSO for two values of γ using ISTA, FISTA, and our algo-
rithm (2ndMM). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2 Solve times for LASSO with n = 1000 obtained using ISTA, FISTA, and our
algorithm (2ndMM) as a function of the sparsity-promoting parameter γ. . . . . 86
4.3 Comparison of our algorithm (2ndMM) with state-of-the-art methods for LASSO
with problem dimension varying from n = 100 to 2000. . . . . . . . . . . . . . . . 86
4.4 (a) The middle row of the circulant feedback gain matrix Z; and (b) the sparsity
level of z?γ (relative to the sparsity level of the optimal centralized controller z?0)
resulting from the solutions to (4.31a) for the linearized Swift-Hohenberg equation
with n = 64 Fourier modes and c = −0.01. . . . . . . . . . . . . . . . . . . . . . . 92
4.5 Performance degradation (in percents) of structured controllers relative to the
optimal centralized controller: polished optimal structured controller obtained by
solving (4.31a) and (4.31b) (solid blue ——); unpolished optimal structured
controller obtained by solving only (4.31a) (dashed red - -×- -); and optimal
structured controller obtained by solving (4.31b) for an a priori specified nearest
neighbor reference structure (dotted yellow · · ·+ · · · ). . . . . . . . . . . . . . 93
4.6 Total time to compute the solution to (4.31a) with γ = 0.004 using our algorithm
(2ndMM), proximal Newton, and ADMM. . . . . . . . . . . . . . . . . . . . . . . 94
4.7 Comparison of (a) times to compute an iteration (averaged over all iterations);
and (b) numbers of iterations required to solve (4.31a) with γ = 0.004. . . . . . 95
4.8 Distance from the optimal solution as a function of iteration number when solving
LASSO using Algorithm 4 for different values of µ. . . . . . . . . . . . . . . . . . 96
6.1 Directed network (solid black — arrows) with added undirected edges (dashed
red - - - arrows). Both the H2 and H∞ optimal structured control problems
yielded the same set of added edges. In addition to these edges, the controllers
tuned the weights of the edges (1)− (3) and (1)− (5). . . . . . . . . . . . . . . . 113
6.2 H2 and H∞ performance of the closed-loop symmetric system and the original
system subject to a controller designed at various values of γ. . . . . . . . . . . . 113
xiii
6.3 Sparsity structure of the matrix A and its symmetric counterpart. The elements
of A are shown with blue dots, and the elements in its symmetric component As
are overlaid in green circles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.4 Difference in H2 norm between the symmetric and original systems with different
controllers designed as a function of γ, normalized by the H2 norm of original
system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.5 The solid — lines are the H2 norms of the symmetric systems and the dotted
- - - lines are the H2 norms of the original systems. The blue ×, red , and
magenta designate c = 105, 1.4× 106, and 1.9× 107 respectively. . . . . . . . 120
6.6 Computation time for the general formulation (6.3) (solid blue ——) and that
which takes advantage of spatial invariance (6.9) (dotted red · · · ∗ · · · ). . . . . . 124
6.7 Feedback gain −x(φ) for the node at position φ = 0, computed with N = 51 and
γ = 0 (solid black —), γ = 0.1 (dashed blue - - -), and γ = 10 (dotted red
· · · ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.1 A directed network and the sparsity pattern of the corresponding graph Laplacian.
This network is stabilized if and only if either node 1 or node 2 are made leaders. 137
7.2 H2 performance of optimal leader set (blue ×) and upper bounds resulting from
“rounding” (yellow ) and the optimal leaders for the undirected network (red
+). Performance is shown as a percent increase in f2 relative to f lb2 (N). . . . . 142
7.3 Network of games in the 2015 − 2016 College Football season. The central con-
nected component corresponds to the top division, and the distal nodes are lower
division teams for whom there are less data. . . . . . . . . . . . . . . . . . . . . . 144
7.4 C. Elegans neural network with N = 10 (a) f2 and (b) f∞ leaders along with
the (c) f2 and (d) f∞ performance of varying numbers of leaders N relative to
f lb(N). In all cases, leaders are selected via “rounding”. . . . . . . . . . . . . . . 146
7.5 A directed network and corresponding A matrix for a virus with 4 mutants and
2 drugs. For this system, f∞ is nondifferentiable. . . . . . . . . . . . . . . . . . . 147
7.6 Comparison of different algorithms starting from initial condition [2.5 2.8]T . The
algorithms are the subgradient method with a constant step-size (dotted blue
· · · ), the subgradient method with a diminishing step-size (solid red —) and
our optimal subgradient method where the step-size is chosen via backtracking
to ensure descent of the objective function (dashed yellow - - -) . . . . . . . . 148
7.7 Mutation pattern of HIV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
xiv
7.8 Percent performance degradation for H2 (solid red —×—) and H∞ (dashed
blue −− −−) performance relative to using all 5 drugs. . . . . . . . . . . . . . 149
8.1 Scaling of computation time with the number of states for CVX and for ADMM for
mass-spring system with γ = 100 and position and velocity outputs. Empirically,
we observe that CVX scales roughly with n6 while ADMM scales roughly with n3.165
8.2 Scaling of computation time with the number of sensors for CVX and for ADMM
for mass-spring system with γ = 100 and n = 50. Outputs are random linear
combinations of the states. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.3 Percent increase in fo(L) in terms of the number of sensors. . . . . . . . . . . . . 167
8.4 Retained position sensors as γ increases. A blue dot indicates that the position of
the corresponding mass is being measured. The top row shows the densest sensor
topology, and the bottom row shows the sparsest. . . . . . . . . . . . . . . . . . . 167
8.5 Retained velocity sensors as γ increases. A blue dot indicates that the velocity of
the corresponding mass is being measured. The top row shows the densest sensor
topology, and the bottom row shows the sparsest. . . . . . . . . . . . . . . . . . . 168
8.6 Number of sensors versus γ for a scheme which uses iterative reweighting and for
the scheme which uses constant weights. Iterative reweighting promotes sparser
structures earlier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.7 Computation time for mass-spring system damped with b = 0.1 and γ = 30. . . . 173
8.8 Body Freedom Flutter flexible wing testbed aircraft. . . . . . . . . . . . . . . . . 173
8.9 (a) Number of sensors as a function of the sparsity-promoting parameter γ; and
(b) Performance comparison of the Kalman filter associated with the sets of sen-
sors resulting from the regularized sensor selection problem and from truncation. 174
8.10 The `2 norm of each column of the Kalman gain for Kalman filters designed for
different sets of sensors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
xv
Chapter 1
Introduction
1.1 A historical perspective
Feedback control theory is the science of designing a controller that connects a system’s output
to its input in order to achieve some desired behavior. A simple example is a thermostat: this
controller achieves a target temperature by adjusting current through heating coils or strength
of air conditioning to counteract unwanted deviations from a desired temperature. The concept
of feedback control arguably traces back to irrigation canals in ancient Mesopotamia built to
facilitate farming [1]. Modern mathematical study of control systems began in the 19th cen-
tury with the study of centrifugal governors used to keep steam engines running at a constant
speed [1]. During the Second World War, applications of control theory for weapons guidance
systems drove the development of graphical single-input single-output control design techniques
based on loop-shaping, Bode plots, and Nyquist diagrams [2]. After the war, the challenge
of controlling rockets during the space race prompted the study of multi-input multi-output
systems and eventually led to the formalism of optimal control [3].
The optimal control of linear systems with quadratic performance measures, such as the
H2 and H∞ norms, is a cornerstone of systems theory. This framework provides a systematic
way to balance closed-loop performance, robustness, and control effort. In the conventional
formulation, an optimal controller is designed to minimize some measure of the amplification
from exogenous sources of excitation to a regulated output which penalizes both the system state
and the control effort. Typically, these optimal controllers do not have a particular structure
and require a centralized implementation in which measurement and control must involve all
1
2
outputs and inputs and must be performed at a single, central location.
While they provide a powerful framework, most traditional optimal control tools are ill-suited
to problems where controller structure is of tantamount importance. In many modern appli-
cations, constraints on communication, computation, and physics impose significant structural
restrictions on the controller. Designing controllers subject to such restrictions is a challenging
problem. In fact, even determining stabilizability was shown to be NP hard in general [4, 5].
Nevertheless, designing structured controllers is of growing importance for many modern
problems, such as the distributed control of large-scale networks of dynamical systems. Systems
of this type are found in applications ranging from distributed power generation, to deployment
of teams of robotic agents, to control of segmented mirrors in extremely large telescopes, to
control of fluid flows around wind turbines and vehicles. The development of analytical and
computational methods for tractable analysis and design of such networks is a major challenge.
Recent technological advances have allowed the individual components of large-scale systems
to be equipped with their own sensing, actuation, communication, computation, and decision
making capabilities. Advances in Micro-Electro-Mechanical-Systems (MEMS) have enabled the
development of arrays of sensors and actuators that can interact with one another. Strings of
vehicular platoons, unmanned aerial vehicles (UAVs), and robotic agents constitute another set
of examples of large-scale autonomous systems [6, 7].
In many of these applications, the scale of the problem, constraints on computing and com-
munication resources, and wide-spread sensing and actuating capabilities pose additional re-
quirements on controller complexity. Typically, these cannot be addressed using tools from
standard optimal control theory. For example, a dense state-feedback controller resulting from
the LQR framework would impose a prohibitive communication burden in large-scale networks.
This is because forming every control input requires information from every subsystem in the
distributed plant. The cost of creating and maintaining communication links makes such an
all-to-all topology infeasible in most large-scale and distributed systems.
This has motivated the design of structured (both decentralized and distributed) controllers.
Early efforts have centered on the design of decentralized strategies [8] and, during the last fifteen
years, the emphasis has shifted to the design of distributed controllers [9–37]. Incorporating
structure into systems theory has also enabled the analysis [38–41], control [42–45], and low
complexity modeling [46, 47] of large-scale fluid flows. Two major issues have emerged: the
identification of controller architectures of convex classes of structured control problems and
optimal control design under a priori specified structural constraints.
3
In some cases, these questions can be resolved using standard tools under additional assump-
tions. Optimal control problems are often reformulated using the Youla parameterization [48,49].
The mapping from the controller to the Youla parameter is nonlinear which typically compro-
mises convexity of the structural constraints in the distributed setup. It is thus important to
identify subspaces which remain invariant under this nonlinear mapping for distributed sys-
tems. In [11, 15], the subspaces of cone and funnel causal systems have been introduced; these
describe how information from every controller propagates through the distributed system. For
spatially-invariant systems, the design of quadratically optimal controllers can be cast into a con-
vex problem if the information in the controller propagates at least as fast as in the plant [11,15].
A similar but more general algebraic characterization of the constraint set was introduced and
convexity was established under the condition of quadratic invariance in [16]. Other classes
of convex distributed control problems include partially-nested systems [50–52], poset-causal
systems [19,25], and positive systems [53–59].
Since most of these convex formulations are expressed in terms of the impulse response pa-
rameters, they do not lend themselves easily to state-space characterization. Apart from very
special instances, the optimal distributed design problem remains challenging. For poset-causal
systems, explicit Riccati-based solutions for the optimal decentralized state-feedback problem
were obtained in [19, 25]. For a two-player problem with block-triangular state-space matrices,
the optimal decentralized output-feedback solution was recently provided in [23,29]. Character-
izing the structural properties of optimal distributed controllers is another important challenge.
For spatially invariant systems, the quadratic optimal controllers are also spatially invariant and
the information from other subsystems is exponentially discounted with the distance between
the controller and the subsystems [9]. For systems on graphs, this spatially decaying property
was studied in [60,61] and it motivates the search for inherently localized controllers.
Optimal control under structural constraints remains a challenging problem [32]. Since these
constraints are often combinatorial in nature (e.g., selecting k communication links between N
nodes), structured optimal control has two main aspects: the identification of effective structures
for the purpose of control (e.g., the communication topology between distributed controllers)
and the design of controllers given a specific structure (e.g., the control law implemented over a
given communication network).
4
1.2 Optimal control
The notion of controller structure can have different connotations and the design variable can
impact the dynamics in different ways. We first define the dynamics we study and the perfor-
mance metrics we use to quantify closed-loop performance. We consider the general dynamics,
ψ = (A + F (x))ψ + Bd
ζ =
C
R(x)
ψ (1.1)
where x ∈ Rm is a design parameter, F : Rm → Rn×n is a linear operator, ψ(t) ∈ Rn is the
state vector, d(t) represents an exogeneous input, and ζ is a regulated output. The matrix
C represents a mapping from ψ to a state penalty and R(x) is a mapping from the state to a
measure of control effort. In Section 1.2.1, we illustrate how this formulation generalizes standard
state feedback and encapsulates state feedback, output feedback, edge addition in directed and
undirected consensus networks, leader selection, and other problems.
In the optimal control problems we consider, the objective is to improve the open-loop (i.e.,
no controller, or x = 0) performance of a system by implementing a static controller x. Solving
optimal control problems amounts to designing controllers x that minimize a performance metric
that quantifies the performance of the closed-loop transfer function from the exogenous input
d to the regulated output ζ. This thesis mainly considers the H2- and H∞-optimal control
frameworks, where closed-loop performance is measured by variance amplification and the peak
of the frequency response, respectively.
We first provide examples of problems that are included in formulation (1.1), provide the
expressions for theH2 andH∞ norms, and then motivate the study of structured optimal control
with a distributed systems example.
5
1.2.1 Applications
Static feedback
The standard static state feedback formulation,
ψ = Aψ + B1d + B2u
u = −Xψ
ζ =
Cψ
R1/2u
(1.2)
with state-feedback matrix X is recovered from (1.1) by taking x = vec(X), F (x) = −B2X
and R(x) = −R1/2X for a positive semidefinite R. When the controller can only measure some
output ζ2 = C2ψ, taking F (x) = −B2XC2 and R(x) = −R1/2XC2 yields the static output
feedback formulation.
We describe and motivate this problem in the context of distributed systems in Section 1.3.
In Section 2.5, we develop an algorithm for designing structured controllers for (1.2) following
the formulation introduced by [62–64].
Edge addition in stochastically forced consensus networks
The problem of adding undirected ‘controller’ edges to an existing ‘plant’ consensus network
also involves dynamics of the form (1.1),
ψ = −(L + E diag(x)ET
)ψ + d
where each element of the state vector, ψi, represents a node in the network, L is a directed graph
Laplacian which contains information about how the nodes are connected in the ‘plant’ network,
E contains information about the locations of potential added edges, x is a vector of added edges
and F (x) = −E diag(x)ET is the corresponding weighted graph Laplacian associated with the
‘controller’ network [65].
Consensus networks have garnered much interest for problems dealing with collective decision-
making and collective sensing [66–69]. These networks have proven useful in applications as
varied as modeling animal group dynamics, control for spacecraft flying in formation, and data
fusion in sensor networks [70–72].
Adding edges to a consensus network improves its performance, but selecting a limited
6
number of edges to add is a combinatorial problem. Much recent work has focused on addressing
the optimal edge addition problem using convex optimization techniques [73–75] and developing
efficient algorithms that scale to very large networks [76]. However, most of these approaches
consider undirected networks. We extend this problem and study edge addition in directed
consensus networks in Sections 2.4 and 6.2.1.
Leader selection in consensus networks
This problem concerns the identification of influential nodes in consensus networks. These special
nodes, so-called leaders, can be equipped with additional information in order to influence the
network behavior in a beneficial way. One application is in vehicular formations, where the
objective is for the vehicles to gather at a certain point. Here, the ‘leaders’ are equipped absolute
measurements (e.g., from GPS units) and the other nodes must rely on relative measurements
(e.g., their distance from certain neighbors). The dynamics of the problem are,
ψ = − (L + diag(x)) ψ + d
where L is a graph Laplacian and x is a nonnegative vector whose nonzero entries identify the
leaders.
The question of how to optimally assign a predetermined number of nodes to act as leaders
in a network of dynamical systems with a given topology has also recently emerged as a useful
proxy for identifying important nodes in a network [77–84]. In [80], the authors develop a greedy
algorithm for leader selection in undirected networks and use convex relaxations to quantify
performance bounds. In [78,79], the authors derive an explicit expression for the set of optimal
leaders in terms of graph theoretic properties which is computationally tractable for very few for
very many leaders. In [85], the authors characterize bounds on the convergence rate based on
the distance between leaders and followers. Most of these results focus on undirected networks
with the exception of [86] in which the authors derive the optimal leaders for one-dimensional
directed path networks.
In contrast to earlier work, our framework can handle leader selection in directed consensus
networks [87,88]. We study this problem in detail in Section 7.3.
7
ψ1
ψ2 ψ4
ψ3
u1
A =
? ? ?? ?? ?
? ?
Fx =
????
Figure 1.1: HIV replication-mutation pattern for a set of 4 mutants with a single drug thataffects 2 mutants.
Combination drug therapy design for HIV treatment
The evolution-replication dynamics of HIV subject to treatment [89,90] can be cast as,
ψ = (A + diag (Fxx)) ψ + d.
Here, the elements of ψ represent populations of individual HIV mutants. The diagonal elements
of A represent each mutant’s replication rate and the off diagonal elements of A represent the
probability of mutation from one mutant to another. The linear function F (x) := diag (Fxx)
captures the effect of drug therapy. The components of the vector x are doses of different drugs
and the kth column of matrix Fx is describes how efficiently drug k kills each HIV mutant. See
Fig. 1.1 for an illustration.
Structure is particularly important for this problem since drug-drug interactions cannot
always be captured by the linear model. These interactions can be vital for efficacy of the
drug or may have fatal consequences, so it is important to apply constraints on dosages and to
impose penalties that promote the satisfaction of combinatorial conditions. However, imposing
constraints on drug dosages is challenging using existing positive systems tools for H2 and H∞design because the drug doses x do not appear explicitly as optimization variables [59, 89, 90],
making regularization difficult.
We first study design of structured controllers by designing a conservative structured con-
troller in Section 6.2.2. We then show convexity of the original problem in Section 7. Since
our formulations use the drug doses x as an optimization variable, our results apply naturally
8
structured combination drug therapy design; we explore this application in Section 7.5.
1.2.2 Performance metrics
In what follows, we use the H2 and H∞ norms to quantify the closed-loop performance. In
the centralized state-feedback problem, the optimal control law for LTI systems is given by
static state-feedback. Even though it is not clear if the optimal structured controller is also
memoryless, we design structured static controllers as a first iteration in design. In Chapter 7,
we explore the use of a time-varying controller for a certain class of structured control problem.
The H2 performance metric is the steady-state amplification from white-in-time stochastic
forcing at the disturbance input d to the performance output ζ of system (1.1),
f2(x) := limt→∞
E(ζT (t) ζ(t)
)= lim
t→∞E(ψT (t)CTC ψ(t) + uT (t)RT (x)R(x)u(t)
).
This quantity is determined by the square of theH2 norm of system (1.1) and it can be expressed
as a function of the controller x as
f2(x) =
trace((CTC + RT (x)R(x))P
), x stabilizing
+∞, otherwise(1.3a)
where x must be such that the closed-loop dynamical generator, A+F (x), corresponds to a stable
system (the spectrum of A+ F (x) must lie completely in the open left-half plane). The matrix
P is the closed-loop controllability gramian given by the solution to the Lyapunov equation,
(A + F (x))P + P (A + F (x))T + B1BT1 = 0.
When F (x) = −B2X, R(x) = −R1/2X, and there are no structural constraints on the matrix
X, the optimal H2 feedback gain is determined by the linear quadratic regulator (LQR) and it
can be computed via the solution of an algebraic Riccati equation.
The H∞ performance metric, which we denote by f∞(·), is the maximum induced L2 gain
from d to ζ in system (1.1),
f∞(x) := sup‖d‖L2≤1
‖ζ‖L2
‖d‖L2
,
where the L2 norm of a signal v is defined as,
‖v‖2L2:=
∫ ∞0
vT (t)v(t) dt.
9
This performance metric corresponds to the peak of the frequency response,
f∞(x) = supω∈R
σ(C (jωI − (A + F (x)))−1B
). (1.3b)
As with f2, the optimalH∞ control problem for unstructured state feedback can be cast in a con-
vex form and readily solved. However, as we describe next, incorporating structural restrictions
on x significantly complicates the design problem.
1.3 Distributed Control
To motivate the study of and explain the difficulty inherent to structured optimal control,
we begin with a discussion of networks of dynamical systems. In this problem, the controller
structure of interest is the communication topology induced by the control law. Traditional
design techniques, such as LQR, yield controllers which require a centralized implementation
where every subsystem communicates with a central node.
For large-scale systems, the computational and communication costs associated with such
an all-to-all communication topology may be prohibitively high. It is thus of interest to design
controllers with distributed structure and sparse communication topologies. We first draw a con-
nection between sparsity of the feedback gain matrix and the induced communication topology,
and then highlight challenges that arise in the design of structured state-feedback controllers.
Let us assume that (1.2) contains N individual subsystems, each with a local state and
control inputs. We partition the state and control input vectors into subvectors corresponding
to each subsystem, ψ := [ψT1 · · · ψTN ]T and u = [uT1 · · · uTN ]T , and write the subsystem
dynamics as,
ψi = Aii ψi +∑j 6= i
Aij ψj + B1i d + B2,ii ui. (1.4a)
The block-sparsity pattern of A determines the interaction topology between subsystems; when
Aij is zero, subsystem j has no direct effect on the evolution of the state of subsystem i.
With each subsystem we associate a controller that specifies the control input ui. Standard
optimal control techniques typically induce a communication topology which requires every local
controller to have access to the state of every subsystem. In large-scale networks of dynamical
systems, this may impose significant communication burden and implementation may be pro-
hibitively expensive. It is thus of interest to explore the design of feedback laws that utilize
limited information exchange within a large-scale network.
10
Under linear state-feedback u = −Xψ, the dynamics (1.4a) become,
ψi = Aii ψi +∑j 6= i
Aij ψj + B1i d − B2,ii
∑j
Xij ψj . (1.4b)
Thus, the block-sparsity pattern of the feedback gain matrix X determines the communication
topology of the static controller: forming the control input ui requires access to the states of
each subsystem j for which Xij is nonzero.
Figure 1.2a illustrates a network of coupled subsystems, associated controller topology, and
the sparsity patterns of the corresponding matrices A and X. The subsystems in the physical
layer are represented by blue octagons; their interaction topology is marked by the blue arrows
which correspond to the sparsity pattern of the matrix A. Each local controller is represented
by a yellow circle; the structure of the information exchange network between the two layers is
marked by the red arrows which correspond to the sparsity pattern of the feedback gain matrix
X.
In the more general setup where the local controllers are dynamic (perhaps because they
estimate the subsystem’s state rather than directly measure it), it is important to determine
the order of local controllers as well as the structure of the information exchange network in the
controller layer; see Fig. 1.2b for an illustration. Recent advances have been made for particular
classes of systems [23,29], but addressing these questions in general remains an open challenge.
1.3.1 An example
After permuting ψ relative to the definition in (1.4) for clarity of exposition, the state vector for
the mass-spring system shown in Fig. 1.3 is determined by ψ =[pT vT
]T, where p and v are
the vectors of positions and velocities of the N masses. We set all masses and spring constants
to unity. Assuming that the control and disturbance inputs enter as forces and partitioning
matrices in the state-space model (1.2) conformably with ψ yields
A =
0 I
T 0
, B1 = B2 =
0
I
,where T is an N ×N tridiagonal Toeplitz matrix with −2 on its main diagonal and 1 on its first
sub- and super-diagonal. In the absence of the structural constraints, the solution to the Riccati
equation yields the centralized H2-optimal controller, i.e., the linear quadratic regulator. In this
case, the LQR, X := [Xp Xv ], has dense position and velocity feedback gain matrices Xp and
11
X =
∗ ∗ ∗∗∗∗ ∗
∗
A =
∗ ∗∗ ∗ ∗∗ ∗ ∗∗ ∗
∗ ∗ ∗
(a) The local controllers are memoryless.
(b) The local controllers are dynamic.
Figure 1.2: A network of 5 dynamical systems with associated local controllers.
12
Figure 1.3: Mass-spring system on a line.
Xv, u1(t)
u2(t)
u3(t)
u4(t)
= −
∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗
︸ ︷︷ ︸
Xp
p1(t)
p2(t)
p3(t)
p4(t)
−∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗
︸ ︷︷ ︸
Xv
v1(t)
v2(t)
v3(t)
v4(t)
.
Even though these matrices are populated with non-zero elements, the gains that are used to
form control actions for individual masses display interesting patterns. Figure 1.4 illustrates the
optimal centralized position feedback gain matrix Xp in the system with 50 masses. Apart from
the edges, both Xp and Xv (not shown) have almost constant diagonals and exponential off-
diagonal decay. This suggests that good performance may be achievable if the smaller elements
of the feedback gain matrix are set to 0. Since the small feedback gains represent interactions
between masses that are spatially distant, such a strategy would allow the masses’ controllers
to interact in a distributed fashion.
More generally, for spatially invariant systems such as the mass-spring system on a circle,
the optimal controllers with respect to quadratic performance indices (e.g., H2, H∞) are also
spatially invariant and they exponentially discount information with spatial distance [9]. More-
over, it has been suggested that optimal controllers for spatially-decaying systems over general
graphs also possess spatially-decaying property [60, 61]. This motivates the search for inher-
ently localized controllers and suggests that localized information exchange in the distributed
controller may provide a viable strategy for controlling large-scale systems.
However, in general, low-magnitude elements in the feedback gain cannot reliably be assumed
unimportant. It is difficult to provide error bounds on the deviation from optimality as a result
of truncation (e.g., as in Fig. 1.4c) and it has been recognized that truncation of the centralized
controller could significantly compromise the closed-loop performance and even yield a controller
that does not guarantee closed-loop stability [60,64].
1.3.2 Lack of convexity for structured control
We consider the state feedback case, i.e., F (x) = B2X, for H2 optimal control. The design of the
optimal state-feedback gain X, subject to constraints on its sparsity pattern (equivalently, on
13
(a) (b) (c)
Figure 1.4: (a) The optimal centralized position feedback gain matrix Xp in the system with50 masses. Both Xp and Xv (not shown) have almost constant diagonals (modulo edges) andexponential off-diagonal decay. (b) Optimal centralized position gains for the middle massn = 25. (c) Truncation of the optimal centralized position gains for the middle mass n = 25.
the communication topology in Fig. 1.2a) has a rich history and was recently revisited in [91,92].
Let the subspace S encapsulate these structural constraints and let us assume that there is a
stabilizing X ∈ S. The optimal control problem of determining stabilizing X ∈ S that minimizes
the H2 norm of system (1.2) can be formulated as
minimize f(X)
subject to X ∈ S(1.5)
and brought into the following form
minimizeP,X
trace((Q + XTRX)P
)subject to (A − B2X)P + P (A − B2X)T + B1B
T1 = 0
P 0, X ∈ S
(1.6)
where Q := CTC and R 0. In the absence of the structural constraint X ∈ S, a standard
change of variables [49]
Z := X P (1.7)
can be used to express the square of the H2 norm as,
f(P, Y ) = trace (QP ) + trace (RZP−1 ZT )
14
Figure 1.5: The change of variables that casts the unstructured state-feedback problem as anSDP, in general, does not preserve the structural properties of X.
and the Schur complement can be employed to cast the optimal state-feedback H2 control
problem as an SDP,
minimizeP, Y, Z
trace (QP ) + trace (RΘ)
subject to (AP − B2 Z) + (AP − B2 Z)T
+ B1BT1 = 0 Θ Z
ZT P
0.
Since P is positive definite, it is invertible, and the optimal centralized (i.e., unstructured) X
is determined by Xc = ZP−1. This centralized solution coincides with the linear quadratic
regulator, which can be explicitly determined by Xc = R−1BT2 P where P is the unique positive
definite solution of the algebraic Riccati equation, ATP + PA+Q− PB2R−1BT2 P = 0.
The above change of variables is, in general, not suitable for imposing structure on X.
Although the constraint on the feedback gain matrix X ∈ S is linear and thus convex, the
corresponding constraint on P and Z is bilinear, ZP−1 ∈ S. This makes it difficult to translate
the sparsity patterns of X to the sparsity patterns of P and Z (see Fig. 1.5 for an illustration),
thereby limiting the use of these coordinates for structured design problems. In fact determin-
ing stabilizability, let alone achieving optimal closed-loop performance, is NP hard for general
structured problems [4, 5].
If P is restricted to be diagonal, the sparsity structure of X coincides with the sparsity
structure of Z. However, this may introduce considerable conservatism in the design and may
not even lead to a feasible SDP characterization (even when the original nonconvex problem is
feasible). One notable special case where this relaxation is tight appears in the H∞ optimal
15
control of positive systems [54].
1.4 Regularized problems
In order to design structured controllers, we draw inspiration from recent advances in compres-
sive sensing. Given some additional assumptions on data structure, it is possible to completely
recover signals even when they are sampled at frequencies below Nyquist rate. While it may
seem magical to a student fresh from an undergraduate signals and systems course, this result
makes intuitive sense; using additional information about the signal’s structure allows one to
recover it with less data from elsewhere. The magic sauce is the concept of regularization, which
allows one to incorporate additional knowledge about the signal structure into the recovery prob-
lem by augmenting the standard loss functions associated with signal sampling with nonsmooth
structure-promoting penalty terms.
Regularized problems have found applications in diverse fields including compressive sens-
ing [93], machine learning [94], statistics [95], image processing [96], and, as we develop in this
thesis, control theory [32]. This is a class of composite optimization problems in which the
objective function is a sum of a differentiable but possibly nonconvex component and a convex
nondifferentiable component,
minimize f(x) + γ g(x).
The differentiable component of the objective function, f(x), typically encodes some measure of
performance. In a least-squares setting, f(x) = ‖Ax− b‖2 measures how well a candidate set of
parameters x satisfies the linear relationship Ax = b. The nondifferentiable component of the
objective function, g(x), is a structure-promoting penalty function and γ specifies the emphasis
on this structure relative to performance. When γ = 0, an unstructured x will be recovered
and larger values of γ yield more structured optimal solutions. As an example, the `1 norm
g(x) =∑|xi| is a commonly used proxy for promoting sparsity of x.
In much of the work that utilizes regularization, γ can be chosen in a systematic way that
arises from assumptions on the problem structure and problem data, e.g. number of nonzero
entries and the statistics of noise affecting the measurements. In contrast, we use this framework
somewhat artificially. Instead of using regularization to harness a priori knowledge of the
structure of x in order to recover it, we use regularization to artificially impose structure on a
design variable x to satisfy engineering requirements.
In feedback synthesis, we augment a traditional performance metric (such as the H2 or H∞
16
norm) with a regularization function to promote certain structural properties in the optimal
controller. For example, the `1 norm and the nuclear norm are commonly used nonsmooth
convex regularizers that encourage sparse and low-rank optimal solutions, respectively.
Such regularized problems can be used to identify controller structure. This is particularly
important because recently, it has been demonstrated that the design of controller architectures
can have a more profound impact on the closed-loop performance than the optimal design under
a given pre-specified architecture [97]. In [62, 64], tools and ideas from control theory, opti-
mization, and compressive sensing have been combined to systematically address the challenge
of designing controller architectures. The proposed approach introduces regularized versions
of standard optimal control problems and aims to strike a balance between closed-loop perfor-
mance and controller complexity. As discussed earlier, when the state vector and control inputs
can be partitioned into subvectors that correspond to separate subsystems, promoting sparsity
of the feedback gain matrix limits information exchange between the physical system and the
controller. Sparse controller architectures can be designed by augmenting standard quadratic
performance measures with sparsity-promoting penalty functions which serve as measures of
controller complexity. Such an approach has received much recent attention [62,64,98–103].
Alongside sparse feedback synthesis, the critical question of sensor and actuator selection has
been recently considered in [104,105]. Although, in general, finding the solution to this problem
requires an intractable combinatorial search, by drawing upon recent developments in sparse
representations, this problem can be cast as a semidefinite program (SDP). Moreover, it is also
of interest to study problems where it is desired to impose structure on a linear function of the
design variable [106,107]. This broader framework covers a wide variety of problems ranging from
wide-area and distributed PI control of power networks [108–111], to combination drug therapy
for HIV treatment [112], to edge addition [76,113,114] and leader selection [77,78,80,81,83,84,87]
in consensus networks.
Several recent efforts have focused on establishing convexity for classes of these problems and
on developing efficient algorithms for optimal controller design for both convex and nonconvex
problems. Convex structured optimal control problems include symmetric modifications to sym-
metric linear systems [62,115,116], diagonal modifications to positive systems [87,112], optimal
sensor and actuator selection [104,105], and edge addition to undirected consensus [76,113,114]
and synchronization [117] networks. Algorithmic developments have employed alternating di-
rection method of multipliers [64,105], proximal gradient and Newton methods [114], as well as
first- and second-order method of multipliers [107, 118–120] to efficiently perform identification
of controller structure and structured feedback synthesis.
17
1.5 Design of structured controllers via regularization
The communication architecture S of the state-feedback controller in (1.6) is fixed and a priori
specified which may impose limits on the achievable performance. For problems where the com-
munication topology is not fixed, it is desirable to design a favorable communication topology
while promoting sparsity of the communication links. To achieve this, an optimization frame-
work which augments the H2 objective function with a penalty on the sparsity of the feedback
gain matrix (i.e., the number of communication links) was introduced in [62–64].
1.5.1 Structure identification
Our objective is to design controller architecture that achieves a desired tradeoff between the
quadratic performance of system (1.1) and the structure of the controller x. To address this
challenge we consider a regularized optimal control problem
minimizex
f(x) + γ g(Tx)
←−
←−
closed-loop
performance
controller
structure
(1.8)
where f(x) is the H2 or H∞ norm of system (1.1). In contrast to (1.6), no structural constraints
are imposed on x in (1.8); instead, the objective is to manage structure of the controller by
introducing a regularization term g(Tx) into the optimal control problem. The matrix T allows
regularization of structure in a different set of coordinates. The non-negative regularization
parameter γ encodes the emphasis on controller structure relative to the closed-loop performance.
For the state feedback problem F (x) = B2X regularized with a sparsity-promoting penalty
function, γ = 0 yields the centralized LQR solution. As γ increases, larger emphasis is placed
on obtaining a more structured feedback gain matrix X; see Fig. 1.6 for an illustration.
Problem (1.8) is difficult to solve directly because f is typically a nonconvex function of x
and g is convex but not differentiable. While the nonlinear change of coordinates (1.7) yields a
convex dependence of f on P and Y , in general, it introduces a nonconvex dependence of the
regularization term g on these optimization variables.
18
Figure 1.6: Increased emphasis on sparsity encourages sparser control architectures at the ex-pense of deteriorating the closed-loop performance. For γ = 0 the optimal centralized controllerXc is obtained from the positive definite solution of the algebraic Riccati equation. Control archi-tectures for γ > 0 are determined by X(γ) := argminX (f(X) + γ g(X)) and they depend on in-terconnections in the distributed plant and the state and control performance weights Q and R.
Sparsity-promoting regularizers
Elementwise sparsity of x can be promoted by incorporating the cardinality function into the
optimal control problem (1.8),
g0(x) = card (x) . (1.9a)
This regularizer counts the number of nonzero elements in x and it yields a combinatorial
optimization problem (1.8) whose solution typically requires an intractable combinatorial search.
A weighted `1 penalty,
g1(x) := ‖w x‖1 =∑i
wi |xi| (1.9b)
provides a convex proxy for promoting elementwise sparsity of x [121]. Here, w is the vector
whose elements are non-negative weights wi and is elementwise matrix multiplication. The
weights wi can be selected to place larger relative penalties on certain elements of x. Similarly,
the sum of the `2 norms (not the `2 norm squared) of the subvectors xi ∈ Rni ,
g2(x) =∑i
wi ‖xi‖2 (1.9c)
enhances group sparsity; i.e., sparsity at the level of subvectors [122].
The `1 norm is the largest convex function that underestimates the cardinality function
on the domain [−1, 1] [123]; see Fig. 1.7 for an illustration in the scalar case. Both the `1
19
Figure 1.7: Cardinality function of a scalar variable x and the corresponding absolute value andlogarithmic approximations on x ∈ [−1, 1].
norm and its weighted version are convex relaxations of card(X). On the other hand, better
approximation can be obtained with nonconvex functions, e.g., the sum-of-logs,
g4(x) =∑i
log
(1 +
|xi|ε
), 0 < ε 1. (1.9d)
The weighted `1 norm attempts to bridge the difference between the `1 norm and the cardinality
function. In contrast to the cardinality function that assigns the same cost to any nonzero
element, the `1 norm penalizes more heavily the elements of larger magnitudes. The positive
weights can be chosen to counteract this magnitude dependence of the `1 norm. For example,
if the weights wi are inversely proportional to the magnitude of xi, wi = 1/|xij |, xi 6= 0
wi = ∞, xi = 0
then there is no difference between the weighted `1 norm of x and the cardinality function of
x. This scheme, however, cannot be implemented, because the weights depend on the unknown
feedback gain. A re-weighted algorithm that solves a sequence of weighted `1 optimization
problems was proposed in [121]. In this, sequential linearization of the sum-of-logs function is
used and the weights are determined by the solution of the optimization problem in the previous
iteration. This algorithm has provided an effective heuristics for promoting sparsity in many
emerging applications.
Additional intuition about the role of sparsity-promoting regularizers can be gained by con-
sidering a problem in which it is desired to find the sparsest feedback gain that provides a given
20
`1 weighted `1 sum-of-logs
Figure 1.8: The solution x? of the constrained problem (1.10) is the intersection of the constraintset C := x | f(x) ≤ σ and the smallest sub-level set of g that touches C. The penalty function gis the `1 norm (left); the weighted `1 norm with appropriate weights (middle); and the nonconvexsum-of-logs function (right).
level of H2 performance σ > 0,
minimize card (x)
subject to f(x) ≤ σ.
Approximating card (x) with a penalty function g(x) yields
minimize g(x)
subject to f(x) ≤ σ.(1.10)
The solution to (1.10) is the intersection of the constraint set C := x | f(x) ≤ σ and the
smallest sub-level set of g that touches C; see Fig. 1.8 for an illustration. In contrast to the
`1 norm whose sub-level sets are determined by the convex `1 ball, the sub-level sets of the
nonconvex sum-of-logs function have a star-like shape.
Other regularization functions
The nuclear norm of a matricial variable X,
g∗(X) :=∑i
σi(X)
21
is the sum of the singular values of the matricial variable X and promotes a low-rank optimal
solution. Enforcing low-rank solutions is often important for low-complexity modelling [46,124–
127]
The indicator function associated with a convex set C,
gC(x) :=
0 x ∈ C∞ x 6∈ C
can be used to enforce that x lie in the convex set C. Such sets can encode structural properties
as simple as box constraints and can even be used to promote more complicated combinatorial
conditions such as mutual exclusivity or necessity between elements of x (i.e., either xi or xj
may be nonzero or xj may only be nonzero if xi is nonzero) [128]. These sorts of constraints
are very important in the context of the drug therapy example described in Section 1.2.1 since
linear models often cannot capture the full complexity of drug-drug interactions.
Finally, we note that recent work has used the framework of atomic norms [129] to penal-
ize communication between subsystems in large-scale distributed systems using more nuanced
measures of controller complexity [100–103].
1.5.2 Polishing
After having identified the controller architecture, we optimize the closed-loop H2 performance
over the identified structure S. This is necessary because the presence of the regularizer in (1.8)
often distorts the optimal solution. For example, in the `1 regularized case, the `1 norm imposes
an additional penalty on the magnitude of the feedback gains, resulting in worse closed-loop
performance. As a result, we obtain the final structured controller by solving a structured
problem,
minimize f(x)
subject to x ∈ S
which is equivalent to (1.8) where the regularizer is an indicator function corresponding to the
identified structure S.
For nuclear norm regularization, polishing amounts to optimizing over the singular values of
the matrix obtained by solving (1.8). Regularization with an indicator function does not require
a polishing step because it does not distort the value of the performance metric f in (1.8) as it
enforces an exact condition on x and is not a proxy for a nonconvex or combinatorial constraint.
22
1.6 Outline of thesis
This thesis approaches the challenge of designing structured controllers for linear systems by
solving regularized optimal control problems. In order to deal with the inherent difficulties of
solving these optimization problems, we tackle two overarching problems. The first is that of
developing tractable methods for solving regularized optimal control problems, which are in
general nonconvex. The second is that of identifying classes of problems for which regularized
optimal control can be placed into a convex form as well as developing methods to use convex
design problems to provide suboptimal solutions for the harder, nonconvex problems.
In Part I, we study algorithms for solving regularized problems. In Chapter 2, we develop a
novel splitting method based on a reformulation of the generalized regularized problem (1.8) with
an auxiliary variable. By exploiting the properties of proximal operators, we bring the associated
augmented Lagrangian into a continuously differentiable form by constraining it to a manifold
that corresponds to explicit minimization over the auxiliary variable. The new expression,
which we call the ‘proximal augmented Lagrangian’, facilitates a method of multipliers algorithm
that offers many advantages relative to existing methods, including guaranteed convergence for
nonconvex problems and the ability to regularize a linear function of the optimization variable.
We then study primal-descent dual-ascent Arrow-Hurwicz-Uzawa type gradient flow dynamics to
solve regularized problems in Chapter 3. We prove global convergence and establish conditions
for exponential convergence for continuous- and discrete-time updates. In Chapter 4, we take
advantage of generalizations of the gradient to apply second-order primal-dual updates to the
proximal augmented Lagrangian and prove global and local quadratic convergence. Finally, in
Chapter 5, we draw connections with other methods and provide additional insight.
In Part II, we identify classes of convex regularized optimal control problems. In Chapter 6,
we study the problem of designing symmetric modifications to symmetric linear systems. This
problem is convex in the underlying design variable and is thus appealing for the purpose of
structured control. We show that even when the system and controller are not symmetric, their
symmetric components can be used to perform structured design with stability and performance
guarantees. In Chapter 7, we examine the problem of designing structured diagonal modifica-
tions to positive systems. We prove that the H2 and H∞ optimal control problems are convex,
apply our results combination drug therapy design for HIV and leader selection in consensus
networks, and show that a constant controller is optimal for an induced-power performance
index. In Chapter 8, we examine the optimal actuator and sensor selection problems. Using
a change of variables, this problem can be cast as a semidefinite program. Although convex,
23
the computational complexity required to solve problems of this class scales poorly with the
problem dimension. We develop customized algorithms to solve this problem efficiently.
Finally, we provide concluding remarks and comment on future research in Chapter 9.
Notation
The set of real numbers is denoted by R and the set of nonnegative (positive) reals is denoted
by R+ (R++). The integers are represented by Z. The set of complex numbers are denoted by
C and j =√−1 is the imaginary unit. The operator <(·) (=(·)) extracts the real (imaginary)
component of a complex argument.
Given a matrix A, AT denotes its transpose. The vector λ(A) (σ(A)) indicates the eigenvalues
(singular values) of A, λi(A) denotes the ith largest eigenvalue of A and σ(A) denotes the princi-
pal (largest) singular value of A. We use trace(A) to denote its trace, and ‖A‖2F := trace(ATA
)to denote its Frobenius norm squared. The vector inner product is given by 〈x, y〉 := xT y
and the matricial inner product is given by 〈X, Y 〉 := trace(XTY ). The Kronecker product is
denoted by ⊗ and the Hadamard (entrywise) product is denoted by .We write A ≥ 0 (A > 0) if A is entrywise nonnegative (positive) and A 0 (A 0) to
denote that A is symmetric and positive semidefinite (definite).
We use the diag(·) operator to denote either the diagonal entries of a matrix or a diagonal
matrix with elements of the vector · on its diagonal, depending on its argument. The symbol E
denotes the expectation operator.
Given a set C we define the indicator function
IC(x) :=
0 if x ∈ C+∞ otherwise.
We define the sparsity pattern sp (u) of a vector u to be the set of indices for which ui is nonzero.
Definition 1. The adjoint of a linear operator F : Rm → Rn×n is the linear operator F †:
Rn×n → Rm which satisfies
〈X, F (x)〉 =⟨F †(X), x
⟩.
Any additional notation specific to particular chapters will be introduced as required.
Part I
Proximal augmented Lagrangian
algorithms
24
Chapter 2
Method of multipliers
We study a class of composite optimization problems in which the objective function is the sum
of a differentiable but possibly nonconvex component and a convex nondifferentiable component.
Problems of this form are encountered in diverse fields including compressive sensing [93], ma-
chine learning [94], statistics [95], image processing [96], and control [64]. In feedback synthesis,
they typically arise when a traditional performance metric (such as the H2 or H∞ norm) is
augmented with a regularization function to promote certain structural properties in the opti-
mal controller. For example, the `1 norm and the nuclear norm are commonly used nonsmooth
convex regularizers that encourage sparse and low-rank optimal solutions, respectively.
In this chapter, we derive the proximal augmented Lagrangian, which allows us to apply
the widely used method of multipliers (MM) to nondifferentiable composite optimization prob-
lems. We then illustrate the utility of this approach by applying it to edge addition in directed
consensus networks and sparse feedback synthesis.
2.1 Background
The lack of a differentiability in objective function (1.8) due to the regularization function
precludes the use of standard descent methods for smooth optimization. Proximal gradient
methods [130] and their accelerated variants [131] generalize gradient descent, but typically re-
quire the nonsmooth term to be separable over the optimization variable. Furthermore, standard
acceleration techniques are not well-suited for problems with constraint sets that do not admit
an easy projection (e.g., closed-loop stability).
An alternative approach is to split the smooth and nonsmooth components in the objective
25
26
function over separate variables which are coupled via an equality constraint. Such a reformu-
lation facilitates the use of the alternating direction method of multipliers (ADMM) [132]. This
augmented-Lagrangian-based method splits the optimization problem into subproblems which
are either smooth or easy to solve. It also allows for a broader class of regularizers than proximal
gradient and it is convenient for distributed implementation. However, there are limited conver-
gence guarantees for nonconvex problems and parameter tuning greatly affects its convergence
rate.
The method of multipliers (MM) is the most widely used algorithm for solving constrained
nonlinear programing problems [133–135]. In contrast to ADMM, it is guaranteed to converge for
noncconvex problems and there are systematic ways to adjust algorithmic parameters. However,
MM is not a splitting method and it requires joint minimization of the augmented Lagrangian
with respect to all primal optimization variables. This subproblem is typically nonsmooth and
as difficult to solve as the original optimization problem.
To treat this problem, we transform the augmented Lagrangian into a continuously differen-
tiable form by exploiting the structure of proximal operators associated with nonsmooth regular-
izers. This new form is obtained by constraining the augmented Lagrangian to a manifold that
corresponds to the explicit minimization over the variable in the nonsmooth term. The resulting
expression, that we call the proximal augmented Lagrangian, is given in terms of the Moreau
envelope of the nonsmooth regularizer and it is continuously differentiable. This allows us to
take advantage of standard optimization tools, including gradient descent and quasi-Newton
methods, and enjoy the convergence guarantees of the standard method of multipliers.
2.2 Problem formulation and background
2.2.1 Composite optimization problem
We consider a composite optimization problem,
minimizex
f(x) + g (T (x)) (2.1)
where the optimization variable x belongs to a finite-dimensional Hilbert space (e.g., Rm or
Rm×n) equipped with an inner product 〈·, ·〉 and associated norm ‖·‖. The function f is con-
tinuously differentiable but possibly nonconvex, the function g is convex but potentially nondif-
ferentiable, and T is a bounded linear operator. We further assume that g is proper and lower
semicontinuous, that (2.1) is feasible, and that its minimum is finite.
27
Regularization of T (x) instead of x is important in the situations where the desired struc-
ture has a simple characterization in the co-domain of T . Such problems arise in spatially-
invariant systems, where it is convenient to perform standard control design in the spatial
frequency domain [9] but necessary to promote structure in the physical space, and in consen-
sus/synchronization networks, where the objective function is expressed in terms of the deviation
of node values from the network average but it is desired to impose structure on the network
edge weights [106].
2.2.2 Background on proximal operators
Problem (2.1) is difficult to solve directly because f is, in general, a nonconvex function of x and
g is typically not differentiable. Since the existing approaches and our method utilize proximal
operators, we first provide a brief overview; for additional information, see [130].
The proximal operator of the function g is given by,
proxµg(v) := argminx
g(x) + 12µ ‖x − v‖2 (2.2a)
and the associated optimal value specifies its Moreau envelope,
Mµg(v) := infxg(x) + 1
2µ ‖x − v‖2 (2.2b)
where µ > 0. The Moreau envelope is a continuously differentiable function, even when g is not,
and its gradient is given by,
∇Mµg(v) = 1µ
(v − proxµg(v)
). (2.2c)
As a concrete example, when g is the `1 norm with unit weight w = 1, the proximal operator is
determined by soft-thresholding,
proxµg1(vi) = Sµ(vi) := sign(vi) max(|vi| − µ, 0) (2.3a)
the associated Moreau envelope is given by the Huber function,
Mµg1(vi) =
v2i /(2µ) |vi| ≤ µ
|vi| − µ/2 |vi| ≥ µ(2.3b)
28
(a) g1(v) (b) proxµg1 (v) (c) Mµg1 (v) (d) ∇Mµg1 (v)
Figure 2.1: The regularization function g1(v) = |v| for a scalar argument v with associatedproximal operator, Moreau envelope, and Moreau envelope gradient when µ = 1.
and the gradient of this Moreau envelope is the saturation function,
∇Mµg1(vi) = sign(vi) min(|vi|/µ, 1). (2.3c)
Figure 2.1 plots these functions for a scalar argument with µ = 1.
2.2.3 Existing algorithms
Proximal gradient
The proximal gradient method generalizes standard gradient descent to certain classes of nons-
mooth optimization problems. This method can be used to solve (2.1) when g(T ) has an easily
computable proximal operator. The standard gradient descent update to find a minimizer x? of
a differentiable function h(x) is given by
xl+1 = xl − αl∇h(xl)
where αl is a step size. When T = I, the proximal gradient method for problem (2.1) with
step-size αl is given by,
xl+1 = proxαlg(xl − αl∇f(xl)).
Clearly, this method is most effective when the proximal operator of g is easy to evaluate.
When g = 0, the proximal gradient method simplifies to standard gradient descent, and when g
is indicator function of a convex set, it simplifies to projected gradient descent.
In particular, the proximal gradient update for the `1-regularized least-squares problem
(LASSO),
minimizex
12 ‖Ax − b‖2 + γ ‖x‖1 (2.4)
29
where γ is a positive regularization parameter, is given by the Iterative Soft-Thresholding Al-
gorithm (ISTA),
xl+1 = Sγαl(xl − αlAT (Axl − b)).
When g(x) = 0, the proximal gradient method simplifies to standard gradient descent, and
when g(x) is the indicator function IC(x) of the convex set C, it simplifies to projected gradient
descent.
Except in special cases, e.g, when T = I or is a diagonal operator, efficient computation of
proxg(T ) does not necessarily follow from an efficiently computable proxg. This makes the use
of proximal gradient method challenging for many applications and its convergence can be slow.
Acceleration techniques [131, 136] improve the convergence rate, but they do not perform well
in the face of constraints such as closed-loop stability.
Augmented Lagrangian methods
A common approach for dealing with a nondiagonal linear operator T in (2.1) is to introduce
an additional optimization variable z
minimize f(x) + g(z)
subject to T (x) − z = 0.(2.5)
The augmented Lagrangian is obtained by adding a quadratic penalty on the violation of the
linear constraint to the regular Lagrangian associated with (2.5),
Lµ(x, z; y) = f(x) + g(z) + 〈y, T (x) − z〉 + 12µ ‖T (x) − z‖2
where y is the Lagrange multiplier, µ is a positive parameter, and Lµ is obtained by augmenting
the regular Lagrangian with a quadratic penalty on the violation of the linear constraint in (2.5).
ADMM solves (2.5) via an iteration which involves minimization of Lµ separately over x and
z and a gradient ascent update (with step size 1/µ) of y [132],
xk+1 = argminx
Lµ(x, zk; yk) (2.6a)
zk+1 = argminz
Lµ(xk+1, z; yk) (2.6b)
yk+1 = yk + 1µ (T (xk+1) − zk+1). (2.6c)
ADMM is appealing because, even when T is nondiagonal, the z-minimization step amounts
30
to evaluating proxµg, and the x-minimization step amounts to solving a smooth (but possibly
nonconvex) optimization problem. Although it was recently shown that ADMM is guaranteed
to converge to a stationary point of (2.5) for some classes of nonconvex problems [137], its rate
of convergence is strongly influenced by the choice of µ.
The method of multipliers (MM) is the most widely used algorithm for solving constrained
nonlinear programing problems and it guarantees convergence to a local minimum. In contrast
to ADMM, each MM iteration requires joint minimization of the augmented Lagrangian with
respect to the primal variables x and z,
(xk+1, zk+1) = argminx, z
Lµ(x, z; yk) (2.7a)
yk+1 = yk + 1µ (T (xk+1) − zk+1). (2.7b)
It is possible to refine MM to allow for inexact solutions to the (x, z)-minimization subproblem
and adaptive updates of the penalty parameter µ. However, until now, MM has not been a
feasible choice for solving (2.5) because the nonconvex and nondifferentiable (x, z)-minimization
subproblem is as difficult as the original problem (2.1).
In what follows, we exploit the structure of the linear constraint in (2.5) and utilize the
optimality conditions with respect to z to eliminate it from the augmented Lagrangian. This
brings the (x, z)-minimization problem in (2.7) into a form that is continuously differentiable
with respect to x.
2.3 MM with the proximal augmented Lagrangian
In this section, we derive the proximal augmented Lagrangian, a continuously differentiable
function resulting from explicit minimization of the augmented Lagrangian over the auxiliary
variable z. This brings the (x, z)-minimization problem in (2.7) into a form that is continuously
differentiable with respect to x and facilitates the use of a wide suite of standard optimization
tools, including the method of multipliers and the Arrow-Hurwicz-Uzawa method.
2.3.1 Derivation of the proximal augmented Lagrangian
The first main result of the paper is provided in Theorem 2.3.1. We use the proximal operator of
the function g to eliminate the auxiliary optimization variable z from the augmented Lagrangian
and transform (2.7a) into a tractable continuously differentiable problem.
31
Theorem 2.3.1. For a proper, lower semicontinuous, and convex function function g, mini-
mization of the augmented Lagrangian Lµ(x, z; y) associated with problem (2.5) over (x, z) is
equivalent to minimization of the proximal augmented Lagrangian
Lµ(x; y) := f(x) + Mµg(T (x) + µy) − µ2 ‖y‖
2 (2.8)
over x. Moreover, if f is continuously differentiable Lµ(x; y) is continuously differentiable over
x and y, and if f has a Lipschitz continuous gradient ∇Lµ(x; y) is Lipschitz continuous.
Proof. Through the completion of squares, the augmented Lagrangian Lµ associated with (2.5)
can be equivalently written as
Lµ(x, z; y) = f(x) + g(z) + 12µ ‖z − (T (x) + µy)‖2 − µ
2 ‖y‖2.
Minimization with respect to z yields an explicit expression,
z?µ(x, y) = proxµg(T (x) + µy) (2.9)
and substitution of z?µ into the augmented Lagrangian provides (2.8), i.e., Lµ(x; y) =
Lµ(x, z?µ(x, y); y). Continuous differentiability of Lµ(x; y) follows from continuous differentiabil-
ity of Mµg and Lipschitz continuity of ∇Lµ(x; y) follows from Lipschitz continuity of proxµg
and boundedness of the linear operator T ; see (2.2c).
Expression (2.8), that we refer to as the proximal augmented Lagrangian, characterizes
Lµ(x, z; y) on the manifold corresponding to explicit minimization over the auxilary variable
z. Theorem 2.3.1 allows joint minimization of the augmented Lagrangian with respect to x and
z, which is in general a nondifferentiable problem, to be achieved by minimizing differentiable
function (2.8) over x. It thus facilitates the use of the method of multipliers as follows and the
Arrow-Hurwicz-Uzawa gradient flow dynamics in Chapter 3.
2.3.2 MM based on the proximal augmented Lagrangian
Theorem 2.3.1 allows us to solve nondifferentiable subproblem (2.7a) by minimizing the con-
tinuously differentiable proximal augmented Lagrangian Lµ(x; yk) over x. Relative to ADMM,
our customized MM algorithm guarantees convergence to a local minimum and offers system-
atic update rules for the parameter µ. Relative to proximal gradient, we can solve (2.1) with a
general bounded linear operator T and can incorporate second order information about f .
32
Using reformulated expression (2.8) for the augmented Lagrangian, MM minimizes Lµ(x; yk)
over the primal variable x and updates the dual variable y using gradient ascent with step-size
1/µ,
xk+1 = argminx
Lµ(x; yk) (2.10a)
yk+1 = yk + 1µ ∇yLµ(xk+1; yk) (2.10b)
where
∇yLµ(xk+1; yk) = T (xk+1) − proxµg(T (xk+1) + µyk)
denotes the primal residual, i.e., the difference between T (xk+1) and z?µ(xk+1, yk).
In contrast to ADMM, our approach does not attempt to avoid the lack of differentiability of
g by fixing z to minimize over x. By constraining Lµ(x, z; y) to the manifold resulting from ex-
plicit minimization over z, we guarantee continuous differentiability of the proximal augmented
Lagrangian Lµ(x; y). MM is a gradient ascent algorithm on the Lagrangian dual of problem (2.5)
strengthened by a quadratic penalty on primal infeasibility. Since a closed-form expression of
the dual is typically unavailable, MM uses the (x, z)-minimization subproblem (2.7a) to evaluate
it computationally and then take a gradient ascent step (2.7b) in y. ADMM avoids this issue
by solving simpler, separate subproblems over x and z. However, the x and z minimization
steps (2.6a) and (2.6b) do not solve (2.7a) and thus unlike the y-update (2.7b) in MM, the
y-update (2.6c) in ADMM is not a gradient ascent step on the strengthened dual. MM thus has
stronger convergence results [132,133] and may lead to fewer y-update steps.
Remark 2.3.2. The proximal augmented Lagrangian enables MM because the x-minimization
subproblem in MM 2.10a is not more difficult than in ADMM (2.6a). For LASSO problem (2.4),
the z-update in ADMM (2.6b) is given by soft-thresholding, zk+1 = Sγµ(xk+1 + µyk), and
the x-update (2.6a) requires minimization of the quadratic function [132]. In contrast, the x-
update (2.10a) in MM requires minimization of (1/2) ‖Ax − b‖2 + Mµkg(x + µkyk), where
Mµkg(v) is the Moreau envelope associated with the `1 norm; i.e., the Huber function (2.3b).
Although in this case the solution to (2.6a) can be characterized explicitly by a matrix inversion,
this is not true in general. The computational cost associated with solving either (2.6a) or (2.10a)
using first-order methods scales at the same rate.
33
Algorithm
The procedure outlined in [135, Algorithm 17.4] allows minimization subproblem (2.10a) to be
inexact, provides a method for adaptively adjusting µk, and describes a more refined update
of the Lagrange multiplier y. We incorporate these refinements into our proximal augmented
Lagrangian algorithm for solving (2.5). In Algorithm 1, η and ω are convergence tolerances,
and µmin is a minimum value of the parameter µ. Because of the equivalence established in
Theorem 2.3.1, convergence to a local minimum follows from the convergence results for the
standard method of multipliers [135].
Algorithm 1 Proximal augmented Lagrangian algorithm for (2.5).
input: Initial point x0 and Lagrange multiplier y0
initialize: µ0 = 10−1, µmin = 10−5, ω0 = µ0, and η0 = µ0.10
for k = 0, 1, 2, . . .
Solve 2.10a such that‖∇xLµ(xk+1, yk)‖ ≤ ωk
if ‖∇yLµk(xk+1; yk)‖ ≤ ηk
if ‖∇yLµk(xk+1; yk)‖ ≤ η and ‖∇xLµ(xk+1, yk)‖ ≤ ωstop with approximate solution xk+1
else
yk+1 = yk + 1µk∇yLµk(xk+1; yk), µk+1 = µk
ηk+1 = ηk µ0.9k+1, ωk+1 = ωk µk+1
endif
else
yk+1 = yk, µk+1 = max(µk/5, µmin)ηk+1 = µ0.1
k+1, ωk+1 = µk+1
endifendfor
2.3.3 Minimization of the proximal augmented Lagrangian over x
MM based on the proximal augmented Lagrangian alternates between minimization of Lµ(x; y)
with respect to x (for fixed values of µ and y) and the update of µ and y. Since Lµ(x; y) is
once continuously differentiable, many techniques can be used to find a solution to subprob-
lem (2.10a). We next summarize three of them.
34
Gradient descent
The gradient of the proximal augmented Lagrangian with respect to x is given by,
∇xLµ(x; y) = ∇f(x) + 1µT
†(T (x) + µy − proxµg(T (x) + µy)).
Gradient descent with backtracking rules, such as the Armijo rule, can be used to find a solution
to (2.10a).
Proximal gradient
Gradient descent does not exploit the structure of the Moreau envelope of the function g; in some
cases, it may be advantageous to use proximal operator associated with the Moreau envelope to
solve 2.10a. In particular, when T = I, this requires evaluation of the proximal operator of the
Moreau envelope,
proxαMµg(v) := argmin
xMµg(x) + 1
2α ‖x − v‖2.
Since Mµg is continuously differentiable, the optimality conditions are given by
0 = ∇Mµg(x) + 1α (x − v)
= 1µ (x − proxµg(x)) + 1
α (x − v).
Thus, proxαMµg(v) = x∗ where x∗ satisfies,
x∗ = 1µ+α
(αproxµg(x
∗) + µ v).
For separable g, the proximal operators associated with the function g and its Moreau envelope
are easily computable. For example, the proximal operator of the `1 norm is determined by
soft-thresholding (2.3a) and the ith element of proxαMµg(v) is given by
proxαMµg(vi) =
µ
µ+α vi, |vi| ≤ µ + α
vi − α sign(vi), |vi| ≥ µ + α.
In [118], proximal gradient methods were used for subproblem (2.10a) to solve the sparse
feedback synthesis problem described in Section 2.5. For associated problem (2.1), computa-
tional savings were shown relative to standard proximal gradient and ADMM.
35
Quasi-Newton method
When the proximal operator associated with g is continuously differentiable, x ∈ Rm, and
T (x) = Tx, the Hessian of the proximal augmented Lagrangian is given by,
∇xxLµ(x; y) = ∇2f(x) + 1µ T
T (I − Bp)T
where (1/µ)(I−Bp) is the Hessian of the Moreau envelope and Bp is the Jacobian of proxµg(x+
µy). Although proxµg is not differentiable in general, it is Lipschitz continuous and therefore
differentiable almost everywhere [138]. When proxµg is not differentiable, the Dini derivative
or the Clarke subgradient [138, 139] can be used to obtain a generalized Jacobian Bp. For the
soft-thresholding operator (2.3a), Bp = diagβ, where the ith component of the vector β is
βi ∈ 0, |vi| < µ; 1, |vi| > µ; 0, 1, |vi| = µ . To improve computational efficiency, we employ
the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method [135, Algorithm 7.4]
to estimate the Hessian ∇xxLµ(x; yk).
L-BFGS estimates the Hessian inverse, Hl, to compute the search direction via r = −Hl(∇Lµ).
Instead of explictly forming Hl, it computes r via a low-rank operation q = −Sl(∇Lµ) followed
by an easily-computable full-rank operation and another low-rank operation, r = S†l (H
0l (q)).
The operations Sl and S†l encode updates to an initial Hessian-inverse approximation H0
l based
on first-order information at previous iterates. The initial estimate of the Hessian inverse can be
adaptively updated and is commonly taken to be the scaling operation, H0l (v) = κv where [135]
κ =
⟨xl − xl−1, ∇Lµ(xl, y) − ∇Lµ(xl−1, y)
⟩‖∇Lµ(xl, y) − ∇Lµ(xl−1, y)‖2
. (2.11)
Remark 2.3.3. For regularization functions that do not admit simply computable proximal
operators, proxµg has to be evaluated numerically by solving (2.2a). If this is expensive, the
primal-descent dual-ascent algorithm of Chapter 3 offers an appealing alternative because it re-
quires one evaluation of proxµg per iteration. When the regularization function g is nonconvex,
the proximal operator may not be single-valued and the Moreau envelope may not be continu-
ously differentiable. In spite of this, the convergence of proximal algorithms has been established
for nonconvex, proper, lower semicontinuous regularizers that obey the Kurdyka- Lojasiewicz in-
equality [140].
36
Algorithm 2 Limited memory BFGS (L-BFGS)
input: Gradient ∇Lµ(xl; y), estimate of Hessian inverse H0l , and history sl = xl+1 − xl,
gradients tl = ∇Lµ(xl+1; y)−∇Lµ(xl; y) and ρl = 1/⟨tl, sl
⟩for last nl iterations
output: Search direction r = Hl(∇Lµ(xl; y))
q = ∇Lµ(xl; y)
for i = l − 1, . . . , l − nlαi = ρi
⟨si, q
⟩q = q − αiti
endfor
r = H0l (q)
for i = l − nl, . . . , l − 1τ = ρi
⟨ti, r
⟩r = r + si(αi − τ)
endfor
2.4 Example: Edge addition in directed consensus net-
works
To illustrate the utility of MM with the proximal augmented Lagrangian, we consider the prob-
lem of edge addition described in Section 1.2.1. The x minimization subproblem is solved using
L-BFGS.
It is desired to minimize the H2 norm of the closed-loop system,
ψ = −(Lp + Lx)ψ + d, ζ =
Q1/2
−R1/2Lx
ψwhere Lp and Lx are the Laplacians of the plant and controller networks, respectively, R 0,
and Q := I − (1/N)11T .
To ensure convergence of ψ to the average of the initial node values, we require that the
closed-loop graph Laplacian, L = Lp + Lx, is balanced. This condition amounts to the linear
constraint, 1TL = 0. We express the directed graph Laplacian of the controller network as,
Lx =∑i 6= j
Lij wij =:M∑l= 1
Ll wl
where wij ≥ 0 is the added edge weight that connects node j to node i, Lij := eieTi − eie
Tj , ei is
the ith basis vector in Rn, and the integer l indexes the edges such that wl = wij and Ll = Lij .
37
For simplicity, we assume that the plant network Lp is balanced and connected. Thus, enforcing
that L is balanced amounts to enforcing
1TLx = 1T
(∑l
Ll wl
)= 0.
This imposes a linear constraint on the vector of edge weights w ∈ RM , Ew = 0, where E is
the incidence matrix of the directed controller network [7]. When any edge may be added to a
network with N nodes, the number of potential added edges is M = N2 −N .
Any vector of edge weights w that corresponds to a balanced graph can be written as w = Tx
where the columns of T span the nullspace of the matrix E. The matrix T can be obtained
via the singular value decomposition of E and it provides a basis for the space of balanced
graphs, i.e., the cycle space [7] of the controller network. Each feasible controller network can
be expressed in terms of this basis,
Lx =∑l
Ll [Tx]l =∑l
Ll
[∑k
(T ek)xk
]l
=:∑k
Lk xk (2.12a)
where the matrices Lk are given by,
Lk :=∑l
Ll [T ek]l. (2.12b)
Since the controller and plant networks each have an eigenvalue at zero corresponding to 1,
we introduce the change of variables, φ
ψ
=
V T
1T
ψ ⇔ ψ = V φ + ψ 1 (2.12c)
where the columns of V form an orthonormal basis of the orthogonal complement to the subspace
spanned by the vector of all ones. The average mode ψ is marginally stable and decoupled from
the dynamics that govern the evolution of φ. Since Q = I− (1/N)11T , ψ is not detectable from
ζ and the dynamics that govern the evolution of deviations from average are,
φ = A φ + B d, ζ =
Q1/2 V
−R1/2 Lx V
φ (2.12d)
38
where A := −V T (Lp + Lx)V and B := V T . The square of the H2 norm of this system is
determined by,
f2(x) =⟨V T (Q + LTxRLx)V, Pc
⟩(2.12e)
where Pc is the controllability gramian of system (2.12d),
0 = A Pc + PcAT + BBT . (2.12f)
To balance the closed-loop H2 performance with the number of added edges, we introduce a
regularized optimization problem
xγ = argmin f2(x) + γ 1TTx + I+(Tx). (2.13)
Here, the regularization parameter γ > 0 specifies the emphasis on sparsity relative to the closed-
loop performance f , and I+ is the indicator function associated with the nonnegative orthant
RM+ . When the desired level of sparsity for the vector of the added edge weights wγ = Txγ has
been attained, optimal weights for the identified set of edges are obtained by solving,
minimizex
f2(x) + IWγ (Tx) + I+(Tx) (2.14)
where Wγ is the set of vectors with the same sparsity pattern as wγ and IWγis the indicator
function associated with this set.
2.4.1 Implementation
We next derive the gradient of the square of the H2 norm of system (2.12d) and provide im-
plementation details for MM with the proximal augmented Lagrangian with the regularization
functions in (2.13) and (2.14).
Lemma 2.4.1. Let a graph Laplacian of a directed plant network Lp be balanced and connected
and let A, B, Lx, and V be as defined in (2.12a)–(2.12d). The gradient of f2(x) defined in (2.12e)
is given by,
∇f2(x) = 2 vec(⟨
(RLxV − V Po)PcVT , Lk
⟩)
39
where Pc and Po are the controllability and observability gramians of system (2.12d),
A Pc + PcAT + BBT = 0
ATPo + PoA + V T (Q + LTxRLx)V = 0.
Proof. The Lagrangian associated with the minimization of the function f2 given by (2.12e)
subject to constraint (2.12f) is
⟨V T (Q + LTxRLx)V, Pc
⟩+⟨A Pc + PcA
T + BBT , Po
⟩where Po is the Lagrange multiplier. Variations with respect to Po and Pc yield the Lyapunov
equations for the controllability and observability gramians Pc and Po, respectively. Using the
definition of A, the Lagrangian can be rewritten as,
⟨(RLxV − 2V Po)PcV
T , Lx⟩
+⟨BBT − V TLpV Pc − PcV
TLTp V, Po
⟩+⟨V TQV, Pc
⟩.
Using definition (2.12a) of Lx and taking variation with respect to each xk yields the expression
for ∇f2(x).
The proximal operator associated with the regularization function gs(z) := γ1T z + I+(z)
in (2.13) is given by
proxµgs(vi) = max (0, vi − γµ).
The associated Moreau envelope for each element of v is
Mµgs(vi) =
1
2µv2i vi ≤ γµ
γ (vi − γ µ2 ) vi > γµ
and its gradient is
∇Mµgs(vi) = max ( 1µvi, γ).
The proximal operator associated with the regularization function in (2.14), gp(z) := IWγ (z) +
I+(z), is a projection onto the intersection of the set Wγ and the nonnegative orthant,
proxµgp(v) = PE(v)
40
where E :=Wγ ∩ RM+ . The associated Moreau envelope is
Mµgp(v) = 12µ ‖v − PE(v)‖2
and its gradient is determined by
∇Mµgp(v) = 1µ (v − PE(v)).
We solve (2.13) and (2.14) using Algorithm 1, where L-BFGS (i.e., Algorithm 2) is employed in
the x-minimization subproblem (2.10a).
2.4.2 Computational experiments
For the plant network shown in Fig. 2.2, Fig. 2.3 illustrates the tradeoff between the number
of added edges and the closed-loop H2 norm. The added edges are identified by computing the
γ-parameterized homotopy path for problem (2.13), and the optimal edge weights are obtained
by solving (2.14). The red dashed lines in Fig. 2.2 show the optimal set of 2 added edges. These
are obtained for γ = 3.5 and they yield 23.91% performance loss relative to the setup in which
all edges in the controller graph are used. We note that the same set of edges is obtained by
conducting an exhaustive search. This suggests that the proposed convex regularizers may offer
a good proxy for solving difficult combinatorial optimization problems.
We also consider simple directed cycle graphs with N = 5 to 50 nodes and M = N2 − Npotential added edges. We solve (2.13) for γ = 0.01, 0.1, 0.2, and R = I using MM with the
proximal augmented Lagrangian (PAL), ADMM, and ADMM with an adaptive heuristic for
updating µ [132] (ADMM µ). The x-minimization subproblems in each algorithm are solved
using L-BFGS. Since gs(Tx) and gp(Tx) are not separable in x, proximal gradient cannot be
used here.
Figure 2.4a shows the time required to solve problem (2.13) in terms of the total number
of potential added edges; Fig. 2.4b demonstrates that PAL requires fewer outer iterations; and
Fig. 2.4c illustrates that the average computation time per outer iteration is roughly equivalent
for all three methods. Even with an adaptive heuristic update of µ [132], ADMM requires more
outer iterations which increases overall solve time relative to the proximal augmented Lagrangian
method. Thus, compared to ADMM, PAL provides computational advantage by reducing the
number of outer iterations (indexed by k in Algorithm 1 and in (2.6)).
41
ψ1
ψ2ψ4
ψ3
ψ5 ψ6 ψ7
Figure 2.2: A balanced plant graph with 7 nodes and 10 directed edges (solid black lines). Asparse set of 2 added edges (dashed red lines) is identified by solving (2.13) with γ = 3.5 andR = I.
2.5 Example: Sparsity-promoting optimal control
Here, we apply the method of multipliers algorithm to apply a sparsity-promoting penalty on the
optimal state-feedback design problem described in Section 1.2.1. As described in Section 1.3,
sparsity of the feedback gain associated with a distributed system corresponds to a controller
with a distributed architecture. We use the proximal gradient method with BB step-size selection
to solve the Lµ minimization subproblem.
The design of state-feedback controllers which balance performance with sparsity has been
the subject of considerable attention in recent years [62, 64, 98–102, 104, 105, 141]. We consider
the sparsity-promoting optimal control problem applied to the linear system,
ψ = (A − B2X)ψ + B1 d, ζ =
Q1/2
−R1/2X
ψ (2.15)
where d is an exogenous disturbance, ζ is the performance output, and Q = QT 0 and
R = RT 0 are the state and control performance weights. System (2.15) describes closed-loop
dynamics under the state-feedback control law,
u = −Xψ,
We make the standard assumptions that (A,B2) is stabilizable and (A,Q1/2) is detectable.
The state-feedback gain X which minimizes the closed-loop H2 norm is, in general, a dense
42
per
form
ance
loss
(per
cent)
number of added edges
Figure 2.3: Tradeoff between performance and sparsity resulting from the solution to (2.13)-(2.14) for the network shown in Fig. 2.2. Performance loss is measured relative to the optimalcentralized controller (i.e., the setup in which all edges in the controller network are used).
M
(a) Total solve time (s)
M
(b) Number of outer iterations
M
(c) Solve time (s) per outer iteration
Figure 2.4: (a) Total time; (b) number of outer iterations; and (c) average time per outeriteration required to solve (2.13) with γ = 0.01, 0.1, 0.2 for a cycle graph with N = 5 to 50 nodesand m = 20 to 2450 potential added edges using PAL (solid blue –×–), ADMM (dashed red- -- -), and ADMM with the adaptive µ-update heuristic (dotted yellow · · · · · · ). PALrequires fewer outer iterations and thus a smaller total solve time.
43
matrix. In [62, 64], the authors studied the problem of designing feedback gain matrices which
balance H2 performance with the sparsity of X. This was achieved by considering a regularized
optimal control problem (2.1) where f = f2 is the closed-loop H2 norm, g(X) encodes some
structural constraint or penalty on X, and γ > 0 encodes the emphasis on this penalty relative
to the H2 performance. In particular we consider regularization by the `1 norm,
minimize f2(X) + γ‖X‖1
where f2 is the H2 norm (1.3a) of system (2.15),
f2(X) = trace(Pc(Q+XTRX))
and its gradient with respect to X is given by
∇f2(X) = 2(RX − BT2 Po)Pc
where Po and Pc are observability and controllability gramians of the closed-loop system,
ATclPo + PoAcl = − (Q + XTRX)
AclPc + PcATcl = −B1B
T1
and Acl := A − B2X. Furthermore, ∇f2(X) is a Lipschitz continuous function on the set of
stabilizing feedback gains [142].
For γ = 0, the problem simplifies to the H2 state-feedback problem whose solution is given
by the standard linear quadratic regulator. A typical approach is to solve (2.1) for a series of
different γ and to generate a set of feedback gains with different levels of sparsity. From this set,
a sparse feedback gain can be selected or γ can be refined to yield sparser or denser controllers.
The proximal augmented Lagrangian associated with this regularized problem is
f2(X) + γhκ(X + µY ) − µ2 ‖Y ‖
2F
where κ := µγ and the Huber function
hκ(V ) =∑i,j
12 V
2ij , |Vij | ≤ κ
κ (|Vij | − 12 κ), |Vij | ≥ κ
44
is the Moreau envelope of the `1 norm.
2.5.1 Minimization of the proximal augmented Lagrangian
The main computational burden in the method of multipliers lies in finding a solution to the
optimization problem (2.10a),
minimizeX
f2(X) + γhκ(X + µY k).
Although the differentiability of Lµ implies that gradient descent may be employed to update
X, we utilize the proximal gradient method to exploit the structure of the Moreau envelope hκ.
We use the notation X l to denote the sequence of inner iterates that converge to a solution
of (2.10a).
Proximal gradient step for minimizing Lµ
Proximal gradient descent, described in Section 2.2.2, provides a generalization of standard
gradient descent which can be applied to nonsmooth optimization problems. Here, we apply
it to solve subproblm (2.10a) in MM with the proximal augmented Lagrangian. The proximal
gradient update is given by X l+1 = X l + X where X minimizes
12α‖X‖
2F +
⟨∇f2(X l), X
⟩+ γ hκ(X l + X + µY )
over X and α is a step size [130]. Note that this problem is separable over the elements of X.
By defining a := Xij , b := (∇f(X l))ij , and c := (X l + µY k)ij , optimization over each element
of X can be expressed as,
minimizea
12α a
2 + b a + γ hκ(a+ c).
Setting the gradient to zero yields, a + α(b + γ satκ(a + c)) = 0 and considering the separate
cases when satκ(a+ c) = κ, a+ c, and −κ yields the optimal a,
a? =
−α(b + γκ), αb− c ≥ κ(αγ + 1)
− α
1 + αγ(b + γc), |αb− c| ≤ κ(αγ + 1)
−α(b − γκ), αb− c ≤ −κ(αγ + 1).
45
Substituting values for a, b, and c yields the proximal gradient update.
Step-size selection
Since the objective function is not smooth, an Armijo backtracking rule cannot be used. Instead,
we backtrack from αl,0 by selecting the smallest nonnegative integer r such that αl = crαl,0 with
c ∈ (0, 1) such that X l+1 is stabilizing and satisfies a descent condition based on the quadratic
approximation
f2(X l+1) ≤ f2(X l) +⟨∇f2(X l), X l+1 − X l
⟩+ 1
2αl‖X l+1 − X l‖2F .
When ∇f2 is Lipschitz continuous and α < 1Lf
where Lf is a Lipschitz constant of ∇f2, this
condition is always satisfied. This backtracking rule adaptively estimates the Lipschitz constant
of ∇f(X) to ensure convergence [131].
To improve the speed of the proximal gradient algorithm, we initialize the step-size using
the Barzilai-Borwein (BB) method [143],
αl,0 =‖X l − X l−1‖2F
〈X l−1 − X l, ∇f2(X l−1) − ∇f(X l)〉.
Figure 2.5 illustrates the utility of the proximal gradient method over standard gradient descent
and the advantage of BB step-size initialization.
2.5.2 Proximal gradient applied directly to (2.1)
For this problem, it is also possible to solve (2.1) directly using proximal gradient descent. This
algorithm is guaranteed to converge to a local optimal point [144], but we find that in practice
it takes longer to find a solution than the method of multipliers. The proximal operator for the
weighted `1-norm is the elementwise softhresholding operator defined in (2.3a) and the update
for solving (2.1) directly is given by
Xk+1 = Sβ(Xk − αk∇f2(Xk)
)where β := γαk and αk is the step-size. The backtracking and BB step-size initialization rules
described in Section 2.5.1 are also used here.
46
Lµ(xl ;yk)−Lµ(x?;yk)
(Lµ(x
0;yk)−Lµ(x?;yk))
Iteration
Figure 2.5: Comparison of proximal gradient with BB step-size selection (solid blue —),proximal gradient without BB step-size selection (dotted black · · · ) and gradient descent withBB step-size selection (dashed red - - -) for the x-minimization step (2.10a) for an unstablenetwork with 20 subsystems, γ = 0.0844, and µ = 0.10. The y-axis shows the distance from theoptimal objective value relative to initial distance.
47
2.5.3 Numerical experiments
We next illustrate the utility of our approach using two examples. We compare our method
of multipliers algorithm with the ADMM algorithm from [64] and a direct application of the
proximal gradient method.
Mass-spring system
Consider a series of N masses connected by linear springs. The dynamics of each mass are
described by
pi = − (pi − pi+1) − (pi − pi−1) + di + ui
where pi is the position of the ith mass. When the first and last masses are affixed to rigid
bodies, the aggregate dynamics are given by p
v
=
0 I
−T 0
p
v
+
0
I
d +
0
I
uwhere p, v, and d are the position, velocity and disturbance vectors, and T is a Toeplitz matrix
with 2 on the main diagonal and −1 on the first super- and sub-diagonals.
In Figure 2.6, we compare the time required to compute a series of sparse feedback gains for
10 values of γ, linearly spaced between 0.001 and 1.0. Taking γ = 1.0 corresponds to roughly
6% nonzero elements in the feedback gain matrix.
Among the three algorithms, ADMM is the fastest; however, the method of multipliers is
comparable and scales at the same rate. Direct proximal gradient was the slowest and exhibited
the worst scaling. Since the mass-spring system has benign dynamics, we next consider an
unstable network.
Unstable network
Let N nodes be uniformly randomly distributed in a box. Each node is an unstable second
order system coupled with nearby nodes via an exponentially decaying function of the Euclidean
distance δ(i, j) between them [60], ψ1i
ψ2i
=
1 1
1 2
ψ1i
ψ2i
+∑j 6= i
e−δ(i,j)
ψ1j
ψ2j
+
0
1
(di + ui)
48
Com
pu
tati
on
tim
e(s
)
Number of subsystems (N)
Figure 2.6: Computation time required to solve (2.1) for 10 evenly spaced values of γ from0.001 to 1.0 for a mass-spring example with N = 5, 10, 20, 30, 40, 50, 100 masses. Performanceof direct proximal gradient (dashed green −− −−), the method of multipliers (solid blue—×—) and ADMM (dash-dot red − · − ·) is displayed. All algorithms use BB step-sizeinitialization.
49
Com
pu
tati
on
tim
e(s
)
Number of subsystems (N)
Figure 2.7: Computation time required to solve (2.1) for 20 evenly spaced values of γ from0.001 to 0.05 for unstable network examples with N = 5, 10, 20, 30, 40, 50 nodes. Performanceof direct proximal gradient (dashed green −− −−), the method of multipliers (solid blue—×—) and ADMM (dash-dot red − · − ·) is displayed. All algorithms use BB step-sizeinitialization.
where Q and R are taken to be the identity. Note that simple truncation of the centralized
controller could result in a non-stabilizing feedback matrix [60]. We solve (2.5) for γ varying
from 0.001 to 0.05 in 20 linearly spaced increments. On average, γ = 0.05 corresponds to
approximately 25% nonzero entries in the feedback gain matrix.
Computation times for N varying from 5 to 50, are shown in Figure 2.7. Since networks
are randomly generated, we average the computation time for 5 networks of each size. For this
more complicated example, the method of multipliers algorithm is the fastest and ADMM is the
slowest.
Chapter 3
First-order primal-dual algorithm
We examine Arrow-Hurwicz-Uzawa gradient flow dynamics for the proximal augmented La-
grangian. Such dynamics can be used to identify saddle points of the Lagrangian [145] and
have enjoyed recent renewed interest in the context of networked optimization because, in many
cases, the gradient can be computed in a distributed manner [146]. Our approach yields a dy-
namical system with a continuous right-hand side for a broad class of nonsmooth optimization
problems. This is in contrast to existing techniques which employ subgradient methods [147] or
use discontinuous projected dynamics [148–150] for problems with inequality constraints. Fur-
thermore, since the proximal augmented Lagrangian is neither strictly concave nor linear in the
dual variable, we require additional developments relative to recent advances [151] to establish
convergence.
3.1 Arrow-Hurwicz-Uzawa gradient flow
In this section, we apply the continuous-time Arrow-Hurwicz-Uzawa gradient flow [145] to the
proximal augmented Lagrangian (2.8). Simultaneous update of the primal and dual variables in x
y
=
−∇xLµ(x; y)
∇yLµ(x; y)
(3.1)
should be compared and contrasted with the approach presented in Section 2.3 where we alter-
nate between minimization of Lµ(x; y) over x and gradient ascent over y. For convex, differen-
tiable f and convex g, we show that the gradient flow dynamics (3.1) globally converge to the
50
51
set of saddle points of the proximal augmented Lagrangian.
We first characterize the optimal primal-dual pairs of optimization problem (2.5) with the
Lagrangian,
f(x) + g(z) + 〈y, T (x) − z〉 .
The associated first-order optimality conditions are given by,
0 = ∇f(x?) + T †(y?) (3.2a)
0 ∈ ∂g(z?) − y? (3.2b)
0 = T (x?) − z? (3.2c)
where ∂g is the subgradient of g. Clearly, these are equivalent to the optimality condition
for (2.1), 0 ∈ ∇f(x?) + T †(∂g(T (x?))).
Remark 3.1.1. Since the proximal augmented Lagrangian (2.8) is neither linear nor strictly
concave in y, we cannot use [151] to show global convergence. For LASSO problem (2.4),
proxµg(x+µy) is differentiable at points where |xi+µyi| 6= γµ for all i. When it is differentiable,
Lµ(x; y) is locally quadratic in y and ∇yyLµ(x; y) = −µ diag(β) where the ith element βi = 1
if |xi + µyi| > γµ and 0 if |xi + µyi| < γµ. Since ∇yyLµ(x; y) 6= 0, the proximal augmented
Lagrangian (2.8) is not linear in y. Furthermore, since ∇yyLµ(x; y) is not negative definite,
Lµ(x; y) is not strictly concave in y.
To show convergence, we introduce a simple lemma about proximal operators which follows
almost directly from their definition. Even though we state the result for x ∈ Rm, g: RM → R
and T (x) = Tx where T ∈ RM×m is a given matrix, the proof for x in a Hilbert space and a
bounded linear operator T follows from similar arguments.
Lemma 3.1.2. Let g: RM → R be a proper, lower semicontinuous, convex function and let
proxµg: RM → RM be the corresponding proximal operator. Then, for any a, b ∈ RM , we can
write
proxµg(a) − proxµg(b) = D (a − b) (3.3)
where D is a symmetric matrix satisfying I D 0.
Proof. Let c := a− b, p := proxµg(a)−proxµg(b), and D := p pT /(pT c), p 6= 0; 0, otherwise.Then, by construction, D = DT 0 and (3.3) can be written as p = Dc. Since proxµg is firmly
nonexpansive [130], pT c ≥ ‖p‖2 or, equivalently, cTD(I − D)c ≥ 0 for every c ∈ RM . Positive
semi-definiteness of I −D thus follows from D 0 and commutativity of D and I −D.
52
Theorem 3.1.3. Let f be a continuously differentiable convex function, and let g be a proper,
lower semicontinuous, convex function. Then, for the primal-descent dual-ascent gradient flow
dynamics (3.1), x
y
=
− (∇f(x) + TT∇Mµg(Tx + µy))
µ (∇Mµg(Tx + µy) − y)
(3.4)
the set of optimal primal-dual pairs (x?, y?) of (2.5) is globally asymptotically stable, and x? is
an optimal solution of (2.1).
Proof. We introduce a change of coordinates x := x− x?, y := y − y? and a Lyapunov function
candidate,
V (x, y) = 12 〈x, x〉 + 1
2 〈y, y〉
where (x?, z?; y?) is an optimal solution to (2.5) that satisfies (3.2). The dynamics in the (x, y)-
coordinates are given by, ˙x
˙y
=
−(∇f(x) − ∇f(x?) + 1µ T
T m)
m − µ y
(3.5)
where
m := µ (∇Mµg(Tx + µy) − ∇Mµg(Tx? + µy?)). (3.6a)
Using expression (3.3) to construct D such that,
D(T x+ µy) = proxµg(Tx+ µy) − proxµg(Tx? + µy?) (3.6b)
and definition (2.2c) of ∇Mµg, we can write
m = (I − D)(T x + µy). (3.6c)
Thus, the derivative of V along the solutions of (3.5) is given by
V (x, y) = − 〈x, ∇f(x) − ∇f(x?)〉 − 1µ 〈T x, (I − D)T x〉 − µ 〈y, Dy〉 .
Since f is convex, 〈x, ∇f(x)−∇f(x?)〉 ≥ 0 and Lemma 3.1.2 imply V ≤ 0.
When ∇f(x) = ∇f(x?) and when matrices D and TT (I −D)T have nontrivial kernels, it is
possible that V = 0 for a nonzero y ∈ kerD and x such that T x ∈ ker(I −D). From (3.6c),
53
these conditions imply m = µy and (3.5) simplifies to,
˙x = −TT y, ˙y = 0.
Thus, the only invariant set for dynamics (3.5) is ∇f(x?) = ∇f(x), y ∈ kerTT ∩ kerD, and
T x ∈ kerI −D. Global asymptotic stability of these points follows from LaSalle’s invariance
principle. To complete the proof, we show that any x and y that lie in this invariant set also
satisfy the optimality conditions (3.2) of problem (2.5) with z = z?µ(x, y) and thus that x solves
problem (2.1).
Since x? and y? are optimal, (3.2a) implies ∇f(x?) + TT y? = 0. For any x and y that lie in
the invariant set of (3.5), we can replace ∇f(x?) with ∇f(x) and add TT y = 0 to obtain
∇f(x) + TT (y? + y) = ∇f(x) + TT y = 0
which implies that x and y also satisfy (3.2a). Furthermore, (I −D)T x = 0 and Dy = 0 can be
combined to yield T x−D(T x+ µy) = 0, which, by (3.6b), leads to
T x − (proxµg(Tx + µy) − proxµg(Tx? + µy?)) = 0.
By optimality condition (3.2b), Tx? = z?, and by definition (2.9), z? = z?µ(x?, y?) = proxµg(Tx?+
µy?). It thereby follows that Tx− proxµg(Tx+ µy) = 0 and thus
Tx = proxµg(Tx + µy) = z?µ(x, y) = z
which implies that x and z satisfy (3.2c).
Finally, the optimality condition of the minimization problem (2.2a) that defines proxµg(v)
is ∂g(z) + 1µ (z − v) 3 0. Setting v = Tx + µy from the expression (2.9) that characterizes the
z?µ-manifold and Tx = z by (3.2c) yields the final optimality condition (3.2b).
We note that gradient flow dynamics (3.1) are convenient for distributed implementation. If
the state vector x corresponds to the concatenated states of individual agents, xi, the sparsity
pattern of T and the structure of the gradient map ∇f : Rm → Rm dictate the communication
topology required to form ∇Lµ in (3.1). For example, if f(x) =∑fi(xi) is separable over
the agents, then ∇fi(xi) can be formed locally. If in addition TT is an incidence matrix of an
undirected network with the graph Laplacian TTT , each agent must share its state xi with its
neighbors and maintain dual variables yi that correspond to its edges.
54
Our approach provides several advantages over existing distributed optimization algorithms.
Even for problems (2.1) with non-differentiable regularizers g, a formulation based on the prox-
imal augmented Lagrangian yields gradient flow dynamics (3.1) with a continuous right-hand
side. This is in contrast to existing approaches which employ subgradient methods [147] or use
discontinuous projected dynamics [148–151]. Furthemore, for problems where T is not diagonal,
a distributed proximal gradient scheme cannot be implemented because the proximal operator
of g(Tx) may not be separable. Finally, relative to a distributed ADMM scheme, our method
does not require solving an x-minimization subproblem in each iteration.
3.2 Exponential convergence for strongly convex f
In this section, we will show that for a strongly convex f with a Lipschitz continuous gradient,
the dynamics described by (3.1) converge exponentially for a sufficiently large µ. Establishing
exponential convergence is particularly interesting since Lµ is neither strictly concave nor linear
in the dual variable. We then turn our attention to discrete-time dynamics, which are directly
related to an algorithmic implementation of the continuous dynamics (3.1) with a fixed step-
size. For this setup, we also show exponential convergence (i.e., linear convergence in standard
optimization terminology) for a suitably selected penalty-parameter µ and step-size.
To show these results, we employ the integral quadratic constraint (IQC) framework recently
formulated by [152] for studying optimization algorithms as linear systems G connected via
feedback with a nonlinear block corresponding to the gradient of f . To study (3.1), we introduce
an additional nonlinear block corresponding to the Moreau envelope; see Fig. 3.1.
We express 3.1, or equivalently 3.4, as a linear system G connected in feedback with non-
linearities that correspond to the gradients of f and of the Moreau envelope of g; see Fig. 3.1.
These nonlinearities can be conservatively characterized by IQCs. Exponential stability of G
connected in feedback with any nonlinearity that satisfies these IQCs implies exponential con-
vergence of 3.1 to the primal-dual optimal solution of (2.5). In what follows, we adjust the tools
of [152–155] to our setup and establish exponential convergence by evaluating the feasibility of
an LMI. We assume that the function f is mf -strongly convex with an Lf -Lipschitz continuous
gradient. Characterizing additional structural restrictions on f and g with IQCs may lead to
tighter bounds on the rate of convergence.
In this work, we use a static filter η = Ψ(u, ζ) = [uT ζT ]T and IQCs which correspond to
Lipschitz continuity. In general, filters with nontrivial dynamics and stricter IQCs may yield
tighter bounds on the exponential rate of convergence.
55
G
∇f −mI
µ∇Mµg
ζ1 = x
ζ2 = Tx+ µy
u1
u2
∆
Figure 3.1: Block diagram of primal-descent dual-ascent dynamics where G is a linear systemconnected via feedback with nonlinearities.
G
∆
Ψ η
ζ u
Figure 3.2: Block diagram of primal-descent dual-ascent dynamics where the nonlinearities arereplaced by an IQC imposed on their inputs and outputs.
3.2.1 Continuous-time dynamics
As illustrated in Fig. 3.1, 3.4 can be expressed as a linear system G connected via feedback to
a nonlinear block ∆,
w = Aw + B u, ξ = C w, u = ∆(ξ)
A =
−mfI
−µI
, B =
−I − 1µT
T
0 I
, C =
I 0
T µI
where w := [xT yT ]T , ξ := [ ξT1 ξT2 ]T , and u := [uT1 uT2 ]T . Nonlinearity ∆ maps the system
outputs ξ1 = x and ξ2 = Tx + µy to the inputs u1 and u2 via u1 = ∆1(ξ1) := ∇f(ξ1) −mfξ1
and u2 = ∆2(ξ2) := µ∇Mµg(ξ2) = ξ2 − proxµg(ξ2).
When the mapping ui = ∆i(ξi) is the L-Lipschitz continuous gradient of a convex function,
56
it satisfies the IQC [152, Lemma 6]
ξi − ξ0
ui − u0
T 0 LI
LI −2I
ξi − ξ0
ui − u0
≥ 0 (3.7)
where ξ0 is some reference point and u0 = ∆i(ξ0). Since f is strongly convex with parameter
mf , the mapping ∆1(ξ1) is the gradient of the convex function f(ξ1)− (mf/2)‖ξ1‖2. Lipschitz
continuity of ∇f with parameter Lf implies Lipschitz continuity of ∆1(ξ1) with parameter
L := Lf − mf ; thus, ∆1 satisfies (3.7) with L = L. Similarly, ∆2(ξ2) is the scaled gradient
of the convex Moreau envelope and is Lipschitz continuous with parameter 1; thus, ∆2 also
satisfies (3.7) with L = 1. These two IQCs can be combined into
(η − η0)TΠ (η − η0) ≥ 0, η := [ ξT uT ]T . (3.8)
For a linear system G connected in feedback with nonlinearities that satisfy IQC (3.8), [153,
Theorem 3] establishes ρ-exponential convergence, i.e., ‖w(t) − w?‖ ≤ τe−ρt‖w(0) − w?‖ for
some τ, ρ > 0, by verifying the existence of a matrix P 0 such that, ATρ P + PAρ PB
BTP 0
+
CT 0
0 I
Π
C 0
0 I
0, (3.9)
where Aρ := A+ρI. In Theorem 3.2.1, we determine a scalar condition that ensures exponential
stability when TTT is a full rank matrix.
Theorem 3.2.1. Let f be strongly convex with parameter mf , let its gradient be Lipschitz
continuous with parameter Lf , let g be proper, lower semicontinuous, and convex, and let TTT
be full rank. Then, if µ ≥ Lf − mf , there is a ρ > 0 such that the dynamics 3.1 converge
ρ-exponentially to the optimal point of (2.5).
Proof. Since any function that is Lipschitz continuous with parameter L is also Lipschitz contin-
uous with parameter L > L, we establish the result for µ = L := Lf −mf . We utilize [153, The-
orem 3] to show ρ-exponential convergence by verifying matrix inequality (3.9) through a series
of equivalent expressions (3.10). We first apply the KYP Lemma [156, Theorem 1] to (3.9) to
obtain an equivalent frequency domain characterization Gρ(jω)
I
∗Π
Gρ(jω)
I
0, ∀ ω ∈ R (3.10a)
57
where Gρ(jω) = C(jωI − Aρ)−1B. Evaluating the left-hand side of (3.10a) for L = µ and
dividing by −2 yields the matrix inequality
µm+ m2 + ω2
m2 + ω2I
m
m2 + ω2TT
∗ m/µ
m2 + ω2TTT +
ω2 − ρµµ2 + ω2
I
0 (3.10b)
where m := mf − ρ > 0 and µ := µ− ρ > 0 so that Aρ is Hurwitz, i.e., the system Gρ is stable.
Since the (1, 1) block in (3.10b) is positive definite for all ω, the matrix in (3.10b) is positive
definite if and only if the corresponding Schur complement is positive definite,
m/µ
µm+ m2 + ω2TTT +
ω2 − ρµµ2 + ω2
I 0. (3.10c)
We exploit the symmetry of TTT to diagonalize (3.10c) via a unitary coordinate transformation.
This yields m scalar inequalities parametrized by the eigenvalues λi of TTT . Multiplying the
left-hand side of these inequalities by the positive quantity (µ2 + ω2)(µm + m2 + ω2) yields a
set of equivalent, quadratic in ω2, conditions,
ω4 + ( mλiµ + m2 + µm− ρµ)ω2 + mµ2λiµ − ρµ(µm+ m2) > 0. (3.10d)
Condition (3.10d) is satisfied for all ω ∈ R if there are no ω2 ≥ 0 for which the left-hand side is
nonpositive. When ρ = 0, both the constant term and the coefficient of ω2 are strictly positive,
which implies that the roots of (3.10d) as a function of ω2 are either not real or lie in the domain
ω2 < 0, which cannot occur for ω ∈ R. Finally, continuity of (3.10d) with respect to ρ implies
the existence a positive ρ that satisfies (3.10d) for all ω ∈ R.
Remark 3.2.2. Each eigenvalue λi of a full rank matrix TTT is positive and hence to estimate
the exponential convergence rate it suffices to check (3.10d) only for the smallest λi. A sufficient
condition for (3.10d) to hold for each ω ∈ R is positivity of the constant term and the coefficient
multiplying ω2. These can be expressed as quadratic inequalities in ρ that admit explicit solutions,
thereby providing an estimate of the rate of exponential convergence.
58
3.2.2 Discrete time
The discrete time dynamics can be expressed as a discrete time linear system G connected via
feedback to a nonlinear block ∆,
wk+1 = Awk + B uk, ξ = C wk, u = ∆(ξ)
A =
αmIαµI
, B = α
−I − 1µT
T
0 I
, C =
I 0
T µI
(3.11)
where αm := 1 − αmf , αµ := 1 − αµ, w := [xT yT ]T , ξ := [ ξT1 ξT2 ]T , and u := [uT1 uT2 ]T . The
nonlinearity ∆ is the same as in the continuous-time case and it still satisfies IQC (3.8).
For a discrete time linear system G connected in feedback with nonlinearities that satisfy
IQC (3.8), [152, Theorem 4] establishes ρ-exponential convergence, i.e., ‖wk−w?‖ ≤ τρk‖w(0)−w?‖ for some τ ≥ 0, ρ ∈ [0, 1) is guaranteed by verifying the existence of a matrix P 0, λ ≥ 0
such that, ATρ PAρ − I ATρ PBρ
BTρ PAρ BTρ PBρ
+ λ
CT 0
0 I
Π
C 0
0 I
0, (3.12)
where Aρ := A/ρ and Bρ := B/ρ. In Theorem 3.2.3, we determine a scalar condition that
ensures exponential stability when TTT = I. The extension to the general case where TTT is
full rank is straightforward and the layout of the proof is meant to facilitate such development;
see Remark 3.2.4.
Theorem 3.2.3. Let f be strongly convex with parameter mf , let its gradient be Lipschitz
continuous with parameter Lf , let g be proper, lower semicontinuous, and convex, and let TTT =
I. Then, if µ ≥ Lf − mf and α ∈ (0,min( 12m ,
12µ , µ)), there is a ρ ∈ (0, 1) such that the
dynamics 3.11 converge ρ-exponentially to the optimal solution.
Proof. Since any function that is Lipschitz continuous with parameter L is also Lipschitz contin-
uous with parameter L > L, we establish the result for µ = L := Lf −mf . We utilize [152, The-
orem 4] to show ρ-exponential convergence by verifying matrix inequality (3.12) through a series
of equivalent expressions (3.13). We first apply the KYP Lemma [156, Theorem 2] to (3.12) to
obtain an equivalent frequency domain characterization Gρ(ejω)
I
∗Π
Gρ(ejω)
I
0, ∀ ω ∈ R (3.13a)
59
where Gρ(ejω) = C(ejωI−Aρ)−1Bρ = C(ρ ejωI−A)−1B. Evaluating the left-hand side of (3.13a)
for L = µ and dividing by −2 yields the matrix inequality a(ω)I b(ω)T
∗ c(ω)I + d(ω)TTT
0, ∀ω ∈ R (3.13b)
where
a(ω) := 1 + αµαm− 1 + ρ cosω
ρ2 + (αm− 1)2 + 2ρ(αm− 1) cosω
b(ω) := ααm− 1 + ρ cosω
ρ2 + (αm− 1)2 + 2ρ(αm− 1) cosω
c(ω) := 1− αµ αµ− 1 + ρ cosω
ρ2 + (αµ− 1)2 + 2ρ(αµ− 1) cosω
d(ω) := αµ
αm− 1ρ cosω
ρ2 + (αm− 1)22ρ(αm− 1) cosω.
Since the range of cosω over ω ∈ R is [−1, 1], we can replace cosω by ζ and check the matrix
inequality a(ζ)I b(ζ)T
∗ c(ζ)I + d(ζ)TTT
0, ∀ζ ∈ [−1, 1]
where
a(ζ) := 1 + αµαm− 1 + ρζ
ρ2 + (αm− 1)2 + 2ρ(αm− 1)ζ=: 1 + αµ
n
d
b(ζ) := ααm− 1 + ρζ
ρ2 + (αm− 1)2 + 2ρ(αm− 1)ζ=: α
n
d
c(ζ) := 1− αµ αµ− 1 + ρζ
ρ2 + (αµ− 1)2 + 2ρ(αµ− 1)ζ
d(ζ) := αµ
αm− 1 + ρζ
ρ2 + (αm− 1)2 + 2ρ(αm− 1)ζ=:
α
µ
n
d.
. (3.13c)
This is equivalent to checking positive definiteness of the (1, 1) block via,
a(ζ) > 0 (3.14a)
and the Schur complement. After diagonalizing TTT with a unitary transformation, this
amounts to checking
c(ζ) + d(ζ)λi −(b(ζ))∗b(ζ)
a(ζ)λi > 0 (3.14b)
for each λi where λi are the eigenvalues of TTT . To show exponential convergence at some rate
ρ, we show (3.12) via (3.14) for ρ = 1. By continuity of (3.14) in ρ, this establishes (3.12) for
some ρ < 1. For the remainder of the proof, we take ρ = 1.
60
Since a(ζ) is a linear fractional transformation of ζ, it is quasiconvex and thus condi-
tion (3.14a) may be established by checking a(1) > 0 and a(−1) > 0 [123],
1 + αµαm
1 + (αm− 1)2 + 2(αm− 1)= 1 +
µ
m> 0 (3.15a)
1 + αµαm− 2
1 + (αm− 1)2 − 2(αm− 1)= 1 +
αµ
αm− 2> 0 (3.15b)
Condition (3.15a) is clearly satisfied. By assumption, αµ ∈ (0, 12 ) and αm ∈ (0, 1
2 ) which implies
that αm− 2 ∈ (−2,− 32 ) and αµ
αm−2 ∈ (0,− 13 ), thereby establishing condition (3.15b).
Expressing the Schur complement (3.14b) in terms of the n and d from (3.13c) yields,
c(ζ) +α
µ
n
dλi −
(αn
d
)(1 + αµ
n
d
)−1 (αn
d
)λi = c(ζ) +
α
µ
n
dλi − α2n
2
d2
d
αµn+ dλi
= c(ζ) +α
µ
n
dλi −
α2n2
d(αµn+ d)λi
= c(ζ) +(α/µ)n(αµn+ d) − α2n2
d(αµn+ d)λi
= c(ζ) +(α/µ)n
αµn+ dλi
Substituting to remove n and d yields
s(ζ) := c(ζ) +(α/µ)n
αµn+ dλi =: 1 + s1(ζ) + s2(ζ)λi > 0. (3.16)
where
s1(ζ) := − αµ(αµ− 1 + ζ)
1 + (αµ− 1)2 + 2(αµ− 1)ζ
s2(ζ) :=(α/µ)(αm− 1 + ζ)
αµ(αm− 1 + ζ) + 1 + (αm− 1)2 + 2(αm− 1)ζ
Since both s1 and s2 are linear fractional transformations of ζ, they are quasiconvex which
implies s1(ζ) ∈ [s1(−1), s1(1)] and s2(ζ) ∈ [s2(−1), s2(1)] for all ζ ∈ [−1, 1]. Evaluating these
functions at ζ = ±1,
s1(−1) ∈ (0, 13αµ), s1(1) = −1, s2(−1) = (α/µ)
αm+αµ−2 > −1, s2(1) = (α/µ)αµ+αm > 1
(3.17)
yields
s1(ζ) ∈ (−1, 13αµ], s2(ζ) > −1, ∀ ζ ∈ [−1, 1] (3.18)
61
Since TTT = I, λi = 1 for all i. However, although s(1) > 0 and s(−1) > 0, the set of
quasiconvex functions is not closed under addition so inequality (3.16) does not follow for all
ζ ∈ [−1, 1]. From (3.18), we can only claim that s(ζ) > −1 for all ζ ∈ [−1, 1], which is clearly
insufficient. To obtain the result, we consider segments of [−1, 1]. These segments vary based
on two cases: µ ≥ m and µ ≤ m.
First, let us consider the case where µ ≤ m. We consider two segments [−1, η] and [η, 1]
where η = 1− αm. Note that the denominator of s1(ζ) is always positive so
s1(η) =−α2µ(µ−m)
1 + (αµ− 1)2 + 2(αµ− 1)(1− αm).
is positive because µ−m ≤ 0. Together with (3.17), this implies
s1(ζ) ≥ 0 ∀ζ ∈ [−1, η], s1(ζ) ≥ −1 ∀ζ ∈ [η, 1] (3.19a)
Combining (3.17) with s2(η) = 0 yields,
s2(ζ) ≥ −1 ∀ζ ∈ [−1, η], s2(ζ) ≥ 0 ∀ζ ∈ [η, 1]. (3.19b)
Together, equations (3.19) imply that (3.14b) holds when µ ≤ m.
Now consider the case µ ≥ m. We consider three segments [−1, δ], [δ, η], and [η, 1] where
δ = 1− αµ. Combining s1(δ) = 0 and s2(η) = 0 with (3.17) yields
s1(ζ) ≥ 0 ∀ζ ∈ [−1, δ], s2(ζ) ≥ 0 ∀ζ ∈ [η, 1]. (3.20a)
Together with (3.18), equation (3.20a) implies that (3.14b) holds for ζ ∈ [−1, δ] and ζ ∈ [η, 1].
To consider ζ ∈ [δ, η], we must consider s1(η) and s2(δ). For this case, where µ ≥ m,
s1(η) =(−αµ)α(µ−m)
1 + (αµ− 1)((αµ− 1) + 2(1− αm))
is negative. However, since |αµ| ≤ 12 , |αµ − 1| ≤ 1
2 , |α(m − µ)| ≤ 12 and |1 − αm| ≤ 1
2 , it is
readily shown that s1(η) > − 13 . Now consider,
s2(δ) =(α/µ)(α(m− µ))
αµ(α(m− µ)) + 1 + (αm− 1)2 + 2(αm− 1)(1− αµ).
which is also necessarily negative. Using the same bounds, |αµ| ≤ 1, |αµ−1| ≤ 1, |α(m−µ)| ≤ 1
62
and |1− αm| ≤ 1, it is readily shown that s2(δ) ≥ − 12 . This implies,
s1(ζ) ∈ [− 13 , 0] ∀ζ ∈ [δ, η], s2(ζ) ∈ [− 1
2 , 0] ∀ζ ∈ [δ, η]. (3.20b)
from which it follows that s(ζ) > 0 for all ζ ∈ [δ, η], establishing condition (3.14b) and completing
the proof.
Remark 3.2.4. The conditions on the step size α and the spectrum of TTT in Theorem 3.2.3
are conservative and lead to the simple verification of inequalities (3.15b) and (3.16). These
inequalities can be posed as quadratic and quartic equations in α, respectively. Since these
orders of polynomial have explicit solutions, more rigorous analysis of these conditions would
lead to explicit, tighter bounds on α and an overt dependence on λi.
3.3 Distributed implementation
Dynamics (3.1) and (3.11) are convenient for distributed implementation. If the state vector x
corresponds to the concatenated states of individual agents, xi, the sparsity pattern of T and
the structure of the gradient map ∇f : Rn → Rn dictate the communication topology required
to form ∇Lµ in (3.1). For example, if f(x) =∑fi(xi) is separable over the agents, then ∇fi(xi)
can be formed locally. If in addition TT is an incidence matrix of an undirected network with
M edges and the graph Laplacian TTT , each agent must share its state xi with its neighbors
and maintain dual variables yi that correspond to the edges to which it is connected.
Distributed implementation of a primal-descent dual-ascent dynamics on the proximal aug-
mented Lagrangian provides several advantages over existing distributed optimization algo-
rithms. For problems where T is not diagonal, a distributed proximal gradient scheme can-
not be implemented because the proximal operator of g(Tx) may not be separable. It has a
continuous right-hand side even for problems (2.1) with non-differentiable regularizers g, unlike
existing approaches which employ subgradient methods [147] or use discontinuous projected dy-
namics [148–151]. Finally, relative to a distributed ADMM scheme, our method does not require
solving an x-minimization subproblem in each iteration.
3.3.1 Example: Optimal placement
In this example [123], a number of agents try to minimize some weighted measure of distances
from their neighbors. The set of neighbors for each node is determined by an a priori specified
63
network. The optimization variable x contains the locations of the ‘mobile’ agents The locations
of fixed agents b enter as a problem parameter. The optimization problem can be written as,
minimize∑
(i,j)∈Exx
fij(xi − xj) +∑
(i,j)∈Exb
fij(xi − bj)
where xi, bi ∈ Rp are position vectors of mobile and fixed agents in a p-dimensional space,
Exx and Exb denote the sets of edges from mobile agents to each other and to the fixed agents
respectively, and fij is some measure of distance. In this example, we take fij = (1/2)‖·‖2 and
impose an additional constraint on the distances between neighboring mobile agents,
‖xi − xj‖ ≤ c, ∀(i, j) ∈ Exx.
We can reexpress the problem as,
minimize1
2
∥∥∥∥∥∥ A
T
x − b
0
∥∥∥∥∥∥2
︸ ︷︷ ︸f(x)
+ I[−c,c](Tx)︸ ︷︷ ︸g(T (x))
(3.21)
where x contains the positions of the mobile agents, Ax−b forms vectors between mobile agents
and neighboring fixed agents, Tx forms vectors connecting neighboring mobile agents, and I[−c,c]
is an indicator function associated with the set of vectors v whose subvectors vi ∈ Rp have norm
less than c.
Distributed implementation
We form the proximal augmented Lagrangian and implement the gradient dynamics described
in (3.1). The proximal operator assoicated with I[−c,c] is a projection which scales each subvector
with norm larger than c such that it has norm c. Each subvector xi of the state corresponds to
the position of the ith agent and each subvector yi of the Lagrange multiplier is associated with
the difference in position between two agents connected by an edge in Exx.
The continuous time dynamics described in (3.1) and the expression for the gradient of f ,
∇f(x) = (ATA + TTT )x + AT b
reveal the distributed nature of the dynamics. The matrix ATA = blkdiag(. . . , diIp, . . . ) where
64
di is the number of fixed agents to which the ith mobile agent is connected and Ip is a p-
dimensional identity matrix. The matrix T is of the form E ⊗ Ip where E is the incidence
matrix corresponding to Exx so TTT is of the form L ⊗ Ip where L is the graph laplacian
associated with Exx.
Each agent can form its dynamics only the positions of itself and its neighbors and the
Lagrange multipliers associated with the edges to which it is connected. The dynamics of yi
may be independently determined by agents.
Simulation results
We simulated the gradient dynamics for the constrained optimal placement problem for a small
system with 5 mobile agents and 6 fixed agents. Figs. 3.3a and 3.3b show the optimal mobile
agent positions (solution to (3.21)) with red or blue circles given two configurations of fixed
agents (black ) with Exx shown by solid red — or solid blue — edges, Exb is shown by
dotted black · · · edges, and c = 0.5. The second configuration corresponds to rotating the
fixed agents by π/8 and scaling their positions by 1.6. In Fig. 3.3a, 3 out of the 5 inequality
constraints are active and in Fig. 3.3b, all 5 inequality constraints are active.
To illustrate the dynamics, we simulated (3.1) from t = 0 to t = 10 with a discrete jump in
fixed agent positions. The mobile agents were initially placed at the origin. The fixed agents
were placed as in Fig. 3.3a until time t = 2.5, at which they were placed as in Fig. 3.3b. The
trajectories of each mobile agent are shown in Fig. 3.3c and the Euclidean distance from their
optimal position (i.e., as in Figs. 3.3a and 3.3b) with respect to time is plotted in Fig. 3.3d.
65
(a) First configuration. 3 out of 5 inequality con-straints active.
(b) Second configuration. 5 out of 5 inequality con-straints active.
(c) Trajectories with black × at position every 2.5time steps.
(d) Euclidean distance from optimal with respectto time.
Figure 3.3: Subfigures (a) and (b) show two optimal configurations of mobile agents () withcorresponding fixed agents (black ), Exx is (solid red — and solid blue — edges) and Exb(dotted black · · · edges). Subfigures (c) and (d) show the agent trajectories and distance fromoptimal configuration (a) until t = 2.5 and from optimal configuration (b) until the final t = 10.
Chapter 4
Second-order primal-dual method
Although the MM and Arrow-Hurwicz-Uzawa primal-descent dual-ascent gradient flow ap-
proaches are typically simple and computationally efficient, they tend to converge slowly to
high-accuracy solutions. As with smooth problems, second order information can enhance the
speed of convergence. A generalization of Newton’s method to non-smooth problems has been
developed in [157,158], but it requires solving a regularized quadratic subproblem to determine
a search direction. Except in cases where the Hessian of the smooth component has a special
structure, this problem may be difficult to solve. More recently, proximal Newton method for
non-smooth composite minimization was developed in [159]. It was shown that, even in the
situations when the search direction is computed approximately, the developed algorithm shares
convergence properties with traditional Newton’s method. Related ideas have been successfully
utilized in a number of applications, including sparse inverse covariance estimation in graphical
models [160,161].
Recent work has extended the method of multipliers to incorporate second-order updates
of the primal and dual variables [162–164]. Since the standard method of multipliers seeks the
saddle point of the augmented Lagrangian, it is challenging to assess joint progress of primal and
dual iterates. In [162], Gill and Robinson have introduced the notion of primal-dual augmented
Lagrangian that serves as the merit function for measuring progress of second-order updates.
The proposed approach is applicable to non-convex problems but it requires that the objective
function is twice continuously differentiable. In [165], a forward-backward envelope was utilized
to cast the original non-smooth optimization problem as the unconstrained minimization of
a continuously differentiable function and generalized sub-differential calculus was employed to
derive second-order updates. Generalized second order methods have also been applied to saddle
66
67
point problems directly [166].
We draw on these recent advances and develop a new algorithm for non-smooth composite
optimization which efficiently forms second-order updates of both primal and dual variables. To
motivate our approach, we show global exponential stability of the corresponding continuous-
time differential inclusion when f is strongly convex. Using the merit function employed in [163,
164], we refine the algorithm by adding adaptive updates for the penalty parameter. We show
global convergence and local quadratic convergence of this algorithm for strongly convex f .
4.1 Problem formulation and background
We consider problem (2.1) with additional assumptions on the function f and linear operator
T . Although our strongest results require strong convexity of f , our theory and techniques are
applicable as long as the Hessian of f is positive definite.
Assumption 1. The function f is twice continuously differentiable, has an Lf Lipschitz con-
tinuous gradient ∇f , and is strictly convex with ∇2f 0; the function g is proper, lower
semicontinuous, and convex; and the matrix T has full row rank.
We now provide background on generalizations of the gradient for nondifferentiable functions
and briefly overview existing second order methods for solving (2.1).
4.1.1 Generalization of the gradient and Jacobian
Although proxµg is typically not differentiable, it is Lipschitz continuous and therefore differ-
entiable almost everywhere [138]. One generalization of the gradient for such functions is given
by the B-subdifferential set [167], which applies to locally Lipschitz continuous functions h:
Rm → R. Let Ch be a set at which h is differentiable. Each element in the set ∂Bh(z) is the
limit point of a sequence of gradients ∇h(zk) evaluated at a sequence of points zk ⊂ Ch
whose limit is z,
∂Bh(z) := J | ∃zk ⊂ Ch, zk → z, ∇h(zk)→ J . (4.1)
If h is continuously differentiable in the neighborhood of a point z, the B-subdifferential set
becomes single valued and it is given by the gradient, ∂Bh(z) = ∇h(z). In general, ∂Bh(z) is
not a convex set; e.g., if h(z) = |z|, ∂Bh(0) = −1, 1.
68
The Clarke subdifferential set of h: Rm → R at z is the convex hull of the B-subdifferential
set [139],
∂Ch(z) := conv (∂Bh(z)).
When h is a convex function, the Clarke subdifferential set is equal to the subdifferential set
∂h(z) which defines the supporting hyperplanes of h at z [168, Chapter VI]. For a function G:
Rm → Rn, the B-generalization of the Jacobian at a point z is given by
∂BG(z) :=[JT1 . . . JTn
]Twhere each Ji ∈ ∂BGi(z) is a member of the B-subdifferential set of the ith component of G
evaluated at z. The Clarke generalization of the Jacobian at a point z, ∂CG(z), has the same
structure where each Ji ∈ ∂CGi(z) is a member of the Clarke subdifferential of Gi(z).
4.1.2 Semismoothness
The mapping G: Rm → Rn is semismooth at z if for any sequence zk → z, the sequence of
Clarke generalized Jacobians JGk ∈ ∂CG(zk) provides a first order approximation of G,
‖G(zk) − G(z) + JGk(z − zk)‖ = o(‖zk − z‖). (4.2)
where φ(k) = o(ψ(k)) denotes that φ(k)/ψ(k)→ 0 as k tends to infinity [169]. The function G
is strongly semismooth if this approximation satisfies the stronger condition,
‖G(zk) − G(z) + JGk(z − zk)‖ = O(‖zk − z‖2),
where φ(k) = O(ψ(k)) signifies that |φ(k)| ≤ Lψ(k) for some positive constant L and positive
ψ(k) [169].
Remark 4.1.1. (Strong) semismoothness of the proximal operator leads to fast asymptotic
convergence of the differential inclusion (see Section 4.2) and the efficient algorithm (see Sec-
tion 4.3). Proximal operators associated with many typical regularization functions (e.g., the `1
and nuclear norms [170], piecewise quadratic functions [171], and indicator functions of affine
convex sets [171]) are strongly semismooth. In general, semismoothness of proxµg follows from
semismoothness of the projection onto the epigraph of g [171]. However, there are convex sets
onto which projection is not directionally differentiable [172]. The indicator functions associ-
ated with such sets or functions whose epigraphs are described by such sets may induce proximal
69
operators which are not semismooth.
4.1.3 Existing second order methods
We discussed existing first order methods in Section 2.2.3. While simple to implement, their slow
convergence to high-accuracy solutions motivates the development of second order methods for
solving (2.1). A generalization of Newton’s method for nonsmooth problems (2.1) with T = I
was developed in [157, 158, 165, 173]. A sequential quadratic approximation of the smooth part
of the objective function is utilized and a search direction x is obtained as the solution of a
regularized quadratic subproblem,
minimizex
12 x
THx + ∇f(xk)T x + g(xk + x) (4.3)
where xk is the current iterate and H is the Hessian of f . This method generalizes the projected
Newton method [174] to a broader class of regularizers. For example, when g is the `1 norm,
this amounts to solving a LASSO problem [175], which can be a challenging task. Coordinate
descent is often used to solve this subproblem [173] and it has been observed to perform well in
practice [76,160,161].
The Forward-Backward Envelope (FBE) was introduced in [176–178]. The FBE is a once-
continuously differentiable nonconvex function of x and its minimum corresponds to the solution
of (2.1) with T = I. As demonstrated in Chapter 5, the FBE can be obtained from the proximal
augmented Lagrangian that we introduce in Section 2.3. Since the generalized Hessian of FBE
involves third-order derivatives of f (which may be expensive to compute), references [176–178]
employ either truncated or quasi-Newton methods to obtain a second order update to x.
4.1.4 Second order updates
Even though Newton’s method is primarily used for solving minimization problems in modern
optimization, it was originally formulated as a root-finding technique and it has long been
employed for finding stationary points [179]. In [166], a generalized Jacobian was used to extend
Newton’s method to semismooth problems. We employ this generalization of Newton’s method
to ∇Lµ(x; y) in order to compute the saddle point of the proximal augmented Lagrangian. The
unique saddle point of Lµ(x; y) is given by the optimal primal-dual pair (x?, y?) and it thus
provides the solution to (2.1).
70
Generalized Newton updates
Let H := ∇2f(x). We use the B-generalized Jacobian of the proximal operator proxµg, PB :=
∂B proxµg(Tx + µy), to define the set of B-generalized Hessians of the proximal augmented
Lagrangian,
∂2BLµ :=
H + 1
µ TT (I − P )T TT (I − P )
(I − P )T −µP
, P ∈ PB
(4.4a)
and the Clarke generalized Jacobian PC := ∂C proxµg(Tx + µy) to define the set of Clarke
generalized Hessians of the proximal augmented Lagrangian,
∂2CLµ :=
H + 1
µ TT (I − P )T TT (I − P )
(I − P )T −µP
, P ∈ PC
. (4.4b)
Note that ∂2BLµ(x; y) ⊂ ∂2
CLµ(x; y) because PB ⊂ PC .
We introduce the composite variable, w := [xT yT ]T , use Lµ(w) interchangeably with
Lµ(x; y), and suppress the dependance of H and P on w to reduce notational clutter. For
simplicity of exposition, we assume that proxµg is semismooth and state the results for the
Clarke generalized Hessian (4.4b), i.e., ∂2Lµ(w) = ∂2CLµ(w). As described in Remark 4.2.6,
analogous convergence results for non-semismooth proxµg can be obtained for the B-generalized
Hessian (4.4a), i.e., ∂2Lµ(w) = ∂2BLµ(w).
We use the Clarke generalized Hessian (4.4b) to obtain a second order update w by linearizing
the stationarity condition ∇Lµ(w) = 0 around the current iterate wk,
∂2CLµ(wk) wk = −∇Lµ(wk). (4.5)
Since proxµg is firmly nonexpansive, 0 P I. In Lemma 4.1.2 we use this fact to prove
that the second order update w is well-defined for any generalized Hessian (4.4) of the proximal
augmented Lagrangian Lµ(x; y) as long as f is strictly convex with ∇2f(x) 0 for all x ∈ Rm.
Lemma 4.1.2. Let H ∈ Rn×n be symmetric positive definite, H 0, let P ∈ RM×M be
symmetric positive semidefinite with eigenvalues less than one, 0 P I, let T ∈ RM×m be
full row rank, and let µ > 0. Then, the matrix H + 1µ T
T (I − P )T TT (I − P )
(I − P )T −µP
71
is invertible and it has n positive and m negative eigenvalues.
Proof. The Haynsworth inertia additivity formula [180] implies that the inertia of matrix (4.4)
is determined by the sum of the inertias of matrices,
H + 1µ T
T (I − P )T (4.6a)
and
−µP − (I − P )T(H + 1
µ TT (I − P )T
)−1
TT (I − P ). (4.6b)
Matrix (4.6a) is positive definite because H 0 and both P and I − P are positive semidefi-
nite. Matrix (4.6b) is negative definite because the kernels of P and I − P have no nontrivial
intersection and T has full row rank.
Fast local convergence
The use of generalized Newton updates for solving the nonlinear equation G(x) = 0 for non-
differentiable G was studied in [166]. We apply this framework to the stationarity condition
∇Lµ(w) = 0 when proxµg is (strongly) semismooth and show that second order updates (4.5)
converge (quadratically) superlinearly within a neighborhood of the optimal primal-dual pair.
Proposition 4.1.1. Let proxµg be (strongly) semismooth, and let wk be defined by (4.5). Then,
there is a neighborhood of the optimal solution w? in which the second order iterates wk+1 =
wk + wk converge (quadratically) superlinearly to w?.
Proof. Lemma 4.1.2 establishes that ∂2C Lµ(w) is nonsingular for any P ∈ PC . Since the gradi-
ent ∇Lµ(w) of the proximal augmented Lagrangian is Lipschitz continuous by Theorem 2.3.1,
nonsingularity of ∂2CLµ(w) and (strong) semismoothness of the proximal operator guarantee
(quadratic) superlinear convergence of the iterates by [166, Theorem 3.2].
4.2 A globally convergent differential inclusion
Since we apply a generalization of Newton’s method to a saddle point problem and the second
order updates are set valued, convergence to the optimal point is not immediate. Although we
showed local convergence rates in Proposition 4.1.1 by leveraging the results of [166], proof of
the global convergence is more subtle and it is established next.
72
To justify the development of a discrete-time algorithm based on the search direction resulting
from (4.5), we first examine the corresponding differential inclusion,
w ∈ −(∂2CLµ(w))−1∇Lµ(w) (4.7)
where ∂2CLµ is the Clarke generalized Hessian (4.4b) of Lµ. We assume existence of a solution
and prove asymptotic stability of (4.7) under Assumption 1 and global exponential stability
under an additional assumption that f is strongly convex.
Assumption 2. Differential inclusion (4.7) has a solution.
4.2.1 Asymptotic stability
We first establish asymptotic stability of differential inclusion (4.7).
Theorem 4.2.1. Let Assumptions 1 and 2 hold and let proxµg be semismooth. Then, differ-
ential inclusion (4.7) is asymptotically stable. Moreover,
V (w) := 12 ‖∇Lµ(w)‖2 (4.8)
provides a Lyapunov function and
V (t) = − 2V (t). (4.9)
Proof. Lyapunov function candidate (4.8) is a positive function of w everywhere apart from the
optimal primal-dual pair w? at which it is zero. It remains to show that V is decreasing along
the solutions w(t) of (4.7), i.e., that V is strictly negative for all w(t) 6= w?,
V (t) := ddtV (w(t)) = − 2V (w(t))
For Lyapunov function candidates V (w) which are differentiable with respect to w,˙V =
wT∇V by the chain rule. Although (4.8) is not differentiable with respect to w, we show that
V (w(t)) is differentiable along the solutions of (4.7). Instead of employing the chain rule, we
use the limit that defines the derivative,
V (t) := ddtV (w(t)) = lim
s→ 0
V (w(t) + sw(t)) − V (w(t))
s(4.10)
73
to show that V exists and is negative along the solutions of (4.7). Here, w is determined by the
dynamics (4.7),
w(t) = −H−1C ∇Lµ(w(t)), (4.11)
for some HC ∈ ∂2CLµ(w(t)) and (4.10) is equivalent to the directional derivative of V (w) in the
direction w. We first introduce
hs(t) :=V (w(t) + sw(t)) − V (w(t))
s
which yields V in the limit s → 0. We then rewrite hs(t) as the limit point of a sequence of
functions hs,k(t) so that
V (t) = lims→ 0
hs(t) = lims→ 0
limk→∞
hs,k(t) (4.12)
and use the Moore-Osgood theorem [181, Theorem 7.11] to exchange the order of the limits and
establish that V (t) = −2V (t).
Let Cg denote a subset of Rm+M over which proxµg(Tx+µy) is differentiable (and therefore
V is differentiable with respect to w) and let wk be a sequence of points in Cg that converges
to w. We define the sequence of functions hs,k(t),
hs,k(t) :=V (wk(t) + sw(t)) − V (wk(t))
s.
To employ the Moore-Osgood theorem, it remains to show that hs,k(t) converges pointwise (for
any k) as s→ 0 and that hs,k(t) converges uniformly on some interval s ∈ [0, s] as k →∞.
Since wk ⊂ Cg, V (wk) is differentiable for every k ∈ Z+ and ∇V (wk) = ∇2Lµ(wk) =
∂2CLµ(wk). It thus follows that
lims→ 0
hs,k(t) = wTk (t)∇2Lµ(wk)∇Lµ(wk(t)) (4.13)
pointwise (for any k). Local Lipschitz continuity of V with respect to wk implies uniform
convergence of hs,k(t) to hs(t) on s ∈ (0, s ] where s > 0. Therefore, the Moore-Osgood theorem
74
on exchanging limits [181, Theorem 7.11] in conjunction with (4.13) implies
V (t) = lims→ 0
hs(t)
= lims→ 0
limk→∞
hs,k(t) = limk→∞
lims→ 0
hs,k(t)
= limk→∞
wT (t)∇2Lµ(wk)∇Lµ(wk(t))
= wT (t)HB∇Lµ(w(t))
= − (∇Lµ(w(t)))TH−1C HB∇Lµ(w(t))
(4.14a)
for some HB ∈ ∂2BLµ(w(t)) by the definition of the B-subdifferential (4.1). This immediately
establishes (4.9) for any w (4.11) such that C ∈ ∂2BLµ(w(t)) by choosing Cg such that HB = HC .
Semismoothness implies directional differentiability of V and thereby equivalence of (4.10)
and (4.14a) by [182, Proposition 2.3]. It follows that
V (t) = (∇Lµ(w(t)))TH−1C HBi∇Lµ(w(t)) = (∇Lµ(w(t)))TH−1
C HBj∇Lµ(w(t)) (4.14b)
for any HBi , HBj ∈ ∂2BLµ(w(t)). By definition of the Clarke subgradient, any HC 6∈ ∂2
BLµ(w(t))
can be expressed as
HC =∑i
αiHBi (4.14c)
where HBi ∈ ∂2BLµ(w(t)) and α ∈ P where P is the simplex P := α ∈ Rr+ |
∑i αi = 1 and r
is the cardinality of the set ∂2BLµ(w). By substituting (4.14c) into the final line of (4.14a) and
noting the equivalence (4.14b), the directional derivative (4.10) can be expressed as
V (t) = − (∇Lµ(w(t)))T
(∑i
αiHBi
)−1 (∑i
βiHBi
)∇Lµ(w(t))
for any β ∈ P. Taking β = α yields V (t) = −‖Lµ(w(t))‖2 = −2V (t), completing the proof.
4.2.2 Global exponential stability
To establish global asymptotic stability, we show that the Lyapunov function (4.8) is radially
unbounded, and to prove exponential stability we bound it with quadratic functions. We first
provide two lemmas that characterize the mappings proxµg and ∇f in terms of the spectral
properties of matrices that describe the corresponding input-output relations at given points.
75
Lemma 4.2.2. Let f be strongly convex with parameter mf and let its gradient ∇f be Lipschitz
continuous with parameter Lf . Then, for any a, b ∈ Rn there exists a symmetric matrix Ga,b
satisfying mfI Ga,b LfI such that
∇f(a) − ∇f(b) = Ga,b (a − b).
Proof. Let c := a− b, d := ∇f(a)−∇f(b), e := d−mfc, and
Ga,b := eeT /(eT c), e 6= 0; 0, otherwise (4.15a)
Ga,b := Ga,b + mfI. (4.15b)
Clearly, by construction, Ga,b = GTa,b 0. It is also readily verified that Ga,b c = d when
eT c 6= 0. It thus remains to show that (i) Ga,b c = d when eT c = 0; and (ii) Ga,b (Lf −mf )I.
(i) Since f is mf strongly convex and ∇f is Lf Lipschitz continuous, h(x) := f(x)− mf2 ‖x‖
2
is convex and ∇h(x) = ∇f(x) −mfx is Lf −mf Lipschitz continuous. Furthermore, we have
e = ∇h(a)−∇h(b), and [152, Proposition 5] implies
eT c ≥ 1Lf −mf ‖e‖
2, for all c ∈ Rn. (4.16)
This shows that eT c = 0 only if e := d −mfc = 0 and, thus, d = mfc = Ga,b c when eT c = 0.
Therefore, there always exist a symmetric matrix Ga,b such that Ga,b c = d.
(ii) When eT c 6= 0, Ga,b is a rank one matrix and its only nonzero eigenvalue is ‖e‖2/(eT c);this follows from Ga,b e = (‖e‖2/(eT c)) e. In this case, inequality (4.16) implies eT c > 0
and (4.16) is equivalent to 1/(eT c) ≤ (Lf − mf )/‖e‖2. Thus, ‖e‖2/(eT c) ≤ Lf − mf and
Ga,b (Lf −mf )I when eT c 6= 0. Since Ga,b = 0 when eT c = 0, Ga,b (Lf −mf )I for all a
and b. Finally, Ga,b 0 and (4.15b) imply mfI Ga,b LfI.
Remark 4.2.3. Although matrices Da,b and Ga,b in Lemmas 3.1.2 and 4.2.2 depend on the
operating point, their spectral properties, 0 Da,b I and mfI Ga,b LfI, hold for all a
and b. These lemmas can be interpreted as a combination between a generalization of the mean
value theorem [181, Theorem 5.9] to vector-valued functions and spectral bounds on the operators
proxµg: Rm → Rm and ∇f : Rn → Rn arising from firm nonexpansiveness of proxµg, strong
convexity of f , and Lipschitz continuity of ∇f .
We now combine Lemmas 3.1.2 and 4.2.2 to establish quadratic upper and lower bounds
76
for Lyapunov function (4.8) and thereby prove global exponential stability of differential inclu-
sion (4.7) for strongly convex f .
Theorem 4.2.4. Let Assumptions 1 and 2 hold, let proxµg be semismooth, and let f be mf
strongly convex. Then, differential inclusion (4.7) is globally exponentially stable, i.e., there
exists κ > 0 such that ‖w(t)− w?‖ ≤ κ e−t ‖w(0)− w?‖.
Proof. Given the assumptions, Theorem 4.2.1 establishes asymptotic stability of (4.7) with the
dissipation rate V (w) = − 2V (w). It remains to show the existence of positive constants κ1
and κ2 such that Lyapunov function (4.8) satisfies
κ1
2 ‖w‖2 ≤ V (w) ≤ κ2
2 ‖w‖2 (4.17)
where w := w − w? and w? := (x?, y?) is the optimal primal-dual pair. The upper bound
in (4.17) follows from Lipschitz continuity of ∇Lµ(w) (see Theorem 2.3.1), with κ2 determined
by the Lipschitz constant of ∇Lµ(w).
To show the lower bound in (4.17), and thus establish radial unboundedness of V (w), we
construct matrices that relate V (w) to w. Lemmas 3.1.2 and 4.2.2 imply the existence of
symmetric matrices Dw and Gw such that 0 Dw I, mfI Gw LfI, and
proxµg(Tx+ µy)− proxµg(Tx? + µy?) = Dw (T x+ µy)
f(x)− f(x?) = Gw x.
As noted in Remark 4.2.3, although Dw and Gw depend on the operating point, their spectral
properties hold for all w.
Since ∇Lµ(w?) = 0, we can write
∇Lµ(w) = ∇Lµ(w) − ∇Lµ(w?) = Qw w
and express Lyapunov function (4.8) as
V (w) = 12 w
TQTwQw w
where
Qw :=
Gw + 1µ T
T (I −Dw)T TT (I −Dw)
(I −Dw)T −µDw
77
for some (Dw, Gw) ∈ Ωw,
Ωw := (Dw, Gw) | 0 Dw I, mfI Gw LfI .
The set Ωw is closed and bounded and the minimum eigenvalue of QTwQw is a continuous function
of Gw and Dw. Thus, the extreme value theorem [181, Theorem 4.14] implies that its infimum
over Ωw,
κ1 = inf(Dw,Gw)∈Ωw
λmin
(QTwQw
)is achieved. By Lemma 4.1.2, Qw is a full rank matrix, which implies that QTwQw 0 for all w
and therefore that κ1 is positive. Thus, V (w) ≥ κ1
2 ‖w‖2, establishing condition (4.17).
Condition (4.9) and [183, Lemma 2.5] imply V (w(t)) = e−2tV (w(0)). It then follows
from (4.17) that
‖w(t)− w?‖2 ≤ (κ2/κ1) e−2t ‖w(0)− w?‖2.
Taking the square root completes the proof and provides an upper bound for the constant κ,
κ ≤√κ2/κ1.
Remark 4.2.5. The rate of exponential convergence established by Theorem 4.2.4 is indepen-
dent of mf , Lf , and µ. This is a consequence of insensitivity of Newton-like methods to poor
conditioning. In contrast, the first order primal-dual method considered in [107] requires a suf-
ficiently large µ for exponential convergence. In our second order primal-dual method, problem
conditioning and parameter selection affect the multiplicative constant κ but not the rate of
convergence.
Remark 4.2.6. When differential inclusion (4.7) is defined with the B-generalized Hessian (4.4a),
Theorems 4.2.1 and 4.2.4 hold even for proximal operators which are not semismooth. In this
case, Theorem 4.2.1 is complete at equation (4.14a) and Theorem 4.2.4 applies without modifi-
cation.
4.3 A second order primal-dual algorithm
An algorithm based on the second order updates (4.5) requires step size selection to ensure
global convergence. This is challenging for saddle point problems because standard notions,
such as sufficient descent, cannot be applied to assess the progress of the iterates. Instead, it
is necessary to identify a merit function whose minimum lies at the stationary point and whose
78
sufficient descent can be used to evaluate progress towards the saddle point.
An approach based on discretization of differential inclusion (4.7) and Lyapunov func-
tion (4.8) as a merit function leads to Algorithm 4 in Section 4.5.1. However, such a merit
function is nonconvex and nondifferentiable in general which makes the utility of backtracking
(e.g., the Armijo rule) unclear. Moreover, Algorithm 4 employs a fixed penalty parameter µ. A
priori selection of this parameter is difficult and it has a large effect on the convergence speed.
Instead, we employ the primal-dual augmented Lagrangian introduced in [162] as a merit
function and incorporate an adaptive µ update. This merit function is convex in both x and
y and it facilitates an implementation with outstanding practical performance. Drawing upon
recent advancements for constrained optimization of twice differentiable functions [163,164], we
show that our algorithm converges to the solution of (2.5). Finally, our algorithm exhibits local
(quadratic) superlinear convergence for (strongly) semismooth proxµg.
4.3.1 Merit function
The primal-dual augmented Lagrangian,
Mµ(x, z; y, λ) := Lµ(x, z;λ) + 12µ ‖Tx− z + µ (λ− y)‖2
was introduced in [162], where λ is an estimate of the optimal Lagrange multiplier y?. Fol-
lowing [162, Theorem 3.1], it can be shown that the optimal primal-dual pair (x?, z?; y?) of
optimization problem (2.5) is a stationary point of Mµ(x, z; y, y?). Furthermore, for any fixed
λ, Mµ is a convex function of x, z, and y and it has a unique global minimizer.
In contrast to [162], we study problems in which a component of the objective function is not
differentiable. As in Theorem 2.3.1, the Moreau envelope associated with the nondifferentiable
component g allows us to eliminate the dependence of the primal-dual augmented Lagrangian
Mµ on z,
z?µ(x; y, λ) = argminz
Mµ(x, z; y, λ)
= proxµ2 g
(Tx + µ2 (2λ − y))
and to express Mµ as a continuously differentiable function,
Mµ(x; y, λ) := Mµ(x, z?µ(x; y, λ); y, λ) = f(x) +Mµ2 g
(Tx + µ
2 (2λ − y))
+ µ4 ‖y‖
2 − µ2 ‖λ‖
2.
For notational compactness, we suppress the dependence on λ and writeMµ(w) when λ is fixed.
79
Remark 4.3.1. The primal-dual augmented Lagrangian is not a Lyapunov function unless
λ = y?. We establish convergence by minimizing Mµ(x; y, λ) over (x; y) – a convex problem –
while adaptively updating the Lagrange multiplier estimate λ.
In [162, 164], the authors obtain a search direction using the Hessian of the merit function,
∇2Mµ. Instead of implementing an analogous update using generalized Hessian ∂2Mµ, we take
advantage of the efficient inversion of ∂2Lµ (see Section 4.3.2) to define the update
∂2Lµ(wk) w = − blkdiag(I,−I)∇M2µ(wk) (4.18)
where the identity matrices are sized conformably with the dimensions of x and y, and
∇M2µ(w) =
∇f(x) + TT∇Mµg(Tx+ µ(2λ− y))
µ(y −∇Mµg(Tx+ µ(2λ− y)))
. (4.19)
Multiplication by blkdiag(I,−I) is used to ensure descent in the dual direction and M2µ is
employed because z?µ(x; y, λ) is determined by the proximal operator associated with (µ/2)g.
When λ = y, ∇xM2µ = ∇xLµ, ∇yM2µ = −∇yLµ, and (4.18) becomes equivalent to the second
order update (4.5).
Lemma 4.3.2. Let w solve (4.18). Then, for the fixed value of the Lagrange multiplier estimate
λ and any σ ∈ (0, 1],
d := (1 − σ) w − σ∇M2µ(w, λ) (4.20)
is a descent direction of the merit function M2µ(w, λ).
Proof. By multiplying (4.18) with the nonsingular matrix
Π :=
I − 1µ T
T
0 I
(4.21)
we can express it as H TT
(I − P )T −µP
x
y
=
−(∇f(x) + TT y)
∇yM2µ(x; y, λ)
(4.22)
80
where H := ∇2f(x) 0. Using (4.19) and (4.22), ∇M2µ(w) can be expressed as,
∇M2µ(w) =
−(H + 1µT
T (I − P )T )x − TT (I − P )y
(I − P )T x − µP y
.Thus,
wT∇M2µ(w) = −xT (H + 1µT
T (I − P )T )x − µyTP y
is negative semidefinite, and the inner product
dT∇M2µ(w) = (1− σ) wT∇M2µ(w) − σ‖∇M2µ(w)‖2
is negative definite when ∇M2µ is nonzero.
4.3.2 Second order primal-dual algorithm
We now develop a customized algorithm that alternates between minimizing the merit function
Mµ(x; y, λ) over (x; y) and updating λ. Near the optimal solution, the algorithm approaches sec-
ond order updates (4.5) with unit step size, leading to local (quadratic) superlinear convergence
for (strongly) semismooth proxµg.
Our approach builds on the sequential quadratic programming method described in [162–164]
and it uses the primal-dual augmented Lagrangian as a merit function to assess progress of
iterates to the optimal solution. Inspired by [184], we ensure sufficient progress with damped
second order updates.
The following two quantities
r := Tx − proxµg(Tx + µy)
s := Tx − proxµg(Tx + µ (2λ − y))
appear in the proof of global convergence. Note that r is the primal residual of optimization
problem (2.5) and that ∇M2µ can be equivalently expressed as
∇M2µ(w) =
∇f(x) + 1µ T
T (s+ µ(2λ− y))
−(s + 2µ(λ− y))
. (4.23)
81
Global convergence
We now establish global convergence of Algorithm 3 under an assumption that the sequence
of gradients generated by the algorithm, ∇f(xk), is bounded. This assumption is standard for
augmented Lagrangian based methods [164,185] and it does not lead to a loss of generality when
f is strongly convex.
Theorem 4.3.3. Let Assumption 1 hold and let the sequence ∇f(xk) resulting from Algo-
rithm 3 be bounded. Then, the sequence of iterateswk
converges to the optimal primal-dual
point of problem (2.5) and the Lagrange multiplier estimates λk converge to the optimal La-
grange multiplier.
Proof. Since V2µ(w, λ) is convex in w for any fixed λ, condition (4.26) in Algorithm 3 will be
satisfied after finite number of iterations. Combining (4.26) and (4.23) shows that sk+2µk(λk−yk) → 0 and ∇f(xk) + 1
µkTT (sk + µ(2λk − yk)) → 0. Together, these statements imply that
the dual residual ∇f(xk) + TT yk of (2.5) converges to zero.
To show that the primal residual rk converges to zero, we first show that sk → 0. If Step 2a
in Algorithm 3 is executed infinitely often, sk → 0 since it satisfies (4.25) at every iteration and
η ∈ (0, 1). If Step 2a is executed finitely often, there is k0 after which λk = λk0 . By adding and
subtracting 2µk∇f(xk) + TT sk + 4µkTT (λk0 − yk) and rearranging terms, we can write
TT sk = 2µk(∇f(xk) + 1µkTT (sk + µk(2λk0 − yk)))
− 2µk∇f(xk) − TT (sk + 2µk(λk0 − yk)) − 2µkTTλk0 .
Taking the norm of each side and applying the triangle inequality, (4.26) and (4.23) yields
‖TT sk‖ ≤ 2µkεk + 2µk‖∇f(xk)‖+ ‖TT ‖εk + 2µk‖TTλk0‖. (4.24)
This inequality implies that TT sk → 0 because ∇f(xk) is bounded, εk → 0, and µk → 0. Since
T has full row rank, TT has full column rank and it follows that sk → 0.
Substituting sk → 0 and ∇f(xk) +TT yk → 0 into the first row of (4.23) and applying (4.26)
implies λk → yk. Thus, sk → rk, implying that the iterates asymptotically drive the primal
residual rk to zero, thereby completing the proof.
Remark 4.3.4. Despite the assumption that ∇f(xk) is bounded, Theorem 4.3.3 can be used
to ensure global convergence whenever f is strongly convex. We show in Lemma 4.5.1 in Sec-
tion 4.5.2 that a bounded set Cf containing the optimal point can be identified a priori. One
82
can thus artificially bound ∇f(x) for all x 6∈ Cf to satisfy the conditions of Theorem 4.3.3 and
guarantee global convergence to the solution of (2.1).
Algorithm 3 Second order primal-dual algorithm for nonsmooth composite optimization.
input: Initial point w0 = (x0, y0), and parameters η ∈ (0, 1), β ∈ (0, 1), τa, τb ∈ (0, 1), εk ≥ 0such that εk → 0.initialize: Set λ0 = y0.
Step 1: If‖sk‖ ≤ η‖sk−1‖ (4.25)
go to Step 2a. If not, go to Step 2b.Step 2a: Set
µk+1 = τaµk, λk+1 = yk
Step 2b: Setµk+1 = τbµ
k, λk+1 = λk
Step 3: Using a backtracking line search, perform a sequence of inner iterations to choosewk+1 until
‖∇M2µk+1(wk+1, λk+1)‖ ≤ εk (4.26)
where the search direction d is obtained using (4.20)–(4.22) with
σ = 0,(wk)T∇M2µk+1(wk)
‖∇M2µk+1(wk)‖2≤ −β, (4.27a)
σ ∈ (0, 1], otherwise. (4.27b)
Asymptotic convergence rate
The invertibility of the generalized Hessian ∂2Lµ(w) allows us to establish local convergence
rates for the second order updates (4.5) when proxµg is (strongly) semismooth.
We now show that the updates in Algorithm 3 are equivalent to the second order updates (4.5)
as k → ∞. Thus, if proxµg is (strongly) semismooth, the sequence of iterates generated by
Algorithm 3 converges (quadratically) superlinearly to the optimal point in some neighborhood
of it.
Theorem 4.3.5. Let the conditions of Theorem 4.3.3 hold, let proxµg be (strongly) semismooth,
and let εk be such that ‖wk − w?‖ = O(εk). Then, in a neighborhood of the optimal point w?,
the iterates wk converge (quadratically) superlinearly to w?.
Proof. From Theorem 4.3.3, λk → yk and thus ∇M2µ(wk) → ∇Lµ(wk). Descent of the Lya-
punov function in Theorem 4.2.1 therefore implies that the update in Step 3 of Algorithm 3 is
83
given by (4.27a), which is equivalent to (4.5) because λ = y. The assumption on εk in con-
junction with Proposition 4.1.1, and [182, Theorem 3.2] imply that this update asymptotically
satisfies (4.26) in one iteration with a unit step size. Therefore, Step 3 reduces to (4.5) in some
neighborhood of the optimal solution and Proposition 4.1.1 implies that wk converges to w?
(quadratically) superlinearly.
Efficient computation of the Newton direction
When g is (block) separable, the matrix P in (4.4) is (block) diagonal. We next demonstrate
that the solution to (4.22) can be efficiently computed when T = I and P is a sparse diagonal
matrix whose entries are either 0 or 1. The extensions to a low rank P , to a P with entries
between 0 and 1, or to a general diagonal T follow from similar arguments.
These conditions occur, for example, when g(z) = γ‖z‖1. The matrix P is sparse when
proxµg(x + µy) = Sγµ(x + µy) is sparse. Larger values of γ are more likely to produce a
sequence of iterates wk for which P is sparse and thus the second order search directions (4.22)
are cheaper to compute.
We can write (4.22) as H I
I − P −µP
x
y
=
ϑ
θ
, (4.28a)
permute it according to the entries of P which are 1 and 0, respectively, and partition the
matrices H, P , and I − P conformably such that
H =
H11 H12
HT12 H22
, P =
I
0
.Let v denote either the primal variable x or the dual variable y. We use v1 to denote the
subvector of v corresponding to the entries of P which are equal to 1 and v2 to denote the
subvector corresponding to the zero diagonal entries of P .
Note that (I − P )v = 0 when v2 = 0 and Pv = 0 when v1 = 0. As a result, x2 and y1 are
explicitly determined by the bottom row of the system of equations (4.28a), 0 −µII 0
x2
y1
=
θ1
θ2
(4.28b)
84
Substitution of the subvectors x2 and y1 into (4.28a) yields,
H11x1 = ϑ1 + H12x2 + y1 (4.28c)
which must be solved for x1. Finally, the computation of y2 requires only matrix-vector products,
y2 = − (ϑ2 + H21x1 + H22x2) . (4.28d)
Thus, the major computational burden in solving (4.22) lies in performing a Cholesky factor-
ization to solve (4.28c), where H11 is a matrix of a much smaller size than H.
4.4 Computational experiments
In this section, we illustrate the merits and the effectiveness of our approach. We first apply our
algorithm to the `1-regularized least squares problem and then study a system theoretic problem
of controlling a spatially-invariant system. All computations were performed in Matlab R2014b
on a 2012 Macbook Air with a 2GHz Intel Core i7 processor and 8GB of RAM. All random
computational experiments are averaged over 20 trials.
4.4.1 Example: `1-regularized least squares
The LASSO problem (2.4) regularizes a least squares objective with a γ-weighted `1 penalty, As
described in Section 2.2.2, the associated proximal operator is given by soft-thresholding Sγµ,
the Moreau envelope is the Huber function, and its gradient is the saturation function. Thus,
P ∈ P is diagonal and Pii is 0 when |xi + µyi| < γµ, 1 outside this interval, and between 0 and
1 on the boundary. Larger values of the regularization parameter γ induce sparser solutions for
which one can expect a sparser sequence of iterates. Note that we require strong convexity of
the least squares penalty; i.e., that ATA is positive definite.
In Fig. 4.1, we show the distance of the iterates from the optimal for the standard proximal
gradient algorithm ISTA, its accelerated version FISTA, and our customized second order primal-
dual algorithm for a problem where ATA has condition number 3.26 × 104. We plot distance
from the optimal point as a function of both iteration number and solve time. Although our
method always requires much fewer iterations, it is most effective when γ is large. In this case
the most computationally demanding step (4.28c) required to determine the second order search
direction (4.22) involves a smaller matrix inversion; see Section 4.3.2 for details. In Fig. 4.2,
85
γ = 0.15γmax γ = 0.85γmax
‖x−x?‖
iteration iteration
‖x−x?‖
time (s) time (s)
Figure 4.1: Distance from optimal solution as a function of the iteration number and solve timewhen solving LASSO for two values of γ using ISTA, FISTA, and our algorithm (2ndMM).
86
com
pu
tati
on
tim
e(s
)
Percent of γmax
Figure 4.2: Solve times for LASSO with n = 1000 obtained using ISTA, FISTA, and ouralgorithm (2ndMM) as a function of the sparsity-promoting parameter γ.
γ = 0.15γmax γ = 0.85γmax
com
pu
tati
on
tim
e(s
)
n n
Figure 4.3: Comparison of our algorithm (2ndMM) with state-of-the-art methods for LASSOwith problem dimension varying from n = 100 to 2000.
87
we show the solve times for n = 1000 as the sparsity-promoting parameter γ ranges from 0
to γmax = ‖AT b‖, where γmax yields a zero solution. All numerical experiments consist of
20 averaged trials.
In Fig. 4.3, we compare the performance of our algorithm with the LASSO function in
Matlab (a coordinate descent method [186]), SpaRSA [187], an interior point method [188], and
YALL1 [189]. Problem instances were randomly generated with A ∈ Rm×n, n ranging from 100
to 2000, m = 3n, and γ = 0.15γmax or 0.85γmax. The solve times and scaling of our algorithm is
competitive with these state-of-the-art methods. For larger values of γ, the second order search
direction (4.22) is cheaper to compute and our algorithm is the fastest.
4.4.2 Example: Distributed control of a spatially-invariant system
We now apply our algorithm to a structured control design problem aimed at balancing closed-
loop H2 performance with spatial support of a state-feedback controller. Following the problem
formulation of [64], ADMM was used in [106,190] to design sparse feedback gains for spatially-
invariant systems. Herein, we demonstrate that our algorithm provides significant computational
advantage over both the ADMM algorithm and a proximal Newton scheme.
Spatially-invariant systems
Let us consider
ψ = Aψ + u + d
ζ =
Q1/2 ψ
R1/2 u
(4.29)
where ψ, u, d, and ζ are the system state, control input, white stochastic disturbance, and
performance output and A, Q 0, and R 0 are n × n circulant matrices. Such systems
evolve over a discrete spatially-periodic domain; they can be used to model spatially-invariant
vehicular platoons [97] and can result from a spatial discretization of fluid flows [38].
Any circulant matrix can be diagonalized via the discrete Fourier transform (DFT). Thus,
the coordinate transformation ψ := T ψ, u := T u, d := T d, where T−1 is the DFT matrix, brings
the state equation in (4.29) into,
˙ψ = A ψ + u + d. (4.30)
Here, A := T−1AT is a diagonal matrix whose main diagonal a is determined by the DFT of
88
the first row a of the matrix A,
ak :=
n−1∑i= 0
ai e−j2πikn , k ∈ 0, . . . , n− 1.
We are interested in designing a structured state-feedback controller, u = −Zψ, that min-
imizes the closed-loop H2 norm, i.e., the variance amplification from the disturbance d to the
regulated output ζ. Since the optimal unstructured Z for spatially-invariant system (4.29) is a
circulant matrix [9], we restrict our attention to circulant feedback gains Z. Thus, Z can also
be diagonalized via a DFT and we equivalently take x := z as our optimization variable where
T−1ZT = diag (z).
For simplicity, we assume that A and Z are symmetric. In this case, a and z are real vectors
and the closed-loop H2 norm of system (4.29) takes separable form f2(x) =∑n−1k= 0 f
k(xk),
fk(xk) =
qk + rkx
2k
2 (xk − ak), xk > ak
∞, otherwise
where xk > ak guarantees closed-loop stability. To promote sparsity of Z, we consider a regu-
larized optimization problem (2.5),
minimizex, z
f2(x) + γ ‖z‖1
subject to Tx − z = 0(4.31a)
where γ is a positive regularization parameter, T is the inverse DFT matrix, and z ∈ Rn denotes
the first row of the symmetric circulant matrix Z. Formulation (4.31a) signifies that while it
is convenient to quantify the H2 norm in the spatial frequency domain, sparsity has to be
promoted in the physical space.
By solving (4.31a) over a range of γ, we identify distributed controller structures which are
specified by the sparsity pattern of the solutions z?γ to (4.31a) at different values of γ. After
selecting a controller structure associated with a particular value of γ, we solve a ‘polishing’ or
‘debiasing’ problem,
minimizex,z
f2(x) + Isp(z?γ)(z)
subject to Tx − z = 0(4.31b)
where Isp(z?γ)(z) is the indicator function associated with the sparsity pattern of z?γ . The solution
to this problem is the optimal controller for system (4.29) with the desired structure, i.e., the
89
same sparsity pattern as z?γ . This step is necessary because the `1 norm in (4.31a) imposes an
additional penalty on z that compromises closed-loop performance.
Implementation
The elements of the gradient of f are
dfk(xk)
dxk=
rkx2k − 2akrkxk − qk2 (xk − ak)2
the Hessian ∇2f2 is a diagonal matrix with non-zero entries,
d2fk(xk)
dx2k
=qk + rka
2k
(xk − ak)3
and the proximal operator associated with the nonsmooth regularizer in (4.31a) is given by
soft-thresholding Sγµ.
While the optimal unstructured controller can be obtained by solving n uncoupled scalar
quadratic equations for xk, sparsity-promoting problem (4.31a) is not in a separable form (be-
cause of the linear constraint) and computing the second order update (4.5) requires solving a
system of equations H T ∗
(I − P )T −µP
x
y
=
ϑ
θ
(4.32)
Pre-multiplying by the matrix blkdiag(T, I) and changing variables to solve for z := T x brings (4.5)
into T H T−1 1n I
I − P −µP
z
y
=
Tϑ
θ
which is of the same form as (4.28a). This equation can be solved efficiently when P is sparse
(i.e., γµ is large; cf. Section 4.3.2) and x can be recovered from z via FFT.
Since the Hessian H of a separable function f is a diagonal matrix, the search direction can
also be efficiently computed when I − P is sparse (i.e., γµ is small). As in Section 4.3.2, the
component of y in the support of P is determined from the bottom row of (4.32). The top row
of (4.32) implies x = H−1(ϑ− T ∗y) and substitution into the bottom row yields
(I − P )T H−1 T ∗ (I − P ) y = θ
where θ := (I − P )(TH−1(ϑ − T ∗P y) − θ) is a known vector. Thus, the component of y in
90
the support of I − P can be determined by inverting a matrix whose size is determined by the
support of I−P and x is readily obtained from y and ϑ. The operations involving T and T ∗ can
be performed via FFT; since H is diagonal, multiplication by these matrices is cheap and the
computational burden in solving (4.5) again arises from a limited matrix inversion. In contrast
to Section 4.3.2, the computation of the search direction using this approach is efficient when
I − P is sparse, i.e., γµ is small.
Swift-Hohenberg equation
We consider the linearized Swift-Hohenberg equation [191],
∂tψ(t, ξ) = (cI − (I + ∂ξξ)2)ψ(t, ξ) + u(t, ξ) + d(t, ξ)
with periodic boundary conditions on a spatial domain ξ ∈ [−π, π]. Finite-dimensional approxi-
mation and diagonalization via the DFT (with an even number of Fourier modes n) yields (4.30)
with ak = c− (1− k2)2 where k = −n/2 + 1, . . . , n/2 is a spatial wavenumber.
Figure 4.4a shows the optimal centralized controller and solutions to (4.31a) for c = −0.01,
n = 64, Q = R = I, and γ = 4× 10−4, 4× 10−3, and 4. As further illustrated in Fig. 4.4b, the
optimal solutions to (4.31a) become sparser as γ is increased.
In Fig. 4.5, we demonstrate the utility of using regularized problems to navigate the tradeoff
between controller performance and structure, an approach pioneered by [64]. The polished op-
timal structured controllers (solid blue ——) were designed by first solving (4.31a) to identify
an optimal structure and then solving (4.31b) to further improve the closed-loop performance. To
illustrate the importance of polishing step (4.31b), we also show the closed-loop performance of
unpolished optimal structured controllers (dashed red - -×- -) resulting from (4.31a). Finally,
to evaluate the controller structures identified by (4.31a), we show the closed-loop performance of
polished ‘reference’ structured controllers (dotted yellow · · ·+ · · · ). Instead of solving (4.31a),
the reference structures are a priori specified as nearest neighbor symmetric controllers of the
same cardinality as controllers resulting from (4.31a). Among controllers with the same num-
ber of nonzero entries, the polished optimal structured controller consistently achieves the best
closed-loop performance.
We compare the computational efficiency of our approach with the proximal Newton
method [173] and ADMM [106, 190]. The proximal Newton method requires solving a LASSO
subproblem (4.3), for which we employ SpaRSA [187]. Since A is circulant, the x-minimization
step in ADMM (2.6) requires solving n uncoupled cubic scalar equations. In general, when A
91
is block circulant, the DFT only block diagonalizes the dynamics and thus the x-minimization
step has to be solved via an iterative procedure [106,190].
Figure 4.6 shows the time to solve (4.31a) with γ = 0.004 using our method, proximal
Newton, and ADMM. Our algorithm and ADMM were stopped when the primal and dual
residuals were below 1 × 10−8. The proximal Newton method was stopped when the norm of
the difference between two consecutive iterates was smaller than 1× 10−8. In Fig. 4.7, we show
the per iteration cost and the total number of iterations required to find the optimal solution
using each method. Our algorithm clearly outperforms proximal Newton and ADMM.
Although proximal Newton requires a similar number of iterations, the LASSO subprob-
lem (4.3) that determines its search direction is much more expensive; this increases the compu-
tation cost of each iteration and slows the overall algorithm. Moreover, for larger problem sizes,
the proximal Newton method struggles with finding a stabilizing search direction because ∇2f2
seems to bring it away from the set of stabilizing feedback gains. It appears that our method
circumvents this issue because its iterates lie in a larger lifted space in which stability is easier
to enforce via backtracking.
On the other hand, while the x- and z- minimization steps in ADMM are quite efficient, as a
first order method ADMM requires a large number of iterations to reach high-accuracy solutions.
Our algorithm achieves better performance because its use of second order information leads to
relatively few iterations and the structured matrix inversion leads to efficient computation of
the search direction.
4.5 Additional considerations
4.5.1 Algorithm based on V (w) as a merit function
Since Theorem 4.2.4 establishes global convergence of the differential inclusion, one algorithmic
approach is to implement a Forward Euler discretization of differential inclusion (4.7). A natural
choice of merit function is the Lyapunov function V defined in (4.8). A simple corollary of
Theorem 4.2.1 shows that the second order update (4.5) is a descent direction for V .
Corollary 4.5.0.1. The second order update (4.5) is a descent direction for the merit function
V defined in (4.8).
Proof. Follows from (4.9) in Theorem 4.2.1.
Corollary 4.5.0.1 enables the use of a backtracking Armijo rule for step size selection. A
92m
idd
lero
wofZ
(a)
spars
ity
level
ofz? γ
γ
(b)
Figure 4.4: (a) The middle row of the circulant feedback gain matrix Z; and (b) the sparsitylevel of z?γ (relative to the sparsity level of the optimal centralized controller z?0) resulting fromthe solutions to (4.31a) for the linearized Swift-Hohenberg equation with n = 64 Fourier modesand c = −0.01.
natural choice of stopping criterion for such an algorithm is a condition on the size of the primal
and dual residuals. Moreover, proposition 4.1.1 suggests fast asymptotic convergence when
proxµg is semismooth. The LASSO example in Fig. 4.8 verifies this intuition when solving
LASSO to a threshold of 1× 10−8 for the primal and dual residuals.
Algorithm 4 Second order primal-dual algorithm for nonsmooth composite optimization basedon discretizing (4.7).
input: Initial point x0, y0, backtracking constant α ∈ (0, 1), Armijo parameter σ ∈ (0, 1),and stopping tolerances ε1, ε2.While: ‖Txk − proxµg(Tx
k + µyk)‖ > ε1 or
‖∇f(xk) − TT yk‖ > ε2
Step 1: Compute wk as defined in (4.5)Step 2: Choose the smallest j ∈ Z+ such that
V (wk + αjwk) ≤ V (wk) − σ αj ‖∇Lµ(wk)‖2
Step 3: Update the primal and dual variables
wk+1 = wk + αjwk
However, such an implementation would require a fixed penalty parameter µ, which typically
has a large effect on the convergence speed of augmented Lagrangian algorithms and is difficult
to select a priori . Moreover, stability of the solution to a differential equation does not always
93
per
form
an
ced
egra
dati
on
(in
per
cents
)
sparsity level of z (in percents)
Figure 4.5: Performance degradation (in percents) of structured controllers relative to the op-timal centralized controller: polished optimal structured controller obtained by solving (4.31a)and (4.31b) (solid blue ——); unpolished optimal structured controller obtained by solvingonly (4.31a) (dashed red - -×- -); and optimal structured controller obtained by solving (4.31b)for an a priori specified nearest neighbor reference structure (dotted yellow · · ·+ · · · ).
94
tim
e
n
Figure 4.6: Total time to compute the solution to (4.31a) with γ = 0.004 using our algorithm(2ndMM), proximal Newton, and ADMM.
95avg
iter
ati
on
tim
e
n
(a)
nu
mb
erof
iter
ati
on
s
n
(b)
Figure 4.7: Comparison of (a) times to compute an iteration (averaged over all iterations); and(b) numbers of iterations required to solve (4.31a) with γ = 0.004.
imply stability of its discretization.
An example
We implement Algorithm 4 to solve the LASSO problem (2.4) studied in Section 4.4. The
LASSO problem was randomly generated with n = 500, m = 1000, and γ = 0.85γmax. Figure 4.8
illustrates the quadratic asymptotic convergence of Algorithm 4 and a strong influence of µ.
4.5.2 Bounding ∇f(x) for Algorithm 3
A bounded set Cf containing the solution to (2.1) can always be identified from any point x,
∇f(x), any element of the subgradient ∂g(x), and a lower bound on the parameter of strong
convexity.
Lemma 4.5.1. Let Assumption 1 hold and let f be mf -strongly convex. Then, for any x, the
optimal solution to (2.1) lies within the ball of radius 2mf‖∇f(x) + TT∂g(T x)‖ centered at x.
Proof. Given any point x, strong convexity of f and convexity of g imply that,
f(x) + g(Tx) − f(x) − g(T x) ≥⟨∇f(x) + TT∂g(T x), x − x
⟩+
mf2 ‖x − x‖2 (4.33)
for all x where ∂g(T x) is any member of the subgradient of g(T x). For any x with ‖x − x‖ ≥2mf‖∇f(x) + TT∂g(T x)‖, the right-hand side of (4.33) must be nonnegative which implies that
f(x) + g(Tx) ≥ f(x) + g(T x) and thus that x cannot solve (2.1).
96
‖x−x?‖
Iteration
Figure 4.8: Distance from the optimal solution as a function of iteration number when solvingLASSO using Algorithm 4 for different values of µ.
For any strongly convex function f and convex function g, a function f can be selected such
that
argmin f(x) + g(Tx) = argmin f(x) + g(Tx).
Here, f is identical to f in some closed set Cf containing x?,
f(x) :=
f(x), x ∈ Cf
f(x), x 6∈ Cf
and f(x) is chosen such that f(x) is convex and twice differentiable, ∇f(x) is uniformly bounded,
and ∇2f(x) 0. Lemma 4.5.1 implies that a set Cf that contains the optimal solution x?
of (2.1) can be identified a priori from any given point x. Since f(x) satisfies all the conditions
of Theorem 4.3.3, Algorithm 3 can be used to solve (2.1) and, since Cf contains x?, its optimal
solution will also solve (2.1).
97
An example
When x ∈ R, f(x) = 12x
2, and Cf = [−1, 1], a potential choice of f(x) is given by,
f(x) =
−2x + ex+1 − 2.5, x ≤ −1
2x + e−x+1 − 2.5, x ≥ 1.
For this choice of f , the gradient of f is continuous and bounded,
∇f(x) =
−2 + ex+1, x ≤ −1
x, x ∈ [−1, 1]
2− e−x+1, x ≥ 1
and the Hessian of f is determined by
∇2f(x) =
ex+1, x ≤ −1
1, x ∈ [−1, 1]
e−x+1, x ≥ 1.
Chapter 5
Connections with other methods
and discussion
The proximal augmented Lagrangian Lµ(x; y) is obtained by constraining Lµ(x, z; y) to the
manifold
Z := (x, z?µ; y) | z?µ = argminz
Lµ(x, z; y)
= (x, z?µ; y) | Tx + µy ∈ z?µ + µ∂Cg(z?µ)
which results from the explicit minimization over the auxiliary variable z. Herein, we interpret
the second order search direction as a linearized update to the KKT conditions for problem (2.5)
and discuss connections to the alternative algorithms.
5.1 Second order updates as linearized KKT corrections
The second order update (4.5) can be viewed as a first order correction to the KKT conditions
for optimization problem (2.5),
0 = ∇f(x) + TT y
0 = Tx − z
0 ∈ ∂Cg(z) − y.
(5.1)
Substitution of z?µ into (5.1) makes the last two conditions redundant; when combined with the
definition of the manifold Z, Tx = z implies y ∈ ∂Cg(z) and y ∈ ∂Cg(z) implies Tx = z.
98
99
Multiplying equation (4.5) for the second order update with the nonsingular matrix (4.21)
recovers the equivalent expression H TT
(I − P )T −µP
xk
yk
= −
∇f(xk) + TT yk
rk
(5.2)
where rk := Txk − z?µ(xk; yk) = Txk − proxµg(Txk + µyk) is the primal residual of (2.5).
Thus, (5.2) describes a first order correction to the first and second KKT conditions in (5.1).
5.2 Connections with other methods
We now discuss broader implications of our framework and draw connections to the existing
methods for solving versions of (2.1). Many techniques for solving composite minimization
problems of the form (2.1) can be expressed in terms of functions embedded in the augmented
Lagrangian; see Table 5.1. Trivially, the objective function in (2.1) corresponds to Lµ(x, z; y)
over the manifold z = Tx. The Lagrange dual of a problem equivalent to (2.5),
minimizex, z
f(x) + g(z) + 12µ ‖Tx − z‖2
subject to Tx − z = 0(5.3)
is recovered by collapsing Lµ(x, z; y) onto the intersection of the z-minimization manifold Z and
the x-minimization manifold,
X := (x?, z; y) | x? = argminx
Lµ(x, z; y)
= (x?, z; y) | µ (∇f(x?) + TT y) = TT (z − Tx?).
The Lagrange dual of (2.5) is recovered from the Lagrange dual of (5.3) in the limit µ→∞.
5.2.1 MM and ADMM
MM implements gradient ascent on the dual of (5.3) by collapsing Lµ(x, z; y) computationally
onto X ∩Z. The joint (x, z)-minimization step in (2.7) evaluates the Lagrange dual at discrete
iterates yk by finding the corresponding (x, z)-pair on X ∩ Z; i.e., the iterate (xk+1, zk+1; yk)
generated by (2.7) lies on the manifold X ∩Z. Note that, in this form, joint (x, z)-minimization
is a challenging nondifferentiable optimization problem.
ADMM avoids this challenge by collapsing Lµ(x, z; y) onto X and Z separately. While
100
the underlying x- and z-minimization subproblems in ADMM are relatively simple, the iterate
(xk+1, zk+1; yk) generated by (2.6) does not typically lie on the X ∩ Z manifold. Thus ADMM
does not represent gradient ascent on the dual of (5.3), causing looser theoretical guarantees
and often worse practical performance.
By collapsing the augmented Lagrangian onto Z, the proximal augmented Lagrangian (2.8)
allows us to express the (x, z)-minimization step in MM as a tractable continuously differen-
tiable problem (cf. Theorem 2.3.1). This avoids challenges associated with ADMM and it does
not increase computational complexity in the x-minimization step in (2.7) relative to ADMM
when using first order methods. We finally note that unlike the Rockafellar’s proximal method
of multipliers [192] which applies the proximal point algorithm to the primal-dual optimality
conditions, our framework reformulates the standard method of multipliers and develops second
order algorithm to solve nonsmooth composite optimization problems.
5.2.2 First order methods
A special instance of our framework has strong connections with the existing methods for dis-
tributed optimization on graphs; e.g., [146, 147, 193]. The networked optimization problem of
minimizing f(x) =∑f i(x) over a single variable x can be reformulated as
∑f i(xi) + g(Tx)
where the components f i of the objective function are distributed over independent agents xi,
x is the aggregate state, TT is the incidence matrix of a strongly connected and balanced graph,
and g is the indicator function associated with the set Tx = 0. The g(Tx) term ensures that at
feasible points, xi = xj = x for all i and j. It is easy to show that ∇Mµg(v) = 1µ v and that the
primal-descent dual-ascent dynamics 3.1 are given by,
x = −∇f(x) − 1µ Lx − y
˙y = Lx(5.4)
where y := TT y and L := TTT is the graph Laplacian. The only difference relative to [193,
Equation 11] is that −y appears instead of −Ly in the equation for the dynamics of the primal
variable x.
We also include a comparison with the EXTRA algorithm [194]. The discretized version of
the first order primal-dual dynamics (5.4) is equivalent to
xk+1 = xk − αx(∇f(xk) + 1µLx
k + yk)
yk+1 = yk + αyLxk
(5.5)
101
x z y function methods
z = Tx - objective function of (2.1) subgradient iteration [134]- - - augmented Lagrangian ADMM [64,132]X Z - Lagrange dual of (5.3) Dual ascent (if explicit
expression available)- Z - proximal augmented Lagrangian MM [107], Arrow-Hurwicz-Uzawa [107],
Second order primal-dual- Z Y Forward-Backward Envelope Forward-Backward Truncated
Newton [176], related toproximal gradient [176, Section 2.1]
Table 5.1: Summary of different functions embedded in the augmented Lagrangian of (2.5) andmethods for solving (2.1) based on these functions.
where y := TT y, L := TTT is the graph Laplacian, and αx and αy are the primal and dual step
sizes, respectively. By noting that that yk =∑k−1t=0 (W − W )xt, the EXTRA algorithm [194,
Equation (2.13)],
xk+1 = Wxk − α∇f(xk) + 1µLx
k +
k−1∑t=0
(W − W )xt
is recovered exactly from (5.5) by setting αx = α and αy = α2µ and taking W = I − α
µL,
W = 12 (I + W ). Of course, EXTRA does not require that W be stated in the form of a
Laplacian, but we can take T = (I −W )1/2 and still recover EXTRA.
5.2.3 Second order methods
We identify the saddle point of Lµ(x, z; y) by forming second order updates to x and y along
the Z manifold. Just as Newton’s method approximates an objective function with a convex
quadratic function, we approximate Lµ(x; y) with a quadratic saddle function.
Constraining the dual variable itself yields connections with other methods. When T =
I, (5.1) implies that the optimal dual variable is given by y? = −∇f(x?), so it is natural to
collapse the augmented Lagrangian onto the manifold
Y := (x, z; y?) | y? = −∇f(x).
The augmented Lagrangian over the manifold Z ∩Y corresponds to the Forward-Backward En-
velope (FBE) introduced in [176]. The proximal gradient algorithm with step size µ can be
102
recovered from a variable metric gradient iteration on the FBE [176]. In [176–178], the approx-
imate line search and quasi-Newton methods based on the FBE were developed to solve (2.1)
with T = I. Since the Hessian of the FBE involves third order derivatives of f , these techniques
employ either truncated- or quasi-Newton methods.
The approach advanced in the current paper applies a second order method to the aug-
mented Lagrangian that is constrained over the larger manifold Z. Relative to alternatives, our
framework offers several advantages. First, while the FBE is in general a nonconvex function of
x, Lµ(x; y) is always convex in x and concave in y. Furthermore, we can compute the Hessian
exactly using only second order derivatives of f and its structure allows for efficient computation
of the search direction. Finally, our formulation allows us to leverage recent advances in second
order methods for augmented Lagrangian methods, e.g., [162–164].
Part II
Structured optimal control
103
Chapter 6
Symmetry and spatial invariance
In this chapter, we propose a principled convex approach for structured H2 and H∞ optimal
controller design. We consider problems where the dynamical generator is an affine function of
the design variable. When the dynamical generator is symmetric, we show that the H2 and H∞norms of the closed-loop system are convex functions of the design variable. This allows us to
impose regularization penalties directly on the design variable in order to promote structure.
We then show that even when the dynamical generator is not symmetric, its symmetric
component can be used to provide an upper bound on its H2 and H∞ norm. Thus, we can
design structured controllers by solving the convex design problem for the symmetric component
of the system and implement the resulting controllers on the original system. We show that this
procedure guarantees stability and a level of H2/H∞ performance.
Although convex, the H2 and H∞ norms for symmetric systems are expressed using linear
matrix inequalities (LMIs) and the semidefinite programs (SDPs) that result from the H2 and
H∞ optimal control problems do not scale favorably with the problem dimension. To address
this challenge, we develop a customized ADMM algorithm and provide a procedure to gain com-
putational advantage when the system and its controller are jointly block-diagonalizable. Such
a structure arises, for example, in spatially-invariant systems [9]. In [190], we took advantage of
this property to develop an efficient and scalable algorithm for sparsity-promoting H2 optimal
feedback design.
104
105
6.1 Symmetric systems
We consider the class of systems (1.1) with R(x) = 0 and B1 = I,
ψ = (A + F (x))ψ + d
ζ = ψ.(6.1)
Instead of limiting the standard measure of control effort, we regularize x directly by imposing
a quadratic penalty, xTRx with R 0, to limit the magnitude of x, and an `1 penalty, ‖x‖1 :=∑i |xi|, to promote sparsity in x. This is because in many motivating examples for this section,
it is of interest to regulate the size of the controller x itself and not its effect on the system state,
R(x)ψ. For example, in the combination drug therapy example first described in Section 1.2.1,
it makes sense to penalize the size of the drug dose but not the quantity of virus each drug is
killing.
Under the above assumptions, we can design controllers for (6.1) by designing controllers for
ψ = (As + Fs(x))ψ + d
ζ = ψ(6.2)
where As := (A + AT )/2 is the symmetric part of A and Fs(x) := (F (x) + FT (x))/2. Since
any matrix A can be decomposed into its symmetric As and antisymmetric components, any
system (6.1) can be placed into the form of (6.2)
We first show that the H2 and H∞ optimal control problems for (6.2) are convex in x and
thus that we can impose regularization direclty on the controller. We then show that stability
of the symmetric system (6.2) implies stability of the corresponding original system (6.1) and
that the H2 and H∞ norms of the symmetric system (6.2) are an upper bound on the H2 and
H∞ norms of the original system (6.1). We then show that when the difference between A and
As of the order ε, the H2 and H∞ norms of systems (6.1) and (6.2) differ only by O(ε2).
We note that this method is more appropriate when F (x) is symmetric. If Fs(x) 6= F (x),
the neglected effect of the asymmetric component of F (x) makes the degree of conservatism
unpredictable.
6.1.1 Convex formulation
The H2 and H∞ norms of (6.2) can be expressed in a convex form. Although more general
symmetric systems can be cast as convex problems, here we assume B = C = I and R(x) = 0
106
to facilitate the transition to the discussion of spectral properties and performance bounds.
H2-optimal control
The H2 norm of system (6.2) is given by trace (Ps) where,
(As + Fs(x)
)Ps + Ps
(As + Fs(x)
)+ I = 0.
Since As = ATs is symmetric, the controllability gramian of system (6.2) can be explicitly
expressed as,
Ps = − 12 (As + Fs(x))
−1
and, by taking a Schur complement, the regularized optimal H2 control problem can be cast in
a convex function of x and an auxiliary variable Θ,
minimizex,Θ
12 trace(Θ) + g(x)
subject to
Θ I
I −(As + Fs(x))
0.(6.3)
The matrix As + Fs(x) is always invertible when it is Hurwitz. We note the structured LQR
problem (i.e., R(x) = R1/2Fs(x)) for symmetric systems can also be expressed as an SDP by
taking the Schur complement of Fs(x)RFs(x). Finally, we drop the constant factor 1/2 for
compactness.
H∞-optimal control
To formulate the H∞ optimal control problem as an SDP, we first state a simple lemma about
system (6.2).
Proposition 6.1.1. The disturbance that achieves the maximum induced L2 amplification for
system (6.2) corresponds to the constant signal d(t) = v where v is the right principal singular
vector of A−1s . I.e., the peak of the frequency response of system (6.2) occurs at the temporal
frequency ω = 0.
Proof. A symmetric matrix can be diagonalized as As = UΛUT where Λ is a diagonal matrix
with the eigenvalues of As on the main diagonal and the columns of U contain the corresponding
107
eigenvectors. For such a matrix,
(jωI − As)−1 = U diag
1
jω−λi
UT .
It is clear that ω = 0 maximizes the singular values of the above matrix. Thus, the H∞ norm
of (6.2) can be characterized by σmax
(−(As + Fs(x))−1
).
The regularized H∞-optimal control problem for symmetric systems can therefore be ex-
pressed as,
minimizex,Θ
σmax(Θ) + g(x)
subject to
Θ I
I −(As + Fs(x))
0.
(6.4)
As we show in the next section, this convex problem can be used for structured H∞ control
design. This is particularly advantageous because many of the existing algorithms for general
structured H2 control cannot be extended to the structured H∞ problem because the H∞ norm
is not differentiable in general.
In Section 6.1.2, we show that stability of the symmetric system (6.2) implies stability of
the corresponding original system (6.1). Furthermore, the H2 and H∞ norms of the symmetric
system are upper bounds on the H2 and H∞ norms of the original system. Finally, when the
difference between A and As is small (of the order ε, O(ε)), the H2/H∞ norms of systems (6.1)
and (6.2) differ only by O(ε2).
This formulation can be applied to the design of edges in consensus networks, leader selection,
and combination drug therapy design for HIV.
6.1.2 Stability and performance
We now justify the use of the symmetric component (6.2) to design controllers for the original
system (6.1). First, we show that the stability of the symmetric system implies stability of the
full system.
Lemma 6.1.1. Let the symmetric part of A be Hurwitz. Then, A is Hurwitz.
Proof. We show this by contradiction. Since the symmetric part of A, As := (A + AT )/2 is
symmetric and Hurwitz, it is negative definite,
v∗Asv < 0 for all v 6= 0.
108
Assume that A is not Hurwitz. Then there is a v such that Av = λv with <(λ) ≥ 0. Furthermore,
v∗Av = λv∗v. However,
v∗Av = v∗Asv + 12 v∗ (A − AT ) v
= v∗Asv + j Im(λ) ‖v‖22.
Since As ≺ 0, v∗Asv cannot have a nonnegative real part.
Remark 6.1.2. Lemma 6.1.1 is not a necessary condition; A may be Hurwitz even if As is not.
6.1.3 Performance bounds
We next show that the H2 norm of the symmetric system (6.2) is an upper bound on the H2
norm of the system (6.1). First, we present a useful theorem from linear algebra [195].
Theorem 6.1.3 (Theorem IX.3.1 in [195]). Let A be any matrix and let As = 12 (A + AT ) be
the symmetric component of A. Then
‖eA‖ ≤ ‖eAs‖
for every unitarily invariant norm.
The statement about the H2 norms of systems (6.1) and (6.2) is a simple corollary of Theo-
rem 6.1.3.
Corollary 6.1.3.1. When the systems (6.1) and (6.2) are stable, the H2 norm of (6.1) is
bounded from above by the H2 norm of system (6.2).
Proof. Recall that the H2 norm of a system,
ψ = Aψ + d
with A Hurwitz is given by trace(Pc) where
APc + PcAT + I = 0
and that X can be expressed as
Pc =
∫ ∞0
eAteAT tdt.
109
Using the linearity of the trace and of integration, we can rewrite the expression for the H2
norm as,
trace
(∫ ∞0
eAteAT tdt
)=
∫ ∞0
‖eAt‖2Fdt.
Since the Frobenius norm is unitarily invariant, by Theorem 6.1.3 ‖eAt‖2F ≤ ‖eAst‖2F for any t
and therefore, ∫ ∞0
‖eAt‖2F dt ≤∫ ∞
0
‖eAst‖2F dt.
Since the right hand side is the H2 norm of system (6.2), this completes the proof.
Note: Theorem 6.1.3 relies on the fact that the sum of the k largest eigenvalues of Ps are larger
than the sum of the k largest eigenvalues of P for any integer k. As a result, Corollary 6.1.3.1
does not extend to a general state penalty matrix Q where the H2 norm would be trace(QX).
For the same reason, the result requires E(ddT ) = I.
We now show that an analogous bound holds for the H∞ norm.
Proposition 6.1.2. When the systems (6.1) and (6.2) are stable, the H∞ norm of the general
system (6.1) is bounded from above by the H∞ norm of the symmetric system (6.2).
Proof. From the bounded real lemma [49], the H∞ norm of the general system (6.1) is less than
ζ if there exists a P 0 such that,
ATP + PA + I + ζ−2P 2 ≺ 0. (6.5)
From Proposition 6.1.1, the H∞ norm of symmetric system (6.2) is ζ = σmax(A−1s ). Taking
P = ζI for any ζ > σmax(A−1s ) satisfies the above LMI; substituting P and As into (6.5) yields,
2ζAs + 2I ≺ 0.
Since As is Hurwitz, As ≺ 0. Since ζ > −λmax(A−1s ), ζ−1 < −λmin(As), so As ≺ −ζ−1I.
Therefore the LMI is satisfied. Since Aa = −ATa , setting P = ζI implies,
ATP + PA = 2ζAs.
Substituting A and P = ζI into (6.5) yields,
ATP + PA + I + ζ−2P 2 = 2ζAs + 2I ≺ 0.
110
6.1.4 Small asymmetric perturbations
We next show that in addition to being an upper bound, the H2 andH∞ norms of the symmetric
and full systems are close when A is nearly normal. In what follows, we show that when a normal
system is subject to an antinormal perturbation of O(ε), the first order correction to the H2
and H∞ norms is zero. A similar but more restrictive result appeared in [117] for the design of
an interconnection graph for synchronizing oscillator networks. We present a result for systems
with normal dynamical generators An. Since a normal matrix An is a matrix that commutes
with ATn , this set includes symmetric matrices As.
Proposition 6.1.3. Let An be a normal matrix. The O(ε) correction to the H2 norm of the
system
x = Anx + d
from an O(ε) antisymmetric perturbation Aa is zero.
Proof. The H2 norm of the above system is given by trace(Pn) where,
AnPn + PnATn + I = 0. (6.6)
From Lemma 1 in [196], Pn = −(An +An)−1. Perturbing An by a small antisymmetric matrix
εAa yields a small correction term εP in the controllability gramian. Collecting the O(ε) terms
from the Lyapunov equation,
(An + εAa)(Pn + εP ) + (Pn + εP )(An + εAa)T + I = 0.
recovers the linear equation,
AnP + PAn + AaPn + PnATa = 0.
The O(ε) correction to the H2 norm is given by,
trace(P ) = trace(Pn (AaPn + PnA
Ta ))
= trace((Aa − Aa)P 2
n
)= 0
111
since Aa = −ATa .
To illustrate this result for a small system, we provide an explicit expression for the H2 norm
of a 2-dimensional system in terms of the corresponding symmetric system.
2-dimensional system
To build intuition, we give the analytical expression for the H2 norm of a 2-dimensional system
in terms of the norm of its symmetric component. The A and As matrices are,
A =
a e+ h
e− h d
, As =
a e
e d
.Solving for Ps = −0.5A−1
s and taking the trace gives the H2 norm of the symmetric system,
trace(Ps) = − a+d2(ad−e2)
Explicitly solving the Lyapunov equation APc + PcAT + I = 0 for P and taking its trace yields
the H2 norm of the original system,
trace(Pc) = − (b−c)2+(a+d)2
2(a+d)(ad−bc) .
With some algebra, the above simplifies to, trace(Ps) = −(a+ d)(2(ad− e2))−1 and,
trace(Pc) = trace(Ps)
((a+ d)2(ad− e2) + 4(ad− e2)h2
(a+ d)2(ad− e) + (a+ d)2h2
).
We show that a similar property holds for the H∞ norm.
Proposition 6.1.4. Let As be a symmetric matrix. The O(ε) correction to the H∞ norm of
the system
x = Asx + d
from an O(ε) antisymmetric perturbation Aa is zero.
Proof. From Proposition 6.1.1, the H∞ norm of the symmetric system is given by σmax(−A−1s ).
The maximum singular value of a matrix is equivalent to,
σmax(X) = sup‖v‖2≤1,‖w‖2≤1
vTXw.
112
Since As is symmetric, w = v. Taking an O(ε) antisymmetric perturbation Aa to the above
expression,
σmax(−(As + εAa)−1) = −vTA−1s v + εvTA−1
s AaA−1s v. + O(ε2)
Since Aa is antisymmetric,⟨A−1s vvTA−1
s , Aa⟩
= 0 and thus the first-order correction is 0.
6.2 Computational experiments
6.2.1 Example: Directed Consensus Network
In this example, we illustrate the utility of the approach described in Section 6.1.2. Consider
the network dynamics given by a directed network as described in Section 1.2.1,
ψ = −(L + E diag(x)ET )ψ
where L is a directed graph Laplacian, F (x) = E diag(x)ET represents the addition of undirected
links, v is a vector that contains weights of these added links, and the incidence matrix E
describes which edges may be added or altered. The regularization on x is given by,
g(x) = ‖x‖22 + γ∑i
|xi|
where the quadratic term limits the size of the edge weights, the `1 norm promotes sparsity of
added links, and γ > 0 parametrizes the importance of sparsity.
For this concrete example, the network topology is given by Figure 6.1. The potential added
edges can connect the following pairs of nodes: (1)− (2), (1)− (3), (1)− (5), (1)− (6), (2)− (5),
(2)− (6), (3)− (6), and (4)− (5).
Controllers were designed by solving problems (6.3) and (6.4) for the symmetric version
of the network over 50 log-distributed values of γ ∈ [10−4, 1]. The closed-loop H2 and H∞norms obtained by applying these controllers to the symmetric and original systems are shown
in Fig. 6.2. Fig. 6.1 also shows which edges were added for γ = 1.
113
ψ1
ψ2ψ3
ψ5
ψ4
ψ6
Figure 6.1: Directed network (solid black — arrows) with added undirected edges (dashedred - - - arrows). Both the H2 and H∞ optimal structured control problems yielded the sameset of added edges. In addition to these edges, the controllers tuned the weights of the edges(1)− (3) and (1)− (5).
f2(x
)
f∞
(x)
Figure 6.2: H2 and H∞ performance of the closed-loop symmetric system and the originalsystem subject to a controller designed at various values of γ.
6.2.2 Example: Combination drug therapy design via symmetric sys-
tems
In this section, we consider the combination drug therapy problem introduced in Section 1.2.1,
ψ = (A + diag (Fxx))ψ + d.
where ψ represents a vector of HIV mutant populations, A contains their evolutionary dynamics,
and Fx contains information about drug treatment. The entry Aii represents how fast mutant i
replicates and the entry Aij represents the probability of mutation from mutant j into mutant i.
Each element xk represents the amount of drug k administered to the patient. The ith element
of the kth column of of Fx specifies how quickly drug k destroys mutant i.
114
Combination drug therapy is desirable because using only one drug often leads to the mu-
tant population adapting to the weaknesses of that particular drug [197]. However, because of
potential side effects and drug-drug interactions, it is not desirable to use a large number of
different drugs. Furthermore, large doses can have additional side effects [198].
Since the probability of mutation is often orders of magnitude less than the rate of replica-
tion [199], Proposition 6.1.3 justifies design using the symmetric component of the system. We
therefore minimize the H2 norm of the system,
ψ = (As + diag(Fxx))ψ + d.
where As = 12 (A + AT ). Since this system is symmetric, it can be posed as problem (6.3).
Primal and dual optimization problems
We first state the problem in a form which is convenient for implementation of the alternating
direction method of multipliers (ADMM), a technique well suited to large scale problems that
has been recently successfully applied to sparse control synthesis problems [64, 105]. Using the
auxiliary variable G := −(As + diag(Fxx)), problem (6.3) becomes,
minimize trace(G−1) + 12x
TRx + γ1Tu
subject to G + As + diag(Fxx) = 0
G 0, x ≥ 0.
(6.7)
Since a negative drug dosage is not possible, the `1 norm of x can be written as 1Tx. The
Lagrangian is,
L(G, x, Y, λ) = trace(G−1) + 12x
TRx + γ1Tx − λTx + 〈Y, G + A + diag(Fxx)〉
where Y = Y T and λ ≥ 0. We omit the Lagrange multiplier associated with the positive
definiteness of G for compactness; it will become a slack variable and will result in a requirement
on the positive definiteness of the dual variable Y . We substitute the equivalent expression,
trace (Y diag(Fxx)) = − yTFxx
where y = diag(Y ) into the Lagrangian, differentiate it with respect to G and x and set the
115
gradients equal to zero,
0 = − G−2 + Y
0 = Rx + γ1 − λ + FTx y.
The optimal G and x are therefore,
G = Y −1/2
x = −R−1(FTx y + γ1− λ).
Substituting these expressions into the Lagrangian yields
2 trace(Y12 ) + trace(AY ) − 1
2 (FTx y + γ1− λ)TR−1(FTx y + γ1− λ).
The dual problem is to maximize the above expression over λ ≥ 0 and Y 0. Since the above
is concave and quadratic in λ, its maximum value is achieved at λ = FTx y + γ1. However, λ
must be nonnegative and thus
λ? =
FTx y + γ1, FTx y + γ1 > 0
0, otherwise.
Therefore, λ can be eliminated by substituting λ? and the dual problem can be written as,
maximize 2 trace(Y 1/2) + trace(AY ) − 12 max(−FTx y−γ1, 0)TR−1(−FTx y − γ1)
subject to Y 0.
(6.8)
Alternating direction method of multipliers
To apply ADMM to (6.7), we first form the corresponding augmented Lagrangian,
Lµ(G, x, Y ) := trace(G−1)+xTRx+γ 1Tx+〈Y, G+A+ diag(Fxx)〉+ 12µ‖G+A+diag(Fxx)‖2F .
Relative to the standard Lagrangian, Lµ contains an additional quadratic penalty on the viola-
tion of the linear constraint. The positive parameter µ specifies the magnitude on the constraint
violation penalty at each iteration.
116
The ADMM iteration uses the update sequence [132]
Gk+1 = argminG
Lρ(G, xk, Y k)
xk+1 = argminx
Lρ(Gk+1, x, Y k)
Y k+1 = Y k + 1µ
(Gk+1 +A+ diag(Fxx
k+1))
to find the optimal solution to the original problem. The stopping criteria depend on the primal
residual, which quantifies how well Gk and xk satisfy the linear constraint, and the dual residual,
which quantifies the difference between xk and xk−1. We refer the reader to [132] for details.
This algorithm is advantageous because the subproblems are much simpler than the original
problem. The G-minimization step has an explicit solution, the x-minimization step takes
a standard problem form for which there are efficient algorithms, and the Y -update step is
algebraic.
G-minimization
The G-minimization step amounts to solving,
minimizeG
trace(G−1) + 12µ‖G− V
k‖2F
subject to G 0
where Vk := −(A+ diag(Fxxk))− 1
2µYk. Seting the gradient,
−G−2 + 1µG − V k
to zero is an explicit exercise with the positive definiteness constraint. Since the powers of G
appear with no coefficients, the optimal G has the same eigenstructure as V k. Its eigenvalues
are determined by the positive real solution to the cubic equation
1µ λ
3i + σi λ
2i − 1
where σi is the corresponding eigenvalue of Vk. By the convexity of the G-minimization problem,
there can be only one positive real solution to the above equation.
117
x-minimization
The x-minimization step is equivalent to solving,
minimizex
γ1Tx + xTRx + 12µ‖Fxx − wk‖22.
where wk := gk+1 + a + µyk, gk = diag(Gk), a = diag(A), and yk = diag(Y k). The objective
function is the sum of a quadratic term and an `1 norm: a problem form is commonly referred
to as LASSO. This problem has attracted lots of attention in recent years and there are many
efficient methods for computing its solution; see Section 4.4.1 for a comparison of many standard
techniques with the second order method we develop in Chapter 4. We solve it using Iterative
Soft-Thresholding (ISTA) [131].
Computational complexity
Generic SDP solvers scale with n6, where n is the dimension of the positive definiteness con-
straint. The G-minimization requires O(n3) operations because it requires an eigenvalue decom-
position, the x-minimization step requires O(nr), operations and the Y -update step requires
O(n2) operations.
Numerical example
Following the example given in [89, 90] based on [200], we examine a system with 35 mutants
and 5 potential types of drugs. The diagonal entries are 0.5 and the off diagonal elements range
from the order of 10−8 to 10−6. The structure of A is shown in Fig. 6.3. It is evident that the
matrix is not symmetric.
We next use our algorithm to design control inputs u for the symmetric system with varying
levels of the sparsity promoting parameter γ. As γ is increased, sparsity is prioritized over H2
performance of the system and therefore the performance degrades. In Fig. 6.4, we show the
difference in H2 performance between the symmetric and original systems as a function of γ. In
this problem, the symmetric model is a very good approximation of the original system, even
up to extremely large levels of γ.
Since the off-diagonal entries of the matrix A are small, we artificially increase them by a
constant factor to study our approach for problems with larger degrees of asymmetry. We take,
Ac = c(A− I A) + 0.5(I A)
118
where c is a constant factor and is the Hadamard (element wise) product. This modification
means that Ac does not have physically relevant implications for the drug synthesis problem, but
it illustrates the utility of our approach with a realistic problem structure. Figure 6.5 compares
the H2 performance for c = 105, 1.4× 106, and 1.9× 107. Compared to the diagonal entries of
Ac, the maximum off-diagonal element is of the same order, one order of magnitude higher, and
two orders of magnitude higher respectively.
When the off-diagonal elements are of the same order of magnitude or smaller than the di-
agonal elements, there is almost no difference between the symmetric and full models. As the
off-diagonal elements get larger, the fidelity of the approximation suffers. Unsurprisingly, as
the system becomes more asymmetric, the symmetric approximation becomes more conserva-
tive [201].
We note that for a realistic synthesis problem, γ would be varied to find sparsity structures
for x. Once a desired sparsity structure is identified, (6.3) would be solved with γ = 0 but
x constrained to have that particular sparsity structure. This process, known as polishing in
compressive sensing literature, would yield a x which is sparse but has better performance than
simply obtaining x from (6.3) where g represents `1 regularization.
6.3 Computational advantages for structured problems
Structured control is often of interest for large-scale systems. As such, the computational scaling
of algorithms used to compute optimal controllers is very important. In this section, we identify
a class of systems which are amenable to scalable distributed algorithms.
When A and F (x) are always jointly block-diagonalizable, the dynamics of the system can
be expressed as the sum of independent subsystems. Define ψ := Πψ and let Π be a unitary
matrix such that,˙ψ = (A + F (x))ψ
where
A := ΠAΠT , F (x) := ΠF (x)ΠT ,
and, for any choice of x, A+ F (x) = blkdiag(A11 + F11, · · · , ANN + FNN ) is block-diagonal with
N blocks of size n× n each.
For problems of this form, computing optimal control strategies is much more efficient in
the ψ coordinates because the majority of the computational burden in solving problems (6.3)
and (6.4) comes from the nN × nN LMI constraint involved in minimizing the performance
119
Figure 6.3: Sparsity structure of the matrix A and its symmetric counterpart. The elements ofA are shown with blue dots, and the elements in its symmetric component As are overlaid ingreen circles.
120
γ
Figure 6.4: Difference in H2 norm between the symmetric and original systems with differentcontrollers designed as a function of γ, normalized by the H2 norm of original system.
f 2(x
)
γ
Figure 6.5: The solid — lines are the H2 norms of the symmetric systems and the dotted - - -lines are the H2 norms of the original systems. The blue ×, red , and magenta designatec = 105, 1.4× 106, and 1.9× 107 respectively.
121
metrics f2(x) or f∞(x).
For this class of system, the H2-optimal control problem (6.3) can be expressed as,
minimizev,Θi
1
2
∑i
trace(Θi) + g(x)
subject to
Θi I
I −((As)ii + (Fs(x))ii)
0.
(6.9)
which is an SDP with N separate n× n LMI blocks. Since SDPs scale with the sixth power of
the LMI blocks, solving this reformulation scales with n6 as opposed to n6N6.
Analogously, the structured H∞-optimal control problem (6.4) can be cast as,
minimizev,Θi
maxi
(σmax(Θi)) + g(x)
subject to
Θi I
I (As)ii + (Fs(x))ii
0.
(6.10)
One important class of system which satisfies these assumptions is spatially-invariant sys-
tems. This structure was used in [190] to develop efficient techniques for sparse feedback syn-
thesis.
6.3.1 Spatially-invariant systems
Spatially-invariant systems have a block-circulant structure which is block-diagonalizable by
a Discrete Fourier Transform (DFT). A spatially-invariant system can be represented by N
subsystems with n states each. The state vector ψ ∈ RnN is composed of N subvectors ψi ∈ Rn
which denotes the state of the subsystem. The matrix A ∈ RnN×nN is block-circulant with
blocks of the size n× n. For example, when N = 3,
A =
A0 A1 A−1
A−1 A0 A1
A1 A−1 A0
where the blocks A0, A−1, A1 ∈ Rn×n.
It was shown in [9] that the optimal feedback controller for a spatially-invariant system is
122
itself spatially-invariant. Assuming that the optimal sparse feedback controller is also spatially-
invariant is equivalent to assuming that F (x) is block-circulant. Block circulant matrices are
block-diagonalizable by the appropriate DFT. Let the block Fourier matrix be
Φ := ΦN ⊗ In,
where In is the n× n identity matrix, ΦN is the N ×N discrete Fourier transform matrix, and
⊗ represents the Kronecker product. By introducing the change of variables ψ := Φψ, where
ψ =[ψT1 · · · ψTN
]T,
and ψi ∈ Rn, the original system’s dynamics can be expressed as N independent n× n subsys-
tems,
A = blkdiag(A11, A22, A33)
Consequently, the optimal structured control problems (6.3) and (6.4) can be cast as (6.9)
and (6.10), which are more amenable to efficient computation.
6.3.2 Swift-Hohenberg Equation
Here we illustrate the utility of the block-diagonalization. Consider a particular realization of
the Swift-Hohenberg equation [191],
∂tψ(t, φ) = β ψ(t, φ) − (1 + ∂φφ)2 ψ(t, φ) + x(φ)ψ(t, φ) + d(t, φ)
where β ∈ R, and ψ(t, ·), v(·) ∈ L2(−∞,∞), d(t, φ) is a white-in-time-and-space stochastic
disturbance and x(φ) is a spatially-invariant feedback controller which is to be designed. A
finite dimensional approximation of this system can be obtained by using the differentiation
suite from [202] to discretize the problem into N points and approximating the infinite domain
with periodic boundary conditions over the domain L2[0, 2π]. A sparse H2 feedback controller
x(φ) can then be identified by solving problem (6.3).
We contrast this method with the approach we advocate in Section 6.3, where we use the
DFT to decompose the system into N first-order systems corresponding to eigenfunctions of the
Swift-Hohenberg equation and solve problem (6.9).
The state vector takes the form of ψ(φ) evaluated at grid points in φ where the dynamics
123
are given by
ψ = (A + X)ψ + d.
and A = βI − (I + D2)2. Here D is a discrete differentiation matrix from [202], and X is the
circulant state feedback matrix.
Using the DFT over φ, the Swift-Hohenberg equation can be expressed as a set of independent
first-order systems,˙ψφ = (aφ + xφ)ψφ + d
where aφ := β−(1−κ2φ)2, and the new coordinates are ψ := Πψ, Π is the DFT matrix, κφ is the
wavenumber (spatial frequency), and x represents X in the Fourier space; i.e., X = ΠT diag(x)Π.
We take the regularization term to be
g(x) = ‖X‖2F + γ‖X‖1
where ‖X‖1 :=∑ij |Xij | is the elementwise `1 norm and γ is a parameter which specifies the
emphasis on sparsity relative to performance.
For the H2 problem, the regularized optimal control problem is of the form of (6.3) with
Fs(x) = X and X is circulant. In that formulation, the problem is an SDP with one N×N LMI
block. In the Fourier space, the problem can be expressed as (6.9), which takes the particular
form,
minimizex
1
2
∑ 1
−(aφ + xφ)+ g
(ΠT diag(x)Π
)subject to − (aφ + xφ) ≥ 0
which does not require the large SDP constraints in (6.3).
We solved the regularizedH2 optimal control problem by solving the general formulation (6.3)
and the more efficient formulation (6.9) for β = 0.1, γ = 1 and N varying from 5 to 51 using
CVX, a general purpose convex optimization solver [203].
Taking advantage of spatial invariance yields a significant computational advantage, as can
be seen in Figure 6.6. Although both expressions of the problem yield the same solution, solving
the realization in (6.9) is much faster and allows us to examine much larger problem dimensions.
In Figure 6.7, we show the spatially-invariant feedback controller for one point in the domain,
i.e., one row of X, computed for N = 101 at γ = 0, γ = 0.1, and γ = 10..
124
Com
pu
tati
onti
me
(s)
Number of discretization points
Figure 6.6: Computation time for the general formulation (6.3) (solid blue ——) and thatwhich takes advantage of spatial invariance (6.9) (dotted red · · · ∗ · · · ).
feed
bac
kga
in−x
(φ)
φ
Figure 6.7: Feedback gain −x(φ) for the node at position φ = 0, computed with N = 51 andγ = 0 (solid black —), γ = 0.1 (dashed blue - - -), and γ = 10 (dotted red · · · ).
Chapter 7
Structured decentralized control
of positive systems
Positive systems have received much attention in recent years because of convenient properties
that arise from their structure. A system is called positive if, for every nonnegative initial
condition and input signal, its state and output remain nonnegative [53]. Such systems appear
in the models of heat transfer, chemical networks, and probabilistic networks. In [54], the
authors show that the KYP lemma greatly simplifies for positive systems, thereby allowing
for decentralized H∞ synthesis via Semidefinite Programming (SDP). In [57], it is shown that
static output-feedback can be solved via Linear Programming (LP) for a class of positive systems.
In [55,56], the authors develop necessary and sufficient conditions for robust stability of positive
systems with respect to induced L1–L∞ norm-bounded perturbations. In [204–206], it is shown
that the structured singular value is equal to its convex upper bound for positive systems.
Thus, assessing robust stability with respect to induced L2 norm-bounded perturbation is also
tractable.
Most of the recent literature focuses on control design for positive systems with respect to
induced L1–L∞ norms or induced L2 norms [54, 56, 58]. In this chapter, we show that a class
of H2 and H∞ optimal control problems which are not tractable for general systems are convex
for positive systems. Moreover, they are convex in the original controller variables which (i)
allows us to formulate convex optimization problems where the control parameter is kept as
an optimization variable; and (ii) facilitates a straightforward implementation of constraints or
regularizers on the control parameter in the optimal control formulation.
125
126
7.1 Problem formulation and background
We begin with a review of positive systems and then pose the problem.
7.1.1 Background on positive systems
We first provide relevant notation and definitions to facilitate discussion.
Notation
The set of n× n Metzler matrices (matrices with nonnegative off diagonal elements) is denoted
by Mn×n. The set of n× n nonnegative (positive) diagonal matrices is denoted by Dn+ (Dn++).
Definition 2 (Graph associated to a matrix). Given A ∈ Rn×n we denote the graph associated
to A as G(A) = (V, E), with the set of vertices V = 1, ..., n and the set of edges E :=
(i, j) |Aji 6= 0 .
Definition 3 (Strongly connected graph). A graph (V, E) is strongly connected if there is a
directed path linking every two distinct nodes in V.
Definition 4 (Weakly connected graph). A graph (V, E) is weakly connected if replacing its
edges with undirected edges results in a strongly connected graph.
Positive systems
We begin with a definition of a linear positive system.
Definition 5. A linear system described by the state-space representation,
ψ = Aψ + Bd
ζ = Cψ + Dd,
is positive if and only if A is Metzler and B, C, and D are nonnegative matrices.
We now state two lemmas that are useful for the analysis of positive LTI systems.
Lemma 7.1.1. Let A ∈Mn×n and let Q ∈ Rn×n be a positive definite matrix with nonnegative
entries. Then
(a) eA ≥ 0;
127
(b) For Hurwitz A, the solution P to the algebraic Lyapunov equation,
AP + PAT + Q = 0
is elementwise nonnegative.
Proof. (a) The Metzler matrix A can be written as A − αI with A ≥ 0 and α > 0. Since
eA =∑∞k=0 A
k/k! and Ak ≥ 0 for all k, eA ≥ 0. Thus, e−α > 0 implies eA = e−αeA ≥ 0.
(b) Apply (a) to P =∫∞
0eAtQ eA
T t dt.
Lemma 7.1.2. The left and right principal singular vectors, w and v, of A ∈ Rn×n+ are non-
negative. If A ∈ Rn×n++ , w and v are positive and unique.
Proof. Follows from the application of the Perron theorem [180, Theorem 8.2.11] to AAT and
ATA.
7.1.2 Problem formulation
We consider a class of positive systems (1.1) subject to structured decentralized control,
ψ = (A + F (x))ψ + Bd
ζ = Cψ(7.1)
where A is a Metzler matrix, F (x): Rm → Dn, and B,C ≥ 0. Many applications fit into this
form, including leader selection and combination drug therapy design for HIV. Our objective is
to design a stabilizing diagonal matrix F (x) that minimizes amplification from d to z. To study
this class of problem, we introduce the following assumption.
Assumption 3. The matrix A in (7.1) is Metzler, the matrices B and C are nonnegative, and
the diagonal matrix F (x) := diag (Fxx) is a linear function of x with Fx ∈ Rn×m.
We now review some recent results. Under Assumption 3, the matrix A + F (x) is Metzler
and its largest eigenvalue is real and a convex function of x [207]. Recently, it has been shown
that the weighted L1 norm of the response of system (7.1) from a nonnegative initial condition
x0 ≥ 0, ∫ T
0
cTψ(t) dt
is a convex function of u for every c ∈ Rn+ [58, 59]. Furthermore, the approach in [54] can
be used to cast the problem of unstructured decentralized H∞ control of positive systems as a
128
semidefinite program (SDP) and [208] can be used to cast it as a Linear Program (LP). However,
both the SDP and the LP formulations require a change of variables that does not preserve the
structure of F (x). Consequently, it is often difficult to design controllers that are feasible for a
given noninvertible operator F or to impose structural constraints or penalties on x.
We show that both the H2 and the H∞ norms are convex functions of the original optimiza-
tion variable x and provide explicit expressions for the (sub)gradients. This allows us to develop
efficient descent algorithms that solve regularized optimal control problems of the form (1.8).
7.2 Convexity of optimal control problems
We next establish convexity of the H2 and H∞ norms for systems that satisfy Assumption 3,
derive a graph theoretic condition that guarantees continuous differentiability of f∞, and de-
velop a customized algorithm for solving optimization problem (1.8) even in the absence of
differentiability.
7.2.1 Convexity of f2 and f∞
We first establish convexity of the H2 optimal control problem and provide the expression for
the gradient of f2.
Proposition 7.2.1. Let Assumption 3 hold and let Acl(x) := A + F (x) be a Hurwitz matrix.
Then, f2 is a convex function of x and its gradient is given by
∇f2(x) = 2F †(PcPo) (7.2)
where Pc and Po are the controllability and observability gramians of the closed-loop system (7.1),
Acl(x)Pc + PcATcl(x) + BBT = 0 (7.3a)
ATcl(x)Po + PoAcl(x) + CTC = 0. (7.3b)
Proof. We first establish convexity of f2(x) and then derive its gradient. The square of the H2
norm is given by,
f2(x) =
⟨CTC, Pc
⟩, Acl(x) Hurwitz
∞, otherwise
129
where the controllability gramian Pc of the closed-loop system is determined by the solution to
Lyapunov equation (7.3a). For Hurwitz Acl(x), Pc can be expressed as,
Pc =
∫ ∞0
eAcl(x)tBBT eATcl(x)t dt.
Substituting into⟨CTC, Pc
⟩and rearranging terms yields,
f2(x) =
∫ ∞0
‖C eAcl(x)tB‖2F dt =
∫ ∞0
∑i, j
(cTi eAcl(x)t bj
)2
dt
where cTi is the ith row of C and bj is the jth column of B. From [58, Lemma 3] it follows that
cT eAcl(x)t b is a convex function of u for nonnegative vectors c and b. Since the range of this
function is R+ and (·)2 is nondecreasing over R+, the composition rules for convex functions [123]
imply that (cTi eAcl(x)t bj)2 is convex in x. Convexity of f2(x) follows from the linearity of the
sum and integral operators.
To derive ∇f2, we form the associated Lagrangian,
L(x, Pc, Po) =⟨CTC, Pc
⟩+⟨Po, Acl(x)Pc + PcA
Tcl(x) + BBT
⟩where Po is the Lagrange multiplier associated with equality constraint (7.3a). Taking variations
of L with respect to Po and Pc yields Lyapunov equations (7.3a) and (7.3b) for controllability
and observability gramians, respectively. Using Acl(x) = A + F (x) and the adjoint of F , we
rewrite the Lagrangian as
L(x, Pc, Po) = 2⟨F †(PcPo), x
⟩+⟨CTC, Pc
⟩+⟨Po, APc + PcA
T + BBT⟩.
Taking the variation of L with respect to x yields (7.2).
Remark 7.2.1. Convexity of the quadratic cost
∫ T
0
ψT (t)CTC ψ(t) dt
also holds over a finite or infinite time horizon for a piecewise constant x. This follows from [58,
Lemma 4] and suggests that an approach inspired by the Model Predictive Control (MPC) frame-
work can be used to compute a time-varying optimal control input for a finite horizon problem.
Remark 7.2.2. The expression for ∇f2 in Proposition 7.2.1 remains valid for any linear system
130
and any linear operator F : Rm → Rn×n. However, convexity of f2 holds under Assumption 3
and is not guaranteed in general.
We now establish the convexity of the H∞ control problem and provide expression for the
subdifferential set of f∞.
Proposition 7.2.2. Let Assumption 3 hold and let Acl(x) := A + F (x) be a Hurwitz matrix.
Then, f∞ is a convex function of x and its subdifferential set is given by
∂f∞(x) =∑
i
αi F†(A−1
cl (x)B vi wTi CA
−1cl (x)
)| wTi (CA−1
cl (x)B) vi = f∞(x), α ∈ P
(7.4)
where F † is the adjoint of the operator F and P is the simplex, P := αi | αi ≥ 0,∑i αi = 1.
Proof. We first establish convexity of f∞(x) and then derive the expression for its subdifferential
set. For positive systems, the H∞ norm achieves its largest value at ω = 0 [54] and from (1.3b)
we thus have f∞(x) = σ(−CA−1cl (x)B). To show convexity of f∞(x), we show that −CA−1
cl (x)B
is a convex and nonnegative function of x, that σ(M) is a convex and nondecreasing function of
a nonnegative argument M , and leverage the composition rules for convex functions [123].
Since Acl(x) is Hurwitz, its inverse can be expressed as
−A−1cl (x) =
∫ ∞0
eAcl(x)t dt. (7.5)
Convexity of cieAcl(x)t bj [58, Lemma 3] and linearity of integration implies that each element of
the matrix
−C A−1cl (x)B = C
∫ ∞0
eAcl(x)t dtB
is convex in x and, by part (a) of Lemma 7.1.1, nonnegative.
The largest singular value σ(M) is a convex function of the entries of M [123],
σ(M) = sup‖w‖=1, ‖v‖=1
wTMv (7.6)
and Lemma 7.1.2 implies that the principal singular vectors vi and wi that achieve the supremum
in (7.6) are nonnegative for M ≥ 0. Thus,
wTi (M + β ejeTk ) vi ≥ wTi Mvi
for any β ≥ 0, thereby implying that σ(M) is nondecreasing over M ≥ 0. Since each element of
131
−CA−1cl (x)B ≥ 0 is convex in x, these properties of σ(·) and the composition rules for convex
functions [123] imply convexity of f∞(x) = σ(−CA−1cl (x)B).
To derive ∂f∞(u), we note that the subdifferential set of the supremum over a set of differ-
entiable functions,
f(x) = supi∈I
fi(x)
is the convex hull of the gradients of the functions that achieve the supremum [209, Theorem
1.13],
∂f(x) = ∑j | fj(x) = f(x)
αj∇fj(x) | α ∈ P
Thus, the subgradient of σ(M) with respect to M is given by
∂ σ(M) =∑
j αjwjvTj | wTj Mvj = σ(M), α ∈ P
.
Finally, the matrix derivative of M−1 with respect to M in conjunction with the chain rule
yields (7.4).
Remark 7.2.3. The adjoint of the linear operator F , introduced in Assumption 3, with respect
to the standard inner product is F †(x) = FTx diag(x). For positive systems, Lemma 7.1.1 implies
that the gramians Pc and Po are nonnegative matrices. Thus, the diagonal of the matrix PcPo
is positive and it follows that f2 is a monotone function of the diagonal matrix F (x). Similarly,
−A−1cl (x)B vi and −wTi CA
−1cl (x) are nonnegative and thus f∞ is also a monotone function of
F (x).
7.2.2 Differentiability of the H∞ norm
In general, theH∞ norm is a nondifferentiable function of the control parameter x. Even though,
under Assumption 3, the decentralized H∞ optimal control problem (1.8) for positive systems is
convex, it is still difficult to solve because of the lack of differentiability of f∞. Nondifferentiable
objective functions often necessitate the use of subgradient methods, which can converge slowly
to the optimal solution.
In what follows, we prove that f∞ is a continuously differentiable function of x for weakly
connected G(A). Then, by noting that f∞ is nondifferentiable only when G(A) contains dis-
connected components, we develop a method for solving (1.8) that outperforms the standard
subgradient algorithm.
132
Differentiability under weak connectivity
To show the result, we first require two technical lemmas.
Lemma 7.2.4. Let M ≥ 0 be a matrix whose main diagonal is strictly positive and whose
associated graph G(M) is weakly connected. Then, the graphs associated with G(MMT ) and
G(MTM) have self loops and are strongly connected.
Proof. Positivity of the main diagonal of M implies that if Mij is nonzero, then (MTM)ij and
(MMT )ij are nonzero; by symmetry, (MTM)ji and (MMT )ji are also nonzero. Thus, G(MTM)
and G(MMT ) contain all the edges (i, j) in G(M) as well as their reversed counterparts (j, i).
Since G(M) is weakly connected, G(MTM) and G(MMT ) are strongly connected. The presence
of self loops follows directly from the positivity of the main diagonal of M .
Lemma 7.2.5. Let M ≥ 0 be a matrix whose main diagonal is strictly positive and whose
associated graph G(M) is weakly connected. Then, the principal singular value and the principal
singular vectors of M are unique.
Proof. Note that G(Mk) has an edge from i to j if M contains a directed path of length k from
i to j [6, Lemma 1.32]. Since G(MMT ) and G(MTM) are strongly connected with self loops,
Lemma 7.2.4 implies the existence of k such that (MTM)k > 0 and (MMT )k > 0 for all k ≥ k,
and Perron Theorem [180, Theorem 8.2.11] implies that (MTM)k and (MMT )k have unique
maximum eigenvalues for all k ≥ k.
The eigenvalues of (MTM)k and (MMT )k are related to the singular values of M by,
λi((MTM)k) = λi((MMT )k) = (σi(M))2k
and the eigenvectors of (MTM)k and (MMT )k are determined by the right and the left singular
vectors of M , respectively. Since the principal eigenvalue and eigenvectors of these matrices are
unique, the principal singular value and the associated singular vectors of M are also unique.
Theorem 7.2.6. Let Assumption 3 hold, let Acl(x) := A + F (x) be a Hurwitz matrix, and let
matrices B and C have strictly positive main diagonals. If the graph G(A) associated with A is
weakly connected, f∞(x) is continuously differentiable.
Proof. Lemma 7.1.1 implies that eAcl(x) ≥ 0. From [6, Lemma 1.32], G(Mk) has an edge from
i to j if there is a directed path of length k from i to j in G(M). Weak connectivity of G(A)
implies weak connectivity of G(A), G(Ak), eAcl(x)t and, by (7.5), of G(−A−1cl (x)).
133
Since Acl(x) is Hurwitz and Metzler, its main diagonal must be strictly negative; otherwise,
ddtψi ≥ 0 for some ψi, contradicting stability and thus the Hurwitz assumption. Equation (7.5)
and Lemma 7.1.1 imply A−1cl (x) ≤ 0 and, since Acl(x) is Metzler, A−1
cl (x)Acl(x) = I can only
hold if the main diagonal of −A−1cl (x) is strictly positive.
Moreover, since the diagonals of B and C are strictly positive, G(−CA−1cl (x)B) is weakly
connected and the diagonal of −CA−1cl (x)B is also strictly positive. Thus, Lemma 7.2.5 implies
that the principal singular value and singular vectors of −CA−1cl (x)B are unique, that (7.4) is
unique for each stabilizing x, and that f∞(x) is continuously differentiable.
Nondifferentiability for disconnected G(A)
Theorem 7.2.6 implies that under a mild assumption on B and C, f∞ is only nondifferentiable
when the graph associated with A has disjoint components. Since the proximal operator of
f∞ does not readily admit an explicit expression, neither the existing proximal methods dis-
cussed in Section 2.2.3 nor the novel algorithms we develop in Part I may be directly applied
to solve (1.8). Instead, subgradient methods [209] must be employed. Although proximal sub-
gradient algorithms [210] can be applied to problems of the form (1.8) where both f and g
are nondifferentiable and g has a cheaply computable proximal operator, subgradient methods
require a large number of iterations to converge.
To a large extent, subgradient methods are inefficient because they do not guarantee descent
of the objective function. However, under the following mild assumption, the subgradient of f∞,
∂f∞, can be conveniently expressed and a descent direction can be obtained by solving a linear
program.
Assumption 4. Without loss of generality, let Acl(x) be permuted such that
Acl(x) = blkdiag (A1cl(x), . . . , Amcl (x))
is block diagonal and let G(Aicl(x)) be weakly connected for every i. Moreover, the matrices
B = blkdiag (B1, . . . , Bm) and C = blkdiag (C1, . . . , Cm) are block diagonal and partitioned
conformably with the matrix Acl(x).
Theorem 7.2.7. Let Assumptions 3 and 4 hold and let Acl(x) := A+F (x) be a Hurwitz matrix.
Then,
f∞(x) = maxi
f i∞(x) (7.7a)
where f i∞(x) := σ(Ci(Aicl(x))−1Bi). Moreover, every element of the subgradient of f∞(x) can be
134
expressed as the convex combination of a finite number of vectors Gj := ∇f j∞(x) corresponding to
the gradients of the functions f j∞(x) that achieve the maximum in (7.7a), i.e., f∞(x) = f j∞(x),
∂f∞(x) = Gα | α ∈ P (7.7b)
where the columns of G are given by Gj and P is the simplex.
Proof. Since Acl(x) is a block diagonal matrix, so is A−1cl (x) and Assumption 4 implies that
−CA−1cl (x)B = blkdiag (−Ci(Aicl(x))−1Bi) is also block diagonal. Thus,
f∞(x) = σ(−CA−1cl (x)B) = max
iσ(−Ci(Aicl(x))−1Bi)
which proves (7.7a). Theorem 7.2.6 implies that each f i∞(x) is continuously differentiable which
establishes (7.7b) by [209, Theorem 1.13].
When g is differentiable, we leverage the above convenient expression for ∂f∞ to select
an element of the subdifferential set which, with an abuse of terminology, we call the optimal
subgradient. The optimal subgradient is guaranteed to be a descent direction for (1.8) and it is
defined as the member of ∂(f∞(x) + γg(x)) that solves
minimizev, α
maxj
(vT (Gj + γ∇g(x))) (7.8a)
subject to v = −(Gα + γ∇g(x)), α ∈ P (7.8b)
vT (Gj + γ∇g(x)) < 0, for all j (7.8c)
where F and f j are defined as in Theorem 7.2.7. While (7.8a) minimizes the directional
derivative of (1.8) in the search direction v, constraints (7.8b) and (7.8c) ensure that v ∈∂(f∞(x) + γ∇g(x)) and that it is a descent direction, respectively.
Problem (7.8) can be solved efficiently because it is a linear program. Moreover, the opti-
mality condition for (1.8), ∂f∞(x) + γ∇g(x) 3 0, can be checked by solving a linear program to
verify the existence of an α ∈ P such that Gα+ γ∇g(x) = 0.
Customized algorithm
Ensuring a descent direction enables principled rules for step-size selection and makes prob-
lem (1.8) with nondifferentiable g tractable via the augmented-Lagrangian-based approaches.
135
Introducing an auxiliary variable z and deriving the proximal augmented Lagrangian as de-
scribed in Section 2.3.1 yields
Lµ(x, z?µ(x, y); y) = f∞(x) + Mµg(x + µy) − µ2 ‖y‖
2.
Since f∞ is the only nondifferentiable component of the proximal augmented Lagrangian, the op-
timal subgradient (7.8) can be used to minimize Lµ(x; yk) over x to facilitate the MM algorithm
in Section (2.3) [107].
7.3 Leader selection in directed networks
We now consider the special case of system (7.1), in which the matrix A is given by a graph
Laplacian, and study the leader selection problem for directed consensus networks. The question
of how to optimally assign a predetermined number of nodes to act as leaders in a network of
dynamical systems with a given topology has recently emerged as a useful proxy for identifying
important nodes in a network [77–84]. Even though significant theoretical and algorithmic
advances for undirected networks have been made, the leader selection problem in directed
networks remains open. We first discussed this problem in Section 1.2.1; we describe it in more
detail here.
7.3.1 Problem formulation
We describe consensus dynamics and state the problem.
Consensus dynamics
The weighted directed network G(L) with n nodes and the graph Laplacian L obeys consensus
dynamics in which each node i updates its state ψi using relative information exchange with its
neighbors,
ψi = −∑j ∈Ni
Lij(ψi − ψj) + di.
Here, Ni := j | (i, j) ∈ E, Lij ≥ 0 is a weight that quantifies the importance of the edge from
node j to node i, di is a disturbance, and the aggregate dynamics are [67],
ψ = −Lψ + d
136
where L is the graph Laplacian of the directed network [211].
The graph Laplacian always has an eigenvalue at zero that corresponds to a right eigenvector
of all ones, L1 = 0. If this eigenvalue is simple, all node values ψi converge to a constant ψ in
the absence of an external input d. When G(L) is balanced and conncted, ψ = 1n 1
Tψ(0) is the
average of the initial node values. In general, ψ = wTψ(0), where w is the left eigenvector of
L corresponding to zero eigenvalue, wTL = 0. If G(L) is not strongly connected, L may have
additional eigenvalues at zero and the node values converge to distinct groups whose number is
equal to or smaller than the multiplicity of the zero eigenvalue.
Leader selection
In consensus networks, the dynamics are governed by relative information exchange and the node
values converge to the network average. In the leader selection paradigm [80], certain “leader”
nodes are additionally equipped with absolute information which introduces negative feedback
on the states of these nodes. If suitable leader nodes are present, the dynamical generator
becomes a Hurwitz matrix and the states of all nodes asymptotically converge to zero.
An example is given by a kinematic model of vehicles where ψ represents the position vector
of a formation. Relative information exchange corresponds to maintaining constant distances
between neighboring vehicles and the leader nodes may have access to absolute information from
GPS units.
The node dynamics in a network with leaders is
ψi = −∑j ∈Ni
Lij(ψi − ψj) − xi ψi + di
where xi ≥ 0 is the weight that node i places on its absolute information. The node i is a leader
if xi > 0, otherwise it is a follower. The aggregate dynamics can be written as,
ψ = −(L + diag (x))ψ + d
and placed in the form (7.1) by taking A = −L, B = C = I, and F (x) = −diag (x). We
evaluate the performance of a leader vector x ∈ Rn using the H2 or H∞ performance metrics
f2 or f∞, respectively. We note that this system is marginally stable in the absence of leaders
and much work on consensus networks focuses on driving the deviations from the average node
values to zero [97] as we discussed in the edge addition example for directed consensus networks
in Section 2.4. Instead, we here focus on driving the node values themselves to zero.
137
1
2
3
4
? ?? ?
? ?? ?
︸ ︷︷ ︸
L
Figure 7.1: A directed network and the sparsity pattern of the corresponding graph Laplacian.This network is stabilized if and only if either node 1 or node 2 are made leaders.
We formulate the combinatorial problem of selecting N leaders to optimize either H2 or H∞norm as follows.
Problem 1. Given a network with a graph Laplacian L and a fixed leader weight κ, find the
optimal set of N leaders that solves
minimizex
f(x)
subject to 1Tx = N κ, xi ∈ 0, κ
where f is one of the performance metrics described in Section 1.2.2, with A = −L, B = C = I,
and F (x) = diag (x).
In [78, 79], the authors derive explicit expressions for leaders in undirected networks. How-
ever, these expressions are efficient only for very few or very many leaders. Instead, we follow [80]
and develop an algorithm which relaxes the integer constraint to obtain a lower bound on Prob-
lem 1 and use greedy heuristics to obtain an upper bound.
Considering leader selection in directed networks adds the challenge of ensuring stability. At
the same time, we can leverage existing results on leader selection in undirected networks to
derive efficient upper bounds on Problem 1.
7.3.2 Stability for directed networks
For a vector of leader weights x to be feasible for Problems 1, it must stabilize system (7.1); i.e.,
−(L+ diag (x)) must be a Hurwitz matrix. When G(L) is undirected and connected, any leader
will stabilize (7.1). However, this is not the case for directed networks. For example, making
node 1 or 2 a leader stabilizes the network in Fig. 7.1, but making nodes 3 or 4 a leader does
not. Theorem 7.3.1 provides a necessary and sufficient condition for stability.
138
Theorem 7.3.1. Let L be a weighted directed graph Laplacian and let x ≥ 0. The matrix
−(L+ diag (x)) is Hurwitz if and only if w x 6= 0 for all nonzero w with wTL = 0, where is
the elementwise product.
Proof. (⇐) If w x = 0, wT diag (x) = 0. If, in addition, wTL = 0, we have
−wT (L + diag (x)) = 0 (7.9)
and therefore zero is an eigenvalue of −(L+ diag (x)).
(⇒) Since the graph Laplacian L is row stochastic and diag (x) is diagonal and nonnegative,
the Gershgorin circle theorem [180] implies that the eigenvalues of −(L+diag (x)) are at most 0.
To show that −(L+diag (x)) is Hurwitz, we show that it has no eigenvalue at zero. Assume there
exists a nonzero w such that (7.9) holds. This implies that either wTL = wT diag (x) = 0 or that
wTL = −wT diag (x). The first case is not possible because, by assumption, wT diag (x) = (w x)T 6= 0 for any w such that wTL = 0. If the second case is true, then wTLv = −wT diag (x)v
must also hold for all v. However, if we take v = 1, then wTL1 = 0 but −wT diag (x)1 is
nonzero. This completes the proof.
Remark 7.3.2. Only the set of leader nodes is relevant to the question of stability. If x does
not stabilize (7.1), no positive weighting of the vector of leader nodes, α x with α ∈ Rn++, will
stabilize (7.1). Similarly if x stabilizes (7.1), every α x will.
Corollary 7.3.2.1. If G(L) is strongly connected, any choice of leader node will stabilize (7.1).
Proof. Since the graph Laplacian associated with a strongly connected graph is irreducible, the
Perron-Frobenius theorem [180] implies that the left eigenvector associated with −L is positive.
Thus, w x 6= 0 for any nonzero x and system (7.1) is stable by Theorem 7.3.1.
Remark 7.3.3. The condition in Theorem 7.3.1 requires that there is a path from the set of
leader nodes to every node in the network. This can be enforced by extracting disjoint “leader
subsets” Sj which are not influenced by the rest of the network, i.e., (vj)TL = 0 where vji = 1 if
i ∈ Sj and vji = 0 otherwise, and which are each strongly connected components of the original
network. Stability is guaranteed if there is at least one leader node in each such subset Sj; e.g.,
for the network in Fig. 7.1, there is one leader subset S1 = 1, 2. By Corollary 7.3.2.1, S1
contains all nodes when the network is strongly connected.
139
7.3.3 Bounds for Problem 1
To approach combinatorial Problem 1, we derive bounds on its optimal objective value. These
bounds can also be used to implement a branch-and-bound approach [212].
Lower bound
By relaxing the combinatorial constraint in Problem 1, we obtain a convex program,
minimizex
f(x)
subject to x ∈ κPN
where κPN := x |∑i xi = Nκ, xi ≤ κ is the “capped” simplex. Using a recent result on
efficient projection onto PN [213], this problem can be solved efficiently via proximal gradient
methods [131] to provide a lower bound on Problem 1.
When G(L) is not strongly connected, additional constraints can be added to enforce the
condition in Theorem 7.3.1 and thus guarantee stability. Let the sets Sj denote “leader subsets”
from which a leader must be chosen, as discussed in Remark 7.3.3. Then, the convex problem
minimize f(x)
subject to x ∈ κPN ,∑i∈Sj
xi ≥ κ, for all j
(7.10)
relaxes the combinatorial constraint and guarantees stability. We denote the resulting lower
bounds on the optimal values of the H2 and H∞ versions of Problem 1 with N leaders by
f lb2 (N) and f lb
∞(N), respectively.
Upper bounds for Problem 1
If k denotes the number of subsets Sj , a stabilizing candidate solution to Problem 1 can be
obtained by “rounding” the solution to (7.10) by taking the N leaders to contain the largest
element from each subset Sj and N − k largest remaining elements. The greedy swapping
algorithm proposed in [80] can further tighten this upper bound.
Recent work on leader selection in undirected networks can also provide upper bounds for
Problem 1 when G(L) is balanced. The symmetric component of the Laplacian of a balanced
graph, Ls := 12 (L+LT ), is the Laplacian of an undirected network. The exact optimal leader set
140
for an undirected network can be efficiently computed when N is either small or large [78, 79].
Since the performance of the symmetric component of a system provides an upper bound on
the performance of the original system, these sets of leaders will have better performance with
L than with Ls for both the H2 (Corollary 6.1.3.1) and H∞ norms (Proposition 6.1.2).
Even when L does not represent a balanced network, f2 and f∞ are respectively upper
bounded by the trace and the maximum eigenvalue of 12 (Ls + F (x))−1. For small numbers of
leaders, they can be efficiently computed using rank-one inversion updates. A similar approach
was used in [78, 79] to derive optimal leaders for undirected networks. Moreover, this approach
always yields a stabilizing set of leaders (Lemma 6.1.1).
7.3.4 Additional comments
We now provide additional discussion on interesting aspects of Problem 1. We first consider the
gradients of f2 and f∞.
Remark 7.3.4. When F (x) = −diag (x), we have ∇f2 = −2 diag (PcPo). The matrix PcPo
often appears in model reduction and (∇f2(x))i corresponds to the inner product between the ith
columns of Pc and Po.
Remark 7.3.5. When F (x) = −diag(x), (∂f∞(x))i is given by the product of −eTi A−1cl (x)v
and wTA−1cl (x)ei. The former quantifies how much the forcing which causes the largest overall
response of system (7.1) affects node i, and the latter captures how much the forcing at node i
affects the direction of the largest output response.
The optimal leader sets for balanced graphs are interesting because they are invariant under
reversal of all edge directions.
Proposition 7.3.1. Let G(L) be balanced, let L := LT so that G(L) contains the reversed edges
of the graph G(L), and let f2 and f∞ denote the performance metrics (7.13) with A = −L,
F (x) = − diag (x), and B = C = I as in Problem 1. Then, f2(x) = f2(x) and f∞(x) = f∞(x).
Proof. The controllability gramian of (7.1) defined with Acl = −(L+ diag(x)) solves Lyapunov
equation (7.3a), −(L + diag(x))Pc − Pc(L + diag(x))T + I = 0, and is also the observability
gramian Po of (7.1) defined with Acl = −(L + diag(x)) = −(LT + diag(x)) that solves (7.3b).
By definition of the H2 norm, f2(x) = trace(Po) = trace(Pc) = f2(x). Since σ(M) = σ(MT ),
f∞(x) = σ(−(L+ diag(x))−1) = σ(−(L+ diag(x))−1) = f∞(x).
This invariance is intriguing because the space of balanced graphs is spanned by cycles.
In [214], the authors explored how undirected cycles affect undirected consensus networks.
141
Proposition 7.3.1 suggests that directed cycles also play a fundamental role in directed con-
sensus networks.
7.4 Computational experiments for leader selection
Here, we illustrate our approach to Problem 1 with N leaders and the weight κ = 1. The
“rounding” approach that we employ is described in Section 7.3.3.
7.4.1 Example: Bounds on leader selection for a small network
For the directed network in Fig. 7.2a, let the edges from node 2 to node 7 and from node 7 to
node 8 have an edge weight of 2 and let all other edges have unit edge weights. We compare
the optimal set of leaders, determined by exhaustive search, to the set of leaders obtained by
(i) “rounding” the solution to relaxed problem (7.10); and by (ii) the optimal selection for
the undirected version of the graph via [78, 79], as discussed in Section 7.3.3. In Fig. 7.2a,
blue node 7 represents the optimal single leader, yellow node 4 represents the single leader
selected by “rounding”, and red node 8 represents the optimal single leader for the undirected
network. In Fig. 7.2b we show the H2 performance for 1 to 8 leader nodes resulting from
different methods. Since in general, we do not know the optimal performance a priori , we plot
performance degradation (in percents) relative to the lower bound on Problem 1 obtained by
solving problem (7.10).
Figure 7.2b shows that neither “rounding” (yellow ) nor the optimal selection for undirected
networks (red +) achieve unilaterally better H2 performance (performance of the optimal leader
sets are shown in blue ×). While the procedure for the undirected networks selects better sets
of 1, 2, and 5 leaders relative to “rounding”, identifying them is expensive except for large or
small number of leaders [78,79] and “rounding” identifies a better set of 4 leaders. This suggests
that, when possible, both sets of leaders should be computed and the one that achieves better
performance should be selected.
7.4.2 Example: Leaders in the neural network of the worm C. Elegans
We now consider the network of neurons in the brain of the worm C. Elegans with 297 nodes
and 2359 weighted directed edges. The data was compiled by [215] from [216]. Inspired by the
use of leader selection as a proxy for identifying important nodes in a network [77–80, 82], we
employ this framework to identify important neurons in the brain of C. Elegans.
142
8
7 2
1
3
6
4
5
2
2
(a) Balanced Network with 8 nodes,12 edges.
100∗
(f2(x
)/flb 2
(N)−
1)
Number of leaders N
(b) Performance of optimal leaders and two leader selectiontechniques.
Figure 7.2: H2 performance of optimal leader set (blue ×) and upper bounds resulting from“rounding” (yellow ) and the optimal leaders for the undirected network (red +). Performanceis shown as a percent increase in f2 relative to f lb
2 (N).
Three nodes in the network have zero in-degree, i.e., they are not influenced by the rest of the
network. Thus, as discussed in Remark 7.3.3, there are three “leader subsets”, each comprised
of one of these nodes. Theorem 7.3.1 implies that system (7.1) can only be stable if each of
these nodes are leaders.
In Figs. 7.4c and 7.4d, we show f2 and f∞ resulting from “rounding” the solution to prob-
lem (7.10) to select the additional 1 to 294 leaders. Performance is plotted as an increase (in
percents) relative to the lower bound f lb(N) obtained from (7.10). This provides an upper
bound on suboptimality of the identified set of leaders. While this value does not provide in-
formation about how f(x) changes with the number of leaders, Remark 7.2.3 implies that it
monotonically decreases with N .
For both f2 and f∞ performance metrics, Figs. 7.4c and 7.4d illustrates that the upper
bound is loosest for 25 leaders (1.56% and 0.48%, respectively). As seen in Fig. 7.2b from the
previous example, whose small size enabled exhaustive search to solve Problem 1 exactly, the
upper bound on suboptimality is not tight and the exact optimal solution to Problem 1 can
differ by as much 21.75% from the lower bound. This suggests that “rounding” selects very
good sets of leaders for the C. Elegans example.
In Figs. 7.4a and 7.4b, we show the network with ten identified f2 and f∞ optimal leaders.
The size of the nodes is related to their out-degree and the thickness of the edges is related to
143
the weight. The red marks nodes that must be leaders and the blue marks the 7 additional
leaders selected by “rounding”.
7.4.3 Example: Ranking college football teams
Inspired by the recent use of graph theoretic tools for ranking athletic teams, we consider the
problem of ranking college football teams. Here, we do not consider that each leader has the
same weight and instead use a sparsity penalty in (1.8) to select sparse subsets of nodes (teams).
Due to the number of teams in the top division of college football (128) and the relative
scarcity of games between them (around 13 per team), ranking these teams is an underdeter-
mined problem. The current practice of ranking by a committee is clearly subject to bias.
Recently, graph theoretic measures, such as average path length from each node, have been
explored for the purpose of objectively ranking teams or athletes [217,218].
We used the scores of college football games from the 2015−2016 season collected from [219]
to generate a network. If team A beat team B, a edge was placed from A to B with a weight
equal to the score difference in the game. There were 203 teams (nodes) and 863 games (edges)
in our data set. We select the top N teams by identifying N H2 optimal leaders. We selected
sparse sets of leaders by solving (1.8) with g as the `1 norm and increasing γ until its solution
had the desired number of nonzero elements. Interestingly, the metrics we use are biased against
selecting leaders which are close in the network. In this context, such proximity would correspond
to teams who have many common opponents.
Fig. 7.3 shows the network generated. The large connected component in the center repre-
sents the teams in the top division. Since our dataset included games played between teams
from the top division and lower divisions but not from games played between teams of the lower
divisions, there are a number of topgoraphically isolated nodes.
Table 7.1 shows sets of 2, 4, 6, and 8 leaders with the corresponding end-of-season rankings
from the Associated Press (AP) [220]. This approach selected teams which agree well with the
AP rankings with the notable exception of Southern Illinois. This team played only one game in
the our set – a close loss to a poorly ranked team. We ascribe this anomaly to the the topological
isolation of this team.
144
Figure 7.3: Network of games in the 2015−2016 College Football season. The central connectedcomponent corresponds to the top division, and the distal nodes are lower division teams forwhom there are less data.
7.5 Computational experiments for combination drug ther-
apy
System (7.1) also arises in the modeling of combination drug therapy [89,90,221–223], which we
first discuss in Section 1.2.1. It provides a model for the evolution of populations of mutants
of the HIV virus ψ in the presence of a combination of drugs x. The HIV virus is known to
be present in the body in the form of different mutant strands; in (7.1), the ith component of
the state vector ψ represents the population of the ith HIV mutant. The diagonal entries of
the matrix A represent the net replication rate of each mutant, and the off diagonal entries of
A, which are all nonnegative, represent the rate of mutation from one mutant to another. The
control input xk is the dose of drug k and the kth column of the matrix Fx in F (x) = diag (Fxx)
specifies at what rate drug k kills each HIV mutant.
7.5.1 Example: Simple problem with nondifferentiable f∞
The mutation patterns of viruses need not be connected. In Fig. 7.5, we show a sample mutation
network with 2 disconnected components. For this network, the H∞ norm is nondifferentiable
when x1 = x2. Nondifferentiability and the lack of an efficiently computable proximal operator
145
γ 0.225 0.6 1 10Teams 8 6 4 2
AP #1 Alabama Alabama Alabama AlabamaAP #4 Ohio State Ohio State Ohio State Ohio StateAP #2 Clemson Clemson ClemsonAP #8 Houston Houston HoustonAP #3 Stanford StanfordAP NR S. Illinois S. Illinois
AP #12 MichiganAP #10 Mississippi
Table 7.1: Leaders selected for different values of γ.
necessitates the use of subgradient methods for solving
minimizex
f∞(x) + xTx.
As shown in Fig. 7.6 with h(x) := f∞(x) + xTx, subgradient methods are not descent methods
so small constant or a divergent series of diminishing step-sizes must be employed.
We compare the performance of the subgradient method with a constant step-size of 10−2
(blue) and a diminishing step-size 7×10−2
k (red) with our optimal subgradient method in which
the step-size is chosen via backtracking to ensure descent of the objective function (yellow). We
show the objective function value with respect to iteration number in Fig. 7.6a and the iterates
xk in the (x1, x2)-plane in Fig. 7.6b.
The subgradient methods were run for 1000 iterations as there is no principled stopping
criterion. Our optimal subgradient method identified the optimal point, to an accuracy of 10−4
(i.e., there was a v ∈ ∂(f∞(x) + xTx) such that ‖v‖ ≤ 10−4) in 23 iterations.
7.5.2 Example: Real world drug therapy problem
Following [89, 90] and using data from [200], we study a system with 35 mutants and 5 drugs.
The sparsity pattern of the matrix A, shown in Fig. 7.7, corresponds to the mutation pattern
and replication rates of 33 mutants and F (x) specifies the effect of drug therapy. Two mutants
are not shown in Fig. 7.7a as they have no mutation pathways to or from other mutants.
Several clinically relevant properties, such as maximum dose or budget constraints, may
be directly enforced in our formulation. Other combinatorial conditions can be promoted via
convex penalties, such as drug j requiring drug i via xi ≥ xj or mutual exclusivity of drugs i
and j via xi + xj ≤ 1. We design optimal drug doses using two convex regularizers g.
146
(a) f2 optimal leaders (b) f∞ optimal leaders
100∗
(f(x
)/flb
(N)−
1)
Number of leaders N
(c) f2 optimal leaders
Number of leaders N
(d) f∞ optimal leaders
Figure 7.4: C. Elegans neural network with N = 10 (a) f2 and (b) f∞ leaders along with the(c) f2 and (d) f∞ performance of varying numbers of leaders N relative to f lb(N). In all cases,leaders are selected via “rounding”.
Budget constraint
We impose a unit budget constraint on the drug doses and solve the f2 and f∞ problems using
proximal gradient methods [131, 134]. These can be cast in the form (1.8), where g is the
indicator function associated with the probability simplex P. Table 7.2 contains the optimal
doses and illustrates the tradeoff between H2 and H∞ performance.
147
x1
x2 x4
x3
u2u1
1 11 1
1 11 1
︸ ︷︷ ︸
A
−1−1 .1.1 −1−1
︸ ︷︷ ︸
Fx
Figure 7.5: A directed network and corresponding A matrix for a virus with 4 mutants and 2drugs. For this system, f∞ is nondifferentiable.
Antibody x?2 x?∞3BC176 0.5952 0.9875
PG16 0 045-46G54W 0.2484 0.0125
PGT128 0.1564 010-1074 0 0
f2(x?2) 0.6017f2(x?∞) 1.1947
f∞(x?2) 0.1857f∞(x?∞) 0.1084
Table 7.2: Optimal budgeted doses and H2/H∞ performance.
Sparsity-promoting framework
Although the above budget constraint is naturally sparsity-promoting, in Algorithm 5 we aug-
ment a quadratically regularized optimal control problem with a reweighted `1 norm [121] to
select a homotopy path of successively sparser sets of drugs. We then perform a ‘polishing’
step to design the optimal doses of the selected set of drugs. We use 50 logarithmically spaced
increments of the regularization parameter γ between 0.01 and 10 to identify the drugs and then
replace the weighted `1 penalty with a constraint to prescribe the selected drugs. In Fig. 7.8, we
show performance degradation (in percents) relative to the optimal dose that uses all 5 drugs
with B = C = I and R = I.
7.6 Time-varying controllers
While references [89,90,112,115,222] and the previous section either assume a constant control
signal or use heuristics to introduce time dependence, we show that such a constant input is in
fact optimal for the induced power norm. We cast the optimal synthesis problem of constant
148h
(xk)−h
(x?)
Iteration k
(a) Descent of objective function.
x2
x1
(b) Iterates in the (x1, x2)-plane.
Figure 7.6: Comparison of different algorithms starting from initial condition [2.5 2.8]T . Thealgorithms are the subgradient method with a constant step-size (dotted blue · · · ), the subgra-dient method with a diminishing step-size (solid red —) and our optimal subgradient methodwhere the step-size is chosen via backtracking to ensure descent of the objective function (dashedyellow - - -)
Algorithm 5 Sparsity-promoting algorithm for N drugs
Set γ > 0, R 0, w = 1, ε > 0while card(xγ) > N do
xγ = argminx
f(x) + xTRx + γ∑i
wi |xi|
increase γ, set wi = 1/(xi + ε)endx?N = argmin
xf(x) + xTRx
subject to sp (x) ⊆ sp (xγ)
control inputs as a finite-dimensional non-smooth convex optimization problem and develop an
algorithm for designing the optimal controller.
7.6.1 Preliminaries
The space of square integrable signals is denoted by L2. The inner product in this space is given
by
〈u, v〉2 :=
∫ ∞0
uT (t) v(t) dt
149
(a) Network of HIV mutation pathways (b) Sparsity pattern of A
Figure 7.7: Mutation pattern of HIV.
Per
cent
deg
rad
ati
on
off
Number of antibodies
Figure 7.8: Percent performance degradation for H2 (solid red —×—) and H∞ (dashed blue−− −−) performance relative to using all 5 drugs.
with the associated norm ‖v‖22 = 〈v, v〉2. The L2 signals are those for which ‖v‖22 is finite and
the locally L2 signals are those for which∫
ΩvT (t) v(t) dt is finite over any compact set Ω, e.g.,
t ∈ [0, T ] for a finite T . The power semi-norm of a signal v is
‖v‖2pow := lim supT→∞
1
T
∫ T
0
vT (t) v(t) dt. (7.11)
The space of trigonometric polynomials is defined as
T :=
g : R→ Rn
∣∣∣∣∣ g(t) =
N∑k= 1
αkejλkt, λk ∈ R, αk ∈ C
,
where j is the imaginary unit.
The closure of T in the space of locally L2 integrable functions with bounded power norm
150
with respect to the metric ‖f − g‖pow is given by the space of Besicovitch almost periodic
functions. Since ‖·‖pow is a seminorm, we consider B2, the space of Besicovitch almost periodic
functions modulo functions with zero power norm [224]. Factoring out signals with zero power
norm is natural in this setting because our performance metric is the power norm of a regulated
output. The inner product associated with this Hilbert space is [225,226]
〈u, v〉 := lim supT→∞
1
T
∫ T
0
uT (t) v(t) dt
and the norm ‖ · ‖pow. The mean of a signal v ∈ B2
M(v) := limT →∞
1
T
∫ T
0
v(t) dt
is well defined for every v ∈ B2. Furthermore, each v ∈ B2 can be decomposed uniquely as
v = v + v, where v is a constant signal given by v = M(v) and M(v) = 0. Note that the inner
product between a constant signal v and a zero-mean signal v is zero, i.e., 〈v, v〉 = 0.
The space B2 contains all bounded L2 signals, periodic signals, and almost periodic signals.
At the same time, it alleviates challenges arising from the fact that the space of signals with
bounded power norm is not a Hilbert space; for additional discussion see [227].
7.6.2 Problem formulation
Consider bilinear positive system (7.1) where x is now allowed to be time-varying and x(t) ∈ Rm,
ψ = (A + F (x(t)))ψ + B d. (7.12a)
For given control and disturbance signals x ∈ B2 and d ∈ B2, we associate the performance
output,
ζx,d =
Q1/2
0
ζ +
0
R1/2
x (7.12b)
with (7.12a), where Q 0 and R 0 are the state and control weights. Under Assump-
tion 3, (7.12) is a positive system. This implies that for every control input x, every nonnegative
disturbance d, and every nonnegative initial condition ψ(0), the state ψ and the output ζx,d of
system (7.12) remain nonnegative at all times.
151
The induced power norm of a stable system (7.12) is,
f(x) :=
sup
‖d‖2pow≤ 1
‖ζx,d‖2pow, system (7.12) stable
∞, otherwise
(7.13)
and it quantifies the response to the worst case persistent disturbance d. Since system (7.12) is
nonlinear, we cannot separate the initial condition and input response and we must define our
notion of stability. We assume that ψ(0) = 0 and consider stability in a bounded input-bounded
output sense.
Definition 6. A control signal x is stabilizing for nonlinear system (7.12) if for every bounded
d ∈ B2, ζx,d is bounded and has finite power, ‖ζx,d‖pow <∞.
For unstable open-loop systems (7.12), there may be no stabilizing control input x in L2.
Thus, the L2-induced gain does not provide a suitable measure of input-output amplification
for (7.12) and f(x) represents an appropriate generalization of the H∞ norm for this class of
bilinear positive systems.
We now formulate the optimal control problem.
Problem 2. Design a stabilizing bounded control signal x ∈ B2 to minimize f(x) for bilinear
positive system (7.12).
7.6.3 Solution to the optimal control problem
In this section, we prove that a constant control input solves Problem 2.
Since f(x) is given by (7.13), any x? which solves Problem 2 satisfies ‖ζx?,d‖2pow ≤ f(x?) ≤f(x) for all x ∈ B2 and d ∈ B2. In particular, ‖ζx?,d‖2pow ≤ f(x) where x is a constant control
input. As shown in [54], for constant control inputs the worst-case disturbance d is also constant,
i.e., f(x) = ‖ζx,d‖2pow. In what follows, we show that ‖ζx,d‖2pow is a convex function of x, that a
constant x minimizes it, and, thus, that a constant control input solves Problem 2.
We first establish convexity of ‖ζx,d‖2pow.
Lemma 7.6.1. Let d(t) = d be a constant non-negative disturbance. Then, under Assumption 3,
the power norm of the output ‖ζx,d‖2pow is a convex function of x ∈ B2.
Proof. We first show that
ψT (t)Qψ(t) + xT (t)Rx(t) (7.14)
152
is a convex function of x ∈ L2[0, t] for any t and then extend this statement to complete the
proof. The matrix R is positive definite so the second term on the right-hand side of (7.14) is a
convex function of x(t). By [58, Theorem 2] every component of ψ(t) is a convex function of the
control input x ∈ L2[0, t]. Since the matrix Q 0 is positive semidefinite and has nonnegative
entries, ψ(t)TQψ(t) is convex and nondecreasing in the elements of ψ(t). Thus, the composition
rules for convex functions [123, Section 3.2.4] imply that ψT (t)Qψ(t) is a convex function of
u ∈ L2[0, t].
Since the integral preserves convexity,
1
T
∫ T
0
(ψT (t)Qψ(t) + xT (t)Rx(t)
)dt
is a convex function of x ∈ L2[0, T ] Let PT denote the mapping that truncates the support of
a signal v to [0, T ]. If there is T such that PT v is not L2 integrable, the power seminorm (7.11)
of v is infinite. This implies that, for any v ∈ B2, PT v ∈ L2[0, T ]. Thus, any v ∈ B2 can be
written as limT→∞ PT v of PT v ∈ L2[0, T ]. Since the limit and supremum preserve convexity,
‖ζx,d‖2pow = lim supT→∞
1
T
∫ T
0
(ψT (t)Qψ(t) + xT (t)Rx(t)
)dt
is a convex function of x ∈ B2.
Lemma 7.6.2. Let x be a stabilizing constant control input for (7.12a) and d be a constant
non-negative disturbance. Then, the directional derivative of ‖ζx,d‖2pow evaluated at x is zero for
any bounded zero-mean variation x ∈ B2.
Proof. The dynamics (7.12a) with control input x+ εx and constant disturbance d are
ψ = (A + F (x + εx))ψ + B d. (7.15)
Since the unperturbed system (with ε = 0) is exponentially stable and boundedness of x implies
that the solution x(t) is continuous in ε [183, Theorem 3.5]. Therefore, (7.15) represents a system
in a regularly perturbed form [183, 228]. The Taylor series expansion can be used to write the
solution to the perturbed dynamics as,
ψ(t) = ψ(t) + ε ψ(t) + O(ε2) (7.16)
153
where ψ is the nominal solution that solves (7.15) for ε = 0
˙ψ = (A + F (x)) ψ + B d. (7.17a)
and ψ represents the first-order correction. By [183, Theorem 10.2], ψ is given by the solu-
tion to a differential equation corresponding to the O(ε) terms in the expression obtained by
substituting (7.16) into equation (7.15),
( ˙ψ + ε˙ψ + O(ε2)) =
[(A + F (x)) ψ + B d
]+ ε
[(A + F (x)) ψ + F (u) ψ
]+ O(ε2).
Collecting the O(ε) terms yields the dynamics that governs the evolution of ψ,
˙ψ = (A + F (x)) ψ + F (x) ψ. (7.17b)
Both (7.17a) and (7.17b) are stable LTI systems. Thus, for the constant disturbance d(t) = d, the
solution ψ to (7.17a) is asymptotically constant. Since x is zero-mean, the signal F (x) ψ is also
zero-mean, i.e., it has no component that contains the zero temporal frequency. Furthermore,
since (7.17b) is a stable LTI system, the frequency components in ψ must correspond to frequency
components in the input, F (x) ψ. Thus, ψ has no frequency component corresponding to the
zero temporal frequency and therefore ψ is also zero-mean.
Because ψ is asymptotically constant and ψ is zero-mean,⟨Q1/2ψ, Q1/2ψ
⟩= 0. Similarly,⟨
R1/2x, R1/2x⟩
= 0. Furthermore, ‖ζx,d‖2pow =⟨Q1/2ψ, Q1/2ψ
⟩+⟨R1/2x, R1/2x
⟩, and we have,
‖z(x+εx),d‖2pow − ‖zx,d‖2pow = 2 ε(⟨Q1/2ψ, Q1/2ψ
⟩+⟨R1/2x, R1/2x
⟩)+ O(ε2)
= O(ε2).(7.18)
Thus, the first order correction to ‖ζx,d‖2pow evaluated at a constant control input x is zero,
which completes the proof.
Remark 7.6.3. Lemma 7.6.2 does not need Assumption 3 and it holds for all bilinear systems
of the form (7.12a) for which there is a stabilizing constant control input.
Lemma 7.6.2 implies that a constant control signal x?
x? ∈ argminx constant
‖ζx,d‖2pow,
154
is a stationary point of ‖ζx,d‖2pow under Assumption 3. Since ‖ζx,d‖2pow is a convex function of
x by Lemma 7.6.1, x? is a local and therefore global minimizer. The following theorem relates
the power norm of the output of system (7.12) subject to a constant disturbance, ‖ζx,d‖2pow,
with the worst case power norm amplification f(x) and shows that the constant control signal
x? solves Problem 2.
Theorem 7.6.4. Let Assumption 3 hold. Then, a constant control input x(t) = x? solves
Problem 2.
Proof. Let x? minimize f(x) over the space of constant functions. Since (7.12a) with a constant
control input is an LTI system, the maximum power-amplification coincides with the H∞ norm.
Moreover, because (7.12a) is a positive system, the maximal singular value of the frequency
response matrix peaks at zero temporal frequency and the worst-case disturbance is a constant
signal [54]. This implies that f(x?) = ‖ζx?,d‖2pow where d = v ≥ 0 is a constant nonnegative
disturbance and v is the right principal singular vector of the matrix −Q 12 (A+ F (x?))−1B.
Suppose there exists a time varying signal x ∈ B2 such that f(x) < f(x?). Then, since f
measures the worst-case disturbance amplification, x must also decrease the power norm of the
output of system (7.12) for a constant disturbance d = v, i.e.,
‖ζx,d‖2pow ≤ f(x) < ‖ζx?,d‖2pow = f(x?). (7.19)
We next show that this is not possible.
Every bounded x ∈ B2 can be written as x = x+ x where x is constant and x is bounded and
zero-mean. Since x? minimizes ‖ζx,d‖2pow over constant control inputs by definition, Lemma 7.6.2
implies that x? is a stationary point of ‖ζx,d‖2pow over all bounded x ∈ B2. Convexity of ‖ζx,d‖2pow
in x by Lemma 7.6.1 thus implies that x? is a local and therefore global minimizer of ‖ζx,d‖2pow,
contradicting (7.19) and completing the proof.
Chapter 8
Actuator or sensor selection
In traditional applications, controller or observer design deals with the problem of how to use
a pre-specified configuration of sensors and actuators in order to attain the desired objective.
In general, the best performance is achieved by using all of the available sensors or actuators.
However, this option may be computationally or economically infeasible. We thus consider the
problem of selecting a subset of available sensors or actuators in order to gracefully degrade
performance relative to the setup where all of them are used.
Typically, sensor/actuator selection and placement is performed by a designer with expert
knowledge of the system. However, in large-scale applications and systems with complex inter-
actions, it can be difficult to do this effectively. For linear time-invariant dynamical systems, we
develop a framework and an efficient algorithm to systematically choose sensors and actuators
via convex optimization.
Our starting point is a formulation with an abundance of potential sensors or actuators.
This setup can encode information about different types or different placements of sensing and
actuating capabilities. We consider the problem of selecting subsets of available options from
this full model. Applications of this formulation range from placement of Phasor Measurement
Units (PMUs) in power systems, to placement of sensors and actuators along an aircraft wing,
to the distribution of GPS units in a formation of multi-vehicle systems.
The problem of interest is a difficult combinatorial optimization problem. Although there is
a wide body of previous work in this area, most of the available literature either uses heuristic
methods or does not consider dynamical models. In [128], the authors provide a convex sensor
selection formulation for a problem with linear measurements. The authors of [229] select sparse
subsets of sensors to minimize the Cramer-Rao bound of a class of nonlinear measurement
155
156
models. The placement of PMUs in power systems was formulated as a variation of the optimal
experiment design in [230]. Actuator selection via genetic algorithms was explored in [231]. A
non-convex formulation of the joint sensor and actuator placement was provided in [232,233] and
it was recently applied to the linearized Ginzburg-Landau equation [234]. The leader selection
problem in consensus networks can be seen as a type of a structured joint actuator and sensor
selection problem which admits a convex relaxation [80] and even an analytical solution for one
or two leaders [78]. However, this formulation does not extend naturally to broader classes of
problems.
The sparsity-promoting framework introduced in [62–64] can be used to obtain block-sparse
structured feedback and observer gains and select actuators or sensors. Indeed, algorithms de-
veloped in [64] have been used in [235] for the sensor selection in a target tracking problem.
However, these algorithms have been developed for general structured control/estimation prob-
lems and they do not exploit the hidden convexity of the actuator/sensor selection problem.
In [104], the authors introduced a convex semidefinite programming (SDP) characterization
of the problem formulation considered in [64] for enhancing certain forms of sparsity in the
feedback gain. Although sensor and actuator selection falls into the class of problems considered
by [104], generic SDP solvers cannot handle large-scale applications. Since we are interested in
high-dimensional systems with many sensors/actuators, we use the alternating direction method
of multipliers (ADMM) [132] and proximal gradient [131] methods to develop a customized solver
that is well-suited for large problems. In contrast to standard SDP solvers, whose computational
complexity scales unfavorably with the number of states/sensors/actuators, the worst case per-
iteration complexity of our method scales only with the number of states. Furthermore, our
algorithm performs much better than standard SDP solvers in numerical experiments.
8.1 Problem formulation
8.1.1 Actuator selection
Consider the standard state-space system first presented in (1.2),
ψ = Aψ + B1 d + B2 u
ζ =
Q1/2
0
ψ +
0
R1/2
u
157
where d is a zero-mean white stochastic process with covariance Vd and the pair (A,B2) is
controllable. The optimal H2 controller minimizes the steady-state variance,
limt→∞
E(ψT (t)Qψ(t) + uT (t)Ru(t)
)where Q = QT 0 specifies a weight on the system states, and R = RT 0 specifies the penalty
on the control input. The global optimal controller for this problem is a state feedback law of
the form u = −Xψ. Although this controller is readily computed by solving the corresponding
algebraic Riccati equation, it typically uses all input channels and thus all available actuators.
We are interested in designing an optimal controller which uses a subset of the available
actuators. This will be accomplished by augmenting the H2 performance index with a term
that promotes row-sparsity of the feedback gain matrix X. The resulting problem can be cast
as a semidefinite program and thus solved efficiently for small problems.
SDP Formulation
Under the state feedback control law, u = −Xψ, the closed-loop system is given by
ψ = (A − B2X)ψ + B1 d
z =
Q1/2
−R1/2X
ψ. (8.1)
The H2 norm of system (8.1) is determined by
f2(X) = trace(QP + XTRX P
)where P = PT 0 is the controllability gramian of the closed loop system,
(A − B2X)P + P (A − B2X)T + B1 VdBT1 = 0.
Since P is positive definite and therefore invertible, the standard change of coordinates Z := XP
can be used to express f(X) in terms of P and Z [49],
f2(P,Z) = trace(QP + P−1 ZTRZ
)
158
and to bring the optimal H2 problem into the following form
minimizeP,Z
f2(P,Z)
subject to AP + PAT −B2Z − ZTBT2 + V = 0
P 0
(8.2)
with V := B1VdBT1 . By taking the Schur complement of P−1ZTRZ this problem can be
expressed as an SDP [49]. Finally, the optimal feedback gain can be recovered by X = ZP−1.
In what follows, we use this SDP characterization to introduce the actuator selection problem.
Sparsity Structure
When the ith row of X is identically equal to zero, the ith control input is not used. Therefore,
obtaining a control law which uses only a subset of available actuators can be achieved by
promoting row-sparsity of X. Our developments are facilitated by the equivalence between the
row-sparsity of X and Z; the ith row of Z is equal to zero if and only if the ith row of Z = XP
is equal to zero [104].
Drawing on the group-sparsity paradigm [122], we augment (8.2) with a sparsity-promoting
penalty on the rows of Z,
minimize f2(P,Z) + γ g2(Z)
subject to AP + PAT −B2Z − ZTBT2 + V = 0
P 0.
(8.3)
where γ > 0 specifies the importance of sparsity relative to H2 performance and
g2(Z) :=
m∑i= 1
wi ‖eTi Z‖2
is a row-sparsity-promoting penalty function in which wi are nonzero weights and ei is the ith
unit vector.
This problem can still be cast as an SDP and standard solvers can be used to find its
solution. However, since generic SDP solvers do not exploit the structure of (8.3), they do not
scale gracefully with problem dimension.
159
8.1.2 Sensor selection
The sensor selection problem can be approached in a similar manner. Consider a linear time-
invariant system,
ψ = As ψ + B1 d
y = C ψ + η
where d and η are zero-mean white stochastic processes with covariances Vd and Vη, respectively,
and (As, C) is an observable pair. The observer,
˙ψ = As ψ + L (y − y)
= As ψ + LC(ψ − ψ
)+ Lη
estimates the state x from the noisy measurements y using a linear injection term with an
observer gain L. For the Hurwitz matrix As − LC, the zero-mean estimate of ψ is given by ψ
and the dynamics of the estimation error ψ := ψ − ψ are governed by
˙ψ = (As − LC) ψ + B1 d − Lη. (8.4)
The variance amplification from the noise sources d and η to the estimation error ψ is determined
by
fo(L) = trace (PoB1 VdBT1 + Po LVη L
T ) (8.5)
where Po is the observability gramian of the error system (8.4),
(As − LC)T Po + Po (As − LC) + I = 0.
The Kalman filter gain L, resulting from the observer Riccati equation, provides an optimal
observer with the smallest variance amplification.
Our objective is to design a Kalman filter which uses a subset of the available sensors.
This can be achieved by enhancing column-sparsity of the observer gain L. Since the change of
coordinates Zs := PoL preserves column sparsity of L, we formulate the sensor selection problem
160
as
minimizePo, Zs
fo(Po, Zs) + γ g2(ZTs )
subject to ATs Po + PoAs − CTZTs − ZsC + I = 0
Po 0
(8.6)
where
fo(Po, Zs) = trace(PoB1 VdB
T1 + P−1
o Zs Vη ZTs
).
We note that the sensor selection problem (8.6) can be obtained from the actuator selection
problem (8.3) by setting the problem data in (8.3) to
A = ATs , B2 = CT , Q = B1 VdBT1
V = I, R = Vη
(8.7)
and recovering the variables Po = P and Zs = ZT .
8.2 Customized algorithm
We next develop an efficient algorithm for solving the actuator selection problem (8.3); the
solution to the sensor selection problem (8.6) can be obtained by mapping it to (8.3) via (8.7).
The challenges in solving the optimization problem (8.3) arise from
• the positive definite constraint;
• the linear Lyapunov-like constraint;
• the non-smoothness of the sparsity-promoting term.
To ensure positive definiteness of P , we use projected descent techniques to optimize over the
positive definite cone. We dualize the linear constraint and split (8.3) into two simpler subprob-
lems over P and Z via the alternating direction method of multipliers (ADMM). This splitting
separates the objective function component associated with the positive definite constraint (the
subproblem over P ) from the non-differentiable component (the subproblem over Z). Since the
subproblem over Z is efficiently solvable, our method scales well with the number of sensors or
actuators.
We employ a projected version of Newton’s method to solve the P -subproblem. The Newton
search direction is obtained via a conjugate gradient algorithm. The Z-subproblem is solved
161
with a proximal method.
8.2.1 Alternating direction method of multipliers
To employ ADMM, we first form the augmented Lagrangian corresponding to the optimization
problem (8.3),
Lµ(P,Z;Y ) := f2(P,Z) + γ g2(Z) + 〈Y, h(P,Z)〉 + 12µ‖h(P,Z)‖2F
where h(X,Y ) = 0 is the linear Lyapunov-like constraint and
h(P,Z) := AP + PAT − B2Z − ZTBT2 + V.
For this problem, the ADMM iteration discussed in Section 2.2.3 [132] is,
P k+1 = argminP
Lµ(P, Zk, Y k)
Zk+1 = argminZ
Lµ(P k+1, Z, Y k)
Y k+1 = Y k + 1µ h(P k+1, Zk+1).
The stopping criteria depend on the primal residual, which quantifies how well P k and Zk
satisfy the linear constraint, and the dual residual, which quantifies the difference between Zk
and Zk−1. We refer the reader to [132] for details.
We note that the proximal augmented Lagrangian reformulation of Lµ described in Sec-
tion 2.3 is not possible here because the linear constraint involves nondiagonal operators which
act on both P and Z.
8.2.2 P -minimization step
Minimization of Lµ over P is step (2.6a) in ADMM and can be equivalently expressed as
minimizeP
f(P,Zk) + 12µ ‖h(P,Zk) + µY k‖2F
subject to P 0.(8.8)
We use a projected version of Newton’s method to solve a sequence of quadratic approximations
to (8.8). In what follows, P l denotes the sequence of inner iterations that converges to the
solution of (8.8).
162
Newton’s method
At each inner iteration P l, Newton’s method does a line search from P l in the direction P l
which minimizes the quadratic approximation of (8.8),
P l := argminP
12
⟨HP l(P ), P
⟩+⟨∇PLµ(P l, Zk, Y k), P
⟩(8.9)
where HP l(P ) is a linear function of P that contains information about the Hessian of Lµ. The
gradient of Lµ with respect to P is given by
∇PLµ(P l, Zk, Y k) := Q − P−1ZTRZP−1 + AT(Y + 1
µ h(P,Z))
+(Y + 1
µ h(P,Z))A.
The quadratic term used in (8.9) is,
12 vec(P )T (∇2
PLµ) vec(P )
where ∇2PLµ is the Hessian of the objective function in (8.8). It can be more conveniently
expressed as 12
⟨H(P ), P
⟩where HP is the linear operator
HP (P ) := H1,P (P ) + 1µ H2(P ). (8.10)
The first term in (8.10) comes from the performance index,
H1,P (P ) := P−1ZTRZP−1 P P−1 + P−1 P P−1ZTRZP−1
and the second term comes from the constraint penalties,
H2(P ) := ATA P + P ATA + A P A + AT P AT .
Solving the linear equation
HP l(P ) + ∇PLµ(P l, Zk, Y k) = 0
yields the Newton direction P l which is computed using the conjugate gradient method [135].
163
Projection
The set C := P |P εI approximates the positive definite cone P 0 for small ε > 0. Once
the Newton direction is determined, the step
P = PC(P l + α P l
)is taken, where PC is the projection on the set C and the step size α is chosen using an Armijo
backtracking search.
To project the symmetric matrix M onto C, its eigenvalues are projected onto the set
λi ≥ ε. From the eigenvalue decomposition M = U diag(λ)UT , where λ is a vector of the
eigenvalues and U is a matrix of the corresponding eigenvectors, the projection is PC(M) =
U diag (max (λ, ε))UT .
While Netwon’s method reduces the number of required steps, computing the search direc-
tions can be prohibitively expensive for large-scale systems. In Section 8.4, we will explore a
proximal gradient algorithm for solving problem (8.3).
8.2.3 Z-minimization step
The Z-minimization step is equivalent to solving,
minimizeZ
γ g2(Z) + trace ((P k+1)−1ZTRZ) + 12µ‖h(P k+1, Z) + µY k‖2F . (8.11)
The objective function is the sum of a quadratic term and a separable sum of `2 norms: a problem
form commonly referred to as group LASSO. We employ the proximal gradient method [131] to
solve this problem. The gradient of the smooth part of (8.11) evaluated at Z is
2RZP−1 + 2µ
(BT2 Z
TBT2 +BT2 B2Z)− 2BT2
(Y + 1
µ (AP + PAT + V )).
The proximal operator for g2 is the block shrinkage operator,
Sα(eTi Z) =
(1 − α/‖eTi Z‖2
)eTi Z, ‖eTi Z‖2 > α
0, ‖eTi Z‖2 ≤ α.
164
8.2.4 Iterative reweighting
Inspired by [121], we employ an iterative reweighing scheme to select the weights wi in the
sparsity-promoting term∑i wi‖ai‖2 to obtain sparser structures at lower values of γ. The
authors in [121] noted that if wi = 1/‖ai‖2, then there is an exact correspondence between the
weighted `1 norm and the cardinality function. However, implementing such weights requires a
priori knowledge of the values ‖ai‖2 at the optimal a. Consequently, we implement a reweighting
scheme in which we run the algorithm multiple times for each value of γ and update the weights
as,
wj+1i =
1
‖ eTi Z ‖2 + ε(8.12)
where ε > 0 ensures that the update is always well-defined.
8.3 Example: Mass-spring system
We use a simple mass-spring system to illustrate the utility of our algorithm in the sensor
selection problem. This system has a clear intuitive interpretation and its dimension can be
easily scaled while retaining the problem structure. Consider a series of masses connected by
linear springs. With unit masses, unit spring constants and no friction, the dynamics of each
mass are described by
pi = −(pi − pi+1) − (pi − pi−1) + di
where pi is the position of the ith mass. When the first and last masses are affixed to rigid
bodies, the aggregate dynamics are given by p
v
=
0 I
−T 0
p
v
+
0
I
dwhere p, v, and d are the position, velocity and disturbance vectors, and T is a Toeplitz matrix
with 2 on the main diagonal and −1 on the first super- and sub-diagonals. The possible sensor
outputs are the position and velocity vectors.
8.3.1 Algorithm speed and computational complexity
The complexity of solving the sensor selection SDP with interior point methods is O((n+ r)6),
where n is the dimension of the state-space and r is the number of sensors. In our algorithm,
165
Com
pu
tati
on
tim
e(s
)
Number of states
Figure 8.1: Scaling of computation time with the number of states for CVX and for ADMM formass-spring system with γ = 100 and position and velocity outputs. Empirically, we observethat CVX scales roughly with n6 while ADMM scales roughly with n3.
the greatest cost is incurred by computing the Newton direction in the P -minimization step.
Since P ∈ Rn×n, the worst case complexity of computing Newton direction is O(n6). This
is because each conjugate gradient step takes O(n3) operations and, in general, n2 conjugate
gradient steps are required to obtain convergence. In well-conditioned problems, the conjugate
gradient method achieves high accuracy much faster which significantly reduces computational
complexity. The Z-minimization step has a computational cost of only O(nr), so the overall
cost per ADMM iteration is O(n6) unless r ≥ n5.
Figure 8.1 shows the time required by ADMM and by CVX [203, 236] to solve (8.6) with
γ = 100 for mass-spring systems of increasing sizes. Our algorithm scales favorably here and it
scales much better when only the number of sensors is varied. Figure 8.2 shows scaling with just
the number of outputs. As the number of sensors increases, CVX’s computation time increases
while the computation time of ADMM barely changes.
166
Com
pu
tati
on
tim
e(s
)
Number of sensors
Figure 8.2: Scaling of computation time with the number of sensors for CVX and for ADMMfor mass-spring system with γ = 100 and n = 50. Outputs are random linear combinations ofthe states.
8.3.2 Sensor selection
We consider a system with 20 masses (40 states) and potential position and velocity measure-
ments for each mass. As γ increases, sparser observer structures are uncovered at the cost
of compromising quality of estimation. The tradeoff between the number of sensors and the
variance of the estimation error is shown in Figure 8.3.
Figures 8.4 and 8.5 show the position and velocity sensor topologies identified by the ADMM
algorithm as γ is increased. To save space, only novel topologies are shown. The selected sensor
configurations have symmetric topology, which is expected for this example. Notably, velocity
measurements are, in general, more important but the locations of the most important position
and velocity sensors differ.
167
Per
cent
incr
ease
inf o
(L)
Number of sensors
Figure 8.3: Percent increase in fo(L) in terms of the number of sensors.
Mass index
Figure 8.4: Retained position sensors as γ increases. A blue dot indicates that the position ofthe corresponding mass is being measured. The top row shows the densest sensor topology, andthe bottom row shows the sparsest.
168
Mass index
Figure 8.5: Retained velocity sensors as γ increases. A blue dot indicates that the velocity ofthe corresponding mass is being measured. The top row shows the densest sensor topology, andthe bottom row shows the sparsest.
8.3.3 Iterative reweighting
Figure 8.6 illustrates the utility of iterative reweighting. When constant sparsity-promoting
weights are used, large values of γ are required to identify sparse structures. Here, we run our
ADMM algorithm 3 times for each value of γ, updating the weights using (8.12) and retaining
them as we increase γ.
8.4 Proximal gradient method
When the system is open-loop stable, it is possible to express the objective function solely in
terms of Z and to solve the regularized problem using proximal gradient descent. For simple
problems, such as the mass-spring system, this shows good performance relative to ADMM.
8.4.1 Elimination of P
The linear constraint can be abstractly written as,
A(P ) − B(Z) + V = 0.
169
Nu
mb
erof
sen
sors
use
d
γ
Figure 8.6: Number of sensors versus γ for a scheme which uses iterative reweighting and for thescheme which uses constant weights. Iterative reweighting promotes sparser structures earlier.
where
A(P ) := AP + PAT
B(Z) := B2Z + ZTBT2 .
We assume that the linear map A is invertible, which occurs if and only if the matrix A has no
eigenvalues with real part 0 and no eigenvalues of A have opposite real part; i.e, no eigenvalue
of A has a real part which is the negative of the real part of any other eigenvalue of A. We can
then write,
P = A−1 (B(Z) − V ) .
8.4.2 Gradient
For notational convenience, we partition the smooth part of the objective function into two
components,
fa(P ) := 〈Q, P 〉
fb(P,Z) :=⟨P−1, ZTRZ
⟩
170
Gradient of fa
Using the explicit expression for P and the properties of linear operators, the first component
of the objective function can be rewritten in terms of Z,
〈Q, P 〉 =⟨Q, A−1 (B(Z) − V )
⟩= −
⟨V, A−†(Q)
⟩+⟨Z, B†(A−†(Q))
⟩where † indicates the adjoint of a linear operator, and
A†(P ) = ATP + PA
B†(P ) = 2BT2 P.
Therefore,
∇Zfa = − 2B2TQad
where
AT Qad + QadA + Q = 0.
Gradient of fb
The gradient of fb can be computed via perturbation analysis. Note that,
(P + ε P )−1 = P−1 − εP−1 P P−1 + O(ε2).
and that since P is specified by a linear constraint,
A(P + ε P ) − B(Z + ε Z) + V = 0,
the variation of P as a result of a variation in Z is given by
P = A−1(B(Z)).
Perturbing fb, ⟨(P + ε P )−1, (Z + ε Z)T R (Z + ε Z)
⟩and collecting the order ε terms, 2
⟨Z, RZP−1
⟩−⟨Z, B†(A−†(P−1ZTRZP−1))
⟩. From this,
it is clear that
∇Zfb = 2RZP−1 − B†(A−†(P−1ZTRZP−1))
171
and so ∇f(Z) = 2BT2 (Rad − Qad) + 2RZP−1 where
AT Qad + QadA + Q = 0
AT Rad + RadA + P−1 ZT RZ P−1 = 0
AP + P AT − B2 Z − ZT BT2 + V = 0.
BB stepsize initialization
We implement proximal gradient descent and initialize the step-size using the Barzilai-Borwein
(BB) method, which attempts to approximate the Hessian with a scaled version of the identity
matrix. The secant equation,
∇f(Y k) ≈ ∇f(Y k−1) + H(Y k, Y k − Y k−1)
predicts the change in the gradient ∇f(Y ) as a result of a change in Y . Quasi-Newton methods
use the above in conjunction with observed changes in the gradient to build an estimate H of
the Hessian and determine a search direction by solving the linear equation,
H(Y k, Y ) = − ∇f(Y k). (8.13)
BB step size selection restricts the Hessian approximation to be a scaled version of the identity,
H(Z, Z) := 1α Z
with α > 0 and minimizes the residual of the secant equation,
‖∇f(Zk) − ∇f(Zk−1) − 1α (Zk − Zk−1)‖2F
over α to obtain,
αm,0 =‖Zk − Zk−1‖2F
〈Zk−1 − Zk, ∇f(Zk−1) − ∇f(Zk)〉(8.14)
We backtrack from αm,0 by selecting the smallest nonnegative integer r such that αm = crαm,0
with c ∈ (0, 1) such that Zk+1 is stabilizing (i.e., yields a positive definite P k+1 = A−1(B(Zk+1−V ))) and so that Y Z+1 satisfies either an ISTA-like or a SpaRSA-like step size selection rule.
172
ISTA-like step size selection rule
The step size α is chosen so that the approximation of the objective function is an overestimate
of the actual objective function
f(Zk+1) + g(Zk+1) ≤ f(Zk) +⟨Y , ∇f(Zk)
⟩+ 1
2α ‖Z‖2F + g(Zk+1)
where Z := Zk+1 − Zk [131].
SpaRSA-like step size selection rule
In this approach [187], the objective function is only checked every p iterations. At that iteration,
backtracking is used to ensure that,
f(Zk+1) < maxm= k− p,...,k
(f(Zm)) .
Otherwise, backtracking is just used to ensure stability.
8.4.3 Example: Damped mass-spring systems
Since this method requires open-loop stability, we must introduce some damping into the mass-
spring example explored earlier. For the damped mass-spring model with N masses, p
v
=
0 I
T − bI −bI
p
v
+ u + d
where T is a Toeplitz matrix with −2 on the main diagonal and 1 on the first super- and sub-
diagonals. The parameter b represents the strength of damping. Experiments were performed
with γ = 30, Q = I and R = 10I. Figure 8.7 shows the computation time for proximal
gradient with both ISTA (PG) and SpaRSA (PGs) step-size selection, CVX, ADMM, and an
MM algorithm with the proximal augmented Lagrangian from Chapter 2.
8.5 Example: Flexible wing aircraft
Finally, we present an example of the sensor selection algorithm applied to a practical example
from aerospace engineering. One barrier in reducing aircraft weight in order to improve fuel
173
Com
pu
tati
on
tim
e(s
)
Number of states
Figure 8.7: Computation time for mass-spring system damped with b = 0.1 and γ = 30.
Figure 8.8: Body Freedom Flutter flexible wing testbed aircraft.
174N
um
ber
of
sen
sors
use
d
γ
Per
cent
per
form
an
ced
egra
dati
on
Number of sensors
Figure 8.9: (a) Number of sensors as a function of the sparsity-promoting parameter γ; and (b)Performance comparison of the Kalman filter associated with the sets of sensors resulting fromthe regularized sensor selection problem and from truncation.
` 2n
orm
ofro
w
Sensor
Figure 8.10: The `2 norm of each column of the Kalman gain for Kalman filters designed fordifferent sets of sensors.
175
efficiency is that lighter airframes are more flexible and thus susceptible to vibrational instabili-
ties [237]. These instabilities, known as flutter, were behind the famous Tacoma Narrows Bridge
collapse and have been identified as the likely cause of the loss of NASA’s Helios Prototype
aircraft [238].
Recent work has sought to approach this problem by actively damping flutter instabili-
ties [239]. Since active control requires reliable detection of instabilities, selection of sensors is
an important challenge. For the Body Freedom Flutter test aircraft shown in Fig. 8.8 [240], we
use the dual formulation to the actuator selection approach described in Section 8.1.2 in con-
junction with iterative re-weighting algorithm to select sparse sets of sensors. Figure 8.9 shows
the number of sensors as a function of the sparsity-promoting parameter γ and the performance
of Kalman filter with the limited sets of sensors [105].
We compare the performance of the Kalman filter corresponding to the sensors selected
by our approach to the Kalman filter associated with sensors selected by truncation. For the
truncation approach, the Kalman gain matrix corresponding to a set of sensors was computed.
The sensor corresponding to the row with the lowest `2 norm was discarded and the Kalman
gain was recomputed for the new set of sensors. This process was repeated iteratively from the
full set of sensors to a set of two sensors. Clearly, the regularized sensor selection algorithm
selects better subsets of sensors than the truncation approach.
To further justify the use of this algorithm over truncation, we show the `2 norm of different
columns of the Kalman gain in Fig. 8.10. The sensor which has the most effect on the state
estimate changes as the number of sensors is decreased, even when each smaller set of sensors
is a subset of a previous set.
Chapter 9
Conclusions and future directions
Proximal augmented Lagrangian methods
In this thesis, we have developed custom optimization algorithms for composite problems by
introducing an auxiliary variable and reformulating the associated augmented Lagrangian. The
resulting function, which we call the proximal augmented Lagrangian, is continuously differen-
tiable and thus opens the door to many methods for solving the original problem.
The proximal augmented Lagrangian facilitates the method of multipliers by transforming
the primal-minimization step into a tractable, differentiable problem. Differentiability also en-
ables primal-dual gradient updates which, for certain problems, are convenient for a distributed
implementation. Generalizations of the Jacobian for once-differentiable functions allow us to de-
rive second order updates which lead to a globally exponentially convergent differential inclusion
as well as a globally convergent algorithm with asympotitic quadratic convergence.
Structured optimal control via regularized optimization
The motivation behind developing these algorithms was to use regularization to design structured
controllers. We successfully applied those algorithms to the nonconvex problems of edge addition
in directed consensus networks and sparse optimal control. We also identified classes of systems
for which structure-promoting optimal control problems can be cast as convex problems.
We first showed how the symmetric component of a system can be used to pose a regularized
convex controller design problem. The resulting controller is structured and has stability and
performance guarantees for the original system. We used this theory to inform the problems
of edge addition and leader selection in directed consensus networks as well as for combination
176
177
drug therapy design for HIV treatment.
We also considered the decentralized control of positive systems. We showed that the H2
and H∞ optimal problems are convex in the original coordinates, allowing us to pose convex
regularized problems to design structured controllers. We applied this theory to leader selection
in consensus networks and combination drug therapy design for HIV. We then established that
a constant controller is optimal for an induced-power performance index.
Finally, we considered the problem of choosing a limited number of actuators or sensors to
effectively control or observe a system. We posed this problem as a semidefinite program and
developed customized methods to solve it efficiently. We applied this approach to select sensors
for a flexible wing aircraft.
Future directions
Future algorithmic research will investigate how to deal with nonconvexity. Although we can
find local minima of nonconvex problems using the method of multipliers with the proximal
augmented Lagrangian, we have no guarantees on the quality of these local minima. It would
be interesting to explore techniques to discover a variety of local minima or to characterize the
quality of any particular local minimum. Moreover, it would be interesting to explore the use
of nonconvex regularizers, especially in distributed methods.
We will also work to develop techniques for structured time-varying or dynamic controllers.
So far, we have considered exclusively static structured controllers. Although we show that
a static controller achieves the optimal for decentralized control of positive systems with the
induced power norm, we cannot expect this to hold for more general systems or performance
indices. Investigating model predictive control, dynamic programming, or periodic structured
control approaches would be an interesting avenue of further research.
References
[1] Enrique Fernandez Cara and Enrique Zuazua Iriondo. Control Theory: History, Mathe-
matical Achievements and Perspectives. Boletın de la Sociedad Espanola de Matematica
Aplicada, 26, 79-140., 2003.
[2] Christopher Bissell. A History of Automatic Control. In Springer Handbook of Automation,
pages 53–69. Springer, 2009.
[3] Neculai Andrei. Modern Control Theory. Studies in Informatics and Control, 15(1):51,
2006.
[4] Vincent Blondel and John N Tsitsiklis. NP-hardness of some linear control design prob-
lems. SIAM J. Control Optim., 35(6):2118–2127, 1997.
[5] Minyue Fu and Zhi-Quan Luo. Computational complexity of a problem arising in fixed
order output feedback design. Syst. Control Lett., 30(5):209–215, 1997.
[6] Francesco Bullo, Jorge Cortes, and Sonia Martınez. Distributed Control of Robotic Net-
works. Princeton University Press, 2009.
[7] Mehran Mesbahi and Magnus Egerstedt. Graph Theoretic Methods in Multiagent Net-
works. Princeton University Press, 2010.
[8] Dragoslav D Siljak. Decentralized control of complex systems. Academic Press, New York,
1991.
[9] Bassam Bamieh, Fernando Paganini, and Munther A Dahleh. Distributed control of spa-
tially invariant systems. IEEE Trans. Automat. Control, 47(7):1091–1107, 2002.
[10] Gustavo A de Castro and Fernando Paganini. Convex synthesis of localized controllers for
spatially invariant system. Automat., 38:445–456, 2002.
178
179
[11] Petros G Voulgaris, Gianni Bianchini, and Bassam Bamieh. Optimal H2 controllers for
spatially invariant systems with delayed communication requirements. Syst. Control Lett.,
50:347–361, 2003.
[12] Raffaello D’Andrea and Geir E Dullerud. Distributed control design for spatially inter-
connected systems. IEEE Trans. Automat. Control, 48(9):1478–1495, 2003.
[13] Cedric Langbort, Ramu S Chandra, and Raffaello D’Andrea. Distributed control de-
sign for systems interconnected over an arbitrary graph. IEEE Trans. Automat. Control,
49(9):1502–1519, 2004.
[14] Dragoslav D Siljak and Aleksandar I. Zecevic. Control of large-scale systems: Beyond
decentralized feedback. Annual Reviews in Control, 29(2):169–179, 2005.
[15] Bassam Bamieh and Petros G Voulgaris. A convex characterization of distributed control
problems in spatially invariant systems with communication constraints. Syst. Control
Lett., 54(6):575–583, 2005.
[16] Michael Rotkowitz and Sanjay Lall. A characterization of convex problems in decentralized
control. IEEE Trans. Automat. Control, 51(2):274–286, Feb 2006.
[17] Francesco Borrelli and Tamas Keviczky. Distributed LQR design for identical dynamically
decoupled systems. IEEE Trans. on Automat. Control, 53(8):1901–1912, 2008.
[18] Javad Lavaei and Amir G Aghdam. Control of continuous-time LTI systems by means of
structurally constrained controllers. Automat., 44(1):141–148, 2008.
[19] Parikshit Shah and Pablo A Parrilo. H2 optimal decentralized control over posets: A state
space solution for state-feedback. In Proceedings of the 49th IEEE Conference on Decision
and Control, pages 6722–6727, 2010.
[20] John Swigart and Sanjay Lall. An explicit state-space solution for a decentralized two-
player optimal linear-quadratic regulator. In Proceedings of the 2010 American Control
Conference, pages 6385–6390, 2010.
[21] Makan Fardad and Mihailo R Jovanovic. Design of optimal controllers for spatially invari-
ant systems with finite communication speed. Automat., 47(5):880–889, May 2011.
[22] Karl Martensson and Anders Rantzer. A scalable method for continuous-time distributed
control synthesis. In Proceedings of the 2012 American Control Conference, pages 6308–
6313, 2012.
180
[23] Laurent Lessard and Sanjay Lall. Optimal controller synthesis for the decentralized two-
player problem with output feedback. In Proceedings of the 2012 American Control Con-
ference, pages 6314–6321, 2012.
[24] Aditya Mahajan, Nuno C Martins, Michael C Rotkowitz, and Serdar Yuksel. Information
structures in optimal decentralized control. In Proceedings of the 51st IEEE Conference
on Decision and Control, pages 1291–1306, 2012.
[25] Parikshit Shah and Pablo A Parrilo. H2 optimal decentralized control over posets: A
state-space solution for state-feedback. IEEE Trans. Automat. Control, 58(12):3084–3096,
2013.
[26] Javad Lavaei. Optimal decentralized control problem as a rank-constrained optimization.
In Proceedings of the 51st Annual Allerton Conference on Communication, Control, and
Computing, pages 39–45, 2013.
[27] Ghazal Fazelnia, Ramtin Madani, and Javad Lavaei. Convex relaxation for optimal dis-
tributed control problem. In Proceedings of the 53rd IEEE Conference on Decision and
Control, pages 896–903, 2014.
[28] Yuh-Shyang Wang, Nikolai Matni, and John C Doyle. Localized LQR optimal control.
In Proceedings of the 53rd IEEE Conference on Decision and Control, pages 1661–1668.
IEEE, 2014.
[29] Laurent Lessard and Sanjay Lall. Optimal control of two-player systems with output
feedback. IEEE Trans. Automat. Control, 60(8):2129–2144, 2015.
[30] Andrew Lamperski and John C Doyle. The H2 control problem for quadratically invariant
systems with delays. IEEE Trans. Automat. Control, 60(7):1945–1950, 2015.
[31] Takuto Mori, Takayuki Wada, and Yasumasa Fujisaki. Structured feedback gain design via
stabilizable dilation and LMIs. In Proceedings of the 54th IEEE Conference on Decision
and Control, pages 3343–3348, 2015.
[32] Mihailo R Jovanovic and Neil K Dhingra. Controller architectures: Tradeoffs between
performance and structure. Eur. J. Control, 30:76–91, July 2016.
[33] Yuh-Shyang Wang. Localized LQR with adaptive constraint and performance guarantee.
In Proceedings of the 55th IEEE Conference on Decision and Control, pages 2769–2776.
IEEE, 2016.
181
[34] Yuh-Shyang Wang and Nikolai Matni. Localized LQG optimal control for large-scale
systems. In Proceedings of the 2016 American Control Conference, pages 1954–1961.
IEEE, 2016.
[35] Yuh-Shyang Wang. A system level approach to optimal controller design for large-scale
distributed systems. PhD thesis, California Institute of Technology, 2017.
[36] Yuh-Shyang Wang, Nikolai Matni, and John C Doyle. System level parameterizations,
constraints and synthesis. In Proceedings of the 2017 American Control Conference, pages
1308–1315, May 2017.
[37] Yuh-Shyang Wang, Nikolai Matni, and John C Doyle. Separable and localized system
level synthesis for large-scale systems. arXiv preprint arXiv:1701.05880, 2017.
[38] Mihailo R Jovanovic and Bassam Bamieh. Componentwise energy amplification in channel
flows. J. Fluid Mech., 534:145–183, July 2005.
[39] Binh K Lieu and Mihailo R Jovanovic. Computation of frequency responses for linear
time-invariant PDEs on a compact interval. J. Comput. Phys., 250:246–269, October
2013.
[40] Binh K Lieu, Mihailo R. Jovanovic, and Satish Kumar. Worst-case amplification of dis-
turbances in inertialess Couette flow of viscoelastic fluids. J. Fluid Mech., 723:232–263,
May 2013.
[41] Binh K Lieu. Dynamics and control of Newtonian and viscoelastic fluids. PhD thesis,
University of Minnesota, 2014.
[42] Rashad Moarref. Model-based control of transitional and turbulent wall-bounded shear
flows. PhD thesis, University of Minnesota, 2012.
[43] Rashad Moarref and Mihailo R Jovanovic. Model-based design of transverse wall oscilla-
tions for turbulent drag reduction. J. Fluid Mech., 707:205–240, September 2012.
[44] Rashad Moarref and Mihailo R Jovanovic. Controlling the onset of turbulence by stream-
wise traveling waves. Part 1: Receptivity analysis. J. Fluid Mech., 663:70–99, November
2010.
[45] Binh K Lieu, Rashad Moarref, and Mihailo R Jovanovic. Controlling the onset of turbu-
lence by streamwise traveling waves. Part 2: Direct numerical simulations. J. Fluid Mech.,
663:100–119, November 2010.
182
[46] Armin Zare, Mihailo R Jovanovic, and Tryphon T Georgiou. Colour of turbulence. J.
Fluid Mech., 812:636–680, February 2017.
[47] Armin Zare. Low-complexity stochastic modeling of wall-bounded shear flows. PhD thesis,
University of Minnesota, 2016.
[48] Bruce A Francis. A Course in H∞ Control Theory. Springer-Verlag, New York, 1987.
[49] Geir E Dullerud and Fernando Paganini. A Course in Robust Control Theory. Springer,
2000.
[50] Yu Chi Ho and Kai-Ching Chu. Team decision theory and information structures in
optimal control problems–part I. IEEE Trans. Automat. Control, 17(1):15–22, 1972.
[51] Petros G Voulgaris. Control of nested systems. In Proceedings of the 2001 American
Control Conference, pages 4442–4445, 2000.
[52] Petros G Voulgaris. A convex characterization of classes of problems in control with
specific interaction and communication structures. In Proceedings of the 2001 American
Control Conference, pages 3128–3133, 2001.
[53] Lorenzo Farina and Sergio Rinaldi. Positive Linear Systems: Theory and Applications.
John Wiley & Sons, 2011.
[54] Takashi Tanaka and Cedric Langbort. The bounded real lemma for internally positive sys-
tems andH∞ structured static state feedback. IEEE Trans. Automat. Control, 56(9):2218–
2223, 2011.
[55] Corentin Briat. Robust stability and stabilization of uncertain linear positive systems via
integral linear constraints: L1 gain and L∞ gain characterization. Int. J. Robust Nonlin.,
23(17):1932–1954, 2013.
[56] Yoshio Ebihara, Dimitri Peaucelle, and Denis Arzelier. L1 gain analysis of linear positive
systems and its application. In Proceedings of 50th IEEE Conference on Decision and
Control, pages 4029–4034. IEEE, 2011.
[57] Anders Rantzer. Distributed control of positive systems. In Proceedings of 50th IEEE
Conference on Decision and Control, pages 6608–6611, Dec 2011.
183
[58] Patrizio Colaneri, Richard H Middleton, Zhiyong Chen, Danilo Caporale, and Franco
Blanchini. Convexity of the cost functional in an optimal control problem for a class of
positive switched systems. Automat., 4(50):1227–1234, 2014.
[59] Anders Rantzer and Bo Bernhardsson. Control of convex monotone systems. In Proceedings
of the 53rd IEEE Conference on Decision and Control, pages 2378–2383, 2014.
[60] Nader Motee and Ali Jadbabaie. Optimal control of spatially distributed systems. IEEE
Trans. Automat. Control, 53(7):1616–1629, 2008.
[61] Nader Motee and Qiyu Sun. Sparsity measures for spatially decaying systems. In Pro-
ceedings of the 2014 American Control Conference, pages 5459–5464, 2014.
[62] Makan Fardad, Fu Lin, and Mihailo R Jovanovic. Sparsity-promoting optimal control for
a class of distributed systems. In Proceedings of the 2011 American Control Conference,
pages 2050–2055, 2011.
[63] Fu Lin, Makan Fardad, and Mihailo R Jovanovic. Sparse feedback synthesis via the
alternating direction method of multipliers. In Proceedings of the 2012 American Control
Conference, pages 4765–4770, Montreal, Canada, 2012.
[64] Fu Lin, Makan Fardad, and Mihailo R Jovanovic. Design of optimal sparse feedback
gains via the alternating direction method of multipliers. IEEE Trans. Automat. Control,
58(9):2426–2431, September 2013.
[65] Makan Fardad, Fu Lin, and Mihailo R Jovanovic. On optimal link creation for facilitation
of consensus in social networks. In Proceedings of the 2014 American Control Conference,
pages 3802–3807, Portland, OR, 2014.
[66] George F Young, L. Scardovi, and Naomi E Leonard. Robustness of noisy consensus
dynamics with directed communication. In Proceedings of the 2010 American Control
Conference, pages 6312–6317, 2010.
[67] Lin Xiao, Stephen P Boyd, and Seung-Jean Kim. Distributed average consensus with
least-mean-square deviation. J. Parallel Distrib. Comput., 67(1):33–46, 2007.
[68] Daniel Zelazo and Mehran Mesbahi. Edge agreement: Graph-theoretic performance
bounds and passivity analysis. IEEE Trans. Automat. Control, 56(3):544–555, 2011.
184
[69] Makan Fardad, Fu Lin, and Mihailo R Jovanovic. On the optimal synchronization of
oscillator networks via sparse interconnection graphs. In Proceedings of the 2012 American
Control Conference, pages 4777–4782, Montreal, Canada, 2012.
[70] David JT Sumpter, Jens Krause, Richard James, Ian D Couzin, and Ashley JW Ward.
Consensus decision making by fish. Current Biology, 18(22):1773–1777, 2008.
[71] Mehran Mesbahi and Fred Y Hadaegh. Formation flying control of multiple spacecraft via
graphs, matrix inequalities, and switching. J. Guid. Control Dyn., 24(2):369–377, 2001.
[72] Lin Xiao, Stephen P Boyd, and Sanjay Lall. A scheme for robust distributed sensor
fusion based on average consensus. In Proceedings of the 4th International Symposium on
Information Processing in Sensor Networks, pages 63–70, 2005.
[73] Arpita Ghosh and Stephen P Boyd. Growing well-connected graphs. In Proceedings of the
45th IEEE Conference on Decision and Control, pages 6605–6611, 2006.
[74] Fu Lin, Makan Fardad, and Mihailo R Jovanovic. Identification of sparse communication
graphs in consensus networks. In Proceedings of the 50th Annual Allerton Conference on
Communication, Control, and Computing, pages 85–89, Monticello, IL, 2012.
[75] Tyler Summers, Iman Shames, John Lygeros, and Florian Dorfler. Topology design for
optimal network coherence. In Proceedings of the 2015 European Control Conference,
pages 575–580. IEEE, 2015.
[76] Sepideh Hassan-Moghaddam and Mihailo R Jovanovic. Topology design for
stochastically-forced consensus networks. IEEE Trans. Control Netw. Syst., 2017.
doi:10.1109/TCNS.2017.2674962.
[77] Stacy Patterson and Bassam Bamieh. Leader selection for optimal network coherence.
In Proceedings of the 49th IEEE Conference on Decision and Control, pages 2692–2697,
2010.
[78] Katherine Fitch and Naomi E Leonard. Information centrality and optimal leader selection
in noisy networks. In Proceedings of the 52nd IEEE Conference on Decision and Control,
pages 7510–7515, 2013.
[79] Katherine Fitch and Naomi E Leonard. Joint centrality distinguishes optimal leaders in
noisy networks. IEEE Trans. Control Netw. Syst., 2014.
185
[80] Fu Lin, Makan Fardad, and Mihailo R Jovanovic. Algorithms for leader selection in
stochastically forced consensus networks. IEEE Trans. Automat. Control, 59(7):1789–
1802, July 2014.
[81] Makan Fardad, Fu Lin, and Mihailo R Jovanovic. Algorithms for leader selection in large
dynamical networks: noise-free leaders. In Proceedings of the 50th IEEE Conference on
Decision and Control and European Control Conference, pages 7188–7193, Orlando, FL,
2011.
[82] Andrew Clark, Linda Bushnell, and Radha Poovendran. A supermodular optimization
framework for leader selection under link noise in linear multi-agent systems. IEEE Trans.
Automat. Control, 59(2):283–296, 2014.
[83] Andrew Clark, Basel Alomair, Linda Bushnell, and Radha Poovendran. Submodularity in
Dynamics and Control of Networked Systems. Springer, 2016.
[84] Andrew Clark, Basel Alomair, Linda Bushnell, and Radha Poovendran. Submodularity
in input node selection for networked linear systems: Efficient algorithms for performance
and controllability. IEEE Control Syst. Mag., 37(6):52–74, Dec 2017.
[85] Guodong Shi, Kin Cheong Sou, Henrik Sandberg, and Karl Henrik Johansson. A graph-
theoretic approach on optimizing informed-node selection in multi-agent tracking control.
Physica D: Nonlinear Phenomena, 267:104–111, 2014.
[86] Stacy Patterson, Neil McGlohon, and Kirill Dyagilev. Optimal k-leader selection for
coherence and convergence rate in one-dimensional networks. IEEE Trans. Control Netw.
Syst., 2016.
[87] Neil K Dhingra, Marcello Colombino, and Mihailo R Jovanovic. Leader selection in di-
rected networks. In Proceedings of the 55th IEEE Conference on Decision and Control,
pages 2715–2720, Las Vegas, NV, 2016.
[88] Neil K Dhingra, Marcello Colombino, and Mihailo R Jovanovic. Structured decentralized
control of positive systems with applications to combination drug therapy and leader
selection in directed networks. IEEE Trans. Control Netw. Syst., 2017. submitted.
[89] Vanessa D Jonsson, Anders Rantzer, and Richard M Murray. A scalable formulation for
engineering combination therapies for evolutionary dynamics of disease. Proceedings of
the 2014 American Control Conference, pages 2771–2778, 2014.
186
[90] Vanessa D Jonsson, Nikolai Matni, and Richard M Murray. Reverse engineering combina-
tion therapies for evolutionary dynamics of disease: An H∞ approach. In Proceedings of
the 52nd IEEE Conference on Decision and Control, pages 2060–2065, 2013.
[91] Makan Fardad, Fu Lin, and Mihailo R Jovanovic. On the optimal design of structured
feedback gains for interconnected systems. In Proceedings of the 48th IEEE Conference
on Decision and Control, pages 978–983, Shanghai, China, 2009.
[92] Fu Lin, Makan Fardad, and Mihailo R Jovanovic. Augmented Lagrangian approach
to design of structured optimal state feedback gains. IEEE Trans. Automat. Control,
56(12):2923–2929, 2011.
[93] David L Donoho. Compressed sensing. IEEE Trans. Inf. Theory, 52(4):1289–1306, 2006.
[94] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical
Learning. Springer, 2009.
[95] Peter J Bickel and Elizaveta Levina. Regularized estimation of large covariance matrices.
Ann. Stat., pages 199–227, 2008.
[96] Tom Goldstein and Stanley Osher. The split Bregman method for `1-regularized problems.
SIAM J. Imaging Sci., 2(2):323–343, 2009.
[97] Bassam Bamieh, Mihailo R Jovanovic, Partha Mitra, and Stacy Patterson. Coherence
in large-scale networks: dimension dependent limitations of local feedback. IEEE Trans.
Automat. Control, 57(9):2235–2249, September 2012.
[98] Simone Schuler, Ping Li, James Lam, and Frank Allgower. Design of structured dynamic
output-feedback controllers for interconnected systems. International Journal of Control,
84(12):2081–2091, 2011.
[99] Makan Fardad and Mihailo R Jovanovic. On the design of optimal structured and sparse
feedback gains using semidefinite programming. In Proceedings of the 2014 American
Control Conference, pages 2438–2443, Portland, OR, 2014.
[100] Nikolai Matni and Venkat Chandrasekaran. Regularization for design. In Proceedings of
the 53rd IEEE Conference on Decision and Control, pages 1111–1118, Los Angeles, CA,
2014.
187
[101] Nikolai Matni. Communication delay co-design in H2-distributed control using atomic
norm minimization. IEEE Transactions on Control of Network Systems, 4(2):267–278,
June 2017.
[102] Nikolai Matni and Venkat Chandrasekaran. Regularization for design. IEEE Trans. Au-
tomat. Control, 61(12):3991–4006, 2016.
[103] Nikolai Matni. Distributed optimal control of cyber-physical systems: controller synthesis,
architecture design and system identification. PhD thesis, California Institute of Technol-
ogy, 2016.
[104] Boris Polyak, Mikhail Khlebnikov, and Pavel Shcherbakov. An LMI approach to structured
sparse feedback design in linear control systems. In Proceedings of the 2013 European
Control Conference, pages 833–838, 2013.
[105] Neil K Dhingra, Mihailo R Jovanovic, and Z. Q. Luo. An ADMM algorithm for optimal
sensor and actuator selection. In Proceedings of the 53rd IEEE Conference on Decision
and Control, pages 4039–4044, Los Angeles, CA, 2014.
[106] Xiaofan Wu and Mihailo R Jovanovic. Sparsity-promoting optimal control of systems with
symmetries, consensus and synchronization networks. Syst. Control Lett., 103:1–8, May
2017.
[107] Neil K Dhingra, Sei Zhen Khong, and Mihailo R Jovanovic. The proximal augmented La-
grangian method for nonsmooth composite optimization. IEEE Trans. Automat. Control,
2016. submitted; also arXiv:1610.04514.
[108] Florian Dorfler, Mihailo R Jovanovic, Michael Chertkov, and Francesco Bullo. Sparse
and optimal wide-area damping control in power networks. In Proceedings of the 2013
American Control Conference, pages 4295–4300, Washington, DC, 2013.
[109] Florian Dorfler, Mihailo R Jovanovic, M. Chertkov, and F. Bullo. Sparsity-promoting
optimal wide-area control of power networks. IEEE Trans. Power Syst., 29(5):2281–2291,
September 2014.
[110] Xiaofan Wu, Florian Dorfler, and Mihailo R Jovanovic. Input-output analysis and decen-
tralized optimal control of inter-area oscillations in power systems. IEEE Trans. Power
Syst., 31(3):2434–2444, May 2016.
188
[111] Xiaofan Wu, Florian Dorfler, and Mihailo R Jovanovic. Topology identification and design
of distributed integral action in power networks. In Proceedings of the 2016 American
Control Conference, pages 5921–5926, Boston, MA, 2016.
[112] Neil K Dhingra, Marcello Colombino, and Mihailo R Jovanovic. On the convexity of a
class of structured optimal control problems for positive systems. In Proceedings of the
2016 European Control Conference, pages 825–830, Aalborg, Denmark, 2016.
[113] Sepideh Hassan-Moghaddam and Mihailo R Jovanovic. An interior point method for grow-
ing connected resistive networks. In Proceedings of the 2015 American Control Conference,
pages 1223–1228, Chicago, IL, 2015.
[114] Sepideh Hassan-Moghaddam and Mihailo R Jovanovic. Customized algorithms for growing
connected resistive networks. In Proceedings of the 10th IFAC Symposium on Nonlinear
Control Systems, pages 986–991, Monterey, CA, 2016.
[115] Neil K Dhingra and Mihailo R Jovanovic. Convex synthesis of symmetric modifications to
linear systems. In Proceedings of the 2015 American Control Conference, pages 3583–3588,
Chicago, IL, 2015.
[116] Neil K Dhingra, Xiaofan Wu, and Mihailo R Jovanovic. Sparsity-promoting optimal control
of systems with invariances and symmetries. In Proceedings of the 10th IFAC Symposium
on Nonlinear Control Systems, pages 648–653, Monterey, CA, 2016.
[117] Makan Fardad, Fu Lin, and Mihailo R Jovanovic. Design of optimal sparse interconnec-
tion graphs for synchronization of oscillator networks. IEEE Trans. Automat. Control,
59(9):2457–2462, September 2014.
[118] Neil K Dhingra and Mihailo R Jovanovic. A method of multipliers algorithm for sparsity-
promoting optimal control. In Proceedings of the 2016 American Control Conference,
pages 1942–1947, Boston, MA, 2016.
[119] Neil K Dhingra, Sei Zhen Khong, and Mihailo R Jovanovic. A second order primal-dual
algorithm for non-smooth convex composite optimization. In Proceedings of the 56th IEEE
Conference on Decision and Control, Melbourne, Australia, 2017.
[120] Neil K Dhingra, Sei Zhen Khong, and Mihailo R Jovanovic. A second order primal-dual
method for nonsmooth convex composite optimization. IEEE Trans. Automat. Control,
2017. submitted; also arXiv:1709.01610.
189
[121] Emmanuel J Candes, Mike B Wakin, and Stephen P Boyd. Enhancing sparsity by
reweighted `1 minimization. J. Fourier Anal. Appl, 14:877–905, 2008.
[122] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped vari-
ables. J. R. Stat. Soc. Series B Stat. Methodol., 68(1):49–67, 2006.
[123] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University
Press, 2004.
[124] Athanasios C Antoulas. Approximation of Large-Scale Dynamical Systems. SIAM, 2005.
[125] Armin Zare, Yongxin Chen, Mihailo R Jovanovic, and Tryphon T Georgiou. Low-
complexity modeling of partially available second-order statistics: theory and an efficient
matrix completion algorithm. IEEE Trans. Automat. Control, 62(3):1368–1383, March
2017.
[126] Armin Zare, Mihailo R Jovanovic, and Tryphon T Georgiou. Alternating direction op-
timization algorithms for covariance completion problems. In Proceedings of the 2015
American Control Conference, pages 515–520, Chicago, IL, 2015.
[127] Armin Zare, Mihailo R Jovanovic, and Tryphon T Georgiou. Completion of partially
known turbulent flow statistics. In Proceedings of the 2014 American Control Conference,
pages 1680–1685, Portland, OR, 2014.
[128] Siddharth Joshi and Stephen P Boyd. Sensor selection via convex optimization. IEEE
Trans. Signal Process., 57(2):451–462, 2009.
[129] Venkat Chandrasekaran, Benjamin Recht, Pablo A Parrilo, and Alan S Willsky. The
convex geometry of linear inverse problems. Foundations of Computational Mathematics,
12(6):805–849, 2012.
[130] Neal Parikh and Stephen P Boyd. Proximal algorithms. Foundations and Trends in
Optimization, 1(3):123–231, 2013.
[131] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear
inverse problems. SIAM J. Imaging Sci., 2(1):183–202, 2009.
[132] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed
optimization and statistical learning via the alternating direction method of multipliers.
Found. Trends Mach. Learning, 3(1):1–124, 2011.
190
[133] Dimitri P Bertsekas. Constrained Optimization and Lagrange Multiplier Methods. Aca-
demic Press, New York, 1982.
[134] Dimitri P Bertsekas. Nonlinear Programming. Athena Scientific, 1999.
[135] Jorge Nocedal and Stephen J Wright. Numerical Optimization. Springer, 2006.
[136] Huan Li and Zhouchen Lin. Accelerated proximal gradient methods for nonconvex pro-
gramming. In Adv. Neural Inf. Process Syst., pages 379–387, 2015.
[137] Mingyi Hong, Zhi-Quan Luo, and Meisam Razaviyayn. Convergence analysis of alternating
direction method of multipliers for a family of nonconvex problems. SIAM J. Optimiz.,
26(1):337–364, 2016.
[138] Ralph T Rockafellar and Roger J-B Wets. Variational Analysis, volume 317. Springer
Science & Business Media, 2009.
[139] Frank H Clarke. Optimization and Nonsmooth Analysis, volume 5. SIAM, 1990.
[140] Jerome Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized min-
imization for nonconvex and nonsmooth problems. Math. Program., 146(1-2):459–494,
2014.
[141] Yin Wang, Jose A Lopez, and Mario Sznaier. Sparse static output feedback controller
design via convex optimization. In Proceedings of the 53rd IEEE Conference on Decision
and Control, pages 376–381, 2014.
[142] Tankred Rautert and Ekkehard W Sachs. Computational design of optimal output feed-
back controllers. SIAM J. Optimiz., 7(3):837–852, 1997.
[143] Jonathan Barzilai and Jonathan M Borwein. Two-point step size gradient methods. IMA
J. Numer. Anal., 8(1):141–148, 1988.
[144] Andrei Patrascu and Ion Necoara. Efficient random coordinate descent algorithms for
large-scale structured nonconvex optimization. J. Global Optim., 61(1):19–46, 2015.
[145] Kenneth J Arrow, Leonid Hurwicz, and Hirofumi Uzawa. Studies in Linear and Non-
Linear Programming. Stanford University Press, 1958.
[146] Jing Wang and Nicola Elia. A control perspective for centralized and distributed convex
optimization. In Proceedings of the 50th IEEE Conference on Decision and Control and
the 10th European Control Conference, pages 3800–3805, 2011.
191
[147] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent
optimization. IEEE Trans. Automat. Control, 54(1):48–61, 2009.
[148] Diego Feijer and Fernando Paganini. Stability of primal–dual gradient dynamics and
applications to network optimization. Automat., 46(12):1974–1981, 2010.
[149] Ashish Cherukuri, Enrique Mallada, and Jorge Cortes. Asymptotic convergence of con-
strained primal–dual dynamics. Syst. Control Lett., 87:10–15, 2016.
[150] Ashish Cherukuri, Enrique Mallada, Steven Low, and Jorge Cortes. The role of con-
vexity on saddle-point dynamics: Lyapunov function and robustness. arXiv preprint
arXiv:1608.08586, 2016.
[151] Ashish Cherukuri, Bahman Gharesifard, and Jorge Cortes. Saddle-point dynamics: condi-
tions for asymptotic stability of saddle points. SIAM Journal on Control and Optimization,
55(1):486–511, 2017.
[152] Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of opti-
mization algorithms via integral quadratic constraints. SIAM J. Optimiz., 26(1):57–95,
2016.
[153] Bin Hu and Peter Seiler. Exponential decay rate conditions for uncertain linear systems
using integral quadratic constraints. IEEE Trans. Automat. Control, 61(11):3631–3637,
2016.
[154] Ross Boczar, Laurent Lessard, and Benjamin Recht. Exponential convergence bounds us-
ing integral quadratic constraints. In Proceedings of the 54th IEEE Conference on Decision
and Control, pages 7516–7521. IEEE, 2015.
[155] Ross Boczar, Laurent Lessard, Andrew Packard, and Benjamin Recht. Exponential sta-
bility analysis via integral quadratic constraints. arXiv preprint arXiv:1706.01337, 2017.
[156] Anders Rantzer. On the Kalman-Yakubovich-Popov lemma. Systems & Control Letters,
28(1):7–10, 1996.
[157] Paul Tseng and Sangwoon Yun. A coordinate gradient descent method for nonsmooth
separable minimization. Math. Program., 117(1-2):387–423, 2009.
[158] Richard H Byrd, Jorge Nocedal, and Figen Oztoprak. An inexact successive quadratic
approximation method for `1 regularized optimization. Math. Program., pages 1–22, 2015.
192
[159] Jason D Lee, Yuekai Sun, and Michael Saunders. Proximal Newton-type methods for
convex optimization. In Adv. Neural Inf. Process. Syst., pages 836–844, 2012.
[160] Cho-Jui Hsieh, Inderjit S Dhillon, Pradeep K Ravikumar, and Matyas A Sustik. Sparse
inverse covariance matrix estimation using quadratic approximation. In Adv. Neural Inf.
Process. Syst., pages 2330–2338, 2011.
[161] Cho-Jui Hsieh, Matyas A Sustik, Inderjit S Dhillon, and Pradeep Ravikumar. QUIC:
Quadratic approximation for sparse inverse covariance estimation. J. Mach. Learn. Res.,
15:2911–2947, 2014.
[162] Philip E Gill and Daniel P Robinson. A primal-dual augmented Lagrangian. Comput.
Optim. Appl., 51(1):1–25, 2012.
[163] Paul Armand, Joel Benoist, Riadh Omheni, and Vincent Pateloup. Study of a primal-dual
algorithm for equality constrained minimization. Comput. Optim. Appl., 59(3):405–433,
2014.
[164] Paul Armand and Riadh Omheni. A globally and quadratically convergent primal-dual
augmented Lagrangian algorithm for equality constrained optimization. Optim. Method.
Softw., pages 1–21, 2015.
[165] Stephen Becker and Jalal Fadili. A quasi-Newton proximal splitting method. In Adv.
Neural Inf. Process. Syst., pages 2618–2626, 2012.
[166] Liqun Qi and Jie Sun. A nonsmooth version of Newton’s method. Math. Program.,
58(1):353–367, 1993.
[167] Stephen M Robinson. Local structure of feasible sets in nonlinear programming, Part III:
Stability and sensitivity. In Nonlinear Analysis and Optimization, pages 45–66. Springer,
1987.
[168] Jean-Baptiste Hiriart-Urruty and Claude Lemarechal. Convex analysis and minimization
algorithms I: Fundamentals, volume 305. Springer Science & Business Media, 2013.
[169] Godfrey Harold Hardy and Edward Maitland Wright. An Introduction to the Theory of
Numbers, volume 5. Oxford University Press, 1979.
[170] Kaifeng Jiang, Defeng Sun, and Kim-Chuan Toh. A partial proximal point algorithm for
nuclear norm regularized matrix least squares problems. Math. Program. Comp., 6(3):281–
325, 2014.
193
[171] Fanwen Meng, Defeng Sun, and Gongyun Zhao. Semismoothness of solutions to generalized
equations and the Moreau-Yosida regularization. Math. Program., 104(2):561–581, 2005.
[172] Joseph B Kruskal. Two convex counterexamples: A discontinuous envelope function and
a nondifferentiable nearest-point mapping. Proceedings of the American Mathematical
Society, 23(3):697–703, 1969.
[173] Jason D Lee, Yuekai Sun, and Michael A Saunders. Proximal Newton-type methods for
minimizing composite functions. SIAM J. Optimiz., 24(3):1420–1443, 2014.
[174] Dimitri P Bertsekas. Projected Newton methods for optimization problems with simple
constraints. SIAM J. Control Opt., 20(2):221–246, 1982.
[175] Robert Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc
B., 58(1):267–288, 1996.
[176] Panagiotis Patrinos, Lorenzo Stella, and Alberto Bemporad. Forward-backward truncated
Newton methods for large-scale convex composite optimization. 2014. arXiv:1402.6655.
[177] Lorenzo Stella, Andreas Themelis, and Panagiotis Patrinos. Forward–backward quasi-
Newton methods for nonsmooth optimization problems. Computational Optimization and
Applications, 67(3):443–487, 2017.
[178] Andreas Themelis, Lorenzo Stella, and Panagiotis Patrinos. Forward-backward envelope
for the sum of two nonconvex functions: Further properties and nonmonotone line-search
algorithms. 2016. arXiv:1606.06256.
[179] James M Ortega and Werner C Rheinboldt. Iterative Solution of Nonlinear Equations in
Several Variables. SIAM, 2000.
[180] Roger A Horn and Charles R Johnson. Matrix Analysis. Cambridge University Press,
1985.
[181] Walter Rudin. Principles of Mathematical Analysis, volume 3. McGraw-Hill, Inc., 1964.
[182] Francisco Facchinei. Minimization of SC1 functions and the Maratos effect. Oper. Res.
Lett., 17(3):131–137, 1995.
[183] Hassan K Khalil. Nonlinear Systems. Prentice Hall, Third edition, 1996.
194
[184] Tecla De Luca, Francisco Facchinei, and Christian Kanzow. A semismooth equation ap-
proach to the solution of nonlinear complementarity problems. Math. Program., 75(3):407–
439, 1996.
[185] Andrew R Conn, Nicholas IM Gould, and Philippe Toint. A globally convergent augmented
Lagrangian algorithm for optimization with general constraints and simple bounds. SIAM
J. Numer. Anal., 28(2):545–572, 1991.
[186] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized
linear models via coordinate descent. J. Stat. Softw., 33(1):1, 2010.
[187] Stephen J Wright, Robert D Nowak, and Mario AT Figueiredo. Sparse reconstruction by
separable approximation. IEEE Trans. Signal Processing, 57(7):2479–2493, 2009.
[188] Seung-Jean Kim, Kwangmoo Koh, Michael Lustig, Stephen Boyd, and Dimitry
Gorinevsky. An interior-point method for large-scale `1-regularized least squares. IEEE
J. Sel. Topics Signal Process, 1(4):606–617, 2007.
[189] Junfeng Yang and Yin Zhang. Alternating direction algorithms for `1-problems in com-
pressive sensing. SIAM J. Sci. Comput., 33(1):250–278, 2011.
[190] David M Zoltowski, Neil K Dhingra, Fu Lin, and Mihailo R Jovanovic. Sparsity-promoting
optimal control of spatially-invariant systems. In Proceedings of the 2014 American Control
Conference, pages 1261–1266, Portland, OR, 2014.
[191] Ju Swift and Pierre C Hohenberg. Hydrodynamic fluctuations at the convective instability.
Physical Review A, 15(1):319, 1977.
[192] Ralph T Rockafellar. Augmented Lagrangians and applications of the proximal point
algorithm in convex programming. Mathematics of Operations Research, 1:97–116, 1976.
[193] Bahman Gharesifard and Jorge Cortes. Distributed continuous-time convex optimization
on weight-balanced digraphs. IEEE Trans. Automat. Control, 59(3):781–786, 2014.
[194] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. EXTRA: An exact first-order algorithm
for decentralized consensus optimization. SIAM J. Optimiz., 25(2):944–966, 2015.
[195] Rajendra Bhatia. Matrix Analysis, volume 169. Springer, 1997.
195
[196] Bassam Bamieh and Mohammed Dahleh. Exact computation of traces and H2 norms for
a class of infinite dimensional problems. IEEE Trans. Automat. Control, 48(4):646–649,
2003.
[197] Franziska Michor, Yoh Iwasa, and Martin A Nowak. Dynamics of cancer progression.
Nature Reviews Cancer, 4(3):197–205, 2004.
[198] Bissan Al-Lazikani, Udai Banerji, and Paul Workman. Combinatorial drug therapy for
cancer in the post-genomic era. Nature Biotechnology, 30(7):679–692, 2012.
[199] Daniel S Rosenbloom, Alison L Hill, S Alireza Rabi, Robert F Siliciano, and Martin A
Nowak. Antiretroviral dynamics determines hiv evolution and predicts therapy outcome.
Nature Medicine, 18(9):1378–1385, 2012.
[200] Florian Klein, Ariel Halper-Stromberg, Joshua A Horwitz, Henning Gruell, Johannes F
Scheid, Stylianos Bournazos, Hugo Mouquet, Linda A Spatz, Ron Diskin, Alexander
Abadir, et al. HIV therapy by a combination of broadly neutralizing antibodies in human-
ized mice. Nature, 492(7427):118–122, 2012.
[201] Lloyd N Trefethen and Mark Embree. Spectra and pseudospectra: the behavior of nonnor-
mal matrices and operators. Princeton University Press, 2005.
[202] JAC Weideman and Satish C Reddy. A MATLAB differentiation matrix suite. ACM
Transactions on Mathematical Software, 26(4):465–519, December 2000.
[203] Michael Grant and Stephen Boyd. CVX: Matlab software for disciplined convex program-
ming, version 2.0 beta. http://cvxr.com/cvx, September 2013.
[204] Marcello Colombino and Roy S Smith. Convex characterization of robust stability anal-
ysis and control synthesis for positive linear systems. In Proceedings of the 53rd IEEE
Conference on Decision and Control, Los Angeles, California, 2014.
[205] Marcello Colombino and Roy S Smith. A convex characterization of robust stability
for positive and positively dominated linear systems. IEEE Trans. Automat. Control,
61(7):1965–1971, July 2016.
[206] Marcello Colombino. Robust and Decentralized Control of Positive Systems: a Convex
Approach. PhD thesis, ETH Zurich, 2016.
196
[207] Joel E Cohen. Convexity of the dominant eigenvalue of an essentially nonnegative matrix.
Proc. Am. Math. Soc., 81(4):657–658, 1981.
[208] Anders Rantzer. On the Kalman-Yakubovich-Popov lemma for positive systems. IEEE
Trans. Automat. Control, 61(5):1346–1349, 2016.
[209] Naum Z Shor. Minimization Methods for Non-differentiable Functions, volume 3. Springer
Science & Business Media, 2012.
[210] Jose YB Cruz. On proximal subgradient splitting method for minimizing the sum of two
nonsmooth convex functions. Set-Valued and Variational Analysis, 25(2):245–263, 2017.
[211] Dragos M Cvetkovic, Michael Doob, and Horst Sachs. Spectra of Graphs: Theory and
Application, volume 87. Academic Pr, 1980.
[212] Eugene L Lawler and David E Wood. Branch-and-bound methods: a survey. Operations
Research, 14(4):699–719, 1966.
[213] Weiran Wang and Canyi Lu. Projection onto the capped simplex. arXiv preprint
arXiv:1503.01002, 2015.
[214] Daniel Zelazo, Simone Schuler, and Frank Allgower. Performance and design of cycles in
consensus networks. Syst. Control Lett., 62(1):85–96, 2013.
[215] Duncan J Watts and Stephen H Strogatz. Collective dynamics of ‘small-world’ networks.
Nature, 393:440–442, 1998.
[216] John G White, Eileen Southgate, J Nichol Thomson, and Sydney Brenner. The structure
of the nervous system of the nematode Caenorhabditis Elegans: the mind of a worm. Phil.
Trans. R. Soc. Lond, 314:1–340, 1986.
[217] Shun Motegi and Naoki Masuda. A network-based dynamical system for competitive
sports. Scientific Reports, 2, 2012.
[218] Juyong Park and Mark EJ Newman. A network-based ranking system for US college
football. Journal of Statistical Mechanics: Theory and Experiment, 2005(10):P10014,
2005.
[219] Sports-Reference. Sports Reference – College Football 2015 schedule and results. http:
//www.sports-reference.com/cfb/years/2015-schedule.html, 2016. Accessed: 2016-
03-15.
197
[220] Associated-Press. The AP Top 25 Poll. http://collegefootball.ap.org/poll, 2016.
Accessed: 2016-03-15.
[221] Esteban Hernandez-Vargas, Patrizio Colaneri, Richard Middleton, and Franco Blanchini.
Discrete-time control for switched positive systems with application to mitigating viral
escape. Int. J. Robust Nonlin., 21(10):1093–1111, 2011.
[222] Vanessa D Jonsson, Nikolai Matni, and Richard M Murray. Synthesizing combination
therapies for evolutionary dynamics of disease for nonlinear pharmacodynamics. In Pro-
ceedings of the 53rd Conference on Decision and Control, pages 2352–2358. IEEE, 2014.
[223] Vanessa D Jonsson. Robust Control of Evolutionary Dynamics. PhD thesis, California
Institute of Technology, 2016.
[224] Abram S Besicovitch. Almost Periodic Functions. Dover Publications, 1954.
[225] Constantin Corduneanu. Almost Periodic Oscillations and Waves. Springer Science &
Business Media, 2009.
[226] Christian Constanda and Maria E Perez. Integral Methods in Science and Engineering.
Springer, 2010.
[227] Jorge Mari. A counterexample in power signals space. IEEE Trans. Automat. Control,
41(1):115–116, 1996.
[228] Petar Kokotovic, Hassan K Khalil, and John O’Reilly. Singular perturbation methods in
control: analysis and design. SIAM, 1999.
[229] Venkat Roy, Sundeep P Chepuri, and Geert Leus. Sparsity-enforcing sensor selection for
DOA estimation. In Proceedings of the 2013 IEEE International Workshop on Computa-
tional Advances in Multi-Sensor Adaptive Processing, pages 340–343, 2013.
[230] Vassilis Kekatos, Georgios B Giannakis, and Bruce Wollenberg. Optimal placement of
phasor measurement units via convex relaxation. IEEE Trans. Power Syst., 27(3):1521–
1530, 2012.
[231] James L Rogers. A parallel approach to optimum actuator selection with a genetic algo-
rithm. In Proceedings of the AIAA Guidance, Navigation, and Control Conference, pages
14–17, 2000.
198
[232] Shinji Kondoh, Chikayoshi Yatomi, and Koichi Inoue. The positioning of sensors and
actuators in the vibration control of flexible systems. JSME Int. J., Series III, 33(2):145–
152, 1990.
[233] Kazuhiko Hiramoto, Hitoshi Doki, and Goro Obinata. Optimal sensor/actuator placement
for active vibration control using explicit solution of algebraic Riccati equation. J. Sound
Vib., 229(5):1057–1075, 2000.
[234] Kevin K Chen and Clarence W Rowley. H2 optimal actuator and sensor placement in the
linearised complex Ginzburg-Landau system. J. Fluid Mech., 681:241–260, 2011.
[235] Engin Masazade, Makan Fardad, and Pramod K Varshney. Sparsity-promoting extended
Kalman filtering for target tracking in wireless sensor networks. IEEE Signal Processing
Letters, 19:845–848, 2012.
[236] Michael Grant and Stephen P Boyd. Graph implementations for nonsmooth convex pro-
grams. In V. Blondel, S. Boyd, and H. Kimura, editors, Recent Advances in Learning and
Control, Lecture Notes in Control and Information Sciences, pages 95–110. Springer-Verlag
Limited, 2008.
[237] Martin J Brenner, Richard C Lind, and David F Voracek. Overview of Recent Flight
Flutter Testing Research at NASA Dryden, volume 4792. National Aeronautics and Space
Administration, Office of Management, Scientific and Technical Information Program,
1997.
[238] Thomas E Noll, John M Brown, Marla E Perez-Davis, Stephen D Ishmael, Geary C
Tiffany, and Matthew Gaier. Investigation of the Helios prototype aircraft mishap volume
I mishap report. NASA, 1, 2004.
[239] Jeffrey M Barker and Gary J Balas. Comparing linear parameter-varying gain-scheduled
control techniques for active flutter suppression. J. Guid. Control Dyn., 23(5):948–955,
2000.
[240] Claudia Moreno, Peter Seiler, and Gary J Balas. Linear parameter varying model reduction
for aeroservoelastic systems. In Proceedings of the AIAA Atmospheric Flight Mechanics
Conference, page 4859, 2012.
Appendix A
CrosswordSince the subject of my thesis is mathematical in nature, I thought it appropriate that I leave
something as an exercise to the reader.
199
200
Across1. Dirac , i.e. a 55-across
train5. In the style of
8. Preserves additivity andhomogeneity
14. Basilica part
15. Gaping hole
16. Non-rhyming fruit
17. It closes the loop
19. As a set which admitsa supporting hyperplaneat each point on itsboundary
20. Systems theory-focusedElsevier gathering acrossthe pond, abbr.
21. Sailor23. We Can!, 2008
Obama-Biden slogan
24. Stand for a golfer
25. McBeal, et. al
27. Ali who stole from fortythieves
30. Gruffly succinct
32. Road usage fee
36. All hot and bothered, ina bad way
38. Baboon genus
39. With 40-across, the sub-ject area of this thesis
40. See 39-across42. Sandwich cookies43. Generalizations of vol-
ume, such as those ofLebesgue or Borel
44. Western Cold War al-liance
45. Town official in charge ofmaking announcements
47. Trivial48. 2016 Green Party candi-
date Jill50. Private Twitter commu-
niques
52. Some Audi roadsters55. Dirac delta function57. Act whose wages is
death60. A Modest Proposal, e.g.
62. A property promoted by`1 regularization (e.g., inthe LASSO problem)
64. Ability
65. For profit diamond grad-ing agcy
66. Salt Lake City collegiateathletes
67. Functions as desired de-spite uncertainty
68. Professional organiza-tion whose members de-sign cars
69. Viral phenomenon
Down1. Small eatery with caf-
feinated beverages
2. Oil cartel
3. 1/60,000 min.
4. Crash site?5. Almond-flavored Italian
liqueur
6. Delicate fabric7. Maladroit8. Place9. Concept famously mis-
understood by AlanisMorissette in a 1996 hitsingle
10. What you get when youmultiply 0 by Inf in Mat-lab
11. Green emotion12. Author James13. Property of the green di-
nosaur in Toy Story
18. Abbr. for ancient dates22. American pro. football
assn.24. Snapbacks and
26. Some sculptures
27. Philosopher Francis
28. Ancient Greek publicsquare
29. French psychologist Al-fred who co-created IQtesting
31. Unagi
33. Carmen or Don Gio-vanni
34. Fruit famously used bythe British Navy to pre-vent scurvy
35. As a transmission linewith non-negligible re-sistnace
37. longa, vita brevis
38. Local frequency con-troller for a generator(abbr.)
40. Call, as a poker bet
41. Springtime householdliquidation
43. Opposite of plusses
45. Type of shoes favoredfor maritime Mafia exe-cutions
46. Hamilton formerly of theDetroit Pistons
49. Components of theMichelin Man’s body
51. French sea52. Former Winter Palace
resident53. With ”cat”, it’s a palin-
drome54. Retained ticket portion
56. Org. of female drivers?
57. Location58. Couple
59. Wall St. trading venue
61. Electronic sensing unitcomposed of accelerome-ters and gyroscopes
63. Result of addition
201
The solution to the crossword puzzle is on next page.
202
Solution