Nonlinear Programming - Department of Mathematical Sciences
Transcript of Nonlinear Programming - Department of Mathematical Sciences
Nonlinear Programming
Marc Teboulle
School of Mathematical Sciences
Tel-Aviv University, Ramat-Aviv, Israel
[email protected], http://www.math.tau.ac.il/teboulle
Tutorial Talk presented at Summer School of ICCOPT-IJuly 30-August 4, 2004, RPI, Troy
Marc Teboulle–Tel-Aviv University – p. 1
Opening Remarks...
Most optimization problems are not solvable!
This is very good news for us....
An Example: Present in 2 hours the basic and recent results onNLP!..... A typical constrained nonlinear (ill-posed) problem.....
What "minimal" (another optimization problem!) material should welearn/know?
Two main issues: Theory and Computation
Many commercial packages exist to solve NLP. However, these arein form of black-box type.
To understand how optimization methods work, their power and theirlimitations, and if they do (or not) solve a problem, we mustunderstand the basic underlying theory.
Marc Teboulle–Tel-Aviv University – p. 2
Contents
Part A. Optimization Theory
Ideas and Principles
Convexity and Duality
Optimality Conditions
Part B. Optimization Algorithms
Basic and Classical Iterative Schemes
Convergence and Complexity issues
Modern Interior and Polynomial Methods
Part C. Some Recent Developments (..Biased of course!..)
Interior Proximal Algorithms
Smooth Lagrangian Multiplier Methods
Elementary Algorithms: Interior Gradient-like Schemes, suitable forvery large scale problems
Marc Teboulle–Tel-Aviv University – p. 3
A Short History of Optimization....
Fermat (1629): Unconstrained Minimization Principle
...+160...Lagrange (1789) Equality Constrained Problems (Mechanics)
Calculus of Variations, 18-19th Century [Euler, Lagrange, Legendre, Hamilton...]
...+150...Karush (1939), Fritz-John (47), Kuhn-Tucker(51)
KKT Theorem for Inequality Constraints: Modern Optimization Era begins...
Engineering Applications (1960)
Optimal Control Bellman, Pontryagin...
Major Developments (50’s with LP) and 60-80’s for NLP
Polynomial Interior Points Methods for Convex Optimization Nesterov-Nemirovsky
(1988)
Combinatorial Problems via continuous approximations 90’s
....More Theory, Algorithmic developments and much more specific models,
applications .... Marc Teboulle–Tel-Aviv University – p. 4
Nonlinear Programming: Formulation
(O) minimize{f(x) : x ∈ X ∩ C}• X ⊂ R
n for implicit or simple constraints (Here X ≡ Rn)
• C a set of explicit constraints described by
C = {x ∈ Rn : gi(x) ≤ 0, i = 1, . . . m,
hi(x) = 0, i = 1, . . . , p}.All the functions in problem (O) are real valued functions on R
n.
Important Special Case: X ∩ C ≡ Rn
The unconstrained minimization problem
(U) minimize{f(x) : x ∈ Rn}
Many methods for constrained problems eventually need to solve sometype of problem (U)
Marc Teboulle–Tel-Aviv University – p. 5
Applications of NLP
OPTIMIZATION APPEARS TO BE PRESENT”ALMOST” EVERYWHERE....
Planning, Management Operations, Logistic (It all started with LP..)
Data Networks, Finance-Economics, VLSI Design
Pattern Recognition, Data Analysis/Mining, Resource Allocation
Mechanical/structural design, Chemical Engineering,...
Machine Learning, Classification
Signal Processing, Communication Systems, Tomography......
.....and of course in Mathematics Itself...!
Marc Teboulle–Tel-Aviv University – p. 6
Definitions and Terminology
(O) minimize{f(x) : x ∈ C}A point x ∈ C is called a feasible solution of (O).
An optimal solution is any feasible point where the local or globalminimum of f relative to C is actually attained.
Definition: Let Nε := Nε(x∗) ≡ Neighborhood of x∗. Then,
x∗ local minimum : f(x∗) ≤ f(x), ∀x ∈ C ∩ Nε
x∗ global minimum : f(x∗) ≤ f(x), ∀x ∈ C
x∗ a strict local minimum : f(x∗) < f(x) ∀x ∈ Nε ∩ S, x �= x∗
Note: There are also ”max” problems...But...
max F ≡ −min[−F ]
Marc Teboulle–Tel-Aviv University – p. 7
How to Solve an Optimization Problem ?
Analytically/Explicitly: Very rarely....or Never....
We try to generate an Iterative (Descent) Algorithm toapproximately solve the problem to a prescribed accuracy
Algorithm: a map A : x → y(start with x to get some new point y)
Iterative: generate a sequence of pts calculated on prior point (orpoints)
Descent: Each new point y is such that f(y) < f(x)
Accuracy: Eventually, we find some x̂ such that
f(x̂) − f(x∗) ≤ ε
Marc Teboulle–Tel-Aviv University – p. 8
A Powerful Algorithm..
Set k = 0, start with x0 somewhereWhile xk �∈ D ≡ {set of desisable Points} Do {
xk+1 = A(xk)k ← k + 1}
StopExpected Output(s): {xk} is a minimizing sequence:
f(xk) → f∗, (optimal value) as k → ∞or/and even more, xk → x∗, optimal solution, denoted via
x∗ ∈ argmin{f(x) : x ∈ C} ≡ {x ∈ C : f(x) = inf f}
Marc Teboulle–Tel-Aviv University – p. 9
Some Basic Questions
How do we pick the initial starting point?
How to construct A so that xk converges to optimal x∗?
How do we stop the algorithm?
How close is the approximate solution to the optimal one? (that wedo not know!!)
How sensitive is the whole process to data perturbations (small andlarge!)?
How do we measure the efficiency of a convergent algorithm tooptimality?
Computational cost per-iteration? Total complexity ?
Marc Teboulle–Tel-Aviv University – p. 10
Emerging Topics and Tools
To answer these questions, we need an appropriate mathematical theoryand tools. For example:
Existence of optimal solutions
Optimality conditions
Convexity and Duality
Convergence and Numerical Analysis
Error and Complexity Analysis
While each algorithm for each type of problem will often require aspecific analysis (e.g., exploiting special structures of the problem), theabove tools remain essential and fundamental.
Marc Teboulle–Tel-Aviv University – p. 11
Convexity—–(See More in [A2-A4])
S ⊂ Rn is convex if the line segment joining any two points of S is
contained in it:
∀x, y ∈ S, ∀λ ∈ [0, 1] =⇒ λx + (1 − λ)y ∈ S
f : S → R is convex if for any x, y ∈ S and any λ ∈ [0, 1],
f(λx + (1 − λ)y) ≤ λf(x) + (1 − λ)f(y)
A Key Fact: Local Minima are also Global under convexity
♣ Convexity plays a fundamental role in optimizationEven in Non convex problems...!
Marc Teboulle–Tel-Aviv University – p. 12
A Simple and Powerful Geometric Result: Separation
Any point outside a nonempty closed convex set C of Rn can be
separated from C by a hyperplane H i.e., let y �∈ C, then
∃0 �= a ∈ Rn, α ∈ R : 〈a, x〉 ≤ α < 〈a, y〉, ∀x ∈ C.
and H := {x ∈ Rn : 〈a, x〉 = α}
This result is fundamental and with far reaching consequences, e.g.,
Alternative Theorems [see A6]
Optimality Conditions
Duality
Marc Teboulle–Tel-Aviv University – p. 13
Existence of Minimizers: inf{f(x) : x ∈ C}
When f : C → Rn will attain its infimum over C ⊂ R
n?
Classical Answer–Weierstrass Theorem: A continuous functiondefined on a compact subset of R
n attains its minimum.
This is a topological problem
How do we get "useful" conditions for testing existence? MimicWeierstrass..
Pick x0 ∈ C, Lf := {x|f(x) ≤ f(x0)}and consider the equivalent problem
inf{f(x) : x ∈ Lf}Suitable ”compactness” and ”continuity” w.r.t Lf , i.e.,
♣ Study behavior of subsets of Rn at infinity
Marc Teboulle–Tel-Aviv University – p. 14
Asymptotic Cones and Functions: A short appetizer
Let ∅ �= C ⊂ Rn be closed convex
f : Rn → R ∪ +{∞} proper, lsc convex [see A1]
Definition–[A-Cone] The asymptotic cone of C,
C∞ := {d ∈ Rn : d + C ⊂ C}
Proposition A set C ⊂ Rn is bounded iff C∞ = {0}
Definition–[A-Function] The asymptotic function f∞ of f isdefined by
epi (f∞) = (epi f)∞
Proposition The A-function is also convex, and for any d ∈ Rn,
f∞(d) = limt→∞
f(x + td) − f(x)t
∀x ∈ dom f
Marc Teboulle–Tel-Aviv University – p. 15
Back to the existence of minimizers
One has to study (Lf )∞. It turns out that (in the convex case)
(Lf )∞ = {d ∈ Rn | f∞(d) ≤ 0}
♣ Topological questions can be handled via Calculus Rules at infinity
Example: (P ) inf{f0(x) : fi(x) ≤ 0, i ∈ [1, m]}; convexResult: Optimal solution set nonempty and compact iff
(fi)∞ ≤ 0, ∀i ∈ [0, m] =⇒ d = 0
A shameless commercial! For general results and applications, seeAsymptotic Cones and Functions
in Optimization and Variational InequalitiesA. Auslender and M. Teboulle
Srpinger Monographs in Mathematics, 2003.
Marc Teboulle–Tel-Aviv University – p. 16
Optimality for Unconstrained Minimization
(U) inf{f(x) : x ∈ Rn} f : R
n → R is a smooth function.Fermat Principle Let x∗ ∈ R
n be a local minimum. Then,
♠ ∇f(x∗) = 0
This is a First Order Necessary condition
If ∇f(x∗) = 0, then x∗ is a stationary point
A local minimum must be a stationary point
Second Order Necessary Cond.: Nonnegative curvature at x∗
The Hessian Matrix ∇2f(x∗) � 0 positive semidefinite
Sufficient conditions for x∗ to be a local minimum: ∇2f(x∗) � 0
Whenever f is assumed convex, then ♠ becomes a sufficient conditionfor x∗ to be a global minimum for f .
Marc Teboulle–Tel-Aviv University – p. 17
Equality constraints: Lagrange Theorem
(E) min{f(x) : h(x) = 0, x ∈ Rn}
where f : Rn → R, h : R
n → Rp with (f, g) ∈ C1
Lagrange Theorem (necessary conditions) Let x∗ be a local minimumfor problem (E). Assume:
(A) {∇h1(x∗), . . . ,∇hp(x∗)} are linearly independent
Then there exists a unique y∗ ∈ Rp satisfying:
∇f(x∗) +p∑
k=1
y∗k∇hk(x∗) = 0
Inequality constraints lead to more complications....
Marc Teboulle–Tel-Aviv University – p. 18
Optimality Conditions in NLP
(P ) inf{f(x) : x ∈ C}, C ⊂ Rn; f : C → R an arbitrary function
A Basic Optimality Criteria-BOC Let x∗ ∈ C and assume that the directional derivative
of f at x∗ exists.
a. Necessary condition: If x∗ is a local minimum of f over C then
f ′(x∗, x − x∗) ≥ 0, ∀x ∈ C.
If f is C1, then the above reduces to 〈x − x∗,∇f(x∗) ≥ 0, ∀x ∈ C.
b. Sufficient condition: Suppose that f is also convex. Then, condition (a) is also a
sufficient condition for x∗ to be a (global) minimum.
Geometric reformulation: through the Normal Cone to a set C at a point x̄, and defined by
NC(x̄) := {d ∈ Rn |〈d, x − x̄〉 ≤ 0, ∀x ∈ C}
Thus one has 〈x − x∗,∇f(x∗) ≥ 0, ∀x ∈ C ⇐⇒ 0 ∈ ∇f(x∗) + NC(x∗).
A Variational Inequality and a Generalized EquationMarc Teboulle–Tel-Aviv University – p. 19
A Useful Formula for Directional Derivatives
Let {gi}mi=0 be smooth functions over R
n and define g(x) := max0≤i≤m gi(x).
Note: even though gi are smooth, this is not necessarily the case for g.
Proposition Assume for each i that gi is continuous and differentiable at x∗.
Then,
g′(x∗, d) = max{〈d,∇gi(x∗)〉 : i ∈ K(x∗)}, ∀d ∈ Rn
where K(x∗) := {i ∈ [0, m] : gi(x∗) = g(x∗) = max0≤i≤m gi(x∗)}
Now observe that with F (x) := max{f(x) − f(x∗), g1(x) . . . , gm(x)}
x∗ ∈ argmin{f(x) : g(x) ≤ 0} solves infx
F (x)
Apply BOC with Alternative Theorem [namely the use of separation!] + the
above formula is all we need to derive the fundamental Theorems characterizing
optimality for constrained problemsMarc Teboulle–Tel-Aviv University – p. 20
First Order Optimality Conditions-Fritz-John Theorem
(P ) inf{f(x) : g(x) ≤ 0, x ∈ Rn}
where f : Rn → R, g : R
n → Rm and (f, g) ∈ C1
Let x∗ be a local minimum for problem (P).
Primal form: Then � ∃d ∈ Rn s.t.:
〈d,∇f(x∗)〉 < 0 and 〈d,∇gi(x∗) < 0, ∀i ∈ I(x∗) := {i : gi(x∗) = 0}Dual Form Then there exists λ0, λi ∈ R+(i ∈ I(x∗)), not all zero,satisfying:
λ0∇f(x∗) +∑
i∈I(x∗)
λ∗i∇gi(x∗) = 0
The weakness of FJ conditions: λ0 ∈ R+ can be equal to zero. Toavoid this, we need a further hypothesis on the problem’s data, calledconstraint qualification
Marc Teboulle–Tel-Aviv University – p. 21
Constraint Qualifications
(P ) inf{f(x) : g(x) ≤ 0, x ∈ Rn}
with f : Rn → R, g : R
n → Rm, smooth.
I(x) := {i : gi(x) = 0} is the set of active constraints.
CQ are crucial regularity conditions on problem’s data to deriveoptimality and duality results.
Linear independence ( LI): {∇gi(x∗)}i∈I(x∗) are linearly indep.
Mangasarian-Fromovitch (MF):
∃ d ∈ Rn : 〈d,∇gi(x∗)〉 < 0, ∀i ∈ I(x∗)
Slater (S): ∃ x̂ : gi(x̂) < 0, ∀i = 1, . . . , m
One has the following relations between these (CQ):
(LI) =⇒ (MF ) =⇒ (S)
Marc Teboulle–Tel-Aviv University – p. 22
The KKT Theorem – A System of Eqs and Inequalities
(P ) inf{f(x) : g(x) ≤ 0, x ∈ Rn}
Let x∗ be a local minimum for problem (P) and assume that a (MF-CQ)holds. Then ∃y∗ ∈ R
m+ s.t.
∇f(x∗) +m∑
i=1
y∗i ∇gi(x∗) = 0, [Saddle pt. in x∗]
gi(x∗) ≤ 0, ∀i ∈ [1, m], [Feasibility ≡ Sad. pt. in y∗]y∗i gi(x∗) = 0, i = 1, . . . , m [complementarity]
With convex data + (CQ), the KKT becomes necessary andsufficient for global optimality
For general NLP (mixed eqs/ineq.), more Optimality Conds. (Firstand Second order), see [A7]
Marc Teboulle–Tel-Aviv University – p. 23
Duality: The Lagrangian
(P ) f∗ := inf{f(x) : g(x) ≤ 0, x ∈ Rn} f : R
n → R, g : Rn → R
m
We assume that there exists a feasible solution for (P) and f∗ ∈ R.
Observation : (P ) ⇐⇒ infx∈Rn
supy≥0
{f(x) + 〈y, g(x)〉
Lagrangian associated with (P) L : Rn × R
m+ → R :
L(x, y) = f(x) + 〈y, g(x)〉 ≡ f(x) +m∑
i=1
yigi(x).
Definition A vector y∗ ∈ Rm is called a Lagrangian multiplier for (P) if
y∗ ≥ 0, and f∗ = inf{L(x, y∗) : x ∈ Rn}
.
Marc Teboulle–Tel-Aviv University – p. 24
Lagrangian Duality • infx∈Rn supy∈Rm+
L(x, y)
Hidden in this equivalent min-max formulation of (P) is anotherproblem called the Dual.Suppose we reverse the inf sup operations:
supy∈R
m+
infx∈Rn
L(x, y)
Define the Dual Function:
h(y) := infx∈Rn
L(x, y), dom h = {y ∈ Rm : h(y) > −∞}.
and the Dual Problem:
(D) h∗ := sup{h(y) : y ∈ Rm+ ∩ dom h}
Note: To avoid h(·) = −∞, additional constraints often emergethrough y ∈ dom h.
Marc Teboulle–Tel-Aviv University – p. 25
Dual problem Properties
The Dual Problem: Uses the same data
(D) h∗ = supy{h(y) : y ∈ R
m+ ∩ dom h}, h(y) = inf
xL(x, y)
Properties of (P)-(D)
Dual objective h is always concave
Dual problem (D) is always convex (ax max of concave func.)
Weak duality holds: f∗ ≥ h∗ for any feasible pair of (P)-(D)
Valid for any optimization problem. No convexity assumed or/and,any other assumptions on primal data!
Marc Teboulle–Tel-Aviv University – p. 26
Duality: Key Questions for the pair (P)-(D)
f∗ = inf{f(x) : g(x) ≤ 0, x ∈ Rn}; h∗ = supy{h(y) : y ∈ R
m+}
• Zero Duality Gap: when f∗ = h∗?• Strong Duality: when inf / sup attained?• Structure/Relations of Primal-Dual Optimal Sets/Solutions
Convex data + a Constraint Qualification, on constraints deliver theanswers.
Proof.
Based on the simple geometric separation argument.
inf / sup attainment + structure of optimal sets via Asymptoticfunctions calculus
Convex problems are the ”Nice NLP”...and much more...
Marc Teboulle–Tel-Aviv University – p. 27
Are there Many Convex Problems?
..More than we use to think...[some times after transformation, e.g.,Geometric programs]
Remember, the dual of any optimization problem is alwaysconvex...can be use at least to approximate the original primal...
Useful Convex Models: Conic Problems
min{〈c, x〉 : A(x) = b, x ∈ K}K is a closed convex cone in some finite dimensional space X
〈·, ·〉 appropriate inner product on X
A is a linear map
Example: Linear ProgrammingX ≡ R
n, K ≡ Rn+, A ∈ R
m×n, b ∈ Rm, c ∈ R
n and 〈·, ·〉 the scalarproduct in R
n
....Other Examples...?Marc Teboulle–Tel-Aviv University – p. 28
Semidefinite Programming
Primal minx∈Rm
{cT x : A(x) � 0};Dual max
Z∈Sn
{− tr A0Z : tr AiZ = ci, i ∈ [1, m], Z � 0}Here, tr is the trace operator and
A(x) := A0 +m∑
i=1
xiAi, each Ai ∈ Sn ≡ symmetric
Primal : x ∈ Rn decision variables; A(x) � 0 is a linear matrix
inequality.
Dual in Conic Form: Z ∈ Sn decision variables, K ≡ S+n is the
closed convex cone of p.s.d. matrices
Marc Teboulle–Tel-Aviv University – p. 29
SDP Features and Applications
♦ Features
SDP are special classes of convex (nondifferentiable) problems
Computationally tractable: Can be approximately solved to a desiredaccuracy in polynomial time
A very active research area since mid 90’s
♦ Applications–A Short list...!
Combinatorial optimization, Computational Geometry
Control theory, Statistics, Classification problems
Other useful conic model : Second order cone programming...
Marc Teboulle–Tel-Aviv University – p. 30
Part B. Optimization Algorithms
Tractability is a key Issue
What optimization problems can we solve?
How do we solve them?
At what cost? [Our current computers have limited memory...and wedo not want to wait too much time [..for ever..] to get a solution!
We need to draw a line between Easy and Hard Problems
Convexity plays a key role in this distinction
Marc Teboulle–Tel-Aviv University – p. 31
Easy/Hard: Example
(P1) max{n∑
j=1
xj : x2j − xj = 0, j = 1, . . . , n; xixj = 0 ∀ij ∈ E}
(P2) inf x0 subject to
m∑j=1
xj = 1,
m∑j=1
ajxlj = bl, l = 1, . . . , k
λmin
⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
x1 xl1
· ·· ·
· ·xm xl
m
xl1 · · · xl
m x0
⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
≥ 0, l = 1, . . . , k
x ∈ Rm+1, xl ∈ R
m, l = 1, . . . , k
(P1) ”looks” much easier than (P2)...Marc Teboulle–Tel-Aviv University – p. 32
Easy/Hard: Example Ctd.
(P1) max{n∑
j=1
xj : x2j − xj = 0, j = 1, . . . , n; xixj = 0 ∀i �= j ∈ Γ}
(P2) min{x0 : λmin(A(x, xl)) ≥ 0,
m∑j=1
ajxlj = bl, l ∈ [1, k]
m∑j=1
xj = 1}
where A(x, xl) is affine in x0, x1, . . . , xm, xl1, . . . , x
lm.
♠ (P1) easy formulation but: is as difficult as an optimization problemcan be! Worst case computational effort for n = 256 is2256 ≈ 1077 ≈ +∞!♠ (P2) complicated formulation but: easy to solve! Form = 100, k = 6 =⇒ 701 variables (≈ 3 times larger) solved in less than2 minutes for 6 digits accuracy!
convex (P2)[slow ↗ (n, ε)] vs. nonconvex (P1) [very fast ↗ (n, ε)]
Marc Teboulle–Tel-Aviv University – p. 33
Toward Computation: Approximation Models
Approximation: replace a complicated function by a ”simpler” one,close enough to the original. This is the "bread" (and butter..) ofnumerical analysis
Linear approximation Suppose f is differentiable at x. Then, for anyy ∈ R
n (notation f ′(x) ≡ ∇f(x)):
f(y) = f(x) + 〈f ′(x), y − x〉 + o(‖y − x‖); limt↘0
t−1o(t) = 0
Quadratic approximation Suppose f is twice differentiable at x (withHessian f ′′(x) ≡ ∇2f(x)). Then,
f(y) = f(x) + 〈f ′(x), y − x〉 +12〈f ′′(x)(y − x), y − x〉 + o(‖y − x‖2).
These models are Local. Thus, the resulting schemes based on thesewill share same properties.
Marc Teboulle–Tel-Aviv University – p. 34
A Generic Unconstrained Minimization Algorithm
(U) min{f(x) : x ∈ Rn}, f ∈ C1(Rn)
Start: with x ∈ Rn such that ∇f(x) �= 0.
Compute new point: x+ = x + td with d ∈ Rn and t > 0 chosen such
that we can guarantee
f(x+) = f(x + td) < f(x)
Sincef(x + td) = f(x) + t〈d,∇f(x)〉 + o(t)
a simple choice could be as follows:
d ∈ Rn is called a descent direction with 〈d,∇f(x)〉 < 0
t ∈ (0, +∞) is a stepsize. How far to go in direction d, called linesearch.
This leads to the simplest scheme: The Gradient Method
Marc Teboulle–Tel-Aviv University – p. 35
The Gradient Method
x0 ∈ Rn, xk+1 = xk + tkd
k
where tk > 0 is the step size. Indeed, with d = −f ′(x) �= 0, we have
〈d, f ′(x)〉 = −‖f ′(x)‖2 < 0
Thus, it is reasonable to choose a positive step size. There exists manyvariants for the choice of tk:
fixed step size: tk := t > 0, ∀k.
full-exact line search: find tk := argmint≥0
f(xk + tdk) Used only
when this can be solved analytically or efficiently.
Inexact line search: step size chosen to approximately minimize falong the ray {x + td|t ≥ 0}. This is the most common used inpractical algorithms, e.g., Armijo Line Search, see [B3].
Marc Teboulle–Tel-Aviv University – p. 36
A Convergence Result for Gradient Method
Theorem CGM Assume f ∈ C1,1L (Rn) and bounded below. Then, with
tk = t, 0 < t < 2L−1 one has
limk→∞
‖f ′(xk)‖ = 0.
Moreover, we have the rate of convergence:
gk := min0≤l≤k
‖f ′(xk)‖ ≤ 1√k + 1
[2L(f(x0) − f∗)
]1/2
Thus, we can obtain a (upper) complexity estimate to achieve gk ≤ ε:
k + 1 ≥ 2L
ε2(f(x0) − f∗) =⇒ gk ≤ ε
Note: estimate does not depend on the problem’s dimension n!
Marc Teboulle–Tel-Aviv University – p. 37
Newton’s Method
(U) minimize {f(x) : x ∈ Rn}
Assumptions Let x∗ ≡ local min. off .
Let f ∈ C2(Rn) with ∇2f(x∗) � lI; l > 0.
‖∇2f(x) −∇2f(y)‖ ≤ M‖x − y‖, ∀x, y,
x0 is close enough to x∗ : ‖x0 − x∗‖ ≤ r̄ ≡ 2l(3M)−1.
Then, the sequence {xk} produced by
xk+1 = xk − (∇2f(xk))−1∇f(xk)
converges locally quadratically to x∗, [See B1-2]
Note: the dependence on knowledge of specific (and generallyunknown/hard to compute) constants....
Marc Teboulle–Tel-Aviv University – p. 38
Basic Unconstrained Schemes-Summary
x0 ∈ Rn, xk+1 = xk + tkW
kdk
whereW k � 0, tk � argmin
tf(xk + tW kdk)
W k ≡ I, dk ≡ −∇f(xk), Gradient Method
W k ≡ ∇2f(xk)−1, tk ≡ 1, Newton’s Method; Fast Localconvergence but can diverge and even breaks down (∇2f(xk)degenerate), and quite expensive...
Global Rate of convergence needs infos on topological properties of∇f, ∇2f ...Soon we will see how to avoid that...
Other Methods: Quasi-Newton [e.g.,BFGS] (replaces f ′′(xk) bysome PD matrix), Conjugate gradient, Trust region.... see [B4]
Marc Teboulle–Tel-Aviv University – p. 39
Constrained Optimization Algorithms
Richer but much more Difficult....
In most algorithms we will face one (or both) of the following:
To solve a sequence of unconstrained/constrained minimizationproblems
To solve a nonlinear system of equations and inequalities
Thus, the importance of having efficient linear algebra methods andsoftware, and a fast and reliable unconstrained routine.
Numerical Optimization � Numerical Linear Algebra
Marc Teboulle–Tel-Aviv University – p. 40
Some Classes of Constrained Optimization Algorithms
Sequential Unconstrained Minimization: Penalty and BarrierMethods
Sequential Linear/Quadratic/Convex Programming
Lagrangian Multiplier Methods
Interior point/Primal-Dual Methods
Dual Methods: Decomposition/Subgradient/Cutting Plane
Active set methods
....and more...
Marc Teboulle–Tel-Aviv University – p. 41
Sequential Unconstrained Minimization
(C) min{f(x) : x ∈ S ⊂ Rn}
Idea: approximate (C) by a sequence of solutions of unconstrainedminimization problems
• Penalty [Courant 1943]: A continuous P (·) is a Penalty function for Sif P (·) ≥ 0 and = 0 if and only if x ∈ S.
Replace (C) by
(Ct) minx∈Rn
{f(x) + tP (x) ≡ Ft(x)} x(t) = argmin{Ft(x)}(t > 0)
For large t the minimum of (Ct) will be in a region where P is small.We thus expect that as t → ∞ :
tP (x(t)) → 0; x(t) → x∗
Marc Teboulle–Tel-Aviv University – p. 42
The Penalty Method
Examples of Penalty FunctionsFor Inequality Constraints S = {x : gi(x) ≤ 0, i = 1, . . . , m}
P (x) =m∑
i=1
max(0, gi(x)); P (x) =m∑
i=1
2max(0, gi(x)) ← smooth
For Equality Constraints S = {x : hi(x) = 0, i = 1, . . . , m}
P (x) = ||h(x)||2, h : Rn → R
m
The Penalty Algorithm Let 0 < tk < tk+1, ∀k with tk → ∞.For each k solve
xk = argminx{Ft(x) ≡ f(x) + tkP (x)}.Convergence: If xk exact global minimizer of Ftk(x), then every limitpoint of {xk} is a solution of C. Marc Teboulle–Tel-Aviv University – p. 43
The Barrier Method: Frish 58, Fiacco-McCormick 68
Similar idea, but acting from the interior to preventing leaving the feasible region.
• Barrier [Interior]: A Barrier function for S with int S �= ∅ is a continuous function s.t.
B(x) → ∞ as x → boundaryS
Examples: B(x) = −∑mi=1[gi(x)]−1, B(x) = −∑m
i=1 log(−gi(x))
Barrier Algorithm Let 0 < tk < tk+1 ∀k with tk → ∞.
For each k solve
xk = argminx
{f(x) +1
tkB(x)}.
Convergence: Every limit point of {xk} is a solution of (C).
In both Penalty/Barrier Methods: Compromise t must be chosen sufficiently large so
that x(t) will approach S from the exterior (interior).... BUT..we do not know how to pick
t.. (if chosen too large, then Ill-Conditioning may occurs)
Avoid IC, do not send t → ∞, one approach: .....use augmented Lagrangian/Multiplier
methods.....Marc Teboulle–Tel-Aviv University – p. 44
A Generic Multiplier Method
(P ) min{f(x) : g(x) ≤ 0}, g : Rn → R
m
Lagrangian: L(x, u) = f(x) + uT g(x) (Linear in u)
An Augmented/General Lagrangian:
A(x, u, c) = f(x) + G(g(x), u, c), (u ≥ 0, c > 0)
Multiplier Method Given {uk, ck}, generate (xk, uk) via:
Find xk+1 = argmin{G(x, uk, ck) : x ∈ Rn}
Dual Update Rule: uk+1 = E(g(xk+1), uk, ck)
Increase ck > 0 (if necessary).
• G should be explicit and preserves data properties of (P) (e.g.smoothness)• E should be an simple explicit formula to update u• How do we get these objects?
Marc Teboulle–Tel-Aviv University – p. 45
Example: Multiplier Method for Ineqs. Constraints
(C) min{f(x) : gi(x) ≤ 0, i = 1, . . . , m}, g := (g1, . . . , gm)T
Quadratic Method of Multipliers [See [B5 for eq. constraints]
xk+1 ∈ argmin{A(x, uk, ck) : x ∈ Rn}
uk+1 = (uk + ckg(xk+1))+, (ck > 0, z+ := max{0, z})A(x, u, c) := f(x) + (2c)−1{||(u + cg(x))+||2 − ||u||2}
Drawback/Advantage:
Separability lost (if original prob. separable)
Not C2 and Newton can breaks down
No need to increase penalty parameter; more robust; uses dual info
More recent approaches allow for constructing smooth Lagrangian sothat Newton’s method can be applied. Later on.....
Marc Teboulle–Tel-Aviv University – p. 46
Sequential "Simpler" Constrained Problems
Idea: Given xk ∈ Rn, uk ∈ R
m, solve a sequence of approximatesimpler problems:
inf{fk(x) : gk(x) ≤ 0}where (fk, gk) are (local) approximations of the objective and constraintfunctions.
Possible Choices Include: Linear [SLP], Convex [SCP], Quadratic[SQP] approximations of f , or g or both.
SQP: quadratic approximation of objective and linearized constraints,i.e., solves a sequence of Quadratic ProgramsMost attractive feature of SQP: superlinear convergence in neighb. ofa solution.A Drawback: Need functions/gradients with high precision
Marc Teboulle–Tel-Aviv University – p. 47
Back to Newton: inf{f(x) : x ∈ dom f}Self-Concordance Theory–[Nesterov-Nemirovsky-90]
Idea: to make the convergence analysis coordinate invariant
[Newton’s method is coordinate invariant..but conv. analysis is not!]
Achieved for self-concordant convex functions
θ is SC ⇐⇒ ∃M : |θ′′′(t)| ≤ Mθ′′(t)3/2, ∀t ∈ dom θ
Newton Revisited with SC function: The Damped Newton Method:
DNM Start with x0 ∈ dom f . Generate {xk} via
xk+1 = xk− 11 + λ(xk)
(f ′′(xk))−1f ′(xk); λ(x) := 〈(f ′′(x))−1f ′(x), f ′(x)
♣ ∀x ∈ dom f, with λ(x) > η > 0, one iteration of DNM decreases the
value of f(x) at least by the constant h(η) := η − log(1 + η). This is a
global result.
Marc Teboulle–Tel-Aviv University – p. 48
Newton for Self-Concordant functions–[See B6-8]
f∗ = inf{f(x) : x ∈ dom f}Given some γ > 0
Damped Phase: λ(xk) > γ, apply Damped Newton that ensures
f(xk+1) − f(xk) ≤ −h(γ)
Quadratic Phase: λ(xk) ≤ γ, then apply pure Newton, whichconverges quadratically
Complexity Analysis: Total number of iterations to ε accuracy:
#of Newton’s step ≤ f(x0) − f∗
h(η)+ log log(1/ε)
Note: Absence of unknown constants and problem dimension
Marc Teboulle–Tel-Aviv University – p. 49
Interior Point Methods for Convex Programs
inf{cT x : x ∈ C}, C ⊂ Rn closed convex
Idea goes back to Barrier Methods, but within a different methodologyBasically one tries to approximately follow the central path generatedwithin the interior of the corresponding feasible set.Computation of Central Path
x∗(µ) = argminx
{µ〈c, x〉 + S(x)}
Where S is a Self-Concordant Barrier for a closed convex feasible setof the given optimization problem .
x∗(µ) remains strictly feasible for every µ > 0
x∗(µ) → x∗ optimal for µ → ∞Can be computed in polynomial time within use of Newton methodand suitable updating for µ
Marc Teboulle–Tel-Aviv University – p. 50
Primal-Dual Interior Methods
(C) inf{f(x) : g(x) ≥ 0} ⇐⇒ inf{f(x) : g(x)−v = 0, v ≥ 0} (g := (g1, . . . , gm)T )
The KKT system with perturbed complementarity
∇f(x) −∇g(x)y = 0, V := Diag(v), Y := Diag(y)
V Y e = µe; µ > 0, e := (1, . . . , 1)T
g(x) − v = 0
Apply Newton’s Method to generate new pt:
(x+, v+, y+) = (x, v, y) + t(∆x, ∆v, ∆y)
t chosen to ensure (v+, y+) > 0 and the merit function sufficiently reduced [see B10]
Advantage: feasibility of x not required, infeasible interior
Suitable for large problems size (n + m < 10, 000 and more)
Marc Teboulle–Tel-Aviv University – p. 51
Optimization–Summary
Nonconvex problems
most are just not solvable....Lacking theory....
Convex problems
Local minima are global
Computationally Tractable: Can be approximately solved to adesired accuracy in polynomial time [Not always efficient e.g., theEllipsoid Method]
Model many interesting problems
Enjoy a powerful Duality Theory that can be used to findbounds/approximations for hard problems; in particular nonconvexquadratic models arising in combinatorial problems.
Marc Teboulle–Tel-Aviv University – p. 52
Mathematical and Computational Challenges
To solve very large scale optimization problems keeping in mind thetrade off between Efficiency versus Practicality/Simplicity
For self-concordant convex problems, we have efficient polynomialalgorithms with high accuracy
Polynomial algorithms are highly sophisticated: require informationon the Hessian of objective and constraints, [often not available], andheavy computational cost at each iteration [e.g., ComputingNewton’s step] which are not affordable for very large scaleproblems.
..... Thus the needs to
Further study potential of elementary/simple methods (e.g., firstorder methods, using function or/and gradient infos only).
Produce more efficient algorithms within these methods
Marc Teboulle–Tel-Aviv University – p. 53
Part C–Interior Gradient/Prox Schemes
Lecture based on Joint Research Works withA. Auslender
University of Lyon I, France
Details, proofs and more results can be found in our two recent works
Interior gradient and epsilon-subgradient descent methods forconstrained convex minimizationMathematics of Operations Research, 29, 2004, 1–26.
A unified framework for interior gradient/subgradient andproximal methods in convex optimization. February 2003(submitted for publication)
More references on related works we developed on Lagrangianmethods, decomposition schemes, semidefinite programming, andvariational inequalities, are listed at the end of these notes.
Marc Teboulle–Tel-Aviv University – p. 54
Gradient Based Methods: Why?
A main drawback: Can be very slow....But...
Main advantages
Use minimal information, e.g., (f, g)
Often lead to very simple iterative schemes
Complexity/iteration mildly dependent in problem’s dimension
Suitable when high accuracy is not crucial [in many large scaleapplications, the data is anyway known only roughly..]
For very large scale problems often remain the only choice
Examples: Three gradient-based algorithms widely used inapplications
Clustering: The k-means algorithm
Neuro-computation: The backpropagation (perceptron) algorithm
The EM (Expectation-Maximization) algorithm in statisticalestimation
Marc Teboulle–Tel-Aviv University – p. 55
Main Results: Overview
A unifying framework for analyzing interior gradient and proximalbased algorithms for constrained minimization
Global convergence results under minimal assumptions
Smooth Lagrangian Multiplier Methods
Derive and analyze corresponding new (sub) gradient interiorschemes for constrained problems.
Modified methods with better complexity/efficiency
Applications of the results to specific instance problems and inparticular to conic optimization i.e., semidefinite and second-orderconic programs for producing elementary algorithms.
Marc Teboulle–Tel-Aviv University – p. 56
Part I. Interior Proximal methods
Ideas and A Unifying Framework
Convergence analysis mechanism
Applications/Examples
Marc Teboulle–Tel-Aviv University – p. 57
Two Classical Algorithms
(P ) f∗ = inf{f(x) : x ∈ C},f : R
n → R ∪ {+∞} is a lsc convex proper function.
C ⊂ Rn is nonempty convex open; C denotes the closure of C.
Let d(x, y) := 2−1‖x − y‖2, λk > 0
• Prox xk ∈ argmin{λkf(x) + d(x, xk−1) : x ∈ C} ⇐⇒
0 ∈ λk∂f(xk) + xk − xk−1 + NC(xk)
• (Sub) Grad xk ∈ argmin{λk〈gk−1, x〉 + d(x, xk−1) : x ∈ C} ⇐⇒
0 ∈ λkgk−1 + xk − xk−1 + NC(xk) ⇐⇒ xk = ΠC(xk−1 − λkgk−1)
where we use ΠC ≡ (I + NC)−1 ≡ Projection map
** Difference: Implicit –versus– Explicit Schemes **
Note: {xk} produced by either one of above algorithms does not necessarily belong to CMarc Teboulle–Tel-Aviv University – p. 58
A proximal term exploiting the geometry of C
We use a proximal term d(x, y) that will play the role of a distance likefunction satisfying certain desirable properties which in particular
Will force the iterates of the produced sequence to stay in C, andthus, will automatically eliminate the constraints, (hence Interior).
Allow to derive explicit and simple iterative schemes for variousinteresting optimization models.
Leads to convergent and improved methods
Minimal required properties for d:d(·, v) is a convex function, ∀vd(·, ·) ≥ 0, and d(u, v) = 0 iff u = v ∀u, v.
• d is not a distance: no symmetry or/and triangle inequality
Marc Teboulle–Tel-Aviv University – p. 59
The Basic Ingredients
(P ) f∗ = inf{f(x) : x ∈ C},Interior Proximal Algorithm–IPA:
x0 ∈ C; xk ∈ argmin{λkf(x)+d(x, xk−1) : x ∈ C}, k = 1, 2 . . . (λk > 0),
where d is some proximal distance .
The basic ingredients needed to achieve our goals are:
To pick an appropriate proximal distance d which allows toeliminate the constraints.
Given d, to find an induced proximal distance H , which will controlthe behavior of the resulting method, to analyze convergence andcomplexity.
We begin by defining an appropriate proximal distance d for problem(P).
Marc Teboulle–Tel-Aviv University – p. 60
A Family of Proximal Distances F
Definition A function d : Rn × R
n → R+ ∪ {+∞} is called a proximaldistance with respect to an open convex set C ⊂ R
n if for each y ∈ C itsatisfies the following properties:(P1) d(·, y) is proper, lsc, convex, and C1 on C.(P2) dom d(·, y) ⊂ C, and dom ∂1d(·, y) = C(P3) d(·, y) is level bounded on R
n, i.e., lim‖u‖→∞ d(u, y) = +∞.(P4) d(y, y) = 0.We denote by F the family of functions d satisfying the Definition.
(P1) is needed to preserve convexity of d
(P2) will force the iterate xk to stay in C
(P3) will guarantee the existence of such an iterate.
Note: By definition d(·, ·) ≥ 0, so that from (P4) we get
∇1d(y, y) = 0 ∀y ∈ C.
Marc Teboulle–Tel-Aviv University – p. 61
The Main Tool
For each given d ∈ F , we generate an induced proximal distancesatisfying some desirable properties.Definition Given C ⊂ R
n, open and convex, and d ∈ F , a functionH : R
n × Rn → R+ ∪ {+∞} is called the induced proximal distance to
d if
(1) H is finite valued on C × C with H(a, a) = 0 ∀a ∈ C
(2) 〈c − b,∇1d(b, a)〉 ≤ H(c, a) − H(c, b) ∀a, b, c ∈ C ♠We write (d, H) ∈ F(C) to quantify such triple [C, d, H].
Likewise, we will write (d, H) ∈ F(C), for the triple [C, d, H], s.t.
there exists H which is finite valued on C × C
satisfies (1)-(2) for any c ∈ C,
and such that ∀c ∈ C one has H(c, ·) level bounded on C.
Clearly, one thus have F(C) ⊂ F(C).Marc Teboulle–Tel-Aviv University – p. 62
Mechanism’s Motivation
Not as mysterious as it might look at first sight...
Example: Quad-prox corresponds to the special case
C = C̄ = Rn, d(x, y) = 2−1‖x − y‖2,∇1d(x, y) = (x − y)
Then,
♠ 〈c − b,∇1d(b, a)〉 = H(c, a) − H(c, b) − H(b, a)
holds with equality;( Pythagoras Theorem!) and with induced H ≡ d.
Methods with d ≡ H are called self-proximal.
There are several examples of more general self-proximal methods andfor various types of constraint sets C, [see C]
Marc Teboulle–Tel-Aviv University – p. 63
Main Results
Within this framework, we can derive :
Global rate of convergence/efficiency estimates in term of functionvalues
Convergence in limit points of the produced sequence by IPA
Global convergence of the sequence {xk} to an optimal solution of(P),within additional assumptions on the induced proximal distanceH , akin to the properties of norms, for reference this is denoted byF+(C), see [C1-5].
Marc Teboulle–Tel-Aviv University – p. 64
A Typical Convergence Result
Theorem (Prox-convergence) Let (d, H) ∈ F+(C) and let {xk} be thesequence generated by the interior-prox:
xk ∈ C gk ∈ ∂f(xk) s.t. λkgk + ∇1d(xk, xk−1) = 0.
Set σn :=∑n
k=1 λk. Then the following hold:
limn→∞ σn = +∞ =⇒ {f(xk)} converges to f∗Global rate of convergence estimate
f(xn) − f(x) = O(σ−1n ), ∀x ∈ C
If X∗ �= ∅, then xk → x∗ optimal solution of (P).
Marc Teboulle–Tel-Aviv University – p. 65
Self-proximal methods
We take d(x, y) := H(x, y) = Dh(x, y) with Dh a Bregman proximaldistance given by:
Dh(x, y) := h(x) − [h(y) + 〈∇h(y), x − y〉]It can be verified that (P1) − (P4) hold for H .
Depending on choice of h one has either (d, H) = (Dh, Dh) ∈ F(C)or (d, H) = (Dh, Dh) ∈ F+(C).
When C = Rn, with h = 2−1‖ · ‖2, then Dh(x, y) = ||x − y||2, and
with (d, H) = (Dh, Dh) ∈ F+(Rn), the IPA is exactly the classicalprox
Several interesting special cases for the pair (d, H) leading toself-proximal schemes for various type of constraints include: SDP,SOC, Convex Programs, see [C7-11]
Marc Teboulle–Tel-Aviv University – p. 66
Example: Semidefinite Constraints: the Cone C = Sn+
Sn+ (Sn
++)≡ symmetric p.s.d (p.d.) matrices. Let
h1 : Sn+ → R, h1(x) = tr x log x;
h3 : Sn++ → R, h3(x) = − tr log x = − log det(x)
For any y ∈ Sn++, let
d1(x, y) = tr(x log x − x log y + y − x), with dom d1(·, y) = Sn+,
d3(x, y) = − log det xy−1 + tr(xy−1) − n, with dom d3(·, y) = Sn++,
With H ≡ di, one has (d1, H) ∈ F(Sn+) and (d3, H) ∈ F(Sn
++)Similarly we can handle
C = {x ∈ Rm : B(x)−B0 ∈ Sn
+}; B(x) =m∑
i=1
xiBi, Bi ∈ Sn ∀i ∈ [0, m]
Marc Teboulle–Tel-Aviv University – p. 67
IPA that are Not-Self-Proximal: ϕ-Divergences
Given a scalar convex function ϕ satisfying some conditions, (call classΦr, see [C12-13]) we define a ϕ-divergence proximal distance by
dϕ(x, y) =n∑
i=1
yri ϕ(y−1
i xi), r = 1, 2
For any ϕ ∈ Φr, one verifies that
dϕ(·, ·) ≥ 0 and "=0" when arguments coincide.
Can be easily extended to handle Polyhedral Constraints.
Examples of functions in Φ1, Φ2:
ϕ1(t) = t log t − t + 1, dom ϕ = [0, +∞),ϕ2(t) = − log t + t − 1, dom ϕ = (0, +∞),
ϕ3(t) = 2(√
t − 1)2, dom ϕ = [0, +∞).
Marc Teboulle–Tel-Aviv University – p. 68
Example: The Class Φ2with dϕ(x, y) =
∑nj=1 y2
jϕ(y−1j xj)
Let ϕ(t) = µp(t) + ν2 (t − 1)2 with p ∈ Φ2
One has, dϕ ∈ F , and it can be proven: ∀a, b ∈ R++, ∀c ∈ R+
♣ 〈c − b,∇1d(b, a)〉 ≤ η(||c − a||2 − ||c − b||2)
where η = 2−1(µ + ν). With H(x, y) := η||x − y||2 one obtains(dϕ, H) ∈ F+(C) and all the convergence results hold.
Note: ♣ Behaves like usual prox in Rn...but here valid in the
non-negative orthant!
Marc Teboulle–Tel-Aviv University – p. 69
A Powerful Application of Interior Prox:General Augmented Lagrangian Methods
An Example: Take dϕ(u, v) =∑m
j=1 v2j ϕ(v−1
j uj) with ϕ ∈ Φ2.
Applying IPA on dual (D) of (..remember (D) is defined on Rm+ ...)
(P ) min{f0(x) : fi(x) ≤ 0, i ∈ [1, m]}yields the Multiplier Method:LMM: Let u0 ∈ R
m++ and λk ≥ λ > 0, ∀k ≥ 1, generate {xk, uk} via
xk ∈ arg min{H(x, uk−1, λk) : x ∈ Rn}
uki = uk−1
i (ϕ∗)′(λkfi(xk)/uk−1i ), i = 1, . . . , m
Here H(x, u, λ) := f0(x) + λ−1∑m
i=1 u2i ϕ
∗(λfi(x)/ui), (λ > 0, u > 0)
Using other d yields to many other old and new LMM: provides aunified framework to their analysis; and can be extended to SDP and VIproblems [see refs].
Marc Teboulle–Tel-Aviv University – p. 70
Important Example: The Log-Quad Proximal Kernel
Let ν > µ > 0 be given fixed parameters, and
Φ2 � ϕ(t) =ν
2(t − 1)2 + µ(t − log t − 1); t > 0
The conjugate ϕ∗ is explicitly given [see A-LQM] and satisfies someremarkable properties:
domϕ∗ = R, and ϕ∗ ∈ C∞(R)
(ϕ∗)′(s) = (ϕ′)−1(s) is Lipschitz for all s ∈ R, with constant ν−1
(ϕ∗)′′(s) ≤ ν−1, ∀s ∈ R.
For given data in (P) fi ∈ C∞(Rn), the resulting H ∈ C∞(Rn)
Marc Teboulle–Tel-Aviv University – p. 71
Computation with Log-Quad Multiplier Method:An Example
Robust [no parameters tuning]
+ computational effort does not increase with dimension
n × m = 1000 × 50, 000
Average results of 100 times executions on Quadratically constrained random model
Marc Teboulle–Tel-Aviv University – p. 72
A Typical Convergent result
Applying Theorem-prox-convergence Under very mild and standardassumptions on primal (P), e.g.,
• Optimal solution set of (P) bounded + Slater, one obtains:
Theorem-Convergence for LMM Let {xk, uk} be the sequencegenerated by previous LMM with ϕ ∈ Φ2. Then
All limit points of {xk, uk} are optimal solutions of (P ) × (D)
The dual sequence {uk} → u∗ optimal solution of the dual (D).
In particular, [and without any further assumptions, such as strictcomplementarity], the following improved global convergence rateestimate holds for the dual objective h:
h(u∗) − h(un) = o((n)−1)
Marc Teboulle–Tel-Aviv University – p. 73
Part II. Interior (Sub)-Gradient methods forConstrained Minimization
Basic Interior (sub) gradient methods
A General Convergence Result
Algorithms for Conic Optimization: Theory and Examples
A More efficient O(1/k2) Interior Gradient Algorithm
Marc Teboulle–Tel-Aviv University – p. 74
An Basic Interior-Gradient Algorithmfor Constrained Minimization over C
The basic step of a (sub)-Gradient Method over Rn is
xk = xk−1 − λkgk−1 ⇐⇒ xk ∈ argmin{λk〈gk−1, x〉+ 2−1‖x− xk−1‖2}
Thus, to solve min{f(x) : x ∈ C}, replace ‖ · ‖2 by some d ∈ F .
Basic Interior Gradient–BIG
Take d ∈ F . Let λk > 0 and generate the sequence {xk} via
xk ∈ argmin{λk〈gk−1, x〉 + d(x, xk−1)}Building on the material previously developed, it is possible to establishvarious type of convergence results for various instances of the triple[C, d, H]. We focus on conic models.
Marc Teboulle–Tel-Aviv University – p. 75
Conic Optimization Models
(M) inf{f(x) : x ∈ C ∩ V},where
V := {x : Ax = b}, with b ∈ Rm and A ∈ R
m×n, n ≥ m
f : Rn → R ∪ {+∞} is convex, lsc.
We assume that ∃x0 ∈ dom f ∩ C : Ax0 = b.
We assume also that f is continuously differentiable with ∇f Lipschitzon C ∩ V and Lipschitz constant L, i.e., there exists L > 0 such that
||∇f(x) −∇f(y)|| ≤ L||x − y|| ∀x, y ∈ C ∩ VNotation: f ∈ C1,1(C ∩ V).
Marc Teboulle–Tel-Aviv University – p. 76
Applying Basic Scheme BIG to Solve Problem (M)
For solving Problem (M), we propose the following basic iteration .
Given d(·, y) σ-strongly convex over (C × V)
Given a step-size rule for choosing λk (Various step-size rules arepossible)
At each step k, starting from a point x0 ∈ C ∩ V the sequencexk ∈ C ∩ V is computed via the relation
xk = u(λk∇f(xk−1), xk−1) = argmin{〈λk∇f(xk−1), z〉+d(z, xk−1)|z ∈ V}
Theorem-Convergence of BIG With (d, H) ∈ F+(C), {xk} convergesto an optimal sol. of (P), and the following Global Rate Estimationholds:
f(xn) − f∗ = O(n−1)
Marc Teboulle–Tel-Aviv University – p. 77
Application examples–[C14-17]
Consider the functions d with (d, H) ∈ F(C) which are regularizeddistances of the following form
d(x, y) = p(x, y) +σ
2||x − y||2, with p ∈ F
We can derive Explicit gradient-like algorithms via the formula
u(v, x) = argminz
{〈v, z〉 + d(z, x)}
for
Semidefinite programs
Second-order conic problems
Convex minimization over the unit simplex
Marc Teboulle–Tel-Aviv University – p. 78
Convex minimization over the unit simplex, C = ∆
Take C = Rn+, A = eT , b = 1, i.e., V = {x : |∑n
j=1 xj = 1} then,
(M) inf{f(x) : x ∈ ∆}with
∆ = {x ∈ Rn :
n∑j=1
xj = 1, x ≥ 0}
This is an interesting special case of standard conic optimization, whicharises in applications.
We will concentrate on Mirror Descent Type Algorithms (MDA)[Nemirovsky-Yudin-1983]
[Beck-Teboulle-2003] have shown that the (MDA) can be simplyviewed as a projection subgradient algorithm with strongly convexBregman proximal distances. As a result, they proposed to use theentropy kernel.
Marc Teboulle–Tel-Aviv University – p. 79
The Entropic Mirror Descent Algorithm EMDA
h(x) :=n∑
j=1
xj log xj if x ∈ ∆, +∞ otherwise
The entropy kernel is 1-strongly convex w.r.t the norm ‖ · ‖1, i.e.,
〈∇h(x) −∇h(y), x − y〉 ≥ ||x − y||21, ∀x, y ∈ ∆
Hence so is the resulting dh ≡ E defined by:
d(z, x) ≡ E(z, x) =
{ ∑nj=1 zj log zj
xjif (z, x) ∈ ∆ × ∆+,
+∞ otherwise
This produces the Entropic Mirror Descent Algorithm (EMDA).
Marc Teboulle–Tel-Aviv University – p. 80
Simple Formula for the EMDA with d ≡ E
The problem u(v, x) = argminz∈∆
{〈v, z〉 + E(z, x)} can be easily solved:
uj(v, x) =xj exp(−vj)∑ni=1 xi exp(−vi)
, j = 1, . . . , n
EMDA: Start with x0 = n−1e, ∀j = 1, . . . , n with vj = gk−1j
xkj (λk) = u(λkv, xk−1), λk =
√2 log n
Lf
√k
, ∀k = 1, . . .
Theorem The sequence generated by EMDA satisfies for all k ≥ 1
min1≤s≤k
f(xs) − minx∈∆
f(x) ≤√
2 log nmax1≤s≤k ||gs||∞√
k
Here the objective function is supposed Lf -Lipschitz on ∆Outperforms the classical proj. grad by a factor of (n/ log n)1/2
Marc Teboulle–Tel-Aviv University – p. 81
Improving efficiency of Interior Gradient Method
The classical gradient method for minimizing a C1,1L function over
Rn exhibits a O(k−1) global convergence rate estimate
[Nesterov-1988] developed what he called an ”optimal algorithm”for smooth convex minimization
He was able to improve the efficiency of the gradient method byconstructing a method that keeps the simplicity of the gradientmethod, but with the faster rate O(k−2).
♣ Question: Can this be extended for constrained problems usingInterior Gradient Methods?
We answer this question positively for a class of interior gradientmethods that leads to an equally simple but more efficient interiorgradient algorithm for convex conic problems.
Marc Teboulle–Tel-Aviv University – p. 82
Problem’s Setting: Back to The Conic Model
(M) inf{f(x) : x ∈ C ∩ V},where
V := {x : Ax = b}, with b ∈ Rm and A ∈ R
m×n, n ≥ m
f : Rn → R ∪ {+∞} is convex, lsc.
∃x0 ∈ dom f ∩ C : Ax0 = b
The optimal solution set X∗ is nonempty
f is continuously differentiable with ∇f Lipschitz on C ∩ V andLipschitz constant L
Marc Teboulle–Tel-Aviv University – p. 83
Generating the Sequence {qk}
Basic Idea: Build a sequence of functions {qk}∞k=0 that approximates f
We take d ≡ H ∈ F , with H is a Bregman proximal distance, withkernel h σ-strongly convex on C ∩ V.
For every k ≥ 0, we construct the sequence {qk(x)} recursively via:
q0(x) = f(x0) + cH(x, x0),
qk+1(x) = (1 − αk)qk(x) + αklk(x, yk)
lk(x, yk) = f(yk) + 〈x − yk,∇f(yk).〉Here, c > 0, and αk ∈ [0, 1). The point x0 is chosen such thatx0 ∈ C ∩ V . The point yk ∈ C is arbitrary and built-in within thealgorithm, [see C18].
Marc Teboulle–Tel-Aviv University – p. 84
The Improved Interior Gradient Algorithm-IGA
Step 0. Choose a point x0 ∈ C ∩ V , and a constant c > 0Set z0 = x0 = y0, c0 = c, λ = σL−1.
Step k. For k ≥ 0, compute:
αk =
√(ckλ)2 + 4ckλ − λck
2yk = (1 − αk)xk + αkz
k,
• zk+1 = argminx∈C∩V
{〈x,σ
αkL∇f(yk) + H(x, zk)} = u(
λ
αk∇f(yk), zk),
xk+1 = (1 − αk)xk + αkzk+1, and ck+1 = (1 − αk)ck.
• Computational work is exactly as the one of the interior gradient methodthrough zk+1
The remaining steps involve trivial computationsMarc Teboulle–Tel-Aviv University – p. 85
The Improved Convergence Rate Estimate for IGA
Theorem Let {xk}, {yk} be the sequences generated by (IGA) and letx∗ be an optimal solution of (P). Then, for any k ≥ 0 we have
f(xk) − f(x∗) = O(1k2
)
and the sequence {xk} is minimizing, i.e., f(xk) → f(x∗).
Thus to solve (P) to accuracy ε > 0, one needs no more than(O(1/
√ε) iterations of IGA. This is a reduction by a squared root
factor in comparison to BIG.
Note: IGA can be used to solve convex minimization over the unitsimplex and spectrahedron in S+
n with this improved globalconvergence rate estimate
Extension with d �= H: Open?
Marc Teboulle–Tel-Aviv University – p. 86
Conclusion
Optimizers are not (yet!) out of job......
Thank you for listening.
Marc Teboulle–Tel-Aviv University – p. 87
Additional Material and References
In the following, you will find some complementary material with moredetails and results on which I will talk very quickly (or not at all!)during the Lecture. This is organized in 3 appendices [A,B,C] refereingto corresponding parts of Lecture and [R] for the references.
Appendix A Some complements for optimization theory
Appendix B More on Algorithms
Appendix C Further details on Interior gradient/prox
References R Pointers to some basic books, and our recent worksrelated to part C
Marc Teboulle–Tel-Aviv University – p. 88
A1. General Defs/Properties for Arbitrary Functions
In optimization problems it is convenient to work with extended-real-valued functions, i.e.,
functions which takes values in R ∪ {+∞} = (−∞, +∞] instead of just finite valued
functions i.e., taking values in R = (−∞, +∞). This allows to rewrite the constrained
problem:
inf{h(x) : x ∈ C} ⇐⇒ inf{f(x) : x ∈ Rn}
with f := h + δC , and δC = 0 if x ∈ C and +∞ otherwise. Rules of arithmetic are thus
extended to include ∞ + ∞ = ∞, α · ∞ = ∞, ∀α > 0, and 0 · ∞ = 0.
Let f : Rn → R ∪ {+∞}. The effective domain of f is the set
dom f := {x ∈ Rn |f(x) < +∞}.
A function is called proper if f(x) < ∞ for at least one x ∈ Rn and
f(x) > −∞, ∀x ∈ Rn, otherwise the function is called improper.
Geometrical objects associated with f : epigraph and level set of f
epi f := {(x, α) ∈ Rn × R |α ≥ f(x)},
lev(f, α) := {x ∈ Rn |f(x) ≤ α}. Marc Teboulle–Tel-Aviv University – p. 89
A2. Lower-Semicontinuity (lsc)
For f : Rn → R ∪ {+∞}, we write
inf f := inf{f(x) : x ∈ Rn},
argmin f = argmin{f(x) : x ∈ Rn} := {x ∈ R
n : f(x) = inf f}.
Lower limits are characterized via
lim infx→y
f(x) = min{α ∈ R := [−∞,∞] | ∃xn → y with f(xn) → α}.
Note that one always have lim infx→y
f(x) ≤ f(y).
Definition The function f : Rn → R ∪ {+∞} is lower semi-continuous (lsc) at x if
f(x) = lim infy→x
f(y),
and lower semi-continuous on Rn if this holds for every x ∈ R
n.
Lower semicontinuity on Rn of a function can be characterized through its level set and epigraph.
Theorem Let f : Rn → R. The following statements are equivalent:
(a) f is lsc on Rn;
(b) the epigraph epi f is closed on Rn × R;
(c) the level sets lev(f, α) are closed in Rn.
Marc Teboulle–Tel-Aviv University – p. 90
A3. Few more Defs. in Convex Analysis
For a proper convex and lower semicontinuous (lsc) function
f : Rn → R ∪ {+∞}, dom f = {x |f(x) < +∞} �= ∅ its effective
domain
f∗(y) = sup{〈x, y〉 − f(x)|x ∈ Rn}, is its conjugate
For all ε ≥ 0 its ε-subdifferential
∂εf = {g ∈ Rn | ∀z ∈ R
n, f(z) + ε ≥ f(x) + 〈g, z − x〉}It coincides with the usual subdifferential ∂f ≡ ∂0f whenever ε = 0,and we set dom ∂f = {x ∈ R
n |∂f(x) �= ∅}.
For any closed convex set S ⊂ Rn
δS denotes the indicator function of S, ri S its relative interior
NS(x) = ∂δS(x) = {ν ∈ Rn |〈ν, z − x〉 ≤ 0,∀z ∈ S}
the normal cone to S at x ∈ S.
Marc Teboulle–Tel-Aviv University – p. 91
A4. Differentiability of Convex Functions
Under differentiability assumptions, one can check the convexity of afunction via the following useful tests: f is convex iff(a) when f ∈ C1: f(x) − f(y) ≥ 〈x − y,∇f(y)〉, ∀x, y;(b) when f ∈ C2: ∇2f(x) is positive semidefinite. (∇2f(x) positivedefinite =⇒ f strictly convex; converse false).
Directional derivative Let f : Rn → R ∪ {+∞} be a convex function,
and let x any point where f is finite and d ∈ Rn. Then the limit
f ′(x; d) := limτ→0+
f(x + τd) − f(x)τ
exists (finite or equal to −∞) for all d ∈ Rn and is called the directional
derivative of f at x.
Marc Teboulle–Tel-Aviv University – p. 92
A5. Coercivity and Asymptotic Function
Definition The function f : Rn → R ∪ {+∞} is called
(a) level bounded if for each λ > inf f , the level set lev(f, λ) is bounded,
(b) coercive if f∞(d) > 0 ∀d �= 0
As an immediate consequence of the definition we remark that f is level bounded if and
only if lim‖x‖→∞ f(x) = +∞ which means that the values of f(x) cannot remain
bounded on any subset of Rn that is not bounded.
In the convex case, all these concepts are in fact equivalent.
Proposition Let f : Rn → R ∪ {+∞} be lsc and proper. If f is coercive, then it is level
bounded. Furthermore, if f is also convex, then the following statements are equivalent:
(a) f is coercive.
(b) f is level bounded.
(c) The optimal set {x ∈ Rn | f(x) = inf f} is nonempty and compact.
(d) 0 ∈ int dom f∗.
Marc Teboulle–Tel-Aviv University – p. 93
A6. Alternative Theorems
Two very useful in the study of optimality conditions.
FarkasExactly one of the following two systems has a solution:(F1) Ax = b, x ≥ 0, (A ∈ R
m×n, b ∈ Rm the given data)
(F2) bT y > 0, AT y ≤ 0, y ∈ Rm
GordanExactly one of the following two systems has a solution:(G1) Ax < 0, x ∈ R
n
(G2) AT y = 0, 0 �= y ∈ Rm+ , (i.e., not all zero
Marc Teboulle–Tel-Aviv University – p. 94
A7. More Optimality Conditions for General NLP
(NLP ) min{f(x) : g(x) ≤ 0, h(x) = 0, x ∈ Rn}
with smooth (C2(Rn)) f : Rn → R, g : R
n → Rm h : R
n → Rp.
Define: L : Rn × R
m+ × R
p → R
L(x, λ, µ) = f(x) +m∑
i=1
λigi(x) +p∑
k=1
µkhk(x),
I(x) = {i : gi(x) = 0}∇2L(x∗, λ∗, µ∗) = Hessian of L at (x∗, λ∗, µ∗) w.r.t x
The tangent subspace:
M(x, d) = {d : dT∇gi(x) = 0, i ∈ I(x), dT∇hk(x) = 0, k ∈ [1, p]}
Marc Teboulle–Tel-Aviv University – p. 95
A7-b First and Second Order Opt. Conditions
Theorem : [NC]-Necessary Conditions Let x∗ be a local minimum for (NLP). Assume x∗ is regular,
namely for k = 1, . . . , p, i ∈ I(x∗) : {∇hk(x∗),∇gi(x∗)} are linearly independent. Then there
exists a unique λ∗, µ∗ such that:
∇xL(x∗, λ∗, µ∗) = 0,
λ∗i ≥ 0, i = 1, . . . , m, λ∗
i = 0 ∀i �∈ I(x∗), (first order conditions)
dT∇2L(x∗, λ∗, µ∗)d ≥ 0, ∀d ∈ M(x∗, d), (second order conditions)
Theorem: [SC]-Sufficient Conditions Suppose that a feasible point x∗ for GNLP satisfies:
∇xL(x∗, λ∗, µ∗) = 0,
λ∗i ≥ 0, i = 1, . . . , m, λ∗
i = 0 ∀i �∈ I(x∗), (first order conditions)
dT∇2L(x∗, λ∗, µ∗)d > 0, ∀0 �= d ∈ M(x∗, d), (second order conditions)
λ∗i > 0, ∀i ∈ I(x∗) (strict complementarity)
Then x∗ is a strict local minimum point for NLP, i.e.,
∃Nε(x∗) s.t. f(x∗) < f(x) ∀x ∈ Nε ∩ S, x �= x∗
Marc Teboulle–Tel-Aviv University – p. 96
A8. Primal-Dual Optimal Solution
Definition The pair (x∗, y∗) ∈ Rn × R
m+ is called a saddle point for L if
L(x∗, y) ≤ L(x∗, y∗) ≤ L(x, y∗), ∀x ∈ Rn, ∀y ∈ R
m+ .
Proposition (Saddle point characterization)(x∗, y∗) ∈ R
n × Rm+ is a saddle point for L iff
(a) x∗ = argminx∈RnL(x, y∗) (L-optimality)(b) x∗ ∈ R
n, g(x∗) ≤ 0 (Primal feasibility)(c) y∗ ∈ R
m+ (Dual feasibility)
(d) y∗i gi(x∗) = 0, i = 1, . . . , m (Complementarity).
Proposition (Sufficient condition for optimality) If (x∗, y∗) ∈ Rn × R
m+
is a saddle point for L, then x∗ is a global optimal solution for NLP.
Note: valid with 0-assumption on the problem’s data! However fornonconvex problem it is in general difficult to find a saddle point.
Marc Teboulle–Tel-Aviv University – p. 97
A9. Closing the Loop...
In the case of convex data (f, gi), the KKT theorem becomes necessaryand sufficient for optimality for the convex program:
(CP ) min{f(x) : g(x) ≤ 0, x ∈ Rn} = f∗
with f : Rn → R, g : R
n → Rm convex
KKT-Theorem for Convex Programs Let (CP) be convex with optimalvalue f∗ < ∞, and assume that a (CQ) holds (for example Slater). Thenx∗ is a global minimum for problem (CP) if and only if there existsλ∗ ∈ R
m+ satisfying the KKT system.
...Equivalent to Duality (Zero Gap)....
Note: Linear equality constraints can also be treated easily, withmultipliers λ∗ ∈ R
m (no sign restriction in that case).
Marc Teboulle–Tel-Aviv University – p. 98
B1. Convergence and Rate of Convergence
Convergence of an algorithm by itself is important but not enough. Wewant to know how fast/efficient it happens. Possible approaches:
Computational Complexity: theoretically estimates the # ofelementary operations needed by a given method to find anexact/approx. optimal solution. Provides worst case estimates, i.e.,upper bound on # of required operations for a class of problem
Informational Complexity: Estimate # of function/gradientevaluations to find opt. sol [as opposed to # of computationaloperations]
Local Analysis: Local behavior of a method near an optimalsolution, but ignores behavior when far from solution.
Which approach is best or/and should be used?Each have advantages and drawbacks!
Marc Teboulle–Tel-Aviv University – p. 99
B2. Local Asymptotic Rate of Convergence Measures
Let {sk} ⊂ R be a positive real sequence converging to zero:limk→∞ sk = 0 (e.g. sk := ‖x∗ − xk‖; sk := |f(xk) − f(x∗)|)
(Q)-Linear-[fairly fast]:
∃ρ ∈ (0, 1) : sk+1/sk ≤ ρ ∀k sufficiently large
Superlinear-[faster]:
limk→∞
sk+1
sk= 0
Quadratic-[very fast]:
∃ρ : sk+1 ≤ ρs2k ∀k sufficiently large
Quadratic =⇒ Superlinear =⇒ Linear
Marc Teboulle–Tel-Aviv University – p. 100
B3. The Armijo Line Search
This is a successive step-size rule where we require more than justreducing the cost.
Armijo Rule Fix scalars s, β, σ with β ∈ (0, 1), σ ∈ (0, 1).Set tk := βmks, where mk = first integer m ≥ 0:
♦ f(xk + βmsdk) − f(xk) ≤ σβms〈f ′(xk), dk〉Stepsizes βms, m = 0, . . . are tried successively until ♦ is satisfiedfor m = mk.
So, here we are not satisfied with just "cost improvement"; amount ofimprovement has to be sufficiently large as defined in ♦.
PRACTICAL CHOICES: σ ∈ [10−5, 10−1], β = 0.5, 0.1.
Marc Teboulle–Tel-Aviv University – p. 101
B4. Other Unconstrained Minimization Algorithms
Quasi-Newton Methods Idea, replace the Hessian (or its inverse) by some P.D. matrix.
Done by "mimicking" the minimization of a quadratic function f(x) = 12xT Qx − bT x. In
that case one has
f ′(x) − f ′(y) = Q(x − y), ∀x, y ∈ Rn
Thus, given a P.D. matrix Hk, we search a matrix Hk+1 s.t.
Hk+1(f′(xk+1) − f ′(xk)) = xk+1 − xk ←− [QN–Quasi Newton condition]
There are many solutions satisfying [QN]. One example is BFGS, given by
Hk+1 = Hk +
(1 +
yTk Hkyk
dTk yk
)− 1
dTk yk
(dkyT
k Hk + HkykdTk
)
where yk := f ′(xk+1) − f ′(xk); dk = xk+1 − xk
QN scheme: Start with x0 ∈ Rn, H0 � 0
Iteration k: Set dk = −Hkf ′(xk); xk+1 = xk − tkdk, [tk stepsize rules]
Compute: dk = xk+1 − xk; yk := f ′(xk+1) − f ′(xk)
Update Matrix: Hk −→ Hk+1 [e.g. via BFGS or other rules]Marc Teboulle–Tel-Aviv University – p. 102
B4-b Other Unconstrained Minimization AlgorithmsCtd.
Conjugate Gradients Initially designed for quadratic problems.
CG: Start with x0 ∈ Rn, f(x0), d0 = −f ′(x0)
Iteration k: xk+1 = xk + tkdk, [tk exact line search]
Compute:
f(xk+1); f ′(xk+1) and βk = −‖f ′(xk)‖−2f ′(xk+1)T (f ′(xk+1) − f ′(xk))
Update : dk+1 = f ′(xk+1) − βkdk
Various other formula exist for βk .
Trust Region Methods Replace Hk in Newton with Ak := Hk + µkI with µk > 0 such that
Ak � 0. Equivalent to impose a constraint on the length of the direction ("trust region")in the
quadratic approximation model:
mind∈Rn
1
2dT Hd + gT d : ‖d‖ ≤ l} (l > 0)
If Newton direction dk = −A−1k f ′(xk) is inactive we set µk = 0 otherwise µk > 0 such that
‖tkdk‖ = lk
Marc Teboulle–Tel-Aviv University – p. 103
B5. A Basic Multiplier Method for Equality Constraints
min{f(x) : h(x) = 0} h : Rn → R
m
Lagrangian: L(x, u) = f(x) + uT h(x)
Augmented L: A(x, u, c)) = L(x, u) + 2−1c||h(x)||2AL = Penalized Lagrangian
Multiplier Method Given {uk, ck}1. Find xk+1 = argmin{A(x, uk, ck) : x ∈ R
n}2. Update Rule: uk+1 = uk + ckh(xk+1)3. Increase ck > 0 if necessary.
Marc Teboulle–Tel-Aviv University – p. 104
B5-b Features of Multipliers Method
A key Advantage: it is not necessary to increase ck to ∞, forconvergence (as opposed to "Penalty/Barrier method" )
As a result, A is "less subject to ill-conditioning", and more "robust".
The AL depends on c but also on the dual multiplier u : better/fasterconvergence can be expected (rather than keeping u constant)
Useful for designing well behaved decomposition/splitting schemes
Extendible to various models such as variational inequalities,semidefinite programs
Marc Teboulle–Tel-Aviv University – p. 105
B6. Self-Concordance Theory
Idea: to make the convergence analysis coordinate invariant [Newton’s method is coordinate invariant..but conv. analysis is not!]
Achieved for self-concordant convex functions
Definition-SCF Let f ∈ C3(dom f) and convex. Then, f is calledself-concordant if ∃Mf ≥ 0 such that:(SC)
|D3f(x)[d, d, d]| ≤ Mf
(D2f(x)[d, d]
)3/2, ∀x ∈ dom f, d ∈ R
n.
(d3
dt3f(x+ td)|t=0 = D3f(x)[d, d, d] = 〈f ′′′(x)[d]d, d〉, θ(t) ≡ f(x+ td))
θ is SC ⇐⇒ |θ′′′(t)| ≤ Mθ′′(t)3/2, ∀t ∈ dom θi.e., Hessian does not vary too fast in its own metric.
Marc Teboulle–Tel-Aviv University – p. 106
B7. Examples of Self-Concordant functions
1. Linear and convex quadratic
f(x) = xT Ax − 2bT x + c, A ∈ S+n , dom f = R
n
Then, f ′(x) = Ax − b, f ′′(x) = A, f ′′′(x) = 0 =⇒ Mf = 02. Logarithmic Barrier f(x) = − log x, dom f = (0, +∞)
then, f ′(x) = −x−1, f ′′(x) = x−2, f ′′′(x) = −2x−3 =⇒ Mf = 2
3. Log-barrier of a quadratic regionf(x) = − log q(x), q(x) = c+ bT x− 0.5xT Ax, dom f = {x : q(x) > 0}.Then, one verifies that f is SC with Mf = 2.4. The following functions in R are NOT SC
ex, x−p(x, p > 0).
Marc Teboulle–Tel-Aviv University – p. 107
B8. Calculus Rules for SC functions
Affine Invariance of SC Let A : Rn → R
m be an affine map, A(x) = Ax + b. If f is
Mf -self-conc. Then, g(x) := f(Ax + b) is self-conc with Mg ≡ Mf .
f SC =⇒ g = af is SC ∀a ≥ 1, with Mg = a−1/2Mf
f, g SC =⇒ h = f + g SC with Mh = max{Mf , Mg}Composition with logarithm Let h : R → R convex with dom h = (0, +∞) and such that
(L) |h′′′(x)| ≤ 3x−1h′′(x), ∀x > 0
Then,
Then, f(x) := − log(−h(x)) − log x is SC on R++ ∩ {x : h(x) < 0}Many functions satisfy (L):
−xp (0 < p ≤ 1); x log x;− log x; x−2(ax + b)2, xp (−1 ≤ p ≤ 0).
Useful to establish self-conc. of the following (important) functions:
f(x) = −∑mi=1 log(bi − aT
i x); dom f = {x : aTi x < bi, i ∈ [1, m]}
f(x) = − log detX; dom f = Sn++
f(x) = − log(t2 − ‖x‖2); dom f = {(x, t) : ‖x‖ < t}Marc Teboulle–Tel-Aviv University – p. 108
B9. Newton with Self-Concordant Functions
Consider the problem min{f(x) : x ∈ dom f} and the Newton scheme
x+ = x − f ′′(x)−1f ′(x)
Theorem - Existence f attains its minimum over dom f iff there existsx ∈ dom f such that λ(x) < 1.For every x with the later property we can establish the following keyresults (all estimates are parameters free!):
f(x) − f(x∗) ≤ h∗(λ(x)) [conjugate of h, h∗(s) := −s − log(1 − s)]
(x − x∗)T f ′′(x)(x − x∗) ≤ (h∗)′(λ(x))
λ(x+) ≤ 2λ2(x)
The last result provides the region of quadratic convergence (withγ ∈ (0, q) and q solves λ = (1 − λ)2) :
λ(x) < q = 2−1(3 −√
5) =⇒ λ(x+) < λ(x)
Marc Teboulle–Tel-Aviv University – p. 109
B9.b Self Concordant Barrier
Definition Let F be a self concordant function. The function F is called a
ν-self-concordant barrier [SCB] for the set domF if for any x ∈ dom F :
maxu∈Rn
{2〈F ′(x), u〉 − 〈F ′′(x)u, u〉} ≤ ν
ν is called the parameter of the barrier.
This is very general definition that can be simplified, assuming that F ′′(x) is
non-singular:
〈F ′′(x)−1F ′(x), F ′(x)〉 ≤ ν
or to:
〈F ′(x), u〉2 ≤ ν〈F ′′(x)u, u〉, ∀u ∈ Rn,∀x ∈ dom F
Linear and quadratic functions are not SCBExamples of SCB: F (x) = − log x; dom F = R+ and F (x) =− log q(x); q(x) = −0.5xT Qx + 〈c, x〉 + d; dom F = {x : q(x) > 0}; Q ∈ S
+n
are 1-SCBMarc Teboulle–Tel-Aviv University – p. 110
B10. Primal-Dual Interior Methods
The KKT system with perturbed complementarity
⇐⇒ ♠ argminx,v
{f(x) − (g(x) − v)T y − µ
m∑i=1
log vi})
∇f(x) −∇g(x)y = 0, V := Diag(v), Y := Diag(y)
V Y e = µe; µ > 0, e := (1, . . . , 1)T
g(x) − v = 0
Apply Newton’s Method to generate new pt: (x+, v+, y+) = (x, v, y) + t(∆x, ∆v, ∆y)⎛⎝−∇2L(x, v) −∇g
−∇gT V Y −1
⎞⎠
⎛⎝ ∆x
∆y
⎞⎠ =
⎛⎝ ∇f(x) −∇g(x)y
µY −1e − g(x)
⎞⎠
t chosen to ensure (v+, y+) > 0 and merit function sufficiently reduced
M(x, v) = f(x) − µ
m∑i=1
log vi +β
2‖g(x) − v‖2; µ = δ
vT y
m; δ ∈ (0, 1), β > 0
See LOQO for convex and nonconvex problems, [Vanderbei-Shanno, 1999]
Marc Teboulle–Tel-Aviv University – p. 111
C1. The Class F+(C)
This class allows to derive pointwise convergence results, and therequested properties below are trying to mimic "norms".
We write (d, H) ∈ F+(C)(⊂ F(C)) when the function H satisfies thefollowing two additional properties:
(a1) ∀y ∈ C and ∀{yk} ⊂ C bounded with limk→+∞ H(y, yk) = 0,one has limk→+∞ yk = y
(a2) ∀y ∈ C and ∀C ⊃ {yk} −→ y we have limk→+∞ H(y, yk) = 0.
Marc Teboulle–Tel-Aviv University – p. 112
C2. The Interior Proximal Algorithm–IPA
Given d ∈ F , λk > 0, εk ≥ 0. (IPA is well defined, see [])
Start from a point x0 ∈ C
Generate a sequence
{xk} ∈ C with gk ∈ ∂εkf(xk)
such thatλkg
k + ∇1d(xk, xk−1) = 0.
The IPA can be viewed as
an approximate interior proximal method when εk > 0 ∀k ∈ N
which becomes exact for the special case εk = 0 ∀k ∈ N
Marc Teboulle–Tel-Aviv University – p. 113
C3. Convergence Results I: Global Rate
Theorem G1 Let (d, H) ∈ F(C) and let {xk} be the sequencegenerated by IPA. Set σn =
∑nk=1 λk. Then the following hold:
(i) f(xn) − f(x) ≤ σ−1n H(x, x0) + σ−1
n
∑nk=1 σkεk ∀x ∈ C.
(ii) If limn→∞ σn = +∞, and εk → 0, then lim infn→∞ f(xn) = f∗ andthe sequence {f(xk)} converges to f∗ whenever
∑∞k=1 εk < ∞.
(iii) Furthermore, suppose X∗ �= ∅, and consider the following cases:(a) X∗ is bounded,(b)
∑∞k=1 λkεk < ∞ and (d, H) ∈ F(C).
Then, under either (a) or (b), the sequence {xk} is bounded with all itslimit points in X∗.
An immediate by-product yields the following global rate ofconvergence estimate for the exact version of IPA, (εk = 0,∀k).Theorem G2 Let (d, H) ∈ F(C) and let {xk} be the sequence generatedby IPA with εk = 0,∀k. Then, f(xn) − f(x) = O(σ−1
n ), ∀x ∈ C.
Marc Teboulle–Tel-Aviv University – p. 114
C4. Convergence Results II: Pointwise Convergence
To establish the global convergence of the sequence {xk} to an optimalsolution of problem (P), we use the class F+(C).
Theorem G3 Let (d, H) ∈ F+(C) and let {xk} be the sequencegenerated by IPA. Suppose that the optimal set X∗ of (P ) is nonempty,σn =
∑nk=1 λk → ∞,
∑∞k=1 λkεk < ∞. and
∑∞k=1 εk < ∞. Then, the
sequence {xk} converges to an optimal solution of (P).
Marc Teboulle–Tel-Aviv University – p. 115
C5. Comments
Note that we have separated the two types of convergence results toemphasize:
The differences and roles played by each of the three classes
F+(C) ⊂ F(C) ⊂ F(C)
To show that the largest, and less demanding class F(C), alreadyprovides reasonable convergence properties for IPA, with minimalassumptions on the problem’s data.
These aspects are now illustrated by several application examples.
Marc Teboulle–Tel-Aviv University – p. 116
C6. Proximal Distances (d, H): Application Examples
In most situations, when constructing an IPA for solving the convexproblem (P), the proximal distance H induced by d will have a specialstructure, known as a Bregman proximal distance Dh, which isgenerated by some convex kernel h.
We first recall the special features of a Bregman proximal distance
We then consider various types of constraint sets C for problem (P),and give many examples for the pair (d, H), for which ourconvergence results hold.
Marc Teboulle–Tel-Aviv University – p. 117
C7. Bregman-proximal distances: Definition
Let h : Rn → R ∪ {+∞} be a proper, lsc, and convex function
with dom h ⊂ C and dom∇h = C, strictly convex and continuous ondom h , C1 on int dom h = C. Define ∀x ∈ R
n, ∀y ∈ dom∇h
H(x, y) := Dh(x, y) := h(x) − [h(y) + 〈∇h(y), x − y〉] (1)
The function Dh enjoys a remarkable three points identity that plays acentral role in the analysis.
H(c, a) = H(c, b)+H(b, a)+〈c−b,∇1H(b, a)〉 ∀a, b ∈ C, ∀c ∈ dom h
To handle the constraint cases C versus C, we need to consider twotypes of convex kernels h.
Marc Teboulle–Tel-Aviv University – p. 118
C8. Difference between F(C) and F+(C): Some Exam-ples
We let C = Rn++. More examples will follow.
Example: Separable Bregman proximal distances are the most commonused in the literature. Let θ : R → R ∪ +∞ be a proper convex and lscfunction with (0, +∞) ⊂ dom θ ⊂ [0, +∞) and
θ ∈ C2(0, +∞), θ′′(t) > 0, ∀t > 0 limt→0+
θ′(t) = −∞
We denote this class by
Θ0 if θ(0) < +∞Θ+ whenever θ(0) = +∞ and θ is nonincreasing.
Given θ in either class, define
h(x) =n∑
j=1
θ(xj), =⇒ Dh is separable
Marc Teboulle–Tel-Aviv University – p. 119
C9. Typical Choices for θ
The first two examples are functions θ ∈ Θ0, i.e., with dom θ = [0, +∞)and the last two are in Θ+, i.e., with dom θ = (0, +∞):
θ1(t) = t log t, (Shannon entropy).
θ2(t) = (pt − tp)/(1 − p), with p ∈ (0, 1).
θ3(t) = − log t (Burg’s entropy).
θ4(t) = t−1
Then, one can verify that for the corresponding proximal distances:
Dh1 , Dh2 ∈ F+(C), while Dh3 , Dh4 ∈ F(C)
Marc Teboulle–Tel-Aviv University – p. 120
C.10 Convex Programming: C = {x : fi(x) ≥ 0, i ∈[1, m]}
[CP ] min{〈c, x〉 : fi(x) ≥ 0, i ∈ [1, m]}Let fi : R
n → R be concave and C1 on Rn for each i ∈ [1, m].
We suppose that Slater’s holds: ∃x0 ∈ Rn : fi(x0) > 0, ∀i ∈ [1, m].
For θ ∈ Θ+ and x ∈ C let
hν(x) =m∑
i=1
θ(fi(x)) +ν
2||x||2, with ν > 0
Set d(x, y) = Dhν (x, y) then (d, Dhν ) ∈ F(C).
Marc Teboulle–Tel-Aviv University – p. 121
Convex Programming–Continued
An interesting algorithm is then obtained for solving [CP] by choosing:θ(t) ≡ θ3(t) = − log t. In this case we obtain:
d(x, y) = Dhν (x, y) =m∑
i=1
− logfi(x)fi(y)
+〈∇fi(y), x − y〉
fi(y)+
ν
2||x − y||2
The constrained convex program has thus been reduced to perform ateach step an unconstrained minimization with objective of the form:
−m∑
i=1
log fi(x) +ν
2‖x‖2 + 〈x, Lk〉
(All "constant" terms depending on k through yk are in Lk)Bears similarity with Barrier and Center methods....Note: This d(·, y) enjoys other interesting properties e.g., when fi areconcave quadratic, then d(·, y) is self-concordant for each y ∈ C.
Marc Teboulle–Tel-Aviv University – p. 122
C11. Second order cone constraints: C = Ln+
Let Ln+ := {x ∈ R
n|xn ≥ (x21 + . . . + x2
n−1)1/2} be the Lorentz cone.
Let Dn be the diagonal matrix
Dn = diag(−1, . . . ,−1, 1)
Define h : Ln++ → R by h(x) = − ln(xT Dnx) + ν
2‖x‖2.
Then, h is proper, lsc and convex on dom h = Ln++. The Bregman
proximal distance associated to h is given by
Dh(x, y) = − logxT Dnx
yT Dny+
2xT Dny
yT Dny− 2 +
ν
2‖x − y‖2.
Thus, with d = Dh, we have (Dh, Dh) ∈ F(Ln++).
Similarly, we can handle the case with C = {x ∈ Rn|Ax − b ∈ Ln
+}
Marc Teboulle–Tel-Aviv University – p. 123
C12. Not Self-Proximal: ϕ-Divergence Kernels on Rn+
Let ϕ : R → R ∪ {+∞} be a lsc, convex, proper function such thatdom ϕ ⊂ R+ and dom ∂ϕ = R++. We suppose in addition that ϕ is C2,strictly convex, nonnegative on R++ with ϕ(1) = ϕ′(1) = 0. We denoteby Φ the class of such kernels and byΦ1 the subclass of these kernels satisfying
ϕ′′(1)(1 − t−1) ≤ ϕ′(t) ≤ ϕ′′(1) log t ∀t > 0.
Φ2 the subclass satisfying
ϕ′′(1)(1 − t−1) ≤ ϕ′(t) ≤ ϕ′′(1)(t − 1) ∀t > 0.
Examples of functions in Φ1, Φ2 are:
ϕ1(t) = t log t − t + 1, dom ϕ = [0, +∞),ϕ2(t) = − log t + t − 1, dom ϕ = (0, +∞),
ϕ3(t) = 2(√
t − 1)2, dom ϕ = [0, +∞).
Marc Teboulle–Tel-Aviv University – p. 124
C13. An Important Example: The Log-Quad ProximalKernel
Let ν > µ > 0 be given fixed parameters, and
Φ2 � ϕ(t) =ν
2(t − 1)2 + µ(t − log t − 1); t > 0
Proposition (i)ϕ is strongly convex on R++ with modulus ν > 0.
(ii)The conjugate of ϕ is given by
ϕ∗(s) =ν
2t2(s) + µ log t(s) − ν
2,
t(s) := (2ν)−1{(ν − µ) + s +√
((ν − µ) + s)2 + 4µν} = (ϕ∗)′(s).
(iv) domϕ∗ = R, and ϕ∗ ∈ C∞(R).
(v)(ϕ∗)′(s) = (ϕ′)−1(s) is Lipschitz for all s ∈ R, with constant ν−1.
(vi) (ϕ∗)′′(s) ≤ ν−1, ∀s ∈ R.
Smooth Lagrangian Multiplier with Log-Quad:Handle easily very large scale instances e.g., (n × m = 1000 × 50, 000)
Number of Newton’step does not increase with dimension...
Also solve (local min.) nonconvex problems....(No proofs for that....!)
Marc Teboulle–Tel-Aviv University – p. 125
C14. BIG with Armijo-Goldstein stepsize rule
We use a generalized stepsize rule, reminiscent to the one used in theclassical projected gradient method.Algorithm 1: Armijo-Goldstein stepsize rule.Let β ∈ (0, 1), m ∈ (0, 1) and s > 0 be fixed chosen scalars.
Step 0 Start from a point x0 ∈ C ∩ V .Step k Generate the sequence {xk} ∈ C ∩ V as follows:
if ∇f(xk−1) ∈ V ⊥ stop.
Otherwise, with xk(λ) = u(λ∇f(xk−1), xk−1), set λk = βjks wherejk is the first nonnegative integer j such that
f(xk(βjs)) − f(xk−1) ≤ m(〈∇f(xk−1), xk(βjs) − xk−1〉, [AG]
Set xk = xk(λk); k ←− k + 1 ; goto step k.
With some work...it can be proved that [AG] Stepsize Rule is welldefined.
Marc Teboulle–Tel-Aviv University – p. 126
C15. Convergence of Algorithm 1
Theorem A1 Let (d, H) ∈ F(C), and let {xk} be the sequenceproduced by Algorithm 1 with (d, H) ∈ F(C). Then,
The sequence {f(xk)} is non increasing and converges to f∗.
Suppose that the optimal set X∗ of problem (M) is nonempty, then:(a) if X∗ bounded, {xk} is bounded with all its limit pts in X∗,(b) if (d, H) ∈ F+(C), {xk} converges to an optimal sol. of (P), andthe following Global Rate Estimation holds:
f(xn) − f∗ = O(n−1)
Marc Teboulle–Tel-Aviv University – p. 127
C16. Cases C = Rn++; Sn
++; Ln++
Let d(x, y) = p(x, y) + σ2||x − y||2
• p(z, x) = µ∑n
j=1 xrjϕ(x−1
j zj), σ ≥ µ > 0, for (z, x) ∈ C × C, and
ϕ(t) = − log t + t − 1 [r=2;the log-quad function], one obtains:
ui(v, x) = xi(ϕ∗)
′(−vix
−1i ), i = 1, . . . n,
• (SDP) Take p(x, y) = tr(− log x + log y + xy−1) − n ∀x, y ∈ Sn++, one has
∀x ∈ Sn++, v ∈ Sn:
u(v, x) = (2σ)−1(A(v, x) +√
A2(v, x) + 4σI)
with A(v, x) : = σx − v − x−1.
• (SOC) Take p(x, y) = − log xT DnxyT Dny
+ 2xT DnyyT Dny
− 2, ∀x, y ∈ Ln++. It can be shown that:
u(v, x) = sw with s := (2σ)−1(√
1 + 8σ‖w‖−2 − 1)
w : = 2τ(x)−1Dnx + v − σx.
τ(x) = xT Dnx.
Marc Teboulle–Tel-Aviv University – p. 128
C17. Modifying The EMDA
We can modify the EMDA with an Armijo-Goldstein step-size rule(since here d ≡ E is 1-strongly convex).
Therefore, we can apply Theorem A1, proving that the sequence {xk}of EMDA with λk defined by the Armijo-Goldstein stepsize rule [AG]converges to an optimal solution of (M).
This modified version can be more practical, since in particular, wedo not need to know/compute the Lipschitz constant Lf .
Another advantage of Entropy Kernel: Extendible to SDPconstraints
∆ ≡ {x ∈ Sn : tr(x) = 1, x � 0}with
d(x, y) := tr(x log x − x log y) on ∆
Marc Teboulle–Tel-Aviv University – p. 129
C18. The Key Result to Update {xk}, {yk} in IGA
Theorem Let σ > 0, L > 0 be given. Suppose that for some k ≥ 0 we have a
point xk ∈ C ∩ V such that f(xk) ≤ q∗k = min{qk(x) : x ∈ C ∩ V}. Let
αk ∈ [0, 1), ck+1 = (1 − αk)ck and C ∩ V � {zk} be generated by
zk+1 = argmin{〈x,αk
ck+1∇f(yk)〉 + H(x, zk) : x ∈ C ∩ V}
Define
yk = (1 − αk)xk + αkzk,
xk+1 = (1 − αk)xk + αkzk+1.
Then, q∗k+1 ≥ f(xk+1) + 12 ( ck+1σ
α2k
− L)‖xk+1 − yk‖2
Therefore, by taking for example Lα2k = σck(1 − αk) = σck+1 we can guarantee
that q∗k+1 ≥ f(xk+1). This leads to the desired interior gradient alg.
Marc Teboulle–Tel-Aviv University – p. 130
R. Short Bibliography—Some Books for parts [A]-[B]
A. Auslender and M. Teboulle, Asymptotic Cones and Functions in Optimization and Variational
Inequalities, Springer Monographs in Mathematics, Springer-Verlag New-York, 2003.
A. Ben-Tal, A. Nemirovski, Lectures on modern convex optimization. Analysis, algorithms, and
engineering applications, SIAM Publications, 2001.
D. Bertsekas, Nonlinear Programming, Athena Scientific, Belmont Masschussetts, 1999.
A. V. Fiacco and G. P. McCormick, Nonlinear Programming: Sequential Unconstrained Minimization
Techniques, Classics in Applied Mathematics, SIAM , Philadelphia, 1990.
O. L. Mangasarian, Nonlinear programming, McGraw-Hill Publishing Company, 1969.
A. Nemirovski and D. Yudin, Problem complexity and Method Efficiency in Optimization, John Wiley
New York, 1983.
Y. Nesterov, A. Nemirovski, Interior point polynomial algorithms in convex programming, SIAM
Publications, Philadelphia, PA, 1994.
J. Nocedal, S.J. Wright, Numerical Optimization, Springer Verlag, New York, 1999.
J. M. Ortega and W. C. Rheinboldt, Iterative solution of nonlinear equations in several variables,
Academic Press, 1970.
R. T. Rockafellar, Convex Analysis, Princeton University Press, 1970.Marc Teboulle–Tel-Aviv University – p. 131
Refs. on some of our recent works related to Part C
A. Auslender, M. Teboulle and S. Ben-Tiba, “Interior Proximal and Multiplier Methods based on
Second Order Homogeneous Kernels”, Mathematics of Operations Research, 24, (1999) 645–668.
A. Auslender, M. Teboulle, “Lagrangian duality and related multiplier methods for variational
inequalities”, SIAM J. Optimization, 10, (2000), 1097–1115.
A. Auslender, M. Teboulle, “Entropic proximal decomposition methods for convex programs and
variational inequalities”, Mathematical Programming, 91, (2001), 33-47.
A. Auslender and M. Teboulle “Interior gradient and epsilon-subgradient descent methods for
constrained convex minimization, Mathematics of Operations Research, 29, 2004, 1–26.
A. Auslender and M. Teboulle “ A unified framework for interior gradient/subgradient and proximal
methods in convex optimization”. February 2003 (submitted for publication).
A. Beck and M.Teboulle “Mirror descent and nonlinear projected subgradient methods for convex
optimization”, Operations Research Letters, 31, 2003, 167-175.
J. Bolte and M. Teboulle, “ Barrier operators and associated gradient-like dynamical systems for
constrained minimization” problems”, SIAM J. of Control Optimization, 42, (2003), 1266–1292
M.Doljanski and M.Teboulle, “An Interior Proximal Algorithm and the exponential multiplier method
for Semidefinite Programming”, SIAM J. of Optimization, 9, 1998, 1-13.
Marc Teboulle–Tel-Aviv University – p. 132