Nonlinear Programming - Department of Mathematical Sciences

Nonlinear Programming

Marc Teboulle

School of Mathematical Sciences

Tel-Aviv University, Ramat-Aviv, Israel

[email protected], http://www.math.tau.ac.il/teboulle

Tutorial Talk presented at Summer School of ICCOPT-IJuly 30-August 4, 2004, RPI, Troy

Marc Teboulle–Tel-Aviv University – p. 1

Opening Remarks...

Most optimization problems are not solvable!

This is very good news for us....

An Example: Present in 2 hours the basic and recent results onNLP!..... A typical constrained nonlinear (ill-posed) problem.....

What "minimal" (another optimization problem!) material should welearn/know?

Two main issues: Theory and Computation

Many commercial packages exist to solve NLP. However, these arein form of black-box type.

To understand how optimization methods work, their power and theirlimitations, and if they do (or not) solve a problem, we mustunderstand the basic underlying theory.


Contents

Part A. Optimization Theory

Ideas and Principles

Convexity and Duality

Optimality Conditions

Part B. Optimization Algorithms

Basic and Classical Iterative Schemes

Convergence and Complexity issues

Modern Interior and Polynomial Methods

Part C. Some Recent Developments (..Biased of course!..)

Interior Proximal Algorithms

Smooth Lagrangian Multiplier Methods

Elementary Algorithms: Interior Gradient-like Schemes, suitable forvery large scale problems


A Short History of Optimization....

Fermat (1629): Unconstrained Minimization Principle

...+160...Lagrange (1789) Equality Constrained Problems (Mechanics)

Calculus of Variations, 18-19th Century [Euler, Lagrange, Legendre, Hamilton...]

...+150...Karush (1939), Fritz-John (47), Kuhn-Tucker(51)

KKT Theorem for Inequality Constraints: Modern Optimization Era begins...

Engineering Applications (1960)

Optimal Control Bellman, Pontryagin...

Major Developments (50’s with LP) and 60-80’s for NLP

Polynomial Interior Points Methods for Convex Optimization Nesterov-Nemirovsky

(1988)

Combinatorial Problems via continuous approximations 90’s

....More Theory, Algorithmic developments and much more specific models,

applications .... Marc Teboulle–Tel-Aviv University – p. 4

Nonlinear Programming: Formulation

(O) minimize{f(x) : x ∈ X ∩ C}• X ⊂ R

n for implicit or simple constraints (Here X ≡ Rn)

• C a set of explicit constraints described by

C = {x ∈ Rn : gi(x) ≤ 0, i = 1, . . . m,

hi(x) = 0, i = 1, . . . , p}.All the functions in problem (O) are real valued functions on R

n.

Important Special Case: X ∩ C ≡ Rn

The unconstrained minimization problem

(U) minimize{f(x) : x ∈ Rn}

Many methods for constrained problems eventually need to solve sometype of problem (U)


Applications of NLP

OPTIMIZATION APPEARS TO BE PRESENT”ALMOST” EVERYWHERE....

Planning, Management Operations, Logistic (It all started with LP..)

Data Networks, Finance-Economics, VLSI Design

Pattern Recognition, Data Analysis/Mining, Resource Allocation

Mechanical/structural design, Chemical Engineering,...

Machine Learning, Classification

Signal Processing, Communication Systems, Tomography......

.....and of course in Mathematics Itself...!


Definitions and Terminology

(O) minimize{f(x) : x ∈ C}A point x ∈ C is called a feasible solution of (O).

An optimal solution is any feasible point where the local or globalminimum of f relative to C is actually attained.

Definition: Let Nε := Nε(x∗) ≡ Neighborhood of x∗. Then,

x∗ local minimum : f(x∗) ≤ f(x), ∀x ∈ C ∩ Nε

x∗ global minimum : f(x∗) ≤ f(x), ∀x ∈ C

x∗ a strict local minimum : f(x∗) < f(x) ∀x ∈ Nε ∩ S, x �= x∗

Note: There are also ”max” problems...But...

max F ≡ −min[−F ]


How to Solve an Optimization Problem ?

Analytically/Explicitly: Very rarely....or Never....

We try to generate an Iterative (Descent) Algorithm toapproximately solve the problem to a prescribed accuracy

Algorithm: a map A : x → y(start with x to get some new point y)

Iterative: generate a sequence of pts calculated on prior point (orpoints)

Descent: Each new point y is such that f(y) < f(x)

Accuracy: Eventually, we find some x̂ such that

f(x̂) − f(x∗) ≤ ε


A Powerful Algorithm..

Set k = 0, start with x0 somewhereWhile xk �∈ D ≡ {set of desisable Points} Do {

xk+1 = A(xk)k ← k + 1}

StopExpected Output(s): {xk} is a minimizing sequence:

f(xk) → f∗, (optimal value) as k → ∞or/and even more, xk → x∗, optimal solution, denoted via

x∗ ∈ argmin{f(x) : x ∈ C} ≡ {x ∈ C : f(x) = inf f}


Some Basic Questions

How do we pick the initial starting point?

How to construct A so that xk converges to optimal x∗?

How do we stop the algorithm?

How close is the approximate solution to the optimal one? (that wedo not know!!)

How sensitive is the whole process to data perturbations (small andlarge!)?

How do we measure the efficiency of a convergent algorithm tooptimality?

Computational cost per-iteration? Total complexity ?


Emerging Topics and Tools

To answer these questions, we need an appropriate mathematical theoryand tools. For example:

Existence of optimal solutions

Optimality conditions

Convexity and Duality

Convergence and Numerical Analysis

Error and Complexity Analysis

While each algorithm for each type of problem will often require aspecific analysis (e.g., exploiting special structures of the problem), theabove tools remain essential and fundamental.


Convexity—–(See More in [A2-A4])

S ⊂ Rn is convex if the line segment joining any two points of S is

contained in it:

∀x, y ∈ S, ∀λ ∈ [0, 1] =⇒ λx + (1 − λ)y ∈ S

f : S → R is convex if for any x, y ∈ S and any λ ∈ [0, 1],

f(λx + (1 − λ)y) ≤ λf(x) + (1 − λ)f(y)

A Key Fact: Local Minima are also Global under convexity

♣ Convexity plays a fundamental role in optimizationEven in Non convex problems...!


A Simple and Powerful Geometric Result: Separation

Any point outside a nonempty closed convex set C of Rn can be

separated from C by a hyperplane H i.e., let y �∈ C, then

∃0 �= a ∈ Rn, α ∈ R : 〈a, x〉 ≤ α < 〈a, y〉, ∀x ∈ C.

and H := {x ∈ Rn : 〈a, x〉 = α}

This result is fundamental and with far reaching consequences, e.g.,

Alternative Theorems [see A6]

Optimality Conditions

Duality


Existence of Minimizers: inf{f(x) : x ∈ C}

When f : C → Rn will attain its infimum over C ⊂ R

n?

Classical Answer–Weierstrass Theorem: A continuous functiondefined on a compact subset of R

n attains its minimum.

This is a topological problem

How do we get "useful" conditions for testing existence? MimicWeierstrass..

Pick x0 ∈ C, Lf := {x|f(x) ≤ f(x0)}and consider the equivalent problem

inf{f(x) : x ∈ Lf}Suitable ”compactness” and ”continuity” w.r.t Lf , i.e.,

♣ Study behavior of subsets of Rn at infinity


Asymptotic Cones and Functions: A short appetizer

Let ∅ �= C ⊂ Rn be closed convex

f : Rn → R ∪ +{∞} proper, lsc convex [see A1]

Definition–[A-Cone] The asymptotic cone of C,

C∞ := {d ∈ Rn : d + C ⊂ C}

Proposition A set C ⊂ Rn is bounded iff C∞ = {0}

Definition–[A-Function] The asymptotic function f∞ of f isdefined by

epi (f∞) = (epi f)∞

Proposition The A-function is also convex, and for any d ∈ Rn,

f∞(d) = limt→∞

f(x + td) − f(x)t

∀x ∈ dom f


Back to the existence of minimizers

One has to study (Lf )∞. It turns out that (in the convex case)

(Lf )∞ = {d ∈ Rn | f∞(d) ≤ 0}

♣ Topological questions can be handled via Calculus Rules at infinity

Example: (P ) inf{f0(x) : fi(x) ≤ 0, i ∈ [1, m]}; convexResult: Optimal solution set nonempty and compact iff

(fi)∞ ≤ 0, ∀i ∈ [0, m] =⇒ d = 0

A shameless commercial! For general results and applications, seeAsymptotic Cones and Functions

in Optimization and Variational InequalitiesA. Auslender and M. Teboulle

Srpinger Monographs in Mathematics, 2003.


Optimality for Unconstrained Minimization

(U) inf{f(x) : x ∈ Rn} f : R

n → R is a smooth function.Fermat Principle Let x∗ ∈ R

n be a local minimum. Then,

♠ ∇f(x∗) = 0

This is a First Order Necessary condition

If ∇f(x∗) = 0, then x∗ is a stationary point

A local minimum must be a stationary point

Second Order Necessary Cond.: Nonnegative curvature at x∗

The Hessian Matrix ∇2f(x∗) � 0 positive semidefinite

Sufficient conditions for x∗ to be a local minimum: ∇2f(x∗) � 0

Whenever f is assumed convex, then ♠ becomes a sufficient conditionfor x∗ to be a global minimum for f .


Equality constraints: Lagrange Theorem

(E) min{f(x) : h(x) = 0, x ∈ Rn}

where f : Rn → R, h : R

n → Rp with (f, g) ∈ C1

Lagrange Theorem (necessary conditions) Let x∗ be a local minimumfor problem (E). Assume:

(A) {∇h1(x∗), . . . ,∇hp(x∗)} are linearly independent

Then there exists a unique y∗ ∈ Rp satisfying:

∇f(x∗) +p∑

k=1

y∗k∇hk(x∗) = 0

Inequality constraints lead to more complications....


Optimality Conditions in NLP

(P ) inf{f(x) : x ∈ C}, C ⊂ Rn; f : C → R an arbitrary function

A Basic Optimality Criteria-BOC Let x∗ ∈ C and assume that the directional derivative

of f at x∗ exists.

a. Necessary condition: If x∗ is a local minimum of f over C then

f ′(x∗, x − x∗) ≥ 0, ∀x ∈ C.

If f is C1, then the above reduces to 〈x − x∗,∇f(x∗) ≥ 0, ∀x ∈ C.

b. Sufficient condition: Suppose that f is also convex. Then, condition (a) is also a

sufficient condition for x∗ to be a (global) minimum.

Geometric reformulation: through the Normal Cone to a set C at a point x̄, and defined by

NC(x̄) := {d ∈ Rn |〈d, x − x̄〉 ≤ 0, ∀x ∈ C}

Thus one has 〈x − x∗,∇f(x∗) ≥ 0, ∀x ∈ C ⇐⇒ 0 ∈ ∇f(x∗) + NC(x∗).

A Variational Inequality and a Generalized EquationMarc Teboulle–Tel-Aviv University – p. 19

A Useful Formula for Directional Derivatives

Let {gi}mi=0 be smooth functions over R

n and define g(x) := max0≤i≤m gi(x).

Note: even though gi are smooth, this is not necessarily the case for g.

Proposition Assume for each i that gi is continuous and differentiable at x∗.

Then,

g′(x∗, d) = max{〈d,∇gi(x∗)〉 : i ∈ K(x∗)}, ∀d ∈ Rn

where K(x∗) := {i ∈ [0, m] : gi(x∗) = g(x∗) = max0≤i≤m gi(x∗)}

Now observe that with F (x) := max{f(x) − f(x∗), g1(x) . . . , gm(x)}

x∗ ∈ argmin{f(x) : g(x) ≤ 0} solves infx

F (x)

Apply BOC with Alternative Theorem [namely the use of separation!] + the

above formula is all we need to derive the fundamental Theorems characterizing

optimality for constrained problemsMarc Teboulle–Tel-Aviv University – p. 20

First Order Optimality Conditions-Fritz-John Theorem

(P ) inf{f(x) : g(x) ≤ 0, x ∈ Rn}

where f : Rn → R, g : R

n → Rm and (f, g) ∈ C1

Let x∗ be a local minimum for problem (P).

Primal form: Then � ∃d ∈ Rn s.t.:

〈d,∇f(x∗)〉 < 0 and 〈d,∇gi(x∗) < 0, ∀i ∈ I(x∗) := {i : gi(x∗) = 0}Dual Form Then there exists λ0, λi ∈ R+(i ∈ I(x∗)), not all zero,satisfying:

λ0∇f(x∗) +∑

i∈I(x∗)

λ∗i∇gi(x∗) = 0

The weakness of FJ conditions: λ0 ∈ R+ can be equal to zero. Toavoid this, we need a further hypothesis on the problem’s data, calledconstraint qualification


Constraint Qualifications

(P ) inf{f(x) : g(x) ≤ 0, x ∈ Rn}

with f : Rn → R, g : R

n → Rm, smooth.

I(x) := {i : gi(x) = 0} is the set of active constraints.

CQ are crucial regularity conditions on problem’s data to deriveoptimality and duality results.

Linear independence ( LI): {∇gi(x∗)}i∈I(x∗) are linearly indep.

Mangasarian-Fromovitch (MF):

∃ d ∈ Rn : 〈d,∇gi(x∗)〉 < 0, ∀i ∈ I(x∗)

Slater (S): ∃ x̂ : gi(x̂) < 0, ∀i = 1, . . . , m

One has the following relations between these (CQ):

(LI) =⇒ (MF ) =⇒ (S)


The KKT Theorem – A System of Eqs and Inequalities

(P ) inf{f(x) : g(x) ≤ 0, x ∈ Rn}

Let x∗ be a local minimum for problem (P) and assume that a (MF-CQ)holds. Then ∃y∗ ∈ R

m+ s.t.

∇f(x∗) +m∑

i=1

y∗i ∇gi(x∗) = 0, [Saddle pt. in x∗]

gi(x∗) ≤ 0, ∀i ∈ [1, m], [Feasibility ≡ Sad. pt. in y∗]y∗i gi(x∗) = 0, i = 1, . . . , m [complementarity]

With convex data + (CQ), the KKT becomes necessary andsufficient for global optimality

For general NLP (mixed eqs/ineq.), more Optimality Conds. (Firstand Second order), see [A7]


Duality: The Lagrangian

(P ) f∗ := inf{f(x) : g(x) ≤ 0, x ∈ Rn} f : R

n → R, g : Rn → R

m

We assume that there exists a feasible solution for (P) and f∗ ∈ R.

Observation : (P ) ⇐⇒ infx∈Rn

supy≥0

{f(x) + 〈y, g(x)〉

Lagrangian associated with (P) L : Rn × R

m+ → R :

L(x, y) = f(x) + 〈y, g(x)〉 ≡ f(x) +m∑

i=1

yigi(x).

Definition A vector y∗ ∈ Rm is called a Lagrangian multiplier for (P) if

y∗ ≥ 0, and f∗ = inf{L(x, y∗) : x ∈ Rn}

.


Lagrangian Duality • infx∈Rn supy∈Rm+

L(x, y)

Hidden in this equivalent min-max formulation of (P) is anotherproblem called the Dual.Suppose we reverse the inf sup operations:

supy∈R

m+

infx∈Rn

L(x, y)

Define the Dual Function:

h(y) := infx∈Rn

L(x, y), dom h = {y ∈ Rm : h(y) > −∞}.

and the Dual Problem:

(D) h∗ := sup{h(y) : y ∈ Rm+ ∩ dom h}

Note: To avoid h(·) = −∞, additional constraints often emergethrough y ∈ dom h.


Dual problem Properties

The Dual Problem: Uses the same data

(D) h∗ = supy{h(y) : y ∈ R

m+ ∩ dom h}, h(y) = inf

xL(x, y)

Properties of (P)-(D)

Dual objective h is always concave

Dual problem (D) is always convex (ax max of concave func.)

Weak duality holds: f∗ ≥ h∗ for any feasible pair of (P)-(D)

Valid for any optimization problem. No convexity assumed or/and,any other assumptions on primal data!


Duality: Key Questions for the pair (P)-(D)

f∗ = inf{f(x) : g(x) ≤ 0, x ∈ Rn}; h∗ = supy{h(y) : y ∈ R

m+}

• Zero Duality Gap: when f∗ = h∗?• Strong Duality: when inf / sup attained?• Structure/Relations of Primal-Dual Optimal Sets/Solutions

Convex data + a Constraint Qualification, on constraints deliver theanswers.

Proof.

Based on the simple geometric separation argument.

inf / sup attainment + structure of optimal sets via Asymptoticfunctions calculus

Convex problems are the ”Nice NLP”...and much more...


Are there Many Convex Problems?

..More than we use to think...[some times after transformation, e.g.,Geometric programs]

Remember, the dual of any optimization problem is alwaysconvex...can be use at least to approximate the original primal...

Useful Convex Models: Conic Problems

min{〈c, x〉 : A(x) = b, x ∈ K}K is a closed convex cone in some finite dimensional space X

〈·, ·〉 appropriate inner product on X

A is a linear map

Example: Linear ProgrammingX ≡ R

n, K ≡ Rn+, A ∈ R

m×n, b ∈ Rm, c ∈ R

n and 〈·, ·〉 the scalarproduct in R

n

....Other Examples...?Marc Teboulle–Tel-Aviv University – p. 28

Semidefinite Programming

Primal minx∈Rm

{cT x : A(x) � 0};Dual max

Z∈Sn

{− tr A0Z : tr AiZ = ci, i ∈ [1, m], Z � 0}Here, tr is the trace operator and

A(x) := A0 +m∑

i=1

xiAi, each Ai ∈ Sn ≡ symmetric

Primal : x ∈ Rn decision variables; A(x) � 0 is a linear matrix

inequality.

Dual in Conic Form: Z ∈ Sn decision variables, K ≡ S+n is the

closed convex cone of p.s.d. matrices


SDP Features and Applications

♦ Features

SDP are special classes of convex (nondifferentiable) problems

Computationally tractable: Can be approximately solved to a desiredaccuracy in polynomial time

A very active research area since mid 90’s

♦ Applications–A Short list...!

Combinatorial optimization, Computational Geometry

Control theory, Statistics, Classification problems

Other useful conic model : Second order cone programming...


Part B. Optimization Algorithms

Tractability is a key Issue

What optimization problems can we solve?

How do we solve them?

At what cost? [Our current computers have limited memory...and wedo not want to wait too much time [..for ever..] to get a solution!

We need to draw a line between Easy and Hard Problems

Convexity plays a key role in this distinction


Easy/Hard: Example

(P1) max{n∑

j=1

xj : x2j − xj = 0, j = 1, . . . , n; xixj = 0 ∀ij ∈ E}

(P2) inf x0 subject to

m∑j=1

xj = 1,

m∑j=1

ajxlj = bl, l = 1, . . . , k

λmin

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

x1 xl1

· ·· ·

· ·xm xl

m

xl1 · · · xl

m x0

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

≥ 0, l = 1, . . . , k

x ∈ Rm+1, xl ∈ R

m, l = 1, . . . , k

(P1) ”looks” much easier than (P2)...Marc Teboulle–Tel-Aviv University – p. 32

Easy/Hard: Example Ctd.

(P1) max{n∑

j=1

xj : x2j − xj = 0, j = 1, . . . , n; xixj = 0 ∀i �= j ∈ Γ}

(P2) min{x0 : λmin(A(x, xl)) ≥ 0,

m∑j=1

ajxlj = bl, l ∈ [1, k]

m∑j=1

xj = 1}

where A(x, xl) is affine in x0, x1, . . . , xm, xl1, . . . , x

lm.

♠ (P1) easy formulation but: is as difficult as an optimization problemcan be! Worst case computational effort for n = 256 is2256 ≈ 1077 ≈ +∞!♠ (P2) complicated formulation but: easy to solve! Form = 100, k = 6 =⇒ 701 variables (≈ 3 times larger) solved in less than2 minutes for 6 digits accuracy!

convex (P2)[slow ↗ (n, ε)] vs. nonconvex (P1) [very fast ↗ (n, ε)]


Toward Computation: Approximation Models

Approximation: replace a complicated function by a ”simpler” one,close enough to the original. This is the "bread" (and butter..) ofnumerical analysis

Linear approximation Suppose f is differentiable at x. Then, for anyy ∈ R

n (notation f ′(x) ≡ ∇f(x)):

f(y) = f(x) + 〈f ′(x), y − x〉 + o(‖y − x‖); limt↘0

t−1o(t) = 0

Quadratic approximation Suppose f is twice differentiable at x (withHessian f ′′(x) ≡ ∇2f(x)). Then,

f(y) = f(x) + 〈f ′(x), y − x〉 +12〈f ′′(x)(y − x), y − x〉 + o(‖y − x‖2).

These models are Local. Thus, the resulting schemes based on thesewill share same properties.


A Generic Unconstrained Minimization Algorithm

(U) min{f(x) : x ∈ Rn}, f ∈ C1(Rn)

Start: with x ∈ Rn such that ∇f(x) �= 0.

Compute new point: x+ = x + td with d ∈ Rn and t > 0 chosen such

that we can guarantee

f(x+) = f(x + td) < f(x)

Sincef(x + td) = f(x) + t〈d,∇f(x)〉 + o(t)

a simple choice could be as follows:

d ∈ Rn is called a descent direction with 〈d,∇f(x)〉 < 0

t ∈ (0, +∞) is a stepsize. How far to go in direction d, called linesearch.

This leads to the simplest scheme: The Gradient Method


The Gradient Method

x0 ∈ Rn, xk+1 = xk + tkd

k

where tk > 0 is the step size. Indeed, with d = −f ′(x) �= 0, we have

〈d, f ′(x)〉 = −‖f ′(x)‖2 < 0

Thus, it is reasonable to choose a positive step size. There exists manyvariants for the choice of tk:

fixed step size: tk := t > 0, ∀k.

full-exact line search: find tk := argmint≥0

f(xk + tdk) Used only

when this can be solved analytically or efficiently.

Inexact line search: step size chosen to approximately minimize falong the ray {x + td|t ≥ 0}. This is the most common used inpractical algorithms, e.g., Armijo Line Search, see [B3].


A Convergence Result for Gradient Method

Theorem CGM Assume f ∈ C1,1L (Rn) and bounded below. Then, with

tk = t, 0 < t < 2L−1 one has

limk→∞

‖f ′(xk)‖ = 0.

Moreover, we have the rate of convergence:

gk := min0≤l≤k

‖f ′(xk)‖ ≤ 1√k + 1

[2L(f(x0) − f∗)

]1/2

Thus, we can obtain a (upper) complexity estimate to achieve gk ≤ ε:

k + 1 ≥ 2L

ε2(f(x0) − f∗) =⇒ gk ≤ ε

Note: estimate does not depend on the problem’s dimension n!


Newton’s Method

(U) minimize {f(x) : x ∈ Rn}

Assumptions Let x∗ ≡ local min. off .

Let f ∈ C2(Rn) with ∇2f(x∗) � lI; l > 0.

‖∇2f(x) −∇2f(y)‖ ≤ M‖x − y‖, ∀x, y,

x0 is close enough to x∗ : ‖x0 − x∗‖ ≤ r̄ ≡ 2l(3M)−1.

Then, the sequence {xk} produced by

xk+1 = xk − (∇2f(xk))−1∇f(xk)

converges locally quadratically to x∗, [See B1-2]

Note: the dependence on knowledge of specific (and generallyunknown/hard to compute) constants....


Basic Unconstrained Schemes-Summary

x0 ∈ Rn, xk+1 = xk + tkW

kdk

whereW k � 0, tk � argmin

tf(xk + tW kdk)

W k ≡ I, dk ≡ −∇f(xk), Gradient Method

W k ≡ ∇2f(xk)−1, tk ≡ 1, Newton’s Method; Fast Localconvergence but can diverge and even breaks down (∇2f(xk)degenerate), and quite expensive...

Global Rate of convergence needs infos on topological properties of∇f, ∇2f ...Soon we will see how to avoid that...

Other Methods: Quasi-Newton [e.g.,BFGS] (replaces f ′′(xk) bysome PD matrix), Conjugate gradient, Trust region.... see [B4]


Constrained Optimization Algorithms

Richer but much more Difficult....

In most algorithms we will face one (or both) of the following:

To solve a sequence of unconstrained/constrained minimizationproblems

To solve a nonlinear system of equations and inequalities

Thus, the importance of having efficient linear algebra methods andsoftware, and a fast and reliable unconstrained routine.

Numerical Optimization � Numerical Linear Algebra


Some Classes of Constrained Optimization Algorithms

Sequential Unconstrained Minimization: Penalty and BarrierMethods

Sequential Linear/Quadratic/Convex Programming

Lagrangian Multiplier Methods

Interior point/Primal-Dual Methods

Dual Methods: Decomposition/Subgradient/Cutting Plane

Active set methods

....and more...


Sequential Unconstrained Minimization

(C) min{f(x) : x ∈ S ⊂ Rn}

Idea: approximate (C) by a sequence of solutions of unconstrainedminimization problems

• Penalty [Courant 1943]: A continuous P (·) is a Penalty function for Sif P (·) ≥ 0 and = 0 if and only if x ∈ S.

Replace (C) by

(Ct) minx∈Rn

{f(x) + tP (x) ≡ Ft(x)} x(t) = argmin{Ft(x)}(t > 0)

For large t the minimum of (Ct) will be in a region where P is small.We thus expect that as t → ∞ :

tP (x(t)) → 0; x(t) → x∗


The Penalty Method

Examples of Penalty FunctionsFor Inequality Constraints S = {x : gi(x) ≤ 0, i = 1, . . . , m}

P (x) =m∑

i=1

max(0, gi(x)); P (x) =m∑

i=1

2max(0, gi(x)) ← smooth

For Equality Constraints S = {x : hi(x) = 0, i = 1, . . . , m}

P (x) = ||h(x)||2, h : Rn → R

m

The Penalty Algorithm Let 0 < tk < tk+1, ∀k with tk → ∞.For each k solve

xk = argminx{Ft(x) ≡ f(x) + tkP (x)}.Convergence: If xk exact global minimizer of Ftk(x), then every limitpoint of {xk} is a solution of C. Marc Teboulle–Tel-Aviv University – p. 43

The Barrier Method: Frish 58, Fiacco-McCormick 68

Similar idea, but acting from the interior to preventing leaving the feasible region.

• Barrier [Interior]: A Barrier function for S with int S �= ∅ is a continuous function s.t.

B(x) → ∞ as x → boundaryS

Examples: B(x) = −∑mi=1[gi(x)]−1, B(x) = −∑m

i=1 log(−gi(x))

Barrier Algorithm Let 0 < tk < tk+1 ∀k with tk → ∞.

For each k solve

xk = argminx

{f(x) +1

tkB(x)}.

Convergence: Every limit point of {xk} is a solution of (C).

In both Penalty/Barrier Methods: Compromise t must be chosen sufficiently large so

that x(t) will approach S from the exterior (interior).... BUT..we do not know how to pick

t.. (if chosen too large, then Ill-Conditioning may occurs)

Avoid IC, do not send t → ∞, one approach: .....use augmented Lagrangian/Multiplier

methods.....Marc Teboulle–Tel-Aviv University – p. 44

A Generic Multiplier Method

(P ) min{f(x) : g(x) ≤ 0}, g : Rn → R

m

Lagrangian: L(x, u) = f(x) + uT g(x) (Linear in u)

An Augmented/General Lagrangian:

A(x, u, c) = f(x) + G(g(x), u, c), (u ≥ 0, c > 0)

Multiplier Method Given {uk, ck}, generate (xk, uk) via:

Find xk+1 = argmin{G(x, uk, ck) : x ∈ Rn}

Dual Update Rule: uk+1 = E(g(xk+1), uk, ck)

Increase ck > 0 (if necessary).

• G should be explicit and preserves data properties of (P) (e.g.smoothness)• E should be an simple explicit formula to update u• How do we get these objects?


Example: Multiplier Method for Ineqs. Constraints

(C) min{f(x) : gi(x) ≤ 0, i = 1, . . . , m}, g := (g1, . . . , gm)T

Quadratic Method of Multipliers [See [B5 for eq. constraints]

xk+1 ∈ argmin{A(x, uk, ck) : x ∈ Rn}

uk+1 = (uk + ckg(xk+1))+, (ck > 0, z+ := max{0, z})A(x, u, c) := f(x) + (2c)−1{||(u + cg(x))+||2 − ||u||2}

Drawback/Advantage:

Separability lost (if original prob. separable)

Not C2 and Newton can breaks down

No need to increase penalty parameter; more robust; uses dual info

More recent approaches allow for constructing smooth Lagrangian sothat Newton’s method can be applied. Later on.....


Sequential "Simpler" Constrained Problems

Idea: Given xk ∈ Rn, uk ∈ R

m, solve a sequence of approximatesimpler problems:

inf{fk(x) : gk(x) ≤ 0}where (fk, gk) are (local) approximations of the objective and constraintfunctions.

Possible Choices Include: Linear [SLP], Convex [SCP], Quadratic[SQP] approximations of f , or g or both.

SQP: quadratic approximation of objective and linearized constraints,i.e., solves a sequence of Quadratic ProgramsMost attractive feature of SQP: superlinear convergence in neighb. ofa solution.A Drawback: Need functions/gradients with high precision


Back to Newton: inf{f(x) : x ∈ dom f}Self-Concordance Theory–[Nesterov-Nemirovsky-90]

Idea: to make the convergence analysis coordinate invariant

[Newton’s method is coordinate invariant..but conv. analysis is not!]

Achieved for self-concordant convex functions

θ is SC ⇐⇒ ∃M : |θ′′′(t)| ≤ Mθ′′(t)3/2, ∀t ∈ dom θ

Newton Revisited with SC function: The Damped Newton Method:

DNM Start with x0 ∈ dom f . Generate {xk} via

xk+1 = xk− 11 + λ(xk)

(f ′′(xk))−1f ′(xk); λ(x) := 〈(f ′′(x))−1f ′(x), f ′(x)

♣ ∀x ∈ dom f, with λ(x) > η > 0, one iteration of DNM decreases the

value of f(x) at least by the constant h(η) := η − log(1 + η). This is a

global result.


Newton for Self-Concordant functions–[See B6-8]

f∗ = inf{f(x) : x ∈ dom f}Given some γ > 0

Damped Phase: λ(xk) > γ, apply Damped Newton that ensures

f(xk+1) − f(xk) ≤ −h(γ)

Quadratic Phase: λ(xk) ≤ γ, then apply pure Newton, whichconverges quadratically

Complexity Analysis: Total number of iterations to ε accuracy:

#of Newton’s step ≤ f(x0) − f∗

h(η)+ log log(1/ε)

Note: Absence of unknown constants and problem dimension


Interior Point Methods for Convex Programs

inf{cT x : x ∈ C}, C ⊂ Rn closed convex

Idea goes back to Barrier Methods, but within a different methodologyBasically one tries to approximately follow the central path generatedwithin the interior of the corresponding feasible set.Computation of Central Path

x∗(µ) = argminx

{µ〈c, x〉 + S(x)}

Where S is a Self-Concordant Barrier for a closed convex feasible setof the given optimization problem .

x∗(µ) remains strictly feasible for every µ > 0

x∗(µ) → x∗ optimal for µ → ∞Can be computed in polynomial time within use of Newton methodand suitable updating for µ


Primal-Dual Interior Methods

(C) inf{f(x) : g(x) ≥ 0} ⇐⇒ inf{f(x) : g(x)−v = 0, v ≥ 0} (g := (g1, . . . , gm)T )

The KKT system with perturbed complementarity

∇f(x) −∇g(x)y = 0, V := Diag(v), Y := Diag(y)

V Y e = µe; µ > 0, e := (1, . . . , 1)T

g(x) − v = 0

Apply Newton’s Method to generate new pt:

(x+, v+, y+) = (x, v, y) + t(∆x, ∆v, ∆y)

t chosen to ensure (v+, y+) > 0 and the merit function sufficiently reduced [see B10]

Advantage: feasibility of x not required, infeasible interior

Suitable for large problems size (n + m < 10, 000 and more)


Optimization–Summary

Nonconvex problems

most are just not solvable....Lacking theory....

Convex problems

Local minima are global

Computationally Tractable: Can be approximately solved to adesired accuracy in polynomial time [Not always efficient e.g., theEllipsoid Method]

Model many interesting problems

Enjoy a powerful Duality Theory that can be used to findbounds/approximations for hard problems; in particular nonconvexquadratic models arising in combinatorial problems.


Mathematical and Computational Challenges

To solve very large scale optimization problems keeping in mind thetrade off between Efficiency versus Practicality/Simplicity

For self-concordant convex problems, we have efficient polynomialalgorithms with high accuracy

Polynomial algorithms are highly sophisticated: require informationon the Hessian of objective and constraints, [often not available], andheavy computational cost at each iteration [e.g., ComputingNewton’s step] which are not affordable for very large scaleproblems.

..... Thus the needs to

Further study potential of elementary/simple methods (e.g., firstorder methods, using function or/and gradient infos only).

Produce more efficient algorithms within these methods


Part C–Interior Gradient/Prox Schemes

Lecture based on Joint Research Works withA. Auslender

University of Lyon I, France

Details, proofs and more results can be found in our two recent works

Interior gradient and epsilon-subgradient descent methods forconstrained convex minimizationMathematics of Operations Research, 29, 2004, 1–26.

A unified framework for interior gradient/subgradient andproximal methods in convex optimization. February 2003(submitted for publication)

More references on related works we developed on Lagrangianmethods, decomposition schemes, semidefinite programming, andvariational inequalities, are listed at the end of these notes.


Gradient Based Methods: Why?

A main drawback: Can be very slow....But...

Main advantages

Use minimal information, e.g., (f, g)

Often lead to very simple iterative schemes

Complexity/iteration mildly dependent in problem’s dimension

Suitable when high accuracy is not crucial [in many large scaleapplications, the data is anyway known only roughly..]

For very large scale problems often remain the only choice

Examples: Three gradient-based algorithms widely used inapplications

Clustering: The k-means algorithm

Neuro-computation: The backpropagation (perceptron) algorithm

The EM (Expectation-Maximization) algorithm in statisticalestimation


Main Results: Overview

A unifying framework for analyzing interior gradient and proximalbased algorithms for constrained minimization

Global convergence results under minimal assumptions

Smooth Lagrangian Multiplier Methods

Derive and analyze corresponding new (sub) gradient interiorschemes for constrained problems.

Modified methods with better complexity/efficiency

Applications of the results to specific instance problems and inparticular to conic optimization i.e., semidefinite and second-orderconic programs for producing elementary algorithms.


Part I. Interior Proximal methods

Ideas and A Unifying Framework

Convergence analysis mechanism

Applications/Examples


Two Classical Algorithms

(P ) f∗ = inf{f(x) : x ∈ C},f : R

n → R ∪ {+∞} is a lsc convex proper function.

C ⊂ Rn is nonempty convex open; C denotes the closure of C.

Let d(x, y) := 2−1‖x − y‖2, λk > 0

• Prox xk ∈ argmin{λkf(x) + d(x, xk−1) : x ∈ C} ⇐⇒

0 ∈ λk∂f(xk) + xk − xk−1 + NC(xk)

• (Sub) Grad xk ∈ argmin{λk〈gk−1, x〉 + d(x, xk−1) : x ∈ C} ⇐⇒

0 ∈ λkgk−1 + xk − xk−1 + NC(xk) ⇐⇒ xk = ΠC(xk−1 − λkgk−1)

where we use ΠC ≡ (I + NC)−1 ≡ Projection map

** Difference: Implicit –versus– Explicit Schemes **

Note: {xk} produced by either one of above algorithms does not necessarily belong to CMarc Teboulle–Tel-Aviv University – p. 58

A proximal term exploiting the geometry of C

We use a proximal term d(x, y) that will play the role of a distance likefunction satisfying certain desirable properties which in particular

Will force the iterates of the produced sequence to stay in C, andthus, will automatically eliminate the constraints, (hence Interior).

Allow to derive explicit and simple iterative schemes for variousinteresting optimization models.

Leads to convergent and improved methods

Minimal required properties for d:d(·, v) is a convex function, ∀vd(·, ·) ≥ 0, and d(u, v) = 0 iff u = v ∀u, v.

• d is not a distance: no symmetry or/and triangle inequality


The Basic Ingredients

(P ) f∗ = inf{f(x) : x ∈ C},Interior Proximal Algorithm–IPA:

x0 ∈ C; xk ∈ argmin{λkf(x)+d(x, xk−1) : x ∈ C}, k = 1, 2 . . . (λk > 0),

where d is some proximal distance .

The basic ingredients needed to achieve our goals are:

To pick an appropriate proximal distance d which allows toeliminate the constraints.

Given d, to find an induced proximal distance H , which will controlthe behavior of the resulting method, to analyze convergence andcomplexity.

We begin by defining an appropriate proximal distance d for problem(P).


A Family of Proximal Distances F

Definition A function d : Rn × R

n → R+ ∪ {+∞} is called a proximaldistance with respect to an open convex set C ⊂ R

n if for each y ∈ C itsatisfies the following properties:(P1) d(·, y) is proper, lsc, convex, and C1 on C.(P2) dom d(·, y) ⊂ C, and dom ∂1d(·, y) = C(P3) d(·, y) is level bounded on R

n, i.e., lim‖u‖→∞ d(u, y) = +∞.(P4) d(y, y) = 0.We denote by F the family of functions d satisfying the Definition.

(P1) is needed to preserve convexity of d

(P2) will force the iterate xk to stay in C

(P3) will guarantee the existence of such an iterate.

Note: By definition d(·, ·) ≥ 0, so that from (P4) we get

∇1d(y, y) = 0 ∀y ∈ C.


The Main Tool

For each given d ∈ F , we generate an induced proximal distancesatisfying some desirable properties.Definition Given C ⊂ R

n, open and convex, and d ∈ F , a functionH : R

n × Rn → R+ ∪ {+∞} is called the induced proximal distance to

d if

(1) H is finite valued on C × C with H(a, a) = 0 ∀a ∈ C

(2) 〈c − b,∇1d(b, a)〉 ≤ H(c, a) − H(c, b) ∀a, b, c ∈ C ♠We write (d, H) ∈ F(C) to quantify such triple [C, d, H].

Likewise, we will write (d, H) ∈ F(C), for the triple [C, d, H], s.t.

there exists H which is finite valued on C × C

satisfies (1)-(2) for any c ∈ C,

and such that ∀c ∈ C one has H(c, ·) level bounded on C.

Clearly, one thus have F(C) ⊂ F(C).Marc Teboulle–Tel-Aviv University – p. 62

Mechanism’s Motivation

Not as mysterious as it might look at first sight...

Example: Quad-prox corresponds to the special case

C = C̄ = Rn, d(x, y) = 2−1‖x − y‖2,∇1d(x, y) = (x − y)

Then,

♠ 〈c − b,∇1d(b, a)〉 = H(c, a) − H(c, b) − H(b, a)

holds with equality;( Pythagoras Theorem!) and with induced H ≡ d.

Methods with d ≡ H are called self-proximal.

There are several examples of more general self-proximal methods andfor various types of constraint sets C, [see C]


Main Results

Within this framework, we can derive :

Global rate of convergence/efficiency estimates in term of functionvalues

Convergence in limit points of the produced sequence by IPA

Global convergence of the sequence {xk} to an optimal solution of(P),within additional assumptions on the induced proximal distanceH , akin to the properties of norms, for reference this is denoted byF+(C), see [C1-5].


A Typical Convergence Result

Theorem (Prox-convergence) Let (d, H) ∈ F+(C) and let {xk} be thesequence generated by the interior-prox:

xk ∈ C gk ∈ ∂f(xk) s.t. λkgk + ∇1d(xk, xk−1) = 0.

Set σn :=∑n

k=1 λk. Then the following hold:

limn→∞ σn = +∞ =⇒ {f(xk)} converges to f∗Global rate of convergence estimate

f(xn) − f(x) = O(σ−1n ), ∀x ∈ C

If X∗ �= ∅, then xk → x∗ optimal solution of (P).


Self-proximal methods

We take d(x, y) := H(x, y) = Dh(x, y) with Dh a Bregman proximaldistance given by:

Dh(x, y) := h(x) − [h(y) + 〈∇h(y), x − y〉]It can be verified that (P1) − (P4) hold for H .

Depending on choice of h one has either (d, H) = (Dh, Dh) ∈ F(C)or (d, H) = (Dh, Dh) ∈ F+(C).

When C = Rn, with h = 2−1‖ · ‖2, then Dh(x, y) = ||x − y||2, and

with (d, H) = (Dh, Dh) ∈ F+(Rn), the IPA is exactly the classicalprox

Several interesting special cases for the pair (d, H) leading toself-proximal schemes for various type of constraints include: SDP,SOC, Convex Programs, see [C7-11]


Example: Semidefinite Constraints: the Cone C = Sn+

Sn+ (Sn

++)≡ symmetric p.s.d (p.d.) matrices. Let

h1 : Sn+ → R, h1(x) = tr x log x;

h3 : Sn++ → R, h3(x) = − tr log x = − log det(x)

For any y ∈ Sn++, let

d1(x, y) = tr(x log x − x log y + y − x), with dom d1(·, y) = Sn+,

d3(x, y) = − log det xy−1 + tr(xy−1) − n, with dom d3(·, y) = Sn++,

With H ≡ di, one has (d1, H) ∈ F(Sn+) and (d3, H) ∈ F(Sn

++)Similarly we can handle

C = {x ∈ Rm : B(x)−B0 ∈ Sn

+}; B(x) =m∑

i=1

xiBi, Bi ∈ Sn ∀i ∈ [0, m]


IPA that are Not-Self-Proximal: ϕ-Divergences

Given a scalar convex function ϕ satisfying some conditions, (call classΦr, see [C12-13]) we define a ϕ-divergence proximal distance by

dϕ(x, y) =n∑

i=1

yri ϕ(y−1

i xi), r = 1, 2

For any ϕ ∈ Φr, one verifies that

dϕ(·, ·) ≥ 0 and "=0" when arguments coincide.

Can be easily extended to handle Polyhedral Constraints.

Examples of functions in Φ1, Φ2:

ϕ1(t) = t log t − t + 1, dom ϕ = [0, +∞),ϕ2(t) = − log t + t − 1, dom ϕ = (0, +∞),

ϕ3(t) = 2(√

t − 1)2, dom ϕ = [0, +∞).


Example: The Class Φ2with dϕ(x, y) =

∑nj=1 y2

jϕ(y−1j xj)

Let ϕ(t) = µp(t) + ν2 (t − 1)2 with p ∈ Φ2

One has, dϕ ∈ F , and it can be proven: ∀a, b ∈ R++, ∀c ∈ R+

♣ 〈c − b,∇1d(b, a)〉 ≤ η(||c − a||2 − ||c − b||2)

where η = 2−1(µ + ν). With H(x, y) := η||x − y||2 one obtains(dϕ, H) ∈ F+(C) and all the convergence results hold.

Note: ♣ Behaves like usual prox in Rn...but here valid in the

non-negative orthant!


A Powerful Application of Interior Prox:General Augmented Lagrangian Methods

An Example: Take dϕ(u, v) =∑m

j=1 v2j ϕ(v−1

j uj) with ϕ ∈ Φ2.

Applying IPA on dual (D) of (..remember (D) is defined on Rm+ ...)

(P ) min{f0(x) : fi(x) ≤ 0, i ∈ [1, m]}yields the Multiplier Method:LMM: Let u0 ∈ R

m++ and λk ≥ λ > 0, ∀k ≥ 1, generate {xk, uk} via

xk ∈ arg min{H(x, uk−1, λk) : x ∈ Rn}

uki = uk−1

i (ϕ∗)′(λkfi(xk)/uk−1i ), i = 1, . . . , m

Here H(x, u, λ) := f0(x) + λ−1∑m

i=1 u2i ϕ

∗(λfi(x)/ui), (λ > 0, u > 0)

Using other d yields to many other old and new LMM: provides aunified framework to their analysis; and can be extended to SDP and VIproblems [see refs].


Important Example: The Log-Quad Proximal Kernel

Let ν > µ > 0 be given fixed parameters, and

Φ2 � ϕ(t) =ν

2(t − 1)2 + µ(t − log t − 1); t > 0

The conjugate ϕ∗ is explicitly given [see A-LQM] and satisfies someremarkable properties:

domϕ∗ = R, and ϕ∗ ∈ C∞(R)

(ϕ∗)′(s) = (ϕ′)−1(s) is Lipschitz for all s ∈ R, with constant ν−1

(ϕ∗)′′(s) ≤ ν−1, ∀s ∈ R.

For given data in (P) fi ∈ C∞(Rn), the resulting H ∈ C∞(Rn)


Computation with Log-Quad Multiplier Method:An Example

Robust [no parameters tuning]

+ computational effort does not increase with dimension

n × m = 1000 × 50, 000

Average results of 100 times executions on Quadratically constrained random model


A Typical Convergent result

Applying Theorem-prox-convergence Under very mild and standardassumptions on primal (P), e.g.,

• Optimal solution set of (P) bounded + Slater, one obtains:

Theorem-Convergence for LMM Let {xk, uk} be the sequencegenerated by previous LMM with ϕ ∈ Φ2. Then

All limit points of {xk, uk} are optimal solutions of (P ) × (D)

The dual sequence {uk} → u∗ optimal solution of the dual (D).

In particular, [and without any further assumptions, such as strictcomplementarity], the following improved global convergence rateestimate holds for the dual objective h:

h(u∗) − h(un) = o((n)−1)


Part II. Interior (Sub)-Gradient methods forConstrained Minimization

Basic Interior (sub) gradient methods

A General Convergence Result

Algorithms for Conic Optimization: Theory and Examples

A More efficient O(1/k2) Interior Gradient Algorithm


An Basic Interior-Gradient Algorithmfor Constrained Minimization over C

The basic step of a (sub)-Gradient Method over Rn is

xk = xk−1 − λkgk−1 ⇐⇒ xk ∈ argmin{λk〈gk−1, x〉+ 2−1‖x− xk−1‖2}

Thus, to solve min{f(x) : x ∈ C}, replace ‖ · ‖2 by some d ∈ F .

Basic Interior Gradient–BIG

Take d ∈ F . Let λk > 0 and generate the sequence {xk} via

xk ∈ argmin{λk〈gk−1, x〉 + d(x, xk−1)}Building on the material previously developed, it is possible to establishvarious type of convergence results for various instances of the triple[C, d, H]. We focus on conic models.


Conic Optimization Models

(M) inf{f(x) : x ∈ C ∩ V},where

V := {x : Ax = b}, with b ∈ Rm and A ∈ R

m×n, n ≥ m

f : Rn → R ∪ {+∞} is convex, lsc.

We assume that ∃x0 ∈ dom f ∩ C : Ax0 = b.

We assume also that f is continuously differentiable with ∇f Lipschitzon C ∩ V and Lipschitz constant L, i.e., there exists L > 0 such that

||∇f(x) −∇f(y)|| ≤ L||x − y|| ∀x, y ∈ C ∩ VNotation: f ∈ C1,1(C ∩ V).


Applying Basic Scheme BIG to Solve Problem (M)

For solving Problem (M), we propose the following basic iteration .

Given d(·, y) σ-strongly convex over (C × V)

Given a step-size rule for choosing λk (Various step-size rules arepossible)

At each step k, starting from a point x0 ∈ C ∩ V the sequencexk ∈ C ∩ V is computed via the relation

xk = u(λk∇f(xk−1), xk−1) = argmin{〈λk∇f(xk−1), z〉+d(z, xk−1)|z ∈ V}

Theorem-Convergence of BIG With (d, H) ∈ F+(C), {xk} convergesto an optimal sol. of (P), and the following Global Rate Estimationholds:

f(xn) − f∗ = O(n−1)


Application examples–[C14-17]

Consider the functions d with (d, H) ∈ F(C) which are regularizeddistances of the following form

d(x, y) = p(x, y) +σ

2||x − y||2, with p ∈ F

We can derive Explicit gradient-like algorithms via the formula

u(v, x) = argminz

{〈v, z〉 + d(z, x)}

for

Semidefinite programs

Second-order conic problems

Convex minimization over the unit simplex


Convex minimization over the unit simplex, C = ∆

Take C = Rn+, A = eT , b = 1, i.e., V = {x : |∑n

j=1 xj = 1} then,

(M) inf{f(x) : x ∈ ∆}with

∆ = {x ∈ Rn :

n∑j=1

xj = 1, x ≥ 0}

This is an interesting special case of standard conic optimization, whicharises in applications.

We will concentrate on Mirror Descent Type Algorithms (MDA)[Nemirovsky-Yudin-1983]

[Beck-Teboulle-2003] have shown that the (MDA) can be simplyviewed as a projection subgradient algorithm with strongly convexBregman proximal distances. As a result, they proposed to use theentropy kernel.


The Entropic Mirror Descent Algorithm EMDA

h(x) :=n∑

j=1

xj log xj if x ∈ ∆, +∞ otherwise

The entropy kernel is 1-strongly convex w.r.t the norm ‖ · ‖1, i.e.,

〈∇h(x) −∇h(y), x − y〉 ≥ ||x − y||21, ∀x, y ∈ ∆

Hence so is the resulting dh ≡ E defined by:

d(z, x) ≡ E(z, x) =

{ ∑nj=1 zj log zj

xjif (z, x) ∈ ∆ × ∆+,

+∞ otherwise

This produces the Entropic Mirror Descent Algorithm (EMDA).


Simple Formula for the EMDA with d ≡ E

The problem u(v, x) = argminz∈∆

{〈v, z〉 + E(z, x)} can be easily solved:

uj(v, x) =xj exp(−vj)∑ni=1 xi exp(−vi)

, j = 1, . . . , n

EMDA: Start with x0 = n−1e, ∀j = 1, . . . , n with vj = gk−1j

xkj (λk) = u(λkv, xk−1), λk =

√2 log n

Lf

√k

, ∀k = 1, . . .

Theorem The sequence generated by EMDA satisfies for all k ≥ 1

min1≤s≤k

f(xs) − minx∈∆

f(x) ≤√

2 log nmax1≤s≤k ||gs||∞√

k

Here the objective function is supposed Lf -Lipschitz on ∆Outperforms the classical proj. grad by a factor of (n/ log n)1/2


Improving efficiency of Interior Gradient Method

The classical gradient method for minimizing a C1,1L function over

Rn exhibits a O(k−1) global convergence rate estimate

[Nesterov-1988] developed what he called an ”optimal algorithm”for smooth convex minimization

He was able to improve the efficiency of the gradient method byconstructing a method that keeps the simplicity of the gradientmethod, but with the faster rate O(k−2).

♣ Question: Can this be extended for constrained problems usingInterior Gradient Methods?

We answer this question positively for a class of interior gradientmethods that leads to an equally simple but more efficient interiorgradient algorithm for convex conic problems.


Problem’s Setting: Back to The Conic Model

(M) inf{f(x) : x ∈ C ∩ V},where

V := {x : Ax = b}, with b ∈ Rm and A ∈ R

m×n, n ≥ m

f : Rn → R ∪ {+∞} is convex, lsc.

∃x0 ∈ dom f ∩ C : Ax0 = b

The optimal solution set X∗ is nonempty

f is continuously differentiable with ∇f Lipschitz on C ∩ V andLipschitz constant L


Generating the Sequence {qk}

Basic Idea: Build a sequence of functions {qk}∞k=0 that approximates f

We take d ≡ H ∈ F , with H is a Bregman proximal distance, withkernel h σ-strongly convex on C ∩ V.

For every k ≥ 0, we construct the sequence {qk(x)} recursively via:

q0(x) = f(x0) + cH(x, x0),

qk+1(x) = (1 − αk)qk(x) + αklk(x, yk)

lk(x, yk) = f(yk) + 〈x − yk,∇f(yk).〉Here, c > 0, and αk ∈ [0, 1). The point x0 is chosen such thatx0 ∈ C ∩ V . The point yk ∈ C is arbitrary and built-in within thealgorithm, [see C18].


The Improved Interior Gradient Algorithm-IGA

Step 0. Choose a point x0 ∈ C ∩ V , and a constant c > 0Set z0 = x0 = y0, c0 = c, λ = σL−1.

Step k. For k ≥ 0, compute:

αk =

√(ckλ)2 + 4ckλ − λck

2yk = (1 − αk)xk + αkz

k,

• zk+1 = argminx∈C∩V

{〈x,σ

αkL∇f(yk) + H(x, zk)} = u(

λ

αk∇f(yk), zk),

xk+1 = (1 − αk)xk + αkzk+1, and ck+1 = (1 − αk)ck.

• Computational work is exactly as the one of the interior gradient methodthrough zk+1

The remaining steps involve trivial computationsMarc Teboulle–Tel-Aviv University – p. 85

The Improved Convergence Rate Estimate for IGA

Theorem Let {xk}, {yk} be the sequences generated by (IGA) and letx∗ be an optimal solution of (P). Then, for any k ≥ 0 we have

f(xk) − f(x∗) = O(1k2

)

and the sequence {xk} is minimizing, i.e., f(xk) → f(x∗).

Thus to solve (P) to accuracy ε > 0, one needs no more than(O(1/

√ε) iterations of IGA. This is a reduction by a squared root

factor in comparison to BIG.

Note: IGA can be used to solve convex minimization over the unitsimplex and spectrahedron in S+

n with this improved globalconvergence rate estimate

Extension with d �= H: Open?


Conclusion

Optimizers are not (yet!) out of job......

Thank you for listening.


Additional Material and References

In the following, you will find some complementary material with moredetails and results on which I will talk very quickly (or not at all!)during the Lecture. This is organized in 3 appendices [A,B,C] refereingto corresponding parts of Lecture and [R] for the references.

Appendix A Some complements for optimization theory

Appendix B More on Algorithms

Appendix C Further details on Interior gradient/prox

References R Pointers to some basic books, and our recent worksrelated to part C


A1. General Defs/Properties for Arbitrary Functions

In optimization problems it is convenient to work with extended-real-valued functions, i.e.,

functions which takes values in R ∪ {+∞} = (−∞, +∞] instead of just finite valued

functions i.e., taking values in R = (−∞, +∞). This allows to rewrite the constrained

problem:

inf{h(x) : x ∈ C} ⇐⇒ inf{f(x) : x ∈ Rn}

with f := h + δC , and δC = 0 if x ∈ C and +∞ otherwise. Rules of arithmetic are thus

extended to include ∞ + ∞ = ∞, α · ∞ = ∞, ∀α > 0, and 0 · ∞ = 0.

Let f : Rn → R ∪ {+∞}. The effective domain of f is the set

dom f := {x ∈ Rn |f(x) < +∞}.

A function is called proper if f(x) < ∞ for at least one x ∈ Rn and

f(x) > −∞, ∀x ∈ Rn, otherwise the function is called improper.

Geometrical objects associated with f : epigraph and level set of f

epi f := {(x, α) ∈ Rn × R |α ≥ f(x)},

lev(f, α) := {x ∈ Rn |f(x) ≤ α}. Marc Teboulle–Tel-Aviv University – p. 89

A2. Lower-Semicontinuity (lsc)

For f : Rn → R ∪ {+∞}, we write

inf f := inf{f(x) : x ∈ Rn},

argmin f = argmin{f(x) : x ∈ Rn} := {x ∈ R

n : f(x) = inf f}.

Lower limits are characterized via

lim infx→y

f(x) = min{α ∈ R := [−∞,∞] | ∃xn → y with f(xn) → α}.

Note that one always have lim infx→y

f(x) ≤ f(y).

Definition The function f : Rn → R ∪ {+∞} is lower semi-continuous (lsc) at x if

f(x) = lim infy→x

f(y),

and lower semi-continuous on Rn if this holds for every x ∈ R

n.

Lower semicontinuity on Rn of a function can be characterized through its level set and epigraph.

Theorem Let f : Rn → R. The following statements are equivalent:

(a) f is lsc on Rn;

(b) the epigraph epi f is closed on Rn × R;

(c) the level sets lev(f, α) are closed in Rn.


A3. Few more Defs. in Convex Analysis

For a proper convex and lower semicontinuous (lsc) function

f : Rn → R ∪ {+∞}, dom f = {x |f(x) < +∞} �= ∅ its effective

domain

f∗(y) = sup{〈x, y〉 − f(x)|x ∈ Rn}, is its conjugate

For all ε ≥ 0 its ε-subdifferential

∂εf = {g ∈ Rn | ∀z ∈ R

n, f(z) + ε ≥ f(x) + 〈g, z − x〉}It coincides with the usual subdifferential ∂f ≡ ∂0f whenever ε = 0,and we set dom ∂f = {x ∈ R

n |∂f(x) �= ∅}.

For any closed convex set S ⊂ Rn

δS denotes the indicator function of S, ri S its relative interior

NS(x) = ∂δS(x) = {ν ∈ Rn |〈ν, z − x〉 ≤ 0,∀z ∈ S}

the normal cone to S at x ∈ S.


A4. Differentiability of Convex Functions

Under differentiability assumptions, one can check the convexity of afunction via the following useful tests: f is convex iff(a) when f ∈ C1: f(x) − f(y) ≥ 〈x − y,∇f(y)〉, ∀x, y;(b) when f ∈ C2: ∇2f(x) is positive semidefinite. (∇2f(x) positivedefinite =⇒ f strictly convex; converse false).

Directional derivative Let f : Rn → R ∪ {+∞} be a convex function,

and let x any point where f is finite and d ∈ Rn. Then the limit

f ′(x; d) := limτ→0+

f(x + τd) − f(x)τ

exists (finite or equal to −∞) for all d ∈ Rn and is called the directional

derivative of f at x.


A5. Coercivity and Asymptotic Function

Definition The function f : Rn → R ∪ {+∞} is called

(a) level bounded if for each λ > inf f , the level set lev(f, λ) is bounded,

(b) coercive if f∞(d) > 0 ∀d �= 0

As an immediate consequence of the definition we remark that f is level bounded if and

only if lim‖x‖→∞ f(x) = +∞ which means that the values of f(x) cannot remain

bounded on any subset of Rn that is not bounded.

In the convex case, all these concepts are in fact equivalent.

Proposition Let f : Rn → R ∪ {+∞} be lsc and proper. If f is coercive, then it is level

bounded. Furthermore, if f is also convex, then the following statements are equivalent:

(a) f is coercive.

(b) f is level bounded.

(c) The optimal set {x ∈ Rn | f(x) = inf f} is nonempty and compact.

(d) 0 ∈ int dom f∗.


A6. Alternative Theorems

Two very useful in the study of optimality conditions.

FarkasExactly one of the following two systems has a solution:(F1) Ax = b, x ≥ 0, (A ∈ R

m×n, b ∈ Rm the given data)

(F2) bT y > 0, AT y ≤ 0, y ∈ Rm

GordanExactly one of the following two systems has a solution:(G1) Ax < 0, x ∈ R

n

(G2) AT y = 0, 0 �= y ∈ Rm+ , (i.e., not all zero


A7. More Optimality Conditions for General NLP

(NLP ) min{f(x) : g(x) ≤ 0, h(x) = 0, x ∈ Rn}

with smooth (C2(Rn)) f : Rn → R, g : R

n → Rm h : R

n → Rp.

Define: L : Rn × R

m+ × R

p → R

L(x, λ, µ) = f(x) +m∑

i=1

λigi(x) +p∑

k=1

µkhk(x),

I(x) = {i : gi(x) = 0}∇2L(x∗, λ∗, µ∗) = Hessian of L at (x∗, λ∗, µ∗) w.r.t x

The tangent subspace:

M(x, d) = {d : dT∇gi(x) = 0, i ∈ I(x), dT∇hk(x) = 0, k ∈ [1, p]}


A7-b First and Second Order Opt. Conditions

Theorem : [NC]-Necessary Conditions Let x∗ be a local minimum for (NLP). Assume x∗ is regular,

namely for k = 1, . . . , p, i ∈ I(x∗) : {∇hk(x∗),∇gi(x∗)} are linearly independent. Then there

exists a unique λ∗, µ∗ such that:

∇xL(x∗, λ∗, µ∗) = 0,

λ∗i ≥ 0, i = 1, . . . , m, λ∗

i = 0 ∀i �∈ I(x∗), (first order conditions)

dT∇2L(x∗, λ∗, µ∗)d ≥ 0, ∀d ∈ M(x∗, d), (second order conditions)

Theorem: [SC]-Sufficient Conditions Suppose that a feasible point x∗ for GNLP satisfies:

∇xL(x∗, λ∗, µ∗) = 0,

λ∗i ≥ 0, i = 1, . . . , m, λ∗

i = 0 ∀i �∈ I(x∗), (first order conditions)

dT∇2L(x∗, λ∗, µ∗)d > 0, ∀0 �= d ∈ M(x∗, d), (second order conditions)

λ∗i > 0, ∀i ∈ I(x∗) (strict complementarity)

Then x∗ is a strict local minimum point for NLP, i.e.,

∃Nε(x∗) s.t. f(x∗) < f(x) ∀x ∈ Nε ∩ S, x �= x∗


A8. Primal-Dual Optimal Solution

Definition The pair (x∗, y∗) ∈ Rn × R

m+ is called a saddle point for L if

L(x∗, y) ≤ L(x∗, y∗) ≤ L(x, y∗), ∀x ∈ Rn, ∀y ∈ R

m+ .

Proposition (Saddle point characterization)(x∗, y∗) ∈ R

n × Rm+ is a saddle point for L iff

(a) x∗ = argminx∈RnL(x, y∗) (L-optimality)(b) x∗ ∈ R

n, g(x∗) ≤ 0 (Primal feasibility)(c) y∗ ∈ R

m+ (Dual feasibility)

(d) y∗i gi(x∗) = 0, i = 1, . . . , m (Complementarity).

Proposition (Sufficient condition for optimality) If (x∗, y∗) ∈ Rn × R

m+

is a saddle point for L, then x∗ is a global optimal solution for NLP.

Note: valid with 0-assumption on the problem’s data! However fornonconvex problem it is in general difficult to find a saddle point.


A9. Closing the Loop...

In the case of convex data (f, gi), the KKT theorem becomes necessaryand sufficient for optimality for the convex program:

(CP ) min{f(x) : g(x) ≤ 0, x ∈ Rn} = f∗

with f : Rn → R, g : R

n → Rm convex

KKT-Theorem for Convex Programs Let (CP) be convex with optimalvalue f∗ < ∞, and assume that a (CQ) holds (for example Slater). Thenx∗ is a global minimum for problem (CP) if and only if there existsλ∗ ∈ R

m+ satisfying the KKT system.

...Equivalent to Duality (Zero Gap)....

Note: Linear equality constraints can also be treated easily, withmultipliers λ∗ ∈ R

m (no sign restriction in that case).


B1. Convergence and Rate of Convergence

Convergence of an algorithm by itself is important but not enough. Wewant to know how fast/efficient it happens. Possible approaches:

Computational Complexity: theoretically estimates the # ofelementary operations needed by a given method to find anexact/approx. optimal solution. Provides worst case estimates, i.e.,upper bound on # of required operations for a class of problem

Informational Complexity: Estimate # of function/gradientevaluations to find opt. sol [as opposed to # of computationaloperations]

Local Analysis: Local behavior of a method near an optimalsolution, but ignores behavior when far from solution.

Which approach is best or/and should be used?Each have advantages and drawbacks!


B2. Local Asymptotic Rate of Convergence Measures

Let {sk} ⊂ R be a positive real sequence converging to zero:limk→∞ sk = 0 (e.g. sk := ‖x∗ − xk‖; sk := |f(xk) − f(x∗)|)

(Q)-Linear-[fairly fast]:

∃ρ ∈ (0, 1) : sk+1/sk ≤ ρ ∀k sufficiently large

Superlinear-[faster]:

limk→∞

sk+1

sk= 0

Quadratic-[very fast]:

∃ρ : sk+1 ≤ ρs2k ∀k sufficiently large

Quadratic =⇒ Superlinear =⇒ Linear


B3. The Armijo Line Search

This is a successive step-size rule where we require more than justreducing the cost.

Armijo Rule Fix scalars s, β, σ with β ∈ (0, 1), σ ∈ (0, 1).Set tk := βmks, where mk = first integer m ≥ 0:

♦ f(xk + βmsdk) − f(xk) ≤ σβms〈f ′(xk), dk〉Stepsizes βms, m = 0, . . . are tried successively until ♦ is satisfiedfor m = mk.

So, here we are not satisfied with just "cost improvement"; amount ofimprovement has to be sufficiently large as defined in ♦.

PRACTICAL CHOICES: σ ∈ [10−5, 10−1], β = 0.5, 0.1.


B4. Other Unconstrained Minimization Algorithms

Quasi-Newton Methods Idea, replace the Hessian (or its inverse) by some P.D. matrix.

Done by "mimicking" the minimization of a quadratic function f(x) = 12xT Qx − bT x. In

that case one has

f ′(x) − f ′(y) = Q(x − y), ∀x, y ∈ Rn

Thus, given a P.D. matrix Hk, we search a matrix Hk+1 s.t.

Hk+1(f′(xk+1) − f ′(xk)) = xk+1 − xk ←− [QN–Quasi Newton condition]

There are many solutions satisfying [QN]. One example is BFGS, given by

Hk+1 = Hk +

(1 +

yTk Hkyk

dTk yk

)− 1

dTk yk

(dkyT

k Hk + HkykdTk

)

where yk := f ′(xk+1) − f ′(xk); dk = xk+1 − xk

QN scheme: Start with x0 ∈ Rn, H0 � 0

Iteration k: Set dk = −Hkf ′(xk); xk+1 = xk − tkdk, [tk stepsize rules]

Compute: dk = xk+1 − xk; yk := f ′(xk+1) − f ′(xk)

Update Matrix: Hk −→ Hk+1 [e.g. via BFGS or other rules]Marc Teboulle–Tel-Aviv University – p. 102

B4-b Other Unconstrained Minimization AlgorithmsCtd.

Conjugate Gradients Initially designed for quadratic problems.

CG: Start with x0 ∈ Rn, f(x0), d0 = −f ′(x0)

Iteration k: xk+1 = xk + tkdk, [tk exact line search]

Compute:

f(xk+1); f ′(xk+1) and βk = −‖f ′(xk)‖−2f ′(xk+1)T (f ′(xk+1) − f ′(xk))

Update : dk+1 = f ′(xk+1) − βkdk

Various other formula exist for βk .

Trust Region Methods Replace Hk in Newton with Ak := Hk + µkI with µk > 0 such that

Ak � 0. Equivalent to impose a constraint on the length of the direction ("trust region")in the

quadratic approximation model:

mind∈Rn

1

2dT Hd + gT d : ‖d‖ ≤ l} (l > 0)

If Newton direction dk = −A−1k f ′(xk) is inactive we set µk = 0 otherwise µk > 0 such that

‖tkdk‖ = lk


B5. A Basic Multiplier Method for Equality Constraints

min{f(x) : h(x) = 0} h : Rn → R

m

Lagrangian: L(x, u) = f(x) + uT h(x)

Augmented L: A(x, u, c)) = L(x, u) + 2−1c||h(x)||2AL = Penalized Lagrangian

Multiplier Method Given {uk, ck}1. Find xk+1 = argmin{A(x, uk, ck) : x ∈ R

n}2. Update Rule: uk+1 = uk + ckh(xk+1)3. Increase ck > 0 if necessary.


B5-b Features of Multipliers Method

A key Advantage: it is not necessary to increase ck to ∞, forconvergence (as opposed to "Penalty/Barrier method" )

As a result, A is "less subject to ill-conditioning", and more "robust".

The AL depends on c but also on the dual multiplier u : better/fasterconvergence can be expected (rather than keeping u constant)

Useful for designing well behaved decomposition/splitting schemes

Extendible to various models such as variational inequalities,semidefinite programs


B6. Self-Concordance Theory

Idea: to make the convergence analysis coordinate invariant [Newton’s method is coordinate invariant..but conv. analysis is not!]

Achieved for self-concordant convex functions

Definition-SCF Let f ∈ C3(dom f) and convex. Then, f is calledself-concordant if ∃Mf ≥ 0 such that:(SC)

|D3f(x)[d, d, d]| ≤ Mf

(D2f(x)[d, d]

)3/2, ∀x ∈ dom f, d ∈ R

n.

(d3

dt3f(x+ td)|t=0 = D3f(x)[d, d, d] = 〈f ′′′(x)[d]d, d〉, θ(t) ≡ f(x+ td))

θ is SC ⇐⇒ |θ′′′(t)| ≤ Mθ′′(t)3/2, ∀t ∈ dom θi.e., Hessian does not vary too fast in its own metric.


B7. Examples of Self-Concordant functions

1. Linear and convex quadratic

f(x) = xT Ax − 2bT x + c, A ∈ S+n , dom f = R

n

Then, f ′(x) = Ax − b, f ′′(x) = A, f ′′′(x) = 0 =⇒ Mf = 02. Logarithmic Barrier f(x) = − log x, dom f = (0, +∞)

then, f ′(x) = −x−1, f ′′(x) = x−2, f ′′′(x) = −2x−3 =⇒ Mf = 2

3. Log-barrier of a quadratic regionf(x) = − log q(x), q(x) = c+ bT x− 0.5xT Ax, dom f = {x : q(x) > 0}.Then, one verifies that f is SC with Mf = 2.4. The following functions in R are NOT SC

ex, x−p(x, p > 0).


B8. Calculus Rules for SC functions

Affine Invariance of SC Let A : Rn → R

m be an affine map, A(x) = Ax + b. If f is

Mf -self-conc. Then, g(x) := f(Ax + b) is self-conc with Mg ≡ Mf .

f SC =⇒ g = af is SC ∀a ≥ 1, with Mg = a−1/2Mf

f, g SC =⇒ h = f + g SC with Mh = max{Mf , Mg}Composition with logarithm Let h : R → R convex with dom h = (0, +∞) and such that

(L) |h′′′(x)| ≤ 3x−1h′′(x), ∀x > 0

Then,

Then, f(x) := − log(−h(x)) − log x is SC on R++ ∩ {x : h(x) < 0}Many functions satisfy (L):

−xp (0 < p ≤ 1); x log x;− log x; x−2(ax + b)2, xp (−1 ≤ p ≤ 0).

Useful to establish self-conc. of the following (important) functions:

f(x) = −∑mi=1 log(bi − aT

i x); dom f = {x : aTi x < bi, i ∈ [1, m]}

f(x) = − log detX; dom f = Sn++

f(x) = − log(t2 − ‖x‖2); dom f = {(x, t) : ‖x‖ < t}Marc Teboulle–Tel-Aviv University – p. 108

B9. Newton with Self-Concordant Functions

Consider the problem min{f(x) : x ∈ dom f} and the Newton scheme

x+ = x − f ′′(x)−1f ′(x)

Theorem - Existence f attains its minimum over dom f iff there existsx ∈ dom f such that λ(x) < 1.For every x with the later property we can establish the following keyresults (all estimates are parameters free!):

f(x) − f(x∗) ≤ h∗(λ(x)) [conjugate of h, h∗(s) := −s − log(1 − s)]

(x − x∗)T f ′′(x)(x − x∗) ≤ (h∗)′(λ(x))

λ(x+) ≤ 2λ2(x)

The last result provides the region of quadratic convergence (withγ ∈ (0, q) and q solves λ = (1 − λ)2) :

λ(x) < q = 2−1(3 −√

5) =⇒ λ(x+) < λ(x)


B9.b Self Concordant Barrier

Definition Let F be a self concordant function. The function F is called a

ν-self-concordant barrier [SCB] for the set domF if for any x ∈ dom F :

maxu∈Rn

{2〈F ′(x), u〉 − 〈F ′′(x)u, u〉} ≤ ν

ν is called the parameter of the barrier.

This is very general definition that can be simplified, assuming that F ′′(x) is

non-singular:

〈F ′′(x)−1F ′(x), F ′(x)〉 ≤ ν

or to:

〈F ′(x), u〉2 ≤ ν〈F ′′(x)u, u〉, ∀u ∈ Rn,∀x ∈ dom F

Linear and quadratic functions are not SCBExamples of SCB: F (x) = − log x; dom F = R+ and F (x) =− log q(x); q(x) = −0.5xT Qx + 〈c, x〉 + d; dom F = {x : q(x) > 0}; Q ∈ S

+n

are 1-SCBMarc Teboulle–Tel-Aviv University – p. 110

B10. Primal-Dual Interior Methods

The KKT system with perturbed complementarity

⇐⇒ ♠ argminx,v

{f(x) − (g(x) − v)T y − µ

m∑i=1

log vi})

∇f(x) −∇g(x)y = 0, V := Diag(v), Y := Diag(y)

V Y e = µe; µ > 0, e := (1, . . . , 1)T

g(x) − v = 0

Apply Newton’s Method to generate new pt: (x+, v+, y+) = (x, v, y) + t(∆x, ∆v, ∆y)⎛⎝−∇2L(x, v) −∇g

−∇gT V Y −1

⎞⎠

⎛⎝ ∆x

∆y

⎞⎠ =

⎛⎝ ∇f(x) −∇g(x)y

µY −1e − g(x)

⎞⎠

t chosen to ensure (v+, y+) > 0 and merit function sufficiently reduced

M(x, v) = f(x) − µ

m∑i=1

log vi +β

2‖g(x) − v‖2; µ = δ

vT y

m; δ ∈ (0, 1), β > 0

See LOQO for convex and nonconvex problems, [Vanderbei-Shanno, 1999]


C1. The Class F+(C)

This class allows to derive pointwise convergence results, and therequested properties below are trying to mimic "norms".

We write (d, H) ∈ F+(C)(⊂ F(C)) when the function H satisfies thefollowing two additional properties:

(a1) ∀y ∈ C and ∀{yk} ⊂ C bounded with limk→+∞ H(y, yk) = 0,one has limk→+∞ yk = y

(a2) ∀y ∈ C and ∀C ⊃ {yk} −→ y we have limk→+∞ H(y, yk) = 0.


C2. The Interior Proximal Algorithm–IPA

Given d ∈ F , λk > 0, εk ≥ 0. (IPA is well defined, see [])

Start from a point x0 ∈ C

Generate a sequence

{xk} ∈ C with gk ∈ ∂εkf(xk)

such thatλkg

k + ∇1d(xk, xk−1) = 0.

The IPA can be viewed as

an approximate interior proximal method when εk > 0 ∀k ∈ N

which becomes exact for the special case εk = 0 ∀k ∈ N


C3. Convergence Results I: Global Rate

Theorem G1 Let (d, H) ∈ F(C) and let {xk} be the sequencegenerated by IPA. Set σn =

∑nk=1 λk. Then the following hold:

(i) f(xn) − f(x) ≤ σ−1n H(x, x0) + σ−1

n

∑nk=1 σkεk ∀x ∈ C.

(ii) If limn→∞ σn = +∞, and εk → 0, then lim infn→∞ f(xn) = f∗ andthe sequence {f(xk)} converges to f∗ whenever

∑∞k=1 εk < ∞.

(iii) Furthermore, suppose X∗ �= ∅, and consider the following cases:(a) X∗ is bounded,(b)

∑∞k=1 λkεk < ∞ and (d, H) ∈ F(C).

Then, under either (a) or (b), the sequence {xk} is bounded with all itslimit points in X∗.

An immediate by-product yields the following global rate ofconvergence estimate for the exact version of IPA, (εk = 0,∀k).Theorem G2 Let (d, H) ∈ F(C) and let {xk} be the sequence generatedby IPA with εk = 0,∀k. Then, f(xn) − f(x) = O(σ−1

n ), ∀x ∈ C.


C4. Convergence Results II: Pointwise Convergence

To establish the global convergence of the sequence {xk} to an optimalsolution of problem (P), we use the class F+(C).

Theorem G3 Let (d, H) ∈ F+(C) and let {xk} be the sequencegenerated by IPA. Suppose that the optimal set X∗ of (P ) is nonempty,σn =

∑nk=1 λk → ∞,

∑∞k=1 λkεk < ∞. and

∑∞k=1 εk < ∞. Then, the

sequence {xk} converges to an optimal solution of (P).


C5. Comments

Note that we have separated the two types of convergence results toemphasize:

The differences and roles played by each of the three classes

F+(C) ⊂ F(C) ⊂ F(C)

To show that the largest, and less demanding class F(C), alreadyprovides reasonable convergence properties for IPA, with minimalassumptions on the problem’s data.

These aspects are now illustrated by several application examples.


C6. Proximal Distances (d, H): Application Examples

In most situations, when constructing an IPA for solving the convexproblem (P), the proximal distance H induced by d will have a specialstructure, known as a Bregman proximal distance Dh, which isgenerated by some convex kernel h.

We first recall the special features of a Bregman proximal distance

We then consider various types of constraint sets C for problem (P),and give many examples for the pair (d, H), for which ourconvergence results hold.


C7. Bregman-proximal distances: Definition

Let h : Rn → R ∪ {+∞} be a proper, lsc, and convex function

with dom h ⊂ C and dom∇h = C, strictly convex and continuous ondom h , C1 on int dom h = C. Define ∀x ∈ R

n, ∀y ∈ dom∇h

H(x, y) := Dh(x, y) := h(x) − [h(y) + 〈∇h(y), x − y〉] (1)

The function Dh enjoys a remarkable three points identity that plays acentral role in the analysis.

H(c, a) = H(c, b)+H(b, a)+〈c−b,∇1H(b, a)〉 ∀a, b ∈ C, ∀c ∈ dom h

To handle the constraint cases C versus C, we need to consider twotypes of convex kernels h.


C8. Difference between F(C) and F+(C): Some Exam-ples

We let C = Rn++. More examples will follow.

Example: Separable Bregman proximal distances are the most commonused in the literature. Let θ : R → R ∪ +∞ be a proper convex and lscfunction with (0, +∞) ⊂ dom θ ⊂ [0, +∞) and

θ ∈ C2(0, +∞), θ′′(t) > 0, ∀t > 0 limt→0+

θ′(t) = −∞

We denote this class by

Θ0 if θ(0) < +∞Θ+ whenever θ(0) = +∞ and θ is nonincreasing.

Given θ in either class, define

h(x) =n∑

j=1

θ(xj), =⇒ Dh is separable


C9. Typical Choices for θ

The first two examples are functions θ ∈ Θ0, i.e., with dom θ = [0, +∞)and the last two are in Θ+, i.e., with dom θ = (0, +∞):

θ1(t) = t log t, (Shannon entropy).

θ2(t) = (pt − tp)/(1 − p), with p ∈ (0, 1).

θ3(t) = − log t (Burg’s entropy).

θ4(t) = t−1

Then, one can verify that for the corresponding proximal distances:

Dh1 , Dh2 ∈ F+(C), while Dh3 , Dh4 ∈ F(C)


C.10 Convex Programming: C = {x : fi(x) ≥ 0, i ∈[1, m]}

[CP ] min{〈c, x〉 : fi(x) ≥ 0, i ∈ [1, m]}Let fi : R

n → R be concave and C1 on Rn for each i ∈ [1, m].

We suppose that Slater’s holds: ∃x0 ∈ Rn : fi(x0) > 0, ∀i ∈ [1, m].

For θ ∈ Θ+ and x ∈ C let

hν(x) =m∑

i=1

θ(fi(x)) +ν

2||x||2, with ν > 0

Set d(x, y) = Dhν (x, y) then (d, Dhν ) ∈ F(C).


Convex Programming–Continued

An interesting algorithm is then obtained for solving [CP] by choosing:θ(t) ≡ θ3(t) = − log t. In this case we obtain:

d(x, y) = Dhν (x, y) =m∑

i=1

− logfi(x)fi(y)

+〈∇fi(y), x − y〉

fi(y)+

ν

2||x − y||2

The constrained convex program has thus been reduced to perform ateach step an unconstrained minimization with objective of the form:

−m∑

i=1

log fi(x) +ν

2‖x‖2 + 〈x, Lk〉

(All "constant" terms depending on k through yk are in Lk)Bears similarity with Barrier and Center methods....Note: This d(·, y) enjoys other interesting properties e.g., when fi areconcave quadratic, then d(·, y) is self-concordant for each y ∈ C.


C11. Second order cone constraints: C = Ln+

Let Ln+ := {x ∈ R

n|xn ≥ (x21 + . . . + x2

n−1)1/2} be the Lorentz cone.

Let Dn be the diagonal matrix

Dn = diag(−1, . . . ,−1, 1)

Define h : Ln++ → R by h(x) = − ln(xT Dnx) + ν

2‖x‖2.

Then, h is proper, lsc and convex on dom h = Ln++. The Bregman

proximal distance associated to h is given by

Dh(x, y) = − logxT Dnx

yT Dny+

2xT Dny

yT Dny− 2 +

ν

2‖x − y‖2.

Thus, with d = Dh, we have (Dh, Dh) ∈ F(Ln++).

Similarly, we can handle the case with C = {x ∈ Rn|Ax − b ∈ Ln

+}


C12. Not Self-Proximal: ϕ-Divergence Kernels on Rn+

Let ϕ : R → R ∪ {+∞} be a lsc, convex, proper function such thatdom ϕ ⊂ R+ and dom ∂ϕ = R++. We suppose in addition that ϕ is C2,strictly convex, nonnegative on R++ with ϕ(1) = ϕ′(1) = 0. We denoteby Φ the class of such kernels and byΦ1 the subclass of these kernels satisfying

ϕ′′(1)(1 − t−1) ≤ ϕ′(t) ≤ ϕ′′(1) log t ∀t > 0.

Φ2 the subclass satisfying

ϕ′′(1)(1 − t−1) ≤ ϕ′(t) ≤ ϕ′′(1)(t − 1) ∀t > 0.

Examples of functions in Φ1, Φ2 are:

ϕ1(t) = t log t − t + 1, dom ϕ = [0, +∞),ϕ2(t) = − log t + t − 1, dom ϕ = (0, +∞),

ϕ3(t) = 2(√

t − 1)2, dom ϕ = [0, +∞).


C13. An Important Example: The Log-Quad ProximalKernel

Let ν > µ > 0 be given fixed parameters, and

Φ2 � ϕ(t) =ν

2(t − 1)2 + µ(t − log t − 1); t > 0

Proposition (i)ϕ is strongly convex on R++ with modulus ν > 0.

(ii)The conjugate of ϕ is given by

ϕ∗(s) =ν

2t2(s) + µ log t(s) − ν

2,

t(s) := (2ν)−1{(ν − µ) + s +√

((ν − µ) + s)2 + 4µν} = (ϕ∗)′(s).

(iv) domϕ∗ = R, and ϕ∗ ∈ C∞(R).

(v)(ϕ∗)′(s) = (ϕ′)−1(s) is Lipschitz for all s ∈ R, with constant ν−1.

(vi) (ϕ∗)′′(s) ≤ ν−1, ∀s ∈ R.

Smooth Lagrangian Multiplier with Log-Quad:Handle easily very large scale instances e.g., (n × m = 1000 × 50, 000)

Number of Newton’step does not increase with dimension...

Also solve (local min.) nonconvex problems....(No proofs for that....!)


C14. BIG with Armijo-Goldstein stepsize rule

We use a generalized stepsize rule, reminiscent to the one used in theclassical projected gradient method.Algorithm 1: Armijo-Goldstein stepsize rule.Let β ∈ (0, 1), m ∈ (0, 1) and s > 0 be fixed chosen scalars.

Step 0 Start from a point x0 ∈ C ∩ V .Step k Generate the sequence {xk} ∈ C ∩ V as follows:

if ∇f(xk−1) ∈ V ⊥ stop.

Otherwise, with xk(λ) = u(λ∇f(xk−1), xk−1), set λk = βjks wherejk is the first nonnegative integer j such that

f(xk(βjs)) − f(xk−1) ≤ m(〈∇f(xk−1), xk(βjs) − xk−1〉, [AG]

Set xk = xk(λk); k ←− k + 1 ; goto step k.

With some work...it can be proved that [AG] Stepsize Rule is welldefined.


C15. Convergence of Algorithm 1

Theorem A1 Let (d, H) ∈ F(C), and let {xk} be the sequenceproduced by Algorithm 1 with (d, H) ∈ F(C). Then,

The sequence {f(xk)} is non increasing and converges to f∗.

Suppose that the optimal set X∗ of problem (M) is nonempty, then:(a) if X∗ bounded, {xk} is bounded with all its limit pts in X∗,(b) if (d, H) ∈ F+(C), {xk} converges to an optimal sol. of (P), andthe following Global Rate Estimation holds:

f(xn) − f∗ = O(n−1)


C16. Cases C = Rn++; Sn

++; Ln++

Let d(x, y) = p(x, y) + σ2||x − y||2

• p(z, x) = µ∑n

j=1 xrjϕ(x−1

j zj), σ ≥ µ > 0, for (z, x) ∈ C × C, and

ϕ(t) = − log t + t − 1 [r=2;the log-quad function], one obtains:

ui(v, x) = xi(ϕ∗)

′(−vix

−1i ), i = 1, . . . n,

• (SDP) Take p(x, y) = tr(− log x + log y + xy−1) − n ∀x, y ∈ Sn++, one has

∀x ∈ Sn++, v ∈ Sn:

u(v, x) = (2σ)−1(A(v, x) +√

A2(v, x) + 4σI)

with A(v, x) : = σx − v − x−1.

• (SOC) Take p(x, y) = − log xT DnxyT Dny

+ 2xT DnyyT Dny

− 2, ∀x, y ∈ Ln++. It can be shown that:

u(v, x) = sw with s := (2σ)−1(√

1 + 8σ‖w‖−2 − 1)

w : = 2τ(x)−1Dnx + v − σx.

τ(x) = xT Dnx.


C17. Modifying The EMDA

We can modify the EMDA with an Armijo-Goldstein step-size rule(since here d ≡ E is 1-strongly convex).

Therefore, we can apply Theorem A1, proving that the sequence {xk}of EMDA with λk defined by the Armijo-Goldstein stepsize rule [AG]converges to an optimal solution of (M).

This modified version can be more practical, since in particular, wedo not need to know/compute the Lipschitz constant Lf .

Another advantage of Entropy Kernel: Extendible to SDPconstraints

∆ ≡ {x ∈ Sn : tr(x) = 1, x � 0}with

d(x, y) := tr(x log x − x log y) on ∆


C18. The Key Result to Update {xk}, {yk} in IGA

Theorem Let σ > 0, L > 0 be given. Suppose that for some k ≥ 0 we have a

point xk ∈ C ∩ V such that f(xk) ≤ q∗k = min{qk(x) : x ∈ C ∩ V}. Let

αk ∈ [0, 1), ck+1 = (1 − αk)ck and C ∩ V � {zk} be generated by

zk+1 = argmin{〈x,αk

ck+1∇f(yk)〉 + H(x, zk) : x ∈ C ∩ V}

Define

yk = (1 − αk)xk + αkzk,

xk+1 = (1 − αk)xk + αkzk+1.

Then, q∗k+1 ≥ f(xk+1) + 12 ( ck+1σ

α2k

− L)‖xk+1 − yk‖2

Therefore, by taking for example Lα2k = σck(1 − αk) = σck+1 we can guarantee

that q∗k+1 ≥ f(xk+1). This leads to the desired interior gradient alg.


R. Short Bibliography—Some Books for parts [A]-[B]

A. Auslender and M. Teboulle, Asymptotic Cones and Functions in Optimization and Variational

Inequalities, Springer Monographs in Mathematics, Springer-Verlag New-York, 2003.

A. Ben-Tal, A. Nemirovski, Lectures on modern convex optimization. Analysis, algorithms, and

engineering applications, SIAM Publications, 2001.

D. Bertsekas, Nonlinear Programming, Athena Scientific, Belmont Masschussetts, 1999.

A. V. Fiacco and G. P. McCormick, Nonlinear Programming: Sequential Unconstrained Minimization

Techniques, Classics in Applied Mathematics, SIAM , Philadelphia, 1990.

O. L. Mangasarian, Nonlinear programming, McGraw-Hill Publishing Company, 1969.

A. Nemirovski and D. Yudin, Problem complexity and Method Efficiency in Optimization, John Wiley

New York, 1983.

Y. Nesterov, A. Nemirovski, Interior point polynomial algorithms in convex programming, SIAM

Publications, Philadelphia, PA, 1994.

J. Nocedal, S.J. Wright, Numerical Optimization, Springer Verlag, New York, 1999.

J. M. Ortega and W. C. Rheinboldt, Iterative solution of nonlinear equations in several variables,

Academic Press, 1970.

R. T. Rockafellar, Convex Analysis, Princeton University Press, 1970.Marc Teboulle–Tel-Aviv University – p. 131

Refs. on some of our recent works related to Part C

A. Auslender, M. Teboulle and S. Ben-Tiba, “Interior Proximal and Multiplier Methods based on

Second Order Homogeneous Kernels”, Mathematics of Operations Research, 24, (1999) 645–668.

A. Auslender, M. Teboulle, “Lagrangian duality and related multiplier methods for variational

inequalities”, SIAM J. Optimization, 10, (2000), 1097–1115.

A. Auslender, M. Teboulle, “Entropic proximal decomposition methods for convex programs and

variational inequalities”, Mathematical Programming, 91, (2001), 33-47.

A. Auslender and M. Teboulle “Interior gradient and epsilon-subgradient descent methods for

constrained convex minimization, Mathematics of Operations Research, 29, 2004, 1–26.

A. Auslender and M. Teboulle “ A unified framework for interior gradient/subgradient and proximal

methods in convex optimization”. February 2003 (submitted for publication).

A. Beck and M.Teboulle “Mirror descent and nonlinear projected subgradient methods for convex

optimization”, Operations Research Letters, 31, 2003, 167-175.

J. Bolte and M. Teboulle, “ Barrier operators and associated gradient-like dynamical systems for

constrained minimization” problems”, SIAM J. of Control Optimization, 42, (2003), 1266–1292

M.Doljanski and M.Teboulle, “An Interior Proximal Algorithm and the exponential multiplier method

for Semidefinite Programming”, SIAM J. of Optimization, 9, 1998, 1-13.


Nonlinear Programming - Department of Mathematical Sciences

Documents

Transcript of Nonlinear Programming - Department of Mathematical Sciences