Curs Tehnici de Optimizare

141
Optimization Techniques Ion Necoara Automation and Systems Engineering Department University Politehnica Bucharest Email: [email protected] 2009

Transcript of Curs Tehnici de Optimizare

Page 1: Curs Tehnici de Optimizare

Optimization Techniques

Ion NecoaraAutomation and Systems Engineering Department

University Politehnica BucharestEmail: [email protected]

2009

Page 2: Curs Tehnici de Optimizare

Contents

1

Page 3: Curs Tehnici de Optimizare

CONTENTS 2

Foreword

This course on numerical optimization techniques is intended for students of Automationand Systems Engineering Department in the second year of their bachelor programme, aswell as for interested master and PhD students from neighboring subjects. The course’saim is to give an introduction into numerical methods for solution of optimization prob-lems, in order to prepare the students for using and developing these methods for specificapplications in engineering. The course’s focus is on continuous optimization (rather thandiscrete optimization) with special emphasis on nonlinear programming. For this reason,the course follows its division into two major parts:

I. unconstrained optimization

II. constrained optimization

As for bibliography, I recommend the text book “Numerical Optimization” by J. Nocedaland S. Wright [?] and the excellent text books: “Convex Optimization” by S. Boyd andL. Vandenberghe [?] (this book is also freely available and can be downloaded from thehome page of Stephen Boyd) and “Introductory Lectures on Convex Optimization: A BasicCourse” by Y. Nesterov [?].

Background: It is required from students to have solid knowledge of linear algebra (e.g.matrix theory, concepts from vector spaces, etc...) and calculus (notions of differentiablefunctions, convergence of sequences, etc...)

Acknowledgement: I would like to thank to Prof. M. Diehl (K.U. Leuven) and mystudents C. Hristescu, A. Manciu, V. Caiter, F. Panait and F. Georgescu (UPB) for theirhelp in writing these notes.

Page 4: Curs Tehnici de Optimizare

Part I

Introduction

3

Page 5: Curs Tehnici de Optimizare

Chapter 1

Background

1.1 Review of matrix analysis

In this course we adopt the convention of considering elements x ∈ Rn to be column vectors,i.e. x = [x1 · · · xn]

T . In Rn the inner product is defined as

⟨x, y⟩ = xTy =n∑

i=1

xiyi.

When is not specified, the norm on Rn is the standard Euclidian norm (i.e. the norminduced by this inner product):

∥x∥ =√⟨x, x⟩.

The angle θ between two non-zero vectors x and y may be defined by:

cos θ =⟨x, y⟩∥x∥∥y∥

, 0 ≤ θ ≤ π.

The fundamental Cauchy-Schwarz inequality states that for any inner product and thecorresponding induced norm the following inequality holds:

|⟨x, y⟩| ≤ ∥x∥∥y∥ ∀ x, y ∈ Rn,

4

Page 6: Curs Tehnici de Optimizare

CHAPTER 1. BACKGROUND 5

with equality if and only if x and y are linearly dependent.

Any norm ∥ · ∥ has a dual norm ∥ · ∥∗ defined by:

∥x∥∗ = max∥y∥=1⟨x, y⟩.

The trace of a square matrix Q = [Qij]ij ∈ Rn×n is defined as

Trace(Q) =n∑

i=1

Qii.

A scalar λ ∈ C and a non-zero vector x that satisfy the equation Qx = λx is called aneigenvalue and eigenvector of Q, respectively. The eigenvalue-eigenvector equation may bewritten equivalently as

(λIn −Q)x = 0, x = 0,

i.e. the matrix λIn −Q is singular, that is,

det(λIn −Q) = 0.

Therefore, the characteristic polynomial of Q is defined as

pQ(λ) = det(λIn −Q).

Clearly the set of roots of pQ(λ) = 0 coincides with the set of eigenvalues of Q. The setof all eigenvalues of Q is called the spectrum of Q and is denoted by σ(Q) = λ1, · · · , λn.Using this notation we have

pQ(λ) = (λ− λ1) · · · (λ− λn)

and thus pQ(0) =∏

i(−λi). From the previous discussion we obtain the following lemma:

Lemma 1.1.1 The following equality holds:

det(Q) =∏i

λi and Trace(Q) =∑i

λi

λi(Qk) = λk

i and λi(aIn + bQ) = a+ bλi ∀i, ∀a, b ∈ R.

Page 7: Curs Tehnici de Optimizare

CHAPTER 1. BACKGROUND 6

We denote with Sn the vector space of symmetric matrices:

Sn = Q ∈ Rn×n : Q = QT.

On this space we define the inner product using the trace:

⟨Q,P ⟩ = Trace(QP ) ∀Q,P ∈ Sn.

Using the well-known properties of the inner product we have:

Trace(QPR) = Trace(RQP ) = Trace(PRQ),

for any matrices Q,P and R of appropriate dimensions. As a consequence, we also have

xTQx = Trace(QxxT ) ∀x ∈ Rn.

For a symmetric matrix Q ∈ Sn the corresponding eigenvalues are real, i.e. σ(Q) ⊂ R.

A symmetric matrix Q ∈ Sn is positive semidefinite (notation Q ≽ 0)

Q ≽ 0 if xTQx ≥ 0 ∀x ∈ Rn

and positive definite (notation Q ≻ 0) if xTQx > 0 for all x ∈ Rn, x = 0. We say thatQ ≽ P if Q− P ≽ 0. We denote the set of positive (semi)definite matrices with (Sn

+)Sn++.

We have the following characterization of a positive semidefinite matrix:

Lemma 1.1.2 The following equivalences hold:(i) The matrix Q is positive semidefinite(ii) All the eigenvalues of Q are non-negative(iii) All the principal minors of Q are non-negative(iv) There exists a matrix L such that Q = LTL.

Let us denote with λmin and λmax the smallest and the largest eigenvalue of a symmetricmatrix Q ∈ Sn. Then,

λmin = minx =0

xTQx

xTxand λmax = max

x=0

xTQx

xTx.

We conclude thatλminIn ≼ Q ≼ λmaxIn.

Page 8: Curs Tehnici de Optimizare

CHAPTER 1. BACKGROUND 7

We can derive definitions for certain matrix norms from vector norms. Given a vector norm∥ · ∥, we define the corresponding matrix norm as:

∥Q∥ = supx=0

∥Qx∥∥x∥

.

For the Euclidean norm the corresponding matrix norm is as follows:

∥Q∥ = (λmaxQTQ)1/2.

The Frobenius norm of a matrix is defined by

∥Q∥F = (m∑i=1

n∑j=1

Q2ij)

1/2

1.2 Review of mathematical analysis

It is important to extend a function f to all Rn by defining its value to be +∞ outside itsdomain. In the following we assume that all functions are implicitly extended. A scalarfunction f : Rn → R has the effective domain the set

domf = x ∈ Rn : f(x) <∞.

The function f is said to be differentiable at a point x ∈ Rn if there exists a vector g ∈ Rn

such that for all y ∈ Rn:

f(x+ y) = f(x) + ⟨g, y⟩+R(∥y∥),

where limy→0R(∥y∥)∥y∥ = 0 and R(0) = 0. The vector g is called the derivative or the gradient

of f at the point x and is written as ∇f(x). In other words a function is differentiable ata point x if it admits a first-order linear approximation at x. It is clear that the gradientis uniquely determined and we define it as a column vector with components

∇f(x) =

∂f(x)∂x1

· · ·∂f(x)∂xn

.

The function f is said to be differentiable on a set X ⊆ domf if it is differentiable at allpoints of X.

Page 9: Curs Tehnici de Optimizare

CHAPTER 1. BACKGROUND 8

The quantity (whenever the limit exists)

f ′(x; d) = limt→+0

f(x+ td)− f(x)

t

is called the directional derivative of f at x along direction d. Note that the directionalderivative may exists for a non-differentiable function:

Example 1.2.1 For the function f(x) = ∥x∥ we have that f ′(0; d) = ∥d∥, but this functionis not differentiable at x = 0.

If the function is differentiable then

f ′(x; d) = ⟨∇f(x), d⟩.

A scalar function f on Rn is said to be twice differentiable at x if it is differentiable at xand we can find a symmetric matrix H ∈ Rn×n such that for all y ∈ Rn

f(x+ y) = f(x) + ⟨∇f(x), y⟩+ 1

2xTHx+R(∥y∥2),

where limy→0R(∥y∥2)∥y∥2 = 0. This matrix H is called the Hessian and is denoted ∇2f(x).

In conclusion, a function is twice differentiable at x if it admits a second-order quadraticapproximation in a neighborhood of x. As for the gradient, the Hessian is unique, wheneverit exists, and is a symmetric matrix with the components

∇2f(x) =

∂2f(x)∂2x1

· · · ∂2f(x)∂x1∂xn

· · · · · · · · ·∂2f(x)∂xn∂x1

· · · ∂2f(x)∂2xn

.

The function f is said to be twice differentiable on a setX ⊆ domf if it is twice differentiableat all points of X. The Hessian can be seen as a derivative of the vector function ∇f :

∇f(x+ y) = ∇f(x) +∇2f(x)y +R(∥y∥).

Example 1.2.2 Let f be a quadratic function

f(x) =1

2xTQx− qTx+ r,

Page 10: Curs Tehnici de Optimizare

CHAPTER 1. BACKGROUND 9

where Q ∈ Rn×n is a symmetric matrix. Then, it is clear that the gradient of f at x is

∇f(x) = Qx− q

and the Hessian at x is∇2f(x) = Q.

A function that is at least once differentiable is said to be a smooth function. A functionthat is k times differentiable with the k-th derivative continuous is said to belong to theclass Ck.

For a differentiable function g : R→ R, we have the classical first-order Taylor approxima-tion with mean value or using the integral form:

g(b)− g(a) = g′(τ)(b− a) =

∫ b

a

g′(τ)dτ,

for some τ in the interval (a, b).

These equalities can be extended to any differentiable function f : Rn → R using theprevious relations for the function g(t) = f(x+ t(y − x)):

f(y) = f(x) + ⟨∇f(x+ τ(y − x)), y − x⟩ for some τ ∈ (0, 1)

f(y) = f(x) +

∫ 1

0

⟨∇f(x+ τ(y − x)), y − x⟩dτ

The reader should note that, using the rules for differentiability, we used:

g′(τ) = ⟨∇f(x+ τ(y − x)), y − x⟩.

Some extensions are possible:

∇f(y) = ∇f(x) +∫ 1

0

⟨∇2f(x+ τ(y − x)), y − x⟩dτ

f(y) = f(x) + ⟨∇f(x), y − x⟩+ 1

2(y − x)T∇2f(x+ τ(y − x))(y − x), for some τ ∈ (0, 1).

A differentiable function has a Lipschitz continuous gradient if there exists a constant L > 0such that

∥∇f(x)−∇f(y)∥ ≤ L∥x− y∥, ∀x, y.

Page 11: Curs Tehnici de Optimizare

CHAPTER 1. BACKGROUND 10

Using Taylor’s approximations given above we obtain the following lemma:

Lemma 1.2.3 (i) A twice differentiable function f has a Lipschitz continuous gradient ifand only if the following inequality holds:

∥∇2f(x)∥ ≤ L ∀x.

(ii) If a differentiable function has a Lipschitz continuous gradient then

|f(y)− f(x)− ⟨∇f(x), y − x⟩| ≤ L

2∥y − x∥2 ∀x, y.

Interpretation: From Lemma ?? it follows that a differentiable function with a Lipschitzcontinuous gradient is bounded from above by a special quadratic function having theHessian 1

LIn (here In is the unit matrix in Rn×n):

f(y) ≤ L

2∥y − x∥2 + ⟨∇f(x), y − x⟩+ f(x) ∀y.

A twice differentiable function has a Lipschitz continuous Hessian if there exists a constantM > 0 such that

∥∇2f(x)−∇2f(y)∥ ≤M∥x− y∥ ∀x, y.

For this class of functions we have the following characterization:

Lemma 1.2.4 For a twice differentiable function f which has a Lipschitz continuous Hes-sian we have:

∥∇f(y)−∇f(x)−∇2f(x)(y − x)∥ ≤ M

2∥y − x∥2 ∀x, y.

Moreover, the following inequality also holds:

−M∥x− y∥In4∇2f(x)−∇2f(y)4M∥x− y∥In ∀x, y.

Page 12: Curs Tehnici de Optimizare

Chapter 2

Convex Theory

2.1 Convex sets

Definition 2.1.1 A set S is an affine set if for any two points x1, x2 ∈ S and any α ∈ Rwe have αx1 + (1− α)x2 ∈ S (i.e. the line generated by any two points from S is includedin S).

x

x

X1

X2

Figure 2.1: Affine set generated by two points x1 and x2.

Example 2.1.2 The solution set of a linear system is an affine set, i.e. the set x ∈ Rn :Ax = b is affine.

The affine combination of p points x1, · · · , xp is define as:

p∑i=1

αixi, where

p∑i=1

αi = 1, αi ∈ R

11

Page 13: Curs Tehnici de Optimizare

CHAPTER 2. CONVEX THEORY 12

The affine hull of a set S ⊆ Rn, denoted Aff(S), is the set containing all the possible affinecombinations with points from S:

Aff(S) = ∑

i∈I,I finite

αixi : xi ∈ S,∑i

αi = 1, αi ∈ R.

In other words Aff(S) is the smallest affine set that contains S.

Definition 2.1.3 The set S is called convex if for any two points x1, x2 ∈ S and α ∈ [0, 1]we have αx1 + (1 − α)x2 ∈ S (i.e. the segment generated by any two points from S isincluded in S).

x

x

X1

X2

x

x

X2

X1

Figure 2.2: Convex set.

It follows immediately that any affine set is convex. Furthermore, the convex combinationof p points x1, · · · , xp is defined as:

p∑i=1

αixi, where

p∑i=1

αi = 1, αi ≥ 0.

The convex hull of a set S, denoted Conv(S), is the set containing all possible convexcombinations with points from S:

Conv(S) = ∑

i∈I,I finite

αixi : xi ∈ S,∑i

αi = 1, αi ≥ 0.

Note that the convex hull of a set is the smallest convex set that contains the given set. Itfollows that if S is convex, the convex hull of S coincides with S.

Theorem 2.1.4 (Caratheodory’s Theorem) If S ⊆ Rn is a convex set then every ele-ment of S is a convex combination of at most n+ 1 points of S.

Page 14: Curs Tehnici de Optimizare

CHAPTER 2. CONVEX THEORY 13

Figure 2.3: Convex hull.

An hyperplane is the convex set defined as

x ∈ Rn : aTx = b, a = 0, b ∈ R.

An halfspace is the convex set defined as

x ∈ Rn : aTx ≥ b or x ∈ Rn : aTx ≤ b,

where a = 0 and b ∈ R.

a

a x>b

a x<b

T

T

Figure 2.4: Hyperplane.

A ball with center x0 ∈ Rn and ray r > 0 is a convex set defined as:

B(x0, r) = x ∈ Rn : ∥x− x0∥ ≤ r = x ∈ Rn : x = x0 + ru, ∥u∥ ≤ 1.

An ellipsoid is the convex set defined as:

x ∈ Rn : (x− x0)TQ−1(x− x0) ≤ 1 = x0 + Lu : ∥u∥ ≤ 1,

Page 15: Curs Tehnici de Optimizare

CHAPTER 2. CONVEX THEORY 14

where Q ≻ 0 and Q = LTL.

A polyhedron is the convex set described by a finite set of hyperplanes and/or halfspaces:

x ∈ Rn : aTi x ≤ bi i = 1, · · · ,m, cTj x = dj j = 1, · · · , p

A polygon is a bounded polyhedron. Another representation of a polyhedron is given interms of vertices:

n1∑i=1

αivi +

n2∑j=1

βjrj :∑i

αi = 1, αi ≥ 0, βj ≥ 0 ∀j,

where vi are called vertices and rj are called affine rays.

v

f I

I

Figure 2.5: Unbounded polygon generated by vertices and affine rays.

Figure 2.6: Bounded polygon.

Definition 2.1.5 A set K is called cone if for any x ∈ K and α ≥ 0, α ∈ R we haveαx ∈ K. We say that K is a convex cone if K is a convex set and a cone.

The conic combination of p points x1, · · · , xp is defined as:

p∑i=1

αixi, where αi ≥ 0

Page 16: Curs Tehnici de Optimizare

CHAPTER 2. CONVEX THEORY 15

The conic hull of a set S, denoted Con(S), is the set containing all possible conic combi-nations with elements from S:

Con(S) = ∑

i∈I,I finite

αixi : xi ∈ S, αi ≥ 0

Note that the conic hull of a set is the smallest cone that contains the given set.

Figure 2.7: Conic hull generated by two points x1 and x2.

Figure 2.8: Conic hull generated by a set S.

For a given cone K (with an associated scalar product ⟨·, ·⟩) its dual cone, denoted K∗, isdefined as:

K∗ = y : ⟨x, y⟩ ≥ 0,∀x ∈ K.

Note that the dual cone is always a closed set. Using the fact that ⟨x, y⟩ = ∥x∥∥y∥ cos∠(x, y)we conclude that the angle between a vector from K and a vector from K∗ is less that π

2.

If the cone K satisfies the condition K = K∗, then we say that K is a self-dual cone.

Example 2.1.6 The following sets are cones:

• Rn and (Rn)∗ = 0.

• Rn+ = x ∈ Rn : x ≥ 0 is called the orthant cone and is self-dual using the usual

scalar product ⟨x, y⟩ = xTy, i.e. (Rn+)

∗ = Rn+.

Page 17: Curs Tehnici de Optimizare

CHAPTER 2. CONVEX THEORY 16

• Ln = [xT t]T ∈ Rn+1 : ∥x∥ ≤ t is called the Lorenz cone or ice-cream cone and it isalso self-dual with the scalar product ⟨[xT t]T , [yT v]T ⟩ = xTy + tv, i.e. (Ln)∗ = Ln.

• Sn+ = X ∈ Sn : X ≽ 0 is the semidefinite cone and is also self-dual with the scalar

product ⟨X,Y ⟩ = Trace(XY ), i.e. (Sn+)

∗ = Sn+.

Figure 2.9: Ice cream cone.

2.1.1 Operations that preserves convexity of sets

• intersection of convex sets is convex, i.e. if the family of sets Sii∈I is convex, then∩i∈I Si is also convex.

• sum of two convex sets S1 and S2 is also convex: S1 + S2 = x+ y : x ∈ S1, y ∈ S2.Moreover, αS = αx :, x ∈ S is convex if the set S is convex and α ∈ R.

• translation of a convex set S is also convex, i.e. given an affine function f(x) = Qx+b,the image of S through f , f(S) = f(x) : x ∈ S, is also convex. Similarly, the pre-image: f−1(S) = x : f(x) ∈ S is also convex.

Linear Matrix Inequalities (LMI): It can be easily proved that the set of positivesemidefinite matrices Sn

+ is convex. Let us now regard an affine map G : Rm → Sn+, G(x) =

A0 +∑m

i=1 xiAi, with symmetric matrices A0, · · · , Am ∈ Sn. The expression

G(x)<0

is called a linear matrix inequality (LMI). It defines a convex set x ∈ Rm : G(x)<0, asthe pre-image of Sn

+ under the affine map G(x).

Page 18: Curs Tehnici de Optimizare

CHAPTER 2. CONVEX THEORY 17

Theorem 2.1.7 (Hyperplane separation theorem) Let S1 and S2 be two convex setssuch that S1 ∩ S2 = ∅. Then, there exists an hyperplane that separates these two sets, i.e.there exists a = 0 and b ∈ R such that aTx ≥ b for all x ∈ S1 and aTx ≤ b for all x ∈ S2.

Figure 2.10: Separation theorem.

Theorem 2.1.8 (Hyperplane support theorem) Let S be a convex set and x0 ∈ bd(S) =cl(S)− int(S). Then there exists a supporting hyperplane for S at x0, i.e. there exists a = 0such that aTx ≥ aTx0 for all x ∈ S.

2.2 Convex functions

Definition 2.2.1 The function f : Rn → R is called convex if its effective domain domfis a convex set and

f(αx+ (1− α)y) ≤ αf(x) + (1− α)f(y),

for all x, y ∈ domf and α ∈ [0, 1].

Iff(αx+ (1− α)y) < αf(x) + (1− α)f(y),

for all x = y ∈ domf and α ∈ (0, 1), then f is called a strictly convex function.

If there is a constant σ > 0 such that

f(αx+ (1− α)y) ≤ αf(x) + (1− α)f(y)− σ

2α(1− α)∥x− y∥2,

for all x, y ∈ domf and α ∈ [0, 1], then f is called a strongly convex function.

Page 19: Curs Tehnici de Optimizare

CHAPTER 2. CONVEX THEORY 18

Figure 2.11: Convex function.

The Jensen inequality tells us that f is a convex function if and only if

f(

p∑i=1

αixi) ≤p∑

i=1

αif(xi)

for all xi ∈ domf and αi ∈ [0, 1],∑

i αi = 1.

The geometrical interpretation of convexity is very simple. For a convex function thefunction values are below the corresponding chord, that is, the values of convex function atpoints on the line segment αx+ (1− α)y are less than or equal to the height of the chordjoining the points (x, f(x)) and (y, f(y)).

Remark 2.2.2 A function is convex if and only if it is convex when restricted to any linethat intersects its domain. Rephrased, f is convex if and only if for all x ∈ domf and forall d, the function g(α) = f(x+αd) is convex on α ∈ R : x+αd ∈ domf. This propertyis very useful in testing whether a function is convex by restricting it to a line.

A function f : Rn → R is called concave if −f is convex.

2.2.1 First-order conditions for convex functions

Theorem 2.2.3 (Convexity for C1 functions) Assume that f : Rn → R is continuouslydifferentiable and domf is a convex set. Then f is convex if and only if

f(y) ≥ f(x) +∇f(x)T (y − x) ∀x, y ∈ domf. (2.1)

Page 20: Curs Tehnici de Optimizare

CHAPTER 2. CONVEX THEORY 19

Proof: “⇒” From the convexity of f we have that for any x, y ∈ domf and for anyα ∈ [0, 1]:

f(x+ α(y − x))− f(x) ≤ α(f(y)− f(x))

and therefore

∇f(x)T (y − x) = limt→0

f(x+ α(y − x))− f(x)

α≤ f(y)− f(x).

“⇐” To prove that for z = x+α(y−x) = (1−α)x+αy holds that f(z) ≤ (1−α)f(x)+αf(y)let us use (??) twice at z, in order to obtain f(x) ≥ f(z) + ∇f(z)T (x − z) and f(y) ≥f(z) +∇f(z)T (y − z) which yield, when weighted with (1 − α) and α and added to eachother

(1− α)f(x) + αf(y) ≥ f(z) +∇f(z)T [(1− α)(x− z) + α(y − z)]︸ ︷︷ ︸=(1−α)x+αy−z=0

.

The interpretation is simple: the tangents are below the graph for a convex function.A straightforward consequence of this theorem is the following statement: assume thatf : Rn → R is continuously differentiable and convex, then

(∇f(x)−∇f(y))T (x− y) ≥ 0 ∀x, y ∈ domf.

2.2.2 Second-order conditions for convex functions

Theorem 2.2.4 (Convexity for C2 Functions) Assume that f : Rn → R is twice continu-ously differentiable and domf is convex. Then f is convex if and only if for all x ∈ domfthe Hessian is positive semidefinite, i.e.

∇2f(x)<0 ∀x ∈ domf. (2.2)

Proof: To prove (??) ⇒ (??) we use a second order Taylor expansion of f at x in anarbitrary direction d:

f(x+ td) = f(x) +∇f(x)Tdt+ 1

2t2dT∇2f(x)d+ o(t2∥d∥2).

From this we obtain

dT∇2f(x)d = limt→0

2

t2(f(x+ td)− f(x)− t∇f(x)Td

)︸ ︷︷ ︸≥0, because of (??).

≥ 0.

Page 21: Curs Tehnici de Optimizare

CHAPTER 2. CONVEX THEORY 20

Conversely, to prove (??)⇐ (??) we use the Taylor rest term formula with some θ ∈ [0, 1]:

f(y) = f(x) +∇f(x)T (y − x)t+1

2t2(y − x)T∇2f(x+ θ(y − x))(y − x)︸ ︷︷ ︸

≥0, due to (??).

.

Example 2.2.5

1. The function f(x) = − log(x) is convex on R+ because f ′′(x) = 1x2 > 0 for all x > 0.

2. The quadratic function f(x) = r+ qTx+ 12xTQx is convex on Rn if and only if Q<0,

because ∀x ∈ Rn : ∇2f(x) = Q. Note that any affine function is convex and concavein the same time.

3. The function f(x, t) = xT xt

is convex on Rn × (0, ∞) because its Hessian

∇2f(x, t) =

[2tIn − 2

t2x

− 2t2xT 2

t3xTx

]is positive definite. To see this, multiply it from left and right with v = [zT s]T ∈ Rn+1

which yields vT∇f(x, t)v = 2t3∥tz − sx∥2 ≥ 0 if t > 0.

Theorem 2.2.6 (Convexity of sublevel sets) The sublevel set x ∈ domf : f(x) ≤ c of aconvex function f : Rn → R with respect to any constant c ∈ R is convex.

Proof: If f(x) ≤ c and f(y) ≤ c then for any α ∈ [0, 1] holds also

f((1− α)x+ αy) ≤ (1− α)f(x) + αf(y) ≤ (1− α)c+ αc = c.

Epigraph of a function : Let f : Rn → R be a function. We define its epigraph as thefollowing set:

epif = [xT t]T ∈ Rn+1 : x ∈ domf, f(x) ≤ t.

Theorem 2.2.7 (Convexity of epigraph) A function f : Rn → R is convex if and only ifits epigraph is a convex set.

Page 22: Curs Tehnici de Optimizare

CHAPTER 2. CONVEX THEORY 21

2.2.3 Operations that preserves convexity of functions

1. If f1 and f2 are convex functions and α1, α2 ≥ 0 then α1f1 + α2f2 is also convex

2. If f is convex then g(x) = f(Ax+ b) (i.e. the composition of a convex function withan affine function) is also convex

3. Let f : Rn×Rm → R be such that f(·, y) convex for any y ∈ S ⊆ Rm. Then the newfunction

g(x) = supy∈S

f(x, y)

is also convex.

4. The composition with a monotone convex function: if f : Rn → R is convex andg : R → R is convex and monotonically increasing, then the function g f : Rn →R, x 7→ g(f(x)) is also convex.

Proof: ∇2(g f)(x) = g′′(f(x))︸ ︷︷ ︸≥0

∇f(x)∇f(x)T︸ ︷︷ ︸<0

+ g′(f(x))︸ ︷︷ ︸≥0

∇2f(x)︸ ︷︷ ︸<0

<0.

Conjugate functions : Let f : Rn → R be a function. We define its conjugate, denotedf ∗, as the function

f ∗(y) = supx∈Rn

yTx− f(x)︸ ︷︷ ︸F (x,y)

From previous discussion it follows that f ∗ is convex regardless the properties of f . More-over, domf ∗ = y : f ∗(y) finite. Another straightforward consequence of the definition isFenchel inequality :

f(x) + f ∗(y) ≥ yTx ∀x, y.

Example 2.2.8 For the convex quadratic function f(x) = 12xTQx, where Q ≻ 0, we have

f ∗(y) = 12yTQ−1y.

Page 23: Curs Tehnici de Optimizare

Chapter 3

Fundamental Concepts ofOptimization

3.1 Introduction

Why we are interested in optimization problems? Optimization is used in many applicationsfrom diverse areas:

• Business : allocation of resources in logistics, investment...

• Science: estimation and fitting of models to measurement data, design of experi-ments...

• Engineering : design and operation of technological systems such as bridges, cars,aircraft, digital devices...

3.1.1 The evolution of optimization

The birth of the theory of extremal problems (minimum/maximum) starts centuries beforethe time of Christ. Ancient mathematicians were interested in a number of questions of

22

Page 24: Curs Tehnici de Optimizare

CHAPTER 3. FUNDAMENTAL CONCEPTS OF OPTIMIZATION 23

isoperimetric type: e.g. what closed curve of a given length encloses the maximum area?This was the age of geometric approaches for solving such optimization problems and theancients used these approaches to derive the solutions. However, a rigorous answer to thistype of problems was not given until the nineteenth century!The isoperimetric problem can be traced back to the legendary story of the queen Dido,told by Virgil in his “Aeneid”. Virgil told about the escape of Dido from her treacherousbrother in the first chapter of Aeneid. Dido had to decide about the choice of a tract of landnear the future city of Carthage, while satisfying the famous constraint of selecting “a spaceof ground, which (from the bull’s hide) they first inclosed”. By the legend, Phoenicianscut the oxhide into thin strips and enclosed a large expanse. Now it is customary to thinkthat the decision by Dido was reduced to the isoperimetric problem of finding a figure ofgreatest area among those surrounded by a curve whose length is given. It is not excludedthat Dido and her subjects solved the practical versions of the problem when the towerwas to be located at the sea coast and part of the boundary coastline of the tract wassomehow prescribed in advance. The foundation of Carthage is usually dated to the ninthcentury before Christ when there was no hint of the Euclidean geometry. Ropestretchingaround stakes leads to convex figures. The Dido problem has a unique solution in theclass of convex figures provided that the fixed nonempty part of the boundary is a convexpolygonal line.

Figure 3.1: Dido‘s problem

There are yet other methods the mathematicians in the days before calculus could haveused to solve optimization problems, namely algebraic approaches. One of the most elegantis the arithmetic-geometric mean inequality:

x1 + x2 + · · ·+ xn

n≥ (x1x2 · · · xn)

1/n, ∀xi ≥ 0, n ≥ 1

with equality if and only if x1 = x2 = · · · = xn.For example, to show that of all rectangles with a given area it is the square that has the

Page 25: Curs Tehnici de Optimizare

CHAPTER 3. FUNDAMENTAL CONCEPTS OF OPTIMIZATION 24

smallest perimeter, we can use this simple algebraic inequality: if we call the sides of therectangle x and y, then the problem is to determine them so that we minimize

2(x+ y) subject to xy = A,

where A is given. From the arithmetic-geometric mean inequality we get

x+ y

2≥ √xy =

√A

with equality if x = y =√A.

Decision making has become a science in the twentieth century once calculus was developed.A simple civil engineering problem that can be solved using calculus is as follows: given twocities on opposite sides of a river with constant width w located at distance a and b fromthe river and lateral separation d, we need to find the optimal location where we shouldbuild a bridge so as to make the journey between the two cities as short as possible

minx

f(x),

where f(x) =√x2 + a2 + w +

√b2 + (d− x)2. Imposing f ′(x) = 0, we get the optimal

location x∗ = ada+b

.

Figure 3.2: Optimal location application.

3.2 What Characterizes an Optimization Problem?

An optimization problem consists of the following three ingredients:

Page 26: Curs Tehnici de Optimizare

CHAPTER 3. FUNDAMENTAL CONCEPTS OF OPTIMIZATION 25

• An objective function, f(x), that shall be minimized or maximized,

• decision variables, x, that can be chosen, and

• constraint that shall be respected, e.g. of the form g(x) ≥ 0 (inequality constraints)or h(x) = 0 (equality constraints).

3.2.1 Mathematical Formulation in Standard Form

minx∈Rn

f(x)

s.t. g(x) ≤ 0, (3.1)

h(x) = 0.

Here, f : Rn → R, g : Rn → Rm and h : Rn → Rp, are usually assumed to be differentiable.Note that the inequalities hold for all components, i.e.

g(x) ≤ 0 ⇔ gi(x) ≤ 0, i = 1, . . . ,m

h(x) = 0 ⇔ hj(x) = 0, j = 1, . . . , p.

Example 3.2.1

minx∈R2

x21 + x2

2

s.t. x2 − 1 + x21 ≤ 0

x1x2 − 1 = 0.

Definition 3.2.2

1. The set x ∈ Rn : f(x) = c is the level set of f for the value c.

2. The feasible set is X = x ∈ Rn : g(x) ≤ 0, h(x) = 0.

3. The point x∗ ∈ Rn is a global minimizer (often also called a global minimum) if andonly if (iff) x∗ ∈ X and f(x∗) ≤ f(x) for all x ∈ X.

Page 27: Curs Tehnici de Optimizare

CHAPTER 3. FUNDAMENTAL CONCEPTS OF OPTIMIZATION 26

4. The point x∗ ∈ Rn is a strict global minimizer iff x∗ ∈ X and f(x∗) < f(x) for allx ∈ X \ x∗

5. The point x∗ ∈ Rn is a local minimizer iff x∗ ∈ X and there exists a neighborhood Nof x∗ (e.g. an open ball around x∗) so that f(x∗) ≤ f(x) for all x ∈ X ∩N .

6. The point x∗ ∈ Rn is a strict local minimizer iff x∗ ∈ X and there exists a neighbor-hood N of x∗ so that f(x∗) < f(x) for all x ∈ (X ∩N ) \ x∗.

Figure 3.3: Local and global minima.

Example 3.2.3 For the following one dimensional problem

minx∈R

sin x expx

s.t. x ≥ 0, x ≤ 4π.

• X = x ∈ R : x ≥ 0, x ≤ 4π = [0, 4π]

• Three local minimizers (which?)

• One global minimizer (which?)

An important issue in the optimization theory is when minimizers exist.

Theorem 3.2.4 (Weierstrass) If the feasible set X ⊂ Rn is compact (i.e. bounded andclosed) and f : X → R is continuous then there exists a global minimizer of the optimizationproblem minx∈X f(x).

Page 28: Curs Tehnici de Optimizare

CHAPTER 3. FUNDAMENTAL CONCEPTS OF OPTIMIZATION 27

Proof: Regard the graph of f , G = (x, t) ∈ Rn × R : x ∈ X, f(x) = t. Notethat G is a compact set, and so is the projection of G onto its last coordinate, the setProjRG = t ∈ R : ∃x such that (x, t) ∈ G, which is a compact interval [fmin, fmax] ⊂ R.By construction, there must be at least one x∗ so that (x∗, fmin) ∈ G.

Thus, minimizers exist under fairly mild circumstances. Though the proof was constructive,it does not lend itself to an efficient algorithm. The topic of this lecture is how to practicallyfind minimizers with help of computer algorithms.

3.3 Types of Optimization Problems

In order to choose the right algorithm for a practical problem, we should know how toclassify it and which mathematical structures can be exploited. Replacing an inadequatealgorithm by a suitable one can make solution times many orders of magnitude shorter.

3.3.1 Nonlinear Programming (NLP)

In this lecture we mainly treat algorithms for general Nonlinear Programming (NLP) prob-lems that are given in the form

minx∈Rn

f(x)

s.t. g(x) ≤ 0 (3.2)

h(x) = 0,

where f : Rn → R, g : Rn → Rm, h : Rn → Rp, are assumed to be continuously differen-tiable at least once, often twice and sometimes more.

Many problems have more structure, which we should exploit in order to solve problemsfaster.

Page 29: Curs Tehnici de Optimizare

CHAPTER 3. FUNDAMENTAL CONCEPTS OF OPTIMIZATION 28

3.3.2 Linear Programming (LP)

When the functions f, g and h are affine in the general formulation (??), the general NLPgets something easier to solve, namely a Linear Program (LP). Explicitly, an LP can bewritten as follows.

minx∈Rn

cTx

s.t. Ax− b = 0, (3.3)

Cx− d ≤ 0.

Here, the problem data are given by c ∈ Rn, A ∈ Rp×n, b ∈ Rp, C ∈ Rm×n, and d ∈ Rm.Note that we could also have a constant contribution to the objective, i.e. have f(x) =cTx+ c0, but that this would not change the minimizers x∗.

LPs can be solved very efficiently since the 1940’s, when G. Dantzig invented the famoussimplex method, an active set method, which is still widely used, but got an equally efficientcompetitor in the so called interior-point methods. LPs can nowadays be solved even ifthey have millions of variables and constraints, every business student knows how to usethem, and LPs arise in myriads of applications. LP algorithms are not treated in detail inthis lecture, but please recognize them if you encounter them in practice and use the rightsoftware.Software: CPLEX, SOPLEX, lp solve, lingo, MATLAB (linprog), SeDuMi, YALMIP.

3.3.3 Quadratic Programming (QP)

If in the general NLP formulation (??) the constraints g, h are affine (as for an LP), butthe objective is a linear-quadratic function, we call the resulting problem a QuadraticProgramming Problem or Quadratic Program (QP). A general QP can be formulated asfollows.

minx∈Rn

1

2xTQx+ qTx+ r

s.t. Ax− b = 0 (3.4)

Cx− d ≤ 0.

Here, in addition to the same problem data as in the LP, we also have the textitHessianmatrix Q ∈ Rn×n. Its name stems from the fact that ∇2

xf(x) = Q, where f(x) = 12xTQx+

Page 30: Curs Tehnici de Optimizare

CHAPTER 3. FUNDAMENTAL CONCEPTS OF OPTIMIZATION 29

qTx+ r.

Convex QP. If the Hessian matrix Q is positive semi-definite (i.e. Q ≽ 0) we call theQP (??) a convex QP. Convex QPs are tremendously easier to solve globally than non-convex QPs (i.e., where the Hessian Q is not positive semi-definite), which might havedifferent local minima.

Strictly convex QP. If the Hessian matrix Q is positive definite (i.e. Q ≻ 0) we call theQP (??) a strictly convex QP. Strictly convex QPs are a subclass of convex QPs, but oftenstill a bit easier to solve than not-strictly convex QPs.

Example 3.3.1 (non-convex QP)

minx∈R2

1

2xT

[5 00 −1

]x+

[02

]Tx

s.t. − 1 ≤ x1 ≤ 1

− 1 ≤ x2 ≤ 10.

This problem has local minimizers at x∗1 = [0 −1]T and x∗

2 = [0 10]T , but only x∗2 is a global

minimizer.

Example 3.3.2 (strictly convex QP)

minx∈R2

1

2xT

[5 00 1

]x+

[02

]Tx

s.t. − 1 ≤ x1 ≤ 1

− 1 ≤ x2 ≤ 10.

This problem has only one (strict) local minimizer at x∗ = [0 − 1]T that is also globalminimizer.

Software: MOSEC, MATLAB (quadprog), SeDuMi, YALMIP.

3.3.4 Convex Optimization (CP)

Both LPs and convex QPs are part of an important class of optimization problems, namelythe convex optimization problems. An optimization problem with convex feasible set X

Page 31: Curs Tehnici de Optimizare

CHAPTER 3. FUNDAMENTAL CONCEPTS OF OPTIMIZATION 30

and convex objective function f : Rn → R is called a convex optimization problem (CP),i.e.

minx∈Rn

f(x)

s.t. g(x) ≤ 0 (3.5)

Ax− b = 0,

where f : Rn → R and g : Rn → Rm are convex functions and the equality constraints aredescribed by an affine function h(x) = Ax− b.

Example 3.3.3 (Quadratically Constrained Quadratic Program (QCQP)) A convex opti-mization problem of the form (??) with the functions f and gi being convex quadratic, iscalled a Quadratically Constrained Quadratic Program (QCQP):

minx∈Rn

1

2xTQ0x+ qT0 x+ r0

s.t.1

2xTQix+ qTi x+ ri ≤ 0 i = 1, · · · ,m (3.6)

Ax− b = 0.

By choosing Q1 = · · · = Qm = 0 we obtain a usual QP, and by also setting Q0 = 0 weobtain an LP. Therefore, the class of QCQPs contains both LPs and QPs as subclasses.

Example 3.3.4 (Semidefinite Programming (SDP)) An interesting class of convex opti-mization problems makes use of linear matrix inequalities (LMI) in order to describe thefeasible set. As it involves the constraint that some matrices should remain positive semidef-inite, this problem class is called Semidefinite Programming (SDP). A general SDP can beformulated as:

minx∈Rn

cTx

s.t. A0 +n∑

i=1

Aixi40 (3.7)

Ax− b = 0,

where Ai ∈ Sk for all i = 0, · · · , n. It turns out that all LPs, QPs, and QCQPs can alsobe formulated as SDPs, besides several other convex problems. Semidefinite Programmingis a very powerful tool in convex optimization.

Page 32: Curs Tehnici de Optimizare

CHAPTER 3. FUNDAMENTAL CONCEPTS OF OPTIMIZATION 31

Minimizing the largest eigenvalue can be formulated as an SDP. We regard a sym-metric matrix G(x) that affinely depends on some design variables x ∈ Rn, i.e. G(x) =A0 +

∑ni=1Aixi with Ai ∈ Sn for all i = 0, · · · , n. If we want to minimize the largest

eigenvalue of G(x), i.e. to solveminx

λmax (G(x))

we can formulate this problem as an SDP by adding a slack variable t ∈ R, as follows:

mint∈R,x∈Rn

t s.t. tI −n∑

i=1

Aixi − A0<0.

Software: An excellent tool to formulate and solve convex optimization problems in aMATLAB environment is YALMIP and CVX, which are available as open-source codesand easy to install. These are interfaces that use e.g. SeDuMi as a solver.

3.3.5 Unconstrained Optimization Problems

Any NLP without constraints is called an unconstrained optimization problem. It has thegeneral form

minx∈Rn

f(x). (3.8)

Unconstrained nonlinear optimization will be the focus of Part II of this lecture notes, whilegeneral constrained optimization problems are the focus of Part III.

3.3.6 Non-Differentiable Optimization Problems

If one or more of the problem functions f, g and h are not differentiable in an optimiza-tion problem (??), we speak of a non-differentiable or non-smooth optimization problem.Non-differentiable optimization problems are much harder to solve than general NLPs. Afew solvers exist (Microsoft Excel solver, Nelder-Mead method, random search, genetic al-gorithms...), but are typically much slower than derivative based methods (which are thetopic of this course).

Page 33: Curs Tehnici de Optimizare

CHAPTER 3. FUNDAMENTAL CONCEPTS OF OPTIMIZATION 32

3.3.7 Mixed-Integer Programming (MIP)

A mixed-integer problem is a problem with integer decisions. An MIP can be formulatedas follows:

minx∈Rn,z∈Zm

f(x, z)

s.t. g(x, z) ≤ 0 (3.9)

h(x, z) = 0.

Mixed integer non-linear program (MINLP): if the functions f, g and h are twicedifferentiable in x and z we speak of a mixed integer non-linear program. Generally speak-ing, these problems are very hard to solve, due to the combinatorial nature of the variablez. However, if a relaxed problem, where the variables z are no longer restricted to theintegers, but to the real numbers, is convex, often very efficient solution algorithms exist.More specifically, we would require that the following problem is convex:

minx∈Rn,z∈Rm

f(x, z)

s.t. g(x, z) ≤ 0

h(x, z) = 0.

The efficient solution algorithms are often based on the technique of branch-and-bound,which uses partially relaxed problems where some of the z are fixed to specific integervalues and some of them are relaxed and exploits the fact that the solution of the relaxedsolutions can only be better than the best integer solution. This way, the search throughthe combinatorial tree can be made more efficient than pure enumeration. Two importantexamples of such problems are given in the following.

Mixed integer linear program (MILP): if the functions f , g and h are affine in bothx and z we speak of a mixed linear integer program. A famous problem in this class is thetraveling salesman problem.Software: small to medium size (i.e. n < 100) problems of this form can efficiently besolved with codes such as the commercial code CPLEX or the free code lp_solve with anice manual http://lpsolve.sourceforge.net/5.5/.

Page 34: Curs Tehnici de Optimizare

CHAPTER 3. FUNDAMENTAL CONCEPTS OF OPTIMIZATION 33

Mixed integer quadratic programs (MIQP): if g and h are affine functions and fconvex quadratic in both x and z we speak of a mixed integer QP (MIQP).Software: small to medium size (i.e. n < 100) problems of this form are also efficientlysolvable, mostly by commercial solvers (e.g. those in TOMLAB).

Page 35: Curs Tehnici de Optimizare

Part II

Unconstrained Optimization

34

Page 36: Curs Tehnici de Optimizare

Chapter 4

Optimality Conditions

In this part of the course we regard unconstrained optimization problems of the form

minx∈Rn

f(x), (4.1)

where we assume that the objective function f : Rn → R has an open effective domaindomf ⊆ Rn. We recall that in this course we assume the function f to be extended to allRn by defining its value to be +∞ outside its domain. Therefore, we are only interested inminimizers that lie inside of domf . We might have domf = Rn, but often this is not thecase, as in the following example where we consider domf = (0,∞) and we assume thatoutside domf the fuction takes the value +∞:

minx∈R

1

x+ x, (4.2)

4.1 Necessary Optimality Conditions

A direction d ∈ Rn is called a descent direction for the function f ∈ C1 at point x ∈ domfif

∇f(x)Td < 0.

35

Page 37: Curs Tehnici de Optimizare

CHAPTER 4. OPTIMALITY CONDITIONS 36

Interpretation: if d is a descent direction at x ∈ domf then the objective function canbe improved around x. Indeed, as domf is open, we could find a t > 0 that is small enoughso that for all τ ∈ [0, t] we have x+ τd ∈ domf and ∇f(x+ τd)Td < 0 (due to continuityof ∇f(·) around x). By Taylor’s Theorem, there exists some θ ∈ (0, t) such that

f(x+ td) = f(x) + t∇f(x∗ + θd)Td︸ ︷︷ ︸<0

< f(x).

Theorem 4.1.1 (First Order Necessary Conditions (FONC)) Let f be a differen-tiable function (i.e. f ∈ C1) and x∗ ∈ domf be a local minimizer of the optimizationproblem (??). Then

∇f(x∗) = 0. (4.3)

Proof: Let us assume by contradiction that ∇f(x∗) = 0. Then we can show that d =−∇f(x∗) is a descent direction, i.e. the objective function can be improved around x∗.Indeed, we can find a t > 0 that is small enough such that for all τ ∈ [0, t] we have∇f(x∗+ τd)Td = −∇f(x∗− τ∇f(x∗))T∇f(x∗) < 0. Moreover, there exists some θ ∈ (0, t)satisfying

f(x∗ − t∇f(x∗)) = f(x∗)− t∇f(x∗ + θ∇f(x∗))T∇f(x∗)︸ ︷︷ ︸>0

< f(x∗).

This is a contradiction with our hypothesis that x∗ is a local minimizer.

Any point x∗ satisfying ∇f(x∗) = 0 is called a stationary point of f .

Theorem 4.1.2 (Second Order Necessary Conditions (SONC)) Let f be a twice dif-ferentiable function (i.e. f ∈ C2) and x∗ ∈ domf be a local minimizer of the optimizationproblem (??). Then

∇2f(x∗)<0. (4.4)

Proof: If (??) does not hold there exists some d ∈ Rn so that dT∇2f(x∗)d < 0. Then theobjective can be improved in the direction d, by choosing again a sufficiently small t > 0

Page 38: Curs Tehnici de Optimizare

CHAPTER 4. OPTIMALITY CONDITIONS 37

so that for all τ ∈ [0, t] holds dT∇2f(x∗ + τd)d < 0 (due to continuity of ∇2f(·) aroundx∗). By Taylor’s Theorem, we have for some θ ∈ (0, t) that

f(x∗ + td) = f(x∗) + t∇f(x∗)Td︸ ︷︷ ︸=0

+1

2t2 dT∇2f(x∗ + θd)d︸ ︷︷ ︸

<0

< f(x∗),

which is a contradiction with x∗ being a local minimizer.

Note that the second order necessary condition (??) is not sufficient for a stationary pointx∗ to be a minimizer. This is illustrated by the function f(x) = x3 or f(x) = −x4 whichare saddle points and maximizers respectively, both fulfilling SONC.

4.2 Sufficient Optimality Conditions

Theorem 4.2.1 (Second Order Sufficient Conditions (SOSC)) Let f be a twice dif-ferentiable function (i.e. f ∈ C2) and x∗ ∈ domf a stationary point (i.e. ∇f(x∗) = 0) and∇2f(x∗)≻0. Then x∗ is strict local minimizer of f .

Proof: Let λmin be the smallest eigenvalue of∇2f(x∗). Clearly, λmin > 0 since∇2f(x∗)≻0and moreover

dT∇2f(x∗)d ≥ λmin∥d∥2 ∀d ∈ Rn.

Using Taylor expansion we have:

f(x∗ + d)− f(x∗) = ∇f(x∗)Td+1

2dT∇2f(x∗)d+R(∥d∥2)

≥ λmin

2∥d∥2 +R(∥d∥2) = (

λmin

2+R(∥d∥2)∥d∥2

)∥d∥2.

Since λmin > 0 there exists ϵ > 0 and δ > 0 such that λmin

2+ R(∥d∥2)

∥d∥2 ≥δ2for all ∥d∥ ≤ ϵ.

Note that the second order sufficient condition (SOSC) is not necessary for a stationarypoint x∗ to be a strict local minimizer. This is illustrated by the function f(x) = x4 forwhich x∗ = 0 is a strict local minimizer with ∇2f(x∗) = 0.

Page 39: Curs Tehnici de Optimizare

CHAPTER 4. OPTIMALITY CONDITIONS 38

4.3 Optimality Condition for Convex Problems

In this section we discuss sufficient conditions for optimality for the convex case. The firstresult refers to the following constrained optimization problem:

Theorem 4.3.1 Let X be a convex set and f ∈ C1 (not necessarily convex). For theconstrained optimization problem

minx∈X

f(x)

the following conditions hold:(i) If x∗ is local minimum then ∇f(x∗)T (x− x∗) ≥ 0 ∀x ∈ X.(ii) If f is convex function then x∗ is local minimum if and only if ∇f(x∗)T (x − x∗) ≥0 ∀x ∈ X.

Proof: (i) Suppose that there exists an y ∈ X such that

∇f(x∗)T (y − x∗) < 0.

Using one version of Taylor’s theorem we have that for t > 0 there exists some ϵ ∈ [0, 1]such that

f(x∗ + t(y − x∗)) = f(x∗) + t∇f(x∗ + ϵt(y − x∗))T (y − x∗).

Since ∇f is continuous taking t small enough we have ∇f(x∗ + ϵt(y − x∗))T (y − x∗) < 0and thus f(x∗ + t(y− x∗)) < f(x∗) which contradicts the fact that x∗ is a local minimizer.(ii) If f is convex f(x) ≥ f(x∗)+∇f(x∗)(x−x∗) for all x ∈ X and since ∇f(x∗)(x−x∗) ≥ 0it follows that f(x) ≥ f(x∗) for all x ∈ X, i.e. x∗ is a global minimal point.

Theorem 4.3.2 For a convex optimization problem minx∈X f(x) (i.e. X convex set andf convex function), every local minimum is also a global one.

Proof: Let x∗ be a local minimum for the above convex optimization problem. Wewill show that for any given point y ∈ X we have f(y) ≥ f(x∗). Indeed, since x∗ is alocal minimizer, there exists a neighborhood N of x∗ so that for all x ∈ X ∩ N we havef(x) ≥ f(x∗). Let us consider the segments with the ends given by x∗ and y. This segment

Page 40: Curs Tehnici de Optimizare

CHAPTER 4. OPTIMALITY CONDITIONS 39

is completely contained in X due to convexity of X. Now we choose a point x on this linethat is in the neighborhood N , but not equal to x∗, i.e. we have x = x∗ + t(y − x∗) witht ≤ 1 but t > 0, and x ∈ X ∩ N . Due to local optimality, we have f(x∗) ≤ f(x), and dueto convexity of f we have

f(x) = f(x∗ + t(y − x∗)) ≤ f(x∗) + t(f(y)− f(x∗)).

It follows that t(f(y)− f(x∗)) ≥ 0, implying f(y)− f(x∗) ≥ 0, as desired.

Theorem 4.3.3 (Convex First Order Sufficient Conditions (cFOSC)) Let f ∈ C1be convex. If x∗ is a stationary point of f (i.e. ∇f(x∗) = 0), then x∗ is a global minimizerof unconstrained convex optimization problem minx∈Rn f(x).

Proof: Since f is convex we have

f(x) ≥ f(x∗) +∇f(x∗)︸ ︷︷ ︸=0

(x− x∗) = f(x∗) ∀x ∈ Rn.

We conclude that for unconstrained optimization problems minx∈Rn f(x), where f ∈ C1, anecessary condition for x∗ to be a local optimizer is

∇f(x∗) = 0. (4.5)

In general we will solve the non-linear system of equations ∇f(x∗) = 0 and then we willcheck if the solution is a local minimizer or not using second order sufficient conditions foroptimality. However, for unconstrained convex problems minx∈Rn f(x), where f is convex,a necessary and sufficient condition for x∗ to be a global optimizer is

∇f(x∗) = 0. (4.6)

Citation [Rockafellar]: “The true watershed in optimization is not between linear andnon-linear, but between convex and non-convex.”.

Page 41: Curs Tehnici de Optimizare

CHAPTER 4. OPTIMALITY CONDITIONS 40

4.4 Perturbation Analysis

In numerical mathematics, we can never evaluate functions at precisions higher than ma-chine precision. Thus, we usually compute only solutions to slightly perturbed problems,and are most interested in minimizers that are stable against small perturbations. This isthe case for strict local minimizers that satisfy the second order sufficient condition.

For this aim we regard functions f(x, a) that depend not only on x ∈ Rn but also on some“disturbance parameter” a ∈ Rm. We are interested in the parametric family of problemsminx f(x, a) yielding minimizers x∗(a) depending on a.

Theorem 4.4.1 (Stability of Parametric Solutions)Assume that f : Rn×Rm → R is C2, and regard the minimization of f(·, a) for a given fixedvalue of a ∈ Rm. If x satisfies the SOSC condition, i.e. ∇xf(x, a) = 0 and ∇2

xf(x, a)≻0,then there is a neighborhood N ⊂ Rm around a so that the parametric minimizer functionx∗(a) is well defined for all a ∈ N , is differentiable in N , and x∗(a) = x. Its derivative ata is given by

∂(x∗(a))

∂a= −

(∇2

xf(x, a))−1∂(∇xf(x, a))

∂a. (4.7)

Moreover, each such x∗(a) with a ∈ N satisfies again the SOSC conditions and is thus astrict local minimizer.

Proof: The existence of the differentiable map x∗ : N → Rn follows from the implicitfunction theorem applied to the stationarity condition ∇xf(x

∗(a), a) = 0. We recall thederivation of Eq. (??) via

0 =d(∇xf(x

∗(a), a))

da=

∂(∇xf(x∗(a), a))

∂x︸ ︷︷ ︸=∇2

xf

·∂x∗(a)

∂a+

∂(∇xf(x∗(a), a))

∂a

The fact that all points x∗(a) are satisfy the SOSC conditions follows from continuity ofthe second derivative.

Page 42: Curs Tehnici de Optimizare

Chapter 5

Convergence of descent directionmethods

From Chapter ?? we have seen that to find a local/global minimal point for an uncon-strained optimization problem

minx∈Rn

f(x), (5.1)

we have to solve a system of n equations ∇f(x∗) = 0 with n unknowns x∗. In some casesthis system can be solve analytically:

Example 5.0.2 (Unconstrained QP) Let us consider the unconstrained convex QP

minx∈Rn

1

2xTQx+ qTx+ r, (5.2)

where Q≻0. Due to the condition 0 = ∇f(x∗) = Qx∗ + q, its unique optimizer is x∗ =−Q−1q. The optimal value of (??) is given by the following basic relation:

minx∈Rn

1

2xTQx+ qTx+ r = −1

2qTQ−1q + r. (5.3)

However, in most of the cases ∇f(x∗) = 0 is a system of nonlinear equations that cannot besolved analytical and we have to find iterative methods for solving such a system. For thesecases numerical methods were developed that we will present and analyze in the sequel.

41

Page 43: Curs Tehnici de Optimizare

CHAPTER 5. CONVERGENCE OF DESCENT DIRECTION METHODS 42

5.1 Numerical methods

Let us consider an optimization problem (??). There are different iterative methods forsolving such an optimization problem and we will discuss some of them in the next chapters.Since there are different methods we search for the most “efficient” method. In general weassume that our problem belongs to a certain class of problems F . In general a numericalmethod is developed for solving different problems but with similar characteristics (smooth-ness, convex,...). What is known about our problem is called a model (i.e. formulation ofthe problem, the functionals that describe the problem, etc). To solve the problem, anumerical method has to collect specific information and the process of collecting data ismade with an oracle (i.e. a unit that answers to successive questions of the method). Thenumerical method solves the problem by collecting data and manipulating the answers ofthe oracle. We have different types of oracles:

0. zero-order oracles O0 which just evaluates functions f(x)

1. first-order oracles O1 which evaluates the function and its gradient f(x) and ∇f(x)

2. second-order oraclesO2 which evaluates the function, its gradient and Hessian f(x),∇f(x)and ∇2f(x)

The efficiency of a numerical methods consists in the quantity of numerical effort requiredby the method to solve a certain class of problems. By solving a problem in some casesmeans finding an exact solution but in the most of the cases just an approximation of thesolution is possible. Therefore, for solving a problem we are happy to find an approximativesolution with a prescribed accuracy ϵ. In general this accuracy represents also our stoppingcriterion in our chosen numerical method. For the particular case of (??) the stoppingcriterion is

∥∇f(x)∥ ≤ ϵ.

The general scheme of an iterative algorithm contains the following steps:

0. it starts with a given point x0 and accuracy ϵ > 0 and the counter is k = 0

1. at step k we denote with Ik the set containing all the information which is accumu-lated from the oracle until the kth iteration

Page 44: Curs Tehnici de Optimizare

CHAPTER 5. CONVERGENCE OF DESCENT DIRECTION METHODS 43

1.1 call the oracle O at the point xk

1.2 update the information Ik+1 = Ik ∪ O(xk)

1.3 apply the rules of the numerical method using the latest information Ik+1 to computethe next iterate xk+1

2. check the stoping criterion; if the stoping criterion is not satisfied repeat step 1.

The complexity of a numerical method can be:

• analytical complexity which is given by total number of oracle calls

• arithmetical complexity which is given by the total number of arithmetical operations

The speed or rate of convergence refers to how fast the sequence xkk approaches thesolution x∗. The order of convergence is the largest positive numbers p which satisfies thefollowing relation:

0 ≤ limk→∞

∥xk+1 − x∗∥∥xk − x∗∥p

<∞,

where we recall that the “sup lim” of a sequence zkk is defined as:

limk→∞

zk = limn→∞

yn, where yn = supk≥n

zk.

Assuming that the limit exists, then p indicates the behavior of the tail of the sequence.When p is big, then the convergence rate is high because the distance to x∗ is reduced withp decimals in one step:

∥xk+1 − x∗∥ ≈ β ∥xk − x∗∥p .

Linear convergence: if β ∈ (0, 1) and p = 1 then

∥xk+1 − x∗∥ ≤ β∥xk − x∗∥

and thus ∥xk − x∗∥ ≈ cβk. For example, xk = βk, where β ∈ (0, 1).

Page 45: Curs Tehnici de Optimizare

CHAPTER 5. CONVERGENCE OF DESCENT DIRECTION METHODS 44

Superlinear convergence: if limk→∞

∥xk+1 − x∗∥∥xk − x∗∥

= 0, (here p = 1 as well), or equivalently

∥xk+1 − x∗∥ ≤ βk∥xk − x∗∥ with βk → 0.

For example xk =1k!

converges superlinearly, as xk+1

xk= 1

k+1.

Quadratic convergence: if limk→∞

∥xk+1 − x∗∥∥xk − x∗∥2

= β, where β ∈ (0, ∞) and p = 2, which is

equivalent to∥xk+1 − x∗∥ ≤ β∥xk − x∗∥.

For example xk = 1

22k converges quadratically, because xk+1

(xk)2= 22

k+1

(22k)2

= 1 <∞. For k = 6,

xk = 2164 ≈ 0, so in practice convergence up to machine precision is reached after roughly

6 iterations.

R-convergence: If the norm sequence ∥xk−x∗∥ is upper bounded by some sequence yk → 0,i.e. ∥xk − x∗∥ ≤ yk and if yk is converging with a given rate, i.e. linearly, superlinearly orQ-quadratically, then xk is said to converge R-linearly, R-superlinearly, or R-quadraticallyto x∗. Here, “R” indicates “root”, because, e.g., R-linear convergence can also be definedvia the root criterion limk→∞

k√∥xk − x∗∥ < 1.

Example 5.1.1

xk =

12k

if k even0 else

(5.4)

This is a fast R-linear convergence, but not as regular as Q-linear.

Remark 5.1.2 The three different convergence and three different R-convergence rateshave the following relations with each other. Here, X ⇒ Y should be read as “If a sequenceconverges with rate X this implies that the sequence also converges with rate Y .

quadratically ⇒ superlinearly ⇒ linearly⇓ ⇓ ⇓

R− quadratically ⇒ R− superlinearly ⇒ R− linearly

Note that quadratic rate reaches convergence the fastest.

Page 46: Curs Tehnici de Optimizare

CHAPTER 5. CONVERGENCE OF DESCENT DIRECTION METHODS 45

5.2 Convergence theorems

Considering a metric space (X, ρ), a numerical method can be viewed as a point to setmap M : X → 2X , defined as xk+1 ∈ M(xk). The degree of freedom given by this choicexk+1 ∈ M(xk) is used for taking into computations specific details of various methods.However, a numerical method is not a random process since such a method generates thesame sequence xkk when we start from the same initial point x0. Defining the methodthis way offers a greater degree of freedom

Example 5.2.1 We consider the following point to set map

xk+1 ∈ [−| xk |n

,| xk |n

]

and a particular realization is for given x0

xk+1 =| xk |n

.

Definition 5.2.2 Given a metric space (X, ρ), a subset S ⊆ X and a numerical methodas a point to set map M : X → 2X , we define the decreasing function ϕ : X → R for thepair (S,M) if it satisfies the following two conditions:

(i) for all x ∈ S and y ∈M(x) we have ϕ(y) ≤ ϕ(x)

(ii) for all x ∈ S and y ∈M(x) we have ϕ(y) < ϕ(x)

Example 5.2.3 For the optimization problem minx∈X

f(x), where X is a convex set and f is

differentiable. We define S = x∗ ∈ Rn :< ∇f(x∗), x − x∗ >≥ 0 ∀x ∈ X the set of thestationary points (i.e. is the set of possible solutions – local minimal points, global minimalpoints, local maximal points, etc). Note also that in general we choose ϕ = f , i.e. themethod chooses xk+1 such that f(xk+1) ≤ f(xk).

Definition 5.2.4 A point to set map M : X → 2X is called closed at the point x0 if forall xk → x0 and yk → y0, yk ∈ M(xk) we have y0 ∈ M(x0). The map M is closed if it isclosed in all points.

Page 47: Curs Tehnici de Optimizare

CHAPTER 5. CONVERGENCE OF DESCENT DIRECTION METHODS 46

Theorem 5.2.5 (Convergence theorem) Let M be a numerical method on the metricspace (X, ρ) and a sequence xk+1 ∈ M(xk). Let S be the solutions set and we assume thatthe following assumptions hold:

(i) the sequence xkk is contained in a compact set

(ii) M is a closed point to set map on X/S

(iii) there exists a continuous ϕ that is a decreasing function for (M,S)

Then all the limit points of the sequence xkk belong to the set S.

5.2.1 Convergence of descent direction methods

We now consider an iterative method

xk+1 = xk + αkdk,

where we assume that dk is a descent direction for f at xk and αk ∈ [0, 1] is the step size.We recall that if dk is a descent direction for f at xk then there exists αk > 0 sufficientlysmall such that f(xk+1) < f(xk).

The ideal case will choose the step size:

αk = argminα≥0

f(xk + αdk).

However, in many situations this univariate optimization problem is very difficult to solve.Therefore, other choices for choosing αk were developed, among them the most practical isthe Wolfe conditions : choose αk such that the following two conditions are satisfied

(W1) f(xk + αkdk) ≤ f(xk) + c1αk∇f(xk)Tdk, where c1 ∈ (0, 1)

(W2) ∇f(xk + αkdk)Tdk ≥ c2∇f(xk)

Tdk, where 0 < c1 < c2 < 1.

Add figure!!!!!!!

Page 48: Curs Tehnici de Optimizare

CHAPTER 5. CONVERGENCE OF DESCENT DIRECTION METHODS 47

We can also choose the step size αk in another way, such that we can avoid to search forαk satisfying the second Wolfe condition, by backtracking :

0. choose α > 0, ρ, c1 ∈ (0, 1)

1. whilef(xk + αdk) > f(xk) + c1αk∇f(xk)

Tdk

update α = ρα

2. αk = α.

In general the initial α = 1. Note that with the backtracking we can find αk in a finitenumber of steps. Moreover, αk found with this method is not too small since αk is close toαk

ρwhich was rejected at the previous iteration because it was a too long step size.

Theorem 5.2.6 (Convergence theorem for descent direction methods) Let minx∈Rn

f(x)

be an optimization problem where f ∈ C1 and the iterative method xk+1 = xk + αkdk wheredk is a descent direction. Furthermore, we assume that f is bounded from below, the stepsize αk is chosen to satisfy the two Wolfe conditions (W1)–(W2) and that ∇f is Lipschitz.Then

∞∑k=0

cos2 θk∥∇f(xk)∥2 <∞,

where θk is the angle made by dk with −∇f(xk).

Proof : Since ∇f is Lipschitz, then there exists L > 0 so that

∥∇f(x)−∇f(y)∥ ≤ L∥x− y∥ ∀x, y.

From (W2) we have:

(∇f(xk+1)−∇f(xk))Tdk ≥ (c2 − 1)∇f(xk)

Tdk.

Using Cauchy-Schwartz inequality we obtain

∥∇f(xk+1)−∇f(xk)∥∥dk∥ ≥ (c2 − 1)∇f(xk)Tdk.

Page 49: Curs Tehnici de Optimizare

CHAPTER 5. CONVERGENCE OF DESCENT DIRECTION METHODS 48

Using now the Lipschitz property of the gradient we obtain

Lαk∥dk∥2 ≥ (c2 − 1)∇f(xk)Tdk

i.e.

αk ≥c2 − 1

L

∇f(xk)Tdk

∥dk∥2.

From (W1) we have:

f(xk+1) ≤ f(xk) + c1(∇f(xk)

Tdk)2

∥dk∥2c2 − 1

L

which leads to

f(xk+1) ≤ f(xk)− c11− c2L

(∇f(xk)Tdk)

2∥∇f(xk)∥2

∥dk∥2∥∇f(xk)∥2.

In conclusion, using the notation c = c11−c2L

we get:

f(xk+1) ≤ f(xk)− c cos2 θk∥∇f(xk)∥2

and thus by summing these inequalities we get:

f(xN) ≤ f(x0)− c

N−1∑j=0

cos2 θj∥∇f(xj)∥2.

Since f is bounded from below we have that when N →∞∞∑k=0

cos2 θk∥∇f(xk)∥2 <∞

i.e.cos2 θk∥∇f(xk)∥2 → 0.

Note that for θk ∈ [0, π2− δ] for all k, we have cos2 θk = 0 and thus ∥∇f(xk)∥ → 0, i.e. xk

converges to a stationary point.

Page 50: Curs Tehnici de Optimizare

Chapter 6

First order methods

In this chapter we present first order methods (i.e. methods based on evaluation of thefunction and its gradient) for solving the following unconstrained optimization problem:

f ∗ = minx∈Rn

f(x),

with f ∈ C2.

6.1 The gradient method (steepest descent method)

The gradient method is based on the following iteration:

xk+1 = xk − αk∇f(xk),

where the step size αk is chosen using one of the three methods presented in the previouschapter: either ideal or satisfying Wolfe conditions or backtracking.

Interpretation:

1. The direction in the gradient method d = −∇f(x) is a descent direction since∇f(x)Td = −∥∇f(x)∥2 = 0 for all x satisfying ∇f(x) = 0.

49

Page 51: Curs Tehnici de Optimizare

CHAPTER 6. FIRST ORDER METHODS 50

2. The iteration xk+1 is obtained by solving the following convex quadratic problem:

xk+1 = arg miny∈Rn

f(xk) +∇fxkT (y − xk) +

1

2αk

∥y − xk∥2,

i.e. we approximate locally the objective function f around xk using a quadraticmodel with the Hessian given by Q = 1

αkIn.

3. The gradient method has the fastest local decrease, this is why this method is alsocalled the steepest descent: indeed for all directions d with ∥d∥ = 1 we have

f(x+ αd) = f(x) + α∇f(x)Td+R(α).

From Cauchy-Schwartz inequality we also get

∇f(x)Td ≥ −∥∇f(x)∥∥d∥ = −∥∇f(x)∥.

Using this inequality and also taking the particular direction d0 = − ∇f(x)∥∇f(x)∥ we get

f(x+ αd) ≥ f(x)− α∥∇f(x)∥+R(α)

whilef(x+ αd0) = f(x)− α∥∇f(x)∥+R(α),

i.e. the largest decrease is obtained for the anti-gradient direction d0.

6.1.1 Convergence of the gradient method

Theorem 6.1.1 If the following conditions hold:

(i) f is differentiable with ∇f continuous (i.e. f ∈ C1)

(ii) the level set Sf(x0) = x ∈ Rn : f(x) ≤ f(x0) is compact for all initial points x0

(iii) the step size satisfies the first Wolfe condition (W1).

Then any limit point of the sequence xkk is a stationary point.

Page 52: Curs Tehnici de Optimizare

CHAPTER 6. FIRST ORDER METHODS 51

Proof: We based our proof on the general convergence theorem presented in the previouschapter: we define the map

M(x) = x− α∇f(x)

Since f is differentiable with ∇f continuous it follows that M(x) is a point to point con-tinuous map and thus closed map. We define S = x∗ ∈ Rn : ∇f(x∗) = 0, the solutionsset (i.e. the stationary points). Moreover, xkk ⊆ Sf (x0), i.e. the sequence generated bythe gradient method is contained in a compact set. We also define ϕ = f as a decreasingfunction since the the first Wolfe condition holds. In conclusion, the general convergencetheorem can be applied and thus any limit point of the sequence will be in S. Note thatsince xk is bounded there exists at least one convergent subsequence.

Theorem 6.1.2 Let f be a differentiable function with Liptschitz gradient (with Lipschitzconstant L > 0) and bounded from below. Moreover, the step size αk is chosen to satisfythe Wolfe conditions. Then the gradient method is convergent.

Proof: Note that in this particular case the angle

θk = 0.

In conclusion, ∑k≥0

cos2 θk∥∇f(xk)∥2 =∑k≥0

∥∇f(xk)∥2 <∞.

It follows that the sequence xk has the property ∇f(xk)→ 0 as k →∞.

Remark 6.1.3 We observe that from the first convergence theorem for gradient methodwe obtained that some subsequence of xkk converges to a stationary point x∗, while fromthe second convergence theorem ∇f(xk)→ 0.

6.1.2 Choosing optimal step size α

In the case when the step size is constant over all the iterations, i.e. xk+1 = xk − α∇f(xk)we are interested in finding the optimal α that guarantees the fastest convergence rate. We

Page 53: Curs Tehnici de Optimizare

CHAPTER 6. FIRST ORDER METHODS 52

assume ∇f to be Lipschitz with L > 0. It follows that

f(y) ≤ f(x) +∇f(x)T (y − x) +L

2∥x− y∥2, ∀x, y.

Therefore,

f(xk+1) ≤ f(xk)− α∥∇f(xk)∥2 +L

2α2∥∇f(xk)∥2 = f(xk)− α(1− L

2α)∥∇f(xk)∥2.

The step size that guarantees the largest decrease per iteration is obtained from the con-dition

maxα>0

α(1− L

2α)

i.e.

α∗ =1

L.

For the gradient method with constant step the optimal step size is α = 1L. In this case

the decrease at each step is given by

f(xk+1) ≤ f(xk)−1

2L∥∇f(xk)∥2

and summing up these inequalities we obtain

f(xN+1) ≤ f(x0)−1

2L

N∑k=0

∥∇f(xk)∥2

i.e.1

2L

N∑k=0

∥∇f(xk)∥2 ≤ f(x0)− f(xN+1) ≤ f(x0)− f ∗.

Let us define∥∇fN∥ = arg min

k=0···N∥∇f(xk)∥.

It follows that1

2L(N + 1)∥∇fN∥2 ≤ f(x0)− f ∗.

In conclusion, after N steps the following convergence rate is obtained

∥∇fN∥ ≤1√

N + 1

√2L(f(x0)− f ∗)

Page 54: Curs Tehnici de Optimizare

CHAPTER 6. FIRST ORDER METHODS 53

i.e. the gradient method has in this case a sublinear convergence rate.

Note that nothing can be said in this case about the convergence of xk to some stationarypoint x∗ or f(xk) to the optimal value f ∗. This type of convergence can be derived in theconvex case.

Add convergence rate for the convex case!!!

6.2 The conjugate directions method

The conjugate directions method is also a first order method, i.e. it is only using the valueof the function and of the gradient (first order oracle) but it is improving the convergencerate of the gradient method at least for the quadratic case. Let us assume the followingstrict convex QP:

minx∈Rn

1

2xTQx− qTx, Q≻0,

whose optimal solution is equivalent with solving the following linear system of equations

Qx∗ = q.

Since Q is invertible x∗ = Q−1q. However, in many cases the computation of the inverseis very expensive, in general case it has complexity O(n3). In the sequel we will present aless expensive numerical method for computing the solution x∗.

Definition 6.2.1 Two vectors d1 and d2 are called Q-orthogonal if dT1Qd2 = 0. A set ofvectors d1, d2, · · · , dk is called Q-orthogonal if dTi Qdj = 0 for all i = j.

Note that if Q≻0 and if d1, d2, · · · , dk are Q-orthogonal and different from zero then theyare linear independent. Moreover, for the case k = n they make a basis for Rn. In conclu-sion, if d1, d2, · · · , dn isQ-orthogonal and different from zero, there exists α1, · · · , αn ∈ Rn

such that x∗ = α1d1 + α2d2 + · · · + αndn (i.e. linear combination of the vectors from thebasis). In order to find αi we use

αi =dTi Qx∗

dTi Qdi=

dTi q

dTi Qdi.

Page 55: Curs Tehnici de Optimizare

CHAPTER 6. FIRST ORDER METHODS 54

We conclude that

x∗ =∑ dTi q

dTi Qdidi

and thus x∗ can be obtained through an iterative process in which at step i we add theterm αidi.

Theorem 6.2.2 Let d1, d2, · · · , dn be a Q-orthogonal vector set with nonzero elements.For any x0 ∈ Rn the sequence xk generated by

xk+1 = xk + αkdk

αk = −rTk dkdTkQdk

, rk = Qxk − q

converges to x∗ after n steps, i.e. xn = x∗.

Note that the residual rk = Qxk − q coincides with the gradient of the quadratic objectivefunction.

Theorem 6.2.3 Let d1, d2, · · · , dn be a Q-orthogonal vector set with nonzero elementsand define the subspace Sk = Spand1, d2, · · · , dk. Then for any x0 ∈ Rn the sequence

xk+1 = xk + αkdk, where αk = −rTk dkdTk Qdk

has the following properties:

(i) xk+1 = arg minx∈x0+Sk

1

2xTQx− qTx

(ii) the residual at step k is orthogonal to all the previous directions, i.e.

rTk di = 0 ∀i < k.

From (ii) we obtain that∇f(xk) ⊥ Sk−1.

The conjugate directions method for solving a strict convex QP contains the following steps:

Page 56: Curs Tehnici de Optimizare

CHAPTER 6. FIRST ORDER METHODS 55

0. given x0 ∈ Rn we define d0 = −∇f(x0) = −r0 = −(Qx0 − q)

1. xk+1 = xk + αkdk with αk = −rTk dkdTk Qdk

2. dk+1 = −∇f(xk+1) + βkdk where βk =rTk+1Qdk

dTk Qdk.

Note that at every step a new direction is chosen that is a linear combination between thecurrent gradient and the previous direction. The conjugate directions method is computa-tional cheap since it uses simple update formulas (i.e. operations with vectors).

Theorem 6.2.4 (Properties of the conjugate directions method) The following prop-erties hold for the conjugate directions method:

(i) Spand0, d1, · · · , dk = Spanr0, r1, · · · , rk = Spanr0, Qr0, · · · , Qkr0

(ii) dTkQdi = 0 for all i < k

(iii) αk = −rTk rkdTk Qdk

(iv) βk =rTk+1rk+1

rTk rk.

6.2.1 Extension to general unconstrained optimization problems

For a general unconstrained optimization problem minx∈Rn f(x), we repeat the same iter-ations as in the quadratic case with the following identifications:

Q = ∇2f(xk), rk = ∇f(xk)

for n steps and then we reinitialize x0 = xn and repeat until ∥∇f(xk)∥ ≤ ϵ:

0. r0 = ∇f(x0) and d0 = −∇f(x0)

1. xk+1 = xk + αkdk for all k = 1, · · · , n− 1, where αk =rTk dk

dTk ∇2f(xk)dk

Page 57: Curs Tehnici de Optimizare

CHAPTER 6. FIRST ORDER METHODS 56

2. dk+1 = −∇f(xk+1) + βkdk, where βk =rTk+1∇

2f(xk)dk

dTk ∇2f(xk)dk

3. after n iterations we replace x0 with xn and repeat the whole process.

Note that in the general case this method is not convergent. The algorithm can be madeconvergent by modifying adequately βk. We have the following updating rules:

Fletcher–Reeves βk =rTk+1rk+1

rTk rk

Polak–Ribiere βk =(rk+1 − rk)

T rk+1

rTk rk.

Page 58: Curs Tehnici de Optimizare

Chapter 7

Newton Type Optimization

In this chapter we will treat how to solve a general unconstrained nonlinear optimizationproblem using also information about the Hessian (second order information) or someapproximation of it (i.e. still based on first order information):

minx∈Rn

f(x) (7.1)

with f ∈ C2.

7.1 Newton method

In numerical analysis, Newton’s method (or the Newton-Raphson method) is a method forfinding roots of a system of equations in one or more dimensions. We consider the firstorder necessary conditions for optimality, which reduces to the system of equations:

∇f(x∗) = 0

with ∇f : Rn → Rn, which has as many components as variables.

The Newton idea consists of linearizing the non-linear equations at xk to find xk+1 = xk+dk

∇f(xk) +∇2f(xk)dk = 0

dk = −∇2f(xk)−1∇f(xk).

57

Page 59: Curs Tehnici de Optimizare

CHAPTER 7. NEWTON TYPE OPTIMIZATION 58

In general, we call dk the Newton direction and the Newton method consists in the followingiteration:

xk+1 = xk −∇2f(xk)−1∇f(xk).

Visualization of the problem

Another interpretation of the Newton method for optimization can be obtained by a sec-ond order Taylor approximation of the objective function f . We recall that second ordersufficient conditions for optimality (SOSC) are: if there exists an x∗ satisfying

∇f(x∗) = 0 and ∇2f(x∗)≻0

then x∗ is a local minimum. If x∗ is a point satisfying (SOSC) then there exists a neighbor-hood of x∗ denotedN such that for x ∈ N we have∇2f(x∗)≻0. From Taylor approximationwe have that

f(xk + d) u f(xk) +∇f(xk)Td+

1

2dT∇2f(xk)d

and thus we define the Newton direction as

dk = argmind

f(xk) +∇f(xk)Td+

1

2dT∇2f(xk)d.

Note that if xk is sufficiently close to x∗ then ∇2f(xk)≻0 and thus from the optimalityconditions for a strict convex QP we obtain again dk = −∇2f(xk)

−1∇f(xk), i.e. is thesame formula, but with a different interpretation.

7.1.1 Local convergence rates

In this section we will analyze the local convergence rate of the Newton method:

xk+1 = xk −∇2f(xk)−1∇f(xk).

Theorem 7.1.1 (Quadratic convergence of Newton method) Let f ∈ C2 and x∗ bea local minimum satisfying SOSC (i.e. ∇f(x∗) = 0 and ∇2f(x∗) ≻ 0). Let l > 0 such that

∇2f(x∗) ≽ lIn.

Page 60: Curs Tehnici de Optimizare

CHAPTER 7. NEWTON TYPE OPTIMIZATION 59

Moreover, we assume that ∇2f(x) is Lipschitz, i.e.

∥∇2f(x)−∇2f(y)∥ ≤M∥x− y∥ ∀x, y ∈ domf,

where M > 0. If x0 is sufficiently close to x∗, i.e.

∥x0 − x∗∥ ≤ 2

3

l

M,

then the Newton iteration xk+1 = xk−∇2f(xk)−1∇f(xk) has the property that the sequence

xkk converges to x∗ quadratically.

Proof: Since x∗ is a local minimum then ∇f(x∗) = 0. Furthermore, using the Taylortheorem in the integral form we have

∇f(xk) = ∇f(x∗) +

∫ 1

0

∇2f(x∗ + τ(xk − x∗))(xk − x∗)dτ .

We obtain:

xk+1 − x∗ = xk − x∗ −∇2f(xk)−1∇f(xk)

= ∇2f(xk)−1[∇2f(xk)(xk − x∗)−∇f(xk) +∇f(x∗)]

= ∇2f(xk)−1[∇2f(xk)(xk − x∗)−

∫ 1

0

∇2f(x∗ + τ(xk − x∗))(xk − x∗)dτ

= ∇2f(xk)−1

∫ 1

0

∇2f(xk)(xk − x∗)−∇2f(x∗ + τ(xk − x∗))(xk − x∗)dτ

= ∇2f(xk)−1

∫ 1

0

[∇2f(xk)−∇2f(x∗ + τ(xk − x∗))](xk − x∗)dτ

Since∥∇2f(xk)−∇2f(x∗)∥ ≤M∥xk − x∗∥

it follows that

−M∥xk − x∗∥In ≤ ∇2f(xk)−∇2f(x∗) ≤M∥xk − x∗∥In.

It follows that

∇2f(xk) ≽ ∇2f(x∗)−M∥xk − x∗∥In ≽ lIn −M∥xk − x∗∥In ≻ 0,

Page 61: Curs Tehnici de Optimizare

CHAPTER 7. NEWTON TYPE OPTIMIZATION 60

provided that ∥xk − x∗∥ ≤ 23

lM

which leads to

0 ≺ ∇2f(xk)−1 ≼ 1

l −M∥xk − x∗∥In.

We conclude that

∥xk+1 − x∗∥ = ∥∇2f(xk)−1∥∥

∫ 1

0

∇2f(xk)−∇2f(x∗ + τ(xk − x∗))dτ∥∥xk − x∗∥

≤ 1

l −M∥xk − x∗∥

∫ 1

0

M(1− τ)∥xk − x∗∥dτ∥xk − x∗∥

≤ 1

l −M∥xk − x∗∥

∫ 1

0

M(1− τ)dτ∥xk − x∗∥2

≤ 1

l −M∥xk − x∗∥M

2∥xk − x∗∥2

If we start from a point x0 which is not close to x∗ then Newton method must be modifiedto guarantee convergence by using a step size αk satisfying the Wolfe conditions:

xk+1 = xk − αk(∇2f(xk))−1∇f(xk).

When xk is sufficiently close to x∗ the step size αk will become 1.

Remark 7.1.2

1. Note that the Newton Method converges in one step for convex quadratic problems.

2. Note that the Newton direction is a descent direction as long as ∇2f(xk) ≻ 0. In theother case, in place of ∇2f(xk) we will consider ϵIn +∇2f(xk) for some adequate ϵsuch that the new matrix becomes positive definite. In that case the update rule is

xk+1 = xk − αk(ϵkIn +∇2f(xk))−1∇f(xk).

3. The main disadvantage of this method is that we need to calculate the Hessian of fand then to invert such a matrix.

Page 62: Curs Tehnici de Optimizare

CHAPTER 7. NEWTON TYPE OPTIMIZATION 61

7.2 Quasi Newton methods

As we have mentioned before the main disadvantage of Newton method is that the compu-tation of the Hessian and its inverse is expensive in general (at least for large problems).In quasi Newton methods the purpose is to replace ∇f(xk)

−1 with a matrix Hk which canbe calculated more easily and thus we get the following iteration:

xk+1 = xk − αkHk∇f(xk).

Note that the direction dk = −Hk∇f(xk) is a descent direction if Hk≻0:∇f(xk)

Tdk = −∇f(xk)T Hk︸︷︷︸

≻0

∇f(xk)︸ ︷︷ ︸>0

< 0

Furthermore, the step size αk is chosen to satisfy the Wolfe conditions. In general, as inthe Newton method, when xk is sufficiently close to the solution x∗ we have αk = 1. Thegoal is to find update rules for the matrix Hk such that asymptotically it converges to thetrue inverse of the Hessian, i.e.

Hk → ∇2f(x∗)−1.

From Taylor approximation we have:

∇f(xk+1) = ∇f(xk +∆k) ≈ ∇f(xk) +∇2f(xk)(xk+1 − xk)

In conclusion, approximating the true Hessian ∇2f(xk) with a matrix Bk+1 we obtain thefollowing relation

∇f(xk+1)−∇f(xk) = Bk+1(xk+1 − xk)

or equivalently using the notation Hk+1 = B−1k+1

Hk+1[∇f(xk+1)−∇f(xk)] = xk+1 − xk (7.2)

which is called the secant equation.

For H−1k = ∇2f(xk) we recover Newton’s method. Note that we have a similar interpreta-

tion as for the Newton method, namely in each iteration, a convex quadratic approximationof the function is considered (i.e. Bk<0) and we minimize it to obtain the next direction:

dk = argmind

f(xk) +∇f(xk)Td+

1

2dTBkd. (7.3)

Since the Hessian is symmetric we require for the matrices Bk and Hk to be symmetricmatrices as well. In conclusion we have n equations (from (??)) with n(n+1)

2unknowns (by

imposing symmetry) and thus we obtain an infinite number of solutions. In the sequel wewill derive different update rules that satisfy (??) and symmetry.

Page 63: Curs Tehnici de Optimizare

CHAPTER 7. NEWTON TYPE OPTIMIZATION 62

7.2.1 Rank one updates

In this case we update the matrix Hk using the following formula

Hk+1 = Hk + βkukuTk ,

where we choose βk ∈ R and uk ∈ Rn such that the secant equation holds.

We start with a symmetric matrix H0 ≻ 0 and we denote with

∆k = xk+1 − xk and δk = ∇f(xk+1)−∇f(xk).

We update the matrix Hk+1 as follows

Hk+1 = Hk +(∆k −Hkδk)(∆k −Hkδk)

T

δTk (∆k −Hkδk).

To guarantee that Hk ≻ 0 the following inequality must hold:

δTk (∆k −Hkδk) > 0.

7.2.2 Rank two updates

In this case we update the matrix Hk using the following formula:

Hk+1 = Hk +∆k∆

Tk

∆Tk δk− Hkδkδ

Tk Hk

∆TkHk∆k

.

This update is called Davidon-Fletcher-Powell (DFP) update.

We again start with an initial matrix H0 ≻ 0. The following properties hold for the (DFP)update

(i) Hk ≻ 0.

(ii) if f(x) = 12xTQx+qTx is quadratic and strictly convex then the (DFP) update yields

conjugate directions, i.e. dk = −Hk∇f(xk) are Q-conjugate directions. Moreover,Hn = Q−1 and in particular if H0 = In then the directions dk coincide with thedirections from the conjugate directions method. Therefore, we can find the solutionof a quadratic problem in maximum n steps by using the (DFP) iterations.

Page 64: Curs Tehnici de Optimizare

CHAPTER 7. NEWTON TYPE OPTIMIZATION 63

If in the (DFP) method we do not use (??) but the equation ∇f(xk+1) − ∇f(xk) =Bk+1(xk+1 − xk) we obtain the Broyden-Fletcher-Goldfarb-Shanno (BFGS) update:

Hk+1 = Hk +Hk∆kδ

Tk + δk∆

TkHk

∆TkHk∆k

− βkHk∆k∆

TkHk

∆TkHk∆k

βk = 1 +∆T

k δk∆T

kHk∆k

The same proprieties are valid for (BFGS) method as for the (DFP) method. However,from a numerical point of view (BFGS) is considered the most stable.

Remark 7.2.1 Note that quasi Newton methods require only first order information. More-over, the directions generated by the quasi Newton method are descent directions if we ensurethat Hk ≻ 0. In general Hk → (∇2f(x∗))−1 and under certain assumptions we will showthat we have superlinear convergence.

7.2.3 Local convergence for quasi Newton method

Theorem 7.2.2 Let x∗ be a point that satisfies (SOSC) and we assume that locally aroundx∗ the quasi Newton iteration takes the form xk+1 = xk−Hk∇f(xk), where Hk is invertiblefor all k and satisfies the following Lipschitz condition

∥H−1k (∇2f(xk)−∇2f(y))∥ ≤ M∥xk − y∥ ∀k ∈ N, y ∈ Rn,

and the following compatibility condition

∥Hk(∇2f(xk)−H−1k )∥ ≤ κk ∀k ∈ N (7.4)

with 0 < M <∞ and κk ≤ κ < 1. We also assume that

∥x0 − x∗∥ < 2(1− κ)

M(7.5)

Then xk converge to x∗ superlinearly if κk → 0.

Page 65: Curs Tehnici de Optimizare

CHAPTER 7. NEWTON TYPE OPTIMIZATION 64

Proof: We will show that ∥xk+1 − x∗∥ ≤ βk∥xk − x∗∥ with βk < 1. For this aim let uscompute

xk+1 − x∗ = xk − x∗ −Hk∇f(xk)

= xk − x∗ −Hk(∇f(xk)−∇f(x∗))

= Hk(H−1k (xk − x∗))−Hk

∫ 1

0

∇2f(x∗ + τ(xk − x∗))(xk − x∗)dτ

= Hk(H−1k −∇

2f(xk))(xk − x∗)−Hk

∫ 1

0

[∇2f(x∗ + τ(xk − x∗))−∇2f(xk)

](xk − x∗)dτ.

Taking the norm on both sides we obtain:

∥xk+1 − x∗∥ ≤ κk∥xk − x∗∥+∫ 1

0

M∥x∗ + τ(xk − x∗)− xk∥dτ ∥xk − x∗∥

=(κk +M

∫ 1

0

(1− τ)dτ︸ ︷︷ ︸= 1

2

∥xk+1 − x∗∥)∥xk+1 − x∗∥

=(κk +

M

2∥xk − x∗∥

)︸ ︷︷ ︸

=βk

∥xk − x∗∥

We have the following superlinear convergence rate:

∥xk+1 − x∗∥ ≤ (κk +M

2∥xk − x∗∥)︸ ︷︷ ︸→0

∥xk − x∗∥.

Truncated Newton method: This approach is suitable for large scale problems andconsists in solving the linear system

∇2f(xk)d = −∇f(xk) (7.6)

inexactly, e.g. by iterative linear algebra.

Page 66: Curs Tehnici de Optimizare

Chapter 8

Estimation and Fitting Problems

Estimation and fitting problems are optimization problems with a special objective, namelya “least squares objective”

minx∈Rn

1

2∥η −M(x)∥2. (8.1)

Here, η ∈ Rm are the m “measurements” and M : Rn → Rm is a “model”, and x ∈ Rn

are called “model parameters”. If the true value for x would be known, we could evaluatethe model M(x) to obtain model predictions for the measurements. The computation ofM(x), which might be a very complex function and for example involve the solution of adifferential equation, is sometimes called the “forward problem”: for given model inputs,we determine the model outputs.

In estimation and fitting problems, as (??), the situation is opposite: we want to findthose model parameters x that yield a predictionM(x) that is as close as possible to theactual measurements η. This problem is often called an “inverse problem”: for given modeloutputs η, we want to find the corresponding model inputs x.

This type of optimization problem arises in applications like

• function approximation

• online estimation for process control

• weather forecast (weather data reconciliation)

65

Page 67: Curs Tehnici de Optimizare

CHAPTER 8. ESTIMATION AND FITTING PROBLEMS 66

• parameter estimation

8.1 Linear least squares

Definition 8.1.1 (Moore-Penrose Pseudo Inverse) Assume J ∈ Rm×n with rank(J) =k, and that the singular value decomposition (SVD) of J is given by J = UΣV T . Then, theMoore-Penrose pseudo inverse J+ is given by:

J+ = V Σ+UT ,

where for

Σ =

σ1

σ2

. . .

σk

0. . .

0

holds Σ+ =

σ−11

σ−12 0

. . .

σ−1k

0. . .

Theorem 8.1.2 If rank(J) = n, then

J+ = (JTJ)−1JT .

If rank(J) = m, thenJ+ = JT (JJT )−1.

Page 68: Curs Tehnici de Optimizare

CHAPTER 8. ESTIMATION AND FITTING PROBLEMS 67

Proof: Let us compute

(JTJ)−1JT = (V ΣTUTUΣV T )−1V ΣTUT

= V (ΣTΣ)−1V TV ΣTUT

= V (ΣTΣ)−1ΣTUT

= V

σ21

σ22

. . .

σ2r

−1

σ1

σ2 0. . .

σr

UT

= V Σ+UT .

Similarly for the other case.

Note that if rank(J) = n, i.e. the columns of J are linearly independent then JTJ isinvertible.

Many models in estimation and fitting problems are linear functions of x. IfM is linear,i.e. M(x) = Jx, then the objective function becomes f(x) = 1

2∥η−Jx∥2 which is a convex

function since ∇2f(x) = JTJ<0. Assuming that rank(J) = n, the global minimizer isfound by

JTJx∗ − JTη = 0⇔ x∗ = (JTJ)−1JT︸ ︷︷ ︸=J+

η. (8.2)

Example [Average linear least squares]: Let us regard the simple optimization prob-lem:

minx∈R

1

2

m∑i=1

(ηi − x)2.

This is a linear least squares problem, where the vector η and the matrix J ∈ Rm×1 aregiven by

η =

η1η2...ηm

, J =

11...1

. (8.3)

Page 69: Curs Tehnici de Optimizare

CHAPTER 8. ESTIMATION AND FITTING PROBLEMS 68

Because JTJ = m, it can be easily seen that

J+ = (JTJ)−1JT =1

m

[1 1 · · · 1

](8.4)

so we conclude that the local minimizer equals the average η of the given points ηi:

x∗ = J+η =1

m

m∑i=1

ηi = η. (8.5)

Example [Linear Regression]:

Figure: Given a set of points which tend to a linear relation between two units

Given data points tii=mi=1 with corresponding values ηii=m

i=1 , find the 2-dimensional pa-rameter vector x = (x1, x2), so that the polynomial of degree one p(t;x) = x1+x2t providesa prediction of η at time t. The corresponding optimization problem looks like:

minx∈R2

1

2

m∑i=1

(ηi − p(ti; x))2 = min

x∈R2

1

2

∥∥∥∥η − J

[x1

x2

]∥∥∥∥2

2

(8.6)

where η is the same vector as in (??) and J is given by

J =

1 t11 t2...

...1 tn

. (8.7)

The local minimizer is found by equation (??), whereas the calculation of (JTJ) is straight-forward:

JTJ =

[m

∑ti∑

ti∑

t2i

]= m

[1 t

t t2

](8.8)

In order to obtain x∗, first (JTJ)−1 is calculated1:

(JTJ)−1 =1

det(JTJ)adj(JTJ) =

1

m(t2 − (t)2)

[t2 −t−t 1

]. (8.9)

1Recall that the adjugate of a matrix A ∈ Rnxn is given by taking the transpose of the cofactor matrix,adj(A) = CT where Cij = (−1)i+jMij with Mij the (i, j) minor of A.

Page 70: Curs Tehnici de Optimizare

CHAPTER 8. ESTIMATION AND FITTING PROBLEMS 69

Second, we compute JTη as follows:

JTη =

[1 · · · 1t1 · · · tm

]η1...ηm

=

[ ∑ηi∑ηiti

]= m

[ηnt

]. (8.10)

Hence, the local minimizer is found by combining the expressions (??) and (??). Note that

t2 − (t)2 =1

m

∑(ti − t)2 = σ2

t . (8.11)

where we used in the last transformation a standard definition of the variance σt. Thecorrelation coefficient ρ is similarly defined by

ρ =

∑(ηi − η)(ti − t)

mσtση

=tη − ηt

σtση

. (8.12)

The two-dimensional parameter vector x = (x1, x2) is found:

x∗ =1

σ2t

[t2η − tηt−tη + ηt

]=

[η − tση

σtρ

ση

σtρ

]. (8.13)

Finally, this can be written as a polynomial of first degree:

p(t; x∗) = η + (t− t)ση

σt

ρ. (8.14)

What do we do with the teasers, the solution of teaser one is given at this point of thelecture

8.2 Ill posed linear least squares

When JTJ is invertible, the set of optimal solutions X∗ has only one optimal point x∗,given by equation (??): X∗ = (JTJ)−1Jη. If JTJ is not invertible, the set of solutionsX∗ is given by

X∗ = x : ∇f(x) = 0 = x : JTJx− JTη = 0. (8.15)

Page 71: Curs Tehnici de Optimizare

CHAPTER 8. ESTIMATION AND FITTING PROBLEMS 70

In order to pick a unique point out of this set, we might choose to search for the minimumnorm solution, i.e. the vector x∗ with minimum norm satisfying x∗ ∈ X∗.

minx∈Rn

1

2∥x∥2 s.t. x ∈ X∗. (8.16)

We will show below that this minimal norm solution is given by the Moore-Penrose pseudoinverse.

8.2.1 Regularization for least squares

The minimum norm solution can be approximated by a “regularized problem”

minx

1

2∥η − Jx∥22 +

ϵ

2∥x∥22, (8.17)

with small ϵ > 0. To get unique solution

∇f(x) = JTJx− JTη + ϵx (8.18)

= (JTJ + ϵI)x− JTη (8.19)

x∗ = (JTJ + ϵI)−1JTη (8.20)

Lemma 8.2.1limϵ→0

(JTJ + ϵI)−1JT = J+.

Proof: Taking the SVD of J = UΣV T , (JTJ + ϵI)−1JT can be written in the form:

(JTJ + ϵI)−1JT = (V ΣTUTUΣV T + ϵ I︸︷︷︸V V T

)−1 JT︸︷︷︸UΣTV T

= V (ΣTΣ + ϵI)−1V TV ΣTUT

= V (ΣTΣ + ϵI)−1ΣTUT

Page 72: Curs Tehnici de Optimizare

CHAPTER 8. ESTIMATION AND FITTING PROBLEMS 71

Rewriting the right hand side of the equation explicitly:

= V

σ21 + ϵ

. . .

σ2r + ϵ

ϵ. . .

ϵ

−1

σ1 0. . .

σr...

0. . .

0 0

UT

Calculating the matrix product simplifies the equation:

= V

σ1

σ21+ϵ

0

. . .

σr

σ2r+ϵ

...0ϵ

. . .0ϵ

0

UT

It can be easily seen that for ϵ→ 0 each diagonal element has the solution:

limϵ→0

σi

σ2i + ϵ

=

1σi

if σi = 0

0 if σi = 0(8.21)

We have shown that the Moore-Penrose inverse J+ solves the problem (??) for infinitelysmall ϵ > 0. Thus its selects x∗ ∈ S∗ with minimal norm.

8.3 Statistical derivation of least squares

A least squares problem (??) can be interpreted as finding the x that “explains” the noisymeasurements η “best”.

Page 73: Curs Tehnici de Optimizare

CHAPTER 8. ESTIMATION AND FITTING PROBLEMS 72

Definition [Maximum-Likelihood Estimate] A maximum-likelyhood estimate of theunknown parameter x maximizes the probability P (η|x) of obtaining the (given) measure-ments η if the parameter would have the value x.

Assume ηi = Mi(x)+ϵi with x the “true” parameter, and ϵi Gaussian noise with expectationvalue E(ϵi) = 0, E(ϵi ϵi) = σ2

i and ϵi, ϵj independent. Then holds

P (η|x) =m∏i=1

P (ηi | x) (8.22)

= C

m∏i=1

exp

(−(ηi −Mi(x))

2

2σ2i

)(8.23)

logP (η|x) = log(C) +m∑i=1

−(ηi −Mi(x))2

2σ2i

(8.24)

with a constant C. Due to monotonicity of the logarithm holds that the argument maxi-mizing P (η|x) is given by

argmaxx∈Rn

P (η|x) = arg minx∈Rn

− log(P (η|x)) (8.25)

= arg minx∈Rn

m∑i=1

(ηi −Mi(x))2

2σ2(8.26)

= arg minx∈Rn

1

2∥S−1(η −M(x))∥22 (8.27)

Thus, the least squares problem has a statistical interpretation. Note that due to the factthat we might have different standard deviations σi for different measurements ηi we needto scale both measurements and model functions in order to obtain an objective in theusual least squares form ∥η − M(x)∥22, as

minx

1

2

n∑i=1

(ηi −Mi(x)

σi

)2

= minx

1

2∥S−1(η −M(x))∥22 (8.28)

= minx

1

2∥S−1η − S−1M(x)∥22 (8.29)

with S =

σ1

. . .

σm

.

Page 74: Curs Tehnici de Optimizare

CHAPTER 8. ESTIMATION AND FITTING PROBLEMS 73

Statistical Interpretation of Regularization terms: Note that a regularization termlike α∥x− x∥22 that is added to the objective can be interpreted as a “pseudo measurement”x of the parameter value x, which includes a statistical assumption: the smaller α, the largerwe implicitly assume the standard deviation of this pseudo-measurement. As the data of aregularization term are usually given before the actual measurements, regularization is alsooften interpreted as “a priori knowledge”. Note that not only the Euclidean norm with onescalar weighting α can be chosen, but many other forms of regularization are possible, e.g.terms of the form ∥A(x− x)∥22 with some matrix A.

8.4 L1-estimation

Instead of using ∥.∥22, i.e. the L2-norm in equation (??), we might alternatively use ∥.∥1,i.e., the L1-norm. This gives rise to the so called L1-estimation problem:

minx∥η −M(x)∥1 = min

x

m∑i=1

|ηi −Mi(x)| (8.30)

Like the L2-estimation problem, also the L1-estimation problem can be interpreted statisti-cally as a maximum-likelihood estimate. However, in the L1-case, the measurement errorsare assumed to follow a Laplace distribution instead of a Gaussian.

An interesting observation is that the optimal L1-fit of a constant x to a sample of differentscalar values η1, . . . , ηm just gives the median of this sample, i.e.

argminx∈R

m∑i=1

|ηi − x| = median of η1, . . . , ηm. (8.31)

Remember that the same problem with the L2-norm gave the average of ηi. Generallyspeaking, the median is less sensitive to outliers than the average, and a detailed analysisshows that the solution to general L1-estimation problems is also less sensitive to a fewoutliers. Therefore, L1-estimation is sometimes also called “robust” parameter estimation.

Page 75: Curs Tehnici de Optimizare

CHAPTER 8. ESTIMATION AND FITTING PROBLEMS 74

8.5 Gauss-Newton (GN) Method

Linear least squares problems can be solved easily. Solving non-linear least squares prob-lems globally is in general NP-hard, but in order to find a local minimum we can iterativelysolve it, and in each iteration approximate the problem by its linearization at the currentguess. This way we obtain a better guess for the next iterate, etc., just as in Newton’smethod for root finding problems.

For non-linear least squares problems of the form

minx

1

2∥η −M(x)∥22︸ ︷︷ ︸

=f(x)

(8.32)

the so called “Gauss-Newton (GN) method” is used. To describe this method, let us firstfor notational convenience introduce the shorthand F (x) = η − M(x) and redefine theobjective to

f(x) =1

2∥F (x)∥22 (8.33)

where F (x) is a nonlinear function F : Rm → Rn with m > n (more measurements thanparameters). At a given point xk (iterate k), F (x) is linearized, and the next iterate xk+1

obtained by solving a linear least squares problem. We expand

F (x) u F (xk) + J(xk)(x− xk) (8.34)

where J(x) is the Jacobian of F (x) which is defined as

J(x) =∂F (x)

∂x. (8.35)

Then, xk+1 can be found as solution of the following linear least squares problem:

xk+1 = argminx

1

2∥F (xk) + J(xk)(x− xk)∥22 (8.36)

For simplicity, we write J(xk) as J and F (xk) as F :

xk+1 = argminx

1

2∥F + J(x− xk)∥22 (8.37)

= xk + argmind

1

2∥F + Jd∥22 (8.38)

= xk − (JTJ)−1JTF (8.39)

= xk + dGNk (8.40)

Page 76: Curs Tehnici de Optimizare

CHAPTER 8. ESTIMATION AND FITTING PROBLEMS 75

The Gauss-Newton method is only applicable to least-squares problems, because the methodlinearizes the non-linear function inside the L2-norm. Note that in equation JTJ mightnot always be invertible.

8.6 Levenberg-Marquardt (LM) Method

This method is a generalization of the Gauss-Newton method that is in particular applicableif JTJ is not invertible, and can lead to more robust convergence far from a solution. TheLevenberg-Marquardt (LM) method makes the step pk smaller by penalizing the norm ofthe step. It defines the the step as:

dLMk = argmind

1

2∥F (xk) + J(xk)d∥22 +

αk

2∥d∥22 (8.41)

= −(JTJ + αkI)−1JTF (8.42)

with some αk > 0. Using this step, it iterates as usual

xk+1 = xk + dLMk . (8.43)

If we would make αk very big, we would not correct the point, but we would stay where we

are: for αk →∞ we get dLMk → 0. More precisely, dLMk = 1αkJTF +R

(1α2k

). On the other

hand, for small αk, i.e. for αk → 0 we get dLMk → −J+F .

It is interesting to note that the gradient of the least squares objective function f(x) =12∥F (x)∥22 equals

∇f(x) = J(x)TF (x), (8.44)

which is the rightmost term in the step of both the Gauss-Newton and the Levenberg-Marquardt method. Thus, if the gradient equals zero, then also dGN

k = dLMk = 0. This isa necessary condition for convergence to stationary points: the GN and LM method bothstay at a point xk with ∇f(xk) = 0. In the following chapter we will in much more detailanalyse the convergence properties of these two methods, which are in fact part of a largerfamily, namely the “Newton type optimization methods”.

We now discuss the convergence theory for these methods (we use a similar reasoning asin Theorem ??):

Page 77: Curs Tehnici de Optimizare

CHAPTER 8. ESTIMATION AND FITTING PROBLEMS 76

Theorem 8.6.1 Let x∗ be a point that satisfies (SOSC) and we assume that locally aroundx∗ the quasi Newton iteration takes the form xk+1 = xk−Hk∇f(xk), where Hk is invertiblefor all k and satisfies the following Lipschitz condition

∥H−1k (∇2f(xk)−∇2f(y))∥ ≤ M∥xk − y∥ ∀k ∈ N, y ∈ Rn,

and the following compatibility condition

∥Hk(∇2f(xk)−H−1k )∥ ≤ κk ∀k ∈ N (8.45)

with 0 < M <∞ and κk ≤ κ < 1. We also assume that

∥x0 − x∗∥ < 2(1− κ)

M(8.46)

Then xk converge to x∗ linearly if κk > κ > 0.

Proof: See the proof of Theorem ??.

Page 78: Curs Tehnici de Optimizare

Chapter 9

Globalisation Strategies

A Newton-type method only converges locally if

κ+ω

2∥x0 − x∗∥ < 1 (9.1)

(9.2)

∥x0 − x∗∥ ≤ 21− κ

ω(9.3)

Recall that ω is a Lipschitz constant of the Hessian that is bounding the non-linearity ofthe problem, and κ is a measure of the approximation error of the Hessian.

But what if ∥x0 − x∗∥ is too big to make Newton’s method converge locally?

The general idea is to make the steps in the iteration shorter and to ensure descent:f(xk+1) < f(xk). This shall result in ∇f(xk) → 0. While doing this, we should nottake too small steps and get stuck.

In this chapter two methods will be described to solve this problem: Line-search andTrust-region.

77

Page 79: Curs Tehnici de Optimizare

CHAPTER 9. GLOBALISATION STRATEGIES 78

9.1 Line-search method

Each iteration of a line search method computes first a search direction pk. The idea is torequire pk to be a descent direction1. The iteration is then given by

xk+1 = xk + tkpk (9.4)

with tk ∈ (0, 1] a scalar called the step length (tk = 1 in case of a full step Newton typemethod).

Computing the step length tk requires a tradeoff between a substantial reduction of f andthe computing speed of this minimization problem. Regard the ideal line search minimiza-tion:

mint

f(xk + tpk) s.t. t ∈ (0, 1] (9.5)

Exact line search is not necessary, instead we ensure that (a) the steps are short enoughto get sufficient decrease (descending must be relevant, see Figure 3.2 in [?]) and (b) longenough to not get stuck.

a) ”Armijo’s” sufficient decrease condition

Armijo stipulates that tk should give sufficient decrease in f :

f(xk + tkpk) ≤ f(xk) + γtk∇f(xk)Tpk (9.6)

with γ ∈ (0, 12) the relaxation of the gradient. In practice γ is chosen quite small, say

γ = 0.1 or even smaller. Note that with γ = 1 equation ?? would be a first order Taylorexpansion.

This condition alone, however, only ensures that the steps are not too long, and it is notsufficient to ensure that the algorithm makes fast enough progress.

Many ways exist to make sure that the steps do not get too short either, and we will justlearn one of them.

b) Backtracking

1pk is a descent direction iff ∇f(xk)T pk < 0

Page 80: Curs Tehnici de Optimizare

CHAPTER 9. GLOBALISATION STRATEGIES 79

Backtracking chooses the step length by starting with t = 1 and checking it against Armijo’scondition. If the Armijo condition is not satisfied, t will be reduced by a factor β ∈ (0, 1).In practice β is chosen to be not too small, e.g. β = 0.8.

A basic implementation of a) and b) can be found in Algorithm ??.

Algorithm 1 Backtracking with Armijo Condition

Inputs: xk, pk, f(xk), ∇f(xk)Tpk, γ, β

Output: step length tk

t← 1while f(xk + tpk) ≥ f(xk) + γt∇f(xk)

Tpk dot← βt

end whiletk ← t

9.2 Global convergence of the line search method

We will now state a general algorithm for Newton type line search, Algorithm ??.

Theorem [Global Convergence of Line-Search] Assume f ∈ C1 (once differentiable)with ∇f Lipschitz and c1I ≼ B−1

k ≼ c2I (eigenvalues of B−1k : c2 ≥ eig(B−1

k ) ≥ c1) with0 < c1 ≪ c2.Then either algorithm ?? stops with success, i.e., ∥∇f(xk)∥ ≤ TOL, or f(xk)→ −∞, i.e.,the problem was unbounded below.

insert figures :)

Proof by contradiction: Assume that algorithm ?? does not stop, i.e. ∥∇f(xk)∥ >TOL for all k, but that f(xk) is bounded below.

Page 81: Curs Tehnici de Optimizare

CHAPTER 9. GLOBALISATION STRATEGIES 80

Algorithm 2 Newton type line search

Inputs: x0, TOL > 0, β ∈ (0, 1), γ ∈ (0, 12)

Output: x∗

k = 0while ∥∇f(xk)∥ > TOL doobtain Bk ≻ 0pk ← −B−1

k ∇f(xk)get tk from the backtracking algorithmxk+1 ← xk + tkpkk ← k + 1

end whilex∗ ← xk

Note: For computational efficiency, ∇f(xk) should only be evaluated once in each itera-tion.

Because f(xk+1) ≤ f(xk) we have f(xk)→ f ∗ which implies [f(xk)− f(xk+1)]→ 0.

From Armijo (??), we have

f(xk)− f(xk+1) ≥ −γtk∇f(xk)Tpk (9.7)

= γtk∇f(xk)TB−1

k ∇f(xk) (9.8)

≥ γc1tk∥∇f(xk)∥22 (9.9)

So we have already:

γc1tk∥∇f(xk)∥22 → 0 (9.10)

If we can show that tk ≥ tmin > 0,∀k our contradiction is complete (⇒ ∥∇f(xk)∥22 → 0).

We show that tk ≥ tmin with tmin = min(1, (1−γ)βLc2

) > 0 where L is the Lipschitz con-stant for ∇f , i.e., ∥∇f(x)−∇f(y)∥ ≤ L∥x− y∥.

Page 82: Curs Tehnici de Optimizare

CHAPTER 9. GLOBALISATION STRATEGIES 81

For full steps tk = 1, obviously tk ≥ tmin. In the other case, due to backtracking, we musthave for the previous line search step (t = tk

β) that the Armijo condition is not satisfied,

otherwise we would have accepted it.

f(xk +tkβpk) > f(xk) + γ

tkβ∇f(xk)

Tpk (9.11)

⇔ f(xk +tkβpk)− f(xk)︸ ︷︷ ︸

=∇f(xk+τpk)T pktkβ,with τ∈(0, tk

β)

> γtkβ∇f(xk)

Tpk (9.12)

⇔ ∇f(xk + τpk)Tpk > γ∇f(xk)

Tpk (9.13)

⇔ (∇f(xk + τpk)−∇f(xk))T︸ ︷︷ ︸

∥·∥≤Lτ∥pk∥

pk > (1− γ) (−∇f(xk)Tpk)︸ ︷︷ ︸

pkTBkpk

(9.14)

⇒ Ltkβ∥pk∥2 > (1− γ) ∥pkTBkpk∥︸ ︷︷ ︸

≥ 1c2

∥pk∥22

(9.15)

⇒ tk >(1− γ)β

c2L(9.16)

(Recall that 1c1≥ eig(Bk) ≥ 1

c2).

We have shown that the step length will not be shorter than (1−γ)βc2L

, and will thus neverbecome zero.

9.3 Trust-region methods (TR)

“Line-search methods and trust-region methods both generate steps with the help of aquadratic model of the objective function, but they use this model in different ways. Linesearch methods use it to generate a search direction and then focus their efforts on findinga suitable step length α along this direction. Trust-region methods define a region aroundthe current iterate within they trust the model to be an adequate representation of theobjective function and then choose the step to be the approximate minimizer of the modelin this region. In effect, they choose the direction and length of the step simultaneously.

Page 83: Curs Tehnici de Optimizare

CHAPTER 9. GLOBALISATION STRATEGIES 82

If a step is not acceptable, they reduce the size of the region and find a new minimzer. Ingeneral, the direction of the step changes whenever the size of the trust region is altered.The size of the trust region is critical to the effectiviness of each step.” (cited from [?]).

The idea is to iterate xk+1 = xk + pk with

pk = arg minp∈Rn

mk(xk + p) s.t. ∥p∥ ≤ ∆k (9.17)

Equation (??) is called the TR-Subproblem, and ∆k > 0 is called the TR-Radius.

One particular advantage of this new type of subproblem is that we even can use indefiniteHessians without problems. Remember that – for an indefinite Hessian – the unconstrainedquadratic model is not bounded below. A trust-region constraint will always ensure thatthe feasible set of the subproblem is bounded so that it always has a well-defined minimizer.

Before defining the “trustworthiness” of a model, recall that:

mk(xk + p) = f(xk) +∇f(xk)Tp+

1

2pTBkp (9.18)

Definition [Trustworthiness]: A measure for the trustworthiness of a model is the ratioof actual and predicted decrease.

ρk =f(xk)− f(xk + pk)

mk(xk)−mk(xk + pk)︸ ︷︷ ︸>0 if ∥∇f(xk)∥=0

=Ared

Pred

(9.19)

We have f(xk + pk) < f(xk) only if ρk > 0. ρk ≈ 1 means a very trustworthy model.

The trust region algorithm is described in Algorithm ??.

The general convergence of the TR algorithm can be found in Theorem 4.5 in [?]

Page 84: Curs Tehnici de Optimizare

CHAPTER 9. GLOBALISATION STRATEGIES 83

Algorithm 3 Trust Region

Inputs: ∆max, η ∈ [0, 14] (when do we accept a step), ∆0, x0, TOL > 0

Output: x∗

k = 0while ∥∆f(xk)∥ > TOL doSolve the TR-subproblem ?? and get pk (approximately)Compute ρkAdapt ∆k+1:if ρk <

14then

∆k+1 ← ∆k ∗ 14(bad model: reduce radius)

else if ρk >34and ∥pk∥ = ∆k then

∆k+1 min(2 ∗∆k,∆max) (good model: increase radius, but not too much)else∆k+1 ← ∆k

end ifDecide on acceptance of stepif ρk > η thenxk+1 ← xk + pk (we trust the model)

elsexk+1 ← xk ”null” step

end ifend whilex∗ ← xk

Page 85: Curs Tehnici de Optimizare

Chapter 10

Calculating Derivatives

In the previous chapters we saw that we regularily need to calculate ∇f and ∇2f . Thereare several methods for calculating these derivatives:

1. By hand

Expensive and error prone.

2. Symbolic differentiation

Using Mathematica or Maple. The disadvantage is that the result is often a very longcode and expensive to evaluate.

3. Numerical differentiation (finite differences)

”Easy and fast, but innacurate”

f(x+ tp)− f(x)

t≈ ∇f(x)Tp (10.1)

How should we choose t? If we take t too small, the derivative will suffer fromnumerical noise. On the other hand, if we take t too large, the linearization error willbe dominant. A good rule of thumb is to use t =

√εmach, with εmach the machine

precision (or the precision of f , if it is lower than the machine precision).

84

Page 86: Curs Tehnici de Optimizare

CHAPTER 10. CALCULATING DERIVATIVES 85

The accuracy of this method is√εmach, which means in practice that we loose half the

valid digits compared to the function evaluation. Second order derivates are thereforeeven more difficult to accurately calculate.

4. ”Imaginary trick” in MATLAB

If f : Rn → R is analytic, then for t = 10−100 we have

∇f(x)Tp =ℑ(f(x+ itp))

t(10.2)

which can be calculated up to machine precision.

Proof:

g(z) = f(x+ zp)

g(z) = g(0) + g′(0)z +1

2g′′(0)z2 +O(z3)

g(it) = g(0) + g′(0)it+1

2g′′(0)i2t2 +O(t3)

= g(0)− 1

2g′′(0)t2 + g′(0)it+O(t3)

ℑ(g(it)) = g′(0)t+O(t3)

5. Automatic differentation in forward and reverse mode

This is the main topic of this chapter.

10.1 Automatic Differentiation (AD): Forward mode

Regard f : Rn → R with m elementary operations:

”Independent variables” (input): x1, . . . , xn

”Intermediate variables”: xn+1, . . . , xn+m−1

Page 87: Curs Tehnici de Optimizare

CHAPTER 10. CALCULATING DERIVATIVES 86

Algorithm 4 Automatic Differentation

Input: x1, . . . , xn

Output: xn+m

for i = n+ 1 to n+m doxi ← ϕi(x1, . . . , xi−1)

end for

Note: each ϕi depends on maximum two out of x1, . . . , xi−1.

”Dependent variable” (output): xn+m

Idea of AD: use chain rule and differentiate each ϕi separately.

Notation: In AD we use a particularly convenient notation using “dot quantities” for all“forward derivatives”. We think of all functions as being dependent on a virtual “time” te.g. for x or f we assume x(t), f(x(t)), even though we never really use the argument t inthe AD notation. But the virtual time allows us to understand the meaning of the “dot

quantities”, which are then defined as x ≡ dx

dtand f ≡ df

dt= ∇f(x)T x.

The derivatives are given by

dxn+i

dt=

∑j<n+i

∂ϕn+i

∂xj

dxj

dt, i = 1, . . . ,m (10.3)

withdxj

dt≡ xj.

Remarks:

• in each sum, only one or two terms are non-zero,

Page 88: Curs Tehnici de Optimizare

CHAPTER 10. CALCULATING DERIVATIVES 87

Algorithm 5 Forward Automatic Differentation

Input: x1, . . . , xn (and all partial derivatives ∂ϕn+i

∂xj)

Output: xn+m

for i = 1 to m do˙xn+i ←

∑j<n+i

∂ϕn+i

∂xjxj

end for

• the previous two algorithms can be combined, eliminating the need to store theintermediate variables.

Example [Forward Automatic Differentiation]:

f(x1, x2, x3) = sin(x1x2) + exp(x1x2x3)

x4 = x1x2

x5 = sin(x4)

x6 = x4x3

x7 = exp(x6)

x8 = x5 + x7

x4 = x1x2 + x1x2

x5 = cos(x4)x4

x6 = x4x3 + x4x3

x7 = exp(x6)x6

x8 = x5 + x7

We can prove that cost(algo ??) ≤ cost(algo ??), or in other words that cost(∇fTp) ≤ 2cost(f) (note that p ≡ x).

How do we get the full gradient of f? Call the algorithm n times with n different ”seed”vectors x ∈ Rn:

Page 89: Curs Tehnici de Optimizare

CHAPTER 10. CALCULATING DERIVATIVES 88

x =

100...0

,

010...0

, . . . ,

000...1

(10.4)

And hence we have cost(∇f) ≤ 2n cost(f).

AD forward is slightly more expensive than numerical finite differences (FD), but is exactup to machine precision.

Software: ADOL-C, Adic, Adifor.

10.2 Automatic Differentiation: Reverse (Backward)

Mode

Recall that in forward AD we used dxn+i

dt=

∑j<n+i

∂ϕn+i

∂xj

dxj

dt. In reverse AD we will instead

use

df

dxi

=∑

j>max(i,n)

df

dxj

∂ϕj

∂xi

(10.5)

Notation: In the reverse mode of AD we use “bar quantities” instead of the “dot quanti-ties” that we used in the forward mode. These quantities can be interpreted as derivativesof the final output with respect to the respective intermediate quantity. We write xi ≡ df

dxi,

so that e.g. ∇f(x) =

x1...xn

.

Page 90: Curs Tehnici de Optimizare

CHAPTER 10. CALCULATING DERIVATIVES 89

Algorithm 6 Reverse Automatic Differentiation

Input: all partial derivatives∂ϕj

∂xi

Output: x1, x2, . . . , xn

x1, x2, . . . , xn+m−1 ← 0xn+m ← 1for j = n+m down to n+ 1 dofor all i < j doxi ← xi + xj

∂ϕj

∂xi

end forend for

Example [Reverse Automatic Differentiation]:

x1, x2, . . . , x7 ← 0

x8 ← 1

(j = 8 : x8 = x5 + x7))

x5 ← x5 + 1 x8

x7 ← x7 + 1 x8

(j = 7 : x7 = exp(x6))

x6 ← x6 + exp(x6)x7

(j = 6 : x6 = x4x3)

x4 ← x4 + x3x6

x3 ← x3 + x4x6

(j = 5 : x5 = sin(x4))

x4 ← x4 + cos(x4)x5

(j = 4 : x4 = x1x2)

x1 ← x1 + x2x4

x2 ← x2 + x1x4

Output: x1, x2, x3 with ∇f(x) =

x1

x2

x3

The gradient is returned ine one reverse sweep. Furthermore, it can be shown that cost(algo

Page 91: Curs Tehnici de Optimizare

CHAPTER 10. CALCULATING DERIVATIVES 90

??) ≤ 5 cost(algo ??). In other words cost(∇f) ≤ 5 cost(f), regardless of the dimensionn!

The only disadvantage is that, unlike in forward AD, you have to store all intermediatevariables and partial derivatives in reverse AD.

Page 92: Curs Tehnici de Optimizare

Part III

Constrained Optimization

91

Page 93: Curs Tehnici de Optimizare

Chapter 11

The Lagrangian Function and Duality

Let us in this section regard a (not-necessarily convex) NLP in standard form (??) withfunctions f : Rn → R, g : Rn → Rp, and h : Rn → Rq.

Definition [Primal Optimization Problem]: We will denote the globally optimalvalue of the objective function subject to the constraints as the “primal optimal value” p∗,i.e.,

p∗ =

(minx∈Rn

f(x) s.t. g(x) = 0, h(x) ≥ 0

), (11.1)

and we will denote this optimization problem as the “primal optimization problem”.

Definition [Lagrangian Function and Lagrange Multipliers]: We define the socalled “Lagrangian function” to be

L(x, λ, µ) = f(x)− λTg(x)− µTh(x). (11.2)

Here, we have introduced the so called “Lagrange multipliers” or “dual variables” λ ∈ Rp

and µ ∈ Rq. The Lagrangian function plays a crucial role in both convex and generalnonlinear optimization. We typically require the inequality multipliers µ to be positive,µ ≥ 0, while the sign of the equality multipliers λ is arbitrary. This is motivated by thefollowing basic lemma.

92

Page 94: Curs Tehnici de Optimizare

CHAPTER 11. THE LAGRANGIAN FUNCTION AND DUALITY 93

Lemma [Lower Bound Property of Lagrangian]: If x is a feasible point of (??) andµ ≥ 0, then

L(x, λ, µ) ≤ f(x). (11.3)

Proof: L(x, λ, µ) = f(x)− λT g(x)︸︷︷︸=0

− µT︸︷︷︸≥0

h(x)︸︷︷︸≥0

≤ f(x).

Definition [Lagrange Dual Function]: We define the so called “Lagrange dual func-tion” as the unconstrained infimum of the Lagrangian over x, for fixed multipliers λ, µ.

q(λ, µ) = infx∈RnL(x, λ, µ). (11.4)

This function will often take the value −∞, in which case we will say that the pair (λ, µ)is “dual infeasible” for reasons that we motivate in the last example of this subsection.

Lemma [Lower Bound Property of Lagrange Dual]: If µ ≥ 0, then

q(λ, µ) ≤ p∗ (11.5)

Proof: The lemma is an immediate consequence of Eq. (??) which implies that for anyfeasible x holds q(λ, µ) ≤ f(x). This inequality holds in particular for the global minimizerx∗ (which must be feasible), yielding q(λ, µ) ≤ f(x∗) = p∗.

Theorem [Concavity of Lagrange Dual]: The function q : Rp × Rq → R is concave,even if the original NLP was not convex.

Proof: We will show that −q is convex. The Lagrangian L is an affine function in themultipliers λ and µ, which in particular implies that −L is convex in (λ, µ). Thus, thefunction −q(λ, µ) = supx−L(x, λ, µ) is the supremum of convex functions in (λ, µ) thatare indexed by x, and therefore convex.

Page 95: Curs Tehnici de Optimizare

CHAPTER 11. THE LAGRANGIAN FUNCTION AND DUALITY 94

A natural question to ask is what is the best lower bound that we can get from the Lagrangedual function. We obtain it by maximizing the Lagrange dual over all possible multipliervalues, yielding the so called “dual problem”.

Definition [Dual Problem]: The “dual problem” with “dual optimal value” d∗ is de-fined as the convex maximization problem

d∗ =

(max

λ∈Rp,µ∈Rqq(λ, µ) s.t. µ ≥ 0

). (11.6)

It is interesting to note that the dual problem is always convex, even if the so called “primalproblem” is not.

As an immediate consequence of the last lemma, we obtain a very fundamental result thatis called “weak duality”

Theorem [Weak Duality]:d∗ ≤ p∗ (11.7)

This theorem holds for any arbitrary optimization problem, but does only unfold its fullstrength in convex optimization, where very often holds a strong version of duality, whichwe will not prove in this course.

Theorem [Strong Duality]: If the primal optimization problem (??) is convex anda technical constraint qualification (e.g. Slater’s condition) holds, then primal and dualobjective are equal to each other,

d∗ = p∗. (11.8)

Strong duality allows us to reformulate a convex optimization problem into its dual, whichlooks very differently but gives the same solution. We will look at this at hand of twoexamples.

Page 96: Curs Tehnici de Optimizare

CHAPTER 11. THE LAGRANGIAN FUNCTION AND DUALITY 95

Example [Dual of a strictly convex QP]: We regard the following strictly convex QP(i.e., with B≻0)

p∗ = minx∈Rn

cTx+1

2xTBx (11.9a)

subject to Ax− b = 0, (11.9b)

Cx− d ≥ 0. (11.9c)

Its Lagrangian function is given by

L(x, λ, µ) = cTx+1

2xTBx− λT (Ax− b)− µT (Cx− d)

= λT b+ µTd+1

2xTBx+

(c− ATλ− CTµ

)Tx.

The Lagrange dual function is the infimum value of the Lagrangian with respect to x, whichonly enters the last two terms in the above expression. We obtain

q(λ, µ) = λT b+ µTd+ infx∈Rn

(1

2xTBx+

(c− ATλ− CTµ

)Tx

)= λT b+ µTd− 1

2

(c− ATλ− CTµ

)TB−1

(c− ATλ− CTµ

)where we have made use of the basic result (??) in the last row.

Therefore, the dual optimization problem of the QP (??) is given by

d∗ = maxλ∈Rp,µ∈Rq

−1

2cTB−1c +

[b+ AB−1cd+ CB−1c

]T [λµ

]− 1

2

[λµ

]T [AC

]B−1

[AC

]T [λµ

](11.10a)

subject to µ ≥ 0. (11.10b)

Due to the fact that the objective is concave, this problem is again a convex QP, but not astrictly convex one. Note that the first term is a constant, but that we have to keep it inorder to make sure that d∗ = p∗, i.e. strong duality, holds.

Page 97: Curs Tehnici de Optimizare

CHAPTER 11. THE LAGRANGIAN FUNCTION AND DUALITY 96

Example [Dual of an LP]: Let us now regard the following LP

p∗ = minx∈Rn

cTx (11.11a)

subject to Ax− b = 0, (11.11b)

Cx− d ≥ 0. (11.11c)

Its Lagrangian function is given by

L(x, λ, µ) = cTx− λT (Ax− b)− µT (Cx− d)

= λT b+ µTd+(c− ATλ− CTµ

)Tx.

Here, the Lagrange dual is

q(λ, µ) = λT b+ µTd+ infx∈Rn

(c− ATλ− CTµ

)Tx

= λT b+ µTd+

0 if c− ATλ− CTµ = 0−∞ else.

Thus, the objective function q(λ, µ) of the dual optimization problem is −∞ at all pointsthat do not satisfy the linear equality c−ATλ−CTµ = 0. As we want to maximize, thesepoints can be regarded as infeasible points of the dual problem (that is why we called them“dual infeasible”), and we can explicitly write the dual of the above LP (??) as

d∗ = maxλ∈Rp,µ∈Rq

[bd

]T [λµ

](11.12a)

subject to c− ATλ− CTµ = 0, (11.12b)

µ ≥ 0. (11.12c)

This is again an LP and it can be proven that strong duality holds for all LPs for whichat least one feasible point exists, i.e. we have d∗ = p∗, even though the two problems lookquite differently.

Page 98: Curs Tehnici de Optimizare

Chapter 12

Optimality Conditions forConstrained Optimization

From now on, regard the general constrained minimization problem in standard form

minimizex ∈ Rn

f(x) (12.1a)

subject to g(x) = 0, (12.1b)

h(x) ≥ 0. (12.1c)

in which f : Rn → R, g : Rn → Rm and h : Rn → Rq are smooth. Recall that the feasibleset for this problem is defined as Ω = x ∈ Rn|g(x) = 0, h(x) ≥ 0.

Definition [Tangent]: p ∈ Rn is called a ”tangent” to Ω at x∗ ∈ Ω if there exists asmooth curve x(t) : [0, ϵ)→ Rn with x(0) = x∗, x(t) ∈ Ω ∀t ∈ [0, ϵ) and dx

dt(0) = p.

Definition [Tangent Cone]: the ”tangent cone” TΩ(x∗) of Ω at x∗ is the set of all

tangent vectors at x∗.

97

Page 99: Curs Tehnici de Optimizare

CHAPTER 12. OPTIMALITY CONDITIONS FOR CONSTRAINEDOPTIMIZATION98

Example [Tangent Cone]:

h(x) =

[(x1 − 1)2 + x2

2 − 1−(x2 − 2)2 − x2

1 + 4

](12.2)

x∗ =[04

]: TΩ(x

∗) =

p|pT

[0−1

]≥ 0

= R× R++ (12.3)

x∗ =[00

]: TΩ(x

∗) =

p|pT

[10

]≥ 0 & pT

[01

]≥ 0

= R++ × R++ (12.4)

Insert figure for this example.

12.1 Karush-Kuhn-Tucker (KKT) Necessary Optimal-

ity Conditions

Theorem [First Order Necessary Conditions, variant 0]: If x∗ is a local minimumof the NLP (??) then

1. x∗ ∈ Ω

2. for all tangents p ∈ TΩ(x∗) holds: ∇f(x∗)Tp ≥ 0

Proof: (By contradiction) If ∃p ∈ TΩ(x∗) with ∇f(x∗)Tp < 0 there would exist a feasible

curve x(t) with df(x(t))dt

∣∣t=0

= ∇f(x∗)Tp < 0.

12.2 Active constraints and constraint qualification

How can we characterize TΩ(x∗)?

Page 100: Curs Tehnici de Optimizare

CHAPTER 12. OPTIMALITY CONDITIONS FOR CONSTRAINEDOPTIMIZATION99

Definition [Active/Inactive Constraint]: An inequality constraint hi(x) ≥ 0 is called”active” at x∗ ∈ Ω iff hi(x

∗) = 0 and otherwise ”inactive”.

Definition [Active Set]: The index set A(x∗) ⊂ 1, . . . , q of active constraints is calledthe ”active set”.

Remark: Inactive constraints do not influence TΩ(x∗).

Definition [LICQ]: The ”linear independence constraint qualification” (LICQ) holds atx∗ ∈ Ω iff all vectors ∇gi(x∗) for i ∈ 1, . . . ,m & ∇hi(x

∗) for i ∈ A(x∗) are linearlyindependent.

Note: this is a technical condition, and is usually satisfied.

Insert figure that illustrates LICQ, and also illustrates why one should usually avoid re-placing an equality with two inequalities.

Definition [Linearized Feasible Cone]: F(x∗) = p|∇gi(x∗)Tp = 0, i = 1, . . . ,m & ∇hi(x∗)Tp ≥

0, i ∈ A(x∗) is called the ”linearized feasible cone” at x∗ ∈ Ω.

Page 101: Curs Tehnici de Optimizare

CHAPTER 12. OPTIMALITY CONDITIONS FOR CONSTRAINEDOPTIMIZATION100

Example [Linearized Feasible Cone]:

h(x) =

[(x1 − 1)2 + x2

2 − 1−(x2 − 2)2 − x2

1 + 4

](12.5)

x∗ =

[04

],A(x∗) = 2 (12.6)

∇h2(x) =

[2x1

−2(x2 − 2)

](12.7)

=

[0−4

](12.8)

F(x∗) =

p|[0−4

]Tp ≥ 0

(12.9)

Lemma: At any x∗ ∈ Ω holds

1. TΩ(x∗) ⊂ F(x∗)

2. If LICQ holds at x∗ then TΩ(x∗) = F(x∗).

Proof:

1.

p ∈ TΩ ⇒ ∃x(t) with p =dx

dt

∣∣∣t=0

& x(0) = x∗ & x(t) ∈ Ω (12.10)

⇒ g(x(t)) = 0 ∀t ∈ [0, ϵ) (12.11)

h(x(t)) ≥ 0 (12.12)

⇒ dgi(x(t))

dt= ∇gi(x∗)Tp = 0, i = 1, . . . ,m (12.13)

dhi(x(t))

dt

∣∣∣t=0

= limt→0+

hi(x(t))− hi(x∗)

t≥ 0 (12.14)

for i ∈ A(x∗) : hi(x∗) = 0 & hi(x(t)) ≥ 0⇔ dhi

dt∇hi(x

∗)Tp ≥ 0 (12.15)

⇒ p ∈ F(x∗) (12.16)

Page 102: Curs Tehnici de Optimizare

CHAPTER 12. OPTIMALITY CONDITIONS FOR CONSTRAINEDOPTIMIZATION101

2. For the full proof see [Noc2006]. The idea is to use the implicit function theorem toconstruct a curve x(t) which has a given vector p ∈ F(x∗) as tangent.

Theorem (variant 1): If LICQ holds at x∗ and x∗ is a local minimizer for the NLP (??)then

1. x∗ ∈ Ω

2. ∀p ∈ F(x∗) : ∇f(x∗)Tp ≥ 0.

How can we simplify the second condition? Here helps the following lemma. To interpret

it, remember that F(x∗) = p|Gp = 0, Hp ≥ 0 with G = dgdx(x∗), H =

[∇hi(x

∗)T

...

]with

i ∈ A(x∗).

Farkas’ Lemma: For any matrices G ∈ Rm×n, H ∈ Rq×n and vector c ∈ Rn holds

either ∃λ ∈ Rm, µ ∈ Rq with µ ≥ 0 & c = GTλ+HTµ (12.17)

or ∃p ∈ Rn with Gp = 0 & Hp ≥ 0 & cTp < 0 (12.18)

but never both (”theorem of alternatives”).

Proof: In the proof we use the ”separating hyperplane theorem” with respect to the pointc ∈ Rn and the set S = GTλ + HTµ|λ ∈ Rn, µ ∈ Rq, µ ≥ 0. S is a convex cone. Theseparating hyperplane theorem states that two convex sets – in our case the set S and thepoint c – can always be separated by a hyperplane. In our case, the hyperplane touchesthe set S at the origin, and is described by a normal vector p. Separation of S and c meansthat for all y ∈ S holds that yTp ≥ 0 and on the other hand, cTp < 0.

Page 103: Curs Tehnici de Optimizare

CHAPTER 12. OPTIMALITY CONDITIONS FOR CONSTRAINEDOPTIMIZATION102

Either c ∈ S ⇔ (??) (12.19)

or c /∈ S (12.20)

⇔ ∃p ∈ Rn : ∀y ∈ S : pTy ≥ 0 & pT c < 0 (12.21)

⇔ ∃p ∈ Rn : ∀λ, µ with µ ≥ 0 : pT (GTλ+HTµ) ≥ 0 & pT c < 0 (12.22)

⇔ ∃p ∈ Rn : Gp = 0 & Hp ≥ 0 & pT c < 0⇔ (??) (12.23)

From Farkas’ lemma follows the desired simplification of the previous theorem:

Theorem (variant 2) [KKT Conditions]: If x∗ is a local minimizer of the NLP (??)and LICQ holds at x∗ then there exists a λ ∈ Rm and µ ∈ Rq with

∇f(x∗)−∇g(x∗)λ−∇h(x∗)µ = 0 (12.24a)

g(x∗) = 0 (12.24b)

h(x∗) ≥ 0 (12.24c)

µ ≥ 0 (12.24d)

µihi(x∗) = 0, i = 1, . . . , q. (12.24e)

Note: The KKT conditions are the First order necessary conditions for optimality (FONC)for constrained optimization, and are thus the equivalent to ∇f(x∗) = 0 in unconstrainedoptimization.

Proof: We know already that (??), (??) ⇔ x∗ ∈ Ω. We have to show that (??), (??),(??) ⇔ ∀p ∈ F(x∗) : pT∇f(x∗) ≥ 0. Using Farkas’ lemma we have

Page 104: Curs Tehnici de Optimizare

CHAPTER 12. OPTIMALITY CONDITIONS FOR CONSTRAINEDOPTIMIZATION103

∀p ∈ F(x∗) : pT∇f(x∗) ≥ 0 ⇔ It is not true that ∃p ∈ F(x∗) : pT∇f(x∗) < 0 (12.25)

⇔ ∃λ, µi ≥ 0 : ∇f(x∗) =∑∇gi(x∗)λi +

∑i∈A(x∗)

∇hi(x∗)µi(12.26)

(12.27)

Now we set all components of µ that are not element of A(x∗) to zero, i.e. µi = 0 if hi(x∗) >

0, and conditions (??) and (??) are trivially satisfied, as well as (??) due to∑

i∈A(x∗)∇hi(x∗)µi =∑

i=1,...,q∇hi(x∗)µi if µi = 0 for i /∈ A(x∗).

Though it is not necessary for the proof of the necessity of the optimality conditions of theabove theorem (variant 2), we point out that the theorem is 100 % equivalent to variant 1,but has the computational advantage that its conditions can be checked easily: if someonegives you a triple (x∗, λ, µ) you can check if it is a KKT point or not.

Note: Using the definition of the Lagrangian, we have (??)⇔ ∇xL(x∗, λ, µ) = 0. In absenceof inequalities, the KKT conditions simplify to ∇xL(x, λ) = 0, g(x) = 0, a formulation thatis due to Lagrange and was much earlier known than the KKT conditions.

Example [KKT Condition]:

minimizex ∈ R2

[0−1

]Tx (12.28)

subject to

[x21 + x2

2 − 1−(x2 − 2)2 − x2

1 + 4

]≥ 0 (12.29)

(12.30)

Does the local minimizer x∗ =

[04

]satisfy the KKT conditions?

First:

Page 105: Curs Tehnici de Optimizare

CHAPTER 12. OPTIMALITY CONDITIONS FOR CONSTRAINEDOPTIMIZATION104

A(x∗) = 2 (12.31)

∇f(x∗) =

[0−1

](12.32)

∇h2(x∗) =

[0−4

](12.33)

Then we write down the KKT conditions, which are for the specific dimensions of thisexample equivalent to the right hand side terms:

(??) ⇔ ∇f(x∗)−∇h1(x∗)µ1 −∇h2(x

∗)µ2 = 0 (12.34)

(??) − (12.35)

(??) ⇔ h1(x∗) ≥ 0 & h2(x

∗) ≥ 0 (12.36)

(??) ⇔ µ1 ≥ 0 & µ2 ≥ 0 (12.37)

(??) ⇔ µ1h1(x∗) = 0 & µ2h2(x

∗) = 0 (12.38)

Finally we check that indeed, all five conditions are satisfied:

(??) ⇐[0−1

]−

[∗∗

]µ1 −

[0−4

]µ2 = 0 (µ1 is inactive, use µ1 = 0, µ2 =

14)(12.39)

(??) − (12.40)

(??) ⇐ h1(x∗) > 0 & h2(x

∗) = 0 (12.41)

(??) ⇐ µ1 = 0 & µ2 =1

4≥ 0 (12.42)

(??) ⇐ µ1h1(x∗) = 0h1(x

∗) = 0 & µ2h2(x∗) = µ20 = 0 (12.43)

12.3 Convex problems

Theorem: Regard a convex NLP and a point x∗ at which LICQ holds. Then:

Page 106: Curs Tehnici de Optimizare

CHAPTER 12. OPTIMALITY CONDITIONS FOR CONSTRAINEDOPTIMIZATION105

x∗ is global minimizer ⇐⇒ ∃λ, µ so that KKT condition hold.

Recall that the NLP

minimizex ∈ Rn

f(x) (12.44)

subject to g(x) = 0, (12.45)

−h(x) ≤ 0. (12.46)

is convex if f and all −hi are convex and g is affine, i.e., g(x) = Gx+ a.

Sketch of proof We only need the ”⇐”-direction.

• Assume (x∗, λ, µ) satisfies the KKT conditions

• L(x, λ, µ) = f(x)−∑

gi(x)λi −∑

hi(x)µi

• L is a convex function of x, and for fixed µ, λ its gradient is zero, ∇L(x∗, λ, µ) =0. Therefore, x∗ is a global minizer of of the unconstrained minimization problemminx L(x, λ, µ) = d∗

• We know thatd∗ ≤ p∗ = min f(x) st g(x) = 0, h(x) ≥ 0.

d∗ = L(x∗, λ, µ) = f(x∗)−∑

gi(x∗)λi︸ ︷︷ ︸

→0

−∑

hi(x∗)µi︸ ︷︷ ︸

→0

= f(x∗) and

x∗ is feasible: i.e. p∗ = d∗ and x∗ is global minimizer.

12.4 Complementarity

The last KKT conditio is called the complementarity condition. Visualized, the situationfor hi(x) and µi that satisfy the three conditions hi ≥ 0, µi ≥ 0 and hiµi = 0 is thefollowing:

Page 107: Curs Tehnici de Optimizare

CHAPTER 12. OPTIMALITY CONDITIONS FOR CONSTRAINEDOPTIMIZATION106

figure

Definition: Regard a KKT point (x∗, λ, µ). For i ∈ A(x∗) we say hi is weakly active ifµi = 0, otherwise, if µi > 0, we call it strictly active. We say that strict complementarityholds at this KKT point iff all active constraints are strictly active. We define the set ofweakly active constraints to be A0(x

∗, µ) and the set of strictly active constraints A+(x∗, µ).

The sets are disjoint and A(x∗) = A0(x∗, µ) ∪ A+(x

∗, µ).

Note: strict complementarity makes many theorems easier.

12.5 Second order conditions

Definition: Regard the KKT point (x∗, λ, µ). The critical cone C(x∗, µ) is the followingset:

C(x∗, µ) = p|∇g(x∗)Tp = 0, ∇hi(x∗)Tp = 0 if i ∈ A+(x

∗, µ), ∇hi(x∗)Tp ≥ 0 if i ∈ A0(x

∗, µ)(12.47)

Note: C(x∗, µ) ⊂ F(x∗). In case that LICQ holds, even C(x∗, µ) ⊂ TΩ(x∗). Thus, the

critical cone is a subset of all feasible directions. In fact: it contains all feasible directionswhich are from first order information neither uphill or downhill directions, as the followingtheorem shows.

Theorem:Regard the KKT point (x∗, λ, µ) with LICQ, then ∀p ∈ TΩ(x

∗) holds

p ∈ C(x∗, µ) ⇔ ∇f(x∗)Tp = 0. (12.48)

Page 108: Curs Tehnici de Optimizare

CHAPTER 12. OPTIMALITY CONDITIONS FOR CONSTRAINEDOPTIMIZATION107

Proof:Use ∇xL(x∗, λ, µ) = 0 to get for any p ∈ C(x∗, µ):

∇f(x∗)Tp = λT ∇gTp︸ ︷︷ ︸=0

+∑

i, µi>0

µi∇hi(x∗)Tp︸ ︷︷ ︸

=0

+∑

i, µi=0

µi∇hi(x∗) = 0 (12.49)

Conversely, if p ∈ TΩ(x∗) then all terms on the right hand side must be non-negative, so that

∇f(x∗)Tp = 0 implies in particular∑

i, µi>0 µi∇hi(x∗)Tp = 0 which implies ∇hi(x

∗)Tp = 0for all i ∈ A+(x

∗, µ), i.e. p ∈ C(x∗, µ).

Example:

minx2 s.t. 1− x21 − x2

2 ≥ 0 (12.50)

x∗ =

(0−1

)(12.51)

∇h(x) =(−2x1

−2x2

)(12.52)

∇f(x) =(01

)(12.53)

µ =?

∇f(x)−∇h(x)µ = 0 (12.54)(01

)−(02

)µ = 0⇔ µ =

1

2(12.55)

x∗ =

(0−1

), µ = 1

2is a KKT point.

TΩ(x∗) = F(x∗) = p | ∇hTp ≥ 0 = p |

(02

)T

p ≥ 0 (12.56)

Page 109: Curs Tehnici de Optimizare

CHAPTER 12. OPTIMALITY CONDITIONS FOR CONSTRAINEDOPTIMIZATION108

C(x∗,∇) = p | ∇hTp = 0 if µ > 0 (12.57)

= p |(02

)T

p = 0 (12.58)

Theorem (SONC): Regard x∗ with LICQ. If x∗ is a local minimizer of the NLP, then:

i) ∃λ, µ so that KKT conditions hold;

ii) ∀p ∈ C(x∗, µ) holds that pT∇2xL(x∗, λ, µ)p ≥ 0

Theorem (SOSC): If x∗ satisfies LICQ and

i) ∃λ, µ so that KKT conditions hold;

ii) ∀p ∈ C(x∗, µ), p = 0, holds that pT∇2xL(x∗, λ, µ)p > 0

then x∗ is a local minimizer.

Note: ∇2xL(x∗, λ, µ) = ∇2f(x∗) −

∑λi∇2gi(x

∗) −∑

µi∇2hi(x∗), i.e. ∇2

xL containscurvature of constraints.

Sketch of proof of both theorems:Regard the following restriction of the feasible set (Ω ⊂ Ω):

Ω = x | g(x) = 0, hi(x) = 0 if i ∈ A+(x∗, µ), hi(x) ≥ 0 if i ∈ A0(x

∗, µ) (12.59)

The critical cone is the tangent cone of this set Ω.

First, for any feasible direction p ∈ TΩ(x∗) \ C(x∗, µ) we have ∇f(x∗)Tp > 0. Thus, the

difficult directions are those in the critical cone only.

Page 110: Curs Tehnici de Optimizare

CHAPTER 12. OPTIMALITY CONDITIONS FOR CONSTRAINEDOPTIMIZATION109

So let us regard points in the set Ω. For fixed λ, µ we have for all x ∈ Ω:

L(x, λ, µ) = f(x)−∑

λi gi(x)︸︷︷︸=0

−∑

i, µi>0

µi hi(x)︸ ︷︷ ︸=0

−∑

i, µi=0

µihi(x)︸ ︷︷ ︸=0

(12.60)

= f(x) (12.61)

Also: ∇xL(x∗, λ, µ) = 0.

So for all x ∈ Ω we have:

f(x) = L(x, λ, µ) (12.62)

= L(x∗, λ, µ)︸ ︷︷ ︸=f(x∗)

+∇xL(x∗, λ, µ)T︸ ︷︷ ︸=0

(x− x∗) +1

2(x− x∗)T∇2

xL(x∗, λ, µ)(x− x∗) + o(∥x− x∗∥2)(12.63)

= f(x∗) +1

2(x− x∗)T∇2

xL(x∗, λ, µ)(x− x∗) + o(∥x− x∗∥2) (12.64)

Example: Regard the example from before:

L(x, µ) = x2 − µ(1− x21 − x2

2) (12.65)

∇xL =

(01

)+ µ

(2x1

2x2

)(12.66)

∇2xL = 0 + µ

(2 00 2

)(12.67)

Page 111: Curs Tehnici de Optimizare

CHAPTER 12. OPTIMALITY CONDITIONS FOR CONSTRAINEDOPTIMIZATION110

For µ = 12and x∗ =

(0−1

)we have:

C(x∗, µ) = p | ∇hTp = 0 = p |(02

)T

p = 0 = (p10

) (12.68)

p ∈ C ⇒ p =

(p10

)(12.69)

∇2xL(x∗, λ, µ) =

1

2

(2 00 2

)=

(1 00 1

)(12.70)

SONC: (p10

)T (1 00 1

)(p10

)︸ ︷︷ ︸

=p21

≥ 0 (12.71)

SOSC:

if p = 0, p ∈ C : pT∇2xLp > 0 (12.72)

if p1 = 0 : p21 > 0 (12.73)

Example 2:

minx2 s.t. 2x2 ≥ x21 − 1− (x2 + 1)2 (12.74)

Here x∗ =

(0−1

), µ = 1

2is still a KKT point.

∇xL(x∗, µ) = 0 (12.75)

∇2xL(x∗, µ) = µ

(2 00 −2

)(12.76)

Page 112: Curs Tehnici de Optimizare

Chapter 13

Equality constrained optimization

In this chapter the problem of

minimizex ∈ Rn

f(x) (13.1a)

subject to g(x) = 0 (13.1b)

with f : Rn → R, g : Rn → Rm, f and g are both smooth functions, will be further treatedin detail.

13.1 Optimality conditions

KKT condition

The necessary KKT optimality condition for

L(x, λ) = f(x)− λTg(x) (13.2)

leads to the expression

∇L(x∗, λ∗) = 0 (13.3)

g(x∗) = 0 (13.4)

111

Page 113: Curs Tehnici de Optimizare

CHAPTER 13. EQUALITY CONSTRAINED OPTIMIZATION 112

Keep in mind that this expression is only valid if we have LICQ, or equivalently stated asthe entries ∇gi(x∗) are linear independent. Recall the definition of the gradient

∇g(x) = (∇g1(x∗),∇g2(x∗), . . . ,∇gm(x∗)) (13.5)

=∂g

∂x(x)T . (13.6)

The rank of the matrix ∇g(x) must be m to obtain LICQ. The tangent space is defined as

TΩ(x∗) =

p|∇g(x)Tp = 0

(13.7)

= KER(∇g(x)T ) (13.8)

An explicit form of the kernel KER(∇g(x)T

)can be obtained by basis for this space

Z ∈ Rn×(n−m) such that the kernel(∇g(x)T

)= ℑ(Z), i.e. ∇g(x)TZ = 0 and rank(Z) =

n − m. This basis (Z1Z2 . . . Zn−m) can be obtained by using a QR-factorization of thematrix ∇g(x).

SONC and SOSC

The SONC looks like

p = Zv with v ∈ Rn−m (13.9)

zT∇2xL(x∗, λ∗)Z < 0 (13.10)

The SOSC points out that if

ZT∇2xL(x∗, λ∗)Z ≻ 0 (13.11)

and the LICQ and KKT conditions are satisfied, then x∗ is a minimizer.

13.2 Equality constrained QP

Regard the optimization problem

minimizex

1

2xTBx+ gTx (13.12a)

subject to b+ Ax = 0 (13.12b)

Page 114: Curs Tehnici de Optimizare

CHAPTER 13. EQUALITY CONSTRAINED OPTIMIZATION 113

with B ∈ Rnxn, A ∈ Rmxn, B = BT . The KKT condition leads to the equation

Bx+ g − ATλ = 0 (13.13a)

b+ Ax = 0. (13.13b)

In matrix notation [B −AT

A 0

] [xλ

]= −

[gb

](13.14)

The left hand side matrix is nearly symmetric. With a few reformulations a symmetricmatrix is obtained [

B AT

A 0

] [x−λ

]= −

[gb

](13.15)

Lemma [KKT-Matrix-Lemma]: Define the matrix[B AT

A 0

](13.16)

as the KKT matrix. Regard some matrix B ∈ Rn×m, B = BT , A ∈ Rm×n with m ≤ n. Ifthe rank(A) = m (A is of full rank, i.e. the LICQ holds) and for all p ∈ kernel(A), p = 0holds pTBp > 0 (SOSC). Then the KKT -matrix is invertible.

Refer to Nocedal NOC, 16.1

Remark that for a QP

B = ∇2xL(x∗, λ∗) (13.17)

A = ∇g(x)T (13.18)

so that the above invertibility condition is equivalent to SOSC.

Page 115: Curs Tehnici de Optimizare

CHAPTER 13. EQUALITY CONSTRAINED OPTIMIZATION 114

13.2.1 Solving the KKT system

Solving KKT systems is an important research topic, there exist many ways to solve thesystem (??). Some methods are:

(i) Brute Force: obtain an LU -factorization of KKT-matrix

(ii) As the KKT-matrix is not definite, a standard Cholesky decomposition does not work.Use an indefinite Cholesky decomposition.

(iii) Schur complement method or so called “Range Space method”: first eliminate x, byequation

x = B−1(ATλ− g) (13.19)

and plug it in to the second equation (??). Get λ from

b+ A(B−1(ATλ− g)) = 0. (13.20)

This method requires that B is invertible, which is not always true.

(iv) Null Space Method: First find basis Z ∈ RN×(n−m) of KER(A), set x = Zv + y withb + Ay = 0 (a special solution) every x = Zv + y satisfies b + Ax = 0, so we have toregard only (??). This is an unconstrained problem

minimizev ∈ Rn−m

gT (Zv + y) +1

2(Zv + y)TB(Zv + y) (13.21a)

⇔ ZTBZv + ZTg + ZTBy = 0 (13.21b)

⇔ v = (ZTBZ)−1(ZTg + ZTBy). (13.21c)

The matrix ZTBZ is called “Reduced Hessian”. This method is always possible ifSOSC holds, in practice the matrices are sparse matrices.

(v) Sparse direct methods like sparse LU decomposition.

(vi) Iterative methods of linear algebra.

Page 116: Curs Tehnici de Optimizare

CHAPTER 13. EQUALITY CONSTRAINED OPTIMIZATION 115

13.3 Newton Lagrange method

Regard again the optimization problem (??) as stated at the beginning of the chapter. Theidea now is to apply Newton’s method to solve the nonlinear KKT conditions

∇L(x, λ) = 0 (13.22a)

g(x) = 0 (13.22b)

Define [xλ

]= w and F (w) =

[∇L(x, λ)

g(x)

](13.23)

with w ∈ Rn+m, F : Rn+m → Rn+m, so that the optimization is just a nonlinear rootfinding problem

F (w) = 0, (13.24)

which we solve again by Newton’s method.

F (ωk) +∂F

∂wk

wk)(w − wk) = 0 (13.25)

Written in terms of gradients

∇xL(xk, λk) +∇2

xL(x, λ)(x− xk)−∇g(xk)(λ− λk) = 0 (13.26)

∇2xL(x, λ)(x − xk) is the linearisation with respect to x, ∇g(xk)(λ − λk) the linearisation

with respect to λ. Recall that ∇L = ∇f −∇gλ.

g(xk) +∇g(xk)T (x− xk) = 0 (13.27)

Written in matrix form an interesting result is obtained[∇xLg

]+

[∇2

xL ∇g∇gT 0

]︸ ︷︷ ︸KKT-matrix

[x− xk

−(λ− λk)

]= 0 (13.28)

The KKT-matrix is invertible if the KKT-matrix lemma holds. From this point it is clearthat at a given solution (x∗, λ∗) with LICQ and SOSC, the KKT-matrix would be invertible.This also holds in the neighborhood of (x∗, λ∗). Thus, if (x∗, λ∗) satisfies LICQ and SOSC

Page 117: Curs Tehnici de Optimizare

CHAPTER 13. EQUALITY CONSTRAINED OPTIMIZATION 116

Algorithm 7 Equality constrained Newton Lagrange method

Choose: x0, λ0, ϵSet: k = 0

while norm

[∇L(xk, λk)

g(x∗)

]≥ ϵ do

get ∆xk and ∆λk from (??)xk+1 = xk +∆xk

λk+1 = λk +∆λk

k = k + 1end while

then the Newton method is well defined for all (x0, λ0) in neighborhood of (x∗, λ∗) andconverges Q-quadratically.

The method is stated as an algorithm in Algorithm ??.

Using the definition

λk+1 = λk +∆λk (13.29)

∇L(xk, λk) = ∇f(xk)−∇g(xk)λk (13.30)

the system (??) needed for calculating ∆λk and ∆xk is equivalent to[∇f(xk)g(xk)

]+

[∇2L ∇g∇gT

] [∆xk

−λk+1

]= 0. (13.31)

This formulation shows that the new iterate does not depend strongly on the old multiplierguess, only via the Hessian matrix. We will later see that we can approximate the Hessianwith different methods.

Page 118: Curs Tehnici de Optimizare

CHAPTER 13. EQUALITY CONSTRAINED OPTIMIZATION 117

13.4 Quadratic model interpretation

Theorem xk+1 and λk+1 are obtained from the solution of a QP:

minimizex ∈ Rn

∇f(xk)T (x− xk) +

1

2(x− xk)

T∇2L(xk, λk)(x− xk) (13.32a)

subject to g(xk) +∇g(xk)T (x− xk) = 0 (13.32b)

So we can get a QP solution xQP and λQP and take it as next NLP solution guess xk+1 andλk+1.

Proof: KKT of QP

∇f(xk) +∇2L(xk, λk)−∇g(x∗ − λk)−∇g∇k = 0 (13.33)

⇔ g +∇gT (x∗ − xk) = 0 (13.34)

More generally, can replace ∇2xL(xk, λk) by some approximation Bk, (Bk = BT

k oftenBk < 0) by Quasi-Newton updates or other.

13.5 Constrained Gauss Newton

Regard:

minimizex ∈ Rn

1

2∥F (x)∥22 (13.35a)

subject to g(x) = 0 (13.35b)

As in the unconstrained case, linearize both F and g. Get approximation by

minimizex ∈ Rn

1

2∥F (xk) + J(xk)(x− xk)∥22 (13.36a)

subject to g(xk) +∇g(xk)T (x− xk) = 0 (13.36b)

Page 119: Curs Tehnici de Optimizare

CHAPTER 13. EQUALITY CONSTRAINED OPTIMIZATION 118

This is a LS-QP which is convex. We call this the constrained Gauss Newton method, thisapproach gets new iterate xk+1 by solution of (??)–(??) in each iteration. Note that nomultipliers λk+1 are needed. The KKT conditions of LS-QP

∇x1

2∥F + J(x− xk)∥ = JTJ(x− xk) + JTF (13.37)

equals

JTJ(x− xk) + JTF −∇gλ(x− xk) = 0 (13.38)

g +∇gT = 0 (13.39)

Recall that JTJ the same is as by Newton iteration, but we replace the Hessian. Theconstrained Gaus Newton gives a Newton type iteration with Bk = JTJ . For LS,

∇2xL(x, λ) = J(x)TJ(x) +

∑Fi(x)∇2Fi(x)−

∑λi∇2gi(x) (13.40)

One can show that ∥λ∥ gets small if ∥F∥ is small. As the unconstrained case CGN convergeswell if ∥F∥ ≈ 0.

13.6 An equality constrained BFGS method

Page 120: Curs Tehnici de Optimizare

CHAPTER 13. EQUALITY CONSTRAINED OPTIMIZATION 119

Algorithm 8 Equality constrained BFGS method

Choose x0, B0, tolerancek = 0Evaluate ∇f(x0), g(x0),

∂g∂x(x0)

while ∥g(xk)∥ > tolerance or ∥∇L(xk, λk)∥ > tolerance doSolve KKT-system (e.g. with CVX):[∇fg

]+

[Bk

∂g∂x

T

∂g∂x

0

] [pk−λk

]= 0

Set ∆λk = λk − λk

Choose step length tk ∈ (0, 1] (details 11.7)

xk+1 = xk + tkpk

λk+1 = λk + tk∆λk

Compute old Lagrange gradient:∇xL(xk, λk+1) = ∇f(xk)− ∂g

∂x(xk)

Tλk+1

Evaluate ∇f(xk+1), g(xk+1),∂g∂x(xk+1)

Compute new Lagrange gradient ∇xL(xk+1, λk+1)

Set sk = xk+1 − xk

Set yk = ∇xL(xk+1, λk+1)−∇xL(xk, λk+1)

Calculate Bk+1 (e.g. with a BFGS update) using sk and yk.

k = k + 1end while

Remark: Bk+1 can alternatively be obtained by either calculating the exact Hessian∇2L(xk+1, λk+1) or by calculating the Gauss-Newton Hessian (J(xk+1)

TJ(xk+1) for a LSobjective function).

Page 121: Curs Tehnici de Optimizare

CHAPTER 13. EQUALITY CONSTRAINED OPTIMIZATION 120

13.7 Local convergence

Recall:

Theorem [Newton type convergence]: Regard the root finding problem F (x) = 0,F : Rn → Rn with F (x∗) = 0 a local solution, J(x) = ∂F

∂x(x), and iteration xk+1 =

xk −M−1k F (xk) with ∀k : Mk ∈ Rn×m invertible, and a Lipschitz condition ∥M−1

k (J(xk)−J(x))∥ ≤ ω∥xk − x∗∥ and a compatibility condition with κ < 1:∥M−1

k (J(xk)−Mk)∥ ≤ κk < κ and ∥x0 − x∗∥ ≤ 2ω(1− κ) then xk → x∗ with linear rate or

even quadratic rate if κ = 0 or superlinear rate if κk → 0 (proof as before).

Corollary: Newton-type constrained optimization converges

• quadratically if Bk = ∇2L(xk, λk),

• superlinearly if Bk → ∇2L(xk, λk) (BFGS),

• linearly if ∥Bk −∇2L(xk, λk)∥ is not too big (Gauss-Newton).

Proof:

Jk =

[∇2L(xk, λk) − ∂g

∂x(xk)

T

∂g∂x(xk) 0

](13.41)

Mk =

[Bk − ∂g

∂x(xk)

T

∂g∂x(xk) 0

](13.42)

Jk −Mk =

[∇2L(xk, λk)−Bk 0

0 0

](13.43)

Page 122: Curs Tehnici de Optimizare

CHAPTER 13. EQUALITY CONSTRAINED OPTIMIZATION 121

13.8 Globalization by line search

Idea: use ”merit function” to measure progress in both objective and constraints.

Definition [L1-merit function]: the ”L1-merit function” is defined to be T1(x) = f(x)+σ∥g(x)∥1 with σ > 0.

Insert visualization.

Definition [directional derivative]: the ”directional derivative of F at x in direction

p” is DF (x)[p] = limt→0,t>0

F (x+tp)−F (x)t

.

Example [directional derivative]:

F (x) = |x− 1| (13.44)

DF (1)[2] = limt→0,t>0

|1 + t(2)− 1| − |1− 1|t

= 2 (13.45)

DF (1)[−3] = limt→0,t>0

|1 + t(−3)− 1| − |1− 1|t

= 3 (13.46)

Lemma: If p & λ solve

[∇fg

]+

[B ∂g

∂x

T

∂g∂x

0

][p

−λ

]= 0 then

DT1(x)[p] = ∇f(x)Tp− σ∥g(x)∥1 (13.47)

DT1(x)[p] ≤ −pTBp− (σ − ∥λ∥∞)∥g(x)∥1 (13.48)

Corollary: If B ≻ 0 & σ ≥ ∥λ∥∞ then p is a descent direction of T1.

Page 123: Curs Tehnici de Optimizare

CHAPTER 13. EQUALITY CONSTRAINED OPTIMIZATION 122

Proof of the lemma:

T1(x+ tp) = f(x+ tp) + σ∥g(x+ tp)∥1 (13.49)

= f(x) + t∇f(x)Tp+ σ∥g(x) + ∂g

∂x(x)pt∥1 +O(t2) (13.50)

= f(x) + t∇f(x)Tp+ σ∥g(x)(1− t)∥1 +O(t2) (13.51)

= f(x) + t∇f(x)Tp+ σ(1− t)∥g(x)∥1 +O(t2) (13.52)

= T1(x) + t(∇f(x)Tp− σ∥g(x)∥1) +O(t2) (13.53)

⇒ (??) (13.54)

∇f(x) +Bp− ∂g

∂x(x)T λ = 0 (13.55)

∇f(x)Tp = λT ∂g

∂x(x)p− pTBp (13.56)

= −λTg(x)− pTBp (13.57)

|∇f(x)Tp| ≤ ∥λ∥∞∥g(x)∥1 − pTBp (13.58)

⇒ (??) (13.59)

In Algorithm ?? use Armijo backtracking with L1-merit function, ensure σ ≥ ∥λ∥∞ (if not,increase σ).

13.9 Careful BFGS updating

How can we make sure that Bk remains positive definite?

Lemma: If Bk ≻ 0 and yTk sk > 0 then Bk+1 from BFGS update is positive definite.

Update Nocedal proof reference.

Page 124: Curs Tehnici de Optimizare

CHAPTER 13. EQUALITY CONSTRAINED OPTIMIZATION 123

Proof: Nocedal?

This is as good as we can desire because:

Lemma: If yTk sk < 0 & Bk+1sk = yk then Bk+1 is not positive semidefinite.

Proof: sTkBk+1sk = sTk yk+1 < 0 i.e. sk is a direction of negative curvature of Bk+1.

Insert figure with proof.

Powell’s trick: If yTk sk < 0.2sTkBsk then do update with a yk instead of yk with yk =yk + θ(Bksk − yk) so that yk

T sk = 0.2sTkBksk > 0.

Remark: If θ = 1 then yk = Bksk and Bk+1 = Bk (thus, the choice of θ between 0 and 1damps the BFGS update if necessary).

Page 125: Curs Tehnici de Optimizare

Chapter 14

Inequality Constrained OptimizationAlgorithms

For simplicity, drop equalities and regard:

minx f(x)s.t. h(x) ≥ 0

(14.1)

In the KKT conditions we had (for i = 1, . . . , q):

1. ∇f(x)−∑q

i=1∇hi(x)µi = 0

2. hi(x) ≥ 0

3. µi ≥ 0

4. µihi(x) = 0

Conditions 2, 3 and 4 are non-smooth, which implies that Newton’s method won’t workhere.

124

Page 126: Curs Tehnici de Optimizare

CHAPTER 14. INEQUALITY CONSTRAINED OPTIMIZATION ALGORITHMS 125

14.1 Quadratic Programming via active set method

Regard the QP problem to be solved:

minx gTx+ 12xTBx

s.t. Ax+ b ≥ 0(14.2)

Assume a convex QP (B ≽ 0). The KKT conditions are necessary and sufficient for globaloptimality (this is the basis for the algorithm):

Bx∗ + g − ATµ∗ = 0 (14.3)

Ax∗ + b ≥ 0 (14.4)

µ∗ ≥ 0 (14.5)

µ∗i (Ax

∗ + b)i ≥ 0 (14.6)

for i = 1, . . . , q.

How do we find x∗, µ∗ and the corresponding active set A(x∗) ⊂ 1, . . . , q so that KKTholds?

Definition: Index Set

A ⊂ 1, . . . , q “Active” (14.7)

I = 1, . . . , q\A “Inactive” (14.8)

Vector division

b =

(bAbI

)b ∈ Rq (14.9)

Matrix division

A =

(AAAI

)(14.10)

ie

Ax+ b ≥ 0⇐⇒ AAx+ bA ≥ 0 AND AIx+ bI ≥ 0 (14.11)

Page 127: Curs Tehnici de Optimizare

CHAPTER 14. INEQUALITY CONSTRAINED OPTIMIZATION ALGORITHMS 126

Lemma: x∗ is a global minimizer of the QP iff there exist an index set A and I and avector µ∗

A so that:

Bx∗ + g − ATAµ

∗A = 0 (14.12)

AAx∗ + bA = 0 (14.13)

AAx∗ + bI ≥ 0 (14.14)

µ∗A ≥ 0 (14.15)

and

µ∗ =

(µ∗A

µ∗I

)with µ∗

I = 0 (14.16)

Active set method idea:

• Choose a set A

• Solve (??) and (??) to get x∗ and µ∗

• Check if (??) and (??) are satisfied

– YES: Solution found

– NO: Change set A by adding or removing constraint indices

For the last step many variants exists: primal, dual, primal-dual, online... E.g., QPSOL,quadprog (Matlab) and qpOASES.

Primal active set method in detail

1. Choose a feasible starting point x0 with corresponding active set A0

2. Set k = 0 and start iteration

Page 128: Curs Tehnici de Optimizare

CHAPTER 14. INEQUALITY CONSTRAINED OPTIMIZATION ALGORITHMS 127

3. Solve

g +Bxk − ATAkµk = 0 (14.17)

AAkxk + bAk

= 0 (14.18)

4. Go on a line from xk to xk: xk+1 = xk + tk(xk−xk) with some tk ∈ [0, 1] so that xk+1

is feasible

(a) If tk < 1 add a blocking constraint i∗ to A:Ak+1 = Ak ∪ i∗set k = k + 1 and go back to (3)

(b) If tk = 1 is possible then xk is feasible, we only need to check if µk ≥ 0.

• If YES: Solution is found

• if NO: Drop index i∗∗ in Ak with µk,i∗∗ < 0 and Ak+1 = Ak\i∗∗.set k = k + 1 and go back to (3)

Remark: we can prove that f(xk+1) ≤ f(xk) (with f the quadratic performance index).

14.2 Sequential Quadratic Programming (SQP)

Regard the NLP:

minx f(x)s.t. h(x) ≥ 0

(14.19)

The SQP idea is to solve in each iteration the QP:

minp ∇f(x)Tp+ 12pTBp

s.t. h(xh) +∂h∂x(xh)p ≥ 0

(14.20)

Local convergence would follow from equality constrained optimization if the active set ofthe QP is the same as the active set of the NLP, at least in the last iterations.

Page 129: Curs Tehnici de Optimizare

CHAPTER 14. INEQUALITY CONSTRAINED OPTIMIZATION ALGORITHMS 128

Theorem [ROBINSON]: If x∗ is a local minimizer of the NLP with LICQ and strictcomplementarity and if xk is close enough to x∗ and B ≽ 0 and B is positive definite onker

(∂h∂x

), then the solution of the QP has the same active set as the NLP.

Proof of [ROBINSON] Define A = A(x∗) and regard:

∇f(x) +Bp− ∂hA

∂x(x)TµQP

A = 0 (14.21)

hA(x) +∂hA

∂x(x)p = 0 (14.22)

this defines an implicit function (p(x,B)

µQPA (x,B)

)(14.23)

with

p(x∗, B) = 0 and µQPA (x∗, B) = µ∗

A) (14.24)

This follows from

∇f(x∗) +B0− ∂hA

∂x(x∗)Tµ∗

A = 0⇐⇒ ∇xL(x∗, µ∗) = 0 (14.25)

hA(x∗) +

∂hA

∂x(x∗)0 = 0 (14.26)

which hold because of

hA(x∗) = 0 (14.27)

hI(x∗) > 0 (14.28)

µ∗I = 0 (14.29)

Note that µ∗A > 0 because of strict complementarity.

For x ≃ x∗, due to continuity of p(x,B) and µQPA (x,B) we still have hI(x) > 0 and

µQPA (x,B) > 0 (14.30)

and even more:

hI(x) +∂hI

∂x(x)p(x,B) > 0 (14.31)

Page 130: Curs Tehnici de Optimizare

CHAPTER 14. INEQUALITY CONSTRAINED OPTIMIZATION ALGORITHMS 129

Therefore a solution of the QP has the same same active set as the NLP and also satisfiesstrict complementarity.

Remark: we can generalise his Theorem to the case where the jacobian ∂h∂x(xh) is only

approximated.

14.3 Powell’s classical SQP algorithm

For an equality and inequality constrained NLP, we can use the BFGS algorithm as beforebut:

1. We solve an inequality constrained QP instead of a linear system

2. We use T1(x) = f(x) + σ∥g(x)∥1 + σ∑q

i=1 |min(0, hi(x))|

3. Use full Lagrange gradient ∇xL(x, λ, µ) in the BFGS formula

(eg “fmincon” in Matlab).

14.4 Interior Point Methods

The IP method is an alternative for the active set method for QP’s or LP’s and for theSQP method. The previous methods had problems with the non-smoothness in the KTT-conditions (2), (3) and (4) (for i = 1, . . . , q):

1. ∇f(x)−∑q

i=1∇hi(x)µi = 0

2. hi(x) ≥ 0

3. µi ≥ 0

Page 131: Curs Tehnici de Optimizare

CHAPTER 14. INEQUALITY CONSTRAINED OPTIMIZATION ALGORITHMS 130

4. µihi(x) = 0 .

The IP-idea is to replace 2,3 and 4 by a smooth condition (which is an approximation):hi(x)µi = τ with τ > 0 but small. The KKT-conditions now become a smooth root findingproblem:

∇f(x)−q∑

i=1

∇hi(x)µi = 0 (14.32)

hi(x)µi − τ = 0 i = 1, . . . , q (14.33)

These conditions are called the IP-KKT conditions and can be solved by Newtons methodand yields solutions x(τ) and µ(τ).

We can show that for τ → 0

x(τ) → x∗ (14.34)

µ(τ) → µ∗ (14.35)

The IP algorithm:

1. Start with a big τ ≫ 0, choose β ∈ (0, 1)

2. Solve IP-KKT to get x(τ) and µ(τ)

3. Replace τ ← βτ and go to 2.(ititialize Newton iteration with old solution).

Remark: The set of solutions

(x(τ)µ(τ)

)for τ ∈ (0,∞) is called the central path.

Remark 2: In fact, the IP-KKT is equivalent to FONC of the Barrier Problem (BP):

minx

f(x)− τ

q∑i=1

log hi(x) (14.36)

FONC of BP⇐⇒ ∇f(x)− τ

q∑i=1

1

hi(x)∇hi(x) = 0 (14.37)

Page 132: Curs Tehnici de Optimizare

CHAPTER 14. INEQUALITY CONSTRAINED OPTIMIZATION ALGORITHMS 131

with µi =τ

hi(x)this is equivalent to IP-KKT.

For convex problems IP methods are well understood with strong complexity results (theyare used e.g. in CVX).

Page 133: Curs Tehnici de Optimizare

Chapter 15

Optimal Control Problems

We regard a dynamical system with dynamics

xk+1 = f(xk, uk) (15.1)

with uk the “controls” or “inputs” and xk the “states”. Let x ∈ Rnx and let u ∈ Rnu withk = 0, . . . , N − 1.

15.1 Optimal control problem (OCP) formulation

minimizex0, u0, x1, . . . , uN−1, xN

N−1∑k=0

L(xk, uk) + E(xN) (15.2a)

subject to xk+1 − f(xk, uk) = 0 for k = 0, . . . , N − 1(15.2b)

Remark that (??) implies a lot of constraints. Sometimes this amount of constraints is notenough, one could add some extra constraints. For example if the first state and last stateare fixed

r(x0, xn) = 0 (15.3)

132

Page 134: Curs Tehnici de Optimizare

CHAPTER 15. OPTIMAL CONTROL PROBLEMS 133

then we would also say that the controls are constrained. Another constraint would beinequalities of the form

h(xk, uk) ≥ 0, k = 0, . . . , N − 1 (15.4)

Remark that a free parameter could be added to the optimisation formulation, e.g. theconstant size of a pot in a chemical reactor. For this we define an extra dummy state fork = 0, . . . , N − 1

pk+1 = pk (15.5)

As an example consider for equation (??)

r(x0, xn) = x0 − x0 (15.6)

where x0 is a fixed initial value. Another example would be considering both ends fixed

r(x0, xn) =

[x0 − x0

xN − xN

]. (15.7)

In many applications, cycles or periodic boundary conditions are optimized by addingconstraints in the form

r(x0, xn) = (x0 − xn) (15.8)

15.2 KKT conditions of optimal control problems

First summarize the variables w = x0, u0, x1, u1, . . . , uN−1, xN and summarize the multi-pliers λ = λ1, . . . , λN , λr. The optimal control problem has the form

minimizew

F (w) (15.9a)

subject to G(w) = 0 (15.9b)

Where

G(w) =

x1 − f(x0, u0)x2 − f(x1, u1)

...xN − f(xN−1, uN−1)

r(x0, xN)

(15.9c)

Page 135: Curs Tehnici de Optimizare

CHAPTER 15. OPTIMAL CONTROL PROBLEMS 134

The Lagragian function has the form

L(ω, λ) = F (ω)− λTG(ω)

=N−1∑k=0

L(xk, uk) + E(xn)−N−1∑k=0

λTk+1(xk+1 − f(xk, uk))

−λTr r(x0, xn) (15.10)

The KKT-conditions of the problem are

∇wL(w, λ) = 0 (15.11a)

G(w) (15.11b)

In more detail, ???? the derivative of L with respect to xk, where n = 0 and n = N arespecial cases. First n = 0 is treated

∇x0L(ω, λ) = ∇x0L(x0, u0) +∂f

∂x0

(x0, u0)Tλ1 −

∂r

∂x0

(x0, xN)Tλr = 0. (15.12a)

Then the case for k = 1, . . . , N − 1 is treated

∇xkL(ω, λ) = ∇xk

L(xk, uk)− λk +∂f

∂xk

(xk, uk)Tλk+1 = 0. (15.12b)

Now the special case n = N

∇xNL(ω, λ) = ∇xN

E(xN)− λN −∂r

∂xN

(x0, xN)Tλr = 0. (15.12c)

The Lagrangian with respect to u is calculated, for k = 0, . . . , N − 1

∇ukL(ω, λ) = ∇uk

L(xk, uk) +∂f

∂xk

(xk, uk)Tλk+1 = 0. (15.12d)

The last two conditions are

xk+1 − f(xk, uk) = 0 k = 0, . . . , N − 1 (15.12e)

r(x0, xn) = 0 (15.12f)

The equations (??) till (??) are the KKT-system of the OCP. There exist different ap-proaches to solve this system. On method is to solve equations (??) to (??) directly, this iscalled the simultaneous approach. The other approach is to calculate all the states in (??)by forwards elemination. This is called the sequential approach and treated first.

Page 136: Curs Tehnici de Optimizare

CHAPTER 15. OPTIMAL CONTROL PROBLEMS 135

15.3 Sequential approach to optimal control

This method is also called “single shooting” or “reduced approach”. The idea is to keep onlyx0 and U = [uT

0 , . . . , uTN−1]

T as variables. The states x1, . . . , xN are eleminated recursivelyby

x0(x0, U) = x0 (15.13)

xk+1(x0, U) = f(xk(x0, ), uk) (15.14)

Then the optimal control problem is equivalent to a problem with less variables

minimizex0, U

N−1∑k=0

L(xk(x0, U), uk) + E(xk(x0, U)) (15.15a)

subject to r(x0, xN(x0, U)) = 0 (15.15b)

Remark that equation (??) is implicitly satisfied. This is called the reduced optimal controlproblem. It can be solved by e.g. Newton type method (SQP if inequalities are present).If r(x0, xN) = x0 − x0 one can also eliminate x0 ≡ x0. The optimality conditions for thisproblem are found in the next subsection.

15.4 Backward differentiation of sequential Lagrangian

The Lagrangian function is given by

L(x0, U, λr) =N−1∑k=0

L(xk(x0, U), uk) + E(xk(x0, U))− λTr r(x0, xN(x0, U)) (15.16)

so the KKT conditions for the reduced optimal control problem are

∇x0L(x0, U, λr) = 0 (15.17a)

∇ukL(x0, U, λr) = 0 k = 0, . . . , N − 1 (15.17b)

r(x0, xN(x0, U)) = 0 (15.17c)

Usually differences are linearized by finite differences, the I-Trick or forward automaticdifferention (AD). But here, backward automatic differentiation (AD) is more efficient.

Page 137: Curs Tehnici de Optimizare

CHAPTER 15. OPTIMAL CONTROL PROBLEMS 136

Algorithm 9 Result of backward AD to KKT-ROCP

Inputsx0, u0,. . .,uN−1, λr

Outputs∇x0L, ∇uk

L and r

Set x0 ← x0

Set k = 0, execute forward sweep:repeatxk+1 = f(xk, uk)k = k + 1

until k = N − 1Get r(x0, xN)

Compute “intermediate” quantities λN , . . . , λ1 by λN = ∇E(xN)− ∂r∂xn

(x0, xN)Tλr

Set k = N − 1, execute backward sweep:repeatλk = ∇xk

L(xk, uk) +∂f∂xk

(xk, uk)Tλk+1 = 0

k = k − 1until k = 1

Compute ∇x0L = ∇x0(x0, u0)− ∂r∂x0

(x0, xN)Tλr +

∂f∂x0

(x0, u0)T λ1 = 0

Set k0repeat∇ukL = ∇uk

L(xk, uk) +∂f∂uk

(xk, uk)T ¯λk+1 = 0

k = k + 1until k = N − 1

Page 138: Curs Tehnici de Optimizare

CHAPTER 15. OPTIMAL CONTROL PROBLEMS 137

The result for backward AD to the equations (??) to (??) to get ∇x0L and ∇ukL is stated

in Algorithm ??. Compare the equations (??) to (??) whereas λk ≡ λk with the algorithm.

We get a second interpretation to the second approach with backward AD: when solving(??) to (??) we eliminate all equations that kan be eliminated by (??), (??) and (??).Only the equations (??), (??) and (??) remain. Backward automatic differentiation (AD)gives gradient at a cost scaling linearly with N and forward differences with respect tou0, . . . , uN−1, would grow with N2.

The sequential and backward automatic differentiation (AD) leads to a small dense (Ja-cobians are dense matrices) nonlinear system in variables (x0, u0, . . . , uN−1, λr). The nextsections tries to avoid the dense Jacobians.

15.5 Simultaneous optimal control

This method is also called “multiple shooting” or “one shot optimization”. The idea is tosolve (??) to (??) directly by a sparsity exploiting Newton-type method. If we regard theoriginal OCP, it is a NLP in variables w = (x0, u0, x1, u1, . . . , uN−1, xN) with multipliers(λ1, . . . , λN , λr) = λ. In the SQP method we get

wk+1 = wk + δwk (15.18)

λk+1 = λkQP (15.19)

by solving

minimize∆w

∇ωF (ωk)T∆w +

1

2∆wTBk∆w (15.20a)

subject to G(w) +∂G

∂w(w)∆w (15.20b)

If we use

Bk = ∇2ωL(ωk, λk) (15.21)

Page 139: Curs Tehnici de Optimizare

CHAPTER 15. OPTIMAL CONTROL PROBLEMS 138

this QP is very structured and equivalent to

minimize∆x0,∆u0, . . . ,∆xn

1

2

N−1∑k=0

[∆xk

∆uk

]TQk

[∆xk

∆uk

]+∆

1

2xTNQN∆xN +

N∑k=0

[∆xN

∆uN

]Tgk +∆xT

NgN(15.22)

subject to r(x0, xN) +∂r(x0, xN)

∂x0

∆x0 +∂r(x0, xN)

∂xN

∆xN = 0(15.23)

∆xk −∂f

∂uk

(xk, uk)∆uk = 0 for k = 0, . . . , N − 1(15.24)

With

Qk = ∇2(xk,uk)

L (15.25)

QN = ∇2xNL (15.26)

gk = ∇xk,ukL(xk, uk) (15.27)

gN = ∇E(xN) (15.28)

Note that for k = m

∂xk

∂xm

L = 0 (15.29a)

∂xk

∂um

L = 0 (15.29b)

∂2

∂uk∂um

L = 0 (15.29c)

This QP leads to a very sparse linear system and can be solved at a cost linear with N .Also simultaneous approaches can deal better with unstable systems xk+1 = f(xk, uk).

Page 140: Curs Tehnici de Optimizare

Chapter 16

Summary of the Lecture

To be added...

139

Page 141: Curs Tehnici de Optimizare

Bibliography

[1] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press,2004.

[2] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer,Boston, 2004.

[3] J. Nocedal and S. Wright. Numerical Optimization. Springer Verlag, 2006.

140