Convex Optimization Theory and...

Convex Optimization Theory and

ApplicationsLecture 6

UnconstrainedMinimization(Chapter9inTextbook)

Prof. Chun-Hung Liu

2016/11/30 Lecture6:UnconstraintedMin 1

Dept. of Electrical and Computer EngineeringNational Chiao Tung University

Fall 2016


Outline

• Terminology and assumptions

• Gradient descent method

• Steepest descent method

• Newton’s method

• Self-concordant functions (Optional)


Closed Functions• A function is said to be closed if, for each , the

sublevel set f : Rn ! R ↵ 2 R

is closed. This is equivalent to the condition that the epigraph of f,

is closed. (This definition is general, but is usually only applied to convex functions.)

• If is continuous, and dom f is closed, then f is closed.

• If is continuous, with dom f open, then f is closed if and only if f converges to along every sequence converging to a boundary point of dom f. In other words, if , with , we have

f : Rn ! R

f : Rn ! R

xi 2 dom f limi!1

f(xi) = 1.

1


Example of Closed Functions

• The following are some examples of closed functions on R:


Unconstrained Minimization Problems• In this chapter, we discuss methods for solving the unconstrained

optimization problem

(9.1)

where is convex and twice continuously differentiable (which implies that dom f is open).

f : Rn ! R

• We assume that the problem is solvable, i.e., there exists an optimal point .x

? = infx

f(x)

• Since f is differentiable and convex, a necessary and sufficient condition for a point to be optimal isx

?

• Solving the unconstrained minimization problem (9.1) is the same as finding a solution of (9.2).

(9.2)rf(x) = 0


Initial Point and Sublevel Set

• The methods described in this chapter require a suitable starting point . The starting point must lie in dom f, and in addition the sublevel set

x

(0)

must be closed. This condition is satisfied for all if the function f is closed, i.e., all its sublevel sets are closed.

x

(0) 2 dom f

• The closeness of the sublevel set S is hard to verify, except when all sublevel sets are closed:• equivalent to the condition that is closed.• true if .• true if as

epi fdom f = R

n

f(x) ! 1 x ! bf dom f

• Examples of differentiable functions with closed sublevel sets

f(x) = log

mX

i=1

exp(a

Ti x+ bi)

!,

f(x) = �mX

i=1

log

�bi � a

Ti x

�


Examples• Quadratic minimization and least-squares

The general convex quadratic minimization problem has the form

where

• This problem can be solved via the optimality conditions , which is a set of linear equations.

• When , there is a unique solution • When P is not positive definite, any solution of is

optimal.• If does not have a solution, then the problem (9.4) is

unbounded below.

Px

? + q = 0

P � 0 x

? = �P

�1q.

Px

? = �q

(9.4)

Px

? = �q


Examples• Unconstrained geometric programming

Consider an unconstrained geometric program in convex form,

The optimality condition is

which in general has no analytical solution, so here we must resort to an iterative algorithm.

• Analytic center of linear inequalities

where the domain of f is the open set

The solution, if it exists, is called the analytic center of the inequalities.


Strong Convexity and Implications

• The objective function is strongly convex on S, which means that there exists an .m > 0

• Strong convexity has several interesting consequences. For , we have

x, y 2 S

for some z on the line segment [x, y].

(9.7)

• According to the strong convexity condition in (9.7), we have

for all x and y in S.

(9.8)


Strong Convexity and Implications• Setting the gradient of (9.8) with respect to y equal to zero, we find

that minimizes the righthand side of (9.8). Therefore we have

y = x� (1/m)rf(x)

Since this holds for any , we havey 2 S

This inequality shows that if the gradient is small at a point, then the point is nearly optimal.

1

2mkrf(x)k22 � f(x)� p

?



• We can also derive a bound on the distance between x and any optimal point , in terms of , given by

kx� x

?k2krf(x)k2x

?

To see this, we apply (9.8) with to obtainy = x

?

Since we must havep

? f(x)

from which (9.11) follows.

(due to Cauchy-Schwarz inequality)

(9.11)


r2f(x)


• Upper bound on

The maximum eigenvalue of , which is a continuous function of x on S, is bounded above on S, i.e., there exists a constant M such that

r2f(x)

for all . This upper bound on the Hessian implies for anyx 2 S

x, y 2 S

Minimizing each side over y yields

(9.14)

(9.12)

(9.13)


Condition Number of Sublevel Sets• From the strong convexity inequality (9.7) and the inequality (9.12),

we have(9.14)

for all . The ratio is thus an upper bound on the condition number of the matrix , i.e., the ratio of its largest eigenvalue to its smallest eigenvalue.

x 2 S

= M/mr2

f(x)

• Geometrical Interpretation of (9.14): We define the width of a convex set , in the direction q, where , asC ✓ Rn kqk2 = 1

• The minimum width and maximum width of C are given by


Condition Number of Sublevel Sets• The condition number of the convex set C is defined as

• The condition number of C gives a measure of its anisotropy oreccentricity. • If the condition number of a set C is small (say, near one) it means

that the set has approximately the same width in all directions, i.e., it is nearly spherical.

• If the condition number is large, it means that the set is far wider in some directions than in others.

i.e., the square of the ratio of its maximum width to its minimum width.


Condition Number of Sublevel Sets

= 2(qTAq)1/2


Condition Number of Sublevel Sets• Suppose f satisfies for all . We will derive a

bound on the condition number of the -sublevel , where .

mI � r2f(x) � MI

x 2 S

↵C↵ = {x|f(x) ↵}

p

?< ↵ f(x0)

• Applying (9.13) and (9.8) with , we havex = x

?

This implies that where

The ratio of the radii squared gives an upper bound on the conditionnumber of :C↵


Condition Number of Sublevel Sets• We can also give a geometric interpretation of the condition number

of the Hessian at the optimum. From the Taylor series expansion of f around ,(r2

f(x?))x

?

we see that, for close to ,↵ p?

i.e., the sublevel set is well approximated by an ellipsoid with center . Therefore,

x

?


• The algorithms described in this chapter produce a minimizing sequence , where

Descent Methods

x

(k), k = 1, . . .

and (except when is optimal).t(k) > 0x

(k)

• is called the step or search direction (even though it need not have unit norm). The scalar is called the step size or step length at iteration k .

• When we focus on one iteration of an algorithm, we sometimes drop the superscripts and use the lighter notation or in place of .

• All the methods we study are descent methods, which means that

t(k) � 0

except when is optimal.x

(k)

�x

(k) 2 Rn

x := x+ t�xx

+ = x+ t�x

x

(k+1) = x

(k) + t

(k)�x

(k)


Descent Methods• From convexity we know that implies

so the search direction in a descent method must satisfyf(y) � f(x(k))

we call such a direction a descent direction.• The outline of a general descent method is as follows. It alternates

between two steps: determining a descent direction , and the selection of a step size t.

�x


Descent Methods• Exact line search

One line search method sometimes used in practice is exact line search, in which t is chosen to minimize f along the ray :{x+ t�x|t � 0}

An exact line search is used when the cost of the minimization problem with one variable, is low compared to the cost of computing the search direction itself.

(9.16)

• Backtracking line search

• One inexact line search method that is very simple and quite effective is called backtracking line search. It depends on two constants with↵,�0 < ↵ < 0.5, 0 < � < 1

• Most line searches used in practice are inexact: the step length is chosen to approximately minimize f along the ray , or even to just reduce f “enough”.

{x+ t�x|t � 0}


Descent Methods

• The line search is called backtracking because it starts with unit step size and then reduces it by the factor until the stopping condition�

f(x+ t�x) f(x) + ↵trf(x)T�x

holds. Since is a descent direction, we have so for small enough t we have

�x rf(x)T�x < 0

which shows that the backtracking line search eventually terminates.


Descent Methods


Gradient Descent Method• A natural choice for the search direction is the negative gradient

. The resulting algorithm is called the gradient algorithm or gradient descent method.�x = �rf(x)

krf(x)k2 ⌘• The stopping criterion is usually of the form , where is small and positive.

⌘


Gradient Descent Method• Convergence analysis: here we present a simple convergence analysis for

the gradient method. • using the lighter notation

where . • Assume f is strongly convex on S so that for all• Define the function by

�x = �f(x)x 2 S

f : R ! Rf(t) = f(x� trf(x))

• With , we obtain a quadratic upper bound ony = x� trf(x) f :

(9.17)

• Analysis for exact line search: assume that an exact line search is used, and minimize over t both sides of the inequality (9.17). • On the lefthand side we get , where is the step length

that minimizes . • The righthand side is a simple quadratic, which is minimized by

and has minimum value

f(texact

) texact

f

t = 1/M


Gradient Descent Method• Therefore we have

We combine this with to conclude

Applying this inequality recursively, we find that

where , which shows that converge to as

c = 1�m/M < 1 f(x(k)) p?

k ! 1


• Analysis for backtracking line search

Gradient Descent Method

Consider the case where a backtracking line search is used in the gradient descent method. We will show that the backtracking exit condition,

is satisfied whenever . First note that0 t 1/M

Using this result and the bound (9.17), for , 0 t 1/M

since .↵ < 1/2


Gradient Descent Method

• Therefore the backtracking line search terminates either with or with a value . This provides a lower bound on the decrease in the objective function.

t = 1t � �/M

Combine this with to obtain

(What does this c tell you?)


Example: A Quadratic Problem in • Consider the following quadratic objective function

where . Clearly, the optimal point is and the optimal value is 0. We apply the gradient descent method with exact line search, starting at the point . we can derive the following closed-form expressions for the iterates and their function values

� > 0x

? = 0

x

(0) = (�, 1)x

(k)

This is illustrated in Figure 9.2, for .� = 10

R2


Example: A Quadratic Problem in R2


Fig. 9.3 Fig. 9.5


Example: A Non-quadratic Problem in • Consider a non-quadratic example as follows

We apply the gradient method with a backtracking line search, withFigure 9.3 shows some level curves of f, and the

iterates generated by the gradient method (shown as small circles). The lines connecting successive iterates show the scaled steps,

↵ = 0.1,� = 0.7.

x

(k)

• Figure 9.4 shows the error versus iteration k. The plot reveals that the error converges to zero approximately as a geometric series, i.e., the convergence is approximately linear.

• To compare backtracking line search with an exact line search, we use the gradient method with an exact line search, on the same problem, and with the same starting point. The results are given in Figures 9.5 and 9.4.

f(x(k))� p

?

R2

(9.20)


Example: A Non-Quadratic Problem in R2


Example: A Non-quadratic Problem in R2


Steepest Descent Method• The first-order Taylor approximation of around x is

f(x+ v)

The second term on the righthand side, , is the directional derivative of f at x in the direction v. It gives the approximate change in f for a small step v. The step v is a descent direction if the directional derivative is negative.

rf(x)T v

• Now the question is how to address the question of how to choose v to make the directional derivative as negative as possible.

• We define a normalized steepest descent direction

(9.23)

A normalized steepest descent direction is a step of unit norm that gives the largest decrease in the linear approximation of f.

�xnsd


Steepest Descent Method

• A normalized steepest descent direction can be interpreted geometrically as follows. We can just as well define as�xnsd

That is, the direction in the unit ball of that extends farthest in the direction .

k · k�rf(x)

• Consider a steepest descent step that is unnormalized, by scaling the normalized steepest descent direction in a particular way:

�xnsd

• Note that for the steepest descent step, we have


Steepest Descent Method• The steepest descent method uses the steepest descent direction as

search direction.


Steepest Descent for Euclidean Norms

• Steepest descent for Euclidean norm: If we take the norm to be the Euclidean norm we find that the steepest descent direction is simply the negative gradient, i.e., . The steepest descent method for the Euclidean norm coincides with the gradient descent method.

k · k

�xsd = �rf(x)

• Steepest descent for quadratic norm: We consider the quadratic norm

where . The normalized steepest descent direction is given by

P 2 Sn++


Steepest descent for Euclidean Norms

• The normalized steepest descent direction for a quadratic norm is illustrated in Figure 9.9.


Steepest descent for Norms`1

• Consider the steepest descent method for the -norm. A normalized steepest descent (nsd) direction,

`1

is easily characterized. Let i be any index for which

• A normalized steepest descent direction for the -norm is given by�xnsd `1

where is the ith standard basis vector. An unnormalized steepest descent step is then

ei

Thus, the normalized steepest descent step in -norm can always be chosen to be a standard basis vector. (See Figure 9.10)

`1


Steepest descent for Norms`1


Newton’s Method• The Newton Step: For , the following vector is called

Newton’s stepx 2 dom f

Positive definiteness of implies thatr2f(x)

unless , so the Newton step is a descent direction (unless xis optimal).

rf(x) = 0

• Minimizer of second-order approximation:

The second-order Taylor approximation (or model) of f at x isf

(9.28)

which is a convex quadratic function of v, and is minimized whenv = �xnt


Newton’s Method• The Newton step is what should be added to the point x to

minimize the second-order approximation of f at x. This is illustrated in Figure 9.16.

�xnt


Newton’s Method• Steepest descent direction in Hessian norm: The Newton step is also

the steepest descent direction at x, for the quadratic norm defined by the Hessian , i.e., r2

f(x)

• This gives another insight into why the Newton step should be a good search direction, and a very good search direction when x is near .

• In particular, near , a very good choice is . When x is near , we have , which explains why the Newton step is a very good choice of search direction. This is illustrated in Figure 9.17.

x

?

x

?P = r2

f(x?)x

? r2f(x) ⇡ r2

f(x?)


Newton’s Method


• Solution of linearized optimality condition: If we linearize the optimality condition near x we obtain

Newton’s Method

rf(x?) = 0

which is a linear equation in v, with solution . So the Newton step is what must be added to x so that the linearized optimality condition holds.

v = �xnt

�xnt

• When x is near (so the optimality conditions almost hold), the update should be a very good approximation of .x+�xnt

x

?

x

?

• When n=1, i.e. , , this interpretation is particularly simple. The solution of the minimization problem is characterized by

i.e., it is the zero-crossing of the derivative , which is monotonically increasing since f is convex.

• Given our current approximation x of the solution, we form a first-order Taylor approximation of at x. The zero-crossing of this affine approximation is then (See Figure 9.18).

f : R ! Rx

?

f

0(x?) = 0, f 0

x+�xnt

f 0


Newton Decrement


Newton’s Method• The Newton Decrement: The following quantity is called the Newton

decrement at x

• We can relate the Newton decrement to the quantity where is the second-order approximation of f at x:

f(x)� infy

bf(y)

bf

• So is an estimate of , based on the quadratic approximation of f at x. We can also express the Newton decrement as

�2/2 f(x)� p

?

This shows that is the norm of the Newton step, in the quadratic norm defined by the Hessian, i.e., the norm

�


Newton’s Method• The Newton decrement comes up in backtracking line search as well,

since we have(9.30)

This is the constant used in a backtracking line search, and can be interpreted as the directional derivative of f at x in the direction of the Newton step:


Classical Convergence Analysis• Assumptions

Outline:

(9.32)

(9.33)


Classical Convergence Analysis• Damped Newton Phase

• Quadratically Convergent Phase (krf(x)k2 < ⌘)

(krf(x)k2 � ⌘)


Classical Convergence Analysis

• Conclusion:


Example in R2

• We first apply Newton’s method with backtracking line search on the test function (9.20), with line search parametersFigure 9.19 shows the Newton iterates, and also the ellipsoids

↵ = 0.1,� = 0.7.

for the first two iterates . The method works well because these ellipsoids give good approximations of the shape of the sublevel sets.

k = 0, 1.

• Figure 9.20 shows the error versus iteration number for the same example. This plot shows that convergence to a very high accuracy is achieved in only five iterations. Quadratic convergence is clearly apparent: The last step reduces the error from about to .10�5 10�10


Example in R2


Example in R100

• Figure 9.21 shows the convergence of Newton’s method with backtracking and exact line search for a problem in .R100

(9.21)

(n = 100,m = 500)


Example in R100

• The plot for the backtracking line search shows that a very high accuracy is attained in eight iterations.

• Like the example in , quadratic convergence is clearly evident after about the third iteration. The number of iterations in Newton's method with exact line search is only one smaller than with a backtracking line search.

• This is also typical. An exact line search usually gives a very small improvement in convergence of Newton's method.

• Figure 9.22 shows the step sizes for this example. After two damped steps, the steps taken by the backtracking line search are all full, i.e., t=1.

R2


Example in R100


Summary for Newton’s MethodNewton’s method has several very strong advantages over gradient and steepest descent methods:

• Convergence of Newton’s method is rapid in general, and quadratic near . Once the quadratic convergence phase is reached, at most six or so iterations are required to produce a solution of very high accuracy.

• Newton’s method is affine invariant. It is insensitive to the choice of coordinates, or the condition number of the sublevel sets of the objective.

• Newton’s method scales well with problem size. Its performance on problems in is similar to its performance on problems in , with only a modest increase in the number of steps required.

• The good performance of Newton's method is not dependent on the choice of algorithm parameters. In contrast, the choice of norm for steepest descent plays a critical role in its performance.

x

?

R10000

R10


Summary for Gradient’s Method

• The gradient method often exhibits approximately linear convergence, i.e. , the error converges to zero approximately as a geometric series.

• The choice of backtracking parameters has a noticeable but not dramatic effect on the convergence. An exact line search sometimes improves the convergence of the gradient method, but the effect is not large.

• The convergence rate depends greatly on the condition number of the Hessian, or the sublevel sets. Convergence can be very slow, even for problems that are moderately well conditioned. When the condition number is larger, the gradient method is so slow that it is useless in practice.

• The main advantage of the gradient method is its simplicity. Its main disadvantage is that its convergence rate depends so critically on the condition number of the Hessian or sublevel sets.

f(x(k))� p

?

↵,�


Classical Convergence Proof of Newton’s Method

• Damped Newton phase: Assume . Strong convexity implies that on S, and therefore

krf(x)k2 � ⌘

r2f(x) � MI

where we use (9.30) and

The step size satisfies the exit condition of the line search, since

t = m/M



• So we can use the step size resulting in a decrease of the objective function

t � �m/M

where we use

Therefore, (9.32) is satisfied if



• Quadratically convergent phase: Assume . We first show that the backtracking line search selects unit steps if

krf(x)k2 < ⌘

By the Lipschitz condition (9.31), we have, for ,t � 0

and therefore

(since )



• Also, we know

where we use We integrate the inequality to get

using We integrate once more to get



Finally, we take t=1 to obtain

(9.39)

Now suppose Due to the strong convexity, we have



Then applying the Lipschitz condition, we have

which is exactly the result in (9.33).


Self-Concordance

• Shortcomings of Classical Convergence Analysis

• Convergence Analysis via Self-concordance (Nesterov and Nemirovski)


Definition and Examples

• Self-concordant functions on RWe start by considering functions on R. A convex function is self-concordant if

f : R ! R

(9.41)

for all . x 2 domf

• Since linear and (convex) quadratic functions have zero third derivative, they are evidently self-concordant.


Definition and Examples

• About the constant 2 that appears in the definition: In fact, this constant is chosen for convenience, in order to simplify the formulas later on; any other positive constant could be used instead.


Definition and Examples• For example, that the convex function satisfiesf : R ! R

(9.42)

where k is some positive constant. Then the function satisfies

f(x) = (k2/4)f(x)

and therefore is self-concordant. • This shows that a function that satisfies (9.42) for some positive k can

be scaled to satisfy the standard self-concordance inequality (9.41).


• Why self-concordance is so important? It is affine invariant.

Why Self-Concordance?

• If define the functio by , where , then is self-concordant if and only if f is. To see this, we substitute

f(y) = f(ay + b)f a 6= 0 f

where , into the self-concordance inequality for , i.e.,, to obtain

x = ay + b f|f 000(y)| 2f 00(y)3/2

which is the self-concordance inequality for f.• Self-concordant functions on : We say a function is

self-concordant if it is self-concordant along every line in its domain, i.e., if the function is a self-concordant function of t for all and for all v.

Rn f : Rn ! R

f(t) = f(x+ tv)x 2 domf


Self-Concordant Calculus• Self-concordance is preserved by scaling by a factor exceeding one: If

f is self-concordant and , then af is self-concordant.

• Self-concordance is also preserved by addition: If are self-concordant, then is self-concordant. To show this, it is sufficient to consider functions and thus

a � 1

f1, f2f1 + f2

f1, f2 : R ! R

In the last step we use the inequality,

which holds for .u, v � 0


• Composition with affine function:

Self-Concordant Calculus

If is self-concordant, and then is self-concordant.

f : Rn ! R A 2 Rn⇥m, b 2 Rn

f(Ax+ b)



• Let be a convex function with , and for all xg : R ! Rdomg = R++

Then

is self-concordant on . The condition (9.43) is homogeneous and preserved under addition. It is satisfied by all (convex) quadratic functions, i.e., functions of the form where .

{x|x > 0, g(x) < 0}

(9.43)

ax

2 + bx+ c a � 0


Properties of Self-Concordant Functions• For strictly convex self-concordant functions, we can obtain similar

bounds in terms of the Newton decrement

Unlike the bounds based on the norm of the gradient, the bounds based on the Newton decrement are not affected by an affine change of coordinates.

• For future reference we note that the Newton decrement can also be expressed as

In other words, we have

for any nonzero v, with equality for .v = �xnt

(9.44)


Upper and Lower Bounds on Second Derivatives

• Suppose is a strictly convex self-concordant function. We can write the self-concordance inequality (9.41) as

f : R ! R

(9.45)t 2 domf

We can integrate (9.45) between 0 and t to obtain

• From this we obtain lower and upper bounds on f 00(t) :

(9.46)

where


Analysis of Newton’s method for Self-Concordant Functions

• In this analysis, we need to show that there are numbers and with , that depend only on the line search parameters and , such that the following hold:

⌘ � > 0,

0 < ⌘ 1

4↵

�

• According to the previous results in (9.32) and (9.33), we have for all and

�(x(l)) ⌘

l � k



• Therefore,

• Damped Newton phase: Let and so we know f(t) = f(x+ t�xnt)

which indicates a quadratic convergence speed (That is a rapid speed!).

If we integrate the upper bound in (9.46) twice, we obtain an upper bound for :f(t)

(9.54)



• We can use this bound to show the backtracking line search always results in a step size . To prove this we note that the point satisfies the exit condition of the line search:

t � �/(1 + �(x))t = 1/(1 + �(x))

(due to , for )

x > 0

• Since , we havet > �/(1 + �(x))

so (9.51) holds if



• Quadratically convergent phase: Here we want to show if we take that is, if , then the backtracking

line search accepts the unit step and (9.52) holds. �(x(k)) (1� 2↵)/4

• First note that the upper bound (9.54) implies that a unit step t=1 yields a point in if If , we have, using (9.54),

domf�(x) < 1. �(x) (1� 2↵)/2

so the unit step satisfies the condition of sufficient decrease.



• Theorem (See exercises 9.17 and 9.18): Suppose be a strictly convex self-concordant function and . If we define

, then we have .

f : Rn ! R�(x) < 1

x

+ = x�r2f(x)�1rf(x)

�(x+) �(x)2/(1� �(x))2

• The inequality (9.52) follows from this theorem because if �(x) 1/4

and (9.52) holds when .�(x(k)) ⌘

Convex Optimization Theory and...

Documents

Transcript of Convex Optimization Theory and...