Convex Optimization Theory and...

83
Convex Optimization Theory and Applications Lecture 6 Unconstrained Minimization (Chapter 9 in Textbook) Prof. Chun-Hung Liu 2016/11/30 Lecture 6: Unconstrainted Min 1 Dept. of Electrical and Computer Engineering National Chiao Tung University Fall 2016

Transcript of Convex Optimization Theory and...

Page 1: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

Convex Optimization Theory and

ApplicationsLecture 6

UnconstrainedMinimization(Chapter9inTextbook)

Prof. Chun-Hung Liu

2016/11/30 Lecture6:UnconstraintedMin 1

Dept. of Electrical and Computer EngineeringNational Chiao Tung University

Fall 2016

Page 2: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 2

Outline

• Terminology and assumptions

• Gradient descent method

• Steepest descent method

• Newton’s method

• Self-concordant functions (Optional)

Page 3: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 3

Closed Functions• A function is said to be closed if, for each , the

sublevel set f : Rn ! R ↵ 2 R

is closed. This is equivalent to the condition that the epigraph of f,

is closed. (This definition is general, but is usually only applied to convex functions.)

• If is continuous, and dom f is closed, then f is closed.

• If is continuous, with dom f open, then f is closed if and only if f converges to along every sequence converging to a boundary point of dom f. In other words, if , with , we have

f : Rn ! R

f : Rn ! R

xi 2 dom f limi!1

f(xi) = 1.

1

Page 4: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 4

Example of Closed Functions

• The following are some examples of closed functions on R:

Page 5: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 5

Unconstrained Minimization Problems• In this chapter, we discuss methods for solving the unconstrained

optimization problem

(9.1)

where is convex and twice continuously differentiable (which implies that dom f is open).

f : Rn ! R

• We assume that the problem is solvable, i.e., there exists an optimal point .x

? = infx

f(x)

• Since f is differentiable and convex, a necessary and sufficient condition for a point to be optimal isx

?

• Solving the unconstrained minimization problem (9.1) is the same as finding a solution of (9.2).

(9.2)rf(x) = 0

Page 6: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 6

Initial Point and Sublevel Set

• The methods described in this chapter require a suitable starting point . The starting point must lie in dom f, and in addition the sublevel set

x

(0)

must be closed. This condition is satisfied for all if the function f is closed, i.e., all its sublevel sets are closed.

x

(0) 2 dom f

• The closeness of the sublevel set S is hard to verify, except when all sublevel sets are closed:• equivalent to the condition that is closed.• true if .• true if as

epi fdom f = R

n

f(x) ! 1 x ! bf dom f

• Examples of differentiable functions with closed sublevel sets

f(x) = log

mX

i=1

exp(a

Ti x+ bi)

!,

f(x) = �mX

i=1

log

�bi � a

Ti x

Page 7: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 7

Examples• Quadratic minimization and least-squares

The general convex quadratic minimization problem has the form

where

• This problem can be solved via the optimality conditions , which is a set of linear equations.

• When , there is a unique solution • When P is not positive definite, any solution of is

optimal.• If does not have a solution, then the problem (9.4) is

unbounded below.

Px

? + q = 0

P � 0 x

? = �P

�1q.

Px

? = �q

(9.4)

Px

? = �q

Page 8: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 8

Examples• Unconstrained geometric programming

Consider an unconstrained geometric program in convex form,

The optimality condition is

which in general has no analytical solution, so here we must resort to an iterative algorithm.

• Analytic center of linear inequalities

where the domain of f is the open set

The solution, if it exists, is called the analytic center of the inequalities.

Page 9: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 9

Strong Convexity and Implications

• The objective function is strongly convex on S, which means that there exists an .m > 0

• Strong convexity has several interesting consequences. For , we have

x, y 2 S

for some z on the line segment [x, y].

(9.7)

• According to the strong convexity condition in (9.7), we have

for all x and y in S.

(9.8)

Page 10: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 10

Strong Convexity and Implications• Setting the gradient of (9.8) with respect to y equal to zero, we find

that minimizes the righthand side of (9.8). Therefore we have

y = x� (1/m)rf(x)

Since this holds for any , we havey 2 S

This inequality shows that if the gradient is small at a point, then the point is nearly optimal.

1

2mkrf(x)k22 � f(x)� p

?

Page 11: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 11

Strong Convexity and Implications

• We can also derive a bound on the distance between x and any optimal point , in terms of , given by

kx� x

?k2krf(x)k2x

?

To see this, we apply (9.8) with to obtainy = x

?

Since we must havep

? f(x)

from which (9.11) follows.

(due to Cauchy-Schwarz inequality)

(9.11)

Page 12: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 12

r2f(x)

Strong Convexity and Implications

• Upper bound on

The maximum eigenvalue of , which is a continuous function of x on S, is bounded above on S, i.e., there exists a constant M such that

r2f(x)

for all . This upper bound on the Hessian implies for anyx 2 S

x, y 2 S

Minimizing each side over y yields

(9.14)

(9.12)

(9.13)

Page 13: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 13

Condition Number of Sublevel Sets• From the strong convexity inequality (9.7) and the inequality (9.12),

we have(9.14)

for all . The ratio is thus an upper bound on the condition number of the matrix , i.e., the ratio of its largest eigenvalue to its smallest eigenvalue.

x 2 S

= M/mr2

f(x)

• Geometrical Interpretation of (9.14): We define the width of a convex set , in the direction q, where , asC ✓ Rn kqk2 = 1

• The minimum width and maximum width of C are given by

Page 14: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 14

Condition Number of Sublevel Sets• The condition number of the convex set C is defined as

• The condition number of C gives a measure of its anisotropy oreccentricity. • If the condition number of a set C is small (say, near one) it means

that the set has approximately the same width in all directions, i.e., it is nearly spherical.

• If the condition number is large, it means that the set is far wider in some directions than in others.

i.e., the square of the ratio of its maximum width to its minimum width.

Page 15: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 15

Condition Number of Sublevel Sets

= 2(qTAq)1/2

Page 16: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 16

Condition Number of Sublevel Sets• Suppose f satisfies for all . We will derive a

bound on the condition number of the -sublevel , where .

mI � r2f(x) � MI

x 2 S

↵C↵ = {x|f(x) ↵}

p

?< ↵ f(x0)

• Applying (9.13) and (9.8) with , we havex = x

?

This implies that where

The ratio of the radii squared gives an upper bound on the conditionnumber of :C↵

Page 17: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 17

Condition Number of Sublevel Sets• We can also give a geometric interpretation of the condition number

of the Hessian at the optimum. From the Taylor series expansion of f around ,(r2

f(x?))x

?

we see that, for close to ,↵ p?

i.e., the sublevel set is well approximated by an ellipsoid with center . Therefore,

x

?

Page 18: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 18

• The algorithms described in this chapter produce a minimizing sequence , where

Descent Methods

x

(k), k = 1, . . .

and (except when is optimal).t(k) > 0x

(k)

• is called the step or search direction (even though it need not have unit norm). The scalar is called the step size or step length at iteration k .

• When we focus on one iteration of an algorithm, we sometimes drop the superscripts and use the lighter notation or in place of .

• All the methods we study are descent methods, which means that

t(k) � 0

except when is optimal.x

(k)

�x

(k) 2 Rn

x := x+ t�xx

+ = x+ t�x

x

(k+1) = x

(k) + t

(k)�x

(k)

Page 19: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 19

Descent Methods• From convexity we know that implies

so the search direction in a descent method must satisfyf(y) � f(x(k))

we call such a direction a descent direction.• The outline of a general descent method is as follows. It alternates

between two steps: determining a descent direction , and the selection of a step size t.

�x

Page 20: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 20

Descent Methods• Exact line search

One line search method sometimes used in practice is exact line search, in which t is chosen to minimize f along the ray :{x+ t�x|t � 0}

An exact line search is used when the cost of the minimization problem with one variable, is low compared to the cost of computing the search direction itself.

(9.16)

• Backtracking line search

• One inexact line search method that is very simple and quite effective is called backtracking line search. It depends on two constants with↵,�0 < ↵ < 0.5, 0 < � < 1

• Most line searches used in practice are inexact: the step length is chosen to approximately minimize f along the ray , or even to just reduce f “enough”.

{x+ t�x|t � 0}

Page 21: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 21

Descent Methods

• The line search is called backtracking because it starts with unit step size and then reduces it by the factor until the stopping condition�

f(x+ t�x) f(x) + ↵trf(x)T�x

holds. Since is a descent direction, we have so for small enough t we have

�x rf(x)T�x < 0

which shows that the backtracking line search eventually terminates.

Page 22: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 22

Descent Methods

Page 23: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 23

Gradient Descent Method• A natural choice for the search direction is the negative gradient

. The resulting algorithm is called the gradient algorithm or gradient descent method.�x = �rf(x)

krf(x)k2 ⌘• The stopping criterion is usually of the form , where is small and positive.

Page 24: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 24

Gradient Descent Method• Convergence analysis: here we present a simple convergence analysis for

the gradient method. • using the lighter notation

where . • Assume f is strongly convex on S so that for all• Define the function by

�x = �f(x)x 2 S

f : R ! Rf(t) = f(x� trf(x))

• With , we obtain a quadratic upper bound ony = x� trf(x) f :

(9.17)

• Analysis for exact line search: assume that an exact line search is used, and minimize over t both sides of the inequality (9.17). • On the lefthand side we get , where is the step length

that minimizes . • The righthand side is a simple quadratic, which is minimized by

and has minimum value

f(texact

) texact

f

t = 1/M

Page 25: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 25

Gradient Descent Method• Therefore we have

We combine this with to conclude

Applying this inequality recursively, we find that

where , which shows that converge to as

c = 1�m/M < 1 f(x(k)) p?

k ! 1

Page 26: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 26

• Analysis for backtracking line search

Gradient Descent Method

Consider the case where a backtracking line search is used in the gradient descent method. We will show that the backtracking exit condition,

is satisfied whenever . First note that0 t 1/M

Using this result and the bound (9.17), for , 0 t 1/M

since .↵ < 1/2

Page 27: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 27

Gradient Descent Method

• Therefore the backtracking line search terminates either with or with a value . This provides a lower bound on the decrease in the objective function.

t = 1t � �/M

Combine this with to obtain

(What does this c tell you?)

Page 28: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 28

Example: A Quadratic Problem in • Consider the following quadratic objective function

where . Clearly, the optimal point is and the optimal value is 0. We apply the gradient descent method with exact line search, starting at the point . we can derive the following closed-form expressions for the iterates and their function values

� > 0x

? = 0

x

(0) = (�, 1)x

(k)

This is illustrated in Figure 9.2, for .� = 10

R2

Page 29: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 29

Example: A Quadratic Problem in R2

Page 30: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 30

Fig. 9.3 Fig. 9.5

Page 31: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 31

Example: A Non-quadratic Problem in • Consider a non-quadratic example as follows

We apply the gradient method with a backtracking line search, withFigure 9.3 shows some level curves of f, and the

iterates generated by the gradient method (shown as small circles). The lines connecting successive iterates show the scaled steps,

↵ = 0.1,� = 0.7.

x

(k)

• Figure 9.4 shows the error versus iteration k. The plot reveals that the error converges to zero approximately as a geometric series, i.e., the convergence is approximately linear.

• To compare backtracking line search with an exact line search, we use the gradient method with an exact line search, on the same problem, and with the same starting point. The results are given in Figures 9.5 and 9.4.

f(x(k))� p

?

R2

(9.20)

Page 32: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 32

Example: A Non-Quadratic Problem in R2

Page 33: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 33

Example: A Non-quadratic Problem in R2

Page 34: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 34

Example: A Non-quadratic Problem in R2

Page 35: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 35

Steepest Descent Method• The first-order Taylor approximation of around x is

f(x+ v)

The second term on the righthand side, , is the directional derivative of f at x in the direction v. It gives the approximate change in f for a small step v. The step v is a descent direction if the directional derivative is negative.

rf(x)T v

• Now the question is how to address the question of how to choose v to make the directional derivative as negative as possible.

• We define a normalized steepest descent direction

(9.23)

A normalized steepest descent direction is a step of unit norm that gives the largest decrease in the linear approximation of f.

�xnsd

Page 36: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 36

Steepest Descent Method

• A normalized steepest descent direction can be interpreted geometrically as follows. We can just as well define as�xnsd

That is, the direction in the unit ball of that extends farthest in the direction .

k · k�rf(x)

• Consider a steepest descent step that is unnormalized, by scaling the normalized steepest descent direction in a particular way:

�xnsd

• Note that for the steepest descent step, we have

Page 37: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 37

Steepest Descent Method• The steepest descent method uses the steepest descent direction as

search direction.

Page 38: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 38

Steepest Descent for Euclidean Norms

• Steepest descent for Euclidean norm: If we take the norm to be the Euclidean norm we find that the steepest descent direction is simply the negative gradient, i.e., . The steepest descent method for the Euclidean norm coincides with the gradient descent method.

k · k

�xsd = �rf(x)

• Steepest descent for quadratic norm: We consider the quadratic norm

where . The normalized steepest descent direction is given by

P 2 Sn++

Page 39: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 39

Steepest descent for Euclidean Norms

• The normalized steepest descent direction for a quadratic norm is illustrated in Figure 9.9.

Page 40: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 40

Steepest descent for Norms`1

• Consider the steepest descent method for the -norm. A normalized steepest descent (nsd) direction,

`1

is easily characterized. Let i be any index for which

• A normalized steepest descent direction for the -norm is given by�xnsd `1

where is the ith standard basis vector. An unnormalized steepest descent step is then

ei

Thus, the normalized steepest descent step in -norm can always be chosen to be a standard basis vector. (See Figure 9.10)

`1

Page 41: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 41

Steepest descent for Norms`1

Page 42: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 42

Newton’s Method• The Newton Step: For , the following vector is called

Newton’s stepx 2 dom f

Positive definiteness of implies thatr2f(x)

unless , so the Newton step is a descent direction (unless xis optimal).

rf(x) = 0

• Minimizer of second-order approximation:

The second-order Taylor approximation (or model) of f at x isf

(9.28)

which is a convex quadratic function of v, and is minimized whenv = �xnt

Page 43: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 43

Newton’s Method• The Newton step is what should be added to the point x to

minimize the second-order approximation of f at x. This is illustrated in Figure 9.16.

�xnt

Page 44: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 44

Newton’s Method• Steepest descent direction in Hessian norm: The Newton step is also

the steepest descent direction at x, for the quadratic norm defined by the Hessian , i.e., r2

f(x)

• This gives another insight into why the Newton step should be a good search direction, and a very good search direction when x is near .

• In particular, near , a very good choice is . When x is near , we have , which explains why the Newton step is a very good choice of search direction. This is illustrated in Figure 9.17.

x

?

x

?P = r2

f(x?)x

? r2f(x) ⇡ r2

f(x?)

Page 45: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 45

Newton’s Method

Page 46: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 46

• Solution of linearized optimality condition: If we linearize the optimality condition near x we obtain

Newton’s Method

rf(x?) = 0

which is a linear equation in v, with solution . So the Newton step is what must be added to x so that the linearized optimality condition holds.

v = �xnt

�xnt

• When x is near (so the optimality conditions almost hold), the update should be a very good approximation of .x+�xnt

x

?

x

?

• When n=1, i.e. , , this interpretation is particularly simple. The solution of the minimization problem is characterized by

i.e., it is the zero-crossing of the derivative , which is monotonically increasing since f is convex.

• Given our current approximation x of the solution, we form a first-order Taylor approximation of at x. The zero-crossing of this affine approximation is then (See Figure 9.18).

f : R ! Rx

?

f

0(x?) = 0, f 0

x+�xnt

f 0

Page 47: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 47

Newton Decrement

Page 48: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 48

Newton’s Method• The Newton Decrement: The following quantity is called the Newton

decrement at x

• We can relate the Newton decrement to the quantity where is the second-order approximation of f at x:

f(x)� infy

bf(y)

bf

• So is an estimate of , based on the quadratic approximation of f at x. We can also express the Newton decrement as

�2/2 f(x)� p

?

This shows that is the norm of the Newton step, in the quadratic norm defined by the Hessian, i.e., the norm

Page 49: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 49

Newton’s Method• The Newton decrement comes up in backtracking line search as well,

since we have(9.30)

This is the constant used in a backtracking line search, and can be interpreted as the directional derivative of f at x in the direction of the Newton step:

Page 50: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 50

Classical Convergence Analysis• Assumptions

Outline:

(9.32)

(9.33)

Page 51: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 51

Classical Convergence Analysis• Damped Newton Phase

• Quadratically Convergent Phase (krf(x)k2 < ⌘)

(krf(x)k2 � ⌘)

Page 52: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 52

Classical Convergence Analysis

• Conclusion:

Page 53: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 53

Example in R2

• We first apply Newton’s method with backtracking line search on the test function (9.20), with line search parametersFigure 9.19 shows the Newton iterates, and also the ellipsoids

↵ = 0.1,� = 0.7.

for the first two iterates . The method works well because these ellipsoids give good approximations of the shape of the sublevel sets.

k = 0, 1.

• Figure 9.20 shows the error versus iteration number for the same example. This plot shows that convergence to a very high accuracy is achieved in only five iterations. Quadratic convergence is clearly apparent: The last step reduces the error from about to .10�5 10�10

Page 54: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 54

Example in R2

Page 55: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 55

Example in R2

Page 56: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 56

Example in R100

• Figure 9.21 shows the convergence of Newton’s method with backtracking and exact line search for a problem in .R100

(9.21)

(n = 100,m = 500)

Page 57: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 57

Example in R100

• The plot for the backtracking line search shows that a very high accuracy is attained in eight iterations.

• Like the example in , quadratic convergence is clearly evident after about the third iteration. The number of iterations in Newton's method with exact line search is only one smaller than with a backtracking line search.

• This is also typical. An exact line search usually gives a very small improvement in convergence of Newton's method.

• Figure 9.22 shows the step sizes for this example. After two damped steps, the steps taken by the backtracking line search are all full, i.e., t=1.

R2

Page 58: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 58

Example in R100

Page 59: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 59

Summary for Newton’s MethodNewton’s method has several very strong advantages over gradient and steepest descent methods:

• Convergence of Newton’s method is rapid in general, and quadratic near . Once the quadratic convergence phase is reached, at most six or so iterations are required to produce a solution of very high accuracy.

• Newton’s method is affine invariant. It is insensitive to the choice of coordinates, or the condition number of the sublevel sets of the objective.

• Newton’s method scales well with problem size. Its performance on problems in is similar to its performance on problems in , with only a modest increase in the number of steps required.

• The good performance of Newton's method is not dependent on the choice of algorithm parameters. In contrast, the choice of norm for steepest descent plays a critical role in its performance.

x

?

R10000

R10

Page 60: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 60

Summary for Gradient’s Method

• The gradient method often exhibits approximately linear convergence, i.e. , the error converges to zero approximately as a geometric series.

• The choice of backtracking parameters has a noticeable but not dramatic effect on the convergence. An exact line search sometimes improves the convergence of the gradient method, but the effect is not large.

• The convergence rate depends greatly on the condition number of the Hessian, or the sublevel sets. Convergence can be very slow, even for problems that are moderately well conditioned. When the condition number is larger, the gradient method is so slow that it is useless in practice.

• The main advantage of the gradient method is its simplicity. Its main disadvantage is that its convergence rate depends so critically on the condition number of the Hessian or sublevel sets.

f(x(k))� p

?

↵,�

Page 61: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 61

Classical Convergence Proof of Newton’s Method

• Damped Newton phase: Assume . Strong convexity implies that on S, and therefore

krf(x)k2 � ⌘

r2f(x) � MI

where we use (9.30) and

The step size satisfies the exit condition of the line search, since

t = m/M

Page 62: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 62

Classical Convergence Proof of Newton’s Method

• So we can use the step size resulting in a decrease of the objective function

t � �m/M

where we use

Therefore, (9.32) is satisfied if

Page 63: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 63

Classical Convergence Proof of Newton’s Method

• Quadratically convergent phase: Assume . We first show that the backtracking line search selects unit steps if

krf(x)k2 < ⌘

By the Lipschitz condition (9.31), we have, for ,t � 0

and therefore

(since )

Page 64: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 64

Classical Convergence Proof of Newton’s Method

• Also, we know

where we use We integrate the inequality to get

using We integrate once more to get

Page 65: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 65

Classical Convergence Proof of Newton’s Method

Finally, we take t=1 to obtain

(9.39)

Now suppose Due to the strong convexity, we have

Page 66: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 66

Classical Convergence Proof of Newton’s Method

Then applying the Lipschitz condition, we have

which is exactly the result in (9.33).

Page 67: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 67

Self-Concordance

• Shortcomings of Classical Convergence Analysis

• Convergence Analysis via Self-concordance (Nesterov and Nemirovski)

Page 68: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 68

Definition and Examples

• Self-concordant functions on RWe start by considering functions on R. A convex function is self-concordant if

f : R ! R

(9.41)

for all . x 2 domf

• Since linear and (convex) quadratic functions have zero third derivative, they are evidently self-concordant.

Page 69: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 69

Definition and Examples

• About the constant 2 that appears in the definition: In fact, this constant is chosen for convenience, in order to simplify the formulas later on; any other positive constant could be used instead.

Page 70: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 70

Definition and Examples• For example, that the convex function satisfiesf : R ! R

(9.42)

where k is some positive constant. Then the function satisfies

f(x) = (k2/4)f(x)

and therefore is self-concordant. • This shows that a function that satisfies (9.42) for some positive k can

be scaled to satisfy the standard self-concordance inequality (9.41).

Page 71: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 71

• Why self-concordance is so important? It is affine invariant.

Why Self-Concordance?

• If define the functio by , where , then is self-concordant if and only if f is. To see this, we substitute

f(y) = f(ay + b)f a 6= 0 f

where , into the self-concordance inequality for , i.e.,, to obtain

x = ay + b f|f 000(y)| 2f 00(y)3/2

which is the self-concordance inequality for f.• Self-concordant functions on : We say a function is

self-concordant if it is self-concordant along every line in its domain, i.e., if the function is a self-concordant function of t for all and for all v.

Rn f : Rn ! R

f(t) = f(x+ tv)x 2 domf

Page 72: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 72

Self-Concordant Calculus• Self-concordance is preserved by scaling by a factor exceeding one: If

f is self-concordant and , then af is self-concordant.

• Self-concordance is also preserved by addition: If are self-concordant, then is self-concordant. To show this, it is sufficient to consider functions and thus

a � 1

f1, f2f1 + f2

f1, f2 : R ! R

In the last step we use the inequality,

which holds for .u, v � 0

Page 73: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 73

• Composition with affine function:

Self-Concordant Calculus

If is self-concordant, and then is self-concordant.

f : Rn ! R A 2 Rn⇥m, b 2 Rn

f(Ax+ b)

Page 74: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 74

Self-Concordant Calculus

Page 75: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 75

Self-Concordant Calculus

• Let be a convex function with , and for all xg : R ! Rdomg = R++

Then

is self-concordant on . The condition (9.43) is homogeneous and preserved under addition. It is satisfied by all (convex) quadratic functions, i.e., functions of the form where .

{x|x > 0, g(x) < 0}

(9.43)

ax

2 + bx+ c a � 0

Page 76: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 76

Self-Concordant Calculus

Page 77: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 77

Properties of Self-Concordant Functions• For strictly convex self-concordant functions, we can obtain similar

bounds in terms of the Newton decrement

Unlike the bounds based on the norm of the gradient, the bounds based on the Newton decrement are not affected by an affine change of coordinates.

• For future reference we note that the Newton decrement can also be expressed as

In other words, we have

for any nonzero v, with equality for .v = �xnt

(9.44)

Page 78: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 78

Upper and Lower Bounds on Second Derivatives

• Suppose is a strictly convex self-concordant function. We can write the self-concordance inequality (9.41) as

f : R ! R

(9.45)t 2 domf

We can integrate (9.45) between 0 and t to obtain

• From this we obtain lower and upper bounds on f 00(t) :

(9.46)

where

Page 79: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 79

Analysis of Newton’s method for Self-Concordant Functions

• In this analysis, we need to show that there are numbers and with , that depend only on the line search parameters and , such that the following hold:

⌘ � > 0,

0 < ⌘ 1

4↵

• According to the previous results in (9.32) and (9.33), we have for all and

�(x(l)) ⌘

l � k

Page 80: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 80

Analysis of Newton’s method for Self-Concordant Functions

• Therefore,

• Damped Newton phase: Let and so we know f(t) = f(x+ t�xnt)

which indicates a quadratic convergence speed (That is a rapid speed!).

If we integrate the upper bound in (9.46) twice, we obtain an upper bound for :f(t)

(9.54)

Page 81: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 81

Analysis of Newton’s method for Self-Concordant Functions

• We can use this bound to show the backtracking line search always results in a step size . To prove this we note that the point satisfies the exit condition of the line search:

t � �/(1 + �(x))t = 1/(1 + �(x))

(due to , for )

x > 0

• Since , we havet > �/(1 + �(x))

so (9.51) holds if

Page 82: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 82

Analysis of Newton’s method for Self-Concordant Functions

• Quadratically convergent phase: Here we want to show if we take that is, if , then the backtracking

line search accepts the unit step and (9.52) holds. �(x(k)) (1� 2↵)/4

• First note that the upper bound (9.54) implies that a unit step t=1 yields a point in if If , we have, using (9.54),

domf�(x) < 1. �(x) (1� 2↵)/2

so the unit step satisfies the condition of sufficient decrease.

Page 83: Convex Optimization Theory and Applicationsweb.it.nctu.edu.tw/~chungliu/courses/Optimization/slides/Lecture6.pdf · Convex Optimization Theory and Applications Lecture6 ... the sublevel

2016/11/30 Lecture6:UnconstraintedMin 83

Analysis of Newton’s method for Self-Concordant Functions

• Theorem (See exercises 9.17 and 9.18): Suppose be a strictly convex self-concordant function and . If we define

, then we have .

f : Rn ! R�(x) < 1

x

+ = x�r2f(x)�1rf(x)

�(x+) �(x)2/(1� �(x))2

• The inequality (9.52) follows from this theorem because if �(x) 1/4

and (9.52) holds when .�(x(k)) ⌘