Newton Type Methods

8/10/2019 Newton Type Methods

1/20

Newton-type Methods

Walter Murray

Department of Management Science and Engineering,

Stanford University, Stanford, CA

July 5, 2010

Abstract

Newtons method is one of the most famous numerical methods. Its origins,as the name suggests, lies in part with Newton, but the form familiar to ustoday is due to Simpson of Simpsons Rule fame. Originally proposed tofind the roots of polynomials it was Simpson who proposed using it for generalnonlinear equations and for its use in optimization by finding a root of thegradient. Here we discuss Newtons method for both of these problems withthe focus on the somewhat harder optimization problem. A disadvantage ofNewtons method, in its original form, is that it has only local convergenceunless the class of functions is severely restricted. Much of the discussion herewill be about the many ways Newtons method may be modified to achieveglobal convergence. A key aim of all these methods is that once the iteratesbecome sufficiently close to a solution the method takes Newton steps.

Keywords: nonlinear equations, optimization methods, modified Newton.

1 Introduction

As noted Newtons method is famous. Type it into the search in Youtube andyou will get 144 videos with one having had over 30,000 hits. Despite its historicorigins it would not be so well known today were it not for its widespread utility.It lies at the heart not just of methods to find roots, but also many other methodssuch as those for solving partial differential equations. Even Karmakars geometricbased algorithm [9] for linear programming that had such acclaim when it was

discovered was subsequently shown in [8] to be equivalent to Newtons methodapplied to a sequence of barrier functions. Newtons method also serves as a modelfor other algorithms. It has a theoretical purpose enabling rates of convergence to bedetermined easily by showing that the algorithm of interest behaves asymptoticallysimilarly to Newtons method. Naturally a lot has been written about the methodand a classic book well worth reading is that by Ortega and Rheinboldt [11]. Otherbooks that cover the material here and much more are [7], [2], and [10].

There would not be so much to read were it not for the fact that Newtonsmethod is only locally convergent. Moreover, as in most cases of algorithms thatare only locally convergent it is usually not known apriori whether an initial estimate


2/20

2 FUNCTIONS OF ONE VARIABLE 2

is indeed close enough. Ideally algorithms should be globally convergent (convergefrom an arbitrary initial point). However, it should be remembered that proofs of

global convergence do need assumptions about the character of the problem thatmay not be verifiable for specific problems.

There are many approaches to incorporating Newtons method into a more com-plex algorithm to ensure global convergence and that is the issue we focus on here.It is somewhat more difficult to deal with the case for Newtons method to mini-mize a function rather than solving a system of nonlinear equations, so it is thatclass of problems that we give our main attention. Note that solvingf(x) = 0 canbe transformed into the minimization problem minx

12f(x)

Tf(x), but minimizing afunction cannot be transformed into solving a system of equations. Simply finding astationary point by equating the gradient to zero does not do the job. Consequently,Newtons method for optimization, in addition to the deficiencies faced when solving

systems of equations, needs to be augmented to enable iterates to move off saddlepoints. This is the key augmentation that is needed for minimization problems.

Note that there may not be a solution to f(x) = 0. Any algorithm will failin some sense on such problems. It is useful to know how it fails and, should itconverge, what are the properties of the point of convergence. Identifying a solutionof a minimization is more problematic. We may converge to a point for which wecan neither confirm the point is or is not a minimizer. That is a another instanceof why minimization is a more difficult problem.

Much is made, and rightly so, of Newtons method having a rapid rate of con-vergence. However, while little can be proved, observations show that when Newtonmethod converges it usually behaves well away from the solution. It is the fact that

it is a good local model of the behavior of a function that its popularity and successcan be attributed. It is also the reason why alternative methods endeavor to emu-late Newton and why it is worthwhile trying to preserve the elements of Newtonsmethod when modifying it to be robust.

We start by addressing the problem in one variable. While it does not highlightall the issues it does illustrate the critical issue of local convergence being the bestthat can be achieved. In some of the text we drop the indices that denote theiteration when doing so does not cause confusion.

2 Functions of one variable

Newtons method for finding the root of a function of one variable is very simple toappreciate. Given some point, say, xk, we may estimate the root of a function, sayf(x), by constructing the tangent to the curve off(x) at xk and noting where thatlinear function is zero. Clearly for Newtons method to be defined we need f(x) tobe differentiable, otherwise the tangent may not exist. Figure 1 shows a single stepof Newtons method. Figure 2 illustrates that Newtons method may not give animproved estimate.

Algebraically the method is that of approximating the nonlinear function atthe current iterate by a linear one and using the location of the zero of the linearapproximation as the next iterate. Consider


3/20


xk

xk+1x

*

Figure 1: The tangent to f(x) at xk

x31

12x17

Figure 2: When fk is small xk+1may be poor.

f(x) ax+b,

where a and b are scalars. Since the two function values and their gradients areequal at xk we have a= f

k f

(xk) and b= fk fkxk giving

f(x) fkx+fk fkxk.

It follows that the zero of the linear function, say xk+1, is given by

xk+1= xk fk/fk. (1)

We can now repeat the process replacing xk by xk+1, which is Newtons methodfor finding the zero or root of a function of one variable. Note that the iterationgiven by (1) is only defined iffk= 0 and it is this issue that is the flaw in Newtonsmethod. It is not only when fk = 0 that is a cause of concern. If |f

k| is small

compared to |fk| then the step may be so large as to make the new iterate a muchworse approximation. It is not hard to construct a sequence of iterates that cycle.

Theorem 1.1 (Local convergence of Newtons method). Let f(x) be aunivariate continuously differentiable real-valued function on an open convex setD IR1. Assume thatf(x) = 0 for somex D and thatf(x) = 0. Then thereexists an open intervalS D containingx such that, for anyx0 inS, the Newton

iterates (1) are well defined, remain inSand converge to x

.

Proof. Let be a fixed constant in (0, 1). Since f is continuous at x andf(x)is non-zero, there is an open interval S= (x , x +) and a positive constant such that

1

|f(x)| and |f(y) f(x)| / (2)

for every x and y in S. Suppose thatxk S. Since f(x) = 0, it follows from (1)

thatx xk+1= (fk f(x

) fk(xk x))/fk.


4/20


Hence the first bound in (2) implies

|xk+1 x| |fk f(x) fk(xk x)|.

From the integral mean-value theorem we have

f(xk) f(x) fk(xk x

) =

10

(f(x +(xk x)) fk

xk x

) d.Hence,

|xk+1 x|

max01

fx +(xk x) fk)

|xk x|. (3)

Thus, the second bound in (2) gives

|xk+1 x| |xk x

|

provided xk S. Since 0, then the sequence converges q-quadratically to x.

Proof. The linear convergence of the sequence {xk} tox is established in Theorem

1.1. Define

k =

max01

|f

x +(xk x)

fk|

, (5)

and assume that x0 S, with defined as in Theorem 1.1. The continuity off

and the convergence of{xk} to x imply that k converges to zero. Since (3) can

be written as

|xk+1 x

| k|xk x

|,the definition of q-superlinear convergence to x is satisfied by {xk}. Let

denotethe value of for which the maximum in (5) is achieved, and let yk denote x

+k(xk x

). Then

|f(yk) fk|

f(yk) f(x) + f(x) fk. (6)Applying (4) in (6) and (5) gives

k 2|xk x|.


5/20


Hence, {xk} satisfies the definition that the rate of convergence to x is at least

quadratic.

It should be stressed that Newtons method can converge even iff(x) = 0. Forexample, supposef(x) =x3. Clearly x = 0. The Newton iteration is:

xk+1= xk x3k/(3x

2k) = 2xk/3,

which obviously converges to 0, but at a linear rate. We can see from Figure 2 thatsimply shifting the function to be x3 1 does not result in such a smooth set ofiterates, although in that case when it converges it does so at a quadratic rate. Whatthis example also illustrates is that the interval of fast convergence may well be smallwhenf(x) is small. In the example abovex is a point of inflection. It could well bethatx is a local minimizer or a local maximizer off(x) in which case there exists a

neighboring function that does not have a zero in the neighborhood ofx. Typicallywhen problems are solved numerically the best that can be achieved is to solveexactly a neighboring problem. Even if the iterates converge numerically to a pointfor which |f(x)| is small there will remain some uncertainty as to whether a zeroexists whenf(x) is also very small. In the case where the function changes sign atx this may be checked. More importantly knowing an interval for which the sign ofthe function at the end points are different provides a means of modifying Newtonsmethod to enable it to find a zero. Specifically when the Newton step lies outsidethe interval it needs to be replaced by a point in the interval. A more sophisticatedapproach is to replace the Newton step even when it lies in the interval. Since weare taking the Newton step from the better to the two end points we expect the

next iterate to be close to that point.

2.1 A hybrid algorithm

If an interval is known, say [a, b], at which sign(f(a)) = -sign(f(b)) then a zero maybe found by the method of bisection. The function is evaluated at the midpoint ofthe interval and the end point of the same sign then discarded to give a new intervalhalf the previous size. The bisection point is optimal in the sense it gives the bestworse case performance. This approach may be combined with Newtons method bydiscarding the Newton iterate whenever it lies outside of the interval in which it isknown a root lies. Some care needs to be exercised to ensure the algorithm convergesand to ensure that the convergence is not very slow. For example, we need to reject

the Newton iterate even if it lies within the interval if it is very close to an endpoint. When performing bisection alone the optimal placement of the new iterateis the midpoint. When combining the method with Newton we are assuming thateventually Newtons method will make bisection steps unnecessary. If the bisectionstep is such that the bestiterate remains the same then at the next iteration we willperform yet one more bisection step. Consequently, we may wish to choose insteada point that looks more likely to replace the best iterate my making it closer to thethat iterate than the bisection point.


6/20


2.2 Modifying Newtons method to ensure global convergence

In one variable the hybrid algorithm described above is likely the best way to go.However, interval reduction does not generalize to ndimensions so it is worthwhileto discuss other means of making the method more robust. A necessary featurewe need to ensure is the existence of the iterates. One way of achieving that is toreplace the Newton iteration whenever |fk| , where >0. For example,

xk+1= xk fk/, (7)

where = sign(fk) if |fk| , otherwise = f

k. This modification alone does

not ensure convergence. What is needed is something that ensures the iterates areimproving. One means of defining improvement is a reduction in f(x)2. We couldhave chosen |f(x)|, but f(x)2 has the advantage of being differentiable. The basic

idea is to define the iterate as

xk+1= xk kfk/, (8)

where k is chosen such that f2k+1 < f

2k . We need something slightly stronger to

prove convergence, but it is enough to choose k = 12

j, where j is the smallest

index 1 such thatf(xk+12

jfk/)

2 < f2k . Determiningk is a common procedurein ndimensional problems and more elaborate and more efficient methods areknown than the simple backtracking just described. However, they also requiremore elaborate termination conditions, which we discuss later.

2.3 Minimizing a function of one variable

Consider now the problem ofminx

f(x).

As noted earlier we could apply Newtons method to g(x), whereg(x) =f(x). TheNewton iteration is then given by

xk+1= xk gk/gk. (9)

Clearly, we can only expect this iteration to converge to x, where g(x) = 0 .However, ifg (x)< 0 we would know that x is a maximizer and not a minimizer.Another interpretation is that we are approximating f(x) by a quadratic function,

for example f(x) 12ax2 +bx +c.

The parameters a, b and c may be determined by the three pieces of informationfk, gk and g

k. The next iterate is defined to be the minimizer of this quadratic

function. However, when a = gk < 0, the quadratic function does not have aminimizer. Even whena = 0, unlessb = 0, this function does not have a minimizer.An essential first step then is to define the Newton iteration for minimizing a functionas a modification to (9), namely

xk+1= xk gk/|gk|. (10)


7/20

3 FUNCTIONS WITHN VARIABLES 7

In essence we reverse the step wheng k is negative and go in the direction away fromthe maximizer. To amend the algorithm when |gk| = 0 or is very small we can make

the same changes as in (7) except = if|gk| , otherwise =|gk|. Also k ischosen to reduce f(x) rather than g(x)Tg(x). However, there is one more case toconsider and that is ifgk = 0. In this case, ifg

k = 0, we may not be able to make

progress. Ifgk < 0 then we are at a maximum and a tiny step in either directionfromxk will produce a new iterate xk+1 such thatfk+1< fk.

As in the case of finding a zero we could also combine Newton with an intervalreduction algorithm such as bisection. We know a minimizer exists in an interval[a, b] ifg(a) > 0 and g(b) < 0. Consequently, we can always maintain an intervalcontaining a minimizer by discarding the appropriate end point. There is a minorcomplication ifg(x) = 0 at the new point. Also if there is more than one minimizerin the interval there may be a choice.

3 Functions with n variables

The extension to n variables is straightforward. The Newton iteration for f(x) = 0becomes

xk+1= xk J1k fk, (11)

where Jk J(xk), and J(x) is the Jacobian matrix off(x). Likewise, the iterationfor minx f(x) is

xk+1= xk H1k gk, (12)

whereHk H(xk), andH(x) is the Hessian matrix off(x). Obviously the iterationsare only defined if the relevant inverses exist. The conditions for convergence and forthe rate of convergence to be at least quadratic in the n-dimensional case is similarto that for the one-dimensional case. Rather then an interval we now require J(x)(andH(x)) to be nonsingular in a ball about x. Despite being inn-dimensions theTaylor expansions used in the proofs are still in one dimension since we are relatingxk+1 toxk along the vector connecting them.

Rather than define the iterations as in (11) and (12) it is more typical to definethem as

xk+1= xk+pk,

where pk is thought of as the Newton step. When modified as

xk+1= xk+kpk,

then the Newton steppk is viewed as a search direction and the scalar k >0 is thesteplength. We are considering the Newton step as a trial to be rejected if it provesworse (in some sense) than the current approximation.

Rather than using inverses we define pk as the solution of a system of linearequation, which for solvingf(x) = 0 becomes

Jkpk = fk,


8/20

4 LINESEARCH METHODS 8

and for minimizing a functionHkpk = gk.

In the case of minimizing, the iteration only makes sense when Hkis positive definite.Unlike spotting whether gk is positive, determining whether a matrix is positivedefinite is nontrivial. SinceHk is symmetric if it was know to be positive definitewe would usually determine pk using Choleskys algorithm. The algorithm can failwhen Hk is indefinite and it is that failure that is the simplest way of determiningwhether or not Hk is positive definite.

4 Linesearch methods

Cast in the manner above Newtons method may be viewed to be in the class of

linesearch methods for minimizing a function (see [7]). From now on we shall treatthe problem of f(x) = 0 as minx12f(x)

Tf(x). A key requirement for linesearchmethods is that pk is a direction ofsufficient descent, which is defined as

gTkpkgkpk

, (13)

where >0, and a2 aTa. This condition bounds the elements of the sequence{pk} from being arbitrarily close to orthogonality to the gradient. Typically methodsare such that pk is defined in a manner that satisfies (13) even though an explicitvalue of is not known.

The other key requirement is that the steplength k is chosen not to be too large

or too small. This may be achieved by a suitable termination condition in the searchfor k. The termination criteria used in the optimization routine SNOPT (see [6])is:

|g(xk+pk)Tpk| g

Tkpk, (14)

where 0 < 1. Typically is chosen in the range [.1, .9]. This terminationcriterion ensures the step k is not too small. Usually it will also ensure it is nottoo large. However, to be absolutely sure we need an additional condition and

f(xk+ pk) fk+g(xk+ pk)Tpk, (15)

where 0< < 12 will suffice. Typically is chosen quite small and as a consequence

this condition is almost always satisfied once (14) is satisfied. To find a step thatsatisfies (14) one can search for a minimizer along pk and terminate at the first point(14) is satisfied. Since we are generating iterates that decreasef(x) it follows that(15) is satisfied with replaced by k >0.

It can be shown that under modest assumptions linesearch methods as describedgenerate a sequence such that limk gk = 0. For nonlinear equations, since gk =JTkfk it implies fk 0 providedJ

is nonsingular. In the minimization case we areonly sure we are at a minimizer only if lim Hk = H

and H is positive definite.WhenH is positive semidefinite with at least one zero eigenvalue and when higherderivatives off(x) are unknown there is usually nothing further that can be done


9/20


to confirm the iterates have converged to a minimizer. Indeed it is not known justhow many orders of higher derivatives will be needed to resolve the issue.

A more general issue is what do in any iteration when either Jk is singular orHk is indefinite since under such circumstances the Newton step is not necessarilya sufficient descent direction. Ifpk is defined to be the solution of a system

Bkpk = gk,

where {Bk} is a sequence of bounded positive definite symmetric matrices whosecondition number is also bounded (smallest eigenvalue bounded away from 0) then{pk}is a sequence of sufficient descent directions. It is easy to see why. Let Ab e asymmetric matrix with eigenvalues

0< 1 2 n.

Let p and g denote vectors in IRn such that Ap = g. From the definition of thesmallest eigenvalue 1 we have

uTAu 1u2.

From and the relationship Ap= g and the above inequality we have

gTp/(pg) =pTAp/(pg) 1p/g.

Moreover, from norm inequalities we may assert that

g = Ap Ap =np.

Using the bound givesgTp/(pg) 1/n.

Applying this result to each matrix in the sequence {Bk} gives have

cos k = gTkpk/(gkpk)

(k)1 /

(k)n 1/M.

Consequently {pk} is a sequence of sufficient descent directions as required. Themain idea on how to modify the Newton system defining pk is to ensure that it isthe solution of a system that has the same properties as Bk.

Consider first the nonlinear equation case. WhenJk is nonsingular the search

direction pk is the solution of

JTkJkpk = JTkfk.

Obviously ifJk is singular then so is JTkJk. However, we may modify the symmetric

system above by adding a multiple of the identity giving

(JTkJk+kI)pk = JTkfk.

The above system will be nonsingular provided k > 0. Hence pk is always welldefined and, providedk is chosen appropriately, is a sufficient descent direction for


10/20


12f(x)

Tf(x). There are details that need explaining such as how best to solve thissystem and how to determine a good value for k, but the issue of convergence is

resolved. WhenJ(x) is nonsingular then there exists an index, say K, such thatfor k Kwe may set k = 0. By doing so the terminal search directions are pureNewtons steps.

We have left one detail out and that is the assumption that k = 1 for k K.Since we do not in general know K, when we compute the pure Newton direction inthe neighborhood of the solution, we need the linesearch procedure to automaticallytake the unit step. It is easy to show this will happen provided the initial step takenis 1. Setting = 1 in the LHS of (14) and noting there exists a constant M < such thatg(xk+pk) Mgk

2 when J(x) is nonsingular, we get

|g(xk+ pk)Tpk| g(xk+pk)pk Mgk

2pk. (16)

Since pk is a sufficient descent direction then the RHS of (14) satisfies

gTkpk gkpk. (17)

Since lim gk 0 it follows there exists Kfor which the unit step satisfies (14) fork K. Note that for the problemf(x) = 0 we have gk =J

Tkfk.

By a similar argument we can show that (15) is also satisfied, which gives us thedesired result.

4.1 Obtaining a direction of sufficient descent from the Newton

equations

It may be thought that a simple way would be to solve the Newton equations andthen reverse the direction ifgTp >0. While this works in the one variable case it isflawed in the n-dimensional case. It is not hard to see why although doubtless thisapproach is still used. Suppose we have two variables and f(x1, x2) is a separablefunction, that is f(x) = f1(x1) + f2(x2). It follows that the Hessian is a diagonalmatrix. SupposeH1,1> 0 andH2,2< 0 and the Newton step is p. In thex1variablethe stepp1is correctly stepping towards the minimizer, while the step p2is steppingaway from a minimizer. It could happen thatp1g1 < p2g2 in which case p is nota descent direction. What we would do if we knew this was a separable function isreverse the sign ofp2 only. More generally what we wish to do is to preserve theNewton step in the subspace for which the H is positive definite and reverse it inthe subspace it is not. Quite how to do that we now explain. Note that reversing

the sign ofp does not give an assured descent direction sincepTg could be zero evenwhen g is not zero.

Consider the spectral decomposition: H = VVT, where V is the matrix ofeigenvectors, =Diag(), and is the vector of eigenvalues. Define

H VVT,

where i = |i|, if |i| , otherwise i = 0, and =Diag(). A means ofachieving our objective is to define p as

p H1g.


11/20


Provided the minimum eigenvalue of H is bigger than then Newton steps willbe taken in the neighborhood of the solution. It may be thought that could be

allowed to converge to zero asgk 0, which would assure Newtons steps will alwaysbe taken in the neighborhood of the solution provided H is positive definite. Forexample, we could define min{|gk|, }, where 0. However, in practice p iscomputed numerically and the error in the computed solution will overwhelm theNewton step whenH is ill conditioned. Indeed the error analysis establishing thatCholesky is a stable algorithm assumes the matrix is sufficiently positive definite.

The drawback of the scheme above is that when n is large it is expensive bothin computational effort and memory to compute the spectral decomposition. WhenH is sufficiently positive definite it is also unnecessary. Since Cholesky is so cheapit is always worth trying first. An alternative is to compute a direction based onmodifying the Cholesky factorization. Cholesky fails on indefinite matrices due

to the need to compute a square root of a negative number when computing thediagonal elements of the factor. An obvious change is to define the diagonal elementsin the same vein as the modification made to the eigenvalues. If no zero pivots occur,the factorization LDLT exists, where L is a unit lower triangular matrix and D isdiagonal. We can now define D in relationship to D in the identical manner to therelationship of to . It can be shown that H LDLT =H+ E, whereE is alsodiagonal. In practice the occurrence of a zero diagonal is of no consequence sinceshould one arise it is replaced by . Unfortunately, it can be shown that the boundon E is O(1/) (see [7]).

In order to obtain a Cholesky factor that is not close to being unbounded itis necessary to perform the column form of the algorithm. The basic idea is not

simply to alter pivots that are tiny or negative but to alter any pivot for which it isnecessary to ensure that the magnitude of the elements of the Cholesky factor aresuitablybounded. If on first computing an element it exceeds a prespecified bound,say , then it is recomputed using an increase to the pivot that makes it matchthe bound. Care needs to be exercised in defining to ensure that matrices thatare sufficiently positive definite are not modified. Let Rdenote the Cholesky factorgenerated by this algorithm, then is chosen such that |Ri,j | , where

2 = max{,/n,},

and are the largest in modulus of the diagonal and off-diagonal elements ofHrespectively, and as before is some small positive scalar. It can be shown (see [7])that H H+E, E is a diagonal matrix, E is bounded and Hhas a boundedcondition number. The algorithm requires almost the same computational effort asthe regular Cholesky algorithm. Let Rkdenote the Cholesky factor ofHk. It followsa suitable sequence of sufficient descent directions, {pk}, are obtained by solving:

RTkRkpk = gk.

When H is sufficiently positive definite then in the neighborhood of the solutionHk= Hk and Newton steps are taken.

Generally using a modified direction as defined avoids saddle points. We aredeliberately defining the direction not to be a step to such points. However, we


12/20


could start at one or end up at one by chance. There are also problems (see [4])that have a form of symmetry that results in the gradient being in the space of the

eigenvectors corresponding to the positive eigenvalues ofH. It is a consequence ofvariables whose values, when they are switched, give the same objective. For suchproblems the only way of breaking the tie is to use a direction of negative curvature.

4.2 Directions of negative curvature

If g = 0 then there is no direction of descent and if H is indefinite then x is asaddle point. The key to moving off a saddle point (or avoiding converging toone) is to use directions of negative curvature. A vector d is termed a direction ofnegative curvature ifdTHd < 0. Obviously the Hessian needs to be indefinite forsuch a direction to exist. It may be thought that it is relatively easy to find such adirection and in some cases it is. IfHhas half its eigenvalues that are negative andequal in value to the positive set then a random direction has a 50% chance of beinga direction of negative curvature (DNC). However, it can be shown (see [3]) that theprobability is extremely small when the number of positive eigenvalues dominate(approximately 1014 when there are 99 eigenvalues of 1 and one of -1). Typicallyin optimization problems there are only a few negative eigenvalues (the iterates areconverging to a point the Hessian is positive semidefinite) and even these are small.A single large negative eigenvalue would signify that there is a DNC along whichthe objective will decrease rapidly and we are rarely that lucky.

Unlike directions of descent there is abest DNC, namely, the eigenvector cor-responding to the smallest eigenvalue. If the smallest eigenvalue is not unique thenany vector spanned by the set of corresponding eigenvalues is equally good. It is

best in the sense that this direction has the smallest value ofdTHd/dTd and it isindependent of a linear transformation of variables. A DNC predicts a potential infi-nite reduction in the objective. Of course in practice the step will be finite but thereis no good initial estimate of what the finite step should be. This contrasts witha direction of descent which is typically derived from a positive definite quadraticmodel for which the unit step is the predicted to be optimal.

In order to show the iterates will converge to a point for which the Hessian is atleast positive semidefinite it is necessary for the DNC to be a direction of sufficientnegative curvature. A sufficient condition for {dk}to be a sequence of directions ofsufficient negative curvature is that

dTkHkdk minkdTkdk,

wheremink is the smallest eigenvalue ofHk, and >0. Obviously 1. What therequirement does is to preventdk from b ecoming arbitrary close to being a directionof zero curvature.

Ifdk is a DNC then so is dk. The sign is chosen so that dTkgk 0. IfH(x

) ispositive definite then there exists a index Ksuch that for k K then dk = 0.

A DNC may be computed from the modified Cholesky factorization. It can beshown an index, j , exists (it can be identified during the factorization) such that d,where

Rd= ej , (18)


13/20

5 TRUST-REGION METHODS 13

is a DNC. However, it is not a direction of sufficient negative curvature. Fortunately,this is of little consequence. Rather than applying the modified Cholesky algorithm

to H it is better to apply it to H+ I, where > 0 is suitably small. Thismodification serves a number of purposes. If this matrix is indefinite then (18)does give a sufficient DNC. When H is singular and positive semidefinite it avoidstripping the modifications, and from a getting a DNC that is not needed in theneighborhood of x. Given a DNC it can be improved by using it as a startingpoint to minimize dTHd/dTd. For example, by just using a single pass though allthe variables of a univariate search (see [3]).

An issue is how best to use a DNC. An obvious choice is to define the searchdirection as:

p= p+d,

where p is a direction of sufficient descent. It is prudent to have scaled d. There

are various schemes that can be used but the general principle is that ifdTHd/dTdis small then d should be small.

It can be shown under suitable assumptions that this algorithm converges to apoint satisfying the second-order necessary conditions for optimality.

5 Trust-region methods

An alternative to modifying the Hessian is to minimize the quadratic approximationsubject to a limit on the size of the step. This is similar to using the interval tolimit the step in problems with one variable. However, unlike the one variable casethe solution does not usually lie within the limit. If at this new point the objectiveis not an improvement then the limit is reduced. We wish to solve:

mins

{(s) 12sTHs+gTs|sTs }, (19)

where > 0. Clearly a minimizer exists. It can be shown (see [10]) that thereexists a scalar 0 such that the global minimizer, s, satisfies

(H+ I)s = g, (20)

where H+ Iis positive semidefinite and (sTs ) = 0. It is rare for H+ I tobe singular. It is immediately apparent that the definition ofs is similar to that

ofp, since the matrix is symmetric, at least positive semidefinite and is a diagonalmodification ofH. IfH is positive definite and the solution ofH s= gis such thatsTs then s is the Newton step. When f(x+s) f(x) the step is rejectedand the system (20) resolved with a smaller value of , typically /2. It canbe shown only a finite number of reductions need be taken. If a satisfactory stepis obtained without the need to reduce and (s) is a reasonable prediction off(x) f(x+ s) then at the next iteration is increased, typically 2. Itcan be shown under typical assumptions that the sequence generated by the trust-region algorithm converges to a point satisfying second-order necessary conditionsfor optimality. It can be shown also that whenH is positive definite that does


14/20

6 GRADIENT-FLOW METHODS 14

not converge to zero. Consequently, in the neighborhood of the solution Newtonsteps are taken.

In order to solve (20) it necessary to determine . To do so requires solving(20)with estimates of. Also if is unsatisfactory the process has to be repeated,which requires more solves of (20). It is not necessary to determine accuratelyand often is satisfactory, Nonetheless it is clear a trust-region algorithm requiresmore solves of similar linear systems than a linesearch method.

Note that when g = 0 and H is indefinite then the solution of (20) is theeigenvector corresponding to the minimum eigenvalue (assuming it is unique).

6 Gradient-flow methods

Unlike the other two classes of methods these are not in widespread use. The

methods were originally proposed by Courant to solve PDEs. Their simplest formis to compute the arc x(t), wherex(t) satisfies

x(t) = g(x(t)) and x(0) =x0. (21)

The minimization problem has been replaced by the problem of solving a system ofnonlinear differential equations. This approach is grossly inefficient. The method ismade reasonable when g(x(t)) is approximated at xk by gk + Hk(x(t) xk). Theidea now is compute xx+1 by searching along the arc p(t), wherep(t) is the solutionof a linear system of differential equations:

p(t) = gk Hkp(t) and p(0) = 0. (22)

The connection to Newtons method is now apparent. When Hk is positive definitethe arc is finite and the end point is the Newton step from xk. IfH(x

) is positivedefinite then when the iterates are in a small neighborhood of the solution theNewton step is taken. An interesting property of this approach is that the arcsand iterates generated are invariant under an affine transformation, a property thatis typically not true for either linesearch or trust-region methods when Hk is notpositive definite for all k .

7 Estimating derivatives

For minimization problems Newton-type methods assume the availability of secondderivatives. When these derivatives are not available there are alternatives suchas quasi-Newton methods. Usually these alternatives are to be preferred althoughthere are problems where using the Newton-type methods described, except with theHessian approximated by finite differences of the gradients, are better. A key differ-ence with quasi-Newton methods is the Hessian approximation in those methods isusually positive definite, hence directions of negative curvature are not available. Allthe methods discussed may be applied to the case in whichH(x) is approximated byfinite differences with little difficulty except for the need to compute the differences.


15/20

8 CONSTRAINED PROBLEMS 15

The performance of such algorithms is almost indistinguishable, in most cases, fromthe case when an exact Hessian is used. Of course the exact derivatives do them-

selves contain errors since they are computed in finite-precision arithmetic. Thereare not the same numerical issues in approximating second derivatives as there areto approximating first derivatives. It can be shown that when finite differences areused and the finite-difference interval does not converge to zero, which for numericalpurposes it can not, the quadratic rate of convergence is lost. However, this resultis misleading. The iterates, except in pathological cases, mimic those of the exactcase. Indeed the quadratic convergence of the exact cases also breaks down, buttypically at a point the algorithm terminates.

When n is large the number of differences could grow with n. WhenH(x) (orJ(x) in the nonlinear equation case) is a sparse matrix there are clever algorithmsthat can choose the direction of the differences in a manner that significantly reduces

the number of differences required (see, [10]). For example, ifH(x) is tridiagonalonly two differences are required.

8 Constrained problems

The general nonlinearly constrained problem may be stated as:

minxIRn

f(x) such that c(x) 0, (23)

where c(x) IRm is a vector of smooth functions. It is usual to distinguish linearfrom nonlinear constraints, and further, to distinguish simple bounds on the vari-

ables from linear constraints. Generally equality constraints are easier to handlethan inequality and, for simplicity, are often omitted. For problems with only linearconstraints the iterates sequence generated is usually feasible and that has greatbenefits. Constraints complicate the adaptation of Newtons method as does sizeand not surprising large constrained problems are the most difficult of all. It isbeyond the scope of this paper to consider all the issues in detail but we shall coversome of the salient issues.

Consider first a problem whose only constraints are simple equalities. By thatwe mean some subset of the variables are fixed at some value. Clearly this createsno issues since the problem is unconstrained in the remaining variables. If there aresimple inequalities then at each iteration an active-set methods fixes some subset of

the variables. Consequently, Newton-type methods using an active-set strategy arerelatively easy to adapt to such problems. However, consider a trust-region methodwhose only constraints are x 0. Adding the constraints xk+ sk 0 to the sub-problem (20) makes the subproblem intractable. Typically the quadratic constraintis replaced by bounds on the step, which then gives a pure bounds constrainedsubproblem. However, finding the global minimizer of even this subproblem is in-tractable. It is still possible to find a step that reduces the objective. What is lostis that the modified algorithm is no longer assured of generating a sequence thatconverges to a point that satisfies the second-order necessary conditions.


16/20


As noted, for an active-set linesearch method, there are no complications forproblems with simple b ounds. There are also no complications with general linear

constraints provided the problem is not too large. Issues do arise when the problemis large. These are not unsurmountable but requires some ingenuity. When we havelinear constraints that are active we need to find a direction of negative curvaturethat lies on the constraints or to put it another way a DNC that is orthogonal tothe constraint normals. We also need to modify the reduced Hessian rather thanthe Hessian. Suppose at the kth iteration we wish to remain on a set of constraintsAk =bk. The subproblem we need to solve to obtain the Newton step is

minp

{12pTHkp+ g

Tkp | Akp= 0}.

Let Zk be a n (n mk) full rank matrix such that AkZk = 0, where mk is the

number of rows ofAk. We are assumingAk is full rank. The solution to this problemis given by

ZTkHkZkp= ZTgk. (24)

However, if ZTkHZk is not positive definite it needs to be modified in a similarmanner to how Hk is modified in the unconstrained case. There is little difficultyin doing this when n is of modest size. There are not a lot of issues ifn mk isof modest size even ifn is large, although forming the product may be expensive.Fortunately, Hk is typically sparse and an implicit form ofZk may be found thatis also sparse. In some cases advantage can be made of structure both in Ak andHk. However, there are cases where bothn and n mk are large and Z

TkHkZk is

not sparse. Under such circumstances there are two approaches. One is to consider,

instead of using the reduced system (24), solving the equivalent KKT systemHk Ak

ATk 0

pk

k

=

gk

0

, (25)

where k is the Lagrange multiplier estimate. This matrix is indefinite regardlessof the character ofHk. However, it may be deduced from the inertia of this matrixwhether or not the reduced Hessian is positive definite. Factorizing this matrix inthe formP LBLTPT, wherePis a permutation matrix,L is a unit lower triangularmatrix, and B is a 2 2 block-diagonal matrix reveals the inertia. Unfortunately,if the reduced Hessian is not positive definite this factorization does not reveal how

to obtain a suitable sufficient descent direction. How to obtain suitable descentdirections by modifying this factorization is described in [5]. They also suggest howto modify the original matrix prior to using an unmodified factorization.

An alternative to factorizations is to attempt to solve (24) using the conjugategradient (CG) method. Since the CG algorithm requires only matrix-vector productsthe reduced Hessian need not be formed explicitly. The CG method may fail onindefinite matrices, but it is precisely because it may fail that makes it useful. It canalso be shown that when it fails a direction of sufficient descent may be constructedfrom the partial solution. While it may not give a direction of sufficient negativecurvature in general it will usually give a direction of at least zero curvature, which if


17/20


need be may improved by using this as an initial vector in minimizing dTZTHZd/dTdeither by a few iterations of the Lanczos algorithm or a few iterations of univariate

search.When there are nonlinear inequality constraints then adapting Newtons method

is considerably more problematic except if the nonlinear constrained problem istransformed into solving a sequence of linearly constrained problems. Assumingthat is not the case most methods generate infeasible iterates. As a consequencethe iterate is typically composed of two components, one is a Newton step to getfeasible and the other to reduce (it may be increasing) the objective. If there areonly equality constraints it is somewhat simpler so we shall describe that first. Weare now concerned not with the Hessian of the objective, but with the Hessian ofthe Lagrangian, where the LagrangianL(x, ) is defined as L(x, ) f(x) Tc(x).Assuming there are no issues with the reduced Hessian of the Lagrangian the system

of equations we need to solve are:

g(x) J(x)T= 0 and c(x) = 0, (26)

which is a system ofn+m equations in n+m unknowns. In other words we arenow applying Newtons method in both the primal and dual variables. The Newtonstep in these variables is given by:

HLk Jk

JTk 0

pk

k

=

gk+ Jkk

ck

, (27)

whereHLk is the Hessian of the Lagrangian atxk, k, and k is the Newton step in

the dual variables. The issues in addressing the case when the reduced Hessian is notpositive definite in the space of thex-variables is similar to the linearly constrainedcase. We have no concern about the character of the Hessian in the space of the dualvariables. However, since the search is in the space of both variables there needs tobe a search direction defined for those variables. How to do that is described in [12].Note that we are leaving out the fact we are minimizing a merit function so careneeds to be taken to ensure directions of negative curvature for the Hessian of theLagrangian are also directions of negative curvature for the merit function. Againhow this may be done is described in [12].

The issue with an active-set method for nonlinear constraints is that the most effi-cient approach is a sequential quadratic programming (SQP) method. Such methods

construct a quadratic objective whose Hessian is an approximation to the Hessian ofthe Lagrangian. When second derivatives are known of both the original objectiveand the of the nonlinear constraint functions only the Lagrange multipliers needbe estimated. If the QP is not convex the solution of the QP is not necessarily adirection of sufficient descent. Indeed the QP may not have a bounded solution andeven if it does, and the global minimizer is found, that may not yield a direction ofdescent. In solving the QP the iterates move on to different sets of active constraints.It is possible for the reduced Hessian to be positive definite on every subspace andyet the solution is not a descent direction. Simply observing the character of thereduced Hessians may not reveal nonconvexity. Unlike the equality constrained case


18/20


we do not know initially the null space of the active set of the constraints sincewe do not know which constraints are active at the solution of the QP. Moreover,

altering the Hessian on that subspace will alter the set of constraints active at thesolution. This conundrum may be resolved by monitoring the total step being takenin the QP from the initial feasible p oint. Assuming the initial reduced Hessian ispositive definite (if it is not the the Hessian may be modified by applying the modi-fied Cholesky factorization to the reduced Hessian) there are no problems providedonly constraints are added. It is when a constraint is deleted that indefiniteness mayarise. While it may not arise in the resulting reduced Hessian the indefiniteness inthe direction from the initial point must b e checked. While the Hessian still haspositive curvature along that direction a descent direction is assured. When it doesnot the situation is similar to that when applying the CG algorithm and it breaksdown, the prior step is a descent direction and the new step yields a direction of

negative curvature.Approximating second derivatives is sometimes cheaper for constrained problems

especially in the linearly constrained case. Since we only need to approximateZTHZthen we can approximate HZdirectly by differencing along the column vectors ofZ. When n m is small this is quite efficient. Also if the system (24) is solvedusing the conjugate gradient algorithm we need only difference (this holds for theunconstrained case also) along the CG directions, which may be only a few beforea sufficiently good search direction is obtained.

8.1 A homotopy approach

Rather than change the algorithm another way to extend the range of convergence

of Newtons method is to change the problem or rather replace the problem by asequence of problems. For example, we can use a homotopy approach. The basic ideais to replace the original problem with a parameterized problem. At one end of theparameter range the problem is easy to solve and at the other end it is the originalproblem. Such an approach may be used for globally convergent algorithms alsosince it can be more efficient than simply solving the original problem directly. Hereour use is not because the problem is necessarily hard to solve but because Newtonsmethod may not work at all. Note that when you do start close to the solution withNewtons method the problem is usually easy to solve since the iterative sequenceconverges quickly.

Constructing a homotopy for the case of solving f(x) = 0 is particularly easy.

Instead of solving f(x) = 0 we could solve f(x) = f(x0), where x0 is our initialestimate, and 0 < 1. Obviously if is very close to 1 then x0 is very close tox(), where x() is a zero off(x) = f(x0). Since we can make as sufficientlyclose to 1 as we wish we can be within the interval of convergence provided thegradient off(x()) = 0. It is not necessary to find the solution to the intermediateproblems precisely. The main difficulty is assessing how conservative to make theinitial choice ofand how quickly it can be reduced. In one variable this approachmakes the Newton step smaller and as such plays a similar role to kin (8). However,note that it is not identical to a fixed scaling of the Newton step.


19/20

9 ACKNOWLEDGEMENT AND COMMENTS 19

Another way of modifying functions is to make them more like the ideal casefor Newtons method. When solving f(x) = 0 that can be done by adding a linear

function, for example, solve (1)f(x)+(xx0) = 0. Here we change the gradientof the function as well as move the zero closer to x0. A similar approach may betaken for the problem minxf(x) by adding a quadratic function.

If at the initial estimatef0= 0 then a different initial estimate is needed regard-less of the approach being used to solve the problem.

It can be particularly useful to use a homotopy approach when using an SQPalgorithm As noted when the Hessian is indefinite difficulties arise. Adding a term12(x x)

T(x x) to the objective, where xis the initial approximation, adds I tothe Hessian. Not only will this reduce the need to modify the Hessian it also almostalways reduces the number of iterations required to solve the QP subproblem. Ifwas very large then we would be taking a very small optimization step from the

initial feasible point of the QP subproblem. It is advisable to change x periodically(perhaps even every iteration) and also to reduce to zero. When the iteratesare close to the solution and the reduced Hessian of the Lagrangian is positivedefinite then indefiniteness will not arise when solving the QP. Indeed close to thesolution only one iteration of the QP is required since the correct active set has beenidentified.

9 Acknowledgement and comments

There is much more that could be written in particular on extensions of Newtonsmethod to both preserve and improve upon the rate of convergence. it would also

be remiss not to cite the classic work of Kantorovich [1]. Thanks are due to NickHenderson for reading the manuscript and to the referees for their prompt reviewsand and insightful comments. Support for this work was provided by the Office ofNaval Research (Grant: N00014-02-1-0076) and the Army (Grant: W911NF-07-2-0027-1).

References

[1] Kantorovich, L.V., Functional Analysis and Applied Mathematics, UspekhiMathematicheskikh Nauk, Vol. 3, pp. 89-185, 1948.

[2] D. Bertsekas (1999), Nonlinear Programming,Athena Scientific.[3] E. Boman (1999). Infeasibility and Negative Curvature in Optimization. PhD

thesis, Stanford University,http://www.stanford.edu/group/SOL/dissertations.html.

[4] V. Ejov, J.A. Filar, W. Murray, G.T. Nguyen (2009), Determinants and longestcycles of graphs (2009).SIAM Journal on Discrete Mathematics, 22 (3), 12151225.


20/20

REFERENCES 20

[5] A. Forsgren and W. Murray (1993), Newton methods for large-scale linearequality-constrained minimization, SIAM J. Matrix Anal. Appl., 14, 560587.

[6] P. E. Gill, W. Murray, and M. A. Saunders (2005). SNOPT: An SQP algorithmfor large-scale constrained optimization, SIAM Review47, 99131.

[7] P. E. Gill, W. Murray, and M. H. Wright (1981). Practical Optimization,Aca-demic Press, London and New York.

[8] P. E. Gill, W. Murray, and M. A. Saunders , J. A. Tomlin, and M. H.Wright(1986). On projected Newton barrier methods for linear programmingand an equivalence to Karmarkars projective method, Mathematical Program-ming36 (2), 183-209.

[9] N. Karmarkar (1984). A New Polynomial Time Algorithm for Linear Program-ming, Combinatorica, Vol 4, nr. 4, 373-395.

[10] J. Nocedal and S. J. Wright (2006), Numerical Optimization,Springer.

[11] J.M. Ortega and W.C.Rheinboldt (2000). Iterative Solutions of Nonlinear Equa-tions in Several Variables, SIAM Calssics in Applied Mathematics.

[12] W. Murray and F. Prieto (1995), A second-derivative method for nonlinearlyconstrained optimization, Report SOL 953, 48 pages.

Newton Type Methods

Documents

Transcript of Newton Type Methods