Gradient-based Methods for Optimization. Part I
Transcript of Gradient-based Methods for Optimization. Part I
Gradient-based Methods for Optimization. Part I.
Prof. Nathan L. Gibson
Department of Mathematics
Applied Math and Computation SeminarOctober 21, 2011
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 1 / 40
Outline
Unconstrained Optimization
Newtonâs Method
Inexact NewtonQuasi-Newton
Nonlinear Least Squares
Gauss-Newton Method
Steepest Descent Method
Levenberg-Marquardt Method
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 2 / 40
Unconstrained Optimization
Unconstrained Optimization
Minimize function f of N variables
I.e., find local minimizer xâ such that
f (xâ) ⤠f (x) for all x near xâ
Different from constrained optimization
f (xâ) ⤠f (x) for all x â U near xâ
Different from global minimizer
f (xâ) ⤠f (x) for all x (possibly in U)
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 3 / 40
Unconstrained Optimization
Sample Problem
Parameter IdentificationConsider
uâ˛â˛ + cuⲠ+ ku = 0; u(0) = u0; uâ˛(0) = 0 (1)
where u represents the motion of an unforced harmonic oscillator (e.g.,spring). We may assume u0 is known, and data {uj}M
j=1 is given for sometimes tj on the interval [0,T ].Now we can state a parameter identification problem to be: findx = [c , k]T such that the solution u(t) to (1) using parameters x is (asclose as possible to) uj when evaluated at times tj .
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 4 / 40
Unconstrained Optimization Definitions
Objective Function
Consider the following formulation of the Parameter Identification problem:Find x=[c , k]T such that the following objective function is minimized:
f (x) =1
2
Mâj=1
|u(tj ; x)â uj |2 .
This is an example of a nonlinear least squares problem.
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 5 / 40
Unconstrained Optimization Definitions
Iterative Methods
An iterative method for minimizing a function f (x) usually has thefollowing parts:
Choose an initial iterate x0
For k = 0, 1, . . .
If xk optimal, stop.Determine a search direction dand a step size ÎťSet xk+1 = xk + Îťd
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 6 / 40
Unconstrained Optimization Definitions
Convergence Rates
The sequence {xk}âk=1 is said to converge to xâ with rate p and rateconstant C if
limkââ
âxk+1 â xâââxk â xââp
= C .
Linear: p = 1 and 0 < C < 1, such that error decreases.
Quadratic: p = 2, doubles correct digits per iteration.
Superlinear: If p = 1, C = 0. Faster than linear. Includes quadracticconvergence, but also intermediate rates.
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 7 / 40
Unconstrained Optimization Necessary and Sufficient Conditions
Necessary Conditions
Theorem
Let f be twice continuously differentiable, and let xâ be a local minimizerof f . Then
âf (xâ) = 0 (2)
and the Hessian of f , â2f (xâ), is positive semidefinite.
Recall A positive semidefinite means
xTAx ⼠0 âx â RN .
Equation (2) is called the first-order necessary condition.
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 8 / 40
Unconstrained Optimization Necessary and Sufficient Conditions
Hessian
Let f : RN â R be twice continuously differentiable (C2), then
The gradient of f is
âf =
[âf
âx1, ¡ ¡ ¡ ,
âf
âxN
]T
The Hessian of f is
â2f =
â2fâx2
1¡ ¡ ¡ â2f
âx1âxN
.... . .
...â2f
âxNâx1¡ ¡ ¡ â2f
âx2N
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 9 / 40
Unconstrained Optimization Necessary and Sufficient Conditions
Sufficient Conditions
Theorem
Let f be twice continuously differentiable in a neighborhood of xâ, and let
âf (xâ) = 0
and the Hessian of f , â2f (xâ), be positive semidefinite. Then xâ is a localminimizer of f .
Note: second derivative information is required to be certain, for instance,if f (x) = x3.
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 10 / 40
Newtonâs Method
Quadratic Objective Functions
Suppose
f (x) =1
2xTHx â xTb
then we have thatâ2f (x) = H
and if H is symmetric (assume it is)
âf (x) = Hx â b.
Therefore, if H is positive definite, then the unique minimizer xâ is thesolution to
Hxâ = b.
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 11 / 40
Newtonâs Method
Newtonâs Method
Newtonâs Method solves for the minimizer of the local quadratic model off about the current iterate xk given by
mk(x) = f (xk) +âf (xk)T (x â xk) +1
2(x â xk)Tâ2f (xk)(x â xk).
If â2f (xk) is positive definite, then the minimizer xk+1 of mk is the uniquesolution to
0 = âmk(x) = âf (xk) +â2f (xk)(x â xk). (3)
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 12 / 40
Newtonâs Method
Newton Step
The solution to (3) is computed by solving
â2f (xk)sk = ââf (xk)
for the Newton Step sNk . Then the Newton update is defined by
xk+1 = xk + sNk .
Note: the step sNk has both direction and length. Variants of Newtonâs
Method modify one or both of these.
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 13 / 40
Newtonâs Method
Standard Assumptions
Assume that f and xâ satisfy the following
1 Let f be twice continuously differentiable and Lipschitz continuouswith constant Îł
ââ2f (x)ââ2f (y)â ⤠γâx â yâ.
2 âf (xâ) = 0.
3 â2f (xâ) is positive definite.
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 14 / 40
Newtonâs Method
Convergence Rate
Theorem
Let the Standard Assumptions hold. Then there exists a δ > 0 such that ifx0 â Bδ(x
â), the Newton iteration converges quadratically to xâ.
I.e., âek+1â ⤠Kâekâ2.
If x0 is not close enough, Hessian may not be positive definite.
If you start close enough, you stay close enough.
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 15 / 40
Newtonâs Method
Problems (and solutions)
Need derivatives
Use finite difference approximations
Needs solution of linear system at each iteration
Use iterative linear solver like CG(Inexact Newton)
Hessians are expensive to find (and solve/factor)
Use chord (factor once) or ShamanskiiUse Quasi-Newton (update Hk to get Hk+1)Use Gauss-Newton (first order approximate Hessian)
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 16 / 40
Nonlinear Least Squares
Nonlinear Least Squares
Recall,
f (x) =1
2
Mâj=1
|u(tj ; x)â uj |2 .
Then for x = [c , k]T
âf (x) =
[âMj=1
âu(tj ;x)âc (u(tj ; x)â uj)âM
j=1âu(tj ;x)
âk (u(tj ; x)â uj)
]= R â˛(x)TR(x)
where R(x) = [u(t1; x)â u1, . . . , u(tM ; x)â uM ]T is called the residual
and R â˛ij(x) = âRi (x)âxj
.
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 17 / 40
Nonlinear Least Squares
Approximate Hessian
In terms of the residual R, the Hessian of f becomes
â2f (x) = R â˛(x)TR â˛(x) + R â˛â˛(x)R(x)
where R â˛â˛(x)R(x) =âM
j=1 rj(x)â2rj(x) and rj(x) is the jth element of thevector R(x).The second order term requires the computation of M Hessians, each sizeN Ă N. However, if we happen to be solving a zero residual problem, thissecond order term goes to zero. One can argue that for small residualproblems (and good initial iterates) the second order term is neglibible.
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 18 / 40
Gauss-Newton Method
Gauss-Newton Method
The equation defining the Newton step
â2f (xk)sk = ââf (xk)
becomes
R â˛(xk)TR â˛(xk)sk = ââf (xk)
= âR â˛(xk)TR(xk).
We define the Gauss-Newton step as the solution sGNk to this equation.
You can expect close to quadratic convergence for small residual problems.Otherwise, not even linear is guaranteed.
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 19 / 40
Gauss-Newton Method
Numerical Example
Recalluâ˛â˛ + cuⲠ+ ku = 0; u(0) = u0; u
â˛(0) = 0.
Let the true parameters be xâ = [c , k]T = [1, 1]T . Assume we haveM = 100 data uj from equally spaced time points on [0, 10].
We will use the initial iterate x0 = [1.1, 1.05]T with Newtonâs Methodand Gauss-Newton.
We compute gradients with forward differences, analytical 2Ă 2matrix inverse, and use ode15s for time stepping the ODE.
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 20 / 40
Gauss-Newton Method
0 2 4 6 8 10â2
0
2
4
6
8
10
t
u
Comparison of initial iterate
DataInitial iterate
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 21 / 40
Gauss-Newton Method
1 1.5 2 2.5 3 3.5 4 4.5 510
â8
10â6
10â4
10â2
100
102
Iterations
Gra
dien
t Nor
m
Newtonâs Method
1 1.5 2 2.5 3 3.5 4 4.5 510
â14
10â12
10â10
10â8
10â6
10â4
10â2
100
Iterations
Fun
ctio
n V
alue
Newtonâs Method
1 1.5 2 2.5 3 3.5 410
â6
10â4
10â2
100
102
Iterations
Gra
dien
t Nor
m
GaussâNewton
1 1.5 2 2.5 3 3.5 410
â14
10â12
10â10
10â8
10â6
10â4
10â2
100
Iterations
Fun
ctio
n V
alue
GaussâNewton
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 22 / 40
Gauss-Newton Method
Newton Gauss-Newton
k ||âf (xk)|| f (xk) ||âf (xk)|| f (xk)
0 2.330e+01 7.881e-01 2.330e+01 7.881e-011 6.852e+00 9.817e-02 1.767e+00 6.748e-032 4.577e-01 6.573e-04 1.016e-02 4.656e-073 3.242e-03 3.852e-08 1.844e-06 2.626e-134 4.213e-07 2.471e-13
Table: Parameter identification problem, locally convergent iterations. CPU timeNewton: 3.4s, Gauss-Newton: 1s.
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 23 / 40
Gauss-Newton Method
0.96 0.98 1 1.02 1.04 1.06 1.08 1.1 1.120.96
0.98
1
1.02
1.04
1.06
1.08
c
k
Iteration history
Newtonâs MethodGaussâNewton
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 24 / 40
Gauss-Newton Method
1 1.5 2 2.5 3 3.5
1
1.5
2
2.5
3
3.5
c
k
Search Direction
Newtonâs MethodGaussâNewton
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 25 / 40
Gauss-Newton Method
1 2 3 4 5 6 7
1
1.5
2
2.5
3
c
k
Search Direction
Newtonâs MethodGaussâNewton
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 26 / 40
Gauss-Newton Method
Global Convergence
Newton direction may not be a descent direction (if Hessian notpositive definite).
Thus Newton (or any Newton-based method) may increase f if x0 notclose enough. Not globally convergent.
Globally convergent methods ensure (sufficient) decrease in f .
The steepest descent direction is always a descent direction.
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 27 / 40
Steepest Descent Method
Steepest Descent Method
We define the steepest descent direction to be dk = ââf (xk). Thisdefines a direction but not a step size.
We define the Steepest Descent update step to be sSDk = Îťkdk for
some Îťk > 0.
We will talk later about ways of choosing Îť.
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 28 / 40
Steepest Descent Method
0.96 0.98 1 1.02 1.04 1.06 1.08 1.1 1.120.96
0.98
1
1.02
1.04
1.06
1.08
c
k
Iteration history
Newtonâs MethodGaussâNewtonSteepest Descent
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 29 / 40
Steepest Descent Method
1 1.5 2 2.5 3 3.5
1
1.5
2
2.5
3
3.5
c
k
Search Direction
Newtonâs MethodGaussâNewtonSteepest Descent
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 30 / 40
Steepest Descent Method
1 2 3 4 5 6 7
1
1.5
2
2.5
3
c
k
Search Direction
Newtonâs MethodGaussâNewtonSteepest Descent
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 31 / 40
Steepest Descent Method
Steepest Descent Comments
Steepest Descent direction is best direction locally.
The negative gradient is perpendicular to level curves.
Solving for sSDk is equivalent to assuming â2f (xk) = I/Îťk .
In general you can only expect linear convergence.
Would be good to combine global convergence property of SteepestDescent with superlinear convergence rate of Gauss-Newton.
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 32 / 40
Levenberg-Marquardt Method
Levenberg-Marquardt Method
Recall the objective function
f (x) =1
2R(x)TR(x)
where R is the residual. We define the Levenberg-Marquardt update stepsLMk to be the solution of(
R â˛(xk)TR â˛(xk) + νk I)
sk = âR â˛(xk)TR(xk)
where the regularization parameter νk is called the Levenberg-Marquardtparameter, and it is chosen such that the approximate HessianR â˛(xk)TR â˛(xk) + νk I is positive definite.
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 33 / 40
Levenberg-Marquardt Method
1 1.5 2 2.5 3 3.5
1
1.5
2
2.5
3
3.5
c
k
Search Direction
Newtonâs MethodGaussâNewtonSteepest DescentLevenbergâMarquardt
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 34 / 40
Levenberg-Marquardt Method
1 2 3 4 5 6 7
1
1.5
2
2.5
3
c
k
Search Direction
Newtonâs MethodGaussâNewtonSteepest DescentLevenbergâMarquardt
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 35 / 40
Levenberg-Marquardt Method
Levenberg-Marquardt Notes
Robust with respect to poor initial conditions and larger residualproblems.
Varying ν involves interpolation between GN direction (ν = 0) andSD direction (large ν).
See
doc lsqnonlin
for MATLAB instructions for LM and GN.
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 36 / 40
Levenberg-Marquardt Method
Levenberg-Marquardt Idea
If iterate is not close enough to minimizer so that GN does not give adescent direction, increase ν to take more of a SD direction.
As you get closer to minimizer, decrease ν to take more of a GN step.
For zero-residual problems, GN converges quadratically (if at all)SD converges linearly (guaranteed)
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 37 / 40
Levenberg-Marquardt Method
LM Alternative Perspective
Approximate Hessian may not be positive definite (orwell-conditioned), increase ν to add regularity.
As you get closer to minimizer, Hessian will become positive definite(by Standard Assumptions). Decrease ν, as less regularization isnecessary.
Regularized problem is ânearby problemâ, want to solve actualproblem as soon as is feasible.
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 38 / 40
Summary
Summary
Taylor series with remainder:
f (x) = f (xk) +âf (xk)T (x â xk) +
1
2(x â xk)
Tâ2f (Ξ)(x â xk)
Newton:
mNk (x) = f (xk) +âf (xk)
T (x â xk) +1
2(x â xk)
Tâ2f (xk)(x â xk)
Gauss-Newton:
mGNk (x) = f (xk) +âf (xk)
T (x â xk) +1
2(x â xk)
TR â˛(xk)TR â˛(xk)(x â xk)
Steepest Descent:
mSDk (x) = f (xk) +âf (xk)
T (x â xk) +1
2(x â xk)
T 1
ÎťkI (x â xk)
Levenberg-Marquardt:
mLMk (x) = f (xk)+âf (xk)
T (xâxk)+1
2(xâxk)
T(R â˛(xk)
TR â˛(xk) + νk I)(xâxk)
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 39 / 40
References
References
1 Levenberg, K., âA Method for the Solution of Certain Problems inLeast-Squaresâ, Quarterly Applied Math. 2, pp. 164-168, 1944.
2 Marquardt, D., âAn Algorithm for Least-Squares Estimation of NonlinearParametersâ, SIAM Journal Applied Math., Vol. 11, pp. 431-441, 1963.
3 More, J. J., âThe Levenberg-Marquardt Algorithm: Implementation andTheoryâ, Numerical Analysis, ed. G. A. Watson, Lecture Notes inMathematics 630, Springer Verlag, 1977.
4 Kelley, C. T., âIterative Methods for Optimizationâ, Frontiers in AppliedMathematics 18, SIAM, 1999.http://www4.ncsu.edu/âźctk/matlab darts.html.
5 Wadbro, E., âAdditional Lecture Materialâ, Optimization 1 / MN1, UppsalaUniversitet, http://www.it.uu.se/edu/course/homepage/opt1/ht07/.
Prof. Gibson (OSU) Gradient-based Methods for Optimization AMC 2011 40 / 40