Local vs Global Optimization Optimization Problems ...
Transcript of Local vs Global Optimization Optimization Problems ...
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Outline
1 Optimization Problems
2 One-Dimensional Optimization
3 Multi-Dimensional Optimization
Michael T. Heath Scientific Computing 2 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
Local vs Global Optimization
x⇤ 2 S is global minimum if f(x⇤) f(x) for all x 2 S
x⇤ 2 S is local minimum if f(x⇤) f(x) for all feasible x in
some neighborhood of x⇤
Michael T. Heath Scientific Computing 6 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
Local vs Global Optimization
x⇤ 2 S is global minimum if f(x⇤) f(x) for all x 2 S
x⇤ 2 S is local minimum if f(x⇤) f(x) for all feasible x in
some neighborhood of x⇤
Michael T. Heath Scientific Computing 6 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
Local vs Global Optimization
x⇤ 2 S is global minimum if f(x⇤) f(x) for all x 2 S
x⇤ 2 S is local minimum if f(x⇤) f(x) for all feasible x in
some neighborhood of x⇤
Michael T. Heath Scientific Computing 6 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
Optimization
Given function f : Rn ! R, and set S ✓ Rn, find x⇤ 2 S
such that f(x⇤) f(x) for all x 2 S
x⇤ is called minimizer or minimum of f
It suffices to consider only minimization, since maximum off is minimum of �f
Objective function f is usually differentiable, and may belinear or nonlinear
Constraint set S is defined by system of equations andinequalities, which may be linear or nonlinear
Points x 2 S are called feasible points
If S = Rn, problem is unconstrained
Michael T. Heath Scientific Computing 3 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
Optimization Problems
General continuous optimization problem:
min f(x) subject to g(x) = 0 and h(x) 0
where f : Rn ! R, g : Rn ! Rm, h : Rn ! Rp
Linear programming : f , g, and h are all linear
Nonlinear programming : at least one of f , g, and h isnonlinear
Michael T. Heath Scientific Computing 4 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
Examples: Optimization Problems
Minimize weight of structure subject to constraint on itsstrength, or maximize its strength subject to constraint onits weight
Minimize cost of diet subject to nutritional constraints
Minimize surface area of cylinder subject to constraint onits volume:
min
x1,x2f(x1, x2) = 2⇡x1(x1 + x2)
subject to g(x1, x2) = ⇡x
21x2 � V = 0
where x1 and x2 are radius and height of cylinder, and V isrequired volume
Michael T. Heath Scientific Computing 5 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
Local vs Global Optimization
x⇤ 2 S is global minimum if f(x⇤) f(x) for all x 2 S
x⇤ 2 S is local minimum if f(x⇤) f(x) for all feasible x in
some neighborhood of x⇤
Michael T. Heath Scientific Computing 6 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
Global Optimization
Finding, or even verifying, global minimum is difficult, ingeneral
Most optimization methods are designed to find localminimum, which may or may not be global minimum
If global minimum is desired, one can try several widelyseparated starting points and see if all produce sameresult
For some problems, such as linear programming, globaloptimization is more tractable
Michael T. Heath Scientific Computing 7 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
Existence of Minimum
If f is continuous on closed and bounded set S ✓ Rn, thenf has global minimum on S
If S is not closed or is unbounded, then f may have nolocal or global minimum on S
Continuous function f on unbounded set S ✓ Rn iscoercive if
lim
kxk!1f(x) = +1
i.e., f(x) must be large whenever kxk is large
If f is coercive on closed, unbounded set S ✓ Rn, then f
has global minimum on S
Michael T. Heath Scientific Computing 8 / 74
An Example of a Coercive Function
Level Sets (contours, isosurfaces)
Goes to 1 as ||x||! 1
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
Level Sets
Level set for function f : S ✓ Rn ! R is set of all points inS for which f has some given constant value
For given � 2 R, sublevel set is
L
�
= {x 2 S : f(x) �}
If continuous function f on S ✓ Rn has nonempty sublevelset that is closed and bounded, then f has global minimumon S
If S is unbounded, then f is coercive on S if, and only if, allof its sublevel sets are bounded
Michael T. Heath Scientific Computing 9 / 74
(No open contours…)
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
Uniqueness of Minimum
Set S ✓ Rn is convex if it contains line segment betweenany two of its points
Function f : S ✓ Rn ! R is convex on convex set S if itsgraph along any line segment in S lies on or below chordconnecting function values at endpoints of segment
Any local minimum of convex function f on convex setS ✓ Rn is global minimum of f on S
Any local minimum of strictly convex function f on convexset S ✓ Rn is unique global minimum of f on S
Michael T. Heath Scientific Computing 10 / 74
Convex Domains and Convex Functions
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
First-Order Optimality Condition
For function of one variable, one can find extremum bydifferentiating function and setting derivative to zero
Generalization to function of n variables is to find criticalpoint, i.e., solution of nonlinear system
rf(x) = 0
where rf(x) is gradient vector of f , whose ith componentis @f(x)/@x
i
For continuously differentiable f : S ✓ Rn ! R, any interiorpoint x⇤ of S at which f has local minimum must be criticalpoint of f
But not all critical points are minima: they can also bemaxima or saddle points
Michael T. Heath Scientific Computing 11 / 74
• Saddle points in higher dimensions are more common than zero-slope inflection points in 1D
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
First-Order Optimality Condition
For function of one variable, one can find extremum bydifferentiating function and setting derivative to zero
Generalization to function of n variables is to find criticalpoint, i.e., solution of nonlinear system
rf(x) = 0
where rf(x) is gradient vector of f , whose ith componentis @f(x)/@x
i
For continuously differentiable f : S ✓ Rn ! R, any interiorpoint x⇤ of S at which f has local minimum must be criticalpoint of f
But not all critical points are minima: they can also bemaxima or saddle points
Michael T. Heath Scientific Computing 11 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
First-Order Optimality Condition
For function of one variable, one can find extremum bydifferentiating function and setting derivative to zero
Generalization to function of n variables is to find criticalpoint, i.e., solution of nonlinear system
rf(x) = 0
where rf(x) is gradient vector of f , whose ith componentis @f(x)/@x
i
For continuously differentiable f : S ✓ Rn ! R, any interiorpoint x⇤ of S at which f has local minimum must be criticalpoint of f
But not all critical points are minima: they can also bemaxima or saddle points
Michael T. Heath Scientific Computing 11 / 74
Note the open contours….
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
First-Order Optimality Condition
For function of one variable, one can find extremum bydifferentiating function and setting derivative to zero
Generalization to function of n variables is to find criticalpoint, i.e., solution of nonlinear system
rf(x) = 0
where rf(x) is gradient vector of f , whose ith componentis @f(x)/@x
i
For continuously differentiable f : S ✓ Rn ! R, any interiorpoint x⇤ of S at which f has local minimum must be criticalpoint of f
But not all critical points are minima: they can also bemaxima or saddle points
Michael T. Heath Scientific Computing 11 / 74
• We use the solution techniques of Chapter 5 to solve this nonlinear system.
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
Second-Order Optimality Condition
For twice continuously differentiable f : S ✓ Rn ! R, wecan distinguish among critical points by consideringHessian matrix H
f
(x) defined by
{Hf
(x)}ij
=
@
2f(x)
@x
i
@x
j
which is symmetric
At critical point x⇤, if Hf
(x⇤) is
positive definite, then x⇤ is minimum of fnegative definite, then x⇤ is maximum of findefinite, then x⇤ is saddle point of fsingular, then various pathological situations are possible
Michael T. Heath Scientific Computing 12 / 74
Classifying Critical Points
• Example 6.5 from the text:
f(x) = 2x31 + 3x21 + 12x1x2 + 3x22 � 6x2 + 6.
• Gradient:
rf(x) =
6x21 + 6x1 + 12x212x1 + 6x2 � 6
�.
• Hessian is symmetric:
Hf
=
"@
2f
@x1@x1
@
2f
@x1@x2
@
2f
@x1@x2
@
2f
@x2@x2
#=
12x1 + 6 12
12 6
�.
• Critical points, solutions to rf(x) = 0 are
x⇤ = [1,�1]T , and
x⇤ = [2,�3]T .
• At first critical point, x⇤ = [1,�1]T ,
Hf
(1,�1) =
18 1212 6
�,
does not have positive eigenvalues, so [1,�1]T is a saddle point.
• At second critical point, x⇤ = [2,�3]T ,
Hf
(2,�3) =
30 1212 6
�,
does have positive eigenvalues, so [2,�3]T is a local minimum.
• First critical point, x⇤ = [1,�1]T , is a saddle point.
• Second critical point, x⇤ = [2,�3]T , is a local minimum.
Classifying Critical Points
• Example 6.5 from the text:
f(x) = 2x31 + 3x21 + 12x1x2 + 3x22 � 6x2 + 6.
• Gradient:
r(x) =
6x21 + 6x1 + 12x212x1 + 6x2 � 6
�.
• Hessian is symmetric:
Hf =
12x1 + 6 12
12 6
�.
• Critical points, solutions to rf(x) = 0 are
x⇤ = [1,�1]T , and
x⇤ = [2,�3]T .
• At first critical point, x⇤ = [1,�1]T ,
Hf(1,�1) =
18 1212 6
�,
does not have positive eigenvalues, so [1,�1]T is a saddle point.
• At second critical point, x⇤ = [2,�3]T ,
Hf(2,�3) =
30 1212 6
�,
does have positive eigenvalues, so [2,�3]T is a local minimum.
Example 6.5 from text
Classifying Critical Points
• Example 6.5 from the text:
f(x) = 2x31 + 3x21 + 12x1x2 + 3x22 � 6x2 + 6.
• Gradient:
r(x) =
6x21 + 6x1 + 12x212x1 + 6x2 � 6
�.
• Hessian is symmetric:
Hf =
12x1 + 6 12
12 6
�.
• Critical points, solutions to rf(x) = 0 are
x⇤ = [1,�1]T , and
x⇤ = [2,�3]T .
• At first critical point, x⇤ = [1,�1]T ,
Hf(1,�1) =
18 1212 6
�,
does not have positive eigenvalues, so [1,�1]T is a saddle point.
• At second critical point, x⇤ = [2,�3]T ,
Hf(2,�3) =
30 1212 6
�,
does have positive eigenvalues, so [2,�3]T is a local minimum.
Example 6.5 from text
Classifying Critical Points
• Example 6.5 from the text:
f(x) = 2x31 + 3x21 + 12x1x2 + 3x22 � 6x2 + 6.
• Gradient:
r(x) =
6x21 + 6x1 + 12x212x1 + 6x2 � 6
�.
• Hessian is symmetric:
Hf =
12x1 + 6 12
12 6
�.
• Critical points, solutions to rf(x) = 0 are
x⇤ = [1,�1]T , and
x⇤ = [2,�3]T .
• At first critical point, x⇤ = [1,�1]T ,
Hf(1,�1) =
18 1212 6
�,
does not have positive eigenvalues, so [1,�1]T is a saddle point.
• At second critical point, x⇤ = [2,�3]T ,
Hf(2,�3) =
30 1212 6
�,
does have positive eigenvalues, so [2,�3]T is a local minimum.
Classifying Critical Points
• Example 6.5 from the text:
f(x) = 2x31 + 3x21 + 12x1x2 + 3x22 � 6x2 + 6.
• Gradient:
rf(x) =
6x21 + 6x1 + 12x212x1 + 6x2 � 6
�.
• Hessian is symmetric:
Hf
=
"@
2f
@x1@x1
@
2f
@x1@x2
@
2f
@x1@x2
@
2f
@x2@x2
#=
12x1 + 6 12
12 6
�.
• Critical points, solutions to rf(x) = 0 are
x⇤ = [1,�1]T , and
x⇤ = [2,�3]T .
• At first critical point, x⇤ = [1,�1]T ,
Hf
(1,�1) =
18 1212 6
�,
does not have positive eigenvalues, so [1,�1]T is a saddle point.
• At second critical point, x⇤ = [2,�3]T ,
Hf
(2,�3) =
30 1212 6
�,
does have positive eigenvalues, so [2,�3]T is a local minimum.
• First critical point, x⇤ = [1,�1]T , is a saddle point. Hessian not SPD.
• Second critical point, x⇤ = [2,�3]T , is a local minimum. Hessian is SPD.
Example 6.5 from text
Classifying Critical Points
• Example 6.5 from the text:
f(x) = 2x31 + 3x21 + 12x1x2 + 3x22 � 6x2 + 6.
• Gradient:
r(x) =
6x21 + 6x1 + 12x212x1 + 6x2 � 6
�.
• Hessian is symmetric:
Hf =
12x1 + 6 12
12 6
�.
• Critical points, solutions to rf(x) = 0 are
x⇤ = [1,�1]T , and
x⇤ = [2,�3]T .
• At first critical point, x⇤ = [1,�1]T ,
Hf(1,�1) =
18 1212 6
�,
does not have positive eigenvalues, so [1,�1]T is a saddle point.
• At second critical point, x⇤ = [2,�3]T ,
Hf(2,�3) =
30 1212 6
�,
does have positive eigenvalues, so [2,�3]T is a local minimum.
Classifying Critical Points
• Example 6.5 from the text:
f(x) = 2x31 + 3x21 + 12x1x2 + 3x22 � 6x2 + 6.
• Gradient:
rf(x) =
6x21 + 6x1 + 12x212x1 + 6x2 � 6
�.
• Hessian is symmetric:
Hf
=
"@
2f
@x1@x1
@
2f
@x1@x2
@
2f
@x1@x2
@
2f
@x2@x2
#=
12x1 + 6 12
12 6
�.
• Critical points, solutions to rf(x) = 0 are
x⇤ = [1,�1]T , and
x⇤ = [2,�3]T .
• At first critical point, x⇤ = [1,�1]T ,
Hf
(1,�1) =
18 1212 6
�,
does not have positive eigenvalues, so [1,�1]T is a saddle point.
• At second critical point, x⇤ = [2,�3]T ,
Hf
(2,�3) =
30 1212 6
�,
does have positive eigenvalues, so [2,�3]T is a local minimum.
• First critical point, x⇤ = [1,�1]T , is a saddle point. Hessian not SPD.
• Second critical point, x⇤ = [2,�3]T , is a local minimum. Hessian is SPD.
Example 6.5 from text
Classifying Critical Points
• Example 6.5 from the text:
f(x) = 2x31 + 3x21 + 12x1x2 + 3x22 � 6x2 + 6.
• Gradient:
r(x) =
6x21 + 6x1 + 12x212x1 + 6x2 � 6
�.
• Hessian is symmetric:
Hf =
12x1 + 6 12
12 6
�.
• Critical points, solutions to rf(x) = 0 are
x⇤ = [1,�1]T , and
x⇤ = [2,�3]T .
• At first critical point, x⇤ = [1,�1]T ,
Hf(1,�1) =
18 1212 6
�,
does not have positive eigenvalues, so [1,�1]T is a saddle point.
• At second critical point, x⇤ = [2,�3]T ,
Hf(2,�3) =
30 1212 6
�,
does have positive eigenvalues, so [2,�3]T is a local minimum.
Classifying Critical Points
• Example 6.5 from the text:
f(x) = 2x31 + 3x21 + 12x1x2 + 3x22 � 6x2 + 6.
• Gradient:
rf(x) =
6x21 + 6x1 + 12x212x1 + 6x2 � 6
�.
• Hessian is symmetric:
Hf
=
"@
2f
@x1@x1
@
2f
@x1@x2
@
2f
@x1@x2
@
2f
@x2@x2
#=
12x1 + 6 12
12 6
�.
• Critical points, solutions to rf(x) = 0 are
x⇤ = [1,�1]T , and
x⇤ = [2,�3]T .
• At first critical point, x⇤ = [1,�1]T ,
Hf
(1,�1) =
18 1212 6
�,
does not have positive eigenvalues, so [1,�1]T is a saddle point.
• At second critical point, x⇤ = [2,�3]T ,
Hf
(2,�3) =
30 1212 6
�,
does have positive eigenvalues, so [2,�3]T is a local minimum.
• First critical point, x⇤ = [1,�1]T , is a saddle point. Hessian not SPD.
• Second critical point, x⇤ = [2,�3]T , is a local minimum. Hessian is SPD.
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
Second-Order Optimality Condition
For twice continuously differentiable f : S ✓ Rn ! R, wecan distinguish among critical points by consideringHessian matrix H
f
(x) defined by
{Hf
(x)}ij
=
@
2f(x)
@x
i
@x
j
which is symmetric
At critical point x⇤, if Hf
(x⇤) is
positive definite, then x⇤ is minimum of fnegative definite, then x⇤ is maximum of findefinite, then x⇤ is saddle point of fsingular, then various pathological situations are possible
Michael T. Heath Scientific Computing 12 / 74
Example of Singular Hessian in 1D
• Consider,
f(x) = (x� 1)4
rf =@f
@xi=
@f
@x= 4(x� 1)3 = 0 when x = x⇤ = 1.
Hij =@2f
@xixj=
@2f
@x2= 12(x� 1)2 = 0.
Here, the Hessian is singular at x = x⇤ and x⇤ is a minimizer of f .
• However, if
f(x) = (x� 1)3
rf =@f
@xi=
@f
@x= 3(x� 1)2 = 0 when x = x⇤ = 1.
Hij =@2f
@xixj=
@2f
@x2= 6(x� 1) = 0.
Here, the Hessian is singular at x = x⇤ and x⇤ is a not a minimizer of f .
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
Constrained OptimalityIf problem is constrained, only feasible directions arerelevantFor equality-constrained problem
min f(x) subject to g(x) = 0
where f : Rn ! R and g : Rn ! Rm, with m n, necessarycondition for feasible point x⇤ to be solution is that negativegradient of f lie in space spanned by constraint normals,
�rf(x⇤) = JT
g
(x⇤)�
where Jg
is Jacobian matrix of g, and � is vector ofLagrange multipliersThis condition says we cannot reduce objective functionwithout violating constraints
Michael T. Heath Scientific Computing 13 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
Constrained Optimality, continued
Lagrangian function L : Rn+m ! R, is defined by
L(x,�) = f(x) + �Tg(x)
Its gradient is given by
rL(x,�) =rf(x) + JT
g
(x)�g(x)
�
Its Hessian is given by
HL(x,�) =
B(x,�) JT
g
(x)Jg
(x) O
�
where
B(x,�) = Hf
(x) +mX
i=1
�
i
Hgi(x)
Michael T. Heath Scientific Computing 14 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
Constrained Optimality, continued
Together, necessary condition and feasibility imply criticalpoint of Lagrangian function,
rL(x,�) =rf(x) + JT
g
(x)�g(x)
�= 0
Hessian of Lagrangian is symmetric, but not positivedefinite, so critical point of L is saddle point rather thanminimum or maximum
Critical point (x⇤,�⇤
) of L is constrained minimum of f ifB(x⇤
,�⇤) is positive definite on null space of J
g
(x⇤)
If columns of Z form basis for null space, then testprojected Hessian ZTBZ for positive definiteness
Michael T. Heath Scientific Computing 15 / 74
Equality Constraints: Case g(x)=scalar.
min f(x) subject to g(x) = 0.
• Lagrangian
L(x,�) := f(x) + � g(x)
� := Lagrange multiplier
rL =
@L@xi
@L@�
!=
rf + �rg
g
!=
0
0
!
at critical point (x⇤,�⇤).
• Note, rf(x⇤) 6= 0.
• Instead, rf(x⇤) = ��⇤ rg(x⇤).
• The gradient of f at x⇤ is a multiple of the gradient of g.
Constrained Optimality Example, g(x) a Scalar.
❑ Comment on: “This condition says we cannot reduce objective function without violating constraints.”
❑ A key point is that, x*, rf is parallel to rg
❑ If there are multiple constraints, then rf in span{rgk} = JgT ¸
g(x) = 0
r g
-r f
Constrained Optimality Example, g(x) a Scalar.
❑ Here, we see the gradients of f and g at a point that is not x*.
❑ Clearly, we can make more progess in reducing f(x) by moving along the g(x)=0 contour until rf is parallel to rg.
g(x) = 0
r g
-r f x*
Example: Two Equality Constraints
min f(x) subject to g1(x) = 0 and g2(x) = 0.
• Lagrangian
L(x,�) := f(x) + �1 g1(x) + �2 g2(x).
• First-order conditions
rL =
@L@xi
@L@�k
!=
0
B@
@f
@xi+P
k
�k
@gk
@xi
0 +
✓g1g2
◆
1
CA
=
rf + J T
g
�
g
!
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
Constrained Optimality, continued
Together, necessary condition and feasibility imply criticalpoint of Lagrangian function,
rL(x,�) =rf(x) + JT
g
(x)�g(x)
�= 0
Hessian of Lagrangian is symmetric, but not positivedefinite, so critical point of L is saddle point rather thanminimum or maximum
Critical point (x⇤,�⇤
) of L is constrained minimum of f ifB(x⇤
,�⇤) is positive definite on null space of J
g
(x⇤)
If columns of Z form basis for null space, then testprojected Hessian ZTBZ for positive definiteness
Michael T. Heath Scientific Computing 15 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
Constrained Optimality, continued
Together, necessary condition and feasibility imply criticalpoint of Lagrangian function,
rL(x,�) =rf(x) + JT
g
(x)�g(x)
�= 0
Hessian of Lagrangian is symmetric, but not positivedefinite, so critical point of L is saddle point rather thanminimum or maximum
Critical point (x⇤,�⇤
) of L is constrained minimum of f ifB(x⇤
,�⇤) is positive definite on null space of J
g
(x⇤)
If columns of Z form basis for null space, then testprojected Hessian ZTBZ for positive definiteness
Michael T. Heath Scientific Computing 15 / 74
• Note: null space of Jg(x*) is the set of vectors tangent to the constraint surface (or surfaces, if more than one g present).
Constrained Optimality Example, g(x) a Scalar.
❑ Comment on: “This condition says we cannot reduce objective function without violating constraints.”
❑ A key point is that, x*, rf is parallel to rg
❑ If there are multiple constraints, then rf in span{rgk} = JgT ¸
g(x) = 0
r g
-r f
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
Constrained Optimality, continued
If inequalities are present, then KKT optimality conditionsalso require nonnegativity of Lagrange multiplierscorresponding to inequalities, and complementaritycondition
Michael T. Heath Scientific Computing 16 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
Constrained Optimality, continued
If inequalities are present, then KKT optimality conditionsalso require nonnegativity of Lagrange multiplierscorresponding to inequalities, and complementaritycondition
Michael T. Heath Scientific Computing 16 / 74
Karush-Kuhn-Tucker
rx
L(x⇤,�⇤) = 0, (1)
g(x
⇤) = 0, (2)
h(x
⇤) 0, (3)
�
⇤i
� 0, i = m+ 1, . . . ,m+ p inequality Lagrange multipliers (4)
�
⇤i
h
i
(x
⇤) = 0, i = m+ 1, . . . ,m+ p complementarity condition. (5)
1
Inequality Constraint and Sign of ̧
❑ A key point is that, at x*, rf is parallel to rh
❑ If there are multiple constraints, then rf in span{rhk} = JhT ¸
h(x) = 0
-r h
-r f x*
x*
h(x) < 0
Inequality Constraint and Sign of ̧
h(x) = 0
-r h
-r f x*
x*
Terminate search when
f(x⇤ +�x) � f(x⇤)| ✏.
Taylor series:
f(x⇤ +�x) = f(x⇤) + �f 0(x⇤) +�x2
2f 00(x⇤) + O(�x3)
�x2
2⇡ f(x⇤ +�x)� f(x⇤)
f 00(x⇤)
|�x| ⇡
s2✏
|f 00(x⇤)|
• So, if ✏ ⇡ ✏M , can expect accuracy to approximatelyp✏M .
• Question: Is a small f 00 good? Or bad?
• If constraint is active (h(x
⇤) = 0) then
rf and rh point in opposite directions.
• rf + �
⇤rh = 0.
• �
⇤> 0.
1
h(x) < 0
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
DefinitionsExistence and UniquenessOptimality Conditions
Sensitivity and Conditioning
Function minimization and equation solving are closelyrelated problems, but their sensitivities differ
In one dimension, absolute condition number of root x⇤ ofequation f(x) = 0 is 1/|f 0
(x
⇤)|, so if |f(x̂)| ✏, then
|x̂� x
⇤| may be as large as ✏/|f 0(x
⇤)|
For minimizing f , Taylor series expansion
f(x̂) = f(x
⇤+ h)
= f(x
⇤) + f
0(x
⇤)h+
12 f
00(x
⇤)h
2+O(h
3)
shows that, since f
0(x
⇤) = 0, if |f(x̂)� f(x
⇤)| ✏, then
|x̂� x
⇤| may be as large asp
2✏/|f 00(x
⇤)|
Thus, based on function values alone, minima can becomputed to only about half precision
Michael T. Heath Scientific Computing 17 / 74
Consider f=1-cos(x)
Consider f=1-cos(x)
Consider f=1-cos(x)
❑ As you zoom in, f(x)à0 quadratically, but xàx* only linearly.
❑ So, if |f(x)-f(x*)| ¼ ², then || x – x* || = O( ²1/2 )
Consider f=1-cos(x)
❑ As you zoom in, f(x)à0 quadratically, but xàx* only linearly
❑ So, if |f(x)-f(x*)| ¼ ², then || x – x* || = O( ²1/2 )
± ²1/2 ²
Sensitivity of Minimization
Terminate search when
|f(x⇤ +�x) � f(x⇤)| ✏.
Taylor series:
f(x⇤ +�x) = f(x⇤) + �xf 0(x⇤) +�x2
2f 00(x⇤) + O(�x3)
�x2
2⇡ f(x⇤ +�x)� f(x⇤)
f 00(x⇤)
|�x| ⇡
s2✏
|f 00(x⇤)|
• So, if ✏ ⇡ ✏M , can expect accuracy to approximatelyp✏M .
• Question: Is a small f 00 good? Or bad?
• If constraint is active (h(x
⇤) = 0) then
rf and rh point in opposite directions.
• rf + �
⇤rh = 0.
• �
⇤> 0.
1
Example: Minimize Cost over Some Design Parameter, x
Cost
Design Parameter
Example: Minimize Cost over Some Design Parameter, x
Cost
Design Parameter ❑ While having f’’(x*) makes it difficult to find the optimal x*, it in fact
is a happy circumstance because it gives you liberty to add additional constraints at no cost.
❑ So, often, having a broad minimum is good.
Methods for One-Dimensional Problems
❑ Demonstrate ❑ Basic techniques ❑ Bracketing ❑ Convergence rates
❑ Useful for line search in multi-dimensional problems
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Golden Section SearchSuccessive Parabolic InterpolationNewton’s Method
Unimodality
For minimizing function of one variable, we need “bracket”for solution analogous to sign change for nonlinearequation
Real-valued function f is unimodal on interval [a, b] if thereis unique x
⇤ 2 [a, b] such that f(x⇤) is minimum of f on[a, b], and f is strictly decreasing for x x
⇤, strictlyincreasing for x⇤ x
Unimodality enables discarding portions of interval basedon sample function values, analogous to interval bisection
Michael T. Heath Scientific Computing 18 / 74
Golden Section Search
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Golden Section SearchSuccessive Parabolic InterpolationNewton’s Method
Golden Section Search, continued⌧ = (
p5� 1)/2
x1 = a+ (1� ⌧)(b� a); f1 = f(x1)
x2 = a+ ⌧(b� a); f2 = f(x2)
while ((b� a) > tol) do
if (f1 > f2) then
a = x1
x1 = x2
f1 = f2
x2 = a+ ⌧(b� a)
f2 = f(x2)
else
b = x2
x2 = x1
f2 = f1
x1 = a+ (1� ⌧)(b� a)
f1 = f(x1)
end
end
Michael T. Heath Scientific Computing 21 / 74
f1 > f2
Golden Section Search
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Golden Section SearchSuccessive Parabolic InterpolationNewton’s Method
Golden Section Search, continued⌧ = (
p5� 1)/2
x1 = a+ (1� ⌧)(b� a); f1 = f(x1)
x2 = a+ ⌧(b� a); f2 = f(x2)
while ((b� a) > tol) do
if (f1 > f2) then
a = x1
x1 = x2
f1 = f2
x2 = a+ ⌧(b� a)
f2 = f(x2)
else
b = x2
x2 = x1
f2 = f1
x1 = a+ (1� ⌧)(b� a)
f1 = f(x1)
end
end
Michael T. Heath Scientific Computing 21 / 74
f1 > f2
Discard
Golden Section Search
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Golden Section SearchSuccessive Parabolic InterpolationNewton’s Method
Golden Section Search, continued⌧ = (
p5� 1)/2
x1 = a+ (1� ⌧)(b� a); f1 = f(x1)
x2 = a+ ⌧(b� a); f2 = f(x2)
while ((b� a) > tol) do
if (f1 > f2) then
a = x1
x1 = x2
f1 = f2
x2 = a+ ⌧(b� a)
f2 = f(x2)
else
b = x2
x2 = x1
f2 = f1
x1 = a+ (1� ⌧)(b� a)
f1 = f(x1)
end
end
Michael T. Heath Scientific Computing 21 / 74
a x1 New x2
f1 > f2
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Golden Section SearchSuccessive Parabolic InterpolationNewton’s Method
Golden Section Search
Suppose f is unimodal on [a, b], and let x1 and x2 be twopoints within [a, b], with x1 < x2
Evaluating and comparing f(x1) and f(x2), we can discardeither (x2, b] or [a, x1), with minimum known to lie inremaining subinterval
To repeat process, we need compute only one newfunction evaluation
To reduce length of interval by fixed fraction at eachiteration, each new pair of points must have samerelationship with respect to new interval that previous pairhad with respect to previous interval
Michael T. Heath Scientific Computing 19 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Golden Section SearchSuccessive Parabolic InterpolationNewton’s Method
Golden Section Search, continued
To accomplish this, we choose relative positions of twopoints as ⌧ and 1� ⌧ , where ⌧
2= 1� ⌧ , so
⌧ = (
p5� 1)/2 ⇡ 0.618 and 1� ⌧ ⇡ 0.382
Whichever subinterval is retained, its length will be ⌧
relative to previous interval, and interior point retained willbe at position either ⌧ or 1� ⌧ relative to new interval
To continue iteration, we need to compute only one newfunction value, at complementary point
This choice of sample points is called golden sectionsearch
Golden section search is safe but convergence rate is onlylinear, with constant C ⇡ 0.618
Michael T. Heath Scientific Computing 20 / 74
Requires Unimodality!
Golden Section Search
❑ f(x) unimodal on [a,b].
❑ Subdivide [a,b] into 3 parts.
❑ If f1 > f2 discard (a,x1)
❑ Choose new point in larger of remaining two intervals: (x1,x2) or (x2,b).
❑ à x1 and x2 should be closer to center and not at 1/3 , 2/3.
-
6
-• • • •a x1 x2 b
1
-
6
-• • • •a x1 x2 b
1
a’ x’1 x’2
f1 > f2
Golden Section Geometry
❑ Want new section [1-¿, ¿] to have same relation as [0,1- ¿ ] to [0,1].
-
6
-
� -⌧• • • •0 0.382 0.618 1
1-⌧ ⌧
⌧ � (1� ⌧ )
⌧=
1 � ⌧
1
2⌧ � 1 = ⌧ � ⌧ 2
⌧ 2 + ⌧ � 1 = 0
⌧ =�1 +
p1 + 4
2=
p5� 1
2= 0.618
1
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Golden Section SearchSuccessive Parabolic InterpolationNewton’s Method
Golden Section Search, continued⌧ = (
p5� 1)/2
x1 = a+ (1� ⌧)(b� a); f1 = f(x1)
x2 = a+ ⌧(b� a); f2 = f(x2)
while ((b� a) > tol) do
if (f1 > f2) then
a = x1
x1 = x2
f1 = f2
x2 = a+ ⌧(b� a)
f2 = f(x2)
else
b = x2
x2 = x1
f2 = f1
x1 = a+ (1� ⌧)(b� a)
f1 = f(x1)
end
end
Michael T. Heath Scientific Computing 21 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Golden Section SearchSuccessive Parabolic InterpolationNewton’s Method
Example: Golden Section Search
Use golden section search to minimize
f(x) = 0.5� x exp(�x
2)
Michael T. Heath Scientific Computing 22 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Golden Section SearchSuccessive Parabolic InterpolationNewton’s Method
Example, continued
x1 f1 x2 f2
0.764 0.074 1.236 0.232
0.472 0.122 0.764 0.074
0.764 0.074 0.944 0.113
0.652 0.074 0.764 0.074
0.584 0.085 0.652 0.074
0.652 0.074 0.695 0.071
0.695 0.071 0.721 0.071
0.679 0.072 0.695 0.071
0.695 0.071 0.705 0.071
0.705 0.071 0.711 0.071
< interactive example >
Michael T. Heath Scientific Computing 23 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Golden Section SearchSuccessive Parabolic InterpolationNewton’s Method
Successive Parabolic InterpolationFit quadratic polynomial to three function valuesTake minimum of quadratic to be new approximation tominimum of function
New point replaces oldest of three previous points andprocess is repeated until convergenceConvergence rate of successive parabolic interpolation issuperlinear, with r ⇡ 1.324
Michael T. Heath Scientific Computing 24 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Golden Section SearchSuccessive Parabolic InterpolationNewton’s Method
Example: Successive Parabolic Interpolation
Use successive parabolic interpolation to minimize
f(x) = 0.5� x exp(�x
2)
Michael T. Heath Scientific Computing 25 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Golden Section SearchSuccessive Parabolic InterpolationNewton’s Method
Example, continued
x
k
f(x
k
)
0.000 0.500
0.600 0.081
1.200 0.216
0.754 0.073
0.721 0.071
0.692 0.071
0.707 0.071
< interactive example >
Michael T. Heath Scientific Computing 26 / 74
Not monotone Superlinear
Successive Parabolic Interpolation is Newton’s Method applied to a quadratic model of the data. r=1.324 We turn to Newton’s method next, r=2.
Matlab: Successive Parabolic Interpolation: Naive Bracketing
Convergence Behavior
❑ Question 1: What type of convergence is this?
❑ Question 2: What does this plot say about conditioning of
Matlab: Successive Parabolic Interpolation – Replace Oldest
Matlab: Successive Parabolic Interpolation – Replace Oldest
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Golden Section SearchSuccessive Parabolic InterpolationNewton’s Method
Newton’s MethodAnother local quadratic approximation is truncated Taylorseries
f(x+ h) ⇡ f(x) + f
0(x)h+
f
00(x)
2
h
2
By differentiation, minimum of this quadratic function of h isgiven by h = �f
0(x)/f
00(x)
Suggests iteration scheme
x
k+1 = x
k
� f
0(x
k
)/f
00(x
k
)
which is Newton’s method for solving nonlinear equationf
0(x) = 0
Newton’s method for finding minimum normally hasquadratic convergence rate, but must be started closeenough to solution to converge < interactive example >
Michael T. Heath Scientific Computing 27 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Golden Section SearchSuccessive Parabolic InterpolationNewton’s Method
Example: Newton’s MethodUse Newton’s method to minimize f(x) = 0.5� x exp(�x
2)
First and second derivatives of f are given by
f
0(x) = (2x
2 � 1) exp(�x
2)
andf
00(x) = 2x(3� 2x
2) exp(�x
2)
Newton iteration for zero of f 0 is given by
x
k+1 = x
k
� (2x
2k
� 1)/(2x
k
(3� 2x
2k
))
Using starting guess x0 = 1, we obtain
x
k
f(x
k
)
1.000 0.132
0.500 0.111
0.700 0.071
0.707 0.071
Michael T. Heath Scientific Computing 28 / 74
Matlab Example: newton.m
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Golden Section SearchSuccessive Parabolic InterpolationNewton’s Method
Safeguarded Methods
As with nonlinear equations in one dimension,slow-but-sure and fast-but-risky optimization methods canbe combined to provide both safety and efficiency
Most library routines for one-dimensional optimization arebased on this hybrid approach
Popular combination is golden section search andsuccessive parabolic interpolation, for which no derivativesare required
Michael T. Heath Scientific Computing 29 / 74
Methods for Multi-Dimensional Problems
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Unconstrained OptimizationNonlinear Least SquaresConstrained Optimization
Steepest Descent Method
Let f : Rn ! R be real-valued function of n real variables
At any point x where gradient vector is nonzero, negativegradient, �rf(x), points downhill toward lower values of f
In fact, �rf(x) is locally direction of steepest descent: fdecreases more rapidly along direction of negativegradient than along any other
Steepest descent method: starting from initial guess x0,successive approximate solutions given by
xk+1 = x
k
� ↵
k
rf(xk
)
where ↵
k
is line search parameter that determines how farto go in given direction
Michael T. Heath Scientific Computing 31 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Unconstrained OptimizationNonlinear Least SquaresConstrained Optimization
Steepest Descent, continued
Given descent direction, such as negative gradient,determining appropriate value for ↵
k
at each iteration isone-dimensional minimization problem
min
↵k
f(xk
� ↵
k
rf(xk
))
that can be solved by methods already discussed
Steepest descent method is very reliable: it can alwaysmake progress provided gradient is nonzero
But method is myopic in its view of function’s behavior, andresulting iterates can zigzag back and forth, making veryslow progress toward solution
In general, convergence rate of steepest descent is onlylinear, with constant factor that can be arbitrarily close to 1
Michael T. Heath Scientific Computing 32 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Unconstrained OptimizationNonlinear Least SquaresConstrained Optimization
Example, continued
< interactive example >
Michael T. Heath Scientific Computing 35 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Unconstrained OptimizationNonlinear Least SquaresConstrained Optimization
Example: Steepest Descent
Use steepest descent method to minimize
f(x) = 0.5x
21 + 2.5x
22
Gradient is given by rf(x) =
x1
5x2
�
Taking x0 =
5
1
�, we have rf(x0) =
5
5
�
Performing line search along negative gradient direction,
min
↵0f(x0 � ↵0rf(x0))
exact minimum along line is given by ↵0 = 1/3, so next
approximation is x1 =
3.333
�0.667
�
Michael T. Heath Scientific Computing 33 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Unconstrained OptimizationNonlinear Least SquaresConstrained Optimization
Example, continued
xk
f(xk
) rf(xk
)
5.000 1.000 15.000 5.000 5.000
3.333 �0.667 6.667 3.333 �3.333
2.222 0.444 2.963 2.222 2.222
1.481 �0.296 1.317 1.481 �1.481
0.988 0.198 0.585 0.988 0.988
0.658 �0.132 0.260 0.658 �0.658
0.439 0.088 0.116 0.439 0.439
0.293 �0.059 0.051 0.293 �0.293
0.195 0.039 0.023 0.195 0.195
0.130 �0.026 0.010 0.130 �0.130
Michael T. Heath Scientific Computing 34 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Unconstrained OptimizationNonlinear Least SquaresConstrained Optimization
Example, continued
< interactive example >
Michael T. Heath Scientific Computing 35 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Unconstrained OptimizationNonlinear Least SquaresConstrained Optimization
Newton’s Method
Broader view can be obtained by local quadraticapproximation, which is equivalent to Newton’s method
In multidimensional optimization, we seek zero of gradient,so Newton iteration has form
xk+1 = x
k
�H�1f
(xk
)rf(xk
)
where Hf
(x) is Hessian matrix of second partialderivatives of f ,
{Hf
(x)}ij
=
@
2f(x)
@x
i
@x
j
Michael T. Heath Scientific Computing 36 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Unconstrained OptimizationNonlinear Least SquaresConstrained Optimization
Newton’s Method, continued
Do not explicitly invert Hessian matrix, but instead solvelinear system
Hf
(xk
)sk
= �rf(xk
)
for Newton step sk
, then take as next iterate
xk+1 = x
k
+ sk
Convergence rate of Newton’s method for minimization isnormally quadratic
As usual, Newton’s method is unreliable unless startedclose enough to solution to converge
< interactive example >
Michael T. Heath Scientific Computing 37 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Unconstrained OptimizationNonlinear Least SquaresConstrained Optimization
Example: Newton’s Method
Use Newton’s method to minimize
f(x) = 0.5x
21 + 2.5x
22
Gradient and Hessian are given by
rf(x) =
x1
5x2
�and H
f
(x) =
1 0
0 5
�
Taking x0 =
5
1
�, we have rf(x0) =
5
5
�
Linear system for Newton step is1 0
0 5
�s0 =
�5
�5
�, so
x1 = x0 + s0 =
5
1
�+
�5
�1
�=
0
0
�, which is exact solution
for this problem, as expected for quadratic function
Michael T. Heath Scientific Computing 38 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Unconstrained OptimizationNonlinear Least SquaresConstrained Optimization
Newton’s Method, continued
In principle, line search parameter is unnecessary withNewton’s method, since quadratic model determineslength, as well as direction, of step to next approximatesolution
When started far from solution, however, it may still beadvisable to perform line search along direction of Newtonstep s
k
to make method more robust (damped Newton)
Once iterates are near solution, then ↵
k
= 1 should sufficefor subsequent iterations
Michael T. Heath Scientific Computing 39 / 74
Optimization ProblemsOne-Dimensional OptimizationMulti-Dimensional Optimization
Unconstrained OptimizationNonlinear Least SquaresConstrained Optimization
Newton’s Method, continued
If objective function f has continuous second partialderivatives, then Hessian matrix H
f
is symmetric, andnear minimum it is positive definite
Thus, linear system for step to next iterate can be solved inonly about half of work required for LU factorization
Far from minimum, Hf
(xk
) may not be positive definite, soNewton step s
k
may not be descent direction for function,i.e., we may not have
rf(xk
)
Tsk
< 0
In this case, alternative descent direction can becomputed, such as negative gradient or direction ofnegative curvature, and then perform line search
Michael T. Heath Scientific Computing 40 / 74