Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018,...
Transcript of Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018,...
![Page 1: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/1.jpg)
1 / 71
Nonsmooth, Nonconvex OptimizationAlgorithms and Examples
Michael L. OvertonCourant Institute of Mathematical Sciences
New York University
Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture
Based on my research work with Jim Burke (Washington), Adrian Lewis (Cornell)
and others
![Page 2: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/2.jpg)
Introduction
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
2 / 71
![Page 3: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/3.jpg)
Nonsmooth, Nonconvex Optimization
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
3 / 71
Problem: find x that locally minimizes f , where f : Rn → R is
![Page 4: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/4.jpg)
Nonsmooth, Nonconvex Optimization
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
3 / 71
Problem: find x that locally minimizes f , where f : Rn → R is
■ Continuous
![Page 5: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/5.jpg)
Nonsmooth, Nonconvex Optimization
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
3 / 71
Problem: find x that locally minimizes f , where f : Rn → R is
■ Continuous■ Not differentiable everywhere, in particular often not
differentiable at local minimizers
![Page 6: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/6.jpg)
Nonsmooth, Nonconvex Optimization
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
3 / 71
Problem: find x that locally minimizes f , where f : Rn → R is
■ Continuous■ Not differentiable everywhere, in particular often not
differentiable at local minimizers■ Not convex
![Page 7: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/7.jpg)
Nonsmooth, Nonconvex Optimization
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
3 / 71
Problem: find x that locally minimizes f , where f : Rn → R is
■ Continuous■ Not differentiable everywhere, in particular often not
differentiable at local minimizers■ Not convex■ Usually, but not always, locally Lipschitz: for all x there
exists Lx s.t. |f(y)− f(z)| ≤ Lx‖y − z‖ for all y, z near x
![Page 8: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/8.jpg)
Nonsmooth, Nonconvex Optimization
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
3 / 71
Problem: find x that locally minimizes f , where f : Rn → R is
■ Continuous■ Not differentiable everywhere, in particular often not
differentiable at local minimizers■ Not convex■ Usually, but not always, locally Lipschitz: for all x there
exists Lx s.t. |f(y)− f(z)| ≤ Lx‖y − z‖ for all y, z near x■ Otherwise, no structure assumed
![Page 9: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/9.jpg)
Nonsmooth, Nonconvex Optimization
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
3 / 71
Problem: find x that locally minimizes f , where f : Rn → R is
■ Continuous■ Not differentiable everywhere, in particular often not
differentiable at local minimizers■ Not convex■ Usually, but not always, locally Lipschitz: for all x there
exists Lx s.t. |f(y)− f(z)| ≤ Lx‖y − z‖ for all y, z near x■ Otherwise, no structure assumed
Lots of interesting applications
![Page 10: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/10.jpg)
Nonsmooth, Nonconvex Optimization
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
3 / 71
Problem: find x that locally minimizes f , where f : Rn → R is
■ Continuous■ Not differentiable everywhere, in particular often not
differentiable at local minimizers■ Not convex■ Usually, but not always, locally Lipschitz: for all x there
exists Lx s.t. |f(y)− f(z)| ≤ Lx‖y − z‖ for all y, z near x■ Otherwise, no structure assumed
Lots of interesting applications
Any locally Lipschitz function is differentiable almost everywhereon its domain. So, whp, can evaluate gradient at any given point.
![Page 11: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/11.jpg)
Nonsmooth, Nonconvex Optimization
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
3 / 71
Problem: find x that locally minimizes f , where f : Rn → R is
■ Continuous■ Not differentiable everywhere, in particular often not
differentiable at local minimizers■ Not convex■ Usually, but not always, locally Lipschitz: for all x there
exists Lx s.t. |f(y)− f(z)| ≤ Lx‖y − z‖ for all y, z near x■ Otherwise, no structure assumed
Lots of interesting applications
Any locally Lipschitz function is differentiable almost everywhereon its domain. So, whp, can evaluate gradient at any given point.
What happens if we simply use gradient descent (steepestdescent) with a standard line search?
![Page 12: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/12.jpg)
A Simple Nonconvex Example
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
4 / 71
f(x)=10*|x2 − x
12| + (1−x
1)2
steepest descent iterates−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
![Page 13: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/13.jpg)
A Simple Nonconvex Example
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
4 / 71
f(x)=10*|x2 − x
12| + (1−x
1)2
steepest descent iterates−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
On this example, iterates invariably converge to a nonstationary point
![Page 14: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/14.jpg)
Failure of Gradient Descent in Nonsmooth Case
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
5 / 71
Known for decades that gradient descent may converge tononstationary points when f is nonsmooth, even if it is convex.
![Page 15: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/15.jpg)
Failure of Gradient Descent in Nonsmooth Case
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
5 / 71
Known for decades that gradient descent may converge tononstationary points when f is nonsmooth, even if it is convex.
■ V.F. Dem’janov and V.N. Malozemov, 1970■ P. Wolfe, 1975■ J.-B. Hiriart-Urruty and C. Lemarechal, 1993
But these are all examples cooked up to defeat exact linesearches from a specific starting point.
![Page 16: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/16.jpg)
Failure of Gradient Descent in Nonsmooth Case
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
5 / 71
Known for decades that gradient descent may converge tononstationary points when f is nonsmooth, even if it is convex.
■ V.F. Dem’janov and V.N. Malozemov, 1970■ P. Wolfe, 1975■ J.-B. Hiriart-Urruty and C. Lemarechal, 1993
But these are all examples cooked up to defeat exact linesearches from a specific starting point.
Failure can be avoided by using sufficiently short steplengths(N.Z. Shor, 1970s), but this is slow.
![Page 17: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/17.jpg)
Armijo-Wolfe Line Search
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
6 / 71
Given x with f differentiable at x and d with ∇f(x)T d < 0, andparameters 0 < c1 < c2 < 1, find steplength t so that
![Page 18: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/18.jpg)
Armijo-Wolfe Line Search
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
6 / 71
Given x with f differentiable at x and d with ∇f(x)T d < 0, andparameters 0 < c1 < c2 < 1, find steplength t so that
■ sufficient decrease in function value:f(x+ td) < f(x) + c1t∇f(x)Td (L. Armijo, 1966)
![Page 19: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/19.jpg)
Armijo-Wolfe Line Search
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
6 / 71
Given x with f differentiable at x and d with ∇f(x)T d < 0, andparameters 0 < c1 < c2 < 1, find steplength t so that
■ sufficient decrease in function value:f(x+ td) < f(x) + c1t∇f(x)Td (L. Armijo, 1966)
■ sufficient increase in directional derivative: f is differentiable atx+ td and ∇f(x+ td)T d > c2∇f(x)Td (P. Wolfe, 1969)
![Page 20: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/20.jpg)
Armijo-Wolfe Line Search
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
6 / 71
Given x with f differentiable at x and d with ∇f(x)T d < 0, andparameters 0 < c1 < c2 < 1, find steplength t so that
■ sufficient decrease in function value:f(x+ td) < f(x) + c1t∇f(x)Td (L. Armijo, 1966)
■ sufficient increase in directional derivative: f is differentiable atx+ td and ∇f(x+ td)T d > c2∇f(x)Td (P. Wolfe, 1969)
Assuming inft f(x+ td) is bounded below,
■ the Armijo condition holds for sufficiently small t as long as f iscontinuous
■ the Wolfe condition holds for sufficiently large t as long as f isdifferentiable
■ the intervals where each holds overlap
so combining the two conditions leads to a convenient, convergentbracketing line search (M.J.D. Powell, 1976)
![Page 21: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/21.jpg)
Armijo-Wolfe Line Search
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
6 / 71
Given x with f differentiable at x and d with ∇f(x)T d < 0, andparameters 0 < c1 < c2 < 1, find steplength t so that
■ sufficient decrease in function value:f(x+ td) < f(x) + c1t∇f(x)Td (L. Armijo, 1966)
■ sufficient increase in directional derivative: f is differentiable atx+ td and ∇f(x+ td)T d > c2∇f(x)Td (P. Wolfe, 1969)
Assuming inft f(x+ td) is bounded below,
■ the Armijo condition holds for sufficiently small t as long as f iscontinuous
■ the Wolfe condition holds for sufficiently large t as long as f isdifferentiable
■ the intervals where each holds overlap
so combining the two conditions leads to a convenient, convergentbracketing line search (M.J.D. Powell, 1976)
Extends to locally Lipschitz case (A.S. Lewis and M.L.O., 2013)
![Page 22: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/22.jpg)
Armijo-Wolfe Line Search
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
6 / 71
Given x with f differentiable at x and d with ∇f(x)T d < 0, andparameters 0 < c1 < c2 < 1, find steplength t so that
■ sufficient decrease in function value:f(x+ td) < f(x) + c1t∇f(x)Td (L. Armijo, 1966)
■ sufficient increase in directional derivative: f is differentiable atx+ td and ∇f(x+ td)T d > c2∇f(x)Td (P. Wolfe, 1969)
Assuming inft f(x+ td) is bounded below,
■ the Armijo condition holds for sufficiently small t as long as f iscontinuous
■ the Wolfe condition holds for sufficiently large t as long as f isdifferentiable
■ the intervals where each holds overlap
so combining the two conditions leads to a convenient, convergentbracketing line search (M.J.D. Powell, 1976)
Extends to locally Lipschitz case (A.S. Lewis and M.L.O., 2013)
Searching for “Armijo-Wolfe” on the web, we foundMelissa Armijo-Wolfe’s LinkedIn page!
![Page 23: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/23.jpg)
Failure of Gradient Method: Simple Convex Example
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
7 / 71
Let f(x) = a|x1|+ x2, with a ≥ 1. Although f is unboundedbelow, it is bounded below along any direction d = −∇f(x).
105
0
u
-5-10-10
-5
v
05
10
60
50
40
30
20
10
0
-10
5|u|
+v
![Page 24: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/24.jpg)
Failure of Gradient Method: Simple Convex Example
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
7 / 71
Let f(x) = a|x1|+ x2, with a ≥ 1. Although f is unboundedbelow, it is bounded below along any direction d = −∇f(x).
105
0
u
-5-10-10
-5
v
05
10
60
50
40
30
20
10
0
-10
5|u|
+v
Theorem. Let x(0) satisfy x(0)1 6= 0 and define x(k) ∈ R
2 by
x(k+1) = x(k) + tkd(k) where d(k) = −∇f(x(k))
and tk is any steplength satisfying the Armijo and Wolfeconditions with Armijo parameter c1. If
c1(a2 + 1) > 1
then x(k) converges to x with x1 = 0, even though f isunbounded below.
Azam Asl and M.L.O., 2017
![Page 25: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/25.jpg)
Illustration of Failure and Success
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
8 / 71
u-3 -2 -1 0 1 2 3
v
0
1
2
3
4
5
6f(u,v) = 5|u|+v. x_0 = (-2.264; 5), c_1=0.1, τ =0.064
u-3 -2 -1 0 1 2 3
v
0
1
2
3
4
5
6f(u,v) = 2|u|+v. x_0 = (-2.264; 5), c_1=0.1, τ =-0.125
a = 5, c1 = 0.1 a = 2, c1 = 0.1
x(k) → x f(x(k)) ↓ −∞
![Page 26: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/26.jpg)
Methods Suitable for Nonsmooth Functions
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
9 / 71
Exploit the gradient information obtained at several points, notjust at one point:
![Page 27: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/27.jpg)
Methods Suitable for Nonsmooth Functions
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
9 / 71
Exploit the gradient information obtained at several points, notjust at one point:
■ Bundle methods (C. Lemarechal, K.C. Kiwiel, 1980s –)extensive practical use and theoretical analysis, butcomplicated in nonconvex case
![Page 28: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/28.jpg)
Methods Suitable for Nonsmooth Functions
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
9 / 71
Exploit the gradient information obtained at several points, notjust at one point:
■ Bundle methods (C. Lemarechal, K.C. Kiwiel, 1980s –)extensive practical use and theoretical analysis, butcomplicated in nonconvex case
■ Gradient sampling: an easily stated method with niceconvergence theory (J.V. Burke, A.S. Lewis, M.L.O., 2005;K.C. Kiwiel, 2007), but computationally intensive
![Page 29: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/29.jpg)
Methods Suitable for Nonsmooth Functions
IntroductionNonsmooth,NonconvexOptimization
A Simple NonconvexExample
Failure of GradientDescent inNonsmooth CaseArmijo-Wolfe LineSearchFailure of GradientMethod: SimpleConvex Example
Illustration of Failureand SuccessMethods Suitable forNonsmoothFunctions
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
9 / 71
Exploit the gradient information obtained at several points, notjust at one point:
■ Bundle methods (C. Lemarechal, K.C. Kiwiel, 1980s –)extensive practical use and theoretical analysis, butcomplicated in nonconvex case
■ Gradient sampling: an easily stated method with niceconvergence theory (J.V. Burke, A.S. Lewis, M.L.O., 2005;K.C. Kiwiel, 2007), but computationally intensive
■ BFGS: traditional workhorse for smooth optimization, worksamazingly well for nonsmooth optimization too, but verylimited convergence theory
![Page 30: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/30.jpg)
Gradient Sampling
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
10 / 71
![Page 31: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/31.jpg)
The Gradient Sampling Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
11 / 71
Fix sample size m ≥ n+ 1, line search parameter β ∈ (0, 1), reductionfactors µ ∈ (0, 1) and θ ∈ (0, 1).
![Page 32: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/32.jpg)
The Gradient Sampling Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
11 / 71
Fix sample size m ≥ n+ 1, line search parameter β ∈ (0, 1), reductionfactors µ ∈ (0, 1) and θ ∈ (0, 1).
Initialize sampling radius ǫ > 0, tolerance τ > 0, iterate x.
![Page 33: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/33.jpg)
The Gradient Sampling Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
11 / 71
Fix sample size m ≥ n+ 1, line search parameter β ∈ (0, 1), reductionfactors µ ∈ (0, 1) and θ ∈ (0, 1).
Initialize sampling radius ǫ > 0, tolerance τ > 0, iterate x.
Repeat (outer loop)
![Page 34: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/34.jpg)
The Gradient Sampling Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
11 / 71
Fix sample size m ≥ n+ 1, line search parameter β ∈ (0, 1), reductionfactors µ ∈ (0, 1) and θ ∈ (0, 1).
Initialize sampling radius ǫ > 0, tolerance τ > 0, iterate x.
Repeat (outer loop)
■ Repeat (inner loop: gradient sampling with fixed ǫ):
![Page 35: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/35.jpg)
The Gradient Sampling Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
11 / 71
Fix sample size m ≥ n+ 1, line search parameter β ∈ (0, 1), reductionfactors µ ∈ (0, 1) and θ ∈ (0, 1).
Initialize sampling radius ǫ > 0, tolerance τ > 0, iterate x.
Repeat (outer loop)
■ Repeat (inner loop: gradient sampling with fixed ǫ):
◆ Set G = {∇f(x),∇f(x+ ǫu1), . . . ,∇f(x+ ǫum)}, samplingu1, · · · , um uniformly from the unit ball
![Page 36: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/36.jpg)
The Gradient Sampling Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
11 / 71
Fix sample size m ≥ n+ 1, line search parameter β ∈ (0, 1), reductionfactors µ ∈ (0, 1) and θ ∈ (0, 1).
Initialize sampling radius ǫ > 0, tolerance τ > 0, iterate x.
Repeat (outer loop)
■ Repeat (inner loop: gradient sampling with fixed ǫ):
◆ Set G = {∇f(x),∇f(x+ ǫu1), . . . ,∇f(x+ ǫum)}, samplingu1, · · · , um uniformly from the unit ball
◆ Set g = argmin{||g|| : g ∈ conv(G)}
![Page 37: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/37.jpg)
The Gradient Sampling Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
11 / 71
Fix sample size m ≥ n+ 1, line search parameter β ∈ (0, 1), reductionfactors µ ∈ (0, 1) and θ ∈ (0, 1).
Initialize sampling radius ǫ > 0, tolerance τ > 0, iterate x.
Repeat (outer loop)
■ Repeat (inner loop: gradient sampling with fixed ǫ):
◆ Set G = {∇f(x),∇f(x+ ǫu1), . . . ,∇f(x+ ǫum)}, samplingu1, · · · , um uniformly from the unit ball
◆ Set g = argmin{||g|| : g ∈ conv(G)}◆ If ‖g ≤ τ , break out of loop.
![Page 38: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/38.jpg)
The Gradient Sampling Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
11 / 71
Fix sample size m ≥ n+ 1, line search parameter β ∈ (0, 1), reductionfactors µ ∈ (0, 1) and θ ∈ (0, 1).
Initialize sampling radius ǫ > 0, tolerance τ > 0, iterate x.
Repeat (outer loop)
■ Repeat (inner loop: gradient sampling with fixed ǫ):
◆ Set G = {∇f(x),∇f(x+ ǫu1), . . . ,∇f(x+ ǫum)}, samplingu1, · · · , um uniformly from the unit ball
◆ Set g = argmin{||g|| : g ∈ conv(G)}◆ If ‖g ≤ τ , break out of loop.◆ Backtracking line search: set d = −g and replace x by x+ td,
with t ∈ {1, 12 ,
14 , . . .} and f(x+ td) < f(x)− βt‖g‖
![Page 39: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/39.jpg)
The Gradient Sampling Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
11 / 71
Fix sample size m ≥ n+ 1, line search parameter β ∈ (0, 1), reductionfactors µ ∈ (0, 1) and θ ∈ (0, 1).
Initialize sampling radius ǫ > 0, tolerance τ > 0, iterate x.
Repeat (outer loop)
■ Repeat (inner loop: gradient sampling with fixed ǫ):
◆ Set G = {∇f(x),∇f(x+ ǫu1), . . . ,∇f(x+ ǫum)}, samplingu1, · · · , um uniformly from the unit ball
◆ Set g = argmin{||g|| : g ∈ conv(G)}◆ If ‖g ≤ τ , break out of loop.◆ Backtracking line search: set d = −g and replace x by x+ td,
with t ∈ {1, 12 ,
14 , . . .} and f(x+ td) < f(x)− βt‖g‖
◆ If f is not differentiable at x+ td, replace x+ td by a nearbypoint where f is differentiable.1
![Page 40: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/40.jpg)
The Gradient Sampling Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
11 / 71
Fix sample size m ≥ n+ 1, line search parameter β ∈ (0, 1), reductionfactors µ ∈ (0, 1) and θ ∈ (0, 1).
Initialize sampling radius ǫ > 0, tolerance τ > 0, iterate x.
Repeat (outer loop)
■ Repeat (inner loop: gradient sampling with fixed ǫ):
◆ Set G = {∇f(x),∇f(x+ ǫu1), . . . ,∇f(x+ ǫum)}, samplingu1, · · · , um uniformly from the unit ball
◆ Set g = argmin{||g|| : g ∈ conv(G)}◆ If ‖g ≤ τ , break out of loop.◆ Backtracking line search: set d = −g and replace x by x+ td,
with t ∈ {1, 12 ,
14 , . . .} and f(x+ td) < f(x)− βt‖g‖
◆ If f is not differentiable at x+ td, replace x+ td by a nearbypoint where f is differentiable.1
■ New phase: set ǫ = µǫ and τ = θτ .
![Page 41: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/41.jpg)
The Gradient Sampling Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
11 / 71
Fix sample size m ≥ n+ 1, line search parameter β ∈ (0, 1), reductionfactors µ ∈ (0, 1) and θ ∈ (0, 1).
Initialize sampling radius ǫ > 0, tolerance τ > 0, iterate x.
Repeat (outer loop)
■ Repeat (inner loop: gradient sampling with fixed ǫ):
◆ Set G = {∇f(x),∇f(x+ ǫu1), . . . ,∇f(x+ ǫum)}, samplingu1, · · · , um uniformly from the unit ball
◆ Set g = argmin{||g|| : g ∈ conv(G)}◆ If ‖g ≤ τ , break out of loop.◆ Backtracking line search: set d = −g and replace x by x+ td,
with t ∈ {1, 12 ,
14 , . . .} and f(x+ td) < f(x)− βt‖g‖
◆ If f is not differentiable at x+ td, replace x+ td by a nearbypoint where f is differentiable.1
■ New phase: set ǫ = µǫ and τ = θτ .
J.V. Burke, A.S. Lewis and M.L.O., SIOPT, 2005.
1Needed in theory, but typically not in practice.
![Page 42: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/42.jpg)
With First Phase of Gradient Sampling
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
12 / 71
f(x)=10*|x2 − x
12| + (1−x
1)2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
![Page 43: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/43.jpg)
With Second Phase of Gradient Sampling
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
13 / 71
f(x)=10*|x2 − x
12| + (1−x
1)2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
![Page 44: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/44.jpg)
The Clarke Subdifferential
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
14 / 71
Assume f : Rn → R is locally Lipschitz, andlet D = {x ∈ R
n : f is differentiable at x}.
![Page 45: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/45.jpg)
The Clarke Subdifferential
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
14 / 71
Assume f : Rn → R is locally Lipschitz, andlet D = {x ∈ R
n : f is differentiable at x}.Rademacher’s Theorem: Rn\D has measure zero.
![Page 46: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/46.jpg)
The Clarke Subdifferential
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
14 / 71
Assume f : Rn → R is locally Lipschitz, andlet D = {x ∈ R
n : f is differentiable at x}.Rademacher’s Theorem: Rn\D has measure zero.
The Clarke subdifferential of f at x is
∂Cf(x) = conv
{lim
x→x,x∈D∇f(x)
}.
![Page 47: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/47.jpg)
The Clarke Subdifferential
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
14 / 71
Assume f : Rn → R is locally Lipschitz, andlet D = {x ∈ R
n : f is differentiable at x}.Rademacher’s Theorem: Rn\D has measure zero.
The Clarke subdifferential of f at x is
∂Cf(x) = conv
{lim
x→x,x∈D∇f(x)
}.
F.H. Clarke, 1973 (he used the name “generalized gradient”).
![Page 48: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/48.jpg)
The Clarke Subdifferential
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
14 / 71
Assume f : Rn → R is locally Lipschitz, andlet D = {x ∈ R
n : f is differentiable at x}.Rademacher’s Theorem: Rn\D has measure zero.
The Clarke subdifferential of f at x is
∂Cf(x) = conv
{lim
x→x,x∈D∇f(x)
}.
F.H. Clarke, 1973 (he used the name “generalized gradient”).
If f is continuously differentiable at x, then ∂Cf(x) = {∇f(x)}.
![Page 49: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/49.jpg)
The Clarke Subdifferential
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
14 / 71
Assume f : Rn → R is locally Lipschitz, andlet D = {x ∈ R
n : f is differentiable at x}.Rademacher’s Theorem: Rn\D has measure zero.
The Clarke subdifferential of f at x is
∂Cf(x) = conv
{lim
x→x,x∈D∇f(x)
}.
F.H. Clarke, 1973 (he used the name “generalized gradient”).
If f is continuously differentiable at x, then ∂Cf(x) = {∇f(x)}.If f is convex, ∂Cf is the subdifferential of convex analysis.
![Page 50: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/50.jpg)
The Clarke Subdifferential
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
14 / 71
Assume f : Rn → R is locally Lipschitz, andlet D = {x ∈ R
n : f is differentiable at x}.Rademacher’s Theorem: Rn\D has measure zero.
The Clarke subdifferential of f at x is
∂Cf(x) = conv
{lim
x→x,x∈D∇f(x)
}.
F.H. Clarke, 1973 (he used the name “generalized gradient”).
If f is continuously differentiable at x, then ∂Cf(x) = {∇f(x)}.If f is convex, ∂Cf is the subdifferential of convex analysis.
We say x is Clarke stationary for f if 0 ∈ ∂Cf(x).
![Page 51: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/51.jpg)
The Clarke Subdifferential
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
14 / 71
Assume f : Rn → R is locally Lipschitz, andlet D = {x ∈ R
n : f is differentiable at x}.Rademacher’s Theorem: Rn\D has measure zero.
The Clarke subdifferential of f at x is
∂Cf(x) = conv
{lim
x→x,x∈D∇f(x)
}.
F.H. Clarke, 1973 (he used the name “generalized gradient”).
If f is continuously differentiable at x, then ∂Cf(x) = {∇f(x)}.If f is convex, ∂Cf is the subdifferential of convex analysis.
We say x is Clarke stationary for f if 0 ∈ ∂Cf(x).
Key point: the convex hull of the set G generated by GradientSampling is a surrogate for ∂Cf .
![Page 52: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/52.jpg)
Example
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
15 / 71
Letf(x) = 10|x2 − x21|+ (1− x1)
2
![Page 53: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/53.jpg)
Example
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
15 / 71
Letf(x) = 10|x2 − x21|+ (1− x1)
2
For x with x2 6= x21, f is differentiable with gradient
∇f(x) = 10 sgn{x2 − x21}[−2x11
]+
[−2(1− x1)
0
]
so ∂Cf(x) = {∇f(x)}.
![Page 54: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/54.jpg)
Example
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
15 / 71
Letf(x) = 10|x2 − x21|+ (1− x1)
2
For x with x2 6= x21, f is differentiable with gradient
∇f(x) = 10 sgn{x2 − x21}[−2x11
]+
[−2(1− x1)
0
]
so ∂Cf(x) = {∇f(x)}. For x with x2 = x21, there are twolimiting gradients, namely
± 10
[−2x11
]+
[−2(1− x1)
0
]
so ∂Cf(x) consists of the convex hull of these two vectors.
![Page 55: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/55.jpg)
Example
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
15 / 71
Letf(x) = 10|x2 − x21|+ (1− x1)
2
For x with x2 6= x21, f is differentiable with gradient
∇f(x) = 10 sgn{x2 − x21}[−2x11
]+
[−2(1− x1)
0
]
so ∂Cf(x) = {∇f(x)}. For x with x2 = x21, there are twolimiting gradients, namely
± 10
[−2x11
]+
[−2(1− x1)
0
]
so ∂Cf(x) consists of the convex hull of these two vectors.The unique x for which 0 ∈ ∂Cf(x) is x = [1; 1]T , so this is theunique Clarke stationary point of f (it follows that it is theglobal minimizer).
![Page 56: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/56.jpg)
Note that 0 ∈ ∂Cf(x) at x = [1; 1]T
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
16 / 71
f(x)=10*|x2 − x
12| + (1−x
1)2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
![Page 57: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/57.jpg)
Grad. Samp.: A Stabilized Steepest Descent Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
17 / 71
Lemma. Let C be a compact convex set and ‖ · ‖ = ‖ · ‖2. Then
−dist(0, C) = min‖d‖≤1
maxg∈C
gTd
![Page 58: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/58.jpg)
Grad. Samp.: A Stabilized Steepest Descent Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
17 / 71
Lemma. Let C be a compact convex set and ‖ · ‖ = ‖ · ‖2. Then
−dist(0, C) = min‖d‖≤1
maxg∈C
gTd
Proof.−dist(0, C) = −min
g∈C‖g‖
![Page 59: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/59.jpg)
Grad. Samp.: A Stabilized Steepest Descent Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
17 / 71
Lemma. Let C be a compact convex set and ‖ · ‖ = ‖ · ‖2. Then
−dist(0, C) = min‖d‖≤1
maxg∈C
gTd
Proof.−dist(0, C) = −min
g∈C‖g‖
= −ming∈C
max‖d‖≤1
gTd
![Page 60: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/60.jpg)
Grad. Samp.: A Stabilized Steepest Descent Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
17 / 71
Lemma. Let C be a compact convex set and ‖ · ‖ = ‖ · ‖2. Then
−dist(0, C) = min‖d‖≤1
maxg∈C
gTd
Proof.−dist(0, C) = −min
g∈C‖g‖
= −ming∈C
max‖d‖≤1
gTd
= − max‖d‖≤1
ming∈C
gTd
![Page 61: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/61.jpg)
Grad. Samp.: A Stabilized Steepest Descent Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
17 / 71
Lemma. Let C be a compact convex set and ‖ · ‖ = ‖ · ‖2. Then
−dist(0, C) = min‖d‖≤1
maxg∈C
gTd
Proof.−dist(0, C) = −min
g∈C‖g‖
= −ming∈C
max‖d‖≤1
gTd
= − max‖d‖≤1
ming∈C
gTd
= − max‖d‖≤1
ming∈C
gT (−d)
![Page 62: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/62.jpg)
Grad. Samp.: A Stabilized Steepest Descent Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
17 / 71
Lemma. Let C be a compact convex set and ‖ · ‖ = ‖ · ‖2. Then
−dist(0, C) = min‖d‖≤1
maxg∈C
gTd
Proof.−dist(0, C) = −min
g∈C‖g‖
= −ming∈C
max‖d‖≤1
gTd
= − max‖d‖≤1
ming∈C
gTd
= − max‖d‖≤1
ming∈C
gT (−d)
= min‖d‖≤1
maxg∈C
gT d.
![Page 63: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/63.jpg)
Grad. Samp.: A Stabilized Steepest Descent Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
17 / 71
Lemma. Let C be a compact convex set and ‖ · ‖ = ‖ · ‖2. Then
−dist(0, C) = min‖d‖≤1
maxg∈C
gTd
Proof.−dist(0, C) = −min
g∈C‖g‖
= −ming∈C
max‖d‖≤1
gTd
= − max‖d‖≤1
ming∈C
gTd
= − max‖d‖≤1
ming∈C
gT (−d)
= min‖d‖≤1
maxg∈C
gT d.
Note: the distance is nonnegative, and zero iff 0 ∈ C.
![Page 64: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/64.jpg)
Grad. Samp.: A Stabilized Steepest Descent Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
17 / 71
Lemma. Let C be a compact convex set and ‖ · ‖ = ‖ · ‖2. Then
−dist(0, C) = min‖d‖≤1
maxg∈C
gTd
Proof.−dist(0, C) = −min
g∈C‖g‖
= −ming∈C
max‖d‖≤1
gTd
= − max‖d‖≤1
ming∈C
gTd
= − max‖d‖≤1
ming∈C
gT (−d)
= min‖d‖≤1
maxg∈C
gT d.
Note: the distance is nonnegative, and zero iff 0 ∈ C.Otherwise, equality is attained by g = ΠC(0), d = −g/|g‖.
![Page 65: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/65.jpg)
Grad. Samp.: A Stabilized Steepest Descent Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
17 / 71
Lemma. Let C be a compact convex set and ‖ · ‖ = ‖ · ‖2. Then
−dist(0, C) = min‖d‖≤1
maxg∈C
gTd
Proof.−dist(0, C) = −min
g∈C‖g‖
= −ming∈C
max‖d‖≤1
gTd
= − max‖d‖≤1
ming∈C
gTd
= − max‖d‖≤1
ming∈C
gT (−d)
= min‖d‖≤1
maxg∈C
gT d.
Note: the distance is nonnegative, and zero iff 0 ∈ C.Otherwise, equality is attained by g = ΠC(0), d = −g/|g‖.Ordinary steepest descent: C = {∇f(x)}.
![Page 66: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/66.jpg)
Grad. Samp.: A Stabilized Steepest Descent Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
17 / 71
Lemma. Let C be a compact convex set and ‖ · ‖ = ‖ · ‖2. Then
−dist(0, C) = min‖d‖≤1
maxg∈C
gTd
Proof.−dist(0, C) = −min
g∈C‖g‖
= −ming∈C
max‖d‖≤1
gTd
= − max‖d‖≤1
ming∈C
gTd
= − max‖d‖≤1
ming∈C
gT (−d)
= min‖d‖≤1
maxg∈C
gT d.
Note: the distance is nonnegative, and zero iff 0 ∈ C.Otherwise, equality is attained by g = ΠC(0), d = −g/|g‖.Ordinary steepest descent: C = {∇f(x)}.Gradient sampling: C = conv(G)
= conv({∇f(x),∇f(x+ ǫu1), . . . ,∇f(x+ ǫum)}) .
![Page 67: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/67.jpg)
Convergence of Gradient Sampling Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
18 / 71
Theorem. Suppose that f : Rn → R
![Page 68: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/68.jpg)
Convergence of Gradient Sampling Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
18 / 71
Theorem. Suppose that f : Rn → R
■ is locally Lipschitz
![Page 69: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/69.jpg)
Convergence of Gradient Sampling Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
18 / 71
Theorem. Suppose that f : Rn → R
■ is locally Lipschitz■ is cont. differentiable on an open full-measure subset of Rn
![Page 70: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/70.jpg)
Convergence of Gradient Sampling Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
18 / 71
Theorem. Suppose that f : Rn → R
■ is locally Lipschitz■ is cont. differentiable on an open full-measure subset of Rn
■ has bounded level sets
![Page 71: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/71.jpg)
Convergence of Gradient Sampling Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
18 / 71
Theorem. Suppose that f : Rn → R
■ is locally Lipschitz■ is cont. differentiable on an open full-measure subset of Rn
■ has bounded level sets
Then, with probability one, f is differentiable at the sampledpoints, the line search always terminates, and if the sequence ofiterates {x} converges to some point x, then, with probability 1
![Page 72: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/72.jpg)
Convergence of Gradient Sampling Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
18 / 71
Theorem. Suppose that f : Rn → R
■ is locally Lipschitz■ is cont. differentiable on an open full-measure subset of Rn
■ has bounded level sets
Then, with probability one, f is differentiable at the sampledpoints, the line search always terminates, and if the sequence ofiterates {x} converges to some point x, then, with probability 1
■ the inner loop always terminates, so the sequences ofsampling radii {ǫ} and tolerances {τ} converge to zero, and
![Page 73: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/73.jpg)
Convergence of Gradient Sampling Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
18 / 71
Theorem. Suppose that f : Rn → R
■ is locally Lipschitz■ is cont. differentiable on an open full-measure subset of Rn
■ has bounded level sets
Then, with probability one, f is differentiable at the sampledpoints, the line search always terminates, and if the sequence ofiterates {x} converges to some point x, then, with probability 1
■ the inner loop always terminates, so the sequences ofsampling radii {ǫ} and tolerances {τ} converge to zero, and
■ x is Clarke stationary for f , i.e., 0 ∈ ∂Cf(x).
![Page 74: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/74.jpg)
Convergence of Gradient Sampling Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
18 / 71
Theorem. Suppose that f : Rn → R
■ is locally Lipschitz■ is cont. differentiable on an open full-measure subset of Rn
■ has bounded level sets
Then, with probability one, f is differentiable at the sampledpoints, the line search always terminates, and if the sequence ofiterates {x} converges to some point x, then, with probability 1
■ the inner loop always terminates, so the sequences ofsampling radii {ǫ} and tolerances {τ} converge to zero, and
■ x is Clarke stationary for f , i.e., 0 ∈ ∂Cf(x).
J.V. Burke, A.S. Lewis and M.L.O., SIOPT, 2005.
![Page 75: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/75.jpg)
Convergence of Gradient Sampling Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
18 / 71
Theorem. Suppose that f : Rn → R
■ is locally Lipschitz■ is cont. differentiable on an open full-measure subset of Rn
■ has bounded level sets
Then, with probability one, f is differentiable at the sampledpoints, the line search always terminates, and if the sequence ofiterates {x} converges to some point x, then, with probability 1
■ the inner loop always terminates, so the sequences ofsampling radii {ǫ} and tolerances {τ} converge to zero, and
■ x is Clarke stationary for f , i.e., 0 ∈ ∂Cf(x).
J.V. Burke, A.S. Lewis and M.L.O., SIOPT, 2005.
Drop the assumption that f has bounded level sets. Then, wp 1,either the sequence {f(x)} → −∞, or every cluster point of thesequence of iterates {x} is Clarke stationary.
![Page 76: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/76.jpg)
Convergence of Gradient Sampling Method
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
18 / 71
Theorem. Suppose that f : Rn → R
■ is locally Lipschitz■ is cont. differentiable on an open full-measure subset of Rn
■ has bounded level sets
Then, with probability one, f is differentiable at the sampledpoints, the line search always terminates, and if the sequence ofiterates {x} converges to some point x, then, with probability 1
■ the inner loop always terminates, so the sequences ofsampling radii {ǫ} and tolerances {τ} converge to zero, and
■ x is Clarke stationary for f , i.e., 0 ∈ ∂Cf(x).
J.V. Burke, A.S. Lewis and M.L.O., SIOPT, 2005.
Drop the assumption that f has bounded level sets. Then, wp 1,either the sequence {f(x)} → −∞, or every cluster point of thesequence of iterates {x} is Clarke stationary.
K.C. Kiwiel, SIOPT, 2007.
![Page 77: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/77.jpg)
Extensions
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
19 / 71
A more efficient version: Adaptive Gradient Sampling(F.E. Curtis and X. Que, 2013).
![Page 78: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/78.jpg)
Extensions
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
19 / 71
A more efficient version: Adaptive Gradient Sampling(F.E. Curtis and X. Que, 2013).
Problems with Nonsmooth Constraints
min f(x)
subject to ci(x) ≤ 0, i = 1, . . . , p
![Page 79: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/79.jpg)
Extensions
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
19 / 71
A more efficient version: Adaptive Gradient Sampling(F.E. Curtis and X. Que, 2013).
Problems with Nonsmooth Constraints
min f(x)
subject to ci(x) ≤ 0, i = 1, . . . , p
where f and c1, . . . , cp are locally Lipschitz but may not bedifferentiable at local minimizers.
![Page 80: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/80.jpg)
Extensions
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
19 / 71
A more efficient version: Adaptive Gradient Sampling(F.E. Curtis and X. Que, 2013).
Problems with Nonsmooth Constraints
min f(x)
subject to ci(x) ≤ 0, i = 1, . . . , p
where f and c1, . . . , cp are locally Lipschitz but may not bedifferentiable at local minimizers.
A successive quadratic programming gradient sampling methodwith convergence theory.
![Page 81: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/81.jpg)
Extensions
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
19 / 71
A more efficient version: Adaptive Gradient Sampling(F.E. Curtis and X. Que, 2013).
Problems with Nonsmooth Constraints
min f(x)
subject to ci(x) ≤ 0, i = 1, . . . , p
where f and c1, . . . , cp are locally Lipschitz but may not bedifferentiable at local minimizers.
A successive quadratic programming gradient sampling methodwith convergence theory.
F.E. Curtis and M.L.O., SIOPT, 2012.
![Page 82: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/82.jpg)
Some Gradient Sampling Success Stories
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
20 / 71
■ Non-Lipschitz eigenvalue optimization for non-normalmatrices (J.V. Burke, A.S. Lewis and M.L.O., 2002 – )
![Page 83: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/83.jpg)
Some Gradient Sampling Success Stories
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
20 / 71
■ Non-Lipschitz eigenvalue optimization for non-normalmatrices (J.V. Burke, A.S. Lewis and M.L.O., 2002 – )
■ Design of fixed-order controllers for linear dynamical systemswith input and output (D. Henrion and M.L.O., 2006, andmany subsequent users of our HIFOO (H-Infinity Fixed OrderOptimization) toolbox)
![Page 84: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/84.jpg)
Some Gradient Sampling Success Stories
Introduction
Gradient Sampling
The GradientSampling Method
With First Phase ofGradient Sampling
With Second Phaseof GradientSampling
The ClarkeSubdifferential
Example
Note that0 ∈ ∂Cf(x) at
x = [1; 1]T
Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethod
ExtensionsSome GradientSampling SuccessStories
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
20 / 71
■ Non-Lipschitz eigenvalue optimization for non-normalmatrices (J.V. Burke, A.S. Lewis and M.L.O., 2002 – )
■ Design of fixed-order controllers for linear dynamical systemswith input and output (D. Henrion and M.L.O., 2006, andmany subsequent users of our HIFOO (H-Infinity Fixed OrderOptimization) toolbox)
■ Design of path planning for robots: avoids “chattering” thatotherwise arises from nonsmoothness (I. Mitchell et al, 2017)
![Page 85: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/85.jpg)
Quasi-Newton Methods
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
21 / 71
![Page 86: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/86.jpg)
Bill Davidon
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
22 / 71
W. Davidon, a physicist at Argonne, had the breakthrough ideain 1959: since it’s too expensive to compute and factor theHessian ∇2f(x) at every iteration, update an approximation toits inverse using information from gradient differences, andmultiply this onto the negative gradient to approximateNewton’s method.
![Page 87: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/87.jpg)
Bill Davidon
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
22 / 71
W. Davidon, a physicist at Argonne, had the breakthrough ideain 1959: since it’s too expensive to compute and factor theHessian ∇2f(x) at every iteration, update an approximation toits inverse using information from gradient differences, andmultiply this onto the negative gradient to approximateNewton’s method.
Each inverse Hessian approximation differs from the previous oneby a rank-two correction.
![Page 88: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/88.jpg)
Bill Davidon
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
22 / 71
W. Davidon, a physicist at Argonne, had the breakthrough ideain 1959: since it’s too expensive to compute and factor theHessian ∇2f(x) at every iteration, update an approximation toits inverse using information from gradient differences, andmultiply this onto the negative gradient to approximateNewton’s method.
Each inverse Hessian approximation differs from the previous oneby a rank-two correction.
Ahead of its time: the paper was rejected by the physicsjournals, but published 30 years later in the first issue of SIAM J.Optimization.
![Page 89: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/89.jpg)
Bill Davidon
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
22 / 71
W. Davidon, a physicist at Argonne, had the breakthrough ideain 1959: since it’s too expensive to compute and factor theHessian ∇2f(x) at every iteration, update an approximation toits inverse using information from gradient differences, andmultiply this onto the negative gradient to approximateNewton’s method.
Each inverse Hessian approximation differs from the previous oneby a rank-two correction.
Ahead of its time: the paper was rejected by the physicsjournals, but published 30 years later in the first issue of SIAM J.Optimization.
Davidon was a well known active anti-war protester during theVietnam War. In December 2013, it was revealed that he wasthe mastermind behind the break-in at the FBI office in Media,PA, on March 8, 1971, during the Muhammad Ali - Joe Frazierworld heavyweight boxing championship.
![Page 90: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/90.jpg)
Fletcher and Powell
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
23 / 71
In 1963, R. Fletcher and M.J.D. Powell improved Davidon’smethod and established convergence for convex quadraticfunctions.
![Page 91: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/91.jpg)
Fletcher and Powell
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
23 / 71
In 1963, R. Fletcher and M.J.D. Powell improved Davidon’smethod and established convergence for convex quadraticfunctions.
They applied it to solve problems in 100 variables: a lot at thetime.
![Page 92: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/92.jpg)
Fletcher and Powell
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
23 / 71
In 1963, R. Fletcher and M.J.D. Powell improved Davidon’smethod and established convergence for convex quadraticfunctions.
They applied it to solve problems in 100 variables: a lot at thetime.
The method became known as the DFP method.
![Page 93: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/93.jpg)
Fletcher and Powell
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
23 / 71
In 1963, R. Fletcher and M.J.D. Powell improved Davidon’smethod and established convergence for convex quadraticfunctions.
They applied it to solve problems in 100 variables: a lot at thetime.
The method became known as the DFP method.
Davidon, Fletcher and Powell all died during 2013–2016.
![Page 94: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/94.jpg)
BFGS
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
24 / 71
In 1970, C.G. Broyden, R. Fletcher, D. Goldfarb and D. Shannoall independently proposed the BFGS method, which is a kind ofdual of the DFP method. It was soon recognized that this was aremarkably effective method for smooth optimization.
![Page 95: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/95.jpg)
BFGS
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
24 / 71
In 1970, C.G. Broyden, R. Fletcher, D. Goldfarb and D. Shannoall independently proposed the BFGS method, which is a kind ofdual of the DFP method. It was soon recognized that this was aremarkably effective method for smooth optimization.
In 1973, C.G. Broyden, J.E. Dennis and J.J. More proved genericlocal superlinear convergence of BFGS and DFP and otherquasi-Newton methods.
![Page 96: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/96.jpg)
BFGS
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
24 / 71
In 1970, C.G. Broyden, R. Fletcher, D. Goldfarb and D. Shannoall independently proposed the BFGS method, which is a kind ofdual of the DFP method. It was soon recognized that this was aremarkably effective method for smooth optimization.
In 1973, C.G. Broyden, J.E. Dennis and J.J. More proved genericlocal superlinear convergence of BFGS and DFP and otherquasi-Newton methods.
In 1976, M.J.D. Powell established convergence of BFGS with aninexact Armijo-Wolfe line search for a general class of smoothconvex functions for BFGS. In 1987, this was extended byR.H. Byrd, J. Nocedal and Y.-X. Yuan to include the whole“Broyden” class of methods interpolating BFGS and DFP:except for the DFP end point.
![Page 97: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/97.jpg)
BFGS
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
24 / 71
In 1970, C.G. Broyden, R. Fletcher, D. Goldfarb and D. Shannoall independently proposed the BFGS method, which is a kind ofdual of the DFP method. It was soon recognized that this was aremarkably effective method for smooth optimization.
In 1973, C.G. Broyden, J.E. Dennis and J.J. More proved genericlocal superlinear convergence of BFGS and DFP and otherquasi-Newton methods.
In 1976, M.J.D. Powell established convergence of BFGS with aninexact Armijo-Wolfe line search for a general class of smoothconvex functions for BFGS. In 1987, this was extended byR.H. Byrd, J. Nocedal and Y.-X. Yuan to include the whole“Broyden” class of methods interpolating BFGS and DFP:except for the DFP end point.
Pathological counterexamples to convergence in the smooth,nonconvex case are known to exist (Y.-H. Dai, 2002, 2013;W. Mascarenhas 2004), but it is widely accepted that themethod works well in practice in the smooth, nonconvex case.
![Page 98: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/98.jpg)
The BFGS Method (“Full” Version)
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
25 / 71
Initialize iterate x and positive-definite symmetric matrix H(which is supposed to approximate the inverse Hessian of f)
![Page 99: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/99.jpg)
The BFGS Method (“Full” Version)
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
25 / 71
Initialize iterate x and positive-definite symmetric matrix H(which is supposed to approximate the inverse Hessian of f)
Repeat
![Page 100: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/100.jpg)
The BFGS Method (“Full” Version)
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
25 / 71
Initialize iterate x and positive-definite symmetric matrix H(which is supposed to approximate the inverse Hessian of f)
Repeat
■ Set d = −H∇f(x).
![Page 101: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/101.jpg)
The BFGS Method (“Full” Version)
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
25 / 71
Initialize iterate x and positive-definite symmetric matrix H(which is supposed to approximate the inverse Hessian of f)
Repeat
■ Set d = −H∇f(x).■ Obtain t from Armijo-Wolfe line search
![Page 102: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/102.jpg)
The BFGS Method (“Full” Version)
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
25 / 71
Initialize iterate x and positive-definite symmetric matrix H(which is supposed to approximate the inverse Hessian of f)
Repeat
■ Set d = −H∇f(x).■ Obtain t from Armijo-Wolfe line search■ Set s = td, y = ∇f(x+ td)−∇f(x)
![Page 103: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/103.jpg)
The BFGS Method (“Full” Version)
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
25 / 71
Initialize iterate x and positive-definite symmetric matrix H(which is supposed to approximate the inverse Hessian of f)
Repeat
■ Set d = −H∇f(x).■ Obtain t from Armijo-Wolfe line search■ Set s = td, y = ∇f(x+ td)−∇f(x)■ Replace x by x+ td
![Page 104: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/104.jpg)
The BFGS Method (“Full” Version)
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
25 / 71
Initialize iterate x and positive-definite symmetric matrix H(which is supposed to approximate the inverse Hessian of f)
Repeat
■ Set d = −H∇f(x).■ Obtain t from Armijo-Wolfe line search■ Set s = td, y = ∇f(x+ td)−∇f(x)■ Replace x by x+ td■ Replace H by V HV T + 1
sT yssT , where V = I − 1
sT ysyT
![Page 105: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/105.jpg)
The BFGS Method (“Full” Version)
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
25 / 71
Initialize iterate x and positive-definite symmetric matrix H(which is supposed to approximate the inverse Hessian of f)
Repeat
■ Set d = −H∇f(x).■ Obtain t from Armijo-Wolfe line search■ Set s = td, y = ∇f(x+ td)−∇f(x)■ Replace x by x+ td■ Replace H by V HV T + 1
sT yssT , where V = I − 1
sT ysyT
Note that H can be computed in O(n2) operations since V is arank one perturbation of the identity
![Page 106: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/106.jpg)
The BFGS Method (“Full” Version)
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
25 / 71
Initialize iterate x and positive-definite symmetric matrix H(which is supposed to approximate the inverse Hessian of f)
Repeat
■ Set d = −H∇f(x).■ Obtain t from Armijo-Wolfe line search■ Set s = td, y = ∇f(x+ td)−∇f(x)■ Replace x by x+ td■ Replace H by V HV T + 1
sT yssT , where V = I − 1
sT ysyT
Note that H can be computed in O(n2) operations since V is arank one perturbation of the identityThe Wolfe condition guarantees that sT y > 0 and hence thatthe new H is positive definite.
![Page 107: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/107.jpg)
BFGS for Nonsmooth Optimization
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
26 / 71
In 1982, C. Lemarechal observed that quasi-Newton methods can beeffective for nonsmooth optimization, but dismissed them as there wasno theory behind them and no good way to terminate them.
![Page 108: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/108.jpg)
BFGS for Nonsmooth Optimization
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
26 / 71
In 1982, C. Lemarechal observed that quasi-Newton methods can beeffective for nonsmooth optimization, but dismissed them as there wasno theory behind them and no good way to terminate them.
Otherwise, there is not much in the literature on the subject untilA.S. Lewis and M.L.O. (Math. Prog., 2013): we address both issues indetail, but our convergence results are limited to special cases.
![Page 109: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/109.jpg)
BFGS for Nonsmooth Optimization
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
26 / 71
In 1982, C. Lemarechal observed that quasi-Newton methods can beeffective for nonsmooth optimization, but dismissed them as there wasno theory behind them and no good way to terminate them.
Otherwise, there is not much in the literature on the subject untilA.S. Lewis and M.L.O. (Math. Prog., 2013): we address both issues indetail, but our convergence results are limited to special cases.
Key point: use an Armijo-Wolfe line search. Do not insist on reducingthe magnitude of the directional derivative along the line!
![Page 110: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/110.jpg)
BFGS for Nonsmooth Optimization
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
26 / 71
In 1982, C. Lemarechal observed that quasi-Newton methods can beeffective for nonsmooth optimization, but dismissed them as there wasno theory behind them and no good way to terminate them.
Otherwise, there is not much in the literature on the subject untilA.S. Lewis and M.L.O. (Math. Prog., 2013): we address both issues indetail, but our convergence results are limited to special cases.
Key point: use an Armijo-Wolfe line search. Do not insist on reducingthe magnitude of the directional derivative along the line!
In the nonsmooth case, BFGS builds a very ill-conditioned inverse“Hessian” approximation, with some tiny eigenvalues converging tozero, corresponding to “infinitely large” curvature in the directionsdefined by the associated eigenvectors.
![Page 111: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/111.jpg)
BFGS for Nonsmooth Optimization
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
26 / 71
In 1982, C. Lemarechal observed that quasi-Newton methods can beeffective for nonsmooth optimization, but dismissed them as there wasno theory behind them and no good way to terminate them.
Otherwise, there is not much in the literature on the subject untilA.S. Lewis and M.L.O. (Math. Prog., 2013): we address both issues indetail, but our convergence results are limited to special cases.
Key point: use an Armijo-Wolfe line search. Do not insist on reducingthe magnitude of the directional derivative along the line!
In the nonsmooth case, BFGS builds a very ill-conditioned inverse“Hessian” approximation, with some tiny eigenvalues converging tozero, corresponding to “infinitely large” curvature in the directionsdefined by the associated eigenvectors.
Remarkably, the condition number of the inverse Hessianapproximation typically reaches 1016 before the method breaks down.
![Page 112: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/112.jpg)
BFGS for Nonsmooth Optimization
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
26 / 71
In 1982, C. Lemarechal observed that quasi-Newton methods can beeffective for nonsmooth optimization, but dismissed them as there wasno theory behind them and no good way to terminate them.
Otherwise, there is not much in the literature on the subject untilA.S. Lewis and M.L.O. (Math. Prog., 2013): we address both issues indetail, but our convergence results are limited to special cases.
Key point: use an Armijo-Wolfe line search. Do not insist on reducingthe magnitude of the directional derivative along the line!
In the nonsmooth case, BFGS builds a very ill-conditioned inverse“Hessian” approximation, with some tiny eigenvalues converging tozero, corresponding to “infinitely large” curvature in the directionsdefined by the associated eigenvectors.
Remarkably, the condition number of the inverse Hessianapproximation typically reaches 1016 before the method breaks down.
We have never seen convergence to non-stationary points that cannotbe explained by numerical difficulties.
![Page 113: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/113.jpg)
BFGS for Nonsmooth Optimization
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
26 / 71
In 1982, C. Lemarechal observed that quasi-Newton methods can beeffective for nonsmooth optimization, but dismissed them as there wasno theory behind them and no good way to terminate them.
Otherwise, there is not much in the literature on the subject untilA.S. Lewis and M.L.O. (Math. Prog., 2013): we address both issues indetail, but our convergence results are limited to special cases.
Key point: use an Armijo-Wolfe line search. Do not insist on reducingthe magnitude of the directional derivative along the line!
In the nonsmooth case, BFGS builds a very ill-conditioned inverse“Hessian” approximation, with some tiny eigenvalues converging tozero, corresponding to “infinitely large” curvature in the directionsdefined by the associated eigenvectors.
Remarkably, the condition number of the inverse Hessianapproximation typically reaches 1016 before the method breaks down.
We have never seen convergence to non-stationary points that cannotbe explained by numerical difficulties.
Convergence rate of BFGS is typically linear (not superlinear) in thenonsmooth case.
![Page 114: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/114.jpg)
With BFGS
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
27 / 71
f(x)=10*|x2 − x
12| + (1−x
1)2
steepest descent, grad sampling and BFGS iterates
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
contours of fstarting pointoptimal pointsteepest descentgrad samp (1st phase)grad samp (2nd phase)bfgs
![Page 115: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/115.jpg)
Example: Minimizing a Product of Eigenvalues
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
28 / 71
Let SN denote the space of real symmetric N ×N matrices, and
λ1(X) ≥ λ2(X) ≥ · · ·λN (X)
denote the eigenvalues of X ∈ SN .
![Page 116: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/116.jpg)
Example: Minimizing a Product of Eigenvalues
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
28 / 71
Let SN denote the space of real symmetric N ×N matrices, and
λ1(X) ≥ λ2(X) ≥ · · ·λN (X)
denote the eigenvalues of X ∈ SN . We wish to minimize
f(X) = log
N/2∏
i=1
λi(A ◦X)
where A ∈ SN is fixed and ◦ is the Hadamard (componentwise)matrix product, subject to the constraints that X is positivesemidefinite and has diagonal entries equal to 1.
![Page 117: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/117.jpg)
Example: Minimizing a Product of Eigenvalues
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
28 / 71
Let SN denote the space of real symmetric N ×N matrices, and
λ1(X) ≥ λ2(X) ≥ · · ·λN (X)
denote the eigenvalues of X ∈ SN . We wish to minimize
f(X) = log
N/2∏
i=1
λi(A ◦X)
where A ∈ SN is fixed and ◦ is the Hadamard (componentwise)matrix product, subject to the constraints that X is positivesemidefinite and has diagonal entries equal to 1.
If we replace∏
by∑
we would have a semidefinite program.
![Page 118: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/118.jpg)
Example: Minimizing a Product of Eigenvalues
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
28 / 71
Let SN denote the space of real symmetric N ×N matrices, and
λ1(X) ≥ λ2(X) ≥ · · ·λN (X)
denote the eigenvalues of X ∈ SN . We wish to minimize
f(X) = log
N/2∏
i=1
λi(A ◦X)
where A ∈ SN is fixed and ◦ is the Hadamard (componentwise)matrix product, subject to the constraints that X is positivesemidefinite and has diagonal entries equal to 1.
If we replace∏
by∑
we would have a semidefinite program.
Since f is not convex, may as well replace X by Y Y T whereY ∈ R
N×N : eliminates psd constraint, and then also easy toeliminate diagonal constraint.
![Page 119: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/119.jpg)
Example: Minimizing a Product of Eigenvalues
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
28 / 71
Let SN denote the space of real symmetric N ×N matrices, and
λ1(X) ≥ λ2(X) ≥ · · ·λN (X)
denote the eigenvalues of X ∈ SN . We wish to minimize
f(X) = log
N/2∏
i=1
λi(A ◦X)
where A ∈ SN is fixed and ◦ is the Hadamard (componentwise)matrix product, subject to the constraints that X is positivesemidefinite and has diagonal entries equal to 1.
If we replace∏
by∑
we would have a semidefinite program.
Since f is not convex, may as well replace X by Y Y T whereY ∈ R
N×N : eliminates psd constraint, and then also easy toeliminate diagonal constraint.
Application: entropy minimization in an environmentalapplication (K.M. Anstreicher and J. Lee, 2004)
![Page 120: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/120.jpg)
BFGS from 10 Randomly Generated Starting Points
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
29 / 71
0 200 400 600 800 1000 1200 140010
−15
10−10
10−5
100
105
iteration
f − f op
t (di
ffere
nt s
tart
ing
poin
ts)
Log eigenvalue product, N=20, n=400, fopt
= −4.37938e+000
f − fopt, where fopt is least value of f found over all runs
![Page 121: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/121.jpg)
Evolution of Eigenvalues of A ◦X
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
30 / 71
0 200 400 600 800 1000 1200 14000.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Log eigenvalue product, N=20, n=400, fopt
= −4.37938e+000
iteration
eige
nval
ues
of A
o X
![Page 122: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/122.jpg)
Evolution of Eigenvalues of A ◦X
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
30 / 71
0 200 400 600 800 1000 1200 14000.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Log eigenvalue product, N=20, n=400, fopt
= −4.37938e+000
iteration
eige
nval
ues
of A
o X
Note that λ6(X), . . . , λ14(X) coalesce
![Page 123: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/123.jpg)
Evolution of Eigenvalues of H
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
31 / 71
0 200 400 600 800 1000 1200 140010
−15
10−10
10−5
100
105
Log eigenvalue product, N=20, n=400, fopt
= −4.37938e+000
iteration
eige
nval
ues
of H
![Page 124: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/124.jpg)
Evolution of Eigenvalues of H
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
31 / 71
0 200 400 600 800 1000 1200 140010
−15
10−10
10−5
100
105
Log eigenvalue product, N=20, n=400, fopt
= −4.37938e+000
iteration
eige
nval
ues
of H
44 eigenvalues of H converge to zero...why???
![Page 125: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/125.jpg)
Regularity
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
32 / 71
A locally Lipschitz, directionally differentiable function f isregular (Clarke 1970s) near a point x when its directionalderivative f ′( · ; d) is upper semicontinuous there for every fixeddirection d.
![Page 126: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/126.jpg)
Regularity
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
32 / 71
A locally Lipschitz, directionally differentiable function f isregular (Clarke 1970s) near a point x when its directionalderivative f ′( · ; d) is upper semicontinuous there for every fixeddirection d.
In this case 0 ∈ ∂Cf(x) is equivalent to the first-order optimalitycondition f ′(x, d) ≥ 0 for all directions d.
![Page 127: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/127.jpg)
Regularity
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
32 / 71
A locally Lipschitz, directionally differentiable function f isregular (Clarke 1970s) near a point x when its directionalderivative f ′( · ; d) is upper semicontinuous there for every fixeddirection d.
In this case 0 ∈ ∂Cf(x) is equivalent to the first-order optimalitycondition f ′(x, d) ≥ 0 for all directions d.
■ All convex functions are regular
![Page 128: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/128.jpg)
Regularity
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
32 / 71
A locally Lipschitz, directionally differentiable function f isregular (Clarke 1970s) near a point x when its directionalderivative f ′( · ; d) is upper semicontinuous there for every fixeddirection d.
In this case 0 ∈ ∂Cf(x) is equivalent to the first-order optimalitycondition f ′(x, d) ≥ 0 for all directions d.
■ All convex functions are regular■ All smooth functions are regular
![Page 129: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/129.jpg)
Regularity
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
32 / 71
A locally Lipschitz, directionally differentiable function f isregular (Clarke 1970s) near a point x when its directionalderivative f ′( · ; d) is upper semicontinuous there for every fixeddirection d.
In this case 0 ∈ ∂Cf(x) is equivalent to the first-order optimalitycondition f ′(x, d) ≥ 0 for all directions d.
■ All convex functions are regular■ All smooth functions are regular■ Nonsmooth concave functions are not regular
Example: f(x) = −|x|
Note: this is a somewhat simpler definition of regularity than theone in Lecture 12, but it is less precise: it defines regularity in aneighborhood, not at a point.
![Page 130: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/130.jpg)
Partly Smooth Functions
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
33 / 71
A regular function f is partly smooth at x relative to a manifoldM containing x (A.S. Lewis 2003) if
![Page 131: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/131.jpg)
Partly Smooth Functions
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
33 / 71
A regular function f is partly smooth at x relative to a manifoldM containing x (A.S. Lewis 2003) if
■ its restriction to M is twice continuously differentiable near x
![Page 132: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/132.jpg)
Partly Smooth Functions
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
33 / 71
A regular function f is partly smooth at x relative to a manifoldM containing x (A.S. Lewis 2003) if
■ its restriction to M is twice continuously differentiable near x■ the Clarke subdifferential ∂f is continuous on M near x
![Page 133: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/133.jpg)
Partly Smooth Functions
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
33 / 71
A regular function f is partly smooth at x relative to a manifoldM containing x (A.S. Lewis 2003) if
■ its restriction to M is twice continuously differentiable near x■ the Clarke subdifferential ∂f is continuous on M near x■ par ∂f(x), the subspace parallel to the affine hull of the
subdifferential of f at x, is exactly the subspace normal toM at x.
![Page 134: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/134.jpg)
Partly Smooth Functions
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
33 / 71
A regular function f is partly smooth at x relative to a manifoldM containing x (A.S. Lewis 2003) if
■ its restriction to M is twice continuously differentiable near x■ the Clarke subdifferential ∂f is continuous on M near x■ par ∂f(x), the subspace parallel to the affine hull of the
subdifferential of f at x, is exactly the subspace normal toM at x.
We refer to par ∂f(x) as the V -space for f at x (with respect toM), and to its orthogonal complement, the subspace tangent toM at x, as the U -space for f at x.
![Page 135: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/135.jpg)
Partly Smooth Functions
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
33 / 71
A regular function f is partly smooth at x relative to a manifoldM containing x (A.S. Lewis 2003) if
■ its restriction to M is twice continuously differentiable near x■ the Clarke subdifferential ∂f is continuous on M near x■ par ∂f(x), the subspace parallel to the affine hull of the
subdifferential of f at x, is exactly the subspace normal toM at x.
We refer to par ∂f(x) as the V -space for f at x (with respect toM), and to its orthogonal complement, the subspace tangent toM at x, as the U -space for f at x.
When we refer to the V -space and U -space without reference toa point x, we mean at a minimizer.
![Page 136: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/136.jpg)
Partly Smooth Functions
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
33 / 71
A regular function f is partly smooth at x relative to a manifoldM containing x (A.S. Lewis 2003) if
■ its restriction to M is twice continuously differentiable near x■ the Clarke subdifferential ∂f is continuous on M near x■ par ∂f(x), the subspace parallel to the affine hull of the
subdifferential of f at x, is exactly the subspace normal toM at x.
We refer to par ∂f(x) as the V -space for f at x (with respect toM), and to its orthogonal complement, the subspace tangent toM at x, as the U -space for f at x.
When we refer to the V -space and U -space without reference toa point x, we mean at a minimizer.
For nonzero y in the V -space, the mapping t 7→ f(x+ ty) isnecessarily nonsmooth at t = 0, while for nonzero y in theU -space, t 7→ f(x+ ty) is differentiable at t = 0 as long as f islocally Lipschitz.
![Page 137: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/137.jpg)
Partly Smooth Functions, continued
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
34 / 71
Example: f(x) = 10|x2 − x21|+ (1− x1)2.
Question: What is M and what are the U and V spaces at theminimizer?
![Page 138: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/138.jpg)
Partly Smooth Functions, continued
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
34 / 71
Example: f(x) = 10|x2 − x21|+ (1− x1)2.
Question: What is M and what are the U and V spaces at theminimizer?
Example: f(x) = ‖x‖2.Question: What is M and what are the U and V spaces at theminimizer?
![Page 139: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/139.jpg)
Same Example Again
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
35 / 71
f(x)=10*|x2 − x
12| + (1−x
1)2
steepest descent, grad sampling and BFGS iterates
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
contours of fstarting pointoptimal pointsteepest descentgrad samp (1st phase)grad samp (2nd phase)bfgs
![Page 140: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/140.jpg)
Relation of Partial Smoothness to Earlier Work
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
36 / 71
Partial smoothness is closely related to earlier work of J.V. Burkeand J.J. More (1990,1994) and S.J. Wright (1993) onidentification of constraint structure by algorithms.
![Page 141: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/141.jpg)
Relation of Partial Smoothness to Earlier Work
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
36 / 71
Partial smoothness is closely related to earlier work of J.V. Burkeand J.J. More (1990,1994) and S.J. Wright (1993) onidentification of constraint structure by algorithms.
When f is convex, the partly smooth nomenclature is consistentwith the usage of V -space and U -space by C. Lemarechal,F. Oustry and C. Sagastizabal (2000), but partial smoothnessdoes not imply convexity and convexity does not imply partialsmoothness.
![Page 142: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/142.jpg)
Why Did 44 Eigenvalues of H Converge to Zero?
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
37 / 71
The eigenvalue product is regular and also partly smooth (in the senseof A.S. Lewis, 2003) with respect to the manifold of matrices with aneigenvalue with given multiplicity. This implies that tangent to thismanifold (preserving the multiplicity to first-order) the function issmooth (“U -shaped”) and normal to it, the function is nonsmooth
(“V -shaped”).
![Page 143: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/143.jpg)
Why Did 44 Eigenvalues of H Converge to Zero?
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
37 / 71
The eigenvalue product is regular and also partly smooth (in the senseof A.S. Lewis, 2003) with respect to the manifold of matrices with aneigenvalue with given multiplicity. This implies that tangent to thismanifold (preserving the multiplicity to first-order) the function issmooth (“U -shaped”) and normal to it, the function is nonsmooth
(“V -shaped”).
Recall that at the computed minimizer,
λ6(A ◦X) ≈ . . . ≈ λ14(A ◦X).
Matrix theory says that imposing multiplicity m on an eigenvalue a
matrix ∈ SN is m(m+1)2 − 1 conditions, or 44 when m = 9, so the
dimension of the V -space at this minimizer is 44.
![Page 144: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/144.jpg)
Why Did 44 Eigenvalues of H Converge to Zero?
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
37 / 71
The eigenvalue product is regular and also partly smooth (in the senseof A.S. Lewis, 2003) with respect to the manifold of matrices with aneigenvalue with given multiplicity. This implies that tangent to thismanifold (preserving the multiplicity to first-order) the function issmooth (“U -shaped”) and normal to it, the function is nonsmooth
(“V -shaped”).
Recall that at the computed minimizer,
λ6(A ◦X) ≈ . . . ≈ λ14(A ◦X).
Matrix theory says that imposing multiplicity m on an eigenvalue a
matrix ∈ SN is m(m+1)2 − 1 conditions, or 44 when m = 9, so the
dimension of the V -space at this minimizer is 44.
Tiny eigenvalues of H correspond to huge curvature, whichcorresponds to V -space directions.
![Page 145: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/145.jpg)
Why Did 44 Eigenvalues of H Converge to Zero?
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
37 / 71
The eigenvalue product is regular and also partly smooth (in the senseof A.S. Lewis, 2003) with respect to the manifold of matrices with aneigenvalue with given multiplicity. This implies that tangent to thismanifold (preserving the multiplicity to first-order) the function issmooth (“U -shaped”) and normal to it, the function is nonsmooth
(“V -shaped”).
Recall that at the computed minimizer,
λ6(A ◦X) ≈ . . . ≈ λ14(A ◦X).
Matrix theory says that imposing multiplicity m on an eigenvalue a
matrix ∈ SN is m(m+1)2 − 1 conditions, or 44 when m = 9, so the
dimension of the V -space at this minimizer is 44.
Tiny eigenvalues of H correspond to huge curvature, whichcorresponds to V -space directions.
Thus BFGS automatically detected the U and V space partitioningwithout knowing anything about the mathematical structure of f !
![Page 146: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/146.jpg)
Variation of f from Minimizer, along EigVecs of H
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
38 / 71
−10 −5 0 5 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
f(x op
t + t
w)
− f op
t
t
Log eigenvalue product, N=20, n=400, fopt
= −4.37938e+000
w is eigvector for eigvalue 10 of final Hw is eigvector for eigvalue 20 of final Hw is eigvector for eigvalue 30 of final Hw is eigvector for eigvalue 40 of final Hw is eigvector for eigvalue 50 of final Hw is eigvector for eigvalue 60 of final H
Eigenvalues of H numbered smallest to largest
![Page 147: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/147.jpg)
BFGS Theory for Special Nonsmooth Functions
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
39 / 71
Convergence results for BFGS with Armijo-Wolfe line searchwhen f is nonsmooth are limited to very special cases.
■ f(x) = |x| (one variable!): sequence generated converging to0 is related to a certain binary expansion of the startingpoint (A.S. Lewis and M.L.O., 2013)
![Page 148: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/148.jpg)
BFGS Theory for Special Nonsmooth Functions
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
39 / 71
Convergence results for BFGS with Armijo-Wolfe line searchwhen f is nonsmooth are limited to very special cases.
■ f(x) = |x| (one variable!): sequence generated converging to0 is related to a certain binary expansion of the startingpoint (A.S. Lewis and M.L.O., 2013)
■ f(x) = |x1|+ x2: f(x) ↓ −∞ (A.S. Lewis and ShanshanZhang, 2015)
![Page 149: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/149.jpg)
BFGS Theory for Special Nonsmooth Functions
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
39 / 71
Convergence results for BFGS with Armijo-Wolfe line searchwhen f is nonsmooth are limited to very special cases.
■ f(x) = |x| (one variable!): sequence generated converging to0 is related to a certain binary expansion of the startingpoint (A.S. Lewis and M.L.O., 2013)
■ f(x) = |x1|+ x2: f(x) ↓ −∞ (A.S. Lewis and ShanshanZhang, 2015)
■ f(x) = |x1|+∑n
i=2 xi: eventually a direction is identified onwhich f is unbounded below (Yuchen Xie and A. Waechter,2017)
![Page 150: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/150.jpg)
BFGS Theory for Special Nonsmooth Functions
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
39 / 71
Convergence results for BFGS with Armijo-Wolfe line searchwhen f is nonsmooth are limited to very special cases.
■ f(x) = |x| (one variable!): sequence generated converging to0 is related to a certain binary expansion of the startingpoint (A.S. Lewis and M.L.O., 2013)
■ f(x) = |x1|+ x2: f(x) ↓ −∞ (A.S. Lewis and ShanshanZhang, 2015)
■ f(x) = |x1|+∑n
i=2 xi: eventually a direction is identified onwhich f is unbounded below (Yuchen Xie and A. Waechter,2017)
■ f(x) =√∑n
i=1 x2i : iterates converge to [0, . . . , 0] (Jiayi Guo
and A.S. Lewis, 2017) (proof based on Powell (1976))
![Page 151: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/151.jpg)
BFGS Theory for Special Nonsmooth Functions
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
39 / 71
Convergence results for BFGS with Armijo-Wolfe line searchwhen f is nonsmooth are limited to very special cases.
■ f(x) = |x| (one variable!): sequence generated converging to0 is related to a certain binary expansion of the startingpoint (A.S. Lewis and M.L.O., 2013)
■ f(x) = |x1|+ x2: f(x) ↓ −∞ (A.S. Lewis and ShanshanZhang, 2015)
■ f(x) = |x1|+∑n
i=2 xi: eventually a direction is identified onwhich f is unbounded below (Yuchen Xie and A. Waechter,2017)
■ f(x) =√∑n
i=1 x2i : iterates converge to [0, . . . , 0] (Jiayi Guo
and A.S. Lewis, 2017) (proof based on Powell (1976))
■ f(x) = |x1|+ x22: remains open!
![Page 152: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/152.jpg)
Challenge: General Nonsmooth Case
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
40 / 71
![Page 153: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/153.jpg)
Challenge: General Nonsmooth Case
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
40 / 71
Assume f is locally Lipschitz with bounded level sets and issemi-algebraic (its graph is a finite union of sets each defined bya finite list of polynomial inequalities)
![Page 154: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/154.jpg)
Challenge: General Nonsmooth Case
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
40 / 71
Assume f is locally Lipschitz with bounded level sets and issemi-algebraic (its graph is a finite union of sets each defined bya finite list of polynomial inequalities)
Assume the initial x and H are generated randomly (e.g. fromnormal and Wishart distributions)
![Page 155: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/155.jpg)
Challenge: General Nonsmooth Case
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
40 / 71
Assume f is locally Lipschitz with bounded level sets and issemi-algebraic (its graph is a finite union of sets each defined bya finite list of polynomial inequalities)
Assume the initial x and H are generated randomly (e.g. fromnormal and Wishart distributions)
Prove or disprove that the following hold with probability one:
![Page 156: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/156.jpg)
Challenge: General Nonsmooth Case
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
40 / 71
Assume f is locally Lipschitz with bounded level sets and issemi-algebraic (its graph is a finite union of sets each defined bya finite list of polynomial inequalities)
Assume the initial x and H are generated randomly (e.g. fromnormal and Wishart distributions)
Prove or disprove that the following hold with probability one:
1. BFGS generates an infinite sequence {x} with fdifferentiable at all iterates
![Page 157: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/157.jpg)
Challenge: General Nonsmooth Case
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
40 / 71
Assume f is locally Lipschitz with bounded level sets and issemi-algebraic (its graph is a finite union of sets each defined bya finite list of polynomial inequalities)
Assume the initial x and H are generated randomly (e.g. fromnormal and Wishart distributions)
Prove or disprove that the following hold with probability one:
1. BFGS generates an infinite sequence {x} with fdifferentiable at all iterates
2. Any cluster point x is Clarke stationary
![Page 158: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/158.jpg)
Challenge: General Nonsmooth Case
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
40 / 71
Assume f is locally Lipschitz with bounded level sets and issemi-algebraic (its graph is a finite union of sets each defined bya finite list of polynomial inequalities)
Assume the initial x and H are generated randomly (e.g. fromnormal and Wishart distributions)
Prove or disprove that the following hold with probability one:
1. BFGS generates an infinite sequence {x} with fdifferentiable at all iterates
2. Any cluster point x is Clarke stationary3. The sequence of function values generated (including all of
the line search iterates) converges to f(x) R-linearly
![Page 159: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/159.jpg)
Challenge: General Nonsmooth Case
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
40 / 71
Assume f is locally Lipschitz with bounded level sets and issemi-algebraic (its graph is a finite union of sets each defined bya finite list of polynomial inequalities)
Assume the initial x and H are generated randomly (e.g. fromnormal and Wishart distributions)
Prove or disprove that the following hold with probability one:
1. BFGS generates an infinite sequence {x} with fdifferentiable at all iterates
2. Any cluster point x is Clarke stationary3. The sequence of function values generated (including all of
the line search iterates) converges to f(x) R-linearly4. If {x} converges to x where f is “partly smooth” w.r.t. a
manifold M then the subspace defined by the eigenvectorscorresponding to eigenvalues of H converging to zeroconverges to the “V -space” of f w.r.t. M at x
![Page 160: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/160.jpg)
Some BFGS Nonsmooth Success Stories
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
41 / 71
![Page 161: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/161.jpg)
Some BFGS Nonsmooth Success Stories
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
41 / 71
■ Design of fixed-order controllers for linear dynamical systemswith input and output (D. Henrion and M.L.O., 2006, andmany subsequent users of our HIFOO (H-Infinity Fixed OrderOptimization) toolbox)
![Page 162: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/162.jpg)
Some BFGS Nonsmooth Success Stories
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
41 / 71
■ Design of fixed-order controllers for linear dynamical systemswith input and output (D. Henrion and M.L.O., 2006, andmany subsequent users of our HIFOO (H-Infinity Fixed OrderOptimization) toolbox)
■ Shape optimization for spectral functions ofDirichlet-Laplacian operators (B. Osting, 2010)
![Page 163: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/163.jpg)
Some BFGS Nonsmooth Success Stories
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
41 / 71
■ Design of fixed-order controllers for linear dynamical systemswith input and output (D. Henrion and M.L.O., 2006, andmany subsequent users of our HIFOO (H-Infinity Fixed OrderOptimization) toolbox)
■ Shape optimization for spectral functions ofDirichlet-Laplacian operators (B. Osting, 2010)
■ Condition metric optimization (P. Boito and J. Dedieu, 2010)
![Page 164: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/164.jpg)
Some BFGS Nonsmooth Success Stories
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
41 / 71
■ Design of fixed-order controllers for linear dynamical systemswith input and output (D. Henrion and M.L.O., 2006, andmany subsequent users of our HIFOO (H-Infinity Fixed OrderOptimization) toolbox)
■ Shape optimization for spectral functions ofDirichlet-Laplacian operators (B. Osting, 2010)
■ Condition metric optimization (P. Boito and J. Dedieu, 2010)
Software is available: HANSO
![Page 165: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/165.jpg)
Extensions of BFGS for Nonsmooth Optimization
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
42 / 71
A combined BFGS-Gradient Sampling method with convergencetheory (F.E. Curtis and X. Que, 2015)
![Page 166: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/166.jpg)
Extensions of BFGS for Nonsmooth Optimization
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
42 / 71
A combined BFGS-Gradient Sampling method with convergencetheory (F.E. Curtis and X. Que, 2015)
Constrained Problems
min f(x)
subject to ci(x) ≤ 0, i = 1, . . . , p
where f and c1, . . . , cp are locally Lipschitz but may not bedifferentiable at local minimizers.
![Page 167: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/167.jpg)
Extensions of BFGS for Nonsmooth Optimization
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
42 / 71
A combined BFGS-Gradient Sampling method with convergencetheory (F.E. Curtis and X. Que, 2015)
Constrained Problems
min f(x)
subject to ci(x) ≤ 0, i = 1, . . . , p
where f and c1, . . . , cp are locally Lipschitz but may not bedifferentiable at local minimizers.
A successive quadratic programming (SQP) BFGS methodapplied to challenging problems in static-output-feedback controldesign (F.E. Curtis, T. Mitchell and M.L.O., 2015).
![Page 168: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/168.jpg)
Extensions of BFGS for Nonsmooth Optimization
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
42 / 71
A combined BFGS-Gradient Sampling method with convergencetheory (F.E. Curtis and X. Que, 2015)
Constrained Problems
min f(x)
subject to ci(x) ≤ 0, i = 1, . . . , p
where f and c1, . . . , cp are locally Lipschitz but may not bedifferentiable at local minimizers.
A successive quadratic programming (SQP) BFGS methodapplied to challenging problems in static-output-feedback controldesign (F.E. Curtis, T. Mitchell and M.L.O., 2015).
Although there are no theoretical results, it is much moreefficient and effective than the SQP Gradient Sampling methodwhich does have convergence results.
![Page 169: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/169.jpg)
Extensions of BFGS for Nonsmooth Optimization
Introduction
Gradient Sampling
Quasi-NewtonMethods
Bill Davidon
Fletcher and Powell
BFGSThe BFGS Method(“Full” Version)
BFGS forNonsmoothOptimization
With BFGSExample:Minimizing aProduct ofEigenvalues
BFGS from 10Randomly GeneratedStarting Points
Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H
Regularity
Partly SmoothFunctionsPartly SmoothFunctions, continued
Same ExampleAgain
Relation of Partial
42 / 71
A combined BFGS-Gradient Sampling method with convergencetheory (F.E. Curtis and X. Que, 2015)
Constrained Problems
min f(x)
subject to ci(x) ≤ 0, i = 1, . . . , p
where f and c1, . . . , cp are locally Lipschitz but may not bedifferentiable at local minimizers.
A successive quadratic programming (SQP) BFGS methodapplied to challenging problems in static-output-feedback controldesign (F.E. Curtis, T. Mitchell and M.L.O., 2015).
Although there are no theoretical results, it is much moreefficient and effective than the SQP Gradient Sampling methodwhich does have convergence results.
Software is available: GRANSO
![Page 170: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/170.jpg)
A Difficult Nonconvex Problem
from Nesterov
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
43 / 71
![Page 171: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/171.jpg)
An Aside: Chebyshev Polynomials
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
44 / 71
A sequence of orthogonal polynomials defined on [−1, 1] by
T0(x) = 1, T1(x) = x, Tn+1(x) = 2xTn(x)− Tn−1(x).
![Page 172: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/172.jpg)
An Aside: Chebyshev Polynomials
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
44 / 71
A sequence of orthogonal polynomials defined on [−1, 1] by
T0(x) = 1, T1(x) = x, Tn+1(x) = 2xTn(x)− Tn−1(x).
So T2(x) = 2x2 − 1,
![Page 173: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/173.jpg)
An Aside: Chebyshev Polynomials
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
44 / 71
A sequence of orthogonal polynomials defined on [−1, 1] by
T0(x) = 1, T1(x) = x, Tn+1(x) = 2xTn(x)− Tn−1(x).
So T2(x) = 2x2 − 1, T3(x) = 4x3 − 3, etc.
![Page 174: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/174.jpg)
An Aside: Chebyshev Polynomials
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
44 / 71
A sequence of orthogonal polynomials defined on [−1, 1] by
T0(x) = 1, T1(x) = x, Tn+1(x) = 2xTn(x)− Tn−1(x).
So T2(x) = 2x2 − 1, T3(x) = 4x3 − 3, etc.
Important properties that can be proved easily include
![Page 175: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/175.jpg)
An Aside: Chebyshev Polynomials
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
44 / 71
A sequence of orthogonal polynomials defined on [−1, 1] by
T0(x) = 1, T1(x) = x, Tn+1(x) = 2xTn(x)− Tn−1(x).
So T2(x) = 2x2 − 1, T3(x) = 4x3 − 3, etc.
Important properties that can be proved easily include
■ Tn(x) = cos(n cos−1(x))
![Page 176: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/176.jpg)
An Aside: Chebyshev Polynomials
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
44 / 71
A sequence of orthogonal polynomials defined on [−1, 1] by
T0(x) = 1, T1(x) = x, Tn+1(x) = 2xTn(x)− Tn−1(x).
So T2(x) = 2x2 − 1, T3(x) = 4x3 − 3, etc.
Important properties that can be proved easily include
■ Tn(x) = cos(n cos−1(x))
■ Tm(Tn(x)) = Tmn(x)
![Page 177: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/177.jpg)
An Aside: Chebyshev Polynomials
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
44 / 71
A sequence of orthogonal polynomials defined on [−1, 1] by
T0(x) = 1, T1(x) = x, Tn+1(x) = 2xTn(x)− Tn−1(x).
So T2(x) = 2x2 − 1, T3(x) = 4x3 − 3, etc.
Important properties that can be proved easily include
■ Tn(x) = cos(n cos−1(x))
■ Tm(Tn(x)) = Tmn(x)
■
∫ 1−1
1√1−x2
Ti(x)Tj(x)dx = 0 if i 6= j
![Page 178: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/178.jpg)
Plots of Chebyshev Polynomials
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
45 / 71
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Left: Plots of T0(x), . . . , T4(x) Right: Plot of T8(x).
![Page 179: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/179.jpg)
Plots of Chebyshev Polynomials
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
45 / 71
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Left: Plots of T0(x), . . . , T4(x) Right: Plot of T8(x).Question: How many extrema does Tn(x) have in [−1, 1]?
![Page 180: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/180.jpg)
Nesterov’s Chebyshev-Rosenbrock Functions
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
46 / 71
Consider the function
Np(x) =1
4(x1 − 1)2 +
n−1∑
i=1
|xi+1 − 2x2i + 1|p, where p ≥ 1
![Page 181: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/181.jpg)
Nesterov’s Chebyshev-Rosenbrock Functions
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
46 / 71
Consider the function
Np(x) =1
4(x1 − 1)2 +
n−1∑
i=1
|xi+1 − 2x2i + 1|p, where p ≥ 1
The unique minimizer is x∗ = [1, 1, . . . , 1]T with Np(x∗) = 0.
![Page 182: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/182.jpg)
Nesterov’s Chebyshev-Rosenbrock Functions
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
46 / 71
Consider the function
Np(x) =1
4(x1 − 1)2 +
n−1∑
i=1
|xi+1 − 2x2i + 1|p, where p ≥ 1
The unique minimizer is x∗ = [1, 1, . . . , 1]T with Np(x∗) = 0.
Define x = [−1, 1, 1, . . . , 1]T with Np(x) = 1 and the manifold
MN = {x : xi+1 = 2x2i − 1, i = 1, . . . , n− 1}
![Page 183: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/183.jpg)
Nesterov’s Chebyshev-Rosenbrock Functions
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
46 / 71
Consider the function
Np(x) =1
4(x1 − 1)2 +
n−1∑
i=1
|xi+1 − 2x2i + 1|p, where p ≥ 1
The unique minimizer is x∗ = [1, 1, . . . , 1]T with Np(x∗) = 0.
Define x = [−1, 1, 1, . . . , 1]T with Np(x) = 1 and the manifold
MN = {x : xi+1 = 2x2i − 1, i = 1, . . . , n− 1}For x ∈ MN , e.g. x = x∗ or x = x, the 2nd term of Np is zero.Starting at x, BFGS needs to approximately follow MN to reachx∗ (unless it “gets lucky”).
![Page 184: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/184.jpg)
Nesterov’s Chebyshev-Rosenbrock Functions
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
46 / 71
Consider the function
Np(x) =1
4(x1 − 1)2 +
n−1∑
i=1
|xi+1 − 2x2i + 1|p, where p ≥ 1
The unique minimizer is x∗ = [1, 1, . . . , 1]T with Np(x∗) = 0.
Define x = [−1, 1, 1, . . . , 1]T with Np(x) = 1 and the manifold
MN = {x : xi+1 = 2x2i − 1, i = 1, . . . , n− 1}For x ∈ MN , e.g. x = x∗ or x = x, the 2nd term of Np is zero.Starting at x, BFGS needs to approximately follow MN to reachx∗ (unless it “gets lucky”).
When p = 2: N2 is smooth but not convex. Starting at x:
![Page 185: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/185.jpg)
Nesterov’s Chebyshev-Rosenbrock Functions
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
46 / 71
Consider the function
Np(x) =1
4(x1 − 1)2 +
n−1∑
i=1
|xi+1 − 2x2i + 1|p, where p ≥ 1
The unique minimizer is x∗ = [1, 1, . . . , 1]T with Np(x∗) = 0.
Define x = [−1, 1, 1, . . . , 1]T with Np(x) = 1 and the manifold
MN = {x : xi+1 = 2x2i − 1, i = 1, . . . , n− 1}For x ∈ MN , e.g. x = x∗ or x = x, the 2nd term of Np is zero.Starting at x, BFGS needs to approximately follow MN to reachx∗ (unless it “gets lucky”).
When p = 2: N2 is smooth but not convex. Starting at x:
■ n = 5: BFGS needs 370 iterations to reduce N2 below 10−15
![Page 186: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/186.jpg)
Nesterov’s Chebyshev-Rosenbrock Functions
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
46 / 71
Consider the function
Np(x) =1
4(x1 − 1)2 +
n−1∑
i=1
|xi+1 − 2x2i + 1|p, where p ≥ 1
The unique minimizer is x∗ = [1, 1, . . . , 1]T with Np(x∗) = 0.
Define x = [−1, 1, 1, . . . , 1]T with Np(x) = 1 and the manifold
MN = {x : xi+1 = 2x2i − 1, i = 1, . . . , n− 1}For x ∈ MN , e.g. x = x∗ or x = x, the 2nd term of Np is zero.Starting at x, BFGS needs to approximately follow MN to reachx∗ (unless it “gets lucky”).
When p = 2: N2 is smooth but not convex. Starting at x:
■ n = 5: BFGS needs 370 iterations to reduce N2 below 10−15
■ n = 10: needs ∼ 50,000 iterations to reduce N2 below 10−15
even though N2 is smooth!
![Page 187: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/187.jpg)
Nesterov’s Chebyshev-Rosenbrock Functions
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
46 / 71
Consider the function
Np(x) =1
4(x1 − 1)2 +
n−1∑
i=1
|xi+1 − 2x2i + 1|p, where p ≥ 1
The unique minimizer is x∗ = [1, 1, . . . , 1]T with Np(x∗) = 0.
Define x = [−1, 1, 1, . . . , 1]T with Np(x) = 1 and the manifold
MN = {x : xi+1 = 2x2i − 1, i = 1, . . . , n− 1}For x ∈ MN , e.g. x = x∗ or x = x, the 2nd term of Np is zero.Starting at x, BFGS needs to approximately follow MN to reachx∗ (unless it “gets lucky”).
When p = 2: N2 is smooth but not convex. Starting at x:
■ n = 5: BFGS needs 370 iterations to reduce N2 below 10−15
■ n = 10: needs ∼ 50,000 iterations to reduce N2 below 10−15
even though N2 is smooth! . . . In the last few iterations, weobserve superlinear convergence!
![Page 188: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/188.jpg)
Why BFGS Takes So Many Iterations to Minimize N2
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
47 / 71
Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,
xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))
![Page 189: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/189.jpg)
Why BFGS Takes So Many Iterations to Minimize N2
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
47 / 71
Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,
xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))
= T2(T2(. . . T2(x1) . . .)) = T2i(x1).
![Page 190: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/190.jpg)
Why BFGS Takes So Many Iterations to Minimize N2
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
47 / 71
Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,
xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))
= T2(T2(. . . T2(x1) . . .)) = T2i(x1).
To move from x to x∗ along the manifold MN exactly requires
![Page 191: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/191.jpg)
Why BFGS Takes So Many Iterations to Minimize N2
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
47 / 71
Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,
xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))
= T2(T2(. . . T2(x1) . . .)) = T2i(x1).
To move from x to x∗ along the manifold MN exactly requires
■ x1 to change from −1 to 1
![Page 192: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/192.jpg)
Why BFGS Takes So Many Iterations to Minimize N2
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
47 / 71
Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,
xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))
= T2(T2(. . . T2(x1) . . .)) = T2i(x1).
To move from x to x∗ along the manifold MN exactly requires
■ x1 to change from −1 to 1■ x2 = 2x21 − 1 to trace the graph of T2(x1) on [−1, 1]
![Page 193: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/193.jpg)
Why BFGS Takes So Many Iterations to Minimize N2
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
47 / 71
Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,
xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))
= T2(T2(. . . T2(x1) . . .)) = T2i(x1).
To move from x to x∗ along the manifold MN exactly requires
■ x1 to change from −1 to 1■ x2 = 2x21 − 1 to trace the graph of T2(x1) on [−1, 1]■ x3 = T2(T2(x)) to trace the graph of T4(x1) on [−1, 1]
![Page 194: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/194.jpg)
Why BFGS Takes So Many Iterations to Minimize N2
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
47 / 71
Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,
xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))
= T2(T2(. . . T2(x1) . . .)) = T2i(x1).
To move from x to x∗ along the manifold MN exactly requires
■ x1 to change from −1 to 1■ x2 = 2x21 − 1 to trace the graph of T2(x1) on [−1, 1]■ x3 = T2(T2(x)) to trace the graph of T4(x1) on [−1, 1]■ xn = T2n−1(x) to trace the graph of T2n−1(x1) on [−1, 1]
which has 2n−1 − 1 extrema in (−1, 1).
![Page 195: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/195.jpg)
Why BFGS Takes So Many Iterations to Minimize N2
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
47 / 71
Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,
xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))
= T2(T2(. . . T2(x1) . . .)) = T2i(x1).
To move from x to x∗ along the manifold MN exactly requires
■ x1 to change from −1 to 1■ x2 = 2x21 − 1 to trace the graph of T2(x1) on [−1, 1]■ x3 = T2(T2(x)) to trace the graph of T4(x1) on [−1, 1]■ xn = T2n−1(x) to trace the graph of T2n−1(x1) on [−1, 1]
which has 2n−1 − 1 extrema in (−1, 1).Even though BFGS will not track the manifold MN exactly, itwill follow it approximately. So, since the manifold is highlyoscillatory, BFGS must take relatively short steps to obtainreduction in N2 in the line search, and hence many iterations!
![Page 196: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/196.jpg)
Why BFGS Takes So Many Iterations to Minimize N2
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
47 / 71
Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,
xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))
= T2(T2(. . . T2(x1) . . .)) = T2i(x1).
To move from x to x∗ along the manifold MN exactly requires
■ x1 to change from −1 to 1■ x2 = 2x21 − 1 to trace the graph of T2(x1) on [−1, 1]■ x3 = T2(T2(x)) to trace the graph of T4(x1) on [−1, 1]■ xn = T2n−1(x) to trace the graph of T2n−1(x1) on [−1, 1]
which has 2n−1 − 1 extrema in (−1, 1).Even though BFGS will not track the manifold MN exactly, itwill follow it approximately. So, since the manifold is highlyoscillatory, BFGS must take relatively short steps to obtainreduction in N2 in the line search, and hence many iterations!
Newton’s method is not much faster, although it convergesquadratically at the end.
![Page 197: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/197.jpg)
First Nonsmooth Variant of Nesterov’s Function
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
48 / 71
N1(x) =1
4(x1 − 1)2 +
n−1∑
i=1
|xi+1 − 2x2i + 1|
![Page 198: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/198.jpg)
First Nonsmooth Variant of Nesterov’s Function
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
48 / 71
N1(x) =1
4(x1 − 1)2 +
n−1∑
i=1
|xi+1 − 2x2i + 1|
N1 is nonsmooth (though locally Lipschitz) as well asnonconvex. The second term is still zero on the manifold MN ,but N1 is not differentiable on MN .
![Page 199: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/199.jpg)
First Nonsmooth Variant of Nesterov’s Function
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
48 / 71
N1(x) =1
4(x1 − 1)2 +
n−1∑
i=1
|xi+1 − 2x2i + 1|
N1 is nonsmooth (though locally Lipschitz) as well asnonconvex. The second term is still zero on the manifold MN ,but N1 is not differentiable on MN .
However, N1 is regular at x ∈ MN and partly smooth at x w.r.t.MN .
![Page 200: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/200.jpg)
First Nonsmooth Variant of Nesterov’s Function
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
48 / 71
N1(x) =1
4(x1 − 1)2 +
n−1∑
i=1
|xi+1 − 2x2i + 1|
N1 is nonsmooth (though locally Lipschitz) as well asnonconvex. The second term is still zero on the manifold MN ,but N1 is not differentiable on MN .
However, N1 is regular at x ∈ MN and partly smooth at x w.r.t.MN .
We cannot initialize BFGS at x, so starting at normallydistributed random points:
![Page 201: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/201.jpg)
First Nonsmooth Variant of Nesterov’s Function
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
48 / 71
N1(x) =1
4(x1 − 1)2 +
n−1∑
i=1
|xi+1 − 2x2i + 1|
N1 is nonsmooth (though locally Lipschitz) as well asnonconvex. The second term is still zero on the manifold MN ,but N1 is not differentiable on MN .
However, N1 is regular at x ∈ MN and partly smooth at x w.r.t.MN .
We cannot initialize BFGS at x, so starting at normallydistributed random points:
■ n = 5: BFGS reduces N1 only to about 10−2 in 10,000iterations
![Page 202: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/202.jpg)
First Nonsmooth Variant of Nesterov’s Function
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
48 / 71
N1(x) =1
4(x1 − 1)2 +
n−1∑
i=1
|xi+1 − 2x2i + 1|
N1 is nonsmooth (though locally Lipschitz) as well asnonconvex. The second term is still zero on the manifold MN ,but N1 is not differentiable on MN .
However, N1 is regular at x ∈ MN and partly smooth at x w.r.t.MN .
We cannot initialize BFGS at x, so starting at normallydistributed random points:
■ n = 5: BFGS reduces N1 only to about 10−2 in 10,000iterations
■ n = 10: BFGS reduces N1 only to about 5× 10−2 in 10,000iterations
![Page 203: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/203.jpg)
First Nonsmooth Variant of Nesterov’s Function
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
48 / 71
N1(x) =1
4(x1 − 1)2 +
n−1∑
i=1
|xi+1 − 2x2i + 1|
N1 is nonsmooth (though locally Lipschitz) as well asnonconvex. The second term is still zero on the manifold MN ,but N1 is not differentiable on MN .
However, N1 is regular at x ∈ MN and partly smooth at x w.r.t.MN .
We cannot initialize BFGS at x, so starting at normallydistributed random points:
■ n = 5: BFGS reduces N1 only to about 10−2 in 10,000iterations
■ n = 10: BFGS reduces N1 only to about 5× 10−2 in 10,000iterations
The method appears to be converging, very slowly, but may behaving numerical difficulties.
![Page 204: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/204.jpg)
Second Nonsmooth Variant of Nesterov’s Function
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
49 / 71
N1(x) =1
4|x1 − 1|+
n−1∑
i=1
|xi+1 − 2|xi|+ 1|.
Again, the unique global minimizer is x∗. The second term iszero on the set
S = {x : xi+1 = 2|xi| − 1, i = 1, . . . , n− 1}
but S is not a manifold: it has “corners”.
![Page 205: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/205.jpg)
Contour Plots of the Nonsmooth Variants for n = 2
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
50 / 71
Nesterov−Chebyshev−Rosenbrock, first variant
x1
x2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2Nesterov−Chebyshev−Rosenbrock, second variant
x1
x2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Contour plots of nonsmooth Chebyshev-Rosenbrock functions N1
(left) and N1 (right), with n = 2, with iterates generated byBFGS initialized at 7 different randomly generated points.
![Page 206: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/206.jpg)
Contour Plots of the Nonsmooth Variants for n = 2
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
50 / 71
Nesterov−Chebyshev−Rosenbrock, first variant
x1
x2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2Nesterov−Chebyshev−Rosenbrock, second variant
x1
x2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Contour plots of nonsmooth Chebyshev-Rosenbrock functions N1
(left) and N1 (right), with n = 2, with iterates generated byBFGS initialized at 7 different randomly generated points.On the left, always get convergence to x∗ = [1, 1]T . On theright, most runs converge to [1, 1] but some go to x = [0,−1]T .
![Page 207: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/207.jpg)
Properties of the Second Nonsmooth Variant N1
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
51 / 71
When n = 2, the point x = [0,−1]T is Clarke stationary for thesecond nonsmooth variant N1. We can see this because zero isin the convex hull of the gradient limits for N1 at the point x.
![Page 208: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/208.jpg)
Properties of the Second Nonsmooth Variant N1
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
51 / 71
When n = 2, the point x = [0,−1]T is Clarke stationary for thesecond nonsmooth variant N1. We can see this because zero isin the convex hull of the gradient limits for N1 at the point x.
However, x = [0,−1]T is not a local minimizer, becaused = [1, 2]T is a direction of linear descent: N ′
1(x, d) < 0.
![Page 209: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/209.jpg)
Properties of the Second Nonsmooth Variant N1
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
51 / 71
When n = 2, the point x = [0,−1]T is Clarke stationary for thesecond nonsmooth variant N1. We can see this because zero isin the convex hull of the gradient limits for N1 at the point x.
However, x = [0,−1]T is not a local minimizer, becaused = [1, 2]T is a direction of linear descent: N ′
1(x, d) < 0.
These two properties mean that N1 is not regular at [0,−1]T .
![Page 210: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/210.jpg)
Properties of the Second Nonsmooth Variant N1
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
51 / 71
When n = 2, the point x = [0,−1]T is Clarke stationary for thesecond nonsmooth variant N1. We can see this because zero isin the convex hull of the gradient limits for N1 at the point x.
However, x = [0,−1]T is not a local minimizer, becaused = [1, 2]T is a direction of linear descent: N ′
1(x, d) < 0.
These two properties mean that N1 is not regular at [0,−1]T .
In fact, for n ≥ 2:
■ N1 has 2n−1 Clarke stationary points
![Page 211: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/211.jpg)
Properties of the Second Nonsmooth Variant N1
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
51 / 71
When n = 2, the point x = [0,−1]T is Clarke stationary for thesecond nonsmooth variant N1. We can see this because zero isin the convex hull of the gradient limits for N1 at the point x.
However, x = [0,−1]T is not a local minimizer, becaused = [1, 2]T is a direction of linear descent: N ′
1(x, d) < 0.
These two properties mean that N1 is not regular at [0,−1]T .
In fact, for n ≥ 2:
■ N1 has 2n−1 Clarke stationary points■ the only local minimizer is the global minimizer x∗
![Page 212: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/212.jpg)
Properties of the Second Nonsmooth Variant N1
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
51 / 71
When n = 2, the point x = [0,−1]T is Clarke stationary for thesecond nonsmooth variant N1. We can see this because zero isin the convex hull of the gradient limits for N1 at the point x.
However, x = [0,−1]T is not a local minimizer, becaused = [1, 2]T is a direction of linear descent: N ′
1(x, d) < 0.
These two properties mean that N1 is not regular at [0,−1]T .
In fact, for n ≥ 2:
■ N1 has 2n−1 Clarke stationary points■ the only local minimizer is the global minimizer x∗
■ x∗ is the only stationary point in the sense of Mordukhovich(i.e., with 0 ∈ ∂N1(x) where we defined ∂ in Lecture 12)(see also Rockafellar and Wets, Variational Analysis, 1998).
(M. Gurbuzbalaban and M.L.O., 2012)
![Page 213: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/213.jpg)
Properties of the Second Nonsmooth Variant N1
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
51 / 71
When n = 2, the point x = [0,−1]T is Clarke stationary for thesecond nonsmooth variant N1. We can see this because zero isin the convex hull of the gradient limits for N1 at the point x.
However, x = [0,−1]T is not a local minimizer, becaused = [1, 2]T is a direction of linear descent: N ′
1(x, d) < 0.
These two properties mean that N1 is not regular at [0,−1]T .
In fact, for n ≥ 2:
■ N1 has 2n−1 Clarke stationary points■ the only local minimizer is the global minimizer x∗
■ x∗ is the only stationary point in the sense of Mordukhovich(i.e., with 0 ∈ ∂N1(x) where we defined ∂ in Lecture 12)(see also Rockafellar and Wets, Variational Analysis, 1998).
(M. Gurbuzbalaban and M.L.O., 2012)
Furthermore, starting from enough randomly generated startingpoints, BFGS finds all 2n−1 Clarke stationary points!
![Page 214: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/214.jpg)
Behavior of BFGS on the Second Nonsmooth Variant
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
52 / 71
0 200 400 600 800 10000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5Nesterov−Chebyshev−Rosenbrock, n=5
different starting points
so
rte
d f
ina
l va
lue
of
f
0 200 400 600 800 10000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5Nesterov−Chebyshev−Rosenbrock, n=6
different starting points
so
rte
d f
ina
l va
lue
of
f
Left: sorted final values of N1 for 1000 randomly generatedstarting points, when n = 5: BFGS finds all 16 Clarke stationarypoints. Right: same with n = 6: BFGS finds all 32 Clarkestationary points.
![Page 215: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/215.jpg)
Convergence to Non-Locally-Minimizing Points
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
53 / 71
When f is smooth, convergence of methods such as BFGS tonon-locally-minimizing stationary points or local maxima ispossible but not likely, because of the line search, and suchconvergence will not be stable under perturbation.
![Page 216: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/216.jpg)
Convergence to Non-Locally-Minimizing Points
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
53 / 71
When f is smooth, convergence of methods such as BFGS tonon-locally-minimizing stationary points or local maxima ispossible but not likely, because of the line search, and suchconvergence will not be stable under perturbation.
However, this kind of convergence is what we are seeing for thenon-regular, non-smooth Nesterov Chebyshev-Rosenbrockexample, and it is stable under perturbation. The same behavioroccurs for gradient sampling or bundle methods.
![Page 217: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/217.jpg)
Convergence to Non-Locally-Minimizing Points
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
53 / 71
When f is smooth, convergence of methods such as BFGS tonon-locally-minimizing stationary points or local maxima ispossible but not likely, because of the line search, and suchconvergence will not be stable under perturbation.
However, this kind of convergence is what we are seeing for thenon-regular, non-smooth Nesterov Chebyshev-Rosenbrockexample, and it is stable under perturbation. The same behavioroccurs for gradient sampling or bundle methods.
Kiwiel (private communication): the Nesterov example is the firsthe had seen which causes his bundle code to have this behavior.
![Page 218: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/218.jpg)
Convergence to Non-Locally-Minimizing Points
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom NesterovAn Aside:ChebyshevPolynomials
Plots of ChebyshevPolynomials
Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2
First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1
Behavior of BFGSon the SecondNonsmooth Variant
53 / 71
When f is smooth, convergence of methods such as BFGS tonon-locally-minimizing stationary points or local maxima ispossible but not likely, because of the line search, and suchconvergence will not be stable under perturbation.
However, this kind of convergence is what we are seeing for thenon-regular, non-smooth Nesterov Chebyshev-Rosenbrockexample, and it is stable under perturbation. The same behavioroccurs for gradient sampling or bundle methods.
Kiwiel (private communication): the Nesterov example is the firsthe had seen which causes his bundle code to have this behavior.
Nonetheless, we don’t know whether, in exact arithmetic, themethods would actually generate sequences converging to thenonminimizing Clarke stationary points. Experiments by Kaku(2011) suggest that the higher the precision used, the more likelyBFGS is to eventually move away from such a point.
![Page 219: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/219.jpg)
Limited Memory Methods
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
54 / 71
![Page 220: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/220.jpg)
Limited Memory BFGS
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
55 / 71
“Full” BFGS requires storing an n× n matrix and doingmatrix-vector multiplies, which is not possible when n is large.
![Page 221: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/221.jpg)
Limited Memory BFGS
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
55 / 71
“Full” BFGS requires storing an n× n matrix and doingmatrix-vector multiplies, which is not possible when n is large.
In the 1980s, J. Nocedal and others developed a “limitedmemory” version of BFGS, with O(n) space and timerequirements, which is very widely used for minimizing smoothfunctions in many variables. At the kth iteration, it applies onlythe most recent m rank-two updates, defined by
(sj , yj), j = k −m, . . . , k − 1
to an initial inverse Hessian approximation H(k)0 .
![Page 222: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/222.jpg)
Limited Memory BFGS
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
55 / 71
“Full” BFGS requires storing an n× n matrix and doingmatrix-vector multiplies, which is not possible when n is large.
In the 1980s, J. Nocedal and others developed a “limitedmemory” version of BFGS, with O(n) space and timerequirements, which is very widely used for minimizing smoothfunctions in many variables. At the kth iteration, it applies onlythe most recent m rank-two updates, defined by
(sj , yj), j = k −m, . . . , k − 1
to an initial inverse Hessian approximation H(k)0 .
There are two variants: with “scaling” (H(k)0 =
sTk−1yk−1
yTk−1yk−1
I) and
without scaling (H(k)0 = I).
![Page 223: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/223.jpg)
Limited Memory BFGS
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
55 / 71
“Full” BFGS requires storing an n× n matrix and doingmatrix-vector multiplies, which is not possible when n is large.
In the 1980s, J. Nocedal and others developed a “limitedmemory” version of BFGS, with O(n) space and timerequirements, which is very widely used for minimizing smoothfunctions in many variables. At the kth iteration, it applies onlythe most recent m rank-two updates, defined by
(sj , yj), j = k −m, . . . , k − 1
to an initial inverse Hessian approximation H(k)0 .
There are two variants: with “scaling” (H(k)0 =
sTk−1yk−1
yTk−1yk−1
I) and
without scaling (H(k)0 = I).
The convergence rate of limited memory BFGS is linear, notsuperlinear, on smooth problems.
![Page 224: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/224.jpg)
Limited Memory BFGS
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
55 / 71
“Full” BFGS requires storing an n× n matrix and doingmatrix-vector multiplies, which is not possible when n is large.
In the 1980s, J. Nocedal and others developed a “limitedmemory” version of BFGS, with O(n) space and timerequirements, which is very widely used for minimizing smoothfunctions in many variables. At the kth iteration, it applies onlythe most recent m rank-two updates, defined by
(sj , yj), j = k −m, . . . , k − 1
to an initial inverse Hessian approximation H(k)0 .
There are two variants: with “scaling” (H(k)0 =
sTk−1yk−1
yTk−1yk−1
I) and
without scaling (H(k)0 = I).
The convergence rate of limited memory BFGS is linear, notsuperlinear, on smooth problems.
Question: how effective is it on nonsmooth problems?
![Page 225: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/225.jpg)
Limited Memory BFGS on the Eigenvalue Product
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
56 / 71
0 10 20 30 40 50
10−5
10−4
10−3
10−2
10−1
100
Log eig prod, N=20, r=20, K=10, n=400, maxit = 400
number of vectors k
med
ian
redu
ctio
n in
f−f op
t (ov
er 1
0 ru
ns)
Lim Mem BFGS (scaling)Lim Mem BFGS (no scaling)full BFGS (scaling once)full BFGS (no scaling)
![Page 226: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/226.jpg)
Limited Memory BFGS on the Eigenvalue Product
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
56 / 71
0 10 20 30 40 50
10−5
10−4
10−3
10−2
10−1
100
Log eig prod, N=20, r=20, K=10, n=400, maxit = 400
number of vectors k
med
ian
redu
ctio
n in
f−f op
t (ov
er 1
0 ru
ns)
Lim Mem BFGS (scaling)Lim Mem BFGS (no scaling)full BFGS (scaling once)full BFGS (no scaling)
Limited Memory is not nearly as good as full BFGS.
![Page 227: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/227.jpg)
Limited Memory BFGS on the Eigenvalue Product
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
56 / 71
0 10 20 30 40 50
10−5
10−4
10−3
10−2
10−1
100
Log eig prod, N=20, r=20, K=10, n=400, maxit = 400
number of vectors k
med
ian
redu
ctio
n in
f−f op
t (ov
er 1
0 ru
ns)
Lim Mem BFGS (scaling)Lim Mem BFGS (no scaling)full BFGS (scaling once)full BFGS (no scaling)
Limited Memory is not nearly as good as full BFGS.
No significant improvement when k reaches 44.
![Page 228: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/228.jpg)
A More Basic Example
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
57 / 71
Let x = [y; z;w] ∈ RnA+nB+nR and consider the test function
f(x) = (y − e)TA(y − e) + {(z − e)TB(z − e)}1/2 +R1(w)
where A = AT ≻ 0, B = BT ≻ 0, e = [1; 1; . . . ; 1].
![Page 229: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/229.jpg)
A More Basic Example
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
57 / 71
Let x = [y; z;w] ∈ RnA+nB+nR and consider the test function
f(x) = (y − e)TA(y − e) + {(z − e)TB(z − e)}1/2 +R1(w)
where A = AT ≻ 0, B = BT ≻ 0, e = [1; 1; . . . ; 1].
The first term is quadratic, the second is nonsmooth but convex,and the third is a nonsmooth, nonconvex Rosenbrock function.
![Page 230: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/230.jpg)
A More Basic Example
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
57 / 71
Let x = [y; z;w] ∈ RnA+nB+nR and consider the test function
f(x) = (y − e)TA(y − e) + {(z − e)TB(z − e)}1/2 +R1(w)
where A = AT ≻ 0, B = BT ≻ 0, e = [1; 1; . . . ; 1].
The first term is quadratic, the second is nonsmooth but convex,and the third is a nonsmooth, nonconvex Rosenbrock function.
The optimal value is 0, with x = e. The function f is partlysmooth and the dimension of the V-space is nB + nR − 1.
![Page 231: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/231.jpg)
A More Basic Example
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
57 / 71
Let x = [y; z;w] ∈ RnA+nB+nR and consider the test function
f(x) = (y − e)TA(y − e) + {(z − e)TB(z − e)}1/2 +R1(w)
where A = AT ≻ 0, B = BT ≻ 0, e = [1; 1; . . . ; 1].
The first term is quadratic, the second is nonsmooth but convex,and the third is a nonsmooth, nonconvex Rosenbrock function.
The optimal value is 0, with x = e. The function f is partlysmooth and the dimension of the V-space is nB + nR − 1.
Set A = XXT where xij are normally distributed, with conditionnumber about 106 when nA = 200. Similarly B with nB < nA.
![Page 232: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/232.jpg)
A More Basic Example
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
57 / 71
Let x = [y; z;w] ∈ RnA+nB+nR and consider the test function
f(x) = (y − e)TA(y − e) + {(z − e)TB(z − e)}1/2 +R1(w)
where A = AT ≻ 0, B = BT ≻ 0, e = [1; 1; . . . ; 1].
The first term is quadratic, the second is nonsmooth but convex,and the third is a nonsmooth, nonconvex Rosenbrock function.
The optimal value is 0, with x = e. The function f is partlysmooth and the dimension of the V-space is nB + nR − 1.
Set A = XXT where xij are normally distributed, with conditionnumber about 106 when nA = 200. Similarly B with nB < nA.
Besides limited memory BFGS and full BFGS, we also comparelimited memory Gradient Sampling, where we sample k ≪ ngradients per iteration.
![Page 233: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/233.jpg)
Smooth, Convex: nA = 200, nB = 0, nR = 1
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
58 / 71
0 5 10 15 2010
−7
10−6
10−5
nA=200, nB=0, nR=1, maxit = 201
number of vectors k
med
ian
redu
ctio
n in
f (o
ver
10 r
uns)
Grad Samp 1e−02, 1e−04Lim Mem BFGS (scaling)Lim Mem BFGS (no scaling)full BFGS (scaling once)full BFGS (no scaling)
![Page 234: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/234.jpg)
Smooth, Convex: nA = 200, nB = 0, nR = 1
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
58 / 71
0 5 10 15 2010
−7
10−6
10−5
nA=200, nB=0, nR=1, maxit = 201
number of vectors k
med
ian
redu
ctio
n in
f (o
ver
10 r
uns)
Grad Samp 1e−02, 1e−04Lim Mem BFGS (scaling)Lim Mem BFGS (no scaling)full BFGS (scaling once)full BFGS (no scaling)
LM-BFGS with scaling even better than full BFGS
![Page 235: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/235.jpg)
Nonsmooth, Convex: nA = 200, nB = 10, nR = 1
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
59 / 71
0 5 10 15 2010
−6
10−5
10−4
10−3
10−2
nA=200, nB=10, nR=1, maxit = 211
number of vectors k
med
ian
redu
ctio
n in
f (o
ver
10 r
uns)
Grad Samp 1e−02, 1e−04Lim Mem BFGS (scaling)Lim Mem BFGS (no scaling)full BFGS (scaling once)full BFGS (no scaling)
![Page 236: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/236.jpg)
Nonsmooth, Convex: nA = 200, nB = 10, nR = 1
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
59 / 71
0 5 10 15 2010
−6
10−5
10−4
10−3
10−2
nA=200, nB=10, nR=1, maxit = 211
number of vectors k
med
ian
redu
ctio
n in
f (o
ver
10 r
uns)
Grad Samp 1e−02, 1e−04Lim Mem BFGS (scaling)Lim Mem BFGS (no scaling)full BFGS (scaling once)full BFGS (no scaling)
LM-BFGS much worse than full BFGS
![Page 237: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/237.jpg)
Nonsmooth, Nonconvex: nA = 200, nB = 10, nR = 5
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
60 / 71
0 5 10 15 2010
−4
10−3
10−2
10−1
nA=200, nB=10, nR=5, maxit = 215
number of vectors k
med
ian
redu
ctio
n in
f (o
ver
10 r
uns)
Grad Samp 1e−02, 1e−04Lim Mem BFGS (scaling)Lim Mem BFGS (no scaling)full BFGS (scaling once)full BFGS (no scaling)
![Page 238: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/238.jpg)
Nonsmooth, Nonconvex: nA = 200, nB = 10, nR = 5
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
60 / 71
0 5 10 15 2010
−4
10−3
10−2
10−1
nA=200, nB=10, nR=5, maxit = 215
number of vectors k
med
ian
redu
ctio
n in
f (o
ver
10 r
uns)
Grad Samp 1e−02, 1e−04Lim Mem BFGS (scaling)Lim Mem BFGS (no scaling)full BFGS (scaling once)full BFGS (no scaling)
LM-BFGS with scaling even worse than LM-Grad-Samp
![Page 239: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/239.jpg)
A Nonsmooth Convex Function, Unbounded Below
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
61 / 71
Let’s reconsiderf(x) = a|x1|+ x2
with a ≥ 1.
105
0
u
-5-10-10
-5
v
05
10
60
50
40
30
20
10
0
-10
5|u|
+v
Turns out that L-BFGS-1 (saving just one update) with scalingfails for smaller values of a than the critical value beyond whichGradient Descent fails!
![Page 240: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/240.jpg)
L-BFGS-1 vs. Gradient Descent
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
62 / 71
Red: path of L-BFGS-1 with scaling, converges to non-stationarypoint.Blue: path of the gradient method with same Armijo-Wolfe linesearch, generates f(x) ↓ −∞.
u-8 -6 -4 -2 0 2 4 6 8 10
v
-12
-10
-8
-6
-4
-2
0
2
4
6
f(u, v) = 3|u|+ v. x0 = (8.284; 2.177), c1=0.05, τ =-0.056
LBFGS-1Gradient method
![Page 241: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/241.jpg)
Convergence of the L-BFGS-1 Search Direction
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
63 / 71
Theorem. Let d(k) be the search direction generated byL-BFGS-1 with scaling applied to f(x) = a|x1|+
∑ni=2 xi using
an Armijo-Wolfe line search. If√
4(n− 1) ≤ a, then|d(k)|||d(k)||
converges to some constant direction d. Furthermore, if
a(a+√
a2 − 3(n− 1)) > (1
c1− 1)(n− 1),
where c1 is the Armijo parameter, then the iterates x(k) convergeto a non-stationary point.
Azam Asl, 2018.
![Page 242: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/242.jpg)
Experiment, with n = 2 and a =√3
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
64 / 71
In practice we observe that√
3(n− 1) ≤ a suffices for themethod to fail, which is a weaker condition than the previousone. Below with n = 2 and a =
√3 the method fails:
u-8 -6 -4 -2 0 2 4 6 8 10
v
-26
-24
-22
-20
-18
-16
-14
-12
-10
-8
f(u, v) = 1.7321|u|+ v. x0 = (8.284; 2.177), c1=0.05, τ =-0.27
LBFGS-1Gradient method
![Page 243: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/243.jpg)
Experiment: slightly smaller a
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
65 / 71
But if we set a =√3− 0.001, it succeeds “at the last minute”.
u-8 -6 -4 -2 0 2 4 6 8 10
v
-30
-28
-26
-24
-22
-20
-18
-16
-14
-12
f(u, v) = 1.7311|u|+ v. x0 = (8.284; 2.177), c1=0.05, τ =-0.27
LBFGS-1Gradient method
![Page 244: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/244.jpg)
Experiments: Top, scaling on; Bottom, scaling off
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
66 / 71
n = 30,√
3(n− 1) = 9.327
N=30, f(X) = a|x(1)|+ Σi=2N x(i), c
1=0.05, (3(N-1))0.5 = 9.33, nrand = 5000
a9.315 9.32 9.325 9.33 9.335 9.34
Fai
lure
rat
e
0
0.2
0.4
0.6
0.8
1
a9.315 9.32 9.325 9.33 9.335 9.34
Fai
lure
rat
e
-1
-0.5
0
0.5
1
![Page 245: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/245.jpg)
Limited Effectiveness of Limited Memory BFGS
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
67 / 71
We have observed that that addition of nonsmoothness to aproblem, convex or nonconvex, creates great difficulties forLimited Memory BFGS, with and without scaling, even when thedimension of the V -space is less than the size of the memory.
![Page 246: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/246.jpg)
Limited Effectiveness of Limited Memory BFGS
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
67 / 71
We have observed that that addition of nonsmoothness to aproblem, convex or nonconvex, creates great difficulties forLimited Memory BFGS, with and without scaling, even when thedimension of the V -space is less than the size of the memory.
Azam Asl’s result establishes failure of L-BFGS-1 for a specific fwhen scaling is on; no such result is proved yet when scaling isoff.
![Page 247: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/247.jpg)
Limited Effectiveness of Limited Memory BFGS
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
67 / 71
We have observed that that addition of nonsmoothness to aproblem, convex or nonconvex, creates great difficulties forLimited Memory BFGS, with and without scaling, even when thedimension of the V -space is less than the size of the memory.
Azam Asl’s result establishes failure of L-BFGS-1 for a specific fwhen scaling is on; no such result is proved yet when scaling isoff.
We have also investigated Limited Memory Gradient Samplingwhich does not work well either.
![Page 248: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/248.jpg)
Other Ideas for Large Scale Nonsmooth Optimization
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
68 / 71
![Page 249: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/249.jpg)
Other Ideas for Large Scale Nonsmooth Optimization
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
68 / 71
■ Exploit structure! Lots of work on this has been done, e.g.using proximal point methods or ADMM (AlternatingDirection Method of Multipliers)
![Page 250: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/250.jpg)
Other Ideas for Large Scale Nonsmooth Optimization
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
68 / 71
■ Exploit structure! Lots of work on this has been done, e.g.using proximal point methods or ADMM (AlternatingDirection Method of Multipliers)
■ Smoothing! Lots of work on this has been done too, mostnotably by Yu. Nesterov
![Page 251: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/251.jpg)
Other Ideas for Large Scale Nonsmooth Optimization
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
68 / 71
■ Exploit structure! Lots of work on this has been done, e.g.using proximal point methods or ADMM (AlternatingDirection Method of Multipliers)
■ Smoothing! Lots of work on this has been done too, mostnotably by Yu. Nesterov
■ Automatic Differentiation (AD): (A. Griewank et. al.)
![Page 252: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/252.jpg)
Other Ideas for Large Scale Nonsmooth Optimization
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product
A More BasicExample
Smooth, Convex:nA = 200, nB =0, nR = 1
Nonsmooth, Convex:nA = 200, nB =10, nR = 1
Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5
A NonsmoothConvex Function,Unbounded BelowL-BFGS-1 vs.Gradient DescentConvergence of theL-BFGS-1 Search
68 / 71
■ Exploit structure! Lots of work on this has been done, e.g.using proximal point methods or ADMM (AlternatingDirection Method of Multipliers)
■ Smoothing! Lots of work on this has been done too, mostnotably by Yu. Nesterov
■ Automatic Differentiation (AD): (A. Griewank et. al.)
■ Stochastic Subgradient Method (D. Davis andD. Drusvyatskiy, 2018)
![Page 253: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/253.jpg)
Concluding Remarks
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
Summary
Our Papers
69 / 71
![Page 254: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/254.jpg)
Summary
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
Summary
Our Papers
70 / 71
Gradient descent frequently fails on nonsmooth problems.
![Page 255: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/255.jpg)
Summary
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
Summary
Our Papers
70 / 71
Gradient descent frequently fails on nonsmooth problems.
Gradient Sampling is a simple method for nonsmooth, nonconvexoptimization for which a convergence theory is known, but it istoo expensive to use in most applications.
![Page 256: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/256.jpg)
Summary
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
Summary
Our Papers
70 / 71
Gradient descent frequently fails on nonsmooth problems.
Gradient Sampling is a simple method for nonsmooth, nonconvexoptimization for which a convergence theory is known, but it istoo expensive to use in most applications.
BFGS — the full version — is remarkably effective onnonsmooth problems, but little theory is known.
![Page 257: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/257.jpg)
Summary
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
Summary
Our Papers
70 / 71
Gradient descent frequently fails on nonsmooth problems.
Gradient Sampling is a simple method for nonsmooth, nonconvexoptimization for which a convergence theory is known, but it istoo expensive to use in most applications.
BFGS — the full version — is remarkably effective onnonsmooth problems, but little theory is known.
Limited Memory BFGS is not so effective on nonsmoothproblems.
![Page 258: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/258.jpg)
Summary
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
Summary
Our Papers
70 / 71
Gradient descent frequently fails on nonsmooth problems.
Gradient Sampling is a simple method for nonsmooth, nonconvexoptimization for which a convergence theory is known, but it istoo expensive to use in most applications.
BFGS — the full version — is remarkably effective onnonsmooth problems, but little theory is known.
Limited Memory BFGS is not so effective on nonsmoothproblems.
Diabolical nonconvex problems such as Nesterov’sChebyshev-Rosenbrock problems can be very difficult, especiallyin the nonsmooth case.
![Page 259: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/259.jpg)
Summary
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
Summary
Our Papers
70 / 71
Gradient descent frequently fails on nonsmooth problems.
Gradient Sampling is a simple method for nonsmooth, nonconvexoptimization for which a convergence theory is known, but it istoo expensive to use in most applications.
BFGS — the full version — is remarkably effective onnonsmooth problems, but little theory is known.
Limited Memory BFGS is not so effective on nonsmoothproblems.
Diabolical nonconvex problems such as Nesterov’sChebyshev-Rosenbrock problems can be very difficult, especiallyin the nonsmooth case.
Our software, HANSO and GRANSO, is available (unconstrainedand constrained) along with HIFOO (H-infinity fixed orderoptimization) for controller design, which has been usedsuccessfully in many applications.
![Page 260: Nonsmooth, Nonconvex Optimization€¦ · Convex and Nonsmooth Optimization Class, Spring 2018, Final Lecture Based on my research ... Optimization A Simple Nonconvex Example Failure](https://reader034.fdocuments.in/reader034/viewer/2022051815/60409e009481d364de769a3b/html5/thumbnails/260.jpg)
Our Papers
Introduction
Gradient Sampling
Quasi-NewtonMethods
A DifficultNonconvex Problemfrom Nesterov
Limited MemoryMethods
Concluding Remarks
Summary
Our Papers
71 / 71
J. V. Burke, A. S. Lewis and M. L. Overton, A Robust GradientSampling Method for Nonsmooth, Nonconvex Optimization, SIAMJ. Optimization, 2005
M. Gurbuzbalaban and M. L. Overton, On Nesterov’s NonsmoothChebyshev-Rosenbrock Functions, SIAM J. Optimization, 2012
A. S. Lewis and M. L. Overton, Nonsmooth Optimization viaQuasi-Newton Methods, Math. Programming, 2013
F. E. Curtis, T. Mitchell and M.L. Overton, A BFGS-SQP Method forNonsmooth, Nonconvex, Constrained Optimization and its Evaluationusing Relative Minimization Profiles, Optimization Methods and
Software, 2016
A. Asl and M. L. Overton, Analysis of the Gradient Method with anArmijo-Wolfe Line Search on a Class of Nonsmooth Convex Functions,arXiv, 2017
J. V. Burke, F. E. Curtis, A. S. Lewis, M. L. Overton and L. Simoes,Gradient Sampling Methods for Nonsmooth Optimization, arXiv, 2018
Papers, software are available at www.cs.nyu.edu/overton.