Post on 29-May-2022
Optimization 101
CSE P576
David M. Rosen
1
Recap
First half:
β’ Photogrammetry and bundle adjustment
β’ Maximum likelihood estimation
This half:
β’ Basic theory of optimization
(i.e. how to actually do MLE)
The Main Idea
Given: π: π π β π , we want to
minπ₯
π(π₯)
Problem: We have no idea how to actually do this β¦
Main idea: Letβs approximate f with a simple model function m, and use that to search for a minimizer of f.
Optimization Meta-Algorithm
Given: A function π: π π β π and an initial guess π₯0 β π π for a minimizer
Iterate:
β’ Construct a model ππ β β π(π₯π + β) of f near π₯π.
β’ Use ππ to search for a descent direction h (π π₯ + β < π(π₯))
β’ Update π₯π+1 β π₯π + β
until convergence
A first exampleLetβs consider applying the Basic Algorithm to minimize
π π₯ = π₯4 β 3π₯2 + π₯ + 2
starting at π₯0 = β1
2.
Q: How can we approximate (model) fnear π₯0?
A: Letβs try linearizing! Take
π0 β β π π₯0 + πβ² π₯0 β
Gradient descentGiven:
β’ A function π: π π β π
β’ An initial guess π₯0 β π π for a minimizer
β’ Sufficient decrease parameter π β (0,1), stepsize shrinkage parameter π β (0,1)
β’ Gradient tolerance π > 0
Iterate:
β’ Compute search direction π = βπ»π π₯π at π₯πβ’ Set initial stepsize πΌ = 1
β’ Backtracking line search: Update πΌ β ππΌ until the Armijo-Goldstein sufficient decrease condition:
π π₯π + πΌπ < π π₯π β ππΌ π 2
is satisfied
β’ Update π₯π+1 β π₯π + πΌπ
until π»π(π₯π) < π
Exercise: Minimizing a quadratic
Gradient Descent
Given:
β’ A function π: π π β π
β’ An initial guess π₯0 β π π for a minimizer
β’ Sufficient decrease parameter π β (0,1), stepsize shrinkage parameter π β (0,1)
β’ Gradient tolerance π > 0
Iterate:
β’ Compute search direction π = βπ»π π₯π at π₯πβ’ Set initial stepsize πΌ = 1
β’ Line search: update πΌ β ππΌ until
π π₯π + πΌπ < π π₯π β ππΌ π 2
β’ Update π₯π+1 β π₯π + πΌπ
until π»π(π₯π) < π
Try minimizing the quadratic:
π π₯, π¦ = π₯2 β π₯π¦ + π π¦2
using gradient descent, starting at π₯0 = (1,1) and using
π, π =1
2and π = 10β3, for a few different values of π , say
π β {1, 10, 100, 1000 }
Q: If you plot function value π π₯π vs. iteration number i, what do you notice?
Exercise: Minimizing a quadratic
= 1
Exercise: Minimizing a quadratic
= 10
Exercise: Minimizing a quadratic
= 100
The problem of conditioning
Gradient descent doesnβt perform well when f is poorly conditioned (has βstretchedβ contours).
Q: How can we improve our local model:
ππ β = π π₯π + π»π π₯ππβ
so that it handles curvature better?
Second-order methods
Letβs try adding in curvature information using a second-order model for f:
ππ β = π π₯π + π»π π₯ππβ +
1
2βππ»2π π₯ β
NB: If π»2π π₯ β» 0, then ππ(β) has a unique minimizer:
βπ = β π»2π π₯0β1π»π π₯0
In that case, using the update rule:
π₯π+1 β π₯π + βπgives Newtonβs method
Exercise: Minimizing a quadratic
Newtonβs methodGiven:β’ A function π: π π β π β’ An initial guess π₯0 β π π for a
minimizerβ’ Gradient tolerance π > 0
Iterate:β’ Compute gradient π»π π₯π and
Hessian π»2π π₯πβ’ Compute Newton step:
βπ = β π»2π π₯0β1π»π π₯0
β’ Update π₯π+1 β π₯π + βπβ’ until π»π(π₯π) < π
Letβs try minimizing the quadratic:
π π₯, π¦ = π₯2 β π₯π¦ + π π¦2
again, this time using Newtonβs method, starting at π₯0 =(1,1) and using π = 10β3, for
π β {1, 10, 100, 1000 }
If you plot function value π π₯π vs. iteration number i, what do you notice?
Quasi-Newton methodsNewtonβs method is fast! (It has a quadratic convergence rate)
But:
β’ βπ is only guaranteed to be a descent direction if π»2π π₯π β» 0
β’ Computing exact Hessians can be expensive!
Quasi-Newton methods: Use a positive-definite approximate Hessian π΅π in the model function:
ππ β = π π₯π + π»π π₯ππβ +
1
2βππ΅πβ
ππ(β) always has a unique minimizer:
βππ = βπ΅πβ1π»π π₯π
βππ is always a descent direction!
Quasi-Newton method with line searchGiven:
β’ A function π: π π β π
β’ An initial guess π₯0 β π π for a minimizer
β’ Sufficient decrease parameter π β (0,1), stepsize shrinkage parameter π β (0,1)
β’ Gradient tolerance π > 0
Iterate:
β’ Compute gradient ππ = π»π π₯π and positive-definite Hessian approximation π΅π at π₯πβ’ Compute quasi-Newton step:
βππ = βπ΅πβ1ππ
β’ Set initial stepsize πΌ = 1
β’ Backtracking line search: Update πΌ β ππΌ until the Armijo-Goldstein sufficient decrease condition:
π π₯π + πΌβππ < π π₯π + ππΌπππβππ
is satisfied
β’ Update π₯π+1 β π₯π + πΌβππuntil ππ < π
Quasi-Newton methods (contβd)
Different choices of π΅π give different QN algorithms
Can trade off accuracy of π΅π with computational cost
LOTS of possibilities here!
β’ Gauss-Newton
β’ Levenberg-Marquardt
β’ (L-) BFGS
β’ Broyden
β’ etc β¦
Donβt be afraid to experiment !
Special case: The Gauss-Newton methodA quasi-Newton algorithm for minimizing a nonlinear least-squares objective:
π π₯ = π π₯ 2
Uses the local quadratic model obtained by linearizing r:
ππ β = π π₯π + π½ π₯π β2
where π½ π₯π βππ
ππ₯π₯π is the Jacobian of r.
Equivalently:ππ = 2π½ π₯π
ππ π₯π , π΅π = 2π½ π₯πππ½ π₯π
A word on linear algebraThe dominant cost (memory + time) in a QN method is linear algebra:
β’ Constructing the Hessian approximation π΅πβ’ Solving the linear system:
π΅πβππ = ββππ
Fast/robust linear algebra is essential for efficient QN methods
β’ Take advantage of sparsity in π΅π!
β’ NEVER, NEVER, NEVER INVERT π©π directly!!!β’ Itβs incredibly expensive and unnecessaryβ’ Use instead [cf. Golub & Van Loanβs Matrix Computations]:
β’ Matrix factorizations: QR, Cholesky, LDLT
β’ Iterative linear methods: conjugate gradient
A word on linear algebra
NEVER INVERT π©π!!!
Optimization methods: Cheat sheet
First-order methodsUse only gradient information
β’ Pro: Local model is inexpensive
β’ Con: Slow (linear) convergence rate
Canonical example: Gradient descent
Best for:
β’ Moderate accuracy
β’ Very large problems
Second-order methodsUse (some) 2nd-order information
β’ Pro: Fast (superlinear) convergence
β’ Con: Local model can be expensive
Canonical example: Newtonβs method
Best for:
β’ High accuracy
β’ Small to moderately large problems