COM S 578X: Optimization for Machine...
Transcript of COM S 578X: Optimization for Machine...
COM S 578X: Optimization for Machine Learning
Lecture Note 7: Accelerated First-Order Methods
Jia (Kevin) Liu
Assistant ProfessorDepartment of Computer Science
Iowa State University, Ames, Iowa, USA
Fall 2019
JKL (CS@ISU) COM S 578X: Lecture 7 1 / 26
Outline
In this lecture:
Heavy-ball method
Conjugate gradient
Nesterov optimal first-order method
FISTA
Barzilai-Borwein
Adding simple constraints
Extending to regularized optimization
JKL (CS@ISU) COM S 578X: Lecture 7 2 / 26
First-Order Methods
So far, you’ve seen gradient descent – The most natural first-order method
GD has a sublinear O(1/k) rate for F∞,1L and a slow linear rate for S2,1µ,L.Can we do better?
First-Order Method (Nesterov): An iterative method generates a sequence oftest points xk such that
xk ∈ x0 + span∇f(x0),∇f(x1), . . . ,∇f(xk−1), k ≥ 1
Theorem 1 (Nesterov)
For any k, 1 ≤ k ≤ 12 (n− 1), and any x0 ∈ Rn, there exists a function f ∈ F∞,1L
such that for any first-order method, we have
f(xk)− f(x∗) ≥ 3L‖x0 − x∗‖232(k + 1)2
= O(1/k2)
Pretty large gap btwn O(1/k) and O(1/k2). So the answer is: Yes, we can!
JKL (CS@ISU) COM S 578X: Lecture 7 3 / 26
Heavy-Ball
Historical Perspective
I First proposed by Polyak in 60s [Polyak ’64]
I Original goal: Accelerate gradient descent method
I Inspired by 2nd-order ODE of heavy-body motion
I Adopted in ML very early [Rumelhart, et al. ’86]
Basic Idea:
I Search direction: Linear combination of current gradient(gravity) and previous search direction (momentum)
I Also called two-step method and could be N -step in thegeneral case
JKL (CS@ISU) COM S 578X: Lecture 7 4 / 26
The Heavy-Ball Algorithm
Consider the unconstrained optimization problem, with f smooth and convex:
minx∈Rn
f(x)
Denote the optimal value as f∗ = minx f(x∗) and an optimal solution as x∗.
Heavy-Ball Method
Choose initial point x0 ∈ Rn and let x−1 = x0. Repeat:
xk+1 = xk − sk∇f(xk) + βk(xk − xk−1), k = 0, 1, 2, . . .
Stop if some stopping criterion is satisfied.
JKL (CS@ISU) COM S 578X: Lecture 7 5 / 26
Convergence Rate Analysis: Strongly Convex Case
Theorem 2 (Polyak)
If f ∈ S2,1µ,L, Heavy-Ball with constant step-size s = 4(√L+√µ)2
and constant
momentum coefficient β =(√
L−√µ√L+√µ
)2, satisfies:
‖xk − x∗‖2 ≤ C(√
κ− 1√κ+ 1
)k√2‖x0 − x∗‖2,
where κ , L/µ and C is a constant.
JKL (CS@ISU) COM S 578X: Lecture 7 6 / 26
HB Convergence Rate Analysis: Strongly Convex CaseProof Sketch.
Consider the new variable: zk = [(xk − x∗)> (xk−1 − x∗)>]>
Under constant step-size s and momentum coefficient β, HB dynamics canbe rewritten in terms of zk:
‖zk+1‖2 =
∥∥∥∥[ xk+1 − x∗
xk − x∗
]∥∥∥∥2
=
∥∥∥∥[ xk − s∇f(xk) + β(xk − xk−1)− x∗
xk − x∗
]∥∥∥∥2
=
∥∥∥∥[ (1 + β)I −βII 0
][xk − x∗
xk−1 − x∗
]− s[∇f(xk)
0
]∥∥∥∥2
(1)
Note that
∇f(xk) = ∇f(x∗) +
∫ 1
0
∇2f(x∗ + τ(xk − x∗))(xk − x∗)dτ
=[ ∫ 1
0
∇2f(x∗ + τ(xk − x∗))dτ︸ ︷︷ ︸B(xk)
](xk − x∗)
Plugging this into (1) to obtain:
JKL (CS@ISU) COM S 578X: Lecture 7 7 / 26
HB Convergence Rate Analysis: Strongly Convex Case
zk+1 =
[xk+1 − x∗
xk − x∗
]=
[xk − s∇f(xk) + β(xk − xk−1)− x∗
xk − x∗
]=
[(1 + β)I− sB(xk) −βI
I 0
]︸ ︷︷ ︸
,Γ(xk)
zk (2)
So, (2) implies ‖zk‖2 = ‖∏k−1i=0 Γ(xi)z0‖2 ≤ maxxi,∀i ‖
∏k−1i=0 Γ(xi)z0‖2.
Noting the identical structure of Γ(xi) for i (implying index independence)
and letting Γ ,
[(1 + β)I− sB(x4) −βkI
I 0
], where x4 is a maximizer
for the function ‖(Γ(x))kz0‖. Then it can be shown that
ρ(Γ) ≤√β,
where we take β = max|1−√sµ|, |1−
√sL|
.
JKL (CS@ISU) COM S 578X: Lecture 7 8 / 26
HB Convergence Rate Analysis: Strongly Convex Case
Plugging this into (2), we have:
‖zk+1‖2 ≤ max|1−√sµ|, |1−
√sL|‖zk‖2 . (3)
If we let s = 4(√µ+√L)2
, we have:
(1−√sµ) = (1−√sL) =
√L−√µ√L+√µ
=
√κ− 1√κ+ 1
.
Substituting this back into (3) yields:
‖xk − x∗‖2 ≤ ‖zk‖2 ≤(√
κ− 1√κ+ 1
)k√2‖x0 − x∗‖2.
JKL (CS@ISU) COM S 578X: Lecture 7 9 / 26
GD vs. HB under Strong Convexity
Gradient Descent: Linear rate
(κ− 1
κ+ 1
)
Heavy-ball: Linear rate
(√κ− 1√κ+ 1
)
Big difference! For ‖xk − x∗‖ to reach ε-accuracy, need k large enough so that(κ− 1
κ+ 1
)k≤ ε ⇒ k ≥ κ
2| log ε| (Gradient Descent)(√
κ− 1√κ+ 1
)k≤ ε ⇒ k ≥
√κ
2| log ε| (Heavy-Ball)
A factor of√κ difference; e.g., if κ = 1000 (not at all uncommon in ML
problems), need approximately 30 times fewer iterations.
JKL (CS@ISU) COM S 578X: Lecture 7 10 / 26
Convergence Rate Analysis: Weakly Convex Case
Theorem 3 (Ghadimi-Feyzmahdavian-Johansson)
If f ∈ F1,1L , Heavy-Ball with βk = k
k+2 and αk = α0
k+2 , where α0 ∈ (0, 1/L],satisfies:
f(xk)− f(x∗) ≤ ‖x0 − x∗‖222α0(k + 1)
= O(1/k),
Remark:
HB has essentially the same sublinear rate as GD in the weakly convex case.
JKL (CS@ISU) COM S 578X: Lecture 7 11 / 26
Conjugate Gradient
Definition 4 (The Notion of Conjugacy)
Let H ∈ Rn×n be symmetric. The vectors d1, . . . ,dn are called H-conjugate (orsimply conjugate) if they are linearly independent and if d>i Hdj = 0 for all i 6= j.
Conjugacy is significant in quadratic functions minimization (often arise in ML):
Let f(x) = c>x + 12x>Hx. Suppose d1, . . . ,dn are H-conjugate. Then:
Note: Conjugate directions are not unique
JKL (CS@ISU) COM S 578X: Lecture 7 12 / 26
Conjugate Directions: Geometric Interpretation
JKL (CS@ISU) COM S 578X: Lecture 7 13 / 26
Quadratic Minimization w/ Conjugacy: Finite Convergence
Theorem 5 (Expanding subspace property)
Let f(x) = c>x + 12x>Hx, where H ∈ Rn×n is symmetric. Let d1, . . . ,dn be
H-conjugate. Let x1 be arbitrary starting point. Let sk=arg mins∈R f(xk+sdk),k = 1, . . . , n, and let xk+1 = xk + skdk. Then, for k = 1, . . . , n, we must have:
∇f(xk+1)>dj = 0, for j = 1, . . . , k.
∇f(x1)>dk = ∇f(xk)>dk.
xk+1 = arg minx
f(x)|x− x1 ∈ spand1, . . . ,dk
xn+1 is a minimizing point of f(x) over Rn
JKL (CS@ISU) COM S 578X: Lecture 7 14 / 26
Geometric Interpretation
d2
∇f(x2)x2x3
∇f(x3)
x1
d1
JKL (CS@ISU) COM S 578X: Lecture 7 15 / 26
Conjugate Gradient (Standard Version)
1 Initial point x1, y1 = x1, d1 = −∇f(y1), k = j = 1.
2 If ‖∇f(yj)‖ ≤ ε, stop. Otherwise, let sj = arg mins≥0 f(yj + sdj). Letyj+1 = yj + sjdj . If j = n, go to Step 4
3 Let dj+1 = −∇f(yj+1) + γjdj . Let j = j + 1 and go to Step 2.
4 Let xk+1 = y1 = yn+1, d1 = −∇f(y1). Let j = 1, k = k+ 1. Go to Step 2.
Can be viewed as heavy-ball, with βj =sjγjsj−1
. But CG can be implemented
without requiring knowledge (or estimation) of L and µ
I Choose sj to (approximately) minimize f along dj . Variants of choosing γj :
Fletcher-Reeves: γFRj =
‖∇f(yj+1)‖2
‖∇f(yj)‖2
Polak-Rebiere: γPRj =
∇f(yj+1)>(∇f(yj+1)−∇f(yj))
‖∇f(yj)‖2
Hestenes-Stiefel: γHSj =
sj∇f(yk+1)>(∇f(yj+1)−∇f(yj))
(yj+1 − yj)>(∇f(yj+1)−∇f(yj))
I All equivalent if f is convex quadratic & exact line search is used ([BSS Ch.8])
JKL (CS@ISU) COM S 578X: Lecture 7 16 / 26
Conjugate Gradient
Nonlinear CG: Variants include Fletcher-Reeves, Polak-Ribiere,Hestnes-Stiefel, etc.
Restarting periodically with d1 = −∇f(xk) (e.g., every n iterations, or whendj fails to be a descent direction)
For quadratic f(·), convergence analysis is based on eigenvalues of A andChebyshev polynomials, min-max arguments. Get:
I Finite termination is as many as iterations as there are distinct eigenvalues
I Asymptotic linear convergence with rate approximately
√κ− 1√κ+ 1
(like
heavy-ball, see [Necedal and Wright, Ch.5])
I Similar sublinear O(1/k) rate for f ∈ F1,1L
Can we close the gap between O(1/k) and O(1/k2) for F1,1L ?
JKL (CS@ISU) COM S 578X: Lecture 7 17 / 26
Nesterov Accelerated First-Order Method
Consider an unconstrained convex optimization problem:
minx∈Rn
f(x), where f(·) has L-Lipschitz cont. gradient.
Nesterov’s AGD method [Nesterov ’83]:1 Choose initial x0. Let y0 = x0. Choose α0 ∈ (0, 1).2 Iteration k ≥ 0: a) Compute f ′(yk) and let:
xk+1 = yk − (1/L)f ′(yk). (regular gradient step)
b) Compute αk+1 ∈ (0, 1) by solving α2k+1 + [α2
k − (1/κ)]αk+1 − α2k = 0.
Let βk =αk(1− αk+1)
(α2k + αk+1)
, and
yk+1 = xk+1 + βk(xk+1 − xk). (“slide” in dir. of x)
Still works for weakly convex (κ =∞)
JKL (CS@ISU) COM S 578X: Lecture 7 18 / 26
Nesterov’s Accelerated First-Order Method
xk+2
yk+2
yk
xk
xk+1
yk+1
In simplest form (constant step-sizes):xk+1 = yk − (1/L)f ′(yk)
yk+1 = xk+1 + β(xk+1 − xk).
Nesterov’s AGD achieves:I Strongly convex: Linear convergence rate insensitive to κ , L/µI Weakly convex: Sublinearly O(1/t2) (order-optimal).
JKL (CS@ISU) COM S 578X: Lecture 7 19 / 26
Convergence Performance of Nesterov’s Method
Theorem 6
The Nesterov accelerated first-order method with α[0] ≥ 1/√κ satisfies:
f(xk)− f(x∗) ≤ c1 min
(1− 1√
κ
)k,
4L
(√L+ c2k)2
,
where constants c1 and c2 depend on x0, α[0], and L.
Remark:
Linear convergence rate similar to Heavy-Ball if f ∈ S2,1µ,LSublinear O(1/k2) if f ∈ F1,1
L
In the special case where α[0] = 1/√κ, this scheme yields:
α[k] =1√κ, β[k] =
√κ− 1√κ+ 1
JKL (CS@ISU) COM S 578X: Lecture 7 20 / 26
FISTA
Beck and Teboulle proposed a similar momentum-based algorithm:1 Choose initial x0. Let y1 = x[0]. t1 = 12 Iteration t ≥ 0: Do the following computations:
xk ← yk −1
L∇f(yk);
tk+1 ←1
2
(1 +
√1 + 4t2k
)yk+1 ← xk +
tk − 1
tk+1(xk − xk−1)
For weakly convex f , converges with f(xk)− f(x∗) = O(1/k2)
When L is unknown, increase an estimate of L until it’s big enough
JKL (CS@ISU) COM S 578X: Lecture 7 21 / 26
Nonmonotone Gradient Method: Barzilai-Borwein
Barzilai and Borwein (BB) proposed an unusual choice of sk
Allows f to increase (sometimes a lot) on some steps: non-monotone
xk+1 = xk − sk∇f(xk), sk = arg mins‖rk − sqk‖2,
where rk = xk − xk−1 and qk , ∇f(xk)−∇f(xk−1)
Explicitly, we have
sk =r>k qkq>k qk
Note that for f(x) = 12x>Ax, we have
sk =r>k Arkr>k A2rk
∈[
1
L,
1
µ
]BB can be viewed as quasi-Newton, with the Hessian approximated by s−1k I
JKL (CS@ISU) COM S 578X: Lecture 7 22 / 26
Comparison: BB vs Gradient Descent
JKL (CS@ISU) COM S 578X: Lecture 7 23 / 26
Extension to Simple Constraints
Constraint set Ω: a (relatively simple) closed convex set
Some algorithms and theory stay largely the same, if we can involve theconstraint x ∈ Ω explicitly in the subproblems
Example: Nesterov’s constant step scheme requires just one calculation tobe changed from the unconstrained version as follows:
1 Choose initial x0. Let y0 = x0. Choose α0 ∈ (0, 1).2 Iteration k ≥ 0: a) Compute f ′(yk) and let:
xk+1 = argminy∈Ωyk − (1/L)f ′(yk). (projected gradient step)
b) Compute αk+1 ∈ (0, 1) by solving α2k+1 + [α2
k − (1/κ)]αk+1 − α2k = 0.
Let βk =αk(1− αk+1)
(α2k + αk+1)
, and
yk+1 = xk+1 + βk(xk+1 − xk). (“slide” in dir. of x)
Convergence theory is unchanged.
JKL (CS@ISU) COM S 578X: Lecture 7 24 / 26
Extension to Regularized Optimization
Consider the following optimization with regularization:
minxf(x) + τψ(x),
where f ∈ FL1,1 and ψ is convex but usually nonsmooth
Often, we only need to change the update to:
xk+1 = arg minx
1
2sk‖x− (xk + skdk)‖22 + τψ(x),
where dk could be a scaled gradient descent step, or deflected gradient (e.g.,heavy-ball, Nesterov, etc.), while sk is the step size
This is also referred to as shrinkage/thresholding step. More on how to solvethe above problem later when we discuss sparse/regularized optimization
JKL (CS@ISU) COM S 578X: Lecture 7 25 / 26
Next Class
Subgradient Method
JKL (CS@ISU) COM S 578X: Lecture 7 26 / 26