Gradient methods for unconstrained problems
Transcript of Gradient methods for unconstrained problems
![Page 1: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/1.jpg)
ELE 522: Large-Scale Optimization for Data Science
Gradient methods for unconstrained problems
Yuxin Chen
Princeton University, Fall 2019
![Page 2: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/2.jpg)
Outline
• Quadratic minimization problems
• Strongly convex and smooth problems
• Convex and smooth problems
• Nonconvex problems
Gradient methods (unconstrained case) 2-2
![Page 3: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/3.jpg)
Differentiable unconstrained minimization
minimizex f(x)subject to x ∈ Rn
• f (objective or cost function) is differentiable
Gradient methods (unconstrained case) 2-3
![Page 4: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/4.jpg)
Iterative descent algorithms
Start with a point x0, and construct a sequence {xt} s.t.
f(xt+1) < f(xt), t = 0, 1, · · ·
• d is said to be a descent direction at x if
f ′(x;d) := limτ ↓ 0
f(x + τd)− f(x)τ︸ ︷︷ ︸
directional derivative
= ∇f(x)>d < 0 (2.1)
• In each iteration, search in descent direction
xt+1 = xt + ηtdt (2.2)
where dt: descent direction at xt; ηt > 0: stepsize
Gradient methods (unconstrained case) 2-4
![Page 5: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/5.jpg)
Iterative descent algorithms
Start with a point x0, and construct a sequence {xt} s.t.
f(xt+1) < f(xt), t = 0, 1, · · ·
• d is said to be a descent direction at x if
f ′(x;d) := limτ ↓ 0
f(x + τd)− f(x)τ︸ ︷︷ ︸
directional derivative
= ∇f(x)>d < 0 (2.1)
• In each iteration, search in descent direction
xt+1 = xt + ηtdt (2.2)
where dt: descent direction at xt; ηt > 0: stepsize
Gradient methods (unconstrained case) 2-4
![Page 6: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/6.jpg)
Gradient descent (GD)
One of the most important examples of (2.2): gradient descent
xt+1 = xt − ηt∇f(xt) (2.3)
x0 x1 x2 x3 x4
Gradient methods 2-41
x0 x1 x2 x3
Gradient methods 2-41
x0 x1 x2 x3
Gradient methods 2-41
x0 x1 x2 x3
Gradient methods 2-41
x0 x1 x2 x3
Gradient methods 2-41
• traced to Augustin Louis Cauchy ’1847 ...
• descent direction: dt = −∇f(xt)• a.k.a. steepest descent, since from (2.1) and Cauchy-Schwarz,
arg mind:‖d‖2≤1
f ′(x;d) = arg mind:‖d‖2≤1
∇f(x)>d = −‖∇f(x)‖2︸ ︷︷ ︸
direction with the greatest rate of objective value improvement
Gradient methods (unconstrained case) 2-5
![Page 7: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/7.jpg)
Gradient descent (GD)
One of the most important examples of (2.2): gradient descent
xt+1 = xt − ηt∇f(xt) (2.3)
x0 x1 x2 x3 x4
Gradient methods 2-41
x0 x1 x2 x3
Gradient methods 2-41
x0 x1 x2 x3
Gradient methods 2-41
x0 x1 x2 x3
Gradient methods 2-41
x0 x1 x2 x3
Gradient methods 2-41
• traced to Augustin Louis Cauchy ’1847 ...
• descent direction: dt = −∇f(xt)• a.k.a. steepest descent, since from (2.1) and Cauchy-Schwarz,
arg mind:‖d‖2≤1
f ′(x;d) = arg mind:‖d‖2≤1
∇f(x)>d = −‖∇f(x)‖2︸ ︷︷ ︸
direction with the greatest rate of objective value improvement
Gradient methods (unconstrained case) 2-5
![Page 8: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/8.jpg)
Quadratic minimization problems
![Page 9: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/9.jpg)
Quadratic minimization
To get a sense of the convergence rate of GD, let’s begin withquadratic objective functions
minimizex f(x) := 12(x− x∗)>Q(x− x∗)
for some n× n matrix Q � 0, where ∇f(x) = Q(x− x∗)
Gradient methods (unconstrained case) 2-7
![Page 10: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/10.jpg)
Convergence for constant stepsizes
Convergence rate: if ηt ≡ η = 2λ1(Q)+λn(Q) , then
‖xt − x∗‖2 ≤(λ1(Q)− λn(Q)λ1(Q) + λn(Q)
)t‖x0 − x∗‖2
where λ1(Q) (resp. λn(Q)) is the largest (resp. smallest) eigenvalueof Q
• as we will see, η is chosen s.t.∣∣1− ηλn(Q)
∣∣ =∣∣1− ηλ1(Q)
∣∣
• the convergence rate is dictated by the condition number λ1(Q)λn(Q)
of Q, or equivalently, maxx λ1(∇2f(x))minx λn(∇2f(x))
• often called linear convergence or geometric convergence◦ since the error lies below a line on a log-linear plot of error
vs. iteration count
Gradient methods (unconstrained case) 2-8
![Page 11: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/11.jpg)
Convergence for constant stepsizes
Convergence rate: if ηt ≡ η = 2λ1(Q)+λn(Q) , then
‖xt − x∗‖2 ≤(λ1(Q)− λn(Q)λ1(Q) + λn(Q)
)t‖x0 − x∗‖2
where λ1(Q) (resp. λn(Q)) is the largest (resp. smallest) eigenvalueof Q
• as we will see, η is chosen s.t.∣∣1− ηλn(Q)
∣∣ =∣∣1− ηλ1(Q)
∣∣
• the convergence rate is dictated by the condition number λ1(Q)λn(Q)
of Q, or equivalently, maxx λ1(∇2f(x))minx λn(∇2f(x))
• often called linear convergence or geometric convergence◦ since the error lies below a line on a log-linear plot of error
vs. iteration count
Gradient methods (unconstrained case) 2-8
![Page 12: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/12.jpg)
Convergence for constant stepsizes
Proof: According to the GD update rule,
xt+1 − x∗ = xt − x∗ − ηt∇f(xt) = (I − ηtQ)(xt − x∗)
=⇒ ‖xt+1 − x∗‖2 ≤∥∥I − ηtQ
∥∥‖xt − x∗‖2The claim then follows by observing that
‖I − ηQ‖ = max{|1− ηtλ1(Q)|, |1− ηtλn(Q)|}︸ ︷︷ ︸remark: optimal choice is ηt= 2
λ1(Q)+λn(Q)
= 1− 2λn(Q)λ1(Q) + λn(Q) = λ1(Q)− λn(Q)
λ1(Q) + λn(Q)
Apply the above bound recursively to complete the proof�
Gradient methods (unconstrained case) 2-9
![Page 13: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/13.jpg)
Exact line search
The stepsize rule ηt ≡ η = 2λ1(Q)+λn(Q) relies on the spectrum of Q,
which requires preliminary experimentation
Another more practical strategy is the exact line search rule
ηt = arg minη≥0
f(xt − η∇f(xt)
)(2.4)
Gradient methods (unconstrained case) 2-10
![Page 14: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/14.jpg)
Convergence for exact line search
Convergence rate: if ηt = arg minη≥0 f(xt − η∇f(xt)
), then
f(xt)− f(x∗
) ≤(λ1(Q)− λn(Q)λ1(Q) + λn(Q)
)2t (f(x0)− f(x∗)
)
• stated in terms of the objective values• convergence rate not faster than the constant stepsize rule
Gradient methods (unconstrained case) 2-11
![Page 15: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/15.jpg)
Convergence for exact line searchProof: For notational simplicity, let gt = ∇f(xt) = Q(xt − x∗). It can beverified that exact line search gives
ηt = gt>gt
gt>Qgt
This gives
f(xt+1) = 12(xt − ηtgt − x∗
)>Q(xt − ηtgt − x∗
)
= 12(xt − x∗
)>Q(xt − x∗
)− ηt‖gt‖2
2 + η2t
2 gt>Qgt
= 12(xt − x∗
)>Q(xt − x∗
)− ‖gt‖4
22gt>Qgt
=(
1− ‖gt‖42(
gt>Qgt)(gt>Q−1gt
))f(xt)
where the last line uses f(xt) = 12(xt − x∗
)>Q(xt − x∗
)= 1
2gt>Q−1gt
Gradient methods (unconstrained case) 2-12
![Page 16: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/16.jpg)
Convergence for exact line search
Proof (cont.): From Kantorovich’s inequality
‖y‖42(
y>Qy)(y>Q−1y
) ≥ 4λ1(Q)λn(Q)(λ1(Q) + λn(Q)
)2 ,
we arrive at
f(xt+1) ≤(
1− 4λ1(Q)λn(Q)(λ1(Q) + λn(Q)
)2
)f(xt)
=(λ1(Q)− λn(Q)λ1(Q) + λn(Q)
)2f(xt)
This concludes the proof since f(x∗) = minx f(x) = 0 �
Gradient methods (unconstrained case) 2-13
![Page 17: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/17.jpg)
Strongly convex and smooth problems
![Page 18: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/18.jpg)
Strongly convex and smooth problems
Let’s now generalize quadratic minimization to a broader class ofproblems
minimizex f(x)
where f(·) is strongly convex and smooth
• a twice-differentiable function f is said to be µ-strongly convexand L-smooth if
0 � µI � ∇2f(x) � LI, ∀x
Gradient methods (unconstrained case) 2-15
![Page 19: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/19.jpg)
Convergence rate for strongly convex and smoothproblems
Theorem 2.1 (GD for strongly convex and smooth functions)
Let f be µ-strongly convex and L-smooth. If ηt ≡ η = 2µ+L , then
‖xt − x∗‖2 ≤(κ− 1κ+ 1
)t‖x0 − x∗‖2,
where κ := L/µ is condition number; x∗ is the minimizer
• generalization of quadratic minimization problems◦ stepsize: η = 2
µ+L (vs. η = 2λ1(Q)+λn(Q) )
◦ contraction rate: κ−1κ+1 (vs. λ1(Q)−λn(Q)
λ1(Q)+λn(Q) )
• dimension-free: iteration complexity is O(
log 1ε
log κ+1κ−1
), which is
independent of the problem size n if κ does not depend on n
• a direct consequence of Theorem 2.1 (using smoothness):
f(xt)− f(x∗) ≤ L
2
(κ− 1κ+ 1
)2t‖x0 − x∗‖22
Gradient methods (unconstrained case) 2-16
![Page 20: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/20.jpg)
Convergence rate for strongly convex and smoothproblems
Theorem 2.1 (GD for strongly convex and smooth functions)
Let f be µ-strongly convex and L-smooth. If ηt ≡ η = 2µ+L , then
‖xt − x∗‖2 ≤(κ− 1κ+ 1
)t‖x0 − x∗‖2,
where κ := L/µ is condition number; x∗ is the minimizer
• generalization of quadratic minimization problems
◦ stepsize: η = 2µ+L (vs. η = 2
λ1(Q)+λn(Q) )
◦ contraction rate: κ−1κ+1 (vs. λ1(Q)−λn(Q)
λ1(Q)+λn(Q) )
• dimension-free: iteration complexity is O(
log 1ε
log κ+1κ−1
), which is
independent of the problem size n if κ does not depend on n
• a direct consequence of Theorem 2.1 (using smoothness):
f(xt)− f(x∗) ≤ L
2
(κ− 1κ+ 1
)2t‖x0 − x∗‖22
Gradient methods (unconstrained case) 2-16
![Page 21: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/21.jpg)
Convergence rate for strongly convex and smoothproblems
Theorem 2.1 (GD for strongly convex and smooth functions)
Let f be µ-strongly convex and L-smooth. If ηt ≡ η = 2µ+L , then
‖xt − x∗‖2 ≤(κ− 1κ+ 1
)t‖x0 − x∗‖2,
where κ := L/µ is condition number; x∗ is the minimizer
• generalization of quadratic minimization problems
◦ stepsize: η = 2µ+L (vs. η = 2
λ1(Q)+λn(Q) )
◦ contraction rate: κ−1κ+1 (vs. λ1(Q)−λn(Q)
λ1(Q)+λn(Q) )
• dimension-free: iteration complexity is O(
log 1ε
log κ+1κ−1
), which is
independent of the problem size n if κ does not depend on n
• a direct consequence of Theorem 2.1 (using smoothness):
f(xt)− f(x∗) ≤ L
2
(κ− 1κ+ 1
)2t‖x0 − x∗‖22
Gradient methods (unconstrained case) 2-16
![Page 22: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/22.jpg)
Proof of Theorem 2.1It is seen from the fundamental theorem of calculus that
∇f(xt) = ∇f(xt)−∇f(x∗)︸ ︷︷ ︸=0
=(∫ 1
0∇2f(xτ )dτ
)(xt − x∗),
where xτ := xt + τ(x∗ − xt). Here, {xτ}0≤τ≤1 forms a line segmentbetween xt and x∗. Therefore,
‖xt+1 − x∗‖2 = ‖xt − x∗ − η∇f(xt)‖2
=∥∥∥∥(I − η
∫ 1
0∇2f(xτ )dτ
)(xt − x∗)
∥∥∥∥≤ sup
0≤τ≤1
∥∥I − η∇2f(xτ )∥∥ ‖xt − x∗‖2
≤ L− µL+ µ
‖xt − x∗‖2
Repeat this argument for all iterations to conclude the proofGradient methods (unconstrained case) 2-17
![Page 23: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/23.jpg)
More on strong convexity
f(·) is said to be µ-strongly convex if
(i) f(y) ≥ f(x) +∇f(x)>(y − x)︸ ︷︷ ︸first-order Taylor expansion
+ µ
2 ‖x− y‖22, ∀x,y
Equivalent first-order characterizations
(ii) For all x and y and all 0 ≤ λ ≤ 1,
f(λx + (1− λ)y) ≤ λf(x) + (1− λ)f(y)− µ2λ(1− λ)‖x− y‖22
(iii) 〈∇f(x)−∇f(y),x− y〉 ≥ µ‖x− y‖22, ∀x,y
(iv) ∇2f(x) � µI, ∀x (for twice differentiable functions)
Gradient methods (unconstrained case) 2-18
![Page 24: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/24.jpg)
More on strong convexity
f(·) is said to be µ-strongly convex if
(i) f(y) ≥ f(x) +∇f(x)>(y − x)︸ ︷︷ ︸first-order Taylor expansion
+ µ
2 ‖x− y‖22, ∀x,y
Equivalent second-order characterization
(ii) For all x and y and all 0 ≤ λ ≤ 1,
f(λx + (1− λ)y) ≤ λf(x) + (1− λ)f(y)− µ2λ(1− λ)‖x− y‖22
(iii) 〈∇f(x)−∇f(y),x− y〉 ≥ µ‖x− y‖22, ∀x,y
(iv) ∇2f(x) � µI, ∀x (for twice differentiable functions)
Gradient methods (unconstrained case) 2-18
![Page 25: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/25.jpg)
More on smoothness
A convex function f(·) is said to be L-smooth if
(i) f(y) ≤ f(x) +∇f(x)>(y − x)︸ ︷︷ ︸first-order Taylor expansion
+ L
2 ‖x− y‖22, ∀x,y
Equivalent first-order characterizations (for convex functions)(ii) For all x and y and all 0 ≤ λ ≤ 1,
f(λx + (1− λ)y) ≥ λf(x) + (1− λ)f(y)− L2 λ(1− λ)‖x− y‖22
(iii) 〈∇f(x)−∇f(y),x− y〉 ≥ 1L‖∇f(x)−∇f(y)‖22, ∀x,y
(iv) ‖∇f(x)−∇f(y)‖2 ≤ L‖x−y‖2, ∀x,y (L-Lipschitz gradient)
(v) ‖∇2f(x)‖2 ≤ L, ∀x (for twice differentiable functions)
Gradient methods (unconstrained case) 2-19
![Page 26: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/26.jpg)
More on smoothness
A convex function f(·) is said to be L-smooth if
(i) f(y) ≤ f(x) +∇f(x)>(y − x)︸ ︷︷ ︸first-order Taylor expansion
+ L
2 ‖x− y‖22, ∀x,y
Equivalent second-order characterization
(ii) For all x and y and all 0 ≤ λ ≤ 1,
f(λx + (1− λ)y) ≥ λf(x) + (1− λ)f(y)− L2 λ(1− λ)‖x− y‖22
(iii) 〈∇f(x)−∇f(y),x− y〉 ≥ 1L‖∇f(x)−∇f(y)‖22, ∀x,y
(iv) ‖∇f(x)−∇f(y)‖2 ≤ L‖x−y‖2, ∀x,y (L-Lipschitz gradient)
(v) ‖∇2f(x)‖2 ≤ L, ∀x (for twice differentiable functions)
Gradient methods (unconstrained case) 2-19
![Page 27: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/27.jpg)
Backtracking line search
Practically, one often performs line searches rather than adoptingconstant stepsizes. Most line searches in practice are, however,inexact
A simple and effective scheme: backtracking line search
Gradient methods (unconstrained case) 2-20
![Page 28: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/28.jpg)
Backtracking line search
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Armijo condition: for some 0 < – < 1
f!xt ≠ ÷Òf(xt)
"< f(xt) ≠ –÷ÎÒf(xt)Î2
2 (2.8)
• Ensures su�cient decrease in objective function acceptable
Algorithm 2.1 Backtracking line search for GD1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
Gradient methods 2-35Armijo condition: for some 0 < α < 1
f(xt − η∇f(xt)
)< f(xt)− αη‖∇f(xt)‖22 (2.5)
• f(xt)− αη‖∇f(xt)‖22 lies above f(xt − η∇f(xt)
)for small η
• ensures sufficient decrease of objective values
Algorithm 2.1 Backtracking line search for GD1: Initialize η = 1, 0 < α ≤ 1/2, 0 < β < 12: while f
(xt − η∇f(xt)
)> f(xt)− αη‖∇f(xt)‖22 do
3: η ← βη
Gradient methods (unconstrained case) 2-21
![Page 29: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/29.jpg)
Backtracking line search
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Armijo condition: for some 0 < – < 1
f!xt ≠ ÷Òf(xt)
"< f(xt) ≠ –÷ÎÒf(xt)Î2
2 (2.8)
• Ensures su�cient decrease in objective function acceptable
Algorithm 2.1 Backtracking line search for GD1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
Gradient methods 2-35
Armijo condition: for some 0 < α < 1f(xt − η∇f(xt)
)< f(xt)− αη‖∇f(xt)‖22 (2.5)
• f(xt)− αη‖∇f(xt)‖22 lies above f(xt − η∇f(xt)
)for small η
• ensures sufficient decrease of objective values
Algorithm 2.2 Backtracking line search for GD1: Initialize η = 1, 0 < α ≤ 1/2, 0 < β < 12: while f
(xt − η∇f(xt)
)> f(xt)− αη‖∇f(xt)‖22 do
3: η ← βη
Gradient methods (unconstrained case) 2-21
![Page 30: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/30.jpg)
Backtracking line search
Backtracking line search
Under smoothness condition,f(xt) ≠ ÷ÎÒf(xt)Î2
2 + L÷2
2 ÎÒf(xt)Î22
Gradient methods 2-22
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Backtracking line search
Under smoothness condition,f(xt) ≠ ÷ÎÒf(xt)Î2
2 + L÷2
2 ÎÒf(xt)Î22
Gradient methods 2-22
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Under smoothness condition, we learn from above figure that÷t Ø —÷̃t, where ÷̃t is solution to
f(xt) ≠ –÷ÎÒf(xt)Î22 = f(xt) ≠ ÷ÎÒf(xt)Î2
2 + L÷2
2 ÎÒf(xt)Î22
=∆ ÷t Ø 2(1 ≠ –)—L
Gradient methods 2-22
Backtracking line search
Backtracking line search
Under smoothness condition,f(xt) ≠ ÷ÎÒf(xt)Î2
2 + L÷2
2 ÎÒf(xt)Î22
Gradient methods 2-22
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Under smoothness condition, we learn from above figure that÷t Ø —÷̃t, where ÷̃t is solution to
f(xt) ≠ –÷ÎÒf(xt)Î22 = f(xt) ≠ ÷ÎÒf(xt)Î2
2 + L÷2
2 ÎÒf(xt)Î22
=∆ ÷t Ø 2(1 ≠ –)—L
Gradient methods 2-22
Backtracking line search
Backtracking line search
Under smoothness condition,f(xt) ≠ ÷ÎÒf(xt)Î2
2 + L÷2
2 ÎÒf(xt)Î22
Gradient methods 2-22
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Backtracking line search
Under smoothness condition,f(xt) ≠ ÷ÎÒf(xt)Î2
2 + L÷2
2 ÎÒf(xt)Î22
Gradient methods 2-22
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Under smoothness condition, we learn from above figure that÷t Ø —÷̃t, where ÷̃t is solution to
f(xt) ≠ –÷ÎÒf(xt)Î22 = f(xt) ≠ ÷ÎÒf(xt)Î2
2 + L÷2
2 ÎÒf(xt)Î22
=∆ ÷t Ø 2(1 ≠ –)—L
Gradient methods 2-22
Backtracking line search
Backtracking line search
Under smoothness condition,f(xt) ≠ ÷ÎÒf(xt)Î2
2 + L÷2
2 ÎÒf(xt)Î22
Gradient methods 2-22
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Backtracking line search
Algorithm 2.1 Backtracking line search for gradient descent1: Initialize ÷ = 1, 0 < – < 1/2, 0 < — < 12: while f
!xt ≠ ÷Òf(xt)
"> f(xt) ≠ –÷ÎÒf(xt)Î2
2 do3: ÷ Ω —÷
f!xt ≠ ÷Òf(xt)
" Æ f(xt) ≠ ÷ÎÒf(xt)Î22
Gradient methods 2-35
Under smoothness condition, we learn from above figure that÷t Ø —÷̃t, where ÷̃t is solution to
f(xt) ≠ –÷ÎÒf(xt)Î22 = f(xt) ≠ ÷ÎÒf(xt)Î2
2 + L÷2
2 ÎÒf(xt)Î22
=∆ ÷t Ø 2(1 ≠ –)—L
Gradient methods 2-22
We learn from above figure that ÷t Ø —÷̃t, where ÷̃t is solution to
f(xt) ≠ –÷ÎÒf(xt)Î22 = f(xt) ≠ ÷ÎÒf(xt)Î2
2 + L÷2
2 ÎÒf(xt)Î22
=∆ ÷t Ø 2(1 ≠ –)—L
Practically, backtracking line search often provides good estimate onlocal Lipschitz constant of gradientsupper bound
Gradient methods (unconstrained case) 2-22
Practically, backtracking line search often (but not always) providesgood estimates on the local Lipschitz constants of gradients
Gradient methods (unconstrained case) 2-22
![Page 31: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/31.jpg)
Convergence for backtracking line search
Theorem 2.2 (Boyd, Vandenberghe ’04)
Let f be µ-strongly convex and L-smooth. With backtracking linesearch,
f(xt)− f(x∗) ≤(1−min
{2µα, 2βαµ
L
})t (f(x0)− f(x∗)
)
where x∗ is the minimizer
Gradient methods (unconstrained case) 2-23
![Page 32: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/32.jpg)
Is strong convexity necessary for linear convergence?
So far we have established linear convergence under strong convexityand smoothness
Strong convexity requirement can often be relaxed• local strong convexity• regularity condition• Polyak-Lojasiewicz condition
Gradient methods (unconstrained case) 2-24
![Page 33: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/33.jpg)
Example: logistic regression
Suppose we obtain m independent binary samples
yi =
1, with prob. 11+exp
(−a>i x\
)
−1, with prob. 11+exp
(a>i x\
)
where {ai}: known design vectors; x\ ∈ Rn: unknown parameters
Gradient methods (unconstrained case) 2-25
![Page 34: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/34.jpg)
Example: logistic regression
The maximum likelihood estimate (MLE) is given by (after a littlemanipulation)
minimizex∈Rn f(x) = 1m
m∑
i=1log
(1 + exp(−yia>i x)
)
• ∇2f(x) = 1m
m∑i=1
exp(−yia>i x)(1 + exp(−yia>i x)
)2︸ ︷︷ ︸
→0 if x→∞
aia>i
x→∞−→ 0
=⇒ f is 0-strongly convex• Does it mean we no longer have linear convergence?
Gradient methods (unconstrained case) 2-26
![Page 35: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/35.jpg)
Local strong convexity
Theorem 2.3 (GD for locally strongly convex and smoothfunctions)
Let f be locally µ-strongly convex and L-smooth such that
µI � ∇2f(x) � LI, ∀x ∈ B0
where B0 := {x : ‖x− x∗‖2 ≤ ‖x0 − x∗‖2} and x∗ is the minimizer.Then Theorem 2.1 continues to hold
Gradient methods (unconstrained case) 2-27
![Page 36: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/36.jpg)
Local strong convexity
x0 x1 x2 x3
1
x0 x1 x2 x3
1
x0 x1 x2 x3
1
x0 x1 x2 x3
1
Local strong convexity
Theorem 2.2 (GD for locally strongly convex and smoothfunctions)
Let f be locally µ-strongly convex and L-smooth such that
µI ∞ Ò2f(x) ∞ LI, ’x : Îx ≠ xúÎ2 Æ Îx0 ≠ xúÎ2.
Then Theorem 2.1 continues to hold.
{x : Îx ≠ xúÎ2 Æ Îx0 ≠ xúÎ2}
Gradient methods 2-24
• Suppose xt ∈ B0. Then repeating our previous analysis yields‖xt+1 − x∗‖2 ≤ κ−1
κ+1‖xt − x∗‖2• This also means xt+1 ∈ B0, so the above bound continues to
hold for the next iteration ...
Gradient methods (unconstrained case) 2-28
![Page 37: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/37.jpg)
Local strong convexity
Back to the logistic regression example, the local strong convexityparameter is given by
infx:‖x−x∗‖2≤‖x0−x∗‖2
λmin
(1m
m∑
i=1
exp(−yia>i x)(1 + exp(−yia>i x)
)2aia>i
)(2.6)
which is often stricly bounded away from 0,1 thus enabling linearconvergence
1For example, when x? = 0 and aii.i.d.∼ N (0, In), one often has (2.6)≥ c0 for
some universal constant c0 > 0 with high prob if m/n > 2 (Sur et al. ’17)Gradient methods (unconstrained case) 2-29
![Page 38: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/38.jpg)
Regularity condition
Convergence under regularity condition≠ 1
LÒf(x)0.2 0.4 0.6 0.8 1
0.9
1
t
f(t)/p
1 + t2
1
Theorem 2.4
Suppose f satisfies (2.6). If ÷t © ÷ = 1L , then
Îxt ≠ xúÎ22 Æ
11 ≠ µ
L
2tÎx0 ≠ xúÎ2
2,
Gradient methods (unconstrained case) 2-30
Nonexpansiveness of projection operator
PC(x) PC(z) x z
Gradient methods (constrained) 3-13
Proof of Lemma 2.5
It follows that
Îxt+1 ≠ xúÎ22 =
..xt ≠ xú ≠ ÷(Òf(xt) ≠ Òf(xú)¸ ˚˙ ˝=0
)..2
2
=..xt ≠ xú..2
2 ≠ 2÷Èxt ≠ xú, Òf(xt) ≠ Òf(xú)͸ ˚˙ ˝Ø 2÷
L ÎÒf(xt)≠Òf(xú)Î22 (smoothness)
+ ÷2..Òf(xt) ≠ Òf(xú)..2
2
Æ..xt ≠ xú..2
2 ≠ ÷2..Òf(xt) ≠ Òf(xú)..2
2
Æ..xt ≠ xú..2
2
Gradient methods 2-36
0.2 0.4 0.6 0.8 1
0.9
1
t
f(t)/p1 + t2
1
Another way is to replace strong convexity and smoothness by thefollowing regularity condition:
〈∇f(x),x− x∗〉 ≥ µ
2 ‖x− x∗‖22 + 12L‖∇f(x)‖22, ∀x (2.7)
〈∇f(x)−∇f(x∗),x−x∗〉 ≥ µ
2 ‖x−x∗‖22 + 1
2L‖∇f(x)−∇f(x∗)‖22, ∀x
• compared to strong convexity (which involves any pair (x,y)),we only restrict ourselves to (x,x∗)
Gradient methods (unconstrained case) 2-30
![Page 39: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/39.jpg)
Regularity condition
Convergence under regularity condition≠ 1
LÒf(x)0.2 0.4 0.6 0.8 1
0.9
1
t
f(t)/p
1 + t2
1
Theorem 2.4
Suppose f satisfies (2.6). If ÷t © ÷ = 1L , then
Îxt ≠ xúÎ22 Æ
11 ≠ µ
L
2tÎx0 ≠ xúÎ2
2,
Gradient methods (unconstrained case) 2-30
Nonexpansiveness of projection operator
PC(x) PC(z) x z
Gradient methods (constrained) 3-13
Proof of Lemma 2.5
It follows that
Îxt+1 ≠ xúÎ22 =
..xt ≠ xú ≠ ÷(Òf(xt) ≠ Òf(xú)¸ ˚˙ ˝=0
)..2
2
=..xt ≠ xú..2
2 ≠ 2÷Èxt ≠ xú, Òf(xt) ≠ Òf(xú)͸ ˚˙ ˝Ø 2÷
L ÎÒf(xt)≠Òf(xú)Î22 (smoothness)
+ ÷2..Òf(xt) ≠ Òf(xú)..2
2
Æ..xt ≠ xú..2
2 ≠ ÷2..Òf(xt) ≠ Òf(xú)..2
2
Æ..xt ≠ xú..2
2
Gradient methods 2-36
0.2 0.4 0.6 0.8 1
0.9
1
t
f(t)/p1 + t2
1
Another way is to replace strong convexity and smoothness by thefollowing regularity condition:
〈∇f(x),x− x∗〉 ≥ µ
2 ‖x− x∗‖22 + 12L‖∇f(x)‖22, ∀x (2.7)
〈∇f(x)−∇f(x∗),x−x∗〉 ≥ µ
2 ‖x−x∗‖22 + 1
2L‖∇f(x)−∇f(x∗)‖22, ∀x
• compared to strong convexity (which involves any pair (x,y)),we only restrict ourselves to (x,x∗)
Gradient methods (unconstrained case) 2-30
![Page 40: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/40.jpg)
Convergence under regularity condition0.2 0.4 0.6 0.8 1
0.9
1
t
f(t)/p1 + t2
1
Theorem 2.4
Suppose f satisfies (2.7). If ηt ≡ η = 1L , then
‖xt − x∗‖22 ≤(1− µ
L
)t‖x0 − x∗‖22
Gradient methods (unconstrained case) 2-31
![Page 41: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/41.jpg)
Proof of Theorem 2.4
It follows that
‖xt+1 − x∗‖22 =∥∥∥xt − x∗ − 1
L∇f(xt)
∥∥∥2
2
= ‖xt − x∗‖22 + 1L2 ‖∇f(xt)‖22 −
2L〈xt − x∗,∇f(xt)〉
(i)≤ ‖xt − x∗‖22 −
µ
L‖xt − x∗‖22
=(
1− µ
L
)‖xt − x∗‖22
where (i) comes from (2.7)
Apply it recursively to complete the proof
Gradient methods (unconstrained case) 2-32
![Page 42: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/42.jpg)
Polyak-Lojasiewicz condition
Another alternative is the Polyak-Lojasiewicz (PL) condition
‖∇f(x)‖22 ≥ 2µ(f(x)− f( x∗︸︷︷︸minimizer
)), ∀x (2.8)
• guarantees that gradient grows fast as we move away from theoptimal objective value• guarantees that every stationary point is a global minimum
Gradient methods (unconstrained case) 2-33
![Page 43: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/43.jpg)
Convergence under PL condition
Theorem 2.5
Suppose f satisfies (2.8) and is L-smooth. If ηt ≡ η = 1L , then
f(xt)− f(x∗) ≤(1− µ
L
)t (f(x0)− f(x∗)
)
• guarantees linear convergence to the optimal objective value• does NOT imply the uniqueness of global minima• proof deferred to Page 2-45
Gradient methods (unconstrained case) 2-34
![Page 44: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/44.jpg)
Example: over-parameterized linear regression
• m data samples {ai ∈ Rn, yi ∈ R}1≤i≤m• linear regression: find a linear model that best fits the data
minimizex∈Rn
f(x) , 12
m∑
i=1(a>i x− yi)2
Over-parameterization: model dimension > sample size(i.e. n > m)
— a regime of particular importance in deep learning
Gradient methods (unconstrained case) 2-35
![Page 45: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/45.jpg)
Example: over-parametrized linear regression
While this is a convex problem, it is not strongly convex, since
∇2f(x) =m∑
i=1aia
>i is rank-deficient if n > m
But for most “non-degenerate” cases, one has f(x∗) = 0 (why?) andthe PL condition is met, and hence GD converges linearly
Gradient methods (unconstrained case) 2-36
![Page 46: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/46.jpg)
Example: over-parametrized linear regression
Fact 2.6
Suppose that A = [a1, · · · ,am]> ∈ Rm×n has rank m, and thatηt ≡ η = 1
λmax(AA>) . Then GD obeys
f(xt)− f(x∗) ≤(
1− λmin(AA>)λmax(AA>)
)t (f(x0)− f(x∗)
), ∀t
• very mild assumption on {ai}• no assumption on {yi}
• (aside) while there are many global minima for thisover-parametrized problem, GD has implicit bias◦ GD converges to a global min closest to initialization x0!
Gradient methods (unconstrained case) 2-37
![Page 47: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/47.jpg)
Example: over-parametrized linear regression
Fact 2.6
Suppose that A = [a1, · · · ,am]> ∈ Rm×n has rank m, and thatηt ≡ η = 1
λmax(AA>) . Then GD obeys
f(xt)− f(x∗) ≤(
1− λmin(AA>)λmax(AA>)
)t (f(x0)− f(x∗)
), ∀t
• very mild assumption on {ai}• no assumption on {yi}
• (aside) while there are many global minima for thisover-parametrized problem, GD has implicit bias◦ GD converges to a global min closest to initialization x0!
Gradient methods (unconstrained case) 2-37
![Page 48: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/48.jpg)
Proof of Fact 2.6
Everything boils down to showing the PL condition
‖∇f(x)‖22 ≥ 2λmin(AA>) f(x) (2.9)
If this holds, then the claim follows immediately from Theorem 2.5and the fact f(x∗) = 0
To prove (2.9), let y = [yi]1≤i≤m, and observe∇f(x) = A>(Ax− y). Then
‖∇f(x)‖22 = (Ax− y)>AA>(Ax− y)≥ λmin(AA>)‖Ax− y‖22= 2λmin(AA>)f(x),
which satisfies the PL condition (2.9) with µ = λmin(AA>)Gradient methods (unconstrained case) 2-38
![Page 49: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/49.jpg)
Convex and smooth problems
![Page 50: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/50.jpg)
Dropping strong convexity
What happens if we completely drop (local) strong convexity?
minimizex f(x)
• f(x) is convex and smooth
Gradient methods (unconstrained case) 2-40
![Page 51: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/51.jpg)
Dropping strong convexity
Without strong convexity, it may often be better to focus on objectiveimprovement (rather than improvement on estimation error)Dropping strong convexity
Without strong convexity, it may often be better to focus on costimprovement (rather than improvement on estimation error)
Example: consider f(x) = 1/x (x > 0). GD iterates {xt} mightnever converge to xú = Œ. In comparison, f(xt) might approachf(xú) = 0 rapidly
Gradient methods (unconstrained case) 2-35
Dropping strong convexity
Without strong convexity, it may often be better to focus on costimprovement (rather than improvement on estimation error)
Example: consider f(x) = 1/x (x > 0). GD iterates {xt} mightnever converge to xú = Œ. In comparison, f(xt) might approachf(xú) = 0 rapidly
Gradient methods (unconstrained case) 2-35
Example: consider f(x) = 1/x (x > 0). GD iterates {xt} mightnever converge to x∗ =∞. In comparison, f(xt) might approachf(x∗) = 0 rapidly
Gradient methods (unconstrained case) 2-41
![Page 52: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/52.jpg)
Objective improvement and stepsize
Question:
• can we ensure reduction of the objective value(i.e. f(xt+1) < f(xt)) without strong convexity?• what stepsizes guarantee sufficient decrease?
Key idea: majorization-minimization• find a simple majorizing function of f(x) and optimize it instead
Gradient methods (unconstrained case) 2-42
![Page 53: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/53.jpg)
Objective improvement and stepsize
From the smoothness assumption,
f(xt+1)− f(xt) ≤ ∇f(xt)>(xt+1 − xt) + L
2 ‖xt+1 − xt‖22
= −ηt‖∇f(xt)‖22 + η2tL
2 ‖∇f(xt)‖22︸ ︷︷ ︸
majorizing function of objective reduction due to smoothness
(pick ηt = 1/L to minimize the majorizing function)
= − 12L‖∇f(xt)‖22
Gradient methods (unconstrained case) 2-43
![Page 54: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/54.jpg)
Objective improvement
Fact 2.7
Suppose f is L-smooth. Then GD with ηt = 1/L obeys
f(xt+1) ≤ f(xt)− 12L‖∇f(xt)‖22
• for ηt sufficiently small, GD results in improvement in theobjective value• does NOT rely on convexity!
Gradient methods (unconstrained case) 2-44
![Page 55: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/55.jpg)
A byproduct: proof of Theorem 2.5
f(xt+1)− f(x∗)(i)≤ f(xt)− f(x∗)− 1
2L‖∇f(xt)‖22(ii)≤ f(xt)− f(x∗)− µ
L
(f(xt)− f(x∗)
)
=(
1− µ
L
) (f(xt)− f(x∗)
)
where (i) follows from Fact 2.7, and (ii) comes from the PL condition(2.8)
Apply it recursively to complete the proof
Gradient methods (unconstrained case) 2-45
![Page 56: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/56.jpg)
Improvement in estimation accuracy
GD is not only improving the objective value, but is also dragging theiterates towards minimizer(s), as long as ηt is not too large
Proof of Lemma 2.5
It follows that
Îxt+1 ≠ xúÎ22 =
..xt ≠ xú ≠ ÷(Òf(xt) ≠ Òf(xú)¸ ˚˙ ˝=0
)..2
2
=..xt ≠ xú..2
2 ≠ 2÷Èxt ≠ xú, Òf(xt) ≠ Òf(xú)͸ ˚˙ ˝Ø 2÷
L ÎÒf(xt)≠Òf(xú)Î22 (smoothness)
+ ÷2..Òf(xt) ≠ Òf(xú)..2
2
Æ..xt ≠ xú..2
2 ≠ ÷2..Òf(xt) ≠ Òf(xú)..2
2
Æ..xt ≠ xú..2
2
Gradient methods 2-36
Proof of Lemma 2.5
It follows that
Îxt+1 ≠ xúÎ22 =
..xt ≠ xú ≠ ÷(Òf(xt) ≠ Òf(xú)¸ ˚˙ ˝=0
)..2
2
=..xt ≠ xú..2
2 ≠ 2÷Èxt ≠ xú, Òf(xt) ≠ Òf(xú)͸ ˚˙ ˝Ø 2÷
L ÎÒf(xt)≠Òf(xú)Î22 (smoothness)
+ ÷2..Òf(xt) ≠ Òf(xú)..2
2
Æ..xt ≠ xú..2
2 ≠ ÷2..Òf(xt) ≠ Òf(xú)..2
2
Æ..xt ≠ xú..2
2
Gradient methods 2-36
Monotonicity
We start with a monotonicity result:
Lemma 2.5
Let f be convex and L-smooth. If ÷t © ÷ = 1/L, then
Îxt+1 ≠ xúÎ2 Æ Îxt ≠ xúÎ2
where xú is any minimizer with optimal f(xú)
Gradient methods 2-35
‖xt − x∗‖2 is monotonicallynonincreasing in t
Treating f as 0-strongly convex, we can see from our previousanalysis for strongly convex problems that
‖xt+1 − x∗‖2 ≤ ‖xt − x∗‖2
Gradient methods (unconstrained case) 2-46
![Page 57: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/57.jpg)
Improvement in estimation accuracy
One can further show that ‖xt − x∗‖2 is strictly decreasing unless xt
is already the minimizer
Fact 2.8
Let f be convex and L-smooth. If ηt ≡ η = 1/L, then
‖xt+1 − x∗‖22 ≤ ‖xt − x∗‖22 −1L2∥∥∇f(xt)
∥∥22
where x∗ is any minimizer of f(·)
Gradient methods (unconstrained case) 2-47
![Page 58: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/58.jpg)
Proof of Fact 2.8
It follows that
‖xt+1 − x∗‖22 =
∥∥xt − x∗ − η(∇f(xt)−∇f(x∗)︸ ︷︷ ︸=0
)∥∥2
2
=∥∥xt − x∗
∥∥22 − 2η〈xt − x∗,∇f(xt)−∇f(x∗)〉︸ ︷︷ ︸
≥ 2ηL ‖∇f(xt)−∇f(x∗)‖2
2 (smooth+cvx)
+ η2∥∥∇f(xt)−∇f(x∗)∥∥2
2
≤∥∥xt − x∗
∥∥22 −
2ηL‖∇f(xt)−∇f(x∗)‖2
2 + η2∥∥∇f(xt)−∇f(x∗)∥∥2
2
=∥∥xt − x∗
∥∥22 −
1L2
∥∥∇f(xt)−∇f(x∗)︸ ︷︷ ︸=0
∥∥22 (since η = 1/L)
Gradient methods (unconstrained case) 2-48
![Page 59: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/59.jpg)
Convergence rate for convex and smooth problems
However, without strong convexity, convergence is typically muchslower than linear (or geometric) convergence
Theorem 2.9 (GD for convex and smooth problems)
Let f be convex and L-smooth. If ηt ≡ η = 1/L, then GD obeys
f(xt)− f(x∗) ≤ 2L‖x0 − x∗‖22t
where x∗ is any minimizer of f(·)
• attains ε-accuracy within O(1/ε) iterations (vs. O(log 1ε )
iterations for linear convergence)
Gradient methods (unconstrained case) 2-49
![Page 60: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/60.jpg)
Proof of Theorem 2.9 (cont.)From Fact 2.7,
f(xt+1)− f(xt) ≤ − 12L‖∇f(xt)‖22
To infer f(xt) recursively, it is often easier to replace ‖∇f(xt)‖2 withsimpler functions of f(xt). Use convexity and Cauchy-Schwarz to get
f(x∗)− f(xt) ≥ ∇f(xt)>(x∗ − xt) ≥ −‖∇f(xt)‖2‖xt − x∗‖2
=⇒ ‖∇f(xt)‖2 ≥f(xt)− f(x∗)‖xt − x∗‖2
Fact 2.8≥ f(xt)− f(x∗)
‖x0 − x∗‖2Setting ∆t := f(xt)− f(x∗) and combining the above bounds yield
∆t+1 −∆t ≤ −1
2L‖x0 − x∗‖22∆2t =: − 1
w0∆2t (2.10)
Gradient methods (unconstrained case) 2-50
![Page 61: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/61.jpg)
Proof of Theorem 2.9 (cont.)
∆t+1 ≤ ∆t −1w0
∆2t
Dividing both sides by ∆t∆t+1 and rearranging terms give
1∆t+1
≥ 1∆t
+ 1w0
∆t
∆t+1
=⇒ 1∆t+1
≥ 1∆t
+ 1w0
(since ∆t ≥ ∆t+1 (Fact 2.7))
=⇒ 1∆t≥ 1
∆0+ t
w0≥ t
w0
=⇒ ∆t ≤w0t
= 2L‖x0 − x∗‖22t
as claimedGradient methods (unconstrained case) 2-51
![Page 62: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/62.jpg)
Nonconvex problems
![Page 63: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/63.jpg)
Nonconvex problems are everywhere
Many empirical risk minimization tasks are nonconvex
minimizex f(x; data)
• low-rank matrix completion• blind deconvolution• dictionary learning• mixture models• learning deep neural nets• ...
Gradient methods (unconstrained case) 2-53
![Page 64: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/64.jpg)
Challenges
• there may be bumps and local minima everywhere◦ e.g. 1-layer neural net (Auer, Herbster, Warmuth ’96; Vu ’98)
• no algorithm can solve nonconvex problems efficiently in all cases
Gradient methods (unconstrained case) 2-54
![Page 65: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/65.jpg)
Typical convergence guarantees
We cannot hope for efficient global convergence to global minima ingeneral, but we may have
• convergence to stationary points (i.e. ∇f(x) = 0)• convergence to local minima• local convergence to global minima (i.e. when initialized suitably)
Gradient methods (unconstrained case) 2-55
![Page 66: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/66.jpg)
Making gradients small
Suppose we are content with any (approximate) stationary point ...
This means that our goal is merely to find a point x with
‖∇f(x)‖2 ≤ ε (called ε-approximate stationary point)
Question: can GD achieve this goal? If so, how fast?
Gradient methods (unconstrained case) 2-56
![Page 67: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/67.jpg)
Making gradients small
Theorem 2.10
Let f be L-smooth and ηk ≡ η = 1/L. Assume t is even.• In general, GD obeys
min0≤k<t
‖∇f(xk)‖2 ≤√
2L (f(x0)− f(x∗))t
• If f(·) is convex, then GD obeys
mint/2≤k<t
‖∇f(xk)‖2 ≤4L‖x0 − x∗‖2
t
• GD finds an ε-approximate stationary point in O(1/ε2) iterations• does not imply GD converges to stationary points; it only says that ∃
approximate stationary point in the GD trajectory
Gradient methods (unconstrained case) 2-57
![Page 68: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/68.jpg)
Proof of Theorem 2.10
From Fact 2.7, we know
12L‖∇f(xk)‖22 ≤ f(xk)− f(xk+1), ∀k
This leads to a telescopic sum when summed over k = t0 to k = t− 1:
12L
t−1∑
k=t0‖∇f(xk)‖22 ≤
t−1∑
k=t0
(f(xk)− f(xk+1)
)= f(xt0)− f(xt)
≤ f(xt0)− f(x∗)
=⇒ mint0≤k<t
‖∇f(xk)‖2 ≤√
2L (f(xt0)− f(x∗))t− t0
(2.11)
Gradient methods (unconstrained case) 2-58
![Page 69: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/69.jpg)
Proof of Theorem 2.10 (cont.)
For a general f(·), taking t0 = 0 immediately estalishes the claim
If f(·) is convex, invoke Theorem 2.9 to obtain
f(xt0)− f(x∗) ≤ 2L‖x0 − x∗‖22t0
Taking t0 = t/2 and combining it with (2.11) give
mint0≤k<t
‖∇f(xk)‖2 ≤2L√
t0(t− t0)‖x0 − x∗‖2 = 4L‖x0 − x∗‖2
t
Gradient methods (unconstrained case) 2-59
![Page 70: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/70.jpg)
Escaping saddles
There are at least two kinds of points with vanishing gradients
0.655
0.66
0.65
0.665
0.67
0.10.6
0.675
0.050.55 0
-0.050.5-0.1
saddle point
0
0.1
0.01
0.02
0.05
0.03
0.04
0
-0.05 1.11.081.061.041.0210.980.96-0.1 0.940.920.9
global and local minimum
Saddle points look like “unstable” critical points; can we hope to atleast avoid saddle points?
Gradient methods (unconstrained case) 2-60
![Page 71: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/71.jpg)
Escaping saddle points
GD cannnot always escape saddles• e.g. if x0 happens to be a saddle︸ ︷︷ ︸
can often be prevented by random initialization
, then GD gets trapped (since
∇f(x0) = 0)
Fortunately, under mild conditions, randomly initialized GD convergesto local (sometimes even global) minimum almost surely (Lee et al.)!
Gradient methods (unconstrained case) 2-61
![Page 72: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/72.jpg)
Example
Consider a simple nonconvex quadratic minimization problem
minimizex
f(x) = 12x>Ax
• A = u1u>1 − u2u>2 , where ‖u1‖2 = ‖u2‖2 = 1 and u>1 u2 = 0
This problem has (at least) a saddle point: x = 0 (why?)• if x0 = 0, then GD gets stuck at 0 (i.e. xt ≡ 0)• what if we initialize GD randomly? can we hope to avoid saddles?
Gradient methods (unconstrained case) 2-62
![Page 73: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/73.jpg)
Example (cont.)
Fact 2.11
If x0 ∼ N (0, I), then with prob. approaching 1, GD with η < 1 obeys
‖xt‖2 →∞ as t→∞
• Interestingly, GD (almost) never gets trapped in the saddle 0!
Gradient methods (unconstrained case) 2-63
![Page 74: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/74.jpg)
Example (cont.)
Proof of Fact 2.11: Observe that
I − ηA = I⊥ + (1− η)u1u>1 + (1 + η)u2u
>2
where I⊥ := I − u1u>1 − u2u>2 . It can be easily verified that
(I − ηA)t = I⊥ + (1− η)tu1u>1 + (1 + η)tu2u
>2
=⇒ xt = (I − ηA)xt−1 = · · · = (I − ηA)tx0
= I⊥x0 + (1− η)t
(u>1 x
0)︸ ︷︷ ︸
=:αt
u1 + (1 + η)t(u>2 x
0)︸ ︷︷ ︸
=:βt
u2
Clearly, αt → 0 as t→∞, and |βt| → ∞︸ ︷︷ ︸and hence ‖xt‖2→∞
as long as β0 6= 0︸ ︷︷ ︸happens with prob. 1
Gradient methods (unconstrained case) 2-64
![Page 75: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/75.jpg)
Reference
[1] ”Convex optimization and algorithms,” D. Bertsekas, 2015.[2] ”Convex optimization: algorithms and complexity,” S. Bubeck,
Foundations and trends in machine learning, 2015.[3] ”First-order methods in optimization,” A. Beck, Vol. 25, SIAM, 2017.[4] ”Convex optimization,” S. Boyd, L. Vandenberghe, Cambridge university
press, 2004.[5] ”Introductory lectures on convex optimization: A basic course,”
Y. Nesterov, Springer Science & Business Media, 2013.[6] ”The likelihood ratio test in high-dimensional logistic regression is
asymptotically a rescaled shi-square,” P. Sur, Y. Chen, E. Candes,Probability Theory and Related Fields, 2017.
[7] ”How to make the gradients small,” Y. Nesterov, Optima, 2012.
Gradient methods (unconstrained case) 2-65
![Page 76: Gradient methods for unconstrained problems](https://reader030.fdocuments.in/reader030/viewer/2022012506/61825812aa2ad83c0c047fd4/html5/thumbnails/76.jpg)
Reference
[8] ”Gradient descent converges to minimizers,” J. Lee, M. Simchowitz,M. Jordan, B. Recht, COLT, 2016.
[9] ”Nonconvex optimization meets low-rank matrix factorization: anoverview,” Y. Chi, Y. Lu, Y. Chen, IEEE Transactions on SignalProcessing, 2019.
Gradient methods (unconstrained case) 2-66