Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of...
Transcript of Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of...
![Page 1: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/1.jpg)
Intro Smooth NonSmooth
Stochastic Second-Order Optimization MethodsPart I: Convex
Fred Roosta
School of Mathematics and PhysicsUniversity of Queensland
![Page 2: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/2.jpg)
Intro Smooth NonSmooth
Disclaimer
Disclaimer:
To preserve brevity and flow, many cited results are simplified,slightly-modified, or generalized...please see the cited work forthe details.
Unfortunately, due to time constraints, many relevant andinteresting works are not mentioned in this presentation.
![Page 3: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/3.jpg)
Intro Smooth NonSmooth
Problem Statement
Problem
minx∈X⊆Rd
F (x) = Ewf (x;w) =
Risk︷ ︸︸ ︷∫Ω
f (x;w(ω))︸ ︷︷ ︸e.g., Loss/Penalty
+prediction function
dP(w(ω))
F : (non-)convex/(non-)smooth
High-dimensional: d 1
With empirical (counting) measure over wini=1 ⊂ Rp, i.e.,∫AdP(w) =
1
n
n∑i=1
1wi∈A, ∀A ⊆ Ω =⇒ F (x) =1
n
n∑i=1
f (x; wi )︸ ︷︷ ︸finite-sum/empirical risk
“Big data”: n 1
![Page 4: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/4.jpg)
Intro Smooth NonSmooth
![Page 5: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/5.jpg)
Intro Smooth NonSmooth
![Page 6: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/6.jpg)
Intro Smooth NonSmooth
Humongous Data / High Dimension
Classical algorithms =⇒ High per-iteration costs
![Page 7: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/7.jpg)
Intro Smooth NonSmooth
Humongous Data / High Dimension
Modern variants:
1 Low per-iteration costs
2 Maintain original convergence rate
![Page 8: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/8.jpg)
Intro Smooth NonSmooth
First Order Methods
Use only gradient information
E.g. : Gradient Descent
x(k+1) = x(k) − α(k)∇F (x(k))
Smooth Convex F ⇒ Sublinear, O(1/k)
Smooth Strongly Convex F ⇒ Linear, O(ρk), ρ < 1
However, iteration cost is high!
![Page 9: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/9.jpg)
Intro Smooth NonSmooth
First Order Methods
Stochastic variants e.g., (mini-batch) SGD
For some s small, chosen at random wisi=1 and
x(k+1) = x(k) − α(k)
s
s∑i=1
∇f (x(k),wi )
Cheap per-iteration costs!
However slower to converge:
Smooth Convex F ⇒ O(1/√k)
Smooth Strongly Convex F ⇒ O(1/k)
![Page 10: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/10.jpg)
Intro Smooth NonSmooth
Modifications...
Achieve the convergence rate of the full GD
Preserve the per-iteration cost of SGD
E.g.: SAG, SDCA, SVRG, Prox-SVRG, Acc-Prox-SVRG,Acc-Prox-SDCA, S2GD, mS2GD, MISO, SAGA, AMSVRG, ...
![Page 11: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/11.jpg)
Intro Smooth NonSmooth
Second Order Methods
Use both gradient and Hessian information
E.g. : Classical Newton’s method
x(k+1) = x(k) − α(k) [∇2F (x(k))]−1∇F (x(k))︸ ︷︷ ︸Non-uniformly Scaled Gradient
Smooth Convex F ⇒ Locally Superlinear
Smooth Strongly Convex F ⇒ Locally Quadratic and GloballyLinear
However, per-iteration cost is much higher!
![Page 12: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/12.jpg)
Intro Smooth NonSmooth
Outline
Part I: ConvexSmooth
Newton-CG
Non-Smooth
Proximal NewtonSemi-smooth Newton
Part II: Non-ConvexLine-Search Based Methods
L-BFGSGauss-NewtonNatural Gradient
Trust-Region Based Methods
Trust-RegionCubic Regularization
Part III: Discussion and Examples
![Page 13: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/13.jpg)
Intro Smooth NonSmooth
Outline
Part I: ConvexSmooth
Newton-CG
Non-Smooth
Proximal NewtonSemi-smooth Newton
Part II: Non-ConvexLine-Search Based Methods
L-BFGSGauss-NewtonNatural Gradient
Trust-Region Based Methods
Trust-RegionCubic Regularization
Part III: Discussion and Examples
![Page 14: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/14.jpg)
Intro Smooth NonSmooth
Finite Sum / Empirical Risk Minimization
FSM/ERM
minx∈X⊆Rd
F (x) =1
n
n∑i=1
f (x; wi ) =1
n
n∑i=1
fi (x)
fi : Convex and Smooth
F : Strongly Convex =⇒ Unique minimizer x?
n 1 and/or d 1
![Page 15: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/15.jpg)
Intro Smooth NonSmooth
Iterative Scheme
y(k) = argminx∈X
(x− x(k))Tg(k) +
1
2(x− x(k))TH(k)(x− x(k))
x(k+1) = x(k) + αk
(y(k) − x(k)
)
![Page 16: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/16.jpg)
Intro Smooth NonSmooth
Iterative Scheme
y(k) = argminx∈X
(x− x(k))T g(k) +
1
2(x− x(k))T H(k)(x− x(k))
, x(k+1) = x(k) + αk
(y(k) − x(k)
)
Newton: g(k) = ∇F (x(k)) & H(k) = ∇2F (x(k))
Gradient Descent: g(k) = ∇F (x(k)) & H(k) = I
Frank-Wolfe: g(k) = ∇F (x(k)) & H(k) = 0
(mini-batch) SGD: Sg ⊂ 1, 2, . . . , n =⇒ g(k) = 1|Sg|
∑j∈Sg
∇fj (x(k)) & H(k) = I
Sub-Sampled Newton:
g(k) = ∇F (x(k))
Sg ⊂ 1, 2, . . . , n =⇒ H(k) =1
|SH |
∑j∈SH
∇2fj (x(k))
Hessian Sub-Sampling
Sg ⊂ 1, 2, . . . , n =⇒ g(k) =1
|Sg|
∑j∈Sg
∇fj (x(k))
SH ⊂ 1, 2, . . . , n =⇒ H(k) =1
|SH |
∑j∈SH
∇2fj (x(k))
Gradient and Hessian Sub-Sampling
![Page 17: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/17.jpg)
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Let X = Rd , i.e., unconstrained optimization
Iterative Scheme
p(k) ≈ argminp∈Rd
pTg(k) +
1
2pTH(k)p
,
x(k+1) = x(k) + α(k)p(k)
g(k) = ∇F (x(k))
H(k) =1
|SH |∑j∈SH
∇2fj(x(k))
Hessian Sub-Sampling
![Page 18: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/18.jpg)
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Global convergence, i.e., starting from any initial point
![Page 19: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/19.jpg)
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Iterative Scheme
Descent Direction: H(k)p(k) ≈ −g(k),
Step Size:
αk = argmax
α≤1α
s.t. F (x(k) + αpk) ≤ F (x(k)) + αβpTk ∇F (x(k))
Update: x(k+1) = x(k) + α(k)p(k)
![Page 20: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/20.jpg)
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Algorithm Newton’s Method with Hessian Sub-Sampling
1: Input: x(0)
2: for k = 0, 1, 2, · · · until termination do3: - g(k) = ∇F (x(k))
4: - S(k)H ⊆ 1, 2, . . . , n
5: - H(k) =1
|S(k)H |
∑j∈S(k)
H
∇2fj(x(k))
6: - H(k)p(k) ≈ −g(k)
7: - Find α(k) that passes Armijo linesearch8: - Update x(k+1) = x(k) + α(k)p(k)
9: end for
![Page 21: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/21.jpg)
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Theorem ( [Byrd et al., 2011])
If each fi is strongly convex, twice continuously differentiable withLipschitz continuous gradient, then
limk→∞
∇F (x(k)) = 0.
![Page 22: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/22.jpg)
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Theorem ( [Bollapragada et al., 2016])
If each fi is strongly convex, twice continuously differentiable withLipschitz continuous gradient, then
E(F (x(k))− F (x?)
)≤ ρk
(F (x(0))− F (x?)
),
where ρ ≤(1− 1/κ2
)with κ being the “condition number” of the
problem.
![Page 23: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/23.jpg)
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
What if F is strongly convex, but each fi is only weakly convex?,e.g.,
F (x) = 1n
∑ni=1 `i (aT
i x)
ai ∈ Rd and Range(aini=1) = Rd
`i : R → R and `′′i ≥ γ > 0
Each ∇2fi (x) = `′′i (aTi x)aia
Ti is rank one!
But ∇2F (x) γ · λmin
(n∑
i=1
aiaTi
)︸ ︷︷ ︸
>0
![Page 24: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/24.jpg)
Intro Smooth NonSmooth
Sub-Sampling Hessian
H =1
|S|∑j∈S∇2fj(x)
Lemma ( [Roosta and Mahoney, 2016a])
Suppose ∇2F (x) γI and 0 ∇2fi (x) LI. Given any0 < ε < 1, 0 < δ < 1, if Hessian is uniformly sub-sampled with
|S| ≥ 2κ log(d/δ)
ε2,
then
Pr(
H (1− ε)γI)≥ 1− δ.
where κ = L/γ.
![Page 25: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/25.jpg)
Intro Smooth NonSmooth
Sub-Sampling Hessian
Inexact Update
Hp ≈ −g =⇒ ‖Hp + g‖ ≤ θ ‖g‖, θ < 1, using CG.
Why CG and not other solvers (at least theoretically)?
H is SPD
p(t) = argminp∈Kt
pTg+1
2pTHp→
⟨p(t), g
⟩≤ −1
2
⟨p(t),Hp(t)
⟩
![Page 26: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/26.jpg)
Intro Smooth NonSmooth
Sub-Sampling Hessian
Theorem ( [Roosta and Mahoney, 2016a])
If θ ≤√
(1− ε)κ
, then, w.h.p,
F (x(k+1))− F (x?) ≤ ρ(F (x(k))− F (x?)
),
where ρ = 1− (1− ε)/κ2.
![Page 27: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/27.jpg)
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Local convergence, i.e., in a neighborhood of x?, and with α(k) = 1
∥∥∥x(k+1) − x?∥∥∥ ≤ ξ1
∥∥∥x(k) − x?∥∥∥+ ξ2
∥∥∥x(k) − x?∥∥∥2
Linear-Quadratic Error Recursion
![Page 28: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/28.jpg)
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
[Erdogdu and Montanari, 2015]
[Ur+1,Λr+1] = TSVD (H, r + 1)
H−1 = λ−1r+1I + Ur
(Λ−1r −
1
λr+1I
)UT
r
x(k+1) = argminx∈X
x−
(x(k) − α(k)H−1∇F (x(k))
)Theorem ( [Erdogdu and Montanari, 2015])∥∥∥x(k+1) − x?
∥∥∥ ≤ ξ(k)1
∥∥∥x(k) − x?∥∥∥+ ξ
(k)2
∥∥∥x(k) − x?∥∥∥2
,
ξ(k)1 = 1− λmin(H(k))
λr+1(H(k))+
Lg
λr+1(H(k))
√log d
|S(k)H |
, and ξ(k)2 =
LH
2λr+1(H(k)).
Note: For constrained optimization, the method is based on two metricgradient projection =⇒ starting from an arbitrary point, the algorithmmight not recognize, i.e., fail to stop at, an stationary point.
![Page 29: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/29.jpg)
Intro Smooth NonSmooth
Sub-sampling Hessian
H =1
|S|∑j∈S∇2fj(x)
Lemma ( [Roosta and Mahoney, 2016b])
Given any 0 < ε < 1, 0 < δ < 1, if Hessian is uniformlysub-sampled with
|S| ≥ 2κ2 log(d/δ)
ε2,
then
Pr(
(1− ε)∇2F (x) H(x) (1 + ε)∇2F (x))≥ 1− δ.
![Page 30: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/30.jpg)
Intro Smooth NonSmooth
Sub-sampled Newton’s Method [Roosta and Mahoney,2016b]
Theorem ( [Roosta and Mahoney, 2016b])
With high-probability, we get∥∥∥x(k+1) − x?∥∥∥ ≤ ξ1
∥∥∥x(k) − x?∥∥∥+ ξ2
∥∥∥x(k) − x?∥∥∥2,
where
ξ1 =ε
(1− ε)+
(√κ
1− ε
)θ, and ξ2 =
LH
2(1− ε)γ.
If θ = ε/√κ, then ξ1 is problem-independent! ⇒ Can be made
arbitrarily small!
![Page 31: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/31.jpg)
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Theorem ( [Roosta and Mahoney, 2016b])
Consider any 0 < ρ0 < ρ < 1 and ε ≤ ρ0/(1 + ρ0). If‖x(0) − x∗‖ ≤ (ρ− ρ0)/ξ2, and
θ ≤ ρ0
√(1− ε)κ
,
we get locally Q-linear convergence
‖x(k) − x∗‖ ≤ ρ‖x(k−1) − x∗‖, k = 1, . . . , k0
with probability (1− δ)k0 .
Problem-independent local convergence rate
By increasing Hessian accuracy, super-linear rate is possible
![Page 32: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/32.jpg)
Intro Smooth NonSmooth
Putting it all together
Theorem ( [Roosta and Mahoney, 2016b])
Under certain assumptions, starting at any x(0), we have
linear convergence
after certain number of iterations, we get“problem-independent” linear convergence
after certain number of iterations, the step size of α(k) = 1passes Armijo rule for all subsequent iterations
![Page 33: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/33.jpg)
Intro Smooth NonSmooth
“Any optimization algorithm for which the unit step length workshas some wisdom. It is too much of a fluke if the unite step length
[accidentally] works.”
Prof. Jorge NocedalIPAM Summer School, 2012
![Page 34: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/34.jpg)
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Theorem ( [Bollapragada et al., 2016])
Suppose ∥∥∥∥Ei
(∇2fi (x)−∇2F (x)
)2∥∥∥∥ ≤ σ2.
Then,
E∥∥∥x(k+1) − x?
∥∥∥ ≤ ξ1
∥∥∥x(k) − x?∥∥∥+ ξ2
∥∥∥x(k) − x?∥∥∥2,
where
ξ1 =σ
γ
√|S(k)
H |+ κθ, and ξ2 =
LH
2γ.
![Page 35: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/35.jpg)
Intro Smooth NonSmooth
Exploiting the structure....
Example:
F (x) =1
n
n∑i=1
`i (aTi x) =⇒ ∇2F (x) = ATDA,
where
A ,
aT1...
aTn
∈ Rn×d , Dx ,1
n
`′′(aT
1 x). . .
`′′(aTn x)
∈ Rn×n.
![Page 36: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/36.jpg)
Intro Smooth NonSmooth
Sketching [Pilanci and Wainwright, 2017]
x(k+1) = argminx∈X
(x− x(k))T∇F (x(k)) +
1
2(x− x(k))T∇2F (x(k))(x− x(k))
∇2F (x(k)) = B(k)TB(k), B(k) ∈ Rn×d
x(k+1) = argminx∈X
(x− x(k))T∇F (x(k)) +1
2‖B(k)︸︷︷︸
n×d
(x− x(k))‖2
∇2F (x(k)) ≈ B(k)TS(k)TS(k)B(k), S(k) ∈ Rs×n, d ≤ s n
x(k+1) = argminx∈X
(x− x(k))T∇F (x(k)) +1
2‖S(k)B(k)︸ ︷︷ ︸
s×d
(x− x(k))‖2
![Page 37: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/37.jpg)
Intro Smooth NonSmooth
Sketching [Pilanci and Wainwright, 2017]
Sub-Gaussian sketches
well-known concentration properties
involve dense/unstructured matrix operations
Randomized orthonormal systems, e.g., Hadamard or Fourier
sub-optimal sample sizes
fast matrix multiplication
Random row sampling
Uniform
Non-uniform (more on this later)
Sparse JL sketches
![Page 38: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/38.jpg)
Intro Smooth NonSmooth
Non-uniform Sampling [Xu et al., 2016]
Non-uniformity among ∇2fi =⇒ O(n) uniform samples!!!
Find |SH| independent of n
Immune non-uniformity
Non-Uniform sampling schemes base on
1 Row norms
2 Leverage scores
![Page 39: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/39.jpg)
Intro Smooth NonSmooth
LiSSA [Agarwal et al., 2017]
Key idea: use Taylor expansion (Neumann series) to constructan estimator of the Hessian inverse
‖A‖ ≤ 1,A 0 =⇒ A−1 =∑∞
i=0 (I− A)i
A−1j =
∑ji=0 (I− A)i = I + (I− A) A−1
j−1
limj→∞A−1j = A−1
Uniformly sub-sampled Hessian + Turncated Neumann series
Three-nested loop involving HVP, i.e., ∇2fi ,j(x(k))v(k)i ,j
Leverage fast multiplication by Hessian of GLMs
![Page 40: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/40.jpg)
Intro Smooth NonSmooth
name complexity per iteration reference
Newton-CG O(nnz(A)√κ) Folklore
SSN-LS O(nnz(A) log n + p2κ3/2) [Xu et al., 2016]
SSN-RNS O(nnz(A) + sr(A)pκ5/2) [Xu et al., 2016]
SRHT O(np(log n)4 + p2(log n)4κ3/2) [Pilanci et al., 2016]
SSN-Uniform O(nnz(A) + pκκ3/2) [Roosta et al., 2016]
LiSSA O(nnz(A) + pκκ2) [Agarwal et al., 2017]
κ = maxx
λmax∇2F (x)
λmin∇2F (x)
κ = maxx
maxi λmax∇2fi (x)
λmin∇2F (x)
κ = maxx
maxi λmax∇2fi (x)
mini λmin∇2fi (x)
⇒ κ ≤ κ ≤ κ
![Page 41: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/41.jpg)
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Iterative Scheme
p(k) ≈ argminp∈Rd
pTg(k) +
1
2pTH(k)p
,
x(k+1) = x(k) + α(k)p(k)
g(k) =1
|Sg|∑j∈Sg
∇fj(x(k))
H(k) =1
|SH |∑j∈SH
∇2fj(x(k))
Gradient and Hessian Sub-Sampling
![Page 42: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/42.jpg)
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Algorithm Newton’s Method with Hessian and Gradient Sub-Sampling
1: Input: x(0)
2: for k = 0, 1, 2, · · · until termination do
3: - S(k)g ⊆ 1, 2, . . . , n =⇒ g(k) = 1
|S(k)g |
∑j∈S(k)
g∇fj(x(k))
4: if∥∥g(k)
∥∥ ≤ σεg then5: - STOP6: end if7: - S(k)
H ⊆ 1, 2, . . . , n =⇒ H(k) = 1
|S(k)H |
∑j∈S(k)
H
∇2fj(x(k))
8: - H(k)p(k) ≈ −g(k)
9: - Find α(k) that passes Armijo linesearch10: - Update x(k+1) = x(k) + α(k)p(k)
11: end for
![Page 43: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/43.jpg)
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Global convergence, i.e., starting from any initial point
[Byrd et al., 2012,Roosta and Mahoney, 2016a,Bollapragada et al., 2016]
![Page 44: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/44.jpg)
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
We can write ∇F (x) = AB where
A :=
| | |∇f1(x) ∇f2(x) · · · ∇fn(x)| | |
and B :=
1/n1/n
...1/n
Lemma ( [Roosta and Mahoney, 2016a])
Let ‖∇fi (x)‖ ≤ G (x) <∞. For any 0 < ε, δ < 1, if
|S| ≥ G (x)2(1 +
√8 ln(1/δ)
)2/ε2,
then
Pr(‖∇F (x)− g(x)‖ ≤ ε
)≥ 1− δ.
![Page 45: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/45.jpg)
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Theorem ( [Roosta and Mahoney, 2016a])
If
θ ≤√
(1− εH)
κ,
then, w.h.p,
F (x(k+1))− F (x?) ≤ ρ(F (x(k))− F (x?)
),
where ρ = 1− (1− εH)/κ2, and upon “STOP”, we have‖∇F (x(k))‖ < (1 + σ) εg.
![Page 46: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/46.jpg)
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Local convergence, i.e., in a neighborhood of x?, and with α(k) = 1
∥∥∥x(k+1) − x?∥∥∥ ≤ ξ0 + ξ1
∥∥∥x(k) − x?∥∥∥+ ξ2
∥∥∥x(k) − x?∥∥∥2
Error Recursion
[Roosta and Mahoney, 2016b, Bollapragada et al., 2016]
![Page 47: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/47.jpg)
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Theorem ( [Roosta and Mahoney, 2016b])
With high-probability, we get∥∥∥x(k+1) − x?∥∥∥ ≤ ξ0 + ξ1
∥∥∥x(k) − x?∥∥∥+ ξ2
∥∥∥x(k) − x?∥∥∥2,
whereξ0 =
εg
(1− εH)γ
ξ1 =εH
(1− εH)+
(√κ
1− εH
)θ,
ξ2 =LH
2(1− εH)γ.
![Page 48: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/48.jpg)
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Theorem ( [Roosta and Mahoney, 2016b])
Consider any 0 < ρ0 + ρ1 < ρ < 1. If ‖x(0) − x∗‖ ≤ c(ρ0, ρ1, ρ),
ε(k)g = ρkεg and
θ ≤ ρ0
√(1− εH)
κ,
we get locally R-linear convergence
‖x(k) − x∗‖ ≤ cρk
with probability (1− δ)2k .
Problem-independent local convergence rate
![Page 49: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/49.jpg)
Intro Smooth NonSmooth
Putting it all together
Theorem ( [Roosta and Mahoney, 2016b])
Under certain assumptions, starting at any x(0), we have
linear convergence,
after certain number of iterations, we get“problem-independent” linear convergence,
after certain number of iterations, the step size of α(k) = 1passes Armijo rule for all subsequent iterations,
upon “STOP” at iteration k, we have‖∇F (x(k))‖ < (1 + σ)
√ρkεg.
![Page 50: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/50.jpg)
Intro Smooth NonSmooth
Outline
Part I: ConvexSmooth
Newton-CG
Non-Smooth
Proximal NewtonSemi-smooth Newton
Part II: Non-ConvexLine-Search Based Methods
L-BFGSGauss-NewtonNatural Gradient
Trust-Region Based Methods
Trust-RegionCubic Regularization
Part III: Discussion and Examples
![Page 51: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/51.jpg)
Intro Smooth NonSmooth
Newton-type Methods for Non-Smooth Problems
Convex Composite Problem
minx∈X⊆Rd
F (x) + R(x)
F : convex and smooth
R: convex and non-smooth
![Page 52: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/52.jpg)
Intro Smooth NonSmooth
Non-Smooth Newton-type [Yu et al., 2010]
Let X = Rd , i.e., unconstrained optimization
Iterative Scheme (Non-Quadratic Sub-Problem)
p(k) ≈ argminp∈Rd
supg(k)∈∂(F+R)(x(k))
pTg(k)
+
1
2pT H(k)︸︷︷︸
e.g.,QN
p
,
x(k+1) = x(k) + α(k)p(k)
![Page 53: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/53.jpg)
Intro Smooth NonSmooth
Proximal Newton-type [Lee et al., 2014,Byrd et al., 2016]
Iterative Scheme
p(k) ≈ argminp∈Rd
pT∇F (x(k)) +1
2pT H(k)︸︷︷︸
≈∇2F (x(k))
p + R(x(k) + p)
x(k+1) = x(k) + α(k)p(k)
Notable examples of proximal Newton methods:
glmnet ( [Friedman et al., 2010]): `1-regularized GLMs,sub-problems are solved using coordinate descent
QUIC ( [Hsieh et al., 2014]): Graphical Lasso problem withfactorization tricks, sub-problems are solved using coordinatedescent
![Page 54: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/54.jpg)
Intro Smooth NonSmooth
Proximal Newton-type [Lee et al., 2014,Byrd et al., 2016]
Inexactness of sub-problem solver
p(k) ≈ argminp∈Rd
pT∇F (x(k)) +
1
2pTH(k)p + R(x(k) + p)
p? is the minimizer of the above sub-problem iff p? = Proxλ(p?), where
Prox(k)λ (p) , argmin
q∈Rd
⟨∇F (x(k)) + H(k)p,q
⟩+ R(x(k) + q) +
λ
2‖q− p‖2
.
∥∥∥p− Prox(k)λ (p)
∥∥∥ ≤ θ ∥∥∥Prox(k)λ (0)
∥∥∥ ,and some sufficient decrease condition on the subproblem
Inexactness
![Page 55: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/55.jpg)
Intro Smooth NonSmooth
Proximal Newton-type [Lee et al., 2014,Byrd et al., 2016]
Step-size
x(k+1) = x(k) + α(k)p(k)
(F + R)(x(k) + αp(k))− (F + R)(x(k)) ≤ β(`k
(x(k) + αp
)− `k
(x(k)
))`k (x) = F (x(k)) + (x− x(k))T∇F (x(k)) + R(x)
Line-Search
![Page 56: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/56.jpg)
Intro Smooth NonSmooth
Proximal Newton-type [Lee et al., 2014,Byrd et al., 2016]
Global convergence: x(k) converges to an optimal solutionstarting at any x(0)
Local convergence: for x(0) close enough to x?
H(k) = ∇2F (x(k))
Exact solve: quadratic convergence
Approximate solve with decaying forcing term: superlinearconvergence
Approximate solve with fixed forcing term: linear convergence
H(k) : Dennis-More =⇒ supelinear convergence
![Page 57: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/57.jpg)
Intro Smooth NonSmooth
Sub-Sampled Proximal Newton-type [Liu et al., 2017]
FSM/ERM
minx∈X⊆Rd
F (x) + R(x) =1
n
n∑i=1
fi (aTi x) + R(x)
fi : Strongly-Convex, Smooth, and Self-Concordant
R: Convex and Non-Smooth
n 1 and/or d 1
Dennis-More condition:
|p(k)T(
H(k) −∇2F (x(k)))
p(k)| ≤ ηkp(k)T∇2F (x(k))p(k)
Leverage score sampling to ensure Dennis-More
Inexact sub-problem solver
![Page 58: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/58.jpg)
Intro Smooth NonSmooth
Finite Sum / Empirical Risk Minimization
Self-Concordant
∣∣vT(∇3f (x)[v]
)v∣∣ ≤ M
(vT∇2f (x)v
)3/2, ∀x, v ∈ Rd
M = 2 is called standard self-concordant.
Theorem ( [Zhang and Lin, 2015])
Suppose there exists γ > 0 and η ∈ [0, 1) such that
|f ′′′
i (t)| ≤ γ(f
′′
i (t))1−η
. Then
F (x) =1
n
n∑i=1
fi (aTi x) +
λ
2‖x‖2
is self-concordant with
M =maxi ‖ai‖1+2η
γ
λη+1/2.
![Page 59: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/59.jpg)
Intro Smooth NonSmooth
Proximal Newton-type Methods
General treatment: [Fukushima and Mine, 1981, Lee et al.,2014, Byrd et al., 2016, Becker and Fadili, 2012, Schmidtet al., 2012, Schmidt et al., 2009, Shi and Liu, 2015]
Tailored to specific problems: [Friedman et al., 2010, Hsiehet al., 2014, Yuan et al., 2012, Oztoprak et al., 2012]
Self-Concordant: [Li et al., 2017, Kyrillidis et al.,2014, Tran-Dinh et al., 2013]
![Page 60: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/60.jpg)
Intro Smooth NonSmooth
Outline
Part I: ConvexSmooth
Newton-CG
Non-Smooth
Proximal NewtonSemi-smooth Newton
Part II: Non-ConvexLine-Search Based Methods
L-BFGSGauss-NewtonNatural Gradient
Trust-Region Based Methods
Trust-RegionCubic Regularization
Part III: Discussion and Examples
![Page 61: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/61.jpg)
Intro Smooth NonSmooth
Semi-Smooth Newton-type Methods
Convex Composite Problem
minx∈X⊆Rd
F (x) + R(x)
F : (non-)convex and (non-)smooth
R: convex and non-smooth
![Page 62: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/62.jpg)
Intro Smooth NonSmooth
Semi-Smooth Newton-type Methods
Recall:
Proximal Newton-type Methods
p(k) ≈ argminp∈Rd
pT∇F (x(k)) +
1
2pTH(k)p + R(x(k) + p)
≡ proxH(k)
R
(x(k) −H(k)−1∇F (x(k))
)− x(k)
proxH(k)
R (x) , argminy∈Rd
R(y) +1
2‖y − x‖2
H(k)
‖y − x‖2H = 〈y − x,H (y − x)〉
![Page 63: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/63.jpg)
Intro Smooth NonSmooth
Semi-Smooth Newton-type Methods
Recall Newton for smooth root-finding problems:
r(x) = 0 =⇒ J(x(k))p(k) = −r(x(k)) =⇒ x(k+1) = x(k) + p(k)
Key Idea
For solving non-linear non-smooth systems of equations, replacethe Jacobian in the Newtonian iteration system by an element ofClarke’s generalized Jacobian, which may be nonempty (e.g.,locally Lipschitz continuous) even if the true Jacobian does notexist.
Example:
r(x) = proxBR
(x− B−1∇F (x)
)− x
![Page 64: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/64.jpg)
Intro Smooth NonSmooth
Semi-Smooth Newton-type Methods
Clarke’s generalized Jacobian
Let r : Rd → Rp be locally Lipschitz continuous at xa. Let Dr bethe set of differentiable points. The Bouligand-subdifferential of rat x is defined as
∂B ,
limk→∞
r′(xk) | xk ∈ Dr, xk → x
.
Clarke’s generalized Jacobian is ∂r(x) = Conv-hull (∂Br(x)).
aBy Rademacher’s Theorem, r is almost everywhere differentiable
![Page 65: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/65.jpg)
Intro Smooth NonSmooth
Semi-Smooth Newton-type Methods
Semi-Smooth Map
r : Rd → Rp is semi-smooth at x if
r a locally Lipschitz continuous function,
r is directionally differentiable at x, and
‖r(x + v)− r(x)− Jv‖ = o(‖v‖), v ∈ Rd , J ∈ ∂r(x + v)
Example: Piecewise continuously differentiable functions.
A vector-valued function is (strongly) semi-smooth if and only ifeach of its component functions is (strongly) semi-smooth.
![Page 66: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/66.jpg)
Intro Smooth NonSmooth
Semi-Smooth Newton-type Methods
First-order stationarity is written as a fixed point-typenon-linear system of equations, e.g.,
rB(x) , x− proxBR
(x− B−1∇F (x)
)= 0, B 0
(Stochastic) semi-smooth Newton’s method used to(approximately) solve these system of nonlinear equations ,e.g., as in [Milzarek et al., 2018]
rB(x) , x− proxBR
(x− B−1g(x)
)J(k)p(k) = −rB(x(k)), J(k) ∈ J B(x(k))
J Bk ,
J ∈ Rd×d | J = (I− A) + AB−1H(k),A ∈ ∂proxB
R (u(x))
u(x) , x− B−1g(x)
![Page 67: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/67.jpg)
Intro Smooth NonSmooth
Semi-Smooth Newton-type Methods
Semi-smoothness of proximal mapping does not hold in general.
In certain settings, these vector-valued functions are monotone andcan be semi-smooth due to the properties of the proximal mappings.
In such cases, their generalized Jacobian matrix is positivesemi-definite due to monotonicity.
Lemma
For a monotone and Lipschitz continuous mapping r : Rd → Rd , andany x ∈ Rd , each element of ∂Br(x) is positive semi-definite.
![Page 68: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/68.jpg)
Intro Smooth NonSmooth
Semi-Smooth Newton-type Methods
General treatment: [Patrinos et al., 2014, Milzarek andUlbrich, 2014, Xiao et al., 2016, Patrinos and Bemporad, 2013]
Tailored to a specific problem: [Yuan et al., 2018]
Stochastic: [Milzarek et al., 2018]
![Page 69: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/69.jpg)
Intro Smooth NonSmooth
THANK YOU!
![Page 70: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/70.jpg)
Intro Smooth NonSmooth
Agarwal, N., Bullins, B., and Hazan, E. (2017).Second-order stochastic optimization for machine learning inlinear time.The Journal of Machine Learning Research, 18(1):4148–4187.
Becker, S. and Fadili, J. (2012).A quasi-Newton proximal splitting method.In Advances in Neural Information Processing Systems, pages2618–2626.
Bollapragada, R., Byrd, R. H., and Nocedal, J. (2016).Exact and inexact subsampled Newton methods foroptimization.IMA Journal of Numerical Analysis.
Byrd, R. H., Chin, G. M., Neveitt, W., and Nocedal, J. (2011).
On the use of stochastic Hessian information in optimizationmethods for machine learning.SIAM Journal on Optimization, 21(3):977–995.
![Page 71: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/71.jpg)
Intro Smooth NonSmooth
Byrd, R. H., Chin, G. M., Nocedal, J., and Wu, Y. (2012).Sample size selection in optimization methods for machinelearning.Mathematical programming, 134(1):127–155.
Byrd, R. H., Nocedal, J., and Oztoprak, F. (2016).An inexact successive quadratic approximation method for l-1regularized optimization.Mathematical Programming, 157(2):375–396.
Erdogdu, M. A. and Montanari, A. (2015).Convergence rates of sub-sampled Newton methods.In Proceedings of the 28th International Conference on NeuralInformation Processing Systems-Volume 2, pages 3052–3060.MIT Press.
Friedman, J., Hastie, T., and Tibshirani, R. (2010).Regularization paths for generalized linear models viacoordinate descent.Journal of statistical software, 33(1):1.
![Page 72: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/72.jpg)
Intro Smooth NonSmooth
Fukushima, M. and Mine, H. (1981).A generalized proximal point algorithm for certain non-convexminimization problems.International Journal of Systems Science, 12(8):989–1000.
Hsieh, C.-J., Sustik, M. A., Dhillon, I. S., and Ravikumar, P.(2014).QUIC: quadratic approximation for sparse inverse covarianceestimation.The Journal of Machine Learning Research, 15(1):2911–2947.
Kyrillidis, A., Mahabadi, R. K., Tran-Dinh, Q., and Cevher, V.(2014).Scalable Sparse Covariance Estimation via Self-Concordance.In AAAI, pages 1946–1952.
Lee, J. D., Sun, Y., and Saunders, M. A. (2014).Proximal Newton-type methods for minimizing compositefunctions.SIAM Journal on Optimization, 24(3):1420–1443.
![Page 73: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/73.jpg)
Intro Smooth NonSmooth
Li, J., Andersen, M. S., and Vandenberghe, L. (2017).Inexact proximal Newton methods for self-concordantfunctions.Mathematical Methods of Operations Research, 85(1):19–41.
Liu, X., Hsieh, C.-J., Lee, J. D., and Sun, Y. (2017).An inexact subsampled proximal Newton-type method forlarge-scale machine learning.arXiv preprint arXiv:1708.08552.
Milzarek, A. and Ulbrich, M. (2014).A semismooth newton method with multidimensional filterglobalization for l 1-optimization.SIAM Journal on Optimization, 24(1):298–333.
Milzarek, A., Xiao, X., Cen, S., Wen, Z., and Ulbrich, M.(2018).A Stochastic Semismooth Newton Method for NonsmoothNonconvex Optimization.arXiv preprint arXiv:1803.03466.
![Page 74: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/74.jpg)
Intro Smooth NonSmooth
Oztoprak, F., Nocedal, J., Rennie, S., and Olsen, P. A. (2012).
Newton-like methods for sparse inverse covariance estimation.In Advances in neural information processing systems, pages755–763.
Patrinos, P. and Bemporad, A. (2013).Proximal Newton methods for convex composite optimization.In Decision and Control (CDC), 2013 IEEE 52nd AnnualConference on, pages 2358–2363. IEEE.
Patrinos, P., Stella, L., and Bemporad, A. (2014).Forward-backward truncated Newton methods for convexcomposite optimization.arXiv preprint arXiv:1402.6655.
Pilanci, M. and Wainwright, M. J. (2017).Newton sketch: A near linear-time optimization algorithm withlinear-quadratic convergence.SIAM Journal on Optimization, 27(1):205–245.
![Page 75: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/75.jpg)
Intro Smooth NonSmooth
Roosta, F. and Mahoney, M. W. (2016a).Sub-sampled Newton methods I: globally convergentalgorithms.arXiv preprint arXiv:1601.04737.
Roosta, F. and Mahoney, M. W. (2016b).Sub-sampled Newton methods II: Local convergence rates.arXiv preprint arXiv:1601.04738.
Schmidt, M., Berg, E., Friedlander, M., and Murphy, K.(2009).Optimizing costly functions with simple constraints: Alimited-memory projected quasi-Newton algorithm.In Artificial Intelligence and Statistics, pages 456–463.
Schmidt, M., Kim, D., and Sra, S. (2012).Projected Newton-type Methods in Machine Learning.Optimization for Machine Learning, (1).
Shi, Z. and Liu, R. (2015).
![Page 76: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/76.jpg)
Intro Smooth NonSmooth
Large scale optimization with proximal stochastic newton-typegradient descent.In Joint European Conference on Machine Learning andKnowledge Discovery in Databases, pages 691–704. Springer.
Tran-Dinh, Q., Kyrillidis, A., and Cevher, V. (2013).Composite self-concordant minimization.arXiv preprint arXiv:1308.2867.
Xiao, X., Li, Y., Wen, Z., and Zhang, L. (2016).A regularized semi-smooth Newton method with projectionsteps for composite convex programs.Journal of Scientific Computing, pages 1–26.
Xu, P., Yang, J., Roosta, F., Re, C., and Mahoney, M. W.(2016).Sub-sampled Newton methods with non-uniform sampling.In Advances in Neural Information Processing Systems (NIPS),pages 3000–3008.
![Page 77: Part I: Convex Fred Roosta University of Queensland · Part I: Convex Fred Roosta School of Mathematics and Physics University of Queensland. Intro Smooth NonSmooth ... \Big data":](https://reader033.fdocuments.in/reader033/viewer/2022052721/5f0a8d277e708231d42c30b9/html5/thumbnails/77.jpg)
Intro Smooth NonSmooth
Yu, J., Vishwanathan, S., Gunter, S., and Schraudolph, N. N.(2010).A quasi-Newton approach to nonsmooth convex optimizationproblems in machine learning.Journal of Machine Learning Research, 11(Mar):1145–1200.
Yuan, G.-X., Ho, C.-H., and Lin, C.-J. (2012).An improved glmnet for l1-regularized logistic regression.Journal of Machine Learning Research, 13(Jun):1999–2030.
Yuan, Y., Sun, D., and Toh, K.-C. (2018).An Efficient Semismooth Newton Based Algorithm for ConvexClustering.arXiv preprint arXiv:1802.07091.
Zhang, Y. and Lin, X. (2015).DiSCO: Distributed optimization for self-concordant empiricalloss.In International conference on machine learning, pages362–370.