Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based...
Transcript of Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based...
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Stochastic Second-Order Optimization MethodsPart II: Non-Convex
Fred Roosta
School of Mathematics and PhysicsUniversity of Queensland
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Now, moving onto non-convex problems...
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Non-Convex Is Hard!
Saddle points, Local Minima, Local Maxima
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
2nd Order Necessary Condition
∇F (x?) = 0 ∇2F (x?)0
2nd Order Sufficient Condition
∇F (x?) = 0 ∇2F (x?)0
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Non-Convex Is Hard!
Additional complexity issues...
Optimization of a degree four polynomial: NP-hard [Hillar etal., 2013]
Checking for sufficient optimality condition: co-NP-complete[Murty et al., 1987]
Checking whether a point is not a local minimum:NP-complete [Murty et al., 1987]
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
All convex problems are the same,while every non-convex problem is different.
Not sure who’s quote this is!
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
(εg , εH)− Optimality
‖∇F (x)‖ ≤ εg and λmin(∇2F (x)) ≥ −εH
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Outline
Part I: ConvexSmooth
Newton-CG
Non-Smooth
Proximal NewtonSemi-smooth Newton
Part II: Non-ConvexLine-Search Based Methods
L-BFGSGauss-NewtonNatural Gradient
Trust-Region Based Methods
Trust-RegionCubic Regularization
Part III: Discussion and Examples
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Sad Note /: BFGS may fail on non-convex problems withboth exact line search [Mascarenhas, 2004] and inexact (e.g.,Wolfe) variants [Dai, 2002]
Happy Note ,: BFGS dominates in many practicalnon-convex applications
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Newton’s Method: Scalar Case
Finding a root of r : R→ R , i.e., find x? for which r(x?) = 0:
0 = r(x?) = r(x (k)) +(x? − x (k)
)r ′(x (k)) + o(|x? − x (k)|)
0 = r(x (k)) +(x (k+1) − x (k)
)r ′(x (k))
x (k+1) = x (k) − r(x (k))
r ′(x (k)).
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Secant Method: Scalar Case
Finding a root of r : R→ R, i.e., find x? for which r(x?) = 0:
Approximate the derivative: r ′(x (k)) ≈ r(x (k))− r(x (k−1))
x (k) − x (k−1)
x (k+1) = x (k) −
(x (k) − x (k−1)
r(x (k))− r(x (k−1))
)r(x (k)).
Local convergence rate is
∣∣∣x (k+1) − x?∣∣∣ ≤ C
∣∣∣x (k) − x?∣∣∣
“Golden Ratio”︷ ︸︸ ︷1 +√
5
2 .
In contrast, rate of convergence of Newton is quadratic!
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Quasi-Newton Method
Quasi-Newton optimization methods extend secant method tomultivariable case!
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Quasi-Newton Method
For r(x) = f ′(x), we have
f ′′(x (k)) ≈ f ′(x (k))− f ′(x (k−1))
x (k) − x (k−1),
i.e.,
f ′′(x (k))(x (k) − x (k−1)
)≈ f ′(x (k))− f ′(x (k−1)).
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Quasi-Newton Method
∇f (x + p) = ∇f (x) +∇2f (x)p +
∫ 1
0
[∇2f (x + tp)−∇2f (x)
]p dt
= ∇f (x) +∇2f (x)p + o(‖p‖),
i.e., when x = x(k),p = x(k−1) − x(k), and ‖p‖ 1 , we have
∇2f (x(k))(
x(k) − x(k−1))≈(∇f (x(k))−∇f (x(k−1))
)
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Quasi-Newton Method
∇2f (x(k))(
x(k) − x(k−1))≈(∇f (x(k))−∇f (x(k−1))
)So, look for H(k) ≈ ∇2f (x(k)) such that
H(k)(
x(k) − x(k−1))
=(∇f (x(k))−∇f (x(k−1))
)︸ ︷︷ ︸
Secant Condition
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Quasi-Newton Method: Another Interpretation
Another interpretation of the secant condition...
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Quasi-Newton Method: Another Interpretation
Recall:
Iterative Scheme
y(k) = argminx∈X
(x− x(k))Tg(k) +
1
2(x− x(k))TH(k)(x− x(k))
x(k+1) = x(k) + αk
(y(k) − x(k)
)
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Quasi-Newton Method: Another Interpretation
Suppose we have a H(k) and x(k+1)
How to update H(k) to obtain a new quadratic approximationto F (x) at x(k+1)?
mk+1(p) , f (x(k+1)) +⟨∇f (x(k+1)),p
⟩+
1
2
⟨p,H(k+1)p
⟩
One reasonable requirement, suggested by Davidon, is
∇mk+1(0) = ∇f (x(k+1)) −→ trivially satisfied
∇mk+1(−αpk) = ∇f (x(k)) −→ secant condition
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Quasi-Newton Method: DFP
The revolution began with...
William C. Davidon
DFP: Davidon-Fletcher-Powell scheme
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Quasi-Newton Method
H(k)(
x(k) − x(k−1))
=(∇f (x(k))−∇f (x(k−1))
).
d equations vs. d2 unknowns
The difference between QNMs boils down to how they update H(k)
(or its inverse)!
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Quasi-Newton Method
Typical notation in QN literature:
sk , x(k+1) − x(k)
yk , ∇f (x(k+1))−∇f (x(k))
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Quasi-Newton Method: BFGS
Updating B(k) ,[H(k)
]−1:
minB∈Rd×d
‖B− B(k)‖
s.t. B = BT , sk = Byk
With ‖A‖ = ‖W1/2AW1/2‖F for a particular W
B(k+1) =
(I−
skyTkyTk sk
)B(k)
(I−
skyTkyTk sk
)+
ykyTkyTk sk
B(k) 0 iff yTk sk > 0 (Curvature Condition) =⇒ Guaranteedby appropriate line search, e.g., Armijo + (strong) Wolfe
Under strong convexity (or if iterates satisfy certainproperties), asymptotic super-linear rate of convergence
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Quasi-Newton Method: Limited Memory
General QNM Update:
B(k+1) = B(k) + [something]
Problem: Memory storage is O(d2)
Soution: Limited-Memory QNMs, which are low-storagemethods
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
L-BFGS
Instead of storing the inverse Hessian B(k), L-BFGS maintains ahistory of iterates and gradients as
sk−m, sk−m−1, . . . , sk ,yk−m, yk−m−1, . . . , yk .
B(k) depends on B(k−1), yk−1 and sk−1.
B(k−1) depends on B(k−2), yk−2 and sk−2.
and so on...
So define B(k−1) implicitly in terms of B(k−1), yk−2 and sk−2.
We continue up to B(k−m), which is initialized to be γI.
These are used to implicitly do B(k)–vector products.
Linear rate of convergence
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
L-BFGS
Curvature Condition:⟨∇f (x(k+1))−∇f (x(k)), x(k+1) − x(k)
⟩> 0
If f (x) is (strictly) convex =⇒3
Otherwise, (strong) Wolfe-condition on α (nonlinear inequality)
〈∇f (x + αp),p〉 ≥ β 〈∇f (x),p〉 , β < 1
When g ≈ ∇f =⇒ noisy curvature estimate =⇒ many issues arise!
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
L-BFGS [Byrd et al., 2014]
Decoupling of the stochastic gradient and curvatureestimations =⇒ different sample subsets for estimating yk
sk = xk − xk−1 where xk = 1L
∑kj=k−L+1 x(j).
yk = ∇2fSH(xk)sk ≈ ∇fSH
(xk)−∇fSH(xk−1).
Update H(k) every L ≥ 1 iterations
Under strong convexity, they show that 0 ≺ µ1I H(k) µ2I
Setting αk ∝ 1/k , they show
E(f (x(k))− f (x?)) ≤ C/k
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
L-BFGS [Mokhtari and Ribeiro, 2014]
sk = x(k+1) − x(k)
Enforce gradient consistency, i.e., use the samples S:yk = ∇fS(x(k+1))−∇fS(x(k)).
For some δ > 0, yk = yk − δskUpdate Bk+1 as in the usual case with sk and ykAdd regularization: Bk+1 = Bk+1 + mI
Add regularization again:
(Bk+1
)−1
=(
Bk+1
)−1+ MI
Spectrum ofBk+1 is bounded above and away from zero
Strong convexity of each fi
δ < minx λmin(∇2fi (x)) =⇒ curvature condition holds
Setting αk ∝ 1/k , they show
E(f (x(k))− f (x?)) ≤ C/k
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
L-BFGS [Moritz et al., 2015]
Combine the ideas of [Byrd et al., 2014] with variancereduction of [Johnson and Zhang, 2013]
Recall SVRG: For s and k , inner and outer iteration counters,respectively, estimate the gradient as
g(s) =(∇fj(x(s))−∇fj(x(k)) +∇F (x(k))
)No need to diminish step-size any more!
Under strong convexity:
E(f (x(k))− f (x?)) ≤ ρkE(f (x(0))− f (x?)), ρ < 1
As in SVRG, convergence is with respect to the outeriterations
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
L-BFGS [Berahas et al., 2016, Berahas and Takac, 2017]
Idea: Perform QN update on overlapping consecutive batches
Idea: Tk = Sk ∩ Sk+1 6= ∅
yk = ∇fOk(x(k+1))−∇fOk
(x(k))
Using constant step-size α
Strongly convex:
E(f (x(k))− f (x?)) ≤ ρk(f (x(0))− f (x?)
)+O(α)
Non-convex: Skip updating H(k) if yTk sk ≤ ε ‖sk‖2
E
(1
T
T−1∑k=0
∥∥∥∇f (x(k))∥∥∥2)≤ O
(1
Tα
)+O(α)
If α ≤ O(1/√T ) =⇒ min
k≤T−1E∥∥∥∇f (x(k))
∥∥∥2
≤ O√
1
T
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Outline
Part I: ConvexSmooth
Newton-CG
Non-Smooth
Proximal NewtonSemi-smooth Newton
Part II: Non-ConvexLine-Search Based Methods
L-BFGSGauss-NewtonNatural Gradient
Trust-Region Based Methods
Trust-RegionCubic Regularization
Part III: Discussion and Examples
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Gauss-Newton
Problem
minx∈Rd
F (x) = f (h(x))
h : Rd → Rp
f : Rp → R, and convex
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Gauss-Newton
Let Jh : Rd → Rp be the Jacobian of h, i.e., Jh(x) ∈ Rp×d .
∇F (x) = JTh (x)∇f (h(x))
∇2F (x) = JTh (x)∇2f (h(x))Jh(x) + ∂2h(x)∇f (h(x))
(Generalized) Gauss-Newton Matrix:
∇2F (x) ≈ JTh (x)∇2f (h(x))Jh(x)︸ ︷︷ ︸G(x) , Gauss-Newton Matrix
0
(Generalized) Gauss-Newton Update:
G(x(k))p ≈ −∇F (x(k))
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Gauss-Newton
Another interpretation:
f (h(x)) ≈ f(
h(x(k)) + Jh(x(k))(
x− x(k)))
, `(x; x(k))
∇f (h(x(k))) = ∇`(x(k); x(k)) = JTh (x(k))∇f (h(x(k)))
∇2f (h(x(k))) ≈ ∇2`(x(k); x(k)) = JTh (x(k))∇2f (h(x(k)))Jh(x(k)) = G(x(k))
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Gauss-Newton
(Generalized) Gauss-Newton Matrix
∇2F (x) ≈ JTh (x)∇2f (h(x))Jh(x)
Properties:
G(x) 0, ∀x
In some applications, after computing∇F (x) = JTh (x)∇f (h(x)), the approximation G(x) does notinvolve any additional derivative evaluations
G(x) is a good approximation if ‖∂2h(x)∇f (h(x))‖ is small,i.e.,
‖∇f (h(x))‖ is small, or
‖∂2h(x)‖ is small, i.e., h is nearly affine
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Gauss-Newton
Gauss-Newton Convergence
Under some regularity assumptions:
Damped Gauss-Newton is globally convergent, i.e.,limk→∞
∥∥∇F (x(k))∥∥ = 0
The rate of convergence can be shown to be linear
Local convergence (S(x) , ∂2h(x?)∇f (h(x?))):∥∥∥x(k+1) − x?∥∥∥ ≤ ‖G(x?)‖ ‖S(x?)‖
∥∥∥x(k) − x?∥∥∥+O
(∥∥∥x(k) − x?∥∥∥2),
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Gauss-Newton
Finite-Sum Problem
minx∈Rd
F (x) =n∑
i=1
fi (hi (x))
Machine Learning (e.g., deep/recurrent/reinforcementlearning): [Martens, 2010, Martens and Sutskever,2011, Chapelle and Erhan, 2011, Wu et al., 2017, Botev et al.,2017]... more on this later
Scientific Computing (e.g., PDE inverse-problems): [Doeland Ascher, 2012, Roosta et al., 2014b, Roosta et al.,2014a, Roosta et al., 2015, Haber et al., 2000, Haber et al.,2012]
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
PDE Inverse Problems with Many R.H.S
∇ · (x(z)∇ui (z)) = qi (z), z ∈ Ω
∂ui (z)
∂ν= 0, z ∈ ∂Ω
, i = 1, . . . , n, Ω ⊂ R2 or R3
(a) True x : 2D (b) True x : 3D
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Forward Problem
Discretize-Then-Optimize
vi = PiA−1(x)qi + εi , i = 1, 2, . . . n, x ∈ Rd
n: No. of measurements
d : Mesh size
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Inverse Problem
εi ∼N (0,Σi )
⇓
fi (hi (x)) =‖Σ− 1
2i
(PiA
−1(x)qi − vi)︸ ︷︷ ︸
hi (x)
‖2
⇓
minx
F (x) =1
n
n∑i=1
‖Σ−12
i
(PiA
−1(x)qi − vi)‖2
Calculating “A−1(x)qi” for each i is costly!
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
A remedy: SAA
S ⊂ [n] & |S| = s
⇓
F (x) ≈ Fs(x) =1
|S|∑i∈S‖Σ−
12
i
(PiA
−1(x)qi − vi)‖2
2
Trace estimation: [Roosta and Ascher, 2015, Roosta et al., 2015]
Find s such that, for a given ε and δ, we get
Pr(|Fs(x)− F (x)| ≤ εF (x)
)≥ 1− δ
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
n = 961, Noise 1%, σ1 = 0.1, σ2 = 1
Method Vanilla Sub-Sampled
PDE Solves 128,774 3,921
(c) True Model (d) Sub-Sampled GN
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
n = 512, Noise 2% σI = 1, σII = .1
Method Vanilla Sub-Sampled
PDE Solves 45,056 2,264
(e) True Model (f) Sub-Sampled GN
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Outline
Part I: ConvexSmooth
Newton-CG
Non-Smooth
Proximal NewtonSemi-smooth Newton
Part II: Non-ConvexLine-Search Based Methods
L-BFGSGauss-NewtonNatural Gradient
Trust-Region Based Methods
Trust-RegionCubic Regularization
Part III: Discussion and Examples
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Natural Gradient
Cross Entropy Minimization
For px(z), a density parametrized by x, the cross-entropy minimizationwith respect to a target density, px(z), is
minx∈XL(x) = −Ex (log px(z)) = −
∫px(z) log px(z) dµ(z).
NB: px(z)dµ(z) can be the empirical measure over the training data.
Fisher Information Matrix
Suppose z ∼ px. Under some weak regularity assumptions:
F(x) , Ex
(∇ log px(z) (∇ log px(z))T
)= −Ex
(∇2 log px(z)
).
Natural Gradient Descent
F(x(k))p(k) ≈ g(k) =⇒ x(k+1) = x(k) + α(k)p(k)
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Natural Gradient
Interpretation I:
minx∈XL(x) = −Ex (log px(z))
Natural Gradient vs. Newton’s Method
For a given x:
Hessian Matrix: ∇2L(x) = −Ex
(∇2 log px(z)
)Fisher Matrix: F(x) = −Ex
(∇2 log px(z)
)
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Natural Gradient
Interpretation I:
Let px be the empirical measure over the given training set zini=1 wherezi ∼ px? for some true, but unknown, parameter x?, i.e., empirical riskminimization:
minx∈XL(x) = −1
n
n∑i=1
log px(zi )
Approximation I: Natural Gradient vs. Newton’s Method
For a given x:
Hessian: ∇2L(x) = −1
n
n∑i=1
∇2 log px(zi ), zi ∼ px?
Approximate Fisher: F(x) = −1
n
n∑i=1
∇2 log px(zi ), zi ∼ px
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Natural Gradient
Interpretation II:
minx∈XL(x) = −1
n
n∑i=1
log px(zi )
In Gauss-Newton, we had L(x) = f (h(x)). Here, we can considerf (t) = − log t ∈ R and h(x) = px(z) ∈ R. So,
G(x) = f ′′(h(x))∇h(x)∇h(x)T =1
(px(z))2∇px(z) (∇px(z))T
=
(1
px(z)∇px(z)
)(1
px(z)∇px(z)
)T
= ∇ log px(z) (∇ log px(z))T
Approximation II: Natural Gradient vs. Gauss-Newton
G(x) = −1
n
n∑i=1
∇ log px(zi ) (∇ log px(zi ))T , zi ∼ px?
F(x) = −1
n
n∑i=1
∇ log px(zi ) (∇ log px(zi ))T , zi ∼ px
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Natural Gradient
Interpretation III:More generally, consider fitting probabilistic models
minx∈Rd
L(x) = L(px)
Recall: steepest descent in Euclidean space
Ideally, we want
p? = argmin‖p‖≤1
L(x + p),
but it is easier to do
−∇L(x)
‖∇L(x)‖= argmin‖p‖≤1
〈∇L(x),p〉
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Natural Gradient
Interpretation III:
KullbackLeibler distance
For given x and x, the Kullback-Leibler distance from px to px is
KL(x ‖ x) , Ex
(log
px(z)
px(z)
)=
∫ (log
px(z)
px(z)
)px(z) dµ(z).
F(x) = ∇2x KL(x ‖ x)|x=x
If F(x) 0, then in a neighborhood of x, we have KL(x ‖ x) > 0, and
KL(x ‖ x) ≈ 1
2(x− x)2 F(x) (x− x)
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Natural Gradient
Interpretation III:
Ideally, we want
p? = argminKL(x‖x+p)≤1
L(x + p)
But, if p 1, we can approximate
F−1(x)∇L(x) ∝ argminpT F(x)p≤1
〈∇L(x),p〉
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Natural Gradient
Classical: [Amari, 1998]
Overview: [Martens, 2014]
On manifolds: [Song and Ermon, 2018]
Deep learning: [Pascanu and Bengio, 2013, Martens andGrosse, 2015, Grosse and Salakhudinov, 2015]
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Outline
Part I: ConvexSmooth
Newton-CG
Non-Smooth
Proximal NewtonSemi-smooth Newton
Part II: Non-ConvexLine-Search Based Methods
L-BFGSGauss-NewtonNatural Gradient
Trust-Region Based Methods
Trust-RegionCubic Regularization
Part III: Discussion and Examples
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Problem Statement
Minimizing Finite Sum Problem
minx∈X⊆Rd
F (x) =1
n
n∑i=1
fi (x)
fi : (non-)convex and smooth
n 1 and/or d 1
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Trust Region: [Sorensen, 1982, Conn et al., 2000]
s(k) = arg min‖s‖≤∆k
⟨s,∇F (x(k))
⟩+
1
2
⟨s,∇2F (x(k))s
⟩
Cubic Regularization: [Griewank, 1981, Nesterov et al., 2006,Cartis et al., 2011a, Cartis et al., 2011b]
s(k) = arg mins∈Rd
⟨s,∇F (x(k))
⟩+
1
2
⟨s,∇2F (x(k))s
⟩+σk3‖s‖3
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Trust Region:
s(k) = arg min‖s‖≤∆k
⟨s,∇F (x(k))
⟩+
1
2
⟨s,∇2F (x(k))s
⟩
Cubic Regularization:
s(k) = arg mins∈Rd
⟨s,∇F (x(k))
⟩+
1
2
⟨s,∇2F (x(k))s
⟩+σk3‖s‖3
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Trust Region:
s(k) = arg min‖s‖≤∆k
⟨s,∇F (x(k))
⟩+
1
2
⟨s,H(k)s
⟩
Cubic Regularization:
s(k) = arg mins∈Rd
⟨s,∇F (x(k))
⟩+
1
2
⟨s,H(k)s
⟩+σk3‖s‖3
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
To get iteration complexity, previous work required:∥∥∥(H(k) −∇2F (x(k)))
s(k)∥∥∥ ≤ C‖s(k)‖2 (1)
Stronger than “Dennis-More”
limk→∞
‖(H(x(k))−∇2F (x(k))
)s(k)‖
‖s(k)‖= 0
We relaxed (1) to∥∥∥(H(k) −∇2F (x(k)))
s(k)∥∥∥ ≤ ε‖s(k)‖ (2)
Quasi-Newton, Sketching, Sub-Sampling satisfy Dennis-Moreand (2) but not necessarily (1)
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
∥∥∥H(x)−∇2F (x)∥∥∥ ≤ ε =⇒
∥∥∥ (H(x)−∇2F (x))
s∥∥∥ ≤ ε‖s‖
H(x) =1
|S|∑j∈S∇2fj(x)
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Lemma (Uniform Sampling [Xu et al., 2017])
Suppose ‖∇2fi (x)‖ ≤ Ki , i = 1, 2, . . . , n. Let K = maxi=1,...,n
Ki .
Given any 0 < ε < 1, 0 < δ < 1, and x ∈ Rd , if
|S| ≥ 16K 2
ε2log
2d
δ,
then for
H(x) =1
|S|∑j∈S∇2fj(x),
we have
Pr(‖H(x)−∇2F (x)‖ ≤ ε
)≥ 1− δ.
Only top eigenvalues/eigenvectors need to preserved.
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
F (x) =1
n
n∑i=1
fi (aTi x)
pi =|f ′′i (aT
i x)|‖ai‖22∑n
j=1 |f ′′j (aTj x)|‖aj‖2
2
H(x) =1
|S|∑j∈S
1
npj∇2fj(x)
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Lemma (Non-Uniform Sampling [Xu et al., 2017])
Suppose ‖∇2fi (x)‖ ≤ Ki , i = 1, 2, . . . , n. Let K = 1n
∑ni=1 Ki .
Given any 0 < ε < 1, 0 < δ < 1, and x ∈ Rd , if
|S| ≥ 16K 2
ε2log
2d
δ,
then for
H(x) =1
|S|∑j∈S
1
npj∇2fj(x),
we have
Pr(‖H(x)−∇2F (x)‖ ≤ ε
)≥ 1− δ,
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
1
n
n∑i=1
Ki ≤ maxi=1,...,n
Ki
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Theorem ( [Xu et al., 2017])
If ε ∈ O(εH), then Stochastic TR terminates after
T ∈ O(maxε−2
g ε−1H , ε−3
H ),
iterations, upon which, with high probability, we have that
‖∇F (x)‖ ≤ εg , and λmin(∇2F (x)) ≥ − (ε+ εH) .
This is tight! [Cartis et al., 2012]
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Theorem ( [Xu et al., 2017])
If ε ∈ O(εg , εH), then Stochastic ARC terminates after
T ∈ O(
maxε−3/2g , ε−3
H ),
iterations, upon which, with high probability, we have that
‖∇F (x)‖ ≤ εg , and λmin(∇2F (x)) ≥ − (ε+ εH) .
This is tight! [Cartis et al., 2012]
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
For ε2H = εg = ε
Stochastic TR: T ∈ O(ε−2.5)
Stochastic ARC: T ∈ O(ε−1.5)
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Outline
Part I: ConvexSmooth
Newton-CG
Non-Smooth
Proximal NewtonSemi-smooth Newton
Part II: Non-ConvexLine-Search Based Methods
L-BFGSGauss-NewtonNatural Gradient
Trust-Region Based Methods
Trust-RegionCubic Regularization
Part III: Discussion and Examples
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
But why 1st Order Methods?
Q: But Why 1st Order Methods?
Cheap Iterations
Easy To Implement
“Good” Worst-Case Complexities
Good Generalization
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
But why Not?2nd Order Methods
Q: But Why Not 2nd Order Methods?
///////Cheap Expensive Iterations
/////Easy Hard To Implement
/////////“Good” “Bad” Worst-Case Complexities
//////Good Bad Generalization
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Our Goal...
Goal: Improve 2nd Order Methods...
Cheap ////////////Expensive Iterations
Easy//////Hard To Use
“Good” ////////“Bad” Average(?)-Case Complexities
Good /////Bad Generalization
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Our Goal..
Any Other Advantages?
Effective Iterations ⇒ Less Iterations ⇒ Less Communications
Saddle Points For Non-Convex Problems
Less Sensitive to Parameter Tuning
Less Sensitive to Initialization
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Simulations: `2 Regularized LR
F (x) =1
n
n∑i=1
(log(
1 + exp(aTi x))− bia
Ti x)
+λ
2‖x‖2
Data n p nnz κ(F )
D1 106 104 0.02% ≈ 104
D2 5× 104 5× 103 Dense ≈ 106
D3 107 2× 104 0.006% ≈ 1010
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
D1, n = 106, p = 104, sparsity : 0.02%, κ ≈ 104
(g) Function Relative Error
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
D2, n = 5× 104, p = 5× 103, sparsity : Dense, κ ≈ 106
(i) Function Relative Error
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
D3, n = 107, p = 2× 104, sparsity : 0.006%, κ ≈ 1010
(k) Function Relative Error
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Newton GPU vs. TensorFlow
Data: Cover Type, n = 4.5× 105, d = 378
10-2 10-1 100 101 102 103
Time in (seconds)
0
10
20
30
40
50
60
70
Test
Acc
ura
cy
step size: (Adam: 1.00e+01, Adadelta: 1.00e+04, Adagrad: 1.00e+03, RMSProp: 1.00e+01,
Momentum: 1.00e-02)
10-2 10-1 100 101 102 103
Time (seconds)
300000
400000
500000
600000
700000
800000
900000
Obje
ctiv
e F
unct
ion -
Tra
inin
g
step size: (Adam: 1.00e+01, Adadelta: 1.00e+04, Adagrad: 1.00e+03, RMSProp: 1.00e+01,
Momentum: 1.00e-02)
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Newton GPU vs. TensorFlow
Data: Newsgroup20, n = 104, d = 106
10-2 10-1 100 101 102 103
Time in (seconds)
0
10
20
30
40
50
60
70
80
90
Test
Acc
ura
cy
step size: (Adam: 1.00e-01, Adadelta: 1.00e+02, Adagrad: 1.00e+00, RMSProp: 1.00e-01,
Momentum: 1.00e-03)
10-2 10-1 100 101 102 103
Time (seconds)
0
5000
10000
15000
20000
25000
30000
35000
Obje
ctiv
e F
unct
ion -
Tra
inin
g
step size: (Adam: 1.00e-01, Adadelta: 1.00e+02, Adagrad: 1.00e+00, RMSProp: 1.00e-01,
Momentum: 1.00e-03)
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Figure: Skew Param = 0
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Figure: Skew Param = 2
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Figure: Skew Param = 4
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Figure: Skew Param = 6
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Figure: Skew Param = 8
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Figure: Skew Param = 9
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Numerical Examples: Deep Learning
Dataset Size Network (# parameters)curves 20, 000 784-400-200-100-50-25-6 842, 340Cifar10 50, 000 ResNet18 270, 896
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Deep Auto-Encoder
100
105
1010
# of Props
101
102
103
104
Tra
inin
g L
oss
Autoencoder: curves
Figure: Random Initialization
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Deep Auto-Encoder
100
105
1010
# of Props
100
101
102
103
Te
st
Err
or
Autoencoder: curves
Figure: Random Initialization
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Deep Auto-Encoder
100
105
1010
# of Props
101
102
103
Tra
inin
g L
oss
Autoencoder: curves
Figure: Zero Initialization
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Deep Auto-Encoder
100
105
1010
# of Props
100
101
102
103
Te
st
Err
or
Autoencoder: curves
Figure: Zero Initialization
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Deep Auto-Encoder
100
105
1010
# of Props
101
102
103
Tra
inin
g L
oss
Autoencoder: curves
TR Uniform: 0= 500
TR Uniform: 0= 1200
TR Uniform: 0= 3000
Momentum SGD: = 0.001
Momentum SGD: = 0.01
Momentum SGD: = 0.1
Momentum SGD: = 0.5GN Uniform
Figure: Scaled Random Initialization
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Deep Auto-Encoder
100
105
1010
# of Props
10-1
100
101
102
103
Test err
pr
Autoencoder: curves
Figure: Scaled Random Initialization
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Is it all rosie?
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
ResNet18
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
ResNet18
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Worst Case Complexity
My to pick with worst case complexity results!!!
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Worst Case Complexity
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Worst Case Complexity
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Worst Case Complexity
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Worst Case Complexity
Q: What do “Newton’s method” and “air travel” have incommon?
A: Both are very fast, but their worst-case is bad!!!
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
THANK YOU!
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Amari, S.-I. (1998).Natural gradient works efficiently in learning.Neural computation, 10(2):251–276.
Berahas, A. S., Nocedal, J., and Takac, M. (2016).A multi-batch l-bfgs method for machine learning.In Advances in Neural Information Processing Systems, pages1055–1063.
Berahas, A. S. and Takac, M. (2017).A Robust Multi-Batch L-BFGS Method for Machine Learning.arXiv preprint arXiv:1707.08552.
Botev, A., Ritter, H., and Barber, D. (2017).Practical Gauss-Newton optimisation for deep learning.arXiv preprint arXiv:1706.03662.
Byrd, R. H., Hansen, S., Nocedal, J., and Singer, Y. (2014).A stochastic quasi-Newton method for large-scaleoptimization.
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
arXiv preprint arXiv:1401.7020.
Cartis, C., Gould, N. I., and Toint, P. L. (2012).Complexity bounds for second-order optimality inunconstrained optimization.Journal of Complexity, 28(1):93–108.
Chapelle, O. and Erhan, D. (2011).Improved preconditioner for Hessian free optimization.In NIPS Workshop on Deep Learning and UnsupervisedFeature Learning, volume 201.
Dai, Y.-H. (2002).Convergence properties of the bfgs algoritm.SIAM Journal on Optimization, 13(3):693–701.
Doel, K. v. d. and Ascher, U. (2012).Adaptive and stochastic algorithms for EIT and DC resistivityproblems with piecewise constant solutions and manymeasurements.
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
SIAM J. Scient. Comput., 34:DOI: 10.1137/110826692.
Grosse, R. and Salakhudinov, R. (2015).Scaling up natural gradient by sparsely factorizing the inverseFisher matrix.In International Conference on Machine Learning, pages2304–2313.
Haber, E., Ascher, U. M., and Oldenburg, D. (2000).On optimization techniques for solving nonlinear inverseproblems.Inverse problems, 16(5):1263.
Haber, E., Chung, M., and Herrmann, F. (2012).An effective method for parameter estimation with PDEconstraints with multiple right-hand sides.SIAM Journal on Optimization, 22(3):739–757.
Johnson, R. and Zhang, T. (2013).Accelerating stochastic gradient descent using predictivevariance reduction.
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
In Advances in Neural Information Processing Systems, pages315–323.
Martens, J. (2010).Deep learning via Hessian-free optimization.In Proceedings of the 27th International Conference onMachine Learning (ICML-10), pages 735–742.
Martens, J. (2014).New insights and perspectives on the natural gradient method.arXiv preprint arXiv:1412.1193.
Martens, J. and Grosse, R. (2015).Optimizing neural networks with kronecker-factoredapproximate curvature.In International conference on machine learning, pages2408–2417.
Martens, J. and Sutskever, I. (2011).Learning recurrent neural networks with Hessian-freeoptimization.
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
In Proceedings of the 28th International Conference onMachine Learning (ICML-11), pages 1033–1040. Citeseer.
Mascarenhas, W. F. (2004).The BFGS method with exact line searches fails fornon-convex objective functions.Mathematical Programming, 99(1):49–61.
Mokhtari, A. and Ribeiro, A. (2014).Res: Regularized stochastic BFGS algorithm.Signal Processing, IEEE Transactions on, 62(23):6089–6104.
Moritz, P., Nishihara, R., and Jordan, M. I. (2015).A linearly-convergent stochastic L-BFGS algorithm.arXiv preprint arXiv:1508.02087.
Pascanu, R. and Bengio, Y. (2013).Revisiting natural gradient for deep networks.arXiv preprint arXiv:1301.3584.
Roosta, F. and Ascher, U. (2015).
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Improved bounds on sample size for implicit matrix traceestimators.Foundations of Computational Mathematics, 15(5):1187–1212.
Roosta, F., Szekely, G. J., and Ascher, U. (2015).Assessing stochastic algorithms for large scale nonlinear leastsquares problems using extremal probabilities of linearcombinations of gamma random variables.SIAM/ASA Journal on Uncertainty Quantification, 3(1):61–90.
Roosta, F., van den Doel, K., and Ascher, U. (2014a).Data completion and stochastic algorithms for PDE inversionproblems with many measurements.Electronic Transactions on Numerical Analysis, 42:177–196.
Roosta, F., van den Doel, K., and Ascher, U. (2014b).Stochastic algorithms for inverse problems involving PDEs andmany measurements.SIAM J. Scientific Computing, 36(5):S3–S22.
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples
Song, Y. and Ermon, S. (2018).Accelerating Natural Gradient with Higher-Order Invariance.arXiv preprint arXiv:1803.01273.
Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., and Ba, J.(2017).Scalable trust-region method for deep reinforcement learningusing kronecker-factored approximation.In Advances in neural information processing systems, pages5279–5288.
Xu, P., Roosta, F., and Mahoney, M. W. (2017).Newton-Type Methods for Non-Convex Optimization UnderInexact Hessian Information.arXiv preprint arXiv:1708.07164.