Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based...

105
Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples Stochastic Second-Order Optimization Methods Part II: Non-Convex Fred Roosta School of Mathematics and Physics University of Queensland

Transcript of Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based...

Page 1: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Stochastic Second-Order Optimization MethodsPart II: Non-Convex

Fred Roosta

School of Mathematics and PhysicsUniversity of Queensland

Page 2: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Now, moving onto non-convex problems...

Page 3: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Non-Convex Is Hard!

Saddle points, Local Minima, Local Maxima

Page 4: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

2nd Order Necessary Condition

∇F (x?) = 0 ∇2F (x?)0

2nd Order Sufficient Condition

∇F (x?) = 0 ∇2F (x?)0

Page 5: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Non-Convex Is Hard!

Additional complexity issues...

Optimization of a degree four polynomial: NP-hard [Hillar etal., 2013]

Checking for sufficient optimality condition: co-NP-complete[Murty et al., 1987]

Checking whether a point is not a local minimum:NP-complete [Murty et al., 1987]

Page 6: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

All convex problems are the same,while every non-convex problem is different.

Not sure who’s quote this is!

Page 7: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

(εg , εH)− Optimality

‖∇F (x)‖ ≤ εg and λmin(∇2F (x)) ≥ −εH

Page 8: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Outline

Part I: ConvexSmooth

Newton-CG

Non-Smooth

Proximal NewtonSemi-smooth Newton

Part II: Non-ConvexLine-Search Based Methods

L-BFGSGauss-NewtonNatural Gradient

Trust-Region Based Methods

Trust-RegionCubic Regularization

Part III: Discussion and Examples

Page 9: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Sad Note /: BFGS may fail on non-convex problems withboth exact line search [Mascarenhas, 2004] and inexact (e.g.,Wolfe) variants [Dai, 2002]

Happy Note ,: BFGS dominates in many practicalnon-convex applications

Page 10: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Newton’s Method: Scalar Case

Finding a root of r : R→ R , i.e., find x? for which r(x?) = 0:

0 = r(x?) = r(x (k)) +(x? − x (k)

)r ′(x (k)) + o(|x? − x (k)|)

0 = r(x (k)) +(x (k+1) − x (k)

)r ′(x (k))

x (k+1) = x (k) − r(x (k))

r ′(x (k)).

Page 11: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Secant Method: Scalar Case

Finding a root of r : R→ R, i.e., find x? for which r(x?) = 0:

Approximate the derivative: r ′(x (k)) ≈ r(x (k))− r(x (k−1))

x (k) − x (k−1)

x (k+1) = x (k) −

(x (k) − x (k−1)

r(x (k))− r(x (k−1))

)r(x (k)).

Local convergence rate is

∣∣∣x (k+1) − x?∣∣∣ ≤ C

∣∣∣x (k) − x?∣∣∣

“Golden Ratio”︷ ︸︸ ︷1 +√

5

2 .

In contrast, rate of convergence of Newton is quadratic!

Page 12: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Quasi-Newton Method

Quasi-Newton optimization methods extend secant method tomultivariable case!

Page 13: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Quasi-Newton Method

For r(x) = f ′(x), we have

f ′′(x (k)) ≈ f ′(x (k))− f ′(x (k−1))

x (k) − x (k−1),

i.e.,

f ′′(x (k))(x (k) − x (k−1)

)≈ f ′(x (k))− f ′(x (k−1)).

Page 14: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Quasi-Newton Method

∇f (x + p) = ∇f (x) +∇2f (x)p +

∫ 1

0

[∇2f (x + tp)−∇2f (x)

]p dt

= ∇f (x) +∇2f (x)p + o(‖p‖),

i.e., when x = x(k),p = x(k−1) − x(k), and ‖p‖ 1 , we have

∇2f (x(k))(

x(k) − x(k−1))≈(∇f (x(k))−∇f (x(k−1))

)

Page 15: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Quasi-Newton Method

∇2f (x(k))(

x(k) − x(k−1))≈(∇f (x(k))−∇f (x(k−1))

)So, look for H(k) ≈ ∇2f (x(k)) such that

H(k)(

x(k) − x(k−1))

=(∇f (x(k))−∇f (x(k−1))

)︸ ︷︷ ︸

Secant Condition

Page 16: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Quasi-Newton Method: Another Interpretation

Another interpretation of the secant condition...

Page 17: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Quasi-Newton Method: Another Interpretation

Recall:

Iterative Scheme

y(k) = argminx∈X

(x− x(k))Tg(k) +

1

2(x− x(k))TH(k)(x− x(k))

x(k+1) = x(k) + αk

(y(k) − x(k)

)

Page 18: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Quasi-Newton Method: Another Interpretation

Suppose we have a H(k) and x(k+1)

How to update H(k) to obtain a new quadratic approximationto F (x) at x(k+1)?

mk+1(p) , f (x(k+1)) +⟨∇f (x(k+1)),p

⟩+

1

2

⟨p,H(k+1)p

One reasonable requirement, suggested by Davidon, is

∇mk+1(0) = ∇f (x(k+1)) −→ trivially satisfied

∇mk+1(−αpk) = ∇f (x(k)) −→ secant condition

Page 19: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Quasi-Newton Method: DFP

The revolution began with...

William C. Davidon

DFP: Davidon-Fletcher-Powell scheme

Page 20: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Quasi-Newton Method

H(k)(

x(k) − x(k−1))

=(∇f (x(k))−∇f (x(k−1))

).

d equations vs. d2 unknowns

The difference between QNMs boils down to how they update H(k)

(or its inverse)!

Page 21: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Quasi-Newton Method

Typical notation in QN literature:

sk , x(k+1) − x(k)

yk , ∇f (x(k+1))−∇f (x(k))

Page 22: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Quasi-Newton Method: BFGS

Updating B(k) ,[H(k)

]−1:

minB∈Rd×d

‖B− B(k)‖

s.t. B = BT , sk = Byk

With ‖A‖ = ‖W1/2AW1/2‖F for a particular W

B(k+1) =

(I−

skyTkyTk sk

)B(k)

(I−

skyTkyTk sk

)+

ykyTkyTk sk

B(k) 0 iff yTk sk > 0 (Curvature Condition) =⇒ Guaranteedby appropriate line search, e.g., Armijo + (strong) Wolfe

Under strong convexity (or if iterates satisfy certainproperties), asymptotic super-linear rate of convergence

Page 23: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Quasi-Newton Method: Limited Memory

General QNM Update:

B(k+1) = B(k) + [something]

Problem: Memory storage is O(d2)

Soution: Limited-Memory QNMs, which are low-storagemethods

Page 24: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

L-BFGS

Instead of storing the inverse Hessian B(k), L-BFGS maintains ahistory of iterates and gradients as

sk−m, sk−m−1, . . . , sk ,yk−m, yk−m−1, . . . , yk .

B(k) depends on B(k−1), yk−1 and sk−1.

B(k−1) depends on B(k−2), yk−2 and sk−2.

and so on...

So define B(k−1) implicitly in terms of B(k−1), yk−2 and sk−2.

We continue up to B(k−m), which is initialized to be γI.

These are used to implicitly do B(k)–vector products.

Linear rate of convergence

Page 25: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

L-BFGS

Curvature Condition:⟨∇f (x(k+1))−∇f (x(k)), x(k+1) − x(k)

⟩> 0

If f (x) is (strictly) convex =⇒3

Otherwise, (strong) Wolfe-condition on α (nonlinear inequality)

〈∇f (x + αp),p〉 ≥ β 〈∇f (x),p〉 , β < 1

When g ≈ ∇f =⇒ noisy curvature estimate =⇒ many issues arise!

Page 26: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

L-BFGS [Byrd et al., 2014]

Decoupling of the stochastic gradient and curvatureestimations =⇒ different sample subsets for estimating yk

sk = xk − xk−1 where xk = 1L

∑kj=k−L+1 x(j).

yk = ∇2fSH(xk)sk ≈ ∇fSH

(xk)−∇fSH(xk−1).

Update H(k) every L ≥ 1 iterations

Under strong convexity, they show that 0 ≺ µ1I H(k) µ2I

Setting αk ∝ 1/k , they show

E(f (x(k))− f (x?)) ≤ C/k

Page 27: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

L-BFGS [Mokhtari and Ribeiro, 2014]

sk = x(k+1) − x(k)

Enforce gradient consistency, i.e., use the samples S:yk = ∇fS(x(k+1))−∇fS(x(k)).

For some δ > 0, yk = yk − δskUpdate Bk+1 as in the usual case with sk and ykAdd regularization: Bk+1 = Bk+1 + mI

Add regularization again:

(Bk+1

)−1

=(

Bk+1

)−1+ MI

Spectrum ofBk+1 is bounded above and away from zero

Strong convexity of each fi

δ < minx λmin(∇2fi (x)) =⇒ curvature condition holds

Setting αk ∝ 1/k , they show

E(f (x(k))− f (x?)) ≤ C/k

Page 28: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

L-BFGS [Moritz et al., 2015]

Combine the ideas of [Byrd et al., 2014] with variancereduction of [Johnson and Zhang, 2013]

Recall SVRG: For s and k , inner and outer iteration counters,respectively, estimate the gradient as

g(s) =(∇fj(x(s))−∇fj(x(k)) +∇F (x(k))

)No need to diminish step-size any more!

Under strong convexity:

E(f (x(k))− f (x?)) ≤ ρkE(f (x(0))− f (x?)), ρ < 1

As in SVRG, convergence is with respect to the outeriterations

Page 29: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

L-BFGS [Berahas et al., 2016, Berahas and Takac, 2017]

Idea: Perform QN update on overlapping consecutive batches

Idea: Tk = Sk ∩ Sk+1 6= ∅

yk = ∇fOk(x(k+1))−∇fOk

(x(k))

Using constant step-size α

Strongly convex:

E(f (x(k))− f (x?)) ≤ ρk(f (x(0))− f (x?)

)+O(α)

Non-convex: Skip updating H(k) if yTk sk ≤ ε ‖sk‖2

E

(1

T

T−1∑k=0

∥∥∥∇f (x(k))∥∥∥2)≤ O

(1

)+O(α)

If α ≤ O(1/√T ) =⇒ min

k≤T−1E∥∥∥∇f (x(k))

∥∥∥2

≤ O√

1

T

Page 30: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Outline

Part I: ConvexSmooth

Newton-CG

Non-Smooth

Proximal NewtonSemi-smooth Newton

Part II: Non-ConvexLine-Search Based Methods

L-BFGSGauss-NewtonNatural Gradient

Trust-Region Based Methods

Trust-RegionCubic Regularization

Part III: Discussion and Examples

Page 31: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Gauss-Newton

Problem

minx∈Rd

F (x) = f (h(x))

h : Rd → Rp

f : Rp → R, and convex

Page 32: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Gauss-Newton

Let Jh : Rd → Rp be the Jacobian of h, i.e., Jh(x) ∈ Rp×d .

∇F (x) = JTh (x)∇f (h(x))

∇2F (x) = JTh (x)∇2f (h(x))Jh(x) + ∂2h(x)∇f (h(x))

(Generalized) Gauss-Newton Matrix:

∇2F (x) ≈ JTh (x)∇2f (h(x))Jh(x)︸ ︷︷ ︸G(x) , Gauss-Newton Matrix

0

(Generalized) Gauss-Newton Update:

G(x(k))p ≈ −∇F (x(k))

Page 33: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Gauss-Newton

Another interpretation:

f (h(x)) ≈ f(

h(x(k)) + Jh(x(k))(

x− x(k)))

, `(x; x(k))

∇f (h(x(k))) = ∇`(x(k); x(k)) = JTh (x(k))∇f (h(x(k)))

∇2f (h(x(k))) ≈ ∇2`(x(k); x(k)) = JTh (x(k))∇2f (h(x(k)))Jh(x(k)) = G(x(k))

Page 34: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Gauss-Newton

(Generalized) Gauss-Newton Matrix

∇2F (x) ≈ JTh (x)∇2f (h(x))Jh(x)

Properties:

G(x) 0, ∀x

In some applications, after computing∇F (x) = JTh (x)∇f (h(x)), the approximation G(x) does notinvolve any additional derivative evaluations

G(x) is a good approximation if ‖∂2h(x)∇f (h(x))‖ is small,i.e.,

‖∇f (h(x))‖ is small, or

‖∂2h(x)‖ is small, i.e., h is nearly affine

Page 35: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Gauss-Newton

Gauss-Newton Convergence

Under some regularity assumptions:

Damped Gauss-Newton is globally convergent, i.e.,limk→∞

∥∥∇F (x(k))∥∥ = 0

The rate of convergence can be shown to be linear

Local convergence (S(x) , ∂2h(x?)∇f (h(x?))):∥∥∥x(k+1) − x?∥∥∥ ≤ ‖G(x?)‖ ‖S(x?)‖

∥∥∥x(k) − x?∥∥∥+O

(∥∥∥x(k) − x?∥∥∥2),

Page 36: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Gauss-Newton

Finite-Sum Problem

minx∈Rd

F (x) =n∑

i=1

fi (hi (x))

Machine Learning (e.g., deep/recurrent/reinforcementlearning): [Martens, 2010, Martens and Sutskever,2011, Chapelle and Erhan, 2011, Wu et al., 2017, Botev et al.,2017]... more on this later

Scientific Computing (e.g., PDE inverse-problems): [Doeland Ascher, 2012, Roosta et al., 2014b, Roosta et al.,2014a, Roosta et al., 2015, Haber et al., 2000, Haber et al.,2012]

Page 37: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

PDE Inverse Problems with Many R.H.S

∇ · (x(z)∇ui (z)) = qi (z), z ∈ Ω

∂ui (z)

∂ν= 0, z ∈ ∂Ω

, i = 1, . . . , n, Ω ⊂ R2 or R3

(a) True x : 2D (b) True x : 3D

Page 38: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Forward Problem

Discretize-Then-Optimize

vi = PiA−1(x)qi + εi , i = 1, 2, . . . n, x ∈ Rd

n: No. of measurements

d : Mesh size

Page 39: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Inverse Problem

εi ∼N (0,Σi )

fi (hi (x)) =‖Σ− 1

2i

(PiA

−1(x)qi − vi)︸ ︷︷ ︸

hi (x)

‖2

minx

F (x) =1

n

n∑i=1

‖Σ−12

i

(PiA

−1(x)qi − vi)‖2

Calculating “A−1(x)qi” for each i is costly!

Page 40: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

A remedy: SAA

S ⊂ [n] & |S| = s

F (x) ≈ Fs(x) =1

|S|∑i∈S‖Σ−

12

i

(PiA

−1(x)qi − vi)‖2

2

Trace estimation: [Roosta and Ascher, 2015, Roosta et al., 2015]

Find s such that, for a given ε and δ, we get

Pr(|Fs(x)− F (x)| ≤ εF (x)

)≥ 1− δ

Page 41: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

n = 961, Noise 1%, σ1 = 0.1, σ2 = 1

Method Vanilla Sub-Sampled

PDE Solves 128,774 3,921

(c) True Model (d) Sub-Sampled GN

Page 42: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

n = 512, Noise 2% σI = 1, σII = .1

Method Vanilla Sub-Sampled

PDE Solves 45,056 2,264

(e) True Model (f) Sub-Sampled GN

Page 43: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Outline

Part I: ConvexSmooth

Newton-CG

Non-Smooth

Proximal NewtonSemi-smooth Newton

Part II: Non-ConvexLine-Search Based Methods

L-BFGSGauss-NewtonNatural Gradient

Trust-Region Based Methods

Trust-RegionCubic Regularization

Part III: Discussion and Examples

Page 44: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Natural Gradient

Cross Entropy Minimization

For px(z), a density parametrized by x, the cross-entropy minimizationwith respect to a target density, px(z), is

minx∈XL(x) = −Ex (log px(z)) = −

∫px(z) log px(z) dµ(z).

NB: px(z)dµ(z) can be the empirical measure over the training data.

Fisher Information Matrix

Suppose z ∼ px. Under some weak regularity assumptions:

F(x) , Ex

(∇ log px(z) (∇ log px(z))T

)= −Ex

(∇2 log px(z)

).

Natural Gradient Descent

F(x(k))p(k) ≈ g(k) =⇒ x(k+1) = x(k) + α(k)p(k)

Page 45: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Natural Gradient

Interpretation I:

minx∈XL(x) = −Ex (log px(z))

Natural Gradient vs. Newton’s Method

For a given x:

Hessian Matrix: ∇2L(x) = −Ex

(∇2 log px(z)

)Fisher Matrix: F(x) = −Ex

(∇2 log px(z)

)

Page 46: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Natural Gradient

Interpretation I:

Let px be the empirical measure over the given training set zini=1 wherezi ∼ px? for some true, but unknown, parameter x?, i.e., empirical riskminimization:

minx∈XL(x) = −1

n

n∑i=1

log px(zi )

Approximation I: Natural Gradient vs. Newton’s Method

For a given x:

Hessian: ∇2L(x) = −1

n

n∑i=1

∇2 log px(zi ), zi ∼ px?

Approximate Fisher: F(x) = −1

n

n∑i=1

∇2 log px(zi ), zi ∼ px

Page 47: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Natural Gradient

Interpretation II:

minx∈XL(x) = −1

n

n∑i=1

log px(zi )

In Gauss-Newton, we had L(x) = f (h(x)). Here, we can considerf (t) = − log t ∈ R and h(x) = px(z) ∈ R. So,

G(x) = f ′′(h(x))∇h(x)∇h(x)T =1

(px(z))2∇px(z) (∇px(z))T

=

(1

px(z)∇px(z)

)(1

px(z)∇px(z)

)T

= ∇ log px(z) (∇ log px(z))T

Approximation II: Natural Gradient vs. Gauss-Newton

G(x) = −1

n

n∑i=1

∇ log px(zi ) (∇ log px(zi ))T , zi ∼ px?

F(x) = −1

n

n∑i=1

∇ log px(zi ) (∇ log px(zi ))T , zi ∼ px

Page 48: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Natural Gradient

Interpretation III:More generally, consider fitting probabilistic models

minx∈Rd

L(x) = L(px)

Recall: steepest descent in Euclidean space

Ideally, we want

p? = argmin‖p‖≤1

L(x + p),

but it is easier to do

−∇L(x)

‖∇L(x)‖= argmin‖p‖≤1

〈∇L(x),p〉

Page 49: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Natural Gradient

Interpretation III:

KullbackLeibler distance

For given x and x, the Kullback-Leibler distance from px to px is

KL(x ‖ x) , Ex

(log

px(z)

px(z)

)=

∫ (log

px(z)

px(z)

)px(z) dµ(z).

F(x) = ∇2x KL(x ‖ x)|x=x

If F(x) 0, then in a neighborhood of x, we have KL(x ‖ x) > 0, and

KL(x ‖ x) ≈ 1

2(x− x)2 F(x) (x− x)

Page 50: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Natural Gradient

Interpretation III:

Ideally, we want

p? = argminKL(x‖x+p)≤1

L(x + p)

But, if p 1, we can approximate

F−1(x)∇L(x) ∝ argminpT F(x)p≤1

〈∇L(x),p〉

Page 51: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Natural Gradient

Classical: [Amari, 1998]

Overview: [Martens, 2014]

On manifolds: [Song and Ermon, 2018]

Deep learning: [Pascanu and Bengio, 2013, Martens andGrosse, 2015, Grosse and Salakhudinov, 2015]

Page 52: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Outline

Part I: ConvexSmooth

Newton-CG

Non-Smooth

Proximal NewtonSemi-smooth Newton

Part II: Non-ConvexLine-Search Based Methods

L-BFGSGauss-NewtonNatural Gradient

Trust-Region Based Methods

Trust-RegionCubic Regularization

Part III: Discussion and Examples

Page 53: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Problem Statement

Minimizing Finite Sum Problem

minx∈X⊆Rd

F (x) =1

n

n∑i=1

fi (x)

fi : (non-)convex and smooth

n 1 and/or d 1

Page 54: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Trust Region: [Sorensen, 1982, Conn et al., 2000]

s(k) = arg min‖s‖≤∆k

⟨s,∇F (x(k))

⟩+

1

2

⟨s,∇2F (x(k))s

Cubic Regularization: [Griewank, 1981, Nesterov et al., 2006,Cartis et al., 2011a, Cartis et al., 2011b]

s(k) = arg mins∈Rd

⟨s,∇F (x(k))

⟩+

1

2

⟨s,∇2F (x(k))s

⟩+σk3‖s‖3

Page 55: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Trust Region:

s(k) = arg min‖s‖≤∆k

⟨s,∇F (x(k))

⟩+

1

2

⟨s,∇2F (x(k))s

Cubic Regularization:

s(k) = arg mins∈Rd

⟨s,∇F (x(k))

⟩+

1

2

⟨s,∇2F (x(k))s

⟩+σk3‖s‖3

Page 56: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Trust Region:

s(k) = arg min‖s‖≤∆k

⟨s,∇F (x(k))

⟩+

1

2

⟨s,H(k)s

Cubic Regularization:

s(k) = arg mins∈Rd

⟨s,∇F (x(k))

⟩+

1

2

⟨s,H(k)s

⟩+σk3‖s‖3

Page 57: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

To get iteration complexity, previous work required:∥∥∥(H(k) −∇2F (x(k)))

s(k)∥∥∥ ≤ C‖s(k)‖2 (1)

Stronger than “Dennis-More”

limk→∞

‖(H(x(k))−∇2F (x(k))

)s(k)‖

‖s(k)‖= 0

We relaxed (1) to∥∥∥(H(k) −∇2F (x(k)))

s(k)∥∥∥ ≤ ε‖s(k)‖ (2)

Quasi-Newton, Sketching, Sub-Sampling satisfy Dennis-Moreand (2) but not necessarily (1)

Page 58: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

∥∥∥H(x)−∇2F (x)∥∥∥ ≤ ε =⇒

∥∥∥ (H(x)−∇2F (x))

s∥∥∥ ≤ ε‖s‖

H(x) =1

|S|∑j∈S∇2fj(x)

Page 59: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Lemma (Uniform Sampling [Xu et al., 2017])

Suppose ‖∇2fi (x)‖ ≤ Ki , i = 1, 2, . . . , n. Let K = maxi=1,...,n

Ki .

Given any 0 < ε < 1, 0 < δ < 1, and x ∈ Rd , if

|S| ≥ 16K 2

ε2log

2d

δ,

then for

H(x) =1

|S|∑j∈S∇2fj(x),

we have

Pr(‖H(x)−∇2F (x)‖ ≤ ε

)≥ 1− δ.

Only top eigenvalues/eigenvectors need to preserved.

Page 60: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

F (x) =1

n

n∑i=1

fi (aTi x)

pi =|f ′′i (aT

i x)|‖ai‖22∑n

j=1 |f ′′j (aTj x)|‖aj‖2

2

H(x) =1

|S|∑j∈S

1

npj∇2fj(x)

Page 61: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Lemma (Non-Uniform Sampling [Xu et al., 2017])

Suppose ‖∇2fi (x)‖ ≤ Ki , i = 1, 2, . . . , n. Let K = 1n

∑ni=1 Ki .

Given any 0 < ε < 1, 0 < δ < 1, and x ∈ Rd , if

|S| ≥ 16K 2

ε2log

2d

δ,

then for

H(x) =1

|S|∑j∈S

1

npj∇2fj(x),

we have

Pr(‖H(x)−∇2F (x)‖ ≤ ε

)≥ 1− δ,

Page 62: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

1

n

n∑i=1

Ki ≤ maxi=1,...,n

Ki

Page 63: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Theorem ( [Xu et al., 2017])

If ε ∈ O(εH), then Stochastic TR terminates after

T ∈ O(maxε−2

g ε−1H , ε−3

H ),

iterations, upon which, with high probability, we have that

‖∇F (x)‖ ≤ εg , and λmin(∇2F (x)) ≥ − (ε+ εH) .

This is tight! [Cartis et al., 2012]

Page 64: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Theorem ( [Xu et al., 2017])

If ε ∈ O(εg , εH), then Stochastic ARC terminates after

T ∈ O(

maxε−3/2g , ε−3

H ),

iterations, upon which, with high probability, we have that

‖∇F (x)‖ ≤ εg , and λmin(∇2F (x)) ≥ − (ε+ εH) .

This is tight! [Cartis et al., 2012]

Page 65: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

For ε2H = εg = ε

Stochastic TR: T ∈ O(ε−2.5)

Stochastic ARC: T ∈ O(ε−1.5)

Page 66: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Outline

Part I: ConvexSmooth

Newton-CG

Non-Smooth

Proximal NewtonSemi-smooth Newton

Part II: Non-ConvexLine-Search Based Methods

L-BFGSGauss-NewtonNatural Gradient

Trust-Region Based Methods

Trust-RegionCubic Regularization

Part III: Discussion and Examples

Page 67: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

But why 1st Order Methods?

Q: But Why 1st Order Methods?

Cheap Iterations

Easy To Implement

“Good” Worst-Case Complexities

Good Generalization

Page 68: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

But why Not?2nd Order Methods

Q: But Why Not 2nd Order Methods?

///////Cheap Expensive Iterations

/////Easy Hard To Implement

/////////“Good” “Bad” Worst-Case Complexities

//////Good Bad Generalization

Page 69: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Our Goal...

Goal: Improve 2nd Order Methods...

Cheap ////////////Expensive Iterations

Easy//////Hard To Use

“Good” ////////“Bad” Average(?)-Case Complexities

Good /////Bad Generalization

Page 70: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Our Goal..

Any Other Advantages?

Effective Iterations ⇒ Less Iterations ⇒ Less Communications

Saddle Points For Non-Convex Problems

Less Sensitive to Parameter Tuning

Less Sensitive to Initialization

Page 71: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Simulations: `2 Regularized LR

F (x) =1

n

n∑i=1

(log(

1 + exp(aTi x))− bia

Ti x)

2‖x‖2

Data n p nnz κ(F )

D1 106 104 0.02% ≈ 104

D2 5× 104 5× 103 Dense ≈ 106

D3 107 2× 104 0.006% ≈ 1010

Page 72: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

D1, n = 106, p = 104, sparsity : 0.02%, κ ≈ 104

(g) Function Relative Error

Page 73: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

D2, n = 5× 104, p = 5× 103, sparsity : Dense, κ ≈ 106

(i) Function Relative Error

Page 74: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

D3, n = 107, p = 2× 104, sparsity : 0.006%, κ ≈ 1010

(k) Function Relative Error

Page 75: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Newton GPU vs. TensorFlow

Data: Cover Type, n = 4.5× 105, d = 378

10-2 10-1 100 101 102 103

Time in (seconds)

0

10

20

30

40

50

60

70

Test

Acc

ura

cy

step size: (Adam: 1.00e+01, Adadelta: 1.00e+04, Adagrad: 1.00e+03, RMSProp: 1.00e+01,

Momentum: 1.00e-02)

10-2 10-1 100 101 102 103

Time (seconds)

300000

400000

500000

600000

700000

800000

900000

Obje

ctiv

e F

unct

ion -

Tra

inin

g

step size: (Adam: 1.00e+01, Adadelta: 1.00e+04, Adagrad: 1.00e+03, RMSProp: 1.00e+01,

Momentum: 1.00e-02)

Page 76: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Newton GPU vs. TensorFlow

Data: Newsgroup20, n = 104, d = 106

10-2 10-1 100 101 102 103

Time in (seconds)

0

10

20

30

40

50

60

70

80

90

Test

Acc

ura

cy

step size: (Adam: 1.00e-01, Adadelta: 1.00e+02, Adagrad: 1.00e+00, RMSProp: 1.00e-01,

Momentum: 1.00e-03)

10-2 10-1 100 101 102 103

Time (seconds)

0

5000

10000

15000

20000

25000

30000

35000

Obje

ctiv

e F

unct

ion -

Tra

inin

g

step size: (Adam: 1.00e-01, Adadelta: 1.00e+02, Adagrad: 1.00e+00, RMSProp: 1.00e-01,

Momentum: 1.00e-03)

Page 77: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Figure: Skew Param = 0

Page 78: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Figure: Skew Param = 2

Page 79: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Figure: Skew Param = 4

Page 80: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Figure: Skew Param = 6

Page 81: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Figure: Skew Param = 8

Page 82: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Figure: Skew Param = 9

Page 83: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Numerical Examples: Deep Learning

Dataset Size Network (# parameters)curves 20, 000 784-400-200-100-50-25-6 842, 340Cifar10 50, 000 ResNet18 270, 896

Page 84: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Deep Auto-Encoder

100

105

1010

# of Props

101

102

103

104

Tra

inin

g L

oss

Autoencoder: curves

Figure: Random Initialization

Page 85: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Deep Auto-Encoder

100

105

1010

# of Props

100

101

102

103

Te

st

Err

or

Autoencoder: curves

Figure: Random Initialization

Page 86: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Deep Auto-Encoder

100

105

1010

# of Props

101

102

103

Tra

inin

g L

oss

Autoencoder: curves

Figure: Zero Initialization

Page 87: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Deep Auto-Encoder

100

105

1010

# of Props

100

101

102

103

Te

st

Err

or

Autoencoder: curves

Figure: Zero Initialization

Page 88: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Deep Auto-Encoder

100

105

1010

# of Props

101

102

103

Tra

inin

g L

oss

Autoencoder: curves

TR Uniform: 0= 500

TR Uniform: 0= 1200

TR Uniform: 0= 3000

Momentum SGD: = 0.001

Momentum SGD: = 0.01

Momentum SGD: = 0.1

Momentum SGD: = 0.5GN Uniform

Figure: Scaled Random Initialization

Page 89: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Deep Auto-Encoder

100

105

1010

# of Props

10-1

100

101

102

103

Test err

pr

Autoencoder: curves

Figure: Scaled Random Initialization

Page 90: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Is it all rosie?

Page 91: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

ResNet18

Page 92: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

ResNet18

Page 93: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Worst Case Complexity

My to pick with worst case complexity results!!!

Page 94: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Worst Case Complexity

Page 95: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Worst Case Complexity

Page 96: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Worst Case Complexity

Page 97: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Worst Case Complexity

Q: What do “Newton’s method” and “air travel” have incommon?

A: Both are very fast, but their worst-case is bad!!!

Page 98: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

THANK YOU!

Page 99: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Amari, S.-I. (1998).Natural gradient works efficiently in learning.Neural computation, 10(2):251–276.

Berahas, A. S., Nocedal, J., and Takac, M. (2016).A multi-batch l-bfgs method for machine learning.In Advances in Neural Information Processing Systems, pages1055–1063.

Berahas, A. S. and Takac, M. (2017).A Robust Multi-Batch L-BFGS Method for Machine Learning.arXiv preprint arXiv:1707.08552.

Botev, A., Ritter, H., and Barber, D. (2017).Practical Gauss-Newton optimisation for deep learning.arXiv preprint arXiv:1706.03662.

Byrd, R. H., Hansen, S., Nocedal, J., and Singer, Y. (2014).A stochastic quasi-Newton method for large-scaleoptimization.

Page 100: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

arXiv preprint arXiv:1401.7020.

Cartis, C., Gould, N. I., and Toint, P. L. (2012).Complexity bounds for second-order optimality inunconstrained optimization.Journal of Complexity, 28(1):93–108.

Chapelle, O. and Erhan, D. (2011).Improved preconditioner for Hessian free optimization.In NIPS Workshop on Deep Learning and UnsupervisedFeature Learning, volume 201.

Dai, Y.-H. (2002).Convergence properties of the bfgs algoritm.SIAM Journal on Optimization, 13(3):693–701.

Doel, K. v. d. and Ascher, U. (2012).Adaptive and stochastic algorithms for EIT and DC resistivityproblems with piecewise constant solutions and manymeasurements.

Page 101: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

SIAM J. Scient. Comput., 34:DOI: 10.1137/110826692.

Grosse, R. and Salakhudinov, R. (2015).Scaling up natural gradient by sparsely factorizing the inverseFisher matrix.In International Conference on Machine Learning, pages2304–2313.

Haber, E., Ascher, U. M., and Oldenburg, D. (2000).On optimization techniques for solving nonlinear inverseproblems.Inverse problems, 16(5):1263.

Haber, E., Chung, M., and Herrmann, F. (2012).An effective method for parameter estimation with PDEconstraints with multiple right-hand sides.SIAM Journal on Optimization, 22(3):739–757.

Johnson, R. and Zhang, T. (2013).Accelerating stochastic gradient descent using predictivevariance reduction.

Page 102: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

In Advances in Neural Information Processing Systems, pages315–323.

Martens, J. (2010).Deep learning via Hessian-free optimization.In Proceedings of the 27th International Conference onMachine Learning (ICML-10), pages 735–742.

Martens, J. (2014).New insights and perspectives on the natural gradient method.arXiv preprint arXiv:1412.1193.

Martens, J. and Grosse, R. (2015).Optimizing neural networks with kronecker-factoredapproximate curvature.In International conference on machine learning, pages2408–2417.

Martens, J. and Sutskever, I. (2011).Learning recurrent neural networks with Hessian-freeoptimization.

Page 103: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

In Proceedings of the 28th International Conference onMachine Learning (ICML-11), pages 1033–1040. Citeseer.

Mascarenhas, W. F. (2004).The BFGS method with exact line searches fails fornon-convex objective functions.Mathematical Programming, 99(1):49–61.

Mokhtari, A. and Ribeiro, A. (2014).Res: Regularized stochastic BFGS algorithm.Signal Processing, IEEE Transactions on, 62(23):6089–6104.

Moritz, P., Nishihara, R., and Jordan, M. I. (2015).A linearly-convergent stochastic L-BFGS algorithm.arXiv preprint arXiv:1508.02087.

Pascanu, R. and Bengio, Y. (2013).Revisiting natural gradient for deep networks.arXiv preprint arXiv:1301.3584.

Roosta, F. and Ascher, U. (2015).

Page 104: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Improved bounds on sample size for implicit matrix traceestimators.Foundations of Computational Mathematics, 15(5):1187–1212.

Roosta, F., Szekely, G. J., and Ascher, U. (2015).Assessing stochastic algorithms for large scale nonlinear leastsquares problems using extremal probabilities of linearcombinations of gamma random variables.SIAM/ASA Journal on Uncertainty Quantification, 3(1):61–90.

Roosta, F., van den Doel, K., and Ascher, U. (2014a).Data completion and stochastic algorithms for PDE inversionproblems with many measurements.Electronic Transactions on Numerical Analysis, 42:177–196.

Roosta, F., van den Doel, K., and Ascher, U. (2014b).Stochastic algorithms for inverse problems involving PDEs andmany measurements.SIAM J. Scientific Computing, 36(5):S3–S22.

Page 105: Stochastic Second-Order Optimization Methods · Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples L-BFGS[Moritz et al., 2015] Combine the ideas of[Byrd

Intro Line-Search Based Methods Trust-Region Based Methods Discussion and Examples

Song, Y. and Ermon, S. (2018).Accelerating Natural Gradient with Higher-Order Invariance.arXiv preprint arXiv:1803.01273.

Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., and Ba, J.(2017).Scalable trust-region method for deep reinforcement learningusing kronecker-factored approximation.In Advances in neural information processing systems, pages5279–5288.

Xu, P., Roosta, F., and Mahoney, M. W. (2017).Newton-Type Methods for Non-Convex Optimization UnderInexact Hessian Information.arXiv preprint arXiv:1708.07164.