Shadowing Properties of Optimization Algorithmspeople.inf.ethz.ch/orvietoa/shadowing_slides.pdf ·...

19
Shadowing Properties of Optimization Algorithms Antonio Orvieto, Aurelien Lucchi Accepted at NeurIPS 2019 0 / 13

Transcript of Shadowing Properties of Optimization Algorithmspeople.inf.ethz.ch/orvietoa/shadowing_slides.pdf ·...

Page 1: Shadowing Properties of Optimization Algorithmspeople.inf.ethz.ch/orvietoa/shadowing_slides.pdf · 2019-10-27 · Stochastic gradient descent as approximate bayesian inference. The

Shadowing Propertiesof Optimization Algorithms

Antonio Orvieto, Aurelien Lucchi

Accepted at NeurIPS 20190 / 13

Page 2: Shadowing Properties of Optimization Algorithmspeople.inf.ethz.ch/orvietoa/shadowing_slides.pdf · 2019-10-27 · Stochastic gradient descent as approximate bayesian inference. The

Unconstrained non-convex optimization

For some regular f : Rd → R, find x∗ := arg minx∈Rd f (x).

Training loss of ResNet-110, no skip connections on CIFAR-10(for more details, check [Li et al., 2018])

Usual Assumption: f is L-smooth, i.e. ‖∇f (x)−∇f (y)‖ ≤ L‖x − y‖.

A recent trend is to model the dynamics of iterative gradient-basedoptimization algorithms with differential equations.

1 / 13

Page 3: Shadowing Properties of Optimization Algorithmspeople.inf.ethz.ch/orvietoa/shadowing_slides.pdf · 2019-10-27 · Stochastic gradient descent as approximate bayesian inference. The

How is an ODE model constructed?

xk+1 = xk − h∇f (xk ) (GD)Define curve y(t) as smooth interpolation: y(kh) = xk

x0

x1 x2

x3

x4

x5

y(0)y(h)

y(2h)

y(3h)

y(4h)y(5h)

What is the law for y(t)?1) by construction : y(t + h) = y(t)− h∇f (y(t));2) thanks to smoothness: y(t + h) = y(t) + hy(t) +O(h2).

In the limit h→ 0, y(t) = −∇f (y(t)) (GD-ODE)

*Solution y ∈ C1(R+,Rd ) exists unique in since ∇f is globally Lip.2 / 13

Page 4: Shadowing Properties of Optimization Algorithmspeople.inf.ethz.ch/orvietoa/shadowing_slides.pdf · 2019-10-27 · Stochastic gradient descent as approximate bayesian inference. The

Some important ODE/SDE models

Algorithm Model Perturbed model (stochastic grads)

xk+1 = xk − h∇f (xk ) X = −∇f (X) dY = −∇f (Y )dt + σdB

(GD) (GD-ODE) (GD-SDE)

[Mertikopoulos and Staudigl, 2016]

xk+1 = xk + β(xk − xk−1) y = −αy −∇f (y)−h∇f (xk )

or

{v = −αv −∇f (y)y = v

{dV = −αVdt−∇f (Y )dt + σdBdY = Ydt

(HB) (HB-ODE) (HB-SDE)

[Polyak, 1964] [Polyak, 1964] [Orvieto et al., 2019]{xk+1 = uk − h∇f (uk )

uk+1 = xk+1 + kk+3 (xk+1 − xk ).

y = − 3t y −∇f (y)

or

{v = − 3

t v −∇f (y)y = v

{dV = − 3

t Vdt−∇f (Y )dt + σdBdY = Vdt

(NAG) (NAG-ODE) (NAG-SDE)

[Nesterov, 1983] [Su et al., 2016] [Krichene and Bartlett, 2017]

and many more: primal-dual algorithms, adaptive methods, etc.

3 / 13

Page 5: Shadowing Properties of Optimization Algorithmspeople.inf.ethz.ch/orvietoa/shadowing_slides.pdf · 2019-10-27 · Stochastic gradient descent as approximate bayesian inference. The

why should we care?do we really need to introduce these objects? what’s the gain?

I GD-ODE/SDE is the basis for many seminal contributions to thetheory of SGD:

1. asymptotic behavior [Kushner and Yin, 2003];2. connection to Bayesian inference [Mandt et al., 2017];3. generalization, width of minimas [Jastrzebski et al., 2017].

I NAG-ODE recently provided us with some novel insights of theacceleration phenomenon1:

1. [Su et al., 2016] studied accel. with Bessel functions;2. [Wibisono et al., 2016] connected NAG to meta-learning

and physics via the minimum action principle;3. [Krichene and Bartlett, 2017] studied the non-trivial

interplay between noise and acceleration in NAG usingstochastic analysis on the NAG-SDE;

4. [Orvieto et al., 2019] showed NAG is equivalent to a lineargradient averaging system after the time-stretch τ = t2/8.

1For convex functions, there is a method (NAG) strictly faster than GD.4 / 13

Page 6: Shadowing Properties of Optimization Algorithmspeople.inf.ethz.ch/orvietoa/shadowing_slides.pdf · 2019-10-27 · Stochastic gradient descent as approximate bayesian inference. The

All in all, ODE models seem to be useful.but how can we motivate this success?Common answer: well, GD is just forward Euler numerical integration(with stepsize h) on GD-ODE

y = −∇f (y)explicit Euler xk+1 = xk − h∇f (xk ),

hence, y(kh) u xk if h is small.

This explanation is not good enough! And this is well known in NA..

y = ∇f (y) =

(0 1−1 0

)y , stable

explicit Euler xk+1 = xk + h

(0 1−1 0

)xk

=

(1 h−h 1

)xk , unstable

i.e. the true system and its discretizationare asymptotically extremely different.

Seems bad news for optimization..5 / 13

Page 7: Shadowing Properties of Optimization Algorithmspeople.inf.ethz.ch/orvietoa/shadowing_slides.pdf · 2019-10-27 · Stochastic gradient descent as approximate bayesian inference. The

image adapted from [Hairer et al., 2006].6 / 13

Page 8: Shadowing Properties of Optimization Algorithmspeople.inf.ethz.ch/orvietoa/shadowing_slides.pdf · 2019-10-27 · Stochastic gradient descent as approximate bayesian inference. The

wait, this makes no sense!how is then possible that these models are useful?!The only reasonable explanation is that this bad situation is notrealistic in our setting.

The purpose of this presentationis to convince you this is the case, i.e.

x0

x1x2

x3

x4x5

y0

y1

y2

y3

y4

y5

no chaos,

x0

x1x2

x3

x4x5

y0 y1

y2

y3

y4 y5

but shadowing!I here, we tackle the problem from a numerical analysis

perspective using tools from Anosov’s and Smale’s theory ofhyperbolicity (from the 60s).

I we have another paper at NeurIPS 2019 that approaches theproblem from an algebraic perspective (arXiv:1810.02565 ). 7 / 13

Page 9: Shadowing Properties of Optimization Algorithmspeople.inf.ethz.ch/orvietoa/shadowing_slides.pdf · 2019-10-27 · Stochastic gradient descent as approximate bayesian inference. The

So, why should this happen?

no clear signal, errors propagate if clear signal, errors get corrected

in particular, this happenswhen we get close to a minimum

8 / 13

Page 10: Shadowing Properties of Optimization Algorithmspeople.inf.ethz.ch/orvietoa/shadowing_slides.pdf · 2019-10-27 · Stochastic gradient descent as approximate bayesian inference. The

A short geometric proofwe want to show that, if x0 = y(0), there exists ε s.t. ‖xk − y(kh)‖ ≤ εfor all k . Let’s call Ψ the map defined by a GD iteration.

xk

yk+1

yk

(xk)

( )yk

Assumptions

I f is L-smooth;

I ‖∇f‖ ≤ `;I f is µ−strongly convex

ResultGD-ODE is an ε-shadow ofGD, with

ε =h`L2µ

9 / 13

Page 11: Shadowing Properties of Optimization Algorithmspeople.inf.ethz.ch/orvietoa/shadowing_slides.pdf · 2019-10-27 · Stochastic gradient descent as approximate bayesian inference. The

What happens if the function is non-convex?

10 / 13

Page 12: Shadowing Properties of Optimization Algorithmspeople.inf.ethz.ch/orvietoa/shadowing_slides.pdf · 2019-10-27 · Stochastic gradient descent as approximate bayesian inference. The

Solution: slightly perturb the initial condition! (idea in [Anosov, 1967])

AnosovLast question: can we glue these two

landscapes and get a more general result?

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.001.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

0.100

0.600

1.10

0

1.600

2.100

2.60

0

3.100

3.600

4.100

4.6004.600

5.100

5.100

5.100

5.100

5.100 5.600

5.600

5.600

5.600

5.600

5.60

0

6.100 6.1006.1

00

6.100

6.100

6.100

6.100

6.6006.6

006.

600

6.600

7.1007.100

7.10

0

7.100

7.6007.6

00

7.600

8.100

8.100 8.100

8.600

8.600

8.600

9.1009.1

00

9.100

9.60

0

9.600

a ResNet-110 from abovesource: [Li et al., 2018] 11 / 13

Page 13: Shadowing Properties of Optimization Algorithmspeople.inf.ethz.ch/orvietoa/shadowing_slides.pdf · 2019-10-27 · Stochastic gradient descent as approximate bayesian inference. The

Experiment on learning halfspaces

Problem: binary classification of MNIST digits 3 and 5.

We take n = 10000 training images {(ai , li )}ni=1;

we use the regularized sigmoid loss (non-convex):

f (x) =λ

2‖x‖2 +

1n

n∑i=1

φ(〈ai , x〉li ), φ(t) =1

1 + et .

12 / 13

Page 14: Shadowing Properties of Optimization Algorithmspeople.inf.ethz.ch/orvietoa/shadowing_slides.pdf · 2019-10-27 · Stochastic gradient descent as approximate bayesian inference. The

In the paper, you can find

I detailed proofs of the results sketched here;

I connection to hyperbolic theory;

I proof of similar (yet weaker) results for the heavy ball.

Thanks for the attention! :)

”Truth is much too complicated to allow anything butapproximations.”

– John Von Neumann

”A major task of mathematics today is to harmonize thecontinuous and the discrete, to include them in onecomprehensive mathematics, and to eliminate obscurityfrom both.”

– E.T. Bell, Men of Mathematics, 1937

13 / 13

Page 15: Shadowing Properties of Optimization Algorithmspeople.inf.ethz.ch/orvietoa/shadowing_slides.pdf · 2019-10-27 · Stochastic gradient descent as approximate bayesian inference. The

References

14 / 13

Page 16: Shadowing Properties of Optimization Algorithmspeople.inf.ethz.ch/orvietoa/shadowing_slides.pdf · 2019-10-27 · Stochastic gradient descent as approximate bayesian inference. The

Anosov, D. V. (1967).

Geodesic flows on closed riemannian manifolds of negativecurvature.

Trudy Matematicheskogo Instituta Imeni VA Steklova, 90:3–210.

Hairer, E., Lubich, C., and Wanner, G. (2006).

Geometric numerical integration: structure-preserving algorithmsfor ordinary differential equations, volume 31.

Springer Science & Business Media.

Jastrzebski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A.,Bengio, Y., and Storkey, A. (2017).

Three Factors Influencing Minima in SGD.

ArXiv e-prints.

Krichene, W. and Bartlett, P. L. (2017).

Acceleration and Averaging in Stochastic Mirror DescentDynamics.

ArXiv e-prints.14 / 13

Page 17: Shadowing Properties of Optimization Algorithmspeople.inf.ethz.ch/orvietoa/shadowing_slides.pdf · 2019-10-27 · Stochastic gradient descent as approximate bayesian inference. The

Kushner, H. and Yin, G. (2003).

Stochastic Approximation and Recursive Algorithms andApplications.

Stochastic Modelling and Applied Probability. Springer New York.

Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. (2018).

Visualizing the loss landscape of neural nets.

In Advances in Neural Information Processing Systems, pages6389–6399.

Mandt, S., Hoffman, M. D., and Blei, D. M. (2017).

Stochastic gradient descent as approximate bayesian inference.

The Journal of Machine Learning Research, 18(1):4873–4907.

Mertikopoulos, P. and Staudigl, M. (2016).

On the convergence of gradient-like flows with noisy gradientinput.

ArXiv e-prints.

Nesterov, Y. E. (1983).14 / 13

Page 18: Shadowing Properties of Optimization Algorithmspeople.inf.ethz.ch/orvietoa/shadowing_slides.pdf · 2019-10-27 · Stochastic gradient descent as approximate bayesian inference. The

A method for solving the convex programming problem withconvergence rate o (1/kˆ 2).

In Dokl. Akad. Nauk SSSR, volume 269, pages 543–547.

Orvieto, A., Kohler, J., and Lucchi, A. (2019).

The role of memory in stochastic optimization.

arXiv preprint arXiv:1907.01678.

Polyak, B. T. (1964).

Some methods of speeding up the convergence of iterationmethods.

USSR Computational Mathematics and Mathematical Physics,4(5):1–17.

Su, W., Boyd, S., and Candes, E. J. (2016).

A differential equation for modeling nesterovs acceleratedgradient method: theory and insights.

Journal of Machine Learning Research, 17(153):1–43.

Wibisono, A., Wilson, A. C., and Jordan, M. I. (2016).14 / 13

Page 19: Shadowing Properties of Optimization Algorithmspeople.inf.ethz.ch/orvietoa/shadowing_slides.pdf · 2019-10-27 · Stochastic gradient descent as approximate bayesian inference. The

A Variational Perspective on Accelerated Methods inOptimization.ArXiv e-prints.

13 / 13