Shadowing Properties of Optimization Algorithmspeople.inf.ethz.ch/orvietoa/shadowing_slides.pdf ·...

Shadowing Propertiesof Optimization Algorithms

Antonio Orvieto, Aurelien Lucchi

Accepted at NeurIPS 20190 / 13

Unconstrained non-convex optimization

For some regular f : Rd → R, find x∗ := arg minx∈Rd f (x).

Training loss of ResNet-110, no skip connections on CIFAR-10(for more details, check [Li et al., 2018])

Usual Assumption: f is L-smooth, i.e. ‖∇f (x)−∇f (y)‖ ≤ L‖x − y‖.

A recent trend is to model the dynamics of iterative gradient-basedoptimization algorithms with differential equations.

1 / 13

How is an ODE model constructed?

xk+1 = xk − h∇f (xk ) (GD)Define curve y(t) as smooth interpolation: y(kh) = xk

x0

x1 x2

x3

x4

x5

y(0)y(h)

y(2h)

y(3h)

y(4h)y(5h)

What is the law for y(t)?1) by construction : y(t + h) = y(t)− h∇f (y(t));2) thanks to smoothness: y(t + h) = y(t) + hy(t) +O(h2).

In the limit h→ 0, y(t) = −∇f (y(t)) (GD-ODE)

*Solution y ∈ C1(R+,Rd ) exists unique in since ∇f is globally Lip.2 / 13

Some important ODE/SDE models

Algorithm Model Perturbed model (stochastic grads)

xk+1 = xk − h∇f (xk ) X = −∇f (X) dY = −∇f (Y )dt + σdB

(GD) (GD-ODE) (GD-SDE)

[Mertikopoulos and Staudigl, 2016]

xk+1 = xk + β(xk − xk−1) y = −αy −∇f (y)−h∇f (xk )

or

{v = −αv −∇f (y)y = v

{dV = −αVdt−∇f (Y )dt + σdBdY = Ydt

(HB) (HB-ODE) (HB-SDE)

[Polyak, 1964] [Polyak, 1964] [Orvieto et al., 2019]{xk+1 = uk − h∇f (uk )

uk+1 = xk+1 + kk+3 (xk+1 − xk ).

y = − 3t y −∇f (y)

or

{v = − 3

t v −∇f (y)y = v

{dV = − 3

t Vdt−∇f (Y )dt + σdBdY = Vdt

(NAG) (NAG-ODE) (NAG-SDE)

[Nesterov, 1983] [Su et al., 2016] [Krichene and Bartlett, 2017]

and many more: primal-dual algorithms, adaptive methods, etc.

3 / 13

why should we care?do we really need to introduce these objects? what’s the gain?

I GD-ODE/SDE is the basis for many seminal contributions to thetheory of SGD:

1. asymptotic behavior [Kushner and Yin, 2003];2. connection to Bayesian inference [Mandt et al., 2017];3. generalization, width of minimas [Jastrzebski et al., 2017].

I NAG-ODE recently provided us with some novel insights of theacceleration phenomenon1:

1. [Su et al., 2016] studied accel. with Bessel functions;2. [Wibisono et al., 2016] connected NAG to meta-learning

and physics via the minimum action principle;3. [Krichene and Bartlett, 2017] studied the non-trivial

interplay between noise and acceleration in NAG usingstochastic analysis on the NAG-SDE;

4. [Orvieto et al., 2019] showed NAG is equivalent to a lineargradient averaging system after the time-stretch τ = t2/8.

1For convex functions, there is a method (NAG) strictly faster than GD.4 / 13

All in all, ODE models seem to be useful.but how can we motivate this success?Common answer: well, GD is just forward Euler numerical integration(with stepsize h) on GD-ODE

y = −∇f (y)explicit Euler xk+1 = xk − h∇f (xk ),

hence, y(kh) u xk if h is small.

This explanation is not good enough! And this is well known in NA..

y = ∇f (y) =

(0 1−1 0

)y , stable

explicit Euler xk+1 = xk + h

(0 1−1 0

)xk

=

(1 h−h 1

)xk , unstable

i.e. the true system and its discretizationare asymptotically extremely different.

Seems bad news for optimization..5 / 13

image adapted from [Hairer et al., 2006].6 / 13

wait, this makes no sense!how is then possible that these models are useful?!The only reasonable explanation is that this bad situation is notrealistic in our setting.

The purpose of this presentationis to convince you this is the case, i.e.

x0

x1x2

x3

x4x5

y0

y1

y2

y3

y4

y5

no chaos,

x0

x1x2

x3

x4x5

y0 y1

y2

y3

y4 y5

✏

but shadowing!I here, we tackle the problem from a numerical analysis

perspective using tools from Anosov’s and Smale’s theory ofhyperbolicity (from the 60s).

I we have another paper at NeurIPS 2019 that approaches theproblem from an algebraic perspective (arXiv:1810.02565 ). 7 / 13

https://arxiv.org/abs/1810.02565

So, why should this happen?

no clear signal, errors propagate if clear signal, errors get corrected

in particular, this happenswhen we get close to a minimum

8 / 13

A short geometric proofwe want to show that, if x0 = y(0), there exists ε s.t. ‖xk − y(kh)‖ ≤ εfor all k . Let’s call Ψ the map defined by a GD iteration.

xk

yk+1

yk

(xk)

�

✏

( )yk

✏

Assumptions

I f is L-smooth;

I ‖∇f‖ ≤ `;I f is µ−strongly convex

ResultGD-ODE is an ε-shadow ofGD, with

ε =h`L2µ

9 / 13

What happens if the function is non-convex?

10 / 13

Solution: slightly perturb the initial condition! (idea in [Anosov, 1967])

AnosovLast question: can we glue these two

landscapes and get a more general result?

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.001.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

0.100

0.600

1.10

0

1.600

2.100

2.60

0

3.100

3.600

4.100

4.6004.600

5.100

5.100

5.100

5.100

5.100 5.600

5.600

5.600

5.600

5.600

5.60

0

6.100 6.1006.1

00

6.100

6.100

6.100

6.100

6.6006.6

006.

600

6.600

7.1007.100

7.10

0

7.100

7.6007.6

00

7.600

8.100

8.100 8.100

8.600

8.600

8.600

9.1009.1

00

9.100

9.60

0

9.600

a ResNet-110 from abovesource: [Li et al., 2018] 11 / 13

Experiment on learning halfspaces

Problem: binary classification of MNIST digits 3 and 5.

We take n = 10000 training images {(ai , li )}ni=1;

we use the regularized sigmoid loss (non-convex):

f (x) =λ

2‖x‖2 +

1n

n∑i=1

φ(〈ai , x〉li ), φ(t) =1

1 + et .

12 / 13

In the paper, you can find

I detailed proofs of the results sketched here;

I connection to hyperbolic theory;

I proof of similar (yet weaker) results for the heavy ball.

Thanks for the attention! :)

”Truth is much too complicated to allow anything butapproximations.”

– John Von Neumann

”A major task of mathematics today is to harmonize thecontinuous and the discrete, to include them in onecomprehensive mathematics, and to eliminate obscurityfrom both.”

– E.T. Bell, Men of Mathematics, 1937

13 / 13

References

14 / 13

Anosov, D. V. (1967).

Geodesic flows on closed riemannian manifolds of negativecurvature.

Trudy Matematicheskogo Instituta Imeni VA Steklova, 90:3–210.

Hairer, E., Lubich, C., and Wanner, G. (2006).

Geometric numerical integration: structure-preserving algorithmsfor ordinary differential equations, volume 31.

Springer Science & Business Media.

Jastrzebski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A.,Bengio, Y., and Storkey, A. (2017).

Three Factors Influencing Minima in SGD.

ArXiv e-prints.

Krichene, W. and Bartlett, P. L. (2017).

Acceleration and Averaging in Stochastic Mirror DescentDynamics.

ArXiv e-prints.14 / 13

Kushner, H. and Yin, G. (2003).

Stochastic Approximation and Recursive Algorithms andApplications.

Stochastic Modelling and Applied Probability. Springer New York.

Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. (2018).

Visualizing the loss landscape of neural nets.

In Advances in Neural Information Processing Systems, pages6389–6399.

Mandt, S., Hoffman, M. D., and Blei, D. M. (2017).

Stochastic gradient descent as approximate bayesian inference.

The Journal of Machine Learning Research, 18(1):4873–4907.

Mertikopoulos, P. and Staudigl, M. (2016).

On the convergence of gradient-like flows with noisy gradientinput.

ArXiv e-prints.

Nesterov, Y. E. (1983).14 / 13

A method for solving the convex programming problem withconvergence rate o (1/kˆ 2).

In Dokl. Akad. Nauk SSSR, volume 269, pages 543–547.

Orvieto, A., Kohler, J., and Lucchi, A. (2019).

The role of memory in stochastic optimization.

arXiv preprint arXiv:1907.01678.

Polyak, B. T. (1964).

Some methods of speeding up the convergence of iterationmethods.

USSR Computational Mathematics and Mathematical Physics,4(5):1–17.

Su, W., Boyd, S., and Candes, E. J. (2016).

A differential equation for modeling nesterovs acceleratedgradient method: theory and insights.

Journal of Machine Learning Research, 17(153):1–43.

Wibisono, A., Wilson, A. C., and Jordan, M. I. (2016).14 / 13

A Variational Perspective on Accelerated Methods inOptimization.ArXiv e-prints.

13 / 13

Shadowing Properties of Optimization Algorithmspeople.inf.ethz.ch/orvietoa/shadowing_slides.pdf ·...

Documents

Transcript of Shadowing Properties of Optimization Algorithmspeople.inf.ethz.ch/orvietoa/shadowing_slides.pdf ·...