A random perturbation approach to some stochastic...

54
A random perturbation approach to some stochastic approximation algorithms in optimization. Wenqing Hu. 1 (Presentation based on joint works with Chris Junchi Li 2 , Weijie Su 3 , Haoyi Xiong 4 .) 1. Department of Mathematics and Statistics, Missouri S&T. 2. Department of Operations Research and Financial Engineering, Princeton University. 3. Department of Statistics, University of Pennsylvania. 4. Department of Computer Science, Missouri S&T.

Transcript of A random perturbation approach to some stochastic...

Page 1: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

A random perturbation approach to somestochastic approximation algorithms in

optimization.

Wenqing Hu. 1

(Presentation based on joint works with Chris Junchi Li 2,Weijie Su 3, Haoyi Xiong 4.)

1. Department of Mathematics and Statistics, Missouri S&T.2. Department of Operations Research and Financial Engineering, Princeton

University.3. Department of Statistics, University of Pennsylvania.4. Department of Computer Science, Missouri S&T.

Page 2: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Stochastic Optimization.

I Stochastic Optimization Problem

minx∈U⊂Rd

F (x) .

I

F (x) ≡ E[F (x ; ζ)] .

I Index random variable ζ follows some prescribed distributionD.

Page 3: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Stochastic Gradient Descent Algorithm.

I We target at finding a local minimum point x∗ of theexpectation function F (x) ≡ E[F (x ; ζ)] :

x∗ = arg minx∈U⊂Rd

E[F (x ; ζ)] .

I We consider a general nonconvex stochastic loss functionF (x ; ζ) that is twice differentiable with respect to x . Togetherwith some additional regularity assumptions 5 we guaranteethat ∇E[F (x ; ζ)] = E[∇F (x ; ζ)].

5. Such as control of the growth of the gradient in expectation.

Page 4: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Stochastic Gradient Descent Algorithm.

I Gradient Descent (GD) has iteration :

x (t) = x (t−1) − η∇F (x (t−1)) ,

or in other words

x (t) = x (t−1) − ηE[∇F (x (t−1); ζ)] .

I Stochastic Gradient Descent (SGD) has iteration :

x (t) = x (t−1) − η∇F (x (t−1); ζt) ,

where {ζt} are i.i.d. random variables that have the samedistribution as ζ ∼ D.

Page 5: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Machine Learning Background : DNN.

I Deep Neural Network (DNN). Goal is to solve the followingstochastic optimization problem

minx∈Rd

F (x) ≡ 1

M

M∑i=1

Fi (x)

where each component Fi corresponds to the loss function fordata point i ∈ {1, ...,M}, and x is the vector of weights beingoptimized.

Page 6: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Machine Learning Background : DNN.

I Let B be the minibatch of prescribed size uniformly sampledfrom {1, ...,M}, then the objective function can be furtherwritten as the expectation of a stochastic function

1

M

M∑i=1

Fi (x) = EB

(1

|B|∑x∈B

Fi (x)

)≡ EBF (x ; B) .

I SGD updates as

x (t) = x (t−1) − η

1

|Bt |∑i∈Bt

∇Fi (x (t−1))

,

which is the classical mini–batch version of the SGD.

Page 7: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Machine Learning Background : Online learning.

I Online learning via SGD : standard sequential predictingproblem where for i = 1, 2, ...1. An unlabeled example ai arrives ;2. We make a prediction bi based on the current weightsxi = [x1

i , ..., xdi ] ∈ Rd ;

3. We observe bi , let ζi = (ai , bi ), and incur some known lossF (xi ; ζi ) which is convex in xi ;4. We update weights according to some rule xi+1 ← S(xi ).

I SGD rule :S(xi ) = xi − η∇1F (xi ; ζi ) ,

where ∇1F (u; v) is a subgradient of F (u; v) with respect tothe first variable u, and the parameter η > 0 is often referredas the learning rate.

Page 8: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Closer look at SGD.

I SGD :x (t) = x (t−1) − η∇F (x (t−1); ζt) .

I F (x) ≡ EF (x ; ζ) in which ζ ∼ D.

I Letet = ∇F (x (t−1); ζt)−∇F (x (t−1))

and we can rewrite the SGD as

x (t) = x (t−1) − η(∇F (x (t−1)) + et) .

Page 9: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Local statistical characteristics of SGD path.

I In a difference form it looks like

x (t) − x (t−1) = −η∇F (x (t−1))− ηet .

I We see that

E(x (t) − x (t−1)|x (t−1)) = −η∇F (x (t−1))

and

[Cov(x (t) − x (t−1)|x (t−1))]1/2 = η[Cov(∇F (x (t−1); ζ))]1/2 .

Page 10: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

SGD approximating diffusion process.

I Very roughly speaking, we can approximate x (t) by a diffusionprocess Xt driven by the stochastic differential equation

dXt = −η∇F (Xt)dt + ησ(Xt)dWt , X0 = x (0) ,

where σ(x) = [Cov(∇F (x ; ζ))]1/2 and Wt is a standardBrownian motion in Rd .

I Slogan : Continuous Markov processes are characterized by itslocal statistical characteristics only in the first and secondmoments (conditional mean and (co)variance).

Page 11: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Diffusion Approximation of SGD : Justification.

I Such an approximation has been justified in the weak sense inmany classical literature 6 7 8.

I It can also be thought of as a normal deviation result.

I One can call the continuous process Xt as the “continuousSGD” 9.

I We also have a work in this direction (Hu–Li–Li–Liu, 2018).

I I will come back to this topic by the end of the talk.

6. Benveniste, A., Metivier, M., Priouret, P., Adaptive Algorithms and Sto-chastic Approximations, Applications of Mathematics, 22, Springer, 1990.

7. Borkar, Vivek S., Stochastic Approximation : A Dynamical Systems View-point, Cambridge University Press, 2008.

8. Kushner, H.J., George Yin, G., Stochastic approximation and recursivealgorithms and applications, Stochastic Modeling and Applied Probability, 35,Second Edition, Springer, 2003.

9. Sirignano, J., Spiliopoulos, K., Stochastic Gradient Descent in ContinuousTime. SIAM Journal of Financial Mathematics, Vol. 8, Issue 1, pp.933–961.

Page 12: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

SGD approximating diffusion process : convergence time.

I Recall that we have set F (x) = EF (x ; ζ).

I Thus we can formulate the original optimization problem as

x∗ = arg minx∈U

F (x) .

I Instead of SGD, in this work let us consider its approximatingdiffusion process

dXt = −η∇F (Xt)dt + ησ(Xt)dWt , X0 = x (0) .

I The dynamics of Xt is used as an alternative optimizationprocedure to find x∗.

I Remark : If there is no noise, then

dXt = −η∇F (Xt)dt , X0 = x (0)

is just the gradient flow S tx (0). Thus it approaches x∗ in thestrictly convex case.

Page 13: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

SGD approximating diffusion process : convergence time.

I Hitting time

τη = inf{t ≥ 0 : F (Xt) ≤ F (x∗) + e}

for some small e > 0.

I Asymptotic of Eτη as η → 0 ?

Page 14: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

SGD approximating diffusion process : convergence time.

I Approximating diffusion process

dXt = −η∇F (Xt)dt + ησ(Xt)dWt , X0 = x (0) .

I Let Yt = Xt/η, then

dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) .

(Random perturbations of the gradient flow !)

I Hitting time

T η = inf{t ≥ 0 : F (Yt) ≤ F (x∗) + e}

for some small e > 0.

I

τη = η−1T η .

Page 15: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Random perturbations of the gradient flow : convergencetime.

I Let Yt = Xt/η, then

dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) .

I Hitting time

T η = inf{t ≥ 0 : F (Yt) ≤ F (x∗) + e}

for some small e > 0.

I Asymptotic of ET η as η → 0 ?

Page 16: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Where is the difficulty ?

Figure 1: Various critical points and the landscape of F .

Page 17: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Where is the difficulty ?

Figure 2: Higher order critical points.

Page 18: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Strict saddle property.

I Definition (strict saddle property)10 11 Given fixed γ1 > 0 and γ2 > 0, we say a Morse function Fdefined on Rn satisfies the “strict saddle property” if each pointx ∈ Rn belongs to one of the following : (i) |∇F (x)| ≥ γ2 > 0 ;(ii) |∇F (x)| < γ2 and λmin(∇2F (x)) ≤ −γ1 < 0 ; (iii)|∇F (x)| < γ2 and λmin(∇2F (x)) ≥ γ1 > 0.

10. Ge, R., Huang, F., Jin, C., Yuan, Y., Escaping from saddlepoints–online stochastic gradient descent for tensor decomposition,arXiv:1503.02101v1[cs.LG]

11. Sun, J., Qu, Q., Wright, J., When are nonconvex problems not scary ?arXiv:1510.06096[math.DC]

Page 19: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Strict saddle property.

Figure 3: Types of local landscape geometry with only strict saddles.

Page 20: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Strong saddle property.

I Definition (strong saddle property)

Let the Morse function F (•) satisfy the strict saddle property withparameters γ1 > 0 and γ2 > 0. We say the Morse function F (•)satisfy the “strong saddle property” if for some γ3 > 0 and anyx ∈ Rn such that ∇F (x) = 0, all eigenvalues λi , i = 1, 2, ..., n ofthe Hessian ∇2F (x) at x satisfying (ii) in Definition 1 are boundedaway from zero by some γ3 > 0 in absolute value, i.e.,|λi | ≥ γ3 > 0 for any 1 ≤ i ≤ n.

Page 21: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Escape from saddle points.

Figure 4: Escape from a strong saddle point.

Page 22: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Escape from saddle points : behavior of process near onespecific saddle.

I The problem was first studied by Kifer in 1981 12.

I Recall that

dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) ,

is a random perturbation of the gradient flow. Kifer needs toassume further that the diffusion matrixσ(•)σT (•) = Cov(∇F (•; ζ)) is strictly uniformly positivedefinite.

I Roughly speaking, Kifer’s result states that the exit from aneighborhood of a strong saddle point happens along the“most unstable” direction for the Hessian matrix, and the exittime is asymptotically ∼ C log(η−1)

12. Kifer, Y., The exit problem for small random perturbations of dynamicalsystems with a hyperbolic fixed point. Israel Journal of Mathematics, 40(1), pp.74–96.

Page 23: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Escape from saddle points : behavior of process near onespecific saddle.

Figure 5: Escape from a strong saddle point : Kifer’s result.

Page 24: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Escape from saddle points : behavior of process near onespecific saddle.

I G ∪ ∂G = 0 ∪ A1 ∪ A2 ∪ A3.

I If x ∈ A2 ∪ A3, then for the deterministic gradient flow thereis a finite

t(x) = inf{t > 0 : S tx ∈ ∂G} .

I In this case as η → 0, expected first exit time converges tot(x), and first exit position converges to S t(x)x .

I Remind that the gradient flow S tx (0) is

dXt = −η∇F (Xt)dt , X0 = x (0) .

Page 25: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Escape from saddle points : behavior of process near onespecific saddle.

Figure 6: Escape from a strong saddle point : Kifer’s result.

Page 26: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Escape from saddle points : behavior of process near onespecific saddle.

I If x ∈ (0 ∪ A1)\∂G , situation is more interesting.

I First exit occurs along the direction pointed out by the mostnegative eigenvalue(s) of the Hessian ∇2F (0).

I ∇2F (0) has spectrum −λ1 = −λ2 = ... = −λq < −λq+1 ≤... ≤ −λp < 0 < λp+1 ≤ ... ≤ λd .

I Exit time will be asymptotically1

2λ1ln(η−1) as η → 0.

I Exit position will converge (in probability) to the intersectionof ∂G with the invariant manifold corresponding to−λ1, ...,−λq.

I Slogan : when the process Yt comes close to 0, it will“choose” the most unstable directions and move along them.

Page 27: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Escape from saddle points : behavior of process near onespecific saddle.

Figure 7: Escape from a strong saddle point : Kifer’s result.

Page 28: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

More technical aspects of Kifer’s result...

I In Kifer’s result all convergence are point–wise with respect toinitial point x . They are not uniform convergence when η → 0with respect to all initial point x.

I The “exit along most unstable directions” happens only wheninitial point x strictly stands on A1.

I Does not allow small perturbations with respect to initialpoint x . Run into messy calculations...

Page 29: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Why these technicalities matter ?

I Our landscape may have a “chain of saddles”.

I Recall that

dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0)

is a random perturbation of the gradient flow.

Page 30: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Global landscape : chain of saddle points.

Figure 8: Chain of saddle points.

Page 31: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Linerization of the gradient flow near a strong saddle point.

I Hartman–Grobman Theorem : for any strong saddle point Othat we consider, there exist an open neighborhood U of Oand a C(0) homeomorphism h : U → Rd such that thegradient flow under h is mapped into a linear flow.

I Linearization Assumption 13 : The homeomorphism h providedby the Hartman–Grobman theorem can be taken to be C(2).

I A sufficient condition for the validity of the C(2) linerizationassumption is the so called non–resonance condition(Sternberg linerization theorem).

I I will also come back to this topic at the end of the talk.

13. Bakhtin, Y., Noisy heteroclinic networks, Probability Theory and RelatedFields, 150(2011), no.1–2, pp. 1–42.

Page 32: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Linerization of the gradient flow near a strong saddle point.

Figure 9: Linerization of the gradient flow near a strong saddle point.

Page 33: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Uniform version of Kifer’s exit asymptotic.

I Let U ⊂ G be an open neighborhood of the saddle point O.Set initial point x ∈ U ∪ ∂U.

I Set dist(U ∪ ∂U, ∂G ) > 0. Let t(x) = inf{t > 0 : S tx ∈ ∂G} .I Wmax is the invariant manifold corresponding to the most

negative eigenvalues −λ1, ...,−λq of ∇2F (O).

I Define Qmax = Wmax ∩ ∂G .

I Set

∂GU∪∂U→out = {S t(x)x for some x ∈ U∪∂U with finite t(x)}∪Qmax .

I For small µ > 0 let

Qµ = {x ∈ ∂G , dist(x , ∂GU∪∂U→out) < µ} .

I τηx is the first exit time to ∂G for the process Yt starting fromx .

Page 34: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Uniform version of Kifer’s exit asymptotic.

Figure 10: Uniform exit dynamics near a specific strong saddle pointunder the linerization assumption.

Page 35: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Uniform version of Kifer’s exit asymptotic.

I TheoremFor any r > 0, there exist some η0 > 0 so that for all x ∈ U ∪ ∂Uand all 0 < η < η0 we have

Exτηx

ln(η−1)≤ 1

2λ1+ r .

For any small µ > 0 and any ρ > 0, there exist some η0 > 0 sothat for all x ∈ U ∪ ∂U and all 0 < η < η0 we have

Px(Yτηx ∈ Qµ) ≥ 1− ρ .

Page 36: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Convergence analysis in a basin containing a localminimum : Sequence of stopping times.

I To demonstrate our analysis, we will assume that x∗ is a localminimum point of F (•) in the sense that for some openneighborhood U(x∗) of x∗ we have

x∗ = arg minx∈U(x∗)

F (x) .

I Assume that there are k strong saddle points O1, ...,Ok inU(x∗) such that F (O1) > F (O2) > ... > F (Ok) > F (x∗).

I Start the process with initial point Y0 = x ∈ U(x∗).

I Hitting time

T η = inf{t ≥ 0 : F (Yt) ≤ F (x∗) + e}

for some small e > 0.

I Asymptotic of ET η as η → 0 ?

Page 37: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Convergence analysis in a basin containing a localminimum : Sequence of stopping times.

Figure 11: Sequence of stopping times.

Page 38: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Convergence analysis in a basin containing a localminimum : Sequence of stopping times.

I Standard Markov cycle type argument.

I But the geometry is a little different from the classicalarguments for elliptic equilibriums found in Freidlin–Wentzellbook 14.

I ET η .k

2γ1ln(η−1) conditioned upon convergence.

14. Freidlin, M., Wentzell, A., Random perturbations of dynamical systems,2nd edition, Springer, 1998.

Page 39: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

SGD approximating diffusion process : convergence time.

I Recall that we have set F (x) = EF (x ; ζ).

I Thus we can formulate the original optimization problem as

x∗ = arg minx∈U(x∗)

F (x) .

I Instead of SGD, in this work let us consider its approximatingdiffusion process

dXt = −η∇F (Xt)dt + ησ(Xt)dWt , X0 = x (0) .

I The dynamics of Xt is used as an surrogate optimizationprocedure to find x∗.

Page 40: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

SGD approximating diffusion process : convergence time.

I Hitting time

τη = inf{t ≥ 0 : F (Xt) ≤ F (x∗) + e}

for some small e > 0.

I Asymptotic of Eτη as η → 0 ?

Page 41: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

SGD approximating diffusion process : convergence time.

I Approximating diffusion process

dXt = −η∇F (Xt)dt + ησ(Xt)dWt , X0 = x (0) .

I Let Yt = Xt/η, then

dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) .

(Random perturbations of the gradient flow !)

I Hitting time

T η = inf{t ≥ 0 : F (Yt) ≤ F (x∗) + e}

for some small e > 0.

I

τη = η−1T η .

Page 42: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Convergence analysis in a basin containing a localminimum.

I Theorem(i) For any small ρ > 0, with probability at least 1− ρ, SGDapproximating diffusion process Xt converges to the minimizer x∗

for sufficiently small β after passing through all k saddle points O1,..., Ok ;(ii) Consider the stopping time τη. Then as η ↓ 0, conditioned onthe above convergence of SGD approximating diffusion process Xt ,we have

limη→0

Eτη

η−1 ln η−1≤ k

2γ1.

Page 43: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Philosophy.

discrete stochastic ← analysis ofoptimization algorithms convergence time

l ldiffusion ← random perturbations

approximation of dynamical systems

Page 44: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Other Examples : Stochastic heavy–ball method.

I

minx∈Rd

f (x) .

I ”heavy ball method” (B.Polyak 1964)xt = xt−1 +

√svt ;

vt = (1− α√

s)vt−1 −√

s∇X f(xt−1 +

√s(1− α

√s)vt−1

);

x0, v0 ∈ Rd .

I Adding momentum leads to the second–order equation

X (t) + αX (t) +∇X f (X (t)) = 0 , X (0) ∈ Rd .

I Hamiltonian system for a dissipative nonlinear oscillator :{dX (t) = V (t)dt , X (0) ∈ Rd ;dV (t) = (−αV (t)−∇X f (X (t)))dt , V (0) ∈ Rd .

Page 45: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Other Examples : Stochastic heavy–ball method.

I How to escape from saddle points of f (x) ?

I Add noise → Randomly perturbed Hamiltonian system for adissipative nonlinear oscillator :

dX εt = V ε

t dt + εσ1(X εt ,V

εt )dW 1

t ;dV ε

t = (−αV εt −∇X f (X ε

t ))dt + εσ2(X εt ,V

εt )dW 2

t ,X ε0 ,V

ε0 ∈ Rd .

Page 46: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Diffusion limit of stochastic heavy–ball method : Randomlyperturbed Hamiltonian system.

Figure 12: Randomly perturbed Hamiltonian system.

Page 47: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Diffusion limit of stochastic heavy–ball method : Randomlyperturbed Hamiltonian system.

Figure 13: Phase Diagram of a Hamiltonian System.

Page 48: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Diffusion limit of stochastic heavy–ball method : Randomlyperturbed Hamiltonian system.

Figure 14: Behavior near a saddle point.

Page 49: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Diffusion limit of stochastic heavy–ball method : Randomlyperturbed Hamiltonian system.

Figure 15: Chain of saddle points in a randomly perturbed Hamiltoniansystem.

Page 50: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Stochastic Composite Gradient Descent.

I By introducing approximating diffusion processes, method ofstochastic dynamical systems, in particular randomperturbations of dynamical systems are effectively used toanalyze these limiting processes.

I For example, averaging principle can be used to analyzeStochastic Composite Gradient Descent (SCGD) (Hu–Li,2017). We skip details here.

Page 51: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Philosophy.

discrete stochastic ← analysis ofoptimization algorithms convergence time

l ldiffusion ← random perturbations

approximation of dynamical systems

Page 52: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Inspiring practical algorithm design.

I The above philosophy of “random perturbations” also inspirespractical algorithm design.

I What if the objective function f (x) is even unknown, butneeds to be learned ?

I Work in progress with Xiong, H. from Missuori S&TComputer Science.

Page 53: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Epilogue : A few remaining problems.

I Approximating diffusion process Xt : how it approximatesx (t) ? Various ways 15. May need correction term 16 fromnumerical SDE theory.

I Linerization assumption : typical in dynamical systems, butshall be removed by much harder work.

I Attacking the discrete iteration directly : Only works forspecial cases such as Principle Component Analysis (PCA) 17.

15. Li, Q., Tai, C., E. W., Stochastic modified equations and adaptive sto-chastic gradient algorithms, arXiv:1511.06251v3

16. Milstein, G.N., Approximate integration of stochastic differential equa-tions, Theory of Probability and its Applications, 19(3), pp.557-562, 1975.

17. Li, C.J., Wang, M., Liu, H., Zhang, T., Near–Optimal Stochastic Approxi-mation for Online Principal Component Estimation, Mathematical Programming,167(1), pp. 75–97, 2018

Page 54: A random perturbation approach to some stochastic ...web.mst.edu/~huwen/slides_Perturbation_optimization.pdf · stochastic optimization problem min x2Rd F(x) 1 M XM i=1 F i(x) ...

Thank you for your attention !