A random perturbation approach to somestochastic approximation algorithms in
optimization.
Wenqing Hu. 1
(Presentation based on joint works with Chris Junchi Li 2,Weijie Su 3, Haoyi Xiong 4.)
1. Department of Mathematics and Statistics, Missouri S&T.2. Department of Operations Research and Financial Engineering, Princeton
University.3. Department of Statistics, University of Pennsylvania.4. Department of Computer Science, Missouri S&T.
Stochastic Optimization.
I Stochastic Optimization Problem
minx∈U⊂Rd
F (x) .
I
F (x) ≡ E[F (x ; ζ)] .
I Index random variable ζ follows some prescribed distributionD.
Stochastic Gradient Descent Algorithm.
I We target at finding a local minimum point x∗ of theexpectation function F (x) ≡ E[F (x ; ζ)] :
x∗ = arg minx∈U⊂Rd
E[F (x ; ζ)] .
I We consider a general nonconvex stochastic loss functionF (x ; ζ) that is twice differentiable with respect to x . Togetherwith some additional regularity assumptions 5 we guaranteethat ∇E[F (x ; ζ)] = E[∇F (x ; ζ)].
5. Such as control of the growth of the gradient in expectation.
Stochastic Gradient Descent Algorithm.
I Gradient Descent (GD) has iteration :
x (t) = x (t−1) − η∇F (x (t−1)) ,
or in other words
x (t) = x (t−1) − ηE[∇F (x (t−1); ζ)] .
I Stochastic Gradient Descent (SGD) has iteration :
x (t) = x (t−1) − η∇F (x (t−1); ζt) ,
where {ζt} are i.i.d. random variables that have the samedistribution as ζ ∼ D.
Machine Learning Background : DNN.
I Deep Neural Network (DNN). Goal is to solve the followingstochastic optimization problem
minx∈Rd
F (x) ≡ 1
M
M∑i=1
Fi (x)
where each component Fi corresponds to the loss function fordata point i ∈ {1, ...,M}, and x is the vector of weights beingoptimized.
Machine Learning Background : DNN.
I Let B be the minibatch of prescribed size uniformly sampledfrom {1, ...,M}, then the objective function can be furtherwritten as the expectation of a stochastic function
1
M
M∑i=1
Fi (x) = EB
(1
|B|∑x∈B
Fi (x)
)≡ EBF (x ; B) .
I SGD updates as
x (t) = x (t−1) − η
1
|Bt |∑i∈Bt
∇Fi (x (t−1))
,
which is the classical mini–batch version of the SGD.
Machine Learning Background : Online learning.
I Online learning via SGD : standard sequential predictingproblem where for i = 1, 2, ...1. An unlabeled example ai arrives ;2. We make a prediction bi based on the current weightsxi = [x1
i , ..., xdi ] ∈ Rd ;
3. We observe bi , let ζi = (ai , bi ), and incur some known lossF (xi ; ζi ) which is convex in xi ;4. We update weights according to some rule xi+1 ← S(xi ).
I SGD rule :S(xi ) = xi − η∇1F (xi ; ζi ) ,
where ∇1F (u; v) is a subgradient of F (u; v) with respect tothe first variable u, and the parameter η > 0 is often referredas the learning rate.
Closer look at SGD.
I SGD :x (t) = x (t−1) − η∇F (x (t−1); ζt) .
I F (x) ≡ EF (x ; ζ) in which ζ ∼ D.
I Letet = ∇F (x (t−1); ζt)−∇F (x (t−1))
and we can rewrite the SGD as
x (t) = x (t−1) − η(∇F (x (t−1)) + et) .
Local statistical characteristics of SGD path.
I In a difference form it looks like
x (t) − x (t−1) = −η∇F (x (t−1))− ηet .
I We see that
E(x (t) − x (t−1)|x (t−1)) = −η∇F (x (t−1))
and
[Cov(x (t) − x (t−1)|x (t−1))]1/2 = η[Cov(∇F (x (t−1); ζ))]1/2 .
SGD approximating diffusion process.
I Very roughly speaking, we can approximate x (t) by a diffusionprocess Xt driven by the stochastic differential equation
dXt = −η∇F (Xt)dt + ησ(Xt)dWt , X0 = x (0) ,
where σ(x) = [Cov(∇F (x ; ζ))]1/2 and Wt is a standardBrownian motion in Rd .
I Slogan : Continuous Markov processes are characterized by itslocal statistical characteristics only in the first and secondmoments (conditional mean and (co)variance).
Diffusion Approximation of SGD : Justification.
I Such an approximation has been justified in the weak sense inmany classical literature 6 7 8.
I It can also be thought of as a normal deviation result.
I One can call the continuous process Xt as the “continuousSGD” 9.
I We also have a work in this direction (Hu–Li–Li–Liu, 2018).
I I will come back to this topic by the end of the talk.
6. Benveniste, A., Metivier, M., Priouret, P., Adaptive Algorithms and Sto-chastic Approximations, Applications of Mathematics, 22, Springer, 1990.
7. Borkar, Vivek S., Stochastic Approximation : A Dynamical Systems View-point, Cambridge University Press, 2008.
8. Kushner, H.J., George Yin, G., Stochastic approximation and recursivealgorithms and applications, Stochastic Modeling and Applied Probability, 35,Second Edition, Springer, 2003.
9. Sirignano, J., Spiliopoulos, K., Stochastic Gradient Descent in ContinuousTime. SIAM Journal of Financial Mathematics, Vol. 8, Issue 1, pp.933–961.
SGD approximating diffusion process : convergence time.
I Recall that we have set F (x) = EF (x ; ζ).
I Thus we can formulate the original optimization problem as
x∗ = arg minx∈U
F (x) .
I Instead of SGD, in this work let us consider its approximatingdiffusion process
dXt = −η∇F (Xt)dt + ησ(Xt)dWt , X0 = x (0) .
I The dynamics of Xt is used as an alternative optimizationprocedure to find x∗.
I Remark : If there is no noise, then
dXt = −η∇F (Xt)dt , X0 = x (0)
is just the gradient flow S tx (0). Thus it approaches x∗ in thestrictly convex case.
SGD approximating diffusion process : convergence time.
I Hitting time
τη = inf{t ≥ 0 : F (Xt) ≤ F (x∗) + e}
for some small e > 0.
I Asymptotic of Eτη as η → 0 ?
SGD approximating diffusion process : convergence time.
I Approximating diffusion process
dXt = −η∇F (Xt)dt + ησ(Xt)dWt , X0 = x (0) .
I Let Yt = Xt/η, then
dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) .
(Random perturbations of the gradient flow !)
I Hitting time
T η = inf{t ≥ 0 : F (Yt) ≤ F (x∗) + e}
for some small e > 0.
I
τη = η−1T η .
Random perturbations of the gradient flow : convergencetime.
I Let Yt = Xt/η, then
dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) .
I Hitting time
T η = inf{t ≥ 0 : F (Yt) ≤ F (x∗) + e}
for some small e > 0.
I Asymptotic of ET η as η → 0 ?
Where is the difficulty ?
Figure 1: Various critical points and the landscape of F .
Where is the difficulty ?
Figure 2: Higher order critical points.
Strict saddle property.
I Definition (strict saddle property)10 11 Given fixed γ1 > 0 and γ2 > 0, we say a Morse function Fdefined on Rn satisfies the “strict saddle property” if each pointx ∈ Rn belongs to one of the following : (i) |∇F (x)| ≥ γ2 > 0 ;(ii) |∇F (x)| < γ2 and λmin(∇2F (x)) ≤ −γ1 < 0 ; (iii)|∇F (x)| < γ2 and λmin(∇2F (x)) ≥ γ1 > 0.
10. Ge, R., Huang, F., Jin, C., Yuan, Y., Escaping from saddlepoints–online stochastic gradient descent for tensor decomposition,arXiv:1503.02101v1[cs.LG]
11. Sun, J., Qu, Q., Wright, J., When are nonconvex problems not scary ?arXiv:1510.06096[math.DC]
Strict saddle property.
Figure 3: Types of local landscape geometry with only strict saddles.
Strong saddle property.
I Definition (strong saddle property)
Let the Morse function F (•) satisfy the strict saddle property withparameters γ1 > 0 and γ2 > 0. We say the Morse function F (•)satisfy the “strong saddle property” if for some γ3 > 0 and anyx ∈ Rn such that ∇F (x) = 0, all eigenvalues λi , i = 1, 2, ..., n ofthe Hessian ∇2F (x) at x satisfying (ii) in Definition 1 are boundedaway from zero by some γ3 > 0 in absolute value, i.e.,|λi | ≥ γ3 > 0 for any 1 ≤ i ≤ n.
Escape from saddle points.
Figure 4: Escape from a strong saddle point.
Escape from saddle points : behavior of process near onespecific saddle.
I The problem was first studied by Kifer in 1981 12.
I Recall that
dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) ,
is a random perturbation of the gradient flow. Kifer needs toassume further that the diffusion matrixσ(•)σT (•) = Cov(∇F (•; ζ)) is strictly uniformly positivedefinite.
I Roughly speaking, Kifer’s result states that the exit from aneighborhood of a strong saddle point happens along the“most unstable” direction for the Hessian matrix, and the exittime is asymptotically ∼ C log(η−1)
12. Kifer, Y., The exit problem for small random perturbations of dynamicalsystems with a hyperbolic fixed point. Israel Journal of Mathematics, 40(1), pp.74–96.
Escape from saddle points : behavior of process near onespecific saddle.
Figure 5: Escape from a strong saddle point : Kifer’s result.
Escape from saddle points : behavior of process near onespecific saddle.
I G ∪ ∂G = 0 ∪ A1 ∪ A2 ∪ A3.
I If x ∈ A2 ∪ A3, then for the deterministic gradient flow thereis a finite
t(x) = inf{t > 0 : S tx ∈ ∂G} .
I In this case as η → 0, expected first exit time converges tot(x), and first exit position converges to S t(x)x .
I Remind that the gradient flow S tx (0) is
dXt = −η∇F (Xt)dt , X0 = x (0) .
Escape from saddle points : behavior of process near onespecific saddle.
Figure 6: Escape from a strong saddle point : Kifer’s result.
Escape from saddle points : behavior of process near onespecific saddle.
I If x ∈ (0 ∪ A1)\∂G , situation is more interesting.
I First exit occurs along the direction pointed out by the mostnegative eigenvalue(s) of the Hessian ∇2F (0).
I ∇2F (0) has spectrum −λ1 = −λ2 = ... = −λq < −λq+1 ≤... ≤ −λp < 0 < λp+1 ≤ ... ≤ λd .
I Exit time will be asymptotically1
2λ1ln(η−1) as η → 0.
I Exit position will converge (in probability) to the intersectionof ∂G with the invariant manifold corresponding to−λ1, ...,−λq.
I Slogan : when the process Yt comes close to 0, it will“choose” the most unstable directions and move along them.
Escape from saddle points : behavior of process near onespecific saddle.
Figure 7: Escape from a strong saddle point : Kifer’s result.
More technical aspects of Kifer’s result...
I In Kifer’s result all convergence are point–wise with respect toinitial point x . They are not uniform convergence when η → 0with respect to all initial point x.
I The “exit along most unstable directions” happens only wheninitial point x strictly stands on A1.
I Does not allow small perturbations with respect to initialpoint x . Run into messy calculations...
Why these technicalities matter ?
I Our landscape may have a “chain of saddles”.
I Recall that
dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0)
is a random perturbation of the gradient flow.
Global landscape : chain of saddle points.
Figure 8: Chain of saddle points.
Linerization of the gradient flow near a strong saddle point.
I Hartman–Grobman Theorem : for any strong saddle point Othat we consider, there exist an open neighborhood U of Oand a C(0) homeomorphism h : U → Rd such that thegradient flow under h is mapped into a linear flow.
I Linearization Assumption 13 : The homeomorphism h providedby the Hartman–Grobman theorem can be taken to be C(2).
I A sufficient condition for the validity of the C(2) linerizationassumption is the so called non–resonance condition(Sternberg linerization theorem).
I I will also come back to this topic at the end of the talk.
13. Bakhtin, Y., Noisy heteroclinic networks, Probability Theory and RelatedFields, 150(2011), no.1–2, pp. 1–42.
Linerization of the gradient flow near a strong saddle point.
Figure 9: Linerization of the gradient flow near a strong saddle point.
Uniform version of Kifer’s exit asymptotic.
I Let U ⊂ G be an open neighborhood of the saddle point O.Set initial point x ∈ U ∪ ∂U.
I Set dist(U ∪ ∂U, ∂G ) > 0. Let t(x) = inf{t > 0 : S tx ∈ ∂G} .I Wmax is the invariant manifold corresponding to the most
negative eigenvalues −λ1, ...,−λq of ∇2F (O).
I Define Qmax = Wmax ∩ ∂G .
I Set
∂GU∪∂U→out = {S t(x)x for some x ∈ U∪∂U with finite t(x)}∪Qmax .
I For small µ > 0 let
Qµ = {x ∈ ∂G , dist(x , ∂GU∪∂U→out) < µ} .
I τηx is the first exit time to ∂G for the process Yt starting fromx .
Uniform version of Kifer’s exit asymptotic.
Figure 10: Uniform exit dynamics near a specific strong saddle pointunder the linerization assumption.
Uniform version of Kifer’s exit asymptotic.
I TheoremFor any r > 0, there exist some η0 > 0 so that for all x ∈ U ∪ ∂Uand all 0 < η < η0 we have
Exτηx
ln(η−1)≤ 1
2λ1+ r .
For any small µ > 0 and any ρ > 0, there exist some η0 > 0 sothat for all x ∈ U ∪ ∂U and all 0 < η < η0 we have
Px(Yτηx ∈ Qµ) ≥ 1− ρ .
Convergence analysis in a basin containing a localminimum : Sequence of stopping times.
I To demonstrate our analysis, we will assume that x∗ is a localminimum point of F (•) in the sense that for some openneighborhood U(x∗) of x∗ we have
x∗ = arg minx∈U(x∗)
F (x) .
I Assume that there are k strong saddle points O1, ...,Ok inU(x∗) such that F (O1) > F (O2) > ... > F (Ok) > F (x∗).
I Start the process with initial point Y0 = x ∈ U(x∗).
I Hitting time
T η = inf{t ≥ 0 : F (Yt) ≤ F (x∗) + e}
for some small e > 0.
I Asymptotic of ET η as η → 0 ?
Convergence analysis in a basin containing a localminimum : Sequence of stopping times.
Figure 11: Sequence of stopping times.
Convergence analysis in a basin containing a localminimum : Sequence of stopping times.
I Standard Markov cycle type argument.
I But the geometry is a little different from the classicalarguments for elliptic equilibriums found in Freidlin–Wentzellbook 14.
I ET η .k
2γ1ln(η−1) conditioned upon convergence.
14. Freidlin, M., Wentzell, A., Random perturbations of dynamical systems,2nd edition, Springer, 1998.
SGD approximating diffusion process : convergence time.
I Recall that we have set F (x) = EF (x ; ζ).
I Thus we can formulate the original optimization problem as
x∗ = arg minx∈U(x∗)
F (x) .
I Instead of SGD, in this work let us consider its approximatingdiffusion process
dXt = −η∇F (Xt)dt + ησ(Xt)dWt , X0 = x (0) .
I The dynamics of Xt is used as an surrogate optimizationprocedure to find x∗.
SGD approximating diffusion process : convergence time.
I Hitting time
τη = inf{t ≥ 0 : F (Xt) ≤ F (x∗) + e}
for some small e > 0.
I Asymptotic of Eτη as η → 0 ?
SGD approximating diffusion process : convergence time.
I Approximating diffusion process
dXt = −η∇F (Xt)dt + ησ(Xt)dWt , X0 = x (0) .
I Let Yt = Xt/η, then
dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) .
(Random perturbations of the gradient flow !)
I Hitting time
T η = inf{t ≥ 0 : F (Yt) ≤ F (x∗) + e}
for some small e > 0.
I
τη = η−1T η .
Convergence analysis in a basin containing a localminimum.
I Theorem(i) For any small ρ > 0, with probability at least 1− ρ, SGDapproximating diffusion process Xt converges to the minimizer x∗
for sufficiently small β after passing through all k saddle points O1,..., Ok ;(ii) Consider the stopping time τη. Then as η ↓ 0, conditioned onthe above convergence of SGD approximating diffusion process Xt ,we have
limη→0
Eτη
η−1 ln η−1≤ k
2γ1.
Philosophy.
discrete stochastic ← analysis ofoptimization algorithms convergence time
l ldiffusion ← random perturbations
approximation of dynamical systems
Other Examples : Stochastic heavy–ball method.
I
minx∈Rd
f (x) .
I ”heavy ball method” (B.Polyak 1964)xt = xt−1 +
√svt ;
vt = (1− α√
s)vt−1 −√
s∇X f(xt−1 +
√s(1− α
√s)vt−1
);
x0, v0 ∈ Rd .
I Adding momentum leads to the second–order equation
X (t) + αX (t) +∇X f (X (t)) = 0 , X (0) ∈ Rd .
I Hamiltonian system for a dissipative nonlinear oscillator :{dX (t) = V (t)dt , X (0) ∈ Rd ;dV (t) = (−αV (t)−∇X f (X (t)))dt , V (0) ∈ Rd .
Other Examples : Stochastic heavy–ball method.
I How to escape from saddle points of f (x) ?
I Add noise → Randomly perturbed Hamiltonian system for adissipative nonlinear oscillator :
dX εt = V ε
t dt + εσ1(X εt ,V
εt )dW 1
t ;dV ε
t = (−αV εt −∇X f (X ε
t ))dt + εσ2(X εt ,V
εt )dW 2
t ,X ε0 ,V
ε0 ∈ Rd .
Diffusion limit of stochastic heavy–ball method : Randomlyperturbed Hamiltonian system.
Figure 12: Randomly perturbed Hamiltonian system.
Diffusion limit of stochastic heavy–ball method : Randomlyperturbed Hamiltonian system.
Figure 13: Phase Diagram of a Hamiltonian System.
Diffusion limit of stochastic heavy–ball method : Randomlyperturbed Hamiltonian system.
Figure 14: Behavior near a saddle point.
Diffusion limit of stochastic heavy–ball method : Randomlyperturbed Hamiltonian system.
Figure 15: Chain of saddle points in a randomly perturbed Hamiltoniansystem.
Stochastic Composite Gradient Descent.
I By introducing approximating diffusion processes, method ofstochastic dynamical systems, in particular randomperturbations of dynamical systems are effectively used toanalyze these limiting processes.
I For example, averaging principle can be used to analyzeStochastic Composite Gradient Descent (SCGD) (Hu–Li,2017). We skip details here.
Philosophy.
discrete stochastic ← analysis ofoptimization algorithms convergence time
l ldiffusion ← random perturbations
approximation of dynamical systems
Inspiring practical algorithm design.
I The above philosophy of “random perturbations” also inspirespractical algorithm design.
I What if the objective function f (x) is even unknown, butneeds to be learned ?
I Work in progress with Xiong, H. from Missuori S&TComputer Science.
Epilogue : A few remaining problems.
I Approximating diffusion process Xt : how it approximatesx (t) ? Various ways 15. May need correction term 16 fromnumerical SDE theory.
I Linerization assumption : typical in dynamical systems, butshall be removed by much harder work.
I Attacking the discrete iteration directly : Only works forspecial cases such as Principle Component Analysis (PCA) 17.
15. Li, Q., Tai, C., E. W., Stochastic modified equations and adaptive sto-chastic gradient algorithms, arXiv:1511.06251v3
16. Milstein, G.N., Approximate integration of stochastic differential equa-tions, Theory of Probability and its Applications, 19(3), pp.557-562, 1975.
17. Li, C.J., Wang, M., Liu, H., Zhang, T., Near–Optimal Stochastic Approxi-mation for Online Principal Component Estimation, Mathematical Programming,167(1), pp. 75–97, 2018
Thank you for your attention !
Top Related