Stochastic Approximations, Di usion Limit and...

Stochastic Approximations, Diffusion Limitand Small Random Perturbations of

Dynamical Systems

– a probabilistic approach to machine learning.

Wenqing Hu. 1

1. Department of Mathematics and Statistics, Missouri S&T.

Machine Learning Background : Training DNN viaminibatch Stochastic Gradient Descent (SGD).

I Deep Neural Network (DNN). Goal is to solve the followingstochastic optimization problem

minx∈Rd

F (x) ≡ 1

M

M∑i=1

Fi (x)

where each component Fi corresponds to the loss function fordata point i ∈ {1, ...,M}, and x is the vector of weights beingoptimized.


I Naive Thinking : Gradient Descent (GD) updates as

x (t) = x (t−1) − η

(1

M

M∑i=1

∇Fi (x (t−1))

).

I Learning rate η > 0 is usually a small stepsize.

I Access of all gradients ∇Fi (x (t−1)) for i = 1, ...,M is ingeneral very expensive in large–scale machine learningproblems.


I Let B be the minibatch of prescribed size uniformly sampledfrom {1, ...,M}, then the objective function can be furtherwritten as the expectation of a stochastic function

1

M

M∑i=1

Fi (x) = EB

(1

|B|∑i∈B

Fi (x)

)≡ EBF (x ;B) .

I SGD updates as

x (t) = x (t−1) − η

1

|Bt |∑i∈Bt

∇Fi (x (t−1))

,

which is the classical mini–batch version of the SGD.

Stochastic Optimization.

I Stochastic Optimization Problem

minx∈U⊂Rd

F (x) .

I

F (x) ≡ E[F (x ; ζ)] .

I Index random variable ζ follows some prescribed distributionD.

Stochastic Optimization.

I We target at finding a local minimum point x∗ of theexpectation function F (x) ≡ E[F (x ; ζ)] :

x∗ = arg minx∈U⊂Rd

E[F (x ; ζ)] .

Stochastic Approximation : Stochastic Gradient Descent(SGD) Algorithm.

I Stochastic Gradient Descent (SGD) has iteration :

x (t) = x (t−1) − η∇F (x (t−1); ζt) ,

where {ζt} are i.i.d. random variables that have the samedistribution as ζ ∼ D.

Closer look at SGD.

I SGD :x (t) = x (t−1) − η∇F (x (t−1); ζt) .

I F (x) ≡ EF (x ; ζ) in which ζ ∼ D.

I Letet = ∇F (x (t−1); ζt)−∇F (x (t−1))

and we can rewrite the SGD as

x (t) = x (t−1) − η(∇F (x (t−1)) + et) .

Diffusion Limit of SGD.

I Stochastic difference equation :

x (t) − x (t−1) = −η∇F (x (t−1))− ηet .

I When η is small, we can approximate x (t) by a diffusionprocess Xt driven by the stochastic differential equation

dXt = −η∇F (Xt)dt + ησ(Xt)dWt , X0 = x (0) ,

where Wt is a standard Brownian motion in Rd and

σ(x)σT (x) = D(x) .

I The diffusion matrix (noise covariance matrix)

D(x) = E(∇F (x ; ζ)−∇F (x))(∇F (x ; ζ)−∇F (x))T .

Diffusion Limit of SGD : Justification.

I Slogan : Continuous Markov processes are characterized by itslocal statistical characteristics only in the first and secondmoments (conditional mean and (co)variance).

I Such an approximation has been justified in the weak sense inmany classical literature. It can also be thought of as a normaldeviation result (Hu–Li–Li–Liu, 2018).

I Conversely, the discrete iteration can be viewed as a numericalscheme for the diffusion limit Xt .

I In many CS literature people simply refer to the diffusion limitas the SGD algorithm.

Small random perturbations of dynamical systems :gradient flow case.

I Diffusion Limit of SGD :

dXt = −η∇F (Xt)dt + ησ(Xt)dWt , X0 = x (0) .

I Let Yt = Xt/η, then

dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) .

(Random perturbations of the gradient flow !)

I In many CS literature people simply refer to the randomlyperturbed process Yt as the SGD algorithm.

Summary : Probabilistic Approach.

stochastictraining algorithm→ statistical properties

approximation of learning model(convergence, generalization, ...)

l small learning rate ldiffusion limit

time change← small random perturbationsof dynamical systems

SGD as a randomly perturbed gradient flow.

I SGD as a randomly perturbed gradient flow :


I

σ(x)σT (x) = D(x) .


D(x) = E(∇F (x ; ζ)−∇F (x))(∇F (x ; ζ)−∇F (x))T .

Target Problem 1 : Convergence Time.

I Assume isotropic noise D(x) = Id .

I Diffusion Limit :

dXt = −η∇F (Xt)dt + ηdWt , X0 = x (0) .

I Convergence Time = Hitting time

τη = inf{t ≥ 0 : F (Xt) ≤ F (x∗) + e}

for some small e > 0.

I Asymptotic of Eτη as η → 0 ?

Target Problem 1 : Convergence Time.

I Random perturbations : Let Yt = Xt/η, then

dYt = −∇F (Yt)dt +√ηdWt , Y0 = x (0) .

I Hitting time

T η = inf{t ≥ 0 : F (Yt) ≤ F (x∗) + e}

for some small e > 0.

I

τη = η−1T η .

I Asymptotic of ET η as η → 0 ?

Approach to Target Problem 1 : Where is the difficulty ?

Figure 1 – Various critical points and the landscape of F .

Approach to Target Problem 1 : Where is the difficulty ?

I Algorithm spends most of its time near critical points of lossfunction F .

Approach to Target Problem 1 : Escape from a saddlepoint.

Figure 2 – Escape from a saddle point.

Approach to Target Problem 1 : Chain of saddle points.

Figure 3 – Chain of saddle points.

Approach to Target Problem 1 : Sequence of stoppingtimes.

I Standard Markov cycle type argument.

I But the geometry is a little different from the classicalarguments for elliptic equilibriums found in Freidlin–Wentzellbook.

I ET η .k

2γ1ln(η−1) conditioned upon convergence.

Approach to Target Problem 1 : Convergence Time.

I Theorem(Hu-Li, 2017) (i) For any small ρ > 0, with probability at least1− ρ, the diffusion limit Xt of SGD converges to the minimizer x∗

for sufficiently small η after passing through all k saddle points O1,..., Ok ;(ii) Consider the stopping time τη. Then as η ↓ 0, conditioned onthe above convergence of the diffusion limit Xt of SGD, we have

limη→0

Eτη

η−1 ln η−1≤ k

2γ1.

Target Problem 2 : Effect of batchsize on generalization.

I It is believed that escape from “bad minima” to “goodminima” is respobsible for good empirical properties underSGD training, e.g. generalization.

I DNN training : How batchsize affect the escape propertiesfrom local minimum points ?

I Large batch training v.s. small batch training.

I Large–batch methods tend to converge to sharp minimizers ofthe training and testing functions.

I Small–batch methods consistently converge to flat minimizers.

I

Sharp minima → Poorer generalization ;Flat minima → Better generalization .


I Let B be the minibatch of prescribed size uniformly sampledfrom {1, ...,M}, then the objective function can be furtherwritten as the expectation of a stochastic function

1

M

M∑i=1

Fi (x) = EB

(1

|B|∑i∈B

Fi (x)

)≡ EBF (x ;B) .

I SGD updates as

x (t) = x (t−1) − η

1

|Bt |∑i∈Bt

∇Fi (x (t−1))

,

which is the classical mini–batch version of the SGD.

Approach to Target Problem 2 : Relationship betweenbatchsize and the diffusion matrix D(x).

I Relationship between batchsize and the diffusion matrix D(x)(Hu-Li-Li-Liu, 2018).

I

D(x) =

(1

|B|− 1

M

)D0(x) ,

where

D0(x) =1

M − 1

M∑i=1

(∇Fi (x)−∇F (x))(∇Fi (x)−∇F (x))T .

I Naively, the larger D(x) is, the easier to escape due toinjection of the noise.




I

σ(x)σT (x) = D(x) .


D(x) = E(∇F (x ; ζ)−∇F (x))(∇F (x ; ζ)−∇F (x))T .

Approach to Target Problem 2 : Escape from basin ofattractors via Large Deviations Theory.

I Theorem(Hu-Li-Li-Liu, 2018) Let U be a basin of attractor such thatO ⊂ U is the only minimum of F (x) and assume that theboundary ∂U of the domain U is smooth so that ∇F (x), x ∈ ∂Upoints to the interior of the boundary of U, then for x0 ∈ U wehave the following two asymptotic

P(x0, ∂U) � exp

(−1

ηminx∈∂U

φQPloc (x ;O)

),

and

Eτ(x0, ∂U) � 1

ηexp

(1

ηminx∈∂U

φQPloc (x ;O)

).

Approach to Target Problem 2 : The quasi–potential.

I The local quasi–potential

φQPloc (x ; x0) = inf

T>0inf

ψ(0)=x0,ψ(T )=xS0T (ψ) .

I Action functional (Large Deviations Rate Function) :

S0T (ψ) =

1

2

∫ T

0(ψ̇t +∇F (ψt))TD−1(ψt)(ψ̇t +∇F (ψt))dt ,

if ψt is abs. cont. t ∈ [0,T ] ;+∞ ,

otherwise .

Target Problem 3 : Algorithm’s implicit selection ofspecific local minima.

I Implicit Regularization : It is widely believed that SGD is animplicit regularizer which helps itself to search for a localminimum that is easy to generalize.

I How to understand this phenomenon ?

Approach to Target Problem 3 : Structure of diffusionmatrix D(x).

I Covariance structure (diffusion matrix) D(x) dependsimplicitly on the model architecture :

D(x) = E(∇F (x ; ζ)−∇F (x))(∇F (x ; ζ)−∇F (x))T .

I Anisotropic Noise v.s. Isotropic Noise.

I D(x) may have very large condition number !




I

σ(x)σT (x) = D(x) .


D(x) = E(∇F (x ; ζ)−∇F (x))(∇F (x ; ζ)−∇F (x))T .

Approach to Target Problem 2 : Escape from basin ofattractors via Large Deviations Theory.

I Theorem(Hu-Li-Li-Liu, 2018) Let U be a basin of attractor such thatO ⊂ U is the only minimum of F (x) and assume that theboundary ∂U of the domain U is smooth so that ∇F (x), x ∈ ∂Upoints to the interior of the boundary of U, then for x0 ∈ U wehave the following two asymptotic

P(x0, ∂U) � exp

(−1

ηminx∈∂U

φQPloc (x ;O)

),

and

Eτ(x0, ∂U) � 1

ηexp

(1

ηminx∈∂U

φQPloc (x ;O)

).

Approach to Target Problem 2 : The quasi–potential.

I The local quasi–potential

φQPloc (x ; x0) = inf

T>0inf

ψ(0)=x0,ψ(T )=xS0T (ψ) .

I Action functional (Large Deviations Rate Function) :

S0T (ψ) =

1

2

∫ T

0(ψ̇t +∇F (ψt))TD−1(ψt)(ψ̇t +∇F (ψt))dt ,

if ψt is abs. cont. t ∈ [0,T ] ;+∞ ,

otherwise .

Approach to Target Problem 3 : Relation between thequasi–potential and the diffusion matrix D(x).

I Theorem(Hu-Zhu-Xiong-Huan, 2019) The local quasi–potential φQP

loc (x ; x0)is a solution to the Hamilton–Jacobi equation

1

2

(∇φQP

loc (x ; x0))T

D(x)∇φQPloc (x ; x0)−∇F (x) · ∇φQP

loc (x ; x0) = 0 ,

with boundary condition

φQPloc (O; x0) = 0 .

Approach to Target Problem 3 : Escape from localminimum point : Isotropic Noise v.s. Anisotropic Noise.

Figure 4 – Escape from local minimum point : Isotropic Noise v.s.Anisotropic Noise.

Target Problem 3 : Algorithm’s implicit selection ofspecific local minima.

I Suppose that F (x) is non–convex and admits several localminimum points.

I Does SGD always select the global minimum of F (x) ?




I

σ(x)σT (x) = D(x) .


D(x) = E(∇F (x ; ζ)−∇F (x))(∇F (x ; ζ)−∇F (x))T .

Approach to Target Problem 3 : Variational Inference.

I Variational Inference : Invariant density

ρSS(x) =1

Z (η)exp

(−2

ηΦ(x)

).

I SGD selects the global minimum of Φ(x) : ρ(x , t)→ ρSS(x)as t →∞.

I If diffusion matrix D(x) = Id is isotropic, then Φ(x) ∝ F (x).

I Anisotropic noise is common !

I So in general Φ(x) 6= F (x).

Approach to Target Problem 3 : Variational Inference viaLarge Deviation Theory.

I Large Deviation Theory provides another way to express theinvariant density

ρSS(x) � exp

(−1

ηφQP(x)

).

I (global) quasi–potential : φQP(x).

Approach to Target Problem 3 : Variational Inference viaLarge Deviation Theory.

I Theorem(Hu-Zhu-Xiong-Huan, 2019) SGD selects the global minimum x∗

of φQP(x) such that φQP(x∗) = 0.

Approach to Target Problem 3 : Global quasi-potential.

I The (global) quasi-potential φQP(x) can be calculated fromthe local quasi–potential φQP

loc (x ; x0).

I We have seen that the local quasi–potential φQPloc (x ; x0)

depends on the diffusion matrix D(x) via the Hamilton-Jacobiequation.

I Markov chain on local minimum points (Freidlin–Wentzelltheory).

Approach to Target Problem 3 : Implicit regularization viacovariance structure.

Figure 5 – Implicit regularization via covariance structure. The lossfunction is symmetric with respect to two local minima (−2, 0) and(2, 0). Left : Process starts from (−2, 0), anisotropic noise in thepotential well ; Right : Process starts from (2, 0), isotropic noise in thepotential well. SGD tends to select (2, 0).

Summary : Probabilistic Approach.

stochastictraining algorithm→ statistical properties

approximation of learning model(convergence, generalization, ...)

l small learning rate ldiffusion limit

time change← small random perturbationsof dynamical systems

Stochastic Approximation : Stochastic heavy–ball method.

I

minx∈Rd

f (x) .

I “heavy ball method” (B.Polyak 1964)xt = xt−1 +

√svt ,

vt = (1− α√s)vt−1 −

√s∇X f (xt−1) ,

x0, v0 ∈ Rd .

I Stochastic heavy ball method :xt = xt−1 +

√svt ,

vt = (1− α√s)vt−1 −

√s∇X f (xt−1; ζt) ,

x0, v0 ∈ Rd .

Diffusion Limit of Stochastic heavy–ball method.

I Diffusion limit of the stochastic heavy ball method :dx(t) =

√sv(t)dt ,

dv(t) = −√sαv(t)dt −

√s∇X f (x(t)) +

√sσ(x(t))dWt ,

x(0), v(0) ∈ Rd .

Small random perturbations of dynamical systems :dissipative Hamiltonian system.

I Small random perturbations of dissipated Hamiltoniansystem :

dX εt = V ε

t dt;dV ε

t = (−αV εt −∇X f (X ε

t ))dt + εσ(X εt )dWt ,

X ε0 ,V

ε0 ∈ Rd .

I Hu-Li-Su, 2017 and work in progress.

Many further interesting directions at the interplay ofprobability and data science ...

I Escape from saddle points for anisotropic noise.

I Can the program on Hamilton–Jacobi equation be carried tospecific deep neural network structures ?

I Empirical Risk v.s. Population Risk : How does this affectSGD’s variational inference ? Relation with Generalization.

I Many more applied problems in collaborations withCS/ECE/ORIE people...

Thank you for your attention !

Stochastic Approximations, Di usion Limit and...

Documents

Transcript of Stochastic Approximations, Di usion Limit and...