Stochastic Approximations, Di usion Limit and...
Transcript of Stochastic Approximations, Di usion Limit and...
Stochastic Approximations, Diffusion Limitand Small Random Perturbations of
Dynamical Systems
– a probabilistic approach to machine learning.
Wenqing Hu. 1
1. Department of Mathematics and Statistics, Missouri S&T.
Machine Learning Background : Training DNN viaminibatch Stochastic Gradient Descent (SGD).
I Deep Neural Network (DNN). Goal is to solve the followingstochastic optimization problem
minx∈Rd
F (x) ≡ 1
M
M∑i=1
Fi (x)
where each component Fi corresponds to the loss function fordata point i ∈ {1, ...,M}, and x is the vector of weights beingoptimized.
Machine Learning Background : Training DNN viaminibatch Stochastic Gradient Descent (SGD).
I Naive Thinking : Gradient Descent (GD) updates as
x (t) = x (t−1) − η
(1
M
M∑i=1
∇Fi (x (t−1))
).
I Learning rate η > 0 is usually a small stepsize.
I Access of all gradients ∇Fi (x (t−1)) for i = 1, ...,M is ingeneral very expensive in large–scale machine learningproblems.
Machine Learning Background : Training DNN viaminibatch Stochastic Gradient Descent (SGD).
I Let B be the minibatch of prescribed size uniformly sampledfrom {1, ...,M}, then the objective function can be furtherwritten as the expectation of a stochastic function
1
M
M∑i=1
Fi (x) = EB
(1
|B|∑i∈B
Fi (x)
)≡ EBF (x ;B) .
I SGD updates as
x (t) = x (t−1) − η
1
|Bt |∑i∈Bt
∇Fi (x (t−1))
,
which is the classical mini–batch version of the SGD.
Stochastic Optimization.
I Stochastic Optimization Problem
minx∈U⊂Rd
F (x) .
I
F (x) ≡ E[F (x ; ζ)] .
I Index random variable ζ follows some prescribed distributionD.
Stochastic Optimization.
I We target at finding a local minimum point x∗ of theexpectation function F (x) ≡ E[F (x ; ζ)] :
x∗ = arg minx∈U⊂Rd
E[F (x ; ζ)] .
Stochastic Approximation : Stochastic Gradient Descent(SGD) Algorithm.
I Stochastic Gradient Descent (SGD) has iteration :
x (t) = x (t−1) − η∇F (x (t−1); ζt) ,
where {ζt} are i.i.d. random variables that have the samedistribution as ζ ∼ D.
Closer look at SGD.
I SGD :x (t) = x (t−1) − η∇F (x (t−1); ζt) .
I F (x) ≡ EF (x ; ζ) in which ζ ∼ D.
I Letet = ∇F (x (t−1); ζt)−∇F (x (t−1))
and we can rewrite the SGD as
x (t) = x (t−1) − η(∇F (x (t−1)) + et) .
Diffusion Limit of SGD.
I Stochastic difference equation :
x (t) − x (t−1) = −η∇F (x (t−1))− ηet .
I When η is small, we can approximate x (t) by a diffusionprocess Xt driven by the stochastic differential equation
dXt = −η∇F (Xt)dt + ησ(Xt)dWt , X0 = x (0) ,
where Wt is a standard Brownian motion in Rd and
σ(x)σT (x) = D(x) .
I The diffusion matrix (noise covariance matrix)
D(x) = E(∇F (x ; ζ)−∇F (x))(∇F (x ; ζ)−∇F (x))T .
Diffusion Limit of SGD : Justification.
I Slogan : Continuous Markov processes are characterized by itslocal statistical characteristics only in the first and secondmoments (conditional mean and (co)variance).
I Such an approximation has been justified in the weak sense inmany classical literature. It can also be thought of as a normaldeviation result (Hu–Li–Li–Liu, 2018).
I Conversely, the discrete iteration can be viewed as a numericalscheme for the diffusion limit Xt .
I In many CS literature people simply refer to the diffusion limitas the SGD algorithm.
Small random perturbations of dynamical systems :gradient flow case.
I Diffusion Limit of SGD :
dXt = −η∇F (Xt)dt + ησ(Xt)dWt , X0 = x (0) .
I Let Yt = Xt/η, then
dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) .
(Random perturbations of the gradient flow !)
I In many CS literature people simply refer to the randomlyperturbed process Yt as the SGD algorithm.
Summary : Probabilistic Approach.
stochastictraining algorithm→ statistical properties
approximation of learning model(convergence, generalization, ...)
l small learning rate ldiffusion limit
time change← small random perturbationsof dynamical systems
SGD as a randomly perturbed gradient flow.
I SGD as a randomly perturbed gradient flow :
dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) .
I
σ(x)σT (x) = D(x) .
I The diffusion matrix (noise covariance matrix)
D(x) = E(∇F (x ; ζ)−∇F (x))(∇F (x ; ζ)−∇F (x))T .
Target Problem 1 : Convergence Time.
I Assume isotropic noise D(x) = Id .
I Diffusion Limit :
dXt = −η∇F (Xt)dt + ηdWt , X0 = x (0) .
I Convergence Time = Hitting time
τη = inf{t ≥ 0 : F (Xt) ≤ F (x∗) + e}
for some small e > 0.
I Asymptotic of Eτη as η → 0 ?
Target Problem 1 : Convergence Time.
I Random perturbations : Let Yt = Xt/η, then
dYt = −∇F (Yt)dt +√ηdWt , Y0 = x (0) .
I Hitting time
T η = inf{t ≥ 0 : F (Yt) ≤ F (x∗) + e}
for some small e > 0.
I
τη = η−1T η .
I Asymptotic of ET η as η → 0 ?
Approach to Target Problem 1 : Where is the difficulty ?
Figure 1 – Various critical points and the landscape of F .
Approach to Target Problem 1 : Where is the difficulty ?
I Algorithm spends most of its time near critical points of lossfunction F .
Approach to Target Problem 1 : Escape from a saddlepoint.
Figure 2 – Escape from a saddle point.
Approach to Target Problem 1 : Chain of saddle points.
Figure 3 – Chain of saddle points.
Approach to Target Problem 1 : Sequence of stoppingtimes.
I Standard Markov cycle type argument.
I But the geometry is a little different from the classicalarguments for elliptic equilibriums found in Freidlin–Wentzellbook.
I ET η .k
2γ1ln(η−1) conditioned upon convergence.
Approach to Target Problem 1 : Convergence Time.
I Theorem(Hu-Li, 2017) (i) For any small ρ > 0, with probability at least1− ρ, the diffusion limit Xt of SGD converges to the minimizer x∗
for sufficiently small η after passing through all k saddle points O1,..., Ok ;(ii) Consider the stopping time τη. Then as η ↓ 0, conditioned onthe above convergence of the diffusion limit Xt of SGD, we have
limη→0
Eτη
η−1 ln η−1≤ k
2γ1.
Target Problem 2 : Effect of batchsize on generalization.
I It is believed that escape from “bad minima” to “goodminima” is respobsible for good empirical properties underSGD training, e.g. generalization.
I DNN training : How batchsize affect the escape propertiesfrom local minimum points ?
I Large batch training v.s. small batch training.
I Large–batch methods tend to converge to sharp minimizers ofthe training and testing functions.
I Small–batch methods consistently converge to flat minimizers.
I
Sharp minima → Poorer generalization ;Flat minima → Better generalization .
Machine Learning Background : Training DNN viaminibatch Stochastic Gradient Descent (SGD).
I Let B be the minibatch of prescribed size uniformly sampledfrom {1, ...,M}, then the objective function can be furtherwritten as the expectation of a stochastic function
1
M
M∑i=1
Fi (x) = EB
(1
|B|∑i∈B
Fi (x)
)≡ EBF (x ;B) .
I SGD updates as
x (t) = x (t−1) − η
1
|Bt |∑i∈Bt
∇Fi (x (t−1))
,
which is the classical mini–batch version of the SGD.
Approach to Target Problem 2 : Relationship betweenbatchsize and the diffusion matrix D(x).
I Relationship between batchsize and the diffusion matrix D(x)(Hu-Li-Li-Liu, 2018).
I
D(x) =
(1
|B|− 1
M
)D0(x) ,
where
D0(x) =1
M − 1
M∑i=1
(∇Fi (x)−∇F (x))(∇Fi (x)−∇F (x))T .
I Naively, the larger D(x) is, the easier to escape due toinjection of the noise.
SGD as a randomly perturbed gradient flow.
I SGD as a randomly perturbed gradient flow :
dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) .
I
σ(x)σT (x) = D(x) .
I The diffusion matrix (noise covariance matrix)
D(x) = E(∇F (x ; ζ)−∇F (x))(∇F (x ; ζ)−∇F (x))T .
Approach to Target Problem 2 : Escape from basin ofattractors via Large Deviations Theory.
I Theorem(Hu-Li-Li-Liu, 2018) Let U be a basin of attractor such thatO ⊂ U is the only minimum of F (x) and assume that theboundary ∂U of the domain U is smooth so that ∇F (x), x ∈ ∂Upoints to the interior of the boundary of U, then for x0 ∈ U wehave the following two asymptotic
P(x0, ∂U) � exp
(−1
ηminx∈∂U
φQPloc (x ;O)
),
and
Eτ(x0, ∂U) � 1
ηexp
(1
ηminx∈∂U
φQPloc (x ;O)
).
Approach to Target Problem 2 : The quasi–potential.
I The local quasi–potential
φQPloc (x ; x0) = inf
T>0inf
ψ(0)=x0,ψ(T )=xS0T (ψ) .
I Action functional (Large Deviations Rate Function) :
S0T (ψ) =
1
2
∫ T
0(ψ̇t +∇F (ψt))TD−1(ψt)(ψ̇t +∇F (ψt))dt ,
if ψt is abs. cont. t ∈ [0,T ] ;+∞ ,
otherwise .
Target Problem 3 : Algorithm’s implicit selection ofspecific local minima.
I Implicit Regularization : It is widely believed that SGD is animplicit regularizer which helps itself to search for a localminimum that is easy to generalize.
I How to understand this phenomenon ?
Approach to Target Problem 3 : Structure of diffusionmatrix D(x).
I Covariance structure (diffusion matrix) D(x) dependsimplicitly on the model architecture :
D(x) = E(∇F (x ; ζ)−∇F (x))(∇F (x ; ζ)−∇F (x))T .
I Anisotropic Noise v.s. Isotropic Noise.
I D(x) may have very large condition number !
SGD as a randomly perturbed gradient flow.
I SGD as a randomly perturbed gradient flow :
dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) .
I
σ(x)σT (x) = D(x) .
I The diffusion matrix (noise covariance matrix)
D(x) = E(∇F (x ; ζ)−∇F (x))(∇F (x ; ζ)−∇F (x))T .
Approach to Target Problem 2 : Escape from basin ofattractors via Large Deviations Theory.
I Theorem(Hu-Li-Li-Liu, 2018) Let U be a basin of attractor such thatO ⊂ U is the only minimum of F (x) and assume that theboundary ∂U of the domain U is smooth so that ∇F (x), x ∈ ∂Upoints to the interior of the boundary of U, then for x0 ∈ U wehave the following two asymptotic
P(x0, ∂U) � exp
(−1
ηminx∈∂U
φQPloc (x ;O)
),
and
Eτ(x0, ∂U) � 1
ηexp
(1
ηminx∈∂U
φQPloc (x ;O)
).
Approach to Target Problem 2 : The quasi–potential.
I The local quasi–potential
φQPloc (x ; x0) = inf
T>0inf
ψ(0)=x0,ψ(T )=xS0T (ψ) .
I Action functional (Large Deviations Rate Function) :
S0T (ψ) =
1
2
∫ T
0(ψ̇t +∇F (ψt))TD−1(ψt)(ψ̇t +∇F (ψt))dt ,
if ψt is abs. cont. t ∈ [0,T ] ;+∞ ,
otherwise .
Approach to Target Problem 3 : Relation between thequasi–potential and the diffusion matrix D(x).
I Theorem(Hu-Zhu-Xiong-Huan, 2019) The local quasi–potential φQP
loc (x ; x0)is a solution to the Hamilton–Jacobi equation
1
2
(∇φQP
loc (x ; x0))T
D(x)∇φQPloc (x ; x0)−∇F (x) · ∇φQP
loc (x ; x0) = 0 ,
with boundary condition
φQPloc (O; x0) = 0 .
Approach to Target Problem 3 : Escape from localminimum point : Isotropic Noise v.s. Anisotropic Noise.
Figure 4 – Escape from local minimum point : Isotropic Noise v.s.Anisotropic Noise.
Target Problem 3 : Algorithm’s implicit selection ofspecific local minima.
I Suppose that F (x) is non–convex and admits several localminimum points.
I Does SGD always select the global minimum of F (x) ?
SGD as a randomly perturbed gradient flow.
I SGD as a randomly perturbed gradient flow :
dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) .
I
σ(x)σT (x) = D(x) .
I The diffusion matrix (noise covariance matrix)
D(x) = E(∇F (x ; ζ)−∇F (x))(∇F (x ; ζ)−∇F (x))T .
Approach to Target Problem 3 : Variational Inference.
I Variational Inference : Invariant density
ρSS(x) =1
Z (η)exp
(−2
ηΦ(x)
).
I SGD selects the global minimum of Φ(x) : ρ(x , t)→ ρSS(x)as t →∞.
I If diffusion matrix D(x) = Id is isotropic, then Φ(x) ∝ F (x).
I Anisotropic noise is common !
I So in general Φ(x) 6= F (x).
Approach to Target Problem 3 : Variational Inference viaLarge Deviation Theory.
I Large Deviation Theory provides another way to express theinvariant density
ρSS(x) � exp
(−1
ηφQP(x)
).
I (global) quasi–potential : φQP(x).
Approach to Target Problem 3 : Variational Inference viaLarge Deviation Theory.
I Theorem(Hu-Zhu-Xiong-Huan, 2019) SGD selects the global minimum x∗
of φQP(x) such that φQP(x∗) = 0.
Approach to Target Problem 3 : Global quasi-potential.
I The (global) quasi-potential φQP(x) can be calculated fromthe local quasi–potential φQP
loc (x ; x0).
I We have seen that the local quasi–potential φQPloc (x ; x0)
depends on the diffusion matrix D(x) via the Hamilton-Jacobiequation.
I Markov chain on local minimum points (Freidlin–Wentzelltheory).
Approach to Target Problem 3 : Implicit regularization viacovariance structure.
Figure 5 – Implicit regularization via covariance structure. The lossfunction is symmetric with respect to two local minima (−2, 0) and(2, 0). Left : Process starts from (−2, 0), anisotropic noise in thepotential well ; Right : Process starts from (2, 0), isotropic noise in thepotential well. SGD tends to select (2, 0).
Summary : Probabilistic Approach.
stochastictraining algorithm→ statistical properties
approximation of learning model(convergence, generalization, ...)
l small learning rate ldiffusion limit
time change← small random perturbationsof dynamical systems
Stochastic Approximation : Stochastic heavy–ball method.
I
minx∈Rd
f (x) .
I “heavy ball method” (B.Polyak 1964)xt = xt−1 +
√svt ,
vt = (1− α√s)vt−1 −
√s∇X f (xt−1) ,
x0, v0 ∈ Rd .
I Stochastic heavy ball method :xt = xt−1 +
√svt ,
vt = (1− α√s)vt−1 −
√s∇X f (xt−1; ζt) ,
x0, v0 ∈ Rd .
Diffusion Limit of Stochastic heavy–ball method.
I Diffusion limit of the stochastic heavy ball method :dx(t) =
√sv(t)dt ,
dv(t) = −√sαv(t)dt −
√s∇X f (x(t)) +
√sσ(x(t))dWt ,
x(0), v(0) ∈ Rd .
Small random perturbations of dynamical systems :dissipative Hamiltonian system.
I Small random perturbations of dissipated Hamiltoniansystem :
dX εt = V ε
t dt;dV ε
t = (−αV εt −∇X f (X ε
t ))dt + εσ(X εt )dWt ,
X ε0 ,V
ε0 ∈ Rd .
I Hu-Li-Su, 2017 and work in progress.
Many further interesting directions at the interplay ofprobability and data science ...
I Escape from saddle points for anisotropic noise.
I Can the program on Hamilton–Jacobi equation be carried tospecific deep neural network structures ?
I Empirical Risk v.s. Population Risk : How does this affectSGD’s variational inference ? Relation with Generalization.
I Many more applied problems in collaborations withCS/ECE/ORIE people...
Thank you for your attention !