Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data...

Deep Neural Networks: From Flat Minimato Numerically Nonvacuous GeneralizationBounds via PAC-Bayes

Daniel M. RoyUniversity of Toronto; Vector Institute

Joint work with

Gintare K. DziugaiteUniversity of Cambridge

(Almost) 50 Shades of Bayesian Learning: PAC-Bayesian trends and insights

Long Beach, CA. December 2017

How does SGD work?

I Growing body of work arguing that SGD performs implicitregularization

I Problem: No matching generalization bounds that are nonvacuouswhen applied to real data and networks.

I We focus on “flat minima” – weights w such that training error is“insensitive” to “large” perturbations

I We show the size/flatness/location of minima found by SGD onMNIST imply generalization using PAC-Bayes bounds

I Focusing on MNIST, we show how to compute generalization boundsthat are nonvacuous for stochastic networks with millions of weights.

I We obtain our (data-dependent, PAC-Bayesian) generalizationbounds via a fair bit of computation with SGD. Our approach is amodern take on Langford and Caruana (2002).

Dziugaite and Roy 1/19

How does SGD work?

I Growing body of work arguing that SGD performs implicitregularization

I Problem: No matching generalization bounds that are nonvacuouswhen applied to real data and networks.

I We focus on “flat minima” – weights w such that training error is“insensitive” to “large” perturbations

I We show the size/flatness/location of minima found by SGD onMNIST imply generalization using PAC-Bayes bounds

I Focusing on MNIST, we show how to compute generalization boundsthat are nonvacuous for stochastic networks with millions of weights.

I We obtain our (data-dependent, PAC-Bayesian) generalizationbounds via a fair bit of computation with SGD. Our approach is amodern take on Langford and Caruana (2002).

Nonvacuous generalization bounds

risk: LD(h) := E(x,y)∼D[`(h(x), y)], D unknown

empirical risk: LS(h) = 1m

∑mi=1 `(h(xi ), yi ), S = {(x1, y1), . . . , (xm, ym)}

generalization error: LD(h)− LS(h)

∀D PS∼Dm

(LD(h)− LS(h) < ε(H,m, δ,S , h)︸︷︷︸

generalization err. bound

)> 1− δ

t data fixed model complexityfixed

I Iis } § #↳

model complexity → # data →

# data fixed H fixed

÷¥Hse¥te!o(m

complexity of H → # data →

He fixed

!IEEE (÷¥ee(ow

" ,1.0 - - . - . - - - . . . .

. . . . . . . .

1 - - . - . - - - . . . .. . . . . . . . .

# data →# data →

SGD is X and X implies generalization

I ( w ; D) = Hah - H ( w l D)

= HLD ) -

H ( D lw )

SGD Is X X ⇒ generalization

“SGD is Empirical Risk Minimization for large enough networks”“SGD is (Implicit) Regularized Loss Minimization”“SGD is Approximate Bayesian Inference”

No statement of the form “SGD is X” explains generalization in deeplearning until we know that X implies generalization under real-worldconditions.

SGD is (not simply) empirical risk minimization

Training error of SGD atconvergence.

Test error at convergenceand for early stoppingidentical.

SGD ≈ Empirical Risk Minimization argminw∈H LS(w)

MNIST has 60,000 training dataTwo-layer fully connected ReLU network has >1m parameters=⇒ PAC bounds are vacuous=⇒ PAC bounds can’t explain this curve

Our focus: Statistical Learning Aspect

I ( w ; D) = Hah - H ( w l D)

= HLD ) -

H ( D lw )

SGD Is X X ⇒ generalization

︸︷︷︸On MNIST, with realistic networks, . . .

I VC bounds don’t imply generalization

I Classic Margin + Norm-bounded Rademacher Complexity Boundsdon’t imply generalization

I Being “Bayesian” does not necessarily imply generalization (sorry!)

Using PAC-Bayes bounds, we show that size/flatness/location ofminima, found by SGD on MNIST, imply generalization for MNIST.

Our bounds require a fair bit of computation/optimization to evaluate.Strictly speaking, they bound the error of a random perturbation of theSGD solution.

Flat minima...training error in flat minima is “insensitive” to “large” perturbations

(Hochreiter and Schmidhuber, 1997)

... meets the PAC-Bayes theorem (McAllister)

∀D ∀P PS∼Dm

[∀Q ∆

(LS(Q), LD(Q)

)≤ KL(Q||P)+log I∆(m)

]≥ 1− δ

For any data distribution, D, i.e., no assumptions,For any “prior” randomized classifier P, even nonsense,with high probability over m i.i.d. samples S ∼ Dm,For any “posterior” randomized classifier Q, not nec. Bayes rule,

Generalization error of Q bounded approximately by1

mKL(Q||P)

Flat minima...training error in flat minima is “insensitive” to “large” perturbations

(Hochreiter and Schmidhuber, 1997)

... meets the PAC-Bayes theorem (McAllister)

∀D ∀P PS∼Dm

[∀Q ∆

(LS(Q), LD(Q)

)≤ KL(Q||P)+log I∆(m)

]≥ 1− δ

For any data distribution, D, i.e., no assumptions,For any “prior” randomized classifier P, even nonsense,with high probability over m i.i.d. samples S ∼ Dm,For any “posterior” randomized classifier Q, not nec. Bayes rule,

Generalization error of Q bounded approximately by1

mKL(Q||P)

Controlling generalization error of randomized classifiers

Let H be a hypothesis class of binary classifiers Rk → {−1, 1}.A randomized classifier is a distribution Q on H. Its risk is

LD(Q) = Ew∼Q

[LD(hw )]

Among the sharpest generalization bounds for randomized classifiers arePAC-Bayes bounds (McAllester, 1999).

Theorem (PAC-Bayes (Catoni, 2007)). .

Let δ > 0 and m ∈ N and assume LD is bounded. Then

∀P, ∀D, PS∼Dm

(∀Q, LD(Q) ≤ 2 LS(Q) + 2

KL(Q||P) + log 1δ

)≥ 1− δ

Our approach

given m i.i.d. data S ∼ Dm

empirical error surfacew 7→ LS(hw )

• wSGD ∈ R472000

weights learned by SGD on MNIST

⊕ Q = N (wSGD + w ′,Σ′)stochastic neural net

generalization/error bound: ∀D PS∼Dm

(LD(Q) < 0.17

)> 0.95

Our approach

¥0057

• wSGD ∈ R472000

(LD(Q) < 0.17

)> 0.95

Our approach

¥0057

¥0057given m i.i.d. data S ∼ Dm

• wSGD ∈ R472000

(LD(Q) < 0.17

)> 0.95

Our approach

¥0057

¥0057given m i.i.d. data S ∼ Dm

• wSGD ∈ R472000

(LD(Q) < 0.17

)> 0.95

Our approach

¥0057

• wSGD ∈ R472000

(LD(Q) < 0.17

)> 0.95

Our approach

¥0057

• wSGD ∈ R472000

(LD(Q) < 0.17

)> 0.95

Our approach

¥0057

• wSGD ∈ R472000

(LD(Q) < 0.17

)> 0.95

Optimizing PAC-Bayes bounds

Given data S , we can find a provably good classifier Q by optimizing thePAC-Bayes bound w.r.t. Q.

For Catoni’s PAC-Bayes bound, the optimization problem is of the form

supQ−τLS(Q)−KL(Q||P).

Lemma. Optimal Q satisfiesdQ

dP(w) =

exp(−τLS(w))∫exp(−τLS(w))P(dw)︸︷︷︸

generalized Bayes rule

Observation. Under log loss and τ = m, the term −τLS(w) is theexpected log likelihood under Q and the objective is the ELBO.

Lemma. log

∫exp(−τLS(w))P(dw) = sup

Q−τLS(Q)−KL(Q||P).

Observation. Under log loss and τ = m, l.h.s. is log marginal likelihood.Cf. Zhang 2004, 2006, Alquier et al. 2015, Germain et al. 2016.

PAC-Bayes Bound optimization

LS(Q) +KL(Q||P) + log 1

Let LS(Q) ≥ LS(Q) with LS differentiable.

m LS(Q) + KL(Q||P)

Let Qw ,s = N (w ,diag(s)).

minw∈Rd

s∈Rd+

m LS(Qw ,s) + KL(Qw ,s ||P)

Take P = N (w0, λ Id) with λ = c exp{−j/b}.

minw∈Rd

s∈Rd+

λ∈(0,c)

m LS(Qw ,s) + KL(Qw ,s ||N (w0, λI ))︸︷︷︸+2 log(b logc

λ‖s‖1 +

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

m LS(Q) + KL(Q||P)

minw∈Rd

s∈Rd+

minw∈Rd

s∈Rd+

λ∈(0,c)

λ‖s‖1 +

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

m LS(Q) + KL(Q||P)

minw∈Rd

s∈Rd+

minw∈Rd

s∈Rd+

λ∈(0,c)

λ‖s‖1 +

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

m LS(Q) + KL(Q||P)

minw∈Rd

s∈Rd+

minw∈Rd

s∈Rd+

λ∈(0,c)

λ‖s‖1 +

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

m LS(Q) + KL(Q||P)

minw∈Rd

s∈Rd+

minw∈Rd

s∈Rd+

λ∈(0,c)

λ‖s‖1 +

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

m LS(Q) + KL(Q||P)

minw∈Rd

s∈Rd+

minw∈Rd

s∈Rd+

λ∈(0,c)

λ‖s‖1 +

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

m LS(Q) + KL(Q||P)

minw∈Rd

s∈Rd+

minw∈Rd

s∈Rd+

λ∈(0,c)

λ‖s‖1 +

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

m LS(Q) + KL(Q||P)

minw∈Rd

s∈Rd+

minw∈Rd

s∈Rd+

λ∈(0,c)

λ‖s‖1 +

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

m LS(Q) + KL(Q||P)

minw∈Rd

s∈Rd+

minw∈Rd

s∈Rd+

λ∈(0,c)

λ‖s‖1 +

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

Numerical generalization bounds on MNIST

# Hidden Layers 1 2 3 1 (R)

Train error 0.001 0.000 0.000 0.007Test error 0.018 0.016 0.013 0.508

SNN train error 0.028 0.028 0.027 0.112SNN test error 0.034 0.033 0.032 0.503

PAC-Bayes bound 0.161 0.186 0.201 1.352

KL divergence 5144 6534 7861 201131# parameters 472k 832k 1193k 472kVC dimension 26m 66m 121m 26m

We have shown that type of flat minima found in practice can beturned into a generalization guarantee.

Bounds are loose, but only nonvacuous bounds in this setting.

Actually, SGD is pretty dangerous

I SGD achieves zero training error reliably

I Despite no explicit regularization, training andtest error very close

I Explicit regularization has minor effect

I SGD can reliably obtain zero training error onrandomized labels

I Hence, Rademacher complexity of model class isnear maximal w.h.p.

Actually, SGD is pretty dangerous

I SGD achieves zero training error reliably

I Despite no explicit regularization, training andtest error very close

I Explicit regularization has minor effect

I SGD can reliably obtain zero training error onrandomized labels

I Hence, Rademacher complexity of model class isnear maximal w.h.p.

Entropy-SGD (Chaudhari et al., 2017)

Entropy-SGD replaces stochastic gradient descent on LS by stochasticgradient ascent applied to the optimization problem:

arg maxw∈Rd

Fγ,τ (w;S),

where Fγ,τ (w;S) = log

exp{−τLS(w′)− τ γ

2‖w′ −w‖2

}dw′.

The local entropy Fγ,τ (·;S) emphasizes flat minima of LS .

Entropy-SGD optimizes PAC-Bayes bound w.r.t. prior

Entropy-SGD optimizes the local entropy

Fγ,τ (w;S) = log

2‖w′ −w‖2

}dw′.

Theorem. Maximizing Fγ,τ (w;S) w.r.t. w corresponds to minimizingPAC-Bayes risk bound w.r.t. prior’s mean w.

Theorem. Let P(S) be an ε-differentially private distribution. Then

∀D, PS∼Dm

((∀Q) KL(LS(Q)||LD(Q)) ≤

KL(Q||P(S)) + ln 2m + 2max{ln 3δ , mε

2}m − 1

)≥ 1− δ.

We optimize Fγ,τ (w;S) using SGLD, obtaining (ε, δ)-differential privacy.

SGLD is known to converge weakly to the ε-differentially privateexponential mechanism. Our analysis makes a coarse approximation:privacy of SGLD is that of exponential mechanism.

Fγ,τ (w;S) = log

2‖w′ −w‖2

}dw′.

∀D, PS∼Dm

((∀Q) KL(LS(Q)||LD(Q)) ≤

2}m − 1

)≥ 1− δ.

Fγ,τ (w;S) = log

2‖w′ −w‖2

}dw′.

∀D, PS∼Dm

((∀Q) KL(LS(Q)||LD(Q)) ≤

2}m − 1

)≥ 1− δ.

Conclusion

I We show that the size/flatness/location of minima (that were foundby SGD on MNIST) imply generalization using PAC-Bayes bounds;

I We show Entropy-SGD optimizes the prior in a PAC-Bayes bound,which is not valid;

I We give a differentially private version of PAC-Bayes theorem andmodify Entropy-SGD so that prior is privately optimized.

Achille, A. and S. Soatto (2017). “On the Emergence of Invariance and Disentangling in Deep Representations”.arXiv preprint arXiv:1706.01350.

Catoni, O. (2007). PAC-Bayesian supervised classification: the thermodynamics of statistical learning. arXiv:0712.0248 [stat.ML].

Chaudhari, P., A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R.Zecchina (2017). “Entropy-SGD: Biasing Gradient Descent Into Wide Valleys”. In: International Conferenceon Learning Representations (ICLR). arXiv: 1611.01838v4 [cs.LG].

Hinton, G. E. and D. van Camp (1993). “Keeping the Neural Networks Simple by Minimizing the DescriptionLength of the Weights”. In: Proceedings of the Sixth Annual Conference on Computational Learning Theory.COLT ’93. Santa Cruz, California, USA: ACM, pp. 5–13.

Hochreiter, S. and J. Schmidhuber (1997). “Flat Minima”. Neural Comput. 9.1, pp. 1–42.

Keskar, N. S., D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang (2017). “On Large-Batch Train-ing for Deep Learning: Generalization Gap and Sharp Minima”. In: International Conference on LearningRepresentations (ICLR). arXiv: 1609.04836v2 [cs.LG].

Langford, J. and R. Caruana (2002). “(Not) Bounding the True Error”. In: Advances in Neural InformationProcessing Systems 14. Ed. by T. G. Dietterich, S. Becker, and Z. Ghahramani. MIT Press, pp. 809–816.

Langford, J. and M. Seeger (2001). Bounds for Averaging Classifiers. Tech. rep. CMU-CS-01-102. CarnegieMellon University.

McAllester, D. A. (1999). “PAC-Bayesian Model Averaging”. In: Proceedings of the Twelfth Annual Conferenceon Computational Learning Theory. COLT ’99. Santa Cruz, California, USA: ACM, pp. 164–170.

Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data...

Documents

Transcript of Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data...