Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data...

Deep Neural Networks: From Flat Minimato Numerically Nonvacuous GeneralizationBounds via PAC-Bayes

Daniel M. RoyUniversity of Toronto; Vector Institute

Joint work with

Gintare K. DziugaiteUniversity of Cambridge

(Almost) 50 Shades of Bayesian Learning: PAC-Bayesian trends and insights

Long Beach, CA. December 2017

How does SGD work?

I Growing body of work arguing that SGD performs implicitregularization

I Problem: No matching generalization bounds that are nonvacuouswhen applied to real data and networks.

I We focus on “flat minima” – weights w such that training error is“insensitive” to “large” perturbations

I We show the size/flatness/location of minima found by SGD onMNIST imply generalization using PAC-Bayes bounds

I Focusing on MNIST, we show how to compute generalization boundsthat are nonvacuous for stochastic networks with millions of weights.

I We obtain our (data-dependent, PAC-Bayesian) generalizationbounds via a fair bit of computation with SGD. Our approach is amodern take on Langford and Caruana (2002).

Dziugaite and Roy 1/19

How does SGD work?

I Growing body of work arguing that SGD performs implicitregularization

I Problem: No matching generalization bounds that are nonvacuouswhen applied to real data and networks.

I We focus on “flat minima” – weights w such that training error is“insensitive” to “large” perturbations

I We show the size/flatness/location of minima found by SGD onMNIST imply generalization using PAC-Bayes bounds

I Focusing on MNIST, we show how to compute generalization boundsthat are nonvacuous for stochastic networks with millions of weights.

I We obtain our (data-dependent, PAC-Bayesian) generalizationbounds via a fair bit of computation with SGD. Our approach is amodern take on Langford and Caruana (2002).

Nonvacuous generalization bounds

risk: LD(h) := E(x,y)∼D[`(h(x), y)], D unknown

empirical risk: LS(h) = 1m

∑mi=1 `(h(xi ), yi ), S = {(x1, y1), . . . , (xm, ym)}

generalization error: LD(h)− LS(h)

∀D PS∼Dm

(LD(h)− LS(h) < ε(H,m, δ,S , h)︸︷︷︸

generalization err. bound

)> 1− δ

t data fixed model complexityfixed

I Iis } § #↳

model complexity → # data →

# data fixed H fixed

÷¥Hse¥te!o(m

complexity of H → # data →

He fixed

!IEEE (÷¥ee(ow

" ,1.0 - - . - . - - - . . . .

. . . . . . . .

1 - - . - . - - - . . . .. . . . . . . . .

# data →# data →

SGD is X and X implies generalization

I ( w ; D) = Hah - H ( w l D)

= HLD ) -

H ( D lw )

SGD Is X X ⇒ generalization

“SGD is Empirical Risk Minimization for large enough networks”“SGD is (Implicit) Regularized Loss Minimization”“SGD is Approximate Bayesian Inference”

No statement of the form “SGD is X” explains generalization in deeplearning until we know that X implies generalization under real-worldconditions.

SGD is (not simply) empirical risk minimization

Training error of SGD atconvergence.

Test error at convergenceand for early stoppingidentical.

SGD ≈ Empirical Risk Minimization argminw∈H LS(w)

MNIST has 60,000 training dataTwo-layer fully connected ReLU network has >1m parameters=⇒ PAC bounds are vacuous=⇒ PAC bounds can’t explain this curve

Our focus: Statistical Learning Aspect

I ( w ; D) = Hah - H ( w l D)

= HLD ) -

H ( D lw )

SGD Is X X ⇒ generalization

︸︷︷︸On MNIST, with realistic networks, . . .

I VC bounds don’t imply generalization

I Classic Margin + Norm-bounded Rademacher Complexity Boundsdon’t imply generalization

I Being “Bayesian” does not necessarily imply generalization (sorry!)

Using PAC-Bayes bounds, we show that size/flatness/location ofminima, found by SGD on MNIST, imply generalization for MNIST.

Our bounds require a fair bit of computation/optimization to evaluate.Strictly speaking, they bound the error of a random perturbation of theSGD solution.

Flat minima...training error in flat minima is “insensitive” to “large” perturbations

(Hochreiter and Schmidhuber, 1997)

... meets the PAC-Bayes theorem (McAllister)

∀D ∀P PS∼Dm

[∀Q ∆

(LS(Q), LD(Q)

)≤ KL(Q||P)+log I∆(m)

]≥ 1− δ

For any data distribution, D, i.e., no assumptions,For any “prior” randomized classifier P, even nonsense,with high probability over m i.i.d. samples S ∼ Dm,For any “posterior” randomized classifier Q, not nec. Bayes rule,

Generalization error of Q bounded approximately by1

mKL(Q||P)

Flat minima...training error in flat minima is “insensitive” to “large” perturbations

(Hochreiter and Schmidhuber, 1997)

... meets the PAC-Bayes theorem (McAllister)

∀D ∀P PS∼Dm

[∀Q ∆

(LS(Q), LD(Q)

)≤ KL(Q||P)+log I∆(m)

]≥ 1− δ

For any data distribution, D, i.e., no assumptions,For any “prior” randomized classifier P, even nonsense,with high probability over m i.i.d. samples S ∼ Dm,For any “posterior” randomized classifier Q, not nec. Bayes rule,

Generalization error of Q bounded approximately by1

mKL(Q||P)

Controlling generalization error of randomized classifiers

Let H be a hypothesis class of binary classifiers Rk → {−1, 1}.A randomized classifier is a distribution Q on H. Its risk is

LD(Q) = Ew∼Q

[LD(hw )]

Among the sharpest generalization bounds for randomized classifiers arePAC-Bayes bounds (McAllester, 1999).

Theorem (PAC-Bayes (Catoni, 2007)). .

Let δ > 0 and m ∈ N and assume LD is bounded. Then

∀P, ∀D, PS∼Dm

(∀Q, LD(Q) ≤ 2 LS(Q) + 2

KL(Q||P) + log 1δ

)≥ 1− δ

Our approach

given m i.i.d. data S ∼ Dm

empirical error surfacew 7→ LS(hw )

• wSGD ∈ R472000

weights learned by SGD on MNIST

⊕ Q = N (wSGD + w ′,Σ′)stochastic neural net

generalization/error bound: ∀D PS∼Dm

(LD(Q) < 0.17

)> 0.95

Our approach

¥0057

• wSGD ∈ R472000

(LD(Q) < 0.17

)> 0.95

Our approach

¥0057

¥0057given m i.i.d. data S ∼ Dm

• wSGD ∈ R472000

(LD(Q) < 0.17

)> 0.95

Our approach

¥0057

¥0057given m i.i.d. data S ∼ Dm

• wSGD ∈ R472000

(LD(Q) < 0.17

)> 0.95

Our approach

¥0057

• wSGD ∈ R472000

(LD(Q) < 0.17

)> 0.95

Our approach

¥0057

• wSGD ∈ R472000

(LD(Q) < 0.17

)> 0.95

Our approach

¥0057

• wSGD ∈ R472000

(LD(Q) < 0.17

)> 0.95

Optimizing PAC-Bayes bounds

Given data S , we can find a provably good classifier Q by optimizing thePAC-Bayes bound w.r.t. Q.

For Catoni’s PAC-Bayes bound, the optimization problem is of the form

supQ−τLS(Q)−KL(Q||P).

Lemma. Optimal Q satisfiesdQ

dP(w) =

exp(−τLS(w))∫exp(−τLS(w))P(dw)︸︷︷︸

generalized Bayes rule

Observation. Under log loss and τ = m, the term −τLS(w) is theexpected log likelihood under Q and the objective is the ELBO.

Lemma. log

∫exp(−τLS(w))P(dw) = sup

Q−τLS(Q)−KL(Q||P).

Observation. Under log loss and τ = m, l.h.s. is log marginal likelihood.Cf. Zhang 2004, 2006, Alquier et al. 2015, Germain et al. 2016.

PAC-Bayes Bound optimization

LS(Q) +KL(Q||P) + log 1

Let LS(Q) ≥ LS(Q) with LS differentiable.

m LS(Q) + KL(Q||P)

Let Qw ,s = N (w ,diag(s)).

minw∈Rd

s∈Rd+

m LS(Qw ,s) + KL(Qw ,s ||P)

Take P = N (w0, λ Id) with λ = c exp{−j/b}.

minw∈Rd

s∈Rd+

λ∈(0,c)

m LS(Qw ,s) + KL(Qw ,s ||N (w0, λI ))︸︷︷︸+2 log(b logc

λ‖s‖1 +

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

m LS(Q) + KL(Q||P)

minw∈Rd

s∈Rd+

minw∈Rd

s∈Rd+

λ∈(0,c)

λ‖s‖1 +

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

m LS(Q) + KL(Q||P)

minw∈Rd

s∈Rd+

minw∈Rd

s∈Rd+

λ∈(0,c)

λ‖s‖1 +

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

m LS(Q) + KL(Q||P)

minw∈Rd

s∈Rd+

minw∈Rd

s∈Rd+

λ∈(0,c)

λ‖s‖1 +

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

m LS(Q) + KL(Q||P)

minw∈Rd

s∈Rd+

minw∈Rd

s∈Rd+

λ∈(0,c)

λ‖s‖1 +

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

m LS(Q) + KL(Q||P)

minw∈Rd

s∈Rd+

minw∈Rd

s∈Rd+

λ∈(0,c)

λ‖s‖1 +

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

m LS(Q) + KL(Q||P)

minw∈Rd

s∈Rd+

minw∈Rd

s∈Rd+

λ∈(0,c)

λ‖s‖1 +

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

m LS(Q) + KL(Q||P)

minw∈Rd

s∈Rd+

minw∈Rd

s∈Rd+

λ∈(0,c)

λ‖s‖1 +

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

m LS(Q) + KL(Q||P)

minw∈Rd

s∈Rd+

minw∈Rd

s∈Rd+

λ∈(0,c)

λ‖s‖1 +

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

Numerical generalization bounds on MNIST

# Hidden Layers 1 2 3 1 (R)

Train error 0.001 0.000 0.000 0.007Test error 0.018 0.016 0.013 0.508

SNN train error 0.028 0.028 0.027 0.112SNN test error 0.034 0.033 0.032 0.503

PAC-Bayes bound 0.161 0.186 0.201 1.352

KL divergence 5144 6534 7861 201131# parameters 472k 832k 1193k 472kVC dimension 26m 66m 121m 26m

We have shown that type of flat minima found in practice can beturned into a generalization guarantee.

Bounds are loose, but only nonvacuous bounds in this setting.

Actually, SGD is pretty dangerous

I SGD achieves zero training error reliably

I Despite no explicit regularization, training andtest error very close

I Explicit regularization has minor effect

I SGD can reliably obtain zero training error onrandomized labels

I Hence, Rademacher complexity of model class isnear maximal w.h.p.

Actually, SGD is pretty dangerous

I SGD achieves zero training error reliably

I Despite no explicit regularization, training andtest error very close

I Explicit regularization has minor effect

I SGD can reliably obtain zero training error onrandomized labels

I Hence, Rademacher complexity of model class isnear maximal w.h.p.

Entropy-SGD (Chaudhari et al., 2017)

Entropy-SGD replaces stochastic gradient descent on LS by stochasticgradient ascent applied to the optimization problem:

arg maxw∈Rd

Fγ,τ (w;S),

where Fγ,τ (w;S) = log

exp{−τLS(w′)− τ γ

2‖w′ −w‖2

}dw′.

The local entropy Fγ,τ (·;S) emphasizes flat minima of LS .

Entropy-SGD optimizes PAC-Bayes bound w.r.t. prior

Entropy-SGD optimizes the local entropy

Fγ,τ (w;S) = log

2‖w′ −w‖2

}dw′.

Theorem. Maximizing Fγ,τ (w;S) w.r.t. w corresponds to minimizingPAC-Bayes risk bound w.r.t. prior’s mean w.

Theorem. Let P(S) be an ε-differentially private distribution. Then

∀D, PS∼Dm

((∀Q) KL(LS(Q)||LD(Q)) ≤

KL(Q||P(S)) + ln 2m + 2max{ln 3δ , mε

2}m − 1

)≥ 1− δ.

We optimize Fγ,τ (w;S) using SGLD, obtaining (ε, δ)-differential privacy.

SGLD is known to converge weakly to the ε-differentially privateexponential mechanism. Our analysis makes a coarse approximation:privacy of SGLD is that of exponential mechanism.

Fγ,τ (w;S) = log

2‖w′ −w‖2

}dw′.

∀D, PS∼Dm

((∀Q) KL(LS(Q)||LD(Q)) ≤

2}m − 1

)≥ 1− δ.

Fγ,τ (w;S) = log

2‖w′ −w‖2

}dw′.

∀D, PS∼Dm

((∀Q) KL(LS(Q)||LD(Q)) ≤

2}m − 1

)≥ 1− δ.

Conclusion

I We show that the size/flatness/location of minima (that were foundby SGD on MNIST) imply generalization using PAC-Bayes bounds;

I We show Entropy-SGD optimizes the prior in a PAC-Bayes bound,which is not valid;

I We give a differentially private version of PAC-Bayes theorem andmodify Entropy-SGD so that prior is privately optimized.

Achille, A. and S. Soatto (2017). “On the Emergence of Invariance and Disentangling in Deep Representations”.arXiv preprint arXiv:1706.01350.

Catoni, O. (2007). PAC-Bayesian supervised classification: the thermodynamics of statistical learning. arXiv:0712.0248 [stat.ML].

Chaudhari, P., A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R.Zecchina (2017). “Entropy-SGD: Biasing Gradient Descent Into Wide Valleys”. In: International Conferenceon Learning Representations (ICLR). arXiv: 1611.01838v4 [cs.LG].

Hinton, G. E. and D. van Camp (1993). “Keeping the Neural Networks Simple by Minimizing the DescriptionLength of the Weights”. In: Proceedings of the Sixth Annual Conference on Computational Learning Theory.COLT ’93. Santa Cruz, California, USA: ACM, pp. 5–13.

Hochreiter, S. and J. Schmidhuber (1997). “Flat Minima”. Neural Comput. 9.1, pp. 1–42.

Keskar, N. S., D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang (2017). “On Large-Batch Train-ing for Deep Learning: Generalization Gap and Sharp Minima”. In: International Conference on LearningRepresentations (ICLR). arXiv: 1609.04836v2 [cs.LG].

Langford, J. and R. Caruana (2002). “(Not) Bounding the True Error”. In: Advances in Neural InformationProcessing Systems 14. Ed. by T. G. Dietterich, S. Becker, and Z. Ghahramani. MIT Press, pp. 809–816.

Langford, J. and M. Seeger (2001). Bounds for Averaging Classifiers. Tech. rep. CMU-CS-01-102. CarnegieMellon University.

McAllester, D. A. (1999). “PAC-Bayesian Model Averaging”. In: Proceedings of the Twelfth Annual Conferenceon Computational Learning Theory. COLT ’99. Santa Cruz, California, USA: ACM, pp. 164–170.

Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data...

Documents

Transcript of Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data...

Bayes Empirical Bayes Inference of Amino Acid Sites Under Positive ...

Non-parametric Empirical Bayes and Compound Bayes ...

Bayes Theorem & Naïve Bayes - Penn Engineeringcis521/Lectures/naive-bayes-spam.pdfUsing Naive Bayes Classifiers to Classify Text: Basic method for Multinomial Variables • As a generative

Bayes Factors 1 Running head: BAYES FACTORS Bayes Factor

PAC-Bayes Bounds with Data Dependent Priorsjmlr.csail.mit.edu/papers/volume13/parrado12a/parrado12a.pdf · Journal of Machine Learning Research 13 (2012) 3507-3531 Submitted 3/08;

Lecture 9: Bayesian Learning - Otto-Friedrich- · PDF fileLEARNING, MDL principle, Bayes Optimal Classiﬁer, Naive Bayes Classiﬁer, Bayes Belief Networks ... on Bayes theorem Lecture

BAYES, ORACLE BAYES, AND EMPIRICAL BAYES By Bradley …Bayes, Oracle Bayes, and Empirical Bayes Bradley Efron Stanford University Abstract. This article concerns the Bayes and frequentist

La th eorie PAC-Bayes en apprentissage supervis esebag/Slides/TAO/Laviolette_Pac-Bayes.pdf · The mathematics of the PAC-Bayes Theory PAC-Bayes bounds and algorithms References Summary

Why Bayes for Clinical Trials? - Hbiostathbiostat.org/doc/bayes/why.pdf · 2019-09-22 · Why Bayes for Clinical Trials? Background Freq&Bayes Types of Probabilities Needed Probabilities

Data Dependent Priors for Stable Learning - … · Background PAC-Bayes Analysis Linear Classiﬁers Data distribution dependent priors Data Dependent Priors for Stable Learning John

Classification: Naïve Bayes - University of Belgradeai.fon.bg.ac.rs/wp-content/uploads/2016/10/Naive-Bayes-Labs-2016.pdf · Naive Bayes classifier • Based on the Bayes rule •

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers

Naive Bayes

A PAC-Bayes Risk Bound for General Loss Functions

Bayes-CPACE: PAC Optimal Exploration in Continuous ...dani, Srinivasa, and Bagnell 2015). Although BAMDPs provide an elegant problem formu-lation for model uncertainty, Probably Approximately

Naïve Bayes 𝑖 𝜶 - Kangwoncs.kangwon.ac.kr/.../2015_MachineLearning/07_naive_bayes.pdf · 2016. 6. 17. · Naïve Bayes •Bayes rule을적용하면모든데이터에대하여고려해야함

Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classiﬁers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

CIS 522: Lecture 15 - Penn Engineeringcis522/slides/CIS522_Lecture... · 2020-03-13 · PAC-Bayes with neural networks Data-dependent bounds on generalization Can get first non-trivial

Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.

PAC-Bayesian Bounds and Aggregation: Introduction, and ... · Introduction : Learning with PAC-Bayes Bounds Three Types of PAC-Bayesian Bounds Computational Issues PAC-BayesianBoundsandAggregation: