Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data...

Post on 13-Oct-2020

4 views 0 download

Transcript of Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data...

Deep Neural Networks: From Flat Minimato Numerically Nonvacuous GeneralizationBounds via PAC-Bayes

Daniel M. RoyUniversity of Toronto; Vector Institute

Joint work with

Gintare K. DziugaiteUniversity of Cambridge

(Almost) 50 Shades of Bayesian Learning: PAC-Bayesian trends and insights

Long Beach, CA. December 2017

How does SGD work?

I Growing body of work arguing that SGD performs implicitregularization

I Problem: No matching generalization bounds that are nonvacuouswhen applied to real data and networks.

I We focus on “flat minima” – weights w such that training error is“insensitive” to “large” perturbations

I We show the size/flatness/location of minima found by SGD onMNIST imply generalization using PAC-Bayes bounds

I Focusing on MNIST, we show how to compute generalization boundsthat are nonvacuous for stochastic networks with millions of weights.

I We obtain our (data-dependent, PAC-Bayesian) generalizationbounds via a fair bit of computation with SGD. Our approach is amodern take on Langford and Caruana (2002).

Dziugaite and Roy 1/19

How does SGD work?

I Growing body of work arguing that SGD performs implicitregularization

I Problem: No matching generalization bounds that are nonvacuouswhen applied to real data and networks.

I We focus on “flat minima” – weights w such that training error is“insensitive” to “large” perturbations

I We show the size/flatness/location of minima found by SGD onMNIST imply generalization using PAC-Bayes bounds

I Focusing on MNIST, we show how to compute generalization boundsthat are nonvacuous for stochastic networks with millions of weights.

I We obtain our (data-dependent, PAC-Bayesian) generalizationbounds via a fair bit of computation with SGD. Our approach is amodern take on Langford and Caruana (2002).

Dziugaite and Roy 1/19

Nonvacuous generalization bounds

risk: LD(h) := E(x,y)∼D[`(h(x), y)], D unknown

empirical risk: LS(h) = 1m

∑mi=1 `(h(xi ), yi ), S = {(x1, y1), . . . , (xm, ym)}

generalization error: LD(h)− LS(h)

∀D PS∼Dm

(LD(h)− LS(h) < ε(H,m, δ,S , h)︸ ︷︷ ︸

generalization err. bound

)> 1− δ

t data fixed model complexityfixed

I Iis } § #↳

s

. .

model complexity → # data →

# data fixed H fixed

:

÷¥Hse¥te!o(m

. "

complexity of H → # data →

He fixed

!IEEE (÷¥ee(ow

" ,1.0 - - . - . - - - . . . .

. . . . . . . .

1 - - . - . - - - . . . .. . . . . . . . .

II

# data →# data →

Dziugaite and Roy 2/19

SGD is X and X implies generalization

I ( w ; D) = Hah - H ( w l D)

= HLD ) -

H ( D lw )

SGD Is X X ⇒ generalization

“SGD is Empirical Risk Minimization for large enough networks”“SGD is (Implicit) Regularized Loss Minimization”“SGD is Approximate Bayesian Inference”

. . .

No statement of the form “SGD is X” explains generalization in deeplearning until we know that X implies generalization under real-worldconditions.

Dziugaite and Roy 3/19

SGD is (not simply) empirical risk minimization

Training error of SGD atconvergence.

Test error at convergenceand for early stoppingidentical.

SGD ≈ Empirical Risk Minimization argminw∈H LS(w)

MNIST has 60,000 training dataTwo-layer fully connected ReLU network has >1m parameters=⇒ PAC bounds are vacuous=⇒ PAC bounds can’t explain this curve

Dziugaite and Roy 4/19

Our focus: Statistical Learning Aspect

I ( w ; D) = Hah - H ( w l D)

= HLD ) -

H ( D lw )

SGD Is X X ⇒ generalization

︸ ︷︷ ︸On MNIST, with realistic networks, . . .

I VC bounds don’t imply generalization

I Classic Margin + Norm-bounded Rademacher Complexity Boundsdon’t imply generalization

I Being “Bayesian” does not necessarily imply generalization (sorry!)

Using PAC-Bayes bounds, we show that size/flatness/location ofminima, found by SGD on MNIST, imply generalization for MNIST.

Our bounds require a fair bit of computation/optimization to evaluate.Strictly speaking, they bound the error of a random perturbation of theSGD solution.

Dziugaite and Roy 5/19

Flat minima...training error in flat minima is “insensitive” to “large” perturbations

(Hochreiter and Schmidhuber, 1997)

... meets the PAC-Bayes theorem (McAllister)

∀D ∀P PS∼Dm

[∀Q ∆

(LS(Q), LD(Q)

)≤ KL(Q||P)+log I∆(m)

δ

m

]≥ 1− δ

For any data distribution, D, i.e., no assumptions,For any “prior” randomized classifier P, even nonsense,with high probability over m i.i.d. samples S ∼ Dm,For any “posterior” randomized classifier Q, not nec. Bayes rule,

Generalization error of Q bounded approximately by1

mKL(Q||P)

Dziugaite and Roy 6/19

Flat minima...training error in flat minima is “insensitive” to “large” perturbations

(Hochreiter and Schmidhuber, 1997)

... meets the PAC-Bayes theorem (McAllister)

∀D ∀P PS∼Dm

[∀Q ∆

(LS(Q), LD(Q)

)≤ KL(Q||P)+log I∆(m)

δ

m

]≥ 1− δ

For any data distribution, D, i.e., no assumptions,For any “prior” randomized classifier P, even nonsense,with high probability over m i.i.d. samples S ∼ Dm,For any “posterior” randomized classifier Q, not nec. Bayes rule,

Generalization error of Q bounded approximately by1

mKL(Q||P)

Dziugaite and Roy 6/19

Controlling generalization error of randomized classifiers

Let H be a hypothesis class of binary classifiers Rk → {−1, 1}.A randomized classifier is a distribution Q on H. Its risk is

LD(Q) = Ew∼Q

[LD(hw )]

Among the sharpest generalization bounds for randomized classifiers arePAC-Bayes bounds (McAllester, 1999).

Theorem (PAC-Bayes (Catoni, 2007)). .

Let δ > 0 and m ∈ N and assume LD is bounded. Then

∀P, ∀D, PS∼Dm

(∀Q, LD(Q) ≤ 2 LS(Q) + 2

KL(Q||P) + log 1δ

m

)≥ 1− δ

Dziugaite and Roy 7/19

Our approach

given m i.i.d. data S ∼ Dm

empirical error surfacew 7→ LS(hw )

• wSGD ∈ R472000

weights learned by SGD on MNIST

⊕ Q = N (wSGD + w ′,Σ′)stochastic neural net

generalization/error bound: ∀D PS∼Dm

(LD(Q) < 0.17

)> 0.95

Dziugaite and Roy 8/19

Our approach

LEDET

¥0057

¥0057

given m i.i.d. data S ∼ Dm

empirical error surfacew 7→ LS(hw )

• wSGD ∈ R472000

weights learned by SGD on MNIST

⊕ Q = N (wSGD + w ′,Σ′)stochastic neural net

generalization/error bound: ∀D PS∼Dm

(LD(Q) < 0.17

)> 0.95

Dziugaite and Roy 8/19

Our approach

LEDET

¥0057

¥0057given m i.i.d. data S ∼ Dm

empirical error surfacew 7→ LS(hw )

• wSGD ∈ R472000

weights learned by SGD on MNIST

⊕ Q = N (wSGD + w ′,Σ′)stochastic neural net

generalization/error bound: ∀D PS∼Dm

(LD(Q) < 0.17

)> 0.95

Dziugaite and Roy 8/19

Our approach

LEDET

¥0057

¥0057given m i.i.d. data S ∼ Dm

empirical error surfacew 7→ LS(hw )

• wSGD ∈ R472000

weights learned by SGD on MNIST

⊕ Q = N (wSGD + w ′,Σ′)stochastic neural net

generalization/error bound: ∀D PS∼Dm

(LD(Q) < 0.17

)> 0.95

Dziugaite and Roy 8/19

Our approach

LEDET

¥0057

¥0057

given m i.i.d. data S ∼ Dm

empirical error surfacew 7→ LS(hw )

• wSGD ∈ R472000

weights learned by SGD on MNIST

⊕ Q = N (wSGD + w ′,Σ′)stochastic neural net

generalization/error bound: ∀D PS∼Dm

(LD(Q) < 0.17

)> 0.95

Dziugaite and Roy 8/19

Our approach

LEDET

¥0057

¥0057

given m i.i.d. data S ∼ Dm

empirical error surfacew 7→ LS(hw )

• wSGD ∈ R472000

weights learned by SGD on MNIST

⊕ Q = N (wSGD + w ′,Σ′)stochastic neural net

generalization/error bound: ∀D PS∼Dm

(LD(Q) < 0.17

)> 0.95

Dziugaite and Roy 8/19

Our approach

LEDET

¥0057

¥0057

given m i.i.d. data S ∼ Dm

empirical error surfacew 7→ LS(hw )

• wSGD ∈ R472000

weights learned by SGD on MNIST

⊕ Q = N (wSGD + w ′,Σ′)stochastic neural net

generalization/error bound: ∀D PS∼Dm

(LD(Q) < 0.17

)> 0.95

Dziugaite and Roy 8/19

Optimizing PAC-Bayes bounds

Given data S , we can find a provably good classifier Q by optimizing thePAC-Bayes bound w.r.t. Q.

For Catoni’s PAC-Bayes bound, the optimization problem is of the form

supQ−τLS(Q)−KL(Q||P).

Lemma. Optimal Q satisfiesdQ

dP(w) =

exp(−τLS(w))∫exp(−τLS(w))P(dw)︸ ︷︷ ︸

generalized Bayes rule

.

Observation. Under log loss and τ = m, the term −τLS(w) is theexpected log likelihood under Q and the objective is the ELBO.

Lemma. log

∫exp(−τLS(w))P(dw) = sup

Q−τLS(Q)−KL(Q||P).

Observation. Under log loss and τ = m, l.h.s. is log marginal likelihood.Cf. Zhang 2004, 2006, Alquier et al. 2015, Germain et al. 2016.

Dziugaite and Roy 9/19

PAC-Bayes Bound optimization

infQ

LS(Q) +KL(Q||P) + log 1

δ

m

Let LS(Q) ≥ LS(Q) with LS differentiable.

infQ

m LS(Q) + KL(Q||P)

Let Qw ,s = N (w ,diag(s)).

minw∈Rd

s∈Rd+

m LS(Qw ,s) + KL(Qw ,s ||P)

Take P = N (w0, λ Id) with λ = c exp{−j/b}.

minw∈Rd

s∈Rd+

λ∈(0,c)

m LS(Qw ,s) + KL(Qw ,s ||N (w0, λI ))︸ ︷︷ ︸+2 log(b logc

λ)

1

2(

1

λ‖s‖1 +

1

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

Dziugaite and Roy 10/19

PAC-Bayes Bound optimization

infQ

LS(Q) +KL(Q||P) + log 1

δ

m

Let LS(Q) ≥ LS(Q) with LS differentiable.

infQ

m LS(Q) + KL(Q||P)

Let Qw ,s = N (w ,diag(s)).

minw∈Rd

s∈Rd+

m LS(Qw ,s) + KL(Qw ,s ||P)

Take P = N (w0, λ Id) with λ = c exp{−j/b}.

minw∈Rd

s∈Rd+

λ∈(0,c)

m LS(Qw ,s) + KL(Qw ,s ||N (w0, λI ))︸ ︷︷ ︸+2 log(b logc

λ)

1

2(

1

λ‖s‖1 +

1

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

Dziugaite and Roy 10/19

PAC-Bayes Bound optimization

infQ

LS(Q) +KL(Q||P) + log 1

δ

m

Let LS(Q) ≥ LS(Q) with LS differentiable.

infQ

m LS(Q) + KL(Q||P)

Let Qw ,s = N (w ,diag(s)).

minw∈Rd

s∈Rd+

m LS(Qw ,s) + KL(Qw ,s ||P)

Take P = N (w0, λ Id) with λ = c exp{−j/b}.

minw∈Rd

s∈Rd+

λ∈(0,c)

m LS(Qw ,s) + KL(Qw ,s ||N (w0, λI ))︸ ︷︷ ︸+2 log(b logc

λ)

1

2(

1

λ‖s‖1 +

1

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

Dziugaite and Roy 10/19

PAC-Bayes Bound optimization

infQ

LS(Q) +KL(Q||P) + log 1

δ

m

Let LS(Q) ≥ LS(Q) with LS differentiable.

infQ

m LS(Q) + KL(Q||P)

Let Qw ,s = N (w ,diag(s)).

minw∈Rd

s∈Rd+

m LS(Qw ,s) + KL(Qw ,s ||P)

Take P = N (w0, λ Id) with λ = c exp{−j/b}.

minw∈Rd

s∈Rd+

λ∈(0,c)

m LS(Qw ,s) + KL(Qw ,s ||N (w0, λI ))︸ ︷︷ ︸+2 log(b logc

λ)

1

2(

1

λ‖s‖1 +

1

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

Dziugaite and Roy 10/19

PAC-Bayes Bound optimization

infQ

LS(Q) +KL(Q||P) + log 1

δ

m

Let LS(Q) ≥ LS(Q) with LS differentiable.

infQ

m LS(Q) + KL(Q||P)

Let Qw ,s = N (w ,diag(s)).

minw∈Rd

s∈Rd+

m LS(Qw ,s) + KL(Qw ,s ||P)

Take P = N (w0, λ Id) with λ = c exp{−j/b}.

minw∈Rd

s∈Rd+

λ∈(0,c)

m LS(Qw ,s) + KL(Qw ,s ||N (w0, λI ))︸ ︷︷ ︸+2 log(b logc

λ)

1

2(

1

λ‖s‖1 +

1

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

Dziugaite and Roy 10/19

PAC-Bayes Bound optimization

infQ

LS(Q) +KL(Q||P) + log 1

δ

m

Let LS(Q) ≥ LS(Q) with LS differentiable.

infQ

m LS(Q) + KL(Q||P)

Let Qw ,s = N (w ,diag(s)).

minw∈Rd

s∈Rd+

m LS(Qw ,s) + KL(Qw ,s ||P)

Take P = N (w0, λ Id) with λ = c exp{−j/b}.

minw∈Rd

s∈Rd+

λ∈(0,c)

m LS(Qw ,s) + KL(Qw ,s ||N (w0, λI ))︸ ︷︷ ︸+2 log(b logc

λ)

1

2(

1

λ‖s‖1 +

1

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

Dziugaite and Roy 10/19

PAC-Bayes Bound optimization

infQ

LS(Q) +KL(Q||P) + log 1

δ

m

Let LS(Q) ≥ LS(Q) with LS differentiable.

infQ

m LS(Q) + KL(Q||P)

Let Qw ,s = N (w ,diag(s)).

minw∈Rd

s∈Rd+

m LS(Qw ,s) + KL(Qw ,s ||P)

Take P = N (w0, λ Id) with λ = c exp{−j/b}.

minw∈Rd

s∈Rd+

λ∈(0,c)

m LS(Qw ,s) + KL(Qw ,s ||N (w0, λI ))︸ ︷︷ ︸+2 log(b logc

λ)

1

2(

1

λ‖s‖1 +

1

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

Dziugaite and Roy 10/19

PAC-Bayes Bound optimization

infQ

LS(Q) +KL(Q||P) + log 1

δ

m

Let LS(Q) ≥ LS(Q) with LS differentiable.

infQ

m LS(Q) + KL(Q||P)

Let Qw ,s = N (w ,diag(s)).

minw∈Rd

s∈Rd+

m LS(Qw ,s) + KL(Qw ,s ||P)

Take P = N (w0, λ Id) with λ = c exp{−j/b}.

minw∈Rd

s∈Rd+

λ∈(0,c)

m LS(Qw ,s) + KL(Qw ,s ||N (w0, λI ))︸ ︷︷ ︸+2 log(b logc

λ)

1

2(

1

λ‖s‖1 +

1

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

Dziugaite and Roy 10/19

PAC-Bayes Bound optimization

infQ

LS(Q) +KL(Q||P) + log 1

δ

m

Let LS(Q) ≥ LS(Q) with LS differentiable.

infQ

m LS(Q) + KL(Q||P)

Let Qw ,s = N (w ,diag(s)).

minw∈Rd

s∈Rd+

m LS(Qw ,s) + KL(Qw ,s ||P)

Take P = N (w0, λ Id) with λ = c exp{−j/b}.

minw∈Rd

s∈Rd+

λ∈(0,c)

m LS(Qw ,s) + KL(Qw ,s ||N (w0, λI ))︸ ︷︷ ︸+2 log(b logc

λ)

1

2(

1

λ‖s‖1 +

1

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

Dziugaite and Roy 10/19

Numerical generalization bounds on MNIST

# Hidden Layers 1 2 3 1 (R)

Train error 0.001 0.000 0.000 0.007Test error 0.018 0.016 0.013 0.508

SNN train error 0.028 0.028 0.027 0.112SNN test error 0.034 0.033 0.032 0.503

PAC-Bayes bound 0.161 0.186 0.201 1.352

KL divergence 5144 6534 7861 201131# parameters 472k 832k 1193k 472kVC dimension 26m 66m 121m 26m

We have shown that type of flat minima found in practice can beturned into a generalization guarantee.

Bounds are loose, but only nonvacuous bounds in this setting.

Dziugaite and Roy 11/19

Actually, SGD is pretty dangerous

I SGD achieves zero training error reliably

I Despite no explicit regularization, training andtest error very close

I Explicit regularization has minor effect

I SGD can reliably obtain zero training error onrandomized labels

I Hence, Rademacher complexity of model class isnear maximal w.h.p.

Dziugaite and Roy 12/19

Actually, SGD is pretty dangerous

I SGD achieves zero training error reliably

I Despite no explicit regularization, training andtest error very close

I Explicit regularization has minor effect

I SGD can reliably obtain zero training error onrandomized labels

I Hence, Rademacher complexity of model class isnear maximal w.h.p.

Dziugaite and Roy 12/19

Entropy-SGD (Chaudhari et al., 2017)

Entropy-SGD replaces stochastic gradient descent on LS by stochasticgradient ascent applied to the optimization problem:

arg maxw∈Rd

Fγ,τ (w;S),

where Fγ,τ (w;S) = log

∫Rp

exp{−τLS(w′)− τ γ

2‖w′ −w‖2

2

}dw′.

The local entropy Fγ,τ (·;S) emphasizes flat minima of LS .

Dziugaite and Roy 13/19

Entropy-SGD optimizes PAC-Bayes bound w.r.t. prior

Entropy-SGD optimizes the local entropy

Fγ,τ (w;S) = log

∫Rp

exp{−τLS(w′)− τ γ

2‖w′ −w‖2

2

}dw′.

Theorem. Maximizing Fγ,τ (w;S) w.r.t. w corresponds to minimizingPAC-Bayes risk bound w.r.t. prior’s mean w.

Theorem. Let P(S) be an ε-differentially private distribution. Then

∀D, PS∼Dm

((∀Q) KL(LS(Q)||LD(Q)) ≤

KL(Q||P(S)) + ln 2m + 2max{ln 3δ , mε

2}m − 1

)≥ 1− δ.

We optimize Fγ,τ (w;S) using SGLD, obtaining (ε, δ)-differential privacy.

SGLD is known to converge weakly to the ε-differentially privateexponential mechanism. Our analysis makes a coarse approximation:privacy of SGLD is that of exponential mechanism.

Dziugaite and Roy 14/19

Entropy-SGD optimizes PAC-Bayes bound w.r.t. prior

Entropy-SGD optimizes the local entropy

Fγ,τ (w;S) = log

∫Rp

exp{−τLS(w′)− τ γ

2‖w′ −w‖2

2

}dw′.

Theorem. Maximizing Fγ,τ (w;S) w.r.t. w corresponds to minimizingPAC-Bayes risk bound w.r.t. prior’s mean w.

Theorem. Let P(S) be an ε-differentially private distribution. Then

∀D, PS∼Dm

((∀Q) KL(LS(Q)||LD(Q)) ≤

KL(Q||P(S)) + ln 2m + 2max{ln 3δ , mε

2}m − 1

)≥ 1− δ.

We optimize Fγ,τ (w;S) using SGLD, obtaining (ε, δ)-differential privacy.

SGLD is known to converge weakly to the ε-differentially privateexponential mechanism. Our analysis makes a coarse approximation:privacy of SGLD is that of exponential mechanism.

Dziugaite and Roy 14/19

Entropy-SGD optimizes PAC-Bayes bound w.r.t. prior

Entropy-SGD optimizes the local entropy

Fγ,τ (w;S) = log

∫Rp

exp{−τLS(w′)− τ γ

2‖w′ −w‖2

2

}dw′.

Theorem. Maximizing Fγ,τ (w;S) w.r.t. w corresponds to minimizingPAC-Bayes risk bound w.r.t. prior’s mean w.

Theorem. Let P(S) be an ε-differentially private distribution. Then

∀D, PS∼Dm

((∀Q) KL(LS(Q)||LD(Q)) ≤

KL(Q||P(S)) + ln 2m + 2max{ln 3δ , mε

2}m − 1

)≥ 1− δ.

We optimize Fγ,τ (w;S) using SGLD, obtaining (ε, δ)-differential privacy.

SGLD is known to converge weakly to the ε-differentially privateexponential mechanism. Our analysis makes a coarse approximation:privacy of SGLD is that of exponential mechanism.

Dziugaite and Roy 14/19

Dziugaite and Roy 15/19

Dziugaite and Roy 16/19

Dziugaite and Roy 17/19

Conclusion

I We show that the size/flatness/location of minima (that were foundby SGD on MNIST) imply generalization using PAC-Bayes bounds;

I We show Entropy-SGD optimizes the prior in a PAC-Bayes bound,which is not valid;

I We give a differentially private version of PAC-Bayes theorem andmodify Entropy-SGD so that prior is privately optimized.

Dziugaite and Roy 18/19

Achille, A. and S. Soatto (2017). “On the Emergence of Invariance and Disentangling in Deep Representations”.arXiv preprint arXiv:1706.01350.

Catoni, O. (2007). PAC-Bayesian supervised classification: the thermodynamics of statistical learning. arXiv:0712.0248 [stat.ML].

Chaudhari, P., A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R.Zecchina (2017). “Entropy-SGD: Biasing Gradient Descent Into Wide Valleys”. In: International Conferenceon Learning Representations (ICLR). arXiv: 1611.01838v4 [cs.LG].

Hinton, G. E. and D. van Camp (1993). “Keeping the Neural Networks Simple by Minimizing the DescriptionLength of the Weights”. In: Proceedings of the Sixth Annual Conference on Computational Learning Theory.COLT ’93. Santa Cruz, California, USA: ACM, pp. 5–13.

Hochreiter, S. and J. Schmidhuber (1997). “Flat Minima”. Neural Comput. 9.1, pp. 1–42.

Keskar, N. S., D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang (2017). “On Large-Batch Train-ing for Deep Learning: Generalization Gap and Sharp Minima”. In: International Conference on LearningRepresentations (ICLR). arXiv: 1609.04836v2 [cs.LG].

Langford, J. and R. Caruana (2002). “(Not) Bounding the True Error”. In: Advances in Neural InformationProcessing Systems 14. Ed. by T. G. Dietterich, S. Becker, and Z. Ghahramani. MIT Press, pp. 809–816.

Langford, J. and M. Seeger (2001). Bounds for Averaging Classifiers. Tech. rep. CMU-CS-01-102. CarnegieMellon University.

McAllester, D. A. (1999). “PAC-Bayesian Model Averaging”. In: Proceedings of the Twelfth Annual Conferenceon Computational Learning Theory. COLT ’99. Santa Cruz, California, USA: ACM, pp. 164–170.

Dziugaite and Roy 19/19