Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data...

39
Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes Daniel M. Roy University of Toronto; Vector Institute Joint work with Gintar˙ e K. Dˇ ziugait ˙ e University of Cambridge (Almost) 50 Shades of Bayesian Learning: PAC-Bayesian trends and insights Long Beach, CA. December 2017

Transcript of Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data...

Page 1: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Deep Neural Networks: From Flat Minimato Numerically Nonvacuous GeneralizationBounds via PAC-Bayes

Daniel M. RoyUniversity of Toronto; Vector Institute

Joint work with

Gintare K. DziugaiteUniversity of Cambridge

(Almost) 50 Shades of Bayesian Learning: PAC-Bayesian trends and insights

Long Beach, CA. December 2017

Page 2: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

How does SGD work?

I Growing body of work arguing that SGD performs implicitregularization

I Problem: No matching generalization bounds that are nonvacuouswhen applied to real data and networks.

I We focus on “flat minima” – weights w such that training error is“insensitive” to “large” perturbations

I We show the size/flatness/location of minima found by SGD onMNIST imply generalization using PAC-Bayes bounds

I Focusing on MNIST, we show how to compute generalization boundsthat are nonvacuous for stochastic networks with millions of weights.

I We obtain our (data-dependent, PAC-Bayesian) generalizationbounds via a fair bit of computation with SGD. Our approach is amodern take on Langford and Caruana (2002).

Dziugaite and Roy 1/19

Page 3: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

How does SGD work?

I Growing body of work arguing that SGD performs implicitregularization

I Problem: No matching generalization bounds that are nonvacuouswhen applied to real data and networks.

I We focus on “flat minima” – weights w such that training error is“insensitive” to “large” perturbations

I We show the size/flatness/location of minima found by SGD onMNIST imply generalization using PAC-Bayes bounds

I Focusing on MNIST, we show how to compute generalization boundsthat are nonvacuous for stochastic networks with millions of weights.

I We obtain our (data-dependent, PAC-Bayesian) generalizationbounds via a fair bit of computation with SGD. Our approach is amodern take on Langford and Caruana (2002).

Dziugaite and Roy 1/19

Page 4: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Nonvacuous generalization bounds

risk: LD(h) := E(x,y)∼D[`(h(x), y)], D unknown

empirical risk: LS(h) = 1m

∑mi=1 `(h(xi ), yi ), S = {(x1, y1), . . . , (xm, ym)}

generalization error: LD(h)− LS(h)

∀D PS∼Dm

(LD(h)− LS(h) < ε(H,m, δ,S , h)︸ ︷︷ ︸

generalization err. bound

)> 1− δ

t data fixed model complexityfixed

I Iis } § #↳

s

. .

model complexity → # data →

# data fixed H fixed

:

÷¥Hse¥te!o(m

. "

complexity of H → # data →

He fixed

!IEEE (÷¥ee(ow

" ,1.0 - - . - . - - - . . . .

. . . . . . . .

1 - - . - . - - - . . . .. . . . . . . . .

II

# data →# data →

Dziugaite and Roy 2/19

Page 5: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

SGD is X and X implies generalization

I ( w ; D) = Hah - H ( w l D)

= HLD ) -

H ( D lw )

SGD Is X X ⇒ generalization

“SGD is Empirical Risk Minimization for large enough networks”“SGD is (Implicit) Regularized Loss Minimization”“SGD is Approximate Bayesian Inference”

. . .

No statement of the form “SGD is X” explains generalization in deeplearning until we know that X implies generalization under real-worldconditions.

Dziugaite and Roy 3/19

Page 6: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

SGD is (not simply) empirical risk minimization

Training error of SGD atconvergence.

Test error at convergenceand for early stoppingidentical.

SGD ≈ Empirical Risk Minimization argminw∈H LS(w)

MNIST has 60,000 training dataTwo-layer fully connected ReLU network has >1m parameters=⇒ PAC bounds are vacuous=⇒ PAC bounds can’t explain this curve

Dziugaite and Roy 4/19

Page 7: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Our focus: Statistical Learning Aspect

I ( w ; D) = Hah - H ( w l D)

= HLD ) -

H ( D lw )

SGD Is X X ⇒ generalization

︸ ︷︷ ︸On MNIST, with realistic networks, . . .

I VC bounds don’t imply generalization

I Classic Margin + Norm-bounded Rademacher Complexity Boundsdon’t imply generalization

I Being “Bayesian” does not necessarily imply generalization (sorry!)

Using PAC-Bayes bounds, we show that size/flatness/location ofminima, found by SGD on MNIST, imply generalization for MNIST.

Our bounds require a fair bit of computation/optimization to evaluate.Strictly speaking, they bound the error of a random perturbation of theSGD solution.

Dziugaite and Roy 5/19

Page 8: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Flat minima...training error in flat minima is “insensitive” to “large” perturbations

(Hochreiter and Schmidhuber, 1997)

... meets the PAC-Bayes theorem (McAllister)

∀D ∀P PS∼Dm

[∀Q ∆

(LS(Q), LD(Q)

)≤ KL(Q||P)+log I∆(m)

δ

m

]≥ 1− δ

For any data distribution, D, i.e., no assumptions,For any “prior” randomized classifier P, even nonsense,with high probability over m i.i.d. samples S ∼ Dm,For any “posterior” randomized classifier Q, not nec. Bayes rule,

Generalization error of Q bounded approximately by1

mKL(Q||P)

Dziugaite and Roy 6/19

Page 9: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Flat minima...training error in flat minima is “insensitive” to “large” perturbations

(Hochreiter and Schmidhuber, 1997)

... meets the PAC-Bayes theorem (McAllister)

∀D ∀P PS∼Dm

[∀Q ∆

(LS(Q), LD(Q)

)≤ KL(Q||P)+log I∆(m)

δ

m

]≥ 1− δ

For any data distribution, D, i.e., no assumptions,For any “prior” randomized classifier P, even nonsense,with high probability over m i.i.d. samples S ∼ Dm,For any “posterior” randomized classifier Q, not nec. Bayes rule,

Generalization error of Q bounded approximately by1

mKL(Q||P)

Dziugaite and Roy 6/19

Page 10: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Controlling generalization error of randomized classifiers

Let H be a hypothesis class of binary classifiers Rk → {−1, 1}.A randomized classifier is a distribution Q on H. Its risk is

LD(Q) = Ew∼Q

[LD(hw )]

Among the sharpest generalization bounds for randomized classifiers arePAC-Bayes bounds (McAllester, 1999).

Theorem (PAC-Bayes (Catoni, 2007)). .

Let δ > 0 and m ∈ N and assume LD is bounded. Then

∀P, ∀D, PS∼Dm

(∀Q, LD(Q) ≤ 2 LS(Q) + 2

KL(Q||P) + log 1δ

m

)≥ 1− δ

Dziugaite and Roy 7/19

Page 11: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Our approach

given m i.i.d. data S ∼ Dm

empirical error surfacew 7→ LS(hw )

• wSGD ∈ R472000

weights learned by SGD on MNIST

⊕ Q = N (wSGD + w ′,Σ′)stochastic neural net

generalization/error bound: ∀D PS∼Dm

(LD(Q) < 0.17

)> 0.95

Dziugaite and Roy 8/19

Page 12: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Our approach

LEDET

¥0057

¥0057

given m i.i.d. data S ∼ Dm

empirical error surfacew 7→ LS(hw )

• wSGD ∈ R472000

weights learned by SGD on MNIST

⊕ Q = N (wSGD + w ′,Σ′)stochastic neural net

generalization/error bound: ∀D PS∼Dm

(LD(Q) < 0.17

)> 0.95

Dziugaite and Roy 8/19

Page 13: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Our approach

LEDET

¥0057

¥0057given m i.i.d. data S ∼ Dm

empirical error surfacew 7→ LS(hw )

• wSGD ∈ R472000

weights learned by SGD on MNIST

⊕ Q = N (wSGD + w ′,Σ′)stochastic neural net

generalization/error bound: ∀D PS∼Dm

(LD(Q) < 0.17

)> 0.95

Dziugaite and Roy 8/19

Page 14: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Our approach

LEDET

¥0057

¥0057given m i.i.d. data S ∼ Dm

empirical error surfacew 7→ LS(hw )

• wSGD ∈ R472000

weights learned by SGD on MNIST

⊕ Q = N (wSGD + w ′,Σ′)stochastic neural net

generalization/error bound: ∀D PS∼Dm

(LD(Q) < 0.17

)> 0.95

Dziugaite and Roy 8/19

Page 15: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Our approach

LEDET

¥0057

¥0057

given m i.i.d. data S ∼ Dm

empirical error surfacew 7→ LS(hw )

• wSGD ∈ R472000

weights learned by SGD on MNIST

⊕ Q = N (wSGD + w ′,Σ′)stochastic neural net

generalization/error bound: ∀D PS∼Dm

(LD(Q) < 0.17

)> 0.95

Dziugaite and Roy 8/19

Page 16: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Our approach

LEDET

¥0057

¥0057

given m i.i.d. data S ∼ Dm

empirical error surfacew 7→ LS(hw )

• wSGD ∈ R472000

weights learned by SGD on MNIST

⊕ Q = N (wSGD + w ′,Σ′)stochastic neural net

generalization/error bound: ∀D PS∼Dm

(LD(Q) < 0.17

)> 0.95

Dziugaite and Roy 8/19

Page 17: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Our approach

LEDET

¥0057

¥0057

given m i.i.d. data S ∼ Dm

empirical error surfacew 7→ LS(hw )

• wSGD ∈ R472000

weights learned by SGD on MNIST

⊕ Q = N (wSGD + w ′,Σ′)stochastic neural net

generalization/error bound: ∀D PS∼Dm

(LD(Q) < 0.17

)> 0.95

Dziugaite and Roy 8/19

Page 18: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Optimizing PAC-Bayes bounds

Given data S , we can find a provably good classifier Q by optimizing thePAC-Bayes bound w.r.t. Q.

For Catoni’s PAC-Bayes bound, the optimization problem is of the form

supQ−τLS(Q)−KL(Q||P).

Lemma. Optimal Q satisfiesdQ

dP(w) =

exp(−τLS(w))∫exp(−τLS(w))P(dw)︸ ︷︷ ︸

generalized Bayes rule

.

Observation. Under log loss and τ = m, the term −τLS(w) is theexpected log likelihood under Q and the objective is the ELBO.

Lemma. log

∫exp(−τLS(w))P(dw) = sup

Q−τLS(Q)−KL(Q||P).

Observation. Under log loss and τ = m, l.h.s. is log marginal likelihood.Cf. Zhang 2004, 2006, Alquier et al. 2015, Germain et al. 2016.

Dziugaite and Roy 9/19

Page 19: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

PAC-Bayes Bound optimization

infQ

LS(Q) +KL(Q||P) + log 1

δ

m

Let LS(Q) ≥ LS(Q) with LS differentiable.

infQ

m LS(Q) + KL(Q||P)

Let Qw ,s = N (w ,diag(s)).

minw∈Rd

s∈Rd+

m LS(Qw ,s) + KL(Qw ,s ||P)

Take P = N (w0, λ Id) with λ = c exp{−j/b}.

minw∈Rd

s∈Rd+

λ∈(0,c)

m LS(Qw ,s) + KL(Qw ,s ||N (w0, λI ))︸ ︷︷ ︸+2 log(b logc

λ)

1

2(

1

λ‖s‖1 +

1

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

Dziugaite and Roy 10/19

Page 20: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

PAC-Bayes Bound optimization

infQ

LS(Q) +KL(Q||P) + log 1

δ

m

Let LS(Q) ≥ LS(Q) with LS differentiable.

infQ

m LS(Q) + KL(Q||P)

Let Qw ,s = N (w ,diag(s)).

minw∈Rd

s∈Rd+

m LS(Qw ,s) + KL(Qw ,s ||P)

Take P = N (w0, λ Id) with λ = c exp{−j/b}.

minw∈Rd

s∈Rd+

λ∈(0,c)

m LS(Qw ,s) + KL(Qw ,s ||N (w0, λI ))︸ ︷︷ ︸+2 log(b logc

λ)

1

2(

1

λ‖s‖1 +

1

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

Dziugaite and Roy 10/19

Page 21: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

PAC-Bayes Bound optimization

infQ

LS(Q) +KL(Q||P) + log 1

δ

m

Let LS(Q) ≥ LS(Q) with LS differentiable.

infQ

m LS(Q) + KL(Q||P)

Let Qw ,s = N (w ,diag(s)).

minw∈Rd

s∈Rd+

m LS(Qw ,s) + KL(Qw ,s ||P)

Take P = N (w0, λ Id) with λ = c exp{−j/b}.

minw∈Rd

s∈Rd+

λ∈(0,c)

m LS(Qw ,s) + KL(Qw ,s ||N (w0, λI ))︸ ︷︷ ︸+2 log(b logc

λ)

1

2(

1

λ‖s‖1 +

1

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

Dziugaite and Roy 10/19

Page 22: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

PAC-Bayes Bound optimization

infQ

LS(Q) +KL(Q||P) + log 1

δ

m

Let LS(Q) ≥ LS(Q) with LS differentiable.

infQ

m LS(Q) + KL(Q||P)

Let Qw ,s = N (w ,diag(s)).

minw∈Rd

s∈Rd+

m LS(Qw ,s) + KL(Qw ,s ||P)

Take P = N (w0, λ Id) with λ = c exp{−j/b}.

minw∈Rd

s∈Rd+

λ∈(0,c)

m LS(Qw ,s) + KL(Qw ,s ||N (w0, λI ))︸ ︷︷ ︸+2 log(b logc

λ)

1

2(

1

λ‖s‖1 +

1

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

Dziugaite and Roy 10/19

Page 23: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

PAC-Bayes Bound optimization

infQ

LS(Q) +KL(Q||P) + log 1

δ

m

Let LS(Q) ≥ LS(Q) with LS differentiable.

infQ

m LS(Q) + KL(Q||P)

Let Qw ,s = N (w ,diag(s)).

minw∈Rd

s∈Rd+

m LS(Qw ,s) + KL(Qw ,s ||P)

Take P = N (w0, λ Id) with λ = c exp{−j/b}.

minw∈Rd

s∈Rd+

λ∈(0,c)

m LS(Qw ,s) + KL(Qw ,s ||N (w0, λI ))︸ ︷︷ ︸+2 log(b logc

λ)

1

2(

1

λ‖s‖1 +

1

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

Dziugaite and Roy 10/19

Page 24: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

PAC-Bayes Bound optimization

infQ

LS(Q) +KL(Q||P) + log 1

δ

m

Let LS(Q) ≥ LS(Q) with LS differentiable.

infQ

m LS(Q) + KL(Q||P)

Let Qw ,s = N (w ,diag(s)).

minw∈Rd

s∈Rd+

m LS(Qw ,s) + KL(Qw ,s ||P)

Take P = N (w0, λ Id) with λ = c exp{−j/b}.

minw∈Rd

s∈Rd+

λ∈(0,c)

m LS(Qw ,s) + KL(Qw ,s ||N (w0, λI ))︸ ︷︷ ︸+2 log(b logc

λ)

1

2(

1

λ‖s‖1 +

1

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

Dziugaite and Roy 10/19

Page 25: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

PAC-Bayes Bound optimization

infQ

LS(Q) +KL(Q||P) + log 1

δ

m

Let LS(Q) ≥ LS(Q) with LS differentiable.

infQ

m LS(Q) + KL(Q||P)

Let Qw ,s = N (w ,diag(s)).

minw∈Rd

s∈Rd+

m LS(Qw ,s) + KL(Qw ,s ||P)

Take P = N (w0, λ Id) with λ = c exp{−j/b}.

minw∈Rd

s∈Rd+

λ∈(0,c)

m LS(Qw ,s) + KL(Qw ,s ||N (w0, λI ))︸ ︷︷ ︸+2 log(b logc

λ)

1

2(

1

λ‖s‖1 +

1

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

Dziugaite and Roy 10/19

Page 26: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

PAC-Bayes Bound optimization

infQ

LS(Q) +KL(Q||P) + log 1

δ

m

Let LS(Q) ≥ LS(Q) with LS differentiable.

infQ

m LS(Q) + KL(Q||P)

Let Qw ,s = N (w ,diag(s)).

minw∈Rd

s∈Rd+

m LS(Qw ,s) + KL(Qw ,s ||P)

Take P = N (w0, λ Id) with λ = c exp{−j/b}.

minw∈Rd

s∈Rd+

λ∈(0,c)

m LS(Qw ,s) + KL(Qw ,s ||N (w0, λI ))︸ ︷︷ ︸+2 log(b logc

λ)

1

2(

1

λ‖s‖1 +

1

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

Dziugaite and Roy 10/19

Page 27: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

PAC-Bayes Bound optimization

infQ

LS(Q) +KL(Q||P) + log 1

δ

m

Let LS(Q) ≥ LS(Q) with LS differentiable.

infQ

m LS(Q) + KL(Q||P)

Let Qw ,s = N (w ,diag(s)).

minw∈Rd

s∈Rd+

m LS(Qw ,s) + KL(Qw ,s ||P)

Take P = N (w0, λ Id) with λ = c exp{−j/b}.

minw∈Rd

s∈Rd+

λ∈(0,c)

m LS(Qw ,s) + KL(Qw ,s ||N (w0, λI ))︸ ︷︷ ︸+2 log(b logc

λ)

1

2(

1

λ‖s‖1 +

1

λ‖w − w0‖2

2 + d log λ− 1d · log s − d).

Dziugaite and Roy 10/19

Page 28: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Numerical generalization bounds on MNIST

# Hidden Layers 1 2 3 1 (R)

Train error 0.001 0.000 0.000 0.007Test error 0.018 0.016 0.013 0.508

SNN train error 0.028 0.028 0.027 0.112SNN test error 0.034 0.033 0.032 0.503

PAC-Bayes bound 0.161 0.186 0.201 1.352

KL divergence 5144 6534 7861 201131# parameters 472k 832k 1193k 472kVC dimension 26m 66m 121m 26m

We have shown that type of flat minima found in practice can beturned into a generalization guarantee.

Bounds are loose, but only nonvacuous bounds in this setting.

Dziugaite and Roy 11/19

Page 29: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Actually, SGD is pretty dangerous

I SGD achieves zero training error reliably

I Despite no explicit regularization, training andtest error very close

I Explicit regularization has minor effect

I SGD can reliably obtain zero training error onrandomized labels

I Hence, Rademacher complexity of model class isnear maximal w.h.p.

Dziugaite and Roy 12/19

Page 30: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Actually, SGD is pretty dangerous

I SGD achieves zero training error reliably

I Despite no explicit regularization, training andtest error very close

I Explicit regularization has minor effect

I SGD can reliably obtain zero training error onrandomized labels

I Hence, Rademacher complexity of model class isnear maximal w.h.p.

Dziugaite and Roy 12/19

Page 31: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Entropy-SGD (Chaudhari et al., 2017)

Entropy-SGD replaces stochastic gradient descent on LS by stochasticgradient ascent applied to the optimization problem:

arg maxw∈Rd

Fγ,τ (w;S),

where Fγ,τ (w;S) = log

∫Rp

exp{−τLS(w′)− τ γ

2‖w′ −w‖2

2

}dw′.

The local entropy Fγ,τ (·;S) emphasizes flat minima of LS .

Dziugaite and Roy 13/19

Page 32: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Entropy-SGD optimizes PAC-Bayes bound w.r.t. prior

Entropy-SGD optimizes the local entropy

Fγ,τ (w;S) = log

∫Rp

exp{−τLS(w′)− τ γ

2‖w′ −w‖2

2

}dw′.

Theorem. Maximizing Fγ,τ (w;S) w.r.t. w corresponds to minimizingPAC-Bayes risk bound w.r.t. prior’s mean w.

Theorem. Let P(S) be an ε-differentially private distribution. Then

∀D, PS∼Dm

((∀Q) KL(LS(Q)||LD(Q)) ≤

KL(Q||P(S)) + ln 2m + 2max{ln 3δ , mε

2}m − 1

)≥ 1− δ.

We optimize Fγ,τ (w;S) using SGLD, obtaining (ε, δ)-differential privacy.

SGLD is known to converge weakly to the ε-differentially privateexponential mechanism. Our analysis makes a coarse approximation:privacy of SGLD is that of exponential mechanism.

Dziugaite and Roy 14/19

Page 33: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Entropy-SGD optimizes PAC-Bayes bound w.r.t. prior

Entropy-SGD optimizes the local entropy

Fγ,τ (w;S) = log

∫Rp

exp{−τLS(w′)− τ γ

2‖w′ −w‖2

2

}dw′.

Theorem. Maximizing Fγ,τ (w;S) w.r.t. w corresponds to minimizingPAC-Bayes risk bound w.r.t. prior’s mean w.

Theorem. Let P(S) be an ε-differentially private distribution. Then

∀D, PS∼Dm

((∀Q) KL(LS(Q)||LD(Q)) ≤

KL(Q||P(S)) + ln 2m + 2max{ln 3δ , mε

2}m − 1

)≥ 1− δ.

We optimize Fγ,τ (w;S) using SGLD, obtaining (ε, δ)-differential privacy.

SGLD is known to converge weakly to the ε-differentially privateexponential mechanism. Our analysis makes a coarse approximation:privacy of SGLD is that of exponential mechanism.

Dziugaite and Roy 14/19

Page 34: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Entropy-SGD optimizes PAC-Bayes bound w.r.t. prior

Entropy-SGD optimizes the local entropy

Fγ,τ (w;S) = log

∫Rp

exp{−τLS(w′)− τ γ

2‖w′ −w‖2

2

}dw′.

Theorem. Maximizing Fγ,τ (w;S) w.r.t. w corresponds to minimizingPAC-Bayes risk bound w.r.t. prior’s mean w.

Theorem. Let P(S) be an ε-differentially private distribution. Then

∀D, PS∼Dm

((∀Q) KL(LS(Q)||LD(Q)) ≤

KL(Q||P(S)) + ln 2m + 2max{ln 3δ , mε

2}m − 1

)≥ 1− δ.

We optimize Fγ,τ (w;S) using SGLD, obtaining (ε, δ)-differential privacy.

SGLD is known to converge weakly to the ε-differentially privateexponential mechanism. Our analysis makes a coarse approximation:privacy of SGLD is that of exponential mechanism.

Dziugaite and Roy 14/19

Page 35: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Dziugaite and Roy 15/19

Page 36: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Dziugaite and Roy 16/19

Page 37: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Dziugaite and Roy 17/19

Page 38: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Conclusion

I We show that the size/flatness/location of minima (that were foundby SGD on MNIST) imply generalization using PAC-Bayes bounds;

I We show Entropy-SGD optimizes the prior in a PAC-Bayes bound,which is not valid;

I We give a differentially private version of PAC-Bayes theorem andmodify Entropy-SGD so that prior is privately optimized.

Dziugaite and Roy 18/19

Page 39: Deep Neural Networks: From Flat Minima to Numerically ... · Optimizing PAC-Bayes bounds Given data S, we can nd a provably good classi er Q by optimizing the PAC-Bayes bound w.r.t.

Achille, A. and S. Soatto (2017). “On the Emergence of Invariance and Disentangling in Deep Representations”.arXiv preprint arXiv:1706.01350.

Catoni, O. (2007). PAC-Bayesian supervised classification: the thermodynamics of statistical learning. arXiv:0712.0248 [stat.ML].

Chaudhari, P., A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R.Zecchina (2017). “Entropy-SGD: Biasing Gradient Descent Into Wide Valleys”. In: International Conferenceon Learning Representations (ICLR). arXiv: 1611.01838v4 [cs.LG].

Hinton, G. E. and D. van Camp (1993). “Keeping the Neural Networks Simple by Minimizing the DescriptionLength of the Weights”. In: Proceedings of the Sixth Annual Conference on Computational Learning Theory.COLT ’93. Santa Cruz, California, USA: ACM, pp. 5–13.

Hochreiter, S. and J. Schmidhuber (1997). “Flat Minima”. Neural Comput. 9.1, pp. 1–42.

Keskar, N. S., D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang (2017). “On Large-Batch Train-ing for Deep Learning: Generalization Gap and Sharp Minima”. In: International Conference on LearningRepresentations (ICLR). arXiv: 1609.04836v2 [cs.LG].

Langford, J. and R. Caruana (2002). “(Not) Bounding the True Error”. In: Advances in Neural InformationProcessing Systems 14. Ed. by T. G. Dietterich, S. Becker, and Z. Ghahramani. MIT Press, pp. 809–816.

Langford, J. and M. Seeger (2001). Bounds for Averaging Classifiers. Tech. rep. CMU-CS-01-102. CarnegieMellon University.

McAllester, D. A. (1999). “PAC-Bayesian Model Averaging”. In: Proceedings of the Twelfth Annual Conferenceon Computational Learning Theory. COLT ’99. Santa Cruz, California, USA: ACM, pp. 164–170.

Dziugaite and Roy 19/19