Dropout as a Bayesian Approximation: Insights and...

16
Preliminaries A Gaussian Process Approximation Dropout as a Bayesian Approximation: Insights and Applications Yarin Gal and Zoubin Ghahramani Discussion by: Chunyuan Li Jan. 15, 2016 1 / 16

Transcript of Dropout as a Bayesian Approximation: Insights and...

Page 1: Dropout as a Bayesian Approximation: Insights and …people.ee.duke.edu/~lcarin/Chunyuan1.15.2016.pdf2016/01/15  · Dropout as a Bayesian Approximation: Insights and Applications

PreliminariesA Gaussian Process Approximation

Dropout as a Bayesian Approximation:Insights and Applications

Yarin Gal and Zoubin Ghahramani

Discussion by: Chunyuan Li

Jan. 15, 2016

1 / 16

Page 2: Dropout as a Bayesian Approximation: Insights and …people.ee.duke.edu/~lcarin/Chunyuan1.15.2016.pdf2016/01/15  · Dropout as a Bayesian Approximation: Insights and Applications

PreliminariesA Gaussian Process Approximation

Main idea

I In the framework of variational inference, the authors showthat the standard algorithm of SGD training withDropout is ensentially optimizing the stochastic lowerbound of Gaussian Processes whose kernel takes theform of neural networks.

2 / 16

Page 3: Dropout as a Bayesian Approximation: Insights and …people.ee.duke.edu/~lcarin/Chunyuan1.15.2016.pdf2016/01/15  · Dropout as a Bayesian Approximation: Insights and Applications

PreliminariesA Gaussian Process Approximation

Outline

PreliminariesDropoutGaussian ProcessesVariational Inference

A Gaussian Process ApproximationA Gaussian Process ApproximationEvaluating the Log Evidence Lower Bound

3 / 16

Page 4: Dropout as a Bayesian Approximation: Insights and …people.ee.duke.edu/~lcarin/Chunyuan1.15.2016.pdf2016/01/15  · Dropout as a Bayesian Approximation: Insights and Applications

PreliminariesA Gaussian Process Approximation

DropoutGaussian ProcessesVariational Inference

DropoutProcedure

I Training stage: A unit is present with probability pI Testing stage: The unit is always present and the weights are

multiplied by p

IntuitionI Training a neural network with dropout can be seen as training a

collection of 2K thinned networks with extensive weight sharingI A single neural network to approximate averaging output at test

time4 / 16

Page 5: Dropout as a Bayesian Approximation: Insights and …people.ee.duke.edu/~lcarin/Chunyuan1.15.2016.pdf2016/01/15  · Dropout as a Bayesian Approximation: Insights and Applications

PreliminariesA Gaussian Process Approximation

DropoutGaussian ProcessesVariational Inference

Dropout for one-hidden-layer Neural NetworksI Dropout local units∗

y =(g((x� b1)W1)� b2

)W2 (1)

I Input x and Output y;I g: activation function; Weights: W ∈ RK`−1×K`

I b` is binary dropout vairiablesI Equivalent to multiplying the global weight matrices by the

binary vectors to dropout entire rows:

y = g(x(diag(b1)W1))(diag(b2)W2) (2)

I Application to regression

L =1

2N

N∑n=1

‖yn − yn‖22 + λ(‖W1‖2 + ‖W2‖2) (3)

∗bias term is ignored for simplicity5 / 16

Page 6: Dropout as a Bayesian Approximation: Insights and …people.ee.duke.edu/~lcarin/Chunyuan1.15.2016.pdf2016/01/15  · Dropout as a Bayesian Approximation: Insights and Applications

PreliminariesA Gaussian Process Approximation

DropoutGaussian ProcessesVariational Inference

Gaussian Processes

I f is the GP functionp(f |X,Y)︸ ︷︷ ︸

Posterior

∝ p(f)︸︷︷︸Prior

p(Y|X,f)︸ ︷︷ ︸Likelihood

I ApplicationsI Regression

F|X ∼ N (0,K(X,X)), Y|F ∼ N (F, τ−1IN )

6 / 16

Page 7: Dropout as a Bayesian Approximation: Insights and …people.ee.duke.edu/~lcarin/Chunyuan1.15.2016.pdf2016/01/15  · Dropout as a Bayesian Approximation: Insights and Applications

PreliminariesA Gaussian Process Approximation

DropoutGaussian ProcessesVariational Inference

Variational InferenceI The predictive distribution

K(y∗|x∗,X,Y) =

∫p(y∗|x∗,w) p(w|X,Y)︸ ︷︷ ︸

≈q(w)

dw (4)

I Objective: argminq KL(q(w)|p(w|X,Y)

)I Variational Prediction: q(y∗|x∗) =

∫p(y∗|x∗,w)q(w)dw

I Log evidence lower bound

LVI =

∫q(w) log p(Y|X,w)dw −KL

(q(w)||p(w)

)(5)

I Objective: argmaxq LVII A variational distribution q(w) that explains the data well

while still being close to prior7 / 16

Page 8: Dropout as a Bayesian Approximation: Insights and …people.ee.duke.edu/~lcarin/Chunyuan1.15.2016.pdf2016/01/15  · Dropout as a Bayesian Approximation: Insights and Applications

PreliminariesA Gaussian Process Approximation

A Gaussian Process ApproximationEvaluating the Log Evidence Lower Bound

A single-layer neural network example

I SetupI Q: input dimension,K: number of hidden units,D: ouput dimesnion

I Goal: Learn W1 ∈ RQ×K and W2 ∈ RK×D to mapX ∈ RN×Q to Y ∈ RN×D

I Idea: Introduce W1 and W2 to GP approxiamtion

8 / 16

Page 9: Dropout as a Bayesian Approximation: Insights and …people.ee.duke.edu/~lcarin/Chunyuan1.15.2016.pdf2016/01/15  · Dropout as a Bayesian Approximation: Insights and Applications

PreliminariesA Gaussian Process Approximation

A Gaussian Process ApproximationEvaluating the Log Evidence Lower Bound

Introduce W1

I Define the kernel of GP

K(x,x′) =

∫p(w)g(w>x)g(w>x′)dw (6)

I Resort to Monte Carlo integration, with generative process:

wk ∼ p(w), W1 = [wk]Kk=1, K(x,x′) =

1

K

K∑k=1

g(w>k x)g(w>k x′)

F|X,W1 ∼ N (0, K), Y|F ∼ N (F, τ−1IN ) (7)

where K is the number of hidden units

9 / 16

Page 10: Dropout as a Bayesian Approximation: Insights and …people.ee.duke.edu/~lcarin/Chunyuan1.15.2016.pdf2016/01/15  · Dropout as a Bayesian Approximation: Insights and Applications

PreliminariesA Gaussian Process Approximation

A Gaussian Process ApproximationEvaluating the Log Evidence Lower Bound

Introduce W2

I Analytically integrating wrt FThe predictive distribution

p(Y|X) =

∫p(Y|F)p(F|X,W1)p(W1)dw1df (8)

can be rewritten as

p(Y|X) =

∫N (Y;0,ΦΦ> + τ−1IN )p(W1)dw1 (9)

where Φ = [φ]Nn=1, φ =√

1K g(W

>1 x)

I For W2 = [w2]Dd=1, wd ∼ N (0, IK)

N (yd;0,ΦΦ> + τ−1IN ) =

∫N (yd;Φwd, τ

−1IN )N (wd; 0, IK)dw1

p(Y|X) =

∫p(Y|X,W1,W2)p(W1)p(W2)dw1dw2 (10)

10 / 16

Page 11: Dropout as a Bayesian Approximation: Insights and …people.ee.duke.edu/~lcarin/Chunyuan1.15.2016.pdf2016/01/15  · Dropout as a Bayesian Approximation: Insights and Applications

PreliminariesA Gaussian Process Approximation

A Gaussian Process ApproximationEvaluating the Log Evidence Lower Bound

Variational Inference in the Approximate Model

q(W1,W2) = q(W1)q(W2) (11)

I To mimic Dropout, q(W1) is factorised over input dimension, each of themis a Gaussian mixture distribution with two components,

q(W1) =

Q∏q=1

q(wq), q(wq) = p1N (mq , σ2IK) + (1− p1)N (0, σ2IK) (12)

with p1 ∈ [0, 1], mq ∈ RK

I The same for q(W2)

q(W2) =K∏k=1

q(wk), q(wk) = p2N (mk, σ2ID) + (1− p2)N (0, σ2ID) (13)

with p1 ∈ [0, 1], mq ∈ RD

I Optimise over parameters, especially M1 = [mq ]Qq=1, M2 = [mk]

Kk=1

11 / 16

Page 12: Dropout as a Bayesian Approximation: Insights and …people.ee.duke.edu/~lcarin/Chunyuan1.15.2016.pdf2016/01/15  · Dropout as a Bayesian Approximation: Insights and Applications

PreliminariesA Gaussian Process Approximation

A Gaussian Process ApproximationEvaluating the Log Evidence Lower Bound

Evaluating the Log Evidence Lower Bound for Regression

I Log evidence lower bound

LGP-VI =

∫q(W1,W2) log p(Y|X,W1,W2)dW1dW2︸ ︷︷ ︸

L1

−KL(q(W1,W2)||p(W1,W2)

)︸ ︷︷ ︸L2

(14)

I Approximation of L†2

I For large enough K we can approximate the KL divergence term as

KL(q(W1)||p(W1)) ≈ QK(σ2 − log(σ2)− 1) + p12

∑Qq=1m

>q mq

(15)

I Similarly for KL(q(W2)||p(W2)

†Following Proposition 1 on KL of a Mixture of Gaussians in Appendix12 / 16

Page 13: Dropout as a Bayesian Approximation: Insights and …people.ee.duke.edu/~lcarin/Chunyuan1.15.2016.pdf2016/01/15  · Dropout as a Bayesian Approximation: Insights and Applications

PreliminariesA Gaussian Process Approximation

A Gaussian Process ApproximationEvaluating the Log Evidence Lower Bound

L1 : Monte Carlo integration

I Parameterization

L1 = q(b1, ε1,b2, ε2) log p(Y|X,W1(b1, ε1),W2(b2, ε2))db1db2dε1dε2 (16)

W1 = diag(b1)(M1 + σε1) + (1− diag(b1))σε1

W2 = diag(b2)(M2 + σε2) + (1− diag(b2))σε2 (17)

where ε1 ∼ N (0, IQ×K),b1q ∼ Bernoulli(p1),

ε2 ∼ N (0, IK×D),b2k ∼ Bernoulli(p2), (18)

I Take a single sample

LGP-MC = log p(Y|X,W1,W2)︸ ︷︷ ︸L1-MC

−KL(q(W1,W2)||p(W1,W2)

)(19)

Optimising LGP-MC converges to the same limit as LGP-VI.

13 / 16

Page 14: Dropout as a Bayesian Approximation: Insights and …people.ee.duke.edu/~lcarin/Chunyuan1.15.2016.pdf2016/01/15  · Dropout as a Bayesian Approximation: Insights and Applications

PreliminariesA Gaussian Process Approximation

A Gaussian Process ApproximationEvaluating the Log Evidence Lower Bound

L1 : Monte Carlo integration

I Regression

L1-MC = log p(Y|X,W1,W2)

=D∑d=1

logN (yd;φwd, τ−1IN )

= −ND

2log(2π) +

ND

2log(τ)−

D∑d=1

τ

2‖yd − φwd‖22 (20)

I Sum over the rows instead of the columns of Y

D∑d=1

τ

2‖yd − φwd‖22 =

∑Nn=1

τ2‖yn − φwn‖22 (21)

where yn = φW2 =√

1Kg(xnW1)W2

14 / 16

Page 15: Dropout as a Bayesian Approximation: Insights and …people.ee.duke.edu/~lcarin/Chunyuan1.15.2016.pdf2016/01/15  · Dropout as a Bayesian Approximation: Insights and Applications

PreliminariesA Gaussian Process Approximation

A Gaussian Process ApproximationEvaluating the Log Evidence Lower Bound

Recover SGD training with Dropout

I Take the approximation for L1 and L2 for the bound, andignoring constant terms τ and σ

LGP-MC = −τ

2

N∑n=1

‖yn − yn‖22 −p1

2‖M1‖22 −

p2

2‖M2‖22 (22)

I Setting σ tend to 0

W1 = diag(b1)(M1 + σε1) + (1− diag(b1))σε1 ⇒

W1 ≈ diag(b1)M1, W2 ≈ diag(b2)M2 (23)

yn =

√1

Kg(xnW1)W2 =

√1

Kg(xn(diag(b1)M1))(diag(b2)M2) (24)

15 / 16

Page 16: Dropout as a Bayesian Approximation: Insights and …people.ee.duke.edu/~lcarin/Chunyuan1.15.2016.pdf2016/01/15  · Dropout as a Bayesian Approximation: Insights and Applications

PreliminariesA Gaussian Process Approximation

A Gaussian Process ApproximationEvaluating the Log Evidence Lower Bound

More ApplicationsI Convolutional Neural Networks

Bayesian Convolutional Neural Networks with Bernoulli Approximate

Variational Inference, arXiv:1506.02158, 2015

I Recurrent Neural NetworksA Theoretically Grounded Application of Dropout in Recurrent Neural

Networks, arXiv:1512.05287, 2015I Reinforcement Learning

Dropout as a Bayesian Approximation: Representing Model Uncertainty in

Deep Learning, arXiv:1506.02142, 201516 / 16