Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf ·...

52
Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Transcript of Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf ·...

Page 1: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

Deep Variational Inference

FLARE Reading Group PresentationWesley Tansey9/28/2016

Page 2: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

What is Variational Inference?

Page 3: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Want to estimate some distribution, p*(x)

What is Variational Inference? p*(x)

Page 4: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Want to estimate some distribution, p*(x)

● Too expensive to estimate

What is Variational Inference? p*(x)

Page 5: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Want to estimate some distribution, p*(x)

● Too expensive to estimate

● Approximate it with a tractable distribution, q(x)

What is Variational Inference? p*(x) q(x)

Page 6: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Fit q(x) inside of p*(x)● Centered at a single

mode ○ q(x) is unimodal

here○ VI is a MAP

estimate

What is Variational Inference? p*(x) q(x)

Page 7: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Mathematically:

KL(q || p*)

= Σxq(x)log(q(x) / p*(x))

What is Variational Inference?

Still hard!

p*(x) usually has a tricky normalizing constant

Page 8: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Mathematically:

KL(q || p*)

= Σxq(x)log(q(x) / p*(x))

● Use unnormalized p~ instead

What is Variational Inference?

Page 9: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

log(q(x) / p*(x))

= log(q(x)) - log(p*(x))

= log(q(x)) - log(p~(x) / Z)

= log(q(x)) - log(p~(x)) - log(Z)

● Mathematically:

KL(q || p*)

= Σxq(x)log(q(x) / p*(x))

● Use unnormalized p~ instead

What is Variational Inference?

Page 10: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

log(q(x) / p*(x))

= log(q(x)) - log(p*(x))

= log(q(x)) - log(p~(x) / Z)

= log(q(x)) - log(p~(x)) - log(Z)

● Mathematically:

KL(q || p*)

= Σxq(x)log(q(x) / p*(x))

● Use unnormalized p~ instead

What is Variational Inference?

Constant=> Can ignore in our optimization problem

Page 11: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Classical method

● Uses a factorized q:

q(x) = ∏i q

i(x

i)

Mean Field VI

[1] Blei, Ng, Jordan, “Latent Dirichlet Allocation”, JMLR, 2003.

Page 12: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Example: Multivariate Gaussian

● Product of independent Gaussians for q

● Spherical covariance underestimates true covariance

Mean Field VI

Page 13: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Vanilla mean field VI assumes you know all the parameters, θ, of the true distribution, p*(x)

Variational Bayes

[1] Blei, Ng, Jordan, “Latent Dirichlet Allocation”, JMLR, 2003.

Page 14: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Vanilla mean field VI assumes you know all the parameters, θ, of the true distribution, p*(x)

● Enter: Variational Bayes (VB)

Variational Bayes

[1] Blei, Ng, Jordan, “Latent Dirichlet Allocation”, JMLR, 2003.

Page 15: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● VB infers both the latent q(x) variables, z, and the p*(x) parameters, θ

● VB-EM was popularized for LDA1

○ E for z, M for θ

Variational Bayes

[1] Blei, Ng, Jordan, “Latent Dirichlet Allocation”, JMLR, 2003.

Page 16: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● VB usually uses a mean field approximation of the form:

q(x) = q(zi | θ)∏

i q

i(x

i | z

i)

Variational Bayes

Page 17: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Requires analytical solutions of expectations w.r.t. q

i○ Intractable in

general● Factored form limits

the power of the approximation

Issues with Mean Field VB

Page 18: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Requires analytical solutions of expectations w.r.t. q

i○ Intractable in

general● Factored form limits

the power of the approximation

Issues with Mean Field VB

Solution: Auto-Encoding Variational Bayes(Kingma and Welling, 2013)

Page 19: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Requires analytical solutions of expectations w.r.t. q

i○ Intractable in

general● Factored form limits

the power of the approximation

Issues with Mean Field VB

Solution:Variational Inference with Normalizing Flows(Rezende and Mohamed, 2015)

Solution: Auto-Encoding Variational Bayes(Kingma and Welling, 2014)

Page 20: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

Auto-Encoding Variational Bayes1

High-level idea:

1) Optimizing the same lower bound that we get in VB

2) Data augmentation trick leads to lower-variance estimator

3) Lots of choices of q(z|x) and p(z) lead to partial closed-form

4) Use a neural network to parameterize qϕ(z | x) and pθ(x | z)

5) SGD to fit everything

[1] Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR, 2014.

Page 21: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Given N iid data points, (x1, ... , xn)

● Maximize the marginal likelihood:

log pθ(x1,...,xn) = Σi log pθ(x(i))

1) VB Lower Bound

Page 22: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Given N iid data points, (x1, ... , xn)

● Maximize the marginal likelihood:

log pθ(x1,...,xn) = Σi log pθ(x(i))

1) VB Lower Bound

Page 23: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Given N iid data points, (x1, ... , xn)

● Maximize the marginal likelihood:

log pθ(x1,...,xn) = Σi log pθ(x(i))

1) VB Lower Bound

Always positive

Page 24: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Given N iid data points, (x1, ... , xn)

● Maximize the marginal likelihood:

log pθ(x1,...,xn) = Σi log pθ(x(i))

1) VB Lower Bound

Always positive

Lower bound

Page 25: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Write lower bound

1) VB Lower Bound

Page 26: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Write lower bound

1) VB Lower Bound

Anyone want the derivation?

Page 27: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Write lower bound

● Rewrite lower bound

1) VB Lower Bound

Page 28: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Write lower bound

● Rewrite lower bound

1) VB Lower Bound

Page 29: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Write lower bound

● Rewrite lower bound

1) VB Lower Bound

Derivation?

Page 30: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Write lower bound

● Rewrite lower bound

● Monte Carlo gradient estimator of expectation part

1) VB Lower Bound

Page 31: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Write lower bound

● Rewrite lower bound

● Monte Carlo gradient estimator of expectation part○ Too high variance

1) VB Lower Bound

Page 32: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Rewrite qϕ(z(l) | x)

● Separate q into a deterministic function of x and an auxiliary noise variable ϵ

● Leads to lower variance estimator

2) Reparameterization trick

Page 33: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Example: univariate Gaussian

● Can rewrite as sum of mean and a scaled noise variable

2) Reparameterization trick

Page 34: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Lots of distributions like this. Three classes given:○ Tractable inverse

CDF○ Location-scale○ Composition

2) Reparameterization trick Exponential, Cauchy, Logistic,

Rayleigh, Pareto, Weibull, Reciprocal, Gompertz, Gumbel, Erlang

Laplace, Elliptical, Student’s t, Logistic, Uniform, Triangular, Gaussian

Log-Normal (exponentiated normal)Gamma (sum of exponentials)Dirichlet (sum of Gammas)Beta, Chi-Squared, F

Page 35: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Yields a new MC estimator

2) Reparameterization trick

Page 36: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Plug estimator into the lower bound eq.

● KL term often can be integrated analytically○ Careful choice of

priors

2) Reparameterization trick

Page 37: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Plug estimator into the lower bound eq.

● KL term often can be integrated analytically○ Careful choice of

priors

2) Reparameterization trick

Page 38: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● KL term often can be integrated analytically○ Careful choice of

priors○ E.g. both Gaussian

3) Partial closed form

Page 39: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Regularizer ● Reconstruction error

● Neural nets○ Encode: q(z | x)○ Decode: p(x | z)

4) Auto-encoder connection

Page 40: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● q(z | x) encodes● p(x | z) decodes● “Information layer(s)”

need to compress○ Reals = infinite info○ Reals + random

noise = finite info

4) Auto-encoder connection (alt.)

More info in Karol Gregor’s Deep Mind lecture: https://www.youtube.com/watch?v=P78QYjWh5sM

Page 41: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Deep networks parameterize both q(z | x) and p(x | z)

● Lower-variance estimator of expected log-likelihood

● Can choose from lots of families of q(z | x) and p(z)

Where are we with VI now? (2013’ish)

Page 42: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Problem:○ Most parametric families

available are simple○ E.g. product of independent

univariate Gaussians○ Most posteriors are complex

Where are we with VI now? (2013’ish)

Page 43: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

Variational Inference with Normalizing Flows1

High-level idea:

1) VAEs are great, but our posterior q(z|x) needs to be simple

2) Take simple q(z | x) and apply series of k transformations to z to get q_k(z | x). Metaphor: z “flows” through each transform.

3) Be clever in choice of transforms (computational issue)

4) Variational posterior q now converges to true posterior p

5) Deep NN now parameterizes q and flow parameters[1] Rezende, Danilo Jimenez, and Shakir Mohamed. "Variational inference with normalizing flows." arXiv preprint arXiv:1505.05770 (2015)..

Page 44: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Function that transforms a probability density through a sequence of invertible mappings

What is a normalizing flow?

q0(z | x)

qk(z | x)

Page 45: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Chain rule lets us write q

k as product of

q0 and inverted determinants

Key equations (1)

Page 46: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Density qk(z’)

obtained by successively composing k transforms

Key equations (2)

Page 47: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Log likelihood of qk(z’)

has a nice additive form

Key equations (3)

Page 48: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Expectation over qk

can be written as an expectation under q

0

● Cute name: law of the unconscious statistician (LOTUS)

Key equations (4)

Page 49: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

Types of flows

1) Infinitesimal Flows:○ Can show convergence in the limit○ Skipping (theoretical; computationally

expensive)

2) Invertible Linear-Time Flows:○ log-det can be calculated efficiently

Page 50: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Applies the transform:

where:

Planar Flows

Page 51: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● Applies the transform:

where:

Radial Flows

Page 52: Deep Variational Inference - University of Texas at Austinml/flare/extra/slides-deepvarinf.pdf · Deep Variational Inference FLARE Reading Group Presentation ... Auto-Encoding Variational

● VI approx. p(x) via latent variable model ○ p(x) = Σ

z p(z)p(x | z)

● VAE introduces an auto-encoder approach○ Reparameterization trick makes it feasible○ Deep NNs parameterize q(z | x) and p(x | z)

● NF takes q(z|x) from simple to complex○ Series of linear-time transforms○ Convergence in the limit

Summary