Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational...

29
D.P. Kingma Stochastic Gradient VB and the Variational Auto-Encoder Stochastic Gradient VB and the Variational Auto-Encoder Durk Kingma Ph.D. Candidate (2 nd year) advised by Max Welling Kingma, Diederik P., and Max Welling. "Stochastic Gradient VB and the variational auto-encoder. " (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic back propagation and variational inference in deep latent gaussian models. (arXiv)

Transcript of Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational...

Page 1: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

D.P. Kingma

Stochastic Gradient VBand the Variational Auto-Encoder

Stochastic Gradient VBand the Variational Auto-Encoder

Durk KingmaPh.D. Candidate (2nd year) advised by Max Welling

Kingma, Diederik P., and Max Welling."Stochastic Gradient VB and the variational auto-encoder." (arXiv)

Quite similar:Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.“Stochastic back propagation and variational inference in deep latent gaussian models. ” (arXiv)

Page 2: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

2D.P. Kingma

ContentsContents

● Stochastic Variational Inference and learning

– SGVB algorithm● Variational auto-encoder

– Experiments● Reparameterizations

– Effect on posterior correlations

Page 3: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

4D.P. Kingma

General setupGeneral setup

● Setup:

– x : observed variables

– z : unobserved/latent variables

– θ : model parameters

– pθ(x,z): joint PDF

● Factorized, differentiable● Factors can be anything, e.g. neural nets

● We want:

– Fast approximate posterior inference p(z|x)

– Learn the parameters θ (e.g. MAP estimate)

Example:

Page 4: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

5D.P. Kingma

ExampleExample

Page 5: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

6D.P. Kingma

LearningLearning

– Regular EM requires tractable pθ(z|x)

– Monte Carlo EM (MCEM) requires sampling from the posterior (slow....)

– Mean-field VB requires certain closed-form solutions to certain expectations of the joint PDF

Page 6: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

7D.P. Kingma

Naive pure MAP optimization approachNaive pure MAP optimization approach

Overfits with high dimensionality of latent space

Page 7: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

8D.P. Kingma

Novel approach:Stochastic Gradient VB

Novel approach:Stochastic Gradient VB

● Optimizes a lower bound of the marginal likelihood of the data● Scales to very large datasets● Scales to high-dimensional latent space● Simple● Fast!● Applies to almost any normalized model with continuous latent

variables

Page 8: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

D.P. Kingma

The Variational BoundThe Variational Bound

● We introduce the variational approximation:

– Distribution can be almost anything (we use Gaussian)

– Will approximate the true (but intractable) posterior

Marginal likelihood can be written as:

This bound is exactly we want to optimize!(w.r.t. φ and θ)

Page 9: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

D.P. Kingma

“Naive” Monte Carlo estimator of the bound“Naive” Monte Carlo estimator of the bound

Problem: not appropriate for differentiation w.r.t. φ!(Cannot differentiate through sampling process).

Recently proposed solutions (2013)

– Michael Jordan / David Blei (very high variance)

– Tim Salimans (2013): (only applies to Exponential Family q)

– Rajesh Ranganath et al,“Black Box Variational Inference”, arXiv 2014

– Andriy Mnih & Karol Gregor, “Neural Variational Inference and Learning”, arXiv 2014

Page 10: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

11D.P. Kingma

Key “reparameterization trick”Key “reparameterization trick”

Example qφ(z) p(ε) g(φ, ε) Also...

Normal dist. z ~ N(μ,σ) ε ~ N(0,1) z = μ + σ * ε Location-scale familie: Laplace, Elliptical, Student’s t, Logistic, Uniform, Triangular, ...

Exponential z ~ exp(λ) ε ~ U(0,1) z = -log(1 – ε)/λ Invertible CDF: Cauchy, Logistic, Rayleigh, Pareto,Weibull, Reciprocal, Gompertz, Gumbel and Erlan, ...

Other z ~ logN(μ,σ) ε ~ N(0,1) z = exp(μ + σ * ε) Gamma, Dirichlet, Beta, Chi-Squared, and F distributions

Alternative way of sampling from qφ(z):

1. Choose some ε ~ p(ε) (independent of φ!)

2. Choose some z = g(φ, ε)

Such that z ~ qφ(z) (the correct distribution)

Page 11: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

D.P. Kingma

SGVB estimatorSGVB estimator

Really simple and appropriate for differentiation w.r.t. φ and θ!

Page 12: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

13D.P. Kingma

Basic SGVB Algorithm (L=1)Basic SGVB Algorithm (L=1)

repeat

until convergence

Torch7Theano

Page 13: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

D.P. Kingma

“Auto-Encoding” VB:efficient on-line version of SGVB

“Auto-Encoding” VB:efficient on-line version of SGVB

● Special case of SGVB:

Large i.i.d. dataset (large N)=> Many variational parameters to learn

● Solution:

– Use conditional:● Neural network

– Doubly stochastic optimization

Avoid local parameters!

Page 14: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

16D.P. Kingma

“Auto-encoding” Stochastic VB (L=1)“Auto-encoding” Stochastic VB (L=1)

repeat

until convergence

Scales to very large datasets!

Page 15: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

17D.P. Kingma

Experiments with “variational auto-encoder”Experiments with “variational auto-encoder”

x p(x|z)

z = g(φ,ε,x)

x

h2

ε

h1

zε ~ p(ε)

Generative model p(x|z) (neural net)

Posterior approximation q(z|x) (neural net)

Objective:

(noisy) negative reconstruction error regularization term

Page 16: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

18D.P. Kingma

ExperimentsExperiments

Page 17: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

19D.P. Kingma

Results:Marginal likelihood lower bound

Results:Marginal likelihood lower bound

Page 18: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

20D.P. Kingma

Results: Marginal log-likelihoodResults: Marginal log-likelihood

MCEM does not scale well to large datasets

Page 19: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

21D.P. Kingma

Robustness to high-dimensional latent space

Robustness to high-dimensional latent space

Page 20: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

22D.P. Kingma

Page 21: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

23D.P. Kingma

learned 2D manifoldslearned 2D manifolds

Page 22: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

24D.P. Kingma

learned 3D manifoldlearned 3D manifold

Page 23: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

25D.P. Kingma

Samples from MNISTSamples from MNIST

Page 24: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

26D.P. Kingma

Reparameterizations of latent variablesReparameterizations of latent variables

Page 25: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

27D.P. Kingma

Reparameterization of continuous latent variables

Reparameterization of continuous latent variables

● Alternative parameterization of latent variables.

Choose some:

– ε ~ p(ε)

– z = g(φ, ε) (invertible)

– z | pa ~ p(z|pa)(the correct distribution)

● z's become determinstic given ε's

● ε's are a priori independent

Large difference in posterior dependencies, efficiency

Centered form Non-centered form(Neural net with injected noise)

Page 26: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

D.P. Kingma

Experiment: MCMC sampling in DBNExperiment: MCMC sampling in DBN

Centered form Non-centered formSamples Autocorrelation Samples Autocorrelation

Fast mixingTerribly slow mixing

Page 27: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

D.P. Kingma

“Efficient Gradient-Based Inference through Transformations between Bayes Nets and Neural Nets”

Diederik P Kingma, Max Welling

For more information and analysis see:For more information and analysis see:

Page 28: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

30D.P. Kingma

ConclusionConclusion

● SGVB: efficient stochastic variational algorithm for inference and learning with continuous latent variables.

Theano and pure numpy implementations:

https://github.com/y0ast/Variational-Autoencoder.git

(includes scikit-learn wrappers)

Thanks!

Page 29: Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational auto-encoder." (arXiv) Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and

31D.P. Kingma

AppendixThe regular SVB gradient estimator

AppendixThe regular SVB gradient estimator