Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational...

Post on 30-Jul-2020

15 views 0 download

Transcript of Stochastic Gradient VB and the Variational Auto …..."Stochastic Gradient VB and the variational...

D.P. Kingma

Stochastic Gradient VBand the Variational Auto-Encoder

Stochastic Gradient VBand the Variational Auto-Encoder

Durk KingmaPh.D. Candidate (2nd year) advised by Max Welling

Kingma, Diederik P., and Max Welling."Stochastic Gradient VB and the variational auto-encoder." (arXiv)

Quite similar:Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.“Stochastic back propagation and variational inference in deep latent gaussian models. ” (arXiv)

2D.P. Kingma

ContentsContents

● Stochastic Variational Inference and learning

– SGVB algorithm● Variational auto-encoder

– Experiments● Reparameterizations

– Effect on posterior correlations

4D.P. Kingma

General setupGeneral setup

● Setup:

– x : observed variables

– z : unobserved/latent variables

– θ : model parameters

– pθ(x,z): joint PDF

● Factorized, differentiable● Factors can be anything, e.g. neural nets

● We want:

– Fast approximate posterior inference p(z|x)

– Learn the parameters θ (e.g. MAP estimate)

Example:

5D.P. Kingma

ExampleExample

6D.P. Kingma

LearningLearning

– Regular EM requires tractable pθ(z|x)

– Monte Carlo EM (MCEM) requires sampling from the posterior (slow....)

– Mean-field VB requires certain closed-form solutions to certain expectations of the joint PDF

7D.P. Kingma

Naive pure MAP optimization approachNaive pure MAP optimization approach

Overfits with high dimensionality of latent space

8D.P. Kingma

Novel approach:Stochastic Gradient VB

Novel approach:Stochastic Gradient VB

● Optimizes a lower bound of the marginal likelihood of the data● Scales to very large datasets● Scales to high-dimensional latent space● Simple● Fast!● Applies to almost any normalized model with continuous latent

variables

D.P. Kingma

The Variational BoundThe Variational Bound

● We introduce the variational approximation:

– Distribution can be almost anything (we use Gaussian)

– Will approximate the true (but intractable) posterior

Marginal likelihood can be written as:

This bound is exactly we want to optimize!(w.r.t. φ and θ)

D.P. Kingma

“Naive” Monte Carlo estimator of the bound“Naive” Monte Carlo estimator of the bound

Problem: not appropriate for differentiation w.r.t. φ!(Cannot differentiate through sampling process).

Recently proposed solutions (2013)

– Michael Jordan / David Blei (very high variance)

– Tim Salimans (2013): (only applies to Exponential Family q)

– Rajesh Ranganath et al,“Black Box Variational Inference”, arXiv 2014

– Andriy Mnih & Karol Gregor, “Neural Variational Inference and Learning”, arXiv 2014

11D.P. Kingma

Key “reparameterization trick”Key “reparameterization trick”

Example qφ(z) p(ε) g(φ, ε) Also...

Normal dist. z ~ N(μ,σ) ε ~ N(0,1) z = μ + σ * ε Location-scale familie: Laplace, Elliptical, Student’s t, Logistic, Uniform, Triangular, ...

Exponential z ~ exp(λ) ε ~ U(0,1) z = -log(1 – ε)/λ Invertible CDF: Cauchy, Logistic, Rayleigh, Pareto,Weibull, Reciprocal, Gompertz, Gumbel and Erlan, ...

Other z ~ logN(μ,σ) ε ~ N(0,1) z = exp(μ + σ * ε) Gamma, Dirichlet, Beta, Chi-Squared, and F distributions

Alternative way of sampling from qφ(z):

1. Choose some ε ~ p(ε) (independent of φ!)

2. Choose some z = g(φ, ε)

Such that z ~ qφ(z) (the correct distribution)

D.P. Kingma

SGVB estimatorSGVB estimator

Really simple and appropriate for differentiation w.r.t. φ and θ!

13D.P. Kingma

Basic SGVB Algorithm (L=1)Basic SGVB Algorithm (L=1)

repeat

until convergence

Torch7Theano

D.P. Kingma

“Auto-Encoding” VB:efficient on-line version of SGVB

“Auto-Encoding” VB:efficient on-line version of SGVB

● Special case of SGVB:

Large i.i.d. dataset (large N)=> Many variational parameters to learn

● Solution:

– Use conditional:● Neural network

– Doubly stochastic optimization

Avoid local parameters!

16D.P. Kingma

“Auto-encoding” Stochastic VB (L=1)“Auto-encoding” Stochastic VB (L=1)

repeat

until convergence

Scales to very large datasets!

17D.P. Kingma

Experiments with “variational auto-encoder”Experiments with “variational auto-encoder”

x p(x|z)

z = g(φ,ε,x)

x

h2

ε

h1

zε ~ p(ε)

Generative model p(x|z) (neural net)

Posterior approximation q(z|x) (neural net)

Objective:

(noisy) negative reconstruction error regularization term

18D.P. Kingma

ExperimentsExperiments

19D.P. Kingma

Results:Marginal likelihood lower bound

Results:Marginal likelihood lower bound

20D.P. Kingma

Results: Marginal log-likelihoodResults: Marginal log-likelihood

MCEM does not scale well to large datasets

21D.P. Kingma

Robustness to high-dimensional latent space

Robustness to high-dimensional latent space

22D.P. Kingma

23D.P. Kingma

learned 2D manifoldslearned 2D manifolds

24D.P. Kingma

learned 3D manifoldlearned 3D manifold

25D.P. Kingma

Samples from MNISTSamples from MNIST

26D.P. Kingma

Reparameterizations of latent variablesReparameterizations of latent variables

27D.P. Kingma

Reparameterization of continuous latent variables

Reparameterization of continuous latent variables

● Alternative parameterization of latent variables.

Choose some:

– ε ~ p(ε)

– z = g(φ, ε) (invertible)

– z | pa ~ p(z|pa)(the correct distribution)

● z's become determinstic given ε's

● ε's are a priori independent

Large difference in posterior dependencies, efficiency

Centered form Non-centered form(Neural net with injected noise)

D.P. Kingma

Experiment: MCMC sampling in DBNExperiment: MCMC sampling in DBN

Centered form Non-centered formSamples Autocorrelation Samples Autocorrelation

Fast mixingTerribly slow mixing

D.P. Kingma

“Efficient Gradient-Based Inference through Transformations between Bayes Nets and Neural Nets”

Diederik P Kingma, Max Welling

For more information and analysis see:For more information and analysis see:

30D.P. Kingma

ConclusionConclusion

● SGVB: efficient stochastic variational algorithm for inference and learning with continuous latent variables.

Theano and pure numpy implementations:

https://github.com/y0ast/Variational-Autoencoder.git

(includes scikit-learn wrappers)

Thanks!

31D.P. Kingma

AppendixThe regular SVB gradient estimator

AppendixThe regular SVB gradient estimator