Stochastic Gradient VB and the Variational Auto ... "Stochastic Gradient VB and the variational...

Click here to load reader

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Stochastic Gradient VB and the Variational Auto ... "Stochastic Gradient VB and the variational...

  • D.P. Kingma

    Stochastic Gradient VB and the Variational Auto-Encoder

    Stochastic Gradient VB and the Variational Auto-Encoder

    Durk Kingma Ph.D. Candidate (2nd year) advised by Max Welling

    Kingma, Diederik P., and Max Welling. "Stochastic Gradient VB and the variational auto-encoder." (arXiv)

    Quite similar: Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. “Stochastic back propagation and variational inference in deep latent gaussian models. ” (arXiv)

  • 2D.P. Kingma


    ● Stochastic Variational Inference and learning – SGVB algorithm

    ● Variational auto-encoder – Experiments

    ● Reparameterizations – Effect on posterior correlations

  • 4D.P. Kingma

    General setupGeneral setup ● Setup:

    – x : observed variables – z : unobserved/latent variables – θ : model parameters – pθ(x,z): joint PDF

    ● Factorized, differentiable ● Factors can be anything, e.g. neural nets

    ● We want: – Fast approximate posterior inference p(z|x) – Learn the parameters θ (e.g. MAP estimate)


  • 5D.P. Kingma


  • 6D.P. Kingma


    – Regular EM requires tractable pθ(z|x) – Monte Carlo EM (MCEM) requires sampling from the

    posterior (slow....) – Mean-field VB requires certain closed-form solutions to

    certain expectations of the joint PDF

  • 7D.P. Kingma

    Naive pure MAP optimization approachNaive pure MAP optimization approach

    Overfits with high dimensionality of latent space

  • 8D.P. Kingma

    Novel approach: Stochastic Gradient VB

    Novel approach: Stochastic Gradient VB

    ● Optimizes a lower bound of the marginal likelihood of the data ● Scales to very large datasets ● Scales to high-dimensional latent space ● Simple ● Fast! ● Applies to almost any normalized model with continuous latent


  • D.P. Kingma

    The Variational BoundThe Variational Bound

    ● We introduce the variational approximation: – Distribution can be almost anything (we use Gaussian) – Will approximate the true (but intractable) posterior

    Marginal likelihood can be written as:

    This bound is exactly we want to optimize! (w.r.t. φ and θ)

  • D.P. Kingma

    “Naive” Monte Carlo estimator of the bound“Naive” Monte Carlo estimator of the bound

    Problem: not appropriate for differentiation w.r.t. φ! (Cannot differentiate through sampling process).

    Recently proposed solutions (2013) – Michael Jordan / David Blei (very high variance) – Tim Salimans (2013): (only applies to Exponential Family q) – Rajesh Ranganath et al,“Black Box Variational Inference”, arXiv 2014 – Andriy Mnih & Karol Gregor, “Neural Variational Inference and Learning”, arXiv 2014

  • 11D.P. Kingma

    Key “reparameterization trick”Key “reparameterization trick”

    Example q φ (z) p(ε) g(φ, ε) Also...

    Normal dist. z ~ N(μ,σ) ε ~ N(0,1) z = μ + σ * ε Location-scale familie: Laplace, Elliptical, Student’s t, Logistic, Uniform, Triangular, ...

    Exponential z ~ exp(λ) ε ~ U(0,1) z = -log(1 – ε)/λ Invertible CDF: Cauchy, Logistic, Rayleigh, Pareto,Weibull, Reciprocal, Gompertz, Gumbel and Erlan, ...

    Other z ~ logN(μ,σ) ε ~ N(0,1) z = exp(μ + σ * ε) Gamma, Dirichlet, Beta, Chi-Squared, and F distributions

    Alternative way of sampling from qφ(z):

    1. Choose some ε ~ p(ε) (independent of φ!) 2. Choose some z = g(φ, ε) Such that z ~ qφ(z) (the correct distribution)

  • D.P. Kingma

    SGVB estimatorSGVB estimator

    Really simple and appropriate for differentiation w.r.t. φ and θ!

  • 13D.P. Kingma

    Basic SGVB Algorithm (L=1)Basic SGVB Algorithm (L=1)


    until convergence

    Torch7 Theano

  • D.P. Kingma

    “Auto-Encoding” VB: efficient on-line version of SGVB

    “Auto-Encoding” VB: efficient on-line version of SGVB

    ● Special case of SGVB:

    Large i.i.d. dataset (large N) => Many variational parameters to learn

    ● Solution: – Use conditional:

    ● Neural network – Doubly stochastic optimization

    Avoid local parameters!

  • 16D.P. Kingma

    “Auto-encoding” Stochastic VB (L=1)“Auto-encoding” Stochastic VB (L=1)


    until convergence

    Scales to very large datasets!

  • 17D.P. Kingma

    Experiments with “variational auto-encoder”Experiments with “variational auto-encoder”

    x p(x|z)

    z = g(φ,ε,x)





    zε ~ p(ε)

    Generative model p(x|z) (neural net)

    Posterior approximation q(z|x) (neural net)


    (noisy) negative reconstruction error regularization term

  • 18D.P. Kingma


  • 19D.P. Kingma

    Results: Marginal likelihood lower bound

    Results: Marginal likelihood lower bound

  • 20D.P. Kingma

    Results: Marginal log-likelihoodResults: Marginal log-likelihood

    MCEM does not scale well to large datasets

  • 21D.P. Kingma

    Robustness to high-dimensional latent space

    Robustness to high-dimensional latent space

  • 22D.P. Kingma

  • 23D.P. Kingma

    learned 2D manifoldslearned 2D manifolds

  • 24D.P. Kingma

    learned 3D manifoldlearned 3D manifold

  • 25D.P. Kingma

    Samples from MNISTSamples from MNIST

  • 26D.P. Kingma

    Reparameterizations of latent variablesReparameterizations of latent variables

  • 27D.P. Kingma

    Reparameterization of continuous latent variables

    Reparameterization of continuous latent variables

    ● Alternative parameterization of latent variables.

    Choose some: – ε ~ p(ε) – z = g(φ, ε) (invertible) – z | pa ~ p(z|pa)

    (the correct distribution)

    ● z's become determinstic given ε's

    ● ε's are a priori independent

    Large difference in posterior dependencies, efficiency

    Centered form Non-centered form (Neural net with injected noise)

  • D.P. Kingma

    Experiment: MCMC sampling in DBNExperiment: MCMC sampling in DBN

    Centered form Non-centered form Samples Autocorrelation Samples Autocorrelation

    Fast mixingTerribly slow mixing

  • D.P. Kingma

    “Efficient Gradient-Based Inference through Transformations between Bayes Nets and Neural Nets”

    Diederik P Kingma, Max Welling

    For more information and analysis see:For more information and analysis see:

  • 30D.P. Kingma


    ● SGVB: efficient stochastic variational algorithm for inference and learning with continuous latent variables. Theano and pure numpy implementations:

    (includes scikit-learn wrappers)


  • 31D.P. Kingma

    Appendix The regular SVB gradient estimator

    Appendix The regular SVB gradient estimator

    Slide 1 Slide 2 Slide 4 Slide 5 Slide 6 Slide 7 Slide 8 Slide 9 Slide 10 Slide 11 Slide 12 Slide 13 Slide 15 Slide 16 Slide 17 Slide 18 Slide 19 Slide 20 Slide 21 Slide 22 Slide 23 Slide 24 Slide 25 Slide 26 Slide 27 Slide 28 Slide 29 Slide 30 Slide 31