D.P. Kingma
Stochastic Gradient VBand the Variational Auto-Encoder
Stochastic Gradient VBand the Variational Auto-Encoder
Durk KingmaPh.D. Candidate (2nd year) advised by Max Welling
Kingma, Diederik P., and Max Welling."Stochastic Gradient VB and the variational auto-encoder." (arXiv)
Quite similar:Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.“Stochastic back propagation and variational inference in deep latent gaussian models. ” (arXiv)
2D.P. Kingma
ContentsContents
● Stochastic Variational Inference and learning
– SGVB algorithm● Variational auto-encoder
– Experiments● Reparameterizations
– Effect on posterior correlations
4D.P. Kingma
General setupGeneral setup
● Setup:
– x : observed variables
– z : unobserved/latent variables
– θ : model parameters
– pθ(x,z): joint PDF
● Factorized, differentiable● Factors can be anything, e.g. neural nets
● We want:
– Fast approximate posterior inference p(z|x)
– Learn the parameters θ (e.g. MAP estimate)
Example:
5D.P. Kingma
ExampleExample
6D.P. Kingma
LearningLearning
– Regular EM requires tractable pθ(z|x)
– Monte Carlo EM (MCEM) requires sampling from the posterior (slow....)
– Mean-field VB requires certain closed-form solutions to certain expectations of the joint PDF
7D.P. Kingma
Naive pure MAP optimization approachNaive pure MAP optimization approach
Overfits with high dimensionality of latent space
8D.P. Kingma
Novel approach:Stochastic Gradient VB
Novel approach:Stochastic Gradient VB
● Optimizes a lower bound of the marginal likelihood of the data● Scales to very large datasets● Scales to high-dimensional latent space● Simple● Fast!● Applies to almost any normalized model with continuous latent
variables
D.P. Kingma
The Variational BoundThe Variational Bound
● We introduce the variational approximation:
– Distribution can be almost anything (we use Gaussian)
– Will approximate the true (but intractable) posterior
Marginal likelihood can be written as:
This bound is exactly we want to optimize!(w.r.t. φ and θ)
D.P. Kingma
“Naive” Monte Carlo estimator of the bound“Naive” Monte Carlo estimator of the bound
Problem: not appropriate for differentiation w.r.t. φ!(Cannot differentiate through sampling process).
Recently proposed solutions (2013)
– Michael Jordan / David Blei (very high variance)
– Tim Salimans (2013): (only applies to Exponential Family q)
– Rajesh Ranganath et al,“Black Box Variational Inference”, arXiv 2014
– Andriy Mnih & Karol Gregor, “Neural Variational Inference and Learning”, arXiv 2014
11D.P. Kingma
Key “reparameterization trick”Key “reparameterization trick”
Example qφ(z) p(ε) g(φ, ε) Also...
Normal dist. z ~ N(μ,σ) ε ~ N(0,1) z = μ + σ * ε Location-scale familie: Laplace, Elliptical, Student’s t, Logistic, Uniform, Triangular, ...
Exponential z ~ exp(λ) ε ~ U(0,1) z = -log(1 – ε)/λ Invertible CDF: Cauchy, Logistic, Rayleigh, Pareto,Weibull, Reciprocal, Gompertz, Gumbel and Erlan, ...
Other z ~ logN(μ,σ) ε ~ N(0,1) z = exp(μ + σ * ε) Gamma, Dirichlet, Beta, Chi-Squared, and F distributions
Alternative way of sampling from qφ(z):
1. Choose some ε ~ p(ε) (independent of φ!)
2. Choose some z = g(φ, ε)
Such that z ~ qφ(z) (the correct distribution)
D.P. Kingma
SGVB estimatorSGVB estimator
Really simple and appropriate for differentiation w.r.t. φ and θ!
13D.P. Kingma
Basic SGVB Algorithm (L=1)Basic SGVB Algorithm (L=1)
repeat
until convergence
Torch7Theano
D.P. Kingma
“Auto-Encoding” VB:efficient on-line version of SGVB
“Auto-Encoding” VB:efficient on-line version of SGVB
● Special case of SGVB:
Large i.i.d. dataset (large N)=> Many variational parameters to learn
● Solution:
– Use conditional:● Neural network
– Doubly stochastic optimization
Avoid local parameters!
16D.P. Kingma
“Auto-encoding” Stochastic VB (L=1)“Auto-encoding” Stochastic VB (L=1)
repeat
until convergence
Scales to very large datasets!
17D.P. Kingma
Experiments with “variational auto-encoder”Experiments with “variational auto-encoder”
x p(x|z)
z = g(φ,ε,x)
x
h2
ε
h1
zε ~ p(ε)
Generative model p(x|z) (neural net)
Posterior approximation q(z|x) (neural net)
Objective:
(noisy) negative reconstruction error regularization term
18D.P. Kingma
ExperimentsExperiments
19D.P. Kingma
Results:Marginal likelihood lower bound
Results:Marginal likelihood lower bound
20D.P. Kingma
Results: Marginal log-likelihoodResults: Marginal log-likelihood
MCEM does not scale well to large datasets
21D.P. Kingma
Robustness to high-dimensional latent space
Robustness to high-dimensional latent space
22D.P. Kingma
23D.P. Kingma
learned 2D manifoldslearned 2D manifolds
24D.P. Kingma
learned 3D manifoldlearned 3D manifold
25D.P. Kingma
Samples from MNISTSamples from MNIST
26D.P. Kingma
Reparameterizations of latent variablesReparameterizations of latent variables
27D.P. Kingma
Reparameterization of continuous latent variables
Reparameterization of continuous latent variables
● Alternative parameterization of latent variables.
Choose some:
– ε ~ p(ε)
– z = g(φ, ε) (invertible)
– z | pa ~ p(z|pa)(the correct distribution)
● z's become determinstic given ε's
● ε's are a priori independent
Large difference in posterior dependencies, efficiency
Centered form Non-centered form(Neural net with injected noise)
D.P. Kingma
Experiment: MCMC sampling in DBNExperiment: MCMC sampling in DBN
Centered form Non-centered formSamples Autocorrelation Samples Autocorrelation
Fast mixingTerribly slow mixing
D.P. Kingma
“Efficient Gradient-Based Inference through Transformations between Bayes Nets and Neural Nets”
Diederik P Kingma, Max Welling
For more information and analysis see:For more information and analysis see:
30D.P. Kingma
ConclusionConclusion
● SGVB: efficient stochastic variational algorithm for inference and learning with continuous latent variables.
Theano and pure numpy implementations:
https://github.com/y0ast/Variational-Autoencoder.git
(includes scikit-learn wrappers)
Thanks!
31D.P. Kingma
AppendixThe regular SVB gradient estimator
AppendixThe regular SVB gradient estimator
Top Related