A Unifying Review of Efficient Variational Inference and ... · Auto-Encoding Variational Bayes...

A Unifying Review of Efficient Variational Inference and Learning inDeep Directed Latent Variable Models

Riashat Islam [email protected]

University of Cambridge

Jiameng Gao [email protected]


Vera Johne [email protected]


Abstract

Deep generative latent variable models haverecently gained significant interest due to de-velopment of efficient and scalable variationalinference methods. Variational methods inv-ole maximization of the lower bound on thelog-likelihood. However, until recently, di-rected latent variable model were difficult totrain on large datasets. In this work, weprovide an overview of several recent meth-ods that have been developed for perform-ing stochastic variational inference on largedatasets. We provide an overview of theo-retical and experimental results for provid-ing a benchmark comparison of the varia-tional inference methods based on feedfor-ward neural networks. All of the approachesin comparison considers maximizing the vari-ational lower bound by jointly training themodel and the inference network. The meth-ods in comparison are all applied on theMNIST dataset as a benchmark comparison.We implemented our own approach to theAuto-Encoding Variational Bayes algorithm(Kingma & Welling, 2013), and compared itwith other approaches. Our experimental re-sults show the significance of the differentvariance reduction techniques for the gradi-ent estimator of the lower bound of the loglikelihood.

Submitted as part of the Advanced Machine Learningcoursework 2015/16, MPhil Machine Learning, Speech andLanguage Technology

1. Introduction

In our work we provide a unifying review of efficientinference and learning algorithms in directed gener-ative models with many layers of hidden variables.It is known that directed latent variable modelsare difficult to train on large datasets since exactinference in such models is intractable. In our work,we compare the different approaches performinginference in deep directed graphical models.

Although directed graphical models are better ableto generate observations directly, but there exists alack of efficient learning algorithms for directed latentvariable models. Recent work proposed approximateinference methods based on feedforward neural net-works to maximize the variational lower bound onlog-likelihood (Mnih & Gregor, 2014). We aim toprovide a unifying review of such methods based onfeedforward networks, and compare the efficiencyof the different learning algorithms. In particular,we focus on the Auto-Encoding Variational Bayes(AEVB) algorithm (Kingma & Welling, 2013)

Recent efforts in machine learning research focused ondeveloping scalable probabilistic models, where usingdirected graphical models we want to develop gener-ative models that can scale to large datasets. Deepgenerative models are known to be able to generalizebetter to unknown data since the directed counter-parts can better capture high level abstractions in thedataset. However, efficient inference algorithms for di-rected generative models has been a major problem.

Variance Reduction Techniques for Variational Inference with Recognition Models

2. Background

2.1. Variational Inference

The task of probabilistic inference in graphical modelsis to compute conditional probability distributionsover hidden variables. Latent variable models providea powerful approach to probabilistic modelling. Aprobabilistic model considers the joint distributionp(x,h) where x are the visible variables, and h arethe hidden variables. The goal is to train a latentvariable model Pθ(x, h) parameterized by θ. Since theposterior distribution p(h|x) is complicated to workwith, exact inference in such models is intractable,and hence maximum likelihood learning cannot beperformed in such models.

Variational methods provide an approach to approx-imate inference. Since the exact posterior Pθ(h|x)is intractable, the key idea is to approximate thisintractable distribution with a simpler tractabledistribution Qφ(h|x) with parameters φ. The distri-bution Qφ(h|x) serves as an approximation to theexact posterior Pθ(h|x).

We can lower bound the marginal likelihood usingJensen’s inequality:

logPθ(x) = log∑h

Pθ(x, h)

=≥∑h

Qφ(h|x) logPθ(x, h)

Qφ(h|x)= EQ[logPθ(x, h)− logQφ(h|x)]

= L(x, θ, φ)

(1)

The lower bound can therefore be written as:

L(x, θ, φ) = logPθ(x)−KL(Qφ(h|x), Pθ(h|x)) (2)

where Kullback-Leibler(KL) divergence is a non-symmetric measure of the differnece between the twoprobability distributions. The goal of variational infer-nece is to maximize the variational lower bound w.r.tthe approximate distribution Qφ(h|x) or equivalentlyminimize the KL divergence. In other words, the KLdivergence is zero when Qφ(h|x) = Pθ(h|x)

2.2. Recognition Model

A recognition model or inference network can learn aninverse mapping from the observations x to the hidden

variables h. Recognition models can allow for fasterconvergence during training and test time. Previouslyin most approaches, the variational posterior Qφ(h|x)for each observation was defined using its own set ofvariational parameters φ. However, recent methodsare based on defining a recognition model or infer-ence network. This means, a feedforward neural net-work will be used to compute the variational distri-bution from the observation. The inference networkcan perform the mapping from x to Qφ(h|x) with theconstraint that it is easy to sample from the infer-ence network. Both the parameters θ of the genera-tive model pθ(x, h) and the recognition model Qφ(h|x)comes from the neural network.

3. Approach

Having obtained a variational lower bound as shownin equation 2, the task is to optimize the lower bound;ie, we want to train the model by locally maximizingL(x, θ, φ) w.r.t the model and inference parameters θand φ. We want to optimize the recognition model inadditon to learning the model parameters to performefficient approximate posterior inference.

In order to optimize the lower bound, we want to dif-ferentiate the lower bound L with respect to inferencenetwork parameters φ and generative model parame-ters θ. The gradients of the variational bounds aregiven as:

∇θL(x) = EQ[∇θ logPθ(x, h)] (3)

for the model parameters θ, and for the inference net-work parameters is given as:

∇φL(x) = EQ[(logPtheta(x, h)−logQφ(h|x))×∇φ logQφ(h|x)](4)

However, since both the gradients in equation 3 and4 involves expectations, they are intractable. Eventhough Monte Carlo approximations to the gradientscan be used, the variance of such gradient estima-tors are usually very high. In general, optimisationbased approaches are better suited compared to sam-pling based Monte Carlo methods, and consideringthat variational methods are usually more efficient. Insection 4, we will therefore consider the different ap-proaches that can be taken for maximizing this lowerbound on the log likelihood with respect to the pa-rameters, such that a better approximation to the in-tractable posterior distribution can be obtained, for


both discrete and continuous variables for directedgenerative models.

4. Methods

4.1. Auto-Encoding Variational Bayes

A variational autoencoder for approximate posteriorinference was recently proposed in (Kingma & Welling,2013) involving a generative network and a recogni-tion network. In this work, both the networks needto trained jointly such as to maximize the variationallower bound on the log likelihood. Since optimizationinvolves finding gradients as shown in equation 3 and4, in AEVB, it was shown that better optimization canbe achieved by considering a reparameterization of thevariational lower bound to obtain a reparameterizedlower bound estimator. The random hidden variableh can be reparameterized h ∼ gφ(ε, x) using a trans-formation that is differentiable gφ(ε, x). This trick istherefore applied to the lower bound L(θ, φ;x) obtainthe Stochastic Gradient Variational Bayes (SGVB)estimator LA(θ, φ;x), such that the reparameterizedlower bound is now given as:

LA(θ, φ;x) =1

L

L∑l=1

logθ(xi, hi,l)− log qφ(hi,l|xi) (5)

Considering minibatches, the following estimator ofmarginal likelihood lower bound is therefore given by:

L(θ, φ;X) ≈ LM (θ, φ;XM ) =N

M

M∑i=1

L(θ, φ;xi) (6)

We can therefore obtain gradient estimates∇θ,φL(θ, φ;xi) and perform stochastic optimiza-tion based using gradient descent. Obtaining betterestimates of the gradient therefore means that theapproximate posterior distribution Qφ(h|x) is a betterapproximation to the true posterior as measured bythe KL divergence KL(Qφ(h|x), Pθ(h|x)).

In the variational autoencoder (AEVB), the trainingprocedure therefore considered a tradeoff between thedata log likelihood log p(x) and the KL divergencefrom the true posterior. By doing this, the modellearns a representation where it is easy to approximatethe posterior inference. Additionally, in (Kingma &Welling, 2013), instead of directly computing the gra-dient of the log likelihood with respect to recognition

Figure 1. Explaining the Reparameterization Trick in theAuto-Encoding Variational Bayes Method

network parameters, (Kingma & Welling, 2013) con-sidered a reparameterization of the recognition distri-bution in terms of auxiliary variables ε, such that thesamples from the recognition model Qφ(h|x) are a de-terministic function of the inputs and auxiliary vari-ables. This approach is valid for a variety of distri-butions, although only the Gaussian distribution wasconsidered for experimental purposes. The AEVB al-gorithm based on stochastic gradients for variationalBayes (SGBV) has been successfully applied in learn-ing to draw images in a realistic manner (Gregor et al.,2015). Figure 1 further shows the significance of thereparameterization trick and the significance of thebackpropagation algorithm in the neural net param-eterized by φ to find the approximate gradient estima-tor ∇φL.

4.1.1. Local Reparameterization Trick

A local reparameterization trick for reducing the vari-ance of the gradient estimators was further proposedin (Kingma et al., 2015). In practice, the performanceof stochastic gradient ascent largely depends on thevariance of the gradient estimates. Considering theparameters θ and φ, the optimization of these param-eters for maximizing the lower bound involves uncer-tainty estimates in the paramters. By considering thelocal reparameterization trick, the global parameteruncertainty can be translated to local uncertainty perdatapoint. Uncertainty in the global model and in-ference network parameters can be considerd as locannoise which can further be considerd as independentacross datapoints in the mini batch sizes. Such param-eterizations ensures that the variance in the gradientestimates is inversely proporitonal to the size of minibatches. Therefore, a better gradient estimator of thelower bound, with reduced variance and low computa-


tional complexity can be achieved by using the localreparameterization trick considerd in (Kingma et al.,2015). Such an approach to deep generative mod-els and scalable variational inference was further usedto develop a probabilistic model for semi-supervisedlearning (Kingma et al., 2014).

4.1.2. AEVB: Experimental Results andDiscussion

Using the MNIST dataset, a generative model of im-ages was trained and we obtained results of the varia-tional lower bound. In our algorithm, the same neuralnetwork architecture was used for both the generativemodel and the recognition model, using 400 hiddenunits at each layer. Considering the stochastic varia-tional Bayes approach, the parameters θ and φ wereobtained by maximizing the gradient ∇θ,φL(θ, φ,X)of the lower bound estimator. For all the experiments,we considered mini batch sizes of 100, and a learningrate of 0.01.

In the variational autoencoder, the number of hiddenunits refers to the number of hidden layers of the neu-ral network that is used in the encoder and decoder.We then analysed how the generalization performancedepends on the dimensionality of the latent space Nz.Figure 3 shows that generalisation on MNIST indeedimproves with more latent variables, and does not leadto overfitting as the generative model contains highernumber of latent variables that needs to be inferred.

Initially, we considered using one sample (L = 1) perdatapoint. However, for comparison with other meth-ods described below, we further considered using morethan one samples per datapoint. However, the AEVBapproach does not consider any importance samplingor weighting based approach. The results below showshow the maximization of the lower bound is depen-dent on the number of samples per datapoint. Inter-

estingly, the test results for different samples are thesame, showing that the generalisation performance forAEVB is independent of the number of samples takenfrom the recognition model. This is further justifiedsince AEVB does not consider any weighted samplingapproach, and hence taking L > 1 does not affect testset performance.

Figure 4. Training lower bound results for different numberof samples L per datapoint

4.2. Neural Variational Inference (NVIL)

Another approach to reducing the variance of thegradient estimates for maximizing lower bound usinggradient-based optimisation was considerd in NeuralVariational Inference (NVIL) (Mnih & Gregor, 2014).Neural Variational Inference and learning (Mnih &Gregor, 2014) also considers training a recognitionnetwork parameterized by a neural network to ap-proximate the posterior distribution. However, inaddtiion to the model and inference network, NVILfurther considers a third network to predict rewardbaselines in the context of the REINFORCE algorithm(Williams, 1992). It also uses the same variational


objective as the variational autoencoder (AEVB),except that a baseline function is used (inspired fromreinforcement learning literature) to reduce varianceinstead of considering a reparameterization trick.However, one drawback of the NVIL approach is thatthe estimator requires learning additional parametersfrom the baseline function for reducing variance ofgradient estimates.

Similar to the variational autoencoder (AEVB), NVILalso considers using a feedfoward network for exactsampling from the variational posterior distribution.However, one major difference is that NVIL considersboth sampling based and variational methods fortraining the directed graphical model. Samples fromthe inference or recognition network are used forobtaining the Monte-Carlo estimates of the gradients,where the inference network is trained jointly withthe model by maximizing the variational lower bound.

Considering the gradient w.r.t the inference networkparameters, φ shown in equation 4, the difference be-tween the two distributions logPθ(x, h)− logQφ(h|x)can be considered as a learning signal for the inferencenetwork parameters:

lφ(x, h) = logPθ(x, h)− logQφ(h|x) (7)

Reducing the baseline c from the learning signallφ(x, h), the gradient from equation 4 can thereforebe written as:

∇φL(x) = EQ[(lφ(x, h)− c)∇φ logQφ(h|x)] (8)

The gradient variance can be further reduced by mak-ing the baseline a function of the observations Cφ(x),which is further implemened using a neural networkand trained to minimise the expected squared error ofthe learning signal EQ[(lφ(x, h) − Cφ(x) − c)2]. Thisapproach incorporating baselines to variance reductionis therefore similar to using control variates.

4.3. NVIL: Experimental Results andDiscussion

In this section, we first present the results of the ef-fectiveness of NVIL algorithm on the MNIST dataset.For the initial set of experiments, we considerd using200 latent variables and used 1500 epochs with abatch size of 100 on the MNIST dataset. The inputdependent baseline functions for reducing variancewas implemented using a neural network with single

hidden layer of 400 tanh units.

We first considered training the generative modelalong with the inference network using different op-timisation approaches. We compared the effectivenessof the stochastic gradient ascent compared to usingadaptive learning rate based RMSProp and AdaGradapproaches. Since tuning the learning rate is an ex-pensive process in training large generative models,and hence Adagrad and RMSProp can be effective inautomatically adjusting the learning rate parameter.For the stochastic gradient ascent, the learning ratewas chosen to be 1e−3, as given in the original work.Due to time constraints, we did not consider a vali-dation set for fine tuning and selecting the learningrate parameter. Figure 5 shows the dependence inconvergence in maximizing the lower bound based ondifferent optimisation techniques. We used adaptivelearning rates since fine-tuning the learning rate is adifficult task requiring separate validation set.

Figure 5. Fixed and Adaptive Learning rate based optimi-sation of the lower bound estimator using the Neural Vari-ational Inference Algorithm

Based on results in figure 5, it is undestood thatwhile both adaptive learning rates converge signifi-cantly more quickly than SGD, in this applicationRMSProp is able to quickly settle into a significantlyhigher log-likelihood than AdaGrad, the convergencerate of which quickly slows down. This may be be-cause AdaGrad quickly converges for sparse parame-ters, even though in a neural network setting many ofthe parameters would be dense.

We further show the estimates of the lower bounddepending on the number of latent variables usedin the directed generative model. We considered asingle layer with varying number of latent variables.Figure 6 further shows that the optimisation of thelower bound is almost independent of the number of


Figure 6. Dependence of NVIL algorithm on the numberof latent variables used in the model

Figure 7. NVIL Optimisation of Lower Bound Estimatordepending on the number of hidden units

latent variables used in the model. The generalisationperformance (not shown here) also shows no signs ofoverfitting for higher number of latent variables. Thisis similar to the results previously observed in figure 3for the variational auto-encoder. Figure 7 also showsthat the performance of the NVIL method is alsoindependent of the number of hidden units used ineach layer of the inference network.

The significance of the variance reduction techniqueusing a baseline network can be notably observed whencomparing with other variance reduction techniques asdiscussed later in section 5

4.4. Importance Weighted Auto-Encoders(IWAE)

Perhaps the most closely related to the variationalautoencoder approach is the Importance WeightedAutoencoder (IWAE) (Burda et al., 2015). The IWAE

uses the same architecutre as VAE, except that thelower bound is derived from importance weighting.However, one difference is that instead of using asingle sample from the inference network of theVAE architecure, IWAE uses multiple samples suchthat the recognition network now produces multipleapproximate samples and their weights are thenaveaged. By doing so, IWAE can achieve a betterflexible approximate posterior distribution to modelthe true posterior distribution of the generative model.

The IWAE considers multiple stochastic hidden layersfor the neural networks, compard to VAE whichconsiderd only one layer of the network. Similarto VAE, the conditional distributions here are alsoconsidered as Gaussians. In VAE the means andvariances would be computed by a deterministicfeedforward neural network; however, in IWAE, theGaussian recognition distribution q(hl|hl−1, θ) whosemeans and covariances are now computed from thestates of the hidden units at the previous layer.

The importance weighted autoencoder also considersa generative network and a recognition network. How-ever, the training procedure for IWAE is based on adifferent lower bound of log p(x). Since it considersmultiple samples from the recognition network, thelower bound is now based on the k-sample importanceweighting of the log likelihood given by:

Lk(x) = Eh1,h2,..hk∼q(h|x)[log1

k

k∑i=1

p(x, hi)

q(hi|x)] (9)

where h1, h2, ...hk are now independent samples fromthe inference network q(h|x). The term inside thesum can be written as wi since it is the unnormal-ized importance weights for the joint distribution. Thelower bound on the marginal log likelihood, consider-ing Jensen’s inequality, can therefore be written as:

Lk = EQ[log1

k

k∑i=1

wi] ≤ logEQ[1

k

k∑i=1

wi] (10)

However, since the generative model is parameterized

by θ, we write the weights w as w(x, h, θ) = p(x,h|θ)q(h|x,θ) .

Again, to optimize the lower bound, we need to findthe estimate of ∇θL(x). The gradient estimate, con-sidering importance weighting for optimizing the lower


bound is therefore given as:

∇θL(x) =1

k

k∑i=1

∇θ logw(x, h(εi;x, θ), θ) (11)

where the mapping h is represented as a neural net-work, and equation 11 is a Monte-Carlo estimator formaximizing the lower bound based on the importanceweighted autoencoder algorithm. The IWAE algo-rithm, being a variant of VAW, has been shown toachieve better generalisation performance compared tothe variational autoencoder method due to its abilityto learn richer latent feature representations.

4.4.1. IWAE: Experimental Results andDiscussion

The importance weighted autoencoder provides agood benchmark for comparing the generative per-formance between VAE and IWAE using the MNISTdataset. For experimental purposes, all the stochastichidden layers used Gaussian distributions with adiagonal covariance, and a tanh non-linear activationfunction was used for each of the hidden units. Similarto the original work as in (Burda et al., 2015), weused Adam for optimisation using minibatches of size20 and ε = 10−4. For our results, we also consideredthe training process to proceed with 3i passes overthe data with a learning rate of 0.001.

For IWAE, we ran our models with 50 units in the hid-den layer, considering 1 or 2 hidden layers in both thegenerative and recognition model. As done in originalwork, we also used minibatches of size 20 using Adapoptimisation. For the importance sampling, we con-sidered comparing how the number of samples takenfrom the inference network affects the performance ofIWAE using the gradient estimation considered above.As done originally, we compared taking samples of1, 5or50 for both L = 1 and L = 2 stochastic hiddenlayers.

Figure 8 shows that taking more than 1 samples k > 1considerably improved the performance of IWAE onthe MNIST dataset. Previously, for AEVB, we showedthat generalisation performance does not improve fordifferent number of samples. Using the code availablefor IWAE, that includes comparison with AEVB, fig-ure 8 further shows that with k = 1, both the methodsachieve same performance. Indeed, IWAE can improvegeneralisation performance by taking higher number ofsamples, while following similar encoder decoder archi-tecture as the AEVB.

Figure 8. Significance of Number of Samples from Infer-ence Network - Comparing AEVB and IWAE for 1 and 2stochastic hidden layers

4.5. Reweighted Wake-Sleep

The reweighted wake sleep algorithm (Bornschein& Bengio, 2014) is extended from the wake-sleepalgorithm (Hinton et al., 1995), and similar to theimportance weighted auto-encoder, it considers ob-taining good estimates of the gradient of the lowerbound by sampling the latent variables multiple timesfrom the recognition model. Therefore, the updatesto the generative network are similar to the gradientestimates based on the lower bound considered inIWAE in equation 11. Therefore, RWS is similarto the importance weighted autoencoder since theestimator of the log likelihood is based on importancesampling.

The wake-sleep algorithm was initially proposed by(Hinton et al., 1995) which was thought of as opti-mising a biased estimator of the gradient. The wake-sleep algorithm was initially proposed for training gen-erative models like the Helmholtz machines and deepbelief network. The original wake-sleep algorithm con-sidered the following variational bound given by:

log p(x) ≥∑h

q(h|x) logp(x, h)

q(h|x)(12)

The wake-sleep algorithm considers maximizing thisvariational bound, where the wake phase correspondsto maximizing w.r.t p and in the sleep phase, theupdate w.r.t q minimises the reversed KL divergenceKL(p(h|x)||q(h|x)).

The reweighted wake-sleep algorithm instead considersformulating the likelihood as an importance weighted


average, such that the unbiased estimator of themarginal likelihood is now given as:

p(x) =∑h

q(h|x)p(x, h)

q(h|x)≈ 1

K

K∑k=1

p(x, hk)

q(hk|x)(13)

Training of the reweighted wake sleep algorithm con-sists of two phases. Since both the generative modelp and recognition model q are parameterized by θ andφ, at first the model pθ is updated for a given qφ.The gradient ∇θLp(θ, x) of the marginal likelihoodLp(θ, x) = log pθ(x) is given by:

∇θLp(θ, x) =1

p(x)Eh∼q(h|x)[

p(x, h)

q(h|x)∇θ log pθ(x, h)]

∇θLp(θ, x) =

K∑k=1

wk∇θ log p(x, hk)

(14)

Once the model pθ is updated, the variance of the esti-mator can then be reduced by considering updates ofqφ for a given pθ. The recognition network qφ can thenbe trained by using maximum likelihood learning withthe loss Lq(φ, x, h) = log qφ(x|h). Similar to above,gradient w.r.t φ can again be obtained considering im-portance sampling given by:

∇φLq(φ, x) =

K∑k=1

wk∇φ log qφ(hk|x) (15)

Additionally, in RWS we consider drawing K samplesfrom the inference network while NVIL consid-ered drawing a single sample. One advance of thereweighted wake sleep algorithm compared to NVILis that it does not require maintaining any baselinenetwork with additional parameters to reduce varianceof gradient estimates.

The reweighted wake sleep algorithm therefore pro-vides a better training procedure for deep generativemodels and obtains a lower bias lower variance estima-tor of the log-likelihood gradient at the experience ofhigher number of samples from the inference network.

4.5.1. RWS: Experimental Results andDiscussion

Using code available online, we re-ran the experi-ments for the reweighted wake-sleep algorithm. Simi-lar to original work, we also used the binarized MNIST

Figure 9. Lower Bound estimator on the MNIST using theReweighted Wake-Sleep Algorithm

dataset and stochastic gradient descent was used withminibatch sizes of 25. The RWS algorithm used K = 5samples during training. The p and q networks in RWSconsisted of three hidden layers. In the original work,the significance of the number of samples used dur-ing training were analysed. Due to time constraints,in this work, we only analysed how the lower boundestimator on the test set varied with the number ofepochs having trained the model with the RWS algo-rithm. Figure 9 below shows the test set lower boundmaximization with the number of iterations using thewake-sleep algorithm. This will later be used in section5 for comparison with other methods.

Figure 10. Significance of the number of samples from in-ference network on final negative log-likelihood

Figure 10 further shows the impact on the final neg-ative log-likelihood as the number of samples are in-creased from the inference network. Figure 10 suggeststhat the training of the model, including the optimiza-tion of the θ and φ parameters are highly dependenton the number of samples taken.


4.6. Related Work

Similar to the algorithms considered above, (Rezendeet al., 2014) also considers scalable inference andlearning in directed generative models. (Rezendeet al., 2014) further developed stochastic backprop-agation, for backpropagating the gradients for jointparameterisation of parameters of generative andrecognition models. Stochastic Backpropagation(Rezende et al., 2014) can be used for efficient infer-ence as it considers computing gradients involvingexpectations through random variables. It thereforeconsiders the similar task of inference methods withcontinuous latent variables by introducing a recogni-tion model and deriving a lower bound estimator ofthe marginal likelihood, with the only difference beingthe use of a modified approach to backpropagationalgorithm for the inference network.

Another related work is to consider the multiple sam-ple approach of IWAE for generative models withdiscrete latent variables. Maximization of the lowerbound of the intractable marginal log likelihood is of-ten done by estimating gradients using samples fromthe infernce network or variational posterior distribu-tion. However, the variational posterior is often notflexible. If the bound is based on single sample es-timates, then the samples that explains the observa-tions poorly are often heavily penalised. Therefore,the variational posterior distribution covers only thehigh probability areas of the true posterior distribu-tion. Using VIMCO discussed in (Mnih & Rezende,2016), multiple samples are considered. This is relatedto the approach of IWAE. The above effect is min-imised by averaging over multiple samples to computethe marginal likelihood estiamtes such that the lowerbound is tighter as the number of samples increases.This approach based on averaging over independentsamples is called Monte-Carlo objectives as discussedin (Mnih & Rezende, 2016). It introduces an unbiasedgradient estimator for multiple-sample objective func-tions and therefore can reduce variance of the gradientestimator, instead of having to introduce a baselinenetwork such as in NVIL that introduces additionalparameters.

5. Comparison of Experimental Results

Figure 11 shows the generalisation performance of thedifferent scalable variational infernece methods on theMNIST dataset. Figure 11 shows the different con-vergence guarantees of the maximization of the lowerbound estimator by jointly optimizing for the param-eters θ and φ in generative and recognition model.

Figure 11. Comparison of Generalisation performance ofAEVB, NVIL and RWS

Comparing for the methods above, our results showthat the reweighted wake-sleep algorithm, consider-ing optimisation using multiple-samples can outweighthe other methods based on the reparameterisationtrick and the baseline network. Since IWAE is asimilar multiple-sampling based approach, and due totime constraints, we only consider comparison betweenAEVB, NVIL and RWS here. The different algo-rithms in figure 11 have different parameter configu-rations. In figure 11, AEVB uses a latent space di-mension of 100, while NVIL uses 500 latent variables,with a single layer consisting of 400 hidden units, andoptimisation performed by stochastic gradient ascent.For the reweighted wake-sleep algorithm, we trainedthe model with K = 100 training samples, we usedK = 5000 samples with a minibatch of size 25. Pre-viously, for the experimental results comparing IWAEand AEVB as shown in figure 8 previously, we alsoshowed that the importance sampling based approachcan obtain better gradient estimates for maximizingthe lower bound compared the AEVB method.

6. Discussion and Future Work

In our work, we provided a unifying review of some ofthe recent approaches for estimating the variationallower bound for efficient approximate inference in deepdirected generative models. In AEVB, we demon-strated how it uses a gradient estimator (SGVB)and a reparameterization trick such that the gradientcan be found for standard stochastic gradient ascentmethods. While AEVB can be used in continuouslatent variable models, NVIL, on the other hand, isonly applicable to discrete binary latent variables.However, instead of introducing a reparameterizationtrick, NVIL simply considered including a baselinenetwork in the gradient estimator to reduce variance


and jointly optimize both the inference and generativenetwork. Once considerable drawback, however,is the need to also estimate the parameters of thefeedforward baseline network. The NVIL howevercan approximate the true posterior with more flexibledistributions, since it combines both sampling andvariational methods.

The importance weighted auto-encoder is an elegantvariant of the VAE, as it uses the same architecture,but instead considers multiple samples to maximize atighter log likelihood lower bound that can be derivedusing an importance sampling approach. ComparingAEVB with IWAE, it was also shown that for the samenumber of stochastic hidden layers, IWAE can learnricher latent representations and acheive a better testperformance compared to AEVB. The use of multiplesamples to approximate the posterior in IWAE givesit better flexibility to model complex posterior distri-butions.

The comparison of the different methods shown in fig-ure 11 shows that the multiple sampling based ap-proach, considering importance sampling, can gen-eralise better compared to ther other methods thatuses a baseline network and the reparameterisationtrick. Figure 11 further suggests that the gradient es-timates of the lower bound are better estimated con-sidering importance sampling, and even though NVILand AEVB considered variance reduction techniques,they cannot outweigh benefits of optimisation based onmultiple samples. Figures 11 and 8 shown previouslysuggests that approaches based on multiple-samplescan achieve gradient estimates with a lower variance,compared to the other variance reduction techniques.

6.1. Extensions

Variational Inference for Monte-Carlo Objec-tives (Mnih & Rezende, 2016): VIMCO isanother scalable variational inference method thatcould be considered as a benchmark for comparison.VIMCO is further related to IWAE, where it alsoconsiders an importance sampling approach to esti-amte the log likelihood. However, VIMCO considersextending IWAE to discrete latent variables. It wouldbe interesting to analyse how VIMCO performs betterthan IWAE for discrete variables, and why it mayprovide a more effective gradient estimator comparedto single sampling based approaches such as NVILand AEVB.

Wake-Sleep Algorithm (Hinton et al., 1995):The original AEVB work uses the wake-sleep algo-

rithm as a benchmark for comparison, as it employsthe same recognition model as the wake-sleep algo-rithm. It would be interesting to observe how theNVIL and multiple sampling approaches like IWAEand RWS may compare to the wake-sleep algorithmon the MNIST dataset.

Other Datasets: We only compared the methodsabove on the full MNIST dataset. If time allows, itwould be interesting whether the same trend and com-parision in results are also observed on other datasets,such as the Frey Faces dataset or the Omniglot datasetof handwritten characters. By comparing our methodsapplied on other datasets, a more uniform comparisonbenchmark can be achieved for all the scalable varia-tional inference methods considered for deep directedgenerative models.

6.2. Future Work

All the recognition models used in the above methodsuses a neural network with weights φ to find anapproximate to the true posterior. For example,the AEVB algorithm uses the application of thebackpropagation algorithm to find an approximategradient estimator ∇φL. All the weights in thesenetworks are considered as point estimates. In all theabove methods, variational inference has been appliedto the stochastic hidden units of the autoencoder.One drawback of having point estimates in deepneural networks is that the optimisation is much moredifficult on the larger scale, making the network proneto overfitting.

Uncertainty in Weights of Neural Network: Aninteresting extension of work might be to considerBayesian neural networks for the parameterisationof the recognition model. In other words, considerthe weights φ in the inference network as weightdistributions and consider Bayes By Backprop asshown in (Blundell et al., 2015). It considers intro-ducing uncertainty in the weights of the network, andperform variational approximation for exact Bayesianupdates. It might be interesting to analyse howuncertainty in the weights might affect the methodsconsidered above, as the inference network would nowhave weight distributions.

Probabilistic Backpropagation: As discussedabove, backpropagation needs to be used for the in-ference network to estimate an approximation to thegradient ∇φL. If we want to consider Bayesian neuralnetworks for the recognition model, we can consider


learning in Bayesian neural networks using the proba-bilistic backpropagation algorithm (Hernandez-Lobato& Adams, 2015). Probabilistic backpropagation worksby propagating probabilities forward through the net-work and then propagate the gradients of the marginallikelihood backwords w.r.t parameters of the posteriorapproximation. Using PBP, we can therefore consideruncertainty in the weights of the inference network,and analyse how the uncertainty calibration in infer-ence network affects performance in scalabale varia-tional inference in deep directed generative models.

7. Summary

In this work, we have therefore provided a unifyingreview and comparison of different methods for train-ing directed latent variable models. We compared thevariance reduction techniques for obtaining better gra-dient estimates for maximizing the lower bound esti-mator of the marginal log-likelihood in deep genera-tive models. All the algorithms considered training anauxiliary neural network to perform inference by op-timizing the variational bound. The algorithms con-sidered joint optimisation of the generative and therecognition model. Our experiments were carried outon the MNIST dataset, and we compared the differ-ent variance reduction techniques in scalable varia-tional inference methods. We compared the generalityand flexibility of the different approaches for perform-ing inference in directed latent variable models. Wecompared the different with the Auto-Encoding Vari-ational Bayes (AEVB) method for performing approx-imate inference in directed generative models.

Acknowledgments

We would like to thank the MPhil MLSALT pro-gram at University of Cambridge for providing thiscourse on Advanced Machine Learning. We thankRichard Turner and Zoubin Ghahramani, along withother guest lecturers for taking this course. We thankYinghzen Li and Richard Turner for providing usefulfeedback and suggestions for directions of work in thisproject.

References

Blundell, Charles, Cornebise, Julien, Kavukcuoglu,Koray, and Wierstra, Daan. Weight uncertainty inneural networks. CoRR, abs/1505.05424, 2015. URLhttp://arxiv.org/abs/1505.05424.

Bornschein, Jorg and Bengio, Yoshua. Reweightedwake-sleep. CoRR, abs/1406.2751, 2014. URL

http://arxiv.org/abs/1406.2751.

Burda, Yuri, Grosse, Roger B., and Salakhutdinov,Ruslan. Importance weighted autoencoders. CoRR,abs/1509.00519, 2015. URL http://arxiv.org/

abs/1509.00519.

Gregor, Karol, Danihelka, Ivo, Graves, Alex, Rezende,Danilo Jimenez, and Wierstra, Daan. DRAW: Arecurrent neural network for image generation. InProceedings of the 32nd International Conference onMachine Learning, ICML 2015, Lille, France, 6-11July 2015, pp. 1462–1471, 2015. URL http://jmlr.

org/proceedings/papers/v37/gregor15.html.

Hernandez-Lobato, Jose Miguel and Adams, Ryan.Probabilistic backpropagation for scalable learn-ing of bayesian neural networks. In Proceedingsof the 32nd International Conference on Ma-chine Learning, ICML 2015, Lille, France,6-11 July 2015, pp. 1861–1869, 2015. URLhttp://jmlr.org/proceedings/papers/v37/

hernandez-lobatoc15.html.

Hinton, Geoffrey E., Dayan, Peter, Frey, Brendan J.,and Neal, Radford M. The wake-sleep algorithm forunsupervised neural networks. Science, 268:1158–1161, 1995.

Kingma, Diederik P. and Welling, Max. Auto-encodingvariational bayes. CoRR, abs/1312.6114, 2013. URLhttp://arxiv.org/abs/1312.6114.

Kingma, Diederik P., Mohamed, Shakir, Rezende,Danilo Jimenez, and Welling, Max. Semi-supervised learning with deep generative models.In Advances in Neural Information ProcessingSystems 27: Annual Conference on Neural In-formation Processing Systems 2014, December8-13 2014, Montreal, Quebec, Canada, pp. 3581–3589, 2014. URL http://papers.nips.cc/paper/

5352-semi-supervised-learning-with-deep-generative-models.

Kingma, Diederik P., Salimans, Tim, and Welling,Max. Variational dropout and the local reparame-terization trick. CoRR, abs/1506.02557, 2015. URLhttp://arxiv.org/abs/1506.02557.

Mnih, Andriy and Gregor, Karol. Neural vari-ational inference and learning in belief net-works. In Proceedings of the 31th Interna-tional Conference on Machine Learning, ICML2014, Beijing, China, 21-26 June 2014, pp. 1791–1799, 2014. URL http://jmlr.org/proceedings/

papers/v32/mnih14.html.

http://arxiv.org/abs/1505.05424




http://jmlr.org/proceedings/papers/v37/gregor15.html

http://jmlr.org/proceedings/papers/v37/gregor15.html

http://jmlr.org/proceedings/papers/v37/hernandez-lobatoc15.html

http://jmlr.org/proceedings/papers/v37/hernandez-lobatoc15.html


http://papers.nips.cc/paper/5352-semi-supervised-learning-with-deep-generative-models

http://papers.nips.cc/paper/5352-semi-supervised-learning-with-deep-generative-models


http://jmlr.org/proceedings/papers/v32/mnih14.html

http://jmlr.org/proceedings/papers/v32/mnih14.html


Mnih, Andriy and Rezende, Danilo Jimenez. Varia-tional inference for monte carlo objectives. CoRR,abs/1602.06725, 2016. URL http://arxiv.org/

abs/1602.06725.

Rezende, Danilo Jimenez, Mohamed, Shakir, andWierstra, Daan. Stochastic backpropagation andapproximate inference in deep generative mod-els. In Proceedings of the 31th InternationalConference on Machine Learning, ICML 2014,Beijing, China, 21-26 June 2014, pp. 1278–1286, 2014. URL http://jmlr.org/proceedings/

papers/v32/rezende14.html.

Williams, Ronald J. Simple statistical gradient-following algorithms for connectionist reinforcementlearning. Machine Learning, 8:229–256, 1992. doi:10.1007/BF00992696. URL http://dx.doi.org/

10.1007/BF00992696.



http://jmlr.org/proceedings/papers/v32/rezende14.html

http://jmlr.org/proceedings/papers/v32/rezende14.html

http://dx.doi.org/10.1007/BF00992696

http://dx.doi.org/10.1007/BF00992696

A Unifying Review of Efficient Variational Inference and ... · Auto-Encoding Variational Bayes...

Documents

Transcript of A Unifying Review of Efficient Variational Inference and ... · Auto-Encoding Variational Bayes...