Accurate and Conservative Estimates of MRF Log-likelihood ...rsalakhu/papers/RAISE.pdf · Accurate...

Accurate and Conservative Estimates of MRF Log-likelihoodusing Reverse Annealing

Yuri Burda1 Roger B. Grosse1 Ruslan SalakhutdinovFields Institute

University of TorontoDepartment of Computer Science

University of TorontoDepartment of Computer Science

University of Toronto

Abstract

Markov random fields (MRFs) are difficult toevaluate as generative models because comput-ing the test log-probabilities requires the in-tractable partition function. Annealed impor-tance sampling (AIS) is widely used to estimateMRF partition functions, and often yields quiteaccurate results. However, AIS is prone to over-estimate the log-likelihood with little indicationthat anything is wrong. We present the Re-verse AIS Estimator (RAISE), a stochastic lowerbound on the log-likelihood of an approximationto the original MRF model. RAISE requires onlythe same MCMC transition operators as standardAIS. Experimental results indicate that RAISEagrees closely with AIS log-probability estimatesfor RBMs, DBMs, and DBNs, but typically errson the side of underestimating, rather than over-estimating, the log-likelihood.

1 Introduction

In recent years, there has been a resurgence of interest inlearning deep representations due to the impressive perfor-mance of deep neural networks across a range of tasks.Generative modeling is an appealing method of learningrepresentations, partly because one can directly evaluatea model by measuring the probability it assigns to held-out test data. Restricted Boltzmann machines (RBMs;Smolensky, 1986) and deep Boltzmann machines (DBMs;Salakhutdinov and Hinton, 2009) are highly effective atmodeling various complex visual datasets (e.g. Salakhutdi-nov and Murray, 2008; Salakhutdinov and Hinton, 2009).Unfortunately, measuring their likelihood exactly is in-tractable because it requires computing the partition func-

Appearing in Proceedings of the 18th International Conference onArtificial Intelligence and Statistics (AISTATS) 2015, San Diego,CA, USA. JMLR: W&CP volume 38. Copyright 2015 by theauthors.

tion of a Markov random field (MRF).

Annealed importance sampling (AIS; Neal, 2001) hasemerged as the state-of-the-art algorithm for estimatingMRF partition functions, and is widely used to evaluateMRFs as generative models (Salakhutdinov and Murray,2008; Theis et al., 2011). AIS is a consistent estimatorof the partition function (Neal, 2001), and often performsvery well in practice. However, it has a property whichmakes it unreliable: it tends to underestimate the parti-tion function, which leads to overly optimistic measuresof the model likelihood. In some cases, it can overesti-mate the log-likelihood by tens of nats (e.g. Grosse et al.,2013), and one cannot be sure whether impressive test log-probabilities result from a good model or a bad partitionfunction estimator. The difficulty of evaluating likelihoodshas led researchers to propose alternative generative mod-els for which the log-likelihood can be computed exactly(Larochelle and Murray, 2011; Poon and Domingos, 2011)or lower bounded (Gregor et al., 2014; Mnih and Gregor,2014), but RBMs and DBMs remain the state-of-the-art formodeling complex data distributions.

Bengio et al. (2013) highlighted the problem of optimisticRBM log-likelihood estimates and proposed a pessimisticestimator based on nonparametric density estimation. Un-fortunately, they reported that their method tends to under-estimate log-likelihoods by tens of nats on standard bench-marks, which is insufficient accuracy since the differencebetween competing models is often on the order of one nat.

We introduce the Reverse AIS Estimator (RAISE), an al-gorithm which computes conservative estimates of MRFlog-likelihoods, but which achieves similar accuracy to AISin practice. In particular, consider an approximate genera-tive model defined as the distribution of approximate sam-ples computed by AIS. Using importance sampling with acarefully chosen proposal distribution, RAISE computes astochastic lower bound on the log-likelihood of the approx-imate model. RAISE is simple to implement, as it requiresonly the same MCMC transition operators as standard AIS.

We evaluated RAISE by using it to estimate test log-1Authors contributed equally

102

Accurate and Conservative Estimates of MRF Log-likelihood using Reverse Annealing

probabilities of several RBMs, DBMs, and Deep BeliefNetworks (DBNs). The RAISE estimates agree closelywith the true log-probabilities on small RBMs where thepartition function can be computed exactly. Furthermore,they agree closely with the standard AIS estimates for full-size RBMs, DBMs, and DBNs. Since one estimate is opti-mistic and one is pessimistic, this agreement is an encour-aging sign that both estimates are close to the correct value.Our results suggest that AIS and RAISE, used in conjunc-tion, can provide a practical way of estimating MRF testlog-probabilities.

2 Background

2.1 Restricted Boltzmann Machines

While our proposed method applies to general MRFs, weuse as our running example a particular type of MRFcalled the restricted Boltzmann machine (RBM; Smolen-sky, 1986). An RBM is an MRF with a bipartite structureover a set of visible units v = (v1, . . . , vNv ) and hiddenunits h = (h1, . . . , hNh

). In this paper, for purposes ofexposition, we assume that all of the variables are binaryvalued. In this case, the distribution over the joint state{v,h} can be written as f(v,h)/Z , where

f(v,h) = exp(a>v + b>h+ v>Wh

), (1)

and a, b, and W denote the visible biases, hidden biases,and weights, respectively. The weights and biases are theRBM’s trainable parameters.

To train the RBM’s weights and biases, one can max-imize the log-probability of a set of training examplesv(1)tr , . . . ,v

(Mtr)tr . Since the log-likelihood gradient is in-

tractable to compute exactly, it is typically approximatedusing contrastive divergence (Hinton, 2002) or persistentcontrastive divergence (Tieleman, 2008). The performanceof the RBM is then measured in terms of the average log-probability of a set of test examples v(1)

test, . . . ,v(Mtest)test .

It remains challenging to evaluate the probability p(v) =f(v)/Z of an example. The unnormalized probabilityf(v) =

∑h f(v,h) can be computed exactly since the

conditional distribution factorizes over the hj . However,Z is intractable to compute exactly, and must be approxi-mated.

RBMs can also be extended to deep Boltzmann machines(Salakhutdinov and Hinton, 2009) by adding one or moreadditional hidden layers. For instance, the joint distributionof a DBM with two hidden layers h1 and h2 can be writtenas f(v,h1,h2)/Z , where

f(v,h1,h2) = exp(a>v + b>1 h1 + b>2 h2+

+v>W1h1 + h>1 W2h2

). (2)

DBMs can be evaluated similarly to RBMs. The maindifference is that the unnormalized probability f(v) =

∑h1,h2

f(v,h1,h2) is intractable to compute exactly.However, Salakhutdinov and Hinton (2009) showed that,in practice, the mean-field approximation yields an accu-rate lower bound. Therefore, similarly to RBMs, the maindifficulty in evaluating DBMs is estimating the partitionfunction.

RBMs are also used as building blocks for training DeepBelief Networks (DBNs; Hinton et al., 2006). For example,a DBN with two hidden layers h1 and h2 is defined as theprobability distribution

p(v,h1,h2) = p2(h1,h2)p1(v |h1), (3)

where p2(h1,h2) is the probability distribution of anRBM, and p1(v |h1) is a product of independent lo-gistic units. The unnormalized probability f(v) =∑

h1,h2p1(v |h1)f2(h1,h2) cannot be computed analyt-

ically, but can be approximated using importance samplingor a variational lower bound that utilizes a recognition dis-tribution q(h1 |v) approximating the posterior p(h1 |v)(Hinton et al., 2006).

2.2 Partition Function Estimation

Often we have a probability distribution ptgt(x) =ftgt(x)/Ztgt (which we call the target distribution) definedon a space X , where ftgt(x) can be computed efficientlyfor a given x ∈ X , and Ztgt is an intractable normalizingconstant. There are two particular cases which concern ushere. First, ptgt may correspond to a Markov random field(MRF), such as an RBM, where ftgt(x) denotes the prod-uct of all potentials, andZtgt =

∑x ftgt(x) is the partition

function of the graphical model.

The second case is where one has a directed graphicalmodel with latent variables h and observed variables v.Here, the joint distribution p(h,v) = p(h)p(v |h) can betractably computed for any particular pair (h,v). However,one often wants to compute the likelihood of a test exam-ple p(vtest) =

∑h p(h,vtest). This can be placed in the

above framework with

ftgt(h) = p(h)p(vtest |h) and Ztgt = p(vtest). (4)

Mathematically, the two partition function estimation prob-lems outlined above are closely related, and the sameclasses of algorithms are applicable to each. However, theydiffer in terms of the behavior of approximate inference al-gorithms in the context of model selection. In particular,many algorithms, such as annealed importance sampling(Neal, 2001) and sequential Monte Carlo (del Moral et al.,2006), yield unbiased estimates Ztgt of the partition func-tion, i.e. E[Ztgt] = Ztgt. Jensen’s Inequality shows thatsuch an estimator tends to underestimate the log partitionfunction on average:

E[log Ztgt] ≤ logE[Ztgt] = logZtgt. (5)

103

Yuri Burda1, Roger B. Grosse1, Ruslan Salakhutdinov

Algorithm 1 Annealed Importance Samplingfor i = 1 to M do

x0 ← sample from p0(x) = fini(x)/Zini

w(i) ← Zini

for k = 1 to K dow(i) ← w(i) fk(xk−1)

fk−1(xk−1)

xk ← sample from Tk (· |xk−1)end for

end forreturn Ztgt =

∑Mi=1 w

(i)/M

In addition, Markov’s inequality shows that it is unlikely tosubstantially overestimate logZtgt:

Pr(log Ztgt > logZtgt + b) < e−b. (6)

For these reasons, we will refer to the estimator as astochastic lower bound on logZtgt.

In the MRF situation, Ztgt appears in the denominator, sounderestimates of the log partition function translate intooverestimates of the log-likelihood. This is problematic,since inaccurate partition function estimates can lead one todramatically overestimate the performance of one’s model.This problem has led researchers to consider alternativegenerative models where the likelihood can be tractablycomputed. By contrast, in the directed case, the parti-tion function is the test log-probability (4), so underesti-mates correspond to overly conservative measures of per-formance. For example, the fact that sigmoid belief net-works (Neal, 1992) have tractable lower (rather than upper)bounds is commonly cited as a reason to prefer them overRBMs and DBMs (e.g. Mnih and Gregor, 2014).

We note that it is possible to achieve stronger tail boundsthan (6) by combining multiple unbiased estimates in cleverways (Gogate et al., 2007).

2.3 Annealed Importance Sampling

Annealed importance sampling (AIS) is an algorithmwhich estimates Ztgt by gradually changing, or “anneal-ing,” a distribution. In particular, one must specify asequence of K + 1 intermediate distributions pk(x) =fk(x)/Zk for k = 0, . . .K, where pini(x) = p0(x) isa tractable initial distribution, and ptgt(x) = pK(x) isthe intractable target distribution. For simplicity, assumeall distributions are strictly positive on X . For each pk,one must also specify an MCMC transition operator Tk(e.g. Gibbs sampling) which leaves pk invariant. AIS alter-nates between MCMC transitions and importance samplingupdates, as shown in Algorithm 1.

The output of AIS is an unbiased estimate Ztgt ofZtgt. Im-portantly, unbiasedness is not an asymptotic property, butholds for anyK (Neal, 2001; Jarzynski, 1997). Neal (2001)demonstrated this by viewing AIS as an importance sam-pling estimator over an extended state space. In particular,

define the distributions

qfwd(x0:K−1) = p0(x0)

K−1∏k=1

Tk(xk |xk−1) (7)

frev(x0:K−1) = ftgt(xK−1)K−1∏k=1

Tk(xk−1 |xk), (8)

where Tk(x′ |x) = Tk(x |x′)pk(x′)/pk(x) is the reversetransition operator for Tk. Here, qfwd represents the se-quence of states generated by AIS, and frev is a fictitious(unnormalized) reverse chain which begins with an exactsample from ptgt and applies the transitions in reverse or-der. Neal (2001) showed that the AIS weights correspondto the importance weights for frev with qfwd as the proposaldistribution.

The mathematical formulation of AIS leaves much flexibil-ity for choosing intermediate distributions. The choice ofdistributions can have a large effect on the performance ofAIS (Grosse et al., 2013), but the most common choice isto take geometric averages of the initial and target distribu-tions:

pβ(x) = fβ(x)/Z(β) = fini(x)1−βftgt(x)

β/Z(β), (9)

where 0 = β0 < β1 < ... < βK = 1 defines the an-nealing schedule. Commonly, fini is the uniform distribu-tion, and (9) reduces to pβ(x) = ftgt(x)

β/Z(β). Thismotivates the term “annealing”, and β resembles an in-verse temperature parameter. As in simulated annealing,the “hotter” distributions often allow faster mixing betweenmodes which are isolated in ptgt. Geometric averages arewidely used because they often have a simple form; for in-stance, the geometric average of two RBMs is obtained bylinearly averaging the weights and biases. The values ofβ can be spaced evenly between 0 and 1, although otherschedules have been explored (Neal, 1996; Behrens et al.,2012; Calderhead and Girolami, 2009).

3 Reverse AIS EstimatorA significant difficulty in evaluating MRFs is that it is in-tractable to compute the partition function. Furthermore,the commonly used algorithms, such as AIS, tend to over-estimate the log-likelihood. If we cannot hope to obtainprovably accurate partition function estimates, it would befar preferable for algorithms to underestimate, rather thanoverestimate, the log-likelihoods. This would save us fromthe embarrassment of reporting unrealistically high testlog-probability scores for a given dataset. In this section,we define an approximate generative model which becomesequivalent to the MRF in the limit of infinite computation.We then present a procedure for obtaining unbiased esti-mates of the probability of a test example (and thereforea stochastic lower bound on the test log-probability) underthe approximate model.

104


Algorithm 2 Reverse AIS Estimator (RAISE)for i = 1 to M do

hK ← sample from ptgt(h |vtest)

w(i) ← ftgt(vtest)/Z0

for k = K − 1 to 0 doxk ← sample from Tk (· |xk+1)

w(i) ← w(i) fk(xk)fk+1(xk)

end forend forreturn pann(vtest) =

∑Mi=1 w

(i)/M

3.1 Case of Tractable Posterior

In this section, we denote the model state as x = (v,h),with v observed and h unobserved. Let us first assumethe conditional distribution ptgt(h |v) is tractable, as is thecase for RBMs. Define the following generative process,which corresponds to the sequence of transitions in AIS:

pfwd(x0:K) = p0(x0)K∏k=1

Tk(xk |xk−1). (10)

By taking the final visible states of this process, we obtaina generative model (which we term the annealing model)which approximates ptgt(v):

pann(vK) =∑

x0:K−1,hK

pfwd(x0:K−1,hK ,vK). (11)

Suppose we are interested in estimating the probability ofa test example vtest. We use as a proposal distribution a re-verse chain starting from vtest. In the annealing metaphor,this corresponds to gradually “melting” the distribution:

qrev(x0:K−1,hK |vtest) = ptgt(hK |vtest)K∏k=1

Tk(xk−1 |xk),

where we identify vk = vtest, and Tk(x′ |x) =

Tk(x |x′)pk(x′)/pk(x) is the reverse transition operatorfor Tk. We then obtain the following identity:

pann(vtest) = Eqrev

[pfwd(x0:K−1,hK ,vtest)

qrev(x0:K−1,hK |vtest)

]= Eqrev

[p0(x0)

ptgt(hK |vtest)

K∏k=1

Tk(xk |xk−1)

Tk(xk−1 |xk)

]

= Eqrev

[p0(x0)

ptgt(hK |vtest)

K∏k=1

fk(xk)

fk(xk−1)

]

= Eqrev

[p0(x0)

ptgt(hK |vtest)

ftgt(xK)

f0(x0)

K−1∏k=0

fk(xk)

fk+1(xk)

]

= Eqrev

[fK(vtest)

Z0

K−1∏k=0

fk(xk)

fk+1(xk)

], Eqrev [w] . (12)

This yields the following algorithm: generate M samplesfrom qrev, and average the values w defined in (12). There

is no need to store the full chains, since the weights can beupdated online. We refer to this algorithm as the ReverseAIS Estimator, or RAISE. The full algorithm is given inAlgorithm 2. We note that RAISE is straightforward toimplement, as it requires only the same MCMC transitionoperators as standard AIS.

Our derivation (12) mirrors the derivation of AIS by Neal(2001). The difference is that in AIS, the reverse chain ismerely hypothetical; in RAISE, the reverse chain is simu-lated, and it is the forward chain which is hypothetical.

By (12), the weights w are an unbiased estimator of theprobability pann(vtest). Therefore, following the discus-sion of Section 2.2, logw is a stochastic lower bound onlog pann(vtest). Furthermore, since pann converges to ptgtin probability as K → ∞ (Neal, 2001), we would heuris-tically expect RAISE to yield a conservative estimate oflog ptgt(vtest). This is not strictly guaranteed, however;RAISE may overestimate log ptgt(vtest) for finite K ifpann(vtest) > ptgt(vtest), which is possible if the AIS ap-proximation somehow attenuates pathologies in the origi-nal MRF. (One such example is described in Section 5.1.)However, since RAISE is a stochastic lower bound on thelog-probabilities under the annealing model, we can strictlyrule out the possibility of RAISE reporting unrealisticallyhigh test log-probabilities for a given dataset, a situationfrequently observed with AIS.

3.2 Extension to Intractable Posterior Distributions

Because Algorithm 2 begins with an exact sample fromthe conditional distribution ptgt(h |vtest), it requires thatthis distribution be tractable. However, many models ofinterest, such as DBMs, have intractable posterior distribu-tions. To deal with this case, we augment the forward chainwith an additional heating step, such that the conditionaldistribution in the final step is tractable, but the distribu-tion over v agrees with that of pann in (11). We make thefurther (weak) assumption that p0(h |v) is tractable. LetT

(v)k denote an MCMC transition operator which preservespk(v,h), but does not change v. For example, it may cy-cle through Gibbs updates to all variables except v. Theforward chain then has the following distribution:

pfwd(x0:K ,h′0:K−1) = p0(x0)

K∏k=1

Tk(xk |xk−1)

K−1∏k=0

T(vK)k (h′k |h′k+1),

where we identify h′K = hK . The reverse distribution isgiven by:

qrev(x0:K−1,hK ,h′0:K−1 |vtest) =

p0(h′0 |vtest)

K−1∏k=0

T(vtest)k (h′k+1 |h′k)

K∏k=1

Tk(xk−1 |xk).

105


Figure 1: A schematic of RAISE for intractable distributions, applied to DBMs. Green: generative model. Blue: proposal distribution.At the top is shown which distribution the variables at each step are meant to approximate.

Algorithm 3 RAISE with intractable posteriorfor i = 1 to M do

h′0 ← sample from p0(h |vtest)

w(i) ← p0(vtest)for k = 1 to K do

h′k ← sample from T(vtest)k (· |h′k−1)

w(i) ← w(i) fk(h′k,vtest)

fk−1(h′k,vtest)

end forfor k = K − 1 to 0 do

xk ← sample from Tk (· |xk+1)

w(i) ← w(i) fk(xk)fk+1(xk)

end forend forreturn pann(vtest) =

∑Mi=1 w

(i)/M

The unbiased estimator is derived similarly to that of Sec-tion 3:

w ,pfwd(x0:K−1,hK ,vtest,h

′0:K−1)

qrev(x0:K−1,hK ,h′0:K−1 |vtest)(13)

= p0(vtest)K−1∏k=0

fk(xk)

fk+1(xk)

K∏k=1

fk(h′k,vtest)

fk−1(h′k,vtest)

The full algorithm is shown in Algorithm 3, and aschematic for the case of DBMs is shown in Figure 1.

3.3 Interpretation as Unrolling

Hinton et al. (2006) showed that the Gibbs sampling pro-cedure for a binary RBM could be interpreted as generat-ing from an infinitely deep sigmoid belief net with sharedweights. They used this insight to derive a greedy trainingprocedure for Deep Belief Nets (DBNs), where one untiesthe weights of a single layer at a time. Furthermore, theyobserved that one could perform approximate inference inthe belief net using the transpose of the generative weightsto compute a variational approximation.

We note that, for RBMs, RAISE can similarly be viewedas a form of unrolling: the annealed generative model panncan be viewed as a belief net with K + 1 layers. Further-more, the RAISE proposal distribution can be viewed asusing the transpose of the weights to perform approximateinference. (The difference from approximate inference inDBNs is that RAISE samples the units rather than using themean-field approximation).

This interpretation of RAISE suggests a method of apply-

Figure 2: RAISE applied to a DBN unrolled into a very deep sig-moid belief net, for K = 1000 intermediate distributions. Green:generative model. Blue: proposal distribution.

ing it to DBNs. The generative model is obtained by un-rolling the RBM on top of the directed layers as shown inFigure 2. The proposal distribution uses the transposes ofthe DBN weights for each of the directed layers. The restis the same as the ordinary RAISE for the unrolled part ofthe model.

4 Variance Reduction using ControlVariates

One of the virtues of log-likelihood estimation using AISis its speed: the partition function need only be estimatedonce. RAISE, unfortunately, must be run separately for ev-ery test example. We would therefore prefer to compute theRAISE estimate for only a small number of test examples.Unfortunately, subsampling the test examples introduces asignificant source of variability: as different test examplescan have wildly different log-likelihoods2, the estimate ofthe average log-likelihood can vary significantly dependingwhich batch of examples is selected. We attenuate this vari-ability using the method of control variates (Ross, 2006), avariance reduction technique which has also been appliedto black-box variational inference (Ranganath et al., 2014).

If Y1, . . . , Yn are independent samples of a random variableY , then the sample average 1

n

∑ni=1 Yi is an unbiased es-

timator of E [Y ] with variance Var [Y ] /n. If X is anotherrandom variable (which ideally is both cheap to compute

2This effect can be counterintuitively large due to differentcomplexities of different categories; e.g., for the mnistCD25-500RBM, the average log-likelihood of handwritten digits “1” was56.6 nats higher than the average log-likelihood of digits “8”.

106


and highly correlated with Y ), then for any scalar α,

1

n

n∑i=1

(Yi − αXi) +α

N

N∑i=1

Xi (14)

is an unbiased estimator of E [Y ] with variance

Var [Y − αX]

n+ α2Var [X]

N+ 2α

Cov [Y − αX,X]

n.

In our experiments, Y is the RAISE estimate of the log-probability of a test example, and X is the (exact or es-timated) log unnormalized probability under the originalMRF. Since the unnormalized probability under the MRFis significantly easier to evaluate than the log-probabilityunder the annealing model, we can let N to be much largerthan n; we set n = 100 and let N be the total number oftest examples. Since the annealing model is an approxi-mation to the MRF, the two models should assign similarlog-probabilities, so we set α = 1. Hence we expect thevariance of Y −X to be smaller than the variance of Y , andthus (14) to have a significantly smaller variance than thesample average. Empirically, we have found that Y − Xhas significantly smaller variance than Y , even when thenumber of intermediate distributions is relatively small.

5 Experimental ResultsWe have evaluated RAISE on several MRFs to determineif its log-probability estimates are both accurate and con-servative. We compared our estimates against those ob-tained from standard AIS. We also compared against theexact log-probabilities of small models for which the parti-tion function can be computed exactly (Salakhutdinov andMurray, 2008). AIS is expected to overestimate the truelog-probabilities while RAISE is expected to underestimatethem. Hence, a close agreement between the two estima-tors would be a strong indication of accurate estimates.

We considered two datasets: (1) the MNIST handwrittendigit dataset (LeCun et al., 1998), which has long served asa benchmark for both classification and density modeling,and (2) the Omniglot dataset (Lake et al., 2013), which con-tains images of handwritten characters across many worldalphabets.3

Both AIS and RAISE can be used with any sequence ofintermediate distributions. For simplicity, in all of our ex-periments, we used the geometric averages path (9) withlinear spacing of the parameter β. We tested two choicesof initial distribution pini: the uniform distribution, and thedata base rate (DBR) distribution (Salakhutdinov and Mur-ray, 2008), where all units are independent, all hidden units

3We used the standard split of MNIST into 60,000 training and10,000 test examples and a random split of Omniglot into 24,345training and 8,070 test examples. In both cases, the inputs are28× 28 binary images.

mnistPCD-20

mnistCD1-20

mnistCD1-500

mnistPCD-500

mnistCD25-500

mnistDBN

mnistDBM

omniPCD-1000

Number of intermediate distributions

Avera

ge t

est

set

log

pro

bab

ility

Figure 3: RAISE estimates of average test log-probabilities us-ing uniform pini. The log-probability estimates tend to increasewith the number of intermediate distributions, suggesting thatRAISE is a conservative estimator.

are uniform, and the visible biases are set to match the aver-age pixel values in the training set. In all cases, our MCMCtransition operator was Gibbs sampling.

We estimated the log-probabilities of a random sample of100 examples from the test set using RAISE and used themethod of control variates (Sec. 4) to estimate the aver-age log-probabilities on the full test dataset. For RBM ex-periments, the control variate was the RBM log unnormal-ized probability, log f(v), whereas for DBMs and DBNs,we used an estimate based on simple importance samplingas described below. For each of the 100 test examples,RAISE was run with 50 independent chains, while the AISpartition function estimates used 5,000 chains; this closelymatched the computation time per intermediate distributionbetween the two methods. Each method required about 1.5hours with the largest number of intermediate distributions(K = 100,000).

5.1 Restricted Boltzmann Machines

We considered models trained using two algorithms: con-trastive divergence (CD; Hinton, 2002) with both 1 and25 CD steps, and persistent contrastive divergence (PCD;Tieleman, 2008). We will refer to the RBMs by the dataset,training algorithm, and the number of hidden units. For ex-ample, “mnistCD25-500” denotes an RBM with 500 hid-den units, trained on MNIST using 25 CD steps. TheMNIST trained RBMs are the same ones evaluated byGrosse et al. (2013). We also provide comparisons to theConservative Sampling-based Log-likelihood (CSL) esti-mator of Bengio et al. (2013).4

Figure 3 shows the average RAISE test log-probability es-timates for all of the RBMs as a function of the number ofintermediate distributions. In all of these examples, as ex-pected, the estimated log-probabilities tended to increase

4The number of chains and number of Gibbs steps for CSLwere chosen to match the total number of Gibbs steps required byRAISE and AIS for K = 100,000.

107


uniform data base ratesModel exact CSL RAISE AIS gap RAISE AIS gapmnistCD1-20 -164.50 -185.74 -165.33 -164.51 0.82 -164.11 -164.50 -0.39mnistPCD-20 -150.11 -152.13 -150.58 -150.04 0.54 -150.17 -150.10 0.07mnistCD1-500 — -566.91 -150.78 -106.52 44.26 -124.77 -124.09 0.68mnistPCD-500 — -138.76 -101.07 -99.99 1.08 -101.26 -101.28 -0.02mnistCD25-500 — -145.26 -88.51 -86.42 2.09 -86.39 -86.35 0.04omniPCD-1000 — -144.25 -100.47 -100.45 0.02 -100.46 -100.46 0.00

Table 1: RAISE and AIS average test log-probabilities using 100,000 intermediate distributions and both choices of pini. CSL: theestimator of Bengio et al. (2013). gap: the difference AIS− RAISE

(a) uniform (b) data base rates

Figure 4: AIS and RAISE estimates of mnistCD1-500 averagetest log-probabilities have a significant gap when annealing froma uniform initial distribution. However, they agree closely whenannealing from the data base rates.

with the number of intermediate distributions, consistentwith RAISE being a conservative log-probability estima-tor.

Table 1 shows the final average test log-probability esti-mates obtained using CSL as well as both RAISE and AISwith 100,000 intermediate distributions. In all of the trialsusing the DBR initial distribution, the estimates of AIS andRAISE agreed to within 1 nat, and in many cases, to within0.1 nats. The CSL estimator, on the other hand, underesti-mated log ptgt by tens of nats in almost all cases, which isinsufficient accuracy since well-trained models often differby only a few nats.

We observed that the DBR initial distribution gave consis-tently better agreement between the two methods comparedwith the uniform distribution, consistent with the resultsof Salakhutdinov and Murray (2008). The largest discrep-ancy, 44.26 nats, was for mnistCD1-500 with uniform pini;with DBR, the two methods differed by only 0.68. Figure 4plots both estimates as a function of the number of initialdistributions. In the uniform case, one might not notice theinaccuracy only by running AIS, as the AIS estimates mayappear to level off. One could be tricked into reporting re-sults that are tens of nats too high! By contrast, when bothmethods are run in conjunction, the inaccuracy of at leastone of the methods becomes obvious.

As discussed in Section 3.1, RAISE is a stochastic lowerbound on the log-likelihood of the annealing model pann,but not necessarily of the RBM itself. When pann is a goodapproximation to the RBM, RAISE gives a conservative es-timate of the RBM log-likelihood. However, it is possiblefor RAISE to overestimate the RBM log-likelihood if pann

��

��

��

��

��

��

��

��

��

Figure 5: The mnistCD1-20 RBM, where we observed RAISEto overestimate the RBM’s test log-probabilities. Left: Averagetest log-probability estimates as a function of K. Top right: 10independent samples from the RBM. Bottom right: 10 indepen-dent samples from the annealing model pann with 10 intermediatedistributions. The pann samples, while poor, show greater diver-sity compared to the RBM samples, consistent with pann bettermatching the data distribution.

models the data distribution better than the RBM itself, forinstance if the approximation attenuates pathologies of theRBM. We observed a single instance of this in our RBMexperiments: the mnistCD1-20 RBM, with the data baserate initialization. As shown in Figure 5, the RAISE es-timates exceeded the AIS estimates for small K, and de-clined as K was increased. Since RAISE gives a stochasticlower bound on log pann and AIS gives a stochastic upperbound on log ptgt, this inversion implies that pann signif-icantly outperformed the RBM itself. Indeed, the RBM(mistakenly) assigned 93% of its probability mass to a sin-gle hidden configuration, while the RAISE model spreadsits probability mass among more diverse configurations.

In all of our other RBM experiments, the AIS and RAISEestimates with DBR initialization andK = 100,000 agreedto within 0.1 nats. Figure 6 shows one such case, for anRBM trained on the challenging Omniglot dataset.

Overall, the RAISE and AIS estimates using DBR initial-ization agreed closely in all cases, and RAISE gave conser-vative estimates in all but one case, suggesting that RAISEtypically gives accurate and conservative estimates of RBMtest log-probabilities.

5.2 Deep Boltzmann Machines

We used RAISE to estimate the average test log-probabilities of two DBM models trained on MNIST andOmniglot. The MNIST DBM has 2 hidden layers of size500 and 1000, and the Omniglot DBM has 2 hidden layers

108


AIS estimates

Est

imate

of

avera

ge log p

robabili

tyon t

he t

est

data

set

Number of intermediate distributions

RAISE estimates

Figure 6: Left: AIS and RAISE estimates of omniPCD-1000RBM average test log-probabilities with annealing from a uni-form initial distribution Top right: 32 training samples fromOmniglot training set Bottom right: 32 independent samplesfrom the omniPCD-1000 RAISE model with 100,000 intermedi-ate distributions.

uniform data base ratesModel RAISE AIS gap RAISE AIS gapMNIST DBM -85.69 -85.72 -0.03 -85.74 -85.67 0.07Omniglot DBM -104.48 -110.86 -6.38 -102.64 -103.27 -0.63MNIST DBN -84.67 -84.49 0.18 — — —Omniglot DBN -100.78 -100.45 0.33 — — —

Table 2: Test log-probability estimates for deep models withK = 100,000. gap: the difference AIS− RAISE

each of size 1000. As with RBMs, we ran RAISE on 100random test examples and used the DBM log unnormal-ized probability, log f(v), as a control variate. To obtainestimates of the DBM unnormalized probability f(v) =∑

h1,h2f(v,h1,h2) we used simple importance sampling

f(v) = Eq(f(v,h2)q(h2 |v)

)with 500 samples, where the pro-

posal distribution q was the mean-field approximation tothe conditional distribution p(h2 |v). The term f(v,h2)was computed by summing out h1 analytically, which isefficient because the conditional distribution factorizes.5

We compared the RAISE estimates to those obtained us-ing AIS. All results for K = 100,000 are shown in Ta-ble 2, and the estimates for the MNIST DBN are plotted asa function of K in Figure 7. All estimates for the MNISTDBM with K = 100,000 agreed quite closely, which is apiece of evidence in favor of the accuracy of the estimates.Furthermore, RAISE provided conservative estimates oflog-probabilities for small K, in contrast with AIS, whichgave overly optimistic estimates. For the Omniglot DBM,RAISE overestimated the DBM log-probabilities by at least6 nats, implying that the annealing model fit the data distri-bution better than the DBM, analogously to the case of themnistCD1-20 RBM discussed in Section 5.1. This showsthat RAISE does not completely eliminate the possibilityof overestimating an MRF’s test log-probabilities.

5.3 Deep Belief Networks

In our final set of experiments, we used RAISE to esti-mate the average test log-probabilities of DBNs trained on

5Previous work (e.g. Salakhutdinov and Hinton, 2009) esti-mated log f(v) using the mean-field lower bound. We found im-portance sampling to give more accurate results in the context ofAIS. However, it made less difference for RAISE, where the logunnormalized probabilities are merely used as a control variate.

Figure 7: Average test log-probability estimates for MNISTmodels as a function of K. Left: the DBM. Right: the DBN.

MNIST and Omniglot. The MNIST DBN had two hid-den layers of size 500 and 2000, and the Omniglot DBNhad two hidden layers each of size 1000. For the ini-tial distribution p0 we used the uniform distribution, asthe DBR distribution is not defined for DBNs. To ob-tain estimates of DBN unnormalized probabilities f(v) =∑

h1p(v |h1)f(h1) we used importance sampling f(v) =

Eq(p(v |h1)f(h1)q(h1 |v)

)with 500 samples, where q was the

DBN recognition distribution (Hinton et al., 2006).

All results for K = 100,000 are shown in Table 2, andFigure 7 shows the estimates for the MNIST DBN as afunction of K. For both DBNs, RAISE and AIS agreedto within 1 nat for K = 100,000, and RAISE gave conser-vative log-probability estimates for all values of K.

5.4 Summary

Between our RBM, DBM, and DBN experiments, we com-pared 10 different models using both uniform and database rate initial distributions. In all but two cases (themnistCD1-20 RBM and the Omniglot DBN), RAISE gaveestimates at or below the smallest log-probability estimatesproduced by AIS, suggesting that RAISE typically givesconservative estimates. Furthermore, in all but one case(the Omniglot DBM), the final RAISE estimate agreed withthe lowest AIS estimate to within 1 nat, suggesting that itis typically accurate.

6 Conclusion

In this paper, we presented RAISE, a stochastic lowerbound on the log-likelihood of an approximation to anMRF model. Our experimental results show that RAISEtypically produces accurate, yet conservative, estimates oflog-probabilities for RBMs, DBMs, and DBNs. More im-portantly, by using RAISE and AIS in conjunction, one canjudge the accuracy of one’s results by measuring the agree-ment of the two estiatmators.

AcknowledgementsThis research was supported by NSERC, Google, and Sam-sung.

109


ReferencesGundula Behrens, Nial Friel, and Merrilee Hurn. Tuning

tempered transitions. Statistics and Computing, 22:65–78, 2012.

Y. Bengio, L. Yao, and K. Cho. Bounding the test log-likelihood of generative models. arXiv:1311.6184, 2013.

Ben Calderhead and Mark Girolami. Estimating Bayesfactors via thermodynamic integration and populationMCMC. Computational Statistics and Data Analysis,53(12):4028–4045, 2009.

P. del Moral, A. Doucet, and A. Jasra. Sequential MonteCarlo samplers. Journal of the Royal Statistical Society:Series B (Methodology), 68(3):411–436, 2006.

V. Gogate, B. Bidyuk, and R. Dechter. Studies in lowerbounding probability of evidence using the Markov in-equality. In Conference on Uncertainty in AI, 2007.

K. Gregor, I. Danihelka, A. Mnih, C. Blundell, andD. Wierstra. Deep autoregressive networks. In Int’lConf. on Machine Learning, 2014.

R. B. Grosse, C. J. Maddison, and R. Salakhutdinov. An-nealing between distributions by averaging moments. InNeural Information Processing Systems, 2013.

Geoffrey E. Hinton. Training products of experts by mini-mizing contrastive divergence. Neural Computation, 14(8):1771–1800, 2002.

Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh.A Fast Learning Algorithm for Deep Belief Nets. NeuralComputation, 18(7):1527–1554, 2006.

Christopher Jarzynski. Equilibrium free-energy differencesfrom nonequilibrium measurements: A master-equationapproach. Physical Review E, 56:5018–5035, 1997.

Brenden M Lake, Ruslan Salakhutdinov, and Josh Tenen-baum. One-shot learning by inverting a compositionalcausal process. In C.J.C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K.Q. Weinberger, editors, Advancesin Neural Information Processing Systems 26, pages2526–2534. Curran Associates, Inc., 2013.

H. Larochelle and I. Murray. The neural autoregressive dis-tribution estimator. In Artificial Intelligence and Statis-tics, 2011.

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Pro-ceedings of the IEEE, 86(11):2278–2324, 1998.

A. Mnih and K. Gregor. Neural variational inference andlearning in belief networks. In Int’l Conf. on MachineLearning, 2014.

R. M. Neal. Annealed importance sampling. Statistics andComputing, 11(2):125–139, 2001.

Radford Neal. Sampling from multimodal distributions us-ing tempered transitions. Statistics and Computing, 6:353–366, 1996.

Radford M. Neal. Connectionist learning of belief net-works. Artificial Intelligence, 1992.

H. Poon and P. Domingos. Sum-product networks: a newdeep architecture. In Uncertainty in Artificial Intelli-gence, 2011.

R. Ranganath, S. Gerrish, and D. M. Blei. Black box varia-tional inference. In Artificial Intelligence and Statistics,2014.

S. M. Ross. Simulation. Academic Press, 2006.

Ruslan Salakhutdinov and Geoffrey E. Hinton. DeepBoltzmann machines. In Proceedings of the Interna-tional Conference on Artificial Intelligence and Statis-tics, 2009.

Ruslan Salakhutdinov and Ian Murray. On the quantitativeanalysis of deep belief networks. In Int’l Conf. on Ma-chine Learning, pages 6424–6429, 2008.

P. Smolensky. Information processing in dynamical sys-tems: foundations of harmony theory. In D. E. Rumel-hart and J. L. McClelland, editors, Parallel DistributedProcessing: Explorations in the Microstructure of Cog-nition. MIT Press, 1986.

Lucas Theis, Sebastian Gerwinn, Fabian Sinz, and MatthiasBethge. In all likelihood, deep belief is not enough.Journal of Machine Learning Research, 12:3071–3096,2011.

Tijmen Tieleman. Training restricted Boltzmann machinesusing approximations to the likelihood gradient. In Intl.Conf. on Machine Learning, 2008.

110

Accurate and Conservative Estimates of MRF Log-likelihood ...rsalakhu/papers/RAISE.pdf · Accurate...

Documents

Transcript of Accurate and Conservative Estimates of MRF Log-likelihood ...rsalakhu/papers/RAISE.pdf · Accurate...