Abstract - Satwik Kottur · Abstract In this project we want to implement and study a type of...

Stochastic Expectation Maximization forLatent Variable Models

Manzil Zaheer and Satwik KotturCarnegie Mellon University

Pittsburgh, PA 15213{manzil, skottur}@cmu.edu

AbstractIn this project we want to implement and study a type of stochastic optimization. This optimization method based onexpectation-maximization will be asynchronous & embarrassingly parallel and thus is useful for inference of latentvariable models. The motivation for this stochastic optimization problem comes from an want to directly design ainference procedure from a “comptastical” (computational + statistical) perspective capable of leveraging moderncomputational resources like GPUs or cloud computing offering massive parallelism. We also find some interestingconnection between stochastic expectation-maximization and stochastic gradient descent strengthening validity ofproposed method.

1 Introduction

In the past decade, frameworks such as stochastic gradient descent (SGD) [1] and map-reduce [2] have enabled machinelearning algorithms to scale to larger and large datasets. However, these frameworks are not always applicable to Bayesianlatent variable models (LVM) with rich statistical dependencies and intractable gradients. Markov chain Monte-Carlo(MCMC) [3] is an appealing alternative, but traditional algorithms such as the Gibbs sampler do not match moderncomputational resources, as they are inherently sequential and the extent to which they can be parallelized depends heavilyupon how the structure of the statistical model interacts with the data. Sometimes—due to the concentration of measurephenomenon associated with large sample sizes—computing the full posterior is unnecessary and maximum a posteriori(MAP) estimates suffice. EM and/or Variational methods [4] and have thus become the sine qua non for inference in thesemodels.

In practice, it is often difficult to determine which inference algorithm will work best since this depends largely on thetask, the data, and the model. Therefore, we must often turn to empirical studies to help understand their strengths andweaknesses. Furthermore, for latent variable mixture models, Gibbs and EM possess remarkably different computationalproperties. For example, collapsed Gibbs sampling is sequential and difficult to parallelize, but has a relatively smallmemory footprint and is easy to distribute because we only need to communicate imputed values. In contrast, EM iseasy to parallelize, but difficult to distribute because the dense conditional puts tremendous pressure on communicationbandwidth and memory. However, the two algorithms have complementary strengths.

Table 1: Comparison with other scalable LDA frameworks.

Method Dataset Infrastructure Processing speedYahooLDA [5] 140K vocab 10 machines 12.87M tokens/s

8.2M docs (2011)797M tokens

lightLDA [6] 50K vocab 24 machines 60M tokens/s1.2B docs (2014)200B tokens

F+LDA [7] 1M vocab 32 machines 110M tokens/s29M docs (2014)1.5B tokens

SEM 140K vocab 8 machines 503M tokens/s3B docs (2015)171B tokens

Number of Machines5 10 15 20

Sam

ples

per

sec

ond

#108

2

4

6

8

10

12

14Maximum sampling rate

Figure 1: Scalability.

1

We observe that a relatively esoteric algorithm, stochastic EM (SEM), has the potential to combine the strengths of Gibbsand EM. In particular, SEM replaces the full expectation from the E-step with a sample from it, enabling the subsequentmaximization step to operate only on the imputed data and current sufficient statistics. As a result, SEM substantiallyreduces memory and communication costs (even more than the Gibbs sampler which must also store the latent variables).Furthermore, since SEM is based on EM, it is embarassingly parallel. For example, a simple 300 line C++ implementationof SEM for latent Dirichlet allocation (LDA) [8], easily beats the state-of-the-art and is highly scalable as shown in Table1 and Figure 1. From a scalable systems perspective SEM has clear computational advantages, but does SEM find goodquality solutions to the MAP inference optimization problem in practice?

In this project, we derive Gibbs, EM, and SEM for a somewhat general and representitive class of latent variable models.More importantly, we derive the algorithms for a special instance of the class that highlights the computational advantagesof SEM over the other two algorithms. Empirically as well as theoretically, we study the performance of these algorithmsand find that SEM performs equally (if not better) than EM and Gibbs across many experimental conditions. Further, wefind that SEM may be more robust to poor initialization conditions than either the Gibbs sampler or EM.

2 Proposed Approach

2.1 Latent Variable Exponential Family

Latent variable models are useful when reasoning about partially observed data such as collections of text or images inwhich each i.i.d. data point is a document or image. Since the same local model is applied to each data point, they havethe following form

p(z,x, η) = p(η)∏i

p (zi, xi|η) . (1)

Our goal is to obtain a MAP estimate for the parameters η that explain the data x through the latent variables z. To exposemaximum parallelism, we want each cell in the automaton to correspond to a data point and its latent variable. However,this is problematic because in general all latent variables depend on each other via the global parameters η and a naiveapproach to updating a single cell would then require examining every other cell in the automaton.

Fortunately, if we further suppose that the complete data likelihood is in the exponential family, i.e., p(zi, xi|η) =exp (〈T (zi, xi), η〉 − g(η)) then the sufficient statistics are given by T (z,x) =

∑i T (zi, xi) and we can thus express

any estimator of interest as a function of just T (z,x) which factorizes over the data. Further, when employing expectationmaximization (EM), the M-step is possible in closed form for many members of the exponential family. This allows us toreformulate the cell level updates to depend only upon the sufficient statistics instead of the neighboring cells. The ideais that, unlike MCMC in general which produces a sequence of states that correspond to complete variable assignmentss0, s1, . . . via a transition kernel q(st+1|st), we can produce a sequence of sufficient statistics T 0, T 1, . . . directly via anevolution function Φ(T t) 7→ T t+1.

2.2 Stochastic EM

Now we describe stochastic EM (SEM). Suppose we want the MAP estimate for η, maxη p(x, η) =maxη

∫p(z,x, η)µ(dz) and employ expectation maximization (EM):

E-step Compute in parallel p(zi|xi, η(t)).M-step Find η(t+1) that maximizes the expected log-likelihood with respect to the conditional

η(t+1) = arg maxη

Ez|x,η(t) [log p(z,x, η)] = ξ−1

(1

n+ n0

∑i

Ez|x,η(t) [T (zi, xi)] + T0

)

where ξ(η) =∇g(η) is invertible as ∇2g(θ)� 0 and n0, T0 parametrize the conjugate prior. Although EM exposes sub-stantial parallelism, it is difficult to scale, since the dense structure p(zi|xi, η(t)) defines values for all possible outcomesfor z and thus puts tremendous pressure on memory bandwidth. To overcome this we introduce sparsity by employingstochastic EM (SEM) [9]. SEM introduces an S-step after the E-step that replaces the full distribution with a singlesample:

S-step Sample z(t)i ∼ p(zi|xi; η(t)) in parallel.

Subsequently, we perform the M-step using the imputed data instead of the expectation. This simple modification over-comes the computational drawbacks of EM for cases in which sampling from p(zi|xi; η(t)) is feasible. (More details

2

about EM and SEM are provided in appendix.) We can now employ fast samplers, such as the alias method, exploitsparsity, reduce CPU-RAM bandwidth while still maintaining massive parallelism. More importantly, the S-step alsoenables all three steps to now be expressed in terms of the current sufficient statistics. Ths enables distributed and parallelimplementations that efficiently execute on modern computational resources offering massive parallelism.

2.3 Implementation

Our implementation has two copies of the data structure containing sufficient statistics T (0) and T (1). We do not computethe values T (z,x) but maintain their sum as we impute values of the cells/latent variables. During iteration 2t of theevolution function, we apply Φ by reading from T (0) and incrementing T (1) as we sample the latent variables (Figure1). Then in the next iteration 2t + 1 we reverse the roles of data structure, i.e. read from T (1) and increment T (0). SeeAlgorithm 1 and Figure 2.

Algorithm 1 SEM for LVM

1: Randomly initialize each cell2: for t = 0→ num iterations do3: for all cell z independently in parallel do4: Read sufficient statistics from T (t mod 2)

5: Compute stochastic updates using pz(k|s)6: Write sufficient statistics to T (t+1 mod 2)

7: end for8: end for

(a) Phase 1

(b) Phase 2Figure 2: Efficient (re)use of buffers.

Use of such read/write buffers offer a virtually lock-free (assuming atomic increments) implementation scheme for SEMand is analogous to double-buffering in computer graphics. Although there is a synchronization barrier after each round,its effect is mitigated because each cell’s work depends only upon the sufficient statistics and thus does the same amountof work. Therefore, evenly balancing the work load across computation nodes is trivial, even for a heterogeneous cluster.

2.4 Intuitions for working of SEM

In this section we present a pedagogical example that highlights the differences between the three algorithms (Gibbs, EM,SEM) and provides intuition for robust working of SEM. We choose an example that is both simple and representativeof the general class. Simplicity is important because it makes it much easier to see the computational differences. Inparticular, we choose a mixture of Gaussians in which the mixing coefficient π0 is known, the precision τ of each Gaussianis known, its priors (µ0, κ0τ ) are known, and inference must infer the means of the Gaussians µ. That is each datum xihas an associated latent variable zi that determines the Gaussian from which it is drawn. The sufficient statistics comprisejust a single array of size K (one element per component). More formally, we have the following generative model:

µk ∼ N (µ0, κ0τ)

zi ∼ Categorical(π0)

xi ∼ N (µzi , τ)

For this simple example, we take K = 10 clusters and n = 100, 000 training examples generated synthetically to removeany error caused by an incorrect model choice

µ0 = 0 κ0 = 0.1

τ = 1 π ∼ Dirichlet(1).

We compare Gibbs, EM and SEM (T = 50 iterations) under several experimental conditions. Further, we include anad-hoc method in which the first 15 iterations are EM and then the rest are collapsed Gibbs sampling (EM+Gibbs). Werepeat each condition 10 times using random initialization and report the average log-likelihood on a held-out test set of10000 instances. First, as a control, in Figure 2a we run all three algorithms on a simple one-dimensional model. Asexpected, they perform equally well.

Next, we study the robustness of each algorithm to initialization in a two-dimensional model. For this, we choose twoextreme initialization conditions under which the algorithms are likely to get stuck in poor local optima. In the first casewe initialize to a low entropy configuration (Figure 2b), and the second case a high entropy configuration (Figure 2c). The

3

Iterations0 10 20 30 40 50 60

Log-L

ikelih

ood

-4.4

-4.2

-4

-3.8

-3.6

-3.4

-3.2

-3

-2.8

TruthEMSEMGibbsEM+Gibbs

(a) Simple 1D case (b) Initialization to single component (c) Uniformly random initialization

Figure 2: Simple Cases

former causes trouble for EM because it is deterministic; in contrast, the stochasticity in Gibbs and SEM allow them to berobust to this condition. The latter causes trouble for Gibbs because the conditional distribution has high entropy forcingthe sampler to rely on sheer luck in the early iterations. In summary, we find SEM is more robust than either algorithmsince it works well in both extremes.

2.5 Theory

It is desirable to study the theoretical behavior of SEM and understand why it works so well in practice (Sec. 3). Guar-antees on the performance of SEM would make it the algorithm of choice for scalable LVM. We first describe someconnections between SEM and Stochastic gradient descent (SDG), inspired from the connection between EM and gra-dient descent. We next briefly discuss related literature helpful in a theoretical understanding of convergence propertiesfor SEM. However, note that we could not concretely obtain guarantees for the performance of SEM and this would be agood future work.

2.5.1 Connections between SEM and SGD

We can view SEM as implicit SGD on MAP. This connection alludes to the convergence rate of SEM. To illustrate this,we consider Latent Dirichlet Allocation (LDA), which is a famous LVM used for topic modelling. For simplicity, weconsider only the topic mixture proportions (θ). As pointed out in [10, 11], one EM step is:

θ+m = θm +M

∂ log p

∂θmk

which is gradient descent with Frank-Wolfe type update and line search. Similarly, for SEM, one step is

θ+mk =

DnmkNm

=1

Nm

Nm∑n=1

δ(zmn = k)

Again vectorizing and re-writing as earlier:θ+m = θm +Mg

where M = 1Nm

[diag(θm)− θmθTm

]and g = 1

θmk

∑Nmn=1 δ(zmn = k). The vector g can be shown to be an unbiased

noisy estimate of the gradient, i.e.

E[g] =1

θmk

Ni∑n=1

E[δ(zij = k)] =∂ log p

∂θmk

Thus, a single step of SEM is equivalent to a single step of SGD. Consequently, we could further embrace the connectionto SGD and use a subset of the data for the S and M steps, similar to incremental EM [12]. Note that in the limit in whichbatches comprise just a single token, the algorithm emulates a collapsed Gibbs sampler. This interpretation strengthensthe theoretical justification for many existing approximate Gibbs sampling approaches. More details in Appendix E.

2.5.2 Understanding Convergence

We now address the critical question of how the invariant measure of SEM for the model presented in Section 2.1 isrelated to the true MAP estimates. First, note that SCA is ergodic [13], a result that immediately applies if we ignore the

4

deterministic components of our automata (corresponding to the observations). Now that we have established ergodicity,we next study the properties of the stationary distribution and find that the modes correspond to MAP estimates.

We make a few mild assumptions about the model:

• The observed data Fisher information is non-singular, i.e. I(η) � 0.• For the Fisher information for z|x, we need it to be non-singular and central limit theorem, law of large number to hold,

i.e. Eη0 [IZ(η0)] � 0 and

supη

∣∣∣∣∣ 1nn∑i=1

Izi(η)− Eη0 [IX(η)]

∣∣∣∣∣→ 0 as n→∞

• We assume that 1n

∑ni=1∇η log p(xi; η) = 0 has at least one solution, let η be a solution.

These assumptions are reasonable. For example in case of mixture models (or topic models), it just means all componentmust be exhibited at least once and all components are unique. The details of this case are worked out in Appendix F. Alsowhen the number of parameters grow with the data, e.g., for topic models, the second assumption still holds. In this case,we resort to corresponding result from high dimensional statistics by replacing the law of large numbers with Donsker’stheorem and everything else falls into place.

Consequently, we show SEM converges weakly to a distribution with mean equal to some root of the score function(∇η log p(xi; η)) and thus a MAP fixed point by borrowing the results known for SEM [14]. In particular, we have:

Theorem 1 Let the assumptions stated above hold and η, is the estimate from SEM. Then as the number of i.i.d. datapoint goes to infinity, i.e. n→∞, we have

√n(η − η)

D→ N(0, I(η0)−1[I − F (η0)−1

)(2)

where F (η0) = I + Eη0 [IX(η0)](I(η0) + Eη0 [IX(η0)]).

This result implies that SEM flocks around a stationary point under very reasonable assumptions and tremendous com-putational benefits. Also, for such complicated models, reaching a stationary point is the best that most methods achieveanyway. Now we switch gears to adopt SEM for LDA and perform some simple experimental evaluations.

3 Large Scale Application

To evaluate the strength and weaknesses of SEM for real world applications and not toy examples, we compare againstparallel and distributed implementations of a collapsed Gibbs sampler (CGS) and a variational inference (CVB0) for LDA.We also compare our results to performance numbers reported in the literature including those of F+LDA and lightLDA.We choose not to evaluate directly against highly tuned and optimized and proprietary systems such as lightLDA for whichno public code is available, as it would not be fair to implement it ourselves. Thus, we only take the best numbers reportedby them.

3.1 SEM for LDA

Topic modeling, and latent Dirichlet allocation (LDA) [8] in particular, have become a must-have of analytics platformsand consequently needs to scale to larger and larger datasets. In LDA, we model each document m of a corpus of Mdocuments as a distribution θm that represents a mixture of topics. There are K such topics, and we model each topick as a distribution φk over the vocabulary of words that appear in our corpus. Each document m contains Nm wordswmn from a vocabulary of size V , and we associate a latent variable zmn to each of the words. The latent variables cantake one of K values that indicate which topic the word belongs to. Both distributions θm and φk have a Dirichlet prior,parameterized respectively with a constant α and β. See Appendix D for more details.

3.2 Existing systems

Many of the scalable systems for topic modeling are based on one of two core inference methods: the collapsed Gibbssampler (CGS) [15], and variational inference (VI) [8] and approximations thereof [16]. To scale LDA to large datasets,or for efficiency reasons, we may need to distribute and parallelize them. Both algorithms can be further approximated tomeet such implementation requirements.

5

Collapsed Gibbs Sampling In collapsed Gibbs sampling the full conditional distribution of the latent topic indicatorsgiven all the others is

p(zmn = k|z¬mn,w) ∝ (Dmk + α)Wkwmn + β

Tk + βV(3)

where Dmk is the number of latent variables in document m that equal k, Wkv is the number of latent variables equal to kand whose corresponding word equals v, and Tk is the number of latent variables that equal k, all excluding current zmn.

CGS is a sequential algorithm in which we draw latent variables in turn, and repeat the process for several iterations.The algorithm performs well statistically, and has further benefited from breakthroughs that lead to a reduction of thesampling complexity [17, 18]. This algorithm can be approximated to enable distribution and parallelism, primarily intwo ways. One is to partition the data, perform one sampling pass and then assimilate the sampler states, thus yielding anapproximate distributed version of CGS (AD-LDA) [19]. Another way is to partition the data and allow each sampler tocommunicate with a distributed central storage continuously. Here, each sampler sends the differential to the global state-keeper and receives from it the latest global value. A very scalable system built on this principle and leveraging inherentsparsity of LDA is YahooLDA [20]. Further improvement and sampling using alias table was incorporated in lightLDA[6]. Contemporaneously, a nomadic distribution scheme and sampling using Fenwick tree was proposed in F+LDA [7].

Variational Inference In variational inference (VI), we seek to optimize the parameters of an approximate distributionthat assumes independence of the latent variables to find a member of the family that is close to the true posterior.Typically, for LDA, document-topic proportions and topic indicators are latent variables and topics are parameter. Then,coordinate ascent alternates between them.

One way to scale VI is stochastic variational inference (SVI) which employs SGD by repeatedly updating the topics viarandomly chosen document subsets [21]. Adding a Gibbs step to SVI introduces sparsity for additional efficiency [22].In some ways this is analogous to our S-step, but in the context of variational inference, the conditional is much moreexpensive to compute, requiring several rounds of sampling.

Another approach, CVB0, achieves scalability by approximating the collapsed posterior [23]. Here, they minimize thefree energy of the approximate distribution for a given parameter γmnk and then use the zero-order Taylor expansion [16].

γmnk ∝ (Dmk + α)× Wkwmn + β

Tk + β V(4)

where Dmk is the fractional contribution of latent variables in document m for topic k, Wkv is the contribution of latentvariables for topic k and whose corresponding word equals v, and Tk is the the contribution of latent variables for topick. Inference updates the variational parameters until convergence. It is possible to distribute and parallelize CVB0 overtokens [16]. VI and CVB0 are the core algorithms behind several scalable topic modeling systems including Mr.LDA [24]and the Apache Spark machine-learning suite.

Remark It is worth noticing that Gibbs sampling and variational inference, despite being justified very differently, haveat their core the very same formulas (shown in a box in formula (3) and (4)). Each of which are literally deciding howimportant is some topic k to the word v appearing in document m by asking the questions: “How many times does topic koccur in document m?”, “How many times is word v associated with topic k?”, and “How prominent is topic k overall?”.It is reassuring that behind all the beautiful mathematics, something simple and intuitive is happening. As we see next,SEM addresses the same questions via analogous formulas.

3.3 An SEM Algorithm for LDA

To re-iterate, the point of using such a method for LDA is that the parallel update dynamics of the SEM gives us analgorithm that is simple to parallelize, distribute and scale. In the next section, we will evaluate how it works in practice.For now, let us explain how we design our SCA to analyze data.

We begin by writing the stochastic EM steps for LDA (derivation is in Appendix D):

E-step: independently in parallel compute the conditional distribution locally:

qmnk =θmkφkwmn∑K

k′=1 θmk′φk′wmn(5)

S-step: independently in parallel draw zij from the categorical distribution:

zmn ∼ Categorical(qmn1, ..., qmnK) (6)

6

M-step: independently in parallel compute the new parameter estimates:

θmk =Dmk + α− 1

Nm +Kα−K

φkv =Wkv + β − 1

Tk + V β − V

(7)

We simulate these inference steps in SEM, which is a dynamical system with evolution function Φ : S −→ S over thestate space S. For LDA, the state space S is

S = Z −→ K×M×V (8)

where Z is the set of cell identifiers (one per token in our corpus), K is a set of K topics, M is a set of M documentidentifiers, and V is a set of V identifiers for the vocabulary words.

The initial state s0 is the map defined as follows: for every occurrence of the word v in document m, we associate a cellz to the triple (kz,m, v) where kz is chosen uniformly at random from K and independently from kz′ for all z′ 6= z. Thisgives us

s0 = z 7→ (kz,m, v) (9)We now need to describe the evolution function Φ. First, assuming that we have a state s and a cell z, we define thefollowing distribution:

pz(k|s) ∝ (Dmk + α)× Wkv + β

Tk + β V(10)

where Dmk =∣∣∣{ z | ∃v. s(z) = (k,m, v)

}∣∣∣,Wkv =

∣∣∣{ z | ∃m. s(z) = (k,m, v)}∣∣∣, and

Tk =∣∣∣{ z | ∃m. ∃v. s(z) = (k,m, v)

}∣∣∣. Note that we have chosen our local update rule slightly different withoutan offset of −1 for the counts corresponding to the mode of the Dirichlet distributions and requiring α, β > 1. Instead,our local update rule allows us to have the relaxed requirement α, β > 0 which is more common for LDA inferencealgorithms.

Assuming that s(z) = (k,m, v) and that k′ is a sample from pz (hence the name “stochastic” cellular automaton) wedefine the local update function as:

φ(s, z) = (k′,m, v)

where s(z) = (k,m,v) and k′ ∼ pz( · |s)(11)

That is, the document and word of the cell remain unchanged, but we choose a new topic according to the distributionpz induced by the state. We obtain the evolution function of the stochastic cellular automaton by applying the function φuniformly on every cell.

Φ(s) = z 7→ φ(s, z) (12)Finally, the SCA algorithm simulates the evolution function Φ starting with s0. Of course, since LDA’s complete datalikelihood is in the exponential family, we never have to represent the states explicitly, and instead employ the sufficientstatistics.

Our implementation has two copies of the count matrices Di, W i, and T i for i = 0 or 1 (as in CGS or CVB0, we donot compute the values Dik, Wkv , and Tk but keep track of the counts as we assign topics to the cells/latent variables).During iteration i of the evolution function, we apply Φ by reading Di mod 2, W i mod 2, and T i mod 2 and incrementingDi+1 mod 2, W i+1 mod 2, and T i+1 mod 2) as we assign topics.

3.4 Experimental Results

Software & hardware All three algorithms are implemented in simple C++11. We implement multithreaded paral-lelization within a node using the work-stealing Fork/Join framework, and the distribution across multiple nodes using theprocess binding to a socket over MPI. We also implemented a version of SEM with a sparse representation for the arrayD of counts of topics per documents and Vose’s alias method to draw from discrete distributions. We run our experimentson a small cluster of 4 nodes connected through 10Gb/s Ethernet. Each node has two 9-core Intel Xeon E5 processorsfor a total of 36 hardware threads per node. For random number generation we employ Intel c©Digital Random NumberGenerators through instruction RDRAND, which uses thermal noise within the silicon to output a random stream of bitsat 3 Gbit/s, producing true random numbers.

7

Table 2: Some statistics of the datasets used in experiment

Dataset V M TokensPubMed 141,043 8,200,000 737,869,085Wikipedia 210,233 6,631,176 1,133,050,514Proprietary ∼140,000 ∼3 billion ∼171 billion

Iteration0 20 40 60 80 100

per

wor

d lo

g-lik

elih

ood

-9.5

-9

-8.5

-8

-7.5

-7

-6.5pubmed 1000

SCACGSCVB0

(a) PubMed, K = 1000, α =0.05, β = 0.1

Iteration0 20 40 60 80 100

per

wor

d lo

g-lik

elih

ood

-9.5

-9

-8.5

-8wiki 1000

SCACGSCVB0

(b) Wikipedia, K = 1000, α =0.05, β = 0.1

Time [min]0 50 100 150 200 250

per

wor

d lo

g-lik

elih

ood

-9.5

-9

-8.5

-8

-7.5

-7

-6.5pubmed 1000

SCACGSCVB0

(c) Pubmed, K = 1000, α =0.05, β = 0.1

Time [min]0 50 100 150 200 250

per

wor

d lo

g-lik

elih

ood

-9.5

-9

-8.5

-8wiki 1000

SCACGSCVB0

(d) Wikipedia, K = 1000, α =0.05, β = 0.1

Figure 3: Evolution of log likelihood on Wikipedia and Pubmed over number of iterations and time.

Datasets We experiment on two public datasets, both of which are cleaned by removing stop words and rare words:PubMed abstracts and English Wikipedia. We also run on a third proprietary dataset. Details are presented in Table 2.

Evaluation To evaluate the proposed method we use predicting power as a metric by calculating the per-word log-likelihood (equivalent to negative log of perplexity) on 10,000 held-out documents conditioned on the trained model. Weset K = 1000 to demonstrate performance for a large number of topics. The hyper parameters are set as α = 50/K andβ = 0.1 as suggested in [25]; other systems such as YahooLDA and Mallet also use this as the default parameter setting.The results are presented in Figure 3.

Finally, for the large dataset, our implementation (only 300 lines of C++) processes 503 million tokens per second (tps) onour modest 8-node cluster. In comparison, some of the best existing systems achieve 112 million tps (F+LDA, personalcommunication) and 60 million tps (lightLDA) [6].

4 Conclusion

We have described a novel inference method for latent variable models that is a stochastic version of the ExpectationMaximization algorithm. The equilibrium of the dynamics are MAP fixed points and the algorithm has many desirablecomputational properties: it is embarrassingly parallel, memory efficient, and like HOGWILD!, is virtually lock-free.Further, for many models, it enables the use of approximate counters and the alias method. Thus, we were able to achievean order of magnitude speed-up over the current state-of-the-art inference algorithms for LDA with accuracy comparableto collapsed Gibbs sampling.

In general, we canno’t always guarantee the correct invariant measure [26], and found that parallelizing improperly causesconvergence to incorrect MAP fixed points. Even so, SEM is used for simulating Ising models in statistical physics [27].

4.1 Advantages of SEM for LDA

The positive consequences of SEM as a choice for inference on LDA are many:

• Our memory footprint is minimal since we only store the data and sufficient statistics. In contrast to MCMC methods,we do not store the assignments to latent variables z. In contrast to variational methods, we do not store the variationalparameters γ. Further, variational methods require K memory accesses (one for each topic) per word. In contrast, theS-step ensures we only have a single access (for the sampled topic) per word. Such reduced pressure on the memorybandwidth can improve performance significantly for highly parallel applications.

• We can further reduce the memory footprint by compressing the sufficient statistics with approximate counters [28,29]. This is possible because updating the sufficient statistics only requires increments as in Mean-for-Mode [30]. Incontrast, CGS decrements counts, preventing the use of approximate counters.

• Our implementation is lock-free (in that it does not use locks, but assumes atomic increments) because the doublebuffering ensures we never read or write to the same data structures. There is less synchronization, which at scale issignificant.

8

References

[1] H. Robbins and S. Monro, “A stochastic approximation method,” Ann. Math. Statist., vol. 22, no. 3, pp. 400–407, 091951.

[2] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1,pp. 107–113, Jan. 2008.

[3] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, Markov Chain Monte Carlo in Practice. Chapman & Hall,1995.

[4] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An introduction to variational methods for graphicalmodels,” Mach. Learn., vol. 37, no. 2, pp. 183–233, Nov. 1999.

[5] A. Smola and S. Narayanamurthy, “An architecture for parallel topic models,” Proc. VLDB Endow., vol. 3, no. 1-2,pp. 703–710, Sep. 2010. [Online]. Available: http://dx.doi.org/10.14778/1920841.1920931

[6] J. Yuan, F. Gao, Q. Ho, W. Dai, J. Wei, X. Zheng, E. P. Xing, T.-Y. Liu, and W.-Y. Ma, “Lightlda: Big topic models onmodest computer clusters,” in Proceedings of the 24th International Conference on World Wide Web. InternationalWorld Wide Web Conferences Steering Committee, 2015, pp. 1351–1361.

[7] H.-F. Yu, C.-J. Hsieh, H. Yun, S. Vishwanathan, and I. S. Dhillon, “A scalable asynchronous distributed algorithmfor topic modeling,” in Proceedings of the 24th International Conference on World Wide Web. International WorldWide Web Conferences Steering Committee, 2015, pp. 1340–1350.

[8] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning Research, vol. 3,pp. 993–1022, Mar. 2003.

[9] G. Celeux and J. Diebolt, “The sem algorithm: a probabilistic teacher algorithm derived from the em algorithm forthe mixture problem,” Computational statistics quarterly, vol. 2, no. 1, pp. 73–82, 1985.

[10] L. Xu and M. I. Jordan, “On convergence properties of the em algorithm for gaussian mixtures,” Neural computation,vol. 8, no. 1, pp. 129–151, 1996.

[11] R. Salakhutdinov, S. Roweis, and Z. Ghahramani, “Relationship between gradient and em steps in latent variablemodels.”

[12] R. M. Neal and G. E. Hinton, “A view of the em algorithm that justifies incremental, sparse, and other variants,” inLearning in graphical models. Springer, 1998, pp. 355–368.

[13] P.-Y. Louis, “Automates cellulaires probabilistes : mesures stationnaires, mesures de gibbs associees et ergodicite,”Ph.D. dissertation, Universite des Sciences et Technologies de Lille and il Politecnico di Milano, September 2002.

[14] S. F. Nielsen, “The stochastic em algorithm: estimation and asymptotic results,” Bernoulli, pp. 457–489, 2000.[15] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proc. National Academy of Sciences of the United States

of America, vol. 101, no. suppl 1, pp. 5228–5235, 2004.[16] A. Asuncion, M. Welling, P. Smyth, and Y. W. Teh, “On smoothing and inference for topic models,” in Proc. Twenty-

Fifth Conference on Uncertainty in Artificial Intelligence, ser. UAI ’09. Arlington, Virginia, USA: AUAI Press,2009, pp. 27–34.

[17] L. Yao, D. Mimno, and A. McCallum, “Efficient methods for topic model inference on streaming document collec-tions,” in Proc. 15th ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining, ser. KDD ’09. New York:ACM, 2009, pp. 937–946.

[18] A. Q. Li, A. Ahmed, S. Ravi, and A. J. Smola, “Reducing the sampling complexity of topic models,” in 20th ACMSIGKDD Intl. Conf. Knowledge Discovery and Data Mining, 2014.

[19] D. Newman, A. Asuncion, P. Smyth, and M. Welling, “Distributed algorithms for topic models,” J. Machine LearningResearch, vol. 10, pp. 1801–1828, Dec. 2009, http://dl.acm.org/citation.cfm?id=1577069.1755845.

[20] A. Smola and S. Narayanamurthy, “An architecture for parallel topic models,” Proc. VLDB Endowment, vol. 3, no.1-2, pp. 703–710, Sep. 2010.

[21] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley, “Stochastic variational inference,” Journal of Machine LearningResearch, vol. 14, pp. 1303–1347, May 2013.

[22] D. Mimno, M. Hoffman, and D. Blei, “Sparse stochastic inference for latent dirichlet allocation,” in Proceedings ofthe 29th International Conference on Machine Learning (ICML-12), ser. ICML ’12, J. Langford and J. Pineau, Eds.New York, NY, USA: Omnipress, July 2012, pp. 1599–1606.

[23] W. Y. Teh, D. Newman, and M. Welling, “A collapsed variational Bayesian inference algorithm for latent Dirichletallocation,” in Advances in Neural Information Processing Systems 19, ser. NIPS 2006. MIT Press, 2007, pp.1353–1360.

9

http://dx.doi.org/10.14778/1920841.1920931

[24] K. Zhai, J. Boyd-Graber, N. Asadi, and M. L. Alkhouja, “Mr. lda: A flexible large scale topic modeling packageusing variational inference in mapreduce,” in Proceedings of the 21st international conference on World Wide Web.ACM, 2012, pp. 879–888.

[25] T. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings of the National Academy of Sciences, vol. 101,pp. 5228–5235, 2004.

[26] D. A. Dawson, “Synchronous and asynchronous reversible Markov systems,” Canadian mathematical bulletin,vol. 17, pp. 633–649, 1974.

[27] G. Y. Vichniac, “Simulating physics with cellular automata,” Physica D: Nonlinear Phenomena, vol. 10, no. 1-2, pp.96–116, Jan. 1984.

[28] R. Morris, “Counting large numbers of events in small registers,” Commun. ACM, vol. 21, no. 10, pp. 840–842, Oct.1978.

[29] M. Csuros, “Approximate counting with a floating-point counter,” in Computing and Combinatorics (COCOON2010), ser. Lecture Notes in Computer Science, M. T. Thai and S. Sahni, Eds. Springer Berlin Heidelberg, 2010,no. 6196, pp. 358–367, see also http://arxiv.org/pdf/0904.3062.pdf.

[30] J.-B. Tristan, J. Tassarotti, and G. L. Steele Jr., “Efficient training of LDA on a GPU by Mean-For-Mode Gibbssampling,” in 32nd International Conference on Machine Learning, ser. ICML 2015, vol. 37, 2015, volume 37 ofthe Journal in Machine Learning Research: Workshop and Conference Proceedings.

10

A (Stochastic) EM in General

Expectation-Maximization (EM) is an iterative method for finding the maximum likelihood or max-imum a posteriori (MAP) estimates of the parameters in statistical models when data is only par-tially, or when model depends on unobserved latent variables. This section is inspired fromhttp://www.ece.iastate.edu/∼namrata/EE527 Spring08/emlecture.pdf

We derive EM algorithm for a very general class of model. Let us define all the quantities of interest.

Table 3: Notation

Symbol Meaningx Observed dataz Unobserved data

(x, z) Complete datafX;η(x; η) marginal observed data densityfZ;η(z; η) marginal unobserved data density

fX,Z;η(x, z; η) complete data density/likelihoodfZ|X;η(z|x; η) conditional unobserved-data (missing-data) density.

Objective: To maximize the marginal log-likelihood or posterior, i.e.

L(η) = log fX;η(x; η). (13)

Assumptions:

1. zi are independent given η. So

fZ;η(z; η) =

N∏i=1

fZi;η(zi; η), (14)

2. xi are independent given missing data zi and η. So

fX,Z;η(x, z; η) =

N∏i=1

fXi,Zi;η(xi, zi; η). (15)

As a consequence we obtain:

fZ|X;η(z|x; η) =

N∏i=1

fZi|Xi;η(zi|xi; η), (16)

Now,L(η) = log fX;η(x; η) = log fX,Z;η(x, z; η)− log fZ|X;η(z|x; η) (17)

or, summing across observations,

L(η) =

N∑i=1

log fXi;η(xi; η) =

N∑i=1

log fXi,Zi;η(xi, zi; η)−N∑i=1

log fZi|Xi;η(zi|xi; η). (18)

Let us take the expectation of the above expression with respect to fZi|Xi;η(zi|xi; ηp), where we choose η = ηp:N∑i=1

EZi|Xi;η [log fXi;η(xi; η)|xi; ηp]

=

N∑i=1

EZi|Xi;η [log fXi,Zi;η(xi, zi; η)|xi; ηp]−N∑i=1

EZi|Xi;η[log fZi|Xi;η(zi|xi; η)|xi; ηp

] (19)

Since L(η) = log fX;η(x; η) does not depend on z, it is invariant for this expectation. So we recover:

L(η) =

N∑i=1

EZi|Xi;η [log fXi,Zi;η(xi, zi; η)|xi; ηp]−N∑i=1

EZi|Xi;η[log fZi|Xi;η(zi|xi; η)|xi; ηp

]= Q(η|ηp)−H(η|ηp).

(20)

11

Now, (20) may be written asQ(η|ηp) = L(η) + H(η|ηp)︸︷︷︸

≤H(ηp|ηp)

(21)

Here, observe that H(η|ηp) is maximized (with respect to η) by η = ηp, i.e.

H(η|ηp) ≤ H(ηp|ηp) (22)

Simple proof using Jensen’s inequality.

As our objective is to maximize L(η) with respect to η, if we maximize Q(η|ηp) with respect to η, it will force L(η) toincrease. This is what is done repetitively in EM. To summarize, we have:

E-step : Compute fZi|Xi;η(zi|xi; ηp) using current estimate of η = ηp.

M-step : Maximize Q(η|ηp) to obtain next estimate ηp+1.

Now assume that the complete data likelihood belongs to the exponential family, i.e.

fXi,Zi;η(xi, zi; η) = exp{T{zi · xi}η − g(η)} (23)

then

Q(η|ηp) =

N∑i=1

EZi|Xi;η [log fXi,Zi;η(xi, zi; η)|xi; ηp]

=

N∑i=1

EZi|Xi;η [T{zi, ·, xi}ηg(η)|xi; ηp]

(24)

To find the maximizer, differentiate and set it to zero:

1

N

∑i

EZi|Xi;η [{Tzi, xi}η|xi; ηp] =dg(η)

dη(25)

and one can obtain the maximizer by solving this equation.

Stochastic EM (SEM) introduces an additional simulation after the E-step that replaces the full distribution with a singlesample:

S-step Sample zi ∼ fZi|Xi;η(zi|xi; ηp)

This essentially means we replace E[·] with an empirical estimate. Thus, instead of solving (25), we simply have:

1

N

∑i

T (zi, xi) =dg(η)

dη. (26)

Computing and solving this system of equations is considerably easier than (25).

Now to demonstrate that SEM is well behaved and works in practice, we run a small experiment. Consider the problem ofestimating the parameters of a Gaussian mixture. We choose a 2-dimensional Gaussian withK = 30 clusters and 100,000training points and 1,000 test points. We run EM and SEM with the following initialization:

(a) Same initialization (b) Bad initialization for SEM

Figure 4: Performance of SEM

12

• Both SEM and EM are provided the same initialization.• SEM is deliberately provided a bad initialization, while EM is not.

The log-likelihood on the heldout test set is shown in Figure 4.

13

B (S)EM Derivation for GMM

Endless flow of equations The EM iteration alternates between performing an expectation (E) step, which creates a func-tion for the expectation of the log-likelihood (`) evaluated using the current estimate for the parameters (initially, randomvalues), and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on theE step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

We are in a Bayesian world, so parameters are treated as Random Variables:

log p(X, θ, π|m0, φ0, α) = log∑Z

p(X,Z, θ, π|m0, φ0, α)

= log∑Z

p(X,Z|θ, π) +∑k

log p(θk|m0, φ0) + log p(π|α)

=∑i

log∑k

p(xi, zi = k|θ, π) +∑k


=∑i

log∑k

q(zi = k|xi)q(zi = k|xi)

p(xi, zi = k|θ, π) +∑k


=∑i

log∑k

q(zi = k|xi)p(xi, zi = k|θ, π)

q(zi = k|xi)+∑k


≥∑i

∑k

q(zi = k|xi) logp(xi, zi = k|θ, π)

q(zi = k|xi)+∑k


(27)

Let’s denote

F (q, θ, π) =∑i

∑k

q(zi = k|xi) logp(xi, zi = k|θ, π)

q(zi = k|xi)+∑k

log p(θk|m0, φ0) + p(π|α)

EM algorithm (in coordinate descent manner) works as follows:

• In E-step, fix θ and π, maximize F over q. On re-arranging one could get:

F (q, θ, π) = −∑i

DKL(q(zi|xi)||P (zi|xi, θ, π)) + logP (xi|θ, π) +∑k

log p(θk|m0, φ0) + p(π|α)

DKL denotes the Kullback Leibler divergence, a measure of the divergence of two distributions, which is defined asDKL(P ||Q) =

∑i P (i) ln P (i)

Q(i) . Since in E-step, θ and π are fixed, F (q, θ, π) is maximized only by q that maximizes thenegative KL divergence. Because KL divergence is always non-negative. DKL = 0 happens only when p and q are thesame.

Summary: In the E-step of tth iteration, we derive q(t) = argmaxq F (q, θ(t−1), π(t−1)), namely

nik = Eq|x[zi = k] = q(t)(zi = k|xi) = p(zi = k|xi; θ(t−1), πt−1)

∝ p(xi|θt−1k , zi = k)p(zi = k|πt−1)

=πt−1k p(xi|θt−1

k )∑k′ π

t−1k′ p(xi|θ

t−1k′ )

(28)

• In M-step, fix q, maximize F over θ and π.

We begin by maximizing over θ, in which case we can drop other terms:

F (q, θ, π) =∑i

∑k

q(zi = k|xi) log p(xi, zi = k|θ, π) +∑k

log p(θk|m0, φ0) + const

=∑i

∑k

q(zi = k|xi)(〈φ(xi), θk〉 − g(θk)) +∑k

(〈φ0, θk〉 −m0g(θk) + const(29)

14

Taking derivative with respect to θk and setting it to 0 yields

∇g(θk) =1

m0 +∑k q(zi = k|xi)

(φ0 +

∑i

q(zi = k|xi)φ(xi)

)

θk = η−1

(1

m0 +∑k q(zi = k|xi)

(φ0 +

∑i

q(zi = k|xi)φ(xi)

)) (30)

Similarly solving for π, we first summarize the equation with the terms related to πk as follwoing:

F (q, θ, π) =∑i

∑k

q(zi = k|xi) log p(xi, zi = k|θ, π) +∑k


F (q, θ, π) =∑i

∑k

q(zi = k|xi) log πk +∑k

(αk − 1) log πk + const(31)

Now solving over πk leads to solving the following optimization function,

π = argmaxπ

∑i

∑k


(αk − 1) log πk

s.t.K∑k=1

πk = 1.

(32)

Writing the lagrangian function for the given optimization function,

L(π, λ) =∑i

∑k


(αk − 1) log πk + λ(1−K∑k=1

πk) (33)

Now, setting the gradient with respect to πk gives us

0 =1

πk

∑i

[q(zi = k|xi) + (αk − 1)] + λ

⇔ πk =

∑i q(zi = k|xi) + (αk − 1)

λ

(34)

Since q(zi = k|xi) + (αk− 1) ≥ 0 and πk have to sum up to 1, solving for λ, and thereby obtaining the solution for πk as

πk =

∑i q(zi = k|xi) + αk − 1∑

i,k q(zi = k|xi) +∑k αk − k

=

∑i q(zi = k|xi) + αk − 1

N +∑k αk −K

.

(35)

The distribution of Multivariate Normal N (µ,Σ) is given by

p(x|µ,Σ) =1√

(2π)d|Σ|exp

(−1

2(x− µ)TΣ−1(x− µ)

)(36)

where µ ∈ Rd and Σ � 0 is a symmetric positive definite d× d matrix.

The conjugate prior for Multivariate Normal Distribution can be parametrized as the Normal Inverse Wishart DistributionNIW(µ0, κ0,Σ0, ν0). The distribution is given by:

p(µ,Σ; µ0, κ0,Σ0, ν0) = N (µ|µ0,Σ/κ0)W−1(Σ|Σ0, ν0)

=κd20 |Σ0|

ν02 |Σ|−

ν0+d+22

2(ν0+1)d

2 πd2 Γd(

ν02 )

e−κ02 (µ−µ0)TΣ−1(µ−µ0)− 1

2 tr(Σ0Σ−1)(37)

Now, derive the Expectation-Maximiazation rules for the mixture of Mutivariate Normal, N (µk,Σk) for k = 1, . . . ,Kand with the shared prior NIW(µ0, κ0,Σ0, ν0).

15

E step:

n(t)ik =

exp[− 1

2 (xi − µ(t)k )>Σ

(t)−1k (xi − µ(t)

k )]π

(t)k∑k

j=1 exp[− 1

2 (xi − µ(t)j )>Σ

(t)−1j (xi − µ(t)

j )]π

(t)j

(38)

M step: First,

π(t+1)k =

αk − 1 +∑i n

(t)ik

N +∑Kj=1 αj −K

. (39)

Now, in natural parameter space, θ1 = Σ−1µ and θ2 = − 12Σ−1. Thus,

Σ = −1

2θ−1

2

µ = −1

2θ−1

2 θ1

(40)

and

g(θ) =

[− 1

4θ>1 θ−12 θ1

d2 log(2π)− 1

2 log | − 2θ2|

]. (41)

So∂g

∂θ1=

[− 1

2θ−12 θ1

0

]=

[µ0

]∂g

∂θ2=

[14θ−12 θ1θ

>1 θ−12

− 12θ−12

]=

[µµ>

Σ

]. (42)

The derivative ∂g1∂θ2

comes from the identity ∂tr(X−1A)∂X = −(X−1)>A(X−1)> and the invariance of the trace operator

under cyclic permutation; see ’Wikipedia/Matrix calculus’. Also recall that

φ0 =

(κ0µ0

Σ0 + κ0µ0µ>0

)m0 =

(κ0

ν0 + d+ 2

)φ(x) =

(xxx>

). (43)

Denote n(t)k =

∑i n

(t)ik Combining all of this with (??) we see that

∂F

∂θ1= κ0µ0 +

∑i

n(t)ik xi −

(κ0 + n

(t)k

)µ

(t+1)k = 0

∂F

∂θ2= Σ0 + κ0µ0µ

>0 +

∑i

n(t)ik xix

>i −

(κ0 + n

(t)k

)µ

(t+1)k µ

(t+1)>k −

(ν0 + d+ 2 + n

(t)k

)Σ

(t+1)k = 0

(44)

and hence

µ(t+1)k =

κ0µ0 +∑i n

(t)ik xi

κ0 + n(t)k

Σ(t+1)k =

Σ0 + κ0µ0µ>0 +

∑i n

(t)ik xix

>i −

(κ0 + n

(t)k

)µ

(t+1)k µ

(t+1)>k

ν0 + d+ 2 + n(t)k

.

(45)

B.1 Introducing Stochasticity

After performing the E-step, we add an extra simulation step, i.e. we draw and impute the values for the latent variablesfrom its distribution conditioned on data and current estimate of the parameters. This means basically nik gets transformedinto δ(zi − k) where k is value drawn from the conditional distribution. Then we proceed to perform the M-step, whichis even simpler now. To summarize SEM for GMM will have following steps:

E-step in parallel compute the conditional distribution locally:

n(t)ik =

exp[− 1

2 (xi − µ(t)k )>Σ

(t)−1k (xi − µ(t)

k )]π

(t)k∑k

j=1 exp[− 1

2 (xi − µ(t)j )>Σ

(t)−1j (xi − µ(t)

j )]π

(t)j

(46)

16

S-step in parallel draw zi from the categorical distribution:

z(t)i ∼ Categorical(n

(t)i1 , ..., n

(t)iK) (47)

M-step in parallel compute the new parameter estimates:

π(t+1)k =

αk − 1 + T(t)k

N +∑Kj=1 αj −K

µ(t+1)k =

κ0µ0 +∑i|z(t)i =k

xi

κ0 + T(t)k

Σ(t+1)k =

Σ0 + κ0µ0µ>0 +

∑i|z(t)i =k

xix>i −

(κ0 + T

(t)k

)µ

(t+1)k µ

(t+1)>k

ν0 + d+ 2 + n(t)k

.

(48)

where T (t)k =

∣∣∣{ z(t)i | z

(t)i = k

}∣∣∣.

17

C Gibbs Sampler Derivation for GMM

The Markov blanket for zi becomes x−i, z−i, xi, φ0,m0, and α in this case. We obtain

P (zi | rest) =P (xi | zi, {xj : zj = zi},m0, φ0)P (zi | z−i, α)

P (xi | x−i,m0, φ0)

=exp[h(mzi , φzi)− h(mzi − 1, φzi − φ(xi))] exp[logB( ~nk)− logB( ~nk − ~ezi)]∑Kj=1 exp[h(mj , φj)− h(mj − 1, φj − φ(xi))] exp[logB( ~nk)− logB( ~nk − ~ej)]

=exp[h(mzi , φzi)− h(mzi − 1, φzi − φ(xi))]

nzi−1

(∑k nk)−1∑K

j=1 exp[h(mj , φj)− h(mj − 1, φj − φ(xi))]nj−1

(∑k nk)−1

=exp[h(mzi , φzi)− h(mzi − 1, φzi − φ(xi))](nzi − 1)∑Kj=1 exp[h(mj , φj)− h(mj − 1, φj − φ(xi))](nj − 1)

(49)

which yields the conditional needed in Gibbs sampling to sample a latent variable zi.

As a pedagogical example, we derive Gibbs sampler for the Multivariate Gaussian with a Normal Inverse Wishart.

First, apply the downdate equations in the following order:

κzi ← κzi − 1, νzi ← νzi − 1

µzi ←(κzi − 1)µzi

κzi− xi

Σzi ← Σzi −κzi

κzi − 1(xi − µzi)(xi − µzi)>

(50)

Update zi by smpling from the distribution:

P (zi = k | rest) =P (xi | zi = k, x−i)(nk − 1)∑j P (xi | zi = j, x−i)(nj − 1)

=tνk−d+1

(xi

∣∣∣ µk, (κk+1)Σkκk(νk−d+1)

)(nk − 1)∑K

j=1 tνj−d+1

(xi

∣∣∣ µj , (κj+1)Σjκj(νj−d+1)

)(nj − 1)

(51)

then apply the update equations in the following order:

Σzi ← Σzi +κzi

κzi − 1(xi − µzi)(xi − µzi)>

µzi ←κziµzi + xiκzi + 1

νzi ← νzi + 1, κzi ← κzi + 1

(52)

At the end of the procedure, we obtain

µk =κ0µ0 +

∑i:zi=k

xi

κ0 + nk

Σk =Σ0 +

∑i:zi=k

xix>i + κ0µ0µ

>0 − (κ0 + nk)µkµ

>k

ν0 + nk − d− 1

(53)

where again nk = |{i : zi = k}|.

C.1 Gibbs Derivation

In the last section we used EM for inference and now we turn to Gibbs Sampling, another popular method. Gibbs samplingis a variety of MCMC sampling in which we cycle through all our latent random variables, resampling each conditionedon the currently sampled values of all other random variables.

Gibbs sampling is an MCMC method that traditionally sweeps all the variables in each iteration and one at a time, sampleseach variable conditioned on the rest (using p(zi | rest), the full conditional). We can often do better (consequence of

18

the Rao-Blackwell theorem) by collapsing and integrating out θk and π. See Algorithm ??, and see Appendix C for moredetails.

The Markov blanket for zi becomes x−i, z−i, xi, φ0,m0, and α in this case. We obtain

P (zi | rest) =P (xi | zi, {xj : zj = zi},m0, φ0)P (zi | z−i, α)

P (xi | x−i,m0, φ0)

=exp[h(mzi , φzi)− h(mzi − 1, φzi − φ(xi))](nzi − 1)∑Kj=1 exp[h(mj , φj)− h(mj − 1, φj − φ(xi))](nj − 1)

(54)

Note that mk −m0 = nk − αk. Lset nk = |{i : zi = k}| so that mk = m0 + nk and nk = nk + α. Thus, we need tomaintain only two invariants, nk and φk per component in the inference procedure.

Algorithm 2 Collapsed Gibbs sampling for mixture models

1: Initialize z randomly and evaluate initial counts nk and statistics φk.2: t← 03: while t ≤ T do4: for i = 1→ N do5: Remove datum from current component and update statistics: nzi ← nzi − 1, φzi ← φzi − φ(xi)6: Sample zi using the PMF stored in

p[k]← (α+ nk − 1) exp (h(m0 + nk + 1, φk + φ(xi))− h(m0 + nk, φk)); p← p/sum(p);7: Add datum to the new component and update statistics: nzi ← nzi + 1, φzi ← φzi + φ(xi)8: end for9: t← t+ 1

10: end while

19

D (S)EM Derivation for LDA

We derive an EM procedure for LDA.

D.1 LDA Model

In LDA, we model each document m of a corpus of M documents as a distribution θm that represents a mixture of topics.There are K such topics, and we model each topic k as a distribution φk over the vocabulary of words that appear in ourcorpus. Each document m contains Nm words wmn from a vocabulary of size V , and we associate a latent variable zmnto each of the words. The latent variables can take one of K values that indicate which topic the word belongs to. We giveeach of the distributions θm and φk a Dirichlet prior, parameterized respectively with a constant α and β. More concisely,LDA has the following mixed density.

p(w, z,θ,φ) =

[M∏m=1

Nm∏n=1

Cat(wmn | φzmn) Cat(zmn | θm)

][M∏m=1

Dir(θm | α)

][K∏k=1

Dir(φk | β)

](55)

The choice of a Dirichlet prior is not a coincidence: we can integrate all of the variables θm and φk and obtain thefollowing closed form solution.

p(w, z) =

[M∏m=1

Pol({zm′n | m′ = m},K, α

)][ K∏k=1

Pol({wmn | zmn = k}, V, β

)](56)

where Pol is the Polya distribution

Pol(S,X, η) =Γ(η K)

Γ(|S|+ η X)

X∏x=1

Γ(∣∣{z | z ∈ S, z = x}

∣∣+ η)

Γ(η)(57)

for all j

for all ifor all k

α θm zmn wmn φk β

Figure 5: LDA Graphical Model

Algorithm 3 LDA Generative Modelinput: α,β

1: for k = 1→ K do2: Choose topic φk ∼ Dir(β)3: end for4: for all document m in corpus D do5: Choose a topic distribution θm ∼ Dir(α)6: for all word index n from 1 to Nm do7: Choose a topic zmn ∼ Categorical(θm)8: Choose word wmn ∼ Categorical(φzmn)9: end for

10: end for

The joint probability density can be expressed as:

p(W,Z, θ, φ|α, β) =

[K∏k=1

p(φk|β)

][M∏m=1

p(θm|α)

Nm∏n=1

p(zmn|θm)p(wmn|φzmn)

]

∝

[K∏k=1

V∏v=1

φβ−1kv

][M∏m=1

(K∏k=1

θα−1mk

)Nm∏n=1

θmzmnφzmnwmn

] (58)

20

D.2 Expectation Maximization

We begin by marginalizing the latent variable Z and finding the lower bound for the likelihood/posterior:

log p(W, θ, φ|α, β) = log∑Z

p(W,Z, θ, φ|α, β)

=

M∑m=1

Nm∑n=1

log

K∑k=1

p(zmn = k|θm)p(wmn|φk)

+

K∑k=1

log p(φk|β) +

M∑m=1

log p(θm|α)

=

M∑m=1

Nm∑n=1

log

K∑k=1

q(zmn = k|wmn)p(zmn = k|θm)p(wmn|φk)

q(zmn = k|wmn)

+

K∑k=1

log p(φk|β) +

M∑m=1

log p(θm|α)

(Jensen Inequality) ≥M∑m=1

Nm∑n=1

K∑k=1

q(zmn = k|wmn) logp(zmn = k|θm)p(wmn|φk)

q(zmn = k|wmn)

+

K∑k=1

log p(φk|β) +

M∑m=1

log p(θm|α)

(59)

Let us define the following functional:

F (q, θ, φ) := −M∑m=1

Nm∑n=1

DKL(q(zmn|wmn)||p(zmn|wmn, θm, φ))

+

M∑m=1

Nm∑n=1

p(wmn|θm, φ) +

K∑k=1

log p(φk|β) +

M∑m=1

log p(θm|α)

(60)

D.2.1 E-Step

In the E-step, we fix θ, φ and maximize F for q. As q appears only in the KL-divergence term, it is equivalent tominimizing the KL-divergence between q(zmn|wmn) and p(zmn|wmn, θm, φ). We know that for any distributions f andg the KL-divergence is minimized when f = g and is equal to 0. Thus, we have

q(zmn = k|wmn) = p(zmn = k|wmn, θm, φ)

=θmkφkwmn∑K

k′=1 θmk′φk′wmn

(61)

For simplicity of notation, let us define



21

D.2.2 M-Step

In the E-step, we fix q and maximize F for θ, φ. As this will be a constrained optimization (θ and φ must lie on simplex),we use standard constrained optimization procedure of Lagrange multipliers. The Lagrangian can be expressed as:

L(θ, φ, λ, µ) =

M∑m=1

Nm∑m=1

K∑k=1

q(zmn = k|wmn) logp(zmn = k|θm)p(wmn|φk)

q(zmn = k|wmn)+

K∑k=1

log p(φk|β)

+

M∑m=1

log p(θm|α) +

K∑k=1

λk

(1−

V∑v=1

φkv

)+

M∑m=1

µi

(1−

K∑k=1

θmk

)

=

M∑m=1

Nm∑n=1

K∑k=1

qmnk log θmkφkwmn +

K∑k=1

V∑v=1

(βv − 1) log φkv +

M∑m=1

K∑k=1

(αk − 1) log θmk

+

K∑k=1

λk

(1−

V∑v=1

φkv

)+

M∑m=1

µm

(1−

K∑k=1

θmk

)+ const.

(63)

Maximizing θ Taking derivative with respect to θmk and setting it to 0, we obtain

∂L∂θmk

= 0 =

Nm∑j=1

qmnk + αk − 1

θmk− µm

µmθmk =

Ni∑j=1

qmnk + αk − 1

(64)

After solving for µm, we finally obtain

θmk =

∑Nmn=1 qmnk + αk − 1∑K

k′=1

∑Nmj=1 qmnk′ + αk′ − 1

(65)

Note that∑Kk′=1 qmnk′ = 1, we reach at the optimizer:

θmk =1

Nm +∑

(αk′ − 1)

(Nm∑n=1

qmnk + αk − 1

)(66)

Maximizing φ Taking derivative with respect to φkv and setting it to 0, we obtain

∂L∂φkv

= 0 =

M∑m=1

Nm∑n=1

qmnkδ(v − wmn) + βv − 1

φkv− λk

λkφkv =

M∑m=1

Nm∑n=1


(67)

After solving for λk, we finally obtain

φkv =

∑Mm=1

∑Nmn=1 qmnkδ(v − wmn) + βv − 1∑V

v′=1

∑Mm=1

∑Nmn=1 δ(v

′ − wmn) + βv′ − 1(68)

Note that∑Vv′=1 δ(v

′ − wmn) = 1, we reach at the optimizer:

φkv =

∑Mm=1

∑Nmn=1 qmnkδ(v − wmn) + βv − 1∑M

m=1

∑Nmn=1 qmnk +

∑(βv′ − 1)

(69)

D.3 Introducing Stochasticity

After performing the E-step, we add an extra simulation step, i.e. we draw and impute the values for the latent variablesfrom its distribution conditioned on data and current estimate of the parameters. This means basically qmnk gets trans-formed into δ(zmn− k) where k is value drawn from the conditional distribution. Then we proceed to perform the M-step,which is even simpler now. To summarize SEM for LDA will have following steps:

22

E-step in parallel compute the conditional distribution locally:

qmnk =θmkφkwmn∑Kk′=1 θmk′φk′wij

(70)

S-step in parallel draw zmn from the categorical distribution:

zmn ∼ Categorical(qmn1, ..., qmnK) (71)

M-step in parallel compute the new parameter estimates:

θmk =Dmk + αk − 1

Nm +∑

(αk′ − 1)

φkv =Wkv + βv − 1

Tk +∑

(βv′ − 1)

(72)

where Dmk =∣∣∣{ zmn | zmn = k

}∣∣∣,Wkv =

∣∣∣{ zmn | wmn = v, zmn = k}∣∣∣, and

Tk =∣∣∣{ zmn | zmn = k

}∣∣∣ =V∑v=1

Wkv .

23

E Equivalency between (S)EM and (S)GD for LDA

We study the equivalency between (S)EM and (S)GD for LDA.

E.1 EM for LDA

EM for LDA can be summarized by follows:

E-Step



M-Step

θmk =1

Nm +∑

(αk′ − 1)

(Nm∑n=1

qmnk + αk − 1

)

φkv =

∑Mm=1


m=1

∑Nmn=1 qmnk +

∑(βv′ − 1)

(74)

E.2 GD for LDA

The joint probability density can be expressed as:

p(W,Z, θ, φ|α, β) =

[K∏k=1

p(φk|β)

][M∏m=1

p(θm|α)

Nm∏n=1

p(zmn|θm)p(wmn|φzmn)

]

∝

[K∏k=1

V∏v=1

φβ−1kv

][M∏m=1

(K∏k=1

θα−1mk

)Nm∏n=1

θmzmnφzmnwmn

] (75)

The log-probability of joint model with Z marginalized can be written as:

log p(W, θ, φ|α, β) = log∑Z

p(W,Z, θ, φ|α, β)

=

M∑m=1

Nm∑n=1

log

K∑k=1

p(zmn = k|θm)p(wmn|φk)

+

K∑k=1

log p(φk|β) +

M∑m=1

log p(θm|α)

=

M∑m=1

Nm∑n=1

log

K∑k=1

θmkφkwmn

+

M∑m=1

K∑k=1

(αk − 1) log θmk +

K∑k=1

V∑v=1

(βv − 1) log φkv

(76)

Gradient for topic per document Now take derivative with respect to θmk:

∂ log p

∂θmk=

Nm∑j=1

φkwmn∑Kk′=1 θmk′φk′wmn

+αk − 1

θmk

=1

θmk

(Nm∑n=1

qmnk + αk − 1

) (77)

24

Gradient for word per topic Now take derivative with respect to φkv:

∂ log p

∂φkv=

M∑m=1

Nm∑n=1

θmkδ(v − wmn)∑Kk′=1 θmk′φk′wmn

+βv − 1

φkv

=1

φkv

(M∑m=1

Nm∑n=1


) (78)

E.3 Equivalency

If we look at one step of EM:

For topic per document

θ+mk =

1

Nm +∑

(αk′ − 1)

(Nm∑n=1

qmnk + αk − 1

)

=θmk

Nm +∑

(αk′ − 1)

∂ log p

∂θmk

Vectorize and can be re-written as:

θ+m = θm +

1

Nm +∑

(αk′ − 1)


] ∂ log p

∂θm(79)

For word per topic

φ+kv =

∑Mm=1


m=1

∑Nmn=1 qmnk +

∑(βv′ − 1)

=φkv∑M

m=1

∑Nmn=1 qmnk +

∑(βv′ − 1)

∂ log p

∂φkv

Vectorize and can be re-written as:

θ+m = θm +

1

Nm +∑

(αk′ − 1)


] ∂ log p

∂θm(80)

E.4 SEM for LDA

We summarize our SEM derivation for LDA as follows:

E-Step



S-stepzmn ∼ Categorical(qmn1, ..., qmnK) (82)

M-step

θmk =Dmk + αk − 1

Nm +∑

(αk′ − 1)

φkv =Wkv + βv − 1

Tk +∑

(βv′ − 1)

(83)

Here Dmk is the total number of tokens that belong to topic k in document m, Wkv is the number of times a word vbelongs to topic k, i.e.,

Dmk =

Nm∑n=1

zmnk (84)

Wkv =

Nm∑n=1

Nd∑m=1

zmnkδ(wm = v) (85)

25

However, observe that all our zmn are one-hot categorical random variables and hence, the above sums can be easilycomputed without going through the entire dataset. This is where the stochastic nature of SEM helps in reducing thetraining time. We next show the equivalency of SEM to SGD.

E.5 Equivalency

In case of LDA, let us begin with θ for which the update over one step stochastic EM is:

θ+mk =

Dmk + αk − 1

Nm +∑Kk′=1(αk′ − 1)

=1

Nm +∑Kk′=1(αk′ − 1)

Nm∑n=1

δ(zmnk = 1) + αk − 1

Again vectorizing and re-writing as earlier:θ+i = θi +Mg

where M = 1Nm+

∑Kk′=1

(αk′−1)


]and g = 1

θmk

∑Nmn=1 δ(zmnk = 1) + αk − 1. The vector g can be

shown to be an unbiased noisy estimate of the gradient, i.e.

E[g] =1

θmk

Nm∑n=1

E[δ(zmnk = 1)] + αk − 1

=1

θmk

Nm∑n=1

qmnk + αk − 1 =∂ log p

∂θmk

Thus, it is SGD with constraints. We have a similar result for φkv , where we can see that an unbiased, noisy estimator ofthe gradient has been used instead of the pure gradient, in the SEM update of parameters. However, note that stochasticitydoes not arise from sub-sampling data as usually in SGD, rather from the randomness introduced in the S-step. But thisimmediately hints for developing an online/incremental version where we can subsample data also. This can remove thebarrier in current implementation and we can have a revolver like structure, which would be loved by the hardware.

26

F Non-singularity of Fisher Information for Mixture Models

Let us consider a general mixture model:

p(x|θ, φ) =

K∑k=1

θkf(x|φk) (86)

Then the log-likelihood can be written as:

log p(x|θ, φ) = log

(K∑k=1

θkf(x|φk)

)(87)

The Fisher Information is given by:

I(θ, φ) = E[(∇ log p(x|θ, φ))(∇ log p(x|θ, φ))T

]=

[∂∂θ log p(x|θ, φ)∂∂φ log p(x|θ, φ)

] [∂∂θ log p(x|θ, φ)∂∂φ log p(x|θ, φ)

]TThese derivatives can be computed as follows:

∂

∂θklog p(x|θ, φ) =

∂

∂θklog

((

K∑k=1

θkf(x|φk)

)

=f(x|φk)∑K

k′=1 θk′f(x|φk′)

∂

∂φklog p(x|θ, φ) =

∂

∂φklog

((

K∑k=1

θkf(x|φk)

)

=θk

∂∂φk

f(x|φk)∑Kk′=1 θk′f(x|φk′)

(88)

For any u, v ∈ RK (with at least one nonzero), then the Fisher Information is positive definite as:

(uT vT )I

(uv

)= (uT vT )E

∂

∂θ log(∑K

k=1 θkf(X|φk))

∂∂φ log

(∑Kk=1 θkf(X|φk)

) ∂∂θ log

(∑Kk=1 θkf(X|φk)

)∂∂φ log

(∑Ki=1 θkf(X|φk)

) T( u

v

)

= E

(uT ∂

∂θlog

(K∑k=1

θkf(X|φk)

)+ vT

∂

∂θlog

(K∑i=1

θkf(X|φi)

))2

= E

(∑Kk=1 ukf(X|φk) + vkθk

∂∂φk

f(X|φk)∑Kk=1 θkf(X|φk)

)2

This can be 0 if and only ifK∑k=1

ukf(x|φi) + vkθk∂

∂φkf(x;φk) = 0 ∀x. (89)

In case of exponential family emission models this cannot hold if all components are unique and all θk > 0. Thus, if weassume all components are unique and every component has been observed at least once, the Fisher information matrixbecomes non-singular.

27

Abstract - Satwik Kottur · Abstract In this project we want to implement and study a type of...

Documents

Transcript of Abstract - Satwik Kottur · Abstract In this project we want to implement and study a type of...