Complexity of Inference in Latent Dirichlet Allocation · probability and the complexity of...

Complexity of Inference in Latent DirichletAllocation

David SontagNew York University∗

Daniel M. RoyUniversity of Cambridge

Abstract

We consider the computational complexity of probabilistic inference in La-tent Dirichlet Allocation (LDA). First, we study the problem of findingthe maximum a posteriori (MAP) assignment of topics to words, wherethe document’s topic distribution is integrated out. We show that, whenthe effective number of topics per document is small, exact inference takespolynomial time. In contrast, we show that, when a document has a largenumber of topics, finding the MAP assignment of topics to words in LDAis NP-hard. Next, we consider the problem of finding the MAP topic dis-tribution for a document, where the topic-word assignments are integratedout. We show that this problem is also NP-hard. Finally, we briefly discussthe problem of sampling from the posterior, showing that this is NP-hardin one restricted setting, but leaving open the general question.

1 Introduction

Probabilistic models of text and topics, known as topic models, are powerful tools for ex-ploring large data sets and for making inferences about the content of documents. Topicmodels are frequently used for deriving low-dimensional representations of documents thatare then used for information retrieval, document summarization, and classification [Bleiand McAuliffe, 2008; Lacoste-Julien et al., 2009]. In this paper, we consider the computa-tional complexity of inference in topic models, beginning with one of the simplest and mostpopular models, Latent Dirichlet Allocation (LDA) [Blei et al., 2003]. The LDA model isarguably one of the most important probabilistic models in widespread use today.

Almost all uses of topic models require probabilistic inference. For example, unsupervisedlearning of topic models using Expectation Maximization requires the repeated computationof marginal probabilities of what topics are present in the documents. For applications ininformation retrieval and classification, each new document necessitates inference to deter-mine what topics are present.

Although there is a wealth of literature on approximate inference algorithms for topic mod-els, such Gibbs sampling and variational inference [Blei et al., 2003; Griffiths and Steyvers,2004; Mukherjee and Blei, 2009; Porteous et al., 2008; Teh et al., 2007], little is knownabout the computational complexity of exact inference. Furthermore, the existing inferencealgorithms, although well-motivated, do not provide guarantees of optimality. We chooseto study LDA because we believe that it captures the essence of what makes inference easyor hard in topic models. We believe that a careful analysis of the complexity of popularprobabilistic models like LDA will ultimately help us build a methodology for spanning thegap between theory and practice in probabilistic AI.

Our hope is that our results will motivate discussion of the following questions, guidingresearch of both new topic models and the design of new approximate inference and learning

∗ This work was partially carried out while D.S. was at Microsoft Research New England.

1

algorithms. First, what is the structure of real-world LDA inference problems? Might therebe structure in “natural” problem instances that makes them different from hard instances(e.g., those used in our reductions)? Second, how strongly does the prior distribution biasthe results of inference? How do the hyperparameters affect the structure of the posteriorand the hardness of inference?

We study the complexity of finding assignments of topics to words with high posteriorprobability and the complexity of summarizing the posterior distributions on topics in adocument by either its expectation or points with high posterior density. In the former case,we show that the number of topics in the maximum a posteriori assignment determines thehardness. In the latter case, we quantify the sense in which the Dirichlet prior can be seento enforce sparsity and use this result to show hardness via a reduction from set cover.

2 MAP inference of word assignments

We will consider the inference problem for a single document. The LDA model states that thedocument, represented as a collection of words w = (w1, w2, . . . , wN ), is generated as follows:a distribution over the T topics is sampled from a Dirichlet distribution, θ ∼ Dir(α); then,for i ∈ [N ] := {1, . . . , N}, we sample a topic zi ∼ Multinomial(θ) and word wi ∼ Ψzi , whereΨt, t ∈ [T ] are distributions on a dictionary of words. Assume that the word distributionsΨt are fixed (e.g., they have been previously estimated), and let lit = log Pr(wi|zi = t) bethe log probability of the ith word being generated from topic t. After integrating out thetopic distribution vector, the joint distribution of the topic assignments conditioned on thewords w is given by

Pr(z1, . . . , zN |w) ∝Γ(∑t αt)∏

t Γ(αt)

∏t Γ(nt + αt)

Γ(∑t αt +N)

N∏i=1

Pr(wi|zi), (1)

where nt is the total number of words assigned to topic t.

In this section, we focus on the inference problem of finding the most likely assignment oftopics to words, i.e. the maximum a posteriori (MAP) assignment. This has many possibleapplications. For example, it can be used to cluster the words of a document, or as part ofa larger system such as part-of-speech tagging [Li and McCallum, 2005]. More broadly, formany classification tasks involving topic models it may be useful to have word-level featuresfor whether a particular word was assigned to a given topic. From both an algorithm designand complexity analysis point of view, this MAP problem has the additional advantage ofinvolving only discrete random variables.

Taking the logarithm of Eq. 1 and ignoring constants, finding the MAP assignment is seento be equivalent to the following combinatorial optimization problem:

Φ = maxxit∈{0,1},nt

∑t

log Γ(nt + αt) +∑i,t

xitlit (2)

subject to∑t

xit = 1,∑i

xit = nt,

where the indicator variable xit = I[zi = t] denotes the assignment of word i to topic t.

2.1 Exact maximization for small number of topics

Suppose a document only uses τ � T topics. That is, T could be large, but we areguaranteed that the MAP assignment for a document uses at most τ different topics. In thissection, we show how we can use this knowledge to efficiently find a maximizing assignmentof words to topics. It is important to note that we only restrict the maximum number oftopics per document, letting the Dirichlet prior and the likelihood guide the choice of theactual number of topics present.

We first observe that, if we knew the number of words assigned to each topic, finding theMAP assignment is easy. For t ∈ [T ], let n∗t be the number of words assigned to topic t

2

w1 w2 w3 w4 w5 w6

t1 t2 t3 t4 t5

0.5 1.0 1.5 2.0 2.5 3.0

0.5

1.0

1.5

2.0

2.5

3.0

2

1

1�2

1�4

1 2 3 4 5

5

10

15

Figure 1: (Left) A LDA instance derived from a k-set packing instance. (Center) Plot ofF (nt) = log Γ(nt + α) for various values of α. The x-axis varies nt, the number of words assignedto topic t, and the y-axis shows F (nt). (Right) Behavior of log Γ(nt + α) as α→ 0. The functionis stable everywhere but at zero, where the reward for sparsity increases without bound.

in the MAP assignment. Then, the MAP assignment x is found by solving the followingoptimization problem:

maxxit∈{0,1}

∑i,t

xitlit (3)

subject to∑t

xit = 1,∑i

xit = n∗t ,

which is equivalent to weighted b-matching in a bipartite graph (the words are on one side,the topics on the other) and can be optimally solved in time O(bm3), where b = maxt n

∗t =

O(N) and m = N + T [Schrijver, 2003].

We call (n1, . . . , nT ) a valid partition when ni ≥ 0 and∑t nt = N . Using weighted b-

matching, we can find a MAP assignment of words to topics by trying all(Tτ

)= Θ(T τ )

choices of τ topics and all possible valid partitions with at most τ non-zeros.

for all subsets A ⊆ [T ] such that |A| = τ dofor all valid partitions n = (n1, n2, . . . , nT ) such that nt = 0 for t 6∈ A do

ΦA,n ←Weighted-B-Matching(A,n, l) +∑t log Γ(nt + αt)

end forend forreturn arg maxA,n ΦA,n

There are at most Nτ−1 valid partitions with τ non-zero counts. For each of these, we solvethe b-matching problem to find the most likely assignment of words to topics that satisfiesthe cardinality constraints. Thus, the total running time is O((NT )τ (N + τ)3). This ispolynomial when the number of topics τ appearing in a document is a constant.

2.2 Inference is NP-hard for large numbers of topics

In this section, we show that probabilistic inference is NP-hard in the general setting where adocument may have a large number of topics in its MAP assignment. Let WORD-LDA(α)denote the decision problem of whether Φ > V (see Eq. 2) for some V ∈ R, where thehyperparameters αt = α for all topics. We consider both α < 1 and α ≥ 1 because, asshown in Figure 1, the optimization problem is qualitatively different in these two cases.

Theorem 1. WORD-LDA(α) is NP-hard for all α > 0.

Proof. Our proof is a straightforward generalization of the approach used by Halperin andKarp [2005] to show that the minimum entropy set cover problem is hard to approximate.

The proof is done by reduction from k-set packing (k-SP), for k ≥ 3. In k-SP, we are givena collection of k-element sets over some universe of elements Σ with |Σ| = n. The goalis to find the largest collection of disjoint sets. There exists a constant c < 1 such thatit is NP-hard to decide whether a k-SP instance has (i) a solution with n/k disjoint sets

3

covering all elements (called a perfect matching), or (ii) at most cn/k disjoint sets (called a(cn/k)-matching).

We now describe how to construct a LDA inference problem from a k-SP instance. Thisrequires specifying the words in the document, the number of topics, and the word logprobabilities lit. Let each element i ∈ Σ correspond to a word wi, and let each set correspondto one topic. The document consists of all of the words (i.e., Σ). We assign uniformprobability to the words in each topic, so that Pr(wi|zi = t) = 1

k for i ∈ t, and 0 otherwise.Figure 1 illustrates the resulting LDA model. The topics are on the top, and the wordsfrom the document are on the bottom. An edge is drawn between a topic (set) and a word(element) if the corresponding set contains that element.

What remains is to show that we can solve some k-SP problem by using this reduction andsolving a WORD-LDA(α) problem. For technical reasons involving α > 1, we require thatk is sufficiently large. We will use the following result (we omit the proof due to spacelimitations).

Lemma 2. Let P be a k-SP instance for k > (1 + α)2, and let P ′ be the derived WORD-LDA(α) instance. There exist constants CU and CL < CU such that, if there is a perfectmatching in P , then Φ ≥ CU . If, on the other hand, there is at most a (cn/k)-matching inP , then Φ < CL.

Let P be a k-SP instance for k > (3 + α)2, P ′ be the derived WORD-LDA(α) instance,and CU and CL < CU be as in Lemma 2. Then, by testing Φ < CL and Φ > CU we candecide whether P is a perfect matching or at best a (cn/k)-matching. Hence k-SP reducesto WORD-LDA(α).

The bold lines in Figure 1 indicate the MAP assignment, which for this example correspondsto a perfect matching for the original k-set packing instance. More realistic documents wouldhave significantly more words than topics used. Although this is not possible while keepingk = 3, since the MAP assignment always has τ ≥ N/k, we can instead reduce from a k-setpacking problem with k � 3. Lemma 2 shows that this is hard as well.

3 MAP inference of the topic distribution

In this section we consider the task of finding the mode of Pr(θ|w). This MAP probleminvolves integrating out the topic assignments, zi, as opposed to the previously consideredMAP problem of integrating out the topic distribution θ. We will see that the MAP topicdistribution is not always well-defined, which will lead us to define and study alternativeformulations. In particular, we give a precise characterization of the MAP problem as oneof finding sparse topic distributions, and use this fact to give hardness results for severalsettings. We also show settings for which MAP inference is tractable.

There are many potential applications of MAP inference of the document’s topic distribu-tion. For example, the distribution may be used for topic-based information retrieval oras the feature vector for classification. As we will make clear later, this type of inferenceresults in sparse solutions. Thus, the MAP topic distribution provides a compact summaryof the document that could be useful for document summarization.

Let θ = (θ1, . . . , θT ). A straightforward application of Bayes’ rule allows us to write theposterior density of θ given w as

Pr(θ|w) ∝

(T∏t=1

θαt−1t

)(N∏i=1

T∑t=1

θtψit

), (4)

where ψit = Pr(wi|zi = t). Taking the logarithm of the posterior and ignoring constants,we obtain

Φ(θ) =

T∑t=1

(αt − 1) log(θt) +

N∑i=1

log

(T∑t=1

θtψit

)(5)

4

We will use the shorthand Φ(θ) = P (θ) + L(θ), where P (θ) =∑Tt=1(αt − 1) log(θt) and

L(θ) =∑Ni=1 log(

∑Tt=1 ψitθt).

To find the MAP θ, we maximize (5) subject to the constraint that∑Tt=1 θt = 1 and θt ≥ 0.

Unfortunately, this maximization problem can be degenerate. In particular, note that ifθt = 0 for αt < 1, then the corresponding term in P (θ) will take the value∞, overwhelmingthe likelihood term. Thus, any feasible solution with the above property could be considered‘optimal’.

A similar problem arises during the maximum-likelihood estimation of a normal mixturemodel, where the likelihood diverges to infinity as the variance of a mixture componentwith a single data point approaches zero [Biernacki and Chretien, 2003; Kiefer and Wol-fowitz, 1956]. In practice, one can enforce a lower bound on the variance or penalize suchconfigurations. Here we consider a similar tactic.

For ε > 0, let TOPIC-LDA(ε) denote the optimization problem

maxθ

Φ(θ) subject to∑t

θt = 1, ε ≤ θt ≤ 1. (6)

For ε = 0, we will denote the corresponding optimization problem by TOPIC-LDA. Whenαt = α, i.e. the prior distribution on the topic distribution is a symmetric Dirichlet,we write TOPIC-LDA(ε, α) for the corresponding optimization problem. In the follow-ing sections we will study the structure and hardness of TOPIC-LDA, TOPIC-LDA(ε) andTOPIC-LDA(ε, α).

3.1 Polynomial-time inference for large hyperparameters (αt ≥ 1)

When αt ≥ 1, Eq. 5 is a concave function of θ. As a result, we can efficiently find θ∗ using anumber of techniques from convex optimization. Note that this is in contrast to the MAPinference problem discussed in Section 2, which we showed was hard for all choices of α.

Since we are optimizing over the simplex (θ must be non-negative and sum to 1), we canapply the exponentiated gradient method [Kivinen and Warmuth, 1995]. Initializing θ0 tobe the uniform vector, the update for time s is given by

θs+1t =

θst exp(η5st )∑t θst

exp(η5st), 5st =

αt − 1

θst+

N∑i=1

ψit∑Tt=1 θ

stψit

, (7)

where η is the step size and 5s is the gradient.

When α = 1 the prior disappears altogether and this algorithm simply corresponds tooptimizing the likelihood term. When α � 1, the prior corresponds to a bias toward aparticular θ topic distribution.

3.2 Small hyperparameters encourage sparsity (α < 1)

On the other hand, when αt < 1, the first term in Eq. 5 is convex whereas the second term isconcave. This setting, of α much smaller than 1, occurs frequently in practice. For example,learning a LDA model on a large corpus of NIPS abstracts with T = 200 topics, we findthat the hyperparameters found range from αt = 0.0009 to 0.135, with the median being0.01. Although in this setting it is difficult to find the global optimum (we will make thisprecise in Theorem 6), one possibility for finding a local maximum is the Concave-ConvexProcedure [Yuille and Rangarajan, 2003].

In this section we prove structural results about the TOPIC-LDA(ε, α) solution space forwhen α < 1. These results illustrate that the Dirichlet prior encourages sparse MAP so-lutions: the topic distribution will be large on as few topics as necessary to explain everyword of the document, and otherwise will be close to zero.

The following lemma shows that in any optimal solution to TOPIC-LDA(ε, α), for everyword, there is at least one topic that both has large probability and gives non-trivial prob-ability to this word. We use K(α, T,N) = e−3/αN−1T−1/α to refer to the lower bound onthe topic’s probability.

5

Lemma 3. Let α < 1. All optimal solutions θ∗ to TOPIC-LDA(ε, α) have the followingproperty: for every word i, θ∗

t≥ K(α, T,N) where t = arg maxt ψitθ

∗t .

Proof sketch. If ε ≥ K(α, T,N) the claim trivially holds. Assume for the purpose of contra-

diction that there exists a word i such that θ∗t< K(α, T,N), where t = arg maxt ψitθ

∗t .

Let Y denote the set of topics t 6= t such that θ∗t ≥ 2ε. Let β1 =∑t∈Y θ

∗t and β2 =∑

t 6∈Y,t6=t θ∗t . Note that β2 < 2Tε. Consider

θt =1

n, θt =

(1− β2 − 1

n

β1

)θ∗t for t ∈ Y, θt = θ∗t for t 6∈ Y, t 6= t. (8)

It is easy to show that ∀t, θt ≥ ε, and∑t θt = 1. Finally, we show that Φ(θ) > Φ(θ∗),

contradicting the optimality of θ∗. The full proof is given in the supplementary material.

Next, we show that if a topic is not sufficiently “used” then it will be given a probability veryclose to zero. By used, we mean that for at least one word, the topic is close in probabilityto that of the largest contributor to the likelihood of the word. To do this, we need to definethe notion of the dynamic range of a word, given as κi = maxt,t′:ψit>0,ψit′>0

ψitψit′

. We let

the maximum dynamic range be κ = maxi κi. Note that κ ≥ 1 and, for most applications,it is reasonable to expect κ to be small (e.g., less than 1000).

Lemma 4. Let α < 1, and let θ∗ be any optimal solution to TOPIC-LDA(ε, α). Suppose

topic t has θ∗t< (κN)−1K(α, T,N). Then, θ∗

t≤ e

11−α+2ε.

Proof. Suppose for the purpose of contradiction that θ∗t> e

11−α+2ε. Consider θ defined as

follows: θt = ε, and θt =(

1−ε1−θ∗

t

)θ∗t for t 6= t. We have:

Φ(θ)− Φ(θ∗) = (1− α) log

(θ∗t

ε

)+ (T − 1)(1− α) log

(1− θ∗

t

1− ε

)+

N∑i=1

log

(∑t θtψit∑t θ∗tψit

).

Using the fact that log(1− z) ≥ −2z for z ∈ [0, 12 ], it follows that

(T − 1)(1− α) log

(1− θ∗

t

1− ε

)≥ (T − 1)(1− α) log

(1− θ∗

t

)≥ 2(T − 1)(α− 1)θ∗

t(9)

≥ 2(T − 1)(α− 1)(κN)−1K(α, T,N) ≥ 2(α− 1). (10)

We have θt ≥ θ∗t for t 6= t, and so∑t θtψit∑t θ∗tψit

=

∑t 6=t θtψit + εψit∑t6=t θ

∗tψit + θ∗

tψit≥

∑t6=t θ

∗tψit∑

t6=t θ∗tψit + θ∗

tψit

. (11)

Recall from Lemma 3 that, for each word i and t = arg maxt ψitθ∗t , we have θt > K(α, T,N).

Necessarily t 6= t. Therefore, using the fact that log 11+z ≥ −z,

log

( ∑t 6=t θ

∗tψit∑

t 6=t θ∗tψit + θ∗

tψit

)≥ −

θ∗tψit∑

t 6=t θ∗tψit

≥ − (κN)−1K(α, T,N)ψitK(α, T,N)ψit

≥ − 1

n. (12)

Thus, Φ(θ)− Φ(θ∗) > (1− α) log e1

1−α+2 + 2(α− 1)− 1 = 0, completing the proof.

Finally, putting together what we showed in the previous two lemmas, we conclude thatall optimal solutions to TOPIC-LDA(ε, α) either have θt large or small, but not in between(that is, we have demonstrated a gap). We have the immediate corollary:

Theorem 5. For α < 1, all optimal solutions to TOPIC-LDA(ε, α) have θt ≤(e

11−α+2

)ε

or θt ≥ κ−1e−3/αN−2T−1/α.

6

3.3 Inference is NP-hard for small hyperparameters (α < 1)

The previous results characterize optimal solutions to TOPIC-LDA(ε, α) and highlight thefact that optimal solutions are sparse. In this section we show that these same propertiescan be the source of computational hardness during inference. In particular, it is possible toencode set cover instances as TOPIC-LDA(ε, α) instances, where the set cover correspondsto those topics assigned appreciable probability.

Theorem 6. TOPIC-LDA(ε, α) is NP-hard for ε ≤ K(α, T,N)T/(1−α)TN/(1−α) and α < 1.

Proof. Consider a set cover instance consisting of a universe of elements and a family ofsets, where we assume for convenience that the minimum cover is neither a singleton, allbut one of the family of sets, nor the entire family of sets, and that there are at least twoelements in the universe. As with our previous reduction, we have one topic per set andone word in the document for each element. We let Pr(wi|zi = t) = 0 when element wi isnot in set t, and a constant otherwise (we make every topic have the uniform distributionover the same number of words, some of which may be dummy words not appearing in thedocument). Let Si ⊆ [T ] denote the set of topics to which word i belongs. Then, up to

additive constants, we have P (θ) = −(1− α)∑Tt=1 log(θt) and L(θ) =

∑Ni=1 log(

∑t∈Si θt).

Let Cθ∗ ⊆ [T ] be those topics t ∈ [T ] such that θ∗t ≥ K(α, T, n), where θ∗ is an optimalsolution to TOPIC-LDA(ε, α). It immediately follows from Lemma 3 that Cθ∗ is a cover.

Suppose for the purpose of contradiction that Cθ∗ is not a minimal cover. Let C be a

minimal cover, and let θt = ε for t 6∈ C and θt = 1−ε(T−|C|)|C| > 1

T otherwise. We will show

that Φ(θ) > Φ(θ∗), contradicting the optimality of θ∗, and thus proving that Cθ∗ is in factminimal. This suffices to show that TOPIC-LDA(ε, α) is NP-hard in this regime.

For all θ in the simplex, we have∑i log(maxt∈Si θt) ≤ L(θ) ≤ 0. Thus it follows that

L(θ∗)− L(θ) ≤ N log T . Likewise, using the assumption that T ≥ |C|+ 1, we have

P (θ)− P (θ∗)

(1− α)≥ −(T − |C|) log ε− (|C|+ 1) logK(α, T,N) + (T − |C| − 1) log ε (13)

≥ log1

ε− T logK(α, T,N), (14)

where we have conservatively only included the terms t 6∈ C for P (θ) and taken θ∗ ∈{ε,K(α, T,N)} with |C|+ 1 terms taking the latter value. It follows that(

P (θ) + L(θ))−(P (θ∗) + L(θ∗)

)> (1− α) log

1

ε− (T logK(α, T,N) +N log T ). (15)

This is greater than 0 precisely when (1− α) log 1ε > log TNK(α, T,N)T .

Note that although ε is exponentially small in N and T , the size of its representation inbinary is polynomial in N and T , and thus polynomial in the size of the set cover instance.

It can be shown that as ε → 0, the solutions to TOPIC-LDA(ε, α) become degenerate,concentrating their support on the minimal set of topics C ⊆ [T ] such that ∀i, ∃t ∈ C s.t.ψit > 0. A generalization of this result holds for TOPIC-LDA(ε) and suggests that, whileit may be possible to give a more sensible definition of TOPIC-LDA as the set of solutionsfor TOPIC-LDA(ε) as ε→ 0, these solutions are unlikely to be of any practical use.

4 Sampling from the posterior

The previous sections of the paper focused on MAP inference problems. In this section, westudy the problem of marginal inference in LDA.

Theorem 7. For α > 1, one can approximately sample from Pr(θ | w) in polynomial time.

Proof sketch. The density given in Eq. 4 is log-concave when α ≥ 1. The algorithm givenin Lovasz and Vempala [2006] can be used to approximately sample from the posterior.

7

Although polynomial, it is not clear whether the algorithm given in Lovasz and Vempala[2006], based on random walks, is of practical interest (e.g., the running time bound has aconstant of 1030). However, we believe our observation provides insight into the complexityof sampling when α is not too small, and may be a starting point towards explaining theempirical success of using Markov chain Monte Carlo to do inference in LDA.

Next, we show that when α is extremely small, it is NP-hard to sample from the posterior.We again reduce from set cover. The intuition behind the proof is that, when α is smallenough, an appreciable amount of the probability mass corresponds to the sparsest possibleθ vectors where the supported topics together cover all of the words. As a result, we coulddirectly read off the minimal set cover from the posterior marginals E[θt | w].

Theorem 8. When α <((4N + 4)TNΓ(N)

)−1, it is NP-hard to approximately sample from

Pr(θ | w), under randomized reductions.

The full proof can be found in the supplementary material. Note that it is likely that onewould need an extremely large and unusual corpus to learn an α so small. Our resultsillustrate a large gap in our knowledge about the complexity of sampling as a function of α.We feel that tightening this gap is a particularly exciting open problem.

5 Discussion

In this paper, we have shown that the complexity of MAP inference in LDA strongly dependson the effective number of topics per document. When a document is generated from a smallnumber of topics (regardless of the number of topics in the model), WORD-LDA can besolved in polynomial time. We believe this is representative of many real-world applications.On the other hand, if a document can use an arbitrary number of topics, WORD-LDA isNP-hard. The choice of hyperparameters for the Dirichlet does not affect these results.

We have also studied the problem of computing MAP estimates and expectations of thetopic distribution. In the former case, the Dirichlet prior enforces sparsity in a sense thatwe make precise. In the latter case, we show that extreme parameterizations can similarlycause the posterior to concentrate on sparse solutions. In both cases, this sparsity is shownto be a source of computational hardness.

In related work, Seppanen et al. [2003] suggest a heuristic for inference that is also applicableto LDA: if there exists a word that can only be generated with high probability from one ofthe topics, then the corresponding topic must appear in the MAP assignment whenever thatword appears in a document. Miettinen et al. [2008] give a hardness reduction and greedyalgorithm for learning topic models. Although the models they consider are very differentfrom LDA, some of the ideas may still be applicable. More broadly, it would be interestingto consider the complexity of learning the per-topic word distributions Ψt.

Our paper suggests a number of directions for future study. First, our exact algorithmscan be used to evaluate the accuracy of approximate inference algorithms, for example bycomparing to the MAP of the variational posterior. On the algorithmic side, it would beinteresting to improve the running time of the exact algorithm from Section 2.1. Also, notethat we did not give an analogous exact algorithm for the MAP topic distribution whenthe posterior has support on only a small number of topics. In this setting, it may bepossible to find this set of topics by trying all S ⊆ [T ] of small cardinality and then doinga (non-uniform) grid search over the topic distribution restricted to support S.

Finally, our structural results on the sparsity induced by the Dirichlet prior draws connec-tions between inference in topic models and sparse signal recovery. We proved that the MAPtopic distribution has, for each topic t, either θt ≈ ε or θt bounded below by some value(much larger than ε). Because of this gap, we can approximately view the MAP problemas searching for a set corresponding to the support of θ. Our work motivates the study ofgreedy algorithms for MAP inference in topic models, analogous to those used for set cover.One could even consider learning algorithms that use this greedy algorithm within the innerloop [Krause and Cevher, 2010].

8

Acknowledgments D.M.R. is supported by a Newton International Fellowship.

References

C. Biernacki and S. Chretien. Degeneracy in the maximum likelihood estimation of univariateGaussian mixtures with EM. Statist. Probab. Lett., 61(4):373–382, 2003. ISSN 0167-7152.

D. Blei and J. McAuliffe. Supervised topic models. In J. Platt, D. Koller, Y. Singer, and S. Roweis,editors, Adv. in Neural Inform. Processing Syst. 20, pages 121–128. MIT Press, Cambridge, MA,2008.

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, 2003. ISSN 1532-4435.

T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc. Natl. Acad. Sci. USA, 101(Suppl1):5228–5235, 2004. doi: 10.1073/pnas.0307752101.

E. Halperin and R. M. Karp. The minimum-entropy set cover problem. Theor. Comput. Sci., 348(2):240–250, 2005. ISSN 0304-3975. doi: http://dx.doi.org/10.1016/j.tcs.2005.09.015.

J. Kiefer and J. Wolfowitz. Consistency of the maximum likelihood estimator in the presence ofinfinitely many incidental parameters. Ann. Math. Statist., 27:887–906, 1956. ISSN 0003-4851.

J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predic-tors. Inform. and Comput., 132, 1995.

A. Krause and V. Cevher. Submodular dictionary selection for sparse representation. In Proc. Int.Conf. on Machine Learning (ICML), 2010.

S. Lacoste-Julien, F. Sha, and M. Jordan. DiscLDA: Discriminative learning for dimensionalityreduction and classification. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors,Adv. in Neural Inform. Processing Syst. 21, pages 897–904. 2009.

W. Li and A. McCallum. Semi-supervised sequence modeling with syntactic topic models. In Proc.of the 20th Nat. Conf. on Artificial Intelligence, volume 2, pages 813–818. AAAI Press, 2005.

L. Lovasz and S. Vempala. Fast algorithms for logconcave functions: Sampling, rounding, integra-tion and optimization. In Proc. of the 47th Ann. IEEE Symp. on Foundations of Comput. Sci.,pages 57–68. IEEE Computer Society, 2006. ISBN 0-7695-2720-5.

P. Miettinen, T. Mielikainen, A. Gionis, G. Das, and H. Mannila. The discrete basis problem. IEEETrans. Knowl. Data Eng., 20(10):1348–1362, 2008.

I. Mukherjee and D. M. Blei. Relative performance guarantees for approximate inference in latentDirichlet allocation. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Adv. inNeural Inform. Processing Syst. 21, pages 1129–1136. 2009.

I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, and M. Welling. Fast collapsed gibbssampling for latent dirichlet allocation. In Proc. of the 14th ACM SIGKDD Int. Conf. on Knowl-edge Discovery and Data Mining, pages 569–577, New York, NY, USA, 2008. ACM.

A. Schrijver. Combinatorial optimization. Polyhedra and efficiency. Vol. A, volume 24 of Algorithmsand Combinatorics. Springer-Verlag, Berlin, 2003. ISBN 3-540-44389-4. Paths, flows, matchings,Chapters 1–38.

J. K. Seppanen, E. Bingham, and H. Mannila. A simple algorithm for topic identification in 0-1data. In Proc. of the 7th European Conf. on Principles and Practice of Knowledge Discovery inDatabases, pages 423–434. Springer-Verlag, 2003.

Y. W. Teh, D. Newman, and M. Welling. A collapsed variational Bayesian inference algorithm forlatent Dirichlet allocation. In Adv. in Neural Inform. Processing Syst. 19, volume 19, 2007.

A. L. Yuille and A. Rangarajan. The concave-convex procedure. Neural Comput., 15:915–936, April2003. ISSN 0899-7667.

9

Supplementary Materials forComplexity of Inference in Latent Dirichlet Allocation

A Proof of Lemma 2

Proof of Lemma 2. Assume there are T sets each having k ≥ 3 elements, and let Φ be theoptimal LDA objective. Define F (n) = log Γ(n+ α). Since lit is constant across all topics,the linear term in Eq. 2 will be a constant K. First, note that, if there is a perfect matching,

Φ ≥ n

kF (k) + (T − n

k)F (0) +K. (16)

The F (0) term is the contribution of unused topics. Otherwise, assume that the best packinghas γ ≤ cn/k sets, each with k elements. Then, by the properties of the log-gamma function,

Φ ≤ γF (k) +n− γkk − 1

F (k − 1) + (T − n

k)F (0) +K, (17)

where we assume, conservatively, that all of the remaining words are explained by topicsassigned (k−1) words. Also, since there was no perfect matching, there were at most T − n

kunused topics. Using our bound on γ, we have

Φ ≤ cn

kF (k) +

n− cnk k

k − 1F (k − 1) + (T − n

k)F (0) +K (18)

=cn

kF (k) +

n(1− c)k − 1

F (k − 1) + (T − n

k)F (0) +K (19)

=dn

kF (k) + (T − n

k)F (0) +K, (20)

where

d := c+ (1− c)β, for β :=k

F (k)

F (k − 1)

k − 1. (21)

Note that F (k)/k → ∞ as k → ∞. Along with the convexity of F , it follows that thereexists a k0 such that β < 1 for all k > k0. Note that k > (3 +α)2 suffices. This implies thatd < 1, which shows that there is a non-zero gap between the possible values of Φ.

Note that the maximum concentration objective, F (n) = n log n, satisfies the conditions onF and, in particular, we have β < 1 for k = 3.

B Derivation of MAP θ objective

Pr(θ|w) ∝∑

z1,...,zN

Pr(θ) Pr(w1, . . . , wN , z1, . . . , zN |θ) (22)

= Pr(θ)∑

z1,...,zN

N∏i=1

Pr(zi|θ) Pr(wi|zi) (23)

= Pr(θ)

N∏i=1

∑zi

Pr(zi|θ) Pr(wi|zi) (24)

= Pr(θ)

N∏i=1

T∑t=1

θt Pr(wi|zi = t) (25)

∝T∏t=1

θαt−1t

N∏i=1

T∑t=1

θt Pr(wi|zi = t). (26)

10

C Proof of Lemma 3

If ε ≥ K(α, T,N) the claim trivially holds. Assume for the purpose of contradiction that

there exists a word i such that θ∗t< K(α, T,N), where t = arg maxt ψitθ

∗t .

Let Y denote the set of topics t 6= t such that θ∗t ≥ 2ε. Let β1 =∑t∈Y θ

∗t and β2 =∑

t 6∈Y,t6=t θ∗t . Note that β2 < 2Tε. Consider θ defined as follows:

θt =1

N(27)

θt =

(1− β2 − 1

N

β1

)θ∗t for t ∈ Y (28)

θt = θ∗t for t 6∈ Y, t 6= t. (29)

Note that this construction implies the bound θt ≥(1− 2Tε− 1

N

)θ∗t for t ∈ Y . Assuming

n ≥ 4 and ε ≤ 12TN , we have that θt ≥ 1

2θ∗t ≥ ε for t ∈ Y , so θ is feasible.

We will show that Φ(θ) > Φ(θ∗), contradicting the optimality of θ∗. First we need thefollowing upper bound, which uses the fact that θ∗

t< 1

N :

1− β2 − 1n

β1=

1− β2 − 1N

1− β2 − θ∗t(30)

< 1. (31)

Then, we have:

P (θ)

α− 1=

T∑t=1

log(θt) (32)

= log1

N+∑t∈Y

log(1− β2 − 1

N

β1

)+∑t 6=t

log θ∗t (33)

≤ log1

N+∑t 6=t

log θ∗t . (34)

Thus,

P (θ)− P (θ∗)

α− 1≤ log

1

N− log θ∗

t(35)

which when α < 1 gives the inequality:

P (θ)− P (θ∗) ≥ (α− 1)(

log1

N− log θ∗

t

)(36)

= (1− α)(

logN + log θ∗t

). (37)

Moving on to the second term, we have:

L(θ) =∑

j∈[N ] : j 6=i

log

(∑t

ψjtθt

)+ log

(∑t

ψitθt

)(38)

≥ (N − 1) log

(1− β2 − 1

N

β1

)+

∑j∈[N ] : j 6=i

log

(∑t

ψjtθ∗t

)+ log

(ψitN

)(39)

≥ (N − 1) log

(1− 2

N

)+

∑j∈[N ] : j 6=i

log

(∑t

ψjtθ∗t

)+ log

(ψitN

). (40)

11

L(θ)− L(θ∗) ≥ (N − 1) log

(1− 2

N

)+ log

(ψitN

)− log

(∑t

ψitθ∗t

)(41)

≥ (N − 1) log

(1− 2

N

)+ log

(ψitN

)− log

(Tψitθ

∗t

)(42)

= (N − 1) (log(N − 2)− logN) + log

(1

N

)− log

(Tθ∗

t

)(43)

≥ (N − 1)

(− 2

N − 2

)+ log

(1

N

)− log

(Tθ∗

t

)(44)

≥ −3 + log

(1

N

)− log

(Tθ∗

t

). (45)

where we used the lower bound log(N − 2) ≥ logN − 2N−2 that arises from the convexity of

the log(x) function, and again assumed N ≥ 4.

Finally, putting these two together, we have:

Φ(θ)− Φ(θ∗) ≥ −3− α logN − log T − α log θ∗t

(46)

Plugging in θ∗t< K(α, T,N) results in Φ(θ)− Φ(θ∗) > 0, giving the contradiction.

D Proof of Theorem 8

Here we reduce from the unique set cover problem, where we are guaranteed that there isonly one minimal size set that covers all elements. It can be shown that Unique Set Coveris NP-hard (under randomized reductions) by using standard reductions from Unique SATto Vertex Cover, and then from Vertex Cover to Set Cover.

Consider a Unique Set Cover instance and our standard reduction to an LDA instance asdescribed in earlier sections. In particular, let w = (w1, . . . , wN ) denote a Unique Set Coverreduction instance and let C ⊆ [T ] denote the unique minimum cover. Let Si be those sets(topics) that cover element wi. We will show that for sufficiently small hyperparameters αt,we can determine whether a set (topic) t ∈ C by testing the value of E[θt|X], thus provingthat the computation of the latter is NP-hard.

We have

p(θ | X) ∝∏t

θαt−1t

∏i

∑t′∈Si

θt′ (47)

=∑r

∏t

θαt−1+ηt(r)t , (48)

where the final summation is over elements r ∈ R := S1×· · ·×SN and ηt(r) := |{i : ri = t}|.For r ∈ R, we write |r| to denote the number of topics t such that ηt(r) 6= 0.

Let N denote the set of sequences n = (n1, . . . , nT ) such that nt ≥ 0 and∑t nt = N . For

n ∈ N , define

Z(n) :=

∫· · ·∫ ∏

t

θαt−1+ntt dθ1 · · · dθT =

∏t Γ(αt + nt)

Γ(α+N), (49)

where α =∑t αt. It follows that

E[θt|X] =

∫· · ·∫θt · p(θ | X)dθ1 · · · dθT (50)

=1∑

r Z(η(r))

∫· · ·∫θt∑r

∏τ

θατ−1+nτ (r)τ dθ1 · · · dθT (51)

=1∑

r Z(η(r))

∑r

Z(η(r, t)), (52)

12

where η(r, t) ∈ N is given by η(r, t)τ = ητ (r) + 1(τ = t). By the identity Γ(z + 1) = zΓ(z),

we have Z(η(r, t)) = Z(η(r))αt+nt(r)α+N , and so it follows that

E[θt|X] =∑r∈R

Z(η(r))αt + nt(r)

α+N, (53)

where

Z(n) :=Z(n)∑

r′∈R Z(η(r′)). (54)

For c ∈ [T ], let Rc denote the set of those r ∈ R such that |r| = c. Recall that C ⊆ [T ] isthe unique minimum set cover associated with the reduction X. Thus, for t ∈ C,

E[θt|X] =∑

r∈R|C|

Z(η(r))αt + nt(r)

α+N+

∑r∈R\R|C|

Z(η(r))αt + nt(r)

α+N(55)

≥ αt + 1

α+N

∑r∈R|C|

Z(η(r)), (56)

where we have used the observation that ηt(r) ≥ 1 for r ∈ R|C|. Let β =∑r∈R|C| Z(η(r)).

If t 6∈ C, then nt(r) = 0 for r ∈ R|C| and so we have E[θt|X] ≤ β αtα+N + (1− β). It follows

that if

β >1

2

(1 +

α+N

α+N + 1

)(57)

then the minimum cover C contains a topic t if and only if E[θ|X] ≥ 14

(1 + 3 α+N

α+N+1

)αt+1α+N ,

and, moreover, we can determine the minimum cover from a bound on β and polynomialapproximations to the marginal distributions of the components of θ. We will take αt = αhenceforth, and show that for α small enough, the bound (57) indeed holds.

Let n, n′ ∈ N be topic counts associated with the minimal cover and some non-minimalcover, respectively. That is, let n = η(r) for some r ∈ R|C| and let n′ = η(r′) for somer′ ∈ Rk and k > |C|. We will bound Z(n′)/Z(n) in order to bound β. We have∏

t

Γ(α+ nt) ≥ Γ(α)T−|C| Γ(α+ 1)|C|, (58)

whereas ∏t

Γ(α+ n′t) ≤ Γ(α)T−|C|−1 Γ(α+ 1)|C| Γ(α+N − |C|). (59)

Therefore,

Z(n′)

Z(n)≤ Γ(α)T−|C|−1 Γ(α+ 1)|C| Γ(α+N − |C|)

Γ(α)T−|C| Γ(α+ 1)|C|(60)

=Γ(α+N − |C|)

Γ(α). (61)

By the convexity of Γ(1/c) in c, we have Γ(α) ≥ α−1 − γ, where γ ≈ .577 is the Eulerconstant. Therefore,

Z(n′)

Z(n)≤ Γ(α+N − 1)

α−1 − γ=: κ(α). (62)

Then by conservatively assuming that there is only one responsibility corresponding to theminimum cover, we have that

β ≥ Z(n)

Z(n) + TNκ(α)Z(n). (63)

13

Therefore, the bound (57) is achieved when

κ(α) ≤ 1

TN (2α+ 2N + 1). (64)

In particular, when N,T ≥ 2 and

α−1 > 2TNΓ(N)(2N + 2) (65)

the marginal expectations can be used to read off the unique minimal set cover.

14

Complexity of Inference in Latent Dirichlet Allocation · probability and the complexity of...

Documents

Transcript of Complexity of Inference in Latent Dirichlet Allocation · probability and the complexity of...