Latent Dirichlet Allocation: Towards a Deeper …obphio.us/pdfs/lda_tutorial.pdf · Latent...

13
Latent Dirichlet Allocation: Towards a Deeper Understanding Colorado Reed January 2012 Abstract The aim of this tutorial is to introduce the reader to Latent Dirichlet Allocation (LDA) for topic modeling. This tutorial is not all-inclusive and should be accompanied/cross- referenced with Blei et al. (2003). The unique aspect of this tutorial is that I provide a full pseudo-code implementation of variational expectation-maximization LDA and an R code implementation at http://highenergy.physics.uiowa.edu/ ~ creed/#code. The R code is arguably the simplest variational expectation-maximization LDA implementation I’ve come across. Unfortunately, the simple implementation makes it very slow and unrealistic for actual application, but it’s designed to serve as an educational tool. Contents 1 Prerequisites 1 2 Introduction 2 3 Latent Dirichlet Allocation 2 3.1 Higher-level Details ............................... 2 3.2 Formal details and LDA inference ....................... 3 3.2.1 Variational Inference for LDA ..................... 6 4 But how does LDA work? 11 5 Further Study 12 6 Appendix: EM Algorithm Refresher 13 1 Prerequisites This tutorial is most useful if you have the following background: 1

Transcript of Latent Dirichlet Allocation: Towards a Deeper …obphio.us/pdfs/lda_tutorial.pdf · Latent...

Latent Dirichlet Allocation: Towards a DeeperUnderstanding

Colorado Reed

January 2012

Abstract

The aim of this tutorial is to introduce the reader to Latent Dirichlet Allocation (LDA)for topic modeling. This tutorial is not all-inclusive and should be accompanied/cross-referenced with Blei et al. (2003). The unique aspect of this tutorial is that I provide a fullpseudo-code implementation of variational expectation-maximization LDA and an R codeimplementation at http://highenergy.physics.uiowa.edu/~creed/#code. The R codeis arguably the simplest variational expectation-maximization LDA implementation I’vecome across. Unfortunately, the simple implementation makes it very slow and unrealisticfor actual application, but it’s designed to serve as an educational tool.

Contents

1 Prerequisites 1

2 Introduction 2

3 Latent Dirichlet Allocation 23.1 Higher-level Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23.2 Formal details and LDA inference . . . . . . . . . . . . . . . . . . . . . . . 3

3.2.1 Variational Inference for LDA . . . . . . . . . . . . . . . . . . . . . 6

4 But how does LDA work? 11

5 Further Study 12

6 Appendix: EM Algorithm Refresher 13

1 Prerequisites

This tutorial is most useful if you have the following background:

1

• basic background in probability, statistics, and inference, i.e. understand Bayes’ ruleand the concept of statistical inference

• understand the Dirichlet distribution

• have a basic understanding of probabilistic graphical models

• understand the expectation-maximization (EM) algorithm

• familiarity with the Kullback-Leibler (KL) divergence will be moderately helpful

If you do not have some or all of the above background, this tutorial can still be helpful.In the text I mention specific resources the interested reader can use to acquire developbackground.

2 Introduction

In many different fields we are faced with a ton of information: think Wikipedia articles,blogs, Flickr images, astronomical survey data, <insert some problem from your areaof research here>, and we need algorithmic tools to organize, search, and understandthis information. Topic modeling is a method for analyzing large quantities of unlabeleddata. For our purposes, a topic is a probability distribution over a collection of wordsand a topic model is a formal statistical relationship between a group of observed andlatent (unknown) random variables that specifies a probabilistic procedure to generatethe topics—a generative model. The central goal of a topic is to provide a “thematicsummary” of a collection of documents. In other words, it answers the question: whatthemes are these documents discussing? A collection of news articles could discuss e.g.political, sports, and business related themes.

3 Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is arguable the most popular topic model in application;it is also the simplest. Let’s examine the generative model for LDA, then I’ll discussinference techniques and provide some [pseudo]code and simple examples that you cantry in the comfort of your home.

3.1 Higher-level Details

First and foremost, LDA provides a generative model that describes how the documentsin a dataset were created.1 In this context, a dataset is a collection of D documents. Butwhat is a document? It’s a collection of words. So our generative model describes howeach document obtains its words. Initially, let’s assume we know K topic distributions forour dataset, meaning K multinomials containing V elements each, where V is the number

1Not literally, of course, this is a simplification of how the documents were actually created.

2

of terms in our corpus. Let βi represent the multinomial for the i-th topic, where the sizeof βi is V : |βi| = V . Given these distributions, the LDA generative process is as follows:

1. For each document:

(a) randomly choose a distribution over topics (a multinomial of length K)

(b) for each word in the document:

(i) Probabilistically draw one of the K topics from the distribution over topicsobtained in (a), say topic βj

(ii) Probabilistically draw one of the V words from βj

This generative model emphasizes that documents contain multiple topics. For in-stance, a health article might have words drawn from the topic related to seasons such aswinter and words drawn from the topic related to illnesses, such as flu. Step (a) reflectsthat each document contains topics in different proportion, e.g. one document may containa lot of words drawn from the topic on seasons and no words drawn from the topic aboutillnesses, while a different document may have an equal number of words drawn from bothtopics. Step (ii) reflects that each individual word in the document is drawn from one ofthe K topics in proportion to the document’s distribution over topics as determined inStep (i). The selection of each word depends on the the distribution over the V words inour vocabulary as determined by the selected topic, βj. Note that the generative modeldoes not make any assumptions about the order of the words in the documents, this isknown as the bag-of-words assumption.

The central goal of topic modeling is to automatically discover the topics from acollection of documents. Therefore our assumption that we know the K topic distributionsis not very helpful; we must learn these topic distributions. This is accomplished throughstatistical inference, and I will discuss some of these techniques in the next section. Figure1 visually displays the difference between a generative model (what I just described) andstatistical inference (the process of learning the topic distributions).

3.2 Formal details and LDA inference

To formalize LDA, let’s first restate the generative process in more detail (compare withthe previous description):

1. For each document:

(a) draw a topic distribution, θd ∼ Dir(α), where Dir(·) is a draw from a uniformDirichlet distribution with scaling parameter α

(b) for each word in the document:

(i) Draw a specific topic zd,n ∼ multi(θd) where multi(·) is a multinomial

(ii) Draw a word wd,n ∼ βzd,n

3

Figure 1: Left: a visualization of the probabilistic generative process for three documents,i.e. DOC1 draws from Topic 1 with probability 1, DOC2 draws from Topic 1 with prob-ability 0.5 and from Topic 2 with probability 0.5, and DOC3 draws from Topic 2 withprobability 1. The topics are represented by β1:K (where K = 2 in this case) in Figure 2and the topic distributions for the two topics ({1, 0}, {0.5, 0.5}, {0, 1}) are represented byθd. Right: In the inferential problem we are interested in learning the topics and topicdistributions. Image taken from Steyvers and Griffiths (2007).

Figure 2 displays the graphical model describing this generative process. A draw from ak dimensional Dirichlet distribution returns a k dimensional multinomial, θ in this case,where the k values must sum to one. The normalization requirement for θ ensures that θlies on a (k − 1) dimensional simplex and has the probability density:

p(θ|α) =Γ(∑k

i=1 αi)∏ki=1 Γ(αi)

k∏i=1

θαi−1i . (1)

If you are not familiar with the Dirichlet distribution and Dirichlet sampling techniques,I encourage you to read Frigyik et al. (2010).

For reference purposes, let’s formalize some notation before moving on:

• w represents a word and wv represents the vth word in the vocabulary where wv = 1and wu = 0 if the v 6= u—this superscript notation will be used with other variablesas well.

• w represents a document (a vector of words) where w = (w1, w2, . . . , wN)

• α is the parameter of the Dirichlet distribution, technically α = (α1, α2, . . . , αk), butunless otherwise noted, all elements of α will be the same, and so in the included Rcode α is simply a number.

• z represents a vector of topics, where if the ith element of z is 1 then w draws fromthe ith topic

4

• β is a k × V word-probability matrix for each topic (row) and each term (column),where βij = p(wj = 1|zi = 1)

The central inferential problem for LDA is determining the posterior distribution ofthe latent variables given the document:

p(θ, z|w, α, β) =p(θ, z,w|α, β)

p(w|α, β).

This distribution is the crux of LDA, so let’s break this down step by step for each indi-vidual document (the probability of the entire corpus is then acquired by multiplying theindividual document probabilities—this assumes independence in the document probabil-ities). First, we can decompose the numerator into a hierarchy by examining the graphicalmodel:

p(θ, z,w|α, β) = p(w|z, β)p(z|θ)p(θ|α)

It should be clear that p(w|z, β) represents the probability of observing a document withN words given the a topic vector of length N that assigns a topic to each word from thek × V probability matrix β. So we can decompose this probability into each individualword probability and multiply them together yielding:

p(w|z, β) =N∏n=1

βzn,wn

Next, p(z|θ) is trivial once we note that p(zn|θ) = θi such that zin = 1, since after all, θ isjust a multinomial. Finally, p(θ|α) is given by Eq. (1). Bringing this all together we have:

p(θ, z,w|α, β) =

(Γ(∑k

i=1 αi)∏ki=1 Γ(αi)

k∏i=1

θαi−1i

)N∏n=1

βzn,wnθzn

where I am using θzn to represent the component of θ chosen for zn. Let’s rephrase thisprobability using the superscript notation mentioned above, and for convenience we’ll usethe entire vocabulary of size V when calculating the probability and rely on an exponentto ween out the words that are used for each document:

p(θ, z,w|α, β) =

(Γ(∑k

i=1 αi)∏ki=1 Γ(αi)

k∏i=1

θαi−1i

)N∏n=1

k∏i=1

V∏j=1

(θiβi,j)wj

nzin (2)

We marginalize over θ and z in order to obtain the denominator (often referred to asthe evidence) of Eq. (3.2):

p(w|α, β) =Γ(∑k

i=1 αi)∏ki=1 Γ(αi)

∫ ( k∏i=1

θαi−1i

)(N∏n=1

k∑i=1

V∏j=1

(θiβij)wj

n

)dθ

Notice this expression is the same as Eq. (2) except we integrate over θ and sum over z.Unfortunately, computing this distribution is intractable as the coupling between θ and βmakes it so that when we compute the log of this function we are unable to separate the θand β. So while exact inference is not tractable, various approximate inference techniquescan be used. Here I examine variational inference in some detail.

5

3.2.1 Variational Inference for LDA

The essential idea of variational inference is to use a simpler, convex distribution thatobtains an adjustable lower bound on the log likelihood of the actual distribution. Vari-ational parameters describe the family of simpler distributions used to determine a lowerbound on the log likelihood and are optimized to create the tightest possible lower bound.

As discussed in Blei et al. (2003), an easy way to obtain a tractable family of lowerbounds is to modify the original graphical model by removing the troublesome edges andnodes. In the LDA model in Figure 2 the coupling between θ and β makes the inferenceintractable. By dropping the problematic edges and nodes we obtain the simplified graph-ical model in Figure 3. This variational distribution has a posterior for each document inthe form:

q(θ, z|γ, φ) = q(θ|γ)N∏n=1

q(zn|φn). (3)

The next step is to formally specify an optimization problem to determine the values ofγ and φ. For the sake of brevity I must defer the reader to A.3 of Blei et al. (2003) forderivation of the following results. In particular, finding an optimal lower bound on thelog likelihood results in the following optimization problem:

(γ∗, φ∗) = arg min(γ,φ)

D (q(θ, z|γ, φ)||p(θ, z|w, α, β)) (4)

which is a minimization of the Kullback-Leibler (KL) divergence between the variationaldistribution and the actual posterior distribution. The KL divergence between the varia-tional and actual distribution is the name of the game for variational Bayesian inference,and if you are not already familiar with KL divergence or the basics of variational infer-ence, I encourage you to read sections 1.6 and 10.1 of Bishop (2006), respectively.

One method to minimize this function is to use an iterative fixed-point method, yield-ing update equations of:

φni ∝ βiwnexp {Eq[log(θi)|γ]} (5)

γi = αi +N∑n=1

φni. (6)

where as shown in Blei et al. (2003) the expectation in the φ update is computed as

Eq[log(θi)|γ] = Ψ(γi)−Ψ(k∑j=1

γj)

where Ψ is the first derivative of the logΓ function, which is “simply” this crazy looking

6

numerical approximation (computed via a Taylor approximation):

Ψ(x) ≈((

(0.00416̄1

(x+ 6)2− 0.003968)

1

(x+ 6)2+ 0.0083̄

)1

(x+ 6)2− 0.083̄

)1

(x+ 6)2

+ log(x)− 1

2x−

6∑i=1

1

x− i

I’ve just bustled through quite a few equations, so let’s pause for moment and gainsome intuition on these results. The Dirichlet update in Eq. (6) is a posterior Dirichletgiven the expected observations taken under the variational distribution, E[zn|φn]. In otherwords, we’re iteratively updating the variational Dirichlet topic parameter, γ, using themultinomial that best describes the observed words. The multinomial update essentiallyuses Bayes’ theorem, p(zn|wn) ∝ p(wn|zn)p(zn), where p(wn|zn) = βiwn and p(zn) isapproximated by the exponential of the expectation of its logarithm under the variationaldistribution. Finally, be sure to note the the variational parameters are actually a functionof w although this was not explicitly expressed in the above equations. The optimizationof Eq. (4) relies on a specific w, see A.3 of Blei et al. (2003).

The fastidious reader may be wondering: “but how in the world do we find β or α?The previous optimization assumed we knew these parameters. . . ” The answer is that weestimate β and α assuming we know φ and γ. Hopefully this approach sounds familiar toyou: it’s the Expectation-Maximization (EM) algorithm on the variational distribution! Ifyou need a brief refresher on the EM algorithm please consult the appendix. If you needto learn the EM algorithm, please see sections 9.2-9.4 of Bishop (2006).

In the E-step of the EM algorithm we determine the log likelihood of the completedata—in other words, assuming we know the hidden parameters, β and α in this case. Inthe M-step we maximize the lower bound on the log likelihood with respect to α and β.More formally:

• E-Step: Find the optimal values of the variational parameters γ∗d and φ∗d for ev-

ery document in the corpus. Knowing these parameters allows us to compute theexpectation of the log likelihood of the complete data

• M-Step: Maximize the lower bound on the log likelihood of

`(α, β) =M∑d=1

logp(wd|α, β)

with respect to α and β. This corresponds to finding maximum likelihood estimateswith expected sufficient statistics for each document.

Here I quote the results for the M-step update but encourage the curious reader toperuse A.3 and A.4 of Blei et al. (2003) for derivations—it involves some neat tricks tomake the inference scalable.

βij ∝M∑d=1

Nd∑n=1

φ∗dniw

jdn (7)

7

α θd zd,n wd,n

β1:k

NM

Figure 2: LDA graphical model

Note that the “proportional to” symbol (∝) in the description of βij simply means thatwe normalize all βi to sum to one. The α update is a bit trickier and uses a linear-scalingNewton-Rhapson algorithm to determine the optimal alpha, with updates carried out inlog-space (assuming a uniform α):

log(αt+1) = log(αt)−dLdα

d2Ldα2α + dL

(8)

dL

dα= M(kΨ

′(kα)− kΨ

′(α)) +

M∑d=1

(Ψ(γdi)−Ψ

(k∑j=1

γdj

))(9)

d2L

dα2= M(k2Ψ

′′(kα)− kΨ

′′(α)) (10)

This completes the nitty gritty of variational inference with LDA. Algorithm 1 providesthe psuedocode for this variational inference and accompanying R code can be found athttp://highenergy.physics.uiowa.edu/~creed/#code.

8

θd

γd

zd,n

φd,n

NM

Figure 3: Variational distribution used to approximate the posterior distribution in LDA,γ and φ are the variational parameters.

9

Algorithm 1: Variational Expectation-Maximization LDAInput: Number of topics K

Corpus with M documents and Nd words in document dOutput: Model parameters: β, θ, z

initialize φ0ni := 1/k for all i in k and n in Nd

initialize γi := αi +N/k for all i in kinitialize α := 50/kinitialize βij := 0 for all i in k and j in V

//E-Step (determine φ and γ and compute expected likelihood)loglikelihood := 0for d = 1 to M

repeatfor n = 1 to Nd

for i = 1 to Kφt+1dni := βiwnexp

(Ψ(γtdi)

)endfornormalize φt+1

dni to sum to 1endforγt+1 := α+

∑Nn=1 φ

t+1dn

until convergence of φd and γdloglikelihood := loglikelihood + L(γ, φ;α, β) // See equation BLAH

endfor

//M-Step (maximize the log likelihood of the variational distribution)for d = 1 to Mfor i = 1 to Kfor j = 1 to Vβij := φdniwdnj

endfornormalize βi to sum to 1

endforendforestimate α via Eq. (8)

if loglikelihood converged thenreturn parameters

elsego back to E-step

endif

10

4 But how does LDA work?

Figure 4 shows example topics obtained using LDA on the Touchstone Applied ScienceAssociates (TASA) corpus (a collection of nearly 40,000 short documents from educationalwritings). Each topic includes the 16 words with the highest probability for the topic: thelargest β value for a given topic. A natural question to ask is, “how does LDA work?” or“why does LDA produce groups of words with similar themes?” For instance, with thefour provided topics we note that the left-most topic is related to drugs and medicine, thenext topic is related to color, the third topic is related to memory, and the fourth topic isrelated to medical care. How did LDA extract these topics from a collection of texts? Thesimple and straightforward answer is co-occurance. LDA extracts clusters of co-occurringwords to form topics.

Figure 4: Taken from Steyvers and Griffiths (2007).

Let’s probe a bit deeper than simply saying LDA works because of co-occurance. Let’sfigure out why. To do this, we must reexamine the joint distribution—which is proportionalto the posterior—and break down what each member of the hierarchy implies:

p(θ, z|w, α, β) ∝ p(θ, z,w|α, β) = p(w|z, β)︸ ︷︷ ︸1

p(z|θ)︸ ︷︷ ︸2

p(θ|α)︸ ︷︷ ︸3

(11)

1. implies that making β a sparse matrix will increase the probability of certain words–remember that the β values for a given topic must sum to one so the more termswe assign a non-zero β value the thinner we have to spread our probability for thetopic.

2. implies that making θ have concentrated components will increase the probability

3. implies that using a small α will increase the probability

The three factors form an interesting dynamic: (1) implies that having sparsely distributedtopics can result in a high probability for a document, where the ideal way to form thesparse components is to make them non-overlapping clusters of co-occurring words in

11

different documents (why is this?); (2) encourages a sparse θ matrix so that the probabilityof choosing a given z value will be large, e.g. θ = (0.25, 0.25, 0.25, 0.25) would yield smallerprobabilities than θ = (0.5, 0.5, 0, 0) . In other words, (2) penalizes documents for havingtoo many possible topics. (3) again penalizes using a large number of possible topics fora given document—small α values yield sparse θs. In summary: (1) wants to form sparse,segregated word clusters, (2) and (3) want to give a small number of possible topics foreach document. But if we only have a few topics to choose from and each topic hasa small number of non-zero word probabilities, then we surely better form meaningfulclusters that could represent a diverse number of documents. How should we do this youask? Form clusters of co-occurring terms, which is largely what LDA accomplishes. Butwhile the joint distribution encourages a small number of topics with small meaningful(co-occurring) topics, the data itself needs a larger number of topics in order to assignsmall “topic clusters” to the data. So, as desired, we have a push and pull factor betweenthe joint distribution and the actual data in determining the assignment of topics.

Other notes to include in future revisions:

• φ approximates p(zn|w) so we can use it to assign topics to individual words (e.g.if qn(zn = 1) > 0.9) (aka variational posterior multinomial parameters).

• γ(w) (the variational posterior Dirichlet parameters) can be used to model docu-ments for classification. γi approximates αi except for the most significant topicswhere γi − αi > 1. γ represents the document on the topic simplex.

• LDA combats overfitting, unlike mixture models (LDA is a mixed membershipmodel)

• LDA can really be considered a dimensionality reduction technique — discrete PCA

• Comparison/evaluation methods

5 Further Study

• Blei et al. (2003): The original LDA paper; much of this tutorial was extracted fromthis paper. This paper also includes several excellent examples of LDA in application.

• Blei (2011): A recent overview and review of topic models, also includes directionsfor future research

• Steyvers and Griffiths (2007) provides a well-constructed introduction to probabilis-tic topic models and details on using topic models to compute similarity betweendocuments and similarity between words.

• Dave Blei’s video lecture on topic models: http://videolectures.net/mlss09uk_blei_tm

12

References

C.M. Bishop. Pattern recognition and machine learning, volume 4. 2006.

D. Blei. Introduction to probabilistic topic models. Communications of the ACM, 2011.

D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent dirichlet allocation. The Journal of MachineLearning Research, 3:993–1022, 2003.

BA Frigyik, A. Kapila, and M.R. Gupta. Introduction to the dirichlet distribution andrelated processes. Department of Electrical Engineering, University of Washignton,UWEETR-2010-0006, 2010.

M. Steyvers and T. Griffiths. Probabilistic topic models. Handbook of latent semanticanalysis, 427(7):424–440, 2007.

6 Appendix: EM Algorithm Refresher

The General EM Algorithm Summary:Given a joint distribution p(X,Z|θ) over observed variable X and latent variables Z, withparameters θ, the goal is to maximize the likelihood function p(X|θ) with respect to θ.

1. Initialize the parameters θold

2. E Step Construct Q(θ,θold)

Q(θ,θold) =∑Z

p(Z|X,θold) ln p(X,Z|θ)

which is the conditional expectation of the complete-data log-likelihood.

3. M Step Evaluate θnew via

θnew = arg maxθ

Q(θ,θold) (12)

4. Check log likelihood and parameter values for convergence, if not converged letθold ← θnew and return to step 2.

13