Not-So-Latent Dirichlet Allocation: Collapsed Gibbs Sampling Using Human Judgments

8/6/2019 Not-So-Latent Dirichlet Allocation: Collapsed Gibbs Sampling Using Human Judgments

http://slidepdf.com/reader/full/not-so-latent-dirichlet-allocation-collapsed-gibbs-sampling-using-human-judgments 1/8

Not-So-Latent Dirichlet Allocation:Collapsed Gibbs Sampling Using Human Judgments

Jonathan ChangFacebook

1601 S. California Ave.Palo Alto, CA 94304

[email protected]

Abstract

Probabilistic topic models are a popular toolfor the unsupervised analysis of text, providingboth a predictive model of future text and a la-tent topic representation of the corpus. Recentstudies have found that while there are sugges-tive connections between topic models and theway humans interpret data, these two often dis-agree. In this paper, we explore this disagree-ment from the perspective of the learning pro-cess rather than the output. We present a noveltask, tag-and-cluster , which asks subjects tosimultaneously annotate documents and clusterthose annotations. We use these annotationsas a novel approach for constructing a topicmodel, grounded in human interpretations of documents. We demonstrate that these topicmodels have features which distinguish themfrom traditional topic models.

1 Introduction

Probabilistic topic models have become popular tools

for the unsupervised analysis of large documentcollections (Deerwester et al., 1990; Grifths andSteyvers, 2002; Blei and Lafferty, 2009). These mod-els posit a set of latent topics , multinomial distribu-tions over words, and assume that each documentcan be described as a mixture of these topics. Withalgorithms for fast approximate posterior inference,we can use topic models to discover both the topicsand an assignment of topics to documents from acollection of documents. (See Figure 1.)

These modeling assumptions are useful in the

sense that, empirically, they lead to good modelsof documents (Wallach et al., 2009). However, re-cent work has explored how these assumptions corre-spond to humans’ understanding of language (Changet al., 2009; Grifths and Steyvers, 2006; Mei et al.,2007). Focusing on the latent space, i.e., the inferredmappings between topics and words and betweendocuments and topics, this work has discovered thatalthough there are some suggestive correspondences

between human semantics and topic models, they areoften discordant.

In this paper we build on this work to furtherexplore how humans relate to topic models. Butwhereas previous work has focused on the resultsof topic models, here we focus on the process bywhich these models are learned. Topic models lendthemselves to sequential procedures through whichthe latent space is inferred; these procedures are ineffect programmatic encodings of the modeling as-

sumptions. By substituting key steps in this programwith human judgments, we obtain insights into thesemantic model conceived by humans.

Here we present a novel task, tag-and-cluster ,which asks subjects to simultaneously annotate a doc-ument and cluster that annotation. This task simulatesthe sampling step of the collapsed Gibbs sampler (de-scribed in the next section), except that the posteriordened by the model has been replaced by human judgments. The task is quick to complete and isrobust against noise. We report the results of a large-

scale human study of this task, and show that humansare indeed able to construct a topic model in this fash-ion, and that the learned topic model has semanticproperties distinct from existing topic models. Wealso demonstrate that the judgments can be used toguide computer-learned topic models towards modelswhich are more concordant with human intuitions.

2 Topic models and inference

Topic models posit that each document is expressedas a mixture of topics. These topic proportions are

drawn once per document, and the topics are sharedacross the corpus. In this paper we focus on LatentDirichlet allocation (LDA) (Blei et al., 2003) a topicmodel which treats each document’s topic assign-ment as a multinomial random variable drawn froma symmetric Dirichlet prior. LDA, when applied to acollection of documents, will build a latent space: acollection of topics for the corpus and a collection of topic proportions for each of its documents.



computer,technology,

system,service, site,

phone,internet,machine

play, lm,movie, theater,

production,star, director,

stage

sell, sale,store, product,

business,advertising,

market,consumer

TOPIC 1

TOPIC 2

TOPIC 3

(a) Topics

Forget theBootleg, JustDownload theMovie Legally

Multiplex HeraldedAs Linchpin To

GrowthThe Shape of

Cinema,Transformed At

the Click of aMouse A Peaceful Crew

Puts MuppetsWhere Its Mouth Is

Stock Trades: ABetter Deal ForInvestors Isn't

Simple

Internet portalsbegin to distinguishamong themselvesas shopping malls

Red Light, GreenLight: A

2-Tone L.E.D. toSimplify Screens

TOPIC 2"BUSINESS"

TOPIC 3"ENTERTAINMENT"

TOPIC 1"TECHNOLOGY"

(b) Document Assignments to Topics

Figure 1: The latent space of a topic model consists of topics, which are distributions over words, and a distributionover these topics for each document. On the left are three topics from a fty topic LDA model trained on articles fromthe New York Times. On the right is a simplex depicting the distribution over topics associated with seven documents.The line from each document’s title shows the document’s position in the topic space.

LDA can be described by the following generativeprocess:

1. For each topic k,

(a) Draw topic βk ∼Dir( η)

2. For each document d,

(a) Draw topic proportions θd ∼Dir( α )(b) For each word wd,n ,

i. Draw topic assignment zd,n ∼Mult( θd )

ii. Draw word wd,n ∼Mult( βzd,n )

This process is depicted graphically in Figure 2. Theparameters of the model are the number of topics,K , as well as the Dirichlet priors on the topic-worddistributions and document-topic distributions, α andη. The only observed variables of the model arethe words, wd,n . The remaining variables must belearned.

There are several techniques for performing pos-terior inference, i.e., inferring the distribution overhidden variables given a collection of documents, in-cluding variational inference (Blei et al., 2003) andGibbs sampling (Grifths and Steyvers, 2006). In thesequel, we focus on the latter approach.

Collapsed Gibbs sampling for LDA treats the topic-word and document-topic distributions, θd and βk , asnuisance variables to be marginalized out. The poste-rior distribution over the remaining latent variables,

DN

θ d α

β k

ηzd,n

wd,nK

Figure 2: A graphical model depiction of latent Dirichletallocation (LDA). Plates denote replication. The shadedcircle denotes an observed variable and unshaded circlesdenote hidden variables.

the topic assignments zd,n , can be expressed as

p(z |α,η, w ) ∝

d k

Γ(n d,k + α k )Γ(ηwd,n + nw d,n ,k )Γ( w nw,k + ηw )

,

where nd,k denotes the number of words in documentd assigned to topic k and n w,k the number of timesword w is assigned to topic k. This leads to thesampling equations,

p(zd,i = k|α,η, w , z ) ∝

(n¬ d,id,k + α k )

ηw d,i + n¬ d,iw d,i ,k

w n¬ d,iw,k + ηw

, (1)

where the superscript ¬d, i indicates that these statis-



tics should exclude the current variable under consid-eration, zd,i .

In essence, the model performs inference by look-ing at each word in succession, and probabilisticallyassigning it to a topic according to Equation 1. Equa-

tion 1 is derived through the modeling assumptionsand choice of parameters. By replacing Equation 1with a different equation or with empirical observa-tions, we may construct new models which reectdifferent assumptions about the underlying data.

3 Constructing topics using human judg-ments

In this section we propose a task which creates aformal setting where humans can create a latentspace representation of the corpus. Our task, tag-and-

cluster , replaces the collapsed Gibbs sampling stepof Equation 1 with a human judgment. In essence,we are constructing a gold-standard series of samplesfrom the posterior. 1

Figure 3 shows how tag-and-cluster is presentedto users. The user is shown a document along with itstitle; the document is randomly selected from a poolof available documents. The user is asked to selecta word from the document which is discriminative,i.e, a word which would help someone looking forthe document nd it. Once the word is selected, theuser is then asked to assign the word to the topicwhich best suits the sense of the word used in thedocument. Users are specically instructed to focuson the meanings of words, not their syntactic usageor orthography.

The user assigns a word to a topic by selecting anentry out of a menu of topics. Each topic is repre-sented by the ve words occuring most frequently inthat topic. The order of the topics presented to theuser is determined by the number of words in thatdocument already assigned to each topic. Once aninstance of a word in a document has been assigned,it cannot be reassigned and will be marked in redwhen subsequent users encounter this document. Inpractice, we also prohibit users from selecting infre-quently occurring words and stop words.

1Additionally, since Gibbs sampling is by nature stochastic,we believe that the task is robust against small perturbations inthe quality of the assignments, so long as in aggregate they tendtoward the mode.

4 Experimental results

We conducted our experiments using Amazon Me-chanical Turk, which allows workers (our pool of prospective subjects) to perform small jobs for a feethrough a Web interface. No specialized training

or knowledge is typically expected of the workers.Amazon Mechanical Turk has been successfully usedin the past to develop gold-standard data for naturallanguage processing (Snow et al., 2008).

We prepare two randomly-chosen, 100-documentsubsets of English Wikipedia. For convenience, wedenote these two sets of documents as set1 and set2 .For each document, we keep only the rst 150 wordsfor our experiments. Because of the encyclopedicnature of the corpus, the rst 150 words typicallyprovides a broad overview of the themes in the article.

We also removed from the corpus stop words andwords which occur infrequently 2 , leading to a lexiconof 8263 words. After this pruning set1 contained11614 words and set2 contained 11318 words.

Workers were asked to perform twenty of the tag-gings described in Section 3 for each task; workerswere paid $0.25 for each such task. The number of latent topics, K , is a free parameter. Here we exploretwo values of this parameter, K = 10 and K = 15 ,leading to a total of four experiments — two for eachset of documents and two for each value of K .

4.1 Tagging behaviorFor each experiment we issued 100 HITs, leading toa total of 2000 tags per experiment. Figure 4 showsthe number of HITs performed per person in eachexperiment. Between 12 and 30 distinct workers par-ticipated in each experiment. The number of HITsperformed per person is heavily skewed, with themost active participants completing an order of mag-nitude more HITs than other particpants.

Figure 5 shows the amount of time taken per tag,in log seconds. Each color represents a different

experiment. The bulk of the tags took less than aminute to perform and more than a few seconds.

4.2 Comparison with LDA

Learned topics As described in Section 3, the tag-and-cluster task is a way of allowing humans to con-

2Infrequently occurring words were identied as those ap-pearing fewer than eight times on a larger collection of 7726articles.



Figure 3: Screenshots of our task. In the center, the document along with its title is shown. Words which cannot be

selected, e.g., distractors and words previously selected, are shown in red. Once a word is selected, the user is asked tond a topic in which to place the word. The user selects a topic by clicking on an entry in a menu of topics, where eachtopic is expressed by the ve words which occur most frequently in that topic.

0 5 10 15 20 25 30

0

5

10

15

20

25

(a) set1 ,K = 10

0 2 4 6 8 10 12

0

10

20

30

40

50

60

(b) set1 ,K = 15

0 5 10 15 20

0

10

20

30

40

50

(c) set2 ,K = 10

0 5 10 15 20 25 30

0

5

10

15

20

(d) set2 ,K = 15

Figure 4: The number of HITs performed (y-axis) by each participant (x-axis). Between 12 and 30 people participatedin each experiment.

0.05 0.1 0.2 0.5 1 2 set1 s et2

0

0.5

1

1.5

2

2.5

e n t r o p y

α

(a) K = 10

0.05 0.1 0.2 0.5 1 2 set1 set2

0

0.5

1

1.5

2

2.5

3

e n t r o p y

α

(b) K = 15

Figure 6: A comparison of the entropy of distributions drawn from a Dirichlet distribution versus the entropy of thetopic proportions inferred by workers. Each column of the boxplot shows the distribution of entropies for 100 drawsfrom a Dirichlet distribution with parameter α . The two rightmost columns show the distribution of the entropy of thetopic proportions inferred by workers on set1 and set2 . The α of workers typically falls between 0.2 and 0.5.



Table 1: The ve words with the highest probability mass in each topic inferred by humans using the task described inSection 3. Each subtable shows the results for a particular experimental setup. Each row is a topic; the most probablewords are ordered from left to right.

(a) set1 , K = 10

railway lighthouse rail hudderseld stationschool college education history conferencecatholic church lm music actorrunners team championships match racingengine company power dwight enginesuniversity london british college countyfood novel book series supermannovember february april august decemberpaint photographs american austin black war history army american battle

(b) set2 , K = 10

president emperor politician election governmentamerican players swedish team zealandwar world navy road torpedosystem pop microsoft music singerseptember 2007 october december 1999television dog name george lmpeople malay town tribes cliff diet chest enzyme hair therapybritish city london english countyschool university college church center

(c) set1 , K = 15

australia knee british israel set

catholic roman island village columbia john devon michael austin charlesschool university class community districtnovember february 2007 2009 2005lighthouse period architects construction designrailway rail hudderseld ownership servicescyprus archdiocese diocese king misscarson gordon hugo ward whitneysignicant application campaign comic consideredborn london american england black war defense history military artilleryactor lm actress band designeryork michigan orida north photographschurch catholic county 2001 agricultural

(d) set2 , K = 15

music pop records singer artist

lm paintings movie painting artschool university english students britishdrama headquarters chess poet storiesfamily church sea christmas emperordog broadcast television bbc breedchampagne regular character characterist ic commonelection government parliament minister polit icianenzyme diet protein hair oxygenwar navy weapons aircraft militaryseptember october december 2008 1967district town marin america americancar power system device deviceshockey players football therapy championscalifornia zealand georgia india kolkata

0 10 20 30 40

−70000

−65000

−60000

−55000

−50000

l o g

l i k e

l i h o o

d

iteration

(a) set1 , K = 10

0 10 20 30 40

−70000

−65000

−60000

−55000

−50000

−45000

l o g

l i k e

l i h o o

d

iteration

(b) set2 , K = 10

0 10 20 30 40

−70000

−65000

−60000

−55000

−50000

l o g

l i k e

l i h o o

d

iteration

(c) set1 , K = 15

0 10 20 30 40

−70000

−65000

−60000

−55000

−50000

−45000

l o g

l i k e

l i h o o

d

iteration

(d) set2 , K = 15

Figure 7: The log-likelihood achieved by LDA as a function of iteration. There is one series for each value of α ∈ {0.05, 0.1, 0.2, 0.5, 1.0, 2.0} from top to bottom.



Table 2: The ve words with the highest probability mass in each topic inferred by LDA, with α = 0 .2. Each subtableshows the results for a particular experimental setup. Each row is a topic; the most probable words are ordered from leftto right.

(a) set1 , K = 10

born 2004 team award sydneyregiment army artillery served scouting

line station main island railwayregion street located site kneefood february conference day 2009pride greek knowledge portland studycatholic church roman black timeclass series lm actor enginetravel human ofce management defenseschool born war world university

(b) set2 , K = 10

september english edit nord hockeyblack hole current england model

training program war election navyschool university district city collegefamily word international road japanpublication time day india bridgeborn pop world released marchwon video microsoft project hungarylm hair bank national townpeople name french therapy artist

(c) set1 , K = 15

time michael written experience matchline station railway branch knowledgelm land pass set battlewilliam orida carson virginia newfoundlandwar regiment british army southreaction terminal copper running complexborn school world college black food conference ight medium railtownship scouting census square countytravel defense training management edgesseries actor engine november awardpride portland band northwest godteam knee 2004 sydney israelcatholic located site region churchclass february time public king

(d) set2 , K = 15

family protein enzyme acting oxygenengland producer popular canadian seasystem death artist running carcharacter series dark main villageenglish word publication stream daytraining program hair students electricaldistrict town city local kolkataseptember edit music records recordedblack pop bank usually holepeople choir road diet relatedwar built navy british servicecenter million cut champagne playersborn television current drama wonschool university college election bornlm nord played league hockey

0 500 1000 1500 2000

−2

0

2

4

6

8

10

12

l o g

( s e c o n

d s

b e

t w e e n

t a g s

)

Figure 5: The distribution of times taken per HIT. Each se-ries represents a different experiment. The bulk of the tagstook less than one minute and more than a few seconds.

struct a topic model. One way of visualizing a learnedtopic model is by examining its topics. Table 1 showsthe topics constructed by human judgments. Each

subtable shows a different experimental setup andeach row shows an individual topic. The ve mostfrequently occurring words in each topic are shown,ordered from left to right.

Many of the topics inferred by humans havestraightforward interpretations. For example, the{november, february, april, august,

december } topic for the set1 corpus with K = 10is simply a collection of months. Similar topics(with years and months combined) can be found

in the other experimental congurations. Othertopics also cover specic semantic domains,such as {president, emperor, politican,

election, government } or {music, pop,

records, singer, artist }. Several of thetopics are combinations of distinct concepts,such as {catholic, church, film, music,

actor }, which is often indicative of the number of clusters, K , being too low.



Table 2 shows the topics learned by LDA un-der the same experimental conditions, with theDirichlet hyperparameter α = 0 .2 (we justifythis choice in the following section). Thesetopics are more difcult to interpret than the

ones created by humans. Some topics seemto largely make sense except for some anoma-lous words, such as {district, town, city,

local, kolkata } or {school, university,

college, election, born }. But the smallamount of data means that it is difcult for a modelwhich does not leverage prior knowledge to infermeaningful topic. In contrast, several humans, evenworking independently, can leverage prior knowledgeto construct meaningful topics with little data.

There is another qualitative difference betweenthe topics found by the tag-and-cluster task andLDA. Whereas LDA must rely on co-occurrence,humans can use ontological information. Thus, atopic which has ontological meaning, such as a listof months, may rarely be discovered by LDA sincethe co-occurrence patterns of months do not form astrong pattern. But users in every experimental con-guration constructed this topic, suggesting that theusers were consistently leveraging information thatwould not be available to LDA, even with a largercorpus.

Hyperparameter values A persistent questionamong practitioners of topic models is how to set orlearn the value of the hyperparameter α . α is a Dirich-let parameter which acts as a control on sparsity —smaller values of α lead to sparser document-topicdistributions. By comparing the sparsity patterns of human judgments to those of LDA for different set-tings of α , we can infer the value of α that wouldbest match human judgments.

Figure 6 shows a boxplot comparison of the en-tropy of draws from a Dirichlet distribution (the gen-

erative process in LDA), versus the observed en-tropy of the models learned by humans. The rstsix columns show the distributions for the Dirichletdraws for various values of α ; the last two columnsshow the observed entropy distributions on the twocorpora, set1 and set2 .

The empirical entropy distributions across the cor-pora are comparable to those of a Dirichlet distri-bution with α between approximately 0.2 and 0.5.

Table 3: The ve words with the highest probability massin each topic inferred by LDA on set1 with α = 0 .2,K = 10 , and initialized using human judgments. Eachrow is a topic; the most probable words are ordered fromleft to right.

line station lighthouse local mainschool history greek knowledge u niversitycatholic church city roman york team club 2004 scouting careerengine knee series medium reactionlocated south site land regionfood lm conference north littlefebruary class born august 2009pride portland time northwest junewar regiment army civil black

These settings of α are slightly higher than, but still inline with a common rule-of-thumb of α = 1 /K . Fig-ure 7 shows the log-likelihood, a measure of model

t, achieved by LDA for each value of α . Higherlog-likelihoods indicate better ts. Commensuratewith the rule of thumb, using log-likelihoods to se-lect α would encourage values smaller than human judgments.

However, while the entropy of the Dirichlet drawsincreases signicantly when the number of clustersis increased from K = 10 to K = 15 , the entropiesassigned by humans does not vary as dramatically.This suggests that for any given document, humansare likely to pull words from only a small number of topics, regardless of how many topics are available,whereas a model will continue to spread probabilitymass across all topics even as the number of topicsincreases.

Improving LDA using human judgments The re-sults of the previous sections suggests that human be-havior differs from that of LDA, and that humans con-ceptualize documents in ways LDA does not. Thismotivates using the human judgments to augment theinformation available to LDA. To do so, we initializethe topic assignments used by LDA’s Gibbs sampler

to those made by humans. We then run the LDAsampler till convergence. This provides a method toweakly incorporate human knowledge into the model.

Table 3 shows the topics inferred by LDA wheninitialized with human judgments. These topics re-semble those directly inferred by humans, althoughas we predicted in the previous sections, the topicconsisting of months has largely disappeared. Othersemantically coherent topics, such as {located,



0 10 20 30 40

−70000

−65000

−60000

−55000

−50000

l o g

l i k

e l i h

o o

d

iteration

Figure 8: The log likelihood achieved by LDA on set1with α = 0 .2, K = 10 , and initialized using human judgments (blue). The red line shows the log likelihoodwithout incorporating human judgments. LDA with hu-man judgments dominates LDA without human judgmentsand helps the model converge more quickly.

south, site, land, region }, have appearedin its place.

Figure 8 shows the log-likelihood course of LDAwhen initialized by human judgments (blue), versusLDA without human judgments (red). Adding hu-man judgments strictly helps the model converge toa higher likelihood and converge more quickly. Inshort, incorporating human judgments shows promiseat improving both the interpretability and conver-

gence of LDA.5 Discussion

We presented a new method for constructing topicmodels using human judgments. Our approach re-lies on a novel task, tag-and-cluster , which asksusers to simultaneously annotate a document withone of its words and to cluster those annotations. Wedemonstrate using experiments on Amazon Mechan-ical Turk that our method constructs topic modelsquickly and robustly. We also show that while our

topic models bear many similarities to traditionallyconstructed topic models, our human-learned topicmodels have unique features such as xed sparsityand a tendency for topics to be constructed aroundconcepts which models such as LDA typically fail tond.

We also underscore that the collapsed Gibbs sam-pling framework is expressive enough to use as thebasis for human-guided topic model inference. This

may motivate, as future work, the construction of dif-ferent modeling assumptions which lead to samplingequations which more closely match the empiricallyobserved sampling performed by humans. In effect,our method constructs a series of samples from the

posterior, a gold standard which future topic modelscan aim to emulate.

Acknowledgments

The author would like to thank Eytan Bakshy andProfessor Jordan Boyd-Graber-Ying for their helpfulcomments and discussions.

ReferencesDavid Blei and John Lafferty, 2009. Text Mining: Theory

and Applications , chapter Topic Models. Taylor andFrancis.

David Blei, Andrew Ng, and Michael Jordan. 2003. La-tent Dirichlet allocation. JMLR , 3:993–1022.

Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish,Chong Wang, and David M. Blei. 2009. Readingtea leaves: How humans interpret topic models. In NIPS .

Scott Deerwester, Susan Dumais, Thomas Landauer,George Furnas, and Richard Harshman. 1990. Index-ing by latent semantic analysis. Journal of the Ameri-can Society of Information Science , 41(6):391–407.

Thomas L. Grifths and Mark Steyvers. 2002. A prob-abilistic approach to semantic representation. In Pro-

ceedings of the 24th Annual Conference of the Cogni-tive Science Society .T. Grifths and M. Steyvers. 2006. Probabilistic topic

models. In T. Landauer, D. McNamara, S. Dennis, andW. Kintsch, editors, Latent Semantic Analysis: A Road to Meaning . Laurence Erlbaum.

Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai. 2007.Automatic labeling of multinomial topic models. InKDD .

R. Snow, Brendan O’Connor, D. Jurafsky, and A. Ng.2008. Cheap and fast—but is it good? Evaluatingnon-expert annotations for natural language tasks. In EMNLP .

Hanna M. Wallach, Iain Murray, Ruslan Salakhutdinov,and David Mimno. 2009. Evaluation methods for topicmodels. In ICML .

Not-So-Latent Dirichlet Allocation: Collapsed Gibbs Sampling Using Human Judgments

Documents

Transcript of Not-So-Latent Dirichlet Allocation: Collapsed Gibbs Sampling Using Human Judgments