Latent Dirichlet Allocation

Latent Dirichlet Allocation(Blei et al.)

Kris Sankaran

2016-11-14

Introduction

Agenda

I Generative Mechanism (15 minutes): What is the proposedmodel, and how does it differ from what existed before?

I Interpretations (10 minutes): What are alternative ways tounderstand the model?

I Model Inference (15 minutes): How would we fit this modelin practice?

I Examples and Conclusion (10 minutes): Why might we fitLDA in practice, and what are its limitations?

Context and Motivation

I Motivated by topic modeling:I Building interpretable representations of text dataI Designing preprocessing steps for classification or information

retrieval

I This said, LDA is not necessarily tied to text analysisI Generative Modeling: Design unified probabilistic models

I Is explicit about assumptions, feels less ad hocI Gives access to (large) Bayesian inference literatureI Can be used as a module in larger probabilistic models

Generative Model

Latent Dirichlet Allocation

I For the nth word in document d ,

wdn|β, θd ∼ Cat (β·k)zdn|θd ∼ Cat (θd)

θd |α ∼ Dir (α)

I Mnemonics:I wdn ∈ {1, . . . ,V } is the term used as the nth word in document

dI zdn ∈ {1, . . . ,K } is the topic associated with the nth word in

document dI θd ∈ SK−1 are the topic mixture proportions for document dI β·k ∈ SV−1 are the term mixture proportions for topic kI α is the topic shrinkage parameter

Latent Dirichlet Allocation

α θ z w

I w are observed dataI α,β are fixed, global parametersI θ, z are random, local parameters

Observed Counts (sum of wdn’s)

Mixing Proportions (θd ’s)

Topic Counts (sum of zdn’s)

z counts

Latent Dirichlet Allocation (β)

Unigram Model

I It can be illustrative to compare with earlier topic modelingapproaches

I The unigram model draws all words from the same multinomial,wdn ∼ Cat (β).

Mixture of UnigramsI This is the multinomial analog of gaussian mixture modelsI Each word is drawn from a mixture of K topics

zd ∼ p (z)wdn|zd ∼ Cat (βzdk )

I Topic assignment is drawn at the document level

Probabilistic Latent Semantic Indexing (pLSI)I pLSI draws a different topic for each word in the document,

zdn|d ∼ p (zdn|d)wdn|zdn ∼ Cat (β)

I The per-document topic mixture proportions are nonrandomand different for each document

I The number of fixed parameters grows linearly with the numberof documents

Back to LDA

I Essential difference: Randomness in topic mixture proportionslets us share information across documents

I Number of fixed parameters does not grow with number ofdocuments.

α θ z w

Interpretations

GeometricI Each topic is a point on the simplex, and the K topics

determine a topics simplexI The mixture of unigrams model gives each document a corner

of the topics simplexI The pLSI estimates the empirical distribution of observed

mixing proportionsI LDA estimates a smooth density over the topics simplex

Matrix FactorizationI We can think of topics as latent factors and mixing proportions

as document scores,

p (wdn = v |θd ,β) =K∑

k=1p (wdn|βvk) p (zdn = k)

= βTv ·p (zdn)

I The different models treat the p (zdn)’s differentlyI In LDA, this probability is βT

v ·θd .

p(wdn = v) θdk

Inference

Variational Bayes

I As scientists / modelers, our primary interest is in the posteriorp (θ, z |w ,α,β) after observing the words w

I This not available in closed form (the normalizing constant isintractable)

I In practice, we also need to estimate the α and β – more onthis later

Variational Bayes

I (Blei et al.) propose a variational approachI Turns Bayesian inference into an optimization problem

I Specifically, consider the family of Γ of q’s that factor like

q (θ, z |γ,ϕ) =D∏

[Dir (θd |γd)

N∏n=1

Cat (zdn|ϕdn)

and try to identify,

argminq∈Γ

KL (q (θ, z ||γ,ϕ) ‖p (θ, z |w ,α,β))

KL Minimization

I Note that

KL (q, p) = Eq

[log q (θ, z |γ,ϕ)

p (θ, z |w ,α,β)

]= −H (q) + log p (w |α,β) − Eq [p (θ, z ,w |α,β)] ,

and that the middle term (the “evidence”) is irrelevant to ouroptimization.

I Hence, find γ∗,ϕ∗ that maximize,

Eq [p (θ, z ,w |,α,β)] + H (q) ,

the “evidence lower bound” (ELBO).

KL Minimization

I The ELBO can be written explicitly (though it’s not pretty),

D∑d=1

K∑k=1

(αk − 1)Eq [log (θdk) |γd ] +D∑

N∑n=1

K∑k=1

ϕdnkEq [log (θdk) |γd ] +

D∑d=1

N∑n=1

V∑v=1

I (wdn = v)ϕdnk logβvk−

D∑d=1

log Γ( K∑

k=1γdk

K∑k=1

log Γ (γdk) −K∑

k=1(γdk − 1)Eq [log θdk |γd ] −

D∑d=1

N∑n=1

K∑k=1

ϕdnk logϕdnk

where we have omitted constants in γ,ϕ.

KL Minimization

I The point is that we can perform coordinate ascent on theparameters ϕ and γ to find locally optimal ϕ∗ and γ∗

I The updates look like

ϕdnk ∝ βnwdn exp (Eq [log θd |γd ])

γdk = αk +

N∑n=1

I InterpretationI First update is like p (zdn|wdn) ∝ p (wdn|zdn) p (zdn)I Second update is like Dirichlet posterior update upon observing

data ϕdnk .I ϕndk are the same across occurrences of the same term → save

memory

Estimating α,β

I So far, we have assumed the fixed parameters α,β are known,when in practice they aren’t

I (Blei et al.) propose two approachesI Variational EM: Here, the ELBO takes the place of the usual

Expected Complete Log-Likelihood, and we alternate betweenoptimizing ϕdnk ,γd (Variational E-step) and α,β (VariationalM-step)

I Smoothed Variational Bayes: Place a Dirichlet prior on β andintroduce this to the Variational approximation. The VariationalM-step now only optimizes α.

I The Smoothed Bayesian approach is better when ML estimatesof β are unreliable (e.g., when data are sparse).

Conclusion

Examples(Blei et al.) compare approaches to a variety of topic modelingtasks,

I Directly fitting to Associated Press corpus, evaluated usingheld-out likelihood

I As preprocessing for classification on the Reuters dataI Collaborative filtering – evaluate likelihood on held-out movies,

instead of words

Conclusion

I The basic LDA model can be easily extended by removingvarious exchangeability assumptions (D. M. Blei and J. D.Lafferty, D. Blei and J. Lafferty, Lacoste-Julien et al.)

I More generally, the three-level hierarchical Bayesian idea opensthe door to a variety of “mixed-membership” models (Airoldi etal., Erosheva and Fienberg, Mackey et al., Fox and Jordan)

I Alternative MCMC, Variational Inference, and Method ofMoments techniques are still an active area of research (M.Hoffman et al., Anandkumar et al., M. D. Hoffman et al., Tehet al.)

ReferencesAiroldi, Edoardo M., et al. “Mixed Membership StochasticBlockmodels.” Journal of Machine Learning Research, vol. 9, no.Sep, 2008, pp. 1981–2014.Anandkumar, Anima, et al. “A Spectral Algorithm for LatentDirichlet Allocation.” Advances in Neural Information ProcessingSystems, 2012, pp. 917–925.Blei, David M., and John D. Lafferty. “Dynamic Topic Models.”Proceedings of the 23rd International Conference on MachineLearning, ACM, 2006, pp. 113–120.Blei, David M., et al. “Latent Dirichlet Allocation.” Journal ofMachine Learning Research, vol. 3, no. Jan, 2003, pp. 993–1022.Blei, David, and John Lafferty. “Correlated Topic Models.”Advances in Neural Information Processing Systems, vol. 18, MIT;1998, 2006, p. 147.Erosheva, Elena A., and Stephen E. Fienberg. “Bayesian MixedMembership Models for Soft Clustering and Classification.”Classification—The Ubiquitous Challenge, Springer, 2005, pp.11–26.Fox, Emily B., and Michael I. Jordan. “Mixed Membership Modelsfor Time Series.” ArXiv Preprint ArXiv:1309.3533, 2013.Hoffman, Matthew D., et al. “Stochastic Variational Inference.”Journal of Machine Learning Research, vol. 14, no. 1, 2013, pp.1303–1347.Hoffman, Matthew, et al. “Online Learning for Latent DirichletAllocation.” Advances in Neural Information Processing Systems,2010, pp. 856–864.Lacoste-Julien, Simon, et al. “DiscLDA: Discriminative Learning forDimensionality Reduction and Classification.” Advances in NeuralInformation Processing Systems, 2009, pp. 897–904.Mackey, Lester W., et al. “Mixed Membership Matrix Factorization.”Proceedings of the 27th International Conference on MachineLearning (ICML-10), 2010, pp. 711–718.Teh, Yee W., et al. “A Collapsed Variational Bayesian InferenceAlgorithm for Latent Dirichlet Allocation.” Advances in NeuralInformation Processing Systems, 2006, pp. 1353–1360.

Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf ·...

Documents

Transcript of Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf ·...

Latent Dirichlet Allocation - Stanford AI Labai.stanford.edu/~ang/papers/jair03-lda.pdf · Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus. The basic

Latent Dirichlet Allocation - Neural Information …papers.nips.cc/paper/2070-latent-dirichlet-allocation.pdfLatent Dirichlet Allocation David M. Blei, Andrew Y. Ng and Michael I.

Topic Model Latent Dirichlet Allocation

Notes on Latent Dirichlet Allocation (LDA) for Beginners

VSSML16 L4. Association Discovery and Latent Dirichlet Allocation

Class-Speciﬁc Simplex-Latent Dirichlet Allocation c Simplex-Latent Dirichlet Allocation for Image Classiﬁcation Mandar Dixit1; Nikhil Rasiwasia2; Nuno Vasconcelos1 1Department

Max-Margin Latent Dirichlet Allocation for Image Classication … · WANG AND MORI: MAX-MARGIN LATENT DIRICHLET ALLOCATION 1 Max-Margin Latent Dirichlet Allocation for Image Classication

Robust Initialization for Learning Latent Dirichlet Allocationprofs.sci.univr.it/bicego/papers/2015_SIMBAD.pdf · Robust Initialization for Learning Latent Dirichlet Allocation Pietro

FLDA: Latent Dirichlet Allocation Based Unsteady Flow Analysishguo/publications/HongLGSYL14-small.pdf · FLDA: Latent Dirichlet Allocation Based Unsteady Flow Analysis Fan Hong, Chufan

Latent Dirichlet Allocation LDA)milos/courses/cs3750-Fall... · 2014. 10. 9. · 10/9/2014 1 Latent Dirichlet Allocation (LDA) Brief Review LDA Dirichlet Distribution The Model Theoretical

Latent Dirichlet Allocation€¦ · 3. Latent Dirichlet allocation Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus. The basic idea is that documents

Max-Margin Latent Dirichlet Allocation for Image ...mori/research/papers/wang-bmvc11.pdf · Max-Margin Latent Dirichlet Allocation for Image Classiﬁcation and Annotation Yang Wang

Latent Dirichlet Allocation - GitHub Pages · Florian Becker – Latent Dirichlet Allocation Institute of Theoretical Informatics Algorithmics Group (Probabilistic) Topic Models -

Latent Dirichlet Allocation - NTNUsmil.csie.ntnu.edu.tw/ppt/20091019_Menphis_2009-10... · Blei, A. Ng and M. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning

Clustering Images Using the Latent Dirichlet Allocation Modelpages.cs.wisc.edu/~pradheep/Clust-LDA.pdf · Clustering Images Using the Latent Dirichlet Allocation Model Pradheep K

Latent Dirichlet Allocation (Nicolas Loeff)

Latent Dirichlet Allocation - MIT CSAILpeople.csail.mit.edu/dsontag/courses/pgm13/slides/... · We describe latent Dirichlet allocation (LDA), a generative probabilistic model for

A Robust Latent Dirichlet Allocation Approach for the ...

Latent Dirichlet Allocation - Stanford Artificial ...ang/papers/jair03-lda.pdf · Latent Dirichlet Allocation David M ... Computer Science Division and Department of Statistics ...