Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf ·...

Post on 30-May-2020

28 views 0 download

Transcript of Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf ·...

Latent Dirichlet Allocation(Blei et al.)

Kris Sankaran

2016-11-14

1

Introduction

2

Agenda

I Generative Mechanism (15 minutes): What is the proposedmodel, and how does it differ from what existed before?

I Interpretations (10 minutes): What are alternative ways tounderstand the model?

I Model Inference (15 minutes): How would we fit this modelin practice?

I Examples and Conclusion (10 minutes): Why might we fitLDA in practice, and what are its limitations?

3

Context and Motivation

I Motivated by topic modeling:I Building interpretable representations of text dataI Designing preprocessing steps for classification or information

retrieval

I This said, LDA is not necessarily tied to text analysisI Generative Modeling: Design unified probabilistic models

I Is explicit about assumptions, feels less ad hocI Gives access to (large) Bayesian inference literatureI Can be used as a module in larger probabilistic models

4

Generative Model

5

Latent Dirichlet Allocation

I For the nth word in document d ,

wdn|β, θd ∼ Cat (β·k)zdn|θd ∼ Cat (θd)

θd |α ∼ Dir (α)

I Mnemonics:I wdn ∈ {1, . . . ,V } is the term used as the nth word in document

dI zdn ∈ {1, . . . ,K } is the topic associated with the nth word in

document dI θd ∈ SK−1 are the topic mixture proportions for document dI β·k ∈ SV−1 are the term mixture proportions for topic kI α is the topic shrinkage parameter

6

Latent Dirichlet Allocation

β

α θ z w

D

N

I w are observed dataI α,β are fixed, global parametersI θ, z are random, local parameters

7

Observed Counts (sum of wdn’s)

word

doc

count

0

10

20

30

8

Mixing Proportions (θd ’s)

topic

doc

theta

0.25

0.50

0.75

9

Topic Counts (sum of zdn’s)

12

3

word

docu

men

t

z counts

10

20

30

10

Latent Dirichlet Allocation (β)

topic

wor

d

beta

0.01

0.02

0.03

11

Unigram Model

I It can be illustrative to compare with earlier topic modelingapproaches

I The unigram model draws all words from the same multinomial,wdn ∼ Cat (β).

w

D

N

β

12

Mixture of UnigramsI This is the multinomial analog of gaussian mixture modelsI Each word is drawn from a mixture of K topics

zd ∼ p (z)wdn|zd ∼ Cat (βzdk )

I Topic assignment is drawn at the document level

β

z w

D

N

13

Probabilistic Latent Semantic Indexing (pLSI)I pLSI draws a different topic for each word in the document,

zdn|d ∼ p (zdn|d)wdn|zdn ∼ Cat (β)

I The per-document topic mixture proportions are nonrandomand different for each document

I The number of fixed parameters grows linearly with the numberof documents

β

d z w

D

N

14

Back to LDA

I Essential difference: Randomness in topic mixture proportionslets us share information across documents

I Number of fixed parameters does not grow with number ofdocuments.

β

α θ z w

D

N

15

Interpretations

16

GeometricI Each topic is a point on the simplex, and the K topics

determine a topics simplexI The mixture of unigrams model gives each document a corner

of the topics simplexI The pLSI estimates the empirical distribution of observed

mixing proportionsI LDA estimates a smooth density over the topics simplex

17

Matrix FactorizationI We can think of topics as latent factors and mixing proportions

as document scores,

p (wdn = v |θd ,β) =K∑

k=1p (wdn|βvk) p (zdn = k)

= βTv ·p (zdn)

I The different models treat the p (zdn)’s differentlyI In LDA, this probability is βT

v ·θd .

p(wdn = v) θdk

βkv

=D

V K

18

Inference

19

Variational Bayes

I As scientists / modelers, our primary interest is in the posteriorp (θ, z |w ,α,β) after observing the words w

I This not available in closed form (the normalizing constant isintractable)

I In practice, we also need to estimate the α and β – more onthis later

20

Variational Bayes

I (Blei et al.) propose a variational approachI Turns Bayesian inference into an optimization problem

I Specifically, consider the family of Γ of q’s that factor like

q (θ, z |γ,ϕ) =D∏

d=1

[Dir (θd |γd)

N∏n=1

Cat (zdn|ϕdn)

],

and try to identify,

argminq∈Γ

KL (q (θ, z ||γ,ϕ) ‖p (θ, z |w ,α,β))

21

KL Minimization

I Note that

KL (q, p) = Eq

[log q (θ, z |γ,ϕ)

p (θ, z |w ,α,β)

]= −H (q) + log p (w |α,β) − Eq [p (θ, z ,w |α,β)] ,

and that the middle term (the “evidence”) is irrelevant to ouroptimization.

I Hence, find γ∗,ϕ∗ that maximize,

Eq [p (θ, z ,w |,α,β)] + H (q) ,

the “evidence lower bound” (ELBO).

22

KL Minimization

I The ELBO can be written explicitly (though it’s not pretty),

D∑d=1

K∑k=1

(αk − 1)Eq [log (θdk) |γd ] +D∑

d=1

N∑n=1

K∑k=1

ϕdnkEq [log (θdk) |γd ] +

D∑d=1

N∑n=1

V∑v=1

I (wdn = v)ϕdnk logβvk−

D∑d=1

log Γ( K∑

k=1γdk

)+

K∑k=1

log Γ (γdk) −K∑

k=1(γdk − 1)Eq [log θdk |γd ] −

D∑d=1

N∑n=1

K∑k=1

ϕdnk logϕdnk

where we have omitted constants in γ,ϕ.

23

KL Minimization

I The point is that we can perform coordinate ascent on theparameters ϕ and γ to find locally optimal ϕ∗ and γ∗

I The updates look like

ϕdnk ∝ βnwdn exp (Eq [log θd |γd ])

γdk = αk +

N∑n=1

ϕdnk

I InterpretationI First update is like p (zdn|wdn) ∝ p (wdn|zdn) p (zdn)I Second update is like Dirichlet posterior update upon observing

data ϕdnk .I ϕndk are the same across occurrences of the same term → save

memory

24

Estimating α,β

I So far, we have assumed the fixed parameters α,β are known,when in practice they aren’t

I (Blei et al.) propose two approachesI Variational EM: Here, the ELBO takes the place of the usual

Expected Complete Log-Likelihood, and we alternate betweenoptimizing ϕdnk ,γd (Variational E-step) and α,β (VariationalM-step)

I Smoothed Variational Bayes: Place a Dirichlet prior on β andintroduce this to the Variational approximation. The VariationalM-step now only optimizes α.

I The Smoothed Bayesian approach is better when ML estimatesof β are unreliable (e.g., when data are sparse).

25

Conclusion

26

Examples(Blei et al.) compare approaches to a variety of topic modelingtasks,

I Directly fitting to Associated Press corpus, evaluated usingheld-out likelihood

I As preprocessing for classification on the Reuters dataI Collaborative filtering – evaluate likelihood on held-out movies,

instead of words

27

Conclusion

I The basic LDA model can be easily extended by removingvarious exchangeability assumptions (D. M. Blei and J. D.Lafferty, D. Blei and J. Lafferty, Lacoste-Julien et al.)

I More generally, the three-level hierarchical Bayesian idea opensthe door to a variety of “mixed-membership” models (Airoldi etal., Erosheva and Fienberg, Mackey et al., Fox and Jordan)

I Alternative MCMC, Variational Inference, and Method ofMoments techniques are still an active area of research (M.Hoffman et al., Anandkumar et al., M. D. Hoffman et al., Tehet al.)

28

ReferencesAiroldi, Edoardo M., et al. “Mixed Membership StochasticBlockmodels.” Journal of Machine Learning Research, vol. 9, no.Sep, 2008, pp. 1981–2014.Anandkumar, Anima, et al. “A Spectral Algorithm for LatentDirichlet Allocation.” Advances in Neural Information ProcessingSystems, 2012, pp. 917–925.Blei, David M., and John D. Lafferty. “Dynamic Topic Models.”Proceedings of the 23rd International Conference on MachineLearning, ACM, 2006, pp. 113–120.Blei, David M., et al. “Latent Dirichlet Allocation.” Journal ofMachine Learning Research, vol. 3, no. Jan, 2003, pp. 993–1022.Blei, David, and John Lafferty. “Correlated Topic Models.”Advances in Neural Information Processing Systems, vol. 18, MIT;1998, 2006, p. 147.Erosheva, Elena A., and Stephen E. Fienberg. “Bayesian MixedMembership Models for Soft Clustering and Classification.”Classification—The Ubiquitous Challenge, Springer, 2005, pp.11–26.Fox, Emily B., and Michael I. Jordan. “Mixed Membership Modelsfor Time Series.” ArXiv Preprint ArXiv:1309.3533, 2013.Hoffman, Matthew D., et al. “Stochastic Variational Inference.”Journal of Machine Learning Research, vol. 14, no. 1, 2013, pp.1303–1347.Hoffman, Matthew, et al. “Online Learning for Latent DirichletAllocation.” Advances in Neural Information Processing Systems,2010, pp. 856–864.Lacoste-Julien, Simon, et al. “DiscLDA: Discriminative Learning forDimensionality Reduction and Classification.” Advances in NeuralInformation Processing Systems, 2009, pp. 897–904.Mackey, Lester W., et al. “Mixed Membership Matrix Factorization.”Proceedings of the 27th International Conference on MachineLearning (ICML-10), 2010, pp. 711–718.Teh, Yee W., et al. “A Collapsed Variational Bayesian InferenceAlgorithm for Latent Dirichlet Allocation.” Advances in NeuralInformation Processing Systems, 2006, pp. 1353–1360.

29