Latent Dirichlet Allocation - · PDF fileLatent Dirichlet Allocation David M. Blei, ......

Latent Dirichlet Allocation

David M. Blei, Andrew Y. Ng, Michael I. Jordan

Presenter: Suin Kim, UILAB, KAISTMarch 08, 2011

CS774, March 08, 2011 Suin Kim, KAIST

Today I’ll talk about...

• Mathematical background• Variational Bayesian method• Parameter estimation• Examples, applications and empirical results

• Preliminary parts are omitted.• Now you’re familiar to basic LDA model.

• Please interrupt me if you have any question.

2


Exchangeability

• Let {z1, z2, ... zn} is a finite set of random variables.

• {z1, z2, ... zn} is exchangeable if the joint distribution is invariant to permutation.

• “Future samples behave like earlier samples.”

• “The future being predictable on the basis of past experience.”

• Exchangeable: conditionally independent and identically distributed. i.i.d ⇒ exchangeable

• Assumption of exchangeability is a major simplifying assumption.

3


Exchangeability

• In LDA, what variables are exchangeable?

4


Exchangeability

• In LDA, what variables are exchangeable?

• Words are generated by topics

• Those topics are infinitely exchangeable within a document

5


Perplexity

• Perplexity is a measurement in information theory.

• For unknown probability model p, how is the proposed model q well predicts the data x?

• The perplexity of model q is .

• Better q will assign higher probabilities q(xi),resulting lower perplexity.

6

2−�N

i=11N log2 q(xi)


Approximate Inference

• Model contains latent variables Z and observable X

• Sometimes posterior distribution p(Z|X) is intractable because

• Dimensionality of the latent space is too high

• Posterior distribution has a highly complex form

• Stochastic approximations

• MCMC

• Deterministic approximations

• Variational Inference(Bayes)

7


Variational Bayesian methods

• Also called ensemble learning

• A family of techniques for approximating intractable integrals

• VB methods can

• provide an analytical approximation to the posterior probability of the unobserved variables

• derive a lower bound to the marginal likelihood

• Alternative to Monte Carlo sampling methods

8


Variational optimization on a Bayesian model

• Set of all latent variables and variables Z

• Set of all observed variables X

• Our model specifies joint distribution P(X, Z)

• We want to find an approximation for P(Z|X) and P(X)

•

• In LDA, is intractable since is intractable.

9

p(Z|X) =p(Z,X)

p(X)p(θ, z|w,α,β) =

p(θ, z,w|α,β)p(w|α,β)

p(w|α,β)



• We can decompose log marginal probability using

• , where

• L(q) is lower bound of ln p(X).

• KL(q||p) is divergence between q and p.

• Less divergence, better approximation.

10

ln p(X) = L(q) + KL(q||p)

L(q) =�

q(Z) ln(p(X,Z)

q(Z))dZ

KL(q||p) = −�

q(Z) ln(p(Z|X)

q(Z))dZ



•

• We want to minimize KL divergence

• By optimization w.r.t q(Z)

• Equivalent to maximizing lower bound L(q)

• We choose tractable distribution q(Z) and seek the member of this family to minimize KL divergence

• No overfitting!

• Why?

11

ln p(X) = L(q) + KL(q||p)



• How can we choose approximating distribution?

1. Use a distribution q(Z|ω) governed by ω• L(q) becomes a function of ω

2. Choose q from factorized distributions.

• Example) Variational distribution is Gaussian, optimized with respect to μ and σ.

12μ


Parameter estimation on LDA

• is intractable

• We define as a surrogate of p.

• We approximate α,β,γ,ϕ from variational EM algorithm

• E-step: Minimize the KL divergence regarding to

• M-step: Find the α and β that maximizes resulting lower bound on the likelihood

13

p(θ, z|w,α,β) =p(θ, z,w|α,β)p(w|α,β)

q(θ, z|γ,φ) = q(θ|γ)N�

n=1

q(zn|φn)

γ∗d ,φ

∗d



• E-step

• We compute for all n and i.

• Setting derivatives as 0 minimizes KL divergence.

•

•

• In E-step, we have inferred ϕ and γ with respect to α and β.

14

φni ∝ βiwn exp(Eq[log(θi)|γ])

γi = αi +�N

n=1 φni

∂

∂γiKL(q||p), ∂

∂φniKL(q||p)



• M-step

• We wish to find parameters that maximize the log likelihood

•

• To find α, we must use an iterative method.

• We use linear-time Newton-Raphson algorithm

• Hessian is in the form .

15

l(α,β) =�M

d=1 log p(wd|α,β)βij ∝

�Md=1

�Nd

n=1 φdniwjdn

∂L

∂αiαj= δ(i, j)MΨ�(αi)−Ψ(

�kj=1 αj)



• Detailed derivation

• Please refer to appendix A.1, A.3 and A.4.

16


Smoothing

• New document is likely to contain word which didn’t appear in a training corpus.

• P(new word) = 0

• P(document contains new word) = ∏P(wordi) = 0

• “Smooth” the multinomial parameters

• Always assign positive probability

17


Experiment

• TREC AP corpus, 16K documents

• 100-topic LDA model

• Using variational inference, authors computed

• Dirichlet parameters γ for the article

• Multinomial parameters ϕn for each word in the article

• Listed topics which have large number of expected number of words in the article.

• Since , significant topics have large .

18

γi = αi +N�

n=1

φni γi − αi

E[Number of words generated by topic i]

4 most significant topics, with words which have large .p(w|z)


Experiment

• What is the problem of the “bag-of-words” model?

20


Experiment

• What is the problem of the “bag-of-words” model?

• Words are separated into different topics.

• How can we extend basic LDA model to overcome this?

21


Applications

• Document modeling

• Generalization performance of models

• Achieve high likelihood on a held-out test set

• Document classification

• Classify a document into ≥2 mutex classes

• Collaborative filtering

• Train a model on a fully observed set of users

• Predict movie that user prefers

22


Document modeling

• For a test set of M documents,

• Compared LDA, Unigram, pLSI and Mixt. Unigrams.

• pLSI and Mixture of Unigrams suffered from overfitting.

• Why LDA is free from overfitting?

23

perplexity(Dtest) = exp(−�M

d=1 log p(wd)�Md=1 Nd

)


pLSI vs LDA

• What is the key difference of pLSI and LDA?

24

Diagram from Yohan Jo’s slide


pLSI vs LDA

• If θ and φ are not constrained to α and β, the model is pLSI.

• Since θ and φ is permittedas well as it explains the corpus well,it overfits to the training data.

25

φ

w

z

θ

α

D

T Nd

β

Diagram from Yohan Jo’s slide


Document modeling

26


Document classification

• Generative model for classification

• Use one LDA module per class

• Choice of features

• Use LDA to reduce feature set into fixed set of real values

• Posterior Dirichlet parameters

• Document classifier using LDA

1. Estimated parameters of LDA on all documents

2. Trained SVM on the result of LDA

27

γ∗(w)


Document classification

28


Collaborative filtering

• Uses EachMovie collaborative filtering data

• A collection of user’s indication of preferred movie choices

• User : DocumentMovie: Word

29

Unseen Document Show Predict


Collaborative filtering

30


Conclusion

• LDA is based on a simple exchangeability assumption for the words and topics in a document

• LDA as a dimensionality reduction technique

• Exact inference is intractable for LDA

• We can use various approximate inference algorithms

• LDA has high modularity and extensibility

• A variety of extensions of LDA can be considered

31

Latent Dirichlet Allocation - · PDF fileLatent Dirichlet Allocation David M. Blei, ......

Documents

Transcript of Latent Dirichlet Allocation - · PDF fileLatent Dirichlet Allocation David M. Blei, ......