Latent Dirichlet Allocation - NTNUsmil.csie.ntnu.edu.tw/ppt/20091019_Menphis_2009-10... · Blei, A....

Latent Dirichlet Allocation

Kuan-Yu Menphis Chen

SLP Laboratory, [email protected]

Main Reference：1. D. Blei, A. Ng and M. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning Research, 2003.2. D. Blei, A. Ng, and M. Jordan, “Latent Dirichlet allocation,” in Proc. NIPS, 2002.3. Steyvers, M. & Griffiths, T.. Probabilistic topic models. In T. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch

(Eds.), Handbook of Latent Semantic Analysis. Hillsdale, NJ: Erlbaum. 2007.4. Huand, J., “Maximum likelihood estimation of Dirichlet distribution parameters”, Manuscript, 2006.

Outline

• Introduction

• Latent Dirichlet Allocation

• Relationship with Other Latent Variable Models

• Inference and Parameter Estimation

• Discussion

2

Introduction

• The tf-idf reduction has some appealing features– Notably in its basic identification of sets of words that are

discriminative for documents in the collection– The approach also provides a relatively small amount of reduction in

description length and reveals little in the way of inter- or intra-document statistical structure

• LSI uses a singular value decomposition of the word-document matrix to identify a linear subspace in the space of tf-idffeatures – LSI captures most of the variance in the collection– It can achieve significant compression in large collections

3

Introduction

• In PLSI, each document is represented as a list of numbers (the mixing proportions for topics), and there is no generative probabilistic model for these numbers

• This leads to several problems: – The number of parameters in the model grows linearly with the size of

the corpus, which leads to serious problems with overfitting– It is not clear how to assign probability to a document outside of the

training set

4

Introduction

• All of these methods (tf-idf, LSI, PLSI) are based on the “bag-of-words” assumption– the order of words in a document can be neglected

• In the language of probability theory, this is an assumption of exchangeability for the words in a document

• Moreover, although less often stated formally, these methods also assume that documents are exchangeable– the specific ordering of the documents in a corpus can also be neglected.

5

Latent Dirichlet Allocation - Notation

• A word is the basic unit of discrete data, defined to be an item from a vocabulary indexed by – Using superscripts to denote components, the v-th word in the

vocabulary is represented by a V-vector w such that and for

• A document is a sequence of N words denoted by , where is the n-th word in the sequence

• A corpus is a collection of M documents denoted by

6

{ }V,,K1

1=vw 0=uwvu ≠

( )Nw,,w,w K21=w nw

( )M,,,D www K21=


• The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words

• LDA assumes the following generative process for each document w in a corpus D :1. Choose 2. Choose3. For each of the N words :

a) Choose a topicb) Choose a word from , a multinomial probability

conditioned on the topic

7

( )ξPoisson~N( )αDir~θ

nw( )θlMultinomia~zn

nw ( )β,z|wp nnnz


• Several simplifying assumptions are made:– The dimensionality k of Dirichlet distribution is assumed known and

fixed– The word probabilities are parameterized by a matrix β, which

we treat as a fixed quantity that is to be estimated– The Poisson assumption is not critical to anything. Note that document

length N is independent of all the other data generating variables (θ and z)

• A k-dimensional Dirichlet random variable θ can take values in the (k-1)-simplex, and has the following probability density:

8

( ) ( )( )

1α1α1

1

1 θθαααθ 1 −−

=

=

∏

∑Γ

Γ= k

kki i

ki i ...|p

Vk ×


• By placing a Dirichlet prior on the topic distribution, the result is a smoothed topic distribution, with the amount of smoothing determined by the parameter.

• The Dirichlet prior can be interpreted as forces on the topic combinations with higher moving the topics away from the corners of the simplex, leading to more smoothing.

9

α

jα


• Given the parameters α and β, the joint distribution of a topic mixture θ, a set of N topics z, and a set of N words w is given by:

• Integrating over θ and summing over z, we obtain the marginal distribution of a document:

• Obtain the probability of a corpus:

10

( ) ( ) ( ) ( ) θ βθθβα,1

d,z|wp|zpα|p|pN

n znnn

n∫ ∏∑ ⎟

⎠

⎞⎜⎝

⎛==

w

( ) ( ) ( ) ( )∏=

=N

nnnn ,z|wp|zp|p|p

1βθαθβα,,θ, wz

( ) ( ) ( ) ( )∏ ∫ ∏ ∑= =

⎟⎠

⎞⎜⎝

⎛=M

dd

N

n zdndnddnd d,z|wp|zp|p|Dp

d

dn1 1θ βθαθβα,


• The parameters α and β are corpus level parameters

• The variables are document-level variables

• The variables and are word-level variables

11

dθ

dnz dnw

Latent Dirichlet Allocation - Exangeability

• A finite set of random variables is said to be exchangeable if the joint distribution is invariant to permutation. If π is a permutation of the integers from 1 to N:

• It is important to emphasize that an assumption of exchangeability is not equivalent to an assumption that the random variables are independent and identically distributed

• Rather, exchangeability essentially can be interpreted as meaning “conditionally independent and identically distributed,”where the conditioning is with respect to an underlying latent parameter of a probability distribution

12(c.f: De Finetti’s representation theorem: http://en.wikipedia.org/wiki/De_Finetti‘s_theorem)

{ }Nz,,z K1

( ) ( ) ( )( )NN z,,zpz,,zp π1π1 KK =

Latent Dirichlet Allocation - Exangeability

• In LDA, we assume that words are generated by topics (by fixed conditional distributions) and that those topics are infinitely exchangeable within a document

• By de Finetti’s theorem, the probability of a sequence of words and topics must therefore have the form:

13(c.f: De Finetti’s representation theorem: http://en.wikipedia.org/wiki/De_Finetti‘s_theorem)

( ) ( ) ( ) ( )∫ ∏ ⎟⎠⎞⎜

⎝⎛=

=θθ|θ,

1dz|wpzppp

N

nnnnzw


• An example density on unigram distributions under LDA for three words and four topics. – The triangle embedded in the x-y plane is the 2-D simplex representing

all possible multinomial distributions over three words– The four points marked with an x are the locations of the multinomial

distributions for each of the four topics– The surface shown on top of the simplex is an example of a density

over the (V −1)-simplex (multinomial distributions of words) given by LDA

14

( )βθ,|wp

( )zwp |

Relationship with Other Latent Variable Models

• Unigram model– Under the unigram model, the words of every document are drawn

independently from a single multinomial distribution:

• Mixture of unigrams– Under this mixture model, each document is generated by first choosing

a topic z and then generating N words independently from the conditional multinomial:

– The LDA model allows documents to exhibit multiple topics

– There are k−1 parameters associated with in the mixture of unigrams, versus the k parameters associated with in LDA

15

( ) ( )∏=

=N

nnwpp

1w

( ) ( ) ( )∑ ∏=

=z

N

nn z|wpzpp

1w

( )zp( )α|θp


• Probabilistic latent semantic indexing (PLSI/PLSA)– The PLSI model posits that a document label d and a word are

conditionally independent given an unobserved topic z:

– It is important to note that d is a dummy index into the list of documents in the training set

• The model learns the topic mixtures only for those documents on which it is trained, so PLSI is not a well-defined generative model

– A further difficulty is that the number of parameters which must be estimated grows linearly with the number of training documents

• It gives kV +kM parameters and therefore linear growth in M

16

nw

( ) ( ) ( ) ( )∑=z

nn d|zpz|wpdpw,dp

( )d|zp


• LDA overcomes both of PLSI’s problems by:– Treating the topic mixture weights as a k-parameter hidden random

variable • Rather than a large set of individual parameters which are explicitly linked

to the training set

– The k+kV parameters in a k-topic LDA model do not grow with the size of the training corpus

17


• A geometric interpretation:– The mixture of unigrams places each document at one of the corners of

the topic simplex– The PLSI model induces an empirical distribution on the topic simplex

denoted by x– LDA places a smooth distribution on the topic simplex denoted by the

contour lines

18

Inference and Parameter Estimation - Inference

• The key inferential problem is that of computing the posteriori distribution of the hidden variable given a document :

– Unfortunately, this distribution is intractable to compute in general

• A function which is intractable due to the coupling between θ and β in the summation over latent topics

• Although the posterior distribution is intractable for exact inference, a wide variety of approximate inference algorithms can be considered for LDA

19

( ) ( )( )βα,|

βα,|,θ,βα,θw

wzwzp

p,|,p =

( ) ( )( )

( ) θ βθθααβα,

1 1 11

1α

1

1 d|pN

n

k

i

V

j

wiji

k

iik

i i

ki i j

ni ⎟⎠⎞

⎜⎝⎛⎟⎠⎞⎜

⎝⎛

ΓΓ

= ∏ ∑ ∏∫ ∏∏

∑

= = ==

−

=

=w

Inference and Parameter Estimation - Variational Inference

• The basic idea of convexity-based variational inference is to make use of Jensen’s inequality to obtain an adjustable lower bound on the log likelihood

• The variational parameters are chosen by an optimization procedure that attempts to find the tightest possible lower bound

• A simple way to obtain a tractable family of lower bound is to consider simple modifications of the original graph model in which some of the edges and nodes are removed.

20


• This family is characterized by the following variationaldistribution:

• The desideratum of finding a tight lower bound on the log likelihood translates directly into the following optimization problem:

– by minimizing the Kullback-Leibler (KL) divergence between the variational distribution and the true posterior

21

( ) ( ) ( )∏=

=N

nnn |zq|q,|,q

1φγθφγθ z

( )( )

( ) ( )( )βαθφγθD min argφγφγ

,,|,p||,|,q,,

wzz=∗∗


23

( ) ( ) ( ) ( )( )

( ) ( ) ( )( )

( ) ( )( ) ( ) ( )( )

( ) ( ) ( ) ( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )

( ) ( )( )

( )( ) ( )βα,|log

βα,,,θ,βα,,

βα,βα,,,θ,logφγ,|θ,

βα,,|θ,logφγ,|θ,βα,|,θ,logφγ,|θ,

φγ,|θ,βα,,|θ,logφγ,|θ,φγ,|θ,logφγ,|θ,βα,|,θ,logφγ,|θ,βα,|log

φγ,|θ,βα,,|θ,logφγ,|θ,βα,,|θ,||φγ,|θ,D

|,|log|||D :prove

βα,,|θ,||φγ,|θ,Dβα,φ;γ,βα,|log: that show

wwz

wwzz

wzzwzz

zwzzzzwzzw

zwzzwzz

wzzw

pdp

pp

pq

dpqdpq

dq

pqdqqdpqp

dq

pqpq

VHQwVHPVHQPQ

pqLp

z

zz

zzz

z

H

=⎟⎟⎠

⎞⎜⎜⎝

⎛⋅=

−=

−−=

−=∴

−=

+=

∫∑

∫∑∫∑

∫∑∫∑∫∑

∫∑

∑

θ

θθ

θθθ

θ

Q

( )βα,φ;γ,L( )βα,|log wp

( )βα,φγ, ||D pq


• We can expand the lower bound by using the factorizations of p and q:

24

( ) ( )[ ] ( )[ ]( )[ ] ( )[ ] ( )[ ] ( )[ ] ( )[ ]

( ) ( ) ( ) ( ) ( )( )

( ) ( )( )

( ) ( ) ( ) ( ) ( )( )

∑ ∑

∑ ∑∑∑

∑ ∑

∑ ∑ ∑

∑ ∑∑∑

= =

==

==

= =

= ==

==

==

−

Ψ−Ψ−−Γ+Γ−

+

Ψ−Ψ+

Ψ−Ψ−+Γ−Γ=

−−++=

−=

N

n

k

inini

k

i

kj jiii

k

i

kj j

N

n

k

iij

jnni

N

n

k

i

kj jini

k

i

kj jiii

k

i

kj j

qqqqq

qq

w

qq|p|p|p

q|pL

1 1

11

11

1 1

1 11

11

11

logφφ

γγ1γγlogγlog

logβφ

γγφ

γγ1ααlogαlog

logEθlogEβ,logEθlogEαθlogE

θ,logEβα,,θ,logEβα,φ;γ,

zzwz

zwz


• By computing the derivatives of the KL divergence and setting them equal to zero, we obtain the following pair of update equations:

• It is important to note that the variational distribution is actually a conditional distribution, varying as a function of w because of

• We can write the resulting variational distribution as , so the variational distribution can be viewed as an approximation to the posterior distribution

25

( )[ ]( )γθlogEexpβφ |iqiwni n∝

∑ =+= Nn niii 1φαγ

( ) ( )( )wwz ∗∗ φγθ ,|,q

( )βα,,θ wz |,p

( )( )

( ) ( )( )βαθφγθD min argφγφγ

,,|,p||,|,q,,

wzz=∗∗


• Computing– A distribution is in the exponential family if it can be written in the

form:

where is the natural or canonical parameter, is the sufficient statistic, and is the log normalizer factor

– The Dirichlet can be written in the form:

– The noteworthy point is that is the cumulant generating function for the sufficient statistic, so in particular, is the expectation, and

is the variance

26

[ ]α|logθE i

( ) ( ) ( ) ( ){ }ηηη AxTxh|xp T −= exp( )xT

( )ηAη

( ) ( )( ) ( ) ( ){ }∑∑∑ === Γ−Γ+−= ki i

ki i

ki iip 111 αlogαloglogθ1αexpα|θ

( )ηA( )ηA′

( )ηA ′′( )[ ] ( )

[ ] ( ) ( )∑ =Ψ−Ψ=

′=kj jii

AxT

1ααα|logθEE η


• We form the Lagrangian by isolating the terms which contain and adding the appropriate Lagrange multipliers

• Setting this derivative to zero yields the maximizing value of the variational parameter

27

[ ] ( ) ( )( ) ( )[ ] ( ) ( ) ( ) nniiv

kj ji

ni

kj ninniniivni

kj jini

kL

L

ni

ni

λ1logφlogβγγφ

1φλlogφφlogβφγγφ

1φ

11φ

++−+Ψ−Ψ=∂

∂

−+−+Ψ−Ψ=

∑

∑∑

=

==

( ) ( ) ( )( ) ( )( ) ( )( )( ) ( )( ) ( )

( ) ( )( ) [ ]( )γ|logθEexpβγγexpβφ

φλ1expβγγexp

φλ1logβγγexp

logφλ1logβγγ

0λ1logφlogβγγ

1

1

1

1

1

iivkj jiivni

ninivkj ji

ninivkj ji

ninivkj ji

nniivkj ji

k

k

k

k

⋅=Ψ−Ψ⋅∝∴

=+−⋅⋅Ψ−Ψ

=+−+Ψ−Ψ

=+−+Ψ−Ψ⇒

=++−+Ψ−Ψ

∑

∑

∑

∑

∑

=

=

=

=

=

niφ


• Maximize with respect to , the i-th component of the posterior Dirichlet parameter. The terms containing are:

• Taking the derivative with respect to and setting to zero yields the maximizing value of the variational parameter

28

[ ] ( ) ( ) ( )( ) ( ) ( )( )

( ) ( ) ( ) ( ) ( )( )

( ) ( )( ) ( ) ( )ik

i

kj j

k

ii

N

nnii

kj ji

k

i

kj jiii

k

i

kj j

N

n

k

i

kj jini

k

i

kj jiiL

γlogγlogγφαγγ

γγ1γγlogγlog

γγφγγ1α

11

1 11

11

11

1 11

11γ

Γ+Γ−⎟⎠⎞⎜

⎝⎛ −+Ψ−Ψ=

Ψ−Ψ−−Γ+Γ−

Ψ−Ψ+Ψ−Ψ−=

∑∑∑ ∑∑

∑ ∑∑∑

∑ ∑ ∑∑ ∑

==

= ==

==

==

= ==

==

iγ

iγ

[ ] ( ) ( )

( ) ( )

∑

∑ ∑∑∑

∑ ∑∑∑

=

= ==

=

= ==

=

+=

=⎟⎠⎞⎜

⎝⎛ −+Ψ′−⎟

⎠⎞⎜

⎝⎛ −+Ψ′

⎟⎠⎞⎜

⎝⎛ −+Ψ′−⎟

⎠⎞⎜

⎝⎛ −+Ψ′=

∂∂

N

nniii

k

jj

N

nnjj

kj ji

N

nniii

k

jj

N

nnjj

kj ji

N

nniii

i

L

1

1 11

1

1 11

1

γ

φαγ

0γφαγγφαγ

γφαγγφαγγ

iγ

Inference and Parameter Estimation - parameter estimation

• In particular, given a corpus of documents we wish to find parameters α and β that maximize the (marginal) log likelihood of the data:

• We can thus find approximate empirical Bayes estimates for the LDA model via an alternating variational EM procedure

• The derivation yields the following iterative algorithm:– (E-step) For each document, find the optimizing values of the

variational parameters– (M-step) Maximize the resulting lower bound on the log likelihood with

respect to the model parameters α and β

29

{ }M,,,D www K21=

( ) ( )∑=

=M

dd ,|p,

1βαlogβα wl

{ }Dd, dd ∈∗∗ :φγ

Inference and Parameter Estimation - parameter estimation

• The M-step update for the conditional multinomial parameter βcan be written out analytically:

• Dirichlet parameter a can be implemented using an efficient Newton-Raphson method. – The general update rule can be written as:

where and are the Hessian matrix and gradient respectively at the point α.

30

∑ ∑= =

∗∝M

d

N

n

jdndniij

dw

1 1φβ

( ) ( )oldoldoldnew H αααα 1∇−= −

( )αH ( )α∇

Inference and Parameter Estimation - Smoothing

• We treat β as a random matrix (one row for each mixture component), where we assume that each row is independently drawn from an exchangeable Dirichlet distribution

• We consider a variational approach to Bayesian inference that places a separable distribution on the random variables β, θ, and z

• An additional update for the new variational parameter λ:

31

Vk ×

( ) ( ) ( )∏ ∏= =

=k

i

M

ddddddiiMMk ,|,q|,,|,,q

1 1:1:1:1 γφθλβDirγφλθβ zz

∑ ∑= =

∗+=M

d

N

n

jdndniij

dw

1 1φλ η

Discussion

• We can view LDA as a dimensionality reduction technique, in the spirit of LSI– But LDA with proper underlying generative probabilistic semantics that

make sense for the type of data that it models

• Exact inference is intractable for LDA, but any of a large suite of approximate inference algorithms can be used– Laplace approximation, higher-order variational techniques, and Monte

Carlo methods

• A variety of extensions of LDA can be considered in which the distributions on the topic variables are elaborated– We could arrange the topics in a time series, essentially relaxing the full

exchangeability assumption to one of partial exchangeability32

Latent Dirichlet Allocation - NTNUsmil.csie.ntnu.edu.tw/ppt/20091019_Menphis_2009-10... · Blei, A....

Documents

Transcript of Latent Dirichlet Allocation - NTNUsmil.csie.ntnu.edu.tw/ppt/20091019_Menphis_2009-10... · Blei, A....