Notes on Latent Dirichlet Allocation (LDA) for Beginners

Notes on Latent Dirichlet Allocation for

Beginners

Jun Wang

Version 0.2, September 2015

1 Introduction

This is a brief introduction on Latent Dirichlet Allocation (LDA), which is oneof the most widely used topic models. We also give detail derivation of GibbsSampling of LDA.

1.1 Notations

We list related notations as follows.

• α and β are hyper-parameter with specified values.

• M is the number of documents. K is the number of topics. V is thenumber of distinct words.

• Nm is the length of the m-th document.

• wm,n(1 ≤ m ≤ M ; 1 ≤ n ≤ Nm) is the n-th word in the m-th document,and words contained in documents are observed variables.

• zm,n(1 ≤ m ≤ M ; 1 ≤ n ≤ Nm) is the topic assigned to the n-th word inthe m-th document, which is latent (or hidden) variable.

• θm(1 ≤ m ≤M) is topic distribution over the m-th document, which is alatent (or hidden) variable of K-dimension vector.

• φk(1 ≤ k ≤ K) is word distribution over the k-th topic, which is a latent(or hidden) variable of V -dimension vector.

1.1.1 θm and Θ

θm is sampled from a Dirichlet distribution with hyper-parameter α.

θm ∼ Dir(α); 1 ≤ m ≤M (1)

1

θm,k is the k-th element in θm, which is corresponding the proportion of thetopic k in the m-th document.

K∑k=1

θm,k = 1 (2)

We can represent θ1 · · · θm · · · θM as a M ×K matrix Θ.

Θ =

θ1

...θm...θM

=

θ1,1 . . . θ1,k . . . θ1,K

......

.... . .

...θm,1 . . . θm,k . . . θm,K

......

.... . .

...θM,1 . . . θM,k . . . θM,K

1.1.2 φk and Φ

φk is sampled from a Dirichlet distribution with hyper-parameter β.

φk ∼ Dir(β); 1 ≤ k ≤ K (3)

φk,v is the v-th element in φk, which is corresponding to the proportion ofthe word v in the k-th topic. Each v is an element of the dictionary with Vdistinct words.

V∑v=1

φk,v = 1 (4)

We can represent φ1 · · ·φk · · ·φK as a K × V matrix Φ.

Φ =

φ1

...φk...φK

=

φ1,1 . . . φ1,v . . . φ1,V

......

.... . .

...φk,1 . . . φk,v . . . φk,V

......

.... . .

...φK,1 . . . φK,v . . . φK,V

1.1.3 Words

The m-th document Wm can be represented by words contained. wm,n is then-th word in the m-th document, and Nm is the number of words in the m-thdocument. wm,n in different positions can be instances of the same v in thedictionary.

Wm =(wm,1 . . . wm,n . . . wm,Nm

)2

The whole document set can be represented as W .

W =

W1

...Wm

...WM

=

w1,1 . . . w1,n . . . w1,N1

......

.... . .

...wm,1 . . . wm,n . . . wm,Nm

......

.... . .

...wM,1 . . . wM,n . . . wM,NM

Please be aware that W is not a matrix, because its every row, which is

corresponding the m-th document, may have different number of words or lengthNm. For example, N1 6= Nm 6= NM .

1.1.4 Latent Variables for Topics assigned to Words

The latent topic variables assigned to words in the m-th document can be repre-sented as Zm. zm,n is the topic assigned to the n-th word in the m-th document,and Nm is the number of words in the m-th document.

Zm =(zm,1 . . . zm,n . . . zm,Nm

)The latent topic variables assigned to words in all documents can be repre-

sented as Z.

Z =

Z1

...Zm

...ZM

=

z1,1 . . . z1,n . . . z1,N1

......

.... . .

...zm,1 . . . zm,n . . . zm,Nm

......

.... . .

...zM,1 . . . zM,n . . . zM,NM

Please be aware that Z is not a matrix, because its every row, which is

corresponding to the m-th document, may have different number of words orlength Nm. For example, N1 6= Nm 6= NM .

1.2 Graphical Models

Based on the generative process of LDA, we can represent the LDA model using“collapsed” plate notation as shown in Figure 1.

For an easy understanding, the corresponding “expanded” model is alsoshown in Figure 2.

2 Joint Distribution

The original joint distribution of the LDA model is represented as follows.

3

Figure 1: The graphical model for LDA using plate notation.

Figure 2: The graphical model for LDA using “expanded” representation.

p(θ1 · · · θm · · · θM ,φ1 · · ·φk · · ·φK ,w1,1 · · ·wm,n · · ·wM,NM

,

z1,1 · · · zm,n · · · zM,NM;α, β) (5)

Because hyper-parameters α and β are pre-specified and fixed values, so theycan be omitted and the compact representation of 5 can be presented as follows.

p(Θ,Φ,W,Z;α, β) = p(Θ,Φ,W,Z) (6)

Based on the Bayesian network structure shown in Figure 2, we can updateand present the joint distribution of LDA as follows.

4

p(θ1;α) · · · p(θm;α) · · · p(θM ;α)

×p(φ1;β) · · · p(φk;β) · · · p(φK ;β)

×p(w1,1/z1,1,Φ) · · · p(w1,n/z1,n,Φ) · · · p(w1,N1/z1,N1

,Φ)

· · · p(wm,1/zm,1,Φ) · · · p(wm,n/zm,n,Φ) · · · p(wm,Nm/zm,Nm

,Φ)

· · · p(wM,1/zM,1,Φ) · · · p(wM,n/zM,n,Φ) · · · p(wM,Nm/zM,NM

,Φ)

×p(z1,1/θ1) · · · p(z1,n/θm) · · · p(z1,N1/θm)

· · · p(zm,1/θm) · · · p(zm,n/θm) · · · p(zm,Nm/θm)

· · · p(zM,1/θM ) · · · p(zM,n/θM ) · · · p(zM,NM/θm) (7)

The compact representation of 7 is

p(Θ,Φ,W,Z) = p(Θ;α)p(Φ;β)p(Z/Θ)p(W/Z,Φ) (8)

The components in 8 are defined as follows. If zm,n = k, then φzm,n is φk.

p(Θ;α) = p(θ1;α) · · · p(θm;α) · · · p(θM ;α) =

M∏m=1

p(θm;α) (9)

p(Φ;β) = p(φ1;β) · · · p(φk;β) · · · p(φK ;β) =

K∏k=1

p(φk;β) (10)

p(W/Z,Φ) = p(w1,1/z1,1,Φ) · · · p(w1,n/z1,n,Φ) · · · p(w1,N1/z1,N1

,Φ)

· · · p(wm,1/zm,1,Φ) · · · p(wm,n/zm,n,Φ) · · · p(wm,Nm/zm,Nm

,Φ)

· · · p(wM,1/zM,1,Φ) · · · p(wM,n/zM,n,Φ) · · · p(wM,NM/zM,NM

,Φ)

=

M∏m=1

Nm∏n=1

p(wm,n/zm,n,Φ)

=M∏m=1

Nm∏n=1

p(wm,n/zm,n, φzm,n) (11)

p(Z/Θ) = p(z1,1/θ1) · · · p(z1,n/θm) · · · p(z1,N1/θm)

· · · p(zm,1/θm) · · · p(zm,n/θm) · · · p(zm,Nm/θm)

· · · p(zM,1/θM ) · · · p(zM,n/θM ) · · · p(zM,NM/θm)

=

M∏m=1

Nm∏n=1

p(zm,n/θm) (12)

2.1 p(Θ;α)

Based on Equation 1, we can get the prior probability of θm, which follows aDirichlet Distribution with the parameter α.

5

p(θm;α) =Γ(Kα)

(Γ(α))Kθα−1m,1 · · · θ

α−1m,k · · · θ

α−1m,K

=Γ(Kα)

(Γ(α))K

K∏k=1

θα−1m,k (13)

p(Θ;α) =

M∏m=1

p(θm;α) =

M∏m=1

Γ(Kα)

(Γ(α))K

K∏k=1

θα−1m,k (14)

2.2 p(Φ; β)

Based on Equation 3, we can get

p(φk;β) =Γ(V β)

(Γ(β))Vφβ−1k,1 · · ·φ

β−1k,v · · ·φ

β−1k,V

=Γ(V β)

(Γ(β))V

V∏v=1

φβ−1k,v (15)

p(Φ;β) =

K∏k=1

p(φk;β) =

K∏k=1

Γ(V β)

(Γ(β))V

V∏v=1

φβ−1k,v (16)

2.3 p(Z/Θ)

θm is sampled from a Dirichlet distribution, so∑Kk=1 θm,k = 1. p(zm,n/θm) is a

categorical distribution, which is a special case of multinomial distribution (likeBernoulli distribution VS binomial distribution).

p(zm,n = k/θm) = θm,k (17)

In the m-th document, we can set the total number of words assigned to thetopic k as im,k with

∑Kk=1 im,k = Nm. So we can store the value of im,k into a

M ×K matrix I.

p(Zm/θm) =

Nm∏n=1

p(zm,n/θm) =

K∏k=1

θim,k

m,k (18)

The above Equation 18 considers the order of words, and actually we canalso use the bag-of-word model which ignores the order of words, and in thatcase we need to use the multinomial distribution to calculate, and get

6

Nm!∏Kk=1 im,k!

K∏k=1

θim,k

m,k

or

Γ(Nm + 1)∏Kk=1 Γ(im,k + 1)

K∏k=1

θim,k

m,k

But the selection of either method will not change the final result. Because

the Γ(Nm+1)∏Kk=1 Γ(im,k+1)

will appear in the both denominator and numerator in the

Bayesian integral, this factor will be cancelled finally and not affect the result.In this notes, we use method which considers the order of words, and based

on Equation 18, we can get

p(Z/Θ) =

M∏m=1

p(Zm/θm) =

M∏m=1

Nm∏n=1

p(zm,n/θm) =

M∏m=1

K∏k=1

θim,k

m,k (19)

2.4 p(W/Z,Φ)

If zm,n = k, then φzm,n is φk, and p(wm,n/zm,n, φzm,n) = p(wm,n/φk).

φk is sampled from a Dirichlet distribution, so∑Vv=1 φk,v = 1. p(wm,n/φk)

is a categorical distribution, which is a special case of multinomial distribution(like Bernoulli distribution VS binomial distribution).

p(wm,n = v/φk) = φk,v (20)

We can treat all M documents as one single large document of length L =∑Mm=1Nm, and l is the position of a word wl in this large document. We can set

the total number of word v assigned to the topic k as jk,v with∑Kk=1

∑Vv=1 jk,v =

L. So we can store the value of jk,v into a K × V matrix J .

p(W/Z,Φ) =

M∏m=1

Nm∏n=1

p(wm,n/zm,n, φzm,n)

=

L∏l=1

φj(zl),(wl)

(zl),(wl)=

K∏k=1

V∏v=1

φjk,v

k,v (21)

7

2.5 Calculation of Joint Distribution

p(Θ,Φ,W,Z)

= p(Θ;α)p(Φ;β)p(Z/Θ)p(W/Z,Φ)

=

M∏m=1

Γ(Kα)

(Γ(α))K

K∏k=1

θα−1m,k

×K∏k=1

Γ(V β)

(Γ(β))V

V∏v=1

φβ−1k,v

×M∏m=1

K∏k=1

θim,k

m,k

×K∏k=1

V∏v=1

φjk,v

k,v (22)

=

p(Z/Θ)p(Θ;α)︷︸︸︷M∏m=1

Γ(Kα)

(Γ(α))K

K∏k=1

θim,k+α−1m,k

×

p(W/Z,Φ)p(Φ; β)︷︸︸︷K∏k=1

Γ(V β)

(Γ(β))V

V∏v=1

φjk,v+β−1k,v (23)

Because α is a fixed value, we can omit it.

p(Z/Θ)p(Θ;α) = p(Z/Θ)p(Θ) = p(ZΘ) (24)

Or can get the same result from an alternative way. Because Z is independenton α, p(Z/Θ) = p(Z/Θ;α).

p(Z/Θ)p(Θ;α) = p(Z/Θ;α)p(Θ;α) = p(ZΘ;α) = p(ZΘ) (25)

β is also a fixed value, we can omit it. And Z is independent on Φ, sop(ZΦ) = p(Z)p(Φ).

p(W/Z,Φ)p(Φ;β) = p(W/Z,Φ)p(Φ) =p(WZΦ)

p(ZΦ)× p(Φ)

=p(WZΦ)

p(Z)p(Φ)× p(Φ) =

p(WZΦ)

p(Z)= p(WΦ/Z) (26)

Similarly, we also can get the same result of p(W/Z,Φ)p(Φ;β) from an al-ternative way.

3 Marginal Distribution

p(Θ,Φ,W,Z) = p(ZΘ)× p(WΦ/Z) (27)

8

p(W,Z) =

∫ ∫p(Z,W,Θ,Φ)dΘdΦ

=

∫ ∫p(ZΘ)p(WΦ/Z)dΘdΦ

=

∫p(ZΘ)dΘ×

∫p(WΦ/Z)dΦ (28)

= p(Z)× p(W/Z) (29)

p(Z,W )

=

M∏m=1

Γ(Kα)

(Γ(α))K

∫ K∏k=1

θim,k+α−1m,k dθm

×K∏k=1

Γ(V β)

(Γ(β))V

∫ V∏v=1

φjk,v+β−1k,v dφk (30)

=

M∏m=1

Γ(Kα)

(Γ(α))K

∏Kk=1 Γ(im,k + α)

Γ(∑Kk=1(im,k + α))

×K∏k=1

Γ(V β)

(Γ(β))V

∏Vv=1 Γ(jk,v + β)

Γ(∑Vv=1(jk,v + β))

(31)

Because Γ(Kα)(Γ(α))K

and Γ(V β)(Γ(β))V

are constant values, so we can cancel them andget

p(Z,W ) ∝M∏m=1


Γ(∑Kk=1(im,k + α))

×K∏k=1


Γ(∑Vv=1(jk,v + β))

(32)

4 Full Conditional Probability

The goal is to infer the topics assigned to all words in all documents. This canbe done through iteration of inferring topic of each single word based on fullconditional probability.

If the wa,b is the current word, and za,b is the assigned topic of wa,b. Z¬a,brepresents all other topic assignments to all other words excluding za,b.

Z = {za,b, Z¬a,b}

9

p(za,b/Z¬a,b,W ) =p(za,b, Z¬a,b,W )

p(Z¬a,b,W )

=p(za,b, Z¬a,b,W )∑za,b

p(za,b, Z¬a,b,W )

∝ p(za,b, Z¬a,b,W ) = p(Z,W ) (33)

Because p(za,b/Z¬a,b,W ) ∝ p(Z,W ), so we can use p(Z,W ) to calculatep(za,b/Z¬a,b,W ).

p(za,b = k/Z¬a,b,W ) ∝ p(za,b = k, Z¬a,b,W )

p(za,b = k/Z¬a,b,W ) =p(za,b = k, Z¬a,b,W )∑Kk=1 p(za,b = k, Z¬a,b,W )

(34)

5 MCMC using Gibbs Sampling

Gibbs sampling is a widely used MCMC method, which try to infer value ofwhole Z based on iteration of calculating full conditional probability of eachsingle za,b.

As mentioned in the previous section, if we treat all M documents as onesingle long document with length L, and then Z = z1 · · · zl · · · zL. za,b is corre-sponding to a specific zl.

The Gibbs sampling is shown in Algorithm 1.

input : Z(0) = z(0)1 · · · z

(0)l · · · z

(0)L ,W (0) = w

(0)1 · · ·w

(0)l · · ·w

(0)L

output: stable Z

for t← 1 to T dofor l← 1 to L do

z(t+1)l ∼ p(zl/z(t+1)

1 · · · z(t+1)l−1 z

(t)l+1 · · · z

(t)L , w1 · · ·wl−1wlwl+1 · · ·wL)

end

endAlgorithm 1: Gibbs sampling algorithm for LDA

z(0)l is initialized with a random number between 1 and K. zl

(t) is the t-thround sampled instance of zl, and zl

(t+1) is the (t+1)-th round sampled instanceof zl.

p(zl/z(t+1)1 · · · z(t+1)

l−1 z(t)l+1 · · · z

(t)L , w1 · · ·wl−1wlwl+1 · · ·wL)

∝ p(z(t+1)1 · · · z(t+1)

l−1 zlz(t)l+1 · · · z

(t)L , w1 · · ·wl−1wlwl+1 · · ·wL) (35)

or

10

p(zl = k/z(t+1)1 · · · z(t+1)

l−1 z(t)l+1 · · · z

(t)L , w1 · · ·wl−1wlwl+1 · · ·wL)

∝ p(z(t+1)1 · · · z(t+1)

l−1 , zl = k, z(t)l+1 · · · z

(t)L , w1 · · ·wl−1wlwl+1 · · ·wL) (36)

We define f(zl = k) as

p(z(t+1)1 · · · z(t+1)

l−1 , zl = k, z(t)l+1 · · · z

(t)L , w1 · · ·wl−1wlwl+1 · · ·wL) (37)

We can calculate all K values of f(zl = k) (1 ≤ zl ≤ K), and then usenormalization to get value of

p(zl = k/z(t+1)1 · · · z(t+1)

l−1 z(t)l+1 · · · z

(t)L , w1 · · ·wl−1wlwl+1 · · ·wL)

=f(zl = k)∑Kzl=1 f(zl = k)

(38)

Based on the above probability distribution, we can sample the value ofzl

(t+1).Notes: In regular literature on Bayesian inference, generally a large number

of samples are generated for the multi-dimension variable Z, and these samplesare used to calculate the mean (or expectation) of Z, or the expectation ofsome function q(Z). But in Gibbs Sampling for LDA, we only get one single(stable?) sample of Z. This seems not consistent with what we learn fromregular literature on Bayesian inference. Why nobody discuss this issue in LDAliterature?

6 Calculation of f(zl = k)

Based on the structure of LDA model as shown in Figure 2, we can simplifycalculation of f(zl = k).

In section 5, we try to infer the value of the current zl. If l is correspondingto the b-th word in the a-th document, zl = za,b.

p(Z,W ) ∝M∏m=1


Γ(∑Kk=1(im,k + α))

×K∏k=1


Γ(∑Vv=1(jk,v + β))

=

excluding a-th document︷︸︸︷∏m 6=a


Γ(∑Kk=1(im,k + α))

×

only a-th document︷︸︸︷∏Kk=1 Γ(ia,k + α)

Γ(∑Kk=1(ia,k + α))

×K∏k=1

(

excluding v = wa,b︷︸︸︷∏v 6=wa,b

Γ(jk,v + β))× (

only v = wa,b︷︸︸︷Γ(jk,(wa,b) + β))

Γ(∑Vv=1(jk,v + β))

(39)

11

We can cancel terms which don’t depend on a and b.

p(Z,W ) ∝∏Kk=1 Γ(ia,k + α)

Γ(∑Kk=1(ia,k + α))

×K∏k=1

Γ(jk,(wa,b) + β)

Γ(∑Vv=1(jk,v + β))

(40)

6.1 Simplifying∏K

k=1 Γ(ia,k + α)

In the a-th document, if the b-th word wa,b is excluded, we recalculate the

number of words assigned to each topic k, which is defined as i¬a,ba,k .

The topic of the word wa,b is za,b, and we can compare∏Kk=1 Γ(ia,k+α) and∏K

k=1 Γ(i¬a,ba,k + α), in which m = a (a-th document) is fixed.

When k 6= za,b, then ia,k = i¬a,ba,k , so Γ(ia,k + α) = Γ(i¬a,ba,k + α).

When k = za,b, obviously ia,k = i¬a,ba,k + 1 (or ia,(za,b) = i¬a,ba,(za,b) + 1), so

Γ(ia,k + α) = Γ(i¬a,ba,k + α+ 1) (or Γ(ia,(za,b) + α) = Γ(i¬a,ba,(za,b) + α+ 1)).

Because Γ(x+1) = x×Γ(x), Γ(i¬a,ba,(za,b)+α+1) = (i¬a,ba,(za,b)+α)×Γ(i¬a,ba,(za,b)+α),

and we can get

K∏k=1

Γ(ia,k + α) =∏

k 6=za,b

Γ(i¬a,ba,k + α)× Γ(i¬a,ba,(za,b) + α+ 1)

=∏

k 6=za,b

Γ(i¬a,ba,k + α)×

k = za,b︷︸︸︷Γ(i¬a,ba,(za,b) + α))× (i¬a,ba,(za,b) + α)

= (i¬a,ba,(za,b) + α)×K∏k=1

Γ(i¬a,ba,k + α) (41)

When we calculate f(za,b = k) with 1 ≤ k ≤ K,∏Kk=1 Γ(i¬a,ba,k + α) is a

constant value and can be cancelled.

K∏k=1

Γ(ia,k + α) ∝ (i¬a,ba,(za,b) + α) (42)

6.2 Simplifying Γ(∑K

k=1(ia,k + α))

Similarly, m = a (a-th document) is fixed, and we can get

12

K∑k=1

(ia,k + α) =∑k 6=za,b

(i¬a,ba,k + α) +

k = za,b︷︸︸︷(i¬a,ba(za,b)

+ α+ 1)

= 1 +

K∑k=1

(i¬a,ba,k + α) (43)

We can further get

Γ(

K∑k=1

(ia,k + α)) = Γ(1 +

K∑k=1

(i¬a,ba,k + α))

=

K∑k=1

(i¬a,ba,k + α)× Γ(

K∑k=1

(i¬a,ba,k + α)) (44)

When we calculate f(za,b = k) with 1 ≤ k ≤ K, Γ(∑Kk=1(i¬a,ba,k + α)) is a

constant value and can be cancelled.Actually,

∑Kk=1 i

¬a,ba,k is the total number of words in a-th document without

counting wa,b, and this number can be defined as aN¬a,b. Obviously, aN

¬a,b =aN − 1, so

K∑k=1

(i¬a,ba,k + α) = Na¬a,b +Kα = Na +Kα− 1 (45)

∑Kk=1(i¬a,ba,k +α) is also a constant value, so the whole part of Γ(

∑Kk=1(ia,k+

α)) can be cancelled.

Γ(

K∑k=1

(ia,k + α)) ∝K∑k=1

(i¬a,ba,k + α) = (Na +Kα− 1) (46)

6.3 Simplifying∏K

k=1 Γ(jk,(wa,b) + β)

Similarly, if we treat all documents as one single large document, and wa,b, theb-th word in the a-th document is excluded, then we recalculate the number ofeach distinct word v assigned to each topic k in this large document, which isdefined as j¬a,bk,v .

We can compare∏Kk=1 Γ(jk,(wa,b) + β) and

∏Kk=1 Γ(j¬a,bk,(wa,b) + β) , in which

the word v = wa,b is fixed.

When k 6= za,b, for any v, we have jk,v = j¬a,bk,v and Γ(jk,v+β) = Γ(j¬a,bk,v +β).

So especially v = wa,b, we have jk,(wa,b) = j¬a,bk,(wa,b) and Γ(jk,(wa,b) + β) =

Γ(j¬a,bk,(wa,b) + β).

13

When k = za,b and v = wa,b, jk,v = j¬a,bk,v +1 (or j(za,b)(wa,b)

= j¬a,b(za,b)(wa,b)

+1),

so Γ(jk,v +β) = Γ(j¬a,bk,v +β+1) (or Γ(j(za,b)(wa,b)

+β) = Γ(j¬a,b(za,b)(wa,b)

+β+1)).

Because Γ(x+ 1) = x× Γ(x), Γ(j¬a,b(za,b),(wa,b) + β + 1) = (j¬a,b(za,b),(wa,b) + β)×Γ(j¬a,b(za,b),(wa,b) + β), and we get

K∏k=1

Γ(jk,(wa,b) + β) =∏

k 6=za,b

Γ(jk,(wa,b) + β) (47)

= (j¬a,b(za,b)(wa,b)

+ β)×K∏k=1

Γ(j¬a,bk,(wa,b) + β) (48)

When we calculate f(za,b = k) with 1 ≤ k ≤ K,∏Kk=1 Γ(j¬a,bk,(wa,b) + β) is a

constant value and can be cancelled.

K∏k=1

Γ(jk,(wa,b) + β) ∝ (j¬a,b(za,b),(wa,b) + β) (49)

6.4 Simplifying∏K

k=1 Γ(∑V

v=1(jk,v + β))

We can also compare∏Kk=1 Γ(

∑Vv=1(jk,v + β)) and

∏Kk=1 Γ(

∑Vv=1(j¬a,bk,v + β)).

When k 6= za,b, for any v, we have jk,v = j¬a,bk,v and∑Vv=1 jk,v =

∑Vv=1 j

¬a,bk,v .

So Γ(∑Vv=1(jk,v + β)) = Γ(

∑Vv=1(j¬a,bk,v + β)), and we can get

∏k 6=za,b

Γ(

V∑v=1

(jk,v + β)) =∏

k 6=za,b

Γ(

V∑v=1

(j¬a,bk,v + β)) (50)

When k = za,b, if v 6= wa,b, we still jk,v = j¬a,bk,v , and we get∑v 6=wa,b

j(za,b),v =∑

v 6=wa,b

j¬a,b(za,b),v (51)

Only when k = za,b and v = wa,b, we have jk,v = j¬a,bk,v + 1 (or j(za,b),(wa,b) =

j¬a,b(za,b),(wa,b) + 1).

14

V∑v=1

j(za,b),v = j(za,b),(wa,b) +∑

v 6=wa,b

j(za,b),v

= j¬a,b(za,b),(wa,b) + 1 +∑

v 6=wa,b

j¬a,b(za,b),v = 1 +

V∑v=1

j¬a,b(za,b),v (52)

V∑v=1

(j(za,b),v + β) = 1 +

V∑v=1

(j¬a,b(za,b),v + β) (53)

Γ(

V∑v=1

(j(za,b),v + β)) = Γ(1 +

V∑v=1

(j¬a,b(za,b),v + β))

= (

V∑v=1

(j¬a,b(za,b),v + β))× (Γ(

V∑v=1

(j¬a,b(za,b),v + β))) (54)

Finally, we can get

K∏k=1

Γ(

V∑v=1

(jk,v + β)) =∏

k 6=za,b

Γ(

V∑v=1

(jk,v + β))×

k = za,b︷︸︸︷Γ(

V∑v=1

(j(za,b),v + β))

=

1 ≤ k ≤ K︷︸︸︷∏k 6=za,b

Γ(

V∑v=1

(j¬a,bk,v + β))× (Γ(

V∑v=1

(j¬a,b(za,b),v + β)))×(

V∑v=1

(j¬a,b(za,b),v + β))

=

K∏k=1

Γ(

V∑v=1

(j¬a,bk,v + β))× (

V∑v=1

(j¬a,b(za,b),v + β)) (55)

When we calculate f(za,b = k) with 1 ≤ k ≤ K,∏Kk=1 Γ(

∑Vv=1(j¬a,bk,v + β))

is a constant value and can be cancelled.

K∏k=1

Γ(

V∑v=1

(jk,v + β)) ∝ (

V∑v=1

(j¬a,b(za,b),v + β)) (56)

6.5 Simplifying f(za,b = k)

Based on the above steps, we can get

f(za,b = k) ∝i¬a,ba,(za,b) + α∑Kk=1(i¬a,ba,k + α)

×j¬a,b(za,b),(wa,b) + β∑Vv=1(j¬a,b(za,b),v + β)

(57)

∝(i¬a,ba,(za,b) + α)× (j¬a,b(za,b),(wa,b) + β)∑V

v=1(j¬a,b(za,b),v + β)(58)

15

7 Inference of Θ and Φ

After we have obtained the value of Z, the topics assigned to all words, we canfurther infer Θ and Φ.

7.1 Inference of Θ

The posterior probability of θm with observations of corresponding Zm is p(θm/Zm;α).And Zm is independent on α, so p(Zm/θm;α) = p(Zm/θm).

p(θm/Zm;α) =p(θm, Zm;α)

p(Zm;α)=

p(θm, Zm;α)∫p(θm, Zm;α)dθm

=p(Zm/θm;α)p(θm;α)∫p(Zm/θm;α)p(θm;α)dθm

=p(Zm/θm)p(θm;α)∫p(Zm/θm)p(θm;α)dθm

(59)

Based on Equation 13 and 18, we can get

p(Zm/θm)p(θm;α) =

K∏k=1

θim,k

m,k ×Γ(Kα)

(Γ(α))K

K∏k=1

θα−1m,k

=Γ(Kα)

(Γ(α))K

K∏k=1

θim,k+α−1m,k (60)

p(θm/Zm;α) =

Γ(Kα)(Γ(α))K

∏Kk=1 θ

im,k+α−1m,k∫ Γ(Kα)

(Γ(α))K

∏Kk=1 θ

im,k+α−1m,k dθm

=

∏Kk=1 θ

im,k+α−1m,k∫ ∏K

k=1 θim,k+α−1m,k dθm

=Γ(∑Kk=1(α+ im,k))∏K

k=1 Γ(α+ im,k)

K∏k=1

θim,k+α−1m,k (61)

So p(θm/Zm;α) is a Dirichlet distribution, and using the expectation ofDirichlet distribution we can get

θm,k =α+ im,k∑K

k=1(α+ im,k)(62)

16

7.2 Inference of Φ

Because for a word wl, when we already know its assigned topic zl, we canorganize all words W into K groups. Each group Wk (1 ≤ k ≤ K) contains allwords sampled from φk with the topic k.

The posterior probability of φk with observations of corresponding Wk isp(φk/Wk;β). And Wk is independent on β, so p(Wk/φk;β) = p(Wk/φk).

p(φk/Wk;β) =p(φk,Wk;β)

p(Wk;β)=

p(φk,Wk;β)∫p(φk,Wk;β)dφk

=p(Wk/φk;β)p(φk;β)∫p(Wk/φk;β)p(φk;β)dφk

=p(Wk/φk)p(φk;β)∫p(Wk/φk)p(φk;β)dφk

(63)

Based on Equation 15 and 21, we can get

p(Wk/φk;β)p(φk;β) =

V∏v=1

φjk,v

k,v ×Γ(V β)

(Γ(β))V

V∏v=1

φβ−1k,v

=Γ(V β)

(Γ(β))V

V∏v=1


p(φk/Wk;β) =

Γ(V β)(Γ(β))V

∏Vv=1 φ

jk,v+β−1k,v∫ Γ(V β)

(Γ(β))V

∏Vv=1 φ

jk,v+β−1k,v dφk

=

∏Vv=1 φ

jk,v+β−1k,v∫ ∏V

v=1 φjk,v+β−1k,v dφk

=Γ(∑Vv=1(β + jk,v))∏V

v=1 Γ(β + jk,v)

V∏v=1


So p(φk/Wk;β) is a Dirichlet distribution, and using the expectation ofDirichlet distribution we can get

θk,v =β + jk,v∑V

v=1(β + jk,v)(66)

8 Concrete Implementation of Gibbs Samplingof LDA

In section 5, we introduce the general idea of Gibbs Sampling of LDA, and weare going to illustrate more detail implementation in Algorithm 2.

17

The total number of words assigned to the topic k is twk, which is a elementof a vector of length K, TW = (tw1 · · · twk · · · twK). The total number ofwords in document m is dwm, which is a element of the vector of length M ,DW = (dw1 · · · dwm · · · dwM ).

ITERATIONS is set as the maximal number of iterations for Gibbs Sam-pling. BURNIN set as the number of iterations for burn-in. SAMPLELAGis set as the number of iterations of the period for sampling Θ and Φ.

9 Topic Inference of new documents

LDA is convenient for topic inference of new coming documents. For a newdocument Wx, we can infer corresponding Zx using Algorithm 3.

10 Conclusion

“I always thought something was fundamentally wrong with the universe” [1]

References

[1] D. Adams. The Hitchhiker’s Guide to the Galaxy. San Val, 1995.

18

input : W ,α,β,Kglobal data: I,J ,TW ,DWoutput : Z,Θ,Φ

// Initialization

zero all elements in I,J ,TW ,DWfor m← 1 to M do

for n← 1 to Nm doassign zm,n a random integer k (1 ≤ k ≤ K);im,k = im,k + 1; jk,v = jk,v + 1;dwm = dwm + 1; twk = twk + 1;

end

end

// Gibbs samping over burn in period and sampling period

for r ← 1 to ITERATIONS dofor m← 1 to M do

for n← 1 to Nm dok = zm,n; v = wm,n;im,k = im,k − 1; jk,v = jk,v − 1;dwm = dwm − 1; twk = twk − 1;for k ← 1 to K do

// Equation 58

fk = (im,k + α) ∗ (jk,v + β)/(twk + V β);

end

// p(zm,n = k/Z¬m,n,W ) = fk/∑Kk=1 fk,Eq 34 and 38

sample k̂ ∼ p(zm,n = k/Z¬m,n,W ); zm,n = k̂;im,k̂ = im,k̂ + 1; jk̂,v = jk̂,v + 1;

dwm = dwm + 1; twk̂ = twk̂ + 1;

end

endif (r > BURNIN) and (r%SAMPLELAG == 0) then

// Equation 62 and 66

for m← 1 to M dofor k ← 1 to K do

θm,k = (im,k + α)/(dwm +Kα);end

endfor k ← 1 to K do

for v ← 1 to V doφk,v = (jk,v + β)/(twk + V β);

end

end

end

endAlgorithm 2: Implementation of Gibbs Sampling of LDA

19

input : W ,Z,Θ,Φ,I,J ,TW ,DW ,α,β,K,Wx

output : Zx

// Initialization

for n← 1 to Nx doassign zx,n a random integer k (1 ≤ k ≤ K);

endfor n← 1 to Nx do

for k ← 1 to K do// Equation 58

fk = (ix,k + α) ∗ (jk,v + β)/(twk + V β);

end

// p(zm,n = k/Z¬m,n,W ) = fk/∑Kk=1 fk,Eq 34 and 38

sample k̂ ∼ p(zx,n = k/Z¬x,n,W ); zx,n = k̂;ix,k̂ = ix,k̂ + 1; jk̂,v = jk̂,v + 1;

dwx = dwx + 1; twk̂ = twk̂ + 1;

endAlgorithm 3: Topic Inference of new documents using LDA

20

Notes on Latent Dirichlet Allocation (LDA) for Beginners

Technology

Transcript of Notes on Latent Dirichlet Allocation (LDA) for Beginners