A Note on Expectation-Propagation for Latent Dirichlet Allocation

3

Click here to load reader

Transcript of A Note on Expectation-Propagation for Latent Dirichlet Allocation

Page 1: A Note on Expectation-Propagation for Latent Dirichlet Allocation

Expectation-Propagation for Latent Dirichlet Allocation

Tomonari MASADA @ Nagasaki University

January 15, 2014

1

This manuscript contains a derivation of the update formulae of expectation-propagation (EP) presentedin the following paper:

Thomas Minka and John Lafferty.Expectation-Propagation for the Generative Aspect Model.in Proc. of the 18th Conference on Uncertainty in Artificial Intelligence, pp. 352-359, 2002.1

2

A joint distribution for document d in LDA can be written as below.

p(x,θd|α) = p(θd|α)p(x|θd)

= p(θd|α)

nd∏i=1

p(xdi|θd) = p(θd|α)

nd∏i=1

(∑k

θdkφkxdi

)= p(θd|α)

∏w

(∑k

θdkφkw

)ndw

, (1)

where p(θd|α) is a Dirichlet prior distribution.

We approximate p(w|θd) =∑k θdkφkw by tw(θd) = sw

∏k θ

βwk

dk .

Then we obtain an approximated joint distribution as follows:

p(x,θd|α) ≈ p(θd|α)∏w

tw(θd)ndw = p(θd|α)

∏w

(sw∏k

θβwk

dk

)ndw

=Γ(∑k αk)∏

k Γ(αk)

∏w

sndww

∏k

θαk−1+

∑w βwkndw

dk (2)

Let γdk = αk +∑w βwkndw. That is,

p(xd,θd|α) ≈Γ(∑k αk)∏

k Γ(αk)

∏w

sndww

∏k

θγdk−1dk . (3)

An approximated posterior q(θd) can be obtained as follows:

q(θd) =

Γ(∑

k αk)∏k Γ(αk)

∏w s

ndww

∏k θ

γdk−1dk∫ Γ(

∑k αk)∏

k Γ(αk)

∏w s

ndww

∏k θ

γdk−1dk dθd

=

∏k θ

γdk−1dk∫ ∏

k θγdk−1dk dθd

=Γ(∑k γdk)∏

k Γ(γdk)

∏k

θγdk−1dk . (4)

1http://research.microsoft.com/en-us/um/people/minka/papers/aspect/

1

Page 2: A Note on Expectation-Propagation for Latent Dirichlet Allocation

3

For each word token in document d, we remove its contribution to∏w tw(θd)

ndw =∏w

(sw∏k θ

βwk

dk

)ndw

.

When the token is a token of word w, this corresponds to a division by sw∏k θ

βwk

dk .Note that we here consider a removal of word token, not a removal of word type.Then we obtain an ‘old’ posterior after this division as follows:

q\w(θd) =Γ(∑

k(γdk − βwk))∏

k Γ(γdk − βwk)

∏k

θγdk−βwk−1dk =

Γ(∑k γ\wdk )∏

k Γ(γ\wdk )

∏k

θγ\wdk −1

dk , (5)

where γ\wdk = γdk − βwk.

By combining p(w|θd) =∑k θdkΦkw with q\w(θd), we obtain a something similar to joint distribution

as follows:

(∑k

θdkφkw) ·Γ(∑k γ\wdk )∏

k Γ(γ\wdk )

∏k

θγ\wdk −1

dk .

Therefore, by integrating θd out, we obtain a something similar to evidence as follows:∫ ∑k

θdkφkwΓ(∑k γ\wdk )∏

k′ Γ(γ\wdk′)

∏k′

θγ\wdk′−1

dk′ dθd =

∑k φkwγ

\wdk∑

k γ\wdk

.

Consequently, we obtain a new posterior as follows:

q̃(θd) =

∑k γ\wdk∑

k φkwγ\wdk

(∑k

θdkφkw

)Γ(∑k γ\wdk )∏

k Γ(γ\wdk )

∏k

θγ\wdk −1

dk . (6)

4

Based on Eq. (6), we calculate Eq̃[θdk] and Eq̃[θ2dk].

Eq̃[θdk] =

∫θdkq̃(θd)dθdk

=

∫θdk

γ̂d\w∑

k φkwγ\wdk

(∑k

θdkφkw

) Γ(γ̂d\w)∏

k Γ(γ\wdk )

∏k

θγ\wdk −1

dk dθdk

=γ̂d\w∑

k φkwγ\wdk

{φkw

γ\wdk (γ

\wdk + 1)

γ̂d\w(γ̂d

\w + 1)+∑k′ 6=k

φk′wγ\wdk γ

\wdk′

γ̂d\w(γ̂d

\w + 1)

}

=γ̂d\w∑

k φkwγ\wdk

·γ\wdk

γ̂d\w ·

φkw +∑k φkwγ

\wdk

γ̂d\w + 1

(7)

Eq̃[θ2dk] =

∫θ2dkq̃(θd)dθdk

=

∫θ2dk

γ̂d\w∑

k φkwγ\wdk

(∑k

θdkφkw

) Γ(γ̂d\w)∏

k Γ(γ\wdk )

∏k

θγ\wdk −1

dk dθdk

=γ̂d\w∑

k φkwγ\wdk

{φkw

γ\wdk (γ

\wdk + 1)(γ

\wdk + 2)

γ̂d\w(γ̂d

\w + 1)(γ̂d\w + 2)

+∑k′ 6=k

φk′wγ\wdk (γ

\wdk + 1)γ

\wdk′

γ̂d\w(γ̂d

\w + 1)(γ̂d\w + 2)

}

=γ̂d\w∑

k φkwγ\wdk

·γ\wdk

γ̂d\w ·

γ\wdk + 1

γ̂d\w + 1

·2φkw +

∑k φkwγ

\wdk

γ̂d\w + 2

, (8)

where γ̂d\w =

∑k γ\wdk .

2

Page 3: A Note on Expectation-Propagation for Latent Dirichlet Allocation

We determine a Dirichlet distribution Dir(γ′d) so that the meanγ′dk

γ̂′d

and the average second moment2

1K

∑kγ′dk(γ′

dk+1)γ̂′d(γ̂′

d+1) are equal to those of q̃(θd), where γ̂′d =∑k γ′dk. This procedure is called moment

matching.To be precise, we obtain γ′d by solving the following equations:

γ′dkγ̂′d

= Eq̃[θdk] for each k, and (9)

1

K

∑k

γ′dk(γ′dk + 1)

γ̂′d(γ̂′d + 1)

=1

K

∑k

Eq̃[θ2dk] . (10)

We can obtain γ′dk as below.

γ′dk = Eq̃[θdk]γ̂′d∑k

Eq̃[θdk]γ̂′d(Eq̃[θdk]γ̂′d + 1)

γ̂′d(γ̂′d + 1)

=∑k

Eq̃[θ2dk]∑

k

Eq̃[θdk](Eq̃[θdk]γ̂′d + 1) =∑k

Eq̃[θ2dk](γ̂′d + 1)∑

k

Eq̃[θdk]2γ̂′d +∑k

Eq̃[θdk] =∑k

Eq̃[θ2dk]γ̂′d +

∑k

Eq̃[θ2dk]∑

k

(Eq̃[θdk]2 − Eq̃[θ2

dk])γ̂′d =

∑k

(Eq̃[θ

2dk]− Eq̃[θdk]

)∴ γ̂′d =

∑k(Eq̃[θ

2dk]− Eq̃[θdk])∑

k(Eq̃[θdk]2 − Eq̃[θ2dk])

(11)

∴ γ′dk =

∑k(Eq̃[θ

2dk]− Eq̃[θdk])∑

k(Eq̃[θdk]2 − Eq̃[θ2dk])· Eq̃[θdk] (12)

5

We now have a Dirichlet distribution that approxiates the posterior q̃(θd) in Eq. (6), i.e.,

q̃(θd) =

∑k γ\wdk∑

k φkwγ\wdk

(∑k

θdkφkw

)Γ(∑k γ\wdk )∏

k Γ(γ\wdk )

∏k

θγ\wdk −1

dk .

To obtain a density function of Dirichlet distribution from Eq. (6), we replace the term (∑k θdkφkw)

by t′w(θd) = s′w∏k θ

β′wk

dk . We set the resulting function equal to the density function of Dir(γ′d) and obtainthe following equation:∑

k γ\wdk∑

k φkwγ\wdk

s′w∏k

θβ′wk

dk

Γ(∑k γ\wdk )∏

k Γ(γ\wdk )

∏k

θγ\wdk −1

dk =Γ(∑k γ′dk)∏

k Γ(γ′dk)

∏k

θγ′dk−1dk . (13)

Consequently, s′w and β′wk are determined as follows:

β′wk = γ′dk − γ\wdk , (14)

s′w =

∑k φkwγ

\wdk∑

k γ\wdk

Γ(∑k γ′dk)∏

k Γ(γ′dk)

∏k Γ(γ

\wdk )

Γ(∑k γ\wdk )

. (15)

2http://www.ucl.ac.uk/statistics/research/pdfs/135.zip

3