Variational Bayesian inference for the supervised LDA

Variational Bayesian inference for the supervised LDA

Tomonari MASADA @ Nagasaki University

August 21, 2015

1 Full joint distribution

Let M , K, and V denote the numbers of documents, topics, and different words, respectively. nd is thelength, i.e., the number of word tokens, of the d-th document. x = {x1, . . . ,xM} are documents, andz = {z1, . . . , zM} are topic assignments. xdi = v means that the word v appears as the i-th token of thed-th document. zdi = k means that the i-th token of the d-th document is assigned to the k-th topic.y = {y1, . . . , yM} are the outputs for the normal regression model.

The full joint distribution of the supervised LDA is:

p(x,y, z,θ,ϕ|α,β,η, σ)

=

M∏d=1

p(θd|α) ·K∏

k=1

p(ϕk|β) ·M∏d=1

nd∏i=1

p(zdi|θd)p(xdi|ϕzdi) ·

M∏d=1

p(yd|η⊤z̄d, σ)

=

M∏d=1

Γ(∑

k αk)∏k Γ(αk)

∏k

θαk−1dk ·

K∏k=1

Γ(∑

v βv)∏v Γ(βv)

V∏v=1

ϕβv−1kv ·

M∏d=1

nd∏i=1

θdzdiϕkxdi·

M∏d=1

1√2πσ2

exp{− (yd − η⊤z̄d)

2

2σ2

},

(1)

where z̄dk ≡∑nd

i=1 δ(zdi=k)

nd.

2 Log of evidence

The log of the evidence is obtained by integrating all unknown variables out as follows:

log p(x,y|α,β,η, σ) = log

∫ ∑z

p(x,y, z,θ,ϕ|α,β,η, σ)dθdϕ

= log

∫ ∑z

{ M∏d=1

p(θd|α) ·K∏

k=1

p(ϕk|β) ·M∏d=1

nd∏i=1


M∏d=1

p(yd|η⊤z̄d, σ)

}dθdϕ (2)

However, the maximization of the log of the evidence given above is intractable.

3 Variational posterior distribution

Jensen’s inequality provides the sum-of-log as a lower bound of the log-of-sum. The sum-of-log is relativelyeasy to manipulate. In our case, a lower bound of Eq. (2) is obtained as follows:

log p(x,y|α,β,η, σ) = log

∫ ∑z

p(x,y, z,θ,ϕ|α,β,η, σ)dθdϕ

= log

∫ ∑z

{ M∏d=1

p(θd|α) ·K∏

k=1

p(ϕk|β) ·M∏d=1

nd∏i=1


M∏d=1

p(yd|η⊤z̄d, σ)

}dθdϕ

= log

∫ ∑z

q(z,θ,ϕ)

∏d p(θd|α) ·

∏k p(ϕk|β) ·

∏d

∏nd

i=1 p(zdi|θd)p(xdi|ϕzdi) ·

∏d p(yd|η⊤z̄d, σ)

q(z,θ,ϕ)dθdϕ

≥∫ ∑

z

q(z,θ,ϕ) log

∏d p(θd|α) ·

∏k p(ϕk|β) ·

∏d

∏nd

i=1 p(zdi|θd)p(xdi|ϕzdi) ·

∏d p(yd|η⊤z̄d, σ)

q(z,θ,ϕ)dθdϕ ,

(3)

1

where q(z,θ,ϕ) is a probability distribution introduced when we apply Jensen’s inequality. This distribu-tion approximates the true posterior distribution and is called the variational posterior distribution.

In the variational Bayesian inference, we maximize the lower bound, i.e., the right hand side of Eq. (3),in place of the log of the evidence, i.e., the left hand side of Eq. (3).

4 Factorization of variational posterior

We make an assumption for the variational posterior to achieve a tractable inference.We assume that q(z,θ,ϕ) can be factorized as follows:

q(z,θ,ϕ) = q(z)q(θ)q(ϕ) =M∏d=1

nd∏i=1

q(zdi) ·M∏d=1

q(θd) ·K∏

k=1

q(ϕk). (4)

This factorization makes the inference tractable. However, we introduce an approximation at the sametime. The lower bound of the evidence, often abbreviated as ELBO, is obtained as follows:

log p(x,y|α,β,η, σ)

≥∫ ∑

z

q(z)q(θ)q(ϕ) log

∏d p(θd|α) ·

∏k p(ϕk|β) ·

∏d

∏i p(zdi|θd)p(xdi|ϕzdi

) ·∏

d p(yd|η⊤z̄d, σ)

q(z)q(θ)q(ϕ)dθdϕ

=M∑d=1

∫q(θd) log p(θd|α)dθd +

K∑k=1

∫q(ϕk) log p(ϕk|β)dϕk

+M∑d=1

nd∑i=1

∫ K∑zdi=1

q(zdi)q(θd) log p(zdi|θd)dθd +M∑d=1

nd∑i=1

∫ K∑zdi=1

q(zdi)q(ϕzdi) log p(xdi|ϕzdi

)dϕzdi

+

M∑d=1

∑zd

q(zd) log p(yd|η⊤z̄d, σ)

−M∑d=1

nd∑i=1

K∑zdi=1

q(zdi) log q(zdi)−M∑d=1

∫q(θd) log q(θd)dθd −

K∑k=1

∫q(ϕk) log q(ϕk)dϕk (5)

We denote this lower bound by F [q(z,θ,ϕ)].

5 Functional derivative

We obtain the variational posterior distribution q(z,θ,ϕ) that maximizes F [q(z,θ,ϕ)] by using the func-tional derivative (cf. Wikipedia). We extract the terms including q(θd) from F [q(z,θ,ϕ)] as follows:

F̃ [q(θd)] =

∫q(θd) log p(θd|α)dθd +

nd∑i=1

∫ K∑zdi=1

q(zdi)q(θd) log p(zdi|θd)dθd −∫

q(θd) log q(θd)dθd (6)

By using the functional derivative,

δF̃ [q(θd)]

δq(θ̂d)= lim

ε→0

F̃ [q(θd) + εδ(θd − θ̂d)]− F̃ [q(θd)]

ε

= limε→0

∫{q(θd) + εδ(θd − θ̂d)} log p(θd|α)dθd −

∫q(θd) log p(θd|α)dθd

ε

+ limε→0

∑nd

i=1

∫ ∑Kzdi=1 q(zdi){q(θd) + εδ(θd − θ̂d)} log p(zdi|θd)dθd −

∑nd

i=1

∫ ∑Kzdi=1 q(zdi)q(θd) log p(zdi|θd)dθd

ε

− limε→0

∫{q(θd) + εδ(θd − θ̂d)} log{q(θd) + εδ(θd − θ̂d)}dθd −

∫q(θd) log q(θd)dθd

ε

= log p(θ̂d|α) +

nd∑i=1

K∑zdi=1

q(zdi) log p(zdi|θ̂d)− limε→0

∫q(θd) log

q(θd)+εδ(θd−θ̂d)q(θd)

dθd

ε− log q(θ̂d)

= log p(θ̂d|α) +

nd∑i=1

K∑zdi=1

q(zdi) log p(zdi|θ̂d)− 1− log q(θ̂d) (7)

2

By setting δF̃ [q(θd)]

δq(θ̂d)= 0, we obtain

q(θd) ∝ p(θd|α) · exp[ nd∑

i=1

K∑zdi=1

q(zdi) log p(zdi|θd)]

∝K∏

k=1

θαk−1dk ·

K∏k=1

exp[ nd∑

i=1

q(zdi = k) log θdk

]=

K∏k=1

θ∑nd

i=1 q(zdi=k)+αk−1dk (8)

Eq. (8) shows that the variational posterior distribution q(θd) is a Dirichlet distribution whose param-eters are

∑nd

i=1 q(zdi = k) + αk for k = 1, . . . ,K, where∑nd

i=1 q(zdi = k) is the expectation of the numberof word tokens in the d-th document that are assigned to the k-th topic.

In a similar manner, we can show that the variational posterior q(ϕk) is a Dirichlet distribution whose

parameters are∑M

d=1

∑nd

i=1 δ(xdi = v)q(zdi = k) + βv for v = 1, . . . , V .∑M

d=1

∑nd

i=1 δ(xdi = v)q(zdi = k) isthe expectation of the number of the tokens of the v-th word that are assigned to the k-th topic.

6 Considerations specific to supervised LDA

6.1 Topic assignment probabilities

The discussion so far is applicable either to the vanilla LDA or to the supervised LDA.However, the posterior distribution q(zd) of the supervised LDA is different from that of LDA, because

the term∑M

d=1

∑zd

q(zd) log p(yd|η⊤z̄d, σ) appears in the lower bound F [q(z,θ,ϕ)] for the supervisedLDA, not in the lower bound for the vanilla LDA.

We extract the terms including q(zd) from F [q(z,θ,ϕ)] as follows:

F̃ [q(zd)] =

nd∑i=1

∫ K∑zdi=1

q(zdi)q(θd) log p(zdi|θd)dθd +

nd∑i=1

∫ K∑zdi=1

q(zdi)q(ϕzdi) log p(xdi|ϕzdi

)dϕzdi

+∑zd

q(zd) log p(yd|η⊤z̄d, σ)−nd∑i=1

K∑zdi=1

q(zdi) log q(zdi)

=

nd∑i=1

K∑k=1

q(zdi = k)

∫q(θd) log θdkdθd +

nd∑i=1

K∑k=1

q(zdi = k)

∫q(ϕk) log ϕkxdi

dϕk

+∑zd

q(zd) log p(yd|η⊤z̄d, σ)−nd∑i=1

K∑k=1

q(zdi = k) log q(zdi = k) (9)

The term∑

zdq(zd) log p(yd|η⊤z̄d, σ) in Eq. (9) can be rewritten as follows:∑

zd

q(zd) log p(yd|η⊤z̄d, σ) = − 1

2σ2

∑zd

q(zd)(yd − η⊤z̄d)2 − 1

2log(2πσ2)

= − 1

2σ2

∑zd

q(zd)

[yd −

1

nd

K∑k=1

ηk

{ nd∑i=1

∆(zdi = k)}]2

− 1

2log(2πσ2) , (10)

where ∆(P ) = 1 if the proposition P is true, and ∆(P ) = 0 otherwise.

We can rewrite

[yd − 1

nd

∑Kk=1 ηk

{∑nd

i=1 ∆(zdi = k)}]2

as follows:

[yd −

1

nd

K∑k=1

ηk

{ nd∑i=1

∆(zdi = k)}]2

= y2d +1

n2d

K∑k=1

η2k

{ nd∑i=1

∆(zdi = k)}− 2yd

nd

K∑k=1

ηk

{ nd∑i=1

∆(zdi = k)}

+1

n2d

K∑k=1

K∑l=1

ηkηl

{ nd∑i=1

∆(zdi = k)}{ nd∑

i′ ̸=i

∆(zdi′ = l)}

(11)

3

Therefore, the term∑

zdq(zd) log p(yd|η⊤z̄d, σ) can be rewritten as follows:∑

zd


= − 1

2σ2

∑zd

q(zd)

[y2d +

1

n2d

K∑k=1

η2k

{ nd∑i=1

∆(zdi = k)}− 2yd

nd

K∑k=1

ηk

{ nd∑i=1

∆(zdi = k)}

+1

n2d

K∑k=1

K∑l=1

ηkηl

{ nd∑i=1

∆(zdi = k)}{ nd∑

i′ ̸=i

∆(zdi′ = l)}]

− 1

2log(2πσ2)

= − y2d2σ2

− 1

2σ2n2d

∑zd

q(zd)K∑

k=1

η2k

{ nd∑i=1

∆(zdi = k)}+

ydσ2nd

∑zd

q(zd)K∑

k=1

ηk

{ nd∑i=1

∆(zdi = k)}

− 1

2σ2n2d

∑zd

q(zd)K∑

k=1

K∑l=1

ηkηl

{ nd∑i=1

∆(zdi = k)}{ nd∑

i′ ̸=i

∆(zdi′ = l)}− 1

2log(2πσ2)

= − y2d2σ2

− 1

2σ2n2d

K∑k=1

η2k

{ nd∑i=1

q(zdi = k)}+

ydσ2nd

K∑k=1

ηk

{ nd∑i=1

q(zdi = k)}

− 1

2σ2n2d

K∑k=1

K∑l=1

ηkηl

{ nd∑i=1

nd∑i′ ̸=i

q(zdi = k)q(zdi′ = l)}− 1

2log(2πσ2), (12)

where∑nd

i′ ̸=i means the summation over the indices {1, . . . , nd} \ {i}.In Eq. (12), for example, the expectation

∑zd

q(zd)∑K

k=1 ηk{∑nd

i=1 ∆(zdi = k)} is rewritten as follows:

∑zd

q(zd)

K∑k=1

ηk

{ nd∑i=1

∆(zdi = k)}=

∑zd

q(zd)

K∑k=1

ηk

{∆(zd1 = k) + · · ·+∆(zdnd

= k)}

=∑zd

q(zd){ K∑

k=1

ηk∆(zd1 = k)}+ · · ·+

∑zd

q(zd){ K∑

k=1

ηk∆(zdnd= k)

}

=K∑

zd1=1

q(zd1)

[{∑z¬1d

q(z¬1d )

}·{ K∑

k=1

ηk∆(zd1 = k)}]

+ · · ·+K∑

zdnd=1

q(zdnd)

[{ ∑z¬ndd

q(z¬nd

d )}·{ K∑

k=1

ηk∆(zdnd= k)

}]

=

K∑zd1=1

q(zd1){ K∑

k=1

ηk∆(zd1 = k)}+ · · ·+

K∑zdnd

=1

q(zdnd){ K∑

k=1

ηk∆(zdnd= k)

}

=K∑

k=1

q(zd1 = k)ηk + · · ·+K∑

k=1

q(zdnd= k)ηk =

K∑k=1

ηk

{ nd∑i=1

q(zdi = k)}

(13)

Other terms in Eq. (12) are also rewritten in a similar manner.After introducing a Lagrange multiplier term λdi{1 −

∑k q(zdi = k)} that represents the constraint∑

k q(zdi = k) = 1 for each zdi, we obtain the following derivative:

δF̃ [q(zd)]

δq(zdi = k)=

∫q(θd) log θdkdθd +


dϕk

− 1

2σ2n2d

η2k +yd

σ2ndηk − 1

σ2n2d

K∑l=1

ηkηl

nd∑i′ ̸=i

q(zdi′ = l)− log q(zdi = k) + 1− λdi (14)

By setting δF̃ [q(zd)]δq(zdi=k) = 0, we obtain the variational probability q(zdi = k) that the i-th word token of

the d-th document is assigned to the k-th topic as follows:

q(zdi = k) ∝ exp[ ∫

q(θd) log θdkdθd

]· exp

[ ∫q(ϕk) log ϕkxdi

dϕk

]· exp

[− η2k

2σ2n2d

+ydηkσ2nd

− ηkσ2n2

d

K∑l=1

ηl

{ nd∑i′ ̸=i

q(zdi′ = l)}]

(15)

4

The integrals∫q(θd) log θdkdθd and


dϕk can be obtained based on Eq. (B.21) of PatternRecognition and Machine Learning by Christopher M. Bishop.

6.2 Regression parameters

The parameters η = (η1, . . . , ηK) and σ appear only in the term∑M

d=1

∑zd

q(zd) log p(yd|η⊤z̄d, σ) inEq. (5) (cf. Eq. (12)). By differentiating it with respect to ηk, we obtain the following:

∂

∂ηk

M∑d=1

∑zd


=M∑d=1

∂

∂ηk

[− 1

2σ2n2d

K∑k=1

η2k

{ nd∑i=1

q(zdi = k)}+

ydσ2nd

K∑k=1

ηk

{ nd∑i=1

q(zdi = k)}

− 1

2σ2n2d

K∑k=1

K∑l=1

ηkηl

{ nd∑i=1

nd∑i′ ̸=i

q(zdi = k)q(zdi′ = l)}]

= −M∑d=1

ηkσ2n2

d

nd∑i=1

q(zdi = k) +M∑d=1

ydσ2nd

nd∑i=1

q(zdi = k)−M∑d=1

ηkσ2n2

d

nd∑i=1

nd∑i′ ̸=i

q(zdi = k)q(zdi′ = k)

−M∑d=1

1

σ2n2d

K∑l ̸=k

ηl

{ nd∑i=1

nd∑i′ ̸=i

q(zdi = k)q(zdi′ = l)}

= − 1

σ2

[ηk

M∑d=1

1

n2d

nd∑i=1

q(zdi = k) +K∑l=1

ηl

M∑d=1

1

n2d

{ nd∑i=1

nd∑i′ ̸=i


+M∑d=1

ydσ2nd

nd∑i=1

q(zdi = k)

(16)

By setting Eq. (16) equal to 0, we obtain the following equation for each k = 1, . . . ,K:

K∑l=1

ηl

M∑d=1

1

n2d

nd∑i=1

q(zdi = k){∆(k = l) +

nd∑i′ ̸=i

q(zdi′ = l)}=

M∑d=1

ydnd

nd∑i=1

q(zdi = k) (17)

Let q(zdi = k) be denoted by ζdik. Then Eq. (17) can be rewritten as follows:[ M∑d=1

1

n2d

nd∑i=1

{diag(ζdi) +

nd∑i′ ̸=i

ζdiζ⊤di′

}]η =

M∑d=1

ydnd

nd∑i=1

ζdi (18)

Consequently, an estimation of η can be obtained as follows:

η =

[ M∑d=1

nd∑i=1

{diag(ζdi) +

nd∑i′ ̸=i

ζdiζ⊤di′

}]−1( M∑d=1

ydnd

nd∑i=1

ζdi

)(19)

We next perform a differentiation with respect to σ and obtain the following:

∂

∂ηk

M∑d=1

∑zd


=

M∑d=1

∂

∂σ

[− y2d

2σ2− 1

2σ2n2d

K∑k=1

η2k

{ nd∑i=1

q(zdi = k)}+

ydσ2nd

K∑k=1

ηk

{ nd∑i=1

q(zdi = k)}

− 1

2σ2n2d

K∑k=1

K∑l=1

ηkηl

{ nd∑i=1

nd∑i′ ̸=i

q(zdi = k)q(zdi′ = l)}− 1

2log(2πσ2)

]

=M∑d=1

[y2dσ3

+1

σ3n2d

K∑k=1

η2k

{ nd∑i=1

q(zdi = k)}− 2yd

σ3nd

K∑k=1

ηk

{ nd∑i=1

q(zdi = k)}

+1

σ3n2d

K∑k=1

K∑l=1

ηkηl

{ nd∑i=1

nd∑i′ ̸=i


− M

σ(20)

5

Therefore, an estimation of σ is obtained as follows:

σ2 =1

M

M∑d=1

{y2d +

K∑k=1

η2k

∑nd

i=1 q(zdi = k)

n2d

− 2yd

K∑k=1

ηk

∑nd

i=1 q(zdi = k)

nd

+K∑

k=1

K∑l=1

ηkηl

∑nd

i=1

∑nd

i′ ̸=i q(zdi = k)q(zdi′ = l)

n2d

}(21)

By setting Eq. (16) equal to 0, we obtain the following equation for each k = 1, . . . ,K:

K∑l=1

ηl

M∑d=1

∑nd

i=1

∑nd


n2d

= −ηk

M∑d=1

∑nd

i=1 q(zdi = k)

n2d

+M∑d=1

yd

∑nd

i=1 q(zdi = k)

nd(22)

Therefore, Eq. (21) can be rewritten as follows:

σ2 =1

M

{ M∑d=1

y2d +K∑

k=1

η2k

M∑d=1

∑nd

i=1 q(zdi = k)

n2d

− 2K∑

k=1

ηk

M∑d=1

yd

∑nd

i=1 q(zdi = k)

nd

+K∑

k=1

ηk

K∑l=1

ηl

M∑d=1

∑nd

i=1

∑nd


n2d

}

=1

M

[ M∑d=1

y2d +K∑

k=1

η2k

M∑d=1

∑nd

i=1 q(zdi = k)

n2d

− 2K∑

k=1

ηk

M∑d=1

yd

∑nd

i=1 q(zdi = k)

nd

+K∑

k=1

ηk

{− ηk

M∑d=1

∑nd

i=1 q(zdi = k)

n2d

+M∑d=1

yd

∑nd

i=1 q(zdi = k)

nd

}]

=1

M

{ M∑d=1

y2d −K∑

k=1

ηk

M∑d=1

yd

∑nd

i=1 q(zdi = k)

nd

}(23)

(This document may contain errors.)

6

Variational Bayesian inference for the supervised LDA

Data & Analytics

Transcript of Variational Bayesian inference for the supervised LDA