Variational Bayesian inference for the supervised LDA
-
Upload
tomonari-masada -
Category
Data & Analytics
-
view
266 -
download
2
Transcript of Variational Bayesian inference for the supervised LDA
Variational Bayesian inference for the supervised LDA
Tomonari MASADA @ Nagasaki University
August 21, 2015
1 Full joint distribution
Let M , K, and V denote the numbers of documents, topics, and different words, respectively. nd is thelength, i.e., the number of word tokens, of the d-th document. x = {x1, . . . ,xM} are documents, andz = {z1, . . . , zM} are topic assignments. xdi = v means that the word v appears as the i-th token of thed-th document. zdi = k means that the i-th token of the d-th document is assigned to the k-th topic.y = {y1, . . . , yM} are the outputs for the normal regression model.
The full joint distribution of the supervised LDA is:
p(x,y, z,θ,ϕ|α,β,η, σ)
=
M∏d=1
p(θd|α) ·K∏
k=1
p(ϕk|β) ·M∏d=1
nd∏i=1
p(zdi|θd)p(xdi|ϕzdi) ·
M∏d=1
p(yd|η⊤z̄d, σ)
=
M∏d=1
Γ(∑
k αk)∏k Γ(αk)
∏k
θαk−1dk ·
K∏k=1
Γ(∑
v βv)∏v Γ(βv)
V∏v=1
ϕβv−1kv ·
M∏d=1
nd∏i=1
θdzdiϕkxdi·
M∏d=1
1√2πσ2
exp{− (yd − η⊤z̄d)
2
2σ2
},
(1)
where z̄dk ≡∑nd
i=1 δ(zdi=k)
nd.
2 Log of evidence
The log of the evidence is obtained by integrating all unknown variables out as follows:
log p(x,y|α,β,η, σ) = log
∫ ∑z
p(x,y, z,θ,ϕ|α,β,η, σ)dθdϕ
= log
∫ ∑z
{ M∏d=1
p(θd|α) ·K∏
k=1
p(ϕk|β) ·M∏d=1
nd∏i=1
p(zdi|θd)p(xdi|ϕzdi) ·
M∏d=1
p(yd|η⊤z̄d, σ)
}dθdϕ (2)
However, the maximization of the log of the evidence given above is intractable.
3 Variational posterior distribution
Jensen’s inequality provides the sum-of-log as a lower bound of the log-of-sum. The sum-of-log is relativelyeasy to manipulate. In our case, a lower bound of Eq. (2) is obtained as follows:
log p(x,y|α,β,η, σ) = log
∫ ∑z
p(x,y, z,θ,ϕ|α,β,η, σ)dθdϕ
= log
∫ ∑z
{ M∏d=1
p(θd|α) ·K∏
k=1
p(ϕk|β) ·M∏d=1
nd∏i=1
p(zdi|θd)p(xdi|ϕzdi) ·
M∏d=1
p(yd|η⊤z̄d, σ)
}dθdϕ
= log
∫ ∑z
q(z,θ,ϕ)
∏d p(θd|α) ·
∏k p(ϕk|β) ·
∏d
∏nd
i=1 p(zdi|θd)p(xdi|ϕzdi) ·
∏d p(yd|η⊤z̄d, σ)
q(z,θ,ϕ)dθdϕ
≥∫ ∑
z
q(z,θ,ϕ) log
∏d p(θd|α) ·
∏k p(ϕk|β) ·
∏d
∏nd
i=1 p(zdi|θd)p(xdi|ϕzdi) ·
∏d p(yd|η⊤z̄d, σ)
q(z,θ,ϕ)dθdϕ ,
(3)
1
where q(z,θ,ϕ) is a probability distribution introduced when we apply Jensen’s inequality. This distribu-tion approximates the true posterior distribution and is called the variational posterior distribution.
In the variational Bayesian inference, we maximize the lower bound, i.e., the right hand side of Eq. (3),in place of the log of the evidence, i.e., the left hand side of Eq. (3).
4 Factorization of variational posterior
We make an assumption for the variational posterior to achieve a tractable inference.We assume that q(z,θ,ϕ) can be factorized as follows:
q(z,θ,ϕ) = q(z)q(θ)q(ϕ) =M∏d=1
nd∏i=1
q(zdi) ·M∏d=1
q(θd) ·K∏
k=1
q(ϕk). (4)
This factorization makes the inference tractable. However, we introduce an approximation at the sametime. The lower bound of the evidence, often abbreviated as ELBO, is obtained as follows:
log p(x,y|α,β,η, σ)
≥∫ ∑
z
q(z)q(θ)q(ϕ) log
∏d p(θd|α) ·
∏k p(ϕk|β) ·
∏d
∏i p(zdi|θd)p(xdi|ϕzdi
) ·∏
d p(yd|η⊤z̄d, σ)
q(z)q(θ)q(ϕ)dθdϕ
=M∑d=1
∫q(θd) log p(θd|α)dθd +
K∑k=1
∫q(ϕk) log p(ϕk|β)dϕk
+M∑d=1
nd∑i=1
∫ K∑zdi=1
q(zdi)q(θd) log p(zdi|θd)dθd +M∑d=1
nd∑i=1
∫ K∑zdi=1
q(zdi)q(ϕzdi) log p(xdi|ϕzdi
)dϕzdi
+
M∑d=1
∑zd
q(zd) log p(yd|η⊤z̄d, σ)
−M∑d=1
nd∑i=1
K∑zdi=1
q(zdi) log q(zdi)−M∑d=1
∫q(θd) log q(θd)dθd −
K∑k=1
∫q(ϕk) log q(ϕk)dϕk (5)
We denote this lower bound by F [q(z,θ,ϕ)].
5 Functional derivative
We obtain the variational posterior distribution q(z,θ,ϕ) that maximizes F [q(z,θ,ϕ)] by using the func-tional derivative (cf. Wikipedia). We extract the terms including q(θd) from F [q(z,θ,ϕ)] as follows:
F̃ [q(θd)] =
∫q(θd) log p(θd|α)dθd +
nd∑i=1
∫ K∑zdi=1
q(zdi)q(θd) log p(zdi|θd)dθd −∫
q(θd) log q(θd)dθd (6)
By using the functional derivative,
δF̃ [q(θd)]
δq(θ̂d)= lim
ε→0
F̃ [q(θd) + εδ(θd − θ̂d)]− F̃ [q(θd)]
ε
= limε→0
∫{q(θd) + εδ(θd − θ̂d)} log p(θd|α)dθd −
∫q(θd) log p(θd|α)dθd
ε
+ limε→0
∑nd
i=1
∫ ∑Kzdi=1 q(zdi){q(θd) + εδ(θd − θ̂d)} log p(zdi|θd)dθd −
∑nd
i=1
∫ ∑Kzdi=1 q(zdi)q(θd) log p(zdi|θd)dθd
ε
− limε→0
∫{q(θd) + εδ(θd − θ̂d)} log{q(θd) + εδ(θd − θ̂d)}dθd −
∫q(θd) log q(θd)dθd
ε
= log p(θ̂d|α) +
nd∑i=1
K∑zdi=1
q(zdi) log p(zdi|θ̂d)− limε→0
∫q(θd) log
q(θd)+εδ(θd−θ̂d)q(θd)
dθd
ε− log q(θ̂d)
= log p(θ̂d|α) +
nd∑i=1
K∑zdi=1
q(zdi) log p(zdi|θ̂d)− 1− log q(θ̂d) (7)
2
By setting δF̃ [q(θd)]
δq(θ̂d)= 0, we obtain
q(θd) ∝ p(θd|α) · exp[ nd∑
i=1
K∑zdi=1
q(zdi) log p(zdi|θd)]
∝K∏
k=1
θαk−1dk ·
K∏k=1
exp[ nd∑
i=1
q(zdi = k) log θdk
]=
K∏k=1
θ∑nd
i=1 q(zdi=k)+αk−1dk (8)
Eq. (8) shows that the variational posterior distribution q(θd) is a Dirichlet distribution whose param-eters are
∑nd
i=1 q(zdi = k) + αk for k = 1, . . . ,K, where∑nd
i=1 q(zdi = k) is the expectation of the numberof word tokens in the d-th document that are assigned to the k-th topic.
In a similar manner, we can show that the variational posterior q(ϕk) is a Dirichlet distribution whose
parameters are∑M
d=1
∑nd
i=1 δ(xdi = v)q(zdi = k) + βv for v = 1, . . . , V .∑M
d=1
∑nd
i=1 δ(xdi = v)q(zdi = k) isthe expectation of the number of the tokens of the v-th word that are assigned to the k-th topic.
6 Considerations specific to supervised LDA
6.1 Topic assignment probabilities
The discussion so far is applicable either to the vanilla LDA or to the supervised LDA.However, the posterior distribution q(zd) of the supervised LDA is different from that of LDA, because
the term∑M
d=1
∑zd
q(zd) log p(yd|η⊤z̄d, σ) appears in the lower bound F [q(z,θ,ϕ)] for the supervisedLDA, not in the lower bound for the vanilla LDA.
We extract the terms including q(zd) from F [q(z,θ,ϕ)] as follows:
F̃ [q(zd)] =
nd∑i=1
∫ K∑zdi=1
q(zdi)q(θd) log p(zdi|θd)dθd +
nd∑i=1
∫ K∑zdi=1
q(zdi)q(ϕzdi) log p(xdi|ϕzdi
)dϕzdi
+∑zd
q(zd) log p(yd|η⊤z̄d, σ)−nd∑i=1
K∑zdi=1
q(zdi) log q(zdi)
=
nd∑i=1
K∑k=1
q(zdi = k)
∫q(θd) log θdkdθd +
nd∑i=1
K∑k=1
q(zdi = k)
∫q(ϕk) log ϕkxdi
dϕk
+∑zd
q(zd) log p(yd|η⊤z̄d, σ)−nd∑i=1
K∑k=1
q(zdi = k) log q(zdi = k) (9)
The term∑
zdq(zd) log p(yd|η⊤z̄d, σ) in Eq. (9) can be rewritten as follows:∑
zd
q(zd) log p(yd|η⊤z̄d, σ) = − 1
2σ2
∑zd
q(zd)(yd − η⊤z̄d)2 − 1
2log(2πσ2)
= − 1
2σ2
∑zd
q(zd)
[yd −
1
nd
K∑k=1
ηk
{ nd∑i=1
∆(zdi = k)}]2
− 1
2log(2πσ2) , (10)
where ∆(P ) = 1 if the proposition P is true, and ∆(P ) = 0 otherwise.
We can rewrite
[yd − 1
nd
∑Kk=1 ηk
{∑nd
i=1 ∆(zdi = k)}]2
as follows:
[yd −
1
nd
K∑k=1
ηk
{ nd∑i=1
∆(zdi = k)}]2
= y2d +1
n2d
K∑k=1
η2k
{ nd∑i=1
∆(zdi = k)}− 2yd
nd
K∑k=1
ηk
{ nd∑i=1
∆(zdi = k)}
+1
n2d
K∑k=1
K∑l=1
ηkηl
{ nd∑i=1
∆(zdi = k)}{ nd∑
i′ ̸=i
∆(zdi′ = l)}
(11)
3
Therefore, the term∑
zdq(zd) log p(yd|η⊤z̄d, σ) can be rewritten as follows:∑
zd
q(zd) log p(yd|η⊤z̄d, σ)
= − 1
2σ2
∑zd
q(zd)
[y2d +
1
n2d
K∑k=1
η2k
{ nd∑i=1
∆(zdi = k)}− 2yd
nd
K∑k=1
ηk
{ nd∑i=1
∆(zdi = k)}
+1
n2d
K∑k=1
K∑l=1
ηkηl
{ nd∑i=1
∆(zdi = k)}{ nd∑
i′ ̸=i
∆(zdi′ = l)}]
− 1
2log(2πσ2)
= − y2d2σ2
− 1
2σ2n2d
∑zd
q(zd)K∑
k=1
η2k
{ nd∑i=1
∆(zdi = k)}+
ydσ2nd
∑zd
q(zd)K∑
k=1
ηk
{ nd∑i=1
∆(zdi = k)}
− 1
2σ2n2d
∑zd
q(zd)K∑
k=1
K∑l=1
ηkηl
{ nd∑i=1
∆(zdi = k)}{ nd∑
i′ ̸=i
∆(zdi′ = l)}− 1
2log(2πσ2)
= − y2d2σ2
− 1
2σ2n2d
K∑k=1
η2k
{ nd∑i=1
q(zdi = k)}+
ydσ2nd
K∑k=1
ηk
{ nd∑i=1
q(zdi = k)}
− 1
2σ2n2d
K∑k=1
K∑l=1
ηkηl
{ nd∑i=1
nd∑i′ ̸=i
q(zdi = k)q(zdi′ = l)}− 1
2log(2πσ2), (12)
where∑nd
i′ ̸=i means the summation over the indices {1, . . . , nd} \ {i}.In Eq. (12), for example, the expectation
∑zd
q(zd)∑K
k=1 ηk{∑nd
i=1 ∆(zdi = k)} is rewritten as follows:
∑zd
q(zd)
K∑k=1
ηk
{ nd∑i=1
∆(zdi = k)}=
∑zd
q(zd)
K∑k=1
ηk
{∆(zd1 = k) + · · ·+∆(zdnd
= k)}
=∑zd
q(zd){ K∑
k=1
ηk∆(zd1 = k)}+ · · ·+
∑zd
q(zd){ K∑
k=1
ηk∆(zdnd= k)
}
=K∑
zd1=1
q(zd1)
[{∑z¬1d
q(z¬1d )
}·{ K∑
k=1
ηk∆(zd1 = k)}]
+ · · ·+K∑
zdnd=1
q(zdnd)
[{ ∑z¬ndd
q(z¬nd
d )}·{ K∑
k=1
ηk∆(zdnd= k)
}]
=
K∑zd1=1
q(zd1){ K∑
k=1
ηk∆(zd1 = k)}+ · · ·+
K∑zdnd
=1
q(zdnd){ K∑
k=1
ηk∆(zdnd= k)
}
=K∑
k=1
q(zd1 = k)ηk + · · ·+K∑
k=1
q(zdnd= k)ηk =
K∑k=1
ηk
{ nd∑i=1
q(zdi = k)}
(13)
Other terms in Eq. (12) are also rewritten in a similar manner.After introducing a Lagrange multiplier term λdi{1 −
∑k q(zdi = k)} that represents the constraint∑
k q(zdi = k) = 1 for each zdi, we obtain the following derivative:
δF̃ [q(zd)]
δq(zdi = k)=
∫q(θd) log θdkdθd +
∫q(ϕk) log ϕkxdi
dϕk
− 1
2σ2n2d
η2k +yd
σ2ndηk − 1
σ2n2d
K∑l=1
ηkηl
nd∑i′ ̸=i
q(zdi′ = l)− log q(zdi = k) + 1− λdi (14)
By setting δF̃ [q(zd)]δq(zdi=k) = 0, we obtain the variational probability q(zdi = k) that the i-th word token of
the d-th document is assigned to the k-th topic as follows:
q(zdi = k) ∝ exp[ ∫
q(θd) log θdkdθd
]· exp
[ ∫q(ϕk) log ϕkxdi
dϕk
]· exp
[− η2k
2σ2n2d
+ydηkσ2nd
− ηkσ2n2
d
K∑l=1
ηl
{ nd∑i′ ̸=i
q(zdi′ = l)}]
(15)
4
The integrals∫q(θd) log θdkdθd and
∫q(ϕk) log ϕkxdi
dϕk can be obtained based on Eq. (B.21) of PatternRecognition and Machine Learning by Christopher M. Bishop.
6.2 Regression parameters
The parameters η = (η1, . . . , ηK) and σ appear only in the term∑M
d=1
∑zd
q(zd) log p(yd|η⊤z̄d, σ) inEq. (5) (cf. Eq. (12)). By differentiating it with respect to ηk, we obtain the following:
∂
∂ηk
M∑d=1
∑zd
q(zd) log p(yd|η⊤z̄d, σ)
=M∑d=1
∂
∂ηk
[− 1
2σ2n2d
K∑k=1
η2k
{ nd∑i=1
q(zdi = k)}+
ydσ2nd
K∑k=1
ηk
{ nd∑i=1
q(zdi = k)}
− 1
2σ2n2d
K∑k=1
K∑l=1
ηkηl
{ nd∑i=1
nd∑i′ ̸=i
q(zdi = k)q(zdi′ = l)}]
= −M∑d=1
ηkσ2n2
d
nd∑i=1
q(zdi = k) +M∑d=1
ydσ2nd
nd∑i=1
q(zdi = k)−M∑d=1
ηkσ2n2
d
nd∑i=1
nd∑i′ ̸=i
q(zdi = k)q(zdi′ = k)
−M∑d=1
1
σ2n2d
K∑l ̸=k
ηl
{ nd∑i=1
nd∑i′ ̸=i
q(zdi = k)q(zdi′ = l)}
= − 1
σ2
[ηk
M∑d=1
1
n2d
nd∑i=1
q(zdi = k) +K∑l=1
ηl
M∑d=1
1
n2d
{ nd∑i=1
nd∑i′ ̸=i
q(zdi = k)q(zdi′ = l)}]
+M∑d=1
ydσ2nd
nd∑i=1
q(zdi = k)
(16)
By setting Eq. (16) equal to 0, we obtain the following equation for each k = 1, . . . ,K:
K∑l=1
ηl
M∑d=1
1
n2d
nd∑i=1
q(zdi = k){∆(k = l) +
nd∑i′ ̸=i
q(zdi′ = l)}=
M∑d=1
ydnd
nd∑i=1
q(zdi = k) (17)
Let q(zdi = k) be denoted by ζdik. Then Eq. (17) can be rewritten as follows:[ M∑d=1
1
n2d
nd∑i=1
{diag(ζdi) +
nd∑i′ ̸=i
ζdiζ⊤di′
}]η =
M∑d=1
ydnd
nd∑i=1
ζdi (18)
Consequently, an estimation of η can be obtained as follows:
η =
[ M∑d=1
nd∑i=1
{diag(ζdi) +
nd∑i′ ̸=i
ζdiζ⊤di′
}]−1( M∑d=1
ydnd
nd∑i=1
ζdi
)(19)
We next perform a differentiation with respect to σ and obtain the following:
∂
∂ηk
M∑d=1
∑zd
q(zd) log p(yd|η⊤z̄d, σ)
=
M∑d=1
∂
∂σ
[− y2d
2σ2− 1
2σ2n2d
K∑k=1
η2k
{ nd∑i=1
q(zdi = k)}+
ydσ2nd
K∑k=1
ηk
{ nd∑i=1
q(zdi = k)}
− 1
2σ2n2d
K∑k=1
K∑l=1
ηkηl
{ nd∑i=1
nd∑i′ ̸=i
q(zdi = k)q(zdi′ = l)}− 1
2log(2πσ2)
]
=M∑d=1
[y2dσ3
+1
σ3n2d
K∑k=1
η2k
{ nd∑i=1
q(zdi = k)}− 2yd
σ3nd
K∑k=1
ηk
{ nd∑i=1
q(zdi = k)}
+1
σ3n2d
K∑k=1
K∑l=1
ηkηl
{ nd∑i=1
nd∑i′ ̸=i
q(zdi = k)q(zdi′ = l)}]
− M
σ(20)
5
Therefore, an estimation of σ is obtained as follows:
σ2 =1
M
M∑d=1
{y2d +
K∑k=1
η2k
∑nd
i=1 q(zdi = k)
n2d
− 2yd
K∑k=1
ηk
∑nd
i=1 q(zdi = k)
nd
+K∑
k=1
K∑l=1
ηkηl
∑nd
i=1
∑nd
i′ ̸=i q(zdi = k)q(zdi′ = l)
n2d
}(21)
By setting Eq. (16) equal to 0, we obtain the following equation for each k = 1, . . . ,K:
K∑l=1
ηl
M∑d=1
∑nd
i=1
∑nd
i′ ̸=i q(zdi = k)q(zdi′ = l)
n2d
= −ηk
M∑d=1
∑nd
i=1 q(zdi = k)
n2d
+M∑d=1
yd
∑nd
i=1 q(zdi = k)
nd(22)
Therefore, Eq. (21) can be rewritten as follows:
σ2 =1
M
{ M∑d=1
y2d +K∑
k=1
η2k
M∑d=1
∑nd
i=1 q(zdi = k)
n2d
− 2K∑
k=1
ηk
M∑d=1
yd
∑nd
i=1 q(zdi = k)
nd
+K∑
k=1
ηk
K∑l=1
ηl
M∑d=1
∑nd
i=1
∑nd
i′ ̸=i q(zdi = k)q(zdi′ = l)
n2d
}
=1
M
[ M∑d=1
y2d +K∑
k=1
η2k
M∑d=1
∑nd
i=1 q(zdi = k)
n2d
− 2K∑
k=1
ηk
M∑d=1
yd
∑nd
i=1 q(zdi = k)
nd
+K∑
k=1
ηk
{− ηk
M∑d=1
∑nd
i=1 q(zdi = k)
n2d
+M∑d=1
yd
∑nd
i=1 q(zdi = k)
nd
}]
=1
M
{ M∑d=1
y2d −K∑
k=1
ηk
M∑d=1
yd
∑nd
i=1 q(zdi = k)
nd
}(23)
(This document may contain errors.)
6