A Note on Latent LSTM Allocation
-
Upload
tomonari-masada -
Category
Engineering
-
view
104 -
download
2
Transcript of A Note on Latent LSTM Allocation
A Note on Latent LSTM Allocation
Tomonari MASADA @ Nagasaki University
August 31, 2017
(I’m not fully confident with this note.)
1 ELBO
In latent LSTM allocation, the topic assignments zd = {zd,1, . . . , zd,Nd} for each document d are drawn
from the categorical distribution whose parameters are obtained as a softmax output of LSTM.Based on the description of the generative process given in the paper [1], we obtain the full joint
distribution as follows:
p({w1, . . . ,wd}, {z1, . . . ,zd},φ; LSTM,β) = p(φ;β)∏d
p(wd, zd,φ; LSTM,β) (1)
We maximize the evidence p({w1, . . . ,wd}; LSTM,β), which is obtained as below.
p({w1, . . . ,wd}; LSTM,β) =
∫ ∑{z1,...,zd}
p({w1, . . . ,wd}, {z1, . . . ,zd},φ; LSTM,β)dφ
=
∫ ∑{z1,...,zd}
p(φ;β)∏d
p(wd, zd|φ; LSTM)dφ, (2)
where
p(wd, zd|φ; LSTM) = p(wd|zd,φ)p(zd; LSTM)
=∏t
p(wd,t|zd,t,φ)p(zd,t|zd,1:t−1; LSTM) (3)
Jensen’s inequality gives the following lower bound of the log of the evidence:
log p({w1, . . . ,wd}; LSTM,β) = log
∫ ∑Z
p(φ;β)∏d
p(wd, zd|φ; LSTM)dφ
= log
∫ ∑Z
q(Z,φ)p(φ;β)
∏d p(wd, zd|φ; LSTM)
q(Z,φ)dφ
≥∫ ∑
Z
q(Z,φ) logp(φ;β)
∏d p(wd, zd|φ; LSTM)
q(Z,φ)dφ
≡ L (4)
Let this lower bound, i.e., ELBO, be denoted by L.We assume that the variational posterior q(Z,φ) factorizes as
∏k q(φk) ×
∏d q(zd). The q(φk) are
Dirichlet distributions whose parameters are ξk = {ξk,1 . . . , ξk,V }.Then the ELBO L can be rewritten as below.
L =
∫q(φ) log p(φ;β)dφ+
∑d
∑zd
q(zd) log p(zd; LSTM) +∑d
∫ ∑zd
q(zd)q(φ) log p(wd|zd,φ)dφ
−∑d
∑zd
q(zd) log q(zd)−∫q(φ) log q(φ)dφ (5)
1
Further we assume that q(zd) factorizes as∏t q(zd,t), where the q(zd,t) are the categorical distributions
satisfying∑Kk=1 q(zd,t = k) = 1. We let γd,t,k denote q(zd,t = k).
The second term of L in Eq. (5) can be rewritten as below.∑zd
q(zd) log p(zd; LSTM) =∑zd
{∏t
q(zd,t)
}∑t
log p(zd,t|zd,1:t−1; LSTM)
=∑zd
{∏t
q(zd,t)
}{log p(zd,1; LSTM) + log p(zd,2|zd,1; LSTM) + log p(zd,3|zd,1, zd,2; LSTM)
+ · · ·+ log p(zd,Nd|zd,1, . . . , zd,Nd−1; LSTM)
}=
K∑zd,1=1
q(zd,1) log p(zd,1; LSTM) +
K∑zd,1=1
K∑zd,2=1
q(zd,1)q(zd,2) log p(zd,2|zd,1; LSTM)
+ · · ·+K∑
zd,1=1
· · ·K∑
zd,Nd−1=1
q(zd,1) · · · q(zd,Nd−1) log p(zd,Nd−1|zd,1, . . . , zd,Nd−2; LSTM)
+ · · ·+K∑
zd,1=1
· · ·K∑
zd,Nd=1
q(zd,1) · · · q(zd,Nd) log p(zd,Nd
|zd,1, . . . , zd,Nd−1; LSTM) (6)
The evaluation of Eq. (6) is intractable. However, for each t, the zd,1:t−1 in p(zd,t|zd,1:t−1; LSTM) can beregarded as free variables whose values are set by some procedure having nothing to do with the generativemodel. We obtain the values of the zd,1:t−1 by LSTM forward pass and denote them as zd,1:t−1. Then wecan simplify Eq. (6) as follows:
∑zd
q(zd) log p(zd; LSTM) =
Nd∑t=1
K∑zd,t=1
q(zd,t) log p(zd,t|zd,1:t−1; LSTM)
=
Nd∑t=1
K∑k=1
γd,t,k log p(zd,t = k|zd,1:t−1; LSTM) (7)
The third term of L in Eq. (5) can be rewritten as below.∑d
∫ ∑zd
q(zd)q(φ) log p(wd|zd,φ)dφ =∑d
∫q(φ)
∑zd
q(zd)∑t
log φzd,t,wd,tdφ
=
∫q(φ)
∑d
Nd∑t=1
K∑k=1
q(zd,t = k) log φk,wd,tdφ
=
D∑d=1
Nd∑t=1
K∑k=1
γd,t,k
{∫q(φk) log φk,wd,t
dφk
}
=
D∑d=1
Nd∑t=1
K∑k=1
γd,t,k
{Ψ(ξk,wd,t
)−Ψ
(∑v
ξk,v
)}(8)
The first term of L in Eq. (5) can be rewritten as below.∫q(φ) log p(φ;β)dφ =
∑k
∫q(φk) log p(φk;β)dφk
= K log Γ(V β)−KV log Γ(β) +∑k
∑v
(β − 1)
∫q(φk) log φk,vdφk
= K log Γ(V β)−KV log Γ(β) + (β − 1)∑k
∑v
{Ψ(ξk,v)−Ψ
(∑v
ξk,v
)}(9)
2
The fourth term of L in Eq. (5) can be rewritten as below.
∑d
∑zd
q(zd) log q(zd) =
D∑d=1
Nd∑t=1
K∑k=1
q(zd,t = k) log q(zd,t = k) (10)
The last term of L can be rewritten as below.∫q(φ) log q(φ)dφ =
∑k
∫q(φk) log q(φk)dφk
=∑k
log Γ
(∑v
ξk,v
)−∑k
∑v
log Γ(ξk,v) +∑k
∑v
(ξk,v − 1)
{Ψ(ξk,v)−Ψ
(∑v
ξk,v
)}(11)
2 Inference
The partial differentiation of L with respect to γd,t,k is
∂L
∂γd,t,k= log p(zd,t = k|zd,1:t−1; LSTM) +
{Ψ(ξk,wd,t
)−Ψ
(∑v
ξk,v
)}− log γd,t,k + const. (12)
By solving ∂L∂γd,t,k
= 0, we obtain
γd,t,k ∝ φk,wd,tp(zd,t = k|zd,1:t−1; LSTM), (13)
where φk,wd,t≡
exp(Ψ(ξk,wd,t))
exp(Ψ(∑
v ξk,v)) . When t = 1, γd,1,k ∝ φk,wd,1p(zd,1 = k|LSTM). Therefore, q(zd,1) does
not depend on the zd,t for t > 1, and we can draw a sample from q(zd,1) without seeing the zd,t fort > 1. When t = 2, γd,2,k ∝ φk,wd,2
p(zd,2 = k|zd,1; LSTM). That is, q(zd,1) depends only on zd,1. Onepossible way to determine zd,1 is to draw a sample from q(zd,1), because this drawing can be performedwithout seeing the zd,t for t > 1. For each t s.t. t > 2, we may repeat a similar argument. However, thisprocedure to determine the zd,t is made possible by the assumption that lead to the approximation givenin Eq. (7), because we cannot obtain the simple update γd,t,k ∝ φk,wd,t
p(zd,t = k|zd,1:t−1; LSTM) withoutthis assumption. And this assumption tells nothing about how we should sample the zd,t. For example,we may draw the zd,t simply based on the softmax output at each t of LSTM without using φ. Anyway, itis sure that the assumption leads to the approximation given in Eq. (7) provides no answer to the questionwhy we should use φ when sampling the zd,t.
For ξk,v, we obtain the estimation β +∑d
∑{t:wd,t=v} γd,t,k as usual.
Let θd,t,k denote p(zd,t = k|zd,1:t−1; LSTM), which is a softmax output of LSTM. The partial differen-tiation of L with respect to any LSTM parameter is
∂L
∂LSTM=∑d∈B
Nd∑t=1
K∑k=1
γd,t,k∂
∂LSTMlog θd,t,k =
∑d∈B
Nd∑t=1
K∑k=1
γd,t,kθd,t,k
∂θd,t,k∂LSTM
(14)
References
[1] Manzil Zaheer, Amr Ahmed, and Alexander J. Smola. Latent LSTM allocation: Joint clusteringand non-linear dynamic modeling of sequence data. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings ofMachine Learning Research, pages 3967–3976, International Convention Centre, Sydney, Australia,06–11 Aug 2017. PMLR.
3