A Note on Latent LSTM Allocation

3
A Note on Latent LSTM Allocation Tomonari MASADA @ Nagasaki University August 31, 2017 (I’m not fully confident with this note.) 1 ELBO In latent LSTM allocation, the topic assignments z d = {z d,1 ,...,z d,N d } for each document d are drawn from the categorical distribution whose parameters are obtained as a softmax output of LSTM. Based on the description of the generative process given in the paper [1], we obtain the full joint distribution as follows: p({w 1 ,..., w d }, {z 1 ,..., z d }, φ; LSTM, β)= p(φ; β) Y d p(w d , z d , φ; LSTM, β) (1) We maximize the evidence p({w 1 ,..., w d }; LSTM, β), which is obtained as below. p({w 1 ,..., w d }; LSTM, β)= Z X {z1,...,z d } p({w 1 ,..., w d }, {z 1 ,..., z d }, φ; LSTM, β)dφ = Z X {z1,...,z d } p(φ; β) Y d p(w d , z d |φ; LSTM)dφ, (2) where p(w d , z d |φ; LSTM) = p(w d |z d , φ)p(z d ; LSTM) = Y t p(w d,t |z d,t , φ)p(z d,t |z d,1:t-1 ; LSTM) (3) Jensen’s inequality gives the following lower bound of the log of the evidence: log p({w 1 ,..., w d }; LSTM, β) = log Z X Z p(φ; β) Y d p(w d , z d |φ; LSTM)dφ = log Z X Z q(Z, φ) p(φ; β) Q d p(w d , z d |φ; LSTM) q(Z, φ) dφ Z X Z q(Z, φ) log p(φ; β) Q d p(w d , z d |φ; LSTM) q(Z, φ) dφ L (4) Let this lower bound, i.e., ELBO, be denoted by L. We assume that the variational posterior q(Z, φ) factorizes as Q k q(φ k ) × Q d q(z d ). The q(φ k ) are Dirichlet distributions whose parameters are ξ k = {ξ k,1 ...,ξ k,V }. Then the ELBO L can be rewritten as below. L = Z q(φ) log p(φ; β)dφ + X d X z d q(z d ) log p(z d ; LSTM) + X d Z X z d q(z d )q(φ) log p(w d |z d , φ)dφ - X d X z d q(z d ) log q(z d ) - Z q(φ) log q(φ)dφ (5) 1

Transcript of A Note on Latent LSTM Allocation

Page 1: A Note on Latent LSTM Allocation

A Note on Latent LSTM Allocation

Tomonari MASADA @ Nagasaki University

August 31, 2017

(I’m not fully confident with this note.)

1 ELBO

In latent LSTM allocation, the topic assignments zd = {zd,1, . . . , zd,Nd} for each document d are drawn

from the categorical distribution whose parameters are obtained as a softmax output of LSTM.Based on the description of the generative process given in the paper [1], we obtain the full joint

distribution as follows:

p({w1, . . . ,wd}, {z1, . . . ,zd},φ; LSTM,β) = p(φ;β)∏d

p(wd, zd,φ; LSTM,β) (1)

We maximize the evidence p({w1, . . . ,wd}; LSTM,β), which is obtained as below.

p({w1, . . . ,wd}; LSTM,β) =

∫ ∑{z1,...,zd}

p({w1, . . . ,wd}, {z1, . . . ,zd},φ; LSTM,β)dφ

=

∫ ∑{z1,...,zd}

p(φ;β)∏d

p(wd, zd|φ; LSTM)dφ, (2)

where

p(wd, zd|φ; LSTM) = p(wd|zd,φ)p(zd; LSTM)

=∏t

p(wd,t|zd,t,φ)p(zd,t|zd,1:t−1; LSTM) (3)

Jensen’s inequality gives the following lower bound of the log of the evidence:

log p({w1, . . . ,wd}; LSTM,β) = log

∫ ∑Z

p(φ;β)∏d

p(wd, zd|φ; LSTM)dφ

= log

∫ ∑Z

q(Z,φ)p(φ;β)

∏d p(wd, zd|φ; LSTM)

q(Z,φ)dφ

≥∫ ∑

Z

q(Z,φ) logp(φ;β)

∏d p(wd, zd|φ; LSTM)

q(Z,φ)dφ

≡ L (4)

Let this lower bound, i.e., ELBO, be denoted by L.We assume that the variational posterior q(Z,φ) factorizes as

∏k q(φk) ×

∏d q(zd). The q(φk) are

Dirichlet distributions whose parameters are ξk = {ξk,1 . . . , ξk,V }.Then the ELBO L can be rewritten as below.

L =

∫q(φ) log p(φ;β)dφ+

∑d

∑zd

q(zd) log p(zd; LSTM) +∑d

∫ ∑zd

q(zd)q(φ) log p(wd|zd,φ)dφ

−∑d

∑zd

q(zd) log q(zd)−∫q(φ) log q(φ)dφ (5)

1

Page 2: A Note on Latent LSTM Allocation

Further we assume that q(zd) factorizes as∏t q(zd,t), where the q(zd,t) are the categorical distributions

satisfying∑Kk=1 q(zd,t = k) = 1. We let γd,t,k denote q(zd,t = k).

The second term of L in Eq. (5) can be rewritten as below.∑zd

q(zd) log p(zd; LSTM) =∑zd

{∏t

q(zd,t)

}∑t

log p(zd,t|zd,1:t−1; LSTM)

=∑zd

{∏t

q(zd,t)

}{log p(zd,1; LSTM) + log p(zd,2|zd,1; LSTM) + log p(zd,3|zd,1, zd,2; LSTM)

+ · · ·+ log p(zd,Nd|zd,1, . . . , zd,Nd−1; LSTM)

}=

K∑zd,1=1

q(zd,1) log p(zd,1; LSTM) +

K∑zd,1=1

K∑zd,2=1

q(zd,1)q(zd,2) log p(zd,2|zd,1; LSTM)

+ · · ·+K∑

zd,1=1

· · ·K∑

zd,Nd−1=1

q(zd,1) · · · q(zd,Nd−1) log p(zd,Nd−1|zd,1, . . . , zd,Nd−2; LSTM)

+ · · ·+K∑

zd,1=1

· · ·K∑

zd,Nd=1

q(zd,1) · · · q(zd,Nd) log p(zd,Nd

|zd,1, . . . , zd,Nd−1; LSTM) (6)

The evaluation of Eq. (6) is intractable. However, for each t, the zd,1:t−1 in p(zd,t|zd,1:t−1; LSTM) can beregarded as free variables whose values are set by some procedure having nothing to do with the generativemodel. We obtain the values of the zd,1:t−1 by LSTM forward pass and denote them as zd,1:t−1. Then wecan simplify Eq. (6) as follows:

∑zd

q(zd) log p(zd; LSTM) =

Nd∑t=1

K∑zd,t=1

q(zd,t) log p(zd,t|zd,1:t−1; LSTM)

=

Nd∑t=1

K∑k=1

γd,t,k log p(zd,t = k|zd,1:t−1; LSTM) (7)

The third term of L in Eq. (5) can be rewritten as below.∑d

∫ ∑zd

q(zd)q(φ) log p(wd|zd,φ)dφ =∑d

∫q(φ)

∑zd

q(zd)∑t

log φzd,t,wd,tdφ

=

∫q(φ)

∑d

Nd∑t=1

K∑k=1

q(zd,t = k) log φk,wd,tdφ

=

D∑d=1

Nd∑t=1

K∑k=1

γd,t,k

{∫q(φk) log φk,wd,t

dφk

}

=

D∑d=1

Nd∑t=1

K∑k=1

γd,t,k

{Ψ(ξk,wd,t

)−Ψ

(∑v

ξk,v

)}(8)

The first term of L in Eq. (5) can be rewritten as below.∫q(φ) log p(φ;β)dφ =

∑k

∫q(φk) log p(φk;β)dφk

= K log Γ(V β)−KV log Γ(β) +∑k

∑v

(β − 1)

∫q(φk) log φk,vdφk

= K log Γ(V β)−KV log Γ(β) + (β − 1)∑k

∑v

{Ψ(ξk,v)−Ψ

(∑v

ξk,v

)}(9)

2

Page 3: A Note on Latent LSTM Allocation

The fourth term of L in Eq. (5) can be rewritten as below.

∑d

∑zd

q(zd) log q(zd) =

D∑d=1

Nd∑t=1

K∑k=1

q(zd,t = k) log q(zd,t = k) (10)

The last term of L can be rewritten as below.∫q(φ) log q(φ)dφ =

∑k

∫q(φk) log q(φk)dφk

=∑k

log Γ

(∑v

ξk,v

)−∑k

∑v

log Γ(ξk,v) +∑k

∑v

(ξk,v − 1)

{Ψ(ξk,v)−Ψ

(∑v

ξk,v

)}(11)

2 Inference

The partial differentiation of L with respect to γd,t,k is

∂L

∂γd,t,k= log p(zd,t = k|zd,1:t−1; LSTM) +

{Ψ(ξk,wd,t

)−Ψ

(∑v

ξk,v

)}− log γd,t,k + const. (12)

By solving ∂L∂γd,t,k

= 0, we obtain

γd,t,k ∝ φk,wd,tp(zd,t = k|zd,1:t−1; LSTM), (13)

where φk,wd,t≡

exp(Ψ(ξk,wd,t))

exp(Ψ(∑

v ξk,v)) . When t = 1, γd,1,k ∝ φk,wd,1p(zd,1 = k|LSTM). Therefore, q(zd,1) does

not depend on the zd,t for t > 1, and we can draw a sample from q(zd,1) without seeing the zd,t fort > 1. When t = 2, γd,2,k ∝ φk,wd,2

p(zd,2 = k|zd,1; LSTM). That is, q(zd,1) depends only on zd,1. Onepossible way to determine zd,1 is to draw a sample from q(zd,1), because this drawing can be performedwithout seeing the zd,t for t > 1. For each t s.t. t > 2, we may repeat a similar argument. However, thisprocedure to determine the zd,t is made possible by the assumption that lead to the approximation givenin Eq. (7), because we cannot obtain the simple update γd,t,k ∝ φk,wd,t

p(zd,t = k|zd,1:t−1; LSTM) withoutthis assumption. And this assumption tells nothing about how we should sample the zd,t. For example,we may draw the zd,t simply based on the softmax output at each t of LSTM without using φ. Anyway, itis sure that the assumption leads to the approximation given in Eq. (7) provides no answer to the questionwhy we should use φ when sampling the zd,t.

For ξk,v, we obtain the estimation β +∑d

∑{t:wd,t=v} γd,t,k as usual.

Let θd,t,k denote p(zd,t = k|zd,1:t−1; LSTM), which is a softmax output of LSTM. The partial differen-tiation of L with respect to any LSTM parameter is

∂L

∂LSTM=∑d∈B

Nd∑t=1

K∑k=1

γd,t,k∂

∂LSTMlog θd,t,k =

∑d∈B

Nd∑t=1

K∑k=1

γd,t,kθd,t,k

∂θd,t,k∂LSTM

(14)

References

[1] Manzil Zaheer, Amr Ahmed, and Alexander J. Smola. Latent LSTM allocation: Joint clusteringand non-linear dynamic modeling of sequence data. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings ofMachine Learning Research, pages 3967–3976, International Convention Centre, Sydney, Australia,06–11 Aug 2017. PMLR.

3