Applying Dynamic Language Models for Streaming Text to LDA

4
Applying Dynamic Language Models for Streaming Text to LDA Tomonari MASADA @ Nagasaki University August 27, 2014 The evidence is given as follows: p(w|α, φ, γ , δ)= z D d=1 p(θ d |γ ) T t=1 K k=1 p(β tk |β 1:t-1,k , α k , δ 1:t-1 , φ) D d=1 p(z d |θ d )p(w d |z d , β t d )dθdβ . (1) A lower bound of the log of the evidence is obtained as follows based on Jensen’s inequality: ln p(w|α, φ, γ , δ) d {∫ q(θ d ) ln p(θ d |γ d )dθ d - q(θ d ) ln q(θ d )dθ d } + t k v q(β kv ) ln p(β tkv |β 1:t-1,kv , α k , δv )dβ kv - t k v q(β tkv ) ln q(β tkv )tkv + d z d q(β t d )q(z d ) ln p(w d |z d , β t d )dβ t d + d z d q(θ d )q(z d ) ln p(z d |θ d )dθ d - d z d q(z d ) ln q(z d ) . (2) Let this lower bound be denoted by L. We assume that q(θ d )= Γ( k η dk ) k Γ(η dk ) k θ η dk -1 dk . Then the first term of L can be rewritten as follows: q(θ d ) ln p(θ d |γ d )dθ d - q(θ d ) ln q(θ d )dθ d = ln Γ( k γ k ) - k ln Γ(γ k ) - ln Γ( k η dk )+ k ln Γ(η dk )+ k (γ k - η dk ) { ψ(η dk ) - ψ( k η dk ) } . (3) Define C k:t,s exp(α k f (xt,xs)) t-1 s =t-c exp(α k f (x t ,x s )) . We assume that q(β tkv )= 1 2πσ tkv exp { - (β tkv -μ tkv ) 2 2σ tkv } . Then the second and the third terms of L can be rewritten as follows: T t=1 q(β kv ) ln p(β tkv |β 1:t-1,kv , α k , δv )dβ kv - T t=1 q(β tkv ) ln q(β tkv )tkv = T t=1 q(β kv ) ln [ 1 2πφ v exp { - (β tkv - t-1 s=t-c C k:t,s β skv ) 2 2φ v } ] dβ kv + T 2 + T 2 ln(2π)+ 1 2 T t=1 ln σ tkv = 1 2 T t=1 ln σ tkv φ v - 1 2φ v T t=1 q(β kv ) ( β tkv - t-1 s=t-c C k:t,s β skv ) 2 dβ kv + const. = 1 2 T t=1 ln σ tkv φ v - T t=1 ( μ tkv - t-1 s=t-c C k:t,s μ skv ) 2 2φ v - T t=1 σ tkv + t-1 s=t-c C 2 k:t,s σ skv 2φ v + const. , (4) 1

Transcript of Applying Dynamic Language Models for Streaming Text to LDA

Page 1: Applying Dynamic Language Models for Streaming Text to LDA

Applying Dynamic Language Models for Streaming Text to LDA

Tomonari MASADA @ Nagasaki University

August 27, 2014

The evidence is given as follows:

p(w|α,φ,γ, δ) =∫ ∑

z

D∏d=1

p(θd|γ)T∏

t=1

K∏k=1

p(βtk|β1:t−1,k,αk, δ1:t−1,φ)D∏

d=1

p(zd|θd)p(wd|zd,βtd)dθdβ .

(1)

A lower bound of the log of the evidence is obtained as follows based on Jensen’s inequality:

ln p(w|α,φ,γ, δ)

≥∑d

{∫q(θd) ln p(θd|γd)dθd −

∫q(θd) ln q(θd)dθd

}+∑t

∑k

∑v

∫q(βkv) ln p(βtkv|β1:t−1,kv,αk, δ, φv)dβkv −

∑t

∑k

∑v

∫q(βtkv) ln q(βtkv)dβtkv

+∑d

∫ ∑zd

q(βtd)q(zd) ln p(wd|zd,βtd

)dβtd+∑d

∫ ∑zd

q(θd)q(zd) ln p(zd|θd)dθd

−∑d

∑zd

q(zd) ln q(zd) . (2)

Let this lower bound be denoted by L.We assume that q(θd) =

Γ(∑

k ηdk)∏k Γ(ηdk)

∏k θ

ηdk−1dk . Then the first term of L can be rewritten as follows:∫

q(θd) ln p(θd|γd)dθd −∫q(θd) ln q(θd)dθd

= lnΓ(∑k

γk)−∑k

ln Γ(γk)− ln Γ(∑k

ηdk) +∑k

ln Γ(ηdk) +∑k

(γk − ηdk){ψ(ηdk)− ψ(

∑k′

ηdk′)}.

(3)

Define Ck:t,s ≡ exp(α⊤k f(xt,xs))∑t−1

s′=t−cexp(α⊤

k f(xt,xs′ )).

We assume that q(βtkv) =1√

2πσtkvexp

{− (βtkv−µtkv)

2

2σtkv

}. Then the second and the third terms of L

can be rewritten as follows:

T∑t=1

∫q(βkv) ln p(βtkv|β1:t−1,kv,αk, δ, φv)dβkv −

T∑t=1

∫q(βtkv) ln q(βtkv)dβtkv

=T∑

t=1

∫q(βkv) ln

[1√2πφv

exp{−

(βtkv −∑t−1

s=t−c Ck:t,sβskv)2

2φv

}]dβkv +

T

2+T

2ln(2π) +

1

2

T∑t=1

lnσtkv

=1

2

T∑t=1

lnσtkvφv

− 1

2φv

T∑t=1

∫q(βkv)

(βtkv −

t−1∑s=t−c

Ck:t,sβskv

)2

dβkv + const.

=1

2

T∑t=1

lnσtkvφv

−T∑

t=1

(µtkv −

∑t−1s=t−c Ck:t,sµskv

)2

2φv−

T∑t=1

σtkv +∑t−1

s=t−c C2k:t,sσskv

2φv+ const. , (4)

1

Page 2: Applying Dynamic Language Models for Streaming Text to LDA

where the last rewrite can be obtained based on the following equation:∫q(βtkv)β

2tkvdβtkv =

∫q(βtkv)

{(βtkv − µtkv)

2 + 2βtkvµtkv − µ2tkv

}dβtkv = σtkv + µ2

tkv . (5)

We denote the posterior probability that the word v is assigned to topic k in document d as ιdvk, where∑Kk=1 ιdvk = 1 holds. Then the fourth term of L can be rewritten as follows:∫ ∑

zd

q(βtd)q(zd) ln p(wd|zd,βtd

)dβtd=

∫q(βtd

)

nd∑i=1

K∑k=1

ιdwdik ln

{exp(a1:td−1,kwdi

+ βtdkwdi)∑V

v=1 exp(a1:td−1,kv + βtdkv)

}dβtd

=

∫q(βtd

)

V∑v=1

ndv

K∑k=1

ιdvk ln

{exp(a1:td−1,kv + βtdkv)∑V

v′=1 exp(a1:td−1,kv′ + βtdkv′)

}dβtd

=

∫q(βtd

)V∑

v=1

ndv

K∑k=1

ιdvk(a1:td−1,kv + βtdkv)dβtd

−∫q(βtd

)V∑

v=1

ndv

K∑k=1

ιdvk ln{ V∑

v′=1

exp(a1:td−1,kv′ + βtdkv′)}dβtd

(6)

The first term of the RHS of Eq. (6) can be rewritten as follows:∫q(βtd

)V∑

v=1

ndv

K∑k=1

ιdvk(a1:td−1,kv + βtdkv)dβtd=

V∑v=1

ndv

K∑k=1

ιdvka1:td−1,kv +V∑

v=1

ndv

K∑k=1

ιdvkµtdkv . (7)

We can obtain the upper bound of the second term of the RHS of Eq. (6) by using the inequality ln(x) ≤−1 + x/ζ + ln(ζ) as follows:∫

q(βtd)

V∑v=1

ndv

K∑k=1

ιdvk ln{ V∑

v′=1

exp(a1:td−1,kv′ + βtdkv′)}dβtd

=K∑

k=1

[(∑v

ndvιdvk

)∫q(βtdk

) ln{ V∑

v′=1

exp(a1:td−1,kv′ + βtdkv′)}dβtdk

]

≤K∑

k=1

[(∑v

ndvιdvk

){− 1 + ln(ζtdk) +

1

ζtdk

V∑v′=1

∫q(βtdk

) exp(a1:td−1,kv′ + βtdkv′)dβtdk

}]

= −V∑

v=1

ndv +K∑

k=1

[(∑v

ndvιdvk

){ln(ζtdk) +

1

ζtdk

∑v

exp(a1:td−1,kv) exp(µtkv +

σtkv2

)}], (8)

where ∫q(βtkv) exp(βtkv)dq(βtkv) =

1√2πσtkv

∫exp

{− (βtkv − µtkv)

2

2σtkv

}exp(βtkv)dq(βtkv)

=1√

2πσtkv

∫exp

{− (βtkv − µtkv)

2 − 2σtkvβtkv2σtkv

}dq(βtkv)

=1√

2πσtkv

∫exp

{− (βtkv − µtkv − σtkv)

2 + 2µtkvσtkv + σ2tkv

2σtkv

}dq(βtkv)

= exp(2µtkvσtkv + σ2

tkv

2σtkv

)× 1√

2πσtkv

∫exp

{− (βtkv − µtkv − σtkv)

2

2σtkv

}dq(βtkv)

= exp(µtkv +

σtkv2

). (9)

The fifth term of L can be rewritten as follows:∫ ∑zd

q(θd)q(zd) ln p(zd|θd)dθd =

∫q(θd)

nd∑i=1

K∑k=1

ιdwdik ln θdkdθd =

∫q(θd)

V∑v=1

ndv

K∑k=1

ιdvk ln θdkdθd

=∑k

∫q(θdk)

(∑v

ndvιdvk

)ln θdkdθdk =

∑k

(∑v

ndvιdvk

){ψ(γdk)− ψ(

∑k′

γdk′)}. (10)

2

Page 3: Applying Dynamic Language Models for Streaming Text to LDA

And the sixth term of L can be rewritten as follows:∑zd

q(zd) ln q(zd) =K∑

k=1

V∑v=1

ιdvk ln ιdvk . (11)

Consequently, we obtain a lower bound of L as follows:

L

≥N∑

d=1

[ln Γ(

∑kγk)−

K∑k=1

ln Γ(γk)− ln Γ(∑

kηdk) +

K∑k=1

ln Γ(ηdk) +

K∑k=1

(γk − ηdk){ψ(ηdk)− ψ(

∑k′ηdk′)

}]

+1

2

K∑k=1

V∑v=1

T∑t=1

lnσtkvφv

−K∑

k=1

V∑v=1

T∑t=1

(µtkv −

∑t−1s=t−c Ck:t,sµskv

)2

2φv−

K∑k=1

V∑v=1

T∑t=1

σtkv +∑t−1

s=t−c C2k:t,sσskv

2φv

+

N∑d=1

V∑v=1

ndv

K∑k=1

ιdvka1:td−1,kv +

N∑d=1

V∑v=1

ndv

K∑k=1

ιdvkµtdkv

−N∑

d=1

K∑k=1

[( V∑v=1

ndvιdvk

){ln(ζtdk) +

1

ζtdk

V∑v=1

exp(a1:td−1,kv) exp(µtkv +

σtkv2

)}]

+

N∑d=1

K∑k=1

( V∑v=1

ndvιdvk

){ψ(γdk)− ψ(

∑k′γdk′)

}−

N∑d=1

K∑k=1

V∑v=1

ιdvk ln ιdvk + const. (12)

We denote the obtained lower bound by L.

∂L

∂ζtk= − 1

ζtk

∑{d:td=t}

( V∑v=1

ndvιdvk

)+

1

ζ2tk

{ ∑{d:td=t}

( V∑v=1

ndvιdvk

)} V∑v=1

exp(a1:t−1,kv) exp(µtkv +

σtkv2

).

(13)

∂L∂ζtk

= 0 gives the following update: ζtk =∑V

v=1 exp(a1:t−1,kv) exp(µtkv +

σtkv

2

). This can be used in the

formulas presented below.

∂L

∂ιdvk= ndva1:td−1,kv + ndvµtdkv − ndv

{ln(ζtdk) +

1

ζtdk

V∑v′=1

exp(a1:td−1,kv′) exp(µtdkv′ +

σtdkv′

2

)}+ ndv

{ψ(γdk)− ψ(

∑k′γdk′)

}− ln ιdvk + const. (14)

Therefore,

ιdvk ∝ 1

ζtdkexp

{a1:td−1,kv + µtdkv + ψ(γdk)− ψ(

∑k′γdk′)

− 1

ζtdk

V∑v′=1

exp(a1:td−1,kv′) exp(µtdkv′ +

σtdkv′

2

)}∝ 1

ζtdkexp(a1:td−1,kv + µtdkv) exp

{ψ(γdk)− ψ(

∑k′γdk′)

}· (15)

∂L

∂µtkv=− 1

φv

{µtkv −

t−1∑s=t−c

Ck:t,sµskv +

min(T,t+c)∑u=t+1

(− Ck:u,tµukv + C2

k:u,tµtkv +

t−1∑s=u−c

Ck:u,tCk:u,sµskv

)}

+∑

{d:td=t}

ndvιdvk − 1

ζtk

( ∑{d:td=t}

V∑v=1

ndvιdvk

)exp

(a1:t−1,kv + µtkv + σtkv/2

). (16)

3

Page 4: Applying Dynamic Language Models for Streaming Text to LDA

However, when we update µtkv only at timestep t,

∂L

∂µtkv= − 1

φv

(µtkv −

t−1∑s=t−c

Ck:t,sµskv

)

+∑

{d:td=t}

ndvιdvk − 1

ζtk

( ∑{d:td=t}

V∑v=1

ndvιdvk

)exp

(a1:t−1,kv + µtkv + σtkv/2

), (17)

where we can regard µskv for s < t as a constant, because µskv is not updated at timestep t > s. ∂L∂µtkv

= 0gives the following equation:

0 = µtkv +ntkφv exp(a1:t−1,kv + σtkv/2)

ζtkexp(µtkv)−

t−1∑s=t−c

C(t,s)kvµskv − φvntvk , (18)

where ntvk ≡∑

{d:td=t} ndvιdvk and ntk ≡∑

v ndvk. The RHS of this equation has the form of f(x) =

x + Aex − B. f ′(x) = 1 + Aex > 0. Therefore, f(x) = 0 can be solved by the bisection method. Forexample, initialize x to be B −A, because f(B −A) > 0.

∂L

∂σtkv=

1

2σtkv−

1 +∑min(T,t+c)

u=t+1 Ck:u,t

2φv− ntk

2ζtkexp

(a1:t−1,kv + µtkv + σtkv/2

). (19)

However, when we update σtkv only at timestep t,

∂L

∂σtkv=

1

2σtkv− 1

2φv− ntk

2ζtkexp

(a1:t−1,kv + µtkv + σtkv/2

). (20)

The RHS has the form of f(x) = 12x −Ae

x/2−B. f ′(x) = − 1x2 −Aex/2 < 0. Since f(0) > 0 and f(∞) < 0,

f(x) = 0 can be solved by the bisection method. For example, initialize x to be 12B .

∂L

∂φv= −KT

2φv+

K∑k=1

T∑t=1

(µtkv −

∑t−1s=t−c Ck:t,sµskv

)2

2φ2v

+K∑

k=1

T∑t=1

σtkv +∑t−1

s=t−c Ck:t,sσskv

2φ2v

(21)

∂L∂φv

= 0 gives the following formula:

φv =1

KT

K∑k=1

T∑t=1

(µtkv −

t−1∑s=t−c

Ck:t,sµskv

)2

+1

KT

K∑k=1

T∑t=1

(σtkv +

t−1∑s=t−c

Ck:t,sσskv

)(22)

∂L

∂αkm=

V∑v=1

1

φv

T∑t=1

( t−1∑s=t−c

∂Ck:t,s

∂αkmµskv

)(µtkv −

t−1∑s=t−c

Ck:t,sµskv

)

−V∑

v=1

1

φv

T∑t=1

t−1∑s=t−c

∂Ck:t,s

∂αkmCk:t,sσskv , (23)

where

∂Ck:t,s

∂αkm=

∂αkm

exp(α⊤k f(xt,xs))∑t−1

s′=t−c exp(α⊤k f(xt,xs′))

=fm(xt,xs) exp(α

⊤k f(xt,xs))∑t−1

s′=t−c exp(α⊤k f(xt,xs′))

−exp(α⊤

k f(xt,xs))∑t−1

s′=t−c fm(xt,xs′) exp(α⊤k f(xt,xs′))

{∑t−1

s′=t−c exp(α⊤k f(xt,xs′))}2

= Ck:t,sfm(xt,xs)− Ck:t,s

∑t−1s′=t−c fm(xt,xs′) exp(α

⊤k f(xt,xs′))∑t−1

s′=t−c exp(α⊤k f(xt,xs′))

. (24)

4