Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step...

35
Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion Variational Inference for LDA Zhao Zhou The Hong Kong University of Science and Technology

Transcript of Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step...

Page 1: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Variational Inference for LDA

Zhao Zhou

The Hong Kong University of Science and Technology

Page 2: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Outline

1 Generative Process of LDA

2 Exponential Family

3 Newton Method

4 Variational InferenceE-StepM-Step

5 Conclusion

Page 3: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Notations and terminology

A word is the basic unit of discrete data, defined to be anitem from a vocabulary indexed by {1, . . . ,V }.A document is a sequence of N words denoted byw = (w1,w2, . . . ,wN), where wn is the nth word in thesequence.

A corpus is a collection of M documents denoted byD = {w1,w2, . . . ,wM}.

Page 4: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Latent Dirichlet allocation

LDA assumes the following generative process for each documentw in a corpus D

Choose N ∼ Poisson(ξ).

Choose θ ∼ Dir(α).

For each of the N words wn:

Choose a topic zn ∼ Mult(θ).Choose a word wn from p(wn|zn, β).

Note that

The dimension of the Dirichlet distribution (topic variable) isknown and fixed.

The word probabilities are parameterized by a k × V matrix βwhere βij = p(w j = 1|z i = 1).

The randomness of N is ignored in subsequent slides.

Page 5: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Latent Dirichlet allocation

Given the parameters α and β, the joint distribution of a topicmixture θ, a set of topics z , and a set of N words w is:

p(θ, z ,w |α, β) = p(θ|α)N∏

n=1

p(zn|θ)p(wn|zn, β)

The marginal distribution of a document is:

p(w |α, β) =

∫p(θ|α)(

N∏n=1

∑zn

p(zn|θ)p(wn|zn, β))dθ.

Page 6: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Exponential family

An exponential family distribution has the form

p(x |η) = h(x) exp{ηT t(x)− a(η)}

The different parts of this equation are

The natural parameter ηThe sufficient statistic t(x)The underlying measure h(x)The log normalizer a(η)

a(η) = log

∫h(x) exp{ηT t(x)}

Page 7: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

First Moment

The derivatives of the log normalizer gives the moments ofthe sufficient statistics

d

dηa(η) =

d

dη(log

∫exp{ηT t(x)}h(x)dx)

=

∫t(x) exp{ηT t(x)}h(x)dx∫

exp{ηT t(x)}h(x)dx

=

∫t(x) exp{ηT t(x)− a(η)}h(x)dx

= E [t(X )]

Page 8: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Computing E [log(θ|α)]

The Dirichlet distribution p(θ|α):

p(θ|α) =Γ(∑K

i=1 αi )∑Ki=1 Γ(αi )

K∏i=1

θαi−1i

= exp{(K∑i=1

(αi − 1) log θi ) + log Γ(K∑i=1

αi )−K∑i=1

log Γ(αi )}

Sufficient statistics: log θi .

Log normalizer:∑K

i=1 log Γ(αi )− log Γ(∑K

i=1 αi )

Page 9: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Computing E [log(θ|α)]

The expectation E [log(θ|α)] is:

E [log θi |α] = a(α)′ = (K∑i=1

log Γ(αi )− log Γ(K∑i=1

αi ))′

= ψ(αi )− ψ(K∑j=1

αj).

where ψ is the digamma function, the first derivative of thelog Gamma function.

Page 10: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Unconstrained minimization

Suppose f convex, twice continuously differentiable.

Assume optimal value p∗ = infx f (x) is attained.

Interpreted as iterative methods for solving optimalitycondition

∇f (x∗) = 0.

Page 11: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Newton Step

Newton step:

∆xnt = −∇2f (x)−1∇f (x).

Interpretations:

x + ∆xnt minimizes second order approximation

f̂ (x + v) = f (x) +∇f (x)T v +1

2vT∇2f (x)v

x + ∆xnt solves linearized optimality condition.

∇f (x + v) ≈ ∇f̂ (x + v) = ∇f (x) +∇2f (x)v = 0.

Page 12: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Newton decrement

A measure of the proximity of x to x∗

λ(x) = (∇f (x)T∇2f (x)−1∇f (x))1/2

equal to the norm of the Newton step in the quadraticHessian norm

λ(x) = (∆xTnt∇2f (x)∆xnt)1/2

Page 13: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Backtracking Line Search

Exact Line search:

t = arg mint>0

f (x + t∆x)

Backtracking line search (with parametersα ∈ (0, 1/2), β ∈ (0, 1)):

Starting at t = 1, repeat t = βt until

f (x + t∆x) < f (x) + αt∇f (x)T∆x .

Page 14: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Newton Method

Repeat

Compute the Newton step and decrement.

∆xnt = −∇2f (x)−1∇f (x)

λ2 = ∇f (x)T∇2f (x)−1∇f (x)

Stopping criterion. quit if λ2

2 ≤ ε.Line search. Choose step size t by backtracking line search.

Update x = x + t∇xnt .

Page 15: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Inference

The posterior distribution of hidden variable:

p(θ, z |w , α, β) =p(θ, z ,w |α, β)

p(w |α, β)

This distribution is intractable to compute since

p(w |α, β) =Γ(∑

j αj)∏j Γ(αj)

∫(

k∏i=1

θαi−1i )(

N∏n=1

k∑i=1

V∏j=1

(θiβij)w jn)dθ

due to the coupling between θ and β.

Page 16: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Variational distribution

The variational distribution on latent variables:

q(θ, z |γ, φ) = q(θ|γ)N∏

n=1

q(zn|φn).

An optimization problem that determines the values of γ andφ with respect to KL-Divergence D:

(γ∗, φ∗) = arg minγ,φ

D(q(θ, z |γ, φ)||p(θ, z |w , α, β))

Page 17: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

KL-Divergence

Now, we denote q(θ, z |γ, φ) by q.

The KL-Divergence between q and p(θ, z |w , α, β) is

D(q||p) = Eq[log q]− Eq[log p(θ, z |w , α, β)]

= Eq[log q]− Eq[log p(θ, z ,w |α, β)] + log p(w |α, β)

Using Jensen’s inequality, we bound p(w |α, β) by

log p(w |α, β) = log

∫ ∑z

p(θ, z ,w |α, β)dθ

= log

∫ ∑z

p(θ, z ,w |α, β)q(θ, z)

q(θ, z)dθ

≥∫ ∑

z

q(θ, z) logp(θ, z ,w |α, β)

q(θ, z)dθ

= Eq[log p(w , z ,w |α, β)]− Eq[log q(θ, z)].

Page 18: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

KL-Divergence

We denote Eq[log p(w , z ,w |α, β)]− Eq[log q(θ, z)] byL(γ, φ;α, β).

Then we have

log p(w |α, β) = L(γ, φ;α, β) + D(q(θ, z |γ, φ)||p(θ, z |w , α, β)).

Maximizing the lower bound L(γ, φ;α, β) with respect to γand φ is equivalent to minimizing the KL-Divergence betweenthe variational posterior probability and the true posteriorprobability.

Page 19: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Variational Inference

Expand L(γ, φ;α, β) using the factorizations of p and q:

L(γ, φ;α, β) = Eq[log p(w , z ,w |α, β)]− Eq[log q(θ, z)]

= Eq[log p(θ|α)] + Eq[log p(z |θ)] + Eq[log p(w |z , β)]

− Eq[log q(θ)]− Eq[log q(z)]

Compute the five terms, respectively.

Page 20: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Computing Eq[log p(θ|α)]

Eq[log p(θ|α)] is given by

Eq[log p(θ|α)] =K∑i=1

(αi − 1)Eq[log θi ]

+ log Γ(K∑i=1

αi )−K∑i=1

log Γ(αi ).

θ is generated by Dir(θ|γ): Eq[log θi ] = ψ(γi )− ψ(∑K

j=1 γj).

Then we have:

Eq[log p(θ|α)] =K∑i=1

(αi − 1)ψ(γi )− ψ(K∑j=1

γj)

+ log Γ(K∑i=1

αi )−K∑i=1

log Γ(αi ).

Page 21: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Computing Eq[log p(z |θ)]

Eq[log p(z |θ)] is given by

Eq[log p(z |θ)] = Eq[N∑

n=1

K∑i=1

zni log θi ]

=N∑

n=1

K∑i=1

Eq[zni ]Eq[log θi ]

=N∑

n=1

k∑i=1

φni (ψ(γi )− ψ(K∑j=1

γj))

where z is generated from Mult(z |φ) and θ is generated fromDir(θ|γ).

Page 22: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Computing Eq[log p(w |z , β)]

Eq[log p(w |z , β)] is given by

Eq[log p(w |z , β)] = Eq[N∑

n=1

k∑i=1

V∑j=1

zniwjn log βij ]

=N∑

n=1

k∑i=1

V∑j=1

Eq[zni ]wjn log βij

=N∑

n=1

k∑i=1

V∑j=1

φniwjn log βij

Page 23: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Computing Eq[log q(θ|γ)]

Eq[log q(θ|γ)] is given by

Eq[log p(θ|γ)] =k∑

i=1

(γi − 1)Eq[log θi ] + log Γ(k∑

i=1

γi )−k∑

i=1

log Γ(γi )

Then, we have

Eq[log p(θ|γ)] = log Γ(k∑

i=1

γi )−k∑

i=1

log Γ(γi )

+k∑

i=1

(γi − 1)(ψ(γi )− ψ(k∑

j=1

γj))

Page 24: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Computing Eq[log q(z |φ)]

Eq[log q(z |φ)] is given by

Eq[log q(z |φ)] = Eq[N∑

n=1

k∑i=1

zni log φni ]

=N∑

n=1

k∑i=1

Eq[zni ] log φni

=N∑

n=1

k∑i=1

φni log φni

Page 25: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Variational Inference

Finally, L(γ, φ;α, β) is

(γ, φ;α, β) = log Γ(K∑i=1

αi )−K∑i=1

log Γ(αi )

+K∑i=1

(αi − 1)(ψ(γi )− ψ(K∑j=1

γj))

+N∑

n=1

K∑i=1

φni (ψ(γi )− ψ(K∑j=1

γj))

+N∑

n=1

K∑i=1

V∑j=1

φniwjn log βij

− (log Γ(K∑i=1

γi )−K∑i=1

log Γ(γi ) +K∑i=1

(γi − 1)(ψ(γi )− ψ(K∑j=1

γj)))

−N∑

n=1

K∑i=1

φni log φni .

Page 26: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Variational Multinomial

Maximize L(γ, φ;α, β) with respect to φni :

Lφni = φni (ψ(γi )− ψ(K∑j=1

γj)) + φni log βiv

− φni log φni + λ(K∑j=1

φni − 1).

Page 27: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Variational Multinomial

Taking derivatives with respect to φni :

∂L

∂φni= (ψ(γi )− ψ(

K∑j=1

γj)) + log βiv − log φni − 1 + λ.

Setting this derivative to zero yields

φni ∝ βiv exp(ψ(γi )− ψ(K∑j=1

γj)).

Page 28: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Variational Dirichlet

Maximize L(γ, φ;α, β) with respect to γi :

Lγ =K∑i=1

(ψ(γi )− ψ(K∑j=1

γj))(αi +N∑

n=1

φni − γi )

− log Γ(K∑j=1

γj) +K∑i=1

log Γ(γi )

Page 29: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Variational Dirichlet

Taking the derivative with respect to γi

∂L

∂γi= ψ′(γi )(αi +

N∑n=1

φni − γi )− ψ′′(K∑j=1

γj)K∑j=1

(αj +N∑

n=1

φnj − γj)

Setting this equation to zero yields:

γi = αi +N∑

n=1

φni .

Page 30: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Variational Inference Algorithm

1 initialize φ0ni = 1K for all i and n.

2 initialize γi = αi + NK for all i

3 repeat

4 for n = 1 to N5 for i = 1 to K

1 φt+1ni = βiwn exp(ψ(γti )).

2 normalize φt+1n to sum 1.

6 γt+1 = α +∑N

n=1 φt+1n

7 until convergence

Page 31: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Parameter Estimation

In the variational E-step, maximize the lower boundL(γ, φ;α, β) with respect to the variational parameters γ andφ.

In the M-step, maximize the bound with respect to the modelparameters α and β.

Page 32: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Conditional Multinomials

Maximize L(γ, φ;α, β) with respect to β:

Lβ =M∑d=1

Nd∑n=1

K∑i=1

V∑j=1

φdniwjdn log βij +

K∑i=1

λi (V∑j=1

βij − 1).

Taking the derivative with respect to βij and setting it to zero:

βij ∝M∑d=1

Nd∑n=1

φdniwjdn.

Page 33: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Dirichlet

Maximize L(γ, φ;α, β) with respect to α:

Lα =M∑d=1

(log Γ(K∑j=1

αj)−K∑i=1

log Γ(αi ))

+K∑i=1

((αi − 1)(ψ(γdi )− ψ(K∑j=1

γdj)))

Taking the derivative with respect to αi

∂L

∂αi= M(ψ(

K∑j=1

αj)− ψ(αi )) +M∑d=1

(ψ(γdi )− ψ(K∑j=1

γdj)).

It is difficult to compute αi by setting the derivative to zero.

Page 34: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Newton Method

Compute the Hessian Matrix by

∂2L

∂αi∂αj= M(ψ′(

K∑j=1

αj)− δ(i , j)ψ′(αi )).

Input this Hessian Matrix and the derivative to NewtonMethod.

Page 35: Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step M-Step 5 Conclusion. Generative Process of LDAExponential FamilyNewton Method Variational

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Conclusion

Variational Inference is used for approximating intractableintegrals arising in Bayesian network.

Variational Inference can be seen as an extension of the EMalgorithm which computes the entire posterior distribution oflatent variables.

Usually, the derived ”best” variational distribution is the samefamily as the corresponding prior distribution over the variable.

A good template proof for variational inference on other topicmodels.