Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step...

Post on 23-May-2020

18 views 1 download

Transcript of Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step...

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Variational Inference for LDA

Zhao Zhou

The Hong Kong University of Science and Technology

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Outline

1 Generative Process of LDA

2 Exponential Family

3 Newton Method

4 Variational InferenceE-StepM-Step

5 Conclusion

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Notations and terminology

A word is the basic unit of discrete data, defined to be anitem from a vocabulary indexed by {1, . . . ,V }.A document is a sequence of N words denoted byw = (w1,w2, . . . ,wN), where wn is the nth word in thesequence.

A corpus is a collection of M documents denoted byD = {w1,w2, . . . ,wM}.

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Latent Dirichlet allocation

LDA assumes the following generative process for each documentw in a corpus D

Choose N ∼ Poisson(ξ).

Choose θ ∼ Dir(α).

For each of the N words wn:

Choose a topic zn ∼ Mult(θ).Choose a word wn from p(wn|zn, β).

Note that

The dimension of the Dirichlet distribution (topic variable) isknown and fixed.

The word probabilities are parameterized by a k × V matrix βwhere βij = p(w j = 1|z i = 1).

The randomness of N is ignored in subsequent slides.

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Latent Dirichlet allocation

Given the parameters α and β, the joint distribution of a topicmixture θ, a set of topics z , and a set of N words w is:

p(θ, z ,w |α, β) = p(θ|α)N∏

n=1

p(zn|θ)p(wn|zn, β)

The marginal distribution of a document is:

p(w |α, β) =

∫p(θ|α)(

N∏n=1

∑zn

p(zn|θ)p(wn|zn, β))dθ.

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Exponential family

An exponential family distribution has the form

p(x |η) = h(x) exp{ηT t(x)− a(η)}

The different parts of this equation are

The natural parameter ηThe sufficient statistic t(x)The underlying measure h(x)The log normalizer a(η)

a(η) = log

∫h(x) exp{ηT t(x)}

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

First Moment

The derivatives of the log normalizer gives the moments ofthe sufficient statistics

d

dηa(η) =

d

dη(log

∫exp{ηT t(x)}h(x)dx)

=

∫t(x) exp{ηT t(x)}h(x)dx∫

exp{ηT t(x)}h(x)dx

=

∫t(x) exp{ηT t(x)− a(η)}h(x)dx

= E [t(X )]

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Computing E [log(θ|α)]

The Dirichlet distribution p(θ|α):

p(θ|α) =Γ(∑K

i=1 αi )∑Ki=1 Γ(αi )

K∏i=1

θαi−1i

= exp{(K∑i=1

(αi − 1) log θi ) + log Γ(K∑i=1

αi )−K∑i=1

log Γ(αi )}

Sufficient statistics: log θi .

Log normalizer:∑K

i=1 log Γ(αi )− log Γ(∑K

i=1 αi )

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Computing E [log(θ|α)]

The expectation E [log(θ|α)] is:

E [log θi |α] = a(α)′ = (K∑i=1

log Γ(αi )− log Γ(K∑i=1

αi ))′

= ψ(αi )− ψ(K∑j=1

αj).

where ψ is the digamma function, the first derivative of thelog Gamma function.

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Unconstrained minimization

Suppose f convex, twice continuously differentiable.

Assume optimal value p∗ = infx f (x) is attained.

Interpreted as iterative methods for solving optimalitycondition

∇f (x∗) = 0.

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Newton Step

Newton step:

∆xnt = −∇2f (x)−1∇f (x).

Interpretations:

x + ∆xnt minimizes second order approximation

f̂ (x + v) = f (x) +∇f (x)T v +1

2vT∇2f (x)v

x + ∆xnt solves linearized optimality condition.

∇f (x + v) ≈ ∇f̂ (x + v) = ∇f (x) +∇2f (x)v = 0.

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Newton decrement

A measure of the proximity of x to x∗

λ(x) = (∇f (x)T∇2f (x)−1∇f (x))1/2

equal to the norm of the Newton step in the quadraticHessian norm

λ(x) = (∆xTnt∇2f (x)∆xnt)1/2

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Backtracking Line Search

Exact Line search:

t = arg mint>0

f (x + t∆x)

Backtracking line search (with parametersα ∈ (0, 1/2), β ∈ (0, 1)):

Starting at t = 1, repeat t = βt until

f (x + t∆x) < f (x) + αt∇f (x)T∆x .

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Newton Method

Repeat

Compute the Newton step and decrement.

∆xnt = −∇2f (x)−1∇f (x)

λ2 = ∇f (x)T∇2f (x)−1∇f (x)

Stopping criterion. quit if λ2

2 ≤ ε.Line search. Choose step size t by backtracking line search.

Update x = x + t∇xnt .

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Inference

The posterior distribution of hidden variable:

p(θ, z |w , α, β) =p(θ, z ,w |α, β)

p(w |α, β)

This distribution is intractable to compute since

p(w |α, β) =Γ(∑

j αj)∏j Γ(αj)

∫(

k∏i=1

θαi−1i )(

N∏n=1

k∑i=1

V∏j=1

(θiβij)w jn)dθ

due to the coupling between θ and β.

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Variational distribution

The variational distribution on latent variables:

q(θ, z |γ, φ) = q(θ|γ)N∏

n=1

q(zn|φn).

An optimization problem that determines the values of γ andφ with respect to KL-Divergence D:

(γ∗, φ∗) = arg minγ,φ

D(q(θ, z |γ, φ)||p(θ, z |w , α, β))

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

KL-Divergence

Now, we denote q(θ, z |γ, φ) by q.

The KL-Divergence between q and p(θ, z |w , α, β) is

D(q||p) = Eq[log q]− Eq[log p(θ, z |w , α, β)]

= Eq[log q]− Eq[log p(θ, z ,w |α, β)] + log p(w |α, β)

Using Jensen’s inequality, we bound p(w |α, β) by

log p(w |α, β) = log

∫ ∑z

p(θ, z ,w |α, β)dθ

= log

∫ ∑z

p(θ, z ,w |α, β)q(θ, z)

q(θ, z)dθ

≥∫ ∑

z

q(θ, z) logp(θ, z ,w |α, β)

q(θ, z)dθ

= Eq[log p(w , z ,w |α, β)]− Eq[log q(θ, z)].

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

KL-Divergence

We denote Eq[log p(w , z ,w |α, β)]− Eq[log q(θ, z)] byL(γ, φ;α, β).

Then we have

log p(w |α, β) = L(γ, φ;α, β) + D(q(θ, z |γ, φ)||p(θ, z |w , α, β)).

Maximizing the lower bound L(γ, φ;α, β) with respect to γand φ is equivalent to minimizing the KL-Divergence betweenthe variational posterior probability and the true posteriorprobability.

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Variational Inference

Expand L(γ, φ;α, β) using the factorizations of p and q:

L(γ, φ;α, β) = Eq[log p(w , z ,w |α, β)]− Eq[log q(θ, z)]

= Eq[log p(θ|α)] + Eq[log p(z |θ)] + Eq[log p(w |z , β)]

− Eq[log q(θ)]− Eq[log q(z)]

Compute the five terms, respectively.

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Computing Eq[log p(θ|α)]

Eq[log p(θ|α)] is given by

Eq[log p(θ|α)] =K∑i=1

(αi − 1)Eq[log θi ]

+ log Γ(K∑i=1

αi )−K∑i=1

log Γ(αi ).

θ is generated by Dir(θ|γ): Eq[log θi ] = ψ(γi )− ψ(∑K

j=1 γj).

Then we have:

Eq[log p(θ|α)] =K∑i=1

(αi − 1)ψ(γi )− ψ(K∑j=1

γj)

+ log Γ(K∑i=1

αi )−K∑i=1

log Γ(αi ).

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Computing Eq[log p(z |θ)]

Eq[log p(z |θ)] is given by

Eq[log p(z |θ)] = Eq[N∑

n=1

K∑i=1

zni log θi ]

=N∑

n=1

K∑i=1

Eq[zni ]Eq[log θi ]

=N∑

n=1

k∑i=1

φni (ψ(γi )− ψ(K∑j=1

γj))

where z is generated from Mult(z |φ) and θ is generated fromDir(θ|γ).

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Computing Eq[log p(w |z , β)]

Eq[log p(w |z , β)] is given by

Eq[log p(w |z , β)] = Eq[N∑

n=1

k∑i=1

V∑j=1

zniwjn log βij ]

=N∑

n=1

k∑i=1

V∑j=1

Eq[zni ]wjn log βij

=N∑

n=1

k∑i=1

V∑j=1

φniwjn log βij

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Computing Eq[log q(θ|γ)]

Eq[log q(θ|γ)] is given by

Eq[log p(θ|γ)] =k∑

i=1

(γi − 1)Eq[log θi ] + log Γ(k∑

i=1

γi )−k∑

i=1

log Γ(γi )

Then, we have

Eq[log p(θ|γ)] = log Γ(k∑

i=1

γi )−k∑

i=1

log Γ(γi )

+k∑

i=1

(γi − 1)(ψ(γi )− ψ(k∑

j=1

γj))

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Computing Eq[log q(z |φ)]

Eq[log q(z |φ)] is given by

Eq[log q(z |φ)] = Eq[N∑

n=1

k∑i=1

zni log φni ]

=N∑

n=1

k∑i=1

Eq[zni ] log φni

=N∑

n=1

k∑i=1

φni log φni

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Variational Inference

Finally, L(γ, φ;α, β) is

(γ, φ;α, β) = log Γ(K∑i=1

αi )−K∑i=1

log Γ(αi )

+K∑i=1

(αi − 1)(ψ(γi )− ψ(K∑j=1

γj))

+N∑

n=1

K∑i=1

φni (ψ(γi )− ψ(K∑j=1

γj))

+N∑

n=1

K∑i=1

V∑j=1

φniwjn log βij

− (log Γ(K∑i=1

γi )−K∑i=1

log Γ(γi ) +K∑i=1

(γi − 1)(ψ(γi )− ψ(K∑j=1

γj)))

−N∑

n=1

K∑i=1

φni log φni .

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Variational Multinomial

Maximize L(γ, φ;α, β) with respect to φni :

Lφni = φni (ψ(γi )− ψ(K∑j=1

γj)) + φni log βiv

− φni log φni + λ(K∑j=1

φni − 1).

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Variational Multinomial

Taking derivatives with respect to φni :

∂L

∂φni= (ψ(γi )− ψ(

K∑j=1

γj)) + log βiv − log φni − 1 + λ.

Setting this derivative to zero yields

φni ∝ βiv exp(ψ(γi )− ψ(K∑j=1

γj)).

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Variational Dirichlet

Maximize L(γ, φ;α, β) with respect to γi :

Lγ =K∑i=1

(ψ(γi )− ψ(K∑j=1

γj))(αi +N∑

n=1

φni − γi )

− log Γ(K∑j=1

γj) +K∑i=1

log Γ(γi )

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Variational Dirichlet

Taking the derivative with respect to γi

∂L

∂γi= ψ′(γi )(αi +

N∑n=1

φni − γi )− ψ′′(K∑j=1

γj)K∑j=1

(αj +N∑

n=1

φnj − γj)

Setting this equation to zero yields:

γi = αi +N∑

n=1

φni .

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Variational Inference Algorithm

1 initialize φ0ni = 1K for all i and n.

2 initialize γi = αi + NK for all i

3 repeat

4 for n = 1 to N5 for i = 1 to K

1 φt+1ni = βiwn exp(ψ(γti )).

2 normalize φt+1n to sum 1.

6 γt+1 = α +∑N

n=1 φt+1n

7 until convergence

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Parameter Estimation

In the variational E-step, maximize the lower boundL(γ, φ;α, β) with respect to the variational parameters γ andφ.

In the M-step, maximize the bound with respect to the modelparameters α and β.

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Conditional Multinomials

Maximize L(γ, φ;α, β) with respect to β:

Lβ =M∑d=1

Nd∑n=1

K∑i=1

V∑j=1

φdniwjdn log βij +

K∑i=1

λi (V∑j=1

βij − 1).

Taking the derivative with respect to βij and setting it to zero:

βij ∝M∑d=1

Nd∑n=1

φdniwjdn.

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Dirichlet

Maximize L(γ, φ;α, β) with respect to α:

Lα =M∑d=1

(log Γ(K∑j=1

αj)−K∑i=1

log Γ(αi ))

+K∑i=1

((αi − 1)(ψ(γdi )− ψ(K∑j=1

γdj)))

Taking the derivative with respect to αi

∂L

∂αi= M(ψ(

K∑j=1

αj)− ψ(αi )) +M∑d=1

(ψ(γdi )− ψ(K∑j=1

γdj)).

It is difficult to compute αi by setting the derivative to zero.

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Newton Method

Compute the Hessian Matrix by

∂2L

∂αi∂αj= M(ψ′(

K∑j=1

αj)− δ(i , j)ψ′(αi )).

Input this Hessian Matrix and the derivative to NewtonMethod.

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Conclusion

Variational Inference is used for approximating intractableintegrals arising in Bayesian network.

Variational Inference can be seen as an extension of the EMalgorithm which computes the entire posterior distribution oflatent variables.

Usually, the derived ”best” variational distribution is the samefamily as the corresponding prior distribution over the variable.

A good template proof for variational inference on other topicmodels.