Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step...

Generative Process of LDA Exponential Family Newton Method Variational Inference Conclusion

Variational Inference for LDA

Zhao Zhou

The Hong Kong University of Science and Technology

Outline

1 Generative Process of LDA

2 Exponential Family

3 Newton Method

4 Variational InferenceE-StepM-Step

5 Conclusion

Notations and terminology

A word is the basic unit of discrete data, defined to be anitem from a vocabulary indexed by {1, . . . ,V }.A document is a sequence of N words denoted byw = (w1,w2, . . . ,wN), where wn is the nth word in thesequence.

A corpus is a collection of M documents denoted byD = {w1,w2, . . . ,wM}.

Latent Dirichlet allocation

LDA assumes the following generative process for each documentw in a corpus D

Choose N ∼ Poisson(ξ).

Choose θ ∼ Dir(α).

For each of the N words wn:

Choose a topic zn ∼ Mult(θ).Choose a word wn from p(wn|zn, β).

Note that

The dimension of the Dirichlet distribution (topic variable) isknown and fixed.

The word probabilities are parameterized by a k × V matrix βwhere βij = p(w j = 1|z i = 1).

The randomness of N is ignored in subsequent slides.

Latent Dirichlet allocation

Given the parameters α and β, the joint distribution of a topicmixture θ, a set of topics z , and a set of N words w is:

p(θ, z ,w |α, β) = p(θ|α)N∏

p(zn|θ)p(wn|zn, β)

The marginal distribution of a document is:

p(w |α, β) =

∫p(θ|α)(

N∏n=1

p(zn|θ)p(wn|zn, β))dθ.

Exponential family

An exponential family distribution has the form

p(x |η) = h(x) exp{ηT t(x)− a(η)}

The different parts of this equation are

The natural parameter ηThe sufficient statistic t(x)The underlying measure h(x)The log normalizer a(η)

a(η) = log

∫h(x) exp{ηT t(x)}

First Moment

The derivatives of the log normalizer gives the moments ofthe sufficient statistics

dηa(η) =

dη(log

∫exp{ηT t(x)}h(x)dx)

∫t(x) exp{ηT t(x)}h(x)dx∫

exp{ηT t(x)}h(x)dx

∫t(x) exp{ηT t(x)− a(η)}h(x)dx

= E [t(X )]

Computing E [log(θ|α)]

The Dirichlet distribution p(θ|α):

p(θ|α) =Γ(∑K

i=1 αi )∑Ki=1 Γ(αi )

K∏i=1

θαi−1i

= exp{(K∑i=1

(αi − 1) log θi ) + log Γ(K∑i=1

αi )−K∑i=1

log Γ(αi )}

Sufficient statistics: log θi .

Log normalizer:∑K

i=1 log Γ(αi )− log Γ(∑K

i=1 αi )

Computing E [log(θ|α)]

The expectation E [log(θ|α)] is:

E [log θi |α] = a(α)′ = (K∑i=1

log Γ(αi )− log Γ(K∑i=1

αi ))′

= ψ(αi )− ψ(K∑j=1

where ψ is the digamma function, the first derivative of thelog Gamma function.

Unconstrained minimization

Suppose f convex, twice continuously differentiable.

Assume optimal value p∗ = infx f (x) is attained.

Interpreted as iterative methods for solving optimalitycondition

∇f (x∗) = 0.

Newton Step

Newton step:

∆xnt = −∇2f (x)−1∇f (x).

Interpretations:

x + ∆xnt minimizes second order approximation

f̂ (x + v) = f (x) +∇f (x)T v +1

2vT∇2f (x)v

x + ∆xnt solves linearized optimality condition.

∇f (x + v) ≈ ∇f̂ (x + v) = ∇f (x) +∇2f (x)v = 0.

Newton decrement

A measure of the proximity of x to x∗

λ(x) = (∇f (x)T∇2f (x)−1∇f (x))1/2

equal to the norm of the Newton step in the quadraticHessian norm

λ(x) = (∆xTnt∇2f (x)∆xnt)1/2

Backtracking Line Search

Exact Line search:

t = arg mint>0

f (x + t∆x)

Backtracking line search (with parametersα ∈ (0, 1/2), β ∈ (0, 1)):

Starting at t = 1, repeat t = βt until

f (x + t∆x) < f (x) + αt∇f (x)T∆x .

Newton Method

Repeat

Compute the Newton step and decrement.

∆xnt = −∇2f (x)−1∇f (x)

λ2 = ∇f (x)T∇2f (x)−1∇f (x)

Stopping criterion. quit if λ2

2 ≤ ε.Line search. Choose step size t by backtracking line search.

Update x = x + t∇xnt .

Inference

The posterior distribution of hidden variable:

p(θ, z |w , α, β) =p(θ, z ,w |α, β)

p(w |α, β)

This distribution is intractable to compute since

p(w |α, β) =Γ(∑

j αj)∏j Γ(αj)

k∏i=1

θαi−1i )(

N∏n=1

k∑i=1

V∏j=1

(θiβij)w jn)dθ

due to the coupling between θ and β.

Variational distribution

The variational distribution on latent variables:

q(θ, z |γ, φ) = q(θ|γ)N∏

q(zn|φn).

An optimization problem that determines the values of γ andφ with respect to KL-Divergence D:

(γ∗, φ∗) = arg minγ,φ

D(q(θ, z |γ, φ)||p(θ, z |w , α, β))

KL-Divergence

Now, we denote q(θ, z |γ, φ) by q.

The KL-Divergence between q and p(θ, z |w , α, β) is

D(q||p) = Eq[log q]− Eq[log p(θ, z |w , α, β)]

= Eq[log q]− Eq[log p(θ, z ,w |α, β)] + log p(w |α, β)

Using Jensen’s inequality, we bound p(w |α, β) by

log p(w |α, β) = log

∫ ∑z

p(θ, z ,w |α, β)dθ

∫ ∑z

p(θ, z ,w |α, β)q(θ, z)

q(θ, z)dθ

≥∫ ∑

q(θ, z) logp(θ, z ,w |α, β)

q(θ, z)dθ

= Eq[log p(w , z ,w |α, β)]− Eq[log q(θ, z)].

KL-Divergence

We denote Eq[log p(w , z ,w |α, β)]− Eq[log q(θ, z)] byL(γ, φ;α, β).

Then we have

log p(w |α, β) = L(γ, φ;α, β) + D(q(θ, z |γ, φ)||p(θ, z |w , α, β)).

Maximizing the lower bound L(γ, φ;α, β) with respect to γand φ is equivalent to minimizing the KL-Divergence betweenthe variational posterior probability and the true posteriorprobability.

Variational Inference

Expand L(γ, φ;α, β) using the factorizations of p and q:

L(γ, φ;α, β) = Eq[log p(w , z ,w |α, β)]− Eq[log q(θ, z)]

= Eq[log p(θ|α)] + Eq[log p(z |θ)] + Eq[log p(w |z , β)]

− Eq[log q(θ)]− Eq[log q(z)]

Compute the five terms, respectively.

Computing Eq[log p(θ|α)]

Eq[log p(θ|α)] is given by

Eq[log p(θ|α)] =K∑i=1

(αi − 1)Eq[log θi ]

+ log Γ(K∑i=1

αi )−K∑i=1

log Γ(αi ).

θ is generated by Dir(θ|γ): Eq[log θi ] = ψ(γi )− ψ(∑K

j=1 γj).

Then we have:

Eq[log p(θ|α)] =K∑i=1

(αi − 1)ψ(γi )− ψ(K∑j=1

+ log Γ(K∑i=1

αi )−K∑i=1

log Γ(αi ).

Computing Eq[log p(z |θ)]

Eq[log p(z |θ)] is given by

Eq[log p(z |θ)] = Eq[N∑

K∑i=1

zni log θi ]

K∑i=1

Eq[zni ]Eq[log θi ]

k∑i=1

φni (ψ(γi )− ψ(K∑j=1

where z is generated from Mult(z |φ) and θ is generated fromDir(θ|γ).

Computing Eq[log p(w |z , β)]

Eq[log p(w |z , β)] is given by

Eq[log p(w |z , β)] = Eq[N∑

k∑i=1

V∑j=1

zniwjn log βij ]

k∑i=1

V∑j=1

Eq[zni ]wjn log βij

k∑i=1

V∑j=1

φniwjn log βij

Computing Eq[log q(θ|γ)]

Eq[log q(θ|γ)] is given by

Eq[log p(θ|γ)] =k∑

(γi − 1)Eq[log θi ] + log Γ(k∑

γi )−k∑

log Γ(γi )

Then, we have

Eq[log p(θ|γ)] = log Γ(k∑

γi )−k∑

log Γ(γi )

(γi − 1)(ψ(γi )− ψ(k∑

Computing Eq[log q(z |φ)]

Eq[log q(z |φ)] is given by

Eq[log q(z |φ)] = Eq[N∑

k∑i=1

zni log φni ]

k∑i=1

Eq[zni ] log φni

k∑i=1

φni log φni

Variational Inference

Finally, L(γ, φ;α, β) is

(γ, φ;α, β) = log Γ(K∑i=1

αi )−K∑i=1

log Γ(αi )

+K∑i=1

(αi − 1)(ψ(γi )− ψ(K∑j=1

K∑i=1

φni (ψ(γi )− ψ(K∑j=1

K∑i=1

V∑j=1

φniwjn log βij

− (log Γ(K∑i=1

γi )−K∑i=1

log Γ(γi ) +K∑i=1

(γi − 1)(ψ(γi )− ψ(K∑j=1

γj)))

−N∑

K∑i=1

φni log φni .

Variational Multinomial

Maximize L(γ, φ;α, β) with respect to φni :

Lφni = φni (ψ(γi )− ψ(K∑j=1

γj)) + φni log βiv

− φni log φni + λ(K∑j=1

φni − 1).

Variational Multinomial

Taking derivatives with respect to φni :

∂φni= (ψ(γi )− ψ(

K∑j=1

γj)) + log βiv − log φni − 1 + λ.

Setting this derivative to zero yields

φni ∝ βiv exp(ψ(γi )− ψ(K∑j=1

γj)).

Variational Dirichlet

Maximize L(γ, φ;α, β) with respect to γi :

Lγ =K∑i=1

(ψ(γi )− ψ(K∑j=1

γj))(αi +N∑

φni − γi )

− log Γ(K∑j=1

γj) +K∑i=1

log Γ(γi )

Variational Dirichlet

Taking the derivative with respect to γi

∂γi= ψ′(γi )(αi +

N∑n=1

φni − γi )− ψ′′(K∑j=1

γj)K∑j=1

(αj +N∑

φnj − γj)

Setting this equation to zero yields:

γi = αi +N∑

φni .

Variational Inference Algorithm

1 initialize φ0ni = 1K for all i and n.

2 initialize γi = αi + NK for all i

3 repeat

4 for n = 1 to N5 for i = 1 to K

1 φt+1ni = βiwn exp(ψ(γti )).

2 normalize φt+1n to sum 1.

6 γt+1 = α +∑N

n=1 φt+1n

7 until convergence

Parameter Estimation

In the variational E-step, maximize the lower boundL(γ, φ;α, β) with respect to the variational parameters γ andφ.

In the M-step, maximize the bound with respect to the modelparameters α and β.

Conditional Multinomials

Maximize L(γ, φ;α, β) with respect to β:

Lβ =M∑d=1

Nd∑n=1

K∑i=1

V∑j=1

φdniwjdn log βij +

K∑i=1

λi (V∑j=1

βij − 1).

Taking the derivative with respect to βij and setting it to zero:

βij ∝M∑d=1

Nd∑n=1

φdniwjdn.

Dirichlet

Maximize L(γ, φ;α, β) with respect to α:

Lα =M∑d=1

(log Γ(K∑j=1

αj)−K∑i=1

log Γ(αi ))

+K∑i=1

((αi − 1)(ψ(γdi )− ψ(K∑j=1

γdj)))

Taking the derivative with respect to αi

∂αi= M(ψ(

K∑j=1

αj)− ψ(αi )) +M∑d=1

(ψ(γdi )− ψ(K∑j=1

γdj)).

It is difficult to compute αi by setting the derivative to zero.

Newton Method

Compute the Hessian Matrix by

∂αi∂αj= M(ψ′(

K∑j=1

αj)− δ(i , j)ψ′(αi )).

Input this Hessian Matrix and the derivative to NewtonMethod.

Conclusion

Variational Inference is used for approximating intractableintegrals arising in Bayesian network.

Variational Inference can be seen as an extension of the EMalgorithm which computes the entire posterior distribution oflatent variables.

Usually, the derived ”best” variational distribution is the samefamily as the corresponding prior distribution over the variable.

A good template proof for variational inference on other topicmodels.

Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step...

Documents

Transcript of Variational Inference for LDAlzhang/teach/6931a/slides/lda-zhou.pdf4 Variational Inference E-Step...

Dynamics Learning with Cascaded Variational Inference for ... · Dynamics Learning with Cascaded Variational Inference for Multi-Step Manipulation Kuan Fang 1, Yuke Zhu;2, Animesh

Tightening Bounds for Variational Inference by Revisiting ......Tightening Bounds for Variational Inference by Revisiting Perturbation Theory 4 Exact posterior inference is impossible

Variational Inference and Learning - University at Buffalocedar.buffalo.edu/~srihari/CSE676/19.4 VariationalInference.pdflatent variables, we can perform variational inference and

Challenges in Variational Inference: Optimization ...approximateinference.org/2015/schedule/Ranganath2015_2.pdf · Rajesh Ranganath Challenges in Variational Inference 7/25. ... Problem:

Collapsed Variational Bayesian Inference of the Author ...people.csail.mit.edu › ythomas › publications › 2016AuthorTopicCVB-… · Collapsed Variational Bayesian Inference

Stochastic Variational Inference

Stochastic Variational Inference - UC3Mjesusfbes/MLG_SVI.pdfThe natural gradient of the ELBO Stochastic Variational Inference 3 Stochastic Variational Inference in Topic Models Topic

The Variational Approximation for Bayesian Inference

BIDIRECTIONAL VARIATIONAL INFERENCE FOR NON …

Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets

Alpha-Beta Divergence For Variational Inference

Variational Inference for Crowdsourcingpapers.nips.cc/paper/4627-variational-inference-for-crowdsourcing.pdf · variational inference methods for graphical models. First, we present

Wasserstein Variational Inference › paper › 7514-wasserstein... · 2 Wasserstein variational inference We can now introduce the new framework of Wasserstein variational inference

Variational Methods for LDA Stochastic Variational Inference...Stochastic variational inference SUBSAMPLE DATA INFER LOCAL STRUCTURE UPDATE GLOBAL STRUCTURE 1 A generic class of models

Variational Learning and Variational Inference · The variational approach • Variational inference: Find q(h) by solving • Variational learning: Alternate between running variational

Parallelization of Variational Inference for Bayesian ...

Variational Bayesian Inference for Parametric and ...matt-wand.utsacademics.info/publicns/Faes11.pdf · Variational Bayesian Inference for Parametric and Nonparametric Regression

Variational Inference for Hierarchical Dirichlet Process ...cs.brown.edu/research/pubs/theses/ugrad/2015/stephenson.will.pdf · Variational Inference for Hierarchical Dirichlet Process

Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Variational Algorithms for Approximate Bayesian Inference