Stochastic Variational Inference - UC3Mjesusfbes/MLG_SVI.pdfThe natural gradient of the ELBO...

IntroductionStochastic Variational Inference

Stochastic Variational Inference in Topic ModelsSome Bibliograpy

Stochastic Variational Inference

Jesus Fernandez Bes

Machine Learning Group

March 27, 2014

Jesus Fernandez Bes Stochastic Variational Inference

http://arxiv.org/abs/1206.7051

http://arxiv.org/abs/1206.7051



1 Introduction

2 Stochastic Variational InferenceModels with local and global hidden variablesMean-field variational inferenceThe natural gradient of the ELBOStochastic Variational Inference

3 Stochastic Variational Inference in Topic ModelsTopic ModelsLatent Diriclet AllocationHierarchichal Dirichlet Process

4 Some Bibliograpy




MotivationMain Ideas

Challenges of modern data analysis

Massive

Complex

High-dimensional

Probability Models (and Graphical Models) deal with complexity.Scale is the problem.

“Traditional” Variational Inference

1 Inference =⇒ High-dimensional optimization.

2 Solved using Coordinate ascent algorithms.

Analyze ALL the data.Re-estimate hidden structure.Analyze ALL the data.. . .

DO NOT SCALE WITH BIG DATA




MotivationMain Ideas

How to make a general Variational method that scales.

Use Stochastic Optimization. Follow cheap noisy estimates ofthe gradient.

Use Natural Gradient. Stochastic Variational Inference has anattractive form.

Structure of SVI

1 Subsample one or more data points from the data.

2 Analyze the subsample using current variational parameters.

3 Implement a closed-form update of the parameters.

4 Repeat.




Models with local and global hidden variablesMean-field variational inferenceThe natural gradient of the ELBOStochastic Variational Inference

p(x, z, β|α) = p(β|α)

N∏n=1

p(xn, zn|β)

N observations x = x1:N .

Vector of global hidden variables β.

N local hidden variables z = z1:N each is a collection of Jvariables zn = zn,1:J .

Vector of fixed parameters α.





Complete Conditional assumption

Complete conditionals are in the exponential family

p(β|x, z, α) = h(β) exp{ηg(x, z, α)T t(β)− ag(ηg(x, z, α))}p(znj |xn, zn,−j , β) = h(znj) exp{ηl(xn, zn,−j , β)T t(znj)−al(ηl(xn, zn,−j , β))}

h(·) is the base measure.

a(·) is the log normalizer.

η(·) is the natural parameter vectors.

t(·) are the sufficient statistics.

Several distributions in the exponential family

Bernoulli, Gaussian, Multinomial, Dirichlet, Gamma, Poisson,Beta,...





Examples of this kind of model

Bayesian Mixture Models.

Latent Dirichlet Allocation.

Hidden Markov Models (+ many variants).

Kalman filters (+ many variants).

Hierarchical linear regression models.

Hierarchical probit classification models.

Probabilistic factor analysis/matrix factorization models.

Certain Bayesian nonparametric mixture models.





GOAL

Approximate the posterior distribution of hidden variables given theobservations.

p(z, β|x) =p(x, z, β)∫

p(x, z, β)dzdβ

The problem with the denominator. Intractable to compute.





The evidence lower bound (ELBO)

log p(x) = log

∫p(x, z, β)dzdβ

= log

∫p(x, z, β)

q(z, β)

q(z, β)dzdβ

= log

(Eq[p(x, z, β)

q(z, β)

])≥ Eq [log p(x, z, β)]− Eq [log q(z, β)]

, L(q).

KL(q(z, β)‖p(z, β|x)) = −L(q) + const.





Mean-field Approximation

Assumption on q(z, β):

q(z, β) = q(β|λ)

N∏n=1

J∏j=1

q(znj |φnj)

with q(β|λ) and q(znj |φnj) in the same exponential family as thecomplete conditionals.

q(β|λ) = h(β) exp{λT t(β)− ag(λ)}q(znj |φnj) = h(znj) exp{φTnjt(znj)− al(φnj)}

Easy coordinate ascent algorithm.





Gradient of the ELBO and Coordinate Ascent Inference

∇λL = ∇2λag(λ)(Eq [ηq(x, z, α)]− λ)

∇φnjL = ∇2

φnjal(φnj)(Eq [ηl(xn, zn,−j , β)]− φnj)

Both of them equal 0 by setting

λ = Eq [ηg(x, z, α)]

φn,j = Eq [ηl(xn, zn,−j , β)]





Gradient, if exists, points to the direction of steepest ascent,

arg maxdλ

f(λ− dλ) subject to ‖dλ‖2 < ε

for small ε. Gradient depends on euclidean distance metric in theparameter space.

In probability distributions euclidean metric can be a bad metric.Jesus Fernandez Bes Stochastic Variational Inference




Natural gradient accounts for the information geometry of itsparameter space.

Symmetrized KL divergence

Natural measure of dissimilarity between probability distributions

DsymKL (λ, λ′) = Eλ

[log

q(β|λ)

q(β|λ′)

]+ Eλ′

[log

q(β|λ′)q(β|λ)

]Using this distance, the direction of steepest ascent is

arg maxdλ

f(λ+ dλ) subject to DsymKL (λ, λ+ dλ) < ε





Natural Gradient

Natural Gradient points in the direction of steeped ascent inthe Riemannian space.

∇̂λf(λ) = G(λ)−1∇λf(λ)

where G(λ) = Eλ[(∇λ log q(β, λ))(∇λ log q(β, λ))T

]is the

fisher information matrix of q(λ).

For exponential family: G(λ) = ∇2λag(λ)

For our mean-field model:

∇̂λL = Eφ [ηq(x, z, α)]− λ

∇̂φnjL = Eλ,φn,−j

[ηl(xn, zn,−j , β)]− φnj





Why Natural Gradients?

Traditional Gradients

∇λL = ∇2λag(λ)(Eq [ηq(x, z, α)]− λ)

∇φnjL = ∇2

φnjal(φnj)(Eq [ηl(xn, zn,−j , β)]− φnj)

Natural Gradients

∇̂λL = Eφ [ηq(x, z, α)]− λ∇̂φnj

L = Eλ,φn,−j[ηl(xn, zn,−j , β)]− φnj

Coordinate ascent is equal to taking a natural gradient step oflength one.

Easier to compute. Use them to develop scalable variationalinferece algorithms.





Stochastic Optimization

We have a random function B(λ) with Eq [B(λ)] = ∇λf(λ). Wecan optimize f(λ) iteratively as,

λ(t) = λ(t−1) + ρtbt(λ(t−1))

where bt is an independent draw from B. The sequence of ρt mustsatisfy Robbins-Monro conditions.

Follow noisy estimates of the gradient with a decreasing stepsize.If gradient can be written as a sum of terms (one per datapoint) a fast noisy approximation can be computed bysubsampling data.λ(t) will converge to the optimal λ∗ (if f is convex) or a localoptimum of f (if not convex *).





L(λ) =

global︷︸︸︷Eq [log p(β)]− Eq [log q(β)]

+

N∑n=1

maxφn

(Eq [log p(xn, zn|β)]− Eq [log q(zn)])︸︷︷︸sum of local

We choose I ∼ Unif(1, · · · , N) and define LI(λ) as the randomfunction

LI(λ) = Eq [log p(β)]− Eq [log q(β)]

+ N maxφI

(Eq [log p(xI , zI |β)]− Eq [log q(zI)])

Expectation of LI is equal to the objective, and consequently∇̂λLI is a noisy but unbiased estimate of the natural gradient ofthe objective.





Stochastic Optimization for global parameters

∇̂λLi = Eq[ηg(x

(N)i , z

(N)i , α)

]− λ

ηg(x(N)i , z

(N)i , α) = α+N · (t(xn, zn), 1)

∇̂λLi = α+N · (Eq [t(xn, zn)] , 1)− λ

Using Stochastic optimization

λ̂t , α+NEφ(λ) [(t(xi, zi), 1)]

λ(t) = λ(t−1) + ρt

(λ̂t − λ(t−1)

)= (1− ρt)λ(t−1) + ρtλ̂t





Stochastic Variational Inference





Extensions

Minibatches

Pick more than one data point each time,

λ(t) = (1− ρt)λ(t−1) +ρtS

∑s

λ̂s.

Empirical Bayes estimation of hyperparameters

Get a point estimate of the value of hyperparameters α

α(t) = α(t−1) + ρt∇αLt(λ(t−1), φ, α(t−1)).




Topic ModelsLatent Diriclet AllocationHierarchichal Dirichlet Process

Topic Models

Observations:

Words wdn is the nth word in the dth document. Element of afixed vocabulary of V terms.

Latent Variables:

A topic βk is a distribution over the vocabulary. Point inV − 1-simplex.Topic proportions θd are asociated to each document.Distribution over topics.Each word in each document comes from a single topic. TopicAssignment zdn are topic indexes.

Consider two models: Latent Dirichlet Allocation (LDA) has afixed number of K topics. Hierarchical Dirichlet Process (HDP)has infinite number of topics.





Analyzing the documents

Posterior inference of p(β, θ, z|w)





Generative model

1 Draw topics bk ∼ Dirichlet(η, · · · , η).2 For each document d ∈ {1, · · · , D}:

1 Draw topic proportions θ ∼ Dirichlet(α, · · · , α).2 For each word w ∈ {1, · · · , N}:

1 Draw topic assignment zdn ∼ Multinomial(θd).2 Draw word wdn ∼ Multinomial(βzdn).





Variational Inference in LDA

Mean-field for LDA

q(zdn) = Multinomial(φdn)

q(θd) = Dirichlet(γd)

q(βk) = Dirichlet(λk)

1 Update per-document d local variational parameters

φkdn ∝ exp{Ψ(γdk) + Ψ(λk,wdn)−Ψ(

∑v

λkv)} for n ∈ {1, · · · , N}

γd = α+

N∑n=1

φdn

2 Update global parameters λk = η +∑D

d=1

∑Nn=1 φ

kdnwdn





Stochastic Variational Inference in LDA





Results LDA

DATA

Nature: 350k docs, 58M words, 4200 terms.

New York Times: 1.8M docs, 461M words, 8000 terms.

Wikipedia: 3.8M docs,482M words, 7700 terms.

* Batch Variational uses a subset of 100k docs.Jesus Fernandez Bes Stochastic Variational Inference




Results HDP




Some Bibliograpy

Main Paper

Hoffman, M. D., and Blei, D. M., and Wang, C., andPaisley, J. (2013). “Stochastic variational inference”. The Journalof Machine Learning Research, 14(1), 1303-1347.

Other References

Blei, D. M.. “Variational Inference”. Lecture Notes ofCOS597C: Advanced Methods in Probabilistic Modeling,Princeton University, fall 2011,www.cs.princeton.edu/courses/archive/fall11/

cos597C/lectures/variational-inference-i.pdf.

Blei, D. M., “Exponential Families,” Lecture Notes ofCOS597C: Advanced Methods in Probabilistic Modeling,Princeton University, fall 2011,www.cs.princeton.edu/courses/archive/fall11/

cos597C/lectures/exponential-families.pdf.


www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf

www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf

www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/exponential-families.pdf

www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/exponential-families.pdf

Stochastic Variational Inference - UC3Mjesusfbes/MLG_SVI.pdfThe natural gradient of the ELBO...

Documents

Transcript of Stochastic Variational Inference - UC3Mjesusfbes/MLG_SVI.pdfThe natural gradient of the ELBO...