Download - Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Transcript
Page 1: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Variational Inference

Material adapted from David BleiUniversity of MarylandINTRODUCTION

Material adapted from David Blei | UMD Variational Inference | 1 / 29

Page 2: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Variational Inference

� Inferring hidden variables� Unlike MCMC:� Deterministic� Easy to gauge convergence� Requires dozens of iterations

� Doesn’t require conjugacy

� Slightly hairier math

Material adapted from David Blei | UMD Variational Inference | 2 / 29

Page 3: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Setup

� ~x = x1:n observations

� ~z = z1:m hidden variables

� α fixed parameters

� Want the posterior distribution

p(z |x ,α) =p(z,x |α)∫

z p(z,x |α)(1)

Material adapted from David Blei | UMD Variational Inference | 3 / 29

Page 4: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Motivation

� Can’t compute posterior for many interesting models

GMM (finite)

1. Draw µk ∼N (0,τ2)2. For each observation i = 1 . . .n:

2.1 Draw zi ∼Mult(π)2.2 Draw xi ∼N (µzi

,σ20)

� Posterior is intractable for large n, and we might want to add priors

p(µ1:K ,z1:n |x1:n) =

∏Kk=1 p(µk)

∏ni=1 p(zi)p(xi |zi ,µ1:K )

µ1:K

z1:n

∏Kk=1 p(µk)

∏ni=1 p(zi)p(xi |zi ,µ1:K )

(2)

Material adapted from David Blei | UMD Variational Inference | 4 / 29

Page 5: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Motivation

� Can’t compute posterior for many interesting models

GMM (finite)

1. Draw µk ∼N (0,τ2)2. For each observation i = 1 . . .n:

2.1 Draw zi ∼Mult(π)2.2 Draw xi ∼N (µzi

,σ20)

� Posterior is intractable for large n, and we might want to add priors

p(µ1:K ,z1:n |x1:n) =

∏Kk=1 p(µk)

∏ni=1 p(zi)p(xi |zi ,µ1:K )

µ1:K

z1:n

∏Kk=1 p(µk)

∏ni=1 p(zi)p(xi |zi ,µ1:K )

(2)

Consider all means

Material adapted from David Blei | UMD Variational Inference | 4 / 29

Page 6: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Motivation

� Can’t compute posterior for many interesting models

GMM (finite)

1. Draw µk ∼N (0,τ2)2. For each observation i = 1 . . .n:

2.1 Draw zi ∼Mult(π)2.2 Draw xi ∼N (µzi

,σ20)

� Posterior is intractable for large n, and we might want to add priors

p(µ1:K ,z1:n |x1:n) =

∏Kk=1 p(µk)

∏ni=1 p(zi)p(xi |zi ,µ1:K )

µ1:K

z1:n

∏Kk=1 p(µk)

∏ni=1 p(zi)p(xi |zi ,µ1:K )

(2)

Consider all assignments

Material adapted from David Blei | UMD Variational Inference | 4 / 29

Page 7: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Main Idea

� We create a variational distribution over the latent variables

q(z1:m |ν) (3)

� Find the settings of ν so that q is close to the posterior

� If q == p, then this is vanilla EM

Material adapted from David Blei | UMD Variational Inference | 5 / 29

Page 8: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

What does it mean for distributions to be close?

� We measure the closeness of distributions using Kullback-LeiblerDivergence

KL(q ||p)≡Eq

logq(Z)

p(Z |x)

(4)

� Characterizing KL divergence� If q and p are high, we’re happy� If q is high but p isn’t, we pay a price� If q is low, we don’t care� If KL = 0, then distribution are equal

This behavior is often called “mode splitting”: we want a good solution, notevery solution.

Material adapted from David Blei | UMD Variational Inference | 6 / 29

Page 9: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

What does it mean for distributions to be close?

� We measure the closeness of distributions using Kullback-LeiblerDivergence

KL(q ||p)≡Eq

logq(Z)

p(Z |x)

(4)

� Characterizing KL divergence� If q and p are high, we’re happy� If q is high but p isn’t, we pay a price� If q is low, we don’t care� If KL = 0, then distribution are equal

This behavior is often called “mode splitting”: we want a good solution, notevery solution.

Material adapted from David Blei | UMD Variational Inference | 6 / 29

Page 10: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

What does it mean for distributions to be close?

� We measure the closeness of distributions using Kullback-LeiblerDivergence

KL(q ||p)≡Eq

logq(Z)

p(Z |x)

(4)

� Characterizing KL divergence� If q and p are high, we’re happy� If q is high but p isn’t, we pay a price� If q is low, we don’t care� If KL = 0, then distribution are equal

This behavior is often called “mode splitting”: we want a good solution, notevery solution.

Material adapted from David Blei | UMD Variational Inference | 6 / 29

Page 11: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Jensen’s Inequality: Concave Functions and Expectations

log(t · x1 + (1 � t) · x2)

t log(x1) + (1 � t) log(x2)

x1 x2

When f is concave

f (E [X ])≥E [f (X)]

If you haven’t seen this before, spend fifteen minutes to convince yourselfthat it’s true

Material adapted from David Blei | UMD Variational Inference | 7 / 29

Page 12: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Evidence Lower Bound (ELBO)

� Apply Jensen’s inequality on log probability of data

logp(x) = log

�∫

z

p(x ,z)

Material adapted from David Blei | UMD Variational Inference | 8 / 29

Page 13: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Evidence Lower Bound (ELBO)

� Apply Jensen’s inequality on log probability of data

logp(x) = log

�∫

z

p(x ,z)

= log

�∫

z

p(x ,z)q(z)

q(z)

Add a term that is equal to one

Material adapted from David Blei | UMD Variational Inference | 8 / 29

Page 14: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Evidence Lower Bound (ELBO)

� Apply Jensen’s inequality on log probability of data

logp(x) = log

�∫

z

p(x ,z)

= log

�∫

z

p(x ,z)q(z)

q(z)

= log

Eq

p(x ,z)

q(z)

��

Take the numerator to create an expectation

Material adapted from David Blei | UMD Variational Inference | 8 / 29

Page 15: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Evidence Lower Bound (ELBO)

� Apply Jensen’s inequality on log probability of data

logp(x) = log

�∫

z

p(x ,z)

= log

�∫

z

p(x ,z)q(z)

q(z)

= log

Eq

p(x ,z)

q(z)

��

≥Eq [logp(x ,z)]−Eq [logq(z)]

Apply Jensen’s equality and use log difference

Material adapted from David Blei | UMD Variational Inference | 8 / 29

Page 16: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Evidence Lower Bound (ELBO)

� Apply Jensen’s inequality on log probability of data

logp(x) = log

�∫

z

p(x ,z)

= log

�∫

z

p(x ,z)q(z)

q(z)

= log

Eq

p(x ,z)

q(z)

��

≥Eq [logp(x ,z)]−Eq [logq(z)]

� Fun side effect: Entropy� Maximizing the ELBO gives as tight a bound on on log probability

Material adapted from David Blei | UMD Variational Inference | 8 / 29

Page 17: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Evidence Lower Bound (ELBO)

� Apply Jensen’s inequality on log probability of data

logp(x) = log

�∫

z

p(x ,z)

= log

�∫

z

p(x ,z)q(z)

q(z)

= log

Eq

p(x ,z)

q(z)

��

≥Eq [logp(x ,z)]−Eq [logq(z)]

� Fun side effect: Entropy� Maximizing the ELBO gives as tight a bound on on log probability

Material adapted from David Blei | UMD Variational Inference | 8 / 29

Page 18: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Evidence Lower Bound (ELBO)

� Apply Jensen’s inequality on log probability of data

logp(x) = log

�∫

z

p(x ,z)

= log

�∫

z

p(x ,z)q(z)

q(z)

= log

Eq

p(x ,z)

q(z)

��

≥Eq [logp(x ,z)]−Eq [logq(z)]

� Fun side effect: Entropy� Maximizing the ELBO gives as tight a bound on on log probability

Material adapted from David Blei | UMD Variational Inference | 8 / 29

Page 19: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Relation to KL Divergence

� Conditional probability definition

p(z |x) =p(z,x)

p(x)(5)

� Plug into KL divergence

KL(q(z) ||p(z |x)) =Eq

logq(z)

p(z |x)

� Negative of ELBO (plus constant); minimizing KL divergence is thesame as maximizing ELBO

Material adapted from David Blei | UMD Variational Inference | 9 / 29

Page 20: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Relation to KL Divergence

� Conditional probability definition

p(z |x) =p(z,x)

p(x)(5)

� Plug into KL divergence

KL(q(z) ||p(z |x)) =Eq

logq(z)

p(z |x)

� Negative of ELBO (plus constant); minimizing KL divergence is thesame as maximizing ELBO

Material adapted from David Blei | UMD Variational Inference | 9 / 29

Page 21: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Relation to KL Divergence

� Conditional probability definition

p(z |x) =p(z,x)

p(x)(5)

� Plug into KL divergence

KL(q(z) ||p(z |x)) =Eq

logq(z)

p(z |x)

=Eq [logq(z)]−Eq [logp(z |x)]

� Negative of ELBO (plus constant); minimizing KL divergence is thesame as maximizing ELBO

Break quotient into difference

Material adapted from David Blei | UMD Variational Inference | 9 / 29

Page 22: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Relation to KL Divergence

� Conditional probability definition

p(z |x) =p(z,x)

p(x)(5)

� Plug into KL divergence

KL(q(z) ||p(z |x)) =Eq

logq(z)

p(z |x)

=Eq [logq(z)]−Eq [logp(z |x)]=Eq [logq(z)]−Eq [logp(z,x)]+ logp(x)

� Negative of ELBO (plus constant); minimizing KL divergence is thesame as maximizing ELBO

Apply definition of conditional probability

Material adapted from David Blei | UMD Variational Inference | 9 / 29

Page 23: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Relation to KL Divergence

� Conditional probability definition

p(z |x) =p(z,x)

p(x)(5)

� Plug into KL divergence

KL(q(z) ||p(z |x)) =Eq

logq(z)

p(z |x)

=Eq [logq(z)]−Eq [logp(z |x)]=Eq [logq(z)]−Eq [logp(z,x)]+ logp(x)

=−�

Eq [logp(z,x)]−Eq [logq(z)]�

+ logp(x)

� Negative of ELBO (plus constant); minimizing KL divergence is thesame as maximizing ELBO

Reorganize terms

Material adapted from David Blei | UMD Variational Inference | 9 / 29

Page 24: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Relation to KL Divergence

� Conditional probability definition

p(z |x) =p(z,x)

p(x)(5)

� Plug into KL divergence

KL(q(z) ||p(z |x)) =Eq

logq(z)

p(z |x)

=Eq [logq(z)]−Eq [logp(z |x)]=Eq [logq(z)]−Eq [logp(z,x)]+ logp(x)

=−�

Eq [logp(z,x)]−Eq [logq(z)]�

+ logp(x)

� Negative of ELBO (plus constant); minimizing KL divergence is thesame as maximizing ELBO

Material adapted from David Blei | UMD Variational Inference | 9 / 29

Page 25: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Mean field variational inference

� Assume that your variational distribution factorizes

q(z1, . . . ,zm) =m∏

j=1

q(zj) (6)

� You may want to group some hidden variables together

� Does not contain the true posterior because hidden variables aredependent

Material adapted from David Blei | UMD Variational Inference | 10 / 29

Page 26: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

General Blueprint

� Choose q

� Derive ELBO

� Coordinate ascent of each qi

� Repeat until convergence

Material adapted from David Blei | UMD Variational Inference | 11 / 29

Page 27: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Example: Latent Dirichlet Allocation

computer, technology,

system, service, site,

phone, internet, machine

play, film, movie, theater,

production, star, director,

stage

sell, sale, store, product,

business, advertising,

market, consumer

TOPIC 1 TOPIC 2 TOPIC 3

Material adapted from David Blei | UMD Variational Inference | 12 / 29

Page 28: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Example: Latent Dirichlet Allocation

Forget the Bootleg, Just Download the Movie Legally

Multiplex Heralded As Linchpin To Growth

The Shape of Cinema, Transformed At the Click of

a Mouse

A Peaceful Crew Puts Muppets Where Its Mouth Is

Stock Trades: A Better Deal For Investors Isn't Simple

The three big Internet portals begin to distinguish

among themselves as shopping mallsRed Light, Green Light: A

2-Tone L.E.D. to Simplify Screens

TOPIC 2

TOPIC 3

TOPIC 1

Material adapted from David Blei | UMD Variational Inference | 12 / 29

Page 29: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Example: Latent Dirichlet Allocation

Hollywood studios are preparing to let people

download and buy electronic copies of movies over

the Internet, much as record labels now sell songs for

99 cents through Apple Computer's iTunes music store

and other online services ...

computer, technology,

system, service, site,

phone, internet, machine

play, film, movie, theater,

production, star, director,

stage

sell, sale, store, product,

business, advertising,

market, consumer

Material adapted from David Blei | UMD Variational Inference | 12 / 29

Page 30: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

LDA Generative Model

MNθd zn wn

Kβk

α

� For each topic k ∈ {1, . . . ,K }, a multinomial distribution βk

� For each document d ∈ {1, . . . ,M}, draw a multinomial distribution θd

from a Dirichlet distribution with parameter α� For each word position n ∈ {1, . . . ,N}, select a hidden topic zn from the

multinomial distribution parameterized by θ .� Choose the observed word wn from the distribution βzn

.

Material adapted from David Blei | UMD Variational Inference | 13 / 29

Page 31: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

LDA Generative Model

MNθd zn wn

Kβk

α

� For each topic k ∈ {1, . . . ,K }, a multinomial distribution βk� For each document d ∈ {1, . . . ,M}, draw a multinomial distribution θd

from a Dirichlet distribution with parameter α

� For each word position n ∈ {1, . . . ,N}, select a hidden topic zn from themultinomial distribution parameterized by θ .

� Choose the observed word wn from the distribution βzn.

Material adapted from David Blei | UMD Variational Inference | 13 / 29

Page 32: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

LDA Generative Model

MNθd zn wn

Kβk

α

� For each topic k ∈ {1, . . . ,K }, a multinomial distribution βk� For each document d ∈ {1, . . . ,M}, draw a multinomial distribution θd

from a Dirichlet distribution with parameter α� For each word position n ∈ {1, . . . ,N}, select a hidden topic zn from the

multinomial distribution parameterized by θ .

� Choose the observed word wn from the distribution βzn.

Material adapted from David Blei | UMD Variational Inference | 13 / 29

Page 33: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

LDA Generative Model

MNθd zn wn

Kβk

α

� For each topic k ∈ {1, . . . ,K }, a multinomial distribution βk� For each document d ∈ {1, . . . ,M}, draw a multinomial distribution θd

from a Dirichlet distribution with parameter α� For each word position n ∈ {1, . . . ,N}, select a hidden topic zn from the

multinomial distribution parameterized by θ .� Choose the observed word wn from the distribution βzn

.

Material adapted from David Blei | UMD Variational Inference | 13 / 29

Page 34: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

LDA Generative Model

MNθd zn wn

Kβk

α

� For each topic k ∈ {1, . . . ,K }, a multinomial distribution βk� For each document d ∈ {1, . . . ,M}, draw a multinomial distribution θd

from a Dirichlet distribution with parameter α� For each word position n ∈ {1, . . . ,N}, select a hidden topic zn from the

multinomial distribution parameterized by θ .� Choose the observed word wn from the distribution βzn

.

Statistical inference uncovers most unobserved variables given data.Material adapted from David Blei | UMD Variational Inference | 13 / 29

Page 35: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Deriving Variational Inference for LDA

Joint distribution:

p(θ ,z,w |α,β) =∏

d

p(θd |α)∏

n

p(zd ,n |θd)p(wd ,n |β ,zd ,n) (7)

� p(θd |α) =Γ (∑

i αi)∏

i Γ (αi)

k θαk−1d ,k (Dirichlet)

� p(zd ,n |θd) = θd ,zd ,n(Draw from Multinomial)

� p(wd ,n |β ,zd ,n) =βzd ,n,wd ,n(Draw from Multinomial)

Material adapted from David Blei | UMD Variational Inference | 14 / 29

Page 36: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Deriving Variational Inference for LDA

Joint distribution:

p(θ ,z,w |α,β) =∏

d

p(θd |α)∏

n

p(zd ,n |θd)p(wd ,n |β ,zd ,n) (7)

� p(θd |α) =Γ (∑

i αi)∏

i Γ (αi)

k θαk−1d ,k (Dirichlet)

� p(zd ,n |θd) = θd ,zd ,n(Draw from Multinomial)

� p(wd ,n |β ,zd ,n) =βzd ,n,wd ,n(Draw from Multinomial)

Material adapted from David Blei | UMD Variational Inference | 14 / 29

Page 37: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Deriving Variational Inference for LDA

Joint distribution:

p(θ ,z,w |α,β) =∏

d

p(θd |α)∏

n

p(zd ,n |θd)p(wd ,n |β ,zd ,n) (7)

� p(θd |α) =Γ (∑

i αi)∏

i Γ (αi)

k θαk−1d ,k (Dirichlet)

� p(zd ,n |θd) = θd ,zd ,n(Draw from Multinomial)

� p(wd ,n |β ,zd ,n) =βzd ,n,wd ,n(Draw from Multinomial)

Material adapted from David Blei | UMD Variational Inference | 14 / 29

Page 38: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Deriving Variational Inference for LDA

Joint distribution:

p(θ ,z,w |α,β) =∏

d

p(θd |α)∏

n

p(zd ,n |θd)p(wd ,n |β ,zd ,n) (7)

� p(θd |α) =Γ (∑

i αi)∏

i Γ (αi)

k θαk−1d ,k (Dirichlet)

� p(zd ,n |θd) = θd ,zd ,n(Draw from Multinomial)

� p(wd ,n |β ,zd ,n) =βzd ,n,wd ,n(Draw from Multinomial)

Material adapted from David Blei | UMD Variational Inference | 14 / 29

Page 39: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Deriving Variational Inference for LDA

Joint distribution:

p(θ ,z,w |α,β) =∏

d

p(θd |α)∏

n

p(zd ,n |θd)p(wd ,n |β ,zd ,n) (7)

� p(θd |α) =Γ (∑

i αi)∏

i Γ (αi)

k θαk−1d ,k (Dirichlet)

� p(zd ,n |θd) = θd ,zd ,n(Draw from Multinomial)

� p(wd ,n |β ,zd ,n) =βzd ,n,wd ,n(Draw from Multinomial)

Variational distribution:

q(θ ,z) = q(θ |γ)q(z |φ) (8)

Material adapted from David Blei | UMD Variational Inference | 14 / 29

Page 40: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Deriving Variational Inference for LDA

Joint distribution:

p(θ ,z,w |α,β) =∏

d

p(θd |α)∏

n

p(zd ,n |θd)p(wd ,n |β ,zd ,n) (7)

� p(θd |α) =Γ (∑

i αi)∏

i Γ (αi)

k θαk−1d ,k (Dirichlet)

� p(zd ,n |θd) = θd ,zd ,n(Draw from Multinomial)

� p(wd ,n |β ,zd ,n) =βzd ,n,wd ,n(Draw from Multinomial)

Variational distribution:

q(θ ,z) = q(θ |γ)q(z |φ) (8)

ELBO:

L(γ,φ;α,β) =Eq [logp(θ |α)]+Eq [logp(z |θ )]+Eq [logp(w |z,β)]

−Eq [logq(θ )]−Eq [logq(z)] (9)

Material adapted from David Blei | UMD Variational Inference | 14 / 29

Page 41: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

What is the variational distribution?

q( ~θ ,~z) =∏

d

q(θd |γd)∏

n

q(zd ,n |φd ,n) (10)

� Variational document distribution over topics γd� Vector of length K for each document� Non-negative� Doesn’t sum to 1.0

� Variational token distribution over topic assignments φd ,n� Vector of length K for every token� Non-negative, sums to 1.0

Material adapted from David Blei | UMD Variational Inference | 15 / 29

Page 42: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Expectation of log Dirichlet

� Most expectations are straightforward to compute

� Dirichlet is harder

Edir [logp(θi |α)] =Ψ (αi)−Ψ

j

αj

(11)

Material adapted from David Blei | UMD Variational Inference | 16 / 29

Page 43: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Expectation 1

Eq [logp(θ |α)] =Eq

log

Γ (∑

i αi)∏

i Γ (αi)

i

θ αi−1i

��

(12)

(13)

Material adapted from David Blei | UMD Variational Inference | 17 / 29

Page 44: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Expectation 1

Eq [logp(θ |α)] =Eq

log

Γ (∑

i αi)∏

i Γ (αi)

i

θ αi−1i

��

(12)

=Eq

log

Γ (∑

i αi)∏

i Γ (αi)

+∑

i

logθ αi−1i

(13)

Log of products becomes sum of logs.

Material adapted from David Blei | UMD Variational Inference | 17 / 29

Page 45: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Expectation 1

Eq [logp(θ |α)] =Eq

log

Γ (∑

i αi)∏

i Γ (αi)

i

θ αi−1i

��

(12)

=Eq

log

Γ (∑

i αi)∏

i Γ (αi)

+∑

i

logθ αi−1i

= logΓ (∑

i

αi)−∑

i

logΓ (αi)+Eq

i

(αi −1) logθi

(13)

Log of exponent becomes product, expectation of constant is constant

Material adapted from David Blei | UMD Variational Inference | 17 / 29

Page 46: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Expectation 1

Eq [logp(θ |α)] =Eq

log

Γ (∑

i αi)∏

i Γ (αi)

i

θ αi−1i

��

(12)

=Eq

log

Γ (∑

i αi)∏

i Γ (αi)

+∑

i

logθ αi−1i

= logΓ (∑

i

αi)−∑

i

logΓ (αi)+Eq

i

(αi −1) logθi

= logΓ (∑

i

αi)−∑

i

logΓ (αi)

+∑

i

(αi −1)

Ψ (γi)−Ψ

j

γj

��

Expectation of log Dirichlet

Material adapted from David Blei | UMD Variational Inference | 17 / 29

Page 47: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Expectation 2

Eq [logp(z |θ )] =Eq

log∏

n

i

θ1[zn==i]i

(13)

(14)

Material adapted from David Blei | UMD Variational Inference | 18 / 29

Page 48: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Expectation 2

Eq [logp(z |θ )] =Eq

log∏

n

i

θ1[zn==i]i

(13)

=Eq

n

i

logθ1[zn==i]i

(14)

(15)

Products to sums

Material adapted from David Blei | UMD Variational Inference | 18 / 29

Page 49: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Expectation 2

Eq [logp(z |θ )] =Eq

log∏

n

i

θ1[zn==i]i

(13)

=Eq

n

i

logθ1[zn==i]i

(14)

=∑

n

i

Eq

logθ1[zn==i]i

(15)

(16)

Linearity of expectation

Material adapted from David Blei | UMD Variational Inference | 18 / 29

Page 50: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Expectation 2

Eq [logp(z |θ )] =Eq

log∏

n

i

θ1[zn==i]i

(13)

=Eq

n

i

logθ1[zn==i]i

(14)

=∑

n

i

Eq

logθ1[zn==i]i

(15)

=∑

n

i

φniEq [logθi ] (16)

(17)

Independence of variational distribution, exponents become products

Material adapted from David Blei | UMD Variational Inference | 18 / 29

Page 51: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Expectation 2

Eq [logp(z |θ )] =Eq

log∏

n

i

θ1[zn==i]i

(13)

=Eq

n

i

logθ1[zn==i]i

(14)

=∑

n

i

Eq

logθ1[zn==i]i

(15)

=∑

n

i

φniEq [logθi ] (16)

=∑

n

i

φni

Ψ (γi)−Ψ

j

γj

��

(17)

Expectation of log Dirichlet

Material adapted from David Blei | UMD Variational Inference | 18 / 29

Page 52: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Expectation 3

Eq [logp(w |z,β)] =Eq

logβzd ,n,wd ,n

(18)

(19)

Material adapted from David Blei | UMD Variational Inference | 19 / 29

Page 53: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Expectation 3

Eq [logp(w |z,β)] =Eq

logβzd ,n,wd ,n

(18)

=Eq

logV∏

v

K∏

i

β1[v=wd ,n,zd ,n=i]i ,v

(19)

(20)

Material adapted from David Blei | UMD Variational Inference | 19 / 29

Page 54: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Expectation 3

Eq [logp(w |z,β)] =Eq

logβzd ,n,wd ,n

(18)

=Eq

logV∏

v

K∏

i

β1[v=wd ,n,zd ,n=i]i ,v

(19)

=V∑

v

K∑

i

Eq [1 [v =wd ,n,zd ,n = i]] logβi ,v (20)

(21)

Material adapted from David Blei | UMD Variational Inference | 19 / 29

Page 55: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Expectation 3

Eq [logp(w |z,β)] =Eq

logβzd ,n,wd ,n

(18)

=Eq

logV∏

v

K∏

i

β1[v=wd ,n,zd ,n=i]i ,v

(19)

=V∑

v

K∑

i

Eq [1 [v =wd ,n,zd ,n = i]] logβi ,v (20)

=V∑

v

K∑

i

φn,iwvd ,n logβi ,v (21)

Material adapted from David Blei | UMD Variational Inference | 19 / 29

Page 56: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Entropies

Entropy of Dirichlet

Hq [γ] =− logΓ

j

γj

+∑

i

logΓ (γi)

−∑

i

(γi −1)

Ψ (γi)−Ψ

k∑

j=1

γj

��

Entropy of Multinomial

Hq [φd ,n] =−∑

i

φd ,n,i logφd ,n,i (22)

Material adapted from David Blei | UMD Variational Inference | 20 / 29

Page 57: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Entropies

Entropy of Dirichlet

Hq [γ] =− logΓ

j

γj

+∑

i

logΓ (γi)

−∑

i

(γi −1)

Ψ (γi)−Ψ

k∑

j=1

γj

��

Entropy of Multinomial

Hq [φd ,n] =−∑

i

φd ,n,i logφd ,n,i (22)

Material adapted from David Blei | UMD Variational Inference | 20 / 29

Page 58: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Complete objective function

Note the entropy terms at the end (negative sign)

Material adapted from David Blei | UMD Variational Inference | 21 / 29

Page 59: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Deriving the algorithm

� Compute partial wrt to variable of interest

� Set equal to zero

� Solve for variable

Material adapted from David Blei | UMD Variational Inference | 22 / 29

Page 60: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Update for φ

Derivative of ELBO:

∂L∂ φni

=Ψ (γi)−Ψ

j

γj

+ logβi ,v − logφni −1+λ (23)

Solution:

φni ∝βiv exp

Ψ (γi)−Ψ

j

γj

��

(24)

Material adapted from David Blei | UMD Variational Inference | 23 / 29

Page 61: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Update for γ

Derivative of ELBO:

∂L∂ γi

=Ψ′ (γi) (αi +φn,i −γi)

−Ψ′�

j

γj

j

αj +∑

n

φnj −γj

Material adapted from David Blei | UMD Variational Inference | 24 / 29

Page 62: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Update for γ

Derivative of ELBO:

∂L∂ γi

=Ψ′ (γi) (αi +φn,i −γi)

−Ψ′�

j

γj

j

αj +∑

n

φnj −γj

Material adapted from David Blei | UMD Variational Inference | 24 / 29

Page 63: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Update for γ

Derivative of ELBO:

∂L∂ γi

=Ψ′ (γi) (αi +φn,i −γi)

−Ψ′�

j

γj

j

αj +∑

n

φnj −γj

Solution:γi =αi +

n

φni (25)

Material adapted from David Blei | UMD Variational Inference | 24 / 29

Page 64: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Update for β

Slightly more complicated (requires Lagrange parameter), but solution isobvious:

βij ∝∑

d

n

φdniwjdn (26)

Material adapted from David Blei | UMD Variational Inference | 25 / 29

Page 65: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Overall Algorithm

1. Randomly initialize variational parameters (can’t be uniform)

2. For each iteration:2.1 For each document, update γ and φ2.2 For corpus, update β2.3 ComputeL for diagnostics

3. Return expectation of variational parameters for solution to latentvariables

Material adapted from David Blei | UMD Variational Inference | 26 / 29

Page 66: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Relationship with Gibbs Sampling

� Gibbs sampling: sample from the conditional distribution of all othervariables

� Variational inference: each factor is set to the exponentiated log of theconditional

� Variational is easier to parallelize, Gibbs faster per step

� Gibbs typically easier to implement

Material adapted from David Blei | UMD Variational Inference | 27 / 29

Page 67: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Implementation Tips

� Match derivation exactly at first

� Randomize initialization, but specify seed

� Use simple languages first

. . . then match implementation

� Try to match variables with paper

� Write unit tests for each atomic update

� Monitor variational bound (with asserts)

� Write the state (checkpointing and debugging)

� Visualize variational parameters

� Cache / memoize gamma / digamma functions

Material adapted from David Blei | UMD Variational Inference | 28 / 29

Page 68: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Implementation Tips

� Match derivation exactly at first

� Randomize initialization, but specify seed

� Use simple languages first . . . then match implementation

� Try to match variables with paper

� Write unit tests for each atomic update

� Monitor variational bound (with asserts)

� Write the state (checkpointing and debugging)

� Visualize variational parameters

� Cache / memoize gamma / digamma functions

Material adapted from David Blei | UMD Variational Inference | 28 / 29

Page 69: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Implementation Tips

� Match derivation exactly at first

� Randomize initialization, but specify seed

� Use simple languages first . . . then match implementation

� Try to match variables with paper

� Write unit tests for each atomic update

� Monitor variational bound (with asserts)

� Write the state (checkpointing and debugging)

� Visualize variational parameters

� Cache / memoize gamma / digamma functions

Material adapted from David Blei | UMD Variational Inference | 28 / 29

Page 70: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Implementation Tips

� Match derivation exactly at first

� Randomize initialization, but specify seed

� Use simple languages first . . . then match implementation

� Try to match variables with paper

� Write unit tests for each atomic update

� Monitor variational bound (with asserts)

� Write the state (checkpointing and debugging)

� Visualize variational parameters

� Cache / memoize gamma / digamma functions

Material adapted from David Blei | UMD Variational Inference | 28 / 29

Page 71: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Implementation Tips

� Match derivation exactly at first

� Randomize initialization, but specify seed

� Use simple languages first . . . then match implementation

� Try to match variables with paper

� Write unit tests for each atomic update

� Monitor variational bound (with asserts)

� Write the state (checkpointing and debugging)

� Visualize variational parameters

� Cache / memoize gamma / digamma functions

Material adapted from David Blei | UMD Variational Inference | 28 / 29

Page 72: Variational Inference - UMIACSusers.umiacs.umd.edu/~jbg/teaching/CMSC_726/17a.pdf · 2020-06-30 · Variational Inference — Inferring hidden variables — Unlike MCMC: Deterministic

Next class

� Example on toy LDA problem

� Current research in variational inference

Material adapted from David Blei | UMD Variational Inference | 29 / 29