Approximate Inference: Variational InferenceVariational Inference CMSC 678 UMBC Outline Recap of...

Approximate Inference:Variational Inference

CMSC 678UMBC

Outline

Recap of graphical models & belief propagation

Posterior inference (Bayesian perspective)

Math: exponential family distributions

Variational InferenceBasic TechniqueExample: Topic Models

Recap from last time…

Graphical Models

𝑝𝑝 𝑥𝑥1, 𝑥𝑥2, 𝑥𝑥3, … , 𝑥𝑥𝑁𝑁 = �𝑖𝑖

𝑝𝑝 𝑥𝑥𝑖𝑖 𝜋𝜋(𝑥𝑥𝑖𝑖))

Directed Models (Bayesian networks)

Undirected Models (Markov random fields)

𝑝𝑝 𝑥𝑥1, 𝑥𝑥2, 𝑥𝑥3, … , 𝑥𝑥𝑁𝑁 =1𝑍𝑍�𝐶𝐶

𝜓𝜓𝐶𝐶 𝑥𝑥𝑐𝑐

Markov Blanket

x

Markov blanket of a node x is its parents, children, and

children's parents

𝑝𝑝 𝑥𝑥𝑖𝑖 𝑥𝑥𝑗𝑗≠𝑖𝑖 =𝑝𝑝(𝑥𝑥1, … , 𝑥𝑥𝑁𝑁)

∫ 𝑝𝑝 𝑥𝑥1, … , 𝑥𝑥𝑁𝑁 𝑑𝑑𝑥𝑥𝑖𝑖

=∏𝑘𝑘 𝑝𝑝(𝑥𝑥𝑘𝑘|𝜋𝜋 𝑥𝑥𝑘𝑘 )

∫ ∏𝑘𝑘 𝑝𝑝 𝑥𝑥𝑘𝑘 𝜋𝜋 𝑥𝑥𝑘𝑘 )𝑑𝑑𝑥𝑥𝑖𝑖factor out terms not dependent on xi

factorization of graph

=∏𝑘𝑘:𝑘𝑘=𝑖𝑖 or 𝑖𝑖∈𝜋𝜋 𝑥𝑥𝑘𝑘 𝑝𝑝(𝑥𝑥𝑘𝑘|𝜋𝜋 𝑥𝑥𝑘𝑘 )

∫ ∏𝑘𝑘:𝑘𝑘=𝑖𝑖 or 𝑖𝑖∈𝜋𝜋 𝑥𝑥𝑘𝑘 𝑝𝑝 𝑥𝑥𝑘𝑘 𝜋𝜋 𝑥𝑥𝑘𝑘 )𝑑𝑑𝑥𝑥𝑖𝑖

the set of nodes needed to form the complete conditional for a variable xi

Markov Random Fields withFactor Graph Notation

x: original pixel/state

y: observed (noisy)

pixel/state

factor nodes are added

according to maximal cliques

unaryfactor

variable

factor graphs are bipartite

binaryfactor

Two Problems for Undirected Models

Finding the normalizer

𝑍𝑍 = �𝑥𝑥

�𝑐𝑐

𝜓𝜓𝑐𝑐(𝑥𝑥𝑐𝑐)

Computing the marginals

𝑍𝑍𝑛𝑛(𝑣𝑣) = �𝑥𝑥:𝑥𝑥𝑛𝑛=𝑣𝑣

�𝑐𝑐

𝜓𝜓𝑐𝑐(𝑥𝑥𝑐𝑐)Q: Why are these difficult?

A: Many different combinations

Sum over all variable combinations, with the xn

coordinate fixed

𝑍𝑍2(𝑣𝑣) = �𝑥𝑥1

�𝑥𝑥3

�𝑐𝑐

𝜓𝜓𝑐𝑐(𝑥𝑥 = 𝑥𝑥1, 𝑣𝑣, 𝑥𝑥3 )

Example: 3 variables, fix the

2nd dimensionBelief propagation algorithms

• sum-product (forward-backward in HMMs)

• max-product/max-sum (Viterbi)

Sum-ProductFrom variables to factors

𝑞𝑞𝑛𝑛→𝑚𝑚 𝑥𝑥𝑛𝑛 = �𝑚𝑚′∈𝑀𝑀(𝑛𝑛)\𝑚𝑚

𝑟𝑟𝑚𝑚′→𝑛𝑛 𝑥𝑥𝑛𝑛

From factors to variables

𝑟𝑟𝑚𝑚→𝑛𝑛 𝑥𝑥𝑛𝑛= �

𝒘𝒘𝑚𝑚\𝑛𝑛

𝑓𝑓𝑚𝑚 𝒘𝒘𝑚𝑚 �𝑛𝑛′∈𝑁𝑁(𝑚𝑚)\𝑛𝑛

𝑞𝑞𝑛𝑛′→𝑚𝑚(𝑥𝑥𝑛𝑛𝑛)

n

m

n

m

set of variables that the mth factor depends on

set of factors in which variable n participates

sum over configuration of variables for the mth factor,

with variable n fixed

default value of 1 if empty product

Outline





Goal: Posterior Inference

Hyperparameters αUnknown parameters ΘData:

Likelihood model:

p( | Θ )

pα( Θ | )

we’re going to be Bayesian (perform Bayesian inference)

Posterior Classification vs.Posterior Inference

“Frequentist” methods

prior over labels (maybe), not weights

Bayesian methods

Θ includes weight parameters

pα( Θ | )pα,w ( y| )

(Some) Learning Techniques

MAP/MLE: Point estimation, basic EM

Variational Inference: Functional Optimization

Sampling/Monte Carlo

today

next class

what we’ve already covered

Outline





Exponential Family Form


Support function• Formally necessary, in practice

irrelevant


Distribution Parameters• Natural parameters• Feature weights


Feature function(s)• Sufficient statistics


Log-normalizer

Why? Capture Common Distributions

Discrete (Finite distributions)


• Gaussian

https://kanbanize.com/blog/wp-content/uploads/2014/07/Standard_deviation_diagram.png


Dirichlet (Distributions over (finite) distributions)


Discrete (Finite distributions)

Dirichlet (Distributions over (finite) distributions)

Gaussian

Gamma, Exponential, Poisson, Negative-Binomial, Laplace, log-Normal,…

Why? “Easy” Gradients

Observed feature countsCount w.r.t. empirical distribution

Expected feature countsCount w.r.t. current model parameters

(we’ve already seen this with maxent models)

Why? “Easy” Expectations

expectation of the sufficient

statistics

gradient of the log normalizer

Why? “Easy” Posterior Inference


p is the conjugate prior for q



Posterior p has same form as prior p




All exponential family models have a conjugate prior (in theory)




Posterior Likelihood Prior

Dirichlet (Beta) Discrete (Bernoulli) Dirichlet (Beta)

Normal Normal (fixed var.) Normal

Gamma Exponential Gamma

Outline





Goal: Posterior Inference

Hyperparameters αUnknown parameters ΘData:

Likelihood model:

p( | Θ )

pα( Θ | )

(Some) Learning Techniques

MAP/MLE: Point estimation, basic EM

Variational Inference: Functional Optimization

Sampling/Monte Carlo

today

next class

what we’ve already covered

Variational Inference

Difficult to compute



Minimize the “difference”

by changing λ

Easy(ier) to compute

q(θ): controlled by parameters λ



Easy(ier) to compute

Minimize the “difference”

by changing λ

Variational Inference: A Gradient-Based Optimization Technique

Set t = 0Pick a starting value λt

Until converged:1. Get value y t = F(q(•;λt))2. Get gradient g t = F’(q(•;λt))3. Get scaling factor ρ t4. Set λt+1 = λt + ρt*g t5. Set t += 1

Variational Inference:The Function to Optimize

Posterior of desired model

Any easy-to-compute distribution


Posterior of desired model

Any easy-to-compute distribution

Find the best distribution (calculus of variations)


Find the best distribution

Parameters for desired model



Variational parameters for θ




Variationalparameters for θ


KL-Divergence (expectation)

DKL 𝑞𝑞 𝜃𝜃 || 𝑝𝑝(𝜃𝜃|𝑥𝑥) =

𝔼𝔼𝑞𝑞 𝜃𝜃 log𝑞𝑞 𝜃𝜃𝑝𝑝(𝜃𝜃|𝑥𝑥)



Variational parameters for θ


Exponential Family Recap: “Easy” Expectations

Exponential Family Recap: “Easy” Posterior Inference

p is the conjugate prior for π



When p and q are the same exponential family form, the variational update q(θ) is (often) computable (in closed form)


Set t = 0Pick a starting value λtLetF(q(•;λt)) = KL[q(•;λt) || p(•)]


Variational Inference:Maximization or Minimization?

Evidence Lower Bound (ELBO)

log𝑝𝑝 𝑥𝑥 = log∫ 𝑝𝑝 𝑥𝑥,𝜃𝜃 𝑑𝑑𝜃𝜃



= log∫ 𝑝𝑝 𝑥𝑥,𝜃𝜃𝑞𝑞 𝜃𝜃𝑞𝑞(𝜃𝜃)

𝑑𝑑𝜃𝜃




𝑑𝑑𝜃𝜃

= log𝔼𝔼𝑞𝑞 𝜃𝜃𝑝𝑝 𝑥𝑥,𝜃𝜃𝑞𝑞 𝜃𝜃




𝑑𝑑𝜃𝜃

= log𝔼𝔼𝑞𝑞 𝜃𝜃𝑝𝑝 𝑥𝑥,𝜃𝜃𝑞𝑞 𝜃𝜃

≥ 𝔼𝔼𝑞𝑞 𝜃𝜃 𝑝𝑝 𝑥𝑥,𝜃𝜃 − 𝔼𝔼𝑞𝑞 𝜃𝜃 𝑞𝑞 𝜃𝜃= ℒ(𝑞𝑞)

Outline





Bag-of-Items Models

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region . …

p( ) Three: 1,people: 2,attack: 2,

…p( )=Unigram counts

Bag-of-Items Models

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region . …

p( ) Three: 1,people: 2,attack: 2,

…pφ,ω( )=

Unigram counts

Global (corpus-level) parameters interact with local (document-level) parameters

Latent Dirichlet Allocation(Blei et al., 2003)

Per-document (unigram) word counts



Count of word j in document i

j

i


Per-document (latent) topic usage


Per-topic word usage


j

i

K topics


Per-document (latent) topic usage



~ Multinomial ~ Dirichlet ~ Dirichlet

(regularize/place priors)


j

i

K topics


Per-document

(latent) topic usage




Per-document

(latent) topic usage



d

Variational Inference: LDirA

Topic usage


Topic words

p: True model

𝜙𝜙𝑘𝑘 ∼ Dirichlet(𝜷𝜷)𝑤𝑤(𝑑𝑑,𝑛𝑛) ∼ Discrete(𝜙𝜙𝑧𝑧 𝑑𝑑,𝑛𝑛 )

𝜃𝜃(𝑑𝑑) ∼ Dirichlet(𝜶𝜶)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∼ Discrete(𝜃𝜃(𝑑𝑑))


Topic usage


Topic words

p: True model q: Mean-field approximation

𝜙𝜙𝑘𝑘 ∼ Dirichlet(𝜷𝜷)𝑤𝑤(𝑑𝑑,𝑛𝑛) ∼ Discrete(𝜙𝜙𝑧𝑧 𝑑𝑑,𝑛𝑛 )


𝜙𝜙𝑘𝑘 ∼ Dirichlet(𝝀𝝀𝒌𝒌)

𝜃𝜃(𝑑𝑑) ∼ Dirichlet(𝜸𝜸𝒅𝒅)𝑧𝑧(𝑑𝑑,𝑛𝑛) ∼ Discrete(𝜓𝜓(𝑑𝑑,𝑛𝑛))





𝔼𝔼𝑞𝑞(𝜃𝜃(𝑑𝑑)) log𝑝𝑝 𝜃𝜃(𝑑𝑑) | 𝛼𝛼





𝔼𝔼𝑞𝑞(𝜃𝜃(𝑑𝑑)) log𝑝𝑝 𝜃𝜃(𝑑𝑑) | 𝛼𝛼 =

𝔼𝔼𝑞𝑞(𝜃𝜃(𝑑𝑑)) 𝛼𝛼 − 1 𝑇𝑇 log𝜃𝜃(𝑑𝑑) + 𝐶𝐶

exponential family form of Dirichlet

𝑝𝑝 𝜃𝜃 =Γ(∑𝑘𝑘 𝛼𝛼𝑘𝑘)∏𝑘𝑘 Γ 𝛼𝛼𝑘𝑘

�𝑘𝑘

𝜃𝜃𝑘𝑘𝛼𝛼𝑘𝑘−1

params = 𝛼𝛼𝑘𝑘 − 1 𝑘𝑘suff. stats.= log𝜃𝜃𝑘𝑘 𝑘𝑘






𝔼𝔼𝑞𝑞(𝜃𝜃(𝑑𝑑)) 𝛼𝛼 − 1 𝑇𝑇 log𝜃𝜃(𝑑𝑑) + 𝐶𝐶

expectation of sufficient statistics of q distribution

params = 𝛾𝛾𝑘𝑘 − 1 𝑘𝑘

suff. stats. = log𝜃𝜃𝑘𝑘 𝑘𝑘






𝔼𝔼𝑞𝑞(𝜃𝜃(𝑑𝑑)) 𝛼𝛼 − 1 𝑇𝑇 log𝜃𝜃(𝑑𝑑) + 𝐶𝐶 =expectation of the

sufficient statistics is the gradient of the

log normalizer

𝛼𝛼 − 1 𝑇𝑇𝔼𝔼𝑞𝑞(𝜃𝜃(𝑑𝑑)) log𝜃𝜃(𝑑𝑑) + 𝐶𝐶






𝔼𝔼𝑞𝑞(𝜃𝜃(𝑑𝑑)) 𝛼𝛼 − 1 𝑇𝑇 log𝜃𝜃(𝑑𝑑) + 𝐶𝐶 =expectation of the

sufficient statistics is the gradient of the

log normalizer

𝛼𝛼 − 1 𝑇𝑇𝛻𝛻𝛾𝛾𝑑𝑑𝐴𝐴 𝛾𝛾𝑑𝑑 − 1 + 𝐶𝐶





𝔼𝔼𝑞𝑞(𝜃𝜃(𝑑𝑑)) log𝑝𝑝 𝜃𝜃(𝑑𝑑) | 𝛼𝛼 = 𝛼𝛼 − 1 𝑇𝑇𝛻𝛻𝛾𝛾𝑑𝑑𝐴𝐴 𝛾𝛾𝑑𝑑 − 1 + 𝐶𝐶

ℒ �𝛾𝛾𝑑𝑑

= 𝛼𝛼 − 1 𝑇𝑇𝛻𝛻𝛾𝛾𝑑𝑑𝐴𝐴 𝛾𝛾𝑑𝑑 − 1 + 𝑀𝑀 𝛾𝛾𝑑𝑑there’s more math

to do!





ℒ �𝛾𝛾𝑑𝑑

= 𝛼𝛼 − 1 𝑇𝑇𝛻𝛻𝛾𝛾𝑑𝑑𝐴𝐴 𝛾𝛾𝑑𝑑 − 1 + 𝑀𝑀 𝛾𝛾𝑑𝑑

𝛻𝛻𝛾𝛾𝑑𝑑ℒ �𝛾𝛾𝑑𝑑= 𝛼𝛼 − 1 𝑇𝑇𝛻𝛻𝛾𝛾𝑑𝑑

2 𝐴𝐴 𝛾𝛾𝑑𝑑 − 1 + 𝛻𝛻𝛾𝛾𝑑𝑑𝑀𝑀 𝛾𝛾𝑑𝑑

Approximate Inference: Variational InferenceVariational Inference CMSC 678 UMBC Outline Recap of...

Documents

Transcript of Approximate Inference: Variational InferenceVariational Inference CMSC 678 UMBC Outline Recap of...