Approximate Inference: Variational InferenceVariational Inference CMSC 678 UMBC Outline Recap of...
Transcript of Approximate Inference: Variational InferenceVariational Inference CMSC 678 UMBC Outline Recap of...
Approximate Inference:Variational Inference
CMSC 678UMBC
Outline
Recap of graphical models & belief propagation
Posterior inference (Bayesian perspective)
Math: exponential family distributions
Variational InferenceBasic TechniqueExample: Topic Models
Recap from last timeβ¦
Graphical Models
ππ π₯π₯1, π₯π₯2, π₯π₯3, β¦ , π₯π₯ππ = οΏ½ππ
ππ π₯π₯ππ ππ(π₯π₯ππ))
Directed Models (Bayesian networks)
Undirected Models (Markov random fields)
ππ π₯π₯1, π₯π₯2, π₯π₯3, β¦ , π₯π₯ππ =1πποΏ½πΆπΆ
πππΆπΆ π₯π₯ππ
Markov Blanket
x
Markov blanket of a node x is its parents, children, and
children's parents
ππ π₯π₯ππ π₯π₯ππβ ππ =ππ(π₯π₯1, β¦ , π₯π₯ππ)
β« ππ π₯π₯1, β¦ , π₯π₯ππ πππ₯π₯ππ
=βππ ππ(π₯π₯ππ|ππ π₯π₯ππ )
β« βππ ππ π₯π₯ππ ππ π₯π₯ππ )πππ₯π₯ππfactor out terms not dependent on xi
factorization of graph
=βππ:ππ=ππ or ππβππ π₯π₯ππ ππ(π₯π₯ππ|ππ π₯π₯ππ )
β« βππ:ππ=ππ or ππβππ π₯π₯ππ ππ π₯π₯ππ ππ π₯π₯ππ )πππ₯π₯ππ
the set of nodes needed to form the complete conditional for a variable xi
Markov Random Fields withFactor Graph Notation
x: original pixel/state
y: observed (noisy)
pixel/state
factor nodes are added
according to maximal cliques
unaryfactor
variable
factor graphs are bipartite
binaryfactor
Two Problems for Undirected Models
Finding the normalizer
ππ = οΏ½π₯π₯
οΏ½ππ
ππππ(π₯π₯ππ)
Computing the marginals
ππππ(π£π£) = οΏ½π₯π₯:π₯π₯ππ=π£π£
οΏ½ππ
ππππ(π₯π₯ππ)Q: Why are these difficult?
A: Many different combinations
Sum over all variable combinations, with the xn
coordinate fixed
ππ2(π£π£) = οΏ½π₯π₯1
οΏ½π₯π₯3
οΏ½ππ
ππππ(π₯π₯ = π₯π₯1, π£π£, π₯π₯3 )
Example: 3 variables, fix the
2nd dimensionBelief propagation algorithms
β’ sum-product (forward-backward in HMMs)
β’ max-product/max-sum (Viterbi)
Sum-ProductFrom variables to factors
ππππβππ π₯π₯ππ = οΏ½ππβ²βππ(ππ)\ππ
ππππβ²βππ π₯π₯ππ
From factors to variables
ππππβππ π₯π₯ππ= οΏ½
ππππ\ππ
ππππ ππππ οΏ½ππβ²βππ(ππ)\ππ
ππππβ²βππ(π₯π₯πππ)
n
m
n
m
set of variables that the mth factor depends on
set of factors in which variable n participates
sum over configuration of variables for the mth factor,
with variable n fixed
default value of 1 if empty product
Outline
Recap of graphical models & belief propagation
Posterior inference (Bayesian perspective)
Math: exponential family distributions
Variational InferenceBasic TechniqueExample: Topic Models
Goal: Posterior Inference
Hyperparameters Ξ±Unknown parameters ΞData:
Likelihood model:
p( | Ξ )
pΞ±( Ξ | )
weβre going to be Bayesian (perform Bayesian inference)
Posterior Classification vs.Posterior Inference
βFrequentistβ methods
prior over labels (maybe), not weights
Bayesian methods
Ξ includes weight parameters
pΞ±( Ξ | )pΞ±,w ( y| )
(Some) Learning Techniques
MAP/MLE: Point estimation, basic EM
Variational Inference: Functional Optimization
Sampling/Monte Carlo
today
next class
what weβve already covered
Outline
Recap of graphical models & belief propagation
Posterior inference (Bayesian perspective)
Math: exponential family distributions
Variational InferenceBasic TechniqueExample: Topic Models
Exponential Family Form
Exponential Family Form
Support functionβ’ Formally necessary, in practice
irrelevant
Exponential Family Form
Distribution Parametersβ’ Natural parametersβ’ Feature weights
Exponential Family Form
Feature function(s)β’ Sufficient statistics
Exponential Family Form
Log-normalizer
Exponential Family Form
Log-normalizer
Why? Capture Common Distributions
Discrete (Finite distributions)
Why? Capture Common Distributions
β’ Gaussian
https://kanbanize.com/blog/wp-content/uploads/2014/07/Standard_deviation_diagram.png
Why? Capture Common Distributions
Dirichlet (Distributions over (finite) distributions)
Why? Capture Common Distributions
Discrete (Finite distributions)
Dirichlet (Distributions over (finite) distributions)
Gaussian
Gamma, Exponential, Poisson, Negative-Binomial, Laplace, log-Normal,β¦
Why? βEasyβ Gradients
Observed feature countsCount w.r.t. empirical distribution
Expected feature countsCount w.r.t. current model parameters
(weβve already seen this with maxent models)
Why? βEasyβ Expectations
expectation of the sufficient
statistics
gradient of the log normalizer
Why? βEasyβ Posterior Inference
Why? βEasyβ Posterior Inference
p is the conjugate prior for q
Why? βEasyβ Posterior Inference
p is the conjugate prior for q
Posterior p has same form as prior p
Why? βEasyβ Posterior Inference
p is the conjugate prior for q
Posterior p has same form as prior p
All exponential family models have a conjugate prior (in theory)
Why? βEasyβ Posterior Inference
p is the conjugate prior for q
Posterior p has same form as prior p
Posterior Likelihood Prior
Dirichlet (Beta) Discrete (Bernoulli) Dirichlet (Beta)
Normal Normal (fixed var.) Normal
Gamma Exponential Gamma
Outline
Recap of graphical models & belief propagation
Posterior inference (Bayesian perspective)
Math: exponential family distributions
Variational InferenceBasic TechniqueExample: Topic Models
Goal: Posterior Inference
Hyperparameters Ξ±Unknown parameters ΞData:
Likelihood model:
p( | Ξ )
pΞ±( Ξ | )
(Some) Learning Techniques
MAP/MLE: Point estimation, basic EM
Variational Inference: Functional Optimization
Sampling/Monte Carlo
today
next class
what weβve already covered
Variational Inference
Difficult to compute
Variational Inference
Difficult to compute
Minimize the βdifferenceβ
by changing Ξ»
Easy(ier) to compute
q(ΞΈ): controlled by parameters Ξ»
Variational Inference
Difficult to compute
Easy(ier) to compute
Minimize the βdifferenceβ
by changing Ξ»
Variational Inference: A Gradient-Based Optimization Technique
Set t = 0Pick a starting value Ξ»t
Until converged:1. Get value y t = F(q(β’;Ξ»t))2. Get gradient g t = Fβ(q(β’;Ξ»t))3. Get scaling factor Ο t4. Set Ξ»t+1 = Ξ»t + Οt*g t5. Set t += 1
Variational Inference: A Gradient-Based Optimization Technique
Set t = 0Pick a starting value Ξ»t
Until converged:1. Get value y t = F(q(β’;Ξ»t))2. Get gradient g t = Fβ(q(β’;Ξ»t))3. Get scaling factor Ο t4. Set Ξ»t+1 = Ξ»t + Οt*g t5. Set t += 1
Variational Inference:The Function to Optimize
Posterior of desired model
Any easy-to-compute distribution
Variational Inference:The Function to Optimize
Posterior of desired model
Any easy-to-compute distribution
Find the best distribution (calculus of variations)
Variational Inference:The Function to Optimize
Find the best distribution
Parameters for desired model
Variational Inference:The Function to Optimize
Find the best distribution
Variational parameters for ΞΈ
Parameters for desired model
Variational Inference:The Function to Optimize
Find the best distribution
Variationalparameters for ΞΈ
Parameters for desired model
KL-Divergence (expectation)
DKL ππ ππ || ππ(ππ|π₯π₯) =
πΌπΌππ ππ logππ ππππ(ππ|π₯π₯)
Variational Inference
Find the best distribution
Variational parameters for ΞΈ
Parameters for desired model
Exponential Family Recap: βEasyβ Expectations
Exponential Family Recap: βEasyβ Posterior Inference
p is the conjugate prior for Ο
Variational Inference
Find the best distribution
When p and q are the same exponential family form, the variational update q(ΞΈ) is (often) computable (in closed form)
Variational Inference: A Gradient-Based Optimization Technique
Set t = 0Pick a starting value Ξ»tLetF(q(β’;Ξ»t)) = KL[q(β’;Ξ»t) || p(β’)]
Until converged:1. Get value y t = F(q(β’;Ξ»t))2. Get gradient g t = Fβ(q(β’;Ξ»t))3. Get scaling factor Ο t4. Set Ξ»t+1 = Ξ»t + Οt*g t5. Set t += 1
Variational Inference:Maximization or Minimization?
Evidence Lower Bound (ELBO)
logππ π₯π₯ = logβ« ππ π₯π₯,ππ ππππ
Evidence Lower Bound (ELBO)
logππ π₯π₯ = logβ« ππ π₯π₯,ππ ππππ
= logβ« ππ π₯π₯,ππππ ππππ(ππ)
ππππ
Evidence Lower Bound (ELBO)
logππ π₯π₯ = logβ« ππ π₯π₯,ππ ππππ
= logβ« ππ π₯π₯,ππππ ππππ(ππ)
ππππ
= logπΌπΌππ ππππ π₯π₯,ππππ ππ
Evidence Lower Bound (ELBO)
logππ π₯π₯ = logβ« ππ π₯π₯,ππ ππππ
= logβ« ππ π₯π₯,ππππ ππππ(ππ)
ππππ
= logπΌπΌππ ππππ π₯π₯,ππππ ππ
β₯ πΌπΌππ ππ ππ π₯π₯,ππ β πΌπΌππ ππ ππ ππ= β(ππ)
Outline
Recap of graphical models & belief propagation
Posterior inference (Bayesian perspective)
Math: exponential family distributions
Variational InferenceBasic TechniqueExample: Topic Models
Bag-of-Items Models
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region . β¦
p( ) Three: 1,people: 2,attack: 2,
β¦p( )=Unigram counts
Bag-of-Items Models
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region . β¦
p( ) Three: 1,people: 2,attack: 2,
β¦pΟ,Ο( )=
Unigram counts
Global (corpus-level) parameters interact with local (document-level) parameters
Latent Dirichlet Allocation(Blei et al., 2003)
Per-document (unigram) word counts
Latent Dirichlet Allocation(Blei et al., 2003)
Per-document (unigram) word counts
Count of word j in document i
j
i
Latent Dirichlet Allocation(Blei et al., 2003)
Per-document (latent) topic usage
Per-document (unigram) word counts
Per-topic word usage
Count of word j in document i
j
i
K topics
Latent Dirichlet Allocation(Blei et al., 2003)
Per-document (latent) topic usage
Per-document (unigram) word counts
Per-topic word usage
~ Multinomial ~ Dirichlet ~ Dirichlet
(regularize/place priors)
Count of word j in document i
j
i
K topics
Latent Dirichlet Allocation(Blei et al., 2003)
Per-document
(latent) topic usage
Per-document (unigram) word counts
Per-topic word usage
Latent Dirichlet Allocation(Blei et al., 2003)
Per-document
(latent) topic usage
Per-document (unigram) word counts
Per-topic word usage
d
Latent Dirichlet Allocation(Blei et al., 2003)
Per-document
(latent) topic usage
Per-document (unigram) word counts
Per-topic word usage
d
Latent Dirichlet Allocation(Blei et al., 2003)
Per-document
(latent) topic usage
Per-document (unigram) word counts
Per-topic word usage
d
Latent Dirichlet Allocation(Blei et al., 2003)
Per-document
(latent) topic usage
Per-document (unigram) word counts
Per-topic word usage
d
Latent Dirichlet Allocation(Blei et al., 2003)
Per-document
(latent) topic usage
Per-document (unigram) word counts
Per-topic word usage
d
Variational Inference: LDirA
Topic usage
Per-document (unigram) word counts
Topic words
p: True model
ππππ βΌ Dirichlet(π·π·)π€π€(ππ,ππ) βΌ Discrete(πππ§π§ ππ,ππ )
ππ(ππ) βΌ Dirichlet(πΆπΆ)π§π§(ππ,ππ) βΌ Discrete(ππ(ππ))
Variational Inference: LDirA
Topic usage
Per-document (unigram) word counts
Topic words
p: True model q: Mean-field approximation
ππππ βΌ Dirichlet(π·π·)π€π€(ππ,ππ) βΌ Discrete(πππ§π§ ππ,ππ )
ππ(ππ) βΌ Dirichlet(πΆπΆ)π§π§(ππ,ππ) βΌ Discrete(ππ(ππ))
ππππ βΌ Dirichlet(ππππ)
ππ(ππ) βΌ Dirichlet(πΈπΈπ π )π§π§(ππ,ππ) βΌ Discrete(ππ(ππ,ππ))
Variational Inference: A Gradient-Based Optimization Technique
Set t = 0Pick a starting value Ξ»tLetF(q(β’;Ξ»t)) = KL[q(β’;Ξ»t) || p(β’)]
Until converged:1. Get value y t = F(q(β’;Ξ»t))2. Get gradient g t = Fβ(q(β’;Ξ»t))3. Get scaling factor Ο t4. Set Ξ»t+1 = Ξ»t + Οt*g t5. Set t += 1
Variational Inference: LDirA
ππ(ππ) βΌ Dirichlet(πΆπΆ)π§π§(ππ,ππ) βΌ Discrete(ππ(ππ))
ππ(ππ) βΌ Dirichlet(πΈπΈπ π )π§π§(ππ,ππ) βΌ Discrete(ππ(ππ,ππ))
p: True model q: Mean-field approximation
πΌπΌππ(ππ(ππ)) logππ ππ(ππ) | πΌπΌ
Variational Inference: LDirA
ππ(ππ) βΌ Dirichlet(πΆπΆ)π§π§(ππ,ππ) βΌ Discrete(ππ(ππ))
ππ(ππ) βΌ Dirichlet(πΈπΈπ π )π§π§(ππ,ππ) βΌ Discrete(ππ(ππ,ππ))
p: True model q: Mean-field approximation
πΌπΌππ(ππ(ππ)) logππ ππ(ππ) | πΌπΌ =
πΌπΌππ(ππ(ππ)) πΌπΌ β 1 ππ logππ(ππ) + πΆπΆ
exponential family form of Dirichlet
ππ ππ =Ξ(βππ πΌπΌππ)βππ Ξ πΌπΌππ
οΏ½ππ
πππππΌπΌππβ1
params = πΌπΌππ β 1 ππsuff. stats.= logππππ ππ
Variational Inference: LDirA
ππ(ππ) βΌ Dirichlet(πΆπΆ)π§π§(ππ,ππ) βΌ Discrete(ππ(ππ))
ππ(ππ) βΌ Dirichlet(πΈπΈπ π )π§π§(ππ,ππ) βΌ Discrete(ππ(ππ,ππ))
p: True model q: Mean-field approximation
πΌπΌππ(ππ(ππ)) logππ ππ(ππ) | πΌπΌ =
πΌπΌππ(ππ(ππ)) πΌπΌ β 1 ππ logππ(ππ) + πΆπΆ
expectation of sufficient statistics of q distribution
params = πΎπΎππ β 1 ππ
suff. stats. = logππππ ππ
Variational Inference: LDirA
ππ(ππ) βΌ Dirichlet(πΆπΆ)π§π§(ππ,ππ) βΌ Discrete(ππ(ππ))
ππ(ππ) βΌ Dirichlet(πΈπΈπ π )π§π§(ππ,ππ) βΌ Discrete(ππ(ππ,ππ))
p: True model q: Mean-field approximation
πΌπΌππ(ππ(ππ)) logππ ππ(ππ) | πΌπΌ =
πΌπΌππ(ππ(ππ)) πΌπΌ β 1 ππ logππ(ππ) + πΆπΆ =expectation of the
sufficient statistics is the gradient of the
log normalizer
πΌπΌ β 1 πππΌπΌππ(ππ(ππ)) logππ(ππ) + πΆπΆ
Variational Inference: LDirA
ππ(ππ) βΌ Dirichlet(πΆπΆ)π§π§(ππ,ππ) βΌ Discrete(ππ(ππ))
ππ(ππ) βΌ Dirichlet(πΈπΈπ π )π§π§(ππ,ππ) βΌ Discrete(ππ(ππ,ππ))
p: True model q: Mean-field approximation
πΌπΌππ(ππ(ππ)) logππ ππ(ππ) | πΌπΌ =
πΌπΌππ(ππ(ππ)) πΌπΌ β 1 ππ logππ(ππ) + πΆπΆ =expectation of the
sufficient statistics is the gradient of the
log normalizer
πΌπΌ β 1 πππ»π»πΎπΎπππ΄π΄ πΎπΎππ β 1 + πΆπΆ
Variational Inference: LDirA
ππ(ππ) βΌ Dirichlet(πΆπΆ)π§π§(ππ,ππ) βΌ Discrete(ππ(ππ))
ππ(ππ) βΌ Dirichlet(πΈπΈπ π )π§π§(ππ,ππ) βΌ Discrete(ππ(ππ,ππ))
p: True model q: Mean-field approximation
πΌπΌππ(ππ(ππ)) logππ ππ(ππ) | πΌπΌ = πΌπΌ β 1 πππ»π»πΎπΎπππ΄π΄ πΎπΎππ β 1 + πΆπΆ
β οΏ½πΎπΎππ
= πΌπΌ β 1 πππ»π»πΎπΎπππ΄π΄ πΎπΎππ β 1 + ππ πΎπΎππthereβs more math
to do!
Variational Inference: A Gradient-Based Optimization Technique
Set t = 0Pick a starting value Ξ»tLetF(q(β’;Ξ»t)) = KL[q(β’;Ξ»t) || p(β’)]
Until converged:1. Get value y t = F(q(β’;Ξ»t))2. Get gradient g t = Fβ(q(β’;Ξ»t))3. Get scaling factor Ο t4. Set Ξ»t+1 = Ξ»t + Οt*g t5. Set t += 1
Variational Inference: LDirA
ππ(ππ) βΌ Dirichlet(πΆπΆ)π§π§(ππ,ππ) βΌ Discrete(ππ(ππ))
ππ(ππ) βΌ Dirichlet(πΈπΈπ π )π§π§(ππ,ππ) βΌ Discrete(ππ(ππ,ππ))
p: True model q: Mean-field approximation
β οΏ½πΎπΎππ
= πΌπΌ β 1 πππ»π»πΎπΎπππ΄π΄ πΎπΎππ β 1 + ππ πΎπΎππ
π»π»πΎπΎππβ οΏ½πΎπΎππ= πΌπΌ β 1 πππ»π»πΎπΎππ
2 π΄π΄ πΎπΎππ β 1 + π»π»πΎπΎππππ πΎπΎππ