Another Walkthrough of Variational Bayes - TWiki€¦ · Another Walkthrough of Variational Bayes...
Transcript of Another Walkthrough of Variational Bayes - TWiki€¦ · Another Walkthrough of Variational Bayes...
Another Walkthrough of Variational Bayes
Bevan Jones
Machine Learning Reading Group
Macquarie University
Variational Bayes?
• Bayes ← Bayes’ Theorem
• But the integral is intractable! – Sampling
• Gibbs, Metropolis Hastings, Slice Sampling, Particle Filters…
– Variational Bayes • Change the equations, replacing intractable integrals • This involves searching for a good approximation
• Variational ← Calculus of Variations – A way of searching through a space of functions for the
“best” one
2
Useful Concepts
• Probability/Information Theory – Bayes’ Theorem – Expectations – Jensen’s Inequality – KL Divergence
• Calculus
– Functionals & Functional Derivatives – Lagrange Multipliers
• Logarithms
3
Outline
• The true likelihood
• Approximating the posterior
• The lower bound and a definition for “best”
• Finding the optimal approximation – Functionals & functional derivatives
– Connection to KL divergence
• The Mean-field approximation
• An inference procedure
• Dirichlet-multinomial example
4
• We have some observed data:
• We have a model relating latent variable z to the data:
• To guess z
• The problem is one of computing
• Or just as good
The (Log) Likelihood
5
Approximating p(z|x)
• The integral in the expression for p(x) may not be easily computed
• But we might be able to get by with an approximation for p(x, z)
• We’ll focus on approximating only part of it
6
Choosing q
• How to choose q?
• Ideally, we want the q that is closest to p
• Define a lower bound on p
– Make this a “function” of q
• Maximize the lower bound to make it as tight as possible
– Choose q accordingly
7
• Jensen’s Inequality
where f is concave
Bounding the Log Likelihood w/ Jensen’s Inequality
8
• Jensen’s Inequality
where f is concave
Bounding the Log Likelihood w/ Jensen’s Inequality
9
• Jensen’s Inequality
where f is concave
Bounding the Log Likelihood w/ Jensen’s Inequality
10
• Jensen’s Inequality
where f is concave
Bounding the Log Likelihood w/ Jensen’s Inequality
11
• We can’t calculate the log likelihood, but we can compute the lower bound
• Maximizing F tightens the lower bound on the likelihood
• What q maximizes F? • If q were a variable we could do this by taking
derivatives and solving for q
The Lower Bound
12
Functionals: the “Variational” in VB
• Functional: a kind of “meta-function” that takes a function as input
• We can view F[q] as a functional of q
• Calculus of functionals parallels that of functions
• Then, we can – take the derivative of F[q] with respect to q,
– set it to 0, and
– solve for q
13
Derivatives
14
• The change in functional as we change its function argument
Functional Derivatives
15
Useful Derivatives
16
Useful Derivatives
17
Useful Derivatives
18
Useful Derivatives
19
Useful Derivatives
20
Calculating q
• Use Lagrange multipliers
constraint
21
Calculating q
22
Calculating q
23
Calculating q
24
• Maximizing F is minimizing the KL divergence
• And
KL Divergence: An Alternative View
25
Optimal q
• The best q(z) is p(z|x)
26
Where are we?
• We’ve bounded the likelihood (Jensen’s Ineq.)
• Made this bound tight (Lagrange Multipliers)
• But the best approximation is no approximation at all!
• We need to constrain q so that it’s tractable
27
Optimal q in an Imperfect World
• We can’t compute q(z)=p(z|x) directly
• Instead, constrain the domain of F[q] to some set of more tractable functions
• This is usually done by making independence assumptions
– The mean field assumption: cut all dependencies
28
• We have some observed data:
• We have a model relating latent variables z and θ to the data:
• To guess z and θ we need
• But the integral is hard!
• Apply the mean field assumption
Example 2: Mean Field Assumption
29
The New Lower Bound
30
The New Lower Bound
31
The New Lower Bound
32
The New Lower Bound
Apply mean field assumption
33
• The integrals get simpler • In fact, these go away
The Benefit of Independence
34
Optimizing the Lower Bound
35
Optimal qθ(θ)
• Use Lagrange multipliers
constraint
36
Optimal qz(z)
• Use Lagrange multipliers
constraint
37
The Approximation q ≠ p
38
Estimating Parameters
• Now we have our approximation q
• We need to compute the expectations
• Use EM-like procedure, alternating between the two – It was hard to do this for p(z,θ|x) – It’s (hopefully) easy for q(z,θ)
• if we’ve defined p to make use of conjugacy • and if we’ve chosen the right constraint for q
39
Calculating F
40
• As a side effect of inference, we already have
• It’s the log of the normalization constant for q(z)
• So, we really only need two more expectations
Calculating F
41
Uses for F
• We can often use F in cases where we would normally use the log likelihood – Measuring convergence
• No guarantee to maximize likelihood, but we do have F
• Others – Model selection
• Choose the model with the highest lower bound
– Selecting the number of clusters • Pick the number that gives us the highest lower bound
– Parameter optimization • Again, optimize the lower bound w.r.t. the parameters
42
Worked Example
Dirichlet-Multinomial Mixture Model
43
Dirichlet-Multinomial Mixture Model
φ
x
z
N
π K
α
β
44
The Intractable Integral
45
The Mean Field Assumption
46
Optimizing F
• Apply Lagrange multipliers just like example 2
• In this case, we have simply replaced z, x, and θ with vectors
• The math is exactly the same
• But we need to find the expectations we skipped before
– Plug in the Dirichlet and multinomial distributions
47
Optimal q(z,θ)
48
• Borrowed from example 2 • See slides 36-38
• All we need to do is apply the particulars of the Mixture model
Optimal qθ(θ)
49
Optimal qφ(φ): The Expectation
50
Dirichlet Distribution
51
Optimal qφ(φ): The Numerator
52
Optimal qφ(φ): The Normalization
53
Optimal qφ(φ): Conjugacy Helps
54
Optimal qπ(π)
• q(π) is essentially the same as q(φ)
• The only difference is that there are multiple π’s
• So, q(π) should be a product of Dirichlets
55
Optimal qπ(π): The Expectation
56
Optimal qπ(π): The Numerator
57
Optimal qπ(π): The Denominator
58
Optimal qπ(π): Putting Them Together
59
A Useful Standard Result
• The digamma function
• The expectation under a Dirichlet of the log of an individual component of a Dirichlet random variable
60
Optimal qz(z)
61
• Again, borrowed from example 2
• See slides 36-38
• Here, we plug in the model definition
Optimal qz(z)
62
• First, let’s work with the simpler multinomial distribution
• Side effect: a kind of estimate for the multinomial parameter vector
Optimal qz(z): The Expectations
63
• Now, let’s work with the product of multinomials
• Side effect: a kind of set of multinomial parameter vectors
• This is essentialy the same math required for HMMs and PCFGs
Optimal qz(z): The Expectations
64
Optimal qz(z): The Expectations
65
Optimal qz(z): Putting It Together
66
Implications of Assumption
• We should get the same result with an even weaker assumption
67
Inference
• “E-Step”: Expected Counts – Topic counts
– Topic-word pair counts
• “M-Step”: The Proportions – Topic j
– Topic-word pair j-k
68
Calculating F
69
• Also borrowed from example 2
• See slides 40-41
• But we adapt it for the mixture model
Calculating F
70
Calculating F: The Normalization Constant
71
• By product of computing