Another Walkthrough of Variational Bayes - TWiki€¦ · Another Walkthrough of Variational Bayes...

Post on 08-Oct-2020

4 views 0 download

Transcript of Another Walkthrough of Variational Bayes - TWiki€¦ · Another Walkthrough of Variational Bayes...

Another Walkthrough of Variational Bayes

Bevan Jones

Machine Learning Reading Group

Macquarie University

Variational Bayes?

• Bayes ← Bayes’ Theorem

• But the integral is intractable! – Sampling

• Gibbs, Metropolis Hastings, Slice Sampling, Particle Filters…

– Variational Bayes • Change the equations, replacing intractable integrals • This involves searching for a good approximation

• Variational ← Calculus of Variations – A way of searching through a space of functions for the

“best” one

2

Useful Concepts

• Probability/Information Theory – Bayes’ Theorem – Expectations – Jensen’s Inequality – KL Divergence

• Calculus

– Functionals & Functional Derivatives – Lagrange Multipliers

• Logarithms

3

Outline

• The true likelihood

• Approximating the posterior

• The lower bound and a definition for “best”

• Finding the optimal approximation – Functionals & functional derivatives

– Connection to KL divergence

• The Mean-field approximation

• An inference procedure

• Dirichlet-multinomial example

4

• We have some observed data:

• We have a model relating latent variable z to the data:

• To guess z

• The problem is one of computing

• Or just as good

The (Log) Likelihood

5

Approximating p(z|x)

• The integral in the expression for p(x) may not be easily computed

• But we might be able to get by with an approximation for p(x, z)

• We’ll focus on approximating only part of it

6

Choosing q

• How to choose q?

• Ideally, we want the q that is closest to p

• Define a lower bound on p

– Make this a “function” of q

• Maximize the lower bound to make it as tight as possible

– Choose q accordingly

7

• Jensen’s Inequality

where f is concave

Bounding the Log Likelihood w/ Jensen’s Inequality

8

• Jensen’s Inequality

where f is concave

Bounding the Log Likelihood w/ Jensen’s Inequality

9

• Jensen’s Inequality

where f is concave

Bounding the Log Likelihood w/ Jensen’s Inequality

10

• Jensen’s Inequality

where f is concave

Bounding the Log Likelihood w/ Jensen’s Inequality

11

• We can’t calculate the log likelihood, but we can compute the lower bound

• Maximizing F tightens the lower bound on the likelihood

• What q maximizes F? • If q were a variable we could do this by taking

derivatives and solving for q

The Lower Bound

12

Functionals: the “Variational” in VB

• Functional: a kind of “meta-function” that takes a function as input

• We can view F[q] as a functional of q

• Calculus of functionals parallels that of functions

• Then, we can – take the derivative of F[q] with respect to q,

– set it to 0, and

– solve for q

13

Derivatives

14

• The change in functional as we change its function argument

Functional Derivatives

15

Useful Derivatives

16

Useful Derivatives

17

Useful Derivatives

18

Useful Derivatives

19

Useful Derivatives

20

Calculating q

• Use Lagrange multipliers

constraint

21

Calculating q

22

Calculating q

23

Calculating q

24

• Maximizing F is minimizing the KL divergence

• And

KL Divergence: An Alternative View

25

Optimal q

• The best q(z) is p(z|x)

26

Where are we?

• We’ve bounded the likelihood (Jensen’s Ineq.)

• Made this bound tight (Lagrange Multipliers)

• But the best approximation is no approximation at all!

• We need to constrain q so that it’s tractable

27

Optimal q in an Imperfect World

• We can’t compute q(z)=p(z|x) directly

• Instead, constrain the domain of F[q] to some set of more tractable functions

• This is usually done by making independence assumptions

– The mean field assumption: cut all dependencies

28

• We have some observed data:

• We have a model relating latent variables z and θ to the data:

• To guess z and θ we need

• But the integral is hard!

• Apply the mean field assumption

Example 2: Mean Field Assumption

29

The New Lower Bound

30

The New Lower Bound

31

The New Lower Bound

32

The New Lower Bound

Apply mean field assumption

33

• The integrals get simpler • In fact, these go away

The Benefit of Independence

34

Optimizing the Lower Bound

35

Optimal qθ(θ)

• Use Lagrange multipliers

constraint

36

Optimal qz(z)

• Use Lagrange multipliers

constraint

37

The Approximation q ≠ p

38

Estimating Parameters

• Now we have our approximation q

• We need to compute the expectations

• Use EM-like procedure, alternating between the two – It was hard to do this for p(z,θ|x) – It’s (hopefully) easy for q(z,θ)

• if we’ve defined p to make use of conjugacy • and if we’ve chosen the right constraint for q

39

Calculating F

40

• As a side effect of inference, we already have

• It’s the log of the normalization constant for q(z)

• So, we really only need two more expectations

Calculating F

41

Uses for F

• We can often use F in cases where we would normally use the log likelihood – Measuring convergence

• No guarantee to maximize likelihood, but we do have F

• Others – Model selection

• Choose the model with the highest lower bound

– Selecting the number of clusters • Pick the number that gives us the highest lower bound

– Parameter optimization • Again, optimize the lower bound w.r.t. the parameters

42

Worked Example

Dirichlet-Multinomial Mixture Model

43

Dirichlet-Multinomial Mixture Model

φ

x

z

N

π K

α

β

44

The Intractable Integral

45

The Mean Field Assumption

46

Optimizing F

• Apply Lagrange multipliers just like example 2

• In this case, we have simply replaced z, x, and θ with vectors

• The math is exactly the same

• But we need to find the expectations we skipped before

– Plug in the Dirichlet and multinomial distributions

47

Optimal q(z,θ)

48

• Borrowed from example 2 • See slides 36-38

• All we need to do is apply the particulars of the Mixture model

Optimal qθ(θ)

49

Optimal qφ(φ): The Expectation

50

Dirichlet Distribution

51

Optimal qφ(φ): The Numerator

52

Optimal qφ(φ): The Normalization

53

Optimal qφ(φ): Conjugacy Helps

54

Optimal qπ(π)

• q(π) is essentially the same as q(φ)

• The only difference is that there are multiple π’s

• So, q(π) should be a product of Dirichlets

55

Optimal qπ(π): The Expectation

56

Optimal qπ(π): The Numerator

57

Optimal qπ(π): The Denominator

58

Optimal qπ(π): Putting Them Together

59

A Useful Standard Result

• The digamma function

• The expectation under a Dirichlet of the log of an individual component of a Dirichlet random variable

60

Optimal qz(z)

61

• Again, borrowed from example 2

• See slides 36-38

• Here, we plug in the model definition

Optimal qz(z)

62

• First, let’s work with the simpler multinomial distribution

• Side effect: a kind of estimate for the multinomial parameter vector

Optimal qz(z): The Expectations

63

• Now, let’s work with the product of multinomials

• Side effect: a kind of set of multinomial parameter vectors

• This is essentialy the same math required for HMMs and PCFGs

Optimal qz(z): The Expectations

64

Optimal qz(z): The Expectations

65

Optimal qz(z): Putting It Together

66

Implications of Assumption

• We should get the same result with an even weaker assumption

67

Inference

• “E-Step”: Expected Counts – Topic counts

– Topic-word pair counts

• “M-Step”: The Proportions – Topic j

– Topic-word pair j-k

68

Calculating F

69

• Also borrowed from example 2

• See slides 40-41

• But we adapt it for the mixture model

Calculating F

70

Calculating F: The Normalization Constant

71

• By product of computing