Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf ·...

124
Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the Dirichlet Process Kurt Miller CS 294: Practical Machine Learning November 19, 2009

Transcript of Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf ·...

Page 1: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Dr. Nonparametric BayesOr: How I Learned to Stop Worrying

and Love the Dirichlet Process

Kurt MillerCS 294: Practical Machine LearningNovember 19, 2009

Page 2: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Today we will discuss NonparametricBayesian methods.

“Nonparametric Bayesian methods”?What does that mean?

Kurt T. Miller Dr. Nonparametric Bayes 2

Page 3: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Today we will discuss NonparametricBayesian methods.

“Nonparametric Bayesian methods”?What does that mean?

Kurt T. Miller Dr. Nonparametric Bayes 2

Page 4: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Introduction

Nonparametric

Nonparametric: Does NOT mean there are noparameters.

Kurt T. Miller Dr. Nonparametric Bayes 3

Page 5: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Introduction

Example: Classification

Build model

Predict using model

Data

Parametric Approach

Nonparametric Approach

!

!

!46

++

++

++++

++

++

++

++

++

++

++

++!

Kurt T. Miller Dr. Nonparametric Bayes 4

Page 6: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Introduction

Example: Regression

Build model

Predict using model

Data

Parametric Approach

Nonparametric Approach!

!

!

Kurt T. Miller Dr. Nonparametric Bayes 5

Page 7: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Introduction

Example: Clustering

Build model

Data

Parametric Approach

Nonparametric Approach!

!

!

Kurt T. Miller Dr. Nonparametric Bayes 6

Page 8: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Introduction

So now we know what nonparametric means,but what does Bayesian mean?

Statistics: Bayesian Basics

The Bayesian approach treats statistical problems bymaintaining probability distributions over possibleparameter values.That is, we treat the parameters themselves as randomvariables having distributions:

1 We have some beliefs about our parameter values ! beforewe see any data. These beliefs are encoded in the priordistribution P(!).

2 Treating the parameters ! as random variables, we canwrite the likelihood of the data X as a conditionalprobability: P(X |!).

3 We would like to update our beliefs about ! based on thedata by obtaining P(!|X ), the posterior distribution.Solution: by Bayes’ theorem,

P(!|X ) =P(X |!)P(!)

P(X )

whereP(X ) =

!P(X |!)P(!)d!

(Slide from tutorial lecture)

Kurt T. Miller Dr. Nonparametric Bayes 7

Page 9: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Introduction

Why Be Bayesian?

You can take a course on this question.

One answer:

Infinite Exchangeability: ∀n p(x1, . . . , xn) = p(xσ(1), . . . , xσ(n))

De Finetti’s Theorem (1955): If (x1, x2, . . .) are infinitelyexchangeable, then ∀n

p(x1, . . . , xn) =∫ ( n∏

i=1

p(xi|θ)

)dP (θ)

for some random variable θ.

Kurt T. Miller Dr. Nonparametric Bayes 8

Page 10: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Introduction

Why Be Bayesian?

You can take a course on this question. One answer:

Infinite Exchangeability: ∀n p(x1, . . . , xn) = p(xσ(1), . . . , xσ(n))

De Finetti’s Theorem (1955): If (x1, x2, . . .) are infinitelyexchangeable, then ∀n

p(x1, . . . , xn) =∫ ( n∏

i=1

p(xi|θ)

)dP (θ)

for some random variable θ.

Kurt T. Miller Dr. Nonparametric Bayes 8

Page 11: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Introduction

Why Be Bayesian?

You can take a course on this question. One answer:

Infinite Exchangeability: ∀n p(x1, . . . , xn) = p(xσ(1), . . . , xσ(n))

De Finetti’s Theorem (1955): If (x1, x2, . . .) are infinitelyexchangeable, then ∀n

p(x1, . . . , xn) =∫ ( n∏

i=1

p(xi|θ)

)dP (θ)

for some random variable θ.

Kurt T. Miller Dr. Nonparametric Bayes 8

Page 12: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Introduction

Simple Example

Task: Toss a (potentially biased) coin N times. Compute θ, theprobability of heads.

Suppose we observe: T, H, H, T. What do we think θ is?

Themaximum likelihood estimate is θ = 1/2. Seems reasonable.

Now suppose we observe: H, H, H, H. What do we think θ is? Themaximum likelihood estimate is θ = 1. Seem reasonable?

Not really. Why?

Kurt T. Miller Dr. Nonparametric Bayes 9

Page 13: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Introduction

Simple Example

Task: Toss a (potentially biased) coin N times. Compute θ, theprobability of heads.

Suppose we observe: T, H, H, T. What do we think θ is? Themaximum likelihood estimate is θ = 1/2. Seems reasonable.

Now suppose we observe: H, H, H, H. What do we think θ is? Themaximum likelihood estimate is θ = 1. Seem reasonable?

Not really. Why?

Kurt T. Miller Dr. Nonparametric Bayes 9

Page 14: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Introduction

Simple Example

Task: Toss a (potentially biased) coin N times. Compute θ, theprobability of heads.

Suppose we observe: T, H, H, T. What do we think θ is? Themaximum likelihood estimate is θ = 1/2. Seems reasonable.

Now suppose we observe: H, H, H, H. What do we think θ is?

Themaximum likelihood estimate is θ = 1. Seem reasonable?

Not really. Why?

Kurt T. Miller Dr. Nonparametric Bayes 9

Page 15: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Introduction

Simple Example

Task: Toss a (potentially biased) coin N times. Compute θ, theprobability of heads.

Suppose we observe: T, H, H, T. What do we think θ is? Themaximum likelihood estimate is θ = 1/2. Seems reasonable.

Now suppose we observe: H, H, H, H. What do we think θ is? Themaximum likelihood estimate is θ = 1. Seem reasonable?

Not really. Why?

Kurt T. Miller Dr. Nonparametric Bayes 9

Page 16: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Introduction

Simple Example

Task: Toss a (potentially biased) coin N times. Compute θ, theprobability of heads.

Suppose we observe: T, H, H, T. What do we think θ is? Themaximum likelihood estimate is θ = 1/2. Seems reasonable.

Now suppose we observe: H, H, H, H. What do we think θ is? Themaximum likelihood estimate is θ = 1. Seem reasonable?

Not really. Why?

Kurt T. Miller Dr. Nonparametric Bayes 9

Page 17: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Introduction

Simple Example

When we observe H, H, H, H, why does θ = 1 seem unreasonable?

Prior knowledge! We believe coins generally have θ ≈ 1/2. How toencode this? By using a Beta prior on θ.

Kurt T. Miller Dr. Nonparametric Bayes 10

Page 18: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Introduction

Simple Example

When we observe H, H, H, H, why does θ = 1 seem unreasonable?

Prior knowledge! We believe coins generally have θ ≈ 1/2. How toencode this? By using a Beta prior on θ.

Kurt T. Miller Dr. Nonparametric Bayes 10

Page 19: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Introduction

Bayesian Approach to Estimating θ

Place a Beta(a, b) prior on θ. This prior has the form

p(θ) ∝ θa−1(1− θ)b−1.

What does this distribution look like?

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

3.5

α1=1.0, α

2=0.1

α1=1.0, α

2=1.0

α1=1.0, α

2=5.0

α1=1.0, α

2=10.0

α1=9.0, α

2=3.0

Kurt T. Miller Dr. Nonparametric Bayes 11

Page 20: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Introduction

Bayesian Approach to Estimating θ

Place a Beta(a, b) prior on θ. This prior has the form

p(θ) ∝ θa−1(1− θ)b−1.

What does this distribution look like?

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

3.5

α1=1.0, α

2=0.1

α1=1.0, α

2=1.0

α1=1.0, α

2=5.0

α1=1.0, α

2=10.0

α1=9.0, α

2=3.0

Kurt T. Miller Dr. Nonparametric Bayes 11

Page 21: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Introduction

Bayesian Approach to Estimating θ

After observing X, a sequence with n heads and m tails, the posterior onθ is:

p(θ|X) ∝ p(X|θ)p(θ)∝ θa+n−1(1− θ)b+m−1

∼ Beta(a + n, b + m).

If a = b = 1 and we observe 5 heads and 2 tails, Beta(6, 3) looks like

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

Kurt T. Miller Dr. Nonparametric Bayes 12

Page 22: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Introduction

Bayesian Approach to Estimating θ

After observing X, a sequence with n heads and m tails, the posterior onθ is:

p(θ|X) ∝ p(X|θ)p(θ)∝ θa+n−1(1− θ)b+m−1

∼ Beta(a + n, b + m).

If a = b = 1 and we observe 5 heads and 2 tails, Beta(6, 3) looks like

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

Kurt T. Miller Dr. Nonparametric Bayes 12

Page 23: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Nonparametric Bayesian Methods overview

Nonparametric Bayesian Methods

Now we know what nonparametric and Bayesian mean. What should weexpect from nonparametric Bayesian methods?

• Complexity of our model should be allowed to grow as we get moredata.

• Place a prior on an unbounded number of parameters.

Kurt T. Miller Dr. Nonparametric Bayes 13

Page 24: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Nonparametric Bayesian Methods overview

Nonparametric Bayesian Methods

Now we know what nonparametric and Bayesian mean. What should weexpect from nonparametric Bayesian methods?

• Complexity of our model should be allowed to grow as we get moredata.

• Place a prior on an unbounded number of parameters.

Kurt T. Miller Dr. Nonparametric Bayes 13

Page 25: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Nonparametric Bayesian Methods overview

Nonparametric Bayesian Methods

Now we know what nonparametric and Bayesian mean. What should weexpect from nonparametric Bayesian methods?

• Complexity of our model should be allowed to grow as we get moredata.

• Place a prior on an unbounded number of parameters.

Kurt T. Miller Dr. Nonparametric Bayes 13

Page 26: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Nonparametric Bayesian Methods overview

Nonparametric Bayesian Methods overview

• Dirichlet Process/Chinese Restaurant ProcessLatent class models - often used in the clustering context

• Beta Process/Indian Buffet ProcessLatent feature models

• Gaussian Process (No culinary metaphor - oh well)Regression

Today we focus on the Dirichlet Process!

Kurt T. Miller Dr. Nonparametric Bayes 14

Page 27: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Nonparametric Bayesian Methods overview

Today’s topic: The Dirichlet Process

A nonparametric approach to clustering. It can be used in anyprobabilistic model for clustering.

Before diving into the details, we first introduce several key ideas.

Kurt T. Miller Dr. Nonparametric Bayes 15

Page 28: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Nonparametric Bayesian Methods overview

Key ideas to be discussed today

• A parametric Bayesian approach to clustering• Defining the model• Markov Chain Monte Carlo (MCMC) inference

• A nonparametric approach to clustering• Defining the model - The Dirichlet Process!• MCMC inference

• Extensions

Kurt T. Miller Dr. Nonparametric Bayes 16

Page 29: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Nonparametric Bayesian Methods overview

Key ideas to be discussed today

• A parametric Bayesian approach to clustering• Defining the model• Markov Chain Monte Carlo (MCMC) inference

• A nonparametric approach to clustering• Defining the model - The Dirichlet Process!• MCMC inference

• Extensions

Kurt T. Miller Dr. Nonparametric Bayes 17

Page 30: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Preliminaries

A Bayesian Approach to Clustering

We must specify two things:

• The likelihood term (how data is affected by the parameters):

p(X|θ)

• The prior (the prior distirubution on the parameters):

p(θ)

We will slowly develop what these are in the Bayesian clustering context.

Kurt T. Miller Dr. Nonparametric Bayes 18

Page 31: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Preliminaries

Motivating example: ClusteringHow many clusters?

Kurt T. Miller Dr. Nonparametric Bayes 19

Page 32: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Preliminaries

Motivating example: ClusteringHow many clusters?

Kurt T. Miller Dr. Nonparametric Bayes 19

Page 33: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Preliminaries

Clustering – A Parametric ApproachFrequentist approach: Gaussian Mixture Models with K mixtures

Distribution over classes: π = (π1, . . . , πK)

Each cluster has a mean and covariance: φi = (µi,Σi)

Then

p(x|π, φ) =K∑

k=1

πkp(x|φk)

Use Expectation Maximization (EM) to maximize the likelihood of thedata with respect to (π, φ).

Kurt T. Miller Dr. Nonparametric Bayes 20

Page 34: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Preliminaries

Clustering – A Parametric Approach

Frequentist approach: Gaussian Mixture Models with K mixtures

Alternate definition:

G =K∑

k=1

πkδφk

where δφkis an atom at φk.

Then

θi ∼ G

xi ∼ p(x|θi)

G

!i

xi

N

!

Kurt T. Miller Dr. Nonparametric Bayes 21

Page 35: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

Clustering – A Parametric ApproachBayesian approach: Bayesian Gaussian Mixture Models with K mixtures

Distribution over classes: π = (π1, . . . , πK)

π ∼ Dirichlet(α/K, . . . , α/K)

(We’ll review the Dirichlet Distribution in a several slides.)

Each cluster has a mean and covariance: φk = (µk,Σk)

(µk,Σk) ∼ Normal-Inverse-Wishart(ν)

We still have

p(x|π, φ) =K∑

k=1

πkp(x|φk)

Kurt T. Miller Dr. Nonparametric Bayes 22

Page 36: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

Clustering – A Parametric Approach

Bayesian approach: Bayesian Gaussian Mixture Models with K mixtures

G is now a random measure.

φk ∼ G0

π ∼ Dirichlet(α/K, . . . , α/K)

G =K∑

i=1

πkδφk

θi ∼ G

xi ∼ p(x|θi)

G

!i

xi

N

!

!

G0

Kurt T. Miller Dr. Nonparametric Bayes 23

Page 37: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

The Dirichlet Distribution

We hadπ ∼ Dirichlet(α1, . . . , αK)

The Dirichlet density is defined as

p(π|α) =Γ(∑K

k=1 αk

)∏K

k=1 Γ(αk)πα1−1

1 πα2−12 · · ·παK−1

K

where πK = 1−∑K−1

k=1 πk.

The expectations of π are

E(πi) =αi∑Ki=1 αi

Kurt T. Miller Dr. Nonparametric Bayes 24

Page 38: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

The Beta DistributionA special case of the Dirichlet distribution is the Beta distribution forwhen K = 2.

p(π|α1, α2) =Γ (α1 + α2)Γ(α1)Γ(α2)

πα1−1(1− π)α2−1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

3.5

α1=1.0, α

2=0.1

α1=1.0, α

2=1.0

α1=1.0, α

2=5.0

α1=1.0, α

2=10.0

α1=9.0, α

2=3.0

Kurt T. Miller Dr. Nonparametric Bayes 25

Page 39: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

The Dirichlet Distribution

In three dimensions:

p(π|α1, α2, α3) =Γ (α1 + α2 + α3)Γ(α1)Γ(α2)Γ(α3)

πα1−11 πα2−1

2 (1− π1 − π2)α3−1

α = (2, 2, 2) α = (5, 5, 5) α = (2, 2, 25)

Kurt T. Miller Dr. Nonparametric Bayes 26

Page 40: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

Draws from the Dirichlet Distribution

α = (2, 2, 2) 1 2 30

0.2

0.4

0.6

0.8

1 2 30

0.1

0.2

0.3

0.4

0.5

1 2 30

0.1

0.2

0.3

0.4

α = (5, 5, 5) 1 2 30

0.1

0.2

0.3

0.4

0.5

1 2 30

0.2

0.4

0.6

0.8

1 2 30

0.2

0.4

0.6

0.8

α = (2, 2, 5) 1 2 30

0.2

0.4

0.6

0.8

1 2 30

0.2

0.4

0.6

0.8

1 2 30

0.2

0.4

0.6

0.8

Kurt T. Miller Dr. Nonparametric Bayes 27

Page 41: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

Key Property of the Dirichlet Distribution

The Aggregation Property: If

(π1, . . . , πi, πi+1, . . . , πK) ∼ Dir(α1, . . . , αi, αi+1, . . . , αK)

then

(π1, . . . , πi + πi+1, . . . , πK) ∼ Dir(α1, . . . , αi + αi+1, . . . , αK)

This is also valid for any aggregation:(π1 + π2,

∑k=3K

πk

)∼ Beta

(α1 + α2,

K∑k=3

αk

)

Kurt T. Miller Dr. Nonparametric Bayes 28

Page 42: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

Key Property of the Dirichlet Distribution

The Aggregation Property: If

(π1, . . . , πi, πi+1, . . . , πK) ∼ Dir(α1, . . . , αi, αi+1, . . . , αK)

then

(π1, . . . , πi + πi+1, . . . , πK) ∼ Dir(α1, . . . , αi + αi+1, . . . , αK)

This is also valid for any aggregation:(π1 + π2,

∑k=3K

πk

)∼ Beta

(α1 + α2,

K∑k=3

αk

)

Kurt T. Miller Dr. Nonparametric Bayes 28

Page 43: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

Multinomial-Dirichlet Conjugacy

Let Z ∼ Multinomial(π) and π ∼ Dir(α).

Posterior:

p(π|z) ∝ p(z|π)p(π)= (πz1

1 · · ·πzK

K )(πα1−11 · · ·παK−1

K )= (πz1+α1−1

1 · · ·πzK+αK−1K )

which is Dir(α + z).

Kurt T. Miller Dr. Nonparametric Bayes 29

Page 44: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

Clustering – A Parametric Approach

Bayesian approach: Bayesian Gaussian Mixture Models with K mixtures

G is now a random measure.

φk ∼ G0

π ∼ Dirichlet(α/K, . . . , α/K)

G =K∑

i=1

πkδφk

θi ∼ G

xi ∼ p(x|θi)

G

!i

xi

N

!

!

G0

Kurt T. Miller Dr. Nonparametric Bayes 30

Page 45: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

Bayesian Mixture Models

We no longer want just the maximum likelihood parameters, we want thefull posterior:

p(π, φ|X) ∝ p(X|π, φ)p(π, φ)

Unfortunately, this is not analytically tractable.

Two main approaches to approximate inference:

• Markov Chain Monte Carlo (MCMC) methods

• Variational approximations

Kurt T. Miller Dr. Nonparametric Bayes 31

Page 46: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

Monte Carlo Methods

Suppose we wish to reason about p(θ|X), but we cannot compute thisdistribution exactly. If instead, we can sample θ ∼ p(θ|X), what can wedo?

p(!|X)

Samples from p(!|X)

This is the idea behind Monte Carlo methods.

Kurt T. Miller Dr. Nonparametric Bayes 32

Page 47: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

Markov Chain Monte Carlo (MCMC)

We do not have access to an oracle that will give use samplesθ ∼ p(θ|X). How do we get these samples?

Markov Chain Monte Carlo (MCMC) methods have been developed tosolve this problem.

We focus on Gibbs sampling, a special case of the Metropolis-Hastingsalgorithm.

Kurt T. Miller Dr. Nonparametric Bayes 33

Page 48: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

Gibbs samplingAn MCMC technique

Assume θ consists of several parameters θ = (θ1, . . . , θm). In the finitemixture model, θ = (π, µ1, . . . , µK ,Σ1, . . . ,ΣK).

Then do

• Initialize θ(0) = (θ(0)1 , . . . , θ

(0)m ) at time step 0.

• For t = 1, 2, . . ., draw θ(t) given θ(t−1) in such a way that eventuallyθ(t) are samples from p(θ|X).

Kurt T. Miller Dr. Nonparametric Bayes 34

Page 49: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

Gibbs samplingAn MCMC technique

In Gibbs sampling, we only need to be able to sample

θ(t)i ∼ p(θi|θ(t)

1 , . . . , θ(t)i−1, θ

(t−1)i+1 , . . . , θ(t−1)

m , X).

If we repeat this for any model we discuss today, theory tells us thateventually we get samples θ(t) from p(θ|X).

Example: θ = (θ1, θ2) and p(θ) ∼ N (µ,Σ).

−3 −2 −1 0 1 2 3 4 5−3

−2

−1

0

1

2

3

4

5

−3 −2 −1 0 1 2 3 4 5−3

−2

−1

0

1

2

3

4

5

First 50 samples First 500 samples

Kurt T. Miller Dr. Nonparametric Bayes 35

Page 50: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

Gibbs samplingAn MCMC technique

In Gibbs sampling, we only need to be able to sample

θ(t)i ∼ p(θi|θ(t)

1 , . . . , θ(t)i−1, θ

(t−1)i+1 , . . . , θ(t−1)

m , X).

If we repeat this for any model we discuss today, theory tells us thateventually we get samples θ(t) from p(θ|X).

Example: θ = (θ1, θ2) and p(θ) ∼ N (µ,Σ).

−3 −2 −1 0 1 2 3 4 5−3

−2

−1

0

1

2

3

4

5

−3 −2 −1 0 1 2 3 4 5−3

−2

−1

0

1

2

3

4

5

First 50 samples First 500 samples

Kurt T. Miller Dr. Nonparametric Bayes 35

Page 51: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

Bayesian Mixture Models - MCMC inference

Introduce “membership” indicators zi where zi ∼ Multinomial(π)indicates which cluster the ith data point belongs to.

p(π,Z, φ|X) ∝ p(X|Z, φ)p(Z|π)p(π, φ)

xi

N

! G0!

zi

K

!k

Kurt T. Miller Dr. Nonparametric Bayes 36

Page 52: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

Gibbs sampling for the Bayesian Mixture Model

Randomly initialize Z, π, φ. Repeat until we have enough samples:

1. Sample each zi from

zi|Z−i, π, φ,X ∝K∑

k=1

πkp(xi|φk)11zi=k

2. Sample each π from

π|Z, φ,X ∼ Dir(n1 + α/K, . . . , nK + α/K)

where ni is the number of points assigned to cluster i.

3. Sample each φk from the NIW posterior based on Z and X.

Kurt T. Miller Dr. Nonparametric Bayes 37

Page 53: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

MCMC in Action

!"#"$!%

&%

Inference in the Finite Mixture Model As was done in the GMM, we will introduce labels zi where zi = k if data pointi belongs to the jth cluster. Gibbs sampling for the finite mixture then proceedsas follows:

1. Holding !t!1, µt!1,!t!1 fixed, sample each zti . zt

i is drawn from the dis-tribution

zti !

1Z

K!

j=1

!t!1j N (xi;µt!1

j ,!t!1j )1zt

i=j

2. Now holding zt, µt!1,!t!1 fixed, sample ! from

!t ! Dirichlet(n1 + "/K, n2 + "/K, . . . , nK + "/K)

where ni is the number of points assigned to the ith cluster and use thefact that the Dirichlet distribution is conjugate to the multinomial.

3. Finally, holding zt and !t fixed, we resample µt and !t from their pos-terior distribution. Using the fact that the NIW is the conjugate priorto the normal distribution, this is a simple draw from the posterior NIWdistribution.

4. Repeat until we have enough samples.

What does this look like in action?

Show Matlab demo

Bad Initialization Point

Iteration 25

Iteration 65

Improvements

•! The Gibbs sampler presented here is one of the most basic samplers for this model. We can improve the sampler by integrating out some of the parameters. This is called a collapsed sampler. See the next slide.

•! Other methods for doing inference introduce Metropolis-Hastings steps in the sampler or do variational inference. We will not present these here.

Collapsed Gibbs Sampler

Here, we will show how to marginalize out !. Similar techniques can be usedfor (µ,!).

We placed a Dir("/K, . . . , "/K) prior on !. Suppose we integrate ! out of allequations. Step 2 disappears and step 3 is left unchanged. In step 1, usingBayes’ rule, we are sampling zt

i from

p(zti |zt!1!i , X, µt!1,!t!1,") ! p(zt

i |zt!1!i ,")p(xi|zt

i , µt!1i ,!t!1

i ).

p(xi|zi, µt!1i ,!t!1)

i is the same likelihood term as before.

p(zti |zt!1!i ,") is the expectation of our posterior Dirichlet distribution. If cluster

j has nj data points assigned to it (not including point i), then this is

p(zti |zt!1!i ,") =

nj + "/K

n" 1" "

Collapsed Gibbs Sampler

The collapsed Gibbs sampler for µ,!, z is now:

1. Holding µt!1,!t!1 fixed, sample each zti . zt

i is drawn from the distribution

zti !

1Z

K!

j=1

nj + !/K

n" 1 + !N (xi;µt!1

j ,!t!1j )1zt

i=j

2. Holding zt fixed, we resample µt and !t from their posterior distribution.This is a simple draw from the posterior NIW distribution.

3. Repeat.

Note about the likelihood term

For easy visualization purposes, we used a

normal mixture model here. However, the

likelihood term can be chosen to fit your

particular application. The same ideas

presented here apply for any mixture model.

[Matlab demo]

Kurt T. Miller Dr. Nonparametric Bayes 38

Page 54: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

Collapsed Gibbs Sampler

Idea for an improvement: we can marginalize out some variables due toconjugacy, so do not need to sample it. This is called a collapsedsampler. Here marginalize out π.

Randomly initialize Z, φ. Repeat:

1. Sample each zi from

zi|Z−i, φ, X ∝K∑

k=1

(nk + α/K)p(xi|φk)11zi=k

2. Sample each φk from the NIW posterior based on Z and X.

Kurt T. Miller Dr. Nonparametric Bayes 39

Page 55: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

Note about the likelihood term

For easy visualization, we used a Gaussian mixture model.

You should use the appropriate likelihood model for your application!

Kurt T. Miller Dr. Nonparametric Bayes 40

Page 56: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

Summary: Parametric Bayesian clustering

• First specify the likelihood - application specific.

• Next specify a prior on all parameters.

• Exact posterior inference is intractable. Can use a Gibbs sampler forapproximate inference.

Kurt T. Miller Dr. Nonparametric Bayes 41

Page 57: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

5 minute break

Kurt T. Miller Dr. Nonparametric Bayes 42

Page 58: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

How to Choose K?

Generic model selection: cross-validation, AIC, BIC, MDL, etc.

Can place of parametric prior on K.

What if we just let K →∞ in our parametric model?

Kurt T. Miller Dr. Nonparametric Bayes 43

Page 59: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

How to Choose K?

Generic model selection: cross-validation, AIC, BIC, MDL, etc.

Can place of parametric prior on K.

What if we just let K →∞ in our parametric model?

Kurt T. Miller Dr. Nonparametric Bayes 43

Page 60: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

Thought Experiment

Let K →∞.

φk ∼ G0

π ∼ Dirichlet(α/K, . . . , α/K)

G =K∑

i=1

πkδφk

θi ∼ G

xi ∼ p(x|θi)

Kurt T. Miller Dr. Nonparametric Bayes 44

Page 61: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

Thought Experiment: Collapsed Gibbs Sampler

Randomly initialize Z, φ. Repeat:

1. Sample each zi from

zi|Z−i, φ, X ∝K∑

k=1

(nk + α/K)p(xi|φk)11zi=k

→K∑

k=1

nkp(xi|φk)11zi=k

Note that nk = 0 for empty clusters.

2. Sample each φk based on Z and X.

Kurt T. Miller Dr. Nonparametric Bayes 45

Page 62: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

Thought Experiment: Collapsed Gibbs Sampler

What about empty clusters? Lump all empty clusters together. Let K+

be the number of occupied clusters. Then the posterior probability ofsitting at any empty cluster is:

zi|Z−i, φ, X ∝ α/K × (K −K+)f(xi|G0)→ αf(xi|G0)

for f(xi|G0) =∫

p(x|φ)dG0(φ).

Kurt T. Miller Dr. Nonparametric Bayes 46

Page 63: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Parametric Bayesian Clustering

Key ideas to be discussed today

• A parametric Bayesian approach to clustering• Defining the model• Markov Chain Monte Carlo (MCMC) inference

• A nonparametric approach to clustering• Defining the model - The Dirichlet Process!• MCMC inference

• Extensions

Kurt T. Miller Dr. Nonparametric Bayes 47

Page 64: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

The Dirichlet Process Model

A Nonparametric Bayesian Approach to Clustering

We must again specify two things:

• The likelihood term (how data is affected by the parameters):

p(X|θ)

Identical to the parametric case.

• The prior (the prior distirubution on the parameters):

p(θ)

The Dirichlet Process!

Exact posterior inference is still intractable. But we have already derivedthe Gibbs update equations!

Kurt T. Miller Dr. Nonparametric Bayes 48

Page 65: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

The Dirichlet Process Model

What is the Dirichlet Process?

Image from http://www.nature.com/nsmb/journal/v7/n6/fig tab/nsb0600 443 F1.html

Kurt T. Miller Dr. Nonparametric Bayes 49

Page 66: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

The Dirichlet Process Model

What is the Dirichlet Process?

(G(A1), . . . , G(An))! Dir(!0G0(A1), . . . ,!0G0(An))

Kurt T. Miller Dr. Nonparametric Bayes 50

Page 67: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

The Dirichlet Process Model

The Dirichlet Process

A flexible, nonparametric prior over an infinite number of clusters/classesas well as the parameters for those classes.

Kurt T. Miller Dr. Nonparametric Bayes 51

Page 68: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

The Dirichlet Process Model

Parameters for the Dirichlet Process

• α - The concentration parameter.

• G0 - The base measure. A prior distribution for the cluster specificparameters.

The Dirichlet Process (DP) is a distribution over distributions. We write

G ∼ DP (α, G0)

to indicate G is a distribution drawn from the DP.

It will become clearer in a bit what α and G0 are.

Kurt T. Miller Dr. Nonparametric Bayes 52

Page 69: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

The Dirichlet Process Model

The DP, CRP, and Stick-Breaking Process

G

!i

xi

N

!

!

G0G ! DP(!, G0)

Stick-Breaking Process(just the weights)

The CRP describes thepartitions of ! when Gis marginalized out.

Kurt T. Miller Dr. Nonparametric Bayes 53

Page 70: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

The Dirichlet Process Model

The Dirichlet Process

Definition: Let G0 be a probability measure on the measurable space(Ω, B) and α ∈ R+.

The Dirichlet Process DP (α, G0) is the distribution on probabilitymeasures G such that for any finite partition (A1, . . . , Am) of Ω,

(G(A1), . . . , G(Am)) ∼ Dir(αG0(A1), . . . , αG0(Am)).

A

A

A

A

A

1

2

3

4

(Ferguson, ’73)

Kurt T. Miller Dr. Nonparametric Bayes 54

Page 71: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

The Dirichlet Process Model

Mathematical Properties of the Dirichlet Process

Suppose we sample

• G ∼ DP (α, G0)• θ1 ∼ G

What is the posterior distribution of G given θ1?

G|θ1 ∼ DP(

α + 1,α

α + 1G0 +

1α + 1

δθ1

)More generally

G|θ1, . . . , θn ∼ DP

(α + n,

α

α + nG0 +

1α + n

n∑i=1

δθi

)

Kurt T. Miller Dr. Nonparametric Bayes 55

Page 72: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

The Dirichlet Process Model

Mathematical Properties of the Dirichlet Process

Suppose we sample

• G ∼ DP (α, G0)• θ1 ∼ G

What is the posterior distribution of G given θ1?

G|θ1 ∼ DP(

α + 1,α

α + 1G0 +

1α + 1

δθ1

)More generally

G|θ1, . . . , θn ∼ DP

(α + n,

α

α + nG0 +

1α + n

n∑i=1

δθi

)

Kurt T. Miller Dr. Nonparametric Bayes 55

Page 73: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

The Dirichlet Process Model

Mathematical Properties of the Dirichlet Process

With probability 1, a sample G ∼ DP (α, G0) is of the form

G =∞∑

k=1

πkδφk

(Sethuraman, ’94)

Kurt T. Miller Dr. Nonparametric Bayes 56

Page 74: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

The Dirichlet Process Model

The Dirichlet Process and Clustering

Draw G ∼ DP (α, G0) to get

G =∞∑

k=1

πkδφk

Use this in a mixture model:

G

!i

xi

N

!

!

G0

Kurt T. Miller Dr. Nonparametric Bayes 57

Page 75: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

The Dirichlet Process Model

The Stick-Breaking Process

• Define an infinite sequence of Beta random variables:

βk ∼ Beta(1, α) k = 1, 2, . . .

• And then define an infinite sequence of mixing proportions as:

π1 = β1

πk = βk

k−1∏l=1

(1− βl) k = 2, 3, . . .

• This can be viewed as breaking off portions of a stick:

1 2...

1β β (1−β )

• When π are drawn this way, we can write π ∼ GEM(α).

Kurt T. Miller Dr. Nonparametric Bayes 58

Page 76: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

The Dirichlet Process Model

The Stick-Breaking Process

• We now have an explicit formula for each πk:πk = βk

∏k−1l=1 (1− βl)

• We can also easily see that∑∞

k=1 πk = 1 (wp1):

1−K∑

k=1

πk = 1− β1 − β2(1− β1)− β3(1− β1)(1− β2)− · · ·

= (1− β1)(1− β2 − β3(1− β2)− · · · )

=K∏

k=1

(1− βk)

→ 0 (wp1 as K →∞)

• So now G =∑∞

k=1 πkδφkhas a clean definition as a random

measure

Kurt T. Miller Dr. Nonparametric Bayes 59

Page 77: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

The Dirichlet Process Model

The Stick-Breaking Process

G

!i

xi

N

!

!

G0

!k

!k

!

!

Kurt T. Miller Dr. Nonparametric Bayes 60

Page 78: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

The Dirichlet Process Model

The Chinese Restaurant Process (CRP)

• A random process in which n customers sit down in a Chineserestaurant with an infinite number of tables

• first customer sits at the first table• mth subsequent customer sits at a table drawn from the following

distribution:

P (previously occupied table i|Fm−1) ∝ ni

P (the next unoccupied table|Fm−1) ∝ α

where ni is the number of customers currently at table i and whereFm−1 denotes the state of the restaurant after m− 1 customershave been seated

Kurt T. Miller Dr. Nonparametric Bayes 61

Page 79: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

The Dirichlet Process Model

The CRP and Clustering

• Data points are customers; tables are clusters

• the CRP defines a prior distribution on the partitioning of the dataand on the number of tables

• This prior can be completed with:

• a likelihood—e.g., associate a parameterized probability distributionwith each table

• a prior for the parameters—the first customer to sit at table kchooses the parameter vector for that table (φk) from the prior

φ2φ1 φ3 φ

4

• So we now have a distribution—or can obtain one—for any quantitythat we might care about in the clustering setting

Kurt T. Miller Dr. Nonparametric Bayes 62

Page 80: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

The Dirichlet Process Model

The CRP Prior, Gaussian Likelihood, Conjugate Prior

φk = (µk,Σk) ∼ N(a, b)⊗ IW (α, β)xi ∼ N(φk) for a data point i sitting at table k

Kurt T. Miller Dr. Nonparametric Bayes 63

Page 81: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

The Dirichlet Process Model

The CRP and the DP

OK, so we’ve seen how the CRP relates to clustering. How does it relateto the DP?

Important fact: The CRP is exchangeable.

Remember De Finetti’s Theorem: If (x1, x2, . . .) are infinitelyexchangeable, then ∀n

p(x1, . . . , xn) =∫ ( n∏

i=1

p(xi|G)

)dP (G)

for some random variable G.

Kurt T. Miller Dr. Nonparametric Bayes 64

Page 82: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

The Dirichlet Process Model

The CRP and the DP

OK, so we’ve seen how the CRP relates to clustering. How does it relateto the DP?

Important fact: The CRP is exchangeable.

Remember De Finetti’s Theorem: If (x1, x2, . . .) are infinitelyexchangeable, then ∀n

p(x1, . . . , xn) =∫ ( n∏

i=1

p(xi|G)

)dP (G)

for some random variable G.

Kurt T. Miller Dr. Nonparametric Bayes 64

Page 83: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

The Dirichlet Process Model

The CRP and the DP

The Dirichlet Process is the De Finetti mixingdistribution for the CRP.

That means, when we integrate outG, we get the CRP.

p(θ1, . . . , θn) =∫ n∏

i=1

p(θi|G)dP (G)

G

!i

xi

N

!

!

G0

Kurt T. Miller Dr. Nonparametric Bayes 65

Page 84: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

The Dirichlet Process Model

The CRP and the DP

The Dirichlet Process is the De Finetti mixingdistribution for the CRP.

That means, when we integrate outG, we get the CRP.

p(θ1, . . . , θn) =∫ n∏

i=1

p(θi|G)dP (G)

G

!i

xi

N

!

!

G0

Kurt T. Miller Dr. Nonparametric Bayes 65

Page 85: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

The Dirichlet Process Model

The CRP and the DP

The Dirichlet Process is the De Finetti mixingdistribution for the CRP.

In English, this means that if the DP is the prior onG, then the CRP defines how points are assigned toclusters when we integrate out G.

Kurt T. Miller Dr. Nonparametric Bayes 66

Page 86: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

The Dirichlet Process Model

The DP, CRP, and Stick-Breaking Process Summary

G

!i

xi

N

!

!

G0G ! DP(!, G0)

Stick-Breaking Process(just the weights)

The CRP describes thepartitions of ! when Gis marginalized out.

Kurt T. Miller Dr. Nonparametric Bayes 67

Page 87: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Inference for the Dirichlet Process

Inference for the DP - Gibbs sampler

We introduce the indicators zi and use the CRP representation.

Randomly initialize Z, φ. Repeat:

1. Sample each zi from

zi|Z−i, φ, X ∝K∑

k=1

nkp(xi|φk)11zi=k + αf(xi|G0)11zi=K+1

2. Sample each φk based on Z and X only for occupied clusters.

This is the sampler we saw earlier, but now with some theoretical basis.

Kurt T. Miller Dr. Nonparametric Bayes 68

Page 88: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Inference for the Dirichlet Process

MCMC in Action for the DP

!"#"$!%

&$%

What does this look like in action?

Show Matlab demo

!! !" !# !$ % $ # " !!&

%

&

'%

()*+,-./.)0.1)/.2-+32.-/

!!" !# !$ !% !& " & % $ #!'

"

'

!"

()*+,)-./0!"

!! !" !# !$ % $ # "!#

!$

%

$

#

"

!

&'()*'+,-.$%

What about the other views of the

Dirichlet Process?

Remember how in the finite case we had a ! and (µ,!) associated with eachcluster? When we let K ! ", the expected value of each !i went to zero.However, as we have seen, the data points still cluster into groups. Why?

It turns out that even though in probability each !i is getting arbitrarily small,we will still sample some that are large. These !i can be drawn from a size-biased distribution known as the stick breaking distribution.

1 2...

1! ! !""!##$

Slide courtesy of Michael Jordan Slide courtesy of Michael Jordan

Stick breaking continued

We now have an explicit formula for

Each of these corresponds to a cluster

where are drawn from their prior.

This infinite sample is a sample from the

Dirichlet Process.

!k = "k

k!1!

l=1

(1! "l)

(µk,!k)

A

A

A

A

A

1

2

3

4

5

!

The Dirichlet Process

Slide courtesy of Michael Jordan

This is the formal definition of the Dirichlet Process. It is not

important that you understand it. I only include it for completeness.

Summary of the Dirichlet Process

•! The Dirichlet Process is a nonparametric prior

that we can use in clustering that allows us to

simultaneously learn a the number of clusters,

which points are assigned to each cluster, and

the cluster parameters.

•! We presented a Gibbs sampler based on the

Chinese Restaurant Process that samples

clusters from this distribution.

[Matlab demo]

Kurt T. Miller Dr. Nonparametric Bayes 69

Page 89: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Inference for the Dirichlet Process

Improvements to the MCMC algorithm

• Collapse out the φk if conjugate model.

• Split-merge algorithms.

Kurt T. Miller Dr. Nonparametric Bayes 70

Page 90: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Inference for the Dirichlet Process

Summary: Nonparametric Bayesian clustering

• First specify the likelihood - application specific.

• Next specify a prior on all parameters - the Dirichlet Process!

• Exact posterior inference is intractable. Can use a Gibbs sampler forapproximate inference. This is based on the CRP representation.

Kurt T. Miller Dr. Nonparametric Bayes 71

Page 91: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Inference for the Dirichlet Process

Key ideas to be discussed today

• A parametric Bayesian approach to clustering• Defining the model• Markov Chain Monte Carlo (MCMC) inference

• A nonparametric approach to clustering• Defining the model - The Dirichlet Process!• MCMC inference

• Extensions

Kurt T. Miller Dr. Nonparametric Bayes 72

Page 92: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

Hierarchical Bayesian Models

Original Bayesian ideaView parameters as random variables - place a prior on them.

“Problem”?Often the priors themselves need parameters.

SolutionPlace a prior on these parameters!

Kurt T. Miller Dr. Nonparametric Bayes 73

Page 93: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

Hierarchical Bayesian Models

Original Bayesian ideaView parameters as random variables - place a prior on them.

“Problem”?Often the priors themselves need parameters.

SolutionPlace a prior on these parameters!

Kurt T. Miller Dr. Nonparametric Bayes 73

Page 94: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

Hierarchical Bayesian Models

Original Bayesian ideaView parameters as random variables - place a prior on them.

“Problem”?Often the priors themselves need parameters.

SolutionPlace a prior on these parameters!

Kurt T. Miller Dr. Nonparametric Bayes 73

Page 95: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

Multiple Learning Problems

Example: xi ∼ N (θi, σ2) in m different groups.

x1j

N1

!2!1

N2

x2j xmj

Nm

!m

· · ·

How to estimate θi for each group?

Kurt T. Miller Dr. Nonparametric Bayes 74

Page 96: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

Multiple Learning ProblemsExample: xi ∼ N (θi, σ

2) in m different groups.

Treat θis as random variables sampled from a common prior

θi ∼ N (θ0, σ20)

x1j

N1

!2!1

N2

x2j xmj

Nm

!m

· · ·

!0

Kurt T. Miller Dr. Nonparametric Bayes 75

Page 97: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

Recall Plate Notation:

!0

!i

xij

Ni

m

is equivalent to

x1j

N1

!2!1

N2

x2j xmj

Nm

!m

· · ·

!0

Kurt T. Miller Dr. Nonparametric Bayes 76

Page 98: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

Let’s Be Bold!Independent estimation Hierarchical Bayesian

x1j

N1

!2!1

N2

x2j xmj

Nm

!m

· · · ⇒

!0

!i

xij

Ni

m

What do we do if we have DPs for multiple related datasets?

G1

!1i

x1i

N1

!1

H1 H2 Hm

!mG2 Gm

!2i !mi

xmix2i

N2 Nm

· · ·

!2

mNi

xij

!ij

Gi

H

!

G0

Kurt T. Miller Dr. Nonparametric Bayes 77

Page 99: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

Let’s Be Bold!Independent estimation Hierarchical Bayesian

x1j

N1

!2!1

N2

x2j xmj

Nm

!m

· · · ⇒

!0

!i

xij

Ni

m

What do we do if we have DPs for multiple related datasets?

G1

!1i

x1i

N1

!1

H1 H2 Hm

!mG2 Gm

!2i !mi

xmix2i

N2 Nm

· · ·

!2

mNi

xij

!ij

Gi

H

!

G0

Kurt T. Miller Dr. Nonparametric Bayes 77

Page 100: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

Let’s Be Bold!Independent estimation Hierarchical Bayesian

x1j

N1

!2!1

N2

x2j xmj

Nm

!m

· · · ⇒

!0

!i

xij

Ni

m

What do we do if we have DPs for multiple related datasets?

G1

!1i

x1i

N1

!1

H1 H2 Hm

!mG2 Gm

!2i !mi

xmix2i

N2 Nm

· · ·

!2

mNi

xij

!ij

Gi

H

!

G0

Kurt T. Miller Dr. Nonparametric Bayes 77

Page 101: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

Attempt 1

mNi

xij

!ij

Gi

H

!

G0

What kind of distribution do we use for G0? H?

Suppose θij are mean parameters for a Gaussian where

Gi ∼ DP(α, G0)

and G0 is a Gaussian with unknown mean?

G0 = N (θ0, σ20)

This does NOT work! Why?

Kurt T. Miller Dr. Nonparametric Bayes 78

Page 102: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

Attempt 1

mNi

xij

!ij

Gi

H

!

G0

What kind of distribution do we use for G0? H?

Suppose θij are mean parameters for a Gaussian where

Gi ∼ DP(α, G0)

and G0 is a Gaussian with unknown mean?

G0 = N (θ0, σ20)

This does NOT work! Why?

Kurt T. Miller Dr. Nonparametric Bayes 78

Page 103: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

Attempt 1

mNi

xij

!ij

Gi

H

!

G0

The problem: If G0 is continuous, then withprobability ONE, Gi and Gj will share ZERO atoms.

⇒ This means NO clustering!

Gi

Gj

G0

Kurt T. Miller Dr. Nonparametric Bayes 79

Page 104: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

Attempt 2

mNi

xij

!ij

Gi

H

!

G0

So G0 must be discrete. What discrete prior can weuse on G0?

How about a parametric prior?

Gee, if only we had a nonparametric prior on discretemeasures...

Kurt T. Miller Dr. Nonparametric Bayes 80

Page 105: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

Attempt 2

mNi

xij

!ij

Gi

H

!

G0

So G0 must be discrete. What discrete prior can weuse on G0?

How about a parametric prior?

Gee, if only we had a nonparametric prior on discretemeasures...

Kurt T. Miller Dr. Nonparametric Bayes 80

Page 106: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

Attempt 2

mNi

xij

!ij

Gi

H

!

G0

So G0 must be discrete. What discrete prior can weuse on G0?

How about a parametric prior?

Gee, if only we had a nonparametric prior on discretemeasures...

Kurt T. Miller Dr. Nonparametric Bayes 80

Page 107: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

The Hierarchical Dirichlet Process

Solution:

mNi

xij

!ij

Gi

H

!

G0!

G0 ∼ DP(γ, H)Gi ∼ DP(α, G0)

θij |Gi ∼ Gi

xij |θij ∼ p(xij |θij)

(Teh, Jordan, Beal, Blei, 2004)

Kurt T. Miller Dr. Nonparametric Bayes 81

Page 108: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

G0 vs. Gi

Since

G0 ∼ DP(γ, H)Gi ∼ DP(α, G0)

we know

G0 =∞∑

k=1

πkδφk

Gi =∞∑

k=1

πikδφk

What is the relationship between πk and πik?

Kurt T. Miller Dr. Nonparametric Bayes 82

Page 109: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

G0 vs. Gi

Since

G0 ∼ DP(γ, H)Gi ∼ DP(α, G0)

we know

G0 =∞∑

k=1

πkδφk

Gi =∞∑

k=1

πikδφk

What is the relationship between πk and πik?

Kurt T. Miller Dr. Nonparametric Bayes 82

Page 110: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

Relationship between πk and πjk

Let (A1, . . . , Am) be a partition of Ω.

A

A

A

A

A

1

2

3

4

By properties of the DP

(Gi(A1), . . . , Gi(Am)) ∼ Dir(αG0(A1), . . . , αG0(Am))

(∑k∈K1

πik, . . . ,∑

k∈Km

πik

)∼ Dir

(α∑

k∈K1

πk, . . . , α∑

k∈Km

πk

)

Kurt T. Miller Dr. Nonparametric Bayes 83

Page 111: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

Relationship between πk and πjk

Let (A1, . . . , Am) be a partition of Ω.

A

A

A

A

A

1

2

3

4

By properties of the DP

(Gi(A1), . . . , Gi(Am)) ∼ Dir(αG0(A1), . . . , αG0(Am))

(∑k∈K1

πik, . . . ,∑

k∈Km

πik

)∼ Dir

(α∑

k∈K1

πk, . . . , α∑

k∈Km

πk

)

Kurt T. Miller Dr. Nonparametric Bayes 83

Page 112: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

Stick-Breaking Construction for the HDP

G0 ∼ DP(γ, H) π ∼ GEM(γ)Gi ∼ DP(α, G0) πi ∼ DP(α, π)

θij |Gi ∼ Gi φk ∼ Hxij |θij ∼ p(xij |θij) zij ∼ πi

xij ∼ p(xij |φzij)

mNi

xij

!ij

Gi

H

!

G0!

!k

mNi

xij

!

!

zij

!

!j G0

!

Kurt T. Miller Dr. Nonparametric Bayes 84

Page 113: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

Stick-Breaking Construction for the HDPRemember:0@X

k∈K1

πik, . . . ,X

k∈Km

πik

1A ∼ Dir

0@αX

k∈K1

πk, . . . , αX

k∈Km

πk

1A

Explicit relationship between π and πi:

βk ∼ Beta(1, γ)

πk = βk

k−1Yj=1

(1− βj)

βik ∼ Beta

απk, α

1−

kXj=1

πj

!!

πik = βik

k−1Yj=1

(1− βij)

Kurt T. Miller Dr. Nonparametric Bayes 85

Page 114: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

The Effect of απ ∼ GEM(γ), πi ∼ DP(α, π)

π: γ = 2 1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

πi:

α = 1 1 2 3 4 5 6 7 8 9 100

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 3 4 5 6 7 8 9 100

0.2

0.4

0.6

0.8

1

α = 5 1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

α = 20 1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

α = 100 1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Kurt T. Miller Dr. Nonparametric Bayes 86

Page 115: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

The Hierarchical Dirichlet Process

For the DP, we had:

• Mathematical definition

• Stick-breaking construction

• Chinese restaurant process

G

!i

xi

N

!

!

G0G ! DP(!, G0)

Stick-Breaking Process(just the weights)

The CRP describes thepartitions of ! when Gis marginalized out.

For the HDP, we have

• Mathematical definition

• Stick-breaking construction

• ?

Kurt T. Miller Dr. Nonparametric Bayes 87

Page 116: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

The Chinese Restaurant Franchise (CRF) - Step 1

First integrate out the Gi.

mNi

xij

!ij

Gi

H

!

G0!

mNi

xij

!ij

H

!

G0!

Kurt T. Miller Dr. Nonparametric Bayes 88

Page 117: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

The Chinese Restaurant Franchise (CRF) - Step 1What is the generative process when we integrate out Gi?

1. Draw global G0 =∑∞

k=1 πkδφk.

2. Each group acts like a separateCRP.

mNi

xij

!ij

H

!

G0!

G0

. . .!1

. . .

. . .

!2

!3

!11

!2 !2

!4!1

!1

!1

!12 !13

!14 !15

!16

!26

!21

!22

!23 !24

!25

!31

!32

!33

!34

!35

!1 !2 !3!4

Kurt T. Miller Dr. Nonparametric Bayes 89

Page 118: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

The Chinese Restaurant Franchise (CRF)

First integrate out the Gi, then integrate out G0

mNi

xij

!ij

Gi

H

!

G0!

mNi

xij

!ij

H

!

G0!

mNi

xij

!ij

H

!

!

Kurt T. Miller Dr. Nonparametric Bayes 90

Page 119: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

Chinese Restaurant Franchise (CRF)

G0

. . .!1

. . .

. . .

!2

!3

!11

!2 !2

!4!1

!1

!1

!12 !13

!14 !15

!16

!26

!21

!22

!23 !24

!25

!31

!32

!33

!34

!35

!1 !2 !3!4

. . .!1

. . .

. . .

!2

!3

!11

!2 !2

!4!1

!1

!1

!12 !13

!14 !15

!16

!26

!21

!22

!23 !24

!25

!31

!32

!33

!34

!35

Kurt T. Miller Dr. Nonparametric Bayes 91

Page 120: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

The Hierarchical Dirichlet Process

For the DP, we had:

• Mathematical definition

• Stick-breaking construction

• Chinese restaurant process

G

!i

xi

N

!

!

G0G ! DP(!, G0)

Stick-Breaking Process(just the weights)

The CRP describes thepartitions of ! when Gis marginalized out.

For the HDP, we have

• Mathematical definition

• Stick-breaking construction

• Chinese restaurant franchiseprocess

Kurt T. Miller Dr. Nonparametric Bayes 92

Page 121: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

Inference

Same classes of algorithms used for the DP:

• MCMC

• CRF representation• Stick-breaking representation

• Variational

We will not go into these.

Kurt T. Miller Dr. Nonparametric Bayes 93

Page 122: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

Application of the HDP - Infinite Hidden Markov Model

Finite Hidden Markov Models (HMMs):

• m states s1, . . . , sm

• si has parameter φi with emission distribution

y ∼ p(y|φi)

• m×m transition matrix

s1 s2 · · · sm

s1 π11 π12 · · · π1m

s2 π21 π22 · · · π2m

......

.... . .

...sm πm1 πm2 · · · πmm

How do we let m →∞?

Kurt T. Miller Dr. Nonparametric Bayes 94

Page 123: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Hierarchical Dirichlet Process

Application of the HDP - Infinite Hidden Markov Model

How do we let m →∞?

Think a bit outside the traditional clustering context.

Let each state si corresponds to a group.

π|γ ∼ GEM(γ)πi|α, π ∼ DP(α, π)φk|H ∼ H

xt|xt−1, (πi)∞i=1 ∼ πxt−1

yt|xt, (πi)∞i=1 ∼ p(yt|φxt)

Kurt T. Miller Dr. Nonparametric Bayes 95

Page 124: Dr. Nonparametric Bayespeople.eecs.berkeley.edu › ... › nonparametric › slides.pdf · 2009-11-20 · Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the

Questions?

Great set of references for the Machine Learning community:http://npbayes.wikidot.com/references

Includes both the “classics” as well as modern applications.

Kurt T. Miller Dr. Nonparametric Bayes 96