Nonparametric Bayesian Statistics...•Bayesian statistics that is not parametric (wait!) •...
Transcript of Nonparametric Bayesian Statistics...•Bayesian statistics that is not parametric (wait!) •...
Nonparametric Bayesian Statistics
Tamara BroderickITT Career Development Assistant Professor Electrical Engineering & Computer Science
MIT
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
1
• Bayesian statistics that is not parametric (wait!) • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
“Wikipedia phenomenon”
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
[Ed Bowlby, NOAA]
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
[Ed Bowlby, NOAA]
1
[Escobar, West 1995; Ghosal et al 1999]
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
[Ed Bowlby, NOAA]
[Arjas, Gasbarra 1994]
1
[Escobar, West 1995; Ghosal et al 1999]
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
[Ed Bowlby, NOAA]
[Arjas, Gasbarra 1994]
1
[Fox et al 2014]
[Escobar, West 1995; Ghosal et al 1999]
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
[Ed Bowlby, NOAA]
[Arjas, Gasbarra 1994]
1
[Ewens, 1972; Hartl, Clark 2003]
[Fox et al 2014]
[Escobar, West 1995; Ghosal et al 1999]
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
[Ed Bowlby, NOAA]
[Arjas, Gasbarra 1994]
[Saria et al
2010]1
[Ewens, 1972; Hartl, Clark 2003]
[Fox et al 2014]
[Escobar, West 1995; Ghosal et al 1999]
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
[Ed Bowlby, NOAA]
[Arjas, Gasbarra 1994]
1
[Saria et al
2010]
[Ewens, 1972; Hartl, Clark 2003]
[Lloyd et al 2012; Miller et al 2010]
[Fox et al 2014]
[Escobar, West 1995; Ghosal et al 1999]
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
[Ed Bowlby, NOAA]
[Sudderth, Jordan 2009]
[Lloyd et al 2012; Miller et al 2010]
[Arjas, Gasbarra 1994]
[Fox et al 2014]
1
[Escobar, West 1995; Ghosal et al 1999]
[Saria et al
2010]
[Ewens 1972; Hartl, Clark 2003]
• A theoretical motivation: De Finetti’s Theorem • A data sequence is infinitely exchangeable if the
distribution of any N data points doesn’t change when permuted:
• De Finetti’s Theorem (roughly): A sequence is infinitely exchangeable if and only if, for all N and some distribution P: !
• Motivates: • Parameters and likelihoods • Priors • “Nonparametric Bayesian” priors
Nonparametric Bayes
p(X1, . . . , XN ) = p(X�(1), . . . , X�(N))
X1, X2, . . .
2
• A theoretical motivation: De Finetti’s Theorem • A data sequence is infinitely exchangeable if the
distribution of any N data points doesn’t change when permuted:
• De Finetti’s Theorem (roughly): A sequence is infinitely exchangeable if and only if, for all N and some distribution P: !
• Motivates: • Parameters and likelihoods • Priors • “Nonparametric Bayesian” priors
Nonparametric Bayes
p(X1, . . . , XN ) = p(X�(1), . . . , X�(N))
X1, X2, . . .
2
• A theoretical motivation: De Finetti’s Theorem • A data sequence is infinitely exchangeable if the
distribution of any N data points doesn’t change when permuted:
• De Finetti’s Theorem (roughly): A sequence is infinitely exchangeable if and only if, for all N and some distribution P: !
• Motivates: • Parameters and likelihoods • Priors • “Nonparametric Bayesian” priors
Nonparametric Bayes
p(X1, . . . , XN ) = p(X�(1), . . . , X�(N))
X1, X2, . . .
[Hewitt, Savage 1955; Aldous 1983]2
• A theoretical motivation: De Finetti’s Theorem • A data sequence is infinitely exchangeable if the
distribution of any N data points doesn’t change when permuted:
• De Finetti’s Theorem (roughly): A sequence is infinitely exchangeable if and only if, for all N and some distribution P: !
• Motivates: • Parameters and likelihoods • Priors • “Nonparametric Bayesian” priors
Nonparametric Bayes
p(X1, . . . , XN ) = p(X�(1), . . . , X�(N))
X1, X2, . . .
p(X1, . . . , XN ) =
Z
✓
NY
n=1
p(Xn|✓)P (d✓)
[Hewitt, Savage 1955; Aldous 1983]2
• A theoretical motivation: De Finetti’s Theorem • A data sequence is infinitely exchangeable if the
distribution of any N data points doesn’t change when permuted:
• De Finetti’s Theorem (roughly): A sequence is infinitely exchangeable if and only if, for all N and some distribution P: !
• Motivates: • Parameters and likelihoods • Priors • “Nonparametric Bayesian” priors
Nonparametric Bayes
p(X1, . . . , XN ) = p(X�(1), . . . , X�(N))
p(X1, . . . , XN ) =
Z
✓
NY
n=1
p(Xn|✓)P (d✓)
X1, X2, . . .
[Hewitt, Savage 1955; Aldous 1983]2
• A theoretical motivation: De Finetti’s Theorem • A data sequence is infinitely exchangeable if the
distribution of any N data points doesn’t change when permuted:
• De Finetti’s Theorem (roughly): A sequence is infinitely exchangeable if and only if, for all N and some distribution P: !
• Motivates: • Parameters and likelihoods • Priors • “Nonparametric Bayesian” priors
Nonparametric Bayes
p(X1, . . . , XN ) = p(X�(1), . . . , X�(N))
p(X1, . . . , XN ) =
Z
✓
NY
n=1
p(Xn|✓)P (d✓)
X1, X2, . . .
[Hewitt, Savage 1955; Aldous 1983]2
• A theoretical motivation: De Finetti’s Theorem • A data sequence is infinitely exchangeable if the
distribution of any N data points doesn’t change when permuted:
• De Finetti’s Theorem (roughly): A sequence is infinitely exchangeable if and only if, for all N and some distribution P: !
• Motivates: • Parameters and likelihoods • Priors • “Nonparametric Bayesian” priors
Nonparametric Bayes
p(X1, . . . , XN ) = p(X�(1), . . . , X�(N))
p(X1, . . . , XN ) =
Z
✓
NY
n=1
p(Xn|✓)P (d✓)
X1, X2, . . .
[Hewitt, Savage 1955; Aldous 1983]2
• A theoretical motivation: De Finetti’s Theorem • A data sequence is infinitely exchangeable if the
distribution of any N data points doesn’t change when permuted:
• De Finetti’s Theorem (roughly): A sequence is infinitely exchangeable if and only if, for all N and some distribution P: !
• Motivates: • Parameters and likelihoods • Priors • “Nonparametric Bayesian” priors
Nonparametric Bayes
p(X1, . . . , XN ) = p(X�(1), . . . , X�(N))
p(X1, . . . , XN ) =
Z
✓
NY
n=1
p(Xn|✓)P (d✓)
X1, X2, . . .
[Hewitt, Savage 1955; Aldous 1983]2
Outline• Dirichlet process
• Background for intuition • Generative model • What does a growing/infinite number of parameters
really mean (in Nonparametric Bayes)? • Chinese restaurant process • Inference • Venture further into the wild world of Nonparametric
Bayesian statistics
3
Outline• Dirichlet process
• Background for intuition • Generative model • What does a growing/infinite number of parameters
really mean (in Nonparametric Bayes)? • Chinese restaurant process • Inference • Venture further into the wild world of Nonparametric
Bayesian statistics
3
Outline• Dirichlet process
• Background for intuition • Generative model • What does a growing/infinite number of parameters
really mean (in Nonparametric Bayes)? • Chinese restaurant process • Inference • Venture further into the wild world of Nonparametric
Bayesian statistics
3
Outline• Dirichlet process
• Background for intuition • Generative model • What does a growing/infinite number of parameters
really mean (in Nonparametric Bayes)? • Chinese restaurant process • Inference • Venture further into the wild world of Nonparametric
Bayesian statistics
3
Outline• Dirichlet process
• Background for intuition • Generative model • What does a growing/infinite number of parameters
really mean (in Nonparametric Bayes)? • Chinese restaurant process • Inference • Venture further into the wild world of Nonparametric
Bayesian statistics
3
Outline• Dirichlet process
• Background for intuition • Generative model • What does a growing/infinite number of parameters
really mean (in Nonparametric Bayes)? • Chinese restaurant process • Inference • Venture further into the wild world of Nonparametric
Bayesian statistics
3
Outline• Dirichlet process
• Background for intuition • Generative model • What does a growing/infinite number of parameters
really mean (in Nonparametric Bayes)? • Chinese restaurant process • Inference • Venture further into the wild world of Nonparametric
Bayesian statistics
3
Outline• Dirichlet process
• Background for intuition • Generative model • What does a growing/infinite number of parameters
really mean (in Nonparametric Bayes)? • Chinese restaurant process • Inference • Venture further into the wild world of Nonparametric
Bayesian statistics
3
Generative model
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative model• Finite Gaussian mixture
model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
• Inference goal: assignments of data points to clusters, cluster parameters
• Finite Gaussian mixture model (K=2 clusters)
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • • [R demo]
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 2 (0, 1)
5
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • • [R demo]
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 2 (0, 1)
5
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • • [R demo]
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 2 (0, 1)
5
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • • [R demo]
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 2 (0, 1)
5
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • • [R demo]
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 2 (0, 1)
5
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • • [R demo]
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
ρ1
dens
ity
⇢1 2 (0, 1)
5
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • • [R demo]
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
ρ1
dens
ity
⇢1 2 (0, 1)
5
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • • [R demo]
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
ρ1
dens
ity
⇢1 2 (0, 1)
5
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
ρ1
dens
ity
a1 > a2
⇢1 2 (0, 1)
5
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
ρ1
dens
ity
a1 > a2
⇢1 2 (0, 1)
5
[demo]
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
ρ1
dens
ity
a1 > a2
⇢1 2 (0, 1)
5
[demo]
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 ⇠ Beta(a1, a2), z ⇠ Cat(⇢1, ⇢2)ρ1
dens
ity
a1 > a2
⇢1 2 (0, 1)
5
[demo]
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 ⇠ Beta(a1, a2), z ⇠ Cat(⇢1, ⇢2)ρ1
dens
ity
a1 > a2
p(⇢1, z) / ⇢1{z=1}1 (1� ⇢1)
1{z=2}⇢a1�11 (1� ⇢1)
a2�1
⇢1 2 (0, 1)
5
[demo]
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 ⇠ Beta(a1, a2), z ⇠ Cat(⇢1, ⇢2)ρ1
dens
ity
a1 > a2
p(⇢1, z) / ⇢1{z=1}1 (1� ⇢1)
1{z=2}⇢a1�11 (1� ⇢1)
a2�1
⇢1 2 (0, 1)
5
[demo]
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 ⇠ Beta(a1, a2), z ⇠ Cat(⇢1, ⇢2)ρ1
dens
ity
a1 > a2
p(⇢1, z) / ⇢1{z=1}1 (1� ⇢1)
1{z=2}⇢a1�11 (1� ⇢1)
a2�1
⇢1 2 (0, 1)
5
[demo]
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 ⇠ Beta(a1, a2), z ⇠ Cat(⇢1, ⇢2)ρ1
dens
ity
a1 > a2
p(⇢1, z) / ⇢1{z=1}1 (1� ⇢1)
1{z=2}⇢a1�11 (1� ⇢1)
a2�1
p(⇢1|z) / ⇢a1+1{z=1}�11 (1� ⇢1)
a2+1{z=2}�1 / Beta(⇢1|a1 + z, a2 + (1� z))
⇢1 2 (0, 1)
5
[demo]
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 ⇠ Beta(a1, a2), z ⇠ Cat(⇢1, ⇢2)ρ1
dens
ity
a1 > a2
p(⇢1, z) / ⇢1{z=1}1 (1� ⇢1)
1{z=2}⇢a1�11 (1� ⇢1)
a2�1
p(⇢1|z) / ⇢a1+1{z=1}�11 (1� ⇢1)
a2+1{z=2}�1 / Beta(⇢1|a1 + z, a2 + (1� z))
⇢1 2 (0, 1)
5
[demo]
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 ⇠ Beta(a1, a2), z ⇠ Cat(⇢1, ⇢2)ρ1
dens
ity
a1 > a2
p(⇢1, z) / ⇢1{z=1}1 (1� ⇢1)
1{z=2}⇢a1�11 (1� ⇢1)
a2�1
p(⇢1|z) / ⇢a1+1{z=1}�11 (1� ⇢1)
a2+1{z=2}�1 / Beta(⇢1|a1 + z, a2 + (1� z))
⇢1 2 (0, 1)
5
[demo]
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K clusters)
6
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K clusters)
6
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K clusters)
⇢1 ⇢2
⇢1:K ⇠ Dirichlet(a1:K)
⇢36
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K clusters)
⇢1 ⇢2
⇢1:K ⇠ Dirichlet(a1:K)
⇢36
µkiid⇠ N (µ0,⌃0)
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K clusters)
⇢1 ⇢2
⇢1:K ⇠ Dirichlet(a1:K)
⇢36
µkiid⇠ N (µ0,⌃0)
zniid⇠ Categorical(⇢1:K)
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K clusters)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇢2
zniid⇠ Categorical(⇢1:K)
⇢1:K ⇠ Dirichlet(a1:K)
⇢36
Dirichlet distribution reviewDirichlet(⇢1:K |a1:K) =
�(PK
k=1 ak)QKk=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
6
Dirichlet distribution reviewDirichlet(⇢1:K |a1:K) =
�(PK
k=1 ak)QKk=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
⇢k 2 (0, 1)X
k
⇢k = 1
6
Dirichlet distribution reviewDirichlet(⇢1:K |a1:K) =
�(PK
k=1 ak)QKk=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
6
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
6
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
a = (0.5,0.5,0.5) a = (5,5,5) a = (40,10,10)
ρ1
dens
ity
ρ2
6
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
a = (0.5,0.5,0.5) a = (5,5,5) a = (40,10,10)
ρ1
dens
ity
ρ2
6
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
a = (0.5,0.5,0.5) a = (5,5,5) a = (40,10,10)
ρ1
dens
ity
ρ2
6
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
a = (0.5,0.5,0.5) a = (5,5,5) a = (40,10,10)
ρ1
dens
ity
ρ2
6
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
a = (0.5,0.5,0.5) a = (5,5,5) a = (40,10,10)
ρ1
dens
ity
ρ2
6
[demo]
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
a = (0.5,0.5,0.5) a = (5,5,5) a = (40,10,10)
ρ1
dens
ity
ρ2
6
[demo]
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1
⇢1:K ⇠ Dirichlet(a1:K), z ⇠ Cat(⇢1:K)
a = ak = 1
a = (0.5,0.5,0.5) a = (5,5,5) a = (40,10,10)
ρ1
dens
ity
ρ2
6
[demo]
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1
⇢1:K ⇠ Dirichlet(a1:K), z ⇠ Cat(⇢1:K)
⇢1:K |z d= Dirichlet(a01:K), a0k = ak + 1{z = k}
a = ak = 1
a = (0.5,0.5,0.5) a = (5,5,5) a = (40,10,10)
ρ1
dens
ity
ρ2
6
[demo]
What if K ≫ N ?• e.g. species sampling, topic modeling, groups on a
social network, etc.
⇢1 ⇢2 ⇢3
…
⇢1000
• Components: number of latent groups
• Clusters: number of components represented in the data
• Number of clusters for N data points is < K and random
• Number of clusters grows with N
7
What if K ≫ N ?
⇢1 ⇢2 ⇢3
…
⇢1000
• Components: number of latent groups
• Clusters: number of components represented in the data
• Number of clusters for N data points is < K and random
• Number of clusters grows with N
7
What if K ≫ N ?• e.g. species sampling, topic modeling, groups on a
social network, etc.
⇢1 ⇢2 ⇢3
…
⇢1000
• Components: number of latent groups
• Clusters: number of components represented in the data
• Number of clusters for N data points is < K and random
• Number of clusters grows with N
7
What if K ≫ N ?• e.g. species sampling, topic modeling, groups on a
social network, etc.
⇢1 ⇢2 ⇢3
…
⇢1000
• Components: number of latent groups
• Clusters: number of components represented in the data
• Number of clusters for N data points is < K and random
• Number of clusters grows with N
7
What if K ≫ N ?• e.g. species sampling, topic modeling, groups on a
social network, etc.
⇢1 ⇢2 ⇢3
…
⇢1000
• Components: number of latent groups
• Clusters: number of components represented in the data
• Number of clusters for N data points is < K and random
• Number of clusters grows with N
7
What if K ≫ N ?• e.g. species sampling, topic modeling, groups on a
social network, etc.
⇢1 ⇢2 ⇢3
…
⇢1000
• Components: number of latent groups
• Clusters: number of components represented in the data
• [demo 1, demo 2]
• Number of clusters for N data points is < K and random
• Number of clusters grows with N7
What if K ≫ N ?• e.g. species sampling, topic modeling, groups on a
social network, etc.
⇢1 ⇢2 ⇢3
…
⇢1000
• Components: number of latent groups
• Clusters: number of components represented in the data
• [demo 1, demo 2]
• Number of clusters for N data points is < K and random
• Number of clusters grows with N7
What if K ≫ N ?• e.g. species sampling, topic modeling, groups on a
social network, etc.
⇢1 ⇢2 ⇢3
…
⇢1000
• Components: number of latent groups
• Clusters: number of components represented in the data
• [demo 1, demo 2]
• Number of clusters for N data points is < K and random
• Number of clusters grows with N7
• Here, difficult to choose finite K in advance (contrast with small K): don’t know K, difficult to infer, streaming data
• How to generate K = ∞ strictly positive frequencies that sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
• “Stick breaking”
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
V1 ⇠ Beta(a1, a2 + a3 + a4)
• “Stick breaking”
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
V1 ⇠ Beta(a1, a2 + a3 + a4) ⇢1 = V1
• “Stick breaking”
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
V1 ⇠ Beta(a1, a2 + a3 + a4) ⇢1 = V1
V2 ⇠ Beta(a2, a3 + a4)
• “Stick breaking”
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
V1 ⇠ Beta(a1, a2 + a3 + a4) ⇢1 = V1
V2 ⇠ Beta(a2, a3 + a4)
• “Stick breaking”
⇢2 = (1� V1)V2
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
V1 ⇠ Beta(a1, a2 + a3 + a4) ⇢1 = V1
V2 ⇠ Beta(a2, a3 + a4) ⇢2 = (1� V1)V2
V3 ⇠ Beta(a3, a4)
• “Stick breaking”
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
V1 ⇠ Beta(a1, a2 + a3 + a4) ⇢1 = V1
V2 ⇠ Beta(a2, a3 + a4) ⇢2 = (1� V1)V2
V3 ⇠ Beta(a3, a4) ⇢3 = (1� V1)(1� V2)V3
• “Stick breaking”
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
V1 ⇠ Beta(a1, a2 + a3 + a4) ⇢1 = V1
V2 ⇠ Beta(a2, a3 + a4) ⇢2 = (1� V1)V2
V3 ⇠ Beta(a3, a4) ⇢3 = (1� V1)(1� V2)V3
⇢4 = 1�3X
k=1
⇢k
• “Stick breaking”
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
…
ak = 1, bk = ↵ > 0
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
ak = 1, bk = ↵ > 0
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1) ⇢1 = V1
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
⇢1 = V1
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
⇢1 = V1
⇢2 = (1� V1)V2
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
⇢1 = V1
⇢2 = (1� V1)V2
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
…
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
⇢1 = V1
⇢2 = (1� V1)V2
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
…
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
Vk ⇠ Beta(ak, bk)
⇢1 = V1
⇢2 = (1� V1)V2
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
…
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
Vk ⇠ Beta(ak, bk)
⇢1 = V1
⇢2 = (1� V1)V2
⇢k =
2
4k�1Y
j=1
(1� Vj)
3
5Vk
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
…
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
Vk ⇠ Beta(ak, bk)
⇢1 = V1
⇢2 = (1� V1)V2
⇢k =
2
4k�1Y
j=1
(1� Vj)
3
5Vk
[Ishwaran, James 2001]9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
…
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
Vk ⇠ Beta(ak, bk)
⇢1 = V1
⇢2 = (1� V1)V2
⇢k =
2
4k�1Y
j=1
(1� Vj)
3
5Vk
[Ishwaran, James 2001]9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
…
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
Vk ⇠ Beta(ak, bk)
⇢1 = V1
⇢2 = (1� V1)V2
⇢k =
2
4k�1Y
j=1
(1� Vj)
3
5Vk
ak = 1, bk = ↵ > 0
⇢ = (⇢1, ⇢2, . . .) ⇠ GEM(↵)
[McCloskey 1965; Engen 1975; Patil and Taillie 1977; Ewens 1987; Sethuraman 1994; Ishwaran, James 2001]9
• Code your own GEM simulator to draw ρ • Simulate drawing cluster indicators (z) from the
distribution you generated in the first exercise • Compare the growth in the number of clusters
as N changes in the GEM case with the growth in the K=1000 case
Exercises
10
…
• How does the expected number of clusters in the GEM case change with N and with the GEM parameter α?
References for Part 1, page 1
11
DJ Aldous. Exchangeability and related topics. Springer, 1983.
E Arjas and D Gasbarra. Nonparametric Bayesian inference from right censored survival data, using the Gibbs sampler. Statistica Sinica, 1994.
E Bowlby. NOAA/Olympic Coast NMS; NOAA/OAR/Office of Ocean Exploration - NOAA Photo Library. Retrieved from: https://en.wikipedia.org/wiki/Opisthoteuthis_californiana#/media/File:Opisthoteuthis_californiana.jpg
S Engen. A note on the geometric series as a species frequency model. Biometrika, 1975.
W Ewens. The sampling theory of selectively neutral alleles. Theoretical Population Biology, 1972.
W Ewens. Population genetics theory -- the past and the future. Mathematical and Statistical Developments of Evolutionary Theory, 1987.
EB Fox, personal website. Retrieved from: http://www.stat.washington.edu/~ebfox/research.html --- Associated paper: EB Fox, MC Hughes, EB Sudderth, and MI Jordan. The Annals of Applied Statistics, 2014.
S Ghosal, JK Ghosh, and RV Ramamoorthi. Posterior consistency of Dirichlet mixtures in density estimation. The Annals of Statistics, 1999.
DL Hartl and AG Clark. Principles of Population Genetics, Fourth Edition. 2003.
E Hewitt and LJ Savage. Symmetric measures on Cartesian products. Transactions of the American Mathematical Society, 1955.
H Ishwaran and LF James. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 2001.
JR Lloyd, P Orbanz, Z Ghahramani, and DM Roy. Random function priors for exchangeable arrays with applications to graphs and relational data. NIPS, 2012.
References for Part 1, page 2
12
JW McCloskey. A model for the distribution of individuals by species in an environment. Ph.D. thesis, Michigan State University, 1965.
K Miller, MI Jordan, and TL Griffiths. Nonparametric latent feature models for link prediction. NIPS, 2009.
GP Patil and C Taillie. Diversity as a concept and its implications for random communities. Bulletin of the International Statistical Institute, 1977.
S Saria, D Koller, and A Penn. Learning individual and population traits from clinical temporal data. NIPS, 2010.
J Sethuraman. A constructive definition of Dirichlet priors. Statistica Sinica, 1994.
EB Sudderth and MI Jordan. Shared segmentation of natural scenes using dependent Pitman-Yor processes. NIPS, 2009.