Post on 31-Mar-2020
Nonparametric Bayesian Statistics
Tamara BroderickITT Career Development Assistant Professor Electrical Engineering & Computer Science
MIT
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
1
• Bayesian statistics that is not parametric (wait!) • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
“Wikipedia phenomenon”
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
[Ed Bowlby, NOAA]
1
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
[Ed Bowlby, NOAA]
1
[Escobar, West 1995; Ghosal et al 1999]
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
[Ed Bowlby, NOAA]
[Arjas, Gasbarra 1994]
1
[Escobar, West 1995; Ghosal et al 1999]
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
[Ed Bowlby, NOAA]
[Arjas, Gasbarra 1994]
1
[Fox et al 2014]
[Escobar, West 1995; Ghosal et al 1999]
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
[Ed Bowlby, NOAA]
[Arjas, Gasbarra 1994]
1
[Ewens, 1972; Hartl, Clark 2003]
[Fox et al 2014]
[Escobar, West 1995; Ghosal et al 1999]
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
[Ed Bowlby, NOAA]
[Arjas, Gasbarra 1994]
[Saria et al
2010]1
[Ewens, 1972; Hartl, Clark 2003]
[Fox et al 2014]
[Escobar, West 1995; Ghosal et al 1999]
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
[Ed Bowlby, NOAA]
[Arjas, Gasbarra 1994]
1
[Saria et al
2010]
[Ewens, 1972; Hartl, Clark 2003]
[Lloyd et al 2012; Miller et al 2010]
[Fox et al 2014]
[Escobar, West 1995; Ghosal et al 1999]
• Bayesian statistics that is not parametric • Bayesian
!
• Not parametric (i.e. not finite parameter, unbounded/growing/infinite number of parameters)
Nonparametric Bayes
P(parameters|data) / P(data|parameters)P(parameters)
[wikipedia.org]
[Ed Bowlby, NOAA]
[Sudderth, Jordan 2009]
[Lloyd et al 2012; Miller et al 2010]
[Arjas, Gasbarra 1994]
[Fox et al 2014]
1
[Escobar, West 1995; Ghosal et al 1999]
[Saria et al
2010]
[Ewens 1972; Hartl, Clark 2003]
• A theoretical motivation: De Finetti’s Theorem • A data sequence is infinitely exchangeable if the
distribution of any N data points doesn’t change when permuted:
• De Finetti’s Theorem (roughly): A sequence is infinitely exchangeable if and only if, for all N and some distribution P: !
• Motivates: • Parameters and likelihoods • Priors • “Nonparametric Bayesian” priors
Nonparametric Bayes
p(X1, . . . , XN ) = p(X�(1), . . . , X�(N))
X1, X2, . . .
2
• A theoretical motivation: De Finetti’s Theorem • A data sequence is infinitely exchangeable if the
distribution of any N data points doesn’t change when permuted:
• De Finetti’s Theorem (roughly): A sequence is infinitely exchangeable if and only if, for all N and some distribution P: !
• Motivates: • Parameters and likelihoods • Priors • “Nonparametric Bayesian” priors
Nonparametric Bayes
p(X1, . . . , XN ) = p(X�(1), . . . , X�(N))
X1, X2, . . .
2
• A theoretical motivation: De Finetti’s Theorem • A data sequence is infinitely exchangeable if the
distribution of any N data points doesn’t change when permuted:
• De Finetti’s Theorem (roughly): A sequence is infinitely exchangeable if and only if, for all N and some distribution P: !
• Motivates: • Parameters and likelihoods • Priors • “Nonparametric Bayesian” priors
Nonparametric Bayes
p(X1, . . . , XN ) = p(X�(1), . . . , X�(N))
X1, X2, . . .
[Hewitt, Savage 1955; Aldous 1983]2
• A theoretical motivation: De Finetti’s Theorem • A data sequence is infinitely exchangeable if the
distribution of any N data points doesn’t change when permuted:
• De Finetti’s Theorem (roughly): A sequence is infinitely exchangeable if and only if, for all N and some distribution P: !
• Motivates: • Parameters and likelihoods • Priors • “Nonparametric Bayesian” priors
Nonparametric Bayes
p(X1, . . . , XN ) = p(X�(1), . . . , X�(N))
X1, X2, . . .
p(X1, . . . , XN ) =
Z
✓
NY
n=1
p(Xn|✓)P (d✓)
[Hewitt, Savage 1955; Aldous 1983]2
• A theoretical motivation: De Finetti’s Theorem • A data sequence is infinitely exchangeable if the
distribution of any N data points doesn’t change when permuted:
• De Finetti’s Theorem (roughly): A sequence is infinitely exchangeable if and only if, for all N and some distribution P: !
• Motivates: • Parameters and likelihoods • Priors • “Nonparametric Bayesian” priors
Nonparametric Bayes
p(X1, . . . , XN ) = p(X�(1), . . . , X�(N))
p(X1, . . . , XN ) =
Z
✓
NY
n=1
p(Xn|✓)P (d✓)
X1, X2, . . .
[Hewitt, Savage 1955; Aldous 1983]2
• A theoretical motivation: De Finetti’s Theorem • A data sequence is infinitely exchangeable if the
distribution of any N data points doesn’t change when permuted:
• De Finetti’s Theorem (roughly): A sequence is infinitely exchangeable if and only if, for all N and some distribution P: !
• Motivates: • Parameters and likelihoods • Priors • “Nonparametric Bayesian” priors
Nonparametric Bayes
p(X1, . . . , XN ) = p(X�(1), . . . , X�(N))
p(X1, . . . , XN ) =
Z
✓
NY
n=1
p(Xn|✓)P (d✓)
X1, X2, . . .
[Hewitt, Savage 1955; Aldous 1983]2
• A theoretical motivation: De Finetti’s Theorem • A data sequence is infinitely exchangeable if the
distribution of any N data points doesn’t change when permuted:
• De Finetti’s Theorem (roughly): A sequence is infinitely exchangeable if and only if, for all N and some distribution P: !
• Motivates: • Parameters and likelihoods • Priors • “Nonparametric Bayesian” priors
Nonparametric Bayes
p(X1, . . . , XN ) = p(X�(1), . . . , X�(N))
p(X1, . . . , XN ) =
Z
✓
NY
n=1
p(Xn|✓)P (d✓)
X1, X2, . . .
[Hewitt, Savage 1955; Aldous 1983]2
• A theoretical motivation: De Finetti’s Theorem • A data sequence is infinitely exchangeable if the
distribution of any N data points doesn’t change when permuted:
• De Finetti’s Theorem (roughly): A sequence is infinitely exchangeable if and only if, for all N and some distribution P: !
• Motivates: • Parameters and likelihoods • Priors • “Nonparametric Bayesian” priors
Nonparametric Bayes
p(X1, . . . , XN ) = p(X�(1), . . . , X�(N))
p(X1, . . . , XN ) =
Z
✓
NY
n=1
p(Xn|✓)P (d✓)
X1, X2, . . .
[Hewitt, Savage 1955; Aldous 1983]2
Outline• Dirichlet process
• Background for intuition • Generative model • What does a growing/infinite number of parameters
really mean (in Nonparametric Bayes)? • Chinese restaurant process • Inference • Venture further into the wild world of Nonparametric
Bayesian statistics
3
Outline• Dirichlet process
• Background for intuition • Generative model • What does a growing/infinite number of parameters
really mean (in Nonparametric Bayes)? • Chinese restaurant process • Inference • Venture further into the wild world of Nonparametric
Bayesian statistics
3
Outline• Dirichlet process
• Background for intuition • Generative model • What does a growing/infinite number of parameters
really mean (in Nonparametric Bayes)? • Chinese restaurant process • Inference • Venture further into the wild world of Nonparametric
Bayesian statistics
3
Outline• Dirichlet process
• Background for intuition • Generative model • What does a growing/infinite number of parameters
really mean (in Nonparametric Bayes)? • Chinese restaurant process • Inference • Venture further into the wild world of Nonparametric
Bayesian statistics
3
Outline• Dirichlet process
• Background for intuition • Generative model • What does a growing/infinite number of parameters
really mean (in Nonparametric Bayes)? • Chinese restaurant process • Inference • Venture further into the wild world of Nonparametric
Bayesian statistics
3
Outline• Dirichlet process
• Background for intuition • Generative model • What does a growing/infinite number of parameters
really mean (in Nonparametric Bayes)? • Chinese restaurant process • Inference • Venture further into the wild world of Nonparametric
Bayesian statistics
3
Outline• Dirichlet process
• Background for intuition • Generative model • What does a growing/infinite number of parameters
really mean (in Nonparametric Bayes)? • Chinese restaurant process • Inference • Venture further into the wild world of Nonparametric
Bayesian statistics
3
Outline• Dirichlet process
• Background for intuition • Generative model • What does a growing/infinite number of parameters
really mean (in Nonparametric Bayes)? • Chinese restaurant process • Inference • Venture further into the wild world of Nonparametric
Bayesian statistics
3
Generative model
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative model• Finite Gaussian mixture
model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
• Inference goal: assignments of data points to clusters, cluster parameters
• Finite Gaussian mixture model (K=2 clusters)
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K=2 clusters)
• Don’t know µ1, µ2
• Don’t know ⇢1, ⇢2
zniid⇠ Categorical(⇢1, ⇢2)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇠ Beta(a1, a2)⇢2 = 1� ⇢1
⇢1 ⇢2
• Inference goal: assignments of data points to clusters, cluster parameters
4
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • • [R demo]
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 2 (0, 1)
5
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • • [R demo]
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 2 (0, 1)
5
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • • [R demo]
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 2 (0, 1)
5
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • • [R demo]
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 2 (0, 1)
5
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • • [R demo]
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 2 (0, 1)
5
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • • [R demo]
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
ρ1
dens
ity
⇢1 2 (0, 1)
5
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • • [R demo]
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
ρ1
dens
ity
⇢1 2 (0, 1)
5
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • • [R demo]
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
ρ1
dens
ity
⇢1 2 (0, 1)
5
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
ρ1
dens
ity
a1 > a2
⇢1 2 (0, 1)
5
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
ρ1
dens
ity
a1 > a2
⇢1 2 (0, 1)
5
[demo]
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
ρ1
dens
ity
a1 > a2
⇢1 2 (0, 1)
5
[demo]
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 ⇠ Beta(a1, a2), z ⇠ Cat(⇢1, ⇢2)ρ1
dens
ity
a1 > a2
⇢1 2 (0, 1)
5
[demo]
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 ⇠ Beta(a1, a2), z ⇠ Cat(⇢1, ⇢2)ρ1
dens
ity
a1 > a2
p(⇢1, z) / ⇢1{z=1}1 (1� ⇢1)
1{z=2}⇢a1�11 (1� ⇢1)
a2�1
⇢1 2 (0, 1)
5
[demo]
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 ⇠ Beta(a1, a2), z ⇠ Cat(⇢1, ⇢2)ρ1
dens
ity
a1 > a2
p(⇢1, z) / ⇢1{z=1}1 (1� ⇢1)
1{z=2}⇢a1�11 (1� ⇢1)
a2�1
⇢1 2 (0, 1)
5
[demo]
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 ⇠ Beta(a1, a2), z ⇠ Cat(⇢1, ⇢2)ρ1
dens
ity
a1 > a2
p(⇢1, z) / ⇢1{z=1}1 (1� ⇢1)
1{z=2}⇢a1�11 (1� ⇢1)
a2�1
⇢1 2 (0, 1)
5
[demo]
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 ⇠ Beta(a1, a2), z ⇠ Cat(⇢1, ⇢2)ρ1
dens
ity
a1 > a2
p(⇢1, z) / ⇢1{z=1}1 (1� ⇢1)
1{z=2}⇢a1�11 (1� ⇢1)
a2�1
p(⇢1|z) / ⇢a1+1{z=1}�11 (1� ⇢1)
a2+1{z=2}�1 / Beta(⇢1|a1 + z, a2 + (1� z))
⇢1 2 (0, 1)
5
[demo]
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 ⇠ Beta(a1, a2), z ⇠ Cat(⇢1, ⇢2)ρ1
dens
ity
a1 > a2
p(⇢1, z) / ⇢1{z=1}1 (1� ⇢1)
1{z=2}⇢a1�11 (1� ⇢1)
a2�1
p(⇢1|z) / ⇢a1+1{z=1}�11 (1� ⇢1)
a2+1{z=2}�1 / Beta(⇢1|a1 + z, a2 + (1� z))
⇢1 2 (0, 1)
5
[demo]
Beta distribution reviewBeta(⇢1|a1, a2) =
�(a1 + a2)
�(a1)�(a2)⇢a1�11 (1� ⇢1)
a2�1 a1, a2 > 0
• Gamma function • integer m: • for x > 0:
• What happens? • • •
• Beta is conjugate to Cat
a = a1 = a2 ! 0a = a1 = a2 ! 1
��(m) = (m� 1)!
�(x) = x�(x� 1)
⇢1 ⇠ Beta(a1, a2), z ⇠ Cat(⇢1, ⇢2)ρ1
dens
ity
a1 > a2
p(⇢1, z) / ⇢1{z=1}1 (1� ⇢1)
1{z=2}⇢a1�11 (1� ⇢1)
a2�1
p(⇢1|z) / ⇢a1+1{z=1}�11 (1� ⇢1)
a2+1{z=2}�1 / Beta(⇢1|a1 + z, a2 + (1� z))
⇢1 2 (0, 1)
5
[demo]
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K clusters)
6
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K clusters)
6
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K clusters)
⇢1 ⇢2
⇢1:K ⇠ Dirichlet(a1:K)
⇢36
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K clusters)
⇢1 ⇢2
⇢1:K ⇠ Dirichlet(a1:K)
⇢36
µkiid⇠ N (µ0,⌃0)
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K clusters)
⇢1 ⇢2
⇢1:K ⇠ Dirichlet(a1:K)
⇢36
µkiid⇠ N (µ0,⌃0)
zniid⇠ Categorical(⇢1:K)
Generative modelP(parameters|data) / P(data|parameters)P(parameters)
• Finite Gaussian mixture model (K clusters)
xnindep⇠ N (µzn ,⌃)
µkiid⇠ N (µ0,⌃0)
⇢1 ⇢2
zniid⇠ Categorical(⇢1:K)
⇢1:K ⇠ Dirichlet(a1:K)
⇢36
Dirichlet distribution reviewDirichlet(⇢1:K |a1:K) =
�(PK
k=1 ak)QKk=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
6
Dirichlet distribution reviewDirichlet(⇢1:K |a1:K) =
�(PK
k=1 ak)QKk=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
⇢k 2 (0, 1)X
k
⇢k = 1
6
Dirichlet distribution reviewDirichlet(⇢1:K |a1:K) =
�(PK
k=1 ak)QKk=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
6
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
6
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
a = (0.5,0.5,0.5) a = (5,5,5) a = (40,10,10)
ρ1
dens
ity
ρ2
6
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
a = (0.5,0.5,0.5) a = (5,5,5) a = (40,10,10)
ρ1
dens
ity
ρ2
6
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
a = (0.5,0.5,0.5) a = (5,5,5) a = (40,10,10)
ρ1
dens
ity
ρ2
6
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
a = (0.5,0.5,0.5) a = (5,5,5) a = (40,10,10)
ρ1
dens
ity
ρ2
6
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
a = (0.5,0.5,0.5) a = (5,5,5) a = (40,10,10)
ρ1
dens
ity
ρ2
6
[demo]
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1a = ak = 1
a = (0.5,0.5,0.5) a = (5,5,5) a = (40,10,10)
ρ1
dens
ity
ρ2
6
[demo]
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1
⇢1:K ⇠ Dirichlet(a1:K), z ⇠ Cat(⇢1:K)
a = ak = 1
a = (0.5,0.5,0.5) a = (5,5,5) a = (40,10,10)
ρ1
dens
ity
ρ2
6
[demo]
Dirichlet distribution review
• What happens? • Dirichlet is conjugate to Categorical
Dirichlet(⇢1:K |a1:K) =�(
PKk=1 ak)QK
k=1 �(ak)
KY
k=1
⇢ak�1k ak > 0
a = ak ! 0 a = ak ! 1
⇢1:K ⇠ Dirichlet(a1:K), z ⇠ Cat(⇢1:K)
⇢1:K |z d= Dirichlet(a01:K), a0k = ak + 1{z = k}
a = ak = 1
a = (0.5,0.5,0.5) a = (5,5,5) a = (40,10,10)
ρ1
dens
ity
ρ2
6
[demo]
What if K ≫ N ?• e.g. species sampling, topic modeling, groups on a
social network, etc.
⇢1 ⇢2 ⇢3
…
⇢1000
• Components: number of latent groups
• Clusters: number of components represented in the data
• Number of clusters for N data points is < K and random
• Number of clusters grows with N
7
What if K ≫ N ?
⇢1 ⇢2 ⇢3
…
⇢1000
• Components: number of latent groups
• Clusters: number of components represented in the data
• Number of clusters for N data points is < K and random
• Number of clusters grows with N
7
What if K ≫ N ?• e.g. species sampling, topic modeling, groups on a
social network, etc.
⇢1 ⇢2 ⇢3
…
⇢1000
• Components: number of latent groups
• Clusters: number of components represented in the data
• Number of clusters for N data points is < K and random
• Number of clusters grows with N
7
What if K ≫ N ?• e.g. species sampling, topic modeling, groups on a
social network, etc.
⇢1 ⇢2 ⇢3
…
⇢1000
• Components: number of latent groups
• Clusters: number of components represented in the data
• Number of clusters for N data points is < K and random
• Number of clusters grows with N
7
What if K ≫ N ?• e.g. species sampling, topic modeling, groups on a
social network, etc.
⇢1 ⇢2 ⇢3
…
⇢1000
• Components: number of latent groups
• Clusters: number of components represented in the data
• Number of clusters for N data points is < K and random
• Number of clusters grows with N
7
What if K ≫ N ?• e.g. species sampling, topic modeling, groups on a
social network, etc.
⇢1 ⇢2 ⇢3
…
⇢1000
• Components: number of latent groups
• Clusters: number of components represented in the data
• [demo 1, demo 2]
• Number of clusters for N data points is < K and random
• Number of clusters grows with N7
What if K ≫ N ?• e.g. species sampling, topic modeling, groups on a
social network, etc.
⇢1 ⇢2 ⇢3
…
⇢1000
• Components: number of latent groups
• Clusters: number of components represented in the data
• [demo 1, demo 2]
• Number of clusters for N data points is < K and random
• Number of clusters grows with N7
What if K ≫ N ?• e.g. species sampling, topic modeling, groups on a
social network, etc.
⇢1 ⇢2 ⇢3
…
⇢1000
• Components: number of latent groups
• Clusters: number of components represented in the data
• [demo 1, demo 2]
• Number of clusters for N data points is < K and random
• Number of clusters grows with N7
• Here, difficult to choose finite K in advance (contrast with small K): don’t know K, difficult to infer, streaming data
• How to generate K = ∞ strictly positive frequencies that sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
• “Stick breaking”
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
V1 ⇠ Beta(a1, a2 + a3 + a4)
• “Stick breaking”
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
V1 ⇠ Beta(a1, a2 + a3 + a4) ⇢1 = V1
• “Stick breaking”
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
V1 ⇠ Beta(a1, a2 + a3 + a4) ⇢1 = V1
V2 ⇠ Beta(a2, a3 + a4)
• “Stick breaking”
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
V1 ⇠ Beta(a1, a2 + a3 + a4) ⇢1 = V1
V2 ⇠ Beta(a2, a3 + a4)
• “Stick breaking”
⇢2 = (1� V1)V2
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
V1 ⇠ Beta(a1, a2 + a3 + a4) ⇢1 = V1
V2 ⇠ Beta(a2, a3 + a4) ⇢2 = (1� V1)V2
V3 ⇠ Beta(a3, a4)
• “Stick breaking”
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
V1 ⇠ Beta(a1, a2 + a3 + a4) ⇢1 = V1
V2 ⇠ Beta(a2, a3 + a4) ⇢2 = (1� V1)V2
V3 ⇠ Beta(a3, a4) ⇢3 = (1� V1)(1� V2)V3
• “Stick breaking”
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
?? (⇢2,...,⇢K)1�⇢1
d= Dirichlet(a2, . . . , aK)) ⇢1
d= Beta(a1,
KX
k=1
ak � a1)
V1 ⇠ Beta(a1, a2 + a3 + a4) ⇢1 = V1
V2 ⇠ Beta(a2, a3 + a4) ⇢2 = (1� V1)V2
V3 ⇠ Beta(a3, a4) ⇢3 = (1� V1)(1� V2)V3
⇢4 = 1�3X
k=1
⇢k
• “Stick breaking”
8
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Observation: ⇢1:K ⇠ Dirichlet(a1:K)
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
…
ak = 1, bk = ↵ > 0
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
ak = 1, bk = ↵ > 0
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1) ⇢1 = V1
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
⇢1 = V1
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
⇢1 = V1
⇢2 = (1� V1)V2
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
⇢1 = V1
⇢2 = (1� V1)V2
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
…
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
⇢1 = V1
⇢2 = (1� V1)V2
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
…
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
Vk ⇠ Beta(ak, bk)
⇢1 = V1
⇢2 = (1� V1)V2
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
…
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
Vk ⇠ Beta(ak, bk)
⇢1 = V1
⇢2 = (1� V1)V2
⇢k =
2
4k�1Y
j=1
(1� Vj)
3
5Vk
9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
…
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
Vk ⇠ Beta(ak, bk)
⇢1 = V1
⇢2 = (1� V1)V2
⇢k =
2
4k�1Y
j=1
(1� Vj)
3
5Vk
[Ishwaran, James 2001]9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
…
ak = 1, bk = ↵ > 0
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
Vk ⇠ Beta(ak, bk)
⇢1 = V1
⇢2 = (1� V1)V2
⇢k =
2
4k�1Y
j=1
(1� Vj)
3
5Vk
[Ishwaran, James 2001]9
Choosing K = ∞• Here, difficult to choose finite K in advance (contrast with
small K): don’t know K, difficult to infer, streaming data • How to generate K = ∞ strictly positive frequencies that
sum to one? • Dirichlet process stick-breaking: • Griffiths-Engen-McCloskey (GEM) distribution:
…
V1 ⇠ Beta(a1, b1)
V2 ⇠ Beta(a2, b2)
Vk ⇠ Beta(ak, bk)
⇢1 = V1
⇢2 = (1� V1)V2
⇢k =
2
4k�1Y
j=1
(1� Vj)
3
5Vk
ak = 1, bk = ↵ > 0
⇢ = (⇢1, ⇢2, . . .) ⇠ GEM(↵)
[McCloskey 1965; Engen 1975; Patil and Taillie 1977; Ewens 1987; Sethuraman 1994; Ishwaran, James 2001]9
• Code your own GEM simulator to draw ρ • Simulate drawing cluster indicators (z) from the
distribution you generated in the first exercise • Compare the growth in the number of clusters
as N changes in the GEM case with the growth in the K=1000 case
Exercises
10
…
• How does the expected number of clusters in the GEM case change with N and with the GEM parameter α?
References for Part 1, page 1
11
DJ Aldous. Exchangeability and related topics. Springer, 1983.
E Arjas and D Gasbarra. Nonparametric Bayesian inference from right censored survival data, using the Gibbs sampler. Statistica Sinica, 1994.
E Bowlby. NOAA/Olympic Coast NMS; NOAA/OAR/Office of Ocean Exploration - NOAA Photo Library. Retrieved from: https://en.wikipedia.org/wiki/Opisthoteuthis_californiana#/media/File:Opisthoteuthis_californiana.jpg
S Engen. A note on the geometric series as a species frequency model. Biometrika, 1975.
W Ewens. The sampling theory of selectively neutral alleles. Theoretical Population Biology, 1972.
W Ewens. Population genetics theory -- the past and the future. Mathematical and Statistical Developments of Evolutionary Theory, 1987.
EB Fox, personal website. Retrieved from: http://www.stat.washington.edu/~ebfox/research.html --- Associated paper: EB Fox, MC Hughes, EB Sudderth, and MI Jordan. The Annals of Applied Statistics, 2014.
S Ghosal, JK Ghosh, and RV Ramamoorthi. Posterior consistency of Dirichlet mixtures in density estimation. The Annals of Statistics, 1999.
DL Hartl and AG Clark. Principles of Population Genetics, Fourth Edition. 2003.
E Hewitt and LJ Savage. Symmetric measures on Cartesian products. Transactions of the American Mathematical Society, 1955.
H Ishwaran and LF James. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 2001.
JR Lloyd, P Orbanz, Z Ghahramani, and DM Roy. Random function priors for exchangeable arrays with applications to graphs and relational data. NIPS, 2012.
References for Part 1, page 2
12
JW McCloskey. A model for the distribution of individuals by species in an environment. Ph.D. thesis, Michigan State University, 1965.
K Miller, MI Jordan, and TL Griffiths. Nonparametric latent feature models for link prediction. NIPS, 2009.
GP Patil and C Taillie. Diversity as a concept and its implications for random communities. Bulletin of the International Statistical Institute, 1977.
S Saria, D Koller, and A Penn. Learning individual and population traits from clinical temporal data. NIPS, 2010.
J Sethuraman. A constructive definition of Dirichlet priors. Statistica Sinica, 1994.
EB Sudderth and MI Jordan. Shared segmentation of natural scenes using dependent Pitman-Yor processes. NIPS, 2009.