Dirichlet Distribution, Dirichlet Process and Dirichlet Process Mixture
Fast Collapsed Gibbs Sampler for Dirichlet Process...
Transcript of Fast Collapsed Gibbs Sampler for Dirichlet Process...
Fast Collapsed Gibbs Sampler for DirichletProcess Gaussian Mixture Models using Rank 1
Cholesky updates
Rajarshi Das
http://www.cs.cmu.edu/~rajarshd/
c-lab lecture
7th October 2014template adapted from
https://www.sharelatex.com/templates/presentations/radboud-university-beamer-(version-1)
Rajarshi Das 1 / 49
Overview
1 Distributions of interest
2 Dirichlet Process
3 Finite, Infinite and Dirichlet Process Mixture Models
4 Dirichlet Process Gaussian Mixture Models
5 Gibbs sampler for DPGMM
6 Linear Algebra Crash Course
7 Fast Gibbs Sampler for DPGMM
8 Results
Rajarshi Das 2 / 49
Overview
1 Distributions of interest
2 Dirichlet Process
3 Finite, Infinite and Dirichlet Process Mixture Models
4 Dirichlet Process Gaussian Mixture Models
5 Gibbs sampler for DPGMM
6 Linear Algebra Crash Course
7 Fast Gibbs Sampler for DPGMM
8 Results
Rajarshi Das 3 / 49
Gaussian Distribution
• 2 parameter distribution (µ and σ)
1
1images from https://en.wikipedia.org/wiki/Normal_distributionRajarshi Das 4 / 49
Multivariate Gaussian Distribution
• Multivariate generalization of Gaussian distribution
• µ is a vector and σ becomes covariance matrix Σ
Rajarshi Das 5 / 49
Normal Inverse Wishart
• 4 parameter family (µ0, κ0,Σ0, ν0)
Σ|Σ0, ν0 ∼ W−1(Σ|Σ0, ν0)
µ|µ0, κ0,Σ ∼ N(µ|µ0,1
κ0Σ)
(µ,Σ) ∼ N(µ|µ0,1
κ0Σ)W−1(Σ|Σ0, ν0)
Rajarshi Das 6 / 49
Multivariate t
• Multivariate generalization of student’s t-distribution
• 3 paramters (µ,Σ, ν)
Rajarshi Das 7 / 49
Dirichlet Distribution
• A dirichlet distribution is a probability distribution overcategorical distributions, parameterized by a vector α of fixedlength K.
• X =< x1, x2, ..., xK >, xi ∈ [0, 1],∑
i xi = 1,is dirichlet distributed with parameter α if,
Rajarshi Das 8 / 49
Dirichlet Distribution
• A dirichlet distribution is a probability distribution overcategorical distributions, parameterized by a vector α of fixedlength K.
• X =< x1, x2, ..., xK >, xi ∈ [0, 1],∑
i xi = 1,is dirichlet distributed with parameter α if,
p (X |α) =Γ (∑
i αi )∏Kk=1 Γ (αk)
K∏i=1
xαi−1i
Rajarshi Das 9 / 49
Overview
1 Distributions of interest
2 Dirichlet Process
3 Finite, Infinite and Dirichlet Process Mixture Models
4 Dirichlet Process Gaussian Mixture Models
5 Gibbs sampler for DPGMM
6 Linear Algebra Crash Course
7 Fast Gibbs Sampler for DPGMM
8 Results
Rajarshi Das 10 / 49
Dirichlet Process
• A Dirichlet Process (DP) is also a distribution over discretedistributions.• This time with no fixed dimensions. In other words the ’K’ is
not fixed (could be infinite!)
• A DP has 2 parameters
1 base distribution G0
2 concentration parameter α
Rajarshi Das 11 / 49
Definition
2
G ∼ DP(G0, α)
if for any partition (A1, .....,AK ) of a measurable space X
(G (A1), ....,G (AK )) ∼ Dirichlet(αG0(A1), ....., αG0(AK ))
2Image from http://www.columbia.edu/~jwp2128/Teaching/E6892/papers/mlss2007.pdf
Rajarshi Das 12 / 49
Example (slide adapted from https://www.cs.cmu.edu/~kbe/dp_tutorial.pdf)
• For example, if the base distribution G0 is a Gaussian.
• The sampled distribution G ∼ DP(G0, α) would look like
Rajarshi Das 13 / 49
Example
Realizations From the Dirichlet Process Look Like a UsedDartboard.3
3https://www.ee.washington.edu/techsite/papers/documents/UWEETR-2010-0006.pdf
Rajarshi Das 14 / 49
Posterior Predictive distribution of DP
G ∼ DP(G0, α)
θ1, ....., θn ∼ G
θn+1|θ1, ....., θn ∼ 1
(α + n)(αG0 + Σn
i=1δθi )
• Also known as Blackwell-MacQueen Urn Scheme
• Has the rich getting richer property
• Chinese Restaurant Process!
Rajarshi Das 15 / 49
Chinese Restaurant Process
4
4Graphic fromhttp://nlp.stanford.edu/~grenager/papers/dp_2005_02_24.pptRajarshi Das 16 / 49
Chinese Restaurant Process
5
5Graphic fromhttp://nlp.stanford.edu/~grenager/papers/dp_2005_02_24.pptRajarshi Das 17 / 49
Chinese Restaurant Process
6
6Graphic fromhttp://nlp.stanford.edu/~grenager/papers/dp_2005_02_24.pptRajarshi Das 18 / 49
Chinese Restaurant Process
7
7Graphic fromhttp://nlp.stanford.edu/~grenager/papers/dp_2005_02_24.pptRajarshi Das 19 / 49
Chinese Restaurant Process
8
8Graphic fromhttp://nlp.stanford.edu/~grenager/papers/dp_2005_02_24.pptRajarshi Das 20 / 49
Overview
1 Distributions of interest
2 Dirichlet Process
3 Finite, Infinite and Dirichlet Process Mixture Models
4 Dirichlet Process Gaussian Mixture Models
5 Gibbs sampler for DPGMM
6 Linear Algebra Crash Course
7 Fast Gibbs Sampler for DPGMM
8 Results
Rajarshi Das 21 / 49
Finite Mixture Models (FMM)
• In FMM, we assume data is generated from K distributions.
∀k, θk |λ ∼ G0(λ)
π|α0 ∼ Dir(α0/K , α0/K , ...., α0/K )
zi |π ∼ Discrete(π)
xi |zi , {θk} ∼ F (θzi )
Rajarshi Das 22 / 49
Infinite Mixture Models (IMM)
• In Infinite mixture models, we don’t know how many mixturesgenerated the data, so we donot fix a value of K and let thedata decide!
Rajarshi Das 23 / 49
Dirichlet Process Mixture Models (DPMM)
G ∼ DP(G0, α)
∀i , φi ∼ G
∀i , xi ∼ F (φi )
Rajarshi Das 24 / 49
DPMM and IMM are the same!
• Turns out that IMM are same as DPGMM when K →∞
P(zi = k|z−i ) = {ni,k+
αK
n−1+α if k is seen beforeα
n−1+α if k is new
Rajarshi Das 25 / 49
Overview
1 Distributions of interest
2 Dirichlet Process
3 Finite, Infinite and Dirichlet Process Mixture Models
4 Dirichlet Process Gaussian Mixture Models
5 Gibbs sampler for DPGMM
6 Linear Algebra Crash Course
7 Fast Gibbs Sampler for DPGMM
8 Results
Rajarshi Das 26 / 49
DPGaussianMM
• In DPGMM, each table is endowed with a Gaussiandistribution.
G ∼ DP(G0, α)
∀i , φi ∼ G
∀i , xi ∼ F (φi )
• F is Gaussian distribution.
• if G0 is a conjugate distribution to F ,then inference via Gibbs samplingbecomes easya.• if covariance is fixed, then G0 is Normal• else if both mean and covariance are
unknown, then G0 is Normal InverseGamma (univariate) or Normal InverseWishart (multivariate)
aAlgorithm 1,2,3 of Neal(2000)
Rajarshi Das 27 / 49
Overview
1 Distributions of interest
2 Dirichlet Process
3 Finite, Infinite and Dirichlet Process Mixture Models
4 Dirichlet Process Gaussian Mixture Models
5 Gibbs sampler for DPGMM
6 Linear Algebra Crash Course
7 Fast Gibbs Sampler for DPGMM
8 Results
Rajarshi Das 28 / 49
Posterior Inference for DPGMM
• So I believe, that the data has been generated by a DirichletProcess Mixture Model. Now from the observed data we wantto infer the parameters of the distributions which generatedthe data. (a.k.a Bayesian Inference)
• What are the params of my Posterior distribution?• The mean and variances associated with each table.
• we don’t know a priori how many of them are there, btw!
• The table assignment of each data point (customer).• p({z}, {µ,Σ}| Data, Hyperparams) = p({µ,Σ}|Hyperparams)
* p({z}|Everything)• If the prior of the table params is conjugate to the likelihood,
then we can integrate out {µ,Σ} – collapsing!• p({z}| Data, Hyperparams)
Rajarshi Das 29 / 49
• Exact computation of posterior is intractable, however we canuse MCMC techniques to sample from posterior distribution.
• Collapsed Gibbs Sampler
p(zi = k|z−i ,X , α, λ) = p(zi = k |z−i , α)p(X |z , λ)
• p(zi = k|z−i , α) is well defined by CRP.
• p(X |z , λ) is the data likelihood. Essentially reduces tocomputing p(xi |x−i , z , λ)
• After some math, p(xi |x−i , z , λ) reduces to
tvN−1−D+1
(xi |µN−1,
ΣN−1(kN−1 + 1)
kN−1(vN−1 − D + 1)
)
Rajarshi Das 30 / 49
Gibbs sampler for Inference
• Now that we have analytical form of everything, the Gibbssampling algorithm becomes
• Randomly initialize customer to tables. Calculate the tableparams. (µ and Σ)
• For each customer• Remove it from the current table. Update the parameters of
the table.• for each table 1...K
• Compute prior prob for the customer to sit in that table.• Compute posterior predictive likelihood to sit in the table.• Multiply them to get p(zi = k|z−i ,X , α, λ)
• Normalize the vector of probabilities and sample for a tablenumber. Update the parameters of the table.
Rajarshi Das 31 / 49
Updating the table parameters
• Every table has 2 parameters - a µ and Σ matrix (which define agaussian distribution)
• For conjugacy we have put a NIW(µ0,Σ0, κ0, ν0) prior on them.
• Since Gaussian and NIW form a conjugate prior, the posteriordistribution of µ and Σ also forms a NIW distribution with thefollowing parameters
µn =κ0µ0 + nx̄
κ0 + n
Σn = Σ0 +N∑i=1
(xi − x̄)(xi − x̄)T +κ0n
κ0 + n(x̄ − µ0)(x̄ − µ0)T
κn = κ0 + n
νn = ν0 + n
Rajarshi Das 32 / 49
Runtime Analysis
• Lets look at the following code snippet from the Gibbssampler algorithm• for each table 1...K
• Compute prior prob for the customer to sit in that table.
• Compute posterior predictive likelihood to sit in the table.(this is a multivariate t-distribution)
• Computing the density under t-distribution requires computingthe determinant and inverse of a covariance matrix.• Both of these operations are very expensive. O(D3)
operations, for a DXD matrix• Therefore the loop takes O(KD3) operations.
Rajarshi Das 33 / 49
Runtime Analysis -II
• Randomly initialize customer to tables. Calculate the tableparams. (µ and Σ)• for µ - O(D) - for each table• for Σ - O(D2) - for each table• Also for each Σ
• Σ−1 and |Σ| - O(D3)
• For the number of iterations you want to run.• For each customer
• Remove it from the current table. Update the parameters ofthe table. - O(D) for µ and O(D2) for Σ
• for each table 1...K• Compute prior prob for the customer to at table. - O(1)• Compute posterior predictive likelihood to sit in the table.
- O(D2)• Multiply them to get p(zi = k |z−i ,X , α, λ) - O(1)
• Normalize the vector of probabilities and sample for a tablenumber. Update the parameters of the table. -O(K ) + O(D) + O(D2) + O(D3)
Rajarshi Das 34 / 49
Overview
1 Distributions of interest
2 Dirichlet Process
3 Finite, Infinite and Dirichlet Process Mixture Models
4 Dirichlet Process Gaussian Mixture Models
5 Gibbs sampler for DPGMM
6 Linear Algebra Crash Course
7 Fast Gibbs Sampler for DPGMM
8 Results
Rajarshi Das 35 / 49
Linear Algebra Crash Course
• Cholesky decomposition: Decomposition of a symmetricpositive definite matrix into the product of a lower triangularmatrix and its conjugate transpose.
A = LLT
• Computing the cholesky decomposition takes O(D3) time!.
Rajarshi Das 36 / 49
Cholesky updates
• Suppose A is a positive definite matrix with L as its choleskydecomposition.
• Now if we obtain A′ from A by an update of the form
A′ = A + xxT
• then the cholesky decomposition L′ of A′ can be obtained byan update operation on L. (Rank 1 update)
• Similarly if we have A = A′ - xxT , then we can perform aRank1 downdate to get L from L′
Rajarshi Das 37 / 49
Cholesky updates
9
• This algorithm is O(D2)!
9Image from http://en.wikipedia.org/wiki/Cholesky_decompositionRajarshi Das 38 / 49
Nice properties
• |A| can be computed from L by
log(|A|) = 2 ∗D∑i=1
log(L(i , i))
• Now lets try to compute bTA−1b
bTA−1b = bT (LLT )−1b
= bT (L−1)TL−1b
= (L−1b)T (L−1b)
• Therefore compute (L−1b) and multiply its transpose withitself
Rajarshi Das 39 / 49
(L−1b)
• (L−1b) is the solution of
Lx = b
• Remember L is a lower triangular matrix, therefore the aboveequation can be solved very efficiently using forwardsubstitution!
Rajarshi Das 40 / 49
Overview
1 Distributions of interest
2 Dirichlet Process
3 Finite, Infinite and Dirichlet Process Mixture Models
4 Dirichlet Process Gaussian Mixture Models
5 Gibbs sampler for DPGMM
6 Linear Algebra Crash Course
7 Fast Gibbs Sampler for DPGMM
8 Results
Rajarshi Das 41 / 49
Putting back in perspective
• Recall the posterior update equation for covariance matrix
Σn = Σ0 +N∑i=1
(xi − x̄)(xi − x̄)T +κ0n
κ0 + n(x̄ − µ0)(x̄ − µ0)T
= Σ0 +N∑i=1
xixTi − (κ0 + n)µnµ
Tn + κ0µ0µ
T0
• Now we can write the above definition recursively as
Σn = Σn−1 + xnxTn − (κ0 + n)µnµ
Tn + (κ0 + n − 1)µn−1µ
Tn−1
= Σn−1 +κ0 + n
κ0 + n − 1(xn − µn)(xn − µn)T
• Looks similar to A′ = A + xxT ?
Rajarshi Das 42 / 49
Putting back in perspective
• Recall pdf of multi-variate t-distribution
• If we know the cholesky decomposition of Σ, then we cancompute (x − µ)TΣ−1(x − µ) from L using similar trickshown before for computing bTL−1b
Rajarshi Das 43 / 49
Runtime Analysis -II
• Randomly initialize customer to tables. Calculate the tableparams. (µ and Σ)• for µ - O(D) - for each table• for Σ - (do not need to compute Σ)• Also for each Σ
• Σ−1 and |Σ| - O(D2)
• For the number of iterations you want to run.• For each customer
• Remove it from the current table. Update the parameters ofthe table. - O(D) for µ and O(D2) for Σ
• for each table 1...K• Compute prior prob for the customer to at table. - O(1)• Compute posterior predictive likelihood to sit in the table.
- O(D2)• Multiply them to get p(zi = k |z−i ,X , α, λ) - O(1)
• Normalize the vector of probabilities and sample for a tablenumber. Update the parameters of the table. - O(K ) + O(D)+ O(D2)
Rajarshi Das 44 / 49
Overview
1 Distributions of interest
2 Dirichlet Process
3 Finite, Infinite and Dirichlet Process Mixture Models
4 Dirichlet Process Gaussian Mixture Models
5 Gibbs sampler for DPGMM
6 Linear Algebra Crash Course
7 Fast Gibbs Sampler for DPGMM
8 Results
Rajarshi Das 45 / 49
Speed Test 1
• Just 100 word vectors each of 100 dimensions.
• Both sampler run for 10,20,50,100,500,1000,10000 iterations.
Rajarshi Das 46 / 49
Caveats
• All of these is possible if the prior covariance Σ0 is positivedefinite.
• In practice, we often tend to put Identity as Σ0, which ispositive definite.
Rajarshi Das 47 / 49
Code
• Java Implementation available athttps://github.com/rajarshd/DPGMM
• Plan to implement in C++ too.
Rajarshi Das 48 / 49
Acknowledgements
• Thanks to my friend Manzil Zaheer for introducing me toCholesky decomposition and proving that the covarianceupdate was indeed a rank 1 update.
Rajarshi Das 49 / 49