Fast Collapsed Gibbs Sampler for Dirichlet Process...

Fast Collapsed Gibbs Sampler for DirichletProcess Gaussian Mixture Models using Rank 1

Cholesky updates

Rajarshi Das

[email protected]

http://www.cs.cmu.edu/~rajarshd/

c-lab lecture

7th October 2014template adapted from

https://www.sharelatex.com/templates/presentations/radboud-university-beamer-(version-1)

Rajarshi Das 1 / 49

[email protected]

http://www.cs.cmu.edu/~rajarshd/

https://www.sharelatex.com/templates/presentations/radboud-university-beamer-(version-1)

Overview

1 Distributions of interest

2 Dirichlet Process

3 Finite, Infinite and Dirichlet Process Mixture Models

4 Dirichlet Process Gaussian Mixture Models

5 Gibbs sampler for DPGMM

6 Linear Algebra Crash Course

7 Fast Gibbs Sampler for DPGMM

8 Results

Rajarshi Das 2 / 49

Overview


2 Dirichlet Process






8 Results

Rajarshi Das 3 / 49

Gaussian Distribution

• 2 parameter distribution (µ and σ)

1

1images from https://en.wikipedia.org/wiki/Normal_distributionRajarshi Das 4 / 49

https://en.wikipedia.org/wiki/Normal_distribution

Multivariate Gaussian Distribution

• Multivariate generalization of Gaussian distribution

• µ is a vector and σ becomes covariance matrix Σ

Rajarshi Das 5 / 49

Multivariate t

• Multivariate generalization of student’s t-distribution

• 3 paramters (µ,Σ, ν)

Rajarshi Das 7 / 49

Dirichlet Distribution

• A dirichlet distribution is a probability distribution overcategorical distributions, parameterized by a vector α of fixedlength K.

• X =< x1, x2, ..., xK >, xi ∈ [0, 1],∑

i xi = 1,is dirichlet distributed with parameter α if,

Rajarshi Das 8 / 49

Dirichlet Distribution

• A dirichlet distribution is a probability distribution overcategorical distributions, parameterized by a vector α of fixedlength K.

• X =< x1, x2, ..., xK >, xi ∈ [0, 1],∑

i xi = 1,is dirichlet distributed with parameter α if,

pdf

p (X |α) =Γ (∑

i αi )∏Kk=1 Γ (αk)

K∏i=1

xαi−1i

Rajarshi Das 9 / 49

Overview


2 Dirichlet Process






8 Results

Rajarshi Das 10 / 49

Dirichlet Process

• A Dirichlet Process (DP) is also a distribution over discretedistributions.• This time with no fixed dimensions. In other words the ’K’ is

not fixed (could be infinite!)

• A DP has 2 parameters

1 base distribution G0

2 concentration parameter α


Definition

2

G ∼ DP(G0, α)

if for any partition (A1, .....,AK ) of a measurable space X

(G (A1), ....,G (AK )) ∼ Dirichlet(αG0(A1), ....., αG0(AK ))

2Image from http://www.columbia.edu/~jwp2128/Teaching/E6892/papers/mlss2007.pdf


http://www.columbia.edu/~jwp2128/Teaching/E6892/papers/mlss2007.pdf

Example (slide adapted from https://www.cs.cmu.edu/~kbe/dp_tutorial.pdf)

• For example, if the base distribution G0 is a Gaussian.

• The sampled distribution G ∼ DP(G0, α) would look like


https://www.cs.cmu.edu/~kbe/dp_tutorial.pdf

Example

Realizations From the Dirichlet Process Look Like a UsedDartboard.3

3https://www.ee.washington.edu/techsite/papers/documents/UWEETR-2010-0006.pdf


https://www.ee.washington.edu/techsite/papers/documents/UWEETR-2010-0006.pdf

Posterior Predictive distribution of DP

G ∼ DP(G0, α)

θ1, ....., θn ∼ G

θn+1|θ1, ....., θn ∼ 1

(α + n)(αG0 + Σn

i=1δθi )

• Also known as Blackwell-MacQueen Urn Scheme

• Has the rich getting richer property

• Chinese Restaurant Process!


Chinese Restaurant Process

4

4Graphic fromhttp://nlp.stanford.edu/~grenager/papers/dp_2005_02_24.pptRajarshi Das 16 / 49

http://nlp.stanford.edu/~grenager/papers/dp_2005_02_24.ppt


5




6




7




8



Overview


2 Dirichlet Process






8 Results


Finite Mixture Models (FMM)

• In FMM, we assume data is generated from K distributions.

∀k, θk |λ ∼ G0(λ)

π|α0 ∼ Dir(α0/K , α0/K , ...., α0/K )

zi |π ∼ Discrete(π)

xi |zi , {θk} ∼ F (θzi )


Infinite Mixture Models (IMM)

• In Infinite mixture models, we don’t know how many mixturesgenerated the data, so we donot fix a value of K and let thedata decide!


Dirichlet Process Mixture Models (DPMM)

G ∼ DP(G0, α)

∀i , φi ∼ G

∀i , xi ∼ F (φi )


DPMM and IMM are the same!

• Turns out that IMM are same as DPGMM when K →∞

P(zi = k|z−i ) = {ni,k+

αK

n−1+α if k is seen beforeα

n−1+α if k is new


Overview


2 Dirichlet Process






8 Results


DPGaussianMM

• In DPGMM, each table is endowed with a Gaussiandistribution.

G ∼ DP(G0, α)

∀i , φi ∼ G

∀i , xi ∼ F (φi )

• F is Gaussian distribution.

• if G0 is a conjugate distribution to F ,then inference via Gibbs samplingbecomes easya.• if covariance is fixed, then G0 is Normal• else if both mean and covariance are

unknown, then G0 is Normal InverseGamma (univariate) or Normal InverseWishart (multivariate)

aAlgorithm 1,2,3 of Neal(2000)


Overview


2 Dirichlet Process






8 Results


Posterior Inference for DPGMM

• So I believe, that the data has been generated by a DirichletProcess Mixture Model. Now from the observed data we wantto infer the parameters of the distributions which generatedthe data. (a.k.a Bayesian Inference)

• What are the params of my Posterior distribution?• The mean and variances associated with each table.

• we don’t know a priori how many of them are there, btw!

• The table assignment of each data point (customer).• p({z}, {µ,Σ}| Data, Hyperparams) = p({µ,Σ}|Hyperparams)

* p({z}|Everything)• If the prior of the table params is conjugate to the likelihood,

then we can integrate out {µ,Σ} – collapsing!• p({z}| Data, Hyperparams)


• Exact computation of posterior is intractable, however we canuse MCMC techniques to sample from posterior distribution.

• Collapsed Gibbs Sampler

p(zi = k|z−i ,X , α, λ) = p(zi = k |z−i , α)p(X |z , λ)

• p(zi = k|z−i , α) is well defined by CRP.

• p(X |z , λ) is the data likelihood. Essentially reduces tocomputing p(xi |x−i , z , λ)

• After some math, p(xi |x−i , z , λ) reduces to

tvN−1−D+1

(xi |µN−1,

ΣN−1(kN−1 + 1)

kN−1(vN−1 − D + 1)

)


Gibbs sampler for Inference

• Now that we have analytical form of everything, the Gibbssampling algorithm becomes

• Randomly initialize customer to tables. Calculate the tableparams. (µ and Σ)

• For each customer• Remove it from the current table. Update the parameters of

the table.• for each table 1...K

• Compute prior prob for the customer to sit in that table.• Compute posterior predictive likelihood to sit in the table.• Multiply them to get p(zi = k|z−i ,X , α, λ)

• Normalize the vector of probabilities and sample for a tablenumber. Update the parameters of the table.


Updating the table parameters

• Every table has 2 parameters - a µ and Σ matrix (which define agaussian distribution)

• For conjugacy we have put a NIW(µ0,Σ0, κ0, ν0) prior on them.

• Since Gaussian and NIW form a conjugate prior, the posteriordistribution of µ and Σ also forms a NIW distribution with thefollowing parameters

µn =κ0µ0 + nx̄

κ0 + n

Σn = Σ0 +N∑i=1

(xi − x̄)(xi − x̄)T +κ0n

κ0 + n(x̄ − µ0)(x̄ − µ0)T

κn = κ0 + n

νn = ν0 + n


Runtime Analysis

• Lets look at the following code snippet from the Gibbssampler algorithm• for each table 1...K

• Compute prior prob for the customer to sit in that table.

• Compute posterior predictive likelihood to sit in the table.(this is a multivariate t-distribution)

• Computing the density under t-distribution requires computingthe determinant and inverse of a covariance matrix.• Both of these operations are very expensive. O(D3)

operations, for a DXD matrix• Therefore the loop takes O(KD3) operations.


Runtime Analysis -II

• Randomly initialize customer to tables. Calculate the tableparams. (µ and Σ)• for µ - O(D) - for each table• for Σ - O(D2) - for each table• Also for each Σ

• Σ−1 and |Σ| - O(D3)

• For the number of iterations you want to run.• For each customer

• Remove it from the current table. Update the parameters ofthe table. - O(D) for µ and O(D2) for Σ

• for each table 1...K• Compute prior prob for the customer to at table. - O(1)• Compute posterior predictive likelihood to sit in the table.

- O(D2)• Multiply them to get p(zi = k |z−i ,X , α, λ) - O(1)

• Normalize the vector of probabilities and sample for a tablenumber. Update the parameters of the table. -O(K ) + O(D) + O(D2) + O(D3)


Overview


2 Dirichlet Process






8 Results


Linear Algebra Crash Course

• Cholesky decomposition: Decomposition of a symmetricpositive definite matrix into the product of a lower triangularmatrix and its conjugate transpose.

A = LLT

• Computing the cholesky decomposition takes O(D3) time!.


Cholesky updates

• Suppose A is a positive definite matrix with L as its choleskydecomposition.

• Now if we obtain A′ from A by an update of the form

A′ = A + xxT

• then the cholesky decomposition L′ of A′ can be obtained byan update operation on L. (Rank 1 update)

• Similarly if we have A = A′ - xxT , then we can perform aRank1 downdate to get L from L′


Cholesky updates

9

• This algorithm is O(D2)!

9Image from http://en.wikipedia.org/wiki/Cholesky_decompositionRajarshi Das 38 / 49

http://en.wikipedia.org/wiki/Cholesky_decomposition

Nice properties

• |A| can be computed from L by

log(|A|) = 2 ∗D∑i=1

log(L(i , i))

• Now lets try to compute bTA−1b

bTA−1b = bT (LLT )−1b

= bT (L−1)TL−1b

= (L−1b)T (L−1b)

• Therefore compute (L−1b) and multiply its transpose withitself


(L−1b)

• (L−1b) is the solution of

Lx = b

• Remember L is a lower triangular matrix, therefore the aboveequation can be solved very efficiently using forwardsubstitution!


Overview


2 Dirichlet Process






8 Results


Putting back in perspective

• Recall the posterior update equation for covariance matrix

Σn = Σ0 +N∑i=1

(xi − x̄)(xi − x̄)T +κ0n

κ0 + n(x̄ − µ0)(x̄ − µ0)T

= Σ0 +N∑i=1

xixTi − (κ0 + n)µnµ

Tn + κ0µ0µ

T0

• Now we can write the above definition recursively as

Σn = Σn−1 + xnxTn − (κ0 + n)µnµ

Tn + (κ0 + n − 1)µn−1µ

Tn−1

= Σn−1 +κ0 + n

κ0 + n − 1(xn − µn)(xn − µn)T

• Looks similar to A′ = A + xxT ?


Putting back in perspective

• Recall pdf of multi-variate t-distribution

• If we know the cholesky decomposition of Σ, then we cancompute (x − µ)TΣ−1(x − µ) from L using similar trickshown before for computing bTL−1b


Runtime Analysis -II

• Randomly initialize customer to tables. Calculate the tableparams. (µ and Σ)• for µ - O(D) - for each table• for Σ - (do not need to compute Σ)• Also for each Σ

• Σ−1 and |Σ| - O(D2)

• For the number of iterations you want to run.• For each customer

• Remove it from the current table. Update the parameters ofthe table. - O(D) for µ and O(D2) for Σ

• for each table 1...K• Compute prior prob for the customer to at table. - O(1)• Compute posterior predictive likelihood to sit in the table.

- O(D2)• Multiply them to get p(zi = k |z−i ,X , α, λ) - O(1)

• Normalize the vector of probabilities and sample for a tablenumber. Update the parameters of the table. - O(K ) + O(D)+ O(D2)


Overview


2 Dirichlet Process






8 Results


Speed Test 1

• Just 100 word vectors each of 100 dimensions.

• Both sampler run for 10,20,50,100,500,1000,10000 iterations.


Caveats

• All of these is possible if the prior covariance Σ0 is positivedefinite.

• In practice, we often tend to put Identity as Σ0, which ispositive definite.


Code

• Java Implementation available athttps://github.com/rajarshd/DPGMM

• Plan to implement in C++ too.


https://github.com/rajarshd/DPGMM

Acknowledgements

• Thanks to my friend Manzil Zaheer for introducing me toCholesky decomposition and proving that the covarianceupdate was indeed a rank 1 update.


Fast Collapsed Gibbs Sampler for Dirichlet Process...

Documents

Transcript of Fast Collapsed Gibbs Sampler for Dirichlet Process...