Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference...

59
Improvements to Variational Bayesian Inference Yee Whye Teh Max Welling Kenichi Kurihara David Newman March 26, 2008 Newton Institute, Cambridge

Transcript of Improvements to Variational Bayesian · PDF fileImprovements to Variational Bayesian Inference...

Improvements toVariational Bayesian Inference

Yee Whye TehMax Welling

Kenichi KuriharaDavid Newman

March 26, 2008Newton Institute, Cambridge

Outline

Introduction

Standard Inference Algorithms for LDA

Collapsed Variational Bayes for LDA

Hierarchical Dirichlet Processes

Collapsed Variational Bayes for HDP

Discussion

Outline

Introduction

Standard Inference Algorithms for LDA

Collapsed Variational Bayes for LDA

Hierarchical Dirichlet Processes

Collapsed Variational Bayes for HDP

Discussion

Bayesian Networks

Asia

Tuberculosis

Smoker

Lung cancer Bronchitis

X-ray Dyspnoea

[Pearl 1988, Heckerman 1995]

Bayesian NetworksAssumptions

I Discrete networks.I Parameter independence.I Conjugate Dirichlet priors.

p(x|θ) =∏

i

p(xi |xpa(i)) =∏i,j,k

θδ(xpa(i)=j)δ(xi =k)

ijk

p(θ|α) =∏i,j

Γ(∑

k αijk )∏k Γ(αijk )

∏k

θαijk−1ijk

i = index of variablej = value of parents xpa(i) of xi

k = value of xi

Bayesian NetworksExample: Naïve Bayes

features

class

data items

I Each class is described by a product distribution overfeatures.

Bayesian NetworksExample: Document Clustering

documents

words

θ

θclusters

I Each document belongs to acluster.

I Words in each document areiid, drawn from acluster-specific distributionover vocabulary.

Bayesian NetworksExample: Latent Dirichlet Allocation

documents

words

θ

θtopics

I Each document is describedby a distribution over topics.

I For each word: draw a topic,then draw the word itself froma topic-specific distributionover vocabulary.

I Mixed membership model;admixture.

[Blei, Ng and Jordan 2003]

Bayesian NetworksExample: Biclustering

rows

columns

Bayesian NetworksExample: Stochastic Block Model

rows

columns

θ

θ

θparameters

[Airoldi et al 2007]

Bayesian NetworksInference

I Observed variables x, unobserved z.I Parameters θ.I We wish to compute (marginals of) the posterior:

p(z,θ|x) =p(x, z|θ)p(θ)

p(x)

I Computational techniques:I Markov chain Monte Carlo,I variational approximations.

Variational Bayes

I Observed variables x, unobserved z, parameters θ.I Posterior:

p(z,θ|x) =p(x, z|θ)p(θ)

p(x)

= argmaxq(z,θ)

Eq[log p(x, z,θ)− log q(z,θ)]

I Wish to optimize the variational free energy:

F(q(z,θ)) = Eq[log p(x, z,θ)− log q(z,θ)]

[Beal 2003]

Variational BayesI Variational free energy:

F(q(z,θ)) = Eq[log p(x, z,θ)− log q(z,θ)]

I Factorization approximation:

q(z,θ) = q(z)q(θ)

I Variational EM algorithm:Variational E step:

q(z) ∝ exp Eq(θ)[log p(x, z,θ)]

Variational M step:

q(θ) ∝ exp Eq(z)[log p(x, z,θ)]

Variational Bayes

q(z) ∝ exp Eq(θ)[log p(x, z,θ)]

q(θ) ∝ exp Eq(z)[log p(x, z,θ)]

I If p(x, z|θ) is an exponential family with tractable conjugateprior p(θ), then

I q(z) takes same form as p(z|x, θ);I q(θ) takes same form as p(θ).

I Computational costs of variational Bayes equivalent to EM.I But: biased.I This talk: improve approximation without incurring

computational expense.

Latent Dirichlet Allocation (LDA)

topics k=1...Kdocuments d=1...D

words i=1...nd

!d zid xid !k

I The d th document distribution over topics

θd |α,π ∼ Dirichlet(απ)

I Corpus-wide topics (distributions over words)

φk |β, τ ∼ Dirichlet(βτ )

I The i th word in d th document:I Draw topic zid |θd ∼ Discrete(θd)I Draw word xid |zid , φ: ∼ Discrete(φzid

)

[Blei, Ng and Jordan 2003, Griffiths and Steyvers 2004]

Latent Dirichlet Allocation (LDA)Inferred Topics on KOS

n=48576november

pollhouseaccountelectoralsenategovernorrepublicans

pollsvote

n=48190iraqwar

militaryiraqi

americantroopsbushsoldierspeopleforces

n=47944bush

administrationyearstaxyearbushstimemillionhealthamerica

n=45580bushkerrypoll

percentgeneralvoterspolls

presidentvote

election

n=42552bush

administrationhousewhite

presidentintelligencereportofficials

commissiondefense

n=42190senatehouserace

electionsrepublicanstate

democratsseatdistrict

democratic

n=40050peoplepoliticalparty

republicansconservative

issuemarriagerightsgayvote

n=40030party

campaignrepublicandemocraticdemocratselection

republicansstatemillionstates

n=39086bushkerry

presidentnewsgeneralmedia

campaignjohntimecheney

n=34138deankerry

edwardsprimary

democraticclarkiowapoll

gephardtlieberman

Latent Dirichlet Allocation (LDA)Inferred Topics on NIPS

n=104314networkweighttraininglearningerrorsetunitoutputhidden

performance

n=81815distributiongaussiandatameanmodelbayesianprobabilityvariablesprior

posterior

n=80199trainingclassifier

classificationsetdataclass

performancepatterntesterror

n=73581modeldata

parametermixturelikelihoodestimationhmm

probabilitydensitymarkov

n=72394neuronsynapticmodelfiringcellspikeinput

synapsespotentialnetwork

n=67775speech

recognitionwordsystemnetworkcharactertrainingspeakerneuralinput

n=66250equationsystempoint

dynamicfunctionparameterlearningmatrixfixednetwork

n=65897unit

networkinputoutputhiddenlayerneuralrecurrentweightactivation

n=65642errortraininglearning

generalizationpredictionweightinputset

networkneural

n=64818functionboundtheorem

approximationcaseresultnumberlossprooferror

Outline

Introduction

Standard Inference Algorithms for LDA

Collapsed Variational Bayes for LDA

Hierarchical Dirichlet Processes

Collapsed Variational Bayes for HDP

Discussion

Standard Gibbs Sampling for LDA

topics k=1...Kdocuments d=1...D

words i=1...nd

!d zid xid !k

I Conditional distributions for Gibbs sampling:

θd |z ∼ Dirichlet(απ + (nd1, . . . ,ndK ))

φk |x, z ∼ Dirichlet(βτ + (nk1, . . . ,nkW ))

p(zid = k |θd ,φ, xid ) ∝ θdkφkxid

I ndkw =∑

i I[zid =k ]I[xid =w ]. Missing indices summed out.I Strong coupling between θ,φ and z.

Standard Variational Bayes for LDA

topics k=1...Kdocuments d=1...D

words i=1...nd

!d zid xid !k

topics k=1...Kdocuments d=1...D

words i=1...nd

!d zid !k

I Basic factorization induces further simplifications:

q(z,θ,φ) = q(z)q(θ,φ) =∏id

q(zid )∏

d

q(θd )∏

k

q(φd )

Standard Variational Bayes for LDA

topics k=1...Kdocuments d=1...D

words i=1...nd

!d zid xid !k

I Variational posteriors:

q(θd )← Dirichlet(θd ;απ + Eq(z)[nd :])

q(φk )← Dirichlet(φk ; βτ + Eq(z)[nk :])

q(zid = k)←∝ exp(Eq(θ)[log θdk ] + Eq(φ)[logφkxid ])

I Structurally very similar to Gibbs conditionals.I Strong coupling between θ,φ and z.

Collapsed Gibbs Sampling for LDA

topics k=1...Kdocuments d=1...D

words i=1...nd

!d zid xid !k

I Integrate out θ,φ (Rao-Blackwellize), and Gibbs sample z:

p(zid = k |z¬id , xid ) ∝ (απk + n¬iddk )

βτxid + n¬idkxid

β + n¬idk

I Each zid interacts with other latent variables z¬id only via then counts—effect of any one zi ′d ′ on zid is weak.

I Faster convergence; mean field approximation is expectedto work well.

Variational Bayes vs Gibbs Sampling in LDAI Variational Bayes

+ Easier to debug.+ Easy to diagnose convergence.− Derivations more involved.− Approximate posterior potentially far from true.+ Lower bound on marginal probability of data.+ Easy to analyse result of inference.

I Gibbs Sampling− Often hard to debug.− Hard to diagnose convergence (if ever).+ Will converge to true posterior if willing to wait.+ Unconverged samples may still be “good enough” for

prediction.− No good way to compute marginal probability of data.− Unclear how to combine multiple samples for analysis due to

non-identifiability.

Outline

Introduction

Standard Inference Algorithms for LDA

Collapsed Variational Bayes for LDA

Hierarchical Dirichlet Processes

Collapsed Variational Bayes for HDP

Discussion

Collapsed Variational Bayes for LDA

topics k=1...Kdocuments d=1...D

words i=1...nd

!d zid xid !k

document d=1...D

words i=1...nd

zid

I Integrate out θ,φ (Rao-Blackwellize), and factorize z:

q(z) =∏id

q(zid )

[Teh, Newman and Welling 2007]

Collapsed Variational Bayes for LDA

topics k=1...Kdocuments d=1...D

words i=1...nd

!d zid xid !k

I Variational posterior updates:

q(zid = k)←∝ exp(Eq[ log(απk + n¬iddk )

+ log(βτxid + n¬idkxid

)

− log(β + n¬idk )])

I Structurally similar to collapsed Gibbs sampling.I Weak interactions among zid ’s.

Collapsed Variational Bayes for LDA

topics k=1...Kdocuments d=1...D

words i=1...nd

!d zid xid !k

topics k=1...Kdocument d=1...D

words i=1...nd

!d zid !k

I Dependence of θ,φ on z treated exactly. Another approach:

q(z,θ,φ) = q(θ,φ|z)∏id

q(zid )

Collapsed Variational Bayes for LDAEquivalence of approaches

I Variational free energy:

F(q(z,θ,φ))

=F(q(z)q(θ,φ|z))

=Eq(z,θ,φ)[log p(x, z,θ,φ)− log q(z,θ,φ)]

=Eq(z)[Eq(θ,φ|z)[log p(x, z,θ,φ)− log q(θ,φ|z)]− log q(z)]

I Optimum of q(θ,φ|z) is p(θ,φ|x, z), plugging in, we get,

maxq(θ,φ|z)

F(q(z)q(θ,φ|z)) = Eq(z)[log p(x, z)− log q(z)]

I Both formulations are equivalent.

Collapsed Variational Bayes for LDAEfficient computations

I Collapsed variational updates:

q(zid = k)←∝ exp(Eq[ log(απk + n¬iddk )

+ log(βτxid + n¬idkxid

)

− log(β + n¬idk )])

I Need to compute terms of form E[log(a + n)], wheren =

∑l bl where bl are independent Bernoulli variables, say

bl ∼ Bernoulli(ρl).I Can compute with fast Fourier transforms, but a

second-order Taylor approximation works very well:

E[log(a + n)] ≈ log(a + E[n])− V[n]

2(a + E[n])2

Collapsed Variational Bayes for LDAExperimental Results

I Corpora:I KOS: D = 3430, W = 6909, N = 467714, K = 8.I NIPS: D = 1675, W = 12419, N = 2166029, K = 40.

I 10% of words in each document withheld as test set.I α = β = .1.I Repeated 50 times.I Report both bounds on marginal probabilities of training set,

and predictive probabilities on test set.

Collapsed Variational Bayes for LDABounds on Marginal Probabilities on KOS and NIPS

0 20 40 60 80 100!9

!8.5

!8

!7.5

Collapsed VBStandard VB

0 20 40 60 80 100!9

!8.8

!8.6

!8.4

!8.2

!8

!7.8

!7.6

!7.4

Collapsed VBStandard VB

!7.8 !7.675 !7.550

5

10

15

20Collapsed VBStandard VB

!7.65 !7.6 !7.55 !7.5 !7.45 !7.40

5

10

15

20

25

30

35

40Collapsed VBStandard VB

Collapsed Variational Bayes for LDAPredictive Probabilities on KOS and NIPS

0 20 40 60 80 100!7.9

!7.8

!7.7

!7.6

!7.5

!7.4

Collapsed GibbsCollapsed VBStandard VB

0 20 40 60 80 100!7.9

!7.8

!7.7

!7.6

!7.5

!7.4

!7.3

!7.2

Collapsed GibbsCollapsed VBStandard VB

!7.7 !7.65 !7.6 !7.55 !7.5 !7.45 !7.40

5

10

15

20Collapsed GibbsCollapsed VBStandard VB

!7.5 !7.45 !7.4 !7.35 !7.3 !7.25 !7.20

5

10

15

20

25

30 Collapsed GibbsCollapsed VBStandard VB

Outline

Introduction

Standard Inference Algorithms for LDA

Collapsed Variational Bayes for LDA

Hierarchical Dirichlet Processes

Collapsed Variational Bayes for HDP

Discussion

Latent Dirichlet AllocationHyperpriors and Model Selection/Averaging

topics k=1...Kdocuments d=1...D

words i=1...nd

!d zid xid !k

I Sensitivity to hyperparameter values.I Sensitivity to the number of topics K .I Model selection/averaging inefficient.I Limitations of parametric models.

Latent Dirichlet AllocationHyperpriors and Model Selection/Averaging

topics k=1...Kdocuments d=1...D

words i=1...nd

!d zid xid !k

topics k=1...Kdocument d=1...D

words i=1...nd

! !d zid xid !k

!

!

!!

γ ∼ Gamma(aγ,bγ) α ∼ Gamma(aα,bα) β ∼ Gamma(aβ,bβ)

π ∼ Dirichlet(γ/K , . . . , γ/K ) τ ∼ Dirichlet(aτ/W , . . . ,aτ/W )

Hierarchical Dirichlet ProcessesNonparametric Alternative to LDA

topics k=1...Kdocuments d=1...D

words i=1...nd

!d zid xid !k

topics k=1...∞ document d=1...D

words i=1...nd

! !d zid xid !k

!

!

!!!

γ ∼ Gamma(aγ,bγ) α ∼ Gamma(aα,bα) β ∼ Gamma(aβ,bβ)

π∼ GEM(γ) τ ∼ Dirichlet(aτ/W , . . . ,aτ/W )

Hierarchical Dirichlet ProcessSpecification in terms of Random Measures

topics k=1...∞ document d=1...D

words i=1...nd

! !d zid xid !k

!

!

!!!

I Equivalent to:

G0 ∼ DP(γ,Dirichlet(βτ)) yid ∼ Gd

Gd ∼ DP(α,G0) xid ∼ Discrete(yid )

[Teh et al 2006]

Outline

Introduction

Standard Inference Algorithms for LDA

Collapsed Variational Bayes for LDA

Hierarchical Dirichlet Processes

Collapsed Variational Bayes for HDP

Discussion

Collapsed Variational Bayes for HDP

topics k=1...∞ document d=1...D

words i=1...nd

! !d zid xid !k

!

!

!!!

I Wish to deal with a full variational posterior.I Need to consider a countably infinite number of topics.

I Truncate the posterior.I Parameter priors are no longer independent.I Hyperpriors not conjugate to priors.

I Collapsed variational Bayes with auxiliary variables.

[Teh, Kurihara and Welling 2008]

Collapsed Variational Bayes for HDP

topics k=1...∞ document d=1...D

words i=1...nd

! !d zid xid !k

!

!

!!!

1. Integrate out parameters θ, φ from HDP.2. Introduce auxiliary variables η, s, ξ, t.3. Factorize the posterior.4. Truncate the posterior.

Collapsed Variational Bayes for HDP

topics k=1...∞ document d=1...D

words i=1...nd

! zid xid

!

!

!!

1. Integrate out parameters θ, φ from HDP.2. Introduce auxiliary variables η, s, ξ, t.3. Factorize the posterior.4. Truncate the posterior.

Collapsed Variational Bayes for HDP

topics k=1...∞ document d=1...D

words i=1...nd

! zid xid

!

!

!!

!dsd!k tk

1. Integrate out parameters θ, φ from HDP.2. Introduce auxiliary variables η, s, ξ, t.3. Factorize the posterior.4. Truncate the posterior.

Collapsed Variational Bayes for HDP

topics k=1...∞ document d=1...D

words i=1...nd

! zid

!

!

!!

!dsd!k tk

1. Integrate out parameters θ, φ from HDP.2. Introduce auxiliary variables η, s, ξ, t.3. Factorize the posterior.4. Truncate the posterior.

Collapsed Variational Bayes for HDP

topics k=1...K document d=1...D

words i=1...nd

! zid

!

!

!!

!dsd!k tk

1. Integrate out parameters θ, φ from HDP.2. Introduce auxiliary variables η, s, ξ, t.3. Factorize the posterior.4. Truncate the posterior.

Collapsed Variational Bayes for HDP

1. Integrate out parameters θ and φ.

p(z::|α,π)=D∏

d=1

Γ(α)

Γ(α + nd )

∞∏k=1

Γ(απk + ndk )

Γ(απk )

p(x::|z::, β, τ )=∞∏

k=1

Γ(β)

Γ(β + nk )

W∏w=1

Γ(βτw + nkw )

Γ(βτw )

Collapsed Variational Bayes for HDP2. Introduce auxiliary variables η, s, ξ and t.

Γ(α)

Γ(α + nd )=

1Γ(nd )

∫ 1

0ηα−1

d (1− ηd )nd−1 dηd

Γ(απk + ndk )

Γ(απk )=

ndk∑sdk =0

[ndk

sdk

](απk )sdk

Γ(β)

Γ(β + nk )=

1Γ(nk )

∫ 1

0ξβ−1

k (1− ξk )nk−1 dξk

Γ(βτw + nkw )

Γ(βτw )=

nkw∑tkw =0

[nkw

tkw

](βτw )tkw

I Formulas for ηd , ξk are Beta identities.I Formulas for sdk , tkw are generating functions for numbers

of tables in CRPs.I [ n

m ] are unsigned Stirling numbers of the first kind.

Collapsed Variational Bayes for HDP

3. Assume the factorization:

q(α, β, γ, τ ,π,η:,s::, ξ:, t::, z::)

=q(γ)q(α, β, τ ,π)q(η:,s::, ξ:, t::|z::)∏id

q(zid )

=q(γ)q(α)q(β)q(τ )q(π)∏d

q(ηd |z::)∏dk

q(sdk |z::)∏

k

q(ξk |z::)∏kw

q(tkw |z::)∏id

q(zid )

Collapsed Variational Bayes for HDP

4. Constrain all posterior mass to first K topics, assume:

q(zid = k) = 0 for all i , d , k > K

Stick-breaking: πk = π̃k

k−1∏l=1

(1− π̃l) π̃k |γ∼ Beta(1, γ)

Collapsed Variational Bayes for HDP

5. Improved second orderapproximation.

I Approximate E[log ξk ], E[log ηd ],E[sdk ] and E[tkw ] for efficiency.

I Second order Taylor expansion tolog(n) fails for Ψ(n) due to fasterdivergence at n = 0.

I Treat n = 0 exactly, andapproximate n > 0.

1 2 3 4 5 6 7 80

20

40

60

80

100

120

Err

or R

atio

to E

xact

Val

ue (

%)

Topic Label

2nd Order Approx.

2nd Order Approx. + 0−Treatment

1 2 3 4 5 6 7 80

1

2

3

4

E[n

k]

Topic Label

Collapsed Variational Bayes for HDPExperimental Results

I Corpora:I KOS: D = 3430, W = 6909, N = 467714.I Reuters: D = 8433, W = 4593, N = 566298.I NIPS: D = 1675, W = 12419, N = 2166029.

I 10% of words in each document withheld as test set.I Repeated 10 times.I Report both bounds on marginal probabilities of training set,

and predictive probabilities on test set.

Collapsed Variational Bayes for HDPBounds on Margnal Probabilities

V: variational, CV: collapsed variational.I Variational bound significantly better than VLDA or CVLDA.

Collapsed Variational Bayes for HDPPredictive Probabilities

V: variational, CV: collapsed variational, G: collapsed Gibbs.I Predictive probabilities better than VLDA and CVLDA.I Better than GLDA.I Worse than GHDP.I Note: GHDP1 and GHDP100 gives different results indicating

local optima issue.

Collapsed Variational Bayes for HDPLocal Optima Issues?

V: variational, CV: collapsed variational, G: collapsed Gibbs.I Initializing at the converged mode of GHDP100 gives better

results (almost same as GHDP100).I Local optima problem: Gibbs sampler better at escaping

bad local optima, if we can find a good local optima forCVHDP it can work very well.

Outline

Introduction

Standard Inference Algorithms for LDA

Collapsed Variational Bayes for LDA

Hierarchical Dirichlet Processes

Collapsed Variational Bayes for HDP

Discussion

Variational Inference vs MCMCI Variational Inference

− Applicable in limited ranged of models.+ Easier to debug.+ Easy to diagnose convergence.− Derivations more involved.− Approximate posterior potentially far from true.+ Lower bound on marginal probability of data.+ Analysis of posterior easier.

I Markov Chain Monte Carlo (MCMC)+ Applicable in wide range of models.− Often hard to debug.− Hard to diagnose convergence (if ever).+ Will converge to true posterior if willing to wait.+ Unconverged samples may still be “good enough”.− No good way to compute marginal probability of data.− Analysis of posterior harder due to non-identifiability.

Variational Inference in Nonparametric Models

I Inference in nonparametric models currently dominated byMCMC methods.

I Only recently has variational inference been proposed, andonly for DP mixtures at that.

I We explore variational inference for the HDP.I More choices for inference algorithms in nonparametric

models.I Compare variational and MCMC in specific circumstance.I Techniques developed applicable to other models.

I Specific case of HDP applied to topic modelling.I Here the HDP can be seen as a nonparametric

generalization of latent Dirichlet allocation (LDA).I Lessons learned here can be applied to other settings.

Discussion

I Variational approximation taken to the extreme.I Need to resolve local optima issue.I The techniques developed here is applicable to many

models composed of discrete and Dirichlet variables.I Infinite State Bayesian Networks (ISBNs) are nonparametric

generalization of Bayesian networks.I Uses hierarchical Dirichlet processes as priors over sets of

parameters.I Future work: variational approximation for HDPs with more

than two layers.

[Welling, Porteous and Bart 2008]

Variational UpdatesCaution: Inspect Only with Magnifying Glass

Update counts from qidk = q(zid = k):

E[n.k.]=P

d,i qidk ; V[n.k.]=P

d,i qidk (1−qidk );

P+[n.k.]=1−Q

d,i (1−qidk ); E+[n.k.]=E[n.k. ]

P+[n.k. ] ; V+[n.k.]=V[n.k. ]

P+[n.k. ]−(1−P+[n.k.])E+[n.k.]2;

E[ndk.]=P

i qidk ; V[ndk.]=P

i qidk (1−qidk );

P+[ndk.]=1−Q

i (1−qidk ); E+[ndk.]=E[ndk. ]

P+[ndk. ] ; V+[ndk.]=V[ndk. ]

P+[ndk. ]−(1−P+[ndk.])E+[ndk.]2;

E[n.kw ]=P

d,i:xid =w qidk ; V[n.kw ]=P

d,i:xid =w qidk (1−qidk );

P+[n.kw ]=1−Q

d,i:xid =w 1−qidk ; E+[n.kw ]=E[n.kw ]

P+[n.kw ] ; V+[n.kw ]=V[n.kw ]

P+[n.kw ]−(1−P+[n.kw ])E+[n.kw ]2;

Update auxiliary variable posteriors:

E[log ηd ]=Ψ(E[α])−Ψ(E[α]+nd··);

E[log ξk ]≈P+[n·k·](Ψ(E[β])−Ψ(E[β]+E+[n·k·])− 12 V+[n·k·]Ψ′′(E[β]+E+[n·k·]));

E[sdk ]≈G[α]G[πk ]P+[ndk·](

Ψ(G[α]G[πk ]+E+[ndk·])−Ψ(G[α]G[πk ])+ 12 V+[ndk·]Ψ

′′(G[α]G[πk ]+E+[ndk·]))

;

E[tkw ]≈G[β]G[τw ]P+[n·kw ](

Ψ(G[β]G[τw ]+E+[n·kw ])−Ψ(G[β]G[τw ])+ 12 V+[n·kw ]Ψ′′(G[β]G[τw ]+E+[n·kw ])

);

Variational UpdatesCaution: Inspect Only with Magnifying Glass

Update Hyperparameters:

E[α]=aα+

Pdk E[sdk ]

bα−P

d E[log ηd ] ; G[α]=exp(Ψ(aα+

Pdk E[sdk ]))

bα−P

d E[log ηd ] ;

E[β]=aβ+

Pkw E[tkw ]

bβ−P

k E[log ξk ] ; G[β]=exp(Ψ(aβ+

Pkw E[tkw ]))

bβ−P

k E[log ξk ] ;

G[π̃k ]=exp(Ψ(1+

Pd E[sdk ]))

exp(Ψ(γ+1+P

dP

l≥k E[sdl ])) ; G[1−π̃k ]=exp(Ψ(γ+

PdP

l>k E[sdl ]))

exp(Ψ(γ+1+P

dP

l≥k E[sdl ])) ;

G[τw ]=exp(Ψ(κ/W +

Pk E[tkw ]))

exp(Ψ(κ+P

k,w E[tkw ])) ; G[πk ]=G[π̃k ]Qk−1

l=1 G[1−π̃l ];

Update q(z):

E[n¬iddk. ]=E[ndk.]−qidk ; V[n¬id

dk. ]=V[ndk.]−qidk (1−qidk );

E[n¬id.kxid

]=E[n.kxid ]−qidk ; V[n¬id.kxid

]=V[n.kxid ]−qidk (1−qidk );

E[n¬id.k. ]=E[n.k.]−qidk ; V[n¬id

.k. ]=V[n.k.]−qidk (1−qidk );

q(zid =k)∝[G[α]G[πk ]+E[n¬id

dk· ]][

G[β]G[τxid ]+E[n¬id·kxid

]][

E[β]+E[n¬id·k· ]]−1

× exp

V[n¬iddk· ]

2(G[α]G[πk ]+E[n¬iddk· ])2−

V[n¬id·kxid

]

2(G[β]G[τxid ]+E[n¬id·kxid

])2 +V[n¬id·k· ]

2(E[β]+E[n¬id·k· ])2

!.