BAYSM'14, Wien, Austria

Post on 05-Dec-2014

1.300 views 2 download

description

talk at BAYSM'14 on mixtures and some resutls and considerations on the topic, hopefully not too soporific for the neXt generation!

Transcript of BAYSM'14, Wien, Austria

My life as a mixture

Christian P. RobertUniversite Paris-Dauphine, Paris & University of Warwick, Coventry

September 17, 2014bayesianstatistics@gmail.com

Your next Valencia meeting:

I Objective Bayes section of ISBAmajor meeting:

I O-Bayes 2015 in Valencia,Spain, June 1-4(+1), 2015

I in memory of our friend SusieBayarri

I objective Bayes, limitedinformation, partly defined andapproximate models, &tc

I all flavours of Bayesian analysiswelcomed!

I “Spain in June, what else...?!”

Outline

Gibbs sampling

weakly informative priors

imperfect sampling

SAME algorithm

perfect sampling

Bayes factor

less informative prior

no Bayes factor

birthdate: May 1989, Ottawa Civic Hospital

Repartition of grey levels in an unprocessed chest radiograph

[X, 1994]

Mixture models

Structure of mixtures of distributions:

x ∼ fj with probability pj ,

for j = 1, 2, . . . , k, with overall density

p1f1(x) + · · ·+ pk fk(x) .

Usual case: parameterised components

k∑i=1

pi f (x |θi )n∑

i=1

pi = 1

where weights pi ’s are distinguished from other parameters

Mixture models

Structure of mixtures of distributions:

x ∼ fj with probability pj ,

for j = 1, 2, . . . , k, with overall density

p1f1(x) + · · ·+ pk fk(x) .

Usual case: parameterised components

k∑i=1

pi f (x |θi )n∑

i=1

pi = 1

where weights pi ’s are distinguished from other parameters

Motivations

I Dataset made of several latent/missing/unobservedgroups/strata/subpopulations. Mixture structure due to themissing origin/allocation of each observation to a specificsubpopulation/stratum. Inference on either the allocations(clustering) or on the parameters (θi , pi ) or on the number ofgroups

I Semiparametric perspective where mixtures are functionalbasis approximations of unknown distributions

Motivations

I Dataset made of several latent/missing/unobservedgroups/strata/subpopulations. Mixture structure due to themissing origin/allocation of each observation to a specificsubpopulation/stratum. Inference on either the allocations(clustering) or on the parameters (θi , pi ) or on the number ofgroups

I Semiparametric perspective where mixtures are functionalbasis approximations of unknown distributions

License

Dataset derived from [my] license plate imageGrey levels concentrated on 256 values [later jittered]

[Marin & X, 2007]

Likelihood

For a sample of independent random variables (x1, · · · , xn),likelihood

n∏i=1

{p1f1(xi ) + · · ·+ pk fk(xi )} .

Expanding this product involves

kn

elementary terms: prohibitive to compute in large samples.But likelihood still computable [pointwise] in O(kn) time.

Likelihood

For a sample of independent random variables (x1, · · · , xn),likelihood

n∏i=1

{p1f1(xi ) + · · ·+ pk fk(xi )} .

Expanding this product involves

kn

elementary terms: prohibitive to compute in large samples.But likelihood still computable [pointwise] in O(kn) time.

Normal mean benchmark

Normal mixture

pN(µ1, 1) + (1 − p)N(µ2, 1)

with only means unknown (2-D representation possible)

Identifiability

Parameters µ1 and µ2 identifiable: µ1cannot be confused with µ2 when p isdifferent from 0.5.

Presence of a spurious mode,

understood by letting p go to 0.5

Normal mean benchmark

Normal mixture

pN(µ1, 1) + (1 − p)N(µ2, 1)

with only means unknown (2-D representation possible)

Identifiability

Parameters µ1 and µ2 identifiable: µ1cannot be confused with µ2 when p isdifferent from 0.5.

Presence of a spurious mode,

understood by letting p go to 0.5

Bayesian inference on mixtures

For any prior π (θ,p), posterior distribution of (θ,p) available upto a multiplicative constant

π(θ,p|x) ∝

n∏i=1

k∑j=1

pj f (xi |θj)

π (θ,p)at a cost of order O(kn)

B Difficulty

Despite this, derivation of posterior characteristics like posteriorexpectations only possible in an exponential time of order O(kn)!

Bayesian inference on mixtures

For any prior π (θ,p), posterior distribution of (θ,p) available upto a multiplicative constant

π(θ,p|x) ∝

n∏i=1

k∑j=1

pj f (xi |θj)

π (θ,p)at a cost of order O(kn)

B Difficulty

Despite this, derivation of posterior characteristics like posteriorexpectations only possible in an exponential time of order O(kn)!

Missing variable representation

Associate to each xi a missing/latent variable zi that indicates itscomponent:

zi |p ∼ Mk(p1, . . . , pk)

andxi |zi ,θ ∼ f (·|θzi ) .

Completed likelihood

`(θ,p|x, z) =

n∏i=1

pzi f (xi |θzi ) ,

and

π(θ,p|x, z) ∝

[n∏

i=1

pzi f (xi |θzi )

]π (θ,p)

where z = (z1, . . . , zn)

Missing variable representation

Associate to each xi a missing/latent variable zi that indicates itscomponent:

zi |p ∼ Mk(p1, . . . , pk)

andxi |zi ,θ ∼ f (·|θzi ) .

Completed likelihood

`(θ,p|x, z) =

n∏i=1

pzi f (xi |θzi ) ,

and

π(θ,p|x, z) ∝

[n∏

i=1

pzi f (xi |θzi )

]π (θ,p)

where z = (z1, . . . , zn)

Gibbs sampling for mixture models

Take advantage of the missing data structure:

Algorithm

I Initialization: choose p(0) and θ(0) arbitrarilyI Step t. For t = 1, . . .

1. Generate z(t)i (i = 1, . . . , n) from (j = 1, . . . , k)

P(z(t)i = j |p

(t−1)j , θ

(t−1)j , xi

)∝ p

(t−1)j f

(xi |θ

(t−1)j

)2. Generate p(t) from π(p|z(t)),

3. Generate θ(t) from π(θ|z(t), x).

[Brooks & Gelman, 1990; Diebolt & X, 1990, 1994; Escobar & West, 1991]

Normal mean example (cont’d)

Algorithm

I Initialization. Choose µ(0)1 and µ

(0)2 ,

I Step t. For t = 1, . . .

1. Generate z(t)i (i = 1, . . . , n) from

P(z(t)i = 1

)= 1−P

(z(t)i = 2

)∝ p exp

(−

1

2

(xi − µ

(t−1)1

)2)

2. Compute n(t)j =

n∑i=1

Iz(t)i =j

and (sxj )(t) =

n∑i=1

Iz(t)i =j

xi

3. Generate µ(t)j (j = 1, 2) from N

(λδ+ (sxj )

(t)

λ+ n(t)j

,1

λ+ n(t)j

).

Normal mean example (cont’d)

−10

1

2

3

4

−1

0

1

2

3

4

µ1

µ2

(a) initialised at random[X & Casella, 2009]

Normal mean example (cont’d)

−10

1

2

3

4

−1

0

1

2

3

4

µ1

µ2

(a) initialised at random

−2

0

2

4

−2

0

2

4

µ1

µ2

(b) initialised close to thelower mode

[X & Casella, 2009]

License

Consider k = 3 components, a D3(1/2, 1/2, 1/2) prior for theweights, a N(x , σ2/3) prior on the means µi and a Ga(10, σ2) prioron the precisions σ−2

i , where x and σ2 are the empirical mean andvariance of License

[Empirical Bayes]

[Marin & X, 2007]

Outline

Gibbs sampling

weakly informative priors

imperfect sampling

SAME algorithm

perfect sampling

Bayes factor

less informative prior

no Bayes factor

weakly informative priors

I possible symmetric empirical Bayes priors

p ∼ D(γ, . . . ,γ), θi ∼ N(µ, ωσ2i ), σ−2i ∼ Ga(ν, εν)

which can be replaced with hierarchical priors[Diebolt & X, 1990; Richardson & Green, 1997]

I independent improper priors on θj ’s prohibited, thus standard“flat” and Jeffreys priors impossible to use (except with theexclude-empty-component trick)

[Diebolt & X, 1990; Wasserman, 1999]

weakly informative priors

I Reparameterization to compact set for use of uniform priors

µi −→ eµi

1 + eµi, σi −→ σi

1 + σi

[Chopin, 2000]

I dependent weakly informative priors

p ∼ D(k , . . . , 1), θi ∼ N(θi−1, ζσ2i−1), σi ∼ U([0,σi−1])

[Mengersen & X, 1996; X & Titterington, 1998]

I reference priors

p ∼ D(1, . . . , 1), θi ∼ N(µ0, (σ2i + τ

20)/2), σ2i ∼ C+(0, τ20)

[Moreno & Liseo, 1999]

Re-ban on improper priors

Difficult to use improper priors in the setting of mixtures becauseindependent improper priors,

π (θ) =

k∏i=1

πi (θi ) , with

∫πi (θi )dθi =∞

end up, for all n’s, with the property∫π(θ,p|x)dθdp =∞

Reason

There are (k − 1)n terms among the kn terms in the expansionthat allocate no observation at all to the i-th component

Re-ban on improper priors

Difficult to use improper priors in the setting of mixtures becauseindependent improper priors,

π (θ) =

k∏i=1

πi (θi ) , with

∫πi (θi )dθi =∞

end up, for all n’s, with the property∫π(θ,p|x)dθdp =∞

Reason

There are (k − 1)n terms among the kn terms in the expansionthat allocate no observation at all to the i-th component

Outline

Gibbs sampling

weakly informative priors

imperfect sampling

SAME algorithm

perfect sampling

Bayes factor

less informative prior

no Bayes factor

Connected difficulties

1. Number of modes of the likelihood of order O(k!):c© Maximization and even [MCMC] exploration of the

posterior surface harder

2. Under exchangeable priors on (θ,p) [prior invariant underpermutation of the indices], all posterior marginals areidentical:c© Posterior expectation of θ1 equal to posterior expectation

of θ2

Connected difficulties

1. Number of modes of the likelihood of order O(k!):c© Maximization and even [MCMC] exploration of the

posterior surface harder

2. Under exchangeable priors on (θ,p) [prior invariant underpermutation of the indices], all posterior marginals areidentical:c© Posterior expectation of θ1 equal to posterior expectation

of θ2

License

When Gibbs output does not (re)produce exchangeability, Gibbssampler has failed to explored the whole parameter space: notenough energy to switch simultaneously enough componentallocations at once

[Marin & X, 2007]

Label switching paradox

I We should observe the exchangeability of the components[label switching] to conclude about convergence of the Gibbssampler.

I If we observe it, then we do not know how to estimate theparameters.

I If we do not, then we are uncertain about the convergence!!!

[Celeux, Hurn & X, 2000]

[Fruhwirth-Schnatter, 2001, 2004]

[Holmes, Jasra & Stephens, 2005]

Label switching paradox

I We should observe the exchangeability of the components[label switching] to conclude about convergence of the Gibbssampler.

I If we observe it, then we do not know how to estimate theparameters.

I If we do not, then we are uncertain about the convergence!!!

[Celeux, Hurn & X, 2000]

[Fruhwirth-Schnatter, 2001, 2004]

[Holmes, Jasra & Stephens, 2005]

Label switching paradox

I We should observe the exchangeability of the components[label switching] to conclude about convergence of the Gibbssampler.

I If we observe it, then we do not know how to estimate theparameters.

I If we do not, then we are uncertain about the convergence!!!

[Celeux, Hurn & X, 2000]

[Fruhwirth-Schnatter, 2001, 2004]

[Holmes, Jasra & Stephens, 2005]

Constraints

Usual reply to lack of identifiability: impose constraints like

µ1 6 . . . 6 µk

in the prior

Mostly incompatible with the topology of the posterior surface:posterior expectations then depend on the choice of theconstraints.

Computational “detail”

The constraint need be imposed during the simulation but caninstead be imposed after simulation, reordering MCMC outputaccording to constraints. [This avoids possible negative effects onconvergence]

Constraints

Usual reply to lack of identifiability: impose constraints like

µ1 6 . . . 6 µk

in the prior

Mostly incompatible with the topology of the posterior surface:posterior expectations then depend on the choice of theconstraints.

Computational “detail”

The constraint need be imposed during the simulation but caninstead be imposed after simulation, reordering MCMC outputaccording to constraints. [This avoids possible negative effects onconvergence]

Constraints

Usual reply to lack of identifiability: impose constraints like

µ1 6 . . . 6 µk

in the prior

Mostly incompatible with the topology of the posterior surface:posterior expectations then depend on the choice of theconstraints.

Computational “detail”

The constraint need be imposed during the simulation but caninstead be imposed after simulation, reordering MCMC outputaccording to constraints. [This avoids possible negative effects onconvergence]

Relabeling towards the mode

Selection of one of the k! modal regions of the posterior,post-simulation, by computing the approximate MAP

(θ,p)(i∗) with i∗ = arg max

i=1,...,Mπ{(θ,p)(i)|x

}

Pivotal Reordering

At iteration i ∈ {1, . . . ,M},

1. Compute the optimal permutation

τi = arg minτ∈Sk

d(τ{(θ(i),p(i)), (θ(i∗),p(i∗))

})where d(·, ·) distance in the parameter space.

2. Set (θ(i),p(i)) = τi ((θ(i),p(i))).

[Celeux, 1998; Stephens, 2000; Celeux, Hurn & X, 2000]

Relabeling towards the mode

Selection of one of the k! modal regions of the posterior,post-simulation, by computing the approximate MAP

(θ,p)(i∗) with i∗ = arg max

i=1,...,Mπ{(θ,p)(i)|x

}

Pivotal Reordering

At iteration i ∈ {1, . . . ,M},

1. Compute the optimal permutation

τi = arg minτ∈Sk

d(τ{(θ(i),p(i)), (θ(i∗),p(i∗))

})where d(·, ·) distance in the parameter space.

2. Set (θ(i),p(i)) = τi ((θ(i),p(i))).

[Celeux, 1998; Stephens, 2000; Celeux, Hurn & X, 2000]

Loss functions for mixture estimation

Global loss function that considersdistance between predictives

L(ξ, ξ) =

∫Xfξ(x) log

{fξ(x)/fξ(x)

}dx

eliminates the labelling effectSimilar solution for estimating clustersthrough allocation variables

L(z , z) =∑i<j

(I[zi=zj ](1 − I[zi=zj ]) + I[zi=zj ](1 − I[zi=zj ])

).

[Celeux, Hurn & X, 2000]

Outline

Gibbs sampling

weakly informative priors

imperfect sampling

SAME algorithm

perfect sampling

Bayes factor

less informative prior

no Bayes factor

MAP estimation

For high-dimensional parameter space,difficulty with marginal MAP (MMAP) estimatesbecause nuisance parameters must be integrated out

θMMAP1 = argΘ1

max p (θ1|y)

where

p (θ1|y) =

∫Θ2

p (θ1,θ2|y) dθ2

MAP estimation

SAME stands for State Augmentation for Marginal Estimation[Doucet, Godsill & X, 2001]

Artificially augmented probability model whose marginaldistribution is

pγ (θ1|y) ∝ p (θ1|y)γ

via replications of the nuisance parameters:

I Replace θ2 with γ artificial replications,

θ2 (1) , . . . ,θ2 (γ)

MAP estimation

SAME stands for State Augmentation for Marginal Estimation[Doucet, Godsill & X, 2001]

Artificially augmented probability model whose marginaldistribution is

pγ (θ1|y) ∝ p (θ1|y)γ

via replications of the nuisance parameters:

I Treat the θ2 (j)’s as distinct random variables:

qγ (θ1,θ2 (1) , . . . ,θ2 (γ)|y) ∝γ∏

k=1

p (θ1,θ2 (k)|y)

MAP estimation

SAME stands for State Augmentation for Marginal Estimation[Doucet, Godsill & X, 2001]

Artificially augmented probability model whose marginaldistribution is

pγ (θ1|y) ∝ p (θ1|y)γ

via replications of the nuisance parameters:

I Use corresponding marginal for θ1

qγ (θ1|y) =

∫qγ (θ1,θ2 (1) , . . . ,θ2 (γ)|y) dθ2 (1) . . . dθ2 (γ)

∝∫ γ∏k=1

p (θ1,θ2 (k)|y) dθ2 (1) . . . dθ2 (γ)

= pγ (θ1|y)

MAP estimation

SAME stands for State Augmentation for Marginal Estimation[Doucet, Godsill & X, 2001]

Artificially augmented probability model whose marginaldistribution is

pγ (θ1|y) ∝ p (θ1|y)γ

via replications of the nuisance parameters:

I Build a MCMC algorithm in the augmented space, withinvariant distribution

qγ (θ1,θ2 (1) , . . . ,θ2 (γ)|y)

MAP estimation

SAME stands for State Augmentation for Marginal Estimation[Doucet, Godsill & X, 2001]

Artificially augmented probability model whose marginaldistribution is

pγ (θ1|y) ∝ p (θ1|y)γ

via replications of the nuisance parameters:

I Use simulated subsequence{θ(i)1 ; i ∈ N

}as drawn from marginal posterior pγ (θ1|y)

example: Galaxy dataset benchmark

82 observations of galaxy velocities from 3 (?) groupsAlgorithm EM MCEM SAME

Mean log-posterior 65.47 60.73 66.22

Std dev of 2.31 4.48 0.02

log-posterior[Doucet & X, 2002]

Really the SAME?!

SAME algorithm re-invented in many guises:

I Gaetan & Yao, 2003, Biometrika

I Jacquier, Johannes & Polson, 2007, J. Econometrics

I Lele, Dennis & Lutscher, 2007, Ecology Letters [data cloning]

I ...

Outline

Gibbs sampling

weakly informative priors

imperfect sampling

SAME algorithm

perfect sampling

Bayes factor

less informative prior

no Bayes factor

Propp and Wilson’s perfect sampler

Difficulty devising MCMC stopping rules:when should one stop an MCMC algorithm?!

Principle: Coupling from the past

rather than start at t = 0 and wait till t = +∞, start at t = −∞and wait till t = 0

[Propp & Wilson, 1996]

c© Outcome at time t = 0 is stationnary

Propp and Wilson’s perfect sampler

Difficulty devising MCMC stopping rules:when should one stop an MCMC algorithm?!

Principle: Coupling from the past

rather than start at t = 0 and wait till t = +∞, start at t = −∞and wait till t = 0

[Propp & Wilson, 1996]

c© Outcome at time t = 0 is stationnary

CFTP Algorithm

Algorithm (Coupling from the past)

1. Start from the m possible values at time −t

2. Run the m chains till time 0 (coupling allowed)

3. Check if the chains are equal at time 0

4. If not, start further back: t ← 2 ∗ t, using the same randomnumbers at time already simulated

I requires a finite state space

I probability of merging chains must be high enough

I hard to implement w/o a monotonicity in both state spaceand transition

CFTP Algorithm

Algorithm (Coupling from the past)

1. Start from the m possible values at time −t

2. Run the m chains till time 0 (coupling allowed)

3. Check if the chains are equal at time 0

4. If not, start further back: t ← 2 ∗ t, using the same randomnumbers at time already simulated

I requires a finite state space

I probability of merging chains must be high enough

I hard to implement w/o a monotonicity in both state spaceand transition

Mixture models

Simplest possible mixture structure

pf0(x) + (1 − p)f1(x),

with uniform prior on p.

Algorithm (Data Augmentation Gibbs sampler)

At iteration t:

1. Generate n iid U(0, 1) rv’s u(t)1 , . . . , u

(t)n .

2. Derive the indicator variables z(t)i as z

(t)i = 0 iff

u(t)i 6 q

(t−1)i =

p(t−1)f0(xi )

p(t−1)f0(xi ) + (1 − p(t−1))f1(xi )

and compute

m(t) =

n∑i=1

z(t)i .

3. Simulate p(t) ∼ Be(n + 1 −m(t), 1 +m(t)).

Mixture models

Algorithm (CFTP Gibbs sampler)

At iteration −t:

1. Generate n iid uniform rv’s u(−t)1 , . . . , u

(−t)n .

2. Partition [0, 1) into intervals [q[j], q[j+1]).

3. For each [q(−t)[j] , q

(−t)[j+1]), generate

p(−t)j ∼ Be(n − j + 1, j + 1).

4. For each j = 0, 1, . . . , n, r(−t)j ← p

(−t)j

5. For (` = 1, ` < T , `++) r(−t+`)j ← p

(−t+`)k with k such that

r(−t+`−1)j ∈ [q

(−t+`)[k] , q

(−t+`)[k+1] ]

6. Stop if the r(0)j ’s (0 6 j 6 n) are all equal. Otherwise, t ← 2 ∗ t.

[Hobert et al., 1999]

Mixture models

Extension to the case k = 3:

Sample of n = 35 observations from

.23N(2.2, 1.44) + .62N(1.4, 0.49) + .15N(0.6, 0.64)

[Hobert et al., 1999]

Outline

Gibbs sampling

weakly informative priors

imperfect sampling

SAME algorithm

perfect sampling

Bayes factor

less informative prior

no Bayes factor

Bayesian model choice

Comparison of models Mi by Bayesian means:

probabilise the entire model/parameter space

I allocate probabilities pi to all models Mi

I define priors πi (θi ) for each parameter space Θi

I compute

π(Mi |x) =

pi

∫Θi

fi (x |θi )πi (θi )dθi∑j

pj

∫Θj

fj(x |θj)πj(θj)dθj

Bayesian model choice

Comparison of models Mi by Bayesian means:

Relies on a central notion: the evidence

Zk =

∫Θk

πk(θk)Lk(θk) dθk ,

aka the marginal likelihood.

Chib’s representation

Direct application of Bayes’ theorem: given x ∼ fk(x|θk) andθk ∼ πk(θk),

Zk = mk(x) =fk(x|θk)πk(θk)

πk(θk |x)

Replace with an approximation to the posterior

Zk = mk(x) =fk(x|θ

∗k)πk(θ

∗k)

πk(θ∗k |x)

.

[Chib, 1995]

Chib’s representation

Direct application of Bayes’ theorem: given x ∼ fk(x|θk) andθk ∼ πk(θk),

Zk = mk(x) =fk(x|θk)πk(θk)

πk(θk |x)

Replace with an approximation to the posterior

Zk = mk(x) =fk(x|θ

∗k)πk(θ

∗k)

πk(θ∗k |x)

.

[Chib, 1995]

Case of latent variables

For missing variable z as in mixture models, natural Rao-Blackwellestimate

πk(θ∗k |x) =

1

T

T∑t=1

πk(θ∗k |x, z

(t)k ) ,

where the z(t)k ’s are Gibbs sampled latent variables

[Diebolt & Robert, 1990; Chib, 1995]

Compensation for label switching

For mixture models, z(t)k usually fails to visit all configurations in a

balanced way, despite the symmetry predicted by the theoryConsequences on numerical approximation, biased by an order k!Recover the theoretical symmetry by using

πk(θ∗k |x) =

1

T k!

∑σ∈Sk

T∑t=1

πk(σ(θ∗k)|x, z

(t)k ) .

for all σ’s in Sk , set of all permutations of {1, . . . , k}[Berkhof, Mechelen, & Gelman, 2003]

Compensation for label switching

For mixture models, z(t)k usually fails to visit all configurations in a

balanced way, despite the symmetry predicted by the theoryConsequences on numerical approximation, biased by an order k!Recover the theoretical symmetry by using

πk(θ∗k |x) =

1

T k!

∑σ∈Sk

T∑t=1

πk(σ(θ∗k)|x, z

(t)k ) .

for all σ’s in Sk , set of all permutations of {1, . . . , k}[Berkhof, Mechelen, & Gelman, 2003]

Galaxy dataset (k)

Using Chib’s estimate, with θ∗k as MAP estimator,

log(Zk(x)) = −105.1396

for k = 3, while introducing permutations leads to

log(Zk(x)) = −103.3479

Note that−105.1396 + log(3!) = −103.3479

k 2 3 4 5 6 7 8

Zk(x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44

Estimations of the marginal likelihoods by the symmetrised Chib’s

approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations

selected at random in Sk).

[Lee et al., 2008]

Galaxy dataset (k)

Using Chib’s estimate, with θ∗k as MAP estimator,

log(Zk(x)) = −105.1396

for k = 3, while introducing permutations leads to

log(Zk(x)) = −103.3479

Note that−105.1396 + log(3!) = −103.3479

k 2 3 4 5 6 7 8

Zk(x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44

Estimations of the marginal likelihoods by the symmetrised Chib’s

approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations

selected at random in Sk).

[Lee et al., 2008]

Galaxy dataset (k)

Using Chib’s estimate, with θ∗k as MAP estimator,

log(Zk(x)) = −105.1396

for k = 3, while introducing permutations leads to

log(Zk(x)) = −103.3479

Note that−105.1396 + log(3!) = −103.3479

k 2 3 4 5 6 7 8

Zk(x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44

Estimations of the marginal likelihoods by the symmetrised Chib’s

approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations

selected at random in Sk).

[Lee et al., 2008]

More efficient sampling

Difficulty with the explosive numbers of terms in

πk(θ∗k |x) =

1

T k!

∑σ∈Sk

T∑t=1

πk(σ(θ∗k)|x, z

(t)k ) .

when most terms are equal to zero...Iterative bridge sampling:

E(t)(k) = E(t−1)(k)M−11

M1∑l=1

π(θl |x)

M1q(θl) +M2π(θl |x)

/

M−12

M2∑m=1

q(θm)

M1q(θm) +M2π(θm|x)

[Fruhwirth-Schnatter, 2004]

More efficient sampling

Iterative bridge sampling:

E(t)(k) = E(t−1)(k)M−11

M1∑l=1

π(θl |x)

M1q(θl) +M2π(θl |x)

/

M−12

M2∑m=1

q(θm)

M1q(θm) +M2π(θm|x)

[Fruhwirth-Schnatter, 2004]

where

q(θ) =1

J1

J1∑j=1

p(θ|z(j))k∏

i=1

p(ξi |ξ(j)i<j , ξ

(j−1)i>j , z(j), x)

More efficient sampling

Iterative bridge sampling:

E(t)(k) = E(t−1)(k)M−11

M1∑l=1

π(θl |x)

M1q(θl) +M2π(θl |x)

/

M−12

M2∑m=1

q(θm)

M1q(θm) +M2π(θm|x)

[Fruhwirth-Schnatter, 2004]

or where

q(θ) =1

k!

∑σ∈S(k)

p(θ|σ(zo))k∏

i=1

p(ξi |σ(ξoi<j),σ(ξ

oi>j),σ(z

o), x)

Further efficiency

After de-switching (un-switching?), representation of importancefunction as

q(θ) =1

Jk!

J∑j=1

∑σ∈Sk

π(θ|σ(ϕ(j)), x) =1

k!

∑σ∈Sk

hσ(θ)

where hσ associated with particular mode of qAssuming generations

(θ(1), . . . ,θ(T)) ∼ hσc (θ)

how many of the hσ(θ(t)) are non-zero?

Sparsity for the sum

Contribution of each term relative to q(θ)

ησ(θ) =hσ(θ)

k!q(θ)=

hσi (θ)∑σ∈Sk

hσ(θ)

and importance of permutation σ evaluated by

Ehσc [ησi (θ)] =1

M

M∑l=1

ησi (θ(l)) , θ(l) ∼ hσc (θ)

Approximate set A(k) ⊆ S(k) consist of [σ1, · · · ,σn] for thesmallest n that satisfies the condition

φn =1

M

M∑l=1

∣∣∣qn(θ(l)) − q(θ(l))∣∣∣ < τ

dual importance sampling with approximation

DIS2A

1 Randomly select {z(j), θ(j)}Jj=1 from Gibbs sample and un-switchConstruct q(θ)

2 Choose hσc (θ) and generate particles {θ(t)}Tt=1 ∼ hσc (θ)

3 Construction of approximation q(θ) using first M-sample

3.1 Compute Ehσc[ησ1 (θ)], · · · , Ehσc

[ησk!(θ)]

3.2 Reorder the σ’s such thatEhσc

[ησ1 (θ)] > · · · > Ehσc[ησk!

(θ)].

3.3 Initially set n = 1 and compute qn(θ(t))’s and φn . If φn=1 < τ,

go to Step 4. Otherwise increase n = n+ 1

4 Replace q(θ(1)), . . . , q(θ(T)) with q(θ(1)), . . . , q(θ(T)) to

estimate E

[Lee & X, 2014]

illustrations

k k! |A(k)| ∆(A)

3 6 1.0000 0.16754 24 2.7333 0.1148

Fishery data

k k! |A(k)| ∆(A)

3 6 1.000 0.16754 24 15.7000 0.65456 720 298.1200 0.4146

Galaxy data

Table : Mean estimates of approximate set sizes, |A(k)|, and thereduction rate of a number of evaluated h-terms ∆(A) for (a) fishery and(b) galaxy datasets

Outline

Gibbs sampling

weakly informative priors

imperfect sampling

SAME algorithm

perfect sampling

Bayes factor

less informative prior

no Bayes factor

Jeffreys priors for mixtures [teaser]

True Jeffreys prior for mixtures of distributions defined as∣∣Eθ[∇T∇ log f (X |θ)]∣∣

I O(k) matrix

I unavailable in closed form except special cases

I unidimensional integrals approximated by Monte Carlo tools

[Grazian [talk tomorrow] et al., 2014+]

Difficulties

I complexity grows in O(k2)

I significant computing requirement (reduced by delayedacceptance)

[Banterle et al., 2014]

I differ from component-wise Jeffreys[Diebolt & X, 1990; Stoneking, 2014]

I when is the posterior proper?

I how to check properness via MCMC outputs?

Outline

Gibbs sampling

weakly informative priors

imperfect sampling

SAME algorithm

perfect sampling

Bayes factor

less informative prior

no Bayes factor

Difficulties with Bayes factors

I delicate calibration towards supporting a given hypothesis ormodel

I long-lasting impact of prior modelling, despite overallconsistency

I discontinuity in the use of improper priors in most settings

I binary outcome more suited for immediate decision than formodel evaluation

I related impossibility to ascertain misfit or outliers

I missing assesment of uncertainty associated with the decision

I difficult computation of marginal likelihoods in most settings

Reformulation

I Representation of the test problem as a two-componentmixture estimation problem where the weights are formallyequal to 0 or 1

I Mixture model thus contains both models under comparisonas extreme cases

I Inspired by consistency result of Rousseau and Mengersen(2011) on overfitting mixtures

I Use of posteror distribution of the weight of a model insteadof single-digit posterior probability

[Kamari [see poster] et al., 2014+]

Construction of Bayes tests

Given two statistical models,

M1 : x ∼ f1(x |θ1) , θ1 ∈ Θ1 and M2 : x ∼ f2(x |θ2 , θ2 ∈ Θ2 , )

embed both models within an encompassing mixture model

Mα : x ∼ αf1(x |θ1) + (1 − α)f2(x |θ2) , 0 6 α 6 1 . (1)

Both models as special cases of the mixture model, one for α = 1and the other for α = 0

c© Test as inference on α

Arguments

I substituting estimate of the weight α for posterior probabilityof model M1 produces an equally convergent indicator ofwhich model is “true” while removing the need of oftenartificial prior probabilities on model indices

I interpretation at least as natural as for the posteriorprobability, while avoiding the zero-one loss setting

I highly problematic computation of marginal likelihoodsbypassed by standard algorithms for mixture estimation

I straightforward extension to collection of models allows toconsider all models at once

I posterior on α evaluates thoroughly strength of support for agiven model, compared with single digit Bayes factor

I mixture model acknowledges possibility that both models [ornone] could be acceptable

Arguments

I standard prior modelling can be reproduced here but improperpriors now acceptable, when both models reparameterisedtowards common-meaning parameters, e.g. location and scale

I using same parameters on both components is essential:opposition between components is not an issue with differentparameter values

I parameters of components, θ1 and θ2, integrated out byMonte Carlo

I contrary to common testing settings, data signal lack ofagreement with either model when posterior on α away fromboth 0 and 1

I in most settings, approach easily calibrated by parametricboostrap providing posterior of α under each model and priorpredictive error

Toy examples (1)

Test of a Poisson P(λ) versus a geometric Geo(p) [as a number offailures, starting at zero]Same parameter used in Poisson P(λ) and geometric Geo(p) with

p = 1/1+λ

Improper noninformative prior π(λ) = 1/λ is validPosterior on λ conditional to allocation vector ζ

π(λ | x , ζ) ∝ exp(−n1(ζ)λ)λ(∑n

i=1 xi−1)(λ+ 1)−(n2+s2(ζ))

and α ∼ Be(n1 + a0, n2 + a0)

Toy examples (1)

.1 .2 .3 .4 .5 1

0.00

0.02

0.04

0.06

0.08

lBF= −509684

Posterior of Poisson weight α when a0 = .1, .2, .3, .4, .5, 1 andsample of geometric G(0.5) 105 observations

Toy examples (2)

Normal N(µ, 1) model versus double-exponential L(µ,√

2) [Scale√2 is intentionaly chosen to make both distributions share the

same variance]Location parameter µ can be shared by both models with a singleflat prior π(µ). Beta distributions B(a0, a0) are compared wrt theirhyperparameter a0

Toy examples (2)

Posterior of double-exponential weight α for L(0,√

2) data, with5, . . . , 103 observations and 105 Gibbs iterations

Toy examples (2)

Posterior of Normal weight α for N(0, .72) data with 103

observations and 104 Gibbs iterations

Toy examples (2)

Posterior of Normal weight α for N(0, 1) data with 103

observations and 104 Gibbs iterations

Toy examples (2)

Posterior of normal weight α for double-exponential data, with 103

observations and 104 Gibbs iterations

Danke schon! Enjoy BAYSM 2014!