How to compute posterior model probabilities and why … · How to compute posterior model...

How to compute posterior model probabilities ... andwhy that doesn’t mean that we have solved the

problem of model choice

Peter Green

School of MathematicsUniversity of Bristol

15 March 2011, Groningen: All models are wrong...

Peter Green (Bristol) Computing posterior model probabilities Groningen, March 2011 1 / 69

Outline

1 Ideal Bayes

2 Across-model methodology

3 Examples

4 Alternatives to across-model sampling

5 Methodological relatives

6 Better Bayes factors

7 Real Bayes


Ideal Bayes

Hierarchical model

Suppose givena prior p(k) over models k in a countable set K, andfor each k

a prior distribution p(θk |k), anda likelihood p(Y |k , θk ) for the data Y .

For definiteness and simplicity, suppose that p(θk |k) is a density in nkdimensions, and that there are no other parameters, so that wherethere are parameters common to all models these are subsumed intoeach θk ∈ Rnk .

Additional parameters, perhaps in additional layers of a hierarchy, areeasily dealt with. All probability distributions are proper.


Ideal Bayes

Hierarchical model as DAG

k

Y

model indicator

model-specific parameter

data

)(kp

)|( kp k

)θκ,|p(Y κ

The generic hierarchical modelfor model choice


Ideal Bayes

The joint posterior

p(k , θk |Y ) =p(k)p(θk |k)p(Y |k , θk )∑

k ′∈K∫

p(k ′)p(θ′k ′ |k ′)p(Y |k ′, θ′k ′)dθ′k ′

can always be factorised as

p(k , θk |Y ) = p(k |Y )p(θk |k ,Y )

– the product of posterior model probabilities and model-specificparameter posteriors.– very often the basis for reporting the inference, and in some of themethods mentioned below is also the basis for computation.


Ideal Bayes

Bayes factors, Evidence

Perhaps you didn’t think you wanted to report p(k |Y )?Of course, with these posterior probabilities, we can report BayesFactors

Bkl =p(Y |k)p(Y |l)

=p(k |Y )

p(l |Y )÷ p(k)

p(l)

for pairwise comparison of models.

For some, the Evidence (= marginal likelihood) p(Y |k) has an intrinsicmeaning and interpretation.


Ideal Bayes

Bayesian model averaging, prediction

We can do model averaging

E(F |Y ) =∑

k

∫F (k , θk )p(k , θk |Y )dθk

for any function F with the same interpretation in each model; theexpectation can be estimated simply by averaging F along the entirerun, essentially ignoring the model indicator k .

As an example, we can do prediction:

p(Y+|Y ) =∑

k

p(Y+|k ,Y )p(k |Y )

a posterior-weighted mixture of the within-model-k predictions

p(Y+|k ,Y ) =

∫p(Y+|k , θk )p(θk |k ,Y )dθk


Ideal Bayes

Note the generality of this basic formulation: it embraces bothgenuine model-choice situations, where the variable k indexes thecollection of discrete models under consideration, but alsosettings where there is really a single model, but one with avariable dimension parameter, for example a functionalrepresentation such as a series whose number of terms is notfixed (in which case, k is unlikely to be of direct inferential interest).


Ideal Bayes

Splines

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●●

●●

●

●

●

●

●●●

0.0 0.2 0.4 0.6 0.8 1.0

−1.

0−

0.5

0.0

0.5

1.0

x

y


Ideal Bayes

Polynomials

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●●

●●

●

●

●

●

●●●

0.0 0.2 0.4 0.6 0.8 1.0

−1.

0−

0.5

0.0

0.5

1.0

x

y


Ideal Bayes

Compatibility across models

Some would argue that responsible adoption of this Bayesianhierarchical model presupposes that, e.g., p(θk |k) should becompatible in that inference about functions of parameters that aremeaningful in several models should be approximately invariant to k .

Such compatibility could in principle be exploited in the construction ofMCMC methods (how?).

But it is philosophically tenable that no such compatibility is present,and we can’t assume it.


Across-model methodology

Across- and within-model simulation

How to compute p(k , θk |Y )?

Two main approaches using MCMC:across: one MCMC simulation with states of the form(k , θk ) ≈ p(k , θk |Y )

within: separate simulations of θk ≈ p(θk |k ,Y ) for each k .and beyond straight MCMC:

particle filter: SMC in place of MCMCABC: ‘likelihood-free’ methods where Y |k , θk can be simulated butp(Y |k , θk ) cannot be evaluated.nested samplingvariational methods



Across-model simulation: Reversible jump MCMC

The across-model MCMC simulator follows the ideal Bayesian, andtreats (k , θk ) as a single unknown. The state space for anacross-model simulation is {(k , θk )} =

⋃k∈K({k} ×Rnk ).

Mathematically, this is not a particularly awkward object. But at least alittle non-standard, when nk varies with k .

We use Metropolis-Hastings to build a suitable reversible chain.

On the face of it, this requires measure-theoretic notation, which maybe unwelcome! The point of the ‘reversible jump’ framework is torender the measure theory invisible, by means of a construction usingonly ordinary densities. Even the fact that we are jumping dimensionsbecomes essentially invisible.



Modelling the code

Let’s avoid all abstraction!

Consider how the transition will be implemented.

Assume X ⊂ Rd , and that π has a density (also denoted π).

At the current state x , we generate, say, r random numbers u from aknown joint density g, and then form the proposed new state as adeterministic function of the current state and the random numbers:x ′ = h(x ,u), say.

The reverse transition from x ′ to x would be made with the aid ofrandom numbers u′ ∼ g′ giving x = h′(x ′,u′).



Modelling the code

x=h’(x’,u’)

x’=h(x,u)u

u’

h

h’

Reversible jump



The equilibrium probability of jumping from A to B is then an integralwith respect to (x ,u):∫

(x ,x ′=h(x ,u))∈A×Bπ(x)g(u)α(x , x ′)dx du.

The equilibrium probability of jumping from B to A is an integral withrespect to (x ′,u′):∫

(x=h′(x ′,u′),x ′)∈A×Bπ(x ′)g′(u′)α(x ′, x)dx ′ du′.

If the transformation from (x ,u) to (x ′,u′) is a diffeomorphism (thetransformation and its inverse are differentiable), then we can apply thestandard change-of-variable formula, to write this as an integral withrespect to (x ,u).



Detailed balance says the two integrals must be equal:it holds if

α(x , x ′) = min{

1,π(x ′)g′(u′)π(x)g(u)

∣∣∣∣∂(x ′,u′)∂(x ,u)

∣∣∣∣} ,involving only ordinary joint densities.



What’s the point?

Perhaps a little indirect!– but it’s a flexible framework for constructing quite complex movesusing only elementary calculus.

The possibility that r < d covers the typical case that given x ∈ X , onlya lower-dimensional subset of X is reachable in one step.

(The Gibbs sampler is the best-known example of this, since in thatcase only some of the components of the state vector are changed ata time, although the formulation here is more general as it allows thesubset not to be parallel to the coordinate axes.)



The trans-dimensional case

But the main benefit of this formalism is that

α(x , x ′) = min{

1,π(x ′)g′(u′)π(x)g(u)

∣∣∣∣∂(x ′,u′)∂(x ,u)

∣∣∣∣} ,applies, without change, in a variable dimension context.

(Use the same symbol π(x) for the target density whatever thedimension of x in different parts of X .)

Provided that the transformation from (x ,u) to (x ′,u′) remains adiffeomorphism, the individual dimensions of x and x ′ can be different.The dimension-jumping is ‘invisible’.



Dimension matching

Suppose the dimensions of x , x ′,u and u′ are d ,d ′, r and r ′

respectively, then we have functions h : Rd ×Rr → Rd ′and

h′ : Rd ′ ×Rr ′ → Rd , used respectively in x ′ = h(x ,u) andx = h′(x ′,u′).

For the transformation from (x ,u) to (x ′,u′) to be a diffeomorphismrequires that d + r = d ′ + r ′, so-called ‘dimension-matching’; if thisequality failed, the mapping and its inverse could not both bedifferentiable.

Dimension matching is necessary but not sufficient.



Dimension matching

x

u1

u2

x’1

x’2

u’

Dimension matching

'' rdrd

1221



Dimension matching

'' rdrd x

u

x’1

x’2

Dimension matching

0211



Details of application to model-choice

We wish to use these reversible jump moves to sample the spaceX =

⋃k∈K({k} ×Rnk ) with invariant distribution π, which here is

p(k , θk |Y ).

Just as in ordinary MCMC, we typically need multiple types of movesto traverse the whole space X . Each move is a transition kernelreversible with respect to π, but only in combination do we obtain anergodic chain.

The moves will be indexed by m: consider a particular move m thatproposes to take x = (k , θk ) to x ′ = (k ′, θ′k ′) or vice-versa for a specificpair (k , k ′).



The acceptance probability is derived as before, and yields

αm(x , x ′) = min{

1,π(x ′)π(x)

jm(x ′)jm(x)

g′m(u′)gm(u)

∣∣∣∣∂(x ′,u′)∂(x ,u)

∣∣∣∣} .Here jm(x) is the probability of choosing move type m when at x , thevariables x , x ′,u,u′ are of dimensions dm,d ′m, rm, r ′m respectively, withdm + rm = d ′m + r ′m, we have x ′ = hm(x ,u) and x = h′m(x ′,u′), and theJacobian has a form correspondingly depending on m.


Examples

Toy example

..... of no statistical use at all!Suppose x lies in R∪R2: π(x) is a mixture:with probability p1, x ∼ Beta(2,3),with probability p2, (x1, x2,1− x1 − x2) ∼ Dirichlet(2,3,4).I will use three moves:(1) within R: x → U(x − ε, x + ε).(2) within R2: (x1, x2)→ (x2, x1).(3) between R and R2

In R, choose (1) or (3) with probabilities 1− r1, r1.In R2, choose (2) or (3) with probabilities 1− r2, r2.Thus j3(x) = r1 for all x ∈ R and j3(x ′) = r2 for all x ′ ∈ R2.


Examples

Dimension-changing with move (3)

Proposal:To go from x ∈ R to (x1, x2) ∈ R2, draw u from U(0,1) [so g3(u) = 1 if0 < u < 1] and propose (x1, x2) = (x ,u). For reverse move, no u′

required [write g′3(u′) ≡ 1] and set x = x1. This certainly gives a

bijection: (x ,u)↔ (x1, x2), with Jacobian = 1.Acceptance decision:

α = min{

1,π(x ′)π(x)

j3(x ′)j3(x)

g′3(u′)

g3(u)

∣∣∣∣∂(x ′,u′)∂(x ,u)

∣∣∣∣}= min

{1,

p2fD(x ,u)p1fB(x)

r2

r1

11|1|}

= min{

1,p2r2fD(x ,u)p1r1fB(x)

}

For reverse move, α = min{1, (p1r1fB(x))/(p2r2fD(x ,u))}.Peter Green (Bristol) Computing posterior model probabilities Groningen, March 2011 26 / 69

Examples

Toy example

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

81.

0

after 1 sweeps

●●


Examples

Toy example

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

81.

0

after 2 sweeps

●

●

●


Examples

Toy example

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

81.

0

after 3 sweeps

●

●

●

●


Examples

Toy example

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

81.

0

after 4 sweeps

●

●

●

●●


Examples

Toy example

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

81.

0

after 5 sweeps

●

●

●

●

●●


Examples

Toy example

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

81.

0

after 6 sweeps

●

●

●

●

●●●


Examples

Toy example

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

81.

0

after 7 sweeps

●

●

●

●

●

●●

●


Examples

Toy example

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

81.

0

after 8 sweeps

●

●

●

●

●

●

●●

●


Examples

Toy example

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

81.

0

after 9 sweeps

●

●

●

●

●

●

●

●●

●


Examples

Toy example

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

81.

0

after 10 sweeps

●

●

●

●

●●

●

●

●●

●


Examples

Toy example

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

81.

0

after 20 sweeps

●

●

●

●

●●●

●

●●●

●

●

●

●

●

●

●●

●

●


Examples

Toy example

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

81.

0

after 30 sweeps

●

●

●

●

●●●

●

●●●

●

●

●

●

●●

●

●

●

●●

●

●●●

●●

●

●

●


Examples

Toy example

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

81.

0

after 40 sweeps

●

●

●

●

●●●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●●●

●●

●

●●

●

●●

●●

●


Examples

Toy example

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

81.

0

after 50 sweeps

●

●

●

●

●●●

●

●●●

●

●

●

●

●●

●

●

●

●

●●●●

●

●

●

●

●

●

●●

●

●●●

●●

●

●●

●

●●

●●●

●

●

●


Examples

Toy example

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

81.

0

after 100 sweeps

●

●

●

●

●●●

●

●●●

●

●

●

●

●●

●

●

●

●

●●●●

●

●

●

●●

●●

●●

●●●●●

●●

●

●

●●

●

●●●

●●

●

●●

●

●●

●●●

●

●

●

●

●

●● ●

●

●

●

●

●●●

●

●

●

●

●●●●

●●

●

●

●●●

●●

●

●

●

●

●

●

●

●●


Examples

Toy example

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

81.

0

after 200 sweeps

●

●

●

●

●●●

●

●●●

●

●

●

●

●●

●

●

●

●

●●●●

●

●

●

●●

●●

●●

●●●●●

●●

●

●

●

●●●●

●●

●●●●

●●●●

●

●●●

●

●●●●●●

●

●●

●

●●●●

●

●

●

●

●●

●

●

●●

●

●

●●

●●

●

●

●

●

●

●●

●

●●●

●●

●

●●

●

●●

●●●

●

●

●

●

●

●● ●

●

●

●

●

●●●

●

●

●

●

●●●●

●●

●

●

●●●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●● ●●

●●●●

●●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●●●●●

●


Examples

Toy example

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

81.

0

after 300 sweeps

●

●

●

●

●●●

●

●●●

●

●

●

●

●●

●

●

●

●

●●●●

●

●

●

●●

●●

●●

●●●●●

●●

●

●

●

●●●●

●●

●●●●

●●●●

●

●●●

●

●●●●●●

●

●●

●

●●●●

●

●

●

●

●●

●

●

●●

●

●

●●

●●

●

●

●

●●●●●

●

●●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●●

●

●

●

●●

●●●●

●

●●●●●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●●

●●

●

●●

●

●●

●●●

●

●

●

●

●

●● ●

●

●

●

●

●●●

●

●

●

●

●●●●

●●

●

●

●●●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●● ●●

●●●●

●●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●●●●●

●

●

●●●●●

●

●

●

●●●

●

●

●●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●●●●●

●

●●●


Examples

Toy example

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

81.

0

after 400 sweeps

●

●

●

●

●●●

●

●●●

●

●

●

●

●●

●

●

●

●

●●●●

●

●

●

●●

●●

●●

●●●●●

●●

●

●

●

●●●●

●●

●●●●

●●●●

●

●●●

●

●●●●●●

●

●●

●

●●●●

●

●

●

●

●●

●

●

●●

●

●

●●

●●

●

●

●

●●●●●

●

●●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●●

●

●

●

●●

●●●●

●

●●●●●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●

●

●

●●●●●●

●●●

●

●●●●●●●●●●●●●●●

●

●●

●●

●

●●●●●

●

●

●●

●

●●●

●●

●

●●

●

●●

●●●

●

●

●

●

●

●● ●

●

●

●

●

●●●

●

●

●

●

●●●●

●●

●

●

●●●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●● ●●

●●●●

●●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●●●●●

●

●

●●●●●

●

●

●

●●●

●

●

●●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●●●●●

●

●●●●●

●

●

●

●

●●●●●●●

●●

●

●●

●

●●

●

●

●●

●

●


Examples

Toy example

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

81.

0

after 500 sweeps

●

●

●

●

●●●

●

●●●

●

●

●

●

●●

●

●

●

●

●●●●

●

●

●

●●

●●

●●

●●●●●

●●

●

●

●

●●●●

●●

●●●●

●●●●

●

●●●

●

●●●●●●

●

●●

●

●●●●

●

●

●

●

●●

●

●

●●

●

●

●●

●●

●

●

●

●●●●●

●

●●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●●

●

●

●

●●

●●●●

●

●●●●●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●

●

●

●●●●●●

●●●

●

●●●●●●●●●●●●●●●

●

●●

●●

●

●●●●●●●●

●

●

●

●●

●

●

●●●●●

●

●●

●

●●●

●●●●

●

●●●●

●

●

●

●●●

●

●●●●

●●

●

●●

●●

●

●●

●●

●●●

●

●

●

●

●

●●

●

●●●

●

●

●●

●

●●

●●●●●

●

●

●

●●

●

●●●

●●

●

●●

●

●●

●●●

●

●

●

●

●

●● ●

●

●

●

●

●●●

●

●

●

●

●●●●

●●

●

●

●●●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●● ●●

●●●●

●●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●●●●●

●

●

●●●●●

●

●

●

●●●

●

●

●●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●●●●●

●

●●●●●

●

●

●

●

●●●●●●●

●●

●

●●

●

●●

●

●

●●

●●

●

●

●

●

●

●●●●

●

●●

●

●

●●

●

●

●


Examples

Toy example

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

81.

0

after 1000 sweeps

●

●

●

●

●●●

●

●●●

●

●

●

●

●●

●

●

●

●

●●●●

●

●

●

●●

●●

●●

●●●●●

●●

●

●

●

●●●●

●●

●●●●

●●●●

●

●●●

●

●●●●●●

●

●●

●

●●●●

●

●

●

●

●●

●

●

●●

●

●

●●

●●

●

●

●

●●●●●

●

●●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●●

●

●

●

●●

●●●●

●

●●●●●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●

●

●

●●●●●●

●●●

●

●●●●●●●●●●●●●●●

●

●●

●●

●

●●●●●●●●

●

●

●

●●

●

●

●●●●●

●

●●

●

●●●

●●●●

●

●●●●

●

●

●

●●●

●

●●●●

●●

●

●●

●●

●

●●

●●

●●●

●

●

●

●

●

●●

●

●●●

●

●

●●

●

●●

●●●●●

●

●

●

●●

●

●

●

●

●●●

●●

●●●●●●●

●●

●

●●

●●

●●

●

●●●

●

●

●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●

●

●●●●●●●●●●●●

●

●●●●●●●

●●

●

●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●

●●

●●

●●●

●●

●●

●●

●

●●

●●

●●

●

●●●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●●

●●●

●●●●

●

●●

●●●

●

●●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●●●

●

●

●●●●●●●

●●●

●

●

●

●●

●

●●

●●●

●●

●

●●

●

●●

●

●●●

●

●

●

●●

●

●●●

●

●●

●

●

●●●●●●●

●

●●

●

●

●●

●

●●

●

●

●

●●●●●●

●●●●●●

●

●

●●

●

●

●●

●●●●●●

●

●

●●

●

●●●

●●

●

●●

●

●●

●●●

●

●

●

●

●

●● ●

●

●

●

●

●●●

●

●

●

●

●●●●

●●

●

●

●●●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●● ●●

●●●●

●●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●●●●●

●

●

●●●●●

●

●

●

●●●

●

●

●●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●●●●●

●

●●●●●

●

●

●

●

●●●●●●●

●●

●

●●

●

●●

●

●

●●

●●

●

●

●

●

●

●●●●

●

●●

●

●

●●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●●

●●

●

●

●●●

●

●●

●

●

●

●

● ●

●●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●●

●●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●●

●●●●●●

●

●

●●

●●

●

●

●●

●

●

●●●

●

●

●●● ●●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●●●●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●


Examples


Examples

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x1 ~ Beta(2,3)

data

mod

el

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x2[1] ~ Beta(2,7)

data

mod

el

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x2[2] ~ Beta(3,6)

data

mod

el


Examples

Cyclones in Bay of Bengal

Multiple change-point analysis

A Bayesian approach to change-point analysis forpoint processes �y�� y�� yn�: combines

� Poisson-process likelihood:p�yjx� � expf

Pn

i�� log x�yi��R L�x�t�dtg

� prior model for step function x�t�, � � t � L,representing intensity

Prior model: represent step function by�k� fsjg

kj�� fhjg

kj��: x�t� �

Pj hjI�sj�sj��t�:

� number of steps k: Poisson(�),

� step heights hj : Gamma(�� ),

� step positions sj : p�sjk� �Q

j�sj�� sj�,

all independent.

18

MCMC for step functions

� � �k� fsjgkj�� fhjg

kj��

I will use four moves:

(a) Metropolis change to a randomly chosen stepheight hj .

(b) Metropolis change to a randomly chosen stepposition sj .

(c) Jump move: birth/death of steps

– birth: choose new step position s� atrandom, split current step height h into two:�h�� h��

– death: choose step at random to kill,combine current step heights �h�� h�� intoone: h

(d) Update hyperparameters �, �

19

Birth and death of steps

j-h

hj+

s* sjsj-1

h

hw�

�hw�� hw��w�

�h�w� s�� u�� h�� h�� w�� w��

20

Example:cyclones hitting the Bay of Bengal

141 cyclones over a period of 100 years(a cyclone is a storm with winds � �� km h��).

time

0 20 40 60 80 100

.. .......... .................. ..................... ............ .......................... . ... ................................................

21

A point process – of cyclone events – in time: modelled here as aninhomogeneous Poisson process, with piecewise constant intensity.


Examples


Multiple change-point analysis

A Bayesian approach to change-point analysis forpoint processes �y�� y�� yn�: combines

� Poisson-process likelihood:p�yjx� � expf

Pn

i�� log x�yi��R L�x�t�dtg

� prior model for step function x�t�, � � t � L,representing intensity

Prior model: represent step function by�k� fsjg

kj�� fhjg

kj��: x�t� �

Pj hjI�sj�sj��t�:

� number of steps k: Poisson(�),

� step heights hj : Gamma(�� ),

� step positions sj : p�sjk� �Q

j�sj�� sj�,

all independent.

18

MCMC for step functions

� � �k� fsjgkj�� fhjg

kj��

I will use four moves:

(a) Metropolis change to a randomly chosen stepheight hj .

(b) Metropolis change to a randomly chosen stepposition sj .

(c) Jump move: birth/death of steps

– birth: choose new step position s� atrandom, split current step height h into two:�h�� h��

– death: choose step at random to kill,combine current step heights �h�� h�� intoone: h

(d) Update hyperparameters �, �

19

Birth and death of steps

j-h

hj+

s* sjsj-1

h

hw�

�hw�� hw��w�

�h�w� s�� u�� h�� h�� w�� w��

20

Example:cyclones hitting the Bay of Bengal

141 cyclones over a period of 100 years(a cyclone is a storm with winds � �� km h��).

time

0 20 40 60 80 100

.. .......... .................. ..................... ............ .......................... . ... ................................................

21

Illustrating the model-jumping move: splitting and merging steps.


Examples


Choices of hyperparameters:

� Prior on k: Poisson(�), with � � �.

� Prior on hj : Gamma(�,�), with– � � ��

– � � �� n�L�

Sample of step functions from the posterior:

time

inte

nsity

0 20 40 60 80 100

01

23

22

Posterior for the number of change points k

o

o

oo

o

o

o

o o o o o o

k

prob

abili

ty

0 2 4 6 8 10 12

0.0

0.10

0.20

Zero change points is ruled out; k � � or � moreprobable than under the prior.

23

Posterior density estimates for change-pointpositions

time

dens

ity

0 20 40 60 80 100

0.0

0.05

0.10

0.15

24

Model-averaged estimate: E�x��jy�

time

inte

nsity

0 20 40 60 80 100

0.0

0.5

1.0

1.5

2.0

2.5

.. .......... .................. ..................... ............ .......................... . ... ................................................

(the expectation of a random step function is not astep function).

25

A small sample from the posterior


Examples





– � � �� n�L�


time

inte

nsity

0 20 40 60 80 100

01

23

22


o

o

oo

o

o

o

o o o o o o

k

prob

abili

ty

0 2 4 6 8 10 12

0.0

0.10

0.20


23


time

dens

ity

0 20 40 60 80 100

0.0

0.05

0.10

0.15

24


time

inte

nsity

0 20 40 60 80 100

0.0

0.5

1.0

1.5

2.0

2.5

.. .......... .................. ..................... ............ .......................... . ... ................................................


25

Posterior of the number of steps


Examples





– � � �� n�L�


time

inte

nsity

0 20 40 60 80 100

01

23

22


o

o

oo

o

o

o

o o o o o o

k

prob

abili

ty

0 2 4 6 8 10 12

0.0

0.10

0.20


23


time

dens

ity

0 20 40 60 80 100

0.0

0.05

0.10

0.15

24


time

inte

nsity

0 20 40 60 80 100

0.0

0.5

1.0

1.5

2.0

2.5

.. .......... .................. ..................... ............ .......................... . ... ................................................


25

An example of inference conditional on k : posterior densities of steppositions


Examples





– � � �� n�L�


time

inte

nsity

0 20 40 60 80 100

01

23

22


o

o

oo

o

o

o

o o o o o o

k

prob

abili

ty

0 2 4 6 8 10 12

0.0

0.10

0.20


23


time

dens

ity

0 20 40 60 80 100

0.0

0.05

0.10

0.15

24


time

inte

nsity

0 20 40 60 80 100

0.0

0.5

1.0

1.5

2.0

2.5

.. .......... .................. ..................... ............ .......................... . ... ................................................


25

An example of Bayesian model averaging: posterior expectation of theintensity


Alternatives to across-model sampling

Alternatives to joint model-parameter sampling

When marginal likelihoods are tractable, it’s usually a good idea tocompute them (thus integrating out θk ) then conductsearch/sampling only over the model indicator k .This direct approach has been taken a little further byGodsill (2001), who considers cases of ‘partial analytic structure’,where some of the parameters in θk may be integrated out, andthe others left unchanged in the move that updates the model, togive an across-model sampler with probable superiorperformance.Marginal likelihoods via within-model sampling: Nial’s talk!



Some issues in choosing a sampler

How many models are there?Do we want results across k , within each k , or for one k ofinterest?Do we need the Evidence (marginal likelihood) values p(Y |k)absolutely, or only relatively?Jumping between models as an aid to mixing (c.f. simulatedtempering: mixing may be better in the ‘other’ model)Are samplers for individual models already written and tested?Are standard strategies like split/merge likely to work?Trade-off between remembering and forgetting θk when leavingmodel k



The different ways that models can interconnect

completely unrelatednested in some irregular waymixture models (nesting and exchangeability)variable selection (factorial structure)graphical models (possibly with constraints such asdecomposability)


Methodological relatives

Methodological connections and extensions

Jump–diffusion (Grenander & Miller, 1994, Phillips & Smith, 1996)Point process representations and samplers (Geyer & Møller,1994, Stephens, 2000)Product-space samplers (Carlin & Chib, 1995, Green & O’Hagan,1998, Dellaportas et al., 2002)Delayed rejection (Green & Mira, 2001)Connections between reversible jump and continuous timebirth-and-death samplers (Cappé, Robert & Rydén, 2001)Composite model space framework (Godsill, 2001)Efficient construction of proposals (Brooks, Giudici & Roberts,2003)Automatic RJ sampler (Hastie, 2005)Population RJMCMC and Interacting SMC (Jasra et al, 2007/8)


Better Bayes factors

Bartolucci, Scaccia and Mira

The obvious estimator of p(k |Y ) is the empirical frequency

p̂(k |Y ) =#{t : k (t) = k}

N

the proportion of the MCMC run spent in model k , but Bartolucci,Scaccia and Mira (Biometrika, 2006) showed that we can do betterusing one of several ‘Rao-Blackwellised’ estimators, exploiting the factthat the sampler moves into each model with a probability that isknown at the time.

We can also use this in BMA, etc.




The basic idea of the simplest version.Let α(t)

kl be the Metropolis-Hastings acceptance ratio for moving frommodel k to model l on sweep t , or 0 if we are not in k at time t − 1 ordo not propose such a move. Define

B̂kl =

∑Nt=1 α

(t)lk /nl∑N

t=1 α(t)kl /nk

where nk = #{t : k (t = k} is the number of visits to model k . Thisderives from Meng–Wong type bridge-sampling ideas, somewhatanalogous to Chib–Jeliazkov.




0 1000 2000 3000 4000

0.56

0.58

0.60

0.62

0.64

sweep

p(k=

2|Y

)


Real Bayes

In reality, it’s not that easy...

Prior model probabilities may be fictionalThe ideal Bayesian (or her/his scientist colleague) has real priorprobabilities reflecting scientific judgement/belief across the modelspace; not very common in practice!

Arbirariness in prior model probabilities may not affect Bayesfactors, but it sabotages Bayesian model averaging!


Real Bayes


No chance of passing the test of a sensitivity analysisIn ordinary parametric problems we commonly find that inferencesare rather insensitive to moderately large variations in priorassumptions, except when there are very few data (indeed, theopposite case, of high sensitivity, poses a challenge to thenon-Bayesian – perhaps the data carry less information thanhoped?). But it’s clear that a test of sensitivity to modelprobabilities will always fail:

p?(k |Y )

p?(l |Y )=

p(k |Y )

p(l |Y )×(

p?(k)p?(l)

÷ p(k)p(l)

)


Real Bayes


Improper parameter priors problemsIn ordinary parametric problems it is commonly true that it is safeto use improper priors – when posterior distributions arewell-defined as limits based on a sequence of approximatingproper priors (and not usually sensitive to what that sequence is).

But improper parameter priors make Bayes factors indeterminate(since improper priors can only be defined up to arbitrarynormalising constants, which persist into marginal likelihoods).

And proper but vague/diffuse priors fail to solve the problem, sincethe Bayes factors will depend on the arbitrary degree ofvagueness used.


Real Bayes


Improper parameter priors problems, continued

In certain circumstances, ideas such as Intrinsic or FractionalBayes factors, or Expected Posterior priors, can be applied,essentially based on tying together improper priors acrossdifferent models. These ideas lose much of the appeal of idealBayes arguments, have arbitrary aspects, and are not widelyaccepted.


Real Bayes

Model uncertainty? Yes, but do we have to choose?

Model uncertainty is a fact of life.When can we quantify it?When can we eliminate it?When can we accommodate it?

Why choose?PredictionScientific understandingPresentationPolicyDefence


Real Bayes

An illusion of unity?

Is ‘model’ too much of a catch-all?different scientific mechanismsselection of predictors in regressionnumber of components in mixtureorder of AR modelcomplexity of polynomial or spline


Real Bayes

All the other criteria

Bayesian hypothesis testing, and ‘alternative-prior carpentry’AIC, BIC, DIC, DIC+, MDL, Cp

Decision theoryBayesian p-valuesPosterior predictive checks


Acknowledgements

Thanks to all my collaborators on trans-dimensional MCMC problems:Carmen Fernández, Paolo Giudici, Miles Harkness, David Hastie,Matthew Hodgson, Antonietta Mira, Agostino Nobile, Marco Pievatolo,Sylvia Richardson, Luisa Scaccia and Claudia Tarantola.

References and preprints available from

http://www.stats.bris.ac.uk/∼peter/Research.htmlhttp://www.stats.bris.ac.uk/∼peter/papers/[email protected]

c©University of Bristol, 2011


How to compute posterior model probabilities and why … · How to compute posterior model...

Documents

Transcript of How to compute posterior model probabilities and why … · How to compute posterior model...