WSC 2011, advanced tutorial on simulation in Statistics

165
Simulation methods in Statistics (on recent advances) Simulation methods in Statistics (on recent advances) Christian P. Robert Universit´ e Paris-Dauphine, IuF, & CRESt http://www.ceremade.dauphine.fr/ ~ xian WSC 2011, Phoenix, December 12, 2011

description

Slides of the tutorial given at WSC 2011, Phoenix, Arizona, Dec. 12 at 3:30

Transcript of WSC 2011, advanced tutorial on simulation in Statistics

Page 1: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Simulation methods in Statistics(on recent advances)

Christian P. Robert

Universite Paris-Dauphine, IuF, & CRESthttp://www.ceremade.dauphine.fr/~xian

WSC 2011, Phoenix, December 12, 2011

Page 2: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Outline

1 Motivation and leading example

2 Monte Carlo Integration

3 The Metropolis-Hastings Algorithm

4 Approximate Bayesian computation

Page 3: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Motivation and leading example

1 Motivation and leading exampleLatent variablesInferential methods

2 Monte Carlo Integration

3 The Metropolis-Hastings Algorithm

4 Approximate Bayesian computation

Page 4: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Latent variables

Latent structures make life harder!

Even simple statistical models may lead to computationalcomplications, as in latent variable models

f(x|θ) =

∫f?(x, x?|θ) dx?

If (x, x?) observed, fine!If only x observed, trouble!

[mixtures, HMMs, state-space models, &tc]

Page 5: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Latent variables

Latent structures make life harder!

Even simple statistical models may lead to computationalcomplications, as in latent variable models

f(x|θ) =

∫f?(x, x?|θ) dx?

If (x, x?) observed, fine!

If only x observed, trouble![mixtures, HMMs, state-space models, &tc]

Page 6: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Latent variables

Latent structures make life harder!

Even simple statistical models may lead to computationalcomplications, as in latent variable models

f(x|θ) =

∫f?(x, x?|θ) dx?

If (x, x?) observed, fine!If only x observed, trouble!

[mixtures, HMMs, state-space models, &tc]

Page 7: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Latent variables

Mixture models

Models of mixtures of distributions:

X ∼ fj with probability pj,

for j = 1, 2, . . . , k, with overall density

X ∼ p1f1(x) + · · ·+ pkfk(x) .

For a sample of independent random variables (X1, · · · ,Xn),sample density

n∏i=1

{p1f1(xi) + · · ·+ pkfk(xi)} .

Expanding this product involves kn elementary terms: prohibitiveto compute in large samples.

Page 8: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Latent variables

Mixture models

Models of mixtures of distributions:

X ∼ fj with probability pj,

for j = 1, 2, . . . , k, with overall density

X ∼ p1f1(x) + · · ·+ pkfk(x) .

For a sample of independent random variables (X1, · · · ,Xn),sample density

n∏i=1

{p1f1(xi) + · · ·+ pkfk(xi)} .

Expanding this product involves kn elementary terms: prohibitiveto compute in large samples.

Page 9: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Latent variables

Mixture models

Models of mixtures of distributions:

X ∼ fj with probability pj,

for j = 1, 2, . . . , k, with overall density

X ∼ p1f1(x) + · · ·+ pkfk(x) .

For a sample of independent random variables (X1, · · · ,Xn),sample density

n∏i=1

{p1f1(xi) + · · ·+ pkfk(xi)} .

Expanding this product involves kn elementary terms: prohibitiveto compute in large samples.

Page 10: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Latent variables

Mixture likelihood

−1 0 1 2 3

−1

01

23

µ1

µ 2

Case of the 0.3N (µ1, 1) + 0.7N (µ2, 1) likelihood

Page 11: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Maximum likelihood methods

goto Bayes

For an iid sample X1, . . . , Xn from a population with densityf(x|θ1, . . . ,θk), the likelihood function is

L(x|θ) = L(x1, . . . , xn|θ1, . . . ,θk)

=∏n

i=1f(xi|θ1, . . . ,θk).

◦ Maximum likelihood has global justifications from asymptotics

◦ Computational difficulty depends on structure, eg latentvariables

Page 12: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Maximum likelihood methods

goto Bayes

For an iid sample X1, . . . , Xn from a population with densityf(x|θ1, . . . ,θk), the likelihood function is

L(x|θ) = L(x1, . . . , xn|θ1, . . . ,θk)

=∏n

i=1f(xi|θ1, . . . ,θk).

◦ Maximum likelihood has global justifications from asymptotics

◦ Computational difficulty depends on structure, eg latentvariables

Page 13: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Maximum likelihood methods

goto Bayes

For an iid sample X1, . . . , Xn from a population with densityf(x|θ1, . . . ,θk), the likelihood function is

L(x|θ) = L(x1, . . . , xn|θ1, . . . ,θk)

=∏n

i=1f(xi|θ1, . . . ,θk).

◦ Maximum likelihood has global justifications from asymptotics

◦ Computational difficulty depends on structure, eg latentvariables

Page 14: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Maximum likelihood methods (2)

Example (Mixtures)

For a mixture of two normal distributions,

pN(µ, τ2) + (1 − p)N(θ,σ2) ,

likelihood proportional to

n∏i=1

[pτ−1ϕ

(xi − µ

τ

)+ (1 − p) σ−1 ϕ

(xi − θ

σ

)]can be expanded into 2n terms.

Page 15: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Maximum likelihood methods (3)

Standard maximization techniques often fail to find the globalmaximum because of multimodality or undesirable behavior(usually at the frontier of the domain) of the likelihood function.

Example

In the special case

f(x|µ,σ) = (1 − ε) exp{(−1/2)x2}+ε

σexp{(−1/2σ2)(x− µ)2}

with ε > 0 known,

whatever n, the likelihood is unbounded:

limσ→0

L(x1, . . . , xn|µ = x1,σ) =∞

Page 16: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Maximum likelihood methods (3)

Standard maximization techniques often fail to find the globalmaximum because of multimodality or undesirable behavior(usually at the frontier of the domain) of the likelihood function.

Example

In the special case

f(x|µ,σ) = (1 − ε) exp{(−1/2)x2}+ε

σexp{(−1/2σ2)(x− µ)2}

with ε > 0 known, whatever n, the likelihood is unbounded:

limσ→0

L(x1, . . . , xn|µ = x1,σ) =∞

Page 17: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

The Bayesian Perspective

In the Bayesian paradigm, the information brought by the data x,realization of

X ∼ f(x|θ),

is combined with prior information specified by prior distributionwith density

π(θ)

Page 18: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

The Bayesian Perspective

In the Bayesian paradigm, the information brought by the data x,realization of

X ∼ f(x|θ),

is combined with prior information specified by prior distributionwith density

π(θ)

Page 19: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Central tool...

Summary in a probability distribution, π(θ|x), called the posteriordistribution

Derived from the joint distribution f(x|θ)π(θ), according to

π(θ|x) =f(x|θ)π(θ)∫f(x|θ)π(θ)dθ

,

[Bayes Theorem]

where

Z(x) =

∫f(x|θ)π(θ)dθ

is the marginal density of X also called the (Bayesian) evidence

Page 20: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Central tool...

Summary in a probability distribution, π(θ|x), called the posteriordistributionDerived from the joint distribution f(x|θ)π(θ), according to

π(θ|x) =f(x|θ)π(θ)∫f(x|θ)π(θ)dθ

,

[Bayes Theorem]

where

Z(x) =

∫f(x|θ)π(θ)dθ

is the marginal density of X also called the (Bayesian) evidence

Page 21: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Central tool...

Summary in a probability distribution, π(θ|x), called the posteriordistributionDerived from the joint distribution f(x|θ)π(θ), according to

π(θ|x) =f(x|θ)π(θ)∫f(x|θ)π(θ)dθ

,

[Bayes Theorem]

where

Z(x) =

∫f(x|θ)π(θ)dθ

is the marginal density of X also called the (Bayesian) evidence

Page 22: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Central tool...central to Bayesian inference

Posterior defined up to a constant as

π(θ|x) ∝ f(x|θ)π(θ)

Operates conditional upon the observations

Integrate simultaneously prior information and informationbrought by x

Avoids averaging over the unobserved values of x

Coherent updating of the information available on θ,independent of the order in which i.i.d. observations arecollected

Provides a complete inferential scope and a unique motor ofinference

Page 23: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Central tool...central to Bayesian inference

Posterior defined up to a constant as

π(θ|x) ∝ f(x|θ)π(θ)

Operates conditional upon the observations

Integrate simultaneously prior information and informationbrought by x

Avoids averaging over the unobserved values of x

Coherent updating of the information available on θ,independent of the order in which i.i.d. observations arecollected

Provides a complete inferential scope and a unique motor ofinference

Page 24: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Central tool...central to Bayesian inference

Posterior defined up to a constant as

π(θ|x) ∝ f(x|θ)π(θ)

Operates conditional upon the observations

Integrate simultaneously prior information and informationbrought by x

Avoids averaging over the unobserved values of x

Coherent updating of the information available on θ,independent of the order in which i.i.d. observations arecollected

Provides a complete inferential scope and a unique motor ofinference

Page 25: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Central tool...central to Bayesian inference

Posterior defined up to a constant as

π(θ|x) ∝ f(x|θ)π(θ)

Operates conditional upon the observations

Integrate simultaneously prior information and informationbrought by x

Avoids averaging over the unobserved values of x

Coherent updating of the information available on θ,independent of the order in which i.i.d. observations arecollected

Provides a complete inferential scope and a unique motor ofinference

Page 26: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Central tool...central to Bayesian inference

Posterior defined up to a constant as

π(θ|x) ∝ f(x|θ)π(θ)

Operates conditional upon the observations

Integrate simultaneously prior information and informationbrought by x

Avoids averaging over the unobserved values of x

Coherent updating of the information available on θ,independent of the order in which i.i.d. observations arecollected

Provides a complete inferential scope and a unique motor ofinference

Page 27: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Examples of Bayes computational problems

1 complex parameter space, as e.g. constrained parameter setslike those resulting from imposing stationarity constraints intime series

2 complex sampling model with an intractable likelihood, ase.g. in some graphical models;

3 use of a huge dataset;

4 complex prior distribution (which may be the posteriordistribution associated with an earlier sample);

5 involved inferential procedure as for instance, Bayes factors

Bπ01(x) =P(θ ∈ Θ0 | x)

P(θ ∈ Θ1 | x)

/π(θ ∈ Θ0)

π(θ ∈ Θ1).

Page 28: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Examples of Bayes computational problems

1 complex parameter space, as e.g. constrained parameter setslike those resulting from imposing stationarity constraints intime series

2 complex sampling model with an intractable likelihood, ase.g. in some graphical models;

3 use of a huge dataset;

4 complex prior distribution (which may be the posteriordistribution associated with an earlier sample);

5 involved inferential procedure as for instance, Bayes factors

Bπ01(x) =P(θ ∈ Θ0 | x)

P(θ ∈ Θ1 | x)

/π(θ ∈ Θ0)

π(θ ∈ Θ1).

Page 29: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Examples of Bayes computational problems

1 complex parameter space, as e.g. constrained parameter setslike those resulting from imposing stationarity constraints intime series

2 complex sampling model with an intractable likelihood, ase.g. in some graphical models;

3 use of a huge dataset;

4 complex prior distribution (which may be the posteriordistribution associated with an earlier sample);

5 involved inferential procedure as for instance, Bayes factors

Bπ01(x) =P(θ ∈ Θ0 | x)

P(θ ∈ Θ1 | x)

/π(θ ∈ Θ0)

π(θ ∈ Θ1).

Page 30: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Examples of Bayes computational problems

1 complex parameter space, as e.g. constrained parameter setslike those resulting from imposing stationarity constraints intime series

2 complex sampling model with an intractable likelihood, ase.g. in some graphical models;

3 use of a huge dataset;

4 complex prior distribution (which may be the posteriordistribution associated with an earlier sample);

5 involved inferential procedure as for instance, Bayes factors

Bπ01(x) =P(θ ∈ Θ0 | x)

P(θ ∈ Θ1 | x)

/π(θ ∈ Θ0)

π(θ ∈ Θ1).

Page 31: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Examples of Bayes computational problems

1 complex parameter space, as e.g. constrained parameter setslike those resulting from imposing stationarity constraints intime series

2 complex sampling model with an intractable likelihood, ase.g. in some graphical models;

3 use of a huge dataset;

4 complex prior distribution (which may be the posteriordistribution associated with an earlier sample);

5 involved inferential procedure as for instance, Bayes factors

Bπ01(x) =P(θ ∈ Θ0 | x)

P(θ ∈ Θ1 | x)

/π(θ ∈ Θ0)

π(θ ∈ Θ1).

Page 32: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Mixtures again

Observations from

x1, . . . , xn ∼ f(x|θ) = pϕ(x;µ1,σ1) + (1 − p)ϕ(x;µ2,σ2)

Prior

µi|σi ∼ N (ξi,σ2i/ni), σ2

i ∼ I G (νi/2, s2i/2), p ∼ Be(α,β)

Posterior

π(θ|x1, . . . , xn) ∝n∏j=1

{pϕ(xj;µ1,σ1) + (1 − p)ϕ(xj;µ2,σ2)

}π(θ)

=

n∑`=0

∑(kt)

ω(kt)π(θ|(kt))

[O(2n)]

Page 33: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Mixtures again

Observations from

x1, . . . , xn ∼ f(x|θ) = pϕ(x;µ1,σ1) + (1 − p)ϕ(x;µ2,σ2)

Prior

µi|σi ∼ N (ξi,σ2i/ni), σ2

i ∼ I G (νi/2, s2i/2), p ∼ Be(α,β)

Posterior

π(θ|x1, . . . , xn) ∝n∏j=1

{pϕ(xj;µ1,σ1) + (1 − p)ϕ(xj;µ2,σ2)

}π(θ)

=

n∑`=0

∑(kt)

ω(kt)π(θ|(kt))

[O(2n)]

Page 34: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Mixtures again

Observations from

x1, . . . , xn ∼ f(x|θ) = pϕ(x;µ1,σ1) + (1 − p)ϕ(x;µ2,σ2)

Prior

µi|σi ∼ N (ξi,σ2i/ni), σ2

i ∼ I G (νi/2, s2i/2), p ∼ Be(α,β)

Posterior

π(θ|x1, . . . , xn) ∝n∏j=1

{pϕ(xj;µ1,σ1) + (1 − p)ϕ(xj;µ2,σ2)

}π(θ)

=

n∑`=0

∑(kt)

ω(kt)π(θ|(kt))

[O(2n)]

Page 35: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Mixtures again [2]

For a given permutation (kt), conditional posterior distribution

π(θ|(kt)) = N

(ξ1(kt),

σ21

n1 + `

)×I G ((ν1 + `)/2, s1(kt)/2)

×N

(ξ2(kt),

σ22

n2 + n− `

)×I G ((ν2 + n− `)/2, s2(kt)/2)

×Be(α+ `,β+ n− `)

Page 36: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Mixtures again [3]

where

x1(kt) = 1`

∑`t=1 xkt

, s1(kt) =∑`t=1(xkt

− x1(kt))2,

x2(kt) = 1n−`

∑nt=`+1 xkt

, s2(kt) =∑nt=`+1(xkt

− x2(kt))2

and

ξ1(kt) =n1ξ1 + `x1(kt)

n1 + `, ξ2(kt) =

n2ξ2 + (n− `)x2(kt)

n2 + n− `,

s1(kt) = s21 + s

21(kt) +

n1`

n1 + `(ξ1 − x1(kt))

2,

s2(kt) = s22 + s

22(kt) +

n2(n− `)

n2 + n− `(ξ2 − x2(kt))

2,

posterior updates of the hyperparameters

Page 37: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Mixtures again [4]

Bayes estimator of θ:

δπ(x1, . . . , xn) =n∑`=0

∑(kt)

ω(kt)Eπ[θ|x, (kt)]

c© Too costly: 2n terms

Unfortunate as the decomposition is meaningfull for clusteringpurposes

Page 38: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Motivation and leading example

Inferential methods

Mixtures again [4]

Bayes estimator of θ:

δπ(x1, . . . , xn) =n∑`=0

∑(kt)

ω(kt)Eπ[θ|x, (kt)]

c© Too costly: 2n termsUnfortunate as the decomposition is meaningfull for clusteringpurposes

Page 39: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Monte Carlo integration

1 Motivation and leading example

2 Monte Carlo IntegrationMonte Carlo integrationImportance SamplingBayesian importance sampling

3 The Metropolis-Hastings Algorithm

4 Approximate Bayesian computation

Page 40: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Monte Carlo integration

Monte Carlo integration

Theme:Generic problem of evaluating the integral

I = Ef[h(X)] =∫Xh(x) f(x) dx

where X is uni- or multidimensional, f is a closed form, partlyclosed form, or implicit density, and h is a function

Page 41: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Monte Carlo integration

Monte Carlo integration (2)

Monte Carlo solutionFirst use a sample (X1, . . . ,Xm) from the density f to approximatethe integral I by the empirical average

hm =1

m

m∑j=1

h(xj)

which convergeshm −→ Ef[h(X)]

by the Strong Law of Large Numbers

Page 42: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Monte Carlo integration

Monte Carlo integration (2)

Monte Carlo solutionFirst use a sample (X1, . . . ,Xm) from the density f to approximatethe integral I by the empirical average

hm =1

m

m∑j=1

h(xj)

which convergeshm −→ Ef[h(X)]

by the Strong Law of Large Numbers

Page 43: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Monte Carlo integration

Monte Carlo precision

Estimate the variance with

vm =1

m− 1

m∑j=1

[h(xj) − hm]2,

and for m large,

hm − Ef[h(X)]√vm

∼ N (0, 1).

Note: This can lead to the construction of a convergence test andof confidence bounds on the approximation of Ef[h(X)].

Page 44: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Monte Carlo integration

Example (Cauchy prior/normal sample)

For estimating a normal mean, a robust prior is a Cauchy prior

X ∼ N (θ, 1), θ ∼ C(0, 1).

Under squared error loss, posterior mean

δπ(x) =

∫∞−∞

θ

1 + θ2e−(x−θ)2/2dθ∫∞

−∞1

1 + θ2e−(x−θ)2/2dθ

Page 45: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Monte Carlo integration

Example (Cauchy prior/normal sample (2))

Form of δπ suggests simulating iid variables

θ1, · · · , θm ∼ N (x, 1)

and calculating

δπm(x) =

m∑i=1

θi

1 + θ2i

/ m∑i=1

1

1 + θ2i

.

The Law of Large Numbers implies

δπm(x) −→ δπ(x) as m −→∞.

Page 46: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Monte Carlo integration

0 200 400 600 800 1000

9.6

9.8

10.0

10.2

10.4

10.6

iterations

Range of estimators δπm for 100 runs and x = 10

Page 47: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Importance Sampling

Importance sampling

Paradox

Simulation from f (the true density) is not necessarily optimal

Alternative to direct sampling from f is importance sampling,based on the alternative representation

Ef[h(X)] =∫X

[h(x)

f(x)

g(x)

]g(x) dx .

which allows us to use other distributions than f

Page 48: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Importance Sampling

Importance sampling

Paradox

Simulation from f (the true density) is not necessarily optimal

Alternative to direct sampling from f is importance sampling,based on the alternative representation

Ef[h(X)] =∫X

[h(x)

f(x)

g(x)

]g(x) dx .

which allows us to use other distributions than f

Page 49: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Importance Sampling

Importance sampling algorithm

Evaluation of

Ef[h(X)] =∫Xh(x) f(x) dx

by

1 Generate a sample X1, . . . ,Xn from a distribution g

2 Use the approximation

1

m

m∑j=1

f(Xj)

g(Xj)h(Xj)

Page 50: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Importance Sampling

Implementation details

◦ Instrumental distribution g chosen from distributions easy tosimulate

◦ The same sample (generated from g) can be used repeatedly,not only for different functions h, but also for differentdensities f

◦ Dependent proposals can be used, as seen later Pop’MC

Page 51: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Importance Sampling

Finite vs. infinite variance

Although g can be any density, some choices are better thanothers:

◦ Finite variance only when

Ef[h2(X)

f(X)

g(X)

]=

∫X

h2(x)f2(X)

g(X)dx <∞ .

◦ Instrumental distributions with tails lighter than those of f(that is, with sup f/g =∞) not appropriate.

◦ If sup f/g =∞, the weights f(xj)/g(xj) vary widely, givingtoo much importance to a few values xj.

◦ If sup f/g =M <∞, finite variance for L2 functions

Page 52: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Importance Sampling

Finite vs. infinite variance

Although g can be any density, some choices are better thanothers:

◦ Finite variance only when

Ef[h2(X)

f(X)

g(X)

]=

∫X

h2(x)f2(X)

g(X)dx <∞ .

◦ Instrumental distributions with tails lighter than those of f(that is, with sup f/g =∞) not appropriate.

◦ If sup f/g =∞, the weights f(xj)/g(xj) vary widely, givingtoo much importance to a few values xj.

◦ If sup f/g =M <∞, finite variance for L2 functions

Page 53: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Importance Sampling

Finite vs. infinite variance

Although g can be any density, some choices are better thanothers:

◦ Finite variance only when

Ef[h2(X)

f(X)

g(X)

]=

∫X

h2(x)f2(X)

g(X)dx <∞ .

◦ Instrumental distributions with tails lighter than those of f(that is, with sup f/g =∞) not appropriate.

◦ If sup f/g =∞, the weights f(xj)/g(xj) vary widely, givingtoo much importance to a few values xj.

◦ If sup f/g =M <∞, finite variance for L2 functions

Page 54: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Importance Sampling

Selfnormalised importance sampling

For ratio estimator

δnh =

n∑i=1

ωi h(xi)

/ n∑i=1

ωi

with Xi ∼ g(y) and Wi such that

E[Wi|Xi = x] = κf(x)/g(x)

Page 55: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Importance Sampling

Selfnormalised variance

then

var(δnh) ≈1

n2κ2

(var(Snh) − 2Eπ[h] cov(Snh,Sn1 ) + Eπ[h]2 var(Sn1 )

).

for

Snh =

n∑i=1

Wih(Xi) , Sn1 =

n∑i=1

Wi

Rough approximation

varδnh ≈1

nvarπ(h(X)) {1 + varg(W)}

Page 56: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

Bayes factor approximation

When approximating the Bayes factor

B01 =

∫Θ0

f0(x|θ0)π0(θ0)dθ0∫Θ1

f1(x|θ1)π1(θ1)dθ1

use of importance functions $0 and $1 and

B01 =n−1

0

∑n0i=1 f0(x|θ

i0)π0(θ

i0)/$0(θ

i0)

n−11

∑n1i=1 f1(x|θ

i1)π1(θi1)/$1(θi1)

Page 57: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

Diabetes in Pima Indian women

Example (R benchmark)

“A population of women who were at least 21 years old, of PimaIndian heritage and living near Phoenix (AZ), was tested fordiabetes according to WHO criteria. The data were collected bythe US National Institute of Diabetes and Digestive and KidneyDiseases.”200 Pima Indian women with observed variables

plasma glucose concentration in oral glucose tolerance test

diastolic blood pressure

diabetes pedigree function

presence/absence of diabetes

Page 58: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

Probit modelling on Pima Indian women

Probability of diabetes function of above variables

P(y = 1|x) = Φ(x1β1 + x2β2 + x3β3) ,

Test of H0 : β3 = 0 for 200 observations of Pima.tr based on ag-prior modelling:

β ∼ N3(0,n(XTX)−1

)

Page 59: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

Probit modelling on Pima Indian women

Probability of diabetes function of above variables

P(y = 1|x) = Φ(x1β1 + x2β2 + x3β3) ,

Test of H0 : β3 = 0 for 200 observations of Pima.tr based on ag-prior modelling:

β ∼ N3(0,n(XTX)−1

)

Page 60: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

Importance sampling for the Pima Indian dataset

Use of the importance function inspired from the MLE estimatedistribution

β ∼ N(β, Σ)

R Importance sampling codemodel1=summary(glm(y~-1+X1,family=binomial(link="probit")))

is1=rmvnorm(Niter,mean=model1$coeff[,1],sigma=2*model1$cov.unscaled)

is2=rmvnorm(Niter,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)

bfis=mean(exp(probitlpost(is1,y,X1)-dmvlnorm(is1,mean=model1$coeff[,1],

sigma=2*model1$cov.unscaled))) / mean(exp(probitlpost(is2,y,X2)-

dmvlnorm(is2,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)))

Page 61: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

Importance sampling for the Pima Indian dataset

Use of the importance function inspired from the MLE estimatedistribution

β ∼ N(β, Σ)

R Importance sampling codemodel1=summary(glm(y~-1+X1,family=binomial(link="probit")))

is1=rmvnorm(Niter,mean=model1$coeff[,1],sigma=2*model1$cov.unscaled)

is2=rmvnorm(Niter,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)

bfis=mean(exp(probitlpost(is1,y,X1)-dmvlnorm(is1,mean=model1$coeff[,1],

sigma=2*model1$cov.unscaled))) / mean(exp(probitlpost(is2,y,X2)-

dmvlnorm(is2,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)))

Page 62: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

Diabetes in Pima Indian women

Comparison of the variation of the Bayes factor approximationsbased on 100 replicas for 20, 000 simulations from the prior andthe above MLE importance sampler

Basic Monte Carlo Importance sampling

23

45

Page 63: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

Bridge sampling

Special case:If

π1(θ1|x) ∝ π1(θ1|x)π2(θ2|x) ∝ π2(θ2|x)

live on the same space (Θ1 = Θ2), then

B12 ≈1

n

n∑i=1

π1(θi|x)

π2(θi|x)θi ∼ π2(θ|x)

[Gelman & Meng, 1998; Chen, Shao & Ibrahim, 2000]

Page 64: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

(Further) bridge sampling

General identity:

B12 =

∫π2(θ|x)α(θ)π1(θ|x)dθ∫π1(θ|x)α(θ)π2(θ|x)dθ

∀ α(·)

1

n1

n1∑i=1

π2(θ1i|x)α(θ1i)

1

n2

n2∑i=1

π1(θ2i|x)α(θ2i)

θji ∼ πj(θ|x)

Page 65: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

Optimal bridge sampling

The optimal choice of auxiliary function is

α? =n1 + n2

n1π1(θ|x) + n2π2(θ|x)

leading to

B12 ≈

1

n1

n1∑i=1

π2(θ1i|x)

n1π1(θ1i|x) + n2π2(θ1i|x)

1

n2

n2∑i=1

π1(θ2i|x)

n1π1(θ2i|x) + n2π2(θ2i|x)

Page 66: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

Illustration for the Pima Indian dataset

Use of the MLE induced conditional of β3 given (β1,β2) as apseudo-posterior and mixture of both MLE approximations on β3

in bridge sampling estimate

R bridge sampling codecova=model2$cov.unscaled

expecta=model2$coeff[,1]

covw=cova[3,3]-t(cova[1:2,3])%*%ginv(cova[1:2,1:2])%*%cova[1:2,3]

probit1=hmprobit(Niter,y,X1)

probit2=hmprobit(Niter,y,X2)

pseudo=rnorm(Niter,meanw(probit1),sqrt(covw))

probit1p=cbind(probit1,pseudo)

bfbs=mean(exp(probitlpost(probit2[,1:2],y,X1)+dnorm(probit2[,3],meanw(probit2[,1:2]),

sqrt(covw),log=T))/ (dmvnorm(probit2,expecta,cova)+dnorm(probit2[,3],expecta[3],

cova[3,3])))/ mean(exp(probitlpost(probit1p,y,X2))/(dmvnorm(probit1p,expecta,cova)+

dnorm(pseudo,expecta[3],cova[3,3])))

Page 67: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

Illustration for the Pima Indian dataset

Use of the MLE induced conditional of β3 given (β1,β2) as apseudo-posterior and mixture of both MLE approximations on β3

in bridge sampling estimate

R bridge sampling codecova=model2$cov.unscaled

expecta=model2$coeff[,1]

covw=cova[3,3]-t(cova[1:2,3])%*%ginv(cova[1:2,1:2])%*%cova[1:2,3]

probit1=hmprobit(Niter,y,X1)

probit2=hmprobit(Niter,y,X2)

pseudo=rnorm(Niter,meanw(probit1),sqrt(covw))

probit1p=cbind(probit1,pseudo)

bfbs=mean(exp(probitlpost(probit2[,1:2],y,X1)+dnorm(probit2[,3],meanw(probit2[,1:2]),

sqrt(covw),log=T))/ (dmvnorm(probit2,expecta,cova)+dnorm(probit2[,3],expecta[3],

cova[3,3])))/ mean(exp(probitlpost(probit1p,y,X2))/(dmvnorm(probit1p,expecta,cova)+

dnorm(pseudo,expecta[3],cova[3,3])))

Page 68: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

Diabetes in Pima Indian women (cont’d)

Comparison of the variation of the Bayes factor approximationsbased on 100× 20, 000 simulations from the prior (MC), the abovebridge sampler and the above importance sampler

Page 69: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

The original harmonic mean estimator

When θki ∼ πk(θ|x),

1

T

T∑t=1

1

L(θkt|x)

is an unbiased estimator of 1/mk(x)[Newton & Raftery, 1994]

Highly dangerous: Most often leads to an infinite variance!!!

Page 70: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

The original harmonic mean estimator

When θki ∼ πk(θ|x),

1

T

T∑t=1

1

L(θkt|x)

is an unbiased estimator of 1/mk(x)[Newton & Raftery, 1994]

Highly dangerous: Most often leads to an infinite variance!!!

Page 71: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

“The Worst Monte Carlo Method Ever”

“The good news is that the Law of Large Numbers guarantees thatthis estimator is consistent ie, it will very likely be very close to thecorrect answer if you use a sufficiently large number of points fromthe posterior distribution.

The bad news is that the number of points required for thisestimator to get close to the right answer will often be greaterthan the number of atoms in the observable universe. The evenworse news is that it’s easy for people to not realize this, and tonaıvely accept estimates that are nowhere close to the correctvalue of the marginal likelihood.”

[Radford Neal’s blog, Aug. 23, 2008]

Page 72: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

“The Worst Monte Carlo Method Ever”

“The good news is that the Law of Large Numbers guarantees thatthis estimator is consistent ie, it will very likely be very close to thecorrect answer if you use a sufficiently large number of points fromthe posterior distribution.The bad news is that the number of points required for thisestimator to get close to the right answer will often be greaterthan the number of atoms in the observable universe. The evenworse news is that it’s easy for people to not realize this, and tonaıvely accept estimates that are nowhere close to the correctvalue of the marginal likelihood.”

[Radford Neal’s blog, Aug. 23, 2008]

Page 73: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

Approximating Zk from a posterior sample

Use of the [harmonic mean] identity

Eπk[

ϕ(θk)

πk(θk)Lk(θk)

∣∣∣∣ x] = ∫ ϕ(θk)

πk(θk)Lk(θk)

πk(θk)Lk(θk)

Zkdθk =

1

Zk

no matter what the proposal ϕ(·) is.[Gelfand & Dey, 1994; Bartolucci et al., 2006]

Direct exploitation of the MCMC output

Page 74: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

Approximating Zk from a posterior sample

Use of the [harmonic mean] identity

Eπk[

ϕ(θk)

πk(θk)Lk(θk)

∣∣∣∣ x] = ∫ ϕ(θk)

πk(θk)Lk(θk)

πk(θk)Lk(θk)

Zkdθk =

1

Zk

no matter what the proposal ϕ(·) is.[Gelfand & Dey, 1994; Bartolucci et al., 2006]

Direct exploitation of the MCMC output

Page 75: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

Comparison with regular importance sampling

Harmonic mean: Constraint opposed to usual importance samplingconstraints: ϕ(θ) must have lighter (rather than fatter) tails thanπk(θk)Lk(θk) for the approximation

Z1k = 1

/1

T

T∑t=1

ϕ(θ(t)k )

πk(θ(t)k )Lk(θ

(t)k )

to enjoy finite variance

Page 76: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

Comparison with regular importance sampling (cont’d)

Compare Z1k with a standard importance sampling approximation

Z2k =1

T

T∑t=1

πk(θ(t)k )Lk(θ

(t)k )

ϕ(θ(t)k )

where the θ(t)k ’s are generated from the density ϕ(·) (with fatter

tails like t’s)

Page 77: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

HPD indicator as ϕ

Use the convex hull of MCMC simulations corresponding to the10% HPD region (easily derived!) and ϕ as indicator:

ϕ(θ) =10

T

∑t∈HPD

Id(θ,θ(t))6ε

Page 78: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

Diabetes in Pima Indian women (cont’d)

Comparison of the variation of the Bayes factor approximationsbased on 100 replicas for 20, 000 simulations for a simulation fromthe above harmonic mean sampler and importance samplers

Harmonic mean Importance sampling

3.102

3.104

3.106

3.108

3.110

3.112

3.114

3.116

Page 79: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

Chib’s representation

Direct application of Bayes’ theorem: given x ∼ fk(x|θk) andθk ∼ πk(θk),

mk(x) =fk(x|θk)πk(θk)

πk(θk|x)

[Bayes Theorem]

Use of an approximation to the posterior

mk(x) =fk(x|θ

∗k)πk(θ

∗k)

πk(θ∗k|x)

.

Page 80: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

Chib’s representation

Direct application of Bayes’ theorem: given x ∼ fk(x|θk) andθk ∼ πk(θk),

mk(x) =fk(x|θk)πk(θk)

πk(θk|x)

[Bayes Theorem]

Use of an approximation to the posterior

mk(x) =fk(x|θ

∗k)πk(θ

∗k)

πk(θ∗k|x)

.

Page 81: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

Case of latent variables

For missing variable z as in mixture models, natural Rao-Blackwellestimate

πk(θ∗k|x) =

1

T

T∑t=1

πk(θ∗k|x, z

(t)k ) ,

where the z(t)k ’s are Gibbs sampled latent variables

Page 82: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

Case of the probit model

For the completion by z,

π(θ|x) =1

T

∑t

π(θ|x, z(t))

is a simple average of normal densities

R Bridge sampling codegibbs1=gibbsprobit(Niter,y,X1)

gibbs2=gibbsprobit(Niter,y,X2)

bfchi=mean(exp(dmvlnorm(t(t(gibbs2$mu)-model2$coeff[,1]),mean=rep(0,3),

sigma=gibbs2$Sigma2)-probitlpost(model2$coeff[,1],y,X2)))/

mean(exp(dmvlnorm(t(t(gibbs1$mu)-model1$coeff[,1]),mean=rep(0,2),

sigma=gibbs1$Sigma2)-probitlpost(model1$coeff[,1],y,X1)))

Page 83: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

Case of the probit model

For the completion by z,

π(θ|x) =1

T

∑t

π(θ|x, z(t))

is a simple average of normal densities

R Bridge sampling codegibbs1=gibbsprobit(Niter,y,X1)

gibbs2=gibbsprobit(Niter,y,X2)

bfchi=mean(exp(dmvlnorm(t(t(gibbs2$mu)-model2$coeff[,1]),mean=rep(0,3),

sigma=gibbs2$Sigma2)-probitlpost(model2$coeff[,1],y,X2)))/

mean(exp(dmvlnorm(t(t(gibbs1$mu)-model1$coeff[,1]),mean=rep(0,2),

sigma=gibbs1$Sigma2)-probitlpost(model1$coeff[,1],y,X1)))

Page 84: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Monte Carlo Integration

Bayesian importance sampling

Diabetes in Pima Indian women (cont’d)

Comparison of the variation of the Bayes factor approximationsbased on 100 replicas for 20, 000 simulations for a simulation fromthe above Chib’s and importance samplers

Page 85: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

The Metropolis-Hastings Algorithm

1 Motivation and leading example

2 Monte Carlo Integration

3 The Metropolis-Hastings AlgorithmMonte Carlo Methods based on Markov ChainsThe Metropolis–Hastings algorithmThe random walk Metropolis-Hastings algorithmAdaptive MCMC

4 Approximate Bayesian computation

Page 86: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

Monte Carlo Methods based on Markov Chains

Running Monte Carlo via Markov Chains

Epiphany! It is not necessary to use a sample from the distributionf to approximate the integral

I =

∫h(x)f(x)dx ,

Principle: Obtain X1, . . . ,Xn ∼ f (approx) without directlysimulating from f, using an ergodic Markov chain with stationarydistribution f

[Metropolis et al., 1953]

Page 87: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

Monte Carlo Methods based on Markov Chains

Running Monte Carlo via Markov Chains

Epiphany! It is not necessary to use a sample from the distributionf to approximate the integral

I =

∫h(x)f(x)dx ,

Principle: Obtain X1, . . . ,Xn ∼ f (approx) without directlysimulating from f, using an ergodic Markov chain with stationarydistribution f

[Metropolis et al., 1953]

Page 88: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

Monte Carlo Methods based on Markov Chains

Running Monte Carlo via Markov Chains (2)

Idea

For an arbitrary starting value x(0), an ergodic chain (X(t)) isgenerated using a transition kernel with stationary distribution f

Insures the convergence in distribution of (X(t)) to a randomvariable from f.For a “large enough” T0, X(T0) can be considered asdistributed from f

Produce a dependent sample X(T0),X(T0+1), . . ., which isgenerated from f, sufficient for most approximation purposes.

Problem: How can one build a Markov chain with a givenstationary distribution?

Page 89: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

Monte Carlo Methods based on Markov Chains

Running Monte Carlo via Markov Chains (2)

Idea

For an arbitrary starting value x(0), an ergodic chain (X(t)) isgenerated using a transition kernel with stationary distribution f

Insures the convergence in distribution of (X(t)) to a randomvariable from f.For a “large enough” T0, X(T0) can be considered asdistributed from f

Produce a dependent sample X(T0),X(T0+1), . . ., which isgenerated from f, sufficient for most approximation purposes.

Problem: How can one build a Markov chain with a givenstationary distribution?

Page 90: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

Monte Carlo Methods based on Markov Chains

Running Monte Carlo via Markov Chains (2)

Idea

For an arbitrary starting value x(0), an ergodic chain (X(t)) isgenerated using a transition kernel with stationary distribution f

Insures the convergence in distribution of (X(t)) to a randomvariable from f.For a “large enough” T0, X(T0) can be considered asdistributed from f

Produce a dependent sample X(T0),X(T0+1), . . ., which isgenerated from f, sufficient for most approximation purposes.

Problem: How can one build a Markov chain with a givenstationary distribution?

Page 91: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

The Metropolis–Hastings algorithm

The Metropolis–Hastings algorithm

BasicsThe algorithm uses the objective (target) density

f

and a conditional densityq(y|x)

called the instrumental (or proposal) distribution

Page 92: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

The Metropolis–Hastings algorithm

The MH algorithm

Algorithm (Metropolis–Hastings)

Given x(t),

1. Generate Yt ∼ q(y|x(t)).

2. Take

X(t+1) =

{Yt with prob. ρ(x(t), Yt),

x(t) with prob. 1 − ρ(x(t), Yt),

where

ρ(x,y) = min

{f(y)

f(x)

q(x|y)

q(y|x), 1

}.

Page 93: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

The Metropolis–Hastings algorithm

Features

Independent of normalizing constants for both f and q(·|x)(ie, those constants independent of x)

Never move to values with f(y) = 0

The chain (x(t))t may take the same value several times in arow, even though f is a density wrt Lebesgue measure

The sequence (yt)t is usually not a Markov chain

Page 94: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

The Metropolis–Hastings algorithm

Convergence properties

The M-H Markov chain is reversible, with invariant/stationarydensity f since it satisfies the detailed balance condition

f(y)K(y, x) = f(x)K(x,y)

Ifq(y|x) > 0 for every (x,y),

the chain is Harris recurrent and

limT→∞ 1

T

T∑t=1

h(X(t)) =

∫h(x)df(x) a.e. f.

limn→∞

∥∥∥∥∫ Kn(x, ·)µ(dx) − f∥∥∥∥TV

= 0

for every initial distribution µ

Page 95: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

The Metropolis–Hastings algorithm

Convergence properties

The M-H Markov chain is reversible, with invariant/stationarydensity f since it satisfies the detailed balance condition

f(y)K(y, x) = f(x)K(x,y)

Ifq(y|x) > 0 for every (x,y),

the chain is Harris recurrent and

limT→∞ 1

T

T∑t=1

h(X(t)) =

∫h(x)df(x) a.e. f.

limn→∞

∥∥∥∥∫ Kn(x, ·)µ(dx) − f∥∥∥∥TV

= 0

for every initial distribution µ

Page 96: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

The random walk Metropolis-Hastings algorithm

Random walk Metropolis–Hastings

Use of a local perturbation as proposal

Yt = X(t) + εt,

where εt ∼ g, independent of X(t).The instrumental density is now of the form g(y− x) and theMarkov chain is a random walk if we take g to be symmetricg(x) = g(−x)

Page 97: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

The random walk Metropolis-Hastings algorithm

Algorithm (Random walk Metropolis)

Given x(t)

1 Generate Yt ∼ g(y− x(t))

2 Take

X(t+1) =

Yt with prob. min

{1,f(Yt)

f(x(t))

},

x(t) otherwise.

Page 98: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

The random walk Metropolis-Hastings algorithm

RW-MH on mixture posterior distribution

−1 0 1 2 3

−1

01

23

µ1

µ 2

X

Random walk MCMC output for .7N(µ1, 1) + .3N(µ2, 1)

Page 99: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

The random walk Metropolis-Hastings algorithm

Acceptance rate

A high acceptance rate is not indication of efficiency since therandom walk may be moving “too slowly” on the target surface

Page 100: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

The random walk Metropolis-Hastings algorithm

Acceptance rate

A high acceptance rate is not indication of efficiency since therandom walk may be moving “too slowly” on the target surface

If x(t) and yt are “too close”, i.e. f(x(t)) ' f(yt), yt is acceptedwith probability

min

(f(yt)

f(x(t)), 1

)' 1 .

and acceptance rate high

Page 101: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

The random walk Metropolis-Hastings algorithm

Acceptance rate

A high acceptance rate is not indication of efficiency since therandom walk may be moving “too slowly” on the target surface

If average acceptance rate low, the proposed values f(yt) tend tobe small wrt f(x(t)), i.e. the random walk [not the algorithm!]moves quickly on the target surface often reaching its boundaries

Page 102: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

The random walk Metropolis-Hastings algorithm

Rule of thumb

In small dimensions, aim at an average acceptance rate of 50%. Inlarge dimensions, at an average acceptance rate of 25%.

[Gelman,Gilks and Roberts, 1995]

Page 103: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

The random walk Metropolis-Hastings algorithm

Noisy AR(1)

Target distribution of x given x1, x2 and y is

exp−1

2τ2

{(x−ϕx1)

2 + (x2 −ϕx)2 +

τ2

σ2(y− x2)2

}.

For a Gaussian random walk with scale ω small enough, therandom walk never jumps to the other mode. But if the scale ω issufficiently large, the Markov chain explores both modes and give asatisfactory approximation of the target distribution.

Page 104: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

The random walk Metropolis-Hastings algorithm

Noisy AR(2)

Markov chain based on a random walk with scale ω = .1.

Page 105: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

The random walk Metropolis-Hastings algorithm

Noisy AR(3)

Markov chain based on a random walk with scale ω = .5.

Page 106: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

Adaptive MCMC

No free lunch!!

MCMC algorithm trained on-line usually invalid:

using the whole past of the “chain” implies that this is not aMarkov chain any longer!

This means standard Markov chain (ergodic) theory does not apply[Meyn & Tweedie, 1994]

Page 107: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

Adaptive MCMC

No free lunch!!

MCMC algorithm trained on-line usually invalid:using the whole past of the “chain” implies that this is not aMarkov chain any longer!

This means standard Markov chain (ergodic) theory does not apply[Meyn & Tweedie, 1994]

Page 108: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

Adaptive MCMC

No free lunch!!

MCMC algorithm trained on-line usually invalid:using the whole past of the “chain” implies that this is not aMarkov chain any longer!

This means standard Markov chain (ergodic) theory does not apply[Meyn & Tweedie, 1994]

Page 109: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

Adaptive MCMC

Example (Poly t distribution)

t T(3, θ, 1) sample (x1, . . . , xn) with flat prior π(θ) = 1Fit a normal proposal from empirical mean and empirical varianceof the chain so far,

µt =1

t

t∑i=1

θ(i) and σ2t =

1

t

t∑i=1

(θ(i) − µt)2 ,

Metropolis–Hastings algorithm with acceptance probability

n∏j=2

[ν+ (xj − θ

(t))2

ν+ (xj − ξ)2

]−(ν+1)/2exp−(µt − θ

(t))2/2σ2t

exp−(µt − ξ)2/2σ2t

,

where ξ ∼ N(µt,σ2t).

Page 110: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

Adaptive MCMC

Example (Poly t distribution)

t T(3, θ, 1) sample (x1, . . . , xn) with flat prior π(θ) = 1Fit a normal proposal from empirical mean and empirical varianceof the chain so far,

µt =1

t

t∑i=1

θ(i) and σ2t =

1

t

t∑i=1

(θ(i) − µt)2 ,

Metropolis–Hastings algorithm with acceptance probability

n∏j=2

[ν+ (xj − θ

(t))2

ν+ (xj − ξ)2

]−(ν+1)/2exp−(µt − θ

(t))2/2σ2t

exp−(µt − ξ)2/2σ2t

,

where ξ ∼ N(µt,σ2t).

Page 111: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

Adaptive MCMC

Invalid scheme

invariant distribution not invariant any longer

when range of initial values too small, the θ(i)’s cannotconverge to the target distribution and concentrates on toosmall a support.

long-range dependence on past values modifies thedistribution of the sequence.

using past simulations to create a non-parametricapproximation to the target distribution does not work either

Page 112: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

Adaptive MCMC

0 1000 2000 3000 4000 5000

−0.4−0.2

0.00.2

Iterations

x

θ

−1.5 −1.0 −0.5 0.0 0.5

01

23

0 1000 2000 3000 4000 5000

−1.5−1.0

−0.50.0

0.51.0

1.5

Iterations

x

θ

−2 −1 0 1 2

0.00.2

0.40.6

0 1000 2000 3000 4000 5000

−2−1

01

2

Iterations

x

θ

−2 −1 0 1 2 3

0.00.1

0.20.3

0.40.5

0.60.7

Adaptive scheme for a sample of 10 xj ∼ T3 and initialvariances of (top) 0.1, (middle) 0.5, and (bottom) 2.5.

Page 113: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

Adaptive MCMC

0 10000 30000 50000

−1.0−0.5

0.00.5

1.01.5

Iterations

x

θ

−1.5 −0.5 0.5 1.0 1.5

0.00.2

0.40.6

0.81.0

Sample produced by 50, 000 iterations of a nonparametricadaptive MCMC scheme and comparison of its distributionwith the target distribution.

Page 114: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

Adaptive MCMC

Simply forget about it!

Warning:One should not constantly adapt the proposal on pastperformances

Either adaptation ceases after a period of burnin...or the adaptive scheme must be theoretically assessed on its ownright.

[Haario & Saaksman, 1999;Andrieu & Robert, 2001]

Page 115: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

Adaptive MCMC

Diminishing adaptation

Adaptivity of cyberparameter γt has to be gradually tuned downto recover ergodicity

[Roberts & Rosenthal, 2007]

Sufficient conditions:

1 total variation distance between two consecutive kernels mustuniformly decrease to zero

[diminishing adaptation]

limt→∞ sup

x‖Kγt(x, ·) − Kγt+1(x, ·)‖TV = 0

2 times to stationary remains bounded for any fixed γt[containment]

Page 116: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

Adaptive MCMC

Diminishing adaptation

Adaptivity of cyberparameter γt has to be gradually tuned downto recover ergodicity

[Roberts & Rosenthal, 2007]Sufficient conditions:

1 total variation distance between two consecutive kernels mustuniformly decrease to zero

[diminishing adaptation]

limt→∞ sup

x‖Kγt(x, ·) − Kγt+1(x, ·)‖TV = 0

2 times to stationary remains bounded for any fixed γt[containment]

Page 117: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

Adaptive MCMC

Diminishing adaptation

Adaptivity of cyberparameter γt has to be gradually tuned downto recover ergodicity

[Roberts & Rosenthal, 2007]Sufficient conditions:

1 total variation distance between two consecutive kernels mustuniformly decrease to zero

[diminishing adaptation]

limt→∞ sup

x‖Kγt(x, ·) − Kγt+1(x, ·)‖TV = 0

2 times to stationary remains bounded for any fixed γt[containment]

Works for random walk proposal that relies on theempirical variance of the sample modulo a ridge-like stabilizing factor

[Haario, Sacksman & Tamminen, 1999]

Page 118: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

Adaptive MCMC

Diminishing adaptation

Adaptivity of cyberparameter γt has to be gradually tuned downto recover ergodicity

[Roberts & Rosenthal, 2007]Sufficient conditions:

1 total variation distance between two consecutive kernels mustuniformly decrease to zero

[diminishing adaptation]

limt→∞ sup

x‖Kγt(x, ·) − Kγt+1(x, ·)‖TV = 0

2 times to stationary remains bounded for any fixed γt[containment]

Tune the scale in each direction toward an optimal acceptance rateof 0.44.

[Roberts & Rosenthal,2006]

Page 119: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

The Metropolis-Hastings Algorithm

Adaptive MCMC

Diminishing adaptation

Adaptivity of cyberparameter γt has to be gradually tuned downto recover ergodicity

[Roberts & Rosenthal, 2007]Sufficient conditions:

1 total variation distance between two consecutive kernels mustuniformly decrease to zero

[diminishing adaptation]

limt→∞ sup

x‖Kγt(x, ·) − Kγt+1(x, ·)‖TV = 0

2 times to stationary remains bounded for any fixed γt[containment]

Packages amcmc and grapham

Page 120: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

Approximate Bayesian computation

1 Motivation and leading example

2 Monte Carlo Integration

3 The Metropolis-Hastings Algorithm

4 Approximate Bayesian computationABC basicsAlphabet soupCalibration of ABC

Page 121: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

Untractable likelihoods

There are cases when the likelihood function f(y|θ) is unavailableand when the completion step

f(y|θ) =

∫Zf(y, z|θ) dz

is impossible or too costly because of the dimension of zc© MCMC cannot be implemented!

[Robert & Casella, 2004]

Page 122: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

Illustrations

Example

Stochastic volatility model: fort = 1, . . . , T ,

yt = exp(zt)εt , zt = a+bzt−1+σηt ,

T very large makes it difficult toinclude z within the simulatedparameters

0 200 400 600 800 1000

−0.

4−

0.2

0.0

0.2

0.4

t

Highest weight trajectories

Page 123: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

Illustrations

Example

Potts model: if y takes values on a grid Y of size kn and

f(y|θ) ∝ exp

{θ∑l∼i

Iyl=yi

}

where l∼i denotes a neighbourhood relation, n moderately largeprohibits the computation of the normalising constant

Page 124: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

Illustrations

Example

Inference on CMB: in cosmology, study of the Cosmic MicrowaveBackground via likelihoods immensely slow to computate (e.gWMAP, Plank), because of numerically costly spectral transforms[Data is a Fortran program]

[Kilbinger et al., 2010, MNRAS]

Page 125: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

Illustrations

Example

Coalescence tree: in populationgenetics, reconstitution of a commonancestor from a sample of genes viaa phylogenetic tree that is close toimpossible to integrate out[100 processor days with 4parameters]

[Cornuet et al., 2009, Bioinformatics]

Page 126: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

The ABC method

Bayesian setting: target is π(θ)f(x|θ)

When likelihood f(x|θ) not in closed form, likelihood-free rejectiontechnique:

ABC algorithm

For an observation y ∼ f(y|θ), under the prior π(θ), keep jointlysimulating

θ′ ∼ π(θ) , z ∼ f(z|θ′) ,

until the auxiliary variable z is equal to the observed value, z = y.

[Tavare et al., 1997]

Page 127: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

The ABC method

Bayesian setting: target is π(θ)f(x|θ)When likelihood f(x|θ) not in closed form, likelihood-free rejectiontechnique:

ABC algorithm

For an observation y ∼ f(y|θ), under the prior π(θ), keep jointlysimulating

θ′ ∼ π(θ) , z ∼ f(z|θ′) ,

until the auxiliary variable z is equal to the observed value, z = y.

[Tavare et al., 1997]

Page 128: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

The ABC method

Bayesian setting: target is π(θ)f(x|θ)When likelihood f(x|θ) not in closed form, likelihood-free rejectiontechnique:

ABC algorithm

For an observation y ∼ f(y|θ), under the prior π(θ), keep jointlysimulating

θ′ ∼ π(θ) , z ∼ f(z|θ′) ,

until the auxiliary variable z is equal to the observed value, z = y.

[Tavare et al., 1997]

Page 129: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

Why does it work?!

The proof is trivial:

f(θi) ∝∑z∈D

π(θi)f(z|θi)Iy(z)

∝ π(θi)f(y|θi)= π(θi|y) .

[Accept–Reject 101]

Page 130: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

Earlier occurrence

‘Bayesian statistics and Monte Carlo methods are ideallysuited to the task of passing many models over onedataset’

[Don Rubin, Annals of Statistics, 1984]

Note Rubin (1984) does not promote this algorithm forlikelihood-free simulation but frequentist intuition on posteriordistributions: parameters from posteriors are more likely to bethose that could have generated the data.

Page 131: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

A as approximative

When y is a continuous random variable, equality z = y is replacedwith a tolerance condition,

ρ(y, z) 6 ε

where ρ is a distance

Output distributed from

π(θ)Pθ{ρ(y, z) < ε} ∝ π(θ|ρ(y, z) < ε)

Page 132: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

A as approximative

When y is a continuous random variable, equality z = y is replacedwith a tolerance condition,

ρ(y, z) 6 ε

where ρ is a distanceOutput distributed from

π(θ)Pθ{ρ(y, z) < ε} ∝ π(θ|ρ(y, z) < ε)

Page 133: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

ABC algorithm

Algorithm 1 Likelihood-free rejection sampler

for i = 1 to N dorepeat

generate θ ′ from the prior distribution π(·)generate z from the likelihood f(·|θ ′)

until ρ{η(z),η(y)} 6 εset θi = θ

end for

where η(y) defines a (maybe in-sufficient) statistic

Page 134: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

Output

The likelihood-free algorithm samples from the marginal in z of:

πε(θ, z|y) =π(θ)f(z|θ)IAε,y(z)∫

Aε,y×Θ π(θ)f(z|θ)dzdθ,

where Aε,y = {z ∈ D|ρ(η(z),η(y)) < ε}.

The idea behind ABC is that the summary statistics coupled with asmall tolerance should provide a good approximation of theposterior distribution:

πε(θ|y) =

∫πε(θ, z|y)dz ≈ π(θ|y) .

Page 135: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

Output

The likelihood-free algorithm samples from the marginal in z of:

πε(θ, z|y) =π(θ)f(z|θ)IAε,y(z)∫

Aε,y×Θ π(θ)f(z|θ)dzdθ,

where Aε,y = {z ∈ D|ρ(η(z),η(y)) < ε}.

The idea behind ABC is that the summary statistics coupled with asmall tolerance should provide a good approximation of theposterior distribution:

πε(θ|y) =

∫πε(θ, z|y)dz ≈ π(θ|y) .

Page 136: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

Pima Indian benchmark

−0.005 0.010 0.020 0.030

020

4060

8010

0

Dens

ity

−0.05 −0.03 −0.01

020

4060

80

Dens

ity

−1.0 0.0 1.0 2.0

0.00.2

0.40.6

0.81.0

Dens

ityFigure: Comparison between density estimates of the marginals on β1

(left), β2 (center) and β3 (right) from ABC rejection samples (red) andMCMC samples (black)

.

Page 137: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

MA example

Consider the MA(q) model

xt = εt +

q∑i=1

ϑiεt−i

Simple prior: uniform prior over the identifiability zone, e.g.triangle for MA(2)

Page 138: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

MA example (2)

ABC algorithm thus made of

1 picking a new value (ϑ1, ϑ2) in the triangle

2 generating an iid sequence (εt)−q<t6T3 producing a simulated series (x ′t)16t6T

Distance: basic distance between the series

ρ((x ′t)16t6T , (xt)16t6T ) =

T∑t=1

(xt − x′t)

2

or between summary statistics like the first q autocorrelations

τj =

T∑t=j+1

xtxt−j

Page 139: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

MA example (2)

ABC algorithm thus made of

1 picking a new value (ϑ1, ϑ2) in the triangle

2 generating an iid sequence (εt)−q<t6T3 producing a simulated series (x ′t)16t6T

Distance: basic distance between the series

ρ((x ′t)16t6T , (xt)16t6T ) =

T∑t=1

(xt − x′t)

2

or between summary statistics like the first q autocorrelations

τj =

T∑t=j+1

xtxt−j

Page 140: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

Comparison of distance impact

Evaluation of the tolerance on the ABC sample against bothdistances (ε = 100%, 10%, 1%, 0.1%) for an MA(2) model

Page 141: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

Comparison of distance impact

0.0 0.2 0.4 0.6 0.8

01

23

4

θ1

−2.0 −1.0 0.0 0.5 1.0 1.5

0.00.5

1.01.5

θ2

Evaluation of the tolerance on the ABC sample against bothdistances (ε = 100%, 10%, 1%, 0.1%) for an MA(2) model

Page 142: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

Comparison of distance impact

0.0 0.2 0.4 0.6 0.8

01

23

4

θ1

−2.0 −1.0 0.0 0.5 1.0 1.5

0.00.5

1.01.5

θ2

Evaluation of the tolerance on the ABC sample against bothdistances (ε = 100%, 10%, 1%, 0.1%) for an MA(2) model

Page 143: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

ABC advances

Simulating from the prior is often poor in efficiency

Either modify the proposal distribution on θ to increase the densityof x’s within the vicinity of y...

[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε

[Beaumont et al., 2002]

.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]

Page 144: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

ABC advances

Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x’s within the vicinity of y...

[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε

[Beaumont et al., 2002]

.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]

Page 145: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

ABC advances

Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x’s within the vicinity of y...

[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε

[Beaumont et al., 2002]

.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]

Page 146: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

ABC basics

ABC advances

Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x’s within the vicinity of y...

[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε

[Beaumont et al., 2002]

.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]

Page 147: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

Alphabet soup

ABC-NP

Better usage of [prior] simulations byadjustement: instead of throwing awayθ ′ such that ρ(η(z),η(y)) > ε, replaceθs with locally regressed

θ∗ = θ− {η(z) − η(y)}Tβ[Csillery et al., TEE, 2010]

where β is obtained by [NP] weighted least square regression on(η(z) − η(y)) with weights

Kδ {ρ(η(z),η(y))}

[Beaumont et al., 2002, Genetics]

Page 148: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

Alphabet soup

ABC-MCMC

Markov chain (θ(t)) created via the transition function

θ(t+1) =

θ′ ∼ Kω(θ′|θ(t)) if x ∼ f(x|θ′) is such that x = y

and u ∼ U(0, 1) 6 π(θ′)Kω(θ(t)|θ′)π(θ(t))Kω(θ′|θ(t))

,

θ(t) otherwise,

has the posterior π(θ|y) as stationary distribution[Marjoram et al, 2003]

Page 149: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

Alphabet soup

ABC-MCMC

Markov chain (θ(t)) created via the transition function

θ(t+1) =

θ′ ∼ Kω(θ′|θ(t)) if x ∼ f(x|θ′) is such that x = y

and u ∼ U(0, 1) 6 π(θ′)Kω(θ(t)|θ′)π(θ(t))Kω(θ′|θ(t))

,

θ(t) otherwise,

has the posterior π(θ|y) as stationary distribution[Marjoram et al, 2003]

Page 150: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

Alphabet soup

ABC-MCMC (2)

Algorithm 2 Likelihood-free MCMC sampler

Use Algorithm 1 to get (θ(0), z(0))for t = 1 to N do

Generate θ ′ from Kω(·|θ(t−1)

),

Generate z ′ from the likelihood f(·|θ ′),Generate u from U[0,1],

if u 6 π(θ ′)Kω(θ(t−1)|θ ′)π(θ(t−1)Kω(θ ′|θ(t−1))

IAε,y(z′) then

set (θ(t), z(t)) = (θ ′, z ′)else(θ(t), z(t))) = (θ(t−1), z(t−1)),

end ifend for

Page 151: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

Alphabet soup

Why does it work?

Acceptance probability that does not involve the calculation of thelikelihood and

πε(θ′, z ′|y)

πε(θ(t−1), z(t−1)|y)× Kω(θ(t−1)|θ ′)f(z(t−1)|θ(t−1))

Kω(θ ′|θ(t−1))f(z ′|θ ′)

=π(θ ′) f(z ′|θ ′) IAε,y(z

′)

π(θ(t−1)) f(z(t−1)|θ(t−1))IAε,y(z(t−1))

× Kω(θ(t−1)|θ ′) f(z(t−1)|θ(t−1))

Kω(θ ′|θ(t−1)) f(z ′|θ ′)

=π(θ ′)Kω(θ(t−1)|θ ′)

π(θ(t−1)Kω(θ ′|θ(t−1))IAε,y(z

′) .

Page 152: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

Alphabet soup

ABCµ

[Ratmann et al., 2009]

Use of a joint density

f(θ, ε|y) ∝ ξ(ε|y, θ)× πθ(θ)× πε(ε)

where y is the data, and ξ(ε|y, θ) is the prior predictive density ofρ(η(z),η(y)) given θ and x when z ∼ f(z|θ)

Warning! Replacement of ξ(ε|y, θ) with a non-parametric kernelapproximation.

Page 153: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

Alphabet soup

ABCµ

[Ratmann et al., 2009]

Use of a joint density

f(θ, ε|y) ∝ ξ(ε|y, θ)× πθ(θ)× πε(ε)

where y is the data, and ξ(ε|y, θ) is the prior predictive density ofρ(η(z),η(y)) given θ and x when z ∼ f(z|θ)Warning! Replacement of ξ(ε|y, θ) with a non-parametric kernelapproximation.

Page 154: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

Alphabet soup

A PMC version

Use of the same kernel idea as ABC-PRC but with IS correction[Beaumont et al., 2009; Toni et al., 2009]

Generate a sample at iteration t by

πt(θ(t)) ∝

N∑j=1

ω(t−1)j Kt(θ

(t)|θ(t−1)j )

modulo acceptance of the associated xt, and use an importance

weight associated with an accepted simulation θ(t)i

ω(t)i ∝ π(θ

(t)i )/πt(θ

(t)i ) .

c© Still likelihood-free[Beaumont et al., 2009]

Page 155: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

Alphabet soup

Sequential Monte Carlo

SMC is a simulation technique to approximate a sequence ofrelated probability distributions πn with π0 “easy” and πT target.Iterated IS as PMC: particles moved from time n to time n viakernel Kn and use of a sequence of extended targets πn

πn(z0:n) = πn(zn)

n∏j=0

Lj(zj+1, zj)

where the Lj’s are backward Markov kernels [check that πn(zn) isa marginal]

[Del Moral, Doucet & Jasra, 2006]

Page 156: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

Alphabet soup

ABC-SMC

True derivation of an SMC-ABC algorithmUse of a kernel Kn associated with target πεn and derivation ofthe backward kernel

Ln−1(z, z′) =

πεn(z′)Kn(z

′, z)

πn(z)

Update of the weights

win ∝ wi(n−1)

∑Mm=1 IAεn (x

min)∑M

m=1 IAεn−1(xmi(n−1))

when xmin ∼ K(xi(n−1), ·)

Page 157: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

Alphabet soup

Properties of ABC-SMC

The ABC-SMC method properly uses a backward kernel L(z, z ′) tosimplify the importance weight and to remove the dependence onthe unknown likelihood from this weight. Update of importanceweights is reduced to the ratio of the proportions of survivingparticlesMajor assumption: the forward kernel K is supposed to be invariantagainst the true target [tempered version of the true posterior]

Adaptivity in ABC-SMC algorithm only found in on-lineconstruction of the thresholds εt, slowly enough to keep a largenumber of accepted transitions

[Del Moral, Doucet & Jasra, 2009]

Page 158: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

Alphabet soup

Properties of ABC-SMC

The ABC-SMC method properly uses a backward kernel L(z, z ′) tosimplify the importance weight and to remove the dependence onthe unknown likelihood from this weight. Update of importanceweights is reduced to the ratio of the proportions of survivingparticlesMajor assumption: the forward kernel K is supposed to be invariantagainst the true target [tempered version of the true posterior]Adaptivity in ABC-SMC algorithm only found in on-lineconstruction of the thresholds εt, slowly enough to keep a largenumber of accepted transitions

[Del Moral, Doucet & Jasra, 2009]

Page 159: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

Calibration of ABC

Which summary statistics?

Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistic [except when done by theexperimenters in the field]

Starting from a large collection of summary statistics is available,Joyce and Marjoram (2008) consider the sequential inclusion intothe ABC target, with a stopping rule based on a likelihood ratiotest.

Does not taking into account the sequential nature of the tests

Depends on parameterisation

Order of inclusion matters.

Page 160: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

Calibration of ABC

Which summary statistics?

Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistic [except when done by theexperimenters in the field]Starting from a large collection of summary statistics is available,Joyce and Marjoram (2008) consider the sequential inclusion intothe ABC target, with a stopping rule based on a likelihood ratiotest.

Does not taking into account the sequential nature of the tests

Depends on parameterisation

Order of inclusion matters.

Page 161: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

Calibration of ABC

Which summary statistics?

Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistic [except when done by theexperimenters in the field]Starting from a large collection of summary statistics is available,Joyce and Marjoram (2008) consider the sequential inclusion intothe ABC target, with a stopping rule based on a likelihood ratiotest.

Does not taking into account the sequential nature of the tests

Depends on parameterisation

Order of inclusion matters.

Page 162: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

Calibration of ABC

Point estimation vs....

In the case of the computation of E[h(θ)|y], Fearnhead andPrangle [12/14/2011] demonstrate that the optimal summarystatistic is

η?(y) = E[h(θ)|y]

Unavailable but approximated by a prior ABC run and ABC-NPcorrections

Page 163: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

Calibration of ABC

Point estimation vs....

In the case of the computation of E[h(θ)|y], Fearnhead andPrangle [12/14/2011] demonstrate that the optimal summarystatistic is

η?(y) = E[h(θ)|y]

Unavailable but approximated by a prior ABC run and ABC-NPcorrections

Page 164: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

Calibration of ABC

...vs. model choice

In the case of the computation of a Bayes factor B12(y), ABCapproximation

T∑t=1

Im(t)=1

/ T∑t=1

Im(t)=2

may fail to converge[Robert et al., 2011]

Separation conditions on the summary statistics for convergence tooccur

[Marin et al., 2011]

Page 165: WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Approximate Bayesian computation

Calibration of ABC

...vs. model choice

In the case of the computation of a Bayes factor B12(y), ABCapproximation

T∑t=1

Im(t)=1

/ T∑t=1

Im(t)=2

may fail to converge[Robert et al., 2011]

Separation conditions on the summary statistics for convergence tooccur

[Marin et al., 2011]