WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Simulation methods in Statistics(on recent advances)

Christian P. Robert

Universite Paris-Dauphine, IuF, & CRESthttp://www.ceremade.dauphine.fr/~xian

WSC 2011, Phoenix, December 12, 2011

http://www.ceremade.dauphine.fr/~xian


Outline

1 Motivation and leading example

2 Monte Carlo Integration

3 The Metropolis-Hastings Algorithm

4 Approximate Bayesian computation


Motivation and leading example


1 Motivation and leading exampleLatent variablesInferential methods






Latent variables

Latent structures make life harder!

Even simple statistical models may lead to computationalcomplications, as in latent variable models

f(x|θ) =

∫f?(x, x?|θ) dx?

If (x, x?) observed, fine!If only x observed, trouble!

[mixtures, HMMs, state-space models, &tc]



Latent variables



f(x|θ) =

∫f?(x, x?|θ) dx?

If (x, x?) observed, fine!

If only x observed, trouble![mixtures, HMMs, state-space models, &tc]



Latent variables



f(x|θ) =

∫f?(x, x?|θ) dx?

If (x, x?) observed, fine!If only x observed, trouble!

[mixtures, HMMs, state-space models, &tc]



Latent variables

Mixture models

Models of mixtures of distributions:

X ∼ fj with probability pj,

for j = 1, 2, . . . , k, with overall density

X ∼ p1f1(x) + · · ·+ pkfk(x) .

For a sample of independent random variables (X1, · · · ,Xn),sample density

n∏i=1

{p1f1(xi) + · · ·+ pkfk(xi)} .

Expanding this product involves kn elementary terms: prohibitiveto compute in large samples.



Latent variables

Mixture likelihood

−1 0 1 2 3

−1

01

23

µ1

µ 2

Case of the 0.3N (µ1, 1) + 0.7N (µ2, 1) likelihood



Inferential methods

Maximum likelihood methods

goto Bayes

For an iid sample X1, . . . , Xn from a population with densityf(x|θ1, . . . ,θk), the likelihood function is

L(x|θ) = L(x1, . . . , xn|θ1, . . . ,θk)

=∏n

i=1f(xi|θ1, . . . ,θk).

◦ Maximum likelihood has global justifications from asymptotics

◦ Computational difficulty depends on structure, eg latentvariables



Inferential methods

Maximum likelihood methods (2)

Example (Mixtures)

For a mixture of two normal distributions,

pN(µ, τ2) + (1 − p)N(θ,σ2) ,

likelihood proportional to

n∏i=1

[pτ−1ϕ

(xi − µ

τ

)+ (1 − p) σ−1 ϕ

(xi − θ

σ

)]can be expanded into 2n terms.



Inferential methods


Standard maximization techniques often fail to find the globalmaximum because of multimodality or undesirable behavior(usually at the frontier of the domain) of the likelihood function.

Example

In the special case

f(x|µ,σ) = (1 − ε) exp{(−1/2)x2}+ε

σexp{(−1/2σ2)(x− µ)2}

with ε > 0 known,

whatever n, the likelihood is unbounded:

limσ→0

L(x1, . . . , xn|µ = x1,σ) =∞



Inferential methods


Standard maximization techniques often fail to find the globalmaximum because of multimodality or undesirable behavior(usually at the frontier of the domain) of the likelihood function.

Example

In the special case

f(x|µ,σ) = (1 − ε) exp{(−1/2)x2}+ε

σexp{(−1/2σ2)(x− µ)2}

with ε > 0 known, whatever n, the likelihood is unbounded:

limσ→0

L(x1, . . . , xn|µ = x1,σ) =∞



Inferential methods

The Bayesian Perspective

In the Bayesian paradigm, the information brought by the data x,realization of

X ∼ f(x|θ),

is combined with prior information specified by prior distributionwith density

π(θ)



Inferential methods

Central tool...

Summary in a probability distribution, π(θ|x), called the posteriordistribution

Derived from the joint distribution f(x|θ)π(θ), according to

π(θ|x) =f(x|θ)π(θ)∫f(x|θ)π(θ)dθ

,

[Bayes Theorem]

where

Z(x) =

∫f(x|θ)π(θ)dθ

is the marginal density of X also called the (Bayesian) evidence



Inferential methods

Central tool...

Summary in a probability distribution, π(θ|x), called the posteriordistributionDerived from the joint distribution f(x|θ)π(θ), according to

π(θ|x) =f(x|θ)π(θ)∫f(x|θ)π(θ)dθ

,

[Bayes Theorem]

where

Z(x) =

∫f(x|θ)π(θ)dθ

is the marginal density of X also called the (Bayesian) evidence



Inferential methods

Central tool...central to Bayesian inference

Posterior defined up to a constant as

π(θ|x) ∝ f(x|θ)π(θ)

Operates conditional upon the observations

Integrate simultaneously prior information and informationbrought by x

Avoids averaging over the unobserved values of x

Coherent updating of the information available on θ,independent of the order in which i.i.d. observations arecollected

Provides a complete inferential scope and a unique motor ofinference



Inferential methods

Examples of Bayes computational problems

1 complex parameter space, as e.g. constrained parameter setslike those resulting from imposing stationarity constraints intime series

2 complex sampling model with an intractable likelihood, ase.g. in some graphical models;

3 use of a huge dataset;

4 complex prior distribution (which may be the posteriordistribution associated with an earlier sample);

5 involved inferential procedure as for instance, Bayes factors

Bπ01(x) =P(θ ∈ Θ0 | x)

P(θ ∈ Θ1 | x)

/π(θ ∈ Θ0)

π(θ ∈ Θ1).



Inferential methods

Mixtures again

Observations from

x1, . . . , xn ∼ f(x|θ) = pϕ(x;µ1,σ1) + (1 − p)ϕ(x;µ2,σ2)

Prior

µi|σi ∼ N (ξi,σ2i/ni), σ2

i ∼ I G (νi/2, s2i/2), p ∼ Be(α,β)

Posterior

π(θ|x1, . . . , xn) ∝n∏j=1

{pϕ(xj;µ1,σ1) + (1 − p)ϕ(xj;µ2,σ2)

}π(θ)

=

n∑`=0

∑(kt)

ω(kt)π(θ|(kt))

[O(2n)]



Inferential methods

Mixtures again [2]

For a given permutation (kt), conditional posterior distribution

π(θ|(kt)) = N

(ξ1(kt),

σ21

n1 + `

)×I G ((ν1 + `)/2, s1(kt)/2)

×N

(ξ2(kt),

σ22

n2 + n− `

)×I G ((ν2 + n− `)/2, s2(kt)/2)

×Be(α+ `,β+ n− `)



Inferential methods

Mixtures again [3]

where

x1(kt) = 1`

∑`t=1 xkt

, s1(kt) =∑`t=1(xkt

− x1(kt))2,

x2(kt) = 1n−`

∑nt=`+1 xkt

, s2(kt) =∑nt=`+1(xkt

− x2(kt))2

and

ξ1(kt) =n1ξ1 + `x1(kt)

n1 + `, ξ2(kt) =

n2ξ2 + (n− `)x2(kt)

n2 + n− `,

s1(kt) = s21 + s

21(kt) +

n1`

n1 + `(ξ1 − x1(kt))

2,

s2(kt) = s22 + s

22(kt) +

n2(n− `)

n2 + n− `(ξ2 − x2(kt))

2,

posterior updates of the hyperparameters



Inferential methods

Mixtures again [4]

Bayes estimator of θ:

δπ(x1, . . . , xn) =n∑`=0

∑(kt)

ω(kt)Eπ[θ|x, (kt)]

c© Too costly: 2n terms

Unfortunate as the decomposition is meaningfull for clusteringpurposes



Inferential methods

Mixtures again [4]

Bayes estimator of θ:

δπ(x1, . . . , xn) =n∑`=0

∑(kt)

ω(kt)Eπ[θ|x, (kt)]

c© Too costly: 2n termsUnfortunate as the decomposition is meaningfull for clusteringpurposes


Monte Carlo Integration

Monte Carlo integration


2 Monte Carlo IntegrationMonte Carlo integrationImportance SamplingBayesian importance sampling







Theme:Generic problem of evaluating the integral

I = Ef[h(X)] =∫Xh(x) f(x) dx

where X is uni- or multidimensional, f is a closed form, partlyclosed form, or implicit density, and h is a function




Monte Carlo integration (2)

Monte Carlo solutionFirst use a sample (X1, . . . ,Xm) from the density f to approximatethe integral I by the empirical average

hm =1

m

m∑j=1

h(xj)

which convergeshm −→ Ef[h(X)]

by the Strong Law of Large Numbers




Monte Carlo precision

Estimate the variance with

vm =1

m− 1

m∑j=1

[h(xj) − hm]2,

and for m large,

hm − Ef[h(X)]√vm

∼ N (0, 1).

Note: This can lead to the construction of a convergence test andof confidence bounds on the approximation of Ef[h(X)].




Example (Cauchy prior/normal sample)

For estimating a normal mean, a robust prior is a Cauchy prior

X ∼ N (θ, 1), θ ∼ C(0, 1).

Under squared error loss, posterior mean

δπ(x) =

∫∞−∞

θ

1 + θ2e−(x−θ)2/2dθ∫∞

−∞1

1 + θ2e−(x−θ)2/2dθ




Example (Cauchy prior/normal sample (2))

Form of δπ suggests simulating iid variables

θ1, · · · , θm ∼ N (x, 1)

and calculating

δπm(x) =

m∑i=1

θi

1 + θ2i

/ m∑i=1

1

1 + θ2i

.

The Law of Large Numbers implies

δπm(x) −→ δπ(x) as m −→∞.




0 200 400 600 800 1000

9.6

9.8

10.0

10.2

10.4

10.6

iterations

Range of estimators δπm for 100 runs and x = 10



Importance Sampling

Importance sampling

Paradox

Simulation from f (the true density) is not necessarily optimal

Alternative to direct sampling from f is importance sampling,based on the alternative representation

Ef[h(X)] =∫X

[h(x)

f(x)

g(x)

]g(x) dx .

which allows us to use other distributions than f



Importance Sampling

Importance sampling algorithm

Evaluation of

Ef[h(X)] =∫Xh(x) f(x) dx

by

1 Generate a sample X1, . . . ,Xn from a distribution g

2 Use the approximation

1

m

m∑j=1

f(Xj)

g(Xj)h(Xj)



Importance Sampling

Implementation details

◦ Instrumental distribution g chosen from distributions easy tosimulate

◦ The same sample (generated from g) can be used repeatedly,not only for different functions h, but also for differentdensities f

◦ Dependent proposals can be used, as seen later Pop’MC



Importance Sampling

Finite vs. infinite variance

Although g can be any density, some choices are better thanothers:

◦ Finite variance only when

Ef[h2(X)

f(X)

g(X)

]=

∫X

h2(x)f2(X)

g(X)dx <∞ .

◦ Instrumental distributions with tails lighter than those of f(that is, with sup f/g =∞) not appropriate.

◦ If sup f/g =∞, the weights f(xj)/g(xj) vary widely, givingtoo much importance to a few values xj.

◦ If sup f/g =M <∞, finite variance for L2 functions



Importance Sampling

Selfnormalised importance sampling

For ratio estimator

δnh =

n∑i=1

ωi h(xi)

/ n∑i=1

ωi

with Xi ∼ g(y) and Wi such that

E[Wi|Xi = x] = κf(x)/g(x)



Importance Sampling

Selfnormalised variance

then

var(δnh) ≈1

n2κ2

(var(Snh) − 2Eπ[h] cov(Snh,Sn1 ) + Eπ[h]2 var(Sn1 )

).

for

Snh =

n∑i=1

Wih(Xi) , Sn1 =

n∑i=1

Wi

Rough approximation

varδnh ≈1

nvarπ(h(X)) {1 + varg(W)}



Bayesian importance sampling

Bayes factor approximation

When approximating the Bayes factor

B01 =

∫Θ0

f0(x|θ0)π0(θ0)dθ0∫Θ1

f1(x|θ1)π1(θ1)dθ1

use of importance functions $0 and $1 and

B01 =n−1

0

∑n0i=1 f0(x|θ

i0)π0(θ

i0)/$0(θ

i0)

n−11

∑n1i=1 f1(x|θ

i1)π1(θi1)/$1(θi1)




Diabetes in Pima Indian women

Example (R benchmark)

“A population of women who were at least 21 years old, of PimaIndian heritage and living near Phoenix (AZ), was tested fordiabetes according to WHO criteria. The data were collected bythe US National Institute of Diabetes and Digestive and KidneyDiseases.”200 Pima Indian women with observed variables

plasma glucose concentration in oral glucose tolerance test

diastolic blood pressure

diabetes pedigree function

presence/absence of diabetes




Probit modelling on Pima Indian women

Probability of diabetes function of above variables

P(y = 1|x) = Φ(x1β1 + x2β2 + x3β3) ,

Test of H0 : β3 = 0 for 200 observations of Pima.tr based on ag-prior modelling:

β ∼ N3(0,n(XTX)−1

)




Importance sampling for the Pima Indian dataset

Use of the importance function inspired from the MLE estimatedistribution

β ∼ N(β, Σ)

R Importance sampling codemodel1=summary(glm(y~-1+X1,family=binomial(link="probit")))

is1=rmvnorm(Niter,mean=model1$coeff[,1],sigma=2*model1$cov.unscaled)

is2=rmvnorm(Niter,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)

bfis=mean(exp(probitlpost(is1,y,X1)-dmvlnorm(is1,mean=model1$coeff[,1],

sigma=2*model1$cov.unscaled))) / mean(exp(probitlpost(is2,y,X2)-

dmvlnorm(is2,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)))




Diabetes in Pima Indian women

Comparison of the variation of the Bayes factor approximationsbased on 100 replicas for 20, 000 simulations from the prior andthe above MLE importance sampler

Basic Monte Carlo Importance sampling

23

45




Illustration for the Pima Indian dataset

Use of the MLE induced conditional of β3 given (β1,β2) as apseudo-posterior and mixture of both MLE approximations on β3

in bridge sampling estimate

R bridge sampling codecova=model2$cov.unscaled

expecta=model2$coeff[,1]

covw=cova[3,3]-t(cova[1:2,3])%*%ginv(cova[1:2,1:2])%*%cova[1:2,3]

probit1=hmprobit(Niter,y,X1)

probit2=hmprobit(Niter,y,X2)

pseudo=rnorm(Niter,meanw(probit1),sqrt(covw))

probit1p=cbind(probit1,pseudo)

bfbs=mean(exp(probitlpost(probit2[,1:2],y,X1)+dnorm(probit2[,3],meanw(probit2[,1:2]),

sqrt(covw),log=T))/ (dmvnorm(probit2,expecta,cova)+dnorm(probit2[,3],expecta[3],

cova[3,3])))/ mean(exp(probitlpost(probit1p,y,X2))/(dmvnorm(probit1p,expecta,cova)+

dnorm(pseudo,expecta[3],cova[3,3])))




Diabetes in Pima Indian women (cont’d)

Comparison of the variation of the Bayes factor approximationsbased on 100× 20, 000 simulations from the prior (MC), the abovebridge sampler and the above importance sampler




The original harmonic mean estimator

When θki ∼ πk(θ|x),

1

T

T∑t=1

1

L(θkt|x)

is an unbiased estimator of 1/mk(x)[Newton & Raftery, 1994]

Highly dangerous: Most often leads to an infinite variance!!!




“The Worst Monte Carlo Method Ever”

“The good news is that the Law of Large Numbers guarantees thatthis estimator is consistent ie, it will very likely be very close to thecorrect answer if you use a sufficiently large number of points fromthe posterior distribution.

The bad news is that the number of points required for thisestimator to get close to the right answer will often be greaterthan the number of atoms in the observable universe. The evenworse news is that it’s easy for people to not realize this, and tonaıvely accept estimates that are nowhere close to the correctvalue of the marginal likelihood.”

[Radford Neal’s blog, Aug. 23, 2008]




“The Worst Monte Carlo Method Ever”

“The good news is that the Law of Large Numbers guarantees thatthis estimator is consistent ie, it will very likely be very close to thecorrect answer if you use a sufficiently large number of points fromthe posterior distribution.The bad news is that the number of points required for thisestimator to get close to the right answer will often be greaterthan the number of atoms in the observable universe. The evenworse news is that it’s easy for people to not realize this, and tonaıvely accept estimates that are nowhere close to the correctvalue of the marginal likelihood.”

[Radford Neal’s blog, Aug. 23, 2008]




Approximating Zk from a posterior sample

Use of the [harmonic mean] identity

Eπk[

ϕ(θk)

πk(θk)Lk(θk)

∣∣∣∣ x] = ∫ ϕ(θk)

πk(θk)Lk(θk)

πk(θk)Lk(θk)

Zkdθk =

1

Zk

no matter what the proposal ϕ(·) is.[Gelfand & Dey, 1994; Bartolucci et al., 2006]

Direct exploitation of the MCMC output




Comparison with regular importance sampling

Harmonic mean: Constraint opposed to usual importance samplingconstraints: ϕ(θ) must have lighter (rather than fatter) tails thanπk(θk)Lk(θk) for the approximation

Z1k = 1

/1

T

T∑t=1

ϕ(θ(t)k )

πk(θ(t)k )Lk(θ

(t)k )

to enjoy finite variance




Comparison with regular importance sampling (cont’d)

Compare Z1k with a standard importance sampling approximation

Z2k =1

T

T∑t=1

πk(θ(t)k )Lk(θ

(t)k )

ϕ(θ(t)k )

where the θ(t)k ’s are generated from the density ϕ(·) (with fatter

tails like t’s)




HPD indicator as ϕ

Use the convex hull of MCMC simulations corresponding to the10% HPD region (easily derived!) and ϕ as indicator:

ϕ(θ) =10

T

∑t∈HPD

Id(θ,θ(t))6ε





Comparison of the variation of the Bayes factor approximationsbased on 100 replicas for 20, 000 simulations for a simulation fromthe above harmonic mean sampler and importance samplers

Harmonic mean Importance sampling

3.102

3.104

3.106

3.108

3.110

3.112

3.114

3.116




Case of latent variables

For missing variable z as in mixture models, natural Rao-Blackwellestimate

πk(θ∗k|x) =

1

T

T∑t=1

πk(θ∗k|x, z

(t)k ) ,

where the z(t)k ’s are Gibbs sampled latent variables




Case of the probit model

For the completion by z,

π(θ|x) =1

T

∑t

π(θ|x, z(t))

is a simple average of normal densities

R Bridge sampling codegibbs1=gibbsprobit(Niter,y,X1)

gibbs2=gibbsprobit(Niter,y,X2)

bfchi=mean(exp(dmvlnorm(t(t(gibbs2$mu)-model2$coeff[,1]),mean=rep(0,3),

sigma=gibbs2$Sigma2)-probitlpost(model2$coeff[,1],y,X2)))/

mean(exp(dmvlnorm(t(t(gibbs1$mu)-model1$coeff[,1]),mean=rep(0,2),

sigma=gibbs1$Sigma2)-probitlpost(model1$coeff[,1],y,X1)))





Comparison of the variation of the Bayes factor approximationsbased on 100 replicas for 20, 000 simulations for a simulation fromthe above Chib’s and importance samplers


The Metropolis-Hastings Algorithm




3 The Metropolis-Hastings AlgorithmMonte Carlo Methods based on Markov ChainsThe Metropolis–Hastings algorithmThe random walk Metropolis-Hastings algorithmAdaptive MCMC




Monte Carlo Methods based on Markov Chains

Running Monte Carlo via Markov Chains

Epiphany! It is not necessary to use a sample from the distributionf to approximate the integral

I =

∫h(x)f(x)dx ,

Principle: Obtain X1, . . . ,Xn ∼ f (approx) without directlysimulating from f, using an ergodic Markov chain with stationarydistribution f

[Metropolis et al., 1953]



Monte Carlo Methods based on Markov Chains

Running Monte Carlo via Markov Chains (2)

Idea

For an arbitrary starting value x(0), an ergodic chain (X(t)) isgenerated using a transition kernel with stationary distribution f

Insures the convergence in distribution of (X(t)) to a randomvariable from f.For a “large enough” T0, X(T0) can be considered asdistributed from f

Produce a dependent sample X(T0),X(T0+1), . . ., which isgenerated from f, sufficient for most approximation purposes.

Problem: How can one build a Markov chain with a givenstationary distribution?



The Metropolis–Hastings algorithm


BasicsThe algorithm uses the objective (target) density

f

and a conditional densityq(y|x)

called the instrumental (or proposal) distribution




The MH algorithm

Algorithm (Metropolis–Hastings)

Given x(t),

1. Generate Yt ∼ q(y|x(t)).

2. Take

X(t+1) =

{Yt with prob. ρ(x(t), Yt),

x(t) with prob. 1 − ρ(x(t), Yt),

where

ρ(x,y) = min

{f(y)

f(x)

q(x|y)

q(y|x), 1

}.




Features

Independent of normalizing constants for both f and q(·|x)(ie, those constants independent of x)

Never move to values with f(y) = 0

The chain (x(t))t may take the same value several times in arow, even though f is a density wrt Lebesgue measure

The sequence (yt)t is usually not a Markov chain




Convergence properties

The M-H Markov chain is reversible, with invariant/stationarydensity f since it satisfies the detailed balance condition

f(y)K(y, x) = f(x)K(x,y)

Ifq(y|x) > 0 for every (x,y),

the chain is Harris recurrent and

limT→∞ 1

T

T∑t=1

h(X(t)) =

∫h(x)df(x) a.e. f.

limn→∞

∥∥∥∥∫ Kn(x, ·)µ(dx) − f∥∥∥∥TV

= 0

for every initial distribution µ



The random walk Metropolis-Hastings algorithm

Random walk Metropolis–Hastings

Use of a local perturbation as proposal

Yt = X(t) + εt,

where εt ∼ g, independent of X(t).The instrumental density is now of the form g(y− x) and theMarkov chain is a random walk if we take g to be symmetricg(x) = g(−x)




Algorithm (Random walk Metropolis)

Given x(t)

1 Generate Yt ∼ g(y− x(t))

2 Take

X(t+1) =

Yt with prob. min

{1,f(Yt)

f(x(t))

},

x(t) otherwise.




RW-MH on mixture posterior distribution

−1 0 1 2 3

−1

01

23

µ1

µ 2

X

Random walk MCMC output for .7N(µ1, 1) + .3N(µ2, 1)




Acceptance rate

A high acceptance rate is not indication of efficiency since therandom walk may be moving “too slowly” on the target surface




Acceptance rate


If x(t) and yt are “too close”, i.e. f(x(t)) ' f(yt), yt is acceptedwith probability

min

(f(yt)

f(x(t)), 1

)' 1 .

and acceptance rate high




Acceptance rate


If average acceptance rate low, the proposed values f(yt) tend tobe small wrt f(x(t)), i.e. the random walk [not the algorithm!]moves quickly on the target surface often reaching its boundaries




Rule of thumb

In small dimensions, aim at an average acceptance rate of 50%. Inlarge dimensions, at an average acceptance rate of 25%.

[Gelman,Gilks and Roberts, 1995]




Noisy AR(1)

Target distribution of x given x1, x2 and y is

exp−1

2τ2

{(x−ϕx1)

2 + (x2 −ϕx)2 +

τ2

σ2(y− x2)2

}.

For a Gaussian random walk with scale ω small enough, therandom walk never jumps to the other mode. But if the scale ω issufficiently large, the Markov chain explores both modes and give asatisfactory approximation of the target distribution.




Noisy AR(2)

Markov chain based on a random walk with scale ω = .1.




Noisy AR(3)

Markov chain based on a random walk with scale ω = .5.



Adaptive MCMC

No free lunch!!

MCMC algorithm trained on-line usually invalid:

using the whole past of the “chain” implies that this is not aMarkov chain any longer!

This means standard Markov chain (ergodic) theory does not apply[Meyn & Tweedie, 1994]



Adaptive MCMC

No free lunch!!

MCMC algorithm trained on-line usually invalid:using the whole past of the “chain” implies that this is not aMarkov chain any longer!

This means standard Markov chain (ergodic) theory does not apply[Meyn & Tweedie, 1994]



Adaptive MCMC

Example (Poly t distribution)

t T(3, θ, 1) sample (x1, . . . , xn) with flat prior π(θ) = 1Fit a normal proposal from empirical mean and empirical varianceof the chain so far,

µt =1

t

t∑i=1

θ(i) and σ2t =

1

t

t∑i=1

(θ(i) − µt)2 ,

Metropolis–Hastings algorithm with acceptance probability

n∏j=2

[ν+ (xj − θ

(t))2

ν+ (xj − ξ)2

]−(ν+1)/2exp−(µt − θ

(t))2/2σ2t

exp−(µt − ξ)2/2σ2t

,

where ξ ∼ N(µt,σ2t).



Adaptive MCMC

Invalid scheme

invariant distribution not invariant any longer

when range of initial values too small, the θ(i)’s cannotconverge to the target distribution and concentrates on toosmall a support.

long-range dependence on past values modifies thedistribution of the sequence.

using past simulations to create a non-parametricapproximation to the target distribution does not work either



Adaptive MCMC

0 1000 2000 3000 4000 5000

−0.4−0.2

0.00.2

Iterations

x

θ

−1.5 −1.0 −0.5 0.0 0.5

01

23

0 1000 2000 3000 4000 5000

−1.5−1.0

−0.50.0

0.51.0

1.5

Iterations

x

θ

−2 −1 0 1 2

0.00.2

0.40.6

0 1000 2000 3000 4000 5000

−2−1

01

2

Iterations

x

θ

−2 −1 0 1 2 3

0.00.1

0.20.3

0.40.5

0.60.7

Adaptive scheme for a sample of 10 xj ∼ T3 and initialvariances of (top) 0.1, (middle) 0.5, and (bottom) 2.5.



Adaptive MCMC

0 10000 30000 50000

−1.0−0.5

0.00.5

1.01.5

Iterations

x

θ

−1.5 −0.5 0.5 1.0 1.5

0.00.2

0.40.6

0.81.0

Sample produced by 50, 000 iterations of a nonparametricadaptive MCMC scheme and comparison of its distributionwith the target distribution.



Adaptive MCMC

Simply forget about it!

Warning:One should not constantly adapt the proposal on pastperformances

Either adaptation ceases after a period of burnin...or the adaptive scheme must be theoretically assessed on its ownright.

[Haario & Saaksman, 1999;Andrieu & Robert, 2001]



Adaptive MCMC

Diminishing adaptation

Adaptivity of cyberparameter γt has to be gradually tuned downto recover ergodicity

[Roberts & Rosenthal, 2007]

Sufficient conditions:

1 total variation distance between two consecutive kernels mustuniformly decrease to zero

[diminishing adaptation]

limt→∞ sup

x‖Kγt(x, ·) − Kγt+1(x, ·)‖TV = 0

2 times to stationary remains bounded for any fixed γt[containment]



Adaptive MCMC



[Roberts & Rosenthal, 2007]Sufficient conditions:



limt→∞ sup

x‖Kγt(x, ·) − Kγt+1(x, ·)‖TV = 0




Adaptive MCMC






limt→∞ sup

x‖Kγt(x, ·) − Kγt+1(x, ·)‖TV = 0


Works for random walk proposal that relies on theempirical variance of the sample modulo a ridge-like stabilizing factor

[Haario, Sacksman & Tamminen, 1999]



Adaptive MCMC






limt→∞ sup

x‖Kγt(x, ·) − Kγt+1(x, ·)‖TV = 0


Tune the scale in each direction toward an optimal acceptance rateof 0.44.

[Roberts & Rosenthal,2006]



Adaptive MCMC






limt→∞ sup

x‖Kγt(x, ·) − Kγt+1(x, ·)‖TV = 0


Packages amcmc and grapham


Approximate Bayesian computation





4 Approximate Bayesian computationABC basicsAlphabet soupCalibration of ABC



ABC basics

Untractable likelihoods

There are cases when the likelihood function f(y|θ) is unavailableand when the completion step

f(y|θ) =

∫Zf(y, z|θ) dz

is impossible or too costly because of the dimension of zc© MCMC cannot be implemented!

[Robert & Casella, 2004]



ABC basics

Illustrations

Example

Stochastic volatility model: fort = 1, . . . , T ,

yt = exp(zt)εt , zt = a+bzt−1+σηt ,

T very large makes it difficult toinclude z within the simulatedparameters

0 200 400 600 800 1000

−0.

4−

0.2

0.0

0.2

0.4

t

Highest weight trajectories



ABC basics

Illustrations

Example

Potts model: if y takes values on a grid Y of size kn and

f(y|θ) ∝ exp

{θ∑l∼i

Iyl=yi

}

where l∼i denotes a neighbourhood relation, n moderately largeprohibits the computation of the normalising constant



ABC basics

Illustrations

Example

Inference on CMB: in cosmology, study of the Cosmic MicrowaveBackground via likelihoods immensely slow to computate (e.gWMAP, Plank), because of numerically costly spectral transforms[Data is a Fortran program]

[Kilbinger et al., 2010, MNRAS]



ABC basics

Illustrations

Example

Coalescence tree: in populationgenetics, reconstitution of a commonancestor from a sample of genes viaa phylogenetic tree that is close toimpossible to integrate out[100 processor days with 4parameters]

[Cornuet et al., 2009, Bioinformatics]



ABC basics

The ABC method

Bayesian setting: target is π(θ)f(x|θ)

When likelihood f(x|θ) not in closed form, likelihood-free rejectiontechnique:

ABC algorithm

For an observation y ∼ f(y|θ), under the prior π(θ), keep jointlysimulating

θ′ ∼ π(θ) , z ∼ f(z|θ′) ,

until the auxiliary variable z is equal to the observed value, z = y.

[Tavare et al., 1997]



ABC basics

The ABC method

Bayesian setting: target is π(θ)f(x|θ)When likelihood f(x|θ) not in closed form, likelihood-free rejectiontechnique:

ABC algorithm

For an observation y ∼ f(y|θ), under the prior π(θ), keep jointlysimulating

θ′ ∼ π(θ) , z ∼ f(z|θ′) ,

until the auxiliary variable z is equal to the observed value, z = y.

[Tavare et al., 1997]



ABC basics

Why does it work?!

The proof is trivial:

f(θi) ∝∑z∈D

π(θi)f(z|θi)Iy(z)

∝ π(θi)f(y|θi)= π(θi|y) .

[Accept–Reject 101]



ABC basics

Earlier occurrence

‘Bayesian statistics and Monte Carlo methods are ideallysuited to the task of passing many models over onedataset’

[Don Rubin, Annals of Statistics, 1984]

Note Rubin (1984) does not promote this algorithm forlikelihood-free simulation but frequentist intuition on posteriordistributions: parameters from posteriors are more likely to bethose that could have generated the data.



ABC basics

A as approximative

When y is a continuous random variable, equality z = y is replacedwith a tolerance condition,

ρ(y, z) 6 ε

where ρ is a distance

Output distributed from

π(θ)Pθ{ρ(y, z) < ε} ∝ π(θ|ρ(y, z) < ε)



ABC basics

A as approximative

When y is a continuous random variable, equality z = y is replacedwith a tolerance condition,

ρ(y, z) 6 ε

where ρ is a distanceOutput distributed from

π(θ)Pθ{ρ(y, z) < ε} ∝ π(θ|ρ(y, z) < ε)



ABC basics

ABC algorithm

Algorithm 1 Likelihood-free rejection sampler

for i = 1 to N dorepeat

generate θ ′ from the prior distribution π(·)generate z from the likelihood f(·|θ ′)

until ρ{η(z),η(y)} 6 εset θi = θ

′

end for

where η(y) defines a (maybe in-sufficient) statistic



ABC basics

Output

The likelihood-free algorithm samples from the marginal in z of:

πε(θ, z|y) =π(θ)f(z|θ)IAε,y(z)∫

Aε,y×Θ π(θ)f(z|θ)dzdθ,

where Aε,y = {z ∈ D|ρ(η(z),η(y)) < ε}.

The idea behind ABC is that the summary statistics coupled with asmall tolerance should provide a good approximation of theposterior distribution:

πε(θ|y) =

∫πε(θ, z|y)dz ≈ π(θ|y) .



ABC basics

Pima Indian benchmark

−0.005 0.010 0.020 0.030

020

4060

8010

0

Dens

ity

−0.05 −0.03 −0.01

020

4060

80

Dens

ity

−1.0 0.0 1.0 2.0

0.00.2

0.40.6

0.81.0

Dens

ityFigure: Comparison between density estimates of the marginals on β1

(left), β2 (center) and β3 (right) from ABC rejection samples (red) andMCMC samples (black)

.



ABC basics

MA example

Consider the MA(q) model

xt = εt +

q∑i=1

ϑiεt−i

Simple prior: uniform prior over the identifiability zone, e.g.triangle for MA(2)



ABC basics

MA example (2)

ABC algorithm thus made of

1 picking a new value (ϑ1, ϑ2) in the triangle

2 generating an iid sequence (εt)−q<t6T3 producing a simulated series (x ′t)16t6T

Distance: basic distance between the series

ρ((x ′t)16t6T , (xt)16t6T ) =

T∑t=1

(xt − x′t)

2

or between summary statistics like the first q autocorrelations

τj =

T∑t=j+1

xtxt−j



ABC basics

Comparison of distance impact

Evaluation of the tolerance on the ABC sample against bothdistances (ε = 100%, 10%, 1%, 0.1%) for an MA(2) model



ABC basics

Comparison of distance impact

0.0 0.2 0.4 0.6 0.8

01

23

4

θ1

−2.0 −1.0 0.0 0.5 1.0 1.5

0.00.5

1.01.5

θ2

Evaluation of the tolerance on the ABC sample against bothdistances (ε = 100%, 10%, 1%, 0.1%) for an MA(2) model



ABC basics

ABC advances

Simulating from the prior is often poor in efficiency

Either modify the proposal distribution on θ to increase the densityof x’s within the vicinity of y...

[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε

[Beaumont et al., 2002]

.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]



ABC basics

ABC advances

Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x’s within the vicinity of y...

[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε

[Beaumont et al., 2002]

.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]



Alphabet soup

ABC-NP

Better usage of [prior] simulations byadjustement: instead of throwing awayθ ′ such that ρ(η(z),η(y)) > ε, replaceθs with locally regressed

θ∗ = θ− {η(z) − η(y)}Tβ[Csillery et al., TEE, 2010]

where β is obtained by [NP] weighted least square regression on(η(z) − η(y)) with weights

Kδ {ρ(η(z),η(y))}

[Beaumont et al., 2002, Genetics]



Alphabet soup

ABC-MCMC

Markov chain (θ(t)) created via the transition function

θ(t+1) =

θ′ ∼ Kω(θ′|θ(t)) if x ∼ f(x|θ′) is such that x = y

and u ∼ U(0, 1) 6 π(θ′)Kω(θ(t)|θ′)π(θ(t))Kω(θ′|θ(t))

,

θ(t) otherwise,

has the posterior π(θ|y) as stationary distribution[Marjoram et al, 2003]



Alphabet soup

ABC-MCMC (2)

Algorithm 2 Likelihood-free MCMC sampler

Use Algorithm 1 to get (θ(0), z(0))for t = 1 to N do

Generate θ ′ from Kω(·|θ(t−1)

),

Generate z ′ from the likelihood f(·|θ ′),Generate u from U[0,1],

if u 6 π(θ ′)Kω(θ(t−1)|θ ′)π(θ(t−1)Kω(θ ′|θ(t−1))

IAε,y(z′) then

set (θ(t), z(t)) = (θ ′, z ′)else(θ(t), z(t))) = (θ(t−1), z(t−1)),

end ifend for



Alphabet soup

ABCµ

[Ratmann et al., 2009]

Use of a joint density

f(θ, ε|y) ∝ ξ(ε|y, θ)× πθ(θ)× πε(ε)

where y is the data, and ξ(ε|y, θ) is the prior predictive density ofρ(η(z),η(y)) given θ and x when z ∼ f(z|θ)

Warning! Replacement of ξ(ε|y, θ) with a non-parametric kernelapproximation.



Alphabet soup

ABCµ

[Ratmann et al., 2009]

Use of a joint density

f(θ, ε|y) ∝ ξ(ε|y, θ)× πθ(θ)× πε(ε)

where y is the data, and ξ(ε|y, θ) is the prior predictive density ofρ(η(z),η(y)) given θ and x when z ∼ f(z|θ)Warning! Replacement of ξ(ε|y, θ) with a non-parametric kernelapproximation.



Alphabet soup

A PMC version

Use of the same kernel idea as ABC-PRC but with IS correction[Beaumont et al., 2009; Toni et al., 2009]

Generate a sample at iteration t by

πt(θ(t)) ∝

N∑j=1

ω(t−1)j Kt(θ

(t)|θ(t−1)j )

modulo acceptance of the associated xt, and use an importance

weight associated with an accepted simulation θ(t)i

ω(t)i ∝ π(θ

(t)i )/πt(θ

(t)i ) .

c© Still likelihood-free[Beaumont et al., 2009]



Alphabet soup

Sequential Monte Carlo

SMC is a simulation technique to approximate a sequence ofrelated probability distributions πn with π0 “easy” and πT target.Iterated IS as PMC: particles moved from time n to time n viakernel Kn and use of a sequence of extended targets πn

πn(z0:n) = πn(zn)

n∏j=0

Lj(zj+1, zj)

where the Lj’s are backward Markov kernels [check that πn(zn) isa marginal]

[Del Moral, Doucet & Jasra, 2006]



Alphabet soup

ABC-SMC

True derivation of an SMC-ABC algorithmUse of a kernel Kn associated with target πεn and derivation ofthe backward kernel

Ln−1(z, z′) =

πεn(z′)Kn(z

′, z)

πn(z)

Update of the weights

win ∝ wi(n−1)

∑Mm=1 IAεn (x

min)∑M

m=1 IAεn−1(xmi(n−1))

when xmin ∼ K(xi(n−1), ·)



Alphabet soup

Properties of ABC-SMC

The ABC-SMC method properly uses a backward kernel L(z, z ′) tosimplify the importance weight and to remove the dependence onthe unknown likelihood from this weight. Update of importanceweights is reduced to the ratio of the proportions of survivingparticlesMajor assumption: the forward kernel K is supposed to be invariantagainst the true target [tempered version of the true posterior]

Adaptivity in ABC-SMC algorithm only found in on-lineconstruction of the thresholds εt, slowly enough to keep a largenumber of accepted transitions




Alphabet soup

Properties of ABC-SMC

The ABC-SMC method properly uses a backward kernel L(z, z ′) tosimplify the importance weight and to remove the dependence onthe unknown likelihood from this weight. Update of importanceweights is reduced to the ratio of the proportions of survivingparticlesMajor assumption: the forward kernel K is supposed to be invariantagainst the true target [tempered version of the true posterior]Adaptivity in ABC-SMC algorithm only found in on-lineconstruction of the thresholds εt, slowly enough to keep a largenumber of accepted transitions




Calibration of ABC

Which summary statistics?

Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistic [except when done by theexperimenters in the field]

Starting from a large collection of summary statistics is available,Joyce and Marjoram (2008) consider the sequential inclusion intothe ABC target, with a stopping rule based on a likelihood ratiotest.

Does not taking into account the sequential nature of the tests

Depends on parameterisation

Order of inclusion matters.



Calibration of ABC

Which summary statistics?

Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistic [except when done by theexperimenters in the field]Starting from a large collection of summary statistics is available,Joyce and Marjoram (2008) consider the sequential inclusion intothe ABC target, with a stopping rule based on a likelihood ratiotest.

Does not taking into account the sequential nature of the tests

Depends on parameterisation

Order of inclusion matters.



Calibration of ABC

Point estimation vs....

In the case of the computation of E[h(θ)|y], Fearnhead andPrangle [12/14/2011] demonstrate that the optimal summarystatistic is

η?(y) = E[h(θ)|y]

Unavailable but approximated by a prior ABC run and ABC-NPcorrections



Calibration of ABC

...vs. model choice

In the case of the computation of a Bayes factor B12(y), ABCapproximation

T∑t=1

Im(t)=1

/ T∑t=1

Im(t)=2

may fail to converge[Robert et al., 2011]

Separation conditions on the summary statistics for convergence tooccur

[Marin et al., 2011]

WSC 2011, advanced tutorial on simulation in Statistics

Education

Transcript of WSC 2011, advanced tutorial on simulation in Statistics