WSC 2011, advanced tutorial on simulation in Statistics
-
Upload
christian-robert -
Category
Education
-
view
2.671 -
download
2
description
Transcript of WSC 2011, advanced tutorial on simulation in Statistics
Simulation methods in Statistics (on recent advances)
Simulation methods in Statistics(on recent advances)
Christian P. Robert
Universite Paris-Dauphine, IuF, & CRESthttp://www.ceremade.dauphine.fr/~xian
WSC 2011, Phoenix, December 12, 2011
Simulation methods in Statistics (on recent advances)
Outline
1 Motivation and leading example
2 Monte Carlo Integration
3 The Metropolis-Hastings Algorithm
4 Approximate Bayesian computation
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Motivation and leading example
1 Motivation and leading exampleLatent variablesInferential methods
2 Monte Carlo Integration
3 The Metropolis-Hastings Algorithm
4 Approximate Bayesian computation
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Latent variables
Latent structures make life harder!
Even simple statistical models may lead to computationalcomplications, as in latent variable models
f(x|θ) =
∫f?(x, x?|θ) dx?
If (x, x?) observed, fine!If only x observed, trouble!
[mixtures, HMMs, state-space models, &tc]
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Latent variables
Latent structures make life harder!
Even simple statistical models may lead to computationalcomplications, as in latent variable models
f(x|θ) =
∫f?(x, x?|θ) dx?
If (x, x?) observed, fine!
If only x observed, trouble![mixtures, HMMs, state-space models, &tc]
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Latent variables
Latent structures make life harder!
Even simple statistical models may lead to computationalcomplications, as in latent variable models
f(x|θ) =
∫f?(x, x?|θ) dx?
If (x, x?) observed, fine!If only x observed, trouble!
[mixtures, HMMs, state-space models, &tc]
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Latent variables
Mixture models
Models of mixtures of distributions:
X ∼ fj with probability pj,
for j = 1, 2, . . . , k, with overall density
X ∼ p1f1(x) + · · ·+ pkfk(x) .
For a sample of independent random variables (X1, · · · ,Xn),sample density
n∏i=1
{p1f1(xi) + · · ·+ pkfk(xi)} .
Expanding this product involves kn elementary terms: prohibitiveto compute in large samples.
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Latent variables
Mixture models
Models of mixtures of distributions:
X ∼ fj with probability pj,
for j = 1, 2, . . . , k, with overall density
X ∼ p1f1(x) + · · ·+ pkfk(x) .
For a sample of independent random variables (X1, · · · ,Xn),sample density
n∏i=1
{p1f1(xi) + · · ·+ pkfk(xi)} .
Expanding this product involves kn elementary terms: prohibitiveto compute in large samples.
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Latent variables
Mixture models
Models of mixtures of distributions:
X ∼ fj with probability pj,
for j = 1, 2, . . . , k, with overall density
X ∼ p1f1(x) + · · ·+ pkfk(x) .
For a sample of independent random variables (X1, · · · ,Xn),sample density
n∏i=1
{p1f1(xi) + · · ·+ pkfk(xi)} .
Expanding this product involves kn elementary terms: prohibitiveto compute in large samples.
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Latent variables
Mixture likelihood
−1 0 1 2 3
−1
01
23
µ1
µ 2
Case of the 0.3N (µ1, 1) + 0.7N (µ2, 1) likelihood
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Maximum likelihood methods
goto Bayes
For an iid sample X1, . . . , Xn from a population with densityf(x|θ1, . . . ,θk), the likelihood function is
L(x|θ) = L(x1, . . . , xn|θ1, . . . ,θk)
=∏n
i=1f(xi|θ1, . . . ,θk).
◦ Maximum likelihood has global justifications from asymptotics
◦ Computational difficulty depends on structure, eg latentvariables
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Maximum likelihood methods
goto Bayes
For an iid sample X1, . . . , Xn from a population with densityf(x|θ1, . . . ,θk), the likelihood function is
L(x|θ) = L(x1, . . . , xn|θ1, . . . ,θk)
=∏n
i=1f(xi|θ1, . . . ,θk).
◦ Maximum likelihood has global justifications from asymptotics
◦ Computational difficulty depends on structure, eg latentvariables
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Maximum likelihood methods
goto Bayes
For an iid sample X1, . . . , Xn from a population with densityf(x|θ1, . . . ,θk), the likelihood function is
L(x|θ) = L(x1, . . . , xn|θ1, . . . ,θk)
=∏n
i=1f(xi|θ1, . . . ,θk).
◦ Maximum likelihood has global justifications from asymptotics
◦ Computational difficulty depends on structure, eg latentvariables
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Maximum likelihood methods (2)
Example (Mixtures)
For a mixture of two normal distributions,
pN(µ, τ2) + (1 − p)N(θ,σ2) ,
likelihood proportional to
n∏i=1
[pτ−1ϕ
(xi − µ
τ
)+ (1 − p) σ−1 ϕ
(xi − θ
σ
)]can be expanded into 2n terms.
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Maximum likelihood methods (3)
Standard maximization techniques often fail to find the globalmaximum because of multimodality or undesirable behavior(usually at the frontier of the domain) of the likelihood function.
Example
In the special case
f(x|µ,σ) = (1 − ε) exp{(−1/2)x2}+ε
σexp{(−1/2σ2)(x− µ)2}
with ε > 0 known,
whatever n, the likelihood is unbounded:
limσ→0
L(x1, . . . , xn|µ = x1,σ) =∞
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Maximum likelihood methods (3)
Standard maximization techniques often fail to find the globalmaximum because of multimodality or undesirable behavior(usually at the frontier of the domain) of the likelihood function.
Example
In the special case
f(x|µ,σ) = (1 − ε) exp{(−1/2)x2}+ε
σexp{(−1/2σ2)(x− µ)2}
with ε > 0 known, whatever n, the likelihood is unbounded:
limσ→0
L(x1, . . . , xn|µ = x1,σ) =∞
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
The Bayesian Perspective
In the Bayesian paradigm, the information brought by the data x,realization of
X ∼ f(x|θ),
is combined with prior information specified by prior distributionwith density
π(θ)
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
The Bayesian Perspective
In the Bayesian paradigm, the information brought by the data x,realization of
X ∼ f(x|θ),
is combined with prior information specified by prior distributionwith density
π(θ)
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Central tool...
Summary in a probability distribution, π(θ|x), called the posteriordistribution
Derived from the joint distribution f(x|θ)π(θ), according to
π(θ|x) =f(x|θ)π(θ)∫f(x|θ)π(θ)dθ
,
[Bayes Theorem]
where
Z(x) =
∫f(x|θ)π(θ)dθ
is the marginal density of X also called the (Bayesian) evidence
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Central tool...
Summary in a probability distribution, π(θ|x), called the posteriordistributionDerived from the joint distribution f(x|θ)π(θ), according to
π(θ|x) =f(x|θ)π(θ)∫f(x|θ)π(θ)dθ
,
[Bayes Theorem]
where
Z(x) =
∫f(x|θ)π(θ)dθ
is the marginal density of X also called the (Bayesian) evidence
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Central tool...
Summary in a probability distribution, π(θ|x), called the posteriordistributionDerived from the joint distribution f(x|θ)π(θ), according to
π(θ|x) =f(x|θ)π(θ)∫f(x|θ)π(θ)dθ
,
[Bayes Theorem]
where
Z(x) =
∫f(x|θ)π(θ)dθ
is the marginal density of X also called the (Bayesian) evidence
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Central tool...central to Bayesian inference
Posterior defined up to a constant as
π(θ|x) ∝ f(x|θ)π(θ)
Operates conditional upon the observations
Integrate simultaneously prior information and informationbrought by x
Avoids averaging over the unobserved values of x
Coherent updating of the information available on θ,independent of the order in which i.i.d. observations arecollected
Provides a complete inferential scope and a unique motor ofinference
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Central tool...central to Bayesian inference
Posterior defined up to a constant as
π(θ|x) ∝ f(x|θ)π(θ)
Operates conditional upon the observations
Integrate simultaneously prior information and informationbrought by x
Avoids averaging over the unobserved values of x
Coherent updating of the information available on θ,independent of the order in which i.i.d. observations arecollected
Provides a complete inferential scope and a unique motor ofinference
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Central tool...central to Bayesian inference
Posterior defined up to a constant as
π(θ|x) ∝ f(x|θ)π(θ)
Operates conditional upon the observations
Integrate simultaneously prior information and informationbrought by x
Avoids averaging over the unobserved values of x
Coherent updating of the information available on θ,independent of the order in which i.i.d. observations arecollected
Provides a complete inferential scope and a unique motor ofinference
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Central tool...central to Bayesian inference
Posterior defined up to a constant as
π(θ|x) ∝ f(x|θ)π(θ)
Operates conditional upon the observations
Integrate simultaneously prior information and informationbrought by x
Avoids averaging over the unobserved values of x
Coherent updating of the information available on θ,independent of the order in which i.i.d. observations arecollected
Provides a complete inferential scope and a unique motor ofinference
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Central tool...central to Bayesian inference
Posterior defined up to a constant as
π(θ|x) ∝ f(x|θ)π(θ)
Operates conditional upon the observations
Integrate simultaneously prior information and informationbrought by x
Avoids averaging over the unobserved values of x
Coherent updating of the information available on θ,independent of the order in which i.i.d. observations arecollected
Provides a complete inferential scope and a unique motor ofinference
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Examples of Bayes computational problems
1 complex parameter space, as e.g. constrained parameter setslike those resulting from imposing stationarity constraints intime series
2 complex sampling model with an intractable likelihood, ase.g. in some graphical models;
3 use of a huge dataset;
4 complex prior distribution (which may be the posteriordistribution associated with an earlier sample);
5 involved inferential procedure as for instance, Bayes factors
Bπ01(x) =P(θ ∈ Θ0 | x)
P(θ ∈ Θ1 | x)
/π(θ ∈ Θ0)
π(θ ∈ Θ1).
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Examples of Bayes computational problems
1 complex parameter space, as e.g. constrained parameter setslike those resulting from imposing stationarity constraints intime series
2 complex sampling model with an intractable likelihood, ase.g. in some graphical models;
3 use of a huge dataset;
4 complex prior distribution (which may be the posteriordistribution associated with an earlier sample);
5 involved inferential procedure as for instance, Bayes factors
Bπ01(x) =P(θ ∈ Θ0 | x)
P(θ ∈ Θ1 | x)
/π(θ ∈ Θ0)
π(θ ∈ Θ1).
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Examples of Bayes computational problems
1 complex parameter space, as e.g. constrained parameter setslike those resulting from imposing stationarity constraints intime series
2 complex sampling model with an intractable likelihood, ase.g. in some graphical models;
3 use of a huge dataset;
4 complex prior distribution (which may be the posteriordistribution associated with an earlier sample);
5 involved inferential procedure as for instance, Bayes factors
Bπ01(x) =P(θ ∈ Θ0 | x)
P(θ ∈ Θ1 | x)
/π(θ ∈ Θ0)
π(θ ∈ Θ1).
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Examples of Bayes computational problems
1 complex parameter space, as e.g. constrained parameter setslike those resulting from imposing stationarity constraints intime series
2 complex sampling model with an intractable likelihood, ase.g. in some graphical models;
3 use of a huge dataset;
4 complex prior distribution (which may be the posteriordistribution associated with an earlier sample);
5 involved inferential procedure as for instance, Bayes factors
Bπ01(x) =P(θ ∈ Θ0 | x)
P(θ ∈ Θ1 | x)
/π(θ ∈ Θ0)
π(θ ∈ Θ1).
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Examples of Bayes computational problems
1 complex parameter space, as e.g. constrained parameter setslike those resulting from imposing stationarity constraints intime series
2 complex sampling model with an intractable likelihood, ase.g. in some graphical models;
3 use of a huge dataset;
4 complex prior distribution (which may be the posteriordistribution associated with an earlier sample);
5 involved inferential procedure as for instance, Bayes factors
Bπ01(x) =P(θ ∈ Θ0 | x)
P(θ ∈ Θ1 | x)
/π(θ ∈ Θ0)
π(θ ∈ Θ1).
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Mixtures again
Observations from
x1, . . . , xn ∼ f(x|θ) = pϕ(x;µ1,σ1) + (1 − p)ϕ(x;µ2,σ2)
Prior
µi|σi ∼ N (ξi,σ2i/ni), σ2
i ∼ I G (νi/2, s2i/2), p ∼ Be(α,β)
Posterior
π(θ|x1, . . . , xn) ∝n∏j=1
{pϕ(xj;µ1,σ1) + (1 − p)ϕ(xj;µ2,σ2)
}π(θ)
=
n∑`=0
∑(kt)
ω(kt)π(θ|(kt))
[O(2n)]
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Mixtures again
Observations from
x1, . . . , xn ∼ f(x|θ) = pϕ(x;µ1,σ1) + (1 − p)ϕ(x;µ2,σ2)
Prior
µi|σi ∼ N (ξi,σ2i/ni), σ2
i ∼ I G (νi/2, s2i/2), p ∼ Be(α,β)
Posterior
π(θ|x1, . . . , xn) ∝n∏j=1
{pϕ(xj;µ1,σ1) + (1 − p)ϕ(xj;µ2,σ2)
}π(θ)
=
n∑`=0
∑(kt)
ω(kt)π(θ|(kt))
[O(2n)]
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Mixtures again
Observations from
x1, . . . , xn ∼ f(x|θ) = pϕ(x;µ1,σ1) + (1 − p)ϕ(x;µ2,σ2)
Prior
µi|σi ∼ N (ξi,σ2i/ni), σ2
i ∼ I G (νi/2, s2i/2), p ∼ Be(α,β)
Posterior
π(θ|x1, . . . , xn) ∝n∏j=1
{pϕ(xj;µ1,σ1) + (1 − p)ϕ(xj;µ2,σ2)
}π(θ)
=
n∑`=0
∑(kt)
ω(kt)π(θ|(kt))
[O(2n)]
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Mixtures again [2]
For a given permutation (kt), conditional posterior distribution
π(θ|(kt)) = N
(ξ1(kt),
σ21
n1 + `
)×I G ((ν1 + `)/2, s1(kt)/2)
×N
(ξ2(kt),
σ22
n2 + n− `
)×I G ((ν2 + n− `)/2, s2(kt)/2)
×Be(α+ `,β+ n− `)
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Mixtures again [3]
where
x1(kt) = 1`
∑`t=1 xkt
, s1(kt) =∑`t=1(xkt
− x1(kt))2,
x2(kt) = 1n−`
∑nt=`+1 xkt
, s2(kt) =∑nt=`+1(xkt
− x2(kt))2
and
ξ1(kt) =n1ξ1 + `x1(kt)
n1 + `, ξ2(kt) =
n2ξ2 + (n− `)x2(kt)
n2 + n− `,
s1(kt) = s21 + s
21(kt) +
n1`
n1 + `(ξ1 − x1(kt))
2,
s2(kt) = s22 + s
22(kt) +
n2(n− `)
n2 + n− `(ξ2 − x2(kt))
2,
posterior updates of the hyperparameters
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Mixtures again [4]
Bayes estimator of θ:
δπ(x1, . . . , xn) =n∑`=0
∑(kt)
ω(kt)Eπ[θ|x, (kt)]
c© Too costly: 2n terms
Unfortunate as the decomposition is meaningfull for clusteringpurposes
Simulation methods in Statistics (on recent advances)
Motivation and leading example
Inferential methods
Mixtures again [4]
Bayes estimator of θ:
δπ(x1, . . . , xn) =n∑`=0
∑(kt)
ω(kt)Eπ[θ|x, (kt)]
c© Too costly: 2n termsUnfortunate as the decomposition is meaningfull for clusteringpurposes
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Monte Carlo integration
1 Motivation and leading example
2 Monte Carlo IntegrationMonte Carlo integrationImportance SamplingBayesian importance sampling
3 The Metropolis-Hastings Algorithm
4 Approximate Bayesian computation
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Monte Carlo integration
Monte Carlo integration
Theme:Generic problem of evaluating the integral
I = Ef[h(X)] =∫Xh(x) f(x) dx
where X is uni- or multidimensional, f is a closed form, partlyclosed form, or implicit density, and h is a function
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Monte Carlo integration
Monte Carlo integration (2)
Monte Carlo solutionFirst use a sample (X1, . . . ,Xm) from the density f to approximatethe integral I by the empirical average
hm =1
m
m∑j=1
h(xj)
which convergeshm −→ Ef[h(X)]
by the Strong Law of Large Numbers
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Monte Carlo integration
Monte Carlo integration (2)
Monte Carlo solutionFirst use a sample (X1, . . . ,Xm) from the density f to approximatethe integral I by the empirical average
hm =1
m
m∑j=1
h(xj)
which convergeshm −→ Ef[h(X)]
by the Strong Law of Large Numbers
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Monte Carlo integration
Monte Carlo precision
Estimate the variance with
vm =1
m− 1
m∑j=1
[h(xj) − hm]2,
and for m large,
hm − Ef[h(X)]√vm
∼ N (0, 1).
Note: This can lead to the construction of a convergence test andof confidence bounds on the approximation of Ef[h(X)].
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Monte Carlo integration
Example (Cauchy prior/normal sample)
For estimating a normal mean, a robust prior is a Cauchy prior
X ∼ N (θ, 1), θ ∼ C(0, 1).
Under squared error loss, posterior mean
δπ(x) =
∫∞−∞
θ
1 + θ2e−(x−θ)2/2dθ∫∞
−∞1
1 + θ2e−(x−θ)2/2dθ
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Monte Carlo integration
Example (Cauchy prior/normal sample (2))
Form of δπ suggests simulating iid variables
θ1, · · · , θm ∼ N (x, 1)
and calculating
δπm(x) =
m∑i=1
θi
1 + θ2i
/ m∑i=1
1
1 + θ2i
.
The Law of Large Numbers implies
δπm(x) −→ δπ(x) as m −→∞.
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Monte Carlo integration
0 200 400 600 800 1000
9.6
9.8
10.0
10.2
10.4
10.6
iterations
Range of estimators δπm for 100 runs and x = 10
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Importance Sampling
Importance sampling
Paradox
Simulation from f (the true density) is not necessarily optimal
Alternative to direct sampling from f is importance sampling,based on the alternative representation
Ef[h(X)] =∫X
[h(x)
f(x)
g(x)
]g(x) dx .
which allows us to use other distributions than f
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Importance Sampling
Importance sampling
Paradox
Simulation from f (the true density) is not necessarily optimal
Alternative to direct sampling from f is importance sampling,based on the alternative representation
Ef[h(X)] =∫X
[h(x)
f(x)
g(x)
]g(x) dx .
which allows us to use other distributions than f
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Importance Sampling
Importance sampling algorithm
Evaluation of
Ef[h(X)] =∫Xh(x) f(x) dx
by
1 Generate a sample X1, . . . ,Xn from a distribution g
2 Use the approximation
1
m
m∑j=1
f(Xj)
g(Xj)h(Xj)
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Importance Sampling
Implementation details
◦ Instrumental distribution g chosen from distributions easy tosimulate
◦ The same sample (generated from g) can be used repeatedly,not only for different functions h, but also for differentdensities f
◦ Dependent proposals can be used, as seen later Pop’MC
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Importance Sampling
Finite vs. infinite variance
Although g can be any density, some choices are better thanothers:
◦ Finite variance only when
Ef[h2(X)
f(X)
g(X)
]=
∫X
h2(x)f2(X)
g(X)dx <∞ .
◦ Instrumental distributions with tails lighter than those of f(that is, with sup f/g =∞) not appropriate.
◦ If sup f/g =∞, the weights f(xj)/g(xj) vary widely, givingtoo much importance to a few values xj.
◦ If sup f/g =M <∞, finite variance for L2 functions
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Importance Sampling
Finite vs. infinite variance
Although g can be any density, some choices are better thanothers:
◦ Finite variance only when
Ef[h2(X)
f(X)
g(X)
]=
∫X
h2(x)f2(X)
g(X)dx <∞ .
◦ Instrumental distributions with tails lighter than those of f(that is, with sup f/g =∞) not appropriate.
◦ If sup f/g =∞, the weights f(xj)/g(xj) vary widely, givingtoo much importance to a few values xj.
◦ If sup f/g =M <∞, finite variance for L2 functions
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Importance Sampling
Finite vs. infinite variance
Although g can be any density, some choices are better thanothers:
◦ Finite variance only when
Ef[h2(X)
f(X)
g(X)
]=
∫X
h2(x)f2(X)
g(X)dx <∞ .
◦ Instrumental distributions with tails lighter than those of f(that is, with sup f/g =∞) not appropriate.
◦ If sup f/g =∞, the weights f(xj)/g(xj) vary widely, givingtoo much importance to a few values xj.
◦ If sup f/g =M <∞, finite variance for L2 functions
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Importance Sampling
Selfnormalised importance sampling
For ratio estimator
δnh =
n∑i=1
ωi h(xi)
/ n∑i=1
ωi
with Xi ∼ g(y) and Wi such that
E[Wi|Xi = x] = κf(x)/g(x)
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Importance Sampling
Selfnormalised variance
then
var(δnh) ≈1
n2κ2
(var(Snh) − 2Eπ[h] cov(Snh,Sn1 ) + Eπ[h]2 var(Sn1 )
).
for
Snh =
n∑i=1
Wih(Xi) , Sn1 =
n∑i=1
Wi
Rough approximation
varδnh ≈1
nvarπ(h(X)) {1 + varg(W)}
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
Bayes factor approximation
When approximating the Bayes factor
B01 =
∫Θ0
f0(x|θ0)π0(θ0)dθ0∫Θ1
f1(x|θ1)π1(θ1)dθ1
use of importance functions $0 and $1 and
B01 =n−1
0
∑n0i=1 f0(x|θ
i0)π0(θ
i0)/$0(θ
i0)
n−11
∑n1i=1 f1(x|θ
i1)π1(θi1)/$1(θi1)
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
Diabetes in Pima Indian women
Example (R benchmark)
“A population of women who were at least 21 years old, of PimaIndian heritage and living near Phoenix (AZ), was tested fordiabetes according to WHO criteria. The data were collected bythe US National Institute of Diabetes and Digestive and KidneyDiseases.”200 Pima Indian women with observed variables
plasma glucose concentration in oral glucose tolerance test
diastolic blood pressure
diabetes pedigree function
presence/absence of diabetes
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
Probit modelling on Pima Indian women
Probability of diabetes function of above variables
P(y = 1|x) = Φ(x1β1 + x2β2 + x3β3) ,
Test of H0 : β3 = 0 for 200 observations of Pima.tr based on ag-prior modelling:
β ∼ N3(0,n(XTX)−1
)
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
Probit modelling on Pima Indian women
Probability of diabetes function of above variables
P(y = 1|x) = Φ(x1β1 + x2β2 + x3β3) ,
Test of H0 : β3 = 0 for 200 observations of Pima.tr based on ag-prior modelling:
β ∼ N3(0,n(XTX)−1
)
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
Importance sampling for the Pima Indian dataset
Use of the importance function inspired from the MLE estimatedistribution
β ∼ N(β, Σ)
R Importance sampling codemodel1=summary(glm(y~-1+X1,family=binomial(link="probit")))
is1=rmvnorm(Niter,mean=model1$coeff[,1],sigma=2*model1$cov.unscaled)
is2=rmvnorm(Niter,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)
bfis=mean(exp(probitlpost(is1,y,X1)-dmvlnorm(is1,mean=model1$coeff[,1],
sigma=2*model1$cov.unscaled))) / mean(exp(probitlpost(is2,y,X2)-
dmvlnorm(is2,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)))
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
Importance sampling for the Pima Indian dataset
Use of the importance function inspired from the MLE estimatedistribution
β ∼ N(β, Σ)
R Importance sampling codemodel1=summary(glm(y~-1+X1,family=binomial(link="probit")))
is1=rmvnorm(Niter,mean=model1$coeff[,1],sigma=2*model1$cov.unscaled)
is2=rmvnorm(Niter,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)
bfis=mean(exp(probitlpost(is1,y,X1)-dmvlnorm(is1,mean=model1$coeff[,1],
sigma=2*model1$cov.unscaled))) / mean(exp(probitlpost(is2,y,X2)-
dmvlnorm(is2,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)))
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
Diabetes in Pima Indian women
Comparison of the variation of the Bayes factor approximationsbased on 100 replicas for 20, 000 simulations from the prior andthe above MLE importance sampler
Basic Monte Carlo Importance sampling
23
45
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
Bridge sampling
Special case:If
π1(θ1|x) ∝ π1(θ1|x)π2(θ2|x) ∝ π2(θ2|x)
live on the same space (Θ1 = Θ2), then
B12 ≈1
n
n∑i=1
π1(θi|x)
π2(θi|x)θi ∼ π2(θ|x)
[Gelman & Meng, 1998; Chen, Shao & Ibrahim, 2000]
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
(Further) bridge sampling
General identity:
B12 =
∫π2(θ|x)α(θ)π1(θ|x)dθ∫π1(θ|x)α(θ)π2(θ|x)dθ
∀ α(·)
≈
1
n1
n1∑i=1
π2(θ1i|x)α(θ1i)
1
n2
n2∑i=1
π1(θ2i|x)α(θ2i)
θji ∼ πj(θ|x)
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
Optimal bridge sampling
The optimal choice of auxiliary function is
α? =n1 + n2
n1π1(θ|x) + n2π2(θ|x)
leading to
B12 ≈
1
n1
n1∑i=1
π2(θ1i|x)
n1π1(θ1i|x) + n2π2(θ1i|x)
1
n2
n2∑i=1
π1(θ2i|x)
n1π1(θ2i|x) + n2π2(θ2i|x)
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
Illustration for the Pima Indian dataset
Use of the MLE induced conditional of β3 given (β1,β2) as apseudo-posterior and mixture of both MLE approximations on β3
in bridge sampling estimate
R bridge sampling codecova=model2$cov.unscaled
expecta=model2$coeff[,1]
covw=cova[3,3]-t(cova[1:2,3])%*%ginv(cova[1:2,1:2])%*%cova[1:2,3]
probit1=hmprobit(Niter,y,X1)
probit2=hmprobit(Niter,y,X2)
pseudo=rnorm(Niter,meanw(probit1),sqrt(covw))
probit1p=cbind(probit1,pseudo)
bfbs=mean(exp(probitlpost(probit2[,1:2],y,X1)+dnorm(probit2[,3],meanw(probit2[,1:2]),
sqrt(covw),log=T))/ (dmvnorm(probit2,expecta,cova)+dnorm(probit2[,3],expecta[3],
cova[3,3])))/ mean(exp(probitlpost(probit1p,y,X2))/(dmvnorm(probit1p,expecta,cova)+
dnorm(pseudo,expecta[3],cova[3,3])))
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
Illustration for the Pima Indian dataset
Use of the MLE induced conditional of β3 given (β1,β2) as apseudo-posterior and mixture of both MLE approximations on β3
in bridge sampling estimate
R bridge sampling codecova=model2$cov.unscaled
expecta=model2$coeff[,1]
covw=cova[3,3]-t(cova[1:2,3])%*%ginv(cova[1:2,1:2])%*%cova[1:2,3]
probit1=hmprobit(Niter,y,X1)
probit2=hmprobit(Niter,y,X2)
pseudo=rnorm(Niter,meanw(probit1),sqrt(covw))
probit1p=cbind(probit1,pseudo)
bfbs=mean(exp(probitlpost(probit2[,1:2],y,X1)+dnorm(probit2[,3],meanw(probit2[,1:2]),
sqrt(covw),log=T))/ (dmvnorm(probit2,expecta,cova)+dnorm(probit2[,3],expecta[3],
cova[3,3])))/ mean(exp(probitlpost(probit1p,y,X2))/(dmvnorm(probit1p,expecta,cova)+
dnorm(pseudo,expecta[3],cova[3,3])))
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
Diabetes in Pima Indian women (cont’d)
Comparison of the variation of the Bayes factor approximationsbased on 100× 20, 000 simulations from the prior (MC), the abovebridge sampler and the above importance sampler
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
The original harmonic mean estimator
When θki ∼ πk(θ|x),
1
T
T∑t=1
1
L(θkt|x)
is an unbiased estimator of 1/mk(x)[Newton & Raftery, 1994]
Highly dangerous: Most often leads to an infinite variance!!!
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
The original harmonic mean estimator
When θki ∼ πk(θ|x),
1
T
T∑t=1
1
L(θkt|x)
is an unbiased estimator of 1/mk(x)[Newton & Raftery, 1994]
Highly dangerous: Most often leads to an infinite variance!!!
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
“The Worst Monte Carlo Method Ever”
“The good news is that the Law of Large Numbers guarantees thatthis estimator is consistent ie, it will very likely be very close to thecorrect answer if you use a sufficiently large number of points fromthe posterior distribution.
The bad news is that the number of points required for thisestimator to get close to the right answer will often be greaterthan the number of atoms in the observable universe. The evenworse news is that it’s easy for people to not realize this, and tonaıvely accept estimates that are nowhere close to the correctvalue of the marginal likelihood.”
[Radford Neal’s blog, Aug. 23, 2008]
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
“The Worst Monte Carlo Method Ever”
“The good news is that the Law of Large Numbers guarantees thatthis estimator is consistent ie, it will very likely be very close to thecorrect answer if you use a sufficiently large number of points fromthe posterior distribution.The bad news is that the number of points required for thisestimator to get close to the right answer will often be greaterthan the number of atoms in the observable universe. The evenworse news is that it’s easy for people to not realize this, and tonaıvely accept estimates that are nowhere close to the correctvalue of the marginal likelihood.”
[Radford Neal’s blog, Aug. 23, 2008]
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
Approximating Zk from a posterior sample
Use of the [harmonic mean] identity
Eπk[
ϕ(θk)
πk(θk)Lk(θk)
∣∣∣∣ x] = ∫ ϕ(θk)
πk(θk)Lk(θk)
πk(θk)Lk(θk)
Zkdθk =
1
Zk
no matter what the proposal ϕ(·) is.[Gelfand & Dey, 1994; Bartolucci et al., 2006]
Direct exploitation of the MCMC output
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
Approximating Zk from a posterior sample
Use of the [harmonic mean] identity
Eπk[
ϕ(θk)
πk(θk)Lk(θk)
∣∣∣∣ x] = ∫ ϕ(θk)
πk(θk)Lk(θk)
πk(θk)Lk(θk)
Zkdθk =
1
Zk
no matter what the proposal ϕ(·) is.[Gelfand & Dey, 1994; Bartolucci et al., 2006]
Direct exploitation of the MCMC output
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
Comparison with regular importance sampling
Harmonic mean: Constraint opposed to usual importance samplingconstraints: ϕ(θ) must have lighter (rather than fatter) tails thanπk(θk)Lk(θk) for the approximation
Z1k = 1
/1
T
T∑t=1
ϕ(θ(t)k )
πk(θ(t)k )Lk(θ
(t)k )
to enjoy finite variance
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
Comparison with regular importance sampling (cont’d)
Compare Z1k with a standard importance sampling approximation
Z2k =1
T
T∑t=1
πk(θ(t)k )Lk(θ
(t)k )
ϕ(θ(t)k )
where the θ(t)k ’s are generated from the density ϕ(·) (with fatter
tails like t’s)
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
HPD indicator as ϕ
Use the convex hull of MCMC simulations corresponding to the10% HPD region (easily derived!) and ϕ as indicator:
ϕ(θ) =10
T
∑t∈HPD
Id(θ,θ(t))6ε
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
Diabetes in Pima Indian women (cont’d)
Comparison of the variation of the Bayes factor approximationsbased on 100 replicas for 20, 000 simulations for a simulation fromthe above harmonic mean sampler and importance samplers
Harmonic mean Importance sampling
3.102
3.104
3.106
3.108
3.110
3.112
3.114
3.116
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
Chib’s representation
Direct application of Bayes’ theorem: given x ∼ fk(x|θk) andθk ∼ πk(θk),
mk(x) =fk(x|θk)πk(θk)
πk(θk|x)
[Bayes Theorem]
Use of an approximation to the posterior
mk(x) =fk(x|θ
∗k)πk(θ
∗k)
πk(θ∗k|x)
.
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
Chib’s representation
Direct application of Bayes’ theorem: given x ∼ fk(x|θk) andθk ∼ πk(θk),
mk(x) =fk(x|θk)πk(θk)
πk(θk|x)
[Bayes Theorem]
Use of an approximation to the posterior
mk(x) =fk(x|θ
∗k)πk(θ
∗k)
πk(θ∗k|x)
.
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
Case of latent variables
For missing variable z as in mixture models, natural Rao-Blackwellestimate
πk(θ∗k|x) =
1
T
T∑t=1
πk(θ∗k|x, z
(t)k ) ,
where the z(t)k ’s are Gibbs sampled latent variables
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
Case of the probit model
For the completion by z,
π(θ|x) =1
T
∑t
π(θ|x, z(t))
is a simple average of normal densities
R Bridge sampling codegibbs1=gibbsprobit(Niter,y,X1)
gibbs2=gibbsprobit(Niter,y,X2)
bfchi=mean(exp(dmvlnorm(t(t(gibbs2$mu)-model2$coeff[,1]),mean=rep(0,3),
sigma=gibbs2$Sigma2)-probitlpost(model2$coeff[,1],y,X2)))/
mean(exp(dmvlnorm(t(t(gibbs1$mu)-model1$coeff[,1]),mean=rep(0,2),
sigma=gibbs1$Sigma2)-probitlpost(model1$coeff[,1],y,X1)))
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
Case of the probit model
For the completion by z,
π(θ|x) =1
T
∑t
π(θ|x, z(t))
is a simple average of normal densities
R Bridge sampling codegibbs1=gibbsprobit(Niter,y,X1)
gibbs2=gibbsprobit(Niter,y,X2)
bfchi=mean(exp(dmvlnorm(t(t(gibbs2$mu)-model2$coeff[,1]),mean=rep(0,3),
sigma=gibbs2$Sigma2)-probitlpost(model2$coeff[,1],y,X2)))/
mean(exp(dmvlnorm(t(t(gibbs1$mu)-model1$coeff[,1]),mean=rep(0,2),
sigma=gibbs1$Sigma2)-probitlpost(model1$coeff[,1],y,X1)))
Simulation methods in Statistics (on recent advances)
Monte Carlo Integration
Bayesian importance sampling
Diabetes in Pima Indian women (cont’d)
Comparison of the variation of the Bayes factor approximationsbased on 100 replicas for 20, 000 simulations for a simulation fromthe above Chib’s and importance samplers
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
The Metropolis-Hastings Algorithm
1 Motivation and leading example
2 Monte Carlo Integration
3 The Metropolis-Hastings AlgorithmMonte Carlo Methods based on Markov ChainsThe Metropolis–Hastings algorithmThe random walk Metropolis-Hastings algorithmAdaptive MCMC
4 Approximate Bayesian computation
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains
Epiphany! It is not necessary to use a sample from the distributionf to approximate the integral
I =
∫h(x)f(x)dx ,
Principle: Obtain X1, . . . ,Xn ∼ f (approx) without directlysimulating from f, using an ergodic Markov chain with stationarydistribution f
[Metropolis et al., 1953]
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains
Epiphany! It is not necessary to use a sample from the distributionf to approximate the integral
I =
∫h(x)f(x)dx ,
Principle: Obtain X1, . . . ,Xn ∼ f (approx) without directlysimulating from f, using an ergodic Markov chain with stationarydistribution f
[Metropolis et al., 1953]
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains (2)
Idea
For an arbitrary starting value x(0), an ergodic chain (X(t)) isgenerated using a transition kernel with stationary distribution f
Insures the convergence in distribution of (X(t)) to a randomvariable from f.For a “large enough” T0, X(T0) can be considered asdistributed from f
Produce a dependent sample X(T0),X(T0+1), . . ., which isgenerated from f, sufficient for most approximation purposes.
Problem: How can one build a Markov chain with a givenstationary distribution?
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains (2)
Idea
For an arbitrary starting value x(0), an ergodic chain (X(t)) isgenerated using a transition kernel with stationary distribution f
Insures the convergence in distribution of (X(t)) to a randomvariable from f.For a “large enough” T0, X(T0) can be considered asdistributed from f
Produce a dependent sample X(T0),X(T0+1), . . ., which isgenerated from f, sufficient for most approximation purposes.
Problem: How can one build a Markov chain with a givenstationary distribution?
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains (2)
Idea
For an arbitrary starting value x(0), an ergodic chain (X(t)) isgenerated using a transition kernel with stationary distribution f
Insures the convergence in distribution of (X(t)) to a randomvariable from f.For a “large enough” T0, X(T0) can be considered asdistributed from f
Produce a dependent sample X(T0),X(T0+1), . . ., which isgenerated from f, sufficient for most approximation purposes.
Problem: How can one build a Markov chain with a givenstationary distribution?
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
The Metropolis–Hastings algorithm
BasicsThe algorithm uses the objective (target) density
f
and a conditional densityq(y|x)
called the instrumental (or proposal) distribution
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
The MH algorithm
Algorithm (Metropolis–Hastings)
Given x(t),
1. Generate Yt ∼ q(y|x(t)).
2. Take
X(t+1) =
{Yt with prob. ρ(x(t), Yt),
x(t) with prob. 1 − ρ(x(t), Yt),
where
ρ(x,y) = min
{f(y)
f(x)
q(x|y)
q(y|x), 1
}.
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Features
Independent of normalizing constants for both f and q(·|x)(ie, those constants independent of x)
Never move to values with f(y) = 0
The chain (x(t))t may take the same value several times in arow, even though f is a density wrt Lebesgue measure
The sequence (yt)t is usually not a Markov chain
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Convergence properties
The M-H Markov chain is reversible, with invariant/stationarydensity f since it satisfies the detailed balance condition
f(y)K(y, x) = f(x)K(x,y)
Ifq(y|x) > 0 for every (x,y),
the chain is Harris recurrent and
limT→∞ 1
T
T∑t=1
h(X(t)) =
∫h(x)df(x) a.e. f.
limn→∞
∥∥∥∥∫ Kn(x, ·)µ(dx) − f∥∥∥∥TV
= 0
for every initial distribution µ
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Convergence properties
The M-H Markov chain is reversible, with invariant/stationarydensity f since it satisfies the detailed balance condition
f(y)K(y, x) = f(x)K(x,y)
Ifq(y|x) > 0 for every (x,y),
the chain is Harris recurrent and
limT→∞ 1
T
T∑t=1
h(X(t)) =
∫h(x)df(x) a.e. f.
limn→∞
∥∥∥∥∫ Kn(x, ·)µ(dx) − f∥∥∥∥TV
= 0
for every initial distribution µ
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
The random walk Metropolis-Hastings algorithm
Random walk Metropolis–Hastings
Use of a local perturbation as proposal
Yt = X(t) + εt,
where εt ∼ g, independent of X(t).The instrumental density is now of the form g(y− x) and theMarkov chain is a random walk if we take g to be symmetricg(x) = g(−x)
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
The random walk Metropolis-Hastings algorithm
Algorithm (Random walk Metropolis)
Given x(t)
1 Generate Yt ∼ g(y− x(t))
2 Take
X(t+1) =
Yt with prob. min
{1,f(Yt)
f(x(t))
},
x(t) otherwise.
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
The random walk Metropolis-Hastings algorithm
RW-MH on mixture posterior distribution
−1 0 1 2 3
−1
01
23
µ1
µ 2
X
Random walk MCMC output for .7N(µ1, 1) + .3N(µ2, 1)
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
The random walk Metropolis-Hastings algorithm
Acceptance rate
A high acceptance rate is not indication of efficiency since therandom walk may be moving “too slowly” on the target surface
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
The random walk Metropolis-Hastings algorithm
Acceptance rate
A high acceptance rate is not indication of efficiency since therandom walk may be moving “too slowly” on the target surface
If x(t) and yt are “too close”, i.e. f(x(t)) ' f(yt), yt is acceptedwith probability
min
(f(yt)
f(x(t)), 1
)' 1 .
and acceptance rate high
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
The random walk Metropolis-Hastings algorithm
Acceptance rate
A high acceptance rate is not indication of efficiency since therandom walk may be moving “too slowly” on the target surface
If average acceptance rate low, the proposed values f(yt) tend tobe small wrt f(x(t)), i.e. the random walk [not the algorithm!]moves quickly on the target surface often reaching its boundaries
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
The random walk Metropolis-Hastings algorithm
Rule of thumb
In small dimensions, aim at an average acceptance rate of 50%. Inlarge dimensions, at an average acceptance rate of 25%.
[Gelman,Gilks and Roberts, 1995]
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
The random walk Metropolis-Hastings algorithm
Noisy AR(1)
Target distribution of x given x1, x2 and y is
exp−1
2τ2
{(x−ϕx1)
2 + (x2 −ϕx)2 +
τ2
σ2(y− x2)2
}.
For a Gaussian random walk with scale ω small enough, therandom walk never jumps to the other mode. But if the scale ω issufficiently large, the Markov chain explores both modes and give asatisfactory approximation of the target distribution.
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
The random walk Metropolis-Hastings algorithm
Noisy AR(2)
Markov chain based on a random walk with scale ω = .1.
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
The random walk Metropolis-Hastings algorithm
Noisy AR(3)
Markov chain based on a random walk with scale ω = .5.
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
Adaptive MCMC
No free lunch!!
MCMC algorithm trained on-line usually invalid:
using the whole past of the “chain” implies that this is not aMarkov chain any longer!
This means standard Markov chain (ergodic) theory does not apply[Meyn & Tweedie, 1994]
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
Adaptive MCMC
No free lunch!!
MCMC algorithm trained on-line usually invalid:using the whole past of the “chain” implies that this is not aMarkov chain any longer!
This means standard Markov chain (ergodic) theory does not apply[Meyn & Tweedie, 1994]
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
Adaptive MCMC
No free lunch!!
MCMC algorithm trained on-line usually invalid:using the whole past of the “chain” implies that this is not aMarkov chain any longer!
This means standard Markov chain (ergodic) theory does not apply[Meyn & Tweedie, 1994]
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
Adaptive MCMC
Example (Poly t distribution)
t T(3, θ, 1) sample (x1, . . . , xn) with flat prior π(θ) = 1Fit a normal proposal from empirical mean and empirical varianceof the chain so far,
µt =1
t
t∑i=1
θ(i) and σ2t =
1
t
t∑i=1
(θ(i) − µt)2 ,
Metropolis–Hastings algorithm with acceptance probability
n∏j=2
[ν+ (xj − θ
(t))2
ν+ (xj − ξ)2
]−(ν+1)/2exp−(µt − θ
(t))2/2σ2t
exp−(µt − ξ)2/2σ2t
,
where ξ ∼ N(µt,σ2t).
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
Adaptive MCMC
Example (Poly t distribution)
t T(3, θ, 1) sample (x1, . . . , xn) with flat prior π(θ) = 1Fit a normal proposal from empirical mean and empirical varianceof the chain so far,
µt =1
t
t∑i=1
θ(i) and σ2t =
1
t
t∑i=1
(θ(i) − µt)2 ,
Metropolis–Hastings algorithm with acceptance probability
n∏j=2
[ν+ (xj − θ
(t))2
ν+ (xj − ξ)2
]−(ν+1)/2exp−(µt − θ
(t))2/2σ2t
exp−(µt − ξ)2/2σ2t
,
where ξ ∼ N(µt,σ2t).
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
Adaptive MCMC
Invalid scheme
invariant distribution not invariant any longer
when range of initial values too small, the θ(i)’s cannotconverge to the target distribution and concentrates on toosmall a support.
long-range dependence on past values modifies thedistribution of the sequence.
using past simulations to create a non-parametricapproximation to the target distribution does not work either
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
Adaptive MCMC
0 1000 2000 3000 4000 5000
−0.4−0.2
0.00.2
Iterations
x
θ
−1.5 −1.0 −0.5 0.0 0.5
01
23
0 1000 2000 3000 4000 5000
−1.5−1.0
−0.50.0
0.51.0
1.5
Iterations
x
θ
−2 −1 0 1 2
0.00.2
0.40.6
0 1000 2000 3000 4000 5000
−2−1
01
2
Iterations
x
θ
−2 −1 0 1 2 3
0.00.1
0.20.3
0.40.5
0.60.7
Adaptive scheme for a sample of 10 xj ∼ T3 and initialvariances of (top) 0.1, (middle) 0.5, and (bottom) 2.5.
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
Adaptive MCMC
0 10000 30000 50000
−1.0−0.5
0.00.5
1.01.5
Iterations
x
θ
−1.5 −0.5 0.5 1.0 1.5
0.00.2
0.40.6
0.81.0
Sample produced by 50, 000 iterations of a nonparametricadaptive MCMC scheme and comparison of its distributionwith the target distribution.
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
Adaptive MCMC
Simply forget about it!
Warning:One should not constantly adapt the proposal on pastperformances
Either adaptation ceases after a period of burnin...or the adaptive scheme must be theoretically assessed on its ownright.
[Haario & Saaksman, 1999;Andrieu & Robert, 2001]
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
Adaptive MCMC
Diminishing adaptation
Adaptivity of cyberparameter γt has to be gradually tuned downto recover ergodicity
[Roberts & Rosenthal, 2007]
Sufficient conditions:
1 total variation distance between two consecutive kernels mustuniformly decrease to zero
[diminishing adaptation]
limt→∞ sup
x‖Kγt(x, ·) − Kγt+1(x, ·)‖TV = 0
2 times to stationary remains bounded for any fixed γt[containment]
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
Adaptive MCMC
Diminishing adaptation
Adaptivity of cyberparameter γt has to be gradually tuned downto recover ergodicity
[Roberts & Rosenthal, 2007]Sufficient conditions:
1 total variation distance between two consecutive kernels mustuniformly decrease to zero
[diminishing adaptation]
limt→∞ sup
x‖Kγt(x, ·) − Kγt+1(x, ·)‖TV = 0
2 times to stationary remains bounded for any fixed γt[containment]
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
Adaptive MCMC
Diminishing adaptation
Adaptivity of cyberparameter γt has to be gradually tuned downto recover ergodicity
[Roberts & Rosenthal, 2007]Sufficient conditions:
1 total variation distance between two consecutive kernels mustuniformly decrease to zero
[diminishing adaptation]
limt→∞ sup
x‖Kγt(x, ·) − Kγt+1(x, ·)‖TV = 0
2 times to stationary remains bounded for any fixed γt[containment]
Works for random walk proposal that relies on theempirical variance of the sample modulo a ridge-like stabilizing factor
[Haario, Sacksman & Tamminen, 1999]
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
Adaptive MCMC
Diminishing adaptation
Adaptivity of cyberparameter γt has to be gradually tuned downto recover ergodicity
[Roberts & Rosenthal, 2007]Sufficient conditions:
1 total variation distance between two consecutive kernels mustuniformly decrease to zero
[diminishing adaptation]
limt→∞ sup
x‖Kγt(x, ·) − Kγt+1(x, ·)‖TV = 0
2 times to stationary remains bounded for any fixed γt[containment]
Tune the scale in each direction toward an optimal acceptance rateof 0.44.
[Roberts & Rosenthal,2006]
Simulation methods in Statistics (on recent advances)
The Metropolis-Hastings Algorithm
Adaptive MCMC
Diminishing adaptation
Adaptivity of cyberparameter γt has to be gradually tuned downto recover ergodicity
[Roberts & Rosenthal, 2007]Sufficient conditions:
1 total variation distance between two consecutive kernels mustuniformly decrease to zero
[diminishing adaptation]
limt→∞ sup
x‖Kγt(x, ·) − Kγt+1(x, ·)‖TV = 0
2 times to stationary remains bounded for any fixed γt[containment]
Packages amcmc and grapham
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
Approximate Bayesian computation
1 Motivation and leading example
2 Monte Carlo Integration
3 The Metropolis-Hastings Algorithm
4 Approximate Bayesian computationABC basicsAlphabet soupCalibration of ABC
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
Untractable likelihoods
There are cases when the likelihood function f(y|θ) is unavailableand when the completion step
f(y|θ) =
∫Zf(y, z|θ) dz
is impossible or too costly because of the dimension of zc© MCMC cannot be implemented!
[Robert & Casella, 2004]
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
Illustrations
Example
Stochastic volatility model: fort = 1, . . . , T ,
yt = exp(zt)εt , zt = a+bzt−1+σηt ,
T very large makes it difficult toinclude z within the simulatedparameters
0 200 400 600 800 1000
−0.
4−
0.2
0.0
0.2
0.4
t
Highest weight trajectories
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
Illustrations
Example
Potts model: if y takes values on a grid Y of size kn and
f(y|θ) ∝ exp
{θ∑l∼i
Iyl=yi
}
where l∼i denotes a neighbourhood relation, n moderately largeprohibits the computation of the normalising constant
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
Illustrations
Example
Inference on CMB: in cosmology, study of the Cosmic MicrowaveBackground via likelihoods immensely slow to computate (e.gWMAP, Plank), because of numerically costly spectral transforms[Data is a Fortran program]
[Kilbinger et al., 2010, MNRAS]
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
Illustrations
Example
Coalescence tree: in populationgenetics, reconstitution of a commonancestor from a sample of genes viaa phylogenetic tree that is close toimpossible to integrate out[100 processor days with 4parameters]
[Cornuet et al., 2009, Bioinformatics]
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
The ABC method
Bayesian setting: target is π(θ)f(x|θ)
When likelihood f(x|θ) not in closed form, likelihood-free rejectiontechnique:
ABC algorithm
For an observation y ∼ f(y|θ), under the prior π(θ), keep jointlysimulating
θ′ ∼ π(θ) , z ∼ f(z|θ′) ,
until the auxiliary variable z is equal to the observed value, z = y.
[Tavare et al., 1997]
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
The ABC method
Bayesian setting: target is π(θ)f(x|θ)When likelihood f(x|θ) not in closed form, likelihood-free rejectiontechnique:
ABC algorithm
For an observation y ∼ f(y|θ), under the prior π(θ), keep jointlysimulating
θ′ ∼ π(θ) , z ∼ f(z|θ′) ,
until the auxiliary variable z is equal to the observed value, z = y.
[Tavare et al., 1997]
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
The ABC method
Bayesian setting: target is π(θ)f(x|θ)When likelihood f(x|θ) not in closed form, likelihood-free rejectiontechnique:
ABC algorithm
For an observation y ∼ f(y|θ), under the prior π(θ), keep jointlysimulating
θ′ ∼ π(θ) , z ∼ f(z|θ′) ,
until the auxiliary variable z is equal to the observed value, z = y.
[Tavare et al., 1997]
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
Why does it work?!
The proof is trivial:
f(θi) ∝∑z∈D
π(θi)f(z|θi)Iy(z)
∝ π(θi)f(y|θi)= π(θi|y) .
[Accept–Reject 101]
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
Earlier occurrence
‘Bayesian statistics and Monte Carlo methods are ideallysuited to the task of passing many models over onedataset’
[Don Rubin, Annals of Statistics, 1984]
Note Rubin (1984) does not promote this algorithm forlikelihood-free simulation but frequentist intuition on posteriordistributions: parameters from posteriors are more likely to bethose that could have generated the data.
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
A as approximative
When y is a continuous random variable, equality z = y is replacedwith a tolerance condition,
ρ(y, z) 6 ε
where ρ is a distance
Output distributed from
π(θ)Pθ{ρ(y, z) < ε} ∝ π(θ|ρ(y, z) < ε)
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
A as approximative
When y is a continuous random variable, equality z = y is replacedwith a tolerance condition,
ρ(y, z) 6 ε
where ρ is a distanceOutput distributed from
π(θ)Pθ{ρ(y, z) < ε} ∝ π(θ|ρ(y, z) < ε)
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
ABC algorithm
Algorithm 1 Likelihood-free rejection sampler
for i = 1 to N dorepeat
generate θ ′ from the prior distribution π(·)generate z from the likelihood f(·|θ ′)
until ρ{η(z),η(y)} 6 εset θi = θ
′
end for
where η(y) defines a (maybe in-sufficient) statistic
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
Output
The likelihood-free algorithm samples from the marginal in z of:
πε(θ, z|y) =π(θ)f(z|θ)IAε,y(z)∫
Aε,y×Θ π(θ)f(z|θ)dzdθ,
where Aε,y = {z ∈ D|ρ(η(z),η(y)) < ε}.
The idea behind ABC is that the summary statistics coupled with asmall tolerance should provide a good approximation of theposterior distribution:
πε(θ|y) =
∫πε(θ, z|y)dz ≈ π(θ|y) .
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
Output
The likelihood-free algorithm samples from the marginal in z of:
πε(θ, z|y) =π(θ)f(z|θ)IAε,y(z)∫
Aε,y×Θ π(θ)f(z|θ)dzdθ,
where Aε,y = {z ∈ D|ρ(η(z),η(y)) < ε}.
The idea behind ABC is that the summary statistics coupled with asmall tolerance should provide a good approximation of theposterior distribution:
πε(θ|y) =
∫πε(θ, z|y)dz ≈ π(θ|y) .
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
Pima Indian benchmark
−0.005 0.010 0.020 0.030
020
4060
8010
0
Dens
ity
−0.05 −0.03 −0.01
020
4060
80
Dens
ity
−1.0 0.0 1.0 2.0
0.00.2
0.40.6
0.81.0
Dens
ityFigure: Comparison between density estimates of the marginals on β1
(left), β2 (center) and β3 (right) from ABC rejection samples (red) andMCMC samples (black)
.
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
MA example
Consider the MA(q) model
xt = εt +
q∑i=1
ϑiεt−i
Simple prior: uniform prior over the identifiability zone, e.g.triangle for MA(2)
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
MA example (2)
ABC algorithm thus made of
1 picking a new value (ϑ1, ϑ2) in the triangle
2 generating an iid sequence (εt)−q<t6T3 producing a simulated series (x ′t)16t6T
Distance: basic distance between the series
ρ((x ′t)16t6T , (xt)16t6T ) =
T∑t=1
(xt − x′t)
2
or between summary statistics like the first q autocorrelations
τj =
T∑t=j+1
xtxt−j
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
MA example (2)
ABC algorithm thus made of
1 picking a new value (ϑ1, ϑ2) in the triangle
2 generating an iid sequence (εt)−q<t6T3 producing a simulated series (x ′t)16t6T
Distance: basic distance between the series
ρ((x ′t)16t6T , (xt)16t6T ) =
T∑t=1
(xt − x′t)
2
or between summary statistics like the first q autocorrelations
τj =
T∑t=j+1
xtxt−j
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
Comparison of distance impact
Evaluation of the tolerance on the ABC sample against bothdistances (ε = 100%, 10%, 1%, 0.1%) for an MA(2) model
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
Comparison of distance impact
0.0 0.2 0.4 0.6 0.8
01
23
4
θ1
−2.0 −1.0 0.0 0.5 1.0 1.5
0.00.5
1.01.5
θ2
Evaluation of the tolerance on the ABC sample against bothdistances (ε = 100%, 10%, 1%, 0.1%) for an MA(2) model
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
Comparison of distance impact
0.0 0.2 0.4 0.6 0.8
01
23
4
θ1
−2.0 −1.0 0.0 0.5 1.0 1.5
0.00.5
1.01.5
θ2
Evaluation of the tolerance on the ABC sample against bothdistances (ε = 100%, 10%, 1%, 0.1%) for an MA(2) model
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
ABC advances
Simulating from the prior is often poor in efficiency
Either modify the proposal distribution on θ to increase the densityof x’s within the vicinity of y...
[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]
...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε
[Beaumont et al., 2002]
.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
ABC advances
Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x’s within the vicinity of y...
[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]
...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε
[Beaumont et al., 2002]
.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
ABC advances
Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x’s within the vicinity of y...
[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]
...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε
[Beaumont et al., 2002]
.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
ABC basics
ABC advances
Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x’s within the vicinity of y...
[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]
...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε
[Beaumont et al., 2002]
.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
Alphabet soup
ABC-NP
Better usage of [prior] simulations byadjustement: instead of throwing awayθ ′ such that ρ(η(z),η(y)) > ε, replaceθs with locally regressed
θ∗ = θ− {η(z) − η(y)}Tβ[Csillery et al., TEE, 2010]
where β is obtained by [NP] weighted least square regression on(η(z) − η(y)) with weights
Kδ {ρ(η(z),η(y))}
[Beaumont et al., 2002, Genetics]
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
Alphabet soup
ABC-MCMC
Markov chain (θ(t)) created via the transition function
θ(t+1) =
θ′ ∼ Kω(θ′|θ(t)) if x ∼ f(x|θ′) is such that x = y
and u ∼ U(0, 1) 6 π(θ′)Kω(θ(t)|θ′)π(θ(t))Kω(θ′|θ(t))
,
θ(t) otherwise,
has the posterior π(θ|y) as stationary distribution[Marjoram et al, 2003]
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
Alphabet soup
ABC-MCMC
Markov chain (θ(t)) created via the transition function
θ(t+1) =
θ′ ∼ Kω(θ′|θ(t)) if x ∼ f(x|θ′) is such that x = y
and u ∼ U(0, 1) 6 π(θ′)Kω(θ(t)|θ′)π(θ(t))Kω(θ′|θ(t))
,
θ(t) otherwise,
has the posterior π(θ|y) as stationary distribution[Marjoram et al, 2003]
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
Alphabet soup
ABC-MCMC (2)
Algorithm 2 Likelihood-free MCMC sampler
Use Algorithm 1 to get (θ(0), z(0))for t = 1 to N do
Generate θ ′ from Kω(·|θ(t−1)
),
Generate z ′ from the likelihood f(·|θ ′),Generate u from U[0,1],
if u 6 π(θ ′)Kω(θ(t−1)|θ ′)π(θ(t−1)Kω(θ ′|θ(t−1))
IAε,y(z′) then
set (θ(t), z(t)) = (θ ′, z ′)else(θ(t), z(t))) = (θ(t−1), z(t−1)),
end ifend for
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
Alphabet soup
Why does it work?
Acceptance probability that does not involve the calculation of thelikelihood and
πε(θ′, z ′|y)
πε(θ(t−1), z(t−1)|y)× Kω(θ(t−1)|θ ′)f(z(t−1)|θ(t−1))
Kω(θ ′|θ(t−1))f(z ′|θ ′)
=π(θ ′) f(z ′|θ ′) IAε,y(z
′)
π(θ(t−1)) f(z(t−1)|θ(t−1))IAε,y(z(t−1))
× Kω(θ(t−1)|θ ′) f(z(t−1)|θ(t−1))
Kω(θ ′|θ(t−1)) f(z ′|θ ′)
=π(θ ′)Kω(θ(t−1)|θ ′)
π(θ(t−1)Kω(θ ′|θ(t−1))IAε,y(z
′) .
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
Alphabet soup
ABCµ
[Ratmann et al., 2009]
Use of a joint density
f(θ, ε|y) ∝ ξ(ε|y, θ)× πθ(θ)× πε(ε)
where y is the data, and ξ(ε|y, θ) is the prior predictive density ofρ(η(z),η(y)) given θ and x when z ∼ f(z|θ)
Warning! Replacement of ξ(ε|y, θ) with a non-parametric kernelapproximation.
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
Alphabet soup
ABCµ
[Ratmann et al., 2009]
Use of a joint density
f(θ, ε|y) ∝ ξ(ε|y, θ)× πθ(θ)× πε(ε)
where y is the data, and ξ(ε|y, θ) is the prior predictive density ofρ(η(z),η(y)) given θ and x when z ∼ f(z|θ)Warning! Replacement of ξ(ε|y, θ) with a non-parametric kernelapproximation.
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
Alphabet soup
A PMC version
Use of the same kernel idea as ABC-PRC but with IS correction[Beaumont et al., 2009; Toni et al., 2009]
Generate a sample at iteration t by
πt(θ(t)) ∝
N∑j=1
ω(t−1)j Kt(θ
(t)|θ(t−1)j )
modulo acceptance of the associated xt, and use an importance
weight associated with an accepted simulation θ(t)i
ω(t)i ∝ π(θ
(t)i )/πt(θ
(t)i ) .
c© Still likelihood-free[Beaumont et al., 2009]
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
Alphabet soup
Sequential Monte Carlo
SMC is a simulation technique to approximate a sequence ofrelated probability distributions πn with π0 “easy” and πT target.Iterated IS as PMC: particles moved from time n to time n viakernel Kn and use of a sequence of extended targets πn
πn(z0:n) = πn(zn)
n∏j=0
Lj(zj+1, zj)
where the Lj’s are backward Markov kernels [check that πn(zn) isa marginal]
[Del Moral, Doucet & Jasra, 2006]
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
Alphabet soup
ABC-SMC
True derivation of an SMC-ABC algorithmUse of a kernel Kn associated with target πεn and derivation ofthe backward kernel
Ln−1(z, z′) =
πεn(z′)Kn(z
′, z)
πn(z)
Update of the weights
win ∝ wi(n−1)
∑Mm=1 IAεn (x
min)∑M
m=1 IAεn−1(xmi(n−1))
when xmin ∼ K(xi(n−1), ·)
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
Alphabet soup
Properties of ABC-SMC
The ABC-SMC method properly uses a backward kernel L(z, z ′) tosimplify the importance weight and to remove the dependence onthe unknown likelihood from this weight. Update of importanceweights is reduced to the ratio of the proportions of survivingparticlesMajor assumption: the forward kernel K is supposed to be invariantagainst the true target [tempered version of the true posterior]
Adaptivity in ABC-SMC algorithm only found in on-lineconstruction of the thresholds εt, slowly enough to keep a largenumber of accepted transitions
[Del Moral, Doucet & Jasra, 2009]
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
Alphabet soup
Properties of ABC-SMC
The ABC-SMC method properly uses a backward kernel L(z, z ′) tosimplify the importance weight and to remove the dependence onthe unknown likelihood from this weight. Update of importanceweights is reduced to the ratio of the proportions of survivingparticlesMajor assumption: the forward kernel K is supposed to be invariantagainst the true target [tempered version of the true posterior]Adaptivity in ABC-SMC algorithm only found in on-lineconstruction of the thresholds εt, slowly enough to keep a largenumber of accepted transitions
[Del Moral, Doucet & Jasra, 2009]
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
Calibration of ABC
Which summary statistics?
Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistic [except when done by theexperimenters in the field]
Starting from a large collection of summary statistics is available,Joyce and Marjoram (2008) consider the sequential inclusion intothe ABC target, with a stopping rule based on a likelihood ratiotest.
Does not taking into account the sequential nature of the tests
Depends on parameterisation
Order of inclusion matters.
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
Calibration of ABC
Which summary statistics?
Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistic [except when done by theexperimenters in the field]Starting from a large collection of summary statistics is available,Joyce and Marjoram (2008) consider the sequential inclusion intothe ABC target, with a stopping rule based on a likelihood ratiotest.
Does not taking into account the sequential nature of the tests
Depends on parameterisation
Order of inclusion matters.
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
Calibration of ABC
Which summary statistics?
Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistic [except when done by theexperimenters in the field]Starting from a large collection of summary statistics is available,Joyce and Marjoram (2008) consider the sequential inclusion intothe ABC target, with a stopping rule based on a likelihood ratiotest.
Does not taking into account the sequential nature of the tests
Depends on parameterisation
Order of inclusion matters.
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
Calibration of ABC
Point estimation vs....
In the case of the computation of E[h(θ)|y], Fearnhead andPrangle [12/14/2011] demonstrate that the optimal summarystatistic is
η?(y) = E[h(θ)|y]
Unavailable but approximated by a prior ABC run and ABC-NPcorrections
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
Calibration of ABC
Point estimation vs....
In the case of the computation of E[h(θ)|y], Fearnhead andPrangle [12/14/2011] demonstrate that the optimal summarystatistic is
η?(y) = E[h(θ)|y]
Unavailable but approximated by a prior ABC run and ABC-NPcorrections
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
Calibration of ABC
...vs. model choice
In the case of the computation of a Bayes factor B12(y), ABCapproximation
T∑t=1
Im(t)=1
/ T∑t=1
Im(t)=2
may fail to converge[Robert et al., 2011]
Separation conditions on the summary statistics for convergence tooccur
[Marin et al., 2011]
Simulation methods in Statistics (on recent advances)
Approximate Bayesian computation
Calibration of ABC
...vs. model choice
In the case of the computation of a Bayes factor B12(y), ABCapproximation
T∑t=1
Im(t)=1
/ T∑t=1
Im(t)=2
may fail to converge[Robert et al., 2011]
Separation conditions on the summary statistics for convergence tooccur
[Marin et al., 2011]