Accelerated approximate Bayesian computation with applications to protein folding data
-
Upload
umberto-picchini -
Category
Science
-
view
164 -
download
0
Transcript of Accelerated approximate Bayesian computation with applications to protein folding data
Accelerating inference for complex stochasticmodels using Approximate Bayesian Computation
with an application to protein folding
Umberto PicchiniCentre for Mathematical Sciences,
Lund University
joint work withJulie Forman (Dept. Biostatistics, Copenhagen University)
Dept. Mathematics, Uppsala 4 Sept. 2014
Umberto Picchini ([email protected])
Outline
I’ll use protein folding data as a motivating example; most of thetalk will be about statistical inference issues.
I will introduce our data and protein folding problem.
I will introduce a model based on stochastic differentialequations.
I will mention some theoretical and computational issues relatedto (Bayesian) inference.
I will introduce approximate Bayesian computation (ABC) toalleviate methodological and computational problems.
Umberto Picchini ([email protected])
A motivating example
Proteins are synthesized in the cell on ribosomes as linear,unstructured polymers
...which then self-assemble into specific and functionalthree-dimensional structures.
Umberto Picchini ([email protected])
This self-assembly process is called protein folding. It’s the last andcrucial step in the transformation of genetic information, encoded inDNA, into functional protein molecules.
Umberto Picchini ([email protected])
protein folding is also associated with a wide range of humandiseases.In many neurodegenerative diseases, such as Alzheimers disease,proteins misfold into toxic protein structures.
Protein folding has been named “the Holy Grail of biochemistryand biophysics” (!).
Umberto Picchini ([email protected])
Modelize time dynamics is difficult (large number of atoms in a3D space);Atoms coordinates are usually projected onto a single dimensioncalled reaction coordinate see the figure below
0 0.5 1 1.5 2 2.5
x 104
20
25
30
35
40
index
Figure: Data time-course projected on a single coordinate: 25,000 measurements of theL-reaction coordinate of the small Trp-zipper protein at sampling freq. ∆−1 = 1/nsec.
here the L-reaction coordinate was used, i.e. the total distance toa folded reference.notice the random switching between folded/unfolded states.
Umberto Picchini ([email protected])
Modelize time dynamics is difficult (large number of atoms in a3D space);Atoms coordinates are usually projected onto a single dimensioncalled reaction coordinate see the figure below
0 0.5 1 1.5 2 2.5
x 104
20
25
30
35
40
index
Figure: Data time-course projected on a single coordinate: 25,000 measurements of theL-reaction coordinate of the small Trp-zipper protein at sampling freq. ∆−1 = 1/nsec.
here the L-reaction coordinate was used, i.e. the total distance toa folded reference.notice the random switching between folded/unfolded states.
Umberto Picchini ([email protected])
Forman and Sørensen [2013] proposed to consider sums ofdiffusions:1
Zt︸︷︷︸observable process
= Yt︸︷︷︸latent state
+ Ut︸︷︷︸autocorrelated error term
they considered diffusion processes to modelize both Yt and Ut
they found i.i.d. errors where not really giving satisfactoryresults.So let’s introduce some autocorrelation:
dUt = −κUtdt +√
2κγ2dWt, U0 = 0
a zero mean Ornstein-Uhlenbeck process with stationaryvariance γ2 and autocorrelation ρU(t) = e−κt. HeredWt ∼ N(0, dt).
1Forman and Sørensen, A transformation approach to modelling multi-modaldiffusions. J. Statistical Planning and Inference. 2014.
Umberto Picchini ([email protected])
Regarding the “‘signal” part Yt: data clearly shows a bimodalmarginal structure:
0 0.5 1 1.5 2 2.5
x 104
20
25
30
35
40
index22 24 26 28 30 32 34 36 38 400
500
1000
1500
2000
2500
data
so we want a stochastic process that is able to switch between the two“modes” (i.e. the folded-unfolded states). One of possible options:
an OU Xt with zero mean and unit variance
dXt = −θXtdt +√
2θdBt, X0 = x0
plug each Xt into the cdf of its stationary N(0,1) distribution⇒takeΦ(Xt) as the cdf of N(0, 1).
Umberto Picchini ([email protected])
now build a Gaussian mixture and take the percentilecorresponding to an areaΦ(Xt)
to summarize: simulate Xt ⇒ compute Φ(Xt)⇒ find percentileYt from a 2-components Gaussian mixture with cdf
F(y) = α ·Φ(
y − µ1
σ1
)+ (1 − α) ·Φ
(y − µ2
σ2
)Umberto Picchini ([email protected])
So in conclusion we have:
Zt︸︷︷︸data
= Yt︸︷︷︸latent state
+ Ut︸︷︷︸autocorrelated error term
Zt = Yt + Ut
Yt := τ(Xt)
dXt = −θXtdt +√
2θdBt
dUt = −κUtdt +√
2κγ2dWt
τ(x) = (F−1◦Φ)(x), F(y) = α·Φ(
y − µ1
σ1
)+(1−α)·Φ
(y − µ2
σ2
)We are interested in conducting (Bayesian) inference for
η = (θ, κ,γ,α,µ1,µ2,σ1,σ2)
Umberto Picchini ([email protected])
Difficulties with exact Bayesian
A non-exhaustive list on the difficulty of using exact Bayesianinference via MCMC and SMC in our application:
our dataset is “large” (25,000 observations...)which is not “terribly” large in absolute sense, but it is whendealing with diffusion processes......in fact even when a proposed parameter value is in the bulk ofthe posterior distribution, generated trajectories might still be toodistant from the data (⇒ high rejection rate!)a high rejection rate implies poor exploration of the posteriorsurface, poor inferential results and increasing computationaltime.some of these issues can be mitigated using bridging techniques(Beskos et al. ’13): not trivial in our case (transformation τ(x)unknown in closed form).
Umberto Picchini ([email protected])
Before going into approximated methods...you should trust (!) that wehave put lots of effort in trying to avoid approximations! Still...
Beside theoretical difficulties, currently existing methods do not scalewell for large data:
e.g. we attempted at using particle MCMC (Andrieu et al. ’10)and even with a few (10 only!) particles it would require weeksto obtain results...
it is expensive to simulate from our model as percentiles ofmixture-models are unknown in closed form.
we had to use some approximated strategy, and we consideredABC (approximate Bayesian computation).
Umberto Picchini ([email protected])
Notation
Data: z = (z0, z1, ..., zn)
unknown parameters: η = (θ, κ,γ,α,µ1,µ2,σ1,σ2)
Likelihood funct.: p(z|η)
Prior density: π(η) (our a priori knowledge of η)
Posterior density:
π(η|z) ∝ π(η)p(z|η)
Ideally we would like to use/sample from the posterior.We assume this is either theoretically difficult or computationallyexpensive (it is in our case!).
Umberto Picchini ([email protected])
Approximate Bayesian computation (ABC)
ABC gives a way to approximate a posterior distribution π(η|z)
key to the success of ABC is the ability to bypass the explicitcalculation of the likelihood p(z|η)...only forward-simulation from the model is required!
ABC is in fact a likelihood-free method that works by simulatingpseudo-data zsim from the model:
zsim ∼ p(z|η)
had an incredible success in genetic studies since mid 90’s(Tavare et al ’97, Pritchard et al. ’99).lots of hype in recent years: see Christian Robert’s excellentblog.
Umberto Picchini ([email protected])
Basic rejection sampler (NO approximations here!)for r = 1 to R do
repeatGenerate parameter η ′ from its prior distribution π(η)Generate zsim from the likelihood p(z|η ′) (!! no need to know
p(·) analytically!!)
until zsim = z (simulated data = actual data)
set ηr = η′
end for
The algorithm produces R samples from the exact posterior π(η|z).
:( It won’t work for continuous data or large amount of data becausePr(z = zsim) ≈ 0⇒ substitute z = zsim with z ≈ zsim
Umberto Picchini ([email protected])
...⇒ substitute z = zsim with z ≈ zsim
Introduce some distance ‖ z − zsim ‖ to measure proximity of zsim
and data z [Pritchard et al. 1999].
Introduce a tolerance value δ > 0.
An ABC rejection sampler:for r = 1 to R do
repeatGenerate parameter η ′ from its prior distribution π(η)Generate zsim from the likelihood p(z|η ′)
until ‖ z − zsim ‖< δ [or alternatively ‖ S(z) − S(zsim) ‖< δ]set ηr = η
′
end forfor some “summary statistics” S(·).
Umberto Picchini ([email protected])
Previous algorithm samples from approximate posteriorsπ(η| ‖ z − zsim ‖< δ) or π(η| ‖ S(z) − S(zsim) ‖< δ.
useful to consider statistics S(·) when dealing with large datasetsto increase the probability of acceptance.
the key result of ABC
When S(·) is “sufficient” for η and δ ≈ 0 sampling from the posterioris (almost) exact!
When S(·) is sufficient for the parameter⇒ π(η| ‖ S(zsim) − S(z) ‖< δ) ≡ π(η| ‖ zsim − z ‖< δ)when δ = 0 and S sufficient we accept only parameter draws forwhich zsim ≡ z⇒ π(η|z), the exact posterior.
This is all good and nice, but such conditions rarely hold.
Umberto Picchini ([email protected])
Previous algorithm samples from approximate posteriorsπ(η| ‖ z − zsim ‖< δ) or π(η| ‖ S(z) − S(zsim) ‖< δ.
useful to consider statistics S(·) when dealing with large datasetsto increase the probability of acceptance.
the key result of ABC
When S(·) is “sufficient” for η and δ ≈ 0 sampling from the posterioris (almost) exact!
When S(·) is sufficient for the parameter⇒ π(η| ‖ S(zsim) − S(z) ‖< δ) ≡ π(η| ‖ zsim − z ‖< δ)when δ = 0 and S sufficient we accept only parameter draws forwhich zsim ≡ z⇒ π(η|z), the exact posterior.
This is all good and nice, but such conditions rarely hold.
Umberto Picchini ([email protected])
a central problem is how to choose the statistics S(·): outside theexponential family we typically cannot derive sufficient statistics.[A key work to obtain “semi-automatically” statistics is Fearnhead-Prangle ’12
(discussion paper on JRSS-B. Very much recommended.)]
substitute with the loose concept of informative (enough)statistic then choose a small (enough) threshold δ.
We now go back to our model and (large) data. We propose sometrick to accelerate the inference.
We will use an ABC within MCMC approach (ABC-MCMC).
Umberto Picchini ([email protected])
a central problem is how to choose the statistics S(·): outside theexponential family we typically cannot derive sufficient statistics.[A key work to obtain “semi-automatically” statistics is Fearnhead-Prangle ’12
(discussion paper on JRSS-B. Very much recommended.)]
substitute with the loose concept of informative (enough)statistic then choose a small (enough) threshold δ.
We now go back to our model and (large) data. We propose sometrick to accelerate the inference.
We will use an ABC within MCMC approach (ABC-MCMC).
Umberto Picchini ([email protected])
ABC-MCMC
Example: Weigh discrepancy between observed data andsimulated trajectories using a uniform 0-1 kernel Kδ(·), e.g.
Kδ(S(zsim), S(z)) =
{1 if ‖ S(zsim) − S(z) ‖< δ0 otherwise
complete freedom to choose a different criterion...
Use such measure in place of the typical conditional density ofdata given latent states. See next slide.
Umberto Picchini ([email protected])
Zt = Yt + Ut with Yt := τ(Xt)
dXt = −θXtdt +√
2θdBt
dUt = −κUtdt +√
2κγ2dWt
ABC-MCMC acceptance ratio: Given the current value of parameterη ≡ ηold generate a Markov chain via Metropolis-Hastings:
Algorithm 1 a generic iteration of ABC-MCMC (fixed threshold δ)At r-th iteration1. generate η ′ ∼ u(η ′|ηold), e.g. using Gaussian random walk2. generate zsim|η
′ ∼ “from the model – forward simulation”3. generateω ∼ U(0, 1)
4. accept η ′ if ω < min(
1, π(η′)K(S(zsim),S(z))u(ηold|η
′)π(ηold)K(S(zold),S(z))u(η ′|ηold)
)then set ηr = η
′ else ηr = ηold
Samples are from π(η, zsim| ‖ S(zsim) − S(z) ‖< δ).Umberto Picchini ([email protected])
Algorithm 2 a generic iteration of ABC-MCMC (random threshold δ)At r-th iteration1. generate η ′ ∼ u(η ′|ηold), δ ′ ∼ v(δ|δold)2. generate zsim|η
′ ∼ “from the model – forward simulation”3. generateω ∼ U(0, 1)
4. accept η ′ if ω < min(
1, π(η′)Kδ(S(zsim),S(z))u(ηold|η
′)v(δold|δ′)
π(ηold)Kδ(S(zold),S(z))u(η ′|ηold)v(δ ′|δold)
)then set (ηr, δr) = (η ′, δ ′) else (ηr, δr) = (ηold, δold)
Samples are from π(η, δ, zsim| ‖ S(zsim) − S(z) ‖< δ).
Umberto Picchini ([email protected])
by using a (not too!) small threshold δ we might obtain a decentacceptance rate for the approximated posterior...
however ABC does not save us from having to producecomputationally costly “long” trajectories zsim (n ≈ 25, 000) ateach step of ABC-MCMC.
However in ABC the relevant info about our simulations isencoded into S(·)...do we really have to simulate zsim having same length as our dataz??
...simulate “short” zsim that still result qualitativelyrepresentative! (see next slide...)
Umberto Picchini ([email protected])
Top row: full dataset of 28,000 observations.Bottom row: every 30th observation is reported.
0 0.5 1 1.5 2 2.5
x 104
20
25
30
35
40
index22 24 26 28 30 32 34 36 38 400
500
1000
1500
2000
2500
data
0 100 200 300 400 500 600 700 800 90022
24
26
28
30
32
34
36
index22 24 26 28 30 32 34 36
0
10
20
30
40
50
60
70
data
Dataset is 30 times smaller but qualitative features are still there!Umberto Picchini ([email protected])
Strategy for large datasets (Picchini-Forman ’14)
We have a dataset z of about 25, 000 observations.prior to starting ABC-MCMC construct S(z) to contain:
1 the 15th-30th...-90th percentile of the marginal distribution of thefull data. → to identify Gauss. mixture params µ1,µ2 etc
2 values of autocorrelation function of full data z at lags(60, 300, 600, ..., 2100). → to identify dynamics-related paramsθ,γ, κ
during ABC-MCMC we simulate shorter trajectories zsim of size25000/30 ≈ 800.we take as summary statistics S(zsim) the 15th-30th...-90thpercentile of simulated data and autocorrelations at lags(2, 10, 20, ..., 70) (recall zsim is 30x shorter than z).we then compare S(zsim) with S(z) into ABC-MCMC.this is fast! S(·) for the large data can be computed prior toABC-MCMC start.
Umberto Picchini ([email protected])
Strategy for large datasets (Picchini-Forman ’14)
We have a dataset z of about 25, 000 observations.prior to starting ABC-MCMC construct S(z) to contain:
1 the 15th-30th...-90th percentile of the marginal distribution of thefull data. → to identify Gauss. mixture params µ1,µ2 etc
2 values of autocorrelation function of full data z at lags(60, 300, 600, ..., 2100). → to identify dynamics-related paramsθ,γ, κ
during ABC-MCMC we simulate shorter trajectories zsim of size25000/30 ≈ 800.we take as summary statistics S(zsim) the 15th-30th...-90thpercentile of simulated data and autocorrelations at lags(2, 10, 20, ..., 70) (recall zsim is 30x shorter than z).we then compare S(zsim) with S(z) into ABC-MCMC.this is fast! S(·) for the large data can be computed prior toABC-MCMC start.
Umberto Picchini ([email protected])
Strategy for large datasets (Picchini-Forman ’14)
We have a dataset z of about 25, 000 observations.prior to starting ABC-MCMC construct S(z) to contain:
1 the 15th-30th...-90th percentile of the marginal distribution of thefull data. → to identify Gauss. mixture params µ1,µ2 etc
2 values of autocorrelation function of full data z at lags(60, 300, 600, ..., 2100). → to identify dynamics-related paramsθ,γ, κ
during ABC-MCMC we simulate shorter trajectories zsim of size25000/30 ≈ 800.we take as summary statistics S(zsim) the 15th-30th...-90thpercentile of simulated data and autocorrelations at lags(2, 10, 20, ..., 70) (recall zsim is 30x shorter than z).we then compare S(zsim) with S(z) into ABC-MCMC.this is fast! S(·) for the large data can be computed prior toABC-MCMC start.
Umberto Picchini ([email protected])
So the first strategy to accelerate computations was simulating asmaller subset of artificial data.
Our second strategy is to perform so called “early rejection” ofproposed parameters.
Umberto Picchini ([email protected])
ABC-MCMC acceptance ratio:
accept η ′ if ω < min(
1,π(η ′)K(|S(zsim) − S(z)|)u(ηold |η
′)
π(ηold)K(|S(zold) − S(z)|)u(η ′|ηold)
)however remember K(·) is a 0/1 kernel→ let’s start the algorithm atan admissible ηstart such that at first iteration S(zold) ≈ S(z)→
accept η ′ if ω < min(
1,π(η ′)
π(ηold)K(|S(zsim) − S(z)|)
u(ηold |η′)
u(η ′|ηold)
)
Notice η ′ will surely be rejected ifω > π(η ′)π(ηold)
u(ηold|η′)
u(η ′|ηold)
REGARDLESS the value of K(·) ∈ {0, 1}→do NOT simulate trajectories if the above is satisfied!
Umberto Picchini ([email protected])
Algorithm 3 Early–Rejection ABC-MCMC (Picchini ’13)1. At (r + 1)th ABC-MCMC iteration:2. generate η ′ ∼ u(η|ηr) from its proposal distribution;3. generateω ∼ U(0, 1);if
ω>π(η ′)u(ηr|η
′)
π(ηr)u(η ′|ηr)(= “ratio”)
then(ηr+1, S(zsim,r+1)) := (ηr , S(zsim,r)); . (proposal early-rejected)
else generate xsim ∼ π(x|η ′) conditionally on the η ′ from step 2; determine ysim = τ(xsim)and generate zsim ∼ π(z|ysim,η ′) and calculate S(zsim);
if K(|S(zsim) − S(z)|) = 0 then(ηr+1, S(zsim,r+1)) := (ηr , S(zsim,r)) . (proposal rejected)
else ifω 6 ratio then(ηr+1, S(zsim,r+1)) := (η ′, S(zsim)) . (proposal accepted)
else(ηr+1, S(zsim,r+1)) := (ηr , S(zsim,r)) . (proposal rejected)
end ifend if4. increment r to r + 1. If r > R stop, else go to step 2.
Umberto Picchini ([email protected])
Notice “early rejection” works only with 0-1 kernels.
No reason not to use it. Early-rejection per-se is not anapproximation, it’s just a trick.
It saved us between 40-50% of computing time (Picchini ’13).
Umberto Picchini ([email protected])
Results after 2 millions ABC-MCMC iterations. Acceptance rate of1% and 6 hrs computation with MATLAB on a common pc desktop.
Table: Protein folding data experiment: posterior means from theABC-MCMC output and 95% posterior intervals.
ABC posterior meanslog θ –6.454 [–6.898,-5.909]log κ –0.651 [–1.424,0.246]logγ 0.071 [–0.313,0.378]logµ1 3.24 [3.22,3.26]logµ2 3.43 [3.39,3.45]logσ1 –0.959 [–2.45,0.38]logσ2 –0.424 [–2.26,0.76]logα –0.663 [–1.035,–0.383]
Umberto Picchini ([email protected])
0 0.5 1 1.5 2 2.5
x 104
20
30
40
0 0.5 1 1.5 2 2.5
x 104
20
30
40
0 0.5 1 1.5 2 2.5
x 104
20
30
40
Figure: Data (top), process Yt (middle), process Zt (bottom).
Zt = Yt + Ut meaning “evaluated at η = posterior mean”Umberto Picchini ([email protected])
A simulation study (Picchini-Forman ’14)
Here we want to compare ABC against the (computationallyintensive) exact Bayesian inference (via particle MCMC,pMCMC).
in order to do so we consider a very small dataset of 360simulated observations.
we use a parallel strategy for pMCMC devised in Dovrandi ’14(4 chains run in parallel using 100 particles for each chain).
C. Dovrandi (2014) Pseudo-marginal algorithms with multiple CPUs.Queensland University of Technology, http://eprints.qut.edu.au/61505/
U.P. and Forman (2014). Accelerating inference for diffusions observed with measurement errorand large sample sizes using Approximate Bayesian Computation. arXiv:1310.0973.
Umberto Picchini ([email protected])
Comparison ABC-MCMC (*) vs exact Bayes (pMCMC)
True valueθ 0.0027 0.0023 [0.0013,0.0041]
0.0024∗ [0.0013,0.0039]κ 0.538 0.444 [0.349,0.558]
0.553∗ [0.386,0.843]γ 1.063 1.040 [0.943,1.158]
0.982∗ [0.701,1.209]µ1 25.52 25.68 [25.08,26.61]
25.72∗ [25.10,26.71]µ2 30.92 32.12 [29.15,35.42]
32.17∗ [29.46,34.96]σ1 0.540 0.421 [0.203,0.844]
0.523∗ [0.248,0.972]σ2 0.624 0.502 [0.232,1.086]
0.511∗ [0.249,1.041]α 0.537 0.510 [0.345,0.755]
0.508∗ [0.346,0.721]
Umberto Picchini ([email protected])
−7 −6.5 −6 −5.5 −50
0.5
1
1.5
2
2.5
(a) log θ
−2 −1.5 −1 −0.5 0 0.5 10
0.2
0.4
0.6
0.8
1
1.2
1.4
(b) log κ
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.60
1
2
3
4
5
6
7
8
(c) logγ
Figure: Exact Bayesian (solid); ABC-MCMC (dashed); true value (verticallines); uniform priors.
Umberto Picchini ([email protected])
3.1 3.15 3.2 3.25 3.30
20
40
60
80
100
120
(a) logµ1
3.3 3.4 3.5 3.6 3.70
20
40
60
80
100
120
140
(b) logµ2
−3 −2 −1 0 10
0.5
1
1.5
2
2.5
3
(c) logσ1
−3 −2 −1 0 10
0.2
0.4
0.6
0.8
1
1.2
1.4
(d) logσ2
−1.5 −1 −0.5 00
0.5
1
1.5
2
2.5
3
3.5
(e) logα
Figure: Exact Bayesian (solid); ABC-MCMC (dashed); true value (vertical lines);uniform priors.Umberto Picchini ([email protected])
Conclusions
as long as we manage to “compress” information into summarystatistics ABC is a useful inferencial tool for complex modelsand large datasets.
1,000 ABC-MCMC iterations performed in 6 sec; in about 20min with exact Bayesian sampling (pMCMC).
...problem is that ABC requires lots of tuning (choose S(·), δ,K(·)...)MATLAB implementation available athttp://sourceforge.net/projects/abc-sde/with 50+ pages manual.
Umberto Picchini ([email protected])
References
U.P. (2014). Inference for SDE models via ApproximateBayesian Computation. J. Comp. Graph. Stat.
U.P. and J. Forman (2013). Accelerating inference for diffusionsobserved with measurement error and large sample sizes usingApproximate Bayesian Computation. arXiv:1310.0973.
U.P. (2013) abc-sde: a MATLAB toolbox for approximateBayesian computation (ABC) in stochastic differential equationmodels.http://sourceforge.net/projects/abc-sde/
Umberto Picchini ([email protected])
Appendix
Umberto Picchini ([email protected])
Proof that the basic ABC algorithm works
The proof is straightforward.We know that a draw (η ′, zsim) produced by the algorithm is such that(i) η ′ ∼ π(η), and (ii) such that zsim = z, where zsim ∼ π(zsim | η ′).
Thus let’s call f (η ′) the (unknown) density for such η ′, then becauseof (i) and (ii)
f (η ′) ∝∑zsim
π(η ′)π(zsim|η′)Iz(zsim) =
∑zsim=z
π(η ′, zsim) ∝ π(η|z).
Therefore η ′ ∼ π(η|z).
Umberto Picchini ([email protected])
A theoretical motivation to consider ABC
An important (known) result
A fundamental consequence is that if S(·) is a sufficient statistic for θthen limδ→0 πδ(θ | y) = π(θ | y) the exact (marginal) posterior!!!
uh?!
Otherwise (in general) the algorithm draws from the approximationπ(θ | ρ(S(x), S(y)) < δ).
also by introducing the class of quadratic losses
L(θ0, θ; A) = (θ0 − θ)TA(θ0 − θ) :
Another relevant result
If S(y) = E(θ | y) then the minimal expected quadratic lossE(L(θ0, θ; A) | y) is achieved via θ = EABC(θ | S(y)) as δ→ 0.
Umberto Picchini ([email protected])
The straightforward motivation is the following:consider the (ABC) posterior πδ(θ | y) then
πδ(θ | y) =∫πδ(θ, x | y)dx ∝ π(θ)
∫1δ
K(|S(x) − S(y)|
δ
)π(x | θ)dx
→ π(θ)π(S(x) = S(y) | θ) (δ→ 0).
Therefore if S(·) is a sufficient statistic for θ then
limδ→0
πδ(θ | y) = π(θ | y)
the exact (marginal) posterior!!!
Umberto Picchini ([email protected])
Acceptance probability in Metropolis-Hastings
Suppose at a given iteration of Metropolis-Hastings we are in the(augmented)-state position (θ#, x#) and wonder whether to move (ornot) to a new state (θ ′, x ′). The move is generated via a proposaldistribution “q((θ#, x#)→ (x ′, θ ′))”.
e.g. “q((θ#, x#)→ (x ′, θ ′))” = u(θ ′|θ#)v(x ′ | θ ′);move “(θ#, x#)→ (θ ′, x ′)” accepted with probability
α(θ#,x#)→(x′,θ′) = min(
1,π(θ ′)π(x ′|θ ′)π(y|x ′, θ ′)q((θ ′, x ′)→ (θ#, x#))
π(θ#)π(x#|θ#)π(y|x#, θ#)q((θ#, x#)→ (θ ′, x ′))
)= min
(1,π(θ ′)π(x ′|θ ′)π(y|x ′, θ ′)u(θ#|θ
′)v(x# | θ#)
π(θ#)π(x#|θ#)π(y|x#, θ#)u(θ ′|θ#)v(x ′ | θ ′)
)now choose v(x | θ) ≡ π(x | θ)
= min(
1,π(θ ′)����π(x ′|θ ′)π(y|x ′, θ ′)u(θ#|θ
′)�����π(x# | θ#)
π(θ#)����π(x#|θ#)π(y|x#, θ#)u(θ ′|θ#)�����π(x ′ | θ ′)
)This is likelihood–free! And we only need to know how to generate x ′
(not a problem...)Umberto Picchini ([email protected])
Acceptance probability in Metropolis-Hastings
Suppose at a given iteration of Metropolis-Hastings we are in the(augmented)-state position (θ#, x#) and wonder whether to move (ornot) to a new state (θ ′, x ′). The move is generated via a proposaldistribution “q((θ#, x#)→ (x ′, θ ′))”.
e.g. “q((θ#, x#)→ (x ′, θ ′))” = u(θ ′|θ#)v(x ′ | θ ′);move “(θ#, x#)→ (θ ′, x ′)” accepted with probability
α(θ#,x#)→(x′,θ′) = min(
1,π(θ ′)π(x ′|θ ′)π(y|x ′, θ ′)q((θ ′, x ′)→ (θ#, x#))
π(θ#)π(x#|θ#)π(y|x#, θ#)q((θ#, x#)→ (θ ′, x ′))
)= min
(1,π(θ ′)π(x ′|θ ′)π(y|x ′, θ ′)u(θ#|θ
′)v(x# | θ#)
π(θ#)π(x#|θ#)π(y|x#, θ#)u(θ ′|θ#)v(x ′ | θ ′)
)now choose v(x | θ) ≡ π(x | θ)
= min(
1,π(θ ′)����π(x ′|θ ′)π(y|x ′, θ ′)u(θ#|θ
′)�����π(x# | θ#)
π(θ#)����π(x#|θ#)π(y|x#, θ#)u(θ ′|θ#)�����π(x ′ | θ ′)
)This is likelihood–free! And we only need to know how to generate x ′
(not a problem...)Umberto Picchini ([email protected])
Acceptance probability in Metropolis-Hastings
Suppose at a given iteration of Metropolis-Hastings we are in the(augmented)-state position (θ#, x#) and wonder whether to move (ornot) to a new state (θ ′, x ′). The move is generated via a proposaldistribution “q((θ#, x#)→ (x ′, θ ′))”.
e.g. “q((θ#, x#)→ (x ′, θ ′))” = u(θ ′|θ#)v(x ′ | θ ′);move “(θ#, x#)→ (θ ′, x ′)” accepted with probability
α(θ#,x#)→(x′,θ′) = min(
1,π(θ ′)π(x ′|θ ′)π(y|x ′, θ ′)q((θ ′, x ′)→ (θ#, x#))
π(θ#)π(x#|θ#)π(y|x#, θ#)q((θ#, x#)→ (θ ′, x ′))
)= min
(1,π(θ ′)π(x ′|θ ′)π(y|x ′, θ ′)u(θ#|θ
′)v(x# | θ#)
π(θ#)π(x#|θ#)π(y|x#, θ#)u(θ ′|θ#)v(x ′ | θ ′)
)now choose v(x | θ) ≡ π(x | θ)
= min(
1,π(θ ′)����π(x ′|θ ′)π(y|x ′, θ ′)u(θ#|θ
′)�����π(x# | θ#)
π(θ#)����π(x#|θ#)π(y|x#, θ#)u(θ ′|θ#)�����π(x ′ | θ ′)
)This is likelihood–free! And we only need to know how to generate x ′
(not a problem...)Umberto Picchini ([email protected])
Generation of δ’s
0 0.5 1 1.5 2 2.5
x 105
0
0.2
0.4
0.6
0.8
Here we generate a chain for log δ using a (truncated) Gaussianrandom walk with support (−∞, log δmax]. We let log δmax decreaseduring the simulation.
Umberto Picchini ([email protected])
HOWTO: post-hoc selection of δ (the “precision”parameter) [Bortot et al. 2007]
During ABC-MCMC we let δ vary (according to a MRW): at rth iterationδr = δr−1 + ∆, with ∆ ∼ N(0,ν2).After the end of the MCMC we have a sequence {θr, δr}r=0,1,2... and for eachparameter {θj,r}r=0,1,2... we produce a plot of the parameter chain vs δ:
0 0.5 1 1.5 2 2.5 3 3.5 4−2.5
−2
−1.5
−1
−0.5
0
bandwidth
Umberto Picchini ([email protected])
Post-hoc selection of the bandwidth δ, cont’d...
Therefore in practice:
we filter out of the analyses those draws {θr}r=0,1,2,...corresponding to “large” δ, for statistical precision;we retain only those {θr}r=0,1,2,... corresponding to a low δ.in the example we retain {θr; δr < 1.5}.PRO: this is useful as it allows an ex-post selection of δ, i.e. wedo not need to know in advance a suitable value for δ.CON: by filtering out some of the draws, a disadvantage of theapproach is the need to run very long MCMC simulations inorder to have enough “material” on which to base our posteriorinference.PRO: also notice that by letting δ vary we are almost consideringa global optimization method (similar to simulated tempering).
Umberto Picchini ([email protected])