Intro to Approximate Bayesian Computation (ABC)

An intro to ABC – approximate Bayesiancomputation

PhD course FMS020F–NAMS002 “Statistical inference for partiallyobserved stochastic processes”, Lund University

http://goo.gl/sX8vU9

Umberto PicchiniCentre for Mathematical Sciences,

Lund Universitywww.maths.lth.se/matstat/staff/umberto/

Umberto Picchini ([email protected])

http://goo.gl/sX8vU9

www.maths.lth.se/matstat/staff/umberto/

In this lecture we consider the case where it is not possible to pursueexact inference for model parameters θ, nor it is possible toapproximate the likelihood function of θ within a given computationalbudget and available time.

The above is not a rare circumstance.

Since the advent of affordable computers and the introduction ofadvanced statistical methods, researchers have become increasinglyambitious, and try to formulate and fit very complex models.

Example: MCMC (Markov chain Monte Carlo) has provided auniversal machinery for Bayesian inference since its rediscovery inthe statistical community in the early 90’s.

Thanks to MCMC (and related methods) scientists’ ambitions havebeen pushed further and further.


However for complex models (and/or large datasets) MCMC is oftenimpractical. Calculating the likelihood, or an approximation thereofmight be impossible.

For example in spatial statistics INLA (integrated nested Laplaceapproximation) is a welcome alternative to the more expensiveMCMC.

Also MCMC is not online: when new observations arrive we have tore-compute the whole likelihood for the total set of observations, i.e.we can’t make use of the likelihood computed at previousobservations.


Particle marginal methods (particle MCMC) are a fantastic possibilityfor exact Bayesian inference for state-space models. But what can wedo for non-state space models?

And what can we do when the dimension of the mathematical systemis large and the implementation of particle filters with millions ofparticles is infeasible?


There is an increasingly interest in statistical methods for models thatare easy to simulate from, but for which it is impossible to calculatetransition densities or likelihoods.

General set-up: we have a complex stochastic process {Xt} withunknown parameters θ. For any θ we can simulate from this process.

We have observations y = f ({X0:T }).

We want to estimate θ but we cannot calculate p(y|θ), as this involvesintegrating over the realisations of {X0:T }.

Notice we are not specifying the probabilistic properties of X0:T norY . We are certainly not restricting ourselves to state-space models.


The likelihood-free idea

Likelihood-free inference motivating idea:

Easy to simulate from model conditional on parameters.

So run simulations for many parameters.

See for which parameter value the simulated data sets matchobserved data best


Different likelihood-free methods

Likelihood-free methods date back to at least Diggle and Gratton(1984) and Rubin (1984, p. 1160)

More recent examples:

Indirect Inference (Gourieroux and Ronchetti 1993);

Approximate Bayesian Computation (ABC) (a review is Marinet al. 2011);

bootstrap filter of Gordon, Salmond and Smith (1993)

Synthetic Likelihoods method of Wood (2010)


https://www.jstor.org/stable/2345504?seq=1#page_scan_tab_contents

https://www.jstor.org/stable/2345504?seq=1#page_scan_tab_contents

http://projecteuclid.org/download/pdf_1/euclid.aos/1176346785

http://arxiv.org/abs/1101.0955


http://www.ece.iastate.edu/~namrata/EE520/Gordonnovelapproach.pdf

http://www.nature.com/nature/journal/v466/n7310/full/nature09319.html

Are approximations any worth?

Why should we care about approximate methods?

Well, we know the most obvious answer: it’s because this is what wedo when exact methods are impractical. No big news...

But I am more interested on the following phenomenon, which Inoticed by direct experience:

Many scientists seem to get intellectual fulfilment by using exactmethods, leading to exact inference.

What we might not see is when they fail to communicate thatthey (consciously or unconsciously) pushed themselves toformulate simpler models, so that exact inference could beachieved.


So the pattern I often notice is:

1 You have a complex scenario, noisy data, unobserved variablesetc

2 you formulate a pretty realistic model... which you can’t fit todata (i.e. exact inference is not possible)

3 you simplify the model (a lot) so it is now tractable with exactmethods.

4 You are happy.

However you might have simplified the model a wee too much to berealistic/useful/sound.


John Tukey – 1962

“Far better an approximate answer to the right question, which isoften vague, than an exact answer to the wrong question, which canalways be made precise. ”

If a complex model is the one I want to use to answer the rightquestion, then I prefer to obtain an approximative answer usingapproximate inference, than fooling myself with a simpler modelusing exact inference.


Gelman and Rubin, 1996

“[...] as emphasized in Rubin (1984), one of the great scientificadvantages of simulation analysis of Bayesian methods is the freedomit gives the researcher to formulate appropriate models rather than beoverly interested in analytically neat but scientifically inappropriatemodels.”

Approximate Bayesian Computation and Synthetic Likelihoods aretwo approximate methods for inference, with ABC vastly morepopular and with older origins.

We will discuss ABC only.


Features of ABC

only need a generative model, i.e. the model we assumed havinggenerated available data y.

only need to be able to simulate from such a model.

in other words, we do not need to assume anything regarding theprobabilistic features of the model components.

particle marginal methods also assume the ability to simulatefrom the model, but also assume a specific model structure,usually a state-space model (SSM).

also, particle marginal methods for SSM require at leastknowledge of p(yt|xt; θ) (to compute importance weights). Whatdo we do without such requirement?


For the moment we can denote data with y instead of, say, y1:T aswhat we are going to introduce is not specific to dynamical models.


Bayesian setting: target is π(θ|y) ∝ p(y|θ)π(θ)

What to do when (1) the likelihood p(y|θ) is unknown in closed formand/or (2) it is expensive to approximate?

Notice that if we are able to simulate observations y∗ by running thegenerative model, then we have

y∗ ∼ p(y|θ)

That is y∗ is produced by the statistical model that generated observeddata y.

(i) Therefore if Y is the space where y takes values, then y∗ ∈ Y.

(ii) y and y∗ have the same dimension.


Loosely speaking...

Example: if we have a SSM and given a parameter value θ and xt−1simulate xt, then plug xt in the observation equation ans simulate y∗t ,then I have that y∗t ∼ p(yt|θ).

This is because if I have two random variables x and y with jointdistribution (conditional on θ) p(y, x|θ) thenp(y, x|θ) = p(y|x; θ)p(x|θ).

I first simulate x∗ from p(x|θ), then conditional on x∗ I simulate y∗

from p(y|x∗, θ).

What I obtain is a draw (x∗, y∗) from p(y, x|θ) hence y∗ alone must bea draw from the marginal p(y|θ).


Likelihood free rejection sampling

1 simulate from the prior θ∗ ∼ π(θ)

2 plug θ∗ in your model and simulate a y∗ [this is the same aswriting y∗ ∼ p(y|θ∗)]

3 if y∗ = y store θ∗. Go to step 1 and repeat.

The above is a likelihood free algorithm: it does not requireknowledge of the expression of p(y|θ).

Each accepted θ∗ is such that θ∗ ∼ π(θ|y) exactly.

We justify the result in next slide.


Justification

The previous algorithm is exact. Let’s see why.

Denote with f (θ∗, y∗) the joint distribution of the accepted (θ∗, y∗).We have that

f (θ∗, y∗) = p(y∗|θ∗)π(θ∗)Iy(y∗)

with Iy(y∗) = 1 iff y∗ = y and zero otherwise. Marginalizing y∗ wehave

f (θ∗) =∫Y

p(y∗|θ∗)π(θ∗)Iy(y∗)dy∗ = p(y|θ∗)π(θ∗) ∝ π(θ∗|y)

hence all accepted θ∗ are drawn from the exact posterior.


Curse of dimensionality

Algorithmically the rejection algorithm could be coded in a whileloop, that would repeat itself until the equality condition is satisfied.

For y taking discrete values in a “small” set of states this ismanageable.

For y a long sequence of observations from a discrete randomvariables with many states this is very challenging.

For y a continuous variable the equality happens with probability zero.


ABC rejection sampling (Tavare et al.1)

Attack the curse of dimensionality by introducing an approximation.Take an arbitrary distance ‖ · ‖ and a threshold ε > 0.


2 simulate a y∗ ∼ p(y|θ∗)

3 if ‖ y∗ − y ‖< ε store θ∗. Go to step 1 and repeat.

Each accepted θ∗ is such that θ∗ ∼ πε(θ|y).

πε(θ|y) ∝∫Y

p(y∗|θ∗)π(θ∗)IAε,y(y∗)dy∗

Aε,y(y∗) = {y∗ ∈ Y; ‖ y∗ − y ‖< ε}.1Tavare et al. 1997. Genetics;145(2)


Step 5: The posterior distribution is approximated with the

accepted parameter points. The posterior distribution should have

a nonnegligible probability for parameter values in a region

around the true value of h in the system, if the data are sufficiently

informative. In this example, the posterior probability mass is

evenly split between the values 0:08 and 0:43.

Figure 3 shows the posterior probabilities obtained by ABC and

a large n using either the summary statistic combined with (e~0and e~2) or the full data sequence. These are compared with the

true posterior, which can be computed exactly and efficiently using

the Viterbi algorithm. The used summary statistic is not sufficient,

and it is seen that even with e~0, the deviation from the

theoretical posterior is considerable. Of note, a much longer

observed data sequence would be required to obtain a posterior

that is concentrated around the true value of h (h~0:25).

This example application of ABC used simplifications for

illustrative purposes. A number of review articles provide pointers

to more realistic applications of ABC [9–11,14].

Model Comparison with ABC

Besides parameter estimation, the ABC-framework can be used

to compute the posterior probabilities of different candidate

models [15–17]. In such applications, one possibility is to use the

rejection-sampling in a hierarchical manner. First, a model is

sampled from the prior distribution for the models; then, given the

model sampled, the model parameters are sampled from the prior

distribution assigned to that model. Finally, a simulation is

performed as in the single-model ABC. The relative acceptance

frequencies for the different models now approximate the posterior

Figure 1. Parameter estimation by Approximate Bayesian Computation: a conceptual overview.doi:10.1371/journal.pcbi.1002803.g001

PLOS Computational Biology | www.ploscompbiol.org 3 January 2013 | Volume 9 | Issue 1 | e1002803


It is self evident that when imposing ε = 0 we force y∗ = y thusimplying that draws will be, again, from the true posterior.

However in practice imposing ε = 0 might require unbearablecomputational times to obtain a single acceptance. In practice wehave to set ε > 0, so that draws are from the approximate posteriorπε(θ|y).

Important ABC result

Convergence “in distribution”:

when ε→ 0, πε(θ|y)→ π(θ|y)

when ε→∞, πε(θ|y)→ π(θ)

Essentially for a too large ε we learn nothing.


Toy model

Let’s try something really trivial. We show how ABC rejection canbecome easily inefficient.

n = 5 i.i.d. observations yi ∼ Weibull(2, 5)

want to estimate parameters of the Weibull, soθ = (2, 5) = (a, b) are the true values.

take ‖ y − y∗ ‖=∑n

i=1(yi − y∗i )2 (you can try a different

distance, this is not really crucial)

let’s use different values of ε

run 50,000 iterations of the algorithm.


We assume wide priors for the “shape” parameter a ∼ U(0.01, 6) andfor the “scale” b ∼ U(0.01, 10).

Try ε = 20

0 1 2 3 4 5 6

0.00

0.05

0.10

0.15

shape

N = 45654 Bandwidth = 0.1699

Den

sity

0 2 4 6 8 10

0.00

0.04

0.08

scale

N = 45654 Bandwidth = 0.3038D

ensi

ty

We are evidently sampling from the prior. Must reduce ε. In factnotice about 46,000 draws were accepted.


Try ε = 7

0 1 2 3 4 5 6

0.00

0.10

0.20

shape

N = 19146 Bandwidth = 0.1779

Den

sity

0 2 4 6 8 10

0.00

0.05

0.10

0.15

0.20

scale

N = 19146 Bandwidth = 0.2186

Den

sity

Here about 19,000 draws were accepted (38%).


Try ε = 3

0 2 4 6

0.00

0.10

0.20

shape

N = 586 Bandwidth = 0.321

Den

sity

2 4 6 8 10

0.0

0.1

0.2

0.3

0.4

scale

N = 586 Bandwidth = 0.2233

Den

sity

Here about 1% of the produced simulations has been accepted. Recalltrue values are (a, b) = (2, 5).

Of course n = 5 is a very small sample size so inference is of limitedquality, but you got the idea of the method.


An idea for self-study

Compare the ABC (marginal) posteriors with exact posteriors fromsome experiment using conjugate priors.

For example see http://www.johndcook.com/CompendiumOfConjugatePriors.pdf


http://www.johndcook.com/CompendiumOfConjugatePriors.pdf

http://www.johndcook.com/CompendiumOfConjugatePriors.pdf

Curse of dimensionality

It becomes immediately evident that results will soon degrade fora larger sample size n

even for a moderately long dataset y, how likely is that weproduce a y∗ such that

∑ni=1(yi − y∗i )

2 < ε for small ε?Very unlikely.

inevitably, we’ll be forced to enlarge ε thus degrading the qualityof the inference.


Here we take n = 200. To compare with our “best” previous result,we use ε = 31 (to obtain again a 1% acceptance rate on 50,000iterations).

3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5

0.0

0.2

0.4

0.6

0.8

shape

N = 474 Bandwidth = 0.1351

Den

sity

3.5 4.0 4.5 5.0 5.5

0.0

0.4

0.8

1.2

scale

N = 474 Bandwidth = 0.08315

Den

sity

Notice shape is completely off (true value is 2).

The approach is just not going to be of any practical use withcontinuous data.


ABC rejection with summaries (Pritchard et al.2)

Same as before, but comparing S(y) with S(y∗) for “appropriate”summary statistics S(·).


2 simulate a y∗ ∼ p(y|θ∗), compute S(y∗)

3 if ‖ S(y∗) − S(y) ‖< ε store θ∗. Go to step 1 and repeat.

Samples are from πε(θ|S(y)) with

πε(θ|S(y)) ∝∫Y

p(y∗|θ∗)π(θ∗)IAε,y(y∗)dy∗

Aε,y(y∗) = {y∗ ∈ Y; ‖ S(y∗) − S(y) ‖< ε}.

2Pritchard et al. 1999, Molecular Biology and Evolution, 16:1791-1798.Umberto Picchini ([email protected])

Using summary statistics clearly introduces a further level ofapproximation. Except when S(·) is sufficient for θ (carries the sameinfo about θ as the whole y).

When S(·) is a set of sufficient statistics for θ,

πε(θ|S(y)) = πε(θ|y)

But then again when y is not in the exponential family, we basicallyhave no hope to construct sufficient statistics.

A central topic in ABC is to construct “informative” statistics, asa replacement for the (unattainable) sufficient ones.Important paper, Fearnhead and Prangle 2012 (discussed later).

If we have “good summaries” we can bypass the curse ofdimensionality problem.


Weibull example, reprise

Take n = 200. Set S(y) = (sample mean y, sample SD y) andsimilarly for y∗. Use ε = 0.35.

1.5 2.0 2.5 3.0

0.0

0.4

0.8

1.2

shape

N = 453 Bandwidth = 0.07102

Den

sity

4.0 4.5 5.0 5.5

0.0

0.5

1.0

1.5

scale

N = 453 Bandwidth = 0.06776D

ensi

ty

This time we have captured both shape and scale (with 1%acceptance).

Also, enlarging n would not cause problems→ robust comparisonsthanks to S(·).Umberto Picchini ([email protected])

From now on we silently assume working with S(y∗) and S(y), and ifwe wish not to summarize anything we can always set S(y) := y.

A main issue in ABC research is that when we use an arbitrary S(·)we can’t quantify “how much off” we are from the ideal sufficientstatistic.

Important work on constructing “informative” statistics:

Fearnhead and Prangle 2012, JRSS-B 74(3).

review by Blum et al 2013, Statistical Science 28(2).

Michael Blum will give a free workshop in Lund on 10 March. Signup here!


http://www.maths.lth.se/matstat/staff/umberto/bayes-blum.html

http://www.maths.lth.se/matstat/staff/umberto/bayes-blum.html

Beyond ABC rejection

ABC rejection is the simplest example of ABC algorithm.

It generates independent draws and can be coded into anembarrassingly parallel algorithm. However in can be massivelyinefficient.

Parameters are proposed from the prior π(θ). A prior does not exploitthe information of already accepted parameters.

Unless π(θ) is somehow similar to πε(θ|y) many proposals will berejected for moderately small ε.

This is especially true for a large dimensional θ.

A natural approach is to consider ABC within an MCMC algorithm.

In a MCMC with random walk proposals the proposed parameterexplores a neighbourhood of the last accepted parameter.


ABC-MCMC

Consider the approximated augmented posterior:

πε(θ, y∗|y) ∝ Jε(y∗, y) p(y∗|θ)π(θ)︸︷︷︸∝π(θ|y∗)

Jε(y∗, y) a function which is a positive constant when y = y∗ (orS(y) = S(y∗)) and takes large positive values when y∗ ≈ y (orS(y) ≈ S(y∗)).π(θ|y∗) the (intractable) posterior corresponding to artificialobservations y∗.when ε = 0 we have Jε(y∗, y) constant andπε(θ, y∗|y) = π(θ|y).

Without loss of generality, let’s assume that Jε(y∗, y) ∝ Iy(y∗), theindicator function.


ABC-MCMC (Marjoram et al. 3)

We wish to simulate from the posterior πε(θ, y∗|y): hence constructproposals for both θ and y∗.

Present state is θ# (and corresponding y#). Propose θ∗ ∼ q(θ∗|θ#).

Simulate y∗ from the model given θ∗ hence the proposal is the modelitself, y∗ ∼ p(y∗|θ∗).

The acceptance probability is thus:

α = min{

1,Iy(y∗)p(y∗|θ∗)π(θ∗)

1× p(y#|θ)π(θ#)× q(θ#|θ∗)p(y#|θ#)

q(θ∗|θ#)p(y∗|θ∗)

}The “1” at the denominator it’s there because of course we must start thealgorithm at some admissible (accepted) y#, hence the denominator willalways have Iy(y#) = 1.

3Marjoram et al. 2003, PNAS 100(26).Umberto Picchini ([email protected])

By considering the simplification in the previous acceptanceprobability we have the ABC-MCMC:

1 Last accepted parameter is θ# (and corresponding y#). Proposeθ∗ ∼ q(θ∗|θ#).

2 generate y∗ conditionally on θ∗ and compute Iy(y∗)

3 if Iy(y∗) = 1 go to step 4 else stay at θ# and return to step 1.

4 Calculate

α = min{

1,π(θ∗)

π(θ#)× q(θ#|θ∗)

q(θ∗|θ#)

}generate u ∼ U(0, 1). If u < α set θ# := θ∗ otherwise stay at θ#.Return to step 1.

During the algorithm there is no need to retain the generated y∗ hencethe set of accepted θ form a Markov chain with stationary distributionπε(θ|y).


The previous ABC-MCMC algorithm is also denoted as“likelihood-free MCMC”.

Notice that likelihoods do not appear in the algorithm.

Likelihoods are substituted by sampling of artificial observationsfrom the data-generating model.

The Handbook of MCMC (CRC press) has a very good chapteron Likelihood-free Markov chain Monte Carlo.


http://arxiv.org/pdf/1001.2058.pdf

Blackboard: proof that the algorithm targets the correct distribution


A (trivial) generalization of ABC-MCMC

Marjoram et. al used Jε(y∗, y) ≡ Iy(y∗). This implies that weconsider equally ok those y∗ such that |y∗ − y| < ε (or such that|S(y∗) − S(y)| < ε)

However we might also reward y∗ in different ways depending ontheir distance to y.

Examples:

Gaussian kernel: Jε(y∗, y) ∝ e−∑n

i=1(yi−y∗i )2/2ε2

, or...

for vector S(·): Jε(y∗, y) ∝ e−(S(y)−S(y∗)) ′W−1(S(y)−S(y∗))/2ε2

And of course the ε in the two formulations above are different.


Then the acceptance probability trivially generalizes to4

α = min{

1,Jε(y∗, y)π(θ∗)Jε(y#, y)π(θ#)

× q(θ#|θ∗)

q(θ∗|θ#)

}.

This is still a likelihood-free approach.

4Sisson and Fan (2010), chapter in Handbook of Markov chain Monte Carlo.Umberto Picchini ([email protected])


Choice of the threshold ε

We would like to use a “small” ε > 0, however it turns out that if youstart at a bad value of θ a small ε will cause many rejections.

start with a fairly large ε allowing the chain to move in theparameters space.after some iterations reduce ε so the chain will explore a(narrower) and more precise approximation to π(θ|y)keep reducing (slowly) ε. Use the set of θ’s accepted using thesmallest ε to report inference results.

It’s not obvious how to determine the sequence ofε1 > ε2 > ... > εk > 0. If the sequence decreases too fast there willbe many rejections (chain suddenly trapped in some tail).

It’s a problem similar to tuning the “temperature” in optimization viasimulated annealing.


Choice of the threshold ε

A possibility:

Say that you have completed a number of iterations via ABC-MCMCor via rejection sampling using ε1, and say that you stored thedistances dε1 =‖ S(y) − S(y∗) ‖ obtained using ε1.

Take the xth percentile of such distances and set a new threshold ε2 asε2 := xth percentile of dε1 .

this way ε2 < ε1. So now you can use ε2 to conduct more simulations,then similarly obtain ε3 := xth percentile of dε2 etc.

Depending on x the decrease from a ε to another ε ′ will be more orless fast. Setting say x = 20 will cause a sharp decrease, while x = 90will let the threshold decrease more slowly.

A slow decrease of ε is safer but implies longer simulations beforereaching acceptable results.

Alternatively just to set the sequence of ε’s by trial and error.Umberto Picchini ([email protected])

When do we stop decreasing ε?

Several studies have shown that when using ABC-MCMC obtaina chain resulting in a 1% acceptance rate (at the smallest ε) is agood compromise between accuracy and computational needs.This is also my experience.

However recall that a “small” ε implies many rejections→you’llhave to run a longer simulation to obtain enough acceptances toenable inference.

ABC, unlike exact MCMC, does require a small acceptance rate.This is needed by its own nature as we are not happy to use alarge ε.

A high acceptance rate denotes that your ε is way too large andyou are probably sampling from the prior π(θ) (!)


Example from Sunnåker et al. 2013

[Large chunks from the cited article constitute the ABC entry in Wikipedia.]

distribution for these models. Again, computational improvements

for ABC in the space of models have been proposed, such as

constructing a particle filter in the joint space of models and

parameters [17].

Once the posterior probabilities of models have been estimated,

one can make full use of the techniques of Bayesian model

comparison. For instance, to compare the relative plausibilities of

two models M1 and M2, one can compute their posterior ratio,

which is related to the Bayes factor B1,2:

p(M1DD)

p(M2DD)~

p(DDM1)

p(DDM2)

p(M1)

p(M2)~B1,2

p(M1)

p(M2):

If the model priors are equal (p(M1)~p(M2)), the Bayes factor

equals the posterior ratio.

In practice, as discussed below, these measures can be highly

sensitive to the choice of parameter prior distributions and

summary statistics, and thus conclusions of model comparison

should be drawn with caution.

Pitfalls and Remedies

As for all statistical methods, a number of assumptions and

approximations are inherently required for the application of

ABC-based methods to real modeling problems. For example,

setting the tolerance parameter e to zero ensures an exact result

but typically makes computations prohibitively expensive. Thus,

values of e larger than zero are used in practice, which introduces

a bias. Likewise, sufficient statistics are typically not available, and

instead, other summary statistics are used, which introduces an

additional bias due to the loss of information. Additional sources of

bias—for example, in the context of model selection—may be

more subtle [12,18].

At the same time, some of the criticisms that have been directed

at the ABC methods, in particular within the field of

phylogeography [19–21], are not specific to ABC and apply to

all Bayesian methods or even all statistical methods (e.g., the

choice of prior distribution and parameter ranges) [9,22].

However, because of the ability of ABC-methods to handle much

more complex models, some of these general pitfalls are of

particular relevance in the context of ABC analyses.

This section discusses these potential risks and reviews possible

ways to address them (Table 2).

Approximation of the PosteriorA nonnegligible e comes with the price that one samples from

p(hDr(DD,D)ƒe) instead of the true posterior p(hDD). With a

sufficiently small tolerance, and a sensible distance measure, the

resulting distribution p(hDr(DD,D)ƒe) should often approximate

the actual target distribution p(hDD) reasonably well. On the other

hand, a tolerance that is large enough that every point in the

parameter space becomes accepted will yield a replica of the prior

distribution. There are empirical studies of the difference between

p(hDr(DD,D)ƒe) and p(hDD) as a function of e [23], and theoretical

results for an upper e-dependent bound for the error in parameter

estimates [24]. The accuracy of the posterior (defined as the

expected quadratic loss) delivered by ABC as a function of e has

also been investigated [25]. However, the convergence of the

distributions when e approaches zero, and how it depends on the

distance measure used, is an important topic that has yet to be

investigated in greater detail. In particular, it remains difficult to

disentangle errors introduced by this approximation from errors

due to model mis-specification [9].

As an attempt to correct some of the error due to a non-zero e,

the usage of local linear weighted regression with ABC to reduce

the variance of the posterior estimates has been suggested [7]. The

method assigns weights to the parameters according to how well

simulated summaries adhere to the observed ones and performs

linear regression between the summaries and the weighted

parameters in the vicinity of observed summaries. The obtained

regression coefficients are used to correct sampled parameters in

the direction of observed summaries. An improvement was

suggested in the form of nonlinear regression using a feed-forward

neural network model [26]. However, it has been shown that the

posterior distributions obtained with these approaches are not

always consistent with the prior distribution, which did lead to a

reformulation of the regression adjustment that respects the prior

distribution [27].

Finally, statistical inference using ABC with a non-zero

tolerance e is not inherently flawed: under the assumption of

Figure 2. A dynamic bistable hidden Markov model.doi:10.1371/journal.pcbi.1002803.g002

Table 1. Example of ABC rejection algorithm.

i hi Simulated Datasets (Step 2) Summary Statistic vS,i (Step 3) Distance r (vS,i,vE) (Step 4) Outcome (Step 4)

1 0.08 AABAAAABAABAAABAAAAA 8 2 accepted

2 0.68 AABBABABAAABBABABBAB 13 7 rejected

3 0.87 BBBABBABBBBABABBBBBA 9 3 rejected

4 0.43 AABAAAAABBABBBBBBBBA 6 0 accepted

5 0.53 ABBBBBAABBABBABAABBB 9 3 rejected

doi:10.1371/journal.pcbi.1002803.t001

PLOS Computational Biology | www.ploscompbiol.org 4 January 2013 | Volume 9 | Issue 1 | e1002803

We have a hidden system state, moving between states {A,B} withprobability θ, and stays in the current state with probability 1 − θ.

Actual observations affected by measurement errors: probability tomisread system states is 1 − γ for both A and B.


http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002803

Example of application: the behavior of the Sonic Hedgehog (Shh)transcription factor in Drosophila melanogaster can be modeled bythe given model.

Not surprisingly, the example is a hidden Markov model:

p(xt|xt−1) = θ when xt , xt−1 and 1 − θ otherwise.

p(yt|xt) = γ when yt = xt and 1 − γ otherwise.

In other words a typical simulation pattern looks like:

A,B,B,B,A,B,A,A,A,B (states x1:T )

A,A,B,B,B,A,A,A,A,A (observations y1:T )

Misrecorded states are flagged in red.


The example could be certainly solved via exact methods, but just forthe sake of illustration, assume we are only able to simulate randomsequences from our model.

Here is how we simulate a sequence of length T:

1 given θ, generate x∗t ∼ Bin(1, θ)2 conditionally on xt, yt is Bernoulli: generate a u ∼ U(0, 1) if

u < γ set y∗t := x∗t otherwise take the other value.

3 set t := t + 1 go to 1 and repeat until we have collected y1, ..., yT .

So we are totally set to generate sequences of A’s and B’s givenparameter values.


We generate a sequence of size T = 150 with θ = 0.25 and γ = 0.9.

The states are discrete and only two (A and B) hence with datasets ofmoderate size we could do without summary statistics. But not forlarge T .

Take S(·) = number of switches between observed states.Example: if y = (A, B, B, A, A, B) we switched 3 times soS(y) = 3.

We only need to set a metric and then we are done:

Example (you can choose a different metric): Jε(y∗, y) = Iy(y∗)with

Iy(y∗) =

{1, |S(y∗) − S(y)| < ε0, otherwise

Plug this setup into an ABC-MCMC and we are essentially usingMarjoram et al. original algorithm.


Priors: θ ∼ U(0, 1) and γ ∼ Beta(20, 3).Starting values for the ABC-MCMC: θ = γ = 0.5

0 1 2 3 4 5 6 7 8

×104

0

0.2

0.4

0.6

0.8

1θ

0 1 2 3 4 5 6 7 8

×104

0

0.2

0.4

0.6

0.8

1γ

Used ε = 6 (first 5,000 iterations) then ε = 2 for further 25,000iterations and ε = 0 for the remaining 50,000 iterations.

When ε = 6 accept. rate 20%, when ε = 2 accept. rate 9% andwhen ε = 0 accept. rate 2%.


Results at ε = 0

Dealing with a discrete state-space model allows the luxury to obtainresults at ε = 0 (impossible with continuous states).

Below: ABC posteriors (blue), true parameters (vertical red lines) andBeta prior (black). For θ we used a uniform prior in [0,1].

0 0.2 0.4 0.6 0.8 10

1

2

3

4

5θ

0 0.2 0.4 0.6 0.8 1 1.20

1

2

3

4

5

6

7γ

Remember: when using non-sufficient statistics results will bebiased even with ε = 0.


A price to be paid when using ABC with a small ε is that, because ofthe many rejections, autocorrelations are very high.

0 10 20 30 40 50 60 70 80 90 100-0.2

0

0.2

0.4

0.6

0.8

1θ

epsilon = 0

epsilon = 2

epsilon = 6

0 10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1γ

epsilon = 0

epsilon = 2

epsilon = 6

This implies the need for longer simulations.


An apology

Paradoxically, all the (trivial) examples I have shown do not requireABC.

I considered simple examples because it’s easier to illustrate themethod, but you will receive an homework having (really) intractablelikelihoods :-b


Weighting summary statistics

Consider a vector of summaries S(·) ∈ Rd, not much literaturediscuss how to assign weights to the components in S(·).

For example consider

Jε(y∗, y) ∝ e−‖(S(y)−S(y∗)‖/2ε2

with ‖ S(y) − S(y∗) ‖= (S(y) − S(y∗)) ′ ·W−1 · (S(y) − S(y∗))

Prangle5 notes that if S(y) = (S1(y), ..., Sd(y)) and if we give to all Sj

the same weight (hence W is the identity matrix) then the distance‖ · ‖ is dominated by the most variable summary Sj.

Only the component of θ “explained” by such Sj will be nicelyestimated.

5D. Prangle (2015) arXiv:1507.00874Umberto Picchini ([email protected])


Useful to have a diagonal W, say W = diag(σ21, ...,σ2

d).

The σj could be determined from some pilot study. Say that we areusing ABC-MCMC, after some appropriate burnin say that we havestored R realizations of S(y∗) corresponding to the R parameters θ∗

into a R× d matrix.

For each column j extract the unique values from(S(1)

j (y∗), .., S(R)j (y∗)) ′ then compute its madj (median absolute

deviation).

Set σ2j := mad2

j .

(The median absolute deviation is a robust measure of dispersion.)

Rerun ABC-MCMC with the updated W, and an adjustment to ε willprobably be required.


ABC for dynamical models

It is trickier to select intuitive (i.e. without the Fearnhead-Prangleapproach) summaries for dynamical models.

However, we can bypass the need for S(·) if we use an ABC versionof sequential Monte Carlo.

A very good review of methods for dynamical models is given inJasra 2015.



ABC-SMC

A simple ABC-SMC algorithm is in Jasra et al. 2010, presented innext slide (with some minor modifications).

For the sake of brevity, just consider a bootstrap filter approach withN particles.

Recall in in ABC we assume that if observation yt ∈ Y then alsoyi∗

t ∈ Y.

As usual, we assume t ∈ {1, 2, ..., T}.


http://link.springer.com/article/10.1007/s11222-010-9185-0

Step 0.Set t = 1. For i = 1, ..., N sample xi

1 ∼ π(x0), y∗i1 ∼ p(y1|xi1),

compute weights wi1 = J1,ε(y1, y∗i1 ) and normalize weights

wi1 := wi

1/∑N

i=1 wi1.

Step 1.resample N particles {xi

t, wit}. Set wi

t = 1/N.Set t := t + 1 and if t = T + 1, stop.

Step 2.For i = 1, ..., N sample xi

t ∼ p(xt|xit−1) and y∗it ∼ p(yt|xi

t). Compute

wit := Jt,ε(yt, y∗it )

normalize weights wit := wi

t/∑N

i=1 wit and go to step 1.


The previous algorithm is not as general as the one actually given inJasra et al. 2010.

I assumed that resampling is performed at every t (not strictlynecessary). If resampling is not performed at every t in step 2 we have

wit := wi

t−1Jt,ε(yt, y∗it ).

Specifically Jasra et al. use Jt,ε(yt, y∗it ) ≡ I‖y∗it −yt‖<ε but that’s notessential for the method to work.

What is important to realize is that in SMC methods the comparisonis “local”, that is we compare particles at time t vs. the observation att. So we can avoid summaries and use data directly.

That is instead of comparing a length T vector y∗ with a length Tvector y we perform separately T comparisons ‖ y∗it − yt ‖. This isvery feasible and clearly does not require an S(·).



So you can form an approximation to the likelihood as we explainedin the particle marginal methods lecture, then plug it into a standardMCMC (not ABC-MCMC) algorithm for parameter estimation.

This is a topic for a final project.


http://www.maths.lu.se/fileadmin/maths/forskning_research/InferPartObsProcess/particlemethods.pdf

Construction of S(·)

We have somehow postponed an important issue in ABC practice: thechoice/construction of S(·).

This is the most serious open-problem in ABC and one oftendetermining the success or failure of the simulation.

We are ready to accept non-sufficiency (available only for data inthe exponential family) in exchange of an “informative statistic”.

Statistics are somehow easier to identify for static models. Fordynamical models their identification is rather arbitrary, but seeMartin et al6 for state space models.

6Martin et al. 2014, arXiv:1409.8363.Umberto Picchini ([email protected])


Semi-automatic summary statistics

To date the most important study on the construction of summaries in ABCis in Fearnehad-Prangle 20127 which is a discussion paper on JRSS-B.Recall a well-known result: consider the class of quadratic losses

L(θ0, θ; A) = (θ0 − θ)TA(θ0 − θ)

with θ0 true value of a parameter and θ an estimator of θ. A is a positivedefinite matrix.

If we set S(y) = E(θ | y) then the minimal expected quadratic lossE(L(θ0, θ; A) | y) is achieved via θ = EABC(θ | S(y)) as ε→ 0.

That is to say, as ε→ 0, we minimize the expected posterior loss using theABC posterior expectation (if S(y) = E(θ|y)). However E(θ | y) isunknown.

7Fearnhead and Prangle (2012).Umberto Picchini ([email protected])


So Fearnhead & Prangle propose a regression-based approach todetermine S(·) (prior to ABC-MCMC start):

for the jth parameter in θ fit separately the linear regressionmodels

Sj(y) = E(θj|y) = β(j)0 + β(j)η(y), j = 1, 2, ..., dim(θ)

[e.g. Sj(y) = β(j)0 + β(j)η(y) = β(j)

0 + β(j)1 y0 + · · ·+ β(j)

n yn oryou can let η(·) contain powers of y, say η(y, y2, y3, ...)]repeat the fitting separately for each θj.

hopefully Sj(y) = β(j)0 + β(j)η(y) will be “informative” for θj.

Clearly, in the end we have as many summaries as the number ofunknown parameters dim(θ).


An example (run before ABC-MCMC):

1. p = dim(θ). Simulate from the prior θ∗ ∼ π(θ) (not very efficient...)2. using θ∗, generate y∗ from your model.Repeat (1)-(2) many times to get the following matrices:

θ(1)1 θ

(1)2 · · · θ(1)

p

θ(2)1 θ

(2)2 · · · θ(2)

p...

,

y∗(1)

1 y(∗1)2 · · · y(∗1)n

y(∗2)1 y(∗2)2 · · · y(∗2)n...

... · · ·...

and for each column of the left matrix do a multivariate linear regression (orlasso, or...)

θ(1)j

θ(2)j...

=

1 y(∗1)1 y(∗1)2 · · · y(∗1)n

1 y(∗2)1 y(∗2)2 · · · y(∗2)n...

... · · ·...

× βj (j = 1, ..., p),

and obtain a statistic for θj, Sj(·) = β(j)0 + β(j)η(·).


Use the same coefficients when calculating summaries for simulateddata and actual data, i.e.

Sj(y) = β(j)0 + β(j)η(y)

Sj(y∗) = β(j)0 + β(j)η(y∗)

In Picchini 2013 I used this approach to select summaries forstate-space models defined by stochastic differential equations.



Software (coloured links are clickable)

EasyABC, R package. Research article.

abc, R package. Research article

abctools, R package. Research article. Focusses on tuning.

Lists with more options here and here .

examples with implemented model simulators (useful toincorporate in your programs).


https://cran.r-project.org/web/packages/EasyABC/index.html

http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12050/abstract

https://cran.r-project.org/web/packages/abc/index.html

http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2011.00179.x/abstract

https://cran.r-project.org/web/packages/abctools/index.html

https://journal.r-project.org/archive/2015-2/nunes-prangle.pdf

https://approximatebayesiancomputational.wordpress.com/software-2/

https://en.wikipedia.org/wiki/Approximate_Bayesian_computation#Software

https://github.com/dennisprangle/LFexamples

Reviews

Fairly extensive but accessible reviews:

1 Sisson and Fan 2010

2 (with applications in ecology) Beaumont 2010

3 Marin et al. 2010

Simpler introductions:

1 Sunnåker et al. 2013

2 (with applications in ecology) Hartig et al. 2013

Review specific for dynamical models:

1 Jasra 2015



http://www.annualreviews.org/doi/abs/10.1146/annurev-ecolsys-102209-144621


http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002803

http://onlinelibrary.wiley.com/doi/10.1111/j.1461-0248.2011.01640.x/abstract


Non-reviews, specific for dynamical models

1 SMC for Parameter estimation and model comparison: Toni etal. 2009

2 Markov models: White et al. 2015

3 SMC: Sisson et al. 2007

4 SMC: Dean et al. 2014

5 SMC: Jasra et al. 2010

6 MCMC: Picchini 2013


http://rsif.royalsocietypublishing.org/content/6/31/187

http://rsif.royalsocietypublishing.org/content/6/31/187


http://www.pnas.org/content/104/6/1760.full




More specialistic resources

selection of summary statistics: Fearnhead and Prangle 2012.

review on summary statistics selection: Blum et al. 2013

expectation-propagation ABC: Barthelme and Chopin 2012

Gaussian Processes ABC: Meeds and Welling 2014

ABC model choice: Pudlo et al 2015


http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2011.01010.x/abstract




http://bioinformatics.oxfordjournals.org/content/early/2015/11/19/bioinformatics.btv684.abstract

Blog posts and slides

1 Christian P. Robert often blogs about ABC (and beyond: it’s afantastic blog!)

2 an intro to ABC by Darren J. Wilkinson

3 Two posts by Rasmus Bååth here and here

4 Tons of slides at Slideshare.


https://xianblog.wordpress.com/

https://darrenjw.wordpress.com/2013/03/31/introduction-to-approximate-bayesian-computation-abc/

http://www.sumsar.net/blog/2014/10/tiny-data-and-the-socks-of-karl-broman/

http://www.sumsar.net/blog/2015/07/tiny-data-and-the-socks-of-karl-broman-the-movie/

http://www.slideshare.net/search/slideshow?searchfrom=header&q=approximate+bayesian+computation&ud=any&ft=all&lang=**&sort=

Intro to Approximate Bayesian Computation (ABC)

Science

Transcript of Intro to Approximate Bayesian Computation (ABC)