Accelerated approximate Bayesian computation with applications to protein folding data

Accelerating inference for complex stochasticmodels using Approximate Bayesian Computation

with an application to protein folding

Umberto PicchiniCentre for Mathematical Sciences,

Lund University

joint work withJulie Forman (Dept. Biostatistics, Copenhagen University)

Dept. Mathematics, Uppsala 4 Sept. 2014

Umberto Picchini ([email protected])

Outline

I’ll use protein folding data as a motivating example; most of thetalk will be about statistical inference issues.

I will introduce our data and protein folding problem.

I will introduce a model based on stochastic differentialequations.

I will mention some theoretical and computational issues relatedto (Bayesian) inference.

I will introduce approximate Bayesian computation (ABC) toalleviate methodological and computational problems.


A motivating example

Proteins are synthesized in the cell on ribosomes as linear,unstructured polymers

...which then self-assemble into specific and functionalthree-dimensional structures.


This self-assembly process is called protein folding. It’s the last andcrucial step in the transformation of genetic information, encoded inDNA, into functional protein molecules.


protein folding is also associated with a wide range of humandiseases.In many neurodegenerative diseases, such as Alzheimers disease,proteins misfold into toxic protein structures.

Protein folding has been named “the Holy Grail of biochemistryand biophysics” (!).


Modelize time dynamics is difficult (large number of atoms in a3D space);Atoms coordinates are usually projected onto a single dimensioncalled reaction coordinate see the figure below

0 0.5 1 1.5 2 2.5

x 104

20

25

30

35

40

index

Figure: Data time-course projected on a single coordinate: 25,000 measurements of theL-reaction coordinate of the small Trp-zipper protein at sampling freq. ∆−1 = 1/nsec.

here the L-reaction coordinate was used, i.e. the total distance toa folded reference.notice the random switching between folded/unfolded states.


Forman and Sørensen [2013] proposed to consider sums ofdiffusions:1

Zt︸︷︷︸observable process

= Yt︸︷︷︸latent state

+ Ut︸︷︷︸autocorrelated error term

they considered diffusion processes to modelize both Yt and Ut

they found i.i.d. errors where not really giving satisfactoryresults.So let’s introduce some autocorrelation:

dUt = −κUtdt +√

2κγ2dWt, U0 = 0

a zero mean Ornstein-Uhlenbeck process with stationaryvariance γ2 and autocorrelation ρU(t) = e−κt. HeredWt ∼ N(0, dt).

1Forman and Sørensen, A transformation approach to modelling multi-modaldiffusions. J. Statistical Planning and Inference. 2014.


Regarding the “‘signal” part Yt: data clearly shows a bimodalmarginal structure:

0 0.5 1 1.5 2 2.5

x 104

20

25

30

35

40

index22 24 26 28 30 32 34 36 38 400

500

1000

1500

2000

2500

data

so we want a stochastic process that is able to switch between the two“modes” (i.e. the folded-unfolded states). One of possible options:

an OU Xt with zero mean and unit variance

dXt = −θXtdt +√

2θdBt, X0 = x0

plug each Xt into the cdf of its stationary N(0,1) distribution⇒takeΦ(Xt) as the cdf of N(0, 1).


now build a Gaussian mixture and take the percentilecorresponding to an areaΦ(Xt)

to summarize: simulate Xt ⇒ compute Φ(Xt)⇒ find percentileYt from a 2-components Gaussian mixture with cdf

F(y) = α ·Φ(

y − µ1

σ1

)+ (1 − α) ·Φ

(y − µ2

σ2

)Umberto Picchini ([email protected])

So in conclusion we have:

Zt︸︷︷︸data

= Yt︸︷︷︸latent state

+ Ut︸︷︷︸autocorrelated error term

Zt = Yt + Ut

Yt := τ(Xt)


2θdBt


2κγ2dWt

τ(x) = (F−1◦Φ)(x), F(y) = α·Φ(

y − µ1

σ1

)+(1−α)·Φ

(y − µ2

σ2

)We are interested in conducting (Bayesian) inference for

η = (θ, κ,γ,α,µ1,µ2,σ1,σ2)


Difficulties with exact Bayesian

A non-exhaustive list on the difficulty of using exact Bayesianinference via MCMC and SMC in our application:

our dataset is “large” (25,000 observations...)which is not “terribly” large in absolute sense, but it is whendealing with diffusion processes......in fact even when a proposed parameter value is in the bulk ofthe posterior distribution, generated trajectories might still be toodistant from the data (⇒ high rejection rate!)a high rejection rate implies poor exploration of the posteriorsurface, poor inferential results and increasing computationaltime.some of these issues can be mitigated using bridging techniques(Beskos et al. ’13): not trivial in our case (transformation τ(x)unknown in closed form).


Before going into approximated methods...you should trust (!) that wehave put lots of effort in trying to avoid approximations! Still...

Beside theoretical difficulties, currently existing methods do not scalewell for large data:

e.g. we attempted at using particle MCMC (Andrieu et al. ’10)and even with a few (10 only!) particles it would require weeksto obtain results...

it is expensive to simulate from our model as percentiles ofmixture-models are unknown in closed form.

we had to use some approximated strategy, and we consideredABC (approximate Bayesian computation).


Notation

Data: z = (z0, z1, ..., zn)

unknown parameters: η = (θ, κ,γ,α,µ1,µ2,σ1,σ2)

Likelihood funct.: p(z|η)

Prior density: π(η) (our a priori knowledge of η)

Posterior density:

π(η|z) ∝ π(η)p(z|η)

Ideally we would like to use/sample from the posterior.We assume this is either theoretically difficult or computationallyexpensive (it is in our case!).


Approximate Bayesian computation (ABC)

ABC gives a way to approximate a posterior distribution π(η|z)

key to the success of ABC is the ability to bypass the explicitcalculation of the likelihood p(z|η)...only forward-simulation from the model is required!

ABC is in fact a likelihood-free method that works by simulatingpseudo-data zsim from the model:

zsim ∼ p(z|η)

had an incredible success in genetic studies since mid 90’s(Tavare et al ’97, Pritchard et al. ’99).lots of hype in recent years: see Christian Robert’s excellentblog.


Basic rejection sampler (NO approximations here!)for r = 1 to R do

repeatGenerate parameter η ′ from its prior distribution π(η)Generate zsim from the likelihood p(z|η ′) (!! no need to know

p(·) analytically!!)

until zsim = z (simulated data = actual data)

set ηr = η′

end for

The algorithm produces R samples from the exact posterior π(η|z).

:( It won’t work for continuous data or large amount of data becausePr(z = zsim) ≈ 0⇒ substitute z = zsim with z ≈ zsim


...⇒ substitute z = zsim with z ≈ zsim

Introduce some distance ‖ z − zsim ‖ to measure proximity of zsim

and data z [Pritchard et al. 1999].

Introduce a tolerance value δ > 0.

An ABC rejection sampler:for r = 1 to R do

repeatGenerate parameter η ′ from its prior distribution π(η)Generate zsim from the likelihood p(z|η ′)

until ‖ z − zsim ‖< δ [or alternatively ‖ S(z) − S(zsim) ‖< δ]set ηr = η

′

end forfor some “summary statistics” S(·).


Previous algorithm samples from approximate posteriorsπ(η| ‖ z − zsim ‖< δ) or π(η| ‖ S(z) − S(zsim) ‖< δ.

useful to consider statistics S(·) when dealing with large datasetsto increase the probability of acceptance.

the key result of ABC

When S(·) is “sufficient” for η and δ ≈ 0 sampling from the posterioris (almost) exact!

When S(·) is sufficient for the parameter⇒ π(η| ‖ S(zsim) − S(z) ‖< δ) ≡ π(η| ‖ zsim − z ‖< δ)when δ = 0 and S sufficient we accept only parameter draws forwhich zsim ≡ z⇒ π(η|z), the exact posterior.

This is all good and nice, but such conditions rarely hold.


a central problem is how to choose the statistics S(·): outside theexponential family we typically cannot derive sufficient statistics.[A key work to obtain “semi-automatically” statistics is Fearnhead-Prangle ’12

(discussion paper on JRSS-B. Very much recommended.)]

substitute with the loose concept of informative (enough)statistic then choose a small (enough) threshold δ.

We now go back to our model and (large) data. We propose sometrick to accelerate the inference.

We will use an ABC within MCMC approach (ABC-MCMC).


ABC-MCMC

Example: Weigh discrepancy between observed data andsimulated trajectories using a uniform 0-1 kernel Kδ(·), e.g.

Kδ(S(zsim), S(z)) =

{1 if ‖ S(zsim) − S(z) ‖< δ0 otherwise

complete freedom to choose a different criterion...

Use such measure in place of the typical conditional density ofdata given latent states. See next slide.


Zt = Yt + Ut with Yt := τ(Xt)


2θdBt


2κγ2dWt

ABC-MCMC acceptance ratio: Given the current value of parameterη ≡ ηold generate a Markov chain via Metropolis-Hastings:

Algorithm 1 a generic iteration of ABC-MCMC (fixed threshold δ)At r-th iteration1. generate η ′ ∼ u(η ′|ηold), e.g. using Gaussian random walk2. generate zsim|η

′ ∼ “from the model – forward simulation”3. generateω ∼ U(0, 1)

4. accept η ′ if ω < min(

1, π(η′)K(S(zsim),S(z))u(ηold|η

′)π(ηold)K(S(zold),S(z))u(η ′|ηold)

)then set ηr = η

′ else ηr = ηold

Samples are from π(η, zsim| ‖ S(zsim) − S(z) ‖< δ).Umberto Picchini ([email protected])

Algorithm 2 a generic iteration of ABC-MCMC (random threshold δ)At r-th iteration1. generate η ′ ∼ u(η ′|ηold), δ ′ ∼ v(δ|δold)2. generate zsim|η

′ ∼ “from the model – forward simulation”3. generateω ∼ U(0, 1)

4. accept η ′ if ω < min(

1, π(η′)Kδ(S(zsim),S(z))u(ηold|η

′)v(δold|δ′)

π(ηold)Kδ(S(zold),S(z))u(η ′|ηold)v(δ ′|δold)

)then set (ηr, δr) = (η ′, δ ′) else (ηr, δr) = (ηold, δold)

Samples are from π(η, δ, zsim| ‖ S(zsim) − S(z) ‖< δ).


by using a (not too!) small threshold δ we might obtain a decentacceptance rate for the approximated posterior...

however ABC does not save us from having to producecomputationally costly “long” trajectories zsim (n ≈ 25, 000) ateach step of ABC-MCMC.

However in ABC the relevant info about our simulations isencoded into S(·)...do we really have to simulate zsim having same length as our dataz??

...simulate “short” zsim that still result qualitativelyrepresentative! (see next slide...)


Top row: full dataset of 28,000 observations.Bottom row: every 30th observation is reported.

0 0.5 1 1.5 2 2.5

x 104

20

25

30

35

40

index22 24 26 28 30 32 34 36 38 400

500

1000

1500

2000

2500

data

0 100 200 300 400 500 600 700 800 90022

24

26

28

30

32

34

36

index22 24 26 28 30 32 34 36

0

10

20

30

40

50

60

70

data

Dataset is 30 times smaller but qualitative features are still there!Umberto Picchini ([email protected])

Strategy for large datasets (Picchini-Forman ’14)

We have a dataset z of about 25, 000 observations.prior to starting ABC-MCMC construct S(z) to contain:

1 the 15th-30th...-90th percentile of the marginal distribution of thefull data. → to identify Gauss. mixture params µ1,µ2 etc

2 values of autocorrelation function of full data z at lags(60, 300, 600, ..., 2100). → to identify dynamics-related paramsθ,γ, κ

during ABC-MCMC we simulate shorter trajectories zsim of size25000/30 ≈ 800.we take as summary statistics S(zsim) the 15th-30th...-90thpercentile of simulated data and autocorrelations at lags(2, 10, 20, ..., 70) (recall zsim is 30x shorter than z).we then compare S(zsim) with S(z) into ABC-MCMC.this is fast! S(·) for the large data can be computed prior toABC-MCMC start.


So the first strategy to accelerate computations was simulating asmaller subset of artificial data.

Our second strategy is to perform so called “early rejection” ofproposed parameters.


ABC-MCMC acceptance ratio:

accept η ′ if ω < min(

1,π(η ′)K(|S(zsim) − S(z)|)u(ηold |η

′)

π(ηold)K(|S(zold) − S(z)|)u(η ′|ηold)

)however remember K(·) is a 0/1 kernel→ let’s start the algorithm atan admissible ηstart such that at first iteration S(zold) ≈ S(z)→

accept η ′ if ω < min(

1,π(η ′)

π(ηold)K(|S(zsim) − S(z)|)

u(ηold |η′)

u(η ′|ηold)

)

Notice η ′ will surely be rejected ifω > π(η ′)π(ηold)

u(ηold|η′)

u(η ′|ηold)

REGARDLESS the value of K(·) ∈ {0, 1}→do NOT simulate trajectories if the above is satisfied!


Algorithm 3 Early–Rejection ABC-MCMC (Picchini ’13)1. At (r + 1)th ABC-MCMC iteration:2. generate η ′ ∼ u(η|ηr) from its proposal distribution;3. generateω ∼ U(0, 1);if

ω>π(η ′)u(ηr|η

′)

π(ηr)u(η ′|ηr)(= “ratio”)

then(ηr+1, S(zsim,r+1)) := (ηr , S(zsim,r)); . (proposal early-rejected)

else generate xsim ∼ π(x|η ′) conditionally on the η ′ from step 2; determine ysim = τ(xsim)and generate zsim ∼ π(z|ysim,η ′) and calculate S(zsim);

if K(|S(zsim) − S(z)|) = 0 then(ηr+1, S(zsim,r+1)) := (ηr , S(zsim,r)) . (proposal rejected)

else ifω 6 ratio then(ηr+1, S(zsim,r+1)) := (η ′, S(zsim)) . (proposal accepted)

else(ηr+1, S(zsim,r+1)) := (ηr , S(zsim,r)) . (proposal rejected)

end ifend if4. increment r to r + 1. If r > R stop, else go to step 2.


Notice “early rejection” works only with 0-1 kernels.

No reason not to use it. Early-rejection per-se is not anapproximation, it’s just a trick.

It saved us between 40-50% of computing time (Picchini ’13).


Results after 2 millions ABC-MCMC iterations. Acceptance rate of1% and 6 hrs computation with MATLAB on a common pc desktop.

Table: Protein folding data experiment: posterior means from theABC-MCMC output and 95% posterior intervals.

ABC posterior meanslog θ –6.454 [–6.898,-5.909]log κ –0.651 [–1.424,0.246]logγ 0.071 [–0.313,0.378]logµ1 3.24 [3.22,3.26]logµ2 3.43 [3.39,3.45]logσ1 –0.959 [–2.45,0.38]logσ2 –0.424 [–2.26,0.76]logα –0.663 [–1.035,–0.383]


0 0.5 1 1.5 2 2.5

x 104

20

30

40

0 0.5 1 1.5 2 2.5

x 104

20

30

40

0 0.5 1 1.5 2 2.5

x 104

20

30

40

Figure: Data (top), process Yt (middle), process Zt (bottom).

Zt = Yt + Ut meaning “evaluated at η = posterior mean”Umberto Picchini ([email protected])

A simulation study (Picchini-Forman ’14)

Here we want to compare ABC against the (computationallyintensive) exact Bayesian inference (via particle MCMC,pMCMC).

in order to do so we consider a very small dataset of 360simulated observations.

we use a parallel strategy for pMCMC devised in Dovrandi ’14(4 chains run in parallel using 100 particles for each chain).

C. Dovrandi (2014) Pseudo-marginal algorithms with multiple CPUs.Queensland University of Technology, http://eprints.qut.edu.au/61505/

U.P. and Forman (2014). Accelerating inference for diffusions observed with measurement errorand large sample sizes using Approximate Bayesian Computation. arXiv:1310.0973.


Comparison ABC-MCMC (*) vs exact Bayes (pMCMC)

True valueθ 0.0027 0.0023 [0.0013,0.0041]

0.0024∗ [0.0013,0.0039]κ 0.538 0.444 [0.349,0.558]

0.553∗ [0.386,0.843]γ 1.063 1.040 [0.943,1.158]

0.982∗ [0.701,1.209]µ1 25.52 25.68 [25.08,26.61]

25.72∗ [25.10,26.71]µ2 30.92 32.12 [29.15,35.42]

32.17∗ [29.46,34.96]σ1 0.540 0.421 [0.203,0.844]

0.523∗ [0.248,0.972]σ2 0.624 0.502 [0.232,1.086]

0.511∗ [0.249,1.041]α 0.537 0.510 [0.345,0.755]

0.508∗ [0.346,0.721]


−7 −6.5 −6 −5.5 −50

0.5

1

1.5

2

2.5

(a) log θ

−2 −1.5 −1 −0.5 0 0.5 10

0.2

0.4

0.6

0.8

1

1.2

1.4

(b) log κ

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.60

1

2

3

4

5

6

7

8

(c) logγ

Figure: Exact Bayesian (solid); ABC-MCMC (dashed); true value (verticallines); uniform priors.


3.1 3.15 3.2 3.25 3.30

20

40

60

80

100

120

(a) logµ1

3.3 3.4 3.5 3.6 3.70

20

40

60

80

100

120

140

(b) logµ2

−3 −2 −1 0 10

0.5

1

1.5

2

2.5

3

(c) logσ1

−3 −2 −1 0 10

0.2

0.4

0.6

0.8

1

1.2

1.4

(d) logσ2

−1.5 −1 −0.5 00

0.5

1

1.5

2

2.5

3

3.5

(e) logα

Figure: Exact Bayesian (solid); ABC-MCMC (dashed); true value (vertical lines);uniform priors.Umberto Picchini ([email protected])

Conclusions

as long as we manage to “compress” information into summarystatistics ABC is a useful inferencial tool for complex modelsand large datasets.

1,000 ABC-MCMC iterations performed in 6 sec; in about 20min with exact Bayesian sampling (pMCMC).

...problem is that ABC requires lots of tuning (choose S(·), δ,K(·)...)MATLAB implementation available athttp://sourceforge.net/projects/abc-sde/with 50+ pages manual.


http://sourceforge.net/projects/abc-sde/

References

U.P. (2014). Inference for SDE models via ApproximateBayesian Computation. J. Comp. Graph. Stat.

U.P. and J. Forman (2013). Accelerating inference for diffusionsobserved with measurement error and large sample sizes usingApproximate Bayesian Computation. arXiv:1310.0973.

U.P. (2013) abc-sde: a MATLAB toolbox for approximateBayesian computation (ABC) in stochastic differential equationmodels.http://sourceforge.net/projects/abc-sde/


http://sourceforge.net/projects/abc-sde/

Appendix


Proof that the basic ABC algorithm works

The proof is straightforward.We know that a draw (η ′, zsim) produced by the algorithm is such that(i) η ′ ∼ π(η), and (ii) such that zsim = z, where zsim ∼ π(zsim | η ′).

Thus let’s call f (η ′) the (unknown) density for such η ′, then becauseof (i) and (ii)

f (η ′) ∝∑zsim

π(η ′)π(zsim|η′)Iz(zsim) =

∑zsim=z

π(η ′, zsim) ∝ π(η|z).

Therefore η ′ ∼ π(η|z).


A theoretical motivation to consider ABC

An important (known) result

A fundamental consequence is that if S(·) is a sufficient statistic for θthen limδ→0 πδ(θ | y) = π(θ | y) the exact (marginal) posterior!!!

uh?!

Otherwise (in general) the algorithm draws from the approximationπ(θ | ρ(S(x), S(y)) < δ).

also by introducing the class of quadratic losses

L(θ0, θ; A) = (θ0 − θ)TA(θ0 − θ) :

Another relevant result

If S(y) = E(θ | y) then the minimal expected quadratic lossE(L(θ0, θ; A) | y) is achieved via θ = EABC(θ | S(y)) as δ→ 0.


Acceptance probability in Metropolis-Hastings

Suppose at a given iteration of Metropolis-Hastings we are in the(augmented)-state position (θ#, x#) and wonder whether to move (ornot) to a new state (θ ′, x ′). The move is generated via a proposaldistribution “q((θ#, x#)→ (x ′, θ ′))”.

e.g. “q((θ#, x#)→ (x ′, θ ′))” = u(θ ′|θ#)v(x ′ | θ ′);move “(θ#, x#)→ (θ ′, x ′)” accepted with probability

α(θ#,x#)→(x′,θ′) = min(

1,π(θ ′)π(x ′|θ ′)π(y|x ′, θ ′)q((θ ′, x ′)→ (θ#, x#))

π(θ#)π(x#|θ#)π(y|x#, θ#)q((θ#, x#)→ (θ ′, x ′))

)= min

(1,π(θ ′)π(x ′|θ ′)π(y|x ′, θ ′)u(θ#|θ

′)v(x# | θ#)

π(θ#)π(x#|θ#)π(y|x#, θ#)u(θ ′|θ#)v(x ′ | θ ′)

)now choose v(x | θ) ≡ π(x | θ)

= min(

1,π(θ ′)��π(x ′|θ ′)π(y|x ′, θ ′)u(θ#|θ

′)��π(x# | θ#)

π(θ#)��π(x#|θ#)π(y|x#, θ#)u(θ ′|θ#)��π(x ′ | θ ′)

)This is likelihood–free! And we only need to know how to generate x ′

(not a problem...)Umberto Picchini ([email protected])

Generation of δ’s

0 0.5 1 1.5 2 2.5

x 105

0

0.2

0.4

0.6

0.8

Here we generate a chain for log δ using a (truncated) Gaussianrandom walk with support (−∞, log δmax]. We let log δmax decreaseduring the simulation.


HOWTO: post-hoc selection of δ (the “precision”parameter) [Bortot et al. 2007]

During ABC-MCMC we let δ vary (according to a MRW): at rth iterationδr = δr−1 + ∆, with ∆ ∼ N(0,ν2).After the end of the MCMC we have a sequence {θr, δr}r=0,1,2... and for eachparameter {θj,r}r=0,1,2... we produce a plot of the parameter chain vs δ:

0 0.5 1 1.5 2 2.5 3 3.5 4−2.5

−2

−1.5

−1

−0.5

0

bandwidth


Post-hoc selection of the bandwidth δ, cont’d...

Therefore in practice:

we filter out of the analyses those draws {θr}r=0,1,2,...corresponding to “large” δ, for statistical precision;we retain only those {θr}r=0,1,2,... corresponding to a low δ.in the example we retain {θr; δr < 1.5}.PRO: this is useful as it allows an ex-post selection of δ, i.e. wedo not need to know in advance a suitable value for δ.CON: by filtering out some of the draws, a disadvantage of theapproach is the need to run very long MCMC simulations inorder to have enough “material” on which to base our posteriorinference.PRO: also notice that by letting δ vary we are almost consideringa global optimization method (similar to simulated tempering).


Accelerated approximate Bayesian computation with applications to protein folding data

Science

Transcript of Accelerated approximate Bayesian computation with applications to protein folding data