Parameter Estimation for Stochastic Reaction Processes ... · in this thesis, are so-called...

Parameter Estimation for Stochastic

Reaction Processes using Sequential

Monte Carlo methods

Abschlussarbeit

zur Erlangung des akademischen Grades

Master of Science

im Studiengang Statistik

an der Humboldt-Universitat zu Berlin

vorgelegt von

Philipp BatzMatrikel-Nr.: 505151

Contents

1 Introduction 31.1 The model framework . . . . . . . . . . . . . . . . . . . . . . . . 41.2 The latent model process . . . . . . . . . . . . . . . . . . . . . . 51.3 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3.1 Maximum likelihood parameter estimation . . . . . . . . . 61.3.2 Bayesian parameter estimation . . . . . . . . . . . . . . . 7

1.4 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Sequential Monte Carlo methods 102.1 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Sequential Monte Carlo filtering . . . . . . . . . . . . . . . . . . . 112.3 The particle filtering algorithm . . . . . . . . . . . . . . . . . . . 15

3 Biochemical Reaction Systems 193.1 Molecular kinetics . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Mass action kinetics . . . . . . . . . . . . . . . . . . . . . . . . . 203.3 Continuous Time Markov process . . . . . . . . . . . . . . . . . . 223.4 Gillespie algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 223.5 Inference on complete path . . . . . . . . . . . . . . . . . . . . . 243.6 Proposal distribution approximation . . . . . . . . . . . . . . . . 26

3.6.1 Diffusion approximation . . . . . . . . . . . . . . . . . . . 263.6.2 Euler discretization . . . . . . . . . . . . . . . . . . . . . . 273.6.3 Markov jump process proposal approximation . . . . . . . 28

3.7 Examples of Biochemical reaction processes . . . . . . . . . . . . 323.7.1 Example 1: Lotka-Volterra process . . . . . . . . . . . . . 323.7.2 Example 2: Gene autoregulatory network . . . . . . . . . 333.7.3 Example 3: Prokaryotic autoregulation process . . . . . . 353.7.4 Example 4: Stochastic reaction-diffusion process . . . . . 37

4 Parameter estimation in biochemical reaction models 404.1 Parameter estimation via EM algorithm . . . . . . . . . . . . . . 40

4.1.1 MCEM-PF: Lotka-Volterra process . . . . . . . . . . . . . 414.1.2 MCEM-PF: Gene autoregulatory network process . . . . 42

4.2 Bayesian parameter estimation . . . . . . . . . . . . . . . . . . . 444.2.1 Bayesian-PF: Lotka-Volterra process . . . . . . . . . . . . 474.2.2 Bayesian-PF: Gene autoregulatory network process . . . . 484.2.3 Bayesian-PF: Prokaryotic autoregulation process . . . . . 504.2.4 Bayesian-PF: Stochastic reaction-diffusion process . . . . 51

4.3 Comparison of estimation results between MCEM-PF and Bayesian-PF approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3.1 Poor man’s data augmentation algorithm 1 . . . . . . . . 524.3.2 Poor man’s data augmentation algorithm 2 . . . . . . . . 53

1

5 Empirical Results 585.1 Results: Lotka-Volterra process . . . . . . . . . . . . . . . . . . . 585.2 Results: Gene autoregulatory network process . . . . . . . . . . . 615.3 Results: Prokaryotic autoregulation model . . . . . . . . . . . . . 655.4 Results: Stochastic reaction-diffusion model . . . . . . . . . . . . 67

6 Discussion 69

A Posterior Moments based on Laplace’s method 70

B Brownian Motion 71

C Stochastic Differential Equations 72C.1 Fokker-Planck equation . . . . . . . . . . . . . . . . . . . . . . . 73

2

1 Introduction

State space models are Markov models of a sequential nature and are found tobe useful in a variety of different areas, such as engineering [1], finance [42] andthe natural sciences ([15], Section IV). In many cases, the underlying Markovprocess cannot directly be observed, but allows only for possibly nonlinear, noisyand incomplete measurements to be taken. In most cases, the hidden processwill be modeled by a continous-time Markov process with continous samplepaths, and the observations are assumed to be available only at discrete timepoints. Examples of continous-time partially observed Markov processes arediffusion or stochastic differential equation (SDE) for models with a continousstate space and Markov jump processes (MJP) in a discrete state space setting.A crucial problem of these models is how to efficiently estimate the parametersgoverning the dynamics. While the likelihood of the complete model of latentprocess and observations can often be easily written down, statistical inferenceon the model such as parameter estimation requires marginalisation over thepossible realisations of the infinite-dimensional latent process. Consequently, inpractical applications one usually has to resort to some kind of approximations.Markov Chain Mone Carlo (MCMC) methods use a time discretizing strategy[21, 29, 39, 43], Variational methods restrict the class of approximating distri-butions in such a way, that a tractable (approximate) solution can be derived[4, 5, 38]. For diffusion models, an approach has been introduced in the litera-ture which is both exact and computationally efficient [6]. However, since thismethod involves a transformation to a constant volatility, it cannot be appliedto nonlinear multivariate processes as the ones discussed in this thesis [49].Alternatively, one can approach the problem with the help of Sequential MonteCarlo (SMC) methods, which approximate the continous trajectories by a fi-nite number of point masses [15, 9, 11, 18, 30, 3]. Conceptually, the basic ideaof SMC methods consists of extending the importance sampling method to asequential model environment. Therefore choosing an appropriate importancedistribution for sampling is of crucial concern in order to achieve a good perfor-mance in practice.In this thesis we are concerned with a model driven by stochastic reaction pro-cesses, which belong to the class of Markov jump processes [46]. Specifically, wewill examine stochastic reaction processes in a biochemical context, which arecalled biochemical reaction processes. To this end we will first develop a mod-eling approach using SMC methods, which allows us to approximate filteringdistributions and likelihoods for biochemical reaction processes. This in partic-ular involves a derivation of a suitable importance distribution, which extendsan idea proposed in [23] to a noisy state space model environment. In a nextstep we will use the SMC algorithm for the purpose of infering the model param-eters. We will discuss a maximum likelihood as well as a bayesian estimationapproach, which we subsequently apply to different instances of biochemical re-action processes.

The remainder of the chapter is structured as follows. We look at the basic

3

structure of the model, the type of latent process underlying the model and dis-cuss two different basic approaches on how to estimate the model parameters.

1.1 The model framework

We consider the following nonlinear model framework, which is governed by twodifferent random processes:

• State process

Xtn+1| Xtn = xtn ∼ p (· | xtn , θ) . (1.1)

• Observation process

Ytn | Xtn = xtn ∼ p (· | xtn , θ) , (1.2)

where θ = (θ1, . . . , θL) denotes the parameter vector of the system. The stateprocess, which is assumed to be hidden from the observer, models the dynam-ics of some random process over a sequence of (N + 1) time points tt0:tN =(t0, . . . , tN ). The process assumes an initial state at time t0 distributed accord-ing to p (xt0 | θ) and proceeds over the subsequent time points via the tran-sition density p

(xtn+1

| xtn , θ). The only available information regarding the

state of the process is encoded in the observation sequence yt0:tN = (y0, . . . , yN )measured at the (N + 1) time points tt0:tN . The observation ytn can be un-derstood as noisy measurement of the true state of the process xtn , wherethe randomness is governed by the density p (ytn | xtn , θ). Note that we as-sume the state process to be Markovian, since the transition density for xtn+1

given the state history xt0:tn = (x01 , . . . , xtn) only depends on the most re-cent state xtn , i.e. p

(xtn+1 | xt1:tn , θ

)= p

(xtn+1 | xtn , θ

). Furthermore, the

specific form of the observation process implies the assumption, that the condi-tional density of an observation ytn given the states xt0:tn and previous obser-vations yt0:tn−1

only depends on the state of the hidden process at time tn, i.e.p(ytn | xt0:tn ,yt0:tn−1

, θ) = p (ytn | xtn , θ).In order to derive the joint distribution of states and observations we can makeuse of the model assumptions, which results in the following factorized expres-sion:

p(xt0:tN ,yt0:tN | θ

)= p (xt0 | θ) p (yt0 | xt0 , θ)

N∏n=1

p(xtn | xtn−1

, θ)p (ytn | xtn , θ)

(1.3)Note that the factorized nature simplifies simulation from the joint distribution,since it allows us to break down this task into a sequential series of samplingfrom the respective transition and observation densities. Hence, generating asample from p

(xt0:tN ,yt0:tN | θ

)can be achieved as follows:

4

State space sampling algorithm

Sample inital state and observation Xt0 ∼ p (· | θ) , Yt0 ∼ p (· | X0, θ)for n = 1, . . . , N do

Sample Xtn ∼ p(· | Xtn−1

, θ)

Sample Ytn ∼ p (· | Xtn , θ)end for

Consequently, the vector (Xt0 , . . . , XtN , Yt0 , . . . , YtN ) is a random draw fromthe joint distribution p


).

For the purposes of statistical inference on dyamical system we need to computethe posterior densities p

(xt0:tT | yt0:tT , θ

)and so called filtering distributions

p(xtT | yt0:tT , θ

)for 1 ≤ T ≤ N . Despite being conceptually simple, posterior

and filtering distribution can be computed analytically only for a small numberof specific cases [9]. In order to derive an approximation to the desired densitiesfor the remaining analytically intractable cases, a number of different strategieshave been proposed in the literature. A class of methods tackling this problem,which recently has attracted wide attention and which will be the route takenin this thesis, are so-called Sequential Monte Carlo (SMC) methods [15, 18, 9].These methods basically use the sampling idea proposed in the state space sam-pling algorithm as building bloc in order to generate a sufficiently large numberof samples from the posterior distributions, which can subsequently be used forMonte Carlo approximations. The application of SMC methods requires thatwe can sample from the transition and observations densities (as can already beseen from the state space sampling algorithm) and also evaluate these densitiesat least up to a normalizing constant.

1.2 The latent model process

In the case of continous-time Markov processes, we assume a d-dimensionalhidden process Xt defined on an interval [0, T ], with Xt denoting the state ofthe process at time t. Then with observations occuring at time points 0 = t0 <∆ = t1 < 2∆ = t2 < . . . < tN = N∆ = T , p∆

(xtn+∆

| xtn , θ)

is understoodas transition density of the process over an interval (tn, tn+∆] between twoconsecutive observations, starting in state xtn . In this thesis we will consider so-called Markov jump processes (MJP) [46]. A d-dimensional MJP is a continous-time Markov process taking on only non-negative discrete values Nd, which ischaracterized by the process rates f

(X′ | X

). We can deduce the transition

probabilities by considering a time interval ∆τ , that is sufficiently small so thatthe probability of two or more reactions occuring is negligible compared to justone reaction happening. Since the probability for a transition from state X tostate X′ is defined as f

(X′ | X

)∆τ , we have as probability for moving from

5

state Xt = x to Xt+∆τ = x′ in time ∆τ :

p (x′ | x) = δx (x′) + f (x′ | x) τ + o (∆τ) , (1.4)

with δx (x′) defining the Dirac measure centered around x. By using the nor-malization property

∑p (x′ | x) = 1, we get f (x | X) = −

∑x′ 6=x f (x′ | x).

Now, the (marginal) probability p (x, t+ ∆τ) can be derived by summing theprobabilities over all possible ways attaining state X′ at time t+∆τ with eitherzero or one reaction occuring in the interval [t, t+ ∆τ):

p (x, t+ ∆τ) =∑x′ 6=x

f (x | x′) p (x′, t) ∆τ + f (x | x) p (x, t) + o (∆τ) . (1.5)

After rearranging and taking the limit ∆τ → 0 we arrive at the master equation,which is an equivalent form of the Chapman-Kolmogorov equation for Markovprocesses:

∂

∂∆τp (xt) =

∑x′ 6=x

(f (x | x′) p (x′t)− f (x′ | x) p (xt)) . (1.6)

Although being exact, the master equation is in many cases mathematicallyintractable due to the size of the state space.

1.3 Parameter estimation

Since the parameter vector θ = (θ1, . . . , θK) governing the state space systemis usually unknown, estimating the unknown parameters conditioned on theobservations y = (yt0 , . . . , ytN ) is of crucial concern. We will specifically focuson two different approaches.

1.3.1 Maximum likelihood parameter estimation

Taking the frequentist perspective, our objective is to find the maximum likeli-hood estimator of the marginal likelihood of the observations:

θ = arg maxθ∈Θ

log p(yt0:tN | θ

), (1.7)

where Θ denotes the parameter space. A popular approach to the parameterestimation problem for the type of models used in this thesis is the expectation-maximization (EM) algorithm [13]. The EM algorithm is a numerically well-behaved gradient method, which increases the likelihood at each subsequentiteration. Specifically, given an estimate θk of the (true) parameters θ at it-eration k, the E-step finds the expected value of the complete log-likelihoodp(xt0:tN ,yt0:tN | θ

)with respect to the observations, given the latent process X

and parameter estimate θk:

Q (θ, θk) =

∫log p


)p(xt0:tN | yt0:tN , θk

)dxt0:tN . (1.8)

6

In the M-step the expression evaluated in the E-step gets maximized, resultingin an updated parameter estimate θk+1 at iteration k + 1:

θk+1 = arg maxθ∈Θ

Q (θ, θk) . (1.9)

It can be shown that each updated parameter estimate θk+1 obeys the non-decreasing property, p

(yt0:tN | θk+1

)≥ p

(yt0:tN | θk

), with respect to the margi-

nal likelihood. Consequently, by iterating the algorithm long enough, a conver-gence to a (local) optimum is assured. Further details on the EM algorithm canbe found in [7].If the filtering distribution of the latent state sequence given the observationsequence is analytically intractable, we can approximate this quantity by gener-

ating a sample X(1)t0:tN , . . . , X

(M)t0:tn ∼ p

(· | yt0:tN , θk

)with M � 1. We can then

use the sample to approximate the integration via the Monte-Carlo method asfollows:

Q (θ, θk) ≈ 1

M

M∑i=1

log p(X

(i)t0:tN ,yt0:tN | θ

). (1.10)

Accordingly, this modification is referred to as Monte-Carlo-EM (MCEM) al-gorithm in the literature [48]. Note that as we are using a Monte Carlo approx-imation in the expectation step of the algorithm, the non-decreasing propertywith respect to the likelihood of subsequent iterations is no longer assured,i.e. it does not necessarily hold that p

(yt0:tN | θk+1

)≥ p

(yt0:tN | θk

)with

p(yt0:tN | θ

)being defined by equation (2.26). We can however track conver-

gence either by using a sliding windows approach or simply by plotting themarginal probabilities over subsequent iterations and terminate the algorithmafter the p

(yt0:tN | θk

)stabilize - that is stop exhibiting an upward drift and

oscillate around some value instead. Let’s say we terminate the algorithm af-ter completing iteration K. Then the vector θk chosen as point estimate to thetrue parameters will be the one with the smallest associated marginal likelihood,mink∈{1,...,K} p

(yt0:tN | θk

), which doesn’t necessarily need to coincide with the

parameter vector at the last iteration, θK .

1.3.2 Bayesian parameter estimation

For our second method we are approaching the parameter estimation inferencefrom a Bayesian perspective. In this framework the system parameter vector istreated as a random variable, which follows a certain probability distribution.Specifically, our information about θ before considering the data is describedby the so-called prior distribution p (θ). Then by combining the prior withthe likelihood, we arrive via Bayes’ rule at the so-called posterior distribution,which encodes our information regarding the parameter vector conditioned onthe available observations:

p(θ | yt0:tN

)=

p (θ) p(yt0:tN | θ

)∫p (θ) p

(yt0:tN | θ

)dθ

(1.11)

∝ p (θ) p(yt0:tN | θ

)(1.12)

7

Instead of considering the marginal likelihood p(yt0:tN | θ

)we can equivalently

write the posterior with respect to the complete likelihood p(xt0:tN ,yt0:tN | θ

):

p(θ | yt0:tN

)∝ p (θ)

∫p(xt0:tN ,yt0:tN | θ

)dxt0:tN . (1.13)

The marginal posterior of θl, l = 1, . . . , L can be easily found as follows:

E[θl | yt0:tN

]=

∫θ1

· · ·∫θl−1

∫θl+1

· · ·∫θL

p(θ | yt0:tN

)dθ1 · · · dθi−1dθi+1 · · · dθL.

(1.14)Having obtained the posterior distribution, we can further use this expressionin order to do statistical inference or extract information from the posterior byevaluating for example the moments of the distribution:

E[θk | yt0:tN

]=

∫Θ

θkp(θ | yt0:tN

)dθ. (1.15)

In additon to capturing our a priori modeling information, a good prior distribu-tion should be easy to interpret and computationally convenient. One particularclass of distributions possessing these properties are so-called conjugate distri-butions. A parametric distribution p (θ | λ) ∈ F from a family of distributionsF is called conjugate to the distribution p (x | θ), if for all observations x andparameters θ the posterior distribution also belongs to the distribution familyF :

p (θ | x, λ) ∝ p (θ | λ) p (x | θ) ∝ p(θ | λ

), (1.16)

where λ denotes the updated parameter, which uniquely determines the poste-rior distribution.

1.4 Outline of the Thesis

Chapter 2 consists of an introduction to Sequential Monte Carlo (SMC) meth-ods, which allow for approximating posterior distributions of state space modelsfor a wide range of model classes, including the Markov jump process model al-ready outlined. Specifically, we discuss the particle filtering method, which willserve as basic building bloc of our subsequent algorithms. In Chapter 3 we applythe SMC framework to the stochastic reaction processes, which belong to theclass of Markov jump processes. We introduce the specifics of these processesand develop an appropriate proposal distribution, which serves as the analyti-cally tractable sampling process in the particle filtering algorithm. Finally weintroduce different specific stochastic reaction processes which we will analyzein our empirical study. In Chapter 4 we use the methods discussed in Chapter 2and 3 in order to develop different algorithms for estimating the model parame-ters in stochastic reaction processes. Specifically we will approach this problemfrom a frequentist (EM) and a Bayesian perspective and discuss the respectivealgorithms. We will give specifics of these algorithms for the example processesunder consideration. Chapter 5 provides an empirical study of the parameter

8

estimation algorithms for the example processes. We discuss the drawbacks andbenefits of either method and evaluate the quality of the results (in relation tocompeting methods given in the literature).

9

2 Sequential Monte Carlo methods

In this chapter we focus on a class of methods for sampling from a posteriordistribution up to a normalizing constant. We start by introducing the basicimportance sampling scheme underlying these algorithms, show how to incor-porate this idea into a sequential sampling environment and finally discuss theparticle filtering algorithm, a member from the class of SMC algorithms, whichserves as our method of choice in this thesis.

2.1 Importance sampling

Assume that we are concerned with computing moments with respect to a com-plex distribution p (x). Specifically, consider an expectation of the form

h := E [h (x)] =

∫h (x) p (x) dx, (2.1)

where h (·) denotes some useful function for estimation, for example the functionh (x) = x pertaining to the mean of the distribution. One way to evaluate theexpression is by generating M samples

{X(i)

}i=1,...,M

from the target distribu-

tion p (x) and approximating (2.1) with the empirical estimate or Monte Carloestimate

E [h (x)] ≈ 1

M

M∑i=1

h(x(i)). (2.2)

If p (x) is too complicated to be sampled from directly, we can often utilize somesimpler probability distribution q (x) from which we can generate samples. Werequire that the support of p (x) is strictly contained in the support of our so-called proposal distribution or importance distribution q (x). Now, if we draw{X(i)

}i=1,...,M

from the importance distribution q (x), we have to correct for

the bias induced by sampling from the wrong distribution. It turns out thatthe correction term, which ensures the unbiasedness of the estimator, is of theform w = p/q. Specifically, this function involving a ratio of distributions isevaluated at each of the sampled points individually, yielding M so-called (un-normalized) importance weights W (i) = p

(X(i)

)/q(X(i)

). The correction term

therefore assigns more weight to regions where q (x) < p (x) and vice versa. Theexpectation can thus be evaluated by sampling from the proposal distributionand subsequently using the following weighted Monte Carlo estimate:

E [h (x)] =

∫h (x) q (x)

p (x)

q (x)dx (2.3)

≈∑Mi=1 W

(i)h(X(i)

)∑Mi=1 W

(i)=

M∑i=1

W (i)h(X(i)

), (2.4)

where W (i) := W (i)/∑Mi=1 W

(i) denotes the normalized importance weight.

10

Note that applying the target distribution p∗ (x) ∝ p (x) or q∗ (x) ∝ q (x) in

an unnormalized form simply amounts to a rescaling of the weights W(i)∗ ∝ wi.

Since the weights get normalized to sum to unity in the estimator (2.3), knowl-edge of the actual normalizing factor is not required.

We will now extend the importance sampling idea to the sequential frameworkcontext of the state space model.

2.2 Sequential Monte Carlo filtering

We assume two stochastic processes X[0,T ] and {Ytn}0≤n≤N , which depend ona d-dimensional parameter vector θ ∈ Θ. The process X[0,T ] is a continuoustime latent Markov process on the interval [0, T ] with initial density p (x | θ)and likelihood p

(x(t,t+∆] | xt, θ

)for a continuous trajectory xt:t+∆ in a time

interval (t, t+ ∆], starting at state Xt = xt:

X0 ∼ p (· | θ) (2.5)

Xt:t+∆ | Xt = xt ∼ p∆ (· | xt, θ) . (2.6)

The process {Ytn}0≤n≤N is an observation process, which allows for indirectmeasurements of the hidden process Xt at discrete time points tn with t0 = 0 <t1 = ∆ < t2 = 2∆ < . . . < tN = N∆ = T . The observations are assumed to beconditionally independent given X[0,T ] with associated densities p (y | x, θ):

Ytn | Xtn = xtn ∼ p (· | xtn , θ) . (2.7)

Specifically, we will assume the measurement error to be normally distributedcentered around the endpoint of the trajectory Xtn = xtn with fixed known(d× d)-dimensional covariance matrix σI, where I denotes the identity matrix:

p(ytn | xtn , σ

)= N

(ytn | xtn , σI

). (2.8)

For notational convenience we define ti : tj = (ti, ti+1, . . . , tj) for i ≤ j. Havingknown parameters θ at our disposal, we can infere the posterior distribution ofthe hidden process Xt0:tn = xt0:tn given the observations Yt0:tn = yt0:tn via theBayesian rule:

p (xt0:tn | yt0:tn , θ) =p (xt0:tn , yt0:tn | θ)

p (yt0:tn | θ). (2.9)

The joint distribution of hidden states and observations is then given by

p (xt0:tn , yt0:tn | θ) = p (x0 | θ)n∏k=1

p(xtk−1:tk | xtk−1

, θ) n∏k=0

p (ytk | xtk , θ) ,

(2.10)where the factorization of the hidden process xt0:tn follows directly from theMarkov property (compare equation (1.3). The marginal distribution of the

11

observation sequence, which is also known as marginal likelihood, can be writtenas

p (yt0:tn | θ) =

∫p (xt0:tn , yt0:tn | θ) dxt0:tn . (2.11)

The factorization property of p (xt0:tn , yt0:tn | θ) allows us to rewrite the poste-rior density recursively as

p (xt0:tn | yt0:tn , θ) = p(x[0,tn−1] | yt0:tn−1

, θ) p (xtn−1:tn | xtn−1


p(ytn | yt0:tn−1 , θ

)(2.12)

withp (yt0:tn | θ) = p

(yt0:tn−1

| θ)p(ytn | yt0:tn−1

, θ), (2.13)

where


)=

∫p(xtn−1:tn | xtn−1 , θ

)p (ytn | xtn , θ)

× p(xtn−1

| yt0:tn−1

)dxtn−1:tn . (2.14)

Clearly, the marginal likelihood of the observation sequence can also be evalu-ated recursively as

p (yt0:tn | θ) = p (yt0 | θ)n∏k=1

p(ytk | yt0:tk−1

, θ). (2.15)

In cases where a direct sampling from the distribution p (xt0:tn | yt0:tn , θ) is notpossible, we can instead build on the basic importance sampling idea outlinedabove and extend it to our sequential environment. Specifically, we generate

a large number M � 1 of samples {X(i)t0:tn}i=1,...,M called particles from a

suitable proposal distribution q (xt0:tn | yt0:tn , θ) and compute the associatedunnormalized importance weights as

W (i)n =

p(X

(i)t0:tn | yt0:tn , θ

)q(X

(i)t0:tn | yt0:tn , θ

) , i = 1, . . . ,M. (2.16)

By normalizing the weights to W(i)n := W

(i)n /

∑Mi=1 W

(i)n , we are able to approx-

imate the expectation of any function h properly defined on the path space viaimportance sampling technique:

h =

∫h (xt0:tn) p (xt0:tn | yt0:tn , θ) dxt0:tn ≈

M∑i=1

W (i)n h

(X

(i)t0:tn

):= h (2.17)

Specifically, we can approximate the posterior distribution p (xt0:tn | yt0:tn , θ)itself as follows:

p (xt0:tn | yt0:tn) =

M∑i=1

W (i)n δ

X(i)t0:tn

(xt0:tn) (2.18)

12

Now, in order to facilitate an efficient computation, we factorize the proposaldistribution analogously to the filtering distribution in equation (2.12):

q (xt0:tn | yt0:tn , θ) = q(xt0:tn−1 | yt0:tn−1 , θ

)q(xtn−1:tn | xt0:tn−1 , yt0:tn , θ

)(2.19)

Rewriting the proposal distribution in this way conveniently induces a factor-ization in the unnormalized importance weights as well:

W (i)n = W

(i)n−1

p(X

(i)tn−1:tn | X

(i)tn−1

, θ)p(ytn | X

(i)tn , θ

)q(X

(i)tn−1:tn | X

(i)t0:tn−1

, yt0:tn , θ)p(ytn | yt0:tn−1 , θ

)∝W (i)

n−1

p(X

(i)tn−1:tn | X

(i)tn−1

, θ)p(ytn | X

(i)tn , θ

)q(X

(i)tn−1:tn | X

(i)t0:tn−1

, yt0:tn , θ) , (2.20)

where the last line follows, since the conditional likelihood doesn’t depend onthe hidden process and therefore can be absorbed into the normalization con-stant.

A crucial issue concerns the specific form of the proposal distribution, as oneneeds to make a sensible choice in order to attain a good performance in prac-tice. The best solution would obviously be the (unavailable) posterior distribu-tion p (xt0:tn | yt0:tn , θ) itself, so we would ideally like to be close to this case.We know, however, that as the discrepancy between the true distribution p andthe proposal distribution q increases with n, the unconditional variance of theimportance weights, i.e. the observation sequence yt0:tn interpreted as randomvariables, tends to increase [34], contributing to the weight degeneracy problemdiscussed below. A natural strategy therefore would be to choose the proposal

distribution for sampling a particle X(i)tn−1:tn on the subinterval (tn−1, tn], which

minimizes the variance conditional on the specific particle X(i)t0:tn−1

and the ob-servation sequence yt0:tn . This leads us to the following proposition (see [19]).

Proposition. The proposal distribution, which minimizes the variance of the

importance weight W(i)n , conditional on the specific particle X

(i)t0:tn−1

and the

observation sequence yt0:tn is given by p(xtn−1:tn | ytn , X

(i)tn−1

, θ)

.

Proof. Calculating V ar[W(i)n ] = E[(W

(i)n )2] − E[W

(i)n ]2, where the variance is

understood with respect to q(xtn−1:tn | X

(i)t0:tn , yt0:tn , θ

), yields (see expression

of the unnormalized weights W(i)n (2.20)):

V ar[W (i)n ] ∝

(W

(i)n−1

)2

∫(p(xtn−1:tn | X

(i)tn−1

, θ)p(ytn | xtn , θ))2

q(xtn−1:tn | X


) dxtn−1:tn

−(p(ytn | X

(i)tn−1

, θ))2].

13

Let’s focus on the integral on the right hand side. After applying the substitu-

tion q(xtn−1:tn | X


)= p

(xtn−1:tn | ytn , X

(i)tn−1

, θ)

to the proposal

distribution, which according to Bayes’ rule can be written as

p(xtn−1:tn | ytn , X

(i)tn−1

, θ)

=p(xtn−1:tn | X

(i)tn−1

, θ)p(ytn | xtn , θ)∫p(xtn−1:tn | X

(i)tn−1

, θ)p(ytn | xtn , θ)dxtn−1:tn

,

the integral simplyfies to(p(ytn | X

(i)tn−1

, θ))2

. Consequently, the variance for

this choice of proposal distribution is zero.

In the literature, this particular proposal distribution is referred to as optimalkernel [19, 10]. Unfortunately, the distribution can be evaluated analyticallyonly for some special cases (perhaps most notably for the class of Gaussianstate space models with nonlinear transition equations) [10], and therefore hasto be approximated most of the time. Since the analytic intractability of theoptimal kernel holds also true for the class of models discussed in this thesis,we denote the approximation to the optimal kernel (which will be discussed at

length in chapter 3), as q(xtn−1:tn | ytn , X

(i)tn−1

, θ)

.

As we have seen, integrating the optimal kernel approximation into the impor-tance sampling framework allows us to approximate any desired expectation hwith respect to the posterior distribution p (xt0:tn | yt0:tn , θ). The recursive formof equation (2.20) thereby enables an efficient computation, since the importanceweights can be calculated forward in time sequentially, without modifying thesampled paths prior to the current time tn. This is the so-called sequentialimportance sampling algorithm (SIS), which we can summarize as follows:

SIS algorithm

for i = 1, . . . ,M do

Sample X(i)0 ∼ q (· | yt0 , θ)

Compute weights:

W(i)0 ∝

p(X

(i)t0 | θ

)p(yt0 | X

(i)t0 , θ

)q(X

(i)t0 | yt0 , θ

) ,

Normalize weights:

W(i)0 =

W(i)0∑N

i=1 W(i)0

, i = 1, . . . ,M

end forfor n = 1, . . . , N do

14

for i = 1, . . . ,M do

Propagate particles X(i)tn−1:tn ∼ q

(· | ytn , X

(i)tn−1

, θ)

and set

X(i)t0:tn ←

(X

(i)[0,tn−1], X

(i)tn−1:tn

)Compute weights:

W (i)n ∝W

(i)n−1

p(X

(i)tn−1:tn | X

(i)tn−1

, θ)p(ytn | X

(i)tn , θ

)q(X

(i)tn−1:tn | X

(i)tn−1

, ytn , θ) ,

end forNormalize weights:

W (i)n =

W(i)n∑N

i=1 W(i)n

, i = 1, . . . ,M

Approximate posterior:


M∑i=1

W (i)n δ

X(i)t0:tn

(xt0:tn)

end for

2.3 The particle filtering algorithm

Despite attaining good results for short data sequences, the SIS algorithm suf-fers from severe drawbacks for longer data sets. This behaviour is largely dueto the weight degeneracy problem, where after a few time steps the probabilitymass is heavily concentrated on a small subset of the weights, thereby renderingmost particles insignificant [15]. Considering that we are sampling with a fixedsample size from a high dimensional space, namely the entire path history ofthe state space up to time tn, which increases over time, the degeneracy prob-lem is hardly surprising. One idea to (partially) overcome this problem, whichposesses good practical and theoretical benefits, consists of adding a resamplingstep. The resampling method comprises drawing M samples with replacement

from the set of particles {X(i)t0:tn}, using the normalized weights as sampling

probabilities. Hence, the idea can be intuitively stated as copying particles witha high associated weights and discarding particles with low associated weights.

In order to obtain a resampled set of M particles, which we denote by {X(i)t0:tn},

one can resort to a number of different approaches introduced in the literature[18]. The most basic sampling scheme is known as

15

Multinomial resampling. The method simply consists of sampling M times

from the available set{X

(i)t0:tn

}, where each member X

(i)t0:tn is chosen with prob-

ability W(i)n . This sampling scheme corresponds to one draw

(N

(1)n , . . . , N

(M)n

)from a multinomial distribution with parameters M and weight vector W

(1:M)n =(

W(1)n , . . . ,W

(M)n

), where N

(i)n denotes the number of offspring generated from

particle X(i)t0:tn and associating equal weight 1/M with each offspring. The pos-

terior approximation with respect to the original set of particles therefore yields:


M∑i=1

N(i)n

MδX

(i)t0:tn

(xt0:tn) (2.21)

Since E[N

(i)n |W (1:M)

n

]= MW

(i)n , it follows that p (xt0:tn | yt0:tn) is an unbi-

ased estimate of the importance sampling approximation p (xt0:tn | yt0:tn) basedon the particles prior to resampling.

Different improved sampling methods have been proposed in the literature,

which reduce the conditional variance V ar[N

(i)n |W (1:M)

n

]compared to the

multinomial resampling, while preserving the unbiasedness property [18]. Outof these improved sampling schemes, the easiest to implement and thereforemost commonly used algorithm is the so-called

Systematic resampling. In this method, we start by sampling from an uni-form distribution U1 ∼ U

[0, 1

M

]and set Ui = i−1

M + U1 for i = 2, . . . ,M . We

then define N(i)n =

∣∣∣{Uj :∑i−1k=1W

(k)n ≤ Uj ≤

∑ik=1W

(k)n

}∣∣∣ as the number of

offspring, where we use the convention∑0k=1 := 0. In analogy to the multino-

mial resampling, it can be shown that this approach is also unbiased [20].

After resampling has been carried out, the posterior density with respect to

the resampled set of particles{X

(i)t−0:tn

}can now be written as


M∑i=1

1

MδX

(i)t−0:tn

(xt0:tn) (2.22)

Although introducing additional Monte Carlo variance, the resampling step canbe proven to be beneficial as it helps to focus the computational effort on regionsof high probability mass, thereby rendering approximations to the filtering dis-tributions much more stable. For a more in-depth discussion, including generalconvergence results, the reader is referred to [18].

Before summarizing the algorithm, we will focus on the approximation to themarginal likelihood p (yt0:tn | θ) of the observation sequence, since this distribu-tion plays a crucial role in the parameter estimation algorithm discussed below.

16

We first consider an approximation to the conditional likelihood of the newobservation ytn given all previous observations yt0:tn−1 . By defining

αn(xtn−1:tn | ytn , θ

):=

p(xtn−1:tn | xtn−1


q(xtn−1:tn | xtn−1

, ytn , θ) , (2.23)

we can rewrite equation (2.14) as:


)=

∫αn(xtn−1:tn | ytn , θ

)q(xtn−1:tn | xtn−1 , ytn , θ

)× p

(xtn−1

| yt0:tn−1

)dxtn−1:tn . (2.24)

Now we substitue the posterior approximation (2.22) for the last term in theabove equation, which subsequently leads to the following pointwise conditionallikelihood approximation:


)=

1

M

M∑i=1

αn

(X

(i)tn−1:tn | ytn , θ

). (2.25)

Hence, the marginal likelihood p (yt0:tn | θ) can be approximated by invokingequation (2.15):

p (yt0:tn | θ) = p (yt0 | θ)n∏k=1

p(ytk | yt0:tk−1

, θ). (2.26)

Applying the sampling-resampling construction as outlined above sequentiallyto all time intervals of our observation sequence leads us to the so-called par-ticle filtering algorithm (PF), which extends the SIS algorithm by including aresampling step:

PF algorithm

for i = 1, . . . ,M do

Sample X(i)t0 ∼ q (· | yt0 , θ)

Compute weights:

W(i)0 ∝

p(X

(i)t0 | θ

)p(yt0 | X

(i)t0 , θ

)q(X

(i)t0 | yt0 , θ

) ,

Normalize weights:

W(i)0 =

W(i)0∑N

i=1 W(i)0

, i = 1, . . . ,M

Approximate likelihood:

p (yt0 | θ) =1

M

M∑i=1

α0

(X

(i)t0 | yt0 , θ

)

17

Resample{W

(i)0 , X

(i)t0

}to obtain M equally weighted particles

{1M , X

(i)t0

}end forfor n = 1, . . . , N do

for i = 1, . . . ,M do


(· | ytn , X

(i)tn−1

, θ)

and set

X(i)t0:tn ←

(X

(i)t0:tn−1

, X(i)tn−1:tn

)Compute weights:

W (i)n ∝W

(i)n−1

p(X

(i)tn−1:tn | X

(i)tn−1

, θ)p(ytn | X

(i)tn , θ

)q(X

(i)tn−1:tn | X

(i)tn−1

, ytn , θ) ,

end forNormalize weights:

W (i)n =

W(i)n∑N

i=1 W(i)n

, i = 1, . . . ,M

Resample{W

(i)n , X

(i)t0:tn

}to obtainM equally weighted particles

{1M , X

(i)t0:tn

}Approximate filtering density:

p (xt0:tn | yt0:tn) =1

M

M∑i=1

δX

(i)

[0,tn]

(xt0:tn)

Approximate marginal likelihood:

p (yt0:tn | θ) = p(yt0:tn−1 | θ

)p(ytn | yt0:tn−1 , θ

),

where:


)=

1

M

M∑i=1

αn

(X

(i)tn−1:tn | ytn , θ

)end for

We finally note that much effort has been made in order to improve the perfor-mance of particle filtering algorithms, in particular with respect to proposingefficient sampling/resampling strategies and variance reduction (a neat overviewcan be found in [10]). Perhaps the most prominent variant of the standardparticle filter algorithm is the auxiliary particle filtering algorithm [39], whichessentially boils down to interchanging the order of sampling and resamplingsteps in the standard algorithm. Since we found no significant difference inthe empirical investigation of our sample processes by substituting the auxiliaryparticle filter for the standard algorithm, we will skip further discussion.

18

3 Biochemical Reaction Systems

In this thesis we are concerned with so-called biochemical reaction models [26],which can be succinctly summarized as follows: If a fixed volume V contains amixture of d chemical species, which can react according to r specified chemicalreaction channels corresponding to different molecule collisions, then given thenumbers of molecules of each species present at some initial time, what will thesemolecular population levels be at any later time. Traditionally, the standardmethod for modeling biochemical reaction models is to formulate this problemin terms of ordinary differential equations. This approach presupposes that thedynamics of the system over time is both continous and deterministic. In reality,however, the dynamics doesn’t follow a continous process, since the moleculepopulation of the system can change only by discrete amounts as a result ofmolecule collisions. Furthermore, the system cannot assumed to be determinis-tic either, since the different chemical reactions are impossible to predict withany certainty. An additional problem arises in models with a low number ofmolecular copy numbers, since in such cases stochastic effects can play a poten-tially significant role, leading to inaccuracies in a deterministic framework [36].For a stochastic modeling of biochemical reaction models, one can either makeuse of discrete Markov processes [38, 40, 41, 31], diffusions or stochastic dif-ferential equation (SDE) models, which approximate the variables as continouswith a white noise term modeling the stochasticity [28, 29, 30], or a hybridapproach, which treats some chemical species as discrete and others as approx-imately continous. Although an exact modeling of biochemical reaction modelshas also been proposed [8], it turns out to be computationally very intensivefor realistically sized problems. A nice introductory overview of biochemicalreaction systems can be found in [50].

In the remainder of this chapter, we will first introduce the specifics of astochastic approach to biochemical reaction models and subsequently developthe process approximation, which integrates a SDE diffusion approximation intoa Markov jump process framework.

3.1 Molecular kinetics

We start by considering a so-called bi-modular reaction of the form

X1 + X2 → X3 + X4 (3.1)

This reaction denotes a conversion of one molecule of species X1 and one moleculeof species X2 into one molecule of species X3 and one molecule of species X4.The reaction is initiated by a collision of X1 and X2, where we assume thatthe molecules are moving around randomly, driven by Brownian motion. Thespecies X1 and X2 on the left hand side of the equation are called reactantspecies, whereas the species X3 and X4 are called product species.Generally, consider a system of d species X = (X1, . . . ,Xd), which interreact

19

through r biochemical reactions in the following way:

R1 : p11X1 + . . .+ p1dXd → q11X1 + . . .+ q1dXdR2 : p21X1 + . . .+ p2dXd → q21X1 + . . .+ q2dXd

......

......

Rv : pr1X1 + . . .+ prdXd → qr1X1 + . . .+ qrdXd,

(3.2)

where pij and qij denote the so called stoichiometry of the jth reactant andthe jth product in the ith reaction, respectively. Using matrix notation we canwrite the system of equations as

PX → QX , (3.3)

where P = (pij) and Q = (qij) are v × d dimensional matrices. Now if theith reaction occurs, the number of molecules x(j) of the jth species decreasesby pij and increases by qij , yielding an overall change of sij = pij − qij . Thereaction effects of the network can then be summarized by the v×d dimensionalso called net effect matrix S = P − Q = (sij). Consequently, the state of thesystem x =

(x(1), . . . , x(d)

), defined as the number of molecules of each species

currently present in the system, changes after an occurence of reaction i to

x′ = x +A′i, (3.4)

where A′i denotes the ith column of the net effect matrix A. The reaction systemdefining the possible jumps between the states therefore represents a specialinstance of a Markov jump process. In order to fully specify the system we haveto define the hazard rates associated with each reaction. The hazard rate ofreaction i is given by the function hi (x, c), which depends on the current stateof the system and some vector of rate paramters c = (ci1, . . . , cil) with l ≥ 1. Wecan interpret the hazard function in the following way: given that the systemis in state x at time t, the probability of the reaction Ri occuring in the timeinterval (t, t+ dt] is given by hi (x, c) dt + o (dt). This interpretation readilyimplies, that provided that no other reactions are taking place, the waiting timeto a reaction event Ri is exponentially distributed with the parameter given bythe hazard function, Exp (hi (x, c)).

3.2 Mass action kinetics

In the case of elementary chemical reactions, one usually assumes the frame-work of mass action kinetics. By making only weak assumptions, namely byassuming that the system be well mixed in a container of some fixed volumeand in a thermal equilibrium, it can be shown [26] that the reaction hazard ofmolecules colliding in this case is constant with a (scalar) rate constant c = ci.Since the mass action kinetic reactions play a prominent role in our empiricalstudy, we will now provide some further detail. We stress, however, that thealgorithms discussed in this thesis are not constrained to the framework of massaction kinetics, but work with arbitrary rate laws, as we will in fact demonstratewith our second example process, the autoregulatory network given below.

20

Zeroth-order Reactions Consider a reaction of the form

Ri : ∅ → Xj

The zeroth-order reaction is used to model a constant rate of production (orinflux from another compartment) of species Xj . Since in this case the hazardreaction equals ci, we have

hi (x, ci) = ci

First-order Reactions A first order (unimolecular) reaction is defined as

Ri : Xj →?

where a particular molecule Xj acts as the reactant, the hazard rate of this eventbeing ci. The question mark in the above relation indicates that the specificsof the right hand side are not important. If there are currently x(j) moleculesin the system, the combined hazard of one of the molecules reacting is simplythe sum of the individual hazards of each specific molecule reacting, since theseevents are mutually exclusive. Hence

hi (x, ci) = cix(j)

Second-order Reactions A second order (bimolecular) reaction is of theform

Ri : Xj + Xk →?

where a particular pair of molecules Xj and Xk is involved in a reaction. Sincethere are x(j) and x(k) molecules of species Xj and Xk, respectively, in existence,a simple combinatorial argument shows that there are x(j)x(k) different pairs ofmolecules, resulting in a combined hazard

hi (x, ci) = cix(j)x(k)

In the special case where j = k

Ri : 2Xj →?

there are only x(j)(x(j) − 1)/2 possible pairs of molecules of type Xj , thereforewe arrive at the hazard rate

hi (x, ci) = cix(j)

(x(j) − 1

)2

.

Higher-order Reactions Although one could easily extend the combinato-rial reasoning to higher order reactions, reactions of order greater than two aregenerally represented as the combined effect of lower order reactions. The ex-amples employed in this thesis are at most of order two.

21

From the above description it is easily seen that hazard rates following the lawsof mass action kinetics can be generally written in the form

hj (x, cj) = cjgj (x) , (3.5)

where gj (x) denotes the number of possible reactions of type Rj given the stateof the system x.

3.3 Continuous Time Markov process

It can be easily seen that a reaction model belongs to the class of Markov jumpprocesses. Specifically, a reaction model with associated net effect matrix A andhazard rate vector h (x, c) corresponds to a MJP with process rates

f (x′ | x) =

ν∑i=1

δA′j (x′ − x)hj (x, cj) (3.6)

Hence, we can equivalently express the master equation in terms of A andh (x, c). In analogy to the derivation of the classic master equation (1.6), theprobabilities over all possible ways attaining state x =

(x(1), . . . , x(d)

)at time

t + ∆t with either zero or one reaction occuring in the interval [t, t+ ∆t) canbe expressed as:

p (x; t+ ∆t) =

ν∑j=1

hj(x−A′j , cj

)p(x−A′j , t

)∆t+1−

ν∑j=1

hj(x−A′j , cj

)∆t

p (x; t) + o (∆t) , (3.7)

where A′j denotes the jth row of the transposed net reaction matrix A. Byrearranging and taking the limit ∆t→ 0, we arrive at the master equation:

∂

∂tp (x; t) =

ν∑j=1

{hj(x−A′j , cj

)p(x−A′j , t

)+ hj

(x−A′j , cj

)p (x; t)

}. (3.8)

Further details can be found in [46]. For the reaction models considered in thisthesis, the master equation cannot be solved analytically. By taking a slightlydifferent perspective, it is, however, possible to analyze the system with thehelp of so-called discrete event simulation algorithms. This approach, whichwas introduced in [25] and named after its author, allows us to simulate processpaths over a specified time interval.

3.4 Gillespie algorithm

Assume a reaction system consisting of r reactions with associated hazard ratehj (x, c) for j = 1, . . . , r, implying that the hazard of some reaction occuring

22

is h0 (x, c) =∑rj=1 hj (x, c). Now suppose that we would like to sample a

trajectory from a stochastic reaction process on the interval [0, T ]. We definet0 = 0 and t1 = T . The key idea in understanding the simulation algorithm isthat instead of directly focusing on the probability p (xt0:t1 | x0, c) as impliedby the master equation, we consider the joint probability of the time to the nextevent and type of event, given some state xt at time t. Specifically we write

p (τ, νi | xt, c) dτ, (3.9)

which defines the probability that the system being in state xt, the next reac-tion will occur in the infinitesimal time interval [t+ τ, t+ τ + ∆τ) and will bea reaction of type Rνi , where 0 ≤ τ < T − t and νi ∈ {1, . . . , r}. To derive anexplicit expression for the above probability, we start by dividing the interval[t, t+ τ + ∆τ) into k+1 subintervals, with [t, t+ τ) being divided into k equidis-tant intervals of length ε = τ/k plus one final interval of length ∆τ . Hence, theprobability p (τ, νi | xt, c) ∆τ can be interpreted as the the event of no reactionsoccuring in the first k subintervals and one event of type νi occuring in the finalsubinterval. Therefore we get

p (τ, νi | xt, c) ∆τ = (1− h0 (xt, c) ε+ o (ε))k

(hνi (xt, c) ∆τ + o (∆τ)) (3.10)

Dividing through by ∆τ and taking the limit ∆τ → 0 yields

p (τ, νi | xt, c) = (1− h0 (xt, c) ε+ o (ε))khνi (xt, c)

=

(1− h0 (xt, c) τ + τ [o (ε) /ε]

k

)khνi (xt, c) , (3.11)

where we use the fact that kε = τ in the last line. After taking the limit k →∞and substituting the result

limk→∞

(1− h0 (xt, c) τ + τ [o (ε) /ε]

k

)k= exp {−h0 (xt, c) τ} (3.12)

for the first term on the right hand side, we finally arrive at

p (τ, νi | xt, c) = h0 (xt, c) exp {−h0 (xt, c) τ} hνi (xt, c)

h0 (xt, c). (3.13)

By closely inspecting this equation, we see that the random variables of time tothe next event and type of event are actually independent, with the former be-ing exponentially distributed with parameter h0 (xt, c) and the latter followinga discrete distribution with probabilities hνi (xt, c) /h0 (xt, c). Since samplingfrom these distributions is straightforward, we can easily simulate a trajectoryby repeatedly evaluating equation (3.13) over the interval [0, T ]. We summarizethe sampling algorithm, which consists of the following steps:

23

Gillespie algorithm

Initialize system at time t = 0 with parameters c = (c1, . . . , cr) and initial pop-

ulation X(1)0 , . . . , X

(k)0

while t < Tfor j = 1, . . . , r do

Compute hj (Xt, c) based on current state Xt

end forCompute total hazard h0 (Xt, c) =

∑rj=1 hj (Xt, c)

Sample time to next event t′ ∼ Exp (h0 (Xt, c)) and set t := Min (t+ t′, T )if t 6= T

Simulate reaction index j from discrete distribution with probabilities(h1 (Xt, c)

h0 (Xt, c), . . . ,

hr (Xt, c)

h0 (Xt, c)

)Update Xt := Xt +A′j , whith A′j denoting the jth column of the neteffect matrix A

end ifOutput t and Xt

end while

Since there are no approximations involved in this algorithm the resulting tra-jectories are exact realizations of the underlying stochastic reaction process.Furthermore, the Gillespie algorithm framework conveniently allows us to de-velop an analytical expression for the likelihood of a given path.

3.5 Inference on complete path

Assume that we have information of the complete sample path Xt on sometime interval [0, T ] at our disposal, implying that we know the time τi and typeνi ∈ {1, . . . , v} of each transition in the time interval. If we assume a total num-ber of r =

∑vj=1 rj transitions in the interval with rj denoting the number of

transition of the jth type, we can order the transition pairs (τi, νi) , i = 1, . . . , rwith respect to time: 0 := τ0 < τ1 < . . . , τr < τr+1 := T . For later reference wewrite the rj time points corresponding with the occurence of the jth event as

Tj :={τ

(j)1 , . . . , τ

(j)rj

}.

In order to derive the likelihood of the process on the path xt0:tN , we revisit theconstruction of a sample path generated via the Gillespie algorithm. Given anevent at time τi−1 the arrival time of the next event τi (provided τi < T ) is expo-nentially distributed with density h0

(xτi−1

, c)

exp{−h0

(xτi−1

, c)

[τi − τi−1]}

,whereas the type of the next event follows a discrete distribution with proba-bilities hνi

(xτi−1 , cνi

)/h0

(xτi−1 , c

). Since the waiting time and type of event

together constitute the ith event, its likelihood is given by their joint density:

exp{−h0

(xτi−1

, c)

[τi − τi−1]}hνi(xτi−1

, cνi). (3.14)

24

Finally we have to account for the information that in the time interval betweenthe last event and the end point of the time interval, (τr, T ], no event occurs,the associated density of it being

exp {−h0 (xtr , c) [T − τr]} .

The full likelihood L (c; x0:T ) over the interval [0, T ] can now be written as aproduct over all events, including the final waiting time until the interval expires:

L (c; x0:T ) =

{r∏i=1

hνi(xτi−1 , cνi

)exp

{−h0

(xτi−1 , c

)[τi − τi−1]

}}× exp {−h0 (xτr , c) [T − τr]}

=

{r∏i=1

hνi(xτi−1 , cνi

)}exp

{−r+1∑i=1

h0 (xτi , c) [τi − τi−1]

}

=

{r∏i=1

hνi(xτi−1

, cνi)}

exp

{−

r∑i=0

h0 (xτi , c) [τi+1 − τi]

}

=

{r∏i=1

hνi(xτi−1

, ci)}

exp

{−∫ T

0

h0 (xt, c) dt

}, (3.15)

where the integral expression in the last line follows directly from the piecewise-constant nature of the hazard function. By rewriting equation (3.15) as

L (c; x0:T ) =

v∏l=1

{rl∏k=1

hl

(xτ

(l)k

, cl

)exp

{−∫ T

0

hl (xt, cl) dt

}}, (3.16)

we can easily deduce that the information regarding the parameter vectors clof the associated hazard function hl (xt, cl) are mutually independent, implyingthat inference can be carried out separately for each parameter vector.In the case of hazard rates, which follow the laws of simple mass action kinetics,we can further simplify the likelihood by remembering that the hazard functionof the associated reaction j ∈ {1, . . . , v} is of the form hj (x, cj) = cjgj (x).Substituting this expression into equation (3.15), the likelihood with respect tothe (scalar) parameter cj can be written as

L (cj ; x0:T ) ∝

{rj∏k=1

hj

(xτ

(j)k

, cj

)}exp

{−∫ T

0

hj (xt, cj) dt

}

∝ crjj exp

{−cj

∫ T

0

gj (xt) dt

}, (3.17)

where we define L (cj ; xt0:tN )) := crjj exp

{−cj

∫ T0gj (xt) dt

}as the so-called

component likelihood. By examining equation (3.17) we can easily deduce that

25

given a hazard function governed by mass action kinetics, the statistic

sj (x0:T ) =

(rj ,

∫ T

0

gj (xt) dt

)(3.18)

is sufficient for the parameter cj . Furthermore, the sufficient statistic is additivewith respect to time, i.e. sj (x0:T ) = sj (x0:T1

) + sj (xT1:T ) for 0 < T1 < T . To

show this, we assume that in the interval (0, T1] there are rT1=∑4j=1 rjT1

transitions occuring, whereas the remaining transitions rT =∑4j=1 (rj − rjT1)

are happening in the latter interval (T1, T ]. Then by the Markov property ofthe process X0:T we get

p (x0:T | x0, cj) = p (x0:T1| x0, cj) p (xT1:T | xT1

, cj)

∝ c(rjT1+(rj−rjT1))

j exp

{−cj

(∫ T1

0

gj (Xt) dt

+

∫ T

T1

gj (Xt) dt

)}(3.19)

3.6 Proposal distribution approximation

In order to utilize the particle filtering framework for stochastic reaction pro-cesses, we need to provide an expression for the optimal proposal distribution

p(xtn−1:tn | ytn , X

(i)tn−1

, θ)

on a time interval (tn−1, tn]. Since this distribution

is analytically intractable, we have to specifiy an appropriate approximation

q(xtn−1:tn | ytn , X

(i)tn−1

, θ)

. To this end, we start by deriving an approximation

to the master equation (3.8) using a stochastic differential equation and subse-quently show how to integrate this approximation into a Markov jump processapproximation which preserves the discrete nature of the underlying stochasticreaction process.

3.6.1 Diffusion approximation

By assuming small jumps in the above Markov process and a small change ofthe function p (x; t) with respect to x, we can invoke a second order Taylorapproximation around the first term of the right hand side of (3.8), which leadsto the Fokker-Planck equation [46] for a d-dimensional process xt

∂

∂tp (x; t) = −

k∑i=1

∂

∂x(i){fi (x) p (x; t)}+

1

2

k∑i=1

k∑j=1

∂2

∂x(i)∂x(j){Dij (x) p (x; t)}

(3.20)with infinitesimal conditional mean fi (x), for i = 1, . . . , k:

fi (x) = lim∆t→0

1

∆tE[{x

(i)t+∆t − x

(i)t

}| xt = x

](3.21)

26

and infinitesimal conditional second moments Dij (X), for i, j = 1, . . . , k

Dij (x) = lim∆t→0

1

∆tCov

[{x

(i)t+∆t − x

(i)t

},{x

(j)t+∆t − x

(j)t

}| xt = x

](3.22)

Now it can be shown that under some mild conditions the solution of the Fokker-Planck equation satisfies the following Ito stochastic differential equation (SDE)

dxt = f (xt) dt+D1/2 (xt) dWt, (3.23)

where fi (xt) is the column vector of f (xt), D1/2 (xt) is a d × d matrix satis-

fying D1/2(D1/2

)′= [Dij (xt)] = D (xt) and dW = (dW1, . . . ,dWd) denotes

the increment of standard Brownian motion. The vector fi (xt) is also called

the drift vector and the matrix D1/2(β1/2

)′the diffusion matrix. An outline of

the proof for the onedimensional case is provided in the appendix C.1., for thegeneral case see for example [12].Since the SDE has no analytic solution, one has to resort to a suitable approx-imation scheme.

3.6.2 Euler discretization

One way to approximate the above SDE is by discretizing the time axis into suf-ficiently small intervals ∆ and assuming that the moment functions are constantbetween subsequent time points. This results in the so called Euler approxima-tion [33], which is given by

∆xt = f (xt) ∆t+D1/2 (xt) ∆Zt (3.24)

with Zt ∼ N (0,∆t) being an iid random vector of Gaussian distributed randomvariables.Equivalently we can write the approximation as a transition probability:

p∆ (xt+∆ | xt, θ) = N(xt + ∆f (xt) , D (xt) ∆2

)(3.25)

Assuming that we have a constant reaction hazard, the number of reactions ofa given type in a sufficiently small time interval (t, t+ ∆t] are approximately

Poisson-distributed. Specifically, for a given state xt =(x

(i)t , . . . , x

(d)t

)′and

corresponding hazards R1 = h1 (x, c1) , . . . , Rk = hk (x, ck) the infinitesimalchange in the number of molecules x(j) is given by

x(j)t+∆t − x

(j)t = a1jN1 + a2jN2 + . . .+ aνjNν . (3.26)

We know that the number of type i reactions, Ni ∼ P (hi (x, ci) ∆t) arePoisson distributed with mean and variance hi (x, ci) ∆t. After considering the

increments x(j)t+∆t − x

(j)t and applying the Poisson approximation in the drift

vector (3.21) and diffusion matrix (3.22) we can derive the following expressions:

f (xt, θ) = A′h (xt, θ) (3.27)

D (xt, θ) = A′diag {h (xt, θ)}A. (3.28)

27

Consequently we end up with the SDE

dxt = A′h (xt, θ) dt+ (A′diag {h (xt, θ)}A)1/2

dWt, (3.29)

where θ := c = (c1, . . . , cν) contains the parameters, A is the ν × d net effectmatrix and h (xt, θ) is the column vector of the hazards hi (xt, ci) Applyingthe Euler approximation to the respective transition probability leads to thefollowing Gaussian distribution

pT−t (xT | xt, θ) ≈ N (xT | xt + (T − t) f (xt) , (T − t)D (xt)) . (3.30)

Instead of directly resorting to the Fokker-Planck approximation, we integratethis diffusion approach into a jump process approximation, which preserves thediscrete-valued nature of the underlying biochemical reaction process. Ideally wewould like to find a discrete-valued and time-continuous approximation, whichis at once easy to sample from and reasonably close to the dynamics of the trueunderlying stochastic reaction process. The following section extends an ideaproposed in [23].

3.6.3 Markov jump process proposal approximation

Let XT1:T2be a d dimensional biochemical reaction system with associated

reactions R1, . . . , Rν , defined on some interval (T1, T2], which could for examplebe the subinterval (T1 = tn, T2 = tn+1]. The system parameters are given byc = (c1, . . . , cν) and the ν × d dimensional net effect matrix by A. We assumethat the parameters c and the state of the process at time T1 ≥ 0, XT1

= xT1are

fixed and known and we are given a noisy observation yT2∼ N (· | xT2

, σI) ofthe state XT2 at the final time point of the interval. For the following discussionwe will rewrite the transition rates hi (x, ci) in terms of the state x immediatelybefore and state x′ immediately after the transition. We get

λ (x,x′) =

ν∑i=1

δx′−x,A′ihi (x, ci) (3.31)

for x′ 6= x. Our aim is to sample a trajectory of the process, which takes theobservation yT2

into account. We start by noting that instead of consideringthe likelihood of the trajectory p (xT1:T2

| xT1, c), we can equivalently consider

the infinitesimal conditional distributions at each time point t ∈ (T1, T2]

lim∆t→0

Pr (Xt+∆t = x′ | Xt = x, c)

∆t= λ (x,x′) , T1 ≤ t ≤ T2, (3.32)

which are by definition the transition rates of the process. In order to con-struct the optimal proposal, we further condition the distribution (3.32) on theobservation yT2

. This results in a time-inhomogeneous process with transition

28

rates

λt (x,x′) = lim∆t→0

Pr (Xt+∆t = x′ | Xt = x, yT2 , c)

∆t

= lim∆t→0

p∆t (x′ | x, c)

∆t

pT2−t−∆t (yT2 | x′,x, θ)pT2−t (yT2

| x, c)

= λ (x,x′) lim∆t→0

pT2−t−∆t (yT2 | x′,x, c)

pT2−t (yT2| x, c)

, T1 ≤ t ≤ T2, (3.33)

where p∆ (x′ | x, c) denotes the transition density of moving from state x tostate x′ in time ∆.Because XT1:T2 is a Markov process, the probability pT2−t−∆t

(yT2| x′,x, c

)does not depend on the state x of the process immediately before the transition.Specifically:

pT2−t−∆t

(yT2| x′,x, c

)=

∫p(yT2| xT2

)pT2−t−∆t (xT2

| x′,x, c) dxT2

=

∫p(yT2| xT2

)pT2−t−∆t (xT2

| x′, c) dxT2

= pT2−t−∆t

(yT2| x′, c

)(3.34)

Therefore the expression for the transition rates simplifies to

λt (x,x′) = λ (x,x′) lim∆t→0

pT2−t−∆t

(yT2| x′,x, c

)pT2−t

(yT2| x, c

)= λ (x,x′)

pT2−t(yT2| x′, c

)pT2−t

(yT2| x, c

) (3.35)

Thus in order to compute the optimal proposal we have to evaluate the transitiondensity

pT2−t(yT2| Xt, c

)=

∫p(yT2| xT2

, c)pT2−t (xT2

| Xt, c) dxT2. (3.36)

Due to the analytic intractability of pT2−t (xT2| Xt, c) we need to resort to an

approximation. Our approach is to use a Fokker-Planck approximation andapproximate the resulting diffusion in turn by the Euler method, as outlinedabove. Applying this machinery to our reaction system results in the followingdrift vector f (X) and diffusion matrix D (X):

f (X) = A′h (X, c) (3.37)

D (X) = A′diag {h (X, c)}A. (3.38)

Hence, the Euler approximation to the transition probability leads to the fol-lowing Gaussian distribution

pT2−t (xT2| Xt, c) ≈ N (xT2

| Xt + (T2 − t) f (Xt) , (T2 − t)D (Xt)) . (3.39)

29

By combining this approximation with the normally distributed measurementerror, equation (2.8), and marginalizing over the latent end state XT2 we finallyarrive at

pT2−t(yT2| Xt, c

)≈∫N(yT2| xT2

, σI)N (xT2

| Xt + (T2 − t) f (Xt) , (T2 − t)D (Xt)) dxT2

= N(yT2| Xt + (T2 − t) f (Xt) , (T2 − t) (D (Xt) + σI)

), (3.40)

where the last line follows from standard properties of the multivariate normaldistribution, see for example [7].Now the time-inhomogeneneity of the resulting process with transition ratesλt (x,x′) renders a direct sampling intractable. In order to overcome this prob-lem, we have to resort to another approximation. Specifically, assume that weare at an event time t and want to sample the time to the next event t′. Inthis case we first calculate the approximation to pT2−t

(yT2| X, c

)for the state

X = Xt itself as well as for all possible states, which are feasible after apply-ing one of the ν reactions to the current state Xt, i.e. X = Xt + A′j withj = 1, . . . , ν. We then evaluate the resulting approximations to λt (Xt,X) forall possible ν transitions. Finally, we simulate the time to the next event t′ aswell as the type of the event at t′ with the transition rates fixed. The resultingsampling scheme then closely resembles the Gillespie algorithm, allowing for astraightforward simulation and easy evaluation of the resulting densities.The sampling on the time interval (T1, T2] proceeds as follows:

MJP proposal sampling algorithm

Input: parameters c = (c1, . . . , cν), start state xT1, observation yT2

Initialize: time t = T1, current state Xc = xT1

while (t < T2)Calculate approximations to pT2−t

(yT2| Xc, c

), eqn.(3.40)

for i = 1, . . . , ν doDetermine the state xi reached after applying the ith reactionto the current state Xc, see rhs. of eqn.(3.4)Calculate approximation to pT2−t

(yT2| xi, c

), eqn.(3.40)

Calculate approximation to the transition rate λt (Xc,xi), eqn.(3.35)end forSample time to next event t′ from an exponential distribution:

t′ ∼ Exp(λt (Xc)

), λt (Xc) =

ν∑i=1

λt (Xc,xi)

Increment time: t := t+ t′

if (t < T2)Sample transition at time t from a discrete distribution:

x′ ∼ Disc

(λt (Xc,x1)

λt (Xc), . . . ,

λt (Xc,xν)

λt (Xc)

)

30

Update state: Xc := x′

end ifend whileOutput: sample path XT1:T2

= xT1:T2

By defining the Markov jump process approximation as our proposal densityin the sequential Monte Carlo framework, we can utilize the algorithm outlinedabove in order propose new particles on the time interval (T1, T2]. Recall thatthe application of the SMC algorithm requires the evaluation of the weights withrespect to the sampled path, which can be interpreted in the Radon-Nikodymsense as a ratio of measures [18]:

wn (xT1:T2) =p (xT1:T2 | xT1 , c)

q(xT1:T2

| yT2,xT1

, c)p (yT2

| xT2 , c). (3.41)

As we have already seen, the likelihood p (xT1:T2 | xT1 , c) is given by equation(3.15) and the measurement error p

(yT2| xT2 , c

)by equation (2.8). In order to

determine the likelihood of the sampled path under the proposal distributionq(xT1:T2

| yT2,xT1

, c), we note that the proposal sampling algorithm differs from

the Gillespie algorithm insofar as the true transition rates λ (x,x′) are replacedby the approximation λt (x,x′) to the inhomogeneous process at the transitiontimes t. Hence, if we assume a total number of r =

∑νj=1 rj transitions in the

interval (T1, T2] with rj denoting the number of transition of the jth type andan ordering of the transition times T1 := τ0 < τ1 < . . . , τr < τr+1 := T2, we get(compare to equation (3.15)):

q(xT1:T2

| yT2,xT1

, c)

=

{r∏i=1

λτi(xτi−1−,xτi−1

)}exp

{−∫ T2

T1

λt (xt) dt

},

(3.42)where xτi−1− denotes the state of the path immediately before the transition attime τi−1.Integrating the discrete diffusion approximation into the sequential Monte Carloframework now allows us to compute an approximation to the filtering den-sity in biochemical reaction systems. In accordance with the notation used inthe chapter on sequential Monte Carlo methods, let Xt0:tN a d-dimensionalreaction system with associated reactions R1, . . . , Rν and parameter vectorθ := c = (c1, . . . , cν). We are given n + 1 observations y =

(yt0 ,yt1 , . . . ,ytn

)at equidistant time points 0 = t0 < t1 < . . . < tn = T . Assume that we have Mequally weighted particles

{1M , X[0,tn−1]

}on the time interval [0, tn−1] at our

disposal. We simulate for each particle X(i)tn−1:tn a trajectory on the subinterval

(tn−1, tn],

X(i)tn−1:tn ∼ q

(· | ytn , X

(i)tn−1

, c)

by applying the above MJP proposal sampling algorithm with start value X(i)tn−1

,observation ytn and parameter vector c. For each particle now extended to

31

the interval [0, tn] we use equation (3.41) to evaluate the associated weight

W(i)n ∝ wn

(X

(i)tn−1:tn

). After normalizing the weights we apply the resampling

step, which results in a set of M equally weighted particles{

1M , X[0,tn]

}on the

interval [0, tn].Now, assume that we are trying to generate a sample path X(tn−1,tn] on thesubinterval (tn−1, tn] by using our MJP proposal approximation algorithm.Then the quality of the Euler approximation as utilized inside the algorithmdiffers according to the distance between the respective current sampling time tand the time point of the next observation tn. Specifically, at the beginning ofthe subinterval tn−1, we are approximating the diffusion of length (tn − tn−1)by ptn−tn−1

(ytn | Xc = xtn−1

, c). As we are reducing the distance to tn with

each jump, the remaining diffusion shortens steadily, which in turn implies asmaller variance of the respective Euler approximation ptn−t

(ytn | Xc = xt, c

).

Hence, the rate approximations calculated as ratios of Euler approximations canbe rather crude at the beginning of the subinterval, depending on the distanceof the subsequent observations. Despite this shortcoming, however, we foundthat for the parameter estimation task the MJP proposal approximation seemsto work surprisingly well for the models discussed in this thesis. Specifically, weobserved that the quality of the estimation results did not improve significantlyby including additional observations (therefore reducing the distance ti+1 − tibetween observations), once a certain (reasonably small) number of observationsare given.

3.7 Examples of Biochemical reaction processes

We now introduce different instances of biochemical rection processes which willserve as the basis of our empirical investigations.

3.7.1 Example 1: Lotka-Volterra process

One particularly simple example of the class of stochastic reaction processes isthe Lotka-Volterra process independently proposed by Lotka [35] and Volterra[47], which describes the dynamics of a population of two interacting species,the predators and preys. The number of preys and predators at time t are

modeled by X(1)t and X

(2)t respectively, where both variables can take on values

in the nonnegative integers. We denote the state of the model at time t as Xt =(X

(1)t , X

(2)t

). There are four possible reactions with accompanying transition

rates, corresponding to reproduction and extinction of a prey or predator. Weuse the specification given in [38]:

R1 : X1 → 2X1 h1 (x, c1) = c1g1 (x) , g1 (x) = x(1)

R2 : X1 → ∅ h2 (x, c2) = c2g2 (x) , g2 (x) = x(1)x(2)

R3 : X2 → 2X2 h3 (x, c3) = c3g3 (x) , g3 (x) = x(1)x(2)

R4 : X2 → ∅ h4 (x, c4) = c4g4 (x) , g4 (x) = x(2)

32

Figure 1: Realisation of a Lotka-Volterra process on the interval [0, 2000] with pa-rameter vector c = (c1, c2, c3, c4) = (5e− 4, 1e− 4, 1e− 4, 5e− 4) and ini-tial population of 19 preys and 7 predators. The process trajectory wasgenerated via the Gillespie algorithm. The blue dots represent the 20 ob-servations at equidistant time points, which were simulated by corruptingthe process state at each observation time with a Gaussian noise-term ofthe form ε ∼ N (· | 0, 1).

Since we assume the laws of mass action kinetics to hold for all reactions in thesystem, it follows that the hazard function are of the form hi (x, ci) = cigi (x)and consequently the parameter vector ci = ci associated with each reactionconsists of one positive constant ci, respectively. Therefore the parameter vectorfor the system is given by c = (c1, c2, c3, c4). The system of reactions impliesthe following net effects matrix:

A =

1 0−1 0

0 10 −1

(3.43)

By applying the net reaction matrix to equation (3.27) one can easily verify thedrift vector and diffusion matrix of the Fokker-Planck approximation to be

f (X) =

(c1X

(1) − c2X(1)X(2)

c3X(1)X(2) − c4X(2)

)(3.44)

D (X) =

(c1X

(1) + c2X(1)X(2) 0

0 c3X(1)X(2) + c4X

(2)

). (3.45)

3.7.2 Example 2: Gene autoregulatory network

As a second example of a stochastic reaction process model we will consider asimple genetic autoregulatory network, which occur in biological cells as partof the transcriptional regulatory network. The model, which is taken from [38],features two species mRNA and proteins, where the number of mRNA and

proteins at time t are denoted by X(1)t and X

(2)t respectively. The state of the

33

model at time t is therefore denoted by Xt =(X

(1)t , X

(2)t

). Both variables can

take on values in the nonnegative integers. The four possible reaction rates withcorresponding transition rates associated with this model are given by:

R1 : ∅ → X1 h1 (x, c1) = c1(1− 0.99H(x(2), c5)

)R2 : X1 → ∅ h2 (x, c2) = c2g2 (x) , g2 (x) = x(1)

R3 : ∅ → X2 h3 (x, c3) = c3g3 (x) , g3 (x) = x(1)

R4 : X2 → ∅ h4 (x, c4) = c4g4 (x) , g4 (x) = x(2)

with model parameters c1 = (c1, c5) and cj = cj for j = 2, 3, 4 and H denotingthe Heaviside function:

H (a, b) =

{0 : a < b1 : a ≥ b (3.46)

Figure 2: Realisation of a Gene autoregulatory network process on the time interval[0, 100000] with parameters c = (2e− 03, 6e− 05, 5e− 04, 7e− 05, 20) andinitial population consisting of X(1) = 12 mRNAs and X(2) = 17 proteins,generated via the Gillespie algorithm. The realisation features 20 observa-tions, denoted by the blue dots, on equidistant time points over the wholeinterval, where each observation equals the state of the process corruptedby a Gaussian noise term ε ∼ N (0, 1).

Conceptually, the model can be understood as follows: both mRNA X(1)t and

protein X(2)t decay exponentially. Proteins are generated through translation

of mRNA with a rate proportional to the number of mRNA present, whereasthe mRNA production depends on protein concentration levels through a cutofffunction: as soon as protein numbers increase beyond a treshold parameter c5,mRNA production is cut back dramatically by a factor 100.Note that the reactions 2, 3, 4 follow the laws of mass action kinetics, whereasthe first reaction exhibits a more general form by coupling the parameters andthe data in a nonlinear way through the appearance of the Heaviside function.Specifically we assume the parameters (c1, c2, c3, c4) to be positive real constantsand c5 a positive integer with support S = {s1, . . . , sS}.

34

An application of the Euler discretization to the genetic autoregulatory networkprocess yields the following net reaction matrix, which equals the one in theprevious example:

A =

1 0−1 0

0 10 −1

(3.47)

Again using the net reaction matrix in equation (3.29), the drift vector anddiffusion matrix are given by:

f (X) =

(c1(1− 0.99H(X(2), c5)

)− c2X(1)

c3X(1) − c4X(2)

)(3.48)

D (X) =

(c1(1− 0.99H(X(2), c5)

)+ c2X

(1) 00 c3X

(1) + c4X(2)

). (3.49)

3.7.3 Example 3: Prokaryotic autoregulation process

The third process is a simple model of prokaryotic autoregulation, as originallyspecified in [28]. In this particular model, so-called dimers of a protein, whichdenote a molecule P2 consisting of 2 protein subunits, coded for by a gene repressits own transcription into RNA by binding to a regulatory region upstream ofthe gene. The transcription of genes into mRNA is facilitated by an enzyme,RNA-polymerase. The process starts with the binding of this enzyme nearthe beginning of a gene to a site called a promoter. After the initial binding,RNA-polymerase travels away from the promoter along the gene, synthesisingmRNA as it moves. The transcription is repressed by protein dimers, whichbind to sites on the DNA known as operators. The repression and transcriptionmechanisms can be formulated via the following chemical reactions, togetherwith their respective transition rates:

R1 : DNA + P2 → DNA · P2 h1 (x, c1) = c1DNA× P2

R2 : DNA · P2 → DNA + P2 h2 (x, c2) = c2DNA · P2

R3 : DNA→ DNA + RNA h3 (x, c3) = c3DNA

The binding of a ribosome to the mRNA, the translation of the mRNA and thefolding of the resulting polypeptide chain into a functional protein P can berepresented by the single reaction and transition rate

R4 : RNA→ RNA + P h4 (x, c4) = c4RNA.

The reversible dimerisation of this protein is encoded by the two reactions andtransition rates

R5 : 2P→ P2 h5 (x, c5) = c5P (P− 1) /2R6 : P2 → 2P h6 (x, c6) = c6P2

Finally, the model is completed by mRNA and protein degradation

R7 : RNA→ ∅ h7 (x, c7) = c7RNAR8 : P→ ∅ h8 (x, c8) = c8P

35

A detailed discussion of this particular model can be found in [50].

Figure 3: Realisation of a Prokaryotic autoregulation process on the time interval[0, 50] with parameters c = (0.1, 0.7, 0.35, 0.2, 0.1, 0.9, 0.3, 0.1) and initialpopulation X0 = (RNA,P,P2,DNA) = (1, 1, 4, 2). The number of genecopies is fixed to k = 10. Each of the equidistant 20 observations wasgenerated by corrupting the state of the process by a Gaussian noise ε ∼N (· | 0, 2).

The state of the system therefore is comprised of five variables, denoted bythe vector X = (RNA,P,P2,DNA · P2,DNA), with 8 accompanying possibletransactions. As we assume the laws of mass action kinetics to hold for allreactions in the system, it follows that the hazard function are of the formhi (x, ci) = cigi (x) and consequently the parameter vector ci = ci associatedwith each reaction consists of one positive constant ci, respectively. Thereforethe parameter vector for the system is given by c = (c1, . . . , c8). Since the modelcontains a conservation law,

DNA · P2 + DNA = k, (3.50)

with k denoting the number of copies of this gene in the genome, the net effectsmatrix is rank deficient. We remedy this by cancelling DNA ·P2 from the modeland replacing any occurences of DNA · P2 in the rate laws by k −DNA, whichspecifically results in an updated h2 (x, c2) = c2 (k −DNA). The reduced model

36

with state X = (RNA,P,P2,DNA) implies the following net effects matrix:

A =

0 0 −1 −10 0 1 11 0 0 00 1 0 00 −2 1 00 2 −1 0−1 0 0 0

0 −1 0 0

(3.51)

The drift vector and diffusion matrix of the Fokker-Planck approximation areconsequently given by

f (X) = A′h (X, c) (3.52)

D (X) = A′diag {h (X, c)}A (3.53)

with rate vector h (X, c) = (h1 (X, c1) , . . . , h8 (X, c8)).

3.7.4 Example 4: Stochastic reaction-diffusion process

As a fourth and last process we will consider a simple example of the class ofstochastic reaction-diffusion processes taken from [14], which are used in systembiology to model the dynamics of spatial-temporal interplay. Specifically, theprocess contains a single molecular species Bicoid protein, whose concentrationgradient determines the developmental pathways in early Drosophila embryo,reacting and diffusing through the embryo in an one-dimensional space [51]. Inorder to model the particle diffusion with a constant rate, we compartmenalizethe space into 8 identical spatially homogeneous bins, which can only commu-nicate with neighbouring bins. Since Bicoid proteins is produced in the anteriorregion of the embryo by translating mRNA deposited by the mother, we furtherassume that new Bicoid proteins can be generated exclusively in the first bin.Finally, protein decay can happen anywhere in the embryo at a constant rate,implying that proteins decays in any bin at a constant rate. The state space

at time t is consequently given by the vector Xt =(X

(1)t , . . . , X

(8)t

)with Xi

denoting the number of protein in the ith bin. The model allows for a total of23 possible reactions with accompanying transition rates, which are defined as

37

follows:R1 : ∅ → X1 h1 (x, k2) = k2

R2 : X1 → ∅ h2 (x, k1) = k1x(1)

...R9 : X8 → ∅ h9 (x, k1) = k1x

(8)

R10 : X1 → X2 h10 (x, d) = dx(1)

R11 : X2 → X1 h11 (x, d) = dx(2)

R12 : X2 → X3 h12 (x, d) = dx(2)

...R21 : X7 → X6 h21 (x, d) = dx(7)

R22 : X7 → X8 h22 (x, d) = dx(7)

R23 : X8 → X7 h23 (x, d) = dx(8)

The first reaction corresponds to creation of a protein with a constant rate k2,reaction two to nine represent protein decay for each bin occuring proportionalto the number of proteins currently present in the respective bin, and the last14 reactions model all possible diffusions between neighboring bins with a rateproportional to the protein number in the bin from which the diffusion orig-inates. Alternatively, the model can also be specified in the continuum limitwith the number of bins going to infinity, which turns out to be analyticallytractable [40].Again, we assume the laws of mass action kinetics to hold for all reactions in thesystem. The hazard function are consequently of the form hi (x, cj) = cjgi (x),which implies that each reaction depends on one positive constant cj ∈ c. There-fore the parameter vector for the system is given by c = (k2, k1, d). Having themodel reactions at our disposal, we can easily deduce the (23× 8)-dimensionalnet effects matrix A and subsequenly compute the drift vector and diffusionmatrix of the Fokker-Planck approximation in the usual fashion:

f (X) = A′h (X, c) (3.54)

D (X) = A′diag {h (X, c)}A (3.55)

with rate vector h (X, c) = (h1 (X, c1) , . . . , h8 (X, c8)).

Finally, the following figure presents a sample trajectory from a stochasticreaction-diffusion process.

38

Figure 4: Realisation of a Stochastic reaction-diffusion process on the time interval[0, 1500] with parameters c = (k1, k2, d) = (0.0001, 0.4, 0.01) and initialpopulation of zero proteins in each bin. Bin 1 denotes the leftmost com-partment, Bin 8 the rightmost compartment. The noise term of the 15equidistant is of the form ε ∼ N (· | 0, 0.16).

39

4 Parameter estimation in biochemical reactionmodels

Assume that we are given N+1 noisy observations yt0:tN =(yt0 ,yt1 , . . . ,ytN

)of

a biochemical reaction process at equidistant time points on some time interval[0, T ]. Our goal is to infer the unknown model parameter vector c = (c1, . . . , cν)based an the information entailed in the observation vector. Specifically we willapproach this problem from a frequentist as well as a bayesian standpoint, usingthe sequential Monte Carlo framework developed in the last chapters, and sub-sequently apply the inference algorithms to our example processes. For the firsttwo processes we will derive a bayesian as well as an EM-type parameter esti-mation algorithm, whereas for the third and fourth algorithms we will restrictourselves to the bayesian parameter inference case. Although the frequentistalgorithm can be derived and implemented straightforwardly for the two latterexamples, the extremely high computational burden renders a thorough empir-ical investigation impracticable.In the literature the parameter inference problem for biochemical reaction mod-els is predominantly approached within a bayesian framework [38, 40, 3, 28, 29,8]. Helpful details of parameter estimation techniques in a sequential settingMonte Carlo from a maximum likelihood as well as bayesian standpoint aregiven in [32, 16, 17, 2, 24].

4.1 Parameter estimation via EM algorithm

In order to derive a point estimator for the parameter vector c, we try to findthe maximum likelihood estimator of the marginal likelihood of the observationsby using the EM algorithm:

c = arg maxc∈C

log p(yt0:tN | c

), (4.1)

where C denotes the parameter space. Recall that in the EM algorithm frame-work , given an estimate ck of the parameters c at iteration k, the estimate atiteration k + 1 is

ck+1 = arg maxc∈C

Q (c, ck) , (4.2)

where

Q (c, ck) =

∫log p

(xt0:tN ,yt0:tN | c

)p(xt0:tN | yt0:tN , ck

)dxt0:tN . (4.3)

As the filtering distribution of the latent state sequence given the observationsequence is analytically intractable, we approximate this quantity by applyingthe SMC algorithm with a large number of particles M � 1. Consequentlywe end up with a Monte-Carlo EM (MCEM) algorithm, where the E-step is

40

approximated by the following expression:

Q (c, ck) =

∫log p

(xt0:tN ,yt0:tN | c

)p (xt0:tN | yt0:tN ) dxt0:tN

=

∫log p

(xt0:tN ,yt0:tN | c

) 1

M

M∑i=1

δX

(i)t0:tN

(xt0:tN ) dxt0:tN

=

M∑i=1

1

Mlog p

(X

(i)t0:tN ,yt0:tN | c

), (4.4)

where X(i)t0:tN denotes the ith sampled path on the interval [0, T ]. Neither the

terms involving the observations p(ytn | xtn , σ

)nor by assumption the initial

density p (x0) do depend on the parameters c, hence we can instead considerthe simplified expression (see equation (2.10))

Q (c, ck) ∝M∑i=1

1

Mlog p

(X

(i)t0:tN | X

(i)0 , c

). (4.5)

Since we are utilizing the particle filtering algorithm to generate particles, wewill refer to our method as MCEM-PF algorithm.

We will now discuss the application of the MCEM-PF algorithm to the firsttwo stochastic reaction processes, example 1 and 2, discussed above.

4.1.1 MCEM-PF: Lotka-Volterra process

Assume that we use our current parameter estimation ck to generate a setof M particles by employing the Particle filtering algorithm. Since all reac-tions in the Lotka-Volterra model follow the mass-action kinetics laws, the path

likelihood p(X

(i)t0:tN | X

(i)0 , c

)=∏4j=1 p

(X

(i)t0:tN | X

(i)0 , cj

)for the ith particle

factorizes according to equation (3.17) into expressions for each individual haz-ard rate parameter, each of which can be summarized by a sufficient statistic

sj

(X

(i)t0:tN

)=(r

(i)j ,∫ T

0gj

(X

(i)t

)dt)

. We have

Q (c, ck) ∝M∑i=1

1

Mlog p

(X

(i)t0:tN | X

(i)0 , c

)=

M∑i=1

1

M

4∑j=1

{r

(i)j log cj − cj

∫ T

0

gj

(X

(i)t

)dt

}. (4.6)

Now the maximization ck+1 = arg maxc∈C Q (c, ck) can be carried out sepa-rately for each parameter ck+1,j by using differentiation:

∂

∂cjQ (cj , ck) =

M∑i=1

1

M

{r

(i)j

cj−∫ T

0

gj

(X

(i)t

)dt

}≡ 0, j = 1, 2, 3, 4 (4.7)

41

Thus the new parameter vector ck+1 = (ck+1,1, ck+1,2, ck+1,3, ck+1,4) has entriesgiven by

ck+1,j =

∑Mi=1 r

(i)j∑M

i=1

∫ T0gj

(X

(i)t

)dt, j = 1, 2, 3, 4. (4.8)

4.1.2 MCEM-PF: Gene autoregulatory network process

Again, we assume that we have a set of M particles generated by the PF algo-rithm with input parameters ck at our disposal. Remember that in this case,the first reaction hazard is of a more general form than the simple mass-actionkinetics laws encountered previously. Therefore the likelihood of the parametervector c1 = (c1, c5) associated with the first hazard cannot readily be sum-marized by a sufficient statistic. Using the fact that the likelihood factorizesbetween parameter vectors of different hazard functions together with the mass-action kinetics form of the remaining three hazards we obtain

Q (c, ck) ∝M∑i=1

1

Mlog p

(X

(i)t0:tN | X

(i)0 , c

)=

M∑i=1

1

M

r1∑k=1

log h1

(X

(i)

t(1)k

, c1, c5

)−∫ T

0

h1

(X

(i)

t(1)k

, c1, c5

)dt

+

4∑j=2

{r

(i)j log cj − cj

∫ T

0

gj

(X

(i)t

)dt

}(4.9)

Now the maximization for the parameters c2, c3 and c4 can be easily computedby differentiation using the appropriate sufficient statistics:

∂

∂cjQ (cj , ck) =

M∑i=1

1

M

{r

(i)j

cj−∫ T

0

gj

(X

(i)t

)dt

}≡ 0, j = 2, 3, 4 (4.10)

Hence, the new parameters c(k+1)2, c(k+1)3 and c(k+1)4 are given by

c(k+1)j =

∑Mi=1 r

(i)j∑M

i=1

∫ T0gj

(X

(i)t

)dt, j = 2, 3, 4. (4.11)

Regarding the maximization of the remaining parameters c1 = (c1, c5) we notethat by assumption c5 is defined on some discrete space S = {s1, . . . , sS} withcardinality |S| = S. Therefore, given a fixed value of c5 = s with s ∈ S, wehave

Q (c1, ck) ∝M∑i=1

1

M

r1∑k=1

log h1

(X

(i)

t(1)k

, c1, s

)−∫ T

0

h1

(X

(i)

t(1)k

, c1, s

)dt (4.12)

∝M∑i=1

1

M

{r

(i)1 log c1 − c1

∫ T

0

g1

(X

(i)t , c5 = s

)dt

}, (4.13)

42

where we define

g1 (Xt, c5 = s) :=

∫ T

0

(1− 0.99H(X

(2)t , s)

)dt

for an observed sample path Xt0:tN .Mimicking the derivation of c(k+1)2, c(k+1)3, c(k+1)4 in equation (4.7), this ex-pression can now readily be maximized with respect to c1, yielding

c(k+1)1,s =

∑Mi=1 r

(i)1∑M

i=1

∫ T0g1

(X

(i)t , c5 = s

)dt, (4.14)

where the subscript s in c(k+1)1,s denotes the dependency on c5 = s.Our strategy for jointly optimizing the parameters c1 and c5 now simply consistsof subsequently fixing c5 = s to each possible value s ∈ S and calculating theassociated maximum of c1 conditioned on c5 = s. We end up with S pairs

Θc1c5 :={(c(k+1)1,s1 , c(k+1)5 = s1

), . . . ,

(c(k+1)1,sS , c(k+1)5 = sS

)},

where we choose as new parameters the configuration which maximizes theobjective function Q (c1, ck):(

c(k+1)1, c(k+1)5

)= argmax

(c1,c5)∈Θc1c5

Q ((c1, c5) , ck) , (4.15)

with

Q ((c1, c5) , ck) ∝M∑i=1

1

M

r1∑k=1

log h1

(X

(i)

t(1)k

, c1, c5

)−∫ T

0

h1

(X

(i)

t(1)k

, c1, c5

)dt

∝M∑i=1

1

M

r1∑k=1

log c1

(1− 0.99H(X

(i)

t(1)k

, c1, c5)

)

− c1∫ T

0

g1

(X

(i)t , c5

)dt (4.16)

Despite being straightforward, this approch unfortunately does not work well inpractice. The problem is that the estimator of the parameter c5 rarely changesduring one run of the algorithm, sometimes even occupying the same value frominitialization until termination. Consequently the algorithm often converges tosome local optimum, which turns out to be far away from the true value of theparameter vector c. One rather inefficient but easy way to remedy this problemis to fix the parameter c5 = s to some value s at the outset and run the EMalgorithm on the reduced parameter set c = (c1, c2, c3, c4). Note that for a fixedc5 = s the first hazard function also follows the mass-action kinetics form:

h1 (x, c1) = c1g1 (x) with g1 (x) = 1− 0.99H(x(2), c5 = s

)(4.17)

43

Accordingly the new parameter vector can be obtained in the usual fashion byapplying differentiation, only using sufficient statistics:

ck+1,j =

∑Mi=1 r

(i)j∑M

i=1

∫ T0gj

(X

(i)t

)dt, j = 1, 2, 3, 4. (4.18)

An application of this second approach to parameter estimation involves com-puting the EM algorithm with every possible value c5 = s fixed and finallychoosing the parameter configuration with the smallest overall marginal likeli-hood.

4.2 Bayesian parameter estimation

In a bayesian modeling framework our objective is to derive the posterior dis-tribution of the parameter vector c = (c1, . . . , cν) given the observations:

p(c | yt0:tN

)=

∫p(xt0:tN , c | yt0:tN

)dxt0:tN (4.19)

where

p(xt0:tN , c | yt0:tN

)∝ p

(yt0:tN | xt0:tN , σ

)p (xt0:tN | c) p (c) . (4.20)

We take the rate parameter vectors ci to be independent a priori, thereforethe prior distribution factorizes to p (c) =

∏νi=1 p (ci). In the case of a hazard

rate which follows the laws of mass-action kinetics, we assume the associated(scalar) parameter ci to be Gamma distributed, as this represents the conjugatedistribution with respect to the likelihood.An application of Bayes’ rule leads to the posterior distribution over the param-eters:

p(c | xt0:tN ,yt0:tN

)∝ p (c) p

(xt0:tN ,yt0:tN | c

)∝ p (c) p (xt0:tN | c) , (4.21)

where the last line follows by remembering that the observations yt0:tN do notdepend on the parameters c. Since the joint filtering distribution of the latentpath and parameters, p

(xt0:tN , c | yt0:tN

), is known up to a normalizing con-

stant, we can find an approximation by employing the particle filter framework.The first straightforward solution would be to sample at the initial time t = 0 a

number of M � 1 particles from the joint proposal distribution(X

(i)0 , c

(i)0

)∼

q (x0, c | yt0 , c) := p (c) q (x0 | yt0 , c). Since we are now dealing with the jointdistribution of sample path and parameters, we have to adapt the correspond-ing weight w0, equation (2.16), in the particle filter accordingly. By consideringthe form of the joint distribution in equation (4.20) the new weights are easilyrecognized as

w0 (x0) =p (c) p (x0 | c) p (y0 | x0, c)

q (x0, c | yt0)(4.22)

44

Therefore the first resampling step is computed with unnormalized weights

W(i)0 ∝

(w0

(X

(i)0

)). Since each particle i is now associated with some pa-

rameter value c(i), we can propagate the particles through time by employingthe PF algorithm of chapter 1 and arrive at the following filtering approxima-tion:

p(xt0:tn , c | yt0:tn

)=

1

M

M∑i=1

δ(X

(i)t0:tn

,c(i)n

) (xt0:tn , c) (4.23)

and subsequently at the approximation to the posterior distribution:

p(c | yt0:tn

)=

1

M

M∑i=1

δc

(i)n

(c) . (4.24)

Unfortunately, as there are no new parameter sample values c(i) introduced afterthe initialization, the approximation (4.24) relies on a very limited number ofdifferent parameter values due to the repeated resampling steps in the SMC algo-rithm. In order to rectify the particle impoverishment we use an idea introducedin [24] and include a MCMC step with invariant density p

(xt0:tn , c | yt0:tn

)to

add diversity to the population at each time step tn, n = 1, . . . , N :(X

(i)t0:tn , c

(i)n

)∼ Kn

(·, · | X(i)

t0:tn , c(i)n

), (4.25)

where by construction Kn satisfies

p(x′t0:tn , c

′ | yt0:tn

)=

∫p(xt0:tn , c | yt0:tn

)Kn

(x′t0:tn , c

′ | xt0:tn , c)

d (xt0:tn , c)

(4.26)

For such a kernel, if (Xt0:tn ,C) ∼ p(·, · | yt0:tn

)and

(X′[0,T ],C

′ | Xt0:tn ,C)∼

Kn (·, · | Xt0:tn ,C), then(X′t0:tn ,C

′) is still marginally distributed according

to p(·, · | yt0:tn

). We note that in contrast to standard MCMC algorithms

the kernel is not required to be ergodic. In fact ergodicity does usually nothold in real-world applications, as this would imply sampling particles withan increasing number of state variables at each time step. Instead one fixes

X(i)t0:tn−L

= X(i)t0:tn−L

for some L ≥ 0 and samples only c(i)n and possibly the state

trajectory on the recent past X(i)tn−L:tn . Following [2], we leave the complete state

path X(i)t0:tn in the algorithm unaltered (corresponding to L = 0) and restrict

ourselves to the sampling of new parameter values. Specifically, the kernel attime step tn is given by

Kn

(x′t0:tn , c

′ | xt0:tn , c)

= δxt0:tn

(x′t0:tn

)p(c′ | xt0:tn ,yt0:tn

), (4.27)

where p(c′ | xt0:tn ,yt0:tn

)is distributed according to equation (4.21) with T =

tn. One can easily verify the invariance property (4.26) for this particular ker-nel. We can summarize the algorithm as follows:

45

Bayesian-PF algorithm

for i = 1, . . . ,M do

Sample(X

(i)0 , c

(i)0

)∼ q (·, · | yt0 , c)

Compute the weights W(i)0 ∝ w0

(X

(i)0

)and normalize:

W(i)0 =

w0

(X

(i)0

)∑Ni=1 w0

(X

(i)0

)end forResample

{W

(i)0 ,(X

(i)0 , c

(i)0

)}to obtain M equally weighted particles{

1M ,(X

(i)0 , c

(i)0

)}for i = 1, . . . ,M do

Resample c(i)0 ∼ p

(· | X(i)

0 ,yt0

)end forfor n = 1, . . . , N do

for i = 1, . . . ,M do


(· | ytn , X

(i)tn−1

, c(i)tn−1

)and set

x(i)t0:tn ←

(X

(i)t0:tn−1

, X(i)tn−1:tn

)Compute weights W

(i)n ∝ wn

(X

(i)tn−1:tn

)end forNormalize weights:

W (i)n =

wn

(X

(i)tn

)∑Ni=1 wn

(X

(i)tn

) , i = 1, . . . ,M

Resample{W

(i)n ,(X

(i)t0:tn , c

(i)n

)}to obtain M equally weighted particles{

1M ,(X

(i)t0:tn , c

(i)n

)}for i = 1, . . . ,M do

Resample c(i)n ∼ p

(· | X(i)

t0:tn ,yt0:tn

)end forApproximate filtering density:

p(xt0:tn , c | yt0:tn

)=

1

M

M∑i=1

δ(X

(i)t0:tn

,c(i)n

) (xt0:tn , c)

end for

46

Consequently the desired approximation to the posterior distribution is

p(c | yt0:tn

)=

1

M

M∑i=1

δc

(i)N

(c) . (4.28)

Although the bayesian particle filter algorithm displays a slightly higher com-putational burden than the standard particle filtering algorithm used in theMCEM algorithm due to the repeated parameter resampling steps, the bayesianapproach requires only one sweep through the data, whereas in the MCEM casethe particle filtering algorithms is iterated until convergence. This implies inpractice that the bayesian algorithm typically outperforms the MCEM algo-rithm by a factor of 10− 15.

We proceed by discussing the Bayesian-PF implementation for all four exampleprocesses introduced above.

4.2.1 Bayesian-PF: Lotka-Volterra process

Since all 4 hazard rates of the model follow the laws of mass-action kinetics, weplace independent Gamma distributions cj ∼ G (aj , bj) , j = 1, 2, 3, 4 with hy-perparameters aprior = (a1, . . . , a4) and bprior = (b1, . . . , b4) over the parametervector c = (c1, c2, c3, c4):

p (c) =

4∏j=1

G (cj | aj , bj) (4.29)

As usual, for an observed path Xt0:tn generated from a Lotka-Volterra process,

we denote the total number of r =∑4j=1 rj transitions in the interval [0, tn] with

rj representing the number of transition of the jth type. Applying equation(4.21) yields the closed-form posterior distribution

p(c | xt0:tn ,yt0:tn

)∝ p (c) p (xt0:tn | c)

∝4∏

n=1

cajj exp {−cjbj}

4∏n=1

crjj exp

{−cj

∫ tn

0

gj (xt) dt

}

=

4∏n=1

caj+rjj exp

{−cj

(bj +

∫ tn

0

gj (xt) dt

)}

∝4∏

n=1

G(

cj | aj + rj , bj +

∫ tn

0

gj (xt) dt

). (4.30)

The posterior p(c | xt0:tn ,yt0:tn

)=∏4n=1 G

(cj | aj + rj , bj +

∫ tn0gj (xt) dt

)re-

mains independent with respect to the different rate parameters, thus the resam-

pling for the individual rates of the ith particle c(i)n ∼ p

(· | X(i)

t0:tn ,yt0:tn

)can be

47

carried out straightforwardly by sampling each rate parameter c(i)nj , j = 1, 2, 3, 4

individually from its respective Gamma distribution:

c(i)nj ∼ G

(· | sj

(X

(i)t0:tn

)), (4.31)

where sj

(X

(i)t0:tn

)=(r

(i)j ,∫ tn

0gj

(X

(i)t

)dt)

denotes the sufficient statistic with

respect to the ith particle (compare equation (3.18)). Note that by using fixeddimensional sufficient statistics the memory requirement does not increase overtime.

4.2.2 Bayesian-PF: Gene autoregulatory network process

We start by specifying the prior distribution over the parameters correspondingto the 4 hazard rates of the model, where the first hazard rate depends on theparameter vector c1 = (c1, c5) and the remaining hazard rates depend on thescalar parameter cj = cj , j = 2, 3, 4, respectively. In order to clarify the fol-lowing discussion, we quickly review some notation. Given an observed samplepath Xt0:tn generated from a gene autoregulatory network process, we denote

the total number of events in the interval [0, tn] as r =∑4j=1 rj with correspond-

ing reaction times {τ1, . . . , τr} and additional definitions t0 = 0 and τr+1 = tn.Furthermore, we denote the r1 reaction times of events corresponding to the

first reaction as {τ (1)1 , . . . , τ

(1)r1 }.

As the Heaviside function in the first hazard rate given by h1 (x, c1, c5) =c1 (1− 0.99H (x, c5)) couples parameters and data in a nonlinear way, there isno conjugate distribution p (c1, c5) with respect to the likelihood readily avail-able. In order to design a suitable prior distribution, we assign to the tresholdparameter c5 of the Heaviside function a discrete uniform distribution over somereasonable range of values S = {s1, . . . , sS} with cardinality |S| = S, c5 ∼ U (S).Consequently for c5 ∈ S:

p (c5) =1

S(4.32)

For the prior of parameter c1 we assume a Gamma distribution with hyperpa-rameters a1 and b1, p (c1) = G (c1 | a1, b1), independent of c5. By combining theprior distributions with the likelihood of the data we arrive at the joint posteriordistribution:

p (c1, c5 | xt0:tn) ∝ p (c1) p (c5) p (xt0:tn | c1, c5) (4.33)

∝ ca11 exp {−c1b1}

1

S

r1∏k=1

h1(xt(1)k

, c5) exp

{−∫ tn

0

h1(xt, c5)dt

}

∝ ca1+r11

r1∏k=1

(1− 0.99H(x

(2)

τ(1)k

, c5)

)×

exp

{−c1

(b1 +

∫ tn

0

(1− 0.99H(x

(2)t , c5)

)dt

)}(4.34)

48

Now, in order to find the marginal posterior distribution of c5

p (c5 | xt0:tn) =

∫p (c1, c5 | xt0:tn) dc1, (4.35)

we note that the functional form of the components containing parameter c1 inequation (4.34) equals the kernel of a Gamma distribution with hyperparame-

ters a1 + r1 and b1 +∫ tn

0

(1− 0.99H(x

(2)t , c5)

)dt, implying for the conditional

posterior of c1, given c5:

p (c1 | xt0:tn , c5) = G(c1 | a1 + r1, b1 +

∫ tn

0

(1− 0.99H(x

(2)t , c5)

)dt

). (4.36)

By using the fact that the normalization constant of a Gamma distributedrandom variable G (α, β) is βα/Γ (α) and recalling

p (c1, c5 | xt0:tn) = p (c1 | xt0:tn , c5) p (c5 | xt0:tn) , (4.37)

we get as marginal posterior of c5:

p (c5 | xt0:tn) =1

Z

∏r1k=1

(1− 0.99H(x

(2)

t(1)k

, c5)

)(b1 +

∫ tn0

(1− 0.99H(x

(2)t , c5)

)dt)a1+r1

(4.38)

with normalization constant

Z =∑c5∈S

∏r1k=1

(1− 0.99H(x

(2)

t(1)k

, c5)

)(b1 +

∫ tn0

(1− 0.99H(x

(2)t , c5)

)dt)a1+r1

(4.39)

Simulating from the joint posterior distribution

p (c1, c5 | xt0:tn) = p (c5 | xt0:tn) p (c1 | xt0:tn , c5) (4.40)

can now be accomplished by first sampling from the marginal posterior distri-bution c5 ∼ p (· | xt0:tn) and subsequently using the drawn value c5 to samplefrom the conditional posterior distribution c1 ∼ p (· | xt0:tn , c5) (see equation(4.36)).Since the remaining hazard rate functions of the model, j = 2, 3, 4, follow thelaws of mass-action kinetics, we use the fact that in this particular case an(independent) Gamma prior distribution p (c2, c3, c4) =

∏4j=2 G (cj | aj , bj) acts

as the conjugate distribution for the likelihood p (xt0:tN | cj) in our autoregu-latory network model, resulting in independent Gamma posterior distributions(in analogy to the posterior in the Lotka-Volterra process, compare equation(4.30)):

p (c2, c3, c4 | xt0:tn) =

4∏n=2

G(cj | aj + rj , bj +

∫ tn

0

gj (xt) dt

).

49

Hence, we arrive at the following complete posterior distribution, given theobserved sample path xt0:tN :

p (c1, c2, c3, c4, c5 | xt0:tn) = p (c1, c5 | xt0:tn)

4∏j=2

p (cj | xt0:tn) . (4.41)

Resampling the individual rates c(i)n ∼ p

(· | X(i)

t0:tn ,yt0:tn

)with j = 2, 3, 4 of

the ith particle consists of sampling the rate parameters c(i)nj , j = 2, 3, 4 from a

Gamma distribution, respectively:

c(i)nj ∼ G

(· | sj

(X

(i)t0:tn

)), (4.42)

where sj

(X

(i)t0:tn

)=(r

(i)j ,∫ tn

0gj

(X

(i)t

)dt)


respect to the ith particle (compare equation (3.18)). The remaining rate pa-

rameters are sampled via c(i)n5 ∼ p

(· | X(i)

t0:tn

)according to equation (4.38) and

subsequently c(i)n1 ∼ p

(· | X(i)

t0:tn , c(i)n5

)according to (4.36).

4.2.3 Bayesian-PF: Prokaryotic autoregulation process

All 8 hazard rates of the model follow the laws of mass-action kinetics, there-fore we can derive the posterior distribution analogously to the Lotka-Volterraprocess case by placing independent Gamma distributions cj ∼ G (aj , bj) , j =1, . . . , 8 with hyperparameters aprior = (a1, . . . , a8) and bprior = (b1, . . . , b8) overthe parameter vector c = (c1, . . . , c8):

p (c) =

4∏j=1

G (cj | aj , bj) (4.43)

For an observed path Xt0:tn generated from a Prokaryotic autoregulation pro-

cess, we denote the total number of r =∑8j=1 rj transitions in the interval [0, tn]

with rj representing the number of transition of the jth type. Consequently, weend up with the closed-form posterior distribution

p(c | xt0:tn ,yt0:tn

)=

8∏n=1

G(

cj | aj + rj , bj +

∫ tn

0

gj (xt) dt

). (4.44)

The posterior p(c | xt0:tn ,yt0:tn

)=∏4n=1 G


∫ tn0gj (xt) dt

)re-

mains independent with respect to the different rate parameters, thus the resam-

pling for the individual rates of the ith particle c(i)n ∼ p

(· | X(i)

t0:tn ,yt0:tn

)can be

carried out straightforwardly by sampling each rate parameter c(i)nj , j = 1, . . . , 8

individually from its respective Gamma distribution:

c(i)nj ∼ G

(· | sj

(X

(i)t0:tn

)), (4.45)

50

where sj

(X

(i)t0:tn

)=(r

(i)j ,∫ tn

0gj

(X

(i)t

)dt)


respect to the ith particle (compare equation (3.18)).

4.2.4 Bayesian-PF: Stochastic reaction-diffusion process

Although all hazard rates of the model follow again the laws of mass-actionkinetics, the situation differs slightly from the previous example in that wehave 23 hazard rates, but only 3 parameters k1,k2 and d. We first define θ =(k1, k2, d) as parameter vector and place independent Gamma distributions z ∼G (az, bz) , z ∈ {k1, k2, d} with hyperparameters aprior = {az} and bprior = {bz}over θ. For an observed path Xt0:tn generated from a Stochastic reaction-

diffusion process we denote the total number of r =∑23j=1 rj transitions in the

interval [0, tn] with rj representing the number of transition of the jth type inthe usual fashion. By revisiting the model specification, we can see that theparameter k2 depends only on the first reaction, k1 depends on reactions twoto nine and d depends on reaction ten to twentythree. We define the followingstatistics for xt ∈ [0, tn]:

rk1 =∑9j=2 rj gk1 (xt) =

∑9j=2 gj (xt)

rk2= r1 gk2

(xt) = g1 (xt)

rd =∑23j=10 rj gd (xt) =

∑23j=10 gj (xt)

Applying equation (4.21) now yields the familiar closed-form posterior distribu-tion

p(θ | xt0:tn ,yt0:tn

)=

∏z∈{k1,k2,d}

G(

z | az + rz, bz +

∫ tn

0

gz (xt) dt

).

Again, the posterior remains independent with respect to the different rateparameters, thus the resampling for the individual rates of the ith particle

θ(i)n ∼ p

(· | X(i)

t0:tn ,yt0:tn

)can be carried out straightforwardly by sampling

each rate parameter z(i)n , z ∈ {k1, k2, d} individually from its respective Gamma

distribution:z(i)n ∼ G

(· | sz

(X

(i)t0:tn

)), (4.46)

where sz

(X

(i)t0:tn

)=(r

(i)z ,∫ tn

0gz

(X

(i)t

)dt)


respect to the ith particle (compare equation (3.18)).

4.3 Comparison of estimation results between MCEM-PFand Bayesian-PF approach

In order to facilitate a reasonable comparison between the point estimater ccomputed by the MCEM-PF Algorithm and the posterior distribution estima-tion generated by the Bayesian approach, we will now discuss a class of meth-ods originally proposed in [48], allowing us to construct a posterior distribution

51

p(c | yt0:tN

)of the parameters based on the results of MCEM-PF estimation

c. After introducing the algorithms we will discuss the application of thesemethods to the two example processes, for which we developed a MCEM-PFestimation algorithm. More details concerning the algorithms can be found in[44].

4.3.1 Poor man’s data augmentation algorithm 1

For our first algorithm we start with the following decomposition:

p(c | yt0:tN

)=

∫p(c | xt0:tN ,yt0:tN

)p(xt0:tN | yt0:tN

)dxt0:tN (4.47)

= E[p(c | xt0:tN ,yt0:tN

)], (4.48)

where the expectation is understood with respect to p(xt0:tN | yt0:tN

). We

now make use of Approximation 1 given in (A.8), which states that we canapproximate p

(xt0:tN | yt0:tN

)by again considering the posterior of the process

path given the observations, but additionally fixing the parameter vector c tothe mode of p

(c | yt0:tN

):

p(xt0:tN | yt0:tN

)= p

(xt0:tN | c,yt0:tN

) [1 + o

(1

M

)], (4.49)

with M denoting the sample size. Now remember that c is exactly the pointestimate computed by the MCEM-PF Algorithm, which we assume to be at ourdisposal. Then substituting (4.49) into equation (4.47) and employing a MonteCarlo step to the expectation (4.48) will yield an approximation to the desireddistribution. Hence, we can summarize the algorithm, which in the literature isreferred to as Poor man’s data augmentation algorithm 1 (PMDA1):

PMDA1 algorithm

Input: parameter estimator c

Sample M samples X(1)t0:tN , . . . ,X

(M)t0:tN with

X(i)t0:tN ∼ p

(xt0:tN | c,yt0:tN

), i = 1, . . . ,M

using the PF algorithmApproximate the posterior distribution:

p(c | yt0:tN

)≈ 1

M

M∑i=1

p(c | X(i)

t0:tN ,yt0:tN

).

There exists a close connection between this algorithm and the Bayesian-PFalgorithm in that the approximation of the posterior distribution in the PMDA1

52

algorithm equals exactly the parameter resampling step at the end of the lasttime interval in the Bayesian-PF algorithm, as induced by the MCMC kernelgiven in equation (4.27). Thus, one can view the PMDA1 algorithm as a sim-plyfied Bayesian-PF algorithm with the prior distribution being the product ofDirac measures centered around the point estimates of the respective parametersas computed by the MCEM-PF Algorithm, but leaving out the MCMC resam-pling step. Since we have already derived the specific posterior distributionsfor our two example processes in the discussion regarding Bayesian parameterestimation, we simply restate the results:

PMDA1: Lotka-Volterra process The posterior distribution for our firstexample is given by a product of independent Gamma distributions (compareequation (4.30)):


)=

4∏n=1

G


∫ T

0

gj (xt)

)(4.50)

PMDA1: Gene autoregulatory network process The posterior distribu-tion in the autoregulatory network instance is (compare equation (4.41)):

p (c | xt0:tN ) = p (c1, c5 | xt0:tN )

4∏j=2

p (cj | xt0:tN ) . (4.51)

with the product term of the right hand side being independent Gamma distri-butions

p (c2, c3, c4 | xt0:tN ) =

4∏n=2

G


∫ T

0

gj (xt)

).

and the joint distribution p (c1, c5 | xt0:tN ) given by equation (4.34).Note that the PMDA1 algorithm is an approximation, since the particles aresampled from p

(xt0:tN | c,yt0:tN

)rather than from p

(xt0:tN | yt0:tN

).

4.3.2 Poor man’s data augmentation algorithm 2

For a slightly different approach we start by assuming for a moment that weare able to straightforwardly evaluate p

(xt0:tN | yt0:tN

). In this case we can ex-

actly sample from the true distribution by making use of importance samplingtechnique with the approximate distribution p

(xt0:tN | c,yt0:tN

)serving as im-

portance distribution from which we can easily generate sample paths. The fullalgorithm looks as follows:

53

PMDA-Exact algorithm

Input: parameter estimator c


(M)t0:tN with

X(i)t0:tN ∼ p

(xt0:tN | c,yt0:tN

), i = 1, . . . ,M

using the PF algorithmCalculate importance weights wi, i = 1, . . . ,M, as

wi =p(X

(i)t0:tN | yt0:tN

)p(X

(i)t0:tN | c,yt0:tN

)Compute the posterior distribution:

p(c | yt0:tN

)=

M∑i=1

wi p(c | X(i)

t0:tN ,yt0:tN

)wi

.

Now, if the distribution p(x

(i)t0:tN | yt0:tN

)is difficult to evaluate, but a second-

order approximation is available instead, we can subsitute this approximation

to p(x

(i)t0:tN | yt0:tN

)for p

(x

(i)t0:tN | yt0:tN

)in the calculation of the importance

weights wi inside the above PMDA-Exact algorithm. This substitution allowsus to still utilize the importance sampling idea, albeit now no longer samplingexactly from p

(c | yt0:tN

).

To derive the algorithm, we start with the decomposition

p(xt0:tN | yt0:tN

)=

∫C

p(xt0:tN | c,yt0:tN

)p(c | yt0:tN

)dc

=

∫C

p(xt0:tN | c,yt0:tN

)×[


)p(xt0:tN | yt0:tN

)p(xt0:tN | c,yt0:tN

) ]dc.

(4.52)

The last line allows us to interpret the distribution in terms of an expectationp(xt0:tN | yt0:tN

)= E

[p(xt0:tN | c,yt0:tN

)], where the expectation is under-

stood with respect to the bracketed expression on the right hand side of (4.52).By invoking Approximation 2 given in (A.9) we can now approximate this ex-pectation with an accuracy of order o

(1/M2

), where M denotes the sample

size. Specifically, we arrive at

E [g (c)] =

(det (Σ∗)

det (Σ)

)1/2exp {−Mh∗ (c∗)}exp {−Mh(c)}

[1 + o

(1

M2

)], (4.53)

54

where in accordance with the notation used in (A.9) we have

g (c) = p(xt0:tN | c,yt0:tN

)(4.54)

−Mh (c) = log p(c | yt0:tN

)(4.55)

= log p(c | xt0:tN ,yt0:tN

)+ log p

(xt0:tN | yt0:tN

)− log p

(xt0:tN | c,yt0:tN

)(4.56)

−Mh∗ (c) = −Mh (c) + log g (c)

= log p(c | xt0:tN ,yt0:tN

)+ log p

(xt0:tN | yt0:tN

)(4.57)

Σ∗ =

[∂2h∗

∂c2

∣∣∣∣c∗

]−1

, Σ =

[∂2h

∂c2

∣∣∣∣c

]−1

(4.58)

with c = argmaxc∈C (−Mh (c)) and c∗ = arg maxc∈C (−Mh∗ (c)).From equation (4.57) we can immediately see that the maximizer c∗ is themaximimizer of the posterior distribution p

(c | xt0:tN ,yt0:tN

)given in equa-

tion (4.21). We further note that as det (Σ) does not depend on the processpath xt0:tN (see equation (4.58) in conjunction with (4.55)), it can thereforebe absorved into the normalization constant. Hence, we obtain as second-orderapproximation:

p(xt0:tN | yt0:tN

)∝ (det (Σ∗))

1/2 p(c∗ | xt0:tN ,yt0:tN

)p(xt0:tN | c,yt0:tN

)p(c | xt0:tN ,yt0:tN

) .

(4.59)In order to arrive at the approximation to the posterior distribution p

(c | yt0:tN

)we mimic the PMDA-Exact algorithm by employing an importance sampling

step, using as importance density p(X

(i)t0:tN | c,yt0:tN

)for i = 1, . . . ,M with

the parameters values fixed to the estimator c and sample paths X(i)t0:tN ∼

p(· | c,yt0:tN

)generated by the MCEM-PF Algorithm. The importance weights

are consequently given as

wi =(

det(

Σ∗(i)))1/2 p

(c∗(i) | X(i)

t0:tN ,yt0:tN

)p(X

(i)t0:tN | c,yt0:tN

)p(c | X(i)

t0:tN ,yt0:tN

)p(X

(i)t0:tN | c,yt0:tN

)=(

det(

Σ∗(i)))1/2 p

(c∗(i) | X(i)

t0:tN ,yt0:tN

)p(c | X(i)

t0:tN ,yt0:tN

) , (4.60)

where c∗(i) denotes the maximizer of p(· | X(i)

t0:tN ,yt0:tN

)and Σ∗(i) is calculated

with respect to the sample path X(i)t0:tN .

The complete algorithm, which is known as the Poor man’s data augmentationalgorithm 2 (PMDA2), proceeds as follows:

55

PMDA 2 algorithm

Input: parameter estimator c = (c1, c2, c3, c4)


(M)t0:tN with

X(i)t0:tN ∼ p

(xt0:tN | c,yt0:tN

), i = 1, . . . ,M

using the PF algorithmfor i = 1, . . . ,M do

Compute weight

wi =(

det(

Σ∗(i)))1/2 p

(c∗ | X(i)

t0:tN ,yt0:tN

)p(c | X(i)

t0:tN ,yt0:tN

)end forApproximate the posterior distribution:

p(c | yt0:tN

)≈

∑Mi=1 wi p

(c | yt0:tN ,X

(i)t0:tN

)∑Mi=1 wi

.

PMDA2: Lotka-Volterra process Since we know that the posterior dis-tribution of the parameters equals a product of Gamma distributions (compareequation (4.30)), it follows

−Mh∗ (c) ∝ log p(c | xt0:tN ,yt0:tN

)=

4∑j=1

(aj + rj) log cj − cj

(bj +

∫ T

0

gj (xt) dt

)(4.61)

and a differentiation with respect to the parameters and subsequently settingto zero leads to the maximizer

c∗ = argmaxc∈C

h∗ (c) =

(a1 + r1

b1 +∫ T

0g1 (xt) dt

, . . . ,a4 + r4

b4 +∫ T

0g4 (xt) dt

). (4.62)

In order to find the specific form of det (Σ∗), we first compute the second deriva-tive of h∗ (c) with respect to a specific parameter cj :

∂2h∗ (c)

∂c2j=aj + rjMc2j

, j = 1, 2, 3, 4. (4.63)

56

Since the cross-derivations vanish due to the independence of the parameters,i.e.∂2h∗ (c) /∂cicj = 0 for i 6= j, we obtain by inserting (4.62) into equation (4.63):

Σ∗ =

[∂2h∗

∂c2

∣∣∣∣c∗

]−1

= diag

(bj +

∫ T0gj (xt) dt

)2

M (aj + rj)

−1

j=1,2,3,4

= diag

M (aj + rj)(bj +

∫ T0gj (xt) dt

)2

j=1,2,3,4

, (4.64)

where diag (sj)j=1,2,3,4 denotes the diagonal matrix with jjth entry sj .Thus it follows from the diagonal structure of the matrix Σ∗:

det (Σ∗) =

4∏j=1

M (aj + rj)(bj +

∫ T0gj (xt) dt

)2 . (4.65)

Finally, we have to evaluate the posterior distribution p(· | X(i)

t0:tN ,yt0:tN

)at

the point estimate c and at the maximizer c∗, with the posterior density beingthe (by now familiar) product of Gamma distributions given in equation (4.30).

PMDA2: Gene autoregulatory network process Recall that the algo-rithm involves the evaluation of the inverse of the hessian matrices Σ and Σ∗

with respect to the parameter vector c. As we cannot differentiate the likeli-hood p

(Xt0:tN ,yt0:tN | c

)with respect to c5 due to the Heaviside function, the

PMDA2 algorithm is not applicable to this process.

The magnitude of the error induced by the approximation in the PMDA al-gorithms cannot be determined in practice [44]. In our case we can howevercompare the results of PMDA1 in the autoregulatory network example andPMDA1 and PMDA2 in the Lotka-Volterra example with the results computedby the Bayesian-PF algorithm, which was our motivation to introduce the classof PMDA algorithms in the first place. If the results of the different methodscoincide for a particular process, we have however little reason to doubt theirvalidity.

57

5 Empirical Results

We will now apply the parameter inference algorithms discussed above to ourexample processes. To this end, we generated data sets in all four instancesby respectively simulating a trajectory from the true process with the help ofthe Gillespie algorithm, dividing up the time interval into (N − 1) subintervalsof equal length, and corrupting the resulting N interval points (including thestart and end time point) by a normally distributed noise. The N data pointsthen served as the available observations in the parameter inference algorithms.As prior over the model parameters we choose a Gamma distribution, whichturns out to be the conjugate distribution to the respective likelihood (with theonly exception being the parameter pair c1 and c5 in the second example, sincethe hazard of c5 does not follow the mass action kinetics laws). Specifically, wechoose for a parameter the distribution ci ∼ G (· | 1, 1/(4ctrue

i )), with i = 1, . . . , pfor a model with p parameters (compare [31]), where ctrue

i denotes the trueparameter value. This distribution turns out to be relatively vague around thetrue value of the parameter, without being too uninformative.

Figure 5: Example of a prior distribution c1 ∼ G(· | 1, 1/(4ctrue1 )

)for the parameter

ctrue1 = 2e − 03 taken from the second process. While both plots displaythe same distribution, the right picture shows the prior in the vicinity ofthe true parameter value (marked by the blue lines, respectively), whichturns out to be reasonably flat.

In order to assess the quality of our estimation outcomes, we report for theexamples discussed below inference results of comparable approaches from theliterature. In each case we use the same specification as the respective referencemodel, but as we generated our data sets synthetically, we did not have theexact same observation sequence, on which the literature results were based, atour disposal. We therefore decided to analyse two different data sets with thesame specification for each of our example processes.

5.1 Results: Lotka-Volterra process

In order to assess the parameter inference algorithms on the Lotka-Volterra pro-cess, we simulated two different processes with the specification given in [40],namely with parameter vector c = (c1, c2, c3, c4) = (5e− 4, 1e− 4, 1e− 4, 5e− 4)on the interval [0, 2000], where the initial population contains 19 preys and7 predators. We generated 20 observations at equidistant time points and

58

corrupted each observation by a normally distributed noise-term of the formε ∼ N (· | 0, 1).In the case of the MCEM-PF algorithm we start with a random initializationof the parameter values drawn from the uniform distributions on the interval[1/3× ci, 3× ci] for i = 1, . . . , 4. The algorithm proceeds by generating 3000particles via the PF algorithm and subsequent parameter maximization at eachiteration, until convergence (up to a certain oscillation) is reached. We thenpick out the parameter estimator c of the iteration with the highest likelihoodp (Y | c). In a next step we use the resulting point estimate as start parame-ter in the two Poor Man’s Data Augmentation (PMDA) algorithms in order toobtain approximations to the posterior density p (c | Y), respectively. As priordistributions for the PMDA1 and PMDA2 algorithms as well the Bayesian-PF we use c1 ∼ G (· | 1, 400) , c2 ∼ G (· | 1, 2000) , c3 ∼ G (· | 1, 2000) , c4 ∼G (· | 1, 400). The next figure illustrates the inference results for a maximumlikelihood (PMDA2) as well as the bayesian approach on the first example pro-cess.

Figure 6: Parameter estimation results for the first process: The upper four graphsrepresent the posterior distributions p (ci | yt0:tN ) , i = 1, . . . , 4, as gener-ated by the PMDA2 algorithm, the lower four graphs the distributionsresulting from an Bayesian-PF application. Red and blue bars denote thetrue model parameter values, respectively.

The complete empirical results for the Lotka-Volterra process are summarizedin the following table.

59

c1 c2 c3 c4True rates

5e-04 1e-04 1e-04 5e-04EM-PF Inference: Parameter point estimation, first process

4.74e-04 1.27e-04 1.28e-04 8.16e-04EM-PF Inference: Parameter point estimation, second process

5.78e-04 8.40e-05 1.28e-04 7.41e-04PMDA1: Parameter estimation, first process

Mean 4.76e-04 1.32e-04 1.27e-04 8.12e-04S.D. 2.05e-04 2.87e-05 3.13e-05 1.39e-04

PMDA1: Parameter estimation, second processMean 5.75e-04 8.42e-05 1.27e-04 7.44e-04S.D. 2.10e-04 2.87e-05 3.06e-05 1.45e-04

PMDA2: Parameter estimation, first processMean 4.78e-04 1.33e-04 1.27e-04 8.13e-04S.D. 2.07e-04 2.90e-05 3.07e-05 1.38e-04

PMDA2: Parameter estimation, second processMean 5.77e-04 8.41e-05 1.28e-04 7.44e-04S.D. 2.09e-04 2.83e-05 2.98e-05 1.49e-04

Bayesian-PF: Parameter estimation, first processMean 3.58e-04 1.26e-04 1.20e-04 7.03e-04S.D. 2.46e-04 2.35e-05 2.22e-04 2.81e-05

Bayesian-PF: Parameter estimation, second processMean 4.58e-04 9.13e-05 9.98e-05 7.03e-04S.D. 1.80e-04 2.58e-05 2.33e-05 1.26e-04

Variational Inference: Parameter estimation from [40]Mean 6.5e-04 1e-04 0.9e-04 8.1e-04S.D. 5.4e-04 0.6e-04 0.4e-04 6.4e-04

Figure 7: Comparison of results for two synthetic Lotka-Volterra processes with iden-tical specifications: MCEM-PF inference are point estimates, PMDA1 andPMDA2 report the mean and standard deviation (S.D.) of the posteriordistribution, which follows from an application of the respective PMDA al-gorithms to the two different MCEM-PF point estimates. Further given arethe moments of the posterior distributions resulting from the Bayesian-PFalgorithm, as well as the estimation results from a variational inference.

Since the MCEM-PF algorithm does only converge to a local optimum, we ranthe algorithm with ten different random parameter initialization for each ofthe two processes and picked out for each process the point estimator with thehighest likelihood p (c | Y) over all ten instances. For convergence the differentinstantiations of the MCEM-PF algorithm took around 10-15 iterations to sta-bilize (up to a certain oscillation) to a local optimum. In each case, we took theoptimum out of a run of 20 iterations.For the case of the bayesian estimation approach, we increased the particle num-

60

ber from 3000 to 5000 in order to assure that the parameter space gets exploredsufficiently in the repeated resampling steps. In the empirical study a particlesize above 5000 did not impact the results in a significant way.While both the PMDA algorithm as well as the Bayesian-PF approach esti-mate the parameters reasonably well, we found (somewhat surprisingly) thatthe Bayesian-PF algorithm, despite its additional parameter resampling steps,does not come with a higher posterior variance as the PMDA algorithms in ourexamples. The difference between the estimations for different processes is rela-tively small. Interestingly the bayesian ansatz seemed to work more consistently,meaning that for a given fixed Lotka Volterra process the bayesian approachyields very similar outputs at different instantiations, while the MCEM-PF algo-rithm sometimes gets stuck in some local optimum which can vary significantlyfrom the true parameter values, and subsequently resulting in a bad posteriorapproximation of the PMDA algorithms. Therefore running several instanceswith different starting values in the case of the EM algorithm seems to be anecessity.Since PMDA1 and PMDA2 results give virtually the same result and accord wellwith the bayesian results, the approximations should be reliable. Finally we notethat our results are roughly on par with the variational inference outcomes, asreported in [40].

5.2 Results: Gene autoregulatory network process

For our second example, we again start by generating two different processeswith specification gleaned from [40]. In particular, we simulated gene au-toregulatory network processes on the time interval [0, 100000] with parametersc = (2e− 03, 6e− 05, 5e− 04, 7e− 05, 20) and initial population consisting ofX(1) = 12 mRNAs and X(2) = 17 proteins. We simulated 20 observations onequidistant time points over the whole interval, where each observation equalsthe state of the process corrupted by a Gaussian noise term ε ∼ N (0, 1).We start with the EM-SMC algorithm. As already mentioned above, includingthe treshold parameter c5 of the Heaviside function in the estimation and em-ploying the strategy outlined in equation (4.15) does not work in practice, sincein this case the estimator of c5 rarely does change its value (if at all) during onerun of the algorithm. This implies that if we start with an initial value of c5far away from the true value c5 = 20, the (whole 5-dimensional) estimator com-puted by the MCEM-PF algorithm will be far away from the true vector mostof the time, thereby rendering the estimation unreliable. In order to circumventthis problem, we fix c5 at the outset and run the MCEM-PF algorithm on thereduced parameter vector c = (c1, c2, c3, c4). Since c5 is unknown, we have toindividually compute the algorithm for each c5 ∈ S, where we assume specifi-cally for our example the set of integers S = (10, 11, . . . , 29, 30). Although beingstraightforward to implement, this approach is very time-consuming in practice.

For each run we initialized the parameters c = (c1, c2, c3, c4) by drawing froma uniform distribution with support [1/3× ci, 3× ci], where ci, i = 1, 2, 3, 4 de-

61

notes the true value. The algorithm proceeds by generating 3000 particles viathe particle filtering algorithm and subsequent parameter maximization at eachiteration, until convergence (up to a certain oscillation) is reached. In analogyto the previous example, we then pick out the parameter estimator c of the it-eration with the highest marginal likelihood p(Y | c) over ten different runs foreach of the two processes, respectively. The MCEM-PF algorithm took around8-10 iterations to stabilize (up to a certain oscillation) to a local optimum. Ineach case, we took the optimum out of a total run of 15 iterations.

Figure 8: Plot of the marginal log-likelihood ln p(Y | c) for each fixed c5 ∈ S. Forthe first process (left picture) the maximum is attained at c5 = 21, for thesecond process the maximum is attained at c5 = 22. Both figures representthe run with the highest marginal likelihood over all ten runs with randominitializations, respectively. The red bar denotes the true value.

As prior distributions for the PMDA1 (recall that the PMDA2 algorithm isnot applicable to this process) and the Bayesian-PF algorithm, we choose theusual vague Gamma distributions for the first four parameters, specifically c1 ∼G (1, 125), c2 ∼ G (1, 4166.67), c3 ∼ G (1, 500) and c4 ∼ G (1, 3571.43). For thetreshold parameter c5 ∼ U (S) we assume a discrete uniform distribution on theset S, assigning equal probability to each s ∈ S. The Bayesian-PF algorithmis run with 5000 particles in order to allow for a sufficient exploration of theparameter space.

62

Figure 9: Parameter estimation results for the first process and the first four pa-rameters: The upper four graphs represent the posterior distributionsp (ci | yt0:tN ) , i = 1, . . . , 4, as generated by the PMDA1 algorithm, thelower four graphs the distributions resulting from an Bayesian-PF applica-tion. Red and blue bars denote the true model parameter values, respec-tively.

c1 c2 c3 c4 c5True rates

2e-03 6e-05 5e-04 7e-05 20EM-PF Inference: Parameter point estimation, first process

1.09e-03 5.18e-05 5.05e-04 7.56e-04 22EM-PF Inference: Parameter point estimation, second process

1.59e-03 1.05e-04 5.02e-04 7.14e-05 21PMDA1: Parameter estimation, first process

Mean 1.14e-03 5.21e-04 5.06e-04 7.58e-05 22.03S.D. 3.12e-04 1.30e-05 5.56e-05 4.90e-06 0.36

PMDA1: Parameter estimation, second processMean 1.61e-03 1.05e-04 5.08e-04 7.12e-05 21.00S.D. 4.92e-04 1.22e-05 3.71e-05 8.81e-06 0.17

Bayesian-PF: Parameter estimation, first processMean 1.73e-03 5.46e-05 5.27e-04 8.05e-05 19.23S.D. 2.39e-03 1.45e-05 1.04e-04 2.04e-05 1.03

Bayesian-PF: Parameter estimation, second processMean 2.14e-03 9.94e-05 7.75e-04 7.02e-05 19.36S.D. 1.28e-03 1.59e-05 1.01e-04 1.70e-05 3.41

Variational Inference: Parameter estimation from [40]Mean 1.6e-03 6.5e-05 4.9e-04 7.7e-05 20.4S.D. 0.5e-03 1.0e-05 0.7e-04 1.2e-05 1.6

Figure 10: Results for two autoregulatory network processes with identical speci-fications: MCEM-PF yields point estimates, PMDA1 and Bayesian-PFreport the moments of the posterior distribution. As reference, we in-cluded the variational inference outcome as reported in [40].

63

By examining the outputs for the two processes as given in the table, oneimmediately notices that the probability mass of the posterior distributionp (c5 | yt0:tN ) in the PMDA1 inference case is heavily concentrated on the valuec5 given by the point estimate in both instances (c5 = 21 in the first exam-ple, c5 = 22 in the second example). Figure 11 gives a visualization of thesedistributions.

Figure 11: First row: Posterior distributions p (c5 | yt0:tN ) resulting from thePMDA1 algorithm for the first (left figure) and second (right figure) pro-cess. In both cases the distribution is heavily concentrated on the pointestimate value of c5, specifically c5 = 21 for the first and c5 = 22 on thesecond process, respectively. Second row: Posterior distribution of c5 re-sulting from the Bayesian-PF algorithm for the first and second process.The probability mass of c5 shows much more variability, especially forthe second process. The blue bars denote the true value.

As we will now see, there exists a close connection between this concentrationeffect and the already mentioned fact, that in the MCEM-PF case the full op-timization (including c5) doesn’t perform well in practice.Recall that the PMDA1 algorithm proceeds by first generating M particles viathe PF algorithm, using parameters c, and subsequently approximating the

posterior distribution as p(c | yt0:tN

)≈ 1

M

∑Mi=1 p

(c | X(i)

[0,T ],yt0:tN

). In the

MCEM-PF algorithm, during the (k + 1)th iteration we generate M particlesvia the PF filtering algorithm with parameters ck and subsequently optimize

the E-step expression of the form Q (c, ck) = 1M

∑Mi=1 log p

(X

(i)[0,T ],yt0:tN | c

).

Further recall that since we are employing a vague prior, the marginal poste-

rior p(c5 | X(i)

[0,T ],yt0:tN

)is proportional to the likelihood p

(X

(i)[0,T ] | c5,yt0:tN

).

Then it becomes evident that the heavily concentration of the probability massat the current parameter value of c5 (which we denote as ck,5) is also respon-sible for the fact that in the MCEM-PF algorithm the new parameter pair(c(k+1),1, c(k+1),5

)maximizing the E-step is mostly the one with the same c5-

value c(k+1),5 = ck,5 as the current value ck,5 (compare equation (4.15)). Weactually found that the concentration effect is at play independently of the valueof c5, albeit in a slightly less dramatic way as we move further away from the

64

true value c5 = 20.Comparing the inference results of the MCEM-PF and the Bayesian-PF algo-rithm, we see that the posterior distributions accord reasonably well in the firstexample (at least with the exception of c5), whereas in the second example theestimation for the mean c1 and the variance of c5 differ significantly. In thebayesian case the posterior distribution of c5 has its probability mass spreadover the whole support, starkly constrasting the c5 posterior in the MCEM-PFcase (see figure 11). Since c1 together with c5 constitute the parameter pairpertaining to the first hazard rate, it is not surprising that the posterior distri-butions of c1 differ as well, with the distribution in the bayesian inference beingsignificantly right-skewed.The difference in the inference results especially in the second example could bedue to the repeated parameter resampling step in the Bayesian SMC algorithm,which helps to better explore the parameter space especially with respect toc5. The exploration of the parameter space of c5 probably contributes to thenotably higher overall variance in the bayesian estimation case, which also turnsout to be higher than the variation in the results as reported in [40]. The overallmeans, however, seem to be roughly comparable.

5.3 Results: Prokaryotic autoregulation model

The basis of the parameter inference task for our third example again consistsof two synthetic data sets, which were generated on the interval [0, 50] withparameter vector c = (0.1, 0.7, 0.35, 0.2, 0.1, 0.9, 0.3, 0.1) and initial populationX0 = (RNA,P,P2,DNA) = (1, 1, 4, 2). The number of gene copies is fixed tok = 10. We simulated 50 observations at equidistant time points, where eachobservation is obtained by corrupting the state of the process by a Gaussiannoise ε ∼ N (· | 0, 2). We took the model together with the parametrization(except for the initial state, which was not reported) from [28]. Since thismodel turned out to be computationally quite demanding, we restrict ourselvesto the application of the Bayesian-PF algorithm for the sake of this analysis.Since all eight parameters in this model follow the laws of mass action kinetics,we apply our usual vague Gamma distribution ci ∼ G (1, 1/(4ctrue

i )) , i = 1, . . . , 8,where ctrue

i denotes the true parameter value. In the application of the Bayesian-PF algorithm to the two synthetic example processes we increased to number ofparticles to 8000 (in contrast to the 5000 particles in the previous two examples)to account for the larger parameter space in this model. The next figure pro-vides a graphical representation of the parameter inference results on the firstprocess, which is followed by a table summarizing the posterior moments forboth processes. As reference, we included the estimation outcomes from [29],which utilizes a SDE diffusion approach within a Markov-Chain Monte Carloframework.

65

Figure 12: Bayesian-PF approximations of the marginal posterior distributionsbased on the first prokaryotic autoregulatory process. Each blue bardenotes the respective true value.

66

c1 c2 c1/c2 c3 c4 c5 c6 c5/c6 c7 c8True rates

0.1 0.7 0.143 0.35 0.2 0.1 0.9 0.111 0.3 0.1Bayesian-PF: Parameter estimation, first process

Mean 0.060 0.399 0.159 0.465 0.172 0.021 0.303 0.071 0.337 0.115S.D. 0.039 0.285 0.035 0.084 0.099 0.006 0.094 0.035 0.065 0.093

Bayesian-PF: Parameter estimation, second processMean 0.083 0.535 0.158 0.462 0.267 0.022 0.376 0.059 0.609 0.151S.D. 0.020 0.132 0.028 0.129 0.062 0.007 0.116 0.028 0.162 0.042

Bayesian-MCMC: Parameter estimation from [29]Mean 0.078 0.612 0.128 0.363 0.236 0.070 0.680 0.104 0.299 0.138S.D. 0.022 0.174 0.019 0.095 0.052 0.024 0.231 0.014 0.076 0.030

Figure 13: Estimated marginal posterior mean and standard deviations based on theBayesian-PF algorithm for the different hazard rates of the prokaryoticautoregulation model, using two different synthetic processes.

Comparing the outputs, one can see that the inference results from the Bayesian-PF algorithm are inferior to the Bayesian-MCMC estimates reported in [29].This is most evident in the estimations of c1,c2 and especially c5 and c6. Whilethe ratio c1/c2 (corresponding to the propensity of reaction R1) can be rea-sonably recovered, the estimation of c5/c6 (corresponding to the propensity ofreaction R5) is quite imprecise. Although the reference MCMC method esti-mates the values of c5 and c6 too low, the infered ratio c5/c6 turns out to befairly accurate. If the inference quality of the Bayesian-PF algorithm could beimproved by, say, significantly increasing the particle number (our method em-ploys 8000 particles, while the MCMC scheme of [29] was run for a total numberof one million iterations) or if on the other hand the approximation error inher-ent in the proposal process prevents better estimation results is unclear at thispoint, and could only be evaluated by further empirical investigations.

5.4 Results: Stochastic reaction-diffusion model

For the empirical analysis of our final example, we again generated two processeson the interval [0, 1500] with parameter vector c = (k2, k1, d) = (0.4, 0.0001, 0.01)and initial population of zero proteins in the system. We simulated 15 observa-tions at equidistant time points, where each observation is obtained by corrupt-ing the state of the process by a Gaussian noise ε ∼ N (· | 0, 0.16). The modeltogether with the above specification was taken from [14].As priors over the model parameters we employ the usual vague Gamma dis-tributions, which in this case are k1 ∼ G (1, 2500), k2 ∼ G (1, 0.625) and c3 ∼G (1, 2.5). As in the previous example, we restrict our parameter inference anal-ysis due to the high computational burden to the bayesian inference approach,where we employ the Bayesian-PF algorithm with a total number of 5000 par-ticles.

67

Figure 14: Posterior distributions for the three model parameters k1, k2 and d, re-sulting from an application of the Bayesian-PF algorithm. The blue barsdenote the true values.

The posterior moments for both processes are reported in the following table:

k1 k2 dTrue rates

1e-04 0.4 0.01Bayesian-PF: Parameter estimation, first process

Mean 1.95e-04 0.4005 0.0073S.D. 2.19e-05 0.0166 0.0001Bayesian-PF: Parameter estimation, second processMean 1.68e-04 0.3924 0.0074S.D. 1.58e-02 0.0158 0.0001

Figure 15: Moments of the estimated marginal posterior distributions to the twosimulated stochastic reaction-diffusion processes, based on the Bayesian-PF approximations.

Although there are no moments of the approximated posterior distributionsavailable, ([14], Figure 2) provides a graphical summary of their estimationresults based on a variational inference scheme. By examining the Bayesian-PFresults given above, one can see that the estimation results of the two processesare quite similar, with the production rate k2 being estimated accurately whilethe value of k1 and d turn out to be rather imprecise. The fact that the effectsof decay and diffusion, which both result in a protein leaving a certain bin, arehard to distinguish in this model together with the diffusion rate d = 1e − 02being two orders of magnitudes bigger than k1 = 1e − 04 renders the decayparameter k1 unidentifiable, since this effect is negligible compared to the effectof d (compare discussion in [14]). One also notices that the posterior estimationof k2 displays significantly more variability than the estimations of either k1 ord, which is owed to the fact that proteins are only generated in the first binand thereby harder to estimate. Judging from this preliminary analysis, the

68

results of the Bayesian-PF estimation seem to accord reasonably well with thevariational inference results, although a more thorough investigation would berequired in order to corroborate this assessment.

6 Discussion

The purpose of this thesis was to approach the parameter inference task forstochastic reaction models within the framework of Sequential Monte Carlomethods. Specifically, we discussed the particle filtering algorithm, which al-lowed us to approximate posterior distributions of our target process via MonteCarlo sampling and further derived a suitable importance process as a efficientand computationally tractable way to simulate trajectories from the desired bio-chemical reaction models. We then integrated the particle filtering algorithminto a maximum likelihood as well as a bayesian parameter inference schemeand empirically investigated these approaches on the basis of four different bio-chemical reaction processes gleaned from the literature. While the MCEM-PFalgorithm displayed serious weaknesses by being computationally very demand-ing as well as highly sensitive to the parameter initialization, the Bayesian-PFalgorithm delivered some promising results, being significantly faster as well asmore consistent in the estimation results compared to the maximum likelihoodapproach. In order to thoroughly evaluate the quality of the Bayesian-PF algo-rithm, especially with respect to state-of-the-art inference approaches discussedin the literature, more empirical investigations would be needed.

69

A Posterior Moments based on Laplace’s method

Laplace’s method. Suppose h (c) is a smooth, bounded unimodal function witha maximum c, then the integral

I =

∫f (c) exp {−nh (c)} dc (A.1)

can be approximated by

I = f (c)

√2π

nσ exp {−nh (c)} (A.2)

with

σ =

[∂2h

∂c2

∣∣∣∣c

]−1/2

(A.3)

and n denoting the sample size.

Proof. Expanding I to the second order around the peak c yields

I ≈∫f (c) exp

{−n

[h (c) + (c− c)h′ (c) +

(c− c)2

2h′′ (c)

]}dc (A.4)

By remembering h′ (c) = 0, we get

I ≈∫f (c) exp

{−n

[h (c) +

(c− c)2

2h′′ (c)

]}dc

= f (c) exp {−nh (c)}∫

exp

{−n (c− c)

2

2σ2

}dc

= f (c) exp {−nh (c)}√

2πσ2

n, (A.5)

where the last line follows from recognizing that the expression under the inte-gral in the next-to-last line equals the (unnormalized) kernel of a normal distri-bution N

(c, σ2/n

). Furthermore it can be shown [45], that the approximation

error is of the order

I = I

[1 + o

(1

n

)](A.6)

In order to calculate moments of posterior distributions such as the expectationof g (c) with respect to a (possibly unnormalized) distribution exp {−nh (c)},we need to evaluate the expression

E [g (c)] =

∫g (c) exp {−nh (c)} dc∫

exp {−nh (c)} dc(A.7)

We will now introduce two approximations to equation (A.7), each of whichmakes use of Laplace’s method in a slightly different way.

70

Approximation 1.

E [g (c)] = g (c)

[1 + o

(1

n

)](A.8)

Proof. We verifiy this approximation by first applying Laplace’s method to thenumerator of equation (A.7) with f = g, which yields

g (c) exp {−nh (c)}√

2πσ2

n

Next we apply Laplace’s method to the denominator of (A.7) with f = 1, toobtain

exp {−nh (c)}√

2πσ2

n

The final derivation showing that the resulting approximation g (c) has an errorof order o (1/n) can be found in [45].

Approximation 2.

E [g (c)] =

(det (Σ∗)

det (Σ)

)1/2exp {−nh∗ (c∗)}exp {−nh (c)}

[1 + o

(1

n2

)]. (A.9)

Proof. To show this, we apply Laplace’s method to the numerator of (A.7) withf = 1, g > 0 and −nh∗ (c) = −nh (c) + log (g (c)), where c∗ denotes the modeof −h∗ (c) and

Σ∗ =

[∂2h∗

∂c2

∣∣∣∣c∗

]−1

.

Furthermore we apply Laplace’s method to the denominator with f = 1 and

Σ =

[∂2h

∂c2

∣∣∣∣c

]−1

.

Then it can be shown that the resulting ratio has an error of order o(1/n2

), a

proof of which is given by [45]. A simple proof for the special case of real valuedfunctions can be found in [27].

For more details on approximations based on Laplace’s method, the reader isreferred to [44] and [27].

B Brownian Motion

A stochastic process {Wt} , t ∈ [0,∞) is a Brownian motion, if the processdepends continously on t, Wt ∈ (−∞,∞) and the following assumptions hold:

1. W0 = 1 with probability 1

2. For 0 ≤ t1 < t2 < t3 <∞, Wt3 −Wt2 and Wt2 −Wt1 are independent

71

3. For 0 ≤ t1 < t2 <∞, Wt2 −Wt1 ∼ N (0, t2 − t1)

Consequently, the transition density from x to y in a time period t is given by

p (x,y; t) =1√2πt

exp

{− (x− y)

2

2t

}(B.1)

For details the reader is referred to [37].

C Stochastic Differential Equations

Consider an univariate stochastic process Xt defined on [0,∞). Then Xt satisfiesthe Ito Stochastic Differential Equation (SDE)

dXt = f (Xt) dt+D12 (Xt) dWt, (C.1)

if for any t ≥ 0 the following integral equation holds:

Xt =

∫ t

0

f (Xs) ds+

∫ t

0

D12 (Xs) dWs. (C.2)

The function f (·) inside the Riemann integral on the right hand side is called

the drift coefficient and the function D12 (·) inside the Ito integral is called the

diffusion coefficient. In order for solutions to the SDE to exist and to be unique,the following Lipschitz and linear growth conditions need to hold:

|f (Xt)− f (Yt)|+∣∣∣D 1

2 (Xt)−D12 (Yt)

∣∣∣ ≤ C1 |Xt − Yt| (C.3)

|f (Xt)|2 +∣∣∣D 1

2 (Xt)∣∣∣2 ≤ C2

2

(1 + |Xt|2

)(C.4)

with C1, C2 > 0 denoting constants, Xt, Yt ∈ R and t ∈ [0, T ]. It can be shownthat the above conditions are sufficient.

The framework can be straightforwardly generalized to multivariate diffusionprocesses. For a d-dimensional stochastic process Xt = (X1t, . . . , Xdt) definedon [0,∞), the SDE is given by (compare to equation (C.1):

dXt = f (Xt) dt+D12 (Xt) dWt, (C.5)

where the drift vector has entries fi (x), for i = 1, . . . , d:

fi (x) = lim∆t→0

1

∆tE[{x

(i)t+∆t − x

(i)t

}| xt = x

](C.6)

and the diffusion matrix has entries Dij (X), for i, j = 1, . . . , d:

Dij (x) = lim∆t→0

1

∆tCov

[{x

(i)t+∆t − x

(i)t

},{x

(j)t+∆t − x

(j)t

}| xt = x

]. (C.7)

Furthermore, the vector dWt = (W1t, . . . ,Wdt)′

denotes the increment of a d-dimensional Brownian motion.Detailed discussions on stochastic differential equations can be found in [37].

72

C.1 Fokker-Planck equation

It can be shown that the dynamics of a solution Xt to the SDE is described bythe so-called Fokker-Planck equation. We outline the proof for the case of anone-dimensional diffusion as given in equation (C.1).

By picking any Borel set B ∈ B, we have:

P (Xt ∈ B) = E [I (Xt ∈ B)] =

∫B

p (x; t) dx. (C.8)

In order to facilitate differentiation, we approximate the indicator function I (·)by some smooth function like for example the tanh function. Now we take thederivative to (C.8), which according to Ito’s rule [37] yields:

dP (Xt ∈ B) = dE [I (Xt ∈ B)]

= E[I ′ (Xt ∈ B) dXt] +1

2E[I ′′ (Xt ∈ B) (dXt)

2]

= E [I ′ (Xt ∈ B) f (Xt)] dt+1

2E [I ′′ (Xt ∈ B)D (Xt)] dt, (C.9)

where the last line follows, since the expectation with respect to the Ito integralis zero.Consequently, we have

d

dtP (Xt ∈ B) =

∫I ′ (x ∈ B) f (xt) p (x; t) dx+

1

2

∫I ′′ (x ∈ B)D (xt) p (x; t) dx.

(C.10)Applying integration-by-parts to the expression on the right hand side, we getfor the first term∫

I ′ (x ∈ B) f (xt) p (x; t) dx =

∫ {∂

∂x[I (x ∈ B) f (xt) p (x; t)]

−I (x ∈ B)∂

∂x[f (xt) p (x; t)]

}dx

= −∫B

∂

∂x[f (xt) p (x; t)] dx, (C.11)

since I (x ∈ B) f (xt) p (x; t) vanishes at the boundaries. In a similar fashion weintegrate the second term two times by parts, which results in∫

I ′′ (x ∈ B)D (xt) p (x; t) =

∫B

∂2

∂x2[D (xt) p (x; t)] dx. (C.12)

Substituting the two expressions into (C.10), we obtain the Fokker-Planck equa-tion

d

dtP (Xt ∈ B) =

∫B

∂

∂tp (x; t) dx

=

∫B

{− ∂

∂x[f (xt) p (x; t)] +

∂2

∂x2[D (xt) p (x; t)]

}dx, (C.13)

73

since this expression holds for any Borel set B ∈ B.The Fokker-Planck equation can be generalized to multivariate diffusion proc-ceses. Specifically, for a d dimensional diffusion Xt = (X1t, . . . , Xdt), the mul-tidimensional Fokker-Planck equation takes the following form:

∂

∂tp (x; t) = −

k∑i=1

∂

∂x(i){fi (x) p (x; t)}+

1

2

k∑i=1

k∑j=1

∂2

∂x(i)∂x(j){Dij (x) p (x; t)} .

(C.14)For an in-depth discussion of the Fokker-Planck equation, see [12] and [46].

74

References

[1] B.D.Anderson and J.B.Moore (1979). Optimal Filtering. Prentice-Hall,New Jersey.

[2] C.Andrieu, N.De Freitas and A.Doucet (1999). Sequential MCMC forBayesian model selection. Proceedings of the IEEE Workshop on HigherOrder Statistics.

[3] C.Andrieu, A.Doucet and R.Holenstein (2010). Particle chain Monte Carlo.Journal of the Royal Statistical Society, Series B, 72, 269-342.

[4] C.Archambeau, D.Cornford, M.Opper and J.Shawe-Taylor (2007). Gaus-sian Process Approximations of Stochastic Differential Equations. Journalof Machine Learning Research: Workshop and Conference Proceedings,1:1-16.

[5] C.Archambeau and M.Opper (2010). Approximate Inference for continous-time Markov processes, Inference and Learning in Dynamic Models (eds.D.Barber et al), Cambridge University Press, Cambridge.

[6] A.Beskos, O.Papaspiliopoulos, G.O.Roberts and P.Fearnhead (2006). Ex-act and computationally efficient likelihood-based estimation for discretelyobserved diffusion processes. Journal of the Royal Statistical Society, SeriesB, 68, 1–29.

[7] C.M.Bishop (2006). Pattern recognition and Machine learning. Springer-Verlag, New York.

[8] R.J.Boys, D.J.Wilkinson and T.B.L.Kirkwood (2004). Bayesian inferencefor a discretely observed stochastic-kinetic model. Statistics and Comput-ing, 18(2), 125-135.

[9] O.Cappe, S.J.Godsill and E.Moulines (2007). An overview of existing meth-ods and recent advances in sequential Monte Carlo. IEEE Proceedings,95(5):899-924.

[10] Z.Chen (2003). Bayesian Filtering: From Kalman filters to particle filters,and beyond. Technical Report, Adaptive Syst. Lab, McMaster University,Hamilton, ON, Canada..

[11] N.Chopin and E.Varini (2007). Particle filtering for continous-time hiddenMarkov models. ESAIM: Proceedings, Vol.19.

[12] W.T.Coffey, Yu.P.Kalmykov and J.T.Waldron (2004). The Langevin Equa-tion: With Applications to Stochastic Problems in Physics, Chemistry andElectrical Engineering (World Scientific Series in Contemporary ChemicalPhysics Vol. 14), World Scientific, Singapore.

75

[13] A.P.Dempster, N.M.Laird and D.B.Rubin (1977). Maximum Likelihoodfrom Incomplete Data via the EM Algorithm. Journal of the Royal Sta-tistical Society, Series B, 39 (1), 1–38.

[14] M.A.Dewar, V.Kadirkamanathan, M.Opper and G.Sanguinetti (2010) Pa-rameter estimation and inference for stochastic reaction-diffusion systems:application to morphogenesis in D. melanogaster. BMC Systems Biology ,4. Art no.21.

[15] A.Doucet, N.de Freitas and N.J.Gordon (eds.) (2001). Sequential MonteCarlo Methods in Practice., Springer-Verlag, New York.

[16] A.Doucet, C.Andrieu, S.S.Singh and V.B.Tadic (2004). Particle Methodsfor Change Detection, System Identification, and Control. Proceeding ofthe IEEE, 423-438.

[17] G.Poyadjis, A.Doucet and S.S.Singh (2005). Maximum likelihood parame-ter estimation using particle methods. Proceedings of the Joint StatisticalMeeting.

[18] A.Doucet and A.M.Johansen (2009). Particle filtering and smoothing: Fif-teen years later. Handbook of Nonlinear Filtering (eds. D.Crisan andB.Rozovsky), Cambridge University Press, Cambridge.

[19] A.Doucet, S.J.Godsill and C.Andrieu (2000). On sequential Monte Carlosampling methods for Bayesian filtering. Statistical Computing, 10:197-208.

[20] R.Douc, O.Cappe and E.Moulines (2005). Comparison of resamplingschemes for particle filtering. 4th International Symposium on Image andSignal Processing and Analysis (ISPA).

[21] G.B.Durham and R.A.Gallant (2002). Numerical techniques for maximumlikelihood estimation of continuous-time diffusion processes. Journal ofBusiness and Economic Statistics 20, 279–316.

[22] O.Elerian, S.Chib and N.Shephard (2001). Likelihood inference for dis-cretely observed nonlinear diffusions. Econometrica, 69(4), 959–993.

[23] P.Fearnhead (2008). Computational Methods for Complex Stochastic Sys-tems: A Review of Some Alternatives to MCMC. Statistics and Computing,18 (2), 151-171.

[24] W.R.Gilks and C.Berzuini (2001). Following a moving target—Monte Carloinference for dynamic Bayesian models. Journal Royal Statistical Society,Series B, 63, 127–146.

[25] D.T.Gillespie (1977). Exact stochastic simulation of coupled chemical re-actions. Journal of Physical Chemistry, 81, 2340–2361.

[26] D.T.Gillespie (1992). A rigorous derivation of the chemical master equation,Physica A 188, 404–425.

76

[27] J.K.Ghosh, M.Delampady and T.Samanta (2009). An Introduction toBayesian Analysis: Theory and Methods. Springer-Verlag, New York.

[28] A.Golightly and D.T.Wilkinson (2005). Bayesian inference for stochastickinetic models using a diffusion approximation. Biometrics 61(3), 781–788.

[29] A.Golightly and D.T.Wilkinson (2010). Markov chain Monte Carlo algo-rithms for SDE parameter estimation. Learning and Inference for Compu-tational Systems Biology (eds. N.D.Lawrence et al), MIT Press, Boston.

[30] A.Golightly (2006). Bayesian Inference for Nonlinear Multivariate DiffusionProcesses. PhD-Thesis, Newcastle University.

[31] R.Holenstein (2009). Particle Markov chain Monte Carlo, PhD-Thesis, Uni-versity of British Columbia, Vancouver.

[32] N.Kantas, A.Doucet, S.S.Singh and J.M.Maciejowski (2009). An overviewof sequential Monte Carlo methods for parameter estimation in generalstate-space models. SYSID 2009, Proceedings IFAC Sytem Identification.

[33] P.E.Kloeden and E.Platen (1992). Numerical Solution of Stochastic Differ-ential Equations, Springer-Verlag, New York.

[34] A.Kong, J.S.Liu and W.H.Wong (1994). Sequential imputations andBayesian missing data problems. Journal of the Americal Statistical As-sociation, 89, 278-288.

[35] A.J.Lotka (1925), Elements of Physical Biology, Williams and Wilkens,Baltimore.

[36] H.H.McAdams and A.Arkin (1999). It’s a noisy business! Genetic regula-tion at the nanomolar scale. Trends in Genetics 15, 65–69.

[37] B.Øksendal (2003). Stochastic Differential Equations: An Introductionwith Applications, Springer-Verlag, New York.

[38] M.Opper and G.Sanguinetti (2007). Variational inference for stochasticreaction processes. Advances in Neural Information Processing Systems(NIPS).

[39] M.K.Pitt and N.Shephard (1999). Filtering via simulation: auxiliary par-ticle filter. Journal of the American Statistical Association, 94, 590-599.

[40] A.Ruttor, G.Sanguinetti and M.Opper (2010). Approximate inference forstochastic reaction systems. Learning and Inference for Computational Sys-tems Biology (eds. N.D.Lawrence et al), MIT Press, Boston.

[41] A.Ruttor and M.Opper (2010). Approximate parameter inference in astochastic reaction-diffusion model. JMLR Workshop and Conference Pro-ceedings Volume 9: AISTATS 2010, 9:669-676.

77

[42] N.Shephard and M.K.Pitt (1997). Likelihood analysis of non-Gaussian mea-surement time series. Biometrika, 84:653–667.

[43] H.Sørensen (2004). Parametric inference for diffusion processes observed atdiscrete points in time. International Statistical Review 72(3), 337–354.

[44] M.A.Tanner (1996). Tools for Statistical Inference: Methods for the Explo-ration of Posterior Distributions and Likelihood Functions. Springer-Verlag,New York.

[45] L.Tierney and J.B.Kadane (1986). Accurate Approximations for PosteriorMoments and Marginal Densities, Journal of the Americal Statistical As-sociation, 81, 82-86.

[46] N.G.van Kampen (1992). Stochastic Processes in Physics and Chemistry,North-Holland.

[47] V.Volterra (1926). Fluctuations in the abundance of a species consideredmathematically. Nature 118, 558–60.

[48] G.C.G.Wei and M.A.Tanner (1990). A Monte Carlo implememtation of theEM algorithm and the poor man’s data augmentation algorithms. Journalof the Americal Statistical Association, 85, 699-704.

[49] D.J.Wilkinson (2003). Discussion to ‘Non centred parameterisations forhierarchical models and data augmentation’ by Papaspiliopoulos, Robertsand Skold. Bayesian Statistics 7, Oxford Science Publications, 323–324.

[50] D.J.Wilkinson (2006). Stochastic modelling for systems biology. Chapmanand Hall/CRC Press, Boca Raton.

[51] Y.F.Wu, E.Myasnikova and J.Reinitz (2007). Master equation simulationanalysis of immunostained Bicoid morphogen gradient. BMC Systems Bi-ology, 1(52).

78

Parameter Estimation for Stochastic Reaction Processes ... · in this thesis, are so-called...

Documents

Transcript of Parameter Estimation for Stochastic Reaction Processes ... · in this thesis, are so-called...