Single Molecule Data Analysis: An Introduction › pdf › 1606.00403v1.pdf · Single Molecule Data...

Single Molecule Data Analysis: An Introduction

Meysam Tavakoli 1, J. Nicholas Taylor 2, Chun-Biu Li 2, Tamiki Komatsuzaki 2, Steve Pressé* 1,3,4

October 20, 2018

1 Physics Department, Indiana University-Purdue University Indianapolis, Indianapolis, IN, 46202, United Statesof America2 Research Institute for Electronic Science, Hokkaido University, Kita 20 Nishi 10, Kita-Ku, Sapporo 001-0020Japan3 Department of Chemistry and Chemical Biology, Indiana University-Purdue University Indianapolis, Indianapo-lis, IN, 46202, United States of America4 Department of Cell and Integrative Physiology, Indiana University School of Medicine, Indianapolis, IN 46202,United States of America

∗ Corresponding author

To appear in Advances of Chemical Physics

Contents

1 Brief Introduction to Data Analysis 3

2 Frequentist and Bayesian Parametric Approaches: A Brief Review 4

2.1 Frequentist inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Maximum likelihood estimation: Applications to hidden and aggregated Markov models . . 5

2.2 Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.2 Uninformative priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.3 Informative priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Information Theory as a Data Analysis Tool 12

3.1 Information theory: Introduction to key quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Information theory in model inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Maximum Entropy and Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4 Applications of MaxEnt: Deconvolution methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4.1 MaxEnt deconvolution: An application to FCS . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1

arX

iv:1

606.

0040

3v1

[ph

ysic

s.bi

o-ph

] 1

Jun

201

6

4 Model Selection 22

4.1 Brief overview of model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1.1 Information theoretic and Bayesian model selection . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Information theoretic model selection: The AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3 Bayesian model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3.1 Parameter marginalization: Illustration on outliers . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3.2 The BIC obtained as a sum over models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3.3 Illustration of the BIC: Change-point algorithms for a Gaussian process . . . . . . . . . . . 28

4.3.4 Shortcomings of the AIC and BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3.5 Note on finite size effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.3.6 Select AIC and BIC applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.3.7 Maximum Evidence: Illustration on HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 An Introduction to Bayesian Nonparametrics 33

5.1 The Dirichlet process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.2 The Dirichlet process mixture model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.3 Dirichlet processes: An application to infinite hidden Markov model . . . . . . . . . . . . . . . . . . 37

6 Information Theory: State Identification and Clustering 38

6.1 Rate-Distortion Theory: Brief outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.1.1 RDT and data clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.1.2 RDT clustering: Formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.1.3 RDT analysis on sample data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.1.4 Application of RDT to single molecule time-series . . . . . . . . . . . . . . . . . . . . . . . 43

6.2 Variations of RDT: Information-based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.3 Variations of RDT: The information bottleneck method . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.3.1 Information bottleneck method: Application to dynamical state-space networks . . . . . . . 46

7 Final Thoughts on Data Analysis and Information Theory 48

7.1 Information theory in experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

7.2 Predictive information and model complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

7.3 The Shore & Johnson axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7.4 Repercussions to rejecting MaxEnt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

8 Concluding Remarks and the Danger of Over-interpretation 55

9 Acknowledgements 55

2

1 Brief Introduction to Data Analysis

The traditional route to model-building in chemistry and physics – a strategy by which implausible hypothesesare eliminated to arrive at a quantitative framework – has been successfully applied to the realm of biologicalphysics [1]. For instance, polymer models have predicted how DNA’s extension depends on externally appliedforce [2] while thermodynamic models – with deep origin-of-life implications – explain how lipid vesicles trappinglong RNA molecules grow at the expense of neighboring vesicles trapping shorter RNA segments in buffer [3].

Another modeling route is the atomistic – molecular dynamics (MD) – approach in which one investigates complexsystems by monitoring the evolution of their many degrees of freedom. Novel algorithms along with machinearchitectures for high-speed MD simulations of biological macromolecules have now even allowed small proteinsto be folded into their native state [4].

Both routes have important pros but they also have cons. For instance, potentials in MD are constructed toreproduce behaviors in regimes for which they are parametrized and cannot rigorously treat chemical reactionsat the heart of biology.

Furthermore, it is not always clear how simple physics-based models – while intuitive – should be adapted totreat complex biological data [5–7]. For example, diffusion in complex environments – such as telomeres insidemammalian cell nuclei [8]; bacterial chromosomal loci [9] and mRNA inside the bacterial cytoplasm [10]; andviruses inside infected cells [11] – has often been termed “anomalous". This is because “normal" diffusion modelsoften fail in living systems where there is molecular crowding [12–16], biomolecular interactions and binding[5,17–21] and active transport [22–26].

Drastic revolutions in the natural sciences have often been triggered by new observations and new observations,in turn, have presented important modeling challenges. These challenges now include the heterogeneity of datacollected at room temperature or in living systems [5,27,28] and the noise that rattles nanoscale systems [29–32].

Necessity is the mother of invention and these new challenges have motivated statistical, data-driven, approachesto model-building in biophysics that explicitly deal with various sources of uncertainty [33–41].

Statistical data-driven analysis methods are the focus of this review. Statistical approaches have been invaluablein generating detailed mechanistic insight into realms previously inaccessible at every step along biology’s centraldogma [42,43]. They have also unveiled basic molecular mechanisms from noisy single molecule data that havegiven rise to detailed energy landscape [44–50] and kinetic scheme [31, 51–55] models. Furthermore, one’schoice of analysis methods can deeply alter the interpretation of experiment [56,57].

We focus this review on parametric as well as more recent information theoretic and non-parametric statisticalapproaches to biophysical data analysis with an emphasis on single molecule applications. We review simplerparametric approaches starting from an assumed model with unknown parameters. We later expand our dis-cussion to include information theoretic and non-parametric approaches that have broadened our perspectivebeyond a strict “parametric" requirement that the model be fully specified from the onset [5,6,36,58,59].

These more general methods have, under some assumptions, relaxed important requirements to: know thefluorophore photophysics a priori to count single molecules from superresolution imaging [34]; prespecify thenumber of states in the analysis of single molecule fluorescence resonance energy transfer (smFRET) timetraces [36]; or the number of diffusion components contributing to fluorescence correlation spectroscopy curves[5, 60, 61]; or the number of diffusion coefficients sampled by cytoplasmic proteins from single protein trackingtrajectories [35]. In fact, these efforts bring us closer towards a non-parametric treatment of the data.

As so many model selection [62, 63] and statistical methods developed to tackle problems in physics, and laterbiophysics, have been motivated by Shannon’s information [64, 65], we take an information theoretic approachwhenever helpful. Beyond model selection, topics we will discuss also include parameter estimation, imagedeconvolution, outliers, change-point detection, clustering and state identification.

In this review, we do not focus on how methods are implemented algorithmically. Rather, we cite the appropriateliterature as needed. Neither do we discuss the experimental methods from which the data are drawn. Since our

3

focus is on an introduction to data analysis for single molecule, there are also many topics in data analysis thatwe do not discuss at all or in much detail (p-values, type I and II errors, point estimates, hypothesis testing, like-lihood ratio tests, credible intervals, bootstrapping, Kalman filtering, single particle tracking, localization, featuremodeling, aspects of density estimation, etc...).

Our review is intended for anyone, from student to established researcher, who wants to understand what canbe accomplished with statistical approaches to modeling and where the field of data analysis in biophysics isheaded. For someone just getting started, we place a special emphasis on the logic, strengths and shortcomingsof different data analysis frameworks with a focus on very recent approaches.

2 Frequentist and Bayesian Parametric Approaches: A Brief Review

2.1 Frequentist inference

Conceptually, the simplest data-driven approach is parametric and frequentist. By “parametric", we mean a modelM is pre-specified and its parameters, θ = {θ1, θ2, · · · , θK}, are unknown and to be determined from the data,D = {D1, D2, · · · , DN}.By “frequentist", we mean that model parameters are determined exclusively from frequencies of repeated ex-periments by contrast to being informed by prior information, which we will turn to shortly in our discussion ofBayesian methods.

Model parameters can, in principle, be determined by binning and subsequently fitting histograms, such as his-tograms of photon arrivals. Fitting histograms is avoided in data analysis in both frequentist and Bayesian meth-ods.

In particular, to avoid selecting an arbitrary histogram bin size and having to collect enough data to build a reliablehistogram, model parameters are determined by maximizing the probability, p(D|M), of observing the sequenceof outcomes under the assumptions of a model whose parameter values, θ, have yet to be determined. Thisprobability, p(D|M) = p(D|θ), is termed the likelihood which is a central object in frequentist inference.

For multiple independent data points, the likelihood is the product over each independent observation

p(D|θ) =∏i

p(Di|θ). (1)

As an example of maximum likelihood estimation, suppose our goal is to estimate a molecular motor’s turnoverrate r from a single measurement of the number of stepping events, n, in some time interval ∆T . We begin bypre-specifying a model: the probability of observing n events is Poisson distributed. Under these assumptions,our likelihood is

p(D = n|θ = r) =(r∆T )n

n!e−r∆T . (2)

Maximizing this likelihood with respect to r yields the estimator r = n/∆T . That is, it returns the most likelyturnover rate under the assumptions of the model.

We can also write likelihoods to explicitly account for correlations in time in a time series (a sequence of datapoints ordered in time) even for continuous time. For instance, the likelihood – for a series of events occurring attimes D = t = {t1, t2, · · · , tN} in continuous time with possible time correlations – is

p(D = t|θ) = p(tN |tN−1, · · · , t1,θ) · · · p(t2|t1,θ)p(t1|θ) =N∏i=2

[p(ti|{tj}j<i,θ)]p(t1|θ). (3)

4

Returning to our molecular motor example, we can also investigate how sharply peaked our likelihood is aroundr = r to give us an estimate for the variability around r. A lower bound on the variance – with var(r) ≡ E(r2) −(E(r))2 where “≡" denotes a definition, not an equality, and E denotes an expectation – evaluated at r is givenby the inverse of the expectation of the likelihood’s curvature

var(r) ≥ 1

−E(∂2r log p(n|r)) . (4)

Intuitively, this shows that the variance is inversely proportional to the likelihood’s sharpness at its maximum. Thisinequality is called the Cramér-Rao bound and the denominator of the right hand side is called the Fisher informa-tion [66]. A formal proof of this bound follows from the Cauchy-Schwarz inequality [66]. However, informally, theequality of the above bound can be understood as follows. Consider our likelihood as a product of independentobservations

p(D|r) =∏i

p(Di|r) ≡ eN log f(D|r) (5)

where we have defined f(D|r) through the expression above and f(D|r) is a function that scales like N0. Wethen expand the likelihood around its maximum r = r∗

p(D|r) = eN log f(D|r) = eN log f(D|r∗)+N (r−r∗)2!

2∂2r log f(D|r∗)+R (6)

where R is the remainder. For large enough N , a quadratic expansion of the likelihood, Eq. (6), is a sufficientlygood approximation to the exact p(D|r). By this same reasoning, the var(r) that we compute using the approxi-mate p(D|r) is a good approximation to the exact var(r). Only when the quadratic expansion is exact – and R iszero – do we recover the lower Cramér-Rao bound upon computing the variance. In fact, only in this limit do r∗

and r coincide.

2.1.1 Maximum likelihood estimation: Applications to hidden and aggregated Markov models

While our molecular motor example is simple and conceptual, this frequentist approach, that we now discuss inmore detail, has been used to learn kinetic rates of protein folding [67] or protein conformational changes [68].

As a more realistic example, we imagine a noisy two state trajectory with high and low signal. Fig. (1a) is acartoon of an experimental single molecule force spectroscopy setup that generates the types of time traces –transitions of an RNA hairpin between zipped, unzipped and an intermediate state – shown in Fig. (1b) [57]. Thenoise level around the signal is high enough that the peaks of the intensity histograms (shown in grey at theextreme right of Fig. (1b)) overlap. Thus, even in the idealized case where there is no drift in the time trace overtime, it is not possible to draw straight lines through the time trace of Fig. (1b) in order to establish in what statethe system finds itself in at any point in time. In fact, looking for crossings of horizontal lines as an indication of astate change would grossly overcount the number of transitions between states.

Markov models:Time series analysis is an important topic in single molecule inference as is treating noise explicitly on a pathwisebasis in single particle tracking data [69, 70] or in two-level time traces [31]. Here we will not deal directly withsingle particle tracking problems [71] nor the extensive literature on Kalman filtering in time series analysis [72].

Rather, we immediately focus on Hidden Markov models, HMMs, commonly used in time series analysis that dealdirectly with the types of time traces shown in Fig. (1b). Before we tackle noisy time traces, we discuss idealizedtraces with no noise (such as thermal noise that rattles molecules, effective noise from unresolved motion on fasttimescales as well as measurement noise). Our goal here is to extract transition rates between states observablein our noiseless time trace without histogramming data.

Markov models start by assuming that the system can occupy a total of K states and that transitions betweenthese states are fully described by a transition matrix, A, whose matrix elements, aij , coincide with the transition

5

microscopic states are Markov. There are unintendedconsequences to these assumptions: since there are fewerobservables than there are microscopic states in AM models,such models are under-determined. Even for very simpleproblems, an infinite number of AM models can be consistentwith the data.20 Thus, relationships between some rates can bespecified (or rates assumed identical) to resolve thisindeterminacy.14

Furthermore, SM data is also noisy. For instance, in SM forcespectroscopy, it may not always be clear whether an apparentexcursion from the high force state to the low force state andthen back to the high force state is due to noise or due to anactual conformational change in the single molecule. HiddenMarkov (HM) models have traditionally been used in SMexperiments to tackle this challenge.10 HM models start byassuming (i) a model, for example, an AM model with all of itsbuilt-in assumptions and (ii) statistics of the noise. Then, given(i) and (ii), the HM model picks transitions between statesfrom the noisy time trace while simultaneously determining themodel parameters (for the case of AM models, the parameterswould be the rates of transition between states). To be clear,HM and AM models are not mutually exclusive.19 Rather, wecan think of the hidden AM model as being a furthergeneralization of an AM model. Given multiple reasonablemodels, Bayesian approaches have been developed todiscriminate between models.23

Our goal here is to build on this body of work and lift someof its most stringent assumptions. In previous work, wepresented a method for tackling noisy SM data starting from avery general non-Markov model class.35 The mathematicswhich are relevant to this work are presented in a self-containedway in the Appendix. Here our main focus is to apply ourmethod to single molecule force spectroscopy data andinterpret the results we obtain from our analysis.Our method is called the non-Markov memory kernel

(NMMK) method because, as we will discuss, our dynamics are

governed by memory kernels which are not a priori assumed tosatisfy the Markov property. Our goal is to extract the memorykernel from the data in a principled fashion from the forcespectroscopy data. We will do so in a two-step process. First, wepick out transitions from the data in an objective, model-independent way24,25 and obtain noisy dwell time histograms invarious states. Next, we extract from each histogram a memorykernel. The memory kernels will turn out to be our model.They contain a full description of the system dynamicsjustlike the topology and rates contain a full description of thedynamics for AM models. To extract the memory kernel fromnoisy histograms, we will adapt the method of imagereconstruction.26−29 In this way, we will show how we can letthe entire SM data set “speak for itself” by allowing it to selectfor the best model. By not assuming a predetermined model,we neither waste data nor bias our interpretation of thetransitions in the raw data. A Markov model will only emergefrom this analysis if it is warranted by the data; it is not assumeda priori. Furthermore, the NMMK method provides a modelwhich is unique given the data, unlike AM models.We will apply our method to SM force time traces obtained

from P5ab, a 22 base pair RNA hairpin taken from aTetrahymena thermophila ribozyme.33 From our analysisemerges a more textured, complex dynamics than couldotherwise be obtained by forcing the data onto a simpleprespecified model. In particular, the analysis suggests that notall transitions are Markov, implying the existence of anintermediate state of the RNA hairpin. The NMMK methodpresented here is general and could in principle be applied todata originating from a wide variety of SM methods, as well asbulk data. We will also discuss some improvements to themethod we suggest and some of its limitations and compare ourmethod to other approaches.

2. THEORETICAL METHODS2.1. The Generalized Master Equation. In the NMMK

model, the dynamicsdescribed by a generalized masterequationare governed by a memory kernel κ(t)

∫ κ κ = − − ≡ − *f t T T f t T t f t( ) d ( ) ( ) ( ) ( )t

0 (1)

where f(t) denotes a dwell time distribution. This dwell timedistribution can be a marginal distribution in a particular state,say A, or a conditional distribution for being in A for time t,given that the system was previously in B for some time t′.There is a memory kernel for each type of dwell timedistribution. AM models are a special type of model where thekinetics are fully characterized by the set of marginal dwell timedistributions for each state and the set of conditional dwell timedistributions between all pairs of states.18 A renewal process isfully described by its marginal dwell time distributions.For simplicity, we will only consider stationary processes and

focus our attention on marginal dwell distributions. Themathematics of the memory kernel formulation are developedin some generality elsewhere.35 Here, we summarize importanthighlights relevant to the RNA hairpin.If f(t) is a single exponential, then the memory kernel, κ(t), is

a δ-function. That is, in such a state, there is no memory. This isthe signature of a Markov process. However, suppose f(t) is adouble exponential

= − + −f t a k t a k t( ) exp( ) exp( )1 1 2 2 (2)

Figure 1. Single molecule force spectroscopy is used to monitor RNAhairpin zipping−unzipping transitions. This figure (adapted from ref34) shows a SM force spectroscopy setup with a single P5ab RNAhairpin33 as it undergoes transitions between a zipped and unzippedstate. The bottom bead in the diagram is held fixed by a micropipet.The upper bead is held in an optical trap. In “passive modeexperiments”, the optical trap is held fixed. As the hairpin transitionsfrom the unzipped to the zipped state, it exerts force on the beadwhich is converted into units of piconewtons (pN) using a worm-like-chain model. See ref 34 for details.

The Journal of Physical Chemistry B Article

dx.doi.org/10.1021/jp500611f | J. Phys. Chem. B 2014, 118, 6597−66036598

the molecule spends most of its time at very well separateddiscrete force levels. Usually these are a low force and a highforce level. One assumption made in invoking the SIC is thatthe noise is uncorrelated in time. This assumption is notrealized in our data; the bead corner frequency, for instance, isabout 2 kHz (while the data collection frequency was 50 kHz).However, there was little difference in the change points wefound by applying the SIC to the raw data versus the datawhere the time trace was averaged down to remove thiscorrelation. This result was not surprising because the SIC isknown to underfit data (i.e., find fewer change points than areactually present); see ref 24 for details.As a check, we verified that all change points detected by

PELT coincide with change points detected using anothermethod, namely, an algorithm due to Kalafut.24

Next, we regroup the different force levels using a clusteringalgorithm (k-means++) to automate the task of identifyingdwells as high or low force. The input to k-means++ is thenumber of clusters desired. K-means++ relies on theassumption that the traces do not substantially drift in time(i.e., that the high force state, say, remains at approximately thesame value from the beginning to the end of the trace). Thisassumption was reasonable for our time traces. Nonetheless wealso detrended our time traces as follows: (i) took the first 150000 steps, (ii) took the lowest 10% of those values and foundtheir median, (iii) repeated the procedure on the last 150 000steps and took the difference between those two numbers as anestimate for the total drift, (iv) subtracted the linear drift, and(v) subsequently removed any overall offset, so that the averagesignal is zero.We call the time traces to which we have applied k-means++

our “quantized time traces”. See Figure 4 for an example. Wecan also use k-means++ to merge well separated states (such asmerge two distinct states into one) in order to verify whetherthe aggregated state now exhibits conformational memory. Weadd that both k-means++ and change-point algorithms dependon the assumptions of noise statistics, though they are free fromthe Markov assumption.

3. DISCUSSION: THE UNFOLDED (LOW FORCE) STATEOF RNA SHOWS CONFORMATIONAL MEMORY

The SM force spectroscopy data we present was collected inthe passive mode, meaning the trap position is held fixed as theP5ab RNA hairpin33 transitions between zipped and unzippedstates. See ref 34 for details. Multiple runs, all collected at 50kHz over a period of 1 min, were carried out on differentphysical RNA fibers and at different trap positions for eachfiber. Our focus here is on the majority of time traces collectedwhere the SM is populating both high force and low force statesfor about equal time. Figures 5 and 6 are examples of such

traces which also illustrate just how noisy data can be.Furthermore, some traces have larger noise amplitude thanothers and some traces show excursions to an intermediatestate.In addition, for simplicity, we only computed the memory

kernel for the marginal dwell time distributions in the low andhigh force state. For a renewal process, these memory kernelswould be a complete description of the kinetics. While wecould, in principle, also compute the memory kernel forconditional dwell distributions, we have not done so here whereour focus is, instead, on marginal distributions.

Figure 4. We use a change-point algorithm to find transitions in theraw SM force spectroscopy data. Left: Typical time trace obtained bySM force spectroscopy in the passive mode34 showing the transitionsbetween a zipped and unzipped state of an RNA hairpin. The highforce (i.e., high signal) state coincides with the zipped state of thehairpin. The raw data are gray, and their associated histogrammedsignal intensity, also gray, shows substantial overlap between low andhigh force states. We apply PELT,25 a change-point algorithm, anddetect the steps in the data shown in red. The histogrammed signalintensity of this de-noised time trace still shows finite breadth. K-means++ is used to cluster each dwell to its closest cluster. Here wespecified three clusters, since the red histogram has three wellseparated peaks. The resulting quantized steps are shown in blue, andthe resulting signal histogram is, by construction, an infinitely sharppeak.

Figure 5. Some RNA time traces show apparent excursions to anintermediate force state. Data is shown in red. The data shown areonly a fraction of the full trace which is collected over a period of 1min. The offset green curve is the quantized time trace where threeclusters were specified.

Figure 6. Some RNA time traces show no excursions to anintermediate force state. As with Figure 5, red is the data and greenis the offset quantized time trace. This trace shows no obviousexcursions to an intermediate state. These excursions could beobscured by the noise. When we cluster the force levels of the timetrace into three states, we recover a state very similar in magnitude tothe high force state. This implies that k-means++ is having difficultyfinding the low intermediate force state it had recovered in Figure 5.

The Journal of Physical Chemistry B Article

dx.doi.org/10.1021/jp500611f | J. Phys. Chem. B 2014, 118, 6597−66036600

a b

Figure 1: Single molecule experiments often generate time traces. The goal is to infer models of single moleculebehavior from these time traces. a) A cartoon of a single molecule force spectroscopy setup probing transitions betweenzipped and unzipped states of an RNA hairpin [57]. Change-point algorithms, that we later discuss, were used in b) todetermine when the signal suddenly changes (red line). The signal indicates the changes in the conformation of the RNAhairpin obscured by noise. Clustering algorithms, also discussed later, were then used to regroup the “denoised" intensitylevels (red line) into distinct states (blue line).

probability of state si to state sj , aij = p(sj |si). Given idealized (noiseless) time traces for now, our goal is to findthe model parameters, θ, which consist of all transition matrix elements as well as all initial state probabilities.That is, p(sj |si) for all pairs of states i and j and p(si) for all i.

To obtain these parameters, we define a likelihood, for each trajectory, of having observed a definite state se-quence D = {s1, s2, ..., sN} – where {s1, s2, ..., sN} here are numbers that serve as labels for states – in discretetime

L(θ|D) ≡ p(s1, s2, ..., sN |θ) =N∏i=2

[p(si|si−1)] p(s1). (7)

We have written the transition probability from time t to t+ δt as p(st+δt|st) where st is the state the system findsitself in at time t. We subsequently maximize these likelihoods with respect to all unknown parameters, θ. Weadd that we need more than one trajectory to estimate the initial state probabilities.

Finally, while the likelihood is the probability of the data given the model, the likelihood is typically maximized withrespect to its parameters and its parameters are treated as its variables. For this reason we write L(θ|D) notL(D|θ).

Hidden Markov models:While Eq. (7) is used for theoretical illustrations [38, 73, 74], it must be augmented to treat noise for real dataanalysis applications. These resulting models, HMMs [31, 75, 76], have been broadly used in single moleculeanalysis including smFRET studies [47,77–81] and force spectroscopy [68].

In HMMs, the state of the signal in time, termed the state of the “latent" or hidden variable, is provided indirectlythrough a sequence of observations D = y = {y1, y2, ..., yN}. Often this relation is captured by the probability ofmaking the observation yi given that the system is in state si, p(yi|si).We can use a distribution over observations of the form p(yi|si) under the assumption that the noise is uncorre-lated in time and that the observable only depends on the state of the underlying system at that time. A Gaussianform for p(yi|si) – following from the central limit theorem or as an approximation to the Poisson distribution – iscommon [31,77].

The HMM model parameters, θ, now include, as before, all transition matrix elements as well as all initial stateprobabilities. In addition, θ also includes the parameters used to describe p(yi|si). For example, for a Gaussiandistribution over observations, where µk and σk designate the mean and variance of the signal for the system in

6

state k at time point i, we have

p(yi|si = k) ∝ e− (yi−µk)

2

2σ2k (8)

where the additional parameters include the means and variances for each state. For this reason, to be clear, wemay have chosen to make our probability over observations depend explicitly on θ, p(yi|si,θ).

In discrete time, where i denotes the time index, the likelihood used for HMMs is

L(θ|D) = p(y|θ) =∑s

p(y, s|θ) =∑s

N∏i=2

[p(yi|si)p(si|si−1)] p(y1|s1)p(s1) (9)

where s = {s1, · · · , sN}. Unlike in Eq. (7), in Eq. (9), the si are bolded as they are variables not numbers. We sumover these state variables since we are not, in general, interested in knowing the full distribution p(y, s|θ). Ratherwe are only interested in obtaining the marginal likelihood p(y|θ) describing the probability of the observationgiven the model irrespective of what state the system occupied at each point in time.

Put differently, while the system only occupies a definite state at any given time point, we are unaware of whatstate the system is in. And, for this reason, we must sum over all possibilities.

An alternative way to represent the HMM is to say

s1 ∼ p(s1)

si|si−1 ∼ p(si|si−1)

yi|si, θ ∼ p(yi|si,θ). (10)

That is s1 – a number, i.e. a realization of s1 – is sampled from p(s1). Then for any i > 1, si|si−1 – a realizationof si conditioned on si−1 – is sampled from the conditional p(si|si−1) while its observation yi|si, θ is sampled fromp(yi|si,θ).

The goal is now to maximize the likelihood, Eq. (9), over each parameter θ. There is a broad literature describingmultiple strategies available to numerically evaluate and maximize the likelihood functions generated from HMMs(as well as AMMs described in the next section) [82] including the Viterbi algorithm [77,83], and, most often used,forward-backward algorithms and expectation maximization [75,84].

Aggregated Markov models:Aggregated Markov models (AMMs) [85] can be thought of as a special case of HMMs in which many states ofthe latent variable have identical output.

AMMs were popularized in biophysics in the analysis of single ion-channel patch clamp experiments [85–88]since, often, two or more distinct inter-converting molecular states of an ion channel may not be experimentallydistinguishable. For example, both states may carry current.

Microscopic states that cannot be distinguished experimentally form an “aggregate of states". In its simplestformulation, AMMs describe transitions between two aggregates (such as the open and closed aggregates ofstates). Each aggregate is composed of multiple, possible interconverting, microscopic states that cannot bedirectly observed. Instead, each aggregate of states belongs to an “observability class". For instance, one cansay that a particular microscopic state belongs to the “open observability class" for an ion channel or the “darkobservability class" for a fluorophore.

AMMs are relevant beyond ion channels. In smFRET, a low FRET state (the “low fluorescence observabilityclass") could arise from photophysical properties of the fluorophores or an internal state of the labeled protein[78]. In fact, most recently, AMMs have been used to address the single molecule counting problem usingsuperresolution imaging data [34].

7

For simplicity, consider a rate matrix, Q, containing only two observability classes, 1 and 2,

Q =

[Q11 Q12

Q21 Q22

]. (11)

The submatrices Qij are populated by matrix elements, indexed k` say, describing the transition rates from statek in observability class i to state ` in observability class j.

The logic from this point forward is identical to the logic of the previous section on HMMs. We must write downa likelihood and subsequently maximize this likelihood with respect to the model parameters. Ignoring noise, thelikelihood of observing the sequence of observability classes D = {a1, a2, ..., aN} in continuous time is [89]

L(θ|D) = 1T ·N−1∏j=1

Gajaj+1(tj) · πa1 (12)

where the ith element of the column vector, πa1 , denotes the initial probability of being in state i from the a1

observability class and where

Gab(tj) = QabeQaatj . (13)

In other words, k`th element of Gab(tj) is the probability that you enter from the kth state of observability class a,dwell there for time tj and subsequently transition to the `th state of observability class b. The row vector, 1T , inEq. (12) is used as a mathematical device to sum over all final microscopic states of the observability class, aN ,observed at the last time point. We do so because we only know in which final observability class we are at theN th measurement, not which microscopic state of the system we are in.

The parameters, θ, here include transitions between all microscopic states across all observability classes as wellas initial probabilities for each state within each observability class. Since the number of parameters exceeds thenumber of observability classes in AMMs, AMMs often yield underdetermined problems [90].

The AMM treatment above can be generalized to include noise or treated in discrete time [91,92]. Both AMMs andHMMs can also be generalized to include the possibility of missed transitions [34, 93]. Missed transitions arisein real applications when a system in some state (in HMMs) or observability class (in AMMs), say k, undergoesrapid transitions – for example rapid as compared to td, the camera’s data acquisition time – to another state,say `. Then the real transition probability in state k must account for all possible missed transitions to ` andrecoveries back to k that could have occurred within td. We account for these missed transitions be resumingover all possible events that could have occurred within the interval td. The technical details are described inRefs. [34,93].

2.2 Bayesian inference

Frequentist inference yields model parameter estimates – like r we saw earlier which are called “point estimates" –and error bounds. Just as we’ve treated the data in the previous section as random variables – that is, realizationsof an experiment – and model parameters as fixed quantities to be determined, Bayesian analysis treats boththe data as well as model parameters as random variables [94]. For the same amount of data that is used infrequentist inference, Bayesian methods return parameter distributions whose usefulness is contingent on thechoice of likelihood and prior, which we describe shortly.

Bayesian methods are now widely used across biophysical data analysis [35, 36, 95–99]. For instance, theyhave been used to infer models describing how mRNA-protein complexes transition between active transport andBrownian motion [96].

Of central importance in Bayesian analysis is the posterior, p(M|D): the conditional probability over models Mgiven observations, D, i.e. the probability of the model after observations have been made. Since there may bemany (choices of) models, we have bolded the model variable, M. By contrast, the probability over M before

8

observations are made, p(M), is called a prior.

We construct the posterior from Bayes’ rule (or theorem) using the likelihood and the prior as inputs. That is, weset

p(D,M) = p(M,D) (14)p(M|D)p(D) = p(D|M)p(M)

p(M|D) =p(D|M)p(M)

p(D)

where p(D) is obtained by normalization from

p(D) =

∫dMp(D,M) =

∫dMp(D|M)p(M) (15)

and p(D,M) is called the joint probability of the model and the data. Typically, in parametric Bayesian inference,when there is a single model, the integration in Eq. (15) is meant as an integration over the model’s parameters.However there are cases where many parametric models M are considered and the integration is interpreted asa sum over models (if the models are discrete) and a subsequent integration over their associated parameters (ifthe parameters are continuous).

Furthermore, we can marginalize (integrate over) posteriors to describe the posterior probability of a particularmodel, say M`, from the broader set of models M irrespective of its associated parameter values (θ`)

p(M`|D) ∝∫dθ`p(D|M`,θ`)p(θ`|M`)p(M`). (16)

Here we make it a point to distinguish a model from its parameters, while earlier M re-grouped both models andtheir parameters. Furthermore, to be clear, we note that the following notations are equivalent∫

dθ` ↔∫dKθ ↔

∫ K∏k=1

dkθ (17)

where K designates the total number of parameters, θ. The quantity p(M`|D) can then be used to comparedifferent models head-to-head. For instance, in single particle tracking, we may be interested in computing theposterior probability that a particle’s mean square displacement arises from one of many models of transport(Brownian motion versus directed motion) irrespective of any value assigned to parameters such as the diffusioncoefficient [97].

2.2.1 Priors

As the number of observations, N , grows, the likelihood determines the shape of the posterior and the choice oflikelihood becomes critical as we will illustrate shortly in Fig. (2). By the central limit theorem, for sufficiently inde-pendent observations, the likelihood function’s breadth will narrow with respect to its mean as N−1/2. Providedabundant data, more attention should be focused on selecting an appropriate likelihood function than selecting aprior.

However, if provided with insufficient data, our choice of prior may deeply influence the posterior. This is perhapsbest illustrated with the extreme example of the canonical distribution in classical statistical physics where pos-terior distributions – over Avogadro’s number of particle positions and velocities – are constructed from just onedata point (total average energy with vanishingly small error) [38, 100, 101]. That is, the error bar is below theresolution limit of the experiment on a macroscopic system.

While the situation is not quite as extreme in biophysics, data may still be quite limited. For instance, single particle(protein) tracks may be short because protein labels photobleach or particles move in and out of focus [35] or the

9

kinetics into and out of intermediate states may be difficult to quantify in single molecule force spectroscopy forrarely visited conformational states [57].

A good choice of prior is therefore also important. There are two types of priors: informative and uninformative[94,102].

2.2.2 Uninformative priors

The simplest uninformative prior – inspired from Laplace’s principle of insufficient reason when the set of hy-potheses are complete and mutually exclusive, such as with dice rolls – is the flat, uniform, distribution. Underthe assumption that p(M) is constant, or flat, over some range, the posterior and likelihood are directly related

p(M|D) ∝ p(D|M)p(M) ∝ p(D|M). (18)

That is, their dependence on M is identical. However, a flat prior over a model parameter, say θ, is not quite asuninformative as it may appear [94] as, a coordinate transformation to the alternate variable, say eθ, reveals thatwe suddenly know more about the random variable eθ than we did about θ since its distribution is no longer flat.Conversely, if eθ is uniform on the interval [0, 1], then θ becomes more concentrated at the upper boundary, 1.

The problem stems from the fact that, under coordinate transformation, if the variable θ’s range is from [0, 1], theneθ ’s range is from [1, e]. To resolve this problem, we can use the Jeffreys prior [103–105] which is invariant underreparametrization of a continuous variable.

The Jeffreys prior, as well as other uninformative priors, are widely used tools across the biophysical literature[106–109]. As we will discuss shortly – as well as in detail in the last section – the Shannon entropy itself can bethought of as an uninformative prior (technically the logarithm of a prior) over probability distributions [38,110,111].This prior is used in the analysis of data originating from a number of techniques including fluorescence correlationspectroscopy (FCS) [5, 60, 61], Electron spin resonance (ESR) [112], fluorescence resonance energy transfer(FRET) and bulk fluorescence [113–115].

2.2.3 Informative priors

One choice of informative prior is suggested by Bayes’ theorem that is used to update priors to posteriors. Briefly,we see that when additional independent data are incorporated into a posterior, the new posterior p(M|D2, D1)is obtained from the old posterior, p(M|D1), and the likelihood as follows

p(M|D2, D1) ∝ p(D2|M)p(M|D1). (19)

In this way, the old posterior, p(M|D1), plays the role of the prior for the new posterior, p(M|D2, D1). If we ask– on the basis of mathematical simplicity – that all future posteriors adopt the same mathematical form, then ourchoice of prior is settled: this prior – called a conjugate prior – must yield a posterior of the same mathematicalform as the prior when multiplied by its corresponding likelihood. The likelihood, in turn, is dictated by the choiceof experiment.

Priors, say p(M|γ), may depend on additional parameters, γ, called hyperparameters distinct from the modelparameters θ. These hyperparameters, in turn, can also be distributed, p(γ|η), thereby establishing a parame-ter hierarchy. For instance, an observable (say the FRET intensity) can depend on the state of a protein whichdepends on transition rates to that state (a model parameter) which, in turn, depends on prior parameters de-termining how transition rates are assumed to be a priori distributed (hyperparameter). We will see examples ofsuch hierarchies in the context of later discussions on infinite Hidden Markov Models.

As a final note, before turning to an example of conjugacy, to avoid committing to specific arbitrary values forhyperparameters we may assume they are distributed according to a (hyper)prior and integrate over the hyper-parameters in order to obtain p(M|D) from p(M|D, γ, η, ...).

10

5 10 15 20

0.1

0.2

0.3

0.4

0.5

�

Pro

bab

ility

Out[14]=

5 10 15 20

0.1

0.2

0.3

0.4

0.5

Prior

N=1N=5N=10

14 test_calculations2.nb

Figure 2: The posterior probability sharpens as more data are accumulated. Here we sampled data according to aPoisson distribution with λ = 5 (designated by the dotted line). Our samples were D = {2, 8, 5, 3, 5, 2, 5, 10, 6, 4}. We plottedthe prior (Eq. (20) with α = 2, β = 1/7) and the resulting posterior after collecting N = 1, then N = 5 and N = 10 points.

Now, we illustrate the concept of conjugacy by returning to our earlier molecular motor example. The priorconjugate to the Poisson distribution with parameter λ – Eq. (2) where λ is r∆T – is the Gamma distribution

Gamma(α, β) = p(λ = r∆T |α, β) =βα

Γ(α)λα−1e−βλ (20)

which contains two hyperparameters, α and β. After a single observation – of n1 events in time ∆T – the posterioris

p(λ|N,α, β) = Gamma(n1 + α, 1 + β) (21)

while, after N independent measurements, with D = {n1, · · ·nN}, we have

p(λ|D, α, β) = Gamma

(N∑i=1

ni + α,N + β

). (22)

Fig. (2) illustrates how the posterior is dominated by the likelihood provided sufficient data and how an arbitrarychoice for the hyperparameters becomes less important for large enough N .

Single molecule photobleaching provides yet another illustrative example [116]. Here we consider the probabilitythat a molecule has an inactive fluorophore (one that never turns on) which, in itself, is a problem towardsachieving quantitative superresolution imaging [117]. We define θ as the probability that a fluorophore is active(and detected). We, correspondingly, let 1−θ be the probability that the fluorophore never turns on. The probabilitythat y of n total molecules in a complex turns on is then binomially distributed

p(y|θ) =n!

(n− y)!y!θy(1− θ)n−y. (23)

Over multiple measurements (multiple complexes each having n total molecules), y, we obtain the followinglikelihood

p(y|θ) ∝∏i

n!

(n− yi)!yi!θyi(1− θ)n−yi . (24)

One choice for p(θ) is the Beta distribution, a conjugate prior to the binomial,

p(θ) =(a+ b− 1)!

(a− 1)!(b− 1)!θa−1(1− θ)b−1. (25)

11

By construction (i.e. by conjugacy), our posterior now takes the form of the Beta distribution

p(θ|y) ∝ θ∑i yi+a−1(1− θ)

∑i(n−yi)+b−1. (26)

Given these data, the estimated mean, θ, obtained from the posterior is now:

θ =

∑i yi + a∑

i n+ a+ b=

∑i yi∑

i n+ a+ b+

a∑i n+ a+ b

(27)

which is, perhaps unsurprisingly, a weighted sum over the prior expectation and the actual data.

Conjugate priors do have obvious mathematical appeal and yield analytically tractable forms for posteriors butthey are more restrictive. Numerical methods to sample posteriors – including Gibbs sampling and related Markovchain Monte Carlo methods [118,119] – continue to be used [116] and developed [120] for biophysical problemsand have somewhat reduced the historical analytical advantage of conjugate priors. However the advantageconferred by the tractability of conjugate priors has turned out to be major advantage for more complex inferenceproblems – such as those involving Dirichlet processes – that we will discuss later.

3 Information Theory as a Data Analysis Tool

3.1 Information theory: Introduction to key quantities

In 1948 Shannon [64] formulated a quantitative measure of uncertainty of a distribution, p(x), later called theShannon entropy

H(x) = −K∑i=1

p(xi) log p(xi) (28)

where xi, the observable, takes on K discrete numerical values {x1, x2, · · · , xK} such as the intensity levelsobserved from a single molecule time trace. Often, −H(x), is called the Shannon information.

The Shannon entropy is an exact, non-perturbative, formula whose mathematical form, Eq. (28), has also beenargued using the large sample limit of the multinomial distribution and Poisson distribution (as we will showlater) [38]. However, Shannon made no such approximations and derived H(x) from a simple set of axioms thata reasonable measure of uncertainty must satisfy [64].

The Shannon entropy behaves as we expect an uncertainty to behave. That is, informally, when all probabilitiesare uniform, p(xi) = 1/K for any xi, then H(x) is at its maximum, H(x) = log K. In fact, as K increases, so doesthe uncertainty, again as we would expect. Conversely, when all probabilities, save one, are zero, then H(x) is atits minimum, H(x) = 0. On a more technical note, Shannon also stipulated that the uncertainty of a probabilitydistribution must satisfy the “composition property" which quantifies how uncertainties should add if outcomes,indexed i, are arbitrarily regrouped [38,64].

Later formalizations due to Shore and Johnson (SJ) [121], have independently arrived at precisely the same formas H(x) (or any function monotonic with H(x)). SJ’s work is closer in spirit to Bayesian methods [110,111,122–125]. We refer the reader to the last section of this review for a simplified version of SJ’s derivation.

While we have so far dealt with distributions depending only on a single variable, the Shannon entropy can alsodeal with joint probability distributions as follows

H(x, y) = −Kx∑i=1

Ky∑j=1

p(xi, yj) log p(xi, yj). (29)

If both observables are statistically independent – that is, if p(xi, yj) = p(xi)p(yj) – then H(x, y) is the sum ofthe Shannon entropy of each observable, i.e. H(x, y) = −∑i,j p(xi)p(yj) log p(xi)p(yj) = −∑i p(xi) log p(xi)−

12

H(x)H(y)

I(x,y)

H(x,y)H(y|x)H(x|y)

Figure 3: Venn diagram depicting different information quantities and their relationship. The value of each entropy isrepresented by the enclosed area of different regions. H(x) and H(y) are both complete circles.∑

j p(yj) log p(yj) = H(x) +H(y). This property is called “additivity".

On the other hand, if the two observables are statistically dependent – that is, if p(xi, yj) 6= p(xi)p(yj) – then wecan decompose the Shannon entropy as follows

H(x, y) = −∑i,j

p(xi, yj) log (p(xi|yj)p(yj)) = H(y) +H(x|y) (30)

where p(xi|yj) = p(xi, yj)/p(yj) is the conditional probability and H(x|y) ≡ −∑i,j p(xi, yj) log p(xi|yj). H(x|y)is called the conditional entropy that measures the uncertainty in knowing the outcome of x if the value of y isknown. In fact, this interpretation follows from Eq. (30): the total uncertainty in predicting the outcomes of both xand y, H(x, y), follows from the uncertainty to predict y, given by H(y), and the uncertainty to predict x after y isknown, H(x|y).

Fig. (3) illustrates the relationship between these entropies and, in particular, provides a conceptual picture forthe relation H(x, y) = H(y) +H(x|y) = H(x) +H(y|x).

The Venn diagram provides us with another important quantity, the mutual information I(x, y), which correspondsto the intersecting area between H(x) and H(y). From Fig. (3), we can read out the following form of I(x, y) andits relation to other entropies

I(x, y) = H(x) +H(y)−H(x, y)

= H(y)−H(y|x) = H(x)−H(x|y)

=∑i,j

p(xi, yj) log

(p(xi, yj)

p(xi)p(yj)

). (31)

The second line of Eq. (31) provides an intuitive meaning for I(x, y) as the amount of uncertainty reduction thatknowledge of either observable provides about the other. In other words, it is interpreted as the informationshared by the two observables x and y. From the last line of Eq. (31), we see that I(x, y) = 0 if and only if x andy are statistically independent, p(xi, yj) = p(xi)p(yj), for all xi and yj .

Now that we have defined the mutual information, we define the Kullback-Leibler (KL) divergence (or relativeentropy) – a generalization of the mutual information – defined as [126,127]

DKL[p(x)‖p(y)] =∑i,j

p(xi) logp(xi)

p(yj). (32)

In Eq. (32), the probability distributions are distributions over single variables for simplicity only. The KL divergencevanishes if and only if p(x) = p(y) but otherwise DKL[p(x)‖p(y)] ≥ 0. We will see this quantity appear in our modelselection section interpreted as a measure of dissimilarity in information content between p(x) and p(y). It isalso interpreted as a pseudo-distance between p(x) and p(y) [128,129] though it is not generally symmetric withrespect to its arguments, DKL[p(x)‖p(y)] 6= DKL[p(y)‖p(x)].

Finally, we note that the Shannon entropy defined as a measure of uncertainty is different from the entropy

13

discussed in thermodynamics and statistical mechanics. Confounding these concepts has lead to importantmisconceptions [130] and we discuss important differences between the Shannon and thermodynamic entropyin the last section of this review.

3.2 Information theory in model inference

Feynman often remarked that Young’s double-slit experiment and its conceptual implications captured the essenceof all of quantum mechanics [131]. The following thought experiment captures key aspects of information the-ory [122,132].

An information theorist once noted that a third of kangaroos have blue eyes (B) and a quarter are left-handed (L).Thus three quarters are right-handed (R) and two-thirds are not blue-eyed (N).

The information theorist was then asked to compute the joint probability that kangaroos simultaneously be blue-eyed and right-handed (pBR). But, this is an underdetermined problem. In fact, we have four unknown probabil-ities (pBR, pBL, pNR, pNL) but only three constraints (i.e. constraints on blue eyes, left-handedness but also aconstraint on the normalization of probabilities). Thus, from these limited constraints and some algebra, we findthat pBR can take on any value from 1/12 to 1/3. Even for this simple example, there are an infinite number ofacceptable models (i.e. probabilities) lying within this range.

To resolve this apparent ambiguity, the information theorist recommended that zero correlations between eyecolor and handedness be assumed a priori since none are otherwise provided by the data. The only model thatnow satisfies this condition, and falls within the previous range, is pBR = pBpR = (1/3)× (3/4) = 1/4.

Interestingly, this solution could equally well have been obtained by minimizing the Shannon information intro-duced in the previous section – or equivalently maximizing the Shannon entropy (H = −∑i pi log pi) – underthe three constraints of the problem imposed using Lagrange multipliers where the index i labels each discreteoutcome. This short illustration highlights many important ideas relevant to biophysical data analysis that we willuse later and touches upon this critical point: information theory provides a principled recipe for incorporatingdata – even absent data – in model inference.

For our kangaroo example, information theory provides a recipe by which all data – both present and absent –contributed to the model-building process. In fact, by saying nothing about the structure of correlations betweeneye-color and handedness – what we are calling “absent data" – we say something: all but one value for pBRwould have imposed correlations between the model variables.

In our kangaroo example, unmeasured correlations were set to zero as a prior assumption to obtain pBR = 1/4.This assumption on absent data is built into the process of model inference by maximizing the Shannon entropy(a process called MaxEnt). The details of this reasoning follow from the SJ axioms discussed in the last sectionof the review. But informally, for now, we say that MaxEnt only inserts correlations that are contained in the data(through the constraints) and assumes no other.

While the kangaroo example is conceptual, here is an example relevant to biophysics: master equations – theevolution equations describing the dynamics of state occupation probabilities – also rigorously follow from max-imizing the Shannon entropy over single molecule trajectory probabilities by assuming structure of absent data.While the mathematics are detailed elsewhere [73] and reviewed in a broader context in Ref. [38], the key idea issimple. When we write a master equation with rates (transition probabilities per unit time) from state to state, wepresuppose that the rates themselves – that is, conditional probabilities of hopping from one state to another –are time-independent. In order for MaxEnt to arrive at time-independent rates, it must therefore have informationon future, as of yet, unobserved transitions which should exhibit no time dependence.

Once these basic constraints on the absent (future) data are incorporated in MaxEnt, then master equationsfollow as a natural consequence [73]. The master equations still only follow if the data provided on transitionprobabilities has no spatial dependence and thus the system is, at least locally, well-stirred. Otherwise, we mayneed a more detailed model with, for example, spatially dependent forces [6].

14

As we will discuss, the structure imposed on absent data [5] is related to the concept of priors in Bayesiananalysis [133] and model complexity penalties [62,63] which we will review in later sections.

3.3 Maximum Entropy and Bayesian inference

Maximum entropy (MaxEnt) is a recipe to infer probability distributions from the data. The probability distributionsinferred coincide with the maximum of an objective function.

Historically, in its simplest realization, Jaynes [38, 100, 101] used Shannon’s entropy to infer the most probabledistribution of equilibrium classical degrees of freedom (positions and momenta). He did so by asking whichdistribution, {pi}, maximizedH given constraints (imposed using Lagrange multipliers) on normalization

∑i pi = 1

and the average energy. Mathematically, Jaynes maximized the following objective function with respect to the{pi} and the Lagrange multipliers

−∑i

pi log pi −∑j

λj

(∑i

aijpi − aj)

(33)

where the λj are the Lagrange multipliers for the jth constraint, and aj are the measured average of the quantityaj which, for outcome i, takes on the value aij . For example, the average dice roll would be constrained as:(∑6

i=1 ipi − 2.7) assuming the average roll happens to be 2.7. Furthermore, if the jth constraint is normalization,then it would be imposed by setting aij = aj = 1 for that constraint.

In fact, going back to our kangaroo example, we saw that acceptable values for pBR that satisfied all 3 constraintswere 1/12 to 1/3. We could have assumed that all values in this range were equally acceptable. However, byenforcing no correlations where none were warranted by the data, we arrived at 1/4 from MaxEnt.

To quantify just how good or bad 1/4 is, we need a posterior distribution over models, {pi}, for the given data. Inother words, we need to reconcile MaxEnt and Bayesian inference by relating the entropy to a Bayesian prior.

To do so, we first note that maximizing the constrained Shannon entropy, Eq. (33), is analogous to maximizinga posterior over the {pi}: H is a logarithm of a prior over the {pi} while the constraints are the logarithm of thelikelihood. The same restrictions that apply to selecting a likelihood in frequentist and Bayesian analysis hold forselecting its logarithm (i.e. the constraints in MaxEnt). Thus, just as Bayesian inference generates distributionsover parameters, {pi}, MaxEnt returns point estimates (the maxima, {p∗i }, of the constrained Shannon entropy).

As we will see, the constraints that we imposed on Eq. (33) are highly unusual and equivalent to delta-functionlikelihoods infinitely sharply peaked at their mean value.

To quantitatively relate MaxEnt to Bayesian inference, we take a frequentist route [111,122] and consider frequen-cies of the outcomes of an experiment by counting the number of events collected in the ith bin, ni, assumingsuch independent events occur with probability µi

P (n|µ) =∏i

µnii e−µi

ni!∼ e

∑i(ni−µi)e−

∑i ni log(ni/µi) (34)

where, in the last step, we have invoked Stirling’s approximation valid when all ni are large. We now define N asthe total number of events,

∑i ni, and define probabilities pi ≡ ni/N and qi ≡ µi/N [38]. Then

P (p|q) ∼ eN∑i(pi−qi)e−N

∑i pi log(pi/qi). (35)

By imposing normalization on both pi and qi, we have

P (p |q) ≡ P(p

∣∣∣∣∣q,∑i

pi =∑i

qi = 1

)=e−N

∑i pi log(pi/qi)δ∑

i pi,1δ∑

i qi,1

Z=eNHδ∑

i pi,1δ∑

i qi,1

Z(36)

15

where Z = Z(q) is a normalization factor. That is, it is an integral of the numerator of Eq. (36) over each pi from0 to 1. In addition, H = −∑i pi log (pi/qi) and δx,y, is the Kronecker delta (i.e. is zero unless x=y in which case itis one).

For our simple kangaroo example, we can now compute the posterior probability over the model, p, given con-straints from the data, D by multiplying the prior, Eq. (36), with hard (Kronecker delta) constraints as our likelihood.This yields

P (p|D,q) =δp1+p2,1/3δp3+p4,1/4 × P (p |q)

Z(37)

and Z is again a normalization and we have conveniently re-indexed the probabilities with numbers rather thanletters. Here P (p|D,q) is a posterior, P(p|q) is a prior (which we have found depends on the Shannon entropy)and q are hyperparameters. Intuitively, we can understand q as being the values to which p defaults when wemaximize the entropy in the absence of constraints. That is, maximizing −∑i pi log(pi/qi) returns pi ∝ qi.In fact, P (p|D,q) describes the probability over all allowed models. Furthermore, given identical constraints fromthe data – i.e. the same likelihood – and given that the normalization, Z, does not depend on p, the ratio ofposterior probabilities then only depends on the Shannon entropy of both models

P (p|D,q)

P (p′|D,q)= eN(−H(p′|q)+H(p|q)). (38)

The factor of N quantifies the strength of our prior assumptions in much the same way that the hyperparametersα and β of Eq. (20) set properties of the prior. In other words, informally N tells us how many “data points" ourprior knowledge is worth.

Now, we can evaluate, for the kangaroo example, a ratio of posteriors for the optimal MaxEnt model (p1 =1/4, p2 = 1/12, p3 = 1/2, p4 = 1/6) and a variant

P (p1 = 1/4, p2 = 1/12, p3 = 1/2, p4 = 1/6|D)

P (p1 = 1/4− ε, p2 = 1/12 + ε, p3 = 1/2 + ε, p4 = 1/6− ε|D)= eN(0.19) (39)

where we have assumed equal (uniform) qi’s and ε = 1/8.

While the maximum of the posterior, Eq. (37), is independent of N for the kangaroo example – because of theartificiality of delta-function constraints – the shape of the posterior and thus the credible interval (the Bayesiananalogue of the frequentist confidence interval) – certainly depend on N [110,124].

The MaxEnt recipe is thus equivalent to maximizing a posterior over a probability distribution. The prior in theMaxEnt prescription is set to the entropy for fundamental reasons described in the last section of the review whichalso details the repercussions of rejecting the principle of MaxEnt.

MaxEnt does not assume a parametric form for the probability distribution. Also, the model parameters – that is,each individual pi – can be very large for probability distributions discretized on a very fine grid.

3.4 Applications of MaxEnt: Deconvolution methods

Often, to determine how many exponential components contribute to a decay process, a decay signal is first fitto a single exponential and the resulting goodness-of-fit is quantified. If the fit is deemed unsatisfactory, then anadditional decay component is introduced and new parameters (two exponential decay constants and the relativeweights for each exponential in this mixture model) are determined. As described, this fitting procedure cannotbe terminated in a principled way. That is, an increasingly large number of exponentials will always improve thefit [125].

MaxEnt deconvolution methods are specifically tailored to tackle this routine problem of data analysis. To give aconcrete example, imagine a decay signal, s(t), which is related to the distribution of decay rates, p(r), through

16

the following relation

s(t) =

∫ ∞0

dre−rtp(r). (40)

MaxEnt deconvolution solves the inverse problem of determining p(r) from s(t). That is, MaxEnt readily infersprobability distributions such as unknown weights, the p(r), which appear in mixture models [110]. These weightscould include, for example, probabilities for exponentials, as in Eq. (40), or probabilities that cytoplasmic proteinssample different diffusion coefficients [5,61,134]. The fitting procedure is ultimately terminated because MaxEntinsists on simple (minimum information or maximum entropy) models consistent with observations.

More generally, for discrete data, Di, we can write the discrete analog of Eq. (40)

Di =∑j

Gijpj + εi (41)

where Gij is the ijth matrix element of a general transformation matrix, G, and pj is the model. Contrary to thenoiseless Eq. (40) here, we have added noise, εi, to Eq. (41).

All experimental details are captured in the matrix G. Here are examples of this matrix:

GFluor · p =

∫ ∞0

dre−rtp(r) (42)

GFRAP · p = −∫ ∞

0dDe−

(x−x0)2

2Dt p(D) (43)

GFCS · p = −∫ ∞

0dτD

1

n

1

(1 + τ/τD)3/2p(τD) (44)

where the first can be used to determine decay rate distributions [113–115]; the second is relevant to fluorescencerecovery after photobleaching (FRAP) with an undetermined distribution over diffusion coefficients, D (assumingisotropic diffusion in one dimension); the third is relevant to fluorescence correlation spectroscopy (FCS) withan undetermined distribution over diffusion times, τD, through a confocal volume [5, 61], assuming a symmetricGaussian confocal volume with n diffusing particles.

If a Gaussian noise model is justified – where εi is sampled from a Gaussian distribution with zero mean and stan-dard deviation σi, i.e. εi ∼ N(0, σi) – then, one could propose to find pj by minimizing the following log-likelihood(equivalent to maximizing the likelihood for a product of Gaussians) under the assumption of independent Gaus-sian observations

χ2 ≡∑i

(Di −

∑j Gijpj

σi

)2

. (45)

This ill-fated optimization of Eq. (45), described in the first paragraph of this section, overfits the model. Addi-tionally, depending on our choice of discretization for the index j of Eq. (41), we may select to have many moreweights, pj , than we have data points, Di, and, in this circumstance, a unique minimum of the χ2 may not evenexist. For this reason, we use the entropy prior and write down our posterior

P

(p

∣∣∣∣∣D,q,∑i

pi =∑i

qi = 1

)∝ P (D|p)× P

(p

∣∣∣∣∣q,∑i

pi =∑i

qi = 1

)∝ e−χ2/2 × e−N

∑i pi log(pi/qi)δ∑

i pi,1δ∑

i qi,1(46)

where the proportionality above indicates that we have not explicitly accounted for the normalization, P (D|q). Ifwe are only interested in a point estimate for our model – i.e. the one that that maximizes the posterior given by

17

Eq. (46) – then the objective function we need to maximize is [38]

−N∑i

pi log (pi/qi)−χ2

2+ λ0

(∑i

pi − 1

)+ λ1

(∑i

qi − 1

)(47)

where we have used Lagrange multipliers (λ0, λ1) to replace the delta-function constraints. The variation of theMaxEnt objective function, Eq. (47), is now understood to be over each pi as well as λ0 and λ1. Furthermore, ifwe are only interested in the maximum of Eq. (47), we are free to multiply Eq. (47) through by constants or addconstants as well. In doing so, we obtain a more familiar MaxEnt form

−∑i

pi log (pi/qi)− φ(χ2 −N

)2

+ λ0

(∑i

pi − 1

)+ λ1

(∑i

qi − 1

)(48)

whereN – the number of independent observations – and φ are constants. N or its inverse, φ, is a hyperparameterthat we must in principle set a priori.

One way to determine φ is to treat φ as a Lagrange multiplier enforcing the constraint that χ2 be equal to itsfrequentist expectation

χ2 ∼ N. (49)

That is, summed over a large and typical number of data points – where, typically,(Di −

∑j Gijpj

)2∼ σ2

i – we

have χ2 ∼ N .

Skilling and Gull [111] have argued that this frequentist line of reasoning to determine φ, and thus the posterior,undermines the meticulous effort that has been put into deriving Shannon’s entropy from SJ’s self-consistentreasoning arguments (that we discuss in the last section). Instead, they proposed [111] a method based onempirical Bayes [94] motivating the choice of hyperparameter from the data.

If we take Eq. (49) for now, we now find a recipe for arriving at the optimal model, p∗,

p∗ = maxp,φ,λ0,λ1

(−∑i

pi log (pi/qi)− φ(χ2 −N

)2

+ λ0

(∑i

pi − 1

)+ λ1

(∑i

qi − 1

)). (50)

Often, we select a uniform distribution (flat q) though different choices are discussed in the literature [125].

Finally, for FCS, Eq. (44) is just a starting point that, for simplicity, ignores confocal volume asymmetry and tripletcorrections. More sophisticated FCS deconvolution methods can account for these [5] and also explicitly accountfor correlated noise and ballistic motion of actively transported particles [60]. And – in part because FCS is soversatile [135,136] and can even be used in vivo in a minimally invasive manner [19,61,137–146] – FCS data hasbeen analyzed using multiple deconvolution methods that have provided models for the dynamics of the humanislet amyloid polypeptide (hIAPP) on the plasma membrane [147] as well as the dynamics of signaling proteins inzebrafish embryos [134].

3.4.1 MaxEnt deconvolution: An application to FCS

Here we briefly present an application where MaxEnt was used to infer the behavior of transcription factors invivo [5] and used to learn about crowding, binding effects and photophysical artifacts contributing to FCS.

Briefly in FCS, labeled proteins are monitored as they traverse an illuminated confocal volume [148]. The diffusiontime, τD, across this volume of width w is obtained from the fluorescence time intensity correlation function, G(τ),according to [146,148]

G(τ) =1

n

(1 +

τ

τD

)−1(1 +

1

Q2

τ

τD

)−1/2

(51)

18

Figure 4: FCS may be used to model the dynamics of labeled particles at many cellular locations (regions of interest(ROIs)), both in the cytosol and in the nucleus. a) Merged image of a cerulean-CTA fluorescent protein (FP) used toimage the cytosol and mCherry red FP used to tag BZip protein domains. In Ref. [5], we analyzed FCS data on taggedBZips diffusing in the nucleus and the cytosol. We analyzed diffusion in ROIs far from heterochromatin by avoiding redFP congregation areas (bright red spots). MaxEnt analysis revealed details of the fluorophore photophysics, crowding andbinding effects that could otherwise be fit using anomalous models. b) A cartoon of the cell nucleus illustrating variousmicroenvironments in which BZip (red dots) diffuses (A: free region; B: crowded region; C: non-specific DNA binding region;D: high affinity binding region).

where n is the average number of particles in the confocal volume and Q characterizes the confocal volume’sasymmetry. The diffusion constant, D, is related to τD by τD = w2/4D [for simplicity, Eq. (51) ignores tripletcorrections [149]] where w designates the width of the confocal volume.

In complex environments, G(τ) often cannot be fit with a single diffusion component (Eq. (51)). That is, τ isno longer simply proportional to a mean square displacement in the confocal volume, 〈δr2〉. Instead, G(τ)’sare constructed for anomalous diffusion models – where 〈δr2〉 ∝ τα and α is different from one – as follows[19,140–143]

G(τ) =1

n

(1 +

(τ

τD

)α)−1(1 +

1

Q2

(τ

τD

)α)−1/2

(52)

where we have introduced an effective diffusion time, τD, and asymmetry parameter, Q.

Circumstances under which anomalous diffusion models – where 〈δr2〉 ∝ τα strictly holds – over many decadesin time are exceptional, not generic. For example, fractional Brownian motion (FBM) – that can give rise toanomalous diffusion [150] – may arise when proteins diffuse through closely packed fractal-like heterochromatinstructures [151] though it is unclear to what degree the structure of heterochromatin actually is fractal. As anotherexample, continuous time random walks (CTRW), in turn, yield anomalous diffusion by imposing power lawparticle waiting time or jump size distributions to describe a single particle’s trajectory [152–155] though thesepower laws have only rarely been observed experimentally [152, 153] and, what is more, FCS does not collectdata on single particle trajectories.

A method of analysis should deal with the data as it is provided. That is, for FCS, a model should preferentiallybe inferred directly from G(τ) rather than conjecturing behaviors for single particle trajectories – that are notobserved – that may give rise to a G(τ).

MaxEnt starts with the data at hand and provides an alternative solution to fitting data using anomalous diffusionmodels [5]. Rather than imposing a parametric form (Eq. (52)) on the data, MaxEnt has been used to harnessthe entire G(τ) to extract information on crowding effects, photophysical label artifacts, cluster formation as wellas affinity site binding in vivo, a topic that has been of recent interest [156–164]; see Fig. (4).

To extract information on basic processes that could be contributing to the G(τ), we start with the observation thatthe in vivo confocal volume is composed of many “microenvironments"; see Fig. (4b). In each microenvironment,the diffusion coefficient differs and diffusion may thus be described using a multi-component (mixture) normal

19

a

b

c

d

Figure 5: Protein binding sites of different affinities yield a G(τ) that is well fit by an anomalous diffusion model. Atheoretical G(τ) (containing 150 points) was created from an anomalous diffusion model, Eq. (52) with α = 0.9, to whichwe added 5% white noise (a, blue dots, logarithmic in time). Using MaxEnt, we infer a p(τD) from this G(τ) (b) and, as asanity check, use it to reconstruct a G(τ) (a, solid curve). In the main body, we discuss how protein binding sites of differentaffinities could give rise to such a p(τD). Part of p(τD) is then excised, yielding a new p(τD) (d, pink curve). Conceptually, thisis equivalent to mutating a binding site which eliminates some τD ’s. We created a G(τ) from this theoretical distribution with8% white noise (c, blue dots, logarithmic in time). We then extracted a p(τD) from this (d, blue curve) and we reconstructeda G(t) from this p(τD) distribution as a check (c, solid curve). Time is in arbitrary units. See text and Ref. [5] for more details.

diffusion model [61,137–140]

G(τ) =1

n

∑τD

p(τD)

(1 +

τ

τD

)−1(1 +

1

Q2

τ

τD

)−1/2

. (53)

MaxEnt is then used to infer the full diffusion coefficient distribution, p(D), or, equivalently p(τD), directly fromthe data. In particular, Fig. (5a)-(5b) illustrates how mixture models and anomalous diffusion models may both fitthe data equally well. As a benchmark, a synthetic G(τ) (blue dots, Fig. (5a)) is generated with an anomalousdiffusion model [Eq. (52) with α = 0.9 with added 5% white noise]. Fig. (5b) shows the resulting p(τD) extractedfrom the noisy synthetic data. From this p(τD), we re-create a G(τ) (solid line, Fig. (5a)) and verified that it closelymatches the original G(τ) (blue dots, Fig. (5a)).

An important advantage with the mixture model, is that it provides a p(τD) that may be microscopically inter-pretable. For instance, suppose a binding site is removed either by mutating/removing a particular DNA bindingsite or cooperative binding partner. Based on the model [discussed in detail in Ref. [5]], we expect the resultingp(τD) – or, equivalently, p(D) – to show a gap at some τD. The hypothetical p(τD), expected after removal of abinding site, is shown with an exaggerated excision (pink curve, Fig. (5d)). A corresponding noisy G(τ) is gener-ated from this p(τD) (blue dots, Fig. (5c)). Now we ask: had we been presented with such a G(τ), would we havebeen able to tell that a site had been mutated? The inferred p(τD) (blue curve, Fig. (5d)) shows a clear excisiondirectly indicating that a mutated site would have been detectable.

To illustrate here that MaxEnt also works on real data, we re-analyzed in vitro data on the diffusion of the smalldye Alexa568 (Fig. (6a)) as well as previously published FCS data [19] on the diffusion of the BZip domain ofa transcription factor (TF) [CCAAT/enhancer-binding C/EBPα] tagged with red fluorescent proteins (FPs) [eithermCherry or mRuby2]. We analyzed data on BZip’s diffusion both in solution and a living mouse cells’ cytoplasmand nucleus (away from heterochromatin) that appeared to show anomalous diffusion. The results are summa-

20

p(D)

Log10(D) μm2/s-1 0 1 2 3

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07 a

-1 0 1 2 30.00

0.02

0.04

0.06

0.08

0.10

Log10(D) μm2/s

b

100 10001010.1 100 10001010.10.000.00

D D

p(D)

Log10(D) μm2/s-1 0 1 2 3

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07 a

-1 0 1 2 30.00

0.02

0.04

0.06

0.08

0.10

Log10(D) μm2/s

b

100 10001010.1 100 10001010.10.000.00

D D

Figure 5: FP Flickering gives rise to the plateaus we see for larger D values. (a) The p(D) forfreely diffusing Alexa568 which shows not flickering (as expected) and thus no superdiffusive plateau.Rather it shows a peak centered at 360µm2/s (the experimentally determined diffusion coefficient is363µm2/s). The smaller peak is due to fluorophore aggregation [15]. (b) The p(D) for mRuby2 whichshows a superdiffusive plateau. See text for details.

as lower-affinity non-specific DNA binding [118]; 2) interactions with other chromatin bindingproteins [103]; 3) association with proteins forming subnuclear domains such as nucleoli [119,120]. For instance, the BZip domain interacts with HP1↵ which in turn binds directly to histones(interaction 3) [103, 120] and is expected to slow BZip’s diffusion. The results shown in Fig.6(e,f) show the expected crowding peak and photobleaching plateau, though less prominent,in p(D) as well as features arising from interactions. For instance, mCherry-BZip shows slowdiffusion coefficients with peaks centered at about 0.2 µm2/s, 0.8 µm2/s and 5 µm2/s. Theselikely identify different types of nuclear interactions. The blue curve in Fig. 6(f) is for thebest data set for mRuby2-BZip nuclear diffusion [i.e. the most monotonic G(t) (which is whatG(t) ought to be in the absence of noise) as measured by the Spearman rank coefficient thatalso yielded the smallest �2 of all experimental data sets examined]. This p(D) shows threeclear peaks corresponding to diffusion coefficients of ⇠ 1000 (flickering), ⇠ 80 (crowding) and⇠ 2µm2/s (binding interaction with K ⇡ 500nM�1 assuming [S] = 0.1mM ).

Here are the data sets we will investigate:Data Set I: BZip mutants – BZip proteins are evolutionarily conserved from plants to hu-mans [121] and in humans, BZips play critical roles across a wide range of biological processesmaking them attractive targets for selective inhibition [121]. For example, specific dominant-negative (DN) heterodimer partners for the BZip proteins, called AZip, were designed to specif-ically inhibit BZip DNA binding in equimolar competition [122–124]. FCS data from cells treatedwith Azip as well as small molecules known to inhibit BZip DNA binding [125] will provide animportant benchmark for our method. It will reveal if and how different peaks (and areas un-derneath peaks corresponding to bound fractions) in p(K) respond in treated cells. In addition,we will also investigate how BZip’s diffusion is affected by the following mutations of its bindingpartner HP1↵: HP1↵ WT; HP1↵ W41A [disrupts binding to Histone H3 meK9]; HP1↵ I165E[dirupts dimerization interface]; HP1↵ W171A [disrupts PxVxL binding pocket].

Data Set II: Integrin crosstalk – Integrins are a family of glycosylated, heterodimeric trans-membrane adherens receptors consisting of non-covalently bound ↵ and � subunits [126,127]and play an essential role in the process of cell adhesion and their engagement regulatesgene expression, cell growth, differentiation and survival. The related biochemical signals areusually induced by ligand-occupied and clustered integrins [126, 127]. As an example, the in-tegrin ↵5�1 and ↵v�3 is believed to be significant for its role in cell migration, metastasis andangiogenesis [128]. In FCS data sets provided, we will investigate the role of the lipid envi-ronment and ligand binding on possible interaction between integrin ↵5�1 and ↵v�3 withoutsignalling cascades. We will determine the influence of the lipid environment on the diffusionand, ultimately, the stoichiometry and crosstalk of the integrin clusters.

Data Set III: Dynamic linker assembly and Cadherin-mediated adhesions – Cell-cell ad-

13

-1 0 1 2 3

0.05

0.10

0.15

0.20-1 0 1 2 3

0.05

0.10

0.15

0.20-1 0 1 2 3

0.05

0.10

0.15

0.20

0.25

0.30

-1 0 1 2 3

0.05

0.10

0.15

0.20-1 0 1 2 3

0.05

0.10

0.15

0.20

0.25

0.30

mCherry mRuby2

IN CYTOSOL

IN NUCLEUS

p(D)

p(D)

p(D)

Log10(D) μm2/s

a

f

d

e

c

b

DLog10(D) μm2/sD100 10001010.1

100 10001010.1 100 10001010.1

100 10001010.1100 10001010.1

IN SOLUTION

-1 0 1 2 3

0.05

0.10

0.15

0.20

100 10001010.1

b c

d e

f g

Figure 6: Probability distributions of diffusion coefficients can be inferred from FCS curves. a) p(D) for freely diffusingAlexa568 shows no “superdiffusive plateau" (defined in the text) that arises from dye flickering. Rather, it shows its mainpeak at 360µm2/s very near the reported value of 363 µm2/s [165]. We attributed the smaller peak centered at ∼ 5 µm2/sto dye aggregation [5]. b) + c) We analyzed p(D)’s obtained from FCS data acquired on mCherry and mRuby2 diffusingfreely in solution, and d) + e) mCherry or mRuby2 tagged BZip protein domains in the cytosol and f) + g) the nucleus farfrom heterochromatin [165]. Black curves are averages of the red curves [total number of data sets: b:3,c:9,d:5,e:16,f:7,and g:21]. The additional blue curve in (g) shows the analysis of the best data set (i.e. the most monotonic G(τ)). See textand Ref. [5] for more details.

21

rized in Fig. (6b)-(6g).

Briefly, in solution (Fig. (6b)-(6c)) we identify the effects of protein flickering on the p(D). Flickering is a fast,reversible photoswitching arising from FP chromophore core instabilities [166]. Fast flickering [faster than thetagged protein’s τD] registers as fast-moving (high diffusion) components. Since many particles flicker, p(D)shows substantial density at high D values. As expected mRuby2 and mCherry flickering appears in p(D) as a“superdiffusive plateau" at the highest values of D in Fig. (6b)-(6c). The plateau’s lower bound coincides with thediffusion coefficient expected in the absence of flickering [5]. As a control, Alexa568 – which is well-behaved inFCS studies [165,167] – shows no plateau; see Fig. (6a).

In the cytosol, we found label-dependent molecular crowding effects on protein diffusion. Beyond the superdif-fusive plateau, the cytosolic p(D) shows peaks for mCherry-BZip and mRuby2-BZip at ∼ 20 − 40 µm2/s; seeFig. (6d)-(6e). This peak’s location is consistent with results from FCS and FRAP experiments [165,168,169] andis attributed to crowding since our labeled proteins are thought to have few cytosolic interactions [165, 170]; seeRef. [5] for details.

Finally, in the nucleus, while our data sets are well fit by anomalous diffusion models [19], our method insteadfinds evidence of BZip’s DNA site-binding. For BZip, we expect: 1) high affinity binding to specific DNA elementsas well as lower-affinity non-specific DNA binding [5,19,171]; 2) interactions with other chromatin binding proteins[19]; and 3) association with proteins [172, 173] such as BZip’s interaction with HP1α (which binds to histones)[19, 173]. The p(D) of Fig. (6f)-(6g) shows the less prominent but expected crowding peak (∼ 10 µm2/s) andphotobleaching plateau as well as features arising from interactions. For instance, for mCherry-BZip we find slowdiffusion coefficients with peaks centered at about 0.2 µm2/s, 0.8 µm2/s and 5 µm2/s identifying possible nuclearinteractions. The blue curve in Fig. (6g) displays “the best data set" for mRuby2-BZip nuclear diffusion [i.e. themost monotonic G(τ) (which is what G(τ) ought to be in the absence of noise) as measured by the Spearmanrank coefficient]. This p(D) shows three clear peaks corresponding to diffusion coefficients of ∼ 1000 (flickering),∼ 80 (crowding) and ∼ 2 µm2/s (binding interaction with K ≈ 500nM−1 assuming [S] = 0.1mM ).

In summary, this section highlights the important mechanistic details that can be drawn from MaxEnt deconvolu-tion techniques.

While MaxEnt is focused on inferring probability distributions of an a priori unspecified form, often we do havespecific parametric forms for models in mind when we analyze data. Selecting between different “nested model"– models obtained as a special case of a more complex model by either eliminating or setting conditions on thecomplex model’s parameters – is the focus of the next section.

4 Model Selection

4.1 Brief overview of model selection

A handful of highly complex models may fit any given data set very well. By contrast, a combinatorially largernumber of simpler models — with fewer and more flexible parameters – provide a looser fit to the data. Whilehighly complex models may provide excellent fits to a single data set, they are, correspondingly, over-committedto that particular data set.

The goal of successful model selection criteria is to pick models: 1) whose complexity is penalized, in a principledfashion, to avoid overfitting; and 2) that convincingly fit the data provided (the training set).

Model selection criteria are widely used in biophysical data analysis from image deconvolution [132,174–177] tosingle molecule step detection [37,59,178,179] and continue to be developed by statisticians [180].

Here we summarize both Information theoretic [181–184] as well as Bayesian [35, 36, 147, 185–188] model se-lection criteria.

22

4.1.1 Information theoretic and Bayesian model selection

In information theory, h(x|θ) = − log p(x|θ) is interpreted as the information contained in the likelihood for datapoints x given parameters θ [189]. Minimizing this information over θ is equivalent to maximizing the likelihoodfor parametric models. For problems where the number of parameters (K) is unknown, preference is alwaysgiven to more complex models. To avoid this problem, a cost function, L, associated to each additional variableis introduced [190], − log p(x|θ) + L(θ). Put differently, in the language of Shannon’s coding theory that we willdiscuss later, if − log p(x|θ) measures the message length, then the goal of model selection is to find a model ofminimal description length (MDL) [191–193] or, informally, of maximum compression [64].

Information theoretic model selection criteria – such as the Akaike Information Criterion (AIC) [194–197] – startwith the assumption that the data may be very complex but that an approximate, candidate, model may minimizethe difference in information content between the true (hypothetical) model and the candidate model. As we willsee in detail later, models that overfit the data are avoided by parametrizing the candidate model on a training dataset and comparing the information between an estimate of the true model and candidate models on a different(validation) data test set. In this way, the AIC is about prediction of a model for additional data points providedbeyond those data points used in the training step.

Since the data may be very complex, as the number of data points provided grows, the complexity (numberof parameters) of the model selected by the AIC grows concomitantly. Complex models are not always a dis-advantage. For instance, they may be essential if we try to approximate an arbitrary non-linear function using ahigh-order polynomial candidate model or for models altogether too complex to represent using simple parametricforms [196].

Bayesian model selection criteria – such as the Bayesian (or Schwartz) information criterion (BIC) [62] – insteadselect the model that maximizes a marginal posterior [55,198,199]. In the marginalization step, we have integratedover all irrelevant or unknown model parameters. This marginalization step is, as we will see, critical in avoidingoverfitting. This is because, by marginalizing over variables, our final marginal posterior is a sum over modelsincluding models that fit the data poorly.

Unlike the AIC, the BIC assumes that there exists a true model and it searches for this model [197, 200]. Sincethis model’s complexity is fixed – does not depend on the number of data points N – the BIC avoids growing thedimensionality of the model with N by penalizing the number of parameters of the model according to a functionof N , logN . This penalizing function is derived, it is not imposed by hand. By contrast to the AIC, the BIC“postdicts" the model since, in using the BIC, we assume that we already have access to all observed data [201].

As we will see later, for slightly non-linear models, the BIC may outperform the AIC which overfits small featureswhile for highly non-linear models – where small features are important – the AIC may outperform the BIC [196].The performance of the AIC and BIC are illustrated for a simple example in Fig. (7).

4.2 Information theoretic model selection: The AIC

In this section, we sketch a derivation of the AIC [181, 189, 202, 203]. While this section is theoretical and canbe skipped upon a first reading, it does highlight: i) the method’s limitations and applicability [204]; ii) that thepenalty term follows from a principled derivation (and is not arbitrarily tunable); iii) how it conceptually differs fromthe BIC.

Briefly, finding the real or true model that generated the data is not achievable, and our goal is to seek a goodcandidate model. The AIC [63, 205], as one of the important selection criteria, is based on estimation of the KLdivergence [127] between the (unknown and unknowable) true distribution that generated the data, f(x), and acandidate distribution p(x|θ0) parametrized by θ0

DKL[f‖p] =

∫dxf(x) log

(f(x)

p(x|θ0)

)=

∫dxf(x) log f(x)−

∫dxf(x) log p(x|θ0) (54)

23

a b

0

100

200

300

400

1 10 100

k

BIC

( ),

AIC

( )

1 10 100

400

300

200

100

0

Time

Inte

nsity

18

20

22

24

26

28

30

32

1 101 201 301 401

0

100

200

300

400

1 10 100

k

BIC

( ),

AIC

( )

1 10 100

400

300

200

100

0

Time

Inte

nsity

18

20

22

24

26

28

30

32

1 101 201 301 401

0

100

200

300

400

1 10 100

k

BIC

( ),

AIC

( )

1 10 100

400

300

200

100

0

Time

Inte

nsity

18

20

22

24

26

28

30

32

1 101 201 301 401

Figure 7: The AIC and BIC are often both applied to step-finding. a) We generated 1000 data points with a backgroundnoise level, σb = 20. On top of the background, we added 6 dwells (5 change points) with noise around the signal havinga standard deviation of σs = 5 (see inset). At this high noise level, and for this particular application, the BIC outperformsthe AIC and the minimum of the BIC is at the theoretical value of 5 (dotted line). All noise is Gaussian and de-correlated.b) For our choice of parameters, the AIC (green) finds a model that overfits the true model (black) while the BIC (red) doesnot. However, as we increase the number of steps (while keeping the total number of data points fixed), the AIC doeseventually outperform the BIC. This is to be expected. The AIC assumes the model could be unbounded in complexity andtherefore does not penalize additional steps as much. The BIC, by contrast, assumes that there exists a true model of finitecomplexity. We acknowledge K. Tsekouras for generating this figure.

where θ0 is the best possible estimate for θ (obtainable in the limit of infinite data). The KL divergence is alwayspositive or – in the event only realizable with synthetic data that both true model and candidate models coincide– zero. The proof of this is well known and follows from Jensen’s inequality [206].

To achieve our goal – and select a model, p, that minimizes DKL[f‖p] – we must first make some approximationsas both f and θ0 are unknown.

Given a training data set, x, we may replace the hypothetical θ0 with its estimate θ(x). However, using the samedata set to evaluate both θ(x) and p(x|θ(x)) biases our KL toward more complex models. To see this, we note thatthe numerical values for the likelihood satisfy p(x|θ(x)) > p(x|θ(y)) since the numerical value for the likelihoodis clearly worse (smaller) for a data set (y) different from the one that was used to parametrize the model. Thus,the KL is always smaller for p(x|θ(x)) than for p(x|θ(y)) and becomes increasingly smaller as we add more andmore parameters. That is, as we grow the dimensionality of θ(x).

To avoid this bias, we (conceptually) estimate θ0 instead on a training set, y, different from the validation set, x,and estimate the KL, DKL[f‖p], as follows [180]

DKL[f‖p] = const− TT = EyEx[log p(x|θ(y))] (55)

where the constant, const, only depends on the hypothetical true model, see Eq. (54), and is, therefore, inde-pendent of our choice of candidate model. Furthermore, the expectation with respect to a distribution over somevariable z is understood as

Ez[g(z)] =1

N

N∑i=1

g(zi) (56)

where the samples, zi, are drawn from that distribution. Furthermore, for clarity, if f were known then

T →∫dyf(y)

∫dxf(x) log p(x|θ(y)). (57)

Our goal is now to estimate and, subsequently, maximize T , with respect to candidate models, in order to minimizeDKL.

24

To evaluate T , we must compute a double expectation value. In the large data set limit, where θ(y) is presumablynear θ0, we expand θ(y) around θ0

T ∼ EyEx

[log p(x|θ0) +

1

2(θ(y)− θ0) ·

(∇θ∇θ log p(x|θ(y))

)θ(y)=θ0

· (θ(y)− θ0)

]= EyEx[log p(x|θ0)]− 1

2Ey[(θ(y)− θ0) · I(θ0) · (θ(y)− θ0)]

= Ex[log p(x|θ0)]− 1

2Ey[(θ(y)− θ0) · I(θ0) · (θ(y)− θ0)] (58)

where the expectation (not the value itself) of the first order term (which we have not written) vanishes and I(θ0)is the Fisher information matrix. We will eventually want to evaluate the expectation of the second order term byfurther simplification (keeping in mind that any errors incurred will, by construction, be ever higher order).

However, for now, our focus is on the leading order term of Eq. (58), Ex[log p(x|θ0)], which we expand aroundθ(x). The resulting T of Eq. (58) becomes

T ∼ Ex[log p(x|θ(x)]− 1

2Ex[(θ0 − θ(x)) · I(θ(x)) · (θ0 − θ(x))]− 1

2Ey[(θ(y)− θ0) · I(θ0) · (θ(y)− θ0)]. (59)

By construction, the first order term of Eq. (59) vanished. Furthermore, to leading order, both quadratic terms areidentical such that

T ∼ Ex[log p(x|θ(x)]− Ey[(θ(y)− θ0) · I(θ0) · (θ(y)− θ0)]. (60)

Evaluated near its maximum, the expectation of (θ(y) − θ0)(θ(y) − θ0) is, again to leading order, the inverse ofthe Fisher information matrix [180]. For a K ×K information matrix, we therefore have

T ∼ Ex[log p(x|θ(x))]−K = Ex[log p(x|θ(x))−K]. (61)

The model that maximizes T is then equivalent to the model minimizing the AIC [180]

AIC ≡ −2 log p(x|θ(x)) + 2K. (62)

In other words, as we increase the number of parameters in our model and the numerical value for the likelihoodbecomes larger (and the AIC decreases), our complexity cost (2 for each parameter for a total of 2K) also rises(and the AIC increases). Thus, our goal is to find a model with a K that minimizes the AIC where, ideally, K isdifferent from 0 or∞.

The “penalty term" of Eq. (62), +2K, does not depend on N . This is very different from the BIC, as we will nowsee, that selects an absolute model without comparison to any reference true model and whose penalty explicitlydepends on N .

4.3 Bayesian model selection

4.3.1 Parameter marginalization: Illustration on outliers

Parameter marginalization is essential to understanding the BIC. We therefore take a brief detour to discuss thistopic in the context of outliers [132].

Suppose, for simplicity, that we are provided a signal with a fixed standard deviation as shown in the inset ofFig. (7a). The likelihood of observing a sequence of N independent Gaussian data points, D = x, drawn from a

25

distribution with unknown mean, µ, and variance, σ2, is

p(x|µ, σ) =

N∏i=1

1√2πσ

e−(xi−µ)

2

2σ2 . (63)

The posterior is then

p(µ, σ|x) =p(x|µ, σ)p(µ, σ)

p(x)(64)

where p(µ, σ) is the prior distribution over µ and σ and the normalization p(x) =∫dµdσp(x|µ, σ)p(µ, σ).

If we know – or can reliably estimate – the standard deviation, then we may fix σ to that value, say σ0, andsubsequently maximize the posterior to obtain an optimal µ. The form of this posterior, p(µ|x, σ0) ≡ p(µ|x), isquite sensitive to outliers because each data point is treated with the same known uncertainty [132]. For example,the distribution over µ – blue curve in Fig. (8) – is heavily influenced by the two apparent outliers near 5-6.

If we have outliers, it may be more reasonable to assume that we are merely cognizant of a lower bound onσ [132]. In this case, a marginal posterior over µ is obtained by integrating σ starting from the lower bound σ0

p(µ|x) =1

p(x)·∫ ∞σ0

dσ p(x|µ, σ)p(µ, σ). (65)

As N grows, the likelihood eventually dominates over the prior and determines the shape of the posterior. Thisposterior over µ – the orange curve of Fig. (8) – is obtained using our previous likelihood (Eq. (63)) with p(µ, σ) =p(µ)p(σ) with a flat prior on µ and, as a matter of later convenience, p(σ) = σ0/σ

2. As we no longer commit to afixed σ, the orange curve is much broader than the blue curve. However it is still susceptible to the outliers near5-6 since all points are treated as having the same uncertainty though only a range for that uncertainty is nowspecified.

Relaxing the constraint that all points must have the same uncertainty, the marginal posterior distribution over µbecomes

p(µ|x) =1

p(x)·∫ ∞σ0

∏i

dσi p(xi|µ, σi)p(µ, σi). (66)

This marginal posterior – shown by the green line of Fig. (8) – is far less committal than were the previousposteriors we considered. That is, this posterior can assume a multimodal form with the highest maximumcentered in the region with the largest number of data points. Unlike our two previous marginal posteriors, thelocation of this posterior’s highest maximum is not as deeply influenced by the outliers. While the highest posteriormaximum provides an estimate for µ largely insensitive to outliers, the additional maxima may help identify apossible second candidate µ. This might be helpful to single molecule force spectroscopy say – where the noiseproperties depend on the state of the system – and apparent occasional outliers may suggest the presence of anadditional force state.

This integration over parameters whose value we do not know – σ in the example above – is key. Throughintegration, we allow (sum over) a broad range of fits to the data. This naturally reduces the complexity of ourmodel because it contributes parameter values to our marginal posterior that yield both good and bad fits to thedata.

4.3.2 The BIC obtained as a sum over models

The BIC seeks a model that maximizes the posterior marginalized over irrelevant or unknown model parametersθ. To compute this posterior, we define a likelihood, p(D = x|θ), describing N independent – or, at worst,weakly correlated – identical observations where p(D|θ) ≡ eN log f(D|θ). To be clear, if the data are completely

26

0 1 2 3 4 5 6 7 µ

P(µ

|D)

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

01

23

45

67

Gaussian, single fixed stdev Gaussian, single uncertain stdev Gaussian, multiple uncertain stdev

Figure 8: Noise models can be adapted to treat outliers. We are given a sequence of data points, D ={1, 1.8, 2.4, 5.5, 5.8} ± 0.25. We want to find the posterior over µ. Blue: We assume the standard deviation is fixed at0.25 and use a Gaussian likelihood with a single variance for all points. Orange: We assume that the standard deviation’slower bound is 0.25, see Eq. (65), but that we still have a single variance for all points. Green: We still assume the standarddeviation’s lower bound is 0.25 but that all points are assumed to have independent standard deviations, see Eq. (66).

independent, then f is understood as the likelihood per observation.

We write down a marginal posterior

p(K|D) ∝∫dKθ eN log f(D|θ) (67)

where K is the total number of parameters.

Since we are interested in a general model selection criterion that does not care about the particularities of theapplication, we consider the large N limit and, thus, ignore the prior altogether as well.

To approximate the integral in Eq. (67), we invoke Laplace’s method as before and expand log f(D|θ) around itsmaximum, θ∗, to second order. In other words, we write

p(K|D) ∼ eN log f(D|θ∗)∫dKθ e−

N2

(θ−θ∗)·∇θ∇θ log f(D|θ∗)·(θ−θ∗)

= eN log f(D|θ∗) (2π/N)K/2√det ∇θ∇θ log f(D|θ∗)

. (68)

The BIC then follows directly from Eq. (68)

BIC ≡ −2 log p(K|D) = −2 log p(D|θ∗) +K logN + O(N0). (69)

By contrast to the AIC given by Eq. (62), the BIC has a penalty that scales as logN . In a later section, we willrelate this penalty to the predictive information provided by the model (Eq. (111)). Finally, we add that whilethe prior over θ is not treated explicitly, this treatment is still distinctly Bayesian. This is because the modelparameters, over which we marginalize, are treated as continuous variables rather than fixed numbers.

We now turn to an illustration of model selection drawn from change-point analysis.

27

0

100

200

300

400

1 10 100

k

BIC

( ),

AIC

( )

1 10 100

400

300

200

100

0

Time

Inte

nsity

18

20

22

24

26

28

30

32

1 101 201 301 4010

100

200

300

400

1 10 100

k

BIC

( ),

AIC

( )

1 10 100

400

300

200

100

0

Time

Inte

nsity

18

20

22

24

26

28

30

32

1 101 201 301 401

18

23

28

33

38

43

1 101 201 301 401

16

20

24

28

32

1 101 20117

21

25

29

33

1 101 201 301 401

a

b

18

23

28

33

38

43

1 101 201 301 401

16

20

24

28

32

1 101 201 301 40117

21

25

29

33

1 101 201 301 401

a

b c

0

100

200

300

400

1 10 100

k

BIC

( ),

AIC

( )

1 10 100

400

300

200

100

0

Time

Inte

nsity

18

20

22

24

26

28

30

32

1 101 201 301 4010

100

200

300

400

1 10 100

k

BIC

( ),

AIC

( )

1 10 100

400

300

200

100

0

Time

Inte

nsity

18

20

22

24

26

28

30

32

1 101 201 301 401

18

23

28

33

38

43

1 101 201 301 401

16

20

24

28

32

1 101 201 301 40117

21

25

29

33

1 101 201 301 401

a

b c

18

23

28

33

38

43

1 101 201 301 401

16

20

24

28

32

1 101 201 301 40117

21

25

29

33

1 101 201 301 401

b c

0

100

200

300

400

1 10 100

k

BIC

( ),

AIC

( )

1 10 100

400

300

200

100

0

Time

Inte

nsity

18

20

22

24

26

28

30

32

1 101 201 301 4010

100

200

300

400

1 10 100

k

BIC

( ),

AIC

( )

1 10 100

400

300

200

100

0

Time

Inte

nsity

18

20

22

24

26

28

30

32

1 101 201 301 401

18

23

28

33

38

43

1 101 201 301 401

16

20

24

28

32

1 101 201 301 40117

21

25

29

33

1 101 201 301

a

c

18

23

28

33

38

43

1 101 201 301 401

16

20

24

28

32

1 101 201 301 40117

21

25

29

33

1 101 201 301 401

a

b c

0

100

200

300

400

1 10 100

k

BIC

( ),

AIC

( )

1 10 100

400

300

200

100

0

Time

Inte

nsity

18

20

22

24

26

28

30

32

1 101 201 301 4010

100

200

300

400

1 10 100

k

BIC

( ),

AIC

( )

1 10 100

400

300

200

100

0

Time

Inte

nsity

18

20

22

24

26

28

30

32

1 101 201 301 401

18

23

28

33

38

43

1 101 201 301 401

16

20

24

28

32

1 101 201 301 40117

21

25

29

33

1 101 201 301 401

a

b c

18

23

28

33

38

43

1 101 201 301 401

16

20

24

28

32

1 101 201 301 40117

21

25

29

33

1 101 201 301 401

b c

0

100

200

300

400

1 10 100

k

BIC

( ),

AIC

( )

1 10 100

400

300

200

100

0

Time

Inte

nsity

18

20

22

24

26

28

30

32

1 101 201 301 4010

100

200

300

400

1 10 100

k

BIC

( ),

AIC

( )

1 10 100

400

300

200

100

0

Time

Inte

nsity

18

20

22

24

26

28

30

32

1 101 201 301 401

18

23

28

33

38

43

1 101 201 301 401

16

20

24

28

32

1 101 201 301 40117

21

25

29

33

1 101 201 301 401

a

c

18

23

28

33

38

43

1 101 201 301 401

16

20

24

28

32

1 101 201 301 40117

21

25

29

33

1 101 201 301 401

a

b c

0

100

200

300

400

1 10 100

k

BIC

( ),

AIC

( )

1 10 100

400

300

200

100

0

TimeIn

tens

ity

18

20

22

24

26

28

30

32

1 101 201 301 4010

100

200

300

400

1 10 100

k

BIC

( ),

AIC

( )

1 10 100

400

300

200

100

0

Time

Inte

nsity

18

20

22

24

26

28

30

32

1 101 201 301 401

18

23

28

33

38

43

1 101 201 301 401

16

20

24

28

32

1 101 201 301 40117

21

25

29

33

1 101 201 301 401

a

b

18

23

28

33

38

43

1 101 201 301 401

16

20

24

28

32

1 101 201 301 40117

21

25

29

33

1 101 201 301 401

a

b c

0

100

200

300

400

1 10 100

k

BIC

( ),

AIC

( )

1 10 100

400

300

200

100

0

Time

Inte

nsity

18

20

22

24

26

28

30

32

1 101 201 301 4010

100

200

300

400

1 10 100

k

BIC

( ),

AIC

( )

1 10 100

400

300

200

100

0

Time

Inte

nsity

18

20

22

24

26

28

30

32

1 101 201 301 401

a b c

Figure 9: BIC finds correct steps when the noise statistics are well characterized. a) Our control. We generatedsynthetic steps (black line) and added noise (white, decorrelated) with the same standard deviation for each data point. Weused a greedy algorithm [59] to identify and compare models according to Eq. (73) and identify the correct step locations (redline) from the noisy time trace (blue). b) Here we use a different, incorrect, likelihood that does not adequately representthe process that we used to generate the synthetic data. That is, we correctly assumed that the noise was white anddecorrelated but also, incorrectly, assumed that we knew and fixed σ (and therefore did not integrate over σ in Eq. (71)). Weunderestimated σ by 12%. Naturally, we overfit (red) the true signal (black). Green shows the step-finding algorithm re-runusing the correct noise magnitude. c) Here we use the BIC from Eq. (73) whose likelihood assumes no noise correlation.However, we generated a signal (black) to which we added correlated noise [by first assigning white noise, εt, to each datapoint and then computing a new correlated noise, εt, at time t from εt = 0.7εt + 0.1εt−1 + 0.1εt−2 + 0.1εt−3]. As expected,the model that the BIC now selects (red) interprets as signal some of the correlated noise from the synthetic data. Weacknowledge K. Tsekouras for generating this figure.

4.3.3 Illustration of the BIC: Change-point algorithms for a Gaussian process

Change-points algorithms locate points in the data where the statistics for a process generating the data change.There is a broad literature, including reviews [207–209], on change-point algorithms relying on the AIC [210–213],the BIC [178, 211, 212, 214–217], generalizations of the AIC [201, 204] and BIC [37, 218], wavelet transforms[219–221] and related techniques [222–226].

Here we illustrate how model selection – and the BIC in particular – are applied to a change-point detectionproblem for a Gaussian process like the one shown in Fig. (7b) with fixed but unknown standard deviation – whichis the same for all data points – and with a discretely changing mean.

We begin by writing down the likelihood

p(x|K,σ,µ, j) =K−1∏i=0

ji+1−1∏`=ji

1√2πσ

e−(x`−µi)

2

2σ2 (70)

where K denotes the number of change-points – points where the mean of the signal changes – occurring atlocations j = {j0, · · · , jK}. To be precise, since the standard deviation is also a parameter to be determined, wehave K + 1 total parameters here.

The model maximizing this likelihood places a change-point at every step (x` = µi for every `). That is,p(x|K,σ,µ, j) peaks when K = N and, expectedly, overfits the data.

To avoid overfitting and – since we are still largely ignorant of the correct values for σ and µ – we integrate over

28

all allowed values for σ and µ. This yields the following marginal posterior

p(K, j|x) ∝∫dσdKµ p(x|K,σ,µ, j)

=

√2π−(N−K)

n1/20 · · ·n1/2

K

· 1

2·(S

2

)− (N−K−1)2

·(N −K − 3

2

)! (71)

where S ≡ n0σ20 + · · ·+ nK σ

2K and

σ2i ≡

1

ni

ji+1−1∑`=ji

x2` −

1

n2i

ji+1−1∑`=ji

x`

2

(72)

where ni counts the number of points contained in the ith step.

Eq. (71) reveals that p(K, j|x) may no longer be peaked at K = N . This is expected since, conceptually, bysumming over σ, grossly underfitting models (models with large σ) now contribute to our marginal posterior.

Taking the further simplifying assumptions that: 1) ni ∼ N/K; 2) all ni are large; and 3) σ20 ∼ σ2

1 · · · ∼ σ2K

are all equal to an expected standard deviation σ2, we recover a form for a Gaussian process BIC seen in theliterature [59]

BIC = −2 log p(K, j|x) = N log σ2 +K logN + O(N0) + const (73)

where the constants, const, capture all terms independent of model parameters (that may depend on N ).

Fig. (9) shows the detection of change-points in synthetic data and illustrates just how sensitive the BIC is tothe correct choice of likelihood. To address this sensitivity, BIC’s have, for example, been tailored to detectchange-points with time correlated noise as would be expected from methods such as single molecule forcespectroscopy [178].

4.3.4 Shortcomings of the AIC and BIC

We cannot compare data sets of different lengths: The objective functions for both the AIC and BIC dependon N . For this reason, we cannot directly compare numerical values for AICs and BICs obtained for data sets ofdifferent lengths [227]. This problem often arises when comparing data sets of originally the same length but witha different number of outliers removed.

Correctly characterizing the likelihood is critical: We illustrate, in Fig. (9b)-(9c), how a mischaracterization ofthe likelihood – and, ultimately, the process that generates the noise – can yield incorrect models.

The curvature of the likelihood function may vanish: The curvature of the likelihood arises in both the AICand BIC. In the AIC, it appears through the Fisher information (see the second term in Eq. (58)) while in theBIC it arises from Laplace’s method [204, 228, 229]. For singular problems, those where the likelihood’s cur-vature vanishes, the AIC and BIC diverge. In concrete terms, this signifies that the model selection criterionbecomes broadly insensitive to the model’s dimensionality. A vanishing curvature occurs: i) when we haveunidentifiable parameters. That is at locations, x = x∗, where p(x∗|θ1, θ2) = p(x∗|θ1). In this case, at x∗,∂θ1∂θ2 log p(x|θ1, θ2)|x=x∗ = 0; ii) at change-points (changes in model parameter) locations in the data [230]. Forinstance, consider a force spectroscopy experiment monitoring the stepping motion of a molecular motor. After asingle step, the signal appears to jump from a mean of µ to µ′. But, in practice, transitions may not be so discrete.In the extreme case, if hypothetically we collected data with an infinite time resolution, we may see the signal (i.e.a hypothetical noiseless time trace) pass through a region of zero curvature as it continuously transitions from aregion of negative to positive curvature. At this singular point, the AIC and BIC fail.

Despite vanishing curvatures at change-points, the AIC and BIC are commonly used in change-point analysis, as

29

we have seen for our illustrative example. In practice, change-point methods ignore the point of zero curvature.That is, they treat the data as piecewise continuous as we had in our example.

Alternatively, to avoid vanishing leading order (quadratic) corrections, we may select a model by evaluating (oftennumerically) the full posterior rather than approximating it (as a BIC). Or, we can use a generalized frequentistinformation criterion (FIC) [201,231] by starting with a biased estimator for T

T ∼ log p(x|θ(x)) (74)

and quantifying its bias ExEy[log p(x|θ(y))− log p(x|θ(x))

].

4.3.5 Note on finite size effects

Both AIC and BIC are asymptotic statements valid in the large N limit. In change-point methods, we have foundthat they may be reliable for modest N , even below 50. In the case of the BIC, for instance – where Laplace’sapproximation is invoked – this is not surprising since the terms ignored are exponentially, not polynomially,subdominant in N . For smaller N , higher order corrections to the AIC – an example of which is called the AICc– depending only on K and N can be computed [189]. By contrast, higher order corrections to the BIC, to orderN0, explicitly capture features of the prior and these corrections have previously been applied to the detection ofchange-points from FRET data [37].

4.3.6 Select AIC and BIC applications

Model selection criteria, whether frequentist (AIC) or Bayesian (BIC), deal directly with likelihoods of observing thedata. Treating the data directly using likelihoods avoids unnecessary data processing such as histogramming ordata reduction into moments, cumulants, correlation functions or other heuristic point statistics that are otherwisecommon in the physics literature when dealing with macroscopic systems.

It is especially useful to treat the data directly in single molecule data analysis where the data, on the one hand,are plentiful (because of high acquisition rates) but where data, on the other hand, show too few discrete eventsto build a reliable histograms to fit a model to the data [34,212].

The BIC has been widely used in biophysics [31, 41, 59, 232, 233] to determine: the number of intensity statesin single molecule emission time traces [214, 234]; the number of steps in photobleaching data [215]; and thestepping dynamics of molecular machines [216]. The AIC, in turn, has been used to determine the number ofdiffusion coefficients sampled from labeled lipophilic dyes on a surface from single molecule trajectories [213];and the number of ribosomal binding states of tRNA from single molecule FRET trajectories [210]. In addition,both the AIC and BIC have provided complementary insight on a number of problems including: photon arrivaltime kinetic parameters (such as rate of emission for different fluorophore states for an unknown number offluorophore states and transition kinetics between them) [212]. And, when compared head-to-head in identifyingtrapping potentials for membrane proteins relevant to single particle tracking [211], the AIC and BIC performdifferently across potentials. This is because, by contrast to simple Brownian motion, confining radial potentials,such as V ∝ r4, introduce nonlinear dynamics that are better approximated by complex models that are lessheavily penalized by the AIC than by the BIC.

While the AIC and BIC have been useful, they are often used complementarily specifically because they aretreated as model selection heuristics. As a result, their conceptual difference – and why their penalties differ –are rarely addressed [204] and it is therefore common, though ultimately incorrect, to treat the complexity penalty(2K for the AIC or the K logN for the BIC) as an adjustable form.

We end this section on model selection with a note on additional methods used in single molecule analysis thathave been directly inspired by the BIC [37, 41, 80, 178, 235]. For instance, Shuang et al. [37] have proposedan MDL heuristic – a method called STaSI (Step Transition and State Identification) – not only to find steps but

30

Figure 10: Identifying states can be accomplished while detecting steps. STaSI is applied to synthetic smFRET data.STaSI works by first iteratively identifying change-points in the data (successive steps shown by arrows in panel (a). Themean of the data from change-point to change-point defines an intensity (FRET) state. An MDL heuristic is subsequentlyused to eliminate (or regroup) intensity levels (b). The MDL is plotted as a function of the number of states (c). The finalanalysis – with change-points and states identified – is shown in (d). For more details see Ref. [37].

subsequently identify states as well in time traces; see Fig. (10). The idea here is not only to penalize boththe number of change-points detected but also to discretize the number of intensity levels (states) sampled inan smFRET trajectory. STaSI has subsequently been used to explore equilibrium transitions among N-methyl-D-aspartate receptor conformations by monitoring the distance across the glycine bound ligand binding domaincleft [218].

4.3.7 Maximum Evidence: Illustration on HMMs

While we have previously discussed HMMs, we have not addressed the important model selection challengethey pose [35,36,236]. That is, it appears paradoxical to assume that while we have no a priori knowledge of theHMM’s model parameters, we have perfect knowledge of the underlying state-space. To address this challenge, itis possible to use a combination of maximum likelihood on the HMM and information criteria to select the numberof states [31].

Fig. (11) illustrates a different strategy to infer the total number of states (K) from a time trace. Fig. (11) comparesthe number of states inferred from a synthetic FRET time trace using maximum likelihood and maximum evidence[36]. The maximum likelihood, p(y|θ∗,K) – where θ∗ designates the parameter estimates that maximize thelikelihood – increases monotonically with the number of states. By contrast, the maximum evidence – p(y|K)defined as the likelihood marginalized over all unknown parameter values – peaks at the theoretically expectednumber of states.

While the concept of maximum evidence is not limited to HMMs, here we use maximum evidence to illustratemodel selection on HMMs.

Like the BIC, maximum evidence penalizes complexity by summing over all unknown parameters not all of whichfit the data very well. So, to construct the evidence for our HMM example, we consider the joint likelihood for asequence of observations, y, and states populated at each time interval, s,

p(y, s|θ,K) =N∏i=2

[p(yi|si,θ,K)p(si|si−1,K)]p(y1|s1,θ,K)p(s1|K) (75)

31

does not. We note, encouragingly, that for the ME inference,jbzj always equaled K* as defined in Eq. 2.

We identify each inferred state with the true state that isclosest in terms of their means, provided the difference inmeans is <0.1 FRET. Inferred states for which no true stateis within 0.1 FRET are considered inaccurate. Note that wedo not demand that one and only one inferred state be iden-tified with the true state. This effective smoothing correctsoverfitting errors in which one true state has been inaccu-rately described by two nearby states (consistent with theconvention of the field for analyzing experimental data).

For all synthetic traces, K0 ¼ 3 with means centered atmz ¼ (0.25, 0.5, 0.75) FRET. Traces were made increasinglynoisy by increasing the standard deviation, s, of each state.Ten different noise levels, ranging from s z 0.02 to s z0.15, were used. Given the FRET states’ mean separationand transition rates, and the lengths of the traces, this noiserange varies from unrealistically noiseless to unrealisticallynoisy. Trace length, T, varied from 50 % T % 500 timesteps, drawn randomly from a uniform distribution. Onetime step corresponds to one time-binned unit of an experi-mental trace, which is typically 25–100 ms for most CCD-camera-based experiments. Fast-transitioning (mean lifetimeof 4 time steps between transitions) and slow-transitioning(mean lifetime of 15 time steps between transitions) traceswere created and analyzed separately. Transitions wereequally likely from all hidden states to all hidden states.For each of the 10 noise levels and two transition speeds,100 traces were generated (2000 traces in total). Traces forwhich K0 ¼ 2 (Fig. S7) and K0 ¼ 4 (Fig. S8) were created

and analyzed as well. The results were qualitatively similarand can be found in Section S7 of the Supporting Material.

As expected, both programs performed better on low noisetraces than on high noise traces. ME correctly determined thenumber of FRET states more often than did ML in all casesexcept for the noisiest fast-transitioning trace set (Fig. 2,upper left). Of the 2000 traces analyzed here using MEand ML, ME overfit one and underfit 232, and ML overfit767 and underfit 391. In short, ME essentially eliminatedoverfitting of the individual traces, whereas ML overfit38% of individual traces. More than 95% (all but nine) ofME underfitting errors occurred on traces with FRET statenoise >0.09, whereas ML underfitting was much moreevenly distributed (at least 30 traces at every noise levelwere underfit by ML). The underfitting of noisy traces byME may be a result of the intrinsic resolvability of thedata, rather than a shortcoming of the inference algorithm;as the noise of two adjacent states becomes much largerthan the spacing between them, the two states become indis-tinguishable from a single noisy state (in the limit, there is nodifference between a one-state and a two-state system if thestates are infinitely noisy). The causes of the underfittingerrors by ML are less easily explained, but such errorssuggest that the ML algorithm has not converged to a globaloptimum in likelihood (for reasons explained in Section S5.2in the Supporting Material).

In analyzing the slow-transitioning traces, the methodsperformed roughly equally on Probabilities 2–4 (alwayswithin ~5% of each other). For the fast-transitioning traces,however, ME was much better at inferring the true trajectory

0 50 1000

0.5

1

a1 state: ln(p(Y|K=1)) = 27.4

0 50 1000

0.5

1

b

1 state: ln(p(Y|θ,K=1)) = 34.2

0 50 1000

0.5

15 states: ln(p(Y|K=5)) = 72.6

0 50 1000

0.5

15 states: ln(p(Y|θ,K=5)) = 123.1

0 50 1000

0.5

13 states: ln(p(Y|K=3)) = 87.1

Maximum Evidence

0 50 1000

0.5

13 states: ln(p(Y|θ,K=3)) = 116.4

Maximum Likelihood

1 2 3 4 5 6 7 8 9 100

25

50

75

100

125

150

175

Number of States Fit to Data

Log

Pro

babi

lity

MEML

FIGURE 1 A single (synthetic) FRET trace analyzed by

ME and ML. The trace contains three hidden states. (A)(Upper) Idealized traces inferred by ME when K ¼ 1,

K ¼ 3, and K ¼ 5, as well as the corresponding log(evi-

dence) for the inference. The data are underresolved

when K ¼ 1, but for both K ¼ 3 and K ¼ 5, the correctnumber of states is populated. (Lower) Idealized traces in-

ferred by ML when K¼ 1, K¼ 3, and K¼ 5, as well as the

corresponding log(likelihood). Inferences when K ¼ 1 and

K¼ 3 are the same as for ME, but the data are overfit whenK ¼ 5. (B) The log (evidence) from ME (black) and log

likelihood from ML (gray) for 1 % K % 10. The evidence

is correctly maximized for K ¼ 3, but the likelihoodincreases monotonically.

Biophysical Journal 97(12) 3196–3205

vbFRET: Bayesian Inference for smFRET 3201

Figure 11: Maximum evidence can be used in model selection. a) For this synthetic time trace, maximum likelihood (ML)will overfit the data. This is clear from b) where it is shown that the log likelihood or probability of the model – evaluated atθ = θ∗ – increases monotonically as we increase the number of states, K. By contrast, maximum evidence (ME) – obtainedby marginalizing the likelihood over θ – identifies the theoretically expected number of states, K = 3. Sample time tracesare shown in (a) and the log probability is plotted in (b). See details in text and Ref. [36].

268 | VOL.10 NO.3 | MARCH 2013 | NATURE METHODS

ARTICLES

To extract the number of binding states and transition rates from the Hfq tracking data, we applied the vbSPT analysis with an uninformative prior distribution to a data set of 12,130 trajecto-ries from a single experiment on approximately ten cells. The best fit was achieved by a three-state model (Fig. 4a), with diffusion constants of approximately 3 m2 s−1, 0.7 m2 s−1 and 0.2 m2 s−1. We repeated the experiment, obtaining an unaltered outcome, and determined the accuracy of the parameter estimates by boot-strap analysis13 (Supplementary Table 6). Next we performed experiments on cells treated with the antibiotic rifampicin (rif), which inhibits transcription and leads to global RNA decay, as we confirmed by RNA decay studies (Supplementary Note 5). For the rif-treated cells we obtained a two-state model, in which the two states have diffusion constants corresponding to the two faster states for untreated cells (3 m2 s−1 and 0.7 m2 s−1) (Fig. 4b and Supplementary Table 7).

To interpret the model, we mapped the different states of Hfq diffusion to states of Hfq binding. As Hfq is found mostly in its hexameric form14,15, we propose that the fastest state (3 m2 s−1) corresponds to the 223-kDa Hfq-Dendra2 hexamer. After rif treatment, the slowest state vanishes, and the fraction occupying the intermediate state decreases substantially (Fig. 4), indicating that these states may represent interactions of Hfq with different RNA species, either free or under processing. The slowest state may represent Hfq bound to RNA being transcribed, which would be expected to disappear upon transcription inhibition.

The decrease in the transition rate from the fast to the slow state after rif treatment (Fig. 4) could be explained by the decreasing number of RNAs available for binding. The conversion from the slow to the fast state could reflect the rates of Hfq dissociation from RNA and/or of RNA degradation, both of which should not be altered much by rif treatment.

When analyzing the combined data from both Hfq experi-ments on untreated cells (23,756 trajectories), we obtained suf-ficient evidence to split the intermediate state into two states (Supplementary Note 1). This is consistent with our inter-pretation that the intermediate state represents binding to mRNA molecules of varying lengths and consequently varying rates of diffusion. After we added another 18,903 trajectories from an additional experiment, the four state model remained the most likely, but we observed by bootstrap analysis that the likelihood of higher-state models increased in this case (Supplementary Note 1). We did not observe state splitting upon pooling data from rif-treated cells, most probably because the RNA-associated pool of Hfq is diminished under this condition because of global RNA decay (Supplementary Note 5).

DISCUSSIONThe vbSPT method opens new possibilities for objective charac-terization of intracellular processes on the basis of single-molecule tracking data. With vbSPT it is possible to obtain kinetic infor-mation from a system at steady state, which is important because most intracellular systems are impossible to perturb gently and with high specificity. We note that the intracellular environment in many cases will include interactions that cannot be found using purified components in the test tube. These include transient or rare interactions as well as complex interactions in uncharted path-ways that can be probed only by altering the genetic background or growth conditions.

The vbSPT algorithm currently defines states by diffusion con-stants and state lifetimes. However, the variational framework is flexible and could be extended to many types of more complex state descriptions by, for example, taking state-dependent locali-zation errors into account.

The current analysis method can resolve states with average lifetimes of one time step, but accurate inference of kinetics is limited to lifetimes of several time steps. Limitations on slow kinetics are set by the data and factors including the underlying model, trajectory lengths, localization error and amount of data. The range of accessible kinetics might be extended by modifying the algorithm to combine data acquired at different frame rates.

METHODSMethods and any associated references are available in the online version of the paper.

Note: Supplementary information is available in the online version of the paper.

ACKNOWLEDGMENTSWe thank I. Barkefors for her careful and critical reading of the manuscript. M.L. is grateful to C.H. Wiggins and J.-W. van de Meent for insightful discussions. This work was supported by the European Research Council (J.E.), the Knut and Alice Wallenberg Foundation (J.E.), Vetenskapsrådet (J.E.), the Göran Gustafsson Foundation (J.E.), the Wenner-Gren Foundations (M.L.) and the Center for Biomembrane Research (M.L.).

AUTHOR CONTRIBUTIONSJ.E. and M.L. conceived the method, M.L. designed the vbSPT algorithm, M.L. and F.P. implemented and tested the algorithm, F.P. designed and implemented the image analysis and particle-tracking algorithms, and C.U. cloned and characterized the bacterial strains. F.P. and J.E. built the optical setup. F.P., C.U. and J.E. designed the experiments, and C.U. and F.P. performed the experiments. F.P., M.L., C.U. and J.E. wrote the manuscript.

COMPETING FINANCIAL INTERESTSThe authors declare no competing financial interests.

Native

0.03

0.03

a b

D1 = 0.24 µm2 s–1

D2 = 0.75 µm2 s–1 D3 = 2.6 µm2 s–1 D1 = 0.71 µm2 s–1 D2 = 3.1 µm2 s–1

1 = 27 ∆t

2 = 14 ∆t 3 = 19 ∆t1 = 32 ∆t 2 = 38 ∆t

[∆t–1] [∆t–1]

0.05

0.04

0.04

0.003

0.002 0.03

Rifampicin treated

Figure 4 | vbSPT analysis of experimental tracking data on Hfq. (a,b) Untreated cells (a) and cells treated with the transcription inhibitor rifampicin (b). Top, typical super-resolution localization plot from a single cell. Scale bars, 0.5 m. Center, ~1,500 trajectories taken from an arbitrary cell. Trajectories longer than seven steps are color coded by the most likely state found by vbSPT analysis. Bottom, models including transition rates, diffusion constants (D) and mean dwell times ( ) found by the analysis. The area represents the relative occupation of each state. (For standard errors from a bootstrap analysis, see Supplementary Tables 6 and 7.) The dashed transitions represent rare and therefore uncertain transitions (Supplementary Note 1).

npg

© 2

013

Nat

ure

Am

eric

a, In

c. A

ll rig

hts

rese

rved

.

Figure 12: The number of diffusive states detected using maximum evidence can establish changes in interactionsof Hfq upon treatment of E. coli cells with rifampicin. a) vbSPT analysis of the RNA helper protein Hfq tracking data.Three distinct diffusive states are detected and sample trajectories are shown color-coded according to which state theybelong. The kinetic scheme shows the diffusion coefficient in each state as well as transition rates between diffusion coef-ficients. b) When treated with a transcription inhibitor (rif), vbSPT finds that the slowest diffusive state vanishes suggestingthat the slowest diffusive state of Hfq was related to an interaction of Hfq with RNA. ∆t = 300Hz throughout the figure. Thescale bar indicates 0.5 µm2/s. See details in Ref. [35].

32

where θ denotes a vector of parameters: the K-dimensional initial probability vector of states, π ≡ p(s1|K); theK-dimensional observation parameters such as means, µ, and standard deviations, σ, for each state assumingGaussian p(yi|si,θ,K); and the K × K matrix, A, of transition matrix elements aij = p(sj |si,K). Contrary toEq. (9), we have made all parameter dependencies of the probabilities contained in Eq. (75) explicit.

The evidence then follows from Eq. (75)

p(y|K) =∑s

∫dθp(y, s|θ,K)p(θ|K) (76)

where the prior is p(θ|K) = p(π|K)p(A|K)p(µ,σ|K) and an example of how this prior is selected is given inRef. [36].

A numerical variational Bayesian (vb) procedure called vbFRET [36] was implemented to evaluate the maximumevidence with sample results shown in Fig. (11). The method has gained traction because it learns the numberof states by comparing the probability of observation given different values of K, p(y|K) [185, 237–239]. Theinterested reader should refer to a general discussion of variational approximations in Ref. [84].

Using a method similar to vbFRET, maximum evidence applied to a HMM model was used to extract the numberof diffusion coefficients – “diffusive states" – sampled from intracellularly diffusing proteins as well as transitionrates describing the hopping kinetics between the diffusive states [35]. Using single particle tracking (SPT) data,the method was also implemented using a variational Bayesian procedure called vbSPT and applied to infer thediffusive dynamics of an RNA helper protein, Hfq, mediating the interaction between small regulatory RNAs andtheir mRNA targets [35]. Fig. (12) details the analysis of the Hfq tracking done on a control E. coli cell and onetreated with a transcription inhibitor [rifampicin (rif)]. Fig. (12b) shows the disappearance of the slow diffusioncomponent for treated cells suggesting that this sluggish component was associated with Hfq-RNA interactionsin the untreated cell.

Finally, using a method called variational Bayes HMM for time-stamp FRET (VB-HMM-TS-FRET), the methodsabove can be generalized to treat time stamp photon arrival (as opposed to assuming binned data in intervals∆t) as well as time-dependent rates [238].

5 An Introduction to Bayesian Nonparametrics

We have already seen how flexible (nested) models – models that can be refined by the addition or removal ofparameters – were critical to change-point analysis and enumeration of states in HMMs. Likewise, we have alsoseen how free-form probability distributions, {p1, · · · , pK} could be inferred from MaxEnt even if K, the numberof parameters, largely exceeded the number of measured data points. Inferring a large number of parameters,i.e. a distribution on a fine grid, is useful even if many probabilities inferred from MaxEnt have small numericalvalues. For example, these probabilities may predict the relative weight of sampling a biologically relevant albeitunusual protein fold – as compared to its native conformation – based on free energy estimates alone even ifsuch conformations are highly unlikely.

These previous treatments went beyond parametric modeling where models have a given mathematical structurewith a fixed number of parameters, such as Gaussians with means and variances.

While the maximum evidence methods we presented earlier provided model probabilities (marginal likelihoodsp(y|K)) for a fixed number of states or parameters (K), in this section we investigate the possibility of averagingover all acceptable K to find posteriors p(θ|y).

These posteriors – p(θ|y) obtained by averaging over all possible starting models – are the purview of Bayesiannonparametrics, a reasonably new (1973) approach to statistical modeling [240]. Bayesian nonparametrics arepoised to play an important role in the analysis of single molecule data since so few model features – such as thenumber of states of a single molecule in any time trace – are known a priori.

33

Contrary to their name, nonparametric models are not parameter-free [240]. Rather, they have an a priori infinitenumber of parameters that are subsequently winnowed down – or, more precisely as we will see, selectivelysampled – by the available data [240–243]. This large initial model-space attempts to capture all reasonablestarting hypotheses [244] and avoids potentially computationally costly model selection and model averaging[245]. In other words, they let the model complexity adapt to the information provided by the raw data and canefficiently and rigorously promote sparse models through the priors considered.

5.1 The Dirichlet process

An important object in Bayesian nonparametrics is the prior process and the most widely used process is theDirichlet process (DP) prior [240]. Much, though not all of Bayesian nonparametrics, relies on generalizations ofthe DP and its representations; see first graph of Ref. [246]. These representations include the infinite limit ofa Gibbs sampling for finite mixture models, the Chinese restaurant process and the stick-breaking construction[243,247]. We will later discuss the stick-breaking construction.

Samples from a DP are distributions much like samples from the exponential of the entropy that we saw earlierare distributions as well. Density estimation [248] and clustering [249] are natural applications of the DP.

To introduce the DP, we start with a parametric example and first consider a probability of outcomes indexedk, π = {π1, π2, · · · , πk} – with Σkπk = 1 and πk ≥ 0 for all k – distributed according to a Dirichlet distribution,π ∼ Dirichlet(α1, · · · , αK). That is, with a distribution over π given by

p(π|α) =Γ(∑

k αk)∏k Γ(αk)

K∏k=1

παk−1k . (77)

The Dirichlet distribution is conjugate to the multinomial distribution. Thus, a sequence of observations, z ={z1, z2, · · · zN}, is distributed according to a multinomial havingK unique bins with populations n = {n1, n2, · · · , nK}

p(z|π) =Γ(∑

k nk + 1)∏k Γ(nk + 1)

K∏k=1

πnkk . (78)

The resulting posterior obtained from the prior, Eq. (77), and likelihood, Eq. (78), is

p(π|z,α) =p(z|π)p(π|α)∫dπp(n|π)p(π|α)

=Γ(∑

k nk +∑

k αk)∏k Γ(nk + αk)

K∏k=1

πnk+αk−1k . (79)

Now, imagine a sequence of N − 1 observations, {z1, z2, · · · zN−1}. Using the posterior above, we can calculatethe probability of adding an observation to a pre-existing cluster, j, with probability πj given the occupation{n1, · · · , nK−1} of all pre-existing clusters K − 1 clusters.

For simplicity we assume all αk identical and equal to α/K. Then

p(zN = j|{z1, · · · , zN−1}, α) =

∫dπ p(zN = j|π)p(π|{z1, · · · , zN−1}, α)

=

∫dπ πjp(π|{z1, · · · , zN−1}, α). (80)

We can evaluate both: i) the probability that our observation populates an existing cluster with nj members; orii) that our observation populates a new cluster. In the non-parametric limit – where we allow an a priori infinitenumber of clusters (K →∞) – these probabilities are [246]

njα+N − 1

vs.α

α+N − 1(81)

34

respectively. Thus α – predictably called a concentration parameter – measures the preference for creating a newcluster. The DP therefore tends to populate clusters according to the number of current members.

The DP describes the infinite dimensional (K →∞) generalization of the Dirichlet distribution and describes thedistribution over π, or equivalently, the distribution over G defined as

G ≡ π · 1 =∞∑k=1

πkδθk (82)

where the Kronecker delta, δθk , denotes a mass point for parameter value θk [247,250].

The θk themselves are iid (identical independently distributed) samples from a base distribution H (e.g. a Gaus-sian). In other words, the base distribution parametrizes the density from which the θk are sampled, i.e.

G ∼ DP (α,H)

θk ∼ H. (83)

Thus as α → ∞, we have G → H. The idea is to use H as the hypothetical parametric model we would havestarted from and use α to relax this assumption [243].

The stick-breaking construction, which we mentioned earlier, is a representation of the DP that can be imple-mented. If we follow [246]

vk ∼ Beta(1, α)

πk = vk

k−1∏j=1

(1− vj)

θk ∼ HG = Σ∞k=1πkδθk (84)

then, without proof, we find that G ∼ DP (α,H).

The analogy to stick-breaking follows from the steps given in Eq. (84). We begin with a stick of unit length andbreak the stick at location, v1 , sampled from a Beta distribution v1 ∼ Beta(1, α). We assign π1 = v1. Theremainder of the stick has length (1− v1). The value of θ1 that we then assign to π1 is sampled from H. We thenreiterate to determine π2.

The πk sampled according to the stick-breaking construction are now decreasing on average but not monotoni-cally so. In practice, the procedure is terminated when the remaining stick is below a predesignated threshold.In the statistics literature, it is said that π is sampled according to π ∼ GEM(α) where GEM stands for Griffiths-Engen-McClosky [251].

5.2 The Dirichlet process mixture model

We have used the DP, thus far, to generate discrete sample distributions, G. In order to treat continuous randomvariables, like y, we generalize our treatment and introduce the Dirichlet Process Mixture Model (DPMM) [252,253] where a continuous parametric distribution, F , is convolved with G, given by Eq. (82),

y ∼∫G(θ)F (y|θ)dθ = Σ∞k=1

∫πkδθkF (y|θ)dθ =

∑k

πkF (y|θk). (85)

35

Figure 13: DPMMs can be used in deconvolution. a) A density generated from N = 500 data points from the mixtureof four exponential components. b) After fewer than 200 MCMC iterations, the DPMM has converged to four mixturecomponents. c) The marginal distribution of the parameter for each mixture component is shown with the red line indicatingthe theoretical value used to generate the synthetic data (0.001, 0.01, 0.1, 10). See Ref. [133] and main body for more details.

In other words [254]

y|θk ∼ F (y|θk)θk ∼ H

G ∼ DP (α,H).

The full posterior we want to determine is now

p(θ|y) ∝∫dπp(y|θ, π, F )p(θ|H)p(π|α). (86)

where, to be explicit, π ∼ GEM(α). In practice, in order to sample from this posterior, we must first samplethe distribution G, then, given this G, we construct the mixture model involving F . Then, to sample y, we mustdetermine from which mixture component, k, the random variable y was selected and subsequently sample yfrom the designated F (y|θk). We must then repeat over multiple G’s.

Many specific Markov chain Monte Carlo (MCMC) methods such as the simple approach above – called Gibbssampling – are discussed for the DPMM in Ref. [254]. A more general discussion on sampling from posteriorsusing MCMC can be found in Ref. [84].

Fig. (13) illustrates the ability of DPMMs to infer rates from a density generated by sampling 500 data points,y, from y ∼ ∑i πie

−yθi with four exponential components [133]. The DPMM then tries to determine how manycomponents there were and correctly converges to four mixture components after fewer than 200 MCMC iterationswith values for the rates closely matching those used to generate the synthetic data; see Fig. (13) for more details.

In comparison to MaxEnt deconvolution methods applied to exponential processes discussed earlier, MaxEntwould have generated a very large number of components (since MaxEnt generates smooth distributions) typi-cally tightly centered at the correct values for the rates at low levels of noise in the data. The DPMM, on the otherhand, converges to four discrete components with broader distributions over rates instead.

36

S0 S1 S2 ST· · ·

· · ·y0 y1 y2 yT�k

⇡k

P

↵

H�

Figure 14: iHMM Graphical Model [257].

5.3 Dirichlet processes: An application to infinite hidden Markov model

As we mentioned earlier, one important challenge with HMMs is their reliance on a predefined number of states.To overcome this challenge, we may use the DP from which to sample the HMM transition matrix, p(st|st−1)[247, 255, 256]. That is, the prior probability of starting from st−1 and transitioning to any of an infinite numberof states is sampled from a DP. This is the idea behind the infinite hidden Markov model (iHMM) which we nowdiscuss.

The most naive formulation of the iHMM – one that samples from a DP the prior probability of the final state ` = stfrom a given initial state m = st−1, p(`|m) – does not sufficiently couple states. In other words, the state ` ispreferentially revisited under the DP process if transitions from m to ` have already occurred. But, because atevery time step, m is a new state then, under the DP prior, the same states are never revisited.

To address this problem, the hierarchical DP (HDP) is used [247]. Briefly, under a HDP, we have

G ∼ DP (α,H)

H ∼ GEM(γ)

where γ is a hyperparameter that plays the role of a concentration parameter on the prior of the base distributionof the DP. The HDP enforces that the probability of st starting from m = st−1 is sampled from a DP whose basehas a common distribution amongst all transition probabilities. We summarize iHMM’s as follows [236]

H ∼ GEM(γ)

πk ∼ DP (α,H)

φk ∼ Pst|st−1 ∼Multinomial(πst−1)

yt|st ∼ F (y|φst)

where πk are transitions out of state k, F (φst) describes the probability of observing yt under the condition weare in state st, P is a prior distribution over observation parameters. A graphical model illustrating the parameterinter-dependencies is shown in Fig. (14). While parameters can be inferred on an iHMM using Gibbs sampling[84], recent methods have been developed [71, 257], for example to increase the computational efficiency ofimplementing the iHMM by limiting the number of states sampled at each time point [257].

37

distribution (mean and variance) and the dynamics of thestates. Even if a state is visited extremely rarely (suchas in the left trace), we are able to confidently assert the ex-istence of distinct conformational states. Additionally, wecould use the posterior distribution over the number of hid-den states as a simple way to quantify confidence in an inter-pretation of the data (see Fig. S6). An interesting extensionof this model would be to combine an ensemble of differenttraces into a hierarchal model (46). In such a model, we ima-gine each trace provides a brief snapshot of some underlyinghidden distribution from which all the traces are drawn.Then the traces, taken in aggregate, provide informationabout the total conformational space and transition dy-namics. Future work remains to be done in this area.

As a final application, we turn to single-molecule photo-bleaching. In this setting, we observe photon counts overtime and are interested in detecting photobleaching eventsthat reveal themselves as sudden decreases in photon inten-sity. We are particularly interested in counting the number ofphotobleaching events in a data trace. This setting is wellsuited for the iHMM because we want to detect transitions

between an unknown number of states (corresponding tobleaching events). Fig. 4 (bottom) shows example tracesfrom fluorophore-tagged TRIP8b proteins (see Materialsand Methods). We can see that photobleaching events areapparent, but in regimes of low signal/noise, it might bequite difficult to tell by eye when bleaching events occur.After analysis with the iHMM, the data points are coloredcorresponding to the hidden state to which they were as-signed. The poster distributions over the number of inferredtransitions are shown in Fig. S7. It is clear that the iHMM isan excellent tool for this task. Even in settings where photo-bleaching events are very difficult to detect by eye (right),the iHMM is able to identify likely transitions in the data.Using the iHMM provides a rigorous and unbiased methodto analyze all these single-molecule time series.

Application of iAMM to single ion channelrecordings

Next, we demonstrate the use of the iAMM to analyze singleion channel recordings. We previously analyzed BK single

FIGURE 4 Application of iHMM to single-molecule time series. (Top) Electrophysiologicalrecording of a patch containing multiple channelsand downward current deflections indicate channelopening events. (Middle) Traces from single-mole-cule FRET. (Bottom) Traces from single-moleculephotobleaching. In each case, the data points arecolored corresponding to the hidden state fromwhich they are likely drawn. Algorithm parame-ters: a ¼ 1 and g ¼ 1.


Modeling Time Series Using Nonparametric Bayes 549

distribution (mean and variance) and the dynamics of thestates. Even if a state is visited extremely rarely (suchas in the left trace), we are able to confidently assert the ex-istence of distinct conformational states. Additionally, wecould use the posterior distribution over the number of hid-den states as a simple way to quantify confidence in an inter-pretation of the data (see Fig. S6). An interesting extensionof this model would be to combine an ensemble of differenttraces into a hierarchal model (46). In such a model, we ima-gine each trace provides a brief snapshot of some underlyinghidden distribution from which all the traces are drawn.Then the traces, taken in aggregate, provide informationabout the total conformational space and transition dy-namics. Future work remains to be done in this area.

As a final application, we turn to single-molecule photo-bleaching. In this setting, we observe photon counts overtime and are interested in detecting photobleaching eventsthat reveal themselves as sudden decreases in photon inten-sity. We are particularly interested in counting the number ofphotobleaching events in a data trace. This setting is wellsuited for the iHMM because we want to detect transitions

between an unknown number of states (corresponding tobleaching events). Fig. 4 (bottom) shows example tracesfrom fluorophore-tagged TRIP8b proteins (see Materialsand Methods). We can see that photobleaching events areapparent, but in regimes of low signal/noise, it might bequite difficult to tell by eye when bleaching events occur.After analysis with the iHMM, the data points are coloredcorresponding to the hidden state to which they were as-signed. The poster distributions over the number of inferredtransitions are shown in Fig. S7. It is clear that the iHMM isan excellent tool for this task. Even in settings where photo-bleaching events are very difficult to detect by eye (right),the iHMM is able to identify likely transitions in the data.Using the iHMM provides a rigorous and unbiased methodto analyze all these single-molecule time series.

Application of iAMM to single ion channelrecordings

Next, we demonstrate the use of the iAMM to analyze singleion channel recordings. We previously analyzed BK single

FIGURE 4 Application of iHMM to single-molecule time series. (Top) Electrophysiologicalrecording of a patch containing multiple channelsand downward current deflections indicate channelopening events. (Middle) Traces from single-mole-cule FRET. (Bottom) Traces from single-moleculephotobleaching. In each case, the data points arecolored corresponding to the hidden state fromwhich they are likely drawn. Algorithm parame-ters: a ¼ 1 and g ¼ 1.


Modeling Time Series Using Nonparametric Bayes 549

a b

Figure 15: iHMM’s can learn the number of states from a time series. iHMMs not only parametrize transition probabilitiesas normal HMMs do. They also learn the number of states in the time series [133]. Here they have been used to find thenumber of states for a) ion (BK) channels in patch clamp experiments [with downward current deflections indicating channelsopening]; b) conformational states of an agonist-binding domain of the NMDA receptor.

Most recently, the potential of iHMMs in biophysics has been illustrated by applying iHMMs for parameter de-termination and model selection in single and multichannel electrophysiology (Fig. (15a)), smFRET (Fig. (15b))as well as single molecule photobleaching [133]. The colors in both time traces shown in Fig. (15) denote thedifferent states that the system visits over the course of the time trace. While most transitions can be detectedby eye in the first time trace, the second time trace demonstrates the potential of iHMMs to go beyond what ispossible by visual inspection. There remain some clear challenges. For instance, it is conceivable that iHMM’swillingness to introduce new states may over-interpret drift, say, in a time trace as the population of new states.

6 Information Theory: State Identification and Clustering

Single-molecule time-series measurements are often modeled using kinetic, or state-space, networks. As wehave already seen, HMMs presuppose a kinetic model, and – once all parameters are determined – assign aprobability of being in a particular state along a time series. Other methods such as DPMMs and iHMMs are moreflexible and do not start with fixed kinetic models a priori. Information theory, by contrast, provides a alternative tonon-parametric methods in identifying probabilities of states (thought of as data “clusters") populated along timeseries, that does not rely on prior processes.

Here we focus on information theoretic clustering [258–260] that follow directly from Shannon’s rate-distortiontheory (RDT) [64].

6.1 Rate-Distortion Theory: Brief outline

Shannon [64] conceived of rate-distortion theory (RDT) to quantify the amount of information that should be sentacross noisy communication channels in order to convey a message within a set error margin. Although RDTwas developed for both continuous as well as discrete transmissions, here – in the interest of single-moleculedata analysis and for sake of brevity – we will restrict our discussion to discrete signals.

Shannon considered a transmission (a message) consisting of discrete signals from a source to a recipient. Forexample, if the message were intended to convey words in the English language, then each discrete signal wouldtransmit a letter of the Latin alphabet. However, the letters comprising the words being transmitted cannot besent directly; the communication channel requires that the transmitted information be encoded. That is, the setof letters being transmitted needs to be transformed into a set of codes that represent the letters being sent. Forexample, the letter A could be encoded by a set of binary symbols such as 0000. After transmission, the encodedsignals are collected by the recipient and are then translated back into their representation in the original setof symbols. In this example, the average length of the binary sequences that encode the letters being sentcorresponds roughly to the “rate" (of information) and the potential for misinterpreting the encoded sequence bythe recipient corresponds to the “distortion."

38

Put differently, the rate quantifies the amount of information about the intended message that is being transmittedacross the channel. For example, consider encoding the letter A in two ways: a single binary character 0 and abinary sequence 00. Because the single binary character is of shorter length than the two-character sequence, itcontains less information about the intended transmission (the letter A) than the longer sequence. Increasing thelength of the encoded representation of the intended message thus increases the amount of information beingtransmitted.

The distortion, on the other hand, quantifies the potential for misinterpreting a transmission. For example, theencoded message may be transmitted as A = 00, but noise on the channel distorts the transmission, resulting inthe signal being interpreted as 01, which may coincide with another letter, say B.

As a rudimentary example, consider a one-word message, “kangaroo", being transmitted from a source to arecipient as a set of binary symbol groups with each group representing a single letter of the alphabet. Theintended message, “kangaroo", is first transformed at the source from the letters of the alphabet to a sequence ofbinary symbol groups, which are then transmitted to the recipient and decoded (translated) back into letters fromthe message.

Because the message will most likely be understood even if one or two letters is misinterpreted in the decodingprocess, some small level of distortion may be acceptable, e.g. “kongaroo" versus “kangaroo". On the otherhand, if several letters of the message received are misinterpreted, then the correct translation of the intendedmessage is unlikely. This latter situation is undesirable, and may be remedied by increasing the lengths of thebinary sequences – i.e. by increasing the rate of information – that encode the letters, thereby increasing theprobability of correctly decoding the intended message and decreasing the level of distortion.

Then, given a level of acceptable distortion, such as one in eight letters (“kongaroo" vs. “kangaroo") we maythen determine how short the binary symbol groups must be in order to accurately convey the intended message.Finding the optimal length of letters is the subject of RDT.

More formally, RDT poses the following question: what is the minimum rate of information required to conveythe intended message at the desired level of distortion? RDT’s main result is that a lower-bound on the rate ofinformation is provided by the mutual information between the set of possible transmissions at the source andthe set of observations at the recipient. In our example above, this quantity is the mutual information between theletters of the alphabet and the set of binary sequences that encode each letter. The minimum rate of informationis then obtained by minimizing this mutual information given an acceptable level of distortion.

6.1.1 RDT and data clustering

We now discuss the relationship between RDT and data clustering. Data clustering is the grouping of elementsof a data set into a subsets of elements, i.e. clusters, containing elements that have similar properties. This isoften accomplished through the minimization of an average statistical distance between the elements assignedto particular clusters and their center by, for example, a method called k-means clustering [84]. Since there aretypically fewer clusters than there are elements in the data set, clustering is a form of compression. In otherwords, data clustering seeks to compress the data with respect to some statistical distance.

In RDT, minimization of the rate of information is also a form of compression. In effect, by minimizing the rate weare minimizing the length of the encoded message to be transmitted, thereby compressing the message. Thiscompression is not performed directly, however, but with respect to a desired (or, as we will discuss, observed)level of distortion. For an infinite compression, there could be substantial distortion. Likewise, if every elementbelongs to its own cluster the distortion vanishes.

By contrast to “hard" partitioning algorithms – where probabilities of elements belonging to clusters are restrictedto 0 or 1 – soft clustering algorithms allow elements to exist across all clusters with some probability of member-ship assigned to each cluster. These more general algorithms are particularly useful for cases where clustersmay overlap. This is especially advantageous in the context of high-noise single-molecule measurement andallows estimation of errors associated with any parameter extracted from the experimental data. For instance,

39

the smFRET example that we will explore later will have two important sources of error: empirical error (photoncounting and fluctuations in the irradiance intensity) and sampling eror (the number of data points are finite).

Since, as we will see in the next section, RDT clustering directly returns conditional probabilities that each datapoint belongs to each cluster [46,260], soft partitioning is directly built into the RDT framework.

6.1.2 RDT clustering: Formalism

RDT returns conditional probabilities, p(Ck|si), of cluster k, selected from a set of n clusters C = {C1, · · · , Cn},given observation i selected from the set of N observations s = {s1, · · · , sN}.As discussed above, the rate is the average amount of information needed to specify an observation si withinthe set of clusters C, computed as the mutual information between the set of clusters C and observations s asfollows

I(C, s) =n∑k=1

N∑i=1

p(Ck|si)p(si) logp(Ck|si)p(Ck)

. (87)

By minimizing the rate, we maximize the compression. However, if the minimization of the rate is not bounded,then we will over-compress and any information that clusters contain on the observations will vanish (i.e. I (C, s)→0). Our minimization of the rate thus needs to be informed, or constrained, by another quantity. This quantity isthe mean distortion among the observations within the set of clusters C, 〈D(C, s)〉, defined as the average of thepairwise distortions between all pairs of observations in s [46,258]

〈D(C, s)〉 =

n∑k=1

p(Ck)

N∑i,j=1

p(si|Ck)p(sj |Ck)dij . (88)

The pairwise distortion between two observations, dij , is a measure of the dissimilarity between them, and itschoice is problem-specific. For example, the dissimilarity between two probability (mass or density) functionscan be measured as the area between their respective cumulative distribution functions, which corresponds to ametric known as the Kantorovich distance [261].

To obtain the minimum rate of information – constrained by the mean distortion – we define an objective functionto be minimized

I(C, s) + β〈D(C, s)〉 (89)

where β is a Lagrange multiplier that controls which term is favored in the minimization. A small value of β favorsminimization of the rate over distortion and, thus, high compression. Conversely, large values of β will cause theminimization to favor the distortion, returning a less compressed clustering result in which the clusters contain arelatively large amount of information about the set of observations.

The formal solution to the minimization of Eq. (89) with respect to the conditional probabilities p(Ck|si) is aBoltzmann-like distribution [260]

p(Ck|si) =p(Ck)

Z(si, β)exp

−β N∑j=1

p(si|Ck)dij

. (90)

The conditionals p(si|Ck) of Eq. (90) are obtained, in turn, using Bayes’ formula; p(Ck) is the marginal probabilityof the cluster Ck

p(Ck) =N∑i=1

p(si)p(Ck|si) (91)

40

Figure 16: A soft clustering algorithm based on RDT is used to determine states from smFRET trajectories. a) Acrystal structure of the AMPA ABD. The green and red spheres represent the donor and acceptor fluorophores, respectively.b) Detection of photons emitted in an smFRET experiment. c) An experimental smFRET trajectory obtained by binning thedata in (b). d) Probability mass functions (pmfs) of the blue and red segments highlighted in (c). e) Cumulative distributionfunction (cdfs) of the highlighted segments in (c). The shaded area represents the Kantorovich distance. f) Visual repre-sentation of clusters in (c) based on multidimensional scaling. g) Transition disconnectivity graph (TRDG) resulting from thetrajectory in (c). See details in text and Ref. [46].

41

and Z(si, β) of Eq. (90) is the normalization [260]

Z(si, β) =n∑k=1

p(Ck) exp

−β N∑j=1

p(si|Ck)dij

. (92)

As can be seen from Eqns. (90)-(91), the conditional probabilities p(Ck|si) and the marginal probabilities p(Ck)must be self-consistent. This self-consistency is exploited to obtain numerical solutions to the variational problemthrough an iterative procedure (Blahut-Arimoto algorithm) [262, 263]. The procedure begins by randomly initial-izing each of the conditionals p(Ck|si) followed by normalization over C and continuing with iterative calculationsof the marginals and conditionals, via Eqns. (90)- (91), until the objective function, Eq. (89), has converged. Inpractice, this algorithm may converge to a local, rather than the global, minimum requiring that it be re-initializedmultiple times with random seeds.

6.1.3 RDT analysis on sample data

Both β and the number of clusters are inputs to RDT. We can use the distortion and rate of RDT as modelselection tools to choose these input.

To do so, we may first obtain an appropriate estimate of the amount of distortion arising from errors in the data.The errors must be treated carefully on a case-by-case basis; this is detailed for the case of smFRET data inRef. [46].

Once this estimate is obtained, it is used as a benchmark against which to compare the distortion, 〈D(C, s)〉,arising from other models (i.e. solutions obtained with different numbers of clusters and/or values of β). Anymodel having 〈D(C, s)〉 less than this benchmark is retained for model complexity comparison.

Model complexity is then assessed by comparing the values of the mutual information I (C, s) arising from eachmodel. Specifically, the model satisfying the distortion criterion having the smallest value of I (C, s) is the leastcomplex model retained for further analysis. Details are provided in Ref. [46].

Fig. (16) sketches key steps in using RDT to identify states (i.e. clusters) of an smFRET trajectory for a proteindomain (AMPA ABD) that we will discuss shortly. Briefly, we segment the trajectory into “elements". Our goal isto cluster these elements shown as short stretches of data in Fig. (16c). Fig. (16d) shows the probability massfunction (pmf) of these elements; while Fig. (16e) illustrates the Kantorovich distance that we use in our distortion.We subsequently use multidimensional scaling [264] to map clusters into two dimensions (Fig. (16f)). This givesus visual insight into cluster overlap as well as the breadth of the conditional distributions p(Ck|si) detailed inRef. [46].

Fig. (16g) captures another useful representation of the conformational space beyond clusters: the free energylandscape. A free energy landscape depicts the conformational motions of proteins (or their domains) such asthe AMPA ABD, that we will discuss shortly, as diffusion on a multidimensional free energy surface, with confor-mational states represented as energy basins on the landscape and the transition times being characterized bythe heights of the energy barriers among the set of basins [265, 266]. Because conformational motion in sm-FRET experiments is projected on a 1-dimensional coordinate, we approximate the free energy landscape witha transition disconnectivity graph (TRDG) [265, 266]. A simple, 3-state TRDG is shown in Fig. (16g). The nodesrepresent relative free energies of the conformational states while the horizontal lines represent free energies atthe barriers for transitions among the conformational states. Briefly, the TRDG is constructed by first identifyingthe slowest transition (i.e. the highest energy barrier) between two disjoint sets in the network [46]. Each subse-quently faster transition between disjoint sets is then identified resulting in the branching structure of the TRDGdetailed in Refs. [46,265,266].

42

6.1.4 Application of RDT to single molecule time-series

We apply RDT clustering to extract kinetic models from smFRET time traces [46, 267, 268]. In the example wenow discuss, smFRET monitors the conformational dynamics of binding domains of a single AMPA receptor, anagonist-mediated ion channel prevalent in the central nervous system. In the context of RDT, each smFRET tra-jectory is viewed as a noisy message received from the source, i.e. the underlying conformational network. RDTis used to decode the message sent by the source and to classify intervals along the time trace into underlyingclusters (conformational states).

AMPA receptors are among the most abundant ion-channel proteins in the central nervous system [267]. They arecomprised of extracellular N-termini and agonist binding domains (ABDs) [involved in ion channel activation [267]],transmembrane domains and intracellular C-terminal domains. They are agonist-mediated ion channels andinteraction of the ABDs with neurotransmitters – such as glutamate – induces conformational motion in the proteinwhich, in turn, triggers the activation of ion transmission through the cellular membrane.

X-ray crystal structures [269] show that ABD has two lobes that form a cleft containing the agonist-binding site[46]. X-ray studies also suggest that the degree of cleft closure controls the activation of the ion channel [269],with a closed cleft corresponding to an activated channel, but exceptions to this conjecture exist [270]. Moleculardynamics simulation [271,272] further suggest that ABD is capable of conformational motion even when bound tothe full agonist glutamate, and that conformational fluctuations are increased in the absence of a bound agonist.What is more, smFRET studies of the apo and various agonist-bound forms of the AMPA ABD support thistheoretical result [267] and, together, suggest that the activation mechanism is more complex. In order to gaindeeper insight into this allosteric mechanism, we used RDT clustering along with time series segmentation andenergy landscape theory to analyze the data [46].

Here we discuss the results of RDT clustering applied to smFRET trajectories of the AMPA ABD while bound tothree different agonists: a full agonist, a partial agonist, and an antagonist [46]. The properties extracted fromeach of these systems, including population distributions and TRDGs, are shown in Fig. (17) [46]. Parametersestimated from the clustering results, including mean efficiencies, occupation probabilities, escape times and freeenergies of the basins are shown in Table (1) [46].

As shown by the transition networks, TRDGs, and state distributions in Figs. (17a)-(17c), the model selectionprocess results in the assignment of 4, 5, and 6 states for the ABDs bound to the full agonist, partial agonist, andantagonist, respectively.

As shown in Fig. (17), and Table (1) [46], the most dominant state when the ABD is glutamate-bound has 74%occupation probability and a mean efficiency of 0.85, which corresponds to a relatively short interdye distanceof ∼ 38 Å, in comparison to the apo form of the ABD [267]. Other states have smaller occupation probabilities(< 10%) which, along with the relatively slow escape times from the states, suggest that, although conformationaldynamics are observed, the glutamate-bound ABD possesses a relatively stable and closed ABD.

By contrast, the most populated state in the partial agonist-bound ABD system shown in Fig. (17b) has a smallermean efficiency of 0.74 and a smaller occupation probability of 52%, indicating a longer interdye distance and aless stable conformation. Furthermore, the TRDG indicates a smaller transition barrier out of many states in thenetwork which, along with the increase in the number of significantly populated states, suggests that ABD is moreactive when bound to the partial agonist.

Lastly, the results of the antagonist-bound ABD returned six states displaying a broader interdye range anda larger relative occupation at lower efficiencies. The most populous state (41%), however, has a high meanefficiency of 0.88, suggesting that the channel should be activated while occupying this closed cleft conformationas defined in Ref. [268]. Inspection of the TRDG shows relatively smaller (∼ 1 kcal/mol) transition barriers andwe found escape times for all states that are relatively faster (200-500 ms), suggesting a conformationally activeABD relative to the full and partial agonist-bound systems. It is this fast and frequent conformational motion thatis the source of the ion channel’s lack of activation.

In summary, RDT clustering applied to these three systems provides broader insight into the activation mech-

43

0.12 A

0.1

@

' G) I

I \ c 0.08 0

'59

1 o.o6

8 ,_1 !. '·I '· 1° 0.04 c 0.02 LL

0.2 0.4 0.6 0.8

0.12

0.1

c 0.08 0

:i 0. 0 0.

0.04

0.12

0.1

c 0.08 0

0.06 0 0.

0.04

0.02

( E)

B G) @ \ \

17 I 122 1 I e I I I

I

c

•

G) \

15 I

• J. ' l ,11, 1

0

0.2 0.4 0.6 0.8 (E)

0.2 0.4

0 @ I I

1131 I 67 I • I • I I

0.6 0.8 efficiency

0.12 A

0.1

@

' G) I

I \ c 0.08 0

'59

1 o.o6

8 ,_1 !. '·I '· 1° 0.04 c 0.02 LL

0.2 0.4 0.6 0.8

0.12

0.1

c 0.08 0

:i 0. 0 0.

0.04

0.12

0.1

c 0.08 0

0.06 0 0.

0.04

0.02

( E)

B G) @ \ \

17 I 122 1 I e I I I

I

c

•

G) \

15 I

• J. ' l ,11, 1

0

0.2 0.4 0.6 0.8 (E)

0.2 0.4

0 @ I I

1131 I 67 I • I • I I

0.6 0.8 efficiency

0.12 A

0.1

@

' G) I

I \ c 0.08 0

'59

1 o.o6

8 ,_1 !. '·I '· 1° 0.04 c 0.02 LL

0.2 0.4 0.6 0.8

0.12

0.1

c 0.08 0

:i 0. 0 0.

0.04

0.12

0.1

c 0.08 0

0.06 0 0.

0.04

0.02

( E)

B G) @ \ \

17 I 122 1 I e I I I

I

c

•

G) \

15 I

• J. ' l ,11, 1

0

0.2 0.4 0.6 0.8 (E)

0.2 0.4

0 @ I I

1131 I 67 I • I • I I

0.6 0.8 efficiency

Figure 17: RDT clustering reveals differences in conformational dynamics for the AMPA ABD. State distributions andTRDGs are given for the full agonist-bound ABD (a); the partial agonist-bound ABD (b); and the antagonist-bound ABD (c).〈E〉 denotes the mean efficiency. See main body and Ref. [46] for details.

44

〈E〉 p(Sk) Fi escape time(%) (kcal/mol) (ms)

Full Agonist0.97 10 1.21 308 (220,425)0.85 74 0 674 (589,753)0.75 8 1.31 185 (145,220)0.64 8 1.29 310 (240,384)

Partial Agonist0.93 16 0.68 807 (690,928)0.84 17 0.68 288 (260,300)0.74 52 0 664 (618,698)0.66 12 0.88 290 (260,308)0.47 3 1.75 502 (388,634)

Antagonist0.88 41 0 490 (458,519)0.76 27 0.26 223 (207,236)0.62 16 0.56 207 (193,217)0.51 11 0.78 220 (203,236)0.32 3 1.47 250 (217,283)0.11 2 1.71 512 (420,619)

Table 1: RDT clustering returns state properties for the AMPA ABDs. These properties include mean efficiencies (〈E〉),occupation probabilities (p(Sk)), free energies (Fi), and escape times with 95% confidence intervals, for the full agonist(glutamate), the partial agonist (nitrowillardiine), and the antagonist (UBP282). See main body and ref [46] for details.Sapporo(Nick: Removed hypothesis test results from table.)

anism of the AMPA receptor [46]. The antagonist-bound ABD exhibits fast conformational fluctuations and arelatively unstable structure that explores a broad range of interdye distances while the most stable conformationof the partial agonist-bound ABD displays a relatively large interdye distance, indicating a weaker, and/or stericallydistorted structure. By comparison, the full agonist-bound ABD displays a relatively stable and static structurewith a small interdye distance, suggesting a strong and stable interaction of the full agonist with the ABD. It is theability of the full agonist to hold the cleft of the ABD closed in a stable manner that causes the full activation of,i.e. the maximum ionic current through, the ion channel.

6.2 Variations of RDT: Information-based clustering

Similar in spirit to RDT is a general method known as information-based clustering that has been used on geneexpression data [258]. Information-based clustering uses a similarity measure – rather than a distortion measure– to quantify how alike elements are [258].

However, unlike in RDT, the quantity that plays the role of RDT’s distortion in information-based clustering – aquantity termed “multi-information"– is a multidimensional mutual information.

An advantage of information-based clustering is that multidimensional relationships among the data to be clus-tered are naturally incorporated into the clustering algorithm. The objective function to be maximized is

〈S(C, s)〉 − TI(C, s) (93)

where C are again the set of clusters and s the set of observations and T is a “trade-off" parameter.

In Eq. (93), 〈S(C, s)〉 is the average multidimensional similarity among the set of observations within the set ofclusters [258], I(C, s) is the mutual information between the set of clusters and the set of observations.

45

6.3 Variations of RDT: The information bottleneck method

There exists another variation of RDT proposed by Tishby et al. [259], termed the information bottleneck (IB)method, which focuses on how well the compressed description of the data, i.e. the clusters or states, can predictthe outcome of another observation, say u. It is similar to information-based clustering however its focus is onpredicting the outcome of another variable and the “distortion" term is given by the mutual information betweenthe clusters and u.

In other words, the information contained in s is squeezed (compressed) through the “bottleneck" of clusters Cwhich is then used to explain u. The advantage with IB is that there is no need for a problem-specific distortion.

Just like in RDT, minimizing the mutual information between s and C generates broad overlapping clusters.However maximizing the mutual information between u and C tends to create sharper clusters.

Mathematically the objective function in IB to be maximized is

I(C, s)− βI(C,u) (94)

where, as before, I(C, s) =∑

k,i p(Ck, si) log(p(Ck|si)/p(Ck)) [likewise for I(C,u)] and β is a trade-off parameter.The maximization of Eq. (94) yields [259]

p(Ck|si) =p(Ck)

Z(si, β)exp

−β M∑j=1

p(uj |si) logp(uj |si)p(uj |Ck)

≡ p(Ck)

Z(si, β)exp [−βDKL[p(u|si)‖p(u|Ck)]] (95)

where DKL[p(u|si)‖p(u|Ck)] is the KL divergence, and

Z(si, β) =∑k

p(Ck) exp[−βDKL[p(u|si)‖p(u|Ck)]] (96)

is the normalization. When comparing Eq. (95) with the formal solution of the conventional RDT, Eq. (90), we seethat the KL divergence, DKL[p(u|si)‖p(u|Ck)], serves as an effective distortion function in the IB framework. Thismeans that by minimizing the distortion DKL[p(u|si)‖p(u|Ck)], one obtains a compression of s (through C) thatpreserves as much as possible the information provided by the relevant observable u.

For illustrative purposes, we consider two special limiting values for the trade-off parameter: β → 0 and β → ∞.As β → 0, we have Z(si, β → 0) =

∑k p(Ck) = 1. Eq. (95) then implies p(Ck|si) = p(Ck). Using Bayes’ rule, one

then obtains p(si|Ck) = p(si). This means that the probability to find si in a cluster is the same for all clusters,implying that all clusters overlap or, effectively, we have just one cluster.

The opposite extreme, β →∞, is the limit of hard-clustering where, to leading order, the normalization becomesZ(si, β) → p(Ck∗) exp[−βDKL[p(u|si)‖p(u|Ck∗)]], where Ck∗ is the cluster with p(uj |Ck∗) = p(uj |si). Substitutingthis approximate normalization back into Eq. (95), we arrive at: p(Ck|si) = 1 if p(uj |si) = p(uj |Ck), and p(Ck|si) =0 otherwise. Since the conditional probability is either one or zero, the limit β → ∞ coincides precisely with thehard clustering case.

6.3.1 Information bottleneck method: Application to dynamical state-space networks

The IB framework has been applied directly to time series data to construct dynamical state-space networkmodels [51, 52]. Here we briefly describe important features of the IB-based state-space network constructiondetailed elsewhere [39,51].

Intuitively, state-space networks derived from the IB construction, must preserve maximum information relevantto predicting the future outcome of the time series. More precisely, they must be minimally complex but, simulta-

46

s = {s1, ..., sN} C = {C1, ..., Cn}

minimize I(C, s)(compression)

I(s, u)

u = {u1, ..., uM}

data to be clustered clusters

relevant observable

max. achievablerelevance

x = {x1, ..., xN} C = {C1, ..., Cn}

I(x, x)

x = {x1, ..., xM}

past sequences to be clustered clusters

future sequences asrelevant observable

max. achievable

predictability

a b

minimize I(C, x)(compression)

maximize I(C, u)(relevance)

maximize I(C, x)(predictability)

Figure 18: The IB method can be used to construct dynamical models. a) The IB method starts from the data to beclustered s (top left), clustering then compresses the information contained in s by minimizing the rate I(C, s) (from top leftto top right). Instead of introducing an a priori distortion measure, the IB compression maximizes I(C,u) quantifying howwell another observable, u, is predicted (from top right to bottom). The maximum achievable “relevance", predicting u froms, is given by I(s,u). b) To construct a predictive dynamical model from time series data, we may define past sequences(top left) as the data to be clustered s and future sequences (bottom) as the relevant observables u.

neously, most predictive [39,259,260].

As an illustration, we consider a time series x sampled at discrete times. We have both past,←−x = {←−x1, · · · ,←−xN},and future, −→x = {−→x1, · · · ,−→xM}, sequences of different total length. We have bolded the elements of ←−x and −→xbecause each element can itself be a vector of some length, say L and L′ respectively.

To construct a minimal state-space network with maximal predictability from the IB framework, we identify thedata to be clustered s as the past sequences ←−x and the relevant observable u as the future sequences −→x(see Fig. (18a)-(18b)). The clusters C obtained in Fig. (18b) represent constructed network states. In principle,the length of the past and future sequences L and L′ should be chosen to be long enough as compared toall dynamical correlations in the time series. However, this may cause some practical sampling problems if Land L′ are too large. These problems are addressed using multiscale wavelet based CM described in detail inRefs. [39,51,52].

Multiscale state-space networks developed from IB methods may capture dynamical correlations of conforma-tional fluctuations covering a wide range of timescales (from millisecond to second) [51, 52]. Moreover, thetopographical features of the networks constructed, including the number connections among the states and theheterogeneities in the transition probabilities among the states, depend on the timescale of observation, namely,the longer the timescale, the simpler and more random the underlying state-space network becomes. Theseinsights provides us with a network topography perspective to understand dynamical transitions from anomalousto normal diffusion [51,52].

We end this brief section by mentioning that clustering past sequences to form state-space networks in whichthe relevant variable are future sequences was proposed separately by Crutchfield et al. [273, 274], and termedcomputational mechanics (CM). The states and resulting state-space network are called causal states and epsilonmachine, respectively. We refer the interested readers to an excellent review of CM [275].

47

7 Final Thoughts on Data Analysis and Information Theory

We have previously seen how information theory can be used in deconvolution, model selection and clustering.In this purely theoretical section, we discuss efforts to use information theory in experimental design and end withsome considerations on the broader applicability of information theory.

7.1 Information theory in experimental design

Just as information theory can be used in model selection after an experiment has been performed, it may alsobe used to suggest an experimental design, labeled ξ.

The goal, in this so far theoretical endeavor, is to find a design that optimizes information gain [276–278]. Forinstance, a choice of design may involve tuning data collection times, bin sizes, choice of variables under obser-vation and sample sizes [279,280]. Fig. (19) illustrates – for a concrete example we will discuss shortly – how thenumber of observations can be treated as a design variable and how information gained grows as we tune thisvariable (repeat trials).

To quantify the information gained, we first consider the expected utility, U(ξ) – depending on the experimentaldesign ξ – defined as the mutual information between the data, y, and model, θ [276]

U(ξ) ≡ I(y,θ|ξ) =

∫dθdyp(y,θ|ξ) log

(p(y,θ|ξ)

p(y|ξ)p(θ|ξ)

)(97)

which we must now maximize with respect to our choice of experiment, ξ.

More concretely, the choice of experiments dictates the mathematical form for p(y|ξ) and p(y,θ|ξ). Thus maxi-mizing with respect to ξ may imply comparing different mathematical forms dictated by the experimental designfor p(y|ξ) and p(y,θ|ξ).Our utility function, U(ξ), is simply the difference in Shannon information before and after the data y was used toinform the model

I(y,θ|ξ) = I(θ|y, ξ)− I(θ|ξ) (98)

where

I(θ|y, ξ) ≡∫dθdyp(y,θ|ξ) log p(θ|y, ξ)

I(θ|ξ) ≡∫dθdyp(y,θ|ξ) log p(θ|ξ). (99)

As an illustration of this formalism, we can now quantify whether future experiments – repeated trials yieldingdata y′ – appreciably change the expected information gain. We begin by iterating Eq. (98) and write

I(y′,θ|y, ξ) = I(θ|y′,y, ξ)− I(θ|y, ξ). (100)

As a concrete example, suppose we monitor photon arrival times in continuous time. We write the likelihood, i.e.the probability of observing a sequence of photon arrivals with arrival times t = {t1, t2, · · · , tN},

p(y = t|θ = r, ξ) = re−rt1 × re−rt2 × · · · × re−rtN (101)

where r is the rate of arrival and ξ is implicitly specified by our choice of experiment (and thus by the form of ourlikelihood). For sake of concreteness, we take a simple exponential prior distribution over r, p(r|ξ) = φe−rφ whereφ is a hyperparameter. Now, our joint distribution, p(t, r|ξ) = p(t|r, ξ)p(r|ξ), as well as our marginal distributionover data, p(t|ξ) =

∫drp(t, r|ξ), are fully specified. We tune the design here by selecting N .

48

1 2 3 4 5

0.2

0.4

0.6

0.8

1.0

Info

rmat

ion

gai

n

NFigure 19: Diminishing returns: most data collected from additional experiments does not result in informationgain. The expected information gained, Eq. (100), grows sub-linearly with the number of photon arrival measurements.

The expected information gained can now be explicitly calculated from Eq. (100). The integrals are over allallowed r and arrival times. All, but the last time, tN , are considered as y. The last time, tN , is y′.

The expected information gained, I(y′,θ|y, ξ), increases monotonically but sub-linearly with the number of trialsand, for our specific example, independently of φ. See Fig. (19). The monotonicity is expected because we haveaveraged over all possible outcomes (i.e. photon arrival times). The sub-linearity however quantifies that futureexperiments result in diminishing returns [281]. Put differently, insofar as the expected information gained allowsto build a predictive model, most of the information gathered has little predictive value.

7.2 Predictive information and model complexity

The result from Fig. (19) in the previous section suggested that most repeated experiments yield little informationgain.

Here we want to quantify this concept by calculating the predictive value of previously collected data on futuredata. We consider the mutual information linking past, i.e. previously collected, data, yp, and future, yf , data[276,281]

Ipred(yf ,yp|ξ) ≡∫dyfdypp(yf ,yp|ξ) log

(p(yf ,yp|ξ)

p(yf |ξ)p(yp|ξ)

). (102)

We wish to simplify Eq. (102) and derive Ipred(yf ,yp|ξ)’s explicit dependence on the total number of data pointscollected in the past, Np. We call Nf the total number of data points collected in the future.

To simplify Eq. (102), we re-write Eq. (102) as follows

Ipred(yf ,yp|ξ) = Ijoint − Ifuture − Ipast=

∫dyfdypp(yf ,yp|ξ) log p(yf ,yp|ξ)

−∫dyfdypp(yf ,yp|ξ) log p(yf |ξ)−

∫dyfdypp(yf ,yp|ξ) log p(yp|ξ) (103)

49

and express p(yf ,yp|ξ) as

p(yf ,yp|ξ) =

∫dKθ p(yf ,yp,θ|ξ) (104)

where K enumerates model parameters. We can similarly re-write both p(yf |ξ) and p(yp|ξ). Next, we simplifyp(yf ,yp|ξ) by assuming that individual data points, both past and future, are sufficiently independent. As thenumber of total data point, Nf +Np, grows we invoke Laplace’s method – and using the notation for the rescaledlogarithm of the likelihood f introduced earlier, Eq. (5) – to simplify p(yf ,yp|ξ). This yields

p(yf ,yp|ξ) =

∫dKθ p(yf ,yp,θ|ξ) ≡

∫dKθ e(Nf+Np) log f(yf ,yp,θ|ξ)

∼ 1√det ((Nf +Np)[−(log f(yf ,yp,θ

∗|ξ)))′′]× e(Nf+Np) log f(yf ,yp,θ

∗|ξ)

∝ 1

(Nf +Np)K/2e(Nf+Np) log f(yf ,yp,θ

∗|ξ) (105)

where θ∗ is the value of θ maximizing the integrand. Similarly, p(yf |ξ) and p(yp|ξ) can be re-written as follows

p(yf |ξ) ∝1

NK/2f

eNf log f(yf ,θ∗f |ξ) (106)

p(yp|ξ) ∝1

NK/2p

eNp log f(yp,θ∗p|ξ) (107)

where θ∗f and θ∗p are those parameter values that maximize their respective integrands. But, for sufficiently largeenough Nf and Np, log f(yf ,θ

∗f |ξ) and log f(yp,θ

∗p|ξ) are both well-approximated by log f(yf ,yp,θ

∗|ξ). This istrue to the extent that yf and yp are typical. Thus [281]

p(yf |ξ) ∝1

NK/2f

eNf log f(yf ,yp,θ∗|ξ) (108)

p(yp|ξ) ∝1

NK/2p

eNf log f(yf ,yp,θ∗|ξ). (109)

Inserting these simplified probabilities, Eq. (105), (108) and (109), into Eq. (103) yields an extensive part thatscales with the number of data points (first two lines) and, to next order, a portion scaling with the logarithm ofthe number of data points (third line) [281]

Ipred(yf ,yp|ξ) ∼ (Nf +Np)

∫dyfdypp(yf ,yp|ξ) log f(yf ,yp,θ

∗|ξ)

−Np


∗|ξ)−Nf


∗|ξ)

−K2

log(Nf +Np) +K

2logNf +

K

2logNp + O(N0

f ) + O(N0p ). (110)

The extensive portion of Ipred, Eq.(110), cancels to leading order. This directly implies that the vast majority ofdata collected provides no predictive information. If we then ask what the past data collected tells us about theentirety of future observations (Nf →∞), upon simplifying Eq. (110), we find

Ipred =K

2logNp. (111)

In other words, the predictive information grows logarithmically with the data collected and linearly with K.

50

Asymptotically, the predictive information is directly related to features of the model (in this case, the numberof parameters) drawn from the data. In addition, Eq. (111) provides an interpretation to the penalty term of theBIC [62] as twice the predictive information.

7.3 The Shore & Johnson axioms

While we’ve discussed the Shannon entropy in the context of its information theoretic interpretation, the SJ axiomsprovide a complementary way to understand the central role of information theory in model inference.

The key mathematical steps in deriving the Shannon entropy from the SJ axioms concretely highlight what as-sumptions are implicit when using H = −∑i pi log pi that go beyond the illustration of the kangaroo exam-ple with eye-color and handedness. Conversely, they clarify which assumptions must be violated in rejectingH = −∑i pi log pi in inferring models for {pi} [130,282].

Briefly, SJ wanted to devise a prescription to infer a probability distribution, {pi}. Thus they constructed an objec-tive function which, when maximized, would guarantee that inferences drawn from their model – the probabilitydistribution, {p∗i }, which maximizes their objective function – would satisfy basic self-consistency conditions thatwe now define.

SJ suggested that the maximum of their objective function must be: 1) unique; 2) coordinate transformationinvariant; 3) subset-independent (i.e. if data are provided on subsets of a system independently, then the relativeprobabilities on two subsets of outcomes within a system should be independent of other subsets); 4) system-independent (i.e. if data are provided for systems independently, the joint probability for two independent systemsshould be the product of their marginal probabilities).

The starting point is a function H (to be determined by the axioms) constrained by data (using Lagrange multipli-ers). SJ considered general equality and inequality constraints for the data.

For concreteness, here we consider a single constraint on an average a of a quantity a. Then SJ’s starting pointis the following objective function

H({pi, qi})− λ(∑

i

aipi − a). (112)

To find the specific form for H, we first invoke SJ’s axiom on subset independence. Subset independence statesthat unless the data are coupled, then the maximum of Eq. (112) with respect to each pi can only depend on thisindex, namely i. This is only guaranteed if H is a sum over i’s, i.e. outcomes. That is,

H =∑i

f(pi, qi). (113)

To further specify the function f , we must apply SJ’s second axiom of coordinate invariance. To do this, we usea continuum representation for the probabilities and write Eq. (113) as

H =

∫D[x]f(p(x), q(x)) (114)

where D[x] denotes the integration measure. Our goal is to show that the maximum of H−λ(∫

D[x]p(x)a(x)− a)

with respect to p(x) – where we have used an average constraint only as a matter of simplicity– is equivalent tothe maximum of the coordinate transformed H ′−λ′

(∫D′[y]p′(y)a′(y)− a

)with respect to p′(y) where the primes

denote a coordinate transform from x→ y. That is

δ

δp(x)

(H − λ

(∫D[x]p(x)a(x)− a

))=

δ

δp′(y)

(H ′ − λ′

(∫D′[y]p′(y)a′(y)− a

))(115)

51

where the δ denotes a functional derivative. To simplify Eq. (115), we note that, under coordinate transformation

D′[y] = D[x]J (116)

where J is the corresponding Jacobian. It then follows that one acceptable relationship between the transformedand untransformed probabilities and observables is [121]: p′ = J−1p, q′ = J−1q (from normalization of thecoordinate transformed distributions); and a′ = a (from the conservation of

∫D[x]p(x)a(x) under coordinate

transformation). Selecting these relations, we find H ′ =∫D[x]Jf(J−1p(x), J−1q(x)).

Thus Eq. (115) simplifies to

− λa(x) + g(p(x), q(x)) = −λ′a(x) + g(J−1p(x), J−1q(x)) (117)

where g(p, q) = δf(p, q)/δp. Since a(x) and the Jacobian, J , are arbitrary functions of x, then Eq. (117) can onlybe true if g(p, q) = g(p/q) and λ = λ′. By integrating g, it then follows that f(p, q) = ph(p/q) up to an arbitraryconstant in q where h is some function of p/q. The fourth axiom on system independence ultimately fixes thefunctional form for h.

To see this, we consider independent constraints on two systems described by coordinates x1 and x2 as follows∫D[x]ak(xk)p(x1, x2) = ak (k = 1, 2). (118)

We then define H =∫D[x]p(x)h(r) where r(x) ≡ p(x)/q(x) and x ≡ {x1, x2}. Variation of H with respect to p

under the constraints given in Eq. (118) yields

δ

δp(x)

(H − λ1

∫D[x]p(x1, x2)a1(x1)− λ2

∫D[x]p(x1, x2)a2(x2)

)= h(r(x)) + r(x)h′(r(x))− λ1a1(x1)− λ2a2(x2)

= h(r1(x1)r2(x2)) + r1(x1)r2(x2)h′(r1(x1)r2(x2))− λ1a1(x1)− λ2a2(x2) = 0 (119)

where, from system independence, we’ve set r(x) to r1(x1)r2(x2) and h′ = δh/δr. To obtain a simple differentialequation in terms of h(r), we take derivatives of the last line of Eq. (119) with respect to both x1 and x2 whichyields

r′1(x1)r′2(x2)(r21r

22h′′′(r1r2) + 4r1r2h

′′(r1r2) + 2h′(r1r2)) = 0 (120)

which further simplifies to

r2h′′′(r) + 4rh′′(r) + 2h′(r) = 0 (121)

from which we find h(r) = −K log(r) +B +C/r with constant K, B and C. From H =∫D[x]p(x)h(r), we find that

H assumes the following form

H = −K∫

D[x]p(x) log(p(x)/q(x)) (122)

up to a positive multiplicative factor K, and additive constants (independent of our model parameters p(x)) thatwe are free to set to zero. In discrete form, this becomes [121]

H = −K∑i

pi log(pi/qi). (123)

Thus any function with a maximum identical to that of H can be used in making self-consistent inferences –that is, inferences that satisfy the SJ axioms – about probability distributions. In the absence of any constraint,maximizing H with respect to each pi returns the corresponding hyperparmeter qi up to a normalization constant.The hyperparameters are therefore understood to be a probability distribution.

52

Finally, a note on axiom 1 is in order: while our derivation was for a special type of constraint, our argumentsabove hold and, in particular, maximizing the constrained H returns a unique set of {pi} if the constraints do notchange the overall convexity of the objective function.

The mathematics above help clarify the following important points about the principle of MaxEnt and, morebroadly, the philosophy of data analysis.

1) While historically MaxEnt has been closely associated to thermodynamics and statistical mechanics in physics,nothing in H ’s derivation limits the applicability of MaxEnt to equilibrium phenomena. In fact, MaxEnt is a generalinference scheme valid for any probability distributions whether they be probability distributions over trajectoriesor equilibrium states [38]. In fact, MaxEnt’s application to dynamical system is reviewed in Ref. [38]. To reiterate,MaxEnt is no more tied to equilibrium than are Bayes’ theorem or even the concept of probability itself.

2) While the H derived from SJ’ axioms is additive, in that H({pij = uivj}) = H({ui}) +H({vj}), H can be usedto infer probability distributions for either additive or non-additive systems [130]. The SJ axioms only enforcethat if no couplings are imposed by the data, then no couplings should be imposed by hand. Put differently, thefunction H that SJ have derived is not only valid for independent systems. It only says that if the data do notcouple two outcomes i and j, then the probabilities inferred must also be independent and satisfy the normaladdition and multiplication rule of normal probabilities. In this way, spurious correlations unwarranted by the dataare not introduced into the model inferred. Conversely, if data couples two outcomes, then this H constrained bycoupled data generates coupled outcomes.

This fundamentally explains why it is incorrect to use other unconventional entropies in model inference whichspecifically enforce couplings by hand [130]. What is more, parametrizing ad hoc couplings (the q-parameter, say,in the Tsallis entropy [283]) in a prior (equivalently the entropy) from data is tautological since the data are usedto then inform the prior and, simultaneously, the likelihood [282].

3) As a corollary to 2, H is not a thermodynamic entropy [38]. The thermodynamic entropy, S, is a number nota function. S is H evaluated at its maximum, {p∗i }, under equilibrium (thermodynamic) constraints. While H isadditive, the thermodynamic entropy S, derived from H, may not be. The non-additivity of S originates from thenon-additivity of the constraints.

4) Historically, constraints used on H to infer probabilistic models were limited to means and variances [100,101]and, in order to infer more complex models, exotic ad hoc constraints were developed leading to problems detailedin Refs. [284–287]. But, as we have explained earlier, H is the logarithm of a prior and constraints on H are thelogarithm of a likelihood. Selecting constraints should be no more arbitrary than selecting a likelihood. Classicalthermodynamics, as it arises from MaxEnt, is therefore atypical. It is a very special (extreme) example wheredata (such as average energy or, equivalently, temperature) is provided with vanishingly small error bar and thelikelihoods are delta-functions.

7.4 Repercussions to rejecting MaxEnt

Since the time when Jaynes provided a justification for the exponential distribution in statistical mechanics fromShannon’s information theory [100, 101], other entropies have been invoked to justify more complex models inthe physical and social sciences such as power laws [283,288–296]. The most widely used of these entropies isthe Tsallis entropy [283].

It has been argued that the Tsallis entropy generalizes statistical mechanics because it is not additive [293,295, 297]. That is, H({pij = uivj}) = H({ui}) + H({vj}) + εH({ui})H({vj}), where ε measures the deviationfrom additivity (though the choice of ε has been criticized because it is selected in an ad hoc manner by fittingdata [282,298,299]).

Non-additive entropies do not follow from the SJ axioms. In fact, the Tsallis entropy explicitly violates the fourthaxiom [130]. Thus, we can ascertain that the resulting H no longer generates self-consistent inferences aboutprobability distributions.

53

To see this, we start from the discrete analog of Eq. (114) dictated by the third axiom

H(p) =∑k

f(pk) (124)

and, for simplicity only, we assume a uniform prior qj which we exclude from the calculation. Now we considerbringing together two systems, indexed i and j, with probability pij .

The fourth axiom – system independence – says that bringing together two systems having marginal probabilitiesu = {ui} and v = {vj} gives new probabilities pij that are factorizable as pij = uivj unless the data couples thesystems.

That is, under independent constraints, on {ui} and {vj} we have

H(p)− λa

∑i,j

pijai − a

− λb∑

i,j

pijbj − b

. (125)

Taking a derivative with respect to pij = uivj then yields

f ′(pij)− λaai − λbbj = 0. (126)

Subsequently taking two more derivatives of Eq. (126) (with respect to ui and vj) yields

f ′′(pij) + pijf′′′(pij) = 0. (127)

Defining f ′′(pα) ≡ g(pα), where α ≡ (i, j), yields from Eq. (127) g(pα) = −1/pα from which we obtain f(pα) =−pα log pα + pα and, ultimately, H = −∑α pα log pα + C, where C is a constant.

By contrast, the Tsallis entropy is defined as

H ≡ K

1− q

(∑k

pqk − 1

). (128)

This entropy satisfies SJ’s third axiom (by virtue of still being a sum over outcomes) but not the fourth axiom.That is, even if data do not couple systems indexed i and j, it is no longer true that pij = uivj . Rather, the Tsallisentropy assumes a coupling pij = p(ui, vj) even if no such coupling has yet been imposed by the data. What ismore, the constant q is fitted to the data from which it (problematically) follows that both prior and likelihood areinformed by the data [282].

To find precisely what the form of this coupling is, we repeat steps that lead us from Eqns. (125)-(127) excepttreating pij as a general function of p(ui, vj) and using the f(pij) dictated by Eq. (128). This yields

(2− q)−1pij∂2pij∂ui∂vj

=∂pij∂ui

∂pij∂vj

. (129)

As a sanity check, in the limit that q approaches 1 (i.e. when the Tsallis entropy approaches the usual Shannoninformation) we immediately recover pij = uivj .

The solution to Eq. (129) is [130]

pij =(uq−1i + vq−1

j − 1)1/(q−1)

. (130)

This exercise can be repeated for other entropies (such as the Burg entropy [300]) and the explicit form for thecorrelations such entropies impose can be calculated [130].

Eq. (130) captures the profound consequence of invoking the Tsallis entropy, or other entropies not consistent

54

with the SJ axioms, in probabilistic model inference. By virtue of violating SJ’s fourth axiom the Tsallis entropyimposes coupling between events where none are yet warranted by the data. By contrast, couplings can beintroduced in models from the normal Shannon entropy by either systematically selecting a prior distribution,{qk}, with couplings or letting the data impose those couplings.

8 Concluding Remarks and the Danger of Over-interpretation

Throughout this review we have discussed multiple modeling strategies. We began by investigating how modelparameters may be inferred from data. We then explored more sophisticated formalisms that have allowed us toinfer not only model parameters, but models themselves starting from broader model classes. We discussed howinformation theory is useful across data analysis and how it connects to Bayesian methods.

While the formalisms we have presented are powerful and perform well on test (synthetic) data sets used tobenchmark the data, it is difficult to determine how well they perform on real data.

For instance, it may be difficult to quantify if Bayesian-inspired parameter averaging – which is critical in modelselection – ultimately rejects models on the basis of parameter values that may be unphysical (such as infinitestandard deviations) and should not have been considered in the first place.

What is more, we often do not know the exact likelihood in any analysis either. Thus, the more we ask of a modelinference scheme, the more sensitive we become to over-interpretation because noise properties may not bewell captured by our approximate likelihood. We gave as a concrete example that inference methods that do notstart with a fixed number of states in the analysis of time series data, may interpret drift as the occupation of newstates over the course of the time trace.

Likewise, inferences made under one choice of noise model are biased by this choice. So, in practice, non-parametric mixture models are still limited by the parametric choice for their distribution over observations whichthen determine the states that will be populated.

Despite these apparent shortcomings, the analysis methods presented here outperform methods from the recentpast and broaden our thinking. Furthermore, the mechanistic insights provided from statistical modeling mayultimately help inspire new theoretical frameworks to describe biological phenomena.

The mathematical frameworks we’ve described here are helpful and worth investigating in their own right. Theysuggest what model features should be extractable from data and, in this sense, may even help inspire new typesof experiments.

9 Acknowledgements

We acknowledge Carlos Bustamante, Christopher Calderon, Kingshuk Ghosh, Irina Gopich, Christy Landes,Antony Lee, Martin Lindén, Philip Nelson, Ilya Nemenman, Konstantinos Tsekouras and Paul Wiggins for sug-gesting topics to be discussed or carefully reading this manuscript in whole or in part. A special thanks toKonstantinos Tsekouras for generating some figures. In addition, SP kindly acknowledges the support of theNSF (MCB) and the DoD (ARO Mechanical Sciences Division). TK acknowledges support from Grant-in-Aidfor Scientific Research (B), JSPS, Grant-in-Aid for Scientific Research on Innovative Areas “Spying minority inbiological phenomena (No.3306)”, MEXT and HFSP.

55

References

[1] T. Komatsuzaki (Ed.), M. Kawakami (Ed.), S. Takahashi (Ed.), H. Yang (Ed.), and R.J. Silbey (Ed.). Ad-vances in Chemical Physics. Vol. 146. Single Molecule Biophysics: Experiments and Theories. John Wiley& Sons, 2011.

[2] C. Bouchiat, M.D. Wang, J.F. Allemand, T. Strick, S.M. Block, and V. Croquette. Estimating the persistencelength of a worm-like chain molecule from force-extension measurements. Biophys. J., 76(1):409–413,1999.

[3] I.A. Chen, R.W. Roberts, and J.W. Szostak. The emergence of competition between model protocells.Science, 305(5689):1474–1476, 2004.

[4] K. Lindorff-Larsen, S. Piana, R.O. Dror, and D.E. Shaw. How fast-folding proteins fold. Science,334(6055):517–520, 2011.

[5] K. Tsekouras, A. Siegel, R. Day, and S. Pressé. Inferring diffusion dynamics from FCS in heterogeneousnuclear environments. Biophys. J., 109(1):7–17, 2015.

[6] S. Pressé. A data-driven alternative to the fractional Fokker-Planck equation. J. Stat. Mech: Th. andExpmt., 2015(7):P07009, 2015.

[7] J. Gunawardena. Models in biology: accurate descriptions of our pathetic thinking. BMC Biol., 12(1):29–40,2014.

[8] I. Bronstein, Y. Israel, E. Kepten, S. Mai, Y. Shav-Tal, E. Barkai, and Y. Garini. Transient anomalous diffusionof telomeres in the nucleus of mammalian cells. Phys. Rev. Lett., 103(1):018102, 2009.

[9] S. Weber, A. Spakowitz, and J. Theriot. Bacterial chromosomal loci move subdiffusively through a vis-coelastic cytoplasm. Phys. Rev. Lett., 104(23):238102, 2010.

[10] I. Golding and E. Cox. Physical nature of bacterial cytoplasm. Phys. Rev. Lett., 96(9):098102, 2006.

[11] G. Seisenberger, M. Ried, T. Endress, H. Buening, M. Hallek, and C. Braeuchle. Real-time single-moleculeimaging of the infection pathway of an adeno-associated virus. Science, 294(5548):1929–1932, 2001.

[12] D. Banks and C. Fradin. Anomalous diffusion of proteins due to molecular crowding. Biophys. J., 9(5):2960–2971, 2005.

[13] T. Feder, I. Brust-Mascher, J. Slattery, B. Baird, and W. Webb. Constrained diffusion or immobile fractionon cell surfaces: a new interpretation. Biophys. J., 70(6):2767–2773, 1996.

[14] M.C. Konopka, I.A. Shkel, S. Cayley, M.T. Record, and J.C. Weisshaar. Crowding and confinement effectson protein diffusion in vivo. J. Bacteriol., 188(17):6115–6123, 2006.

[15] S.R. McGuffee and A.H. Elcock. Diffusion, crowding & protein stability in a dynamic molecular model of thebacterial cytoplasm. PLoS Comput. Biol., 6(3):e1000694, 2010.

[16] I.V. Tolic-Nørrelykke, E.L. Munteanu, G. Thon, L. Oddershede, and K. Berg-Sørensen. Anomalous diffusionin living yeast cells. Phys. Rev. Lett., 93(7):078102, 2004.

[17] M. Wachsmuth, W. Waldeck, and J. Langowski. Anomalous diffusion of fluorescent probes inside living cellnuclei investigated by spatially-resolved fluorescence correlation spectroscopy. J. Mol. Bio., 298(4):677–689, 2000.

[18] M. Saxton. A biological interpretation of transient anomalous subdiffusion. I. Qualitative model. Biophys.J., 92(4):1178–1191, 2007.

1

[19] A.P. Siegel, N.M. Hays, and R.N. Day. Unraveling transcription factor interactions with heterochromatin pro-tein using fluorescence lifetime imaging microscopy and fluorescence correlation spectroscopy. J. Biomed.Opt., 18(2):25002, 2013.

[20] B.R. Parry, I.V. Surovtsev, M.T. Cabeen, C.S. O’Hern, E.R. Dufresne, and C. Jacobs-Wagner. The bacterialcytoplasm has glass-like properties and is fluidized by metabolic activity. Cell, 156(1):183–194, 2014.

[21] P. Schwille, J. Korlach, and W.W. Webb. Fluorescence correlation spectroscopy with single-molecule sen-sitivity on cell and model membranes. Cytometry, 36(3):176–182, 1999.

[22] A. Caspi, R. Granek, and M. Elbaum. Enhanced diffusion in active intracellular transport. Phys. Rev. Lett.,85(26):5655–5658, 2000.

[23] L. Bruno, V. Levi, M. Brunstein, and M.A. Despósito. Transition to superdiffusive behavior in intracellularactin-based transport mediated by molecular motors. Phys. Rev. E, 80(1):011912, 2009.

[24] P. Bressloff and J. Newby. Stochastic models of intracellular transport. Rev. Mod. Phys., 85(1):135–196,2013.

[25] B. Regner, D. Vucinic, C. Domnisoru, T. Bartol, M. Hetzer, D. Tartakovsky, and T. Sejnowski. Anomalousdiffusion of single particles in cytoplasm. Biophys. J., 104(8):1652–1660, 2013.

[26] J. Wu and K. Berland. Propagators and time-dependent diffusion coefficients for anomalous diffusion.Biophys. J., 95(4):2049–2052, 2008.

[27] M. Kaern, T.C. Elston, W.J. Blake, and J.J. Collins. Stochasticity in gene expression: From theories tophenotypes. Nat. Rev. Genet., 6(6):451–464, 2005.

[28] W.E. Moerner and D.P. Fromm. Methods of single-molecule fluorescence spectroscopy and microscopy.Rev. Sci. Instrum., 74(8):3597–3619, 2003.

[29] B. Schueler and W.A. Eaton. Protein folding studied by single-molecule FRET. Curr. Op. in Struct. Bio.,18(1):16–26, 2008.

[30] Y. Taniguchi, P.J. Choi, G.W. Li, H. Chen, M. Babu, J. Hearn, A. Emili, and X.S. Xie. Quantifying E. coliproteome and transcriptome with single-molecule sensitivity in single cells. Science, 329(5991):533–538,2010.

[31] S.A. McKinney, C. Joo, and T. Ha. Analysis of single-molecule FRET trajectories using hidden Markovmodeling. Biophys. J., 91(5):1941–1951, 2006.

[32] B. Munsky, B. Trinh, and M. Khammash. Listening to the noise: Random fluctuations reveal gene networkparameters. Mol. Syst. Biol., 5(1):318–325, 2009.

[33] S. F. Gull and G. J. Daniell. Image reconstruction from incomplete and noisy data. Nature, 272:686–690,1978.

[34] G. Rollins, J.Y. Sin, C. Bustamante, and S. Pressé. A stochastic approach to the molecular countingproblem in super-resolution microscopy. Proc. Natl. Acad. Sc., 112(2):E110–E118, 2015.

[35] F. Persson, M. Lindén, C. Unoson, and J. Elf. Extracting intracellular diffusive states and transition ratesfrom single-molecule tracking data. Nat. Meth., 10(3):265–269, 2013.

[36] J.E. Bronson, J. Fei, J.M. Hofman, R.L. Gonzalez Jr., and C.H. Wiggins. Learning rates and states frombiophysical time series: A Bayesian approach to model selection and single-molecule FRET data. Biophys.J., 97(12):3196–3205, 2009.

[37] B. Shuang, D. Cooper, J.N. Taylor, L. Kisley, J. Chen, W. Wang, C.B. Li, T. Komatsuzaki, and C.F. Landes.Fast step transition and state identification (STaSI) for discrete single-molecule data analysis. J. Phys.Chem. Lett., 5(18):3157–3161, 2014.

2

[38] S. Pressé, K. Ghosh, J. Lee, and K.A. Dill. Principles of maximum entropy and maximum caliber in statisticalphysics. Rev. Mod. Phys., 85(3):1115–1141, 2013.

[39] Y. Sako and M. Ueda. Cell signaling reactions: Single-molecule kinetic analysis. Springer, 2011.

[40] Y. Sako and M. Ueda. Theory and evaluation of single-molecule signals. Springer, 2008.

[41] M. Greenfeld, D.S. Pavlichin, H. Mabuchi, and D. Herschlag. Single molecule analysis research tool(SMART): An integrated approach for analyzing single molecule data. PLoS ONE, 7(2):e30024, 2012.

[42] C. Bustamante, W. Cheng, and Y. Mejia. Revisiting the central dogma one molecule at a time. Cell,144(4):480–497, 2011.

[43] G.W. Li and X.S. Xie. Central dogma at the single-molecule level in living cells. Nature, 475(7356):308–315,2011.

[44] A. Baba and T. Komatsuzaki. Construction of effective free energy landscape from single-molecule timeseries. Proc. Natl. Acad. Sci., 104(49):19297–19302, 2007.

[45] A. Baba and T. Komatsuzaki. Extracting the underlying effective free energy landscape from single-molecule time series local equilibrium states and their network. Phys. Chem. Chem. Phys., 13(4):1395–1406, 2011.

[46] J.N. Taylor, C.B. Li, D.R. Cooper, C.F. Landes, and T. Komatsuzaki. Error-based extraction of states andenergy Landscapes from experimental single-molecule time-series. Sci. Rep., 5:9174, 2015.

[47] M. Pirchi, G. Ziv, I. Riven, S. Sedghani Cohen, N. Zohar, Y. Barak, and G. Haran. Single-molecule fluores-cence spectroscopy maps the folding landscape of a large protein. Nat. Comm., 2:493, 2011.

[48] M.T. Woodside and S.M. Block. Reconstructing folding energy landscapes by single-molecule force spec-troscopy. Ann. Rev. Biophys., 43:19–39, 2014.

[49] N.C. Harris, Y. Song, and C.H. Kiang. Experimental free energy surface reconstruction from single-moleculeforce spectroscopy using Jarzynski’s equality. Phys. Rev. Lett., 99(6):068101, 2007.

[50] K. Kamagata, T. Kawaguchi, Y. Iwahashi, A. Baba, K. Fujimoto, T. Komatsuzaki, Y. Sambongi, Y. Goto, andS. Takahashi. Long-term observation of fluorescence of free single molecules to explore protein-foldingenergy landscapes. J. Am. Chem. Soc., 134(28):11525–11532, 2012.

[51] C.B. Li, H. Yang, and T. Komatsuzaki. Multiscale complex network of protein conformational fluctuations insingle-molecule time series. Proc. Natl. Acad. Sci., 105(2):536–541, 2008.

[52] C.B. Li, H. Yang, and T. Komatsuzaki. New quantification of local transition heterogeneity of multiscalecomplex networks constructed from single-molecule time series. J. Phys. Chem. B, 113(44):14732–14741,2009.

[53] C.B. Li and T. Komatsuzaki. Aggregated Markov model using time series of single molecule dwell timeswith minimum excessive information. Phys. Rev. Lett., 111(5):058301, 2013.

[54] T. Sultana, H. Takagi, M. Morimatsu, H. Teramoto, C.B. Li, Y. Sako, and T. Komatsuzaki. Non-Markovianproperties and multiscale hidden Markovian network buried in single molecule time series. J. Chem. Phys.,139(24):245101, 2013.

[55] M. Andrec, R.M. Levy, and D.S. Talaga. Direct determination of kinetic rates from single-molecule photonarrival trajectories using hidden Markov models. J. Phys. Chem. A, 107(38):7454–7464, 2003.

[56] T. G Terentyeva, H. Engelkamp, A.E. Rowan, T. Komatsuzaki, J. Hofkens, C.B. Li, and K. Blank. Dynamicdisorder in single-enzyme experiments: facts and artifacts. ACS nano, 6(1):346–354, 2011.

3

[57] S. Pressé, J. Peterson, J. Lee, P. Elms, J.L. MacCallum, S. Marqusee, C.J. Bustamante, and K.A. Dill.Single molecule conformational memory extraction: P5ab RNA hairpin. J. Phys. Chem. B, 118(24):6597–6603, 2014.

[58] D.V. Glidden. Robust inference for event probabilities with non-Markov event data. Biometrics, 58(2):361–368, 2002.

[59] B. Kalafut and K. Visscher. An objective, model-independent method for detection of non-uniform steps innoisy signals. Comp. Phys. Comm., 179(10):716–723, 2008.

[60] J. He, S.M. Guo, and M. Bathe. Bayesian approach to the analysis of fluorescence correlation spectroscopydata I: Theory. Anal. Chem., 84(9):3871–3879, 2012.

[61] P. Sengupta, K. Garai, J. Balaji, N. Periasamy, and S. Maiti. Measuring size distribution in highly heteroge-neous systems with fluorescence correlation spectroscopy. Biophys. J., 84(3):1977–1984, 2003.

[62] G. Schwartz. Estimating the dimension of a model. Ann. Stat., 6(2):461–464, 1978.

[63] H. Akaike. A new look at the statistical model identification. IEEE Trans. Autom. Control., 19(6):716–723,1974.

[64] C.E. Shannon. A mathematical theory of communication. Bell Sys. Tech. J., 27(1):379, 1948.

[65] E.T. Jaynes. Probability theory: The logic of science. Cambridge University Press, 2003.

[66] G. Casella and R.L. Berger. Statistical inference. Duxbury Press, 2002.

[67] H.S. Chung, K. McHale, J.M. Louis, and W.A. Eaton. Single molecule fluorescence experiments determineprotein folding transition path times. Science, 335(6071):981–984, 2012.

[68] P.J. Elms, J.D. Chodera, C. Bustamante, and S. Marqusee. The molten globule state is unusually de-formable under mechanical force. Proc. Natl. Acad. Sc., 109(10):3796–3801, 2012.

[69] A.J. Berglund. Statistics of camera-based single-particle tracking. Phys. Rev. E, 82(1):011917, 2010.

[70] L. Zhang, P.A. Mykland, and Y. Aït-Sahalia. A tale of two time scales. JASA, 100(472):1394–1411, 2010.

[71] C.P. Calderon and K. Bloom. Inferring latent states and refining force estimates via hierarchical Dirichletprocess modeling in single particle tracking experiments. PloS ONE, 10(9):e0137633, 2015.

[72] J.D. Hamilton. Time series analysis. Economic Theory. II, Princeton University Press, 1995.

[73] J. Lee and S. Pressé. A derivation of the master equation from path entropy maximization. J. Chem. Phys.,137(7):074103, 2012.

[74] H. Ge, S. Pressé, K. Ghosh, and K.A. Dill. Markov processes follow from the principle of maximum caliber.J. Chem. Phys., 136(6):064108, 2012.

[75] L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc.IEEE, 77(2):257–286, 1989.

[76] A. McCallum, D. Freitag, and F.C.N. Pereira. Maximum entropy Markov models for information extractionand segmentation. ICML, 17:591–598, 2000.

[77] Y. Liu, J. Park, K.A. Dahmen, Y.R. Chemla, and T. Ha. A comparative study of multivariate and univari-ate hidden Markov modelings in time-binned single-molecule FRET data analysis. J. Phys. Chem. B,114(16):5386–5403, 2010.

[78] H.S. Chung and I.V. Gopich. Fast single-molecule FRET spectroscopy: Theory and experiment. Phys.Chem. Chem. Phys., 21(35):18644–18657, 2014.

4

[79] B.G. Keller, A. Kobitski, A. Jaschke, G.U. Nienhaus, and F. Noé. Complex RNA folding kinetics revealed bysingle-molecule FRET and hidden Markov models. J. Am. Chem. Soc., 136(12):4534–4543, 2014.

[80] M. Blanco and N.G. Walter. Analysis of complex single-molecule FRET time trajectories. Methods Enzy-mol., 472:153–178, 2010.

[81] T.H. Lee. Extracting kinetics information from single-molecule fluorescence resonance energy transfer datausing hidden Markov models. J. Phys. Chem. B, 113(33):11535–11542, 2009.

[82] D. Kelly, M. Dillingham, A. Hudson, and K. Wiesner. A new method for inferring hidden Markov models fromnoisy time sequences. PloS ONE, 7(1):e29703, 2012.

[83] G.D. Forney. The Viterbi algorithm. Proc. IEEE, 61(3):268–278, 1973.

[84] C. Bishop. Pattern recognition and machine learning. Springer, 2006.

[85] F. Qin, A. Auerbach, and F. Sachs. Maximum likelihood estimation of aggregated Markov processes. Proc.R. Soc. B-Biol. Sci., 264(1380):375–383, 1997.

[86] D. Colquhoun and A. Hawkes. On the stochastic properties of bursts of single ion channel openings and ofclusters of bursts. Philos. Trans. R. Soc. Lond. Ser. B-Biol. Sci., 300(1098):1–59, 1982.

[87] R. Horn and K. Lange. Estimating kinetic constants from single channel data. Biophys. J., 43(2):207–223,1983.

[88] F. Qin, A. Auerbach, and F. Sachs. Estimating single-channel kinetic parameters from idealized patch-clampdata containing missed events. Biophys. J., 70(1):264–280, 1996.

[89] D.R. Fredkin and J.A. Rice. On aggregated Markov processes. J. App. Prob., 23(1):208–214, 1986.

[90] P. Kienker. Equivalence of aggregated Markov models of ion-channel gating. Proc. R. Soc. B-Biol. Sci.,236(1284):269–309, 1989.

[91] F.G. Ball and J.A. Rice. Stochastic models for ion channels: Introduction and bibliography. Math. Biosci.,112(2):189–206, 1992.

[92] F. Qin, A. Auerbach, and F. Sachs. A direct optimization approach to hidden Markov modeling for singlechannel kinetics. Biophys. J., 79(4):1915–1927, 2000.

[93] B. Roux and R. Sauvé. A general solution to the time interval omission problem applied to single channelanalysis. Biophys. J., 48(1):149–158, 1985.

[94] P.M. Lee. Bayesian statistics: An introduction. John Wiley & Sons, 2012.

[95] S.C. Kou, X.S. Xie, and J.S. Liu. Bayesian analysis of single-molecule experimental data. Appl. Statist.,54(3):469–506, 2005.

[96] N. Monnier, Z. Barry, H.Y. Park, K.C. Su, Z. Katz, B.P. English, A. Dey, K. Pan, I.M. Cheeseman, R.H. Singer,and M. Bathe. Inferring transient particle transport dynamics in live cells. Nat. Meth., 12(9):838–840, 2015.

[97] N. Monnier, S.-M. Guo, M. Mori, J. He, P. Lénárt, and M. M. Bathe. Bayesian approach to MSD-basedanalysis of particle motion in live cells. Biophys. J., 103(3):616–626, 2012.

[98] S. Tuerkcan, A. Alexandrou, and J.B. Masson. A Bayesian inference scheme to extract diffusivity andpotential fields from confined single-molecule trajectories. Biophys. J., 102(10):2288–2298, 2012.

[99] J.B. Witkoskie and J. Cao. Single molecule kinetics. II. Numerical Bayesian approach. J. Chem. Phys.,121(13):6373–6379, 2004.

[100] E.T. Jaynes. Information theory and statistical mechanics. Phys. Rev., 106(4):620–630, 1957.

5

[101] E.T. Jaynes. Information theory and statistical mechanics. II. Phys. Rev., 108(2):171–190, 1957.

[102] A. Gelman. Prior distribution. Encyclopedia of environmetrics, 3:1634–1637, 2002.

[103] H. Jeffreys. An invariant form for the prior probability in estimation problems. Proc. R. Soc. A,186(1007):453–461, 1946.

[104] H. Jeffreys. The theory of probability. Oxford University Press, 1939.

[105] D.R. Eno. Noninformative prior Bayesian analysis for statistical calibration problems. PhD thesis, VirginiaPolytechnic Institute and State University, 1999.

[106] C.K. Fisher, O. Ullman, and C.M. Stultz. Comparative studies of disordered proteins with similar sequences:application to aβ40 and aβ42. Biophys. J., 104(7):1546–1555, 2013.

[107] D.L. Ensign and V.S. Pande. Bayesian detection of intensity changes in single molecule and moleculardynamics trajectories. J. Phys. Chem. B, 114(1):280–292, 2009.

[108] D.L. Ensign, V. Pande, H. Andersen, and S.G. Boxer. Bayesian statistics and single-molecule trajectories.Stanford University Press, 2010.

[109] D.L. Ensign and V.S. Pande. Bayesian single-exponential kinetics in single-molecule experiments andsimulations. J. Phys. Chem. B, 113(36):12410–12423, 2009.

[110] J. Skilling and R.K. Bryan. Maximum entropy image reconstruction: General algorithm. Month. Not. Roy.Astr. Soc., 211(1):111–124, 1984.

[111] J. Skilling and S.F. Gull. Bayesian maximum entropy image reconstruction. Lecture Notes-MonographSeries, 20:341–367, 1991.

[112] Y.W. Chiang, P.P. Borbat, and J.H. Freed. Maximum entropy: A complement to Tikhonov regularization fordetermination of pair distance distributions by pulsed ESR. J. Magn. Res., 177(2):184–196, 2005.

[113] A. J. W. G. Visser, S.P. Laptenok, N.V. Visser, A. van Hoek, D.J.S. Birch, J.C. Brochon, and J.W. Borst. Time-resolved FRET fluorescence spectroscopy of visible fluorescent protein pairs. Eur. Biophys. J., 39(2):241–253, 2010.

[114] P. J. Steinbach, R. Ionescu, and C.R. Matthews. Analysis of kinetics using a hybrid maximum-entropy/nonlinear-least-squares method: Application to protein folding. Biophys. J., 82(4):2244–2255,2002.

[115] P.J. Steinbach, K. Chu, H. Frauenfelder, J.B. Johnson, D.C. Lamb, G.U. Nienhaus, T.B. Sauke, and R.D.Young. Determination of rate distributions from kinetic experiments. Biophys. J., 61(1):235–245, 1992.

[116] K.E. Hines. A primer on Bayesian inference for biophysical systems. Biophys. J., 108(9):2103–2113, 2015.

[117] N. Durisic, L. Laparra-Cuervo, A. Sandoval-Álvarex, J.S. Borbely, and M. Lakadamyali. Single-moleculeevaluation of fluorescent protein photoactivation efficiency using an in vivo nanotemplate. Nat. Meth.,11(2):156–162, 2001.

[118] W.K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika,57(1):97–109, 1970.

[119] A. F. M. Smith and G. O. Roberts. Bayesian computation via the Gibbs sampler and related Markov chainMonte Carlo methods. J. Roy. Stat. Soc. B, 55(1):3–23, 1993.

[120] M. Beckers, F. Drechsler, T. Eilert, J. Nagy, and J. Michaelis. Quantitative structural information from single-molecule FRET. Faraday Discuss., 184:117–129, 2015.

[121] J.E. Shore and R.W. Johnson. Axiomatic derivation of the principle of maximum entropy and the principleof minimum cross-entropy. IEEE Trans. Inf. Theory, 26 (1):26–37, 1980.

6

[122] S.F. Gull and J. Skilling. Maximum entropy image reconstruction. IEE Proc. F, 131(6):646–659, 1984.

[123] J. Skilling. The axioms of maximum entropy. Maximum-Entropy and Bayesian Methods in Science andEngineering, 1:173–187, 1988.

[124] R. Bryan. Maximum entropy analysis of oversampled data problems. Eur. Biophys. J., 18(3):165–174,1990.

[125] A.K. Livesey and J.C. Brochon. Analyzing the distribution of decay constants in pulse-fluorimetry using themaximum entropy method. Biophys. J., 52(5):693–706, 1987.

[126] T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, 1991.

[127] S. Kullback and R.A. Leibler. On information and sufficiency. Ann. Math. Stat., 22(1):79–86, 1951.

[128] S.I. Amari. Information geometry of the EM and em algorithms for neural networks. Neural net., 8(9):1379–1408, 1995.

[129] S.I. Amari. Information geometry on hierarchy of probability distributions. IEEE Trans. Inf. Theory,47(5):1701–1711, 2001.

[130] S. Pressé, K. Ghosh, J. Lee, and K.A. Dill. Nonadditive entropies yield probability distributions with biasesnot warranted by the data. Phys. Rev. Lett., 111(18):180604, 2013.

[131] R. Feynman and R.B. Leighton. The Feynman lectures on physics. Addison-Wesley, 1964.

[132] D.S. Sivia. Data analysis: A Bayesian tutorial. Oxford University Press, 1996.

[133] K.E. Hines, J.R. Bankston, and R.W. Aldrich. Analyzing single-molecule time series via nonparametricBayesian inference. Biophys. J., 108(3):540–556, 2015.

[134] G. Sun, S.-M. Guo, C. The, V. Korzh, M. Bathe, and T. Wohland. Bayesian model selection applied to theanalysis of fluorescence correlation spectroscopy data of fluorescent proteins in vitro and in vivo. Anal.Chem., 87(8):4326–4333, 2015.

[135] K. Bacia, S. Kim, and P. Schwille. Fluorescence cross-correlation spectroscopy in living cells. Nat. Meth.,3(2):83–89, 2006.

[136] T. Torres and M. Levitus. Measuring conformational dynamics: A new FCS-FRET approach. J. Phys.Chem. B, 111(25):7392–7400, 2007.

[137] P. Kapusta, M. Wahl, A. Benda, M. Hof, and J. Enderlein. Fluorescence lifetime correlation spectroscopy.J. Fluoresc., 17(1):43–48, 2007.

[138] A. Michelman-Ribeiro, D. Mazza, T. Rosales, T.J. Stasevich, H. Boukari, V. Rishi, C. Vinson, J. R. Knutson,and J.G. McNally. Direct measurement of association and dissociation rates of DNA binding in live cells byfluorescence correlation spectroscopy. Biophys. J., 97(1):337–346, 2009.

[139] N. Kahya and P. Schwille. Fluorescence correlation studies of lipid domains in model membranes. Mol.Membr. Biol., 23(1):29–39, 2006.

[140] S.A. Kim, K.G. Heinze, and P. Schwille. Fluorescence correlation spectroscopy in living cells. Nat. Meth.,4(11):963–973, 2007.

[141] J. Szymanski and M. Weiss. Elucidating the origin of anomalous diffusion in crowded fluids. Phys. Rev.Lett., 103(3):038102, 2009.

[142] F. Hoefling and T. Franosch. Anomalous transport in the crowded world of biological cells. Rep. Prog.Phys., 76(4):046602, 2013.

7

[143] M. Weiss, H. Hashimoto, and T. Nilsson. Anomalous protein diffusion in living cells as seen by fluorescencecorrelation spectroscopy. Biophys. J., 84(6):4043–4052, 2003.

[144] P. Schwille, U. Haupts, S. Maiti, and W. W. Webb. Molecular dynamics in living cells observed by flu-orescence correlation spectroscopy with one- and two-photon excitation. Biophys. J., 77(4):2251–2265,1999.

[145] N. Malchus and M. Weiss. Elucidating anomalous protein diffusion in living cells with fluorescence correla-tion spectroscopy – facts and pitfalls. J. Fluoresc., 20(1):19–26, 2010.

[146] O. Krichevsky and G. Bonnet. Fluorescence correlation spectroscopy: The technique and its applications.Rep. Prog. Phys., 65(2):251–297, 2002.

[147] S.M. Guo, N. Bag, A. Mishra, T. Wohland, and M. Bathe. Bayesian total internal reflection fluorescence cor-relation spectroscopy reveals hIAPP-induced plasma membrane domain organization in live cells. Biophys.J., 106(1):190–200, 2014.

[148] E. Elson and D. Magde. Fluorescence correlation spectroscopy. I. Conceptual basis and theory. Biopoly-mers, 13(1):1–27, 1974.

[149] J. Widengren, U. Mets, and R. Rigler. Fluorescence correlation spectroscopy of triplet states in solution: Atheoretical and experimental study. J. Phys. Chem., 99(36):13368–13379, 1995.

[150] K. Burnecki, E. Kepten, J. Janczura, I. Bronshtein, Y. Garini, and A. Weron. Universal algorithm for iden-tification of fractional Brownian motion. A case of telomere subdiffusion. Biophys. J., 103(9):1839–1847,2012.

[151] A. Bancaud, S. Huet, N. Daigle, J. Mozziconacci, J. Beaudouin, and J. Ellenberg. Molecular crowdingaffects diffusion and binding of nuclear proteins in heterochromatin and reveals the fractal organization ofchromatin. EMBO J., 28(24):3785–3798, 2009.

[152] I.Y. Wong, M.L. Gardel, D.R. Reichman, E.R. Weeks, M.T. Valentine, A.R. Bausch, and D.A. Weitz.Anomalous diffusion probes microstructure dynamics of entangled F-actin networks. Phys. Rev. Lett.,92(17):178101, 2004.

[153] J.H. Jeon, V. Tejedor, S. Burov, E. Barkai, C. Selhuber-Unkel, K. Berg-Sørensen, L. Oddershede, andR. Metzler. In vivo anomalous diffusion and weak ergodicity breaking of lipid granules. Phys. Rev. Lett.,106(4):048103, 2011.

[154] M.J. Saxton and K. Jacobson. Single-particle tracking: Applications to membrane dynamics. Ann. Rev.Biophys. Biomol. Struct., 26(1):373–399, 1997.

[155] A. Ott, J.P. Bouchaud, D. Langevin, and W. Urbach. Anomalous diffusion in “living polymers": A genuineLévy flight? Phys. Rev. Lett., 65(17):2201–2204, 1990.

[156] H. Jankevics, M. Prummer, P. Izewska, H. Pick, K. Leufgen, and H. Vogel. Diffusion-time distributionanalysis reveals characteristic ligand-dependent interaction patterns of nuclear receptors in living cells.Biochemistry, 44(35):11676–11683, 2005.

[157] M. Woringer, X. Darzacq, and I. Izeddin. Geometry of the nucleus: A perspective on gene expressionregulation. Curr. Opin. Chem. Biol., 20:112–119, 2014.

[158] M. Serag, M. Abadi, and S. Habuchi. Single-molecule diffusion and conformational dynamics by spatialintegration of temporal fluctuations. Nat. Comm., 5:5123, 2014.

[159] I. Izeddin, V. Récamier, L. Bosanac, I. Cissé, L. Boudarene, C. Dugast-Darzacq, F. Proux, O. Bénichou,R. Voituriez, O. Bensaude, M. Dahan, and X. Darzacq. Single-molecule tracking in live cells reveals distincttarget-search strategies of transcription factors in the nucleus. Elife, 3:e02230, 2014.

8

[160] J. Gorman, F. Wang, S. Redding, A. Plys, T. Fazio, S. Wind, E. Alani, and E. Greene. Single-molecule imag-ing reveals target-search mechanisms during DNA mismatch repair. Proc. Natl. Acad. Sci., 109(45):E3074–E3083, 2012.

[161] J. Gebhardt, D. Suter, R. Roy, Z. Zhao, A. Chapman, S. Basu, T. Maniatis, and X. Xie. Single-moleculeimaging of transcription factor binding to DNA in live mammalian cells. Nat. Meth., 10(5):421–426, 2013.

[162] G. Stormo and Y. Zhao. Determining the specificity of protein-DNA interactions. Nat. Rev. Genet.,11(11):751–760, 2010.

[163] J. Elf, G.-W. Li, and X.S. Xie. Probing transcription factor dynamics at the single-molecule level in a livingcell. Science, 316(5828):1191–1194, 2007.

[164] E. Marklund, A. Mahmutovic, O. Berg, P. Hammar, D. van der Spoel, D. Fange, and J. Elf. Transcription-factor binding and sliding on DNA studied using micro- and macroscopic models. Proc. Natl. Acad. Sci.,110(49):19796–19801, 2013.

[165] A. Siegel, M. Baird, M. Davidson, and R. Day. Strengths and weaknesses of recently engineered redfluorescent proteins evaluated in live cells using fluorescence correlation spectroscopy. Int. J. Mo. Sci.,14(10):20340–20358, 2013.

[166] M. Drobizhev, T. Hughes, Y. Stepanenko, P. Wnuk, K. O’Donnell, J. Scott, P. Callis, A. Mikhaylov, L. Dokken,and A. Rebane. Primary role of the chromophore bond length alternation in reversible photoconversion ofred fluorescence proteins. Sci. Rep., 2(688):1–6, 2012.

[167] B. Slaughter, J. Schwartz, and R. Li. Mapping dynamic protein interactions in MAP kinase signaling usinglive-cell fluorescence fluctuation spectroscopy and imaging. Proc. Natl. Acad. Sc., 104(51):20320–20325,2007.

[168] Z. Wang, J. Shah, Z. Chen, C.-H. Sun, and M. Berns. Fluorescence correlation spectroscopy investigationof a GFP mutant-enhanced cyan fluorescent protein and its tubulin fusion in living cells with two-photonexcitation. J. Biomed. Opt., 9(2):395–403, 2004.

[169] Z. Petrášek and P. Schwille. Precise measurement of diffusion coefficients using scanning fluorescencecorrelation spectroscopy. Biophys. J., 94(8):1437–1448, 2008.

[170] Z.X. Wu, M. Zhao, L. Xia, Y. Yu, S.M. Shen, S.F. Han, H. Li, T.D. Wang, G.Q. Chen, and L.S. Wang. Pkcδenhances C/EBPα degradation via inducing its phosphorylation and cytoplasmic translocation. Biochem.Biophys. Res. Co., 433(2):220–225, 2013.

[171] D. Ramji and P. Foka. CCAAT/enhancer binding proteins: Structure, function and regulation. Biochem. J.,365(3):561–5751, 2002.

[172] P. Hemmerich, L. Schmiedeberg, and S. Diekmann. Dynamic as well as stable protein interactions con-tribute to genome function and maintenance. Chromosome Res., 19(1):131–151, 2011.

[173] L. Schmiedeberg, K. Weisshart, S. Diekmann, G. Meyer zu Hoerste, and P. Hemmerich. High- and low-mobility populations of HP1 in heterochromatin of mammalian cells. Mol. Biol. Cell, 15(6):2819–2833,2004.

[174] Y.M. Wang and R.H. Austin. Single-molecule imaging of laci diffusing along nonspecific DNA. Williams(Ed.), M.C. and Maher (Ed.), J.L. Biophysics of DNA-protein interactions from single molecules to biologicalsystems. Springer, 2010.

[175] J.J. de Rooi, C. Ruckebusch, and P.H.C. Eilers. Sparse deconvolution in one and two dimensions: Ap-plications in endocrinology and single-molecule fluorescence imaging. Anal. Chem., 86(13):6291–6298,2014.

9

[176] Y. Brody, N. Neufeld, N. Bieberstein, S.Z. Causse, E.M. Böhnlein, K.M. Neugebauer, X. Darzacq, andY. Shav-Tal. The in vivo kinetics of RNA polymerase II elongation during co-transcriptional splicing. PLoSBiol, 9(1):e1000573, 2011.

[177] E.G. Anderson and A.A. Hoskins. Single molecule approaches for studying spliceosome assembly andcatalysis. Spliceosomal Pre-mRNA Splicing: Methods and Protocols, 1126:217–241, 2014.

[178] S.G. Arunajadai and W. Cheng. Step detection in single-molecule real time trajectories embedded incorrelated noise. PLoS ONE, 8(3):e59279, 2013.

[179] T. Aggarwal, D. Materassi, R. Davison, T. Hays, and M. Salapaka. Detection of steps in single moleculedata. Cell. Mol. Bioeng., 5(1):14–31, 2012.

[180] G. Claeskens and N.L. Hjort. Model selection and model averaging. Cambridge University Press, 2008.

[181] H. Bozdogan. Model selection and Akaike’s information criterion (AIC): The general theory and its analyticalextensions. Psychometrika, 52(3):345–370, 1987.

[182] K. Yamaoka, T. Nakagawa, and T. Uno. Application of Akaike’s information criterion (AIC) in the evaluationof linear pharmacokinetic equations. J. Pharmacokinet. Biopharm., 6(2):165–175, 1978.

[183] D. Posada and T.R. Buckley. Model selection and model averaging in phylogenetics: Advantages of Akaikeinformation criterion and Bayesian approaches over likelihood ratio tests. Syst. Biol., 53(5):793–808, 2004.

[184] H. Bozdogan. Akaike’s information criterion and recent developments in information complexity. J. Math.Psychol., 44(1):62–91, 2000.

[185] K. Okamoto and Y. Sako. Variational bayes analysis of a photon-based hidden Markov model for single-molecule FRET trajectories. Biophys. J., 103(6):1315–1324, 2012.

[186] S.C. Kou, B.J. Cherayil, W. Min, B. P. English, and X.S. Xie. Single-molecule Michaelis-Menten equations.J. Phys. Chem. B, 109(41):19068–19081, 2005.

[187] P.R. Barber, S.M. Ameer-Beg, S. Pathmananthan, M. Rowley, and A.C.C. Coolen. A Bayesian method forsingle molecule, fluorescence burst analysis. Biomed. Opt. Express, 1(4):1148–1158, 2010.

[188] N. Zarrabi, S. Ernst, B. Verhalen, S. Wilkens, and M. Börsch. Analyzing conformational dynamics of singleP-glycoprotein transporters by Förster resonance energy transfer using hidden Markov models. Methods,66(2):168–179, 2014.

[189] K.P. Burnham and D.R. Anderson. Model selection and multimodel inference: A practical information-theoretic approach. Springer, 2003.

[190] M.H. Hansen and B. Yu. Model selection and the principle of minimum description length. J. Am. Stat.Assoc., 96(454):746–774, 2001.

[191] J.J. Rissanen. Fisher information and stochastic complexity. IEEE Trans. Inf. Theory, 42(1):40–47, 1996.

[192] V. Balasubramanian. MDL, Bayesian inference, and the geometry of the space of probability distributions.Grunwald (Ed.), P. and Myung (Ed.), I.J. and Pitt (Ed.), M. Advances in minimum description length: Theoryand applications. MIT Press, 2005.

[193] I.J. Myung, V. Balasubramanian, and M.A. Pitt. Counting probability distributions: Differential geometry andmodel selection. Proc. Natl. Acad. Sci., 97(21):11170–11175, 2000.

[194] R. Shibata. Selection of the order of an autoregressive model by Akaike’s information criterion. Biometrika,63(1):117–126, 1976.

[195] R. Nishii. Asymptotic properties of criteria for selection of variables in multiple regression. Ann. Stat.,12(2):758–765, 1984.

10

[196] S.I. Vrieze. Model selection and psychological theory: A discussion of the differences between the Akaikeinformation criterion (AIC) and the Bayesian information criterion (BIC). Psychol. Meth., 17(2):228–243,2012.

[197] J. Kuha. AIC and BIC comparisons of assumptions and performance. Socio. Meth. Res., 33(2):188–229,2004.

[198] S. Chen and P. Gopalakrishnan. Speaker, environment and channel change detection and clustering via theBayesian information criterion. Proc. DARPA Broadcast News Transcription and Understanding Workshop,8:127–132, 1998.

[199] J. Chen and Z. Chen. Extended Bayesian information criteria for model selection with large model spaces.Biometrika, 95(3):759–771, 2008.

[200] A.C. Atkinson. Likelihood ratios, posterior odds and information criteria. J. Econometrics, 16(1):15–20,1981.

[201] C.H. LaMont and P.A. Wiggins. The frequentist information criterion (FIC): The unification of information-based and frequentist inference. arXiv preprint arXiv:1506.05855, 2015.

[202] G.C. Chow. A comparison of the information and posterior probability criteria for model selection. J.Econometrics, 16(1):21–33, 1981.

[203] R. Shibata. Statistical aspects of model selection. Springer, 1989.

[204] P.A Wiggins. An information-based approach to change-point analysis with applications to biophysics andcell biology. Biophys. J., 109(2):346–354, 2015.

[205] H. Akaike. Information theory and an extension of the maximum likelihood principle. In Petrov (Ed.), B.N.and Csaki (Ed.), F. International symposium on information theory. Springer, 1998.

[206] R.V. Hogg and A.T. Craig. Introduction to mathematical statistics. Macmillan, 1978.

[207] Jie Chen and Arjun K Gupta. On change point detection and estimation. Commun. Stat. Simulat. C.,30(3):665–697, 2001.

[208] P.R. Krishnaiah and B.Q. Miao. 19 review about estimation of change points. Handbook of statistics,7:375–402, 1988.

[209] M. Basseville and I.V. Nikiforov. Detection of abrupt changes: Theory and application. Prentice Hall, 1993.

[210] J.B. Munro, A. Vaiana, K.Y. Sanbonmatsu, and S.C. Blanchard. A new view of protein synthesis: Mappingthe free energy landscape of the ribosome using single-molecule FRET. Biopolymers, 89(7):565–577,2008.

[211] S. Türkcan and J.B. Masson. Bayesian decision tree for the classification of the mode of motion in single-molecule trajectories. PloS ONE, 8(12):e82799, 2013.

[212] M. Hajdziona and A. Molski. Maximum likelihood-based analysis of single-molecule photon arrival trajecto-ries. J. Chem. Phys, 134(5):054112, 2011.

[213] L.C.C. Elliott, M. Barhoum, J.M Harris, and P.W. Bohn. Trajectory analysis of single molecules exhibitingnon-Brownian motion. Phys. Chem. Chem. Phys., 13(10):4326–4334, 2011.

[214] L.P. Watkins and H. Yang. Detection of intensity change points in time-resolved single-molecule measure-ments. J. Phys. Chem. B, 109(1):617–628, 2005.

[215] Y. Chen, N.C. Deffenbaugh, C.T. Anderson, and W.O. Hancock. Molecular counting by photobleaching inprotein complexes with many subunits: Best practices and application to the cellulose synthesis complex.Mol. Bio. Cell, 25(22):3630–3642, 2014.

11

[216] M.A. Little, B.C. Steel, F. Bai, Y. Sowa, T. Bilyard, D.M. Mueller, R.M. Berry, and N.S. Jones. Steps andbumps: Precision extraction of discrete states of molecular machines. Biophys. J., 101(2):477–485, 2011.

[217] J. Chen and Y.P. Wang. A statistical change point model approach for the detection of DNA copy numbervariations in array CGH data. IEEE/ACM Trans. Comput. Biol. Bioinf., 6(4):529–541, 2009.

[218] D.R. Cooper, D.M. Dolino, H. Jaurich, B. Shuang, S. Ramaswamy, C.E. Nurik, J. Chen, V. Jayaraman,and C.F. Landes. Conformational transitions in the Glycine-Bound GluN1 NMDA Receptor LBD via single-molecule FRET. Biophys. J., 109(1):66–75, 2015.

[219] H. Yang. Change-point localization and wavelet spectral analysis of single-molecule time series. Single-Molecule Biophysics: Experiment and Theory, Vol. 146. John Wiley & Sons, 2011.

[220] J.N. Taylor, D.E. Makarov, and C.F. Landes. Denoising single-molecule FRET trajectories with waveletsand Bayesian inference. Biophys. J., 98(1):164–173, 2010.

[221] Y. Wang. Jump and sharp cusp detection by wavelets. Biometrika, 82(2):385–397, 1995.

[222] M.A. Little and N.S. Jones. Generalized methods and solvers for noise removal from piecewise constantsignals. I. Background theory. Proc. R. Soc. A-Math. Phys. Eng. Sci., 467(2135):3088–3114, 2011.

[223] M.A. Little and N.S. Jones. Generalized methods and solvers for noise removal from piecewise constantsignals. II. New methods. Proc. R. Soc. A-Math. Phys. Eng. Sci., 467(2135):3115–3140, 2011.

[224] M.A. Little and N.S. Jones. Signal processing for molecular and cellular biological physics: An emergingfield. Phil. Trans. R. Soc. A, 371(1984):20110546, 2013.

[225] R. Killick, P. Fearnhead, and I. A. Eckley. Optimal detection of changepoints with a linear computationalcost. JASA, 107(500):1590–1598, 2012.

[226] K. Frick, A. Munk, and H. Sieling. Multiscale change point inference. J. R. Stat. Soc. Ser. B Stat. Methodol.,76(3):495–580, 2014.

[227] S. Hu. Akaike information criterion. Cent. Res. Sci. Comput., 2007.

[228] J. Chen, A.K. Gupta, and J. Pan. Information criterion and change point problem for regular models.Sankhya: The Indian Journal of Statistics, 68(2):252–282, 2006.

[229] A. Gelman, J. Hwang, and A. Vehtari. Understanding predictive information criteria for Bayesian models.Stat. Comput., 24(6):997–1016, 2014.

[230] C.H. LaMont and P.A. Wiggins. The development of an information criterion for change-point analysis.Neural Comput., 28(3):594–612, 2016.

[231] C.H. LaMont and P.A. Wiggins. Information-based inference for sloppy and singular models. arXiv preprintarXiv:1506.05855, 2015.

[232] S. Watanabe. A widely applicable Bayesian information criterion. J. Mach. Learn. Res., 14(1):867–897,2013.

[233] S. Balakrishnan, H. Kamisetty, J.G. Carbonell, S.I. Lee, and C.J. Langmead. Learning generative modelsfor protein fold families. Proteins: Struct., Funct., Bioinf., 79(4):1061–1078, 2011.

[234] R.H. Goldsmith and W.E. Moerner. Watching conformational-and photodynamics of single fluorescentproteins in solution. Nat. Chem., 2(3):179–186, 2010.

[235] R. Krishnan, M.R. Blanco, M.L. Kahlscheuer, J. Abelson, C. Guthrie, and N.G. Walter. Biased Brownianratcheting leads to pre-mRNA remodeling and capture prior to first-step splicing. Nat. Struct. Mol. Biol.,20(12):1450–1457, 2013.

12

[236] M.J. Beal, Z. Ghahramani, and C.E. Rasmussen. The infinite hidden Markov model. In Advances in neuralinformation processing systems. MIT Press, 2001.

[237] Søren Preus, Sofie L Noer, Lasse L Hildebrandt, D. Gudnason, and V. Birkedal. iSMS: Single-moleculeFRET microscopy software. Nat. Meth., 12(7):593–594, 2015.

[238] J.W. van de Meent, J.E. Bronson, C.H. Wiggins, and R.L. Gonzalez. Empirical bayes methods enableadvanced population-level analyses of single-molecule FRET experiments. Biophys. J., 106(6):1327–1337,2014.

[239] S. Johnson, J.W. van de Meent, R. Phillips, C.H. Wiggins, and M. Lindén. Multiple LacI-mediated loopsrevealed by Bayesian statistics and tethered particle motion. Nucl. Acids Res., 1:gku563, 2014.

[240] T.S. Ferguson. A Bayesian analysis of some nonparametric problems. Ann. Stat., 1(2):209–230, 1973.

[241] P. Orbanz and Y. W. Teh. Bayesian nonparametric models. In Encyclopedia of Machine Learning. Springer,2011.

[242] Z. Ghahramani. Bayesian non-parametrics and the probabilistic approach to modelling. Phil. Trans. R. Soc.A, 371(1984):20110553, 2013.

[243] Yee Whye Teh. Dirichlet process. In Encyclopedia of machine learning. Springer, 2011.

[244] S.J. Gershman and D.M. Blei. A tutorial on Bayesian nonparametric models. J. Math. Psychol., 56(1):1–12,2012.

[245] P. Yau, R. Kohn, and S. Wood. Bayesian variable selection and model averaging in high-dimensionalmultinomial nonparametric regression. J. Comp. Graph. Stat., 12(1):23–54, 2012.

[246] E.G. Phadia. Prior processes and their applications. Springer, 2013.

[247] Y.W. Teh, M.I. Jordan, M.J. Beal, and D.M. Blei. Hierarchical Dirichlet processes. JASA, 101(476):1566–1581, 2012.

[248] R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. J. Comp. Graph. Stat.,9(2):249–265, 2000.

[249] S. Kim, M.G. Tadesse, and M. Vannucci. Variable selection in clustering via Dirichlet process mixturemodels. Biometrika, 93(4):877–893, 2006.

[250] J. Sethuraman. A constructive definition of Dirichlet priors. Statistica sinica, 4(2):639–650, 1994.

[251] J. Pitman. Poisson–Dirichlet and GEM invariant distributions for split-and-merge transformations of aninterval partition. Comb. Probab. Comput., 11(5):501–514, 2002.

[252] A.Y. Lo. On a class of Bayesian nonparametric estimates: I. Density estimates. Ann. Stat., 12(1):351–357,1984.

[253] C.E. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann.Stat., 2(6):1152–1174, 1974.

[254] R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. J. Comp. Graph. Stat.,9(2):249–265, 2000.

[255] E.B. Fox, E.B. Sudderth, M.I. Jordan, and A.S. Willsky. An HDP-HMM for systems with state persistence.In Proc. ICML. ACM, 2008.

[256] E. Fox, E. Sudderth, M. Jordan, and A. Willsky. Bayesian nonparametric inference of switching dynamiclinear models. IEEE Trans. Signal Process, 59(4):1569–1585, 2011.

13

[257] J. Van Gael, Y. Saatci, Y.W. Teh, and Z. Ghahramani. Beam sampling for the infinite hidden Markov model.Proc. Int. Conf. Machine Learning, 2008.

[258] N. Slonim, G. S. Atwal, G. Tkacik, and W. Bialek. Information-based clustering. Proc. Natl. Acad. Sci.,102(2):18297–18302, 2005.

[259] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. Proc. 37th Annu. Allert. Conf.Commu. Contl. Comput., 1999.

[260] S. Still and W. Bialek. How many clusters? An information-theoretic perspective. Neural Comp.,16(12):2483–2506, 2004.

[261] L. Kantorovitch. On the translocation of masses. Manag. Sci., 5(1):1–4, 1958.

[262] R. E. Blahut. Computation of channel capacity and rate-distortion functions. IEEE Trans. Inf. Theory,18(4):460–473, 1972.

[263] S. Arimoto. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Trans.Inf. Theory, 18(1):14–20, 1972.

[264] I. Borg and P.J.F. Groenen. Modern multidimensional scaling: Theory and applications. Springer, 2010.

[265] S. V. Krivov and M. J. Karplus. Free energy disconnectivity graphs: Application to peptide models. J. Chem.Phys., 117(23):10894–10903, 2002.

[266] S. V. Krivov and M. J. Karplus. Hidden complexity of free energy surfaces for peptide (protein) folding. Proc.Nat. Acad. Sci., 101(41):14766–14770, 2004.

[267] C. F. Landes, A. Rambhadran, J. N. Taylor, F. Salatan, and V. Jayaraman. Structural landscape of isolatedagonist-binding domains from single AMPA receptors. Nat. Chem. Biol., 7(3):168–173, 2011.

[268] S. Ramaswamy, D. R. Cooper, N. K. Poddar, D. M. MacLean, A. Rambhadran, J. N. Taylor, H. Uhm, C. F.Landes, and V. Jayaraman. Role of conformational dynamics in α-amino-3-hydroxy-5-methylisoxazole-4-propionic acid (AMPA) receptor partial agonism. J. Biol. Chem., 287(52):43557–43564, 2012.

[269] N. Armstrong and E. Gouaux. Mechanisms for Activation and Antagonism of an AMPA-Sensitive GlutamateReceptor: Crystal Structures of the GluR2 Ligand Binding Core. Neuron, 28(1):165–181, 2000.

[270] K. Poon, A. H. Ahmed, L. M. Nowak, and R. E. Oswald. Mechanisms of modal activation of GluA3 receptors.Mol. Pharmacol., 80(1):49–59, 2011.

[271] A. Lau and B. Roux. The free energy landscapes governing conformational changes in a glutamate receptorligand-binding domain. Structure, 15(10):1203–1214, 2007.

[272] A. Y. Lau and B. Roux. The hidden energetics of ligand binding and activation in a glutamate receptor. Nat.Struct. Mol. Biol., 18(3):283–287, 2011.

[273] J. P. Crutchfield and K. Young. Inferring statistical complexity. Phys. Rev. Lett., 63(2):105–108, 1989.

[274] J. P. Crutchfield. Between order and chaos. Nat. Phys., 8(1):17–24, 2012.

[275] C. R. Shalizi and J. P. Crutchfield. Computational mechanics: Pattern and prediction, structure and sim-plicity. J. Stat. Phys., 194(3):817–879, 2001.

[276] K. Chaloner and I. Verdinelli. Bayesian experimental design: A review. Stat. Sci., 10(3):273–304, 1995.

[277] P. Sebastiani and H.P. Wynn. Bayesian experimental design and Shannon information. Proc. Sec. Bay.Stat. Sci., 44:176–181, 1997.

[278] P. Sebastiani and H.P. Wynn. Maximum entropy sampling and optimal Bayesian experimental design. J.Roy. Stat. Soc. B, 62(1):145–157, 2000.

14

[279] D.S. Talaga. Information theoretical approach to single-molecule experimental design and interpretation. J.Phys. Chem. A, 110(31):9743–9757, 2006.

[280] D.S. Talaga. Information-theoretical analysis of time-correlated single-photon counting measurements ofsingle molecules. J. Phys. Chem. A, 113(17):5251–5263, 2009.

[281] W. Bialek, I. Nemenman, and N. Tishby. Predictability, complexity, and learning. Neural comp.,13(11):2409–2463, 2001.

[282] S. Pressé. Nonadditive entropy maximization is inconsistent with Bayesian updating. Phys. Rev. E,90(5):052149, 2014.

[283] C. Tsallis. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys., 52(1–2):479–487, 1988.

[284] S. Abe. Generalized molecular chaos hypothesis and the H theorem: Problem of constraints and amend-ment of nonextensive statistical mechanics. Phys. Rev. E, 79(4):041116, 2009.

[285] S. Abe. Instability of q-averages in nonextensive statistical mechanics. Euro. Phys. Lett., 84(6):60006,2008.

[286] R. Hanel, S. Thurner, and C. Tsallis. On the robustness of q-expectation values and Rényi entropy. Euro.Phys. Lett., 85(2):20005, 2009.

[287] S. Abe. Anomalous behavior of q-averages in nonextensive statistical mechanics. J. Stat. Mech.: Theor.Expmt., 2009(7):P07027, 2009.

[288] A. Rényi. On measures of entropy and information. Proc. Sym. Math. Stat. Prob., 4:547–561, 1961.

[289] C. Hanel and S. Thurner. A comprehensive classification of complex statistical systems and an axiomaticderivation of their entropy and distribution functions. Eur. Phys. Lett., 93:20006, 2011.

[290] P. Jizba and T. Arimitsu. Towards information theory for q-nonextensive statistics with q-deformed distribu-tions. Physica A: Stat. Mech. App., 365(2):76–84, 2006.

[291] M. Hotta and I. Joichi. Composability and generalized entropy. Phys. Lett. A, 262(4):302–309, 1999.

[292] C. Tsallis. Non-extensive thermostatistics: Brief review and comments. Physica A, 221(1):277–290, 1995.

[293] E.M.F. Curado and C. Tsallis. Generalized statistical mechanics: Connection with thermodynamics. J.Phys. A: Math. Gen., 52(2):L69–L72, 1991.

[294] R.J.V. dos Santos. Generalization of Shannon’s theorem for Tsallis entropy. J. Math. Phys., 38(8):4104–4107, 1997.

[295] S. Abe. Axioms and uniqueness theorem for Tsallis entropy. Phys. Lett. A, 271(1):74–79, 2000.

[296] C. Tsallis, M. Gell-Mann., and Y. Sato. Asymptotically scale-invariant occupancy of phase space makesthe entropy Sq extensive. Proc. Natl. Acad. Sci., 102(43):15377–15382, 2005.

[297] S. Abe. General pseudoadditivity of composable entropy prescribed by the existence of equilibrium. Phys.Rev. E, 63(3):061105, 2001.

[298] G. Wilk and Z. Wlodarczyk. Interpretation of the nonextensivity parameter q in some applications of Tsallisstatistics and Lévy distributions. Phys. Rev. Lett., 84(13):2770–2773, 2000.

[299] C. Beck. Dynamical foundations of nonextensive statistical mechanics. Phys. Rev. Lett., 87(18):180601,2001.

[300] A.N. Gorban and I.V. Karlin. Family of additive entropy functions out of thermodynamic limit. Phys. Rev. E,67(1):016104, 2003.

15

Single Molecule Data Analysis: An Introduction › pdf › 1606.00403v1.pdf · Single Molecule Data...

Documents

Transcript of Single Molecule Data Analysis: An Introduction › pdf › 1606.00403v1.pdf · Single Molecule Data...