Markovian and Non-Markovian Protein Sequence Evolution: Aggregated ... · Markovian and...

14
Markovian and Non-Markovian Protein Sequence Evolution: Aggregated Markov Process Models Carolin Kosiol 1 and Nick Goldman 2 1 Institut für Populationsgenetik, Vetmeduni Vienna, Veterinärplatz 1, A-1210 Wien, Austria 2 European Bioinformatics Institute, EMBL Outstation-Hinxton, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK Received 14 September 2010; received in revised form 28 May 2011; accepted 3 June 2011 Available online 21 June 2011 Edited by M. Sternberg Keywords: protein evolution; amino acid substitution models; codon models; aggregated Markov models; rate heterogeneity Over the years, there have been claims that evolution proceeds according to systematically different processes over different timescales and that protein evolution behaves in a non-Markovian manner. On the other hand, Markov models are fundamental to many applications in evolutionary studies. Apparent non-Markovian or time-dependent behavior has been attributed to influence of the genetic code at short timescales and dominance of physicochemical properties of the amino acids at long timescales. However, any long time period is simply the accumulation of many short time periods, and it remains unclear why evolution should appear to act systematically differently across the range of timescales studied. We show that the observed time-dependent behavior can be explained qualitatively by modeling protein sequence evolution as an aggregated Markov process (AMP): a time-homogeneous Markovian substitution model observed only at the level of the amino acids encoded by the protein-coding DNA sequence. The study of AMPs sheds new light on the relationship between amino acid-level and codon-level models of sequence evolution, and our results suggest that protein evolution should be modeled at the codon level rather than using amino acid substitution models. © 2011 Elsevier Ltd. Introduction In 1968, Dayhoff et al. introduced a first model of protein sequence evolution, resulting in the devel- opment of the widely used amino acid replacement matrices known as the PAM matrices. 1 Since Day- hoff's PAM matrices, there have been increasingly good descriptions of the average patterns and processes of evolution of large collections of protein sequences, as well as more and more specialized matrices considering functional and structural prop- erties of proteins. 25 Such models are widely used in comparative sequence analyses. 6,7 Mathematically speaking, all these models are time-homogenous Markov models defined by the assumption that each amino acid evolves indepen- dently of time and of its past history. The instanta- neous rate matrix, which represents the patterns of the substitution process and specifies the model completely, is the same at any time in a time- homogenous model. 6 In fact, if the rate matrix can be written as the product of a scalar function of time and a constant matrix, that is, Q(t)= r(t)Q, then r(t) can be interpreted as an overall rate of evolution varying over time and Q can be interpreted as a constant pattern of amino acid replacements. In this case, the overall evolutionary rate and time are confounded, 8,9 and over any time period [t 0 ,t 1 ], the process so defined cannot be distinguished from that defined by the time-homogenous Qt ðÞ = P rQ, where P r is the mean rate in that period, equal to R t 1 t 0 rt ðÞdt = t 1 t 0 ð Þ, if it is only observed at t 0 and t 1 . 10 In this paper, we refer to Markov processes defined *Corresponding author. E-mail address: [email protected]. Abbreviations used: AMP, aggregated Markov process; MRCA, most recent common ancestor; HMM, dden Markov model. doi:10.1016/j.jmb.2011.06.005 J. Mol. Biol. (2011) 411, 910923 Contents lists available at www.sciencedirect.com Journal of Molecular Biology journal homepage: http://ees.elsevier.com.jmb 0022-2836 © 2011 Elsevier Ltd. Open access under CC BY license. Open access under CC BY license.

Transcript of Markovian and Non-Markovian Protein Sequence Evolution: Aggregated ... · Markovian and...

Page 1: Markovian and Non-Markovian Protein Sequence Evolution: Aggregated ... · Markovian and Non-Markovian Protein Sequence Evolution: Aggregated Markov Process Models Carolin Kosiol1⁎

doi:10.1016/j.jmb.2011.06.005 J. Mol. Biol. (2011) 411, 910–923

Contents lists available at www.sciencedirect.com

Journal of Molecular Biologyj ourna l homepage: ht tp : / /ees .e lsev ie r.com. jmb

Markovian and Non-Markovian Protein SequenceEvolution: Aggregated Markov Process Models

Carolin Kosiol1⁎ and Nick Goldman2

1Institut für Populationsgenetik, Vetmeduni Vienna, Veterinärplatz 1, A-1210 Wien, Austria2European Bioinformatics Institute, EMBL Outstation-Hinxton, Wellcome Trust Genome Campus, Hinxton,Cambridge CB10 1SD, UK

Received 14 September 2010;received in revised form28 May 2011;accepted 3 June 2011Available online21 June 2011

Edited by M. Sternberg

Keywords:protein evolution;amino acid substitutionmodels;codon models;aggregated Markov models;rate heterogeneity

*Corresponding author. E-mail [email protected] used: AMP, aggreg

MRCA, most recent common ancestMarkov model.

0022-2836 © 2011 Elsevier Ltd.Open acc

Over the years, there have been claims that evolution proceeds according tosystematically different processes over different timescales and that proteinevolution behaves in a non-Markovian manner. On the other hand, Markovmodels are fundamental to many applications in evolutionary studies.Apparent non-Markovian or time-dependent behavior has been attributedto influence of the genetic code at short timescales and dominance ofphysicochemical properties of the amino acids at long timescales. However,any long time period is simply the accumulation of many short timeperiods, and it remains unclear why evolution should appear to actsystematically differently across the range of timescales studied. We showthat the observed time-dependent behavior can be explained qualitativelyby modeling protein sequence evolution as an aggregated Markov process(AMP): a time-homogeneous Markovian substitution model observed onlyat the level of the amino acids encoded by the protein-coding DNAsequence. The study of AMPs sheds new light on the relationship betweenamino acid-level and codon-level models of sequence evolution, and ourresults suggest that protein evolution should be modeled at the codon levelrather than using amino acid substitution models.

© 2011 Elsevier Ltd. Open access under CC BY license.

Introduction

In 1968, Dayhoff et al. introduced a first model ofprotein sequence evolution, resulting in the devel-opment of the widely used amino acid replacementmatrices known as the PAM matrices.1 Since Day-hoff's PAM matrices, there have been increasinglygood descriptions of the average patterns andprocesses of evolution of large collections of proteinsequences, as well as more and more specializedmatrices considering functional and structural prop-erties of proteins.2–5 Such models are widely used incomparative sequence analyses.6,7

ress:

ated Markov process;or; HMM, dden

ess under CC BY license.

Mathematically speaking, all these models aretime-homogenous Markov models defined by theassumption that each amino acid evolves indepen-dently of time and of its past history. The instanta-neous rate matrix, which represents the patterns ofthe substitution process and specifies the modelcompletely, is the same at any time in a time-homogenous model.6 In fact, if the rate matrix can bewritten as the product of a scalar function of timeand a constant matrix, that is, Q(t)= r(t)Q, then r(t)can be interpreted as an overall rate of evolutionvarying over time and Q can be interpreted as aconstant pattern of amino acid replacements. In thiscase, the overall evolutionary rate and time areconfounded,8,9 and over any time period [t0,t1], theprocess so defined cannot be distinguished fromthat defined by the time-homogenous Q tð Þ = Pr Q,where Pr is the mean rate in that period, equal toR t1t0r tð Þdt= t1 − t0ð Þ, if it is only observed at t0 and t1.

10

In this paper, we refer to Markov processes defined

Page 2: Markovian and Non-Markovian Protein Sequence Evolution: Aggregated ... · Markovian and Non-Markovian Protein Sequence Evolution: Aggregated Markov Process Models Carolin Kosiol1⁎

911Aggregated Markov Models

by instantaneous rate matrices that can be written inthe form r(t)Q as time-homogeneous. While notprecisely accurate, it is a convenient shorthand for aclass ofMarkov processes that cannot be distinguishedfrom time-homogeneous ones using the available data.For sequence evolution on a phylogenetic tree,

imagine that, after a speciation (or gene duplication)event, a pair of sequences evolves from theircommon ancestor according to a time-homogeneousMarkov model. After some time, we may measurethe differences and the divergence level between thetwo sequences, and because the model is timehomogeneous, the sequences will continue to evolveaccording to the same process, leading to moredifferences and higher divergence levels. This modelimplies that the patterns of substitutions takingplace are the same at low and high sequencedivergences. Even if the overall rate of evolutionvaries between lineages (i.e., the instantaneous ratematrix varies by a constant multiplicative factor)or over time within lineages, a properly imple-mented inference procedure is able to infer theconstant patterns of evolutionary changes.11 How-ever, while most work in phylogenetic modelinghas concentrated on devising improved Markovmodels, some criticisms have been directed at themodels' time-homogeneous and Markov naturesthemselves.Henikoff and Henikoff derived a series of BLO-

SUM matrices, which are probability matrices butare not based on a Markov model.12 They countedall the amino acid replacements between conservedsubblocks of aligned protein sequences from manydifferent protein families in the BLOCKS database.The subblocks were made by single-linkage cluster-ing about a percentage identity threshold, anddifferent matrices were obtained by varying thisthreshold. The matrices of the BLOSUM series areidentified by a number after the matrix (e.g.,BLOSUM62), which refers to the percentage identityof the subblocks of multiple aligned amino acidsused to construct the matrix. Although thesepercentages indicate different divergence levelsbetween the aligned proteins that give rise to eachmatrix, there is no assumption of common patternsof amino acid change over evolutionary time: theBLOSUMmatrices are not based on an evolutionarymodel, and it is not possible to generate theBLOSUM matrix series simply by interpolating orextrapolating.BLOSUMmatrices often perform better than PAM

matrices for the purpose of amino acid sequencealignment or database searching.12 This may bebecause protein sequences behave in a non-time-homogeneous or non-Markov manner, a hypothesisthat could have serious consequences for the fieldsof maximum likelihood and Bayesian phylogenetics,which are based on time-homogeneous Markovmodels.

Mitchison and Durbin tried to find one global andconstant instantaneous rate matrix Q that couldgenerate, as an exponential family (see below), aseries of protein replacement matrices they hadestimated empirically (also from BLOCKS).13 Thiswould have been a time-homogeneous Markovprocess explanation of their observations, but theycould not find aQ that gave a good fit. Furthermore,Benner et al. (hereafter referred to as BCG) inferredprotein replacement matrices from sets of sequencesseparated by different divergence levels and foundqualitative differences in the substitution patterns.14

They concluded that the evolutionary processchanged as a function of sequence divergence, thatthe assumption that high divergence can be mod-eled by extrapolating the patterns of low sequencedivergence does not hold and that amino acidsequence evolution is non-Markovian.

Thought experiments that expose the fallacythat evolution is different depending on whenit is observed

Figure 1a illustrates the assumption made bytime-homogeneous Markov models. All patterns ofchange are constant over all evolutionary time,represented by one shade of red along thebranches of the phylogenies that feature in anysequence comparison (colored branches are sam-pled; uncolored branches are not). The “eyes” onthe right and the associated horizontal linesindicate what would be observed at two differenttime points (denoted t0 and t1). For this simplemodel, the process observed back in time to theancestor is the same from any time point (e.g.,black eye at t0 cf. gray eye at t1) and regardless ofwhich sequences are compared (comparisons A–F).BCG's finding that amino acid evolutionary pat-terns appear different depending on the diver-gence level of proteins compared implies that sucha simple time-homogeneous Markov assumptioncannot be correct.BCG formulated perhaps the most detailed

criticism of standard Markov models. Althoughthey observed different patterns of replacement fordifferent divergence levels, their explicit rejection ofMarkovian evolution14 is unfounded, as they didnot explore the possibility of time-dependent Mar-kov processes or the extension of the state space torecover the Markov property.15 Nevertheless, as somuch current phylogenetic theory relies on time-homogeneous Markov processes, it is important tosee if such models can be reconciled with BCG's andMitchison and Durbin's observations.Invoking different evolutionary models for every

different sequence is of little interest: it is only thegeneral applicability of a particular model thatmakes it widely useful. Figure1b represents onesimple possible explanation of replacement patterns

Page 3: Markovian and Non-Markovian Protein Sequence Evolution: Aggregated ... · Markovian and Non-Markovian Protein Sequence Evolution: Aggregated Markov Process Models Carolin Kosiol1⁎

(a) (b)

(c) (d)

time

/ evo

lutio

nary

dis

tanc

etim

e / e

volu

tiona

ry d

ista

nce

time

/ evo

lutio

nary

dis

tanc

etim

e / e

volu

tiona

ry d

ista

nce

Fig. 1. Observing evolutionary processes at different time points. Throughout the figure, different colors representdifferent substitution processes along a branch, and we consider observing different sequence pairs (colored lineages) attime t0 (black eye) and at later time t1 (gray eye). (a) Simple time-homogeneous Markov model. (b) Replacement dynamicsdependent on time since MRCA, color coded according to spectrum to the left of panel. (c) Approach of Benner et al.:replacement dynamics dependent on speciation and duplication events and subsequent elapsed time. The “callout”(green box) with asterisks indicates lineage splits that are not observed when pairwise comparisons B and E are used. (d)Modeling the “average” process using a time-homogenous approach: process changes without trend are indicated byrandom change in color patterns. Note that, for all the processes illustrated in (a)–(d), we assume that corrections for theoccurrence of multiple substitutions have been made so that the “snapshot” we take at different times should not lead todifferent observations caused by multiple substitutions.

912 Aggregated Markov Models

being different for different divergence levels. Thismodel is time-homogeneous Markov but has adifferent pattern of replacements according to thetime of their most recent common ancestor (MRCA).This is illustrated using shades from blue to violetfor high divergence levels (Fig. 1b; e.g., comparisonsA and E), bright red for medium divergence levels (Band F) and pale-red shades for low divergence levels(C and D), as observed at the level of the black eye(t0). The time axis is in effect linked with a color scaledistinguishing different evolutionary processes fordifferent divergence levels before the present. Thisinterpretation is, however, problematic for tworeasons. First, notice that the comparison labeled Aobserved at t0 represents a level of divergence that isdifferent from that of comparison B observed at thesame time, as indicated by their different colors, butimagine time elapsing until t1, the level of the grayeye. Comparison B is now equivalent to the earlierobservation A, yet the pattern of changes observedin A (relative to the black eye) and B (gray eye) donot match. This thought experiment regardingobservation points that differ in time shows that

this model cannot give rise to observed patterns ofchanges characterized by the time since sequences'evolutionary divergence. The same argument can bemade by contrasting, for example, comparison Bmade at t0 and comparison D made at t1.Second, still referring to Fig. 1b, we highlight the

point that the evolutionary histories relating a pairof sequences do not correspond unambiguously toone divergence at a unique position in evolutionarytime. Imagine that, in Fig. 1b, the two trees representthe evolution of the same set of sequences, but withdifferent pairwise comparisons highlighted. Differ-ent choices of comparisons (e.g., A and B in the left-hand tree and E in the right-hand tree) includesequences with common history yet differentevolutionary pattern because the pairwise diver-gence is greater. This is inconsistent, further illus-trating that sequences' evolutionary histories are notuniquely associated with one specific divergencelevel. The same point is made by consideringcomparisons F, C and D. BCG's observations cannotbe consistent with evolutionary dynamics that areconstant over time (Fig. 1a and b).

Page 4: Markovian and Non-Markovian Protein Sequence Evolution: Aggregated ... · Markovian and Non-Markovian Protein Sequence Evolution: Aggregated Markov Process Models Carolin Kosiol1⁎

913Aggregated Markov Models

A more sophisticated interpretation of BCG'sconclusions can remove some of the inconsistencieshighlighted in Fig. 1b by having the evolutionaryprocess change over time. If all protein sequencesevolved in a concerted fashion, each undergoingidentical substitution dynamics at the same pointin actual (clock) time and with those dynamicsvarying over time, then inferred patterns of changecould be consistent with BCG's observations. Sucha model could be Markov (though clearly not time-homogeneous), but this level of synchronization ofevolutionary dynamics is, however, entirely unreal-istic. A more plausible argument would be thatpatterns of change could alter at points in the treewhere lineages split (e.g., duplication can creategene copies free from the same functional con-straints as their ancestors).16 BCG's conclusion wasthat such a scenario, with protein evolution transi-tioning from domination by the genetic code soonafter divergences to domination by amino acids'physicochemical properties at greater distances,could account for their observations. This hypothe-sis is represented in Fig. 1c as Markov evolutiondependent on time since the last lineage split. Ifevolutionary patterns depend only on the time sincethe MRCA of the sequences compared (Fig. 1c, A–F),then evolutionary dynamics would seem differentdepending on the compared sequences' divergencelevels. The time of observation (e.g., t0 or t1) does notalter this finding or lead to any inconsistency.However, only a small proportion of BCG's

pairwise comparisons will have gene duplication(as opposed to speciation) events as their MRCA.Only those few that do are likely to have alteredevolutionary patterns,17 and even these will gener-ally have been subject to these altered evolutionarypatterns only for a short time near to the duplicationevent.18 Further, an explanation such as this alsoassumes duplications to occur only at the MRCA ofeach observed pairwise comparison. However, thereis no guarantee that there have been no subsequentlineage splitting events after the MRCA of anobserved pair of sequences. Indeed, this too is anunlikely scenario: often the case will be as illustratedby comparisons B–E in Fig. 1c, which containmultiple lineage splits that are not observed (for Band E, highlighted by asterisks in the “callout”region of the figure) in addition to the one at theMRCA that is. Consequently, it is not tenable toinvoke an explanation of the observation of time-dependent protein evolution based on actual dupli-cation/speciation events, since there is no distinc-tion in BCG's data between sequence pairs that aretrue sister groups and those separated by interme-diate (unanalyzed or simply unobserved) descen-dants of the same common ancestor.In contrast, Fig. 1d illustrates our understanding

of the complexity of evolutionary dynamicsassessed over a large collection of related and

unrelated proteins. There are various differentprocesses (colors) in different lineages; there areboth gradual changes at different rates and abruptchanges, and these may or may not coincide withlineage splits. The position of changes in evolution-ary process is largely random with respect to theposition in the actual underlying or observed treesand is not coordinated from one lineage to another.In this case, the end result of observations taken atany time and of sequences of any level of divergencewill be the mixture of many processes. While eachmay be Markov and time-homogeneous, the overalleffect may be highly time-inhomogeneous on a per-lineage basis. However, estimated over large as-semblages of protein examples, the average inferredevolutionary dynamics may remain the same, withno biases induced by what sequences are observed,which pairs are compared or when our experimenttakes place.This series of thought experiments indicates that,

contrary to BCG's suggestion, in the most realisticcase, the “average” process should be time homo-geneous and should be the same if estimated fromenough sequences, irrespective of their divergencelevels and irrespective of whether it is estimated in1994, in 2011 or in one million or one hundredmillion years time. Long periods of evolution areno more than the accumulation of many shortperiods of evolution, unaware of when they will beobserved.

Alternative explanations of experimental results

The time-homogeneous approach is logicallyconsistent where BCG's explanation is not, but thesimple time-homogeneous models consideredabove are unable to explain the experimental resultsof Benner et al.14 and Mitchison and Durbin13 andthe success of the BLOSUMmatrices.12 Explanationsinvolving processes that are nonhomogeneous intime, relying (logically inconsistently) on specific(implausible) duplication and speciation events orinvoking complex switches of substitution dynamicshave also failed to explain BCG's observations.Accepting those observations but not necessarilytheir authors' conclusions, we investigate otherfactors that could have caused the differences ininferred evolutionary dynamics depending on ob-served divergence levels.Markov processes are very successful at modeling

the average behavior of collections of chance events.Trying to retain a time-homogeneous Markovframework while performing this investigation, inthis paper, we use aggregated Markov processes(AMPs)19 to model protein evolution as Markovianat the DNA (codon) level but observed (via thegenetic code) only at the amino acid level. Allprevious studies of non-Markovian behavior wereon the amino acid level only and often with

Page 5: Markovian and Non-Markovian Protein Sequence Evolution: Aggregated ... · Markovian and Non-Markovian Protein Sequence Evolution: Aggregated Markov Process Models Carolin Kosiol1⁎

914 Aggregated Markov Models

inference techniques that are less advanced thanthose now available.20 Evolution, however, occursat the DNA level. Furthermore, codon-level modelshave been tested and improved21,22 so that we canhope that we have adequate codon models to basethis study on. We show that many time-inhomoge-neous findings for protein evolution can beexplained by time-homogeneous Markov modelsof the evolution of codon sequences that areobserved at the level of amino acids.

Theory

Time-homogeneous Markov models forsequence evolution

The time-homogeneous Markov model assertsthat one protein sequence is derived from anotherby a series of independent mutations, each chang-ing one character in the ancestral sequence toanother character in its descendant during evolu-tion. We consider only models that assumeindependence of evolution at different sites. Acontinuous-time Markov process is defined by itsN×N instantaneous rate matrix Q=(Qij)i,j=1,…N,where N is the number of character states. Twotypes of character alphabets will be considered forprotein evolution here: amino acids (N=20) andcodons (N=61, if stop codons are discarded). Fori≠ j, the matrix entry Qij represents the instanta-neous rate of change from state i to state j,independently at each site. Our assumption oftime-homogeneity means that Qij are constant intime (or that any time dependence is through ascalar rate factor is described in Introduction).Changes at each site occur as a Poisson process

with these given rates while waiting in a particularstate. The total rate at which any change from thatstate occurs is the sum of all the rates of changesfrom that state, and this determines the waiting timein a given state before moving to another. The Qiientry of the matrix is set to be minus the sum of allother entries in that row, representing (−1 times) therate at which changes leave state i:

Qii = −XNj p i

Qij

Molecular sequence data consist of observedcharacter states at some given time, and the quantitymost commonly needed for calculations is theprobability of observing a given character afterevolutionary time t≥0 has elapsed. We denote byPij(t) the probability of a site being in state j after time t,given that the process started in state i at that site attime 0.We canwrite the probabilities Pij(t) as anN×N

matrix that we denote P(t) and that is determined viathe relationship23

P tð Þ = etQ ð1Þ

where the exponential of a matrix is defined by thefollowing power series, with I being the appropriateidentity matrix:

etQ = I + tQ +tQð Þ22!

+tQð Þ33!

+ : : : ð2Þ

In practice, this power series is calculated numer-ically using standard linear algebra techniques.24

The most popular method in molecular phylogenyuses eigen-decomposition.6 A series P(t1), P(t2), …that can be derived in this way from the sameinstantaneous rate matrix Q is referred to as anexponential family of matrices.If a Markov process is left evolving for a long time,

the probability of finding it in a given stateconverges to a value independent of the startingstate; this distribution is known as the equilibriumdistribution π=(π1,…,πN). The equilibrium distribu-tion π can be found by solving πP(t)=π for any tN0or equivalently23 πQ=0.Time (t) and rate (Qij) are confounded, and

without extrinsic information, only their productcan be inferred.8,9 Consequently, we can normalizethe instantaneous rate matrix with any factor.Typically in phylogenetic applications, Q is normal-ized so that the mean rate of replacement atequilibrium (∑i∑j≠ iπiQij) is 1, which means thattimes (evolutionary distances) are measured in unitsof expected substitutions per site.

Amino acid models

For the amino acid substitution models in thispaper, we assume that amino acid sites in analignment evolve independently according to thesame reversible Markov process defined by a 20×20instantaneous rate matrix. Dayhoff et al. introducedthe first amino acid model in the form of asubstitution probability matrix. However, an instan-taneous rate matrix Q can easily be calculated fromsuch a probability matrix.11 In this study, we use theDayhoff model provided in the phylogenetic soft-ware package PAML;25 other common models givequalitatively similar results. Following commonpractice, we often refer to the “PAM distance”. Thiseffectively corresponds to 100× the expected numberof amino acid replacements per amino acid site.11

Codon models

Markov models of codon substitution were firstproposed by Goldman and Yang21 and Muse andGaut.26 In this paper, we mainly refer to the model

Page 6: Markovian and Non-Markovian Protein Sequence Evolution: Aggregated ... · Markovian and Non-Markovian Protein Sequence Evolution: Aggregated Markov Process Models Carolin Kosiol1⁎

915Aggregated Markov Models

M0 from Yang et al.22 For this model, the elements ofQ are defined as:

Qij;i p j =

0 if i or j is a stop codon orthe change iYj requires N 1 nucleotidesubstitution

kj if iYj is a synonymous transversionkjn if iYj is a synonymous transitionkjN if iYj is a nonsynonymous transversionkjnN if iYj is a nonsynonymous transition

8>>>>>>>><>>>>>>>>:

ð3Þ

where the parameter κ is the transition/transversionrate ratio, ω is the nonsynonymous/synonymousrate ratio and πj is the equilibrium frequency foreach codon j. Because of the interpretation of theparameter ω as a bias toward (ωN1) or away from(ωb1) nonsynonymous changes, this model and itsvariants are widely used in the detection of naturalselection.22,27,28 For amino acid models and codonmodels, variation of rates among sites in proteinshas been modeled. Often, a discretized Gammadistribution of rates is considered,29 and we use thisand similar approaches below.As we often discuss amino acid-level and codon-

level processes together, we help distinguish thesecontexts by using subscripts i, j,… for codons andx, y,… for amino acids.

Aggregated Markov processes

Protein sequence evolution of amino acids, as wellas of codons, has been modeled using Markovprocesses. However, the evidence found againsttime-homogeneous Markov models has all derivedfrom amino acid-level analyses of protein sequences,whereas protein evolution occurs at the level ofcoding DNA. Here, we consider a model that isMarkov on the underlying codon level, butwhichweinterpret as its corresponding amino acids.Suppose, for example, that between times tk and

tk+1, the following substitution has occurred in acoding region:

We assume that such substitutions are generatedby a continuous-time time-homogeneous Markovprocess {X(t), t≥0} on the codon level with state

space C={AAA,AAC,…,TTG,TTT} and equilibriumdistribution π and probability matrices P(t)=eQt asabove. Further, we suppose that the codons are notdirectly observable but that a deterministic functionof the underlying Markov process [i.e., Y(t)= f(X(t)),where f maps the state space C to the aggregate setA={A,R,N,…,V}], can be observed. Clearly, weconsider observing the amino acids encoded by thecodons, with f defined by the universal genetic code.The observable process of amino acids {Y(t), t≥0} isthen called an AMP.19 The dependence structure forthe site highlighted in the example above isrepresented by the following graph:

Given only the amino acid-level observations Y(t),it is impossible to tell whether the substitution ofleucine (L) with proline (P) was caused by asubstitution from CTT to CCT, from CTC to CCC,fromCTA to CCA or fromCTG to CCG. [We assumeonly single-nucleotide changes; a more generalmodel might permit double- and triple-nucleotidechanges instantaneously (e.g., CTT→CCA), whichresults in a larger set of codon substitutionscompatible with each amino acid replacement.30]Consequently, the probability of a change to proline(P) does depend not only on the present amino acidleucine (L) but also on the hidden state X(tk). Thestochastic process Y(t) describing the amino acidevolution is therefore non-Markovian.More formally, AMPs are a subclass of hidden

Markov models (HMMs); HMMs also allow theobserved Y(t) to be probabilistically determinedgiven X(t). The theory of HMMs says that thestochastic process X(t) on the state space is Markovbut that the observable processY(t) is non-Markov.31

It is therefore clear that a Markov process of codonevolution will lead to non-Markovian observationsof amino acid sequence evolution. Below, we askwhether this can explain the time-dependent obser-vations that other authors have recorded.Yang et al. have described another way of deriving

an amino acid substitution model from a codonmodel.32 However, they constrain the amino acidmodel to be Markovian, fixing the rates of aminoacid changes equal to the total rates of all corre-sponding codon changes. Being Markovian byconstruction, such models cannot explain observa-tions of non-Markovian amino acid substitution.

Log-odds matrices and BCG's experiments

Rather than compare inferred instantaneous ratematrices Q, BCG illustrated their findings bydiscussing elements of the log-odds matrices (L)

Page 7: Markovian and Non-Markovian Protein Sequence Evolution: Aggregated ... · Markovian and Non-Markovian Protein Sequence Evolution: Aggregated Markov Process Models Carolin Kosiol1⁎

916 Aggregated Markov Models

often used as scoring matrices in database searchand alignment programs. Positive scores in a log-odds matrix designate a pair of residues that replaceeach other more often than expected by chance;negative scores designate pairs that replace eachother less than would be expected by chance. Log-odds matrices are related to probability matrices P(t)(see below), and similar to them, they depend on anevolutionary distance or time t. To make meaningfulcomparisons of log-odds matrices derived fromsequences with different divergence levels, oneneeds to carry out some normalization, and BCGachieved this by computing matrices standardizedto t=2.5 (250 PAM). Below, we adopt BCG'sprocedures to estimate probability matrices, calcu-late log-odds matrices and normalize them.BCG split 1.7 million pairwise aligned amino acid

sequences from the MIPS database33 into 10 setsbased on bands of divergence levels (4.7–6.4, 6.4–8.7,8.7–11.8, 11.8–16, 16–22, 22–29, 29–40, 40–54, 54–74and 74–100 PAMs). Denoting the average PAMdistance in each set by tk for k=1,…,10, for each k,they compiled a matrix of counts T(tk), where Txy(tk)is the number of substitutions from amino acid x toamino acid y observed in a given set of sequences,and a diagonal matrix N(tk), with Nxx(tk) the totalobserved number of amino acids of type x. Since,from a pairwise alignment, it is not possible to

0

2

–2

4

–4

–6

–2

–4

–6

0 10 20 30 40 50 60 70 80 90 100

WYWF

CMCV

PAM distance

0 10 20 30 40 50 60 70 80 90 100

Cro

ss te

rms

in lo

g od

ds m

atri

x

0

2

4

WFWYCMCV

Amino acid time t* (PAM distance)

Cro

ss te

rms

in lo

g od

ds m

atri

x

(a) (

(c) (

Fig. 2. Graphs of some cross terms of L(250) log-odds matricthe cross term (off-diagonal element) [L(250)]WF computed fromb) Graphs redrawn from Benner et al.,14 colored for clarity. (c a

decide whether a substitution is from x to y or from yto x, half of each substitution is counted in onedirection and half is counted in the other. For each ofthe PAM bands, BCG then estimated amino acidsubstitution matrices using the formula

P tkð Þ = T tkð Þ × N tkð Þ½ �−1

These matrices are each extrapolated to a diver-gence of 1 PAM (0.01 expected substitutions per site):

P 1 PAMð Þ = P t = 0:01ð Þ = P tkð Þ½ �1= tk

= T tkð Þ� N tkð Þ½ �−1h i1= tk ð4Þ

and converted to a 250-PAM (t=2.5) log-oddsmatrix:

Lxy 250ð Þ = 10log10Pxy 2:50ð Þ

fy

= 10log10P 0:01ð Þð Þ250

h ixy

fyfor all x p y ð5Þ

where fy=Nyy(tk)/∑zNzz (tk) is the frequency of aminoacid y in each data set.BCG illustrated their results by plotting values of

Lxy(250) for various amino acids x and y. Their

0 10 20 30 40 50 60 70 80 90 100

PAM distance

0 10 20 30 40 50 60 70 80 90 100

0

1

–1

2

–2

3

–3

CYWR

CRCW

Cro

ss te

rms

in lo

g od

ds m

atri

x

0

1

–1

2

3

4

5

6

7

Amino acid time t* (PAM distance)

WRCYCR

CW

Cro

ss te

rms

in lo

g od

ds m

atri

x

b)

d)

es. For example, the line labeledWF represents the value ofdata at divergence levels between 0 and 100 PAMs. (a andnd d) Results of simulations under an AMP model.

Page 8: Markovian and Non-Markovian Protein Sequence Evolution: Aggregated ... · Markovian and Non-Markovian Protein Sequence Evolution: Aggregated Markov Process Models Carolin Kosiol1⁎

(a)

(b)

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0

Prop

ortio

n of

sin

gle

base

cha

nges

50 60 70 8040302010

PAM

0 50 60 70 8040302010

PAM

ExperimentalSimpleMarkovProcess

AAMixtureProcess

MixtureAMPSimpleAMP

Experimental

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Prop

ortio

n of

sin

gle

base

d ch

ange

s

Fig. 3. Comparison to Mitchison and Durbin's results.(a) Experimental data, simple time-homogeneous Mar-kovian and mixture of time-homogeneous Markov pro-cesses on the amino acid level. (b) Experimental data andaggregated processes of a codon-level time-homogeneousMarkov process both with and without rate heterogeneity.

917Aggregated Markov Models

results are reproduced in Fig. 2a and b. These clearlyindicate that certain elements of the log-oddsmatrices vary as a function of the PAM distance ofthe sequences from which they were estimated,behavior that is inconsistent with time-homoge-neousMarkovian evolution. Looking for a biologicalexplanation for these findings, BCG proposed aninterpretation that the genetic code influencesprotein evolution strongly at early stages of diver-gence, while physicochemical properties are domi-nant at later stages. For example, they inferred fromthe values of LCW(250) (Fig. 2b) that substitutionsfrom cysteine (C) to tryptophan (W) are frequent atsmall PAM distances because only a single basechange is necessary (TGC or TGT to TGG), whereasat larger PAM distances, these substitutions areinfrequent because the side chain of tryptophan (W)is large and hydrophobic while the side chain ofcysteine (C) is small and can form disulfide bondsinaccessible to tryptophan (W). Similar argumentswere made for other amino acid substitutionsillustrated in Fig. 2a and b. However, as explainedin our thought experiments above, it does not makesense to base such explanations on divergence levelsor speciation events.

Exponential families and Mitchison and Durbin'sexperiments

Mitchison and Durbin estimated amino acidsubstitution probability matrices from experimentaldata using maximum likelihood methods,13 infer-ring 10 matrices from multiple alignments takenfrom the same BLOCKS database used to derive theBLOSUM matrices.12 They then tried to identify anexponential family that would generate the series ofmatrices (i.e., explain their observations as a time-homogeneous Markov process) but were unable todo so. However, they provided interesting analysisand diagnostics giving insight into the reasons whytheir approach failed. In their diagnostics, theyconsidered the proportion of amino acid changesthat may be explained by a single-nucleotide changeand the way this proportion changes over time. Tocompute this, they summed substitution probabilitymatrix entries over all amino acid substitutions thatcan be achieved via a single-nucleotide change andtook the ratio of this to the probability of any change:

PxYyð ÞaD1

Pxy tð ÞPxP

y p x Pxy tð Þ ð6Þ

where Δ1 is the set of amino acid changes requiringonly a single-nucleotide change. This value is plottedin Fig. 3a (“Experimental”) and shows an initialrapid decline followed by a slower decline for moredistant protein comparisons. This is suggestive of achange in evolutionary dynamics that is not consis-

tent with the near-linear decrease observed for time-homogeneous Markovian amino acid sequenceevolution (Fig. 3a, “SimpleMarkovProcess”; seealso below). As with BCG, naïve interpretation ofthese results again suggests that different processesare observed at different timescales. Since ourthought experiments indicate that sequence evolu-tion cannot be different depending simply on whenwe make our observations, this finding is anotherthat we hope to explain via AMPs.

Simulation methods

Simulation of evolving sequences as a way oftesting hypotheses and evaluating the idealizedbehavior of evolutionary models is well esta-blished.34 We use simulated codon data to investi-gate if aggregation (AMPs) can lead to observationssimilar to those of BCG and Mitchison and Durbin.Working at the codon level, we calculatedP t4k� �

= eQt4k (see above) using values of tk⁎ coveringa range similar to that used by BCG. The frequencyof observing codon i in one sequence and j inanother is then given by πiPij(tk⁎). Letting Cx and Cy

Page 9: Markovian and Non-Markovian Protein Sequence Evolution: Aggregated ... · Markovian and Non-Markovian Protein Sequence Evolution: Aggregated Markov Process Models Carolin Kosiol1⁎

918 Aggregated Markov Models

represent the sets of codons that code for aminoacids x and y, respectively, we simulate aggregated(amino acid level) data by noting that the frequencyof observing amino acid x in one sequence andamino acid y in the other sequence is then

XiaCx

XjaCy

kiPij tTk� � ð7Þ

To create AMP substitution matrices from thissimulation data, we apply the methods of BCG tothese idealized data and set

Txy tTk� �

=XiaCx

XjaCy

kiPij tTk� �

and Nxx tTk� �

=Xy

Txy tTk� �

The matrices T(tk⁎) and N(tk⁎) are subjected to thesame analyses performed by BCG and Mitchisonand Durbin in order to see if AMPs can give anexplanation of those authors' observations.Furthermore, we do not use the prespecified

codon times tk⁎ to normalize the matrices to timet=0.01 [Eq. (4)] but, instead, base this normalizationon inferred amino acid times. BCG had to rely onPAM distances tk estimated from observed aminoacid sequences, and it is important that we mimicthis because the amino acid time estimates tk may besystematically and nonlinearly biased relative to thecodon times tk⁎ (see also below). PAML25 canperform this estimation based on frequencies suchas those computed from Eq. (7). We used thismethod to estimate tk, the divergence levels of theAMP amino acid data, for use in normalizingprobability matrices according to Eq. (4).

Results

Understanding the behavior of AMPs:the Chapman–Kolmogorov equation

The Chapman–Kolmogorov equation gives themethod of combining probabilities from substitu-tion patterns observed at intermediate time stepsinto longer-term probabilities when the underlyingprocess is Markovian.35 Adherence to the Chap-man–Kolmogorov equation is a necessary andsufficient condition for a process to be Markov andguarantees the ability to interpolate and extrapolateprobabilities from observations at different timepoints.For example, suppose we are interested in

P(Y(t1)=M|Y(t0)=C), the probability of a changefrom amino acid cysteine (C) at time t0 tomethionine (M) at time t1. (We concentrate oncysteine–methionine changes here for simplicity,but the same results hold generally for any amino acid

pair.) However, imagine an experiment designed insuch away thatwe have three observation times (t0, t1and an intermediate time τ) and that the data allow usto determine the amino acid substitution probabilitiesfor the period from t0 to t1 and also for the periodsfrom t0 to τ and from τ to t1.If the AMP were Markovian, we could relate the

probabilities of the observations in the different timeperiods by applying the Chapman–Kolmogorovequation:23

P Y t1ð Þ = M jY t0ð Þ=Cð Þ=Xx

P Y t1ð Þ = M jY Hð Þ = xð Þ

× P Y Hð Þ = x jY t0ð Þ = Cð Þ for any Ha t0; t1½ �ð8Þ

where the summation is over all 20 amino acids thatmight be observed at intermediate time τ. Con-versely, if the Chapman–Kolmogorov equation doesnot hold, we know that the examined process isbehaving in a non-Markovian manner.35

In the context of our earlier thought experiments,this corresponds to estimating probability matricesat intermediate times and using these to calculatethe substitution probabilities at a later time.However, we now show that assuming Markovianbehavior for observations (e.g., amino acidchanges) generated by an AMP can lead tosubstantial error in the estimation of substitutionprobabilities. Interestingly, the time of observationdoes matter for the AMP, whereas it is irrelevantfor a simple Markovian process. This begins toestablish that AMPs may be able to explain someearlier authors' observations of non-Markovianprotein evolution.For the AMP representing the case that we only

observe amino acids, we calculate the right-handside of the Chapman–Kolmogorov equation byusing probabilities derived from Eq. (7) above.On the underlying codon level, the process is

Markov by construction (see above). However, toconfirm adherence to the Chapman–Kolmogorovequation and to compare codon results with aminoacid results, for our codon model, we apply thefollowing calculations. For Y(t0), we use the equi-librium distribution for the codons of cysteine (C)defined by

�0½ �i =kiP

jaCCkj

if iaCC

0 otherwise

8<:

where CC is the set of codons coding for amino acidC. We consider a codon initially in a state describedby this distribution and evolving (according to theMarkov codon model) over time. For example, attime τ, the state distribution is υ0P(τ− t0) and the

Page 10: Markovian and Non-Markovian Protein Sequence Evolution: Aggregated ... · Markovian and Non-Markovian Protein Sequence Evolution: Aggregated Markov Process Models Carolin Kosiol1⁎

difference = 33.9%

AggregatedCodon

0.002

0.0022

0.0024

0.0026

0.0028

0.003

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Intermediate time τ

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Intermediate time τ

P(Y

(t =

5)

= M

| Y

(t =

0)

= C

)difference = 2.0%

0.121

0.122

0.123

0.124

0.125

0.126

AggregatedCodon

P(Y

(t =

5)

= R

| Y

(t =

0)

= C

)

(a)

(b)

Fig. 4. Dependence of transition probabilities on thetime of intermediate observation. Transition probabilitiesfor the purely Markov codon model (labeled Codon) andthe AMP (Aggregated) are calculated using the right-handside of the Chapman–Kolmogorov Eq. (8). (a) Probabilityof a change from cysteine (C) to methionine (M) via anintermediate step at time τ. Cysteine and methionine aredistant in the genetic code (three-nucleotide changes). (b)Probability of a change from cysteine (C) to arginine (R)via an intermediate step at time τ. Cysteine and arginineare close in the genetic code (one-nucleotide change).

919Aggregated Markov Models

probability to be in amino acid state x at time τ isgiven by:

P Y Hð Þx jY t0ð Þ = Cð Þ = PiaCx

�0P H − t0ð Þ½ �i=

PiaCx

PjaC �0½ �jP H − t0ð Þji

=P

iaCx

PjaC �0½ �jP Y Hð Þ = i jY t0ð Þ = jð Þ

=P

iaCx

PjaCC

�0½ �jP Y Hð Þ = i jY t0ð Þ = jð Þ=

PjaCC

�0½ �jP

iaCxP Y Hð Þ = i jY t0ð Þ = jð Þ

ð9Þ

Using appropriate versions of Eq. (9), we havecalculated the right-hand side of the Chapman–Kolmogorov Eq. (8) for t0=0, for t1=5 and fordifferent intermediate times τ∈ [0.0,5.0] for thepurely Markov codon model and compare it to thesimulated AMP results (only amino acids ob-served). We use a simple M0 model [Eq. (3)] withω=0.2 and κ=2.5 and codon frequencies asspecified in Supplementary Material. The resultsfor the change from cysteine to methionine(C→M), as described above, and also for thechange from cysteine to arginine (C→R) are shownin Fig. 4. Unlike the results from the Markoviancodon process, the values derived from the AMPare not constant. In other words, the probabilitiesof amino acid substitution depend on the interme-diate time τ when the amino acid sequences areobserved. Considering similar plots for otheramino acid substitutions (not shown), we notethat this effect is particularly strong if the aminoacids are distant in the genetic code (e.g., they aretwo- or three-nucleotide changes apart).This confirms, therefore, that although the AMP is

a time-homogeneous Markov process on the codonlevel, it is non-Markovian (and time dependent)when observations are aggregated to the level ofamino acids. Perhaps the AMP can provide alogically consistent model that can explain BCG'sclaim that the time at which the evolutionaryprocess is observed is relevant for the estimation ofthe substitution process.

Comparison to BCG's results

The extrapolated matrices L(250) obtained fromBCG's 10 data sets of different divergence levelsshould be the same if the underlying process ofamino acid sequence evolution were time homoge-neous. To check this, we used Dayhoff's amino acidmodel to simulate perfectly time-homogeneousMarkov data in the form of pairwise alignments atdifferent divergent levels tk and applied BCG'sinference procedure, as described above, to calculatethe L(250) log-odds matrices. For these data, weconfirmed that the elements of the log-odds matrixare not dependent on the divergence levels tk (seeSupplementary Material Fig. S1a and b) and, thus,that BCG's observations (Fig. 2a and b) are not

consistent with time-homogeneous Markovianamino acid substitution.Proteins often show variation in the rates of

substitution at different protein sites. This can affectestimates of divergence levels36 and, thus, raises thequestion of whether rate heterogeneity at the aminoacid level could have caused the time dependencyeffects BCG observed. Therefore, we also simulateddata from a mixture process acting on amino acids, amethodology not available at the time of BCG'sanalysis, using a discretized Gamma distribution ofrates of evolution over amino acid sites that isdetermined by the parameter α.29 To simulate abroad range of typical protein data, we used 182values of α, representing an empirical distribution ofvalues for typical globular proteins.37 However,although very slight changes in L(250) values couldbe observed, the effect was not as pronounced as thatobserved by Benner et al. (see Supplementary

Page 11: Markovian and Non-Markovian Protein Sequence Evolution: Aggregated ... · Markovian and Non-Markovian Protein Sequence Evolution: Aggregated Markov Process Models Carolin Kosiol1⁎

920 Aggregated Markov Models

Material Fig. S1c and d). We conclude that althoughcombined data from a mixture of time-homogeneousMarkovian amino acid models with different valuesof α can generate observations with slight timedependence, a realistic distribution ofα values cannotexplain BCG's observations. In summary, we wereunable to recreate observations of Benner et al. usingtime-homogeneous amino acid Markov models.We next investigated whether simulations under

AMP models could lead to observations similar tothose of BCG. Initially, we used a simple M0 model(above), choosing realistic values of ω=0.2 (moder-ate purifying selection), κ=2.5 and codon frequen-cies as specified in SupplementaryMaterial. The log-odds matrices showed some dependency on thePAM distance, but the magnitude of the effect on thelog-odds elements did not reflect the strong varia-tion of the log-odds of BCG's experimental data.Further trials with different parameter values led toqualitatively similar results (not shown).However, introducing a more realistic model for

among-site rate variation by using a Gammamodel29 clearly compared better with Benner'splots from experimental data (not shown). Finally,instead of determining rate variation by a Gammadistribution, we illustrate the effect using ratecategories specified “by hand”. Excellent resultswere achieved by aggregation of simulated datafrom a codon model with among-site rate variationdefined by 12 relative rate categories:

r1 =0:000001 r2 =0:00001 r3 =0:0001 r4 =0:001r5 = 0:01 r6 = 0:1 r7 = 0:15 r8 = 0:2r9 = 0:3 r10 = 0:5 r11 = 2:0 r12 = 8:738889

ð10ÞAll categories were given the same probability

(1/12), maintaining a mean rate of 1. While aGamma distribution was not extreme enough, wenote that our choice of distribution of evolutionaryrates is still realistic. In particular, we needed veryslowly evolving sites to explain the time depen-dence effect more than those given by a Gammadistribution; however, models allowing for asubstantial proportion of invariant sites are widelyused in phylogenetics.38,39

Results for this AMP model are shown in Fig. 2cand d and should be compared with Benner's resultson experimental data in Fig. 2a and b. Similar to thegraphs of the experimental data, the graphs of thesimulated data show significant curvature. For thespecific elements of the log-odds matrix that BCGplotted, the order and the trends of the graphs agreefor the experimental and simulated AMP data. Thisshows that relatively complex but realistic time-homogeneous codon models can generate behaviorsimilar to what BCG observed.However, some ranges of the experimental graphs

are different (Fig. 2). We speculate that this may

reflect the fact that the M0 codon model is not fullyrealistic, for example, treating all synonymouschanges and all nonsynonymous changes equallyand assigning the same level of selective pressure(ω) to all protein sites. AMPs based on morecomplex parametric codon models22 or on empiricalcodon models22 might give a picture of the ranges ofL(250) matrix values more in accord with BCG'sempirical results. Also, at high PAM distances, thesimulated graphs converge to zero as predicted bytheory. In contrast, BCG's experimental graphsactually often cross the zero line, which we attributeto problems such as difficulty of aligning divergentsequences, noise, small sample sizes, and so on.A final possibility not yet addressed is that BCG's

experimental graphs could have the shapes they donot because of any systematic effect of molecularevolutionary processes but simply as a result ofinferential noise. Our results from simulating datasets of approximately the sizes of those used by BCGindicate that the resulting levels of variability are notnearly sufficient to explain the difference in shapesbetween the curves shown in Fig. 2a and b (BCG)and c and d (our results) (see SupplementaryMaterial Fig. S2 for details).

Comparison to Mitchison and Durbin's results

We repeated Mitchison and Durbin's analysisusing four simulations. We simulated data fromDayhoff's amino acid model as a simple time-homogeneous Markovian amino acid model andusing a mixture of time-homogeneous Markovianamino acid models as described above, incorporat-ing 182 rate heterogeneity (α) values.37 We alsosimulated data from an AMP based on the codonmodel described above, both without rate heteroge-neity and with rate heterogeneity as given by the 12rate categories in Eq. (10).Figure 3 compares the results from these four

models to the experimental data of Mitchison andDurbin. Figure 3a confirms that a simple time-homogeneous Markov model on the amino acidlevel does not fit their observations. Although themixture of time-homogeneous amino acid modelsgives results somewhat closer to experimental data,it still predicts the proportion of single base changesto decrease fairly linearly. Thus, it appears that time-homogeneous Markovian amino acid processmodels alone cannot explain the observations ofMitchison and Durbin.Figure 3b shows similar poor agreement between

Mitchison and Durbin's results and the results fromour AMP with no rate heterogeneity. However,there is much better agreement for our AMPincorporating rate heterogeneity. Although thiscombination of time-homogeneous codon model,rate heterogeneity and aggregation does not reflectprecisely the behavior of the experimental results for

Page 12: Markovian and Non-Markovian Protein Sequence Evolution: Aggregated ... · Markovian and Non-Markovian Protein Sequence Evolution: Aggregated Markov Process Models Carolin Kosiol1⁎

921Aggregated Markov Models

small times (PAM distances), it does capture muchbetter the rapid and nonlinear decline of theproportion of single base changes. Again, apparentdifferences in protein evolution on different time-scales can in fact be explained by an AMP.

Discussion

Since the work of Dayhoff et al., there have beenincreasingly good empirical models of the averagepatterns andprocesses of evolution of large collectionsof amino acid sequence, as well as more and morespecialized matrices considering functional and struc-tural properties of proteins. However, while mostwork in phylogenetic modeling is aimed at devisingimproved time-homogeneous Markov models, somecriticisms have been directed at the time-homogeneityassumption and the models' Markov nature itself.Studies on experimental protein sequence data (e.g.,by Benner et al.14 and Mitchison and Durbin13) haveobserved different substitution patterns at differentlevels of sequence divergence. These observationsindicate that amino acid sequence evolution behavesin a time-dependent manner.While Benner et al. did not support their criticisms

of the Markov nature of amino acid sequenceevolution by consideration of time-dependent Mar-kov processes, the claims that time-homogeneitywas violated required further investigation.14 In aseries of thought experiments, we have shown thatpast explanations (i.e., that the process of evolutionis different for different divergence times) areirrational because the time of observation and thechoice of sequences compared cannot have anyinfluence on actual amino acid substitutions. How-ever, time-homogeneous Markov models are fun-damental to many applications in evolutionarystudies, and we need to find some explanation forthe observations.We emphasize that our criticism of Benner et al.'

interpretation of their results should not be taken tomean that we are arguing against the importance ofthe genetic code or of amino acids' physiochemicalproperties in evolutionary models. However, weargue that these influence the average substitutionpatterns observed over collections of proteins at allevolutionary distances in the same way. Indeed,studies of genome variation data have suggested aninfluence of physicochemical properties at thepopulation level40–42 and, thus, within a far-shorterperiod than the distances discussed in this paper.We have also shown that the time-dependent

behavior described in the literature can be explainedby modeling protein-coding DNA sequence evolutionas an AMP that combines a time-homogeneousMarkov model of codon evolution with rate heteroge-neity among different codon sites of the protein andthat evaluateswhatwewould infer ifweobservedonly

encoded amino acid sequences at different divergencelevels. This leads to a model that is non-Markovianon the observed amino acids, and we have focusedon the consequences of non-Markovian behaviorusing the Chapman–Kolmogorov equation35 andcomparisons to studies on experimental data byBenner et al.14 andMitchison andDurbin.13 Althoughprevious results13,14 cannot be explained by a puretime-homogeneous Markov model or a realisticmixture of such models on the amino acid level, theaggregated Markov model captures the qualitativebehavior of empirical studies and leads to betteragreement between models and empirical data.Although it does not incorporate any of the physico-chemical properties considered by Benner et al. to beresponsible for their results, the AMP in fact is able tocapture quite accurately the form of the resultsinterpreted by Benner et al. andMitchison andDurbinas evidence of time-dependent evolution. We there-fore conclude that the paradox that arose from pastobservations of time-dependent behavior can beresolved.AMPs based on M0 (above) capture the high

proportion of single base changes at very lowdivergence levels observed by Mitchison andDurbin because they assume that individual codonreplacements involve only single bases [Eq. (3)].13

Studies that have investigated instantaneous occur-rence of multiple base replacements suggest thatthese do arise in low numbers.43,30 AMPs based onmodels incorporating these events could lead toimproved fit with Mitchison and Durbin's observa-tions. We found that a high level of rate variationacross sites was also needed to give a good fit toempirical results. An effect of this rate variation is toconcentrate codon changes into a small number ofhighly variable sequence sites, leading to morechanges per altered site and thus a higher propor-tion of altered sites requiring multiple base changesto explain observed amino acid differences. Thiscontributes to the much steeper fall in the proportionof single base changes as divergence increases forthe AMP with rate variation, as shown in Fig. 3b,and Fig. 2 illustrates the same effects at the level ofindividual amino acid replacements. An additionaleffect that contributes to the improved fit of our rate-heterogeneous AMP is systematic underestimationof divergence caused by model misspecification.Here, we estimate divergences from data generatedby a rate-heterogeneous codon process, with anamino acid model assuming rate homogeneity. Theconcentration of changes into a small number ofsites leads to more multiple hits and thus moreamino acid replacements that are not observable andunderestimation of divergence levels (see Supple-mentary Material Fig. S3; the nonlinearity of therelationship between tk⁎ and the inferred PAMdistance also explains the gradient changes in therate-heterogeneous AMP plot in Fig. 3b).

Page 13: Markovian and Non-Markovian Protein Sequence Evolution: Aggregated ... · Markovian and Non-Markovian Protein Sequence Evolution: Aggregated Markov Process Models Carolin Kosiol1⁎

922 Aggregated Markov Models

The considerable level of rate variation across sitesneeded to generate behavior similar to that observedby Benner et al. and Mitchison and Durbin couldalso be caused in part by variation in selectivepressures such as those modeled by parameters ofthe M7 or M8 codon models for selection.22 Suchstudy of the causes of rate heterogeneity is beyondthe “proof-of-principle” approach used in thispaper. Furthermore, our comparisons were limited,since the original data (and detailed results) of theabove studies are not available anymore. However,the results of our simulation study using AMPsalready strongly suggest that protein evolution willbe most accurately modeled with codon rather thanamino acid substitution models.This recommendation is in accord with recent

work on the use of codon-level models for molecularphylogenetics. Ren et al. study the utility of codonmodels for phylogenetic reconstruction and molec-ular dating.44 They report that codon models havegood performance in both recent and deep di-vergences. Although their computational burdenmakes codon models currently infeasible for treesearching, Ren et al. recommend them for comparingpredetermined candidate trees. In contrast, model-ing protein sequence evolution on the amino acidlevel may introduce systematic error. The nature ofprotein-coding sequence evolution is such that time-homogeneous Markov modeling on the codon levelseems reasonable, but this leads to time-dependentand non-Markov behavior on the amino acid level. Itis increasingly feasible to use codon models whereamino acid models have been used in the past, andour results overturn a long-standing claim thatproteins evolve in a time-dependent manner andgive further reasons why codon models may bepreferable.Supplementary materials related to this article can

be found online at doi:10.1016/j.jmb.2011.06.005

Acknowledgements

C.K. and N.G. would like to thank Bret Larget andAllen Rodrigo for discussions on AMPs and ananonymous referee for extensive and perceptivecomments that led to numerous improvements tothe manuscript. This research was supported by theWellcome Trust through grants GR069321MA andGR078968MA.

References

1. Dayhoff, M. O. & Eck, R. V. (1968). A model ofevolutionary change in proteins. In Atlas of Protein

Sequence and Structure (Dayhoff,M.O.&Eck, R.V., eds),pp. 33–41, National Biomedical Research Foundation,Washington, DC.

2. Adachi, J. & Hasegawa, M. (1996). Model of aminoacid substitution in proteins encoded by mitochon-drial DNA. J. Mol. Evol. 42, 459–468.

3. Goldman, N., Thorne, J. L. & Jones, D. T. (1998).Assessing the impact of secondary structure andsolvent accessibility on protein evolution. Genetics,149, 445–458.

4. Liò, P. & Goldman, N. (1999). Using protein structuralinformation in evolutionary inference: transmem-brane proteins. Mol. Biol. Evol. 16, 1696–1710.

5. Le, S. Q. & Gascuel, O. (2010). Accounting for solventaccessibility and secondary structure in proteinphylogenetics is clearly beneficial. Syst. Biol. 59,277–287.

6. Liò, P. & Goldman, N. (1998). Models of molecularevolution and phylogeny. Genome Res. 8, 1233–1244.

7. Thorne, J. L. & Goldman, N. (2007). Probabilisticmodels for the study of protein evolution. InHandbook of Statistical Genetics (Balding, D., Bishop,M. & Cannings, C., eds), pp. 439–459, Wiley,Chichester, UK.

8. Felsenstein, J. (1973). Maximum likelihood and min-imum-steps methods for estimating evolutionary treesfrom data on discrete characters. Syst. Zool. 22,240–249.

9. Yang, Z. (2006). Computational Molecular Evolution.Oxford University Press, Oxford, UK.

10. Yang, Z. & Goldman, N. (1994). Evaluation andextension of Markov process models for the evolutionof DNA (in Chinese, with Abstract in English). ActaGenet. Sin. 21, 17–23.

11. Kosiol, C. & Goldman, N. (2005). Different versions ofthe Dayhoff rate matrix. Mol. Biol. Evol. 22, 193–199.

12. Henikoff, S. & Henikoff, J. G. (1992). Amino acidsubstitution matrices from protein blocks. Proc. NatlAcad. Sci. USA, 89, 10915–10919.

13. Mitchison, G. & Durbin, R. (1995). Tree-basedmaximal likelihood substitution matrices and hiddenMarkov models. J. Mol. Evol. 41, 1139–1151.

14. Benner, S. A., Cohen, M. A. & Gonnet, G. H. (1994).Amino acid substitution during functionally con-strained divergent evolution of protein sequences.Protein Eng. 7, 1323–1332.

15. Bartlett, M. S. (1978). Introduction to Stochastic Processes,3rd edit. CambridgeUniversity Press, Cambridge, UK.

16. Ohno, S. (1970). Evolution by Gene Duplication. Springer-Verlag, Berlin, Germany.

17. Seoighe, C., Johnston, C. R. & Shields, D. C. (2003).Significantly different patterns of amino acid replace-ment after gene duplication as compared to afterspeciation. Mol. Biol. Evol. 20, 484–490.

18. Kondrashov, F. A., Rogozin, I. B., Wolf, Y. I. &Koonin, E. V. (2002). Selection in the evolution of geneduplications.Genome Biol. 3; research0008.1–0008.9.

19. Larget, B. (1998). A canonical representation foraggregated Markov processes. J. Appl. Probab. 32,313–324.

20. Klosterman, P. S., Uzilov, A. V., Bendaña, Y. R.,Bradley, R. K., Chao, S., Kosiol, C. et al. (2006). XRate:a fast prototyping, training and annotation tool forphylo-grammars. BMC Bioinformatics, 7, 428.

Page 14: Markovian and Non-Markovian Protein Sequence Evolution: Aggregated ... · Markovian and Non-Markovian Protein Sequence Evolution: Aggregated Markov Process Models Carolin Kosiol1⁎

923Aggregated Markov Models

21. Goldman, N. & Yang, Z. (1994). A codon-based modelof nucleotide substitution for protein-coding DNAsequences. Mol. Biol. Evol. 11, 725–736.

22. Yang, Z., Nielsen, R., Goldman, N. & Pedersen, A. M.(2000). Codon-substitution models for heterogeneousselection pressure at amino acid sites. Genetics, 155,431–449.

23. Norris, J. R. (1997). Markov Chains. CambridgeUniversity Press, Cambridge, UK.

24. Moler, C. & Van Loan, C. (2003). Nineteen dubiousways to compute the exponential of a matrix, twenty-five years later. SIAM Rev. 45, 3–49.

25. Yang, Z. (2007). PAML 4: phylogenetic analysis bymaximum likelihood. Mol. Biol. Evol. 24, 1586–1591.

26. Muse, S. V. & Gaut, B. S. (1994). A likelihoodapproach for comparing synonymous and nonsy-nonymous nucleotide substitution rates, with appli-cation to the chloroplast genome. Mol. Biol. Evol. 11,715–724.

27. Yang, Z. & Nielsen, R. (2002). Codon-substitutionmodels for detecting molecular adaptation at individ-ual sites along specific lineages. Mol. Biol. Evol. 19,908–917.

28. Massingham, T. & Goldman, N. (2005). Detectingamino acid sites under positive selection and purify-ing selection. Genetics, 169, 1753–1762.

29. Yang, Z. (1994). Maximum likelihood phylogeneticestimation from DNA sequences with variable ratesover sites: approximate methods. J. Mol. Evol. 39,306–314.

30. Kosiol, C., Holmes, I. & Goldman, N. (2007). Anempirical codon model for protein sequence evolu-tion. Mol. Biol. Evol. 24, 1464–1479.

31. Kuensch, H. R. (2001). State space and hidden Markovmodels. In Complex Stochastic Systems (Barndorff-Nielsen, O. E., Cox, D. R. & Klüppelberg, C., eds),pp. 109–173, CRC Press, New York, NY.

32. Yang, Z., Nielsen, R. & Hasegawa, M. (1998). Modelsof amino acid substitution and applications tomitochondrial protein evolution. Mol. Biol. Evol. 15,1600–1611.

33. Gonnet, G. H., Cohen, M. A. & Benner, S. A. (1992).Exhaustive matching of the entire protein sequencedatabase. Science, 256, 1443–1445.

34. Fletcher, W. & Yang, Z. (2009). INDELible: a flexiblesimulator of biological sequence evolution. Mol. Biol.Evol. 26, 1879–1888.

35. Gillespie, D. T. (1992). Markov Processes: An Introduc-tion for Physical Scientists. London Academic Press,Boston, UK.

36. Yang, Z., Goldman, N. & Friday, A. (1994). Compar-ison of models for nucleotide substitution used inmaximum-likelihood phylogenetic estimation. Mol.Biol. Evol. 11, 316–324.

37. Goldman, N. & Whelan, S. (2002). A novel use ofequilibrium frequencies in models of sequence evolu-tion. Mol. Biol. Evol. 19, 1821–1831.

38. Gu, X., Fu, Y. X. & Li, W. H. (1995). Maximumlikelihood estimation of the heterogeneity of substi-tution rate among nucleotide sites. Mol. Biol. Evol. 12,546–557.

39. Le, S. Q. & Gascuel, O. (2008). An improved generalamino acid replacement matrix. Mol. Biol. Evol. 25,1307–1320.

40. Sunyaev, S., Ramensky, V. & Bork, P. (2000). Towardsa structural basis of human non-synonymous singlenucleotide polymorphisms. Trends Genet. 16, 198–200.

41. Ng, P. C. & Henikoff, S. (2003). SIFT: predicting aminoacid changes that affect protein function. Nucleic AcidsRes. 31, 3812–3814.

42. Williamson, S. H., Hernandez, R., Fledel-Alon, A.,Zhu, L., Nielsen, R. & Bustamante, C. D. (2005).Simultaneous inference of selection and populationgrowth from patterns of variation in the humangenome. Proc. Natl Acad. Sci. USA, 102, 7882–7887.

43. Whelan, S. & Goldman, N. (2004). Estimating thefrequency of events that cause multiple-nucleotidechanges. Genetics, 167, 2027–2043.

44. Ren, F., Tanaka, H. & Yang, Z. (2005). An empiricalexamination of the utility of codon substitutionmodels in phylogenetic reconstruction. Syst. Biol. 54,808–818.