Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract...

30
Bayesian Phylogenetic Inference – Estimating Diversification Rates from Reconstructed Phylogenies Sebastian Höhna

Transcript of Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract...

Page 1: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

Bayesian Phylogenetic Inference – Estimating Diversification Rates fromReconstructed Phylogenies

Sebastian Höhna

Page 2: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference
Page 3: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

Bayesian Phylogenetic InferenceEstimating Diversification Rates from Reconstructed Phylogenies

Sebastian Höhna

Page 4: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

c©Sebastian Höhna, Stockholm 2013

ISBN 978-91-7447-771-9

Printed in Sweden by US-AB, Stockholm 2013

Distributor: Department of Mathematics, Stockholm University

Page 5: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

Abstract

Phylogenetics is the study of the evolutionary relationship between species.Inference of phylogeny relies heavily on statistical models that have been ex-tended and refined tremendously over the past years into very complex hierar-chical models. Paper I introduces probabilistic graphical models to statisticalphylogenetics and elaborates rigorously the potential advantageous a unifiedgraphical model representation could have for the community, e.g. , more un-derstandable models and more reproducible studies from other publications.

Once the phylogeny is reconstructed it is possible to infer the rates of di-versification (speciation and extinction). In this thesis I extend the birth-deathprocess so that it can be applied to incompletely sampled phylogenies, thatis, phylogenies of only a subsample of the presently living species from onegroup. Previous work only considered the case when every species had thesame probability to be included and here I give two alternative sampling con-cepts: diversified taxon sampling and cluster sampling. Paper II introducesthese sampling concepts under a constant rate birth-death process and givesthe probability density for reconstructed phylogenies. These models are ex-tended in Paper IV to time-dependent diversification rates, again, under differ-ent sampling concepts and applied to empirical phylogenies. Paper III focuseson fast and unbiased simulations of reconstructed phylogenies. The efficiencyis achieved by deriving the analytical distribution and density function of thespeciation times in the reconstructed phylogeny.

Page 6: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference
Page 7: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

List of Papers

The following papers, referred to in the text by their Roman numerals, areincluded in this thesis.

PAPER I: Höhna S., Heath T. A., Boussau B., Ronquist F. and Huelsen-beck J. P. (2013) Probabilistic Graphical Model Representationin Phylogenetics, (manuscript).Contribution: All authors developed the methodology and jointlywrote the manuscript.

PAPER II: Höhna S., Stadler T., Ronquist F. and Britton T. (2011) In-ferring speciation and extinction rates under different samplingschemes. Molecular Biology and Evolution 28(9):2577–2589.DOI: 10.1093/molbev/msr095Contribution: Implementation of the methods and analyzing thedata. All authors designed the study, developed the methodol-ogy and wrote the manuscript.

PAPER III: Höhna S. (2013) Fast simulation of reconstructed phylogeniesunder global time-dependent birth-death processes. Bioinfor-matics 29(11):1367–1374.DOI: 10.1093/bioinformatics/btt153

PAPER IV: Höhna S. (2013) Likelihood Inference of non-constant Diver-sification Rates with Incomplete Taxon Sampling. PLoS One(accepted)

Reprints were made with permission from the publishers.

Page 8: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

Related articles that are not included in the thesis but have been publishedduring the course of the PhD studies:

PAPER A: Höhna S. and Drummond A. J. (2012) Guided Tree TopologyProposals for Bayesian Phylogenetic Inference. Systematic Biol-ogy 61(1):1–11.DOI: 10.1093/sysbio/syr074

PAPER B: Alström P., Höhna S., Gelang M., Ericson P. and Olsson U. (2011)Non-monophyly and intricate morphological evolution within theavian family Cettiidae revealed by multilocus analysis of a taxo-nomically densely sampled dataset. BMC Evolutionary Biology11(1):352.DOI: 10.1186/1471-2148-11-352

PAPER C: Ronquist F., Teslenko M., van der Mark P., Ayres D. L., DarlingA., Höhna S., Larget B., Liu L., Suchard M. A. and HuelsenbeckJ. P. (2012) MrBayes 3.2: Efficient Bayesian MCMC Inferenceand Model Choice Across a Large Model Space. Systematic Bi-ology 61(3):539–542.DOI: 10.1093/sysbio/sys029

Page 9: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

Acknowledgements

During the time of my PhD studies I have obtained a lot of help and supportfrom many people that made working on the thesis easier and much more en-joyable. This is only a short list of people I have interacted with more closelywhile working on the thesis.

First, I would like to thank Ana. You have motivated me the most, broughtthings always back into perspective and made me re-think a lot of stuff. Butmost importantly you succeeded to balance my life at least a bit.

Second, my fellow students at the department because it was always goodto have others to talk to who are in the same situation. Especially Fredrikand Fabio, sharing the office or home with me for a long time, have probablysuffered the most under my sometimes random thoughts which I never heldback. And for explaining me the Swedish culture with weird movies.

All the people in Berkeley and the Huelsenbeck lab (John, Tracy, Bastien,Michael, Chris, Brian, Nick and Jeremy) who have made my stays a lot of fun.I really enjoyed the atmosphere and the different perspective you had on theresearch topics.

My supervisors, Tom and Fredrik, for giving me this opportunity and theiradvise and support. This also holds for the entire department, because evenif I did not always fit in perfectly or had different opinions, I have learned somuch.

I would also like to thank my parents, my whole family and friends whoI have neglected in recent years far too much. Your direct contribution to thisthesis may be only minor but without the time before, that made me who I am,I probably would not have done and finished this work.

Thanks to all of you!

Page 10: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference
Page 11: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

Contents

Abstract v

List of Papers vii

Acknowledgements ix

1 Introduction 131.1 Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.2 Phylogenetics . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.2.1 Bayesian inference . . . . . . . . . . . . . . . . . . . 181.2.2 Summary of Paper I . . . . . . . . . . . . . . . . . . 19

1.3 Diversification . . . . . . . . . . . . . . . . . . . . . . . . . . 211.3.1 Estimating Diversification Rates . . . . . . . . . . . . 211.3.2 Summary of Paper II . . . . . . . . . . . . . . . . . . 231.3.3 Summary of Paper III . . . . . . . . . . . . . . . . . . 241.3.4 Summary of Paper IV . . . . . . . . . . . . . . . . . 26

References xxvii

Page 12: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference
Page 13: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

1. Introduction

1.1 Evolution

As many more individuals of each species are born than can pos-sibly survive; and as, consequently, there is a frequently recurringstruggle for existence, it follows that any being, if it vary howeverslightly in any manner profitable to itself, under the complex andsometimes varying conditions of life, will have a better chance ofsurviving, and thus be naturally selected. From the strong princi-ple of inheritance, any selected variety will tend to propagate itsnew and modified form.

— Charles Darwin, On the Origin of Species (1859)

Charles Darwin laid out the fundamental principles of the theory of evo-lution in his masterpiece “On the Origin of Species” (Darwin, 1859). Theseprinciples are used to study the history of life. His ideas were simple andgeneral at the same time: adaptation by means of natural selection and a com-mon descent of all life forms. The theory of evolution still holds true todayand specific aspects are of great interest in current research. This thesis willonly touch a small aspect of the studies related to evolution: phylogenetics anddiversification rate analyses.

The general principles of the theory of evolution can be summarized asfollows. Any species consists of several populations and within each popula-tion there are many individuals with different characteristics, so called traits orphenotypes. New populations originate, for example, either by a subdivisionof one ancestral population (e.g. due to changing geography) or are founded bysome individuals that “migrate” to a new habitat (e.g. a dispersal event). In thisway new populations are formed that possibly evolve into a new independentspecies.

Some individuals may have a higher chance of survival and/or have ahigher rate of reproduction because of an advantageous phenotype. A givenphenotype is inherited from the parents to the offspring. For example, GregorMendel showed in his seminal work that the offspring of peas with one coloralways have the same color and that offspring with two differently colored par-ents produce mixed colors themselves. This observation gave rise to the study

13

Page 14: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

of genetics. The key observation is that phenotypes are inherited from parentto offspring and evolutionary relationship can be established by a long line ofdescent. Commonly, if two populations have evolved distinctive phenotypesone considers them as belonging to different species.

As an example consider Darwin’s finches on the Galapagos islands. Asmall number of individuals arrived by pure chance from the mainland ofSouth-America to one of the Galapagos islands. At this time, these individ-uals still belonged to the same species as the mainland finches. However, dueto the different environment – e.g. different food source and different predators– a different natural pressure was selecting for different phenotypes on thesefinches. Hence, after a while this new population of finches on the Galapagosislands diverged in phenotype until the population constituted in a new species.Differences developed in beaks, tails, body form and plumage.

Isolated populations can also evolve by pure chance into a separate speciesby genetic and phenotypic drift. Regardless of the cause, the important factis that new species arise all the time and existing species go extinct. All newspecies have in common that they descended from an ancestral species andtherefore every two species are related to one another because at one timethey shared a common ancestor. The ancestral relationship between a group ofspecies is represented by a phylogeny and the study thereof is named phyloge-netics.

This thesis is mainly concerned with statistical methods describing the pro-cess of evolution and should be considered in a larger framework targeting theinference of evolutionary parameters, such as the phylogeny and the rates ofdiversification over time.

1.2 Phylogenetics

Evolution and genetics are two research areas that consider many problemsin statistical inference or more specifically in stochastic processes. Stochasticprocesses are of great importance in these research areas because they gen-erally lead to probabilistic description of the observed data and to likelihoodmethods – the likelihood function is the probability of the data as a function ofthe parameters – which have in turn shown their superiority over other methodsin parameter inference (Felsenstein, 1973; Huelsenbeck and Rannala, 1997).

I start with an explanatory example. Let us assume that we have some ob-served characters (e.g. DNA characters) from some species – say three species.For example, consider that these character represent a gene or a part of a gene.Now based on this observation we want to infer the evolutionary relationshipand history between these three species, called phylogenetic inference. I willnow briefly explain the probabilistic model for the evolution of a gene se-

14

Page 15: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

T TAC

T T

T TA AGGA G

GA G

A AGGA GT TAAGGA G

x x

x

T

xsubstitution

deletion

insertion*

* * *

C

CC

CC C CC

1 1 2 3 4 5 6 7 1 2 3 7 84 5 6 78

T TA AGGA G

T TA GA G

TCC CCC

1 2 3 4 5 6 7 8

- --

- - -

9

T TA AGGA G

T TA GA G

TCC CCC

- --

- - --

-

-

True

Alignment

Alternative

Alignment

Alignment

Position

a) b)

Figure 1.1: A model of molecular sequence evolution. The left side shows theevolution of a DNA sequence along a rooted phylogenetic tree with three tips(species). The sequence evolves by insertion, deletion and substitution events.The right side shows the observation from the process of molecular evolution:the molecular sequences of present day species. The homology (ancestral rela-tionship) of the characters is established and characters that are homologous arewritten in the same column of the multiple sequence alignment. Two alternativehomology reconstruction are given: the true alignment and the most parsimo-nious alignment (least number of substitutions).

quence as shown in Figure 1.1.a. The process starts with some characters atthe ancestral species that build the root sequence. Commonly it is assumedthat these characters are independent and identically distributed with knownor estimated frequencies. Then, along the branches of the phylogenetic tree,three events may happen: 1) a substitution event from one state into another(e.g. from an ’A’ to a ’T’ at position 1 on the right branch), 2) an insertionevent of a random number of characters, and 3) a deletion event of one or sev-eral characters (Thorne et al., 1991, 1992). The causes for these events aremanifold and a more realistic representation should have to take into account,amongst others, the dependency between characters and the distribution of thesize of insertions and deletions.

A proper statistical analysis would jointly infer insertions, deletions andsubstitutions (Redelings and Suchard, 2005). Unfortunately a joint inferenceis only applicable to very small datasets because of the computational burdens(Bouchard-Côté and Jordan, 2013; Lunter et al., 2005; Redelings and Suchard,2009). Thus, the problem of phylogenetic inference is often simplified by a se-quential inference approach. First, the multiple sequence alignment is inferred,that is, the homology structure between observed characters is established (seeFigure 1.1.b). The best alignment according to some optimality criterion, e.g.maximum likelihood or the parsimony principle, is then used for further anal-yses (Wong et al., 2008).

After the alignment has been obtained the phylogeny is estimated. Given

15

Page 16: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

the known alignment there are no more insertion and deletion events that haveto be modeled. Hence, the stochastic process of molecular sequence evolu-tion is reduced to a substitution process. This substitution process is in turnmodeled by a continuous time Markov model which is parameterized by aninstantaneous rate matrix Q. In the case of DNA substitutions Q is defined as

Q =

−µA µGA µCA µTA

µAG −µG µCG µT G

µAC µGC −µC µTC

µAT µGT µCT −µT

where µi j represents that rate of substitution from i to j. Furthermore, let thetransition probability matrix be defined by

P(t) =

pAA(t) pGA(t) pCA(t) pTA(t)pAG(t) pGG(t) pCG(t) pT G(t)pAC(t) pGC(t) pCC(t) pTC(t)pAT (t) pGT (t) pCT (t) pT T (t)

= eQt =∞

∑j=0

tQ j

j!.

Most applications in phylogenetics use a simplified substitution rate matrix,for example the most simple matrix with equal exchangeability rates and equi-librium frequencies (Jukes and Cantor, 1969):

QJC69 =

∗ 1 1 11 ∗ 1 11 1 ∗ 11 1 1 ∗

which has the advantage that the transition probability matrix can be computedanalytically:

PJC69 =

14 +

34 e−tµ 1

4 − 14 e−tµ 1

4 − 14 e−tµ 1

4 − 14 e−tµ

14 − 1

4 e−tµ 14 +

34 e−tµ 1

4 − 14 e−tµ 1

4 − 14 e−tµ

14 − 1

4 e−tµ 14 − 1

4 e−tµ 14 +

34 e−tµ 1

4 − 14 e−tµ

14 − 1

4 e−tµ 14 − 1

4 e−tµ 14 − 1

4 e−tµ 14 +

34 e−tµ

.

Note that I omitted for simplicity and following standard notation the diagonalelements of the rate matrix Q and represented the elements by an ’*’. The diag-onal elements are defined as µii =−∑i6= j µ ji. Other widely used rate matrices

16

Page 17: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

are Kimura’s 2-parameter model (Kimura, 1980):

QK80 =

∗ κ 1 1κ ∗ 1 11 1 ∗ κ

1 1 κ ∗

,

Felsenstein (1981) model with unequal equilibrium frequencies π:

QF81 =

∗ πC πG πT

πA ∗ πG πT

πA πC ∗ πT

πA πC πG ∗

,

and the model of Hasegawa et al. (1985) with unequal equilibrium frequenciesand a different transition-transversion rate κ:

QHKY =

∗ κπC πG πT

κπA ∗ πC πT

πA πC ∗ κπT

πA πC κπG ∗

.

For all these models analytical solutions of the transition probability matrixare available. Additionally, researcher often apply the general time reversible(GTR) (Tavaré, 1986) that consists of six exchangeability rates x and four equi-librium frequencies π:

QGT R =

−(x1 + x2 + x3)

π1x1π2

π1x2π3

π1x3π4

x1 −(π1x1π2

+ x4 + x5)π2x4π3

π2x5π4

x2 x4 −(π1x2π3

+ π2x4π3

+ x6)π3x6π4

x3 x5 x6 −(π1x3π4

+ π2x5π4

+ π3x6π4

)

.

However, numerical methods are necessary to compute the transition probabil-ity matrix under the GTR model. Similar models are applied for the evolutionof amino acid (20 states) or codon (61 states) sequence but include larger pa-rameter spaces because of the larger number of possible states.

Then, the probability of the “observed” multiple sequence alignment iscomputed by multiplying the probability of each substitution history for eachbranch of the tree and each site of the sequence. For example in Figure 1.1.afor the first position in the sequence one has to compute the probability ofstarting with an ’A’ and observing an ’A’ at the internal node, then for eachdescendant species observing an ’A’ and observing a ’C’ at the third specieswhich gives

PAA(b1)×PAA(b2)×PAA(b3)×PAC(b4) .

17

Page 18: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

This multiplication is repeated for each site in the sequence. Furthermore, theinformation of the ancestral sequences is commonly not available and thus onesums the probabilities of having observed either character. More details areprovided in Paper I. Paper B is an example of an application of phylogeneticinference where the species relationship within a family of bird species is esti-mated from gene-sequences of four different genes (Alström et al., 2011).

This brief introduction into phylogenetic methods aimed to make the pointthat the statistical models used in phylogenetic inference can be very complexand many competing models exist. In Paper I we argue that this complexityrequires a unified modeling framework describing different models and thus tosimplify future progress in model development. However, before I summarizePaper I I will give a brief introduction to Bayesian inference in phylogenetics.

1.2.1 Bayesian inference

In phylogenetics one often finds oneself in the situation of possessing a datasetand wishing to learn about the process and course of evolution. Naturally, oneis not able to repeat the “experiment” and thus to obtain a different realiza-tion of the data under the same circumstances. Therefore, the researcher isrestricted to the role of a passive observer and the task of the statistician issolving an inverse problem – namely inferring the parameters of the processfrom the observations. Although a major goal in the phylogenetic inference isto infer a single point estimate, such as the most likely phylogeny, computingthe uncertainty in the estimate is of great interest too. Computing the uncer-tainty of a phylogenetic tree is not trivial and therefore several methods havebeen developed, such as bootstrapping (Felsenstein, 1985), but in recent yearsBayesian inference has overtaken other methods because it provides a natu-ral interpretation of uncertainty (Alfaro and Holder, 2006; Holder and Lewis,2003; Huelsenbeck et al., 2002, 2001). Bayes theorem states

P(Parameter|Data) =P(Data|Parameter)×P(Parameter)

P(Data)

which for a phylogeny means

P(Treeτ |Alignment) =P(Alignment|Treeτ)×P(Treeτ)

P(Alignment)

=P(Alignment|Treeτ)×P(Treeτ)

∑∀

P(Alignment|Treex)×P(Treex)

=

∫θ

P(Alignment|Treeτ ,θ)×P(Treeτ ,θ)dθ

∑∀x

(∫θ

P(Alignment|Treex,θ)×P(Treex,θ)dθ

)18

Page 19: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

where θ represents the vector of additional parameters including the substi-tution model parameters (exchangeability rates and equilibrium frequencies).Unfortunately this probability cannot be computed analytically – except forvery, very small datasets and toy models – because, amongst others, the num-ber of phylogenies grows super-exponentially with the number of species in thestudy (Felsenstein, 1978). Instead, stochastic approximation methods are used;most prominently Markov chain Monte Carlo (MCMC) methods (Li et al.,2000; Mau and Newton, 1997; Rannala and Yang, 1996).

The most common implementation of an MCMC algorithm is the Metropolis-Hastings algorithm (Hastings, 1970; Metropolis et al., 1953). The fundamentalidea is to

1. Propose new values x′ according to a proposal density q(x′|x(t)).

2. Compute the acceptance probability

α = min

(1,

p(x′)p(x(t))

× q(x(t)|x′)q(x′|x(t))

)

3. With probability α accept the proposal and set x(t+1) = x′; otherwisereject the proposal and set x(t+1) = x(t).

Here x represents the complete vector of parameter states, for example thephylogeny, the substitution model parameters and the diversification rates.

A main challenge for MCMC algorithms in Bayesian phylogenetic infer-ence is sampling the space of phylogenetic trees efficiently (Höhna and Drum-mond, 2012; Lakner et al., 2008). Because the number of possible phylogeniesis extremely large and there is no inherent ordering of phylogenies, proposingnew phylogenies is difficult and clever algorithms are required. I have con-tributed to this challenge in Paper A (Höhna and Drummond, 2012) whichhowever does not constitute the main part of this thesis.

The main part of this thesis – perhaps except Paper I – is not restricted toBayesian inference. Nevertheless, the reader should keep in mind that the over-all goal is to develop a joint statistical inference framework for all parametersof the evolutionary process and Bayesian inference is the preferred method.

1.2.2 Summary of Paper I

The number and complexity of phylogenetic models has increased rapidly inrecent years which is accompanied by an enormous increase in available se-quence data due to faster and cheaper sequencing technologies. A new chal-lenge for the phylogenetic community consists of describing all these models:

19

Page 20: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

young researchers need to learn and understand the plethora of models and em-piricists, software developers and theoreticians need to decipher the describedmodels of the their peers.

S1i S2i S3i

S4i

S5i

i ∈ N

Q

π

e

b

l3

l2

l1

l4

λ

e) Plate

d) Clamped node(observed)

c) Deterministic node

b) Stochastic node

a) Constant node

Figure 1.2: Graphical model representation of a phylogenetic model using adirected acyclic graph (DAG) including the notation. The phylogenetic tree isclearly visible in the DAG and shows how the observed characters depend on theinternal sequences. The entire tree sits on a plate because the variables are repli-cated for each site in the sequence, hence each site is independent and identicallydistributed. The substitution model is represented in the left part by the DAGnodes π for the equilibrium frequencies, ε for the exchangeability rates and Qfor the substitution rate matrix. The prior for the branch length is shown on theright as for independent variables from an exponential distribution.

In Paper I we introduce (probabilistic) graphical models to phylogenetics.A probabilistic graphical models consists of a directed acyclic graph (DAG)where each node represents a variable in the model. We adopt the notation toexplicitly model constant variables (fixed parameters), deterministic variables(variables transformations e.g. the Q matrix given the equilibrium frequencies)and stochastic variables. An example of a phylogenetic tree as a graphicalmodel is given in Figure 1.2.

We elaborate on the graphical model representation from three differentperspectives (levels): The beginners who want to become familiar with graphi-cal models and some basic phylogenetic models; The software developers whowant to develop and extend existing software for phylogenetic inference byadding new (sub-) models; The theoreticians who want to develop new mod-els, methods or algorithm for phylogenetic inference. We focus on showingthe strength of the graphical model representation, for example the flexibility,

20

Page 21: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

the understandability and the explicitness (see also Figure 1.2).The paper, though closely resembling a review or a point-of-view of cur-

rent models and methods, is intended as a first paper in a series of papers thatwill push forward graphical models in phylogenetics. These papers are writtenaround our new Bayesian phylogenetic inference software called RevBayes(www.RevBayes.net) which is being developed based on the idea of graphicalmodels.

1.3 Diversification

In the previous section I introduced phylogenetic inference and described mod-els to infer a phylogeny. In an optimal setting one would infer the phylogenyand all other parameters, such as the rates of diversification, jointly. Instead,it is common to infer diversification rates and patterns assuming a known phy-logeny and to disregard any uncertainty. The main reason behind this separa-tion is that different scientific communities with little overlap interact here –the first interested in modeling the process of speciation and extinction and thesecond in inferring phylogenies – and hence no model, method and software isavailable to perform joint inference. Moreover, a single gene-tree is often usedas a proxy for the species tree, ignoring the potential differences between themdue to population-level processes such as incomplete lineage sorting (Rannalaand Yang, 2003).

In Figure 1.3 I have illustrated the process of speciation. Figure 1.3.ashows the process with instantaneous separation of populations and Figure 1.3.bshows the process with protracted speciation, i.e. speciation is not instanta-neous (Etienne and Rosindell, 2012). Individuals evolve within each (sub-)population and by chance an individual could be more closely related to anindividual from another population which leads to incomplete lineage sorting(Maddison, 1997). It is generally assumed that every gene has its own realiza-tion from a multispecies coalescent model due to, e.g., recombination events.Using a single gene-tree as a proxy for the species tree overestimates the ageof divergence between two species and may or may not show a wrong treetopology.

These issues are presented here so that the reader is aware of some issueswith the standard assumptions of diversification analyses. The data are a recon-structed phylogenetic tree; a quantity that is not truly observed but estimated.

1.3.1 Estimating Diversification Rates

The specific birth-death process used in diversification rate studies is a specialcase of a Galton-Watson process where the offspring distribution equals to two.

21

Page 22: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

A1.1 A1.2 A2.1 A2.2 B1.1 B1.2 B2.1 B2.2 B3.1 B3.2

Speciation model Protracted speciation model

Species tree Gene tree

Incomplete Lineage Sorting

Subspeciation

Speciation initiation

Speciation completion

A.1 A.2 B.1 B.2 C.1 C.2 D.1 D.2 E.1 E.2

Figure 1.3: Models of speciation. The left sketch shows a realization of a birth-death process with instantaneous speciation events. Within this species tree thereis a gene tree highlighting the difference between the two. The right sketch showsthe same tree but if speciation was protracted and thus the same number of pop-ulations constitute to fewer species.

More specifically, every species gives birth to exactly one other species withrate λ and dies with rate µ . The probability of survival of at least one speciesand the distribution of the number of species has long been known (Kendall,1948). However, the challenge in phylogenetics is that only the reconstructedtree is observed, that is, all extinct lineages are pruned away. Fortunately evenunder this process, termed the reconstructed process, it is possible to inferspeciation and extinction rates (Nee et al., 1994; Rannala and Yang, 1996;Thompson, 1975). An illustration of the process is given in Figure 1.4.

The underlying assumption of the process of diversification, as used here,is that every species evolves independently of its environment and all speciesfollow the same process (homogeneity). These assumptions are obvious sim-plifications of reality because species are influenced by their environment, forexample due to the carrying capacity (Etienne et al., 2012), and species speci-ate under different rates, for example because of speciation rates being depen-dent on phenotypes (FitzJohn et al., 2009). Nevertheless, these assumptionsare used for mathematical convenience and to increase the power in estimatingparameters from the limited available data.

The majority of the papers in this thesis focus on the reconstructed process.Paper II and Paper IV tackle the problem of estimating speciation and extinc-

22

Page 23: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

t0

t1

t2

t3

T

Speciation

Extinction

Tim

e

b)a)

Figure 1.4: A simulated birth-death tree starting with a single species. a) Thecomplete tree containing both extant and extinct species. b) The reconstructedtree containing only extant species.

tion rates when only an incomplete sample of the extant species is available(Höhna, 2013b; Höhna et al., 2011) and Paper III targets efficient simulationsunder the reconstructed process (Höhna, 2013a).

1.3.2 Summary of Paper II

Only in very few studies are all species of a group – e.g. a family – included.Reasons commonly include time and financial constraints. In these situationthe specific species included in the study are carefully selected rather thanbeing picked randomly.

a) Uniform Sampling b) Diversified Sampling c) Cluster Sampling

0 0 0

Figure 1.5: Three sampled trees of size n = 4 from an extant tree of size m = 8(so ρ̂ = 0.5). a) A uniformly-sampled tree. b) A diversified-sampled tree. c) Acluster-sampled tree.

Previously the only model that accommodated incomplete taxon samplingwas uniform taxon sampling (frequently also called random taxon sampling al-

23

Page 24: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

though the term random may be misleading in this context). Under the uniformtaxon sampling concept every species has the same probability to be includedin the sample. In the paper we discuss two alternative methods of incompletetaxon sampling: diversified sampling and cluster sampling. Diversified sam-pling represents a method where all the recent species are removed (speciesthat resulted from a speciation event after some fixed time). Sampling speciesunder the diversified sampling concept maximizes the phylogenetic diversityobtained in the sample which gives the method its name. Cluster sampling isthe opposite of diversified sampling; only very recent species are included in acluster.

In the paper we derive the likelihood under each of the three samplingschemes conditioned on the number of sampled species, the time of the originof the process and both. We use the fact that each speciation event in thereconstructed tree is independent and identically distributed. The distributionand density functions had been published previously for constant speciationand extinction rates (Yang and Rannala, 1997). We apply the methods on 18previously published bird phylogenies and conclude, based on a Bayes factortest, that 11 data sets were likely to have resulted from diversified sampling.Our main result is that parameter estimates are likely to be seriously biased,particularly estimates of extinction rates, unless the sampling bias is modeledcorrectly.

1.3.3 Summary of Paper III

The ability of performing fast simulations is important in order to explore thebehavior of a process, validate inference methods and for model testing bymeans of posterior predictive simulations. Curiously, simulation methods forthe reconstructed process have been neglected previously. The intuitive ap-proach is to simulate birth and death events, thus simulating a complete phy-logeny including extinct lineages, and then to prune away all extinct lineages.Unfortunately, this approach has two major drawbacks: 1) An excessive num-ber of unnecessary speciation events may need to be simulated if the relativeextinction rate is large (µ/λ ≈ 1) and 2) It is not known when the processshould be stopped if the condition requires n species alive (see Fig. 1.6).

The problem is approached by realizing, again, that the density and distri-bution function for the speciation events in the reconstructed tree have someknown distribution. In the paper I extend the previous results to any time-dependent rate function. Then, using the distribution function I developed asimulation algorithm that is not only accurate (solving the problem of sim-ulating for a fixed number of species) but also outperforms other currentlyavailable algorithms by several orders of magnitude.

24

Page 25: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

0 2 4 6 8 10

020

40

60

80

100

120

Time

N(T

)

Figure 1.6: Realization of a birth-death process with λ = 2.0 and µ = 1.8. Hereit is clearly visible that the process reached n = 50 species multiple times andstopping when the process first hit n = 50 species leads to a wrong distributionof the speciation times.

25

Page 26: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

1.3.4 Summary of Paper IV

Paper IV is an extension of Paper II by allowing the diversification rates tochange over time. Rates of diversification are unlikely to be constant. Forexample, changing environments may lead first to a higher extinction rate or amass-extinction event which then opens ecological niches for new species.

In Paper IV I use time-dependent diversification rates. For example, diversity-dependent diversification rates can be approximated by time-dependent diver-sification rates with decaying rates (lower speciation rate because there is lessopportunity for new species) and the signal should be recoverable from thedata. I use the same approach as in Paper II to derive the likelihood func-tion under uniform taxon sampling and diversified taxon sampling but usethe distribution and density function derived in Paper III to include time-dependent diversification rates. Eventually I applied six different models oftime-dependent diversification rates, including constant diversification ratesand exponentially decaying diversification rates, to three higher order phy-logenies. These higher order phylogenies are constructed by sampling a fewrepresentative species for each family and thus resemble the diversified taxonsampling model. The results of the analysis favor a decaying speciation ratemodel for two of the three datasets and the diversified taxon sampling in one ofthe three datasets. In conclusion, the actual method how higher order phyloge-nies are constructed violates our mathematical definition of diversified taxonsampling and a more relaxed model, where not all missing speciation eventsmust have been after a specified point in time, is needed.

26

Page 27: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

References

Alfaro, M. E. and Holder, M. T. (2006). The posterior and the prior inBayesian phylogenetics. Annual Review of Ecology, Evolution, and Sys-tematics, 37(1):19–42.

Alström, P., Höhna, S., Gelang, M., Ericson, P., and Olsson, U. (2011). Non-monophyly and intricate morphological evolution within the avian familycettiidae revealed by multilocus analysis of a taxonomically densely sam-pled dataset. BMC Evolutionary Biology, 11(1):352.

Bouchard-Côté, A. and Jordan, M. I. (2013). Evolutionary inference via thepoisson indel process. Proceedings of the National Academy of Sciences,110(4):1160–1166.

Darwin, C. (1859). On the Origin of Species by Means of Natural Selection,or the Preservation of Favoured Races in the Struggle for Life. Murray.

Etienne, R., Haegeman, B., Stadler, T., Aze, T., Pearson, P., Purvis, A., andPhillimore, A. (2012). Diversity-dependence brings molecular phylogeniescloser to agreement with the fossil record. Proceedings of the Royal SocietyB: Biological Sciences, 279(1732):1300–1309.

Etienne, R. and Rosindell, J. (2012). Prolonging the past counteracts the pullof the present: protracted speciation can explain observed slowdowns indiversification. Systematic Biology, 61(2):204–213.

Felsenstein, J. (1973). Maximum-likelihood estimation of evolutionary treesfrom continuous characters. Am J Hum Genet, 25(5):471–92.

Felsenstein, J. (1978). The number of evolutionary trees. Systematic Zoology,27(1):27–33.

Felsenstein, J. (1981). Evolutionary trees from DNA sequences: A maximumlikelihood approach. Journal of Molecular Evolution, 17(6):368–376.

Felsenstein, J. (1985). Confidence limits on phylogenies: an approach usingthe bootstrap. Evolution, 39(4):783–791.

Page 28: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

FitzJohn, R., Maddison, W., and Otto, S. (2009). Estimating trait-dependentspeciation and extinction rates from incompletely resolved phylogenies.Systematic biology, 58(6):595–611.

Hasegawa, M., Kishino, H., and Yano, T. (1985). Dating of the human-apesplitting by a molecular clock of mitochondrial DNA. Journal of MolecularEvolution, 22(2):160–174.

Hastings, W. K. (1970). Monte carlo sampling methods using markov chainsand their applications. Biometrika, 57(1):97–109.

Höhna, S. (2013a). Fast simulation of reconstructed phylogenies under globaltime-dependent birth–death processes. Bioinformatics, 29(11):1367–1374.

Höhna, S. (2013b). Likelihood inference of non-constant diversification rateswith incomplete taxon sampling. PLoS One.

Höhna, S. and Drummond, A. J. (2012). Guided Tree Topology Proposals forBayesian Phylogenetic Inference. Systematic Biology, 61(1):1–11.

Höhna, S., Stadler, T., Ronquist, F., and Britton, T. (2011). Inferring speciationand extinction rates under different species sampling schemes. MolecularBiology and Evolution, 28(9):2577–2589.

Holder, M. and Lewis, P. (2003). Phylogeny estimation: traditional andBayesian approaches. Nature Reviews Genetics, 4(4):275.

Huelsenbeck, J., Larget, B., Miller, R., and Ronquist, F. (2002). PotentialApplications and Pitfalls of Bayesian Inference of Phylogeny. SystematicBiology, 51(5):673–688.

Huelsenbeck, J., Ronquist, F., Nielsen, R., and Bollback, J. (2001). BayesianInference of Phylogeny and Its Impact on Evolutionary Biology. Science,294(5550):2310 – 2314.

Huelsenbeck, J. P. and Rannala, B. (1997). Phylogenetic methods come of age:testing hypotheses in an evolutionary context. Science, 276(5310):227–232.

Jukes, T. and Cantor, C. (1969). Evolution of protein molecules. MammalianProtein Metabolism, 3:21–132.

Kendall, D. G. (1948). On the generalized "birth-and-death" process. TheAnnals of Mathematical Statistics, 19(1):1–15.

Kimura, M. (1980). A simple method for estimating evolutionary rates of basesubstitutions through comparative studies of nucleotide sequences. Journalof Molecular Evolution, 16(2):111–120.

Page 29: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

Lakner, C., van der Mark, P., Huelsenbeck, J. P., Larget, B., and Ronquist, F.(2008). Efficiency of markov chain monte carlo tree proposals in bayesianphylogenetics. Systematic Biology, 57(1):86–103.

Li, S., Pearl, D. K., and Doss, H. (2000). Phylogenetic tree construction usingmarkov chain monte carlo. Journal of the American Statistical Association,95(450):493–508.

Lunter, G., Drummond, A. J., Miklós, I., and Hein, J. (2005). Statistical align-ment: Recent progress, new applications, and challenges. In Statisticalmethods in molecular evolution, pages 375–405. Springer.

Maddison, W. P. (1997). Gene trees in species trees. Systematic biology,46(3):523–536.

Mau, B. and Newton, M. A. (1997). Phylogenetic inference for binary dataon dendograms using markov chain monte carlo. Journal of Computationaland Graphical Statistics, 6(1):122–131.

Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E.(1953). Equation of State Calculations by Fast Computing Machines. Jour-nal of Chemical Physics, 21:1087–1092.

Nee, S., May, R. M., and Harvey, P. H. (1994). The reconstructed evolutionaryprocess. Philosophical Transactions: Biological Sciences, 344(1309):305–311.

Rannala, B. and Yang, Z. (1996). Probability distribution of molecular evolu-tionary trees: A new method of phylogenetic inference. Journal of Molecu-lar Evolution, 43(3):304–311.

Rannala, B. and Yang, Z. (2003). Bayes estimation of species divergence timesand ancestral population sizes using dna sequences from multiple loci. Ge-netics, 164(4):1645–1656.

Redelings, B. and Suchard, M. (2005). Joint Bayesian Estimation of Align-ment and Phylogeny. Systematic Biology, 54:401–418.

Redelings, B. D. and Suchard, M. A. (2009). Robust inferences from am-biguous alignments. Sequence alignment: methods, models, concepts, andstrategies, pages 209–271.

Tavaré, S. (1986). Some probabilistic and statistical problems in the analy-sis of DNA sequences. Some Mathematical Questions in BiologyDNASequence Analysis, 17:57–86.

Page 30: Sebastian Höhna - DiVA portalsu.diva-portal.org/smash/get/diva2:659516/FULLTEXT01.pdfAbstract Phylogenetics is the study of the evolutionary relationship between species. Inference

Thompson, E. (1975). Human evolutionary trees. Cambridge University PressCambridge.

Thorne, J. L., Kishino, H., and Felsenstein, J. (1991). An evolutionary modelfor maximum likelihood alignment of dna sequences. Journal of MolecularEvolution, 33(2):114–124.

Thorne, J. L., Kishino, H., and Felsenstein, J. (1992). Inching toward reality:an improved likelihood model of sequence evolution. Journal of MolecularEvolution, 34(1):3–16.

Wong, K. M., Suchard, M. A., and Huelsenbeck, J. P. (2008). Alignmentuncertainty and genomic analysis. Science, 319(5862):473–476.

Yang, Z. and Rannala, B. (1997). Bayesian phylogenetic inference using DNAsequences: a Markov Chain Monte Carlo Method. Molecular Biology andEvolution, 14(7):717–724.