Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian...

Species tree inference

Remco [email protected]

University of AucklandMax Planck Institute

Canberra, 2018

Some slides are based on material provided by David Bryant, Alexie Drummond,Joseph Heled, Paul Lewis

Species tree inference Canberra, 2018 1 / 30

Gene trees and species trees


Bi-allelic markers (SNPs and AFLPs)SNAPP = SNP and AFLP Package for Phylogenetic Analysis= multi species coalescent without those pesky gene trees.

Assumptions: independent sites, only coalescent and mutation (noselection, migration, gene flow, ...)one gene tree per siteintegrates out gene treestree in units of substitutions

(Bryant et al, MBE 2012)Species tree inference Canberra, 2018 3 / 30

The coalescent

What is the probability that two individuals have the same parent?


Theoretical population genetics

Most of theoretical population genetics is basedon the idealised Wright-Fisher model ofpopulation which assumes

Constant population size N

Discrete generations

Complete mixing

For the purposes of this presentation thepopulation will be assumed to be haploid.


The coalescent

Data: a small genetic sample from a large background population.

The coalescent

is a model of the ancestral relationships of a sample of individualstaken from a larger population.

describes a probability distribution on ancestral genealogies (trees)given a population history, N(t).

I Therefore the coalescent can convert information from ancestralgenealogies into information about population history and vice versa.

a model of ancestral genealogies, not sequences, and its simplest formassumes neutral evolution.

can be thought of as the process in a SNAPP tree


Constant population size: N(t) = N0


Bayesian models: SNAPP likelihood, prior, posteriorPosterior:

p(θ|D,M) =

prior︷︸︸︷p(θ|M)

likelihood︷︸︸︷p(D|M, θ)

P(D|M)︸︷︷︸marginal likelihood

θ: parameters

T tree

u rate of red → green

v rate of green → red

c1..k coalescent rates, one of each of the k branches

λ birth rate

priors:

Yule prior for tree, with λ birthrate

priors on λ, u ( u+v2 = uv) and c1..k


Bayesian PhylogeneticsThe output of a Bayesian evolutionary analysis is a probabilitydistribution on trees and parameter values.

For phylogenetics the tree topology is the object of interest. Thesubstitution parameters and tree prior parameters are a nuisance thatwe average over using MCMC and then ignore.For population genetics the tree and substitution parameters are anuisance that we average over and then ignore, focusing instead onthe population parameters.Often a more specific hypothesis is of interest (like ‘Did this adaptiveradiation predate the Miocene?’) and then the result of the analysisshould be the testing of this hypothesis, averaged over all trees andparameter values, weighted by their probability given the data.


Tree space as a hilly landscapeThe space of all possible trees can be visualized as a hilly landscape. Nearbypoints in this landscape represent similar trees, and the height of the landscape isthe probability of the tree at that point.

This space can be sampled in a Bayesian analysis with MCMC

The peak can be identified by a search algorithm in the context ofmaximum likelihoods

Markov chain Monte Carlo (MCMC) robot[courtesy of Paul O Lewis]


Pure Random Walk[courtesy of Paul O Lewis]

Proposal scheme:

random direction

gamma-distributed steplength (mean 45 pixels,s.d. 40 pixels)

reflection at edges

Target distribution:

equal mixture of 3bivariate normal hills

inner contours: 50%

outer contours: 95%

In this case the robot isaccepting every step and 5000steps are shown


Burn In

Robot is nowfollowing the rulesand thus quicklyfinds one of thethree hills.

Note that first fewsteps are not at allrepresentative ofthe distribution.

100 steps takenfrom starting point


Target Distribution Approximation[courtesy of Paul O Lewis]

How good is the MCMCapproximation?

51.2% of points areinside inner contours(cf. 50% actual)

93.6% of points areinside outer contours(cf. 95% actual)

Approximation gets betterthe longer the chain isallowed to run.

5000 steps taken


Setting up mutation ratesDo not use the defaults!

Estimate during MCMC

Estimate from data, µu = 1/2π0, µv = 1/2π1, and keep fixed

Include non-polymorphic sites if you can!

If sampling any, hyper priors need to be specified


Priors

Yule (pure birth) prior on species tree topology, birth rate λ

Gamma, Inverse Gamma, Uniform or CIR prior on pop sizes

If sampling any, hyper priors need to be specified


Ascertainment correction

SNP data is selected under various conditions

Non constant sites only

At least N different sites only, e.g., no constant and no singletons

Panels – different ascertainment within species.

Others...

This has considerable impact on pop sizes/tree height estimates

Active field of research/work in progress...


Missing data

SNAPP handles a site with missing data as if the lineage does not exist

Best not used without non-polymorphic sitesSNAPP needs at least one observation per species. If not, sites will bedeleted.


Performance tuning= tree likelihood calculation optimisation

Use threads – need to find optimal number for your data

MCMCMC – from BEASTLabs package

Start with small nr of lineages, increase till running out of patience

Subsample lineages, not sites

(No BEAGLE support, none expected)


Species tree topology simulations

‘Easy tree’3 trees in credibility set for ≤ 400 loci1 tree in credibility set for ≥ 500 loci.True tree always in credibility set

‘Hard tree’2-3 trees in credibility set for ≤ 10000 loci1 tree in credibility set for ≥ 100000 loci.True tree always in credibility set

Species tree topology is recovered well


Finding the lineage sorting sweet spot

No knowledge about topology‘Infinite’ population size estimatesQuite a bit of knowledge aboutancestral population

Some knowledge about topology (likea standard phylogenetic analysis)Poor population size estimatesSome knowledge about present-daypopulations sizes


Some correlations to be aware off I

−4 −3 −2 −1 0

−7

−6

−5

−4

−3

log (tree he igh t)

lo

g(th

eta

)

−4 −3 −2 −1 00

24

6

log (tree he igh t)

lo

g(#

co

ns

ta

nt s

ite

s)


Some correlations to be aware off 2

−7 −6 −5 −4 −3 −2

02

46

log (theta)

lo

g(#

co

ns

ta

nt s

ite

s)

3 4 5 6 7 80

24

6

log (coa lescent ra te )

lo

g(#

co

ns

ta

nt s

ite

s)


Powers and limitations of SNAPP

SNAPP recovers

Topology

Coalescent times

Population sizes per branch

in order of decreasing accuracy

Not enough lineages ⇒ pop size samples from prior (happens oftennear root)

θ and coalescent time estimates are often more accurate at thebottom than the top of tree(higher uncertainty when going back in time)


Detecting Anomalies

Consider two species

What happens when they are grouped as one species? Pop sizeincreases

Where to look for cryptic species? Large pop size branches

Where to look for mislabeled lineages? Large pop size branches


Species delimitation – BFD*

Variant of Bayesian model selection

Suppose there are two species assignments A and B

Estimate log marginal likelihood (MLA) using stepping stone analysisof taxon sets A

Estimate log marginal likelihood (MLB) of taxon sets B

The difference ∆ = 2(MLA −MLB) is the (log) Bayes factor.

∆ < 0 support for B0 < ∆ < 2 barely worth mentioning2 < ∆ < 6 substantial support for A6 < ∆ < 10 strong support for A

10 < ∆ decisive

(Leache et al 2014 SysBio)


Stepping stone/path sampling analysis

Requires model-selection package in BEASTSetting up and running an analysis:

Start BEAST AppStore , select PathSampler

or hack XML (see wiki or tutorial)

How many steps?

start with small number of steps, say 8

increase nr of steps till ML estimate does not decrease any more

How long

no hard and fast rule

ESS > 200 for each step is over kill

multiple (say 4) runs giving same estimate


Future developments

Snapper – SNAPP but faster

Nested sampling for species delimitation

• 2 PhD candidates and Post-doc position available (seehttp://beast2.org) to work on multi species coalescent methods


http://beast2.org

Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian...

Documents

Transcript of Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian...