Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian...

30
Species tree inference Remco Bouckaert [email protected] University of Auckland Max Planck Institute Canberra, 2018 Some slides are based on material provided by David Bryant, Alexie Drummond, Joseph Heled, Paul Lewis Species tree inference Canberra, 2018 1 / 30

Transcript of Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian...

Page 1: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Species tree inference

Remco [email protected]

University of AucklandMax Planck Institute

Canberra, 2018

Some slides are based on material provided by David Bryant, Alexie Drummond,Joseph Heled, Paul Lewis

Species tree inference Canberra, 2018 1 / 30

Page 2: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Gene trees and species trees

Species tree inference Canberra, 2018 2 / 30

Page 3: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Bi-allelic markers (SNPs and AFLPs)SNAPP = SNP and AFLP Package for Phylogenetic Analysis= multi species coalescent without those pesky gene trees.

Assumptions: independent sites, only coalescent and mutation (noselection, migration, gene flow, ...)one gene tree per siteintegrates out gene treestree in units of substitutions

(Bryant et al, MBE 2012)Species tree inference Canberra, 2018 3 / 30

Page 4: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

The coalescent

What is the probability that two individuals have the same parent?

Species tree inference Canberra, 2018 4 / 30

Page 5: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Theoretical population genetics

Most of theoretical population genetics is basedon the idealised Wright-Fisher model ofpopulation which assumes

Constant population size N

Discrete generations

Complete mixing

For the purposes of this presentation thepopulation will be assumed to be haploid.

Species tree inference Canberra, 2018 5 / 30

Page 6: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

The coalescent

Data: a small genetic sample from a large background population.

The coalescent

is a model of the ancestral relationships of a sample of individualstaken from a larger population.

describes a probability distribution on ancestral genealogies (trees)given a population history, N(t).

I Therefore the coalescent can convert information from ancestralgenealogies into information about population history and vice versa.

a model of ancestral genealogies, not sequences, and its simplest formassumes neutral evolution.

can be thought of as the process in a SNAPP tree

Species tree inference Canberra, 2018 6 / 30

Page 7: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Constant population size: N(t) = N0

Species tree inference Canberra, 2018 7 / 30

Page 8: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Bayesian models: SNAPP likelihood, prior, posteriorPosterior:

p(θ|D,M) =

prior︷ ︸︸ ︷p(θ|M)

likelihood︷ ︸︸ ︷p(D|M, θ)

P(D|M)︸ ︷︷ ︸marginal likelihood

θ: parameters

T tree

u rate of red → green

v rate of green → red

c1..k coalescent rates, one of each of the k branches

λ birth rate

priors:

Yule prior for tree, with λ birthrate

priors on λ, u ( u+v2 = uv) and c1..k

Species tree inference Canberra, 2018 8 / 30

Page 9: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Bayesian PhylogeneticsThe output of a Bayesian evolutionary analysis is a probabilitydistribution on trees and parameter values.

For phylogenetics the tree topology is the object of interest. Thesubstitution parameters and tree prior parameters are a nuisance thatwe average over using MCMC and then ignore.For population genetics the tree and substitution parameters are anuisance that we average over and then ignore, focusing instead onthe population parameters.Often a more specific hypothesis is of interest (like ‘Did this adaptiveradiation predate the Miocene?’) and then the result of the analysisshould be the testing of this hypothesis, averaged over all trees andparameter values, weighted by their probability given the data.

Species tree inference Canberra, 2018 9 / 30

Page 10: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Tree space as a hilly landscapeThe space of all possible trees can be visualized as a hilly landscape. Nearbypoints in this landscape represent similar trees, and the height of the landscape isthe probability of the tree at that point.

This space can be sampled in a Bayesian analysis with MCMC

The peak can be identified by a search algorithm in the context ofmaximum likelihoods

Page 11: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Markov chain Monte Carlo (MCMC) robot[courtesy of Paul O Lewis]

Species tree inference Canberra, 2018 11 / 30

Page 12: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Markov chain Monte Carlo (MCMC) robot[courtesy of Paul O Lewis]

Species tree inference Canberra, 2018 12 / 30

Page 13: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Pure Random Walk[courtesy of Paul O Lewis]

Proposal scheme:

random direction

gamma-distributed steplength (mean 45 pixels,s.d. 40 pixels)

reflection at edges

Target distribution:

equal mixture of 3bivariate normal hills

inner contours: 50%

outer contours: 95%

In this case the robot isaccepting every step and 5000steps are shown

Species tree inference Canberra, 2018 13 / 30

Page 14: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Burn In

Robot is nowfollowing the rulesand thus quicklyfinds one of thethree hills.

Note that first fewsteps are not at allrepresentative ofthe distribution.

100 steps takenfrom starting point

Species tree inference Canberra, 2018 14 / 30

Page 15: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Target Distribution Approximation[courtesy of Paul O Lewis]

How good is the MCMCapproximation?

51.2% of points areinside inner contours(cf. 50% actual)

93.6% of points areinside outer contours(cf. 95% actual)

Approximation gets betterthe longer the chain isallowed to run.

5000 steps taken

Species tree inference Canberra, 2018 15 / 30

Page 16: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Setting up mutation ratesDo not use the defaults!

Estimate during MCMC

Estimate from data, µu = 1/2π0, µv = 1/2π1, and keep fixed

Include non-polymorphic sites if you can!

If sampling any, hyper priors need to be specified

Species tree inference Canberra, 2018 16 / 30

Page 17: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Priors

Yule (pure birth) prior on species tree topology, birth rate λ

Gamma, Inverse Gamma, Uniform or CIR prior on pop sizes

If sampling any, hyper priors need to be specified

Species tree inference Canberra, 2018 17 / 30

Page 18: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Ascertainment correction

SNP data is selected under various conditions

Non constant sites only

At least N different sites only, e.g., no constant and no singletons

Panels – different ascertainment within species.

Others...

This has considerable impact on pop sizes/tree height estimates

Active field of research/work in progress...

Species tree inference Canberra, 2018 18 / 30

Page 19: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Missing data

SNAPP handles a site with missing data as if the lineage does not exist

Best not used without non-polymorphic sitesSNAPP needs at least one observation per species. If not, sites will bedeleted.

Species tree inference Canberra, 2018 19 / 30

Page 20: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Performance tuning= tree likelihood calculation optimisation

Use threads – need to find optimal number for your data

MCMCMC – from BEASTLabs package

Start with small nr of lineages, increase till running out of patience

Subsample lineages, not sites

(No BEAGLE support, none expected)

Species tree inference Canberra, 2018 20 / 30

Page 21: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Species tree topology simulations

‘Easy tree’3 trees in credibility set for ≤ 400 loci1 tree in credibility set for ≥ 500 loci.True tree always in credibility set

‘Hard tree’2-3 trees in credibility set for ≤ 10000 loci1 tree in credibility set for ≥ 100000 loci.True tree always in credibility set

Species tree topology is recovered well

Species tree inference Canberra, 2018 21 / 30

Page 22: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Finding the lineage sorting sweet spot

No knowledge about topology‘Infinite’ population size estimatesQuite a bit of knowledge aboutancestral population

Some knowledge about topology (likea standard phylogenetic analysis)Poor population size estimatesSome knowledge about present-daypopulations sizes

Species tree inference Canberra, 2018 22 / 30

Page 23: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Some correlations to be aware off I

−4 −3 −2 −1 0

−7

−6

−5

−4

−3

log (tree he igh t)

lo

g(th

eta

)

−4 −3 −2 −1 00

24

6

log (tree he igh t)

lo

g(#

co

ns

ta

nt s

ite

s)

Species tree inference Canberra, 2018 23 / 30

Page 24: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Some correlations to be aware off 2

−7 −6 −5 −4 −3 −2

02

46

log (theta)

lo

g(#

co

ns

ta

nt s

ite

s)

3 4 5 6 7 80

24

6

log (coa lescent ra te )

lo

g(#

co

ns

ta

nt s

ite

s)

Species tree inference Canberra, 2018 24 / 30

Page 25: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Powers and limitations of SNAPP

SNAPP recovers

Topology

Coalescent times

Population sizes per branch

in order of decreasing accuracy

Not enough lineages ⇒ pop size samples from prior (happens oftennear root)

θ and coalescent time estimates are often more accurate at thebottom than the top of tree(higher uncertainty when going back in time)

Species tree inference Canberra, 2018 25 / 30

Page 26: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Detecting Anomalies

Consider two species

What happens when they are grouped as one species? Pop sizeincreases

Where to look for cryptic species? Large pop size branches

Where to look for mislabeled lineages? Large pop size branches

Species tree inference Canberra, 2018 26 / 30

Page 27: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Bayesian model selection: marginal likelihood

Posterior:

p(θ|D,M) =

prior︷ ︸︸ ︷p(θ|M)

likelihood︷ ︸︸ ︷p(D|M, θ)

P(D|M)︸ ︷︷ ︸marginal likelihood

Marginal likelihood:

p(D|M) =

∫θ∈Θ

p(θ|M)p(D|M, θ)dθ

integrate/marginalise out θBayes factor:

p(D|M1)

p(D|M2)

Species tree inference Canberra, 2018 27 / 30

Page 28: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Species delimitation – BFD*

Variant of Bayesian model selection

Suppose there are two species assignments A and B

Estimate log marginal likelihood (MLA) using stepping stone analysisof taxon sets A

Estimate log marginal likelihood (MLB) of taxon sets B

The difference ∆ = 2(MLA −MLB) is the (log) Bayes factor.

∆ < 0 support for B0 < ∆ < 2 barely worth mentioning2 < ∆ < 6 substantial support for A6 < ∆ < 10 strong support for A

10 < ∆ decisive

(Leache et al 2014 SysBio)

Species tree inference Canberra, 2018 28 / 30

Page 29: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Stepping stone/path sampling analysis

Requires model-selection package in BEASTSetting up and running an analysis:

Start BEAST AppStore , select PathSampler

or hack XML (see wiki or tutorial)

How many steps?

start with small number of steps, say 8

increase nr of steps till ML estimate does not decrease any more

How long

no hard and fast rule

ESS > 200 for each step is over kill

multiple (say 4) runs giving same estimate

Species tree inference Canberra, 2018 29 / 30

Page 30: Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian evolutionary analysis is aprobability distribution on trees and parameter values.

Future developments

Snapper – SNAPP but faster

Nested sampling for species delimitation

• 2 PhD candidates and Post-doc position available (seehttp://beast2.org) to work on multi species coalescent methods

Species tree inference Canberra, 2018 30 / 30