Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian...
Transcript of Species tree inference - CBAcba.anu.edu.au/files/SNAPP_Bouckaert.pdf · The output of a Bayesian...
Species tree inference
Remco [email protected]
University of AucklandMax Planck Institute
Canberra, 2018
Some slides are based on material provided by David Bryant, Alexie Drummond,Joseph Heled, Paul Lewis
Species tree inference Canberra, 2018 1 / 30
Gene trees and species trees
Species tree inference Canberra, 2018 2 / 30
Bi-allelic markers (SNPs and AFLPs)SNAPP = SNP and AFLP Package for Phylogenetic Analysis= multi species coalescent without those pesky gene trees.
Assumptions: independent sites, only coalescent and mutation (noselection, migration, gene flow, ...)one gene tree per siteintegrates out gene treestree in units of substitutions
(Bryant et al, MBE 2012)Species tree inference Canberra, 2018 3 / 30
The coalescent
What is the probability that two individuals have the same parent?
Species tree inference Canberra, 2018 4 / 30
Theoretical population genetics
Most of theoretical population genetics is basedon the idealised Wright-Fisher model ofpopulation which assumes
Constant population size N
Discrete generations
Complete mixing
For the purposes of this presentation thepopulation will be assumed to be haploid.
Species tree inference Canberra, 2018 5 / 30
The coalescent
Data: a small genetic sample from a large background population.
The coalescent
is a model of the ancestral relationships of a sample of individualstaken from a larger population.
describes a probability distribution on ancestral genealogies (trees)given a population history, N(t).
I Therefore the coalescent can convert information from ancestralgenealogies into information about population history and vice versa.
a model of ancestral genealogies, not sequences, and its simplest formassumes neutral evolution.
can be thought of as the process in a SNAPP tree
Species tree inference Canberra, 2018 6 / 30
Constant population size: N(t) = N0
Species tree inference Canberra, 2018 7 / 30
Bayesian models: SNAPP likelihood, prior, posteriorPosterior:
p(θ|D,M) =
prior︷ ︸︸ ︷p(θ|M)
likelihood︷ ︸︸ ︷p(D|M, θ)
P(D|M)︸ ︷︷ ︸marginal likelihood
θ: parameters
T tree
u rate of red → green
v rate of green → red
c1..k coalescent rates, one of each of the k branches
λ birth rate
priors:
Yule prior for tree, with λ birthrate
priors on λ, u ( u+v2 = uv) and c1..k
Species tree inference Canberra, 2018 8 / 30
Bayesian PhylogeneticsThe output of a Bayesian evolutionary analysis is a probabilitydistribution on trees and parameter values.
For phylogenetics the tree topology is the object of interest. Thesubstitution parameters and tree prior parameters are a nuisance thatwe average over using MCMC and then ignore.For population genetics the tree and substitution parameters are anuisance that we average over and then ignore, focusing instead onthe population parameters.Often a more specific hypothesis is of interest (like ‘Did this adaptiveradiation predate the Miocene?’) and then the result of the analysisshould be the testing of this hypothesis, averaged over all trees andparameter values, weighted by their probability given the data.
Species tree inference Canberra, 2018 9 / 30
Tree space as a hilly landscapeThe space of all possible trees can be visualized as a hilly landscape. Nearbypoints in this landscape represent similar trees, and the height of the landscape isthe probability of the tree at that point.
This space can be sampled in a Bayesian analysis with MCMC
The peak can be identified by a search algorithm in the context ofmaximum likelihoods
Markov chain Monte Carlo (MCMC) robot[courtesy of Paul O Lewis]
Species tree inference Canberra, 2018 11 / 30
Markov chain Monte Carlo (MCMC) robot[courtesy of Paul O Lewis]
Species tree inference Canberra, 2018 12 / 30
Pure Random Walk[courtesy of Paul O Lewis]
Proposal scheme:
random direction
gamma-distributed steplength (mean 45 pixels,s.d. 40 pixels)
reflection at edges
Target distribution:
equal mixture of 3bivariate normal hills
inner contours: 50%
outer contours: 95%
In this case the robot isaccepting every step and 5000steps are shown
Species tree inference Canberra, 2018 13 / 30
Burn In
Robot is nowfollowing the rulesand thus quicklyfinds one of thethree hills.
Note that first fewsteps are not at allrepresentative ofthe distribution.
100 steps takenfrom starting point
Species tree inference Canberra, 2018 14 / 30
Target Distribution Approximation[courtesy of Paul O Lewis]
How good is the MCMCapproximation?
51.2% of points areinside inner contours(cf. 50% actual)
93.6% of points areinside outer contours(cf. 95% actual)
Approximation gets betterthe longer the chain isallowed to run.
5000 steps taken
Species tree inference Canberra, 2018 15 / 30
Setting up mutation ratesDo not use the defaults!
Estimate during MCMC
Estimate from data, µu = 1/2π0, µv = 1/2π1, and keep fixed
Include non-polymorphic sites if you can!
If sampling any, hyper priors need to be specified
Species tree inference Canberra, 2018 16 / 30
Priors
Yule (pure birth) prior on species tree topology, birth rate λ
Gamma, Inverse Gamma, Uniform or CIR prior on pop sizes
If sampling any, hyper priors need to be specified
Species tree inference Canberra, 2018 17 / 30
Ascertainment correction
SNP data is selected under various conditions
Non constant sites only
At least N different sites only, e.g., no constant and no singletons
Panels – different ascertainment within species.
Others...
This has considerable impact on pop sizes/tree height estimates
Active field of research/work in progress...
Species tree inference Canberra, 2018 18 / 30
Missing data
SNAPP handles a site with missing data as if the lineage does not exist
Best not used without non-polymorphic sitesSNAPP needs at least one observation per species. If not, sites will bedeleted.
Species tree inference Canberra, 2018 19 / 30
Performance tuning= tree likelihood calculation optimisation
Use threads – need to find optimal number for your data
MCMCMC – from BEASTLabs package
Start with small nr of lineages, increase till running out of patience
Subsample lineages, not sites
(No BEAGLE support, none expected)
Species tree inference Canberra, 2018 20 / 30
Species tree topology simulations
‘Easy tree’3 trees in credibility set for ≤ 400 loci1 tree in credibility set for ≥ 500 loci.True tree always in credibility set
‘Hard tree’2-3 trees in credibility set for ≤ 10000 loci1 tree in credibility set for ≥ 100000 loci.True tree always in credibility set
Species tree topology is recovered well
Species tree inference Canberra, 2018 21 / 30
Finding the lineage sorting sweet spot
No knowledge about topology‘Infinite’ population size estimatesQuite a bit of knowledge aboutancestral population
Some knowledge about topology (likea standard phylogenetic analysis)Poor population size estimatesSome knowledge about present-daypopulations sizes
Species tree inference Canberra, 2018 22 / 30
Some correlations to be aware off I
−4 −3 −2 −1 0
−7
−6
−5
−4
−3
log (tree he igh t)
lo
g(th
eta
)
−4 −3 −2 −1 00
24
6
log (tree he igh t)
lo
g(#
co
ns
ta
nt s
ite
s)
Species tree inference Canberra, 2018 23 / 30
Some correlations to be aware off 2
−7 −6 −5 −4 −3 −2
02
46
log (theta)
lo
g(#
co
ns
ta
nt s
ite
s)
3 4 5 6 7 80
24
6
log (coa lescent ra te )
lo
g(#
co
ns
ta
nt s
ite
s)
Species tree inference Canberra, 2018 24 / 30
Powers and limitations of SNAPP
SNAPP recovers
Topology
Coalescent times
Population sizes per branch
in order of decreasing accuracy
Not enough lineages ⇒ pop size samples from prior (happens oftennear root)
θ and coalescent time estimates are often more accurate at thebottom than the top of tree(higher uncertainty when going back in time)
Species tree inference Canberra, 2018 25 / 30
Detecting Anomalies
Consider two species
What happens when they are grouped as one species? Pop sizeincreases
Where to look for cryptic species? Large pop size branches
Where to look for mislabeled lineages? Large pop size branches
Species tree inference Canberra, 2018 26 / 30
Bayesian model selection: marginal likelihood
Posterior:
p(θ|D,M) =
prior︷ ︸︸ ︷p(θ|M)
likelihood︷ ︸︸ ︷p(D|M, θ)
P(D|M)︸ ︷︷ ︸marginal likelihood
Marginal likelihood:
p(D|M) =
∫θ∈Θ
p(θ|M)p(D|M, θ)dθ
integrate/marginalise out θBayes factor:
p(D|M1)
p(D|M2)
Species tree inference Canberra, 2018 27 / 30
Species delimitation – BFD*
Variant of Bayesian model selection
Suppose there are two species assignments A and B
Estimate log marginal likelihood (MLA) using stepping stone analysisof taxon sets A
Estimate log marginal likelihood (MLB) of taxon sets B
The difference ∆ = 2(MLA −MLB) is the (log) Bayes factor.
∆ < 0 support for B0 < ∆ < 2 barely worth mentioning2 < ∆ < 6 substantial support for A6 < ∆ < 10 strong support for A
10 < ∆ decisive
(Leache et al 2014 SysBio)
Species tree inference Canberra, 2018 28 / 30
Stepping stone/path sampling analysis
Requires model-selection package in BEASTSetting up and running an analysis:
Start BEAST AppStore , select PathSampler
or hack XML (see wiki or tutorial)
How many steps?
start with small number of steps, say 8
increase nr of steps till ML estimate does not decrease any more
How long
no hard and fast rule
ESS > 200 for each step is over kill
multiple (say 4) runs giving same estimate
Species tree inference Canberra, 2018 29 / 30
Future developments
Snapper – SNAPP but faster
Nested sampling for species delimitation
• 2 PhD candidates and Post-doc position available (seehttp://beast2.org) to work on multi species coalescent methods
Species tree inference Canberra, 2018 30 / 30