Species and Gene Trees: History, Inference, and Visualization - Joseph Heled
-
Upload
australian-bioinformatics-network -
Category
Health & Medicine
-
view
592 -
download
1
description
Transcript of Species and Gene Trees: History, Inference, and Visualization - Joseph Heled
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Pre Darwin phylogenetic trees
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
The Origin sole figure
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
The Cytochrome C Gene Tree (Fitch, 1967)
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
• Processes of speciation
• Evolution of traits
• Biogeography
• Epidemiology
• Co-Evolution (host/parasite)
• Domestication
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Selecting a “Duck”
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
The Molecular Clock (early ’60s)
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Models of Sequence Evolution
JC69 model (Jukes and Cantor, 1969)
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
The Kingman Coalescent (1982)
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Wright-Fisher Population (1931)
• The individuals were randomly sampled
from a population of size N.
• The parent of any individual is chosen
uniformly at random from all potential
parents
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
The Coalescent
The larger the population, the longer (on average)you have to travel back in time for the commonancestor.
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
The Coalescent for multiple individuals
The waiting time for the first common ancestor oftwo individuals out of m (going backwards in time)
is exponential with a rate of (m2)/Ne
.Ne is the Wright-Fisher effective population size.
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
From Models to Inference
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Bayes’ Theorem (a Reminder)
P(A ∧ B) = P(A)P(B |A)
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Bayes’ Theorem (2)
P(B)P(A|B) = P(A ∧ B) = P(A)P(B |A)
P(A|B) = P(B |A)P(A)P(B)
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Bayesian Inference
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Models (so far)
Substitution model: A stochastic process for the evolution(change) of genetic data (sequences) overtime.
Clock model: How substitution rates change over time.
Coalescent model: A stochastic process for the ancestralrelationship between a group ofhomologous sequences from severalindividuals.
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Models (Math Notation)
Coalescent model: f (T |Ne)
Substitution model: f (G |T )
Where G is the gene (sequence data) and T is theancestral relationships (tree).
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
The Biological Species Concept
The conventional definition of “a species” amongstevolutionary biologists is “a group of organisms whosemembers interbreed among themselves, but are separatedfrom other groups by genetically-based barriers to geneflow.”
Jerry Coyne “Why Evolution is True” blog.
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
The Species “tree”
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
The Gene(s) tree
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Species Tree Ancestral Reconstruction
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Multiple Individuals from each Species
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Multispecies coalescent – Kingman Coalescent per SpeciesTree Branch
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Multiple Independent Loci
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
The Multispecies Posterior
P(S |D) =
∫g
P(S , g |D)
∝∫
g
P(D|S , g)P(S , g)
=
∫g
P(D|g)P(S , g)
=
∫g
P(D|g)P(g |S)P(S)
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
A Complex Posterior
P(S |D) =
∫g
f (S , g |D)
=
∫g
f (D|S , g)f (S , g)f (D)
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Problem 1: f(D)
The prior probability of obtaining data D.
P(S |D) =
∫g
f (D|S , g)f (S , g)
f(D)
We don’t know the value of f (D).
f (D) =
∫g ,S
f (D|S , g)f (S , g)
However, it is a constant.
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Problem 2: The Whole Damn Thing
The posterior is a distribution defined by a complexmultidimensional integral.
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Enter MCMC
Markov Chain Monte Carlo (MCMC) is a class ofmethods for stochasticly sampling from probabilitydistributions based on constructing a Markov chain.
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Very short history of MCMC
1953: Metropolis algorithm published in Journal ofChemical Physics (Metropolis et al.)
1970: Hastings algorithms in Biometrika (Hastrings)
1974: Gibbs sampler and Hammersley-Clifford theorempaper by Besag
1980s: Image analysis and spatial statistics enjoyed MCMCalgorithms, not popular with others due to the lackof computing power
1995: Reversible jump algorithm in Biometrika (Green)
groundtruth.info/AstroStat/slog/2008/mcmc-historyo
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
MCMC in a nutshell
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
MCMC in a nutshell (2)
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
MCMC in a nutshell (3)
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
MCMC in a nutshell (4)
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
MCMC in a nutshell (5)
If we propose to go from B or A to either A or B with equalprobability, then
2
1
A B
Flow from A to B is 2/3 · 1/4 = 1/6, and from B to B is1/3 · 1/2 = 1/6, 1/3 in total.
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
MCMC in a nutshell (6)
Hastings Ratio (x to y) =p(y → x)f (y)
p(x → y)f (x)
So far we had p(y → x) = p(x → y), that is theprobability going from x to y was equal to goingback.
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
MCMC in a nutshell (7)
If at A we always propose to go to B but from B we go to A or Bwith equal propability, that is,
Table: p(x → y)
A B
A 0 1B 1/2
1/2
Then
HR(A→ B) =1/2 · 11 · 2 = 1/4,
AndHR(B → A) = 4.
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
MCMC in a nutshell (8)
2
1
A B
Flow from B to A is 1/3 · 1/2 and from A to A is 2/3 · 3/4 = 1/2,2/3 in total.
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Tree(s) Visualisation
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Traditional Tree Visualization
0.0090
sum
oax
Cyan
FLSJ
pot
arib
aria
ult
couc
woll
coast
con
int
ins
0.47
0.62
0.45
1
0.38
0.99
0.96 0.88
0.431
1
0.91
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Traditional Tree Visualization (2)
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Species Tree with Population Sizes
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Species Tree with Gene Trees
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Species Tree (Densitree)
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
The “Star Tree”
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Species Tree (Densitree)
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Taxa Order Matters (When Drawing Multiple Trees)
0 1 2 3 4 5 6
73%
0 1 5 6 2 3 4
17%
2 3 4 5 6 0 1
10%
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Some Orders are Better than Others
0 1 2 3 4 5 6
73%
0 1 5 6 2 3 4
17%
2 3 4 5 6 0 1
10%
2 3 4 0 1 5 6
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Disadvantages:
• Population size changes inone branch have a visualeffect on other branches.
• Fails when trying to showthe whole posterior a laDensiTree.
• No obvious way to extendfor trees with constantpopulation size per branch.
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
The Imperial AT-AT Tree
Provide some space between branches.
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
A double Act
Target species tree (blue) and ?BEAST posterior summary(orange).
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Species Tree with Constant Population Sizes
To extend to constant branches, we need a rule to place thebottom of the branch on top of the descendant branches. We usethe proportion rule.
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Position,Position,Position
A species tree specifies heights and widths. The challenge is topick good X-axis positions.The star tree builds the tree from root towards the tips. Buildingfrom the tips towards the root is simpler when drawing speciestrees. When building from the tips the descendants X-positionsdetermine the parent position.
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Position,Position,Position
However, there are many ways to place nodes. Here are four ofthem:
Descendants Mean Halfway between direct descendants.
Tips Mean Average of all tips in the sub-tree.
Middle Halfway between rightmost tip of left sub-treeand leftmost tip of right sub-tree.
Balanced by Population At point minimizing the difference betweenbranch bottom and top centers.
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Node Positioning
The methods are similar for balanced trees. The difference is in thehandling of unbalanced trees.
D-MeanT-MeanMiddleBalanced
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Node Positioning
D-MeanBalanced
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Species Tress Posterior
0 20 40 60 80
0
5
10
15ar
ia
arib
wol
l
coucpo
t
ult
coas
t
sum ins
int
FLSJ
con
oax
Cyan
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Gene Trees Within Species Trees: Preliminary
Next we would like to draw the gene tree within the species tree.
Hurdle 1: Obtain a suitable gene tree.
The gene tree has to be compatible with the species tree. This isnot a problem when drawing a specific MCMC state, but is aproblem when using summary trees.
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Tips Positioning
Hurdle 2: Branches with non-constant width complicatespositioning of tips.
−1 0 1 2 3 4 5 6
0.0
0.5
1.0
1.5
2.0
2.5
3.0
−2 0 2 4 6 8 10
0.0
0.5
1.0
1.5
2.0
2.5
3.0
−2 0 2 4 6 8 10
0.0
0.5
1.0
1.5
2.0
2.5
3.0
−5 0 5 10 15 20
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Tips Positioning (automatic)
The placing insures that extrema points are at least ε apart(horizontally).
−2 0 2 4 6 8 100.0
0.5
1.0
1.5
2.0
2.5
3.0
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Tips Positioning (automatic)
Even with Python, it is far from trivial to implement. Remember,we want a tight fit, and placing should work for all modes ofinternal positioning.Basically, we build the tree from bottom to top up by joiningclades. the X position of extrema points for the clade is a linearfunction of the spacing between the sub-trees (where the spacinginside the two sub trees are fixed). So each extrema points sets alower limit on the spacing, and the largest is taken as the finalseparation.The best way to pick ε for a tree still needs to be worked out.
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Drawing the Gene Tree
Hurdle 3: A suitable policy for drawing the gene tree.
We reuse the ideas of the Star Tree. Given the position of aninternal node, the left/right branches is drawn as a straight linestowards the “middle” of the left/right sub-trees. But we still needto handle the species transitions.
From the bottom up, we(linearly) map the lineagesleaving the branch to the topof the branch. The top ofthe clade is then put in themiddle of the mapped taxa.
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Visual Clutter
0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
800.0
900.0
g2
g1
g0
g3
g6
g5
g4
g7
g9
g8
g1
2g
10
g1
1
g1
3g
15
g1
4branches 300 generations
spe
cie
sA
spe
cie
sB
spe
cie
sC
spe
cie
sD
spe
cie
sE
Contained Tree
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Gene Tree Inside Species Tree: As-Is
0 2 4 6 8 10
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
ID2 ID3 ID4 ID0 ID1 ID5 ID6
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Reducing Visual Clutter (1)
a b x ab x
The size (number of tips) of the sub-tree (b,x) is 2, but the span(number of tips between leftmost and rightmost) is 3.
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Reducing Visual Clutter (2)
For every sequential arrangement of the gene tree taxa we can geta rough measure of the amount of crossings,∑
n∈Internal Nodesspan(n)− size(n)
size(n) is the number of taxa in the sub tree. span(n) is thenumber of taxa in the group bounded by the leftmost andrightmost tips of the sub tree. The difference is the excess taxa,the number of potential lineages that may need to cross out of theclade.Note that valid arrangements depend on the orientation of thespecies tree, so optimization should be over both species orderingand gene tips arrangements compatible with that order. Since thenumber is typically large, we resort to multiple tries of hillclimbing. Number of tries might be fixed or bounded by time.
Introduction The Coalescent Bayesian Inference MCMC Visualisation Tree Shape Taxa Position
Unresolved Conflicts
Optimized tree on the right.