Phylogenetics What is a tree & how many are there? Principles of phylogenetic receconstruction....
-
Upload
lionel-davidson -
Category
Documents
-
view
215 -
download
2
Transcript of Phylogenetics What is a tree & how many are there? Principles of phylogenetic receconstruction....
PhylogeneticsWhat is a tree & how many are there?
Principles of phylogenetic receconstruction.
Special Issues
Rooting a tree
The Molecular Clock
Almost Clocks.
Trees – graphical & biological.A graph is a set vertices (nodes) {v1,..,vk} and a set of edges {e1=(vi1,vj1),..,en=(vin,vjn)}. Edges can be directed, then (vi,vj) is viewed as different (opposite direction) from (vj,vi) - or undirected.
Nodes can be labelled or unlabelled. In phylogenies the leaves are labelled and the rest unlabelled.
The degree of a node is the number of edges it is a part of. A leaf has degree 1.
A graph is connected, if any two nodes has a path connecting them.
A tree is a connected graph without any cycles, i.e. only one path between any two nodes.
v1v2
v4
v3
(v1v2)
(v2, v4)
or (v4, v2)
Trees & phylogenies.A tree with k nodes has k-1 edges. (easy to show by induction).
A root is a special node with degree 2 that is interpreted as the point furthes back in time. The leaves are interpreted as being contemporary.
A root introduces a time direction in a tree.
A rooted tree is said to be bifurcating, if all non-leafs/roots has degree 3, corresponding to 1 ancestor and 2 children. For unrooted tree it is said to have valency 3.
Edges can be labelled with a positive real number interpreted as time duration or amount or evolution.
If the length of the path from the root to any leaf is the same, it obeys a molecular clock.
Tree Topology: Discrete structure – phylogeny without branch lengths.
Leaf
Root
Internal Node
Leaf
Internal Node
Enumerating Trees: Unrooted & valency 3
2
1
3
11
24
23
31 2
3 4
4
1 2
3 4
1 2
3 4
1 2
3 4
1 2
3 4
1 2
3 4
5
5 5
5
5
(2 j 3)j3
n 1
(2n 5)!
(n 2)!2n 2
4 5 6 7 8 9 10 15 20
3 15 105 945 10345 1.4 105 2.0 106 7.9 1012 2.2 1020
Recursion: Tn= (2n-5) Tn-1 Initialisation: T1= T2= T3=1
Local operations on trees.
Nearest Neighbor Interchange:
Subtree cut and regrafting – (subtree root kept)
Subtree cut and regrafting – (subtree root possibly new)
A C
DB
AC
DB
Central Principles of Phylogeny Reconstruction
Parsimony
Distance
Likelihood
TTCAGT
TCCAGT
GCCAAT
GCCAAT
s2
s1
s4
s3
s2
s1
s4
s3
s2
s1
s4
s3
0
1
12
0 Total Weight: 4
1
1 2
3 2 10.4
0.6
0.3
0.71.5
L=3.1*10-7
Parameter estimates
Distance Concepts on Trees I
A: Metric, d( , ) : i: d(a,b)=0 <=> a=b ii: d(a,b)=d(b,a) iii: d(a,b) <= d(a,c) + d(c,b)
a
c
b
Tree Metric: (distance function originates from tree)
d(x,y) + d(z,w) = d(x,z) + d(y,w) > d(x,w) + d(y,z), where z,y,z,w is a permutation of a,b,c,d.
(> implies that no branch has length 0)
Distance Concepts on Trees II
s2
s1
s4
s3
Reconstruction Principle: d(s1,i) = (d(s1,s2) + d(s1,s3) - d(s2,s3))/2
s3
s2s1
i
Ultra Metric (distance function originates from tree)
d(x,y) = d(x,z) > d(x,y), where z,y,z is a permutation of a,b,c.(> implies that no branch has length 0)
Distance Concepts on Trees III
i
s1 s3s2
Reconstruction Principle: d(s1,i) = d(s1,s2)/2
Unweighted Pair-Group method with Arithmetic MeanInput: Matrix with pariwise distances between sequences, D:
1: Find smallest distance, di,j
2: i,j are now siblings with a distance, di,j/2, to their MRCA (i,j).
3: A new distancematrix of dimension (n-1)*(n-1) where i and j have been substituted by (i,j). All distances to (i,j) are dk,(i,j) = (dk,i + dj,k)/2.
4: This is done n-1 times and the tree has been reconstructed.
Output: An ultrametric.
Comment: i. If UPGMA is given an ultrametric, it will reconstruct the same ultrametric.
UPGMASokal and Michener, 1958
Assignment to internal nodes: The simple way.
C
A
C CA
CT G
???
?
?
?
What is the cheapest assignment of nucleotides to internal nodes, given some (symmetric) distance function d(N1,N2)??
If there are k leaves, there are k-2 internal nodes and 4k-2 possible assignments of nucleotides. For k=22, this is more than 1012.
Cost of a history - minimizing over internal statesA C G T
A C G T A C G T
d(C,G) +wC(left subtree)
subtree)} (),({min
subtree)} (),({min
)(
rightwNGd
leftwNGd
subtreew
NsNucleotideN
NsNucleotideN
G
Cost of a history – leaves (initialisation).A C G T
G A
Empty
Cost 0
Empty
Cost 0
Initialisation: leaves
Cost(N)= 0 if
N is at leaf,
otherwise infinity
Fitch-Hartigan-Sankoff Algorithm
(A,C,G,T) (9,7,7,7) Costs: Transition 2, / \ Transversion 5. / \ / \ (A, C, G, T) \ (10,2,10,2) \ / \ \ / \ \ / \ \ / \ \ / \ \ (A,C,G,T) (A,C,G,T) (A,C,G,T) * 0 * * * * * 0 * * 0 *
The cost of cheapest tree hanging from this node given there is a “C” at this node
A C
TG
5S RNA Alignment & PhylogenyHein, 1990
10 tatt-ctggtgtcccaggcgtagaggaaccacaccgatccatctcgaacttggtggtgaaactctgccgcggt--aaccaatact-cg-gg-gggggccct-gcggaaaaatagctcgatgccagga--ta17 t--t-ctggtgtcccaggcgtagaggaaccacaccaatccatcccgaacttggtggtgaaactctgctgcggt--ga-cgatact-tg-gg-gggagcccg-atggaaaaatagctcgatgccagga--t- 9 t--t-ctggtgtctcaggcgtggaggaaccacaccaatccatcccgaacttggtggtgaaactctattgcggt--ga-cgatactgta-gg-ggaagcccg-atggaaaaatagctcgacgccagga--t-14 t----ctggtggccatggcgtagaggaaacaccccatcccataccgaactcggcagttaagctctgctgcgcc--ga-tggtact-tg-gg-gggagcccg-ctgggaaaataggacgctgccag-a--t- 3 t----ctggtgatgatggcggaggggacacacccgttcccataccgaacacggccgttaagccctccagcgcc--aa-tggtact-tgctc-cgcagggag-ccgggagagtaggacgtcgccag-g--c-11 t----ctggtggcgatggcgaagaggacacacccgttcccataccgaacacggcagttaagctctccagcgcc--ga-tggtact-tg-gg-ggcagtccg-ctgggagagtaggacgctgccag-g--c- 4 t----ctggtggcgatagcgagaaggtcacacccgttcccataccgaacacggaagttaagcttctcagcgcc--ga-tggtagt-ta-gg-ggctgtccc-ctgtgagagtaggacgctgccag-g--c-15 g----cctgcggccatagcaccgtgaaagcaccccatcccat-ccgaactcggcagttaagcacggttgcgcccaga-tagtact-tg-ggtgggagaccgcctgggaaacctggatgctgcaag-c--t- 8 g----cctacggccatcccaccctggtaacgcccgatctcgt-ctgatctcggaagctaagcagggtcgggcctggt-tagtact-tg-gatgggagacctcctgggaataccgggtgctgtagg-ct-t-12 g----cctacggccataccaccctgaaagcaccccatcccgt-ccgatctgggaagttaagcagggttgagcccagt-tagtact-tg-gatgggagaccgcctgggaatcctgggtgctgtagg-c--t- 7 g----cttacgaccatatcacgttgaatgcacgccatcccgt-ccgatctggcaagttaagcaacgttgagtccagt-tagtact-tg-gatcggagacggcctgggaatcctggatgttgtaag-c--t-16 g----cctacggccatagcaccctgaaagcaccccatcccgt-ccgatctgggaagttaagcagggttgcgcccagt-tagtact-tg-ggtgggagaccgcctgggaatcctgggtgctgtagg-c--t- 1 a----tccacggccataggactctgaaagcactgcatcccgt-ccgatctgcaaagttaaccagagtaccgcccagt-tagtacc-ac-ggtgggggaccacgcgggaatcctgggtgctgt-gg-t--t-18 a----tccacggccataggactctgaaagcaccgcatcccgt-ccgatctgcgaagttaaacagagtaccgcccagt-tagtacc-ac-ggtgggggaccacatgggaatcctgggtgctgt-gg-t--t- 2 a----tccacggccataggactgtgaaagcaccgcatcccgt-ctgatctgcgcagttaaacacagtgccgcctagt-tagtacc-at-ggtgggggaccacatgggaatcctgggtgctgt-gg-t--t- 5 g---tggtgcggtcataccagcgctaatgcaccggatcccat-cagaactccgcagttaagcgcgcttgggccagaa-cagtact-gg-gatgggtgacctcccgggaagtcctggtgccgcacc-c--c-13 g----ggtgcggtcataccagcgttaatgcaccggatcccat-cagaactccgcagttaagcgcgcttgggccagcc-tagtact-ag-gatgggtgacctcctgggaagtcctgatgctgcacc-c--t- 6 g----ggtgcgatcataccagcgttaatgcaccggatcccat-cagaactccgcagttaagcgcgcttgggttggag-tagtact-ag-gatgggtgacctcctgggaagtcctaatattgcacc-c-tt-
9
11
10
6
8
7
543
12
17
16
1514
13
12
Transitions 2, transversions 5
Total weight 843.
Fungi
Animals
Mitochondria Plants
Prokaryotes
The Felsenstein ZoneFelsenstein-Cavendar (1979)
s4
s3s2
s1
Patterns:(16 only 8 shown)
0 1 0 0 0 0 0 0
0 0 1 0 0 1 0 1
0 0 0 1 0 1 1 0
0 0 0 0 1 0 1 1
True Tree Reconstructed Tree
s3
s1
s2
s4
BootstrappingFelsenstein (1985)
ATCTGTAGTCT
ATCTGTAGTCT
ATCTGTAGTCT
ATCTGTAGTCT
10230101201
ATCTGTAGTCT
ATCTGTAGTCT
ATCTGTAGTCT
ATCTGTAGTCT
1
23
4
Probability of leaf observations - summing over internal states
A C G T
A C G T A C G T
subtree)} ()({
subtree)} ()({
)(
rightPNGP
leftPNGP
subtreeP
NsNucleotideN
NsNucleotideN
G
P(CG) *PC(left subtree)
GleafG leafP
tionInitialisa
,)(
With Clock: Without Clock: s5 s4 23 5.2 \ / /\ 40.9 20.4 / \ \ / / \ ! / \ 1.6 5.6 23 sd4.6 124.4 / \ s1---6-------22---------------11---3 /\ \ ! ! 44.9 /\ \ /\ 7 3.4 4 sd.1.4 / \ \ / \ ! s1 s2 s3 s4 s5 s2
Likelihood: 7.9*10-14 = 0.31.1,0.18.1 6.2*10-12 = 0.34.1 0.16.1
ln(7.9*10-14) –ln(6.2*10-12) is 2 – distributed with (n-2) degrees of freedom.
Output from Likelihood Method
First noted by Zuckerkandl & Pauling (1964) as an empirical fact.
How can one detect it?
Known Ancestor Time Unknown AncestorTime
/\ a at time T. / \ / \ ? \ / \ /\ \ / \ / \ \ / \ / \ \s1 s2 s1 s2 s3
The Molecular Clock
3 billion years ago: no reliable clock no outgroupGiven 2 set of homologous proteins, i.e. MDH & LDH can the archea, prokaria and eukaria be rooted? LDH MDH A A \ \ \ \ --------E --------E / / / / P P LDH MDH / \ / \ / \ /\ /\ / \ / \ / /\ / /\ P A E P A E
Rooting the 3 kingdoms
Purpose 1) To give time direction in the phylogeny & most ancient point2) To be able to define concepts such a monophyletic group.
Metoder:1) Outgrup: Enhance data set with sequence from a species definitely distant to all of them. It will be be joined at the root of the original data set.
2) Midpoint: Find midpoint of longest path in tree.
3) Assume Molecular Clock.
Rootings
(Illustration of Langley-Fitch) s1 /\ \ / \ clock: l1 \ / \ ----*--- s3 /\ \ {l1 = l2 < l3} l2 / l3 / \ \ / / \ \ s2 s1 s2 s3Given root: (2k-3)-(k-1) = (k-2) degrees of freedoms lost in imposing a clock.Assumptions1. Ancestral Sequences are observable.2. The number of events on branch is Poisson distributed with a mean proportional to the branch length. The same proportionality constant for all branches.3. The observed differences between sequences at two neighboring nodes is the actual number of events. s1' s1 \ \ \ l1 \ c*l1 \ ------- s3 ------------ s3' l2 / l3 c*l2 / c*l3 / / s2 / s2' sequences 1 sequences 2 k sequences s species : s(2k-3)s s(k-1) (2k-3)+s s+(k-1)
The generation/year-time clock
I Smoothing a non-clock tree onto a clock tree (Sanderson).
II Rate of Evolution of the rate of Evolution (Thorne et al.).The rate of evolution can change at each bifurcation.
III Relaxed Molecular Clock (Huelsenbeck et al.). At random points in time, the rate changes by multiplying with random variable (gamma distributed)
Almost Clocks (MJ Sanderson (1997) “A Nonparametric Approach to Estimating Divergence Times in the Absence of Rate Constancy” Mol.Biol.Evol.14.12.1218-31) , J.L.Thorne et al. (1998): “Estimating the Rate of Evolution of the Rate of Evolution.” Mol.Biol.Evol. 15(12).1647-57, JP Huelsenbeck et al. (2000) “A compound Poisson Process for Relaxing the Molecular Clock” Genetics 154.1879-92. )
Non-contemporaneous leaves.(A.Rambaut (2000): Estimating the rate of molecular evolution: incorporating non-contemporaneous sequences into maximum likelihood phylogenies. Bioinformatics 16.4.395-399)
In presence of recombination and Gene Conversion, the relationship among sequence might not be describable by a phylogeny!!
Recombination and the Molecular Clock I
Common Practice: I Finding “the phylogeny” anyway.II testing for the molecular clock.
What is the consequences of this practice?I Simulate data with model including recombination.II Reconstruct phylogeny.III Test for Clock.
Recombination and the Molecular Clock IISchierup & Hein (2000): Recombination and the Molecular Clock. Mol.Biol.Evol.17.10.1578-79 + Schierup & Hein (2000): Consequences of Recombination on Traditional Phylogenetic Analysis. Genetics 156.879-91.
History of Phylogenetic Methods
1958 Sokal and Michener publishes UGPMA method for making distrance trees with a clock.
1964 Parsimony principle defined, but not advocated by Edwards and Cavalli-Sforza.
1962-65 Zuckerkandl and Pauling introduces the notion of a Molecular Clock.
1967 First large molecular phylogenies by Fitch and Margoliash.
1969 Heuristic method used by Dayhoff to make trees and reconstruct ancetral sequences.
1970: Neyman analyzes three sequence stochastic model with Jukes-Cantor substitution.
1971-73 Fitch, Hartigan & Sankoff independently comes up with same algorithm reconstructing parsimony ancetral sequences.
1973 Sankoff treats alignment and phylogenies as on general problem – phylogenetic alignment.
1979 Cavender and Felsenstein independently comes up with same evolutionary model where parsimony is inconsistent. Later called the “Felsenstein Zone”.
1981: Felsenstein Maximum Likelihood Model & Program DNAML (i programpakken PHYLIP).
1981 Parsimony tree problem is shown to be NP-Complete.
1985: Felsenstein introduces bootstrapping as confidence interval on phylogenies.
1986 Bandelt and Dress introduces split decompostion as a generalization of trees.
1985-: Many authors (Sawyer, Hein, Stephens, M.Smith) tries to address the problem of recombinations in phylogenies.
1997-9 Thorne et al., Sanderson & Huelsenbeck introduces the Almost Clock.
2000 Rambaut (and others) makes methods that can find trees with non-contemporaneous leaves.
2001- Major rise in the interest in phylogenetic statistical alignment
Books:Molecular Systematics (1996) (eds. Hillis and Craig)New Uses for Phylogenies (1996) (eds. P.Harvey)W.Maddison and D.Maddison : MacCladeSemple & Steel (2003): Phylogenetics OUP
Journals:Molecular Biology and EvoltionJ. Molecular EvolutionMolecular PhylogeneticsSystematic Biology.J. of Classification
www-pages:PAUP – probably the best package for phylogenetic analysis available. David Swoffordhttp://www.lms.si.edu/PAUP/about.html
MacClade – W. & D. Maddison http://phylogeny.arizona.edu/macclade/macclade.html
PHYLIP – J. Felsenstein. http://depts.washington.edu/genetics/faculty/felsenstein.html
PAML – Z. Yang http://abacus.gene.ucl.ac.uk/
Phylogeny: literature, www and packages.
1: Error function: wi,j * (di,j - pi,j)a
2: Minimisation has two parts topology & branchlengths. Try all topologies and solv branch problem for each.
3: A(i,j),k is (n*(n-1)/2)*(2n-3) matrix with 1 if k is an edge on the path from i to j, 0 ellers.
4: The path length i & j, pi,j, In the given topology is given by: pi,j = A(i,j),k*sk.
5: If wi,j =1 og a=2 this can be solved by linear algebra (di,j - A(i,j),k*sk)2
Global Fit Metods
Input: Distancematrix D.
1: For each leaf the average distance to the others is calculated ri=(di,1 + di,2 + + dn,i)/(n-1).
2: Rate corrected distance matrix, M, is constructedmi,j = di,j - (ri + rj)/(n-2). Only minimal mi,j is necessary.
3: Make ancestral node, u, to i & j giving minimal mi,j. New branch lengths are defined by si,u = di,j/2 + (ri - rj)/[2*(N-2)] sj,u = di,j - si,u
4: The distance from u to the others are set to dk,u = (di,k + dj,k -di,j)/2
Do this n-2 times
Alternativ karakterisation af metoden: Start med bedste kvadratiske fit af et træ med en k indre (k<n) indre knuder, tilføj den indre gren, som giver den største forbedring i det kvadratiske fit (nu k+1 knuder). Dette fortsættes indtil hel træet er bygget (k-1 indre knuder er tilføjet.
Nearest Neighbor JoiningSaitou and Nei, 1987
Ø = Lavt overslag på vægten af træ - eventuelt vægten på godt gættet træ.
W(n) = vægten for træet i knude n.R(n) = højt underslag for vægttilvæksten ved at tilføje resten af sekvenserne.Betingelse for bounding:W(n) + R(n) >= Ø97 7 102Hvordan regnes R(n) ud? A T C G A C G G T C G G *
Branch and Bound Algorithm
I. Bootstrapping columns in the alignment.Example: Human, Chimp, Gorilla & Orangutan with root.position 1 2 3 4 5 6 7 8 9 12.586H T C T G A C G T T T G A ... CC T C T G A C G G T T G A ... CG T C T G A C G G T T G A ... CO T C A G A C G G T C G A ... Croot T C A G A C G T A A G A ... C15 possible trees, only 3 of relevance: /\ /\ /\ / \ / \ / \ /\ \ /\ \ /\ \ / \ \ / \ \ / \ \ /\ \ \ /\ \ \ /\ \ \ / \ \ \ / \ \ \ / \ \ \ H C G O H G C O C G H OI. Bootstrap probabilities: 0.80 0.09 0.11II. Differences in likelihood: 0.0 -16.63 s.d=14.22 -15.12 sd=13.95
Tree topology comparison.