Cao et al. (2000) Gene Phylogeny of Mammals a good example where molecular sequences have led to a...
-
Upload
katelyn-mowles -
Category
Documents
-
view
219 -
download
3
Transcript of Cao et al. (2000) Gene Phylogeny of Mammals a good example where molecular sequences have led to a...
Cao et al. (2000) Gene
Phylogeny of Mammals
a good example where molecular sequences have led to a big improvement of our understanding of evolution.
rRNA structure is conserved in evolution.
The sequence changes slowly. Therefore it can be used to tell us about the earliest branches in the tree of life.
69 Mammals with complete motochondrial genomes.
Used two models simulatneously
Total of 3571 sites
= 1637 single sites
+ 967 pairs
Hudelot et al. 2003
Afrotheria / Laurasiatheria
Phylogenetic Methods
• Distances
• Clustering Methods
• Likelihood Methods
• Parsimony
Recommended books:
P. G. Higgs and T. K. Attwood – Bioinformatics and Molecular Evolution
R. Page and A. Holmes – Molecular evolution: a phylogenetic approach
W. H. Li – Molecular evolution
Part of sequence alignment of Mitochondrial Small Sub-Unit rRNA
Full gene is length ~950
11 Primate species with mouse as outgroup
Mouse : Lemur : Tarsier : SakiMonkey : Marmoset : Baboon : Gibbon : Orangutan : Gorilla : PygmyChimp : Chimp : Human :
* 20 * 40 * 60 * CUCACCAUCUCUUGCUAAUUCAGCCUAUAUACCGCCAUCUUCAGCAAACCCUAAAAAGG-UAUUAAAGUAAGCAAAAGACUCACCACUUCUUGCUAAUUCAACUUAUAUACCGCCAUCCCCAGCAAACCCUAUUAAGGCCC-CAAAGUAAGCAAAAACCUUACCACCUCUUGCUAAUUCAGUCUAUAUACCGCCAUCUUCAGCAAACCCUAAUAAAGGUUUUAAAGUAAGCACAAGUCUUACCACCUCUUGCC-AU-CAGCCUGUAUACCGCCAUCUUCAGCAAACUCUA-UAAUGACAGUAAAGUAAGCACAAGUCUCACCACGUCUAGCC-AU-CAGCCUGUAUACCGCCAUCUUCAGCAAACUCCU-UAAUGAUUGUAAAGUAAGCAGAAGUCCCACCCUCUCUUGCU----UAGUCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGCUACGAAGUGAGCGCAAAUCUCACCAUCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGACAAAGGCUAUAAAGUAAGCACAAACCUCACCACCCCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGCCACGAAGUAAGCGCAAACCUCACCACCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGACGAAGGCCACAAAGUAAGCACAAGUCUCACCGCCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGUUACAAAGUAAGCGCAAGUCUCACCGCCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGUUACAAAGUAAGCGCAAGUCUCACCACCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGCUACAAAGUAAGCGCAAGUCucACC cuCUuGCu cAgccUaUAUACCGCCAUCuuCAGCAAACcCu A G aAAGUaAGC AA
: 78 : 78 : 79 : 76 : 76 : 75 : 75 : 75 : 75 : 75 : 75 : 75
From alignment construct pairwise distances.
Species 1: AAGTCTTAGCGCGATSpecies 2: ACGTCGTATCGCGAT
* * * D = 3/15 = 0.2
D = fraction of differences between sequences
BUT - D is not an additive distance,
D does not increase linearly with time.
G
CA
2 substitutions happened - only 1 is visible
G
AA
2 substitutions happened - nothing visible
Models of Sequence Evolution
Pij(t) = probability of being in state j at time t
given that ancestor was in state i at time 0.
States label bases A,C,G & Ti
t
jkjk
ikij rP
dt
dP
rij is the rate of substitution from state i to state j
Jukes - Cantor Model
All substitution rates =
All base frequencies are 1/4
)4exp(4
3
4
1)( ttPii )4exp(
4
1
4
1)( ttPij
))8exp(1(4
3)2(1 ttPD ii
t = 2t
td 6Mean number of substitutions per site:
)3
41ln(
4
3Dd
d increases linearly with time
d = D
D = 3/4
D
d
The HKY model has a more general substitution rate matrix
*
*
*
*
CGA
TGA
TCA
TCG
T
C
G
A
TCGA
to
from
The frequencies of the four bases are
is the transition-transversion rate parameter
* means minus the sum of elements on the row
.,,, TCGA
Baboon Gibbon Orang Gorilla PygmyCh. Chimp Human
Baboon 0.00000 0.18463 0.19997 0.18485 0.17872 0.18213 0.17651
Gibbon 0.18463 0.00000 0.13232 0.11614 0.11901 0.11368 0.11478
Orang 0.19997 0.13232 0.00000 0.09647 0.09767 0.09974 0.09615
Gorilla 0.18485 0.11614 0.09647 0.00000 0.04124 0.04669 0.04111
PygmyChimp 0.17872 0.11901 0.09767 0.04124 0.00000 0.01703 0.03226
Chimp 0.18213 0.11368 0.09974 0.04669 0.01703 0.00000 0.03545
Human 0.17651 0.11478 0.09615 0.04111 0.03226 0.03545 0.00000
Part of the Jukes-Cantor Distance Matrix for the Primates example
Use as input to clustering methods
Mouse-Primates ~ 0.3
Distance Matrix Methods
Follow a clustering procedure on the distance matrix:
1. Join closest 2 clusters
2. Recalculate distances between all the clusters
3. Repeat 1 and 2 until all species are connected in a single cluster.
Initially each species is a cluster on its own. The clusters get bigger during the process, until they are all connected to a single tree.
Neighbour Joining method is commonly used clustering method
Neighbour-Joining method
Take two neighbouring nodes i and j and replace by a single new node n.
din + dnk = dik ; djn + dnk = djk ; din + djn = dij ;
therefore dnk = (dik + djk - dij)/2 ; applies for every k
define
k
iki dN
r2
1
kjkj d
Nr
2
1
k
i
j
n n
Let din = (dij + ri - rj)/2 ; djn = (dij + rj - ri)/2 .
Rule: choose i and j for which Dij is smallest, where Dij = dij - ri - rj .
but ...
NJ method produces an Unrooted, Additive Tree
Additive means distance between species = distance summed along internal branches
0.1
Baboon
Mouse
LemurTarsier
SakiMonkey
Marmoset
Gibbon
OrangutanGorilla PygmyChimp
Chimp
Human
0.1
Mouse
Lemur
Tarsier47
SakiMonkey
Marmoset100
Baboon
Gibbon
Orang
Gorilla
Human
PygmyChimp
Chimp100
84
100
99
100
100
100
The tree has been rooted using the Mouse as outgroup
The Maximum Likelihood Criterion
Calculate the likelihood of observing the data on a given the tree.
Choose tree for which the likelihood is the highest.
x
A G
t2t1
)()()( 21 tPtPxL xGxA
x
y z
t2t1
y z
xzxy tPzLtPyLxL )()()()()( 21
Can calculate total likelihood for the site recursively.
Likelihood is a function of tree topology, branch lengths, and parameters in the substitution rate matrix. All of these can be optimized.
Tree log L difference S.E. Significantly worse
1 -5783.03 0.39 7.33 no NJ tree
2 -5782.64 0.00 <-------------best tree (Human,(Gorilla,(Chimp,PygmyCh)))
3 -5787.99 5.35 5.74 no ((Human,Gorilla),(Chimp,PygmyCh))
4 -5783.80 1.16 9.68 no (Lemur,(Tarsier,Other Primates))
5 -5784.76 2.12 9.75 no (Tarsier,(Lemur,Other Primates))
????????????????????????????????????????????????????????????????????
6 -5812.96 30.32 15.22 yes (Gorilla,(Chimp,(Human,PygmyCh)))
7 -5805.21 22.56 13.08 no Orang and Gorilla form clade
8 -5804.98 22.34 13.15 no Gorilla branches earlier than Orang
9 -5812.97 30.33 14.17 yes Gibbon and Baboon form a clade
Using ML to rank the alternative trees for the primates example
0.1
MouseLemur
Tarsier47
SakiMonkeyMarmoset
100
BaboonGibbon
OrangGorilla
HumanPygmyChimpChimp
10084
10099
100100
100
The NJ Tree has:
(Gorilla,(Human,(Chimp,PygmyCh)))
((Lemur,Tarsier),Other Primates))
The Parsimony Criterion
Try to explain the data in the simplest possible way with the fewest arbitrary assumptions
Used initially with morphological characters.
Suppose C and D possess a character (1) that is absent in A and B (0)
A(0) B(0) C(1) D(1)
+
1 2
A(0) C(1) B(0) D(1)
+
*
3
A(0) C(1) B(0) D(1)+ +
In 1, the character evolves only once (+)
In 2, the character evolves once (+) and is lost once (*)
In 3, the character evolves twice independently
The first is the simplest explanation, therefore tree 1 is to be preferred by the parsimony criterion.
Parsimony with molecular data
A(T)
B(T)C(T)
D(G)
E(G)
*
A(T)
B(T)
C(T)
D(G)E(G)
* *
1. Requires one mutation 2. Requires two mutations
By parsimony, 1 is to be preferred to 2.
A(T)
B(T)C(T)
D(T)
E(G)
*A(T)
B(T)
C(T)
D(T)E(G)
*
This site is non-informative. Whatever the arrangement of species, only one mutation is required. To be informative, a site must have at least two bases present at least twice.
The best tree is the one that minimizes the overall number of mutations at all sites.
Searching Tree Space
Require a way of generating trees to be tested by Maximum Likelihood, or Parsimony.
Nearest neighbour interchange
Subtree pruning and regrafting
No. of distinct tree topologies
N Unrooted (UN) Rooted (RN)
3 1 3
4 3 15
5 15 105
6 105 945
7 945 10395
UN = (2N-5)UN-1 RN = (2N-3)RN-1
Conclusion: there are huge numbers of trees, even for relatively small numbers of species. Therefore you cannot look at them all.
Group III - Supraprimates
Dating from phylogeniesThe molecular clock
must be approx 50 m.y.
must be approx 150 m.y.
known to be 100 m.y.
Can lead to controversies because the clock does not always go at a constant rate
Mammalian orders (primates, rodents, carnivores, bats ....)Molecular dates tend to be earlier (100 m.y.) than those coming from the fossil record (65 m.y.)
Animal phyla - Molecular phylogenies are resolving the relationships between phyla that were not understood from morphology.Still uncertainty about dates - Molecular dates suggest about 1 b.y. Earliest fossil evidence is 560 m.y.
Old morphological phylogeny New molecular phylogeny
Puts LUCA just prior to 4, and origin of eukaryotes at 2.7.
Origin of cyanobacteria is at 2.5, whereas they are claimed to be present in the fossil record at 3.5.
Molecular dating of key events in early evolution