Molecular Evolution: Plan for week
description
Transcript of Molecular Evolution: Plan for week
Molecular Evolution: Plan for weekMonday 3.11: Basics of Molecular Evolution Lecture 1: 9-10.30 Molecular Basis and Models I (JH) Computer : 11-12.30 PAUP : Distance/Parsimony/Compatibility (JH/IH)Lecture 2 : 13.30-15 Molecular Basis and Models II (JH)Lecture 3: 15.30-17 The Origin of Life (JH/ Miklos)
Tuesday 4.11: Tree of Life Lecture 1: 9-10.30 Molecular Evolution of Eukaryote Pathogens (Day/Barry)Lecture 2: 11-12.30 Molecular Evolution of Prokaryote Pathogens (Maiden)Computer: 13.30-15 Analysis of Viral Data (Taylor)Lecture 3:15.30-17 Molecular Evolution of Virus (E.Holmes)
Wednesday 5.11: Stochastic Models of Evolution & PhylogeniesComputer : 9-10.30 PAUP/Mr. Bayes: Likelihood (JH/IH)Lecture 1:11-12.30 The Evolution of Protein Structures (Deane)Computer: 13.30-15 PAML:Testing Evolutionary Models (JH/Lyngsoe)Lecture 2:15.30- 17 Molecular Evolution & Function/Structure/Selection(Meyer)
Thursday 6.11: More PhylogeniesComputer : 9-10.30 Molecular Evolution on the web (JH/Lyngsoe)Lecture 2: 11-12.30 Beyond Phylogenies: Networks & Recombination (Song/JH)Computer: 13.30-15 Beyond Phylogenies (Song)Lecture 3: 15.30-17 Molecular Evolution and the Genomes. (JH/Lunter)
Friday 7.11: Results, Advanced Topics and article discussion Computer: 9-10.30 Statistical Alignment (JH/IM)Lecture: 11-12.30 Article Discussion/Presentation by studentsThe Last Lunch
Two Discussion Articles
1. Timing the ancestor of the HIV-1 pandemic strains.
Korber B, Muldoon M, Theiler J, Gao F, Gupta R, Lapedes A, Hahn BH, Wolinsky S, Bhattacharya T.
Science. 2000 Jun 9;288(5472):1789-96.
2. Sequencing and comparison of yeast species to identify genes and regulatory elements. Kells, M., N.Patterson, M.Endrizzi & E.Lander Nature May 15 2003 vol 423.241-
The Data & its growth.1976/79 The first viral genome –MS2/X174
1995 The first prokaryotic genome – H. influenzae
1996 The first unicellular eukaryotic genome - Yeast
1997 The first multicellular eukaryotic genome – C.elegans
2001 The human genome
2002 The Mouse Genome
1.5.03: Known
>1000 viral genomes
96 prokaryotic genomes
16 Archeobacterial genomes
A series multicellular genomes are coming.
A general increase in data involving higher structures and dynamics of biological systems
The Nucleotides
http://www.accessexcellence.org/AB/GG/
Pyremidines Purines
Transversions
Tra
nsit
ions
The Amino Acids/Codons/Genes
http://www.accessexcellence.org/AB/GG/
{nucleotides}3 amino acids, stop
Major Application Areas of Molecular Evolution
Phylogenies and Classification
Rates of Evolution & The Molecular Clock
Dating
Functional Constraint – Negative Selection.
Positive/Diversifying Selection
Structure
RNA Structure
Gene Finding
Homing in on Important Genes
Homology Searches
Disease Gene Mapping
The Tree (?) of Life LUCA
ProkaryotesEukaryotes Archea
Origin of Life
Viruses??
Plant Fungi Animals
Tree of Life.
Science vol.300 June 2003
The Origin of Life
When did life originate?
Is the present structure a necessity or is it random accident?
How frequent is life in the Universe?
“+”: “-”:
Self replication easy
Self assembly easy
Many extrasolar planets
Hard to make proper polymerisation
No convincing scenario.
No testability
Increased Origin Research:
In preparation of future NASA expeditions.
The rise of nano biology.
The ability to simulate larger molecular systems
Central Principles of Phylogeny Reconstruction
Parsimony
Distance
Likelihood
TTCAGT
TCCAGT
GCCAAT
GCCAAT
s2
s1
s4
s3
s2
s1
s4
s3
s2
s1
s4
s3
0
1
12
0 Total Weight: 4
1
1 2
3 2 10.4
0.6
0.3
0.71.5
L=3.1*10-7
Parameter estimates
From Distance to PhylogeniesWhat is the relationship of a, b, c, d & e?
A b c d e
A - 22 10 22 22
B 6 - 22 16 14
C 7 3 - 22 22
D 13 9 8 - 16
e 6 8 9 15 -
Molecular clock
No
Mo
lecu
lar
clo
ck
Enumerating Trees: Unrooted & valency 3
2
1
3
11
24
23
31 2
3 4
4
1 2
3 4
1 2
3 4
1 2
3 4
1 2
3 4
1 2
3 4
5
5 5
5
5
(2 j 3)j3
n 1
(2n 5)!
(n 2)!2n 2
4 5 6 7 8 9 10 15 20
3 15 105 945 10345 1.4 105 2.0 106 7.9 1012 2.2 1020
Recursion: Tn= (2n-5) Tn-1 Initialisation: T1= T2= T3=1
Heuristic Searches in Tree SpaceNearest Neighbour Interchange
Subtree regrafting
Subtree rerooting and regrafting
T2
T1
T4
T3
T2
T1
T4
T3T2
T1
T4T3
T4T3
s4
s5
s6s1
s2
s3
T4
T3
s4
s5
s6
s1
s2
s3
T4T3
s4
s5
s6s1
s2
s3
T4
T3
s4
s5
s6
s1
s2
s3
Assignment to internal nodes: The simple way.
C
A
C CA
CT G
???
?
?
?
What is the cheapest assignment of nucleotides to internal nodes, given some (symmetric) distance function d(N1,N2)??
If there are k leaves, there are k-2 internal nodes and 4k-2 possible assignments of nucleotides. For k=22, this is more than 1012.
5S RNA Alignment & PhylogenyHein, 1990
10 tatt-ctggtgtcccaggcgtagaggaaccacaccgatccatctcgaacttggtggtgaaactctgccgcggt--aaccaatact-cg-gg-gggggccct-gcggaaaaatagctcgatgccagga--ta17 t--t-ctggtgtcccaggcgtagaggaaccacaccaatccatcccgaacttggtggtgaaactctgctgcggt--ga-cgatact-tg-gg-gggagcccg-atggaaaaatagctcgatgccagga--t- 9 t--t-ctggtgtctcaggcgtggaggaaccacaccaatccatcccgaacttggtggtgaaactctattgcggt--ga-cgatactgta-gg-ggaagcccg-atggaaaaatagctcgacgccagga--t-14 t----ctggtggccatggcgtagaggaaacaccccatcccataccgaactcggcagttaagctctgctgcgcc--ga-tggtact-tg-gg-gggagcccg-ctgggaaaataggacgctgccag-a--t- 3 t----ctggtgatgatggcggaggggacacacccgttcccataccgaacacggccgttaagccctccagcgcc--aa-tggtact-tgctc-cgcagggag-ccgggagagtaggacgtcgccag-g--c-11 t----ctggtggcgatggcgaagaggacacacccgttcccataccgaacacggcagttaagctctccagcgcc--ga-tggtact-tg-gg-ggcagtccg-ctgggagagtaggacgctgccag-g--c- 4 t----ctggtggcgatagcgagaaggtcacacccgttcccataccgaacacggaagttaagcttctcagcgcc--ga-tggtagt-ta-gg-ggctgtccc-ctgtgagagtaggacgctgccag-g--c-15 g----cctgcggccatagcaccgtgaaagcaccccatcccat-ccgaactcggcagttaagcacggttgcgcccaga-tagtact-tg-ggtgggagaccgcctgggaaacctggatgctgcaag-c--t- 8 g----cctacggccatcccaccctggtaacgcccgatctcgt-ctgatctcggaagctaagcagggtcgggcctggt-tagtact-tg-gatgggagacctcctgggaataccgggtgctgtagg-ct-t-12 g----cctacggccataccaccctgaaagcaccccatcccgt-ccgatctgggaagttaagcagggttgagcccagt-tagtact-tg-gatgggagaccgcctgggaatcctgggtgctgtagg-c--t- 7 g----cttacgaccatatcacgttgaatgcacgccatcccgt-ccgatctggcaagttaagcaacgttgagtccagt-tagtact-tg-gatcggagacggcctgggaatcctggatgttgtaag-c--t-16 g----cctacggccatagcaccctgaaagcaccccatcccgt-ccgatctgggaagttaagcagggttgcgcccagt-tagtact-tg-ggtgggagaccgcctgggaatcctgggtgctgtagg-c--t- 1 a----tccacggccataggactctgaaagcactgcatcccgt-ccgatctgcaaagttaaccagagtaccgcccagt-tagtacc-ac-ggtgggggaccacgcgggaatcctgggtgctgt-gg-t--t-18 a----tccacggccataggactctgaaagcaccgcatcccgt-ccgatctgcgaagttaaacagagtaccgcccagt-tagtacc-ac-ggtgggggaccacatgggaatcctgggtgctgt-gg-t--t- 2 a----tccacggccataggactgtgaaagcaccgcatcccgt-ctgatctgcgcagttaaacacagtgccgcctagt-tagtacc-at-ggtgggggaccacatgggaatcctgggtgctgt-gg-t--t- 5 g---tggtgcggtcataccagcgctaatgcaccggatcccat-cagaactccgcagttaagcgcgcttgggccagaa-cagtact-gg-gatgggtgacctcccgggaagtcctggtgccgcacc-c--c-13 g----ggtgcggtcataccagcgttaatgcaccggatcccat-cagaactccgcagttaagcgcgcttgggccagcc-tagtact-ag-gatgggtgacctcctgggaagtcctgatgctgcacc-c--t- 6 g----ggtgcgatcataccagcgttaatgcaccggatcccat-cagaactccgcagttaagcgcgcttgggttggag-tagtact-ag-gatgggtgacctcctgggaagtcctaatattgcacc-c-tt-
9
11
10
6
8
7
543
12
17
16
1514
13
12
Transitions 2, transversions 5
Total weight 843.
Cost of a history - minimizing over internal statesA C G T
A C G T A C G T
d(C,G) +wC(left subtree)
subtree)} (),({min
subtree)} (),({min
)(
rightwNGd
leftwNGd
subtreew
NsNucleotideN
NsNucleotideN
G
Cost of a history – leaves (initialisation).A C G T
G A
Empty
Cost 0
Empty
Cost 0
Initialisation: leaves
Cost(N)= 0 if
N is at leaf,
otherwise infinity
Fitch-Hartigan-Sankoff Algorithm
The cost of cheapest tree hanging from this node given there is a “C” at this node
A C
TG
2
5(A,C,G,T) * 0 * *
(A,C,G,T) * * * 0
(A,C,G,T) * * 0 *
(A, C, G,T)(10,2,10,2)
(A,C,G,T)(9,7,7,7)
The Felsenstein ZoneFelsenstein-Cavendar (1979)
s4
s3s2
s1
Patterns:(16 only 8 shown)
0 1 0 0 0 0 0 0
0 0 1 0 0 1 0 1
0 0 0 1 0 1 1 0
0 0 0 0 1 0 1 1
True Tree Reconstructed Tree
s3
s1
s2
s4
BootstrappingFelsenstein (1985)
ATCTGTAGTCT
ATCTGTAGTCT
ATCTGTAGTCT
ATCTGTAGTCT
10230101201
ATCTGTAGTCT
ATCTGTAGTCT
ATCTGTAGTCT
ATCTGTAGTCT
1
23
4
21 500
??????????
??????????
??????????
??????????
1
2 3
41
23
4
??????????
??????????
??????????
??????????
The Molecular Clock
First noted by Zuckerkandl & Pauling (1964) as an empirical fact.
How can one detect it?
Known Ancestor, a, at Time t
s1 s2
a
Unknown Ancestors
s1 s2 s3
??
1) Outgrup: Enhance data set with sequence from a species definitely distant to all of them. It will be be joined at the root of the original data
RootingsPurpose 1) To give time direction in the phylogeny & most ancient point2) To be able to define concepts such a monophyletic group.
2) Midpoint: Find midpoint of longest path in tree.
3) Assume Molecular Clock.
Rooting the 3 kingdoms3 billion years ago: no reliable clock - no outgroupGiven 2 set of homologous proteins, i.e. MDH & LDH can the archea, prokaria and eukaria be rooted?
E PA
E
P
A
Root??
E
P
A
LDH/MDHLDH/MDH
Given 2 set of homologous proteins, i.e. MDH & LDH can the archea, prokaria and eukaria be rooted?
E
P
A
E
P
A
LD
H
MD
H
timeContemporary sampleno time structure
Serial samplewith time structure
2000
1980
1990
RNA viruses like HIV evolve fast enough that you can’t ignore the time structure
Non-contemporaneous leaves.(A.Rambaut (2000): Estimating the rate of molecular evolution: incorporating non-contemporaneous sequences into maximum likelihood phylogenies. Bioinformatics 16.4.395-399)
From Drummond
Pt.7
Pt.9
HIV1U36148
HIV1U36015HIV1U35980
HIV1U36073
HIV1U35926
HIVU95460
Pt.2
Patient #6 fromWolinsky et al.
Pt.5
Pt.3Pt.1Pt.8
Pt.6
10%
Shankarappa et al (1999)
0 2 4 6 8 10
Years Post Seroconversion
Viral Divergence
2%
4%
6%
8%
10%
From Drummond
HIV-1 (env) evolution in nine infected individuals
Lineage A
Lineage B
‘Ladder-like’ appearance
Ne = [4000,6300]
Mu = [0.8% – 1%] per site year
• 210 sequences collected over a period of 9.5 years• 660 nucleotides from env: C2-V5 region• Only first 285 (no alignment ambiguities) were used
in this analysis• Effective population size and mutation rate were co-
estimated using Bayesian MCMC.
From Drummond
A tree sampled from the posterior distribution of Shankarappa Patient
Models of Amino Acid, Nucleotide & Codon Evolution
Amino Acids, Nucleotides & Codons
Continuous Time Markov Processes
Specific Models
Special Issues
Context Dependence
Rate Variation
The Purpose of Stochastic Models.
1. Molecular Evolution is Stochastic.
2. To estimate evolutionary parameters, not observable directly:
i. Real number of events in evolutionary history.
ii. Rates of different kinds of events in evolutionary history.
iii. Strength of selection against amino acid changing nucleotide substitutions.
iv. Estimate importance of different biological factors.
3. Survive a goodness of fit test.
4. Serve these purposes as simply as possible.
ACGTC
Central Problems: History cannot be observed, only end products.
Comment: Even if History could be observed, the underlying process couldn’t
ACGCC
AGGCC
AGGCT
AGGCT
AGGTT
ACGTC
ACGCC
AGGCC
AGGCT
AGGCT
AGGTT
AGGGC
AGTGC
Principle of Inference: LikelihoodLikelihood function L() – the probability of data as function of parameters: L(,D)
LogLikelihood Function – l(): ln(L(,D))
If the data is a series of independent experiments L() will become a product of Likelihoods of each experiment, l() will become the sum of LogLikelihoods of each experiment
In Likelihood analysis parameter is not viewed as a random variable.
increases.data as (D)ˆ:yConsistenc true
xnx ppxxn
npnxL
)1(
!)!(
!);,(
)1ln()()ln()!)!(
!ln();,( pxnpx
xxn
npnxl
Likelihood and logLikelihood of Coin Tossing
From Edwards (1991) Likelihood
Principle of Inference: Bayesian Analysis
In Bayesian Analysis the parameters are viewed as stochastic variables that has a prior distribution before observing data. Data depend on the parameters and after observing the data, the parameters will have a posterior distribution.
2) Processes in different positions of the molecule are independent, so the probability for the whole alignment will be the product of the probabilities of the individual patterns.
Simplifying Assumptions I
TGGTT)(TCGGTA)(*)( aPaPaPPa
Data: s1=TCGGTA,s2=TGGTT
1) Only substitutions. s1 TCGGTA s1 TCGGA s2 TGGT-T s2 TGGTT
TGGTTTCGGTA
Probability of Data
a - unknown
Biological setup
TGGTT)(TCGGA)(*)( aPaPaPPa
)1s()1s(*)(5
1iiiiii
aii
iaPaPaPP
TT
a1a2
a3a4
a5
G G T T
C G G A
Simplifying Assumptions II
3) The evolutionary process is the same in all positions
)2s()1s(*)(5
1iiii
ai
iaPaPaPP
4) Time reversibility: Virtually all models of sequence evolution are time reversible. I.e. πi Pi,j(t) = πj Pj,i(t), where πi is the stationary distribution of i and Pt(i->j) the probability that state i has changed into state j after t time. This implies that
Pa,N1(l1)*Pa,N2(l2) = PN1,N2(l1+l2)*)(a
aP *)( 1NP
=
a
N1N2
l2+l1l1 l2 N2N1
)2s1()1(5
1iii
isPsPP
Simplifying assumptions III
6) The rate matrix, Q, for the continuous time Markov Chain is the same at all times (and often all positions). However, it is possible to let the rate of events, ri, vary from site to site, then the term for passed time, t, will be substituted by ri*t.
5) The nucleotide at any position evolves following a continuous time Markov Chain.
T O A C G TF A -(qA,C+qA,G+qA,T) qA,C qA,G qA,T
R C qC,A -(qC,A+qC,G+qC,T) qC, G qC ,T O G qG,A qG,C -(qG,A+qG,C+qG,T) qG,T
M T qT,A qT,C qT,G -(qT,A+qT,C+qT,G)
Pi,j(t) continuous time markov chain on the state space {A,C,G,T}.
Q - rate matrix:
t1 t2
CCA
ijji q
P
)(lim ,
0 iiii q
P
1)(lim ,
0
i. P(0) = I. ii. P() close to I+Q for small. iii. P'(0) = Q.
iv. lim P(t) has the equilibrium frequencies of the 4 nucleotides in each row.
v. Waiting time in state j, Tj, P(Tj > t) = e -(qjj
t)
vi. QE=0 Eij=1 (all i,j) vii. PE=E viii If AB=BA, then eA+B=eAeB.
Q and P(t)
.......!3
)(
!2
)(
!
)()exp()(
32
0
tQtQtQI
i
tQtQtP
i
i
What is the probability of going from i (C?) to j (G?) in time t with rate matrix Q?
Rate-matrix, R: T O
A C G T
F A R C O G M T
Transition prob. after time t, a = *t:
P(equal) = ¼(1 + 3e-4*a ) ~ 1 - 3a P(diff.) = ¼(1 - 3e-4*a ) ~ 3a
Stationary Distribution: (1,1,1,1)/4.
Jukes-Cantor 69: Total Symmetry
342455
55
1
)1()31()4
1()
4
1(
T)T)P(AG)P(GG)P(GT)P(CP(T)4
1()2s1()1(
aa
iii
ee
sPsPP
Geometric/Exponential DistributionsThe Geometric Distribution: {0,1,..} Geo(p): P{Z=j)=pj(1-p) P{Z>j)=pj E(Z)=1/p.
The Exponential Distribution: R+ Exp() Density: f(t) = e-t, P(X>t)= e-t
Properties: X ~ Exp() Y ~ Exp() independent
i. P(X>t2|X>t1) = P(X>t2-t1) (t2 > t1) Markov (memoryless) process
ii. E(X) = 1/.
iii. P(Z>t)=(≈)P(X>t) small a (p=e-a).
iv. P(X < Y) = /( + ).
v. min(X,Y) ~ Exp ().N
Mean 2.5
Comparison of Pairs of Nucleotides/Sequences
C
G
All Evolutionary Paths:
)1(41),( 4
,t
t eGCP
C
G
Shortest Path
C
G
Sample Paths according to their probability:
CTACGT
GTATAT
All Evolutionary Paths:
Higher CellsChimp Mouse Fish E.coli
ATTGTGTATATAT….CAG
ATTGCGTATCTAT….CCG
From Q to P for Jukes-Cantor
3111
1311
1131
1113
3
3
3
3
3111
1311
1131
1113
4
3111
1311
1131
1113
1i
i
t
i
i
i
eI
itIi
4
10
3111
1311
1131
1113
4/1
!/
3111
1311
1131
1113
)4(4/1!/
3
3
3
3
TO A C G T
F A - R C O G M T a = *t b = *t
Kimura 2-parameter model
start
)21(25. )(24 bab ee
)1(25. 4be
)1(25. 4be
)21(25. )(24 bab ee
Q:
P(t):
Unequal base composition: (Felsenstein, 1981)
Qi,j = C*πj i unequal j
Felsenstein81 & Hasegawa, Kishino & Yano 85
Transition/transversion & compostion bias (Hasegawa, Kishino & Yano, 1985)
()*C*πj i- >j a transition Qi,j = C*πj i- >j a transversion
Dayhoffs empirical approach (1970)
Take a set of closely related proteins, count all differences and make symmetric difference matrix, since time direction cannot be observed.
If qij=qji, then equilibrium frequencies, i, are all the same.
The transformation qij --> iqij/j, then equilibrium frequencies will be i.
Measuring Selection ThrSer
ACGTCA
ThrProPro
ACGCCA
ThrSer
ACGCCG
ArgSer
AGGCCG
ThrSer
ACTCTG
AlaSer
GCTCTG
Certain events have functional consequences and will be selected out. The strength and localization of this selection is of great interest.
AlaSer
GCACTG
-
-
The selection criteria could in principle be anything, but the selection against amino acid changes is without comparison the most important
I
The Genetic Code
i.
3 classes of sites:
4
2-2
1-1-1-1
Problems:
i. Not all fit into those categories.
ii. Change in on site can change the status of another.
4 (3rd) 1-1-1-1 (3rd)
ii. TA (2nd)
Possible events if the genetic code remade from Li,1997
N
Substitutions Number Percent
Total in all codons 549 100
Synonymous 134 25
Nonsynonymous 415 75
Missense 392 71
Nonsense 23 4
Possible number of substitutions: 61 (codons)*3 (positions)*3 (alternative nucleotides).
Ser Thr Glu Met Cys Leu Met Gly Thr TCA ACT GAG ATG TGT TTA ATG GGG ACG *** * * * * * * ** GGG ACA GGG ATA TAT CTA ATG GGT AGC Ser Thr Gly Ile Tyr Leu Met Gly Ser
Ks : Number of Silent Events in Common HistoryKa : Number of Replacement Events in Common HistoryNs : Silent positionsNa : replacement positions.
Rates per pos: ((Ks/Ns)/2T)Example: Ks =100 Ns = 300 T=108 yearsSilent rate (100/300)/2*108 = 1.66 * 10-9 /year/pos.
Synonyous (silent) & Non-synonymous (replacement) substitutions
Thr
ACGArg
AGG
Thr
ACC
Ser
AGC
Miyata: use most silent path for calculations.
*
* *
Kimura’s 2 parameter model & Li’s Model.
start
Selection on the 3 kinds of sites (a,b)(?,?)
1-1-1-1 (f*,f*)
2-2 (,f*)
4 (, )
Rates: Probabilities:
)21(25. )(24 bab ee
)1(25. 4be
)1(25. 4be
)21(25. )(24 bab ee
Sites Total Conserved Transitions Transversions1-1-1-1 274 246 (.8978) 12(.0438) 16(.0584)2-2 77 51 (.6623) 21(.2727) 5(.0649)4 78 47 (.6026) 16(.2051) 15(.1923)
Z(t,t) = .50[1+exp(-2t) - 2exp(-t(+)] transition Y(t,t) = .25[1-exp(-2t )] (transversion)X(t,t) = .25[1+exp(-2t) + 2exp(-t()] identity
L(observations,a,b,f)= C(429,274,77,78)* {X(a*f,b*f)246*Y(a*f,b*f)12*Z(a*f,b*f)16}* {X(a,b*f)51*Y(a,b*f)21*Z(a,b*f)5}*{X(a,b)47*Y(a,b)16*Z(a,b)15}
where a = at and b = bt. Estimated Parameters: a = 0.3003 b = 0.1871 2*b = 0.3742 (a + 2*b) = 0.6745 f = 0.1663
Transitions Transversions1-1-1-1 a*f = 0.0500 2*b*f = 0.06222-2 a = 0.3004 2*b*f = 0.06224 a = 0.3004 2*b = 0.3741
Expected number of: replacement substitutions 35.49 synonymous 75.93Replacement sites : 246 + (0.3742/0.6744)*77 = 314.72Silent sites : 429 - 314.72 = 114.28 Ks = .6644 Ka = .1127
alpha-globin from rabbit and mouse.Ser Thr Glu Met Cys Leu Met Gly GlyTCA ACT GAG ATG TGT TTA ATG GGG GGA * * * * * * * **TCG ACA GGG ATA TAT CTA ATG GGT ATASer Thr Gly Ile Tyr Leu Met Gly Ile
Hasegawa, Kisino & Yano Subsitution Model Parameters:
a*t β*t A C G T 0.350 0.105 0.361 0.181 0.236 0.222 0.015 0.005 0.004 0.003 0.003
Selection Factors
GAG 0.385 (s.d. 0.030)POL 0.220 (s.d. 0.017)VIF 0.407 (s.d. 0.035)VPR 0.494 (s.d. 0.044)TAT 1.229 (s.d. 0.104)REV 0.596 (s.d. 0.052)VPU 0.902 (s.d. 0.079)ENV 0.889 (s.d. 0.051)NEF 0.928 (s.d. 0.073)
Estimated Distance per Site: 0.194
HIV2 Analysis
Examples of rates remade from Li,1997
N
RNA Virus
Influenza A Hemagglutinin 13.1 10-3 3.6 10-3
Hepatitis C E 6.9 10-3 0.3 10-3
HIV 1 gag 2.8 10-3 1.7 10-3
DNA virus
Hepatitis B P 4.6 10-5 1.5 10-5
Herpes Simplex Genome 3.5 10-8
Nuclear Genes
Mammals c-mos 5.2 10-9 0.9 10-9
Mammals a-globin 3.9 10-9 0.6 10-9
Mammals histone 3 6.2 10-9 0.0
Organism Gene Syno/year Non-Syno/Year
i. Codons as the basic unit.
ii. A codon based matrix would have (61*61)-61 (= 3661) off-diagonal entries. i. Bias in nucleotide usage. ii. Bias in codon usage. iii. Bias in amino acid usage. iv. Synonymous/non-synonymous distinction. v. Amino acid distance. vi. Transition/transversion bias.
codon i and codon j differing by one nucleotide, then pj exp(-di,j/V) differs by transitionqi,j = pj exp(-di,j/V) differs by transversion.
-di,j is a physico-chemical difference between amino acid i and amino acid j. V is a factor that reflects the variability of the gene involved.
Codon based ModelsGoldman,Yang + Muse,Gaut
Rate variation between sites:iid each site
i) The rate at each position is drawn independently from a distribution, typically a (or lognormal) distribution. G(a,b) has density x-1*e-x/) , where is called scale parameter and form parameter.
Let L(pi,,t) be the likelihood for observing the i'th pattern, t all time lengths, the parameters describing the process parameters and f (ri) the continuous distribution of rate(s). Then iiii drrfrpLL )(),,(
What is the probability of the data?
What is the most probable ”hidden” configuration?
What is the probability of specific ”hidden” state?
1) Different positions in the molecule evolves at different rates. For instance fast or slow rF or slow rS.
2) The rates at neighbor positions evolve at the same rate.
Rate variation between sites:iid Hidden Markov Chains
O1 O2 O3 O4 O5 O6 O7 O8 O9 O10
F
S
Data: 3 sequences of length L ACGTTGCAA ...AGCTTTTGA ...TCGTTTCGA ...
Statistical Test of Models (Goldman,1990)
A. Likelihood (free multinominal model 63 free parameters)L1 = pAAA
#AAA*...pAAC#AAC*...*pTTT
#TTT where pN1N2N3 = #(N1N2N3)/L
L2 = pAAA(l1',l2',l3') #AAA*...*pTTT(l1',l2',l3') #TTTl2
l1
l3TCGTTTCGA ...
ACGTTGCAA ...
AGCTTTTGA ...
B. Jukes-Cantor and unknown branch lengths
Test statistics: I. (expected-observed)2/expected or II: -2 lnQ = 2(lnL1 - lnL2) JC69 Jukes-Cantor: 3 parameters => 2 60 d.of freedom
Problems: i. To few observations pr. pattern. ii. Many competing hypothesis.
Parametric bootstrap: i. Maximum likelihood to estimate the parameters. ii. Simulate with estimated model. iii. Make simulated distribution of -2 lnQ. iv. Where is real -2 lnQ in this distribution?
Emperical Observations:i. Variance/Mean > 1 (clumpy process) for non-synonymous event Possible explanations:i. Selective Avalances.ii. Gene conversions from pseudogenes.
Episodic Evolution
Poisson Process: i. Ti's independent, exponentially distributed with same parameter (l). ii. Variance and Mean both l.
Assignment to internal nodes: The simple way.
C
A
C CA
CT G
???
?
?
?
If branch lengths and evolutionary process is known, what is the probability of nucleotides at the leaves?
Cctacggccatacca a ccctgaaagcaccccatcccgt Cttacgaccatatca c cgttgaatgcacgccatcccgt Cctacggccatagca c ccctgaaagcaccccatcccgt Cccacggccatagga c ctctgaaagcactgcatcccgt Tccacggccatagga a ctctgaaagcaccgcatcccgt Ttccacggccatagg c actgtgaaagcaccgcatcccgTggtgcggtcatacc g agcgctaatgcaccggatccca Ggtgcggtcatacca t gcgttaatgcaccggatcccat
Probability of leaf observations - summing over internal states
A C G T
A C G T A C G T
subtree)} ()({
subtree)} ()({
)(
rightPNGP
leftPNGP
subtreeP
NsNucleotideN
NsNucleotideN
G
P(CG) *PC(left subtree)
GleafG leafP
tionInitialisa
,)(
ln(7.9*10-14) –ln(6.2*10-12) is 2 – distributed with (n-2) degrees of freedom
Output from Likelihood Method.
s1 s2 s3 s4 s5No
w
Du
pli
ca
tio
n T
ime
s
Am
ou
nt
of
Ev
olu
tio
n
s1
s2
s3
s4
s5
Likelihood: 6.2*10-12 = 0.34 0.16
Likelihood: 7.9*10-14 = 0.31 0.18
Molecular Clock No Molecular Clock
23 -/+5.2
12 -/+2.211.1 -/+1.8
5.9 -/+1.2
6.9 -/+1.3 11.4 -/+1.9
3.9 -/+0.8
10.9 -/+2.1
9.9 -/+1.2
11.6 -/+2.1
n-1 heights estimated 2n-3 lengths estimated
4.1 -/+0.7
The generation/year-time clock Langley-Fitch,1973
s1
s3
s2
s1 s3s2
{l1 = l2 < l3}l2 l1
l3
l3Some rooting techniquee
Absolute Time Clock:
Generation Time Clock:
Absolute Time Clock
Generation T
ime
Elephant Mouse
100 Myr
variable
constant
l1 = l2
The generation/year-time clock Langley-Fitch,1973
s1 s3s2
Any TreeGeneration Time Clock
Can the generation time clock be tested?
Assume, a data set: 3 species, 2 sequences each s1
s3
s2
s1
s3
s2
s1 s3s2
The generation/year-time clock Langley-Fitch,1973
s1
s3
s2
l2 l1
l3
s1
s3
s2
c*l2
c*l1
c*l3
s1
s3
s2
s1 s3s2
l2 l1
l3 l1 = l2
l3
k=3: degrees of freedom: 3dg: 2
k: dg: 2k-3 dg: k-1
k=3, t=2: dg=4 k, t: dg =(2k-3)-(t-1)
– globin, cytochrome c, fibrinopeptide A & generation time clock
Langley-Fitch,1973
N
Fibrinopeptide A phylogeny:
Hu
ma
n
Go
rilla
Do
nkey
Gib
bo
n
Mo
nkey
Rab
bit
Co
w
Rat
Pig
Ho
rse
Go
at
Llam
a
Sh
eep
Do
g
Relative rates
-globin 0.342
– globin 0.452
cytochrome c 0.069
fibrinopeptide A 0.137
I Smoothing a non-clock tree onto a clock tree (Sanderson).
II Rate of Evolution of the rate of Evolution (Thorne et al.).The rate of evolution can change at each bifurcation.
III Relaxed Molecular Clock (Huelsenbeck et al.). At random points in time, the rate changes by multiplying with random variable (gamma distributed)
Almost Clocks (MJ Sanderson (1997) “A Nonparametric Approach to Estimating Divergence Times in the Absence of Rate Constancy” Mol.Biol.Evol.14.12.1218-31) , J.L.Thorne et al. (1998): “Estimating the Rate of Evolution of the Rate of Evolution.” Mol.Biol.Evol. 15(12).1647-57, JP Huelsenbeck et al. (2000) “A compound Poisson Process for Relaxing the Molecular Clock” Genetics 154.1879-92. )
Comment: Makes perfect sense. Testing no clock versus perfect is choosing between two unrealistic extremes.
Summary
PhylogenyPrinciples of PhylogeniesRates of Molecular Rates and the Molecular Clock Rooting Phylogenies The Generation Time Clock Almost Clocks Non-Contemporaneous Leaves (Viruses & Ancient DNA)
The Purpose of Stochastic Models
The assumptions of Stochastic Models
The Central Models
Measuring Selection
Variation among sites
Testing Models.
History of Phylogenetic Methods & Stochastic Models
1958 Sokal and Michener publishes UGPMA method for making distrance trees with a clock.
1964 Parsimony principle defined, but not advocated by Edwards and Cavalli-Sforza.
1962-65 Zuckerkandl and Pauling introduces the notion of a Molecular Clock.
1967 First large molecular phylogenies by Fitch and Margoliash.
1969 Heuristic method used by Dayhoff to make trees and reconstruct ancetral sequences.
1969 Jukes-Cantor proposes simple model for amino acid evolution.
1970: Neyman analyzes three sequence stochastic model with Jukes-Cantor substitution.
1971-73 Fitch, Hartigan & Sankoff independently comes up with same algorithm reconstructing parsimony ancetral sequences.
1973 Sankoff treats alignment and phylogenies as on general problem – phylogenetic alignment.
1979 Cavender and Felsenstein independently comes up with same evolutionary model where parsimony is inconsistent. Later called the “Felsenstein Zone”.
1979: Kimura introduces transition/transversion bias in nucleotide model in response to pbulication of mitochondria sequences.
1981: Felsenstein Maximum Likelihood Model & Program DNAML (i programpakken PHYLIP). Simple nucleotide model with equilibrium bias.
1981 Parsimony tree problem is shown to be NP-Complete.
1985: Felsenstein introduces bootstrapping as confidence interval on phylogenies.
1985: Hasegawa, Kishino and Yano combines transition/transversion bias with unequal equilibrium frequencies.
1986 Bandelt and Dress introduces split decomposition as a generalization of trees.
1985-: Many authors (Sawyer, Hein, Stephens, M.Smith) tries to address the problem of recombinations in phylogenies.
1991 Gillespie’s book proposes “lumpy” evolution.
1994 Goldman & Yang + Muse & Gaut introduces codon based models
1997-9 Thorne et al., Sanderson & Huelsenbeck introduces the Almost Clock.
2000 Rambaut (and others) makes methods that can find trees with non-contemporaneous leaves.
2000 Complex Context Dependent Models by Jensen & Pedersen. Dinucleotide and overlapping reading frames.
2001- Major rise in the interest in phylogenetic statistical alignment
2001- Comparative genomics underlines the functional importance of molecular evolution.
References: Books & JournalsJoseph Felsenstein "Inferring Phylogenies” 660 pages Sinauer 2003 Excellent – focus on methods and
conceptual issues.
Masatoshi Nei, Sudhir Kumar “Molecular Evolution and Phylogenetics” 336 pages Oxford University Press Inc, USA 2000
R.D.M. Page, E. Holmes “Molecular Evolution: A Phylogenetic Approach” 352 pages 1998 Blackwell Science (UK)
Dan Graur, Li Wen-Hsiung “Fundamentals of Molecular Evolution” Sinauer Associates Incorporated 439 pages 1999
Margulis, L and K.V. Schwartz (1998) “Five Kingdoms” 500 pages Freeman A grand illustrated tour of the tree of life
Semple, C and M. Steel “Phylogenetics” 2002 230 pages Oxford University Press Very mathematical
Journals
Journal of Molecular Evolution : http://www.nslij-genetics.org/j/jme.html
Molecular Biology and Evolution : http://mbe.oupjournals.org/
Molecular Phylogenetics and Evolution : http://www.elsevier.com/locate/issn/1055-7903
Systematic Biology - http://systbiol.org/J. of Classification - http://www.pitt.edu/~csna/joc.html
References: www-pagesTree of Life on the WWW
http://tolweb.org/tree/phylogeny.html
http://www.treebase.org/treebase/
Software
http://evolution.genetics.washington.edu/phylip.html
http://paup.csit.fsu.edu/
http://morphbank.ebc.uu.se/mrbayes/
http://evolve.zoo.ox.ac.uk/beast/
http://abacus.gene.ucl.ac.uk/software/paml.html
Data & Genome Centres
http://www.ncbi.nih.gov/Entrez/
http://www.sanger.ac.uk
NextClassification of Viruses *
Overhead with considerations model> data.
Example : HMM variation in rates, gamma rates.
Example: Almost clock
Example: Episodic clock
Example: Bootstrapping. *