Molecular Evolution: Plan for week

Molecular Evolution: Plan for weekMonday 3.11: Basics of Molecular Evolution Lecture 1: 9-10.30 Molecular Basis and Models I (JH) Computer : 11-12.30 PAUP : Distance/Parsimony/Compatibility (JH/IH)Lecture 2 : 13.30-15 Molecular Basis and Models II (JH)Lecture 3: 15.30-17 The Origin of Life (JH/ Miklos)

Tuesday 4.11: Tree of Life Lecture 1: 9-10.30 Molecular Evolution of Eukaryote Pathogens (Day/Barry)Lecture 2: 11-12.30 Molecular Evolution of Prokaryote Pathogens (Maiden)Computer: 13.30-15 Analysis of Viral Data (Taylor)Lecture 3:15.30-17 Molecular Evolution of Virus (E.Holmes)

Wednesday 5.11: Stochastic Models of Evolution & PhylogeniesComputer : 9-10.30 PAUP/Mr. Bayes: Likelihood (JH/IH)Lecture 1:11-12.30 The Evolution of Protein Structures (Deane)Computer: 13.30-15 PAML:Testing Evolutionary Models (JH/Lyngsoe)Lecture 2:15.30- 17 Molecular Evolution & Function/Structure/Selection(Meyer)

Thursday 6.11: More PhylogeniesComputer : 9-10.30 Molecular Evolution on the web (JH/Lyngsoe)Lecture 2: 11-12.30 Beyond Phylogenies: Networks & Recombination (Song/JH)Computer: 13.30-15 Beyond Phylogenies (Song)Lecture 3: 15.30-17 Molecular Evolution and the Genomes. (JH/Lunter)

Friday 7.11: Results, Advanced Topics and article discussion Computer: 9-10.30 Statistical Alignment (JH/IM)Lecture: 11-12.30 Article Discussion/Presentation by studentsThe Last Lunch

Two Discussion Articles

1. Timing the ancestor of the HIV-1 pandemic strains.

Korber B, Muldoon M, Theiler J, Gao F, Gupta R, Lapedes A, Hahn BH, Wolinsky S, Bhattacharya T.

Science. 2000 Jun 9;288(5472):1789-96.

2. Sequencing and comparison of yeast species to identify genes and regulatory elements. Kells, M., N.Patterson, M.Endrizzi & E.Lander Nature May 15 2003 vol 423.241-

The Data & its growth.1976/79 The first viral genome –MS2/X174

1995 The first prokaryotic genome – H. influenzae

1996 The first unicellular eukaryotic genome - Yeast

1997 The first multicellular eukaryotic genome – C.elegans

2001 The human genome

2002 The Mouse Genome

1.5.03: Known

>1000 viral genomes

96 prokaryotic genomes

16 Archeobacterial genomes

A series multicellular genomes are coming.

A general increase in data involving higher structures and dynamics of biological systems

The Nucleotides

http://www.accessexcellence.org/AB/GG/

Pyremidines Purines

Transversions

Tra

nsit

ions

The Amino Acids/Codons/Genes

http://www.accessexcellence.org/AB/GG/

{nucleotides}3 amino acids, stop

Major Application Areas of Molecular Evolution

Phylogenies and Classification

Rates of Evolution & The Molecular Clock

Dating

Functional Constraint – Negative Selection.

Positive/Diversifying Selection

Structure

RNA Structure

Gene Finding

Homing in on Important Genes

Homology Searches

Disease Gene Mapping

The Tree (?) of Life LUCA

ProkaryotesEukaryotes Archea

Origin of Life

Viruses??

Plant Fungi Animals

Tree of Life.

Science vol.300 June 2003

The Origin of Life

When did life originate?

Is the present structure a necessity or is it random accident?

How frequent is life in the Universe?

“+”: “-”:

Self replication easy

Self assembly easy

Many extrasolar planets

Hard to make proper polymerisation

No convincing scenario.

No testability

Increased Origin Research:

In preparation of future NASA expeditions.

The rise of nano biology.

The ability to simulate larger molecular systems

Central Principles of Phylogeny Reconstruction

Parsimony

Distance

Likelihood

TTCAGT

TCCAGT

GCCAAT

GCCAAT

s2

s1

s4

s3

s2

s1

s4

s3

s2

s1

s4

s3

0

1

12

0 Total Weight: 4

1

1 2

3 2 10.4

0.6

0.3

0.71.5

L=3.1*10-7

Parameter estimates

From Distance to PhylogeniesWhat is the relationship of a, b, c, d & e?

A b c d e

A - 22 10 22 22

B 6 - 22 16 14

C 7 3 - 22 22

D 13 9 8 - 16

e 6 8 9 15 -

Molecular clock

No

Mo

lecu

lar

clo

ck

Enumerating Trees: Unrooted & valency 3

2

1

3

11

24

23

31 2

3 4

4

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

5

5 5

5

5

(2 j 3)j3

n 1

(2n 5)!

(n 2)!2n 2

4 5 6 7 8 9 10 15 20

3 15 105 945 10345 1.4 105 2.0 106 7.9 1012 2.2 1020

Recursion: Tn= (2n-5) Tn-1 Initialisation: T1= T2= T3=1

Heuristic Searches in Tree SpaceNearest Neighbour Interchange

Subtree regrafting

Subtree rerooting and regrafting

T2

T1

T4

T3

T2

T1

T4

T3T2

T1

T4T3

T4T3

s4

s5

s6s1

s2

s3

T4

T3

s4

s5

s6

s1

s2

s3

T4T3

s4

s5

s6s1

s2

s3

T4

T3

s4

s5

s6

s1

s2

s3

Assignment to internal nodes: The simple way.

C

A

C CA

CT G

???

?

?

?

What is the cheapest assignment of nucleotides to internal nodes, given some (symmetric) distance function d(N1,N2)??

If there are k leaves, there are k-2 internal nodes and 4k-2 possible assignments of nucleotides. For k=22, this is more than 1012.

5S RNA Alignment & PhylogenyHein, 1990

10 tatt-ctggtgtcccaggcgtagaggaaccacaccgatccatctcgaacttggtggtgaaactctgccgcggt--aaccaatact-cg-gg-gggggccct-gcggaaaaatagctcgatgccagga--ta17 t--t-ctggtgtcccaggcgtagaggaaccacaccaatccatcccgaacttggtggtgaaactctgctgcggt--ga-cgatact-tg-gg-gggagcccg-atggaaaaatagctcgatgccagga--t- 9 t--t-ctggtgtctcaggcgtggaggaaccacaccaatccatcccgaacttggtggtgaaactctattgcggt--ga-cgatactgta-gg-ggaagcccg-atggaaaaatagctcgacgccagga--t-14 t----ctggtggccatggcgtagaggaaacaccccatcccataccgaactcggcagttaagctctgctgcgcc--ga-tggtact-tg-gg-gggagcccg-ctgggaaaataggacgctgccag-a--t- 3 t----ctggtgatgatggcggaggggacacacccgttcccataccgaacacggccgttaagccctccagcgcc--aa-tggtact-tgctc-cgcagggag-ccgggagagtaggacgtcgccag-g--c-11 t----ctggtggcgatggcgaagaggacacacccgttcccataccgaacacggcagttaagctctccagcgcc--ga-tggtact-tg-gg-ggcagtccg-ctgggagagtaggacgctgccag-g--c- 4 t----ctggtggcgatagcgagaaggtcacacccgttcccataccgaacacggaagttaagcttctcagcgcc--ga-tggtagt-ta-gg-ggctgtccc-ctgtgagagtaggacgctgccag-g--c-15 g----cctgcggccatagcaccgtgaaagcaccccatcccat-ccgaactcggcagttaagcacggttgcgcccaga-tagtact-tg-ggtgggagaccgcctgggaaacctggatgctgcaag-c--t- 8 g----cctacggccatcccaccctggtaacgcccgatctcgt-ctgatctcggaagctaagcagggtcgggcctggt-tagtact-tg-gatgggagacctcctgggaataccgggtgctgtagg-ct-t-12 g----cctacggccataccaccctgaaagcaccccatcccgt-ccgatctgggaagttaagcagggttgagcccagt-tagtact-tg-gatgggagaccgcctgggaatcctgggtgctgtagg-c--t- 7 g----cttacgaccatatcacgttgaatgcacgccatcccgt-ccgatctggcaagttaagcaacgttgagtccagt-tagtact-tg-gatcggagacggcctgggaatcctggatgttgtaag-c--t-16 g----cctacggccatagcaccctgaaagcaccccatcccgt-ccgatctgggaagttaagcagggttgcgcccagt-tagtact-tg-ggtgggagaccgcctgggaatcctgggtgctgtagg-c--t- 1 a----tccacggccataggactctgaaagcactgcatcccgt-ccgatctgcaaagttaaccagagtaccgcccagt-tagtacc-ac-ggtgggggaccacgcgggaatcctgggtgctgt-gg-t--t-18 a----tccacggccataggactctgaaagcaccgcatcccgt-ccgatctgcgaagttaaacagagtaccgcccagt-tagtacc-ac-ggtgggggaccacatgggaatcctgggtgctgt-gg-t--t- 2 a----tccacggccataggactgtgaaagcaccgcatcccgt-ctgatctgcgcagttaaacacagtgccgcctagt-tagtacc-at-ggtgggggaccacatgggaatcctgggtgctgt-gg-t--t- 5 g---tggtgcggtcataccagcgctaatgcaccggatcccat-cagaactccgcagttaagcgcgcttgggccagaa-cagtact-gg-gatgggtgacctcccgggaagtcctggtgccgcacc-c--c-13 g----ggtgcggtcataccagcgttaatgcaccggatcccat-cagaactccgcagttaagcgcgcttgggccagcc-tagtact-ag-gatgggtgacctcctgggaagtcctgatgctgcacc-c--t- 6 g----ggtgcgatcataccagcgttaatgcaccggatcccat-cagaactccgcagttaagcgcgcttgggttggag-tagtact-ag-gatgggtgacctcctgggaagtcctaatattgcacc-c-tt-

9

11

10

6

8

7

543

12

17

16

1514

13

12

Transitions 2, transversions 5

Total weight 843.

Cost of a history - minimizing over internal statesA C G T

A C G T A C G T

d(C,G) +wC(left subtree)

subtree)} (),({min

subtree)} (),({min

)(

rightwNGd

leftwNGd

subtreew

NsNucleotideN

NsNucleotideN

G

Cost of a history – leaves (initialisation).A C G T

G A

Empty

Cost 0

Empty

Cost 0

Initialisation: leaves

Cost(N)= 0 if

N is at leaf,

otherwise infinity

Fitch-Hartigan-Sankoff Algorithm

The cost of cheapest tree hanging from this node given there is a “C” at this node

A C

TG

2

5(A,C,G,T) * 0 * *

(A,C,G,T) * * * 0

(A,C,G,T) * * 0 *

(A, C, G,T)(10,2,10,2)

(A,C,G,T)(9,7,7,7)

The Felsenstein ZoneFelsenstein-Cavendar (1979)

s4

s3s2

s1

Patterns:(16 only 8 shown)

0 1 0 0 0 0 0 0

0 0 1 0 0 1 0 1

0 0 0 1 0 1 1 0

0 0 0 0 1 0 1 1

True Tree Reconstructed Tree

s3

s1

s2

s4

BootstrappingFelsenstein (1985)

ATCTGTAGTCT

ATCTGTAGTCT

ATCTGTAGTCT

ATCTGTAGTCT

10230101201

ATCTGTAGTCT

ATCTGTAGTCT

ATCTGTAGTCT

ATCTGTAGTCT

1

23

4

21 500

??????????

??????????

??????????

??????????

1

2 3

41

23

4

??????????

??????????

??????????

??????????

The Molecular Clock

First noted by Zuckerkandl & Pauling (1964) as an empirical fact.

How can one detect it?

Known Ancestor, a, at Time t

s1 s2

a

Unknown Ancestors

s1 s2 s3

??

1) Outgrup: Enhance data set with sequence from a species definitely distant to all of them. It will be be joined at the root of the original data

RootingsPurpose 1) To give time direction in the phylogeny & most ancient point2) To be able to define concepts such a monophyletic group.

2) Midpoint: Find midpoint of longest path in tree.

3) Assume Molecular Clock.

Rooting the 3 kingdoms3 billion years ago: no reliable clock - no outgroupGiven 2 set of homologous proteins, i.e. MDH & LDH can the archea, prokaria and eukaria be rooted?

E PA

E

P

A

Root??

E

P

A

LDH/MDHLDH/MDH

Given 2 set of homologous proteins, i.e. MDH & LDH can the archea, prokaria and eukaria be rooted?

E

P

A

E

P

A

LD

H

MD

H

timeContemporary sampleno time structure

Serial samplewith time structure

2000

1980

1990

RNA viruses like HIV evolve fast enough that you can’t ignore the time structure

Non-contemporaneous leaves.(A.Rambaut (2000): Estimating the rate of molecular evolution: incorporating non-contemporaneous sequences into maximum likelihood phylogenies. Bioinformatics 16.4.395-399)

From Drummond

Pt.7

Pt.9

HIV1U36148

HIV1U36015HIV1U35980

HIV1U36073

HIV1U35926

HIVU95460

Pt.2

Patient #6 fromWolinsky et al.

Pt.5

Pt.3Pt.1Pt.8

Pt.6

10%

Shankarappa et al (1999)

0 2 4 6 8 10

Years Post Seroconversion

Viral Divergence

2%

4%

6%

8%

10%

From Drummond

HIV-1 (env) evolution in nine infected individuals

Lineage A

Lineage B

‘Ladder-like’ appearance

Ne = [4000,6300]

Mu = [0.8% – 1%] per site year

• 210 sequences collected over a period of 9.5 years• 660 nucleotides from env: C2-V5 region• Only first 285 (no alignment ambiguities) were used

in this analysis• Effective population size and mutation rate were co-

estimated using Bayesian MCMC.

From Drummond

A tree sampled from the posterior distribution of Shankarappa Patient

Models of Amino Acid, Nucleotide & Codon Evolution

Amino Acids, Nucleotides & Codons

Continuous Time Markov Processes

Specific Models

Special Issues

Context Dependence

Rate Variation

The Purpose of Stochastic Models.

1. Molecular Evolution is Stochastic.

2. To estimate evolutionary parameters, not observable directly:

i. Real number of events in evolutionary history.

ii. Rates of different kinds of events in evolutionary history.

iii. Strength of selection against amino acid changing nucleotide substitutions.

iv. Estimate importance of different biological factors.

3. Survive a goodness of fit test.

4. Serve these purposes as simply as possible.

ACGTC

Central Problems: History cannot be observed, only end products.

Comment: Even if History could be observed, the underlying process couldn’t

ACGCC

AGGCC

AGGCT

AGGCT

AGGTT

ACGTC

ACGCC

AGGCC

AGGCT

AGGCT

AGGTT

AGGGC

AGTGC

Principle of Inference: LikelihoodLikelihood function L() – the probability of data as function of parameters: L(,D)

LogLikelihood Function – l(): ln(L(,D))

If the data is a series of independent experiments L() will become a product of Likelihoods of each experiment, l() will become the sum of LogLikelihoods of each experiment

In Likelihood analysis parameter is not viewed as a random variable.

increases.data as (D)ˆ:yConsistenc true

xnx ppxxn

npnxL

)1(

!)!(

!);,(

)1ln()()ln()!)!(

!ln();,( pxnpx

xxn

npnxl

Likelihood and logLikelihood of Coin Tossing

From Edwards (1991) Likelihood

Principle of Inference: Bayesian Analysis

In Bayesian Analysis the parameters are viewed as stochastic variables that has a prior distribution before observing data. Data depend on the parameters and after observing the data, the parameters will have a posterior distribution.

2) Processes in different positions of the molecule are independent, so the probability for the whole alignment will be the product of the probabilities of the individual patterns.

Simplifying Assumptions I

TGGTT)(TCGGTA)(*)( aPaPaPPa

Data: s1=TCGGTA,s2=TGGTT

1) Only substitutions. s1 TCGGTA s1 TCGGA s2 TGGT-T s2 TGGTT

TGGTTTCGGTA

Probability of Data

a - unknown

Biological setup

TGGTT)(TCGGA)(*)( aPaPaPPa

)1s()1s(*)(5

1iiiiii

aii

iaPaPaPP

TT

a1a2

a3a4

a5

G G T T

C G G A

Simplifying Assumptions II

3) The evolutionary process is the same in all positions

)2s()1s(*)(5

1iiii

ai

iaPaPaPP

4) Time reversibility: Virtually all models of sequence evolution are time reversible. I.e. πi Pi,j(t) = πj Pj,i(t), where πi is the stationary distribution of i and Pt(i->j) the probability that state i has changed into state j after t time. This implies that

Pa,N1(l1)*Pa,N2(l2) = PN1,N2(l1+l2)*)(a

aP *)( 1NP

=

a

N1N2

l2+l1l1 l2 N2N1

)2s1()1(5

1iii

isPsPP

Simplifying assumptions III

6) The rate matrix, Q, for the continuous time Markov Chain is the same at all times (and often all positions). However, it is possible to let the rate of events, ri, vary from site to site, then the term for passed time, t, will be substituted by ri*t.

5) The nucleotide at any position evolves following a continuous time Markov Chain.

T O A C G TF A -(qA,C+qA,G+qA,T) qA,C qA,G qA,T

R C qC,A -(qC,A+qC,G+qC,T) qC, G qC ,T O G qG,A qG,C -(qG,A+qG,C+qG,T) qG,T

M T qT,A qT,C qT,G -(qT,A+qT,C+qT,G)

Pi,j(t) continuous time markov chain on the state space {A,C,G,T}.

Q - rate matrix:

t1 t2

CCA

ijji q

P

)(lim ,

0 iiii q

P

1)(lim ,

0

i. P(0) = I. ii. P() close to I+Q for small. iii. P'(0) = Q.

iv. lim P(t) has the equilibrium frequencies of the 4 nucleotides in each row.

v. Waiting time in state j, Tj, P(Tj > t) = e -(qjj

t)

vi. QE=0 Eij=1 (all i,j) vii. PE=E viii If AB=BA, then eA+B=eAeB.

Q and P(t)

.......!3

)(

!2

)(

!

)()exp()(

32

0

tQtQtQI

i

tQtQtP

i

i

What is the probability of going from i (C?) to j (G?) in time t with rate matrix Q?

Rate-matrix, R: T O

A C G T

F A R C O G M T

Transition prob. after time t, a = *t:

P(equal) = ¼(1 + 3e-4*a ) ~ 1 - 3a P(diff.) = ¼(1 - 3e-4*a ) ~ 3a

Stationary Distribution: (1,1,1,1)/4.

Jukes-Cantor 69: Total Symmetry

342455

55

1

)1()31()4

1()

4

1(

T)T)P(AG)P(GG)P(GT)P(CP(T)4

1()2s1()1(

aa

iii

ee

sPsPP

Geometric/Exponential DistributionsThe Geometric Distribution: {0,1,..} Geo(p): P{Z=j)=pj(1-p) P{Z>j)=pj E(Z)=1/p.

The Exponential Distribution: R+ Exp() Density: f(t) = e-t, P(X>t)= e-t

Properties: X ~ Exp() Y ~ Exp() independent

i. P(X>t2|X>t1) = P(X>t2-t1) (t2 > t1) Markov (memoryless) process

ii. E(X) = 1/.

iii. P(Z>t)=(≈)P(X>t) small a (p=e-a).

iv. P(X < Y) = /( + ).

v. min(X,Y) ~ Exp ().N

Mean 2.5

Comparison of Pairs of Nucleotides/Sequences

C

G

All Evolutionary Paths:

)1(41),( 4

,t

t eGCP

C

G

Shortest Path

C

G

Sample Paths according to their probability:

CTACGT

GTATAT

All Evolutionary Paths:

Higher CellsChimp Mouse Fish E.coli

ATTGTGTATATAT….CAG

ATTGCGTATCTAT….CCG

From Q to P for Jukes-Cantor

3111

1311

1131

1113

3

3

3

3

3111

1311

1131

1113

4

3111

1311

1131

1113

1i

i

t

i

i

i

eI

itIi

4

10

3111

1311

1131

1113

4/1

!/

3111

1311

1131

1113

)4(4/1!/

3

3

3

3

TO A C G T

F A - R C O G M T a = *t b = *t

Kimura 2-parameter model

start

)21(25. )(24 bab ee

)1(25. 4be

)1(25. 4be

)21(25. )(24 bab ee

Q:

P(t):

Unequal base composition: (Felsenstein, 1981)

Qi,j = C*πj i unequal j

Felsenstein81 & Hasegawa, Kishino & Yano 85

Transition/transversion & compostion bias (Hasegawa, Kishino & Yano, 1985)

()*C*πj i- >j a transition Qi,j = C*πj i- >j a transversion

Dayhoffs empirical approach (1970)

Take a set of closely related proteins, count all differences and make symmetric difference matrix, since time direction cannot be observed.

If qij=qji, then equilibrium frequencies, i, are all the same.

The transformation qij --> iqij/j, then equilibrium frequencies will be i.

Measuring Selection ThrSer

ACGTCA

ThrProPro

ACGCCA

ThrSer

ACGCCG

ArgSer

AGGCCG

ThrSer

ACTCTG

AlaSer

GCTCTG

Certain events have functional consequences and will be selected out. The strength and localization of this selection is of great interest.

AlaSer

GCACTG

-

-

The selection criteria could in principle be anything, but the selection against amino acid changes is without comparison the most important

I

The Genetic Code

i.

3 classes of sites:

4

2-2

1-1-1-1

Problems:

i. Not all fit into those categories.

ii. Change in on site can change the status of another.

4 (3rd) 1-1-1-1 (3rd)

ii. TA (2nd)

Possible events if the genetic code remade from Li,1997

N

Substitutions Number Percent

Total in all codons 549 100

Synonymous 134 25

Nonsynonymous 415 75

Missense 392 71

Nonsense 23 4

Possible number of substitutions: 61 (codons)*3 (positions)*3 (alternative nucleotides).

Ser Thr Glu Met Cys Leu Met Gly Thr TCA ACT GAG ATG TGT TTA ATG GGG ACG *** * * * * * * ** GGG ACA GGG ATA TAT CTA ATG GGT AGC Ser Thr Gly Ile Tyr Leu Met Gly Ser

Ks : Number of Silent Events in Common HistoryKa : Number of Replacement Events in Common HistoryNs : Silent positionsNa : replacement positions.

Rates per pos: ((Ks/Ns)/2T)Example: Ks =100 Ns = 300 T=108 yearsSilent rate (100/300)/2*108 = 1.66 * 10-9 /year/pos.

Synonyous (silent) & Non-synonymous (replacement) substitutions

Thr

ACGArg

AGG

Thr

ACC

Ser

AGC

Miyata: use most silent path for calculations.

*

* *

Kimura’s 2 parameter model & Li’s Model.

start

Selection on the 3 kinds of sites (a,b)(?,?)

1-1-1-1 (f*,f*)

2-2 (,f*)

4 (, )

Rates: Probabilities:

)21(25. )(24 bab ee

)1(25. 4be

)1(25. 4be

)21(25. )(24 bab ee

Sites Total Conserved Transitions Transversions1-1-1-1 274 246 (.8978) 12(.0438) 16(.0584)2-2 77 51 (.6623) 21(.2727) 5(.0649)4 78 47 (.6026) 16(.2051) 15(.1923)

Z(t,t) = .50[1+exp(-2t) - 2exp(-t(+)] transition Y(t,t) = .25[1-exp(-2t )] (transversion)X(t,t) = .25[1+exp(-2t) + 2exp(-t()] identity

L(observations,a,b,f)= C(429,274,77,78)* {X(a*f,b*f)246*Y(a*f,b*f)12*Z(a*f,b*f)16}* {X(a,b*f)51*Y(a,b*f)21*Z(a,b*f)5}*{X(a,b)47*Y(a,b)16*Z(a,b)15}

where a = at and b = bt. Estimated Parameters: a = 0.3003 b = 0.1871 2*b = 0.3742 (a + 2*b) = 0.6745 f = 0.1663

Transitions Transversions1-1-1-1 a*f = 0.0500 2*b*f = 0.06222-2 a = 0.3004 2*b*f = 0.06224 a = 0.3004 2*b = 0.3741

Expected number of: replacement substitutions 35.49 synonymous 75.93Replacement sites : 246 + (0.3742/0.6744)*77 = 314.72Silent sites : 429 - 314.72 = 114.28 Ks = .6644 Ka = .1127

alpha-globin from rabbit and mouse.Ser Thr Glu Met Cys Leu Met Gly GlyTCA ACT GAG ATG TGT TTA ATG GGG GGA * * * * * * * **TCG ACA GGG ATA TAT CTA ATG GGT ATASer Thr Gly Ile Tyr Leu Met Gly Ile

Hasegawa, Kisino & Yano Subsitution Model Parameters:

a*t β*t A C G T 0.350 0.105 0.361 0.181 0.236 0.222 0.015 0.005 0.004 0.003 0.003

Selection Factors

GAG 0.385 (s.d. 0.030)POL 0.220 (s.d. 0.017)VIF 0.407 (s.d. 0.035)VPR 0.494 (s.d. 0.044)TAT 1.229 (s.d. 0.104)REV 0.596 (s.d. 0.052)VPU 0.902 (s.d. 0.079)ENV 0.889 (s.d. 0.051)NEF 0.928 (s.d. 0.073)

Estimated Distance per Site: 0.194

HIV2 Analysis

Examples of rates remade from Li,1997

N

RNA Virus

Influenza A Hemagglutinin 13.1 10-3 3.6 10-3

Hepatitis C E 6.9 10-3 0.3 10-3

HIV 1 gag 2.8 10-3 1.7 10-3

DNA virus

Hepatitis B P 4.6 10-5 1.5 10-5

Herpes Simplex Genome 3.5 10-8

Nuclear Genes

Mammals c-mos 5.2 10-9 0.9 10-9

Mammals a-globin 3.9 10-9 0.6 10-9

Mammals histone 3 6.2 10-9 0.0

Organism Gene Syno/year Non-Syno/Year

i. Codons as the basic unit.

ii. A codon based matrix would have (61*61)-61 (= 3661) off-diagonal entries. i. Bias in nucleotide usage. ii. Bias in codon usage. iii. Bias in amino acid usage. iv. Synonymous/non-synonymous distinction. v. Amino acid distance. vi. Transition/transversion bias.

codon i and codon j differing by one nucleotide, then pj exp(-di,j/V) differs by transitionqi,j = pj exp(-di,j/V) differs by transversion.

-di,j is a physico-chemical difference between amino acid i and amino acid j. V is a factor that reflects the variability of the gene involved.

Codon based ModelsGoldman,Yang + Muse,Gaut

Rate variation between sites:iid each site

i) The rate at each position is drawn independently from a distribution, typically a (or lognormal) distribution. G(a,b) has density x-1*e-x/) , where is called scale parameter and form parameter.

Let L(pi,,t) be the likelihood for observing the i'th pattern, t all time lengths, the parameters describing the process parameters and f (ri) the continuous distribution of rate(s). Then iiii drrfrpLL )(),,(

What is the probability of the data?

What is the most probable ”hidden” configuration?

What is the probability of specific ”hidden” state?

1) Different positions in the molecule evolves at different rates. For instance fast or slow rF or slow rS.

2) The rates at neighbor positions evolve at the same rate.

Rate variation between sites:iid Hidden Markov Chains

O1 O2 O3 O4 O5 O6 O7 O8 O9 O10

F

S

Data: 3 sequences of length L ACGTTGCAA ...AGCTTTTGA ...TCGTTTCGA ...

Statistical Test of Models (Goldman,1990)

A. Likelihood (free multinominal model 63 free parameters)L1 = pAAA

#AAA*...pAAC#AAC*...*pTTT

#TTT where pN1N2N3 = #(N1N2N3)/L

L2 = pAAA(l1',l2',l3') #AAA*...*pTTT(l1',l2',l3') #TTTl2

l1

l3TCGTTTCGA ...

ACGTTGCAA ...

AGCTTTTGA ...

B. Jukes-Cantor and unknown branch lengths

Test statistics: I. (expected-observed)2/expected or II: -2 lnQ = 2(lnL1 - lnL2) JC69 Jukes-Cantor: 3 parameters => 2 60 d.of freedom

Problems: i. To few observations pr. pattern. ii. Many competing hypothesis.

Parametric bootstrap: i. Maximum likelihood to estimate the parameters. ii. Simulate with estimated model. iii. Make simulated distribution of -2 lnQ. iv. Where is real -2 lnQ in this distribution?

Emperical Observations:i. Variance/Mean > 1 (clumpy process) for non-synonymous event Possible explanations:i. Selective Avalances.ii. Gene conversions from pseudogenes.

Episodic Evolution

Poisson Process: i. Ti's independent, exponentially distributed with same parameter (l). ii. Variance and Mean both l.

Assignment to internal nodes: The simple way.

C

A

C CA

CT G

???

?

?

?

If branch lengths and evolutionary process is known, what is the probability of nucleotides at the leaves?

Cctacggccatacca a ccctgaaagcaccccatcccgt Cttacgaccatatca c cgttgaatgcacgccatcccgt Cctacggccatagca c ccctgaaagcaccccatcccgt Cccacggccatagga c ctctgaaagcactgcatcccgt Tccacggccatagga a ctctgaaagcaccgcatcccgt Ttccacggccatagg c actgtgaaagcaccgcatcccgTggtgcggtcatacc g agcgctaatgcaccggatccca Ggtgcggtcatacca t gcgttaatgcaccggatcccat

Probability of leaf observations - summing over internal states

A C G T

A C G T A C G T

subtree)} ()({

subtree)} ()({

)(

rightPNGP

leftPNGP

subtreeP

NsNucleotideN

NsNucleotideN

G

P(CG) *PC(left subtree)

GleafG leafP

tionInitialisa

,)(

ln(7.9*10-14) –ln(6.2*10-12) is 2 – distributed with (n-2) degrees of freedom

Output from Likelihood Method.

s1 s2 s3 s4 s5No

w

Du

pli

ca

tio

n T

ime

s

Am

ou

nt

of

Ev

olu

tio

n

s1

s2

s3

s4

s5

Likelihood: 6.2*10-12 = 0.34 0.16

Likelihood: 7.9*10-14 = 0.31 0.18

Molecular Clock No Molecular Clock

23 -/+5.2

12 -/+2.211.1 -/+1.8

5.9 -/+1.2

6.9 -/+1.3 11.4 -/+1.9

3.9 -/+0.8

10.9 -/+2.1

9.9 -/+1.2

11.6 -/+2.1

n-1 heights estimated 2n-3 lengths estimated

4.1 -/+0.7

The generation/year-time clock Langley-Fitch,1973

s1

s3

s2

s1 s3s2

{l1 = l2 < l3}l2 l1

l3

l3Some rooting techniquee

Absolute Time Clock:

Generation Time Clock:

Absolute Time Clock

Generation T

ime

Elephant Mouse

100 Myr

variable

constant

l1 = l2


s1 s3s2

Any TreeGeneration Time Clock

Can the generation time clock be tested?

Assume, a data set: 3 species, 2 sequences each s1

s3

s2

s1

s3

s2

s1 s3s2


s1

s3

s2

l2 l1

l3

s1

s3

s2

c*l2

c*l1

c*l3

s1

s3

s2

s1 s3s2

l2 l1

l3 l1 = l2

l3

k=3: degrees of freedom: 3dg: 2

k: dg: 2k-3 dg: k-1

k=3, t=2: dg=4 k, t: dg =(2k-3)-(t-1)

– globin, cytochrome c, fibrinopeptide A & generation time clock

Langley-Fitch,1973

N

Fibrinopeptide A phylogeny:

Hu

ma

n

Go

rilla

Do

nkey

Gib

bo

n

Mo

nkey

Rab

bit

Co

w

Rat

Pig

Ho

rse

Go

at

Llam

a

Sh

eep

Do

g

Relative rates

-globin 0.342

– globin 0.452

cytochrome c 0.069

fibrinopeptide A 0.137

I Smoothing a non-clock tree onto a clock tree (Sanderson).

II Rate of Evolution of the rate of Evolution (Thorne et al.).The rate of evolution can change at each bifurcation.

III Relaxed Molecular Clock (Huelsenbeck et al.). At random points in time, the rate changes by multiplying with random variable (gamma distributed)

Almost Clocks (MJ Sanderson (1997) “A Nonparametric Approach to Estimating Divergence Times in the Absence of Rate Constancy” Mol.Biol.Evol.14.12.1218-31) , J.L.Thorne et al. (1998): “Estimating the Rate of Evolution of the Rate of Evolution.” Mol.Biol.Evol. 15(12).1647-57, JP Huelsenbeck et al. (2000) “A compound Poisson Process for Relaxing the Molecular Clock” Genetics 154.1879-92. )

Comment: Makes perfect sense. Testing no clock versus perfect is choosing between two unrealistic extremes.

Summary

PhylogenyPrinciples of PhylogeniesRates of Molecular Rates and the Molecular Clock Rooting Phylogenies The Generation Time Clock Almost Clocks Non-Contemporaneous Leaves (Viruses & Ancient DNA)

The Purpose of Stochastic Models

The assumptions of Stochastic Models

The Central Models

Measuring Selection

Variation among sites

Testing Models.

History of Phylogenetic Methods & Stochastic Models

1958 Sokal and Michener publishes UGPMA method for making distrance trees with a clock.

1964 Parsimony principle defined, but not advocated by Edwards and Cavalli-Sforza.

1962-65 Zuckerkandl and Pauling introduces the notion of a Molecular Clock.

1967 First large molecular phylogenies by Fitch and Margoliash.

1969 Heuristic method used by Dayhoff to make trees and reconstruct ancetral sequences.

1969 Jukes-Cantor proposes simple model for amino acid evolution.

1970: Neyman analyzes three sequence stochastic model with Jukes-Cantor substitution.

1971-73 Fitch, Hartigan & Sankoff independently comes up with same algorithm reconstructing parsimony ancetral sequences.

1973 Sankoff treats alignment and phylogenies as on general problem – phylogenetic alignment.

1979 Cavender and Felsenstein independently comes up with same evolutionary model where parsimony is inconsistent. Later called the “Felsenstein Zone”.

1979: Kimura introduces transition/transversion bias in nucleotide model in response to pbulication of mitochondria sequences.

1981: Felsenstein Maximum Likelihood Model & Program DNAML (i programpakken PHYLIP). Simple nucleotide model with equilibrium bias.

1981 Parsimony tree problem is shown to be NP-Complete.

1985: Felsenstein introduces bootstrapping as confidence interval on phylogenies.

1985: Hasegawa, Kishino and Yano combines transition/transversion bias with unequal equilibrium frequencies.

1986 Bandelt and Dress introduces split decomposition as a generalization of trees.

1985-: Many authors (Sawyer, Hein, Stephens, M.Smith) tries to address the problem of recombinations in phylogenies.

1991 Gillespie’s book proposes “lumpy” evolution.

1994 Goldman & Yang + Muse & Gaut introduces codon based models

1997-9 Thorne et al., Sanderson & Huelsenbeck introduces the Almost Clock.

2000 Rambaut (and others) makes methods that can find trees with non-contemporaneous leaves.

2000 Complex Context Dependent Models by Jensen & Pedersen. Dinucleotide and overlapping reading frames.

2001- Major rise in the interest in phylogenetic statistical alignment

2001- Comparative genomics underlines the functional importance of molecular evolution.

References: Books & JournalsJoseph Felsenstein "Inferring Phylogenies” 660 pages Sinauer 2003 Excellent – focus on methods and

conceptual issues.

Masatoshi Nei, Sudhir Kumar “Molecular Evolution and Phylogenetics” 336 pages Oxford University Press Inc, USA 2000

R.D.M. Page, E. Holmes “Molecular Evolution: A Phylogenetic Approach” 352 pages 1998 Blackwell Science (UK)

Dan Graur, Li Wen-Hsiung “Fundamentals of Molecular Evolution” Sinauer Associates Incorporated 439 pages 1999

Margulis, L and K.V. Schwartz (1998) “Five Kingdoms” 500 pages Freeman A grand illustrated tour of the tree of life

Semple, C and M. Steel “Phylogenetics” 2002 230 pages Oxford University Press Very mathematical

Journals

Journal of Molecular Evolution : http://www.nslij-genetics.org/j/jme.html

Molecular Biology and Evolution : http://mbe.oupjournals.org/

Molecular Phylogenetics and Evolution : http://www.elsevier.com/locate/issn/1055-7903

Systematic Biology - http://systbiol.org/J. of Classification - http://www.pitt.edu/~csna/joc.html

http://www.amazon.co.uk/exec/obidos/ASIN/0195135857/qid=1066051397/sr=1-2/ref=sr_1_0_2/202-0414044-2100601



References: www-pagesTree of Life on the WWW

http://tolweb.org/tree/phylogeny.html

http://www.treebase.org/treebase/

Software

http://evolution.genetics.washington.edu/phylip.html

http://paup.csit.fsu.edu/

http://morphbank.ebc.uu.se/mrbayes/

http://evolve.zoo.ox.ac.uk/beast/

http://abacus.gene.ucl.ac.uk/software/paml.html

Data & Genome Centres

http://www.ncbi.nih.gov/Entrez/

http://www.sanger.ac.uk

NextClassification of Viruses *

Overhead with considerations model> data.

Example : HMM variation in rates, gamma rates.

Example: Almost clock

Example: Episodic clock

Example: Bootstrapping. *

Molecular Evolution: Plan for week

Documents

Transcript of Molecular Evolution: Plan for week