Lecture 3: Markov models of sequence evolution
description
Transcript of Lecture 3: Markov models of sequence evolution
![Page 1: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/1.jpg)
Lecture 3: Markov models of sequence evolution
Alexei Drummond
![Page 2: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/2.jpg)
2CS369 2007
Friday quiz: How many bacterial cells are there in an average adult human?
A) 1012 (1 trillion)B) 1013 (10 trillion)C) 1014 (100 trillion)D) 1015 (1000 trillion)
Hint: There are about 1014 human cells in the average adult human.
![Page 3: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/3.jpg)
CS369 2007 3
Modeling genetic change
• Given two or more aligned nucleotide or amino acid sequences, usually the first goal is to calculate some measure of sequence similarity (or conversely distance)
• The simplest way to estimate genetic distances is the p-distance (number of differences between two sequences divided by the sequence length)– The p-distance is the hamming distance normalized by the length
of the sequence. Therefore it is the proportion of positions at which the sequences differ.
– The p-distance can also be consider the probability that the two sequences differ at a random position (site).
![Page 4: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/4.jpg)
CS369 2007 4
AACCTGTGCA
AATCTGTGTA * *
ATCCTGGGTT * * **
Seq1 AATCTGTGTAseq2 ATCCTGGGTT ** * *
Modeling genetic change
![Page 5: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/5.jpg)
CS369 2007 5
proportion of # nt between two sequences
Seq1 AATCTGTGTAseq2 ATCCTGGGTT ** * *
p-distance=0.4
Usually underestimate the true distance:genetic (or evolutionary) distance d
P-distance
![Page 6: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/6.jpg)
CS369 2007 6
AACCTGTGCA
AACCTGTGCA
T A A C AACCAGTGAA * *
AACCTGTGCA T G A
C ACCCGGTGAA * *
Multiple, parallel, and back-substitutions
![Page 7: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/7.jpg)
CS369 2007 7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.5 1 1.5 2 2.5 3
Genetic distance (d)
p-distance (p)
Relationship between p (observed) distance
andd (genetic) distance
![Page 8: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/8.jpg)
CS369 2007 8
Transition probabilities
• Definition: Let Pxy(t) be the probability that a nucleotide x evolves to a nucleotide y in time t. If x = y then this evolutionary pathway could involve 0, 2, 3 or more substitutions. If x y the the pathway could involve 1, 2, 3 or more substitutions.
• P(t) is then a square transition probability matrix of size 4 by 4.
![Page 9: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/9.jpg)
CS369 2007 9
• At any given site in a sequence the rate of change from base i to base j is independent from the base that occupied that site prior i
G A
G
PGG(t) PGA(t)t
i = A, C, G, T
PGG(t) and PGA(t)Independent from i
Markov property
Modeling nucleotide substitutions as a time-homogeneous time-continuous stationary Markov
process (1)
![Page 10: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/10.jpg)
CS369 2007 10
• Homogeneity– Substitution rates do not change over time
• Stationarity– The relative frequencies of A, C, G, and T
(A, C, G, T) are at equilibrium, i.e. remain constant.
Modeling nt substitutions as a time-homogeneous time-continuous stationary Markov process (2)
![Page 11: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/11.jpg)
CS369 2007 11
Models of DNA Substitution1. Base frequencies are equal and all substitutions are equally likely
(Jukes-Cantor)
2. Base frequencies are equal but transitions and transversions occur at different rates
(Kimura 2 parameter)
3. Unequal base frequencies and transitions andtransversions occur at different rates
(Hasegawa-Kishino-Yano)
4. Unequal base frequencies and all substitution types occur at different rates
(General Reversible Model)
Simplest
Most complex
![Page 12: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/12.jpg)
CS369 2007 12
i frequency of nt i
a, b, c, etc. relative rate parameters
non-diagonal entries:rate flow from nucleotide i to nucleotide j
diagonal entries: total rate flow that leaves nucleotide i (rate at which nt i disappear per site per sequence).
scale factor so total output per unit time = 1.0
€
Q =1
λ
−μ(aπ C + bπ G + cπ T ) μaπ C μbπ G μcπ T
μgπ A −μ(gπ A + dπ G + eπ T ) μdπ G μeπ T
μhπ A μjπ C −μ(hπ A + jπ C + fπ T ) μfπ T
μiπ A μkπ C μlπ G −μ(iπ A + kπ C + lπ G )
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
A C G T
The Q-matrix (instantaneous rate matrix)
![Page 13: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/13.jpg)
CS369 2007 13
€
Q =μ
λ
* aπ C bπ G cπ T
gπ A * dπ G eπ T
hπ A jπ C * fπ T
iπ A kπ C lπ G *
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
A C G T
The Q-matrix
€
Qii = − Qij
j≠ i
∑
A
C
G
T
€
total rate = Π iQii
i
∑
![Page 14: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/14.jpg)
CS369 2007 14
Substitutions from nucleotide i to nucleotide j have the same rate of substitutions from nucleotide j to nucleotide i.
In general: f = 1 and a, b, c, d, e are estimated from the data via maximum likelihood
€
Q =μ
λ
* aπ C bπ G cπ T
aπ A * dπ G eπ T
bπ A dπ C * fπ T
cπ A eπ C fπ G *
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
A C G T
€
Π=
A
π C
π G
π T
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
General Time Reversible (GTR) Models
![Page 15: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/15.jpg)
CS369 2007 15
Time-reversibility
x y
z
x
y
equivalent
€
t
2
€
t
![Page 16: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/16.jpg)
CS369 2007 16
€
A = π C = π G = π T =1/4
a = b = c = d = e = f =1
λ = 3/4
€
Q =4
3μ
* 1/4 1/4 1/4
1/4 * 1/4 1/4
1/4 1/4 * 1/4
1/4 1/4 1/4 *
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
€
Π=
0.25
0.25
0.25
0.25
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
Q-matrix for the Jukes and Cantor (JC) model
![Page 17: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/17.jpg)
CS369 2007 17
€
Q =μ
3
* 1 1 1
1 * 1 1
1 1 * 1
1 1 1 *
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
€
Π=
0.25
0.25
0.25
0.25
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
Q-matrix for the Jukes and Cantor (JC) model
![Page 18: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/18.jpg)
CS369 2007 18
= rate per unit time of nucleotide i (i =A, C, G, T) replacement during evolution: nt substitutions per sequence per site per unit time
t = nt substitutions per site between two sequences that are separated by time t = d
€
Q =
−μ 1/3μ 1/3μ 1/3μ
1/3μ −μ 1/3μ 1/3μ
1/3μ 1/3μ −μ 1/3μ
1/3μ 1/3μ 1/3μ −μ
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
Evolutionary meaning of the Q-matrix for the JC model
€
ΠiQii
i
∑ = μ
![Page 19: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/19.jpg)
CS369 2007 19
€
P(t) = exp(Qt)
Estimating transition probabilities
• As soon as the Q matrix, and thus the evolutionary model, is specified, it is possible to calculate the probabilities of change from any base to any other during the evolutionary time t, P(t), by computing the matrix exponential
![Page 20: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/20.jpg)
CS369 2007 20
By computingP(t)=exp(Qt)
with Q according to the JC model
Pi=j(t) = probability of nt i to end up with the same character after time t
Pij(t) = probability of nt i ending up as a different character after time t
€
Pi= j (t) =1
4+
3
4exp(−
4
3μt)
Pi≠ j (t) =3
4−
3
4exp(−
4
3μt)
Jukes and Cantor (JC) model solution
![Page 21: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/21.jpg)
CS369 2007 21
•The total probability of two sequences sharing the same nucleotide at a position is Pi=j(t) and therefore the probability of the two sequences being different, p = 1 - Pi=i(t) = Pij(t)
p = 3/4 (1 - exp(-4/3t))
•An estimator of p is the observed proportion of different sites between two sequences ( p-distance).
Estimating the genetic distances(1)
![Page 22: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/22.jpg)
CS369 2007 22
Solving for t we get t = - 3/4 ln (1- 4/3 p). Substituting t with d
we finally obtain the Jukes-Cantor correction formula for the genetic
distance d between two sequences:
d = - 3/4 ln (1- 4/3 p)
It can also be demonstrated that the variance V(d) will be given by
V(d) = 9p(1-p)/(3-4p)2
Estimating the genetic distances(2)
![Page 23: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/23.jpg)
CS369 2007 23
Seq1 AATCTGTGTAseq2 ATCCTGGGTT ** * *
p-distance = 0.4
d (JC model) = - 3/4 ln [1- 4/3 (0.4)] = 0.5716
Calculating JC distance
![Page 24: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/24.jpg)
CS369 2007 24
AACCTGTGCA
AATCTGTGTA * *
ATCCTGGGTT * * **
p-distance = 0.4
d (JC model) = - 3/4 ln [1- 4/3 (0.4)] = 0.5716
Calculating JC distance
![Page 25: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/25.jpg)
CS369 2007 25
€
Q =μ
λ
* π C π G π T
π A * π G π T
π A π C * π T
π A π C π G *
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
€
A ≠ π C ≠ π G ≠ π T
a = b = c = d = e = f =1
λ =1− (π A2 + π C
2 + π G2 + π T
2)
Q-matrix for the F81 model
![Page 26: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/26.jpg)
CS369 2007 26
•p = observed distance
• When A= T= C= G=0.25, = 3/4, and the formula
becomes equivalent to the one obtained for the JC model
€
d = −λ ln(1− p /λ )
F81 model correction formula
![Page 27: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/27.jpg)
CS369 2007 27
€
A = π C = π G = π T =1/4
a = c = d = f =1
b = e = κ
λ = κ + 2
Transversions
Transitions
€
Q =μ
κ + 2
* 1 κ 1
1 * 1 κ
κ 1 * 1
1 κ 1 *
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
Q-matrix for the Kimura-2p (K80) model
![Page 28: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/28.jpg)
CS369 2007 28
A = 39.0%C = 16.6%G = 22.8%T = 21.6%
Average Ti/Tv=2.6
Average SEQUENCE COMPOSITION (HIV-O/HIV-M full pol) 5% chi-square test p-value
SE8538a passed 97.80% 97TZ02a passed 94.59%
BOLO122b passed 99.94% CAM1b passed 96.73% NY5CGb passed 97.64% 98IN022c passed 99.44% 94IN112c passed 98.68% 93IN101c passed 99.61% VI850f passed 97.09% X138g passed 86.61% SE6165g passed 95.73% VI991h passed 98.23% SE9173j passed 96.17% SE92809j passed 96.50% MP535k passed 69.92% 92UG001d passed 86.20% HIVO passed 77.48%
Nucleotide frequencies in HIV/SIV are at equilibrium: pol gene
![Page 29: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/29.jpg)
CS369 2007 29
A = 34.5%C = 17.4%G = 23.4%T = 24.7%
Average Ti/Tv=1.5
Average SEQUENCE COMPOSITION (SIV/HIV full envelope)
5% chi-square test p-value MVP5180 passed 14.60%
SIVcpzUS passed 48.09% SIVcpzGAB passed 51.77% 92UG037a passed 84.58%
92UG975g passed 99.73% 92RU131g passed 97.45%
93IN905c passed 77.15% 92BRO25c passed 59.51% 92UG021d passed 94.89% 92UG024d passed 92.60% BSSG3b passed 97.86% SFMHS20b passed 92.40% 91TH652b passed 92.86% MBC18R01b passed 99.59%
Nucleotide frequencies in HIV/SIV are at equilibrium: env gene
![Page 30: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/30.jpg)
CS369 2007 30
Q-matrix for the F84 model(very similar to the HKY85 model)
€
Q =μ
λ
* π C [1+ κ /(π A + π G )]π G π T
π A * π G [1+ κ /(π C + π T )]π T
[1+ κ /(π A + π G )]π A π C * π T
π A [1+ κ /(π C + π T )]π C π G *
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
€
A ≠ π C ≠ π G ≠ π T
a = c = d = f =1
b =1+ κ /(π A + π G )
e =1+ κ /(π C + π T )
(Transversions)
(Transitions)
![Page 31: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/31.jpg)
CS369 2007 31
Average Ti/Tv=1.5
TransitionsTransversions
A C G T
A
C
G
T
From
To
346.2 697.4 290.3
241.9 123 320.8
515.4 126.6 117.1
215.6 371 144.6
Average frequency of changes between states
SIV/HIV-1 envelope
Nucleotide substitution patterns in HIV/SIV
![Page 32: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/32.jpg)
CS369 2007 32
More complex models…
• More complex models, like Tamura-Nei (TN93), or the general time reversible (GTR) model usually requires numerical algorithms in order to calculate d.
• Several software packages exist that can estimate genetic distances between nucleotide sequences according to different evolutionary models – MEGA3, – PAUP*, – PHYLIP, – TREE-PUZZLE, – DAMBE,– Geneious 2.5.4
![Page 33: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/33.jpg)
CS369 2007 33
HIV-1B vs HIV-O/SIVcpz/HIV-1Cfull envelope
HIV-O
SIVcpz
HIV-1C
p-distance JC69 K80 Tajima-Nei
0.391 (.008) 0.552 (.018) 0.560 (.019) 0.572 (.019)
0.266 (.009) 0.337 (.009) 0.340 (.010) 0.427 (.013)
0.163 (.008) 0.184 (.008) 0.187 (.008) 0.189 (.008)
Estimating HIV genetic distances: env gene
![Page 34: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/34.jpg)
CS369 2007 34
HIV-1B vs HIV-O/HIV-1Cfull pol
HIV-O
HIV-1C
p-distance JC69 K80 Tajima-Nei
0.257 (.007) 0.315 (.010) 0.318 (.011) 0.324 (.011)
0.103 (.005) 0.111 (.005) 0.113 (.006) 0.114 (.006)
Estimating HIV genetic distances: pol gene
![Page 35: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/35.jpg)
CS369 2007 35
When divergence is low p and d are linearly related
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.5 1 1.5 2 2.5 3
Genetic distance (d)
p-distance (p)
![Page 36: Lecture 3: Markov models of sequence evolution](https://reader036.fdocuments.in/reader036/viewer/2022062803/56814767550346895db4a4fa/html5/thumbnails/36.jpg)
CS369 2007 36
Conclusions
• The genetic distance between two sequences can be estimated using a Markov model of DNA substitution.
• Different models will estimate different genetic distances• We have focused on DNA models, but it is possible to
consider models for proteins and models that take into account codons and the genetic code.
• Markov model approaches to estimating genetic distance do not deal with indels, and presuppose an alignment
• These models assume that all positions in a DNA sequence mutate at the same rate. We will talk about how to relax this assumption in later lectures.