Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating...
-
Upload
adela-collins -
Category
Documents
-
view
212 -
download
0
Transcript of Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating...
Lecture 10 – Models of DNA Sequence Evolution
Correct for multiple substitutions in calculating pairwise genetic distances.
Derive transformation probabilities for likelihood-based methods.
Prob(Rr | t ) = pm x Pm,k(v3,1) x Pk,A(v1,w) x Pk,G(v1,x) x Pm,l(v3,2) x Pl,C(v2,y) x Pl,C(v2,z)
It’s the Pi,j’s that we need a substitution model to calculate.
The models typically used are Markov processes.
Poisson process is a stochastic process that can be used to model events in time.
The time between events is exponentially distributed, with rate l.
Jukes-Cantor ModelThe probability of a site remaining constant is: pii(t) = ¼ + ¾ e-4at
The probability of a site changing is : pij(t) = ¼ - ¼ e-4at
a is the rate at which any nucleotide changes to any other per unit time.
Given that the state at the site is i at t0, we start by estimating the probability of state i at that site at t1.
pi(0) = 1
pi(1) = 1-3a
Now, what’s the probability of this site having state i at t2
There are two ways for the site to have state i at t2:
1 – It still hasn’t changed since time t0.
2 – It has changed to something else and back again.
Therefore, pi(2) = (1 – 3a) pi(1) + a [1 – pi(1)], where
(1 – 3a) pi(1) = probability of no change at the site during time t2, (1-3a), times the probability of the site having state i at time t1, (pi(1)).
and
a[1-pi(1)] = probability of a change to i, (a), times the probability that the site is not state i at time t1, (1-pi(1))
Jukes-Cantor Model
Jukes-Cantor Model
We have a recurrence equation.
pi(t+1) = (1 - 3a) pi(t) + a [1 – pi(t)] = pi(t) - 3api(t) + a – api(t)
We can calculate the change in pi(t) across time, Dt.
pi(t+1) – pi(t) = -3api(t) + a – api(t)
so
and
Jukes-Cantor Model
pi(t) = 1/4 + (pi(0) – 1/4) e -4at
We have a probability that a site has a particular nucleotide after time t, given in terms of its initial state.
If i = j, pi(0) = 1.
Therefore, pii(t) = 1/4 + 3/4 e -4at
If i not = j, pi(0) = 0, and pij(t) = 1/4 -
1/4 e -4at
a is an instantaneous rate, so we’ve modeled branch length (rate x time) explicitly in our expectations.
The JC model makes several assumptions.
1) All substitutions are equally likely; we have a single substitution type.
2) Base frequencies are assumed to be equal; each of the four nucleotides occurs at 25% of sites.
3) Each site has the same probability of experiencing a substitution as any other; we have an equal-rates model.
4) The process is constant through time.
5) Sites are independent of each other.
6) Substitution is a Markov process.
-3a a a
a a
-3a a a
Q = a a -3a a
a a a -
3a
Q - matrix
Substitution types and base frequencies.
-m(apC + bpG + cpT) mapC
mbpG mcpT
mgpA -m(gpA + dpG - epT) mdpG
mepT
Q = mhpA mjpC -
m(hpA + jpC + fpT) mfpT
mipA mkpC
mlpG -m(ipA + kpC + lpG)
For the general case:
where, m = the average instantaneous substitution rate,a, b, c, …, l are relative rate parameters (one of them is set to 1).and pi’s are the frequencies of the base that is being substituted to.
Note that this is not symmetric, and therefore, the full model is non-reversible.
a = g, b = h, c = i, d = j, e = k, & f = l.
Substitution types and base frequencies.
-m(apC + bpG + cpT) mapC
mbpG mcpT
mapA -m(apA + dpG + epT) mdpG
mepT
Q = mbpA mdpC -
m(bpA + dpC + fpT) mfpT
mcpA mepC
mfpG -m(cpA + epC + fpG)
General Time-Reversible Model
There are six relative transformation rates (one of which is set to 1).
There are four base frequencies that must sum to 1.
Note that this is not a symmetric matrix, but it can be decomposed into R and P.
Substitution types and base frequencies.
-m(a+b+c) mamb mc
ma -m(a+d+e)
md meR =
mb md -m(b+d+f) mf
mc me
mf -m(c+e+f)pA
0 00
0
pC 00
P = 0
0 pG
0
00 0pT
Visual GTR
Common Simplifications
Transition type substitutions occur at a higher rate than transversion substitutions.
K2P Model was the first to address this.
So we set b = e = k (for transitions), and a = c = d = f = 1 (for transversions) .
-(m)(k + 2)/4 m/4 mk/4 m/4
m/4 -(m)(k + 2)/4
m/4 mk/4for K2P: Q =
mk/4 m/4-(m)(k + 2)/4 m/4
m/4 mk/4 m/4 -(m)(k + 2)/4
All pi = ¼
where a = mk/4 and b = m/4. Thus, k = / a b and
Hasegawa-Kishino-Yano (HKY) Model
-m(kpG + pY) mpC mkpG mpT
mpA -m(kpT + pR)
mpG mkpfor HKY: Q =
mkpA mpC -m(kpA + pY) mpT
mpA mkpC
mpG -m(kpC + pR)
where a = mk, b = m, pR = pA + pG, and pY = pC + pT.
There are lots of other models that restrict the Q-matrix.
Some common models
There are 203 special cases of the GTR, 406 if we allow for equal base frequencies.
Calculating Transformation Probabilities.
So the Q & R matrices we’ve been discussing define the instantaneous rates of substitutions from one nucleotide to another.
Convert the rates to probabilities by matrix exponentiation:
P(t) = e Qt
Jukes-Cantor
K2P
Again, it’s these Pij that are used in the likelihood function.