Interpretation of exponentiation + eigenvalue decomposition The terms in the series expansion of...
-
Upload
tyler-goodwin -
Category
Documents
-
view
214 -
download
0
Transcript of Interpretation of exponentiation + eigenvalue decomposition The terms in the series expansion of...
Interpretation of exponentiation + eigenvalue decompositionThe terms in the series expansion of P(t) does not directly have an interpretation. The first, I, is the trivial transition function and the remaining has negative numbers and 0 row sums. If the CTMC has identical exit rates for all states (q) then:
Where Q’ is Q-I/q (the single step transition probabilities). Without identical exit rates, I don’t know a simple interpretation.
Ie Q is weighted symmetric Qi,j = πj Qj,i/πi and thus is diagonazable, Q = UDUT
If Q has distinct eigenvalues, then it will also have simple expressions for Pi,j(t)
Then a little rearrangement gives:
0 tti ti + 1
Continuous Time Markov Chain Poisson Process
0 ni i + 1
Discrete Time Markov Chain
+=
Kimura 2-parameter model - K80 TO A C G T
F A - R C O G M T a = *t b = *t
Q:
P(t)
start
)21(25. )(24 bab ee
)21(25. )(24 bab ee
Unequal base composition: (Felsenstein, 1981 F81)
Qi,j = C*πj i unequal j
Felsenstein81 & Hasegawa, Kishino & Yano 85
Tv/Tr & compostion bias (Hasegawa, Kishino & Yano, 1985 HKY85)
()*C*πj i- >j a transition Qi,j = C*πj i- >j a transversion
Rates to frequent nucleotides are high - (π =(πA , πC , πG , πT)
Tv/Tr = (πT πC +πA πG )/[(πT+πC )(πA+ πG )]A
G
T
C
Tv/Tr = () (πT πC +πA πG )/[(πT+πC )(πA+ πG )]
Group 3, Symmetric 6, Reversible 9 and General 12 models
Time reversible:
Symmetric:
General:
C A
GT
Kimura 3 parameter 1980, Evans and Speed 1993:
Can be interpreted as random walk on Z2*Z2
• Often only differences can be observed leading to a symmetric matrix• Symmetric matrices has uniform equilibrium distributions
Can be obtained from differences and equilibrium distributions
Non-reversible models allow rooting with only 2 sequences
=
Alternative condition for time reversibility: No net flows
From Nucleotide to Sequence
• Context-dependent models
Genome:
Dinucleotides
..ACGGA..
• Di-nucleotide events
ACGGAGT
ACGTCGT
• Rate Variation ATTGCGTCCAATATTGCGTCCAAT
ATGGCGTCC T ATATTGCGTGCAAT
ATTGCGTCC A ATATTGCGTCCGAT
Each nucleotide evolves independent
Di-nucleotide events Averof et al. (2000) E
vidence for High F
requency of Sim
ultaneous Double-N
ucleotide Substitutions” S
cience287.1283- . + S
mith et al. (2003) A
Low rate of
Sim
ultaneous Double-N
ucleotide Mutations in P
rimates” M
ol.Biol.E
vol 20.1.47-53
ACGGAGT
ACGTCGT
=
ACGGAGT
ACGTCGT
Sin
gle n
ucleo
tide even
ts
?
ACGGAGT
ACGTCGT
Do
ub
le events
Doublet Singlet Singlet
Assuming JC69 + doublet mutations.
00: 10-8 doublet mutation rate , ~10% of singlet rate03: much less for a large more reliable data set
Context-dependent modelsFrom singlet models to doublet models:
Independence
Independence with CG avoidance
Strand symmetry
Only single events
Single events with simple double events
Contagious Dependence:
Pedersen and Jensen, 2001
Siepel and Haussler, 2003
AA
G
C
T
C?
A
Rate variation between sites:iid each site
The rate at each position is drawn independently from a distribution, typically a (or lognormal) distribution. G(a,b) has density x-1*e-x/) , where is called scale parameter and form parameter.
iiii drrfrpLL )(),,(
Let L(pi,,t) be the likelihood for observing the i'th pattern, t all time lengths, the
parameters describing the process parameters and f (ri) the continuous distribution of
rate(s). Then
Measuring Selection ThrSer
ACGTCA
Certain events have functional consequences and will be selected out. The strength and localization of this selection is of great interest.
ThrProPro
ACGCCA
-
ArgSer
AGGCCG
-
The selection criteria could in principle be anything, but the selection against amino acid changes is without comparison the most important
ThrSer
ACGCCG
ThrSer
ACTCTG
AlaSer
GCTCTG
AlaSer
GCACTG
The Genetic Code
i.
3 classes of sites:
4
2-2
1-1-1-1
Problems:
i. Not all fit into those categories.
ii. Change in on site can change the status of another.
4 (3rd) 1-1-1-1 (3rd)
ii. TA (2nd)
Possible events if the genetic code remade from Li,1997
Substitutions Number Percent
Total in all codons 549 100
Synonymous 134 25
Nonsynonymous 415 75
Missense 392 71
Nonsense 23 4
Possible number of substitutions: 61 (codons)*3 (positions)*3 (alternative nucleotides).
Kimura’s 2 parameter model & Li’s Model.
Selection on the 3 kinds of sites (a,b)(?,?)
1-1-1-1 (f*,f*)
2-2 (,f*)
4 (, )
Rates:start
Probabilities:
)21(25. )(24 bab ee
)21(25. )(24 bab ee
Sites Total Conserved Transitions Transversions1-1-1-1 274 246 (.8978) 12(.0438) 16(.0584)2-2 77 51 (.6623) 21(.2727) 5(.0649)4 78 47 (.6026) 16(.2051) 15(.1923)
alpha-globin from rabbit and mouse.Ser Thr Glu Met Cys Leu Met Gly GlyTCA ACT GAG ATG TGT TTA ATG GGG GGA * * * * * * * **TCG ACA GGG ATA TAT CTA ATG GGT ATASer Thr Gly Ile Tyr Leu Met Gly Ile
Z(t,t) = .50[1+exp(-2t) - 2exp(-t(+)] transition Y(t,t) = .25[1-exp(-2t )] transversionX(t,t) = .25[1+exp(-2t) + 2exp(-t()] identity
L(observations,a,b,f)= C(429,274,77,78)* {X(a*f,b*f)246*Y(a*f,b*f)12*Z(a*f,b*f)16}* {X(a,b*f)51*Y(a,b*f)21*Z(a,b*f)5}*{X(a,b)47*Y(a,b)16*Z(a,b)15}
where a = at and b = bt.
Estimated Parameters: a = 0.3003 b = 0.1871 2*b = 0.3742 (a + 2*b) = 0.6745 f = 0.1663
Transitions Transversions1-1-1-1 a*f = 0.0500 2*b*f = 0.06222-2 a = 0.3004 2*b*f = 0.06224 a = 0.3004 2*b = 0.3741
Expected number of: replacement substitutions 35.49 synonymous 75.93Replacement sites : 246 + (0.3742/0.6744)*77 = 314.72Silent sites : 429 - 314.72 = 114.28 Ks = .6644 Ka = .1127
Probabilities of different paths
• Number of events • Kinds of events
Starting in A ending in B after time t
Rate of going from i to j: qi,j
A B
S2
S3
Sk
S1
Key questions (conditional/unconditional):
• Time spent at different states
Learning to Count: Robust Estimates for Labeled D
istances between M
olecular Sequences O’Brien, M
inin, and Marc A. Suchard M
ol. Biol. Evol. 26(4):801–814. 2009Vladim
ir N M
inin and Marc A Suchard Fast, accurate and sim
ulation-free stochastic mapping 3995 363 2008 Phil. Trans. R. Soc. B
Counting labeled transitions in continuous-time M
arkov models of evolution Vladim
ir N. M
inin á Marc A. Suchard J. M
ath. Biol. (2008) 56:391–412
Generalize to a phylogeny
• Time to get from A to B
• Very liked and dis-liked sets
• Probability of only visiting {S}
A B
C
S2
S3
Sk
S1
t1 t2
t3
• Distribution of ancestor state, X P(X)= P(A,B,C,X)/P(A,B,C)
• Which edges/nodes carry most/least probability? Ranked lists of edges nodes.
Summary of Substitution Models
• Extensions to the basic model
• Rate heterogeneity
• Context Dependent Models
• Codons
• From nucleotide to sequence• Independence of nucleotides
• Assumptions behind substitution models
• Continuous time Markov Chain
• Only substitutions
• Independence and identity of positions
• From P to Q & from Q to P
• Independence of lineages
• The simplest model: Jukes-Cantor
• Ancestral Analysis – conditioning on start and finish
0 t1 t2 T
From Continuous to Discrete Time
Kos
kine
n,J.
(20
04)
Bay
esia
n In
fere
nce
for
Lon
gitu
dina
l So
cial
Net
wor
ks. R
esea
rch
Rep
ort,
num
ber
2004
:4, S
tock
holm
Uni
vers
ity,
Dep
artm
ent o
f St
atis
tics
. K
oski
nen,
J. a
nd S
nijd
ers,
T. (
2007
) B
ayes
ian
infe
renc
e fo
r dy
nam
ic s
ocia
l ne
twor
k da
ta, J
ourn
al
of S
tati
stic
al P
lann
ing
and
Infe
renc
e, 1
37, 3
930-
-393
8. R
. Sha
ran,
T. I
deke
r, M
odel
ing
cell
ular
mac
hine
ry t
hrou
gh b
iolo
gica
l net
wor
k co
mpa
riso
n, N
atur
e B
iote
chno
logy
, 24,
427
(20
06).
Sni
jder
s, T
. (20
01)
“Sta
tist
ical
eva
luat
ion
of s
ocia
l net
wor
ks
dyna
mic
s” in
Soc
iolo
gica
l Met
hodo
logy
By
Mic
hael
Sob
el S
nijd
ers,
T. e
t al.
(200
8) “
Max
imum
Lik
elih
ood
Eva
luat
ion
for
Soci
al N
etw
ork
Dyn
amic
s” I
n pr
ess
I.
Mik
los,
G.A
. L
un
ter
and
I. H
olm
es (
2004
) A
"lo
ng
in
del
" m
odel
for
evo
lutio
nary
se
quen
ce a
lignm
ent.
Mol
. B
iol.
Evo
l. 21
(3):
529-
540.
App
endi
x A• Sum over i state assignments gives probability of paths of length i.
• Integrate of all waiting times (t1,..,ti) and state assignments of length i gives probability of specific trajectory
• Sum over all path lengths gives probability of N turning into N’
• The above expression can be shown to be of the formAnd recursions O(N2) exists to calculate coefficients.
Correlated MutationsMotivation: Models often assume independence between sites or sites and phenotype, however that might not be warranted and methods detecting correlation is of great use.
A
G
C
C
*
T
Ideal situation – complete history known.
A
G
C
C
Real situation – end points known, little power.
G C T GG CA A
Many end points – a phylogeny might give power.
• Explicit modelling:
• Ancestral Analysis:
- +
-
+
Single binary state
++
+-
-+
--
-- -+ +- ++Single binary state
• Multiple Testing
All pairs – n(n-1)/2 Site – phenotypic character - n
Transition Path Sampling Algorithm/MCMC
P1 P2
p1
p3
p2
p5p4
p6
Path 1 - probability: p1 p2 p3
Path 2 - probability: p1 p4 p5 p6
Local modification of Path 1 in Path 2:
Set of
paths:
Likelihood - L( )
Probability of going from to - q( , )
Acceptance ratio
Discrete Space:
Continuous Space – reversible jump MCMC (Green, 1995)
The acceptance ration will have to be weighted by Jacobian – J.
Typically much slower as continuous case includes stochastic integration
Simulating trajectories that ends in B at time t
0t
Hob
olth
and
Sto
ne (
2009
) E
FF
ICIE
NT
SIM
ULA
TIO
N F
RO
M F
INIT
E-S
TA
TE
, C
ON
TIN
UO
US
-TIM
E M
AR
KO
V C
HA
INS
WIT
H I
NC
OM
PLE
TE
O
BS
ER
VA
TIO
NS
S2
S3
Sk
S1
A
q01
q02
q03
q0k
S2
S3
Sk
S1
B
q1B
q2B
q3B
qkB
Challenge for large state space, E(steps) large and Pa,b(t) small:
Algorithm (forward rejection sampling)
Sample paths unconditionally
Keep paths ending in B at time t
Normalize their probability by dividing with PA,B(t)
Can be modified to be more efficient if Paa(t) has high probability
Sample discrete jump transition according to conditional jump process
- Real jumps - Self jumps The Poisson Process – tag the red stars !!
Sample jump points according to Poisson Process
Create Uniformized Process
maxi -qii
Q’: qii:=-
Interpret increased exit rates as self-jumps
R:= I + Q’/
Conditional jump probabilitiesUnconditional jump probabilities
0 ni i + 1
Data: 3 sequences of length L ACGTTGCAA ...AGCTTTTGA ...TCGTTTCGA ...
Statistical Test of Models (Goldman,1990)
A. Likelihood (free multinominal model 63 free parameters)L1 = pAAA
#AAA*...pAAC#AAC*...*pTTT
#TTT where pN1N2N3 = #(N1N2N3)/L
L2 = pAAA(l1',l2',l3') #AAA*...*pTTT(l1',l2',l3') #TTTl2
l1
l3
TCGTTTCGA ...
ACGTTGCAA ...
AGCTTTTGA ...
B. Jukes-Cantor and unknown branch lengths
Parametric bootstrap: i. Maximum likelihood to estimate the parameters. ii. Simulate with estimated model. iii. Make simulated distribution of -2 lnQ. iv. Where is real -2 lnQ in this distribution?
Test statistics: I. (expected-observed)2/expected or II: -2 lnQ = 2(lnL1 - lnL2) JC69 Jukes-Cantor: 3 parameters => 2 60 d.of freedom
Problems: i. To few observations pr. pattern. ii. Many competing hypothesis.
Extension to Overlapping RegionsHein & Stoevlbaek, 95
(f1f2a, f1f2b) (f2a, f1f2b) (f2a, f2b)
(f1a, f1f2b) (f2a, f1f2b) (a, f2b)
(f1a, f1b) (a, f1b) (a, b)
1st
2nd
1-1-1-1 sites
2-2
4
1-1-1-1 2-2 4
Ziheng Yang has an alternative model to this, were sites are lumped into the same category if they have the same configuration of positions and reading frames.
Example: Gag & Pol from HIVgagpol
1-1-1-1 sites
2-2
4
1-1-1-1 2-2 4
64 31 34
40 7 0
27 2 0
GagPol
MLE: a=.084 b= .024 a+2b=.133 fgag=.403 fpol=.229
Hasegawa, Kisino & Yano Subsitution Model Parameters:
a*t β*t A C G T
0.350 0.105 0.361 0.181 0.236 0.222 0.015 0.005 0.004 0.003 0.003
HIV1 Analysis
Selection Factors
GAG 0.385 (s.d. 0.030)POL 0.220 (s.d. 0.017)VIF 0.407 (s.d. 0.035)VPR 0.494 (s.d. 0.044)TAT 1.229 (s.d. 0.104)REV 0.596 (s.d. 0.052)VPU 0.902 (s.d. 0.079)ENV 0.889 (s.d. 0.051)NEF 0.928 (s.d. 0.073)
Estimated Distance per Site: 0.194
Open Problem II: Example Neural Networks (NN))
Related by a phylogeny:Independent instances:
Motivation: To combine methods that define patterns from a series of independent instances with pattern inferred by instances related by a phylogenetic tree.
Basic Equations I
Backward Equation:
i jt2
k
i k jt1 t2
Matrix version:t = t1 + t2
Chapman-Kolmogorov:
Forward Equation:
i k jt1 h
Initial Condition:
0 t1 t2 T
From Continuous to Discrete Time
Kos
kine
n,J.
(20
04)
Bay
esia
n In
fere
nce
for
Lon
gitu
dina
l So
cial
Net
wor
ks. R
esea
rch
Rep
ort,
num
ber
2004
:4, S
tock
holm
Uni
vers
ity,
Dep
artm
ent o
f St
atis
tics
. K
oski
nen,
J. a
nd S
nijd
ers,
T. (
2007
) B
ayes
ian
infe
renc
e fo
r dy
nam
ic s
ocia
l ne
twor
k da
ta, J
ourn
al
of S
tati
stic
al P
lann
ing
and
Infe
renc
e, 1
37, 3
930-
-393
8. R
. Sha
ran,
T. I
deke
r, M
odel
ing
cell
ular
mac
hine
ry t
hrou
gh b
iolo
gica
l net
wor
k co
mpa
riso
n, N
atur
e B
iote
chno
logy
, 24,
427
(20
06).
Sni
jder
s, T
. (20
01)
“Sta
tist
ical
eva
luat
ion
of s
ocia
l net
wor
ks
dyna
mic
s” in
Soc
iolo
gica
l Met
hodo
logy
By
Mic
hael
Sob
el S
nijd
ers,
T. e
t al.
(200
8) “
Max
imum
Lik
elih
ood
Eva
luat
ion
for
Soci
al N
etw
ork
Dyn
amic
s” I
n pr
ess
I.
Mik
los,
G.A
. L
un
ter
and
I. H
olm
es (
2004
) A
"lo
ng
in
del
" m
odel
for
evo
lutio
nary
se
quen
ce a
lignm
ent.
Mol
. B
iol.
Evo
l. 21
(3):
529-
540.
App
endi
x A• Sum over i state assignments gives probability of paths of length i.
• Integrate of all waiting times (t1,..,ti) and state assignments of length i gives probability of specific trajectory
• Sum over all path lengths gives probability of N turning into N’
• The above expression can be shown to be of the formAnd recursions O(N2) exists to calculate coefficients.
Fast/Slowly Evolving StatesFelsenstein & Churchill, 1996
n1positions
sequ
enc
esk
1
slow - rsfast - rfHMM:
• r - equilibrium distribution of hidden states (rates) at first position
•pi,j - transition probabilities between hidden states
•L(j,r) - likelihood for j’th column given rate r.
•L(j,r) - likelihood for first j columns given j’th column has rate r.Likelihood Recursions:
Likelihood Initialisations:
Basic Equations
Hobolth, A. and Jensen, J.L. (2005). Statistical inference in evolutionary models of DNA sequences via the EM algorithm. Statistical applications in Genetics and Molecular Biology, 4, 18
0 ns
Expected time spent in j, T(j), in going from a to b:
a
bi
Expected number of transition from i to j, N(i,j), in going from a to b:
0 ns
a
bi
j
qi,j
Higher moments and combinations of N( ) and T( ) can be calculated using the same reasoning
0 na
b
Evaluation of Eab(N1(),..Nr,T1(),..Tm()), would involve the evaluation of at most n+m dimensional integral
i. Codons as the basic unit.ii. A codon based matrix would have (61*61)-61 (= 3661) off-diagonal entries. i. Bias in nucleotide usage. ii. Bias in codon usage. iii. Bias in amino acid usage. iv. Synonymous/non-synonymous distinction. v. Amino acid distance. vi. Transition/transversion bias.
codon i and codon j differing by one nucleotide, then pj exp(-di,j/V) differs by transitionqi,j = pj exp(-di,j/V) differs by transversion.
-di,j is a physico-chemical difference between amino acid i and amino acid j. V is a factor that reflects the variability of the gene involved.
Codon based ModelsGoldman,Yang + Muse,Gaut
Dayhoffs empirical approach (1970)
Take a set of closely related proteins, count all differences and make symmetric difference matrix, since time direction cannot be observed.
History of Phylogenetic Methods & Stochastic Models
1958 Sokal and Michener publishes UGPMA method for making distrance trees with a clock.
1964 Parsimony principle defined, but not advocated by Edwards and Cavalli-Sforza.
1962-65 Zuckerkandl and Pauling introduces the notion of a Molecular Clock.
1967 First large molecular phylogenies by Fitch and Margoliash.
1969 Heuristic method used by Dayhoff to make trees and reconstruct ancetral sequences.
1969 Jukes-Cantor proposes simple model for amino acid evolution.
1970: Neyman analyzes three sequence stochastic model with Jukes-Cantor substitution.
1971-73 Fitch, Hartigan & Sankoff independently comes up with same algorithm reconstructing parsimony ancetral sequences.
1973 Sankoff treats alignment and phylogenies as on general problem – phylogenetic alignment.
1979 Cavender and Felsenstein independently comes up with same evolutionary model where parsimony is inconsistent. Later called the “Felsenstein Zone”.
1979: Kimura introduces transition/transversion bias in nucleotide model in response to pbulication of mitochondria sequences.
1981: Felsenstein Maximum Likelihood Model & Program DNAML (i programpakken PHYLIP). Simple nucleotide model with equilibrium bias.
1981 Parsimony tree problem is shown to be NP-Complete.
1985: Felsenstein introduces bootstrapping as confidence interval on phylogenies.
1985: Hasegawa, Kishino and Yano combines transition/transversion bias with unequal equilibrium frequencies.
1986 Bandelt and Dress introduces split decomposition as a generalization of trees.
1985-: Many authors (Sawyer, Hein, Stephens, M.Smith) tries to address the problem of recombinations in phylogenies.
1991 Gillespie’s book proposes “lumpy” evolution.
1994 Goldman & Yang + Muse & Gaut introduces codon based models
1997-9 Thorne et al., Sanderson & Huelsenbeck introduces the Almost Clock.
2000 Rambaut (and others) makes methods that can find trees with non-contemporaneous leaves.
2000 Complex Context Dependent Models by Jensen & Pedersen. Dinucleotide and overlapping reading frames.
2001- Major rise in the interest in phylogenetic statistical alignment
2001- Comparative genomics underlines the functional importance of molecular evolution.