Interpretation of exponentiation + eigenvalue decomposition The terms in the series expansion of...

Interpretation of exponentiation + eigenvalue decompositionThe terms in the series expansion of P(t) does not directly have an interpretation. The first, I, is the trivial transition function and the remaining has negative numbers and 0 row sums. If the CTMC has identical exit rates for all states (q) then:

Where Q’ is Q-I/q (the single step transition probabilities). Without identical exit rates, I don’t know a simple interpretation.

Ie Q is weighted symmetric Qi,j = πj Qj,i/πi and thus is diagonazable, Q = UDUT

If Q has distinct eigenvalues, then it will also have simple expressions for Pi,j(t)

Then a little rearrangement gives:

0 tti ti + 1

Continuous Time Markov Chain Poisson Process

0 ni i + 1

Discrete Time Markov Chain

+=

Kimura 2-parameter model - K80 TO A C G T

F A - R C O G M T a = *t b = *t

Q:

P(t)

start

)21(25. )(24 bab ee

)21(25. )(24 bab ee

Unequal base composition: (Felsenstein, 1981 F81)

Qi,j = C*πj i unequal j

Felsenstein81 & Hasegawa, Kishino & Yano 85

Tv/Tr & compostion bias (Hasegawa, Kishino & Yano, 1985 HKY85)

()*C*πj i- >j a transition Qi,j = C*πj i- >j a transversion

Rates to frequent nucleotides are high - (π =(πA , πC , πG , πT)

Tv/Tr = (πT πC +πA πG )/[(πT+πC )(πA+ πG )]A

G

T

C

Tv/Tr = () (πT πC +πA πG )/[(πT+πC )(πA+ πG )]

Group 3, Symmetric 6, Reversible 9 and General 12 models

Time reversible:

Symmetric:

General:

C A

GT

Kimura 3 parameter 1980, Evans and Speed 1993:

Can be interpreted as random walk on Z2*Z2

• Often only differences can be observed leading to a symmetric matrix• Symmetric matrices has uniform equilibrium distributions

Can be obtained from differences and equilibrium distributions

Non-reversible models allow rooting with only 2 sequences

=

Alternative condition for time reversibility: No net flows

From Nucleotide to Sequence

• Context-dependent models

Genome:

Dinucleotides

..ACGGA..

• Di-nucleotide events

ACGGAGT

ACGTCGT

• Rate Variation ATTGCGTCCAATATTGCGTCCAAT

ATGGCGTCC T ATATTGCGTGCAAT

ATTGCGTCC A ATATTGCGTCCGAT

Each nucleotide evolves independent

Di-nucleotide events Averof et al. (2000) E

vidence for High F

requency of Sim

ultaneous Double-N

ucleotide Substitutions” S

cience287.1283- . + S

mith et al. (2003) A

Low rate of

Sim

ultaneous Double-N

ucleotide Mutations in P

rimates” M

ol.Biol.E

vol 20.1.47-53

ACGGAGT

ACGTCGT

=

ACGGAGT

ACGTCGT

Sin

gle n

ucleo

tide even

ts

?

ACGGAGT

ACGTCGT

Do

ub

le events

Doublet Singlet Singlet

Assuming JC69 + doublet mutations.

00: 10-8 doublet mutation rate , ~10% of singlet rate03: much less for a large more reliable data set

Context-dependent modelsFrom singlet models to doublet models:

Independence

Independence with CG avoidance

Strand symmetry

Only single events

Single events with simple double events

Contagious Dependence:

Pedersen and Jensen, 2001

Siepel and Haussler, 2003

AA

G

C

T

C?

A

Rate variation between sites:iid each site

The rate at each position is drawn independently from a distribution, typically a (or lognormal) distribution. G(a,b) has density x-1*e-x/) , where is called scale parameter and form parameter.

iiii drrfrpLL )(),,(

Let L(pi,,t) be the likelihood for observing the i'th pattern, t all time lengths, the

parameters describing the process parameters and f (ri) the continuous distribution of

rate(s). Then

Measuring Selection ThrSer

ACGTCA

Certain events have functional consequences and will be selected out. The strength and localization of this selection is of great interest.

ThrProPro

ACGCCA

-

ArgSer

AGGCCG

-

The selection criteria could in principle be anything, but the selection against amino acid changes is without comparison the most important

ThrSer

ACGCCG

ThrSer

ACTCTG

AlaSer

GCTCTG

AlaSer

GCACTG

The Genetic Code

i.

3 classes of sites:

4

2-2

1-1-1-1

Problems:

i. Not all fit into those categories.

ii. Change in on site can change the status of another.

4 (3rd) 1-1-1-1 (3rd)

ii. TA (2nd)

Possible events if the genetic code remade from Li,1997

Substitutions Number Percent

Total in all codons 549 100

Synonymous 134 25

Nonsynonymous 415 75

Missense 392 71

Nonsense 23 4

Possible number of substitutions: 61 (codons)*3 (positions)*3 (alternative nucleotides).

Kimura’s 2 parameter model & Li’s Model.

Selection on the 3 kinds of sites (a,b)(?,?)

1-1-1-1 (f*,f*)

2-2 (,f*)

4 (, )

Rates:start

Probabilities:

)21(25. )(24 bab ee

)21(25. )(24 bab ee

Sites Total Conserved Transitions Transversions1-1-1-1 274 246 (.8978) 12(.0438) 16(.0584)2-2 77 51 (.6623) 21(.2727) 5(.0649)4 78 47 (.6026) 16(.2051) 15(.1923)

alpha-globin from rabbit and mouse.Ser Thr Glu Met Cys Leu Met Gly GlyTCA ACT GAG ATG TGT TTA ATG GGG GGA * * * * * * * **TCG ACA GGG ATA TAT CTA ATG GGT ATASer Thr Gly Ile Tyr Leu Met Gly Ile

Z(t,t) = .50[1+exp(-2t) - 2exp(-t(+)] transition Y(t,t) = .25[1-exp(-2t )] transversionX(t,t) = .25[1+exp(-2t) + 2exp(-t()] identity

L(observations,a,b,f)= C(429,274,77,78)* {X(a*f,b*f)246*Y(a*f,b*f)12*Z(a*f,b*f)16}* {X(a,b*f)51*Y(a,b*f)21*Z(a,b*f)5}*{X(a,b)47*Y(a,b)16*Z(a,b)15}

where a = at and b = bt.

Estimated Parameters: a = 0.3003 b = 0.1871 2*b = 0.3742 (a + 2*b) = 0.6745 f = 0.1663

Transitions Transversions1-1-1-1 a*f = 0.0500 2*b*f = 0.06222-2 a = 0.3004 2*b*f = 0.06224 a = 0.3004 2*b = 0.3741

Expected number of: replacement substitutions 35.49 synonymous 75.93Replacement sites : 246 + (0.3742/0.6744)*77 = 314.72Silent sites : 429 - 314.72 = 114.28 Ks = .6644 Ka = .1127

Probabilities of different paths

• Number of events • Kinds of events

Starting in A ending in B after time t

Rate of going from i to j: qi,j

A B

S2

S3

Sk

S1

Key questions (conditional/unconditional):

• Time spent at different states

Learning to Count: Robust Estimates for Labeled D

istances between M

olecular Sequences O’Brien, M

inin, and Marc A. Suchard M

ol. Biol. Evol. 26(4):801–814. 2009Vladim

ir N M

inin and Marc A Suchard Fast, accurate and sim

ulation-free stochastic mapping 3995 363 2008 Phil. Trans. R. Soc. B

Counting labeled transitions in continuous-time M

arkov models of evolution Vladim

ir N. M

inin á Marc A. Suchard J. M

ath. Biol. (2008) 56:391–412

Generalize to a phylogeny

• Time to get from A to B

• Very liked and dis-liked sets

• Probability of only visiting {S}

A B

C

S2

S3

Sk

S1

t1 t2

t3

• Distribution of ancestor state, X P(X)= P(A,B,C,X)/P(A,B,C)

• Which edges/nodes carry most/least probability? Ranked lists of edges nodes.

Summary of Substitution Models

• Extensions to the basic model

• Rate heterogeneity

• Context Dependent Models

• Codons

• From nucleotide to sequence• Independence of nucleotides

• Assumptions behind substitution models

• Continuous time Markov Chain

• Only substitutions

• Independence and identity of positions

• From P to Q & from Q to P

• Independence of lineages

• The simplest model: Jukes-Cantor

• Ancestral Analysis – conditioning on start and finish

0 t1 t2 T

From Continuous to Discrete Time

Kos

kine

n,J.

(20

04)

Bay

esia

n In

fere

nce

for

Lon

gitu

dina

l So

cial

Net

wor

ks. R

esea

rch

Rep

ort,

num

ber

2004

:4, S

tock

holm

Uni

vers

ity,

Dep

artm

ent o

f St

atis

tics

. K

oski

nen,

J. a

nd S

nijd

ers,

T. (

2007

) B

ayes

ian

infe

renc

e fo

r dy

nam

ic s

ocia

l ne

twor

k da

ta, J

ourn

al

of S

tati

stic

al P

lann

ing

and

Infe

renc

e, 1

37, 3

930-

-393

8. R

. Sha

ran,

T. I

deke

r, M

odel

ing

cell

ular

mac

hine

ry t

hrou

gh b

iolo

gica

l net

wor

k co

mpa

riso

n, N

atur

e B

iote

chno

logy

, 24,

427

(20

06).

Sni

jder

s, T

. (20

01)

“Sta

tist

ical

eva

luat

ion

of s

ocia

l net

wor

ks

dyna

mic

s” in

Soc

iolo

gica

l Met

hodo

logy

By

Mic

hael

Sob

el S

nijd

ers,

T. e

t al.

(200

8) “

Max

imum

Lik

elih

ood

Eva

luat

ion

for

Soci

al N

etw

ork

Dyn

amic

s” I

n pr

ess

I.

Mik

los,

G.A

. L

un

ter

and

I. H

olm

es (

2004

) A

"lo

ng

in

del

" m

odel

for

evo

lutio

nary

se

quen

ce a

lignm

ent.

Mol

. B

iol.

Evo

l. 21

(3):

529-

540.

App

endi

x A• Sum over i state assignments gives probability of paths of length i.

• Integrate of all waiting times (t1,..,ti) and state assignments of length i gives probability of specific trajectory

• Sum over all path lengths gives probability of N turning into N’

• The above expression can be shown to be of the formAnd recursions O(N2) exists to calculate coefficients.

Correlated MutationsMotivation: Models often assume independence between sites or sites and phenotype, however that might not be warranted and methods detecting correlation is of great use.

A

G

C

C

*

T

Ideal situation – complete history known.

A

G

C

C

Real situation – end points known, little power.

G C T GG CA A

Many end points – a phylogeny might give power.

• Explicit modelling:

• Ancestral Analysis:

- +

-

+

Single binary state

++

+-

-+

--

-- -+ +- ++Single binary state

• Multiple Testing

All pairs – n(n-1)/2 Site – phenotypic character - n

Transition Path Sampling Algorithm/MCMC

P1 P2

p1

p3

p2

p5p4

p6

Path 1 - probability: p1 p2 p3

Path 2 - probability: p1 p4 p5 p6

Local modification of Path 1 in Path 2:

Set of

paths:

Likelihood - L( )

Probability of going from to - q( , )

Acceptance ratio

Discrete Space:

Continuous Space – reversible jump MCMC (Green, 1995)

The acceptance ration will have to be weighted by Jacobian – J.

Typically much slower as continuous case includes stochastic integration

Simulating trajectories that ends in B at time t

0t

Hob

olth

and

Sto

ne (

2009

) E

FF

ICIE

NT

SIM

ULA

TIO

N F

RO

M F

INIT

E-S

TA

TE

, C

ON

TIN

UO

US

-TIM

E M

AR

KO

V C

HA

INS

WIT

H I

NC

OM

PLE

TE

O

BS

ER

VA

TIO

NS

S2

S3

Sk

S1

A

q01

q02

q03

q0k

S2

S3

Sk

S1

B

q1B

q2B

q3B

qkB

Challenge for large state space, E(steps) large and Pa,b(t) small:

Algorithm (forward rejection sampling)

Sample paths unconditionally

Keep paths ending in B at time t

Normalize their probability by dividing with PA,B(t)

Can be modified to be more efficient if Paa(t) has high probability

Sample discrete jump transition according to conditional jump process

- Real jumps - Self jumps The Poisson Process – tag the red stars !!

Sample jump points according to Poisson Process

Create Uniformized Process

maxi -qii

Q’: qii:=-

Interpret increased exit rates as self-jumps

R:= I + Q’/

Conditional jump probabilitiesUnconditional jump probabilities

0 ni i + 1

Data: 3 sequences of length L ACGTTGCAA ...AGCTTTTGA ...TCGTTTCGA ...

Statistical Test of Models (Goldman,1990)

A. Likelihood (free multinominal model 63 free parameters)L1 = pAAA

#AAA*...pAAC#AAC*...*pTTT

#TTT where pN1N2N3 = #(N1N2N3)/L

L2 = pAAA(l1',l2',l3') #AAA*...*pTTT(l1',l2',l3') #TTTl2

l1

l3

TCGTTTCGA ...

ACGTTGCAA ...

AGCTTTTGA ...

B. Jukes-Cantor and unknown branch lengths

Parametric bootstrap: i. Maximum likelihood to estimate the parameters. ii. Simulate with estimated model. iii. Make simulated distribution of -2 lnQ. iv. Where is real -2 lnQ in this distribution?

Test statistics: I. (expected-observed)2/expected or II: -2 lnQ = 2(lnL1 - lnL2) JC69 Jukes-Cantor: 3 parameters => 2 60 d.of freedom

Problems: i. To few observations pr. pattern. ii. Many competing hypothesis.

Extension to Overlapping RegionsHein & Stoevlbaek, 95

(f1f2a, f1f2b) (f2a, f1f2b) (f2a, f2b)

(f1a, f1f2b) (f2a, f1f2b) (a, f2b)

(f1a, f1b) (a, f1b) (a, b)

1st

2nd

1-1-1-1 sites

2-2

4

1-1-1-1 2-2 4

Ziheng Yang has an alternative model to this, were sites are lumped into the same category if they have the same configuration of positions and reading frames.

Example: Gag & Pol from HIVgagpol

1-1-1-1 sites

2-2

4

1-1-1-1 2-2 4

64 31 34

40 7 0

27 2 0

GagPol

MLE: a=.084 b= .024 a+2b=.133 fgag=.403 fpol=.229

Hasegawa, Kisino & Yano Subsitution Model Parameters:

a*t β*t A C G T

0.350 0.105 0.361 0.181 0.236 0.222 0.015 0.005 0.004 0.003 0.003

HIV1 Analysis

Selection Factors

GAG 0.385 (s.d. 0.030)POL 0.220 (s.d. 0.017)VIF 0.407 (s.d. 0.035)VPR 0.494 (s.d. 0.044)TAT 1.229 (s.d. 0.104)REV 0.596 (s.d. 0.052)VPU 0.902 (s.d. 0.079)ENV 0.889 (s.d. 0.051)NEF 0.928 (s.d. 0.073)

Estimated Distance per Site: 0.194

Open Problem II: Example Neural Networks (NN))

Related by a phylogeny:Independent instances:

Motivation: To combine methods that define patterns from a series of independent instances with pattern inferred by instances related by a phylogenetic tree.

Basic Equations I

Backward Equation:

i jt2

k

i k jt1 t2

Matrix version:t = t1 + t2

Chapman-Kolmogorov:

Forward Equation:

i k jt1 h

Initial Condition:

0 t1 t2 T

From Continuous to Discrete Time

Kos

kine

n,J.

(20

04)

Bay

esia

n In

fere

nce

for

Lon

gitu

dina

l So

cial

Net

wor

ks. R

esea

rch

Rep

ort,

num

ber

2004

:4, S

tock

holm

Uni

vers

ity,

Dep

artm

ent o

f St

atis

tics

. K

oski

nen,

J. a

nd S

nijd

ers,

T. (

2007

) B

ayes

ian

infe

renc

e fo

r dy

nam

ic s

ocia

l ne

twor

k da

ta, J

ourn

al

of S

tati

stic

al P

lann

ing

and

Infe

renc

e, 1

37, 3

930-

-393

8. R

. Sha

ran,

T. I

deke

r, M

odel

ing

cell

ular

mac

hine

ry t

hrou

gh b

iolo

gica

l net

wor

k co

mpa

riso

n, N

atur

e B

iote

chno

logy

, 24,

427

(20

06).

Sni

jder

s, T

. (20

01)

“Sta

tist

ical

eva

luat

ion

of s

ocia

l net

wor

ks

dyna

mic

s” in

Soc

iolo

gica

l Met

hodo

logy

By

Mic

hael

Sob

el S

nijd

ers,

T. e

t al.

(200

8) “

Max

imum

Lik

elih

ood

Eva

luat

ion

for

Soci

al N

etw

ork

Dyn

amic

s” I

n pr

ess

I.

Mik

los,

G.A

. L

un

ter

and

I. H

olm

es (

2004

) A

"lo

ng

in

del

" m

odel

for

evo

lutio

nary

se

quen

ce a

lignm

ent.

Mol

. B

iol.

Evo

l. 21

(3):

529-

540.

App

endi

x A• Sum over i state assignments gives probability of paths of length i.

• Integrate of all waiting times (t1,..,ti) and state assignments of length i gives probability of specific trajectory

• Sum over all path lengths gives probability of N turning into N’

• The above expression can be shown to be of the formAnd recursions O(N2) exists to calculate coefficients.

Fast/Slowly Evolving StatesFelsenstein & Churchill, 1996

n1positions

sequ

enc

esk

1

slow - rsfast - rfHMM:

• r - equilibrium distribution of hidden states (rates) at first position

•pi,j - transition probabilities between hidden states

•L(j,r) - likelihood for j’th column given rate r.

•L(j,r) - likelihood for first j columns given j’th column has rate r.Likelihood Recursions:

Likelihood Initialisations:

Basic Equations

Hobolth, A. and Jensen, J.L. (2005). Statistical inference in evolutionary models of DNA sequences via the EM algorithm. Statistical applications in Genetics and Molecular Biology, 4, 18

0 ns

Expected time spent in j, T(j), in going from a to b:

a

bi

Expected number of transition from i to j, N(i,j), in going from a to b:

0 ns

a

bi

j

qi,j

Higher moments and combinations of N( ) and T( ) can be calculated using the same reasoning

0 na

b

Evaluation of Eab(N1(),..Nr,T1(),..Tm()), would involve the evaluation of at most n+m dimensional integral

i. Codons as the basic unit.ii. A codon based matrix would have (61*61)-61 (= 3661) off-diagonal entries. i. Bias in nucleotide usage. ii. Bias in codon usage. iii. Bias in amino acid usage. iv. Synonymous/non-synonymous distinction. v. Amino acid distance. vi. Transition/transversion bias.

codon i and codon j differing by one nucleotide, then pj exp(-di,j/V) differs by transitionqi,j = pj exp(-di,j/V) differs by transversion.

-di,j is a physico-chemical difference between amino acid i and amino acid j. V is a factor that reflects the variability of the gene involved.

Codon based ModelsGoldman,Yang + Muse,Gaut

Dayhoffs empirical approach (1970)

Take a set of closely related proteins, count all differences and make symmetric difference matrix, since time direction cannot be observed.

History of Phylogenetic Methods & Stochastic Models

1958 Sokal and Michener publishes UGPMA method for making distrance trees with a clock.

1964 Parsimony principle defined, but not advocated by Edwards and Cavalli-Sforza.

1962-65 Zuckerkandl and Pauling introduces the notion of a Molecular Clock.

1967 First large molecular phylogenies by Fitch and Margoliash.

1969 Heuristic method used by Dayhoff to make trees and reconstruct ancetral sequences.

1969 Jukes-Cantor proposes simple model for amino acid evolution.

1970: Neyman analyzes three sequence stochastic model with Jukes-Cantor substitution.

1971-73 Fitch, Hartigan & Sankoff independently comes up with same algorithm reconstructing parsimony ancetral sequences.

1973 Sankoff treats alignment and phylogenies as on general problem – phylogenetic alignment.

1979 Cavender and Felsenstein independently comes up with same evolutionary model where parsimony is inconsistent. Later called the “Felsenstein Zone”.

1979: Kimura introduces transition/transversion bias in nucleotide model in response to pbulication of mitochondria sequences.

1981: Felsenstein Maximum Likelihood Model & Program DNAML (i programpakken PHYLIP). Simple nucleotide model with equilibrium bias.

1981 Parsimony tree problem is shown to be NP-Complete.

1985: Felsenstein introduces bootstrapping as confidence interval on phylogenies.

1985: Hasegawa, Kishino and Yano combines transition/transversion bias with unequal equilibrium frequencies.

1986 Bandelt and Dress introduces split decomposition as a generalization of trees.

1985-: Many authors (Sawyer, Hein, Stephens, M.Smith) tries to address the problem of recombinations in phylogenies.

1991 Gillespie’s book proposes “lumpy” evolution.

1994 Goldman & Yang + Muse & Gaut introduces codon based models

1997-9 Thorne et al., Sanderson & Huelsenbeck introduces the Almost Clock.

2000 Rambaut (and others) makes methods that can find trees with non-contemporaneous leaves.

2000 Complex Context Dependent Models by Jensen & Pedersen. Dinucleotide and overlapping reading frames.

2001- Major rise in the interest in phylogenetic statistical alignment

2001- Comparative genomics underlines the functional importance of molecular evolution.

Interpretation of exponentiation + eigenvalue decomposition The terms in the series expansion of...

Documents

Transcript of Interpretation of exponentiation + eigenvalue decomposition The terms in the series expansion of...