An EM Algorithm for Inferring the Evolution of Eukaryotic Gene Structure
Liran Carmel, Igor B. Rogozin, Yuri I. Wolf and Eugene V. Koonin
NCBI, NLM, National Institutes of Health
Outline
Background and Related Work Data Components The Model The Algorithm Results – Homogeneous Evolution Results – Heterogeneous Evolution Summary
GU AG
splicing
exon1 exon2intron
mRNA
What are Exons and Introns
exon1 exon2
Related work
2 3 4
Gilbert 2005 [hybrid; branch-specific]
Koonin2003 [Dollo Parsimony]
Csuros 2005 [ML; branch-
specific]
Kenmochi 2005 [ML; branch-
specific]
Stolzfus 2004 [Bayes; gene-
specific]
gain
Stolzfus
Koonin,Kenmochi,Csuros
Gilbert
Koonin,Kenmochi,Csuros
loss
Outline
Background and Related Work Data Components The Model The Algorithm Results – Homogeneous Evolution Results – Heterogeneous Evolution Summary
Eukaryota
AME
DicdiUnikonts
Unikonts
Metazoa
CoelomataDeuterostomia
Diptera
FungiAscomycota
ScAfNc
Magnoliophyta
Chordata
Vertebrata
Apicomplexa
Pezizomycotina
Amniota
Mammals
DicdiCaeel
StrpuCioin
DanreGalga
Homsa DromeAnoga
CryneSchpo
SacceAspfu
NeucrArath
OrysaThepa
PlafaRoden
Phylogenetic tree
HS …ATGTCGATCGTGCTCGTCGTACTCTCGTAC…DM …ATGTGGATCGTGCTCGTCGTACTCTCGTAC…CE …ATGTGGATTGTGCTCGTCGTACTCTCGTAC…AT …ATGTTGATGGTGCTCGTCGTACTCTCGTAC…SC …ATGTTGATTGTGCTCGTCGTACTCTCGTAC…SP …ATGTTGATT---CTCGTCGTACTCTCGTAC…
Multiple alignment
41 118 222 230 251 309 377 453 465 539 597 602 713 SC 0 0 0 0 0 0 0 0 0 0 0 0 0SP 0 1 1 0 0 0 1 0 0 0 0 0 0CE 1 1 0 0 0 0 1 0 0 0 0 0 0DM 1 1 0 0 0 0 1 0 0 1 0 0 0HS 1 1 0 0 1 0 1 0 1 1 1 0 0AT 1 1 0 1 0 1 1 1 0 1 0 1 1
Strong phyletic signal
Presence/absence maps (proteasome component C3)
Missing data
HS …ATGTCGATCGTGCTCGTCGTACTCTCGTAC…DM …ATGTGGATCGTGCTCGTCGTACTCTCGTAC…CE …ATGTGGATTGTGCTCGTCGTACTCTCGTAC…AT …ATGTTGATGGTGCTCGTCGTACTCTCGTAC…SC …ATGTTGATTGTGCTCGTCGTACTCTCGTAC…SP …ATGTTGATT---CTCGTCGTACTCTCGTAC…?
Missing data (proteasome component C3)
41 118 222 230 251 309 377 401 453 465 539 597 602 713 SC 0 0 0 0 0 0 0 0 0 0 0 0 0 0SP 0 1 1 0 0 0 1 ? 0 0 0 0 0 0CE 1 1 0 0 0 0 1 0 0 0 0 0 0 0DM 1 1 0 0 0 0 1 0 0 0 1 0 0 0HS 1 1 0 0 1 0 1 ? 0 1 1 1 0 0AT 1 1 0 1 0 1 1 1 1 0 1 0 1 1
Eukaryota
AME
DicdiUnikonts
Unikonts
Metazoa
CoelomataDeuterostomia
Diptera
FungiAscomycota
ScAfNc
Magnoliophyta
Chordata
Vertebrata
Apicomplexa
Pezizomycotina
Amniota
Mammals
DicdiCaeel
StrpuCioin
DanreGalga
Homsa DromeAnoga
CryneSchpo
SacceAspfu
NeucrArath
OrysaThepa
PlafaRoden
Bayesian Network
Outline
Background and Related Work Data Components The Model The Algorithm Results – Homogeneous Evolution Results – Heterogeneous Evolution Summary
Probability structure
descendant in state 0 descendant in state 1
parent in state 0
parent in state 1
)1(1 tget )1( tget
root prior probability: transition probability: for gene and branch
of length
branch-specific loss
gene-specific loss
branch-specific
gain
gene-specific
gain
tg0
tget )1(1
tget )1(
t
Rate variation across sites
gain variation
loss variation
gg r
);()1()(~ Gr
gg r
);(~ Lr
shape parameter
(gain)
fraction of invariant sites
shape parameter
(loss)
Parameter Summary
Global parameters – probability for intron absence in the root – fraction of invariant sites – shape parameters of the gamma distribution
Gene-specific parameters – gain rate – loss rate
Branch-specific parameters – gain coefficient – loss coefficient
0
LG ,
g
g
tt
Homogeneous vs. Heterogeneous Evolution
The number of parameters in the model
GS 2)22(24
number of extant species
number of genes
HomogeneousEvolution
setting G = 1
HeterogeneousEvolution
fixing global parameters and branch-specific parameters
Outline
Background and Related Work Data Components The Model The Algorithm Results – Homogeneous Evolution Results – Heterogeneous Evolution Summary
Likelihood maximization via Expectation Maximization
E-Stepinward-outward recursions on the treemember in the junction-tree algorithms
familymissing data are naturally embedded
Inward (gamma) recursion
)|)(Pr()( PqqLq
?
?
?
?
??
q
Inward (gamma) recursion - Initialization
)|)(Pr()( PqqLq
t
tt
t
tt
t
t
s
se
e
se
e
qtgk
tgk
tgk
tgk
1
1
1)1(
)1(
0)1(1
)1(1
)('
'
Inward (gamma) recursion - Recursion
)|)(Pr()( PqqLq
1
0
)()()()(j
Rtj
Ltjt
gijti qqqAq
q
Outward (alpha) recursion
))0(|,Pr(),( qLPqqPqq
Likelihood maximization via EM E-Step
inward-outward recursions on the treemember in the junction-tree algorithms
familymissing data are naturally embedded
M-Steplow-tolerance variable-by-variable
maximizationNewton-Raphson
Outline
Background and Related Work Data Components The Model The Algorithm Results – Homogeneous Evolution Results – Heterogeneous Evolution Summary
Intron density in ancient eukaryotes
2 3 4
Gilbert2005
Koonin2003
Csuros 2005
Kenmochi 2005
Stolzfus 2004
Evolutionary Landscape
loser
gainer
stable
dynamic
Eukaryota
AMEDicdiUnikonts
Unikonts
Metazoa
CoelomataDeuterostomia
Diptera
FungiAscomycota
ScAfNc
Magnoliophyta
Chordata
Vertebrata
Apicomplexa
Pezizomycotina
Amniota
Mammals
DicdiCaeel
StrpuCioin
DanreGalga
Homsa DromeAnoga
CryneSchpo
SacceAspfu
NeucrArath
OrysaThepa
PlafaRoden
Modes of Evolution
0 1 2 3 4 5 60
0.005
0.01
0.015
0.02
0.025
total loss rate [1/BYA]
tota
l gai
n ra
te [1
/BY
A]
Deuterostomia
Metazoa
Unikonts
Ascomycota
SacceDiptera
ScAfNc
Fungi
Roden
Caeel
AnogaSchpo
ChordataPezizomycotina
Cioin
Modes of Evolution
0 1 2 3 4 5 60
0.005
0.01
0.015
0.02
0.025
total loss rate [1/BYA]
tota
l gai
n ra
te [1
/BY
A]
Deuterostomia
Metazoa
Unikonts
Ascomycota
SacceDiptera
ScAfNc
Fungi
Roden
Caeel
AnogaSchpo
ChordataPezizomycotina
Cioin
loser
gainer
stable
dynamic
Outline
234 genes
295 genes
187 genes
Background and Related Work Data Components The Model The Algorithm Results – Homogeneous Evolution Results – Heterogeneous Evolution Summary
Gene Characteristics
New features of genes: Intron gain rate Intron loss rate
Old features of genes: Expression level Evolutionary rate Lethality Connectivity in protein-protein interactions Connectivity in genetic interactions
Combined Features
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Status
Ada
ptab
ility
NPGI
PGLER
EL
PPI
KE
Combined Features
-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
Adaptability
Re
activ
ity
NP
GI
PGL
ER
EL
PPI
KE
PPI
ER
Important genes gain introns
Status Adaptability reactivity
Gain rate 0.33 0.06 0.37
Loss rate 0.05 -0.03 0.07
Outline
Background and Related Work Data Components The Model The Algorithm Results – Homogeneous Evolution Results – Heterogeneous Evolution Summary
Conclusions
Disparate landscape – both gain and loss play role in intron evolution
The common ancestor of the crown group had an intron content comparable to fungi, apicomlexans and dipterans
Three modes of evolution – more than one mechanism?
Important genes tend to gain introns
Top Related