An EM Algorithm for Inferring the Evolution of Eukaryotic Gene Structure Liran Carmel, Igor B....

Post on 15-Jan-2016

215 views 0 download

Tags:

Transcript of An EM Algorithm for Inferring the Evolution of Eukaryotic Gene Structure Liran Carmel, Igor B....

An EM Algorithm for Inferring the Evolution of Eukaryotic Gene Structure

Liran Carmel, Igor B. Rogozin, Yuri I. Wolf and Eugene V. Koonin

NCBI, NLM, National Institutes of Health

Outline

Background and Related Work Data Components The Model The Algorithm Results – Homogeneous Evolution Results – Heterogeneous Evolution Summary

GU AG

splicing

exon1 exon2intron

mRNA

What are Exons and Introns

exon1 exon2

Related work

2 3 4

Gilbert 2005 [hybrid; branch-specific]

Koonin2003 [Dollo Parsimony]

Csuros 2005 [ML; branch-

specific]

Kenmochi 2005 [ML; branch-

specific]

Stolzfus 2004 [Bayes; gene-

specific]

gain

Stolzfus

Koonin,Kenmochi,Csuros

Gilbert

Koonin,Kenmochi,Csuros

loss

Outline

Background and Related Work Data Components The Model The Algorithm Results – Homogeneous Evolution Results – Heterogeneous Evolution Summary

Eukaryota

AME

DicdiUnikonts

Unikonts

Metazoa

CoelomataDeuterostomia

Diptera

FungiAscomycota

ScAfNc

Magnoliophyta

Chordata

Vertebrata

Apicomplexa

Pezizomycotina

Amniota

Mammals

DicdiCaeel

StrpuCioin

DanreGalga

Homsa DromeAnoga

CryneSchpo

SacceAspfu

NeucrArath

OrysaThepa

PlafaRoden

Phylogenetic tree

HS …ATGTCGATCGTGCTCGTCGTACTCTCGTAC…DM …ATGTGGATCGTGCTCGTCGTACTCTCGTAC…CE …ATGTGGATTGTGCTCGTCGTACTCTCGTAC…AT …ATGTTGATGGTGCTCGTCGTACTCTCGTAC…SC …ATGTTGATTGTGCTCGTCGTACTCTCGTAC…SP …ATGTTGATT---CTCGTCGTACTCTCGTAC…

Multiple alignment

41 118 222 230 251 309 377 453 465 539 597 602 713 SC 0 0 0 0 0 0 0 0 0 0 0 0 0SP 0 1 1 0 0 0 1 0 0 0 0 0 0CE 1 1 0 0 0 0 1 0 0 0 0 0 0DM 1 1 0 0 0 0 1 0 0 1 0 0 0HS 1 1 0 0 1 0 1 0 1 1 1 0 0AT 1 1 0 1 0 1 1 1 0 1 0 1 1

Strong phyletic signal

Presence/absence maps (proteasome component C3)

Missing data

HS …ATGTCGATCGTGCTCGTCGTACTCTCGTAC…DM …ATGTGGATCGTGCTCGTCGTACTCTCGTAC…CE …ATGTGGATTGTGCTCGTCGTACTCTCGTAC…AT …ATGTTGATGGTGCTCGTCGTACTCTCGTAC…SC …ATGTTGATTGTGCTCGTCGTACTCTCGTAC…SP …ATGTTGATT---CTCGTCGTACTCTCGTAC…?

Missing data (proteasome component C3)

41 118 222 230 251 309 377 401 453 465 539 597 602 713 SC 0 0 0 0 0 0 0 0 0 0 0 0 0 0SP 0 1 1 0 0 0 1 ? 0 0 0 0 0 0CE 1 1 0 0 0 0 1 0 0 0 0 0 0 0DM 1 1 0 0 0 0 1 0 0 0 1 0 0 0HS 1 1 0 0 1 0 1 ? 0 1 1 1 0 0AT 1 1 0 1 0 1 1 1 1 0 1 0 1 1

Eukaryota

AME

DicdiUnikonts

Unikonts

Metazoa

CoelomataDeuterostomia

Diptera

FungiAscomycota

ScAfNc

Magnoliophyta

Chordata

Vertebrata

Apicomplexa

Pezizomycotina

Amniota

Mammals

DicdiCaeel

StrpuCioin

DanreGalga

Homsa DromeAnoga

CryneSchpo

SacceAspfu

NeucrArath

OrysaThepa

PlafaRoden

Bayesian Network

Outline

Background and Related Work Data Components The Model The Algorithm Results – Homogeneous Evolution Results – Heterogeneous Evolution Summary

Probability structure

descendant in state 0 descendant in state 1

parent in state 0

parent in state 1

)1(1 tget )1( tget

root prior probability: transition probability: for gene and branch

of length

branch-specific loss

gene-specific loss

branch-specific

gain

gene-specific

gain

tg0

tget )1(1

tget )1(

t

Rate variation across sites

gain variation

loss variation

gg r

);()1()(~ Gr

gg r

);(~ Lr

shape parameter

(gain)

fraction of invariant sites

shape parameter

(loss)

Parameter Summary

Global parameters – probability for intron absence in the root – fraction of invariant sites – shape parameters of the gamma distribution

Gene-specific parameters – gain rate – loss rate

Branch-specific parameters – gain coefficient – loss coefficient

0

LG ,

g

g

tt

Homogeneous vs. Heterogeneous Evolution

The number of parameters in the model

GS 2)22(24

number of extant species

number of genes

HomogeneousEvolution

setting G = 1

HeterogeneousEvolution

fixing global parameters and branch-specific parameters

Outline

Background and Related Work Data Components The Model The Algorithm Results – Homogeneous Evolution Results – Heterogeneous Evolution Summary

Likelihood maximization via Expectation Maximization

E-Stepinward-outward recursions on the treemember in the junction-tree algorithms

familymissing data are naturally embedded

Inward (gamma) recursion

)|)(Pr()( PqqLq

?

?

?

?

??

q

Inward (gamma) recursion - Initialization

)|)(Pr()( PqqLq

t

tt

t

tt

t

t

s

se

e

se

e

qtgk

tgk

tgk

tgk

1

1

1)1(

)1(

0)1(1

)1(1

)('

'

Inward (gamma) recursion - Recursion

)|)(Pr()( PqqLq

1

0

)()()()(j

Rtj

Ltjt

gijti qqqAq

q

Outward (alpha) recursion

))0(|,Pr(),( qLPqqPqq

Likelihood maximization via EM E-Step

inward-outward recursions on the treemember in the junction-tree algorithms

familymissing data are naturally embedded

M-Steplow-tolerance variable-by-variable

maximizationNewton-Raphson

Outline

Background and Related Work Data Components The Model The Algorithm Results – Homogeneous Evolution Results – Heterogeneous Evolution Summary

Intron density in ancient eukaryotes

2 3 4

Gilbert2005

Koonin2003

Csuros 2005

Kenmochi 2005

Stolzfus 2004

Evolutionary Landscape

loser

gainer

stable

dynamic

Eukaryota

AMEDicdiUnikonts

Unikonts

Metazoa

CoelomataDeuterostomia

Diptera

FungiAscomycota

ScAfNc

Magnoliophyta

Chordata

Vertebrata

Apicomplexa

Pezizomycotina

Amniota

Mammals

DicdiCaeel

StrpuCioin

DanreGalga

Homsa DromeAnoga

CryneSchpo

SacceAspfu

NeucrArath

OrysaThepa

PlafaRoden

Modes of Evolution

0 1 2 3 4 5 60

0.005

0.01

0.015

0.02

0.025

total loss rate [1/BYA]

tota

l gai

n ra

te [1

/BY

A]

Deuterostomia

Metazoa

Unikonts

Ascomycota

SacceDiptera

ScAfNc

Fungi

Roden

Caeel

AnogaSchpo

ChordataPezizomycotina

Cioin

Modes of Evolution

0 1 2 3 4 5 60

0.005

0.01

0.015

0.02

0.025

total loss rate [1/BYA]

tota

l gai

n ra

te [1

/BY

A]

Deuterostomia

Metazoa

Unikonts

Ascomycota

SacceDiptera

ScAfNc

Fungi

Roden

Caeel

AnogaSchpo

ChordataPezizomycotina

Cioin

loser

gainer

stable

dynamic

Outline

234 genes

295 genes

187 genes

Background and Related Work Data Components The Model The Algorithm Results – Homogeneous Evolution Results – Heterogeneous Evolution Summary

Gene Characteristics

New features of genes: Intron gain rate Intron loss rate

Old features of genes: Expression level Evolutionary rate Lethality Connectivity in protein-protein interactions Connectivity in genetic interactions

Combined Features

-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Status

Ada

ptab

ility

NPGI

PGLER

EL

PPI

KE

Combined Features

-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

Adaptability

Re

activ

ity

NP

GI

PGL

ER

EL

PPI

KE

PPI

ER

Important genes gain introns

Status Adaptability reactivity

Gain rate 0.33 0.06 0.37

Loss rate 0.05 -0.03 0.07

Outline

Background and Related Work Data Components The Model The Algorithm Results – Homogeneous Evolution Results – Heterogeneous Evolution Summary

Conclusions

Disparate landscape – both gain and loss play role in intron evolution

The common ancestor of the crown group had an intron content comparable to fungi, apicomlexans and dipterans

Three modes of evolution – more than one mechanism?

Important genes tend to gain introns