The use of the concepts of evolutionary biology in genome (biological) annotation.
Pierre PontarottiEA 3781 Evolution Biologique
[email protected]://www.up.univ-mrs.fr/evol/
• Somes Concepts in evolutionary biology
• Use of the concepts for• Gene Structural and functional annotation.
Informatisation
Others concepts
Metazoan Phylogeny ( Adoutte et al. 2000)Arthropods
Gastrotrichs
Nematodes
Onychophorans
TardigradesKinorhynchs
Priapulids
EC
DY
SO
ZO
AN
S
MolluscsRotifersAnnelidsGnathostomulidsSipunculansNemerteansPogonophoransPlatyhelminthesEntoproctsBryozoansBrachiopodsPhoronids
LO
PH
OT
RO
CH
OZ
OA
NS
VertebratesCephalochordatesUrochordates
HemichordatesEchinoderms
PR
OT
OS
TO
ME
SD
EU
TE
RO
ST
OM
ES
BIL
AT
ER
IA
CtenophoransCnidariansPoriferans
Urbilateria
??
URBILATERIA : The hypothetical Metazoan AncestorGeoffroy de St Hilaire during XIX th Century
URBILATERIA Genome evolved by the fixation of :• Nucleotide substitution• Gene loss• Genic duplication
Gene duplication Genome region duplication Whole genome duplication Chromosomal rearrangement
Large scale gene duplication in vertebrate lineage
Deu
téro
stom
ata
Pro
tost
omat
a
Ver
tebr
ates
Amniota (Human)
Lisamphibia
Chondrichthyes (shark) Cephalaspidomorphi (lamprey)
Céphalochordata (amphioxus)
Echinodermata
Actinopterygii(Zebrafish)
Urochordata(Ciona)
Insects (Drosophila)
Myxini (Hagfish)
Nématod (c. elegans)
751
>751
564
528450
<833-993
833-993
T1
T2
360
20 000 genes
Pikaia
I
A
B
C
D
Population :
POP 1
POP 1 split in
2 autonomous populations
A
B
C
D
A
B
C
D
POP 1A
POP 1B
Allele A fixation and accumulation of new mutations
A1
A2
B1
B2
Allele B fixation and accumulation of new mutations
From alleles to orthologsPoints mutations
From alleles to orthologspoints mutations
POP 1A
POP 1B
A1
A2
A1
A2
B1
B2
B1
B2
POP 1A1
POP 1A2
POP 1B1
POP 1B2
A11
A12
A21
A22
B11
B12
B21
B22
POP 1B split in
2 autonomous populations
Allele A1 fixation and accumulation of new mutations
POP 1A split in
2 autonomous populations
Allele A2 fixation and accumulation of new mutations
Allele B1 fixation and accumulation of new mutations
Allele B2 fixation and accumulation of new mutations
From alleles to orthologs
A.1.1
A.1.2
A.2.1
A.2.2
B.1.1
B.1.2
B.2.1
B.2.2
Alleles
Alleles
Alleles
Alleles
Orthologs
Orthologs and paralogs
A1/2 A3
A
A1 A2 A3 URBILATERIA
A2 A3’ A3”A1
HUMAN multigenic family
A1 A2 A3
DROSOPHILA multigenic family
A1, A2, B ParalogsDuplication
Speciation
Orthology/ Paralogy
Orthologs : 2 genes on different species Which come from a common ancestor and separated by a speciation event.
Paralogs : 2 genes resulting from a duplication event in a genome.
A1 HUMAN
A1 DROSO
A2 HUMAN
A2 DROSO
A3’ HUMAN
A3” HUMAN
A3 DROSO
Co-Orthologues
Duplication
Speciation
A
A1/2
A3
From Gene History
To Gene Function
Orhologs under purifying selection
A
A
URBILATERIA
Speciation
Purifying Selection
DROSOPHILA
Ancestral Function
HUMAN
Ancestral Function
Purifying Selection
A
Ortholog functional switch
A
A2 A
URBILATERIA
SpeciationPurifying
Selection
DROSOPHILA
Ancestral Function
HUMAN
New Function ?
Positive selectionOr relaxed
Co-ortholog Sub Functionalization
A
A’ A
URBILATERIA
Speciation
Purifying Selection
DROSOPHILA
Ancestral Function
A”
Duplication
HUMAN
Sub-Function
HUMAN
Sub-Function
Co-ortholog Neo Functionalization
A
A A
URBILATERIA
Speciation
Purifying Selection
DROSOPHILA
Ancestral Function
A2
Duplication
HUMAN
Ancestral Function
HUMAN
New Function
Positive or relaxed Positive or relaxed selectionselection
Purifying Selection
• Orthology /paralogy information
• is important for functional inference
• (forget for species with high level of horizontal transfer)
Orthology/ Paralogy
Orthologs : 2 genes on different species Which come from a common ancestor and separated by a speciation event.
Paralogs : 2 genes resulting from a duplication event in a genome.
A1 HUMAN
A1 DROSO
A2 HUMAN
A2 DROSO
A3’ HUMAN
A3” HUMAN
A3 DROSO
Co-Orthologues
Duplication
Speciation
A
A1/2
A3
Many scientists are using the best BLAST hit to look for orthologous relationship
A Warning that will be discussed by other intervenants
… BUT!
Many co orthologs can be present
Problem with genomes that are not fully sequencedOr when gene loss occurred
AND
Even with Phylogenetic analysis :• Bias must be corrected. • A phylogenetic tree is hypothetical
• Evolutionary shift (due to positive or relaxed selection) could be linked to functional shift .See N Galtier and A Levasseur talks.
• Detection of Positive selection and functional shift
• Detection of Evolutionary constraint relaxation and functional shift
Co-ortholog Neo Functionalization
A
A A
URBILATERIA
Speciation
Purifying Selection
DROSOPHILA
Ancestral Function
A2
Duplication
HUMAN
Ancestral Function
HUMAN
New Function
Purifying Selection
Constitutive proteasome β-subunits replacement after Interferon-γ stimulation
Paralogue = duplicated gene
Constitutive Proteasome Immuno-Proteasome
Paralogue replacement
PSMB8 (LMP 7)
PSMB9 (LMP 2)
PSMB10 (LMP Z)
PSMB5
PSMB6
PSMB7
• New function (specialization) (Specific size protein or peptide degradation – used by MHC system)
• Only found in vertebrates
• Ancestral function : Protein degradation• Present in all Metazoans, therefore
present in Urbilateria (Metazoan ancestor).
Large scale gene duplication in vertebrate lineage
Imm
uno
Pro
teas
ome
Pro
teas
ome
Deu
téro
stom
ata
Pro
tost
omat
a
Ver
tebr
ates
Amniota (Human)
Lisamphibia
Chondrichthyes (shark) Cephalaspidomorphi (lamprey)
Céphalochordata (amphioxus)
Echinodermata
Actinopterygii(Zebrafish)
Urochordata(Ciona)
Insects (Drosophila)
Myxini (Hagfish)
Nématod (c. elegans)
751
>751
564
528450
<833-993
833-993
360
PROTEASOME
PSMB7 Mus PSMB7 Ratt
PSMB7 Bos PSMB7 Homo PSMB7 Gall
PSMB7 Xeno PSMB7 Zebra
PSMB7 Fugu PSMB10 Zebra
PSMB10 Fugu PSMB10 Bos
PSMB10 Mus PSMB10 Homo
PSMB7/10 Bran PSMB7/10 Ci-zeta Cionai
PSMB7/10 BombyxPSMB7/10 Prosbeta2
PSMB7/10 CG18341 Drosophila
62100
100
4495
93
78
599558
88
98100
5280
0.1
**
*
74 99
100*
*69
9995
* *
62
*
*
76
80
**
9578
93
9191
5958
75 *
*Duplication
The study genes and genomes HISTORY.
Help to find evidences for gene FUNCTION.
Concepts in evolutionary biology
• Use of the concepts for • Structural and functional annotation.
Structural annotation (deciphering of gene structure). Functional annotation (especially the use of
phylogeny to decipher proteins function).
.
Biochemical and Biological process :
• Experimental approach : RNA Interference Tandem affinity purification and mass spectrometry
• In Silico
Functional annotationFunctional annotation
• Functional Annotation
Based on phylogeny. from experimentally annotated genes…
Functional annotationFunctional annotation
INTERLUDE
• FUNCTION????
• A complex concept;
Function Prediction Using orthology information (done)
Using the evolutionary shift information (in progress)
Function prediction by Integrative phylogenomics (Engelhardt et al
PLOS Computional biology 2005) (in progress)
Homologs with experimentally known function: how information can be found.
Gene Ontology
MedLine
SwissProt
Textual Information Analysis
G.O. Standard
GenBank
Functional annotationFunctional annotation
• Biological process – biological process to which the gene or gene product contributes. Cell growth and maintenance; pyrimidine metabolism; …
• Molecular function – biochemical activity, including specific binding to ligands or structures, of a gene product. Enzyme, transporter; Toll receptor ligand, …
• Cellular component – place in the cell where a gene product is active. Cytoplasm, ribosome, …
. Plus others classifications to develop:In particular evolutionary based ontology
Functional annotationFunctional annotation
Gene Ontology Classification
Small fraction correspond to known, well-characterized proteins.
If the function is unknown : Phylogenetic analysis :
Functional prediction:
Using orthology information
Using the evolutionary shift information
by integrative Phylogenomics
Tumor necrosis factor family Phylogenetic tree :Orthologs identification
GgaTNFSF10DreTNFSF10
HsaTNFSF10PolTNFSF11
HsaTNFSF11XlaTNFSF11
GgaTNFSF5
HsaTNFSF5BboTNFSF5
MmuTNFSF2HsaTNFSF2
MmuTNFSF1HsaTNFSF1
MmuTNFSF15
HsaTNFSF15HsaTNFSF14MmuTNFSF14
HsaTNFSF6RnoTNFSF6
HsaTNFSF13MmuTNFSF6
GgaTNFSF13
PolTNFSF13MmuTNFSF7HsaTNFSF7
HsaTNFSF8MmuTNFSF8
HsaTNFSF9MmuTNFSF9
EIGER (DmeTNF)
9996
73
7879
95
9999
79
MmuTNFSF598
96
99
99
99
99
88
99
69
74
55
5897
9968
99
99
0,2
DF1
DF2
DF3
Trends in Immunology (July 2003)
Atherosclerotic plaque
formation
ALPS - LPR/GLD
Lymphoproliferative syndrome
TNFSF1
TNFSF2
TNFSF3
TNFSF14
TNFSF6
TNFSF10
TNFSF11
TNFSF5
TNFSF13BTNFSF13
TNFSF12?
TNFSF9
TNFSF8
TNFSF7
TNFSF18
TNFSF4
EDA-A1
EDA-A2
TNFSF15
LN, PP, GC, Tumorocidal activity
T cell Homeostasis (death)
T cell Homeostasis (death), CTL function,peripheral tolerance, T cell costimulation, chemotaxis
LN, bone Homeostasis, mammary gland development
T cell Homeostasis (survival?), CTL activation,peripheral tolerance?
T cell homeostasis (survival), peripheral tolerance
T cell activation?
T cell activation and survival, CTL activity, Tumorocidal actvity?
?Tooth, hair, sweat gland formation
Tooth, hair, skin formation?
PP, GC, T cell Homeostasis (death)
T cell transmigration and homeostasis (survival)?
GC, B cell function, peripheral tolerance, T cell priming
Tumorocidal activity, T cell function?Tumorocidal activity, T cell function?
Negative selection, autoimmunity
?
?
T cell costimulation, negative selection?
B cell HomeostasisB cell Homeostasis ?B cell Homeostasis
TNFRSF1A
TNFRSF1B
TNFRSF3
TNFRSF14TNFRSF6B
TNFRSF11A
TNFRSF5
TNFRSF11B
TNFRSF17
TNFRSF9
TNFRSF8
TNFRSF6
TNFRSF10BTNFRSF10ATNFRSF10CTNFRSF10D
TACI
TNFRSF7
TNFRSF18
TNFRSF4
TNFRSF19
EDAR
XEDAR
TNFRSF21
RELT
TNFRSF12
BR3
Molecular Function Biological Process
Human TNF family Phylogenetic tree :Search for the closest Paralog
Functional annotationFunctional annotation
Trends in Immunology (July 2003)
Small fraction correspond to known, well-characterized proteins.
If the function is unknown : Phylogenetic analysis :
Gene function prediction:
Using orthology information Using the evolutionary shift
information ( see Levasseur talk) by integrative Phylogenomics
evolutionary biology concepts for genome annotation
Further reading
Concepts
Levasseur A, Danchin E, Orlando L, Bailly X, Pontarotti P. Conceptual bases for quantifying the role of the environment on genomes evolution: the participation of positive selection and neutral evolution Biological review in press
Danchin E.G.J, et al. The Major Histocompatibiliy Complex Origin Immunological reviews. 2004;198(1):216-232.
Concepts for applied evolution Danchin E.G.J, Levasseur A, Lopez-Rascol V, Gouret P, Pontarotti P. The use
of evolutionary biology concepts for genome annotation. J. Exp. Zoology Part B: Mol. and Dev. Evol. 2006
Informatisation des concepts et connaissances
• Phylogénie
• Détection des gènes orthologues et paralogues
• Détection de changements évolutifs (en cours)
• Prévision de fonctions
FIGENIX est une plate-forme logicielle multi-utilisateur dédiée aux taches d'annotation structurales et fonctionnelles:
- Prédictions de gènes pour de grandes séquences d'ADN
- Construction d'arbres phylogénétiques robustes
- Détection automatique d'orthologues et de paralogues
- Recherche automatique de données fonctionnelles sur les gènes disponibles à partir de bases de données « Web »
- Filtrage et construction de bases de données protéiques (contigage d'EST)
- Processus chainés(ex: Prédiction de gènes suivie d'études phylogénétiques
pour chacun)
ETAPES DU PIPELINE de Phylogénie (1)
EnsemblNR…
Séquence protéique codée par un gène putatif
BLAST + filtrage
CLUSTAL W + purification + correction de biais
Alignement multiple
Conservation « repeats »
monophylétiques
Alignement « repeats » fusionnés
Test de composition par TREEPuzzle pour
élim séq trop divergentes
Construction Arbre de la Vie
PFAM
Recherche de domaines par HmmPFAM
Création domaine « FIGENIX » (correctDomains)
Conservation alignement complet
Existence « repeats »?
N
O
Arbre de référence
Enumération domaines
Détection « groupes de paralogie » + élim sites qui évol trop vites (« test de Gu »)
Élim séq >30% « gaps »
Élim domaines les + non congruents détectés par HomPart de PAUP
Test de saturation
NJ Parcimonie Maximum de vraisemblance
Comparaison topologies par tests Templeton-Hasegawa
Topologies congruentes?
Arbre NJ Arbre consensus
Détection orthologuesI
recherche de fonctions
ETAPES DU PIPELINE de phylogénie (2)
arbre arbre arbre
Construction Arbre de la Vie
Arbre de référence
ON
Architecture de FIGENIX
RDBMS
Expert SystemGenomic
Data Annotation Engine
Web Server
Persistence Layer
RepositoryLoad Balancing, Security, ...
Archiver
Request
Data exchange
MGIAgent
GOAgent
ESTAgent
Functional Collector Agent
- plate-forme Intranet/Extranet
-architecture 3 tiers (interface web/ serveurs “métier” / base de données)
1)
Further reading:about concepts informatisation
• Gouret et al.FIGENIX: intelligent automation of genomic annotation: expertise integration in a new software platform. BMC Bioinformatics. 2005 Aug 5;6:198
• Balandraud et al. A rigorous method for multigenic families' functional annotation: the peptidyl arginine deiminase (PADs) proteins family example BMC Genomics 2005, 6:153
Further reading on FIGENIX utilization
• Danchin et al . Eleven ancestral gene families lost in mammals and vertebrates while otherwise universally conserved in animals BMC Evolutionary Biology 2006, 6:5
• Paillisson et al . Bromodomain testis-specific protein is expressed in mouse oocyte and evolves faster than its ubiquitously expressed paralogs BRD2, -3 and -4. Genomics. 2006
• Levasseur et al Tracking the evolutionary and functional shifts connection: the lipase-esterase example.BMC evolutionary biology 2006
Structural annotation
Genome nucleotide-level Annotation :
• Mapping• Finding genomic landmarks
• Gene finding and protein prediction• Non-coding RNAs and regulatory regions• Identifying repetitive elements• Mapping segmental duplications• Mapping variations (SNP, microsatellites,
….)
Structural annotationStructural annotation
Available tools
Ab initio :• Genscan• Fgenesh• Genie• Etc …
Similarity Based :• Genewise• Sim4• Est2genome• Figenix
Based on statistical signals within the DNA. Coding propensity (hexamer signals).Splice Site Signals.Strengths :
Easy and quick to run. Only need DNA as input.
Weakness : High false positive rate.
Alignement programs that know about gene structure.Very accurate with strong sequence similaritiesStrengths : Accurate.Weakness : Need strong similarities, slow to run.
Structural annotationStructural annotationState of the Art
• Structural Annotation
combining together a statistical and homologous approach (similarities with known proteins). The process automation resulted in an expert system based on biological inference rules using gene history and ab-initio program. But yet not completely evolutionary biology based
« FIGENIX SOFTWARE PLATFORM » Annotating method Structural annotationStructural annotation
segment ADN
protéine A(meilleur hit région 1) protéine B
(meilleur hit région 2)
région 1 région 2
hsp: A1 hsp: A2
hsp: A3
hsp: B1
hsp:B2
DM SD A D D D DA A DAA D A+
DA A A
Protein = amino acid sequence
Gene = nucleotidic sequence
mRNA = nucleotidic sequence
P
Transcription
Traduction
Figenix : 87%Figenix : 87%
Genscan : 31%
HMMGene : 38%
Sequence
Protein
Validation of structural annotationValidation of structural annotation
The platform performances were validated on standard dataset (HMR195) see Guigò et al, 2000; Rogic et al, 2001.
0.87
0.38
0.31
CORRECT PROTEIN
PREDICTION
0.220.650.800.55Genscan
0.050.950.920.91Figenix
0.150.780.810.75Hmmgen
OVER PREDICTION
Terminal
(55)
Internal
(186)
Initial
(55)
EXON TYPEPROGRAMS
Accuracy versus Exon Type and Prediction
The Mouse and Rat sequence from the HMR195 dataset was used on the human division of swissprot.
Structural annotationStructural annotation
• The next step for structural annotation :
• Is to take into account the gene evolutionary history
• Concepts , modélisation, informatisation, bio-annalyse
Structural annotation (deciphering of gene structure).
Functional annotation (especially the use of phylogeny to decipher proteins function).
Next
• Phylogenomics (genome Evolution)
• Phylopostgenomics
• - phylotranscriptomics
• - phylointeractomics
• ………..
Connaissances/concepts
Observation : il existe des régions de syntenies conservées entre espèce.
Explication /concept : ces régions proviennent d’une région ancestrale qui a évoluée de manière indépendante après spéciation dans chaque lignée, mais pas assez pour perdre toute trace de conservation. A partir de cette connaissance et de cette prédiction que découle un ensemble de réflexion qui indique que les analyses des synténies conservées et la reconstruction de régions ancestrales sont intéressantes, d’un point de vu appliqué : assistance au clonage positionnel et d’un point de vue conceptuel : compréhension de l’évolution des génomes.
Formalisation de la question biologique
Comment mettre en évidence les synténies conservées ?
C’est aussi à ce moment que la conceptualisation prend toute sa place
Si les synténies conservées proviennent vraiment d’une région ancestrale, les gènes dans ces régions doivent avoir
ll faut donc avoir des programmes qui soient capables de mettre en évidence les relations d’orthologie, et de trouver des clusters significatifs.
Reconstruction des génomes (translocation, fusion inversion… pondération de ces événements)
1/ des relations d’orthologie
2/ le regroupement des gènes orthologues doit être improbable sous l’hypothèse du hasard (le regroupement doit être significatif).
Modélisation mathématique Il faut modéliser dans le cas ou les outils informatiques
n’existent pas ou dont le formalisme biologique n’est pas correct. Ce qui est le cas pour les tests statistiques de regroupement (la taille des famille de in-paralogues en particulier).
Modéliser la reconstruction des génomes Formalisation informatique 1)AlgorithmesTests statistiquesModélisation reconstruction ancestrale des génomes2) Intégration avec les autres outils « informatique »
dans le système informatique (CASSIOPE)
• Bioanalyse• Recherche automatique de synténies
conservées.• Reconstruction et évolution de régions
génomique• Nouvelle connaissance et nouveaux
concepts• Application directe : • aide au clonage positionnel• Concepts/connaissance:• Mise en évidence de regroupement fonctionnel
C.A.S.S.I.O.P.E
• C.A.S.S.I.O.P.E: Clever Agent System for Synteny Inheritance and Other Phenomena in Evolution
•
• find conserved regions between genomes
For more info see Virginie Lopez Rascol
C.A.S.S.I.O.P.E.
• Toward the ancestral genome reconstruction
Toward the ancestral genome reconstruction
C.A.S.S.I.O.P.E
• Bioanalyse• Recherche automatique de synténies
conservées.• Reconstruction et évolution de régions
génomique• Nouvelle connaissance et nouveaux
concepts• Application directe : • aide au clonage positionnel• Concepts/connaissance:• Mise en évidence de regroupement fonctionnel
CollaborateursProjet MEG* (Modèlisation Evolution Génome)
Nathalie Balandraud Etienne Danchin Philippe Gouret Vérane Vitiello
• Math/bio• Julien Berestycki* Simona Grusea* • Stéphanie Léocard* Valda Limic *• Laure Rigal* Etienne Pardoux*
• Info/bio• Olivier Chabrol* Virginie Lopez* • Cedric Notredame*
• Concepts et bio-analyse• Roxane Barthelemy * Jean, Paul Casanova*• Elodie Darbo* Anthony Levasseur* • Eric Faure* Pierre Pontarotti*
http://www.up.univ-mrs.fr/evol/
Open Discussion
Phylo postgenomic
Top Related