Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian...
-
Upload
megan-buchanan -
Category
Documents
-
view
216 -
download
0
Transcript of Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian...
Construction of Genome Trees from Conservation Profiles of Proteins
Fredj Tekaia Edouard Yeramian
Institut [email protected]
• Species tree construction and difficulties;
• Post genome era species tree construction;
• Genome tree construction based on conservation profiles;
Outline
• Conclusions;
• References.
• Conservation profiles;
Species tree - Tree Of Life
• 16/18s rRNA tree (Woese 1990);Woese and others have used rRNA comparisons to construct a “Tree Of Life” showing the evolutionary relationships of a wide variety of organisms.
The « Tree Of Life » has long served as a useful tool for describing the history and relationships of organisms over evolutionary time. One species is represented as a branching point, or node, on the tree, and the branches represent paths of descent from a parental node.
The three-domain proposal based on the ribosomal RNA tree. Woese et al. PNAS. 87:4576-4579. (1990)
The two-empire proposal, separating eukaryotes from prokaryotes and eubacteria from archaebacteria. Mayr, D. PNAS 95:9720-23. (1998).
The three-domain proposal, with continuous lateral gene transfer among domains. Doolittle. Science 284:2124-8. (1999)
The ring of life, incorporating lateral gene transfer but preserving the prokaryote eukaryote divide. Rivera & Lake JA. Nature 431: 152-5. (2004)
Martin & Embley
Nature 431:152-5.(2004)
The 1.2-Megabase Genome Sequence of Mimivirus Raoult et al. Sciences, 306:1344-1350. (2004)
Genomic Databases and the Tree of LifeKeith A. Crandall and Jennifer E. Buhay
Sciences, 306; 1144-1145. (2004)
Prospects for Building the Tree of Life from Large Sequence Databases Driskell, et al .
Sciences, 306; 1172-1174. (2004)
Pennisi, E. (1998). Genome data shake tree of life.
Science 280:672-4.
New genome sequences are mystifying evolutionary biologists by revealing unexpected connections between microbes thought to have diverged hundreds of millions of years ago.
and suggests to construct species trees from their whole gene content.
Genome phylogeny based on gene content (1999)
Snel, Bork, Huynen. Nature Genetics 21, 108-110.
E
A
B
Tekaia, Lazcano & Dujon (1999)
Genome Research 9: 550-7.
E
A
B
Complete genomes
2208 projects
• 460 published
(14-11-2006)
• 1054 prokaryotes
• 631 eukaryotes
387 29
44
http://www.genomesonline.org/
Genomes 2 edition 2002. T.A. Brown
Gene tree - Species tree
Species tree
A B C
Gene tree
A B C
•
•
Time Duplication
Duplication
Speciation
Speciation
A B C
Problems with species tree construction
• main difficulties in species tree construction include extensive incongruence between alternative phylogenies generated from single-gene data sets;
-Genes don't evolve at the same rate nor in the same way;
-the evolutionary history inferred from one gene may be different from what another gene appears to show.
Alternative solutions: integrative methods
• “supertree”The supertree approach estimates phylogenies for subsets of genes with good overlap, then combines these subtree estimates into a supertree.
Bininda-Emonds et al. 2002
• Depends on the ability to distinguish between orthologs and paralogs;
• Supertree approaches are controversial, in part because the methodology results in a degree of disconnection between the underlying genetic data and the final tree produced.
• “phylogenomic tree”
(based on concatenation of a gene sample common to the considered species);
S1
Sn
.
.
• genes don't evolve at the same rate nor in the same way;
• a limited number of genes are shared among all species;
The tree of one percent (2006)
Dagan and Martin. Genome Biology, 7:118.
More generally these methods suffer difficulties related to the phylogenetic tree construction:
• global sequence alignment (quality, gaps,...);
• different evolutionary histories of genes;
• substitution saturation;...
and
• more seriously from gene sampling difficulties.
A B C
Gene tree - Species tree: The gene sampling problem
A B C
Red is lost in C
Blue is lost in A and B
A B C
gene tree # species tree
Adapted from:
Linder, Moret, Nakhleh, Warnow.
True species tree
A B C
Gene tree - Species tree: The gene sampling problem
All red orthologs has been lost in the 3 species.
A B C
Luckily: sampling gives the blue orthologs. The true species tree is reconstructed.
A B C
Gene tree - Species tree: The gene sampling problem
All versions of the gene are in the 3 species
A AB BC C
Gene trees are the same as the species tree
Genome tree is another alternative to construct species tree.
• The concept of genome tree is based on overall gene content similarity.
(consider more than single gene information)
Methodology
•
•
•
••
•
•
•
••
Matrice T kij > 0
Correspondence Analysis
Classification
1 i p1
j
n
kij
sup
F1
Fp
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
••
•
••
• orthogonal system;
• use of euclidean distance;
Systematic Analysis of Completely Sequenced Organisms
• In silico species specific comparisons (Tekaia & Dujon. J. Mol. Evol. 1999)
(27 eucaryal, 19 archaeal and 33 bacterial species: 541880 proteins)
Proteome1
Proteomen
Proteome
blastp, pam250, SEG filter
• 99 species
(B: 33; A: 19; E:27)
• total of 541880 proteins
Systematic Analysis of Completely Sequenced Organisms
• In silico species specific comparisons (27 eucaryal, 19 archaeal and 33 bacterial species: 541880 proteins)
• Degree of ancestral duplication and of ancestral conservation between pairs of species;
• Families of paralogs (Partition-MCL);
• Families of orthologs (Partition-MCL);
• Distribution of orthologous families according to the three domains of life;
• Determination of the protein dictionary (orthologs);
• Determination of protein conservation profiles;
Note on: Homologs - Paralogs - Orthologs
Homologs: A1, B1, A2, B2
Paralogs: A1 vs B1 and A2 vs B2
Orthologs: A1 vs A2 and B1 vs B2
S1 S2
a b
Sequence analysis
Species-1 Species-2
Duplication
Ancestor
Evolution
Speciation
A1 A2
B1 B2
A
B
A
B
A
Time
Ancestor
species genome
Evolutionary processes include
Phylogeny*duplication genesis
Expansion*
HGT HGT
Exchange* loss Deletion*selection*
Expansion, Exchange and Deletion are noise. They should be eliminated or at least reduced.
• Large scale comparative analysis of predicted proteomes revealed
significant evolutionary processes:
Genome tree construction from “Protein Conservation Profiles” and attempt to reduce
noisy evolutionary processes
To overcome some of these limitations, we consider
p 0111111000111111111000110110111101001111101111
• A “conservation profile” is an n-component binary vector describing a protein conservation pattern across n species.
Components are 0 and 1, following absence or presence of homologs.
• A conservation profile is the trace of protein evolutionary histories jointly captured in a set of n species (multidimensional feature);
• Conservation profiles are signatures of evolutionary relationships;
Conservation profiles
• 99 species (B: 33; A: 19; E:27); 541880 proteins
Main interesting properties of conservation profiles:
E A B S1..............I.............I................Sn
G1,1 100000000000000000000000000000000000000000000000 G2,1 111111111111111111111111111111111111111111111111 G3,1 111111110011111111111111011101110101111111101111 ....................................................... Gn1,1
100001110001000000000000000000000000000000000000
G1,2 010000000000000000010100000000000111000011100011 G2,2 010000000000000000010100000000000111000011100011........................................................ Gn2,2 111111110011111111111111011101110101111111101111........................................................ G1,n 011110100000000000000000001000000000000000000001 G2,n 111111110011111111100011011101110101111111101111 G3,n 111111110011111111100011011101110101111111101111........................................................ Gnp,n 100110000000000000000000000000000000000000000001
Protein conservation profiles
Table : 541880 proteins x 99 species• Different conservation profiles represent different evolutionary histories
original total proteins (99 species)
non-specific proteins i.e conservation profiles (82%)
distinct conservation profiles (42%)
Distinct conservation profiles
541880
442460
184130
111111110011111111111111011101110101111111101111
100110000000000000000000000000000000000000000001
100000000000000000000000000000000000000000000000111111111111111111111111111111111111111111111111
010000000000000000010100000000000111000011100011
................................................
• This set is indicative of the various observed evolutionary histories.
• Effect of the duplication process is reduced(one representative from each set of identical conservation profiles)
0102030405060708090
100110120130140150160170180190200210220230240250
c01c02c03c04c05c06c07c08c09c10c11c12c13c14c15c16c17c18c19c20c21c22c23c24c25c26c27c28c29c30c31c32c33c34c35c36c37c38c39c40c41c42c43c44c45c46c47c48c49c50c51c52c53c54c55c56c57c58c59c60c61c62c63c64c65c66c67c68c69c70c71c72c73c74c75c76c77c78c79c80c81c82c83c84c85c86c87c88c89c90c91c92c93c94c95c96c97c98c99
Conservation weights (sum of "1":presence)
Fractions (*10000) of distinct conservation profiles
Presence in the 184130 distinct conservation profiles:Mean=32.2; SD=23.3; min=1; Max=99.
Genome tree construction: data matrices
• Jaccard similarity scores between species
sij = N11/(N11+N01+N10);
N11; N01; N10 are respectively total occurrences of (1,1), (0,1) and (1,0) between i,j.
• 184130 d.c.prof
T = { Tij = sij ; i=1,n; j=1,n; n }
111111110011111111111111011101110101111111101111
100110000000000000000000000000000000000000000001
100000000000000000000000000000000000000000000000111111111111111111111111111111111111111111111111
010000000000000000010100000000000111000011100011
................................................
i jvarious evolutionary histories
Tekaia F, Yeramian E. (2005). PLoS Comput Biol.1(7):e75
profiles tree
Conclusions: Methodology
• Species classification is not an easy task!
• Methods that take into account whole genome informations are still needed;
• Correspondence analysis method might be helpful in revealing evolutionary trends embedded in the multidimensional relationships as obtained from large scale genome comparisons;
• Species tree construction should take into account the whole information included in the genomes;
• Thus they should correspond to the most accurate type of markers for species classification;• In principal profiles tree derived from distinct conservation profiles should considerably minimize genome acquisition effects and should reflect less noisy phylogenetic signals;• The profiles tree presents evidence of conservation of stable phylogenetic relationships and reveals unconventional species clustering;• The profiles tree corresponds to the classification of the evolutionary scenari.
Conclusions...• Conservation profiles represent most conserved and meaningful evolutionary signals jointly captured in a set of species;
Acknowledgments:
The support of: • The Institut Pasteur (Strategic Horizontal Programme on Anopheles gambiae)
• The Ministère de la Recherche Scientifique (France): ACI-IMPBIO-2004–98-GENEPHYS program.
• Bernard Dujon (Institut Pasteur).
References:• Tekaia, F. and Dujon, B. (1999).
Pervasiveness of gene conservation and persistence of duplicates in cellular genomes. Journal of Molecular Evolution, 49:591-600.
• Tekaia, F., Lazcano, A. and B. Dujon (1999). Genome tree as revealed from whole proteome comparisons. Genome Res. 12:17-25.
• Tekaia, F., Yeramian, E. and Dujon, B. (2002).Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: a global picture with correspondence analysis. Gene 297: 51-60.
• Tekaia, F. and Yeramian, E. (2005).Genome Trees from Conservation Profiles. PLoS Comput Biol.1(7):e75.
• Tekaia, F. and Yeramian, E. (2006).Evolution of Proteomes: Fundamental signatures and global trends in amino acid composition. BMC Genomics. 7:307.
• Tekaia F, Latgé JP. (2005). Aspergillus fumigatus: saprophyte or pathogen?
Curr Opin Microbiol. 8:385-92. Review.
• Systematic analysis of completely sequenced organisms:http://www.pasteur.fr/~tekaia/sacso.html
References:• Bininda-Emonds ORP (2005). Supertree Construction in the Genomic Age.Methods in Enzymology 395: p.745-757.• Bininda-Emonds,OPRP, John L. Gittleman, Mike A. Steel (2002)The (super)Tree Of Life: Procedures, Problems, and Prospects. Annual Review of Ecology and Systematics, Vol. 33: 265-289.
• Dagan, T. and W, Martin (2006). The tree of one percent. Genome Biology, 7:118.• Delsuc F, Brinkmann H, Philippe H. (2005). Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet. 6:361-75. Review.• Doolittle. Science 284:2124-8. (1999)• Driskell, et al. (2004). Sciences, 306; 1172-1174.
• http://www.genomesonline.org/gold.cgi (list of genome projects)• Keith A. Crandall and Jennifer E. Buhay (2004). Sciences, 306; 1144-1145.
• Linder, Moret, Nakhleh, and Warnow: http://compbio.unm.edu/networks1.ppt
• Martin & Embley (2004). Nature 431:152-5.
• MCL: a cluster algorithm for graphs: http://micans.org/mcl/
• Pennisi, E.(1998). Genome data shake tree of life.Science. 280:672-4.
• Rivera & Lake JA.(2004). Nature 431: 152-5.• Raoult et al.(2004). Sciences, 306:1344-1350.• Snel, Bork, Huynen (1999). Genome phylogeny based on gene content.Nature Genetics 21, 108-110.
• Snel B, Huynen MA, Dutilh BE (2005). Genome trees and the nature of genome evolution.Annu Rev Microbiol.;59:191-209. Review.
• Woese et al.(1990). PNAS. 87:4576-4579.