Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian...

35
Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur [email protected] r

Transcript of Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian...

Page 1: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

Construction of Genome Trees from Conservation Profiles of Proteins

Fredj Tekaia Edouard Yeramian

Institut [email protected]

Page 2: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

• Species tree construction and difficulties;

• Post genome era species tree construction;

• Genome tree construction based on conservation profiles;

Outline

• Conclusions;

• References.

• Conservation profiles;

Page 3: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

Species tree - Tree Of Life

• 16/18s rRNA tree (Woese 1990);Woese and others have used rRNA comparisons to construct a “Tree Of Life” showing the evolutionary relationships of a wide variety of organisms.

The « Tree Of Life » has long served as a useful tool for describing the history and relationships of organisms over evolutionary time. One species is represented as a branching point, or node, on the tree, and the branches represent paths of descent from a parental node.

Page 4: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

The three-domain proposal based on the ribosomal RNA tree. Woese et al. PNAS. 87:4576-4579. (1990)

The two-empire proposal, separating eukaryotes from prokaryotes and eubacteria from archaebacteria. Mayr, D. PNAS 95:9720-23. (1998).

The three-domain proposal, with continuous lateral gene transfer among domains. Doolittle. Science 284:2124-8. (1999)

The ring of life, incorporating lateral gene transfer but preserving the prokaryote eukaryote divide. Rivera & Lake JA. Nature 431: 152-5. (2004)

Martin & Embley

Nature 431:152-5.(2004)

Page 5: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

The 1.2-Megabase Genome Sequence of Mimivirus Raoult et al. Sciences, 306:1344-1350. (2004)

Genomic Databases and the Tree of LifeKeith A. Crandall and Jennifer E. Buhay

Sciences, 306; 1144-1145. (2004)

Prospects for Building the Tree of Life from Large Sequence Databases Driskell, et al .

Sciences, 306; 1172-1174. (2004)

Page 6: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

Pennisi, E. (1998). Genome data shake tree of life.

Science 280:672-4.

New genome sequences are mystifying evolutionary biologists by revealing unexpected connections between microbes thought to have diverged hundreds of millions of years ago.

and suggests to construct species trees from their whole gene content.

Page 7: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

Genome phylogeny based on gene content (1999)

Snel, Bork, Huynen. Nature Genetics 21, 108-110.

E

A

B

Page 8: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

Tekaia, Lazcano & Dujon (1999)

Genome Research 9: 550-7.

E

A

B

Page 9: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

Complete genomes

2208 projects

• 460 published

(14-11-2006)

• 1054 prokaryotes

• 631 eukaryotes

387 29

44

http://www.genomesonline.org/

Page 10: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

Genomes 2 edition 2002. T.A. Brown

Gene tree - Species tree

Species tree

A B C

Gene tree

A B C

Time Duplication

Duplication

Speciation

Speciation

A B C

Page 11: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

Problems with species tree construction

• main difficulties in species tree construction include extensive incongruence between alternative phylogenies generated from single-gene data sets;

-Genes don't evolve at the same rate nor in the same way;

-the evolutionary history inferred from one gene may be different from what another gene appears to show.

Page 12: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

Alternative solutions: integrative methods

• “supertree”The supertree approach estimates phylogenies for subsets of genes with good overlap, then combines these subtree estimates into a supertree.

Bininda-Emonds et al. 2002

• Depends on the ability to distinguish between orthologs and paralogs;

• Supertree approaches are controversial, in part because the methodology results in a degree of disconnection between the underlying genetic data and the final tree produced.

Page 13: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

• “phylogenomic tree”

(based on concatenation of a gene sample common to the considered species);

S1

Sn

.

.

• genes don't evolve at the same rate nor in the same way;

• a limited number of genes are shared among all species;

The tree of one percent (2006)

Dagan and Martin. Genome Biology, 7:118.

Page 14: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

More generally these methods suffer difficulties related to the phylogenetic tree construction:

• global sequence alignment (quality, gaps,...);

• different evolutionary histories of genes;

• substitution saturation;...

and

• more seriously from gene sampling difficulties.

Page 15: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

A B C

Gene tree - Species tree: The gene sampling problem

A B C

Red is lost in C

Blue is lost in A and B

A B C

gene tree # species tree

Adapted from:

Linder, Moret, Nakhleh, Warnow.

True species tree

Page 16: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

A B C

Gene tree - Species tree: The gene sampling problem

All red orthologs has been lost in the 3 species.

A B C

Luckily: sampling gives the blue orthologs. The true species tree is reconstructed.

Page 17: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

A B C

Gene tree - Species tree: The gene sampling problem

All versions of the gene are in the 3 species

A AB BC C

Gene trees are the same as the species tree

Page 18: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

Genome tree is another alternative to construct species tree.

• The concept of genome tree is based on overall gene content similarity.

(consider more than single gene information)

Page 19: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

Methodology

••

••

Matrice T kij > 0

Correspondence Analysis

Classification

1 i p1

j

n

kij

sup

F1

Fp

• •

••

••

• orthogonal system;

• use of euclidean distance;

Page 20: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

Systematic Analysis of Completely Sequenced Organisms

• In silico species specific comparisons (Tekaia & Dujon. J. Mol. Evol. 1999)

(27 eucaryal, 19 archaeal and 33 bacterial species: 541880 proteins)

Proteome1

Proteomen

Proteome

blastp, pam250, SEG filter

• 99 species

(B: 33; A: 19; E:27)

• total of 541880 proteins

Page 21: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

Systematic Analysis of Completely Sequenced Organisms

• In silico species specific comparisons (27 eucaryal, 19 archaeal and 33 bacterial species: 541880 proteins)

• Degree of ancestral duplication and of ancestral conservation between pairs of species;

• Families of paralogs (Partition-MCL);

• Families of orthologs (Partition-MCL);

• Distribution of orthologous families according to the three domains of life;

• Determination of the protein dictionary (orthologs);

• Determination of protein conservation profiles;

Page 22: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

Note on: Homologs - Paralogs - Orthologs

Homologs: A1, B1, A2, B2

Paralogs: A1 vs B1 and A2 vs B2

Orthologs: A1 vs A2 and B1 vs B2

S1 S2

a b

Sequence analysis

Species-1 Species-2

Duplication

Ancestor

Evolution

Speciation

A1 A2

B1 B2

A

B

A

B

A

Time

Page 23: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

Ancestor

species genome

Evolutionary processes include

Phylogeny*duplication genesis

Expansion*

HGT HGT

Exchange* loss Deletion*selection*

Expansion, Exchange and Deletion are noise. They should be eliminated or at least reduced.

• Large scale comparative analysis of predicted proteomes revealed

significant evolutionary processes:

Page 24: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

Genome tree construction from “Protein Conservation Profiles” and attempt to reduce

noisy evolutionary processes

To overcome some of these limitations, we consider

Page 25: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

p 0111111000111111111000110110111101001111101111

• A “conservation profile” is an n-component binary vector describing a protein conservation pattern across n species.

Components are 0 and 1, following absence or presence of homologs.

• A conservation profile is the trace of protein evolutionary histories jointly captured in a set of n species (multidimensional feature);

• Conservation profiles are signatures of evolutionary relationships;

Conservation profiles

• 99 species (B: 33; A: 19; E:27); 541880 proteins

Main interesting properties of conservation profiles:

Page 26: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

E A B S1..............I.............I................Sn

G1,1 100000000000000000000000000000000000000000000000 G2,1 111111111111111111111111111111111111111111111111 G3,1 111111110011111111111111011101110101111111101111 ....................................................... Gn1,1

100001110001000000000000000000000000000000000000

G1,2 010000000000000000010100000000000111000011100011 G2,2 010000000000000000010100000000000111000011100011........................................................ Gn2,2 111111110011111111111111011101110101111111101111........................................................ G1,n 011110100000000000000000001000000000000000000001 G2,n 111111110011111111100011011101110101111111101111 G3,n 111111110011111111100011011101110101111111101111........................................................ Gnp,n 100110000000000000000000000000000000000000000001

Protein conservation profiles

Table : 541880 proteins x 99 species• Different conservation profiles represent different evolutionary histories

Page 27: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

original total proteins (99 species)

non-specific proteins i.e conservation profiles (82%)

distinct conservation profiles (42%)

Distinct conservation profiles

541880

442460

184130

111111110011111111111111011101110101111111101111

100110000000000000000000000000000000000000000001

100000000000000000000000000000000000000000000000111111111111111111111111111111111111111111111111

010000000000000000010100000000000111000011100011

................................................

• This set is indicative of the various observed evolutionary histories.

• Effect of the duplication process is reduced(one representative from each set of identical conservation profiles)

Page 28: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

0102030405060708090

100110120130140150160170180190200210220230240250

c01c02c03c04c05c06c07c08c09c10c11c12c13c14c15c16c17c18c19c20c21c22c23c24c25c26c27c28c29c30c31c32c33c34c35c36c37c38c39c40c41c42c43c44c45c46c47c48c49c50c51c52c53c54c55c56c57c58c59c60c61c62c63c64c65c66c67c68c69c70c71c72c73c74c75c76c77c78c79c80c81c82c83c84c85c86c87c88c89c90c91c92c93c94c95c96c97c98c99

Conservation weights (sum of "1":presence)

Fractions (*10000) of distinct conservation profiles

Presence in the 184130 distinct conservation profiles:Mean=32.2; SD=23.3; min=1; Max=99.

Page 29: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

Genome tree construction: data matrices

• Jaccard similarity scores between species

sij = N11/(N11+N01+N10);

N11; N01; N10 are respectively total occurrences of (1,1), (0,1) and (1,0) between i,j.

• 184130 d.c.prof

T = { Tij = sij ; i=1,n; j=1,n; n }

111111110011111111111111011101110101111111101111

100110000000000000000000000000000000000000000001

100000000000000000000000000000000000000000000000111111111111111111111111111111111111111111111111

010000000000000000010100000000000111000011100011

................................................

i jvarious evolutionary histories

Page 30: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

Tekaia F, Yeramian E. (2005). PLoS Comput Biol.1(7):e75

profiles tree

Page 31: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

Conclusions: Methodology

• Species classification is not an easy task!

• Methods that take into account whole genome informations are still needed;

• Correspondence analysis method might be helpful in revealing evolutionary trends embedded in the multidimensional relationships as obtained from large scale genome comparisons;

• Species tree construction should take into account the whole information included in the genomes;

Page 32: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

• Thus they should correspond to the most accurate type of markers for species classification;• In principal profiles tree derived from distinct conservation profiles should considerably minimize genome acquisition effects and should reflect less noisy phylogenetic signals;• The profiles tree presents evidence of conservation of stable phylogenetic relationships and reveals unconventional species clustering;• The profiles tree corresponds to the classification of the evolutionary scenari.

Conclusions...• Conservation profiles represent most conserved and meaningful evolutionary signals jointly captured in a set of species;

Page 33: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

Acknowledgments:

The support of: • The Institut Pasteur (Strategic Horizontal Programme on Anopheles gambiae)

• The Ministère de la Recherche Scientifique (France): ACI-IMPBIO-2004–98-GENEPHYS program.

• Bernard Dujon (Institut Pasteur).

Page 34: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

References:• Tekaia, F. and Dujon, B. (1999).

Pervasiveness of gene conservation and persistence of duplicates in cellular genomes. Journal of Molecular Evolution, 49:591-600.

• Tekaia, F., Lazcano, A. and B. Dujon (1999). Genome tree as revealed from whole proteome comparisons. Genome Res. 12:17-25.

• Tekaia, F., Yeramian, E. and Dujon, B. (2002).Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: a global picture with correspondence analysis. Gene 297: 51-60.

• Tekaia, F. and Yeramian, E. (2005).Genome Trees from Conservation Profiles. PLoS Comput Biol.1(7):e75.

• Tekaia, F. and Yeramian, E. (2006).Evolution of Proteomes: Fundamental signatures and global trends in amino acid composition. BMC Genomics. 7:307.

• Tekaia F, Latgé JP. (2005). Aspergillus fumigatus: saprophyte or pathogen?

Curr Opin Microbiol. 8:385-92. Review.

• Systematic analysis of completely sequenced organisms:http://www.pasteur.fr/~tekaia/sacso.html

Page 35: Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr.

References:• Bininda-Emonds ORP (2005). Supertree Construction in the Genomic Age.Methods in Enzymology 395: p.745-757.• Bininda-Emonds,OPRP, John L. Gittleman, Mike A. Steel (2002)The (super)Tree Of Life: Procedures, Problems, and Prospects. Annual Review of Ecology and Systematics, Vol. 33: 265-289.

• Dagan, T. and W, Martin (2006). The tree of one percent. Genome Biology, 7:118.• Delsuc F, Brinkmann H, Philippe H. (2005). Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet. 6:361-75. Review.• Doolittle. Science 284:2124-8. (1999)• Driskell, et al. (2004). Sciences, 306; 1172-1174.

• http://www.genomesonline.org/gold.cgi (list of genome projects)• Keith A. Crandall and Jennifer E. Buhay (2004). Sciences, 306; 1144-1145.

• Linder, Moret, Nakhleh, and Warnow: http://compbio.unm.edu/networks1.ppt

• Martin & Embley (2004). Nature 431:152-5.

• MCL: a cluster algorithm for graphs: http://micans.org/mcl/

• Pennisi, E.(1998). Genome data shake tree of life.Science. 280:672-4.

• Rivera & Lake JA.(2004). Nature 431: 152-5.• Raoult et al.(2004). Sciences, 306:1344-1350.• Snel, Bork, Huynen (1999). Genome phylogeny based on gene content.Nature Genetics 21, 108-110.

• Snel B, Huynen MA, Dutilh BE (2005). Genome trees and the nature of genome evolution.Annu Rev Microbiol.;59:191-209. Review.

• Woese et al.(1990). PNAS. 87:4576-4579.