Global Classification of (Plant) Proteins across Multiple Species

35
Global Classification of (Plant) Proteins across Multiple Species Kerr Wall Jim Leebens-Mack Naomi Altman Victor Albert Dawn Field Hong Ma Claude dePamphilis

description

Global Classification of (Plant) Proteins across Multiple Species. Kerr Wall Jim Leebens-Mack Naomi Altman Victor Albert Dawn Field Hong Ma Claude dePamphilis. Global Classification of Proteins. The protein classification problem A method for global classification - PowerPoint PPT Presentation

Transcript of Global Classification of (Plant) Proteins across Multiple Species

Page 1: Global Classification of (Plant) Proteins across Multiple Species

Global Classification of (Plant) Proteins across Multiple Species

Kerr WallJim Leebens-MackNaomi AltmanVictor AlbertDawn FieldHong MaClaude dePamphilis

Page 2: Global Classification of (Plant) Proteins across Multiple Species

Global Classification of Proteins

• The protein classification problem

• A method for global classification

• “Bootstrap” support for global classification

• Structure within clusters

• Structure between clusters

• Results from complete proteome classification: arabidopsis, oryza and populus

Page 3: Global Classification of (Plant) Proteins across Multiple Species

The protein classification problem

• Genomic sequence can be translated into protein sequence but …

• The function of most proteins is unknown.

• Protein classification is used to: infer protein folding structure infer protein function infer evolutionary relationships **

Page 4: Global Classification of (Plant) Proteins across Multiple Species

Similarity of Protein SequenceFFHPLECEPTLQMGFHSDQIS-VAA---AGPS--VNNN---FFHPLDCGPTLQMGYPSDSLTAEAAASVAGPS--C--S---FFHPLECEPTLQIGYQPDPIT-VAA---AGPS--VN-NYMPFFHPIECEPTLQMGYQQDQIT-VAAA--AGPSMTMN-S---FFQHIECEPTLHIGYQPDQIT-VAA---AGPS--MN-NYMQFFHPLECEPTLQIGYQHDQIT-IAA---PGPS--VS-NYMP

• Each row represents a different protein.• Each letter represents an amino acid.• Each “–” represents a space which is missing in this sequence but

has something in it in a different protein in this set.

• In closely related proteins, the distance between proteins is the number of mismatches.

• In distantly related species, the sequences are given a score – often the probability that a random sequence matches as well (e.g. BLAST E-value)

Page 5: Global Classification of (Plant) Proteins across Multiple Species

Inferring Evolutionary Relationships

Main methods: statistical phylogeny based on sequence alignment and evolutionary models

-requires a high degree of sequence similarity-good alignments use slow algorithms and often lots of

manual intervention

manual curation -requires a large amount of manual intervention-can incorporate sequence, folding structure and function.

These methods are good for 100’s of genes.

Page 6: Global Classification of (Plant) Proteins across Multiple Species

Global Classification of Proteins

Very high throughput:

Arabidopsis 26,207

Rice 57,915

Poplar 45,555

Total 129,677

Our goal: The joint classification of all known plant proteins using a “scaffold” derived from the 3 completely sequenced species

Page 7: Global Classification of (Plant) Proteins across Multiple Species

A method for global classification

• Clustering based on a similarity (or distance) matrix is commonly used.

• A quick method for clustering (sparse matrix computations are often used).

• Our similarity matrix is 129,677 x 129,677 so we need:

• A quick method for computing distance (BLAST E-values are often used; we use -log(E-value) as the similarity measure)

Page 8: Global Classification of (Plant) Proteins across Multiple Species

TribeMCL Clustering AlgorithmPredicted protein sequences from the fully sequenced genomes of Arabidopsis thaliana columbia (26207) and Oryza sativa japonica (57915) were downloaded from TIGR. Populus trichocarpa (45555) was downloaded from JGI.

All sequences were blasted against each other using BLASTp 2.4 with an E-value cutoff of 1x10-5

The TribeMCL package was used to predict putative protein families at low, medium, and high (I=1.2,3,5) stringencies

The results are stored at http://www.floralgenome.org/cgi-bin/tribedb/tribe.cgi

Page 9: Global Classification of (Plant) Proteins across Multiple Species
Page 10: Global Classification of (Plant) Proteins across Multiple Species
Page 11: Global Classification of (Plant) Proteins across Multiple Species
Page 12: Global Classification of (Plant) Proteins across Multiple Species

TribeMCL MethodEnright, Van Dongen and Ouzounis (2002)

• Similarity is measured by

-log10(BLAST E-value)

• Clustering is done by MCL Method

Page 13: Global Classification of (Plant) Proteins across Multiple Species

Suppose S is the similarity matrix.

1. Normalize the rows of S to sum to 1.

2. Raise each entry to the power r>1. (r is the “stringency”) and renormalize. S(r)

3. Take a “Markov step” – replace S(r)’S(r).

4. Iterate to convergence.

MCL Algorithmvan Dongen, 2000

It is very fast because low similarities are truncated to zero and sparse matrix methods can then be used.

Page 14: Global Classification of (Plant) Proteins across Multiple Species

A Heuristic for MCL

We take a random walk on the graph described by the similarity matrix

BUT

After each step we weaken the links between distant nodes and strengthen the links between nearby nodes

Graphic from van Dongen, 2000

Page 15: Global Classification of (Plant) Proteins across Multiple Species

Similarity Matrix

r=2.0

r=2.8

r=2.9

r=2.6Cluster pattern at Convergence as a function of r

Small groups break apart first.

The pattern is quite robust to changes in the similarity of the green region

16

40

60

Page 16: Global Classification of (Plant) Proteins across Multiple Species

Similarity Matrix

r=2.0

r=2.8

r=3.1

r=2.6Cluster pattern at Convergence as a function of r

At r=3.6 all units separate

The additional similarity indicated by pink has a profound effect

16

40

60

50

Page 17: Global Classification of (Plant) Proteins across Multiple Species

Similarity Matrix

r=2.0

r=2.7

r=2.8

r=2.6Cluster pattern at Convergence as a function of r

More strongly connecting the “background” disrupts the pattern until r=2.7, after which we quickly cycle through the pattern (2.9 turns the center group into singletons and 3.0 turns everything into singletons.)

30

40

60

Page 18: Global Classification of (Plant) Proteins across Multiple Species

Similarity Matrix

r=2.0

r=2.3

r=2.1Cluster pattern at Convergence as a function of r

Weakening the within cluster similarity accelerates the breakdown into singletons

16

30

60

Page 19: Global Classification of (Plant) Proteins across Multiple Species

Similarity Matrix

r=2.0

r=2.3Cluster pattern at Convergence as a function of r

Strengthening the “background” while weakening the within cluster similarity makes it difficult to pick out the clusters.

25

30

60

Page 20: Global Classification of (Plant) Proteins across Multiple Species

Some Summary Statistics for the Clusters

Protein Set Number of Proteins

Number of Clusters at r=3

Percent of Singletons

Arabidopsis 26,207 11,467

(44%)

69%

Arabidopsis+

Rice

84,122 28,175

(33%)

68%

Arabidopsis+

Rice + Poplar

129,677 35,873

(28%)

67%

Page 21: Global Classification of (Plant) Proteins across Multiple Species
Page 22: Global Classification of (Plant) Proteins across Multiple Species

Cluster ATH Rice Poplar

ATH 30% - -

+Rice 17% 25% -

+Poplar 12% 24% 15%

%Singletons

Page 23: Global Classification of (Plant) Proteins across Multiple Species
Page 24: Global Classification of (Plant) Proteins across Multiple Species
Page 25: Global Classification of (Plant) Proteins across Multiple Species

Tribes for large gene families show some, but not complete correspondence to inferred phylogenetic relationships. Tribes with MADS genes formed at low, medium and high stringencies are mapped on to the a recently published Arabidopsis MADS gene phylogeny (Martinez-Castilla & Alvarez-Buylla 2003).

Comparing Tribes to Phylogenetic Trees from Sequence Alignment

Page 26: Global Classification of (Plant) Proteins across Multiple Species

Comparisons with curated gene families

• Added tribe information to TAIR’s gene families– www.floralgenome.org/cgi-bin/tair/tair.cgi

– E.g. Cytochrome P450

Page 27: Global Classification of (Plant) Proteins across Multiple Species
Page 28: Global Classification of (Plant) Proteins across Multiple Species

“Bootstrap” Support for Clusters

To determine the stability of the clusters, we need some type of perturbation of the system. We use the “0.632 jackknife” instead of the bootstrap (as we want a set of unique proteins).

We clustered 100 samples, each a random selection of 63.2% of the proteins.

We count “1” for each tribe each time all the genes in the tribe selected for the bootstrap sample are clustered.

Page 29: Global Classification of (Plant) Proteins across Multiple Species
Page 30: Global Classification of (Plant) Proteins across Multiple Species
Page 31: Global Classification of (Plant) Proteins across Multiple Species

From Tribes to Phylogenetics• Within each tribe of 3 or more proteins we can

do hierarchical clustering using the similarity matrix (Harlow, Gogarten, Ragan, 2004) or forming a careful alignment and doing phylogenetic tree.

• We can also form SuperTribes, by clustering the tribes. Because we still have a large set of objects to cluster, we continue to use MCL.

• Within a SuperTribe, we can do hierarchical clustering.

• The SuperTribe for the MADS family shown earlier includes all the MADS sequences

Page 32: Global Classification of (Plant) Proteins across Multiple Species

Single Linkage TribeMCL• Define the distance

between tribes as the smallest pairwise E-value.

• Use TribeMCL on the resulting similarity matrix.

• Use hierarchical clustering within supertribes.

Single Linkage Tribe MCL

Hierarchical clustering or phylogenetic trees

Page 33: Global Classification of (Plant) Proteins across Multiple Species

Floral Genome Project and Plant ProteinClassification

Page 34: Global Classification of (Plant) Proteins across Multiple Species

Use of the Global Classification• Project goal is to understand the evolution

of flowers.• Data has been collected to various

degrees of intensity on 15 non-model species across the phylogeny of flowering plants and merged with data from other projects.

• PlantTribes will be used to assist in placing these proteins into families to infer evolutionary relationships.

Page 35: Global Classification of (Plant) Proteins across Multiple Species

And many thanks to:• Kerr Wall – FGP Bioinformatics (PSU)• Claude dePamphilis – FGP PI (PSU)• Jim Leebens-Mack – FGP Project Director(PSU)• Hong Ma – FGP co-PI (PSU)• Victor Albert – collaborator (U. Oslo)• Dawn Field – collaborator (Oxford U.)

And FGP collaborators at PSU, UFL and Cornell.

And especially

NSF – Plant Genome Research Program