Data Mining of Enzymes Uri Weingart*, Yair Lavi* and David Horn School of Physics and Astronomy, Tel...

1
Data Mining of Enzymes Uri Weingart*, Yair Lavi* and David Horn School of Physics and Astronomy, Tel Aviv University, Tel Aviv 69978, Israel Predicting the function of a protein from its sequence is a long-standing challenge, typically addressed using either sequence-similarity or sequence-motifs. We employ the novel motif method of Kunik et al, introducing Specific Peptides (SPs) that are unique to specific branches of the Enzyme Commission (EC) functional classification. We devise the methodology that allows for searching SPs on arbitrary proteins, determining from its sequence whether a protein is an enzyme and what the enzyme’s EC classification is. We show that the predictive power of SPs, both for true-positives (enzymes) and true-negatives (non- enzymes), depends on the coverage length L of all SP matches (the number of amino-acids matched on the protein sequence). In our analysis, L≥7 leads to highly accurate results . Testing DME on E coli Use all (singly) annotated Swiss-Prot Enzymes dated before 1.7.06 for training (89,854 sequences) to produce 87,017 SPs. Test on enzymes annotated after 1.7.06 Using EC labels of SPs to determine the 3rd level EC assignment of the enzyme, we find recall of the training set to be 85%. On the test set we obtain precision 99% and recall 73% Analysis of an E. Coli test set, containing enzymes annotated after 1.7.06 and proteins that are not annotated as enzymes: Labels compare predictor (DME) with oracle (Swiss-Prot) using notations P/P for same EC prediction of both, P/DP for different predictions, P/NP for predictions of DME and no EC assignment by Swiss-Prot, etc. Correspondingly we define Precision=A/(A+B), Recall= A/(A+B+D), Putative Novelty= C/(A+B+C) Specificity=E/(E+B+C) and Accuracy=(A+E)/(all). Here precision=98.9% and accuracy=94.3%. Examples of sequences from two EC numbers sharing the same 3rd level EC hierarchy and specific peptides Putative novelties in E coli Enzymatic Spectra Metagenomic analyses 1. Sargasso Sea Data (Venter et al, Sciecne 304 (2004)) 1,001,986 protein records. Average length 194 amino-acids, sd=109. We obtain EC assignments for 177,376 proteins. Examples of proteins with two EC assignments based on L4>14 2. Human Gut metagenomics (Gill et al, Science 312 (2006)). Two proteomes of subjects 7 and 8 consist of 20,523 and 25,980 proteins correspondingly. We predict 2,616 enzymes for subject 7 and 2,949 enzymes for subject 8 . P07649 - 5.4.99.12 [1] an active site D at sequence location 60, [2] a binding site Y at location 118, [3] a binding site L at 245. [4] The active site is common to two SPs (CAGRT(D)AGVH). [5] GQVVH at locations 67-71 (SP3) [6] FHARF at 107-111, a tentative RNA-binding peptide, [7] ENDFTS at 157-163 [8] HMVRNI at 201-207 in active pocket Occurrence of SPs on the 3D structure A user-friendly tool that displays occurrences of SPs on any protein sequence that is presented as a query, together with the EC assignments due to these SPs, is available at http://adios.tau.ac.il/DME Alignment is ordered according to SPs (in red). Spaces are inserted to highlight annotations of active and binding sites. References: Vered Kunik, Yasmine Meroz, Zach Solan, Ben Sandbank, Uri Weingart, Eytan Ruppin, David Horn: PLoS Comput Biol 3(8): e167, 2007 Yasmine Meroz and David Horn: Proteins: Structure, Function, and Bioinformatics 72 (2), 606-612, 2008 Last example (P10089) has conserved Cys replaced by Tyr and presumed to be nonfunctional EC assignments of the enzymes A8A6K3, B1IWZ8 and A1AHS0, were upgraded in later Swiss-Prot editions to level 3 in agreement with our predictions Predicted relative frequencies of enzymatic annotations are normalized. Leading categories are 6.1.1 (Aminoacyl-tRNA Synthtases), 3.6.3 (Hydrolases catalyzing transmembrane movement of substances involving ATPases) and 2.7.7 (Nucleotidyl transferases). Differences between distributions lead to L1-distance of 0.20 between subjects 7 and 8, and distance of 0.42 between each one of the two gut metagenomes and the Sargasso-Sea metagenome . Occurrence of SPs with coverage length of L≥7 amino-acids is a good indicator for the enzymatic classification of queried proteins. DME is a useful tool for metagenomic studies: it can be applied to fragmentary data without prior knowledge regarding species identity. DME search indicates EC assignment and points to biologically important loci. Conclusions Sargasso Id DME Prediction1 DME Prediction2 1087009309955 1.3.1.26 1.6.5.3 1087008920257 2.1.1.31 4.6.1.12 1084001244192 3.6.3.44 2.7.1.130 1087008920589 3.6.3.44 2.7.1.130 1087009217043 3.6.3.44 2.7.1.130 1087012113699 3.6.3.44 2.7.1.130 1087012119815 3.6.3.44 2.7.1.130 1084000934990 3.6.3.44 2.7.1.130 1084001248318 4.2.3.5 4.2.3.4 ___________ _ LN refers to coverage-length (number of matched amino- acids) for SPs at level N of the EC hierarchy *supported in part by fellowships granted by the Edmond J. Safra Bioinformatics program at TAU

Transcript of Data Mining of Enzymes Uri Weingart*, Yair Lavi* and David Horn School of Physics and Astronomy, Tel...

Page 1: Data Mining of Enzymes Uri Weingart*, Yair Lavi* and David Horn School of Physics and Astronomy, Tel Aviv University, Tel Aviv 69978, Israel Predicting.

Data Mining of EnzymesUri Weingart*, Yair Lavi* and David Horn

School of Physics and Astronomy, Tel Aviv University, Tel Aviv 69978, Israel

Predicting the function of a protein from its sequence is a long-standing challenge, typically addressed using either sequence-similarity or sequence-motifs. We employ the novel motif method of Kunik et al, introducing Specific Peptides (SPs) that are unique to specific branches of the Enzyme Commission (EC) functional classification. We devise the methodology that allows for searching SPs on arbitrary proteins, determining from its sequence whether a protein is an enzyme and what the enzyme’s EC classification is. We show that the predictive power of SPs, both for true-positives (enzymes) and true-negatives (non-enzymes), depends on the coverage length L of all SP matches (the number of amino-acids matched on the protein sequence). In our analysis, L≥7 leads to highly accurate results.

Testing DME on E coliUse all (singly) annotated Swiss-Prot Enzymes dated before 1.7.06 for training (89,854 sequences) to produce 87,017 SPs.Test on enzymes annotated after 1.7.06 Using EC labels of SPs to determine the 3rd level EC assignment of the enzyme, we find recall of the training set to be 85%. On the test set we obtain precision 99% and recall 73%

Analysis of an E. Coli test set, containing enzymes annotated after 1.7.06 and proteins that are not annotated as enzymes:

Labels compare predictor (DME) with oracle (Swiss-Prot) using notations P/P for same EC prediction of both, P/DP for different predictions, P/NP for predictions of DME and no EC assignment by Swiss-Prot, etc.Correspondingly we definePrecision=A/(A+B), Recall= A/(A+B+D), Putative Novelty= C/(A+B+C)Specificity=E/(E+B+C) and Accuracy=(A+E)/(all). Here precision=98.9% and accuracy=94.3%.

Examples of sequences from two EC numbers sharing the same 3rd level

EC hierarchy and specific peptides

Putative novelties in E coli

Enzymatic Spectra

Metagenomic analyses1. Sargasso Sea Data (Venter et al, Sciecne 304 (2004))1,001,986 protein records. Average length 194 amino-acids, sd=109.We obtain EC assignments for 177,376 proteins.Examples of proteins with two EC assignments based on L4>14

2. Human Gut metagenomics (Gill et al, Science 312 (2006)).Two proteomes of subjects 7 and 8 consist of 20,523 and 25,980 proteins correspondingly. We predict 2,616 enzymes for subject 7 and 2,949 enzymes for subject 8.

P07649 - 5.4.99.12[1] an active site D at sequence location 60, [2] a binding site Y at location 118, [3] a binding site L at 245. [4] The active site is common to two SPs (CAGRT(D)AGVH). [5] GQVVH at locations 67-71 (SP3) [6] FHARF at 107-111, a tentative RNA-binding peptide, [7] ENDFTS at 157-163 [8] HMVRNI at 201-207 in active pocket

Occurrence of SPs on the 3D structure

A user-friendly tool that displays occurrences of SPs on any protein sequence that is presented as a query, together with the EC assignments due to these SPs, is available at http://adios.tau.ac.il/DME

Alignment is ordered according to SPs (in red). Spaces are inserted to highlight annotations of active and binding sites.

References:

Vered Kunik, Yasmine Meroz, Zach Solan, Ben Sandbank, Uri Weingart, Eytan Ruppin, David Horn: PLoS Comput Biol 3(8): e167, 2007

Yasmine Meroz and David Horn: Proteins: Structure, Function, and Bioinformatics 72 (2), 606-612, 2008

Last example (P10089) has conserved Cys replaced by Tyr and presumed to be nonfunctional

EC assignments of the enzymes A8A6K3, B1IWZ8 and A1AHS0, were upgraded in later Swiss-Prot editions to level 3 in agreement with our predictions

Predicted relative frequencies of enzymatic annotations are normalized. Leading categories are 6.1.1 (Aminoacyl-tRNA Synthtases), 3.6.3 (Hydrolases catalyzing transmembrane movement of substances involving ATPases) and 2.7.7 (Nucleotidyl transferases). Differences between distributions lead to L1-distance of 0.20 between subjects 7 and 8, and distance of 0.42 between each one of the two gut metagenomes and the Sargasso-Sea metagenome.

Occurrence of SPs with coverage length of L≥7 amino-acids is a good indicator for the enzymatic classification of queried proteins. DME is a useful tool for metagenomic studies: it can be applied to fragmentary data without prior knowledge regarding species identity.DME search indicates EC assignment and points to biologically important loci.

Conclusions

Sargasso Id DMEPrediction1

DMEPrediction2

1087009309955 1.3.1.26 1.6.5.31087008920257 2.1.1.31 4.6.1.121084001244192 3.6.3.44 2.7.1.1301087008920589 3.6.3.44 2.7.1.1301087009217043 3.6.3.44 2.7.1.1301087012113699 3.6.3.44 2.7.1.1301087012119815 3.6.3.44 2.7.1.1301084000934990 3.6.3.44 2.7.1.1301084001248318 4.2.3.5 4.2.3.4

____________

LN refers to coverage-length (number of matched amino-acids) for SPs at level N of the EC hierarchy

*supported in part by fellowships granted by the Edmond J. Safra Bioinformatics program at TAU