A survey of orphan enzyme activities

15
BioMed Central Page 1 of 15 (page number not for citation purposes) BMC Bioinformatics Open Access Research article A survey of orphan enzyme activities Yannick Pouliot 1,2 and Peter D Karp* 1 Address: 1 Bioinformatics Research Group, Artificial Intelligence Center, SRI International, 333 Ravenswood Ave, Menlo Park, California, 94025- 3493, USA and 2 Lane Medical Library and Knowledge Management Center, Information Resources and Technology, Stanford University Medical Center, 300 Pasteur Drive. Stanford, CA 94305-5123, USA Email: Yannick Pouliot - [email protected]; Peter D Karp* - [email protected] * Corresponding author Abstract Background: Using computational database searches, we have demonstrated previously that no gene sequences could be found for at least 36% of enzyme activities that have been assigned an Enzyme Commission number. Here we present a follow-up literature-based survey involving a statistically significant sample of such "orphan" activities. The survey was intended to determine whether sequences for these enzyme activities are truly unknown, or whether these sequences are absent from the public sequence databases but can be found in the literature. Results: We demonstrate that for ~80% of sampled orphans, the absence of sequence data is bona fide. Our analyses further substantiate the notion that many of these enzyme activities play biologically important roles. Conclusion: This survey points toward significant scientific cost of having such a large fraction of characterized enzyme activities disconnected from sequence data. It also suggests that a larger effort, beginning with a comprehensive survey of all putative orphan activities, would resolve nearly 300 artifactual orphans and reconnect a wealth of enzyme research with modern genomics. For these reasons, we propose that a systematic effort to identify the cognate genes of orphan enzymes be undertaken. Background After a decade of comprehensive genomic sequencing, more than 500 genomes have been sequenced to comple- tion, mostly prokaryotes. The prodigious rate of new sequence annotation is highlighted by the fact that there were just over 300 genomes available when this study was carried out in late 2004. However, the fraction of genes for which no function can be predicted remains high (30%– 50%). In response, proposals have been put forth for the bioinformatics analysis of bacterial genomes to identify genes with high likelihood of scoring true in confirmatory laboratory assays of their respective function [1,2]. This would increase the field's pool of experimentally charac- terized proteins, with concomitant increases in the accu- racy and coverage of genome annotation. We believe the return on investment of this approach would be particu- larly high when addressing the problem of orphan activi- ties, that is, enzymatic activities for which no sequence information is available [3,4]. Decades of detailed enzymology have created a wealth of knowledge about enzymes and their activities. However, crucial aspects of these enzymes are absent from bioinfor- matics databases with surprising frequency. For example, Published: 10 July 2007 BMC Bioinformatics 2007, 8:244 doi:10.1186/1471-2105-8-244 Received: 22 March 2007 Accepted: 10 July 2007 This article is available from: http://www.biomedcentral.com/1471-2105/8/244 © 2007 Pouliot and Karp; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Transcript of A survey of orphan enzyme activities

Page 1: A survey of orphan enzyme activities

BioMed CentralBMC Bioinformatics

ss

Open AcceResearch articleA survey of orphan enzyme activitiesYannick Pouliot1,2 and Peter D Karp*1

Address: 1Bioinformatics Research Group, Artificial Intelligence Center, SRI International, 333 Ravenswood Ave, Menlo Park, California, 94025-3493, USA and 2Lane Medical Library and Knowledge Management Center, Information Resources and Technology, Stanford University Medical Center, 300 Pasteur Drive. Stanford, CA 94305-5123, USA

Email: Yannick Pouliot - [email protected]; Peter D Karp* - [email protected]

* Corresponding author

AbstractBackground: Using computational database searches, we have demonstrated previously that nogene sequences could be found for at least 36% of enzyme activities that have been assigned anEnzyme Commission number. Here we present a follow-up literature-based survey involving astatistically significant sample of such "orphan" activities. The survey was intended to determinewhether sequences for these enzyme activities are truly unknown, or whether these sequences areabsent from the public sequence databases but can be found in the literature.

Results: We demonstrate that for ~80% of sampled orphans, the absence of sequence data is bonafide. Our analyses further substantiate the notion that many of these enzyme activities playbiologically important roles.

Conclusion: This survey points toward significant scientific cost of having such a large fraction ofcharacterized enzyme activities disconnected from sequence data. It also suggests that a largereffort, beginning with a comprehensive survey of all putative orphan activities, would resolve nearly300 artifactual orphans and reconnect a wealth of enzyme research with modern genomics. Forthese reasons, we propose that a systematic effort to identify the cognate genes of orphan enzymesbe undertaken.

BackgroundAfter a decade of comprehensive genomic sequencing,more than 500 genomes have been sequenced to comple-tion, mostly prokaryotes. The prodigious rate of newsequence annotation is highlighted by the fact that therewere just over 300 genomes available when this study wascarried out in late 2004. However, the fraction of genes forwhich no function can be predicted remains high (30%–50%). In response, proposals have been put forth for thebioinformatics analysis of bacterial genomes to identifygenes with high likelihood of scoring true in confirmatorylaboratory assays of their respective function [1,2]. This

would increase the field's pool of experimentally charac-terized proteins, with concomitant increases in the accu-racy and coverage of genome annotation. We believe thereturn on investment of this approach would be particu-larly high when addressing the problem of orphan activi-ties, that is, enzymatic activities for which no sequenceinformation is available [3,4].

Decades of detailed enzymology have created a wealth ofknowledge about enzymes and their activities. However,crucial aspects of these enzymes are absent from bioinfor-matics databases with surprising frequency. For example,

Published: 10 July 2007

BMC Bioinformatics 2007, 8:244 doi:10.1186/1471-2105-8-244

Received: 22 March 2007Accepted: 10 July 2007

This article is available from: http://www.biomedcentral.com/1471-2105/8/244

© 2007 Pouliot and Karp; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Page 1 of 15(page number not for citation purposes)

Page 2: A survey of orphan enzyme activities

BMC Bioinformatics 2007, 8:244 http://www.biomedcentral.com/1471-2105/8/244

recent computational analyses of sequence databasesdemonstrate that at least 36% of enzyme activities thathave been assigned an Enzyme Commission (EC) number[5] appear to be devoid of a gene or protein sequence [3].Since then, similar analyses have been published, withsimilar results [4,6,7]. The existence of such a large frac-tion of orphan activities is surprising, given that many ofthese enzymes have been described decades ago and areoften involved in basic cellular functions. Several exam-ples exist of the recent identification of genes involved inimportant enzymatic functions (reviewed in [1,2]).Indeed, in our study 44 orphans were found to be presentin one or more primary metabolic pathways in a variety ofspecies (described below). Details of many of the orphanenzymes uncovered during this survey point to multipleand significant consequences for the lack of sequenceinformation in areas such as genome annotation, compu-tational pathway prediction, and metabolic engineering.For these reasons, the orphan problem and related issueswere highlighted in a recent report of the American Soci-ety for Microbiology [8]. In view of the biological richnessassociated with orphan enzymatic activities (Figure 1,Table 1), we have taken the first steps in creating the foun-dations of an Enzyme Genomics Initiative [3].

Here we describe a literature-based survey of presumedorphans intended to further validate and characterizethese activities (Figure 2). The confidence of the results ofthis survey was designed to be within a 5% error marginrelative to the universe of orphan activities, based on arandomly selected subset of orphan activities from theNomenclature Committee of the IUBMB (NC-IUBMB).We have also assessed the practicability of identifying thegenes associated with these orphans. As a consequence,the survey captures data from the literature that shouldfacilitate the identification of cognate genes for theorphan activities evaluated. Here, we define the cognategene for an activity as a gene that has been shown to codefor an enzyme that carries out that activity.

The survey confirmed that ~80% of the sampled orphansdo not have sequence information associated with them.Consequently, this lack represents a true information def-icit. Weaknesses in database integration and a lack ofinformation capture from the literature to databasesappear to be largely responsible for most of the artifactualorphans making up the other 20%. Given the importanceof these enzymatic activities, we propose that the publicsequence databases assign high priority to correcting data-base entries for artifactual orphans. We further proposethat a systematic effort be undertaken to sequence thegenes of validated orphans, as this survey demonstratesthat primary literature data and database analyses com-bined with current proteomics and genomic technologies

should be adequate to enable the rapid identification ofmany of these genes.

ResultsMost orphan enzymatic activities are bona fide (Table 2).Our survey found that more than 80% of orphans are notdue to artifacts such as missing database annotations (pri-marily failure to capture information from the literature),or lack of database cross-referencing, such as the availabil-ity of a sequence in one database not being reflected in asecond database. Specifically, a total of 187 orphans outof 228 surveyed activities were validated in at least one of287 species (species are listed in Table 3 and Table 4, thelist of validated orphans is in Table 5). A majority oforphans (54.36%) occurred in Eukaryotes, followed byEubacteria (39.37%) (Table 6). Within the Eubacteria,genus Pseudomonas was significantly overrepresented(35%) (Table 7). While a systematic determination of thespecies spectrum of orphan activities was not performedhere, we did notice several cases of an orphan activityreported in more than one species, as well as one case ofan orphan activity occurring in species from differentdomains.

Because the eventual isolation of the cognate genes ofthese activities is greatly facilitated by comprehensivegenome sequencing, we determined for what fraction ofall validated orphans a full genome sequence is available(Table 8). 43% of Eubacterial species in which orphansoccurred were found to have such sequences, availableeither presently or due shortly. This figure rises to 83%when including the genomes of related species, on theassumption that they might be sufficiently closely relatedto permit the identification of the cognate gene. For exam-ple, at the time of this study the completed genomesequence of Pseudomonas fluorescens was not available,but those of three other Pseudomonas species were.

Oxidoreductases (EC1) and transferases (EC2) were themost frequently represented classes of enzymatic activityfor validated orphans (Figure 3). On a per capita basis,oxidoreductases and transferases were overrepresented by~20%, whereas hydrolases and ligases were underrepre-sented by 35% and 64%, respectively.

The original publication date for all orphans was broadlydistributed around a mean of 1977 (Figure 4), comparedto a mean of 1975 for validated orphans.

Causes of artifactsA comprehensive list of artifactual orphans and theinferred nature of the artifact is available [9]. Althoughthis study was not designed to determine conclusively thecauses of artifactuality, incompleteness in database entriesappears to be the predominant cause of the artifacts iden-

Page 2 of 15(page number not for citation purposes)

Page 3: A survey of orphan enzyme activities

BMC Bioinformatics 2007, 8:244 http://www.biomedcentral.com/1471-2105/8/244

tified here. For example, the DNA sequence associatedwith reaction 3.5.1.79 is available in the EMBL database,however, the UniProt entry for this enzyme does not listany protein sequence (Table 9). Other representative arti-factual orphans are listed in Table 9, along with a descrip-tion of the cause of the artifact. In a small fraction of casesa clear determination of the species in which the activitywas characterized could not be made.

Extent of salvageabilityValidated orphans were analyzed to determine whethersufficient information is available from their publishedcharacterization that, when combined with other factors,could enable the rapid identification of at least one cog-nate gene. Overall, we determined that 57 validatedorphans (25% of total) might be salvageable (Figure 5A;Table 10), distributed approximately equally across

Example of a metabolic pathway involving a validated orphanFigure 1Example of a metabolic pathway involving a validated orphan.

Page 3 of 15(page number not for citation purposes)

Page 4: A survey of orphan enzyme activities

BMC Bioinformatics 2007, 8:244 http://www.biomedcentral.com/1471-2105/8/244

Table 1: Biological significance of selected validated orphans. The extent and significance of published research associated with a selection of validated orphans is detailed

EC No. Activity Year firstpublished

No. PubMedPublications InvolvingOrphan

Significance

1.1.1.43 Phosphogluconate 2-dehydrogenase

1961 2417 Positive reports of evaluation as a drug target against Trypanosome; trypanocidal activity has been reported; involved in 2-dehydro-D-gluconate degradation pathway

2.3.1.23 1-acylglycerophosphocholine O-acyltransferase

1967 256 Activity is present in lower eukaryotes, plants, and multiple mammalian tissues

5.1.3.17 Heparosan-N-sulfate-glucuronate 5-epimerase

1979 16 Involved in the biosynthesis of heparan sulfate, which binds proteins to modulate signaling events in embryogenesis. Mouse gene knock-out results in late lethal phenotype.Correction added in proof: Thanks to a comment by Dr. K. Robison and research by Dr. A. Shearer, we have found that 5.1.3.17 is an artifactual orphan rather than a validated orphan. Genes for this enzyme have been identified in cow and mouse (J Biol Chem 272:28158 1997; J Biol Chem 276:20069 2001).

2.3.1.105 Alkylglycerophosphate 2-O-acetyltransferase

1986 9 Involved in platelet activating factor biosynthesis; possible involvement in ischemia

3.1.3.59 Alkylacetylglycerophosphatase 1986 9 Involved in platelet activation factor biosynthesis2.7.1.106 Glucose-1,6-bisphosphate

synthase1975 9 Present in several mammalian tissues. Involved in glucose

metabolism1.2.1.23 2-oxoaldehyde dehydrogenase

(NAD+)1967 9 Involved in the development of diabetic complications

1.14.11.6 Thymine dioxygenase 1972 9 Present in both lower and higher eukaryotes1.1.1.16 Galactitol 2-dehydrogenase 1956 5 Insulin dysregulation2.3.1.14 Glutamine N-

phenylacetyltransferase1957 4 Investigated as a predictor of carotid endarterectomy in

middle-aged individuals1.2.1.25 2-oxoisovalerate

dehydrogenase (acylating)1969 4 Present in prokaryotes and eukaryotes. In the latter,

participates in primary metabolism pathway for valine degradation

E.C. 2.3.1.23 is listed in italics because it was cloned and sequenced in 2006, after the completion of this study

eukaryotes and bacteria. Far more bacterial orphans werejudged to have "excellent" or "good" salvageability ascompared with eukaryotic orphans: 70% (7+12 out of 27)vs. 48% (5+11 out of 33), respectively (Figure 5B). Thisdiscrepancy is primarily due to factors such as the muchgreater difficulty for purifying an activity from highereukaryotes, the difficulty of obtaining enough startingprotein from lower eukaryotes such as multicellular fungi,and the absence of a comprehensive genome sequencefrom species such as Bos Taurus and Sus scrofa.

Overall, more than half of the salvageable orphans ranked"good" or "excellent", with oxidoreductases (EC1) andhydrolases (EC2) being overrepresented in that set. Allother enzymatic classes were significantly underrepre-sented (Figure 6).

DiscussionThis survey demonstrates that ~80% of orphan enzymaticactivities are bona fide; therefore, we conclude that of the1,356 putative orphans extant at the time of this study,more than 1,000 are highly likely to constitute true infor-

mation deficits since their lack of sequence information isnot the result of a database error.

The absence of DNA or protein sequences encoding suchwell-characterized enzymatic activities is particularly con-sequential because these activities were often identifieddecades ago, and many have been the focus of significantresearch activity (Table 1). Without the cognate sequencesfor these activities, the quality of annotation of allsequenced genomes in terms of both coverage (fraction ofgenes that can be recognized) and accuracy (fraction ofpredicted gene functions that are correct) is diminished.Many of these activities may go for years without beingsequenced – for example, 1-acylglycerophosphocholineO-acyltransferase (Table 1) was finally purified andsequenced nearly forty years after it was first characterized[10]. Perhaps more troubling is the unknown pool of"false positive" annotations. Phosphogluconate 2-dehy-drogenase (Table 1), an orphan at the time of this analy-sis, has since been assigned to a sequence in the humangenome with no experimental evidence linking it to thator any homologous sequence, but apparently instead on

Page 4 of 15(page number not for citation purposes)

Page 5: A survey of orphan enzyme activities

BMC Bioinformatics 2007, 8:244 http://www.biomedcentral.com/1471-2105/8/244

the basis of the gene in question already being assigned asimilar activity. This kind of "hidden orphan" would havebeen missed by most orphan analyses, and can beexpected to propagate a potentially incorrect assignmentto other genomes in the future. Computational metabolicpathway prediction [11] and metabolic engineering alsodepend on sequence information and are thus similarlycompromised.

Conversely, ~20% of orphans surveyed were observed tobe artifacts, such that ~270 orphans out of 1,356 putativeorphans examined should be resolvable entirely via liter-ature research and database cleanup. As a result of thisprocess as it was carried out on our sampling of orphans,we have reported 11 artifactual orphan activities to public

sequence repositories for correction (see Table 8 for exam-ples).

In addition to validating orphans, the survey was useful incapturing information from the literature to assess theirsalvageability: more than half of validated orphans werefound to be salvageable (Figure 5). Examples of salvagea-ble orphan activities with the traits that make them sal-vageable are listed in Table 11.

As abundantly noted elsewhere, such database cleansingis essential to maximize the existing research investmentand prevent the propagation of mistakes [12-14] (seeTable 12 for examples of artifacts that have beenresolved). This necessity has not eluded the field of enzy-

Literature survey processFigure 2Literature survey process.

Page 5 of 15(page number not for citation purposes)

Page 6: A survey of orphan enzyme activities

BMC Bioinformatics 2007, 8:244 http://www.biomedcentral.com/1471-2105/8/244

Page 6 of 15(page number not for citation purposes)

Table 2: Summary of survey results

Number Proportion

Total number of putative orphans 1,356Number required to achieve 95% significance 180 13.3%Number orphans evaluated 228 16.8%

Out of 228 orphans:Number of artifactual orphans 41 18.0%Number of valid orphans 187 82.0%Max. number of salvageable orphans (all rankings) 57 25.0%

Out of 57 salvageable orphans: Number Proportion

Excellent 9 15.8%Good 23 40.4%Marginal 9 15.8%Poor 16 28.1%Bacterial salvageable orphans 26 45.6%Eukaryotic salvageable orphans 31 54.4%

The survey was designed to achieve a maximum sampling error of 5%, 19 times out of 20. This corresponds to a minimum sample size of ~80 orphans. A total of 228 orphans were in fact surveyed. In a number of cases more than one instance of an orphan activity was evaluated because the activity was reported in more than one species. Consequently, 286 instances were evaluated.

Table 3: Species distribution of Eubacterial validated orphans

Species No. of Orphans Species No. of Orphans

Acinetobacter NCIB 9871 1 Pasteurella tuberculosis 2Actinoplanes missouriensis 1 Pedobacter heparinus 1Aerom onas sp. 1 Propionibacterium pentosaceum 1Alcaligenes eutrophus 1 Proteus mirabilis 1Alcaligenes faecalis 1 Pseudomonas (species undefined) 2Arthrobacter GJM -1 1 Pseudomonas fluorescens 3Arthrobacter oxydans 1 Pseudomonas graveolens 1Arthrobacter sp. 2 Pseudomonas MS 1Azotobacter vinelandii 1 Pseudomonas MSU-1 1Bacillus subtilis 1 Pseudomonas P-2 2Cellulom onas sp. 1 Pseudomonas putida 7Clostridium cylindrosporum 1 Pseudomonas putida P2 1Clostridium kluyveri 2 Pseudomonas saccharophilia 2Clostridium pasteurianum 1 Pseudomonas sp. 4Clostridium SB4 1 Pseudomonas sp. P-501 1Clostridium sporogenes 2 Pseudomonas syringae GG 1Corynebacterium cyclohexanicum 1 Pseudomonas testosteroni 1Escherichia coli 8 Rhodococcus 1Flavobacterium 1 Rhodopseudomonas sphaeroides 1Flavobacterium sp. 1 Salmonella typhimurium 1Klebsiella aerogenes 1 Streptococcus faecalis 1Micrococcus denitrificans 1 Streptococcus mutans 1Microorganism 2 Streptomyces virginiae 1Mycobacterium tuberculosis 1 Thiobacillus thioparus 1Nocardia (species undefined) 1 Unknown 2

The total number of orphans is greater than the number of activities because a given activity may be present in more than one species. The exact species of some orphans can be unclear or unstated, in which case these are classified under a generic term ("species undefined", "unknown", etc). The total number of orphans is greater than the number of activities because a given activity may be present in more than one species. The exact species of some orphans can be unclear or unstated, in which case these are classified under a generic term ("species undefined", "unknown", etc).

Page 7: A survey of orphan enzyme activities

BMC Bioinformatics 2007, 8:244 http://www.biomedcentral.com/1471-2105/8/244

mology [3,4,15,16], and the present survey demonstratesthe usefulness of correlating biological databases andmining the literature to enhance the value of existingresearch and facilitate the identification of the remainingorphan-associated genes. Until recently, there were nogeneral repositories of orphan activity data, althoughsome species-specific databases and pages were main-tained, such as EchoBase [17] and a web page listing uni-dentified E. coli enzymes maintained by the EcoCycproject [18]. Consequently, we updated the MetaCyc [19]database to identify reactions that have been analyzed bythis survey, and annotated them and associated databaseobjects with results such as the validity of their orphan sta-tus, links to their cognate protein in the case of artifacts,and the properties of the protein copurifying with theactivity in the case of validated orphans. Recently,Lespinet and Labedan created ORENZA [20], a database

dedicated to maintaining an up-to-date listing of allenzyme activities for which no sequences are available inmajor sequence databases [6]. We are contributing ourupdated orphan information to ORENZA as well. Thesedata, captured in MetaCyc and ORENZA, should facilitatethe work of enzymologists interested in identifying thecognate genes of orphan activities. For instance, the workof Melnick et al. [21] is an excellent example of the com-bined application of modern laboratory and bioinformat-ics techniques that would benefit from the data describedhere.

Several proposals have been made recently aimed at pro-ducing a complete catalog of biochemical activities, bio-logical functions, and their cognate genes [2,3]. Many ofthese proposals recommend that such a project begin withprokaryotes because of the general ease of gene cloning

Table 4: Species distribution of Eukaryotic validated orphans

Species No. of Orphans Organism No. of Orphans

Acrocylindrium sp. 1 Nectria haem atococca/Fusarium solani f.sp. Phaseoli 1Arachis hypogaea 2 Neurospora (subspecies undefined) 1ASparagus officinalis 1 Neurospora crassa 1Aspergillus niger 2 Ochromonas malhamensis 1Avena coleoptiles 1 Ovis aries 2Bauhenia monandra 1 Pea sativum var. Alaska 1Bostaurus 3 Penicillium atrovenetum 1Capra hircus 1 Penicillium chrysogenum 1Catharanthus roseus 1 Penicillium patulum 1Cavia porcellus 3 Phaseolus aureus 3Chlorella 1 Phaseolus radiatus 1Chrysosplenium americanum 1 Pisum sativum (variety unspecified) 1Cichorium endivia 1 Pycnoporus coccineus 1Citrus (subspecies undefined) 1 Raphanus sativus 1Corydalis cava 1 Rat (subspecies undefined) 18Cucurbita maxima 1 Rat Sprague-Dawley 5Daucus carota 1 Rhodotorula glutinis 2Entamoeba histolytica 1 Saccharomyces cerevisiae 5Euglena gracilis 1 Saccharum officinarum 1Flaveria spp. 1 Secale cereale 1Fundulus heteroclitus 1 Sesamum indicum 1Gallus gallus 1 Several 1Homo sapiens 5 Sorghum bicolor 2Hordeum (species undefined) 1 Spinacia 1Hordeum vulgare subsp. Vulgare 2 Spinacia oleracea 1Lasallia pustulata 1 Sus scrofa 7Lilium longiflorum 1 Tecoma stans 1Lupinus albus 1 Thea sinensis 1Lycopersicon esculentum 2 Trypanosoma brucei brucei 1Macaca mulatta 1 Tulipa cv. Apeldoorn 1Mentha piperita 1 Unknown 1Mesocricetus auratus 1 Yeast (species undefined) 4Mold 1 Zea mays 2Mouse (species undefined) 3

The total number of orphans is greater than the number of activities because a given activity may be present in more than one species. The exact species of some orphans can be unclear or unstated, in which case these are classified under a generic term ("mold", "mouse", "unknown", etc). The total number of orphans is greater than the number of activities because a given activity may be present in more than one species. The exact species of some orphans can be unclear or unstated, in which case these are classified under a generic term ("mold", "mouse", "unknown", etc).

Page 7 of 15(page number not for citation purposes)

Page 8: A survey of orphan enzyme activities

BMC Bioinformatics 2007, 8:244 http://www.biomedcentral.com/1471-2105/8/244

Table 5: Validated orphan activities

EC No. Ranking EC No. Ranking EC No. Ranking

1.1.1.13 difficult 1.14.99.24 difficult 3.1.3.47 good1.1.1.16 difficult 1.21.3.2 difficult 3.1.3.59 difficult1.1.1.43 difficult 1.97.1.3 difficult 3.1.3.72 difficult1.1.1.54 difficult 2.1.1.112 difficult 3.1.4.43 difficult1.1.1.84 excellent 2.1.1.137 artifact 3.1.6.17 good1.1.1.92 marginal 2.1.1.141 artifact 3.1.8.2 artifact1.1.1.101 difficult 2.1.1.143 artifact 3.2.1.56 difficult1.1.1.144 difficult 2.1.1.147 difficult 3.2.1.77 difficult1.1.1.146 artifact 2.1.1.84 difficult 3.2.1.100 good1.1.1.163 artifact 2.1.1.99 difficult 3.2.1.112 excellent1.1.1.172 good 2.1.2.4 difficult 3.2.1.115 difficult1.1.1.196 good 2.3.1.14 difficult 3.2.1.128 excellent1.1.1.208 poor 2.3.1.23 difficult 3.2.1.136 difficult1.1.1.226 excellent 2.3.1.24 artifact 3.2.1.137 poor1.1.1.245 artifact 2.3.1.33 difficult 3.2.2.10 difficult1.1.1.258 artifact 2.3.1.49 difficult 3.4.11.16 good1.1.1.265 excellent 2.3.1.68 marginal 3.4.13.7 difficult1.1.2.5 artifact 2.3.1.96 difficult 3.4.17.16 good1.1.3.23 difficult 2.3.1.98 good 3.4.21.103 artifact1.17.1.1 difficult 2.3.1.102 artifact 3.4.22.44 artifact1.17.99.2 artifact 2.3.1.103 poor 3.4.22.46 artifact1.2.1.18 artifact 2.3.1.105 poor 3.4.23.28 difficult1.2.1.20 difficult 2.3.1.114 poor 3.4.23.30 artifact1.2.1.23 poor 2.3.1.133 marginal 3.4.24.54 artifact1.2.1.25 difficult 2.3.1.140 difficult 3.5.1.30 good1.2.1.32 artifact 2.3.1.161 artifact 3.5.1.33 artifact1.2.1.33 difficult 2.3.2.3 artifact 3.5.1.39 poor1.2.1.52 difficult 2.3.2.7 difficult 3.5.1.58 excellent1.2.1.54 good 2.3.3.3 difficult 3.5.1.62 artifact1.2.1.63 marginal 2.4.1.23 poor 3.5.1.67 difficult1.2.1.64 artifact 2.4.1.29 difficult 3.5.1.71 poor1.2.3.6 difficult 2.4.1.41 valid 3.5.1.79 artifact1.2.3.7 poor 2.4.1.43 difficult 3.5.2.13 poor1.2.3.8 artifact 2.4.1.57 artifact 3.5.2.16 artifact1.3.1.4 difficult 2.4.1.66 artifact 3.5.3.2 good1.3.1.5 difficult 2.4.1.73 artifact 3.5.5.2 difficult1.3.1.6 artifact 2.4.1.97 difficult 3.6.1.18 excellent1.3.1.11 difficult 2.4.1.110 poor 3.6.1.2 artifact1.3.1.37 difficult 2.4.1.125 difficult 3.6.1.52 artifact1.3.7.1 difficult 2.4.1.126 valid 3.6.3.17 artifact1.3.99.15 artifact 2.4.1.153 poor 3.6.3.24 artifact1.3.99.21 artifact 2.4.1.167 difficult 3.6.3.28 artifact1.4.1.11 good 2.4.1.176 difficult 3.6.4.4 artifact1.4.1.17 good 2.4.1.180 excellent 4.1.1.24 difficult1.4.99.4 marginal 2.4.1.184 difficult 4.1.1.52 difficult1.4.99.5 artifact 2.4.1.215 artifact 4.1.1.56 difficult1.5.1.21 good 2.4.2.35 difficult 4.1.1.75 difficult1.5.99.11 artifact 2.5.1.4 difficult 4.1.2.23 difficult1.6.5.7 artifact 2.5.1.42 difficult 4.1.2.28 difficult1.7.3.1 difficult 2.6.1.22 difficult 4.1.2.35 difficult1.7.3.5 valid 2.6.1.27 poor 4.1.3.35 difficult1.8.1.5 artifact 2.6.1.32 difficult 4.2.1.5 difficult1.10.1.1 difficult 2.6.1.33 poor 4.2.1.43 good1.10.3.4 difficult 2.6.1.75 good 4.2.1.62 good1.12.98.2 artifact 2.7.1.43 difficult 4.2.1.77 difficult1.13.11.14 difficult 2.7.1.54 difficult 4.2.1.81 difficult1.13.11.24 artifact 2.7.1.64 difficult 4.2.1.93 difficult1.13.11.25 artifact 2.7.1.77 difficult 4.2.1.97 marginal1.13.11.35 difficult 2.7.1.106 difficult 4.2.1.101 artifact

Page 8 of 15(page number not for citation purposes)

Page 9: A survey of orphan enzyme activities

BMC Bioinformatics 2007, 8:244 http://www.biomedcentral.com/1471-2105/8/244

from these species [1,2]. Indeed, our data support thisnotion, as we find substantially more orphans with a sal-vageability ranking of "good" and "excellent" in prokary-otes as compared to eukaryotes. The availability of acomprehensive review of the problem achieved by thissurvey, combined with broad genomic sequencing andpowerful computational tools, leads us to conclude thatthe field is in an excellent position to rectify the informa-tion gap associated with the orphan activity phenome-non.

ConclusionMore than one third of enzyme activities with assigned ECnumbers are orphan activities, having no associated geneor protein sequence. We carried out a literature-based sur-vey of a representative sample of presumed orphansintended to further validate and characterize these orphanactivities. We have also assessed the practicability of iden-tifying the genes associated with these orphans. In doingso, we captured data from the literature that should assistin future identification of cognate genes for the orphanactivities we examined.

This survey confirmed that about 80% of sampled orphanactivities have no sequence information associated withthem, either in databases or in the literature. Weaknessesin database integration and failure to capture informationfrom the literature account for most of the remaining20%.

This survey points toward the significant scientific cost ofhaving such a large fraction of characterized enzyme activ-ities disconnected from sequence data. It also suggeststhat a larger effort, beginning with a comprehensive sur-vey of all putative orphan activities, would resolve nearly300 artifactual orphans and reconnect a wealth of enzymeresearch with modern genomics. For these reasons, wepropose that a systematic effort to identify the cognategenes of orphan enzymes be undertaken.

MethodsLiterature survey processThis survey was performed from June through August2004 and relied on enzyme activities described by the NC-IUBMB. This enzyme classification and nomenclature sys-tem is hierarchical in nature and is based upon the reac-tion catalyzed. It assigns specific numerical identifiers, an

Table 6: Domain distribution of validated orphans

Domain No. Species Proportion

Eukaryota 156 54.36%Eubacteria 113 39.37%Unknown 15 5.23%Viruses 2 0.70%Archaea 1 0.35%

Orphans with "Unknown" listed for their domain tend to be microbes that were insufficiently characterized to place them in either the Eubacteria or Archaea domains.

1.13.12.9 good 2.7.1.131 poor 4.2.2.14 artifact1.14.11.10 difficult 2.7.1.134 difficult 4.2.3.19 artifact1.14.11.6 poor 2.7.1.142 difficult 4.2.99.19 artifact1.14.13.10 marginal 2.7.4.20 difficult 4.3.1.10 difficult1.14.13.23 good 2.7.7.44 difficult 4.3.1.20 difficult1.14.13.24 artifact 2.7.7.51 difficult 4.5.1.4 difficult1.14.13.42 difficult 2.7.8.10 difficult 5.1.1.6 difficult1.14.13.51 difficult 2.7.8.22 difficult 5.1.1.9 marginal1.14.13.58 excellent 2.8.1.3 excellent 5.1.3.17 difficult1.14.13.60 difficult 2.8.2.28 difficult 5.2.1.10 difficult1.14.13.72 difficult 3.1.1.36 difficult 5.2.1.11 difficult1.14.13.73 artifact 3.1.1.39 difficult 5.4.3.5 artifact1.14.15.2 difficult 3.1.1.40 poor 5.5.1.11 good1.14.16.5 good 3.1.1.78 artifact 5.5.1.12 artifact1.14.99.18 artifact 3.1.2.11 difficult 5.5.1.3 difficult1.14.99.22 artifact 3.1.3.14 difficult 6.3.1.6 difficult

3.1.3.38 poor 6.3.4.8 difficult

All 228 orphans reviewed in this study are listed. The salvageability of an orphan is ranked "difficult" when factors such as unclear species of origin, lack of molecular descriptors, or lack of comprehensive genome sequence hinder cloning of the cognate gene. Note that such rankings do not take into account the availability of molecular descriptors which enable the identification of a candidate gene in one species, and, through orthology, the identification of a candidate gene in a second species for which these descriptors are not available.

Table 5: Validated orphan activities (Continued)

Page 9 of 15(page number not for citation purposes)

Page 10: A survey of orphan enzyme activities

BMC Bioinformatics 2007, 8:244 http://www.biomedcentral.com/1471-2105/8/244

EC number, to each distinct enzymatic activity. The firstdigit represents the class of reaction catalyzed (e.g., oxi-doreductases are EC1; transferases are EC2). The seconddigit of the EC number refers to the subclass, which gen-erally contains information about the type of compoundor group involved (e.g., an enzyme acting on the CH-OHgroup of donors, or acting on the aldehyde or oxo groupof donors). The third digit defines the sub-subclass, whichspecifies the nature of the reaction. The fourth digit is aserial number that is used to identify the individualenzyme within a sub-subclass (see [22] for a descriptionof the classification system).

It is important to bear in mind that distinct proteins cata-lyzing the same reaction are assigned the same ECnumber. Since the EC system is based upon the reactioncatalyzed, when applied to a protein it describes a bio-chemical function of this protein. That function can alsobe shared by several proteins (isozymes) that can becoded by genes in the same or different species.

Presumed orphan EC numbers were identified using theBioWarehouse database system [23]. BioWarehouse [24]is an integrated database that enables cross-database que-ries using the structured query language (SQL). SRI's Bio-Warehouse instance was queried for enzymatic activitieswith no matching sequences in any major proteinsequence databases, including TrEMBL, PIR, SWISS-PROT, CMR, ENZYME, and BioCyc (the selection of thesedatabases is described in [3]). This query returned an ini-tial list of 1,356 EC numbers that had not been retired ormerged at the time of the survey.

This list was randomized and the primary literature asso-ciated with a sample of these putative orphans was proc-essed successively according to that random order. The

size of the sample necessary to ensure representationalaccuracy as compared to the total pool of EC numbers wascalculated using Equation 1. Approximately 180 orphansare required to achieve better than 95% confidence, giventhe total number of EC numbers. Since a sample of 228orphans was ultimately surveyed, the 95% level of signif-icance was exceeded.

Equation 1: sample size estimationSE is the standard error associated with the survey, andis derived by dividing the sampling error by 1.96, suchthat for a sampling error of 5% (95% confidence inter-val), the standard error is 0.0255102. p is the probabil-ity that the EC number is a true positive, that is, thereis truly no sequence information for that EC number;this value is 0.85 based on data from a preliminarysurvey. N is the universe of orphan activities. Solvingfor n provides the sample size.

SE2 = [(p(1-p)/n)] [(N-n)/N]

A comprehensive manual analysis of the literature associ-ated with this sample of 228 orphans drawn from the ran-domized list was performed as outlined in Figure 2.Various databases (Table 13) were consulted to extract thedata elements listed in Table 12. For each selected putativeorphan in the sample, the text search engine ExPASy Pro-teomics Server [25] was used to search TrEMBL, ENZYME,and IUBMB database records to confirm the absence ofsequence data. For each orphan, all protein names, authornames, reaction names, substrate names, and productnames listed in the IUBMB record for that orphan wereused as query arguments.

The primary literature associated with each orphan's entryin the IUBMB database [5,26] describing the isolation and

Table 7: Top four most represented Eubacteria

Genus No. instances of orphans Fraction of all Eubacteria

Pseudomonas 27 35.06%Escherichia 8 10.39%Clostridium 7 9.09%Arthrobacter 4 5.19%

Table 8: Availability of completely sequenced genomes for Eubacterial validated orphans

Complete genome sequence Ongoing genome sequencingCount Proportion Count Proportion

Same species 23 31.9% 9 18.4%Same genus, related species 12 16.7% 16 32.6%

The number of available comprehensive genome sequences for validated Eubacterial orphans was tallied. Cases where the genome sequence of a species does not exist but where the sequence of a related species from the same genus is available are also listed, as are ongoing comprehensive genomic sequencing projects for genomes not currently available.

Page 10 of 15(page number not for citation purposes)

Page 11: A survey of orphan enzyme activities

BMC Bioinformatics 2007, 8:244 http://www.biomedcentral.com/1471-2105/8/244

characterization of the activity was reviewed for the pres-ence of sequence information. In particular, we were alertfor the presence of molecular descriptors that might beuseful in cloning the associated genes in the papers(described below), particularly Mr, pI, and details of thepurification scheme (Table 14). Systematic searches ofPubMed were also performed to ascertain whether publi-cations other than those cited by IUBMB might contain

relevant sequence and molecular descriptor data. A totalof 331 publications (1.45 papers per orphan) were exam-ined for additional molecular descriptors that might beuseful in cloning, as described above. Data obtained fromthese publications were assembled into a database.

All artifactual orphans (orphans for which sequence infor-mation was found during the literature review process)were reported promptly to the Swiss Institute of Bioinfor-matics, the European Bioinformatics Institute, theORENZA database, and the Nomenclature Committee ofthe IUBMB for the relevant database entries to be updated.

Data sources and database analysesThe initial searches for presumed orphan activities wereperformed using BioWarehouse version 3.0 (SRI Interna-tional) running under Oracle 10G (Oracle Corporation,Redwood Shores, California). BioWarehouse is a bioin-formatics data warehousing environment developedunder the Bio-SPICE program [23].

Identification and ranking of salvageable orphansSalvageable orphans are orphan activities for which it islikely that at least one cognate gene can be identified andconfirmed in a practical manner. The extent of this sal-

Publication year of original publications describing orphan activitiesFigure 4Publication year of original publications describing orphan activities. The publication date associated with the original source articles of all instances of orphans surveyed here is plotted (286 instances of orphans, corresponding to 228 activities), based upon the IUBMB record. In a number of cases more than one instance of an orphan activity was evaluated because the activity was reported in more than one species.

Distribution of enzymatic activities in validated orphansFigure 3Distribution of enzymatic activities in validated orphans. The percentage of validated orphan activities belonging to each EC class is shown.

Page 11 of 15(page number not for citation purposes)

Page 12: A survey of orphan enzyme activities

BMC Bioinformatics 2007, 8:244 http://www.biomedcentral.com/1471-2105/8/244

vageability was determined by ranking validated orphansaccording to the likelihood and practicality that at leastone cognate gene can be identified, and that the geneproduct can be isolated and demonstrated to catalyze theenzymatic activity in a practical manner.

Orphans were ranked based on data in the original litera-ture, combined with the availability of the completegenome sequences for the species in which an orphan wasfirst elucidated. The principal ranking factors are (1) clearidentification of the species involved and its ease of

Table 9: Example artifactual orphans

EC No. Enzyme Name Original Species Year Swiss-Prot/TrEMBLAcc. No.

Cause of artifact Significance of error; importance of orphan activity

3.4.21.103 Physarolisin(a proteinase)

Physarum flavicomum 1982 Q8MZS4 IUBMB entry lists a 2003 paper describing a gene coding for a protein with this activity [28]. Sequence is in Swiss-Prot but ENZYME does not reference this sequence.

Lack of database cross-referencing presumably involving the long interval between the initial characterization of the activity and the cloning of the gene.

3.5.2.16 Maleimide hydrolase Blastobacter sp. A17p-4 1997 Q93T25 ENZYME and IUBMB entries are not referencing a Swiss-Prot entry from a 2002 paper describing the cloning of gene coding for this [29].

Lack of database cross-referencing is not restricted to older orphans.

3.5.1.79 Phthalyl amidase Xanthobacter agilis 1995 N/A The sequence, listed in a patent associated with a 1996 paper by [30] in Journal of Molecular Catalysis B: Enzymatic are available from Entrez, but not from TrEMBL. The paper itself is not available from PubMed.

Note: though the protein sequence is not available from the UniProt database, the DNA sequence is present in the EMBL database.

3.1.8.2 Diisopropyl-fluoro-phosphatase

Alteromonas sp. 1954 Q44238 ENZYME and IUBMB entries are not referencing a Swiss-Prot entry associated with a 1996 paper describing the cloning of a gene coding for an enzyme with this activity [31].

This enzymatic activity detoxifies nerve gas. The gene is part of a widespread gene family with otherwise unknown function, with members in Homo sapiens.

Table 10: Example artifactual orphans that are salvageable

EC No. Ranking EC No. Ranking

1.1.1.226 excellent 1.4.1.11 Good1.1.1.265 excellent 1.4.1.17 Good1.14.13.58 excellent 1.5.1.21 Good2.4.1.180 excellent 2.3.1.98 Good2.8.1.3 excellent 2.6.1.58 Good3.2.1.112 excellent 2.6.1.75 Good3.2.1.128 excellent 3.1.3.47 Good3.5.1.58 excellent 3.1.6.17 Good3.6.1.18 excellent 3.2.1.100 Good1.1.1.172 good 3.4.17.16 Good1.1.1.196 good 3.5.1.30 Good1.13.12.9 good 3.5.3.2 Good1.14.13.23 good 4.2.1.43 Good1.14.16.5 good 4.2.1.62 Good1.2.1.54 good 5.5.1.11 Good

Validated orphans with a salvageability ranking of "good" or better are listed.

Page 12 of 15(page number not for citation purposes)

Page 13: A survey of orphan enzyme activities

BMC Bioinformatics 2007, 8:244 http://www.biomedcentral.com/1471-2105/8/244

growth; (2) the availability of molecular descriptors, mostimportantly the molecular mass (Mr), but also the isoelec-tric point (pI); (3) the types of purification and analyticaltechniques used in the original literature; and (4) evi-dence that the protein can be purified with reasonableeffort using current techniques, based on factors such asspecific activity, purification yield, number of stepsinvolved, and availability of substrate and of alternatepurification procedures. "Excellent" and "Good" ratingsindicate an activity associated with a sequenced organism,and whose purifications and assays are likely to bestraightforward to replicate. "Difficult" activities are thosewith tricky purifications or complex assays, but asequenced target organism or sequenced related organ-ism. "Marginal" activities are those for which sequencingis in progress in the target organism, or a related organism."Poor" activities are those for which no genome sequenceis available, or sequencing is in progress in a related

Table 11: Selected salvageable orphans

EC No. Ranking Pathways Activity Species Full Genome Sequence?

Ongoing Genomic Sequencing?

Mr (kDa) pI(pH units)

3.5.1.30 Good None 5-amino-penta-namidase

Pseudomonas putida P2, Pseudomonas fluorescens

Yes(P. putida)*

Several Pseudomonas species

67 N/A

5.5.1.11 Good None Dichloro-muconate cyclo- isomerase

Alcaligenes eutrophus JMP 134 (Ralstonia eutropha JMP134)

N/A Yes 40 ± 10 N/A

4.2.1.97 Marginal None Phaseollidin hydratase

Fusarium solani f.sp. Phaseoli

No Different species (Fusarium sporotrichioides)

monomer 1: 47 monomer 2: 49

2.3.1.103 Poor None Sinapoylglucose–sinapoylglucose O-sinapoyltransferase

Raphanus sativus N/A N/A 55 N/A

A selection of orphans with different salvageability rankings are listed. Pathway names are those used in the MetaCyc database. *: The genomes of several strains of P. fluorescens are in the final stages of assembly and are essentially fully sequenced. N/A: not available

Salvageability ranking of validated orphansFigure 5Salvageability ranking of validated orphans. The suita-bility of validated orphans for eventual cloning of at least one cognate gene was evaluated according to the ranking system described in the text. Out of 228 orphans, 57 were judged to be salvageable. A: Overall salvageability ranking (percentage out of 57); B: Domain distribution of salvageable orphans (number of orphans). Note that the total is greater than 57 because some orphans have different evaluations in the dif-ferent species in which they have been reported. One orphan is also shared between Eubacteria and Eukaryotes.

Distribution of enzymatic activities for salvageable orphans ranked "good" and "excellent"Figure 6Distribution of enzymatic activities for salvageable orphans ranked "good" and "excellent".

Page 13 of 15(page number not for citation purposes)

Page 14: A survey of orphan enzyme activities

BMC Bioinformatics 2007, 8:244 http://www.biomedcentral.com/1471-2105/8/244

organism, and assay or purification conditions are likelyto be hard to replicate.

Data availabilityInformation about validated orphan activities has beenentered into the MetaCyc database [27]. Other data gener-ated by our survey can be found at [9].

AbbreviationsEnzyme Commission (EC), Nomenclature Committee ofthe International Union of Biochemistry and MolecularBiology (NC-IUBMB), Structured Query Language (SQL)

Authors' contributionsPK conceived the study. PK and YP jointly devised themethodology. YP performed the literature research anddrafted the manuscript. PK revised the manuscript.

AcknowledgementsThis work was funded by grant MCB-0438571 from the U.S. National Sci-ence Foundation. The BioWarehouse is funded by contract F30602-01-C-0153 from the Defense Advanced Research Projects Agency. This material is based upon work supported by DARPA and the Air Force Research Lab-oratory under Contract No. F30602-01-C-0153. We gratefully acknowl-edge Dr. Tadhg P. Begley, Department of Chemistry and Chemical Biology, Cornell University, for help in analyzing biochemical purification protocols; Dr. Ron Caspi, Bioinformatics Research Group, SRI International, for sup-

Table 14:

Name of enzyme activityIs lack of sequence confirmed?Bibliographical data (publication dates, authors, institutions)Name of species

Can the species associated with the original publications be unambiguously identified?Is a comprehensive genome sequence available for those species?

Are comprehensive genome sequences available from closely-related species?Is there ongoing genomic sequencing for those species or from closely-related species?

Are molecular data such as Mr and pI available?Does the purification and characterization procedure suggest that purifying this enzyme should be reasonably straightforward?

Table 12: Example of artifactual orphans resolved by this survey

EC No. Enzyme Name Species TrEMBL/Swiss-Prot Accession No.

1.1.1.163 Cyclopentanol dehydrogenase Comamonas sp. Q8GAV91.13.11.24 Quercetin 2,3-dioxygenase Bacillus subtilis P421063.6.3.24 Nickel-transporting ATPase Escherichia coli P335932.1.1.143 24-methylenesterol C-methyltransferase Arabidopsis thaliana Q94JS42.1.1.143 24-methylenesterol C-methyltransferase Arabidopsis thaliana Q39227

All Swiss-Prot entries listed here have been updated with the corresponding EC number.

Table 13: Main data sources used by the orphan survey

Database name Content Source Accessed via...

TrEMBL [32] Comprehensive protein and DNA sequence data

Swiss Institute of Bioinformatics Web

Comprehensive Microbial Repository (CMR [33])

Extensive genomic data for microbial species

The Institute for Genomic Research BioWarehouse

BioCyc databases Collection of pathway/genome databases primarily concerned with microbial species

Bioinformatics Research Group, SRI International

BioWarehouse

IUBMB Enzyme Nomenclature [34] Description of enzymes that have been assigned an EC number by the Enzyme Commission

Nomenclature Committee of the International Union of Biochemistry and Molecular Biology

Web and BioWarehouse

ENZYME [35] Repository of information relative to the nomenclature of enzymes

Swiss Institute of Bioinformatics Web and BioWarehouse

NCBI Taxonomy [36] Taxonomy database National Center for Biotechnology Information

Web and BioWarehouse

PubMed Literature database National Library of Medicine Web

Page 14 of 15(page number not for citation purposes)

Page 15: A survey of orphan enzyme activities

BMC Bioinformatics 2007, 8:244 http://www.biomedcentral.com/1471-2105/8/244

Publish with BioMed Central and every scientist can read your work free of charge

"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."

Sir Paul Nurse, Cancer Research UK

Your research papers will be:

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours — you keep the copyright

Submit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.asp

BioMedcentral

port with the MetaCyc database and analysis of purification protocols; and Dr. Alexander Shearer, Bioinformatics Research Group, SRI International, for assistance with manuscript revision and resubmission.

References1. Roberts RJ: Identifying protein function–a call for community

action. PLoS Biol 2004, 2(3):E42.2. Galperin MY, Koonin EV: 'Conserved hypothetical' proteins:

prioritization of targets for experimental study. Nucl Acids Res2004, 32(18):5452-5463.

3. Karp P: Call for an enzyme genomics initiative. Genome Biology2004, 5(8):401.

4. Lespinet O, Labedan B: Orphan enzymes? Science 2005,307(5706):42a.

5. Barrett A: Nomenclature Committee of the InternationalUnion of Biochemistry and Molecular Biology (NC-IUBMB).Enzyme Nomenclature. Recommendations 1992. Supple-ment 4: corrections and additions (1997). Eur J Biochem 1997,250(1):1-6.

6. Lespinet O, Labedan B: ORENZA: a web resource for studyingORphan ENZyme activities. BMC Bioinformatics 2006, 7:436.

7. Lespinet O, Labedan B: Orphan enzymes could be an unex-plored reservoir of new drug targets. Drug Discov Today 2006,11(7–8):300-305.

8. Roberts RJ, Karp PD, Kasif S, Linn S, MR B: An ExperimentalApproach to Genome Annotation. American Society for Micro-biology; 2005.

9. Enzyme Genomics Site [http://bioinformatics.ai.sri.com/enzyme-genomics]

10. Chen X, Hyatt BA, Mucenski ML, Mason RJ, Shannon JM: Identifica-tion and characterization of a lysophosphatidylcholine acyl-transferase in alveolar type II cells. Proc Natl Acad Sci USA 2006,103(31):11724-11729.

11. Romero P, Wagg J, Green ML, Kaiser D, Krummenacker M, Karp PD:Computational prediction of human metabolic pathwaysfrom the complete human genome. Genome Biol 2005, 6(1):R2.

12. Bork P, Koonin E: Predicting functions from proteinsequences–where are the bottlenecks? Nature Genetics 1998,18(4):313-318.

13. Karp PD: What we do not know about sequence analysis andsequence databases. Bioinformatics 1998, 14(9):753-754.

14. Pennisi E: Keeping Genome Databases Clean and Up to Date.Science 1999, 286(5439):447-450.

15. Naumoff D, Xu Y, Glansdorff N, Labedan B: Retrieving sequencesof enzymes experimentally characterized but erroneouslyannotated : the case of the putrescine carbamoyltransferase.BMC Genomics 2004, 5(1):52.

16. Lespinet O, Labedan B: Puzzling over orphan enzymes. Cell MolLife Sci 2006, 63(5):517-523.

17. Misra RV, Horler RSP, Reindl W, Goryanin II, Thomas GH: Echo-BASE: an integrated post-genomic database for Escherichiacoli. Nucl Acids Res 2005, 33(suppl 1):D329-333.

18. EcoCyc orphan page [http://ecocyc.org/enzymes.shtml]19. Caspi R, Foerster H, Fulcher CA, Hopkinson R, Ingraham J, Kaipa P,

Krummenacker M, Paley S, Pick J, Rhee SY, Tissier C, Zhang P, KarpPD: MetaCyc: a multiorganism database of metabolic path-ways and enzymes. Nucleic Acids Res 2006:D511-516.

20. ORENZA [http://www.orenza.u-psud.fr]21. Melnick J, Lis E, Park J-H, Kinsland C, Mori H, Baba T, Perkins J, Schyns

G, Vassieva O, Osterman A, Begley TP: Identification of the twomissing bacterial genes involved in thiamine salvage: thia-mine pyrophosphokinase and thiamine kinase. J Bacteriol 2004,186(11):3660-3662.

22. Tipton K, Boyce S: History of the enzyme nomenclature sys-tem. Bioinformatics 2000, 16:34-40.

23. Lee TJ, Pouliot Y, Wagner V, Gupta P, Stringer-Calvert DW, Tenen-baum JD, Karp PD: BioWarehouse: a bioinformatics databasewarehouse toolkit. BMC Bioinformatics 2006, 7(1):170.

24. BioWarehouse [http://bioinformatics.ai.sri.com/biowarehouse/]25. ExPASY Proteomics Server [http://www.expasy.org/cgi-bin/

sprot-search-ful]26. IUBMB database [http://www.chem.qmul.ac.uk/iubmb/enzyme/]27. MetaCyc [http://www.metacyc.org]28. Nishii W, Ueki T, Miyashita R, Kojima M, Kim Y, Sasaki N, K M-M,

Takahashi K: Structural and enzymatic characterization of

physarolisin (formerly physaropepsin) proves that it is aunique serine-carboxyl proteinase. Biochem Biophys Res Commun2003, 301(4):1023-1029.

29. Wang Y, Zhang Y, Ding J, Li uY, Wang J, Yu Z: Cloming, sequenceanalysis of imidase gene from Alcaligenes eatrophus and itsexpression in E coli. Wei Sheng Wu Xue Bao 2002, 42(2):153-162.

30. Briggs B, Kreuzman A, Whitesitt C, Yeh W-K, Zmijewski M: Discov-ery, purification, and properties of o-phthalyl amidase fromXanthobacter agilis. Journal of Molecular Catalysis B: Enzymatic1996, 2(1):53-69.

31. Cheng T, Harvey S, Chen G: Cloning and expression of a geneencoding a bacterial enzyme for decontamination of organ-ophosphorus nerve agents and nucleotide sequence of theenzyme. Appl Environ Microbiol 1996, 62(5):1636-1641.

32. Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S,Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA,O'Donovan C, Redaschi N, Yeh LS: The Universal ProteinResource (UniProt). Nucl Acids Res 2005, 33(suppl 1):D154-159.

33. Peterson JD, Umayam LA, Dickinson T, Hickey EK, White O: TheComprehensive Microbial Resource. Nucl Acids Res 2001,29(1):123-125.

34. Barrett A: Nomenclature Committee of the InternationalUnion of Biochemistry and Molecular Biology (NC-IUBMB).Enzyme Nomenclature. Recommendations 1992. Supple-ment 5: corrections and additions (1999). Eur J Biochem 1999,250(1):1-6.

35. Bairoch A: The ENZYME database in 2000. Nucl Acids Res 2000,28(1):304-305.

36. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, ChurchDM, DiCuccio M, Edgar R, Federhen S, Helmberg W, Kenton DL,Khovayko O, Lipman DJ, Madden TL, Maglott DR, Ostell J, Pontius JU,Pruitt KD, Schuler GD, Schriml LM, Sequeira E, Sherry ST, Sirotkin K,Starchenko G, Suzek TO, Tatusov R, Tatusova TA, Wagner L,Yaschenko E: Database resources of the National Center forBiotechnology Information. Nucl Acids Res 2005, 33(suppl1):D39-45.

Page 15 of 15(page number not for citation purposes)