A Summary of Genomic Databases: Overview and...
Transcript of A Summary of Genomic Databases: Overview and...
A Summary of Genomic Databases: Overviewand Discussion
Erika De Francesco2, Giuliana Di Santo1, Luigi Palopoli1, Simona E.Rombo1
1 Exeura, Via Pedro Alvares Cabrai, Edificio Manhattan,87036 C.da Lecco, Rende (CS), [email protected]
2 Universita della Calabria, Via Pietro Bucci, 87036 Rende, [email protected], [email protected],[email protected]
Summary. In the last few years both the amount of electronically stored biologicaldata and the number of biological data repositories grew up significantly (today,more than eight hundred can be counted thereof). In spite of the enormous amount ofavailable resources, a user may be disoriented when he/she searches for specific data.Thus, the accurate analysis of biological data and repositories turn out to be usefulto obtain a systematic view of biological database structures, tools and contentsand, eventually, to facilitate the access and recovery of such data. In this chapter,we propose an analysis of genomic databases, which are databases of fundamentalimportance for the research in bioinformatics. In particular, we provide a smallcatalog of 74 selected genomic databases, analyzed and classified by consideringboth their biological contents and their technical features (e.g, how they may bequeried, the database schemas they are based on, the different data formats, etc.).We think that such a work may be an useful guide and reference for everyone needingto access and to retrieve information from genomic databases.
1 The Biological Databases Scenario
In such an evolving field like the biological one, having continuously new datato store and to analyze seems natural. Thus, the amount of data stored inbiological databases is growing very quickly [12], and the number of relevantbiological data sources can be nowadays estimated in about 968 units [34]. Theavailability of so many data repositories is indisputably an important resourcebut, on the other hand, opens new quests, as to effectively access and retrieveavailable data. Indeed, the distributed nature of biological knowledge, and theheterogeneity of biological data and sources, can often cause trouble to theuser trying specific demands. And, in fact, not only such data sources are sonumerous, but they use different kinds of representations, access methods andvarious features and formats are offered.
A.S. Sidhu et al. (Eds.): Biomedical Data and Alications, SCI 224, pp. 37–54.
springerlink.com c© Springer-Verlag Berlin Heidelberg 2009
38 Erika De Francesco, Giuliana Di Santo, Luigi Palopoli, Simona E. Rombo
In this scenario, a documented analysis of on-line available material interms not only of biological contents, but also considering the diverse types ofexploited data formats and representations, the different ways in which suchdatabases may be queried (e.g., by graphical interaction or by text forms) orthe various database schemas they are based on, can be useful to ease users’activities spent towards fulfilling their information needs. With this aim inmind, we started from the about 968 biological databases listed in [34], wherea classification reflecting some specific biological views of data is provided,and we focused on a specific biological database category, that of genomicdatabases, analyzing them w.r.t. database technologies and tools (which hasnot been considered in previous works, such as [31, 34]).
As a result of this work, in this chapter we will discuss the 74 genomicdatabases we have analyzed, specifying the main features they have in commonand the differences among them in terms of both biological and databasetechnologies oriented points of views. In particular, we used five different andorthogonal criteria of analysis, that are:
• typologies of recoverable data (e.g., genomic segments or clone/contig re-gions);
• database schema types (e.g., Genolist schema or Chado schema);• query types (e.g., simple queries or batch queries);• query methods (e.g., graphical interaction based or query language based
methods);• export formats (e.g., flat files or XML formats).
Section 2 is devoted to the illustration of such an analysis, and a subsec-tion for each particular criterium under which the selected genomic databaseshave been analyzed in detail is included. To facilitate the full understanding ofthe discussion, the mentioned classification criteria would be also graphicallyillustrated by some graphs. In this way, given one of the analyzed databases,it would be possible to easily know in a synthetic way all the informationcaptured by the described criteria. An on-line version of such an analysis isavailable at http://siloe.deis.unical.it/ biodbSurvey/, where a num-bered list of all the databases we analyzed is provided (as also reported here inthe Appendix) and the classifications corresponding to the five criteria illus-trated above are graphically reported. In particular, for each criterium, eachsub-entry is associated to a list of databases; clicking on one of the databases,a table will be shown where all the main features of that database are sum-marized (see Section 2.6 for more details).
The two following subsections summarize some basic biological notionsand illustrate the main biological database classification commonly adoptedin the literature.
1.1 Some (very) basic biology
The genome encodes all the information of the Deoxyribonucleic Acid (DNA),the macromolecule containing the genetic heritage of each living being. DNAis a sequence of symbols on an alphabet of four characters, that are, A, T ,C and G, representing the four nucleotides Adenine, Thymine, Cytosine and
A Summary of Genomic Databases: Overview and Discussion 39
Guanine. In genomic sequences, three kinds of subsequences can be distin-guished: i) genic subsequences, coding for protein expression; ii) regulatorysubsequences, placed upstream or downstream the gene of which they influ-ence the expression; iii) subsequences apparently not related to any function.Each gene is associated to a functional result that, usually, is a protein. Proteinsynthesis can be summarized in two main steps: transcription and translation.During transcription, the DNA genic sequence is copied into a Messenger Ri-bonucleic Acid (RNAm), delivering needed information to the synthesis ap-paratus. During translation, the RNAm is translated into a protein, that is, asequence of symbols on an alphabet of 20 characters, each denoting an aminoacid and each corresponding to a nucleotide triplet. Therefore, even changingone single nucleotide in a genic subsequence may cause a change in the corre-sponding protein sequence. Proteins as, more in general, macromolecules, arecharacterized by different levels of information, related to the elementary unitsconstituting them and to their spatial disposition. Such levels are known asstructures. In particular, the biological functions of macromolecules also de-pend on their three-dimensional shapes, usually named tertiary structures.Moreover, behind the information encoded in the genomic sequence, additiveknowledge about biological functions of genes and molecules can be attainedby experimental techniques of computational biology. Such techniques are de-voted, for example, to build detailed pathway maps, that are, directed graphsrepresenting metabolic reactions of cellular signals [14].
1.2 Biological database classification
As pointed out in [31], all the biological databases may be classified, w.r.t.their biological contents, in the following categories (see also Figure 1):
• Macromolecular Databases: contain information related to the three mainmacromolecules classes, that are, DNA, RNA and proteins.– DNA Databases: describe DNA sequences.– RNA Databases: describe RNA sequences.– Protein Databases: store information on proteins under three different
points of view:· Protein Sequence Databases: describe amino acid sequences.· Protein Structure Databases: store information on protein struc-
tures, that are, protein shapes, charges and chemical features, char-acterizing their biological functions.
· Protein Motif and Domain Databases: store sequences and struc-tures of particular protein portions (motifs and domains) to whichspecific functional meanings can be associated, grouped by biolog-ical functions, cellular localizations, etc.
• Genomic Databases: contain data related to the genomic sequencing ofdifferent organisms, and gene annotations;– Human Genome Databases: include information on the human gene
sequencing.– Model Organism Databases (MOD): store data coming from the se-
quencing projects of model organisms (such as, e.g., MATDB); theyare also intended to support the Human Genome Project (HGP).
40 Erika De Francesco, Giuliana Di Santo, Luigi Palopoli, Simona E. Rombo
– Other Organism Databases: store information derived from sequencingprojects not related to HGP.
– Organelle Databases: store genomic data of cellular organelles, suchas mitochondria, having their own genome, distinct from the nucleargenome.
– Virus Databases: store virus genomes.• Experiment Databases: contain results of lab experiments; these can be
grouped in three main classes:– Microarray Databases: store data deriving from microarray experi-
ments, useful to assess the variability of the gene expression.– 2D PAGE Databases: contain data on proteins identified by two-
dimensional polyacrylamide gel electrophoresis (2-D PAGE), an exper-imental technique used to physically separate and distinguish proteins.
– Other Experiment Databases: store results of specific experimentaltechniques, such as Quantitative PCR Primer Database [23].
• Immunological Databases: store information associated to immunologic re-sponses.
• Gene Databases: contain sequence data of DNA codifying subsequences,that are, the genes.
• Transcript Databases: store the set of transcribed genomic sequences ofseveral organisms (that are, those genomic sequences characterizing or-gans’ and tissues’ cellular properties).
• Pathway Databases: contain information on metabolic reaction networksof cellular signals.
• Polymorphism and mutation Databases: report data on polymorphic vari-ations and mutations.
• Meta-Database: store descriptions and links about (selected groups of)biological databases;
• Literature Databases: these are archives storing papers published on pre-mier biological journals.
• Other Databases: store data related to specific application contexts, suchas the DrugBank [10].
It is worth pointing out that the above classes in some cases overlap. As anexample, EcoCyc [40] is to be looked at both as a genomic and as a pathwaydatabase. Note that this is intrinsic to the structure of the biological context.In fact, biological data sources contain data which can be classified underdifferent perspectives, due to the various origins and goals of the associatedrepositories.
2 Genomic databases analysis
This section focuses on genomic databases, that represent one among the mostimportant biological database categories. Differently from gene databases,containing only coding DNA sequences, genomic databases contain also noncoding intergenic sequences.
A Summary of Genomic Databases: Overview and Discussion 41
Biological database
categories
Macromolecular Databases
Pathway databases
Protein databases
Polymorphism and mutation databases
Experiment databases
Literature databases
Genomic Databases
Meta Database (database of database)
Protein structure databases
Protein motif and domine databases
Microarray databases
2D PAGE databases
Other organism databases
Model organism databases
Other experiments databases
Transcript databases
Organelle databases
Gene databases
DNA databases
RNA databases
Other databases
Virus genome databases
Human genome databases
Immunological databases
Protein sequence databases
Biological database
categories
Macromolecular Databases
Pathway databases
Protein databases
Polymorphism and mutation databases
Experiment databases
Literature databases
Genomic Databases
Meta Database (database of database)
Protein structure databases
Protein motif and domine databases
Microarray databases
2D PAGE databases
Other organism databases
Model organism databases
Other experiments databases
Transcript databases
Organelle databases
Gene databases
DNA databases
RNA databases
Other databases
Virus genome databases
Human genome databases
Immunological databases
Protein sequence databases
Fig. 1. Biological database classification.
In the following, we discuss a selected subset of genomic repositories (74units) and illustrate in detail their differences and common features accordingto five criteria of analysis, that are, the typologies of recoverable data, thedatabase schema types, the query types, the query methods and, finally, theexport formats.
For each classification criterium, the corresponding classification and thelist of associated databases will be also graphically illustrated. In particu-lar, in the graphs that follow, classes are represented by rectangles, whereascircles indicate sets of databases. Databases are indicated by numbers, ac-cording to the lists found in the Appendix and at http://siloe.deis.unical
.it/biodbSurvey/.
42 Erika De Francesco, Giuliana Di Santo, Luigi Palopoli, Simona E. Rombo
2.1 Recoverable Data
Fig. 2. Recoverable data typologies.
Genomic databases contain a large set of data types. Some archives reportonly the sequence, the function and the organism corresponding to a givenportion of genome, other ones contain also detailed information useful forbiological or clinical analysis. As shown in Figure 2, data that are recoverablefrom genomic databases can been distinguished in six classes:
• Genomic segments: include all the nucleotide subsequences which aremeaningful from a biological point of view, such as genes, clone/clontigsequences, polymorphisms, control regions, motifs and structural featuresof chromosomes. In detail:
A Summary of Genomic Databases: Overview and Discussion 43
– Genes: are DNA subsequences originating functional product such asproteins and regulatory elements. All genomic databases contain infor-mation about genes.
– Clone/contig regions: are sequences of in vitro copies of DNA regions,used to clone the DNA sequence of a given organism. Examples ofdatabases storing information about clone/contig regions are ArkDB[1], SGD [32] and ZFIN [47].
– Polymorphisms: are frequent changes in nucleotide sequences, usuallycorresponding to a phenotypic change (e.g., the variety of eye colors inthe population). As an example, BeetleBase [3] contains informationabout polymorphisms.
– Control regions: are DNA sequences regulating the gene expression. Afew databases store information about control regions. Among themthere are Ensembl [13], FlyBase [42], SGD [32], and WormBase [28].
– Motifs (or patterns): are specific DNA segments having important func-tions and repeatedly occurring in the genome of a given organism. Ex-amples of databases containing information about motifs are AureoList[2], Leproma [20] and VEGA [38].
– Structural features: can be further distinguished in telomeres, cen-tromeres and repeats.· Telomeres: are the terminal regions of chromosomes. Genolevures
[33], PlasmoDB [30] and SGD [32] store information about telom-eres.
· Centromeres: represent the conjunction between short arms andlong arms of the chromosome. BGI-RISe [51], PlasmoDB [30], RAD[24] and SGD [32] are databases containing information about cen-tromeres.
· Repeats: are repetitions of short nucleotide segments, repeatedlyoccurring in some of the regions of the chromosome. Among theothers, CADRE [7], PlasmoDB [30] and ToxoDB [37] store infor-mation about repeats.
• Maps: result from projects that produced the sequencing and mappingof the DNA of diverse organisms, such as the Human Genome Project[50]. In particular, genetic maps give information on the order in whichgenes occur in the genome, providing only an estimate of genes distances,whereas physical maps give more precise information on the physical dis-tances among genes. GOBASE [45] and BeetleBase [3] are examples ofdatabases storing maps.
• Variations and mutations: A nucleotide change is a mutation if it occurswith low frequency in the population (as opposed to polymorphisms). Mu-tations may cause alterations in the protein tertiary structures, inducingpathologic variations of the associated biological functions. These infor-mation are stored in databases such as, for example, FlyBase [42], andPlasmoDB [30].
• Pathways: describe interactions of sets of genes, or of proteins, or ofmetabolic reactions involved in the same biological function. Pathwaysare stored in Bovilist [5] and Leproma [20], for example.
44 Erika De Francesco, Giuliana Di Santo, Luigi Palopoli, Simona E. Rombo
• Expression data: are experimental data about the different levels of expres-sion of genes. The levels of such an expression are related to the quantityof genic product of genes. Information about expression data are storedin, e. g., CandidaDB [6], PlasmoDB [30], SGD [32] and ToxoDB [37].
• Bibliographic references: are repositories of (sometimes, links to) relevantbiological literature. Most genomic databases contain bibliographic refer-ences.
2.2 Database Schemas
Most genomic databases are relational, even if relevant examples exist basedon the object-oriented or the object-relational model (see e.g., [28, 49]). Wedistinguished four different types of database schema designed “specifically”to manage biological data, from other unspecific database schemas, which aremostly generic relational schemas, whose structure is anyways independentfrom the biological nature of data.
Thus, w.r.t. database schemas, we grouped genomic databases in fiveclasses, as also illustrated in Figure 3; more in detail, we referred to:
Fig. 3. Database schemas.
• The Genomics Unified Schema (GUS) [16]: this is a relational schemasuitable for a large set of biological information, including genomic data,genic expression data, protein data. Databases based on this schema areCryptoDB [35], PlasmoDB [30] and TcruziDB [27].
• The Genolist Schema [43]: this is a relational database schema developedto manage bacteria genomes. In the Genolist Schema, genomic data arepartitioned in subsequences, each associated to specific features. With it,partial and efficient data loading is possible (query and update operations),since the update of sequences and associated features only require accessingand analyzing the interested chromosomic fragments. Databases based onthe genolist schema are, for example, AureoList [2], LegioList [19] andPyloriGene [22].
• The Chado Schema [8]: this is a relational schema having an extensiblemodular structure, and is used in some model organism databases such asBeetleBase [3] and FlyBase [42].
A Summary of Genomic Databases: Overview and Discussion 45
• The Pathway Tools Schema [39]: this is an object schema, used in Path-way/Genome databases. It is based on an ontology defining a large set ofclasses (there are about 1350 of them), attributes and relations to model bi-ological data, such as metabolic pathways, enzymatic functions, genes, pro-moters and genic regulation mechanisms. Databases based on this schemaare BioCyc [4], EcoCyc [11] and HumanCyc [18].
• Unspecific Schema refers to databases which do not correspond to anyof the specific groups illustrated above; in particular, most of them arebased on a relational schema, without any special adaptation to managebiological data.
2.3 Query Types
Fig. 4. Query types.
In Figure 4, query types supported by the analyzed databases are illus-trated. By simple querying it is possible to recover data satisfying some stan-dard search parameters such as, e.g., gene names, functional categories andothers. Batch queries consist of bunches of simple queries that are simulta-neously processed. The answers to such queries result as a combination ofthe answers obtained for the constituent simple queries. Analysis queries aremore complex and, somehow, more typical of the biological domain. They con-sist in retrieving data based on similarities (similarity queries) and patterns(pattern search queries). The former ones take in input a DNA or a protein(sub)sequence, and return those sequences found in the database that are themost similar to the input sequence. The latter ones take in input a pattern pand a DNA sequence s and return those subsequences of s, which turn out tobe most strongly related to the input pattern p. As an example, consider thepattern TATA, that is, a pattern denoting a regulative sequence upstream thegenic sequence of every organism: filling the pattern search query form withit on a database such as CandidaDB [6], and allowing at most one mismatch,311 genic subsequences extracted from the Candida (a fungus) DNA and con-taining the TATA box in evidence are returned. Both sequence similarity andpattern search queries exploit heuristic search algorithms such as BLAST [29].
46 Erika De Francesco, Giuliana Di Santo, Luigi Palopoli, Simona E. Rombo
2.4 Query Methods
A further analysis criterium we considered is related to methods used toquery available databases, as illustrated in Figure 5. We distinguished fourmain classes of query methods: text based methods, graphical interaction basedmethods, sequence based methods and query language based methods.
Fig. 5. Query methods.
The most common methods are the text based ones, further grouped intwo categories: free text and forms. In both cases, the query can be formulatedspecifying some keywords. With the free text methods, the user can specifysets of words, also combining them by logical operators. Such keywords are tobe searched for in specific and not a-priori determined fields and tables of thedatabase. With forms, searching starts by specifying the values to look for thatare associated to attributes of the database. Multi-attributes searches can beobtained by combining the sub-queries by logical operators. As an example,ArkDB [1] supports both free text and forms queries.
Graphical interaction based methods are also quite common in genomicdatabase access systems. In fact, a large set of genomic information can be vi-sualized by physical and genetic maps, whereby queries can be formulated byinteracting with graphical objects representing the annotated genomic data.Graphical support tools have been indeed developed such as, e.g., the GenericGenome Browser (GBrowser) [48], which allows the user to navigate interac-tively through a genome sequence, to select specific regions and to recoverthe correspondent annotations and information. Databases such as CADRE[7] and HCVDB [17], for example, allow the user to query their repository bygraphical interaction.
Sequence based queries rely on the computation of similarity among ge-nomic sequences and patterns. In this case, the input is a nucleotides or amino-acid sequence and data mining algorithms, such as in [29], are exploited toprocess the query, so that alignments and similar information are retrievedand returned.
Query language based methods exploit DBMS-native query languages(e.g., SQL-based ones), allowing a “low level” interaction with the data- bases.
A Summary of Genomic Databases: Overview and Discussion 47
At the moment, few databases (WormBase [28] and GrainGenes [15]) supportthem.
2.5 Result Formats
In genomic databases, several formats are adopted to represent query results.Web interfaces usually provide answers encoded in HTML, but other formatsare often available as well. In the following discussion, we shall refer to theformats adopted for representing query results and not to those used for ftp-downloadable data.
Fig. 6. Result formats.
Flat files are semi-structured text files where each information class is re-ported on one or more than one consecutive lines, identified by a code usedto characterize the annotated attributes. The most common flat files formatsare the GenBank Flat File (GBFF) [41] and the European Molecular BiologyLaboratory Data Library Format [44]. These formats represent basically thesame information contents, even if with some syntactic differences. Databasessupporting the GenBank Flat File are, for example, CADRE [7] and Ensembl[13], whereas databases supporting the EMBL format are CADRE [7], En-sembl [13], and ToxoDB [37].
The comma/tab and the attribute-value separated files are similar to flatfiles, but featuring some mildly stronger form of structure. The former ones,supported for example by BioCyc [4], TAIR [46] and HumanCyc [18], repre-sent data in a tabular format, separating attributes and records by special
48 Erika De Francesco, Giuliana Di Santo, Luigi Palopoli, Simona E. Rombo
characters (e.g., commas and tabulations). The latter ones (supported for ex-ample by BioCyc [4], EcoCyc [11] and HumanCyc [18]) organize data as pairs<attribute, value>.
In other cases, special formats have been explicitly conceived for biologi-cal data. An example is the FASTA format [44], commonly used to representsequence data (it is exploited in all genomic databases, except for ColiBase[9], PhotoList [21] and StreptoPneumoList [25]). Each entry of a FASTA doc-ument consists in three components: a comment line, that is optional andreports brief information about the sequence and the GenBank entry code;the sequence, represented as a string on the alphabet {A, C, G, T} of thenucleotide symbols, and a character denoting the end of the sequence.
More recently, in order to facilitate the spreading of information in hetero-geneous contexts, the XML format has been sometimes supported. The mainXML-based standards in the biological context are the System Biology MarkupLanguage (SBML) [36] and the Biological Pathway Exchange (BioPAX) [26].SMBL is used to export and exchange query results such as pathways of cel-lular signals, gene regulation and so on. The main data types here are derivedfrom XML-Schema. Similarly, BioPAX has been designed to integrate differentpathways resources via a common data format. BioPAX specifically consistsof two parts: the BioPAX ontology, that provides an abstract representationof concepts related to biological pathways, and the BioPAX format, definingthe syntax of data representation according to definitions and relations repre-sented in the ontology. BioCyc [4], EcoCyc [11] and HumanCyc [18] supportboth SBML and BioPAX formats.
Figure 6 summarizes the described data formats.
2.6 The on-line synthetic description
We remark that the main goal of our investigation has been that of screeninga significant subset of the available on-line biological resources, in order toindividuate some common directions of analysis, at least for a sub-category ofthem (i.e., the genomic databases). Thus, as one of the main results of suchan analysis, we provide a synthetic description of each of the 74 analyzeddatabases according to the considered criteria. All such synthetic descriptionsare freely available at http://siloe.deis.unical.it/biodbSurvey/.
In Figure 7, an example of the synthetic description we adopted is illus-trated. In particular, a form is exploited where both the acronym and thefull name of the database under analysis are showed. We consider the Cryp-toDB [35] (Cryptosporidium Genome Resources) database, that is, the genomedatabase of the AIDS-related apicomplexan-parasite, as reported in the briefdescription below the full name in the form. Also the web-link to the con-sidered database is provided, along with its database features. In particular,CryptoDB is a relational DBMS based on the Genomic Unified Schema (illus-trated above, in Section 2.2). All features corresponding to the other analysiscriteria are reported. In detail, we highlight that CryptoDB contains informa-tion on genes, clone/contig sequences, polymorphism and motifs, and that itstores physical maps, pathways and bibliographic references as well. It sup-ports simple queries and also analysis queries based on similarity search, and
A Summary of Genomic Databases: Overview and Discussion 49
Fig. 7. An example of the database synthetic descriptions available athttp://siloe.deis.unical.it/biodbSurvey/.
allows graphical interaction and sequence based query methods. Additionalinformation are contained in the form about the search criteria supportedby each database. In particular, CryptoDB can be queried by using bothkeywords and map positions as search parameters (map positions require thespecification of the positions of chromosomic regions). The result formats sup-ported by CryptoDB are HTML (as all the other genomic databases), FASTAand tab-separated files. It allows the user to download data stored in its repos-itory, but not to submit new data, as specified in the last two field (“submit”and “download”) of the form.
3 Concluding Remarks
In this chapter we presented the main results of an analysis we have conductedon 74 genomic databases. Such an analysis has been done mainly focusing ona database technology point of view, by choosing five specific criteria underwhich all the considered databases have been investigated. Thus, the con-
50 Erika De Francesco, Giuliana Di Santo, Luigi Palopoli, Simona E. Rombo
sidered databases have been classified according to such criteria and theirsynthetic descriptions are given, with the aim of providing a preliminary in-strument to access and manage more easily the enormous amount of datacoming from the biological repositories.
References
1. Arkdb. http://www.thearkdb.org/arkdb//index.jsp.2. Aureolist. http://genolist.pasteur.fr/AureoList.3. Beetlebase website. http://www.bioinformatics.ksu.edu/BeetleBase.4. Biocyc website. http://biocyc.org.5. Bovilist world-wide web server. http://genolist.pasteur.fr/BoviList.6. Candidadb: Database dedicated to the analysis of the genome of the human
fungal pathogen, candida albicans. http://genolist.pasteur.fr/CandidaDB.7. Central aspergilus data repository. www.cadre.man.ac.uk.8. Chado: The gmod database schema. http://www.gmod.org/chado.9. Colibase. http://colibase.bham.ac.uk.
10. Drugbank database. http://redpoll.pharmacy.ualber- ta.ca/drugbank.11. Ecocyc website. http://ecocyc.org.12. Embl nucl. seq. db statistics. http://www3.ebi.ac.uk/ Services/DBStats.13. Ensembl. www.ensembl.org.14. From sequence to function. an introduction to the kegg project.
http://www.genome.jp/kegg/docs/intro.html.15. Graingenes: a database for triticeae and avena.
http://wheat.pw.usda.gov/GG2/index.shtml.16. Gus: The gen. unified schema docum. http://www.gusdb.org/ documenta-
tion.php.17. Hcv polyprotein. http://hepatitis.ibcp.fr/.18. Humancyc website. http://humancyc.org.19. Legiolist. http://genolist.pasteur.fr/LegioList.20. Leproma. http://genolist.pasteur.fr/Leproma.21. Photolist. http://genolist.pasteur.fr/PhotoList.22. Pylorigene. http://genolist.pasteur.fr/PyloriGene.23. Quant. pcr primer db (qppd). http://web.ncifcrf.gov/rtp/gel/ primerdb.24. Rice annotation database. http://golgi.gs.dna.affrc.go.jp/SY-1102/rad.25. Streptopneumolist. http://genolist.pasteur.fr/StreptoPneumoList.26. System biology markup language (sbml). http://www.smbl.org.27. Tcruzidb: An integrated trypanosoma cruzi genome resource.
http://tcruzidb.org/.28. Wormbase: Db on biology and genome of c.elegans. http://www.wormbase.org/.29. S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. Basic local
alignment search tool. J. of Mol. Biology, 215(3):403–410, 1990.30. A. Bahl and et. al. Plasmodb: the plasmodium genome resource. a database
integrating experimental and computational data. Nucl. Ac. Res., 31(1):212–215, 2003.
31. A.D. Baxevanis. The molecular biology database collection: an update compi-lation of biological database resource. Nucl. Ac. Res., 29(1):1–10, 2001.
32. J.M. Cherry, C. Ball, S. Weng, G. Juvik, R. Schmidt, C. Adler, B. Dunn,S. Dwight, L. Riles, R.K. Mortimer, and D. Botstein.
A Summary of Genomic Databases: Overview and Discussion 51
33. F. Iragne E. Beyne M. Nikolski D. Sherman, P. Durrens and J.-L. Souciet.Genolevures complete genomes provide data and tools for comparative genomicsof hemiascomycetous yeasts. Nucl. Ac. Res., 34(Database issue):D432–D435,2006.
34. M. Y. Galperin. The molecular biology database collection: 2007 update. Nucl.Ac. Res., 35(Database issue):D3–D4, 2007.
35. M. Heiges, H. Wang, E. Robinson, C. Aurrecoechea, X. Gao, N. Kaluskar,P. Rhodes, S. Wang, C. Z. He, Y. Su, J. Miller, E. Kraemer, and J. C. Kissinger.Cryptodb: a cryptosporidium bioinformatics resource update. Nucl. Ac. Res.,34(Database issue):D419–D422, 2006.
36. M. Hucka, A. Finney, and H.M. Sauro. The system biology markup lan-guage (sbml): a medium for representation and exchange of biochemical networkmodel. Bioinformatics, 19(4):524–531, 2003.
37. L. Li I. T. Paulsen J. C. Kissinger, B. Gajria and D. S. Roos. Toxodb: accessingthe toxoplasma gondii genome. Nucl. Ac. Res., 31(1):234–236, 2003.
38. J. G. R. Gilbert K. Jekosch S. Keenan P. Meidl S. M. Searle J. Stalker R. StoreyS. Trevanion L. Wilming J. L. Ashurst, C.-K. Chen and T. Hubbard. Thevertebrate genome annotation (vega) database. Nucl. Ac. Res., 33(DatabaseIssue):D459–D465, 2005.
39. P. D. Karp. An ontology for biological function based on molecular interactions.Bioinformatics, 16:269–285, 2000.
40. I.M. Keseler, J. Collado-Vides, S. Gama-Castro, J. Ingraham, S. Paley, I.T.Paulsen, M. Peralta-Gil, and P.D. Karp. Ecocyc: a comprehensive databaseresource for escherichia coli. Nucl. Ac. Res., 33(Database issue):D334–D337,2005.
41. Z. Lacroix and T. Critchlow. Bioinformatics: Managing Scientific Data. 2005.42. V.B. Strelets P. Zhang W.M Gelbart M.A. Crosby, J.L. Goodman and the Fly-
Base Consortium. Flybase: genomes by the dozen. Nucl. Ac. Res., 35:D486–D491, 2007.
43. I. Moszer, L. M. Jones, S. Moreira, C. Fabry, and A. Danchin. Subtilist: The ref.database for the bacillus subtilist genome. Nucl. Ac. Res., 30(1):62–65, 2002.
44. D. W. Mount. Bioinformatics: Sequence and Genome Analysis. Cold SpringHarbor Laboratory Press, second edition, 2004.
45. Zhang Yue Yang LiuSong Wang Eric Marie Veronique Lang B. Franz O’Brien,Emmet A. and Gertraud Burger. Gobase - a database of organelle and bacterialgenome information. Nucl. Ac. Res., 34:D697–D699, 2006.
46. S.Y. Rhee, W. Beavis, T.Z. Berardini, G. Chen, D. Dixon, A. Doyle, M. Garcia-Hernandez, E. Huala, G. Lander, M. Montoya, N. Miller, L.A. Mueller,S. Mundodi, L. Reiser, J. Tacklind, D.C. Weems, Y. Wu, I. Xu, D. Yoo, J. Yoon,and P. Zhang. The arabidopsis information resource (tair): a model organismdatabase providing a centralized, curated gateway to arabidopsis biology, re-search materials and communit. Nucl. Ac. Res., 31(1):224–228, 2003.
47. J. Sprague, L. Bayraktaroglu, D. Clements, T. Conlin, D. Fashena, K. Frazer,M. Haendel, D. Howe, P. Mani, S. Ramachandran, E. Segerdell K. Schaper,P. Song, S. Taylor B. Sprunger, C. Van Slyke, and M. Westerfield. The zebrafishinformation network: the zebrafish model organism database. Nucl. Acids Res.,34:D581–D585, 2006.
48. L. D. Stein, C. Mungall, S. Shu, M. Caudy, M. Mangone, A. Day, E. Nickerson,J.E. Stajich, T.W. Harris, A. Arva, and S. Lewis. The generic genome browser:A building block for a model organism system database. Gen. Res., 12(10):1599–1610, 2002.
49. S. Twigger, J. Lu, M. Shimoyama, and et al. D. Chen. Rat genome database(rgd): mapping disease onto the genome. Nucl. Ac. Res., 30(1):125–128, 2002.
52 Erika De Francesco, Giuliana Di Santo, Luigi Palopoli, Simona E. Rombo
50. J.C. Venter and et al. The sequence of the human genome. Science,291(5507):1304–1351, 2001.
51. W. Zhao, J. Wang, X. He, X. Huang, Y. Jiao, M. Dai, S. Wei, J. Fu, Y. Chen,X. Ren, Y. Zhang, P. Ni, J. Zhang, S. Li, J. Wang, G. K. Wong, H. Zhao,J. Yu, H. Yang, and J. Wang. Bgi-ris: an integrated information resource andcomparative analysis workbench for rice genomics. Nucl. Acids Res., 32:D377–D382, 2004.
A Summary of Genomic Databases: Overview and Discussion 53
Appendix
1: ArkDB, http://www.thearkdb.org
2: AureoList, http://genolist.pasteur.fr/AureoList
3: BeetleBase, http://www.bioinformatics.ksu.edu/BeetleBase
4: BGI-RISe, http://rise.genomics.org.cn
5: BioCyc, http://biocyc.org
6: Bovilist, http://genolist.pasteur.fr/BoviList
7: Brassica ASTRA, http://hornbill.cspp.latrobe.edu.au/cgi-binpub/index.pl
8: CADRE, http://www.cadre.man.ac.uk
9: CampyDB, http://campy.bham.ac.uk
10: CandidaDB, http://genolist.pasteur.fr/CandidaDB
11: CGD, http://www.candidagenome.org
12: ClostriDB, http://clostri.bham.ac.uk
13: coliBase, http://colibase.bham.ac.uk
14: Colibri, http://genolist.pasteur.fr
15: CryptoDB, http://cryptodb.org
16: CyanoBase, http://www.kazusa.or.jp/cyano
17: CyanoList, http://genolist.pasteur.fr/CyanoList
18: DictyBase, http://dictybase.org
19: EcoCyc, http://ecocyc.org
20: EMGlib, http://pbil.univ-lyon1.fr/emglib/emglib.htm
21: Ensembl, http://www.ensembl.org
22: FlyBase, http://flybase.bio.indiana.edu
23: Full-Malaria, http://fullmal.ims.u-tokyo.ac.jp
24: GabiPD, http://gabi.rzpd.de
25: GDB, http://www.gdb.org
26: Genolevures, http://cbi.labri.fr/Genolevures
27: GIB, http://gib.genes.nig.ac.jp
28: GOBASE, http://megasun.bch.umontreal.ca/gobase
29: GrainGenes, http://wheat.pw.usda.gov/GG2/index.shtml
30: Gramene, http://www.gramene.org
31: HCVDB, http://hepatitis.ibcp.fr
32: HOWDY, http://www-alis.tokyo.jst.go.jp/HOWDY
33: HumanCyc, http://humancyc.org
34: INE, http://rgp.dna.affrc.go.jp/giot/INE.html
35: LegioList, http://genolist.pasteur.fr/LegioList
36: Leproma, http://genolist.pasteur.fr/Leproma
37: LeptoList, http://bioinfo.hku.hk/LeptoList
38: LIS, http://www.comparative-legumes.org
39: ListiList, http://genolist.pasteur.fr/ListiList
40: MaizeGDB, http://www.maizegdb.org
41: MATDB, http://mips.gsf.de/proj/thal/db
42: MBGD, http://mbgd.genome.ad.jp
43: MGI, http://www.informatics.jax.org
44: MitoDat, http://www-lecb.ncifcrf.gov/mitoDat/#searching
45: MjCyc, http://maine.ebi.ac.uk:1555/MJNRDB2/server.htm
46: Microscope, http://www.genoscope.cns.fr/agc/mage/wwwpkgdb/MageHome/
54 Erika De Francesco, Giuliana Di Santo, Luigi Palopoli, Simona E. Rombo
index.php?webpage=microscope
47: MolliGen, http://cbi.labri.fr/outils/molligen
48: MtDB, http://www.medicago.org/MtDB
49: MypuList, http://genolist.pasteur.fr/MypuList
50: NCBI Genomes Databases, http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome
51: NCBI Viral Genomes, http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/viruses.html
52: OGRe, http://drake.physics.mcmaster.ca/ogre
53: Oryzabase, http://www.shigen.nig.ac.jp/rice/oryzabase
54: PEC, http://shigen.lab.nig.ac.jp/ecoli/pec
55: PhotoList, http://genolist.pasteur.fr/PhotoList
56: PlantGDB, http://www.plantgdb.org
57: PlasmoDB, http://plasmodb.org
58: PyloriGene, http://genolist.pasteur.fr/PyloriGene
59: RAD, http://golgi.gs.dna.affrc.go.jp/SY-1102/rad
60: RGD, http://rgd.mcw.edu
61: SagaList, http://genolist.pasteur.fr/SagaList
62: ScoCyc, http://scocyc.jic.bbsrc.ac.uk:1555
63: SGD, http://www.yeastgenome.org
64: StreptoPneumoList, http://genolist.pasteur.fr/StreptoPneumoList
65: SubtiList, http://genolist.pasteur.fr/SubtiList
66: TAIR, http://www.arabidopsis.org
67: TcruziDB, http://tcruzidb.org
68: TIGR-CMR, http://www.tigr.org/CMR
69: ToxoDB, http://toxodb.org
70: TubercuList, http://genolist.pasteur.fr/TubercuList
71: UCSC, http://genome.ucsc.edu
72: VEGA, http://vega.sanger.ac.uk
73: WormBase, http://www.wormbase.org
74: ZFIN, http://zfin.org