Database resources of the National Center for Biotechnology

6
Database resources of the National Center for Biotechnology David L. Wheeler * , Deanna M. Church, Scott Federhen, Alex E. Lash, Thomas L. Madden, Joan U. Pontius, Gregory D. Schuler, Lynn M. Schriml, Edwin Sequeira, Tatiana A. Tatusova and Lukas Wagner National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA Received September 17, 2002; Accepted October 2, 2002 ABSTRACT In addition to maintaining the GenBank(R) nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides data analysis and retrieval resources for the data in GenBank and other biological data made available through NCBI’s Web site. NCBI resources include Entrez, PubMed, PubMed Central (PMC), LocusLink, the NCBITaxonomy Browser, BLAST, BLAST Link (BLink), Electronic PCR (e-PCR), Open Reading Frame (ORF) Finder, References Sequence (RefSeq), UniGene, HomoloGene, ProtEST, Database of Single Nucleotide Polymorphisms (dbSNP), Human/Mouse Homology Map, Cancer Chromosome Aberration Project (CCAP), Entrez Genomes and related tools, the Map Viewer, Model Maker (MM), Evidence Viewer (EV), Clusters of Orthologous Groups (COGs) data- base, Retroviral Genotyping Tools, SAGEmap, Gene Expression Omnibus (GEO), Online Mendelian Inheritance in Man (OMIM), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), and the Conserved Domain Architecture Retrieval Tool (CDART). Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of the resources can be accessed through the NCBI home page at: http:// www.ncbi.nlm.nih.gov. INTRODUCTION The National Center for Biotechnology Information (NCBI) at the National Institutes of Health was created in 1988 to develop information systems for molecular biology. In addition to maintaining the GenBank(R) (1) nucleic acid sequence database, to which data is submitted by the scientific community, NCBI provides data retrieval systems and computational resources for the analysis of GenBank data and a variety of other biological data. For the purposes of this overview, the NCBI suite of database resources is grouped into the six categories given below. All resources discussed are available from the NCBI home page at: http://www.ncbi.nlm. nih.gov. In most cases, the data underlying these resources is available for bulk download at ‘ftp.ncbi.nih.gov’. DATABASE RETRIEVAL TOOLS Entrez Entrez (2) is an integrated database retrieval system for DNA and protein sequences derived from several sources (1,3–6), the NCBI taxonomy, genome maps, population sets, gene expression data, protein structures from the Molecular Modeling Database (MMDB) (7), 3D and alignment-based protein domains, and the biomedical literature via PubMed, Online Mendelian Inheritance in Man (OMIM), and online Books. PubMed includes primarily 12 million references and abstracts in MEDLINE(R), with links to the full-text of more than 3000 journals available on the Web. The Books database now contains 12 online scientific textbooks. Entrez enables text searching of databases ranging from those for sequences and the scientific literature, to those for structure and gene expression using simple Boolean queries, and provides extensive links to related information. Some links are simple cross-references, for example, from a sequence to the abstract of the paper in which it was reported, from a protein sequence to its corresponding DNA sequence or 3D- structure, or to alignments with other sequences. Other links are based on computed similarities among the sequences or MEDLINE abstracts. These pre-computed ‘neighbors’ allow rapid access for browsing groups of related records. A service called LinkOut expands the range of external links from individual database records to related outside services, including organism-specific genome databases. The records retrieved by an Entrez search can be displayed in a wide variety of formats and downloaded singly or in large batches. Formatting options vary for records of different types. *To whom correspondence should be addressed. Tel: þ1 3014962475/þ1 3014355950; Fax: þ1 3014809241; Email: [email protected] 28–33 Nucleic Acids Research, 2003, Vol. 31, No. 1 # 2003 Oxford University Press DOI: 10.1093/nar/gkg033 at Northeastern University Libraries on November 26, 2014 http://nar.oxfordjournals.org/ Downloaded from

Transcript of Database resources of the National Center for Biotechnology

Page 1: Database resources of the National Center for Biotechnology

Database resources of the National Centerfor Biotechnology

David L Wheeler Deanna M Church Scott Federhen Alex E Lash

Thomas L Madden Joan U Pontius Gregory D Schuler Lynn M Schriml

Edwin Sequeira Tatiana A Tatusova and Lukas Wagner

National Center for Biotechnology Information National Library of Medicine National Institutes of HealthBuilding 38A 8600 Rockville Pike Bethesda MD 20894 USA

Received September 17 2002 Accepted October 2 2002

ABSTRACT

In addition to maintaining the GenBank(R) nucleicacid sequence database the National Center forBiotechnology Information (NCBI) provides dataanalysis and retrieval resources for the data inGenBank and other biological data made availablethrough NCBIrsquos Web site NCBI resources includeEntrez PubMed PubMed Central (PMC) LocusLinkthe NCBITaxonomy Browser BLAST BLAST Link(BLink) Electronic PCR (e-PCR) Open ReadingFrame (ORF) Finder References Sequence (RefSeq)UniGene HomoloGene ProtEST Database of SingleNucleotide Polymorphisms (dbSNP) HumanMouseHomology Map Cancer Chromosome AberrationProject (CCAP) Entrez Genomes and related toolsthe Map Viewer Model Maker (MM) Evidence Viewer(EV) Clusters of Orthologous Groups (COGs) data-base Retroviral Genotyping Tools SAGEmap GeneExpression Omnibus (GEO) Online MendelianInheritance in Man (OMIM) the Molecular ModelingDatabase (MMDB) the Conserved Domain Database(CDD) and the Conserved Domain ArchitectureRetrieval Tool (CDART) Augmenting many of theWeb applications are custom implementations ofthe BLAST program optimized to search specializeddata sets All of the resources can be accessedthrough the NCBI home page at httpwwwncbinlmnihgov

INTRODUCTION

The National Center for Biotechnology Information (NCBI) atthe National Institutes of Health was created in 1988 todevelop information systems for molecular biology In additionto maintaining the GenBank(R) (1) nucleic acid sequencedatabase to which data is submitted by the scientific

community NCBI provides data retrieval systems andcomputational resources for the analysis of GenBank dataand a variety of other biological data For the purposes of thisoverview the NCBI suite of database resources is grouped intothe six categories given below All resources discussed areavailable from the NCBI home page at httpwwwncbinlmnihgov In most cases the data underlying these resources isavailable for bulk download at lsquoftpncbinihgovrsquo

DATABASE RETRIEVAL TOOLS

Entrez

Entrez (2) is an integrated database retrieval system for DNAand protein sequences derived from several sources (13ndash6)the NCBI taxonomy genome maps population sets geneexpression data protein structures from the MolecularModeling Database (MMDB) (7) 3D and alignment-basedprotein domains and the biomedical literature via PubMedOnline Mendelian Inheritance in Man (OMIM) and onlineBooks PubMed includes primarily 12 million references andabstracts in MEDLINE(R) with links to the full-text of morethan 3000 journals available on the Web The Books databasenow contains 12 online scientific textbooks

Entrez enables text searching of databases ranging fromthose for sequences and the scientific literature to those forstructure and gene expression using simple Boolean queriesand provides extensive links to related information Some linksare simple cross-references for example from a sequence tothe abstract of the paper in which it was reported from aprotein sequence to its corresponding DNA sequence or 3D-structure or to alignments with other sequences Other linksare based on computed similarities among the sequences orMEDLINE abstracts These pre-computed lsquoneighborsrsquo allowrapid access for browsing groups of related records A servicecalled LinkOut expands the range of external links fromindividual database records to related outside servicesincluding organism-specific genome databases

The records retrieved by an Entrez search can be displayed ina wide variety of formats and downloaded singly or in largebatches Formatting options vary for records of different types

To whom correspondence should be addressed Tel thorn1 3014962475thorn1 3014355950 Fax thorn1 3014809241 Email wheelerncbinlmnihgov

28ndash33 Nucleic Acids Research 2003 Vol 31 No 1 2003 Oxford University PressDOI 101093nargkg033

at Northeastern U

niversity Libraries on N

ovember 26 2014

httpnaroxfordjournalsorgD

ownloaded from

For example display formats for GenBank records includethe GenBank Flatfile FASTA XML ASN1 and othersGraphical display formats are offered for some types ofrecords including genomic records

PubMed Central (PMC)

PMC is a digital archive of peer reviewed journals in the lifesciences Over 90 journals including Nucleic Acids Researchnow deposit the full text of their articles in PMC which isavailable at httpwwwpubmedcentralgov Participation inPMC requires a commitment to free access to full text perhapswith some delay after publication Some journals provide freeaccess to their full text directly in PMC while others require alink to the journalrsquos own site where full text is generallyavailable free within 6 months to a year of publication AllPMC free articles are identified in PubMed search results

Taxonomy

The NCBI taxonomy database indexes over 119 000 organismsthat are represented in the databases with at least onenucleotide or protein sequence The Taxonomy Browser canbe used to view the taxonomic position or retrieve sequenceand structural data for a particular organism or group Searchesof the NCBI taxonomy may be made on the basis of wholepartial or phonetically spelled organism names and links toorganisms commonly used in biological research are providedThe Entrez Taxonomy system adds the ability to displaycustom taxonomic trees representing user-defined subsets ofthe full NCBI taxonomy

NCBI is developing the non-bibliographic applications ofLinkOut and expanding that project into the taxonomy andsequence domains of Entrez Many outside resources currentlymaintain LinkOut links from Entrez entries including modelorganism and taxonomic databases NCBI has developedseveral tools to help LinkOut providers including a simpleflatfile format for specifying links and a NameID status pagefor tracking the current use of names and IDs in the taxonomydatabase

LocusLink

LocusLink developed at NCBI in conjunction with severalinternational collaborators offers a single query interface tocurated sequences and descriptive information about genes andincludes links to NCBIrsquos Map Viewer Evidence Viewer (EV)Model Maker (MM) BLAST Link (BLink) protein domainsfrom NCBIrsquos Conserved Domain Database and many othergene-related resources LocusLink is discussed in a separatearticle in this issue (6)

THE BLAST FAMILY OF SEQUENCE-SIMILARITYSEARCH PROGRAMS

The Basic Local Alignment Search Tool (BLAST) programs(89) perform sequence-similarity searches against a variety ofsequence databases beginning with either a query sequence ora GenBank accession number BLAST returns a set of gappedalignments between the query and similar database sequenceswith links to the full database records and to other relevant

databases such as UniGene or LocusLink The sequences ofany or all of the database hits appearing in a BLAST alignmentmay be selected for bulk download A BLAST variantBLAST2Sequences (10) compares two DNA or proteinsequences using any of the standard BLAST programs andproduces a dot-plot representation of the alignments it reports

Each alignment returned by a BLAST search receives a scoreand a measure of statistical significance called the ExpectationValue (E-value) for judging its quality Either an E-valuethreshold or a range can be specified to limit the alignmentsreturned BLAST takes into account the amino acid composi-tion of the query sequence in its estimation of statisticalsignificance This composition-based statistical treatmentused in conventional protein BLAST searches as well asPSI-BLAST (9) searches tends to reduce the number of false-positive database hits (11)

The default BLAST output format is the lsquopairwisersquoalignment however several lsquoquery-anchoredrsquo multiplesequence alignment formats are available An alignmentoption called the lsquoHit Tablersquo provides a compact tabulareasily parsable summary of the BLAST results including foreach database hit the positions of alignment starts and stopsscores and Expectation Values These outputs may be returnedin HTML XML text or as ASN1 In addition BLAST cangenerate a taxonomically organized output that shows thedistribution of BLAST hits by organism in three formats

A particularly powerful feature of the web BLAST interfaceallows searches to be restricted to a database subset usingstandard Entrez search strings the same restrictions may beused to screen the output of an initially unrestricted searchThese features provide the means to effectively construct acustom database for searching or to process the output of asearch to include only sequences of interest Web BLAST usesa standard URL-API that allows complete search specifica-tions including BLAST parameters such as Entrez restrictionsand the search query to be contained in a URL posted to theweb page

A recent addition to the BLAST family called MegaBLAST(12) facilitates batch nucleotide queries which can be pastedinto a web page or uploaded from a file MegaBLAST isdesigned to search for nearly exact matches and is up to 10times faster than standard BLAST for such searchesMegaBLAST is provided to search entire eukaryotic genomesbut it is also used to search a rapidly growing database calledthe Trace Archive which contains over 125 million sequencingtraces The Trace Archive includes whole genome shotgun(WGS) shotgun EST clone end and finishing reads fromover 30 organisms such as Homo sapiens Mus musculusRattus norvegicus Danio rerio Zea mays and Caenorhabditiselegans To facilitate rapid cross-species nucleotide queries ofthe Trace Archive NCBI offers a version of MegaBLASTcalled Discontinuous MegaBLAST that uses a non-contiguousword match (13) as the nucleus for its alignments Searchesusing Discontinuous MegaBLAST are far more rapid thancross-species translated searches such as blastx but maintain acompetitive degree of sensitivity when comparing codingregions

The NCBI-generated assembly of the human as well as othersubmitted genomic assemblies such as those of the mouse andzebrafish may be searched using specialized genome BLAST

Nucleic Acids Research 2003 Vol 31 No 1 29

at Northeastern U

niversity Libraries on N

ovember 26 2014

httpnaroxfordjournalsorgD

ownloaded from

pages These pages search a set of genome-specific databasesand generate where possible genomic views of the BLASThits using the Map Viewer

BLink

BLink displays pre-computed protein BLAST alignments foreach protein sequence in the Entrez databases BLink allowsfor the display of subsets of these alignments by taxonomiccriteria by database of origin relation to a complete genomemembership in a Clusters of Orthologous Group (COG) (14) orby relation to a 3D structure or conserved protein domainBLink links are displayed for protein records in Entrez as wellas within LocusLink reports

RESOURCES FOR GENE-LEVEL SEQUENCES

UniGene

UniGene (15) is a system for automatically partitioningGenBank sequences including ESTs into a non-redundant setof gene-oriented clusters UniGene clusters ESTs from 10animals and 7 plants bringing the total number of organismsrepresented to 17 UniGene starts with entries in theappropriate organism division of GenBank combines thesewith ESTs of that organism and creates clusters of sequencesthat share virtually identical 30 untranslated regions (30 UTRs)Each UniGene cluster contains sequences that represent aunique gene and is linked to related information such as thetissue types in which the gene is expressed model organismprotein similarities the LocusLink report for the gene and itsmap location In the human UniGene database over 36million human ESTs in GenBank have been reduced 35-fold innumber to over 104 000 sequence clusters The UniGenecollection has been used as a source of unique sequence forthe fabrication of microarrays for the large-scale study of geneexpression (16) UniGene databases are updated weekly withnew EST sequences and bimonthly with newly characterizedsequences UniGene clusters may be searched in several waysby gene name chromosomal location cDNA library accessionnumber and ordinary text words Cluster sequences may alsobe downloaded by FTP

ProtEST

ProtEST a tool analogous to BLink presents pre-computedBLAST alignments between protein sequences from modelorganisms and the 6-frame translations of UniGene nucleotidesequences Protein sequences that are derived from conceptualtranslations or model transcripts are excluded The eight modelorganisms included are H sapiens M musculus Rattusnorvegicus Drosophila melanogaster C elegans Saccharo-myces cerevisiae Arabidopsis thaliana and Escherichia coliProtEST links are displayed in UniGene reports with modelorganism protein similarites For each nucleotide sequencematch the ProtEST report shows the UniGene cluster ID theGenBank accession number and the percent identity betweenthe protein and nucleotide translation in the aligned region Alink is also provided to the sequence trace in the NCBI TraceArchive if available ProtEST reports are updated in tandemwith UniGene protein similarities

HomoloGene

HomoloGene is a database of both curated and calculated geneorthologs and homologs for 14 organisms including H sapiensM musculus D rerio D melanogaster C elegans A thalianaHordeum vulgare Oryza sativa Z mays Curated orthologsinclude gene pairs from the Mouse Genome Database (MGD) atthe Jackson Laboratory the Zebrafish Information (ZFIN)database at the University of Oregon and from publishedreports Computed orthologs and homologs which areconsidered putative are identified from BLAST nucleotidesequence comparisons between all UniGene clusters for eachpair of organisms HomoloGene also contains a set of tripletortholog-based COG (14)-like clusters which may include up to14 members in which the triplet orthologs in two organisms areboth orthologous to the same gene in a third organism For thethree organisms human mouse and rat there are currentlyover 7000 of these self-consistent triplets The HomoloGenedatabase can be queried using UniGene ClusterIDs LocusLinkLocusIDs gene symbols gene names and nucleotide accessionnumbers as well as those terms found in UniGene clustertitles The current datasets for the calculated orthologs andhomologs and the Mutually Orthologous Pairs are alsoavailable via FTP

References Sequence (RefSeq)

The RefSeq database (6) provides curated reference sequencesfor mRNAs genomic sequences computationally-derivedsequences and proteins for human and over 1700 otherorganisms

Open Reading Frame (ORF) Finder

ORF Finder performs a six-frame translation of a nucleotidesequence and returns a graphic that indicates the location ofeach ORF found Restrictions on the size of the ORFs returnedmay be set The protein translations of the ORFs detected canbe submitted directly for BLAST similarity searching orsearching against the COGs (see below) database

Electronic PCR (e-PCR)

e-PCR is a tool for locating Sequence Tagged Sites (STSs)within a nucleotide sequence by searching against a non-redundant database of over 133 000 human and 84 000 non-human STSs called UniSTS

A database of Single Nucleotide Polymorphisms (dbSNP)

The dbSNP (17) is a repository for single base nucleotidesubstitutions and short deletion and insertion polymorphismsThe dbSNP database contains almost 3 million human SNPsas well as about half a million from organisms includingM musculus Anopheles gambiae D rerio and A thalianaThe Web interface allows flexible searches by gene name andby cross-reference to other databases such as OMIM or thestructure databases Searches for SNPs lying between twomarkers and batch downloads are also supported SNP reportslink to structures from the MMDB allowing the 3-Dvisualization using NCBIrsquos interactive macromolecular viewer

30 Nucleic Acids Research 2003 Vol 31 No 1

at Northeastern U

niversity Libraries on N

ovember 26 2014

httpnaroxfordjournalsorgD

ownloaded from

Cn3D (18) of amino acid changes implied by SNPs in codingregions

RESOURCES FOR GENOME-SCALE ANALYSIS

Entrez Genomes

Entrez Genomes (19) provides access to genomic datacontributed by the scientific community for over 1000 specieswhose sequencing and mapping is complete or in progressEntrez Genomes now includes more than 86 completemicrobial genomes and 302 RefSeq for eukaryotic organellesMany higher eukaryotic genomes are also included withinEntrez Genomes such as those of H sapiens M musculusD melanogaster Anopheles gambiae C elegans and Athaliana

In Entrez Genomes complete genomes can be accessedhierarchically starting from either an alphabetical listing or aphylogenetic tree for each of six principle taxonomic groupsOne can follow the hierarchy to a graphical overview for thegenome of a single organism on to the level of a singlechromosome and finally down to the level of a single gene Ateach level are one or more views pre-computed summariesand links to analyses appropriate at that level For instance atthe level of a genome or a chromosome a coding regions viewdisplays the location of each coding region length of theproduct GenBank identification number for the proteinsequence and name of the protein product A RNA genesview lists the location and gene names for ribosomal andtransfer RNA genes At the level of a single gene links areprovided to pre-computed sequence neighbors for the geneproduct Any protein gene product that is a member of aCOG (19) is linked to the COGs database A summary of COGfunctional groups is also presented in tabular and graphicalformats at the genome level

For complete microbial genomes pre-computed BLASTneighbors for protein sequences including their taxonomicdistribution and links to 3-D structures are given in TaxTablesand PDBTables respectively Pairwise sequence alignmentsare presented graphically and linked to the Cn3D macro-molecular viewer (18) which allows the interactive display of3-D structures and sequence alignments The TaxPlot toolgraphically compares similarities in the proteomes of twoorganisms to that of a third reference organism and is availablefor both prokaryotic and eukaryotic genomes Resources forthe genomes of higher eukaryotes are discussed below

COGs

The COGs database (14) presents a compilation of ortholo-gous groups of proteins from completely sequenced organismsrepresenting 44 species and 30 phylogenetically distant cladesThe COGs are now also linked to the proteins of two highereukaryotes C elegans and D melanogaster

Retroviral genotyping tools

The genotyping of retrovirus sequences is important in thecharacterization of viral genetic diversity in the tracking ofepidemics and in vaccine development NCBI offers a Web-based genotyping tool that employs a blastn comparison

between a retroviral sequence to be subtyped and a defaultpanel of reference sequences or a panel provided by the userAn HIV-1-specific subtyping tool uses a set of referencesequences taken from the principle HIV-1 variants

Eukaryotic Genomic Resources

Entrez Genomes links to Genome Resources webpages devotedto the sequencing of a number of eukaryotic organismsincluding H sapiens M musculus and D melanogaster Apage called Plant Genomes Central serves as a collection pointfor resources related to plant genome projects Many genomeprojects have progressed to the point at which it is useful to havean interactive genome viewing tool with which to correlate thedata present in various of genomic maps NCBI has developedthe Map Viewer for this purpose

Map Viewer

The NCBI Map Viewer displays genome assemblies using setsof synchronized chromosomal maps Map Viewer displays areavailable for the genomes of four vertebrates includingH sapiens and M musculus and D rerio three invertebratesincluding D melanogaster and C elegans seven plantsincluding A thaliana and O satvia and two fungi S cerevisiaeand Schizosaccharomyces pombe The genomic maps dis-played by the Map Viewer vary according to the data availablefor the subject organism The maps can be selected from a setof cytogenetic maps such as chromosomal ideogramssequence-based maps such as those showing contigs genesand SNPs and physical maps such as the G3 and GB4 humanradiation-hybrid maps Maps showing ab initio gene modelsEST alignments with links to UniGene clusters and mRNAalignments used to construct gene models are also available forsome organisms The rightmost map in a Map Viewer displaycalled the master map generates an extended set of map-specific links to related resources In the case of the Genesmap two of these links are to the EV and MM described belowIn addition to its graphical display the Map Viewer offers atabular view of the data that is convenient for export to otherprograms for further analysis

Queries against an entire genome or particular chromosomescan be made in the Map Viewer using gene names or symbolsmarker names SNP identifiers accession numbers and otheridentifiers The human version of the Map Viewer is tightlyintegrated with other NCBI databases such as LocusLink anddbSNP Segments of a genomic assembly may be downloadedusing the Map Viewerrsquos lsquoDownloadView Sequencersquo link forsome genomes such as H sapiens M musculus and A thalianaSupported download formats are GenBank and FASTA

Model Maker (MM)

MM allows the construction of transcript models using novelcombinations of putative exons derived from ab initiopredictions or from the alignment of GenBank transcriptsincluding ESTs and NCBI RefSeqs to the NCBI humangenome assembly The MM interface consists of a graphicaloverview of transcript alignments to a genomic contig witheach unique block of alignment collected and numbered as aputative exon Transcript models are constructed by selecting

Nucleic Acids Research 2003 Vol 31 No 1 31

at Northeastern U

niversity Libraries on N

ovember 26 2014

httpnaroxfordjournalsorgD

ownloaded from

from this collection As the transcript is created the impliedprotein translation is given in each reading frame with anyinternal stop codons indicated Previously observed exonsplice patterns are indicated as guides to model buildingCompleted models may be saved locally or analyzed withOrfFinder

Evidence Viewer (EV)

The EV displays the alignments to a genomic contig of RefSeqtranscripts GenBank mRNAs known or potential transcriptsand ESTs supporting a gene model The EV produces agraphical summary of the alignments that indicates thecoordinate range of the gene model on the genomic contigand the areas of alignment to the transcripts on separate tracksEST alignment density along the contig is indicated on anothertrack A mismatch and an insertiondeletion track are alsoshown to highlight areas of disagreement between transcriptsequences and the genomic sequence Following the graphicalsummary are exon-by-exon alignments of all of the transcriptsequences against the genomic contig including flankinggenomic sequence for each exon to show the presence orabsence of splice sites Any proteins annotated on thetranscript sequences are also shown and mismatches betweentranscripts and the genomic contig or between proteinsannotated on the aligned transcripts are highlighted

The HumanndashMouse Homology Maps

The HumanndashMouse Homology Maps are tables of genetic lociin homologous segments of DNA from human and the mouseThe map is computed by integrating orthologs curated by theMouse Genome Database with putative orthologs identified byhomology The maps are linked to GeneMaprsquo99 OMIMLocusLink dbSTS BLAST2Sequences and the MouseGenome Database at The Jackson Laboratory Other mousegenome resources can be found on the Mouse GenomeResources page

The Cancer Chromosome Aberration Project (CCAP)

The CCAP service is an initiative of the National CancerInstitute (NCI) and NCBI The data includes a compilation byF Mitelman F Mertens and B Johansson of recurrentneoplasia-associated chromosomal aberrations from theCancer Chromosome Aberration Bank at the University ofLund Sweden (20) The Spectral Karyotyping database SKYcreated jointly by NCI and NCBI enables investigators to sharetheir own SKYand Comparative Genomic Hybidization (CGH)data on chromsomal aberrations (httpwwwncbinlmnihgovskyskywebcgi)

RESOURCES FOR THE ANALYSIS OF PATTERNSOF GENE EXPRESSION AND PHENOTYPES

SAGEmap

Serial Analysis of Gene Expression (SAGE) is a technique fortaking a snapshot of the messenger RNA population of a cellto obtain a quantitative measure of gene expression NCBIrsquosSAGEmap (21) service implements many functions useful in

the analysis of SAGE data such as a two-way mapping betweenSAGE tag and UniGene SAGEmap can also construct a user-configurable table of data comparing one group of SAGElibraries with another Groups may be chosen for inclusion inthe table on the basis of several expression criteria SAGEmapis updated weekly immediately following the update ofUniGene and the data is reflected in the human genome MapViewer as the SAGE track

Gene Expression Omnibus (GEO)

The GEO (22) is a data repository and retrieval system for geneexpression data derived from any organism or artificial sourceGene expression data derived from spotted microarray high-density oligonucleotide array hybridization filter and SAGEdata are available for download and accepted for deposit Atthe time of writing the repository contains high-throughputgene expression data on over 2300 samples

OMIM

NCBI provides the online version of the OMIM catalog ofhuman genes and genetic disorders authored and edited byVictor A McKusick at The Johns Hopkins University (23)The database contains information on disease phenotypes andgenes including extensive descriptions gene names inheri-tance patterns map locations and gene polymorphisms OMIMcurrently contains 13 864 entries including data on 10 290established gene loci and 1019 phenotypic descriptions and isnow searchable using the powerful Entrez interface

THE MOLECULAR MODELING DATABASE(MMDB) THE CONSERVED DOMAIN DATABASESEARCH AND CDART

The NCBI MMDB built by processing entries from the ProteinData Bank (5) is described in (7) The structures in the MMDBare linked to sequences in Entrez and to the Conserved DomainDatabase (CDD) The CDD contains PSI-BLAST-derivedPosition Specific Score Matrices representing domains takenprincipally from two public protein domain collections theSimple Modular Architecture Research Tool (SMART) (24)and Pfam (25) but also draws from domains defined by NCBIresearchers NCBIrsquos Conserved Domain Search (CD-Search)service can be used to search a protein sequence for conserveddomains in the CDD Wherever possible CDD hits are linkedto structures which coupled with a multiple sequence align-ment of representatives of the domain hit can be viewed withNCBIrsquos 3-D molecular structure viewer Cn3D (18) TheConserved Domain Architecture Retrieval Tool (CDART)allows searches of protein databases on the basis of a conserveddomain and returns the domain architectures of databaseproteins conatining the query domain Alignment-based proteindomain information from the CDD and 3-D domains from theMMDB are searchable via the Entrez interface

FOR FURTHER INFORMATION

Most of the resources described here include documentationother explanatory material and references to collaborators and

32 Nucleic Acids Research 2003 Vol 31 No 1

at Northeastern U

niversity Libraries on N

ovember 26 2014

httpnaroxfordjournalsorgD

ownloaded from

data sources on the respective web sites Several tutorials arealso offered under the Education link from NCBIrsquos home pageA site map provides a comprehensive table of NCBI resourcesand the about NCBI feature provides bioinformatics primersand other supplementary information A user support staff isavailable to answer questions at infoncbinlmnihgov

REFERENCES

1 BensonDA Karsch-MizrachiI LipmanDJ OstellJ RappBA andWheelerDL (2002) GenBank Nucleic Acids Res 30 17ndash20

2 SchulerGD EpsteinJA OhkawaH and KansJA (1996) Entrezmolecular biology database and retrieval system Methods Enzymol 266141ndash162

3 BarkerWC GaravelliJS HuongH McGarveyPB OrcuttBCSrinivarsaraoGY XiaoC YehLS LedleyRS JandaJ PfeifferFMewesHW TsugitaA and WuK (2000) The Protein InformationResource (PIR) Nucleic Acids Res 28 41ndash44

4 KriventsevaEV FleischmannW ZdobnovEM and ApweilerR (2001)CluSTr a database of Clusters of SWISS-PROT and TrEMBL proteinsNucleic Acids Res 29 33ndash36

5 BermanHM WestbrookJ FengZ GillilandG BhatTN WeissigHShindyalovIN and BournePE (2000) The Protein Data Bank NucleicAcids Res 28 235ndash242

6 PruittK TatusovT and MaglottD (2003) RefSeq and LocusLink NCBIgene-centered resources Nucleic Acids Res 31 34ndash37

7 Marchler-BauerA AndersonJ FedorovaN DeWeese-ScottCGeerLY HurwitzD JacksonJJ JacobsA LanczyckiC LiebertCMadejT MarchlerGH MazumderR NikolskayaA PanchenkoARShoemakerBA SongJ SridharRB ThiessenPA VasudevanSWangY YamashitaR YinJ and BryantSH (2003) MMDB Entrezrsquos3D-structure database Nucleic Acids Res 31 474ndash477

8 AltschulSE GishW MillerW MyersEW and LipmanDJ (1990)Basic local alignment search tool J Mol Biol 215 403ndash410

9 AltschulSF MaddenTL SchafferAA ZhangJ MillerW andLipmanDJ (1997) Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res 25 3389ndash3402

10 TatusovaTA and MaddenTL (1999) BLAST 2 Sequences a new toolfor comparing protein and nucleotide sequences FEMS Microbiol Lett174 247ndash250

11 SchafferAA AravindL MaddenTL ShavirinS SpougeJLWolfYI KooninEV and AltschulSF (2001) Improving the accuracy of

PSI-BLAST protein database searches with composition-based statisticsand other refinements Nucleic Acids Res 29 2994ndash3005

12 ZhangZ SchwartzS WagnerL and MillerW (2000) A greedyalgorithm for aligning DNA sequences J Comput Biol 7 203ndash214

13 MaB TrompJ and LiM (2002) PatternHunter faster and more sensitivehomology search Bioinformatics 18 440ndash445

14 TatusovRL GalperinMY NataleDA and KooninEV (2000) TheCOG database a tool for genome-scale analysis of protein functions andevolution Nucleic Acids Res 28 33ndash36

15 SchulerGD (1997) Pieces of the puzzle expressed sequence tags and thecatalog of human genes J Mol Med 75 694ndash698

16 ErmolaevaO RastogiM PruittKD SchulerGD BittnerMLChenY SimonR MeltzerP TrentJM and BoguskiMS (1998) Datamanagement and analysis for gene expression arrays Nature Genet 2019ndash23

17 SherryST WardMH KholodovM BakerJ PhamL SmigielskiEand SirotkinK (2001) dbSNP The NCBI database of genetic variationNucleic Acids Res 29 308ndash311

18 WangY GeerLY ChappeyC KansJA and BryantSH (2000) Cn3Dsequence and structure views for Entrez Trends Biochem Sci 25300ndash302

19 TatusovaT Karsch-MizrachiI and OstellJ (1999) Complete genomesin WWW Entrez data representation and analysis Bioinformatics 15536ndash543

20 MitelmanF MertensF and JohanssonB (1997) A breakpoint map ofrecurrent chromosomal rearrangements in human neoplasia Nature Genet15 417ndash474

21 LashAE TolstoshevCM WagnerL SchulerGD StrausbergRLRigginsGJ and AltschulSF (2000) SAGEmap a public gene expressionresource Genome Res 7 1051ndash1060

22 EdgarR DomrachevM and LashAE (2002) Gene Expression OmnibusNCBI gene expression and hybridization array data repository NucleicAcids Res 30 207ndash210

23 McKusickVA (1998) Mendelian Inheritance in Man Catalogs of HumanGenes and Genetic Disorders 12th edn The Johns Hopkins UniversityPress Baltimore MD

24 LetunicI GoodstadtL DickensNJ DoerksT SchultzJ MottRCiccarelliF CopleyRR PontingCP and BorkP (2002) Recentimprovements to the SMART domain-based sequence annotation resourceNucleic Acids Res 30 242ndash244

25 BatemanA BirneyE CerrutiL DurbinR EtwillerL EddySRGriffiths-JonesS HoweKL and SonnhammerELL (2002) The Pfamprotein families database Nucleic Acids Res 30 276ndash280

Nucleic Acids Research 2003 Vol 31 No 1 33

at Northeastern U

niversity Libraries on N

ovember 26 2014

httpnaroxfordjournalsorgD

ownloaded from

Page 2: Database resources of the National Center for Biotechnology

For example display formats for GenBank records includethe GenBank Flatfile FASTA XML ASN1 and othersGraphical display formats are offered for some types ofrecords including genomic records

PubMed Central (PMC)

PMC is a digital archive of peer reviewed journals in the lifesciences Over 90 journals including Nucleic Acids Researchnow deposit the full text of their articles in PMC which isavailable at httpwwwpubmedcentralgov Participation inPMC requires a commitment to free access to full text perhapswith some delay after publication Some journals provide freeaccess to their full text directly in PMC while others require alink to the journalrsquos own site where full text is generallyavailable free within 6 months to a year of publication AllPMC free articles are identified in PubMed search results

Taxonomy

The NCBI taxonomy database indexes over 119 000 organismsthat are represented in the databases with at least onenucleotide or protein sequence The Taxonomy Browser canbe used to view the taxonomic position or retrieve sequenceand structural data for a particular organism or group Searchesof the NCBI taxonomy may be made on the basis of wholepartial or phonetically spelled organism names and links toorganisms commonly used in biological research are providedThe Entrez Taxonomy system adds the ability to displaycustom taxonomic trees representing user-defined subsets ofthe full NCBI taxonomy

NCBI is developing the non-bibliographic applications ofLinkOut and expanding that project into the taxonomy andsequence domains of Entrez Many outside resources currentlymaintain LinkOut links from Entrez entries including modelorganism and taxonomic databases NCBI has developedseveral tools to help LinkOut providers including a simpleflatfile format for specifying links and a NameID status pagefor tracking the current use of names and IDs in the taxonomydatabase

LocusLink

LocusLink developed at NCBI in conjunction with severalinternational collaborators offers a single query interface tocurated sequences and descriptive information about genes andincludes links to NCBIrsquos Map Viewer Evidence Viewer (EV)Model Maker (MM) BLAST Link (BLink) protein domainsfrom NCBIrsquos Conserved Domain Database and many othergene-related resources LocusLink is discussed in a separatearticle in this issue (6)

THE BLAST FAMILY OF SEQUENCE-SIMILARITYSEARCH PROGRAMS

The Basic Local Alignment Search Tool (BLAST) programs(89) perform sequence-similarity searches against a variety ofsequence databases beginning with either a query sequence ora GenBank accession number BLAST returns a set of gappedalignments between the query and similar database sequenceswith links to the full database records and to other relevant

databases such as UniGene or LocusLink The sequences ofany or all of the database hits appearing in a BLAST alignmentmay be selected for bulk download A BLAST variantBLAST2Sequences (10) compares two DNA or proteinsequences using any of the standard BLAST programs andproduces a dot-plot representation of the alignments it reports

Each alignment returned by a BLAST search receives a scoreand a measure of statistical significance called the ExpectationValue (E-value) for judging its quality Either an E-valuethreshold or a range can be specified to limit the alignmentsreturned BLAST takes into account the amino acid composi-tion of the query sequence in its estimation of statisticalsignificance This composition-based statistical treatmentused in conventional protein BLAST searches as well asPSI-BLAST (9) searches tends to reduce the number of false-positive database hits (11)

The default BLAST output format is the lsquopairwisersquoalignment however several lsquoquery-anchoredrsquo multiplesequence alignment formats are available An alignmentoption called the lsquoHit Tablersquo provides a compact tabulareasily parsable summary of the BLAST results including foreach database hit the positions of alignment starts and stopsscores and Expectation Values These outputs may be returnedin HTML XML text or as ASN1 In addition BLAST cangenerate a taxonomically organized output that shows thedistribution of BLAST hits by organism in three formats

A particularly powerful feature of the web BLAST interfaceallows searches to be restricted to a database subset usingstandard Entrez search strings the same restrictions may beused to screen the output of an initially unrestricted searchThese features provide the means to effectively construct acustom database for searching or to process the output of asearch to include only sequences of interest Web BLAST usesa standard URL-API that allows complete search specifica-tions including BLAST parameters such as Entrez restrictionsand the search query to be contained in a URL posted to theweb page

A recent addition to the BLAST family called MegaBLAST(12) facilitates batch nucleotide queries which can be pastedinto a web page or uploaded from a file MegaBLAST isdesigned to search for nearly exact matches and is up to 10times faster than standard BLAST for such searchesMegaBLAST is provided to search entire eukaryotic genomesbut it is also used to search a rapidly growing database calledthe Trace Archive which contains over 125 million sequencingtraces The Trace Archive includes whole genome shotgun(WGS) shotgun EST clone end and finishing reads fromover 30 organisms such as Homo sapiens Mus musculusRattus norvegicus Danio rerio Zea mays and Caenorhabditiselegans To facilitate rapid cross-species nucleotide queries ofthe Trace Archive NCBI offers a version of MegaBLASTcalled Discontinuous MegaBLAST that uses a non-contiguousword match (13) as the nucleus for its alignments Searchesusing Discontinuous MegaBLAST are far more rapid thancross-species translated searches such as blastx but maintain acompetitive degree of sensitivity when comparing codingregions

The NCBI-generated assembly of the human as well as othersubmitted genomic assemblies such as those of the mouse andzebrafish may be searched using specialized genome BLAST

Nucleic Acids Research 2003 Vol 31 No 1 29

at Northeastern U

niversity Libraries on N

ovember 26 2014

httpnaroxfordjournalsorgD

ownloaded from

pages These pages search a set of genome-specific databasesand generate where possible genomic views of the BLASThits using the Map Viewer

BLink

BLink displays pre-computed protein BLAST alignments foreach protein sequence in the Entrez databases BLink allowsfor the display of subsets of these alignments by taxonomiccriteria by database of origin relation to a complete genomemembership in a Clusters of Orthologous Group (COG) (14) orby relation to a 3D structure or conserved protein domainBLink links are displayed for protein records in Entrez as wellas within LocusLink reports

RESOURCES FOR GENE-LEVEL SEQUENCES

UniGene

UniGene (15) is a system for automatically partitioningGenBank sequences including ESTs into a non-redundant setof gene-oriented clusters UniGene clusters ESTs from 10animals and 7 plants bringing the total number of organismsrepresented to 17 UniGene starts with entries in theappropriate organism division of GenBank combines thesewith ESTs of that organism and creates clusters of sequencesthat share virtually identical 30 untranslated regions (30 UTRs)Each UniGene cluster contains sequences that represent aunique gene and is linked to related information such as thetissue types in which the gene is expressed model organismprotein similarities the LocusLink report for the gene and itsmap location In the human UniGene database over 36million human ESTs in GenBank have been reduced 35-fold innumber to over 104 000 sequence clusters The UniGenecollection has been used as a source of unique sequence forthe fabrication of microarrays for the large-scale study of geneexpression (16) UniGene databases are updated weekly withnew EST sequences and bimonthly with newly characterizedsequences UniGene clusters may be searched in several waysby gene name chromosomal location cDNA library accessionnumber and ordinary text words Cluster sequences may alsobe downloaded by FTP

ProtEST

ProtEST a tool analogous to BLink presents pre-computedBLAST alignments between protein sequences from modelorganisms and the 6-frame translations of UniGene nucleotidesequences Protein sequences that are derived from conceptualtranslations or model transcripts are excluded The eight modelorganisms included are H sapiens M musculus Rattusnorvegicus Drosophila melanogaster C elegans Saccharo-myces cerevisiae Arabidopsis thaliana and Escherichia coliProtEST links are displayed in UniGene reports with modelorganism protein similarites For each nucleotide sequencematch the ProtEST report shows the UniGene cluster ID theGenBank accession number and the percent identity betweenthe protein and nucleotide translation in the aligned region Alink is also provided to the sequence trace in the NCBI TraceArchive if available ProtEST reports are updated in tandemwith UniGene protein similarities

HomoloGene

HomoloGene is a database of both curated and calculated geneorthologs and homologs for 14 organisms including H sapiensM musculus D rerio D melanogaster C elegans A thalianaHordeum vulgare Oryza sativa Z mays Curated orthologsinclude gene pairs from the Mouse Genome Database (MGD) atthe Jackson Laboratory the Zebrafish Information (ZFIN)database at the University of Oregon and from publishedreports Computed orthologs and homologs which areconsidered putative are identified from BLAST nucleotidesequence comparisons between all UniGene clusters for eachpair of organisms HomoloGene also contains a set of tripletortholog-based COG (14)-like clusters which may include up to14 members in which the triplet orthologs in two organisms areboth orthologous to the same gene in a third organism For thethree organisms human mouse and rat there are currentlyover 7000 of these self-consistent triplets The HomoloGenedatabase can be queried using UniGene ClusterIDs LocusLinkLocusIDs gene symbols gene names and nucleotide accessionnumbers as well as those terms found in UniGene clustertitles The current datasets for the calculated orthologs andhomologs and the Mutually Orthologous Pairs are alsoavailable via FTP

References Sequence (RefSeq)

The RefSeq database (6) provides curated reference sequencesfor mRNAs genomic sequences computationally-derivedsequences and proteins for human and over 1700 otherorganisms

Open Reading Frame (ORF) Finder

ORF Finder performs a six-frame translation of a nucleotidesequence and returns a graphic that indicates the location ofeach ORF found Restrictions on the size of the ORFs returnedmay be set The protein translations of the ORFs detected canbe submitted directly for BLAST similarity searching orsearching against the COGs (see below) database

Electronic PCR (e-PCR)

e-PCR is a tool for locating Sequence Tagged Sites (STSs)within a nucleotide sequence by searching against a non-redundant database of over 133 000 human and 84 000 non-human STSs called UniSTS

A database of Single Nucleotide Polymorphisms (dbSNP)

The dbSNP (17) is a repository for single base nucleotidesubstitutions and short deletion and insertion polymorphismsThe dbSNP database contains almost 3 million human SNPsas well as about half a million from organisms includingM musculus Anopheles gambiae D rerio and A thalianaThe Web interface allows flexible searches by gene name andby cross-reference to other databases such as OMIM or thestructure databases Searches for SNPs lying between twomarkers and batch downloads are also supported SNP reportslink to structures from the MMDB allowing the 3-Dvisualization using NCBIrsquos interactive macromolecular viewer

30 Nucleic Acids Research 2003 Vol 31 No 1

at Northeastern U

niversity Libraries on N

ovember 26 2014

httpnaroxfordjournalsorgD

ownloaded from

Cn3D (18) of amino acid changes implied by SNPs in codingregions

RESOURCES FOR GENOME-SCALE ANALYSIS

Entrez Genomes

Entrez Genomes (19) provides access to genomic datacontributed by the scientific community for over 1000 specieswhose sequencing and mapping is complete or in progressEntrez Genomes now includes more than 86 completemicrobial genomes and 302 RefSeq for eukaryotic organellesMany higher eukaryotic genomes are also included withinEntrez Genomes such as those of H sapiens M musculusD melanogaster Anopheles gambiae C elegans and Athaliana

In Entrez Genomes complete genomes can be accessedhierarchically starting from either an alphabetical listing or aphylogenetic tree for each of six principle taxonomic groupsOne can follow the hierarchy to a graphical overview for thegenome of a single organism on to the level of a singlechromosome and finally down to the level of a single gene Ateach level are one or more views pre-computed summariesand links to analyses appropriate at that level For instance atthe level of a genome or a chromosome a coding regions viewdisplays the location of each coding region length of theproduct GenBank identification number for the proteinsequence and name of the protein product A RNA genesview lists the location and gene names for ribosomal andtransfer RNA genes At the level of a single gene links areprovided to pre-computed sequence neighbors for the geneproduct Any protein gene product that is a member of aCOG (19) is linked to the COGs database A summary of COGfunctional groups is also presented in tabular and graphicalformats at the genome level

For complete microbial genomes pre-computed BLASTneighbors for protein sequences including their taxonomicdistribution and links to 3-D structures are given in TaxTablesand PDBTables respectively Pairwise sequence alignmentsare presented graphically and linked to the Cn3D macro-molecular viewer (18) which allows the interactive display of3-D structures and sequence alignments The TaxPlot toolgraphically compares similarities in the proteomes of twoorganisms to that of a third reference organism and is availablefor both prokaryotic and eukaryotic genomes Resources forthe genomes of higher eukaryotes are discussed below

COGs

The COGs database (14) presents a compilation of ortholo-gous groups of proteins from completely sequenced organismsrepresenting 44 species and 30 phylogenetically distant cladesThe COGs are now also linked to the proteins of two highereukaryotes C elegans and D melanogaster

Retroviral genotyping tools

The genotyping of retrovirus sequences is important in thecharacterization of viral genetic diversity in the tracking ofepidemics and in vaccine development NCBI offers a Web-based genotyping tool that employs a blastn comparison

between a retroviral sequence to be subtyped and a defaultpanel of reference sequences or a panel provided by the userAn HIV-1-specific subtyping tool uses a set of referencesequences taken from the principle HIV-1 variants

Eukaryotic Genomic Resources

Entrez Genomes links to Genome Resources webpages devotedto the sequencing of a number of eukaryotic organismsincluding H sapiens M musculus and D melanogaster Apage called Plant Genomes Central serves as a collection pointfor resources related to plant genome projects Many genomeprojects have progressed to the point at which it is useful to havean interactive genome viewing tool with which to correlate thedata present in various of genomic maps NCBI has developedthe Map Viewer for this purpose

Map Viewer

The NCBI Map Viewer displays genome assemblies using setsof synchronized chromosomal maps Map Viewer displays areavailable for the genomes of four vertebrates includingH sapiens and M musculus and D rerio three invertebratesincluding D melanogaster and C elegans seven plantsincluding A thaliana and O satvia and two fungi S cerevisiaeand Schizosaccharomyces pombe The genomic maps dis-played by the Map Viewer vary according to the data availablefor the subject organism The maps can be selected from a setof cytogenetic maps such as chromosomal ideogramssequence-based maps such as those showing contigs genesand SNPs and physical maps such as the G3 and GB4 humanradiation-hybrid maps Maps showing ab initio gene modelsEST alignments with links to UniGene clusters and mRNAalignments used to construct gene models are also available forsome organisms The rightmost map in a Map Viewer displaycalled the master map generates an extended set of map-specific links to related resources In the case of the Genesmap two of these links are to the EV and MM described belowIn addition to its graphical display the Map Viewer offers atabular view of the data that is convenient for export to otherprograms for further analysis

Queries against an entire genome or particular chromosomescan be made in the Map Viewer using gene names or symbolsmarker names SNP identifiers accession numbers and otheridentifiers The human version of the Map Viewer is tightlyintegrated with other NCBI databases such as LocusLink anddbSNP Segments of a genomic assembly may be downloadedusing the Map Viewerrsquos lsquoDownloadView Sequencersquo link forsome genomes such as H sapiens M musculus and A thalianaSupported download formats are GenBank and FASTA

Model Maker (MM)

MM allows the construction of transcript models using novelcombinations of putative exons derived from ab initiopredictions or from the alignment of GenBank transcriptsincluding ESTs and NCBI RefSeqs to the NCBI humangenome assembly The MM interface consists of a graphicaloverview of transcript alignments to a genomic contig witheach unique block of alignment collected and numbered as aputative exon Transcript models are constructed by selecting

Nucleic Acids Research 2003 Vol 31 No 1 31

at Northeastern U

niversity Libraries on N

ovember 26 2014

httpnaroxfordjournalsorgD

ownloaded from

from this collection As the transcript is created the impliedprotein translation is given in each reading frame with anyinternal stop codons indicated Previously observed exonsplice patterns are indicated as guides to model buildingCompleted models may be saved locally or analyzed withOrfFinder

Evidence Viewer (EV)

The EV displays the alignments to a genomic contig of RefSeqtranscripts GenBank mRNAs known or potential transcriptsand ESTs supporting a gene model The EV produces agraphical summary of the alignments that indicates thecoordinate range of the gene model on the genomic contigand the areas of alignment to the transcripts on separate tracksEST alignment density along the contig is indicated on anothertrack A mismatch and an insertiondeletion track are alsoshown to highlight areas of disagreement between transcriptsequences and the genomic sequence Following the graphicalsummary are exon-by-exon alignments of all of the transcriptsequences against the genomic contig including flankinggenomic sequence for each exon to show the presence orabsence of splice sites Any proteins annotated on thetranscript sequences are also shown and mismatches betweentranscripts and the genomic contig or between proteinsannotated on the aligned transcripts are highlighted

The HumanndashMouse Homology Maps

The HumanndashMouse Homology Maps are tables of genetic lociin homologous segments of DNA from human and the mouseThe map is computed by integrating orthologs curated by theMouse Genome Database with putative orthologs identified byhomology The maps are linked to GeneMaprsquo99 OMIMLocusLink dbSTS BLAST2Sequences and the MouseGenome Database at The Jackson Laboratory Other mousegenome resources can be found on the Mouse GenomeResources page

The Cancer Chromosome Aberration Project (CCAP)

The CCAP service is an initiative of the National CancerInstitute (NCI) and NCBI The data includes a compilation byF Mitelman F Mertens and B Johansson of recurrentneoplasia-associated chromosomal aberrations from theCancer Chromosome Aberration Bank at the University ofLund Sweden (20) The Spectral Karyotyping database SKYcreated jointly by NCI and NCBI enables investigators to sharetheir own SKYand Comparative Genomic Hybidization (CGH)data on chromsomal aberrations (httpwwwncbinlmnihgovskyskywebcgi)

RESOURCES FOR THE ANALYSIS OF PATTERNSOF GENE EXPRESSION AND PHENOTYPES

SAGEmap

Serial Analysis of Gene Expression (SAGE) is a technique fortaking a snapshot of the messenger RNA population of a cellto obtain a quantitative measure of gene expression NCBIrsquosSAGEmap (21) service implements many functions useful in

the analysis of SAGE data such as a two-way mapping betweenSAGE tag and UniGene SAGEmap can also construct a user-configurable table of data comparing one group of SAGElibraries with another Groups may be chosen for inclusion inthe table on the basis of several expression criteria SAGEmapis updated weekly immediately following the update ofUniGene and the data is reflected in the human genome MapViewer as the SAGE track

Gene Expression Omnibus (GEO)

The GEO (22) is a data repository and retrieval system for geneexpression data derived from any organism or artificial sourceGene expression data derived from spotted microarray high-density oligonucleotide array hybridization filter and SAGEdata are available for download and accepted for deposit Atthe time of writing the repository contains high-throughputgene expression data on over 2300 samples

OMIM

NCBI provides the online version of the OMIM catalog ofhuman genes and genetic disorders authored and edited byVictor A McKusick at The Johns Hopkins University (23)The database contains information on disease phenotypes andgenes including extensive descriptions gene names inheri-tance patterns map locations and gene polymorphisms OMIMcurrently contains 13 864 entries including data on 10 290established gene loci and 1019 phenotypic descriptions and isnow searchable using the powerful Entrez interface

THE MOLECULAR MODELING DATABASE(MMDB) THE CONSERVED DOMAIN DATABASESEARCH AND CDART

The NCBI MMDB built by processing entries from the ProteinData Bank (5) is described in (7) The structures in the MMDBare linked to sequences in Entrez and to the Conserved DomainDatabase (CDD) The CDD contains PSI-BLAST-derivedPosition Specific Score Matrices representing domains takenprincipally from two public protein domain collections theSimple Modular Architecture Research Tool (SMART) (24)and Pfam (25) but also draws from domains defined by NCBIresearchers NCBIrsquos Conserved Domain Search (CD-Search)service can be used to search a protein sequence for conserveddomains in the CDD Wherever possible CDD hits are linkedto structures which coupled with a multiple sequence align-ment of representatives of the domain hit can be viewed withNCBIrsquos 3-D molecular structure viewer Cn3D (18) TheConserved Domain Architecture Retrieval Tool (CDART)allows searches of protein databases on the basis of a conserveddomain and returns the domain architectures of databaseproteins conatining the query domain Alignment-based proteindomain information from the CDD and 3-D domains from theMMDB are searchable via the Entrez interface

FOR FURTHER INFORMATION

Most of the resources described here include documentationother explanatory material and references to collaborators and

32 Nucleic Acids Research 2003 Vol 31 No 1

at Northeastern U

niversity Libraries on N

ovember 26 2014

httpnaroxfordjournalsorgD

ownloaded from

data sources on the respective web sites Several tutorials arealso offered under the Education link from NCBIrsquos home pageA site map provides a comprehensive table of NCBI resourcesand the about NCBI feature provides bioinformatics primersand other supplementary information A user support staff isavailable to answer questions at infoncbinlmnihgov

REFERENCES

1 BensonDA Karsch-MizrachiI LipmanDJ OstellJ RappBA andWheelerDL (2002) GenBank Nucleic Acids Res 30 17ndash20

2 SchulerGD EpsteinJA OhkawaH and KansJA (1996) Entrezmolecular biology database and retrieval system Methods Enzymol 266141ndash162

3 BarkerWC GaravelliJS HuongH McGarveyPB OrcuttBCSrinivarsaraoGY XiaoC YehLS LedleyRS JandaJ PfeifferFMewesHW TsugitaA and WuK (2000) The Protein InformationResource (PIR) Nucleic Acids Res 28 41ndash44

4 KriventsevaEV FleischmannW ZdobnovEM and ApweilerR (2001)CluSTr a database of Clusters of SWISS-PROT and TrEMBL proteinsNucleic Acids Res 29 33ndash36

5 BermanHM WestbrookJ FengZ GillilandG BhatTN WeissigHShindyalovIN and BournePE (2000) The Protein Data Bank NucleicAcids Res 28 235ndash242

6 PruittK TatusovT and MaglottD (2003) RefSeq and LocusLink NCBIgene-centered resources Nucleic Acids Res 31 34ndash37

7 Marchler-BauerA AndersonJ FedorovaN DeWeese-ScottCGeerLY HurwitzD JacksonJJ JacobsA LanczyckiC LiebertCMadejT MarchlerGH MazumderR NikolskayaA PanchenkoARShoemakerBA SongJ SridharRB ThiessenPA VasudevanSWangY YamashitaR YinJ and BryantSH (2003) MMDB Entrezrsquos3D-structure database Nucleic Acids Res 31 474ndash477

8 AltschulSE GishW MillerW MyersEW and LipmanDJ (1990)Basic local alignment search tool J Mol Biol 215 403ndash410

9 AltschulSF MaddenTL SchafferAA ZhangJ MillerW andLipmanDJ (1997) Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res 25 3389ndash3402

10 TatusovaTA and MaddenTL (1999) BLAST 2 Sequences a new toolfor comparing protein and nucleotide sequences FEMS Microbiol Lett174 247ndash250

11 SchafferAA AravindL MaddenTL ShavirinS SpougeJLWolfYI KooninEV and AltschulSF (2001) Improving the accuracy of

PSI-BLAST protein database searches with composition-based statisticsand other refinements Nucleic Acids Res 29 2994ndash3005

12 ZhangZ SchwartzS WagnerL and MillerW (2000) A greedyalgorithm for aligning DNA sequences J Comput Biol 7 203ndash214

13 MaB TrompJ and LiM (2002) PatternHunter faster and more sensitivehomology search Bioinformatics 18 440ndash445

14 TatusovRL GalperinMY NataleDA and KooninEV (2000) TheCOG database a tool for genome-scale analysis of protein functions andevolution Nucleic Acids Res 28 33ndash36

15 SchulerGD (1997) Pieces of the puzzle expressed sequence tags and thecatalog of human genes J Mol Med 75 694ndash698

16 ErmolaevaO RastogiM PruittKD SchulerGD BittnerMLChenY SimonR MeltzerP TrentJM and BoguskiMS (1998) Datamanagement and analysis for gene expression arrays Nature Genet 2019ndash23

17 SherryST WardMH KholodovM BakerJ PhamL SmigielskiEand SirotkinK (2001) dbSNP The NCBI database of genetic variationNucleic Acids Res 29 308ndash311

18 WangY GeerLY ChappeyC KansJA and BryantSH (2000) Cn3Dsequence and structure views for Entrez Trends Biochem Sci 25300ndash302

19 TatusovaT Karsch-MizrachiI and OstellJ (1999) Complete genomesin WWW Entrez data representation and analysis Bioinformatics 15536ndash543

20 MitelmanF MertensF and JohanssonB (1997) A breakpoint map ofrecurrent chromosomal rearrangements in human neoplasia Nature Genet15 417ndash474

21 LashAE TolstoshevCM WagnerL SchulerGD StrausbergRLRigginsGJ and AltschulSF (2000) SAGEmap a public gene expressionresource Genome Res 7 1051ndash1060

22 EdgarR DomrachevM and LashAE (2002) Gene Expression OmnibusNCBI gene expression and hybridization array data repository NucleicAcids Res 30 207ndash210

23 McKusickVA (1998) Mendelian Inheritance in Man Catalogs of HumanGenes and Genetic Disorders 12th edn The Johns Hopkins UniversityPress Baltimore MD

24 LetunicI GoodstadtL DickensNJ DoerksT SchultzJ MottRCiccarelliF CopleyRR PontingCP and BorkP (2002) Recentimprovements to the SMART domain-based sequence annotation resourceNucleic Acids Res 30 242ndash244

25 BatemanA BirneyE CerrutiL DurbinR EtwillerL EddySRGriffiths-JonesS HoweKL and SonnhammerELL (2002) The Pfamprotein families database Nucleic Acids Res 30 276ndash280

Nucleic Acids Research 2003 Vol 31 No 1 33

at Northeastern U

niversity Libraries on N

ovember 26 2014

httpnaroxfordjournalsorgD

ownloaded from

Page 3: Database resources of the National Center for Biotechnology

pages These pages search a set of genome-specific databasesand generate where possible genomic views of the BLASThits using the Map Viewer

BLink

BLink displays pre-computed protein BLAST alignments foreach protein sequence in the Entrez databases BLink allowsfor the display of subsets of these alignments by taxonomiccriteria by database of origin relation to a complete genomemembership in a Clusters of Orthologous Group (COG) (14) orby relation to a 3D structure or conserved protein domainBLink links are displayed for protein records in Entrez as wellas within LocusLink reports

RESOURCES FOR GENE-LEVEL SEQUENCES

UniGene

UniGene (15) is a system for automatically partitioningGenBank sequences including ESTs into a non-redundant setof gene-oriented clusters UniGene clusters ESTs from 10animals and 7 plants bringing the total number of organismsrepresented to 17 UniGene starts with entries in theappropriate organism division of GenBank combines thesewith ESTs of that organism and creates clusters of sequencesthat share virtually identical 30 untranslated regions (30 UTRs)Each UniGene cluster contains sequences that represent aunique gene and is linked to related information such as thetissue types in which the gene is expressed model organismprotein similarities the LocusLink report for the gene and itsmap location In the human UniGene database over 36million human ESTs in GenBank have been reduced 35-fold innumber to over 104 000 sequence clusters The UniGenecollection has been used as a source of unique sequence forthe fabrication of microarrays for the large-scale study of geneexpression (16) UniGene databases are updated weekly withnew EST sequences and bimonthly with newly characterizedsequences UniGene clusters may be searched in several waysby gene name chromosomal location cDNA library accessionnumber and ordinary text words Cluster sequences may alsobe downloaded by FTP

ProtEST

ProtEST a tool analogous to BLink presents pre-computedBLAST alignments between protein sequences from modelorganisms and the 6-frame translations of UniGene nucleotidesequences Protein sequences that are derived from conceptualtranslations or model transcripts are excluded The eight modelorganisms included are H sapiens M musculus Rattusnorvegicus Drosophila melanogaster C elegans Saccharo-myces cerevisiae Arabidopsis thaliana and Escherichia coliProtEST links are displayed in UniGene reports with modelorganism protein similarites For each nucleotide sequencematch the ProtEST report shows the UniGene cluster ID theGenBank accession number and the percent identity betweenthe protein and nucleotide translation in the aligned region Alink is also provided to the sequence trace in the NCBI TraceArchive if available ProtEST reports are updated in tandemwith UniGene protein similarities

HomoloGene

HomoloGene is a database of both curated and calculated geneorthologs and homologs for 14 organisms including H sapiensM musculus D rerio D melanogaster C elegans A thalianaHordeum vulgare Oryza sativa Z mays Curated orthologsinclude gene pairs from the Mouse Genome Database (MGD) atthe Jackson Laboratory the Zebrafish Information (ZFIN)database at the University of Oregon and from publishedreports Computed orthologs and homologs which areconsidered putative are identified from BLAST nucleotidesequence comparisons between all UniGene clusters for eachpair of organisms HomoloGene also contains a set of tripletortholog-based COG (14)-like clusters which may include up to14 members in which the triplet orthologs in two organisms areboth orthologous to the same gene in a third organism For thethree organisms human mouse and rat there are currentlyover 7000 of these self-consistent triplets The HomoloGenedatabase can be queried using UniGene ClusterIDs LocusLinkLocusIDs gene symbols gene names and nucleotide accessionnumbers as well as those terms found in UniGene clustertitles The current datasets for the calculated orthologs andhomologs and the Mutually Orthologous Pairs are alsoavailable via FTP

References Sequence (RefSeq)

The RefSeq database (6) provides curated reference sequencesfor mRNAs genomic sequences computationally-derivedsequences and proteins for human and over 1700 otherorganisms

Open Reading Frame (ORF) Finder

ORF Finder performs a six-frame translation of a nucleotidesequence and returns a graphic that indicates the location ofeach ORF found Restrictions on the size of the ORFs returnedmay be set The protein translations of the ORFs detected canbe submitted directly for BLAST similarity searching orsearching against the COGs (see below) database

Electronic PCR (e-PCR)

e-PCR is a tool for locating Sequence Tagged Sites (STSs)within a nucleotide sequence by searching against a non-redundant database of over 133 000 human and 84 000 non-human STSs called UniSTS

A database of Single Nucleotide Polymorphisms (dbSNP)

The dbSNP (17) is a repository for single base nucleotidesubstitutions and short deletion and insertion polymorphismsThe dbSNP database contains almost 3 million human SNPsas well as about half a million from organisms includingM musculus Anopheles gambiae D rerio and A thalianaThe Web interface allows flexible searches by gene name andby cross-reference to other databases such as OMIM or thestructure databases Searches for SNPs lying between twomarkers and batch downloads are also supported SNP reportslink to structures from the MMDB allowing the 3-Dvisualization using NCBIrsquos interactive macromolecular viewer

30 Nucleic Acids Research 2003 Vol 31 No 1

at Northeastern U

niversity Libraries on N

ovember 26 2014

httpnaroxfordjournalsorgD

ownloaded from

Cn3D (18) of amino acid changes implied by SNPs in codingregions

RESOURCES FOR GENOME-SCALE ANALYSIS

Entrez Genomes

Entrez Genomes (19) provides access to genomic datacontributed by the scientific community for over 1000 specieswhose sequencing and mapping is complete or in progressEntrez Genomes now includes more than 86 completemicrobial genomes and 302 RefSeq for eukaryotic organellesMany higher eukaryotic genomes are also included withinEntrez Genomes such as those of H sapiens M musculusD melanogaster Anopheles gambiae C elegans and Athaliana

In Entrez Genomes complete genomes can be accessedhierarchically starting from either an alphabetical listing or aphylogenetic tree for each of six principle taxonomic groupsOne can follow the hierarchy to a graphical overview for thegenome of a single organism on to the level of a singlechromosome and finally down to the level of a single gene Ateach level are one or more views pre-computed summariesand links to analyses appropriate at that level For instance atthe level of a genome or a chromosome a coding regions viewdisplays the location of each coding region length of theproduct GenBank identification number for the proteinsequence and name of the protein product A RNA genesview lists the location and gene names for ribosomal andtransfer RNA genes At the level of a single gene links areprovided to pre-computed sequence neighbors for the geneproduct Any protein gene product that is a member of aCOG (19) is linked to the COGs database A summary of COGfunctional groups is also presented in tabular and graphicalformats at the genome level

For complete microbial genomes pre-computed BLASTneighbors for protein sequences including their taxonomicdistribution and links to 3-D structures are given in TaxTablesand PDBTables respectively Pairwise sequence alignmentsare presented graphically and linked to the Cn3D macro-molecular viewer (18) which allows the interactive display of3-D structures and sequence alignments The TaxPlot toolgraphically compares similarities in the proteomes of twoorganisms to that of a third reference organism and is availablefor both prokaryotic and eukaryotic genomes Resources forthe genomes of higher eukaryotes are discussed below

COGs

The COGs database (14) presents a compilation of ortholo-gous groups of proteins from completely sequenced organismsrepresenting 44 species and 30 phylogenetically distant cladesThe COGs are now also linked to the proteins of two highereukaryotes C elegans and D melanogaster

Retroviral genotyping tools

The genotyping of retrovirus sequences is important in thecharacterization of viral genetic diversity in the tracking ofepidemics and in vaccine development NCBI offers a Web-based genotyping tool that employs a blastn comparison

between a retroviral sequence to be subtyped and a defaultpanel of reference sequences or a panel provided by the userAn HIV-1-specific subtyping tool uses a set of referencesequences taken from the principle HIV-1 variants

Eukaryotic Genomic Resources

Entrez Genomes links to Genome Resources webpages devotedto the sequencing of a number of eukaryotic organismsincluding H sapiens M musculus and D melanogaster Apage called Plant Genomes Central serves as a collection pointfor resources related to plant genome projects Many genomeprojects have progressed to the point at which it is useful to havean interactive genome viewing tool with which to correlate thedata present in various of genomic maps NCBI has developedthe Map Viewer for this purpose

Map Viewer

The NCBI Map Viewer displays genome assemblies using setsof synchronized chromosomal maps Map Viewer displays areavailable for the genomes of four vertebrates includingH sapiens and M musculus and D rerio three invertebratesincluding D melanogaster and C elegans seven plantsincluding A thaliana and O satvia and two fungi S cerevisiaeand Schizosaccharomyces pombe The genomic maps dis-played by the Map Viewer vary according to the data availablefor the subject organism The maps can be selected from a setof cytogenetic maps such as chromosomal ideogramssequence-based maps such as those showing contigs genesand SNPs and physical maps such as the G3 and GB4 humanradiation-hybrid maps Maps showing ab initio gene modelsEST alignments with links to UniGene clusters and mRNAalignments used to construct gene models are also available forsome organisms The rightmost map in a Map Viewer displaycalled the master map generates an extended set of map-specific links to related resources In the case of the Genesmap two of these links are to the EV and MM described belowIn addition to its graphical display the Map Viewer offers atabular view of the data that is convenient for export to otherprograms for further analysis

Queries against an entire genome or particular chromosomescan be made in the Map Viewer using gene names or symbolsmarker names SNP identifiers accession numbers and otheridentifiers The human version of the Map Viewer is tightlyintegrated with other NCBI databases such as LocusLink anddbSNP Segments of a genomic assembly may be downloadedusing the Map Viewerrsquos lsquoDownloadView Sequencersquo link forsome genomes such as H sapiens M musculus and A thalianaSupported download formats are GenBank and FASTA

Model Maker (MM)

MM allows the construction of transcript models using novelcombinations of putative exons derived from ab initiopredictions or from the alignment of GenBank transcriptsincluding ESTs and NCBI RefSeqs to the NCBI humangenome assembly The MM interface consists of a graphicaloverview of transcript alignments to a genomic contig witheach unique block of alignment collected and numbered as aputative exon Transcript models are constructed by selecting

Nucleic Acids Research 2003 Vol 31 No 1 31

at Northeastern U

niversity Libraries on N

ovember 26 2014

httpnaroxfordjournalsorgD

ownloaded from

from this collection As the transcript is created the impliedprotein translation is given in each reading frame with anyinternal stop codons indicated Previously observed exonsplice patterns are indicated as guides to model buildingCompleted models may be saved locally or analyzed withOrfFinder

Evidence Viewer (EV)

The EV displays the alignments to a genomic contig of RefSeqtranscripts GenBank mRNAs known or potential transcriptsand ESTs supporting a gene model The EV produces agraphical summary of the alignments that indicates thecoordinate range of the gene model on the genomic contigand the areas of alignment to the transcripts on separate tracksEST alignment density along the contig is indicated on anothertrack A mismatch and an insertiondeletion track are alsoshown to highlight areas of disagreement between transcriptsequences and the genomic sequence Following the graphicalsummary are exon-by-exon alignments of all of the transcriptsequences against the genomic contig including flankinggenomic sequence for each exon to show the presence orabsence of splice sites Any proteins annotated on thetranscript sequences are also shown and mismatches betweentranscripts and the genomic contig or between proteinsannotated on the aligned transcripts are highlighted

The HumanndashMouse Homology Maps

The HumanndashMouse Homology Maps are tables of genetic lociin homologous segments of DNA from human and the mouseThe map is computed by integrating orthologs curated by theMouse Genome Database with putative orthologs identified byhomology The maps are linked to GeneMaprsquo99 OMIMLocusLink dbSTS BLAST2Sequences and the MouseGenome Database at The Jackson Laboratory Other mousegenome resources can be found on the Mouse GenomeResources page

The Cancer Chromosome Aberration Project (CCAP)

The CCAP service is an initiative of the National CancerInstitute (NCI) and NCBI The data includes a compilation byF Mitelman F Mertens and B Johansson of recurrentneoplasia-associated chromosomal aberrations from theCancer Chromosome Aberration Bank at the University ofLund Sweden (20) The Spectral Karyotyping database SKYcreated jointly by NCI and NCBI enables investigators to sharetheir own SKYand Comparative Genomic Hybidization (CGH)data on chromsomal aberrations (httpwwwncbinlmnihgovskyskywebcgi)

RESOURCES FOR THE ANALYSIS OF PATTERNSOF GENE EXPRESSION AND PHENOTYPES

SAGEmap

Serial Analysis of Gene Expression (SAGE) is a technique fortaking a snapshot of the messenger RNA population of a cellto obtain a quantitative measure of gene expression NCBIrsquosSAGEmap (21) service implements many functions useful in

the analysis of SAGE data such as a two-way mapping betweenSAGE tag and UniGene SAGEmap can also construct a user-configurable table of data comparing one group of SAGElibraries with another Groups may be chosen for inclusion inthe table on the basis of several expression criteria SAGEmapis updated weekly immediately following the update ofUniGene and the data is reflected in the human genome MapViewer as the SAGE track

Gene Expression Omnibus (GEO)

The GEO (22) is a data repository and retrieval system for geneexpression data derived from any organism or artificial sourceGene expression data derived from spotted microarray high-density oligonucleotide array hybridization filter and SAGEdata are available for download and accepted for deposit Atthe time of writing the repository contains high-throughputgene expression data on over 2300 samples

OMIM

NCBI provides the online version of the OMIM catalog ofhuman genes and genetic disorders authored and edited byVictor A McKusick at The Johns Hopkins University (23)The database contains information on disease phenotypes andgenes including extensive descriptions gene names inheri-tance patterns map locations and gene polymorphisms OMIMcurrently contains 13 864 entries including data on 10 290established gene loci and 1019 phenotypic descriptions and isnow searchable using the powerful Entrez interface

THE MOLECULAR MODELING DATABASE(MMDB) THE CONSERVED DOMAIN DATABASESEARCH AND CDART

The NCBI MMDB built by processing entries from the ProteinData Bank (5) is described in (7) The structures in the MMDBare linked to sequences in Entrez and to the Conserved DomainDatabase (CDD) The CDD contains PSI-BLAST-derivedPosition Specific Score Matrices representing domains takenprincipally from two public protein domain collections theSimple Modular Architecture Research Tool (SMART) (24)and Pfam (25) but also draws from domains defined by NCBIresearchers NCBIrsquos Conserved Domain Search (CD-Search)service can be used to search a protein sequence for conserveddomains in the CDD Wherever possible CDD hits are linkedto structures which coupled with a multiple sequence align-ment of representatives of the domain hit can be viewed withNCBIrsquos 3-D molecular structure viewer Cn3D (18) TheConserved Domain Architecture Retrieval Tool (CDART)allows searches of protein databases on the basis of a conserveddomain and returns the domain architectures of databaseproteins conatining the query domain Alignment-based proteindomain information from the CDD and 3-D domains from theMMDB are searchable via the Entrez interface

FOR FURTHER INFORMATION

Most of the resources described here include documentationother explanatory material and references to collaborators and

32 Nucleic Acids Research 2003 Vol 31 No 1

at Northeastern U

niversity Libraries on N

ovember 26 2014

httpnaroxfordjournalsorgD

ownloaded from

data sources on the respective web sites Several tutorials arealso offered under the Education link from NCBIrsquos home pageA site map provides a comprehensive table of NCBI resourcesand the about NCBI feature provides bioinformatics primersand other supplementary information A user support staff isavailable to answer questions at infoncbinlmnihgov

REFERENCES

1 BensonDA Karsch-MizrachiI LipmanDJ OstellJ RappBA andWheelerDL (2002) GenBank Nucleic Acids Res 30 17ndash20

2 SchulerGD EpsteinJA OhkawaH and KansJA (1996) Entrezmolecular biology database and retrieval system Methods Enzymol 266141ndash162

3 BarkerWC GaravelliJS HuongH McGarveyPB OrcuttBCSrinivarsaraoGY XiaoC YehLS LedleyRS JandaJ PfeifferFMewesHW TsugitaA and WuK (2000) The Protein InformationResource (PIR) Nucleic Acids Res 28 41ndash44

4 KriventsevaEV FleischmannW ZdobnovEM and ApweilerR (2001)CluSTr a database of Clusters of SWISS-PROT and TrEMBL proteinsNucleic Acids Res 29 33ndash36

5 BermanHM WestbrookJ FengZ GillilandG BhatTN WeissigHShindyalovIN and BournePE (2000) The Protein Data Bank NucleicAcids Res 28 235ndash242

6 PruittK TatusovT and MaglottD (2003) RefSeq and LocusLink NCBIgene-centered resources Nucleic Acids Res 31 34ndash37

7 Marchler-BauerA AndersonJ FedorovaN DeWeese-ScottCGeerLY HurwitzD JacksonJJ JacobsA LanczyckiC LiebertCMadejT MarchlerGH MazumderR NikolskayaA PanchenkoARShoemakerBA SongJ SridharRB ThiessenPA VasudevanSWangY YamashitaR YinJ and BryantSH (2003) MMDB Entrezrsquos3D-structure database Nucleic Acids Res 31 474ndash477

8 AltschulSE GishW MillerW MyersEW and LipmanDJ (1990)Basic local alignment search tool J Mol Biol 215 403ndash410

9 AltschulSF MaddenTL SchafferAA ZhangJ MillerW andLipmanDJ (1997) Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res 25 3389ndash3402

10 TatusovaTA and MaddenTL (1999) BLAST 2 Sequences a new toolfor comparing protein and nucleotide sequences FEMS Microbiol Lett174 247ndash250

11 SchafferAA AravindL MaddenTL ShavirinS SpougeJLWolfYI KooninEV and AltschulSF (2001) Improving the accuracy of

PSI-BLAST protein database searches with composition-based statisticsand other refinements Nucleic Acids Res 29 2994ndash3005

12 ZhangZ SchwartzS WagnerL and MillerW (2000) A greedyalgorithm for aligning DNA sequences J Comput Biol 7 203ndash214

13 MaB TrompJ and LiM (2002) PatternHunter faster and more sensitivehomology search Bioinformatics 18 440ndash445

14 TatusovRL GalperinMY NataleDA and KooninEV (2000) TheCOG database a tool for genome-scale analysis of protein functions andevolution Nucleic Acids Res 28 33ndash36

15 SchulerGD (1997) Pieces of the puzzle expressed sequence tags and thecatalog of human genes J Mol Med 75 694ndash698

16 ErmolaevaO RastogiM PruittKD SchulerGD BittnerMLChenY SimonR MeltzerP TrentJM and BoguskiMS (1998) Datamanagement and analysis for gene expression arrays Nature Genet 2019ndash23

17 SherryST WardMH KholodovM BakerJ PhamL SmigielskiEand SirotkinK (2001) dbSNP The NCBI database of genetic variationNucleic Acids Res 29 308ndash311

18 WangY GeerLY ChappeyC KansJA and BryantSH (2000) Cn3Dsequence and structure views for Entrez Trends Biochem Sci 25300ndash302

19 TatusovaT Karsch-MizrachiI and OstellJ (1999) Complete genomesin WWW Entrez data representation and analysis Bioinformatics 15536ndash543

20 MitelmanF MertensF and JohanssonB (1997) A breakpoint map ofrecurrent chromosomal rearrangements in human neoplasia Nature Genet15 417ndash474

21 LashAE TolstoshevCM WagnerL SchulerGD StrausbergRLRigginsGJ and AltschulSF (2000) SAGEmap a public gene expressionresource Genome Res 7 1051ndash1060

22 EdgarR DomrachevM and LashAE (2002) Gene Expression OmnibusNCBI gene expression and hybridization array data repository NucleicAcids Res 30 207ndash210

23 McKusickVA (1998) Mendelian Inheritance in Man Catalogs of HumanGenes and Genetic Disorders 12th edn The Johns Hopkins UniversityPress Baltimore MD

24 LetunicI GoodstadtL DickensNJ DoerksT SchultzJ MottRCiccarelliF CopleyRR PontingCP and BorkP (2002) Recentimprovements to the SMART domain-based sequence annotation resourceNucleic Acids Res 30 242ndash244

25 BatemanA BirneyE CerrutiL DurbinR EtwillerL EddySRGriffiths-JonesS HoweKL and SonnhammerELL (2002) The Pfamprotein families database Nucleic Acids Res 30 276ndash280

Nucleic Acids Research 2003 Vol 31 No 1 33

at Northeastern U

niversity Libraries on N

ovember 26 2014

httpnaroxfordjournalsorgD

ownloaded from

Page 4: Database resources of the National Center for Biotechnology

Cn3D (18) of amino acid changes implied by SNPs in codingregions

RESOURCES FOR GENOME-SCALE ANALYSIS

Entrez Genomes

Entrez Genomes (19) provides access to genomic datacontributed by the scientific community for over 1000 specieswhose sequencing and mapping is complete or in progressEntrez Genomes now includes more than 86 completemicrobial genomes and 302 RefSeq for eukaryotic organellesMany higher eukaryotic genomes are also included withinEntrez Genomes such as those of H sapiens M musculusD melanogaster Anopheles gambiae C elegans and Athaliana

In Entrez Genomes complete genomes can be accessedhierarchically starting from either an alphabetical listing or aphylogenetic tree for each of six principle taxonomic groupsOne can follow the hierarchy to a graphical overview for thegenome of a single organism on to the level of a singlechromosome and finally down to the level of a single gene Ateach level are one or more views pre-computed summariesand links to analyses appropriate at that level For instance atthe level of a genome or a chromosome a coding regions viewdisplays the location of each coding region length of theproduct GenBank identification number for the proteinsequence and name of the protein product A RNA genesview lists the location and gene names for ribosomal andtransfer RNA genes At the level of a single gene links areprovided to pre-computed sequence neighbors for the geneproduct Any protein gene product that is a member of aCOG (19) is linked to the COGs database A summary of COGfunctional groups is also presented in tabular and graphicalformats at the genome level

For complete microbial genomes pre-computed BLASTneighbors for protein sequences including their taxonomicdistribution and links to 3-D structures are given in TaxTablesand PDBTables respectively Pairwise sequence alignmentsare presented graphically and linked to the Cn3D macro-molecular viewer (18) which allows the interactive display of3-D structures and sequence alignments The TaxPlot toolgraphically compares similarities in the proteomes of twoorganisms to that of a third reference organism and is availablefor both prokaryotic and eukaryotic genomes Resources forthe genomes of higher eukaryotes are discussed below

COGs

The COGs database (14) presents a compilation of ortholo-gous groups of proteins from completely sequenced organismsrepresenting 44 species and 30 phylogenetically distant cladesThe COGs are now also linked to the proteins of two highereukaryotes C elegans and D melanogaster

Retroviral genotyping tools

The genotyping of retrovirus sequences is important in thecharacterization of viral genetic diversity in the tracking ofepidemics and in vaccine development NCBI offers a Web-based genotyping tool that employs a blastn comparison

between a retroviral sequence to be subtyped and a defaultpanel of reference sequences or a panel provided by the userAn HIV-1-specific subtyping tool uses a set of referencesequences taken from the principle HIV-1 variants

Eukaryotic Genomic Resources

Entrez Genomes links to Genome Resources webpages devotedto the sequencing of a number of eukaryotic organismsincluding H sapiens M musculus and D melanogaster Apage called Plant Genomes Central serves as a collection pointfor resources related to plant genome projects Many genomeprojects have progressed to the point at which it is useful to havean interactive genome viewing tool with which to correlate thedata present in various of genomic maps NCBI has developedthe Map Viewer for this purpose

Map Viewer

The NCBI Map Viewer displays genome assemblies using setsof synchronized chromosomal maps Map Viewer displays areavailable for the genomes of four vertebrates includingH sapiens and M musculus and D rerio three invertebratesincluding D melanogaster and C elegans seven plantsincluding A thaliana and O satvia and two fungi S cerevisiaeand Schizosaccharomyces pombe The genomic maps dis-played by the Map Viewer vary according to the data availablefor the subject organism The maps can be selected from a setof cytogenetic maps such as chromosomal ideogramssequence-based maps such as those showing contigs genesand SNPs and physical maps such as the G3 and GB4 humanradiation-hybrid maps Maps showing ab initio gene modelsEST alignments with links to UniGene clusters and mRNAalignments used to construct gene models are also available forsome organisms The rightmost map in a Map Viewer displaycalled the master map generates an extended set of map-specific links to related resources In the case of the Genesmap two of these links are to the EV and MM described belowIn addition to its graphical display the Map Viewer offers atabular view of the data that is convenient for export to otherprograms for further analysis

Queries against an entire genome or particular chromosomescan be made in the Map Viewer using gene names or symbolsmarker names SNP identifiers accession numbers and otheridentifiers The human version of the Map Viewer is tightlyintegrated with other NCBI databases such as LocusLink anddbSNP Segments of a genomic assembly may be downloadedusing the Map Viewerrsquos lsquoDownloadView Sequencersquo link forsome genomes such as H sapiens M musculus and A thalianaSupported download formats are GenBank and FASTA

Model Maker (MM)

MM allows the construction of transcript models using novelcombinations of putative exons derived from ab initiopredictions or from the alignment of GenBank transcriptsincluding ESTs and NCBI RefSeqs to the NCBI humangenome assembly The MM interface consists of a graphicaloverview of transcript alignments to a genomic contig witheach unique block of alignment collected and numbered as aputative exon Transcript models are constructed by selecting

Nucleic Acids Research 2003 Vol 31 No 1 31

at Northeastern U

niversity Libraries on N

ovember 26 2014

httpnaroxfordjournalsorgD

ownloaded from

from this collection As the transcript is created the impliedprotein translation is given in each reading frame with anyinternal stop codons indicated Previously observed exonsplice patterns are indicated as guides to model buildingCompleted models may be saved locally or analyzed withOrfFinder

Evidence Viewer (EV)

The EV displays the alignments to a genomic contig of RefSeqtranscripts GenBank mRNAs known or potential transcriptsand ESTs supporting a gene model The EV produces agraphical summary of the alignments that indicates thecoordinate range of the gene model on the genomic contigand the areas of alignment to the transcripts on separate tracksEST alignment density along the contig is indicated on anothertrack A mismatch and an insertiondeletion track are alsoshown to highlight areas of disagreement between transcriptsequences and the genomic sequence Following the graphicalsummary are exon-by-exon alignments of all of the transcriptsequences against the genomic contig including flankinggenomic sequence for each exon to show the presence orabsence of splice sites Any proteins annotated on thetranscript sequences are also shown and mismatches betweentranscripts and the genomic contig or between proteinsannotated on the aligned transcripts are highlighted

The HumanndashMouse Homology Maps

The HumanndashMouse Homology Maps are tables of genetic lociin homologous segments of DNA from human and the mouseThe map is computed by integrating orthologs curated by theMouse Genome Database with putative orthologs identified byhomology The maps are linked to GeneMaprsquo99 OMIMLocusLink dbSTS BLAST2Sequences and the MouseGenome Database at The Jackson Laboratory Other mousegenome resources can be found on the Mouse GenomeResources page

The Cancer Chromosome Aberration Project (CCAP)

The CCAP service is an initiative of the National CancerInstitute (NCI) and NCBI The data includes a compilation byF Mitelman F Mertens and B Johansson of recurrentneoplasia-associated chromosomal aberrations from theCancer Chromosome Aberration Bank at the University ofLund Sweden (20) The Spectral Karyotyping database SKYcreated jointly by NCI and NCBI enables investigators to sharetheir own SKYand Comparative Genomic Hybidization (CGH)data on chromsomal aberrations (httpwwwncbinlmnihgovskyskywebcgi)

RESOURCES FOR THE ANALYSIS OF PATTERNSOF GENE EXPRESSION AND PHENOTYPES

SAGEmap

Serial Analysis of Gene Expression (SAGE) is a technique fortaking a snapshot of the messenger RNA population of a cellto obtain a quantitative measure of gene expression NCBIrsquosSAGEmap (21) service implements many functions useful in

the analysis of SAGE data such as a two-way mapping betweenSAGE tag and UniGene SAGEmap can also construct a user-configurable table of data comparing one group of SAGElibraries with another Groups may be chosen for inclusion inthe table on the basis of several expression criteria SAGEmapis updated weekly immediately following the update ofUniGene and the data is reflected in the human genome MapViewer as the SAGE track

Gene Expression Omnibus (GEO)

The GEO (22) is a data repository and retrieval system for geneexpression data derived from any organism or artificial sourceGene expression data derived from spotted microarray high-density oligonucleotide array hybridization filter and SAGEdata are available for download and accepted for deposit Atthe time of writing the repository contains high-throughputgene expression data on over 2300 samples

OMIM

NCBI provides the online version of the OMIM catalog ofhuman genes and genetic disorders authored and edited byVictor A McKusick at The Johns Hopkins University (23)The database contains information on disease phenotypes andgenes including extensive descriptions gene names inheri-tance patterns map locations and gene polymorphisms OMIMcurrently contains 13 864 entries including data on 10 290established gene loci and 1019 phenotypic descriptions and isnow searchable using the powerful Entrez interface

THE MOLECULAR MODELING DATABASE(MMDB) THE CONSERVED DOMAIN DATABASESEARCH AND CDART

The NCBI MMDB built by processing entries from the ProteinData Bank (5) is described in (7) The structures in the MMDBare linked to sequences in Entrez and to the Conserved DomainDatabase (CDD) The CDD contains PSI-BLAST-derivedPosition Specific Score Matrices representing domains takenprincipally from two public protein domain collections theSimple Modular Architecture Research Tool (SMART) (24)and Pfam (25) but also draws from domains defined by NCBIresearchers NCBIrsquos Conserved Domain Search (CD-Search)service can be used to search a protein sequence for conserveddomains in the CDD Wherever possible CDD hits are linkedto structures which coupled with a multiple sequence align-ment of representatives of the domain hit can be viewed withNCBIrsquos 3-D molecular structure viewer Cn3D (18) TheConserved Domain Architecture Retrieval Tool (CDART)allows searches of protein databases on the basis of a conserveddomain and returns the domain architectures of databaseproteins conatining the query domain Alignment-based proteindomain information from the CDD and 3-D domains from theMMDB are searchable via the Entrez interface

FOR FURTHER INFORMATION

Most of the resources described here include documentationother explanatory material and references to collaborators and

32 Nucleic Acids Research 2003 Vol 31 No 1

at Northeastern U

niversity Libraries on N

ovember 26 2014

httpnaroxfordjournalsorgD

ownloaded from

data sources on the respective web sites Several tutorials arealso offered under the Education link from NCBIrsquos home pageA site map provides a comprehensive table of NCBI resourcesand the about NCBI feature provides bioinformatics primersand other supplementary information A user support staff isavailable to answer questions at infoncbinlmnihgov

REFERENCES

1 BensonDA Karsch-MizrachiI LipmanDJ OstellJ RappBA andWheelerDL (2002) GenBank Nucleic Acids Res 30 17ndash20

2 SchulerGD EpsteinJA OhkawaH and KansJA (1996) Entrezmolecular biology database and retrieval system Methods Enzymol 266141ndash162

3 BarkerWC GaravelliJS HuongH McGarveyPB OrcuttBCSrinivarsaraoGY XiaoC YehLS LedleyRS JandaJ PfeifferFMewesHW TsugitaA and WuK (2000) The Protein InformationResource (PIR) Nucleic Acids Res 28 41ndash44

4 KriventsevaEV FleischmannW ZdobnovEM and ApweilerR (2001)CluSTr a database of Clusters of SWISS-PROT and TrEMBL proteinsNucleic Acids Res 29 33ndash36

5 BermanHM WestbrookJ FengZ GillilandG BhatTN WeissigHShindyalovIN and BournePE (2000) The Protein Data Bank NucleicAcids Res 28 235ndash242

6 PruittK TatusovT and MaglottD (2003) RefSeq and LocusLink NCBIgene-centered resources Nucleic Acids Res 31 34ndash37

7 Marchler-BauerA AndersonJ FedorovaN DeWeese-ScottCGeerLY HurwitzD JacksonJJ JacobsA LanczyckiC LiebertCMadejT MarchlerGH MazumderR NikolskayaA PanchenkoARShoemakerBA SongJ SridharRB ThiessenPA VasudevanSWangY YamashitaR YinJ and BryantSH (2003) MMDB Entrezrsquos3D-structure database Nucleic Acids Res 31 474ndash477

8 AltschulSE GishW MillerW MyersEW and LipmanDJ (1990)Basic local alignment search tool J Mol Biol 215 403ndash410

9 AltschulSF MaddenTL SchafferAA ZhangJ MillerW andLipmanDJ (1997) Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res 25 3389ndash3402

10 TatusovaTA and MaddenTL (1999) BLAST 2 Sequences a new toolfor comparing protein and nucleotide sequences FEMS Microbiol Lett174 247ndash250

11 SchafferAA AravindL MaddenTL ShavirinS SpougeJLWolfYI KooninEV and AltschulSF (2001) Improving the accuracy of

PSI-BLAST protein database searches with composition-based statisticsand other refinements Nucleic Acids Res 29 2994ndash3005

12 ZhangZ SchwartzS WagnerL and MillerW (2000) A greedyalgorithm for aligning DNA sequences J Comput Biol 7 203ndash214

13 MaB TrompJ and LiM (2002) PatternHunter faster and more sensitivehomology search Bioinformatics 18 440ndash445

14 TatusovRL GalperinMY NataleDA and KooninEV (2000) TheCOG database a tool for genome-scale analysis of protein functions andevolution Nucleic Acids Res 28 33ndash36

15 SchulerGD (1997) Pieces of the puzzle expressed sequence tags and thecatalog of human genes J Mol Med 75 694ndash698

16 ErmolaevaO RastogiM PruittKD SchulerGD BittnerMLChenY SimonR MeltzerP TrentJM and BoguskiMS (1998) Datamanagement and analysis for gene expression arrays Nature Genet 2019ndash23

17 SherryST WardMH KholodovM BakerJ PhamL SmigielskiEand SirotkinK (2001) dbSNP The NCBI database of genetic variationNucleic Acids Res 29 308ndash311

18 WangY GeerLY ChappeyC KansJA and BryantSH (2000) Cn3Dsequence and structure views for Entrez Trends Biochem Sci 25300ndash302

19 TatusovaT Karsch-MizrachiI and OstellJ (1999) Complete genomesin WWW Entrez data representation and analysis Bioinformatics 15536ndash543

20 MitelmanF MertensF and JohanssonB (1997) A breakpoint map ofrecurrent chromosomal rearrangements in human neoplasia Nature Genet15 417ndash474

21 LashAE TolstoshevCM WagnerL SchulerGD StrausbergRLRigginsGJ and AltschulSF (2000) SAGEmap a public gene expressionresource Genome Res 7 1051ndash1060

22 EdgarR DomrachevM and LashAE (2002) Gene Expression OmnibusNCBI gene expression and hybridization array data repository NucleicAcids Res 30 207ndash210

23 McKusickVA (1998) Mendelian Inheritance in Man Catalogs of HumanGenes and Genetic Disorders 12th edn The Johns Hopkins UniversityPress Baltimore MD

24 LetunicI GoodstadtL DickensNJ DoerksT SchultzJ MottRCiccarelliF CopleyRR PontingCP and BorkP (2002) Recentimprovements to the SMART domain-based sequence annotation resourceNucleic Acids Res 30 242ndash244

25 BatemanA BirneyE CerrutiL DurbinR EtwillerL EddySRGriffiths-JonesS HoweKL and SonnhammerELL (2002) The Pfamprotein families database Nucleic Acids Res 30 276ndash280

Nucleic Acids Research 2003 Vol 31 No 1 33

at Northeastern U

niversity Libraries on N

ovember 26 2014

httpnaroxfordjournalsorgD

ownloaded from

Page 5: Database resources of the National Center for Biotechnology

from this collection As the transcript is created the impliedprotein translation is given in each reading frame with anyinternal stop codons indicated Previously observed exonsplice patterns are indicated as guides to model buildingCompleted models may be saved locally or analyzed withOrfFinder

Evidence Viewer (EV)

The EV displays the alignments to a genomic contig of RefSeqtranscripts GenBank mRNAs known or potential transcriptsand ESTs supporting a gene model The EV produces agraphical summary of the alignments that indicates thecoordinate range of the gene model on the genomic contigand the areas of alignment to the transcripts on separate tracksEST alignment density along the contig is indicated on anothertrack A mismatch and an insertiondeletion track are alsoshown to highlight areas of disagreement between transcriptsequences and the genomic sequence Following the graphicalsummary are exon-by-exon alignments of all of the transcriptsequences against the genomic contig including flankinggenomic sequence for each exon to show the presence orabsence of splice sites Any proteins annotated on thetranscript sequences are also shown and mismatches betweentranscripts and the genomic contig or between proteinsannotated on the aligned transcripts are highlighted

The HumanndashMouse Homology Maps

The HumanndashMouse Homology Maps are tables of genetic lociin homologous segments of DNA from human and the mouseThe map is computed by integrating orthologs curated by theMouse Genome Database with putative orthologs identified byhomology The maps are linked to GeneMaprsquo99 OMIMLocusLink dbSTS BLAST2Sequences and the MouseGenome Database at The Jackson Laboratory Other mousegenome resources can be found on the Mouse GenomeResources page

The Cancer Chromosome Aberration Project (CCAP)

The CCAP service is an initiative of the National CancerInstitute (NCI) and NCBI The data includes a compilation byF Mitelman F Mertens and B Johansson of recurrentneoplasia-associated chromosomal aberrations from theCancer Chromosome Aberration Bank at the University ofLund Sweden (20) The Spectral Karyotyping database SKYcreated jointly by NCI and NCBI enables investigators to sharetheir own SKYand Comparative Genomic Hybidization (CGH)data on chromsomal aberrations (httpwwwncbinlmnihgovskyskywebcgi)

RESOURCES FOR THE ANALYSIS OF PATTERNSOF GENE EXPRESSION AND PHENOTYPES

SAGEmap

Serial Analysis of Gene Expression (SAGE) is a technique fortaking a snapshot of the messenger RNA population of a cellto obtain a quantitative measure of gene expression NCBIrsquosSAGEmap (21) service implements many functions useful in

the analysis of SAGE data such as a two-way mapping betweenSAGE tag and UniGene SAGEmap can also construct a user-configurable table of data comparing one group of SAGElibraries with another Groups may be chosen for inclusion inthe table on the basis of several expression criteria SAGEmapis updated weekly immediately following the update ofUniGene and the data is reflected in the human genome MapViewer as the SAGE track

Gene Expression Omnibus (GEO)

The GEO (22) is a data repository and retrieval system for geneexpression data derived from any organism or artificial sourceGene expression data derived from spotted microarray high-density oligonucleotide array hybridization filter and SAGEdata are available for download and accepted for deposit Atthe time of writing the repository contains high-throughputgene expression data on over 2300 samples

OMIM

NCBI provides the online version of the OMIM catalog ofhuman genes and genetic disorders authored and edited byVictor A McKusick at The Johns Hopkins University (23)The database contains information on disease phenotypes andgenes including extensive descriptions gene names inheri-tance patterns map locations and gene polymorphisms OMIMcurrently contains 13 864 entries including data on 10 290established gene loci and 1019 phenotypic descriptions and isnow searchable using the powerful Entrez interface

THE MOLECULAR MODELING DATABASE(MMDB) THE CONSERVED DOMAIN DATABASESEARCH AND CDART

The NCBI MMDB built by processing entries from the ProteinData Bank (5) is described in (7) The structures in the MMDBare linked to sequences in Entrez and to the Conserved DomainDatabase (CDD) The CDD contains PSI-BLAST-derivedPosition Specific Score Matrices representing domains takenprincipally from two public protein domain collections theSimple Modular Architecture Research Tool (SMART) (24)and Pfam (25) but also draws from domains defined by NCBIresearchers NCBIrsquos Conserved Domain Search (CD-Search)service can be used to search a protein sequence for conserveddomains in the CDD Wherever possible CDD hits are linkedto structures which coupled with a multiple sequence align-ment of representatives of the domain hit can be viewed withNCBIrsquos 3-D molecular structure viewer Cn3D (18) TheConserved Domain Architecture Retrieval Tool (CDART)allows searches of protein databases on the basis of a conserveddomain and returns the domain architectures of databaseproteins conatining the query domain Alignment-based proteindomain information from the CDD and 3-D domains from theMMDB are searchable via the Entrez interface

FOR FURTHER INFORMATION

Most of the resources described here include documentationother explanatory material and references to collaborators and

32 Nucleic Acids Research 2003 Vol 31 No 1

at Northeastern U

niversity Libraries on N

ovember 26 2014

httpnaroxfordjournalsorgD

ownloaded from

data sources on the respective web sites Several tutorials arealso offered under the Education link from NCBIrsquos home pageA site map provides a comprehensive table of NCBI resourcesand the about NCBI feature provides bioinformatics primersand other supplementary information A user support staff isavailable to answer questions at infoncbinlmnihgov

REFERENCES

1 BensonDA Karsch-MizrachiI LipmanDJ OstellJ RappBA andWheelerDL (2002) GenBank Nucleic Acids Res 30 17ndash20

2 SchulerGD EpsteinJA OhkawaH and KansJA (1996) Entrezmolecular biology database and retrieval system Methods Enzymol 266141ndash162

3 BarkerWC GaravelliJS HuongH McGarveyPB OrcuttBCSrinivarsaraoGY XiaoC YehLS LedleyRS JandaJ PfeifferFMewesHW TsugitaA and WuK (2000) The Protein InformationResource (PIR) Nucleic Acids Res 28 41ndash44

4 KriventsevaEV FleischmannW ZdobnovEM and ApweilerR (2001)CluSTr a database of Clusters of SWISS-PROT and TrEMBL proteinsNucleic Acids Res 29 33ndash36

5 BermanHM WestbrookJ FengZ GillilandG BhatTN WeissigHShindyalovIN and BournePE (2000) The Protein Data Bank NucleicAcids Res 28 235ndash242

6 PruittK TatusovT and MaglottD (2003) RefSeq and LocusLink NCBIgene-centered resources Nucleic Acids Res 31 34ndash37

7 Marchler-BauerA AndersonJ FedorovaN DeWeese-ScottCGeerLY HurwitzD JacksonJJ JacobsA LanczyckiC LiebertCMadejT MarchlerGH MazumderR NikolskayaA PanchenkoARShoemakerBA SongJ SridharRB ThiessenPA VasudevanSWangY YamashitaR YinJ and BryantSH (2003) MMDB Entrezrsquos3D-structure database Nucleic Acids Res 31 474ndash477

8 AltschulSE GishW MillerW MyersEW and LipmanDJ (1990)Basic local alignment search tool J Mol Biol 215 403ndash410

9 AltschulSF MaddenTL SchafferAA ZhangJ MillerW andLipmanDJ (1997) Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res 25 3389ndash3402

10 TatusovaTA and MaddenTL (1999) BLAST 2 Sequences a new toolfor comparing protein and nucleotide sequences FEMS Microbiol Lett174 247ndash250

11 SchafferAA AravindL MaddenTL ShavirinS SpougeJLWolfYI KooninEV and AltschulSF (2001) Improving the accuracy of

PSI-BLAST protein database searches with composition-based statisticsand other refinements Nucleic Acids Res 29 2994ndash3005

12 ZhangZ SchwartzS WagnerL and MillerW (2000) A greedyalgorithm for aligning DNA sequences J Comput Biol 7 203ndash214

13 MaB TrompJ and LiM (2002) PatternHunter faster and more sensitivehomology search Bioinformatics 18 440ndash445

14 TatusovRL GalperinMY NataleDA and KooninEV (2000) TheCOG database a tool for genome-scale analysis of protein functions andevolution Nucleic Acids Res 28 33ndash36

15 SchulerGD (1997) Pieces of the puzzle expressed sequence tags and thecatalog of human genes J Mol Med 75 694ndash698

16 ErmolaevaO RastogiM PruittKD SchulerGD BittnerMLChenY SimonR MeltzerP TrentJM and BoguskiMS (1998) Datamanagement and analysis for gene expression arrays Nature Genet 2019ndash23

17 SherryST WardMH KholodovM BakerJ PhamL SmigielskiEand SirotkinK (2001) dbSNP The NCBI database of genetic variationNucleic Acids Res 29 308ndash311

18 WangY GeerLY ChappeyC KansJA and BryantSH (2000) Cn3Dsequence and structure views for Entrez Trends Biochem Sci 25300ndash302

19 TatusovaT Karsch-MizrachiI and OstellJ (1999) Complete genomesin WWW Entrez data representation and analysis Bioinformatics 15536ndash543

20 MitelmanF MertensF and JohanssonB (1997) A breakpoint map ofrecurrent chromosomal rearrangements in human neoplasia Nature Genet15 417ndash474

21 LashAE TolstoshevCM WagnerL SchulerGD StrausbergRLRigginsGJ and AltschulSF (2000) SAGEmap a public gene expressionresource Genome Res 7 1051ndash1060

22 EdgarR DomrachevM and LashAE (2002) Gene Expression OmnibusNCBI gene expression and hybridization array data repository NucleicAcids Res 30 207ndash210

23 McKusickVA (1998) Mendelian Inheritance in Man Catalogs of HumanGenes and Genetic Disorders 12th edn The Johns Hopkins UniversityPress Baltimore MD

24 LetunicI GoodstadtL DickensNJ DoerksT SchultzJ MottRCiccarelliF CopleyRR PontingCP and BorkP (2002) Recentimprovements to the SMART domain-based sequence annotation resourceNucleic Acids Res 30 242ndash244

25 BatemanA BirneyE CerrutiL DurbinR EtwillerL EddySRGriffiths-JonesS HoweKL and SonnhammerELL (2002) The Pfamprotein families database Nucleic Acids Res 30 276ndash280

Nucleic Acids Research 2003 Vol 31 No 1 33

at Northeastern U

niversity Libraries on N

ovember 26 2014

httpnaroxfordjournalsorgD

ownloaded from

Page 6: Database resources of the National Center for Biotechnology

data sources on the respective web sites Several tutorials arealso offered under the Education link from NCBIrsquos home pageA site map provides a comprehensive table of NCBI resourcesand the about NCBI feature provides bioinformatics primersand other supplementary information A user support staff isavailable to answer questions at infoncbinlmnihgov

REFERENCES

1 BensonDA Karsch-MizrachiI LipmanDJ OstellJ RappBA andWheelerDL (2002) GenBank Nucleic Acids Res 30 17ndash20

2 SchulerGD EpsteinJA OhkawaH and KansJA (1996) Entrezmolecular biology database and retrieval system Methods Enzymol 266141ndash162

3 BarkerWC GaravelliJS HuongH McGarveyPB OrcuttBCSrinivarsaraoGY XiaoC YehLS LedleyRS JandaJ PfeifferFMewesHW TsugitaA and WuK (2000) The Protein InformationResource (PIR) Nucleic Acids Res 28 41ndash44

4 KriventsevaEV FleischmannW ZdobnovEM and ApweilerR (2001)CluSTr a database of Clusters of SWISS-PROT and TrEMBL proteinsNucleic Acids Res 29 33ndash36

5 BermanHM WestbrookJ FengZ GillilandG BhatTN WeissigHShindyalovIN and BournePE (2000) The Protein Data Bank NucleicAcids Res 28 235ndash242

6 PruittK TatusovT and MaglottD (2003) RefSeq and LocusLink NCBIgene-centered resources Nucleic Acids Res 31 34ndash37

7 Marchler-BauerA AndersonJ FedorovaN DeWeese-ScottCGeerLY HurwitzD JacksonJJ JacobsA LanczyckiC LiebertCMadejT MarchlerGH MazumderR NikolskayaA PanchenkoARShoemakerBA SongJ SridharRB ThiessenPA VasudevanSWangY YamashitaR YinJ and BryantSH (2003) MMDB Entrezrsquos3D-structure database Nucleic Acids Res 31 474ndash477

8 AltschulSE GishW MillerW MyersEW and LipmanDJ (1990)Basic local alignment search tool J Mol Biol 215 403ndash410

9 AltschulSF MaddenTL SchafferAA ZhangJ MillerW andLipmanDJ (1997) Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res 25 3389ndash3402

10 TatusovaTA and MaddenTL (1999) BLAST 2 Sequences a new toolfor comparing protein and nucleotide sequences FEMS Microbiol Lett174 247ndash250

11 SchafferAA AravindL MaddenTL ShavirinS SpougeJLWolfYI KooninEV and AltschulSF (2001) Improving the accuracy of

PSI-BLAST protein database searches with composition-based statisticsand other refinements Nucleic Acids Res 29 2994ndash3005

12 ZhangZ SchwartzS WagnerL and MillerW (2000) A greedyalgorithm for aligning DNA sequences J Comput Biol 7 203ndash214

13 MaB TrompJ and LiM (2002) PatternHunter faster and more sensitivehomology search Bioinformatics 18 440ndash445

14 TatusovRL GalperinMY NataleDA and KooninEV (2000) TheCOG database a tool for genome-scale analysis of protein functions andevolution Nucleic Acids Res 28 33ndash36

15 SchulerGD (1997) Pieces of the puzzle expressed sequence tags and thecatalog of human genes J Mol Med 75 694ndash698

16 ErmolaevaO RastogiM PruittKD SchulerGD BittnerMLChenY SimonR MeltzerP TrentJM and BoguskiMS (1998) Datamanagement and analysis for gene expression arrays Nature Genet 2019ndash23

17 SherryST WardMH KholodovM BakerJ PhamL SmigielskiEand SirotkinK (2001) dbSNP The NCBI database of genetic variationNucleic Acids Res 29 308ndash311

18 WangY GeerLY ChappeyC KansJA and BryantSH (2000) Cn3Dsequence and structure views for Entrez Trends Biochem Sci 25300ndash302

19 TatusovaT Karsch-MizrachiI and OstellJ (1999) Complete genomesin WWW Entrez data representation and analysis Bioinformatics 15536ndash543

20 MitelmanF MertensF and JohanssonB (1997) A breakpoint map ofrecurrent chromosomal rearrangements in human neoplasia Nature Genet15 417ndash474

21 LashAE TolstoshevCM WagnerL SchulerGD StrausbergRLRigginsGJ and AltschulSF (2000) SAGEmap a public gene expressionresource Genome Res 7 1051ndash1060

22 EdgarR DomrachevM and LashAE (2002) Gene Expression OmnibusNCBI gene expression and hybridization array data repository NucleicAcids Res 30 207ndash210

23 McKusickVA (1998) Mendelian Inheritance in Man Catalogs of HumanGenes and Genetic Disorders 12th edn The Johns Hopkins UniversityPress Baltimore MD

24 LetunicI GoodstadtL DickensNJ DoerksT SchultzJ MottRCiccarelliF CopleyRR PontingCP and BorkP (2002) Recentimprovements to the SMART domain-based sequence annotation resourceNucleic Acids Res 30 242ndash244

25 BatemanA BirneyE CerrutiL DurbinR EtwillerL EddySRGriffiths-JonesS HoweKL and SonnhammerELL (2002) The Pfamprotein families database Nucleic Acids Res 30 276ndash280

Nucleic Acids Research 2003 Vol 31 No 1 33

at Northeastern U

niversity Libraries on N

ovember 26 2014

httpnaroxfordjournalsorgD

ownloaded from