Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005...

Overview of Biological Databases

(Lecture for CS498-CXZ Algorithms in Bioinformatics)

Sept. 6, 2005

ChengXiang Zhai

Department of Computer Science

University of Illinois, Urbana-Champaign

Most slides are taken from NCBI field guide at the web site http://www.ncbi.nlm.nih.gov/

The Central Dogma & Biological Data

Protein structures-Experiments-Models (homologues)

Literature information

Original DNA Sequences(Genomes)

Protein Sequences-Inferred -Direct sequencing

Expressed DNA sequences( = mRNA Sequences= cDNA sequences)

Expressed Sequence Tags (ESTs)

Entrez Integrates Most of Them!

Entrez

Nucleotide

PubMed

Protein

Taxonomy

Structure

Domains 3D DomainsJournal

s

PMC

OMIM

Books

PopSet

SNP

UniGene UniSTS

Genome

Gene

GEO

GEO Datasets

MeSH

CancerChromosomes

Homologene

Outline

• NCBI & Entrez

• Major Biological Databases

• Using Entrez

Some background about Entrez…

The National Center for Biotechnology Information

Created in 1988 as a part of theNational Library of Medicine at NIH

– Establish public databases

– Research in computational biology

– Develop software tools for sequence analysis

– Disseminate biomedical information

Bethesda,MD

Web Access: http://www.ncbi.nlm.nih.gov

Number of Users and Hits Per Day

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

450,000

Nu

mb

er o

f U

sers

1997 1998 1999 2000 2001 2002 2003

Christmas &New Year’s

Days

Currently averaging10,000,000 to 50,000,000

hits per day!

Major Biological Databases

Entrez: Database Integration

Hard Link

NeighborsRelated Structures

3 -D 3 -D StructureStructure

VAST

NeighborsRelated Sequences

NucleotideNucleotideSequencesSequences

BLAST

NeighborsRelated SequencesBLinkDomains

ProteinProteinSequencesSequences

BLAST

TaxonomyTaxonomy

Phylogeny

PubMedPubMedAbstractsAbstracts

Word weight

Related Articles

GenomeGene

OMIM

Cancer Chromosome

CDD3D domain

PubChem

Books

PMC

OMIM

SNP

Genome Project

HomoloGene

UniGeneGEO

Types of Databases

• Primary Databases

– Original submissions by experimentalists

– Content controlled by the submitter

• Examples: GenBank, SNP, GEO

• Derivative Databases– Built from primary data– Content controlled by third party (NCBI)

• Examples: Refseq, RefSNP, GEO Datasets, UniGene, TPA, NCBI Protein, Structure, Conserved Domain

Primary vs. DerivativeSequence Databases

GenBankGenBank

SequencingSequencingCentersCenters

GA

GAGA

ATTAT

TC

CGAGA

ATTAT

TC

C

AT

GAGA

ATTC

C GAGA

ATTC

C

TTGACAAT

TGACTA

ACGTGC

TTGACA

CGTGAATTGACTA

TATAGCCG

ACGTGC

ACGTGCACGTGC

TTGACA

TTGACA

CGTGA

CGTGA

CGTGA

ATTGACTA

ATTGACTAATTGACTA

ATTGACTA

TATAGCCG

TATAGCCGTATAGCCGTATAGCCG

TATAGCCG TATAGCCGTATAGCCG TATAGCCGCAT

T

GAGA

ATTC

C GAGA

ATTC

C LabsLabs

AlgorithmsAlgorithms

UniGene

CuratorsCurators

RefSeq

GenomeAssembly

TATAGCCGAGCTCCGATACCGATGACAA

Updated continuall

y by NCBI

Updated ONLY by submitters

0%

20%

40%

60%

80%

100%

PDBTPARefSeqGenBank

Entrez Nucleotides

Primary

• GenBank / EMBL / DDBJ 57,172,944

Derivative

• RefSeq 1,278,742

• Third Party Annotation 4,653

• PDB 5,973 Total 58,462,312

0%

20%

40%

60%

80%

100% PDBPRFPIRSwissProtTPARefSeqGenPept

Entrez Protein: Derivative Databases

GenPept 3,515,141

RefSeq 1,802,523

Third Party Annotation 4,217

Swiss Prot 189,324

PIR 222,232

PRF 12,079

PDB 68,621

Total 5,814,137

BLAST nr total 2,726,372

Database 1: GenBank NCBI’s Primary Sequence Database

What is GenBank?

• Nucleotide only sequence database

• Archival in nature– Historical

– Reflective of submitter point of view (subjective)

– Redundant

• GenBank Data

– Direct submissions (traditional records)

– Batch submissions (EST, GSS, STS)

– ftp accounts (genome data)

• Three collaborating databases– GenBank

– DNA Database of Japan (DDBJ)

– European Molecular Biology Laboratory (EMBL) Database

EBI

GenBankGenBank

DDBJDDBJ

EMBLEMBL

EMBLEMBL

Entrez

SRS

getentry

NIGNIGCIB

NCBI

NIHNIH

•Submissions•Updates •Submissions

•Updates

•Submissions•Updates

International Sequence Database Collaboration

GenBank Divisions“Organismal”PRI (28) Primate

ROD (15) Rodent PLN (13) Plant and FungalBCT (11) Bacterial/ArchealINV (7) InvertebrateVRT (7) Other VertebrateVRL (4) ViralMAM (2) MammalianPHG (1) PhageSYN (1) SyntheticUNA (1) Unannotated

“Functional”EST (377) Expressed Sequence Tag GSS (138) Genome Survey SequenceHTG (63) High Throughput GenomicPAT (17) PatentSTS (9) Sequence Tagged SiteCON (1) Contigs, virtual

• Organized by taxonomy (sort of)• Direct submissions (Sequin/Bankit)• Accurate (~1 error per 10,000 bp)• Well characterized

• Organized by sequence type• Batch submissions (ftp/email) • Inaccurate• Poorly characterized

GenBank Functional (Bulk) Divisions

GenBankEST

STS

GSS

HTG

• Expressed Sequence Tag

– 1st pass single read cDNA

• Genome Survey Sequence

– 1st pass single read gDNA

• High Throughput Genomic

– incomplete sequences of genomic clones

• Sequence Tagged Site

– PCR-based mapping reagents

Whole Genome Shotgun

EST Division: Expressed Sequence Tags

RNA gene products

nucleus30,000 genes

80-100,000 uniquecDNA clones in library

- isolate unique clones - sequence once from

each end

make cDNA library

5’

3’

>IMAGE:275615 3', mRNA sequenceNNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTATTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTTAATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTCTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGGTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC

>IMAGE:275615 5' mRNA sequenceGACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGTGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAATTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTACTGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCAAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG

ESTs in Entrez

Total 28 million recordsHuman 6.0 millionMouse 4.3 millionRat 0.7 millionZebrafish 0.6 millionWheat 0.6 millionBarley 0.3 millionMaize 0.4 million

Total 28 million recordsHuman 6.0 millionMouse 4.3 millionRat 0.7 millionZebrafish 0.6 millionWheat 0.6 millionBarley 0.3 millionMaize 0.4 million

GSS, WGS, HTG

shred

Whole BAC insert (or genome)

isolate clonessequence

GSS divisionor trace archive

Draft sequence (HTG division)

assembly whole genome shotgun assemblies (traditional division)

HTG Example: Honeybee Draft Sequences

• Unfinished sequences of BACs

• Gaps and unordered pieces

• Finished sequences (Phase 3) move

to traditional GenBank division

• Unfinished sequences of BACs

• Gaps and unordered pieces

• Finished sequences (Phase 3) move

to traditional GenBank division

LOCUS AC141845 147720 bp DNA linear HTG 19-MAR-2004

DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT

SEQUENCE, 14 unordered pieces.

ACCESSION AC141845

VERSION AC141845.1 GI:29124029

KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT.

LOCUS AC141845 147720 bp DNA linear HTG 19-MAR-2004

DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT

SEQUENCE, 14 unordered pieces.

ACCESSION AC141845

VERSION AC141845.1 GI:29124029

KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT.

Seq

uen

ce R

eco

rds

(mil

lio

ns)

To

tal Base P

airs(b

illion

s)

0

5

10

15

20

25

30

35

0

5

10

15

20

25

30

35

40Sequence recordsTotal base pairs

Release 148: 45.2 million records 49.4 billion nucleotides

Average doubling time ≈ 14 months

’83 ’84 ’85 ’86 ’87 ’88 ’89 ’90 ’91 ’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01 ’02 ’03 ’04 ’05 ’06

40

45

45

50

5550

File Formats of theSequence Databases

Each sequence is represented bya text record called a flat file.

GenBank/GenPept (useful for scientists) FASTA (the simplest format)

ASN.1 & XML (useful for programmers)

A TraditionalGenBank

Record

LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004)REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USAREFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.FEATURES Location/Qualifiers source 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="AFS1" CDS 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN"ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt//

Header

Feature Table

Sequence

The Flatfile Format

LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004)REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USAREFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.

The Header


Header: Locus LineLOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004

Molecule typeMolecule typeDivisionDivision

Modification DateModification Date

Locus nameLocus name

LengthLength


Header: Database Identifiers

ACCESSION AY182241

VERSION AY182241.2 GI:32265057

ACCESSION AY182241

VERSION AY182241.2 GI:32265057

Accession•Stable•Reportable•Universal

Accession•Stable•Reportable•Universal

VersionTracks changes in sequenceVersionTracks changes in sequence

GI numberNCBI internal useGI numberNCBI internal use


Header: Organism

SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.

SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.

NCBI-controlled taxonomy

FEATURES Location/Qualifiers source 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="AFS1" CDS 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN"

The Feature Table

Coding sequenceCoding sequence

start (atg)start (atg) stop (tag)stop (tag)

ImpliedproteinImpliedprotein

GenPept Identifiers

The Sequence: 99.99% Accurate

ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga

ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga

1741 ggacccacat cctgtcttta ctattccaac ctcttgtaaa ctagtactca tatagtttga 1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt 1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa 1921 aaaaaaaaaa a//

1741 ggacccacat cctgtcttta ctattccaac ctcttgtaaa ctagtactca tatagtttga 1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt 1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa 1921 aaaaaaaaaa a//

GenPept: FASTA format

>gi|32265058|gb|AAO22848.2| (E,E)-alpha-farnesene synthase [Malus x domestica]MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWKNDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLFEKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLENHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHSLELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWWANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGSEEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLTKVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMADFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIKGMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHILSLLFQPLVN

>gi|32265070|gb|AAP75563.1| putative doublecortin domain-containing protein MAKTGAEDHREALSQSSLSLLTEAMEVLQQSSPEGTLDGNTVNPIYKYILNDLPREFMSSQAKAVIKTTDDYLQSQFGPNRLVHSAAVSEGSGLQDCSTHQTASDHSHDEISDLDSYKSNSKNNSCSISASKRNRPVSAPVGQLRVAEFSSLKFQSARNWQKLSQRHKLQPRVIKVTAYKNGSRTVFARVTAPTITLLLEECTEKLNLNMAARRVFLADGKEALEPEDIPHEADVYVSTGEPFLNPFKKIKDHLLLIKKVTWTMNGLMLPTDIKRRKTKPVLSIRMKKLTERTSVRILFFKNGMGQDGHEITVGKETMKKVLDTCTIRMNLNLPARYFYDLYGRKIEDISKGKH

Seq-entry ::= set { class nuc-prot , descr { title "Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds." , source { org { taxname "Malus x domestica" , common "cultivated apple" , db { { db "taxon" , tag id 3750 } } , orgname { name binomial { genus "Malus" , species "x domestica" } , mod { { subtype cultivar , subname "'Law Rome'" } , { subtype old-name , subname "Malus domestica" , attrib "(10)cultivar='Law Rome'" } } , lineage "Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus" , gcode 1 ,,

Abstract Syntax Notation: ASN.1

FASTA NucleotideFASTA Nucleotide

FASTAProteinFASTAProtein

GenPeptGenPept GenBankGenBank

ASN.1ASN.1

Database 2: RefSeq NCBI’s Derivative Sequence Database

What is RefSeq?• Curated transcripts and proteins (NM_, NP_)

– reviewed

– human, mouse, rat, fruit fly, zebrafish, arabidopsis

microbial genomes (proteins), and more

• Model transcripts and proteins (XM_, XP_)

• Assembled Genomic Regions (contigs) (NT_, NW_)– human genome

– mouse genome

– rat genome

• Chromosome records (NC_)

– Human genome

– microbial

– organelle

ftp://ftp.ncbi.nih.gov/refseq/release/

srcdb_refseq[Properties]

RefSeq Benefits

• non-redundancy

• explicitly linked nucleotide and protein sequences

• updates to reflect current sequence data and biology

• data validation

• format consistency

• distinct accession series

• stewardship by NCBI staff and collaborators

Curated genomic DNACurated genomic DNA(NC, NT, NW)(NC, NT, NW)

Curated Model mRNACurated Model mRNA (XM)(XM)(XR)(XR)

Curated mRNACurated mRNA (NM)(NM)(NR)(NR)

Model protein Model protein (XP)(XP)

RefSeq Curation Processes

ProteinProtein (NP)(NP)

Scanning....

RefSeq Accession Numbers

mRNAs and Proteins

NM_123456 Curated mRNANP_123456 Curated ProteinNR_123456 Curated non-coding RNAXM_123456 Predicted mRNAXP_123456 Predicted Protein XR_123456 Predicted non-coding RNAGene RecordsNG_123456 Reference Genomic SequenceChromosomeNC_123455 Microbial replicons, organelle , viral

genomes, human chromosomesAssembliesNT_123456 Contig NW_123456 WGS Supercontig

From GenBank to RefSeq

NM_000121: Sequence Revision History

Database 3: UniGene NCBI’s Derivative EST Database

UniGene

• Records are clusters of mRNAs and ESTs that ideally represent single genes

• Records are created automatically by a modified BLAST algorithm

• UniGene provides a means to identify an EST or unannotated mRNA

Clustering Expressed Sequences

Gene-oriented clusters of expressed sequences

• Automatic clustering using MegaBlast

• Each cluster represents a unique gene

• Informed by genome hits

• Information on tissue types and map locations

• Useful for gene discovery and selection of

mapping reagents

UniGene

unique gene

A Cluster of ESTs

query

5’ EST hits

3’ EST hits

UniGene Collections

Example UniGene Cluster

Histogram of cluster sizes for UniGene Hs Build 177

(Now at Build #186)

UniGene Cluster Hs.95351

SELECTED PROTEIN SIMILARITES

UniGene Cluster Hs.95351

GENE EXPRESSION

UniGene Cluster Hs.95351: expression

UniGene Cluster Hs.95351: seqs

Download sequences

web page

ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/

Database 4: MMDB

NCBI’s derivative protein structure database

Indexing into MMDB

Structure

id 1 , name "helix 1" , type helix , location subgraph residues interval { { molecule-id 1 , from 49 , to 61 } } } ,

Add secondary structure

inter-residue-bonds { { atom-id-1 { molecule-id 1 , residue-id 1 , atom-id 1 } , atom-id-2 { molecule-id 1 , residue-id 2 , atom-id 9 } } ,

Add chemical bonds

• Import only experimentally determined structures• Convert to ASN.1 • Verify sequences

• Create “backbone” model (Cα, P only)• Create single-conformer model

MMDBMolecular Modeling Data Base

Structure Summary

Cn3D viewer

Conserved Domains3D Domain Neighbors

Structure Neighbors

Cn3D 4.1: C-Src

Cn3D 4.1: Structural Alignment

Casein kinase S. pombe

Src Kinase H. sapiens

Conserved ATP binding site

Cn3D: Simple Homology Modeling

human

swordtail

NCBI CD: Tyrosine Kinase

Using Cn3D to model domains

Submitting a PDB File to VAST

• Choose the file format• Remove all lines except ATOM

This is the best way to convert PDB files to MMDB format

for viewing with Cn3D!

Database 5: GEO

NCBI’s Gene Expression Omnibus

GPLPlatform

descriptions

GSMRaw/processedspot intensities

from a singleslide/chip

GSEGrouping of

slide/chip data“a single experiment”

GDSGrouping ofexperiments

Curated byNCBI

Submitted byExperimentalistsSubmitted by

Manufacturer*

Entrez GEOEntrez

GEO Datasets

GEO SaMple:

experimental

conditions

GEO SEries:

set of related

samples

What’s a DataSet?

Platform (GPL)

array definition

Sample(GSM)

hyb. measurements

Series(GSE)

related Samples

Supplied by submitter

DataSet (GDS)

• A collection of experimentally-related samples processed using the same platform.• Samples within DataSets are organized into subgroups based on experimental variables.• Form the basis of GEO’s query, analysis and data display tools.

Assembled by GEO staff

Gene Expression Omnibus

Dataset browser

http://www.ncbi.nlm.nih.gov/geo/gds/gds_browse.cgi

GEO Dataset Browser

GEO Dataset Report

GEO Profiles… of 12625

Database 6: CDDNCBI’s Derivative Conserved Domain

Database

Entrez CDD

Conserved Domain Database

• Multiple sequence alignments

• Position-specific scoring matrices (PSSM)

• Sources SMART, PFAM, COGs, KOGs, and

NCBI curated domains (structure-informed alignments)

• Multiple sequence alignments

• Position-specific scoring matrices (PSSM)

• Sources SMART, PFAM, COGs, KOGs, and

NCBI curated domains (structure-informed alignments)

CDD

>gi|45549418|gb|AAS67634.1| ATP7A [Solenodon paradoxus] IVYQPHLITVEEIKKQIKAVGFPAFIKKQPKYLKLGAIDIERLKNIPVKSSEGSQQMSPSSTNDSKVTLTIDGMHCNSCVSNIESALSTLHYVSSIVVSLQNKSAIIKYNANSVTPEILKKAIEAISPGQYRVSITSEVESTSNSPSSSSQKAPLNVVSQPLTQVTVININGMTCNSCVQSIEGVMSKKAGVKSIQVSLANRNGTVEYDP LLTSPEILRE

CDD

CD

Pfam

COG

Click on a colored bar to align your sequence to the CD

Conserved Domain Database: cd00371.1, HMA

CDART: Conserved Domain Architecture Retrieval Tool

Database 7: NCBI Genome Map

Viewing Complex Genomes

• Map Viewer Home Page• Shows all supported organisms

• Provides links to genomic BLAST

– Genome Overview Page• Provides links to individual chromosomes

• Shows hits on a genome graphically

– Chromosome Viewing Page• Allows interactive views of annotation details

• Provides numerous maps unique to each genome

NCBI Map Viewer

The Map Viewer

Genome BLASTGenome BLAST

Map Viewer: Human MLH1

Customizable

NCBI Assembly

EST Hits

Gene Annotations

Models

Transcripts

Maps and Options

Mapped Variations

MLH1 Synteny: Mammalian Genomes

Many Other NCBI Databases…

Other Specialized Databases

• Gene Symbol Database ( HUGO Gene Nomenclature )

• KEGG (Kyoto Encyclopedia of Genes and Genomes) Pathway

• EPD (Eukaryote Promoter Database)

• Transcription Factor Database ( TRANSFAC )

• Many organism-specific databases (e.g., Flybase, Beebase)

• …

http://www.gene.ucl.ac.uk/nomenclature/

http://www.genome.ad.jp/kegg/

http://www.epd.isb-sib.ch/

http://transfac.gbf.de/TRANSFAC/

http://flybase.bio.indiana.edu/

http://racerx00.tamu.edu/bee_resources.html

Access Databases through Entrez

Accessing the Data in Entrez• Web Tools

– Batch Entrez

• Upload a file of GI or accession numbers to retrieve sequences

– Batch Citation Matcher

• Send citation information to Entrez and retrieve PubMed IDs for linking, citation display or other applications

– Advanced Entrez Searching

• Advanced searching techniques for Web Entrez

– My NCBI

• Includes automatic e-mailing of search updates and filters for search results

• Requires a username and password to access stored searches

• Programming Tools

– E-Utilities

• Run Entrez queries and download data from your own scripts over the Web

– Linking to Entrez

• Link to specific Entrez pages from your own web pages or applications

– Entrez Client/Server

• C language library for embedding Entrez calls into your programs

Entrez: Web Access

Default search: Against all databases in EntrezDefault search: Against all databases in Entrez

Interface: Global EntrezInterface: Global Entrez

Target database: Adjustable using the pull-down menuTarget database: Adjustable using the pull-down menu

Default search: Against all databases in EntrezDefault search: Against all databases in Entrez

Interface: Global EntrezInterface: Global Entrez

Target database: Adjustable using the pull-down menuTarget database: Adjustable using the pull-down menu

/************************************************************************** asn2ff.c* convert an ASN.1 entry to flat file format, using the FFPrintArray. ***************************************************************************/#include <accentr.h>#include "asn2ff.h"#include "asn2ffp.h"#include "ffprint.h"#include <subutil.h>#include <objall.h>#include <objcode.h>#include <lsqfetch.h>#include <explore.h>

#ifdef ENABLE_ID1#include <accid1.h>#endif

FILE *fpl;

Args myargs[] = {{"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL},{"Input is a Seq-entry","F", NULL ,NULL ,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL},{"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL},{"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL},{"Show Sequence?","T", NULL ,NULL ,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL},

Toolbox Sources

ftp> open ftp.ncbi.nih.gov..ftp> cd toolboxftp> cd ncbi_tools

ftp://ftp.ncbi.nlm.gov/toolbox/ncbi_tools

NCBI Toolbox

Challenges in Bioinformatics

Hard Link

NeighborsRelated Structures

3 -D 3 -D StructureStructure

VAST

NeighborsRelated Sequences

NucleotideNucleotideSequencesSequences

BLAST

NeighborsRelated SequencesBLinkDomains

ProteinProteinSequencesSequences

BLAST

TaxonomyTaxonomy

Phylogeny

PubMedPubMedAbstractsAbstracts

Word weight

Related Articles

GenomeGene

OMIM

Cancer Chromosome

CDD3D domain

PubChem

Books

PMC

OMIM

SNP

Genome Project

HomoloGene

UniGeneGEO

How can we help biologists manage and exploit all such

rapid growing, heterogeneous, and inaccurate information both efficiently and effectively?

Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005...

Documents

Transcript of Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005...