Introduction to databases part 2to identify and sequence a representative full open reading frame...

63
Introduction to databases part 2 Shifra Ben-Dor Irit Orr

Transcript of Introduction to databases part 2to identify and sequence a representative full open reading frame...

Page 1: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Introduction to databasespart 2

Shifra Ben-DorIrit Orr

Page 2: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

And now, for the moleculesand databases...

• DNA• RNA• Protein

Page 3: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

DNA sequences

• Genes are encoded in genomic sequences.• Genes are transcribed into mRNAs

(including coding, intronic, 5’ and 3’untranslated regions).

• mRNA’s are spliced (introns removed) andtranslated into proteins.

• mRNAs are copied to cDNAs

Page 4: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

TSS TTS

ATG Stop PolyA site

Promoter1 2 3 4

ATG Stop PolyA site

1 2 3 4

GenomicDNA

Pre-mRNA

mRNA

Modified from Zhang MQ Nat Rev Genet. 2002 Sep;3(9):698-709.

ATG Stop

1 2 3 4Cap PolyA

5’ UTR 3’ UTRCDS

Page 5: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

International DNA databases

Genbank at NCBI http://www.ncbi.nlm.nih.gov/

EMBL at EBI http://www.ebi.ac.uk/embl/

DDBJ in Japan http://www.ddbj.nig.ac.jp/

Page 6: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

DATA sources for DNA databases

• Direct scientist submission• Genome sequencing labs and groups• Scientific literature• Patent applications

• EMBL, Genbank and DDBJ collaborateto collect all sequence data reportedaround the world.

Page 7: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

International DNA databases

All of these databases are: Updated every 2-3 months. Have weekly (or daily updates). Are divided into sublibraries for easier

searching.

Page 8: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

DNA database divisions

• PRI - primate (human,monkey)• ROD - rodent (mouse,rat)• MAM - other mammalian

(bovine,cat)• VRT - other vertebrate (chicken)• INV - invertebrate• PLN - plant, fungal, and alga• BCT - bacteria• VRL - viruses• PHG - bacteriophage• SYN - synthetic (plasmids,

vectors)• UNA - unannotated sequences• PAT - patent sequences

• EST - Expressed SequenceTags

• STS - Sequence Tagged Sites• GSS - Genome Survey

Sequences• HTG - High Throughput

Genomic Sequences• HTC - High Throughput cDNA

Sequences

Page 9: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Genomic databases

• Specialized resources that are:– Species specific– Sequencing technique specific

• Display whole chromosomes (not aspecific sequence).

Page 10: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Sources of mRNA’s

• Experimental– Clone new gene– Clone gene from database– 2 hybrid system

• Database– “Typical” cDNA– Full length cDNA– EST

Page 11: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

mRNA

Full length cDNA

Typical cDNA

5’mG AAAA

TTTT

TTTT

primer

AAAAprimer

primer

Page 12: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Sources of mRNA’s

• Individual Labs various• Refseq NM• Kasuza (KIAA) D, ABFull Length Sequencing projects:• Riken, Nedo (FLJ), HRI AK, CR DKFZ, Genoscope...• MGC BC

Accession Numbers

Page 13: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

REFSEQ from NCBI(Reference sequence database)

✵ Definition The Reference Sequence (RefSeq) collection

aims to provide a comprehensive, integrated,non-redundant set of sequences, includinggenomic DNA, transcript (RNA), and proteinproducts, for major research organisms.

Page 14: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

REFSEQ from NCBInon-redundancyexplicitly linked nucleotide and protein

sequencesupdates to reflect current knowledge of sequence

data and biologydata validation and format consistencydistinct accession seriesongoing curation by NCBI staff and collaborators,

with reviewed records indicated

Page 15: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Curated RefSeq records Status

• Reviewed records represent a compilation of ourcurrent knowledge of a gene and its transcripts.

• The RefSeq COMMENT block indicates the Status ofthe record and the GenBank sequence data that wasused to provide the record.

• In addition, the COMMENT may identify acollaboration which supplied the defining sequenceinformation for the genome, gene, or protein.

The level of curation may differ between differentcollaborating groups.

Page 16: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

RefSeq

• Reviewed• Provisional• Predicted• Genome Annotation

• Validated• Model• Inferred• WGS

✵Status Codes: RefSeq records are provided with a statuscode which provides an indication of the levelof review a RefSeq record has undergone.

Page 17: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

STATUS Definition

PREDICTED The RefSeq record is predicted and has notbeen subject to individual review. The transcript may represent an ab initio predictionor may be partially supported by other transcript data; in both cases, the protein is predicted.

PROVISIONAL The RefSeq record has not yet been subject toindividual review and is thought to be well supported and to represent a valid transcript and protein.

REVIEWED The RefSeq record has been the reviewed by NCBI staff or by collaborator. The NCBI reviewprocess includes reviewing available sequence data and frequently also includes a review of theliterature and other sources of information.

Page 18: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

STATUS Definition GENOME This identifies RefSeq records provided by the

ANNOTATION NCBI GenomeAnnotation process. These records are provided via automated processing and are not subject to individual review or revision between builds

INFERRED The RefSeq record is inferred by genome sequence analysis. There is no same-organism experimental support for the full extent of the sequence; there may be some level of support byhomology.

MODEL The RefSeq record is predicted by genome sequence analysis. The record may represent an ab initio prediction, or may have some level of transcript or homology support.

Page 19: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

STATUS Definition

VALIDATED The RefSeq record has undergone an initial review to provide the preferred sequence standard. The record has not yet been subject to final review at which time additional functional information may be provided.

WGS The RefSeq record represents a collection of whole genome shotgun (WGS) sequences. This status code is applied to genomic records

Page 20: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Accession Format Molecule Type

NC_123456 Complete GenomeComplete ChromosomeComplete Sequence

NG_123456 Genomic Region

NM_123456 mRNA

NP_123456 Protein

NT_123456 Genomic Contig (from BACs)NW_123456 Genomic Contig (from WGS)

XM_123456 mRNA (taken from genomic seq)

XR_123456 RNA (taken from genomic seq)

XP_123456 Protein (taken from genomic seq)

Page 21: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

What is the difference betweenRefSeq and GenBank?

Genbank is:• Archival database and includes publicly available DNA sequences

submitted from individual laboratories and large-scale sequencingprojects.

• Accession numbers are assigned to these submitted sequences.• Submitted sequence data is exchanged between NCBIs GenBank, EMBL

Data Library (EMBL) and the DNA Data Bank of Japan (DDBJ) toachieve comprehensive worldwide coverage.

• As an archival database, GenBank is very redundant for some loci.• Sequence records are owned by the original submitter and can not be

altered by a third party.

Page 22: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

What is the difference betweenRefSeq and GenBank?

RefSeq is: Sequences are derived from GenBank and provide non-redundant

curated data. Entries records represent current knowledge. RefSeq records are owned by NCBI and therefore can be updated

as needed to maintain current annotation or to incorporateadditional sequence information.

Some records include additional sequence information that wasnever submitted to an archival database but is available in theliterature.

Some sequence records are provided through collaboration; andthus may not be available in any one GenBank record.

RefSeq sequences are not submitted primary seqs.

Page 23: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

KasuzaLarge cDNA inserts (> 4 kb).

Determined the complete base sequences of approximately 2000 species of previously undiscovered cDNA from KG-1 cells and brain tissue, with an average length of 5 kb.

Database: HUGE (Human Unidentified Gene-Encoded protein database)http://www.kazusa.or.jp/huge

Page 24: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

NEDO

Full Length mRNASequencing

Page 25: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

NEDO~ 160,000 clones were isolated from more than20 full-length enriched human cDNA libraries madeby "Oligo-capping" method. Their 5's end sequenceswere determined.

We selected about 10,000 putatively full-lengthcDNA using these sequence data and determined theentire sequence of the selected clones. This NEDOproject aims to determine the sequence of 20,000full-length cDNA clones in addition.

Page 26: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

RIKENMouse Genome Encyclopedia

A project to sequence full length mouse cDNA’s.

Over 21,000 genes sequenced from oligo capped libraries from about 200 tissues and cell types

Set standards for annotation with FANTOM (Functional Annontation Of Mouse)

Page 27: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

MGC - MammalianGene Collection

The NIH Mammalian Gene Collection (MGC) seeks to identify and sequence a representative full openreading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries enrichedfor full-length cDNAs derived from human tissue and cell lines, and mouse tissue. 5' EST reads are generated from each library. Several algorithms are applied to select putative full ORF clones.

Page 28: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Sources of mRNA’s

• Experimental– Clone new gene– Clone gene from database– 2 hybrid system

• Database– “Typical” cDNA– Full length cDNA– EST

Page 29: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

RNA

RNA, cDNA, and ESTs

mRNA

cDNA

exon 1 exon 2 exon 3

EST

EST

cDNA clone

GenBank ESTs GenBank ESTs (Expressed Sequence Tags): (Expressed Sequence Tags): ~ 7,900,000 human ~ 7,900,000 human ESTsESTs~ 4,700,000 mouse ~ 4,700,000 mouse ESTsESTs

Adapted with permission from Adam Sartiel

Page 30: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Uses of ESTs

- prediction of coding regions- detection of alternative splicing- clustering to form “genes”

Problems with clustering:- incomplete coverage breaks genes up- gene families

Page 31: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Problems with ESTs

- low copy number genes- rare tissues- mistakes- enrichment of 3’ ends of genes- incomplete coverage of genes

Page 32: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

• With the increasing sequencing and annotation of keygenomes, having a gene-based view of the resultantinformation is useful. Entrez Gene has therefore beenimplemented to supply key connections in the nexus ofmap, sequence, expression, structure, function, citation,and homology data. Unique identifiers are assigned togenes with defining sequences, genes with known mappositions, and genes inferred from phenotypicinformation. These gene identifiers are tracked, andinformation is added when available. Entrez Gene can beconsidered as the successor to LocusLink, with the majordifferences being in greater scope (more of the genomesrepresented by NCBI Reference Sequences or RefSeqs)and in being integrated for indexing and query in NCBI'sEntrez system.

Page 33: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Entrez Gene at NCBIthe successor to LocusLink

Entrez Gene - A database for gene-specific information.It does not include all known or predicted genes; instead

Entrez Gene focuses on the genomes that have beencompletely sequenced, that have an active researchcommunity to contribute gene-specific information, orthat are scheduled for intense sequence analysis.

The content of Entrez Gene represents the result ofcuration and automated integration of data from NCBI'sReference Sequence project (RefSeq), fromcollaborating model organism databases, and from manyother databases available from NCBI. Records areassigned unique, stable and tracked integers asidentifiers.

Page 34: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Entrez Gene at NCBI

The content (nomenclature, map location, geneproducts and their attributes, markers, phenotypes,and links to citations, sequences, variation details,maps, expression, homologs, protein domains andexternal databases) is updated as new informationbecomes available.

Entrez Gene data is used by other NCBI resourcessuch as: BLAST, Geo, HomoloGene, Map Viewer,UniGene, UniSTS and NCBI's genome annotationpipeline.

Page 35: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

An Entrez Gene card will contain theinformation:

• Full Report (default) includes diagrams and text to represent someof what is known about a gene.

• Links to other resources within NCBI or external to NCBI.• The Title• The Navigation menu• The Summary• Genome regions, transcripts and products• Genome Context• Bibliography• Interactions• Alleles• General Gene Information

Page 36: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

An Entrez Gene card will containthe information:

• General Protein Information• Refseq• Related Sequences• Additional links

Page 37: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Data reliability in databases

• The huge amount of data collected indatabases present a lot of problems:– Data accuracy– Sequence redundancy– Inconsistent nomenclature– Inaccurate annotation– Sequence contamination (vectors,

bacterial)

Page 38: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Data reliability in databases

• The database staff notify the Authorsthat an error (or contamination) wasdetected in their sequence entry.

• However, it takes time to correct the data.• Meanwhile the error is continued, because

a lot of the Proteins in the Protein db aretranslated from the DNA sequence db.

Page 39: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Data reliability in databases

• A lot of the sequences in thedatabase are quite “old”. They werenot updated since they weresubmitted, even though technologyand data was very much updated.

Page 40: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Taxonomy Databases

• An international effort is done for allsequence databases to create a unifiedtaxonomic tag for the sequences submitted.

Problem: each sequence depositor gives “his”name for the specie

Solution: Unified taxonomy ID

Page 41: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

HUGO Gene NomenclatureCommittee

• This committee is responsible for the approval of aunique symbol for each gene.

• It also designs a longer and more descriptive name.

• The committee makes considerable efforts to usesymbols acceptable to workers in the field, butsometimes it is not possible to use exactly what haspreviously appeared in the literature.

• However, wherever the committee is aware of suchsymbols, they are listed as aliases in the Genewdatabase.(http://www.gene.ucl.ac.uk/cgibin/nomenclature/searchgenes.pl)

Page 42: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Gene symbols

Gene symbols are designated by upper case Latin letters orby a combination of upper-case letters and Arabic numbers.

Symbols should be short in order to be useful, and shouldnot attempt to represent all known information about agene.

Ideally symbols should be no longer than six characters inlength.

Based on classical genetic guidelines, it is recommended thatgene symbols are either underlined or italicized whenreferring to genotypic information (phenotypic informationis represented in standard fonts).

Page 43: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Gene Symbols

80887826000469q31ATP-bindingcassette, sub-family A (ABC1),member 1

ABCA1

PubMedID

MIMNumber

CytogeneticLocation

Full nameSymbol

Page 44: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Protein databases

Page 45: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Protein databases

• There are many different proteindatabases containing different types ofinformation:– Primary Amino Acids sequence.– Secondary structure– 3D structure– Protein family domains– Consensus active sites

Page 46: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Sources of Protein

• Proteins that have been worked onexperimentally

• mRNA whose product has beenworked on experimentally (no actualprotein sequencing done)

• Translated DNA (mRNA) sequences

Page 47: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Protein Primary SequenceDatabases

• Usually contain description of the protein entry(annotation), the amino acid sequence andsometimes links to other related databases.

• Swiss-Prot, from the University of Geneva (nowthe Swiss Institute of Bioinformatics), is acurated protein database which strives toprovide a high level of annotation, a minimallevel of redundancy and high level ofintegration with other databases.

Page 48: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

UniProt (Universal Protein Resource) is the world's mostcomprehensive catalog of information on proteins. It isa central repository of protein sequence and functioncreated by joining the information contained in Swiss-Prot, TrEMBL, and PIR.

• The UniProt Knowledgebase (UniProt) is the centralaccess point for extensive curated protein information,including function, classification, and cross-reference.

• The UniProt Non-redundant Reference (UniRef)databases combine closely related sequences into asingle record to speed searches.

• The UniProt Archive (UniParc) is a comprehensiverepository, reflecting the history of all proteinsequences.

Page 49: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Swiss-Prot Database (primary database)

• Swiss-Prot annotation includes:– Description of protein function– Protein domain structure– Post-translational modifications– Protein variants

• Sequence entries are composed of different line-types, each with their own format. Forstandardization purposes the format ofSwissProt follows as closely as possible that ofthe EMBL (DNA) Database.

Page 50: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Swiss-Prot Database

Swiss-Prot differs from other protein databasesby the following criteria:

Annotation Minimal Redundancy Integration with other databases

Page 51: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Swiss-Prot Database

Annotation In Swiss-Prot, as in most other sequence

databases, two classes of data can bedistinguished: the core data and the annotation.

The core data consists of the sequence; thecitation information (bibliographical references)and the taxonomic data (description of thebiological source of the protein).

Page 52: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

The annotation consists of the description of:

• Function(s) of the protein• Post-translational modification(s). For

example carbohydrates, phosphorylation,acetylation, GPI-anchor, etc.

• Domains and sites. For example calciumbinding regions, ATP-binding sites, zincfingers, etc.

• Secondary structure

Page 53: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

The annotation consists of the description of:

• Quaternary structure. For examplehomodimer, heterotrimer, etc.

• Similarities to other proteins• Disease(s) associated with deficiency(s)

of/in the protein• Sequence conflicts, variants, etc.

Page 54: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Swiss-Prot Database

To obtain this information, Swiss-Prot uses, inaddition to the publications that report newsequence data, review articles to periodicallyupdate the annotations of families or groupsof proteins.

Swiss-Prot also makes use of externalexperts, who have been recruited to sendtheir comments and updates concerningspecific groups of proteins.

Page 55: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Swiss-Prot Database

Minimal Redundancy Many sequence databases contain, for a given

protein sequence, separate entries whichcorrespond to different literature reports.In SWISS-PROT, they try as much as possibleto merge all these data so as to minimize theredundancy of the database.

If conflicts exist between various sequencingreports, they are indicated in the feature tableof the corresponding entry.

Page 56: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Swiss-Prot Database Integration with other databases It is important to provide the users of

biomolecular databases with a degree ofintegration between the three types sequence-related databases (nucleic acid sequences, proteinsequences and protein tertiary structures) as wellas with specialized data collections.

SWISS- PROT is currently cross-referenced with~100 different databases. Cross-references areprovided in the form of pointers to informationrelated to SWISS-PROT entries and found in datacollections other than SWISS-PROT.

Page 57: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

TrEMBL database

• TrEMBL is a computer-annotatedsupplement of SWISS-PROT thatcontains all the translations of theEMBL (DNA) database.

• TrEMBL contain entries not yetintegrated in SWISS-PROT.

Page 58: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

NR database (primary databases from NCBI ! !)

• The NR Protein database containssequence data from the translatedcoding regions from DNA sequences inGenBank, EMBL and DDBJ as well asprotein sequences submitted to PIR,SWISSPROT, PRF, PDB (sequences fromsolved structures).

Page 59: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Data reliability in Proteindatabases

• About 30% of the proteins in thedatabases have erroneous sequences dueto:– missing exons in the DNA translation.– Introns mistakenly translated.

• Another common problem is the assigningof functions to “new” proteins, based onsequence similarity.

Page 60: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Data reliability in Proteindatabases

• For example:– Protein A is similar to protein B.– Protein B annotation is based on Protein A

annotation (which has an error).– Annotation of Protein A is corrected by the

group working on it. This correction does notappear or reflect in Protein B annotation.

– When Protein C and D are also based on theerroneous annotation on B, the problem…...

Page 61: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Text searching pitfalls

• It finds exactly what you type(try pseudogene vs. psuedogene)

• Older records may have differentannotation, from gene names on…

• human vs homo sapiens• Gene symbols vs full gene name

(for example neuregulin vs nrg1)

Page 62: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

• Most sites use boolean operators(AND, OR, BUT NOT)

• Can do (or add) a field specific tag -but each site has a different way ofadding it to a search - for example,NCBI uses square brackets []

Page 63: Introduction to databases part 2to identify and sequence a representative full open reading frame (ORF) clone for each human and mouse gene. MGC has produced over 80 cDNA libraries

Remember:

Text searching is NOT sequencesimilarity searching! You many notfind all related sequences by textsearching!!!!