Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3...

60
Introduc)on to Databases part 2 Shifra Ben‐Dor Irit Orr

Transcript of Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3...

Page 1: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

Introduc)ontoDatabasespart2

ShifraBen‐DorIritOrr

Page 2: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

Andnow,forthemoleculesanddatabases...

•  DNA

•  RNA

•  Protein

Page 3: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

DNAsequences

•  Genesareencodedingenomicsequences.

•  Genesaretranscribedintopre‐mRNAs(includingcoding,intronic,5’and3’untranslatedregions).

• mRNA’sarespliced(intronsremoved)andtranslatedintoproteins.

• mRNAsarecopiedtocDNAs

Page 4: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

TSS TTS

ATG Stop PolyA site

Promoter 1 2 3 4

ATG Stop PolyA site

1 2 3 4

Genomic DNA

Pre-mRNA

mRNA

Modified from Zhang MQ Nat Rev Genet. 2002 Sep;3(9):698-709.

ATG Stop

1 2 3 4 Cap PolyA

5’ UTR 3’ UTR CDS

Page 5: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

Interna)onalDNAdatabases

  GenbankatNCBI  hNp://www.ncbi.nlm.nih.gov/

  EMBLatEBI

  hNp://www.ebi.ac.uk/embl/

  DDBJinJapan  hNp://www.ddbj.nig.ac.jp/

Page 6: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

DATAsourcesforDNAdatabases

•  Directscien)stsubmission

•  Genomesequencinglabsandgroups

•  Scien)ficliterature•  Patentapplica)ons

•  EMBL,GenbankandDDBJcollaboratetocollectallsequencedatareportedaroundtheworld.

Page 7: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

Interna)onalDNAdatabases

  Allofthesedatabaseshave:

  Officialreleasesevery2‐3months.

  Weekly(ordailyupdates).

  Aredividedintosublibrariesforeasiersearching.

Page 8: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

DNAdatabasedivisions

•  PRI‐primate(human,monkey)•  ROD‐rodent(mouse,rat)•  MAM‐othermammalian

(bovine,cat)•  VRT‐othervertebrate(chicken)•  INV‐invertebrate•  PLN‐plant,fungal,andalga•  BCT‐bacteria•  VRL‐viruses•  PHG‐bacteriophage•  SYN‐synthe)c(plasmids,vectors)•  UNA‐unannotatedsequences•  PAT‐patentsequences

•  EST‐ExpressedSequenceTags•  STS‐SequenceTaggedSites

•  GSS‐GenomeSurveySequences

•  HTG‐HighThroughputGenomicSequences

•  HTC‐HighThroughputcDNASequences

Page 9: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

ShortReadandTraceArchives

Theoutputoflargescalesequencingprojectsandnext‐genera)onsequencingarestoredinseparatedatabases.NCBIisphasingouttheSRA,butthedatawillbeavailableinGEO,thedatabaseformicroarrayresults.

Page 10: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

Genomicdatabases

•  Specializedresourcesthatare:– Speciesspecific– Sequencingtechniquespecific

•  Displaywholechromosomes(notaspecificsequence).

Page 11: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

SourcesofmRNA’s

•  Experimental– Clonenewgene– Clonegenefromdatabase– 2hybridsystem,RNA‐Seq...

•  Database– “Typical”cDNA– FulllengthcDNA– EST

Page 12: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

mRNA

Full length cDNA

Typical cDNA

5’mG AAAA

TTTT

TTTT

primer

AAAA primer

primer

Page 13: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

REFSEQNCBI(Referencesequencedatabase)

✵  Definition

  The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms.

Page 14: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

REFSEQ from NCBI  non-redundancy  explicitly linked nucleotide and protein

sequences  updates to reflect current knowledge of sequence

data and biology  data validation and format consistency  distinct accession series  ongoing curation by NCBI staff and collaborators,

with reviewed records indicated

Page 15: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

RefSeqrecordStatus

•  TheRefSeqCOMMENTblockindicatestheStatusoftherecordandtheGenBanksequencedatathatwasusedtoprovidetherecord.

•  Inaddi)on,theCOMMENTmayiden)fyacollabora)onwhichsuppliedthedefiningsequenceinforma)onforthegenome,gene,orprotein.

Thelevelofcura)onmaydifferbetweendifferentcollabora)nggroups.

Page 16: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

RefSeq

•  Reviewed*•  Provisional•  Predicted

•  GenomeAnnota)on

•  Validated*•  Model

•  Inferred

•  WGS

✵ StatusCodes: RefSeqrecordsareprovidedwithastatuscodewhichprovidesanindica)onofthelevelofreviewaRefSeqrecordhasundergone.

*Curated

Page 17: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

STATUSDefini+on

REVIEWEDTheRefSeqrecordhasbeenthereviewedbyNCBIstafforbycollaborator.TheNCBIreviewprocessincludesreviewingavailablesequencedataandfrequentlyalsoincludesareviewoftheliteratureandothersourcesofinforma)on.

VALIDATED

TheRefSeqrecordhasundergoneanini)alreviewtoprovidethepreferredsequencestandard.Therecordhasnotyetbeensubjecttofinalreviewatwhich)meaddi)onalfunc)onalinforma)onmaybeprovided.

PROVISIONALTheRefSeqrecordhasnotyetbeensubjecttoindividualreviewandisthoughttobewellsupportedandtorepresentavalidtranscriptandprotein.

Page 18: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

STATUSDefini+on

PREDICTEDTheRefSeqrecordispredictedandhasnotbeensubjecttoindividualreview.Thetranscriptmayrepresentanabini&opredic)onormaybepar)allysupportedbyothertranscriptdata;inbothcases,theproteinispredicted.

INFERREDTheRefSeqrecordisinferredbygenomesequenceanalysis.Thereisnosame‐organismexperimentalsupportforthefullextentofthesequence;theremaybesomelevelofsupportbyhomology.

MODELTheRefSeqrecordispredictedbygenomesequenceanalysis.Therecordmayrepresentanabini&opredic)on,ormayhavesomeleveloftranscriptorhomologysupport.

Page 19: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

STATUSDefini+on

GENOMEANNOTATION Thisiden)fiesRefSeqrecordsprovidedbytheNCBIGenomeAnnota)onprocess.Theserecordsareprovidedviaautomatedprocessingandarenotsubjecttoindividualrevieworrevisionbetweenbuilds

WGS

TheRefSeqrecordrepresentsacollec)onofwholegenomeshotgun(WGS)sequences.Thisstatuscodeisappliedtogenomicrecords

Page 20: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

AccessionFormat MoleculeType

NC_123456 CompleteGenome CompleteChromosome CompleteSequence

NG_123456 GenomicRegion

NM_123456 mRNA

NR_123456 non‐codingRNA

NP_123456 Protein

NT_123456 GenomicCon)g(fromBACs)

NW_123456 GenomicCon)g(fromWGS)

XM_123456 mRNA(takenfromgenomicseq)

XR_123456 RNA(takenfromgenomicseq)

XP_123456 Protein(takenfromgenomicseq)

Page 21: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

WhatisthedifferencebetweenRefSeqand

GenBank?Genbankis:

•  ArchivaldatabaseandincludespubliclyavailableDNAsequencessubmiNedfromindividuallaboratoriesandlarge‐scalesequencingprojects.

•  AccessionnumbersareassignedtothesesubmiNedsequences.

•  SubmiNedsequencedataisexchangedbetweenNCBIsGenBank,EMBLDataLibrary(EMBL)andtheDNADataBankofJapan(DDBJ)toachievecomprehensiveworldwidecoverage.

•  Asanarchivaldatabase,GenBankisveryredundantforsomeloci.

•  SequencerecordsareownedbytheoriginalsubmiNerandcannotbealteredbyathirdparty.

Page 22: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

WhatisthedifferencebetweenRefSeqand

GenBank?RefSeqis:

 SequencesarederivedfromGenBankandprovidenon‐redundantcurateddata.

 Entriesrecordsrepresentcurrentknowledge. RefSeqrecordsareownedbyNCBIandthereforecanbe

updatedasneededtomaintaincurrentannota)onortoincorporateaddi)onalsequenceinforma)on.

 Somerecordsincludeaddi)onalsequenceinforma)onthatwasneversubmiNedtoanarchivaldatabasebutisavailableintheliterature.

 Somesequencerecordsareprovidedthroughcollabora)on;andthusmaynotbeavailableinanyoneGenBankrecord.

 RefSeqsequencesarenotsubmiNedprimaryseqs.

Page 23: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

VariousHighThroughputCollec)onsNedo,DFKZ,HRI,Genoscope

•  Full‐lengthcDNAlibrariesfromvarious)ssuesweresubtractedandnormalizedtoreduceredundancy

•  Cloneswereend‐sequencedtofurtherreduceredundancy

•  WholeinsertsweresequencedtogetmRNAsequences

•  [KIAA–donebyKazusawasaprojectforlongcDNAs–over4kb,butmaynotbefull‐length]

Page 24: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

MGC‐MammalianGeneCollec)on

TheNIHMammalianGeneCollec)on(MGC)seekstoiden)fyandsequencearepresenta)vefullopenreadingframe(ORF)cloneforeachhuman,mouse,ratandcowgene.ZebrafishandXenopushavetheirownprojects(ZGCandXGC)

MGCproducedover80cDNAlibrariesenrichedforfull‐lengthcDNAsderivedfromhuman)ssueandcelllines,andmouse)ssue.

5'ESTreadsweregeneratedfromeachlibrary.Severalalgorithmsareappliedtoselectputa)vefullORFclones.Targetedcloningorsynthesiswasusedtofinish.

Page 25: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

SourcesofmRNAs

•  IndividualLabs various

•  Refseq XX_123456

FullLengthSequencingprojects:

•  Riken,Nedo(FLJ),HRI AK,CR

DKFZ,Genoscope,[KIAA]... [AB,D]

•  MGC BC,CT

AccessionNumbers

Page 26: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

SourcesofmRNA’s

•  Experimental– Clonenewgene– Clonegenefromdatabase– 2hybridsystem

•  Database– “Typical”cDNA– FulllengthcDNA– EST

Page 27: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

RNA

RNA, cDNA, and ESTs

mRNA

cDNA

exon 1 exon 2 exon 3

EST

EST

cDNA clone

Adapted with permission from Adam Sartiel

Page 28: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

UsesofESTs

‐ predic)onofcodingregions‐ detec)onofalterna)vesplicing‐ clusteringtoform“genes”

Problemswithclustering:‐ incompletecoveragebreaksgenesup‐ genefamilies

Page 29: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

ProblemswithESTs

‐ lowcopynumbergenes

‐ rare)ssues‐ mistakes

‐ enrichmentof3’endsofgenes

‐ incompletecoverageofgenes

Page 30: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

With the increasing sequencing and annotation of key genomes, having a gene-based view of the resultant information is useful. Entrez Gene has therefore been implemented to supply key connections in the nexus of map, sequence, expression, structure, function, citation, and homology data. Unique identifiers are assigned to genes with defining sequences, genes with known map positions, and genes inferred from phenotypic information.

Page 31: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

EntrezGeneatNCBI

EntrezGene‐Adatabaseforgene‐specificinforma)on.

Itdoesnotincludeallknownorpredictedgenes;insteadEntrezGenefocusesonthegenomesthathavebeencompletelysequenced,thathaveanac)veresearchcommunitytocontributegene‐specificinforma)on,orthatarescheduledforintensesequenceanalysis.

ThecontentofEntrezGenerepresentstheresultofcura)onandautomatedintegra)onofdatafromNCBI'sReferenceSequenceproject(RefSeq),fromcollabora)ngmodelorganismdatabases,andfrommanyotherdatabasesavailablefromNCBI.Recordsareassignedunique,stableandtrackedintegersasiden)fiers.

Page 32: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

EntrezGeneatNCBI

Thecontent(nomenclature,maploca)on,geneproductsandtheiraNributes,markers,phenotypes,andlinkstocita)ons,sequences,varia)ondetails,maps,expression,homologs,proteindomainsandexternaldatabases)isupdatedasnewinforma)onbecomesavailable.

EntrezGenedataisusedbyotherNCBIresourcessuchas:BLAST,Geo,HomoloGene,MapViewer,UniGene,UniSTSandNCBI'sgenomeannota)onpipeline.

Page 33: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

Datareliabilityindatabases

Thehugeamountofdatacollectedindatabasespresentalotofproblems:

– Dataaccuracy–  Sequenceredundancy–  Inconsistentnomenclature

–  Inaccurateannota)on–  Sequencecontamina)on(vectors,bacterial)

Page 34: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

Datareliabilityindatabases

•  Thedatabasestaffno)fytheAuthorsthatanerror(orcontamina)on)wasdetectedintheirsequenceentry.

•  However,ittakes)metocorrectthedata.

• Meanwhiletheerroriscon)nued,becausealotoftheProteinsintheProteindbaretranslatedfromtheDNAsequencedb.

Page 35: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

Datareliabilityindatabases

•  Alotofthesequencesinthedatabasearequite“old”.TheywerenotupdatedsincetheyweresubmiNed,eventhoughtechnologyanddatawasverymuchupdated.

Page 36: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

Genesymbols

GenesymbolsaredesignatedbyuppercaseLa)nleNersorbyacombina)onofupper‐caseleNersandArabicnumbers.

Symbolsshouldbeshortinordertobeuseful,andshouldnotaNempttorepresentallknowninforma)onaboutagene.

Ideallysymbolsshouldbenolongerthansixcharactersinlength.

Basedonclassicalgene)cguidelines,itisrecommendedthatgenesymbolsareeitherunderlinedoritalicizedwhenreferringtogenotypicinforma)on(phenotypicinforma)onisrepresentedinstandardfonts).

Page 37: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

HUGOGeneNomenclatureCommiNee

•  ThiscommiNeeisresponsiblefortheapprovalofauniquesymbolforeachgene.

•  Italsodesignsalongerandmoredescrip)vename.

•  ThecommiNeemakesconsiderableeffortstousesymbolsacceptabletoworkersinthefield,butsome)mesitisnotpossibletouseexactlywhathaspreviouslyappearedintheliterature.

•  However,whereverthecommiNeeisawareofsuchsymbols,theyarelistedasaliasesintheGenewdatabase.(hIp://www.gene.ucl.ac.uk/cgibin/nomenclature/searchgenes.pl)

Page 38: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

GeneSymbols

80887826000469q31ATP‐bindingcasseNe,sub‐familyA(ABC1)member1

ABCA1

PubMedID

MIMNumber

Cytogene)cLoca)on

FullnameSymbol

Page 39: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

TaxonomyDatabases

•  Aninterna)onaleffortisdoneforallsequencedatabasestocreateaunifiedtaxonomictagforthesequencessubmiNed.

  Problem:eachsequencedepositorgives“his”nameforthespecie

  Solu)on:UnifiedtaxonomyID

Page 40: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

Proteindatabases

Page 41: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

Proteindatabases

•  Therearemanydifferentproteindatabasescontainingdifferenttypesofinforma)on:

–  PrimaryAminoAcidssequence.

–  Secondarystructure–  3Dstructure–  Proteinfamilydomains

–  Consensusac)vesites

Page 42: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

SourcesofProtein

•  Proteinsthathavebeenworkedonexperimentally

• mRNAwhoseproducthasbeenworkedonexperimentally(noactualproteinsequencingdone)

•  TranslatedDNA(mRNA)sequences

Page 43: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

ProteinPrimarySequenceDatabases

•  Usuallycontaindescrip)onoftheproteinentry(annota)on),theaminoacidsequenceandsome)meslinkstootherrelateddatabases.

•  Swiss‐Prot,fromtheUniversityofGeneva(nowtheSwissIns)tuteofBioinforma)cs),isacuratedproteindatabasewhichstrivestoprovideahighlevelofannota)on,aminimallevelofredundancyandhighlevelofintegra)onwithotherdatabases.

Page 44: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

UniProt (Universal Protein Resource) is the world's most comprehensive catalog of information on proteins. It is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR.

•  The UniProt Knowledgebase (UniProt) is the central access point for extensive curated protein information, including function, classification, and cross-reference.

•  The UniProt Non-redundant Reference (UniRef) databases combine closely related sequences into a single record to speed searches.

•  The UniProt Archive (UniParc) is a comprehensive repository, reflecting the history of all protein sequences.

Page 45: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

Swiss‐ProtDatabase(primarydatabase)

•  Swiss‐Protannota)onincludes:– Descrip)onofproteinfunc)on–  Proteindomainstructure–  Post‐transla)onalmodifica)ons–  Proteinvariants

•  Sequenceentriesarecomposedofdifferentline‐types,eachwiththeirownformat.Forstandardiza)onpurposestheformatofSwissProtfollowsascloselyaspossiblethatoftheEMBL(DNA)Database.

Page 46: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

Swiss‐ProtDatabase

Swiss‐Protdiffersfromotherproteindatabasesbythefollowingcriteria:

 Annota)on

 MinimalRedundancy

  Integra)onwithotherdatabases

Page 47: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

Swiss‐ProtDatabase

 Annota)on   InSwiss‐Prot,asinmostothersequencedatabases,twoclassesofdatacanbedis)nguished:thecoredataandtheannota)on.

  Thecoredataconsistsofthesequence;thecita)oninforma)on(bibliographicalreferences)andthetaxonomicdata(descrip)onofthebiologicalsourceoftheprotein).

Page 48: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

  Theannota)onconsistsofthedescrip)onof:

•  Func)on(s)oftheprotein•  Post‐transla)onalmodifica)on(s).Forexamplecarbohydrates,phosphoryla)on,acetyla)on,GPI‐anchor,etc.

•  Domainsandsites.Forexamplecalciumbindingregions,ATP‐bindingsites,zincfingers,etc.

•  Secondarystructure

Page 49: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

  Theannota)onconsistsofthedescrip)onof:

•  Quaternarystructure.Forexamplehomodimer,heterotrimer,etc.

•  Similari)estootherproteins•  Disease(s)associatedwithdeficiency(s)of/intheprotein

•  Sequenceconflicts,variants,etc.

Page 50: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

Swiss‐ProtDatabase

  Toobtainthisinforma)on,Swiss‐Protuses,inaddi)ontothepublica)onsthatreportnewsequencedata,reviewar)clestoperiodicallyupdatetheannota)onsoffamiliesorgroupsofproteins.

  Swiss‐Protalsomakesuseofexternalexperts,whohavebeenrecruitedtosendtheircommentsandupdatesconcerningspecificgroupsofproteins.

Page 51: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

Swiss‐ProtDatabase

 MinimalRedundancy   Manysequencedatabasescontain,foragivenproteinsequence,separateentrieswhichcorrespondtodifferentliteraturereports.InSWISS‐PROT,theytryasmuchaspossibletomergeallthesedatasoastominimizetheredundancyofthedatabase.

  Ifconflictsexistbetweenvarioussequencingreports,theyareindicatedinthefeaturetableofthecorrespondingentry.

Page 52: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

Swiss‐ProtDatabase

  Integra)onwithotherdatabases   Itisimportanttoprovidetheusersofbiomoleculardatabaseswithadegreeofintegra)onbetweenthethreetypessequence‐relateddatabases(nucleicacidsequences,proteinsequencesandproteinter)arystructures)aswellaswithspecializeddatacollec)ons.

  SWISS‐PROTiscurrentlycross‐referencedwith~100differentdatabases.Cross‐referencesareprovidedintheformofpointerstoinforma)onrelatedtoSWISS‐PROTentriesandfoundindatacollec)onsotherthanSWISS‐PROT.

Page 53: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

TrEMBLdatabase

•  TrEMBLisacomputer‐annotatedsupplementofSWISS‐PROTthatcontainsallthetransla)onsoftheEMBL(DNA)database.

•  TrEMBLcontainentriesnotyetintegratedinSWISS‐PROT.

Page 54: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

•  Combinesinforma)onnotinotherdatabases,likemicroarraydata,popula)onvaria)onstudies,proteomics

•  Powerfulqueryingop)ons

•  Onlyforhumanproteins

Page 55: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

NRdatabase(primarydatabasesfromNCBI)

•  TheNRProteindatabasecontainssequencedatafromthetranslatedcodingregionsfromDNAsequencesinGenBank,EMBLandDDBJaswellasproteinsequencessubmiNedtoPIR,SWISSPROT,PRF,PDB(sequencesfromsolvedstructures).

Page 56: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

DatareliabilityinProteindatabases

•  About30%oftheproteinsinthedatabaseshaveerroneoussequencesdueto:– missingexonsintheDNAtransla)on.– Intronsmistakenlytranslated.

•  Anothercommonproblemistheassigningoffunc)onsto“new”proteins,basedonsequencesimilarity.

Page 57: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

DatareliabilityinProteindatabases

•  Forexample:– ProteinAissimilartoproteinB.

– ProteinBannota)onisbasedonProteinAannota)on(whichhasanerror).

– Annota)onofProteinAiscorrectedbythegroupworkingonit.Thiscorrec)ondoesnotappearorreflectinProteinBannota)on.

– WhenProteinCandDarealsobasedontheerroneousannota)ononB,theproblem…...

Page 58: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

Textsearchingpi{alls

•  Itfindsexactlywhatyoutype(trypseudogenevs.psuedogene)

•  Olderrecordsmayhavedifferentannota)on,fromgenenameson…

•  humanvshomosapiens

•  Genesymbolsvsfullgenename(forexampleneuregulinvsnrg1)

Page 59: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

• Mostsitesusebooleanoperators(AND,OR,BUTNOT)

•  Cando(oradd)afieldspecifictag‐buteachsitehasadifferentwayofaddingittoasearch‐forexample,NCBIusessquarebrackets[]

Page 60: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...

Remember:

TextsearchingisNOTsequencesimilaritysearching!Youmanynotfindallrelatedsequencesbytextsearching!!!!