[George polya] mathematics_and_plausible_reasoning(bookos.org)
Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3...
Transcript of Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3...
![Page 1: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/1.jpg)
Introduc)ontoDatabasespart2
ShifraBen‐DorIritOrr
![Page 2: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/2.jpg)
Andnow,forthemoleculesanddatabases...
• DNA
• RNA
• Protein
![Page 3: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/3.jpg)
DNAsequences
• Genesareencodedingenomicsequences.
• Genesaretranscribedintopre‐mRNAs(includingcoding,intronic,5’and3’untranslatedregions).
• mRNA’sarespliced(intronsremoved)andtranslatedintoproteins.
• mRNAsarecopiedtocDNAs
![Page 4: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/4.jpg)
TSS TTS
ATG Stop PolyA site
Promoter 1 2 3 4
ATG Stop PolyA site
1 2 3 4
Genomic DNA
Pre-mRNA
mRNA
Modified from Zhang MQ Nat Rev Genet. 2002 Sep;3(9):698-709.
ATG Stop
1 2 3 4 Cap PolyA
5’ UTR 3’ UTR CDS
![Page 5: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/5.jpg)
Interna)onalDNAdatabases
GenbankatNCBI hNp://www.ncbi.nlm.nih.gov/
EMBLatEBI
hNp://www.ebi.ac.uk/embl/
DDBJinJapan hNp://www.ddbj.nig.ac.jp/
![Page 6: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/6.jpg)
DATAsourcesforDNAdatabases
• Directscien)stsubmission
• Genomesequencinglabsandgroups
• Scien)ficliterature• Patentapplica)ons
• EMBL,GenbankandDDBJcollaboratetocollectallsequencedatareportedaroundtheworld.
![Page 7: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/7.jpg)
Interna)onalDNAdatabases
Allofthesedatabaseshave:
Officialreleasesevery2‐3months.
Weekly(ordailyupdates).
Aredividedintosublibrariesforeasiersearching.
![Page 8: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/8.jpg)
DNAdatabasedivisions
• PRI‐primate(human,monkey)• ROD‐rodent(mouse,rat)• MAM‐othermammalian
(bovine,cat)• VRT‐othervertebrate(chicken)• INV‐invertebrate• PLN‐plant,fungal,andalga• BCT‐bacteria• VRL‐viruses• PHG‐bacteriophage• SYN‐synthe)c(plasmids,vectors)• UNA‐unannotatedsequences• PAT‐patentsequences
• EST‐ExpressedSequenceTags• STS‐SequenceTaggedSites
• GSS‐GenomeSurveySequences
• HTG‐HighThroughputGenomicSequences
• HTC‐HighThroughputcDNASequences
![Page 9: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/9.jpg)
ShortReadandTraceArchives
Theoutputoflargescalesequencingprojectsandnext‐genera)onsequencingarestoredinseparatedatabases.NCBIisphasingouttheSRA,butthedatawillbeavailableinGEO,thedatabaseformicroarrayresults.
![Page 10: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/10.jpg)
Genomicdatabases
• Specializedresourcesthatare:– Speciesspecific– Sequencingtechniquespecific
• Displaywholechromosomes(notaspecificsequence).
![Page 11: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/11.jpg)
SourcesofmRNA’s
• Experimental– Clonenewgene– Clonegenefromdatabase– 2hybridsystem,RNA‐Seq...
• Database– “Typical”cDNA– FulllengthcDNA– EST
![Page 12: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/12.jpg)
mRNA
Full length cDNA
Typical cDNA
5’mG AAAA
TTTT
TTTT
primer
AAAA primer
primer
![Page 13: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/13.jpg)
REFSEQNCBI(Referencesequencedatabase)
✵ Definition
The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms.
![Page 14: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/14.jpg)
REFSEQ from NCBI non-redundancy explicitly linked nucleotide and protein
sequences updates to reflect current knowledge of sequence
data and biology data validation and format consistency distinct accession series ongoing curation by NCBI staff and collaborators,
with reviewed records indicated
![Page 15: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/15.jpg)
RefSeqrecordStatus
• TheRefSeqCOMMENTblockindicatestheStatusoftherecordandtheGenBanksequencedatathatwasusedtoprovidetherecord.
• Inaddi)on,theCOMMENTmayiden)fyacollabora)onwhichsuppliedthedefiningsequenceinforma)onforthegenome,gene,orprotein.
Thelevelofcura)onmaydifferbetweendifferentcollabora)nggroups.
![Page 16: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/16.jpg)
RefSeq
• Reviewed*• Provisional• Predicted
• GenomeAnnota)on
• Validated*• Model
• Inferred
• WGS
✵ StatusCodes: RefSeqrecordsareprovidedwithastatuscodewhichprovidesanindica)onofthelevelofreviewaRefSeqrecordhasundergone.
*Curated
![Page 17: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/17.jpg)
STATUSDefini+on
REVIEWEDTheRefSeqrecordhasbeenthereviewedbyNCBIstafforbycollaborator.TheNCBIreviewprocessincludesreviewingavailablesequencedataandfrequentlyalsoincludesareviewoftheliteratureandothersourcesofinforma)on.
VALIDATED
TheRefSeqrecordhasundergoneanini)alreviewtoprovidethepreferredsequencestandard.Therecordhasnotyetbeensubjecttofinalreviewatwhich)meaddi)onalfunc)onalinforma)onmaybeprovided.
PROVISIONALTheRefSeqrecordhasnotyetbeensubjecttoindividualreviewandisthoughttobewellsupportedandtorepresentavalidtranscriptandprotein.
![Page 18: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/18.jpg)
STATUSDefini+on
PREDICTEDTheRefSeqrecordispredictedandhasnotbeensubjecttoindividualreview.Thetranscriptmayrepresentanabini&opredic)onormaybepar)allysupportedbyothertranscriptdata;inbothcases,theproteinispredicted.
INFERREDTheRefSeqrecordisinferredbygenomesequenceanalysis.Thereisnosame‐organismexperimentalsupportforthefullextentofthesequence;theremaybesomelevelofsupportbyhomology.
MODELTheRefSeqrecordispredictedbygenomesequenceanalysis.Therecordmayrepresentanabini&opredic)on,ormayhavesomeleveloftranscriptorhomologysupport.
![Page 19: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/19.jpg)
STATUSDefini+on
GENOMEANNOTATION Thisiden)fiesRefSeqrecordsprovidedbytheNCBIGenomeAnnota)onprocess.Theserecordsareprovidedviaautomatedprocessingandarenotsubjecttoindividualrevieworrevisionbetweenbuilds
WGS
TheRefSeqrecordrepresentsacollec)onofwholegenomeshotgun(WGS)sequences.Thisstatuscodeisappliedtogenomicrecords
![Page 20: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/20.jpg)
AccessionFormat MoleculeType
NC_123456 CompleteGenome CompleteChromosome CompleteSequence
NG_123456 GenomicRegion
NM_123456 mRNA
NR_123456 non‐codingRNA
NP_123456 Protein
NT_123456 GenomicCon)g(fromBACs)
NW_123456 GenomicCon)g(fromWGS)
XM_123456 mRNA(takenfromgenomicseq)
XR_123456 RNA(takenfromgenomicseq)
XP_123456 Protein(takenfromgenomicseq)
![Page 21: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/21.jpg)
WhatisthedifferencebetweenRefSeqand
GenBank?Genbankis:
• ArchivaldatabaseandincludespubliclyavailableDNAsequencessubmiNedfromindividuallaboratoriesandlarge‐scalesequencingprojects.
• AccessionnumbersareassignedtothesesubmiNedsequences.
• SubmiNedsequencedataisexchangedbetweenNCBIsGenBank,EMBLDataLibrary(EMBL)andtheDNADataBankofJapan(DDBJ)toachievecomprehensiveworldwidecoverage.
• Asanarchivaldatabase,GenBankisveryredundantforsomeloci.
• SequencerecordsareownedbytheoriginalsubmiNerandcannotbealteredbyathirdparty.
![Page 22: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/22.jpg)
WhatisthedifferencebetweenRefSeqand
GenBank?RefSeqis:
SequencesarederivedfromGenBankandprovidenon‐redundantcurateddata.
Entriesrecordsrepresentcurrentknowledge. RefSeqrecordsareownedbyNCBIandthereforecanbe
updatedasneededtomaintaincurrentannota)onortoincorporateaddi)onalsequenceinforma)on.
Somerecordsincludeaddi)onalsequenceinforma)onthatwasneversubmiNedtoanarchivaldatabasebutisavailableintheliterature.
Somesequencerecordsareprovidedthroughcollabora)on;andthusmaynotbeavailableinanyoneGenBankrecord.
RefSeqsequencesarenotsubmiNedprimaryseqs.
![Page 23: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/23.jpg)
VariousHighThroughputCollec)onsNedo,DFKZ,HRI,Genoscope
• Full‐lengthcDNAlibrariesfromvarious)ssuesweresubtractedandnormalizedtoreduceredundancy
• Cloneswereend‐sequencedtofurtherreduceredundancy
• WholeinsertsweresequencedtogetmRNAsequences
• [KIAA–donebyKazusawasaprojectforlongcDNAs–over4kb,butmaynotbefull‐length]
![Page 24: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/24.jpg)
MGC‐MammalianGeneCollec)on
TheNIHMammalianGeneCollec)on(MGC)seekstoiden)fyandsequencearepresenta)vefullopenreadingframe(ORF)cloneforeachhuman,mouse,ratandcowgene.ZebrafishandXenopushavetheirownprojects(ZGCandXGC)
MGCproducedover80cDNAlibrariesenrichedforfull‐lengthcDNAsderivedfromhuman)ssueandcelllines,andmouse)ssue.
5'ESTreadsweregeneratedfromeachlibrary.Severalalgorithmsareappliedtoselectputa)vefullORFclones.Targetedcloningorsynthesiswasusedtofinish.
![Page 25: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/25.jpg)
SourcesofmRNAs
• IndividualLabs various
• Refseq XX_123456
FullLengthSequencingprojects:
• Riken,Nedo(FLJ),HRI AK,CR
DKFZ,Genoscope,[KIAA]... [AB,D]
• MGC BC,CT
AccessionNumbers
![Page 26: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/26.jpg)
SourcesofmRNA’s
• Experimental– Clonenewgene– Clonegenefromdatabase– 2hybridsystem
• Database– “Typical”cDNA– FulllengthcDNA– EST
![Page 27: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/27.jpg)
RNA
RNA, cDNA, and ESTs
mRNA
cDNA
exon 1 exon 2 exon 3
EST
EST
cDNA clone
Adapted with permission from Adam Sartiel
![Page 28: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/28.jpg)
UsesofESTs
‐ predic)onofcodingregions‐ detec)onofalterna)vesplicing‐ clusteringtoform“genes”
Problemswithclustering:‐ incompletecoveragebreaksgenesup‐ genefamilies
![Page 29: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/29.jpg)
ProblemswithESTs
‐ lowcopynumbergenes
‐ rare)ssues‐ mistakes
‐ enrichmentof3’endsofgenes
‐ incompletecoverageofgenes
![Page 30: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/30.jpg)
With the increasing sequencing and annotation of key genomes, having a gene-based view of the resultant information is useful. Entrez Gene has therefore been implemented to supply key connections in the nexus of map, sequence, expression, structure, function, citation, and homology data. Unique identifiers are assigned to genes with defining sequences, genes with known map positions, and genes inferred from phenotypic information.
![Page 31: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/31.jpg)
EntrezGeneatNCBI
EntrezGene‐Adatabaseforgene‐specificinforma)on.
Itdoesnotincludeallknownorpredictedgenes;insteadEntrezGenefocusesonthegenomesthathavebeencompletelysequenced,thathaveanac)veresearchcommunitytocontributegene‐specificinforma)on,orthatarescheduledforintensesequenceanalysis.
ThecontentofEntrezGenerepresentstheresultofcura)onandautomatedintegra)onofdatafromNCBI'sReferenceSequenceproject(RefSeq),fromcollabora)ngmodelorganismdatabases,andfrommanyotherdatabasesavailablefromNCBI.Recordsareassignedunique,stableandtrackedintegersasiden)fiers.
![Page 32: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/32.jpg)
EntrezGeneatNCBI
Thecontent(nomenclature,maploca)on,geneproductsandtheiraNributes,markers,phenotypes,andlinkstocita)ons,sequences,varia)ondetails,maps,expression,homologs,proteindomainsandexternaldatabases)isupdatedasnewinforma)onbecomesavailable.
EntrezGenedataisusedbyotherNCBIresourcessuchas:BLAST,Geo,HomoloGene,MapViewer,UniGene,UniSTSandNCBI'sgenomeannota)onpipeline.
![Page 33: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/33.jpg)
Datareliabilityindatabases
Thehugeamountofdatacollectedindatabasespresentalotofproblems:
– Dataaccuracy– Sequenceredundancy– Inconsistentnomenclature
– Inaccurateannota)on– Sequencecontamina)on(vectors,bacterial)
![Page 34: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/34.jpg)
Datareliabilityindatabases
• Thedatabasestaffno)fytheAuthorsthatanerror(orcontamina)on)wasdetectedintheirsequenceentry.
• However,ittakes)metocorrectthedata.
• Meanwhiletheerroriscon)nued,becausealotoftheProteinsintheProteindbaretranslatedfromtheDNAsequencedb.
![Page 35: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/35.jpg)
Datareliabilityindatabases
• Alotofthesequencesinthedatabasearequite“old”.TheywerenotupdatedsincetheyweresubmiNed,eventhoughtechnologyanddatawasverymuchupdated.
![Page 36: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/36.jpg)
Genesymbols
GenesymbolsaredesignatedbyuppercaseLa)nleNersorbyacombina)onofupper‐caseleNersandArabicnumbers.
Symbolsshouldbeshortinordertobeuseful,andshouldnotaNempttorepresentallknowninforma)onaboutagene.
Ideallysymbolsshouldbenolongerthansixcharactersinlength.
Basedonclassicalgene)cguidelines,itisrecommendedthatgenesymbolsareeitherunderlinedoritalicizedwhenreferringtogenotypicinforma)on(phenotypicinforma)onisrepresentedinstandardfonts).
![Page 37: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/37.jpg)
HUGOGeneNomenclatureCommiNee
• ThiscommiNeeisresponsiblefortheapprovalofauniquesymbolforeachgene.
• Italsodesignsalongerandmoredescrip)vename.
• ThecommiNeemakesconsiderableeffortstousesymbolsacceptabletoworkersinthefield,butsome)mesitisnotpossibletouseexactlywhathaspreviouslyappearedintheliterature.
• However,whereverthecommiNeeisawareofsuchsymbols,theyarelistedasaliasesintheGenewdatabase.(hIp://www.gene.ucl.ac.uk/cgibin/nomenclature/searchgenes.pl)
![Page 38: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/38.jpg)
GeneSymbols
80887826000469q31ATP‐bindingcasseNe,sub‐familyA(ABC1)member1
ABCA1
PubMedID
MIMNumber
Cytogene)cLoca)on
FullnameSymbol
![Page 39: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/39.jpg)
TaxonomyDatabases
• Aninterna)onaleffortisdoneforallsequencedatabasestocreateaunifiedtaxonomictagforthesequencessubmiNed.
Problem:eachsequencedepositorgives“his”nameforthespecie
Solu)on:UnifiedtaxonomyID
![Page 40: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/40.jpg)
Proteindatabases
![Page 41: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/41.jpg)
Proteindatabases
• Therearemanydifferentproteindatabasescontainingdifferenttypesofinforma)on:
– PrimaryAminoAcidssequence.
– Secondarystructure– 3Dstructure– Proteinfamilydomains
– Consensusac)vesites
![Page 42: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/42.jpg)
SourcesofProtein
• Proteinsthathavebeenworkedonexperimentally
• mRNAwhoseproducthasbeenworkedonexperimentally(noactualproteinsequencingdone)
• TranslatedDNA(mRNA)sequences
![Page 43: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/43.jpg)
ProteinPrimarySequenceDatabases
• Usuallycontaindescrip)onoftheproteinentry(annota)on),theaminoacidsequenceandsome)meslinkstootherrelateddatabases.
• Swiss‐Prot,fromtheUniversityofGeneva(nowtheSwissIns)tuteofBioinforma)cs),isacuratedproteindatabasewhichstrivestoprovideahighlevelofannota)on,aminimallevelofredundancyandhighlevelofintegra)onwithotherdatabases.
![Page 44: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/44.jpg)
UniProt (Universal Protein Resource) is the world's most comprehensive catalog of information on proteins. It is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR.
• The UniProt Knowledgebase (UniProt) is the central access point for extensive curated protein information, including function, classification, and cross-reference.
• The UniProt Non-redundant Reference (UniRef) databases combine closely related sequences into a single record to speed searches.
• The UniProt Archive (UniParc) is a comprehensive repository, reflecting the history of all protein sequences.
![Page 45: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/45.jpg)
Swiss‐ProtDatabase(primarydatabase)
• Swiss‐Protannota)onincludes:– Descrip)onofproteinfunc)on– Proteindomainstructure– Post‐transla)onalmodifica)ons– Proteinvariants
• Sequenceentriesarecomposedofdifferentline‐types,eachwiththeirownformat.Forstandardiza)onpurposestheformatofSwissProtfollowsascloselyaspossiblethatoftheEMBL(DNA)Database.
![Page 46: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/46.jpg)
Swiss‐ProtDatabase
Swiss‐Protdiffersfromotherproteindatabasesbythefollowingcriteria:
Annota)on
MinimalRedundancy
Integra)onwithotherdatabases
![Page 47: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/47.jpg)
Swiss‐ProtDatabase
Annota)on InSwiss‐Prot,asinmostothersequencedatabases,twoclassesofdatacanbedis)nguished:thecoredataandtheannota)on.
Thecoredataconsistsofthesequence;thecita)oninforma)on(bibliographicalreferences)andthetaxonomicdata(descrip)onofthebiologicalsourceoftheprotein).
![Page 48: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/48.jpg)
Theannota)onconsistsofthedescrip)onof:
• Func)on(s)oftheprotein• Post‐transla)onalmodifica)on(s).Forexamplecarbohydrates,phosphoryla)on,acetyla)on,GPI‐anchor,etc.
• Domainsandsites.Forexamplecalciumbindingregions,ATP‐bindingsites,zincfingers,etc.
• Secondarystructure
![Page 49: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/49.jpg)
Theannota)onconsistsofthedescrip)onof:
• Quaternarystructure.Forexamplehomodimer,heterotrimer,etc.
• Similari)estootherproteins• Disease(s)associatedwithdeficiency(s)of/intheprotein
• Sequenceconflicts,variants,etc.
![Page 50: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/50.jpg)
Swiss‐ProtDatabase
Toobtainthisinforma)on,Swiss‐Protuses,inaddi)ontothepublica)onsthatreportnewsequencedata,reviewar)clestoperiodicallyupdatetheannota)onsoffamiliesorgroupsofproteins.
Swiss‐Protalsomakesuseofexternalexperts,whohavebeenrecruitedtosendtheircommentsandupdatesconcerningspecificgroupsofproteins.
![Page 51: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/51.jpg)
Swiss‐ProtDatabase
MinimalRedundancy Manysequencedatabasescontain,foragivenproteinsequence,separateentrieswhichcorrespondtodifferentliteraturereports.InSWISS‐PROT,theytryasmuchaspossibletomergeallthesedatasoastominimizetheredundancyofthedatabase.
Ifconflictsexistbetweenvarioussequencingreports,theyareindicatedinthefeaturetableofthecorrespondingentry.
![Page 52: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/52.jpg)
Swiss‐ProtDatabase
Integra)onwithotherdatabases Itisimportanttoprovidetheusersofbiomoleculardatabaseswithadegreeofintegra)onbetweenthethreetypessequence‐relateddatabases(nucleicacidsequences,proteinsequencesandproteinter)arystructures)aswellaswithspecializeddatacollec)ons.
SWISS‐PROTiscurrentlycross‐referencedwith~100differentdatabases.Cross‐referencesareprovidedintheformofpointerstoinforma)onrelatedtoSWISS‐PROTentriesandfoundindatacollec)onsotherthanSWISS‐PROT.
![Page 53: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/53.jpg)
TrEMBLdatabase
• TrEMBLisacomputer‐annotatedsupplementofSWISS‐PROTthatcontainsallthetransla)onsoftheEMBL(DNA)database.
• TrEMBLcontainentriesnotyetintegratedinSWISS‐PROT.
![Page 54: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/54.jpg)
• Combinesinforma)onnotinotherdatabases,likemicroarraydata,popula)onvaria)onstudies,proteomics
• Powerfulqueryingop)ons
• Onlyforhumanproteins
![Page 55: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/55.jpg)
NRdatabase(primarydatabasesfromNCBI)
• TheNRProteindatabasecontainssequencedatafromthetranslatedcodingregionsfromDNAsequencesinGenBank,EMBLandDDBJaswellasproteinsequencessubmiNedtoPIR,SWISSPROT,PRF,PDB(sequencesfromsolvedstructures).
![Page 56: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/56.jpg)
DatareliabilityinProteindatabases
• About30%oftheproteinsinthedatabaseshaveerroneoussequencesdueto:– missingexonsintheDNAtransla)on.– Intronsmistakenlytranslated.
• Anothercommonproblemistheassigningoffunc)onsto“new”proteins,basedonsequencesimilarity.
![Page 57: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/57.jpg)
DatareliabilityinProteindatabases
• Forexample:– ProteinAissimilartoproteinB.
– ProteinBannota)onisbasedonProteinAannota)on(whichhasanerror).
– Annota)onofProteinAiscorrectedbythegroupworkingonit.Thiscorrec)ondoesnotappearorreflectinProteinBannota)on.
– WhenProteinCandDarealsobasedontheerroneousannota)ononB,theproblem…...
![Page 58: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/58.jpg)
Textsearchingpi{alls
• Itfindsexactlywhatyoutype(trypseudogenevs.psuedogene)
• Olderrecordsmayhavedifferentannota)on,fromgenenameson…
• humanvshomosapiens
• Genesymbolsvsfullgenename(forexampleneuregulinvsnrg1)
![Page 59: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/59.jpg)
• Mostsitesusebooleanoperators(AND,OR,BUTNOT)
• Cando(oradd)afieldspecifictag‐buteachsitehasadifferentwayofaddingittoasearch‐forexample,NCBIusessquarebrackets[]
![Page 60: Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modified ...](https://reader031.fdocuments.in/reader031/viewer/2022013002/5ca5ecf588c993b8788d20c0/html5/thumbnails/60.jpg)
Remember:
TextsearchingisNOTsequencesimilaritysearching!Youmanynotfindallrelatedsequencesbytextsearching!!!!