A Rough Guide to Biological Databases Alastair Kerr, Ph.D. Bioinformatician Wellcome Trust Centre...
-
Upload
charles-carson -
Category
Documents
-
view
223 -
download
1
Transcript of A Rough Guide to Biological Databases Alastair Kerr, Ph.D. Bioinformatician Wellcome Trust Centre...
A Rough Guide to A Rough Guide to Biological Databases Biological Databases
Alastair Kerr, Ph.D.Alastair Kerr, Ph.D.Bioinformatician Bioinformatician
Wellcome Trust Centre for Cell BiologyWellcome Trust Centre for Cell Biology
SLIDESSLIDES
http://bifx1.bio.ed.ac.ukhttp://bifx1.bio.ed.ac.uk
Follow links to Recent Follow links to Recent PresentationsPresentations
GoalsGoals
Understand differences between different Understand differences between different data sourcesdata sources
Know where to go if you have an sequence Know where to go if you have an sequence id (e.g. RefSeq, SwissProt Ensembl)id (e.g. RefSeq, SwissProt Ensembl)
Understand the best data source to use for Understand the best data source to use for your task (e.g. looking for orthologs or your task (e.g. looking for orthologs or splice variants, proteomics experiment)splice variants, proteomics experiment)
Understand how identifiers workUnderstand how identifiers work Understand how to get the best annotationUnderstand how to get the best annotation
Overview ofOverview ofSequence DatabasesSequence Databases
Brief history Brief history Varieties of data sources (databases Varieties of data sources (databases
and datasets)and datasets)– Utility/drawbacks of each Utility/drawbacks of each
Use of identifiersUse of identifiers DNA /Protein Annotation DNA /Protein Annotation
– Distributed annotation system [DAS]Distributed annotation system [DAS] DAS clients - outline and demoDAS clients - outline and demo
Dawn of the Age of Dawn of the Age of SequencingSequencing
Mid 50’s : First protein sequence -by Mid 50’s : First protein sequence -by Fred SangerFred Sanger
Late 70’s : clone based sequencing Late 70’s : clone based sequencing arrived with Sanger dideoxy method also arrived with Sanger dideoxy method also the first bioinformatics tools (STADEN, the first bioinformatics tools (STADEN, alignments)alignments)
All sequences were published in papers, All sequences were published in papers, a central warehouse was clearly needed a central warehouse was clearly needed to keep them allto keep them all
Sequence WarehousesSequence Warehouses
NCBIGenBank EMBL
• Protein and DNA database
•GenBank NR [Non-redundant]
• Historically • DNA: EMBL• Protein: translated
EMBL (trEMBL)• Now called EBI
• UniProt
National Centre forBiotechnology Information
European MolecularBiology Laboratory
Sources of DNA ErrorSources of DNA Error
Vector contaminationVector contamination– Now mainly eliminated in the sequencing Now mainly eliminated in the sequencing
pipeline but still possible with rarer vectorspipeline but still possible with rarer vectors Sequencing artefactsSequencing artefacts
– Recombination, mutations, contaminationRecombination, mutations, contamination– insertions /deletions (More prone in Next-insertions /deletions (More prone in Next-
gen sequencing?) gen sequencing?) – Most removed from ‘finished’ sequence by Most removed from ‘finished’ sequence by
comparing multiple reads, but still prevalent comparing multiple reads, but still prevalent in ESTs and GSSin ESTs and GSS
Protein Sequence ErrorProtein Sequence Error
Most proteins are based on a Most proteins are based on a modelmodel of the of the gene gene
This gene model is often deduced: This gene model is often deduced: combination of:combination of:– EST dataEST data– ORF finding programs ORF finding programs – Splice site finding programsSplice site finding programs
Protein interpretations may changeProtein interpretations may change– Transcription/translation start sites Transcription/translation start sites – Splice Variants Splice Variants – DNA errorsDNA errors
Varieties of data sources Varieties of data sources
Sequence Warehouses Sequence Warehouses – ““everything under one roof”everything under one roof”
Genome Databases Genome Databases – Containing single genome dataset(s)Containing single genome dataset(s)
Single pass reads Single pass reads – [EST] Expressed sequence set (cDNA)[EST] Expressed sequence set (cDNA)– [GSS] Genome survey sequence[GSS] Genome survey sequence
Curated setsCurated sets
ProsPros– Retrieve a specific sequenceRetrieve a specific sequence
e.g. an identifier from a papere.g. an identifier from a paper
– Comprehensive sequence searchComprehensive sequence search ConsCons
– >100 BILLION bases from 165,000 >100 BILLION bases from 165,000 organismsorganisms
– RedundancyRedundancy Variants, both real and artefacts (may Variants, both real and artefacts (may
complicate mass-spec searches)complicate mass-spec searches)
– Still may contain junkStill may contain junk
DNA Sequence WarehousesDNA Sequence Warehouses
Single Genome Single Genome databases/datasetsdatabases/datasets
ProsPros– Truly non-redundantTruly non-redundant– If complete, you know the gene copy number If complete, you know the gene copy number
and gene familiesand gene families– Can search genomic DNA for know but un-Can search genomic DNA for know but un-
annotated genesannotated genes– Can ‘browse’ the genomeCan ‘browse’ the genome– Usually very good computer annotationUsually very good computer annotation
ConsCons– Not always assembled correctly Not always assembled correctly – May be incomplete (despite saying otherwise)May be incomplete (despite saying otherwise)– Often no human intervention with annotationOften no human intervention with annotation
ensEMBLensEMBL
From EBI / Sanger From EBI / Sanger Many vertebrate and model organism Many vertebrate and model organism
genomesgenomes– All cross-annotated (easy to find All cross-annotated (easy to find
orthologs)orthologs) Access via web with built in DAS client Access via web with built in DAS client
(see later)(see later)– BIOMART accessBIOMART access– Also / SQL / Perl-APIAlso / SQL / Perl-API
Low Quality SequencesLow Quality Sequences
Expressed sequence tag (EST) are reads coding Expressed sequence tag (EST) are reads coding DNA of variable quality but usually very DNA of variable quality but usually very numerous numerous – Ideal for determining which part of a genome is Ideal for determining which part of a genome is
transcribed…transcribed… but not necessarily coding! [ncRNAs]but not necessarily coding! [ncRNAs]
– Recent improvements in visualisation, ORF Recent improvements in visualisation, ORF identification and clusteringidentification and clustering
Genomic survey sequence [GSS] are single pass Genomic survey sequence [GSS] are single pass readsreads– Genomes in early stage of sequencingGenomes in early stage of sequencing– Environmental sampling (meta-genomics) Environmental sampling (meta-genomics)
Using EST databasesUsing EST databases
Gene model verification (e.g. checking Gene model verification (e.g. checking a splice variant)a splice variant)– Search EST databases with genomic Search EST databases with genomic
sequence or cDNAsequence or cDNA
Curated (Protein) Data SetsCurated (Protein) Data Sets
Several efforts to create high quality Several efforts to create high quality databasesdatabases
SwissProt (1986) the first gold standard in SwissProt (1986) the first gold standard in protein functional annotation protein functional annotation – Originally every entry entered by Amos Originally every entry entered by Amos
Bairoch!Bairoch!– Now integrated into UniProt (EBI / EMBL)Now integrated into UniProt (EBI / EMBL)
RefSeqRefSeq– NCBIs human-involved annotation effortNCBIs human-involved annotation effort
IMPORTANT: These are site specific. And IMPORTANT: These are site specific. And NOT shared between NCBI / EBI NOT shared between NCBI / EBI
IPI IPI International Protein IndexInternational Protein Index
From EBIFrom EBI Contains proteomes of higher Contains proteomes of higher
eukaryotic eukaryotic Effectively maintains a database of Effectively maintains a database of
cross references between the primary cross references between the primary data sources data sources
Minimally redundant yet maximally Minimally redundant yet maximally complete sets of proteins for featured complete sets of proteins for featured species (one sequence per transcript) species (one sequence per transcript)
Primary sequence crossover Primary sequence crossover between databasesbetween databases
Primary sequence is exchanged Primary sequence is exchanged between databasesbetween databases– But what is primary sequence?But what is primary sequence?– Sometimes only ‘finished’ sequence is Sometimes only ‘finished’ sequence is
sharedshared NOT SHAREDNOT SHARED
– Gene Models (and therefore proteins)Gene Models (and therefore proteins)– Some EST/GSSSome EST/GSS– From some genome projects not yet From some genome projects not yet
publishedpublished
Consortia identifiersConsortia identifiers
Most key species have a consortia / Most key species have a consortia / group / community that provides the group / community that provides the key identifiers in the fieldkey identifiers in the field
Humans Humans – Was HUGO (HUman Genome Was HUGO (HUman Genome
Organisation)Organisation)– now the HGNC (Human Genome now the HGNC (Human Genome
Nomenclature Committee)Nomenclature Committee) Some of limited use as have Some of limited use as have
incomplete coverageincomplete coverage
Database IdentifiersDatabase Identifiers
Every dataset has their own system of Every dataset has their own system of identifying gene/proteinidentifying gene/protein
Example: Human ADH4Example: Human ADH4– EnsemblEnsembl
ENSG00000198099 ENST00000423445 ENSG00000198099 ENST00000423445 ENSP00000397939 ENSP00000397939
– SwissProtSwissProt ADH4_HUMAN P08319 ADH4_HUMAN P08319
– RefSeqRefSeq NM_000670.3NM_000670.3 NP_000661.2 NP_000661.2
– GenBank GenBank gi|71565152|ref|NP_000661.2| gi|71565152|ref|NP_000661.2|
Keeping Track of ChangesKeeping Track of Changes
Gene models can changeGene models can change– Will the id you used yesterday still get Will the id you used yesterday still get
the same sequence today?the same sequence today?– Or: How to you get the latest version of Or: How to you get the latest version of
a sequence?a sequence?
Keeping Track of ChangesKeeping Track of Changes GenbankGenbank
– Gi number changes each time, often removed Gi number changes each time, often removed when it gets supersededwhen it gets superseded
SwissProtSwissProt– Accession changes each time (P08319) but the Accession changes each time (P08319) but the
ID remains constant (ADH4_HUMAN)ID remains constant (ADH4_HUMAN) RefSeq and EnsemblRefSeq and Ensembl
– Revision based ids Revision based ids NM_000670.3 ENSG00000198099.1NM_000670.3 ENSG00000198099.1
– XXX.number XXX.number XXX always retrieve latestXXX always retrieve latest XXX.number retrieves the versionXXX.number retrieves the version
Converting between Converting between identifiersidentifiers
Most databases understand an Most databases understand an identifier from another database – identifier from another database – some are better than otherssome are better than others– Best to use UniprotBest to use Uniprot
NEVER rely on common names NEVER rely on common names
Standard Keywords and Standard Keywords and annotation annotation
Problem: Many ways to name a geneProblem: Many ways to name a gene– Reductase = oxidase = dehydrogenaseReductase = oxidase = dehydrogenase
Gene Ontology Consortium [GO]Gene Ontology Consortium [GO]– GO terms standardise naming GO terms standardise naming – Note that errors may still occur in the Note that errors may still occur in the
assignment of terms assignment of terms – Found in RefSeq, SwissProt and most Found in RefSeq, SwissProt and most
genome databasesgenome databases GO browsers e.g. AmiGOGO browsers e.g. AmiGO
Annotation IssuesAnnotation Issues
Most annotation by inference from Most annotation by inference from homology/orthologyhomology/orthology
Domain specific function rather than Domain specific function rather than protein specific functionprotein specific function
Programs may give different answers Programs may give different answers – not possible for one source to store – not possible for one source to store all possibilitiesall possibilities
Distributed AnnotationDistributed Annotation
Information on a protein or gene can Information on a protein or gene can be stored on multiple (often be stored on multiple (often specialised) specialised) serversservers– Distributed Annotation System [DAS]Distributed Annotation System [DAS]
Data can be accessed using Data can be accessed using clientclient software that checks these sitessoftware that checks these sites– Built into the ensEMBL websiteBuilt into the ensEMBL website
DAS ClientDAS ClientDAS ClientDAS ClientDAS ClientDAS Servers
DAS Client
Reference Data
DISTRIBUTED ANNOTATION SYSTEM
Das ServersDas Servers
Free one in ensEMBLFree one in ensEMBL I have my own in-house I have my own in-house
– Currently storing CpG islands Currently storing CpG islands – Speak to me if you need to use itSpeak to me if you need to use it
Found on DAS registryFound on DAS registry– many DAS clients can look up this many DAS clients can look up this
registryregistry
Some DAS ClientsSome DAS Clients
Dedicated Web siteDedicated Web site– Dasty2 [Protein DAS]Dasty2 [Protein DAS]
Genome browsersGenome browsers– ensEMBL web site [Protein and Gene DAS]ensEMBL web site [Protein and Gene DAS]
Protein structure viewersProtein structure viewers– Spice [3D Protein DAS]Spice [3D Protein DAS]
Alignment programs Alignment programs – Jalview [Protein and Gene DAS]Jalview [Protein and Gene DAS]
GotchasGotchas
Coordinate systems MUST matchCoordinate systems MUST match– Will change with each genome assemblyWill change with each genome assembly
Use a single source only Use a single source only – e.g. UniProt for proteinse.g. UniProt for proteins
More key databases1More key databases1
Conserved Domains (See later siminar)Conserved Domains (See later siminar)– Interpro: hosts and names from multiply Interpro: hosts and names from multiply
projectsprojects– Each domain often functionally annotatedEach domain often functionally annotated– More in following talkMore in following talk
Mendelian Inheritance in Man (at NCBI)Mendelian Inheritance in Man (at NCBI)– Phenotype centric, literature curatedPhenotype centric, literature curated
Population differencesPopulation differences– dbSNP: polymorphisms (also in Ensembl)dbSNP: polymorphisms (also in Ensembl)
– PharmGkb: drug efficacyPharmGkb: drug efficacy
More Key databases2 More Key databases2
MicroarrayMicroarray– EBI : ArrayExpressEBI : ArrayExpress– NCBI: GEO NCBI: GEO
Pathway DatabasesPathway Databases– Reactome Reactome – KEGGKEGG– Cytoscape applicationCytoscape application
ReferencesReferences NCBI NCBI www.ncbi.nlm.nih.govwww.ncbi.nlm.nih.gov EBI EBI www.ebi.ac.ukwww.ebi.ac.uk UniProt UniProt http://beta.uniprot.orghttp://beta.uniprot.org Sanger Centre Sanger Centre www.sanger.ac.ukwww.sanger.ac.uk ensEMBL ensEMBL www.ensembl.orgwww.ensembl.org Firefox and BioBar Firefox and BioBar www.mozilla.orgwww.mozilla.org SPICE (Das Client) SPICE (Das Client) http://www.efamily.org.uk/software/dasclients/spice/http://www.efamily.org.uk/software/dasclients/spice/ Gene Ontology Consortium Gene Ontology Consortium www.geneontology.orgwww.geneontology.org AmiGO AmiGO www.godatabase.orgwww.godatabase.org GMOD GMOD www.gmod.orgwww.gmod.org BioMoby BioMoby www.biomoby.orgwww.biomoby.org Dasty2 Dasty2 http://www.ebi.ac.uk/dasty/http://www.ebi.ac.uk/dasty/ Interpro: Interpro: http://http://www.ebi.ac.uk/interprowww.ebi.ac.uk/interpro// Reactome: http//www.Reactome: http//www.reactomereactome.org/.org/ KEGG: http://www.genome.jp/ KEGG: http://www.genome.jp/ keggkegg/ / keggkegg2.html2.html Cytoscape http://www.Cytoscape http://www.cytoscapecytoscape.org .org PharmGkb: http://www. PharmGkb.org/PharmGkb: http://www. PharmGkb.org/