Biological Database Systems
-
Upload
denis-shestakov -
Category
Business
-
view
114 -
download
0
description
Transcript of Biological Database Systems
Biological Database Systems
Denis Shestakov, University of Turku/Tampere
BioDB-1, ShestakovPage 2
Course Information
• Course structure:– Lectures: approx. 12 (plus today’s
intro and review lecture in the end of the course)
– Project work: details will be given next time
– Exam: easy to pass if project is done
– URL:
BioDB-1, ShestakovPage 3
Course Information
• Dates:– Period 2: 27.11, 4.12, 11.12– Period 3: 10 meetings on
Mondays/Wednesdays
• Contact info:– Email: – ICT, B6019: at 15-18 on Tuesdays
BioDB-1, ShestakovPage 4
Course Information: Literature• Slides• References in the end of slides• Books:
– Bioinformatics: Managing Scientific Data by Lacroix & Critchlow, Morgan Kaufmann, 2003 ISBN-10: 155860829X
– Database Systems Concepts, 5th edition by Silbershatz, Korth & Sudarshan, McGraw-Hill, 2005 ISBN-10: 0072958863
• Articles:– Biological database design and implementation
by Birney & Clamp (the Ensembl project), Briefings in Bioinformatics, 5(1):31-38, 2004
Biological Database Systems
1.1. Course Content1.2. Course Objectives1.3. Database and DBMS1.4. Biological Databases
BioDB-1, ShestakovPage 6
Course content: main topics
1. Database concepts, database design process
2. Relational data model3. Introduction to SQL4. XML and XML-based databases5. Data structures for biological data:
storage and querying6. Model organism databases
BioDB-1, ShestakovPage 7
Course content: main topics
7. LIMS, BioPostgres8. Analysis workflows, web services9. Integration of biological data10.Integration of biological data,
example of integration system11.Research issues in scientific
databases12.* Project discussion, exam
preparation
BioDB-1, ShestakovPage 8
Course focus • Database issues:
– Biology-specific– Representation of biological data– Design of biological databases
• NOT about:– Usage of existing databases– Accessing/retrieving data from bio-
databases
BioDB-1, ShestakovPage 9
Course goal
Give basic knowledge of biological* database design
* - for molecular biology
BioDB-1, ShestakovPage 10
Do you need to know that?• Work in “wet” laboratory:
– One bioinformatician and many biologists– Likely to be IT guru for others– Expect to answer IT-related questions
• Work in bioinformatics lab:– Many bioinformaticians– Group may maintain several dbs– Basics are helpful
• Create/maintain biological databases– Start learning!– Ask for more information
BioDB-1, ShestakovPage 11
Database?From Merriam-Webster dictionary:(http://www.merriam-webster.com/dictionary/database)
BioDB-1, ShestakovPage 12
Database?• A collection of data:
– structured– searchable (i.e., indexable)– updated– cross-referenced
• Objective:– Transform “meaningless” raw data into useful
information which can be accessed and analysed in the best way
• Database Management System (DBMS):– software designed for the purpose of
managing databases (access, insert, delete, update, etc.)
BioDB-1, ShestakovPage 13
DBMS
• A set of tools that:– Store– Extract– Modify
DatabaseDatabase
StoreStore ExtractExtract ModifyModify
USERSUSERS
BioDB-1, ShestakovPage 14
Biological Databases?Explosive growth in biological data• E.g., tremendous increase in
nucleotide sequences (first increase in data due to the polymerase chain reaction (PCR) technique development in 1983)
• 1980: 80 genes fully sequenced• …
BioDB-1, ShestakovPage 15
Biological Databases?• EMBL Database Growth:
Total nucleotides
(Nov 07: 188,490,792,445)Number of entries
(Nov 07: 106,144,026)
BioDB-1, ShestakovPage 16
Biological Databases?
• Data (genomic sequences, 3D structures, 2D gel analysis, microarrays….) directly submitted to databases
• Essential tools for biological research, like reading relevant literature
BioDB-1, ShestakovPage 17
Biological Databases: History
• 1965– Margaret Dayhoff et al. publish “Atlas
of Protein Sequences and Structures”• 1982
– EMBL initiates DNA sequence databases, followed within a year by GenBank and in 1984 by the DNA Database of Japan
• 1988– EMBL/GenBank/DDBJ agree on common
format for data elements
BioDB-1, ShestakovPage 18
Biological Databases: some statistics• More than 1000 different databases
– 968 databases reported inThe Molecular Biology Database Collection: 2007 update by Galperin, Nucleic Acids Research, 2007, Vol. 35, Database issue D3-D4
– Metabase: database of biological databases, http://biodatabase.org/index.php/Main_Page
• Database sizes: <100kB to >100GB (EMBL >500GB)– DNA: >100GB– Protein: 1GB– 3D structure: 5GB
• Update frequency: daily to annyally• Freely accessible (as a rule)
BioDB-1, ShestakovPage 19
Some databases in the field of molecular biology
AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb, BBDB, BCGD, Beanref, Biolmage,BioMagResBank, BIOMDB, BLOCKS, BovGBASE,BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb,Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC,ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,GCRDB, GDB, GENATLAS, Genbank, GeneCards,Genline, GenLink, GENOTK, GenProtEC, GIFTS,GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase,OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB,PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD,PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE,PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE,SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase,SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB,TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE,VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD, YPM, etc …
Find more at http://biodatabase.org
BioDB-1, ShestakovPage 20
Categories of Biological Databases
1. Nucleotide sequences2. Genomics3. Mutation/polymorphism4. Protein seqiences5. Protein domain/family6. Proteomics (2D gel, MS)
BioDB-1, ShestakovPage 21
Categories of Biological Databases
7. Microarray8. Organism-specific9. 3D structure10.Metabolism11.Bibliography12.Others
BioDB-1, ShestakovPage 22
Categories of Biological Databases
7. Microarray8. Organism-specific9. 3D structure10.Metabolism11.Bibliography12.Others
BioDB-1, ShestakovPage 23
Biological Databases: special features
• Autonomous: many independent maintainers
• Heterogeneous data formats: e.g., various data formats for the same data elements
• Dynamic: frequent and continous changes in data content (and, more importnatly, in data schema)
• Broad domain knowledge • Workflow-oriented: databases + rich set of
analysis tools• Information integration is essential:
aggregate data from several databases
BioDB-1, ShestakovPage 24
Biological Databases: integration
Figure is taken from Bioinformatics: Managing Scientific Data by Lacroix & Critchlow, p.20