Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational...

69
Databases Protein Structure and Bioinformatics Group

Transcript of Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational...

Page 1: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

Databases

Protein Structure and Bioinformatics

Group

Page 2: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 2

Purpose of the lecture

● provide an overview of available databases● what are they for?● the contents of the most important databases● how to query these databases● make you aware of drawbacks and pitfalls

Page 3: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 3

Overview

● intro on databases● database models● overview of biological databases● details of often used databases and/or providers● some remarks on data quality

Page 4: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 4

Why databases?

● Exponential growth of:– sequences

– structures

– literature

● Need for efficient storage and management tools● Need for standardization

Page 5: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 5

Solution: databases

● coherent, consistent, designed for special purpose● data model: clearly defined data structure● database management system: easy access and

management

Page 6: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 6

What is a database

● any organized collection of data– card filing system

– telephone book

● now: A collection of information organized in such a way that a computer program can quickly select desired pieces of data.

● you need: Database Management System (DBMS)

Page 7: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 7

Database modelslogical structure of a database

● flat file● relational model (most used)● other:

– object-oriented, XML, hierarchical, network

● Database Management Systems (DBMS) include: MySQL, PostgreSQL, SQLite, Microsoft SQL Server,Oracle, SAP, dBASE, FoxPro, IBM DB2, LibreOffice Base and FileMaker Pro

Page 8: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 8

Flat file

● written in plain text, standard defined format● often tab-delimited or comma-separated text files● each line is a record● fields are separated by delimiters: tabs, commas● searching only sequential

Page 9: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 9

DNA and protein sequences in FASTA format

>gi|71902539|ref|NM_000051.3| Homo sapiens ataxia telangiectasia mutated (ATM), mRNACCGGAGCCCGAGCCGAAGGGCGAGCCGCAAACGCTAAGTCGCTGGCCATTGGTGGACATGGCGCAGGCGCGTTTGCTCCGACGGGCCGAATGTTTTGGGGCAGTGTTTTGAGCGCGGAGACCGCGTGATACTGGATGCGCATGGGCATACCGTGCTCTGCGGCTGCTTGGCGTTGCTTCTTCCTCCAGAAGTGGGCGCTGGGCAGTCACGCAGGGTTTGAACCGGAAGCGGGAGTAGGTAGCTGCGTGGCTAACGGAGAAAAGAAGCCGTGGCCGCGGGAGGAGGCGAGAGGAGTCGGGATCTGCGCTGCAGCCACCGCCGCGGTTGATACTACTTTGACCTTCCGAGTGCAGTGACAGTGATGTGTGTTCTGAAATTGTGAACCATGAGTCTAGTACTTAATGATCTGCTTATCTGCTGCCGTCAACTAGAACATGATAGAGCTACAGAACGAAAGAAAGAAGTTGAGAAATTTAAGCGCCTGATTCGAGATCCTGAAACAATTAAACATCTAGATCGGCATTCAGATTCCAAACAAGGAAAATATTTGAATTGGGATG

>gi|71902540|ref|NP_000042.3| serine-protein kinase ATM [Homo sapiens]MSLVLNDLLICCRQLEHDRATERKKEVEKFKRLIRDPETIKHLDRHSDSKQGKYLNWDAVFRFLQKYIQKETECLRIAKPNVSASTQASRQKKMQEISSLVKYFIKCANRRAPRLKCQELLNYIMDTVKDSSNGAIYGADCSNILLKDILSVRKYWCEISQQQWLELFSVYFRLYLKPSQDVHRVLVARIIHAVTKGCCSQTDGLNSKFLDFFSKAIQCARQEKSSSGLNHILAALTIFLKTLAVNFRIRVCELGDEILPTLLYIWTQHRLNDSLKEVIIELFQLQIYIHHPKGAKTQEKGAYESTKWRSILYNLYDLLVNEISHIGSRGKYSSGFRNIAVKENLIELMADICHQVFNEDTRSLEISQSYTTTQRESSDYSVPCKRKKIELGWEVIKDHLQKSQNDFDLVPWLQIATQLISKYPASLPNCELSPLLMILSQLLPQQRHGERTPYVLRCLTEVALCQDKRSNLESSQKSDLLKLWNKIWCI

Page 10: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 10

Relational database

● database is composed of tables● each table has records (rows)● each record has fields (columns)● relational:

– tables hold logically related sets of data

– each record has a unique identifier: primary key

– relations between tables through keys

Page 11: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 11

Relational database

● PK = primary key, unique identifier

● FK = foreign key, connects to primary key in Customer table

Page 12: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 12

Aspects of relational databases● tables hold logically related sets of data● order of rows irrelevant (random access!)● rows are unique: no duplication of information● searching is specifying what you want:

– which field(s) from which table(s) under which condition(s)

– SQL (Structured Query Language)

● searching speed can be increased by using indexes

Page 13: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 13

Querying a database with SQL

Page 14: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 14

How to access databases● Web-based Graphical Users Interfaces (GUI)

– you do not see the underlying database structure

– output defined by host/provider

● File Transfer Protocol (FTP)– mostly flat files

● Application Programmers Interface (API)– you will approach database programmatically

through web services (SOAP/REST)

Page 15: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 15

Biological database providers/host

● EBI European Bioinformatics Institute

● SIB Swiss Institute of Bioinformatics

● NCBI National Center for Biotechnology

Information

● DDBJ DNA Databank of Japan

Page 16: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 16

Classification of biological databases

Primary: hold experimentally derived data● experimental data repositories● sequence databases● structure databases

Page 17: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 17

Classification of biological databases

Secondary: derived information from primary databases

● sequence related● genome related● structure related● expression data (RNA, protein)● pathway information

Page 18: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 18

Experimental data repositories

● Gene Expression Omnibus (GEO)● ArrayExpress● European Nucleotide Archive (ENA)

Page 19: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 19

Primary sequence databases

DNA/nucleotide sequences

Ensembl (EBI/Wellcome Trust Sanger Inst.)

GenBank (NCBI)

DNA Data Bank of Japan (DDBJ)

European Nucleotide Archive (EMBL-EBI)

Page 20: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 20

Primary sequence databases

protein sequences

UniProtKB UniProt Knowledge Base– UniProtKB/Swiss-Prot

– UniProtKB/TrEMBL

NCBI Protein

Page 21: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 21

Primary structure databases

Protein Data Bank (PDB)

Nucleic Acid Database

Cambridge Structural Database

Page 22: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 22

Secondary databases

● sequence related

– ProSite

– Pfam

– Enzyme

– REBase (restriction enzymes)

Page 23: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 23

Secondary databases

● genome related

Online Mendelian Inheritance in Man

TRANSFAC (transcription factors)

Page 24: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 24

Secondary databases● structure related

– DSSP Database of Secondary Structure Assignments

– HSSP Homology-derived Secondary Structure of Proteins

– Dali: comparing protein structures in 3D

Page 25: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 25

Secondary databases● expression data

– Expression Atlas

– Human Protein Atlas● pathway related

– KEGG: Kyoto Encyclopedia of Genes and Genomes

Page 26: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 26

Databases on Human Genes and Diseases

● General human genetics databases

e.g. HGMD

● General polymorphism databases

e.g NCBI SNP (dbSNP)

● Cancer gene and variant databases

e.g. COSMIC, Cancer Genome Atlas

Page 27: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 27

Databases on Human Genes and Diseases

● Gene-, system- or disease-specific databases– Locus-Specific DataBases, see e.g. HGVS

http://www.hgvs.org

– Disease-specific, e.g. IDbases: locus-specific databases for immunodeficiency-causing variations http://structure.bmc.lu.se/idbase/

– System-specific, e.g. GWASCatalog: genome-wide association studies

Page 28: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 28

Databases on Human Genes and Diseases

● Online Mendelian Inheritance in Man

Page 29: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 29

Locus-Specific Databases (LSDBs) list at www.hgvs.org/locuc-specific-

mutation-databases

Page 30: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 30

IDbases atstructure.bmc.lu.se/idbase

Page 31: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 31

BTKbase at LOVD.nl

Page 32: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 32

Nucleic Acids Research

● The NAR on line Molecular Biology Database Collection is published in the Database issue each year

● 2016: 1685 listings● URL: http://www.oxfordjournals.org/nar/database/c/

Page 33: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 33

Page 34: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 34

Wikipedia

URL: http://en.wikipedia.org/wiki/List_of_biological_databases

Page 35: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 35

PubMed

● The access point to medicine related publications● PubMed comprises more than 26 million citations

for biomedical literature

URL: http://www.ncbi.nlm.nih.gov/pubmed

Page 36: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 36

Some examples

● NCBI● UniProtKB/Swiss-Prot● PDB● Ensembl

Page 37: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 37

NCBIhttps://www.ncbi.nlm.nih.gov/

Page 38: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 38

NCBI Genetics & Medicine

Page 39: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 39

NCBI Handbook

Page 40: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 40

NCBI search

Page 41: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 41

NCBI Gene: download settings

Page 42: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 42

NCBI Gene: display settings

Page 43: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 43

NCBI Gene: Genomic regions etc.

Page 44: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 44

NCBI Gene: Reference sequences

Page 45: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 45

NCBI Gene: Reference sequences

Page 46: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 46

NCBI Gene: Reference sequences

Page 47: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 47

NCBI Gene: Reference sequences

information about the fields in GenBank records can be found at:

● NCBI handbook● https://www.ncbi.nlm.nih.gov/genbank/samplerecord/

Page 48: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 48

NCBI Gene: Reference sequences

Page 49: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 49

NCBI Gene: Reference sequences

Page 50: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 50

NCBI Gene: Reference sequences

Page 51: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 51

NCBI dbSNP: short genetic variations

Page 52: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 52

UniProtwww.uniprot.org

Page 53: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 53

UniProtKB/Swiss-Prot

Page 54: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 54

UniProtKB/Swiss-Prot

Page 55: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 55

Protein Data Bank in Europe (PDBe)

Page 56: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 56

Protein Data Bank (in Japan)

Page 57: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 57

Protein Data Bank (in Japan)

Page 58: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 58

Protein Data Bank

Page 59: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 59

Ensemblwww.ensembl.org

Page 60: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 60

Page 61: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 61

Ensembl variants

Page 62: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 62

KEGGintegrating genomic and chemical

information with systems information

Page 63: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 63

KEGG Pathways

Page 64: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 64

Some remarks about data quality

● how up-to-date is the database● is the database hand-curated by experts● when using data from a database, try to check these● be aware of the fact that there can be always errors

somewhere

Page 65: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 65

Example of checking data

● checking variant descriptions can be done with the Mutalyzer Name Checker tool: https://mutalyzer.nl

● Name Checker takes a complete sequence variant description (e.g. NM_000061.2:c.214A>G)

● variant description will be checked if it is according to HGVS rules

Page 66: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 66

Example of checking data

Page 67: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 67

Mutalyzer Name Checker

Page 68: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 68

Mutalyzer Name Check result (part)

Page 69: Databases Protein Structure and Bioinformatics Group€¦ · 7 Oct 2016 12 Aspects of relational databases tables hold logically related sets of data order of rows irrelevant (random

7 Oct 2016 69

Thanks

● Protein Structure and Bioinformatics Group● BMC B13● [email protected]● http://structure.bmc.lu.se