Bioinformacs,Resources, PDB, - rostlab.org · 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990...
Transcript of Bioinformacs,Resources, PDB, - rostlab.org · 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990...
BioinfRes SS 15
Bioinforma)cs Resources -‐ PDB -‐
Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J. Reeb
Ins)tut für Informa)k I12
BioinfRes SS 15
Orga -‐ Exam Date
● Exam takes place on Friday, July 31st ● Room: MW 0250 (Mechanical Engineering Building)
● Time scheduled: 8.30-‐10.30 (might be later)
● Dura)on: approx. 90 min
BioinfRes SS 15
Adver)sement
● Bachelor thesis: Carry your Genes (CyG) ● In collabora)on with Certgate GmbH and Iteratec GmbH
● Affects: Personalized medicine, mobile apps, encryp)on
● Hiwi opportunity included ● see h\ps://www.rostlab.org/teaching/theses
BioinfRes SS 15
PDB – History ● 1968: Brookhaven RAster Display (BRAD) ● 1969: Edgar Meyer came up with a file format for atomic coordinates
● 1971: remote access with SEARCH program wri\en by Meyer -‐> PDB func)onal
● 1998: transfer to RCSB (Research Collaboratory for Structural Biology)
● 2003: forma)on of wwPDB (PDBe, RCSB, PDBj, BMRB(2006))
BioinfRes SS 15
References ● F.C. Bernstein, T.F. Koetzle, G.J.B. Williams, E.F. Meyer Jr., M.D.
Brice, J.R. Rodgers, O. Kennard, T. Shimanouchi, M. Tasumi (1977) The Protein Data Bank: a computer-‐based archival file for macromolecular structures. J. Mol. Biol. 112: 535-‐542.
● H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne (2000) The Protein Data Bank Nucleic Acids Research, 28: 235-‐242.
● H.M. Berman, K. Henrick, H. Nakamura (2003) Announcing the worldwide Protein Data Bank Nature Structural Biology 10 (12): 98.
● h\p://www.rcsb.org/pdb/home/home.do
BioinfRes SS 15
Current Composi)on*
Experimental Method
Proteins Nucleic Acids
Protein/Nucleic Acid complexes
Other Total
X-ray diffraction
90.662 1.622 4.510 4 96.798
NMR 9.597 1.118 225 8 10.948 Electron microscopy
566 29 184 0 779
Hybrid 70 3 2 1 76 Other 165 4 6 13 188 Total 101.060 2.776 4.927 26 108.789
*May, 18th, 2015
BioinfRes SS 15
Growth of PDB – All Entries
0
20000
40000
60000
80000
100000
120000
1972
19
74
1976
19
78
1980
19
82
1984
19
86
1988
19
90
1992
19
94
1996
19
98
2000
20
02
2004
20
06
2008
20
10
2012
20
14
Yearly
Total
BioinfRes SS 15
Entries According to Method
0
20000
40000
60000
80000
100000
120000
Total
X-Ray
NMR
EM
BioinfRes SS 15
Growth of X-‐Ray Structures
0 10000 20000 30000 40000 50000 60000 70000 80000 90000
100000
1972
19
74
1976
19
78
1980
19
82
1984
19
86
1988
19
90
1992
19
94
1996
19
98
2000
20
02
2004
20
06
2008
20
10
2012
20
14
Yearly
Total
BioinfRes SS 15
Growth of NMR Structures
0
2000
4000
6000
8000
10000
12000
1972
19
74
1976
19
78
1980
19
82
1984
19
86
1988
19
90
1992
19
94
1996
19
98
2000
20
02
2004
20
06
2008
20
10
2012
20
14
Yearly
Total
BioinfRes SS 15
Growth of EM Structures
0
100
200
300
400
500
600
700
800 19
72
1974
19
76
1978
19
80
1982
19
84
1986
19
88
1990
19
92
1994
19
96
1998
20
00
2002
20
04
2006
20
08
2010
20
12
2014
Yearly
Total
BioinfRes SS 15
Unique CATH Folds (Topologies)
0
200
400
600
800
1000
1200
1400
1600 19
72
1974
19
76
1978
19
80
1982
19
84
1986
19
88
1990
19
92
1994
19
96
1998
20
00
2002
20
04
2006
20
08
2010
20
12
2014
Yearly
Total
BioinfRes SS 15
Unique CATH Superfamilies
0
500
1000
1500
2000
2500
3000
1972
19
74
1976
19
78
1980
19
82
1984
19
86
1988
19
90
1992
19
94
1996
19
98
2000
20
02
2004
20
06
2008
20
10
2012
20
14
Yearly
Total
BioinfRes SS 15
Atomic Coordinate Entry Format
● aka PDB format ● current version 3.30
● comprises 190 pages
● mp://mp.wwpdb.org/pub/pdb/doc/format_descrip)ons/Format_v33_A4.pdf
BioinfRes SS 15
Record Format ● allowed characters: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ 1234567890 `-=[]\;',./~!@#$%^&*()_+{}|:"<>?"
● ,:; are delimiters, otherwise need to be escaped by \ ● a file consists of mul)ple lines
● each line is 80 characters wide including EOL
● lines are self-‐iden)fying: first six columns contains the record name followed by a blank
BioinfRes SS 15
Single Line Records, One Time/One Line
● CRYST1: Unit cell parameters, space group, and Z. ● END: Last record in the file.
● HEADER: First line of the entry, contains PDB ID code, classifica)on, and date of deposi)on.
● NUMMDL: Number of models. .....
BioinfRes SS 15
One Time/Mul)ple Line (incompl.) ● AUTHOR: List of contributors. ● KEYWDS: List of keywords describing the macromolecule.
● SOURCE: Biological source of macromolecules in the entry.
● TITLE: Descrip)on of the experiment represented in the entry.
● subsequent lines have a con)nua)on number
BioinfRes SS 15
Mul)ple Times/One Line (incompl.)
● ATOM: Atomic coordinate records for standard groups.
● CONECT: Connec)vity records. ● DBREF: Reference to the entry in the sequence database(s).
● HELIX: Iden)fica)on of helical substructures. ● SHEET: Iden)fica)on of sheet substructures.
BioinfRes SS 15
Mul)ple Times/Mul)ple Lines (incompl.)
● FORMUL: Chemical formula of non-‐standard groups.
● HETNAM: Compound name of the heterogens. ● SEQRES: Primary sequence of backbone residues.
● SITE: Iden)fica)on of groups comprising important en)ty sites.
● subsequent lines have a con)nua)on number
BioinfRes SS 15
Record Order ● Records have to appear in a defined order ● There are mandatory and op)onal records
● Some mandatory records depends on condi)ons
● Mandatory records without content are “NULL” ● examples for mandatory records: - HEADER - TITLE - COMPND - .....
BioinfRes SS 15
Records Belongs to Sec)ons Section Record Type Title HEADER, OBSLTE, TITLE, SPLIT,
CAVEAT, COMPND, SOURCE, KEYWDS,EXPDTA, NUMMDL, MDLTYP, AUTHOR, REVDAT, SPRSDE, JRNL
Remark REMARKs 0-999 Primary structure DBREF, SEQADV, SEQRES MODRES Secondary structure HELIX, SHEET Coordinate MODEL, ATOM, ANISOU, TER,
HETATM, ENDMDL .... ....
BioinfRes SS 15
Records Even Have Formats
● A Records consists of fields with specified data ● Data could be: A-‐Z, a-‐z, atom name, a nine character string represen)ng a date, a number,...
● Complex data: token (string followed by ‘:’), a comma separated list of strings, a fixed format string literal
● ....
BioinfRes SS 15
Example Header COLUMNS "DATA TYPE "FIELD " " "DEFINITION ------------------------------------------------------------------------------------ "
1-6" " "Record name ""HEADER”"
11-50 " "String(40) "classification "Classifies the molecule(s). *"
51-59 " "Date " "depDate " " "Deposition date. This is the date the " " " " " " " " " "coordinates were received at the PDB. "
63-66 " "IDcode " "idCode " " "This identifier is unique within the PDB. "
"
"
* taken from a class list from the current wwPDB Annotation Documentation Appendices (http://www.wwpdb.org/docs.html)
"
BioinfRes SS 15
Classifica)on of Structures: CATH/SCOP ● came up in the middle of the 1990s ● both are quite similar
● aim: organize the protein structures available in PDB, based on single domains
● hierarchical system (roughly): - secondary structure content - fold - super families - families
BioinfRes SS 15
SCOP: a Structural Classifica)on of Proteins
● Murzin, A., Brenner, S. E., Hubbard, T. J. P. and Chothia, C. (1995) J. Mol. Biol., 247, 536-‐540
● Hubbard, T. P., Murzin, A., Brenner, S. E. and Chothia, C. (1997), Nucl. Acids Res. 25(1), 236-‐239 (easier to obtain)
● fully manually curated, driven by expert analysis
● associated with the ASTRAL compendium
● latest news: SCOPe (UC Berkeley), SCOP2 (MRC Lab Mol Biol, Cambridge, UK)
BioinfRes SS 15
CATH -‐ Faces
taken from http://www.ebi.ac.uk/about/people/janet-thornton
taken from http://www.tgac.ac.uk/scientific-advisory-board/
BioinfRes SS 15
CATH ● semi-‐automa)c procedure for deriving a novel hierarchical classifica)on of protein domain structures
● four main levels: - C: protein class, mainly secondary structure composi)on of each domain
- A: architecture, summarizes shapes based on orienta)on of secondary structure elements
- T: topology, sequen)al connec)vity is considered - H: homologous superfamily, high similarity with similar func)ons, evolu)onary rela)onship assumed