Bioinformacs,Resources, PDB, - rostlab.org · 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990...

29
BioinfRes SS 15 Bioinforma)cs Resources PDB Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J. Reeb Ins)tut für Informa)k I12

Transcript of Bioinformacs,Resources, PDB, - rostlab.org · 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990...

BioinfRes SS 15

Bioinforma)cs  Resources  -­‐  PDB  -­‐  

Lecture  &  Exercises  Prof.  B.  Rost,  Dr.  L.  Richter,  J.  Reeb  

Ins)tut  für  Informa)k  I12  

BioinfRes SS 15

Orga    -­‐  Exam  Date  

●  Exam  takes  place  on  Friday,  July  31st  ●  Room:  MW  0250  (Mechanical  Engineering  Building)  

●  Time  scheduled:  8.30-­‐10.30  (might  be  later)  

●  Dura)on:  approx.  90  min  

BioinfRes SS 15

Adver)sement  

●  Bachelor  thesis:  Carry  your  Genes  (CyG)  ●  In  collabora)on  with  Certgate  GmbH  and  Iteratec  GmbH  

●  Affects:  Personalized  medicine,  mobile  apps,  encryp)on  

●  Hiwi  opportunity  included  ●  see  h\ps://www.rostlab.org/teaching/theses  

BioinfRes SS 15

BioinfRes SS 15

PDB  –  History  ●  1968:  Brookhaven  RAster  Display  (BRAD)  ●  1969:  Edgar  Meyer  came  up  with  a  file  format  for  atomic  coordinates  

●  1971:  remote  access  with  SEARCH  program  wri\en  by  Meyer  -­‐>  PDB  func)onal  

●  1998:  transfer  to  RCSB  (Research  Collaboratory  for  Structural  Biology)  

●  2003:  forma)on  of  wwPDB  (PDBe,  RCSB,  PDBj,  BMRB(2006))  

BioinfRes SS 15

References  ●  F.C.  Bernstein,  T.F.  Koetzle,  G.J.B.  Williams,  E.F.  Meyer  Jr.,  M.D.  

Brice,  J.R.  Rodgers,  O.  Kennard,  T.  Shimanouchi,  M.  Tasumi  (1977)  The  Protein  Data  Bank:  a  computer-­‐based  archival  file  for  macromolecular  structures.  J.  Mol.  Biol.  112:  535-­‐542.  

●  H.M.  Berman,  J.  Westbrook,  Z.  Feng,  G.  Gilliland,  T.N.  Bhat,  H.  Weissig,  I.N.  Shindyalov,  P.E.  Bourne  (2000)  The  Protein  Data  Bank    Nucleic  Acids  Research,  28:  235-­‐242.  

●  H.M.  Berman,  K.  Henrick,  H.  Nakamura  (2003)  Announcing  the  worldwide  Protein  Data  Bank  Nature  Structural  Biology  10  (12):  98.  

●  h\p://www.rcsb.org/pdb/home/home.do  

BioinfRes SS 15

Current  Composi)on*  

Experimental Method

Proteins Nucleic Acids

Protein/Nucleic Acid complexes

Other Total

X-ray diffraction

90.662 1.622 4.510 4 96.798

NMR 9.597 1.118 225 8 10.948 Electron microscopy

566 29 184 0 779

Hybrid 70 3 2 1 76 Other 165 4 6 13 188 Total 101.060 2.776 4.927 26 108.789

*May, 18th, 2015

BioinfRes SS 15

Growth  of  PDB  –  All  Entries  

0

20000

40000

60000

80000

100000

120000

1972

19

74

1976

19

78

1980

19

82

1984

19

86

1988

19

90

1992

19

94

1996

19

98

2000

20

02

2004

20

06

2008

20

10

2012

20

14

Yearly

Total

BioinfRes SS 15

Entries  According  to  Method  

0

20000

40000

60000

80000

100000

120000

Total

X-Ray

NMR

EM

BioinfRes SS 15

Growth  of X-­‐Ray  Structures  

0 10000 20000 30000 40000 50000 60000 70000 80000 90000

100000

1972

19

74

1976

19

78

1980

19

82

1984

19

86

1988

19

90

1992

19

94

1996

19

98

2000

20

02

2004

20

06

2008

20

10

2012

20

14

Yearly

Total

BioinfRes SS 15

Growth  of  NMR  Structures  

0

2000

4000

6000

8000

10000

12000

1972

19

74

1976

19

78

1980

19

82

1984

19

86

1988

19

90

1992

19

94

1996

19

98

2000

20

02

2004

20

06

2008

20

10

2012

20

14

Yearly

Total

BioinfRes SS 15

Growth  of  EM  Structures  

0

100

200

300

400

500

600

700

800 19

72

1974

19

76

1978

19

80

1982

19

84

1986

19

88

1990

19

92

1994

19

96

1998

20

00

2002

20

04

2006

20

08

2010

20

12

2014

Yearly

Total

BioinfRes SS 15

Unique  CATH  Folds  (Topologies)  

0

200

400

600

800

1000

1200

1400

1600 19

72

1974

19

76

1978

19

80

1982

19

84

1986

19

88

1990

19

92

1994

19

96

1998

20

00

2002

20

04

2006

20

08

2010

20

12

2014

Yearly

Total

BioinfRes SS 15

Unique  CATH  Superfamilies  

0

500

1000

1500

2000

2500

3000

1972

19

74

1976

19

78

1980

19

82

1984

19

86

1988

19

90

1992

19

94

1996

19

98

2000

20

02

2004

20

06

2008

20

10

2012

20

14

Yearly

Total

BioinfRes SS 15

BioinfRes SS 15

Atomic  Coordinate  Entry  Format  

●  aka  PDB  format  ●  current  version  3.30  

●  comprises  190  pages  

●  mp://mp.wwpdb.org/pub/pdb/doc/format_descrip)ons/Format_v33_A4.pdf  

 

BioinfRes SS 15

Record  Format  ●  allowed  characters:    abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ 1234567890 `-=[]\;',./~!@#$%^&*()_+{}|:"<>?"

●  ,:;  are  delimiters,  otherwise  need  to  be  escaped  by  \  ●  a  file  consists  of  mul)ple  lines  

●  each  line  is  80  characters  wide  including  EOL  

●  lines  are  self-­‐iden)fying:  first  six  columns  contains  the  record  name  followed  by  a  blank  

BioinfRes SS 15

Single  Line  Records,  One  Time/One  Line  

●  CRYST1:  Unit  cell  parameters,  space  group,  and  Z.  ●  END:  Last  record  in  the  file.    

●  HEADER:  First  line  of  the  entry,  contains  PDB  ID  code,  classifica)on,  and  date  of  deposi)on.    

●  NUMMDL:  Number  of  models.  .....  

BioinfRes SS 15

One  Time/Mul)ple  Line  (incompl.)  ●  AUTHOR:  List  of  contributors.  ●  KEYWDS:  List  of  keywords  describing  the  macromolecule.    

●  SOURCE:  Biological  source  of  macromolecules  in  the  entry.    

●  TITLE:  Descrip)on  of  the  experiment  represented  in  the  entry.    

●  subsequent  lines  have  a  con)nua)on  number  

BioinfRes SS 15

Mul)ple  Times/One  Line  (incompl.)  

●  ATOM:  Atomic  coordinate  records  for  standard  groups.  

●  CONECT:  Connec)vity  records.  ●  DBREF:  Reference  to  the  entry  in  the  sequence  database(s).  

●  HELIX:  Iden)fica)on  of  helical  substructures.  ●  SHEET:  Iden)fica)on  of  sheet  substructures.  

BioinfRes SS 15

Mul)ple  Times/Mul)ple  Lines  (incompl.)  

●  FORMUL:  Chemical  formula  of  non-­‐standard  groups.  

●  HETNAM:  Compound  name  of  the  heterogens.  ●  SEQRES:  Primary  sequence  of  backbone  residues.  

●  SITE:  Iden)fica)on  of  groups  comprising  important  en)ty  sites.    

●  subsequent  lines  have  a  con)nua)on  number  

BioinfRes SS 15

Record  Order  ●  Records  have  to  appear  in  a  defined  order  ●  There  are  mandatory  and  op)onal  records  

●  Some  mandatory  records  depends  on  condi)ons  

●  Mandatory  records  without  content  are  “NULL”  ●  examples  for  mandatory  records:  -  HEADER  -   TITLE    -  COMPND    -  .....  

BioinfRes SS 15

Records  Belongs  to  Sec)ons  Section Record Type Title HEADER, OBSLTE, TITLE, SPLIT,

CAVEAT, COMPND, SOURCE, KEYWDS,EXPDTA, NUMMDL, MDLTYP, AUTHOR, REVDAT, SPRSDE, JRNL

Remark REMARKs 0-999 Primary structure DBREF, SEQADV, SEQRES MODRES Secondary structure HELIX, SHEET Coordinate MODEL, ATOM, ANISOU, TER,

HETATM, ENDMDL .... ....

BioinfRes SS 15

Records  Even  Have  Formats  

●  A  Records  consists  of  fields  with  specified  data  ●  Data  could  be:  A-­‐Z,  a-­‐z,  atom  name,  a  nine  character  string  represen)ng  a  date,  a  number,...  

●  Complex  data:  token  (string  followed  by  ‘:’),  a  comma  separated  list  of  strings,  a  fixed  format  string  literal  

●  ....  

BioinfRes SS 15

Example  Header  COLUMNS "DATA TYPE "FIELD " " "DEFINITION ------------------------------------------------------------------------------------ "

1-6" " "Record name ""HEADER”"

11-50 " "String(40) "classification "Classifies the molecule(s). *"

51-59 " "Date " "depDate " " "Deposition date. This is the date the " " " " " " " " " "coordinates were received at the PDB. "

63-66 " "IDcode " "idCode " " "This identifier is unique within the PDB. "

"

"

* taken from a class list from the current wwPDB Annotation Documentation Appendices (http://www.wwpdb.org/docs.html)

"

 

BioinfRes SS 15

Classifica)on  of  Structures:  CATH/SCOP  ●  came  up  in  the  middle  of  the  1990s  ●  both  are  quite  similar  

●  aim:  organize  the  protein  structures  available  in  PDB,  based  on  single  domains  

●  hierarchical  system  (roughly):  -  secondary  structure  content  -  fold  -  super  families  -  families  

BioinfRes SS 15

SCOP:  a  Structural  Classifica)on  of  Proteins  

●  Murzin,  A.,  Brenner,  S.  E.,  Hubbard,  T.  J.  P.  and  Chothia,  C.  (1995)  J.  Mol.  Biol.,  247,  536-­‐540  

●  Hubbard,  T.  P.,  Murzin,  A.,  Brenner,  S.  E.  and  Chothia,  C.  (1997),  Nucl.  Acids  Res.  25(1),  236-­‐239  (easier  to  obtain)  

●  fully  manually  curated,  driven  by  expert  analysis  

●  associated  with  the  ASTRAL  compendium  

●  latest  news:  SCOPe  (UC  Berkeley),  SCOP2  (MRC  Lab  Mol  Biol,  Cambridge,  UK)  

BioinfRes SS 15

CATH  -­‐  Faces  

taken from http://www.ebi.ac.uk/about/people/janet-thornton

taken from http://www.tgac.ac.uk/scientific-advisory-board/

BioinfRes SS 15

CATH  ●  semi-­‐automa)c  procedure  for  deriving  a  novel  hierarchical  classifica)on  of  protein  domain  structures  

●  four  main  levels:  -  C:  protein  class,  mainly  secondary  structure  composi)on  of  each  domain  

-  A:  architecture,  summarizes  shapes  based  on  orienta)on  of  secondary  structure  elements  

-  T:  topology,  sequen)al  connec)vity  is  considered  -  H:  homologous  superfamily,  high  similarity  with  similar  func)ons,  evolu)onary  rela)onship  assumed