5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August...

41
5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011 The PDBbind Database: A Comprehensive Collection of the Binding Data and Structures of the Complexes in the Protein Data Bank Renxiao Wang

Transcript of 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August...

Page 1: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

5th Meeting on U.S. Government Chemical Databases and Open ChemistryFrederick, Maryland, August 25-26, 2011

The PDBbind Database: A Comprehensive Collection of the Binding Data and Structures of the Complexes in the Protein Data Bank

Renxiao Wang

Page 2: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

Outline

What is the PDBbind database and why to develop it?

How is the PDBbind database compiled?

What information is provided by the PDBbind database?

Possible applications of the PDBbind database

2

Page 3: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

3

Protein Data Bank

Biomolecularcomplexes

Complexes withbinding data

PDBbind web site

What is the PDBbind Database?

(1) Complexes formed between small-molecule ligands and biomacromolecules, and (2) those between biomacromolecules.

Structural information and binding data

http://www.pdbbind-cn.org/

Page 4: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

4

Both structural and energetic information are indispensable for an in-depth understanding of the recognition between small molecules and biological macromolecules.

Why to Create the PDBbind Database?

It is especially important for the development and calibration of computational methods for the estimation of protein-ligand binding affinity.

Page 5: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

5

Why to Create the PDBbind Database?

• Three-dimensional structures of biomolecular complexes are available from the Protein Data Bank :

• More than 74,000 structures have been deposited in PDB by Aug 1st, 2011. Nearly half of them are complexes of all types.

• However, binding affinity data of these complexes, if available, used to scatter in literature and thus are difficult to access.

• Before PDBbind, no other database has attempted to collect such binding data in a systematic manner.

The PDBbind database aims at providing a comprehensive collection of the binding data for all types of biomolecular complexes in PDB.

Page 6: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

6

Why to Create the PDBbind Database?

The old approach: Assemble the data sets reported by other researchers.

For example, the X-Score scoring function was developed by using a set of 230 protein-ligand complexes with known binding data. This data set was compiled by assembling several smaller data sets reported previously, which was the largest collection of this type at that time.

Wang R. et al., J. Comput.-Aided Mol. Des. 2002, 16, 11-26.

Disadvantages of this approach

It is difficult to verify those binding data since original references are often not given: Some data are IC50 values; Some data are not binding affinity data; There are even typographical errors!

Regular updates are not possible.

Page 7: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

7

History of the PDBbind Database

(1) Wang, R. et al. J. Med. Chem. 2004, 47, 2977-2980. (2) Wang, R. et al. J. Med. Chem. 2005, 48, 4111-4119. (3) Cheng, T. et al. J. Chem. Inf. Model. 2009, 49, 1079-1093.

Apr, 2001: Preliminary trial & launch of the project (University of Michigan)

May, 2004: PDBbind v.2004 was publicly released at http://www. pdbbind.org/ (University of Michigan)

Nov, 2007: The PDBbind-CN server was launched at http://www. pdbbind-cn.org/ (Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences)

Aug, 2011: The current version (v.2011), providing binding data for ~8,000 complexes in PDB

v.2005 and v.2006 were released.

v.2008, v.2009, and v.2010 were released.

7

Page 8: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

8

How is the PDBbind Database Compiled?

The entire PDB

Biomolecularcomplexes

Complexes withbinding data

Integrate into the PDBbind

web site

II. Collection of binding data from original references

I. Classification of complexes

III. Data processing & web design

Page 9: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

9

Step I. Classification of Biomolecular Complexes

The entire classification process is automated by a set of computer programs.

A given PDB entry

Contain a protein?

Contain a nucleic acid?

Contain a nucleic acid?

Contain a small molecule?

Contain a small molecule?

Protein-proteincomplex

Protein-nucleicacid complex

Protein-ligandcomplex

Nucleic acid-ligand complex

Misc. oligomer

Contain two proteins?

Apo- nucleic acid

Apo-protein

YES

NO

YES

YES

YES

YES

YES

NO

NO

NO

NO

NO

Page 10: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

10

1570 (2.25%)620 (0.89%)

4580 (6.58%)

2912 (4.18%)

7067 (10.15%)

22147 (31.8%)

30744 (44.15%)

apo-proteins

protein-ligand complexes

special protein-ligand complexes(cofactor-containing)

protein-nucleic acid complexes

protein-protein complexes

nucleic acid-ligand complexes

apo-nucleic acids

* Based on the PDB contents released by Jan 1st 2011, 70,224 entries in total

Classification of the Entire Protein Data Bank

Page 11: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

11

Binding affinity data of a given complex could be reported or cited in the “primary citation” of the PDB file (success rate 30%).

Step II. Collection of Binding Data from Literature

Page 12: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

12

Accepted binding affinity data include dissociation constants (Kd), inhibition constants (Ki), and concentrations at 50% inhibition (IC50).

A computer program is developed to process PDF files, filtering out the papers containing no binding data.

Each remaining paper is then examined independently by two persons. Consensus must be reached before the binding data are recorded.

Collection of Binding Data from Literature

Page 13: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

13

Collection of Binding Data from Literature

• Over 17,800 references have been processed so far.

• Each primary reference is saved as a PDF file, in which the binding data are clearly marked.

• Mistakes are still possible during manual data curation. Nevertheless, >98% of the binding data in PDBbind are correct.

The primary reference for PDB entry 1BXO

Page 14: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

14

Small Molecules

Proteins6,070 1,427

428NucleicAcids

66

Outcomes of Binding Data Collection

PDBbind v.2011 includes binding data for 7,991 complexes in PDB.

Proteins

Page 15: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

15

Updates of the PDBbind Database

Version Entries in PDB

PDBbind

Valid complexes

Complexes with binding data

Refined set Core set

2004 28,991 6,847 2,276 1,091 231

2005 34,338 9,775 2,756 1,296 288

2006 34,338 9,775 2,632 1,122 234

2007 40,876 11,822 3,124 1,300 195

2008 48,092 18,211 4,300 1,401 210

2009 55,069 23,284 5,678 1,741 219

2010 62,387 26,434 6,772 2,061 231

2011 70,224 30,259 7,991 2,476 243

It is critical to update PDBbind regularly to keep up with the constant growth of PDB. PDBbind is now updated annually, and it grows by 20-30% each year.

Page 16: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

16

Browse information

Search information

Download information

Depositbinding data

RCSB PDBRCSB PDB

StructuresStructures Binding DataBinding Data

Biomolecular complexes in PDBBiomolecular complexes in PDB

PDBsumPDBsum

PubMedPubMed

PubChemPubChem

http://www.pdbbind-cn.org/

Step III. Build the PDBbind-CN Web Site

PDBbind-CN

Page 17: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

17The basic information of each complex is summarized on a single page.

On-line Information @ PDBbind-CN

Page 18: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

18

Multiple display modes are provided by ChemAxon and JMol Java applets on the web interface of PDBbind-CN.

Page 19: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

19

Various types of queries may be used in the searching of binding data. Results are given in well-organized forms, which can be output in either the PDF format or the Excel format.

On-line Search @ PDBbind-CN

Page 20: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

20

On-line Search @ PDBbind-CN

Substructure/similarity search among the small-molecule ligands in all protein-ligand complexes in PDB (>12,000 entities), not limited to those with known binding data.

Page 21: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

21

On-line Search @ PDBbind-CN

Similarity search among all protein and nucleic acid sequences in PDB, not limited to those with known binding data.

Page 22: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

22

What can be downloaded from PDBbind-CN?

Tables of binding data for all categories of complexes.

“Clean” structural files of most of the protein-ligand complexes with known binding data (6,023 in v.2011), which can be readily utilized by most molecular modeling software.

• A complete “biological unit” of each complex is split into a protein molecule and a ligand molecule.

• The protein molecule is saved in the PDB format and the ligand molecule is saved in the SYBYL Mol2 format after necessary processing.

The “refined set” and the “core set” of selected protein-ligand complexes, providing a high-quality benchmark for docking/scoring studies.

Page 23: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

23

Selection of the Refined Set

The refined set consists of protein-ligand complexes meeting higher standards:

• Concerns on quality of the structure: crystal structures with resolution<2.5 Å & R-factor<0.250; both the protein and the ligand structures need to be complete.

• Concerns on quality of the binding data: Binding data are given in Kd or Ki; and 2.0<-logKd <12.0 (i.e. Kd=10mM~1pM); binding data cannot be an estimated value; the protein as well as the ligand used in the binding assay need to match exactly the ones observed in the crystal structure.

• Concerns on nature of the complex: must be non-covalent binding; must be binary complex; ligand MW<1000; ligand does not contain B, Be, Si, and metal atoms.

In v.2011, a total of 2,476 protein-ligand complexes are selected into the refined set, accounting for 41% of all of the protein-ligand complex with known binding data.

Page 24: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

24

Selection of the Core Set

In v.2011, the core set consists of a total of 81 families of 243 protein-ligand complexes. The core set will be controlled under 300 complexes (100 families) in the future.

Clustering

Selection

The refined set (2,476)

The core set (243)

Page 25: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

25

Selection of the Core Set

The core set is selected to provide a representative, non-redundant sampling of the refined set, so that serves better as a benchmark for validating docking/scoring approaches.

1. Clustering: Group the protein-ligand complexes in the refined set into families by protein sequence similarity (cutoff = 90%).

2. Selection of clusters: Only consider the families that have at least 5 members. The highest binding affinity in each valid family must be at least 100-fold higher than the lowest binding affinity.

3. Selection of representatives: For each remaining family, select the complex with the highest binding affinity (the “topper”), the lowest binding affinity (the “lower”), and the one closest to the mean value (the “middler”) as the representatives of this family.

Methods

Page 26: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

26

Possible Applications of the PDBbind Database

According to our literature survey, 30~40 applications of the PDBbind database are published each year.

Provide high-quality data sets for theoretical and computational studies on molecular recognition

– Binding data available for protein-ligand, protein-protein, and protein-nucleic acid complexes

– Specially compiled “refined set” and “core set”

Provide useful clues to medicinal chemists and other researchers for the discovery of bioactive small-molecule compounds or potential targets

Page 27: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

27

What ligands bind to it ?

What high-affinity ligands look like ? What low-affinity

ligands look like?

If these chemical moieties may interact with other proteins(new targets or side effects) ?

Critical chemical moieties(pharmacophore)

A known target

Page 28: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

28

To Build an Integrated Platform for Data Mining

Protein Data BankProtein Data Bank

3D Structures3D Structures

Pharmacophore Models

Pharmacophore ModelsChemical

DatabasesChemical

DatabasesUseful

HitsUseful

Hits

Data Mining Tools

Docking

Binding site analysis

Scoring

Data Compilation

Binding AffinityData

Binding AffinityData

PharmaceuticalImplications

PharmaceuticalImplications

Protein-Ligand & Nucleic Acid-Ligand ComplexesProtein-Ligand & Nucleic Acid-Ligand Complexes

ComplexFamiliesComplexFamilies

Page 29: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

29

Answer to FAQ

Why does not PDBbind provide experimental details in addition to the binding data?

• Such information is not always given in the reference.

• Of course it takes a lot of extra efforts to retrieve such information, and it is difficult to format such information.

• The users can always check the original reference if they really need such information.

Page 30: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

30

What is the difference between PDBbind and Binding MOAD?

Binding MOAD also collects the binding data of protein-ligand complexes, which is also based on a systematic mining of the Protein Data Bank.

Thus, the contents of Binding MOAD overlap with part of PDBbind, and these two databases are similar in some technical aspects.

Binding MOAD (Mother Of All Databases) was independently developed by Prof. Heather Carlson’s group at the University of Michigan, and was released to the public in 2005.

Proteins: Structure, Function, and Bioinformatics, Volume 60, Issue 3, pages 333–340, 15 August 2005.

Answer to FAQ

Page 31: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

31

Summary: Significance of the PDBbind Database

– More binding data: The latest version provides binding data for ~8,000 complexes• Systematic mining of the entire PDB• Covering all major categories of biomolecular complexes, not only

for selected protein-ligand complexes

– Better in quality• Reasonable classification of biomolecular complexes • Binding affinity data carefully collected from original references

– Regularly updated since the first public release in 2004. Binding data increase by 20~30% each year.

– Widely popular: User-friendly web interface; over 1,600 registered users from some 40 countries across the world.

Page 32: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

32

The Natural Science Foundation of China (NSFC), the Ministry of Sciences and Technologies of China (MOST), and the Chinese Academy of Sciences (CAS).

Acknowledgments

Special thanks to Prof. Shaomeng Wang and his group at the University of Michigan!

Liu,Z. Li,J. Li,Y. Li,X. Lin,F.

Thanks to the following contributors in my group:

Page 33: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

33

Page 34: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

34

Why to Create the PDBbind Database?

Protein-small molecule binding

Protein-nucleic acid binding

Protein-protein binding

Recognitions and Interactions between various types of molecules are

essential at the molecular level for various biological processes.

Page 35: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

35

As a matter of fact, most PDB entries contain multiple heterogen molecules in addition to the primary molecule (protein or nucleic acid).

Is this a meaningful protein-ligand complex?

Difficulty in Complex Classification

Page 36: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

36

Difficulty in Complex Classification

What are classified as “valid” small-molecule ligand molecules:

• “Regular” organic molecules • Oligo-peptides containing < 10 amino acid residues• Oligo-nucleic acids containing < 4 nucleotides

What are classified as “special” ligand molecules:

• Cofactors/coenzymes: CoA, NAD, FAD, Heme & their derivatives

• Inorganic species• Organic solvents and buffer components• Saccharide molecules with high occurrences

What are classified as “junk” molecules:

Page 37: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

37

Difficulty in Complex Classification

Is this a protein-protein complex or a protein-ligand complex?

A complex may be classified into more than one category.

Protein A Protein B

Small-molecule ligand

Page 38: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

38

Shanghai Inst. Org. Chem.

www.pdbbind.org

www.pdbbind-cn.org

The PDBbind database has >1,600 registered users all over the world by now.

Univ. Michigan

Page 39: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

39

Process the Complex Structures

1. Split a complete “biological unit” of each complex into a protein molecule and a ligand molecule.

2. Save the protein molecule in the PDB format.

– Remove redundant structural units;– Add hydrogen atoms;– Keep the water and metal ions with the protein.

3. Save the ligand molecule in the Mol2 format.

– Correct atom types and bond types– Add hydrogen atoms and partial charges– Handle tautomers correctly

These processed structural files can be readily utilized by most molecular modeling software.

Page 40: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

40

The PDBbind-CN Database蛋白 -配体复合物三维结构

及亲合性数据库

发展亲合性打分函数Scoring Function

Development

评估亲合性打分函数Scoring Function

Assessment

My Scoring Function Tripod

(1) J. Comput.-Aided Mol. Des. 2002, 16, 11-26. (2) J. Med. Chem. 2003, 46, 2287-2303. (3) J. Med. Chem. 2004, 47, 2977-2980. (4) J. Chem. Inf. Comput. Sci. 2004, 44, 2114-2125. (5) J. Med. Chem. 2005, 48, 4111-4119. (6) Proteins, 2006, 64, 1058-1068. (7) J. Chem. Theory Comput. 2008, 4, 1959–1973. (8) J. Chem. Inf. Model. 2009, 49, 1079-1093. (9) J. Chem. Inf. Model. 2009, 49, 1033-1048. (10) Mol. Informatics, 2010, 29,87-96. (11) J. Comp. Chem. 2010, 31, 2109-2125. (12) BMC Bioinformatics, 2010, 11, 193-208. (13) J. Chem. Theory Comput. 2010, 6, 1852-1870. (14) J. Chem. Inf. Model., 2010, 50 , 682–1692.

Page 41: 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Frederick, Maryland, August 25-26, 2011.

41

1E1VKi = 12000 nM

1E1XKi = 1300 nM

1PXPKi = 220 nM

1PXOKi = 2.0 nM

1PXNKi = 70 nM

Some CDK-2 inhibitors recorded in PDBbind