A MULTIBODY ATOMIC STATISTICAL POTENTIAL FOR PREDICTING ENZYME-INHIBITOR BINDING ENERGY

1
A MULTIBODY ATOMIC STATISTICAL POTENTIAL FOR PREDICTING ENZYME-INHIBITOR BINDING ENERGY Majid Masso ([email protected]) Laboratory for Structural Bioinformatics, School of Systems Biology, George Mason University, 10900 University Blvd. MS 5B3, Manassas, Virginia 20110, USA I. Abstract Accurate prediction of enzyme-inhibitor binding energy has the capacity to speed drug design and chemical genomics efforts by helping to narrow the focus of experiments. Here a non-redundant set of three hundred high-resolution crystallographic enzyme- inhibitor structures was compiled for analysis, complexes with known binding energies G) based on the availability of experimentally determined inhibition constants (k i ). Additionally, a separate set of over 1400 diverse high- resolution macromolecular crystal structures was collected for the purpose of creating an all-atom knowledge-based statistical potential, via application of the Delaunay tessellation computational geometry technique. Next, two hundred of the enzyme-inhibitor complexes were randomly selected to develop a model for predicting binding energy, first by tessellating structures of the complexes as well as the enzymes without their bound inhibitors, then by using the statistical potential to calculate a topological score for each structure tessellation. We derived as a predictor of binding energy an empirical linear function of the difference between topological scores for a complex and its isolated enzyme. A correlation coefficient (r) of 0.79 was obtained for the experimental and calculated ΔG values, with a standard error of 2.34 kcal/mol. Lastly, the model was evaluated with the held-out set of one hundred complexes, for which structure tessellations were performed in order to calculate topological score differences, and binding energy predictions were generated from the derived linear function. Calculated binding energies for the test data also compared well with their experimental counterparts, displaying a correlation coefficient of r = 0.77 with a standard error of 2.50 kcal/mol. II. Protein Data Bank ( http://www.rcsb.org/pdb ) PDB – repository of solved (x-ray, nmr, ...) structures Each structure file contains atomic 3D coordinate data Atom X Y Z : : : : III. Macromolecular Modeling Native structure is conformation having lowest energy Physics-based energy calculations using quantum mechanics are computationally impractical Same for molecular mechanics-based potential energy functions (i.e., force fields): E(total) = E(bond) + E(angle) + E(dihedral) + E(electrostatic) + E(van der Waals) Alternative (our approach): knowledge- based potentials of mean force (i.e., generated from known protein structures) IV. Knowledge-Based Potentials of Mean Force • Assumptions: At equilibrium, native state has global free energy min Microscopic states (i.e., features) follow Boltzmann dist • Examples: Well-documented in the literature : distance-dependent pairwise interactions at the atomic or amino acid level This study : inclusion of higher-order contributions by developing all-atom four-body statistical potentials Motivation (our prior work): Four-body protein potential at the amino acid level V. Motivational Example: Pairwise Amino Acid Potential A 20-letter protein alphabet yields 210 residue pairs Obtain large, diverse PDB dataset of single protein chains For each residue pair (i, j), calculate the relative frequency f ij with which they appear within a given distance (e.g., 12 angstroms) of each other in all the protein structures Calculate a rate p ij expected by chance alone from a background or reference distribution (more later…) Apply inverted Bolzmann principle: s ij = log(f ij / p ij ) quantifies interaction propensity and is proportional to the energy of interaction (by a factor of ‘–RT’) VI. All-Atom Four-Body Statistical Potential Obtain diverse PDB dataset of 1417 single chain and multimeric proteins, many complexed to ligands (see XV. References) Six-letter atomic alphabet: C, N, O, S, M (metals), X (other) Apply Delaunay tessellation to the atomic point coordinates of each PDB file – objectively identifies all nearest- neighbor quadruplets of atoms in the structure (8 angstrom cutoff) VII. All-Atom Four-Body Statistical Potential A six-letter atomic alphabet yields 126 distinct quadruplets For each quad (i, j, k, l), calculate observed rate of occurrence f ijkl among all tetrahedra from the 1417 structure tessellations Compute rate p ijkl expected by chance from a multinomial reference distribution: a n = proportion of atoms from all structures that are of type n t n = number of occurrences of atom type n in the quad Apply inverted Bolzmann principle: s ijkl = log(f ijkl / p ijkl ) quantifies the interaction propensity and is proportional to the energy of atomic quadruplet interaction . 4 and 1 where , ! ! 4 6 1 6 1 6 1 6 1 n n n n n t n n n ijkl t a a t p n VIII. Summary Data for the 1417 Structure Files and their Delaunay Tessellations Atom Types C ount Proportion C 3,612,988 0.633193 N 969,253 0.169866 O 1,088,410 0.190749 S 28,502 0.004995 (allmetals)M 2,529 0.000443 (allothernon-metals)X 4,299 0.000754 Totalatom count: 5,705,981 Totaltetrahedron count: 34,504,737 IX. All-Atom Four-Body Statistical Potential Q uad C ount s ijkl Q uad C ount s ijkl Q uad C ount s ijkl Q uad C ount s ijkl CCCC 4015872 -0.14024 CMOX 68 0.308765 MMNX 0 -- NNOS 5637 -0.30523 CCCM 1592 -0.98922 CMSS 2047 2.848813 MMOO 306 2.31553 NNOX 302 -0.75477 CCCN 4025206 -0.16987 CMSX 13 1.172117 MMOS 104 3.127729 NNSS 311 0.319427 CCCO 6202159 -0.03247 CMXX 6 1.958862 MMOX 3 2.409325 NNSX 6 -0.87471 CCCS 293157 0.224008 CNNN 102035 -0.62305 MMSS 254 5.398477 NNXX 5 0.168652 CCCX 2796 -0.97505 CNNO 1995038 0.140679 M M SX 2 3.815151 NOOO 171147 0.021937 CCMM 132 0.908235 CNNS 15892 -0.37618 MMXX 0 -- NOOS 10697 -0.07737 CCMN 3318 -0.57598 CNNX 578 -0.99392 MNNN 1030 0.53596 NOOX 3102 0.206513 CCMO 5325 -0.42089 CNOO 2734639 0.227273 MNNO 1128 0.047955 NOSS 922 0.440012 CCMS 2293 0.795108 CNOS 95438 0.050981 MNNS 561 1.326526 NOSX 12 -0.92506 CCMX 15 -0.5677 CNOX 2168 -0.77117 MNNX 5 0.098041 NOXX 61 0.903627 CCNN 1797552 -0.12464 CNSS 4264 0.584024 MNOO 3744 0.518626 NSSS 33 1.052833 CCNO 8233136 0.184864 CNSX 37 -0.95711 MNOS 314 0.723107 NSSX 0 -- CCNS 124653 -0.05308 CNXX 61 0.382553 MNOX 29 0.510083 NSXX 0 -- CCNX 2007 -1.02473 COOO 524994 -0.06271 MNSS 793 3.008398 NXXX 3 2.475964 CCOO 3366568 0.047161 COOS 34429 -0.14114 MNSX 5 1.328573 OOOO 34212 -0.12555 CCOS 198630 0.098905 COOX 23801 0.520038 MNXX 9 2.706383 OOOS 4240 -0.0525 CCOX 4626 -0.71243 COSS 4380 0.545326 MOOO 5430 1.106856 OOOX 9553 1.121777 CCSS 15288 0.868158 COSX 58 -0.81224 MOOS 156 0.669977 OOSS 300 0.203077 CCSX 144 -0.63735 COXX 65 0.359781 MOOX 168 1.523669 OOSX 36 -0.19726 CCXX 143 0.482159 CSSS 285 1.417735 MOSS 210 2.380989 OOXX 128 1.476181 CMMM 23 3.480397 CSSX 5 0.006247 MOSX 4 1.181307 OSSS 38 1.063748 CMMN 144 1.216422 CSXX 4 0.730845 MOXX 55 3.442148 OSSX 3 0.305472 CMMO 256 1.415945 CXXX 9 2.381656 M SSS 62 3.910199 OSXX 0 -- CMMS 662 3.41048 MMMM 83 7.794725 M SSX 2 2.763224 OXXX 2 2.249518 CMMX 1 1.41113 MMMN 37 4.258301 M SXX 0 -- SSSS 6 2.446092 CMNN 2474 -0.13203 MMMO 29 4.102142 M XXX 16 5.786451 SSSX 0 -- CMNO 6267 -0.07975 MMMS 379 6.8003 NNNN 3878 -0.8697 SSXX 0 -- CMNS 2588 1.118068 MMMX 0 -- NNNO 46665 -0.44173 SXXX 0 -- CMNX 26 -0.05842 MMNN 83 1.849597 NNNS 460 -0.86605 XXXX 0 -- CMOO 8481 0.302308 MMNO 102 1.587734 NNNX 34 -1.17582 CMOS 1010 0.659069 MMNS 363 3.720958 NNOO 340620 0.195102 X. Topological Score (TS) Delaunay tessellation of any macromolecular structure yields an aggregate of tetrahedral simplices Each simplex can be scored using the all-atom four-body potential based on the quad present at the four vertices Topological score (or ‘total potential’) of the structure: the sum of all constituent simplices in the tessellation s ijkl TS = Σs ijkl XI. Topological Score Difference (ΔTS) XII. Application of ΔTS: Predicting Enzyme–Inhibitor Binding Energy MOAD – repository of exp. inhibition constants (k i ) for protein–ligand complexes whose structures are in PDB Collected k i values for 300 complexes reflecting diverse protein structures Obtained exp. binding energy from k i via ΔG exp = – RTln(k i ) Calculated ΔTS for complexes XIII. Predicting Enzyme– Inhibitor Binding Energy Randomly selected 200 complexes to train a model Correlation coefficient r = 0.79 between ΔTS and ΔG exp Empirical linear transform of ΔTS to reflect energy values: ΔG calc = (1 / 0.0003) × ΔTS – 6.24 Linear => same r = 0.79 value between ΔG calc and ΔG exp Also, standard error of SE = 2.34 kcal/mol and fitted regression line of y = 0.98x – 0.41 (y = ΔG calc and x = ΔG exp ) XIV. Predicting Enzyme– Inhibitor Binding Energy For the test set of 100 remaining complexes: r = 0.77 between ΔG calc and ΔG exp SE = 2.50 kcal/mol Fitted regression line is y = 1.07x + 0.46 All training/test data available online as a text file (see XV. References) XV. References and Acknowledgments PDB dataset: http://proteins. gmu . edu / automute /tessellatable1417.txt Train/test dataset: http://proteins. gmu . edu / automute /MOAD300ki.txt PDB (structure DB): http://www.rcsb.org/pdb MOAD (ligand binding DB): http://bindingmoad.org/ Qhull (Delaunay tessellation): http://www.qhull.org/ UCSF Chimera (ribbon/ball-stick structure visualization): http://www.cgl.ucsf.edu/chimera/ Matlab (tessellation visualization): http://www.mathworks.com/products/matla b/

description

A MULTIBODY ATOMIC STATISTICAL POTENTIAL FOR PREDICTING ENZYME-INHIBITOR BINDING ENERGY Majid Masso ([email protected]) Laboratory for Structural Bioinformatics, School of Systems Biology, George Mason University, 10900 University Blvd. MS 5B3, Manassas, Virginia 20110, USA. - PowerPoint PPT Presentation

Transcript of A MULTIBODY ATOMIC STATISTICAL POTENTIAL FOR PREDICTING ENZYME-INHIBITOR BINDING ENERGY

Page 1: A MULTIBODY ATOMIC STATISTICAL POTENTIAL FOR PREDICTING ENZYME-INHIBITOR BINDING ENERGY

A MULTIBODY ATOMIC STATISTICAL POTENTIAL FOR PREDICTING ENZYME-INHIBITOR BINDING ENERGYMajid Masso ([email protected])

Laboratory for Structural Bioinformatics, School of Systems Biology, George Mason University, 10900 University Blvd. MS 5B3, Manassas, Virginia 20110, USA

I. AbstractAccurate prediction of enzyme-inhibitor binding energy has the capacity to speed drug design and chemical genomics efforts by helping to narrow the focus of experiments. Here a non-redundant set of three hundred high-resolution crystallographic enzyme-inhibitor structures was compiled for analysis, complexes with known binding energies (ΔG) based on the availability of experimentally determined inhibition constants (ki). Additionally, a separate set of over 1400 diverse high-resolution macromolecular crystal structures was collected for the purpose of creating an all-atom knowledge-based statistical potential, via application of the Delaunay tessellation computational geometry technique. Next, two hundred of the enzyme-inhibitor complexes were randomly selected to develop a model for predicting binding energy, first by tessellating structures of the complexes as well as the enzymes without their bound inhibitors, then by using the statistical potential to calculate a topological score for each structure tessellation. We derived as a predictor of binding energy an empirical linear function of the difference between topological scores for a complex and its isolated enzyme. A correlation coefficient (r) of 0.79 was obtained for the experimental and calculated ΔG values, with a standard error of 2.34 kcal/mol. Lastly, the model was evaluated with the held-out set of one hundred complexes, for which structure tessellations were performed in order to calculate topological score differences, and binding energy predictions were generated from the derived linear function. Calculated binding energies for the test data also compared well with their experimental counterparts, displaying a correlation coefficient of r = 0.77 with a standard error of 2.50 kcal/mol.

II. Protein Data Bank (http://www.rcsb.org/pdb)

• PDB – repository of solved (x-ray, nmr, ...) structures

• Each structure file contains atomic 3D coordinate data

Atom X Y Z

::

::

III. Macromolecular Modeling

• Native structure is conformation having lowest energy

• Physics-based energy calculations using quantum mechanics are computationally impractical

• Same for molecular mechanics-based potential energy functions (i.e., force fields): E(total) = E(bond) + E(angle) + E(dihedral) + E(electrostatic) + E(van der Waals)

• Alternative (our approach): knowledge-based potentials of mean force (i.e., generated from known protein structures)

IV. Knowledge-Based Potentials of Mean Force

• Assumptions:– At equilibrium, native state has global free energy min– Microscopic states (i.e., features) follow Boltzmann dist

• Examples:– Well-documented in the literature: distance-dependent

pairwise interactions at the atomic or amino acid level– This study: inclusion of higher-order contributions by

developing all-atom four-body statistical potentials• Motivation (our prior work):

– Four-body protein potential at the amino acid level

V. Motivational Example:Pairwise Amino Acid Potential

• A 20-letter protein alphabet yields 210 residue pairs

• Obtain large, diverse PDB dataset of single protein chains

• For each residue pair (i, j), calculate the relative frequency fij with which they appear within a given distance (e.g., 12 angstroms) of each other in all the protein structures

• Calculate a rate pij expected by chance alone from a background or reference distribution (more later…)

• Apply inverted Bolzmann principle: sij = log(fij / pij) quantifies interaction propensity and is proportional to the energy of interaction (by a factor of ‘–RT’)

VI. All-Atom Four-Body Statistical Potential

• Obtain diverse PDB dataset of 1417 single chain and multimeric proteins, many complexed to ligands (see XV. References)

• Six-letter atomic alphabet: C, N, O, S, M (metals), X (other)• Apply Delaunay tessellation to the atomic point coordinates of

each PDB file – objectively identifies all nearest-neighbor quadruplets of atoms in the structure (8 angstrom cutoff)

VII. All-Atom Four-Body Statistical Potential

• A six-letter atomic alphabet yields 126 distinct quadruplets• For each quad (i, j, k, l), calculate observed rate of

occurrence fijkl among all tetrahedra from the 1417 structure tessellations

• Compute rate pijkl expected by chance from a multinomial reference distribution:

• an = proportion of atoms from all structures that are of type n• tn = number of occurrences of atom type n in the quad • Apply inverted Bolzmann principle: sijkl = log(fijkl / pijkl)

quantifies the interaction propensity and is proportional to the energy of atomic quadruplet interaction

.4 and 1 where,

!

!4 6

1

6

1

6

16

1

nn

nn

n

tn

nn

ijkl taat

p n

VIII. Summary Data for the 1417 Structure Files and their Delaunay Tessellations

Atom Types Count Proportion

C 3,612,988 0.633193 N 969,253 0.169866 O 1,088,410 0.190749 S 28,502 0.004995

(all metals) M 2,529 0.000443 (all other non-metals) X 4,299 0.000754

Total atom count: 5,705,981

Total tetrahedron count: 34,504,737

IX. All-Atom Four-Body Statistical PotentialQuad Count sijkl Quad Count sijkl Quad Count sijkl Quad Count sijkl CCCC 4015872 -0.14024 CMOX 68 0.308765 MMNX 0 -- NNOS 5637 -0.30523 CCCM 1592 -0.98922 CMSS 2047 2.848813 MMOO 306 2.31553 NNOX 302 -0.75477 CCCN 4025206 -0.16987 CMSX 13 1.172117 MMOS 104 3.127729 NNSS 311 0.319427 CCCO 6202159 -0.03247 CMXX 6 1.958862 MMOX 3 2.409325 NNSX 6 -0.87471 CCCS 293157 0.224008 CNNN 102035 -0.62305 MMSS 254 5.398477 NNXX 5 0.168652 CCCX 2796 -0.97505 CNNO 1995038 0.140679 MMSX 2 3.815151 NOOO 171147 0.021937 CCMM 132 0.908235 CNNS 15892 -0.37618 MMXX 0 -- NOOS 10697 -0.07737 CCMN 3318 -0.57598 CNNX 578 -0.99392 MNNN 1030 0.53596 NOOX 3102 0.206513 CCMO 5325 -0.42089 CNOO 2734639 0.227273 MNNO 1128 0.047955 NOSS 922 0.440012 CCMS 2293 0.795108 CNOS 95438 0.050981 MNNS 561 1.326526 NOSX 12 -0.92506 CCMX 15 -0.5677 CNOX 2168 -0.77117 MNNX 5 0.098041 NOXX 61 0.903627 CCNN 1797552 -0.12464 CNSS 4264 0.584024 MNOO 3744 0.518626 NSSS 33 1.052833 CCNO 8233136 0.184864 CNSX 37 -0.95711 MNOS 314 0.723107 NSSX 0 -- CCNS 124653 -0.05308 CNXX 61 0.382553 MNOX 29 0.510083 NSXX 0 -- CCNX 2007 -1.02473 COOO 524994 -0.06271 MNSS 793 3.008398 NXXX 3 2.475964 CCOO 3366568 0.047161 COOS 34429 -0.14114 MNSX 5 1.328573 OOOO 34212 -0.12555 CCOS 198630 0.098905 COOX 23801 0.520038 MNXX 9 2.706383 OOOS 4240 -0.0525 CCOX 4626 -0.71243 COSS 4380 0.545326 MOOO 5430 1.106856 OOOX 9553 1.121777 CCSS 15288 0.868158 COSX 58 -0.81224 MOOS 156 0.669977 OOSS 300 0.203077 CCSX 144 -0.63735 COXX 65 0.359781 MOOX 168 1.523669 OOSX 36 -0.19726 CCXX 143 0.482159 CSSS 285 1.417735 MOSS 210 2.380989 OOXX 128 1.476181 CMMM 23 3.480397 CSSX 5 0.006247 MOSX 4 1.181307 OSSS 38 1.063748 CMMN 144 1.216422 CSXX 4 0.730845 MOXX 55 3.442148 OSSX 3 0.305472 CMMO 256 1.415945 CXXX 9 2.381656 MSSS 62 3.910199 OSXX 0 -- CMMS 662 3.41048 MMMM 83 7.794725 MSSX 2 2.763224 OXXX 2 2.249518 CMMX 1 1.41113 MMMN 37 4.258301 MSXX 0 -- SSSS 6 2.446092 CMNN 2474 -0.13203 MMMO 29 4.102142 MXXX 16 5.786451 SSSX 0 -- CMNO 6267 -0.07975 MMMS 379 6.8003 NNNN 3878 -0.8697 SSXX 0 -- CMNS 2588 1.118068 MMMX 0 -- NNNO 46665 -0.44173 SXXX 0 -- CMNX 26 -0.05842 MMNN 83 1.849597 NNNS 460 -0.86605 XXXX 0 -- CMOO 8481 0.302308 MMNO 102 1.587734 NNNX 34 -1.17582 CMOS 1010 0.659069 MMNS 363 3.720958 NNOO 340620 0.195102

X. Topological Score (TS)

• Delaunay tessellation of any macromolecular structure yields an aggregate of tetrahedral simplices

• Each simplex can be scored using the all-atom four-body potential based on the quad present at the four vertices

• Topological score (or ‘total potential’) of the structure: the sum of all constituent simplices in the tessellation

sijkl TS = Σsijkl

XI. Topological Score Difference (ΔTS) XII. Application of ΔTS: Predicting Enzyme–Inhibitor Binding Energy

• MOAD – repository of exp. inhibition constants (ki) for protein–ligand complexes whose structures are in PDB

• Collected ki values for 300 complexes reflecting diverse protein structures

• Obtained exp. binding energy from ki via ΔGexp = –RTln(ki)

• Calculated ΔTS for complexes

XIII. Predicting Enzyme–Inhibitor Binding Energy

• Randomly selected 200 complexes to train a model

• Correlation coefficient r = 0.79 between ΔTS and ΔGexp

• Empirical linear transform of ΔTS to reflect energy values:

ΔGcalc = (1 / 0.0003) × ΔTS – 6.24

• Linear => same r = 0.79 value between ΔGcalc and ΔGexp

• Also, standard error of SE = 2.34 kcal/mol and fitted regression line of y = 0.98x – 0.41 (y = ΔGcalc and x = ΔGexp)

XIV. Predicting Enzyme–Inhibitor Binding Energy

• For the test set of 100 remaining complexes:

• r = 0.77 between ΔGcalc and ΔGexp

• SE = 2.50 kcal/mol

• Fitted regression line is y = 1.07x + 0.46

• All training/test data available online as a text file (see XV. References)

XV. References and Acknowledgments

• PDB dataset: http://proteins.gmu.edu/automute/tessellatable1417.txt

• Train/test dataset: http://proteins.gmu.edu/automute/MOAD300ki.txt

• PDB (structure DB): http://www.rcsb.org/pdb

• MOAD (ligand binding DB): http://bindingmoad.org/

• Qhull (Delaunay tessellation): http://www.qhull.org/

• UCSF Chimera (ribbon/ball-stick structure visualization): http://www.cgl.ucsf.edu/chimera/

• Matlab (tessellation visualization): http://www.mathworks.com/products/matlab/