Identification of Protein–Protein Interaction Sites from...

23
Identification of Protein–Protein Interaction Sites from Docking Energy Landscapes Juan Ferna ´ ndez-Recio 1 , Maxim Totrov 2 and Ruben Abagyan 1 * 1 Department of Molecular Biology, The Scripps Research Institute, 10550 North Torrey Pines Road, La Jolla, CA 92037 USA 2 Molsoft, LLC, 3366 Torrey Pines Court, La Jolla, CA 92037, USA Protein recognition is one of the most challenging and intriguing pro- blems in structural biology. Despite all the available structural, sequence and biophysical information about protein – protein complexes, the phy- sico-chemical patterns, if any, that make a protein surface likely to be involved in protein–protein interactions, remain elusive. Here, we apply protein docking simulations and analysis of the interaction energy land- scapes to identify protein–protein interaction sites. The new protocol for global docking based on multi-start global energy optimization of an all- atom model of the ligand, with detailed receptor potentials and atomic solvation parameters optimized in a training set of 24 complexes, explores the conformational space around the whole receptor without restrictions. The ensembles of the rigid-body docking solutions generated by the simu- lations were subsequently used to project the docking energy landscapes onto the protein surfaces. We found that highly populated low-energy regions consistently corresponded to actual binding sites. The procedure was validated on a test set of 21 known protein–protein complexes not used in the training set. As much as 81% of the predicted high-propensity patch residues were located correctly in the native interfaces. This approach can guide the design of mutations on the surfaces of proteins, provide geometrical details of a possible interaction, and help to annotate protein surfaces in structural proteomics. q 2003 Elsevier Ltd. All rights reserved. Keywords: protein– protein interactions; binding site identification; pseudo-Brownian Monte Carlo; docking energy landscapes; hot spots *Corresponding author Introduction Protein – protein interactions (PPI) play a key role in many biological processes such as signal trans- duction, gene expression control, enzyme inhi- bition, antibody–antigen recognition or even the assembly of multi-domain proteins. In the cellular context, these PPIs very often lead to the formation of stable specific protein–protein complexes that are essential to perform their biological functions. Identification of protein – protein interactions can be performed through a combination of exper- imental and theoretical techniques, and several protein interaction databases are being compiled. 1 However, the exact location and geometry of the interface poses a separate problem, which cannot always be addressed with X-ray crystallography or NMR studies. As a result, the demand for com- puter modeling is growing, and a number of algor- ithms that generate models of the macromolecular complex based on the structures of individual sub- units (docking methods) have been described. 2–7 However, these docking procedures can only gen- erate the near-native complex structure among a number of alternative docked conformations (false-positives), and some information about the interacting surfaces (biological data, sequence and structure homology) is frequently required to single out the correct solution. The knowledge of the approximate location of the binding interface 0022-2836/$ - see front matter q 2003 Elsevier Ltd. All rights reserved. Present address: J. Ferna ´ndez-Recio, Department of Biochemistry, University of Cambridge, 80 Tennis Court Road, Cambridge CB2 1GA, UK. E-mail address of the corresponding author: [email protected] Abbreviations used: PPI, protein – protein interaction; PDB, Protein Data Bank; ICM, internal coordinate mechanics; RMSD, root-mean-square deviation; ASA, accessible surface area; ASP, atomic solvation parameter; ABS, averaged buried surface; RBS, relative buried surface; IP, interface propensity; NIP, normalized interface propensity; EPO, erythropoietin; EPOR, erythropoietin receptor; MC, Monte Carlo; FNR, ferredoxin-NADP þ reductase; Fd, ferredoxin. doi:10.1016/j.jmb.2003.10.069 J. Mol. Biol. (2004) 335, 843–865

Transcript of Identification of Protein–Protein Interaction Sites from...

Page 1: Identification of Protein–Protein Interaction Sites from ...abagyan.ucsd.edu/pdf/04_Ident_Recio_JMB.pdf · Identification of Protein–Protein Interaction Sites from Docking Energy

Identification of Protein–Protein Interaction Sites fromDocking Energy Landscapes

Juan Fernandez-Recio1, Maxim Totrov2 and Ruben Abagyan1*

1Department of MolecularBiology, The Scripps ResearchInstitute, 10550 North TorreyPines Road, La Jolla, CA 92037USA

2Molsoft, LLC, 3366 TorreyPines Court, La Jolla, CA92037, USA

Protein recognition is one of the most challenging and intriguing pro-blems in structural biology. Despite all the available structural, sequenceand biophysical information about protein–protein complexes, the phy-sico-chemical patterns, if any, that make a protein surface likely to beinvolved in protein–protein interactions, remain elusive. Here, we applyprotein docking simulations and analysis of the interaction energy land-scapes to identify protein–protein interaction sites. The new protocol forglobal docking based on multi-start global energy optimization of an all-atom model of the ligand, with detailed receptor potentials and atomicsolvation parameters optimized in a training set of 24 complexes, exploresthe conformational space around the whole receptor without restrictions.The ensembles of the rigid-body docking solutions generated by the simu-lations were subsequently used to project the docking energy landscapesonto the protein surfaces. We found that highly populated low-energyregions consistently corresponded to actual binding sites. The procedurewas validated on a test set of 21 known protein–protein complexes notused in the training set. As much as 81% of the predicted high-propensitypatch residues were located correctly in the native interfaces. Thisapproach can guide the design of mutations on the surfaces of proteins,provide geometrical details of a possible interaction, and help to annotateprotein surfaces in structural proteomics.

q 2003 Elsevier Ltd. All rights reserved.

Keywords: protein–protein interactions; binding site identification;pseudo-Brownian Monte Carlo; docking energy landscapes; hot spots*Corresponding author

Introduction

Protein–protein interactions (PPI) play a key rolein many biological processes such as signal trans-duction, gene expression control, enzyme inhi-bition, antibody–antigen recognition or even theassembly of multi-domain proteins. In the cellularcontext, these PPIs very often lead to the formation

of stable specific protein–protein complexes thatare essential to perform their biological functions.Identification of protein–protein interactions canbe performed through a combination of exper-imental and theoretical techniques, and severalprotein interaction databases are being compiled.1

However, the exact location and geometry of theinterface poses a separate problem, which cannotalways be addressed with X-ray crystallographyor NMR studies. As a result, the demand for com-puter modeling is growing, and a number of algor-ithms that generate models of the macromolecularcomplex based on the structures of individual sub-units (docking methods) have been described.2 – 7

However, these docking procedures can only gen-erate the near-native complex structure among anumber of alternative docked conformations(false-positives), and some information about theinteracting surfaces (biological data, sequence andstructure homology) is frequently required tosingle out the correct solution. The knowledge ofthe approximate location of the binding interface

0022-2836/$ - see front matter q 2003 Elsevier Ltd. All rights reserved.

Present address: J. Fernandez-Recio, Department ofBiochemistry, University of Cambridge, 80 Tennis CourtRoad, Cambridge CB2 1GA, UK.

E-mail address of the corresponding author:[email protected]

Abbreviations used: PPI, protein–protein interaction;PDB, Protein Data Bank; ICM, internal coordinatemechanics; RMSD, root-mean-square deviation; ASA,accessible surface area; ASP, atomic solvation parameter;ABS, averaged buried surface; RBS, relative buriedsurface; IP, interface propensity; NIP, normalizedinterface propensity; EPO, erythropoietin; EPOR,erythropoietin receptor; MC, Monte Carlo; FNR,ferredoxin-NADPþ reductase; Fd, ferredoxin.

doi:10.1016/j.jmb.2003.10.069 J. Mol. Biol. (2004) 335, 843–865

Page 2: Identification of Protein–Protein Interaction Sites from ...abagyan.ucsd.edu/pdf/04_Ident_Recio_JMB.pdf · Identification of Protein–Protein Interaction Sites from Docking Energy

on the protein surface can be of direct practical useas well as help to improve the accuracy of thedocking models. Chemical and physical comple-mentarity between the interacting surfaces isessential, and the existence of “hot spots”, e.g. afew residues that confer most of the bindingenergy, has been reported.8 – 12 In this context, thequestion arises of whether there are specific chemi-cal and physical characteristics on the surfaces ofproteins that can be used to identify the PPI sites.Different analyses of known protein–protein inter-faces attempted to find specific characteristics thatdifferentiate PPI sites from the rest of the proteinsurface.13 – 19 However, in hetero-complexes the PPIsite composition does not differ significantly fromthe rest of the solvent-accessible surface.20 – 22 Inaddition, the results of continuum electrostatic cal-culations suggest that protein–protein interfacesare naturally designed to exploit electrostatic andhydrophobic forces in very different ways.23 Thus,it appears difficult to find chemical or structuralpatterns on the surface of proteins that unequivo-cally define PPI sites, and very few computationalmethods have been developed so far for their pre-diction. A method based on residue interface pro-pensities and physico-chemical analysis of surfacepatches, presented a success rate of 54% on a data-set of 59 complexes, which included homo-dimers,hetero-complexes, and antibody–antigen com-plexes (prediction was considered successfulwhen at least one of the three predicted patchescovered .50% of the real interface).24 Morerecently, a neural network based on sequence pro-files of neighboring residues and solvent-exposurevalues was trained on a dataset of 615 protein–pro-tein complex structures to predict interactingregions.25 The method was tested on a set of 35unbound proteins, where 70% of the predictedinterface residues were located correctly in theinterfaces. Unfortunately, results obtained fromneural networks are difficult to interpret in physi-cal or phenomenological terms. The conformationalchanges upon protein–protein association areoften limited to local movements, which suggeststhat protein–protein association is driven by rigid-body fit.21,26 – 28 Thus, the idea of analyzing therigid-body docking energy landscape in search ofprotein recognition areas is highly attractive. Thepossibility of low-resolution recognition in pro-tein–protein association has been studied recentlyusing geometry-based rigid-body dockingsimulations.29

In the present work, we use rigid-body dockingsimulations, with a grid-based energy descriptionof the system, to analyze the existence of preferredinteraction sites on a protein surface. The soft inter-action energy and the pseudo-Brownian confor-mational sampling procedure used here weretested previously and optimized on a benchmarkof known protein–protein complexes for whichthe uncomplexed structures were determined.4

The energy function was shown to be sufficientlyaccurate and tolerant to the induced fit to select

the near-native docked solution in most cases, andthe search procedure was able to explorethoroughly the rotational and translational degreesof freedom of the ligand in the vicinity of the bind-ing site of the receptor (so called local docking).Here, we address the more challenging task ofexploring the conformational space around thewhole receptor, and we show that the energy pro-file for the ensemble of found docked poses can beused to determine accurately interaction sites onprotein surfaces.

Results

Global rigid-body docking

Rigid-body docking simulations were applied toa large dataset of protein–protein complexes(Table 1) using the 3-D coordinates of the unboundsubunits. The method used here permits a com-plete sampling of the whole receptor surface, with-out any restriction, and is based on a previouslydescribed multiple-start pseudo-Brownian Monte-Carlo minimization procedure,4 which wasextended to sample entire surfaces for both pro-teins and to adapt automatically the distributionof starting points to the shape and size of the pro-teins as described in Materials and Methods. Themodified procedure is capable of locating nearlyall distinct (farther than 4 A of interface RMSDapart) local minima of the grid energy function.The number of conformations obtained from thesimulations depended on the size and shape ofthe interacting molecules. For instance, a total of5958 different conformations were obtained for thecomplex chymotrypsin/APPI (PDB 1ca0), whereasa total of 12,064 conformations were obtained forthe considerably larger complex Fab D44.1/lyso-zyme (PDB 1mlc).

The RMSD/energy distributions of all dockedposes for some of the complexes are shown inFigure 1(a). Despite the rather noisy distributionof the individual energy values, the behavior ofthe averaged energy for the solutions within a slid-ing RMSD range suggests an overall dependenceof the energy on the proximity to the real structure,which for some complexes points towards thenative solution (“funnel” shape).

Inclusion of a detailed solvation term andoptimization of the energy estimate on thetraining set

To accelerate calculations, only surface chargescaling (SChEM)30 and a crude hydrophobic con-tact grid potential were used during the dockingsimulations to account for the solvation effects. Toinclude the desolvation penalty of burying sol-vent-exposed polar groups, as well as to improvethe precision of the hydrophobic energy calcu-lation, the energy value for every docked posewas recalculated after the simulations using an

844 Mapping Protein Interfaces by Docking

Page 3: Identification of Protein–Protein Interaction Sites from ...abagyan.ucsd.edu/pdf/04_Ident_Recio_JMB.pdf · Identification of Protein–Protein Interaction Sites from Docking Energy

atomic solvent-accessible surface area (ASA)-basedsolvation model (see Materials and Methods). Wehad found that the atomic solvation parameters(ASP) derived from vapor/water transfer energiesimproved the local docking results.4 Here, weextend and optimize this solvation model for the“global docking” by using ASP derived from octa-nol/water transfer energies (see Materials andMethods), which better describe the burial of sol-vent-exposed residues in the interface.31

The following energy function was used to re-evaluate the set of conformations generated during

the docking simulations:

E ¼ aðEHvw þ ECvwÞ þ Eel þ bEhb þ gEpolar

þ dEar þ 1Eal ð1Þ

The coefficients (a, b, g, d, 1) assigned to eachenergy term were optimized to obtain the bestpossible ranking for the near-native solutions inall complexes of the training set listed in Table 1(the fitting procedure is detailed in Materials andMethods). The same training set was previously

Table 1. Benchmark of protein–protein complexes used in this work

Receptor Ligand Complex Unbound receptor Unbound ligand

Parameterization setChymotrypsin APPI 1ca0 5cha 1aapChymotrypsin BPTI 1cbw 5cha 1bpiChymotrypsin Eglin C 1acb 5cha 1eglChymotrypsin Ovomucoid 1cho 5cha 1omuChymotrypsinogen HPTI 1cgi 1chg 1hptKallikrein A BPTI 2kai 2pka 1bpiSubtilisin BPN CI-2 2sni 2st1 2ci2Subtilisin BPN SSI 2sic 2st1 3ssiSubtilisin Carlsberg Eglin C 1cse 1sbc 1eglThermitase Eglin C 2tec 1thm 1eglTrypsin (bovine) APPI 1taw 5ptp 1aapTrypsin (bovine) BPTI 2ptc 5ptp 1bpiTrypsin (rat) BPTI 3tgi 1ane 1bpiTrypsin (rat) APPI 1brc 1bra 1aapAcetylcholinesterase Fasciculin II 1fss 2ace 1fsca-amylase Tendamistat 1bvn 1pif 2aitBarnase Barstar 1bgs 1a2p 1a19Ribonuclease Sa Barstar 1ay7 1rge 1a19TEM-1 b-lactamase BLIP TEM1a RETMa BLIPa

UDG UGI 1ugh 1akz 2ugiCytochrome c peroxidase Cytochrome c 2pcb 1ccp 1hrcCytochrome f Plastocyanin 2pcf 1ctm 1ag6Fab D44.1 Lysozyme 1mlc 1mlb 1lzaFv D1.3 Lysozyme 1vfb 1vfa 1lza

Test setFerredoxin-NADP þ reductase Ferredoxin 1ewy 1que 1fxaErythropoietin receptor Erythropoietin 1eer 1ern 1buyHPr kinase HPr 1kkl 1jb1 1sphVP6 (rotavirus) Fab VP6FABb 1qhd VP6FABc

Hemagglutinin Fab HC63 1ken 2viu 1kenc

a-amylase VHH Amd10 1kxv 1pif 1kxvc

a-amylase VHH Amb7 1kxt 1pif 1kxtc

a-amylase VHH Amd9 1kxq 1pif 1kxqc

TCR-b speA 1l0x 1bec 1b1zTrypsin Soybean trypsin inhibitor 1avw 2ptn 1ba7Ribonuclease inhibitor Ribonuclease A 1dfj 2bnh 7rsaTrypsinogen PSTI 1tgs 2ptn 1hptFab 5G9 Tissue factor 1ahw 1fgn 1boyHyhel-63 Fab Lysozyme 1dqj 1dqq 3lztIgG1 E8 Fab Cytochrome c 1wej 1qbl 1hrcHIV-1 NEF FYN TK SH3 domain 1avz 1avv 1shfRAS activating domain RAS 1wq1 1wer 5p21Methylamine dehydrogenase Amicyanin 2mta 2bbk 1aanThrombin mutant Pancreatic trypsin inhibitor 1bth 2hnt 6ptiCDK2 Cyclin 1fin 1hcl 1vinCDK2 KAP 1fq1 1b39 1fpz

The selection of protein–protein complexes is described elsewhere.4a No coordinates deposited in PDB: the structures were kindly provided by the authors.57 – 59

b No coordinates deposited in PDB at the present time; the structure of the VP6/Fab complex was made available by the authors(M. C. Vaney & F. Rey, unpublished results) to constitute target 2 in the recent CAPRI competition (http://capri.ebi.ac.uk).

c We used for the docking simulations the coordinates of the bound ligand in a random orientation, as provided by the CAPRIorganization (http://capri.ebi.ac.uk).

Mapping Protein Interfaces by Docking 845

Page 4: Identification of Protein–Protein Interaction Sites from ...abagyan.ucsd.edu/pdf/04_Ident_Recio_JMB.pdf · Identification of Protein–Protein Interaction Sites from Docking Energy

Figure 1. (a) Distribution of the rigid-body docking poses for different complexes after the simulations, as re-evalu-ated by equation (3) in Materials and Methods. (b) The same docking poses re-evaluated with the optimized energy

846 Mapping Protein Interfaces by Docking

Page 5: Identification of Protein–Protein Interaction Sites from ...abagyan.ucsd.edu/pdf/04_Ident_Recio_JMB.pdf · Identification of Protein–Protein Interaction Sites from Docking Energy

used to benchmark the initial docking procedure.4

The non-solvation energy terms are describedelsewhere:4 EHvw and ECvw are the van der Waalspotentials pre-calculated for a hydrogen and aheavy-atom probe, respectively, truncated atþ1.0 kcal/mol (1 cal ¼ 4.184 J); Eel is the electro-static potential with a distance-dependant dielec-tric constant 1 ¼ 4r;32 and Ehb is the hydrogen-bonding potential. The additional solvation termwas split into three components: Epolar includescontributions only from polar atoms; Ear and Eal

include contributions from the aromatic and ali-phatic atoms, respectively. Several parallel mini-mization trajectories of the “amoeba” simplexoptimizer with different initial values convergedto the following weighting factor values: a ¼ 0.40;b ¼ 2.43; g ¼ 2.30; d ¼ 6.26 and 1 ¼ 1.27. The ratiobetween the van der Waals and electrostaticsweighting factors (ratio vdw/el ¼ 0.40) remainedsimilar to the ratio between the values used duringsimulations (ratio vdw/el ¼ 0.46). The low weight-ing of the van der Waals term can be explained by(i) its higher intrinsic noise due to the high sensi-tivity of this short-distance term to small devi-ations in the structure; and/or (ii) a possibledouble-counting due to the fact that the solvationterm (with parameters derived from octanol/water transfer energies) already takes into accountthe attractive component of the intermolecular vander Waals (this would be the case assuming well-packed interfaces; however, the van der Waalsterm is still needed to account for clashes in therigid-body docking interfaces). The weight of thehydrogen bonding potential with respect to theelectrostatics (ratio hb/el ¼ 2.43) approximatelydoubled (weights used in simulations had theratio hb/el ¼ 1.17). This is as expected, since thedesolvation of hydrogen bond donors/acceptors isnow explicitly present and compensates, in part,for the (bigger) energy gain from the hydrogenbond formation. Interestingly, the factor for thearomatic solvation component is considerably lar-ger than the rest of the solvation terms (see morein Discussion).

Figure 1 shows the energy/RMSD distributionsof all docking poses for some of the complexes ofthe training set (1ca0, 2pcf and 1mlc), as re-evalu-ated with both the original energy function(equation (3) in Materials and Methods) and theoptimized one (equation (1) with the parametersdescribed above). The energy profiles becomemuch more funnel-shaped (i.e. conformations

closer to the real structure have significantly lowerrelative energy values) when the solutions arere-evaluated with the optimized energy functionand new solvation terms.

In order to demonstrate the general applicabilityof the optimized parameters, we validated themon a test set (Table 1) formed by complexes whosestructures (and those of their unbound subunits)were available to us only after we compiled thetraining set for benchmarking the docking method.The test set included: the recently solved com-plexes FNR/ferredoxin and erythropoietin (EPO)/receptor (EPOR), the seven targets of the CAPRIexperiment†, and the unbound–unbound testcases of the benchmark compiled by Chen et al.,33

none of which was used in our training set. As anexample, the effect of the new energy function onthe energy/RMSD profile for the docking of EPOand EPOR is shown in Figure 1. As can be seen,the relative energy values of the conformations inthe vicinity of the native complex structure alsoimproved with the new energy function. Theenergy/RMSD profiles of the rest of complexes inthe test set (Table 1) became also more funnel-like(not shown), which indicates that the optimizedparameters are applicable to complexes outsidethe training set.

Distribution of docking poses in the trainingset and in the test set

The energy/RMSD profile provides only a 1-Drepresentation of the 6-D rigid-body dockingenergy landscape of the complex and requiresprior knowledge of the native structure. Visualiza-tion of the spatial distribution of the docked posesaround the receptor can result in a more objectiveand detailed picture. In Figure 2 we plot the pos-itions of the center of mass of the ligand in all ofthe rigid-body docking poses obtained for a rep-resentation of complexes of the training set: aprotease–inhibitor (chymotrypsin/APPI; PDB1ca0); an electron-transfer complex (cytochrome f/plastocyanin; PDB 2pcf); and an antibody–antigen(Fab/lysozyme; PDB 1mlc). In the same Figure isshown a representation of complexes of the testset: a protease-inhibitor (ribonuclease inhibitor/ribonuclease A; PDB 1dfj); an electron-transfercomplex (FNR/ferredoxin; PDB 1ewy); and anantibody–antigen (IgG1 E8 Fab/cytochrome c;PDB 1wej). The dots have been colored from redto blue, according to the energy values of the

function, as defined by equation (1). RMSD is calculated for the ligand interface Ca atoms (defined as those with atleast one atom at ,4 A from any receptor atom) with respect to the native complex structure. The average energyvalue for the docking poses within a sliding 5 A RMSD range is represented as a continuous line. The complexeschymotrypsin/APPI (PDB 1ca0); cytochrome f /plastocyanin (PDB 2pcf); and Fab/lysozyme (PDB 1mlc) were includedin the initial training set used to derive the optimized parameters. An example of a complex not included in the initialtraining set, EPO/EPObp (PDB 1eer), is shown.

† http://capri.ebi.ac.uk

Mapping Protein Interfaces by Docking 847

Page 6: Identification of Protein–Protein Interaction Sites from ...abagyan.ucsd.edu/pdf/04_Ident_Recio_JMB.pdf · Identification of Protein–Protein Interaction Sites from Docking Energy

Figure 2. Representation of the docking poses obtained for complexes of the training set: (a) chymotrypsin/APPI; (b)cytochrome f /plastocyanin; and (c) Fab/lysozyme; and of the test set: (d) ribonuclease inhibitor/ribonuclease A; (e)FNR/ferredoxin; and (f) IgG1 E8 Fab/cytochrome c. The native complex structures (PDB codes: 1ca0; 2pcf; 1mlc; 1dfj;1ewy; 1wej, respectively) are represented (receptor in gray; ligand in green). The dots represent the center of mass ofevery ligand pose after the rigid-body simulations. They are colored according to the energy value of each solutionfrom red (lowest energy) to blue (highest energy).

848 Mapping Protein Interfaces by Docking

Page 7: Identification of Protein–Protein Interaction Sites from ...abagyan.ucsd.edu/pdf/04_Ident_Recio_JMB.pdf · Identification of Protein–Protein Interaction Sites from Docking Energy

corresponding pose (from lower to higher energyvalues, respectively). An accumulation of low-energy solutions around the known position of theligand in the complex can be observed, both in com-plexes in the training set and in the test set.

The analysis of these distributions can be furtherfacilitated using a spherical coordinates projectionof the positions of the ligand’s center of mass (seeMaterials and Methods). These projected dockinglandscapes (shown in Figure 3(a) for selected com-plexes) indicate clearly the accumulation of low-energy solutions around the known binding site.Zones with high density of low-energy solutions

are shown in red, while regions with high-energysolutions or no solutions at all are shown in blue.Averaging neighboring regions removes some ofthe noise and shows more clearly the populatedareas of both low and high-energy regions (redand blue zones, respectively), as can be seen inFigure 3(b). In this analysis, the receptor surface isconsidered spherical (thus overlooking all minorgeometrical details). In some cases, this might be anoversimplification, but it provides a simple visual-ization of the distribution of docked poses on the sur-face of the receptor, which can be useful to comparedocking landscapes of different complexes.

Figure 3. (a) Representation of the rigid-body docking poses for the complexes chymotrypsin/APPI (PDB 1ca0);cytochrome f /plastocyanin (PDB 2pcf); and Fab/lysozyme (PDB 1mlc). The position of the center of mass of everyligand docking pose is represented in spherical coordinates as a 2-D sinusoidal equal-area projection, and colored(from red to blue) according to the energy values (low energies in red; high energies in blue). (b) The same dockingenergy landscapes, averaged as described in Materials and Methods, and colored according to the same criteria.

Mapping Protein Interfaces by Docking 849

Page 8: Identification of Protein–Protein Interaction Sites from ...abagyan.ucsd.edu/pdf/04_Ident_Recio_JMB.pdf · Identification of Protein–Protein Interaction Sites from Docking Energy

Mapping residue interface propensities ontothe protein surface

We next proceeded to investigate how theobserved trends in docking pose distributions andenergy landscapes could be translated into the pro-pensities of the individual receptor surface resi-dues to be buried in the putative complexedstructures. For each receptor–surface residue, therelative surface buried upon ligand binding wascalculated, and this value was averaged for alldocking poses according to a Boltzmann distri-bution at the elevated simulation temperature, asdefined by equation (4). This energy-based weight-ing system increases the influence of the low-energy solutions in the final average relativeburied surface values (for more details, seeMaterials and Methods). In Figure 4(a) the per-residue segments of the molecular surface of chy-motrypsin are colored from red to blue accordingto the energy-averaged buried surface values cal-culated using the complete set of docking solutionsðN ¼ 5958Þ for the chymotrypsin/APPI complex.This representation reflects the zones of the recep-tor surface where the lowest-energy solutionsaccumulate (in red). High-energy solutions havelittle influence on the final values, so if we computeonly the relative buried surface values per residuefor the 100 lowest-energy solutions (instead of thetotal number of 5958), the results are very similar,as can be seen in Figure 4(b). Moreover, if we calcu-late the same values using the 100 lowest-energysolutions without any energy weighting, as definedby equation (5), the results are practically identical,as can be seen in Figure 4(c). This averaged buriedsurface (ABS) value is easier to compute andreflects the number of low-energy solutions inwhich a particular residue is involved in the inter-face: a value of 0 for a particular residue wouldmean that it is not found in the interface in any ofthe 100 lowest-energy solutions, whereas a valueof 1 would mean that it is buried completely uponbinding in all 100 lowest-energy docking solutions.Using these values and ,ABS . (estimator for theexpected ABS value from a random distribution ofthe docking interfaces), we can calculate the nor-malized interface propensity (NIP) per residue(see Materials and Methods). A normalized pro-pensity of 0 will indicate that the residue is found

Figure 4. Mapping of the averaged relative buried sur-face per residue on the receptor molecular surface for thedocking of chymotrypsin (PDB 5cha) and APPI (PDB

1aap). The Connolly surface of the receptor is rep-resented, and residues are colored according to theirABS values (residues in red/blue have highest/lowestvalues, respectively). The position corresponding to theexperimental complex structure is marked in green. (a)Plot obtained from all 5958 docking poses, after weight-ing according to a Boltzmann distribution (equation (4)in Materials and Methods). (b) Plot obtained from the100 lowest-energy docking poses with the same energyweighting (equation (4) with N ¼ 100). (c) Plot obtainedfrom the 100 lowest-energy docking poses with noenergy weighting, as defined by equation (5).

850 Mapping Protein Interfaces by Docking

Page 9: Identification of Protein–Protein Interaction Sites from ...abagyan.ucsd.edu/pdf/04_Ident_Recio_JMB.pdf · Identification of Protein–Protein Interaction Sites from Docking Energy

in the docking interfaces as frequently as expected,while a value of 1 (maximum expected value)would indicate that it is found in absolutely allthe docking interfaces. Negative values wouldindicate that the residue is found in the dockinginterfaces less frequently than expected. Figure 5shows the interface propensity maps calculatedfor complexes from both the training and test sets.The accumulation of low-energy solutions in thevicinity of the known binding sites indicates theexistence of preferred binding areas for protein–protein docking. We will now compare theseinterface propensity maps with the experimentalinterfaces and explore their potential to predictprotein–protein interaction sites.

Interface propensity maps from the dockingposes: prediction of binding sites

The observed distribution of the docking inter-face propensities suggests that they can be used topredict which residues are likely to be involved inthe interface. We initially used a loose NIP cutoffof 0.0 to define the theoretical binding residues(i.e. those found in the docking interfaces more fre-quently than expected). In order to compare thistheoretical binding site with the experimental one,we can compute the coverage, i.e. percentage of theexperimental binding site that is predicted correctly:

COVERAGE ¼ 100NEXP

r > NPREDr

NEXPr

ð2Þ

where NEXPr and NPRED

r are the number of residues inthe experimental and predicted binding site,respectively.

The statistical significance of the predicted cov-erage values was evaluated by a x2 test. Contin-gency tables were built for each prediction, forwhich a P significance (probability that the predic-tion is random) was computed. As an example, inTable 2 is shown the contingency table for the pre-dicted receptor and ligand interfaces of the com-plex chymotrypsin/APPI (PDB 1ca0) compared tothe experimental ones.

In Table 3 are listed the values of the percentageof the experimental interface covered by the pre-dicted one in the complexes of the training andtest sets (for both receptor and ligand molecules),as well as their confidence values. As can be seen,the predicted interfaces differ from a random dis-tribution (confidence .95%) in the majority of themolecules both in the training set (69%) and in thetest set (81%). If we consider only the significantpredictions, the average percentage of the experi-mental interface covered by the predicted one inthe complexes of the training set is 86% (89% forthe receptor interfaces; 81% for the ligand inter-faces). The results for the test set are very similar:the average coverage is 80% (83% for the receptorinterfaces; 78% for the ligand interfaces).

High-propensity binding patches in thetraining set

The predicted binding sites listed in Table 3 areformed by residues found in the docking interfacesmore frequently than expected. This yields, in gen-eral, theoretical interaction sites larger than theexperimental ones. Although we found a signifi-cant correlation between the predicted and nativeinterfaces in most of the cases, some concern mayarise about the practical application of these largepredicted binding sites. Thus, it may be beneficialto reduce their size by selecting residues withstronger interface propensities, that is, by increas-ing the NIP cutoff value used to define the theoreti-cal binding sites. As can be seen in Figure 6(a),when the NIP cutoff value was increased, the accu-racy (i.e. percentage of correctly predicted resi-dues) of the theoretical binding sites in thetraining set improved considerably, although itincreased the number of molecules with no detect-able high-propensity residues. As can be seen inTable 4, at NIP cutoff of 0.4 the average accuracyof the high-propensity binding patches (found in65% of the molecules) is about 85%. Higher cutoffvalues could be imposed to try to find residueseven more likely to be in the native interface; e.g.at NIP cutoff of 0.7 the high-propensity bindingpatches, found in 19% of the molecules, are all cor-rect (results not shown).

High-propensity binding patches in the test set

In order to test the general applicability of themethod, we have analyzed the existence of high-propensity binding patches in an independent testset formed by complexes not used in the trainingset (Table 1; for more details about the test setselection, see Materials and Methods).

As can be seen in Figure 6(b), when the NIP cut-off value was increased, the accuracy (i.e. percen-tage of correctly predicted residues) of thetheoretical binding sites in the test set improved

Table 2. x2 test for the comparison between predictedand experimental interfaces in the interaction chymo-trypsin/APPI

In experimentalinterface

Not inexperimental

interface

ReceptorIn predicted interface 21 (5.3) 21 (36.7)Not in predictedinterface

5 (20.7) 158 (125.6)

x2 ¼ 66.4; P , 1024

LigandIn predicted interface 8 (5.8) 14 (16.2)Not in predictedinterface

6 (8.2) 25 (22.8)

x2 ¼ 1.9; P ¼ 0:17

The x2 test is based on the number of residues inside/outsidethe predicted and experimental interfaces (expected occurrencesfor a hypothetically random distribution in parentheses).

Mapping Protein Interfaces by Docking 851

Page 10: Identification of Protein–Protein Interaction Sites from ...abagyan.ucsd.edu/pdf/04_Ident_Recio_JMB.pdf · Identification of Protein–Protein Interaction Sites from Docking Energy

Figure 5 (legend opposite)

852 Mapping Protein Interfaces by Docking

Page 11: Identification of Protein–Protein Interaction Sites from ...abagyan.ucsd.edu/pdf/04_Ident_Recio_JMB.pdf · Identification of Protein–Protein Interaction Sites from Docking Energy

Figure 5. Maps of the residue interface propensities for some receptor molecules. The 100 lowest-energy dockingsolutions were used to generate the maps. The color-coding is the same as that in Figure 4. Complexes of the trainingset: (a) chymotrypsin (PDB 5cha)/APPI (PDB 1aap); (b) cytochrome f (PDB 1ctm)/plastocyanin (PDB 1ag6); (c) subtili-sin (PDB 2st1)/SSI (PDB 3ssi). Complexes of the test set: (d) ribonuclease inhibitor/ribonuclease A (PDB 1dfj); (e)FNR/ferredoxin (PDB 1ewy); and (f) IgG1 E8 Fab/cytochrome c (PDB 1wej).

Table 3. Comparison between the predicted and the experimental interfaces

Receptor Ligand

Complexa Coverageb Sizec Pd Coverageb Sizec Pd

Parameterization set1ca0 81 1.6 ,1024 57 1.6 0.171cbw 92 1.4 ,1024 64 1.9 0.161acb 77 1.4 ,1024 41 1.9 0.441cho 89 1.4 ,1024 77 1.6 ,0.051cgi 92 1.3 ,1024 70 1.2 ,0.052kai 54 1.3 ,1024 79 2.2 0.062sni 93 2.6 ,1024 53 1.4 0.092sic 83 2.8 ,1024 46 2.5 0.271cse 96 2.9 ,1024 41 1.8 0.532tec 82 2.4 ,1024 45 1.5 0.901taw 83 1.6 ,1024 62 1.8 0.132ptc 89 1.7 ,1024 36 2.0 0.163tgi 29 1.8 0.78 64 2.1 0.441brc 80 1.8 ,1024 62 1.7 0.091fss 100 4.2 ,1024 83 1.2 ,1024

1bvn 90 2.0 ,1024 86 1.4 ,1024

1bgs 95 1.3 ,1024 94 1.3 ,1024

1ay7 100 1.4 ,1024 94 1.8 ,1024

TEM1e 59 2.3 0.07 79 1.2 ,1024

1ugh 97 2.2 ,1024 75 0.9 ,1024

2pcb 92 3.8 ,1024 73 2.5 ,0.052pcf 92 1.6 ,1024 73 1.1 ,1024

1mlc 100 3.2 ,1024 29 3.2 0.111vfb 100 1.9 ,1024 83 3.1 ,0.05

Test set1ewy 100 1.9 ,1024 68 1.2 ,1024

1eer 91 2.6 ,1024 75 1.4 ,1024

1kkl 11 3.1 0.37 81 1.2 ,1024

VP6FABf 91 4.5 ,1024 100 4.9 ,1024

1ken 97 3.2 ,1024 100 3.4 ,1024

1kxv 10 2.7 0.19 83 1.9 ,1024

1kxt 55 3.9 ,0.05 96 1.8 ,1024

1kxq 100 1.9 ,1024 96 1.7 ,1024

1l0x 0 2.8 ,0.05 100 6.2 ,1024

1avw 100 1.7 ,1024 94 2.6 ,1024

1dfj 89 1.8 ,1024 80 1.2 ,1024

1tgs 93 1.5 ,1024 82 1.2 ,1024

1ahw 89 5.7 ,1024 0 1.9 ,1024

1dqj 100 5.0 ,1024 39 2.7 0.681wej 100 4.1 ,1024 44 2.4 0.771avz 42 2.6 0.49 92 1.7 ,1024

1wq1 33 2.9 0.35 53 1.3 ,0.052mta 93 3.1 ,1024 100 1.7 ,1024

1bth 61 2.2 ,1024 39 1.2 0.841fin 68 2.4 ,1024 100 1.5 ,1024

1fq1 32 3.9 0.84 0 2.1 ,1024

Predicted interfaces were defined as those surface residues with NIP . 0.0 (equation (7) in Materials and Methods).a The PDB code of the complex is shown here for clarity, but the unbound subunits were used during docking simulations (see

Table 1).b Percentage of residues in the experimental interface predicted correctly.c Relative size of the predicted interface with respect to the experimental one (i.e. total number of residues in the predicted interface

divided by the total number of residues in the experimental interface).d P-significance values obtained from a x2 test (see Materials and Methods).e Complex TEM-1 b-lactamase/BLIP; see Table 1 for coordinates.f Complex VP6/Fab; see Table 1 for coordinates.

Mapping Protein Interfaces by Docking 853

Page 12: Identification of Protein–Protein Interaction Sites from ...abagyan.ucsd.edu/pdf/04_Ident_Recio_JMB.pdf · Identification of Protein–Protein Interaction Sites from Docking Energy

considerably in a fashion similar to that in thetraining set (expectedly, it increased the number ofmolecules with no detectable high-propensityresidues). As can be seen in Table 4, at NIP cut-off of 0.4 the average accuracy of the high-pro-pensity binding patches (found in 62% of themolecules) is 73%. If antigen molecules areremoved from the analysis (see Discussion), theaverage accuracy of the high-propensity bindingpatches (NIP cutoff ¼ 0.4) increases to 81%(Figure 6(c)). The results obtained in the test setare thus comparable to the values obtained inthe parameterization set.

Docking non-native ligands

So far, we have generated interface propensitymaps starting from the unbound receptor andligand subunit coordinates of known protein–pro-tein complexes. For test purposes, we haveexplored the possibility of docking a non-nativeligand. For that, we performed docking simu-lations using the 3-D coordinates of unboundreceptors (chymotrypsin, PDB 5cha; cytochrome f ;PDB 1ctm; UDG, PDB 1akz) and a non-nativeligand such as lysozyme (PDB 1lza) instead oftheir native ligands (APPI, plastocyanin and UGI,

Figure 6. Comparison betweenthe theoretical and the experimentalbinding sites in (a) the training set;(b) the test set; and (c) the sametest set after excluding antigens.The average accuracy (percentageof predicted residues located cor-rectly in the native interface; redline) increases proportionately tothe NIP cutoff value that is used todefine the theoretical binding sites.The blue line shows the percentageof cases in which a predicted bind-ing patch (formed by at least oneresidue with NIP . cutoff) isdetected.

854 Mapping Protein Interfaces by Docking

Page 13: Identification of Protein–Protein Interaction Sites from ...abagyan.ucsd.edu/pdf/04_Ident_Recio_JMB.pdf · Identification of Protein–Protein Interaction Sites from Docking Energy

respectively). The results are quite interesting: thelysozyme docking poses accumulated around theexperimental location of the native ligand in someof the receptor molecules. For instance, the pre-dicted receptor interface derived from the chymo-trypsin/lysozyme docking simulations (at NIPcutoff ¼ 0.0) covered 73% ðP , 1024Þ of the recep-tor experimental interface (as compared to thevalue of 81% obtained by using the native APPIligand; Table 3). Furthermore, a high-propensitybinding patch was found on the surface of chymo-trypsin (14 residues; 79% accuracy). However, thepredicted interface derived from the cytochromef /lysozyme docking (at NIP cutoff ¼ 0.0) did notshow a significant correlation with the experimen-tal interface, and the four residue high-propensitybinding patch found had a low level of accuracy(25%). Lastly, the predicted interface derived fromthe UDG/lysozyme docking simulations (at NIPcutoff .0.0) covered 52% (P , 0.05) of the receptorexperimental interface (as compared to the value of97% obtained by using the native UGI ligand;Table 3), but no high-propensity binding patchwas found.

Discussion

Global rigid-body docking: exploring the wholeprotein surface

In this work we have explored the dockingenergy landscapes of known protein–protein com-plexes using ensembles of docked structures gener-ated by a rigid-body pseudo-Brownian globalenergy optimization procedure. Previously4 wedescribed a two-step docking procedure thatallowed sampling of approximately half of the sur-face of the receptor around the known binding site.We have extended the search to the whole receptorsurface. No spatial or biological restrictions wereused during simulations, which allowed a com-plete sampling of the docking landscape aroundeach subunit. A side-chain refinement step wasused in our previous work in order to re-evaluatethe 400 lowest-energy interfaces in search for thenear-native solution. However, the goal here is theanalysis of all docking poses for the description ofthe complete docking landscape. A side-chainrefinement of all the thousands of interfaces gener-ated by the rigid-body protocol would have beencomputationally too expensive, so we have pre-ferred to focus the analysis onto the rigid-bodydocking landscape.

Optimized energy function forprotein–protein association

It was observed repeatedly that force-fieldenergy as well as various solvation and electro-static terms failed to discriminate consistently thenear-native solution from false positives in dockingsimulations. A number of factors contribute to this

Table 4. High-propensity binding patches

Proteina

Number of residuesin patch Patch accuracyb

Parameterization set1ca0 (receptor) 12 831ca0 (ligand) 2 1001cbw (receptor) 14 931acb (receptor) 14 861cho (receptor) 11 911cho (ligand) 2 1001cgi (receptor) 15 1001cgi (ligand) 1 1002kai (receptor) 7 572sni (ligand) 9 442sic (ligand) 2 01taw (receptor) 9 1001taw (ligand) 2 1002ptc (receptor) 11 913tgi (receptor) 3 01brc (receptor) 3 1001brc (ligand) 1 1001fss (ligand) 8 1001bvn (receptor) 3 1001bvn (ligand) 5 1001bgs (receptor) 6 1001bgs (ligand) 3 1001ay7 (receptor) 8 100TEM1 (ligand)c 20 951ugh (ligand) 10 1002pcb (receptor) 9 562pcb (ligand) 7 292pcf (receptor) 10 1002pcf (ligand) 3 100

Test set1ewy (receptor) 11 1001ewy (ligand) 7 711eer (ligand) 6 831kkl (ligand) 4 100VP6FAB (ligand)d 4 1001ken (ligand) 11 1001kxv (receptor) 11 01kxv (ligand) 8 751kxt (receptor) 9 01kxt (ligand) 9 891kxa (receptor) 13 1001l0x (receptor) 13 01avw (receptor) 7 1001avw (ligand) 7 861dfj (receptor) 10 901dfj (ligand) 3 1001tgs (receptor) 6 1001tgs (ligand) 4 1001ahw (ligand) 2 01dqj (receptor) 1 1001wej (receptor) 9 891avz (receptor) 2 01avz (ligand) 1 1002mta (ligand) 6 1001bth (ligand) 1 01fin (ligand) 9 100

All cases where a high-propensity binding patch was found(as defined by those surface residues with NIP . 0.4; equation(7) in Materials and Methods).

a The PDB code of the complex is shown here for clarity, butthe prediction results apply to the unbound subunits listed inTable 1 (receptor or ligand as specified).

b Percentage of predicted residues located correctly in theexperimental interface.

c Complex TEM-1 b-lactamase/BLIP; see Table 1 forcoordinates.

d Complex VP6/Fab; see Table 1 for coordinates.

Mapping Protein Interfaces by Docking 855

Page 14: Identification of Protein–Protein Interaction Sites from ...abagyan.ucsd.edu/pdf/04_Ident_Recio_JMB.pdf · Identification of Protein–Protein Interaction Sites from Docking Energy

phenomenon: (i) each of the terms used in theenergy functions can be estimated only with con-siderable uncertainty (e.g. the values of partialcharges and internal dielectric constant, depths ofthe minima for van der Waals interactions, andper-atomic solvation parameters vary up to two-fold and sometimes more between various esti-mates/parameterizations); (ii) van der Waalsinteractions and, to a lesser extent, electrostaticsand other terms, are highly sensitive to the inac-curacies of the atomic coordinates, the situationbeing exacerbated by the lack of relaxation in thestructures produced by the rigid-body dockingsimulations; and (iii) certain phenomena such asaromatic/aromatic interaction might be largelyunaccounted for in the existing force-fieldparameters.

These problems may be alleviated by the appro-priate weighting of the terms. Possible benefits ofweighting are twofold: (i) it can correct for sub-optimal choice of uncertain parameters such asinternal dielectric constant, and (ii) it can down-weight the terms that carry larger noise. While thelatter may seem non-physical, it can be showneasily that the optimal weight of a term, whichincludes a noise component declines as 1=ð1 þs2

n=s2s Þ with the increase of the amplitude of noise

(s2n is the mean-square deviation of the noise com-

ponent and s2s is that of the signal).

We thus introduced weighting coefficients forvan der Waals, electrostatics and hydrogen bond-ing potentials. Furthermore, we have split theatomic ASA-based solvation term into three com-ponents: polar desolvation, aliphatic hydrophobi-city and aromatic hydrophobicity. The coefficientsfor the different energy terms were optimized on atraining set using as objective function the rankingof the near-native conformations for a set of 24protein–protein complexes of known structure,docked from the unbound subunits. After theenergy re-evaluation using the optimized par-ameters, the near-native solutions ranked below100 in 70% of the cases of the training set (as com-pared to 33% with the old energy function). Inorder to exclude the possibility of over-fitting ofthe parameters on the training set, we exploredthe effects of the parameter optimization on theranking of more remote solutions, as well as onthe ranking of the near-native solution in a test setformed by complexes not included in the trainingset. We observed that although the new energyfunction was optimized to rank as best as possibleonly a single near-native solution per complex, theenergy landscapes in the vicinity of the nativestructure adopted a more defined funnel-likeshape (Figure 1). This suggests that the optimiz-ation procedure was not distorted by particularfeatures/inaccuracies of the specific near-nativesolutions used. On the contrary, the new energyfunction seems to result in better behavior of theaveraged energy for the ensemble of solutionsnear the native binding sites. Furthermore,we found that the new energy function also

improved the docking landscapes of other com-plexes not included in the initial training set(Figure 1).

It is noteworthy that the optimal weighting fac-tor for the aromatic solvation is considerably largerthan those obtained for the rest of the solvationterms. It might indicate that the parameters forwater/octanol transfer of aromatic atoms are notoptimal to estimate the energy gain of burying aro-matic atoms in the (often) non-aliphatic environ-ment of the protein–protein interfaces. As analternative explanation, the larger weighting valuefor the solvation of aromatic atoms could be com-pensating for some specific aromatic interactionsunderestimated or not considered at all in the restof the energy function, such as the van der Waalsattraction. Indeed, Lomize and co-workersreported recently that the solvation parameter foraromatic carbon should be increased several-foldwhen no explicit van der Waals interaction isincluded.34

Validation of the binding-site prediction on atest set of 21 complexes

We used 21 complexes not included in the initialparameterization set as an external validation set.Overall, the performance of the method on the testset was comparable to its performance on the train-ing set.

The first example in the test set is a recentlysolved electron transfer complex formed by theproteins FNR and Fd. As can be seen in Figure5(e), the predicted binding sites correlated extra-ordinarily well with the experimental interfaces,and we found highly accurate high-propensitybinding patches in both FNR and Fd molecules(100% and 71% accuracy, respectively; see Table 4).These high-propensity binding areas detected inthe surface of both molecules, as can be seen inFigure 7(a) and (b), may be explained by the exist-ence of favorable forces that orient the moleculesduring the association process to allow an efficientelectron transfer between the redox centers, asmutational studies suggest.35 Another interestingexample is the interaction between EPO andEPOR. The predicted receptor interaction site (NIPcutoff ¼ 0.0) covered 91% ðP , 1024Þ of the exper-imental one (see Table 2). Interestingly, we found asix residue high-propensity binding patch in theligand, located correctly (83% accuracy) in one ofthe two experimental ligand interaction sites:indeed the one reported to have significantlyhigher affinity (Figure 7(c)).36

The following examples in the test set came fromthe CAPRI experiment. The first of the CAPRI tar-gets was the HPr kinase/HPr complex, in whichthe ligand interface was predicted perfectly (Tables3 and 4). However, no good prediction was foundfor the receptor interface, probably because of thelarge conformational movement of the C-terminalhelix of the receptor HPr kinase between theunbound and the complexed states. The following

856 Mapping Protein Interfaces by Docking

Page 15: Identification of Protein–Protein Interaction Sites from ...abagyan.ucsd.edu/pdf/04_Ident_Recio_JMB.pdf · Identification of Protein–Protein Interaction Sites from Docking Energy

Figure 7. High-propensity binding sites found in different molecules of the test set: (a) FNR (bound ferredoxin isshown in green); (b) ferredoxin (bound FNR in green); (c) EPO (bound EPObp in green); (d) Fab (bound hemagglutininin green); (e) TCR-b (bound speA in green; in yellow is the expected position of the TCR-a subunit as in PDB 1tcr); and(f) cyclin (bound CDK2 is shown in green). High-propensity binding residues (NIP . 0.4) are shown in red.

Mapping Protein Interfaces by Docking 857

Page 16: Identification of Protein–Protein Interaction Sites from ...abagyan.ucsd.edu/pdf/04_Ident_Recio_JMB.pdf · Identification of Protein–Protein Interaction Sites from Docking Energy

five CAPRI targets were antigen–antibody com-plexes, in which the structure of the bound anti-body was provided in a random orientation.Interestingly, the antibody-binding interfaces werepredicted excellently (Tables 3 and 4): high-propen-sity patches were found in four out of a total of fivecases, with an average level of accuracy of 91%. Forexample, Figure 7(d) shows the interface predictionfor the hemagglutinin/Fab complex, in which the11 residue high-propensity patch found in the Fabligand was located 100% correctly on the exper-imental interface. However, the predictions were,in general, poorer for the antigen molecules (seethe next sub-section). In the last CAPRI target, theTCR-b/speA complex, a high-propensity patchwas found in the receptor, although in a locationdifferent from the experimental binding site. Inter-estingly, as can be seen in Figure 7(e), this pre-dicted binding site was actually located in thepreviously described interface between TCR-b andTCR-a (PDB 1tcr). In the absence of TCR-a duringsimulations, it seems that low-energy dockingposes accumulated in the area that TCR-b uses forthe binding to TCR-a, a “protein-phylic” patchthat the method thus recognized.

The remaining complexes in the test set weretaken from the benchmark independently built byChen et al.33 After removing those cases alreadyincluded in our training set, we had a set of 12complexes divided in four categories (enzyme-inhibitor, antibody–antigen, others, and difficultcases) according to the definition in the originalbenchmark. It is interesting to compare the resultsobtained for the different categories. In the threecomplexes classified as enzyme-inhibitor (PDBcodes 1avw, 1dfj, 1tgs), the results were perfect:high-propensity binding patches were found in allof them (both in receptor and ligand molecules)with 96% of the predicted residues located cor-rectly at the interfaces. It is interesting that even ina complex such as the ribonuclease A/inhibitor(PDB 1dfj), whose topology is completely differentfrom that of the standard protease-inhibitor, thepredictive results were excellent, as can be seen inFigure 5(d). As for the three antibody–antigencomplexes (PDB codes 1ahw, 1dqj, 1wej), theresults were similar to those obtained in the rest ofthe antibody–antigen cases: good prediction ofbinding sites in the antibody surfaces, but no cor-rect predicted sites in the antigen surfaces (see thenext sub-section). In the following three complexes,classified in the original benchmark as others (PDBcodes 1avz, 1wq1, 2mta), we found high-propen-sity binding patches in three of the six moleculesinvolved, but one of them (HIV-1 NEF) yieldedincorrect predictions (0% accuracy). This is per-haps a non-adequate case for a prediction test, asthe structure of the unbound receptor (PDB 1avv)is lacking three residues that are essential for bind-ing: thus, it is not surprising that the correct bind-ing patch is not found on this incomplete proteinstructure. In the last three complexes, described asdifficult cases (PDB codes: 1bth, 1fin, 1fq1), the

results were slightly worse, as expected: high-pro-pensity binding patches were found in only twoof the six molecules involved, and one of them(pancreatic trypsin inhibitor) was not located cor-rectly in the interface. All of these difficult caseseither have major backbone movement betweenthe unbound and bound conformations, or arelacking an important number of residues in thestructure of the unbound subunits. These challen-ging conditions are pushing the rigid-body dock-ing approach to the limit. Nevertheless, it isremarkable that, in spite of these difficulties, wecan find a high-propensity binding patch in cyclinlocated correctly in the interface with CDK2, ascan be seen in Figure 7(f).

Poor prediction of antibody-binding sites onantigen surfaces

Relatively poor performance of the NIP as a pre-dictor of the protein binding patch for the antigensis perhaps not surprising, since the antibody–anti-gen system represents a rather special case of pro-tein–protein interaction, where only one partner(the antibody) has actively evolved for optimalbinding, while the other partner (the antigen) pas-sively provides relatively arbitrary interface. Forinstance, among the CAPRI targets there werethree different a-amylase/VHH complexes (PDBcodes 1kxv, 1kxt, and 1kxa), where the VHH anti-body bound to different areas of the a-amylase sur-face in each complex. Interestingly, dockingsimulations found an identical high-propensity a-amylase binding site in all these three cases. Thispredicted site correlates very well (89% accuracy)with the experimental interface of the a-amylase/VHH Amd9 complex, indeed the one reported tohave the strongest affinity amongst the threecomplexes.37 This brings up the interesting ques-tion of whether these antibody molecules prefer-ably bind zones of the antigen surface with higher“protein-phylicity”. A systematic analysis of high-propensity patches in antigen surfaces, beyond thescope of this work, could help to understandimmunogenicity properties, which would be extre-mely useful for antibody design.

Docking a non-native ligand to predict bindinginterfaces of a receptor

The surprising results obtained when using anon-natural ligand during the docking simulationsindicate that, at least for some complexes, the pre-ferred binding areas derived from the interfacepropensity maps may not depend on the ligandmolecule used, and might suggest the existence ofnon-specific binding areas in the protein surfaces.This would be expected for proteases, which arefunctionally required to interact with a variety ofpolypeptides. It would be highly interesting to ana-lyze if this is a general property of protein–proteininterfaces or if, on the contrary, this is a phenom-enon associated only with certain protein–protein

858 Mapping Protein Interfaces by Docking

Page 17: Identification of Protein–Protein Interaction Sites from ...abagyan.ucsd.edu/pdf/04_Ident_Recio_JMB.pdf · Identification of Protein–Protein Interaction Sites from Docking Energy

binding mechanisms. A complete analysis of theeffect of using different probes on the dockinglandscape is beyond the scope of this work, andwill be studied in a forthcoming publication.

Conclusions

The method described here can be appliedimmediately to suggest “hot spots” in the subunitsof the complexes for experimental testing by muta-genesis. The identification of protein–protein inter-action sites on a protein surface can be used torestrict the conformational search during the dock-ing simulations. This would allow the selection ofa sub-set of rigid-body docking solutions forfurther refinement when no experimental infor-mation about the binding site is available. Afurther development would be the introduction ofgenomics information (conservation of residues inthe interfaces), which could lead to more accuratepredictions. This methodology could be applied tolarge datasets of individual structures generatedby the structural genomics efforts or by homologymodeling methods. Prediction of putative proteininterfaces can provide additional evidence forcandidate protein pairs detected by othertechniques, such as the Rosetta Stone method,38

which could be an important step in building themap of interactions of a given organism at theatomic level.

In conclusion, we have described a new efficientprotocol for sampling the complete protein sur-faces in protein–protein docking. The ensemblesof rigid-body docking solutions for a training setof 24 protein–protein complex structures wereused to derive an optimized energy function forprotein–protein association. We found that low-energy docking solutions accumulated in the vicin-ity of the experimental protein–protein interfaces,and devised a graphic way of visualizing the puta-tive binding hot spots by averaging the interfacesof the 100 lowest-energy docking poses. The inter-face propensity maps generated from the dockingposes showed excellent correlation with the exper-imental protein-binding sites in the training set,and these results were further confirmed on anindependent test set of 21 protein–protein com-plexes. The work presented here indicates theexistence of energetically favorable binding areason the surfaces of proteins, and provides a usefultool to identify protein-binding sites on individualstructures.

Materials and Methods

Rigid-body docking simulations: energy function

The rigid-body docking procedure used here is a vari-ation of a previously described method, benchmarked ona set of experimental protein–protein complexstructures.4 Global energy optimization was performedby rigid-body sampling of the orientations of the ligand

molecule with respect to the receptor molecule (whoseposition is fixed). The interaction energy was rep-resented by five types of grid potentials,39,40 pre-calcu-lated on a grid as described elsewhere:4,41 the van derWaals potentials for a hydrogen atom probe ðEHvwÞ andfor a heavy-atom probe ðECvwÞ; an electrostatic potentialwhere the partial atomic charges of the receptor werecorrected by the induced surface solvent charge densityto account for the solvent screening effect on the inter-molecular pairwise electrostatic interactions ðEsolv

el Þ;30 thehydrogen-bonding potential ðEhbÞ; and a simple hydro-phobic potential roughly proportional to the number ofhydrophobic atoms of the ligand in contact with thehydrophobic surface of the receptor ðEhpÞ: Energy bal-ance between the different terms was optimized pre-viously on a training set of known protein–proteincomplexes.4 The solvation energy ðEsolvÞ was calculatedusing an atomic solvent-accessible surface (ASA)-basedmodel42,43 with per-atomic parameters derived44 fromexperimental vapor–water transfer energies for side-chain analogues45 and additional experimental data forcharged solutes.46 This solvation term was added to thetotal energy to re-evaluate the docking solutions asdescribed:4

E ¼ EHvw þ ECvw þ 2:16Esolvel þ 2:53Ehb þ 4:35Ehp

þ 0:20Esolv ð3Þ

The grid potentials generated from the receptor were cal-culated in a box covering all the receptor atoms plus amargin of 10 A (Figure 8). An extended box thatdepended on the size of the ligand was defined. Inorder to maintain the ligand molecule in the vicinity ofthe receptor during simulations, an energy penalty wasapplied outside this extended box. If a ligand atom isinside the map box, its energy is given by the grid poten-tials. For all ligand atoms inside the extended box, butoutside the grid potential box, the energy is zero. Foreach ligand atom outside the extended box, a penalty ofþ1 kcal/mol is added to the energy.

Rigid-body docking simulations: conformationalsampling procedure

In order to improve conformational sampling, mul-tiple initial ligand positions were generated around thereceptor, by defining a series of points ðiRec ¼1; 2;…;Nsf

recÞ with an average distance of 15 A betweenthem, as can be seen in Figure 9(a). These points weregenerated systematically on the receptor solvent-accessi-ble surface (expanded by 3 A to overcome minor struc-tural details) by dividing the surface into triangles withaverage side 15 A. This distribution of points aroundthe molecule is insensitive to the atomic details of themolecular surface, but reflects the overall shape of themolecule (including sizeable concavities and protuber-ances), and is more realistic than considering the mol-ecule as a sphere or polyhedron as in our previouswork. A similar set of points ðiLig ¼ 1; 2;…;Nsf

ligÞ was gen-erated around the ligand molecule. Imaginary axes per-pendicular to the molecule surface were defined fromeach receptor and ligand point. For each receptor surfacepoint ðiRecÞ, various ligand orientations “aiming” at thispoint were generated by superimposing every ligandsurface point ðiLigÞ onto the receptor point and aligningtheir respective axes, as can be seen in Figure 9(b).Finally, for each of these orientations, six 608 rotationsðiRot ¼ 1;…; 6Þ were defined around the axis perpendicular

Mapping Protein Interfaces by Docking 859

Page 18: Identification of Protein–Protein Interaction Sites from ...abagyan.ucsd.edu/pdf/04_Ident_Recio_JMB.pdf · Identification of Protein–Protein Interaction Sites from Docking Energy

to both receptor and ligand surfaces. This operation wasrepeated for all receptor surface points in order to gen-erate the total number of ligand starting conformations(6Nsf

ligNsfrec; which ranged from 1092 in the ribonuclease

Sa/barstar complex to 4488 in the Fab F44.1/lysozymecomplex).

For each ligand starting position {iRec; iLig; iRot};sampling of rotational and translational space of therigid-body ligand around the receptor molecule was per-formed by a pseudo-Brownian Monte-Carlo minimiz-ation procedure47 implemented in the MolSoft ICM 2.8program.48 The minimization procedure consisted of aseries of Monte-Carlo random moves of the position ofthe ligand, each one followed by up to 200 steps of thelocal conjugate gradient minimization. The simulationterminated after 20,000 energy evaluations. New confor-mations generated after each random move and localminimization were selected or dropped according to theMetropolis criteria49 with a temperature of 5000 K. Thefinal selection of conformations is a variation of a pre-viously described procedure.4 A simple parallelizationprotocol was used to accelerate calculations: all simu-lations originated from a given receptor surface pointðiRecÞ were run consecutively as a single computationaljob (two to seven hours on a 667 MHz Alpha processor;four to ten hours on a 700 MHz Pentium III workstationrunning Linux). Thus, a total number of jobs equal tothe number of surface points around the receptor ðNsf

recÞwere run in parallel (ranging from 14 to 44 per complex).For each job, all the conformations accumulated after thedifferent MC runs (starting from 6Nsf

lig ligandorientations/rotations) were merged in a single confor-mational stack,50 from which the geometrically similarconformations (RMSD of the ligand rotation and trans-lation positional variables ,2 A) were removed (see thescheme in Figure 10). Later, the 400 lowest-energysolutions were selected and compressed further so thatonly the lowest-energy conformations with RMSD forthe ligand interface Ca atoms greater than 4 A were

retained. Finally, for each complex, all conformationalstacks obtained from the different jobs were mergedinto a single file, from which the geometrically similarconformations (RMSD of the ligand rotation and trans-lation positional variables ,2 A) were removed asdescribed above.

Solvation term for protein–protein association

The solvation effect upon complex formation was cal-culated using an atomic solvent-accessible surface(ASA)-based model (i.e. as a sum of per-atomic contri-butions proportional to the ASA)42,43 with parametersderived from linear fitting to octanol/water transferenergies for N-acetyl amino acid amide derivatives.51

These per atomic parameters, listed in Table 5, were cal-culated for ten classes of atoms in the same fashion asthat described for vapor–water transfer parameters.44

The solvation energy was calculated separately for thepolar, aromatic and aliphatic atoms, and separateweighting factors were assigned to these three com-ponents. The ASPs derived from octanol/water transferenergies seem to describe better the burial of solvent-exposed residues in the interface.31 Since this solvation

Figure 8. The grid box where thereceptor potentials are pre-calcu-lated (small cube) is shown aroundchymotrypsin, with the van derWaals potential for a heavy-atomprobe represented inside. A largerbox (defined by extending the gridbox by the ligand size: 40.1 A)where the ligand is allowed toroam freely is shown (large cube).The ligand molecule is shown in anarbitrary position. The energy ofthe ligand atoms inside the gridbox is given by the grid potentials.The energy of the ligand atomsinside the extended box, but out-side the grid box, is zero. The pen-alty energy for each ligand atomoutside the extended box isþ1.0 kcal/mol.

Table 5. Atomic solvation parameters

s (cal/(mol A2)) Radius (A) Atom type

15.1 1.95 C aliphatic17.7 1.8 C aromatic217.0 1.7 N uncharged254.8 1.7 Nþ, Nz in Lysþ

227.3 1.7 Nh1, Nh2 in Argþ

218.5 1.6 O hydroxyl213.6 1.4 O carbonyl229.9 1.4 O2 in Glu, Asp11.2 2.0 S in SH2.2 1.85 S in Met or S–S

860 Mapping Protein Interfaces by Docking

Page 19: Identification of Protein–Protein Interaction Sites from ...abagyan.ucsd.edu/pdf/04_Ident_Recio_JMB.pdf · Identification of Protein–Protein Interaction Sites from Docking Energy

model includes the attractive component of the inter-molecular van der Waals interactions, the explicit vander Waals calculations are required here only to avoidlarge intermolecular clashes in the rigid-body interfaces.Because of that, it was not essential to use the correctgeometry of the interacting side-chains during van derWaals calculations, and thus we were more confident inthe use of the rigid-body approach.

Optimization of the energy function

In order to optimize the parameters for protein–pro-tein binding, a minimization process was applied, asdescribed below. The conformations obtained from therigid-body docking simulations were re-evaluated usingan energy function consisting of: (i) the van der Waals,

hydrogen bonding and electrostatic potentials calculatedfrom the grid maps; and (ii) the solvation energy calcu-lated separately for the polar, aromatic and aliphaticatoms, according to the original per-atomic parameters(Table 5). Because the explicit solvation term wasincluded, neither the SChEM30 correction for solventelectrostatic screening nor the hydrophobic grid poten-tial (both used during the simulations) were included inthe final energy function, defined by equation (1). Theweights of the different energy terms (the weight forelectrostatics was kept as 1 for a reference) were opti-mized through the amoeba simplex minimizationprocedure,52 using a training set of 24 protein–proteincomplexes (Table 1). For each iterative step, the newweights defined a new energy function that was used tore-evaluate the docking solutions and to calculate a newranking value for the near-native solution (i.e. the one

Figure 9. (a) Distribution of the ðiRec ¼ 1; 2;…;NsfrecÞ points around the chymotrypsin receptor (red spheres) that is

used to define the starting positions for the docking of the APPI ligand. These points ðNsfrec ¼ 23Þ were defined on the

receptor solvent-accessible surface expanded by 3 A (represented by the small dots), with a minimum distance of15 A between them. (b) A series of points ðiLig ¼ 1; 2;…;Nsf

ligÞ around the APPI ligand was defined in the samefashion ðNsf

lig ¼ 12Þ: Each iRec point around the chymotrypsin is superimposed onto each iLig point around the APPI inorder to define Nsf

recNsflig (276) different positions of the APPI ligand. The ligand is also re-oriented to align the axis per-

pendicular to the solvent-accessible surfaces of ligand and receptor at the match point. Finally, six ligand orientationsare generated for every position by rotating the ligand molecule in 608 increments around that axis. A total of6Nsf

recNsflig (1656) starting positions for the APPI ligand around the chymotrypsin receptor are thus generated (one

starting ligand position is shown).

Mapping Protein Interfaces by Docking 861

Page 20: Identification of Protein–Protein Interaction Sites from ...abagyan.ucsd.edu/pdf/04_Ident_Recio_JMB.pdf · Identification of Protein–Protein Interaction Sites from Docking Energy

with the lower RMSD) for each complex. The logarithmof the sum of the ranking values of the 24 near-nativesolutions (one per complex) was the objective functionto optimize (the logarithm function was used here toavoid an excessive influence of the complexes in whichthe near-native solution was ranked “poorly”). Fiveindependent minimizations were performed, startingfrom different sets of random weighting values to ensureconvergence.

Analysis of the distribution of the docking poses:representation of the energy landscape inspherical coordinates

For each docking solution, the center of mass of theligand was represented in spherical coordinates withrespect to the center of mass of the receptor. A u anglewas calculated with respect to the z axis (which was cho-sen arbitrarily), and a w angle was calculated from the

projection of the center of mass of the given solutiononto the xy plane, with respect to the x axis (defined bythe center of mass of the ligand in the known complexstructure). As can be seen in Figure 3(a), we can rep-resent the spherical coordinates of the different dockingposes in a 2-D sinusoidal projection ðx ¼ w sin u;y ¼ 90 2 uÞ; in which the position of the ligand in thenative complex structure (coordinates: {w ¼ 08, u ¼ 908})is in the center of the plot. This projection is equal-area,the scale is true along all parallels and central meridian,and we can evaluate the density of docking poses at allzones of the plot (because they correspond with thetrue values on the spherical surface).

We generated a continuous energy map from the pre-viously described point values as follows. The projectionof the distribution of docking poses was divided intocells generated by a 2-D grid (60 £ 30). All 2-D unit cellsrepresented portions of the spherical space with equalarea. The energy values of the solutions within a particu-lar grid cell were averaged and assigned to this cell. If a

Figure 10. Scheme of the global rigid-body docking procedure. A computational job is launched for each iRec startingpoint around the receptor. The resulting conformations are merged and filtered to obtain the final conformational set(for more details, see Materials and Methods).

862 Mapping Protein Interfaces by Docking

Page 21: Identification of Protein–Protein Interaction Sites from ...abagyan.ucsd.edu/pdf/04_Ident_Recio_JMB.pdf · Identification of Protein–Protein Interaction Sites from Docking Energy

cell had no solutions inside, the highest energy value ofall solutions was assigned to that cell. After assigningan energy value to all unit cells, the energy profile wassmoothed by averaging the value for each cell over thesurrounding 3 £ 3 cell region, as can be seen in Figure3(b).

Analysis of the distribution of the docking poses:predicted interface propensities

The distribution of docked poses obtained after globaldocking was projected onto the molecular surface of thereceptor as follows. For every solvent-exposed residueof the unbound receptor, the relative surface buriedupon binding was calculated for all docking poses andaveraged with Boltzmann weighting:

Energy-averaged buried surface ðEABSiÞ

¼1

N

XN

k¼1

ASAUnbi 2 ASABnd

ik

ASAUnbi

e2ðEk2E0Þ=RT

!ð4Þ

where ASAiUnb is the solvent-accessible surface area for

the receptor residue i before ligand binding; ASAikBnd is

the solvent-accessible surface area for the same residueafter ligand binding according to the pose k; Ek is theenergy value of that pose; E0 is the lowest energy value;T is the temperature of the simulation (typically 5000 K);R is the gas constant (1.987 cal/mol); and N is the totalnumber of docking poses.

A simplified version of the method was applied as fol-lows. The averaged buried surface was calculated fromthe 100 lowest-energy solutions, with no energy weight-ing:

Averaged buried surface ðABSiÞ

¼1

100

X100

k¼1

ASAUnbi 2 ASABnd

ik

ASAUnbi

!ð5Þ

From this value, we calculated the residue interface pro-pensity:

Interface propensity ðIPiÞ ¼ABSi

kABSl¼

ABSi

1

Nsar

XNsar

i¼1

ABSi

ð6Þ

where Nsar is the number of total solvent-accessiblereceptor residues. An interface propensity IP . 1 willindicate that the residue is found in the docking inter-faces more frequently than expected. Larger IP valueswould indicate a stronger preference for the interface,but their maximum expected values will be different foreach complex. Therefore, in order to compare IP valuesfor residues in different complexes, we have to normal-ize them between the average value (kIPl ¼ 1) and themaximum expected value (IPMAX):

Normalized interface propensity ðNIPiÞ

¼IPi 2 kIPl

IPMAX 2 kIPl¼

ABSi

kABSl2 1

ABSMAX

kABSl2 1

¼ABSi 2 kABSl

ABSMAX 2 kABSlð7Þ

where ABSMAX is the maximum expected ABS value

ðABSMAX ¼ 1Þ: The normalized interface propensity(NIP) value for a specific residue will indicate its prefer-ence for the interface.

Definition of experimental interface

In order to compare predicted and experimental inter-faces, the latter were defined as the residues in the nativecomplex structure with relative surface buried uponbinding RBS . 0.1:

Relative buried surface ðRBSiÞ ¼ASAUnb

i 2 ASABndi

ASAUnbi

ð8Þ

Statistical significance of the predictions

In order to determine if the global occurrence of thepredicted interfaces differs significantly from a randomdistribution, we evaluated the predicted coverage valuesby a x2 test. For each complex, the distribution of thepredicted interface residues was compared to the distri-bution of the residues in the real interface. A contingencytable can be built for the comparison of these two distri-butions, by computing the number of residues in the pre-dicted interface that are in the real interface, the numberof residues in the predicted interface that are not in thereal interface, and the number of residues not predictedto be in the interface that are actually in the real interface(as in Table 2). The P significance obtained from applyingthe x2 test (one degree of freedom) to this contingencytable indicates the probability that the distribution ofthe predicted interfaces with respect to the experimentalones differs from a random distribution. This is usedhere to evaluate if the predicted coverage values are stat-istically significant, or if, on the contrary, they could havebeen obtained by chance alone.

Selection of protein–protein complexes in thetest set

The method presented here has been validated on atest set formed by recently solved complexes notincluded in the training of the energy function.

The first test case was the electron transfer complexformed by the proteins FNR and Fd, which play anessential role in photosynthesis.53 The X-ray structure ofthe complex between the oxidized FNR and Fd fromAnabaena PCC7119 has been reported (PDB 1ewy),54 andsuggests a highly probable conformational state of theelectron transfer between these two proteins. The struc-tures of the unbound FNR (PDB 1que) and Fd (PDB1fxa) from Anabaena PCC7119 were used to generate theinterface propensity maps as described in previous sec-tions. The second test case was the interaction betweenEPO and EPOR. The structure of EPO is known, both inits unbound form (PDB 1buy) and in complex withEPObp, the extracellular ligand binding domain ofEPOR (PDB 1eer).55 The structure of the unbound dimer-ized EPObp has been released recently (PDB 1ern). Sinceeach EPObp monomer binds to different areas of EPO,and no contact between EPObp monomers is present inthe complex structure, only one of the EPObp monomerswas used during simulations.

A further seven complexes were taken from the recentCAPRI competition (Table 1). The coordinates of theunbound subunits were provided by the organizers, butthe structures of the protein–protein complexes werenot released until all participants had submitted their

Mapping Protein Interfaces by Docking 863

Page 22: Identification of Protein–Protein Interaction Sites from ...abagyan.ucsd.edu/pdf/04_Ident_Recio_JMB.pdf · Identification of Protein–Protein Interaction Sites from Docking Energy

models. Therefore, the rigid-body docking simulationsused here to obtain the interface propensity maps wererun in the complete absence of information about thefinal solution56 and therefore constitute a perfect blindvalidation of the method.

Twelve more complexes were taken from theunbound–unbound test cases in the benchmark com-piled by Chen et al.33 The benchmark is formed by 31unbound–unbound test cases, but we selected only com-plexes that were significantly different from those weused in our training set. The final selection happened tobe formed by 12 complexes, which included three differ-ent cases of each one of the four categories in which thebenchmark is divided: enzyme-inhibitor, antibody–anti-gen, others, and difficult cases (Table 1). Thus, this selec-tion constitutes a genuine validating test that effectivelycovers a broad variety of protein–protein complexes.

Acknowledgements

This work was supported by the Department ofEnergy (DE-FG03-00ER63042).

References

1. Xenarios, I. & Eisenberg, D. (2001). Protein inter-action databases. Curr. Opin. Biotechnol. 12, 334–339.

2. Camacho, C. J. & Vajda, S. (2002). Protein-proteinassociation kinetics and protein docking. Curr. Opin.Struct. Biol. 12, 36–40.

3. Elcock, A. H., Sept, D. & McCammon, J. A. (2001).Computer simulation of protein–protein inter-actions. J. Phys. Chem. ser. B, 105, 1504–1518.

4. Fernandez-Recio, J., Totrov, M. & Abagyan, R. (2002).Soft protein–protein docking in internal coordinates.Protein Sci. 11, 280–291.

5. Smith, G. R. & Sternberg, M. J. (2002). Prediction ofprotein–protein interactions by docking methods.Curr. Opin. Struct. Biol. 12, 28–35.

6. Sternberg, M. J., Gabb, H. A. & Jackson, R. M. (1998).Predictive docking of protein–protein and protein–DNA complexes. Curr. Opin. Struct. Biol. 8, 250–256.

7. Wodak, S. J. & Janin, J. (1978). Computer analysis ofprotein–protein interaction. J. Mol. Biol. 124,323–342.

8. Bogan, A. A. & Thorn, K. S. (1998). Anatomy of hotspots in protein interfaces. J. Mol. Biol. 280, 1–9.

9. Clackson, T., Ultsch, M. H., Wells, J. A. & de Vos,A. M. (1998). Structural and functional analysis ofthe 1:1 growth hormone:receptor complex revealsthe molecular basis for receptor affinity. J. Mol. Biol.277, 1111–1128.

10. Clackson, T. & Wells, J. A. (1995). A hot spot of bind-ing energy in a hormone–receptor interface. Science,267, 383–386.

11. Hu, Z., Ma, B., Wolfson, H. & Nussinov, R. (2000).Conservation of polar residues as hot spots at proteininterfaces. Proteins: Struct. Funct. Genet. 39, 331–342.

12. Novotny, J., Bruccoleri, R. E. & Saul, F. A. (1989). Onthe attribution of binding energy in antigen–anti-body complexes McPC 603, D1.3, and HyHEL-5.Biochemistry, 28, 4735–4749.

13. Argos, P. (1988). An investigation of protein subunitand domain interfaces. Protein Eng. 2, 101–113.

14. Glaser, F., Steinberg, D. M., Vakser, I. A. & Ben-Tal,

N. (2001). Residue frequencies and pairing prefer-ences at protein–protein interfaces. Proteins: Struct.Funct. Genet. 43, 89–102.

15. Janin, J., Miller, S. & Chothia, C. (1988). Surface, sub-unit interfaces and interior of oligomeric proteins.J. Mol. Biol. 204, 155–164.

16. Jones, S. & Thornton, J. M. (1995). Protein–proteininteractions: a review of protein dimer structures.Prog. Biophys. Mol. Biol. 63, 31–65.

17. Jones, S. & Thornton, J. M. (1996). Principles of pro-tein–protein interactions. Proc. Natl Acad. Sci. USA,93, 13–20.

18. Jones, S. & Thornton, J. M. (1997). Analysis of pro-tein–protein interaction sites using surface patches.J. Mol. Biol. 272, 121–132.

19. Young, L., Jernigan, R. L. & Covell, D. G. (1994). Arole for surface hydrophobicity in protein–proteinrecognition. Protein Sci. 3, 717–729.

20. Chothia, C. & Janin, J. (1975). Principles of protein–protein recognition. Nature, 256, 705–708.

21. Conte, L. L., Chothia, C. & Janin, J. (1999). Theatomic structure of protein–protein recognitionsites. J. Mol. Biol. 285, 2177–2198.

22. Janin, J. & Chothia, C. (1990). The structure of pro-tein–protein recognition sites. J. Biol. Chem. 265,16027–16030.

23. Sheinerman, F. B. & Honig, B. (2002). On the role ofelectrostatic interactions in the design of protein–protein interfaces. J. Mol. Biol. 318, 161–177.

24. Jones, S. & Thornton, J. M. (1997). Prediction of pro-tein–protein interaction sites using patch analysis.J. Mol. Biol. 272, 133–143.

25. Zhou, H. X. & Shan, Y. (2001). Prediction of proteininteraction sites from sequence profile and residueneighbor list. Proteins: Struct. Funct. Genet. 44,336–343.

26. Gabdoulline, R. R. & Wade, R. C. (2001). Protein–protein association: investigation of factors influ-encing association rates by Brownian dynamicssimulations. J. Mol. Biol. 306, 1139–1155.

27. Janin, J. (1997). The kinetics of protein–protein rec-ognition. Proteins: Struct. Funct. Genet. 28, 153–161.

28. Norel, R., Petrey, D., Wolfson, H. J. & Nussinov, R.(1999). Examination of shape complementarity indocking of unbound proteins. Proteins: Struct. Funct.Genet. 36, 307–317.

29. Vakser, I. A., Matar, O. G. & Lam, C. F. (1999). A sys-tematic study of low-resolution recognition in pro-tein–protein complexes. Proc. Natl Acad. Sci. USA,96, 8477–8482.

30. Fernandez-Recio, J., Totrov, M. & Abagyan, R. (2002).Screened charge electrostatic model in protein–pro-tein docking simulations. Pac. Symp. Biocomput. 7,552–565.

31. Vajda, S., Weng, Z., Rosenfeld, R. & DeLisi, C. (1994).Effect of conformational flexibility and solvation onreceptor–ligand binding free energies. Biochemistry,33, 13977–13988.

32. Pickersgill, R. W. (1988). A rapid method of calculat-ing charge–charge interaction energies in proteins.Protein Eng. 2, 247–248.

33. Chen, R., Mintseris, J., Janin, J. & Weng, Z. (2003). Aprotein–protein docking benchmark. Proteins: Struct.Funct. Genet. 52, 88–91.

34. Lomize, A. L., Riebarkh, M. Y. & Pogozheva, I. D.(2002). Interatomic potentials and solvation par-ameters from protein engineering data for buriedresidues. Protein Sci. 11, 1984–2000.

35. Hurley, J. K., Hazzard, J. T., Martinez-Julvez, M.,

864 Mapping Protein Interfaces by Docking

Page 23: Identification of Protein–Protein Interaction Sites from ...abagyan.ucsd.edu/pdf/04_Ident_Recio_JMB.pdf · Identification of Protein–Protein Interaction Sites from Docking Energy

Medina, M., Gomez-Moreno, C. & Tollin, G. (1999).Electrostatic forces involved in orienting Anabaenaferredoxin during binding to Anabaenaferredoxin:NADP þ reductase: site-specific muta-genesis, transient kinetic measurements, and electro-static surface potentials. Protein Sci. 8, 1614–1622.

36. Philo, J. S., Aoki, K. H., Arakawa, T., Narhi, L. O. &Wen, J. (1996). Dimerization of the extracellulardomain of the erythropoietin (EPO) receptor byEPO: one high-affinity and one low-affinity inter-action. Biochemistry, 35, 1681–1691.

37. Lauwereys, M., Arbabi Ghahroudi, M., Desmyter, A.,Kinne, J., Holzer, W., De Genst, E. et al. (1998). Potentenzyme inhibitors derived from dromedary heavy-chain antibodies. EMBO J. 17, 3512–3520.

38. Marcotte, E. M., Pellegrini, M., Ng, H. L., Rice, D. W.,Yeates, T. O. & Eisenberg, D. (1999). Detecting pro-tein function and protein–protein interactions fromgenome sequences. Science, 285, 751–753.

39. Goodford, P. J. (1985). A computational procedurefor determining energetically favorable binding siteson biologically important macromolecules. J. Med.Chem. 28, 849–857.

40. Totrov, M. & Abagyan, R. (1997). Flexible protein–ligand docking by global energy optimization ininternal coordinates. Proteins: Struct. Funct. Genet.Suppl, 1, 215–220.

41. Totrov, M. & Abagyan, R. (2001). Protein–liganddocking as an energy optimization problem. InDrug-Receptor Thermodynamics: Introduction and Appli-cations (Raffa, R. B., ed.), 1st edit., pp. 603–624, Wiley,New York.

42. Eisenberg, D. & McLachlan, A. D. (1986). Solvationenergy in protein folding and binding. Nature, 319,199–203.

43. Wesson, L. & Eisenberg, D. (1992). Atomic solvationparameters applied to molecular dynamics of pro-teins in solution. Protein Sci. 1, 227–235.

44. Abagyan, R. (1997). Protein structure prediction byglobal energy optimization. Computer Simulationsof Biomolecular Systems (van Gunsteren, W. F., et al.,eds), vol. 3, pp. 363–394, Kluwer AcademicPublisher, Dordrecht.

45. Wolfenden, R., Andersson, L., Cullis, P. M. &Southgate, C. C. (1981). Affinities of amino acid sidechains for solvent water. Biochemistry, 20, 849–855.

46. Pearson, R. G. (1986). Ionization potentials and elec-tron affinities in aqueous solution. J. Am. Chem. Soc.108, 6109–6114.

47. Abagyan, R., Totrov, M. & Kuznetsov, D. (1994). ICM:

a new method for structure modeling and design:applications to docking and structure predictionfrom the distorted native conformation. J. Comput.Chem. 15, 488–506.

48. MolSoft (2000). ICM 2.8 Program Manual, MolSoftLLC, San Diego.

49. Metropolis, N. A., Rosenbluth, A. W., Rosenbluth,N. M., Teller, A. H. & Teller, E. (1953). Equation ofstate calculations by fast computing machines.J. Chem. Phys. 21, 1087–1092.

50. Abagyan, R. & Argos, P. (1992). Optimal protocoland trajectory visualization for conformationalsearches of peptides and proteins. J. Mol. Biol. 225,519–532.

51. Fauchere, J. L. & Pliska, V. (1983). Hydrophobic par-ameters-pi of amino-acid side-chains from the parti-tioning of N-acetyl-amino-acid amides. Eur. J. Med.Chem. 18, 369–375.

52. Nelder, J. A. & Mead, R. (1965). A simplex methodfor function minimization. Comput. J. 7, 308–313.

53. Batie, C. J. & Kamin, H. (1984). Electron transfer byferredoxin:NADP þ reductase. Rapid-reaction evi-dence for participation of a ternary complex. J. Biol.Chem. 259, 11976–11985.

54. Morales, R., Charon, M. H., Kachalova, G., Serre, L.,Medina, M., Gomez-Moreno, C. & Frey, M. (2000). Aredox-dependent interaction between two electron-transfer partners involved in photosynthesis. EMBORep. 1, 271–276.

55. Syed, R. S., Reid, S. W., Li, C., Cheetham, J. C., Aoki,K. H., Liu, B. et al. (1998). Efficiency of signallingthrough cytokine receptors depends critically onreceptor orientation. Nature, 395, 511–516.

56. Fernandez-Recio, J., Totrov, M. & Abagyan, R. (2003).ICM-DISCO docking by global energy optimizationwith fully flexible side-chains. Proteins: Struct. Funct.Genet. 52, 113–117.

57. Strynadka, N. C., Adachi, H., Jensen, S. E., Johns, K.,Sielecki, A., Betzel, C. et al. (1992). Molecular struc-ture of the acyl-enzyme intermediate in beta-lactamhydrolysis at 1.7 A resolution. Nature, 359, 700–705.

58. Strynadka, N. C., Jensen, S. E., Alzari, P. M. & James,M. N. (1996). A potent new mode of beta-lactamaseinhibition revealed by the 1.7 A X-ray crystallo-graphic structure of the TEM-1-BLIP complex. NatureStruct. Biol. 3, 290–297.

59. Strynadka, N. C., Jensen, S. E., Johns, K., Blanchard,H., Page, M., Matagne, A. et al. (1994). Structuraland kinetic characterization of a beta-lactamase-inhibitor protein. Nature, 368, 657–660.

Edited by J. Thornton

(Received 16 December 2002; received in revised form 21 October 2003; accepted 29 October 2003)

Mapping Protein Interfaces by Docking 865