Geometric and chemical patterns of interaction in protein–ligand complexes and their application...

13
Geometric and Chemical Patterns of Interaction in Protein– Ligand Complexes and Their Application in Docking Ernesto Moreno * and Kalet Leo ´n Center of Molecular Immunology, Havana, Cuba ABSTRACT We present a new method for rep- resenting the binding site of a protein receptor that allows the use of the DOCK approach to screen large ensembles of receptor conformations for ligand bind- ing. The site points are constructed from templates of what we called “attached points” (ATPTS). Each template (one for each type of amino acid) is com- posed of a set of representative points that are attached to side-chain and backbone atoms through internal coordinates, carry chemical information about their parent atoms and are intended to cover positions that might be occupied by ligand atoms when complexed to the protein. This method is completely automatic and proved to be extremely fast. With the aim of obtaining an experimental basis for this approach, the Protein Data Bank was searched for proteins in complex with small mol- ecules, to study the geometry of the interactions between the different types of protein residues and the different types of ligand atoms. As a result, well-defined patterns of interaction were obtained for most amino acids. These patterns were then used for constructing a set of templates of attached points, which constitute the core of the ATPTS approach. The quality of the ATPTS representation was demon- strated by using this method, in combination with the DOCK matching and orientation algorithms, to generate correct ligand orientations for >1000 pro- tein–ligand complexes. Proteins 2002;47:1–13. © 2002 Wiley-Liss, Inc. Key words: protein–ligand interactions; docking; binding site representation; negative image INTRODUCTION The docking of two molecules [a ligand and a receptor (usually a protein)] requires some type of characterization of the receptor-binding site. Several different approaches that make use of geometrical and chemical features of the molecular surface at the binding region have been devel- oped for this purpose. 1–3 The program DOCK, 1,4 in particu- lar, uses a negative image of the protein-binding site based on spheres that fill cavities in that site and carry a limited amount of chemical information in the form of colors. Recent articles describe new methods for generating other types of site points, instead of the original spheres, to be used for DOCKing. 5,6 In the DOCK approach, the receptor, whose structure usually is taken from X-ray crystallogra- phy, is treated as rigid in docking runs to screen large databases of small molecules for binding. Because there is only one receptor conformation, time may be spent in carefully characterizing the binding site. Thus, the genera- tion of the sphere representation goes through several steps, 1 some of which require or benefit from direct user intervention. The demand on CPU time and user intervention, how- ever, may become critical in other cases. Thus, in a recent work (manuscript), dealing with the flexibility of the protein receptor in the docking of a single ligand, we used the DOCK approach in a different or “inverted way,” that is, applying it for testing a large number of receptor conformations for binding of one molecule. The key ele- ment that made possible the serial processing of thou- sands of protein conformations was the introduction of a new method, both fast and automatic, for representing the binding site of the protein receptor in substitution of the traditional DOCK spheres. The new method consists of constructing the site points from templates of points linked to protein residues. Each template (one for each type of amino acid) is composed of a set of points, intended to be located in positions that are suitable for interactions with ligand atoms. These points are attached to protein atoms through internal coordi- nates, and because of this they were termed “attached points” (ATPTS). Aiming at giving a sound experimental support to this approach, we analyze the available crystal data of protein- small molecule complexes to extract information about the geometry of the interactions between different types of protein residues and different types of ligand atoms. The purpose was to determine which relative positions ligand atoms may occupy with respect to amino acid side chains and backbone atoms and the possible existence of patterns of interactions that could be represented by networks of attached points. Different studies of intra- and intermolecular patterns of interactions at the atomic level have been reported, with different purposes. For example, contact distances be- tween amino acid atoms in proteins have been investi- gated to derive knowledge-based mean force potentials for use in fold recognition, 7 and intermolecular contacts have *Correspondence to: Ernesto Moreno, Center of Molecular Immunol- ogy, P.O. Box 16040, Havana 11600, Cuba. E-mail: emoreno@ict. cim.sld.cu Received 6 March 2001; Accepted 21 August 2001 Published online 00 Month 2001 PROTEINS: Structure, Function, and Genetics 47:1–13 (2002) DOI 10.1002/prot.10026 © 2002 WILEY-LISS, INC.

Transcript of Geometric and chemical patterns of interaction in protein–ligand complexes and their application...

Page 1: Geometric and chemical patterns of interaction in protein–ligand complexes and their application in docking

Geometric and Chemical Patterns of Interaction in Protein–Ligand Complexes and Their Application in DockingErnesto Moreno* and Kalet LeonCenter of Molecular Immunology, Havana, Cuba

ABSTRACT We present a new method for rep-resenting the binding site of a protein receptor thatallows the use of the DOCK approach to screen largeensembles of receptor conformations for ligand bind-ing. The site points are constructed from templatesof what we called “attached points” (ATPTS). Eachtemplate (one for each type of amino acid) is com-posed of a set of representative points that areattached to side-chain and backbone atoms throughinternal coordinates, carry chemical informationabout their parent atoms and are intended to coverpositions that might be occupied by ligand atomswhen complexed to the protein. This method iscompletely automatic and proved to be extremelyfast. With the aim of obtaining an experimentalbasis for this approach, the Protein Data Bank wassearched for proteins in complex with small mol-ecules, to study the geometry of the interactionsbetween the different types of protein residues andthe different types of ligand atoms. As a result,well-defined patterns of interaction were obtainedfor most amino acids. These patterns were then usedfor constructing a set of templates of attached points,which constitute the core of the ATPTS approach.The quality of the ATPTS representation was demon-strated by using this method, in combination withthe DOCK matching and orientation algorithms, togenerate correct ligand orientations for >1000 pro-tein–ligand complexes. Proteins 2002;47:1–13.© 2002 Wiley-Liss, Inc.

Key words: protein–ligand interactions; docking;binding site representation; negativeimage

INTRODUCTION

The docking of two molecules [a ligand and a receptor(usually a protein)] requires some type of characterizationof the receptor-binding site. Several different approachesthat make use of geometrical and chemical features of themolecular surface at the binding region have been devel-oped for this purpose.1–3 The program DOCK,1,4 in particu-lar, uses a negative image of the protein-binding site basedon spheres that fill cavities in that site and carry a limitedamount of chemical information in the form of colors.Recent articles describe new methods for generating othertypes of site points, instead of the original spheres, to beused for DOCKing.5,6 In the DOCK approach, the receptor,whose structure usually is taken from X-ray crystallogra-

phy, is treated as rigid in docking runs to screen largedatabases of small molecules for binding. Because there isonly one receptor conformation, time may be spent incarefully characterizing the binding site. Thus, the genera-tion of the sphere representation goes through severalsteps,1 some of which require or benefit from direct userintervention.

The demand on CPU time and user intervention, how-ever, may become critical in other cases. Thus, in a recentwork (manuscript), dealing with the flexibility of theprotein receptor in the docking of a single ligand, we usedthe DOCK approach in a different or “inverted way,” thatis, applying it for testing a large number of receptorconformations for binding of one molecule. The key ele-ment that made possible the serial processing of thou-sands of protein conformations was the introduction of anew method, both fast and automatic, for representing thebinding site of the protein receptor in substitution of thetraditional DOCK spheres.

The new method consists of constructing the site pointsfrom templates of points linked to protein residues. Eachtemplate (one for each type of amino acid) is composed of aset of points, intended to be located in positions that aresuitable for interactions with ligand atoms. These pointsare attached to protein atoms through internal coordi-nates, and because of this they were termed “attachedpoints” (ATPTS).

Aiming at giving a sound experimental support to thisapproach, we analyze the available crystal data of protein-small molecule complexes to extract information about thegeometry of the interactions between different types ofprotein residues and different types of ligand atoms. Thepurpose was to determine which relative positions ligandatoms may occupy with respect to amino acid side chainsand backbone atoms and the possible existence of patternsof interactions that could be represented by networks ofattached points.

Different studies of intra- and intermolecular patternsof interactions at the atomic level have been reported, withdifferent purposes. For example, contact distances be-tween amino acid atoms in proteins have been investi-gated to derive knowledge-based mean force potentials foruse in fold recognition,7 and intermolecular contacts have

*Correspondence to: Ernesto Moreno, Center of Molecular Immunol-ogy, P.O. Box 16040, Havana 11600, Cuba. E-mail: [email protected]

Received 6 March 2001; Accepted 21 August 2001

Published online 00 Month 2001

PROTEINS: Structure, Function, and Genetics 47:1–13 (2002)DOI 10.1002/prot.10026

© 2002 WILEY-LISS, INC.

Page 2: Geometric and chemical patterns of interaction in protein–ligand complexes and their application in docking

been explored to develop empirical free energy functionsfor macromolecular docking.8 In another study, a largenumber of intermolecular interactions present in theCambridge Structural Database9 and in the Protein DataBank (PDB)10 were compiled in a database called IsoStar.11

This database includes collections of interatomic contactsfor the main functional groups that comprise the differentprotein amino acids. We decided, however, to use the PDBas a direct source of experimental data, to be able to tailorand focus the study on the practical purpose that moti-vated it.

Here we present the basis of the ATPTS method, reportthe findings from an extensive search for protein–ligandatomic contacts in hundreds of PDB complexes, and de-scribe the translation of the patterns of interactions ob-tained in the form of templates of attached points. Finally,we validate the application of the attached points ap-proach in docking by using the obtained templates tocorrectly orient �1000 ligands in their binding sites.

MATERIALS AND METHODSConstruction of a Binding Site RepresentationFrom Templates of Attached Points

A new, totally automated method to represent protein-binding sites was created to allow a fast generation of sitepoints for a given receptor conformation. For each type ofamino acid, a template of points was constructed in whicheach point was attached to a parent protein atom and totwo other conveniently situated atoms using internalcoordinates, as illustrated in Figure 1. These attachedpoints carry chemical information (in the form of colors)about their parent protein atoms and, as with the DOCKspheres, are intended to cover positions that might beoccupied by atoms of a ligand complexed to the protein. Anexample of what entries in a template definition look likeis provided in Table I.

The data needed for generating an ATPTS representa-tion are the templates and a list of the amino acids locatedin the binding site region (their PDB residue identifiers).

The process of constructing the site points for a givenreceptor coordinate set is itself very simple, involving onlya few steps. First, and following the ATPTS templates, theinternal coordinates of all the attached points associatedwith the protein residues in the input list are translated toCartesian coordinates. Afterward, points bumping intoprotein atoms are removed, leaving only a layer of charac-teristic points near the binding site surface. Values around2.0 and 2.8 Å are used as the minimal allowed (bumping)distances from the attached points to polar and nonpolaratoms, respectively (see below). Finally, geometric con-straints are applied to eliminate redundant or unneces-sary points. For example, points that are closer to eachother than a given close-distance cutoff are merged into asingle point (by averaging their coordinates), but this isdone only for points having the same color.

Selection and Filtering of the Experimental Data

The Protein Data Bank (release of January 1999) wassearched for proteins in complex with small molecules(contained in the heteroatom records of PDB files) by usingthe “SearchFields” browser implemented in the PDB website.12 We searched for entries containing structures ofproteins (but not nucleic acids) determined by X-raydiffraction and having the string “complex” in the titlerecord (“TITLE”). A total of 1014 entries were retrieved.Part of these structures contained complexes of proteinswith other proteins or peptides, and no heteroatoms,except water oxygens in some cases. These entries wereeliminated automatically in a filtering process, as de-scribed in the following.

The heteroatom records (lines starting with the keyword“HETATM”) in the retrieved PDB files were processed,first, to eliminate water oxygens. Then, the remainingheteroatoms were grouped by their connectivity into frag-ments or molecules (two atoms were considered to bebonded if the distance between them was �2 Å). Ligandmolecules having less than 10 atoms were removed, aswell as those fragments that were covalently bonded to theprotein. Afterward, we eliminated unnecessary redundan-cies, such as repeated complexes at different resolutionsand repeated copies of the same ligand within an entry. Inmost cases, these copies are found for proteins havingseveral identical binding sites, for instance, the choleratoxin B pentamer13 (entry 3chb), or for entries containingcopies of the same complex in the asymmetric unit of thecrystal. In either case, the protein–ligand contact patternsdisplayed by the different ligand copies are almost identi-cal. On the other hand, there were several small groups ofentries containing the same protein, but in complex withdifferent ligands, or mutants of the same protein incomplex with the same or different ligands. We did noteliminate any of these complexes because each of themmight contribute different protein–ligand contacts to thisstudy. For instance, different versions of the same proteinvery often contain mutations in the binding site region,which most likely results in distinct new contacts with theligand.

Fig. 1. The position of an attached point is defined with respect tothree amino acid atoms using internal coordinates: the distance R fromthe point to its parent atom (atom 1), the angle � formed by the attachedpoint and atoms 1 and 2, and the torsion angle � formed by the attachedpoint and atoms 1, 2, and 3. The parent atom defines the chemicalproperty or color carried by the attached point.

2 E. MORENO AND K. LEON

Page 3: Geometric and chemical patterns of interaction in protein–ligand complexes and their application in docking

When this filtering process was concluded, a total of 440PDB entries remained, and 516 ligand molecules wereselected from these structures. The number of ligands wasgreater than the number of entries because some proteinsbound two or more different ligands. Figure 2(A) shows thesize distribution of the ligand molecules in this data set, asan illustration of its heterogeneity.

A second set of protein–ligand complexes was selectedfor a further testing of the templates of attached pointsobtained in the present work. These complexes wereextracted from the subset of PDB structures releasedduring the year 2000 by using the same criteria andfiltering procedures described above. As result, 364 PDBentries, including 503 ligands, were selected. The sizedistribution of the ligands from this set is shown in Figure2(B).

Because of the large number of protein–ligand com-plexes that were included in this study, all the calculationsdescribed in this work were automated to process seriallythe selected PDB entries by using a set of our ownprograms. The program Visual Molecular Dynamics(VMD)14 was used for graphics display of a limited number

of these structures. Calculations were performed on a400-MHz Pentium II computer running the Linux operat-ing system. Programs were written in Fortran 77 and werecompiled by using the GNU Fortran compiler.

RESULTS AND DISCUSSIONSearching for Protein–Ligand Contacts andCollecting Them on Model Amino Acid Structures

The selected protein–ligand complexes were searchedfor atom-heteroatom contacts. In this study, a proteinatom and a heteroatom were considered as being in contact(or interacting) if the distance between them was �4.3 Å,irrespective of the atom types. This number was chosenafter several initial computations showed that the averagedistance for nonpolar contacts was about 3.8 Å. A tolerancedistance of 0.5 Å was then added to this value. This yieldsthe most strongly interacting pairs of atoms.

The purpose of these calculations was to determine therelative position occupied by every interacting heteroatomwith respect to the amino acid residue it interacts with andthen collect all the atomic contacts for each of the 20protein residue types.

TABLE I. Examples of Entries in the Template of Attached PointsDesigned for the Side Chain of Arginine

Pointa

Reference atoms Internal coordinates

Colorb3 2 1 R � �

6 NE CZ NH1 2.8 162 �82 414 NE CZ NH2 2.8 162 82 417 CZ CD NE 2.8 125 �160 318 CZ CD NE 2.8 125 160 320 CZ CD NE 3.8 90 90 722 NE CZ NH1 3.8 90 90 7

aThe template for the Arginine side chain has 24 attached points in total.bEach template point was colored according to the type of its parent atom (atom 1). Colornumbers follow the code: 1) acceptor, 2) negative, 3) donor, 4) positive, 5) hydroxyl, 6)aliphatic, and 7) aromatic.

Fig. 2. Histograms showing the size distribution of the ligand molecules selected for this study. A:Histogram for the set of 516 complexes extracted from a 1999 PDB release. B: For a set of 503 complexesreleased during the year 2000. The size is given by the number of heteroatoms (no hydrogens).

PATTERNS OF PROTEIN–LIGAND INTERACTIONS 3

Page 4: Geometric and chemical patterns of interaction in protein–ligand complexes and their application in docking

In transporting the interactions to fixed amino acidstructures we had to consider the conformational variabil-ity displayed by some of the amino acid side chains, as wellas the flexibility of the backbone. To deal with thisproblem, a set of internal coordinate reference systemswas created for every atom in each of the 20 protein aminoacids, so that the position of any interacting heteroatomcould be referred to a set of three connected amino acidatoms whose relative disposition do not change uponrotation of torsion angles, as illustrated in Figure 3(A).Thus, the way in which an interaction is collected dependson which reference system, and in particular which parentatom, is chosen for a given interacting heteroatom, asdepicted in Figure 3(B). In this example, a heteroatominteracts both with the hydroxyl oxygen (O�) and thebackbone oxygen (O) of threonine. The choice of one or theother protein atom as the center of the reference systemwill determine the location of the collected interaction.

The solution given to the problem of the referencesystem was simple: the protein atom that participates inthe interaction is, by definition, the one that is closer to theheteroatom. Once a reference system has been chosen, the

representation of the contact on the model amino acid isunambiguous. Moreover, if an interaction point on themodel amino acid structure is transposed back onto theprotein residue it was taken from in the source crystalcomplex, it would occupy exactly the same position as itssource heteroatom, no matter what conformational differ-ences might exist between the crystal and the model aminoacid structures.

For all residue types (except glycine), the interactionswith the carbonyl oxygen and the nitrogen atoms of thebackbone were collected separately, on a single residuemodel that included also the carbonyl carbon participatingin a peptide bond with the backbone nitrogen, that is, anatom that belongs to another residue going before in theprotein sequence. This was necessary because the internalcoordinates system associated with the nitrogen mustinclude its two bound atoms, so that the orientation of theNH group (with which most interactions occur) was unam-biguously defined.

After being reproduced on the model amino acids, theheteroatom contacts from all the selected ligands werecollected in files (one for each amino acid type and an

Fig. 3. Transposing heteroatom-protein residue interactions onto model amino acid structures. A: Aheteroatom interacting with the arginine N� is transposed to the model structure by using a reference systemcentered at N� and including its two adjacent atoms. The geometry of this trio remains fixed upon rotationaround any of the bonds. B: A heteroatom is in contact with two threonine atoms: the hydroxyl oxygen (O�) andthe backbone oxygen (O). The position of the collected contact will depend on the choice of the primaryinteraction, if the conformations of the source and the model amino acids are different.

4 E. MORENO AND K. LEON

Page 5: Geometric and chemical patterns of interaction in protein–ligand complexes and their application in docking

additional one for backbone atoms), so that they can beinspected by using a molecular graphics program. Inaddition, a record containing information about the sourceheteroatom and the source crystal complex was stored in aseparate file for each heteroatom-atom interaction. Thisway, an atom label obtained by clicking on an interactionpoint on the computer screen could be easily referenced toits source experimental data.

In the initial calculations we performed, all the heteroa-toms within the defined cutoff distance were representedon the model amino acids. This produced crowded clouds ofpoints where no defined patterns could be seen. An analy-sis of several crystal complexes identified as the source ofsome of the “unexpected” points revealed the main reasonswhy such confused clouds of points were obtained. Themost commonly observed cases are illustrated in Figure 4.

Figure 4(A) represents a situation in which a hydroxyloxygen and its implicit hydrogen from the ligand is partici-pating in a hydrogen bond while both the oxygen and itsbounded carbon are within the contact distance, as definedhere, from the carboxyl oxygen of aspartic acid. In thiscase, it is evident that the interaction takes place onlythrough the oxygen and that the carbon atom should notbe taken into account. A more difficult case is presented inFigure 4(B), where a ligand oxygen is involved in ahydrogen bond to the carboxyl oxygen of aspartic acidwhile contacting the aromatic ring of phenylalanine. Thelater contact, although it probably makes a contribution tothe binding energy, most likely is “secondary,” or inciden-tal. In other words, the geometry of the strong hydrogenbond between the ligand oxygen and the aspartic acid (the“primary” interaction) is the one that is relevant in thiscase. We conclude that, at least for the practical purpose ofthis study, it makes more sense to take into account onlyprimary interactions, because collecting secondary con-tacts, such as the one in this example, would introduceundesirable noise in the interaction patterns for someamino acids.

It would be possible, in principle, to create a set of rulesto deal with different types of contacts between heteroat-oms and protein amino acids; however, the large variety offunctional groups present in the hetero molecules and themany different environments that ligand atoms encounterin the protein binding sites would make this task verydifficult, and the set of rules would be extremely large.

To cope with the ample diversity of protein–ligandcontacts found in the PDB database we created a small setof general, simple, and easy-to-program “interaction rules”:

1. If a single heteroatom is in contact with two or moreprotein residues, only the interaction with the closestamino acid is collected.

2. The interaction of a heteroatom with a protein residueis assumed to be effected through the closest atom of theamino acid.

3. If two or more heteroatoms make their closest contactwith the same protein atom, only the closest hetero-atom is counted as an interaction.

Rule 1 defines the primary interaction of a heteroatomby using a distance criterion and discards any secondarycontacts, if present. Rule 2 also applies a distance criterionto determine which atom in a protein amino acid is takenas the center of the reference system that will be used totransport the interaction onto the model amino acid. Rule3 allows a choice to the problem exemplified in Figure 4(A).Although these rules are approximations, they fulfill wellthe practical objective of considering the strongest interac-tions to yield well-defined patterns of interaction.

Participation of Different Protein Amino Acids inthe Interactions With Ligand Atoms

Figure 5(A) shows the extent of participation of the 20protein amino acids in interactions with heteroatoms. Onthe left side of this graphics, arginine stands out as one of

Fig. 4. A: Fragment of a ligand molecule displaying an oxygen atom (in dark) involved in a hydrogen bondto an aspartic acid from the protein receptor. Both the ligand oxygen and its bonded carbon are within thedefined contact distance (4.3 Å) from the aspartic acid. B: Fragment of another ligand molecule containing ahydroxyl oxygen, whose implicit proton is participating in a hydrogen bond to the carboxyl oxygen of asparticacid and, at the same time, is making contact with the aromatic ring of a phenylalanine residue.

PATTERNS OF PROTEIN–LIGAND INTERACTIONS 5

Page 6: Geometric and chemical patterns of interaction in protein–ligand complexes and their application in docking

the amino acids most involved in contacts with ligandmolecules (second after glycine), although its frequency ofoccurrence in the studied complexes is relatively low [Fig.5(B)]. At the other end of the horizontal axis, tryptophanalso stands out because of its large numbers of contacts,despite being one of the less abundant amino acids in theselected protein–ligand complexes. As can be observed,aromatic residues in general make important contribu-tions to the binding of ligand molecules, despite theirrelatively low frequency of occurrence, partly because oftheir larger sizes.

Glycine occupies the first position in its number ofinteractions with ligand atoms, being also one of the mostfrequent amino acids. Because glycine has no side chain,all its interactions involve backbone atoms. As can beobserved in Figure 5(A), the relative participation of thebackbone in the interactions decreases, as a rule, with theincrease in amino acid size, which can be clearly seen forthe group of residues having aliphatic side chains and forthe serine-threonine couple. Some residues (e.g., cysteine,methionine, and proline) have a much lower participationin ligand contacts.

Brief Description of the Interaction PatternsObtained for the Different Amino Acids

Before describing the contacts found for the 20 aminoacids, it is necessary to point out that the analysis we couldmake from the collected data is limited in scope for severalreasons. The main one is that we collected the heteroatomsthat interact with the protein without extracting any

information about their chemical context in the ligandmolecule they come from. Such analysis would be ex-tremely difficult to make in an automated way. Anotherreason is the absence of hydrogen atoms in PDB files,which makes it difficult in some cases to determinewhether there is a hydrogen bond. However, for thepractical purposes behind this study, that is, the construc-tion of templates of attached points to be used for docking,the information about the geometric distribution of thedifferent types of interacting atoms around the amino acidresidues, which is summarized here omitting numericaldetails, should suffice. The patterns obtained for the aminoacid side chains are described first, followed by the collec-tion of contacts with backbone atoms for all amino acidresidues (except glycine, which was examined separately).

Arginine

The contacting heteroatoms collected for this residue areshown in Figure 6a. Two small but crowded clouds ofoxygen atoms appear over the two charged nitrogens,whereas other two less dense clouds composed of oxygensand a few nitrogens surround the lateral edges of theguanidino group. A number of carbon atoms and also somenitrogens form a layer on each side of the plane formed bythe guanidino group.

Lysine

The most prominent feature in the collection of contactsgathered for this residue is a ring of oxygen atoms situatedover the charged nitrogen, perpendicular to the C-N axis.

Fig. 5. A: The total numbers of atom-heteroatom interactions computed in this study for each type ofprotein amino acid are represented by columns in dark gray. Columns in light gray represent the number ofinteractions with backbone atoms for each amino acid type, which are fewer in number because generally theside chains obscure the backbone. B: Frequency of occurrence of each amino acid type in the proteinscontained in the PDB entries selected for this study.

6 E. MORENO AND K. LEON

Page 7: Geometric and chemical patterns of interaction in protein–ligand complexes and their application in docking

This ring becomes more crowded on the side opposite to theC� carbon. Other scattered contacts, mainly with carbonatoms, can be seen around the rest of the side chain.

Aspartic and glutamic acidsBoth amino acids display two well-defined clusters of

oxygens and nitrogens located over the carboxyl group, as

Fig. 6. Collected atomic contacts (left panels) and ATPTS templates (right panels) for the side chains ofthree different protein amino acids: Arg, Asp, and Tyr. Carbon atoms in these protein residues are colored inyellow, whereas the nitrogen and oxygen atoms are blue and red, respectively. Collected contacts arerepresented as small balls whose colors are in correspondence with their source heteroatoms: red, oxygen;blue, nitrogen; green, carbon; and yellow, sulfur. Attached points in the templates shown on right panelsresemble the observed interaction patterns. An arc of positive points (shown in light blue) was constructed overthe two charged NH2 groups of arginine, and a couple of donor points (in magenta) were associated with theN. In addition, three points (in green) on each side of the guanidino group were colored as aromatic becausethis region displays a binding pattern similar to that found on aromatic rings. The carboxyl group of aspartic acidwas covered by six points (shown in pink) bearing the color negative. For tyrosyne, four hydroxyl points (in red)were placed over the hydroxyl oxygen and two layers of attached points with the color aromatic (shown ingreen) were constructed on both sides of the phenyl ring.

PATTERNS OF PROTEIN–LIGAND INTERACTIONS 7

Page 8: Geometric and chemical patterns of interaction in protein–ligand complexes and their application in docking

shown in Figure 6(c) for aspartate. The two clouds ofcontacts are elongated in the direction perpendicular tothe plane of the carboxylate. Some carbon atoms can beseen within these two clusters, but most of them arebonded, in their source structures, to oxygen atoms presentin the neighboring cluster. Other contacts are spreadaround the side chain for both amino acids.

Asparagine and glutamine

Three groups of contacts can be distinguished for thesetwo similar side chains, one for the oxygen and two for thenitrogen. Each of the two clusters that belong to thenitrogen is distributed around an axis formed by thenitrogen and one of the hydrogen atoms bound to it (notpresent in the X-ray structures). The cluster that belongsto the oxygen and one of the clusters associated with thenitrogen appear on top of the amino group, following ageometric pattern similar to that observed for Asp andGlu. There are more interactions with the nitrogen thanwith the oxygen (twice as many, roughly)

Serine and threonine

A ring of oxygen atoms and a few nitrogens covers thehydroxyl group in Ser. Such a ring is outlined also for Thr,but it is not as well defined as for Ser. A broad sphericalsector over the CH3 group in Thr, which intersects theoxygen ring over the hydroxyl group, is populated mostlyby carbon atoms.

Glycine

As shown in Figure 5, Gly is the amino acid thataccounts for the greatest number of interactions withligand atoms. Clouds of atoms, most of which are oxygens,almost surround the whole residue. A compact cluster ofoxygens and a few nitrogens is located in the region wherethe backbone NH points, whereas heterogeneous clouds ofatoms are spread on the surface over the plane formed bythe backbone C, C�, and N atoms

Proline, cysteine, and methionine

Only a few contacts were collected for these threeresidues. For Pro, they were distributed around the C , C�,and C� carbons, whereas for Cys and Met they werescattered around the terminal atoms of the side chain.

Alanine, valine, leucine, and isoleucine

The contact patterns for these four amino acids share acommon feature: the aliphatic side chain is surrounded bya (chemically) heterogeneous cloud of atoms and, althoughsome areas are more populated than others, there are notwell-defined clusters of atoms.

Phenylalanine and tyrosine

The aromatic rings of both side chains are totallyenveloped by heteroatom contacts, mainly by carbon at-oms, but also by a number of oxygens and some nitrogens,as shown in Figure 6(e) for tyrosine. Most of the contactsare stacked over the ring plane, whereas the regions over

the lateral edges are less populated (except for the termi-nal edges of Phe). The hydroxyl group of Tyr displays twoclusters of oxygens located on the same plane defined bythe aromatic ring [see Fig. 6(c)].

Histidine

The distribution of contacts around this residue shows acluster of oxygens over each of the two ring nitrogens, butthey are not equally populated: interactions with N aremore frequent than with N�. The rest of the contacts,mostly carbon atoms, are located mainly over the ringplane.

Tryptophan

The most prominent feature observed for Trp is thestacking of heteroatoms (mainly carbons) on both sides ofthe indole plane. There is also a cluster of oxygens in theproper geometry to accept a hydrogen bond from the NHgroup, and only a few contacts with other edges of theindole group.

Backbone atoms

A dense cluster of oxygens and nitrogens was collectedover the backbone nitrogen. This cluster is elongated inthe direction perpendicular to the plane formed by the C�,N, and C atoms, where C is the carbonyl carbon of aresidue going before in the sequence. By contrast with thispicture, the cloud of contacts with the backbone oxygen ismore spread and highly heterogeneous, although a fuzzyring of oxygens and nitrogens centered on the C-O axis canbe discerned from a bulk of carbon atoms.

Constructing the Templates of Attached PointsFrom the Interaction Patterns

The obtained interaction patterns were taken as a basis forconstructing a set of templates of attached points, one foreach amino acid. As pointed out in the Introduction, the aimof a binding site representation that is going to be used by adocking program such as DOCK is to construct points inpositions that, when matched by ligand atoms, can producecorrect orientations of the ligand molecule. The more pointsin proper positions, the more likely it is that the program willproduce correct docking solutions. On the other hand, thetotal number of points should be limited to save computationtime. Therefore, the templates should ensure a reasonablenumber of matches without having too many points. Besides,they must be of general use, that is, they should perform wellfor any protein–ligand complex.

To guide the construction of the templates, the interact-ing heteroatoms were grouped in clusters, separately foreach atom type. First, pairwise distances were calculatedand stored in a bidimensional array. Afterward, the het-eroatom having the greatest number of close neighborswas selected as the center of the first cluster, using a cutoffdistance of 1.5 Å to consider two atoms as close. Thisdistance corresponds to the typical value of the “distancetolerance” parameter used by the DOCK matching algo-rithm (see below). The selected atom and its neighborswere excluded from the distance matrix, and then the

8 E. MORENO AND K. LEON

Page 9: Geometric and chemical patterns of interaction in protein–ligand complexes and their application in docking

procedure was repeated for the remaining heteroatoms,until there were no atoms left.

The coordinates of the central atom of each cluster werestored in a PDB format file, so that they could be easilydisplayed on the screen. The final structures of the tem-plates were determined from visual inspection of theinteraction patterns and the cluster centers by using theVMD molecular graphics program. The attached pointswere distributed mostly over the areas where a significantdensity of interacting heteroatoms was observed. Theseparation between points varied depending on the den-sity of the interaction patterns, the minimum value beingabout 1.2 Å and the maximum about 2.5 Å. Figure 6 showsthe templates constructed for arginine, aspartic acid, andtyrosine.

Not every heteroatom contact was covered by the net-work of attached points: isolated contacts and, in somecases, low-density areas were not represented. This im-plies that for some protein–ligand complexes, one or a fewinteracting ligand atoms might not have the possibility ofproducing correct matches in a docking simulation. How-ever, as will be shown below, the performance of theconstructed templates was more than satisfactory.

Each attached point in a template has a chemical labelor color, which is used for chemical matching, as imple-mented in DOCK 3.5 to speed up the docking calcula-tions.15 The patterns obtained gave a picture about whatkind of atom may interact with a given amino acid at agiven position, although the exact nature of the interactioncannot always be determined. The assignment of chemicallabels (colors) to attached points in the collection oftemplates was made on the basis of the parent atom eachpoint is attached to. Seven different colors were applied:hydrophobic, donor, positive, acceptor, negative, hydroxyl,and aromatic; the latter was used for points covering therings of histidine, phenylalanine, and tyrosine [see Fig.6(f)] and the indole group of tryptophan. The chemicalinformation about the interacting heteroatoms obtained inthis study, which is limited to the atom type, becomesrelevant when defining the ligand atom-site point match-ing rules that will operate in a docking run that makes useof the color filter.

Evaluating the Performance of the ATPTSTemplates

The quality of a method that creates a negative image ofthe binding site using points depends on its capability ofgenerating these points in proper places, that is, close topositions occupied by ligand atoms when complexed to theprotein. We applied this criterion to make a first evalua-tion of the templates, by comparing the positions ofconstructed ATPTS points with the positions of atoms ofthe ligand, placed in its crystal orientation.

ATPTS binding site representations were generated forthe 516 protein–ligand complexes included in this study ina totally automated way, by using the obtained templatesand following the algorithm described in Materials andMethods. The calculation time per structure was, onaverage, �0.1 s. The selection of the protein residues that

form the binding site was made on the basis of theirproximity to the ligand in the crystal complex: everyprotein residue within a distance of 5 Å from at least one ofthe heteroatoms was included. Bump distances of 2.2 and2.8 Å were used for polar and nonpolar protein atoms,respectively. A merging distance of 1 Å was used to fusepoints too close together. It is worth noting that if amerging distance � 1.2 Å were applied, then some of thepoints attached to the same protein residue may bemerged into each other.

As quantitative measures of the correspondence be-tween the ensembles of points and ligand atoms, wedefined two magnitudes: the number of matched heteroat-oms (Nmh) and the number of matching heteroatom-pointpairs (Nmp). Here, a heteroatom was considered to bematched by an attached point if the distance between themwas less than certain “matching cutoff,” and a matchingheteroatom-point pair is defined as a point and a hetero-atom being within this matching cutoff distance from eachother. In principle, the same heteroatom could be matchedby more than one point and, therefore, could form morethan one matching pair.

The next step was then to calculate Nmh and Nmp foreach protein–ligand complex. Because the points producedby the ATPTS method are at the protein surface, onlythose ligand atoms that are close to the protein can beexpected to match. For this reason, both Nmh and Nmp areevaluated against the number of heteroatoms contactingthe protein (Nc), as delimited by a cutoff distance of 4.3 Å.Figure 7 shows the results of these calculations, in theform of histograms, for different values of the matchingcutoff distance: 1.0, 1.5, and 2.0 Å.

For a matching cutoff distance of 1.5 Å, the ATPTSrepresentation covered more than 70% of the contactingheteroatoms for most of the complexes [Fig. 7(B)], and theaverage number of matching pairs per contacting atomwas greater than one also for most complexes [Fig. 7( )]. Itshould be noted that the “matching cutoff” distance de-fined here is closely related to the “distance tolerance”parameter used within DOCK 3.5, which limits the differ-ences between sphere-sphere and atom-atom distances ina set of matching sphere-ligand atom pairs.1 Although thelongest-distance filter used by DOCK is more restrictivethan the single atom-point pair evaluation performed here(if using the same value for both the distance tolerance andthe matching cutoff parameters), the statistics presentedin Figure 7 suggest that the historical default value of 1.5Å used for the distance tolerance parameter in DOCKcalculations,16 should work for most protein–ligand com-plexes. This was confirmed by actual docking runs, asdescribed below.

Note that both the number of matched heteroatoms andthe number of matching heteroatom-point pairs would belower if the “color” filter15 implemented in DOCK 3.5 isused. However, we did not apply this filter in the calcula-tions because its implementation would require auto-mated coloring of ligand atoms and establishing a set of“universal” (valid for all complexes) matching rules, andboth problems are beyond the scope of this work.

PATTERNS OF PROTEIN–LIGAND INTERACTIONS 9

Page 10: Geometric and chemical patterns of interaction in protein–ligand complexes and their application in docking

Automated Docking of >1000 Ligands Using theATPTS Templates

The final and most important test we performed for theATPTS templates was directly focused on the purpose theywere designed for: docking. An automated, serial dockingof �1000 ligands to their corresponding binding siteswas performed with our own program that we call“DOCKPDB.” This program takes as input data thecollections of ligand structures and ATPTS representa-tions of the corresponding binding sites and executes theDOCK (version 3.5) matching and orientation routines foreach of the complexes.

DOCKPDB does not perform any contact or energyscoring, because we were interested only in evaluating thecapabilities of the ATPTS representation of producingcorrect ligand orientations when used in combination withthe DOCK matching algorithm. A standard distance toler-ance of 1.5 Å was used in the calculations for all complexes.As output, the program stores the best values of theroot-mean-square deviation (RMSD) between a generatedligand orientation and its crystal geometry. Figure 8shows this testing scheme.

We first ran docking simulations for the 516 protein–ligand complexes that were used to derive the ATPTStemplates. The results from these calculations, which tookonly a few hours, are summarized in Figure 9. Except for a

few complexes, the RMSDs for the best orientations givenby the program were �2 Å, and most of them, �1 Å [Fig.9(A)]. Furthermore, for most of the complexes, the pro-gram produced a large number (hundreds or thousands) ofmatches, leading to correct orientations of the ligand(RMSD � 2 Å), as shown in Figure 9(B).

An inspection of the few structures that could not bedocked with a low RMSD revealed that they corre-sponded to relatively large ligands having only a fewcontacts with the protein. For such complexes, theprogram has little chance of reproducing the overallorientation of the ligand in the crystal structure, be-cause the matching and orientation procedures areperformed over a reduced group of heteroatoms locatedon a border of the molecule.

To further evaluate the general applicability of thetemplates of attached points, we performed docking simu-lations, using the DOCKPDB program, for a second set ofexperimental structures extracted from the PDB, as de-scribed in Materials and Methods. This set, composed ofstructures released during the year 2000, contained 503protein–ligand complexes, thus being almost as large asthe data set used for the attached points construction. Theresults from these docking runs were very similar to theresults obtained earlier, as shown in Figure 9(C and D). Asit happened before, only a few structures could not be

Fig. 7. A–C: Histograms showing the distribution of the studied protein–ligand complexes by the value ofthe Nmh/Nc ratio (see text) expressed as percent, for different values of the matching cutoff distance: (A) 1.0 Å,(B) 1.5 Å, and (C) 2.0 Å. �, �, and �: Histograms displaying the distribution of the complexes by the value of theNmp/Nc ratio (see text) for the same values of the matching cutoff distance as above: (�) 1.0 Å, ( ) 1.5 Å and (�)2.0 Å. Note that the scales of the y axes are different, to display more clearly the data for each value of thematching cutoff distance.

10 E. MORENO AND K. LEON

Page 11: Geometric and chemical patterns of interaction in protein–ligand complexes and their application in docking

docked correctly because of the small contact area betweenthe ligand and the protein.

Attached Points Versus Spheres

The development of the ATPTS approach for represent-ing the binding site was driven by the necessity of having afast and automated method that could be used repeatedlyfor a large number of receptor conformations. The stan-dard DOCK representation was developed for the case of asingle receptor geometry, where consuming a few minutesof CPU time is not a problem, nor is the need for userintervention. In this approach, the program SPHGEN fillsthe binding site with a set of overlapping spheres, whichare generated by using pairs of protein surface points.1

Then the user selects a subset of these spheres (actuallytheir centers) as representative of the site and assignschemical properties (colors) to them.

A more recent report5 describes a new method (SURFSPH)to generate spheres than can be used by DOCK. Thesespheres are constructed by using three receptor atomsinstead of two surface points, and the process of generatingand coloring the spheres is completely automated. TheSURFSPH algorithm is more complex than the ATPTSalgorithm presented here and probably demands morecomputer time, mainly because of the large number ofthree-atom combinations that are tried to generate theinitial set of spheres.

But beyond the question of user-dependent versus auto-mated (slower or faster) algorithm, it is important to notethat the three approaches (SPHGEN, SURFSPH, andATPTS) follow different principles. The spheres generatedby SPHGEN are located within a given range of distancefrom the receptor surface (usually 1.5–5.0 Å) and mayoccupy any relative position with respect to the differentchemical groups at the surface of the protein. The functionof these spheres is to fill cavities in the binding site that

may be occupied by ligand atoms. SURFSPH, on the otherhand, generates spheres that follow the molecular surfacemore closely, thus focusing on those ligand atoms that aremost likely to be directly involved in the interactions withthe protein. The construction of the SURFSPH spheres isbased on distance and energy criteria and, in general, doesnot take into account the directionality that characterizesseveral types of interactions (although in one reportedstudy, vectors were used to construct spheres linked tobackbone nitrogens or carbonyl oxygens5).

Points generated by ATPTS, just as for SURFSPHspheres, are intended to occupy places at the molecularsurface that are suitable for direct ligand atom-proteinatom interactions. The positions of the attached points,however, are determined by vectors, which are defined intemplates constructed for each type of amino acid sidechain. Therefore, the ATPTS method considers the particu-lar geometries in which different functional groups ofproteins interact with ligand atoms, as derived from acomprehensive study of available experimental data. An-other important feature of this method is that each at-tached point carries a color determined by its parent atom,that is, only by one atom instead of three, as in theSURFSPH method. The rules that are used to mergepoints that are close to each other also differ in the twoapproaches. In SURFSPH, points are merged even if theyhave different colors (chemical properties), and a colorchoice has to be made for the resulting point; in ATPTS,the ambiguity in color selection that may appear whenmerging points was solved in a simple way: points havingdifferent colors (i.e., having different binding preferencesfor different types of atoms) are never merged, no matterhow close they are to each other. This means that the samesite in space may allow different types of interactions withligand atoms if the surrounding protein atoms have differ-ent properties.

Fig. 8. Outline of the docking calculations performed for the 516 PDB complexes. The coordinates of theselected ligand molecules were extracted from the PDB files and stored in a single file. The ATPTS templates,together with a list of binding site residues obtained previously by an automated procedure (see text), wereused to automatically generate the ATPTS representations of the protein binding sites, which were collectedalso in a single file. Both files (ligand coordinates and ATPTS representations) served as input to the programDOCKPDB, which gave as output the best RMSD values between docked and crystal orientations obtained foreach complex, as well as every RMSD value below 2 Å.

PATTERNS OF PROTEIN–LIGAND INTERACTIONS 11

Page 12: Geometric and chemical patterns of interaction in protein–ligand complexes and their application in docking

CONCLUSIONS

The calculations we have performed for a broad collec-tion of protein–ligand complexes extracted from the Pro-tein Data Bank provided an experimental validation of theattached points (ATPTS) approach for representing aprotein binding site, which was developed for the screen-ing of large ensembles of receptor conformations in dock-ing simulations. We found that for most of the amino acids,interactions with ligand atoms are mostly distributed overcertain regions on a surface surrounding the residue,forming well-defined patterns of interaction. These pat-terns were then used for constructing a set of templates ofattached points, which constitute the core of the method.

The quality of the ATPTS representation was demon-strated by using this method to generate correct ligandorientations for �1000 protein–ligand complexes. Thisserial docking also showed the simplicity of the method inthat it only needs the set of ATPTS templates and a list of

binding site residues to automatically construct a negativeimage of a given protein receptor in a small fraction of asecond.

ACKNOWLEDGMENTS

We thank Dr. Robert Jernigan for his thorough revisionof the manuscript and valuable comments on this work.

REFERENCES

1. Kuntz ID, Blaney JM, Oatley SJ, Langridge R, Ferrin TE. Ageometric approach to macromolecule-ligand interactions. J MolBiol 1982;161:269–288.

2. Lin SL, Nussinov R, Fischer D, Wolfson HJ. Molecular surfacerepresentations by sparse critical points. Proteins 1994;18:94–101.

3. Laskowski RA, Thornton JM, Humblet C, Singh J. X-SITE: use ofempirically derived atomic packing preferences to identify favour-able interaction regions in the binding sites of proteins. J Mol Biol1996;259:175–201.

4. Kuntz ID. Structure-based strategies for drug design and discov-ery. Science 1992;257:1078–1082.

Fig. 9. A and C: The best RMSD values obtained in the docking calculations (one for each protein–ligandcomplex) are plotted in the form of histogram, for the first (516 complexes) and second (503 complexes)experimental data sets, respectively. For values � 2 Å, columns are too small to be seen, so their heights areexplicitly shown in parentheses. B and D: Histograms showing the distribution of the complexes by the numberof good matches (RMSD � 2 Å) obtained in the docking calculations for the first and second data sets,respectively.

12 E. MORENO AND K. LEON

Page 13: Geometric and chemical patterns of interaction in protein–ligand complexes and their application in docking

5. Oshiro CM, Kuntz ID. Characterization of receptors with a newnegative image: use in molecular docking and lead optimization.Proteins 1998;30:321–336.

6. Hendrix DK, Kuntz ID. Surface solid angle-based site points formolecular docking. Pac Symp Biocomput 1998;3:317–326.

7. Melo F, Feytmans E. Novel knowledge-based mean force potentialat atomic level. J Mol Biol 1997;267:207–222.

8. Weng Z, Vajda S, Delisi C. Prediction of protein complexes usingempirical free energy functions. Protein Sci 1996;5:614–626.

9. Allen FH, Kennard O. 3D search and research using the Cam-bridge Structural Database. Chemical Design Automation News1993;8:31–37.

10. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, WeissigH, Shindyalov IN, Bourne PE. The Protein Data Bank. NucleicAcids Res 2000;28:235–242.

11. Bruno IJ, Cole JC, Lommerse JP, Rowland RS, Taylor R, VerdonkML. IsoStar: a library of information about nonbonded interac-tions. J Comput Aided Mol Des 1997;11:525–537.

12. The Research Collaboratory for Structural Bioinformatics PDB.WWW address: http://www.rcsb.org/pdb/.

13. Merritt EA, Sarfaty S, van den Akker F, L’hoir C, Martial JA, HolWGJ. Crystal structure of cholera toxin B-pentamer bound toreceptor GM1 pentasaccharide. Protein Sci 1994;3:166–175.

14. Humphrey W, Dalke A, Schulten K. VMD—visual moleculardynamics. J Mol Graphics 1996;14:33–38.

15. Shoichet BK, Kuntz ID. Matching chemistry and shape in molecu-lar docking. Protein Eng 1993;6:723–732.

16. Ewing TJA, Kuntz ID. Critical evaluation of search algorithms forautomated molecular docking and database screening. J ComputChem 1997;18:1175–1189.

PATTERNS OF PROTEIN–LIGAND INTERACTIONS 13