Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for...

32
J. Mol. Biol. (1976) 100, 427-458 Automated Interpretation of Electron Density Maps of Proteins: Derivation of Atomic Co-ordinates for the Main Chain JONATHAN GREER Department of Biological Sciences Columbia University, New York, N.Y. 10027, U.S.A. (Received 28 August 1975, and in revised form ld October 1975) Earlier work on the automated interpretation of protein electron density maps (Greer, 1974) has been expanded to produce atomic co-ordinates for the main chain of the molecule. Procedures have been designed that select the main-chain segments, determine the directionality of the chain, and fit ~-carbon and E-carbon co-ordinates at the appropriate positions. The orientation of the peptide plane is derived for a number of residues by pre-building secondary structure regions. Strategy for the refinement of these provisional co-ordinates is discussed. The method has been applied to the 2.0 A map of ribonuclease S (Wyckoff e$ at., 1970). Two polypeptide chains are found which correspond to the S-peptide and S-protein. a- and E-carbon positions are derived, and one a-helix and much of the E-sheet region are identified. The close fit of the automatically-calculated to the hand-measured co-ordinates suggests that the mebhods described here may prove useful to the protein crystallographer in interpreting his electron density map. 1. Introduction The ultimate objective of X-ray diffraction analysis of protein crystals is to produce atomic co-ordinates that can be used to expand our understanding of biological function. The quality and accuracy of these atomic co-ordinates depend heavily upon the ability to properly interpret the electron density maps of the protein structure, especially since the resolution of the maps is usually less than atomic. Detailed knowledge of the primary structure, protein chemistry, and protein stereogeometry are all necessary to extract the maximum information about the three-dimensional structure from the density maps. Consequently, the protein crystallographer must spend long months carefully interpreting the map using physical models, contour maps on plastic sheets, and a Richards box, i.e. an optical comparator, to super- impose them visually (Richards, 1968). Early hopes that this process might be simplified and improved by using a cathode ray tube graphics display system to produce an electronic simulation of the Richards box have not as yet been realized as the problems involved in accomplishing this task have proved more intractable than originally imagined. For these reasons, we began to explore an alternative the automation of the interpretation process using pattern recognition techniques and known protein chemistry and stereogeometry (Greer, 1974,1975). The electron density maps were skeletonized (Hilditch, 1969) to achieve a massive reduction in the number of points in the electron density maps with a minimum loss of information. Side chains were 29 427

Transcript of Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for...

Page 1: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

J. Mol. Biol. (1976) 100, 427-458

Automated Interpretation of Electron Density Maps of Proteins: Derivation of Atomic Co-ordinates for the Main Chain

JONATHAN GREER

Department of Biological Sciences Columbia University, New York, N.Y. 10027, U.S.A.

(Received 28 August 1975, and in revised form ld October 1975)

Earlier work on the automated interpretation of protein electron density maps (Greer, 1974) has been expanded to produce atomic co-ordinates for the main chain of the molecule. Procedures have been designed that select the main-chain segments, determine the directionality of the chain, and fit ~-carbon and E-carbon co-ordinates at the appropriate positions. The orientation of the peptide plane is derived for a number of residues by pre-building secondary structure regions. Strategy for the refinement of these provisional co-ordinates is discussed.

The method has been applied to the 2.0 A map of ribonuclease S (Wyckoff e$ at., 1970). Two polypeptide chains are found which correspond to the S-peptide and S-protein. a- and E-carbon positions are derived, and one a-helix and much of the E-sheet region are identified. The close fit of the automatically-calculated to the hand-measured co-ordinates suggests that the mebhods described here may prove useful to the protein crystallographer in interpreting his electron density map.

1. Introduction

The ultimate objective of X-ray diffraction analysis of protein crystals is to produce atomic co-ordinates tha t can be used to expand our understanding of biological function. The quality and accuracy of these atomic co-ordinates depend heavily upon the ability to properly interpret the electron density maps of the protein structure, especially since the resolution of the maps is usually less than atomic. Detailed knowledge of the primary structure, protein chemistry, and protein stereogeometry are all necessary to extract the maximum information about the three-dimensional structure from the density maps. Consequently, the protein crystallographer must spend long months carefully interpreting the map using physical models, contour maps on plastic sheets, and a Richards box, i.e. an optical comparator, to super- impose them visually (Richards, 1968). Early hopes tha t this process might be simplified and improved by using a cathode ray tube graphics display system to produce an electronic simulation of the Richards box have not as yet been realized as the problems involved in accomplishing this task have proved more intractable than originally imagined.

For these reasons, we began to explore an alternative the automation of the interpretation process using pat tern recognition techniques and known protein chemistry and stereogeometry (Greer, 1974,1975). The electron density maps were skeletonized (Hilditch, 1969) to achieve a massive reduction in the number of points in the electron density maps with a minimum loss of information. Side chains were

29 427

Page 2: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

428 J. GREER

removed leaving primarily main chain and inter-main chain bridges. Crystallographic and non-crystallographic symmetry was used to isolate a single protein molecule from its neighbors in the density map. As a result of these manipulations, the intermediate objective of producing an overall picture of the molecule was achieved. Secondary structure regions could be seen quite clearly in the skeleton in most cases. In addition, by taking all skeletal branch points and by interpolating between branch points every 4 A along stretches of the skeleton, a group of isolated points could be amassed many of which corresponded to the ~r positions quite closely. However, at tha t stage, main-chain lines could not be distinguished nor could ~-carbon positions be selected in order along the sequence. The programs were unable to identify secondary structure regions even though they were evident by visual inspection.

In this paper, the focus of attention has turned to the longer range goal-- the production of atomic co-ordinates of the molecule, in particular, the methods used to derive rudimentary atomic co-ordinates for the main chain atoms of a protein.

The first step involves the identification of the skeletal regions which correspond to the backbone of the molecule. ~-carbon co-ordinates can then be fitted to this backbone and the directionality of the chain determined. Wherever side chains are found in the skeleton, co-ordinates for the fl-carbon are deduced. More sophisticated model building is performed on secondary structure regions in order to orieut the peptide plane for many of the residues. Strategy for the refinement of these rudi- mentary atomic co-ordinates is discussed.

As in previous work, the interpretation process described here has been designed not to use knowledge of the primary structure since so many electron density maps of proteins are being produced for which the sequence information is as yet unavailable. I t is hoped that descriptions of the side chains and sequence data will be incorporated into the interpretation scheme in the near future and provisional co-ordinates for the side chains can then be derived by the programs. Some aspects of this subject are considered in the Conclusion.

I t is important to emphasize at the outset that the term automated interpretation is to some extent a misnomer. I t is quite clear that situations will always occur in the interpretation of any protein density map that will continue to require the crystallographer to use all the protein chemistry, stereogeometry, sequence data, and ingenuity available to completely interpret the map. Nevertheless, the automated system should provide reliable co-ordinates for most of the molecule and some guidance in areas of awkwardness and difficulty.

2. M e t h o d o f A t t a c k

(a) Preparation, skeletonization and tracing of the map

The initial step in the analysis is the transformation of the electron density map to a Cartesian co-ordinate system with constant grid spacing in the major axial directions. Experiments with different map definitions indicate tha t the optimum grid interval is in most cases 1-0 A for maps with resolution between 2.0 and 3.5 A. As has been noted previously (Greer, 1975), cer ta in complex regions of the map occasionally require a freer map definition. Nevertheless, a 1.0 A grid appears most suitable in the general case.

The only other important requirement for the preparation of the map is the

Page 3: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

P R E D I C T I O N O F C O - O R D I N A T E S I N P R O T E I N M A P S 429

stipulati(m tha t no par t of the molecule being examined may lie within two grid units of the edge of the map space. I t is no longer necessary for the center of gravi ty of the molecule to lie close to the center of the Cartesian space. However, the approxi- mate center of the molecule is needed at a later stage of the analysis (see section (c) below).

The skeletonization and tracing procedures remain essentially identical to tha t described in detail in the first paper of this series (Greer, 1974). The electron density map is reduced to a series of " thin lines" tha t preserve the continuity of the structure - - t h e skeleton (Fig. 1 (a) and (b)). The side chain branches are then removed, as well as chains tha t touch the boundary of map space and have no branches, leaving primarily main chain and inter-main chain bridges (Fig. 1(c)). The skeleton is then described as was illustrated in Table 1 of Greer (1974).

The most critical parameter in the production of the skeleton, and hence in the total interpretat ion procedure, is the minimum density level or density threshold (Pmln)" I f this parameter is set too high, then the skeleton will be too disconnected and important parts of the molecule missed. If, on the other hand, the threshold is set too low, then the result is an overconnected structure where it is very hard to differentiate between main chain and inter-main chain bridges. The opt imum value has been determined by trial and error, using the description produced by the trace program to estimate the degree of commctivity. The appearance of the skeleton a t the best value of Pm~ is a good indicator of the overall quality of the original electron density map. The lower the quality of the map, the more difficult it is to balance the connectivity of the skeleton. While this procedure is somewhat subjective, it has worked quite satisfactorily in the cases examined so far. In these maps, it has required no more than two or three a t tempts with different pm~. levels to find the opt imum value. This value, in general, lies somewhat above tha t used for contoured sections in a Richards Box. This is because one normally ignores the lowest contours when analyzing the map and uses them only when the density is very weak. I t is true, of course, tha t for a large protein the whole electron density map need not be scanned when trying different Prom values. Instead, one need only work with a section of the map tha t contains par t of the molecule and only run the whole map once the best P r ~ has been determined.

The contour interval used by the skeleton program is quite arbitrary. In practice, a value is chosen so tha t the span between the Pmin and the highest density point is around eight contour levels.

I t is useful to calculate the skeleton at Pr, in values one or two contour levels lower than the best value. The purpose of this is to compile a list of the side chains found at these lower Prom values and combine them for use later in the interpretation (see section (e) below). This is necessary because it is usually true tha t while the "bes t" Pm~n m a y connect the main chain optimally, the density between the main chain and the side chains is often more tenuous and a lower Pr, ln is required to connect them in the skeleton.

(b) Tree structure of the skeleton

In any analysis of this type, the method of describing the data is exceedingly important for conceptualization of the data as well as for efficiency of handling. In considering a method of description we must take into account the two important properties of the skeleton : the positions of the grid points on the skeleton and their

Page 4: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

430 J . G R E E R

~ X

I 4 z - 3 2 I 2 ?

h- 2 2 7 S I 3

9- t i

. . . . ; I

I}- ~ I I IZ,- , 2 2 2 3 15- 2 ] 3 16- 2 z IT- 4 L Id- I I 2

z , ] - I ? :, t

z ~ ' , z I 7 1 z3- I Z h ] 3 z, ~, ] Z4- 25- 26- Z2- 2,3-

2 1 1 ? i ]

2 I ! I

2 1 I Z

t 3 3 2

I I

I 1 3 I 2 ~ 12~ I : , 7?

2 b I ) 1 I I h t l

7 1 3 3 h ~ 3 3 2 ; ~ 2 : , 7 2

3z, Z ] ? I

I 2 3 .'~ 5 ,~ 7 d ~ I I 2 3 ~ ~ 6 7 ~, 9 2 I ? 31~ 5 ( 1 7 8 ~ 3

" X

_),._/

I

I

~ X . i 2 ] i,

i -

t ~:

~: ',~: t}-

-14

-16

- I t

- 2 l

2 2

I, S J 2

I

2 6

z J 3 t 2

7 zz

- I

-; 3 5 r - k

] 5 I, ? 2 ~ 2 " 5 S 5 z, - 6

r, , t, ~ l

e

I ,' -II

3 7 -13 ? ~ -I t .

I -17

,S - I 9 I I - 2 o

. . . . , ~ ', ~ - ~ ='~- 2 z, 3 8 -z3

- h ? I -28 29- ? -2q 3o- ~ -3o

(b) ~ X

O I I

{c'~ Ld) ~ X ~ X

(e) { f )

FIG. 1. Simulated 2-dimensional map of a tridecapeptide. The molecule and map correspond to described in Fig. 1 of Greer (1974). In this Fig. neighboring molecules have been added in away as to generate the symmetry of planar space group p2. The scale bar, when shown,

Page 5: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

P R E D I C T I O N OF C O - O R D I N A T E S I N P R O T E I N M A P S 431

connectivity. The previous programs have stored the skeleton either as a set of grid co-ordinates, or as a series of pairs of points where the members of each pair are neighbors. However, neither of these methods allows the full and efficient expression of the properties of the skeleton.

I t is useful at this stage in the analysis to redefine the skeleton as a tree structure where the nodes are either skeleton branch points or tips. A segment is defined as a section of the skeleton tha t lies between two nodes; thus, within one segment there is no ambigui ty as to how to follow the chain. A full description of the skeleton should then consist of a list of the segments and a Table of how the segments are connected a t the nodes. In s tandard t ree- type structures, one need only record the connections at one end of each segment. However, because the skeleton always has loops, a connectivity Table for both ends of each segment must be stored. Table 1 illustrates the segment list and connectivity Tables for the two-dimensional model skeleton (Fig. l(c)) created for illustrative purposes. In this format, it is easy to wend one's way through the skeleton. In addition, as we shall see in sections (c) and (d), below, there are other very useful properties to this description.

(c) Isolation of the molecule, a reappraisal

The next step is the isolation of the molecule from its neighbors in the crystal. This has been discussed in some detail in the first paper of this series (Greer, 1974). The method described was to apply crystallographic and local transformations to each point in the skeleton in order to generate all the symmetrical ly related points within the map space. Of these, only one can lie in " the molecule" and the rest should be removed. The difficulty in carrying out this procedure is in finding an adequate and reliable set of criteria for selecting the correct points to retain. The previous paper discussed several possible criteria.

Exper imentat ion with a number of other protein electron density maps has shown tha t the criteria described previously do not always dissect out the molecule accurately and are much too time-consuming. Therefore, a new approach was designed to solve this important problem of isolating a single molecule using the segments introduced in the last section rather than single points. The tree-like description provides the basis for a new set of criteria for selecting a single molecule out of the skeleton.

For purposes of this procedure, it is assumed tha t any segment of the skeleton is either entirely within "the molecule" or entirely outside it and in fact par t of one of

is 5 A long. (a) A digital representation of the electron density map. The grid interval is 1 A. The symmetry operators are 2-fold axes at (12,5) and (12,25). Notice tha t the symmetry-related density is not always identical. This is duo to interpolation error. (b) Skeleton map of the simulated 2-dimensional map. Note that the symmetry-related features are not identical in the skeleton. This is due not only to interpolation errors but also to idiosyncrasies in the point selection procedure of the skoletonization process (see Method of Attack, section (e) and especially Greer (1974) for a more complete discussion). (e) The skeleton map, with side chains removed, is shown in the form of a line drawing. The circles mark the positions along the main chain of the removed side chains. The 1 and 2 mark the center-of-gravity points to be used in (d) and (e). (d) Skeleton of a single molecule isolated using one center-of-gravity point which is labeled with a 1 in the Fig. Note t ha t the 2 short segments around (6,24) have been removed erroneously (see text). (e) Skeleton of the molecule isolated using 2 center-of-gravity points (see 2). In this case the whole of the molecule is preserved (see text). (f) The simplified skeleton of the molecule. The loop around (21,7) has been removed and the main chain extends as one single segment of length 47.87 A, from (24,19) all the way to (6,24).

Page 6: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

+ _ ,.el

"~N o

"~ ~ N

N

N

e . , - ~

N

e.

r.~

I I

III

E

b

q~

%

~o

o

o

~ o

~ ~o

O~o

o ~

| b

Page 7: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

PREDICTION OF CO-ORDINATES IN PROTEIN MAPS 433

the neighbors. Str ic t ly speaking, this is not always true. However, in the very small number of cases where part of the segment is really in the molecule and part out, typically at intermolecular contacts in the crystal, the cautious nature of thB criteria will usually cause this segment to be retained. Thus an extra piece of density will be included in the skeleton rather than a possibly important par t of the molecule deleted.

The symmetry operators are applied to a segment as a whole, rather than to individual points. For each segment, the symmetrically related points are calculated and checked to see if they lie within the space of the Cartesian map and on the skeleton. I f they do, the segments in which these symmetrically related points occur are recorded. In this way, one or more segments are found to be related to the initial segment. Since, by assumption, a whole segment is either in or out of the molecule, all the points of symmetrically related segments do not have to be generated from the initial segment to prove it is related. I f a significant percentage of the points (usually taken to be 25% or greater, but this value can be varied) in a segment are found to be symmetrically related to points in the initial segment, then the whole segment is presumed to be related to the initial segment. This provision resolves the problem that symmetrical points do not always appear to be identical in the skeleton (see Fig. l(b)). The following criteria can then be used to select between these two sets of symmetrically related segments.

(1) I f any of the points that are generated by application of a symmetry operator to a segment falls out of bounds of the Cartesian map, then all the segments tha t are related to the initial segment by this symmetry operator must be in the neighboring molecule and are removed. (This is a direct consequence of the requirement tha t the Cartesian space encompass the whole molecule.)

(2) An appropriate value for the center of gravity of the molecule is read into the program. This number is usually known from packing considerations or from a low resolution map. The distance from the center of gravity of all the points in the initial segment tha t have symmetry mates, di, is computed. The distance from the center of gravity of the symmetrically related points, ds, is also calculated. Great care must be taken in determining how to compare these two distances. Operationally, it is bet ter to err on the side of caution and leave too many segments in the skeleton than to delete a part of the molecule. Therefore, a reasonably conservative test was chosen, namely, that the set of segments further from the center of gravity is removed only if its distance exceeds the average of the two distances by 20%, i.e. if

2 Idl - - dsl / (d ~ + ds) :> 0.20.

I f the two distances computed do not differ by more than 20% of the average, neither set of segments is deleted.

(3) Having scanned through all the segments searching for symmetry-related segments, those segments which have no symmetry-related partners within the Cartesian map space are labeled as being definitively in " the molecule". Similarly, those segments which, as a result of the previous tests, have had all their symmetry- related partners removed are also labeled as being in the molecule.

(4) At this stage, the connectivity properties of the tree-like description bear full fruit. In cases where the distance criterion has failed to discriminate between the

Page 8: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

434 J. GREER

two possibilities, the neighbors of each set of segments are examined. I f all the neighbors of only one of the two sets of segments have been marked for deletion, then that segment is deleted as well. Alternatively, if all the neighbors of one set of segments are either removed or indeterminate but at least one neighbor of the other set of segments is definitively in the molecule (see (3) above), then the latter are retained and the former deleted.

(5) Isolated segments, i.e. segments which are completely disconnected from the rest of the skeleton, are removed unless one of the two endpoints lies within a distance d of the rest of the retained skeleton. This d is usually taken to be 4 A and permits the deletion of those isolated segments tha t are so far removed from the rest of the molecule as to he irrelevant.

(6) A similar criterion to (5) is applied to single points in the skeleton.

(7) In those cases where the above rules have failed to select one of the pair of symmetric segments, both are retained in the skeleton until additional information becomes available later in the analysis to distinguish between them. As will be seen in Application to Ribonuclease S, section (b), this involves only a very small minority of the segments.

The above rules have been applied to the simulated two-dimensional electron density map. The center of gravity was taken as (16,16). The space group was p2 with 2-fold axes at (12,5) and (12,25). The resulting skeleton is illustrated in Figure l(d). As can be seen from the Figure, segments 7 and 8 of Table 1, which arc part of the central molecule, have been removed because they lie significantly further from the center of gravity than the s3~nmetrically related segments 9 and 10 of the neigh- boring molecule. This problem can be solved using the following considerations.

Because of the very odd shapes of protein molecules, it is sometimes worthwhile to consider not only a single point for the center of gravity, but a number of points or even a locus of points. Thus, a distinctly multilobal structure, such as antibody (Davies et al., 1975), might have one center of gravity point for each lobe. In such cases, the program relates each point it is examining to the nearest center of gravity point when distances from the center of gravity are computed.

This procedure was tried on the two-dimensional map. Center of gravity points were chosen at (10,17) and (23,14), one for each leg of the molecule. The resulting skeleton is shown in Figure l(e). In this case, segments 7 and 8 have been retained because they do not lie significantly further from the closer center of gravity point than segments 9 and 10.

The procedure described above is both a logical and practical improvement over the previous method (section 2(d) of Greer, 1974). The use of segments rather than groups of neighboring points confers several advantages which produce a better skeleton with great savings in computation times. Considering a segment as a single enti ty which is either in or out of the molecule, avoids the necessity of costly searches through the skeleton for missing symmetrically related points. Furthermore, by means of a connectivity Table, neighboring segments can be used to decide whether a segment should be accepted or rejected. Lastly, a ligase program (section 2(e) of Greer, 1974) is no longer needed to correct errors caused by the removal of a few too many points due to the arbitrary nature of the groups of points considered by the old method.

Page 9: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

PREDICTION OF CO-ORDINATES IN PROTEIN MAPS 435

(d) Simplification of the skeleton

The linear nature of the polypeptide chain requires tha t ff any segment is connected to another segment via two (or more) segments (see segments 4 and 5 tha t lie between 2 and 6 in Fig. 1 and Table 1) then only one of the two paths can be main chain and the other must be a bridge of some kind. I t is useful to detect loops of this type at this stage and remove the segment most likely to be the bridge. When this is done, the remaining segments (2,5,6, in Fig. l(f) and Table 1) can be joined into one long segment which simplifies the interpretation of the skeleton considerably (see section (e) below).

Such a pair of segments is easily detected using the connectivity Tables of the tree-like description (see segments 4 and 5 in Table 1). The following rules are used to delete one of them from the skeleton.

(1) When only one of the pair of segments has side-chain branches, then it is retained and the other is rejected. I f both segments have side chains, both are retained.

(2) When neither segment has side chains, the ratio of their lengths is examined. I f one segment is four times longer than the other, and greater than 8 A, then the longer segment is retained and the shorter removed. Otherwise the shorter segment is preserved and the longer deleted. The ratio of four and the 8/ix length are derived empirically and appear to differentiate satisfactorily between significant long segments and artefactual short ones.

More complex loops and cage-like structures occur in the skeleton as well. Although a variety of methods have been used to detect and remove such loops, there is no way to ensure that genuine structural features are not distorted by such a simpli- fication. The strategy adopted at the present time, therefore, is to leave such complex loops in the skeleton and rely upon using a finer map grid definition (Greer, 1975) to properly represent these regions of the skeleton, if necessary.

(e) Selecting the main chain

The skeleton that remains as a result of all the previous operations represents primarily main chain and inter-main chain bridges of a single molecule with occasional side chains not removed in the previous stages of the analysis. The next step is to identify the main chain. The linear nature of the polypeptide chain is of prime importance in this regard. Thus when following the main chain, at each bifurcation of the skeleton, only one branch can be main chain and the other must be either inter-main chain bridge or side chain. In order to determine which segments of the skeleton are main chain, the unique properties of main chain must be considered.

(1) Any segment of the skeleton which is 10/~ or longer can automatically be classified as main chain. This is true since the inter-main chain bridges found in proteins are either hydrogen bonds or disulfide bridges ; these are approximately 4 A and 9/~, respectively, from main chain to main chain while the longest side chain, arginine, is 7.3 tlx long. An exception to this rule occurs if two long side chains interact to form a bridge. This does occur in protein maps (see Application to Ribonuelease S, section (c)), but only rarely.

(2) Any segment which has side chains must be main chain. Analysis of the main chain proceeds as follows. The longest segment of the skeleton

(that exceeds 10 A) is selected and each end is pursued in turn, deciding which

Page 10: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

436 J . G R E E R

branch to follow at each fork in the skeleton. One branch is then labeled as main chain and the other is flagged as being either side chain or bridge. When the program cannot determine which branch to follow, e.g. if both segments are shorter than 10 A and neither has side chains, then the program skips to the next longest segment and examines it in the same way until all the long segments have been studied. In order to increase the number of short segments tha t can be labeled as main chain, side chains are included from skeletons calculated with a lower Pmln aS well as from the skeleton being processed (see section (a) above).

In this way, a series of lines is constructed, each consisting of several connected segments, tha t represents long stretches of the main chain of the protein. Since short segments tha t are indeterminate early in the pass through the segments may be identified as bridges later, the analysis is i terated until no more segments can be joined together.

The success of this process depends critically upon the choice of minimum density level discussed in section (a) above. I f Pmin is too low, the skeleton becomes over- connected and many more decisions have to be made as to which segments are maid chain and which are bridges. Furthermore, the average length of the segments decreases when it is overconnected and so there are less long segments tha t can be unequivocally defined as main chain. When P~ln is too high, the skeleton becomes too disjointed, and it is difficult to piece together the polypeptide chain. The higher the quality of the map, the more successful the procedure for identifying main chain becomes.

The above discussion would be sufficient for an ideal electron density map. In practice the identification of the main chain is complicated by a number of problems that commonly occur in actual protein electron density maps.

I t is often the case in a protein structure, tha t some portion of the main chain is partially disordered in the crystal and thus the corresponding electron density is either very weak or non-existent. This will cause a discontinuity in the skeletal representation of the molecule. Conversely, spurious density may cause a bridge or side chain to occur in the skeleton where density should not really be. When these two effects occur in particularly sensitive places, a variety of paradoxical situations can arise. For example, the main chain may be appearing to cross itself or both segments at a bifurcation will appear to have side chains when only one can possibly be main chain or two long (>10 A) segments are encountered at the fork, again when only one can possibly be main chain. In each such case, the procedure of fusing segments is halted, the paradox is noted with as much relevant information as possible, and then the analysis is continued with the next longest segment in the skeleton. Such paradoxes require the intervention of the protein crystallographer for their solution. I t simply does not pay to a t tempt to automate all the sophisticated evidence that must be brought to bear in order to resolve these paradoxes. Considerable effort is being expended in determining what types of information are most useful to the crystal- lographer in order to realize what the problem actually is in each circumstance. Several illustrations of such dilemmas will be presented in Application to Ribo- nuclease S, section (c).

After all possible segments have been compiled into main chain lines, a Table is created of all the possible ways tha t the ends of these main-chain lines might be connected, either through other segments in the skeleton or across gaps in the skeleton. This information, together with the known number of expected polypeptide chains,

Page 11: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

P R E D I C T I O N OF CO-ORDINATES IN P R O T E I N MAPS 437

can be used to t ry to connect the main-chain line~ in order to form the complete skeletal representation of the polypeptide chains. Developments of this type are being introduced manually at this stage in the analysis although eventually they might be automated. Either before selecting the main chain or before building rough atomic co-ordinates, segments can be added or deleted by the user. For example, after the main-chain skeletal lines are specified, and based upon the Table of how their ends might be joined, it may be obvious to the crystallographer tha t two ends should be joined because of their close proximity. I f examination of a skeleton calculated at a lower Pmln confirms tha t the two ends are indeed connected at this lower density level, the segment in this lower Pmin map can be added to the list of the skeleton being interpreted and the building program can be instructed to connect the two main- chain lines using this added segment.

(f) Building a model, generating atomic co.ordinates

Having obtained skeletal line representations of the main chain, the next task is to build co-ordinates into this skeleton. Rough co-ordinates are generated which can be refined using model-building programs (Diamond, 1966; Hermans & McQueen, 1974; Hermans & Ferro, 1971), real-space refinement (Diamond, 1971,1974), energy mini- mization (Levitt & Lifson, 1969; Levitt , 1974; Hermans & McQueen, 1974), and other established techniques (Jensen, 1974; Watenpaugh et al., 1971,1973).

The first step is to record as ~-carbon co-ordinates the intersection points of the side-chain branches (removed by the trace program, section (a) above) with the main chain. Since a large percentage of the side chains do not appear in the skeleton, either because they are glyeines or more likely due to weak density for the connection to the side chain, another method is necessary to derive the m-carbons for the remain- ing residues. Such a method has previously been described (section 2(f) of Greer, 1974) and involves interpolation along the main-chain skeletal lines between the side-chain branch points using the " 4 / ~ rule". Since the C~-C~ distance is always close to 4 A, in fact about 3.8 A, ~-carbons can be interpolated into the main-chain skeleton at regular intervals of approximately 4 A. In this way, alist of all the m-carbon co-ordinates is prepared for each main-chain skeletal line. This procedure is not foolproof. Residues are occasionally missed or an additional residue inserted in- correctly. This is precisely what one might expect, and in fact occurs commonly when an electron density map is interpreted by a crystallographer in the absence of primary sequence information (Adams et al., 1973).

The next critical piece of information necessary to construct a model of the protein, is to determine the directionality of each of the main-chain lines. Directionality is most often determined by fitting a known primary sequence for the protein to the density for the side chains. Since the amino acid sequence is presumed unknown at this stage of the analysis, as is often the case, this method will not be used here. Another way to determine the directionality is to use the known handedness of the amino acid residue at the m-carbon. (This method requires knowledge of the correct enantiomorph of the structure. This can be determined either from anomalous scattering data or from examination of the handedness of the m-helices in the map.) I f one forms a plane with the main-chain atoms N, Cm and C, whether Cfl appears above or below this plane determines which atom is N and which is C, since all residues in a protein are L. Attempts to apply this method to the skeleton have so far proved a total failure. The geometry around the m-carbon is not preserved sufficiently well in the skeleton even

Page 12: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

438 J . G R E E R

in a-helical regions to permit this method to work. Indeed, visual examination of the protein electron density maps shows tha t it is usually difficult to determine the handedness of the side chains in the original map. Consequently, some other procedure is necessary to establish the direction of the main-chain lines.

The electron density map was originally placed on a 1.0 A grid in par t to minimize the number of main-chain carbonyl oxygens which would appear as "side-chain" branches in the skeleton. Nevertheless, occasionally two side-chain branches arc observed less than 2 A apart , implying, by the 4 A rule, tha t one of the branches must be a carbonyl oxygen. Unfortunately, one cannot tell which of the two is the carbonyl oxygen and which is the a-carbon by their lengths, since the carbonyl oxygen branch is often quite long--perhaps due to hydrogen-bonding to a water molecule or salt ion. Indeed, it is most probably this hydrogen-bonding which accounts for the fact tha t the skeletonizing procedure has preserved a side-chain-like branch at this carbonyl. Nevertheless, such a situation can be used to determine the chain direction- ality. Consider the two sequences of branches shown below where + indicates a branch point on the chain and - - indicates roughly a 0.5-A length of the skeletal

2.5 1.5 4.0

4.0 1.5 2.5

q + - - - - - - - ~ t-

main-chain line.

The middle two branch points in each case must be an a-carbon and carbonyl carbon while the outer two branch points are clearly a-carbons. Since the C-N-Ca distance is around 2.5 A, the C-Ca distance is around 1-5 A, and the Ca-Ca length is about 4 A, it is clear tha t the first case above represents Ca(i + 1)-(N)-C=O-Ccc(i)-Ca(i -- 1) and the second is C a ( / - - 1 ) -Ca( i ) -C=O-(N)-Ca( i + 1). I f the distances are preserved sufficiently accurately in the skeleton, the above calculation will determine the chain directionality. When this distance test is applied to the main chain at each carbonyl oxygen in the skeleton, even if one of the immediately adjacent Cr162 is not present, the deduced directionality is usually correct, especially when summed over several occurrences of a carbonyl oxygen in a single main-chain line. Of course, this method is only applicable when carbonyl oxygen branches are found in the main-chain line.

The crystallographer may have some information about the directionality of the main-chain lines from outside sources, e.g. from the heavy-a tom binding sites or rudimentary sequence data. Therefore, the program provides the option of specifying the directionality of a line and such specification supersedes any determination of direction made by the program.

The co-ordinate information available at this stage can now be summarized. Rough a-carbon co-ordinates have been determined for each residue. Where a side- chain branch has been detected in the skeleton, fl-carbon co-ordinates are derived by moving 1.54 A out along the side-chain branch. For purposes of model building, such a residue is designated alanine, while an a-carbon position which has no side chain is called glyeine. In addition, in those rare instances where a main-chain carbonyl carbon and oxygen have been detected, the carbonyl carbon co-ordinates are set to the value of the intersection of the branch with the main chain and the oxygen co-ordinates are computed as the value of the branch 1.24/~ from the main chain.

Page 13: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

P R E D I C T I O N OF CO-ORDINATES IN P R O T E I N MAPS 439

(g) More sophisticated model building

Supplying crude a-carbon co-ordinates for many of the residues and a- and fl-carbons for the rest is an inadequate description of the main chain. Informat ion must be derived about the peptide bond plane for each residue from the map or the skeleton. Unfortunately, this is not easily done, in general. However, there exists a large number of cases where rudimentary model building can be pe r fo rmed- - tha t is in regions of secondary structure. I f secondary structure regions, such as a-helix, parallel fl-sheet and antiparallel fl-sheet, can be detected, then an "ideal" helix or sheet can be fitted onto the rough ~-carbon co-ordinates, producing provisional co-ordinates for all the main-chain atoms.

This procedure is applied in the case of a-helix as follows. All the a-carbons of the main chain of the skeleton are examined to find those which obey the rule tha t the distance between

Ca(i) and Ccz(i -]- 3) ~ 6.0 A and Ca(i) and Ca(i ~ 4) ~ 6.0 A.

I f a stretch of two or more residues obeys these rules, then an a-helix of six or more residues is assumed. The a-carbon co-ordinates of an ideal a-helix (generated with r ~ and ~b ~ - - 4 7 . 5 2 ~ taken from Diamond's model building program (Diamond, 1966)) are then rota ted onto the a-carbon co-ordinates of this stretch of the main chain using a least-squares fit procedure. The co-ordinates of this rotated ideal helix are used only if the sum of the squares of the deviations (di) between the two sets of a-carbon co-ordinates divided by the square of the number of residues (N) is less than 0.75; i.e.

~d12/N 2 <_ 0.75.

The number of residues is used as the square rather than the single power in this deviation test in order to tolerate a larger deviation in longer helices. I t turns out though, tha t this is not a critical factor. The real importance of this deviation test is to differentiate helices of length around six residues from turns in the chain.

Experience has shown tha t once a helical region passes the above deviation test, it is best to omit the last residue and in practice construct a helix of length n - - 1. This is because the first non-helical residue after a helix will almost always fulfill the above helix selection rule and be included as par t of the helix even though it does not belong.

A similar process is applied to building fl-sheet. The rules for detecting fl-sheet are tha t the distances between

Parallel : Ca(i ~ k) and Ca(j ~ / c ) ~ 6.0 A

for k : 0,1,2 . . . . . n and [i - - j [ > 3

Antiparallel: C a ( / J r k) and Ca(j - - k) ~ 6.0 A

for/~ ---- 0,1,2 . . . . . n and [i - - j [ > 3.

When n is three or greater, a s tandard fl-sheet is built into this stretch of the skeleton. The standard antiparallel fl-sheet consists of two strands, each 13 residues long, derived originally from the co-ordinates for residues 80 to 86 and 97 to 103 of ribo- nuclease S (Wyekoff et al., 1970; Richards & Wyckoff, 1973). No s tandard parallel fl-sheet has been adopted as yet. This will be implemented in the near future. Before rotat ing the s tandard sheet onto the region of the skeleton, the position of the fl-carbons

Page 14: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

440 J . G R E E R

of the skeleton relative to the plane of the fl-sheet is determined in o,'de," to align the skeleton fl-sheet to the standard one. The ~-earbon co-ordinates in both strands of the standard fl-sheet are fitted to the x-carbons of the skeleton by least-squares fit. Only if the sum of the squares of the deviations between the ~-earbons (d~) divided by the square of the number of residues in the two strands (N) is less than or equal to 1-0, that is

~dl2/N 2 < 1.0,

is the model building for this region completed. Once again 2Y 2 is used rather than N to favor building of longer fl-sheets. This test is particularly important in this case because although all the fl-sheets observed so far in protein structures have a right- handed twist, the twist angle varies considerably from sheet to sheet and from protein to protein. Nevertheless, no effort has been made here to vary the twist angle of the standard sheet to match the skeleton. Rather, the least-squares procedure is used to distribute any misfit of the standard sheet because of a difference in the twist angle. I t is presumed that the co-ordinates will be corrected by the refinement programs. I f the tx~dst differs too greatly, the model building of this region will be aborted by the deviation test.

Building secondary structure into the skeleton allows a large portion of the mole- cule to be determined with a fair degree of accuracy. However, a significant part of the molecule remains with only C~ and sometimes Cfl co-ordinates specified. Methods are being developed to derive information about the orientation of the peptide plane in these cases (Greer & Levitt, unpublished results).

3. Application to Ribonuclease S

In order to properly present the results on the interpretation of the structure of ribonuelease S, parts of the skeleton will be described as corresponding to specific residues solely for purposes of geographic location in the molecule. Figure 2 shows a ch'awing of ribonuelease S using the measured co-ordinates of Riehards & Wyekoff (1973). By referring the residue labels described in the text to this Figure, the reader should be able to locate the designated area in the drawings of the respective skeletons. These labels are used for descriptive purposes only; they have not been used in any way during the interpretation process. Later in this section, the residue numbers predicted by the program (shoxwa as primed labels) will be compared to the actual residue numbers (shown ~ unprimed labels) of the ribonuclease S structure (Riehards & Wyekoff, 1973).

(a ) Slceletonization and tracing

The preparation, skeletonization, removal of side chains and preliminary tracing of the skeleton for the 2.0 A electron density map of ribonuelease S (Wyekoff et al., 1970) have been described in detail (Greer, 1974). The skeleton was calculated for a Cartesian map on a 1-0-A grid. The P~,ln used was 200 and the contour interval 50, on a scale where the highest density point was 841 and the lowest --449. Out of an initial 67,760 points in the map, 1654 remained after skeletonization. Side chains and chains touching the boundary of the map and having no branches were deleted in the next step leaving a total of 1308 points. Figure 4 of Greer (1974) showed the resulting skeleton in a standard view of ribonuclease S. Because of the presence of neighboring chains, the plot was so complex that little or no structural information was deducible.

Page 15: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

P R E D I C T I O N OF CO-ORDINATES IN P R O T E I N MAPS 441

F~os. 2, 4, 5, 7, 8, 10 and 11. Stereo presentations of the molecule. A stereoviewer is necessary in order to see these Figures in 3 dimensions. All of these drawings show the complete ribonuclease S molecule in the B-view of Richards & Wyckoff (1973). In the skeletal drawings, the single density points have been suppressed. The scale bar corresponds to 5 A.

i .<,< i < <<..,--(-" i <- / ' . . xo .Z ,< i + " X < i

FIe. 2. Drawing of ribonuclease S produced by connecting the c~-carbon positions with straight lines. The a-carbon co-ordinates are those of Richards & Wyckoff (1973). The disulfide bridges have boon included as well. This Figure is included to serve as a geographic key to the residue numbers of ribonuclease S for the skeletal drawings in Figs 4, 5, 7 and 8.

(b) Tree-formation, isolation of the molecule and simplification

The skeleton was then transformed from a list of points into a tree-like structure. The 1308 points were organized into 274 segments, the longest of which has 54 points and is 73.0 A in length. Of these 274 segments, 95 are totally isolated segments; that is, each segment is completely disconnected from the rest of the skeleton. A histogram of the lengths of these segments is sho~n in Figure 3(a).

The next step was to remove the neighboring molecules. Using the six symmetry operators of the space group P3121 of ribonuclease S, the symmetry-related segments were generated. The criteria described above in Method of Attack, section (c) were used to determine which of the segments to retain and which to delete. The numbers of segments removed or retained are summarized in Table 2. A corresponding histogram of the lengths of the segments retained is shown in Figure 3(b). The skeleton produced by this step is shown in Figure 4. Of the original 274 segments of 1308 unique points, 139 segments remain consisting of 628 points. The status of only 8 out of the 274 segments was indeterminate, that is none of the criteria could be used to determine if they or their symmetry mates should be deleted. This is quite a small portion of the skeleton of the molecule. In fact, none of these segments is in the molecule. Virtually all the small isolated segments that are retained could have been deleted. None represents density of part of the molecule. One segment violates the basic premise of this me thod- - tha t a segment must be all in, or all out, of the molecule. The segment jutting out at residue 10 (see Fig. 4) consists partly of the side chain of Argl0 and partly of the side chain of Asp53 of the neighboring molecule

Page 16: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

442 J. G R E E R

I00

80

w

60 E ff

o

~ 4o Z

20

I 2 3 4 5 6 7 8 9 I0 greater

Length (~) (a)

I 2 3 4 5 6 7 8 9 IOgreater

(b)

Fro. 3. Histograms of the lengths of the segments in the total skeleton of ribonuelease S (a) and of the segment lengths in the skeleton of the isolated molecule (b). ([3), isolated; ([]), connected.

TABLE 2

Disposition of segments during molecular isolation for the ribonuclease S skeleton

Criteria for retention or removal Disposition Number of segments ~/o

Isolated chain, further than 4 A All neighbors are absent Further from center of gravity Touches edge and out of bounds Indeterminate status All symmetry-related segments removed Unique segment Neighbor is definitely in molecule

Removed 26 9-5 Removed 6 2-2 Removed 88 32-1 Removed 15 5-5 Retained 8 2-9 Retained 93 33-9 Retained 35 12- 8 Retained 3 1-1

with which i t forms an in termolecular sal t -bridge in the crystal. This segment has

been re ta ined in the molecule, and as will become apparent , makes the selection of

the m a i n chain of the S-peptide of r ibonuclease S very difficult. Once again, the

skeleton of the isolated molecule shows a s t r iking improvemen t over the to ta l skeleton for this region of the electron dens i ty map.

The simplifying program was t hen used to remove simple loops t h a t appear in the

skeleton. Nine simple loops were encountered, eight of which were removed. I n the

Page 17: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

P R E D I C T I O N OF C O - O R D I N A T E S IN P R O T E I N MAPS 443

/ i

| !

_ / _4 \

/

m l

/

| ~ _

/

. _ / \

- - I

FIe. 4. (See legend to Fig. 2.) Skeletal drawing of the isolated molecule of ribonuclease S. This Fig. can be compared with Fig. 5 of Greer (1974). The plus marks indicate the positions of the removed, side-chain branches.

one odd case, a round residues 55 to 56 (see Fig. 4), bo th possible segments in the loop

con ta ined side chains and so ne i ther could be removed. The segments involved are

summar ized in Table 3. The skeleton at the end of this stage contains 120 segments

and is shown in Figure 5. The delet ion of two of these loops, a round residues 65 a nd 117, causes cer ta in

difficulties later in the analysis and i l lustrates some of the major problems t h a t are

encountered in the in te rp re ta t ion of the electron dens i ty maps. Because the electron

dens i ty for residues 66 to 69 is very weak in the ma p (Wyckoff et al., 1970), the skeleton

for these residues is absent (see Fig. 4). Consequent ly , a side result of removing a

loop a round residue 65 is to jo in the disulfide bridge, 65 to 72, to the skeleton of the

TABLE 3

Summary of simple loops treated by the program and which segments are removed

Seg- Seg- Segment Fused segment w Residues of ment 1 ment 2 Reason molecule (A) (A) removed Residues Length (A)

14-15 6.29 1.41 1 Length <8-0 A 16-17 3.15 3.15 1 Side chain 10 to 17 34.14 28-30 9.80 1.41 2 Side chain 31-32 5.20 1.73 2 Side chain 26 to 40 58.11

Side chain 2.41 4.56 2 Length < 8.0 A Side chain of Tyr 76 43-44 1.73 4.56 1 Side chain 42 to 46 17.51 55-56 2.73 4'89 None Side chains in both 64--65 5.56 1.41 1 Length < 8.0/~ 58 to 65 to 72 t 37.43

110-117 3.46 23.66 1 Side chains 108 to 117 to 58:~ 36.09

Includes 58 to 65 and the S-S bridge between 65 and 72 {see text). :~ Segment 108 to 117 has the S-S bridge between 117 and 58 attached (see text}. w This fused segment corresponds to the new segment formed from the 2 segments on each side

of the loop together with the retained segment of the loop. 3O

Page 18: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

444 J . G R E E R

/ ] I - - J x - I __

I /- i / I ! /

- ~"-ti r

!

t I / r~-l)~. \ _ I

- ' k , _ . . j ~;~

l I ! I

FIG. 5. (See legend to Fig. 2.) Ske le ton of r ibonuc lease S w i th s imple loops r emoved . T h e re- m o v e d s e g m e n t s are s h o w n by d a s h e d lines in th i s Fig. Compare wi th Tab le 3.

main chain at residue 65 to form one segment. The situation at residue 117 is more complicated. In tlfis case, two awkward situations occur together. First, the density for the main chain between residues 117 and 118 is weak and the skeleton is therefore missing (Fig. 4). Second, because the density in the map for the disulfide bridge between 58 and 110 touches residue 117, the skeleton has connected residues 58 to 117 to 110 via this bridge. A loop is thereby created between residues 110 and 117; one arm consists of the main chain of residues 1L1, 112 . . . . . 116 and the other arm is par t of the S-S bridge. The latter is much shorter, 3.46 A, than the actual main- chain arm, 23"66 A, and so it is removed from the skeleton (Table 3) leaving 110 and 117 connected by main chain while the disulfide bridge now connects 58 to 117. As a result of the removal of the loop and the absence of skeletal main chain connecting 117 to 118, the disulfide bridge is joined to the main chair~ at residue 117. As will be shown later (section (c) below), the program tha t searches for the main chain will have to detect these erroneous a t tachments of the disulfide bridges to main-chain skeleton and correct them.

(c) Selection of the main chain of ribonuclease S

The condensation of segments into main-chain fines is s h o ~ schematically in Figure 6 and summarized in Table 4. I t is useful to describe the progress of the program in some detail so tha t the nature of the various paradoxes encountered will become clear. The procedure starts with the longest segment, 58.1 A, which corres- ponds to residues 26 to 40 (compare Fig. 5 with Fig. 2). This segment is then extended from residue 40 in a series of steps (Fig. 6) through the segment which corresponds to residue 52. At this point a cage-like structure is encountered in the skeleton and the program is unable to select which segment to follow from amongst the possibilities. The procedure now a t tempts to extend this main-chain line from the other end of the starting segment, i.e. at 26, and finds two short segments which represent residues 25 to 26 and the disulfide bridge between 26 and 84. Not knowing, at this stage, which of the two to follow, the program terminates consideration of this main-chain line for this pass through the segments.

Page 19: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

P R E D I C T I O N OF C O - O R D I N A T E S IN P R O T E I N MAPS 445

~ tip

rlO- 34-I 5 7 5 17 25 26

11.5 15. 6 2

4 ~ " 6 5 58 ~ 2 5l 48 46 L, 4: > 40

,, ,o.5 ,, r' \ 4 I w . . . ; =.8/ , ',

I ' " ; I i I 6 8 ! ~7.1 ; ~3.1 ~ / 13.5 0-5 r i i I I I I

r,.~r, .Sit ~ "~ I I / ' u I | I \ ~=\12,5L c ~ ~ I I ~ _ I \ ~ f ~ i6.5~ I 17"9 : i 1"5"9 Isnl_f'Ol~ 55'9 ' ~

5.s~

72 "3 76 80 84 86 I " - -v-- ' i F4 I I =15'5 I I I

1171 50,6 108 /

/ /

/

7.5 5"1// /

/ /

/ 6 h 3-4/"8-3 15-8 l i p

118 li-9 121 124

% 7

Fic. 6. Schematic representation of the main-chain selection procedure for ribonuelease S. Segments which are longer than 10 A or which contain side-chain branches are drawn in thick lineo. Thin lines are segments which are short and have no side-chain branches. The thin broken lines are segments which have been identified as bridge or side chain. The segment lengths are recorded in the middle of each segment and the appropriate residue label is placed at the end of each segment. The large numbers arc the main-chain line numbers of Table 4. The arrow heads indicate the direction of the selection process ; each main-chain line always starts with the largest segment. The reasons for line termination are coded by sh, meaning two short segments en- countered, and se, meaning side chains in both segments. The thick broken lines correspond to the disulfide bridges that were removed from the respective segments in order to resolve the paradoxes. (See Fig. 7 for a drawing of the result.) The manually added connections between the main-chain lines are shown by curly brackets. (See the result in Fig. 8.) See text for a detailed description.

I n a similar way, the procedure begins with a 53.8-~ segment of residues 95 to

108 and extends this first segment to the segment for residues 108 to 117 which

includes the dens i ty for the disulfide bridge to 58 (see section (b) above). A t this point a paradox is discovered. The two segments which occur a t this b i furca t ion

have side chains. I n one case the program is a t t e m p t i n g to follow a segment which

represents residues 58 to 56 and in the other i t is t ry ing to follow 58 to 65. This

paradox can be resolved b y separa t ing the S-S bridge from the end of the segment for

117 to 108. At the other end of this ma in -cha in line, segments are combined un t i l residue 84 is reached where the pa t t e rn of small segments forming a cyclobutane- l ike

r ing (see Fig. 5) p revents the program from con t inu ing (Fig. 6).

Page 20: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

446 J. GREER

TXBLE 4

Main chain lines found for ribonuclease S

Line Length (A) Region of molecule No. of segments Largest segment (A)

1 127.33 84 to 117 4 53.85 t 2 107.72 26 to 52 6 58.11 3 60"92 69 to 84 4 17.95 4 37-97 56 to 65 2 37.435 5 34.14 10 to 17 1 34.14 6 25.46 118 to 124 3 13.76 7 23.56 3 to 10 1 23.56 8 11.71 Side chain 1 11.71

This segment had the 58 to 117 S-S bridge included. It is removed before the line is completed (see text).

:~ This segment had the 65 to 72 S-S bridge included. It is removed before the line is completed (see text).

The program then starts a new main-chain line with a 37.4 A segment from 58 to 65 plus the disulfide bridge to 72 (see section (b) above). To the 58 end is added a segment for 58 to 56. At 56 two segments are encountered, both of which have side- chain branches. These two correspond to the looped segments which could not be eliminated in section (b) above. This paradox can not be resolved by the program. At the opposite end of this line, at 72, another paradox occurs. Two segments which correspond to the main chain between 72 and 69 (12.54 A) and between 72 and 76 (16-54 A) are both longer than 10 A and both have side chains. Once again, this paradox can be resolved by separating the disulfide bridge from the main-chain segment for 58 to 65. Propagation of this main-chain line in this direction therefore ceases at residue 65.

The segment from 76 to 80, 17.9 A, forms the nucleus for another main-chain line. This is extended to 69 on one end and to 84 at the other end where it ceases due to the ring of small segments as described above. Residues 118 to 124 are combined from three segments.

Another paradox occurs when the main chain of the S-peptide, residues 1 to 20, is examined. This paradox arises because of the 11-7 A segment which corresponds to the Arg side chain of residue 10 and the Asp side chain of residue 53 of a neighboring molecule (see section (b) above). As a result, three long segments meet at the ~-earbon position of residue 10: a 23.6 A segment which corresponds to 3 to 10, a 34-1 A segment for 10 to 17, and the 11.7/~ side-chain segment (see Fig. 5 and Table 4). Which two of these three should be connected will be discussed shortly.

Having completed the first pass through all the segments, the main-chain selection procedure is repeated, but no additional segments can be added. Thus, at the end of the selection we are left with eight main-chain lines corresponding to residues 26 to 52, 84 to 117, 56 to 65, 69 to 84, 118 to 124, 3 to 10, 10 to 17, and the side chain of 10 (Fig. 6). The discontinuities between these lines are caused by weak density at 66 to 69 and 117 to 118, complex skeletal patterns at 52 to 56 and 84, while an over- long side chain confuses the issue at residue 10 in the S-peptide. :Figure 7 shows the skeleton for these eight main-chain lines and associated inter-main chain bridges and side chains.

Page 21: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

P R E D I C T I O N O F C O - O R D I N A T E S I N P R O T E I N M A P S 447

FIG. 7. (See l egend to Fig. 2.) Skele ta l l ines se lec ted as m a i n c h a i n for r ibonue lease S (solid lines). T h e s e g m e n t s t h a t h a v e been ident i f ied as i n t e r - m a i n cha in b r idges or side cha ins d u r i n g t h e se lect ion process are s h o w n as d o t t e d lines.

A Table is now produced showing how these eight main-chain lines might be connected (Table 5). The lines are numbered as in Table 4. Such connections are being performed manually at this time, but might conceivably be automated at some stage. The first part of the Table tries to connect the ends through existing skeletal segments, if possible. This is intended to suggest how main-chain lines might be connected through areas of the skeleton where the main chain could not be chosen unambiguously. When several connections exist between two or more line ends, the shortest length connection is usually selected. A number of lines can be merged in this way; lines 1 and 3 should be joined with a 1.4 A link (residue 84). Lines 2 and 4 can be joined either with a 9.3 A or an 11.5/~ link (residues 52 to 56). This am- biguity arises because of the simple loop that could not be removed when both segments contained side chains (see section (b) above). In this case, the longer of the two links, 11.5 A, was selected (Fig. 6), because it contained more side chains than the alternative. In practice, either could have been chosen with little effect on the final predicted co-ordinates.

The second part of the Table lists the distance between ends of main-chain lines tha t are less than 10 A. This allows the joining of segments across gaps in the skeleton. One end of line 3 is 5.1 A from an end of line 4 (65 to 69) and one end of line 1 is 6.7 A from an end of line 6 (117 to 118). All other distances between ends of lines (excluding the three lines which meet at residue 10) are greater than 10 A as can be seen in Table 5. I t seems likely therefore tha t these lines should also be connected.

The third part of Table 5 shows all distances, greater than 6 A, between the ends of main-chain lines and other segment tips in the skeleton. This section of the Table indicates tha t the two connected segments, 51, 1-41 /~ long, and 50, 5-88-~ long, would bridge lines 3 and 4 with two small gaps of 4.69 A and 4.24 A, respectively. Close examination of this area in Figure 5 (around residues 67 to 68) shows that these segments are in perfect position to connect the two lines. A skeleton of ribonuclease S, calculated at a lower P~ln---- 100, instead of 200, was examined in order to find segments that might span the gaps between lines 3, 4, and connecting segments

Page 22: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

448 J . G R E E R

TABLE 5

Possible connections between main-chain lines via

skeletal segments or across gaps

Line ends No. lines Length End pt Equivalent residues From To distance From To

Lines connected by s%qnents la 3a 1 1.41421 1-41421 84 84 la 3a 2 4.87831 1.41421 84 84 la 2a 2 8.70674 6.40312 84 26 2a 3a 2 7.29253 5-91608 26 84 2b 4a 2 11.48868 6-08276 52 56 2b 4a 2 9.34241 6-08276 52 56

End-to end distance less than 10.00 A lb 6a none none 6-70820 117 118 3b 4b none none 5.09902 69 65 5a 7a none none 0"0 10 l0 5a 8a none none 0-0 10 10 7a 8a none none 0-0 10 10

End-to-end distance less than 6.00 A t 3b 51 none none 4.69042 69 68 3b 49 none none 5.38516 69 side chain of 68 3b 49 none none 5-91608 69 67-68 3b 50 none none 5.91608 69 67-68 3b 51 none none 5-91608 69 67-68 4b 49 none none 5.38517 65 side chain of 68 4 b 49 none none 5.38517 65 67-68 4b 50 none none 6-38517 65 67-68 4b 50 none none 4-24264 65 67 4b 51 none none 5.38517 65 67-68 4b 51 none none 4.47214 65 68

~f The ent~'ies in this section of the Table refer only to main-chain lines 3 and 4. The correspond- ing entries for the other lines have been omitted for brevity as they are not relevant to the inter- pr'etation of the molecule.

50 and 51, and be tween lines 1 and 6. A segment was found to close each of the gaps

be tween fines 3, 4 and the i r neighbor ing segments , wi th lengths 6.67/~ and 5.56 A,

respect ively, g iv ing a to ta l l ink connect ing lines 3 and 4 of 19.51 A (see Fig. 6). The

only segment in the Pmln ~ 100 skeleton t h a t could be found t h a t would connec t

lines 1 and 6 was over 12 A long and connec ted residue 117 of line 1 and residue

118 of line 6 via t h e skeleton which corresponds to residues 109 to 110. This is due to

~he problem ~vith the S -S bridge be tween 58, 110, and 117 discussed above. I n order

to avo id this obvious ly false connect ion via 110, an artificial segment , 7 .3/~ long,

was cons t ruc ted by hand to connect 118 to 117. W h e n these are added to the segment

list, a long main-cha in line is p roduced which spans the molecule f rom residue 26 to

124 and is 399.11 A long.

Unfo r tuna te ly , Table 5 cannot help us decide how to connect the segments t h a t

represent the main chain of the S-pept ide, 3 t o 17. However , the fol lowing a r g u m e n t

migh t be used to resolve this d i lemma. I f two of the three segments are main- chain,

the th i rd mus t b e side chain or in te r -main chain bridge. B u t no side chain or br idge

is longer than 9-0 A (see Method of At tack , sect ion (e)). The th i rd segment m a y be

Page 23: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

P R E D I C T I O N OF CO-ORDINATES IN P R O T E I N MAPS 449

composed of a side chain forming a bridge to another side chain. Since all three segments end in tips and are not connected to the rest of the skeleton of the molecule, it is possible tha t one of these segments may bridge this molecule to one of its neigh- bors in the crystal. When the three segments are examined in the molecular isolation program, it becomes apparent tha t the 11.7 A segment is composed of a side chain of residue 10 and a side chain of a neighboring molecule (Asp53) as has been pointed out above (see section (b) above). Hence lines 3 and 5 should be connected, giving a chain 57.70 A long, and this paradox is tentat ively resolved.

As a result of the above manipulations, two long main-chain lines have been produced (Fig. 8). Applying the 4 A rule roughly, it is clear tha t the shorter line corresponds to the S-peptide while the longer main chain line is the S-protein. The lines ,can now be processed by the building program.

!

Fro. 8. (Soe legend to Fig. 2.) Total interpreted main chain for ribonuclcase S. This Figure in- eludes the main chain lines of Fig. 7 together with the manually added connections at residues 52 to 56, 65 to 69, 84 and 117 to 118. The line for the side chain of Arg10-Asp53 has been removed and tho main chain for residues 3 to 17 connected.

(d) Building atomic vo-ordinates for ribonuclease 2

Having produced two main-chain lines corresponding to the S-protein and the S-peptide of ribonuclease S, the next hurdle tha t must be overcome is determining the directionality of these main-chain lines. The program searches for isolated occurrences of "side-chain branches" less than 2/~ apart . Three such cases appear in the main-chain line corresponding to the S-protein, and based upon the distances to neighboring C~ positions, all predict the N to CO direction consistently and correctly. In addition, in four cases in this main-chain line, a pair of consecutive branch separations is less than 2 A. I n this situation, the middle branch is designated a CO, but no a t t empt can be made to use these situations to augment the deter- mination of the directionality.

Unfortunately, no isolated short inter-branch lengths are found in the line which represents the S-peptide. Consequently, the program would be powerless a t the present state of sophistication to continue model-building this main-chain line,

Page 24: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

450 J. G R E E R

However, we can take advantage of known chemical information to assign a direc- t ionahty to this skeletal line. Since ribonuclease S is generated from a single poly- peptide chain, ribonuclease A, by digestion with subtihsin (Richards & Vithayathil , 1959), the end of the S-peptide should be able to connect with the N-terminus of the S-protein determined above. The distances from the two ends of the S-peptide skeletal line to the S-protein N-terminus are 27.5 A and 18.6 A. Both the considerably shorter distance, as well as perusal of the topology of the ends in Figure 8, indicate which end must be the C-terminus of the S-peptide. The assignment of the direc- tionality of the S-peptide is therefore conferred manually at the beginning of the building process.

All the intersections of main chain with side-chain branches are designated as ~-carbons. Then, the 4 A rule is applied to the stretches of main chain between branches. Using an empirically derived Cr162 skeletal length of 3-886 A, 48 ~-carbons were built into the skeletal line of the S-protein, in addition to the 57 which have side-chain branches. Of the 17 a-carbons found in the S-peptide, 10 had side chains while 7 were calculated by interpolation. The statistics for these lines are summarized in Table 6.

TABLE 6

Summary of statistics for the main-chain skeletal lines

Number Mean Actual No. of Main Length Residues with Ca-C~ residues No. of residues chain (A) found side distance in actual Added Missed

chains (A) structure residues

S-peptide 57.70 17 10 3.5181 3 to 17 15 2 0 S-protein 399.11 105 57 3-8376 26 to 124 99 8 2

The program then searches for recognizable secondary structure elements in the skeleton. Table 7 lists the residues involved in each case together with the statistics of the fit of the s tandard secondary structure to the a-carbons. Of the three :c-helices in ribonuclease S (see Fig. 2), only one can be fitted within the required statistical limits described in Method of Attack, section (g), tha t is the helix around residues 6 to 13 (0.527). The second helix, residues 26 to 34, is tested by the program, but fits the s tandard a-helix very poorly, giving an error of 2-06 for 11 a-carbons, way over the 0.75 limit. The third helix, which appears in the skeleton as a cage structure, is missed entirely by the program because of the way the main-chain line has been selected from amongst the many small segments. We will return to a discussion of these helices later.

Ribonuclease S has two three-stranded antiparallel fl-sheets. Parts of each of these fl-sheets are found and built by the program (Table 7). These are shown schematically in Figure 9. Comparison of this Figure with Figure 3 of Riehards & Wyckoff (1973) or Figure 6 of Wyckoff et al. (1970) shows tha t the program has selected the two related fl-strands correctly in most cases, and tha t residues in the opposing strands are usually in correct register, but tha t the program has failed to detect all the fl regions in the molecule.

Page 25: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

~

0

p~

0

o

~ o

"1-- +4- ~

Page 26: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

452 J . G R E E R

f

( ) (o)

(,) (,)

) ( ) o(

j ~ , o'<

o) 1)

FIG. 9. Hydrogen .bond ing p a t t e r n for the ma in chain of r ibonuclease S as predicted b y the p rogram. Solid lines indicate lq . . . O distances of less t h a n 3-5 A, b roken lines indicate N . . . O distances of less t h a n 5.0 A. The residue n u m b e r s are those predicted by the p r o g r a m (which appear as pr imed labels in the text) and no t the actual r ibonuclease S sequence number s . This Figure should be compared wi th Fig. 3 of Richards & Wyckoff (1973) af ter which it is pa t t e rned .

The results of the model-building are summarized in Figures 10 and 11 which show a plot of the a-carbons predicted by the program superimposed upon a similar plot constructed using the a-carbon co-ordinates of Richards & Wyckoff (1973). In Figure 11, the a-carbons have been replaced by all the atoms predicted by the program in the regions that were prebuilt ~4th standard secondary structure (see Table 7).

The predicted ~-carbon co-ordinates were compared directly with the a-carbon

Page 27: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

PREDICTION OF CO-ORDINATES IN PROTEIN MAPS 453

!

L4O -I ""

~9

~

! !

i',. " - . ,

,0 �9 . . . .

o ! -.: . .- '~ " ' - .-' 25

o ~ " ~ ~-'~ ".20

Fzc. 10. (See legend to Fig. 2.) u-Carbon line drawing produced using the co-ordinates predicted by the program (solid lines) superimposed upon the u-carbon drawing using the co-ordinates of Riehards & Wyckoff (1973) (broken lines). The residue labels refer to the actual protein sequence and appear to the right and just above the corresponding ~-carbon co-ordinates of Richards & Wyckoff. Every fifth residue is labeled. There is a detailed discussion of the drawings in the text.

' a s ,

! ! I

.."~4 0.. ,, -"

t2o ; .'. . . . . ' / 25

FIG. 11. The predicted u-carbon co-ordinates are plot ted as in Fig. 10, except tha t all the atoms have been included for those residues in which the secondary structure was prebuilt (see Table 7). This drawing is also superimposed upon the u-carbon co-ordinates of Richards & Wyckoff (1973) and is labeled as in Fig. 10.

co-ordinates of RichaIds & Wyekoff (1973). First, the origin shift, applied to make all the grid points positive in the skeleton, was subtracted from the predicted a-carbon co-ordinates (see section 3 of Greer, 1974). No other rotation or translation was applied. This transformed the predicted co-ordinates to the same co-ordinate system as those of Richards & Wyckoff (1973). Then a-carbon plots of the two, such as shown in Figure 10, were examined and the ten extra calculated ~-earbons (11', 15', 24', 27', 28', 56', 61', 114', 116' and 122' of Fig. 9) were dropped from the list as well as the two missing residues (52 and 53) and the uninterpreted main-chain ends, residues 1 to 2, 18 to 20, and 21 to 25. This left 112 a-carbons for which corresponding co- ordinates existed in both lists. The original positions predicted from the skeleton as well as the final co-ordinates after secondary structure building were compared,

Page 28: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

454 J . G R E E R

giving a mean absolute deviation of 1.472 and 1.613 A, respectively, for the 112 corresponding =-carbon positions. I t is interesting that the building of secondary structure actually makes the fit of the predicted co-ordinates to the measured ones worse. This is probably because the model-building procedure is not based on the electron density map and may move the atomic positions away from the density. The other strange fact is that the center of gravity of the predicted =-carbons is 1.695 A away from the center of gravity of the measured co-ordinates. This shift is almost entirely in the X direction and may be due, in part, to the direction-depend- ent nature of the skeletonization process.

(e) Description of the structure

In order to properly describe the structure predicted by the automated system, and the differences from the structure proposed by Riehards, Wyekoff and co-workers, a full atlas would be necessary. Since the derived co-ordinates have not been refined using the model-building and the real-space refinement programs (Diamond, 1966, 1971,1974) or energy minimization (Levitt & Lifson, 1969; Levitt, 1974), it would be pointless to go into great detail at this stage. Nevertheless, it is useful to analyze, by a tour through the molecule, what degree of success has been achieved and what problems remain to be solved so that proper refinement strategy can be mapped. The reader is requested to refer to Figures 9 :10 and 11 during this discussion of the structure where the points to be made are illustrated. Figure 12 shows the absolute deviation between the two co-ordinate sets, residue by residue. This provides a

i I

~--Meosured

a-he l i x ~ l i

T~ted /9-sheet

~ 3 .~,

e) -i! v V v I I

2O

M e o s u r e d ~ I I Predicted i--I i i

I I I I ] I I I

4 0 6 0 8 0 I00.

Residue number in the sequence

I

i20

Fro. 12. A plot of the absolute deviation between the c-carbon co-ordlnates predicted by the program and those of Richards & Wyckoff (1973) for the residues in the molecule. The horizontal line at d I = 1-613 A shows the mean absolute deviation for the whole molecule. The broken line crosses over those residues missed by the program (52 and 53). Arrows mark the places where the program has added a residue. The horizontal bars at the top indicate the C-helical and E-sheet regions of the molecule; the top bars show the model of Richards & Wyekoff (1973) and the bottom bars the predicted secondary structure.

graphic summary of the degree of fit to be described below. In all cases, except Figure 9, primed labels designate the residue numbers determined by the automated program system and unprimed labels are the numbers derived from the sequence and the known structure.

Page 29: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

P R EDICTION OF CO-ORDINATES IN P R O T E I N MAPS 455

The first residue predicted by the program corresponds to residue 3 of the sequence. In fact, skeleton exists for the main-chain between 2 and 3 (see Fig. 5); however, the main-chain selecting routine cannot decide how to choose between tha t segment and the segment which represents the side chain of Thr3. The fact tha t residue 1 is missing is not surprising since the termini of polypeptide chains are often disordered in the crystal. The program follows the 3 to 13 co-helix quite closely by building ~-helix into residues 6 to 13. Residues 13 to 17 are not predicted very well, as a residue has been added between 12 and 13 (11') and between 15 and 16 (15'). The density for residues 18 to 20 is very weak in the map (Richards & Wyckoff, 1973) and so no co-ordinates can be predicted for this region of the molecule.

The first three residues of the S-protein, 21 to 23, are once again very weak in the map and so no skeleton or co-ordinates have been produced for these residues. The main chain for 24 to 26 exists (Fig. 5), however it could not be distinguished as main chain from the side-chain of Tyr25. The helix from 26 to 34 is incorrectly interpreted by the skeleton. I t appears to be left-handed! This is due to weak density in the map for the main chain between residues 27 to 28 and strong density for a hydrogen bond from 27 to 31 which the skeleton follows instead. A technique has been proposed for calculating the skeleton at a 0-5 A grid interval (Greer, 1975), which alleviates part of this problem by removing the link for the strong hydrogen bond and follows the actual main-chain density more closely, but retains the discontinuity between 27 and 28. This new technique is presently being incorporated into the automated system.

The section of the molecule between 34 and 50 is reasonably well interpreted. Deviations from the measured co-ordinates would probably be reduced during real- space refinement.

The co-helix from residues 50 to 58 is not detected-by the program. However, the original skeleton was quite complex in this region (see Fig. 5). The program is hard put to select amongst the many short segments that make up the cage-like pat tern for this helix. Consequently, the final chosen main-chain skeleton does not conform to the expected topology of an ~-helix (Figs 5 and 7). :Figure 10 shows that one turn of the helix has been missed entirely, deleting thereby residues 52 and 53 from the predicted co-ordinates. The new technique of calculating the skeleton at a 0.5 A grid interval improves this region dramatically as has been discussed previously (Greer, 1975).

Residues 58 to 65 are interpreted fairly faithfully. Two extra residues have been inserted erroneously, one between 62 and 63 (56') and one between 66 and 67 (61'). The chain continues to follow the expected atomic positions quite nicely through residue 108, with somewhat larger deviations around residues 88 to 89, 98 and 102. Much of this region was preconstructed as antiparallcl fi-sheet (Table 7). Only one fi-sheet has been built out of phase with the structure of Richards & Wyckoff (1973). Residues 72 to 74 and 101 to 103 are predicted by the program to be a ~-sheet, whereas Richards & Wyckoff found 73 to 75 to be the complementary strand to 101 to 103. Preliminary evidence suggests tha t there is some tendency for the hydrogen-bonding pat tern to switch to 73 to 75 when the co-ordinates are refined. The longest fl-sheet sequence, 78 to 83 and 100 to 105 (see Figs 9 and 11) overlaps with the residues of ribonuclease S tha t were chosen for the standard fl-sheet (see Method of Attack, section (g)). This choice of standard fl-sheet does not affect in any way the initial selection by the program of residues 78 to 83 and 100 to 105 from the skeleton as a

Page 30: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

456 J. G R E E R

possible fl-sheet; only the rules listed in Method of Attack, section (g) are used for this. The overlap of these residues could, in principle, affect the quality of the fit of the skeletal co-ordinates to the standard sheet. Nevertheless, Table 7 shows tha t the deviation test for this skeletal fl-sheet is well within the range of the other fitted fi-sheets. Thus the skeleton is sufficiently distorted to make this region as unlike the standard sheet as any other fl-sheet in the molecule. Experiments on other electron density maps also bear this out (Greer, unpublished results). Energy mini- mization and real-space refinement should improve the fit of much of this part of the molecule, especially, in the fl-sheet regions.

The loop in the structure from residue 109 through 117 presents some interesting problems for the interpretation process. The co-ordinates in this loop do not corres- pond very well with the co-ordinates of Richards & Wyekoff (1973) (see Figs 10 and 12). The ends of the loop, 109 to 111 and 116 to 118, come too close together in the skeleton and therefore in the predicted co-ordinate list. This caused some of the paradoxes discussed in sections (b) and (c) above, where the disulfide bridge from 58 to 110 was initially found to join 58 to 117 to 110 and then was interpreted as a bridge between 58 and 117 (51' to 112'). Energy minimization should remove the close Van der Waals' contacts and real-space refinement should improve the inter- pretation of the residues in this loop considerably. I t will probably require comparison with the sequence, or at least chemical knowledge of the position of the disulfide bridges in the primary structure to correct the connection of 58 to 117 rather than to 110. This error might become apparent during real-space refinement.

The last six residues are interpreted fairly well, but with the addition of two residues around 118 to 120 (114' and 116') and another residue after 124 (122') which is fitted partly to the density for the side chain of Val124. The last residues appear quite clearly in the map and hence in the skeleton and derived model since they are involved in a fl-sheet interaction and thus are well ordered in the crystal.

Fitting the peptide planes to the density map is always a much more difficult task than finding the ~-carbon positions. Looking at the secondary structure regions in detail (Fig. 11), there are areas where the main-chain dihedral angles are reasonable and places where they are clearly wrong. In the absence of refined co-ordinates, it is not worthwhile analyzing these angles in greater depth.

4. C o n c l u s i o n

The goal of automated interpretation of protein electron density maps now seems perceptibly closer. The combination of a skeletal formalism for the electron density map together with a procedure for isolating a single molecule from its neighbors provides an excellent method for depicting the overall topology of the molecule and regular secondary-structure elements as well. Moving beyond this first step, much of the main-chain can be distinguished and rudimentary co-ordinates designated. More sophisticated model building improves the co-ordinates and ultimately refinement procedures will provide the high quality data necessary to properly relate the detailed three-dimensional structural information to biological function.

Experiments must be performed to determine the techniques that are most suitable for the refinement of the provisional co-ordinates. A strategy for the use of the

Page 31: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

P REDICTION OF CO-ORDINATES IN P R O T E I N MAPS 457

presently available methods and the development of objective criteria to measure the improvement of the co-ordinates are presently being explored intensively (Greer & Levitt , unpublished results).

The overall view of the molecule and the rudimentary atomic co-ordinates for the main-chain of large sections of the molecule can greatly simplify the fitting procedure on an electronic simulation of a Richards box (Richards, 1968) using a graphics display terminal. The need to examine large regions of the electron density map in order to find suitable places to start model building is obviated as well as much of the effort required to fit each residue into its appropriate place using the topology of the graphic representation of the electron density map.

Despite the limited success described above, much remains to be done. The program system works well for a good electron density map in regions of strong density. I t is just beginning to learn how to cope with more d i f c u ] t situations of weak density and poorer quality electron density maps. As presently envisioned, and as demonstrated in Application to Ribonuclease S, above, the protein crystallographer will almost certainly have to intervene to unravel the intricacies of the structure in paradoxical situations because the number of possible complications is quite large and the effort to program the sophistication necessary to solve them considerable. The major effort has been and will continue to be in two areas: (1) the analysis of why the system breaks down and how it compares with the difficulties tha t the human interpreter encounters, and (2) how programs can provide the necessary information to the crystallographer to locate the source of the troubles and solve them. The extent to which a sophisticated interactive graphics display is indispensable to solve these paradoxes is one of the questions that must be answered as more experience is gathered in using the automated system.

Considerable effort must be devoted to the determination of the dihedral angles and the peptide plane orientation. A method must be found which permits the correct assignment of the carbonyl oxygen position in a significant number of cases. In addition, it would be useful to have more Cfl positions. At present, the program detects approximately 55% of the side-chains (67 out of 122 in ribonuclease S). Finding more side-chain branches would allow an increase in the accuracy of the specification of the C~ and Cfl co-ordinates and also provide some help in orienting the plane of the neighboring peptide bonds.

The side-chain densities have been ignored up to this point, as has been the sequence, since in so many instances the protein crystallographer produces his map before the primary structure has been determined. Nevertheless, there is much that a program system can do, in the absence of sequence information, to describe and categorize the density tha t appears for each side chain. Size, shape, peak height and environment are examples of information which can be obtained readily. Such a description could then be compared to the sequence, if known, to correct for added or deleted residues, to help determine the directionality of the chain, if necessary, and to begin model building of the side chains themselves.

The most important goal for the immediate future is the application of the auto- mated interpretation system to the electron density maps of other proteins in order to gain experience from the problems encountered in a number of different maps and improve the empirical parameters derived in the procedure. Such efforts are presently underway and suggest tha t this procedure may eventually find widespread application in the field.

Page 32: Automated interpretation of electron density maps of proteins: Derivation of atomic co-ordinates for the main chain

458 J . G R E E R

I thank Dr Michael Levi t t for helpful advice and discussion. The molecular drawings were produced using programs wri t ten by Messrs Reidar Bornholdt , Christos Tountas and Richard Abba of the Columbia Biology Depar tment Computer Faci l i ty . This work was suppor ted by a grant from the Nat ional Ins t i tu tes of Heal th , HL16601, and by Nat ional Ins t i tu tes of Heal th facili ty grant RR00442 and the Columbia Univers i ty Computer Center.

REFERENCES

Adams, M. J., Ford, G. C., Liljas, A. & Rossmann, M. G. (1973). Biochem. Biophys. Res. Commun. 53, 46-51.

Davies, D. R., Padlan, E. A. & Segal, D. M. (1975). Annu. Rev. Biochem. 44, 639-667. Diamond, R. {1966}. Acta Crystallogr. 21,253-266. Diamond, R. (1971}. Acta Crystallogr. set A, 27, 436-452. Diamond, R. (1974}. J. Mol. Biol. 82, 371-391. Greer, J . (1974}. J. Mol. Biol. 82, 279-301. Greer, J . (1975}. J. Mol. Biol. 98, 649-654. Hermans, J . & Ferro, D. (1971}. Biopolymers, 10, 1121-1138. Hermans, J. & McQueen, J . E., Jr . {1974}. Acta Crystallogr. ser. A, 30, 730-739. Hilditch, C. J. (1969)..Machine Intelligence, 4, 403-420. Jensen, L. H. (1974). Annu. Rev. Biophys. Bioeng. 3, 81-93. Levit t , M. (1974). J. Mol. Biol. 82, 393-420. Levi t t , M. & Lifson, S. (1969). J. ~Iol. Biol. 46, 269-279. Richards, F. M. {1968). J. i~iol. Biol. 37, 225-230. Richards, F. M. & Vithayathi l , P. J . (1959}. J. Biol. Chem. 234, 1459-1465. Richards, F. M. & Wyekoff, I-t. W. (1973}. In Atlas of Molecular Structures {Phillips, D. C.

& Richards, F. M., eds), vol. 1, Oxford Univers i ty Press. Watenpaugh, K. D., Sieker, L. C., Herr iot t , J . R. & Jensen, L. H. (1971}. Cold Spring

Harbor Symp. Quan$. Biol. 36, 359-367. Watenpaugh, K. D., Sieker, L. C., Herr iot t , J. R. & Jenscn, L. H. (1973). Acta Crystallogr.

ser. B, 29, 943-956. Wyckoff, H. W., Tsernoglou, D., Hanson, A. W., Knox, J. R., Lee, B. & Richards, F. lYl.

(1970). J. Biol. Chem. 245, 305-328.