Evaluation of polymer sequence fragment data using graph theory

B U L L E T I N O F

~ATHE1W_ATICAL B I O P H Y S I C S

vonu~E 31, 1969

EVALUATION OF POLYMER SEQUENCE FRAGMENT DATA USING GRAPH THEORY

[] GEORGE I-IuTcB~NSON Laboratory of Applied Studies, Division of Computer Research and Technology, National Institutes of Health, Bethesda, Maryland

Much of recent work to determine primary structures of nucleic acids and proteins employs the "fragmentation" or "overlap" stratagem. Typically, a preparation of a given polymer with unknown sequence is purified and then subjected to an enzyme known to cut the polymer at certain specific sites. The quantities and sequences of the resulting fragments are determined. For RNA primary sequences, pancreatic ribonuclease and T1 ribonuclease are ordinarily used as fragmenting enzymes. A technique is described for evaluating such fragment data. I t has the following properties: I t is easily determined whether or not the fragment data is inconsistent. I t is always possible to determine the first and last nucleotides of the unknown sequence from the data of two limit digests. Consistent data from two limit digests can always be fitted into a convenient conceptual framework developed within the theory of graphs. In most cases, partial digest information can be used to modify the framework constructed from two limit digests, as such information is obtained. An efficient analysis of all fragment data in this conceptual framework can always be made. One can detect inconsistencies and can generate the entire list of polymer sequences consistent with the fragment data.

1. Introduction. Much of recent work to determine p r imary s tructures of

nucleic acids and proteins employs the " f r agmenta t ion" or "over lap" strafe- gem. Typically, a prepara t ion of given polymer with unknown sequence is

purified and then subjected to an enzyme known to cut the polymer at cer-

ta in specific sites. The quanti t ies and sequences of the resulting f ragments are determined. I f the process is s topped before all of the cuts have been

completed, the result is called a "par t ia l digest ." The alternative, in which

541

542 G. ]~JTCHINSON

all cuts specific to the enzyme are made, is called a "limit digest" (or "complete digest"). A common procedure is to prepare a limit digest for each available fragmenting enzyme, and then prepare a sufficient number of partial digests to uniquely determine the polymer sequence by overlapping the fragment sequences. The mathematical analysis of the fragment data is not always easy. Mathematical theories and computer programs have been developed to at tack this problem (Dayhoff, 1964 ; Mosimann et al., 1966 ; Shapiro, 1967 ; Mosimann and Vinton, 1968).

For determining RNA primary sequences, pancreatic and T1 ribonucIeases are ordinarily used as fragmenting enzymes (Holley et al., 1965 ; Shapiro et al., 1965; Madison et al., 1966; Ra]Bhandary et al., 1967; Goodman et al., 1968). The technique described in the following is suited to the evaluation of such fragment data, although not restricted to that case alone. I t has the following properties :

(1) I t is easily determined whether the fragment data from two limit digests for the two enzymes is consistent. In other words, if an error in the analysis produces fragment data that could not properly result from any polymer sequence, this fact w/ll be known.

(2) I t is always possible to determine the first and last nueleotides of the unknown sequence with consistent data from two limit digests. This may be used to confirm direct evidence obtained concerning the beginning and end of the unknown sequence.

(3) Consistent data from two limit digests can always be fitted into a convenient conceptual framework developed within the theory of graphs. In most cases, partial digest information can be used to modify the framework constructed from the two limit digests as such information is obtained.

(4) An efficient analysis of all fragment data fitted into this conceptual framework can always be made. One can determine, for example, whether there is a unique polymer sequence consistent with this data, or whether the data is inconsistent. I t is even possible to generate the entire list of possible polymer sequences, that is, all sequences that could produce the data incor- porated into the framework. This computation would probably not be feasible for the two limit digests, which would usually be consistent with a very large number of possibilities. However, it would often be convenient to work directly with a list of possible polymer sequences as soon as sufficient data had been found to reduce the list to a manageable size.

In the following, a description of how to use the method is given. The mathematical theory and proofs may be found elsewhere (unpublished version: (Hutchinson, 1968)).

POLYMER SEQUENCE FRAGMENT DATA 543

The author is indebted to Dr. J. E. Mosimann, Dr. C. R. Merril, Mr. J. E. Vinton and Mr. M. B. Shapiro for conversations concerning the polymer sequencing problem. The correspondence between Euler circuits of certain graphs and polymer sequences, which is of central importance to the methods described here, was first observed by Mr. Vinton.

2. Organization of the Fragment Data. To motivate the general approach, an example is constructed and analyzed. Consider the sequence* :

pG-G-G-C-G-U-G-A-C-U-C-G-U-C-C-A-C-C-Ao~ (1)

This artificial example was constructed from the primary structure for alanine transfer RNA (Holley et al., 1965) by taking the first six and last thirteen nucleotides. The method can accommodate data including minor nucleotides, although none appear in this example.

Pancreatic ribonuclease would cut a polymer with sequence (1) at the 3'~ terminus (right side) of each C- or U- residue. T1 ribonuclease would cut it at the 3'-terminus of each G-residue. Omitting the phosphates, limit digests produced by these two enzymes would be:

Pancreatic ribonuelease acting on sequence (1):

Cuts produced : GGGC/GU/GAC/U/C/GU/C/C/AC/C/A

Data obtained : 4C U A AC

2GU GAC GGGC.

T1 ribonuclease acting on sequence (I) :

Cuts produced : G/G/G/CG/UG/ACUCG/UCCACCA

Data obtained : 3G CG UG ACUCG UCCACCA.

(2)

(3)

*Abbreviations used: p and - represent phosphate residues; A-, adenosine 3'-phosphate; C-, cytidine 3'-phosphate; G-, guanosine 3"-phosphate; and U-, uridine 3'-phosphate.

544 G. HUTCHINSON

When fragments having the same sequence are produced at several different places in the polymer sequence, the number of times is indicated by an integer preceding the fragment. For example, fragment GU is produced at positions 5-6 and 12-13 of (1) under digestion by pancreatic ribonuclease, so a 2 precedes GUin the data of (2). These integers are determined experimentally by measur- ing the quantities of each fragment in the limit digest. The integer associated with a fragment sequence is called the "weight" of the sequence, and equals 1 if not specified.

At this point, the general assumptions for analysis of two limit digests are described. A finite list of terms called "bases" is given. (In the example, the bases are the nucleotides A, C, G and U.) Two limit fragmentation pro- cesses, numbered I and II, are presupposed. The bases are of three types, numbered I, I I and III . In the first limit fragmentation process, the given sequence is cut immediately to the right of each type I base. (In the example, C and U are type I bases.) In the second process, the given sequence is cut immediately to the right of each type I I base. (In the example, G is the only type I I base.) Bases not of types I or I I are of type III . (In the example, A is the only type I I I base.) I t is specifically assumed that :

(A1) The sequence and weight of each fragment is obtained for both limit digests.

(A2) 1~o base is both of type I and type II.

Now assume that the sequence (1) is not known, but that the data in (2) and (3) have been obtained. To analyze the data, we first apply three consistency tests.

Under the hypothesis that limit digests were obtained, fragment sequences in (2) can not contain a C or U term except at the end (rightmost position). Similarly, fragment sequences in (3) can not contain a G term except at the end. The first test is to verify by inspection that these conditions are satis- fied. In general, (CT1) in limit digest I, no fragment sequence has a base of type I except at the end. In limit digest II, no fragment sequence has a base of type I I except at the end.

The only fragment of (2) that does not have a C or U at the end is the fragment A. Similarly, UCCACCA is the only fragment of (3) not having a G at the end. Obviously, such fragment sequences must correspond to the end of the polymer. Therefore, consistent data from a limit digest can produce at most one such fragment, which must have weight one.

A fragment of limit digest I not ending in a base of type I is called "abnormal," as is a fragment of limit digest II not ending in a base of type II. Other fragments are called "normal."


For the second consistency test, one examines the abnormal fragments of the two limit digests. Before giving the test, it is necessary to describe a simple procedure for regrouping the fragment sequences. To mark the divisions, periods are inserted into the fragment sequence following each type I or type I I base. In (2) for example, ACUCG is regrouped to AC.U.C . G (the period after the final term is omitted). The regrouped sequence is regarded as a sequence of four terms, namely AC, U, C and G. These terms are called "extended bases." Ordinarily, an extended base is a single base of type I or I I (e.g. C, U or G) or a sequence of type I I I terms followed by a base type I or I I (e.g., AC, AAU or A_A_AG). The final extended base of a regrouped abnormal fragment may be a sequence of one or more type I I I bases (e.g., A or AA).

Note that the abnormal fragment A of limit digest I equals the last extended base of the regrouped sequence U. C. C. AC. C. A of the abnormal fragment of limit digest II. We say here that "A equals the last extended base of UCCACCA."

The second consistency test is then (CT2) All abnormal fragment sequences in both limit digests have weight one, and fall into exactly one of the five patterns given in Table 1.

As previously noted, A is the last extended base of UCCACCA. Since UCCACCA is an abnormal fragment ending in a base of type I I I and containing bases of type I, the data of (2) and (3) satisfies case 3 of Table 1, and so passes CT2.

For the third consistency test, we first regroup all the fragment sequences for both limit digests, in the way previously described. In Table 2, the weights and regrouped fragment sequences are given for the data of (2) and (3).

A "one term fragment" is a fragment with one term in its regrouped fragment sequence, that is, its sequence is an extended base. In Table 2, one term fragments are in the top row. Fragments tha t are not one term fragments are called "are fragments."

I f we omit the first and last terms from a regrouped are fragment sequence, the remaining extended bases are called "interior" terms. For example, U and C are the interior terms of AC.U.C . G in Table 2. The next step is to count all of the interior terms of all of the fragment sequences. Of course, a fragment with only one or two extended bases has no interior terms, so the top two rows of Table 2 do not affect the count. So, we obtain :

G. G. G. C yields 2G AC.U.C. G yields U, C U . C . C . A C . C . A yields 3C, AC

Total yield: 4C, U, AC, 2G 8 - - B .~'I .B.

546 G. HUTCHINSON

Observe t h a t yields of a g iven ex tended base are combined b y add ing the

weights. I f a f r a g m e n t h a d weight w, the corresponding yield would be mul t i -

plied b y w. Fo r example , 2 G . A A G . G. G. U would yield 2AAG, 4G. N o w m a t c h the to t a l y ie ld of inter ior t e rms (4C, U, AC, 2G), aga ins t the

one t e r m f r agmen t s of Tab le 2 (4C, U, A, AC, 3G). W e see t h a t all in ter ior t e rms can be ma tched , and exac t ly two te rms , G and A, r ema in u n m a t c h e d in the l a t t e r list. F o r consis tent d a t a f r o m two l imi t digests, this m a t c h i n g

TABLE 1

Abnormal Fragment Pat terns for Two Limit Digests

Ca~e Number

Abnormal Fragments in Limit Digest I

None.

One abnormal fragment ending in a type II base.

One abnormal fragment equal to the last extended base of the Limit Digest I I abnormal fragment.

One abnormal fragment ending in a type I I I base and containing one or more type I I bases.

Abnormal Fragments in Limit Digest II

One abnormal fragment ending in a type I base

None.

One abnormal fragment ending in a type I I I base and containing one or more type I bases.

One abnormal fragment equal to the last extended base of the Limit Digest I abnormal fragment.

Corresponding Property of Polymer Sequence

Unknown Polymer sequence ends in a type I base.

Unknown polymer sequence ends in a type I I base.

Unknown polymer sequence ends in a type I I I base, and the last base not of type I I I is of type I.

Unknown polymer sequence ends in a type I I I base, and the last base not of type I I I is ot type I I .

Degenerate case: Every base of the polymer sequence is type I I I , so that neither enzyme cuts the polymer. This case is disregarded, since there is no sequencing problem.

process will a lways leave exac t ly two u n m a t c h e d te rms , and t h e y will be the

first and las t ex tended bases of the u n k n o w n p o l y m e r sequence. I f (1) is regrouped as in (4), t h e n G and A are the first and las t ex tended bases.

G . G - G . C . G - U . G . A C . U . C . G . U . C . C . A C . C . A (4)


Now A is known to be the last extended base of the unknown sequence because abnormal fragments A and UCCACCA were found in applying CT2. So, we can deduce tha t G is the first extended base of the unknown sequence. Formally: (CT3) Assume the data satisfies CT2. So, there exist one or two abnormal fragments, and the last extended base of the polymer sequence is known. Construct the list of interior extended bases of fragments in the two limit digests, as previously described. Match this against the list of one term fragments from both limit digests. Then :

TABLE 2

Regrouped Fragment Sequences for Data of (2) and (3)

Number of Extended l~egrouped Fragment Regrouped Fragment Bases in Regrouped Sequences from (2) Sequences from (3) Fragmenb Sequence and their Weights and their Weights

1 4C 3G U A

AC

2 2G.U C.G G.AC U. G

3 None None

4 G.G.G.C AC.U.C.G

5 None None

6 None U.C.C. AC. C. A

(1) Every term of the first list must be matched. (2) Exactly two terms of the second list must be unmatched (possibly one

extended base with weight 2). (3) One of the unmatched terms must be the last extended base of the poly-

mer sequence, as determined from CT2. The other unmatched term is the first term of the unknown polymer sequence.

At this point, an explanation of this matching process is in order. An extended base can be classified as type I, I I or I I I according to whether its sequence ends with a type I, I I or I I I base, respectively. For example, C, U and AC are type I extended bases,: G and AAG are type I I extended bases,

548 G. HUTCHINSON

and A and AA are t y p e I I I ex tended bases. No te t h a t a t y p e I I I ex tended base can occur on ly a t the end of a regrouped sequence.

A t e r m of a r eg rouped sequence is a " b o u n d a r y " i f i t is t y p e I or I I and the p rev ious t e r m is o f a different type . The first t e rm, which has no previous t e rm, is no t a bounda ry . F o r example , consider (4) ana lyzed in Tab le 3. T e r m 5 is a b o u n d a r y because i t is t y p e I I a n d t e r m 4 is t y p e I . T e r m 15

is not a b o u n d a r y because b o t h t e rms 14 a n d 15 are t y p e I . T e r m 17 is not a b o u n d a r y because i t is no t t y p e I or I I .

TABLE 3

Analysis of Sequence (4)

Term Number

Term

Type of Extended Base

Is the Term a Boundary?

1 2

G G

I I I I

No No

Term Number

Term

Type of Extended Base

Is the Term a Boundary?

9 10

U C

I I

No No

3 4 5

G C G

I I I I I

No Yes Yes

11 12

G U

I I I

Yes Yes

6 7 8

U G AC

I I I I

Yes Yes Yes

13

C

I

No

14 15 16 17

C AC C A

I I I I I I

No No No No

I n F igure 1, the d a t a f rom (2) and (3) is al igned according to sequence (4), wi th one t e r m f r agmen t s shaded.

Observe in F igure 1 t h a t the one t e r m f r agmen t s do not occur on boundar ies , t h a t is, t e rms 4, 5, 6, 7, 8, 11 and 12. On the o ther inter ior te rms , n u m b e r e d

T e r m N u m b e r 1 2 3 4 6 6 7 8 9 10 11 12 13 14 15 16 17

I-~ragments IG. G. G. cI D D D D D

II-Fragments ~ ~ ~ ~ ~ -~ [AC. U. C. G 1 IU. C. C. AC. C. A I

Figure 1. Data Aligned in the Sequence (4)


2, 3, 9, 10, 13, 14, 15 and 16, there is a one term fragment in one digest, and an interior term of an are fragment in the other digest. Only two other one term fragments occur, one each at the first and last terms. The matching process of CT3 is verified by generalizing this observation.

Now observe the pat tern of the arc fragments. There is an are fragment from the first term to the first boundary (1 to 4). Then there is an arc fragment from the first boundary to the second (4 to 5) in the other limit digest. This pat tern continues; arc fragments go from one boundary to the next, alternating between the limit digests. The final are fragment goes from the last boundary to the end.

This pat tern motivates the following constructions. I t is convenient to create a special "end-around" fragment. Take the longest abnormal fragment, and add two terms to the end of the regrouped sequence. The first added term is a special character (8), used to indicate the end of the polymer.

TABLE 4

Array of Arc Fragments

G

C

U

AC

G C U AC

G.G.G.C 2G.U G.AC

C.G

U.G U.C.C.AC.C.A.E.G

AC.U.C.G l

The second added term is the first extended base of the polymer, as determined by CT3. For the data of (2) and (3), U-C. C. AC. C .A is the longest abnormal fragment and G was the first extended base of the polymer. So, the end-around fragment is U . C . C . A C . C.A. 8. G. In general, the abnormal fragment of limit digest I I is taken for cases 1 and 3 of Table 1, and that of limit digest I for cases 2 and 4.

Now arrange all of the normal arc fragments, plus the end-around fragment, into a square array indexed by the extended bases. A fragment is put into row X and column Y of the array if X and Y are its first and last extended bases, respectively. In Table 4, this operation is performed for the data of (2) and (3).

550 G. HUTCHINSOI~

Associated with this arrayis a diagram. (In mathematicalusage, a "net" with a "value" or "weight" associated with each line. See (Harary et al., 1965).) For each extended base, a point is put into the diagram. For each fragment in row X and column Y of Table 4, an arrow is drawn from the point X to the point Y in the diagram. These points and arrows are displayed in Figure 2, which is called the "two digests w-net." The fragment sequences and weights are attached to each arrow; the arrows are called "arcs."

AC - . . . . �9 C

A C . U . C . G /7

2G.U U . C . C . A C . C . A . E . G

l

U

Figure 2. Two Digests w-lqet

Now reconsider Figure 1. I f we trace the arc fragments of both digests from left to right, we obtain the sequence G.G. G-C, C. G, G.U, U-G, G.AC, AC-U. C. G, G .U and U. C. C.AC. C.A. The corresponding sequence of arcs in Figure 2 has the following properties :

(P1) The head of each arc in the sequence and the tail of the next arc are at the same extended base.

(1)2) The head of the last arc and the tail of the first arc are at the same extended base. (The end-around arc was constructed to close the loop.)

(1)3) The number of times each arc appears in the sequence equals its weight. (Observe that G .U with weight 2 appears twice, and all other terms appear once.)

A sequence of arcs satisfying properties P1, P2 and P3 above is called an "Euler circuit." This is an obvious generalization of the concept of Euler


circuit t h a t appears in the t heo ry of graphs. (See " l ine-complete t ra jec tor ies" in H a r a r y et al., 1965.) A "special Euler c i rcui t" of the two digests w-net is an Eu le r circuit which has the end-around arc as the last t e rm of the sequence. There are exac t ly six special Eu le r circuits of the w-net of Figure 2 ; t h e y are given in Table 5. B y over lapping the arc sequences in the obvious fashion, i t is possible to recons t ruc t sequences of bases f rom the special Euler circuits. These reconstruct ions are also given in Table 5.

E1 : G.G.~.C C.G G.V U.G G.Ae

G - - - - - - - > C ) G ) U > G > AC S1 : GGGCGUGACUCGUCCACCA E2: G e.o Ae.u.c 2 ; C ) G AC G ~ U - - - ~ G $2 : GGGCGACUCGUGUCCACCA E3: O a.Ac AC AC.V.C.a. G G.V.> U u.~.> G p.a.a.c

$3 : GACUCGUGGGCGUCCACCA E4: G o.ac AC AC.V,C.a e.G.e.e> e.e> o.u> v.a> G.U

�9 > G C G - U G $4 : GACUCGGGCGUGUCCACCA E5: G G.U> U U.G> G G.G.G.C C.G> G.AC AC.U.C.G.

>C G AC ~ G $5 : GUGGGCGACUCGUCCACCA E6: G G.U) U V.a> G G.AC AC ACIU.C.G) G

$6 : GUGACUCGGGCGUCCACCA

TABLE 5

Special Euler Circuits and Reconstructed Sequences

AC.U.C.G G.U) U.C.C.AC.C.A.~.G ) G U ~ G

G.U U.C.C.AC.C.A.~.G ) U ~ G

> C C.G) G G.U U.C.C.AC.C.A.~.G

~ U ~G

G.U) V U.C.C.AC.C.A.~.G ~G

G.U U.C.C.AC.C.A.~.G ~ U . > G

G.G.G.C G.U U.C.C.AC.C.A.~.G ~ C c 'a-~G ) U ~ G

The sequence S 1 of Table 5 is the same as the sequence (1). I f the sequences $2 th rough $6 are subjected to l imit digestion, each one gives precisely the da ta of (2) and (3). In fact , $1 th rough $6 is a complete list of all possible sequences for the unknown polymer , consistent wi th the da t a of (2) and (3). I n general :

Theorem A. Suppose t ha t da t a f rom two l imit digests satisfies CT1, CT2 and CT3. Then :

(1) The end-around f ragment and two digests w-net can always be constructed, as described previously.

(2) The da ta is consistent if and only ff the two digests w-net has a special Eule r circuit.

552 G. H U T C H I N S O N

(3) Every possible polymer sequence consistent with the data corresponds to a unique special Enier circuit for the two digests w-net. The correspondence is as shown in Figure 1, with the longest abnormal fragment modified into the end-around fragment. The inverse operation, recovering the polymer sequence from the special Euler circuit, is as shown in Table 5.

In the final section, general methods for analyzing w-nets are given. Using Theorem A, data from two complete digests can be completely analyzed by these methods.

Other investigators have obtained results and analyzed many examples using a somewhat different graph theory approach (Mosimann and Vinton, 1968, private communication). There is strong evidence that a graph can be constructed from the data of three or more limit digests. The Euler circuits of this graph would also be in one-one correspondence with the possible sequences for the polymer, consistent with the data. Furthermore, the assump- tion A2 needed for the construction above may not be needed for this more general construction.

In addition to the two limit digests, data may be obtained from partial digests or other chemical techniques. In many cases, such restrictions have a form convenient for w-net analysis.

As an example suppose that a fragment with sequence CGUG is obtained under partial digestion by T1 ribonuclease, along with the limit digest data of (2) and (3). The regrouped polymer sequence must contain terms C-G.U. G in some four consecutive positions. Now these four extended bases are type I, type II, type I and type II, respectively. So, the three terms at the right are boundaries in the regrouped polymer sequence (see Table 3). Now the term C is not the first term of the polymer sequence, by CT3. Under the hypothesis of partial digestion by T1 ribonuclcase, then, a G term (type II) must appear to the left of the fragment. Therefore, all four terms of C- G.U. G are boundaries. Now arc fragments run from each boundary to the next (see Fig. 1).

So, the sequence of arcs C c.~ ~.u ~.G> - G ~ U G must appear in some three consecutive positions of the special Euler circuit corresponding to the polymer sequence. In Figure 3, the two digests w-net has been modified by introducing two new points G' and U', and reconnecting some of the arcs.

Now G' and U' each have one arc entering and one arc leaving, both of weight one. So, any special Eulcr circuit of the w-net in Figure 3 must have the required sequence,

C e.~) G' a.~> U' ~'~ , ~ G , ( 5 )

appearing within it. In fact, the special Euler circuits of the w-net in Figure


3 correspond to E1 and E4 in Table 5. Observe that S1 and $4 are the only polymer sequences in the table which can yield a CGUG fragment under par-

tial digestion by T 1 ribonuelease. In general, a "reduction" of the two digests w-net corresponds to a sequence

of arcs,

X1 ~1 X~ ~2 X3 > . . . > X~_I ~-~ Xn, (6)

known to appear consecutively within the special Euler circuit of the polymer sequence. New nodes X~, X~ . . . . . X~_ 1 are introduced corresponding to

�9 G . A C , G U

U . G

A ~ �9 ~

G . G . G . C G . U

U . C . C . A C . 0 . A . ~ . G

U

Figure 3. Two Digests w-Net, l~eduction 1

extended bases X2, X 3 . . . . , X~_I, and arcs al, a2 . . . . , an_ 1 start and end as indicated :

X1 ~ X2 a2> X3 > "'" �9 X n - 1 Xn.

The remaining arcs start and end as in the two digests w-net. I f are a from X to Y has weight n in the two digests w-net and appears m times in the sequence al, a 2 . . . . . a~_l, then it has weight n - m in the reduced w-net. (For example, G . U with weight two in Figure 2 has weight one from G to U in Figure 3, since it appears once in the sequence (5).) The special Euler circuits of the reduction above correspond exactly to those special Euler circuits of the two digests w-net in which the sequence (6) appears in consecutive positions.

I t is also possible to detect inconsistent data by the reduction method. For example, suppose tha t the fi'agment CGAC was obtained under partial

554 G. HUTCHINSON

digestion by pancreatic ribonuclease, in addition to the two limit digests (2) and (3). So, the three terms C.G.AC appear in the extended base sequence. They are type I, type I I and type I, respectively. So, the G and AC terms are boundaries. Under the hypothesis of partial digestion by pancreatic ribonuclease, the term to the left of the C term is type I, and so the C term is not a boundary. From Table 2, the only arc fragment of three or more terms ending in C and G terms is AC. U. C.G. So, the reduction sequence is

AC Ac.v.c.~ G' ~.A~ > AC.

In Figure 4, the corresponding reduced w-net is given.

AC

AC.U.C.G Q I G.AC

/ G,

Figure 4.

/ . ~ G .G.G.C

2G.U .G U .C .C .AC.C .A.E .G

\ U

Two Digests w-Net, l~eduction 2

Now, the w-net of Figure 4 is disconnected, so it has no special Euler circuits. But this means that no polymer sequence is consistent with the data. So, the data is in error.

As a final example of reduction, suppose that the data of (2) and (3), plus the T1 partial digest CGUG, have been obtained. Also, assume that detec- tion of a 5'-phosphate on the GGGC fragment shows that GGGC appears at the left end of the polymer sequence. In this case, the reduction sequence is

U ~.c.C.AC.C.A.~.o G" o.a.a.c C.

The convention here is that the first term of a circuit "follows" the last term. Applying this reduction to Figure 3, the w-net of Figure 5 is obtained.

The only special Euler circuit of the above w-net corresponds to E 1 in Table 5. So, the data of this example is sufficient to uniquely determine the polymer sequence as $1, that is, as the sequence (1).


Note that the w-net of Figure 5 is reduced twice. In general, each partial digest or other data could be analyzed as it is obtained. I f the data can be represented by a suitable reduction, the reduced w-net can be constructed. Unfortunately, not every fragment sequence can be represented as a reduction. Also, difficulties may arise in reducing a w-net that has already been reduced several times. Using actual data, one could hope to use the w-net technique until relatively few possibilities remained, and then work directly from a list of possible polymer sequences. In the next section, an efficient technique for generating all special Euler circuits of a w-net is given. Using this technique, one can abandon the w-net analysis whenever desirable.

AC*. < ~ G AC U ' G . U G' \

A C . U . C . G / ~ C

. . . G . C

U . ~ _ _ , ~ , A G"

U . C . C . A C . C . A . ~ . G Figure 5. Two Digests w-Net, I~eduetion 3

8. The Nuler Circuits of a w-Net. In the previous section, the problem of evaluating polymer fragment data was reduced to the problem of finding the special Euler circuits of a certain w-net. In the following, methods of analyzing any w-net are given. In particular it is possible to :

(1) Determine whether a w-net has any special Euler circuits by simple tests.

(2) Obtain conservative estimates of the number of special Euler circuits for a w-net.

(3) Generate the list of all special Euler circuits of a w-net or as large a part of the complete list as desired.

I t is convenient to introduce several concepts at this point. They are mild generalizations of concepts used in the theory of directed graphs. The approach taken here is informal, although a precise formulation is possible using set theory.

556 G. HUTCHINSON

A "ne t wi th weighted arcs" or "w-net" is a diagram consisting of points, arrows going from a point to another point or f rom a point to itself, and a positive integer associated wi th each arrow. The arrows are called "arcs ," and the integer associated with an are is called its "weight ." A w-net need not be connected (see Fig. 4), and m a y have any number of arcs between the same pair of points.

Given a w-net G, a (directed) " p a t h " is a finite sequence of one or more ares of G such t h a t (1) the point at the head of each arc is the point at the tail of the nex t arc in the sequence, and (2) no arc appears in the sequence more t imes t h a n its weight. A pa th " s t a r t s " and "ends" at the points at the tail of the first arc and head of the last are, respectively.

In Figure 6, the letters x, y, z, and a, b, c, d, e, f are labels for the points and

x ~ ~ 3 b 1 @Z

c

Figure 6. A w-Net

ares, respectively. The integer next to each arc is its weight. The sequences (a, d) and (a, b, a, b) are not paths. In the former, a goes into y and d comes from x, which are not the same. In the lat ter , b occurs twice in the sequence, exceeding its weight. The sequences (d, e, b, a), (b) and (c, e, b, a) are paths from x to y, f rom y to z, and from y to y, respectively.

A pa th t h a t s tarts and ends at the same point is called a "circui t ." An "Euler circuit" is a circuit such tha t , for each arc, the number of t imes the are occurs in the pa th sequence equals the arc's weight. There are no Euler circuits for the w-net of Figure 6.

The "indegree" (respectively, "outdegree") of a point x is the sum of the weights of all ares wi th head (respectively, tail) a t x. The notat ions id(x) and od(x) represent the indegree and outdegree of x, respectively. A w-net G is an " isograph" if id(x) = od(x) for every point x of G. Since a, c, d and e enter point y in Figure 6, for example, id(y) = 2 + 2 + 3 + 1 = 8. Since b, c and f leave y, od(y) = 1 + 2 + 4 = 7, and therefore the w-net is no t an isograph.


A w-net is "disconnected" if it can be separated into two or more parts with no arcs joining points in different parts (see Fig. 4). O the rwi se it is (weakly) "connected." An "isolated point" is a point no~ at the head or tail of any arc.

The following theorem is a closely related to Euler's solution of the "Bridges of Konigsberg" problem in 1736. (See also Harary et at., 1965 ; Theorem 12.6.)

F

Mark all points of G as "unlabelled" and ! "unscanned."

Choose any point of G and mark it

"labe]led."

Are there any points \ of G that are t

"labelled" and / "unscanned" ?

Yes

Choose any "labelled" and "unscanned" point

x of G. Mark x as "scanned."

Mark as "labelled" any point y such tha t

there is an arc from x to y or from y to x.

No [ Is every \ ~[ point of G \"labelled"?]

~ Yes

[ ou r f

, S t o p . - ~ f

No

Output: I is J

Figure 7. Connectivity Algorithm Flowchart

Theorem B. Let G be a w-net with no isolated points. Then G has an Euler circuit if and only if G is a connected isograph.

By computing the indegrce and outdegree of each point, it is immediately clear whether a w-net is an isograph. I t is usually obvious to the eye whether a w-net is connected or disconnected. I f it is desired to use a computer to

558 G. HUTCHINSON

evaluate the data, a simple algorithm is available to determine connectivity. The flowchart is given in Figure 7. I t is a simplified and adapted version of the Ford-Fulkerson algorithm for computing flow in networks (Ford and l~lkerson, 1962, pp. 17-18).

In the following, a w-net with an Euler circuit will be shown to have a special Euler circuit (that is, one ending with a designated end-around arc). From Theorems A and B, then, we can conclude that the data from two limit digests is consistent if and only if it satisfies CT1, CT2 and CT3, and the two digests w-net is a connected isograph.

An operation of "composition" of paths in a w-net is needed. The composite of two paths is obtained by juxtaposing the two path sequences. For example, the composite of paths (d, c) and (b, a, c) in Figure 6, denoted (d, c) �9 (b, a, c), equals (d, c, b, a, c). The result is a path if (1) the end point of the first path equals the start point of the second, and (2) the number of times any arc appears in the resulting sequence does not exceed its weight. In particular, suppose C is a circuit which is separated into an initial part C1 and final part C2, so that C = C1 * C2. Then C2 * C1 is also a circuit, and is called a "rotation" of C. For example, (c, b, a) = (c) �9 (b, a) -- (c, b) * (a) in Figure 6, so (b, a, c) and (a, c, b) are rotations of (c, b, a).

Figure 8 is the same as Figure 2 except for the labels. Using the w-net

1

a ~ y ~ x f e ez

Figure 8. A Connected Isograph G

G of Figure 8 as an example, the method of generating all special Euler circuits of a w-net will be described.

The method requires two techniques. Using the first technique, a single special Euler circuit is constructed. The second technique uses known speciM


Euler circuits to generate more of them. The second process is iterated until no new special Euler circuits are produced, at which point all of the special Euler circuits have been found.

For the first technique, three results given below are used :

(R1) Let P be a path and E be an Euler circuit of a connected isograph G. Then the are sequence of P is not longer than tha t of E, and has equal length if and only if P is also an Euler circuit.

(R2) I f G is an isograph and P is a path of G such that P �9 (b) is not a path for any arc b, then P is a circuit.

(R3) I f G is a connected isograph and P is a circuit, then either P is an Euler circuit, or there exists an arc b such tha t P �9 (b) is a path, or there exists a rotation Q of P and an arc c such tha t Q �9 (c) is a path.

Using results R2 and R3, we can find longer and longer paths of a connected isograph G. Finally, a path of maximum length is obtained, and it is an Euler circuit by 1~1. A suitable rotation of this Euler circuit yields a special Euler circuit.

In Figure 8, start with any arc and extend the path sequence arbitrarily until no terms can be added to the path sequence. For example, start with (b) and extend to (b, a). There is no arc h such tha t (b, a, h) is a path. So, rotate the circuit to (a, b) and continue, for example to C = (a, b, c, d, c, e, f, g). In this example the process is now complete, since C is an Euler circuit. Sup- pose tha~ e is the end-around arc, so tha t special Euler circuits are those with e as last term of the sequence. By a suitable rotation, arc e can be put at the end :

C = ( a , b , c , d , c , e ) . ( f , g ) .

D = ( f , g , a , b , c , d , c , e ) .

The special Euler circuit D corresponds to E2 in Table 5. :Now consider the sequence of points for D :

f g a b c d c e y > z > y . > x > y > w > y . > w - - - - + y .

Observe tha t (a, b, c) and (c) are both paths from y to w, and form non-overlapping segments of D. I f we interchange these two segments, leaving the others fixed, the result is again a special Euler circuit :

D: ( f , g , a , b , c , d , c , e )

D': ( f , g , c , d , a , b , c , e ) .

560 G. I=IUTCttINSON

Another such possibility is (f, g, a, b) and (c, d), both being paths y to y, and forming adjacent non-overlapping segments of D. The interchange gives :

D: (f,g,a,b,c,d,c,e) D": (c,d,f,g,a,b,c,e).

D' and D" are called "segment interchanges" of D. In general, if there are paths D~, 1 < i < 5, such tha t D = D1 * D2* D3 * D4* Ds, and D2 and D4 start at the same point and end at the same point, then D1 * D~ �9 D 3 �9 D 2 �9 D 5 is a segment interchange of D. Either D1 or D3 or both may be omitted. That is, if D = D 2 �9 D4 * Do as above, then D~ �9 D 2 �9 Do is a segment interchange.

The basic theorem for the generation of all special Euler circuits follows :

Theorem C. Suppose G is a connected isograph with end-around are e, and L is a non-empty list of special Euler circuits of G. I f every segment interchange of a member of list L is itself a member of list L, then list L contains every special Euler circuit of G.

From Theorem C, it is clear tha t all special Euler circuits can be generated from a single one by successive segment interchanges. At each stage, add to list L all segment interchanges of current list members. The complete list is obtained when segment interchanges produce no new list items.

To find all the segment interchanges of a special Euler circuit C, use the following method :

(1) The "working sequence" for C is the point sequence with the last point omitted.

For example, for D above, the point sequence is (y, z, y, x, y w, y, w, y) and the working sequence is (y, z, y, x, y, w, y, w).

(2) Find all points repeated three or more times in the working sequence. I f a point occurs in positions i, j and It, where i < j < /c then interchange the arc segment from i to j - 1 with the arc segment from j to k - 1.

For example, no point except y occurs three or more times in the working sequence. Point y occurs in positions 1, 3, 5 and 7 of the working sequence, determining four segment interchanges :

i , j ,k 1 3 5

1, 3, 7

1, 5, 7

3, 5, 7

Known Euler Circuit D

(f,g,a,b,c,d,c,e) (f,g,a,b,c,d,c,e) (f,g,a,b,c,d,c,e) (f,g,a,b,c,d,c,e)

Segment Interchange

(a, b, f , g, c, d, c, e)

(a, b, c, d, f, g, c, e)

(c, d, f , g, a, b, c, e)

(f,g,c,d,a,b,c,e)


(3) Find patterns of form ( . . . u . . . v . . . u . . . v . . . ) in the working sequence. More precisely, if point u appears in positions i and ]r and point v appears in positions j and m, where i < j < k < m, then interchange the arc segment from i to j - 1 with the segment from ]c to m - 1.

For example, in the working sequence (y, z, y, x, y, w, y, w), the only points occurring two or more times are y and w. The pattern ( . . . w . . . y . . . w . . . y . . . ) never occurs, but the pattern ( . . . y . . . w . . . y . . . w . . . ) determines three segment interchanges :

i, j , k, m Given Euler Circuit D Segment Interchange

1 , 6 , 7 , 8 ( f ,g ,a ,b ,c ,d ,c ,e ) (c ,d , f ,g ,a ,b ,c ,e) 3 , 6 , 7 , 8 ( f ,g ,a ,b ,c ,d ,c ,e ) ( f ,g ,c ,d ,a ,b ,c ,e ) 5 , 6 , 7 , 8 ( f ,g ,a ,b ,c ,d ,c ,e ) ( f ,g ,a ,b ,c ,d ,c ,e )

The list of D and its segment interchanges therefore contains five special Euler circuits :

( f ,g ,a ,b ,c ,d ,c ,e ) (a ,b , f ,g ,c ,d ,c ,e) (a, b, e, d, f, g, c, e) (c ,d , f ,g ,a ,b ,c ,e) ( f ,g ,c ,d ,a ,b ,c ,e) .

Computing all the segment interchanges t e r m ,

(c, d, a, b, f, g, c, e).

of the above list produces one new

The list of six special Euler circuits produces no new ones under segment interchange. By Theorem C, then, it constitutes the complete list of special Enler circuits of G. The above list corresponds to the entries in Table 5 arranged in the order E2, E4, E3, ES, E1 and E6.

Finally, Theorem C allows conservative estimates of the number of special Euler circuits to be easily obtained:

Criterion 1. Suppose G is a connected isograph with r arcs between a given point x and a given point y, and none of these arcs is the end-around arc. I f the weights of these ares are ni, n2 . . . . . nr, respectively, and n -- nl + n2 +

�9 �9 �9 + nT, then there are at least n!/(nl! n2!. . , nr! ) special Euler circuits for G.

Criterion 2. Suppose G is a connected isograph with r different arcs entering some point x. (Alternatively, there may be r different arcs leaving some point x.) Then G has at least (r - 1)! special Euler circuits.

9--B.M.B.

562 G. HUTCHINSON

F o r e x a m p l e , t h e r e a r e f o u r a r c s e n t e r i n g p o i n t y in F i g u r e 8, so t h e r e a r e a t

l e a s t (4 - 1)! = 6 spec i a l E u l e r c i r cu i t s b y C r i t e r i o n 2. I n t h i s case, t h e e s t i -

m a t e w a s e x a c t .

A c o m p l e t e se t o f m e t h o d s fo r t h e a n a l y s i s o f t w o l i m i t d ige s t s h a s n o w b e e n

p r o v i d e d . T h e r e d u c t i o n m e t h o d a l l ows p a r t i a l d i g e s t i n f o r m a t i o n t o b e

a n a l y z e d a lso in m a n y cases .

L I T E R A T U R E

Dayhoff, M.O. 1964. "Computer Aids to Protein Sequence Determinat ion." J. Theor. Biol., 8, 97-112.

Ford, L. R. Jr . and D. R. Fulkerson. 1962. "Flows in Networks." Princeton, New Jersey : Princeton Univers i ty Press.

Goodman, H. M., J . Abelson, A. Landy, S. Brenner and J . D. Smith. 1968. "Aanber Suppression: A Nucleotide Change in the Anticodon of a Tyrosine Transfer R N A . " Nature, 217, 1019-1024 (Mar. 16).

Harary , F., R. Z. Norman and D. Cartwright. 1965. "St ructura l Models : An Introduc- t ion to the Theory of Directed Graphs." New York: John Wiley & Sons, Inc.

Holley, R. W., J . Apgar, G. A. Everet t , J . T. Madison, M. Marquisee, S. H. Merrill, J . R. Penswick and A. Zamir. 1965. "Structure of a Ribonucleic Acid." Science, 147, 1462-1465 (Mar. 19).

Hutchinson, G. 1968. "Evalua t ion of Polymer Sequence Data from Two Complete Digests." In terna l Report , Nat ional Ins t i tu tes of Health.

Madison, J . T., G. A. Everet t , and H. Kung. 1966. "Nucleotide Sequence of a Yeast Tyrosine Transfer RNA." Science, 153, 531-534 (July 29).

Mosimann, J. E., M. B. Shapiro, C. R. Merril, D. F. Bradley and J . E. Vinton. 1966. "Reconstruction of Protein and Nucleic Acid Sequences: IV. The Algebra of Free Monoids and the Fragmenta t ion Strategem." Bull..Math. Biophysics, 28, 235-260.

- - a n d J. E. Vinton. 1968. "Necessary and Sufficient Conditions for a Sequence to be Solvable with Complete Digest Fragments ." Manuscript, Nat ional Ins t i tu tes of Health.

RajBhandary , U. L., S. H. Chang, A. Stuart , R. D. Faulkner , R. M. Hoskinson and H. G. Khorana. 1967. "Studies on Polynucleotides, L X V I I I . The P r imary Struc- ture of Yeast Phenylalanine Transfer RNA." Prec. 1V.A.S., 57, 751-758.

Shapiro, M. B., C. R. Merril, D. F. Bradley and J. E. Mosimann. 1965. "Reconstruc- t ion of Protein and Nucleic Acid Sequences: Alanine Transfer Ribonucleic Acid." Science, 150, 918-921 (Nov. 12).

1967. "An Algori thm for Reconstructing Protein and R N A Sequences." J. Assoc. Getup. Mach., 14, 720-731.

RECEIVED 1-6-69

Evaluation of polymer sequence fragment data using graph theory

Documents

Transcript of Evaluation of polymer sequence fragment data using graph theory