Methods for comparing sequence data such as restriction endonuclease maps or nucleotide sequences of...

8
Journal of Virological Methods, 9 ( 1984) 3 17-324 Elsevier JVM 00349 317 METHODS FOR COMPARING SEQUENCE DATA SUCH AS RESTRICTION ENDONUCLEASE MAPS OR NUCLEOTIDE SEQUENCES OF VIRAL NUCLEIC ACID MOLECULES ADRIAN GIBBS and FRANK FENNER Research School of Biological Sciences and John Curiin School of Medical Research, Australian National University, Canberra, A.C.T. 2601, Australia (Accepted 30 July 1984) Maps of restriction endonuclease sites in the DNA genomes of orthopoxviruses were compared by computer classification methods. Data from such maps were recorded as qualitative binary information, and gave closely similar dendrograms when different dissimilarity measures and sorting strategies were used for the computations. A method for calculating dendrograms from small data sets, by ‘hand’, is described in detail to assist those who do not have access to suitable computer programs. The same calculation strategy gave useful classifications of partial nucieotide and amino acid sequences of orthomyxovirus haemaggIuti- nins. restriction endonuclease maps classification poxviruses orthomyxoviruses INTRODUCTION Restriction endonucleases cut DNA molecules at particular nucleotide sequences that are usually palindromic and four or six nucleotides in length. By various means, the positions of those sequences (restriction sites) within a DNA molecule may be mapped. Statistical methods have been devised to use such maps to estimate the evolutionary distance between two related DNAs or to estimate the sequence varia- tion within populations of organisms; these methods have been reviewed by Kaplan (1983). Maps of restriction sites are useful for comparing viral DNAs, as was first shown for the genomes of the orthopoxviruses by Wittek et al. (1977). There is a large centra1 region in their genomes that is common to all species and may be used to align their maps unequivocally (Mackett and Archard, 1979). Other parts of the maps are more variable, and it is difficult to obtain an overall view of the relatedness of several species or isolates merely by inspecting their maps. We have therefore computed classifica- tions of the maps to simplify the comparisons, and report these methods here. 0166-0934/84/$03.00 0 1984 Elsevier Science Publishers B.V.

Transcript of Methods for comparing sequence data such as restriction endonuclease maps or nucleotide sequences of...

Journal of Virological Methods, 9 ( 1984) 3 17-324

Elsevier

JVM 00349

317

METHODS FOR COMPARING SEQUENCE DATA SUCH AS RESTRICTION

ENDONUCLEASE MAPS OR NUCLEOTIDE SEQUENCES OF VIRAL

NUCLEIC ACID MOLECULES

ADRIAN GIBBS and FRANK FENNER

Research School of Biological Sciences and John Curiin School of Medical Research, Australian National

University, Canberra, A.C.T. 2601, Australia

(Accepted 30 July 1984)

Maps of restriction endonuclease sites in the DNA genomes of orthopoxviruses were compared by

computer classification methods. Data from such maps were recorded as qualitative binary information,

and gave closely similar dendrograms when different dissimilarity measures and sorting strategies were used

for the computations. A method for calculating dendrograms from small data sets, by ‘hand’, is described in

detail to assist those who do not have access to suitable computer programs. The same calculation strategy

gave useful classifications of partial nucieotide and amino acid sequences of orthomyxovirus haemaggIuti-

nins.

restriction endonuclease maps classification poxviruses orthomyxoviruses

INTRODUCTION

Restriction endonucleases cut DNA molecules at particular nucleotide sequences

that are usually palindromic and four or six nucleotides in length. By various means,

the positions of those sequences (restriction sites) within a DNA molecule may be

mapped. Statistical methods have been devised to use such maps to estimate the

evolutionary distance between two related DNAs or to estimate the sequence varia-

tion within populations of organisms; these methods have been reviewed by Kaplan

(1983).

Maps of restriction sites are useful for comparing viral DNAs, as was first shown for

the genomes of the orthopoxviruses by Wittek et al. (1977). There is a large centra1

region in their genomes that is common to all species and may be used to align their

maps unequivocally (Mackett and Archard, 1979). Other parts of the maps are more

variable, and it is difficult to obtain an overall view of the relatedness of several species

or isolates merely by inspecting their maps. We have therefore computed classifica-

tions of the maps to simplify the comparisons, and report these methods here.

0166-0934/84/$03.00 0 1984 Elsevier Science Publishers B.V.

318

RESULTS

Computer classifications

The maps of the genomes of various orthopoxviruses were obtained from pub-

lished data. Figure 1, for example, shows the aligned maps of Hind-III restriction sites

of the DNAs of two isolates each of vaccinia, variola and monkeypox viruses. The

state of each site in each genome map was recorded as binary qualitative data (Table

1); ‘present’, ‘absent’ or ‘inapplicable’ (i.e. that part of the genome not present). These

data were compared using programs in the TAXON library of the CSIRONET

facilities of the Commonwealth Scientific and Industrial Research Organization;

programs were based on those of Lance and Williams (1966, 1967a,b, 1977) and Burr

(1968, 1970).

Each classification involved two stages. First, each sequence was compared, in turn,

Fig. 1. Maps of the Hind-111 restriction fragments obtained from the genomes of six orthopoxviruses. The

viruses are, top to bottom, monkeypox (Zaire), monkeypox (West Africa), vaccinia (Lister), vaccinia

(Venezuela), variola major and variola minor (redrawn from data supplied by Joe J. Esposito).

TABLE 1

The condensed qualitative binary data deduced from Fig. 1, Each site is recorded as present (1) or absent (0)

or missing (*) because that part of the genome is absent

Monkeypox Zaire 10100001101000011

Monkeypox West Africa ‘0010101101000011

Vaccinia Lister 00000010100101011

Vaccinia Venezuela 00000010100101011

Variola major ******00011010011

Variola minor ******00011010111

111111001000010000

111111001000001011

111111001100100***

111111001000000***

11111111001000010*

11111111001100010*

319

with every other sequence and their similarity assessed using either the ‘squared Euclidean distance’ metric, which accentuates differences, or the ‘Canberra’ or ‘Cower’ metrics (Lance and Williams, 1967a), which do not. The similarities were recorded in a matrix, one matrix for each metric.

In the second stage of the classification, the matrices of similarities were reduced to dendrograms by the flexible agglomerative hierarchical classification method of Lance and Williams (1966, 1967b). For each classification thesortingstrategy wasused which had been shown to be most useful for that metric; the ‘minimum increment of sum of squares’ strategy (Burr, 1968, 1970) for ‘Euclidean’, and, for the others, ‘flexible strategy’, with parameters beta -0.25, alpha 0.125 and gamma 0 (Lance and Williams, 1967b).

The resulting dendrograms differed only in scale. Figure 2a shows that obtained

(4 Monkeypox: wildtype Monkeypox: h15 Monkeypox: h7 Monkeypox: h2 Monkeypox: h8 Variola major Whitepox: Netherlands

r , WhFtepox: Mastomys ,

Fig. 2. (a) Dendrogram computed from the data in Table 1 using the TAXON library of the CSIRONET

facilities; squared Euclidean distance metric and Burr’s incremental sum of squares sorting strategy (Lance

and Williams, 1967a; Burr, 1968). The full scale represents a dissimilarity measure of 2.0 for Figs. 2a, c and

d, and a 50% difference for Fig. 2b. (b) Dendrogram calculated, by hand, from the data in Table 1 using the

‘percentage dissimilarity’ as a measure of relatedness and centroid sorting strategy (details shown in Table

2). (c) Dendrogram computed in the same way as that in Fig. 2a but with composite maps ofHind-III, X/IO-I

and Sma-I restriction sites of two isolates each of monkeypox virus and variola virus, and three of vaccinia

virus (data from Mackett and Archard, 1979). (d) Dendrogram computed from maps of 51 restriction sites

(Hind-III, Xho-I and Sma-I) in the genomes of wild-type monkeypox virus and four ‘white pock’ mutants

derived from it fh2, h7, h8, and h15), also variola virus, ‘whitepox’ virus from Cynomolgm monkey kidney

cells, and ‘whitepox’ virus from Musromys natdensis. Computation parameters as in Fig. 2a (data from

Dumbell and Archard, 1980).

320

using the squared Euclidean distance metric. It confirms the close relationship of each

pair of isolates, and suggests that vaccinia virus is more closely related to monkeypox

virus than to variola virus,

However, similar classifications of more detailed maps (66 restriction sites of

Hind-III, Xho-I and %a-I) of isolates of these viruses indicated (Fig. 2c) that

vaccinia may be slightly more closely related to variola than to monkeypox.

There may be several reasons for the difference in these results. For example, the

classification based on three enzymes (66 sites in all) might be expected to be more

accurate than that based on Hind-III alone (36 sites in all) because a greater propor-

tion of each genome was sampled. However, most Hind-III sites are in the central

conserved part of the genome (Mackett and Archard, 1979), and hence the Hind-III

map may be a better guide to the relatedness of the viruses than the mixed enzyme

map. Whatever the reason for the difference between these classifications, it is

important to realise that these methods are based on mutation data and therefore

assess close relationships more reliably than distant ones.

An example of the practical use of these methods is shown in Fig. 26, which is a

~lassi~~ation of the restriction maps of five monkeypox isolates, two ‘whitepox’

isolates and one variola isolate (Dumbell and Archard, 1980). Four of the monkeypox

isolates (h2, h7, h8, and h 15) were ‘white pock’ mutants isolated from the wild-type

monkeypox, and it can be seen that they group unequivocally with monkeypox virus

rather than with variola virus and the ‘whitepoxes’. The significance of this result is

that the ‘whitepoxes’ were thought to have been isolates from animals (‘Netherlands’

from ~ynorn~~gus monkey kidney cells: Gispen and Brand-Saathof, 1972; l ~us?o~ys’

from the kidney of a rat of that name shot in Zaire: Marennikova et al., 1976).

Subsequently, Marennikova et al. (1979) claimed that ‘whitepox’ viruses arose as

mutants of monkeypox virus, but Dumbell and Kapsenburg (1982) showed that the

Netherlands ‘whitepox’ isolate was a laboratory contaminant. The clustering of the

four ‘white pock’ mutants of monkeypox virus with the wild-type strain from which

they were isolated, and of both ‘whitepoxes’ with variola virus (Fig. 2d), indicates that

“whitepoxes’ are probably not ‘white pock’ mutants of monkeypox virus.

A non-computer method Not all people have access to computers, even fewer to the TAXON library of

CSIRONET, or similar programs in GENSTAT. However, it is quite simple to

calculate certain types of dendrograms from small sets of data like those in Table I

without the help of a computer.

The calculation is done in two stages. First, the dissimilarity of the maps is

calculated in some convenient way, for example, the matrix in Table 2 shows the

number of sites not shared by each pair of sequences expressed as a percentage of totai

number of sites in those sequences. Then the branching points of the dendrogram are

calculated in the following way. The matrix is inspected to find which pair of

sequences are most similar; sequences 5 and 6 in Table 2. This pair and its associated

321

TABLE 2

PercentagedissimilarityoftherestrictionmapsinFig. landTable 1,togetherwith thecalculationsforobtain-

ing from them a dendrogram by centroid sorting

Map Percentage difference between restriction maps

Map

2 3 4 5 6 ‘7’ ‘8’ ‘9’ ‘10’

1 20 34.5 41.4 30.3 24.2 27.3 38.0 27.3 44.0

2 - 37.9 44.8 29.6 25.0 27.3 41.4 -

3 - - 6.9 48.2 40.7 44.6 -

4 - - 55.6 48.2 51.9 -

5 - - 6.1 -

6 - -

‘7’ _ - 48.3 27.3 -

‘8’ _ _ 39.7 44.0

sequence + sequence = sequence at dissimilarity of

5 6 ‘7’ 6.1%

3 4 ‘8’ 6.9%

1 2 ‘9’ 20.0%

‘7’ ‘9 ‘10 27.3%

‘8’ ‘10 ‘11’ 44.0%

dissimilarity (6.1%) is recorded. Then the data associated with sequences 5 and 6 in the

matrix are replaced with that for a ‘new sequence’ (‘7’), which thus represents the

bifurcation point for 5 and 6. The way in which dissimilarity of the new sequence (‘7’)

to the remaining sequences (l-4) is calculated represents the ‘sorting strategy’ em-

ployed for constructing the dendrogram. The simplest is ‘centroid sorting’, and merely

involves calculating the average dissimilarity. Thus, the dissimilarity of ‘7’ and ‘1’ is

27.3%, which is the mean of the dissimilarity of ‘5’ and ‘l’, 30.3%, and ‘6’ and ‘l’,

24.2%. Similarly, the dissimilarity of ‘7’ and ‘2’ is 27.3%, of ‘7’ and ‘3’ is 44.6% and of

‘7’ and ‘4’ is 5 1.9%. The process of identifying and fusing the most similar sequences is

repeated until only one ‘sequence’ remains. Finally, the recorded fusions and the

associated dissimilarities are used to draw a dendrogram. The dendrogram in Fig. 2b

was derived in this way from Table 2, and can be seen to be closely similar to that in

Fig. 2a.

DISCUSSION

Any table of dissimilarities may be converted into a dendrogram in the way

described above. One advantage of calculating a dendrogram using centroid sorting

322

AlNks/33 (N1NI) AIBHl35 IHlNl) Al PR/ 8134 (HINI) h/&1/42 (HlNl) b/USSR/90/77 (HlN1) A/Fort Warren/ 1150 (WlNl) AILoyanglb/57 (WINl) AiducklALbertaf 35J76 (HINl) A/suine/Iowaf 15130 (HlNi) A/New Jerseyill; JHINI) AfTokyoi 3167 (HZNZ) h/Netherlands/68 (HZN2) A/RI/ 5-157 (H2N2) A/Berkeley/ 68 (H2N2) A/dUCk/GDR/72 (H2N2) AJduckJAlbertal U/77 (N2N3) A/blue winged teai/604/78 (H2N3) A~~U~k~o~e~r~~~?7 {H2Nl) Afp~ntaillA1berrai293/17 (W2N9) AIduckiNew York!6750178 fW2N2) A/shearwater/Australia/75 (H5N3) Afshearwater/Australia/72 (WbNS) h/duck/ALbertaJ 1298179 (MNS) A/duck/Alberta/l297/79 (HbN9) A/tern/Australia/75 (HllN9) A~d~cki~~aine~~~b~ (KIlN9) Afd~~kf~~~l~~d~~6 fH1 IN6) Aldufkmew -fork/f2178 CHilN6) AiduckiMemphisj546176 (tiilN9) A/gullftiss/26/80 (B13N6) A/~ull/W704/77 (H13~6) A/turkey/Ontario/6118/68 (H8N4) Alduck/Alberraia27/78 IHBN4) AiducklAlbertai283i77 (uaw+) A~t~~ke~~~~~co~s~~~ I!66 (H9N2) A~d~~k~A~be~t~~6#~76 tHlZN5)

Fig. 3. (a) Dendrogram computed from the percentage dissimilarities of the 300-350 Y-terminal nucleotide

sequences (Air, 1981) of the haemagglutinin genes of 44 group-A influenzaviruses using centroid sorting

strategy. (b) Dendrogram computed in the same way as that in Fig. 3a from the amino acid sequences

deduced from ihe nucleotide sequences used for Fig. 3a.

rather than any other strategy is that its scale is the same as the matrix from which it is

derived, so that each branch point gives the mean dissimilarity of all individuals

connected through that point.

This method is also useful for summarizing the relationships of aligned nucleotide

or amino acid sequences; the matrix of percentage differences may be reduced to a

dendrogram in precisely the same way, by ‘hand’ or by computer. Figure 3a, for

example, is a dendrogram illustrating the relationships of the sequences of the 300-350

?‘-terminal nucleotides of the haemagglutinin genes of 44 group-A influenzaviruses,

A/NuS/33 (HlNl) A/PR/0/34 (HlNl) A/BH/ 35 (HlNl) A/Be1142 (HlNl) A/Loyang/4/57 (HlNl) A/USSR/90/77 (HlNl) A/Fort Warren/ l/50 (HlNl) A/duck/Alberta/35/76 (HlNl) A/swine/Iowall5/30 (HlNl) A/New Jersey/ll/76 (HlNl) Al RI/ 5-l 57 (H2N2) A/Tokyo/3/67 (H2N2) A/Berkeley/68 (H2N2) A/Netherlands/68 (H2N2) A/duck/GDR/lZ (H2N2) A/duck/New York/6750/78 (H2N2) A/blue winged tea1/604/70 (H2N3) A/duck/Ontario/77 (H2Nl) A/duck/Alberta/77/77 (H2N3) Alpintail/Alberta/293/77 (H2N9) A/shearwater/Australia/75 (H5N3) A/sheawater/Australia/72 (HbN5) A/duck/Alberta/l298/79 (H6N5) A/duck/Alberta/l297/79 (HbN9) A/tern/Australia/75 (HllN9) A/duck/Ukraine/l/60 (HllN9) Alduck/England/56 (Hl lN6) A/duck/New York/l2/78 (Hl lN6) A/duck/Memphisl546/76 (HllN9) A/gull/Mass/26/00 (H13N6) A/gull/Md/704/77 (H13N6) A/turkeylOntario/6118/68 (H8N4) A/duck/Alberta/827/78 (H8N4) A/ducklAlberta/283/77 (H8N4) A/turkeylWisconsin/l/66 (H9N2) A/duck/Alberta/60/76 (H12N5) AlturkeylOregonlll (HlN3) A/equine/Prague1 l/56 (HlN7) AlducklManitobal53 (HlON7) Alquail/Italy/ll7/65 (HlON8) A/duck/Albertal2B/76 (~4~6) A/duck/Australia/702/78 (H3N8) A/Mem~his/ll71 (H3N2) A/duck/Ukraine/ l/63 (H3N8)

I I I I

323

(b)

and Fig. 3b shows the relationships of the N-terminal amino acid sequences deduced

from them. These classifications agree, except in minor details, and the indicated

groupings correlate precisely with the serological groupings of the haemagglutinins,

which are indicated by the ‘H’ part of the subtype numbering given in brackets after

each name.

REFERENCES

Air, G.M., 1981, Proc. Natl. Acad. Sci. U.S.A. 78, 7639.

Burr, E.J., 1968, Aust. Comput. J. 1, 97.

Burr, E.J., 1970, Aust. Comput. J. 2, 98.

Dumbell, K.R. and L.C. Archard, 1980, Nature (London) 286, 29.

Dumbell, K.R. and J.G. Kapsenberg, 1982, Bull. W.H.O. 60, 381.

324

Gispen, R. and B. Brand-Saathof, 1972, Bull. W.H.O. 46, 585.

Kaplan, N., 1983, in: Statistics: Textbooks and Monographs, Vol. 47. Statistical Analysis of DNA Sequence

Data, ed. B.S. Weir (Dekker, New York and Basel), p. 75.

Lance, G.N. and W.T. Williams, 1966, Nature (London) 212, 218.

Lance, G.N. and W.T. Williams, 1967a, Aust. Comput. J. 1, 15.

Lance, G.N. and W.T. Williams, 1967b, Comput. J. 9, 373.

Lance, G.N. and W.T. Williams, 1977, in: Mathematical Methods for Digital Computers, Vol. 3. Statistical

Methods for Digital Computers (Wiley, New York), p. 269.

Mackett, M. and L.C. Archard, 1979, J. Gen. Virol. 45, 683.

Marennikova, S.S., E.M. Shelukina and L.S. Shenkman, 1976, Acta Viral. 20, 80.

Marennikova, S.S., E.M. Shelukina, N.N. Maltseva and G.R. Matsevich, 1979, Intervirology 11, 333.

Wittek, R., A. Menna, D. Schtimperli, S. Stoffel, H.K. Miiller and R. Wyler, 1977, J. Virol. 23, 669.