Improved tools for biological sequencecomparison · 2444 Thepublicationcostsofthis article...

Proc. Natl. Acad. Sci. USAVol. 85, pp. 2444-2448, April 1988Biochemistry

Improved tools for biological sequence comparison(amino acid/nucleic acid/data base searches/local similarity)

WILLIAM R. PEARSON* AND DAVID J. LIPMANt*Department of Biochemistry, University of Virginia, Charlottesville, VA 22908; and tMathematical Research Branch, National Institute of Diabetes andDigestive and Kidney Diseases, National Institutes of Health, Bethesda, MD 20892

Communicated by Gerald M. Rubin, December 2, 1987 (received for review September 17, 1987)

ABSTRACT We have developed three computer pro-grams for comparisons of protein and DNA sequences. Theycan be used to search sequence data bases, evaluate similarityscores, and identify periodic structures based on local se-quence similarity. The FASTA program is a more sensitivederivative of the FASTP program, which can be used to searchprotein or DNA sequence data bases and can compare aprotein sequence to a DNA sequence data base by translatingthe DNA data base as it is searched. FASTA includes anadditional step in the calculation of the initial pairwise simi-larity score that allows multiple regions of similarity to bejoined to increase the score of related sequences. The RDF2program can be used to evaluate the significance of similarityscores using a shuffling method that preserves local sequencecomposition. The LFASTA program can display all the re-gions of local similarity between two sequences with scoresgreater than a threshold, using the same scoring parametersand a similar alignment algorithm; these local similarities canbe displayed as a "graphic matrix" plot or as individualalignments. In addition, these programs have been generalizedto allow comparison of DNA or protein sequences based on avariety of alternative scoring matrices.

We have been developing tools for the analysis of proteinand DNA sequence similarity that achieve a balance ofsensitivity and selectivity on the one hand and speed andmemory requirements on the other. Three years ago, wedescribed the FASTP program for searching amino acidsequence data bases (1), which uses a rapid technique forfinding identities shared between two sequences and exploitsthe biological constraints on molecular evolution. FASTPhas decreased the time required to search the NationalBiomedical Research Foundation (NBRF) protein sequencedata base by more than two orders of magnitude and hasbeen used by many investigators to find biologically signifi-cant similarities to newly sequenced proteins. There is atrade-off between sensitivity and selectivity in biologicalsequence comparison: methods that can detect more dis-tantly related sequences (increased sensitivity) frequentlyincrease the similarity scores of unrelated sequences (de-creased selectivity). In this paper we describe a new versionof FASTP, FASTA, which uses an improved algorithm thatincreases sensitivity with a small loss of selectivity and anegligible decrease in speed. We have also developed arelated program, LFASTA, for local similarity analyses ofDNA or amino acid sequences. These programs run oncommonly available microcomputers as well as on largermachines.

METHODSThe search algorithm we have developed proceeds throughfour steps in determining a score for pair-wise similarity.

FASTP and FASTA achieve much of their speed and selec-tivity in the first step, by using a lookup table to locate allidentities or groups of identities between two DNA or aminoacid sequences during the first step of the comparison (2).The ktup parameter determines how many consecutive iden-tities are required in a match. For example, if ktup = 4 for aDNA sequence comparison, only those identities that occurin a run of four consecutive matches are examined. In thefirst step, the 10 best diagonal regions are found using asimple formula based on the number of ktup matches and thedistance between the matches without considering shorterruns of identities, conservative replacements, insertions, ordeletions (1, 3).

In the second step of the comparison, we rescore these 10regions using a scoring matrix that allows conservativereplacements and runs of identities shorter than ktup tocontribute to the similarity score. For protein sequences,this score is usually calculated using the PAM250 matrix (4),although scoring matrices based on the minimum number ofbase changes required for a replacement or on an alternativemeasure of similarity can also be used with FASTA. Foreach of these best diagonal regions, a subregion with maxi-mal score is identified. We will refer to this region as the"initial region"; the best initial regions from Fig. lA areshown in Fig. 1B.The FASTP program uses the single best scoring initial

region to characterize pair-wise similarity; the initial scoresare used to rank the library sequences. FASTA goes onestep further during a library search; it checks to see whetherseveral initial regions may be joined together. Given thelocations of the initial regions, their respective scores, and a"joining" penalty (analogous to a gap penalty), FASTAcalculates an optimal alignment of initial regions as a com-bination of compatible regions with maximal score. FASTAuses the resulting score to rank the library sequences. Welimit the degradation of selectivity by including in theoptimization step only those initial regions whose scores areabove a threshold. This process can be seen by comparingFig. 1B with Fig. 1C. Fig. 1B shows the 10 highest scoringinitial regions after rescoring with the PAM250 matrix; thebest initial region reported by FASTP is marked with anasterisk. Fig. 1C shows an optimal subset of initial regionsthat can be joined to form a single alignment.

In the fourth step of the comparison, the highest scoringlibrary sequences are aligned using a modification of theoptimization method described by Needleman and Wunsch(5) and Smith and Waterman (6). This final comparisonconsiders all possible alignments of the query and librarysequence that fall within a band centered around the highestscoring initial region (Fig. 1D). With the FASTP program,optimization frequently improved the similarity scores ofrelated sequences by factors of 2 or 3. Because FASTAcalculates an initial similarity score based on an optimizationof initial regions during the library search, the initial score is

Abbreviation: NBRF, National Biomedical Research Foundation.

2444

The publication costs of this article were defrayed in part by page chargepayment. This article must therefore be hereby marked "advertisement"in accordance with 18 U.S.C. §1734 solely to indicate this fact.

Dow

nloa

ded

by g

uest

on

Apr

il 8,

202

0

Proc. Natl. Acad. Sci. USA 85 (1988) 2445

A 50 100

50

100

50

100

N

''\X\\\'

* \\' \\

\\ * '~\' \

C50 100

B

FIG. 1. Identification of sequence similarities by FASTA. Thefour steps used by the FASTA program to calculate the initial andoptimal similarity scores between two sequences are shown. (A)Identify regions of identity. (B) Scan the regions using a scoringmatrix and save the best initial regions. Initial regions with scoresless than the joining threshold (27) are dashed. The asterisk denotesthe highest scoring region reported by FASTP. (C) Optimally joininitial regions with scores greater than a threshold. The solid linesdenote regions that are joined to make up the optimized initial score.(D) Recalculate an optimized alignment centered around the highestscoring initial region. The dotted lines denote the bounds of theoptimized alignment. The result of this alignment is reported as theoptimized score.

much closer to the optimized score for many sequences. Infact, unlike FASTP, the FASTA method may yield initialscores that are higher than the corresponding optimizedscores.

Local Similarity Analyses. Molecular biologists are ofteninterested in the detection of similar subsequences withinlonger sequences. In contrast to FASTP and FASTA, whichreport only the one highest scoring alignment between twosequences, local sequence comparison tools can identifymultiple alignments between smaller portions of two se-quences. Local similarity searches can clearly show theresults of gene duplications (see Fig. 2) or repeated struc-tural features (see Fig. 3) and are frequently displayed usinga "graphic matrix" plot (7), which allows one to detectregions of local similarity by eye. Optimal algorithms forsensitive local sequence comparison (6, 8, 9) can havetremendous computational requirements in time and mem-ory, which make them impractical on microcomputers and,when comparing longer sequences, on larger machines as

well.The program for detecting local similarities, LFASTA,

uses the same first two steps for finding initial regions thatFASTA uses. However, instead of saving 10 initial regions,LFASTA saves all diagonal regions with similarity scoresgreater than a threshold. LFASTA and FASTA also differ inthe construction of optimized alignments. Instead of focus-ing on a single region, LFASTA computes a local alignmentfor each initial region. Thus LFASTA considers all of theinitial regions shown in Fig. 1B, instead of just the diagonalshown in Fig. 1D. Furthermore, LFASTA considers not

50 100 only the band around each initial region but also potentialsequence alignments for some distance before and after theinitial region. Starting at the end of the initial region, anoptimization (6) proceeds in the reverse direction until allpossible alignment scores have gone to zero. The location ofthe maximal local similarity score in the reverse direction isthen used to start a second optimization that proceeds in theforward direction. An optimal path starting from the forwardmaximum is then displayed (5). The local homologies can bedisplayed as sequence alignments (see Fig. 2B) or on atwo-dimensional graphic matrix style plot (see Figs. 2A and3).

Statistical Significance. The rapid sequence comparisonalgorithms we have developed also provide additional toolsfor evaluating the statistical significance of an alignment.There are approximately 5000 protein sequences, with 1.1million amino acid residues, in the NBRF protein sequencelibrary, and any computer program that searches the libraryby calculating a similarity score for each sequence in thelibrary will find a highest scoring sequence, regardless ofwhether the alignment between the query and library se-quence is biologically meaningful or not. Accompanying theprevious version of FASTP was a program for the evaluationof statistical significance, RDF, which compares one se-quence with randomly permuted versions of the potentiallyrelated sequence.We have written a new version of RDF (RDF2) that has

several improvements. (i) RDF2 calculates three scores foreach shuffled sequence: one from the best single initial region(as found by FASTP), a second from the joined initial regions(used by FASTA), and a third from the optimized diagonal.(it) RDF2 can be used to evaluate amino acid or DNAsequences and allows the user to specify the scoring matrix tobe employed. Thus sequences found using the PAM250scoring matrix can be evaluated using the identity or geneticcode matrix. (iii) The user may specify either a global or localshuffle routine.

Locally biased amino acid or nucleotide composition isperhaps the most common reason for high similarity scoresof dubious biological significance (10). High scoring align-ments between query and library sequences may be due topatches of hydrophobic or charged amino acid residues or toA + T- or G + C-rich regions in DNA. A simple Monte Carloshuffle analysis that constructs random sequences by takingeach residue in one sequence and placing it randomly alongthe length of the new sequence will break up these patches ofbiased composition. As a result, the scores of the shuffledsequences may be much lower than those of the unshuffledsequence, and the sequences will appear to be related.Alternatively, shuffled sequences can be constructed bypermuting small blocks of 10 or 20 residues so that, while theorder of the sequence is destroyed, the local composition isnot. By shuffling the residues within short blocks along thesequence, patches of G + C- or A+ T-rich regions in DNA,for example, are undisturbed. Evaluating significance with alocal shuffle is more stringent than the global approach, andthere may be some circumstances in which both should beused in conjunction. Whereas two proteins that share acommon evolutionary ancestor may have clearly significantsimilarity scores using either shuffling strategy, proteinsrelated because of secondary structure or hydropathic pro-file may have similarity scores whose significance decreasesdramatically when the results of global and local shufflingare compared.

Implementation. The FASTA/LFASTA package of se-quence analysis tools is written in the C programming lan-guage and has been implemented under the Unix, VAX/VMS, and IBM PC DOS operating systems. Versions of theprogram that run on the IBM PC are limited to query se-

50

100.

\l

6F

1

I

I

Biochemistry: Pearson and Lipman

16

Dow

nloa

ded

by g

uest

on

Apr

il 8,

202

0

2446 Biochemistry: Pearson and Lipman

Table 1. FASTA and FASTP initial scores of the T-cell receptor(RWMSAV) versus the NBRF data base

Initial score

NBRF code Sequence FASTA FASTP

RWHUAV T-cell receptor a chain 155 98KlHURE Ig K chain V-I region 127 111KVMS50 Ig K chain V region 149 62KVMSM6 Ig K chain precursor V regions 141 64KVRB29 Ig K chain V region 126 54L3HUSH Ig A chain V-III region 90 47KVMS41 Ig K chain precursor V region 87 87RWMSBV T-cell receptorp-chain precursor 94 94RWHUVY T-cell receptor ,-chain precursor 91 59RWHUGV T-cell receptor y-chain precursor 87 61RWHUT4 T-cell surface glycoprotein T4 86 63RWMSVB T-cell receptor -chain precursor 71 41HVMS44 Ig heavy-chain V region 67 36GlHUDW Ig heavy-chain V-II region 62 35

The average FASTP score = 26.1 + 6.8 (mean + SD). Theaverage FASTA score = 26.2 ± 7.2 (mean ± SD). The mean andSD were computed excluding scores >54. V, Variable.

quences of 2000 residues; library sequences can be anylength. Copies of the program are available from the authors.Although FASTA and LFASTA were designed for protein

and DNA sequence comparison, they use a general methodthat can be applied to any alphabet with arbitrary match/mismatch scoring values. All the scoring parameters, includ-ing match/mismatch values, values for the first residue in agap and subsequent residues in the gap, and other parame-ters that control the number of sequences to be saved andthe histogram intervals, can be specified without changingthe program.

EXAMPLESComparison of FASTA with FASTP. To demonstrate the

superiority of the FASTA method for computing the initialscore, we compared the protein sequence of a T-cell receptora chain (NBRF code RWMSAV) with all sequences in theNBRF protein data baser and computed initial scores withboth the present and previous methods. The T-cell receptor isa member of the immunoglobulin superfamily; in Release 12.0of the data base, this superfamily has 203 members. FASTPplaced 160 immunoglobulin superfamily sequences in the 200top-scoring sequences; 57 related sequences received initialscores less than four standard deviations above the meanscore. FASTA placed 180 superfamily members in the 200top-scoring sequences; only 20 related sequences scoredbelow four standard deviations above the mean. Table 1 con-tains specific examples from this data base search. Althoughthere is often little difference in the two methods, this ex-ample shows that in a number of cases the new method ob-tains significantly higher scores between related sequences.

Nucleic Acid Data Base Search. FASTA can also be used tosearch DNA sequence data bases, either by comparing aDNA query sequence to the DNA library or by comparing anamino acid query sequence to the DNA library by translatingeach library DNA sequence in all six possible readingframes. We compared the 660-nucleotide rat transforminggrowth factor type a mRNA (GenBank locus RATTGFA)with all the mammalian sequences in Release 48 of Gen-Bank§. We set ktup = 4 (see Methods), and the search wascompleted in under 15 min on an IBM PCAT microcom-

Table 2. DNA data base search of rat transforming growth factor(RATTGFA) versus mammalian sequences

GenBank Scorelocus Sequence Initial Optimized

HUMTFGAM Human TGF mRNA 1336 1618HUMTGFA2 Human TGF gene (exon 2) 354 366HUMTGFA1 Human TGF gene (5' end) 224 381MUSRGEB3 Mouse 18S-5.8S-28S rRNA 140 107

geneMUSRGE52 Mouse 18S-5.8S-28S rRNA 140 107

geneMUSMHDD MHC class I H-2D 122 78HUMMETIF1 Metallothionein (MT)IF gene 116 92MUSRGLP 45S rRNA (5' end) 115 83HUMPS2 pS2 mRNA 105 106MUSClAI1 a-1 type I procollagen 86 89The 10 sequences having the highest initial scores are given. TGF,

transforming growth factor; MHC, major histocompatibility com-plex.

puter. The 10 top-scoring library sequences are shown inTable 2. Although it can be seen that the 3 top-scoringsequences are clearly related to RATTGFA, there are otherhigh-scoring sequences that are probably not related, and themouse epidermal growth factor, found in the translated database search (Table 3), is not found among the top-scoringsequences.To further examine the similarity detected between RAT-

TGFA and MUSRGEB3, a mouse rRNA gene cluster, weused the RDF2 program for Monte Carlo analysis of statis-tical significance (the window for local shuffling was set to 10bases). Of the 50 shuffled comparisons (data not shown), 1obtained an initial score greater than 140 (the observed initialscore), and 9 shuffled sequences obtained optimized scoresgreater than 107 (the observed optimized score). Therefore,the similarity between RATTGFA and MUSRGEB3 is un-likely to be significant.

Translated Nucleic Acid Data Base Search. When searchingfor sequences that encode proteins, amino acid sequencecomparisons are substantially more sensitive than DNA se-quence comparisons because one can use scoring matriceslike the PAM250 matrix that discriminate between conserva-tive and nonconservative substitutions. A variant of FASTA,TFASTA, can be used to compare a protein sequence to aDNA sequence library; it translates the DNA sequences intoeach of six possible reading frames "on-the-fly." TFASTAtranslates the DNA sequences from beginning to end; itincludes both intron and exon sequences in the translatedprotein sequence; termination codons are translated intounknown (X) amino acids. Table 3 shows the results of atranslating search of the mammalian sequences in the Gen-Bank DNA data base using the RATTGFA protein sequenceas the query and ktup = 1. In the translated search, the mouseepidermal growth factor now obtains an initial score higherthan any unrelated sequences; however, HUMTGFA1, whichwas found in the DNA data base search but only contains 13translated codons, is no longer among the top scoring se-quences.

Local Similarities. Fig. 2 displays the output of a localsimilarity analysis (ktup = 4) ofCHPHBA1M, a chimpanzeeal-globin mRNA, and RABHBAPT, a rabbit a-globin gene,including the complete coding sequence and a flankingpseudo-01-globin gene. LFASTA can either display a graphicmatrix style plot of the local homologies (Fig. 2A) or thealignments themselves (Fig. 2B). The right-most three align-ments (Fig. 2A) match the corresponding regions of themRNA to exon subsequences from the pseudogene. We notethat the FASTA initial score for the comparison of CHPH-

tProtein Identification Resource (1987) Protein Sequence Database(Natl. Biomed. Res. Found., Washington, DC), Release 12.§EMBL/GenBank Genetic Sequence Database (1987) (Intelligenet-ics, Mountain View, CA), Tape Release 48.

Proc. Natl. Acad. Sci. USA 85 (1988)

Dow

nloa

ded

by g

uest

on

Apr

il 8,

202

0

479hDISTRIBUTION OF EXTRACORPUSCULAR HEMOGLOBIN IN PLASMA

concentration and light absorption was linear (Fig-ure 5), indicating that the color intensity of thebenzidine oxidatively transformed was directlyproportional to the content of hemoglobin on

paper.

In order to compare the relationship betweenhemoglobin concentration and light absorption forfree and protein-bound hemoglobin, the followingstudy was made. Hemoglobin was added toplasma in vitro in increments of from 2 to 5 mg.

per cent over a concentration range of 80 to 240mg. per cent. Following incubation of plasma at

370 C. in order to promote protein-binding, elec-trophoresis, staining and photometric analysis were

carried out in the usual manner. Protein-bindingwas maximal in this experiment at a hemoglobinconcentration of 112 mg. per cent. At concentra-tions greater than 112 mg. per cent, free hemo-globin appeared in increasing concentrations. As-suming that the concentration of PBH remainedat 112 mg. per cent at all concentrations in excess

of this maximal binding capacity, the concentrationof free hemoglobin at these higher ( 112 to 240 mg.

per cent) concentrations was calculated as the dif-ference between the known total hemoglobin con-

centration and the concentration of PBH (112 mg.

per cent). Knowing, therefore, the actual con-

centrations of free and protein-bound hemoglobin,the light absorption of each fraction was measuredand compared at each concentration studied. Fig-ure 6 illustrates that at all concentrations studiedthe relationship between hemoglobin concentrationand light absorption of the oxidized benzidinebands was linear and was the same for free and

1,.2.

1.05

Q9-~~~~~~~~~~~~~~~~~~~

Q/

Q20.1

.1 .2 .3 4 .5S.6 .7 .8 .91.01 . 1.2FREE Hb-AREA (Cmn)BOUND Hb -AREA (cm2)

FIG. 6. RELATIONSHIP BETWEEN CONCENTRATION ANDLIGHT ABSORPTION (BENZIDINE STAIN) OF FREE HEMO-GLOBIN RELATIVE TO PROTEIN-BOUND HEMOGLOBINThe concentration of free hemoglobin relative to the

concentration of protein-bound hemoglobin is plottedagainst the respective ratios of the light absorption ofthese constituents (when stained with benzidine) at eachratio of concentrations studied. The straight line rela-tionship shown indicates that the relationship betweenconcentration and light absorption is the same for freeand protein-bound hemoglobin.

protein-bound hemoglobin. Calculations of theconcentrations of free and protein-bound hemo-globin as determined by the photometric tech-nique were, therefore, in close agreement withthe actual concentrations of these components(Table I).A similar study of methemalbumin revealed

that the relationship between concentration and

TABLE I

In vitro study of the recovery of free and protein-bound hemoglobin (Hb) as determined by photometric analysis *

Total Free Hb Free (calculated) Bound Hb Bound (calculated)Hb Actual Calculated Free (actual) Actual Calculated Bound (actual)

110 0 0 1.00 110 110 1.00112 0 0 1.00 112 112 1.00115 3 3 1.00 112 112 1.00120 8 14 1.75 112 106 0.95125 13 12 0.93 112 113 1.01130 18 12 0.67 112 118 1.05140 28 30 1.07 112 110 0.98160 48 44 0.92 112 116 1.04180 68 69 1.01 112 111 0.99200 88 90 1.02 112 110 0.98240 128 128 1.00 112 112 1.00

* Hemoglobin values are expressed as concentration in mg. per cent.

Dow

nloa

ded

by g

uest

on

Apr

il 8,

202

0

2448 Biochemistry: Pearson and Lipman

FIG. 3. Repeated structure in themyosin heavy chain. LFASTA was usedto compare the Caenorhabditis elegansmyosin heavy chain protein sequence(NBRF code MWKW) with itself usingthe PAM250 scoring matrix. The solid,dashed, and dotted lines denote decreas-ing similarity scores. The solid lines hadinitial region scores greater than 80 andoptimized local scores greater than 150;the longer dashed lines had initial regionand optimized local scores greater than65 and 120, respectively, and the shorterdashed lines had initial region and opti-mized local scores greater than 50 and100, respectively. Homologous regionswith lower scores are plotted with dots.

scores, must be carefully evaluated (1, 11). The Monte Carloanalysis of statistical significance provided by a programsuch as RDF2 can often be critical in evaluating a borderlinesimilarity. Previously we suggested ranges of z values [(ob-served score - mean of shuffled scores)/standard deviationof shuffled scores] corresponding to approximate signifi-cance levels. However the z values determined in a MonteCarlo analysis become less useful as the distribution ofshuffled scores diverges from a normal distribution, as isfound with FASTA. Therefore, we now focus on the highestscores of the shuffled sequences. For example, if in 50shuffled comparisons, several random scores are as high orhigher than the observed score, then the observed similarityis not a particularly unlikely event. One can have moreconfidence if in 200 shuffled comparisons, no random scoreapproaches the observed score. In general, our experiencehas led us to be conservative in evaluating an observedsimilarity in an unlikely biological context.These programs provide a group of sequence analysis

tools that use a consistent measure for scoring similarity andconstructing alignments. FASTA, RDF2, and LFASTA alluse the same scoring matrices and similar alignment algo-rithms, so that potentially related library sequences discov-

ered after the search of a sequence data base can beevaluated further from a variety of perspectives. In addition,LFASTA can also show alternative alignments betweensequences with periodic structures or duplications.

1. Lipman, D. J. & Pearson, W. R. (1985) Science 227, 1435-1441.

2. Dumas, J. P. & Ninio, J. (1982) Nucleic Acids Res. 10,197-206.

3. Wilbur, W. J. & Lipman, D. J. (1983) Proc. Natl. Acad. Sci.USA 80, 726-730.

4. Dayhoff, M., Schwartz, R. M. & Orcutt, B. C. (1978) in Atlasof Protein Sequence and Structure, ed. Dayhoff, M. (Natl.Biomed. Res. Found., Silver Spring, MD), Vol. 5, Suppl. 3,pp. 345-352.

5. Needleman, S. & Wunsch, C. (1970) J. Mol. Biol. 48, 444-453.6. Smith, T. & Waterman, M. S. (1981) J. Mol. Biol. 147,

195-197.7. Maizel, J. & Lenk, R. (1981) Proc. Natl. Acad. Sci. USA 78,

7665-7669.8. Goad, W. & Kanehisa, M. (1982) Nucleic Acids Res. 10,

247-263.9. Sellers, P. H. (1979) Proc. Natl. Acad. Sci. USA 76, 3041.

10. Lipman, D. J., Wilbur, W. J., Smith, T. F. & Waterman,M. S. (1984) Nucleic Acids Res. 12, 215-226.

11. Doolittle, R. (1981) Science 214, 149-159.

Proc. Natl. Acad. Sci. USA 85 (1988)

Dow

nloa

ded

by g

uest

on

Apr

il 8,

202

0

Improved tools for biological sequencecomparison · 2444 Thepublicationcostsofthis article...

Documents

Transcript of Improved tools for biological sequencecomparison · 2444 Thepublicationcostsofthis article...