An Introduction to Effective BLASTing€¦ · Effective BLASTing – TIPS & TECHNIQUES Hypothesis -...

Effective BLASTing – TIPS & TECHNIQUES

Hypothesis - 26

TIPS & TECHNIQUES

An Introduction to Effective BLASTing

Paul C. BoutrosDepartment of Medical Biophysics, University of TorontoCorrespondence: [email protected]

AbstractSequence-alignments in general and BLASTsearches in particular have become aubiquitous part of molecular biology. Despiteits popularity, the vast array of BLAST tools andparameter choices can overwhelm the user. Yetaccepting the default parameters can greatlyreduce search sensitivity and accuracy. Thisreview focuses on the major parameters forBLASTN and BLASTP searches, and discussesboth their default values and how they can betweaked to enhance query results.

IntroductionComputational biology can be defined as theuse of quantitative, mathematical models tostudy biological questions (1). This covers abroad spectrum of research questions, rangingfrom phylogenetic studies (2) to the discovery ofnew genes (3) and splice-variants (4), and fromthe prediction of transcription-factor binding-sites (5 ) to the integration of largetranscriptomic and genomic datasets (6).

With such a diverse set of questions, itmight be surprising that there is a common“battery” of computational and mathematicaltechniques used in their solutions. In a way,this is akin to standard molecular techniqueslike PCR or Western Blotting, which are broadlyapplied to answer many distinct researchquestions from a molecular perspective.

This “toolbox” of computationaltechniques includes pattern-recognitiontechniques like clustering (7), sequence-modeling procedures like Hidden MarkovModels (8), and a wide-range of statistical andmathematical procedures (9-11).Perhaps the most ubiquitous computationaltechnique, however, is sequence-alignment.The most common sequence-alignment

program, NCBI BLAST, is used tens ofthousands of times each day (12).

This review aims on introducing thereader to the many parameters available fortuning and effectively using NCBI BLAST.Following a brief overview of BLAST, the twocanonical forms of BLAST are introduced. Theparameters for each form are detailed, andrecommendations on parameter selection aregiven.

The Problem of Sequence Alignments andthe BLAST SolutionSequence-alignments are a core element of thecomputational biology tool-box and areextensively used to study the primary structureof proteins and nucleic acids. Fundamentally, asequence-alignment is a way of comparingsequences to one another. Thus, sequence-alignments can find use whenever sequencesare being studied. Typical uses of sequence-alignments include the identification of cDNAclones in a library (13), the discovery of splice-variants in large sequence-databases (1 4),functional characterization of uncharacterizedgenes (15), and evolutionary studies of specificproteins or genes (16).

Regardless of the application,sequence-alignments are a way of determining“how similar” sequences are to one another.Regions that are similar can be overlapped, or“aligned”. The graphical display of thisalignment gives the technique its name.Sequences that are very similar will show“strong” alignments, meaning that fewmismatches or gaps exist. Algorithms exist toalign pairs of sequences (pair-wise sequencealignment) as well as to align larger numbers ofsequences (multiple sequence alignment). Thisreview focuses exclusively on pair-wisesequence alignments.


Hypothesis - 27

Because of its similarity to classical computerscience problems, pair-wise sequence-alignment has been extensively studied. Anoptimal algorithm exists to align two sequencesto one another based on a computationaltechnique called dynamic programming.Unfortunately these optimal alignments – oftencalled Smith/Waterman alignments – areextremely slow. Even on very fast computers,comprehensive database searches usingoptimal Smith/Waterman alignments can runprohibitively slowly (17).

This is where BLAST comes in. Thebasic local alignment search tool is based on astatistical approximation used to speed-upSmith/Waterman alignments. By assuming thatthe best local alignment will contain a small,exact match (Figure 1) the execution time oflarge database searches can be dramaticallyreduced (17).

GGCAT

2. Look for short exact matches

GGCAT|||||GGCAT

GGCAT||| |GGCTT

GGCAT

AGCAATGAAT

GGGTG

Database

3. Align sequences with short exact matches

1. Find best matches in a database

Probe Sequence

GGCAT GGCTT GGGTGTGAAT

GGCAT||

TGAAT

GGCAT||GGGTG

Score = 5 Score = 2 Score = 4 Score = 2

GGCAT

2. Look for short exact matches

GGCAT|||||GGCAT

GGCAT||| |GGCTT

GGCAT

AGCAATGAAT

GGGTG

Database

3. Align sequences with short exact matches

1. Find best matches in a database

Probe Sequence

GGCAT GGCTT GGGTGTGAAT

GGCAT||

TGAAT

GGCAT||GGGTG

Score = 5 Score = 2 Score = 4 Score = 2

Figure 1: Overview of the BLAST algorithm. The mostcommon use of BLAST is to find the best matches to aprobe sequence in a large database (step 1). Forexample, if a novel cDNA clone is isolated from alibrary, it can be identified by using BLASTN against atranscriptomic database like dbEST. The first step in aBLAST search is to identify those sequences that haveshort, exact matches with the probe sequence. In thiscase, the longest exact match is underlined for eachsequence in the database. Only those sequences withexact matches longer than 3 base pairs are carried onfor full alignment (step 3). In this case two weakalignments (score = 2) would not be detected by theBLAST algorithm.

Further, based on a statistical advanceby Karlin et al (18) the original BLAST software

was able to provide estimates of statistical-significance. In other words, it was able to say“how likely was this alignment to happen bychance”. This is given by the “E-value” for aBLAST alignment. The E-value estimates howmany times a similarity this strong would occurby chance alone in a search of this database.

Different Flavours of BLASTThe original BLAST program (17) was limited tocomparing protein or DNA sequences againstlarger databases. Overtime, however, manyspecialized versions of BLAST have beendeveloped, including PSI-BLAST (12) and Blast-2-Sequences (19). At the time of writing, themain BLAST webpage at NCBI provided no lessthan 25 different “flavours” of the BLASTalgorithm.

Despite this variety, two major versionsof BLAST remain the most widely used. First,BLASTN is used to compare a nucleotidesequence against a database of nucleotidesequences. Second, BLASTP is used tocompare a protein sequence against a databaseof protein sequences. The following twosections review the common uses and themajor parameters of each of those programs.

BLASTN: Comparing Nucleotide SequencesProgram OverviewNucleotide alignments are occasionally used inphylogenetic studies, but the three main uses ofpair-wise nucleotide alignments are sequenceidentification, primer design, and genomicmapping.

Sequence identification is often the endresult of functional screens or expression arrayexperiments that identify one or moresequences associated with a given phenotype.A BLASTN search is then used to identify thegene or transcript corresponding to theexperimentally identified sequence (13).In designing primers for PCR studies it is criticalthat the primers have minimal cross-reactivityand only anneal to a single template. BLASTNsearches are used with the candidate primersequences to identify potential cross-hybridization problems (20, 21).

Genomic mapping is necessary both forcharacterizing the results of some types of high-throughput polymorphism studies (22) and for


Hypothesis - 28

GCAGCGG----AGCGGGTTGA Seq1|| ||||....||||||||||GCTGCGCCGTGAGCGGGTTGA Seq2++-++++----++++++++++ Score

Mismatch Gap

Figure 2: Overview of the BLAST scoring system. The core of a sequence-alignment algorithm is the scoring system,which answers the question “what is a good match?”. Each pair of aligned residues is given a score: matches receivepositive scores and mismatches usually) receive negative ones. The scores can be calculated in a number of ways,both probabilistically and heuristically, and are typically stored in a “scoring matrix”. Gaps between the two sequencesreceive negative scores, typically with a large penalty for the presence of a gap and a smaller penalty for eachadditional residue in the gap. The total score for the alignment is obtained by adding all the positive (match) andnegative (mismatch and gap) scores. The overall score can then be compared to a distribution to determine statisticalsignificance.

a) Filtered Alignment

Query: 941 attcaatacaaacaatctcttaaattgggttcatgatgcagtctcctctttnnnnnnnnn 1000 |||||||||||||||||||||||||||||||||||||||||||||||||||Sbjct: 1139 attcaatacaaacaatctcttaaattgggttcatgatgcagtctcctctttaaaacaaaa 1198

Query: 1001 nnnnnnnnnnnnnnnntatacttgaacaaaagggtcagaggacctgtatttaagcaaata 1060 ||||||||||||||||||||||||||||||||||||||||||||Sbjct: 1199 caaaacaaaacaaaactatacttgaacaaaagggtcagaggacctgtatttaagcaaata 1258

b) Unfiltered Alignment

Query: 941 attcaatacaaacaatctcttaaattgggttcatgatgcagtctcctctttaaaacaaaa 1000 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||Sbjct: 1139 attcaatacaaacaatctcttaaattgggttcatgatgcagtctcctctttaaaacaaaa 1198

Query: 1001 caaaacaaaacaaaactatacttgaacaaaagggtcagaggacctgtatttaagcaaata 1060 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||Sbjct: 1199 caaaacaaaacaaaactatacttgaacaaaagggtcagaggacctgtatttaagcaaata 1258

Figure 3: The Perils of Filtering. A portion of the BLASTN alignment between two isoforms of the Mxi1 gene (RefSeqmRNA accessions: NM_130439 and NM_005962). The alignment was repeated twice, once with filtering (a) and oncewithout (b). A repetitive AC-rich region of 25 nucleotides (underlined) has been filtered out (a), but neverthelessprovides an exact match with the alternate isoform (b). In cases like these, filtering can obscure true alignments andcan be removed.


Hypothesis - 29

interpreting the results of ChIP-Chipexperiments (23).

For all these applications, standardBLASTN is the best tool to use. The othermajor nucleotide BLAST tool offered by NCBI ismegablast, which is mainly used in assemblyand searching of genomic trace sequences.

Major Parameters

Database: There are three major nucleotidedatabases that can be searched: dbEST,RefSeq, and Genomic databases. ExpressedSequence Tags (ESTs) are generated by single-pass sequencing of mRNAs. While this single-pass sequencing may introduce errors, ESTdatabases are extremely large and thus providea way of looking at the transcriptome of manyspecies whose genome has not yet beensequenced: at writing, there were 416 specieswith at least 1000 EST sequences available.The RefSeq databases are also derived frommRNA sequences, but are manually curated byNCBI workers to ensure that they accuratelyrepresent a single gene. Genomic databasespresent fragments of genomes from wholechromosomes, as well as trace-files and partialcontigs from the genome-assembly process.Default Value: non-redundant database (nr)Rationale: nr contains portions of all three majornucleotide databases, making it useful foralmost any BLASTN searchTuning Suggestion: Pick the database mostsuitable for the search in question. Fortranscriptomic searches in well-characterizedgenes use a RefSeq database. Fortranscriptomic searches in poorly-characterizedgenes or species use an EST database. Thesespecialized searches will both find somematches not present in the nr database as wellas avoid spurious matches found by mixinggenomic and mRNA sequences.

Organism: The option exists to limit any BLASTsearch to a specific species or family.Default Value: Search all organismsRationale: Searching all organisms maximizesthe number of hits found.Tuning Suggestion: It is almost alwaysappropriate to specify a single organism orgroup of organisms. This reduces the numberof low-sensitivity or uninformative hits returned.Further, it can dramatically speed BLAST

execution time. Species and families areidentified by their latin names, such as Rodentia(rodents), Mammalia (mammals), Homo sapiens(humans), Mus musculus (mouse) and Rattusnorvegicus (rat).

Expect: The expect parameter is like a p-valuethreshold: it gives the least sensitive hit to bereturned by BLAST. The number indicates thenumber of times this hit could have occurred bychance in searching the database. Longermatches will inherently be less likely to occur bychance, and thus have lower expectation.Default Value: 10Rationale: An expectation of 10 is capable ofdetecting most long matches as well as someshort, inexact matches.Tuning Suggestion: Many BLAST searchesreturn hundreds of hits. It can be helpful toreduce the expect to 0.001 to remove lower-quality hits. This can somewhat speed upBLAST execution. When searching with shortsequences such as PCR primers it isoccasionally necessary to i nc rease theexpectation to find inexact hits. For example,identifying an exact 12 bp alignment can requirean expect as high as 25.

Word Size: The first step of the BLASTalgorithm is to find short exact matchesbetween the search sequence and eachsequence in the database. Completealignments are then performed only onsequences from the database that contain ashort exact match (Figure 1). This greatlyspeeds up BLAST execution, but willoccasionally miss some hits, especially withshorter query sequences. The length of theexact match required is called the “word size”.A word-size of 1 is essentially identical to acomprehensive (but slow) Smith-Watermanalignment.Default Parameter: 11Rationale: A word-size of 11 allows searches toexecute reasonably rapidly, but will clearly misssome relevant hits (17, 24).Tuning Suggestion: The word-size shouldalways be reduced to the lowest possible value(7 for nucleotide BLASTs) to maximizesensitivity. This selection is particularly criticalfor short probe sequences.


Hypothesis - 30

BLASTP: Comparing Protein SequencesProgram OverviewThe uses of protein alignments are quitedifferent from those of nucleotide alignments.Alignments are rarely used to identify proteins–a major exception is in large-scale mass-spectroscopy experiments. Protein alignmentsare much more frequently used in phylogeneticstudies than nucleotide sequences (25). Inaddition, functional analyses are commonlyused to characterize protein sequences.

Functional analyses involve searching aprotein sequence for either conserved domains(26) or for homology to proteins of knownfunction. For example, if the function of ahuman protein is unknown, but it has stronghomology to a murine protein of knownfunction, then a hypothesis about the functioncan be made. This general approach has beenused extensively in recent years to allow cross-species prediction of protein-complexes andprotein-protein interactions (27).

NCBI offers five distinct types ofBLAST-based protein-protein sequencealignment tools. Both PHI-BLAST and PSI-BLAST involve “profiles” characterizing a familyof proteins and are used in some phylogeneticstudies (28). Both rpsblast and cdart arespecialized to identify conserved domains andfunctional motifs. Despite these options, formost protein-alignments, BLASTP remains themost appropriate choice.

Major Parameters

Database: There are two major classes ofdatabases available for protein searches. TheRefSeq database is, again, a highly curateddatabase of protein sequences. The PDBdatabase contains all protein sequences whose3D structure has been solved and is available.Default: As with nucleotide searches, the defaultis the highly inclusive non-redundant (nr)database.Rationale: To encompass all known proteinsTuning Suggestion: If searching only for wellcharacterized proteins consider restricting thesearch to the RefSeq database. For most otherapplications nr is appropriate for proteinsearches.

Do CD Search: The BLASTP program gives theoption of simultaneously searching thesequence for conserved domains (CDs) orfunctional motifs.Default: YesRationale: To provide as much data as possibleTuning Suggestion: Leave the CD searchenabled. The search only adds a marginalperformance penalty, and indeed the CD searchusually returns its results well before theBLASTP search, thus giving the user somethingto start interpreting immediately.

Species: As with BLASTN, the option exists tospecify which species should be considered.Only sequences from the specific will be alignedwith the probe sequence.Default: All speciesRationale: Maximize sensitivityTuning Suggestion: As with BLASTN (seeabove) choosing a specific species can greatlyimprove execution speed and remove spurioushits, leaving a much more easily interpretedresult.

Expect: As with BLASTN searches (see above)the expectation value serves as a threshold.Any hits less significant than this expectationvalue will not be returned by the program.Default Value: 10Rationale: This number is something of acompromise between long query sequences (forwhich it returns many poor matches) and shortquery sequences (for which it may removesome informative matches).Tuning Suggestion: As with BLASTN searches itis often absolutely necessary to increase theExpect to identify short matches. For longermatches, it can be helpful to reduce the expectto return a more manageable number of hits,but this is not critical.

Word Size: As with BLASTN searches, theword-size reflects the initial filtering size in aBLASTP search (see Figure 3).Default: 3Rationale: A compromise between executiontime and search sensitivityTuning Suggestion: As with nucleotidealignments it is always beneficial to reduce theword-size to the smallest value possible (2 forBLASTP searches). The increase in executiontime is usually compensated by specifying a


Hypothesis - 31

species, and the additional true hits returnedcan be of great biological importance.

Matrix & Gap Costs: These are the coreelements of the “scoring system” in a proteinalignment. Recall that the goal of a localsequence-alignment algorithm is to comparetwo sequences and identify similar regions.One key to solving this problem is defining“what makes two sequences similar”. Thisdefinition of sequence-similarity is embedded inthe “scoring system”, and has two major parts:substitution-scores and gap penalties (Figure 2).When two residues are aligned together a scoreis assigned based on how similar they arebelieved to be. Exact matches and veryconservative mismatches are given positivescores, while mismatches receive negativescores. The magnitude of the score is areflection of how conservative or radical achange might be, and the full set of scores arestored in a table called a “scoring matrix” or a“substitution matrix” (29).

In some cases a residue in onesequence has no matching residue on the othersequence (Figure 2). This is called a gap; gapsreceive negative scores to penalize this lack ofsimilarity between the two sequences. Thepenalties assigned to gaps are typically “affine”.This means that a gap is penalized twice: oncefor existing and once based on its length. Thepenalty for gap existence is usually larger thanthe gap extension penalty, reflecting the ideathat the insertion or deletion that leads to a gapcould easily involve multiple residues (30).

The selection of a scoring system iscritical in any sequence-alignment. Forexample, careful selection of the substitutionparameters has proved invaluable whenworking with membrane proteins that haveunusual amino-acid compositions (31).

While DNA-based scoring systems arefairly simple (24, 32), protein-based systemscan be highly complex. Substitution matricescan be based on an estimated evolutionarydistance, and a matrix optimized for identifyingvery similar proteins may not work out foridentifying weak similarities. Common matricesinclude the PAM and BLOSUM series.Empirically some matrices, e.g. BLOSUM62,appear to be better for weaker alignments, while

the PAM series (PAM30 and PAM70) arethought to be superior for shorter querysequences. Similarly, the penalties assigned tothe opening and extension of a gap can beadjusted. Smaller penalties can be used todetect weaker similarities that may havediverged through the insertion or deletion ofsignificant regions.Default: BLASTP defaults to using theBLOSUM62 matrix with a large penalty foropening a gap and a small one for extending it.Rationale: Most BLASTP alignment searchesinvolve longer sequences, and BLOSUM62 isoptimal for these searches.Tuning Suggestions: For many cases theBLOSUM62 matrix is appropriate. For veryshort sequences, the PAM30 matrix may bemore sensitive. Similarly, gap penalties shouldnormally be increased for longer or more closelyrelated sequences, as this reflects the reducedlikelihood of an insertion.

An Aside: FilteringOne other option available for all BLASTprograms is the use of a filter. This filterprevents low complexity regions from drivingthe overall alignment. For example, in proteins,acidic- or proline-rich regions would beremoved from consideration, while for DNA,poly-A regions and highly repetitive sequencesare masked out. Unfortunately this filtering canoccasionally remove interesting regions (Figure3). It is very difficult to identify those rare caseswhere filtering is harmful, but if a BLAST queryreturns absolutely no hits it is possible thatfiltering is the culprit.

SummaryThe original BLAST algorithm became popularbecause of its speed-advantages, freeavailability on NCBI servers, and improvedstatistical estimations. Just as important, thealgorithm has been extensively studied andimproved over the years. Statistical estimationhas been improved (24, 30, 33) and new typesof searches have been introduced (12, 19, 28,34). This continued development has extendedthe scope of BLAST, and is continuing withenhanced integration between BLAST resultsand genomic annotation and resources. BLASTcan be expected to remain a critical tool forsolving computational biologists problems wellinto the future.


Hypothesis - 32

The parameters discussed here aregenerally applicable beyond just BLASTP andBLASTN and extend to most flavours of BLAST.By careful tuning, BLAST query results can begreatly improved, and this critical tool can beused more effectively.

Other sources of InformationSequence-alignment is a critical part ofcomputational biology. This review focusedonly pair-wise alignment. There are severalgood reviews of multiple alignments, including(25). A recent text by Durbin et al gives anexcellent theoretical and mathematicalintroduction to the field of sequence-alignmentsbeyond BLAST searches (35).

AcknowledgmentsThe author thanks the two anonymousreviewers for helpful suggestions.

References:

1. D. Noble, Nat Rev Mol Cell Biol 3, 459(2002).

2. S. L. Baldauf, Trends Genet 19, 345 (2003).3. A. Siepel, D. Haussler, J Comput Biol 11,

413 (2004).4. G. Yeo et al., Genome Biol 5, R74 (2004).5. W. W. Wasserman, A. Sandelin, Nat Rev

Genet 5, 276 (2004).6. C. H. Kim et al., Proteomics 3, 2454 (2003).7. R. O. Duda, P. E. Hart, D. G. Stork, Pattern

classification (Wiley, New York, ed. 2nd,2001).

8. S. R. Eddy, Curr Opin Struct Biol 6, 361(1996).

9. C. Workman et al., Genome Biol 3,research0048 (2002).

10. R. Jansen et al., Science 302, 449-53 (Oct17, 2003).

11. G. Didier et al., Bioinformatics 18, 490(2002).

12. S. F. Altschul et al., Nucleic Acids Res 25,3389 (1997).

13. R. G. Halgren, et al., Nucleic Acids Res 29,582 (2001).

14. T. P. Larsson, et al., FEBS Lett 579, 690(2005).

15. S. Khan, et al., Bioinformatics 19, 2484(2003).

16. J. W. Thornton, E. Need, D. Crews, Science301, 1714 (2003).

17. S. F. Altschul, et al., J Mol Biol 215, 403(1990).

18. S. Karlin, S. F. Altschul, Proc Natl Acad SciU S A 87, 2264 (1990).

19. T. A. Tatusova, T. L. Madden, FEMSMicrobiol Lett 174, 247 (1999).

20. P. C. Boutros, A. B. Okey, Bioinformatics20, 2399 (2004).

21. M. Lexa, J. Horak, B. Brzobohaty,Bioinformatics 17, 192 (2001).

22. R. Sachidanandam et al., Nature 409, 928(2001).

23. L. E. Heisler et al., Nucleic Acids Res 33,2952 (2005).

24. D. J. States, W. Gish, S. F. Altschul,METHODS: A Companion to Methods inEnzymology 3, 66 (1991).

Top 3 Tips for Effective BLASTing

Minimize word-sizeAlways use the smallest word-sizepossible, as larger values may missreal, biologically relevant hits.

Specify a Species and DatabaseThe default options search mostsequences for every species. Byspecifying these options, the numberof uninformative hits dropsdramatically, as does execution time.This is particularly important fornucleotide searches, where the defaultnr database includes both genomicand transcriptomic sequences.

Record & RepeatTrying to repeat a BLAST search canbe a frustrating experience. Thecontinual addition of sequences topublic databases can result in new hitsarising and make older results harderto find. It is helpful to record relevantaccession numbers, the BLASTparameters you used, and the date onwhich you performed your search.Repeating a query in the future mayidentify novel hits that were notpreviously available in sequencedatabases.


Hypothesis - 33

25. A. Phillips, D. Janies, W. Wheeler, MolPhylogenet Evol 16, 317 (2000).

26. A. Marchler-Bauer et al., Nucleic Acids Res31, 383 (2003).

27. K. R. Brown, I. Jurisica, Bioinformatics 21,2076-82 (2005).

28. D. T. Jones, M. B. Swindells, TrendsBiochem Sci 27, 161 (2002).

29. M. R. Gribskov, J. Devereux, Sequenceanalysis primer, UWBC biotechnicalresource series (Stockton Press ; MacmillanPublishers, New York; Basingstroke, Hants,England, 1991).

30. S. F. Altschul, W. Gish, Methods Enzymol266, 460 (1996).

31. T. Muller, S. Rahmann, M. Rehmsmeier,Bioinformatics 17 Suppl 1, S182 (2001).

32. F. Chiaromonte, V. B. Yap, W. Miller, PacSymp Biocomput, 115 (2002).

33. S. F. Altschul, R. Bundschuh, R. Olsen, T.Hwa, Nucleic Acids Res 29, 351 (2001).

34. A. A. Schaffer et al., Nucleic Acids Res 29,2994 (2001).

35. R. Durbin, Biological sequence analysis :probabilistic models of proteins and nucleicacids (Cambridge University Press,Cambridge, UK New York, 1998).

Photo by Nick Shah

An Introduction to Effective BLASTing€¦ · Effective BLASTing – TIPS & TECHNIQUES Hypothesis -...

Documents

Transcript of An Introduction to Effective BLASTing€¦ · Effective BLASTing – TIPS & TECHNIQUES Hypothesis -...