PHAST: A Fast Phage Search Tool

6
PHAST: A Fast Phage Search Tool You Zhou 1 , Yongjie Liang 2 , Karlene H. Lynch 1 , Jonathan J. Dennis 1 and David S. Wishart 1,2,3, * 1 Department of Biological Sciences, 2 Department of Computing Science, University of Alberta and 3 National Research Council, National Institute for Nanotechnology (NINT), Edmonton, AB, Canada T6G 2E8 Received February 28, 2011; Revised April 21, 2011; Accepted May 26, 2011 ABSTRACT PHAge Search Tool (PHAST) is a web server de- signed to rapidly and accurately identify, annotate and graphically display prophage sequences within bacterial genomes or plasmids. It accepts either raw DNA sequence data or partially annotated GenBank formatted data and rapidly performs a number of database comparisons as well as phage ‘corner- stone’ feature identification steps to locate, annotate and display prophage sequences and prophage features. Relative to other prophage identification tools, PHAST is up to 40 times faster and up to 15% more sensitive. It is also able to process and annotate both raw DNA sequence data and Genbank files, provide richly annotated tables on prophage features and prophage ‘quality’ and distinguish be- tween intact and incomplete prophage. PHAST also generates downloadable, high quality, interactive graphics that display all identified prophage compo- nents in both circular and linear genomic views. PHAST is available at (http://phast.wishartlab.com). INTRODUCTION Bacteriophage, the viruses that infect bacteria, can typic- ally be divided into two groups, lytic and temperate. Lytic phage infect propagate within and then lyse their host bacterial cells as part of their life cycle, while temperate phages may exist benignly within the DNA of their bac- terial host. Temperate phages can physically integrate into one of the native replicons (plasmid or chromosome) of their preferred bacterial host, although a few phages can exist as independent plasmids (1). Integrated phages are termed prophages. Prophages tend to be inserted at specific integration sites within the host genome, but their location can vary depending on the phage species. Prophages are essentially dormant phages that are only replicated through bacterial DNA replication and cell division. Bacteria containing a prophage are called lysogens because their prophage is in the lysogenic cycle, in which the viral (esp. the lytic) genes are not expressed. Upon damage to the host cell DNA or other physiological cues, the prophage may be induced to excise itself from the bacterial genome. After induction, the phage’s lytic genes are turned on, infectious virions are assembled within the host cell and the cell is lysed (killed) releasing infectious phage particles that can go on to infect more cells. Bacterial genomes can contain a significant proportion (>20%) of functional and non-functional bacteriophage genes (1). Consequently, prophage sequences can account for a significant fraction of the variation within bacterial species or clades. The presence of prophage sequences may also allow some bacteria to acquire antibiotic resistance, to exist in new environmental niches, to improve adhesion or to become pathogenic (1). Because bacterial genome fragments can also be carried by phage particles, the lytic process is thought to be an important vehicle for horizontal gene transfer. Furthermore, because of their high specificity and high potency, phages are also being investigated as potential candidates for novel antibiotics (2) or even cancer therapies (3). As the most abundant biological entity on Earth (1), phages are also thought to play a crucial role in cycling nutrients and boosting photosynthesis in the world’s oceans (4). However, not all prophages or prophage-like entities are functional. Indeed, many prophages are dormant due to mutational decay or the loss of critical genes over thousands of host generations. Defective or cryptic prophages are abundant in many bacterial genomes and they can carry a number of genes that may be beneficial to the host. These can include genes encoding proteins with homologous recombination functions or bacteriocins that may be used to inhibit the growth of other bacteria in competition for nutrients. There are generally two methods to identify prophages: (i) experimental and (ii) computational. The experimental approach involves inducing the host bacteria to release phage particles by exposing them to UV light or other DNA-damaging conditions. This approach can certainly prove the existence of viable phages, but will not reveal defective prophages (1). In addition, not all viable phages *To whom correspondence should be addressed. Tel: +780 492 0383; Fax:+780 492 5305; Email: [email protected] Published online 14 June 2011 Nucleic Acids Research, 2011, Vol. 39, Web Server issue W347–W352 doi:10.1093/nar/gkr485 ß The Author(s) 2011. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/nar/article-abstract/39/suppl_2/W347/2507246 by guest on 13 February 2018

Transcript of PHAST: A Fast Phage Search Tool

Page 1: PHAST: A Fast Phage Search Tool

PHAST: A Fast Phage Search ToolYou Zhou1, Yongjie Liang2, Karlene H. Lynch1, Jonathan J. Dennis1 and

David S. Wishart1,2,3,*

1Department of Biological Sciences, 2Department of Computing Science, University of Alberta and 3NationalResearch Council, National Institute for Nanotechnology (NINT), Edmonton, AB, Canada T6G 2E8

Received February 28, 2011; Revised April 21, 2011; Accepted May 26, 2011

ABSTRACT

PHAge Search Tool (PHAST) is a web server de-signed to rapidly and accurately identify, annotateand graphically display prophage sequences withinbacterial genomes or plasmids. It accepts either rawDNA sequence data or partially annotated GenBankformatted data and rapidly performs a number ofdatabase comparisons as well as phage ‘corner-stone’ feature identification steps to locate, annotateand display prophage sequences and prophagefeatures. Relative to other prophage identificationtools, PHAST is up to 40 times faster and up to15% more sensitive. It is also able to process andannotate both raw DNA sequence data and Genbankfiles, provide richly annotated tables on prophagefeatures and prophage ‘quality’ and distinguish be-tween intact and incomplete prophage. PHAST alsogenerates downloadable, high quality, interactivegraphics that display all identified prophage compo-nents in both circular and linear genomic views.PHAST is available at (http://phast.wishartlab.com).

INTRODUCTION

Bacteriophage, the viruses that infect bacteria, can typic-ally be divided into two groups, lytic and temperate. Lyticphage infect propagate within and then lyse their hostbacterial cells as part of their life cycle, while temperatephages may exist benignly within the DNA of their bac-terial host. Temperate phages can physically integrate intoone of the native replicons (plasmid or chromosome) oftheir preferred bacterial host, although a few phages canexist as independent plasmids (1). Integrated phages aretermed prophages. Prophages tend to be inserted at specificintegration sites within the host genome, but their locationcan vary depending on the phage species. Prophages areessentially dormant phages that are only replicatedthrough bacterial DNA replication and cell division.Bacteria containing a prophage are called lysogens because

their prophage is in the lysogenic cycle, in which the viral(esp. the lytic) genes are not expressed. Upon damage tothe host cell DNA or other physiological cues, theprophage may be induced to excise itself from the bacterialgenome. After induction, the phage’s lytic genes areturned on, infectious virions are assembled within thehost cell and the cell is lysed (killed) releasing infectiousphage particles that can go on to infect more cells.Bacterial genomes can contain a significant proportion

(>20%) of functional and non-functional bacteriophagegenes (1). Consequently, prophage sequences can accountfor a significant fraction of the variation within bacterialspecies or clades. The presence of prophage sequences mayalso allow some bacteria to acquire antibiotic resistance,to exist in new environmental niches, to improve adhesionor to become pathogenic (1). Because bacterial genomefragments can also be carried by phage particles, thelytic process is thought to be an important vehicle forhorizontal gene transfer. Furthermore, because of theirhigh specificity and high potency, phages are also beinginvestigated as potential candidates for novel antibiotics(2) or even cancer therapies (3). As the most abundantbiological entity on Earth (1), phages are also thoughtto play a crucial role in cycling nutrients and boostingphotosynthesis in the world’s oceans (4). However, notall prophages or prophage-like entities are functional.Indeed, many prophages are dormant due to mutationaldecay or the loss of critical genes over thousands of hostgenerations. Defective or cryptic prophages are abundantin many bacterial genomes and they can carry a number ofgenes that may be beneficial to the host. These can includegenes encoding proteins with homologous recombinationfunctions or bacteriocins that may be used to inhibit thegrowth of other bacteria in competition for nutrients.There are generally two methods to identify prophages:

(i) experimental and (ii) computational. The experimentalapproach involves inducing the host bacteria to releasephage particles by exposing them to UV light or otherDNA-damaging conditions. This approach can certainlyprove the existence of viable phages, but will not revealdefective prophages (1). In addition, not all viable phages

*To whom correspondence should be addressed. Tel: +780 492 0383; Fax: +780 492 5305; Email: [email protected]

Published online 14 June 2011 Nucleic Acids Research, 2011, Vol. 39, Web Server issue W347–W352doi:10.1093/nar/gkr485

� The Author(s) 2011. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Downloaded from https://academic.oup.com/nar/article-abstract/39/suppl_2/W347/2507246by gueston 13 February 2018

Page 2: PHAST: A Fast Phage Search Tool

can be induced under the same conditions and the requiredconditions often are not known a priori. Given the easewith which bacterial genomes can now be sequenced, thecomputational identification of prophages from genomicsequence data has become the most preferred route. Earlysequence-based efforts often depended upon manualinspection of disrupted genes and attachment sites (5)or the analysis of atypical nucleotide content (6,7).However, prophage regions do not always exhibit atypicalnucleotide content (8). Likewise, phages neither alwaysintegrate into the same coding regions nor do they exclu-sively use tRNAs as the target site for integration.Consequently, this makes scanning for atypical gene con-tent or simple searches for disrupted genes or tRNAs un-reliable for finding prophage regions. More recentmethods relying on a much more holistic or integratedapproach have appeared. These combine sequence com-parisons to known phage or prophage genes, comparisonsto known bacterial genes, tRNA and dinucleotide analysisand hidden Markov scanning for attachment site recogni-tion. These combined methods are now available in anumber of excellent programs and web servers such asPhage_Finder (9), Prophinder (10) and Prophage Finder(11). These tools have helped revolutionize findingprophages in bacterial genomes.Except for Prophage Finder, these phage finding pro-

grams and web servers still require that the input genomesequence must be well annotated with all open readingframes (ORFs) and/or tRNA sites pre-identified. This an-notation process is not only time consuming, it is alsohighly dependent on the choice of the annotationprograms or methods. Furthermore, the choice andaccuracy of the genome annotation method can signifi-cantly affect the accuracy of the phage predictions (videinfra). In addition, even with a fully annotated bacterialgenome, all other phage-finding methods require 30min to2 h to complete their analyses. Now that it is possible tosequence an entire bacterial genome in less than a day,prophage identification needs to be faster, more accurateand much less dependent on the availability of fullyannotated bacterial genomes. To address these issues, wehave developed a web-based application named PHAST (afast PHAge Search Tool), to support rapid and accurateprophage identification using either raw or annotated bac-terial genome sequence data. The main features of PHASTinclude:

. Prophage region identification support for both rawnucleotide sequence input (using GLIMMER gene pre-diction and local genome annotation tools) as well asannotated GenBank file input;

. Support for detailed prophage annotation includingposition, length, boundaries, number of genes, attach-ment sites, tRNAs, identified phage-like genes andattachment sites (att);

. A customized phage and prophage database that isautomatically updated on a biweekly basis;

. Support for the prediction of the completeness orpotential viability of identified prophages (intact,questionable or incomplete);

. Extremely fast processing times (about 3min for atypical bacterial genome);

. Graphical output that supports both circular andlinear genomic views as well as interactive browsingand labeling of dynamically generated figures;

. Fully downloadable text and graphics; and

. Support for scriptable operations through anapplication programming interface (API).

PHAST’s prophage finding performance is generallysuperior to other applications. When given an annotatedgenome, it achieves 85.4% sensitivity and 94.2% positivepredictive value (PPV) when evaluated against the collec-tion of prophages referenced by Prophinder. When given araw sequence file, PHAST achieves 79.4% sensitivity and86.5% PPV using the same evaluation set. This is about10% more accurate than existing phage finding tools.PHAST is freely available at (http://phast.wishartlab.com).

MATERIALS AND METHODS

PHAST is an integrated search and annotation tool thatcombines genome-scale ORF prediction and translation(via GLIMMER), protein identification (via BLASTmatching and annotation by homology), phage sequenceidentification (via BLAST matching to a phage-specificsequence database), tRNA identification, attachment siterecognition and gene clustering density measurementsusing density-based spatial clustering of applicationswith noise (DBSCAN) (17) and sequence annotation textmining. In addition to these basic operations, PHAST alsoevaluates the completeness of the putative prophage, tabu-lates data on the phage or phage-like features and rendersthe data into several colorful graphs and charts. Detailsabout the databases, algorithms and implementation aregiven below.

Creation of custom prophage and bacterial sequencedatabases

PHAST’s prophage sequence database consists of a customcollection of phage and prophage protein sequences fromtwo sources. One is the National Center for BiotechnologyInformation (NCBI) phage database that includes 46 407proteins from 598 phage genomes. The other source isfrom the prophage database (12), which consists of159 prophage regions and 9061 proteins not found inthe NCBI phage database. Since many of the prophageproteins in the prophage database are actually bacterialproteins and some have only been identified computation-ally, we only selected those prophage proteins that havebeen associated with a clear phage function. This setincludes a total of 379 phage protease, integrase and struc-tural proteins. This PHAST phage library is used toidentify putative phage proteins in the query genome viaBLASTP (13) searches.

In addition to a custom, self-updating phage sequencelibrary, PHAST also maintains a bacterial sequence libraryconsisting of 1300 non-redundant bacterial genomes/proteomes from all major eubacterial and archaebacterial

W348 Nucleic Acids Research, 2011, Vol. 39, Web Server issue

Downloaded from https://academic.oup.com/nar/article-abstract/39/suppl_2/W347/2507246by gueston 13 February 2018

Page 3: PHAST: A Fast Phage Search Tool

phyla. This bacterial sequence library contains more thanfour million annotated or partially annotated protein se-quences. Relative to the full GenBank protein sequencelibrary (100+ million sequences), this bacterial-specificlibrary is 25� smaller. This means that PHAST’sgenome annotation step (see below) can be accomplished25� faster.

Genome annotation and comparison

PHAST accepts both raw DNA sequence and GenBankannotated genomes. If given a raw genomic sequence(FASTA format), PHAST identifies all ORFs usingGLIMMER 3.02 (14). This ORF identification step takesabout 45 s for an average bacterial genome of 5.0Mb. Thetranslated ORFs are then rapidly annotated via BLASTusing PHAST’s non-redundant bacterial protein library(�2–3min/genome). Because tRNA and tmRNA sitesprovide valuable information for identifying the attach-ment sites, they are calculated using the programstRNAscan-SE (15) and ARAGORN (16). If an input(GenBank formatted) file is provided with completeprotein and tRNA information, these steps are skipped.Phage or phage-like proteins are then identified by per-forming a BLAST search against PHAST’s local phage/prophage sequence database along with specific keywordssearches to facilitate further refinement and identification.Matched phage or phage-like sequences with BLASTe-values less than 10�4 are saved as hits and their positionstracked for subsequent evaluation for local phage densityby DBSCAN (17).

Identification of prophage regions and prediction oftheir completeness

Prophages can be considered as clusters of phage-likegenes within a bacterial genome. The primary challenge(after phage-like genes have been identified) is to deter-mine if these genes are sufficiently well clustered or prox-imal to each other to be considered prophage candidates.Although there are a few reported clustering methods foridentifying phage gene clusters (9–11), we found the gen-eral DBSCAN algorithm performs just as well, likely be-cause the identification of clusters of prophage genes is nota particularly difficult task. DBSCAN takes two param-eters: the cluster size n and a distance e. The parametern defines the minimal number of phage-like genes requiredto form a prophage cluster and e is the maximal spatialdistance between two neighbor genes within the samecluster. In our case, the spatial distance between twogenes is just the number of nucleotides between them. Inother words, n can be considered as the minimal prophagesize and e is the protein density within the prophage region.Empirically, we set n to be 6, since prophages generallyhave more than five proteins. The value of e was set to3000 based on assessments from a small number of identi-fied prophages in ProphageDB (12). We found that usinga moderately different e-value will generally not changethe prediction sensitivity. If PHAST’s input file is an anno-tated GenBank file, an additional text scan is performed toidentify prophages that may not have been found by clus-tering. This secondary (moving window) scan looks for

specific phage-related keywords in the GenBank proteinname field of the input file, such as ‘protease’, ‘integrase’and ‘tail fiber’. If 6 or more proteins associated with thesekeywords are found within a window of 60 proteins, theregion is considered as a putative prophage region even ifan insufficient number of phage-like genes were found byDBSCAN within this region. Finally, if the identifiedprophage contains an integrase, potential phage attach-ment sites (one for each integrase in tandem prophages)are then identified by scanning the region for short nucleo-tide repeats (12–80 bases) (18).After all prophage regions have been detected, a com-

pleteness score is assigned to each identified prophage.Three potential scenarios are considered: (i) the regiononly contains genes/proteins of a known phage;(ii) >50% of the genes/proteins in the region are relatedto a known phage and (iii) <50% of the genes/proteins inthe region are related to a known phage. In scenario (i),the region automatically has a completeness score of 150(the maximum). In scenario (ii) and (iii), the region’s com-pleteness score is calculated as the sum of the scores cor-responding to the region’s size and number of genes. If it isfound that the region is related to a known phage, bothscores are calculated using the size and number of matchedgenes of the related phage, otherwise they are calculatedusing the average size (30 kb) and average number of genes(40) of typical phages. The total score in scenario (iii) alsocounts the number of identified ‘cornerstone’ genes aswell as the density of phage-like genes in the region.‘Cornerstone genes’ are genes encoding proteins involvedin phage structure, DNA regulation, insertion and lysis(1). Table 1 shows the details of PHAST’s completenessscore calculation. A prophage region is considered to beincomplete if its completeness score is less than 60, ques-tionable if the score is between 60 and 90, and intact if thescore is above 90.

Program and web server characteristics

PHAST’s search, annotation and DBSCAN clusteringsoftware were written using a combination of C and Java.PHAST’s web interface was implemented using a standardCGI framework. PHAST’s interactive Google-Map stylegraphics were built using Adobe’s Flash Builder. PHASTalso supports remote scripting using a URL API (this isdescribed under PHAST’s ‘Instructions’ link) and it main-tains a large, hyperlinked database of pre-computed bac-terial genomes for rapid prophage identification amongknown/well-studied genomes (see PHAST’s ‘Databases’link). A screenshot montage of PHAST’s output is givenin Figure 1. The web application is platform independentand has been tested successfully on Internet Explorer 8.0,Mozilla Firefox 3.0 and Safari 4.0. However, in order toview the Flash output the user must have Adobe FlashPlayer installed. This is freely available at http://www.adobe.com/products/flashplayer. For the most up todate instructions of how to use the server please read theonline help page at http://phast.wishartlab.com/how_to_use.html.

Nucleic Acids Research, 2011, Vol. 39, Web Server issue W349

Downloaded from https://academic.oup.com/nar/article-abstract/39/suppl_2/W347/2507246by gueston 13 February 2018

Page 4: PHAST: A Fast Phage Search Tool

Performance evaluation

In order to compare PHAST’s performance with otherprograms, we used a collection of hand-annotated pro-phages from 54 bacterial genomes (1,10) as our ‘goldstandard’ reference control. PHAST was evaluated usingboth GenBank annotated sequences (i.e. bacterial genomeswith manually or semi-automatically annotated genomes)as well as raw DNA sequence files. The performance wasmeasured using both sensitivity [TP/(TP+FN)] andpositive predictive value or PPV [TP/(TP+FP)]. Usingthis 54 genome data set PHAST achieved, a sensitivity(Sn) of 85.4% and a PPV of 94.2% when evaluatedusing GenBank annotated files. When using raw DNAsequence and its own ORF finding and genome

annotation tools, PHAST achieved a sensitivity of79.4% and a PPV of 86.5% for the same 54 genomes.PHAST’s performance using the same annotatedGenBank data was superior to Prophinder (Sn 77.5%,PPV 93.6%), Prophage Finder (Sn 92.1%, PPV 52.1%)and Phage_Finder (Sn 68.5%, PPV 94.3%). PHAST’s per-formance using raw DNA sequence data does not quitematch that of the pre-annotated data, but its combinedsensitivity/positive predictive value is still comparable toProphinder and superior to both Prophage Finder andPhage_Finder. Detailed comparisons for both theGenBank and raw sequence inputs for all 54 genomescan be found in PHAST’s documentation page (http://phast.wishartlab.com/documentation.html). PHAST’simproved performance does not necessarily indicate

Figure 1. A screenshot montage of some of PHAST’s different graphical and tabular views including its linear and circular genome renderings aswell PHAST’s corresponding prophage annotation.

Table 1. PHAST’s phage completeness score calculation

Scenario A Scenario B Scenario C

Number of nucleotides (# bases) – (# bases in the region/# bases in the related phage)� 100

+10 if # bases >30 kb

Number of genes (# genes) – (# genes in the region/# genes in the related phage)� 100

+10 if # genes >40

Cornerstone genesa – – +10 for each cornerstone genePhage-like genes – – +10 if occupies 70% or more of the regionFinal scores 150 Sum of above Sum of above

The completeness score is calculated for three different scenarios: (A) the region contains all genes of a known phage. (B) >50% of the genes in theregion are related to a known phage. (C) <50% of the genes in the region are related to a known phage.aCornerstone genes are identified key phage structural genes (using keywords such as ‘capsid’, ‘head’, ‘plate’, ‘tail’, ‘coat’, ‘portal’ and ‘holin’) andphage DNA regulation genes (such as ‘integrase’, ‘transposase’ and ‘terminase’) and phage function genes (such as ‘lysin’ and ‘bacteriocin’).

W350 Nucleic Acids Research, 2011, Vol. 39, Web Server issue

Downloaded from https://academic.oup.com/nar/article-abstract/39/suppl_2/W347/2507246by gueston 13 February 2018

Page 5: PHAST: A Fast Phage Search Tool

Prophinder, Prophage Finder or Phage_Finder’s phagefinding algorithms are inferior to PHAST’s algorithm.Rather, some of the performance gain appears to be dueto PHAST’s implementation of a newer, larger phagesequence library and perhaps a better exploitation ofkeyword annotations.

A further challenge with evaluating any kind of pro-phage identification software is that there is no ‘absolute’or ‘gold’ standard. Careful manual annotation by phageexperts is certainly a high standard, but it is more thanlikely that some prophages in the 54 evaluation genomeswere not identified, having decayed or mutated too muchfor them to appear in the Casjens reference list (1). Inother words, some of the false positive predictions mayin fact be true positives. Indeed, through manual inspec-tion of PHAST’s results we found a number of ‘dense’positive BLAST hits to phage proteins in several genomes,but these were not labeled as prophages in the Casjensreference list. Instead of ‘false positives’, we believe thatthey should be considered as prophage-related regionsthat have not been previously reported in the literature.

In addition to evaluating PHAST’s prophage identifica-tion performance, we also evaluated its speed. Given thatPHAST accepts two kinds of file input (raw FASTA DNAsequence and GenBank formatted files), we assessed itsperformance for both kinds of input files. When givenraw genomic sequence, PHAST must run GLIMMER aswell as several gene/protein identification programs. Usingthe raw Escherichia coli O157:H7 genome sequence only(GenBank accession NC_002655), PHAST completed itsprophage identification in just over 4min. When tested onthe same input file, Prophage Finder returned results after20min. However, it is important to note that ProphageFinder does not annotate bacterial genes, its output is

very ‘crude’ and its combined Sn/PPV score is significantlyworse than PHAST’s (Table 2). Using the GenBankannotated E. coli O157:H7 file, PHAST completed itsprophage identification in 140 s. Using the same annotatedNC_002655 file for the Prophinder (10) web server took33min, while using a local copy of Phage_Finder (9)running on a 2.1GHz Pentium PC with 12Gb RAM,the same file took 93min. These data suggest thatPHAST is between 5 and 40 times faster than existingprophage finding programs. A more complete featureand performance comparison between PHAST and otherexisting prophage finding tools is given in Table 2.

Limitations

PHAST is not without some limitations. First, like allother database-driven annotation systems, PHAST obvi-ously performs poorly at identifying novel phages, whosegenes/proteins are not closely related to any record in thePHAST database. In this regard, the appearance of largenumbers of proximal proteins with unknown functioncould be a good indication of a novel phage. Second,the DBSCAN algorithm used by PHAST assumes aneven density of phage-like hits in every prophage genomicsequence, which is not generally true in practice.Consequently, a highly uneven distribution of phage-likegenes could potentially fool the DBSCAN algorithm.Finally, PHAST will occasionally ‘split’ larger prophagesinto a number of smaller prophages due to a paucity ofBLAST hits.

CONCLUSIONS

PHAST represents a new generation of fast prophage iden-tification and phage annotation tools that produces

Table 2. Features and performance of PHAST relative to other prophage identification tools

PHAST PROPHINDER PROPHAGE FINDER PHAGE_FINDER

Execution time (E. coli O157:H7 5.5Mb) 140 s 1980±90 s N/A 5547 sb

Execution time (E. coli O157:H7 sequence alone) 240 s N/A 1800±90 s N/AAccepts raw DNA file Yes No Yes NoAccepts Genbank file Yes Yes No YesSensitivity (%) 85.4 77.5 N/A 68.5b

Sensitivity (%) (sequence alone) 79.4 N/A 92.1a N/APPV (%) 94.2 93.6 N/A 94.3b

PPV (sequence alone) (%) 86.5 N/A 52.1a N/ADownloadable images Yes No No NoPhage completeness labeling Yes No No NoCircular genome view Yes No No NoZoomable graphics Yes No No NoHighlights key phage proteins Yes No No NoScriptable operation Yes No No YesAttachment site prediction Yes No No YesOutput type Tables+graphics Tables+graphics Text only Text onlyOutput readability Good Good Poor Poor

The performance of all web servers and programs was evaluated using a reference set of 54 manually annotated genomes (1,10). Annotations for all54 genome inputs can also be found at Prophinder’s web page. Phage_Finder was tested locally (strict mode) using HMMER 2.3 for its HMMsearch. Prophage Finder’s (11) web server was tested using its default parameters. Detailed results can be found on the PHAST website.aProphage Finder failed to return results for all input files. The numbers reported here are for just 46 of the 54 genomes.bPhage_Finder can be run under four different modes, this table reports the results using the strict mode with HMMER 2.3. Using other parametersone obtains: Sn=88.4%, PPV=23.7% (HMMER 2.3, non-strict mode), Sn=90.3%, PPV=23.9% (HMMER 3.0, non-strict mode) andSn=0.0%, PPV=0.0% (HMMER 3.0, strict mode). When HMMER 3.0 is used Phage_Finder is significantly faster (898 s for E. coliO157:H7), but significantly less accruate.

Nucleic Acids Research, 2011, Vol. 39, Web Server issue W351

Downloaded from https://academic.oup.com/nar/article-abstract/39/suppl_2/W347/2507246by gueston 13 February 2018

Page 6: PHAST: A Fast Phage Search Tool

accurate results in minutes using only raw or lightly anno-tated genome sequence data. PHAST also produces exten-sive text summaries, downloadable figures, circular andlinear genome views as well as colorful, zoomable, user-interactive graphics. As phage and prophage databasescontinue to expand [with >100 million different phagegenomes still to be sequenced (19)], we believe thatPHAST’s integrated comparative approach to phagefinding will only lead to continued improvements in itssensitivity and specificity.

ACKNOWLEDGEMENTS

The authors wish to thank the referees for their helpfulsuggestions. Y.Z. wrote the phage clustering, the phageidentification scripts and the FLASH viewer. Y.L.prepared the phage and bacterial sequence databases,built the web interface and implemented various externalprograms (BLAST, GLIMMER, CGVIEW). K.H.L.co-designed the methods and tested the server. D.S.W.and J.J.D. conceived the ideas, server requirements andgeneral methodology. All authors contributed to thewriting of the manuscript and all have seen andapproved of its content.

FUNDING

The authors wish to thank the Canadian Institutes ofHealth Research (CIHR) and Genome Alberta (adivision of Genome Canada) for financial support.Funding for open access charge: Canadian Institutes ofHealth Research.

Conflict of interest statement. None declared.

REFERENCES

1. Casjens,S. (2003) Prophages and bacterial genomics: what havewe learned so far? Mol. Microbiol., 49, 277–300.

2. Coates,A.R. and Hu,Y. (2007) Novel approaches to developingnew antibiotics for bacterial infections. Br. J. Pharmacol., 152,1147–1154.

3. Bar,H., Yacoby,I. and Benhar,I. (2008) Killing cancer cells bytargeted drug-carrying phage nanomedicines. BMC Biotechnol., 8,37.

4. Sullivan,M.B., Waterbury,J.B. and Chisholm,S.W. (2003)Cyanophages infecting the oceanic cyanobacteriumProchlorococcus. Nature, 424, 1047–1051.

5. Fouts,D.E. (2004) Bacteriophage bioinformatics. In Fraser,C.M.,Read,T.D. and Nelson,K.E. (eds), Microbial Genomes. HumanaPress Inc., Totowa, NJ, pp. 71–91.

6. Nicolas,P., Bize,L., Muri,F., Hoebeke,M., Rodolphe,F.,Ehrlich,S.D., Prum,B. and Bessieres,P. (2002) Mining Bacillussubtilis chromosome heterogeneities using hidden Markov models.Nucleic Acids Res., 30, 1418–1426.

7. Srividhya,K.V., Alaguraj,V., Poornima,G., Kumar,D., Singh,G.P.,Raghavenderan,L., Katta,A.V., Mehta,P. and Krishnaswamy,S.(2007) Identification of prophages in bacterial genomes bydinucleotide relative abundance difference. PLoS One., 2, e1193.

8. Nelson,K.E., Weinel,C., Paulsen,I.T., Dodson,R.J., Hilbert,H.,Martins dos Santos,V.A., Fouts,D.E., Gill,S.R., Pop,M.,Holmes,M. et al. (2002) Complete genome sequence andcomparative analysis of the metabolically versatile Pseudomonasputida KT2440. Environ. Microbiol., 4, 799–808.

9. Fouts,D.E. (2006) Phage_Finder: automated identification andclassification of prophage regions in complete bacterial genomesequences. Nucleic Acids Res., 34, 5839–5851.

10. Lima-Mendez,G., Van Helden,J., Toussaint,A. and Leplae,R.(2008) Prophinder: a computational tool for prophage predictionin prokaryotic genomes. Bioinformatics, 24, 863–865.

11. Bose,M. and Barber,R.D. (2006) Prophage Finder: a prophageloci prediction tool for prokaryotic genome sequences.In Silico Biol., 6, 223–227.

12. Srividhya,K.V., Rao,G.V., Raghavenderan,L., Mehta,P.,Prilusky,J., Sankarnarayanan,M., Sussman,J.L. andKrishnasswamy,S. (2006) Database and comparative identificationof prophages. Lec. Notes Control Informat. Sci., 344, 863–868.

13. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z.,Miller,W. and Lipman,D.J. (1997) Gapped BLAST andPSI-BLAST: a new generation of protein database searchprograms. Nucleic Acids Res., 25, 3389–3402.

14. Salzberg,S.L., Delcher,A.L., Kasif,S. and White,O. (1998)Microbial gene identification using interpolated Markov models.Nucleic Acids Res., 26, 544–548.

15. Lowe,T.M. and Eddy,S.R. (1997) tRNAscan-SE: a program forimproved detection of transfer RNA genes in genomic sequence.Nucleic Acids Res., 25, 955–964.

16. Laslett,D. and Canback,B. (2004) ARAGORN, a program todetect tRNA genes and tmRNA genes in nucleotide sequences.Nucleic Acids Res., 32, 11–16.

17. Ester,M., Kriegel,H.P., Sander,J. and Xu,X. (1996) Adensity-based algorithm for discovering clusters in large spatialdatabases with noise. KDD-1996 Proceedings. AAAI Press, MenloPark, CA, pp. 226–231.

18. Williams,K.P. (2001) Integration sites for genetic elements inprokaryotic tRNA and tmRNA genes: sublocation prefeence ofintegrase subfamilies. Nucleic Acids Res., 30, 866–875.

19. Rohwer,F. (2003) Global phage diversity. Cell, 113, 141.

W352 Nucleic Acids Research, 2011, Vol. 39, Web Server issue

Downloaded from https://academic.oup.com/nar/article-abstract/39/suppl_2/W347/2507246by gueston 13 February 2018