antiSMASH 5.0: updates to the secondary metabolite genome ...genome mining pipeline Kai...

8
General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. Users may download and print one copy of any publication from the public portal for the purpose of private study or research. You may not further distribute the material or use it for any profit-making activity or commercial gain You may freely distribute the URL identifying the publication in the public portal If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Downloaded from orbit.dtu.dk on: Mar 30, 2020 antiSMASH 5.0: updates to the secondary metabolite genome mining pipeline Blin, Kai; Shaw, Simon; Steinke, Katharina; Villebro, Rasmus; Ziemert, Nadine; Lee, Sang Yup; Medema, Marnix H; Weber, Tilmann Published in: Nucleic acids research Link to article, DOI: 10.1093/nar/gkz310 Publication date: 2019 Document Version Publisher's PDF, also known as Version of record Link back to DTU Orbit Citation (APA): Blin, K., Shaw, S., Steinke, K., Villebro, R., Ziemert, N., Lee, S. Y., ... Weber, T. (2019). antiSMASH 5.0: updates to the secondary metabolite genome mining pipeline. Nucleic acids research, 87(W2), W81-W87. https://doi.org/10.1093/nar/gkz310

Transcript of antiSMASH 5.0: updates to the secondary metabolite genome ...genome mining pipeline Kai...

Page 1: antiSMASH 5.0: updates to the secondary metabolite genome ...genome mining pipeline Kai Blin1,SimonShaw1, Katharina Steinke2, Rasmus Villebro1, Nadine Ziemert2, Sang Yup Lee1,3, Marnix

General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors andor other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights

Users may download and print one copy of any publication from the public portal for the purpose of private study or research

You may not further distribute the material or use it for any profit-making activity or commercial gain

You may freely distribute the URL identifying the publication in the public portal If you believe that this document breaches copyright please contact us providing details and we will remove access to the work immediately and investigate your claim

Downloaded from orbitdtudk on Mar 30 2020

antiSMASH 50 updates to the secondary metabolite genome mining pipeline

Blin Kai Shaw Simon Steinke Katharina Villebro Rasmus Ziemert Nadine Lee Sang Yup MedemaMarnix H Weber TilmannPublished inNucleic acids research

Link to article DOI101093nargkz310

Publication date2019

Document VersionPublishers PDF also known as Version of record

Link back to DTU Orbit

Citation (APA)Blin K Shaw S Steinke K Villebro R Ziemert N Lee S Y Weber T (2019) antiSMASH 50 updatesto the secondary metabolite genome mining pipeline Nucleic acids research 87(W2) W81-W87httpsdoiorg101093nargkz310

Published online 29 April 2019 Nucleic Acids Research 2019 Vol 47 Web Server issue W81ndashW87doi 101093nargkz310

antiSMASH 50 updates to the secondary metabolitegenome mining pipelineKai Blin1 Simon Shaw1 Katharina Steinke2 Rasmus Villebro1 Nadine Ziemert2 SangYup Lee13 Marnix H Medema 4 and Tilmann Weber 1

1The Novo Nordisk Foundation Center for Biosustainability Technical University of Denmark Kemitorvet bygning 2202800 Kgs Lyngby Denmark 2German Centre for Infection Research (DZIF) Interfaculty Institute of Microbiology andInfection Medicine Auf der Morgenstelle 28 University of Tubingen 72076 Tubingen DE Germany 3Department ofChemical and Biomolecular Engineering (BK21 Plus Program) and BioInformatics Research Center Korea AdvancedInstitute of Science and Technology 291 Daehak-ro Yuseong-gu Daejeon 34141 South Korea and 4BioinformaticsGroup Wageningen University Droevendaalsesteeg 1 6708PB Wageningen the Netherlands

Received February 07 2019 Revised April 02 2019 Editorial Decision April 16 2019 Accepted April 17 2019

ABSTRACT

Secondary metabolites produced by bacteria andfungi are an important source of antimicro-bials and other bioactive compounds In recentyears genome mining has seen broad applica-tions in identifying and characterizing new com-pounds as well as in metabolic engineeringSince 2011 the lsquoantibiotics and secondary metabo-lite analysis shellndashndashantiSMASHrsquo (httpsantismashsecondarymetabolitesorg) has assisted researchersin this both as a web server and a standalone toolIt has established itself as the most widely usedtool for identifying and analysing biosynthetic geneclusters (BGCs) in bacterial and fungal genome se-quences Here we present an entirely redesignedand extended version 5 of antiSMASH antiSMASH5 adds detection rules for clusters encoding thebiosynthesis of acyl-amino acids -lactones fungalRiPPs RaS-RiPPs polybrominated diphenyl ethersC-nucleosides PPY-like ketones and lipolanthinesFor type II polyketide synthase-encoding gene clus-ters antiSMASH 5 now offers more detailed predic-tions The HTML output visualization has been re-designed to improve the navigation and visual repre-sentation of annotations We have again improvedthe runtime of analysis steps making it possibleto deliver comprehensive annotations for bacterialgenomes within a few minutes A new output file inthe standard JavaScript object notation (JSON) for-mat is aimed at downstream tools that process anti-SMASH results programmatically

INTRODUCTION

Bacterial and fungal natural products constitute a keysource of scaffolds for the development of antimicrobialsand other drugs (1) and mediate ecological interactions be-tween organisms in various ways (2)

Mining genomic data for the presence of biosyntheticpathways that enable organisms to produce such moleculeswhich are also referred to as secondary or specializedmetabolites have become an essential approach that com-plements activity- and chemistry-guided isolation and iden-tification approaches (3) Several computational tools suchas CLUSEAN (4) or PRISM (5) have been developed tosupport scientists with this task The lsquoantibiotics and sec-ondary metabolites analysis shellrsquo antiSMASH is a pio-neer amongst these tools Initially released in 2011 (6) ithas since been further extended and improved (7ndash12) andis currently used by thousands of academic and industrialscientists worldwide to identify so called secondary metabo-lite lsquobiosynthetic gene clustersrsquo (BGCs) in their genomes ofinterest In 2017 a database component was added to theantiSMASH framework which provides instant access tothousands of pre-computed antiSMASH genome miningresults of publicly available genomes (1314) Furthermoreseveral independent tools such as the mass-spectrometryguided peptide mining tool Pep2Path (15) the lsquoAntibioticResistance Target Seekerrsquo ARTS (16) the sgRNA designtool CRISPy-web (17) a reverse-tailoring tool to match fin-ished NRPSPKS structures to antiSMASH-predicted corestructures (18) and the BGC clustering and classificationplatform BiG-SCAPE (19) were developed that directly in-teract with and interpret results generated by antiSMASHand provide information that is outside the scope of a core-antiSMASH analysis

Here we present version 5 of antiSMASH which con-tains many improvements In addition to many features

To whom correspondence should be addressed Tel +45 24896132 Email tiwebiosustaindtudkCorrespondence may also be addressed to Marnix H Medema Tel +31 317484706 Email marnixmedemawurnl

Ccopy The Author(s) 2019 Published by Oxford University Press on behalf of Nucleic Acids ResearchThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby40) whichpermits unrestricted reuse distribution and reproduction in any medium provided the original work is properly cited

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

W82 Nucleic Acids Research 2019 Vol 47 Web Server issue

visible to the end users such as extended and improvedBGC detection and analysis capabilities and a mod-ernized and improved User Interface (see below) anti-SMASH version 5 was completely rewritten in Pythonversion 3 and the code was restructured to increaseperformance reliability and ease of maintenance Thishas led to a significant speed increase of the pipelineA complete list of antiSMASH 5 features is includedin the antiSMASH documentation httpsdocsantismashsecondarymetabolitesorgantiSMASH5features

NEW FEATURES AND UPDATES

New gene cluster classes and refinement of cluster detectionrules

The most widely used and recommended mode to detectBGCs in genomic data is via manually curated and val-idated gene cluster rules These are based on identifyingco-occurring conserved core enzymes in the genome usingHMM-profiles that were derived from Pfam (20) SMART(21) BAGEL (22) or Yadav et al (23) or that were cre-ated specifically for antiSMASH While antiSMASH ver-sion 4 supported the rule-based detection of 44 differ-ent biosynthetic types antiSMASH 5 now includes rulesfor 52 different BGC types In version 5 new rules wereadded to detect BGCs encoding the biosynthesis of N-acylamino acids (24) -lactones (25) polybrominated diphenylethers (26) C-nucleosides (27) pseudopyronines (28) fun-gal RiPPs (29ndash31) and RaS-RiPPs (3233) Furthermore anew lsquonrps-likersquo rule was defined for NRPS-fragments ieatypical NRPSs that donrsquot have the typical C-A-T modulearchitecture The previous lsquootherksrsquo rule was split into tworules to individually assign heterocyst glycolipid synthase-like clusters and other atypical PKSs In addition somerules were improved based on user case reports The rulesdescribing lanthipeptides and trans-AT type I PKS were re-fined to reduce the number of false positive hybrid calls onother cluster types For trans-AT- type I PKS and type IIPKS we increased the size of the cluster cutoffs to capturepreviously missed tailoring enzymes in published clustersThe rule for linear azoleazoline-containing peptides wasmade more generic to better cover the range of describedclusters

The rule describing microcin clusters was removed as mi-crocins are a class of RiPPs defined via their productionin Enterobacteriaceae and are already captured by one ofour other specific RiPP cluster rules depending on their re-spective biosynthesis pathway (eg microcin J25-like RiPPswere previously covered by the old microcin cluster rules butchemically are lasso peptides while microcin B17 is a linearazol(in)e-containing peptide)

Improved type II PKS prediction

Bacterial type II PKS BGCs code for the biosynthesis ofaromatic polyketides such as the antibiotic tetracycline orthe anti-tumour drug doxorubicin From the beginning an-tiSMASH has had rules that were able to detect type IIPKS BGCs by checking for the presence of the KSα andKSβCLF component of the minimal PKS However no

detailed prediction methods had been added since anti-SMASHrsquos first version In antiSMASH 5 we introduce anew PKS II analysis module (12) which uses a collection ofmanually curated HMMs to predict potential starter unitsthe number of elongation cycles (and thus a rough esti-mation of the putative molecular weight of the core com-pound) cyclization patterns and some conserved type IIPKS specific tailoring reactions This module is automat-ically triggered whenever a type II PKS BGC is detected

Annotation of resistance genes via Resfams

The Resfams database (34) is a curated database of proteinfamilies with confirmed antibiotic resistance function an-tiSMASH 5 uses the profile Hidden Markov Models (pH-MMs) from Resfams to annotate potential resistance genesfound in predicted gene regions Potential resistance gene-hits are displayed in the lsquogene detailsrsquo panel along with otherfunctional annotations

GO-term annotations

The Gene Ontology (GO) is a controlled vocabulary for de-scribing biological processes molecular functions and cel-lular components in a consistent way to enable comparisonof these between different species Amongst its wide rangeof uses the GO has been used to predict gene clusters ineukaryotes and bacteria (35) and in conjunction with anti-SMASH to refine cluster boundaries in antiSMASH out-put for Aspergillus species (36)

To facilitate these and other GO-based analyses anti-SMASH 5 includes an option to automatically annotateGO terms on Pfam domains This functionality makes useof the fact that GO terms may be linked not only to spe-cific gene products but also to other means of classifica-tion in so-called lsquomappingsrsquo (httpgeneontologyorgpagedownload-mappings) As antiSMASH can automaticallyannotate Pfam domains the GO annotation functionalitymakes use of the Pfam to GO mapping supplied by theGene Ontology Consortiumrsquos website (37) If the ID of apredicted Pfam domain in an antiSMASH record is presentin the Pfam to GO mapping the respective GO terms areassigned and presented in the lsquogene detailsrsquo panel

Link to the antiSMASH database

antiSMASH provides options to search for similar geneclusters in public datasets As already implemented in previ-ous versions of the software the KnownClusterBlast func-tionality searches each identified region against the man-ually curated MIBiG (38) repository The KnownClus-terBlast and ClusterBlast search functions use an algo-rithm first described in antiSMASH 1 (6) which also isin use in a generalized version in MultiGeneBlast (39)In the previous versions of antiSMASH the ClusterBlastdatabase was generated by scripts that used the antiSMASHBGC detection logic on sequences downloaded from theNCBI GenbankRefSeq databases As version 2 of theantiSMASH database now also contains BGCs of draftgenomes (14) starting with antiSMASH 5 the ClusterBlastdatabases will be directly generated from the new anti-SMASH database and complemented with individual BGC

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

Nucleic Acids Research 2019 Vol 47 Web Server issue W83

records that were submitted to NCBI outside of whole-genome submissions This provides several advantages Theabundance of entries for selected generaspecies in the pub-lic databases (and thus also in the previous ClusterBlastdatabase) is strongly skewed towards clinically or industri-ally relevant organisms There are for example more than15 000 assemblies for Escherichia coli deposited at NCBIFor the antiSMASH database a sequence-based dereplica-tion workflow was established (14) that reduced the num-ber of redundant entries with very high sequence similarityThus the updated ClusterBlast database contains fewer en-tries than the previous release despite the increase in pub-licly available sequence data This decrease has resulted inreduced computation times while simultaneously providingmore relevant hits Furthermore as the entries of the Clus-terBlast database are directly related to the BGCs in the an-tiSMASH database a link to the respective BGC is now in-cluded for all ClusterBlast hits promptly directing the userto the detailed report of the similar gene clusters

New lsquoregionrsquo concept

In previous versions antiSMASH referred to all co-locatedhybrid and independent BGCs with the single label lsquoclusterrsquoIn many cases this led to confusing structure predictionswhen distinct BGCs are encoded side-by-side For examplemany Streptomyces plasmids exist for which all BGCs lie soclose to each other that all were joined into a single largelsquoclusterrsquo In order to better distinguish the different biolog-ical options that lead to BGCs antiSMASH 5 introducessome new terminology

The definitions now used in antiSMASH 5 areCore The minimum area containing one or more genes

that code for enzymes for a single BGC type that are de-tected by the manually curated detection rules These genesdo not have to be contiguous but can be within a certaincutoff distance as defined by the detection rule for the BGCtype in question

Neighbourhood Distance up- and downstream of thecluster core that is used to find tailoring genesenzymesthe neighbourhood distances for the individual biosynthetictypes were empirically determined and defined in the detec-tion rules

Protocluster Contains core + neighbourhoods at bothsides of the core each protocluster always will have one sin-gle product type (for example NRPS) Protoclusters mayoverlap partially or completely with other protoclusters Inthe result webpage protoclusters are displayed as boxesabove the gene arrows The cores are shown as solid colourboxes the neighbourhoods are the half-transparent areasaround the cores

Candidate cluster Contains one or more protoclustersthe candidate clusters are defined as described below Thesedefinitions better allow modelling of hybrid clusters suchas PKSNRPS hybrids which combine two or more differ-ent biosynthetic classes (as identified in the detection rules)or cases where one class is used to biosynthesize a precur-sor for a second class An example of the latter is found inglycopeptide biosynthesis where one of the amino acids issynthesized by a type III PKS which is then incorporatedinto the product by a NRPS Candidate clusters may overlap

partially or completely with other candidate clusters In theresult webpage candidate clusters are shown as boxes abovethe protoclusters

Region Contains one or more candidate clusters The re-gions in antiSMASH 5 correspond to the entities calledlsquoclustersrsquo in antiSMASH 1 ndash 4 and now constitute what isdisplayed on a page of the results webpage Sometimes a re-gion will contain multiple mutually exclusive candidate clus-ters in such cases comparative genomic analysis andor ex-perimental work is required to assess which of these candi-date clusters constitute actual BGCs Regions will not over-lap with each other At least one of the contained candidateclusters will cover the full length of the region

There are four kinds of candidate clusters chemical hy-brids interleaved neighbouring and single

Chemical hybrid candidate clusters contain at least twoprotoclusters that share at least one gene that codes for en-zymes of two or more separate BGC types (eg a single genecoding for type I PKS and NRPS modules) (Figure 1A) Anexample of this type are hybrid PKSNRPSs Please notethat this type of candidate cluster can also include protoclus-ters within that shared range that do not share a coding se-quence provided that they are completely contained withinthe candidate cluster

Interleaved candidate clusters contain protoclusters thatdo not share cluster-type-defining coding sequences buttheir core locations overlap (Figure 1B)

Neighbouring candidate clusters contain protoclusterswhich transitively overlap in their neighbourhoods (Figure1C)

Single candidate clusters (Figure 1D) exist for consistencyof access they contain only a single protocluster Note thatindividual protoclusters can be contained by more than onecandidate cluster (typically a neighbouring candidate clusterand one of single interleaved or chemical hybrid)

Each candidate cluster assignment is transitive for ex-ample if a protocluster would form a chemical hybrid witheach of two neighbouring protoclusters but these neigh-bours would not form a chemical hybrid on their own allthree together will still form a chemical hybrid candidatecluster

Improved user interface

A central aim of antiSMASH is to provide very detailedand specific information via an easy to use and under-stand user interface (UI) The UI remained principally un-changed from the initial release of antiSMASH in 2011 de-spite the increased functionality added with each new ver-sion In this version we have modernized the UI using up-dated web technologies that allow a better structuring ofthe result-content of the antiSMASH results pages For re-designing the UI it was important that the reliable andwell-established look-and-feel was conserved while also re-taining the ability to download the whole web-based resultsfolder and to display it locally in a variety of web-browsers

We and others (such as (40)) have realized that anti-SMASH results using the heuristic ClusterFinder algorithm(41) were more often than not wrongly interpreted At thesame time ClusterFinder contributed significantly to thecomputational workload For these reasons we decided to

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

W84 Nucleic Acids Research 2019 Vol 47 Web Server issue

Figure 1 Candidate cluster types 1234 Greyyellow gene involved in protocluster AB (A) Chemical Hybrids Since cluster type A and cluster typeB share a CDS that defines those protoclusters they are classified as lsquochemical hybridrsquo (B) Interleaved Since none of the protoclusters share any definingCDS with any other protocluster it is not annotated as a chemical hybrid even though the biosynthetic product may or may not be The two protoclustersform an interleaved candidate clusters since the core of A overlaps with the core of B (C) Neighbouring Neighbouring candidate clusters are defined if theneighbourhoods of two protoclusters but not their cores overlap (D) Singles If protoclusters donrsquot have any overlaprelation with other protoclusters theterm single candidate cluster is assigned

remove this feature from the pubic antiSMASH web serverIt is of course still included in the download version of an-tiSMASH and can be enabled via the command line

In the Regions overview section (Figure 2) a graph-ical overview showing the location of the identified re-gions on the chromosomeplasmidscaffoldscontig is dis-played In the detailed view regions that are located oncontig-borders are now clearly labelled This often indi-cates that parts of the BGC are missing or that several sec-tions of a BGC are located on different contigs and aretherefore reported individually (for a more detailed discus-sion on this phenomenon please see (42)) For the firsttime antiSMASH 5 now offers interactive browsing of the

BGCs including selection of lsquofunctionalrsquo units ie core en-zymes transporters etc zooming to individual genes orregionscandidate clustersprotoclusters Details of the se-lection are now provided in side panels instead of pop-upwindows using a hierarchical view of the analysis sum-maries (which can be expanded by clicking lsquo+rsquo) to provideadditional details For the display of the PKSNRPS do-main organization the user now can choose whether tolimit the shown domains to the currently selected genes orjust display the results of the selected gene(s)enzyme(s)Furthermore the information is now organized in lsquotabsrsquothat do not require scrolling down along an often very longresults page

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

Nucleic Acids Research 2019 Vol 47 Web Server issue W85

Figure 2 Screenshot of the antiSMASH 5 user interface (example NCBI-acc Y16952 balhimycin BGC) The new region overview now allowspanningzooming The candidate cluster and protocluster boxes are explained in the lsquonew region conceptrsquo section above Information about the currentlyselected gene are displayed at the right lsquoGene detailsrsquo panel For PKS or NRPS regions the detailed domain annotation is displayed by pressing the tabsusers can select the domain overview (shown) or the ClusterBlast KnownClusterBlast or SubClusterBlast results At the right the structure predictionand details of specificity predictions are displayed upon selecting the plus sign

CODE REFACTORING AND SPEED-UP

Large parts of the pre-antiSMASH 5 code base were stillderived from antiSMASH version 1 which was released in2011 In order to maintain future compatibility the anti-SMASH code base had to be migrated from python 27which will reach end-of-life in 2020 to the current versions35ndash37 As this transition required significant modificationto the antiSMASH code we decided to take this as a chanceto completely rewrite the software with a special considera-tion on runtime code stability and code maintainability Aunit test and integration test framework was implementedthat covers most parts of the antiSMASH 5 code allowing amuch easier debugging andndashndashmost importantlyndashndashextensionof the code while at the same time ensuring that new fea-tures do not negatively impact the results of existing mod-ules For some of the externally contributed modules (Sand-puma trans-AT PKS comparisons terpene PrediCAT) ourcontributors are currently preparing updated and compli-ant versions which will be added to antiSMASH 5 in mi-nor releases once they are finished and tested Like the ear-lier antiSMASH versions antiSMASH 5 provides the anal-ysis results in an interactive webpage and richly annotatedGenBank-format files for the whole genome and individualclusters As a new feature in version 5 all data are also avail-able as a computer readable JSON container which allowsthird party tools to easily process antiSMASH annotationsThis JSON output has superseded some other output typessuch as BioSynML and XLS

In addition to the advantages mentioned above the coderefactoring and cleanup has also led to a significant speed

increase of the new version by a factor of 4-11times (dependingon genome and selected options) instead of waiting timesof several hours antiSMASH results are now usually deliv-ered within 30ndash40 min after the start of the job for a typicalsubmission at the public web server

CONCLUSIONS AND FUTURE PERSPECTIVES

With the help of software like antiSMASH genome miningfor specialized metabolites has established itself as a com-plementary approach for the identification of novel metabo-lites which is routinely used within the natural products re-search community and increasingly applied in related fieldssuch as metagenomics environmental biology or metabolicengineering With the improvements to the antiSMASHuser interface and performance we keep pace with these de-velopments Furthermore the complete refactoring of theantiSMASH 5 code base will allow us to increasingly useantiSMASH as a tool that provides analysis data on whichother software can perform additional analyses

DATA AVAILABILITY

antiSMASH is available from httpsantismashsecondarymetabolitesorg (bacterial version) orhttpsfungismashsecondarymetabolitesorg (fun-gal version) The antiSMASH documentationincluding a PDF user guide is available fromhttpsdocsantismashsecondarymetabolitesorg Thesewebsites are free and open to all users and there is no loginrequirement The antiSMASH source code is available from

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

W86 Nucleic Acids Research 2019 Vol 47 Web Server issue

httpsgithubcomantismashantismash antiSMASH isalso available via Docker

ACKNOWLEDGEMENTS

We thank Justin JJ van der Hooft for critical comments onthe manuscript and providing documentation and EmiliaPalazzotto and Tetiana Gren for helpful discussions anduser testing of the new features

FUNDING

Novo Nordisk Foundation [NNF10CC1016517 toSYLTW NNF16OC0021746 to TW] Center forMicrobial Secondary Metabolites (CeMiSt) Danish Na-tional Research Foundation [DNRF137 to TW] Reinholdand Maria Teufel Foundation (to KS) Funding for openaccess charge The Novo Nordisk FoundationConflict of interest statement None declared

REFERENCES1 NewmanDJ and CraggGM (2016) Natural products as sources of

new drugs from 1981 to 2014 J Nat Prod 79 629ndash6612 van der MeijA WorsleySF HutchingsMI and van WezelGP

(2017) Chemical ecology of antibiotic production by actinomycetesFEMS Microbiol Rev 41 392ndash416

3 ZiemertN AlanjaryM and WeberT (2016) The evolution ofgenome mining in microbes - a review Nat Prod Rep 33 988ndash1005

4 WeberT RauschC LopezP HoofI GaykovaV HusonDHand WohllebenW (2009) CLUSEAN a computer-based frameworkfor the automated analysis of bacterial secondary metabolitebiosynthetic gene clusters J Biotechnol 140 13ndash17

5 SkinniderMA MerwinNJ JohnstonCW and MagarveyNA(2017) PRISM 3 expanded prediction of natural product chemicalstructures from microbial genomes Nucleic Acids Res 45W49ndashW54

6 MedemaMH BlinK CimermancicP de JagerV ZakrzewskiPFischbachMA WeberT TakanoE and BreitlingR (2011)antiSMASH rapid identification annotation and analysis ofsecondary metabolite biosynthesis gene clusters in bacterial andfungal genome sequences Nucleic Acids Res 39 W339ndashW346

7 BlinK MedemaMH KazempourD FischbachMABreitlingR TakanoE and WeberT (2013) antiSMASH 20ndashndashaversatile platform for genome mining of secondary metaboliteproducers Nucleic Acids Res 41 W204ndashW212

8 WeberT BlinK DuddelaS KrugD KimHU BruccoleriRLeeSY FischbachMA MullerR WohllebenW et al (2015)antiSMASH 30ndashndasha comprehensive resource for the genome miningof biosynthetic gene clusters Nucleic Acids Res 43 W237ndashW243

9 BlinK WolfT ChevretteMG LuX SchwalenCJKautsarSA Suarez DuranHG de Los SantosELC KimHUNaveM et al (2017) antiSMASH 40-improvements in chemistryprediction and gene cluster boundary identification Nucleic AcidsRes 45 W36ndashW41

10 KautsarSA Suarez DuranHG BlinK OsbournA andMedemaMH (2017) plantiSMASH automated identificationannotation and expression analysis of plant biosynthetic geneclusters Nucleic Acids Res 45 W55ndashW63

11 BlinK KazempourD WohllebenW and WeberT (2014)Improved lanthipeptide detection and prediction for antiSMASHPLoS One 9 e89420

12 VillebroR ShawS BlinK and WeberT (2019) Sequence-basedclassification of type II polyketide synthase biosynthetic gene clustersfor antiSMASH J Ind Microbiol Biotechnol 46 469ndash475

13 BlinK MedemaMH KottmannR LeeSY and WeberT (2017)The antiSMASH database a comprehensive database of microbialsecondary metabolite biosynthetic gene clusters Nucleic Acids Res45 D555ndashD559

14 BlinK Pascal AndreuV de Los SantosELC Del CarratoreFLeeSY MedemaMH and WeberT (2019) The antiSMASHdatabase version 2 a comprehensive resource on secondarymetabolite biosynthetic gene clusters Nucleic Acids Res 47D625ndashD630

15 MedemaMH PaalvastY NguyenDD MelnikADorresteinPC TakanoE and BreitlingR (2014) Pep2Pathautomated mass spectrometry-guided genome mining of peptidicnatural products PLoS Comput Biol 10 e1003822

16 AlanjaryM KronmillerB AdamekM BlinK WeberTHusonD PhilmusB and ZiemertN (2017) The AntibioticResistant Target Seeker (ARTS) an exploration engine for antibioticcluster prioritization and novel drug target discovery Nucleic AcidsRes 45 W42ndashW48

17 BlinK PedersenLE WeberT and LeeSY (2016) CRISPy-webAn online resource to design sgRNAs for CRISPR applicationsSynth Syst Biotechnol 1 118ndash121

18 ShirleyWA KelleyBP PotierY KoschwanezJH BruccoleriRand TarselliM (2018) Unzipping natural products improved naturalproduct structure predictions by ensemble modeling and fingerprintmatching ChemRxiv doi httpdoi1026434chemrxiv6863864 26July 2018 preprint not peer reviewed

19 Navarro-MunozJ Selem-MojicaN MullowneyM KautsarSTryonJ ParkinsonE De Los SantosE YeongMCruz-MoralesP AbubuckerS et al (2018) A computationalframework for systematic exploration of biosynthetic diversity fromlarge-scale genomic data bioRxiv doi httpdoi101101445270 17October 2018 preprint not peer reviewed

20 FinnRD CoggillP EberhardtRY EddySR MistryJMitchellAL PotterSC PuntaM QureshiMSangrador-VegasA et al (2016) The Pfam protein families databasetowards a more sustainable future Nucleic Acids Res 44D279ndashD285

21 LetunicI and BorkP (2018) 20 years of the SMART protein domainannotation resource Nucleic Acids Res 46 D493ndashD496

22 de JongA van HeelAJ KokJ and KuipersOP (2010) BAGEL2mining for bacteriocins in genomic data Nucleic Acids Res 38W647ndashW651

23 YadavG GokhaleRS and MohantyD (2009) Towards predictionof metabolic products of polyketide synthases an in silico analysisPLoS Comput Biol 5 e1000351

24 CraigJW CherryMA and BradySF (2011) Long-chain N-acylamino acid synthases are linked to the putativePEP-CTERMexosortase protein-sorting system in Gram-negativebacteria J Bacteriol 193 5707ndash5715

25 RobinsonSL ChristensonJK and WackettLP (2018)Biosynthesis and chemical diversity of -lactone natural productsNat Prod Rep 36 458ndash475

26 AgarwalV BlantonJM PodellS TatonA SchornMABuschJ LinZ SchmidtEW JensenPR PaulVJ et al (2017)Metagenomic discovery of polybrominated diphenyl etherbiosynthesis by marine sponges Nat Chem Biol 13 537ndash543

27 SosioM GaspariE IorioM PessinaS MedemaMHBernasconiA SimoneM MaffioliSI EbrightRH andDonadioS (2018) Analysis of the Pseudouridimycin biosyntheticpathway provides Insights into the formation of C-nucleosideantibiotics Cell Chem Biol 25 540ndash549

28 BauerJS GhequireMGK NettM JostenM SahlH-G DeMotR and GrossH (2015) Biosynthetic origin of the antibioticpseudopyronines A and B in Pseudomonas putida BW11M1Chembiochem 16 2491ndash2497

29 LuoH Hallen-AdamsHE Scott-CraigJS and WaltonJD (2012)Ribosomal biosynthesis of -amanitin in Galerina marginata FungalGenet Biol 49 123ndash129

30 NaganoN UmemuraM IzumikawaM KawanoJ IshiiTKikuchiM TomiiK KumagaiT YoshimiA MachidaM et al(2016) Class of cyclic ribosomal peptide synthetic genes infilamentous fungi Fungal Genet Biol 86 58ndash70

31 DingW LiuW-Q JiaY LiY van der DonkWA and ZhangQ(2016) Biosynthetic investigation of phomopsins reveals a widespreadpathway for ribosomal natural products in Ascomycetes Proc NatlAcad Sci USA 113 3521ndash3526

32 BushinLB ClarkKA PelczerI and SeyedsayamdostMR (2018)Charting an unexplored streptococcal biosynthetic landscape reveals

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

Nucleic Acids Research 2019 Vol 47 Web Server issue W87

a unique peptide cyclization motif J Am Chem Soc 14017674ndash17684

33 CarusoA BushinLB ClarkKA MartinieRJ andSeyedsayamdostMR (2019) A radical approach to enzymatic-Thioether bond formation J Am Chem Soc 141 990ndash997

34 GibsonMK ForsbergKJ and DantasG (2015) Improvedannotation of antibiotic resistance determinants reveals microbialresistomes cluster by ecology ISME J 9 207ndash216

35 YiG SzeSH and ThonMR (2007) Identifying clusters offunctionally related genes in genomes Bioinformatics 23 1053ndash1060

36 InglisDO BinkleyJ SkrzypekMS ArnaudMBCerqueiraGC ShahP WymoreF WortmanJR and SherlockG(2013) Comprehensive annotation of secondary metabolitebiosynthetic genes and gene clusters of Aspergillus nidulans Afumigatus A niger and A oryzae BMC Microbiol 13 91

37 The Gene Ontology Consortium (2016) Expansion of the GeneOntology knowledgebase and resources Nucleic Acids Res 45D331ndashD338

38 MedemaMH KottmannR YilmazP CummingsMBigginsJB BlinK de BruijnI ChooiYH ClaesenJ

CoatesRC et al (2015) Minimum information about a biosyntheticgene cluster Nat Chem Biol 11 625ndash631

39 MedemaMH TakanoE and BreitlingR (2013) Detectingsequence homology at the gene cluster level with MultiGeneBlastMol Biol Evol 30 1218ndash1223

40 BaltzRH (2018) Natural product drug discovery in the genomic erarealities conjectures misconceptions and opportunities J IndMicrobiol Biotechnol 46 281ndash299

41 CimermancicP MedemaMH ClaesenJ KuritaK WielandBrownLC MavrommatisK PatiA GodfreyPA KoehrsenMClardyJ et al (2014) Insights into secondary metabolism from aglobal analysis of prokaryotic biosynthetic gene clusters Cell 158412ndash421

42 BlinK KimHU MedemaMH and WeberT (2017) Recentdevelopment of antiSMASH and other computational approaches tomine secondary metabolite biosynthetic gene clusters BriefBioinform doi101093bibbbx146

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

Page 2: antiSMASH 5.0: updates to the secondary metabolite genome ...genome mining pipeline Kai Blin1,SimonShaw1, Katharina Steinke2, Rasmus Villebro1, Nadine Ziemert2, Sang Yup Lee1,3, Marnix

Published online 29 April 2019 Nucleic Acids Research 2019 Vol 47 Web Server issue W81ndashW87doi 101093nargkz310

antiSMASH 50 updates to the secondary metabolitegenome mining pipelineKai Blin1 Simon Shaw1 Katharina Steinke2 Rasmus Villebro1 Nadine Ziemert2 SangYup Lee13 Marnix H Medema 4 and Tilmann Weber 1

1The Novo Nordisk Foundation Center for Biosustainability Technical University of Denmark Kemitorvet bygning 2202800 Kgs Lyngby Denmark 2German Centre for Infection Research (DZIF) Interfaculty Institute of Microbiology andInfection Medicine Auf der Morgenstelle 28 University of Tubingen 72076 Tubingen DE Germany 3Department ofChemical and Biomolecular Engineering (BK21 Plus Program) and BioInformatics Research Center Korea AdvancedInstitute of Science and Technology 291 Daehak-ro Yuseong-gu Daejeon 34141 South Korea and 4BioinformaticsGroup Wageningen University Droevendaalsesteeg 1 6708PB Wageningen the Netherlands

Received February 07 2019 Revised April 02 2019 Editorial Decision April 16 2019 Accepted April 17 2019

ABSTRACT

Secondary metabolites produced by bacteria andfungi are an important source of antimicro-bials and other bioactive compounds In recentyears genome mining has seen broad applica-tions in identifying and characterizing new com-pounds as well as in metabolic engineeringSince 2011 the lsquoantibiotics and secondary metabo-lite analysis shellndashndashantiSMASHrsquo (httpsantismashsecondarymetabolitesorg) has assisted researchersin this both as a web server and a standalone toolIt has established itself as the most widely usedtool for identifying and analysing biosynthetic geneclusters (BGCs) in bacterial and fungal genome se-quences Here we present an entirely redesignedand extended version 5 of antiSMASH antiSMASH5 adds detection rules for clusters encoding thebiosynthesis of acyl-amino acids -lactones fungalRiPPs RaS-RiPPs polybrominated diphenyl ethersC-nucleosides PPY-like ketones and lipolanthinesFor type II polyketide synthase-encoding gene clus-ters antiSMASH 5 now offers more detailed predic-tions The HTML output visualization has been re-designed to improve the navigation and visual repre-sentation of annotations We have again improvedthe runtime of analysis steps making it possibleto deliver comprehensive annotations for bacterialgenomes within a few minutes A new output file inthe standard JavaScript object notation (JSON) for-mat is aimed at downstream tools that process anti-SMASH results programmatically

INTRODUCTION

Bacterial and fungal natural products constitute a keysource of scaffolds for the development of antimicrobialsand other drugs (1) and mediate ecological interactions be-tween organisms in various ways (2)

Mining genomic data for the presence of biosyntheticpathways that enable organisms to produce such moleculeswhich are also referred to as secondary or specializedmetabolites have become an essential approach that com-plements activity- and chemistry-guided isolation and iden-tification approaches (3) Several computational tools suchas CLUSEAN (4) or PRISM (5) have been developed tosupport scientists with this task The lsquoantibiotics and sec-ondary metabolites analysis shellrsquo antiSMASH is a pio-neer amongst these tools Initially released in 2011 (6) ithas since been further extended and improved (7ndash12) andis currently used by thousands of academic and industrialscientists worldwide to identify so called secondary metabo-lite lsquobiosynthetic gene clustersrsquo (BGCs) in their genomes ofinterest In 2017 a database component was added to theantiSMASH framework which provides instant access tothousands of pre-computed antiSMASH genome miningresults of publicly available genomes (1314) Furthermoreseveral independent tools such as the mass-spectrometryguided peptide mining tool Pep2Path (15) the lsquoAntibioticResistance Target Seekerrsquo ARTS (16) the sgRNA designtool CRISPy-web (17) a reverse-tailoring tool to match fin-ished NRPSPKS structures to antiSMASH-predicted corestructures (18) and the BGC clustering and classificationplatform BiG-SCAPE (19) were developed that directly in-teract with and interpret results generated by antiSMASHand provide information that is outside the scope of a core-antiSMASH analysis

Here we present version 5 of antiSMASH which con-tains many improvements In addition to many features

To whom correspondence should be addressed Tel +45 24896132 Email tiwebiosustaindtudkCorrespondence may also be addressed to Marnix H Medema Tel +31 317484706 Email marnixmedemawurnl

Ccopy The Author(s) 2019 Published by Oxford University Press on behalf of Nucleic Acids ResearchThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby40) whichpermits unrestricted reuse distribution and reproduction in any medium provided the original work is properly cited

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

W82 Nucleic Acids Research 2019 Vol 47 Web Server issue

visible to the end users such as extended and improvedBGC detection and analysis capabilities and a mod-ernized and improved User Interface (see below) anti-SMASH version 5 was completely rewritten in Pythonversion 3 and the code was restructured to increaseperformance reliability and ease of maintenance Thishas led to a significant speed increase of the pipelineA complete list of antiSMASH 5 features is includedin the antiSMASH documentation httpsdocsantismashsecondarymetabolitesorgantiSMASH5features

NEW FEATURES AND UPDATES

New gene cluster classes and refinement of cluster detectionrules

The most widely used and recommended mode to detectBGCs in genomic data is via manually curated and val-idated gene cluster rules These are based on identifyingco-occurring conserved core enzymes in the genome usingHMM-profiles that were derived from Pfam (20) SMART(21) BAGEL (22) or Yadav et al (23) or that were cre-ated specifically for antiSMASH While antiSMASH ver-sion 4 supported the rule-based detection of 44 differ-ent biosynthetic types antiSMASH 5 now includes rulesfor 52 different BGC types In version 5 new rules wereadded to detect BGCs encoding the biosynthesis of N-acylamino acids (24) -lactones (25) polybrominated diphenylethers (26) C-nucleosides (27) pseudopyronines (28) fun-gal RiPPs (29ndash31) and RaS-RiPPs (3233) Furthermore anew lsquonrps-likersquo rule was defined for NRPS-fragments ieatypical NRPSs that donrsquot have the typical C-A-T modulearchitecture The previous lsquootherksrsquo rule was split into tworules to individually assign heterocyst glycolipid synthase-like clusters and other atypical PKSs In addition somerules were improved based on user case reports The rulesdescribing lanthipeptides and trans-AT type I PKS were re-fined to reduce the number of false positive hybrid calls onother cluster types For trans-AT- type I PKS and type IIPKS we increased the size of the cluster cutoffs to capturepreviously missed tailoring enzymes in published clustersThe rule for linear azoleazoline-containing peptides wasmade more generic to better cover the range of describedclusters

The rule describing microcin clusters was removed as mi-crocins are a class of RiPPs defined via their productionin Enterobacteriaceae and are already captured by one ofour other specific RiPP cluster rules depending on their re-spective biosynthesis pathway (eg microcin J25-like RiPPswere previously covered by the old microcin cluster rules butchemically are lasso peptides while microcin B17 is a linearazol(in)e-containing peptide)

Improved type II PKS prediction

Bacterial type II PKS BGCs code for the biosynthesis ofaromatic polyketides such as the antibiotic tetracycline orthe anti-tumour drug doxorubicin From the beginning an-tiSMASH has had rules that were able to detect type IIPKS BGCs by checking for the presence of the KSα andKSβCLF component of the minimal PKS However no

detailed prediction methods had been added since anti-SMASHrsquos first version In antiSMASH 5 we introduce anew PKS II analysis module (12) which uses a collection ofmanually curated HMMs to predict potential starter unitsthe number of elongation cycles (and thus a rough esti-mation of the putative molecular weight of the core com-pound) cyclization patterns and some conserved type IIPKS specific tailoring reactions This module is automat-ically triggered whenever a type II PKS BGC is detected

Annotation of resistance genes via Resfams

The Resfams database (34) is a curated database of proteinfamilies with confirmed antibiotic resistance function an-tiSMASH 5 uses the profile Hidden Markov Models (pH-MMs) from Resfams to annotate potential resistance genesfound in predicted gene regions Potential resistance gene-hits are displayed in the lsquogene detailsrsquo panel along with otherfunctional annotations

GO-term annotations

The Gene Ontology (GO) is a controlled vocabulary for de-scribing biological processes molecular functions and cel-lular components in a consistent way to enable comparisonof these between different species Amongst its wide rangeof uses the GO has been used to predict gene clusters ineukaryotes and bacteria (35) and in conjunction with anti-SMASH to refine cluster boundaries in antiSMASH out-put for Aspergillus species (36)

To facilitate these and other GO-based analyses anti-SMASH 5 includes an option to automatically annotateGO terms on Pfam domains This functionality makes useof the fact that GO terms may be linked not only to spe-cific gene products but also to other means of classifica-tion in so-called lsquomappingsrsquo (httpgeneontologyorgpagedownload-mappings) As antiSMASH can automaticallyannotate Pfam domains the GO annotation functionalitymakes use of the Pfam to GO mapping supplied by theGene Ontology Consortiumrsquos website (37) If the ID of apredicted Pfam domain in an antiSMASH record is presentin the Pfam to GO mapping the respective GO terms areassigned and presented in the lsquogene detailsrsquo panel

Link to the antiSMASH database

antiSMASH provides options to search for similar geneclusters in public datasets As already implemented in previ-ous versions of the software the KnownClusterBlast func-tionality searches each identified region against the man-ually curated MIBiG (38) repository The KnownClus-terBlast and ClusterBlast search functions use an algo-rithm first described in antiSMASH 1 (6) which also isin use in a generalized version in MultiGeneBlast (39)In the previous versions of antiSMASH the ClusterBlastdatabase was generated by scripts that used the antiSMASHBGC detection logic on sequences downloaded from theNCBI GenbankRefSeq databases As version 2 of theantiSMASH database now also contains BGCs of draftgenomes (14) starting with antiSMASH 5 the ClusterBlastdatabases will be directly generated from the new anti-SMASH database and complemented with individual BGC

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

Nucleic Acids Research 2019 Vol 47 Web Server issue W83

records that were submitted to NCBI outside of whole-genome submissions This provides several advantages Theabundance of entries for selected generaspecies in the pub-lic databases (and thus also in the previous ClusterBlastdatabase) is strongly skewed towards clinically or industri-ally relevant organisms There are for example more than15 000 assemblies for Escherichia coli deposited at NCBIFor the antiSMASH database a sequence-based dereplica-tion workflow was established (14) that reduced the num-ber of redundant entries with very high sequence similarityThus the updated ClusterBlast database contains fewer en-tries than the previous release despite the increase in pub-licly available sequence data This decrease has resulted inreduced computation times while simultaneously providingmore relevant hits Furthermore as the entries of the Clus-terBlast database are directly related to the BGCs in the an-tiSMASH database a link to the respective BGC is now in-cluded for all ClusterBlast hits promptly directing the userto the detailed report of the similar gene clusters

New lsquoregionrsquo concept

In previous versions antiSMASH referred to all co-locatedhybrid and independent BGCs with the single label lsquoclusterrsquoIn many cases this led to confusing structure predictionswhen distinct BGCs are encoded side-by-side For examplemany Streptomyces plasmids exist for which all BGCs lie soclose to each other that all were joined into a single largelsquoclusterrsquo In order to better distinguish the different biolog-ical options that lead to BGCs antiSMASH 5 introducessome new terminology

The definitions now used in antiSMASH 5 areCore The minimum area containing one or more genes

that code for enzymes for a single BGC type that are de-tected by the manually curated detection rules These genesdo not have to be contiguous but can be within a certaincutoff distance as defined by the detection rule for the BGCtype in question

Neighbourhood Distance up- and downstream of thecluster core that is used to find tailoring genesenzymesthe neighbourhood distances for the individual biosynthetictypes were empirically determined and defined in the detec-tion rules

Protocluster Contains core + neighbourhoods at bothsides of the core each protocluster always will have one sin-gle product type (for example NRPS) Protoclusters mayoverlap partially or completely with other protoclusters Inthe result webpage protoclusters are displayed as boxesabove the gene arrows The cores are shown as solid colourboxes the neighbourhoods are the half-transparent areasaround the cores

Candidate cluster Contains one or more protoclustersthe candidate clusters are defined as described below Thesedefinitions better allow modelling of hybrid clusters suchas PKSNRPS hybrids which combine two or more differ-ent biosynthetic classes (as identified in the detection rules)or cases where one class is used to biosynthesize a precur-sor for a second class An example of the latter is found inglycopeptide biosynthesis where one of the amino acids issynthesized by a type III PKS which is then incorporatedinto the product by a NRPS Candidate clusters may overlap

partially or completely with other candidate clusters In theresult webpage candidate clusters are shown as boxes abovethe protoclusters

Region Contains one or more candidate clusters The re-gions in antiSMASH 5 correspond to the entities calledlsquoclustersrsquo in antiSMASH 1 ndash 4 and now constitute what isdisplayed on a page of the results webpage Sometimes a re-gion will contain multiple mutually exclusive candidate clus-ters in such cases comparative genomic analysis andor ex-perimental work is required to assess which of these candi-date clusters constitute actual BGCs Regions will not over-lap with each other At least one of the contained candidateclusters will cover the full length of the region

There are four kinds of candidate clusters chemical hy-brids interleaved neighbouring and single

Chemical hybrid candidate clusters contain at least twoprotoclusters that share at least one gene that codes for en-zymes of two or more separate BGC types (eg a single genecoding for type I PKS and NRPS modules) (Figure 1A) Anexample of this type are hybrid PKSNRPSs Please notethat this type of candidate cluster can also include protoclus-ters within that shared range that do not share a coding se-quence provided that they are completely contained withinthe candidate cluster

Interleaved candidate clusters contain protoclusters thatdo not share cluster-type-defining coding sequences buttheir core locations overlap (Figure 1B)

Neighbouring candidate clusters contain protoclusterswhich transitively overlap in their neighbourhoods (Figure1C)

Single candidate clusters (Figure 1D) exist for consistencyof access they contain only a single protocluster Note thatindividual protoclusters can be contained by more than onecandidate cluster (typically a neighbouring candidate clusterand one of single interleaved or chemical hybrid)

Each candidate cluster assignment is transitive for ex-ample if a protocluster would form a chemical hybrid witheach of two neighbouring protoclusters but these neigh-bours would not form a chemical hybrid on their own allthree together will still form a chemical hybrid candidatecluster

Improved user interface

A central aim of antiSMASH is to provide very detailedand specific information via an easy to use and under-stand user interface (UI) The UI remained principally un-changed from the initial release of antiSMASH in 2011 de-spite the increased functionality added with each new ver-sion In this version we have modernized the UI using up-dated web technologies that allow a better structuring ofthe result-content of the antiSMASH results pages For re-designing the UI it was important that the reliable andwell-established look-and-feel was conserved while also re-taining the ability to download the whole web-based resultsfolder and to display it locally in a variety of web-browsers

We and others (such as (40)) have realized that anti-SMASH results using the heuristic ClusterFinder algorithm(41) were more often than not wrongly interpreted At thesame time ClusterFinder contributed significantly to thecomputational workload For these reasons we decided to

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

W84 Nucleic Acids Research 2019 Vol 47 Web Server issue

Figure 1 Candidate cluster types 1234 Greyyellow gene involved in protocluster AB (A) Chemical Hybrids Since cluster type A and cluster typeB share a CDS that defines those protoclusters they are classified as lsquochemical hybridrsquo (B) Interleaved Since none of the protoclusters share any definingCDS with any other protocluster it is not annotated as a chemical hybrid even though the biosynthetic product may or may not be The two protoclustersform an interleaved candidate clusters since the core of A overlaps with the core of B (C) Neighbouring Neighbouring candidate clusters are defined if theneighbourhoods of two protoclusters but not their cores overlap (D) Singles If protoclusters donrsquot have any overlaprelation with other protoclusters theterm single candidate cluster is assigned

remove this feature from the pubic antiSMASH web serverIt is of course still included in the download version of an-tiSMASH and can be enabled via the command line

In the Regions overview section (Figure 2) a graph-ical overview showing the location of the identified re-gions on the chromosomeplasmidscaffoldscontig is dis-played In the detailed view regions that are located oncontig-borders are now clearly labelled This often indi-cates that parts of the BGC are missing or that several sec-tions of a BGC are located on different contigs and aretherefore reported individually (for a more detailed discus-sion on this phenomenon please see (42)) For the firsttime antiSMASH 5 now offers interactive browsing of the

BGCs including selection of lsquofunctionalrsquo units ie core en-zymes transporters etc zooming to individual genes orregionscandidate clustersprotoclusters Details of the se-lection are now provided in side panels instead of pop-upwindows using a hierarchical view of the analysis sum-maries (which can be expanded by clicking lsquo+rsquo) to provideadditional details For the display of the PKSNRPS do-main organization the user now can choose whether tolimit the shown domains to the currently selected genes orjust display the results of the selected gene(s)enzyme(s)Furthermore the information is now organized in lsquotabsrsquothat do not require scrolling down along an often very longresults page

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

Nucleic Acids Research 2019 Vol 47 Web Server issue W85

Figure 2 Screenshot of the antiSMASH 5 user interface (example NCBI-acc Y16952 balhimycin BGC) The new region overview now allowspanningzooming The candidate cluster and protocluster boxes are explained in the lsquonew region conceptrsquo section above Information about the currentlyselected gene are displayed at the right lsquoGene detailsrsquo panel For PKS or NRPS regions the detailed domain annotation is displayed by pressing the tabsusers can select the domain overview (shown) or the ClusterBlast KnownClusterBlast or SubClusterBlast results At the right the structure predictionand details of specificity predictions are displayed upon selecting the plus sign

CODE REFACTORING AND SPEED-UP

Large parts of the pre-antiSMASH 5 code base were stillderived from antiSMASH version 1 which was released in2011 In order to maintain future compatibility the anti-SMASH code base had to be migrated from python 27which will reach end-of-life in 2020 to the current versions35ndash37 As this transition required significant modificationto the antiSMASH code we decided to take this as a chanceto completely rewrite the software with a special considera-tion on runtime code stability and code maintainability Aunit test and integration test framework was implementedthat covers most parts of the antiSMASH 5 code allowing amuch easier debugging andndashndashmost importantlyndashndashextensionof the code while at the same time ensuring that new fea-tures do not negatively impact the results of existing mod-ules For some of the externally contributed modules (Sand-puma trans-AT PKS comparisons terpene PrediCAT) ourcontributors are currently preparing updated and compli-ant versions which will be added to antiSMASH 5 in mi-nor releases once they are finished and tested Like the ear-lier antiSMASH versions antiSMASH 5 provides the anal-ysis results in an interactive webpage and richly annotatedGenBank-format files for the whole genome and individualclusters As a new feature in version 5 all data are also avail-able as a computer readable JSON container which allowsthird party tools to easily process antiSMASH annotationsThis JSON output has superseded some other output typessuch as BioSynML and XLS

In addition to the advantages mentioned above the coderefactoring and cleanup has also led to a significant speed

increase of the new version by a factor of 4-11times (dependingon genome and selected options) instead of waiting timesof several hours antiSMASH results are now usually deliv-ered within 30ndash40 min after the start of the job for a typicalsubmission at the public web server

CONCLUSIONS AND FUTURE PERSPECTIVES

With the help of software like antiSMASH genome miningfor specialized metabolites has established itself as a com-plementary approach for the identification of novel metabo-lites which is routinely used within the natural products re-search community and increasingly applied in related fieldssuch as metagenomics environmental biology or metabolicengineering With the improvements to the antiSMASHuser interface and performance we keep pace with these de-velopments Furthermore the complete refactoring of theantiSMASH 5 code base will allow us to increasingly useantiSMASH as a tool that provides analysis data on whichother software can perform additional analyses

DATA AVAILABILITY

antiSMASH is available from httpsantismashsecondarymetabolitesorg (bacterial version) orhttpsfungismashsecondarymetabolitesorg (fun-gal version) The antiSMASH documentationincluding a PDF user guide is available fromhttpsdocsantismashsecondarymetabolitesorg Thesewebsites are free and open to all users and there is no loginrequirement The antiSMASH source code is available from

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

W86 Nucleic Acids Research 2019 Vol 47 Web Server issue

httpsgithubcomantismashantismash antiSMASH isalso available via Docker

ACKNOWLEDGEMENTS

We thank Justin JJ van der Hooft for critical comments onthe manuscript and providing documentation and EmiliaPalazzotto and Tetiana Gren for helpful discussions anduser testing of the new features

FUNDING

Novo Nordisk Foundation [NNF10CC1016517 toSYLTW NNF16OC0021746 to TW] Center forMicrobial Secondary Metabolites (CeMiSt) Danish Na-tional Research Foundation [DNRF137 to TW] Reinholdand Maria Teufel Foundation (to KS) Funding for openaccess charge The Novo Nordisk FoundationConflict of interest statement None declared

REFERENCES1 NewmanDJ and CraggGM (2016) Natural products as sources of

new drugs from 1981 to 2014 J Nat Prod 79 629ndash6612 van der MeijA WorsleySF HutchingsMI and van WezelGP

(2017) Chemical ecology of antibiotic production by actinomycetesFEMS Microbiol Rev 41 392ndash416

3 ZiemertN AlanjaryM and WeberT (2016) The evolution ofgenome mining in microbes - a review Nat Prod Rep 33 988ndash1005

4 WeberT RauschC LopezP HoofI GaykovaV HusonDHand WohllebenW (2009) CLUSEAN a computer-based frameworkfor the automated analysis of bacterial secondary metabolitebiosynthetic gene clusters J Biotechnol 140 13ndash17

5 SkinniderMA MerwinNJ JohnstonCW and MagarveyNA(2017) PRISM 3 expanded prediction of natural product chemicalstructures from microbial genomes Nucleic Acids Res 45W49ndashW54

6 MedemaMH BlinK CimermancicP de JagerV ZakrzewskiPFischbachMA WeberT TakanoE and BreitlingR (2011)antiSMASH rapid identification annotation and analysis ofsecondary metabolite biosynthesis gene clusters in bacterial andfungal genome sequences Nucleic Acids Res 39 W339ndashW346

7 BlinK MedemaMH KazempourD FischbachMABreitlingR TakanoE and WeberT (2013) antiSMASH 20ndashndashaversatile platform for genome mining of secondary metaboliteproducers Nucleic Acids Res 41 W204ndashW212

8 WeberT BlinK DuddelaS KrugD KimHU BruccoleriRLeeSY FischbachMA MullerR WohllebenW et al (2015)antiSMASH 30ndashndasha comprehensive resource for the genome miningof biosynthetic gene clusters Nucleic Acids Res 43 W237ndashW243

9 BlinK WolfT ChevretteMG LuX SchwalenCJKautsarSA Suarez DuranHG de Los SantosELC KimHUNaveM et al (2017) antiSMASH 40-improvements in chemistryprediction and gene cluster boundary identification Nucleic AcidsRes 45 W36ndashW41

10 KautsarSA Suarez DuranHG BlinK OsbournA andMedemaMH (2017) plantiSMASH automated identificationannotation and expression analysis of plant biosynthetic geneclusters Nucleic Acids Res 45 W55ndashW63

11 BlinK KazempourD WohllebenW and WeberT (2014)Improved lanthipeptide detection and prediction for antiSMASHPLoS One 9 e89420

12 VillebroR ShawS BlinK and WeberT (2019) Sequence-basedclassification of type II polyketide synthase biosynthetic gene clustersfor antiSMASH J Ind Microbiol Biotechnol 46 469ndash475

13 BlinK MedemaMH KottmannR LeeSY and WeberT (2017)The antiSMASH database a comprehensive database of microbialsecondary metabolite biosynthetic gene clusters Nucleic Acids Res45 D555ndashD559

14 BlinK Pascal AndreuV de Los SantosELC Del CarratoreFLeeSY MedemaMH and WeberT (2019) The antiSMASHdatabase version 2 a comprehensive resource on secondarymetabolite biosynthetic gene clusters Nucleic Acids Res 47D625ndashD630

15 MedemaMH PaalvastY NguyenDD MelnikADorresteinPC TakanoE and BreitlingR (2014) Pep2Pathautomated mass spectrometry-guided genome mining of peptidicnatural products PLoS Comput Biol 10 e1003822

16 AlanjaryM KronmillerB AdamekM BlinK WeberTHusonD PhilmusB and ZiemertN (2017) The AntibioticResistant Target Seeker (ARTS) an exploration engine for antibioticcluster prioritization and novel drug target discovery Nucleic AcidsRes 45 W42ndashW48

17 BlinK PedersenLE WeberT and LeeSY (2016) CRISPy-webAn online resource to design sgRNAs for CRISPR applicationsSynth Syst Biotechnol 1 118ndash121

18 ShirleyWA KelleyBP PotierY KoschwanezJH BruccoleriRand TarselliM (2018) Unzipping natural products improved naturalproduct structure predictions by ensemble modeling and fingerprintmatching ChemRxiv doi httpdoi1026434chemrxiv6863864 26July 2018 preprint not peer reviewed

19 Navarro-MunozJ Selem-MojicaN MullowneyM KautsarSTryonJ ParkinsonE De Los SantosE YeongMCruz-MoralesP AbubuckerS et al (2018) A computationalframework for systematic exploration of biosynthetic diversity fromlarge-scale genomic data bioRxiv doi httpdoi101101445270 17October 2018 preprint not peer reviewed

20 FinnRD CoggillP EberhardtRY EddySR MistryJMitchellAL PotterSC PuntaM QureshiMSangrador-VegasA et al (2016) The Pfam protein families databasetowards a more sustainable future Nucleic Acids Res 44D279ndashD285

21 LetunicI and BorkP (2018) 20 years of the SMART protein domainannotation resource Nucleic Acids Res 46 D493ndashD496

22 de JongA van HeelAJ KokJ and KuipersOP (2010) BAGEL2mining for bacteriocins in genomic data Nucleic Acids Res 38W647ndashW651

23 YadavG GokhaleRS and MohantyD (2009) Towards predictionof metabolic products of polyketide synthases an in silico analysisPLoS Comput Biol 5 e1000351

24 CraigJW CherryMA and BradySF (2011) Long-chain N-acylamino acid synthases are linked to the putativePEP-CTERMexosortase protein-sorting system in Gram-negativebacteria J Bacteriol 193 5707ndash5715

25 RobinsonSL ChristensonJK and WackettLP (2018)Biosynthesis and chemical diversity of -lactone natural productsNat Prod Rep 36 458ndash475

26 AgarwalV BlantonJM PodellS TatonA SchornMABuschJ LinZ SchmidtEW JensenPR PaulVJ et al (2017)Metagenomic discovery of polybrominated diphenyl etherbiosynthesis by marine sponges Nat Chem Biol 13 537ndash543

27 SosioM GaspariE IorioM PessinaS MedemaMHBernasconiA SimoneM MaffioliSI EbrightRH andDonadioS (2018) Analysis of the Pseudouridimycin biosyntheticpathway provides Insights into the formation of C-nucleosideantibiotics Cell Chem Biol 25 540ndash549

28 BauerJS GhequireMGK NettM JostenM SahlH-G DeMotR and GrossH (2015) Biosynthetic origin of the antibioticpseudopyronines A and B in Pseudomonas putida BW11M1Chembiochem 16 2491ndash2497

29 LuoH Hallen-AdamsHE Scott-CraigJS and WaltonJD (2012)Ribosomal biosynthesis of -amanitin in Galerina marginata FungalGenet Biol 49 123ndash129

30 NaganoN UmemuraM IzumikawaM KawanoJ IshiiTKikuchiM TomiiK KumagaiT YoshimiA MachidaM et al(2016) Class of cyclic ribosomal peptide synthetic genes infilamentous fungi Fungal Genet Biol 86 58ndash70

31 DingW LiuW-Q JiaY LiY van der DonkWA and ZhangQ(2016) Biosynthetic investigation of phomopsins reveals a widespreadpathway for ribosomal natural products in Ascomycetes Proc NatlAcad Sci USA 113 3521ndash3526

32 BushinLB ClarkKA PelczerI and SeyedsayamdostMR (2018)Charting an unexplored streptococcal biosynthetic landscape reveals

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

Nucleic Acids Research 2019 Vol 47 Web Server issue W87

a unique peptide cyclization motif J Am Chem Soc 14017674ndash17684

33 CarusoA BushinLB ClarkKA MartinieRJ andSeyedsayamdostMR (2019) A radical approach to enzymatic-Thioether bond formation J Am Chem Soc 141 990ndash997

34 GibsonMK ForsbergKJ and DantasG (2015) Improvedannotation of antibiotic resistance determinants reveals microbialresistomes cluster by ecology ISME J 9 207ndash216

35 YiG SzeSH and ThonMR (2007) Identifying clusters offunctionally related genes in genomes Bioinformatics 23 1053ndash1060

36 InglisDO BinkleyJ SkrzypekMS ArnaudMBCerqueiraGC ShahP WymoreF WortmanJR and SherlockG(2013) Comprehensive annotation of secondary metabolitebiosynthetic genes and gene clusters of Aspergillus nidulans Afumigatus A niger and A oryzae BMC Microbiol 13 91

37 The Gene Ontology Consortium (2016) Expansion of the GeneOntology knowledgebase and resources Nucleic Acids Res 45D331ndashD338

38 MedemaMH KottmannR YilmazP CummingsMBigginsJB BlinK de BruijnI ChooiYH ClaesenJ

CoatesRC et al (2015) Minimum information about a biosyntheticgene cluster Nat Chem Biol 11 625ndash631

39 MedemaMH TakanoE and BreitlingR (2013) Detectingsequence homology at the gene cluster level with MultiGeneBlastMol Biol Evol 30 1218ndash1223

40 BaltzRH (2018) Natural product drug discovery in the genomic erarealities conjectures misconceptions and opportunities J IndMicrobiol Biotechnol 46 281ndash299

41 CimermancicP MedemaMH ClaesenJ KuritaK WielandBrownLC MavrommatisK PatiA GodfreyPA KoehrsenMClardyJ et al (2014) Insights into secondary metabolism from aglobal analysis of prokaryotic biosynthetic gene clusters Cell 158412ndash421

42 BlinK KimHU MedemaMH and WeberT (2017) Recentdevelopment of antiSMASH and other computational approaches tomine secondary metabolite biosynthetic gene clusters BriefBioinform doi101093bibbbx146

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

Page 3: antiSMASH 5.0: updates to the secondary metabolite genome ...genome mining pipeline Kai Blin1,SimonShaw1, Katharina Steinke2, Rasmus Villebro1, Nadine Ziemert2, Sang Yup Lee1,3, Marnix

W82 Nucleic Acids Research 2019 Vol 47 Web Server issue

visible to the end users such as extended and improvedBGC detection and analysis capabilities and a mod-ernized and improved User Interface (see below) anti-SMASH version 5 was completely rewritten in Pythonversion 3 and the code was restructured to increaseperformance reliability and ease of maintenance Thishas led to a significant speed increase of the pipelineA complete list of antiSMASH 5 features is includedin the antiSMASH documentation httpsdocsantismashsecondarymetabolitesorgantiSMASH5features

NEW FEATURES AND UPDATES

New gene cluster classes and refinement of cluster detectionrules

The most widely used and recommended mode to detectBGCs in genomic data is via manually curated and val-idated gene cluster rules These are based on identifyingco-occurring conserved core enzymes in the genome usingHMM-profiles that were derived from Pfam (20) SMART(21) BAGEL (22) or Yadav et al (23) or that were cre-ated specifically for antiSMASH While antiSMASH ver-sion 4 supported the rule-based detection of 44 differ-ent biosynthetic types antiSMASH 5 now includes rulesfor 52 different BGC types In version 5 new rules wereadded to detect BGCs encoding the biosynthesis of N-acylamino acids (24) -lactones (25) polybrominated diphenylethers (26) C-nucleosides (27) pseudopyronines (28) fun-gal RiPPs (29ndash31) and RaS-RiPPs (3233) Furthermore anew lsquonrps-likersquo rule was defined for NRPS-fragments ieatypical NRPSs that donrsquot have the typical C-A-T modulearchitecture The previous lsquootherksrsquo rule was split into tworules to individually assign heterocyst glycolipid synthase-like clusters and other atypical PKSs In addition somerules were improved based on user case reports The rulesdescribing lanthipeptides and trans-AT type I PKS were re-fined to reduce the number of false positive hybrid calls onother cluster types For trans-AT- type I PKS and type IIPKS we increased the size of the cluster cutoffs to capturepreviously missed tailoring enzymes in published clustersThe rule for linear azoleazoline-containing peptides wasmade more generic to better cover the range of describedclusters

The rule describing microcin clusters was removed as mi-crocins are a class of RiPPs defined via their productionin Enterobacteriaceae and are already captured by one ofour other specific RiPP cluster rules depending on their re-spective biosynthesis pathway (eg microcin J25-like RiPPswere previously covered by the old microcin cluster rules butchemically are lasso peptides while microcin B17 is a linearazol(in)e-containing peptide)

Improved type II PKS prediction

Bacterial type II PKS BGCs code for the biosynthesis ofaromatic polyketides such as the antibiotic tetracycline orthe anti-tumour drug doxorubicin From the beginning an-tiSMASH has had rules that were able to detect type IIPKS BGCs by checking for the presence of the KSα andKSβCLF component of the minimal PKS However no

detailed prediction methods had been added since anti-SMASHrsquos first version In antiSMASH 5 we introduce anew PKS II analysis module (12) which uses a collection ofmanually curated HMMs to predict potential starter unitsthe number of elongation cycles (and thus a rough esti-mation of the putative molecular weight of the core com-pound) cyclization patterns and some conserved type IIPKS specific tailoring reactions This module is automat-ically triggered whenever a type II PKS BGC is detected

Annotation of resistance genes via Resfams

The Resfams database (34) is a curated database of proteinfamilies with confirmed antibiotic resistance function an-tiSMASH 5 uses the profile Hidden Markov Models (pH-MMs) from Resfams to annotate potential resistance genesfound in predicted gene regions Potential resistance gene-hits are displayed in the lsquogene detailsrsquo panel along with otherfunctional annotations

GO-term annotations

The Gene Ontology (GO) is a controlled vocabulary for de-scribing biological processes molecular functions and cel-lular components in a consistent way to enable comparisonof these between different species Amongst its wide rangeof uses the GO has been used to predict gene clusters ineukaryotes and bacteria (35) and in conjunction with anti-SMASH to refine cluster boundaries in antiSMASH out-put for Aspergillus species (36)

To facilitate these and other GO-based analyses anti-SMASH 5 includes an option to automatically annotateGO terms on Pfam domains This functionality makes useof the fact that GO terms may be linked not only to spe-cific gene products but also to other means of classifica-tion in so-called lsquomappingsrsquo (httpgeneontologyorgpagedownload-mappings) As antiSMASH can automaticallyannotate Pfam domains the GO annotation functionalitymakes use of the Pfam to GO mapping supplied by theGene Ontology Consortiumrsquos website (37) If the ID of apredicted Pfam domain in an antiSMASH record is presentin the Pfam to GO mapping the respective GO terms areassigned and presented in the lsquogene detailsrsquo panel

Link to the antiSMASH database

antiSMASH provides options to search for similar geneclusters in public datasets As already implemented in previ-ous versions of the software the KnownClusterBlast func-tionality searches each identified region against the man-ually curated MIBiG (38) repository The KnownClus-terBlast and ClusterBlast search functions use an algo-rithm first described in antiSMASH 1 (6) which also isin use in a generalized version in MultiGeneBlast (39)In the previous versions of antiSMASH the ClusterBlastdatabase was generated by scripts that used the antiSMASHBGC detection logic on sequences downloaded from theNCBI GenbankRefSeq databases As version 2 of theantiSMASH database now also contains BGCs of draftgenomes (14) starting with antiSMASH 5 the ClusterBlastdatabases will be directly generated from the new anti-SMASH database and complemented with individual BGC

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

Nucleic Acids Research 2019 Vol 47 Web Server issue W83

records that were submitted to NCBI outside of whole-genome submissions This provides several advantages Theabundance of entries for selected generaspecies in the pub-lic databases (and thus also in the previous ClusterBlastdatabase) is strongly skewed towards clinically or industri-ally relevant organisms There are for example more than15 000 assemblies for Escherichia coli deposited at NCBIFor the antiSMASH database a sequence-based dereplica-tion workflow was established (14) that reduced the num-ber of redundant entries with very high sequence similarityThus the updated ClusterBlast database contains fewer en-tries than the previous release despite the increase in pub-licly available sequence data This decrease has resulted inreduced computation times while simultaneously providingmore relevant hits Furthermore as the entries of the Clus-terBlast database are directly related to the BGCs in the an-tiSMASH database a link to the respective BGC is now in-cluded for all ClusterBlast hits promptly directing the userto the detailed report of the similar gene clusters

New lsquoregionrsquo concept

In previous versions antiSMASH referred to all co-locatedhybrid and independent BGCs with the single label lsquoclusterrsquoIn many cases this led to confusing structure predictionswhen distinct BGCs are encoded side-by-side For examplemany Streptomyces plasmids exist for which all BGCs lie soclose to each other that all were joined into a single largelsquoclusterrsquo In order to better distinguish the different biolog-ical options that lead to BGCs antiSMASH 5 introducessome new terminology

The definitions now used in antiSMASH 5 areCore The minimum area containing one or more genes

that code for enzymes for a single BGC type that are de-tected by the manually curated detection rules These genesdo not have to be contiguous but can be within a certaincutoff distance as defined by the detection rule for the BGCtype in question

Neighbourhood Distance up- and downstream of thecluster core that is used to find tailoring genesenzymesthe neighbourhood distances for the individual biosynthetictypes were empirically determined and defined in the detec-tion rules

Protocluster Contains core + neighbourhoods at bothsides of the core each protocluster always will have one sin-gle product type (for example NRPS) Protoclusters mayoverlap partially or completely with other protoclusters Inthe result webpage protoclusters are displayed as boxesabove the gene arrows The cores are shown as solid colourboxes the neighbourhoods are the half-transparent areasaround the cores

Candidate cluster Contains one or more protoclustersthe candidate clusters are defined as described below Thesedefinitions better allow modelling of hybrid clusters suchas PKSNRPS hybrids which combine two or more differ-ent biosynthetic classes (as identified in the detection rules)or cases where one class is used to biosynthesize a precur-sor for a second class An example of the latter is found inglycopeptide biosynthesis where one of the amino acids issynthesized by a type III PKS which is then incorporatedinto the product by a NRPS Candidate clusters may overlap

partially or completely with other candidate clusters In theresult webpage candidate clusters are shown as boxes abovethe protoclusters

Region Contains one or more candidate clusters The re-gions in antiSMASH 5 correspond to the entities calledlsquoclustersrsquo in antiSMASH 1 ndash 4 and now constitute what isdisplayed on a page of the results webpage Sometimes a re-gion will contain multiple mutually exclusive candidate clus-ters in such cases comparative genomic analysis andor ex-perimental work is required to assess which of these candi-date clusters constitute actual BGCs Regions will not over-lap with each other At least one of the contained candidateclusters will cover the full length of the region

There are four kinds of candidate clusters chemical hy-brids interleaved neighbouring and single

Chemical hybrid candidate clusters contain at least twoprotoclusters that share at least one gene that codes for en-zymes of two or more separate BGC types (eg a single genecoding for type I PKS and NRPS modules) (Figure 1A) Anexample of this type are hybrid PKSNRPSs Please notethat this type of candidate cluster can also include protoclus-ters within that shared range that do not share a coding se-quence provided that they are completely contained withinthe candidate cluster

Interleaved candidate clusters contain protoclusters thatdo not share cluster-type-defining coding sequences buttheir core locations overlap (Figure 1B)

Neighbouring candidate clusters contain protoclusterswhich transitively overlap in their neighbourhoods (Figure1C)

Single candidate clusters (Figure 1D) exist for consistencyof access they contain only a single protocluster Note thatindividual protoclusters can be contained by more than onecandidate cluster (typically a neighbouring candidate clusterand one of single interleaved or chemical hybrid)

Each candidate cluster assignment is transitive for ex-ample if a protocluster would form a chemical hybrid witheach of two neighbouring protoclusters but these neigh-bours would not form a chemical hybrid on their own allthree together will still form a chemical hybrid candidatecluster

Improved user interface

A central aim of antiSMASH is to provide very detailedand specific information via an easy to use and under-stand user interface (UI) The UI remained principally un-changed from the initial release of antiSMASH in 2011 de-spite the increased functionality added with each new ver-sion In this version we have modernized the UI using up-dated web technologies that allow a better structuring ofthe result-content of the antiSMASH results pages For re-designing the UI it was important that the reliable andwell-established look-and-feel was conserved while also re-taining the ability to download the whole web-based resultsfolder and to display it locally in a variety of web-browsers

We and others (such as (40)) have realized that anti-SMASH results using the heuristic ClusterFinder algorithm(41) were more often than not wrongly interpreted At thesame time ClusterFinder contributed significantly to thecomputational workload For these reasons we decided to

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

W84 Nucleic Acids Research 2019 Vol 47 Web Server issue

Figure 1 Candidate cluster types 1234 Greyyellow gene involved in protocluster AB (A) Chemical Hybrids Since cluster type A and cluster typeB share a CDS that defines those protoclusters they are classified as lsquochemical hybridrsquo (B) Interleaved Since none of the protoclusters share any definingCDS with any other protocluster it is not annotated as a chemical hybrid even though the biosynthetic product may or may not be The two protoclustersform an interleaved candidate clusters since the core of A overlaps with the core of B (C) Neighbouring Neighbouring candidate clusters are defined if theneighbourhoods of two protoclusters but not their cores overlap (D) Singles If protoclusters donrsquot have any overlaprelation with other protoclusters theterm single candidate cluster is assigned

remove this feature from the pubic antiSMASH web serverIt is of course still included in the download version of an-tiSMASH and can be enabled via the command line

In the Regions overview section (Figure 2) a graph-ical overview showing the location of the identified re-gions on the chromosomeplasmidscaffoldscontig is dis-played In the detailed view regions that are located oncontig-borders are now clearly labelled This often indi-cates that parts of the BGC are missing or that several sec-tions of a BGC are located on different contigs and aretherefore reported individually (for a more detailed discus-sion on this phenomenon please see (42)) For the firsttime antiSMASH 5 now offers interactive browsing of the

BGCs including selection of lsquofunctionalrsquo units ie core en-zymes transporters etc zooming to individual genes orregionscandidate clustersprotoclusters Details of the se-lection are now provided in side panels instead of pop-upwindows using a hierarchical view of the analysis sum-maries (which can be expanded by clicking lsquo+rsquo) to provideadditional details For the display of the PKSNRPS do-main organization the user now can choose whether tolimit the shown domains to the currently selected genes orjust display the results of the selected gene(s)enzyme(s)Furthermore the information is now organized in lsquotabsrsquothat do not require scrolling down along an often very longresults page

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

Nucleic Acids Research 2019 Vol 47 Web Server issue W85

Figure 2 Screenshot of the antiSMASH 5 user interface (example NCBI-acc Y16952 balhimycin BGC) The new region overview now allowspanningzooming The candidate cluster and protocluster boxes are explained in the lsquonew region conceptrsquo section above Information about the currentlyselected gene are displayed at the right lsquoGene detailsrsquo panel For PKS or NRPS regions the detailed domain annotation is displayed by pressing the tabsusers can select the domain overview (shown) or the ClusterBlast KnownClusterBlast or SubClusterBlast results At the right the structure predictionand details of specificity predictions are displayed upon selecting the plus sign

CODE REFACTORING AND SPEED-UP

Large parts of the pre-antiSMASH 5 code base were stillderived from antiSMASH version 1 which was released in2011 In order to maintain future compatibility the anti-SMASH code base had to be migrated from python 27which will reach end-of-life in 2020 to the current versions35ndash37 As this transition required significant modificationto the antiSMASH code we decided to take this as a chanceto completely rewrite the software with a special considera-tion on runtime code stability and code maintainability Aunit test and integration test framework was implementedthat covers most parts of the antiSMASH 5 code allowing amuch easier debugging andndashndashmost importantlyndashndashextensionof the code while at the same time ensuring that new fea-tures do not negatively impact the results of existing mod-ules For some of the externally contributed modules (Sand-puma trans-AT PKS comparisons terpene PrediCAT) ourcontributors are currently preparing updated and compli-ant versions which will be added to antiSMASH 5 in mi-nor releases once they are finished and tested Like the ear-lier antiSMASH versions antiSMASH 5 provides the anal-ysis results in an interactive webpage and richly annotatedGenBank-format files for the whole genome and individualclusters As a new feature in version 5 all data are also avail-able as a computer readable JSON container which allowsthird party tools to easily process antiSMASH annotationsThis JSON output has superseded some other output typessuch as BioSynML and XLS

In addition to the advantages mentioned above the coderefactoring and cleanup has also led to a significant speed

increase of the new version by a factor of 4-11times (dependingon genome and selected options) instead of waiting timesof several hours antiSMASH results are now usually deliv-ered within 30ndash40 min after the start of the job for a typicalsubmission at the public web server

CONCLUSIONS AND FUTURE PERSPECTIVES

With the help of software like antiSMASH genome miningfor specialized metabolites has established itself as a com-plementary approach for the identification of novel metabo-lites which is routinely used within the natural products re-search community and increasingly applied in related fieldssuch as metagenomics environmental biology or metabolicengineering With the improvements to the antiSMASHuser interface and performance we keep pace with these de-velopments Furthermore the complete refactoring of theantiSMASH 5 code base will allow us to increasingly useantiSMASH as a tool that provides analysis data on whichother software can perform additional analyses

DATA AVAILABILITY

antiSMASH is available from httpsantismashsecondarymetabolitesorg (bacterial version) orhttpsfungismashsecondarymetabolitesorg (fun-gal version) The antiSMASH documentationincluding a PDF user guide is available fromhttpsdocsantismashsecondarymetabolitesorg Thesewebsites are free and open to all users and there is no loginrequirement The antiSMASH source code is available from

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

W86 Nucleic Acids Research 2019 Vol 47 Web Server issue

httpsgithubcomantismashantismash antiSMASH isalso available via Docker

ACKNOWLEDGEMENTS

We thank Justin JJ van der Hooft for critical comments onthe manuscript and providing documentation and EmiliaPalazzotto and Tetiana Gren for helpful discussions anduser testing of the new features

FUNDING

Novo Nordisk Foundation [NNF10CC1016517 toSYLTW NNF16OC0021746 to TW] Center forMicrobial Secondary Metabolites (CeMiSt) Danish Na-tional Research Foundation [DNRF137 to TW] Reinholdand Maria Teufel Foundation (to KS) Funding for openaccess charge The Novo Nordisk FoundationConflict of interest statement None declared

REFERENCES1 NewmanDJ and CraggGM (2016) Natural products as sources of

new drugs from 1981 to 2014 J Nat Prod 79 629ndash6612 van der MeijA WorsleySF HutchingsMI and van WezelGP

(2017) Chemical ecology of antibiotic production by actinomycetesFEMS Microbiol Rev 41 392ndash416

3 ZiemertN AlanjaryM and WeberT (2016) The evolution ofgenome mining in microbes - a review Nat Prod Rep 33 988ndash1005

4 WeberT RauschC LopezP HoofI GaykovaV HusonDHand WohllebenW (2009) CLUSEAN a computer-based frameworkfor the automated analysis of bacterial secondary metabolitebiosynthetic gene clusters J Biotechnol 140 13ndash17

5 SkinniderMA MerwinNJ JohnstonCW and MagarveyNA(2017) PRISM 3 expanded prediction of natural product chemicalstructures from microbial genomes Nucleic Acids Res 45W49ndashW54

6 MedemaMH BlinK CimermancicP de JagerV ZakrzewskiPFischbachMA WeberT TakanoE and BreitlingR (2011)antiSMASH rapid identification annotation and analysis ofsecondary metabolite biosynthesis gene clusters in bacterial andfungal genome sequences Nucleic Acids Res 39 W339ndashW346

7 BlinK MedemaMH KazempourD FischbachMABreitlingR TakanoE and WeberT (2013) antiSMASH 20ndashndashaversatile platform for genome mining of secondary metaboliteproducers Nucleic Acids Res 41 W204ndashW212

8 WeberT BlinK DuddelaS KrugD KimHU BruccoleriRLeeSY FischbachMA MullerR WohllebenW et al (2015)antiSMASH 30ndashndasha comprehensive resource for the genome miningof biosynthetic gene clusters Nucleic Acids Res 43 W237ndashW243

9 BlinK WolfT ChevretteMG LuX SchwalenCJKautsarSA Suarez DuranHG de Los SantosELC KimHUNaveM et al (2017) antiSMASH 40-improvements in chemistryprediction and gene cluster boundary identification Nucleic AcidsRes 45 W36ndashW41

10 KautsarSA Suarez DuranHG BlinK OsbournA andMedemaMH (2017) plantiSMASH automated identificationannotation and expression analysis of plant biosynthetic geneclusters Nucleic Acids Res 45 W55ndashW63

11 BlinK KazempourD WohllebenW and WeberT (2014)Improved lanthipeptide detection and prediction for antiSMASHPLoS One 9 e89420

12 VillebroR ShawS BlinK and WeberT (2019) Sequence-basedclassification of type II polyketide synthase biosynthetic gene clustersfor antiSMASH J Ind Microbiol Biotechnol 46 469ndash475

13 BlinK MedemaMH KottmannR LeeSY and WeberT (2017)The antiSMASH database a comprehensive database of microbialsecondary metabolite biosynthetic gene clusters Nucleic Acids Res45 D555ndashD559

14 BlinK Pascal AndreuV de Los SantosELC Del CarratoreFLeeSY MedemaMH and WeberT (2019) The antiSMASHdatabase version 2 a comprehensive resource on secondarymetabolite biosynthetic gene clusters Nucleic Acids Res 47D625ndashD630

15 MedemaMH PaalvastY NguyenDD MelnikADorresteinPC TakanoE and BreitlingR (2014) Pep2Pathautomated mass spectrometry-guided genome mining of peptidicnatural products PLoS Comput Biol 10 e1003822

16 AlanjaryM KronmillerB AdamekM BlinK WeberTHusonD PhilmusB and ZiemertN (2017) The AntibioticResistant Target Seeker (ARTS) an exploration engine for antibioticcluster prioritization and novel drug target discovery Nucleic AcidsRes 45 W42ndashW48

17 BlinK PedersenLE WeberT and LeeSY (2016) CRISPy-webAn online resource to design sgRNAs for CRISPR applicationsSynth Syst Biotechnol 1 118ndash121

18 ShirleyWA KelleyBP PotierY KoschwanezJH BruccoleriRand TarselliM (2018) Unzipping natural products improved naturalproduct structure predictions by ensemble modeling and fingerprintmatching ChemRxiv doi httpdoi1026434chemrxiv6863864 26July 2018 preprint not peer reviewed

19 Navarro-MunozJ Selem-MojicaN MullowneyM KautsarSTryonJ ParkinsonE De Los SantosE YeongMCruz-MoralesP AbubuckerS et al (2018) A computationalframework for systematic exploration of biosynthetic diversity fromlarge-scale genomic data bioRxiv doi httpdoi101101445270 17October 2018 preprint not peer reviewed

20 FinnRD CoggillP EberhardtRY EddySR MistryJMitchellAL PotterSC PuntaM QureshiMSangrador-VegasA et al (2016) The Pfam protein families databasetowards a more sustainable future Nucleic Acids Res 44D279ndashD285

21 LetunicI and BorkP (2018) 20 years of the SMART protein domainannotation resource Nucleic Acids Res 46 D493ndashD496

22 de JongA van HeelAJ KokJ and KuipersOP (2010) BAGEL2mining for bacteriocins in genomic data Nucleic Acids Res 38W647ndashW651

23 YadavG GokhaleRS and MohantyD (2009) Towards predictionof metabolic products of polyketide synthases an in silico analysisPLoS Comput Biol 5 e1000351

24 CraigJW CherryMA and BradySF (2011) Long-chain N-acylamino acid synthases are linked to the putativePEP-CTERMexosortase protein-sorting system in Gram-negativebacteria J Bacteriol 193 5707ndash5715

25 RobinsonSL ChristensonJK and WackettLP (2018)Biosynthesis and chemical diversity of -lactone natural productsNat Prod Rep 36 458ndash475

26 AgarwalV BlantonJM PodellS TatonA SchornMABuschJ LinZ SchmidtEW JensenPR PaulVJ et al (2017)Metagenomic discovery of polybrominated diphenyl etherbiosynthesis by marine sponges Nat Chem Biol 13 537ndash543

27 SosioM GaspariE IorioM PessinaS MedemaMHBernasconiA SimoneM MaffioliSI EbrightRH andDonadioS (2018) Analysis of the Pseudouridimycin biosyntheticpathway provides Insights into the formation of C-nucleosideantibiotics Cell Chem Biol 25 540ndash549

28 BauerJS GhequireMGK NettM JostenM SahlH-G DeMotR and GrossH (2015) Biosynthetic origin of the antibioticpseudopyronines A and B in Pseudomonas putida BW11M1Chembiochem 16 2491ndash2497

29 LuoH Hallen-AdamsHE Scott-CraigJS and WaltonJD (2012)Ribosomal biosynthesis of -amanitin in Galerina marginata FungalGenet Biol 49 123ndash129

30 NaganoN UmemuraM IzumikawaM KawanoJ IshiiTKikuchiM TomiiK KumagaiT YoshimiA MachidaM et al(2016) Class of cyclic ribosomal peptide synthetic genes infilamentous fungi Fungal Genet Biol 86 58ndash70

31 DingW LiuW-Q JiaY LiY van der DonkWA and ZhangQ(2016) Biosynthetic investigation of phomopsins reveals a widespreadpathway for ribosomal natural products in Ascomycetes Proc NatlAcad Sci USA 113 3521ndash3526

32 BushinLB ClarkKA PelczerI and SeyedsayamdostMR (2018)Charting an unexplored streptococcal biosynthetic landscape reveals

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

Nucleic Acids Research 2019 Vol 47 Web Server issue W87

a unique peptide cyclization motif J Am Chem Soc 14017674ndash17684

33 CarusoA BushinLB ClarkKA MartinieRJ andSeyedsayamdostMR (2019) A radical approach to enzymatic-Thioether bond formation J Am Chem Soc 141 990ndash997

34 GibsonMK ForsbergKJ and DantasG (2015) Improvedannotation of antibiotic resistance determinants reveals microbialresistomes cluster by ecology ISME J 9 207ndash216

35 YiG SzeSH and ThonMR (2007) Identifying clusters offunctionally related genes in genomes Bioinformatics 23 1053ndash1060

36 InglisDO BinkleyJ SkrzypekMS ArnaudMBCerqueiraGC ShahP WymoreF WortmanJR and SherlockG(2013) Comprehensive annotation of secondary metabolitebiosynthetic genes and gene clusters of Aspergillus nidulans Afumigatus A niger and A oryzae BMC Microbiol 13 91

37 The Gene Ontology Consortium (2016) Expansion of the GeneOntology knowledgebase and resources Nucleic Acids Res 45D331ndashD338

38 MedemaMH KottmannR YilmazP CummingsMBigginsJB BlinK de BruijnI ChooiYH ClaesenJ

CoatesRC et al (2015) Minimum information about a biosyntheticgene cluster Nat Chem Biol 11 625ndash631

39 MedemaMH TakanoE and BreitlingR (2013) Detectingsequence homology at the gene cluster level with MultiGeneBlastMol Biol Evol 30 1218ndash1223

40 BaltzRH (2018) Natural product drug discovery in the genomic erarealities conjectures misconceptions and opportunities J IndMicrobiol Biotechnol 46 281ndash299

41 CimermancicP MedemaMH ClaesenJ KuritaK WielandBrownLC MavrommatisK PatiA GodfreyPA KoehrsenMClardyJ et al (2014) Insights into secondary metabolism from aglobal analysis of prokaryotic biosynthetic gene clusters Cell 158412ndash421

42 BlinK KimHU MedemaMH and WeberT (2017) Recentdevelopment of antiSMASH and other computational approaches tomine secondary metabolite biosynthetic gene clusters BriefBioinform doi101093bibbbx146

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

Page 4: antiSMASH 5.0: updates to the secondary metabolite genome ...genome mining pipeline Kai Blin1,SimonShaw1, Katharina Steinke2, Rasmus Villebro1, Nadine Ziemert2, Sang Yup Lee1,3, Marnix

Nucleic Acids Research 2019 Vol 47 Web Server issue W83

records that were submitted to NCBI outside of whole-genome submissions This provides several advantages Theabundance of entries for selected generaspecies in the pub-lic databases (and thus also in the previous ClusterBlastdatabase) is strongly skewed towards clinically or industri-ally relevant organisms There are for example more than15 000 assemblies for Escherichia coli deposited at NCBIFor the antiSMASH database a sequence-based dereplica-tion workflow was established (14) that reduced the num-ber of redundant entries with very high sequence similarityThus the updated ClusterBlast database contains fewer en-tries than the previous release despite the increase in pub-licly available sequence data This decrease has resulted inreduced computation times while simultaneously providingmore relevant hits Furthermore as the entries of the Clus-terBlast database are directly related to the BGCs in the an-tiSMASH database a link to the respective BGC is now in-cluded for all ClusterBlast hits promptly directing the userto the detailed report of the similar gene clusters

New lsquoregionrsquo concept

In previous versions antiSMASH referred to all co-locatedhybrid and independent BGCs with the single label lsquoclusterrsquoIn many cases this led to confusing structure predictionswhen distinct BGCs are encoded side-by-side For examplemany Streptomyces plasmids exist for which all BGCs lie soclose to each other that all were joined into a single largelsquoclusterrsquo In order to better distinguish the different biolog-ical options that lead to BGCs antiSMASH 5 introducessome new terminology

The definitions now used in antiSMASH 5 areCore The minimum area containing one or more genes

that code for enzymes for a single BGC type that are de-tected by the manually curated detection rules These genesdo not have to be contiguous but can be within a certaincutoff distance as defined by the detection rule for the BGCtype in question

Neighbourhood Distance up- and downstream of thecluster core that is used to find tailoring genesenzymesthe neighbourhood distances for the individual biosynthetictypes were empirically determined and defined in the detec-tion rules

Protocluster Contains core + neighbourhoods at bothsides of the core each protocluster always will have one sin-gle product type (for example NRPS) Protoclusters mayoverlap partially or completely with other protoclusters Inthe result webpage protoclusters are displayed as boxesabove the gene arrows The cores are shown as solid colourboxes the neighbourhoods are the half-transparent areasaround the cores

Candidate cluster Contains one or more protoclustersthe candidate clusters are defined as described below Thesedefinitions better allow modelling of hybrid clusters suchas PKSNRPS hybrids which combine two or more differ-ent biosynthetic classes (as identified in the detection rules)or cases where one class is used to biosynthesize a precur-sor for a second class An example of the latter is found inglycopeptide biosynthesis where one of the amino acids issynthesized by a type III PKS which is then incorporatedinto the product by a NRPS Candidate clusters may overlap

partially or completely with other candidate clusters In theresult webpage candidate clusters are shown as boxes abovethe protoclusters

Region Contains one or more candidate clusters The re-gions in antiSMASH 5 correspond to the entities calledlsquoclustersrsquo in antiSMASH 1 ndash 4 and now constitute what isdisplayed on a page of the results webpage Sometimes a re-gion will contain multiple mutually exclusive candidate clus-ters in such cases comparative genomic analysis andor ex-perimental work is required to assess which of these candi-date clusters constitute actual BGCs Regions will not over-lap with each other At least one of the contained candidateclusters will cover the full length of the region

There are four kinds of candidate clusters chemical hy-brids interleaved neighbouring and single

Chemical hybrid candidate clusters contain at least twoprotoclusters that share at least one gene that codes for en-zymes of two or more separate BGC types (eg a single genecoding for type I PKS and NRPS modules) (Figure 1A) Anexample of this type are hybrid PKSNRPSs Please notethat this type of candidate cluster can also include protoclus-ters within that shared range that do not share a coding se-quence provided that they are completely contained withinthe candidate cluster

Interleaved candidate clusters contain protoclusters thatdo not share cluster-type-defining coding sequences buttheir core locations overlap (Figure 1B)

Neighbouring candidate clusters contain protoclusterswhich transitively overlap in their neighbourhoods (Figure1C)

Single candidate clusters (Figure 1D) exist for consistencyof access they contain only a single protocluster Note thatindividual protoclusters can be contained by more than onecandidate cluster (typically a neighbouring candidate clusterand one of single interleaved or chemical hybrid)

Each candidate cluster assignment is transitive for ex-ample if a protocluster would form a chemical hybrid witheach of two neighbouring protoclusters but these neigh-bours would not form a chemical hybrid on their own allthree together will still form a chemical hybrid candidatecluster

Improved user interface

A central aim of antiSMASH is to provide very detailedand specific information via an easy to use and under-stand user interface (UI) The UI remained principally un-changed from the initial release of antiSMASH in 2011 de-spite the increased functionality added with each new ver-sion In this version we have modernized the UI using up-dated web technologies that allow a better structuring ofthe result-content of the antiSMASH results pages For re-designing the UI it was important that the reliable andwell-established look-and-feel was conserved while also re-taining the ability to download the whole web-based resultsfolder and to display it locally in a variety of web-browsers

We and others (such as (40)) have realized that anti-SMASH results using the heuristic ClusterFinder algorithm(41) were more often than not wrongly interpreted At thesame time ClusterFinder contributed significantly to thecomputational workload For these reasons we decided to

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

W84 Nucleic Acids Research 2019 Vol 47 Web Server issue

Figure 1 Candidate cluster types 1234 Greyyellow gene involved in protocluster AB (A) Chemical Hybrids Since cluster type A and cluster typeB share a CDS that defines those protoclusters they are classified as lsquochemical hybridrsquo (B) Interleaved Since none of the protoclusters share any definingCDS with any other protocluster it is not annotated as a chemical hybrid even though the biosynthetic product may or may not be The two protoclustersform an interleaved candidate clusters since the core of A overlaps with the core of B (C) Neighbouring Neighbouring candidate clusters are defined if theneighbourhoods of two protoclusters but not their cores overlap (D) Singles If protoclusters donrsquot have any overlaprelation with other protoclusters theterm single candidate cluster is assigned

remove this feature from the pubic antiSMASH web serverIt is of course still included in the download version of an-tiSMASH and can be enabled via the command line

In the Regions overview section (Figure 2) a graph-ical overview showing the location of the identified re-gions on the chromosomeplasmidscaffoldscontig is dis-played In the detailed view regions that are located oncontig-borders are now clearly labelled This often indi-cates that parts of the BGC are missing or that several sec-tions of a BGC are located on different contigs and aretherefore reported individually (for a more detailed discus-sion on this phenomenon please see (42)) For the firsttime antiSMASH 5 now offers interactive browsing of the

BGCs including selection of lsquofunctionalrsquo units ie core en-zymes transporters etc zooming to individual genes orregionscandidate clustersprotoclusters Details of the se-lection are now provided in side panels instead of pop-upwindows using a hierarchical view of the analysis sum-maries (which can be expanded by clicking lsquo+rsquo) to provideadditional details For the display of the PKSNRPS do-main organization the user now can choose whether tolimit the shown domains to the currently selected genes orjust display the results of the selected gene(s)enzyme(s)Furthermore the information is now organized in lsquotabsrsquothat do not require scrolling down along an often very longresults page

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

Nucleic Acids Research 2019 Vol 47 Web Server issue W85

Figure 2 Screenshot of the antiSMASH 5 user interface (example NCBI-acc Y16952 balhimycin BGC) The new region overview now allowspanningzooming The candidate cluster and protocluster boxes are explained in the lsquonew region conceptrsquo section above Information about the currentlyselected gene are displayed at the right lsquoGene detailsrsquo panel For PKS or NRPS regions the detailed domain annotation is displayed by pressing the tabsusers can select the domain overview (shown) or the ClusterBlast KnownClusterBlast or SubClusterBlast results At the right the structure predictionand details of specificity predictions are displayed upon selecting the plus sign

CODE REFACTORING AND SPEED-UP

Large parts of the pre-antiSMASH 5 code base were stillderived from antiSMASH version 1 which was released in2011 In order to maintain future compatibility the anti-SMASH code base had to be migrated from python 27which will reach end-of-life in 2020 to the current versions35ndash37 As this transition required significant modificationto the antiSMASH code we decided to take this as a chanceto completely rewrite the software with a special considera-tion on runtime code stability and code maintainability Aunit test and integration test framework was implementedthat covers most parts of the antiSMASH 5 code allowing amuch easier debugging andndashndashmost importantlyndashndashextensionof the code while at the same time ensuring that new fea-tures do not negatively impact the results of existing mod-ules For some of the externally contributed modules (Sand-puma trans-AT PKS comparisons terpene PrediCAT) ourcontributors are currently preparing updated and compli-ant versions which will be added to antiSMASH 5 in mi-nor releases once they are finished and tested Like the ear-lier antiSMASH versions antiSMASH 5 provides the anal-ysis results in an interactive webpage and richly annotatedGenBank-format files for the whole genome and individualclusters As a new feature in version 5 all data are also avail-able as a computer readable JSON container which allowsthird party tools to easily process antiSMASH annotationsThis JSON output has superseded some other output typessuch as BioSynML and XLS

In addition to the advantages mentioned above the coderefactoring and cleanup has also led to a significant speed

increase of the new version by a factor of 4-11times (dependingon genome and selected options) instead of waiting timesof several hours antiSMASH results are now usually deliv-ered within 30ndash40 min after the start of the job for a typicalsubmission at the public web server

CONCLUSIONS AND FUTURE PERSPECTIVES

With the help of software like antiSMASH genome miningfor specialized metabolites has established itself as a com-plementary approach for the identification of novel metabo-lites which is routinely used within the natural products re-search community and increasingly applied in related fieldssuch as metagenomics environmental biology or metabolicengineering With the improvements to the antiSMASHuser interface and performance we keep pace with these de-velopments Furthermore the complete refactoring of theantiSMASH 5 code base will allow us to increasingly useantiSMASH as a tool that provides analysis data on whichother software can perform additional analyses

DATA AVAILABILITY

antiSMASH is available from httpsantismashsecondarymetabolitesorg (bacterial version) orhttpsfungismashsecondarymetabolitesorg (fun-gal version) The antiSMASH documentationincluding a PDF user guide is available fromhttpsdocsantismashsecondarymetabolitesorg Thesewebsites are free and open to all users and there is no loginrequirement The antiSMASH source code is available from

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

W86 Nucleic Acids Research 2019 Vol 47 Web Server issue

httpsgithubcomantismashantismash antiSMASH isalso available via Docker

ACKNOWLEDGEMENTS

We thank Justin JJ van der Hooft for critical comments onthe manuscript and providing documentation and EmiliaPalazzotto and Tetiana Gren for helpful discussions anduser testing of the new features

FUNDING

Novo Nordisk Foundation [NNF10CC1016517 toSYLTW NNF16OC0021746 to TW] Center forMicrobial Secondary Metabolites (CeMiSt) Danish Na-tional Research Foundation [DNRF137 to TW] Reinholdand Maria Teufel Foundation (to KS) Funding for openaccess charge The Novo Nordisk FoundationConflict of interest statement None declared

REFERENCES1 NewmanDJ and CraggGM (2016) Natural products as sources of

new drugs from 1981 to 2014 J Nat Prod 79 629ndash6612 van der MeijA WorsleySF HutchingsMI and van WezelGP

(2017) Chemical ecology of antibiotic production by actinomycetesFEMS Microbiol Rev 41 392ndash416

3 ZiemertN AlanjaryM and WeberT (2016) The evolution ofgenome mining in microbes - a review Nat Prod Rep 33 988ndash1005

4 WeberT RauschC LopezP HoofI GaykovaV HusonDHand WohllebenW (2009) CLUSEAN a computer-based frameworkfor the automated analysis of bacterial secondary metabolitebiosynthetic gene clusters J Biotechnol 140 13ndash17

5 SkinniderMA MerwinNJ JohnstonCW and MagarveyNA(2017) PRISM 3 expanded prediction of natural product chemicalstructures from microbial genomes Nucleic Acids Res 45W49ndashW54

6 MedemaMH BlinK CimermancicP de JagerV ZakrzewskiPFischbachMA WeberT TakanoE and BreitlingR (2011)antiSMASH rapid identification annotation and analysis ofsecondary metabolite biosynthesis gene clusters in bacterial andfungal genome sequences Nucleic Acids Res 39 W339ndashW346

7 BlinK MedemaMH KazempourD FischbachMABreitlingR TakanoE and WeberT (2013) antiSMASH 20ndashndashaversatile platform for genome mining of secondary metaboliteproducers Nucleic Acids Res 41 W204ndashW212

8 WeberT BlinK DuddelaS KrugD KimHU BruccoleriRLeeSY FischbachMA MullerR WohllebenW et al (2015)antiSMASH 30ndashndasha comprehensive resource for the genome miningof biosynthetic gene clusters Nucleic Acids Res 43 W237ndashW243

9 BlinK WolfT ChevretteMG LuX SchwalenCJKautsarSA Suarez DuranHG de Los SantosELC KimHUNaveM et al (2017) antiSMASH 40-improvements in chemistryprediction and gene cluster boundary identification Nucleic AcidsRes 45 W36ndashW41

10 KautsarSA Suarez DuranHG BlinK OsbournA andMedemaMH (2017) plantiSMASH automated identificationannotation and expression analysis of plant biosynthetic geneclusters Nucleic Acids Res 45 W55ndashW63

11 BlinK KazempourD WohllebenW and WeberT (2014)Improved lanthipeptide detection and prediction for antiSMASHPLoS One 9 e89420

12 VillebroR ShawS BlinK and WeberT (2019) Sequence-basedclassification of type II polyketide synthase biosynthetic gene clustersfor antiSMASH J Ind Microbiol Biotechnol 46 469ndash475

13 BlinK MedemaMH KottmannR LeeSY and WeberT (2017)The antiSMASH database a comprehensive database of microbialsecondary metabolite biosynthetic gene clusters Nucleic Acids Res45 D555ndashD559

14 BlinK Pascal AndreuV de Los SantosELC Del CarratoreFLeeSY MedemaMH and WeberT (2019) The antiSMASHdatabase version 2 a comprehensive resource on secondarymetabolite biosynthetic gene clusters Nucleic Acids Res 47D625ndashD630

15 MedemaMH PaalvastY NguyenDD MelnikADorresteinPC TakanoE and BreitlingR (2014) Pep2Pathautomated mass spectrometry-guided genome mining of peptidicnatural products PLoS Comput Biol 10 e1003822

16 AlanjaryM KronmillerB AdamekM BlinK WeberTHusonD PhilmusB and ZiemertN (2017) The AntibioticResistant Target Seeker (ARTS) an exploration engine for antibioticcluster prioritization and novel drug target discovery Nucleic AcidsRes 45 W42ndashW48

17 BlinK PedersenLE WeberT and LeeSY (2016) CRISPy-webAn online resource to design sgRNAs for CRISPR applicationsSynth Syst Biotechnol 1 118ndash121

18 ShirleyWA KelleyBP PotierY KoschwanezJH BruccoleriRand TarselliM (2018) Unzipping natural products improved naturalproduct structure predictions by ensemble modeling and fingerprintmatching ChemRxiv doi httpdoi1026434chemrxiv6863864 26July 2018 preprint not peer reviewed

19 Navarro-MunozJ Selem-MojicaN MullowneyM KautsarSTryonJ ParkinsonE De Los SantosE YeongMCruz-MoralesP AbubuckerS et al (2018) A computationalframework for systematic exploration of biosynthetic diversity fromlarge-scale genomic data bioRxiv doi httpdoi101101445270 17October 2018 preprint not peer reviewed

20 FinnRD CoggillP EberhardtRY EddySR MistryJMitchellAL PotterSC PuntaM QureshiMSangrador-VegasA et al (2016) The Pfam protein families databasetowards a more sustainable future Nucleic Acids Res 44D279ndashD285

21 LetunicI and BorkP (2018) 20 years of the SMART protein domainannotation resource Nucleic Acids Res 46 D493ndashD496

22 de JongA van HeelAJ KokJ and KuipersOP (2010) BAGEL2mining for bacteriocins in genomic data Nucleic Acids Res 38W647ndashW651

23 YadavG GokhaleRS and MohantyD (2009) Towards predictionof metabolic products of polyketide synthases an in silico analysisPLoS Comput Biol 5 e1000351

24 CraigJW CherryMA and BradySF (2011) Long-chain N-acylamino acid synthases are linked to the putativePEP-CTERMexosortase protein-sorting system in Gram-negativebacteria J Bacteriol 193 5707ndash5715

25 RobinsonSL ChristensonJK and WackettLP (2018)Biosynthesis and chemical diversity of -lactone natural productsNat Prod Rep 36 458ndash475

26 AgarwalV BlantonJM PodellS TatonA SchornMABuschJ LinZ SchmidtEW JensenPR PaulVJ et al (2017)Metagenomic discovery of polybrominated diphenyl etherbiosynthesis by marine sponges Nat Chem Biol 13 537ndash543

27 SosioM GaspariE IorioM PessinaS MedemaMHBernasconiA SimoneM MaffioliSI EbrightRH andDonadioS (2018) Analysis of the Pseudouridimycin biosyntheticpathway provides Insights into the formation of C-nucleosideantibiotics Cell Chem Biol 25 540ndash549

28 BauerJS GhequireMGK NettM JostenM SahlH-G DeMotR and GrossH (2015) Biosynthetic origin of the antibioticpseudopyronines A and B in Pseudomonas putida BW11M1Chembiochem 16 2491ndash2497

29 LuoH Hallen-AdamsHE Scott-CraigJS and WaltonJD (2012)Ribosomal biosynthesis of -amanitin in Galerina marginata FungalGenet Biol 49 123ndash129

30 NaganoN UmemuraM IzumikawaM KawanoJ IshiiTKikuchiM TomiiK KumagaiT YoshimiA MachidaM et al(2016) Class of cyclic ribosomal peptide synthetic genes infilamentous fungi Fungal Genet Biol 86 58ndash70

31 DingW LiuW-Q JiaY LiY van der DonkWA and ZhangQ(2016) Biosynthetic investigation of phomopsins reveals a widespreadpathway for ribosomal natural products in Ascomycetes Proc NatlAcad Sci USA 113 3521ndash3526

32 BushinLB ClarkKA PelczerI and SeyedsayamdostMR (2018)Charting an unexplored streptococcal biosynthetic landscape reveals

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

Nucleic Acids Research 2019 Vol 47 Web Server issue W87

a unique peptide cyclization motif J Am Chem Soc 14017674ndash17684

33 CarusoA BushinLB ClarkKA MartinieRJ andSeyedsayamdostMR (2019) A radical approach to enzymatic-Thioether bond formation J Am Chem Soc 141 990ndash997

34 GibsonMK ForsbergKJ and DantasG (2015) Improvedannotation of antibiotic resistance determinants reveals microbialresistomes cluster by ecology ISME J 9 207ndash216

35 YiG SzeSH and ThonMR (2007) Identifying clusters offunctionally related genes in genomes Bioinformatics 23 1053ndash1060

36 InglisDO BinkleyJ SkrzypekMS ArnaudMBCerqueiraGC ShahP WymoreF WortmanJR and SherlockG(2013) Comprehensive annotation of secondary metabolitebiosynthetic genes and gene clusters of Aspergillus nidulans Afumigatus A niger and A oryzae BMC Microbiol 13 91

37 The Gene Ontology Consortium (2016) Expansion of the GeneOntology knowledgebase and resources Nucleic Acids Res 45D331ndashD338

38 MedemaMH KottmannR YilmazP CummingsMBigginsJB BlinK de BruijnI ChooiYH ClaesenJ

CoatesRC et al (2015) Minimum information about a biosyntheticgene cluster Nat Chem Biol 11 625ndash631

39 MedemaMH TakanoE and BreitlingR (2013) Detectingsequence homology at the gene cluster level with MultiGeneBlastMol Biol Evol 30 1218ndash1223

40 BaltzRH (2018) Natural product drug discovery in the genomic erarealities conjectures misconceptions and opportunities J IndMicrobiol Biotechnol 46 281ndash299

41 CimermancicP MedemaMH ClaesenJ KuritaK WielandBrownLC MavrommatisK PatiA GodfreyPA KoehrsenMClardyJ et al (2014) Insights into secondary metabolism from aglobal analysis of prokaryotic biosynthetic gene clusters Cell 158412ndash421

42 BlinK KimHU MedemaMH and WeberT (2017) Recentdevelopment of antiSMASH and other computational approaches tomine secondary metabolite biosynthetic gene clusters BriefBioinform doi101093bibbbx146

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

Page 5: antiSMASH 5.0: updates to the secondary metabolite genome ...genome mining pipeline Kai Blin1,SimonShaw1, Katharina Steinke2, Rasmus Villebro1, Nadine Ziemert2, Sang Yup Lee1,3, Marnix

W84 Nucleic Acids Research 2019 Vol 47 Web Server issue

Figure 1 Candidate cluster types 1234 Greyyellow gene involved in protocluster AB (A) Chemical Hybrids Since cluster type A and cluster typeB share a CDS that defines those protoclusters they are classified as lsquochemical hybridrsquo (B) Interleaved Since none of the protoclusters share any definingCDS with any other protocluster it is not annotated as a chemical hybrid even though the biosynthetic product may or may not be The two protoclustersform an interleaved candidate clusters since the core of A overlaps with the core of B (C) Neighbouring Neighbouring candidate clusters are defined if theneighbourhoods of two protoclusters but not their cores overlap (D) Singles If protoclusters donrsquot have any overlaprelation with other protoclusters theterm single candidate cluster is assigned

remove this feature from the pubic antiSMASH web serverIt is of course still included in the download version of an-tiSMASH and can be enabled via the command line

In the Regions overview section (Figure 2) a graph-ical overview showing the location of the identified re-gions on the chromosomeplasmidscaffoldscontig is dis-played In the detailed view regions that are located oncontig-borders are now clearly labelled This often indi-cates that parts of the BGC are missing or that several sec-tions of a BGC are located on different contigs and aretherefore reported individually (for a more detailed discus-sion on this phenomenon please see (42)) For the firsttime antiSMASH 5 now offers interactive browsing of the

BGCs including selection of lsquofunctionalrsquo units ie core en-zymes transporters etc zooming to individual genes orregionscandidate clustersprotoclusters Details of the se-lection are now provided in side panels instead of pop-upwindows using a hierarchical view of the analysis sum-maries (which can be expanded by clicking lsquo+rsquo) to provideadditional details For the display of the PKSNRPS do-main organization the user now can choose whether tolimit the shown domains to the currently selected genes orjust display the results of the selected gene(s)enzyme(s)Furthermore the information is now organized in lsquotabsrsquothat do not require scrolling down along an often very longresults page

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

Nucleic Acids Research 2019 Vol 47 Web Server issue W85

Figure 2 Screenshot of the antiSMASH 5 user interface (example NCBI-acc Y16952 balhimycin BGC) The new region overview now allowspanningzooming The candidate cluster and protocluster boxes are explained in the lsquonew region conceptrsquo section above Information about the currentlyselected gene are displayed at the right lsquoGene detailsrsquo panel For PKS or NRPS regions the detailed domain annotation is displayed by pressing the tabsusers can select the domain overview (shown) or the ClusterBlast KnownClusterBlast or SubClusterBlast results At the right the structure predictionand details of specificity predictions are displayed upon selecting the plus sign

CODE REFACTORING AND SPEED-UP

Large parts of the pre-antiSMASH 5 code base were stillderived from antiSMASH version 1 which was released in2011 In order to maintain future compatibility the anti-SMASH code base had to be migrated from python 27which will reach end-of-life in 2020 to the current versions35ndash37 As this transition required significant modificationto the antiSMASH code we decided to take this as a chanceto completely rewrite the software with a special considera-tion on runtime code stability and code maintainability Aunit test and integration test framework was implementedthat covers most parts of the antiSMASH 5 code allowing amuch easier debugging andndashndashmost importantlyndashndashextensionof the code while at the same time ensuring that new fea-tures do not negatively impact the results of existing mod-ules For some of the externally contributed modules (Sand-puma trans-AT PKS comparisons terpene PrediCAT) ourcontributors are currently preparing updated and compli-ant versions which will be added to antiSMASH 5 in mi-nor releases once they are finished and tested Like the ear-lier antiSMASH versions antiSMASH 5 provides the anal-ysis results in an interactive webpage and richly annotatedGenBank-format files for the whole genome and individualclusters As a new feature in version 5 all data are also avail-able as a computer readable JSON container which allowsthird party tools to easily process antiSMASH annotationsThis JSON output has superseded some other output typessuch as BioSynML and XLS

In addition to the advantages mentioned above the coderefactoring and cleanup has also led to a significant speed

increase of the new version by a factor of 4-11times (dependingon genome and selected options) instead of waiting timesof several hours antiSMASH results are now usually deliv-ered within 30ndash40 min after the start of the job for a typicalsubmission at the public web server

CONCLUSIONS AND FUTURE PERSPECTIVES

With the help of software like antiSMASH genome miningfor specialized metabolites has established itself as a com-plementary approach for the identification of novel metabo-lites which is routinely used within the natural products re-search community and increasingly applied in related fieldssuch as metagenomics environmental biology or metabolicengineering With the improvements to the antiSMASHuser interface and performance we keep pace with these de-velopments Furthermore the complete refactoring of theantiSMASH 5 code base will allow us to increasingly useantiSMASH as a tool that provides analysis data on whichother software can perform additional analyses

DATA AVAILABILITY

antiSMASH is available from httpsantismashsecondarymetabolitesorg (bacterial version) orhttpsfungismashsecondarymetabolitesorg (fun-gal version) The antiSMASH documentationincluding a PDF user guide is available fromhttpsdocsantismashsecondarymetabolitesorg Thesewebsites are free and open to all users and there is no loginrequirement The antiSMASH source code is available from

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

W86 Nucleic Acids Research 2019 Vol 47 Web Server issue

httpsgithubcomantismashantismash antiSMASH isalso available via Docker

ACKNOWLEDGEMENTS

We thank Justin JJ van der Hooft for critical comments onthe manuscript and providing documentation and EmiliaPalazzotto and Tetiana Gren for helpful discussions anduser testing of the new features

FUNDING

Novo Nordisk Foundation [NNF10CC1016517 toSYLTW NNF16OC0021746 to TW] Center forMicrobial Secondary Metabolites (CeMiSt) Danish Na-tional Research Foundation [DNRF137 to TW] Reinholdand Maria Teufel Foundation (to KS) Funding for openaccess charge The Novo Nordisk FoundationConflict of interest statement None declared

REFERENCES1 NewmanDJ and CraggGM (2016) Natural products as sources of

new drugs from 1981 to 2014 J Nat Prod 79 629ndash6612 van der MeijA WorsleySF HutchingsMI and van WezelGP

(2017) Chemical ecology of antibiotic production by actinomycetesFEMS Microbiol Rev 41 392ndash416

3 ZiemertN AlanjaryM and WeberT (2016) The evolution ofgenome mining in microbes - a review Nat Prod Rep 33 988ndash1005

4 WeberT RauschC LopezP HoofI GaykovaV HusonDHand WohllebenW (2009) CLUSEAN a computer-based frameworkfor the automated analysis of bacterial secondary metabolitebiosynthetic gene clusters J Biotechnol 140 13ndash17

5 SkinniderMA MerwinNJ JohnstonCW and MagarveyNA(2017) PRISM 3 expanded prediction of natural product chemicalstructures from microbial genomes Nucleic Acids Res 45W49ndashW54

6 MedemaMH BlinK CimermancicP de JagerV ZakrzewskiPFischbachMA WeberT TakanoE and BreitlingR (2011)antiSMASH rapid identification annotation and analysis ofsecondary metabolite biosynthesis gene clusters in bacterial andfungal genome sequences Nucleic Acids Res 39 W339ndashW346

7 BlinK MedemaMH KazempourD FischbachMABreitlingR TakanoE and WeberT (2013) antiSMASH 20ndashndashaversatile platform for genome mining of secondary metaboliteproducers Nucleic Acids Res 41 W204ndashW212

8 WeberT BlinK DuddelaS KrugD KimHU BruccoleriRLeeSY FischbachMA MullerR WohllebenW et al (2015)antiSMASH 30ndashndasha comprehensive resource for the genome miningof biosynthetic gene clusters Nucleic Acids Res 43 W237ndashW243

9 BlinK WolfT ChevretteMG LuX SchwalenCJKautsarSA Suarez DuranHG de Los SantosELC KimHUNaveM et al (2017) antiSMASH 40-improvements in chemistryprediction and gene cluster boundary identification Nucleic AcidsRes 45 W36ndashW41

10 KautsarSA Suarez DuranHG BlinK OsbournA andMedemaMH (2017) plantiSMASH automated identificationannotation and expression analysis of plant biosynthetic geneclusters Nucleic Acids Res 45 W55ndashW63

11 BlinK KazempourD WohllebenW and WeberT (2014)Improved lanthipeptide detection and prediction for antiSMASHPLoS One 9 e89420

12 VillebroR ShawS BlinK and WeberT (2019) Sequence-basedclassification of type II polyketide synthase biosynthetic gene clustersfor antiSMASH J Ind Microbiol Biotechnol 46 469ndash475

13 BlinK MedemaMH KottmannR LeeSY and WeberT (2017)The antiSMASH database a comprehensive database of microbialsecondary metabolite biosynthetic gene clusters Nucleic Acids Res45 D555ndashD559

14 BlinK Pascal AndreuV de Los SantosELC Del CarratoreFLeeSY MedemaMH and WeberT (2019) The antiSMASHdatabase version 2 a comprehensive resource on secondarymetabolite biosynthetic gene clusters Nucleic Acids Res 47D625ndashD630

15 MedemaMH PaalvastY NguyenDD MelnikADorresteinPC TakanoE and BreitlingR (2014) Pep2Pathautomated mass spectrometry-guided genome mining of peptidicnatural products PLoS Comput Biol 10 e1003822

16 AlanjaryM KronmillerB AdamekM BlinK WeberTHusonD PhilmusB and ZiemertN (2017) The AntibioticResistant Target Seeker (ARTS) an exploration engine for antibioticcluster prioritization and novel drug target discovery Nucleic AcidsRes 45 W42ndashW48

17 BlinK PedersenLE WeberT and LeeSY (2016) CRISPy-webAn online resource to design sgRNAs for CRISPR applicationsSynth Syst Biotechnol 1 118ndash121

18 ShirleyWA KelleyBP PotierY KoschwanezJH BruccoleriRand TarselliM (2018) Unzipping natural products improved naturalproduct structure predictions by ensemble modeling and fingerprintmatching ChemRxiv doi httpdoi1026434chemrxiv6863864 26July 2018 preprint not peer reviewed

19 Navarro-MunozJ Selem-MojicaN MullowneyM KautsarSTryonJ ParkinsonE De Los SantosE YeongMCruz-MoralesP AbubuckerS et al (2018) A computationalframework for systematic exploration of biosynthetic diversity fromlarge-scale genomic data bioRxiv doi httpdoi101101445270 17October 2018 preprint not peer reviewed

20 FinnRD CoggillP EberhardtRY EddySR MistryJMitchellAL PotterSC PuntaM QureshiMSangrador-VegasA et al (2016) The Pfam protein families databasetowards a more sustainable future Nucleic Acids Res 44D279ndashD285

21 LetunicI and BorkP (2018) 20 years of the SMART protein domainannotation resource Nucleic Acids Res 46 D493ndashD496

22 de JongA van HeelAJ KokJ and KuipersOP (2010) BAGEL2mining for bacteriocins in genomic data Nucleic Acids Res 38W647ndashW651

23 YadavG GokhaleRS and MohantyD (2009) Towards predictionof metabolic products of polyketide synthases an in silico analysisPLoS Comput Biol 5 e1000351

24 CraigJW CherryMA and BradySF (2011) Long-chain N-acylamino acid synthases are linked to the putativePEP-CTERMexosortase protein-sorting system in Gram-negativebacteria J Bacteriol 193 5707ndash5715

25 RobinsonSL ChristensonJK and WackettLP (2018)Biosynthesis and chemical diversity of -lactone natural productsNat Prod Rep 36 458ndash475

26 AgarwalV BlantonJM PodellS TatonA SchornMABuschJ LinZ SchmidtEW JensenPR PaulVJ et al (2017)Metagenomic discovery of polybrominated diphenyl etherbiosynthesis by marine sponges Nat Chem Biol 13 537ndash543

27 SosioM GaspariE IorioM PessinaS MedemaMHBernasconiA SimoneM MaffioliSI EbrightRH andDonadioS (2018) Analysis of the Pseudouridimycin biosyntheticpathway provides Insights into the formation of C-nucleosideantibiotics Cell Chem Biol 25 540ndash549

28 BauerJS GhequireMGK NettM JostenM SahlH-G DeMotR and GrossH (2015) Biosynthetic origin of the antibioticpseudopyronines A and B in Pseudomonas putida BW11M1Chembiochem 16 2491ndash2497

29 LuoH Hallen-AdamsHE Scott-CraigJS and WaltonJD (2012)Ribosomal biosynthesis of -amanitin in Galerina marginata FungalGenet Biol 49 123ndash129

30 NaganoN UmemuraM IzumikawaM KawanoJ IshiiTKikuchiM TomiiK KumagaiT YoshimiA MachidaM et al(2016) Class of cyclic ribosomal peptide synthetic genes infilamentous fungi Fungal Genet Biol 86 58ndash70

31 DingW LiuW-Q JiaY LiY van der DonkWA and ZhangQ(2016) Biosynthetic investigation of phomopsins reveals a widespreadpathway for ribosomal natural products in Ascomycetes Proc NatlAcad Sci USA 113 3521ndash3526

32 BushinLB ClarkKA PelczerI and SeyedsayamdostMR (2018)Charting an unexplored streptococcal biosynthetic landscape reveals

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

Nucleic Acids Research 2019 Vol 47 Web Server issue W87

a unique peptide cyclization motif J Am Chem Soc 14017674ndash17684

33 CarusoA BushinLB ClarkKA MartinieRJ andSeyedsayamdostMR (2019) A radical approach to enzymatic-Thioether bond formation J Am Chem Soc 141 990ndash997

34 GibsonMK ForsbergKJ and DantasG (2015) Improvedannotation of antibiotic resistance determinants reveals microbialresistomes cluster by ecology ISME J 9 207ndash216

35 YiG SzeSH and ThonMR (2007) Identifying clusters offunctionally related genes in genomes Bioinformatics 23 1053ndash1060

36 InglisDO BinkleyJ SkrzypekMS ArnaudMBCerqueiraGC ShahP WymoreF WortmanJR and SherlockG(2013) Comprehensive annotation of secondary metabolitebiosynthetic genes and gene clusters of Aspergillus nidulans Afumigatus A niger and A oryzae BMC Microbiol 13 91

37 The Gene Ontology Consortium (2016) Expansion of the GeneOntology knowledgebase and resources Nucleic Acids Res 45D331ndashD338

38 MedemaMH KottmannR YilmazP CummingsMBigginsJB BlinK de BruijnI ChooiYH ClaesenJ

CoatesRC et al (2015) Minimum information about a biosyntheticgene cluster Nat Chem Biol 11 625ndash631

39 MedemaMH TakanoE and BreitlingR (2013) Detectingsequence homology at the gene cluster level with MultiGeneBlastMol Biol Evol 30 1218ndash1223

40 BaltzRH (2018) Natural product drug discovery in the genomic erarealities conjectures misconceptions and opportunities J IndMicrobiol Biotechnol 46 281ndash299

41 CimermancicP MedemaMH ClaesenJ KuritaK WielandBrownLC MavrommatisK PatiA GodfreyPA KoehrsenMClardyJ et al (2014) Insights into secondary metabolism from aglobal analysis of prokaryotic biosynthetic gene clusters Cell 158412ndash421

42 BlinK KimHU MedemaMH and WeberT (2017) Recentdevelopment of antiSMASH and other computational approaches tomine secondary metabolite biosynthetic gene clusters BriefBioinform doi101093bibbbx146

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

Page 6: antiSMASH 5.0: updates to the secondary metabolite genome ...genome mining pipeline Kai Blin1,SimonShaw1, Katharina Steinke2, Rasmus Villebro1, Nadine Ziemert2, Sang Yup Lee1,3, Marnix

Nucleic Acids Research 2019 Vol 47 Web Server issue W85

Figure 2 Screenshot of the antiSMASH 5 user interface (example NCBI-acc Y16952 balhimycin BGC) The new region overview now allowspanningzooming The candidate cluster and protocluster boxes are explained in the lsquonew region conceptrsquo section above Information about the currentlyselected gene are displayed at the right lsquoGene detailsrsquo panel For PKS or NRPS regions the detailed domain annotation is displayed by pressing the tabsusers can select the domain overview (shown) or the ClusterBlast KnownClusterBlast or SubClusterBlast results At the right the structure predictionand details of specificity predictions are displayed upon selecting the plus sign

CODE REFACTORING AND SPEED-UP

Large parts of the pre-antiSMASH 5 code base were stillderived from antiSMASH version 1 which was released in2011 In order to maintain future compatibility the anti-SMASH code base had to be migrated from python 27which will reach end-of-life in 2020 to the current versions35ndash37 As this transition required significant modificationto the antiSMASH code we decided to take this as a chanceto completely rewrite the software with a special considera-tion on runtime code stability and code maintainability Aunit test and integration test framework was implementedthat covers most parts of the antiSMASH 5 code allowing amuch easier debugging andndashndashmost importantlyndashndashextensionof the code while at the same time ensuring that new fea-tures do not negatively impact the results of existing mod-ules For some of the externally contributed modules (Sand-puma trans-AT PKS comparisons terpene PrediCAT) ourcontributors are currently preparing updated and compli-ant versions which will be added to antiSMASH 5 in mi-nor releases once they are finished and tested Like the ear-lier antiSMASH versions antiSMASH 5 provides the anal-ysis results in an interactive webpage and richly annotatedGenBank-format files for the whole genome and individualclusters As a new feature in version 5 all data are also avail-able as a computer readable JSON container which allowsthird party tools to easily process antiSMASH annotationsThis JSON output has superseded some other output typessuch as BioSynML and XLS

In addition to the advantages mentioned above the coderefactoring and cleanup has also led to a significant speed

increase of the new version by a factor of 4-11times (dependingon genome and selected options) instead of waiting timesof several hours antiSMASH results are now usually deliv-ered within 30ndash40 min after the start of the job for a typicalsubmission at the public web server

CONCLUSIONS AND FUTURE PERSPECTIVES

With the help of software like antiSMASH genome miningfor specialized metabolites has established itself as a com-plementary approach for the identification of novel metabo-lites which is routinely used within the natural products re-search community and increasingly applied in related fieldssuch as metagenomics environmental biology or metabolicengineering With the improvements to the antiSMASHuser interface and performance we keep pace with these de-velopments Furthermore the complete refactoring of theantiSMASH 5 code base will allow us to increasingly useantiSMASH as a tool that provides analysis data on whichother software can perform additional analyses

DATA AVAILABILITY

antiSMASH is available from httpsantismashsecondarymetabolitesorg (bacterial version) orhttpsfungismashsecondarymetabolitesorg (fun-gal version) The antiSMASH documentationincluding a PDF user guide is available fromhttpsdocsantismashsecondarymetabolitesorg Thesewebsites are free and open to all users and there is no loginrequirement The antiSMASH source code is available from

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

W86 Nucleic Acids Research 2019 Vol 47 Web Server issue

httpsgithubcomantismashantismash antiSMASH isalso available via Docker

ACKNOWLEDGEMENTS

We thank Justin JJ van der Hooft for critical comments onthe manuscript and providing documentation and EmiliaPalazzotto and Tetiana Gren for helpful discussions anduser testing of the new features

FUNDING

Novo Nordisk Foundation [NNF10CC1016517 toSYLTW NNF16OC0021746 to TW] Center forMicrobial Secondary Metabolites (CeMiSt) Danish Na-tional Research Foundation [DNRF137 to TW] Reinholdand Maria Teufel Foundation (to KS) Funding for openaccess charge The Novo Nordisk FoundationConflict of interest statement None declared

REFERENCES1 NewmanDJ and CraggGM (2016) Natural products as sources of

new drugs from 1981 to 2014 J Nat Prod 79 629ndash6612 van der MeijA WorsleySF HutchingsMI and van WezelGP

(2017) Chemical ecology of antibiotic production by actinomycetesFEMS Microbiol Rev 41 392ndash416

3 ZiemertN AlanjaryM and WeberT (2016) The evolution ofgenome mining in microbes - a review Nat Prod Rep 33 988ndash1005

4 WeberT RauschC LopezP HoofI GaykovaV HusonDHand WohllebenW (2009) CLUSEAN a computer-based frameworkfor the automated analysis of bacterial secondary metabolitebiosynthetic gene clusters J Biotechnol 140 13ndash17

5 SkinniderMA MerwinNJ JohnstonCW and MagarveyNA(2017) PRISM 3 expanded prediction of natural product chemicalstructures from microbial genomes Nucleic Acids Res 45W49ndashW54

6 MedemaMH BlinK CimermancicP de JagerV ZakrzewskiPFischbachMA WeberT TakanoE and BreitlingR (2011)antiSMASH rapid identification annotation and analysis ofsecondary metabolite biosynthesis gene clusters in bacterial andfungal genome sequences Nucleic Acids Res 39 W339ndashW346

7 BlinK MedemaMH KazempourD FischbachMABreitlingR TakanoE and WeberT (2013) antiSMASH 20ndashndashaversatile platform for genome mining of secondary metaboliteproducers Nucleic Acids Res 41 W204ndashW212

8 WeberT BlinK DuddelaS KrugD KimHU BruccoleriRLeeSY FischbachMA MullerR WohllebenW et al (2015)antiSMASH 30ndashndasha comprehensive resource for the genome miningof biosynthetic gene clusters Nucleic Acids Res 43 W237ndashW243

9 BlinK WolfT ChevretteMG LuX SchwalenCJKautsarSA Suarez DuranHG de Los SantosELC KimHUNaveM et al (2017) antiSMASH 40-improvements in chemistryprediction and gene cluster boundary identification Nucleic AcidsRes 45 W36ndashW41

10 KautsarSA Suarez DuranHG BlinK OsbournA andMedemaMH (2017) plantiSMASH automated identificationannotation and expression analysis of plant biosynthetic geneclusters Nucleic Acids Res 45 W55ndashW63

11 BlinK KazempourD WohllebenW and WeberT (2014)Improved lanthipeptide detection and prediction for antiSMASHPLoS One 9 e89420

12 VillebroR ShawS BlinK and WeberT (2019) Sequence-basedclassification of type II polyketide synthase biosynthetic gene clustersfor antiSMASH J Ind Microbiol Biotechnol 46 469ndash475

13 BlinK MedemaMH KottmannR LeeSY and WeberT (2017)The antiSMASH database a comprehensive database of microbialsecondary metabolite biosynthetic gene clusters Nucleic Acids Res45 D555ndashD559

14 BlinK Pascal AndreuV de Los SantosELC Del CarratoreFLeeSY MedemaMH and WeberT (2019) The antiSMASHdatabase version 2 a comprehensive resource on secondarymetabolite biosynthetic gene clusters Nucleic Acids Res 47D625ndashD630

15 MedemaMH PaalvastY NguyenDD MelnikADorresteinPC TakanoE and BreitlingR (2014) Pep2Pathautomated mass spectrometry-guided genome mining of peptidicnatural products PLoS Comput Biol 10 e1003822

16 AlanjaryM KronmillerB AdamekM BlinK WeberTHusonD PhilmusB and ZiemertN (2017) The AntibioticResistant Target Seeker (ARTS) an exploration engine for antibioticcluster prioritization and novel drug target discovery Nucleic AcidsRes 45 W42ndashW48

17 BlinK PedersenLE WeberT and LeeSY (2016) CRISPy-webAn online resource to design sgRNAs for CRISPR applicationsSynth Syst Biotechnol 1 118ndash121

18 ShirleyWA KelleyBP PotierY KoschwanezJH BruccoleriRand TarselliM (2018) Unzipping natural products improved naturalproduct structure predictions by ensemble modeling and fingerprintmatching ChemRxiv doi httpdoi1026434chemrxiv6863864 26July 2018 preprint not peer reviewed

19 Navarro-MunozJ Selem-MojicaN MullowneyM KautsarSTryonJ ParkinsonE De Los SantosE YeongMCruz-MoralesP AbubuckerS et al (2018) A computationalframework for systematic exploration of biosynthetic diversity fromlarge-scale genomic data bioRxiv doi httpdoi101101445270 17October 2018 preprint not peer reviewed

20 FinnRD CoggillP EberhardtRY EddySR MistryJMitchellAL PotterSC PuntaM QureshiMSangrador-VegasA et al (2016) The Pfam protein families databasetowards a more sustainable future Nucleic Acids Res 44D279ndashD285

21 LetunicI and BorkP (2018) 20 years of the SMART protein domainannotation resource Nucleic Acids Res 46 D493ndashD496

22 de JongA van HeelAJ KokJ and KuipersOP (2010) BAGEL2mining for bacteriocins in genomic data Nucleic Acids Res 38W647ndashW651

23 YadavG GokhaleRS and MohantyD (2009) Towards predictionof metabolic products of polyketide synthases an in silico analysisPLoS Comput Biol 5 e1000351

24 CraigJW CherryMA and BradySF (2011) Long-chain N-acylamino acid synthases are linked to the putativePEP-CTERMexosortase protein-sorting system in Gram-negativebacteria J Bacteriol 193 5707ndash5715

25 RobinsonSL ChristensonJK and WackettLP (2018)Biosynthesis and chemical diversity of -lactone natural productsNat Prod Rep 36 458ndash475

26 AgarwalV BlantonJM PodellS TatonA SchornMABuschJ LinZ SchmidtEW JensenPR PaulVJ et al (2017)Metagenomic discovery of polybrominated diphenyl etherbiosynthesis by marine sponges Nat Chem Biol 13 537ndash543

27 SosioM GaspariE IorioM PessinaS MedemaMHBernasconiA SimoneM MaffioliSI EbrightRH andDonadioS (2018) Analysis of the Pseudouridimycin biosyntheticpathway provides Insights into the formation of C-nucleosideantibiotics Cell Chem Biol 25 540ndash549

28 BauerJS GhequireMGK NettM JostenM SahlH-G DeMotR and GrossH (2015) Biosynthetic origin of the antibioticpseudopyronines A and B in Pseudomonas putida BW11M1Chembiochem 16 2491ndash2497

29 LuoH Hallen-AdamsHE Scott-CraigJS and WaltonJD (2012)Ribosomal biosynthesis of -amanitin in Galerina marginata FungalGenet Biol 49 123ndash129

30 NaganoN UmemuraM IzumikawaM KawanoJ IshiiTKikuchiM TomiiK KumagaiT YoshimiA MachidaM et al(2016) Class of cyclic ribosomal peptide synthetic genes infilamentous fungi Fungal Genet Biol 86 58ndash70

31 DingW LiuW-Q JiaY LiY van der DonkWA and ZhangQ(2016) Biosynthetic investigation of phomopsins reveals a widespreadpathway for ribosomal natural products in Ascomycetes Proc NatlAcad Sci USA 113 3521ndash3526

32 BushinLB ClarkKA PelczerI and SeyedsayamdostMR (2018)Charting an unexplored streptococcal biosynthetic landscape reveals

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

Nucleic Acids Research 2019 Vol 47 Web Server issue W87

a unique peptide cyclization motif J Am Chem Soc 14017674ndash17684

33 CarusoA BushinLB ClarkKA MartinieRJ andSeyedsayamdostMR (2019) A radical approach to enzymatic-Thioether bond formation J Am Chem Soc 141 990ndash997

34 GibsonMK ForsbergKJ and DantasG (2015) Improvedannotation of antibiotic resistance determinants reveals microbialresistomes cluster by ecology ISME J 9 207ndash216

35 YiG SzeSH and ThonMR (2007) Identifying clusters offunctionally related genes in genomes Bioinformatics 23 1053ndash1060

36 InglisDO BinkleyJ SkrzypekMS ArnaudMBCerqueiraGC ShahP WymoreF WortmanJR and SherlockG(2013) Comprehensive annotation of secondary metabolitebiosynthetic genes and gene clusters of Aspergillus nidulans Afumigatus A niger and A oryzae BMC Microbiol 13 91

37 The Gene Ontology Consortium (2016) Expansion of the GeneOntology knowledgebase and resources Nucleic Acids Res 45D331ndashD338

38 MedemaMH KottmannR YilmazP CummingsMBigginsJB BlinK de BruijnI ChooiYH ClaesenJ

CoatesRC et al (2015) Minimum information about a biosyntheticgene cluster Nat Chem Biol 11 625ndash631

39 MedemaMH TakanoE and BreitlingR (2013) Detectingsequence homology at the gene cluster level with MultiGeneBlastMol Biol Evol 30 1218ndash1223

40 BaltzRH (2018) Natural product drug discovery in the genomic erarealities conjectures misconceptions and opportunities J IndMicrobiol Biotechnol 46 281ndash299

41 CimermancicP MedemaMH ClaesenJ KuritaK WielandBrownLC MavrommatisK PatiA GodfreyPA KoehrsenMClardyJ et al (2014) Insights into secondary metabolism from aglobal analysis of prokaryotic biosynthetic gene clusters Cell 158412ndash421

42 BlinK KimHU MedemaMH and WeberT (2017) Recentdevelopment of antiSMASH and other computational approaches tomine secondary metabolite biosynthetic gene clusters BriefBioinform doi101093bibbbx146

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

Page 7: antiSMASH 5.0: updates to the secondary metabolite genome ...genome mining pipeline Kai Blin1,SimonShaw1, Katharina Steinke2, Rasmus Villebro1, Nadine Ziemert2, Sang Yup Lee1,3, Marnix

W86 Nucleic Acids Research 2019 Vol 47 Web Server issue

httpsgithubcomantismashantismash antiSMASH isalso available via Docker

ACKNOWLEDGEMENTS

We thank Justin JJ van der Hooft for critical comments onthe manuscript and providing documentation and EmiliaPalazzotto and Tetiana Gren for helpful discussions anduser testing of the new features

FUNDING

Novo Nordisk Foundation [NNF10CC1016517 toSYLTW NNF16OC0021746 to TW] Center forMicrobial Secondary Metabolites (CeMiSt) Danish Na-tional Research Foundation [DNRF137 to TW] Reinholdand Maria Teufel Foundation (to KS) Funding for openaccess charge The Novo Nordisk FoundationConflict of interest statement None declared

REFERENCES1 NewmanDJ and CraggGM (2016) Natural products as sources of

new drugs from 1981 to 2014 J Nat Prod 79 629ndash6612 van der MeijA WorsleySF HutchingsMI and van WezelGP

(2017) Chemical ecology of antibiotic production by actinomycetesFEMS Microbiol Rev 41 392ndash416

3 ZiemertN AlanjaryM and WeberT (2016) The evolution ofgenome mining in microbes - a review Nat Prod Rep 33 988ndash1005

4 WeberT RauschC LopezP HoofI GaykovaV HusonDHand WohllebenW (2009) CLUSEAN a computer-based frameworkfor the automated analysis of bacterial secondary metabolitebiosynthetic gene clusters J Biotechnol 140 13ndash17

5 SkinniderMA MerwinNJ JohnstonCW and MagarveyNA(2017) PRISM 3 expanded prediction of natural product chemicalstructures from microbial genomes Nucleic Acids Res 45W49ndashW54

6 MedemaMH BlinK CimermancicP de JagerV ZakrzewskiPFischbachMA WeberT TakanoE and BreitlingR (2011)antiSMASH rapid identification annotation and analysis ofsecondary metabolite biosynthesis gene clusters in bacterial andfungal genome sequences Nucleic Acids Res 39 W339ndashW346

7 BlinK MedemaMH KazempourD FischbachMABreitlingR TakanoE and WeberT (2013) antiSMASH 20ndashndashaversatile platform for genome mining of secondary metaboliteproducers Nucleic Acids Res 41 W204ndashW212

8 WeberT BlinK DuddelaS KrugD KimHU BruccoleriRLeeSY FischbachMA MullerR WohllebenW et al (2015)antiSMASH 30ndashndasha comprehensive resource for the genome miningof biosynthetic gene clusters Nucleic Acids Res 43 W237ndashW243

9 BlinK WolfT ChevretteMG LuX SchwalenCJKautsarSA Suarez DuranHG de Los SantosELC KimHUNaveM et al (2017) antiSMASH 40-improvements in chemistryprediction and gene cluster boundary identification Nucleic AcidsRes 45 W36ndashW41

10 KautsarSA Suarez DuranHG BlinK OsbournA andMedemaMH (2017) plantiSMASH automated identificationannotation and expression analysis of plant biosynthetic geneclusters Nucleic Acids Res 45 W55ndashW63

11 BlinK KazempourD WohllebenW and WeberT (2014)Improved lanthipeptide detection and prediction for antiSMASHPLoS One 9 e89420

12 VillebroR ShawS BlinK and WeberT (2019) Sequence-basedclassification of type II polyketide synthase biosynthetic gene clustersfor antiSMASH J Ind Microbiol Biotechnol 46 469ndash475

13 BlinK MedemaMH KottmannR LeeSY and WeberT (2017)The antiSMASH database a comprehensive database of microbialsecondary metabolite biosynthetic gene clusters Nucleic Acids Res45 D555ndashD559

14 BlinK Pascal AndreuV de Los SantosELC Del CarratoreFLeeSY MedemaMH and WeberT (2019) The antiSMASHdatabase version 2 a comprehensive resource on secondarymetabolite biosynthetic gene clusters Nucleic Acids Res 47D625ndashD630

15 MedemaMH PaalvastY NguyenDD MelnikADorresteinPC TakanoE and BreitlingR (2014) Pep2Pathautomated mass spectrometry-guided genome mining of peptidicnatural products PLoS Comput Biol 10 e1003822

16 AlanjaryM KronmillerB AdamekM BlinK WeberTHusonD PhilmusB and ZiemertN (2017) The AntibioticResistant Target Seeker (ARTS) an exploration engine for antibioticcluster prioritization and novel drug target discovery Nucleic AcidsRes 45 W42ndashW48

17 BlinK PedersenLE WeberT and LeeSY (2016) CRISPy-webAn online resource to design sgRNAs for CRISPR applicationsSynth Syst Biotechnol 1 118ndash121

18 ShirleyWA KelleyBP PotierY KoschwanezJH BruccoleriRand TarselliM (2018) Unzipping natural products improved naturalproduct structure predictions by ensemble modeling and fingerprintmatching ChemRxiv doi httpdoi1026434chemrxiv6863864 26July 2018 preprint not peer reviewed

19 Navarro-MunozJ Selem-MojicaN MullowneyM KautsarSTryonJ ParkinsonE De Los SantosE YeongMCruz-MoralesP AbubuckerS et al (2018) A computationalframework for systematic exploration of biosynthetic diversity fromlarge-scale genomic data bioRxiv doi httpdoi101101445270 17October 2018 preprint not peer reviewed

20 FinnRD CoggillP EberhardtRY EddySR MistryJMitchellAL PotterSC PuntaM QureshiMSangrador-VegasA et al (2016) The Pfam protein families databasetowards a more sustainable future Nucleic Acids Res 44D279ndashD285

21 LetunicI and BorkP (2018) 20 years of the SMART protein domainannotation resource Nucleic Acids Res 46 D493ndashD496

22 de JongA van HeelAJ KokJ and KuipersOP (2010) BAGEL2mining for bacteriocins in genomic data Nucleic Acids Res 38W647ndashW651

23 YadavG GokhaleRS and MohantyD (2009) Towards predictionof metabolic products of polyketide synthases an in silico analysisPLoS Comput Biol 5 e1000351

24 CraigJW CherryMA and BradySF (2011) Long-chain N-acylamino acid synthases are linked to the putativePEP-CTERMexosortase protein-sorting system in Gram-negativebacteria J Bacteriol 193 5707ndash5715

25 RobinsonSL ChristensonJK and WackettLP (2018)Biosynthesis and chemical diversity of -lactone natural productsNat Prod Rep 36 458ndash475

26 AgarwalV BlantonJM PodellS TatonA SchornMABuschJ LinZ SchmidtEW JensenPR PaulVJ et al (2017)Metagenomic discovery of polybrominated diphenyl etherbiosynthesis by marine sponges Nat Chem Biol 13 537ndash543

27 SosioM GaspariE IorioM PessinaS MedemaMHBernasconiA SimoneM MaffioliSI EbrightRH andDonadioS (2018) Analysis of the Pseudouridimycin biosyntheticpathway provides Insights into the formation of C-nucleosideantibiotics Cell Chem Biol 25 540ndash549

28 BauerJS GhequireMGK NettM JostenM SahlH-G DeMotR and GrossH (2015) Biosynthetic origin of the antibioticpseudopyronines A and B in Pseudomonas putida BW11M1Chembiochem 16 2491ndash2497

29 LuoH Hallen-AdamsHE Scott-CraigJS and WaltonJD (2012)Ribosomal biosynthesis of -amanitin in Galerina marginata FungalGenet Biol 49 123ndash129

30 NaganoN UmemuraM IzumikawaM KawanoJ IshiiTKikuchiM TomiiK KumagaiT YoshimiA MachidaM et al(2016) Class of cyclic ribosomal peptide synthetic genes infilamentous fungi Fungal Genet Biol 86 58ndash70

31 DingW LiuW-Q JiaY LiY van der DonkWA and ZhangQ(2016) Biosynthetic investigation of phomopsins reveals a widespreadpathway for ribosomal natural products in Ascomycetes Proc NatlAcad Sci USA 113 3521ndash3526

32 BushinLB ClarkKA PelczerI and SeyedsayamdostMR (2018)Charting an unexplored streptococcal biosynthetic landscape reveals

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

Nucleic Acids Research 2019 Vol 47 Web Server issue W87

a unique peptide cyclization motif J Am Chem Soc 14017674ndash17684

33 CarusoA BushinLB ClarkKA MartinieRJ andSeyedsayamdostMR (2019) A radical approach to enzymatic-Thioether bond formation J Am Chem Soc 141 990ndash997

34 GibsonMK ForsbergKJ and DantasG (2015) Improvedannotation of antibiotic resistance determinants reveals microbialresistomes cluster by ecology ISME J 9 207ndash216

35 YiG SzeSH and ThonMR (2007) Identifying clusters offunctionally related genes in genomes Bioinformatics 23 1053ndash1060

36 InglisDO BinkleyJ SkrzypekMS ArnaudMBCerqueiraGC ShahP WymoreF WortmanJR and SherlockG(2013) Comprehensive annotation of secondary metabolitebiosynthetic genes and gene clusters of Aspergillus nidulans Afumigatus A niger and A oryzae BMC Microbiol 13 91

37 The Gene Ontology Consortium (2016) Expansion of the GeneOntology knowledgebase and resources Nucleic Acids Res 45D331ndashD338

38 MedemaMH KottmannR YilmazP CummingsMBigginsJB BlinK de BruijnI ChooiYH ClaesenJ

CoatesRC et al (2015) Minimum information about a biosyntheticgene cluster Nat Chem Biol 11 625ndash631

39 MedemaMH TakanoE and BreitlingR (2013) Detectingsequence homology at the gene cluster level with MultiGeneBlastMol Biol Evol 30 1218ndash1223

40 BaltzRH (2018) Natural product drug discovery in the genomic erarealities conjectures misconceptions and opportunities J IndMicrobiol Biotechnol 46 281ndash299

41 CimermancicP MedemaMH ClaesenJ KuritaK WielandBrownLC MavrommatisK PatiA GodfreyPA KoehrsenMClardyJ et al (2014) Insights into secondary metabolism from aglobal analysis of prokaryotic biosynthetic gene clusters Cell 158412ndash421

42 BlinK KimHU MedemaMH and WeberT (2017) Recentdevelopment of antiSMASH and other computational approaches tomine secondary metabolite biosynthetic gene clusters BriefBioinform doi101093bibbbx146

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019

Page 8: antiSMASH 5.0: updates to the secondary metabolite genome ...genome mining pipeline Kai Blin1,SimonShaw1, Katharina Steinke2, Rasmus Villebro1, Nadine Ziemert2, Sang Yup Lee1,3, Marnix

Nucleic Acids Research 2019 Vol 47 Web Server issue W87

a unique peptide cyclization motif J Am Chem Soc 14017674ndash17684

33 CarusoA BushinLB ClarkKA MartinieRJ andSeyedsayamdostMR (2019) A radical approach to enzymatic-Thioether bond formation J Am Chem Soc 141 990ndash997

34 GibsonMK ForsbergKJ and DantasG (2015) Improvedannotation of antibiotic resistance determinants reveals microbialresistomes cluster by ecology ISME J 9 207ndash216

35 YiG SzeSH and ThonMR (2007) Identifying clusters offunctionally related genes in genomes Bioinformatics 23 1053ndash1060

36 InglisDO BinkleyJ SkrzypekMS ArnaudMBCerqueiraGC ShahP WymoreF WortmanJR and SherlockG(2013) Comprehensive annotation of secondary metabolitebiosynthetic genes and gene clusters of Aspergillus nidulans Afumigatus A niger and A oryzae BMC Microbiol 13 91

37 The Gene Ontology Consortium (2016) Expansion of the GeneOntology knowledgebase and resources Nucleic Acids Res 45D331ndashD338

38 MedemaMH KottmannR YilmazP CummingsMBigginsJB BlinK de BruijnI ChooiYH ClaesenJ

CoatesRC et al (2015) Minimum information about a biosyntheticgene cluster Nat Chem Biol 11 625ndash631

39 MedemaMH TakanoE and BreitlingR (2013) Detectingsequence homology at the gene cluster level with MultiGeneBlastMol Biol Evol 30 1218ndash1223

40 BaltzRH (2018) Natural product drug discovery in the genomic erarealities conjectures misconceptions and opportunities J IndMicrobiol Biotechnol 46 281ndash299

41 CimermancicP MedemaMH ClaesenJ KuritaK WielandBrownLC MavrommatisK PatiA GodfreyPA KoehrsenMClardyJ et al (2014) Insights into secondary metabolism from aglobal analysis of prokaryotic biosynthetic gene clusters Cell 158412ndash421

42 BlinK KimHU MedemaMH and WeberT (2017) Recentdevelopment of antiSMASH and other computational approaches tomine secondary metabolite biosynthetic gene clusters BriefBioinform doi101093bibbbx146

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract47W

1W815481154 by D

TU Library - Technical Inform

ation Center of D

enmark user on 18 July 2019