DOOR: a prokaryotic operon database for genome analyses...

8
DOOR: a prokaryotic operon database for genome analyses and functional inference Huansheng Cao*, Qin Ma*, Xin Chen and Ying Xu Corresponding author: Ying Xu, Department of Biochemistry and Molecular Biology, Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA. Tel. 1-706-542-9779. E-mail: xyn@uga.edu *These authors contributed equally to this work. Abstract The rapid accumulation of fully sequenced prokaryotic genomes provides unprecedented information for biological studies of bacterial and archaeal organisms in a systematic manner. Operons are the basic functional units for conducting such studies. Here, we review an operon database DOOR (the Database of prOkaryotic OpeRons) that we have previously developed and continue to update. Currently, the database contains 6 975 454 computationally predicted operons in 2072 complete genomes. In addition, the database also contains the following information: (i) transcriptional units for 24 genomes derived using publicly available transcriptomic data; (ii) orthologous gene mapping across genomes; (iii) 6408 cis- regulatory motifs for transcriptional factors of some operons for 203 genomes; (iv) 3 456 718 Rho-independent terminators for 2072 genomes; as well as (v) a suite of tools in support of applications of the predicted operons. In this review, we will ex- plain how such data are computationally derived and demonstrate how they can be used to derive a wide range of higher- level information needed for systems biology studies to tackle complex and fundamental biology questions. Key words: DOOR; operon database; transcription unit; genomic organization; gene regulation Introduction The rapid advancement of genome-scale sequencing tech- niques in the past decade has made it possible to have a micro- bial genome sequenced in a highly efficient and affordable manner. It has now become routine to have such genomes sequenced by individual labs. To date, 6917 complete genomes and 40 257 draft genomes have been sequenced and organized into the NCBI Genome database. In addition, 5415 complete pro- karyotic genomes and 43 348 draft genomes are stored at the JGI Integrated Microbial Genome and Microbiome Samples (JGI IMG/M) database (https://img.jgi.doe.gov/cgi-bin/m/main.cgi) [1]. Over 175 million genes have been computationally predicted in the sequenced prokaryotic genomes, with 171 million (97.7%) being protein-encoding genes and 4 million (2.3%) RNA genes. These genomic data have enabled studies of a wide range of complex and previously unsolvable questions in a systematic manner, such as inferring and modeling the dynamic behaviors of regulatory networks in prokaryotic organisms. These studies become possible because of the unique property of prokaryotes, functionally closely related genes tend to be organized into op- erons, i.e. clusters of genes arranged in tandem on the same strand of a genome, which share a common promoter and ter- minator [2]. It is noteworthy to point out that the concept of operon has evolved since the discovery and characterization of the lac op- eron in Escherichia coli by French scientists Jacob and Monod in 1960 [2]. Based on the original definition, an extra technical cri- terion has been added to operon: operons encoded in a genome do not overlap with each other [3, 4]. This simplified definition Huansheng Cao is a Postdoctoral Researcher in the Department of Biochemistry and Molecular Biology and Institute of Bioinformatics at the University of Georgia. His research interests include microbial systems biology. Qin Ma is an Assistant Professor in the Department of Agronomy, Horticulture and Plant Science at South Dakota State University. He is interested in bio- informatic algorithms/tools development in transcriptional regulation. Xin Chen is an Assistant Professor in the Center for Applied Mathematics at Tianjin University, China; his research interests include microbial and plant genome analysis. Ying Xu is the Regents-GRA Eminent Scholar Chair Professor in the Department of Biochemistry and Molecular Biology and Institute of Bioinformatics of the University of Georgia. His interests include microbial and cancer systems biology. Submitted: 2 May 2017; Received (in revised form): 13 June 2017 V C The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 1 Briefings in Bioinformatics, 2017, 1–8 doi: 10.1093/bib/bbx088 Software Review

Transcript of DOOR: a prokaryotic operon database for genome analyses...

Page 1: DOOR: a prokaryotic operon database for genome analyses ...pdfs.semanticscholar.org/b8a0/1c65404effaaa98e413a... · DOOR: a prokaryotic operon database for genome analyses and functional

DOOR: a prokaryotic operon database for genome

analyses and functional inferenceHuansheng Cao*, Qin Ma*, Xin Chen and Ying XuCorresponding author: Ying Xu, Department of Biochemistry and Molecular Biology, Institute of Bioinformatics, University of Georgia, Athens, GA 30602,USA. Tel. 1-706-542-9779. E-mail: [email protected]*These authors contributed equally to this work.

Abstract

The rapid accumulation of fully sequenced prokaryotic genomes provides unprecedented information for biological studiesof bacterial and archaeal organisms in a systematic manner. Operons are the basic functional units for conducting suchstudies. Here, we review an operon database DOOR (the Database of prOkaryotic OpeRons) that we have previouslydeveloped and continue to update. Currently, the database contains 6 975 454 computationally predicted operons in 2072complete genomes. In addition, the database also contains the following information: (i) transcriptional units for 24genomes derived using publicly available transcriptomic data; (ii) orthologous gene mapping across genomes; (iii) 6408 cis-regulatory motifs for transcriptional factors of some operons for 203 genomes; (iv) 3 456 718 Rho-independent terminatorsfor 2072 genomes; as well as (v) a suite of tools in support of applications of the predicted operons. In this review, we will ex-plain how such data are computationally derived and demonstrate how they can be used to derive a wide range of higher-level information needed for systems biology studies to tackle complex and fundamental biology questions.

Key words: DOOR; operon database; transcription unit; genomic organization; gene regulation

Introduction

The rapid advancement of genome-scale sequencing tech-niques in the past decade has made it possible to have a micro-bial genome sequenced in a highly efficient and affordablemanner. It has now become routine to have such genomessequenced by individual labs. To date, 6917 complete genomesand 40 257 draft genomes have been sequenced and organizedinto the NCBI Genome database. In addition, 5415 complete pro-karyotic genomes and 43 348 draft genomes are stored at the JGIIntegrated Microbial Genome and Microbiome Samples (JGIIMG/M) database (https://img.jgi.doe.gov/cgi-bin/m/main.cgi)[1]. Over 175 million genes have been computationally predictedin the sequenced prokaryotic genomes, with 171 million (97.7%)being protein-encoding genes and 4 million (2.3%) RNA genes.

These genomic data have enabled studies of a wide range ofcomplex and previously unsolvable questions in a systematicmanner, such as inferring and modeling the dynamic behaviorsof regulatory networks in prokaryotic organisms. These studiesbecome possible because of the unique property of prokaryotes,functionally closely related genes tend to be organized into op-erons, i.e. clusters of genes arranged in tandem on the samestrand of a genome, which share a common promoter and ter-minator [2].

It is noteworthy to point out that the concept of operon hasevolved since the discovery and characterization of the lac op-eron in Escherichia coli by French scientists Jacob and Monod in1960 [2]. Based on the original definition, an extra technical cri-terion has been added to operon: operons encoded in a genomedo not overlap with each other [3, 4]. This simplified definition

Huansheng Cao is a Postdoctoral Researcher in the Department of Biochemistry and Molecular Biology and Institute of Bioinformatics at the University ofGeorgia. His research interests include microbial systems biology.Qin Ma is an Assistant Professor in the Department of Agronomy, Horticulture and Plant Science at South Dakota State University. He is interested in bio-informatic algorithms/tools development in transcriptional regulation.Xin Chen is an Assistant Professor in the Center for Applied Mathematics at Tianjin University, China; his research interests include microbial and plantgenome analysis.Ying Xu is the Regents-GRA Eminent Scholar Chair Professor in the Department of Biochemistry and Molecular Biology and Institute of Bioinformatics ofthe University of Georgia. His interests include microbial and cancer systems biology.Submitted: 2 May 2017; Received (in revised form): 13 June 2017

VC The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

1

Briefings in Bioinformatics, 2017, 1–8

doi: 10.1093/bib/bbx088Software Review

Page 2: DOOR: a prokaryotic operon database for genome analyses ...pdfs.semanticscholar.org/b8a0/1c65404effaaa98e413a... · DOOR: a prokaryotic operon database for genome analyses and functional

has enabled computational scientists to predict operons basedon genomic sequence(s) alone [5]. Therefore, all the completeprokaryotic genomes have their operons predicted. However,the downside is: these computationally predicted operons onlyrepresent a small subset of all the operons encoded in a gen-ome. With the transcriptomic data becoming increasingly avail-able for prokaryotes, researchers have realized that an operonmay have different variations in terms of its component genesexpressed under different conditions, termed transcriptionalunits (TUs) [6]. Clearly, full elucidation of all the TUs of an op-eron is considerably more challenging than sequence-basedprediction of operons, as they are condition-dependent.Currently, there are no published algorithms for TU predictionbased on genomic sequences alone, as such capability needs tobe able to accurately identify promoter–terminator pairs thatare used under varying conditions. TUs are now being identifiedthrough analyses of transcriptomic data collected under a fewconditions only for a small number of genomes [7]. The currentversion of DOOR (the Database of prOkaryotic OpeRons) consistsof both types of operons: operons predicted based on sequencesand limited TUs identified using transcriptomic data [8].

Because of serving as the basic functional and TUs as well asthe backbones of TUs, operons are important elements for thestudy of gene regulation and functional adjustment to differentenvironments. For example, some important pathways tend tobe organized into a few transcriptionally co-regulated operons[9, 10]. Through these tightly co-expressed operons, the regula-tory motifs can be identified, which in turn will reveal globalregulatory circuits. Besides regulation of operons on a genomescale, comparative studies of these functional units can revealpathway conservation and operon evolution among prokary-otes [11, 12], and provide more accurate prediction of ortholo-gous genes across different prokaryotic genomes [11], comparedwith gene-centric methods for ortholog mapping. Furthermore,the positions of the operons in a genome and their functionalassociation provide a unique and practical angle to demonstratethe global organizing principle in these organisms [13].

In this article, we review the status of the DOOR database interms of its basic content, as well as a suite of tools that enablethe user to fully use the predicted operons and identified TUsfor functional studies of a prokaryotic organism. Some majorapplications through some of the third-party work that hasused DOOR are summarized. On this basis of DOOR and DOOR-supporting tools, we will introduce operon regulation, evolutionand organization on local and global scale in prokaryotes. Theentire contents covered are illustrated in Figure 1.

The DOOR databaseOperon databases and DOOR

Owing to this fundamental importance, several operon data-bases have been developed and deployed for public service,including OperonDB [14], ProOpDB [15], ODB [16], rrnDB [17] andDOOR [8, 18].

The first DOOR database was developed in 2006. Its operonprediction algorithm looks for a maximal sequence of adjacentgenes on the same DNA strand without disruption on the op-posite strand such that each pair of the adjacent genes has con-served neighboring relationship in some other genomes; theirintergenic distances are within some threshold; and are func-tionally related [8, 18]. This simple working definition of operonhas made the prediction both reliable and stable. Specifically,the core of the algorithm is a classification method. The

program classifies each pair of adjacent genes into two classes:in or not in the same operon, using five features: the intergenicdistance, conservation level of the two genes in the same neigh-borhood across other genomes, functional relatedness meas-ured using phylogenetic distances between the two genes, theratio between the lengths of the two genes and frequencies ofcertain predefined DNA motifs in their intergenic region [5].Among these features, the intergenic distance is the highestdiscerning feature in predicting if a pair of adjacent genes is inthe same operon. The training of the classifier was done on ex-perimentally validated operons in a few organisms, including E.coli and Bacillus subtilis. The operon prediction pipeline is givenin Figure 2.

Two separate classifiers are trained and used depending onwhether the target genome has a substantial number of experi-mentally identified operons. For those with such identified op-erons, a nonlinear decision tree-based classifier is used to makethe prediction. When tested on B. subtilis and E. coli genomes,the prediction program achieved accuracy levels at 90.2 and93.7%, respectively [5]. For genomes without such operons, a lin-ear logistic classifier is found to be most reliable, with the min-imum false-positive rates [5]. When tested on E. coli and B.subtilis without applying their operon information, this pre-dictor has prediction accuracies at 84.6 and 83.3%, respectively[7]. The prediction program of DOOR was ranked as the best op-eron predictor by an independent review in 2013 [19] in terms ofprediction accuracy [18].

The first version of DOOR had operons for 675 complete pro-karyotic genomes, totaling 450 986 operons, predicted using theabove algorithms. The DOOR database supports the followingtools to enable a user to use its operons for functional studies:(i) a search tool allowing a user to find operons by gene namesand operon size in desired genomes; (ii) a prediction capabilityfor cis-regulatory binding sites using a combination of two pre-dictors MEME [20] and CUBIC [21], which enables a user tosearch for potentially co-regulated operons in a genome;(iii) general statistics about operons encoded in each genomealong with the relevant literature; and (iv) an operonWiki to fa-cilitate interactions between a user and the developer [18].

DOOR2: more genomes and transcriptomic data

With the rapid increase in the complete prokaryotic genomesand the growing popularity of DOOR, we developed an updatedversion of the database, DOOR2, in 2013 [8]. Two major changesare made in this version. First, the number of genomes isincreased to 2072, three times of the original size, which con-sists of 1939 complete bacterial genomes and 133 completearchaeal genomes, covering 2205 chromosomes and 1645 plas-mids [8]. For these genomes, 1 323 902 multigene operons is pre-dicted, averaging 583 such operons per chromosome and 24operons per plasmid, along with 2 578 949 single-gene operons.In addition, the database has 6408 verified cis-regulatory bind-ing motifs for transcription factors spanning 203 genomes, and3 456 718 Rho-independent terminators for 2072 genomes.Furthermore, the database has 6 975 454 pairs of conserved op-erons across different genomes. Functionally, housekeepingbiological processes such as cell division, translation, electrontransport chain for aerobic/anaerobic respiration, ATP produc-tion and thiamine biosynthesis have the highest percentage ofconserved operons, as exemplified in Figure 3.

Given the increasing availability of RNA-seq data for com-plete genomes, DOOR2 also includes such data from the publicdomain, which can be used to elucidate TUs encoded in a

2 | Cao et al.

Page 3: DOOR: a prokaryotic operon database for genome analyses ...pdfs.semanticscholar.org/b8a0/1c65404effaaa98e413a... · DOOR: a prokaryotic operon database for genome analyses and functional

genome. Currently, DOOR2 contains predicted TUs for 24 gen-omes, which we expect to increase significantly in the next re-lease of the database.

Like the previous version, DOOR2 provides the followingtools in support of basic and advanced applications of its operoninformation: (i) a predictor for TUs based on given operons andRNA-seq data of a specified genome, in support of inference ofthe dynamic variants of the operons under specified conditions;(ii) a computer program for mapping orthologous genes acrossgenomes, in support of analyses of operon evolution and biolo-gical pathway mapping across genomes; (iii) a search capability,developed using MySQL, to enable a user to retrieve informationstored in the database, including operons, TUs, cis-regulatorymotifs and terminators for individual operons and conservedoperons across multiple genomes using queries containing

terms such as species name, operon ID and/or gene name; (iv)operon predictors as described in the previous section, whichallows a user to predict operons in a genome that is not cur-rently in the DOOR2 database; and (v) a genome browser is alsodeveloped to support visualization of selected operons alongwith TU structures under multiple conditions, if such data areavailable [8].

For operon prediction in a specified genome, DOOR2 requiresthree input files: a file containing gene locations in the targetgenome (*.gff or *.gtf file), protein sequences (*.faa file) and thenucleotide sequence (*.fna file) of the entire genome, where thethree files are used to derive the genomic features needed totrain our classification model (Figure 2). For output, DOOR2automatically sends an e-mail to the user once the computingjob is done, which contains a link to the results page, containingvarious information as shown in Figure 4.

DOOR2 has been widely applied by researchers in a varietyof areas. Overall, three types of technical applications havebeen frequently used by users of the database and tools: operondata retrieval, operon prediction for newly sequenced genomesand utility tools for new information discovery (Figure 5). Theseapplications also span a wide range of research areas. For ex-ample, a specific application area is in biomedical research. Inthis regard, DOOR is usually used for prediction or retrieval foroperons involved in virulence [22] or regulation [23]. Besidesthese individual third-party applications, we introduce the gen-eral applications of our DOOR-based tools in further derivingbiology knowledge from available prokaryotic genomes in thefollowing sections.

From genomic structures to functionalinference

With the rapid accumulation of large quantities of gene expres-sion data generated for various prokaryotes, researchers canstudy the functionalities of the predicted operons in a system-atic manner. Here, we highlight a few examples to demonstratehow functional analyses can be done through integrated

Figure 1. The DOOR operon database and information derivable from the database. (A) A schematic for consecutive operons in two genomes (blue and magenta), which

are further grouped into uber-operons. (B) Predicted cis-regulatory motifs for operons. (C) TUs for DOOR operons, derived based on available transcriptomic data.

(D) Co-expressed genes and operons predicted through bi-clustering analyses. (E) New knowledge regarding an organizing principle that governs the global arrange-

ment of operons in a genome.

Figure 2. The computational pipeline for operon prediction in DOOR.

DOOR | 3

Page 4: DOOR: a prokaryotic operon database for genome analyses ...pdfs.semanticscholar.org/b8a0/1c65404effaaa98e413a... · DOOR: a prokaryotic operon database for genome analyses and functional

analyses of operons and gene expression data, including TUprediction, cis-regulatory motif prediction and analysis andco-expression analyses of operons (Figure 1B–E).

TU prediction based on gene expression data

We have developed a program, SeqTU, for predicting TUs basedon provided RNA-seq data (Figure 1B and C) [6]. SeqTU predictsthe genomic boundaries of TUs based on two features measuringthe RNA-seq expression patterns across a target genome:expression-level continuity and variance. Intuitively, two consecu-tive genes in the same TU should not (i) have substantial gaps inthe expression profile across the intergenic region between thetwo genes, based on the definition of a TU; and (ii) have large vari-ations in their gene expression levels across the two coding re-gions. SeqTU predicts a pair of consecutive genes on the samegenomic strand to be in the same TU if both conditions are met[6]. A total of 2590 TUs have been predicted based on availableRNA-seq using SeqTU [6, 7] and stored in DOOR2. In total of 44% ofthese TUs have multiple genes. Figure 4B shows one example ofmultiple TUs of the same E. coli operon.

SeqTU has a training program that allows users to train anorganism-specific TU predictor based on user-provided gene ex-pression data. The software is now part of the DOOR2package [8].

Prediction and application of cis-regulatory motifs

We developed a computational algorithm BOBRO for the predic-tion of cis-regulatory binding sites in prokaryotic genomes [24]and implemented it as a Web server DMINDA [25, 26]. This Web-based tool has been integrated into the DOOR2 package in sup-port of cis-regulatory motif prediction for identified operons andTUs (Figure 1B). Such a capability has proved to be essential forelucidation of transcription regulatory networks encoded in aprokaryotic genome. The cis-motif prediction program runs asfollows: it first assesses the possibility of each position in a pro-moter region to be the start of a cis-regulatory motif, and it thendistinguishes the actual conserved motifs from the accidentalones based on the concept of motif closure [24]. These two stepsmake BOBRO substantially sensitive and selective in conservedmotif identification against a noisy background.

Figure 3. The main biological functions of conserved operons in E. coli. The scatterplot shows the cluster representatives based on the GO terms’ functional similarities,

where each bubble represents a GO Biological Function and overlapping bubbles are for overlapping GO functions (e.g. the electron transport chain for aerobic/anaer-

obic respiration). The size of a bubble represents the level of a term in the GO’s functional hierarchy, specifically the higher a term in the hierarchy, the larger the bub-

ble, and the color of a bubble indicates the P-value.

4 | Cao et al.

Page 5: DOOR: a prokaryotic operon database for genome analyses ...pdfs.semanticscholar.org/b8a0/1c65404effaaa98e413a... · DOOR: a prokaryotic operon database for genome analyses and functional

A key application of such a prediction capability is to identifyoperons or TUs that share conserved cis-regulatory binding sitesin their promoter regions, for prediction of transcriptionally co-regulated and possibly functionally related operons. These op-erons or TUs constitute a regulon regulated by a transcriptionfactor that binds to such conserved motifs, which may be

involved in functionally associated pathways in response tospecific conditions. To further support such applications, wehave integrated a new feature: identification of co-occurring cis-regulatory motifs in the same promoter region, into the motifprediction package. This feature enables elucidation of co-regulated operons/TUs by multiple transcription factors [27, 28].

Figure 4. Displays of computational results by DOOR. (A) A screenshot of a display window. (B) A display of TUs, with the red bars representing genes, the first row of

the blue bars representing multigene operons and the following rows of blue bars being TUs under different conditions. (C) A display of validated or predicted tran-

scription factor binding sites (the left bottom) and Rho-independent terminators (on the right). (D) conserved operons.

Figure 5. The usage of DOOR and DOOR2, measured by the number of citations, and the types of applications of the DOOR information by third-party publications.

DOOR | 5

Page 6: DOOR: a prokaryotic operon database for genome analyses ...pdfs.semanticscholar.org/b8a0/1c65404effaaa98e413a... · DOOR: a prokaryotic operon database for genome analyses and functional

Co-expression analyses of operons

A complementary approach to inference of transcriptionally co-regulated operons and TUs to the above motif-based method isthrough identification of co-expressions of such operons.Knowing that transcriptional co-regulation of operons/TUs aregenerally condition-dependent, we only aim to find operons/TUs that are co-expressed under some but not necessarily allconditions. This requires a class of more powerful clusteringtechniques than the traditional clustering methods, namely, bi-clustering algorithms, also referred to as two-dimensional clus-tering [29, 30]. These techniques should be able to detect groupsof operons/TUs, whose correlation (co-expression) exists under(to be identified) subsets of all conditions.

We have previously developed a bi-clustering algorithmQUBIC using a graph-theoretical approach [31–33] (Figure 1D).Compared with similar bi-clustering programs, QUBIC solvesthe general bi-clustering problems more efficiently, and is cap-able to solving problems with tens of thousands of genes andup to thousands of conditions in a few minutes of the CPU timeon a desktop computer. This program will be integrated into thenext version of the DOOR database (under development). Figure6 shows an example of two large bi-clusters (Figure 6A) andtheir levels of co-expressions (Figure 6B).

Operon evolution and global genomicorganization

Here, we show four examples to illustrate how operon data canbe used to derive higher-level and more complex informationencoded in prokaryotic genomes.

Uber-operons: the footprint of operon evolution

By studying conservation among groups of operons that arefunctionally related, we found that some unions of operonshave conserved gene list across genomes, with each such unionof operons called a uber-operon [34]. We have previously de-veloped an algorithm for uber-operon prediction through iden-tifying two groups of functionally related operons in a targetgenome versus a set of reference genomes, whose gene sets areconserved between the target and the reference genomes [12].Using this algorithm, we have predicted 158 uber-operons con-taining 1830 genes (�40% of all genes) in E. coli K12 against 90

reference genomes [12]. Interestingly, the level of functional re-latedness among genes in the same uber-operons is higher thanthat among genes in the same regulons but lower than thatamong genes in the same operons in E. coli K12. The functionalrelatedness is measured based on closeness of KEGG pathwaysand path distance of gene ontology (GO) terms of the genes [35].These different levels of functional relatedness between genescopes suggest that genes in the same uber-operons are func-tionally related. As the component operons in the same uber-operons may not necessarily be transcriptionally co-regulated,uber-operons provide a novel way to derive functionally associ-ated operons, such as pathways, independent of regulons [12],each of which represents a collection of transcriptionally co-regulated operons.

Orthologous gene identification based on operons

Accurate identification of orthologous genes across genomes isthe most basic step in comparative genomics but remains to besolved. We developed a computer tool GOST [11] for orthologousgene mapping across prokaryotic genomes. Compared with allthe earlier methods that solely rely on gene sequence similarity,GOST identifies orthologous genes among homologous geneswith an additional criterion: orthologous genes between twogenomes should have homologous working partners in their re-spective genomes. Two genes in a genome are considered asworking partners if they belong to a common operon or uber-operon [34, 35]. Figure 7 shows a few examples of accuratelyidentified orthologous genes across two genomes [12].

Pathway mapping based on conservation of operonsacross genomes

Accurate mapping of known biological pathways from one pro-karyotic organism to another is an important issue when anno-tating newly sequenced prokaryotic genomes. Existing methodsbased on gene mapping through sequence homology oftenleads to incorrect results because sequence similarities alone donot contain sufficient information for correct identification oforthologs. We have developed an algorithm, P-MAP, for path-way mapping across genomes of different prokaryotic species,which integrates both sequence similarity and genomic syntenyinformation defined by operons [36, 37]. Specifically, the algo-rithm identifies orthologous genes across genomes among

Figure 6. (A) Co-expressed genes/operons in two bi-clusters identified in E. coli under a stationary growth condition. (B) Co-expression networks among bi-clusters.

Green nodes represent genes in bi-cluster #1 and red nodes for genes in bi-cluster #2. The larger a node, the higher its degree of presence in the co-expression network,

and the thicker an edge, the higher the co-expression level.

6 | Cao et al.

Page 7: DOOR: a prokaryotic operon database for genome analyses ...pdfs.semanticscholar.org/b8a0/1c65404effaaa98e413a... · DOOR: a prokaryotic operon database for genome analyses and functional

homologous genes, which also share homologous working part-ners. Our analysis on a number of known homologous path-ways shows that using genomic synteny information asconstraints can greatly improve the pathway-mapping accuracyover methods that use sequence similarity informationalone [36].

Elucidation of the global organizing principle of operonsin prokaryotic genomes

The availability of genome-scale operons has also made it pos-sible to study whether the global arrangement of operons in aprokaryotic genome is arbitrarily determined or follows someorganizing rules, beyond functional studies of prokaryotic gen-omes. We have previously conducted a series of studies on po-tential determinants in the global arrangement of operons in agenome [9, 13, 38]. One observation is that operons for pathwayswith higher frequencies of transcriptional activation tend to bemore closely clustered together in the host genome [13]. Furtheranalyses revealed that such pathways with higher frequenciesof transcriptional activation tend to have their operonsarranged in a fewer supercoils, the basic folding units in a pro-karyotic DNA, which are typically 10–15 kbps long [10]. Our pos-tulation is that prokaryotes have evolved to minimize the totalenergy needed for folding and unfolding supercoils to enabletranscription of operons that need to be activated throughoutthe life cycle of the organism. Our computational simulationanalyses through randomly reshuffling a genome strongly con-firmed our hypothesis [9, 13, 38]. This represents the first deter-minant for the global arrangements of operons in a prokaryoticgenome.

Summary and outlook

Operons and TUs are the basic functional units in prokaryotes.Their accurate identification serves as the basis of a wide rangeof biological studies of prokaryotes. We have developed andmaintained an operon database DOOR in the past decade as ourservice to the research community of bacterial and archaeal or-ganisms. Here, we reviewed the information stored in the DOORdatabase along with a suite of tools associated with the data-base in support of operon-based analyses and information dis-covery. In addition, we have showed a few examples of howoperons can be used to not only derive higher-level functionalinformation encoded in a genome but also enable computa-tional studies of fundamental principles in terms of the globalorganization of operons in a genome. A major challenge to us,the developer of the DOOR database, is to keep up with the rateof prokaryotic genome sequencing. We are currently workingtoward the next version of the DOOR database to include �7000complete prokaryotic genomes, along with deployment of add-itional tools needed to mine such large collections of operonsand TUs.

Key Points

• DOOR is an operon database consisting of 1 323 902multigene operons and 2 578 949 single-gene operonsin 2072 complete genomes. To facilitate using andmining the data in DOOR, a suite of utility tools hasbeen developed and integrated into DOOR.

• With the increasing availability of genome-scale tran-scriptomic data for prokaryotes, increasingly moreTUs have been integrated into DOOR, to facilitatesystems-level studies of the biology of prokaryotes.

• The availability of the genome-scale operons and TUshas enabled a new class of functional analyses of pro-karyotes as well as fundamental research into thegeneral principles of prokaryotic genomes.

Funding

This work was supported by a grant from the U.S. Departmentof Energy (grant number #DE-PS02-06ER64304). The BioEnergyScience Center (BESC) is supported by the Office of Biologicaland Environmental Research in the DOE Office of Science. Thiswork was also supported by National Science Foundation/EPSCoR Award No. IIA-1355423, the State of South DakotaResearch Innovation Center, and the Agriculture ExperimentStation of South Dakota State University.

References1. Chen IMA, Markowitz VM, Chu K, et al. IMG/M: integrated gen-

ome and metagenome comparative data analysis system.Nucleic Acids Res 2016;45:D507–16.

2. Jacob F, Perrin D, S�anchez C, et al. The operon: a group ofgenes with expression coordinated by an operator. C R AcadSci Paris 1960;250:1727–9.

3. Salgado H, Moreno-Hagelsieb G, Smith TF, et al. Operons inEscherichia coli: genomic analyses and predictions. Proc NatlAcad Sci USA 2000;97:6652–7.

4. Craven M, Page D, Shavlik JW, et al. A probabilistic learningapproach to whole-genome operon prediction. In: Proceedingsof International Conference on Intelligent Systems for Molecular

Figure 7. Two examples of orthologous gene prediction. (A) The two red block

arrows represent two orthologous genes, which are identified based on the in-

formation that they share homologous working partners, represented by the

yellow block arrows, for which the popular program RBH failed to identify.

(B) Two red block arrows represent two orthologous genes, identified based on

the information that they share homologous working partners (the yellow block

arrows).

DOOR | 7

Page 8: DOOR: a prokaryotic operon database for genome analyses ...pdfs.semanticscholar.org/b8a0/1c65404effaaa98e413a... · DOOR: a prokaryotic operon database for genome analyses and functional

Biology, 2000, p. 116–27. USA, AAAI Press, San Diego,California.

5. Dam P, Olman V, Harris K, et al. Operon prediction using bothgenome-specific and general genomic information. NucleicAcids Res 2007;35:288–98.

6. Chou W-C, Ma Q, Yang S, et al. Analysis of strand-specificRNA-seq data using machine learning reveals the structuresof transcription units in Clostridium thermocellum. Nucleic AcidsRes 2015;43:e67.

7. Chen X, Chou W-C, Ma Q, et al. SeqTU: a web server for iden-tification of bacterial transcription units. Sci Rep 2017;7:43925.

8. Mao X, Ma Q, Zhou C, et al. DOOR 2.0: presenting operons andtheir functions through dynamic and integrated views.Nucleic Acids Res 2014;42:D654–9.

9. Ma Q, Xu Y. Global genomic arrangement of bacterial genes isclosely tied with the total transcriptional efficiency. GenomicsProteomics Bioinformatics 2013;11:66–71.

10.Ma Q, Yin Y, Schell MA, et al. Computational analyses of tran-scriptomic data reveal the dynamic organization of theEscherichia coli chromosome under different conditions.Nucleic Acids Res 2013;41:5594–603.

11.Li G, Ma Q, Mao X, et al. Integration of sequence-similarityand functional association information can overcome intrin-sic problems in orthology mapping across bacterial genomes.Nucleic Acids Res 2011;39:e150.

12.Che D, Li G, Mao F, et al. Detecting uber-operons in prokaryoticgenomes. Nucleic Acids Res 2006;34:2418–27.

13.Yin Y, Zhang H, Olman V, et al. Genomic arrangement of bac-terial operons is constrained by biological pathways encodedin the genome. Proc Natl Acad Sci USA 2010;107:6310–15.

14.Pertea M, Ayanbule K, Smedinghoff M, et al. OperonDB: a com-prehensive database of predicted operons in microbial gen-omes. Nucleic Acids Res 2009;37:D479–82.

15.Taboada B, Ciria R, Martinez-Guerrero CE, et al. ProOpDB: pro-karyotic operon database. Nucleic Acids Res 2012;40:D627–31.

16.Okuda S, Katayama T, Kawashima S, et al. ODB: a database ofoperons accumulating known operons across multiple gen-omes. Nucleic Acids Res 2006;34:D358–62.

17.Stoddard SF, Smith BJ, Hein R, et al. rrnDB: improved tools forinterpreting rRNA gene abundance in bacteria and archaeaand a new foundation for future development. Nucleic AcidsRes 2015;43:D593–8.

18.Mao F, Dam P, Chou J, et al. DOOR: a database for prokaryoticoperons. Nucleic Acids Res 2009;37:D459–63.

19.Brouwer RWW, Kuipers OP, van Hijum SAFT. The relativevalue of operon predictions. Brief Bioinform 2008;9:367–75.

20.Bailey TL, Williams N, Misleh C, et al. MEME: discovering andanalyzing DNA and protein sequence motifs. Nucleic Acids Res2006;34:W369–W373.

21.Olman V, Xu D, Xu Y. CUBIC: identification of regulatory bind-ing sites through data clustering. J Bioinform Comput Biol 2003;1:21–40.

22.Arthur JC, Gharaibeh RZ, Muhlbauer M, et al. Microbial gen-omic analysis reveals the essential role of inflammation inbacteria-induced colorectal cancer. Nat Commun 2014;5:4724.

23.Sorek R, Cossart P. Prokaryotic transcriptomics: a new viewon regulation, physiology and pathogenicity. Nat Rev Genet2010;11:9–16.

24.Li G, Liu B, Ma Q, et al. A new framework for identifying cis-regulatory motifs in prokaryotes. Nucleic Acids Res 2011;39:e42.

25.Ma Q, Zhang H, Mao X, et al. DMINDA: an integrated web ser-ver for DNA motif identification and analyses. Nucleic AcidsRes 2014;42:W12–W19.

26.Yang J, Chen X, McDermaid A, et al. DMINDA 2.0: integratedand systematic views of regulatory DNA motif identificationand analyses. Bioinformatics 2017, in pres.

27.Ma Q, Liu B, Zhou C, et al. An integrated toolkit for accurateprediction and analysis of cis-regulatory motifs at a genomescale. Bioinformatics 2013;29:2261–2268.

28.Liu B, Zhang H, Zhou C, et al. An integrative and applicablephylogenetic footprinting framework for cis-regulatorymotifs identification in prokaryotic genomes. BMC Genomics2016;17:578.

29.Preli�c A, Bleuler S, Zimmermann P, et al. A systematic com-parison and evaluation of biclustering methods for gene ex-pression data. Bioinformatics 2006;22:1122–1129.

30.Tanay A, Sharan R, Shamir R. Discovering statistically signifi-cant biclusters in gene expression data. Bioinformatics 2002;18:S136–S144.

31.Li G, Ma Q, Tang H, et al. QUBIC: a qualitative biclustering al-gorithm for analyses of gene expression data. Nucleic AcidsRes 2009;37:e101.

32.Zhou F, Ma Q, Li G, et al. QServer: a biclustering server for pre-diction and assessment of co-expressed gene clusters. PLoSOne 2012;7:e32660.

33.Zhang Y, Xie J, Yang J, et al. QUBIC: a bioconductor package forqualitative biclustering analysis of gene co-expression data.Bioinformatics 2017;33:450–2.

34.Lathe WC, III, Snel B, Bork P. Gene context conservation of ahigher order than operons. Trends Biochem Sci 2000;25:474–479.

35.Wu H, Su Z, Mao F, et al. Prediction of functional modulesbased on comparative genome analysis and Gene Ontologyapplication. Nucleic Acids Res 2005;33:2822–2837.

36.Mao F, Su Z, Olman V, et al. Mapping of orthologous genes inthe context of biological pathways: an application of integerprogramming. Proc Natl Acad Sci USA 2006;103:129–134.

37.Liu B, Zhou C, Li G, et al. Bacterial regulon modeling and pre-diction based on systematic cis regulatory motif analyses.Sci Rep 2016;6:23030.

38.Ma Q, Chen X, Liu C, et al. Understanding the commonalitiesand differences in genomic organizations across closelyrelated bacteria from an energy perspective. Sci China Life Sci2014;57:1121–1130.

8 | Cao et al.