iTerm-PseKNC: a sequence-based tool for predicting bacterial...
Transcript of iTerm-PseKNC: a sequence-based tool for predicting bacterial...
Sequence analysis
iTerm-PseKNC: a sequence-based tool for
predicting bacterial transcriptional terminators
Chao-Qin Feng1, Zhao-Yue Zhang1, Xiao-Juan Zhu1, Yan Lin2,
Wei Chen1,3,*, Hua Tang4,* and Hao Lin1,*
1Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for
Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China, 2Key
Laboratory for Animal Disease Resistance Nutrition of the Ministry of Education, Animal Nutrition Institute,
Sichuan Agricultural University, Chengdu 610054, China, 3Center for Genomics and Computational Biology, School
of Life Sciences, North China University of Science and Technology, Tangshan 063000, China and 4Department of
Pathophysiology, Southwest Medical University, Luzhou 646000, China
*To whom correspondence should be addressed.
Associate Editor: John Hancock
Received on August 12, 2018; revised on September 13, 2018; editorial decision on September 19, 2018; accepted on September 20, 2018
Abstract
Motivation: Transcription termination is an important regulatory step of gene expression. If there
is no terminator in gene, transcription could not stop, which will result in abnormal gene expres-
sion. Detecting such terminators can determine the operon structure in bacterial organisms and
improve genome annotation. Thus, accurate identification of transcriptional terminators is essen-
tial and extremely important in the research of transcription regulations.
Results: In this study, we developed a new predictor called ‘iTerm-PseKNC’ based on support vec-
tor machine to identify transcription terminators. The binomial distribution approach was used to
pick out the optimal feature subset derived from pseudo k-tuple nucleotide composition (PseKNC).
The 5-fold cross-validation test results showed that our proposed method achieved an accuracy of
95%. To further evaluate the generalization ability of ‘iTerm-PseKNC’, the model was examined on
independent datasets which are experimentally confirmed Rho-independent terminators in
Escherichia coli and Bacillus subtilis genomes. As a result, all the terminators in E. coli and 87.5%
of the terminators in B. subtilis were correctly identified, suggesting that the proposed model could
become a powerful tool for bacterial terminator recognition.
Availability and implementation: For the convenience of most of wet-experimental researchers,
the web-server for ‘iTerm-PseKNC’ was established at http://lin-group.cn/server/iTerm-PseKNC/, by
which users can easily obtain their desired result without the need to go through the detailed math-
ematical equations involved.
Contact: [email protected] or [email protected] or [email protected]
1 Introduction
In bacterial, gene expression is a complex and coordinate process
involving gene transcription start and termination that were regu-
lated by promoter and terminator, respectively. When the RNA
polymerase binds to a promoter, the transcription from DNA to
RNA starts and continues until the RNA polymerase reaches a
transcription terminator. All the RNA polymerases must terminate,
dissociate and release the product RNA at the terminator. Thus,
terminator is an extremely important DNA regulatory element.
Generally, terminator locates at the 3’ end of genes or within cistrons.
There are two categories of terminators: Rho-dependent terminator
(also called factors-dependent terminator) and Rho-independent
VC The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected] 1469
Bioinformatics, 35(9), 2019, 1469–1477
doi: 10.1093/bioinformatics/bty827
Advance Access Publication Date: 21 October 2018
Original Paper
Dow
nloaded from https://academ
ic.oup.com/bioinform
atics/article-abstract/35/9/1469/5104941 by University of Electronic Science and Technology of C
hina user on 04 May 2019
terminator (also called factor-independent or intrinsic terminator) as
shown in Figure 1. The former needs additional factors to present its
function. The latter displays its function only based on the DNA tem-
plate and the RNA polymerase.
Detecting transcription terminators can improve genome annota-
tion and shed light on the organization of bacterial genomes.
Although the biochemical experimental approaches can identify two
types of terminators clearly and accurately, the wet-experimental
techniques are really time-consuming and expensive. Thus, compu-
tational methods provide researchers with a great choice to accur-
ately recognize terminators.
Early attempt using computational methods for terminators pre-
diction included dinucleotide distribution matrices (Brendel et al.,
1986; Brendel and Trifonov, 1984) and artificial neural networks
(Nair et al., 1994). They only used the nucleotide composition infor-
mation of the Rho-independent terminator sequence to describe ter-
minator samples. Subsequently, some prediction methods using
hairpin structure and the downstream T-rich region as features were
proposed. Daubenton-Carafa et al. (1990) built a score system-
based model by combining free energy of RNA hairpin formation,
the number and position information of thymidine residues in the
T-rich region with binary linear prediction. de Hoon et al. (2005)
also developed a score system based on the information of T-rich re-
gion. Ermolaeva et al. (2000) proposed a method called TransTerm
which used stepwise threshold filtering instead of binary linear pre-
diction. Yada et al. (1999) converted these two types of features into
seven features for multiple linear prediction. But the above methods
did not consider the stability of T-rich region. In contrast, Lesnik
et al. (2001) and Unniraman et al. (2002) predicted Rho-
independent terminators according to the hairpin structure and the
free energy of the T-rich region.
Although above-mentioned methods have achieved some prom-
ising results for terminator prediction, some aspects in this area
should be improved due to the following reasons. Firstly, the bench-
mark datasets constructed by these methods were too small to reflect
the statistical profile of terminators. Secondly, the predictive per-
formance of existing methods is still far from satisfaction. Thirdly,
these published computational works mainly focused on the predic-
tion of Rho-independent terminators. Although Rho-dependent ter-
minators also have been intensively studied in the binding between
Rho-binding site and Rho protein (Guerin et al., 1998; Zhu and von
Hippel, 1998) and the observation of consensus sequence in Rho ter-
minations (Carlomagno and Nappo, 2001; Chen and Richardson,
1987; Ciampi et al., 1989; Das, 1992; Friedman and Olson, 1983;
Gong and Yanofsky, 2002; Graham and Richardson, 1998;
Heinrich et al., 1995; Vogel and Jensen, 1995; Zalatan et al., 1993),
computational method for identifying Rho-dependent terminators is
less well-studied. Fourthly, no web-server was provided for these
methods, and hence their usage is quite limited, particularly for the
wet-experimental scholars.
This study was devoted to enhancing the prediction power and
quality in identifying the bacterial terminators from the aforemen-
tioned four aspects. We firstly constructed an objective benchmark
dataset including 280 experimentally confirmed terminator sequen-
ces and 560 non-terminator sequences in Escherichia coli.
Subsequently, these samples were formulated by integrating the
oligonucleotide composition information and their physicochemi-
cal correlation into pseudo k-tuple nucleotide composition
(PseKNC). Moreover, the support vector machine (SVM) was uti-
lized to discriminate terminators from non-terminators. According
to the proposed method, we established a predictor named ‘iTerm-
PseKNC’.
2 Materials and methods
2.1 DataConstructing an objective benchmark dataset could provide us a
chance not only to obtain the intrinsic and essential properties of ter-
minator, but also to correctly evaluate the performance of proposed
method.
We extracted the E. coli terminator sequences from the
RegulonDB (Gama-Castro et al., 2016). After deleting overlapping
and repeat terminator sequences, we finally obtained 18 rho-
dependent and 262 rho-independent terminator sequences which
were regarded as positive dataset.
Previous studies (Lesnik et al., 2001; Zhu and von Hippel,
1998) showed that the lengths of rho-independent and rho-
dependent terminators are usually within 100 bp and 70–80 bp, re-
spectively. Our statistical results showed that the average lengths of
terminators are �50 bp. Thus, to include sequence as more as pos-
sible, the length of negative samples was set to 80 bp, which were
extracted from �100 to �20 bp upstream and 20–100 bp down-
stream of 280 terminators in E. coli K-12 genome, respectively.
Accordingly, 280 terminator and 560 non-terminator sequences
were obtained. We further used CD-HIT (Fu et al., 2012) to inves-
tigate the sequence identity of all sequences and found that 836
sequences have sequence identity of <75%, suggesting that using
such data will not cause model bias.
2.2 Feature descriptionThe second key step for a powerful predictor is to develop a valid
feature set to represent samples. The following will describe the fea-
tures used in this study.
Suppose a DNA sequence D consists of L nucleic acid
residues; i.e.
Fig. 1. A schematic of bacterial transcriptional termination. (A) Rho-depend-
ent transcription termination; (B) Rho-independent transcription termination
1470 C.-Q.Feng et al.
Dow
nloaded from https://academ
ic.oup.com/bioinform
atics/article-abstract/35/9/1469/5104941 by University of Electronic Science and Technology of C
hina user on 04 May 2019
D ¼ R1R2 . . . Ri . . . RL (1)
where Ri represents the i-th nucleic acid residue at position i in the
DNA sequence.
Some sequence alignment-based methods such as BLAST and
FASTA could directly perform sequence similarity search by using
sequences. However, these methods can do nothing when facing
low-similar sequences. Fortunately, machine learning methods
could provide help to solve this problem (Cao et al., 2017). Thus,
transferring a DNA sequence to a vector without losing sample in-
formation has become one of the most challenging problems in
bioinformatics.
Due to the fact that k-tuple (also called k-mer) nucleotide com-
position could reflect the oligonucleotide composition and has
been widely used in DNA element identification (Lin and Li, 2011),
we utilized this feature to formulate each sample with 4k compo-
nents; i.e.
D ¼ ½f k�tuple1 f k�tuple
2 . . . f k�tuplei . . . f k�tuple
4k � (2)
where fi is the normalized occurrence frequency of the i-th k-tuple
nucleotide in the DNA sequence. As we can see from Equation (2),
only short-range correlation information was considered in the for-
mulation. The long-range interaction between oligonucleotides was
fully ignored.
Encouraged by the successes of pseudo amino acid composition
method in computational proteomics, PseKNC (Chen et al., 2014,
2015) was proposed to convert a DNA/RNA sequence to a same di-
mension vector which could not only reflect the short-range se-
quence-order information by using k-tuple nucleotide composition,
but also consider long-range sequence-order information by calcu-
lating the correlation of physicochemical properties of oligonucleoti-
des in DNA/RNA sequences. Thus, PseKNC has been successfully
applied in the identification of various of DNA/RNA regulatory ele-
ments, such as DNA recombination spots (Chen et al., 2013; Yang
et al., 2018b) nucleosome positioning sequence (Chen et al., 2016;
Guo et al., 2014), promoter (Lin et al., 2014, 2017), translation ini-
tiation site (Chen et al., 2014), origin of replication (Li et al., 2015a,
b; Zhang et al., 2016), RNA modification sites (Chen et al., 2018;
Feng et al., 2018; Yang et al., 2018a). Although these works have
demonstrated that PseKNC is a good feature to formulate DNA/
RNA samples, all them used Type-I PseKNC. The different physico-
chemical properties were mixed together in Type I PseKNC. Because
different properties maybe play different roles, the Type-II PseKNC
(also called series correlation PseKNC) could consider these differ-
ences and improve the description on samples.
Thus, in this study, according to the thought of series correlation
PseKNC (Type-II PseKNC) (Chen et al., 2014), we made a minor
modification on the Type-II PseKNC and then used a vector contain-
ing ð4k þ k � KÞ components to describe a terminator/non-terminator
sequence as below:
DIIPseKNC ¼ ½d1 . . . d4k d4kþ1 . . . d4kþk d4kþkþ1 . . . d4kþkK�
T
(3)
where
du ¼f k�tupleu ;1 � u � 4k
xsu�4k ;4k þ 1 � u � 4k þ kK
((4)
where f k�tupleu has the same meaning as defined in Equation (2). The
su�4k reflects the long-range sequence-order effect of a DNA se-
quence with L nucleotides and can be calculated by:
s1 ¼1
L� k� 1
XL�k�1
i¼1
J1i;iþ1
s2 ¼1
L� k� 1
XL�k�1
i¼1
J2i;iþ1
. . . . . .
sK ¼1
L� k� 1
XL�k�1
i¼1
JKi;iþ1
. . . . . .
skK�1 ¼1
L� k� k
XL�k�k
i¼1
JK�1i;iþk
skK ¼1
L� k� k
XL�k�k
i¼1
JKi; iþk
:
8>>>>>>>>>>>>>>>>>>>>>>>>>>>><>>>>>>>>>>>>>>>>>>>>>>>>>>>>:
(5)
In Equations (3)–(5), k reflects the tiers or correlation rank along
a DNA sequence and should be set to an integer number less than
L�k. x is a weight factor. K is the number of physicochemical prop-
erties. The Jni;iþm describes the correlation of the n�th physicochemi-
cal property between the i-th oligonucleotide (RiRiþ1 . . . Riþn�1) and
(iþm)-th oligonucleotide (RiþmRiþmþ1 . . . Riþmþn�1) and can be cal-
culated by
Jni;iþm ¼ Hn RiRiþ1 . . . Riþn�1ð Þ �Hn RiþmRiþmþ1 . . . Riþmþn�1ð Þ
n ¼ 1; 2; . . . K; m ¼ 1; 2; . . . ; k; i ¼ 1; 2; . . . ; L� k� k
8><>:
(6)
where n is the length of oligonucleotides. HnðRiRiþ1 . . . Riþn�1Þ and
HnðRiþmRiþmþ1 . . . Riþmþn�1Þ are the values of the n� th physico-
chemical property for oligonucleotide RiRiþ1 . . . Riþn�1 and
RiþmRiþmþ1 . . . Riþmþn�1 in a DNA sequence. In this study, six phys-
icochemical properties including slide, shift, rise, roll, tilt and twist
for dinucleotides (namely n ¼ 2) were used, and the original physi-
cochemical property values can be obtained from Brukner et al.
(1995). Note that before substituting these values into Equation (6),
they should be standardized by the following equation:
Hn RiRiþ1ð Þ ¼H0
n RiRiþ1ð Þ � H0nðRiRiþ1Þ
D ESD H0
nðRiRiþ1ÞD E (7)
where the symbol �h i means taking the average of quantity therein
over the 16 different combinations of A, T, C, G for RiRiþ1, and SD
means the corresponding standard deviation.
2.3 Feature selectionAs we can see from the Equation (3), with the increase of k, we can
get more information about sequence order, but the dimension of
the vector DIIPseKNC will also rapidly increase and lead to the so-
called ‘high-dimension disaster’. Moreover, a model with excessive
features increases the risk of overfitting which will reduce the
cluster-tolerant or deviation tolerance (Chou, 1999) and may result
in a poor performance of model.
To save computational resources, reduce run-time and improve
predictive accuracy, it is necessary to optimize feature set by picking
out relevant features that contribute greatly for the prediction. At
present, some effective feature selection techniques which can reduce
irrelevant features and noisy information have been proposed, such
as minimum redundancy maximum relevance (Peng et al., 2005),
analysis of variance (Chen et al., 2016; Tang et al., 2016, 2018;
Yang et al., 2016), principal component analysis (Wang and Yang,
Predicting bacterial transcriptional terminators 1471
Dow
nloaded from https://academ
ic.oup.com/bioinform
atics/article-abstract/35/9/1469/5104941 by University of Electronic Science and Technology of C
hina user on 04 May 2019
2009), max-relevance-max-distance (Zou et al., 2016a, b) and bino-
mial distribution (BD) (Lai et al., 2017; Su et al., 2018).
Generally, oligonucleotide frequencies do not obey the Gaussian
distribution for a large k with relatively small sample data. Thus, we
proposed using BD-based technique to rank the important of short-
range sequence-order features. According to the theory of BD-based
technique, we can judge whether a certain k-tuple nucleotide occurs
in terminators or non-terminators is an essentially random event or
not.
First, we should calculate the prior probability qi by Equation (8)
qi ¼ mi=M (8)
where mi is the total number of k-tuple nucleotide in the i-th cat-
egory of samples (i ¼ 1, 2 corresponds to terminator samples and
non-terminator samples). M denotes the total number of all k-tuple
nucleotides in both positive and negative datasets.
Second, the probability of the j-th k-tuple nucleotide in the i-th
category of datasets can be obtained by the following equation:
P nijð Þ ¼XNj
m¼nij
Nj!
m!ðNj �mÞ qmi ð1� qiÞNj�m (9)
where Nj and nij denote the total numbers of the j-th k-tuple nucleo-
tides, respectively, occurring in the benchmark dataset and in i-th
category dataset.
If the j-th k-tuple nucleotide occurring in the i-th category data-
set has biological significance, namely its appearance is not a ran-
dom event, the value of PðnijÞ will be very small. In the current
study, there are two sample datasets, namely positive sample and
negative sample, meaning that we should calculate two probabilities
Pðn1jÞ and Pðn2jÞ for the two datasets. Then, we may define the oc-
currence probability of the j-th k-tuple nucleotide in benchmark
dataset as follows
Pj ¼ minðPðn1jÞ; Pðn2jÞÞ (10)
where Pðn1jÞ and Pðn2jÞ are the probabilities of j-th k-tuple nucleo-
tide in terminator samples and non-terminator samples, respectively.
All the k-tuple nucleotides were ranked by their probability values,
that is to say, if the value of Pj is smaller, this k-tuple nucleotide is
more effective for classification.
2.4 Support vector machine (SVM)SVM is a supervised machine learning algorithm based on structural
risk minimization (Burges, 1998) framework. It has been widely
used in many bioinformatics fields (Cao et al., 2014; Chen et al.,
2017; Feng et al., 2013a, b, c; Li et al., 2016; Song et al., 2006,
2010, 2012, 2018a, b; Tang et al., 2017). The basic principle of
SVM is to map the original feature vectors to a higher Hilbert space
by using various kernel functions and then searches a separating
hyperplane in Hilbert space. A soft package LIBSVM used to
implement SVM has been developed by Lin’s lab, which can be
freely downloaded from http://www.csie.ntu.edu.tw/�cjlin/libsvm/
(Chang and Lin, 2011).
In our study, the input of SVM is the vectors obtained from
terminator and non-terminator sequences according to Equation
(3). The default radial basis function was utilized as the kernel
function to construct SVM model. The optimization of the regu-
larization parameter C and the kernel width parameter c were
performed by using Grid search with the following searching
space:
2�5 � C � 215; with step D ¼ 2
2�15 � c � 2�5; with step D ¼ 2�1
8<: (11)
2.5 Performance evaluation2.5.1 N-fold cross-validation
Cross-validation is a model validation technique for assessing the
performance of a statistical analysis that will generalize to an inde-
pendent dataset. The goal of cross-validation is to define a dataset to
‘test’ the model in the training phase. As summarized in a review
(Chou and Zhang, 1995), n-fold cross-validation, jackknife test and
independent dataset test are often used.
In our study, we used the 5-fold cross-validation (n ¼ 5) to esti-
mate the generalization of model. In 5-fold cross-validation, the
benchmark dataset is equally divided into five subsets, and one of
the five subsets is used as the test set or validation set and the other
four subsets are put together to form a training set. Then, this pro-
cession repeats five times.
2.5.2 Performance evaluation parameters
We used four metrics which are specificity (Sp), sensitivity (Sn), over-
all accuracy (Acc) and Matthew’s correlation coefficient (MCC)
(Feng et al., 2013a, b, c; Li et al., 2018a, b; Song et al., 2018a, b) to
evaluate the predictive performance of our model.
Sn ¼ 1�Nþ�Nþ
; 0 � Sn � 1
Sp ¼ 1�N�þN�
; 0 � Sp � 1
Acc ¼ 1�Nþ� þN�þNþ þN�
; 0 � Acc � 1
MCC ¼1� Nþ�
NþþN�þ
N�
� �ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1þN�þ �Nþ�Nþ
� �1þNþ� �N�þ
N�
� �s ; 0 � MCC � 1
8>>>>>>>>>>>>>>>>>>>><>>>>>>>>>>>>>>>>>>>>:
(12)
where Nþ and N� are the total numbers of terminators and non-
terminators in benchmark dataset, respectively, where Nþ� denotes
the number of the terminators recognized as non-terminators, N�þdenotes the number of the incorrectly predicted non-terminators. Sn
and Sp, respectively, represent the ability of model to correctly
identify terminators and non-terminators. Acc measures the overall
accuracy of our model for distinguishing terminators from non-
terminators. MCC reflects the performance of our model on imbal-
ance benchmark dataset, here the number of positive:negative is 1:2.
We also drew the receiver operating characteristic (ROC) curves
and calculated the area under the receiver operating characteristic
curve (AUC) to measure the performance of models across the entire
range of SVM decision values (Yang et al., 2018b).
3 Results
3.1 Parameter optimizationFrom Equation (3), the performance of the SVM predictor depends
on three parameters, k, k and x, where k and k reflect the short-
range or local sequence-order effect and the long-range or global se-
quence-order effect, respectively. One may notice that the greater
the k and k are, the more sequence-order information the model
1472 C.-Q.Feng et al.
Dow
nloaded from https://academ
ic.oup.com/bioinform
atics/article-abstract/35/9/1469/5104941 by University of Electronic Science and Technology of C
hina user on 04 May 2019
contains. x is used to weight the short-range and long-range, and its
value is usually within the range from 0 to 1. Therefore, we searched
for the optimal values of the three parameters as the following
regions.
2 � k � 6; with step D ¼ 11 � k � 20; with step D ¼ 10:1 � x � 1; with step D ¼ 0:1
8<: (13)
According to Equation (13), we considered 1000 independent
parameter combinations, and performed a 5-fold cross-validated
test for each of these parameter combinations. The accuracies were
showed in Figure 2.
From the Figure 2, we noticed that parameter k plays a key role
in terminator identification. No matter what values of k and x are
used, with the increase of k from 2 to 5, the Acc improves from �85
to �95%. However, when k is set to 6, the prediction performance
decreases to �92%. This observation obeys the rule of short-range
dominance of base correlations (Luo et al., 1998). Once the short-
range correlation parameter k is determined, in most of cases, the
Acc improves with the increase of k. However, when k ¼ 5, the
maximum Acc appears at k ¼ 2; 3; 4; 5. This phenomenon can be
explained by the inhomogeneity of long-range correlation property
(Luo et al., 1998). To include long-range correlation information as
more as possible in prediction model, the parameter k was set to 5.
In all 1000 models, the weight parameter x plays a very slight role
in balancing the information between short-range and long-range
correlation because the SVM is a powerful algorithm which can deal
with feature dimension issue. Thus, we used x ¼ 1 for the conveni-
ence. And then, the final model with the Acc of 95.37% in this sec-
tion was established when k ¼ 5, k ¼ 5 and x ¼ 1. It means that the
combination of the short-range information, namely 5-tuple nucleo-
tide (or called pentamer) composition, with the long-range se-
quence-order information, namely 30 (5 � 6) physicochemical
correlation information, could produce the best performance (Acc
of 95.37%) for discriminating terminators from non-terminators.
3.2 Feature optimizationAccording to Equation (3), the dimension for the PseKNC vector
with k ¼ 5 and k ¼5 is 45 þ 5� 6 ¼ 1054, which may lead to fea-
ture redundancy. To find out the optimal pentamer, we used the bi-
nomial distribution technique to calculate the probabilities of the
1024 short-range sequence-order features, and then the increment
feature selection (IFS) (Li et al., 2016, 2015a, b) was used to pick
out the best feature subset which could produce the accuracy.
Firstly, we ranked the 1024 pentamers in descending order accord-
ing to their probabilities [Equation (10)]. Secondly, we selected the
first pentamer with the smallest probability in the ranked feature set
as the first feature subset. Subsequently, the second feature subset
was constructed by adding the pentamer with the second smallest
probability into the first feature subset, and the third feature subset
was formed by adding the pentamer with the third smallest prob-
ability into the second feature subset and so on. Finally, 1024 fea-
ture subsets were constructed. To find out the best feature subset
that can produce the maximum accuracy, we used SVM with 5-fold
cross-validation to evaluate the performances of the 1024 feature
subsets. The IFS curve was drawn in Figure 3. It is clear that the
maximum overall accuracy of 95.48% was achieved when 808 opti-
mal pentamers were used.
Finally, we added the 30 long-range sequence-order features to
the 808 optimal pentamers and achieved a much higher Acc of
95.71% with the Sn, Sp and MCC of 86.07%, 99.46% and 0.8881,
respectively. We also drew the correlated ROC curve in Figure 4. It
shows that the AUC reaches the value of 0.9520, suggesting that the
Fig. 2. The accuracies of 1000 parameter combinations by using 5-fold cross-
validation on benchmark dataset
Fig. 3. A plot to show the IFS curve. The IFS curve reaches to peak of 95.48%
when abscissa is 808
Fig. 4. The ROC curves for terminator predictors with 838 features
Predicting bacterial transcriptional terminators 1473
Dow
nloaded from https://academ
ic.oup.com/bioinform
atics/article-abstract/35/9/1469/5104941 by University of Electronic Science and Technology of C
hina user on 04 May 2019
proposed model has an excellent prediction capability on terminator
recognition.
To show the prediction capability of final model under the con-
ditions of Sp ¼ 99.90, 99.00, 95.00%, we recorded other evaluation
indexes in Table 1. From the table, we noticed that the final model
could produce high sensitivities for the three thresholds, suggesting
that the model is accurate and robust.
3.3 Feature analysisAccording to statistical theory for small probability event, the occur-
rence of a feature with smaller probability (or high confidence level)
is not a stochastic event. Thus, the feature is important for the con-
struction of prediction model. To intuitively show the contribution
of each pentamer, a heat map was plotted in Figure 5 which is a
graphical representation of a matrix, where the elements represent
the pentamers and are encoded by using different colors according
to their confidence levels. The column and the row of the heat map
represent the first dimers and the last trimers of pentamers,
respectively.
In feature selection process, 808 optimal pentamers were selected
to produce the best result. Only 216 redundant information were
excluded. From Figure 5, one may also notice that many pentamers
have confidence level from yellow to red elements, demonstrating
that the feature selection is correct. However, Figure 5 cannot reflect
the preference of optimal pentamers in terminators or non-
terminators. Thus, we did a further analysis on 1024 pentamers, and
selected 40 pentamers with highest confidence level (>98.20%) as
the most important features for classification. Of the 40 pentamers,
20 often appear in terminators (Fig. 6A), and the remained 20 prefer
to non-terminators (Fig. 6B). We noticed that the 20 pentamers
occurring in terminators always contain short repeated trimers, such
as AAA, CCC and TTT. However, the other 20 pentamers does not
display similar mode. To further investigate the importance of the
40 pentamers, we retrained the proposed model using these pentam-
ers, and achieved the Acc of 87.50% in 5-fold cross-validation. Such
high accuracy demonstrates the preference of pentamers in
terminators.
To further investigate the sequence mode in terminators,
we used Multiple Em for Motif Elicitation (MEME) (Bailey and
Elkan, 1994) to discover the consensus motifs in terminators.
Results showed that the consensus sequence is A[AT]AAAAAA[AG]
[CG]C[CG][GC][CG][CT-A][CGAT][AT][ACTG][GTA][GCA][GC].
Excitingly, the element obtained by MEME is identical to the
optimized pentamer from feature selection technique.
3.4 Web-server introductionBased on our proposed model, a user-friendly and publicly access-
ible web-server for ‘iTerm-PseKNC’ was established available at
http://lin-group.cn/server/iTerm-PseKNC/, by which users can up-
load DNA sequences to search where the terminator is. The user
guide is introduced as follow:
Step 1. Click on the web address http://lin-group.cn/server/
iTerm-PseKNC/ and one may see the homepage of ‘iTerm-PseKNC’
on his computer screen as shown in Figure 7. Users may read the
brief introduction about the predictor.
Step2. Click on the WEBSERVER button to see the prediction
interface. Either type or copy/paste the query DNA sequences with
length of >80 bp into the input box at the center of the page. User
could also upload a DNA sequence file below the input box. The in-
put sequence must be in the FASTA format. Users could check the
example sequences with FASTA format by clicking on the Example
button on the right of the input box.
Step 3. Click on the Submit button to see the predicted result. If
you use the four sequence samples in the Example window as an in-
put, after clicking the Submit button, you will see the following out-
comes shown on the screen of your computer. (i) Two samples were
Table 1. The performance of the model for different thresholds
Threshold Sp (%) Sn (%) Acc (%) MCC
High 99.90 87.10 95.60 0.9019
Medium 99.00 90.00 95.95 0.9087
Low 95.00 93.20 94.40 0.8764
Fig. 5. A heat map to show the confidence level of the 1024 pentamers. The
color scale is ranged from blue (represents low confidence level or large
probability) through green and yellow to red (represents high confidence
level or small probability)
A
B
Fig. 6. A histogram to show the different heptamers between the terminators
and non-terminators. (A) 20 pentamers often appear in terminator sequences.
(B) 20 pentamers often appear in non-terminator sequences
1474 C.-Q.Feng et al.
Dow
nloaded from https://academ
ic.oup.com/bioinform
atics/article-abstract/35/9/1469/5104941 by University of Electronic Science and Technology of C
hina user on 04 May 2019
predicted as terminator with the probabilities of containing a ter-
minator are 99.45% and 99.18%. (ii) Two samples were predicted
as non-terminator with the probabilities of containing a terminator
are 3.15% and 0.63%. All these results are fully consistent with the
experimental observations. It only takes about few seconds for the
above computation before the predicted results appear on your com-
puter screen.
Step 4. Click on the DOWNLOAD button to download the
benchmark datasets used to train and test the ‘iTerm-PseKNC’
predictor.
Step 5. Click on the CITATION button to find the relevant
papers that document the detailed development and algorithm of
‘iTerm-PseKNC’.
Step 6. Click on the ABOUT US button to learn about us. You
may also contact us using Contact button.
3.5 Evaluating on other datasetsTo further evaluate the generalization ability of ‘iTerm-PseKNC’, it
is necessary to examine the predictor on independent data. Thus, we
built two independent sets. One contains 147 experimentally con-
firmed rho-independent terminators in E. coli (Lesnik et al., 2001).
The other includes 425 experimentally identified terminators in
Bacillus subtilis (de Hoon et al., 2005). We executed the predictor
‘iTerm-PseKNC’ on the two independent datasets. Results show
that all terminators in E. coli and 87.5% terminators in B. subtilis
were correctly identified, suggesting that our proposed method is
powerful and can be a useful tool for bacterial terminator
annotation.
Based on the 147 E. coli terminators, Lesnik et al. (2001) devel-
oped an RNAMotif algorithm and achieved the accuracy of 100%
on this data. However, the benchmark dataset used to train ‘iTerm-
PseKNC’ model does not contain Lesnik’s data, meaning our
obtained results are stricter and more objective. de Hoon et al.
(2005) analyzed the 425 B. subtilis terminator sequences and cor-
rectly identified 94% of B. subtilis terminator by using a decision
rule. One may notice that the accuracy of our model on B. subtilis
terminators is not higher than that of de Hoon’s model. The reason
is that de Hoon’s model was trained on B. subtilis terminators, but
our model on E. coli terminators. Thus, our predictor could provide
more reliable and flexible cross-species prediction. What is more,
none of these works provides web-server, which may prevent wet-
experimental scholars from using these published methods.
3.6 Comparison with other traditional machine learning
methodsVarious of machine learning methods have been widely applied in
bioinformatics fields (Cao and Cheng, 2016; Cao et al., 2016; Li
et al., 2018a, b; Loh et al., 2018; Naseem et al., 2017; Patel et al.,
2017; Peng et al., 2018; Stephenson et al., 2018; Yuan et al., 2017).
To compare the prediction performance of different methods on ter-
minator identification, we repeated above procession using four
kinds of methods: decision tree, multi-layer perceptron, logistic re-
gression and Naıve Bayes. The scikit-learn package in python was
used to implement the four algorithms. All results were listed in
Table 2. As shown in Table 1, it is clear that the four metrics
[defined in Equation (12)] achieved by SVM are significantly higher
than those of the other four methods, indicating that it is correct
and suitable to use SVM to build computational method for termin-
ator identification.
4 Conclusion
In this study, we developed a predictor called ‘iTerm-PseKNC’ to
identify terminators. The Type-II PseKNC was first applied in bio-
informatics field, which consider not only the short-range oligo-
nucleotide composition, but also the long-range correlation of
physicochemical properties of oligonucleotide. Different from trad-
itional Type-I PseKNC, the Type-II PseKNC could separate different
long-range correlation for different physicochemical properties,
which could provide more objective description on samples. The
rates achieved by the predictor with 808 pentamer features and 30
long-range sequence-order features were 86.07%, 99.46%, 95.71%
and 0.88% in Sn, Sp, Acc and MCC, respectively, indicating that
DNA sequence contains important information for terminator identi-
fication. Many kinds of motifs were found by using feature analysis
technique. The examinations on independent datasets demonstrated
that ‘iTerm-PseKNC’ is a reliable tool to identify terminators in bac-
terial genomes. In the future, we will collect more terminators and
separate rho-independent from rho-dependent terminators to retrain
the model for further improving the prediction performance.
Funding
This work was supported by the National Nature Scientific Foundation
of China (61772119, 31771471, 61702430); the Fundamental Research
Funds for the Central Universities of China (Nos. ZYGX2015Z006,
ZYGX2016J125, ZYGX2016J118); Natural Science Foundation for
Distinguished Young Scholar of Hebei Province (No. C2017209244);
the Program for the Top Young Innovative Talents of Higher Learning
Institutions of Hebei Province (No. BJ2014028) and the Science Strength
Promotion Programme of UESTC.
Conflict of Interest: none declared.
References
Bailey,T.L. and Elkan,C. (1994) Fitting a mixture model by expectation maxi-
mization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst.
Mol. Biol., 2, 28–36.
Fig. 7. The top page of the ‘iTerm-PseKNC’ web-server at http://lin-group.cn/
server/iTerm-PseKNC/
Table 2. A comparison of different methods for identifying
terminator
Method Sn (%) Sp (%) Acc (%) MCC
Decision tree 69.64 90.00 83.21 0.7562
Multi-layer perceptron 63.57 94.46 84.16 0.7078
Logistic regression 46.07 93.92 77.97 0.3839
Naıve Bayes 70.71 86.25 81.07 0.7521
Our method 86.07 99.46 95.71 0.8881
Predicting bacterial transcriptional terminators 1475
Dow
nloaded from https://academ
ic.oup.com/bioinform
atics/article-abstract/35/9/1469/5104941 by University of Electronic Science and Technology of C
hina user on 04 May 2019
Brendel,V. et al. (1986) Terminators of transcription with RNA polymerase
from Escherichia coli: what they look like and how to find them. J. Biomol.
Struct. Dyn., 3, 705–723.
Brendel,V. and Trifonov,E.N. (1984) A computer algorithm for testing poten-
tial prokaryotic terminators. Nucleic Acids Res., 12, 4411–4427.
Brukner,I. et al. (1995) Sequence-dependent bending propensity of DNA as
revealed by DNase I: parameters for trinucleotides. EMBO J., 14,
1812–1818.
Burges,C.J.C. (1998) A tutorial on support vector machines for pattern recog-
nition. Data Min. Knowl. Discov., 2, 121–167.
Cao,R. and Cheng,J. (2016) Integrated protein function prediction by mining
function associations, sequences, and protein-protein and gene-gene inter-
action networks. Methods, 93, 84–91.
Cao,R. et al. (2017) ProLanGO: protein Function Prediction Using Neural
Machine Translation Based on a Recurrent Neural Network. Molecules, 22,
E1732.
Cao,R. et al. (2014) SMOQ: a tool for predicting the absolute residue-specific
quality of a single protein model with support vector machines. BMC
Bioinformatics, 15, 120.
Cao,R.Z. et al. (2016) DeepQA: improving the estimation of single protein
model quality with deep belief networks. BMC Bioinformatics, 17, 495.
Carlomagno,M.S. and Nappo,A. (2001) The antiterminator NusB enhances
termination at a sub-optimal Rho site. J. Mol. Biol., 309, 19–28.
Chang,C.-C. and Lin,C.-J. (2011) LIBSVM: a library for support vector
machines. ACM Trans. Intell. Syst. Technol., 2, 1–27.
Chen,C.Y. and Richardson,J.P. (1987) Sequence elements essential for
rho-dependent transcription termination at lambda tR1. J. Biol. Chem.,
262, 11292–11299.
Chen,W. et al. (2016) Using deformation energy to analyze nucleosome posi-
tioning in genomes. Genomics, 107, 69–75.
Chen,W. et al. (2018) iRNA-3typeA: identifying Three Types of Modification
at RNA’s Adenosine Sites. Mol. Ther. Nucleic Acids, 11, 468–474.
Chen,W. et al. (2014) iTIS-PseTNC: a sequence-based predictor for identifying
translation initiation site in human genes using pseudo trinucleotide com-
position. Anal. Biochem., 462, 76–83.
Chen,W. et al. (2013) iRSpot-PseDNC: identify recombination spots with
pseudo dinucleotide composition. Nucleic Acids Res., 41, e68.
Chen,W. et al. (2014) PseKNC: a flexible web server for generating pseudo
K-tuple nucleotide composition. Anal. Biochem., 456, 53–60.
Chen,W. et al. (2017) iDNA4mC: identifying DNA N4-methylcytosine sites
based on nucleotide chemical properties. Bioinformatics, 33, 3518–3523.
Chen,W. et al. (2015) PseKNC-General: a cross-platform package for generat-
ing various modes of pseudo nucleotide compositions. Bioinformatics, 31,
119–120.
Chen,X.X. et al. (2016) Identification of Bacterial Cell Wall Lyases via Pseudo
Amino Acid Composition. BioMed Res. Int., 2016, 1654623.
Chou,K.-C. and Zhang,C.-T. (1995) Prediction of protein structural classes.
Crit. Rev. Biochem. Mol. Biol., 30, 275–349.
Chou,K.C. (1999) A key driving force in determination of protein structural
classes. Biochem. Biophys. Res. Commun., 264, 216–224.
Ciampi,M.S. et al. (1989) Features of the rho-dependent transcription termin-
ation polar element within the hisG cistron of Salmonella typhimurium. J.
Bacteriol., 171, 4472–4478.
Das,A. (1992) How the phage lambda N gene product suppresses transcription
termination: communication of RNA polymerase with regulatory proteins
mediated by signals in nascent RNA. J. Bacteriol., 174, 6711–6716.
Daubenton-Carafa et al. (1990) Prediction of rho-independent Escherichia
coli transcription terminators. J. Mol. Biol., 216, 835–858.
de Hoon,M.J. et al. (2005) Prediction of transcriptional terminators in
Bacillus subtilis and related species. PLoS Comput. Biol., 1, e25.
Ermolaeva,M.D. et al. (2000) Prediction of transcription terminators in bac-
terial genomes. J. Mol. Biol., 301, 27–33.
Feng,P. et al. (2018) iDNA6mA-PseKNC: identifying DNA
N(6)-methyladenosine sites by incorporating nucleotide physicochemical
properties into PseKNC. Genomics, doi: 10.1016/j.ygeno.2018.01.005.
Feng,P.M. et al. (2013a) iHSP-PseRAAAC: identifying the heat shock protein
families using pseudo reduced amino acid alphabet composition. Anal.
Biochem., 442, 118–125.
Feng,P.M. et al. (2013b) Naive Bayes classifier with feature selection to
identify phage virion proteins. Comput. Math. Methods Med., 2013,
530696.
Feng,P.M. et al. (2013c) Identification of antioxidants from sequence
information using naive Bayes. Comput. Math. Methods Med., 2013,
567529.
Friedman,D.I. and Olson,E.R. (1983) Evidence that a nucleotide sequence,
“boxA,” is involved in the action of the NusA protein. Cell, 34, 143–149.
Fu,L. et al. (2012) CD-HIT: accelerated for clustering the next-generation
sequencing data. Bioinformatics, 28, 3150–3152.
Gama-Castro,S. et al. (2016) RegulonDB version 9.0: high-level integration of
gene regulation, coexpression, motif clustering and beyond. Nucleic Acids
Res., 44, D133–D143.
Gong,F. and Yanofsky,C. (2002) Analysis of tryptophanase operon expression
in vitro: accumulation of TnaC-peptidyl-tRNA in a release factor 2-depleted
S-30 extract prevents Rho factor action, simulating induction. J. Biol.
Chem., 277, 17095–17100.
Graham,J.E. and Richardson,J.P. (1998) rut Sites in the nascent transcript me-
diate Rho-dependent transcription termination in vivo. J. Biol. Chem., 273,
20764–20769.
Guerin,M. et al. (1998) A simple polypyrimidine repeat acts as an artificial
Rho-dependent terminator in vivo and in vitro. Nucleic Acids Res., 26,
4895–4900.
Guo,S.H. et al. (2014) iNuc-PseKNC: a sequence-based predictor for predict-
ing nucleosome positioning in genomes with pseudo k-tuple nucleotide com-
position. Bioinformatics, 30, 1522–1529.
Heinrich,T. et al. (1995) Point mutations in the leader boxA of a
plasmid-encoded Escherichia coli rrnB operon cause defective antitermina-
tion in vivo. J. Bacteriol., 177, 3793–3800.
Lai,H.Y. et al. (2017) Sequence-based predictive modeling to identify cancer-
lectins. Oncotarget, 8, 28169–28175.
Lesnik,E.A. et al. (2001) Prediction of rho-independent transcriptional termi-
nators in Escherichia coli. Nucleic Acids Res., 29, 3583–3594.
Li,B.Q. et al. (2018a) Prediction of Protein-Peptide Interactions with a Nearest
Neighbor Algorithm. Curr. Bioinform., 13, 14–24.
Li,D.P. et al. (2016) Protein Folds Prediction with Hierarchical Structured
SVM. Curr. Proteomics, 13, 79–85.
Li,F. et al. (2018b) Quokka: a comprehensive tool for rapid and accurate pre-
diction of kinase family-specific phosphorylation sites in the human prote-
ome. Bioinformatics, 34, 4223–4231.
Li,F. et al. (2016) GlycoMine(struct): a new bioinformatics tool for highly ac-
curate mapping of the human N-linked and O-linked glycoproteomes by
incorporating structural features. Sci. Rep., 6, 34595.
Li,F. et al. (2015a) GlycoMine: a machine learning-based approach for pre-
dicting N-, C- and O-linked glycosylation in the human proteome.
Bioinformatics, 31, 1411–1419.
Li,W.C. et al. (2015b) iORI-PseKNC: a predictor for identifying origin of rep-
lication with pseudo k-tuple nucleotide composition. Chemometr. Intell.
Lab., 141, 100–106.
Lin,H. et al. (2014) iPro54-PseKNC: a sequence-based predictor for identify-
ing sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide com-
position. Nucleic Acids Res., 42, 12961–12972.
Lin,H. and Li,Q.Z. (2011) Eukaryotic and prokaryotic promoter prediction
using hybrid approach. Theory Biosci., 130, 91–100.
Lin,H. et al. (2017) Identifying sigma70 promoters with novel pseudo nucleo-
tide composition. IEEE/ACM Trans. Comput. Biol. Bioinform, doi:
10.1109/TCBB.2017.2666141.
Loh,S.K. et al. (2018) A Review of Computational Approaches to Predict
Gene Functions. Curr. Bioinform., 13, 373–386.
Luo,L.F. et al. (1998) Statistical correlation of nucleotides in a DNA sequence.
Phys. Rev. E., 58, 861–871.
Nair,T.M. et al. (1994) Application of artificial neural networks for prokary-
otic transcription terminator prediction. FEBS Lett., 346, 273–277.
Naseem,I. et al. (2017) ECMSRC: a Sparse Learning Approach for
the Prediction of Extracellular Matrix Proteins. Curr. Bioinform., 12,
361–368.
Patel,S. et al. (2017) DeepInteract: deep Neural Network Based
Protein-Protein Interaction Prediction Tool. Curr. Bioinform., 12, 551–557.
1476 C.-Q.Feng et al.
Dow
nloaded from https://academ
ic.oup.com/bioinform
atics/article-abstract/35/9/1469/5104941 by University of Electronic Science and Technology of C
hina user on 04 May 2019
Peng,H. et al. (2005) Feature selection based on mutual information: criteria
of max-dependency, max-relevance, and min-redundancy. IEEE Trans.
Pattern Anal. Mach. Intell., 27, 1226–1238.
Peng,L. et al. (2018) The Advances and Challenges of Deep Learning
Application in Biological Big Data Processing. Curr. Bioinform., 13,
352–359.
Song,J. et al. (2006) Prediction of cis/trans isomerization in proteins using
PSI-BLAST profiles and secondary structure information. BMC
Bioinformatics, 7, 124.
Song,J. et al. (2018a) PROSPERous: high-throughput prediction of substrate
cleavage sites for 90 proteases with improved accuracy. Bioinformatics, 34,
684–687.
Song,J. et al. (2012) PROSPER: an integrated feature-based tool for predicting
protease substrate cleavage sites. PLoS One, 7, e50300.
Song,J. et al. (2010) Cascleave: towards more accurate prediction of caspase
substrate cleavage sites. Bioinformatics, 26, 752–760.
Song,J. et al. (2018b) iProt-Sub: a comprehensive package for accurately map-
ping and predicting protease-specific substrates and cleavage sites. Brief.
Bioinform., doi: 10.1093/bib/bby028.
Stephenson,N. et al. (2018) Survey of Machine Learning Techniques in Drug
Discovery. Curr. Drug Metabol., doi: 10.2174/1389200219666180
820112457.
Su,Z.D. et al. (2018) iLoc-lncRNA: predict the subcellular location of
lncRNAs by incorporating octamer composition into general PseKNC.
Bioinformatics, 34, 4196–4204.
Tang,H. et al. (2017) A two-step discriminated method to identify thermophil-
ic proteins. Int. J. Biomath., 10, 1750050.
Tang,H. et al. (2016) Identification of immunoglobulins using Chou’s pseudo
amino acid composition with feature selection technique. Mol. Biosyst., 12,
1269–1275.
Tang,H. et al. (2018) HBPred: a tool to identify growth hormone-binding pro-
teins. Int. J. Biol. Sci., 14, 957–964.
Unniraman,S. et al. (2002) Conserved economics of transcription termination
in eubacteria. Nucleic Acids Res., 30, 675–684.
Vogel,U. and Jensen,K.F. (1995) Effects of the antiterminator BoxA on tran-
scription elongation kinetics and ppGpp inhibition of transcription elong-
ation in Escherichia coli. J. Biol. Chem., 270, 18335–18340.
Wang,T. and Yang,J. (2009) Using the nonlinear dimensionality reduction
method for the prediction of subcellular localization of Gram-negative bac-
terial proteins. Mol. Divers., 13, 475–481.
Yada,T. et al. (1999) Modeling and predicting transcriptional units of Escherichia
coli genes using hidden Markov models. Bioinformatics, 15, 987–993.
Yang,H. et al. (2018a) iRNA-2OM: a sequence-based predictor for identifying
2’-O-methylation sites in Homo sapiens. J. Comput. Biol., doi:
10.1089/cmb.2018.0004.
Yang,H. et al. (2018b) iRSpot-Pse6NC: identifying recombination spots in
Saccharomyces cerevisiae by incorporating hexamer composition into gen-
eral PseKNC. Int. J. Biol. Sci., 14, 883–891.
Yang,H. et al. (2016) Identification of Secretory Proteins in Mycobacterium
tuberculosis Using Pseudo Amino Acid Composition. Biomed. Res. Int.,
2016, 5413903.
Yuan,L.Z. et al. (2017) Using Quadratic Discriminant Analysis to Predict
Protein Secondary Structure Based on Chemical Shifts. Curr. Bioinform.,
12, 52–56.
Zalatan,F. et al. (1993) Deletion analysis of the Escherichia coli
rho-dependent transcription terminator trp t’. J. Biol. Chem., 268,
17051–17056.
Zhang,C.J. et al. (2016) iOri-Human: identify human origin of replication by
incorporating dinucleotide physicochemical properties into pseudo nucleo-
tide composition. Oncotarget, 7, 69783–69793.
Zhu,A.Q. and von Hippel,P.H. (1998) Rho-dependent termination within the
trp t’ terminator. I. Effects of rho loading and template sequence.
Biochemistry, 37, 11202–11214.
Zou,Q. et al. (2016a) Pretata: predicting TATA binding proteins with novel
features and dimensionality reduction strategy. BMC Syst. Biol., 10, 114.
Zou,Q. et al. (2016b) A novel features ranking metric with application to scal-
able visual and bioinformatics data classification. Neurocomputing, 173,
346–354.
Predicting bacterial transcriptional terminators 1477
Dow
nloaded from https://academ
ic.oup.com/bioinform
atics/article-abstract/35/9/1469/5104941 by University of Electronic Science and Technology of C
hina user on 04 May 2019