iTerm-PseKNC: a sequence-based tool for predicting bacterial...

Sequence analysis

iTerm-PseKNC: a sequence-based tool for

predicting bacterial transcriptional terminators

Chao-Qin Feng1, Zhao-Yue Zhang1, Xiao-Juan Zhu1, Yan Lin2,

Wei Chen1,3,*, Hua Tang4,* and Hao Lin1,*

1Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for

Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China, 2Key

Laboratory for Animal Disease Resistance Nutrition of the Ministry of Education, Animal Nutrition Institute,

Sichuan Agricultural University, Chengdu 610054, China, 3Center for Genomics and Computational Biology, School

of Life Sciences, North China University of Science and Technology, Tangshan 063000, China and 4Department of

Pathophysiology, Southwest Medical University, Luzhou 646000, China

*To whom correspondence should be addressed.

Associate Editor: John Hancock

Received on August 12, 2018; revised on September 13, 2018; editorial decision on September 19, 2018; accepted on September 20, 2018

Abstract

Motivation: Transcription termination is an important regulatory step of gene expression. If there

is no terminator in gene, transcription could not stop, which will result in abnormal gene expres-

sion. Detecting such terminators can determine the operon structure in bacterial organisms and

improve genome annotation. Thus, accurate identification of transcriptional terminators is essen-

tial and extremely important in the research of transcription regulations.

Results: In this study, we developed a new predictor called ‘iTerm-PseKNC’ based on support vec-

tor machine to identify transcription terminators. The binomial distribution approach was used to

pick out the optimal feature subset derived from pseudo k-tuple nucleotide composition (PseKNC).

The 5-fold cross-validation test results showed that our proposed method achieved an accuracy of

95%. To further evaluate the generalization ability of ‘iTerm-PseKNC’, the model was examined on

independent datasets which are experimentally confirmed Rho-independent terminators in

Escherichia coli and Bacillus subtilis genomes. As a result, all the terminators in E. coli and 87.5%

of the terminators in B. subtilis were correctly identified, suggesting that the proposed model could

become a powerful tool for bacterial terminator recognition.

Availability and implementation: For the convenience of most of wet-experimental researchers,

the web-server for ‘iTerm-PseKNC’ was established at http://lin-group.cn/server/iTerm-PseKNC/, by

which users can easily obtain their desired result without the need to go through the detailed math-

ematical equations involved.

Contact: [email protected] or [email protected] or [email protected]

1 Introduction

In bacterial, gene expression is a complex and coordinate process

involving gene transcription start and termination that were regu-

lated by promoter and terminator, respectively. When the RNA

polymerase binds to a promoter, the transcription from DNA to

RNA starts and continues until the RNA polymerase reaches a

transcription terminator. All the RNA polymerases must terminate,

dissociate and release the product RNA at the terminator. Thus,

terminator is an extremely important DNA regulatory element.

Generally, terminator locates at the 3’ end of genes or within cistrons.

There are two categories of terminators: Rho-dependent terminator

(also called factors-dependent terminator) and Rho-independent

VC The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected] 1469

Bioinformatics, 35(9), 2019, 1469–1477

doi: 10.1093/bioinformatics/bty827

Advance Access Publication Date: 21 October 2018

Original Paper

Dow

nloaded from https://academ

ic.oup.com/bioinform

atics/article-abstract/35/9/1469/5104941 by University of Electronic Science and Technology of C

hina user on 04 May 2019

http://lin-group.cn/server/iTerm-PseKNC/

https://academic.oup.com/

terminator (also called factor-independent or intrinsic terminator) as

shown in Figure 1. The former needs additional factors to present its

function. The latter displays its function only based on the DNA tem-

plate and the RNA polymerase.

Detecting transcription terminators can improve genome annota-

tion and shed light on the organization of bacterial genomes.

Although the biochemical experimental approaches can identify two

types of terminators clearly and accurately, the wet-experimental

techniques are really time-consuming and expensive. Thus, compu-

tational methods provide researchers with a great choice to accur-

ately recognize terminators.

Early attempt using computational methods for terminators pre-

diction included dinucleotide distribution matrices (Brendel et al.,

1986; Brendel and Trifonov, 1984) and artificial neural networks

(Nair et al., 1994). They only used the nucleotide composition infor-

mation of the Rho-independent terminator sequence to describe ter-

minator samples. Subsequently, some prediction methods using

hairpin structure and the downstream T-rich region as features were

proposed. Daubenton-Carafa et al. (1990) built a score system-

based model by combining free energy of RNA hairpin formation,

the number and position information of thymidine residues in the

T-rich region with binary linear prediction. de Hoon et al. (2005)

also developed a score system based on the information of T-rich re-

gion. Ermolaeva et al. (2000) proposed a method called TransTerm

which used stepwise threshold filtering instead of binary linear pre-

diction. Yada et al. (1999) converted these two types of features into

seven features for multiple linear prediction. But the above methods

did not consider the stability of T-rich region. In contrast, Lesnik

et al. (2001) and Unniraman et al. (2002) predicted Rho-

independent terminators according to the hairpin structure and the

free energy of the T-rich region.

Although above-mentioned methods have achieved some prom-

ising results for terminator prediction, some aspects in this area

should be improved due to the following reasons. Firstly, the bench-

mark datasets constructed by these methods were too small to reflect

the statistical profile of terminators. Secondly, the predictive per-

formance of existing methods is still far from satisfaction. Thirdly,

these published computational works mainly focused on the predic-

tion of Rho-independent terminators. Although Rho-dependent ter-

minators also have been intensively studied in the binding between

Rho-binding site and Rho protein (Guerin et al., 1998; Zhu and von

Hippel, 1998) and the observation of consensus sequence in Rho ter-

minations (Carlomagno and Nappo, 2001; Chen and Richardson,

1987; Ciampi et al., 1989; Das, 1992; Friedman and Olson, 1983;

Gong and Yanofsky, 2002; Graham and Richardson, 1998;

Heinrich et al., 1995; Vogel and Jensen, 1995; Zalatan et al., 1993),

computational method for identifying Rho-dependent terminators is

less well-studied. Fourthly, no web-server was provided for these

methods, and hence their usage is quite limited, particularly for the

wet-experimental scholars.

This study was devoted to enhancing the prediction power and

quality in identifying the bacterial terminators from the aforemen-

tioned four aspects. We firstly constructed an objective benchmark

dataset including 280 experimentally confirmed terminator sequen-

ces and 560 non-terminator sequences in Escherichia coli.

Subsequently, these samples were formulated by integrating the

oligonucleotide composition information and their physicochemi-

cal correlation into pseudo k-tuple nucleotide composition

(PseKNC). Moreover, the support vector machine (SVM) was uti-

lized to discriminate terminators from non-terminators. According

to the proposed method, we established a predictor named ‘iTerm-

PseKNC’.

2 Materials and methods

2.1 DataConstructing an objective benchmark dataset could provide us a

chance not only to obtain the intrinsic and essential properties of ter-

minator, but also to correctly evaluate the performance of proposed

method.

We extracted the E. coli terminator sequences from the

RegulonDB (Gama-Castro et al., 2016). After deleting overlapping

and repeat terminator sequences, we finally obtained 18 rho-

dependent and 262 rho-independent terminator sequences which

were regarded as positive dataset.

Previous studies (Lesnik et al., 2001; Zhu and von Hippel,

1998) showed that the lengths of rho-independent and rho-

dependent terminators are usually within 100 bp and 70–80 bp, re-

spectively. Our statistical results showed that the average lengths of

terminators are �50 bp. Thus, to include sequence as more as pos-

sible, the length of negative samples was set to 80 bp, which were

extracted from �100 to �20 bp upstream and 20–100 bp down-

stream of 280 terminators in E. coli K-12 genome, respectively.

Accordingly, 280 terminator and 560 non-terminator sequences

were obtained. We further used CD-HIT (Fu et al., 2012) to inves-

tigate the sequence identity of all sequences and found that 836

sequences have sequence identity of <75%, suggesting that using

such data will not cause model bias.

2.2 Feature descriptionThe second key step for a powerful predictor is to develop a valid

feature set to represent samples. The following will describe the fea-

tures used in this study.

Suppose a DNA sequence D consists of L nucleic acid

residues; i.e.

Fig. 1. A schematic of bacterial transcriptional termination. (A) Rho-depend-

ent transcription termination; (B) Rho-independent transcription termination

1470 C.-Q.Feng et al.

Dow





D ¼ R1R2 . . . Ri . . . RL (1)

where Ri represents the i-th nucleic acid residue at position i in the

DNA sequence.

Some sequence alignment-based methods such as BLAST and

FASTA could directly perform sequence similarity search by using

sequences. However, these methods can do nothing when facing

low-similar sequences. Fortunately, machine learning methods

could provide help to solve this problem (Cao et al., 2017). Thus,

transferring a DNA sequence to a vector without losing sample in-

formation has become one of the most challenging problems in

bioinformatics.

Due to the fact that k-tuple (also called k-mer) nucleotide com-

position could reflect the oligonucleotide composition and has

been widely used in DNA element identification (Lin and Li, 2011),

we utilized this feature to formulate each sample with 4k compo-

nents; i.e.

D ¼ ½f k�tuple1 f k�tuple

2 . . . f k�tuplei . . . f k�tuple

4k � (2)

where fi is the normalized occurrence frequency of the i-th k-tuple

nucleotide in the DNA sequence. As we can see from Equation (2),

only short-range correlation information was considered in the for-

mulation. The long-range interaction between oligonucleotides was

fully ignored.

Encouraged by the successes of pseudo amino acid composition

method in computational proteomics, PseKNC (Chen et al., 2014,

2015) was proposed to convert a DNA/RNA sequence to a same di-

mension vector which could not only reflect the short-range se-

quence-order information by using k-tuple nucleotide composition,

but also consider long-range sequence-order information by calcu-

lating the correlation of physicochemical properties of oligonucleoti-

des in DNA/RNA sequences. Thus, PseKNC has been successfully

applied in the identification of various of DNA/RNA regulatory ele-

ments, such as DNA recombination spots (Chen et al., 2013; Yang

et al., 2018b) nucleosome positioning sequence (Chen et al., 2016;

Guo et al., 2014), promoter (Lin et al., 2014, 2017), translation ini-

tiation site (Chen et al., 2014), origin of replication (Li et al., 2015a,

b; Zhang et al., 2016), RNA modification sites (Chen et al., 2018;

Feng et al., 2018; Yang et al., 2018a). Although these works have

demonstrated that PseKNC is a good feature to formulate DNA/

RNA samples, all them used Type-I PseKNC. The different physico-

chemical properties were mixed together in Type I PseKNC. Because

different properties maybe play different roles, the Type-II PseKNC

(also called series correlation PseKNC) could consider these differ-

ences and improve the description on samples.

Thus, in this study, according to the thought of series correlation

PseKNC (Type-II PseKNC) (Chen et al., 2014), we made a minor

modification on the Type-II PseKNC and then used a vector contain-

ing ð4k þ k � KÞ components to describe a terminator/non-terminator

sequence as below:

DIIPseKNC ¼ ½d1 . . . d4k d4kþ1 . . . d4kþk d4kþkþ1 . . . d4kþkK�

T

(3)

where

du ¼f k�tupleu ;1 � u � 4k

xsu�4k ;4k þ 1 � u � 4k þ kK

((4)

where f k�tupleu has the same meaning as defined in Equation (2). The

su�4k reflects the long-range sequence-order effect of a DNA se-

quence with L nucleotides and can be calculated by:

s1 ¼1

L� k� 1

XL�k�1

i¼1

J1i;iþ1

s2 ¼1

L� k� 1

XL�k�1

i¼1

J2i;iþ1

. . . . . .

sK ¼1

L� k� 1

XL�k�1

i¼1

JKi;iþ1

. . . . . .

skK�1 ¼1

L� k� k

XL�k�k

i¼1

JK�1i;iþk

skK ¼1

L� k� k

XL�k�k

i¼1

JKi; iþk

:

8>>>>>>>>>>>>>>>>>>>>>>>>>>>><>>>>>>>>>>>>>>>>>>>>>>>>>>>>:

(5)

In Equations (3)–(5), k reflects the tiers or correlation rank along

a DNA sequence and should be set to an integer number less than

L�k. x is a weight factor. K is the number of physicochemical prop-

erties. The Jni;iþm describes the correlation of the n�th physicochemi-

cal property between the i-th oligonucleotide (RiRiþ1 . . . Riþn�1) and

(iþm)-th oligonucleotide (RiþmRiþmþ1 . . . Riþmþn�1) and can be cal-

culated by

Jni;iþm ¼ Hn RiRiþ1 . . . Riþn�1ð Þ �Hn RiþmRiþmþ1 . . . Riþmþn�1ð Þ

n ¼ 1; 2; . . . K; m ¼ 1; 2; . . . ; k; i ¼ 1; 2; . . . ; L� k� k

8><>:

(6)

where n is the length of oligonucleotides. HnðRiRiþ1 . . . Riþn�1Þ and

HnðRiþmRiþmþ1 . . . Riþmþn�1Þ are the values of the n� th physico-

chemical property for oligonucleotide RiRiþ1 . . . Riþn�1 and

RiþmRiþmþ1 . . . Riþmþn�1 in a DNA sequence. In this study, six phys-

icochemical properties including slide, shift, rise, roll, tilt and twist

for dinucleotides (namely n ¼ 2) were used, and the original physi-

cochemical property values can be obtained from Brukner et al.

(1995). Note that before substituting these values into Equation (6),

they should be standardized by the following equation:

Hn RiRiþ1ð Þ ¼H0

n RiRiþ1ð Þ � H0nðRiRiþ1Þ

D ESD H0

nðRiRiþ1ÞD E (7)

where the symbol �h i means taking the average of quantity therein

over the 16 different combinations of A, T, C, G for RiRiþ1, and SD

means the corresponding standard deviation.

2.3 Feature selectionAs we can see from the Equation (3), with the increase of k, we can

get more information about sequence order, but the dimension of

the vector DIIPseKNC will also rapidly increase and lead to the so-

called ‘high-dimension disaster’. Moreover, a model with excessive

features increases the risk of overfitting which will reduce the

cluster-tolerant or deviation tolerance (Chou, 1999) and may result

in a poor performance of model.

To save computational resources, reduce run-time and improve

predictive accuracy, it is necessary to optimize feature set by picking

out relevant features that contribute greatly for the prediction. At

present, some effective feature selection techniques which can reduce

irrelevant features and noisy information have been proposed, such

as minimum redundancy maximum relevance (Peng et al., 2005),

analysis of variance (Chen et al., 2016; Tang et al., 2016, 2018;

Yang et al., 2016), principal component analysis (Wang and Yang,

Predicting bacterial transcriptional terminators 1471

Dow





2009), max-relevance-max-distance (Zou et al., 2016a, b) and bino-

mial distribution (BD) (Lai et al., 2017; Su et al., 2018).

Generally, oligonucleotide frequencies do not obey the Gaussian

distribution for a large k with relatively small sample data. Thus, we

proposed using BD-based technique to rank the important of short-

range sequence-order features. According to the theory of BD-based

technique, we can judge whether a certain k-tuple nucleotide occurs

in terminators or non-terminators is an essentially random event or

not.

First, we should calculate the prior probability qi by Equation (8)

qi ¼ mi=M (8)

where mi is the total number of k-tuple nucleotide in the i-th cat-

egory of samples (i ¼ 1, 2 corresponds to terminator samples and

non-terminator samples). M denotes the total number of all k-tuple

nucleotides in both positive and negative datasets.

Second, the probability of the j-th k-tuple nucleotide in the i-th

category of datasets can be obtained by the following equation:

P nijð Þ ¼XNj

m¼nij

Nj!

m!ðNj �mÞ qmi ð1� qiÞNj�m (9)

where Nj and nij denote the total numbers of the j-th k-tuple nucleo-

tides, respectively, occurring in the benchmark dataset and in i-th

category dataset.

If the j-th k-tuple nucleotide occurring in the i-th category data-

set has biological significance, namely its appearance is not a ran-

dom event, the value of PðnijÞ will be very small. In the current

study, there are two sample datasets, namely positive sample and

negative sample, meaning that we should calculate two probabilities

Pðn1jÞ and Pðn2jÞ for the two datasets. Then, we may define the oc-

currence probability of the j-th k-tuple nucleotide in benchmark

dataset as follows

Pj ¼ minðPðn1jÞ; Pðn2jÞÞ (10)

where Pðn1jÞ and Pðn2jÞ are the probabilities of j-th k-tuple nucleo-

tide in terminator samples and non-terminator samples, respectively.

All the k-tuple nucleotides were ranked by their probability values,

that is to say, if the value of Pj is smaller, this k-tuple nucleotide is

more effective for classification.

2.4 Support vector machine (SVM)SVM is a supervised machine learning algorithm based on structural

risk minimization (Burges, 1998) framework. It has been widely

used in many bioinformatics fields (Cao et al., 2014; Chen et al.,

2017; Feng et al., 2013a, b, c; Li et al., 2016; Song et al., 2006,

2010, 2012, 2018a, b; Tang et al., 2017). The basic principle of

SVM is to map the original feature vectors to a higher Hilbert space

by using various kernel functions and then searches a separating

hyperplane in Hilbert space. A soft package LIBSVM used to

implement SVM has been developed by Lin’s lab, which can be

freely downloaded from http://www.csie.ntu.edu.tw/�cjlin/libsvm/

(Chang and Lin, 2011).

In our study, the input of SVM is the vectors obtained from

terminator and non-terminator sequences according to Equation

(3). The default radial basis function was utilized as the kernel

function to construct SVM model. The optimization of the regu-

larization parameter C and the kernel width parameter c were

performed by using Grid search with the following searching

space:

2�5 � C � 215; with step D ¼ 2

2�15 � c � 2�5; with step D ¼ 2�1

8<: (11)

2.5 Performance evaluation2.5.1 N-fold cross-validation

Cross-validation is a model validation technique for assessing the

performance of a statistical analysis that will generalize to an inde-

pendent dataset. The goal of cross-validation is to define a dataset to

‘test’ the model in the training phase. As summarized in a review

(Chou and Zhang, 1995), n-fold cross-validation, jackknife test and

independent dataset test are often used.

In our study, we used the 5-fold cross-validation (n ¼ 5) to esti-

mate the generalization of model. In 5-fold cross-validation, the

benchmark dataset is equally divided into five subsets, and one of

the five subsets is used as the test set or validation set and the other

four subsets are put together to form a training set. Then, this pro-

cession repeats five times.

2.5.2 Performance evaluation parameters

We used four metrics which are specificity (Sp), sensitivity (Sn), over-

all accuracy (Acc) and Matthew’s correlation coefficient (MCC)

(Feng et al., 2013a, b, c; Li et al., 2018a, b; Song et al., 2018a, b) to

evaluate the predictive performance of our model.

Sn ¼ 1�Nþ�Nþ

; 0 � Sn � 1

Sp ¼ 1�N�þN�

; 0 � Sp � 1

Acc ¼ 1�Nþ� þN�þNþ þN�

; 0 � Acc � 1

MCC ¼1� Nþ�

NþþN�þ

N�

� �ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1þN�þ �Nþ�Nþ

� �1þNþ� �N�þ

N�

� �s ; 0 � MCC � 1

8>>>>>>>>>>>>>>>>>>>><>>>>>>>>>>>>>>>>>>>>:

(12)

where Nþ and N� are the total numbers of terminators and non-

terminators in benchmark dataset, respectively, where Nþ� denotes

the number of the terminators recognized as non-terminators, N�þdenotes the number of the incorrectly predicted non-terminators. Sn

and Sp, respectively, represent the ability of model to correctly

identify terminators and non-terminators. Acc measures the overall

accuracy of our model for distinguishing terminators from non-

terminators. MCC reflects the performance of our model on imbal-

ance benchmark dataset, here the number of positive:negative is 1:2.

We also drew the receiver operating characteristic (ROC) curves

and calculated the area under the receiver operating characteristic

curve (AUC) to measure the performance of models across the entire

range of SVM decision values (Yang et al., 2018b).

3 Results

3.1 Parameter optimizationFrom Equation (3), the performance of the SVM predictor depends

on three parameters, k, k and x, where k and k reflect the short-

range or local sequence-order effect and the long-range or global se-

quence-order effect, respectively. One may notice that the greater

the k and k are, the more sequence-order information the model


Dow





http://www.csie.ntu.edu.tw/&hx0026;sim;cjlin/libsvm/

http://www.csie.ntu.edu.tw/&hx0026;sim;cjlin/libsvm/

contains. x is used to weight the short-range and long-range, and its

value is usually within the range from 0 to 1. Therefore, we searched

for the optimal values of the three parameters as the following

regions.

2 � k � 6; with step D ¼ 11 � k � 20; with step D ¼ 10:1 � x � 1; with step D ¼ 0:1

8<: (13)

According to Equation (13), we considered 1000 independent

parameter combinations, and performed a 5-fold cross-validated

test for each of these parameter combinations. The accuracies were

showed in Figure 2.

From the Figure 2, we noticed that parameter k plays a key role

in terminator identification. No matter what values of k and x are

used, with the increase of k from 2 to 5, the Acc improves from �85

to �95%. However, when k is set to 6, the prediction performance

decreases to �92%. This observation obeys the rule of short-range

dominance of base correlations (Luo et al., 1998). Once the short-

range correlation parameter k is determined, in most of cases, the

Acc improves with the increase of k. However, when k ¼ 5, the

maximum Acc appears at k ¼ 2; 3; 4; 5. This phenomenon can be

explained by the inhomogeneity of long-range correlation property

(Luo et al., 1998). To include long-range correlation information as

more as possible in prediction model, the parameter k was set to 5.

In all 1000 models, the weight parameter x plays a very slight role

in balancing the information between short-range and long-range

correlation because the SVM is a powerful algorithm which can deal

with feature dimension issue. Thus, we used x ¼ 1 for the conveni-

ence. And then, the final model with the Acc of 95.37% in this sec-

tion was established when k ¼ 5, k ¼ 5 and x ¼ 1. It means that the

combination of the short-range information, namely 5-tuple nucleo-

tide (or called pentamer) composition, with the long-range se-

quence-order information, namely 30 (5 � 6) physicochemical

correlation information, could produce the best performance (Acc

of 95.37%) for discriminating terminators from non-terminators.

3.2 Feature optimizationAccording to Equation (3), the dimension for the PseKNC vector

with k ¼ 5 and k ¼5 is 45 þ 5� 6 ¼ 1054, which may lead to fea-

ture redundancy. To find out the optimal pentamer, we used the bi-

nomial distribution technique to calculate the probabilities of the

1024 short-range sequence-order features, and then the increment

feature selection (IFS) (Li et al., 2016, 2015a, b) was used to pick

out the best feature subset which could produce the accuracy.

Firstly, we ranked the 1024 pentamers in descending order accord-

ing to their probabilities [Equation (10)]. Secondly, we selected the

first pentamer with the smallest probability in the ranked feature set

as the first feature subset. Subsequently, the second feature subset

was constructed by adding the pentamer with the second smallest

probability into the first feature subset, and the third feature subset

was formed by adding the pentamer with the third smallest prob-

ability into the second feature subset and so on. Finally, 1024 fea-

ture subsets were constructed. To find out the best feature subset

that can produce the maximum accuracy, we used SVM with 5-fold

cross-validation to evaluate the performances of the 1024 feature

subsets. The IFS curve was drawn in Figure 3. It is clear that the

maximum overall accuracy of 95.48% was achieved when 808 opti-

mal pentamers were used.

Finally, we added the 30 long-range sequence-order features to

the 808 optimal pentamers and achieved a much higher Acc of

95.71% with the Sn, Sp and MCC of 86.07%, 99.46% and 0.8881,

respectively. We also drew the correlated ROC curve in Figure 4. It

shows that the AUC reaches the value of 0.9520, suggesting that the

Fig. 2. The accuracies of 1000 parameter combinations by using 5-fold cross-

validation on benchmark dataset

Fig. 3. A plot to show the IFS curve. The IFS curve reaches to peak of 95.48%

when abscissa is 808

Fig. 4. The ROC curves for terminator predictors with 838 features


Dow





proposed model has an excellent prediction capability on terminator

recognition.

To show the prediction capability of final model under the con-

ditions of Sp ¼ 99.90, 99.00, 95.00%, we recorded other evaluation

indexes in Table 1. From the table, we noticed that the final model

could produce high sensitivities for the three thresholds, suggesting

that the model is accurate and robust.

3.3 Feature analysisAccording to statistical theory for small probability event, the occur-

rence of a feature with smaller probability (or high confidence level)

is not a stochastic event. Thus, the feature is important for the con-

struction of prediction model. To intuitively show the contribution

of each pentamer, a heat map was plotted in Figure 5 which is a

graphical representation of a matrix, where the elements represent

the pentamers and are encoded by using different colors according

to their confidence levels. The column and the row of the heat map

represent the first dimers and the last trimers of pentamers,

respectively.

In feature selection process, 808 optimal pentamers were selected

to produce the best result. Only 216 redundant information were

excluded. From Figure 5, one may also notice that many pentamers

have confidence level from yellow to red elements, demonstrating

that the feature selection is correct. However, Figure 5 cannot reflect

the preference of optimal pentamers in terminators or non-

terminators. Thus, we did a further analysis on 1024 pentamers, and

selected 40 pentamers with highest confidence level (>98.20%) as

the most important features for classification. Of the 40 pentamers,

20 often appear in terminators (Fig. 6A), and the remained 20 prefer

to non-terminators (Fig. 6B). We noticed that the 20 pentamers

occurring in terminators always contain short repeated trimers, such

as AAA, CCC and TTT. However, the other 20 pentamers does not

display similar mode. To further investigate the importance of the

40 pentamers, we retrained the proposed model using these pentam-

ers, and achieved the Acc of 87.50% in 5-fold cross-validation. Such

high accuracy demonstrates the preference of pentamers in

terminators.

To further investigate the sequence mode in terminators,

we used Multiple Em for Motif Elicitation (MEME) (Bailey and

Elkan, 1994) to discover the consensus motifs in terminators.

Results showed that the consensus sequence is A[AT]AAAAAA[AG]

[CG]C[CG][GC][CG][CT-A][CGAT][AT][ACTG][GTA][GCA][GC].

Excitingly, the element obtained by MEME is identical to the

optimized pentamer from feature selection technique.

3.4 Web-server introductionBased on our proposed model, a user-friendly and publicly access-

ible web-server for ‘iTerm-PseKNC’ was established available at

http://lin-group.cn/server/iTerm-PseKNC/, by which users can up-

load DNA sequences to search where the terminator is. The user

guide is introduced as follow:

Step 1. Click on the web address http://lin-group.cn/server/

iTerm-PseKNC/ and one may see the homepage of ‘iTerm-PseKNC’

on his computer screen as shown in Figure 7. Users may read the

brief introduction about the predictor.

Step2. Click on the WEBSERVER button to see the prediction

interface. Either type or copy/paste the query DNA sequences with

length of >80 bp into the input box at the center of the page. User

could also upload a DNA sequence file below the input box. The in-

put sequence must be in the FASTA format. Users could check the

example sequences with FASTA format by clicking on the Example

button on the right of the input box.

Step 3. Click on the Submit button to see the predicted result. If

you use the four sequence samples in the Example window as an in-

put, after clicking the Submit button, you will see the following out-

comes shown on the screen of your computer. (i) Two samples were

Table 1. The performance of the model for different thresholds

Threshold Sp (%) Sn (%) Acc (%) MCC

High 99.90 87.10 95.60 0.9019

Medium 99.00 90.00 95.95 0.9087

Low 95.00 93.20 94.40 0.8764

Fig. 5. A heat map to show the confidence level of the 1024 pentamers. The

color scale is ranged from blue (represents low confidence level or large

probability) through green and yellow to red (represents high confidence

level or small probability)

A

B

Fig. 6. A histogram to show the different heptamers between the terminators

and non-terminators. (A) 20 pentamers often appear in terminator sequences.

(B) 20 pentamers often appear in non-terminator sequences


Dow








predicted as terminator with the probabilities of containing a ter-

minator are 99.45% and 99.18%. (ii) Two samples were predicted

as non-terminator with the probabilities of containing a terminator

are 3.15% and 0.63%. All these results are fully consistent with the

experimental observations. It only takes about few seconds for the

above computation before the predicted results appear on your com-

puter screen.

Step 4. Click on the DOWNLOAD button to download the

benchmark datasets used to train and test the ‘iTerm-PseKNC’

predictor.

Step 5. Click on the CITATION button to find the relevant

papers that document the detailed development and algorithm of

‘iTerm-PseKNC’.

Step 6. Click on the ABOUT US button to learn about us. You

may also contact us using Contact button.

3.5 Evaluating on other datasetsTo further evaluate the generalization ability of ‘iTerm-PseKNC’, it

is necessary to examine the predictor on independent data. Thus, we

built two independent sets. One contains 147 experimentally con-

firmed rho-independent terminators in E. coli (Lesnik et al., 2001).

The other includes 425 experimentally identified terminators in

Bacillus subtilis (de Hoon et al., 2005). We executed the predictor

‘iTerm-PseKNC’ on the two independent datasets. Results show

that all terminators in E. coli and 87.5% terminators in B. subtilis

were correctly identified, suggesting that our proposed method is

powerful and can be a useful tool for bacterial terminator

annotation.

Based on the 147 E. coli terminators, Lesnik et al. (2001) devel-

oped an RNAMotif algorithm and achieved the accuracy of 100%

on this data. However, the benchmark dataset used to train ‘iTerm-

PseKNC’ model does not contain Lesnik’s data, meaning our

obtained results are stricter and more objective. de Hoon et al.

(2005) analyzed the 425 B. subtilis terminator sequences and cor-

rectly identified 94% of B. subtilis terminator by using a decision

rule. One may notice that the accuracy of our model on B. subtilis

terminators is not higher than that of de Hoon’s model. The reason

is that de Hoon’s model was trained on B. subtilis terminators, but

our model on E. coli terminators. Thus, our predictor could provide

more reliable and flexible cross-species prediction. What is more,

none of these works provides web-server, which may prevent wet-

experimental scholars from using these published methods.

3.6 Comparison with other traditional machine learning

methodsVarious of machine learning methods have been widely applied in

bioinformatics fields (Cao and Cheng, 2016; Cao et al., 2016; Li

et al., 2018a, b; Loh et al., 2018; Naseem et al., 2017; Patel et al.,

2017; Peng et al., 2018; Stephenson et al., 2018; Yuan et al., 2017).

To compare the prediction performance of different methods on ter-

minator identification, we repeated above procession using four

kinds of methods: decision tree, multi-layer perceptron, logistic re-

gression and Naıve Bayes. The scikit-learn package in python was

used to implement the four algorithms. All results were listed in

Table 2. As shown in Table 1, it is clear that the four metrics

[defined in Equation (12)] achieved by SVM are significantly higher

than those of the other four methods, indicating that it is correct

and suitable to use SVM to build computational method for termin-

ator identification.

4 Conclusion

In this study, we developed a predictor called ‘iTerm-PseKNC’ to

identify terminators. The Type-II PseKNC was first applied in bio-

informatics field, which consider not only the short-range oligo-

nucleotide composition, but also the long-range correlation of

physicochemical properties of oligonucleotide. Different from trad-

itional Type-I PseKNC, the Type-II PseKNC could separate different

long-range correlation for different physicochemical properties,

which could provide more objective description on samples. The

rates achieved by the predictor with 808 pentamer features and 30

long-range sequence-order features were 86.07%, 99.46%, 95.71%

and 0.88% in Sn, Sp, Acc and MCC, respectively, indicating that

DNA sequence contains important information for terminator identi-

fication. Many kinds of motifs were found by using feature analysis

technique. The examinations on independent datasets demonstrated

that ‘iTerm-PseKNC’ is a reliable tool to identify terminators in bac-

terial genomes. In the future, we will collect more terminators and

separate rho-independent from rho-dependent terminators to retrain

the model for further improving the prediction performance.

Funding

This work was supported by the National Nature Scientific Foundation

of China (61772119, 31771471, 61702430); the Fundamental Research

Funds for the Central Universities of China (Nos. ZYGX2015Z006,

ZYGX2016J125, ZYGX2016J118); Natural Science Foundation for

Distinguished Young Scholar of Hebei Province (No. C2017209244);

the Program for the Top Young Innovative Talents of Higher Learning

Institutions of Hebei Province (No. BJ2014028) and the Science Strength

Promotion Programme of UESTC.

Conflict of Interest: none declared.

References

Bailey,T.L. and Elkan,C. (1994) Fitting a mixture model by expectation maxi-

mization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst.

Mol. Biol., 2, 28–36.

Fig. 7. The top page of the ‘iTerm-PseKNC’ web-server at http://lin-group.cn/

server/iTerm-PseKNC/

Table 2. A comparison of different methods for identifying

terminator

Method Sn (%) Sp (%) Acc (%) MCC

Decision tree 69.64 90.00 83.21 0.7562

Multi-layer perceptron 63.57 94.46 84.16 0.7078

Logistic regression 46.07 93.92 77.97 0.3839

Naıve Bayes 70.71 86.25 81.07 0.7521

Our method 86.07 99.46 95.71 0.8881


Dow







Brendel,V. et al. (1986) Terminators of transcription with RNA polymerase

from Escherichia coli: what they look like and how to find them. J. Biomol.

Struct. Dyn., 3, 705–723.

Brendel,V. and Trifonov,E.N. (1984) A computer algorithm for testing poten-

tial prokaryotic terminators. Nucleic Acids Res., 12, 4411–4427.

Brukner,I. et al. (1995) Sequence-dependent bending propensity of DNA as

revealed by DNase I: parameters for trinucleotides. EMBO J., 14,

1812–1818.

Burges,C.J.C. (1998) A tutorial on support vector machines for pattern recog-

nition. Data Min. Knowl. Discov., 2, 121–167.

Cao,R. and Cheng,J. (2016) Integrated protein function prediction by mining

function associations, sequences, and protein-protein and gene-gene inter-

action networks. Methods, 93, 84–91.

Cao,R. et al. (2017) ProLanGO: protein Function Prediction Using Neural

Machine Translation Based on a Recurrent Neural Network. Molecules, 22,

E1732.

Cao,R. et al. (2014) SMOQ: a tool for predicting the absolute residue-specific

quality of a single protein model with support vector machines. BMC

Bioinformatics, 15, 120.

Cao,R.Z. et al. (2016) DeepQA: improving the estimation of single protein

model quality with deep belief networks. BMC Bioinformatics, 17, 495.

Carlomagno,M.S. and Nappo,A. (2001) The antiterminator NusB enhances

termination at a sub-optimal Rho site. J. Mol. Biol., 309, 19–28.

Chang,C.-C. and Lin,C.-J. (2011) LIBSVM: a library for support vector

machines. ACM Trans. Intell. Syst. Technol., 2, 1–27.

Chen,C.Y. and Richardson,J.P. (1987) Sequence elements essential for

rho-dependent transcription termination at lambda tR1. J. Biol. Chem.,

262, 11292–11299.

Chen,W. et al. (2016) Using deformation energy to analyze nucleosome posi-

tioning in genomes. Genomics, 107, 69–75.

Chen,W. et al. (2018) iRNA-3typeA: identifying Three Types of Modification

at RNA’s Adenosine Sites. Mol. Ther. Nucleic Acids, 11, 468–474.

Chen,W. et al. (2014) iTIS-PseTNC: a sequence-based predictor for identifying

translation initiation site in human genes using pseudo trinucleotide com-

position. Anal. Biochem., 462, 76–83.

Chen,W. et al. (2013) iRSpot-PseDNC: identify recombination spots with

pseudo dinucleotide composition. Nucleic Acids Res., 41, e68.

Chen,W. et al. (2014) PseKNC: a flexible web server for generating pseudo

K-tuple nucleotide composition. Anal. Biochem., 456, 53–60.

Chen,W. et al. (2017) iDNA4mC: identifying DNA N4-methylcytosine sites

based on nucleotide chemical properties. Bioinformatics, 33, 3518–3523.

Chen,W. et al. (2015) PseKNC-General: a cross-platform package for generat-

ing various modes of pseudo nucleotide compositions. Bioinformatics, 31,

119–120.

Chen,X.X. et al. (2016) Identification of Bacterial Cell Wall Lyases via Pseudo

Amino Acid Composition. BioMed Res. Int., 2016, 1654623.

Chou,K.-C. and Zhang,C.-T. (1995) Prediction of protein structural classes.

Crit. Rev. Biochem. Mol. Biol., 30, 275–349.

Chou,K.C. (1999) A key driving force in determination of protein structural

classes. Biochem. Biophys. Res. Commun., 264, 216–224.

Ciampi,M.S. et al. (1989) Features of the rho-dependent transcription termin-

ation polar element within the hisG cistron of Salmonella typhimurium. J.

Bacteriol., 171, 4472–4478.

Das,A. (1992) How the phage lambda N gene product suppresses transcription

termination: communication of RNA polymerase with regulatory proteins

mediated by signals in nascent RNA. J. Bacteriol., 174, 6711–6716.

Daubenton-Carafa et al. (1990) Prediction of rho-independent Escherichia

coli transcription terminators. J. Mol. Biol., 216, 835–858.

de Hoon,M.J. et al. (2005) Prediction of transcriptional terminators in

Bacillus subtilis and related species. PLoS Comput. Biol., 1, e25.

Ermolaeva,M.D. et al. (2000) Prediction of transcription terminators in bac-

terial genomes. J. Mol. Biol., 301, 27–33.

Feng,P. et al. (2018) iDNA6mA-PseKNC: identifying DNA

N(6)-methyladenosine sites by incorporating nucleotide physicochemical

properties into PseKNC. Genomics, doi: 10.1016/j.ygeno.2018.01.005.

Feng,P.M. et al. (2013a) iHSP-PseRAAAC: identifying the heat shock protein

families using pseudo reduced amino acid alphabet composition. Anal.

Biochem., 442, 118–125.

Feng,P.M. et al. (2013b) Naive Bayes classifier with feature selection to

identify phage virion proteins. Comput. Math. Methods Med., 2013,

530696.

Feng,P.M. et al. (2013c) Identification of antioxidants from sequence

information using naive Bayes. Comput. Math. Methods Med., 2013,

567529.

Friedman,D.I. and Olson,E.R. (1983) Evidence that a nucleotide sequence,

“boxA,” is involved in the action of the NusA protein. Cell, 34, 143–149.

Fu,L. et al. (2012) CD-HIT: accelerated for clustering the next-generation

sequencing data. Bioinformatics, 28, 3150–3152.

Gama-Castro,S. et al. (2016) RegulonDB version 9.0: high-level integration of

gene regulation, coexpression, motif clustering and beyond. Nucleic Acids

Res., 44, D133–D143.

Gong,F. and Yanofsky,C. (2002) Analysis of tryptophanase operon expression

in vitro: accumulation of TnaC-peptidyl-tRNA in a release factor 2-depleted

S-30 extract prevents Rho factor action, simulating induction. J. Biol.

Chem., 277, 17095–17100.

Graham,J.E. and Richardson,J.P. (1998) rut Sites in the nascent transcript me-

diate Rho-dependent transcription termination in vivo. J. Biol. Chem., 273,

20764–20769.

Guerin,M. et al. (1998) A simple polypyrimidine repeat acts as an artificial

Rho-dependent terminator in vivo and in vitro. Nucleic Acids Res., 26,

4895–4900.

Guo,S.H. et al. (2014) iNuc-PseKNC: a sequence-based predictor for predict-

ing nucleosome positioning in genomes with pseudo k-tuple nucleotide com-

position. Bioinformatics, 30, 1522–1529.

Heinrich,T. et al. (1995) Point mutations in the leader boxA of a

plasmid-encoded Escherichia coli rrnB operon cause defective antitermina-

tion in vivo. J. Bacteriol., 177, 3793–3800.

Lai,H.Y. et al. (2017) Sequence-based predictive modeling to identify cancer-

lectins. Oncotarget, 8, 28169–28175.

Lesnik,E.A. et al. (2001) Prediction of rho-independent transcriptional termi-

nators in Escherichia coli. Nucleic Acids Res., 29, 3583–3594.

Li,B.Q. et al. (2018a) Prediction of Protein-Peptide Interactions with a Nearest

Neighbor Algorithm. Curr. Bioinform., 13, 14–24.

Li,D.P. et al. (2016) Protein Folds Prediction with Hierarchical Structured

SVM. Curr. Proteomics, 13, 79–85.

Li,F. et al. (2018b) Quokka: a comprehensive tool for rapid and accurate pre-

diction of kinase family-specific phosphorylation sites in the human prote-

ome. Bioinformatics, 34, 4223–4231.

Li,F. et al. (2016) GlycoMine(struct): a new bioinformatics tool for highly ac-

curate mapping of the human N-linked and O-linked glycoproteomes by

incorporating structural features. Sci. Rep., 6, 34595.

Li,F. et al. (2015a) GlycoMine: a machine learning-based approach for pre-

dicting N-, C- and O-linked glycosylation in the human proteome.

Bioinformatics, 31, 1411–1419.

Li,W.C. et al. (2015b) iORI-PseKNC: a predictor for identifying origin of rep-

lication with pseudo k-tuple nucleotide composition. Chemometr. Intell.

Lab., 141, 100–106.

Lin,H. et al. (2014) iPro54-PseKNC: a sequence-based predictor for identify-

ing sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide com-

position. Nucleic Acids Res., 42, 12961–12972.

Lin,H. and Li,Q.Z. (2011) Eukaryotic and prokaryotic promoter prediction

using hybrid approach. Theory Biosci., 130, 91–100.

Lin,H. et al. (2017) Identifying sigma70 promoters with novel pseudo nucleo-

tide composition. IEEE/ACM Trans. Comput. Biol. Bioinform, doi:

10.1109/TCBB.2017.2666141.

Loh,S.K. et al. (2018) A Review of Computational Approaches to Predict

Gene Functions. Curr. Bioinform., 13, 373–386.

Luo,L.F. et al. (1998) Statistical correlation of nucleotides in a DNA sequence.

Phys. Rev. E., 58, 861–871.

Nair,T.M. et al. (1994) Application of artificial neural networks for prokary-

otic transcription terminator prediction. FEBS Lett., 346, 273–277.

Naseem,I. et al. (2017) ECMSRC: a Sparse Learning Approach for

the Prediction of Extracellular Matrix Proteins. Curr. Bioinform., 12,

361–368.

Patel,S. et al. (2017) DeepInteract: deep Neural Network Based

Protein-Protein Interaction Prediction Tool. Curr. Bioinform., 12, 551–557.


Dow





Peng,H. et al. (2005) Feature selection based on mutual information: criteria

of max-dependency, max-relevance, and min-redundancy. IEEE Trans.

Pattern Anal. Mach. Intell., 27, 1226–1238.

Peng,L. et al. (2018) The Advances and Challenges of Deep Learning

Application in Biological Big Data Processing. Curr. Bioinform., 13,

352–359.

Song,J. et al. (2006) Prediction of cis/trans isomerization in proteins using

PSI-BLAST profiles and secondary structure information. BMC

Bioinformatics, 7, 124.

Song,J. et al. (2018a) PROSPERous: high-throughput prediction of substrate

cleavage sites for 90 proteases with improved accuracy. Bioinformatics, 34,

684–687.

Song,J. et al. (2012) PROSPER: an integrated feature-based tool for predicting

protease substrate cleavage sites. PLoS One, 7, e50300.

Song,J. et al. (2010) Cascleave: towards more accurate prediction of caspase

substrate cleavage sites. Bioinformatics, 26, 752–760.

Song,J. et al. (2018b) iProt-Sub: a comprehensive package for accurately map-

ping and predicting protease-specific substrates and cleavage sites. Brief.

Bioinform., doi: 10.1093/bib/bby028.

Stephenson,N. et al. (2018) Survey of Machine Learning Techniques in Drug

Discovery. Curr. Drug Metabol., doi: 10.2174/1389200219666180

820112457.

Su,Z.D. et al. (2018) iLoc-lncRNA: predict the subcellular location of

lncRNAs by incorporating octamer composition into general PseKNC.

Bioinformatics, 34, 4196–4204.

Tang,H. et al. (2017) A two-step discriminated method to identify thermophil-

ic proteins. Int. J. Biomath., 10, 1750050.

Tang,H. et al. (2016) Identification of immunoglobulins using Chou’s pseudo

amino acid composition with feature selection technique. Mol. Biosyst., 12,

1269–1275.

Tang,H. et al. (2018) HBPred: a tool to identify growth hormone-binding pro-

teins. Int. J. Biol. Sci., 14, 957–964.

Unniraman,S. et al. (2002) Conserved economics of transcription termination

in eubacteria. Nucleic Acids Res., 30, 675–684.

Vogel,U. and Jensen,K.F. (1995) Effects of the antiterminator BoxA on tran-

scription elongation kinetics and ppGpp inhibition of transcription elong-

ation in Escherichia coli. J. Biol. Chem., 270, 18335–18340.

Wang,T. and Yang,J. (2009) Using the nonlinear dimensionality reduction

method for the prediction of subcellular localization of Gram-negative bac-

terial proteins. Mol. Divers., 13, 475–481.

Yada,T. et al. (1999) Modeling and predicting transcriptional units of Escherichia

coli genes using hidden Markov models. Bioinformatics, 15, 987–993.

Yang,H. et al. (2018a) iRNA-2OM: a sequence-based predictor for identifying

2’-O-methylation sites in Homo sapiens. J. Comput. Biol., doi:

10.1089/cmb.2018.0004.

Yang,H. et al. (2018b) iRSpot-Pse6NC: identifying recombination spots in

Saccharomyces cerevisiae by incorporating hexamer composition into gen-

eral PseKNC. Int. J. Biol. Sci., 14, 883–891.

Yang,H. et al. (2016) Identification of Secretory Proteins in Mycobacterium

tuberculosis Using Pseudo Amino Acid Composition. Biomed. Res. Int.,

2016, 5413903.

Yuan,L.Z. et al. (2017) Using Quadratic Discriminant Analysis to Predict

Protein Secondary Structure Based on Chemical Shifts. Curr. Bioinform.,

12, 52–56.

Zalatan,F. et al. (1993) Deletion analysis of the Escherichia coli

rho-dependent transcription terminator trp t’. J. Biol. Chem., 268,

17051–17056.

Zhang,C.J. et al. (2016) iOri-Human: identify human origin of replication by

incorporating dinucleotide physicochemical properties into pseudo nucleo-

tide composition. Oncotarget, 7, 69783–69793.

Zhu,A.Q. and von Hippel,P.H. (1998) Rho-dependent termination within the

trp t’ terminator. I. Effects of rho loading and template sequence.

Biochemistry, 37, 11202–11214.

Zou,Q. et al. (2016a) Pretata: predicting TATA binding proteins with novel

features and dimensionality reduction strategy. BMC Syst. Biol., 10, 114.

Zou,Q. et al. (2016b) A novel features ranking metric with application to scal-

able visual and bioinformatics data classification. Neurocomputing, 173,

346–354.


Dow





iTerm-PseKNC: a sequence-based tool for predicting bacterial...

Documents

Transcript of iTerm-PseKNC: a sequence-based tool for predicting bacterial...