Review Article Long Noncoding RNA Identification ...

15
Review Article Long Noncoding RNA Identification: Comparing Machine Learning Based Tools for Long Noncoding Transcripts Discrimination Siyu Han, 1 Yanchun Liang, 1,2 Ying Li, 1 and Wei Du 1 1 College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China 2 Zhuhai Laboratory of Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Zhuhai College of Jilin University, Zhuhai 519041, China Correspondence should be addressed to Ying Li; [email protected] and Wei Du; [email protected] Received 12 August 2016; Revised 5 October 2016; Accepted 13 October 2016 Academic Editor: Xing Chen Copyright © 2016 Siyu Han et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Long noncoding RNA (lncRNA) is a kind of noncoding RNA with length more than 200 nucleotides, which aroused interest of people in recent years. Lots of studies have confirmed that human genome contains many thousands of lncRNAs which exert great influence over some critical regulators of cellular process. With the advent of high-throughput sequencing technologies, a great quantity of sequences is waiting for exploitation. us, many programs are developed to distinguish differences between coding and long noncoding transcripts. Different programs are generally designed to be utilised under different circumstances and it is sensible and practical to select an appropriate method according to a certain situation. In this review, several popular methods and their advantages, disadvantages, and application scopes are summarised to assist people in employing a suitable method and obtaining a more reliable result. 1. Introduction Long noncoding RNAs (lncRNAs), one of the most poorly understood but also the most common RNA species, are those noncoding transcripts with length more than 200 nucleotides. Initially, people classified noncoding RNA (ncRNA) genes as “junk gene” or transcriptional “noise” [1]. Nonetheless, researchers found that about 70% of the genome is transcribed in various contexts and cell types [2, 3], about 80% of the genome has biochemical functions [4], and many DNAs code for RNAs as the end products instead of proteins [5]. LncRNAs are involved in a wide range of cellular mechanisms such as the regulation of genome activity [6], histone modifications [7, 8], and DNA methylation [9]. In addition, lots of studies have demonstrated that lncRNAs have a significant role in diverse biological processes; thus lncRNAs are especially important to the studies of human biology and diseases [10]. For example, in prostate cancer of human, lncRNA SChLAP1 and chromatin remodelling complex SWI/SNF have opposing roles. SchLAP1 has an interaction with the SNF5 subunit of SWI/SNF and inhibits binding of SWI/SNF to chromatin, which leads to genome- wide derepression of gene activity [11]. Moreover, aber- rant expression of lncRNAs in cancer can be regarded as biomarkers and therapeutic targets because of its extremely specific expression [6]. e LncRNADisease database now integrates more than 1000 lncRNA-disease entries and 475 lncRNA interaction entries, which suggested that lncRNAs are associated with diseases closely [12]. Since lncRNAs so closely interact with diseases, many lncRNA-disease association detection tools are invented. Assuming that lncRNAs with similar functions tend to associate with similar diseases, a semisupervised method, Laplacian Regularized Least Squares for LncRNA-Disease Association (LRLSLDA) [13], was developed; this tool dis- plays a satisfying result and needs no negative samples. Nonetheless, this method is facing the problems of parameter selection and classifier combination. e principal idea of Hindawi Publishing Corporation BioMed Research International Volume 2016, Article ID 8496165, 14 pages http://dx.doi.org/10.1155/2016/8496165

Transcript of Review Article Long Noncoding RNA Identification ...

Page 1: Review Article Long Noncoding RNA Identification ...

Review ArticleLong Noncoding RNA Identification ComparingMachine Learning Based Tools for Long NoncodingTranscripts Discrimination

Siyu Han1 Yanchun Liang12 Ying Li1 and Wei Du1

1College of Computer Science and Technology Key Laboratory of Symbol Computation and Knowledge Engineering ofMinistry of Education Jilin University Changchun 130012 China2Zhuhai Laboratory of Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of EducationZhuhai College of Jilin University Zhuhai 519041 China

Correspondence should be addressed to Ying Li liyingjlueducn and Wei Du weidujlueducn

Received 12 August 2016 Revised 5 October 2016 Accepted 13 October 2016

Academic Editor Xing Chen

Copyright copy 2016 Siyu Han et alThis is an open access article distributed under the Creative CommonsAttribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Long noncoding RNA (lncRNA) is a kind of noncoding RNA with length more than 200 nucleotides which aroused interest ofpeople in recent years Lots of studies have confirmed that human genome contains many thousands of lncRNAs which exert greatinfluence over some critical regulators of cellular process With the advent of high-throughput sequencing technologies a greatquantity of sequences is waiting for exploitation Thus many programs are developed to distinguish differences between codingand long noncoding transcripts Different programs are generally designed to be utilised under different circumstances and it issensible and practical to select an appropriate method according to a certain situation In this review several popular methodsand their advantages disadvantages and application scopes are summarised to assist people in employing a suitable method andobtaining a more reliable result

1 Introduction

Long noncoding RNAs (lncRNAs) one of the most poorlyunderstood but also the most common RNA speciesare those noncoding transcripts with length more than200 nucleotides Initially people classified noncoding RNA(ncRNA) genes as ldquojunk generdquo or transcriptional ldquonoiserdquo[1] Nonetheless researchers found that about 70 of thegenome is transcribed in various contexts and cell types [2 3]about 80 of the genome has biochemical functions [4] andmany DNAs code for RNAs as the end products instead ofproteins [5] LncRNAs are involved in a wide range of cellularmechanisms such as the regulation of genome activity [6]histone modifications [7 8] and DNA methylation [9] Inaddition lots of studies have demonstrated that lncRNAshave a significant role in diverse biological processes thuslncRNAs are especially important to the studies of humanbiology and diseases [10] For example in prostate cancerof human lncRNA SChLAP1 and chromatin remodelling

complex SWISNF have opposing roles SchLAP1 has aninteraction with the SNF5 subunit of SWISNF and inhibitsbinding of SWISNF to chromatin which leads to genome-wide derepression of gene activity [11] Moreover aber-rant expression of lncRNAs in cancer can be regarded asbiomarkers and therapeutic targets because of its extremelyspecific expression [6] The LncRNADisease database nowintegrates more than 1000 lncRNA-disease entries and 475lncRNA interaction entries which suggested that lncRNAsare associated with diseases closely [12]

Since lncRNAs so closely interact with diseases manylncRNA-disease association detection tools are inventedAssuming that lncRNAs with similar functions tend toassociate with similar diseases a semisupervised methodLaplacian Regularized Least Squares for LncRNA-DiseaseAssociation (LRLSLDA) [13] was developed this tool dis-plays a satisfying result and needs no negative samplesNonetheless this method is facing the problems of parameterselection and classifier combination The principal idea of

Hindawi Publishing CorporationBioMed Research InternationalVolume 2016 Article ID 8496165 14 pageshttpdxdoiorg10115520168496165

2 BioMed Research International

LRLSLDA as mentioned above is to measure the functionalsimilarity of lncRNAs which means that the performance ofsimilarity calculation model largely determines the perfor-mance of associationmodelThe similarity calculationmodelof LRLSLDA is LFSCM (LncRNA Functional SimilarityCalculation based on the information of MiRNA) which isbased on lncRNA-miRNA interactions and miRNA-diseaseassociations In 2015 novel lncRNA functional similaritycalculation models (LNCSIM) [14] were provided by Chenet al By integrating LRLSLDA and LNCSIM the perfor-mance was enhanced Recently a new lncRNA functionalsimilarity calculation model FMLNCSIM (Fuzzy Measure-Based LncRNA Functional Similarity Calculation Model)[15] has been developed this new model has a web interface(http21921960245) for usersrsquo convenience Consideringthat nowadays the experimentally confirmed data of miRNA-disease associations are much easier to obtain than the onesof lncRNA-disease Chen [16] utilised the miRNA-diseaseassociation and miRNA-lncRNA interaction to identifiedlncRNA-disease associationThis method (HGLDA) circum-vents the utility of LncRNADisease database but still presentsthe desired results Currently many other tools such asRWRlncD [17] and RWRHLD [18] were designed aiming atpredicting lncRNA-disease association and obtaining morereliable results Unfortunately they have their own limitations[16] As the titles of these methods implied RWRlncDand RWRHLD mainly predict the association by utilisingRandom Walk with Restart (RWR) RWRlncD can onlybe applied to the case that lncRNAs have known relateddiseases and RWRHLD cannot deal with the circumstancethat lncRNAs have unknown lncRNA-miRNA interactionsAnother method Improved Random Walk with Restart forLncRNA-Disease Association (IRWRLDA) [19] is also basedon RWR but IRWRLDA can predict the associations evenwhen diseases show no known related lncRNAs

Research [20] has illustrated the lncRNA-disease asso-ciation extensively and comprehensively Basically there arethree approaches to performing lncRNA-disease associationprediction [20] to train a model based on machine learningalgorithm to construct a heterogeneous network or tointegrate lncRNA-miRNA interactions and miRNA-diseaseassociations Currently researches have acknowledged thatit is imperative to analyse the role of lncRNAs in manydiseases especially cancer but the first step and funda-mental work is how to discriminate lncRNAs from genesWith the rapid development of next-generation sequencingtechnologies thousands and thousands of transcriptomeshave been discovered which furnished us with more andmore useful information on ncRNAs Meanwhile manyncRNAs identification approaches have been developed tofacilitate the researches and analyses Each kind of ncRNAhas its own prediction tools such as tRNAscan-SE (1997)[21] and tRNA-Predict (2015) [22] for transfer RNA (tRNA)identification mirnaDetect (2014) [23] and imDC (2015) [24]for microRNA (miRNA) prediction and RNAmmer (2007)[25] for ribosomal RNA (rRNA) discrimination Both tRNA-Predict and mirnaDetect are constructed with the features ofsecondary structure and codon-bias The method imDC isan algorithm of ensemble learning to deal with imbalanced

data and is applied to miRNA classification The researcharea of ncRNA is fast growing However it is still a challengeto distinguish lncRNAs from protein-coding genes in thatlncRNAs share many features similar to mRNAs Moreoverthe incomplete transcripts or genes poorly annotated orcontaining sequencing errors also thwart the discriminationand functional inference During the last ten years manyefforts on lncRNA identification have been made and manyapproaches have been developed to make a more accuratediscrimination Several studies [26 27] have summarised andreviewed the approaches of ncRNAs identification and anal-ysis but a few report the discussion of lncRNAs predictionmethodsWang et al [26] discussed several ncRNA detectionmethods based on homology information and commonfeatures Different approaches aiming at detecting differentkinds of ncRNAs are presented and an overview of someuseful tools was given yet no analysis on application scopeswas provided Hence the summary of these methods is moretheoretical than practical Veneziano et al [27] summarisedsome computational approaches of ncRNA analysis basedon deep sequencing technology Some lncRNA predictiontools were discussed briefly butmany other helpful tools wereexcluded

In this paper we mainly focus on the tools for lncRNAidentification The aim of this paper is to summarise thepopular algorithms of lncRNA identification and to assistresearchers in determining which method is more appro-priate for their purpose Here comprehensive analyses anddiscussions of these tools were provided Then we comparedseveral popular machine learning based methods includingCoding Potential Calculator (CPC) [28] Coding PotentialAssessment Tool (CPAT) [29] Coding-Non-Coding Index(CNCI) [30] predictor of long noncoding RNAs and mes-senger RNAs based on an improved k-mer scheme (PLEK)[31] Long noncodingRNA IDentification (LncRNA-ID) [32]and lncRScan-SVM [33] In addition lncRNA-MFDL [34]and LncRNApred [35] two artificial neural network- (ANN-)involved tools are also introduced in this paper However theprovided access link of lncRNA-MFDL has been forbiddenLncRNApred often throws errors while handling massive-scale data which can be processed by CPC andCPAT success-fully Thus we only briefly introduce the algorithms of theclassification model but omit the discussions of applicationscope We expect that this review can be a practical manualwhen readers conduct lncRNA identification researches

CPC (2007) is used to assess the protein-coding potentialof transcripts with high accuracy and speed [28] Howeverwith the emergence of new programs speed is scarcelyconsidered as ameritThe features of CPC can be divided intotwo categoriesThe first one is based on the extent and qualityof the Open Reading Frame (ORF) and the other category isderived from BLASTX research The authors employed theLIBSVM package to train support vector machine (SVM)model with the standard radial basis function kernel [36]

CPAT (2013) is another protein-coding potential assess-ment tool based on the model of logistic regression Theselected features include the quality of the ORF FickettScore and hexamer score Fickett Score is used to evaluateeach basersquos unequal content frequency and asymmetrical

BioMed Research International 3

Table 1 Overview of the methods concerning lncRNA identification

Published year Testing datasets Training species Model Query file format Web interfaceCPC 2007 ncRNAlowast Eukaryotic SVM FASTA YesCPAT 2013 lncRNAlowast Human mouse fly zebrafish LR BED FASTA YesCNCI 2013 lncRNA Human plant SVM FASTA GTF NoPLEK 2014 lncRNA Human maize SVM FASTA NolncRNA-MFDL 2015 lncRNA Human DL 119880119899119896119899119900119908119899lowastlowast 119880119899119896119899119900119908119899lowastlowastLncRNA-ID 2015 lncRNA Human mouse RF BED FSATA NolncRScan-SVM 2015 lncRNA Human mouse SVM GTF NoLncRNApred 2016 lncRNA Human RF FASTA Web onlyTesting datasets denote that one specific method is developed to discriminate ncRNAs or lncRNAs from protein-coding transcripts The classification modelof CPC CNCI PLEK and lncRScan-SVM is support vector machine (SVM) CPAT employs logistic regression (LR) LncRNA-ID and LncRNApred utiliserandom forests (RF) and lncRNA-MFDL uses deep stacking networks (DSNs) of deep learning (DL) algorithmlowastNote that the most popular tool CPC is trained and tested on datasets of ncRNAs and protein-coding transcripts The training datasets of CPAT are alsoncRNAs and protein-coding transcripts though test on lncRNAs for CPAT is conducted and achieved a higher accuracylowastlowastThe access link of lncRNA-MFDL has expired thus we cannot verify the information that the original paper failed to mention

distribution in the positions of codons in one sequenceHexamer score is mainly based on the usage bias of adjacentamino acids in proteins

CNCI (2013) is a classifier to differentiate protein-codingand noncoding transcripts by profiling the intrinsic compo-sition of the sequence According to the unequal distributionof adjoining nucleotide triplets (ANT) in two kinds ofsequences a 64 lowast 64 ANT Score Matrix is constructed toevaluate the sequence and the sliding window is used as asupplement to achieve a more robust result [30] ANT bearssome similarities to the hexamer score of CPAT but muchmore comprehensive and intricate analysis was conductedto facilitate the incomplete transcripts classification Theclassification model of CPAT is SVM with a standard radialbasis function kernel

PLEK (2014) uses k-mer scheme and sliding window toanalyse the transcripts For multiple species PLEK does nothave too many advantages over CNCI on testing data ofnormal sequence Nevertheless compared with PLEK theresults of CNCI will deteriorate when the sequence containssome insert or deletion (indel) errors These errors are verycommon in todayrsquos sequencing platforms The classificationmodel of PLEK is SVM with a radial basis function kernel

LncRNA-ID (2015) has 11 features which can be cate-gorized according to ORF ribosome interaction and theconservation of protein The first category is similar to theORF features in CPC and CPAT The foundation of thesecond feature category is the interactions between mRNAsand ribosomes during protein translation since some studiesdisplayed that lncRNAs can be associated with ribosomes[37 38] but do not show the release of ribosomes [39] Theprofile hidden Markov model-based alignment is used toassess the conservation of protein The classification modelof LncRNA-ID is improved using random forest which assistsLncRNA-ID effectively in handling imbalanced training data

Some tools are initially designed to predict ncRNAs butcan also be applied to lncRNAs prediction such as Phyloge-netic Codon Substitution Frequencies (PhyloCSF 2011) [40]and RNAcon (2014) [41] Based on nucleotide substitutionsand formal statistical comparison of phylogenetic codon

models [40] PhyloCSF utilises multiple sequence alignmentsto find conserved protein-coding regions As an alignment-basedmethod PhyloCSF entails high-quality alignments andsuffers from low efficiency RNAconmainly predicts ncRNAsutilising k-mer scheme Based on graph properties [41 42]RNAcon can also perform ncRNAs classification and classifydifferent ncRNA classes

Some methods are especially developed for long inter-genic noncoding RNAs (lincRNAs one subgroup of lncR-NAs) classification such as iSeeRNA (2012 web server and Linux binary package available at http13718913371iSeeRNindexhtml) [43] and LincRNA Classifier based on selectedfeatures (linc-SF 2013) [44] iSeeRNA built a SVM modelwith three feature groups ORF adjoining nucleotides fre-quencies (GC CT TAG TGT ACG and TCG) and con-servation score obtained from Phast [45] The classifier oflinc-SF evaluates the sequences with the criteria of sequencelength GC content minimum free energy (MFE) and k-merscheme

2 Details of the Methods

In this part we will discuss the machine learning modelsand the selected features of each method more specificallyFirstly for usersrsquo convenience some brief information of eachmethod is displayed in Table 1 and the details of using aresummarisedThen the details of eachmethod are provided inthe following Table 2 is a summary about the features selectedby each method

21 Details of Using CPC can be downloaded fromhttpcpccbipkueducndownload CPC has a user-friendly webinterface at httpcpccbipkueducnprogramsrun cpcjspDocuments and User Guide are provided at the web-site To run CPC on a local PC a comprehensive pro-tein reference database is required and users can down-load it from ftpftpncbinlmnihgovblastdb or ftpftpuniprotorgpubdatabasesuniprotunirefuniref90 About20 gigabytes (GB) of free space is also needed for storing theprotein reference database

4 BioMed Research International

Table 2 Summary of the features of each method selected

ORF Codon Sequence structure Ribosomeinteraction Alignment Protein

conservation

CPC Quality coverageintegrity No No No BLASTX

Number and119864-value of hitsDistribution of

hits

CPAT Lengthcoverage

HexamerFrequency

Content of thebases

Position of thebases

No No No

CNCI No ANT matrixCodon-bias MLCDS No No No

PLEK No No Improved k-merscheme No No No

lncRNA-MFDL

Lengthcoverage No

k-mer schemeSecondarystructureMLCDS

No No No

LncRNA-ID Lengthcoverage No Kozak motif

Ribosome releasesignal

Changes of bindingenergy

Profile HMMbased alignment

Score ofHMMER

Length of theprofile

Length of alignedregion

lncRScan-SVM No Distribution of

stop codon

Score oftxCdsPredictlength oftranscripts

length and count ofexon

No Phylo-HMMbased alignment

AveragePhastCons scores

LncRNApred Lengthcoverage No

Length of thesequence

signal to noiseratio

k-mer schemeG + C content

No No No

All features are categorized into six groups according to the similarity or basic principles Thus some items in the table might not be exactly in one-to-onecorrespondence with the feature names given in the corresponding published references

CPAT is also available both for download and as a web-server Users can obtain the latest resource code from httpssourceforgenetprojectsrna-cpatfilessource=navbar Pre-releases tutorial files and examples are also supplied onthe pages CPAT requires Python 27x numpy cython andR when running offline The web server is available athttplilabresearchbcmeducpatindexphp

CNCI can be downloaded at httpsgithubcomwww-bioinfo-orgCNCI Version 2 is updated on Feb 28 2014Setup and running steps are attached on the websitesLibsvm-30 has been enclosed in the package Other addi-tional files can be downloaded at httpwwwbioinfoorgnp

PLEK was implemented by C and Python The sourcecode can be freely downloaded from httpssourceforgenetprojectsplekfiles Several videos to assist user in utilisingPLEK correctly are also provided Python 27x is required

Scripts of LncRNA-ID can be obtained at httpsgithubcomzhangy72LncRNA-ID

LncRScan-SVM provided scripts gene annotation filesanddatasetsThe scripts can be downloaded at httpssource-forgenetprojectslncrscansvmsource20=20directoryA Readme file is also attached on this site

All the stand-alone versions of these tools requireLinuxUNIX operating system

The link of lncRNA-MFDL provided is httpscompge-nomicsutsaedulncRNA MDFL LncRNApred only has theweb interface and is available at httpmm20132014wicpnet57203LncRNApredhomejsp However the link of lncRNA-MFDL expired when we did this research And LncRNApredonly provides a web server which cannot handle too manysequences at one time

22 CPC inDetail CPC [28] extracted six features to evaluatethe coding potential of transcripts Log-odds score coverageand integrity of ORF are used to assess the ORFs of onesequence ORFs are predicted by framefinder A high-quality

BioMed Research International 5

ORF tends to have a high log-odds score and a larger ORFcoverage The integrity of ORF means ORFs in protein-coding transcripts are disposed of to begin with a startcodon and end with a stop codon The other three featuresare number of hits hit score and frame score which arederived from the output of BLASTX search A protein-codingtranscript prefersmore hits in alignment with lower119864-valuesThen the hit score is defined as follows [28]

119878119894 = mean119895

minuslog10 119864119894119895 119894 isin 0 1 2

Hit Score = mean119894isin012

119878119894 = sum2119894=0 1198781198943 (1)

where 119864119894119895 is the 119864-Value of the 119895th hits in the 119894th ORF Anoncoding transcript may also contain some hits but thesehits are inclined to scatter in three frames rather than belocated in one The frame score to calculate the distributionof hits among three ORFs is defined in the following

Frame Score = variance119894isin012

119878119894 = sum2119894=0 (119878119894 minus 119878)22 (2)

Thus a protein-coding transcript will achieve a higher hitscore and frame score because of the lower119864-value and biaseddistribution of the hits

The training data of CPC [46] are eukaryotic ncRNAsfrom the RNAdb [47] and NONCODE [48 49] databasesCPC is designed to assess transcriptsrsquo protein-coding poten-tial whichmeans it will have high accuracy of discriminatingprotein-coding transcripts Moreover CPC also has the errortolerance capacity which owesmuch to framefinderrsquos accurateprediction Framefinder performed well even though inputtranscripts may have some point mutations indel errorsand truncations CPC is slightly inferior in distinguishingnoncoding transcripts in respect of the fact that lncRNAsmaycontain putative ORFs and transcript length is also familiarto protein-coding transcripts The slow speed is anotherimperfection of CPC

23 CPAT in Detail CPAT [29] is an alignment-free pro-gram CPAT uses a logistic regression model and can betrained on own data of users Apart from the features ofmaximum length and coverage of ORF akin to CPC FickettScore is another criterion Fickett Score can be regarded asa dependent classifier it is mainly based on calculating theposition of each base favoured and the content of each basein the sequence [50] The basersquos position parameter of CPATis defined as follows

1198601 = Number of As in positions 0 3 6 1198602 = Number of As in positions 1 4 7 1198603 = Number of As in positions 2 5 8

119860-position = max (1198601 1198602 1198603)min (1198601 1198602 1198603) + 1

119860-content = Occurence Number of 119860Total Number of all bases

(3)

where 119860 in the formula means the base 119860 and the otherthree bases are measured in a similar way The parameterof position calculates each basersquos favoured position andthe parameter of content is the percentage of each basein the sequence Then according to distributions of eightparametersrsquo values [50] it is easy to obtain the probability thatthe sequence will be a protein-coding transcript Next eachprobability is multiplied by a weight to make a more accurateresult The weight is the percentage of the times that theestimate of each parameter alone is correct Finally accordingto the above descriptions Fickett Score can be determined asfollows

Fickett Score = 8sum119894=1

119901119894119908119894 (4)

According to Fickett [50] Fickett Score alone can correctlydiscriminate about 94 of the coding segments and 97 ofthe noncoding segments with 18 of ldquoNo Opinionrdquo

The last feature of CPAT is hexamer score which is themost discriminating feature Hexamer means the adjacentamino acids in proteinsThe features of the in-frame hexamerfrequency of coding and noncoding transcripts are calculatedand hexamer score is defined in the following

Hexamer Score = 1119898119898sum119894=1

log( 119865 (119867119894)1198651015840 (119867119894)) (5)

There are 64 lowast 64 kinds of hexamers and 119894 denotes each hex-amer 119865(119867119894) (119894 = 0 1 2 4095) means the in-frame hex-amer frequency of protein-coding transcripts while 1198651015840(119867119894)means noncoding transcripts For a transcript containing119898 hexamers a positive hexamer score indicates a protein-coding transcript

A high-quality training dataset is constructed contain-ing 10000 protein-coding transcripts selected from RefSeqdatabase with the annotations of the Consensus CodingSequence project and 10000 noncoding transcripts randomlycollected from GENCODE database CPAT is prebuilt hex-amer tables and logit models for human mouse fly andzebrafish Meanwhile CPAT uses pure linguistic features tofacilitate discrimination of the poorly annotated transcriptsCPAT has an efficient offline program and also provides auser-friendly web interface

24 CNCI in Detail CNCI [30] is mainly based on sequenceintrinsic composition it evaluates the transcripts by calcu-lating the usage frequency of adjoining nucleotide triplets(ANT) Firstly two ANT matrices are constructed based onthe usage frequency of ANT in noncoding sequences andcoding region of the sequences (CDS) For 4096 ANT the

6 BioMed Research International

formulas to calculate each ANT usage frequency are definedas follows

119883119894119873 = 119899sum119895=1

119878119895 (119883119894)

119879 = 119898sum119894=1

119883119894119873 = 119898sum119894=1

119899sum119895=1

119878119895 (119883119894) 119898 = 64 times 64 119899 = 1 119873

119883119894119865 = 119883119894119873119879

(6)

where 119883 means one kind of ANT 119878119895(119883119894) is the occurrencenumber of119883119894 in one sequence 119878119895Thus119883119894119873 denotes the totaloccurrence number of one kind of ANT in the dataset while119879indicates the total occurrence number of all kinds of ANT inthe dataset Accordingly 119883119894119865 is the usage frequency of ANTThen the ANT Score Matrix is utilised which is the log2-ratio of the two above-mentioned ANT matrices to score asequence and make a discrimination

ANT Score Matrix = log2CDS Matrix

Non-coding Matrix (7)

The distinguishing results of ANT Score Matrix are fairlywell but the matrix is constructed by computing the ANTusage frequency of coding region and noncoding regionconsequently the untranslated region (UTR) of the entiresequence will interfere with the performance of discrimina-tionThe sliding window is employed with one ANT (3 nt) ineach scan step to identify the CDS of a sequence by scanningsix reading frames of each sequence The different sizes(30 60 90 300 nt) of the sliding windows are examinedand the size of 150 nt for this classification model is found toobtain the most robust result For a sequence consisting of119896 ANT there will be 119896 minus 1 segments in this sequence Basedon the ANT Score Matrix each segment will get an 119878-Scoreand each reading frame can obtain an array comprised of the119878-Scores The formula of 119878-Score is defined as follows

119878-Score = 119899sum119894=1

119867119901 (119883119894) (8)

where 119883 means ANT 119867119901 is the ANT Score Matrix and119899 is the total number of the ANT in one segment or thewhole sequence Hence a correct reading frame of codingtranscript tends to have a higher whole sequence 119878-Scoreand in this array of reading frame the region composed ofconsecutive high 119878-Scores is the CDS For long noncodingtranscripts the Maximum Interval Sum [51] program is usedto identify the most-like CDS (MLCDS) which is the regionthat gained the largest sum of consecutive 119878-Scores in eachreading frame Among those six MLCDS the length and119878-Score of the MLCDS with the highest value are selectedas the features of CNCI Furthermore the features of the

LENGTH-Percentage SCORE-Distance and codon-bias arealso selected to improve accuracy

LENGTH-Percentage = 1198721sum119899119894=0 (119884119894)

SCORE-Distance = sum119899119895=0 (119878 minus 119864119895)5

(9)

where 1198721 is the length of the MLCDS with the highest 119878-Score 119884119894 is the length of each MLCDS 119878 is the highest 119878-Score among six MLCDS and 119864119895 is the 119878-Score of other fiveMLCDS Codon-bias (3-mer frequencies) is a parameter toevaluate the usage bias of different codons in protein-codingor long noncoding transcripts The log2-ratio of occurrencefrequency of each codon (stop codons are excluded) inprotein-coding genes and lncRNAs is calculated and mostcodons have distinct usage bias in two kinds of sequences

The training datasets of CNCI contain protein-codingtranscripts selected from RefSeq database and long noncod-ing transcripts selected from GENCODE [52] The CNCIis applied to other species with the aim of examining thescope of applicationThe results of vertebrates (except birds)especially mammals can be accepted since the programwas trained on human gene set CNCI can be used todiscriminate incomplete transcripts especially those high-throughput sequencing data of poorly explored species

25 PLEK in Detail PLEK [31] is an alignment-free toolbased on 119896-mer frequencies of the sequences For a givensequence the sliding windows with size of 119896 scan 1 nt asa step forward 119896 ranges from 1 to 5 which is a trade-off between accuracy and computational time Thus for asequence consisting of 119860 119862 119866 and 119879 the 41 + 42 + 43 +44 + 45 = 1364 patterns can be obtained Then the followingformulas can be used

119891119894 = 119888119894119904119896119908119896 119896 = 1 2 3 4 5 119894 = 1 2 1364119904119896 = 119897 minus 119896 + 1119908119896 = 1

45minus119896 119896 = 1 2 3 4 5

(10)

where 119894 is the number of the patterns 119888119894 denotes the numberof the segments in slidingwindowsmatchingwith patterns 119904119896denotes the total of the segments when sliding window slidesalong the sequencewith the size of 119896Therefore119891119894 is the usagefrequency multiplied by a factor119908119896 which is used to facilitatethe discrimination

A balanced training dataset is conducted with all 22389long noncoding transcripts collected from the GENCODEdataset [52ndash54] and 22389 protein-coding transcripts ran-domly selected from the human RefSeq dataset [55 56]Though the training model of PLEK is human PLEK canstill be applied to other vertebrates PLEK is particularlydesigned for the transcripts acquired from current sequenc-ing platforms which consist of some indel errors commonly

BioMed Research International 7

For these transcripts the performance of PLEK is betterthan CPC and CNCI PLEK can be trained with usersrsquo owndatasets but it may take a long time to be accomplished

26 LncRNA-MFDL in Detail LncRNA-MFDL [34] is basedon feature fusion and deep learning algorithm LncRNA-MFDL has four kinds of features which are integrated tobuild a classification model based on deep stacking networks(DSNs one kind of deep learning algorithm) [57 58] Fourfeature groups of lncRNA-MFDL include 119896-mer secondarystructure ORF obtained by utilising txCdsPredict program(httpgenomeucscedu) [59] and MLCDS features whichare inspired by CNCI [30]

The 119896-mer scheme employed in lncRNA-MFDL is unlikethe one in PLEK Here the 119896 only ranges from 1 to 3but the frequencies are calculated on the regions of thewhole sequence and ORF at the same time Considering thatthe secondary structure is more conserved and stable thanprimary structure a representative criterion the minimumfree energy (MFE) is used to assess the secondary structureof the transcripts Utilising RNAfold program of ViennaRNAPackage [60] the MFE the ratio of MFE to sequence lengthand the number of paired bases and unpaired bases can beeasily obtained

27 LncRNA-ID in Detail LncRNA-ID [32] has three cate-gories of features as mentioned earlier Except for the lengthand coverage of ORF the features based on translationmechanism and protein conservation are extracted

Many studies [61ndash63] have demonstrated that severalnucleotide sites in Kozak motif play a prominent roleduring the initiation of protein translation An efficienttranslation indicates that the highly conserved nucleotidesappear at the positions minus3 +4 and minus2 minus1 of Kozak motifGCCRCCAUGG (R represents purine and the position of Ain start codon AUG is +1) Thus these conserved sites aremore likely to exist in protein-coding transcripts Moreoverwhen the translation starts the binding energy will changealong with the interaction between the 31015840 end of rRNAs andmRNA transcripts The Ribosome Coverage to calculate thechanges of the binding energy is defined as follows

Ribosome Coverage = 119871sum119894=1

119873119894 | 120575119894 lt 0 (11)

where 120575119894 is the free energy at position 119894 and 119873119894 is thenumber of base pairs starting at position 119894 in a sequencewith the length of 119871 Next the three levels of ribosome occu-pancy by computing Ribosome Coverage on three regionsrespectively are obtained the whole transcript ORF and31015840UTR Accordingly a true protein-coding transcript tendsto attain higher Ribosome Coverage on the whole transcriptand the ORF region When the translation terminates theribosomes will be released from protein-coding transcriptsTherefore it is likely to capture a considerable drop ofribosome occupancy when ribosomes reach stop codonsThe

Ribosome Release Score to capture this change of ribosomeoccupancy is defined

Ribosome Release Score

= Ribosome coverage of ORFlength (ORF)Ribosome coverage of 31015840UTRlength (31015840UTR)

(12)

and a protein-coding transcript inclines to exhibit a higherRibosome Release Score For protein translation categorythe selected features including nucleotides at two positionsof Kozak motif Ribosome Coverage on three regions andRibosome Release Score are selected

The protein conservation of the sequences is evaluatedaccording to profile hidden Markov model-based align-ment scores HMMER [64] is a software suite for sequencehomology detection using probabilistic methods LncRNA-ID employed HMMER with the 119864-value cutoff of 01 toalign the transcripts against all available protein families Aprotein-coding transcript is expected to get a higher scorelonger aligned region and a reasonable length of the profilein the alignment

In human genome although the amount of lncRNA isat least four times more than protein-coding genes [65]the majority class in training data is protein-coding tran-script on account of poorly annotated lncRNA Hence theclassification model of this method is balanced randomforest [66 67] which is derived from random forest butcould utilise the sufficient protein-coding data and avoidinaccurate results caused by the imbalanced training data atthe same time The human prebuilt model of LncRNA-IDcontains 15308 protein-coding transcripts and 4586 lncRNAsfrom GENCODE [52] For mouse the training datasets arecomprised of 22033 protein-coding transcripts and 2457lncRNAs randomly selected from GENCODE These twodatasets were also used to draw receiver operation charac-teristic (ROC) curves in the next section (Figure 2) Userscan train LncRNA-ID with their own dataset and apply it tovarious species

28 LncRScan-SVM in Detail LncRScan-SVM [33] classifiesthe sequencesmainly by evaluating the qualities of nucleotidesequences codon sequence and transcripts structure Thecounts and average length of exon in one sequence arecalculated The protein-coding transcripts are disposed ofto include more exons thus having a longer exon lengththan lncRNA Another feature is the score of txCdsPredictThis third-part program from UCSC genome browser [68]can determine if a transcript is protein-coding Conservationscore is obtained by calculating the average of PhastConsscores [45] from Phast (httpcompgencshleduphast)Transcript length and standard deviation of stop codoncounts between three ORFs are the last two features

The reliable datasets are constructed from GENCODE[54] composed of 81814 protein-coding transcripts and23898 long noncoding transcripts of human And formouse47394 protein-coding transcripts and 6053 long noncodingtranscripts from GENCODE [52] are also contained withinthe dataset After being trained on human and mouse

8 BioMed Research International

datasets lncRScan-SVM obtains a good performance onlncRNA prediction

29 LncRNApred in Detail Before constructing the classi-fier self-organizing feature map (SOM) clustering [69] isemployed to select representative samples as the trainingdataset which enhanced the performance of LncRNApredAs to the features the length and coverage of the longest ORFone of the classical and typical features are selected as thecriteria In addition G + C content 119896-mer (119896 is from 1 to 3just like lncRNA-MFDL) and length of the sequence are alsothe features of LncRNApred The novel idea of LncRNApredis SNR which transforms one sequence into four binarynumeric sequences

119906119887 = 1 119878 [119899] = 1198870 119878 [119899] = 119887

119899 = 0 1 2 119873 minus 1 119887 isin 119860 119879 119862 119866 (13)

where 119887 means four kinds of bases 119873 is the length of onesequence and 119878[119899] denotes a sequence of length 119873 Thusthere will be four binary sequences 119906119887 | 119887 isin (119860 119879 119862 119866)Then applying Discrete Fourier Transform (DFT) to thesefour binary numeric sequences the power spectrum 119875[119896]can be obtained

119880119887 [119896] =119873minus1sum119899=0

119906119887 [119899] 119890minus119894(2120587119899119896119873) 119896 = 0 1 119873 minus 1119875 [119896] = 1003816100381610038161003816119880119860 [119896]10038161003816100381610038162 + 1003816100381610038161003816119880119879 [119896]10038161003816100381610038162 + 1003816100381610038161003816119880119866 [119896]10038161003816100381610038162 + 1003816100381610038161003816119880119862 [119896]10038161003816100381610038162

(14)

The studies of Fickett [50 70] have presented that positionsand compositions of four bases are different in lncRNAsand protein-coding RNA and because of this the powerspectrum of one protein-coding transcript will have a peakat1198733 position Hence the SNR is defined as follows

119864 = sum119873minus1119896=0 119875 [119896]119873 SNR = 119875 [1198733]

119864 (15)

Now there are 89 features the length and coverage ofthe longest ORF the length of the sequence SNR G + Ccontent and 4 + 16 + 64 features of 119896-mer Noticing that notall the features have high discriminative power the featureselection ismade and 25 high-quality features are determinedfrom the original 84 features of k-mer Finally 30 features areselected to build a random forest model The performance ofrandom forest is largely determined by training setThereforethe clustering method is used to find out the most adequatesequences to form a high standard training setThe clusteringmethod SOM [69] achieved the best result and was chosen toselect characteristic sequences

An overall procedure of these eight tools is displayed inFigure 1

3 Performance of These Methods

To quantify the classification performance under one unifiedstandard we first characterise lncRNAs as the positive classand protein-coding transcripts as the negative class then theperformance of these tools can be evaluated with severalstandard criteria defined as follows

Sensitivity = TPTP + FN

Specificity = TN

TN + FP

Accuracy = TP + TNTP + FP + TN + FN

False Positive Rate = FP

FP + TN

(16)

As one of the most popular methods CPC is especiallydesigned for assessing protein-coding potential and per-formed fairly well for discriminating protein-coding tran-script It enjoys the best results when screening the codingtranscripts For 10000 protein-coding genes and 10000 lncR-NAs selected from UCSC genome browser (GRCh37hg19)CPC picked up about 9762 coding transcripts while CPATdistinguished 8528 of them CPAT also outperforms CPCwith 8994 accuracy [33] Table 3 shows the performance ofthese tools on the same testing dataset CPC picks up 9997of human protein-coding genes collected from GENCODEin comparison with the latest program LncRNA-ID whoseperformance is 9528 However the performance of CPCappears to somewhat decline when focusing on the capabilityof discriminating noncoding transcripts especially long non-coding transcripts CPC only picked up 6648 of humanrsquoslong noncoding transcripts while the results of CPAT PLEKand LncRNA-ID are 8695 9952 and 9628

CPC and CPAT are the programs to assess the codingpotential but CNCI is especially used to classify protein-coding and long noncoding transcripts With the sequencesbecoming longer and longer CNCI was more superior toCPC According to Sun et al [30] when the length oftranscript is longer than 2000 nt the accuracy of CPC isonly around 04 while the CNCI still has an outstandingperformance The training dataset of CNCI is human butthis method still achieved more than 90 accuracy in othervertebrates apart from the birds [30] PLEK is tested ontwo datasets sequenced by PacBio and 454 platforms (referto Table 3) Among the tools being compared CPC stillpicked up about 9990 coding genes though this figureis not that useful because it can only distinguish 1900and 4720 lncRNAs CNCI displayed better performanceon both datasets but PLEK even achieved a more satisfyingresult

LncRNA-ID is another method to identify the longnoncoding transcripts Compared with other programsLncRNA-ID strikes a good balance between sensitivity andfalse positive rate According to Table 3 it is noticeable thatlncRNA is better than PLEK but slightly inferior to CPCand CPAT on the testing data of coding genes and the

BioMed Research International 9

(i) ORF(ii) BLASTX

CPC

(i) k-mer

PLEK

(i) ANT matrix(ii) Codon-bias

(iii) MLCDS

CNCI

(i) Exon (ii) Stop codon

(iii) txCdsPredict(iv) PhastCons

LncRScan-SVM

Support vector machine

Featureextraction

Modelselection

Modelvalidation

Input data

10-fold cross-validationArea under the curve (AUC) of receiver operation characteristic (ROC)

LncRNA-ID(i) ORF

(ii) Ribosome interaction

(iii) HMMER

(i) ORF(ii) k-mer

(iii) Signal to noise ratio

LncRNApred

Random forests

CPAT

(i) ORF(ii) Fickett score

(iii) Hexamer

Logistic regression

LncRNA-MFDL(i) ORF

(ii) k-mer(iii) MLCDS(iv) Secondary

structure

Deeplearning

Figure 1 An overall procedure of eight tools The features of each tool are sorted into several groups and only the categories of the featuresare listed in the figure

02 04 06 08 10001 minus specificity

00

02

04

06

08

10

Sens

itivi

ty

00

02

04

06

08

10

Sens

itivi

ty

02 04 06 08 10001 minus specificity

Mouse (from GENCODE)Human (from GENCODE)

PLEK (AUC = 0991 optimal = 00210973)CNCI (AUC = 0937 optimal = 0084 0970)CPAT (AUC = 0990 optimal = 0037 0961)CPC (AUC = 0977 optimal = 0038 0889) CPC (AUC = 0950 optimal = 0068 0839)

CPAT (AUC = 0966 optimal = 0073 0926)CNCI (AUC = 0903 optimal = 0137 0959)PLEK (AUC = 0870 optimal = 0291 0885)

Figure 2 The ROC curves of CPC CPAT CNCI and PLEK We assessed the models using the same datasets as LncRNA-ID (selected fromGENCODE) used Both CPC and CPAT were evaluated with the latest versions

10 BioMed Research International

Table 3 Overview of each toolrsquos performance on different testing datasets

Testing dataset CPC CPAT CNCI PLEK LncRNA-ID lncRScan-SVMHuman MCF-7 (PacBio)1

Specificity 9990 9180 9470Sensitivity 1900 7870 9580Accuracy 9700 9130 9470Human HelaS3 (454)2

Specificity 9990 9390 9550Sensitivity 4720 8110 9250Accuracy 9900 9370 9540Human (from GENCODE)3

Specificity 9997 9955 8918 9528Sensitivity 6648 8695 9952 9628Accuracy 8322 9325 9432 9578Mouse (from GENCODE)4

Specificity 9875 9895 7094 9210Sensitivity 7655 3880 8811 9445Accuracy 8765 6888 7949 9328Human (from GRCh37hg19)5

Specificity 9762 8528 8920Sensitivity 6723 9460 9388Accuracy 8243 8994 9194Mouse (from GRCm38mm10)5

Specificity 9837 8817 8914Sensitivity 7546 9534 9529Accuracy 8691 9176 9221The results of the tools being tested on the same datasets are listed above Bold numbers denote the highest value of the metrics1MCF-7 is available at httpwwwpacbcomblogdata-release-human-mcf-7-transcriptome 2dataset of HelaS3 is available at httpswwwncbinlmnihgovsraSRX214365 34datasets are available at httpswwwdropboxcomsh7yvmqknartttm6kAAAQHvLZPjgjf4dtmHM7GNCqaH1 gencodedl=0 andhttpswwwdropboxcomsh7yvmqknartttm6kAACzaG-QJggvbXW6LA32oo7baM1 gencodedl=0 5dataset of human and mouse is available athttpjournalsplosorgplosonearticleid=101371journalpone0139654

performance on lncRNAs is just the opposite LncRNA-IDcan be trained with usersrsquo own data it can obtain a satisfyingresult even when the data is unbalancedWith the proportionof lncRNA decreasing CPAT shows a sharp reduction from7951 to 5446on the capability of lncRNAdiscriminationLncRNA-ID by contrast fell less than 1 [32]

ROC curves of CPC CPAT PLEK and LncRNA-IDtested on human and mouse datasets were provided in [32]Since CPC and CPAT are updated as the accumulation ofgene database it is useful to assess their performance withlatest version and take CNCI into account Here we utilisethe test set of LncRNA-ID [32] (both the datasets of humanandmouse are selected fromGENCODE) to reevaluate CPCCPAT CNCI and PLEK (Figure 2) According to [32] thearea under curve (AUC) of LncRNA-ID on human datasetis 09829 (optimal = 00545 09720) while on mouse it is09505 (optimal = 00800 09445) [32] In our assessmentthe performance of PLEK is identical with [32] while theperformance of CPC and CPAT as we anticipated displayedsome differences The ROC curves were drawn in R with thepackage of pROC [71]

LncRScan-SVM is compared with CPC and CPAT onhuman and mouse datasets from UCSC (version hg19 of

human and mm10 of mouse) CPC as an excellent codingpotential assessment tool still achieves 9837 of specificityonmouse testing dataset CPAT on the contrary achieved thehighest values of sensitivity both on the datasets of humanand mouse LncRScan-SVM surpasses CPAT with 8920and 8914 of specificity on human and mouse datasets Forsensitivity lncRScan-SVM obtained 9388 and 9529 onthe same testing datasets which are only 072 and around01 lower than CPATrsquos results respectively but much higherthan CPCrsquos 6723 and 7546 In addition lncRScan-SVMalso has the best results of accuracy and AUC [33] on thesedatasets

For the same testing datasets the running time of CPATis the shortest and CPC shows the longest time to finishthe process because of its alignment process When beingtested on a dataset containing 4000 protein-coding and 4000long noncoding transcripts CPAT takes 3536 s and LncRNA-ID takes 6535 s to accomplish the discrimination whilePLEK and CPC need 2147m and 8651 h respectively [32]PLEK is 8 times and 244 times faster than CNCI and CPCrespectively on the same testing data [31] and lncRScan-SVM also needs about 10 times as much as CPAT to finishcomputation [33]

BioMed Research International 11

Table 4 Priority of employing different methods on different situations

CPC CPAT CNCI PLEK LncRNA-ID lncRScan-SVMCoding potential assessment Human lncRNAs Mouse lncRNAs Other Species1 Testing data with sequencing errors2 Lack of annotation Massive-scale data3 Trained by users4 Web interface This table only presents the preferences under different situations which means a method with a tick can achieve a better performance under a certaincircumstance1Only CPAT LncRNA-ID and lncRScan-SVM provide the model for mouse When analysing other species CPAT has the model for fly and zebrafish CNCIand PLEK can predict the sequences of vertebrata and plant CPAT PLEK and LncRNA-ID can build a newmodel based on usersrsquo datasets 2Users can chooseCNCI for incomplete sequences and CPC or PLEK for the transcripts with indel errors 3CPAT is the most efficient method Though lncRScan-SVM needsmore time than CPAT and LncRNA-ID it is also acceptable 4LncRNA-ID can handle the imbalanced training data Training PLEK with usersrsquo own datasetsmay be a time-consuming task

4 Application Scopes of the Methods

All these methods have own particular scopes to exert theirtalents which means an appropriate program can help usobtain a satisfying result The priority of utilising thesetools under some particular circumstances is summarised inTable 4

CPC is based on sequence alignment which facilitatesprotein-coding transcripts selection but impairs the perfor-mance of noncoding transcripts in that long noncoding tran-scripts share more similarities with coding transcripts suchas putative ORF which could mislead CPC Also becauseof alignment process utilising CPC to analyse massive-scaledata is a time-consuming process

CPAT is also used to evaluate the coding potential thoughthe performance on long noncoding transcripts is acceptableCPAT has a compromise between coding and noncodingtranscripts that is not bad Since the model of CPAT islogistic regression and the input file is FASTA format CPATis markedly superior in computational time which meansCPAT is more suitable for being applied to data on a largescale Furthermore linguistic features make CPAT be able toanalyse the sequences without annotation and allowing usersto train the model with their own dataset extends CPATrsquosscope of application Users can apply CPAT to other speciesinstead of being confined to human or mouse only

CNCI is designed to distinguish between coding and longnoncoding transcripts without the annotations of sequencesBecause lots of lncRNAs are poorly annotated this qualityprovided amore accurate discrimination for these sequencesCNCI is trained on human dataset but can also be applied toothermammals such asmouse and orangutan CNCI displaysacceptable results on vertebrates (except fish) but for plantsand invertebrates the result is not very satisfying CNCI isvaluable when the sequences lack annotations or users do nothave training set of other species CNCI also shows a goodperformance when the transcripts are incomplete

PLEK employs a higher fault tolerance algorithm andperforms better when the sequences have indel errors It is a

proper tool to analyse the de novo assembled transcriptomedatasets such as the sequences obtained from Roche (454)and Pacific Biosciences (PacBio) sequencing platforms Inaddition to human and mouse PLEK can also be used toother vertebrates and displays comparable results with theones of CNCI PLEKrsquos model can be trained by users but ittakes a long time to be completed

LncRNA-ID has many merits and delivers better all-round performance on human andmouse datasets Althoughthe time LncRNA-ID spent on classifying is nearly twice ofCPAT LncRNA-ID is still more efficient than other methodswhich makes it a reasonable choice when data are on amassive-scale The model of LncRNA-ID can be trained byusers but the most excellent attribute is the competence ofhandling the unbalanced training data For studying thosenot well-explored species LncRNA-ID takes priority whenusers have training datasets

LncRScan-SVM achieves a good trade-off betweenthe discrimination of coding and long noncoding genesLncRScan-SVM is slower than CPAT and LncRNA-ID but itis still acceptable For analysing human and mouse datasetslncRScan-SVM can be considered as a proper approach

5 Discussion

According to the features selected by each tool it is apparentthat different tools have their own advantages and disadvan-tages CPC is developed to assess coding potential of thetranscripts moreover CPC is trained on datasets of protein-coding and noncoding RNA which means it achieves excel-lent performance when analysing ncRNAs CPC providesa stand-alone version and a web server but both of thetwo programs need vast amounts of time to process thesequences As alignment-based tools the performance ofCPC varied when using different protein reference databaseCPAT can present satisfying results efficiently partly becauseCPAT builds the logistic model which is faster than SVMThe web server of CPAT can display the result in an instant

12 BioMed Research International

which facilitates small scale prediction tasks A minor dis-advantage of CPAT is that the cutoff of CPAT varies fromspecies to species and users have to determine the optimumcutoff value when they are training a new model CNCI isdesigned to predict the transcripts assembled from whole-transcriptome sequencing data Thus CNCI offers a highaccuracy on incomplete transcripts CNCI did not provideresult of elaborate comparison between CNCI and CPATbut CPAT has no regard for the problem of incompletetranscripts Meanwhile UTRs of the transcripts may alsointerfere with the performance of CPATThe features of ANTof CNCI closely resemble the hexamer of CPAT but thedistinguishing process of CNCI is more complicated andaccurate than CPAT However the sliding window of CNCIslides 3 nt in each step and consequently some deletionor frameshift errors may lead to a false shift and presentusers with a disappointing performance In such cases PLEKhas made a considerable improvement and exhibits moreflexibility when handling the indel sequencing errors Indelerrors are very common in the sequences obtained by todayrsquossequencing platform which means PLEK performs well forde novo assembled transcriptomes With the indel error rateincreasing the accuracy of CNCI is decreasing while PLEKhas no distinct fluctuation Nonetheless since the nucleotidescompositions differ slightly among different species theperformance of PLEK on multiple species is not better thanor approximately equivalent to CNCI whose performanceis more stable on different species Both LncRNA-ID andlncRScan-SVM achieve a balance between protein-codingand lncRNAs But the capacity of lncRScan-SVMwill be lim-ited when analysing the sequences with a lack of annotationAnother point that needs to be brought up is that lncRScan-SVM and CNCI support lowast GTF as input file format

It is apparent that nucleotides composition (such as 119896-merand G + C content) and ORF are two classic and widely usedfeature groups These features have strong discriminativepower because protein-coding genes will finally be tran-scribed and translated to produce a specific amino acid chainwhich requires some specified nucleotides composition andhigh-quality ORFs As to the models of these tools SVM(CPC CNCI and PLEK) logistic regression (CPAT) andrandom forest (LncRNA-ID) are more practical for lncRNAidentification though ANN or deep learning is a more popu-lar machine learning algorithm now Along with the protein-coding genes prediction the annotations of lncRNA genehave been performed as well A new tool named AnnoLnc(2015 available at httpannolnccbipkueducnindexjsp)has just been developed to annotate new discovered lncRNAsbut related article has not yet been officially published Userscan access its web server for more information

LncRNAs are receiving increasing attention and lncRNAidentification has always been a challenge for researchesof life science For so many different types of sequencesvarious excellent tools should be developed to tackle differentproblems under various circumstances in the future In thisreview we summarised several tools for lncRNAs identifica-tion and concluded respective scopes Due to their differentscopes of application using a method apposite to particularsituation will be of essence to achieve convincing results

We hope this review can help researchers employ a moreappropriate method in certain situations

Additional Points

Key Points (i) Different tools have different scopes Usersshould select a proper tool according to the type of sequences(ii) From the perspective of sequence types CPC andCPAT are mainly used to assess coding potential CNCIand PLEK can be applied to the sequences obtained fromhigh-throughput sequencing platforms or the poorly anno-tated LncRNA-ID and lncRScan-SVM are more accurate onhuman and mouse datasets (iii) From the perspective ofother functions CPC and CPAT have web interfaces Theclassification models of CPAT LncRNA-ID and PLEK canbe trained on usersrsquo own datasets CPAT LncRNA-ID andlncRScan-SVM can be utilised when the data to be analysedare on a massive-scale

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This research was supported by the National Natural ScienceFoundation of China (61272207 61472158 and 61402194)and the Science-Technology Development Project from JilinProvince (20130522118JH 20130101070JC) and China Post-doctoral Science Foundation (2014T70291)

References

[1] A F Palazzo and E S Lee ldquoNon-coding RNA what isfunctional andwhat is junkrdquo Frontiers in Genetics vol 5 article2 pp 1ndash11 2015

[2] Y Okazaki M Furuno T Kasukawa et al ldquoAnalysis of themouse transcriptome based on functional annotation of 60770full-length cDNAsrdquo Nature vol 420 pp 563ndash573 2002

[3] S Djebali C A Davis AMerkel et al ldquoLandscape of transcrip-tion in human cellsrdquo Nature vol 489 pp 101ndash108 2012

[4] I Dunham A Kundaje S F Aldred et al ldquoAn integratedencyclopedia of DNA elements in the human genomerdquo Naturevol 489 pp 57ndash74 2012

[5] E Pennisi ldquoENCODE project writes eulogy for junk DNArdquoScience vol 337 no 6099 pp 1159ndash1161 2012

[6] S U Schmitz P Grote and B G Herrmann ldquoMechanisms oflong noncoding RNA function in development and diseaserdquoCellular and Molecular Life Sciences vol 73 no 13 pp 2491ndash2509 2016

[7] K Plath J Fang S K Mlynarczyk-Evans et al ldquoRole of histoneH3 lysine 27 methylation in X inactivationrdquo Science vol 300no 5616 pp 131ndash135 2003

[8] S T da Rocha V Boeva M Escamilla-Del-Arenal et al ldquoJarid2is implicated in the initial xist-induced targeting of PRC2 to theinactive X chromosomerdquoMolecular Cell vol 53 no 2 pp 301ndash316 2014

[9] V OrsquoLeary S V Ovsepian L G Carrascosa et al ldquoPARTICLEa triplex-forming long ncRNA regulates locus-specific methy-lation in response to low-dose irradiationrdquo Cell Reports vol 11no 3 pp 474ndash485 2015

BioMed Research International 13

[10] A C Marques and C P Ponting ldquoCatalogues of mammalianlong noncoding RNAs modest conservation and incomplete-nessrdquo Genome Biology vol 10 no 11 article R124 2009

[11] J R Prensner M K Iyer A Sahu et al ldquoThe long noncodingRNA SChLAP1 promotes aggressive prostate cancer and antag-onizes the SWISNF complexrdquo Nature Genetics vol 45 no 11pp 1392ndash1398 2013

[12] G Chen Z Wang D Wang et al ldquoLncRNADisease a databasefor long-non-coding RNA-associated diseasesrdquo Nucleic AcidsResearch vol 41 no 1 pp D983ndashD986 2013

[13] X Chen and G-Y Yan ldquoNovel human lncRNA-disease associ-ation inference based on lncRNA expression profilesrdquo Bioinfor-matics vol 29 no 20 pp 2617ndash2624 2013

[14] X Chen C Clarence Yan C Luo W Ji Y Zhang and Q DaildquoConstructing lncRNA functional similarity network based onlncRNA-disease associations and disease semantic similarityrdquoScientific Reports vol 5 Article ID 11338 2015

[15] X Chen Y-A Huang X-S Wang Z You and K C ChanldquoFMLNCSIM fuzzy measure-based lncRNA functional simi-larity calculation modelrdquo Oncotarget vol 7 no 29 pp 45948ndash45958 2016

[16] X Chen ldquoPredicting lncRNA-disease associations and con-structing lncRNA functional similarity network based on theinformation of miRNArdquo Scientific Reports vol 5 Article ID13186 2015

[17] J Sun H Shi Z Wang et al ldquoInferring novel lncRNA-diseaseassociations based on a random walk model of a lncRNAfunctional similarity networkrdquo Molecular BioSystems vol 10no 8 pp 2074ndash2081 2014

[18] M Zhou X Wang J Li et al ldquoPrioritizing candidate disease-related long non-codingRNAs bywalking on the heterogeneouslncRNA and disease networkrdquoMolecular BioSystems vol 11 no3 pp 760ndash769 2015

[19] X Chen Z You G Yan and D Gong ldquoIRWRLDA improvedrandom walk with restart for lncRNA-disease association pre-dictionrdquo Oncotarget vol 7 no 36 pp 57919ndash57931 2016

[20] X Chen C C Yan X Zhang and Z You ldquoLong non-codingRNAs and complex diseases from experimental results tocomputational modelsrdquo Briefings in Bioinformatics 2016

[21] T M Lowe and S R Eddy ldquotRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequencerdquo Nucleic Acids Research vol 25 no 5 pp 955ndash9641997

[22] Q Zou J Guo Y Ju M Wu X Zeng and Z Hong ldquoImprov-ing tRNAscan-SE annotation results via ensemble classifiersrdquoMolecular Informatics vol 34 no 11-12 pp 761ndash770 2015

[23] L Wei M Liao Y Gao R Ji Z He and Q Zou ldquoImprovedand promising identificationof human microRNAs by incorpo-ratinga high-quality negative setrdquo IEEEACM Transactions onComputational Biology and Bioinformatics vol 11 no 1 pp 192ndash201 2014

[24] C Y Wang L L Hu M Z Guo X Y Liu and Q Zou ldquoimDCan ensemble learningmethod for imbalanced classificationwithmiRNAdatardquoGenetics andMolecular Research vol 14 no 1 pp123ndash133 2015

[25] K Lagesen P Hallin E A Roslashdland H-H Staeligrfeldt T Rognesand DW Ussery ldquoRNAmmer consistent and rapid annotationof ribosomal RNA genesrdquo Nucleic Acids Research vol 35 no 9pp 3100ndash3108 2007

[26] C Wang L Wei M Guo and Q Zou ldquoComputationalapproaches in detecting non- coding RNArdquo Current Genomicsvol 14 no 6 pp 371ndash377 2013

[27] D Veneziano G Nigita and A Ferro ldquoComputationalapproaches for the analysis of ncRNA through deep sequencingtechniquesrdquo Frontiers in Bioengineering and Biotechnology vol3 article 77 2015

[28] L Kong Y Zhang Z-Q Ye et al ldquoCPC assess the protein-coding potential of transcripts using sequence features andsupport vector machinerdquo Nucleic Acids Research vol 35 no 2pp W345ndashW349 2007

[29] L Wang H J Park S Dasari et al ldquoCoding-potential assess-ment tool using an alignment-free logistic regression modelrdquoNucleic Acids Research vol 41 no 6 article e74 2013

[30] L Sun H Luo D Bu et al ldquoUtilizing sequence intrinsiccomposition to classify protein-coding and long non-codingtranscriptsrdquo Nucleic Acids Research vol 41 no 17 article e1662013

[31] A Li J Zhang and Z Zhou ldquoPLEK a tool for predicting longnon-coding RNAs and messenger RNAs based on an improvedk-mer schemerdquo BMC Bioinformatics vol 15 no 1 article 3112014

[32] R Achawanantakun J Chen Y Sun and Y Zhang ldquoLncRNA-ID long non-coding RNA IDentification using balanced ran-dom forestsrdquo Bioinformatics vol 31 no 24 pp 3897ndash3905 2015

[33] L Sun H Liu L Zhang and J Meng ldquolncRScan-SVM a toolfor predicting long non-coding RNAs using support vectormachinerdquo PLoS ONE vol 10 no 10 Article ID e0139654 2015

[34] X-N Fan and S-W Zhang ldquoLncRNA-MFDL identification ofhuman long non-coding RNAs by fusing multiple features andusing deep learningrdquo Molecular BioSystems vol 11 no 3 pp892ndash897 2015

[35] C Pian G Zhang Z Chen et al ldquoLncRNApred classificationof long non-coding RNAs and protein-coding transcripts by theensemble algorithm with a new hybrid featurerdquo PLoS ONE vol11 no 5 Article ID e0154567 2016

[36] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[37] M C Frith A R Forrest E Nourbakhsh et al ldquoThe abundanceof short proteins in the mammalian proteomerdquo PLoS Geneticsvol 2 no 4 pp 515ndash528 2006

[38] D M Anderson K M Anderson C-L Chang et al ldquoAmicropeptide encoded by a putative long noncoding RNAregulatesmuscle performancerdquoCell vol 160 no 4 pp 595ndash6062015

[39] M Guttman P Russell N T Ingolia J S Weissman and ES Lander ldquoRibosome profiling provides evidence that largenoncoding RNAs do not encode proteinsrdquo Cell vol 154 no 1pp 240ndash251 2013

[40] M F Lin I Jungreis and M Kellis ldquoPhyloCSF a comparativegenomics method to distinguish protein coding and non-coding regionsrdquoBioinformatics vol 27 no 13 Article ID btr209pp i275ndashi282 2011

[41] B Panwar A Arora and G P S Raghava ldquoPrediction andclassification of ncRNAs using structural informationrdquo BMCGenomics vol 15 no 1 article 127 2014

[42] L Childs Z Nikoloski P May and D Walther ldquoIdentificationand classification of ncRNAmolecules using graph propertiesrdquoNucleic Acids Research vol 37 no 9 article e66 2009

[43] K Sun X Chen P Jiang X Song H Wang and HSun ldquoiSeeRNA identification of long intergenic non-codingRNA transcripts from transcriptome sequencing datardquo BMCGenomics vol 14 article S7 2013

14 BioMed Research International

[44] Y Wang Y Li Q Wang et al ldquoComputational identificationof human long intergenic non-coding RNAs using a GA-SVMalgorithmrdquo Gene vol 533 no 1 pp 94ndash99 2014

[45] A Siepel G Bejerano J S Pedersen et al ldquoEvolutionarilyconserved elements in vertebrate insect worm and yeastgenomesrdquo Genome Research vol 15 no 8 pp 1034ndash1050 2005

[46] J Liu J Gough and B Rost ldquoDistinguishing protein-codingfrom non-coding RNAs through support vector machinesrdquoPLoS Genetics vol 2 no 4 pp 529ndash536 2006

[47] K C Pang S Stephen P G Engstrom et al ldquoRNAdb-acomprehensive mammalian noncoding RNA databaserdquoNucleicAcids Research vol 33 pp D125ndashD130 2005

[48] C Liu B Bai G Skogerboslash et al ldquoNONCODE an inte-grated knowledge database of non-coding RNAsrdquoNucleic AcidsResearch vol 33 pp D112ndashD115 2005

[49] Y Zhao H Li S Fang et al ldquoNONCODE 2016 an informativeand valuable data source of long non-coding RNAsrdquo NucleicAcids Research vol 44 no 1 pp D203ndashD208 2016

[50] J W Fickett ldquoRecognition of protein coding regions in DNAsequencesrdquoNucleic Acids Research vol 10 no 17 pp 5303ndash53181982

[51] S Mukherjee and Y Zhang ldquoMM-align a quick algorithmfor aligning multiple-chain protein complex structures usingiterative dynamic programmingrdquoNucleic Acids Research vol 37no 11 2009

[52] T Derrien R Johnson G Bussotti et al ldquoThe GENCODE v7catalog of human long noncoding RNAs analysis of their genestructure evolution and expressionrdquoGenome Research vol 22no 9 pp 1775ndash1789 2012

[53] J Harrow F Denoeud A Frankish et al ldquoGENCODE produc-ing a reference annotation for ENCODErdquo Genome Biology vol7 supplement 1 pp S41ndashS49 2006

[54] J Harrow A Frankish J M Gonzalez et al ldquoGENCODE thereference human genome annotation for the ENCODE projectrdquoGenome Research vol 22 no 9 pp 1760ndash1774 2012

[55] K D Pruitt T Tatusova G R Brown andD RMaglott ldquoNCBIReference Sequences (RefSeq) current status new features andgenome annotation policyrdquo Nucleic Acids Research vol 40 no1 pp D130ndashD135 2012

[56] K D Pruitt T Tatusova and D R Maglott ldquoNCBI refer-ence sequences (RefSeq) a curated non-redundant sequencedatabase of genomes transcripts and proteinsrdquo Nucleic AcidsResearch vol 35 no 1 pp D61ndashD65 2007

[57] L Deng D Yu and J Platt ldquoScalable stacking and learningfor building deep architecturesrdquo in Proceedings of the 2012IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo12) pp 2133ndash2136 Kyoto Japan March2012

[58] B Hutchinson L Deng and D Yu ldquoTensor deep stackingnetworksrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 35 no 8 pp 1944ndash1957 2013

[59] W J Kent C W Sugnet T S Furey et al ldquoThe human genomebrowser at UCSCrdquo Genome Research vol 12 no 6 pp 996ndash1006 2002

[60] R Lorenz S H Bernhart C Honer zu Siederdissen et alldquoViennaRNA Package 20rdquo Algorithms for Molecular Biologyvol 6 no 1 article 26 2011

[61] M Kozak ldquoRecognition of AUG and alternative initiatorcodons is augmented by G in position +4 but is not generallyaffected by the nucleotides in positions +5 and +6rdquoThe EMBOJournal vol 16 no 9 pp 2482ndash2492 1997

[62] M Kozak ldquoInitiation of translation in prokaryotes and eukary-otesrdquo Gene vol 234 no 2 pp 187ndash208 1999

[63] H Xu P Wang Y Fu et al ldquoLength of the ORF position of thefirstAUGand theKozakmotif are important factors in potentialdual-coding transcriptsrdquo Cell Research vol 20 no 4 pp 445ndash457 2010

[64] R D Finn J Clements and S R Eddy ldquoHMMER webserver interactive sequence similarity searchingrdquoNucleic AcidsResearch vol 39 no 2 pp W29ndashW37 2011

[65] P Kapranov J Cheng S Dike et al ldquoRNA maps reveal newRNA classes and a possible function for pervasive transcrip-tionrdquo Science vol 316 no 5830 pp 1484ndash1488 2007

[66] C Chen A Liaw and L BreimanUsing RandomForest to LearnImbalancedData University of California Berkeley Calif USA2004

[67] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[68] J A Sogn ldquoStructure of the peptide antibiotic polypeptinrdquoJournal of Medicinal Chemistry vol 19 no 10 pp 1228ndash12311976

[69] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[70] J W Fickett and C-S Tung ldquoAssessment of protein codingmeasuresrdquo Nucleic Acids Research vol 20 no 24 pp 6441ndash6450 1992

[71] X Robin N Turck A Hainard et al ldquopROC an open-sourcepackage for R and S+ to analyze and compare ROC curvesrdquoBMC Bioinformatics vol 12 article 77 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 2: Review Article Long Noncoding RNA Identification ...

2 BioMed Research International

LRLSLDA as mentioned above is to measure the functionalsimilarity of lncRNAs which means that the performance ofsimilarity calculation model largely determines the perfor-mance of associationmodelThe similarity calculationmodelof LRLSLDA is LFSCM (LncRNA Functional SimilarityCalculation based on the information of MiRNA) which isbased on lncRNA-miRNA interactions and miRNA-diseaseassociations In 2015 novel lncRNA functional similaritycalculation models (LNCSIM) [14] were provided by Chenet al By integrating LRLSLDA and LNCSIM the perfor-mance was enhanced Recently a new lncRNA functionalsimilarity calculation model FMLNCSIM (Fuzzy Measure-Based LncRNA Functional Similarity Calculation Model)[15] has been developed this new model has a web interface(http21921960245) for usersrsquo convenience Consideringthat nowadays the experimentally confirmed data of miRNA-disease associations are much easier to obtain than the onesof lncRNA-disease Chen [16] utilised the miRNA-diseaseassociation and miRNA-lncRNA interaction to identifiedlncRNA-disease associationThis method (HGLDA) circum-vents the utility of LncRNADisease database but still presentsthe desired results Currently many other tools such asRWRlncD [17] and RWRHLD [18] were designed aiming atpredicting lncRNA-disease association and obtaining morereliable results Unfortunately they have their own limitations[16] As the titles of these methods implied RWRlncDand RWRHLD mainly predict the association by utilisingRandom Walk with Restart (RWR) RWRlncD can onlybe applied to the case that lncRNAs have known relateddiseases and RWRHLD cannot deal with the circumstancethat lncRNAs have unknown lncRNA-miRNA interactionsAnother method Improved Random Walk with Restart forLncRNA-Disease Association (IRWRLDA) [19] is also basedon RWR but IRWRLDA can predict the associations evenwhen diseases show no known related lncRNAs

Research [20] has illustrated the lncRNA-disease asso-ciation extensively and comprehensively Basically there arethree approaches to performing lncRNA-disease associationprediction [20] to train a model based on machine learningalgorithm to construct a heterogeneous network or tointegrate lncRNA-miRNA interactions and miRNA-diseaseassociations Currently researches have acknowledged thatit is imperative to analyse the role of lncRNAs in manydiseases especially cancer but the first step and funda-mental work is how to discriminate lncRNAs from genesWith the rapid development of next-generation sequencingtechnologies thousands and thousands of transcriptomeshave been discovered which furnished us with more andmore useful information on ncRNAs Meanwhile manyncRNAs identification approaches have been developed tofacilitate the researches and analyses Each kind of ncRNAhas its own prediction tools such as tRNAscan-SE (1997)[21] and tRNA-Predict (2015) [22] for transfer RNA (tRNA)identification mirnaDetect (2014) [23] and imDC (2015) [24]for microRNA (miRNA) prediction and RNAmmer (2007)[25] for ribosomal RNA (rRNA) discrimination Both tRNA-Predict and mirnaDetect are constructed with the features ofsecondary structure and codon-bias The method imDC isan algorithm of ensemble learning to deal with imbalanced

data and is applied to miRNA classification The researcharea of ncRNA is fast growing However it is still a challengeto distinguish lncRNAs from protein-coding genes in thatlncRNAs share many features similar to mRNAs Moreoverthe incomplete transcripts or genes poorly annotated orcontaining sequencing errors also thwart the discriminationand functional inference During the last ten years manyefforts on lncRNA identification have been made and manyapproaches have been developed to make a more accuratediscrimination Several studies [26 27] have summarised andreviewed the approaches of ncRNAs identification and anal-ysis but a few report the discussion of lncRNAs predictionmethodsWang et al [26] discussed several ncRNA detectionmethods based on homology information and commonfeatures Different approaches aiming at detecting differentkinds of ncRNAs are presented and an overview of someuseful tools was given yet no analysis on application scopeswas provided Hence the summary of these methods is moretheoretical than practical Veneziano et al [27] summarisedsome computational approaches of ncRNA analysis basedon deep sequencing technology Some lncRNA predictiontools were discussed briefly butmany other helpful tools wereexcluded

In this paper we mainly focus on the tools for lncRNAidentification The aim of this paper is to summarise thepopular algorithms of lncRNA identification and to assistresearchers in determining which method is more appro-priate for their purpose Here comprehensive analyses anddiscussions of these tools were provided Then we comparedseveral popular machine learning based methods includingCoding Potential Calculator (CPC) [28] Coding PotentialAssessment Tool (CPAT) [29] Coding-Non-Coding Index(CNCI) [30] predictor of long noncoding RNAs and mes-senger RNAs based on an improved k-mer scheme (PLEK)[31] Long noncodingRNA IDentification (LncRNA-ID) [32]and lncRScan-SVM [33] In addition lncRNA-MFDL [34]and LncRNApred [35] two artificial neural network- (ANN-)involved tools are also introduced in this paper However theprovided access link of lncRNA-MFDL has been forbiddenLncRNApred often throws errors while handling massive-scale data which can be processed by CPC andCPAT success-fully Thus we only briefly introduce the algorithms of theclassification model but omit the discussions of applicationscope We expect that this review can be a practical manualwhen readers conduct lncRNA identification researches

CPC (2007) is used to assess the protein-coding potentialof transcripts with high accuracy and speed [28] Howeverwith the emergence of new programs speed is scarcelyconsidered as ameritThe features of CPC can be divided intotwo categoriesThe first one is based on the extent and qualityof the Open Reading Frame (ORF) and the other category isderived from BLASTX research The authors employed theLIBSVM package to train support vector machine (SVM)model with the standard radial basis function kernel [36]

CPAT (2013) is another protein-coding potential assess-ment tool based on the model of logistic regression Theselected features include the quality of the ORF FickettScore and hexamer score Fickett Score is used to evaluateeach basersquos unequal content frequency and asymmetrical

BioMed Research International 3

Table 1 Overview of the methods concerning lncRNA identification

Published year Testing datasets Training species Model Query file format Web interfaceCPC 2007 ncRNAlowast Eukaryotic SVM FASTA YesCPAT 2013 lncRNAlowast Human mouse fly zebrafish LR BED FASTA YesCNCI 2013 lncRNA Human plant SVM FASTA GTF NoPLEK 2014 lncRNA Human maize SVM FASTA NolncRNA-MFDL 2015 lncRNA Human DL 119880119899119896119899119900119908119899lowastlowast 119880119899119896119899119900119908119899lowastlowastLncRNA-ID 2015 lncRNA Human mouse RF BED FSATA NolncRScan-SVM 2015 lncRNA Human mouse SVM GTF NoLncRNApred 2016 lncRNA Human RF FASTA Web onlyTesting datasets denote that one specific method is developed to discriminate ncRNAs or lncRNAs from protein-coding transcripts The classification modelof CPC CNCI PLEK and lncRScan-SVM is support vector machine (SVM) CPAT employs logistic regression (LR) LncRNA-ID and LncRNApred utiliserandom forests (RF) and lncRNA-MFDL uses deep stacking networks (DSNs) of deep learning (DL) algorithmlowastNote that the most popular tool CPC is trained and tested on datasets of ncRNAs and protein-coding transcripts The training datasets of CPAT are alsoncRNAs and protein-coding transcripts though test on lncRNAs for CPAT is conducted and achieved a higher accuracylowastlowastThe access link of lncRNA-MFDL has expired thus we cannot verify the information that the original paper failed to mention

distribution in the positions of codons in one sequenceHexamer score is mainly based on the usage bias of adjacentamino acids in proteins

CNCI (2013) is a classifier to differentiate protein-codingand noncoding transcripts by profiling the intrinsic compo-sition of the sequence According to the unequal distributionof adjoining nucleotide triplets (ANT) in two kinds ofsequences a 64 lowast 64 ANT Score Matrix is constructed toevaluate the sequence and the sliding window is used as asupplement to achieve a more robust result [30] ANT bearssome similarities to the hexamer score of CPAT but muchmore comprehensive and intricate analysis was conductedto facilitate the incomplete transcripts classification Theclassification model of CPAT is SVM with a standard radialbasis function kernel

PLEK (2014) uses k-mer scheme and sliding window toanalyse the transcripts For multiple species PLEK does nothave too many advantages over CNCI on testing data ofnormal sequence Nevertheless compared with PLEK theresults of CNCI will deteriorate when the sequence containssome insert or deletion (indel) errors These errors are verycommon in todayrsquos sequencing platforms The classificationmodel of PLEK is SVM with a radial basis function kernel

LncRNA-ID (2015) has 11 features which can be cate-gorized according to ORF ribosome interaction and theconservation of protein The first category is similar to theORF features in CPC and CPAT The foundation of thesecond feature category is the interactions between mRNAsand ribosomes during protein translation since some studiesdisplayed that lncRNAs can be associated with ribosomes[37 38] but do not show the release of ribosomes [39] Theprofile hidden Markov model-based alignment is used toassess the conservation of protein The classification modelof LncRNA-ID is improved using random forest which assistsLncRNA-ID effectively in handling imbalanced training data

Some tools are initially designed to predict ncRNAs butcan also be applied to lncRNAs prediction such as Phyloge-netic Codon Substitution Frequencies (PhyloCSF 2011) [40]and RNAcon (2014) [41] Based on nucleotide substitutionsand formal statistical comparison of phylogenetic codon

models [40] PhyloCSF utilises multiple sequence alignmentsto find conserved protein-coding regions As an alignment-basedmethod PhyloCSF entails high-quality alignments andsuffers from low efficiency RNAconmainly predicts ncRNAsutilising k-mer scheme Based on graph properties [41 42]RNAcon can also perform ncRNAs classification and classifydifferent ncRNA classes

Some methods are especially developed for long inter-genic noncoding RNAs (lincRNAs one subgroup of lncR-NAs) classification such as iSeeRNA (2012 web server and Linux binary package available at http13718913371iSeeRNindexhtml) [43] and LincRNA Classifier based on selectedfeatures (linc-SF 2013) [44] iSeeRNA built a SVM modelwith three feature groups ORF adjoining nucleotides fre-quencies (GC CT TAG TGT ACG and TCG) and con-servation score obtained from Phast [45] The classifier oflinc-SF evaluates the sequences with the criteria of sequencelength GC content minimum free energy (MFE) and k-merscheme

2 Details of the Methods

In this part we will discuss the machine learning modelsand the selected features of each method more specificallyFirstly for usersrsquo convenience some brief information of eachmethod is displayed in Table 1 and the details of using aresummarisedThen the details of eachmethod are provided inthe following Table 2 is a summary about the features selectedby each method

21 Details of Using CPC can be downloaded fromhttpcpccbipkueducndownload CPC has a user-friendly webinterface at httpcpccbipkueducnprogramsrun cpcjspDocuments and User Guide are provided at the web-site To run CPC on a local PC a comprehensive pro-tein reference database is required and users can down-load it from ftpftpncbinlmnihgovblastdb or ftpftpuniprotorgpubdatabasesuniprotunirefuniref90 About20 gigabytes (GB) of free space is also needed for storing theprotein reference database

4 BioMed Research International

Table 2 Summary of the features of each method selected

ORF Codon Sequence structure Ribosomeinteraction Alignment Protein

conservation

CPC Quality coverageintegrity No No No BLASTX

Number and119864-value of hitsDistribution of

hits

CPAT Lengthcoverage

HexamerFrequency

Content of thebases

Position of thebases

No No No

CNCI No ANT matrixCodon-bias MLCDS No No No

PLEK No No Improved k-merscheme No No No

lncRNA-MFDL

Lengthcoverage No

k-mer schemeSecondarystructureMLCDS

No No No

LncRNA-ID Lengthcoverage No Kozak motif

Ribosome releasesignal

Changes of bindingenergy

Profile HMMbased alignment

Score ofHMMER

Length of theprofile

Length of alignedregion

lncRScan-SVM No Distribution of

stop codon

Score oftxCdsPredictlength oftranscripts

length and count ofexon

No Phylo-HMMbased alignment

AveragePhastCons scores

LncRNApred Lengthcoverage No

Length of thesequence

signal to noiseratio

k-mer schemeG + C content

No No No

All features are categorized into six groups according to the similarity or basic principles Thus some items in the table might not be exactly in one-to-onecorrespondence with the feature names given in the corresponding published references

CPAT is also available both for download and as a web-server Users can obtain the latest resource code from httpssourceforgenetprojectsrna-cpatfilessource=navbar Pre-releases tutorial files and examples are also supplied onthe pages CPAT requires Python 27x numpy cython andR when running offline The web server is available athttplilabresearchbcmeducpatindexphp

CNCI can be downloaded at httpsgithubcomwww-bioinfo-orgCNCI Version 2 is updated on Feb 28 2014Setup and running steps are attached on the websitesLibsvm-30 has been enclosed in the package Other addi-tional files can be downloaded at httpwwwbioinfoorgnp

PLEK was implemented by C and Python The sourcecode can be freely downloaded from httpssourceforgenetprojectsplekfiles Several videos to assist user in utilisingPLEK correctly are also provided Python 27x is required

Scripts of LncRNA-ID can be obtained at httpsgithubcomzhangy72LncRNA-ID

LncRScan-SVM provided scripts gene annotation filesanddatasetsThe scripts can be downloaded at httpssource-forgenetprojectslncrscansvmsource20=20directoryA Readme file is also attached on this site

All the stand-alone versions of these tools requireLinuxUNIX operating system

The link of lncRNA-MFDL provided is httpscompge-nomicsutsaedulncRNA MDFL LncRNApred only has theweb interface and is available at httpmm20132014wicpnet57203LncRNApredhomejsp However the link of lncRNA-MFDL expired when we did this research And LncRNApredonly provides a web server which cannot handle too manysequences at one time

22 CPC inDetail CPC [28] extracted six features to evaluatethe coding potential of transcripts Log-odds score coverageand integrity of ORF are used to assess the ORFs of onesequence ORFs are predicted by framefinder A high-quality

BioMed Research International 5

ORF tends to have a high log-odds score and a larger ORFcoverage The integrity of ORF means ORFs in protein-coding transcripts are disposed of to begin with a startcodon and end with a stop codon The other three featuresare number of hits hit score and frame score which arederived from the output of BLASTX search A protein-codingtranscript prefersmore hits in alignment with lower119864-valuesThen the hit score is defined as follows [28]

119878119894 = mean119895

minuslog10 119864119894119895 119894 isin 0 1 2

Hit Score = mean119894isin012

119878119894 = sum2119894=0 1198781198943 (1)

where 119864119894119895 is the 119864-Value of the 119895th hits in the 119894th ORF Anoncoding transcript may also contain some hits but thesehits are inclined to scatter in three frames rather than belocated in one The frame score to calculate the distributionof hits among three ORFs is defined in the following

Frame Score = variance119894isin012

119878119894 = sum2119894=0 (119878119894 minus 119878)22 (2)

Thus a protein-coding transcript will achieve a higher hitscore and frame score because of the lower119864-value and biaseddistribution of the hits

The training data of CPC [46] are eukaryotic ncRNAsfrom the RNAdb [47] and NONCODE [48 49] databasesCPC is designed to assess transcriptsrsquo protein-coding poten-tial whichmeans it will have high accuracy of discriminatingprotein-coding transcripts Moreover CPC also has the errortolerance capacity which owesmuch to framefinderrsquos accurateprediction Framefinder performed well even though inputtranscripts may have some point mutations indel errorsand truncations CPC is slightly inferior in distinguishingnoncoding transcripts in respect of the fact that lncRNAsmaycontain putative ORFs and transcript length is also familiarto protein-coding transcripts The slow speed is anotherimperfection of CPC

23 CPAT in Detail CPAT [29] is an alignment-free pro-gram CPAT uses a logistic regression model and can betrained on own data of users Apart from the features ofmaximum length and coverage of ORF akin to CPC FickettScore is another criterion Fickett Score can be regarded asa dependent classifier it is mainly based on calculating theposition of each base favoured and the content of each basein the sequence [50] The basersquos position parameter of CPATis defined as follows

1198601 = Number of As in positions 0 3 6 1198602 = Number of As in positions 1 4 7 1198603 = Number of As in positions 2 5 8

119860-position = max (1198601 1198602 1198603)min (1198601 1198602 1198603) + 1

119860-content = Occurence Number of 119860Total Number of all bases

(3)

where 119860 in the formula means the base 119860 and the otherthree bases are measured in a similar way The parameterof position calculates each basersquos favoured position andthe parameter of content is the percentage of each basein the sequence Then according to distributions of eightparametersrsquo values [50] it is easy to obtain the probability thatthe sequence will be a protein-coding transcript Next eachprobability is multiplied by a weight to make a more accurateresult The weight is the percentage of the times that theestimate of each parameter alone is correct Finally accordingto the above descriptions Fickett Score can be determined asfollows

Fickett Score = 8sum119894=1

119901119894119908119894 (4)

According to Fickett [50] Fickett Score alone can correctlydiscriminate about 94 of the coding segments and 97 ofthe noncoding segments with 18 of ldquoNo Opinionrdquo

The last feature of CPAT is hexamer score which is themost discriminating feature Hexamer means the adjacentamino acids in proteinsThe features of the in-frame hexamerfrequency of coding and noncoding transcripts are calculatedand hexamer score is defined in the following

Hexamer Score = 1119898119898sum119894=1

log( 119865 (119867119894)1198651015840 (119867119894)) (5)

There are 64 lowast 64 kinds of hexamers and 119894 denotes each hex-amer 119865(119867119894) (119894 = 0 1 2 4095) means the in-frame hex-amer frequency of protein-coding transcripts while 1198651015840(119867119894)means noncoding transcripts For a transcript containing119898 hexamers a positive hexamer score indicates a protein-coding transcript

A high-quality training dataset is constructed contain-ing 10000 protein-coding transcripts selected from RefSeqdatabase with the annotations of the Consensus CodingSequence project and 10000 noncoding transcripts randomlycollected from GENCODE database CPAT is prebuilt hex-amer tables and logit models for human mouse fly andzebrafish Meanwhile CPAT uses pure linguistic features tofacilitate discrimination of the poorly annotated transcriptsCPAT has an efficient offline program and also provides auser-friendly web interface

24 CNCI in Detail CNCI [30] is mainly based on sequenceintrinsic composition it evaluates the transcripts by calcu-lating the usage frequency of adjoining nucleotide triplets(ANT) Firstly two ANT matrices are constructed based onthe usage frequency of ANT in noncoding sequences andcoding region of the sequences (CDS) For 4096 ANT the

6 BioMed Research International

formulas to calculate each ANT usage frequency are definedas follows

119883119894119873 = 119899sum119895=1

119878119895 (119883119894)

119879 = 119898sum119894=1

119883119894119873 = 119898sum119894=1

119899sum119895=1

119878119895 (119883119894) 119898 = 64 times 64 119899 = 1 119873

119883119894119865 = 119883119894119873119879

(6)

where 119883 means one kind of ANT 119878119895(119883119894) is the occurrencenumber of119883119894 in one sequence 119878119895Thus119883119894119873 denotes the totaloccurrence number of one kind of ANT in the dataset while119879indicates the total occurrence number of all kinds of ANT inthe dataset Accordingly 119883119894119865 is the usage frequency of ANTThen the ANT Score Matrix is utilised which is the log2-ratio of the two above-mentioned ANT matrices to score asequence and make a discrimination

ANT Score Matrix = log2CDS Matrix

Non-coding Matrix (7)

The distinguishing results of ANT Score Matrix are fairlywell but the matrix is constructed by computing the ANTusage frequency of coding region and noncoding regionconsequently the untranslated region (UTR) of the entiresequence will interfere with the performance of discrimina-tionThe sliding window is employed with one ANT (3 nt) ineach scan step to identify the CDS of a sequence by scanningsix reading frames of each sequence The different sizes(30 60 90 300 nt) of the sliding windows are examinedand the size of 150 nt for this classification model is found toobtain the most robust result For a sequence consisting of119896 ANT there will be 119896 minus 1 segments in this sequence Basedon the ANT Score Matrix each segment will get an 119878-Scoreand each reading frame can obtain an array comprised of the119878-Scores The formula of 119878-Score is defined as follows

119878-Score = 119899sum119894=1

119867119901 (119883119894) (8)

where 119883 means ANT 119867119901 is the ANT Score Matrix and119899 is the total number of the ANT in one segment or thewhole sequence Hence a correct reading frame of codingtranscript tends to have a higher whole sequence 119878-Scoreand in this array of reading frame the region composed ofconsecutive high 119878-Scores is the CDS For long noncodingtranscripts the Maximum Interval Sum [51] program is usedto identify the most-like CDS (MLCDS) which is the regionthat gained the largest sum of consecutive 119878-Scores in eachreading frame Among those six MLCDS the length and119878-Score of the MLCDS with the highest value are selectedas the features of CNCI Furthermore the features of the

LENGTH-Percentage SCORE-Distance and codon-bias arealso selected to improve accuracy

LENGTH-Percentage = 1198721sum119899119894=0 (119884119894)

SCORE-Distance = sum119899119895=0 (119878 minus 119864119895)5

(9)

where 1198721 is the length of the MLCDS with the highest 119878-Score 119884119894 is the length of each MLCDS 119878 is the highest 119878-Score among six MLCDS and 119864119895 is the 119878-Score of other fiveMLCDS Codon-bias (3-mer frequencies) is a parameter toevaluate the usage bias of different codons in protein-codingor long noncoding transcripts The log2-ratio of occurrencefrequency of each codon (stop codons are excluded) inprotein-coding genes and lncRNAs is calculated and mostcodons have distinct usage bias in two kinds of sequences

The training datasets of CNCI contain protein-codingtranscripts selected from RefSeq database and long noncod-ing transcripts selected from GENCODE [52] The CNCIis applied to other species with the aim of examining thescope of applicationThe results of vertebrates (except birds)especially mammals can be accepted since the programwas trained on human gene set CNCI can be used todiscriminate incomplete transcripts especially those high-throughput sequencing data of poorly explored species

25 PLEK in Detail PLEK [31] is an alignment-free toolbased on 119896-mer frequencies of the sequences For a givensequence the sliding windows with size of 119896 scan 1 nt asa step forward 119896 ranges from 1 to 5 which is a trade-off between accuracy and computational time Thus for asequence consisting of 119860 119862 119866 and 119879 the 41 + 42 + 43 +44 + 45 = 1364 patterns can be obtained Then the followingformulas can be used

119891119894 = 119888119894119904119896119908119896 119896 = 1 2 3 4 5 119894 = 1 2 1364119904119896 = 119897 minus 119896 + 1119908119896 = 1

45minus119896 119896 = 1 2 3 4 5

(10)

where 119894 is the number of the patterns 119888119894 denotes the numberof the segments in slidingwindowsmatchingwith patterns 119904119896denotes the total of the segments when sliding window slidesalong the sequencewith the size of 119896Therefore119891119894 is the usagefrequency multiplied by a factor119908119896 which is used to facilitatethe discrimination

A balanced training dataset is conducted with all 22389long noncoding transcripts collected from the GENCODEdataset [52ndash54] and 22389 protein-coding transcripts ran-domly selected from the human RefSeq dataset [55 56]Though the training model of PLEK is human PLEK canstill be applied to other vertebrates PLEK is particularlydesigned for the transcripts acquired from current sequenc-ing platforms which consist of some indel errors commonly

BioMed Research International 7

For these transcripts the performance of PLEK is betterthan CPC and CNCI PLEK can be trained with usersrsquo owndatasets but it may take a long time to be accomplished

26 LncRNA-MFDL in Detail LncRNA-MFDL [34] is basedon feature fusion and deep learning algorithm LncRNA-MFDL has four kinds of features which are integrated tobuild a classification model based on deep stacking networks(DSNs one kind of deep learning algorithm) [57 58] Fourfeature groups of lncRNA-MFDL include 119896-mer secondarystructure ORF obtained by utilising txCdsPredict program(httpgenomeucscedu) [59] and MLCDS features whichare inspired by CNCI [30]

The 119896-mer scheme employed in lncRNA-MFDL is unlikethe one in PLEK Here the 119896 only ranges from 1 to 3but the frequencies are calculated on the regions of thewhole sequence and ORF at the same time Considering thatthe secondary structure is more conserved and stable thanprimary structure a representative criterion the minimumfree energy (MFE) is used to assess the secondary structureof the transcripts Utilising RNAfold program of ViennaRNAPackage [60] the MFE the ratio of MFE to sequence lengthand the number of paired bases and unpaired bases can beeasily obtained

27 LncRNA-ID in Detail LncRNA-ID [32] has three cate-gories of features as mentioned earlier Except for the lengthand coverage of ORF the features based on translationmechanism and protein conservation are extracted

Many studies [61ndash63] have demonstrated that severalnucleotide sites in Kozak motif play a prominent roleduring the initiation of protein translation An efficienttranslation indicates that the highly conserved nucleotidesappear at the positions minus3 +4 and minus2 minus1 of Kozak motifGCCRCCAUGG (R represents purine and the position of Ain start codon AUG is +1) Thus these conserved sites aremore likely to exist in protein-coding transcripts Moreoverwhen the translation starts the binding energy will changealong with the interaction between the 31015840 end of rRNAs andmRNA transcripts The Ribosome Coverage to calculate thechanges of the binding energy is defined as follows

Ribosome Coverage = 119871sum119894=1

119873119894 | 120575119894 lt 0 (11)

where 120575119894 is the free energy at position 119894 and 119873119894 is thenumber of base pairs starting at position 119894 in a sequencewith the length of 119871 Next the three levels of ribosome occu-pancy by computing Ribosome Coverage on three regionsrespectively are obtained the whole transcript ORF and31015840UTR Accordingly a true protein-coding transcript tendsto attain higher Ribosome Coverage on the whole transcriptand the ORF region When the translation terminates theribosomes will be released from protein-coding transcriptsTherefore it is likely to capture a considerable drop ofribosome occupancy when ribosomes reach stop codonsThe

Ribosome Release Score to capture this change of ribosomeoccupancy is defined

Ribosome Release Score

= Ribosome coverage of ORFlength (ORF)Ribosome coverage of 31015840UTRlength (31015840UTR)

(12)

and a protein-coding transcript inclines to exhibit a higherRibosome Release Score For protein translation categorythe selected features including nucleotides at two positionsof Kozak motif Ribosome Coverage on three regions andRibosome Release Score are selected

The protein conservation of the sequences is evaluatedaccording to profile hidden Markov model-based align-ment scores HMMER [64] is a software suite for sequencehomology detection using probabilistic methods LncRNA-ID employed HMMER with the 119864-value cutoff of 01 toalign the transcripts against all available protein families Aprotein-coding transcript is expected to get a higher scorelonger aligned region and a reasonable length of the profilein the alignment

In human genome although the amount of lncRNA isat least four times more than protein-coding genes [65]the majority class in training data is protein-coding tran-script on account of poorly annotated lncRNA Hence theclassification model of this method is balanced randomforest [66 67] which is derived from random forest butcould utilise the sufficient protein-coding data and avoidinaccurate results caused by the imbalanced training data atthe same time The human prebuilt model of LncRNA-IDcontains 15308 protein-coding transcripts and 4586 lncRNAsfrom GENCODE [52] For mouse the training datasets arecomprised of 22033 protein-coding transcripts and 2457lncRNAs randomly selected from GENCODE These twodatasets were also used to draw receiver operation charac-teristic (ROC) curves in the next section (Figure 2) Userscan train LncRNA-ID with their own dataset and apply it tovarious species

28 LncRScan-SVM in Detail LncRScan-SVM [33] classifiesthe sequencesmainly by evaluating the qualities of nucleotidesequences codon sequence and transcripts structure Thecounts and average length of exon in one sequence arecalculated The protein-coding transcripts are disposed ofto include more exons thus having a longer exon lengththan lncRNA Another feature is the score of txCdsPredictThis third-part program from UCSC genome browser [68]can determine if a transcript is protein-coding Conservationscore is obtained by calculating the average of PhastConsscores [45] from Phast (httpcompgencshleduphast)Transcript length and standard deviation of stop codoncounts between three ORFs are the last two features

The reliable datasets are constructed from GENCODE[54] composed of 81814 protein-coding transcripts and23898 long noncoding transcripts of human And formouse47394 protein-coding transcripts and 6053 long noncodingtranscripts from GENCODE [52] are also contained withinthe dataset After being trained on human and mouse

8 BioMed Research International

datasets lncRScan-SVM obtains a good performance onlncRNA prediction

29 LncRNApred in Detail Before constructing the classi-fier self-organizing feature map (SOM) clustering [69] isemployed to select representative samples as the trainingdataset which enhanced the performance of LncRNApredAs to the features the length and coverage of the longest ORFone of the classical and typical features are selected as thecriteria In addition G + C content 119896-mer (119896 is from 1 to 3just like lncRNA-MFDL) and length of the sequence are alsothe features of LncRNApred The novel idea of LncRNApredis SNR which transforms one sequence into four binarynumeric sequences

119906119887 = 1 119878 [119899] = 1198870 119878 [119899] = 119887

119899 = 0 1 2 119873 minus 1 119887 isin 119860 119879 119862 119866 (13)

where 119887 means four kinds of bases 119873 is the length of onesequence and 119878[119899] denotes a sequence of length 119873 Thusthere will be four binary sequences 119906119887 | 119887 isin (119860 119879 119862 119866)Then applying Discrete Fourier Transform (DFT) to thesefour binary numeric sequences the power spectrum 119875[119896]can be obtained

119880119887 [119896] =119873minus1sum119899=0

119906119887 [119899] 119890minus119894(2120587119899119896119873) 119896 = 0 1 119873 minus 1119875 [119896] = 1003816100381610038161003816119880119860 [119896]10038161003816100381610038162 + 1003816100381610038161003816119880119879 [119896]10038161003816100381610038162 + 1003816100381610038161003816119880119866 [119896]10038161003816100381610038162 + 1003816100381610038161003816119880119862 [119896]10038161003816100381610038162

(14)

The studies of Fickett [50 70] have presented that positionsand compositions of four bases are different in lncRNAsand protein-coding RNA and because of this the powerspectrum of one protein-coding transcript will have a peakat1198733 position Hence the SNR is defined as follows

119864 = sum119873minus1119896=0 119875 [119896]119873 SNR = 119875 [1198733]

119864 (15)

Now there are 89 features the length and coverage ofthe longest ORF the length of the sequence SNR G + Ccontent and 4 + 16 + 64 features of 119896-mer Noticing that notall the features have high discriminative power the featureselection ismade and 25 high-quality features are determinedfrom the original 84 features of k-mer Finally 30 features areselected to build a random forest model The performance ofrandom forest is largely determined by training setThereforethe clustering method is used to find out the most adequatesequences to form a high standard training setThe clusteringmethod SOM [69] achieved the best result and was chosen toselect characteristic sequences

An overall procedure of these eight tools is displayed inFigure 1

3 Performance of These Methods

To quantify the classification performance under one unifiedstandard we first characterise lncRNAs as the positive classand protein-coding transcripts as the negative class then theperformance of these tools can be evaluated with severalstandard criteria defined as follows

Sensitivity = TPTP + FN

Specificity = TN

TN + FP

Accuracy = TP + TNTP + FP + TN + FN

False Positive Rate = FP

FP + TN

(16)

As one of the most popular methods CPC is especiallydesigned for assessing protein-coding potential and per-formed fairly well for discriminating protein-coding tran-script It enjoys the best results when screening the codingtranscripts For 10000 protein-coding genes and 10000 lncR-NAs selected from UCSC genome browser (GRCh37hg19)CPC picked up about 9762 coding transcripts while CPATdistinguished 8528 of them CPAT also outperforms CPCwith 8994 accuracy [33] Table 3 shows the performance ofthese tools on the same testing dataset CPC picks up 9997of human protein-coding genes collected from GENCODEin comparison with the latest program LncRNA-ID whoseperformance is 9528 However the performance of CPCappears to somewhat decline when focusing on the capabilityof discriminating noncoding transcripts especially long non-coding transcripts CPC only picked up 6648 of humanrsquoslong noncoding transcripts while the results of CPAT PLEKand LncRNA-ID are 8695 9952 and 9628

CPC and CPAT are the programs to assess the codingpotential but CNCI is especially used to classify protein-coding and long noncoding transcripts With the sequencesbecoming longer and longer CNCI was more superior toCPC According to Sun et al [30] when the length oftranscript is longer than 2000 nt the accuracy of CPC isonly around 04 while the CNCI still has an outstandingperformance The training dataset of CNCI is human butthis method still achieved more than 90 accuracy in othervertebrates apart from the birds [30] PLEK is tested ontwo datasets sequenced by PacBio and 454 platforms (referto Table 3) Among the tools being compared CPC stillpicked up about 9990 coding genes though this figureis not that useful because it can only distinguish 1900and 4720 lncRNAs CNCI displayed better performanceon both datasets but PLEK even achieved a more satisfyingresult

LncRNA-ID is another method to identify the longnoncoding transcripts Compared with other programsLncRNA-ID strikes a good balance between sensitivity andfalse positive rate According to Table 3 it is noticeable thatlncRNA is better than PLEK but slightly inferior to CPCand CPAT on the testing data of coding genes and the

BioMed Research International 9

(i) ORF(ii) BLASTX

CPC

(i) k-mer

PLEK

(i) ANT matrix(ii) Codon-bias

(iii) MLCDS

CNCI

(i) Exon (ii) Stop codon

(iii) txCdsPredict(iv) PhastCons

LncRScan-SVM

Support vector machine

Featureextraction

Modelselection

Modelvalidation

Input data

10-fold cross-validationArea under the curve (AUC) of receiver operation characteristic (ROC)

LncRNA-ID(i) ORF

(ii) Ribosome interaction

(iii) HMMER

(i) ORF(ii) k-mer

(iii) Signal to noise ratio

LncRNApred

Random forests

CPAT

(i) ORF(ii) Fickett score

(iii) Hexamer

Logistic regression

LncRNA-MFDL(i) ORF

(ii) k-mer(iii) MLCDS(iv) Secondary

structure

Deeplearning

Figure 1 An overall procedure of eight tools The features of each tool are sorted into several groups and only the categories of the featuresare listed in the figure

02 04 06 08 10001 minus specificity

00

02

04

06

08

10

Sens

itivi

ty

00

02

04

06

08

10

Sens

itivi

ty

02 04 06 08 10001 minus specificity

Mouse (from GENCODE)Human (from GENCODE)

PLEK (AUC = 0991 optimal = 00210973)CNCI (AUC = 0937 optimal = 0084 0970)CPAT (AUC = 0990 optimal = 0037 0961)CPC (AUC = 0977 optimal = 0038 0889) CPC (AUC = 0950 optimal = 0068 0839)

CPAT (AUC = 0966 optimal = 0073 0926)CNCI (AUC = 0903 optimal = 0137 0959)PLEK (AUC = 0870 optimal = 0291 0885)

Figure 2 The ROC curves of CPC CPAT CNCI and PLEK We assessed the models using the same datasets as LncRNA-ID (selected fromGENCODE) used Both CPC and CPAT were evaluated with the latest versions

10 BioMed Research International

Table 3 Overview of each toolrsquos performance on different testing datasets

Testing dataset CPC CPAT CNCI PLEK LncRNA-ID lncRScan-SVMHuman MCF-7 (PacBio)1

Specificity 9990 9180 9470Sensitivity 1900 7870 9580Accuracy 9700 9130 9470Human HelaS3 (454)2

Specificity 9990 9390 9550Sensitivity 4720 8110 9250Accuracy 9900 9370 9540Human (from GENCODE)3

Specificity 9997 9955 8918 9528Sensitivity 6648 8695 9952 9628Accuracy 8322 9325 9432 9578Mouse (from GENCODE)4

Specificity 9875 9895 7094 9210Sensitivity 7655 3880 8811 9445Accuracy 8765 6888 7949 9328Human (from GRCh37hg19)5

Specificity 9762 8528 8920Sensitivity 6723 9460 9388Accuracy 8243 8994 9194Mouse (from GRCm38mm10)5

Specificity 9837 8817 8914Sensitivity 7546 9534 9529Accuracy 8691 9176 9221The results of the tools being tested on the same datasets are listed above Bold numbers denote the highest value of the metrics1MCF-7 is available at httpwwwpacbcomblogdata-release-human-mcf-7-transcriptome 2dataset of HelaS3 is available at httpswwwncbinlmnihgovsraSRX214365 34datasets are available at httpswwwdropboxcomsh7yvmqknartttm6kAAAQHvLZPjgjf4dtmHM7GNCqaH1 gencodedl=0 andhttpswwwdropboxcomsh7yvmqknartttm6kAACzaG-QJggvbXW6LA32oo7baM1 gencodedl=0 5dataset of human and mouse is available athttpjournalsplosorgplosonearticleid=101371journalpone0139654

performance on lncRNAs is just the opposite LncRNA-IDcan be trained with usersrsquo own data it can obtain a satisfyingresult even when the data is unbalancedWith the proportionof lncRNA decreasing CPAT shows a sharp reduction from7951 to 5446on the capability of lncRNAdiscriminationLncRNA-ID by contrast fell less than 1 [32]

ROC curves of CPC CPAT PLEK and LncRNA-IDtested on human and mouse datasets were provided in [32]Since CPC and CPAT are updated as the accumulation ofgene database it is useful to assess their performance withlatest version and take CNCI into account Here we utilisethe test set of LncRNA-ID [32] (both the datasets of humanandmouse are selected fromGENCODE) to reevaluate CPCCPAT CNCI and PLEK (Figure 2) According to [32] thearea under curve (AUC) of LncRNA-ID on human datasetis 09829 (optimal = 00545 09720) while on mouse it is09505 (optimal = 00800 09445) [32] In our assessmentthe performance of PLEK is identical with [32] while theperformance of CPC and CPAT as we anticipated displayedsome differences The ROC curves were drawn in R with thepackage of pROC [71]

LncRScan-SVM is compared with CPC and CPAT onhuman and mouse datasets from UCSC (version hg19 of

human and mm10 of mouse) CPC as an excellent codingpotential assessment tool still achieves 9837 of specificityonmouse testing dataset CPAT on the contrary achieved thehighest values of sensitivity both on the datasets of humanand mouse LncRScan-SVM surpasses CPAT with 8920and 8914 of specificity on human and mouse datasets Forsensitivity lncRScan-SVM obtained 9388 and 9529 onthe same testing datasets which are only 072 and around01 lower than CPATrsquos results respectively but much higherthan CPCrsquos 6723 and 7546 In addition lncRScan-SVMalso has the best results of accuracy and AUC [33] on thesedatasets

For the same testing datasets the running time of CPATis the shortest and CPC shows the longest time to finishthe process because of its alignment process When beingtested on a dataset containing 4000 protein-coding and 4000long noncoding transcripts CPAT takes 3536 s and LncRNA-ID takes 6535 s to accomplish the discrimination whilePLEK and CPC need 2147m and 8651 h respectively [32]PLEK is 8 times and 244 times faster than CNCI and CPCrespectively on the same testing data [31] and lncRScan-SVM also needs about 10 times as much as CPAT to finishcomputation [33]

BioMed Research International 11

Table 4 Priority of employing different methods on different situations

CPC CPAT CNCI PLEK LncRNA-ID lncRScan-SVMCoding potential assessment Human lncRNAs Mouse lncRNAs Other Species1 Testing data with sequencing errors2 Lack of annotation Massive-scale data3 Trained by users4 Web interface This table only presents the preferences under different situations which means a method with a tick can achieve a better performance under a certaincircumstance1Only CPAT LncRNA-ID and lncRScan-SVM provide the model for mouse When analysing other species CPAT has the model for fly and zebrafish CNCIand PLEK can predict the sequences of vertebrata and plant CPAT PLEK and LncRNA-ID can build a newmodel based on usersrsquo datasets 2Users can chooseCNCI for incomplete sequences and CPC or PLEK for the transcripts with indel errors 3CPAT is the most efficient method Though lncRScan-SVM needsmore time than CPAT and LncRNA-ID it is also acceptable 4LncRNA-ID can handle the imbalanced training data Training PLEK with usersrsquo own datasetsmay be a time-consuming task

4 Application Scopes of the Methods

All these methods have own particular scopes to exert theirtalents which means an appropriate program can help usobtain a satisfying result The priority of utilising thesetools under some particular circumstances is summarised inTable 4

CPC is based on sequence alignment which facilitatesprotein-coding transcripts selection but impairs the perfor-mance of noncoding transcripts in that long noncoding tran-scripts share more similarities with coding transcripts suchas putative ORF which could mislead CPC Also becauseof alignment process utilising CPC to analyse massive-scaledata is a time-consuming process

CPAT is also used to evaluate the coding potential thoughthe performance on long noncoding transcripts is acceptableCPAT has a compromise between coding and noncodingtranscripts that is not bad Since the model of CPAT islogistic regression and the input file is FASTA format CPATis markedly superior in computational time which meansCPAT is more suitable for being applied to data on a largescale Furthermore linguistic features make CPAT be able toanalyse the sequences without annotation and allowing usersto train the model with their own dataset extends CPATrsquosscope of application Users can apply CPAT to other speciesinstead of being confined to human or mouse only

CNCI is designed to distinguish between coding and longnoncoding transcripts without the annotations of sequencesBecause lots of lncRNAs are poorly annotated this qualityprovided amore accurate discrimination for these sequencesCNCI is trained on human dataset but can also be applied toothermammals such asmouse and orangutan CNCI displaysacceptable results on vertebrates (except fish) but for plantsand invertebrates the result is not very satisfying CNCI isvaluable when the sequences lack annotations or users do nothave training set of other species CNCI also shows a goodperformance when the transcripts are incomplete

PLEK employs a higher fault tolerance algorithm andperforms better when the sequences have indel errors It is a

proper tool to analyse the de novo assembled transcriptomedatasets such as the sequences obtained from Roche (454)and Pacific Biosciences (PacBio) sequencing platforms Inaddition to human and mouse PLEK can also be used toother vertebrates and displays comparable results with theones of CNCI PLEKrsquos model can be trained by users but ittakes a long time to be completed

LncRNA-ID has many merits and delivers better all-round performance on human andmouse datasets Althoughthe time LncRNA-ID spent on classifying is nearly twice ofCPAT LncRNA-ID is still more efficient than other methodswhich makes it a reasonable choice when data are on amassive-scale The model of LncRNA-ID can be trained byusers but the most excellent attribute is the competence ofhandling the unbalanced training data For studying thosenot well-explored species LncRNA-ID takes priority whenusers have training datasets

LncRScan-SVM achieves a good trade-off betweenthe discrimination of coding and long noncoding genesLncRScan-SVM is slower than CPAT and LncRNA-ID but itis still acceptable For analysing human and mouse datasetslncRScan-SVM can be considered as a proper approach

5 Discussion

According to the features selected by each tool it is apparentthat different tools have their own advantages and disadvan-tages CPC is developed to assess coding potential of thetranscripts moreover CPC is trained on datasets of protein-coding and noncoding RNA which means it achieves excel-lent performance when analysing ncRNAs CPC providesa stand-alone version and a web server but both of thetwo programs need vast amounts of time to process thesequences As alignment-based tools the performance ofCPC varied when using different protein reference databaseCPAT can present satisfying results efficiently partly becauseCPAT builds the logistic model which is faster than SVMThe web server of CPAT can display the result in an instant

12 BioMed Research International

which facilitates small scale prediction tasks A minor dis-advantage of CPAT is that the cutoff of CPAT varies fromspecies to species and users have to determine the optimumcutoff value when they are training a new model CNCI isdesigned to predict the transcripts assembled from whole-transcriptome sequencing data Thus CNCI offers a highaccuracy on incomplete transcripts CNCI did not provideresult of elaborate comparison between CNCI and CPATbut CPAT has no regard for the problem of incompletetranscripts Meanwhile UTRs of the transcripts may alsointerfere with the performance of CPATThe features of ANTof CNCI closely resemble the hexamer of CPAT but thedistinguishing process of CNCI is more complicated andaccurate than CPAT However the sliding window of CNCIslides 3 nt in each step and consequently some deletionor frameshift errors may lead to a false shift and presentusers with a disappointing performance In such cases PLEKhas made a considerable improvement and exhibits moreflexibility when handling the indel sequencing errors Indelerrors are very common in the sequences obtained by todayrsquossequencing platform which means PLEK performs well forde novo assembled transcriptomes With the indel error rateincreasing the accuracy of CNCI is decreasing while PLEKhas no distinct fluctuation Nonetheless since the nucleotidescompositions differ slightly among different species theperformance of PLEK on multiple species is not better thanor approximately equivalent to CNCI whose performanceis more stable on different species Both LncRNA-ID andlncRScan-SVM achieve a balance between protein-codingand lncRNAs But the capacity of lncRScan-SVMwill be lim-ited when analysing the sequences with a lack of annotationAnother point that needs to be brought up is that lncRScan-SVM and CNCI support lowast GTF as input file format

It is apparent that nucleotides composition (such as 119896-merand G + C content) and ORF are two classic and widely usedfeature groups These features have strong discriminativepower because protein-coding genes will finally be tran-scribed and translated to produce a specific amino acid chainwhich requires some specified nucleotides composition andhigh-quality ORFs As to the models of these tools SVM(CPC CNCI and PLEK) logistic regression (CPAT) andrandom forest (LncRNA-ID) are more practical for lncRNAidentification though ANN or deep learning is a more popu-lar machine learning algorithm now Along with the protein-coding genes prediction the annotations of lncRNA genehave been performed as well A new tool named AnnoLnc(2015 available at httpannolnccbipkueducnindexjsp)has just been developed to annotate new discovered lncRNAsbut related article has not yet been officially published Userscan access its web server for more information

LncRNAs are receiving increasing attention and lncRNAidentification has always been a challenge for researchesof life science For so many different types of sequencesvarious excellent tools should be developed to tackle differentproblems under various circumstances in the future In thisreview we summarised several tools for lncRNAs identifica-tion and concluded respective scopes Due to their differentscopes of application using a method apposite to particularsituation will be of essence to achieve convincing results

We hope this review can help researchers employ a moreappropriate method in certain situations

Additional Points

Key Points (i) Different tools have different scopes Usersshould select a proper tool according to the type of sequences(ii) From the perspective of sequence types CPC andCPAT are mainly used to assess coding potential CNCIand PLEK can be applied to the sequences obtained fromhigh-throughput sequencing platforms or the poorly anno-tated LncRNA-ID and lncRScan-SVM are more accurate onhuman and mouse datasets (iii) From the perspective ofother functions CPC and CPAT have web interfaces Theclassification models of CPAT LncRNA-ID and PLEK canbe trained on usersrsquo own datasets CPAT LncRNA-ID andlncRScan-SVM can be utilised when the data to be analysedare on a massive-scale

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This research was supported by the National Natural ScienceFoundation of China (61272207 61472158 and 61402194)and the Science-Technology Development Project from JilinProvince (20130522118JH 20130101070JC) and China Post-doctoral Science Foundation (2014T70291)

References

[1] A F Palazzo and E S Lee ldquoNon-coding RNA what isfunctional andwhat is junkrdquo Frontiers in Genetics vol 5 article2 pp 1ndash11 2015

[2] Y Okazaki M Furuno T Kasukawa et al ldquoAnalysis of themouse transcriptome based on functional annotation of 60770full-length cDNAsrdquo Nature vol 420 pp 563ndash573 2002

[3] S Djebali C A Davis AMerkel et al ldquoLandscape of transcrip-tion in human cellsrdquo Nature vol 489 pp 101ndash108 2012

[4] I Dunham A Kundaje S F Aldred et al ldquoAn integratedencyclopedia of DNA elements in the human genomerdquo Naturevol 489 pp 57ndash74 2012

[5] E Pennisi ldquoENCODE project writes eulogy for junk DNArdquoScience vol 337 no 6099 pp 1159ndash1161 2012

[6] S U Schmitz P Grote and B G Herrmann ldquoMechanisms oflong noncoding RNA function in development and diseaserdquoCellular and Molecular Life Sciences vol 73 no 13 pp 2491ndash2509 2016

[7] K Plath J Fang S K Mlynarczyk-Evans et al ldquoRole of histoneH3 lysine 27 methylation in X inactivationrdquo Science vol 300no 5616 pp 131ndash135 2003

[8] S T da Rocha V Boeva M Escamilla-Del-Arenal et al ldquoJarid2is implicated in the initial xist-induced targeting of PRC2 to theinactive X chromosomerdquoMolecular Cell vol 53 no 2 pp 301ndash316 2014

[9] V OrsquoLeary S V Ovsepian L G Carrascosa et al ldquoPARTICLEa triplex-forming long ncRNA regulates locus-specific methy-lation in response to low-dose irradiationrdquo Cell Reports vol 11no 3 pp 474ndash485 2015

BioMed Research International 13

[10] A C Marques and C P Ponting ldquoCatalogues of mammalianlong noncoding RNAs modest conservation and incomplete-nessrdquo Genome Biology vol 10 no 11 article R124 2009

[11] J R Prensner M K Iyer A Sahu et al ldquoThe long noncodingRNA SChLAP1 promotes aggressive prostate cancer and antag-onizes the SWISNF complexrdquo Nature Genetics vol 45 no 11pp 1392ndash1398 2013

[12] G Chen Z Wang D Wang et al ldquoLncRNADisease a databasefor long-non-coding RNA-associated diseasesrdquo Nucleic AcidsResearch vol 41 no 1 pp D983ndashD986 2013

[13] X Chen and G-Y Yan ldquoNovel human lncRNA-disease associ-ation inference based on lncRNA expression profilesrdquo Bioinfor-matics vol 29 no 20 pp 2617ndash2624 2013

[14] X Chen C Clarence Yan C Luo W Ji Y Zhang and Q DaildquoConstructing lncRNA functional similarity network based onlncRNA-disease associations and disease semantic similarityrdquoScientific Reports vol 5 Article ID 11338 2015

[15] X Chen Y-A Huang X-S Wang Z You and K C ChanldquoFMLNCSIM fuzzy measure-based lncRNA functional simi-larity calculation modelrdquo Oncotarget vol 7 no 29 pp 45948ndash45958 2016

[16] X Chen ldquoPredicting lncRNA-disease associations and con-structing lncRNA functional similarity network based on theinformation of miRNArdquo Scientific Reports vol 5 Article ID13186 2015

[17] J Sun H Shi Z Wang et al ldquoInferring novel lncRNA-diseaseassociations based on a random walk model of a lncRNAfunctional similarity networkrdquo Molecular BioSystems vol 10no 8 pp 2074ndash2081 2014

[18] M Zhou X Wang J Li et al ldquoPrioritizing candidate disease-related long non-codingRNAs bywalking on the heterogeneouslncRNA and disease networkrdquoMolecular BioSystems vol 11 no3 pp 760ndash769 2015

[19] X Chen Z You G Yan and D Gong ldquoIRWRLDA improvedrandom walk with restart for lncRNA-disease association pre-dictionrdquo Oncotarget vol 7 no 36 pp 57919ndash57931 2016

[20] X Chen C C Yan X Zhang and Z You ldquoLong non-codingRNAs and complex diseases from experimental results tocomputational modelsrdquo Briefings in Bioinformatics 2016

[21] T M Lowe and S R Eddy ldquotRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequencerdquo Nucleic Acids Research vol 25 no 5 pp 955ndash9641997

[22] Q Zou J Guo Y Ju M Wu X Zeng and Z Hong ldquoImprov-ing tRNAscan-SE annotation results via ensemble classifiersrdquoMolecular Informatics vol 34 no 11-12 pp 761ndash770 2015

[23] L Wei M Liao Y Gao R Ji Z He and Q Zou ldquoImprovedand promising identificationof human microRNAs by incorpo-ratinga high-quality negative setrdquo IEEEACM Transactions onComputational Biology and Bioinformatics vol 11 no 1 pp 192ndash201 2014

[24] C Y Wang L L Hu M Z Guo X Y Liu and Q Zou ldquoimDCan ensemble learningmethod for imbalanced classificationwithmiRNAdatardquoGenetics andMolecular Research vol 14 no 1 pp123ndash133 2015

[25] K Lagesen P Hallin E A Roslashdland H-H Staeligrfeldt T Rognesand DW Ussery ldquoRNAmmer consistent and rapid annotationof ribosomal RNA genesrdquo Nucleic Acids Research vol 35 no 9pp 3100ndash3108 2007

[26] C Wang L Wei M Guo and Q Zou ldquoComputationalapproaches in detecting non- coding RNArdquo Current Genomicsvol 14 no 6 pp 371ndash377 2013

[27] D Veneziano G Nigita and A Ferro ldquoComputationalapproaches for the analysis of ncRNA through deep sequencingtechniquesrdquo Frontiers in Bioengineering and Biotechnology vol3 article 77 2015

[28] L Kong Y Zhang Z-Q Ye et al ldquoCPC assess the protein-coding potential of transcripts using sequence features andsupport vector machinerdquo Nucleic Acids Research vol 35 no 2pp W345ndashW349 2007

[29] L Wang H J Park S Dasari et al ldquoCoding-potential assess-ment tool using an alignment-free logistic regression modelrdquoNucleic Acids Research vol 41 no 6 article e74 2013

[30] L Sun H Luo D Bu et al ldquoUtilizing sequence intrinsiccomposition to classify protein-coding and long non-codingtranscriptsrdquo Nucleic Acids Research vol 41 no 17 article e1662013

[31] A Li J Zhang and Z Zhou ldquoPLEK a tool for predicting longnon-coding RNAs and messenger RNAs based on an improvedk-mer schemerdquo BMC Bioinformatics vol 15 no 1 article 3112014

[32] R Achawanantakun J Chen Y Sun and Y Zhang ldquoLncRNA-ID long non-coding RNA IDentification using balanced ran-dom forestsrdquo Bioinformatics vol 31 no 24 pp 3897ndash3905 2015

[33] L Sun H Liu L Zhang and J Meng ldquolncRScan-SVM a toolfor predicting long non-coding RNAs using support vectormachinerdquo PLoS ONE vol 10 no 10 Article ID e0139654 2015

[34] X-N Fan and S-W Zhang ldquoLncRNA-MFDL identification ofhuman long non-coding RNAs by fusing multiple features andusing deep learningrdquo Molecular BioSystems vol 11 no 3 pp892ndash897 2015

[35] C Pian G Zhang Z Chen et al ldquoLncRNApred classificationof long non-coding RNAs and protein-coding transcripts by theensemble algorithm with a new hybrid featurerdquo PLoS ONE vol11 no 5 Article ID e0154567 2016

[36] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[37] M C Frith A R Forrest E Nourbakhsh et al ldquoThe abundanceof short proteins in the mammalian proteomerdquo PLoS Geneticsvol 2 no 4 pp 515ndash528 2006

[38] D M Anderson K M Anderson C-L Chang et al ldquoAmicropeptide encoded by a putative long noncoding RNAregulatesmuscle performancerdquoCell vol 160 no 4 pp 595ndash6062015

[39] M Guttman P Russell N T Ingolia J S Weissman and ES Lander ldquoRibosome profiling provides evidence that largenoncoding RNAs do not encode proteinsrdquo Cell vol 154 no 1pp 240ndash251 2013

[40] M F Lin I Jungreis and M Kellis ldquoPhyloCSF a comparativegenomics method to distinguish protein coding and non-coding regionsrdquoBioinformatics vol 27 no 13 Article ID btr209pp i275ndashi282 2011

[41] B Panwar A Arora and G P S Raghava ldquoPrediction andclassification of ncRNAs using structural informationrdquo BMCGenomics vol 15 no 1 article 127 2014

[42] L Childs Z Nikoloski P May and D Walther ldquoIdentificationand classification of ncRNAmolecules using graph propertiesrdquoNucleic Acids Research vol 37 no 9 article e66 2009

[43] K Sun X Chen P Jiang X Song H Wang and HSun ldquoiSeeRNA identification of long intergenic non-codingRNA transcripts from transcriptome sequencing datardquo BMCGenomics vol 14 article S7 2013

14 BioMed Research International

[44] Y Wang Y Li Q Wang et al ldquoComputational identificationof human long intergenic non-coding RNAs using a GA-SVMalgorithmrdquo Gene vol 533 no 1 pp 94ndash99 2014

[45] A Siepel G Bejerano J S Pedersen et al ldquoEvolutionarilyconserved elements in vertebrate insect worm and yeastgenomesrdquo Genome Research vol 15 no 8 pp 1034ndash1050 2005

[46] J Liu J Gough and B Rost ldquoDistinguishing protein-codingfrom non-coding RNAs through support vector machinesrdquoPLoS Genetics vol 2 no 4 pp 529ndash536 2006

[47] K C Pang S Stephen P G Engstrom et al ldquoRNAdb-acomprehensive mammalian noncoding RNA databaserdquoNucleicAcids Research vol 33 pp D125ndashD130 2005

[48] C Liu B Bai G Skogerboslash et al ldquoNONCODE an inte-grated knowledge database of non-coding RNAsrdquoNucleic AcidsResearch vol 33 pp D112ndashD115 2005

[49] Y Zhao H Li S Fang et al ldquoNONCODE 2016 an informativeand valuable data source of long non-coding RNAsrdquo NucleicAcids Research vol 44 no 1 pp D203ndashD208 2016

[50] J W Fickett ldquoRecognition of protein coding regions in DNAsequencesrdquoNucleic Acids Research vol 10 no 17 pp 5303ndash53181982

[51] S Mukherjee and Y Zhang ldquoMM-align a quick algorithmfor aligning multiple-chain protein complex structures usingiterative dynamic programmingrdquoNucleic Acids Research vol 37no 11 2009

[52] T Derrien R Johnson G Bussotti et al ldquoThe GENCODE v7catalog of human long noncoding RNAs analysis of their genestructure evolution and expressionrdquoGenome Research vol 22no 9 pp 1775ndash1789 2012

[53] J Harrow F Denoeud A Frankish et al ldquoGENCODE produc-ing a reference annotation for ENCODErdquo Genome Biology vol7 supplement 1 pp S41ndashS49 2006

[54] J Harrow A Frankish J M Gonzalez et al ldquoGENCODE thereference human genome annotation for the ENCODE projectrdquoGenome Research vol 22 no 9 pp 1760ndash1774 2012

[55] K D Pruitt T Tatusova G R Brown andD RMaglott ldquoNCBIReference Sequences (RefSeq) current status new features andgenome annotation policyrdquo Nucleic Acids Research vol 40 no1 pp D130ndashD135 2012

[56] K D Pruitt T Tatusova and D R Maglott ldquoNCBI refer-ence sequences (RefSeq) a curated non-redundant sequencedatabase of genomes transcripts and proteinsrdquo Nucleic AcidsResearch vol 35 no 1 pp D61ndashD65 2007

[57] L Deng D Yu and J Platt ldquoScalable stacking and learningfor building deep architecturesrdquo in Proceedings of the 2012IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo12) pp 2133ndash2136 Kyoto Japan March2012

[58] B Hutchinson L Deng and D Yu ldquoTensor deep stackingnetworksrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 35 no 8 pp 1944ndash1957 2013

[59] W J Kent C W Sugnet T S Furey et al ldquoThe human genomebrowser at UCSCrdquo Genome Research vol 12 no 6 pp 996ndash1006 2002

[60] R Lorenz S H Bernhart C Honer zu Siederdissen et alldquoViennaRNA Package 20rdquo Algorithms for Molecular Biologyvol 6 no 1 article 26 2011

[61] M Kozak ldquoRecognition of AUG and alternative initiatorcodons is augmented by G in position +4 but is not generallyaffected by the nucleotides in positions +5 and +6rdquoThe EMBOJournal vol 16 no 9 pp 2482ndash2492 1997

[62] M Kozak ldquoInitiation of translation in prokaryotes and eukary-otesrdquo Gene vol 234 no 2 pp 187ndash208 1999

[63] H Xu P Wang Y Fu et al ldquoLength of the ORF position of thefirstAUGand theKozakmotif are important factors in potentialdual-coding transcriptsrdquo Cell Research vol 20 no 4 pp 445ndash457 2010

[64] R D Finn J Clements and S R Eddy ldquoHMMER webserver interactive sequence similarity searchingrdquoNucleic AcidsResearch vol 39 no 2 pp W29ndashW37 2011

[65] P Kapranov J Cheng S Dike et al ldquoRNA maps reveal newRNA classes and a possible function for pervasive transcrip-tionrdquo Science vol 316 no 5830 pp 1484ndash1488 2007

[66] C Chen A Liaw and L BreimanUsing RandomForest to LearnImbalancedData University of California Berkeley Calif USA2004

[67] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[68] J A Sogn ldquoStructure of the peptide antibiotic polypeptinrdquoJournal of Medicinal Chemistry vol 19 no 10 pp 1228ndash12311976

[69] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[70] J W Fickett and C-S Tung ldquoAssessment of protein codingmeasuresrdquo Nucleic Acids Research vol 20 no 24 pp 6441ndash6450 1992

[71] X Robin N Turck A Hainard et al ldquopROC an open-sourcepackage for R and S+ to analyze and compare ROC curvesrdquoBMC Bioinformatics vol 12 article 77 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 3: Review Article Long Noncoding RNA Identification ...

BioMed Research International 3

Table 1 Overview of the methods concerning lncRNA identification

Published year Testing datasets Training species Model Query file format Web interfaceCPC 2007 ncRNAlowast Eukaryotic SVM FASTA YesCPAT 2013 lncRNAlowast Human mouse fly zebrafish LR BED FASTA YesCNCI 2013 lncRNA Human plant SVM FASTA GTF NoPLEK 2014 lncRNA Human maize SVM FASTA NolncRNA-MFDL 2015 lncRNA Human DL 119880119899119896119899119900119908119899lowastlowast 119880119899119896119899119900119908119899lowastlowastLncRNA-ID 2015 lncRNA Human mouse RF BED FSATA NolncRScan-SVM 2015 lncRNA Human mouse SVM GTF NoLncRNApred 2016 lncRNA Human RF FASTA Web onlyTesting datasets denote that one specific method is developed to discriminate ncRNAs or lncRNAs from protein-coding transcripts The classification modelof CPC CNCI PLEK and lncRScan-SVM is support vector machine (SVM) CPAT employs logistic regression (LR) LncRNA-ID and LncRNApred utiliserandom forests (RF) and lncRNA-MFDL uses deep stacking networks (DSNs) of deep learning (DL) algorithmlowastNote that the most popular tool CPC is trained and tested on datasets of ncRNAs and protein-coding transcripts The training datasets of CPAT are alsoncRNAs and protein-coding transcripts though test on lncRNAs for CPAT is conducted and achieved a higher accuracylowastlowastThe access link of lncRNA-MFDL has expired thus we cannot verify the information that the original paper failed to mention

distribution in the positions of codons in one sequenceHexamer score is mainly based on the usage bias of adjacentamino acids in proteins

CNCI (2013) is a classifier to differentiate protein-codingand noncoding transcripts by profiling the intrinsic compo-sition of the sequence According to the unequal distributionof adjoining nucleotide triplets (ANT) in two kinds ofsequences a 64 lowast 64 ANT Score Matrix is constructed toevaluate the sequence and the sliding window is used as asupplement to achieve a more robust result [30] ANT bearssome similarities to the hexamer score of CPAT but muchmore comprehensive and intricate analysis was conductedto facilitate the incomplete transcripts classification Theclassification model of CPAT is SVM with a standard radialbasis function kernel

PLEK (2014) uses k-mer scheme and sliding window toanalyse the transcripts For multiple species PLEK does nothave too many advantages over CNCI on testing data ofnormal sequence Nevertheless compared with PLEK theresults of CNCI will deteriorate when the sequence containssome insert or deletion (indel) errors These errors are verycommon in todayrsquos sequencing platforms The classificationmodel of PLEK is SVM with a radial basis function kernel

LncRNA-ID (2015) has 11 features which can be cate-gorized according to ORF ribosome interaction and theconservation of protein The first category is similar to theORF features in CPC and CPAT The foundation of thesecond feature category is the interactions between mRNAsand ribosomes during protein translation since some studiesdisplayed that lncRNAs can be associated with ribosomes[37 38] but do not show the release of ribosomes [39] Theprofile hidden Markov model-based alignment is used toassess the conservation of protein The classification modelof LncRNA-ID is improved using random forest which assistsLncRNA-ID effectively in handling imbalanced training data

Some tools are initially designed to predict ncRNAs butcan also be applied to lncRNAs prediction such as Phyloge-netic Codon Substitution Frequencies (PhyloCSF 2011) [40]and RNAcon (2014) [41] Based on nucleotide substitutionsand formal statistical comparison of phylogenetic codon

models [40] PhyloCSF utilises multiple sequence alignmentsto find conserved protein-coding regions As an alignment-basedmethod PhyloCSF entails high-quality alignments andsuffers from low efficiency RNAconmainly predicts ncRNAsutilising k-mer scheme Based on graph properties [41 42]RNAcon can also perform ncRNAs classification and classifydifferent ncRNA classes

Some methods are especially developed for long inter-genic noncoding RNAs (lincRNAs one subgroup of lncR-NAs) classification such as iSeeRNA (2012 web server and Linux binary package available at http13718913371iSeeRNindexhtml) [43] and LincRNA Classifier based on selectedfeatures (linc-SF 2013) [44] iSeeRNA built a SVM modelwith three feature groups ORF adjoining nucleotides fre-quencies (GC CT TAG TGT ACG and TCG) and con-servation score obtained from Phast [45] The classifier oflinc-SF evaluates the sequences with the criteria of sequencelength GC content minimum free energy (MFE) and k-merscheme

2 Details of the Methods

In this part we will discuss the machine learning modelsand the selected features of each method more specificallyFirstly for usersrsquo convenience some brief information of eachmethod is displayed in Table 1 and the details of using aresummarisedThen the details of eachmethod are provided inthe following Table 2 is a summary about the features selectedby each method

21 Details of Using CPC can be downloaded fromhttpcpccbipkueducndownload CPC has a user-friendly webinterface at httpcpccbipkueducnprogramsrun cpcjspDocuments and User Guide are provided at the web-site To run CPC on a local PC a comprehensive pro-tein reference database is required and users can down-load it from ftpftpncbinlmnihgovblastdb or ftpftpuniprotorgpubdatabasesuniprotunirefuniref90 About20 gigabytes (GB) of free space is also needed for storing theprotein reference database

4 BioMed Research International

Table 2 Summary of the features of each method selected

ORF Codon Sequence structure Ribosomeinteraction Alignment Protein

conservation

CPC Quality coverageintegrity No No No BLASTX

Number and119864-value of hitsDistribution of

hits

CPAT Lengthcoverage

HexamerFrequency

Content of thebases

Position of thebases

No No No

CNCI No ANT matrixCodon-bias MLCDS No No No

PLEK No No Improved k-merscheme No No No

lncRNA-MFDL

Lengthcoverage No

k-mer schemeSecondarystructureMLCDS

No No No

LncRNA-ID Lengthcoverage No Kozak motif

Ribosome releasesignal

Changes of bindingenergy

Profile HMMbased alignment

Score ofHMMER

Length of theprofile

Length of alignedregion

lncRScan-SVM No Distribution of

stop codon

Score oftxCdsPredictlength oftranscripts

length and count ofexon

No Phylo-HMMbased alignment

AveragePhastCons scores

LncRNApred Lengthcoverage No

Length of thesequence

signal to noiseratio

k-mer schemeG + C content

No No No

All features are categorized into six groups according to the similarity or basic principles Thus some items in the table might not be exactly in one-to-onecorrespondence with the feature names given in the corresponding published references

CPAT is also available both for download and as a web-server Users can obtain the latest resource code from httpssourceforgenetprojectsrna-cpatfilessource=navbar Pre-releases tutorial files and examples are also supplied onthe pages CPAT requires Python 27x numpy cython andR when running offline The web server is available athttplilabresearchbcmeducpatindexphp

CNCI can be downloaded at httpsgithubcomwww-bioinfo-orgCNCI Version 2 is updated on Feb 28 2014Setup and running steps are attached on the websitesLibsvm-30 has been enclosed in the package Other addi-tional files can be downloaded at httpwwwbioinfoorgnp

PLEK was implemented by C and Python The sourcecode can be freely downloaded from httpssourceforgenetprojectsplekfiles Several videos to assist user in utilisingPLEK correctly are also provided Python 27x is required

Scripts of LncRNA-ID can be obtained at httpsgithubcomzhangy72LncRNA-ID

LncRScan-SVM provided scripts gene annotation filesanddatasetsThe scripts can be downloaded at httpssource-forgenetprojectslncrscansvmsource20=20directoryA Readme file is also attached on this site

All the stand-alone versions of these tools requireLinuxUNIX operating system

The link of lncRNA-MFDL provided is httpscompge-nomicsutsaedulncRNA MDFL LncRNApred only has theweb interface and is available at httpmm20132014wicpnet57203LncRNApredhomejsp However the link of lncRNA-MFDL expired when we did this research And LncRNApredonly provides a web server which cannot handle too manysequences at one time

22 CPC inDetail CPC [28] extracted six features to evaluatethe coding potential of transcripts Log-odds score coverageand integrity of ORF are used to assess the ORFs of onesequence ORFs are predicted by framefinder A high-quality

BioMed Research International 5

ORF tends to have a high log-odds score and a larger ORFcoverage The integrity of ORF means ORFs in protein-coding transcripts are disposed of to begin with a startcodon and end with a stop codon The other three featuresare number of hits hit score and frame score which arederived from the output of BLASTX search A protein-codingtranscript prefersmore hits in alignment with lower119864-valuesThen the hit score is defined as follows [28]

119878119894 = mean119895

minuslog10 119864119894119895 119894 isin 0 1 2

Hit Score = mean119894isin012

119878119894 = sum2119894=0 1198781198943 (1)

where 119864119894119895 is the 119864-Value of the 119895th hits in the 119894th ORF Anoncoding transcript may also contain some hits but thesehits are inclined to scatter in three frames rather than belocated in one The frame score to calculate the distributionof hits among three ORFs is defined in the following

Frame Score = variance119894isin012

119878119894 = sum2119894=0 (119878119894 minus 119878)22 (2)

Thus a protein-coding transcript will achieve a higher hitscore and frame score because of the lower119864-value and biaseddistribution of the hits

The training data of CPC [46] are eukaryotic ncRNAsfrom the RNAdb [47] and NONCODE [48 49] databasesCPC is designed to assess transcriptsrsquo protein-coding poten-tial whichmeans it will have high accuracy of discriminatingprotein-coding transcripts Moreover CPC also has the errortolerance capacity which owesmuch to framefinderrsquos accurateprediction Framefinder performed well even though inputtranscripts may have some point mutations indel errorsand truncations CPC is slightly inferior in distinguishingnoncoding transcripts in respect of the fact that lncRNAsmaycontain putative ORFs and transcript length is also familiarto protein-coding transcripts The slow speed is anotherimperfection of CPC

23 CPAT in Detail CPAT [29] is an alignment-free pro-gram CPAT uses a logistic regression model and can betrained on own data of users Apart from the features ofmaximum length and coverage of ORF akin to CPC FickettScore is another criterion Fickett Score can be regarded asa dependent classifier it is mainly based on calculating theposition of each base favoured and the content of each basein the sequence [50] The basersquos position parameter of CPATis defined as follows

1198601 = Number of As in positions 0 3 6 1198602 = Number of As in positions 1 4 7 1198603 = Number of As in positions 2 5 8

119860-position = max (1198601 1198602 1198603)min (1198601 1198602 1198603) + 1

119860-content = Occurence Number of 119860Total Number of all bases

(3)

where 119860 in the formula means the base 119860 and the otherthree bases are measured in a similar way The parameterof position calculates each basersquos favoured position andthe parameter of content is the percentage of each basein the sequence Then according to distributions of eightparametersrsquo values [50] it is easy to obtain the probability thatthe sequence will be a protein-coding transcript Next eachprobability is multiplied by a weight to make a more accurateresult The weight is the percentage of the times that theestimate of each parameter alone is correct Finally accordingto the above descriptions Fickett Score can be determined asfollows

Fickett Score = 8sum119894=1

119901119894119908119894 (4)

According to Fickett [50] Fickett Score alone can correctlydiscriminate about 94 of the coding segments and 97 ofthe noncoding segments with 18 of ldquoNo Opinionrdquo

The last feature of CPAT is hexamer score which is themost discriminating feature Hexamer means the adjacentamino acids in proteinsThe features of the in-frame hexamerfrequency of coding and noncoding transcripts are calculatedand hexamer score is defined in the following

Hexamer Score = 1119898119898sum119894=1

log( 119865 (119867119894)1198651015840 (119867119894)) (5)

There are 64 lowast 64 kinds of hexamers and 119894 denotes each hex-amer 119865(119867119894) (119894 = 0 1 2 4095) means the in-frame hex-amer frequency of protein-coding transcripts while 1198651015840(119867119894)means noncoding transcripts For a transcript containing119898 hexamers a positive hexamer score indicates a protein-coding transcript

A high-quality training dataset is constructed contain-ing 10000 protein-coding transcripts selected from RefSeqdatabase with the annotations of the Consensus CodingSequence project and 10000 noncoding transcripts randomlycollected from GENCODE database CPAT is prebuilt hex-amer tables and logit models for human mouse fly andzebrafish Meanwhile CPAT uses pure linguistic features tofacilitate discrimination of the poorly annotated transcriptsCPAT has an efficient offline program and also provides auser-friendly web interface

24 CNCI in Detail CNCI [30] is mainly based on sequenceintrinsic composition it evaluates the transcripts by calcu-lating the usage frequency of adjoining nucleotide triplets(ANT) Firstly two ANT matrices are constructed based onthe usage frequency of ANT in noncoding sequences andcoding region of the sequences (CDS) For 4096 ANT the

6 BioMed Research International

formulas to calculate each ANT usage frequency are definedas follows

119883119894119873 = 119899sum119895=1

119878119895 (119883119894)

119879 = 119898sum119894=1

119883119894119873 = 119898sum119894=1

119899sum119895=1

119878119895 (119883119894) 119898 = 64 times 64 119899 = 1 119873

119883119894119865 = 119883119894119873119879

(6)

where 119883 means one kind of ANT 119878119895(119883119894) is the occurrencenumber of119883119894 in one sequence 119878119895Thus119883119894119873 denotes the totaloccurrence number of one kind of ANT in the dataset while119879indicates the total occurrence number of all kinds of ANT inthe dataset Accordingly 119883119894119865 is the usage frequency of ANTThen the ANT Score Matrix is utilised which is the log2-ratio of the two above-mentioned ANT matrices to score asequence and make a discrimination

ANT Score Matrix = log2CDS Matrix

Non-coding Matrix (7)

The distinguishing results of ANT Score Matrix are fairlywell but the matrix is constructed by computing the ANTusage frequency of coding region and noncoding regionconsequently the untranslated region (UTR) of the entiresequence will interfere with the performance of discrimina-tionThe sliding window is employed with one ANT (3 nt) ineach scan step to identify the CDS of a sequence by scanningsix reading frames of each sequence The different sizes(30 60 90 300 nt) of the sliding windows are examinedand the size of 150 nt for this classification model is found toobtain the most robust result For a sequence consisting of119896 ANT there will be 119896 minus 1 segments in this sequence Basedon the ANT Score Matrix each segment will get an 119878-Scoreand each reading frame can obtain an array comprised of the119878-Scores The formula of 119878-Score is defined as follows

119878-Score = 119899sum119894=1

119867119901 (119883119894) (8)

where 119883 means ANT 119867119901 is the ANT Score Matrix and119899 is the total number of the ANT in one segment or thewhole sequence Hence a correct reading frame of codingtranscript tends to have a higher whole sequence 119878-Scoreand in this array of reading frame the region composed ofconsecutive high 119878-Scores is the CDS For long noncodingtranscripts the Maximum Interval Sum [51] program is usedto identify the most-like CDS (MLCDS) which is the regionthat gained the largest sum of consecutive 119878-Scores in eachreading frame Among those six MLCDS the length and119878-Score of the MLCDS with the highest value are selectedas the features of CNCI Furthermore the features of the

LENGTH-Percentage SCORE-Distance and codon-bias arealso selected to improve accuracy

LENGTH-Percentage = 1198721sum119899119894=0 (119884119894)

SCORE-Distance = sum119899119895=0 (119878 minus 119864119895)5

(9)

where 1198721 is the length of the MLCDS with the highest 119878-Score 119884119894 is the length of each MLCDS 119878 is the highest 119878-Score among six MLCDS and 119864119895 is the 119878-Score of other fiveMLCDS Codon-bias (3-mer frequencies) is a parameter toevaluate the usage bias of different codons in protein-codingor long noncoding transcripts The log2-ratio of occurrencefrequency of each codon (stop codons are excluded) inprotein-coding genes and lncRNAs is calculated and mostcodons have distinct usage bias in two kinds of sequences

The training datasets of CNCI contain protein-codingtranscripts selected from RefSeq database and long noncod-ing transcripts selected from GENCODE [52] The CNCIis applied to other species with the aim of examining thescope of applicationThe results of vertebrates (except birds)especially mammals can be accepted since the programwas trained on human gene set CNCI can be used todiscriminate incomplete transcripts especially those high-throughput sequencing data of poorly explored species

25 PLEK in Detail PLEK [31] is an alignment-free toolbased on 119896-mer frequencies of the sequences For a givensequence the sliding windows with size of 119896 scan 1 nt asa step forward 119896 ranges from 1 to 5 which is a trade-off between accuracy and computational time Thus for asequence consisting of 119860 119862 119866 and 119879 the 41 + 42 + 43 +44 + 45 = 1364 patterns can be obtained Then the followingformulas can be used

119891119894 = 119888119894119904119896119908119896 119896 = 1 2 3 4 5 119894 = 1 2 1364119904119896 = 119897 minus 119896 + 1119908119896 = 1

45minus119896 119896 = 1 2 3 4 5

(10)

where 119894 is the number of the patterns 119888119894 denotes the numberof the segments in slidingwindowsmatchingwith patterns 119904119896denotes the total of the segments when sliding window slidesalong the sequencewith the size of 119896Therefore119891119894 is the usagefrequency multiplied by a factor119908119896 which is used to facilitatethe discrimination

A balanced training dataset is conducted with all 22389long noncoding transcripts collected from the GENCODEdataset [52ndash54] and 22389 protein-coding transcripts ran-domly selected from the human RefSeq dataset [55 56]Though the training model of PLEK is human PLEK canstill be applied to other vertebrates PLEK is particularlydesigned for the transcripts acquired from current sequenc-ing platforms which consist of some indel errors commonly

BioMed Research International 7

For these transcripts the performance of PLEK is betterthan CPC and CNCI PLEK can be trained with usersrsquo owndatasets but it may take a long time to be accomplished

26 LncRNA-MFDL in Detail LncRNA-MFDL [34] is basedon feature fusion and deep learning algorithm LncRNA-MFDL has four kinds of features which are integrated tobuild a classification model based on deep stacking networks(DSNs one kind of deep learning algorithm) [57 58] Fourfeature groups of lncRNA-MFDL include 119896-mer secondarystructure ORF obtained by utilising txCdsPredict program(httpgenomeucscedu) [59] and MLCDS features whichare inspired by CNCI [30]

The 119896-mer scheme employed in lncRNA-MFDL is unlikethe one in PLEK Here the 119896 only ranges from 1 to 3but the frequencies are calculated on the regions of thewhole sequence and ORF at the same time Considering thatthe secondary structure is more conserved and stable thanprimary structure a representative criterion the minimumfree energy (MFE) is used to assess the secondary structureof the transcripts Utilising RNAfold program of ViennaRNAPackage [60] the MFE the ratio of MFE to sequence lengthand the number of paired bases and unpaired bases can beeasily obtained

27 LncRNA-ID in Detail LncRNA-ID [32] has three cate-gories of features as mentioned earlier Except for the lengthand coverage of ORF the features based on translationmechanism and protein conservation are extracted

Many studies [61ndash63] have demonstrated that severalnucleotide sites in Kozak motif play a prominent roleduring the initiation of protein translation An efficienttranslation indicates that the highly conserved nucleotidesappear at the positions minus3 +4 and minus2 minus1 of Kozak motifGCCRCCAUGG (R represents purine and the position of Ain start codon AUG is +1) Thus these conserved sites aremore likely to exist in protein-coding transcripts Moreoverwhen the translation starts the binding energy will changealong with the interaction between the 31015840 end of rRNAs andmRNA transcripts The Ribosome Coverage to calculate thechanges of the binding energy is defined as follows

Ribosome Coverage = 119871sum119894=1

119873119894 | 120575119894 lt 0 (11)

where 120575119894 is the free energy at position 119894 and 119873119894 is thenumber of base pairs starting at position 119894 in a sequencewith the length of 119871 Next the three levels of ribosome occu-pancy by computing Ribosome Coverage on three regionsrespectively are obtained the whole transcript ORF and31015840UTR Accordingly a true protein-coding transcript tendsto attain higher Ribosome Coverage on the whole transcriptand the ORF region When the translation terminates theribosomes will be released from protein-coding transcriptsTherefore it is likely to capture a considerable drop ofribosome occupancy when ribosomes reach stop codonsThe

Ribosome Release Score to capture this change of ribosomeoccupancy is defined

Ribosome Release Score

= Ribosome coverage of ORFlength (ORF)Ribosome coverage of 31015840UTRlength (31015840UTR)

(12)

and a protein-coding transcript inclines to exhibit a higherRibosome Release Score For protein translation categorythe selected features including nucleotides at two positionsof Kozak motif Ribosome Coverage on three regions andRibosome Release Score are selected

The protein conservation of the sequences is evaluatedaccording to profile hidden Markov model-based align-ment scores HMMER [64] is a software suite for sequencehomology detection using probabilistic methods LncRNA-ID employed HMMER with the 119864-value cutoff of 01 toalign the transcripts against all available protein families Aprotein-coding transcript is expected to get a higher scorelonger aligned region and a reasonable length of the profilein the alignment

In human genome although the amount of lncRNA isat least four times more than protein-coding genes [65]the majority class in training data is protein-coding tran-script on account of poorly annotated lncRNA Hence theclassification model of this method is balanced randomforest [66 67] which is derived from random forest butcould utilise the sufficient protein-coding data and avoidinaccurate results caused by the imbalanced training data atthe same time The human prebuilt model of LncRNA-IDcontains 15308 protein-coding transcripts and 4586 lncRNAsfrom GENCODE [52] For mouse the training datasets arecomprised of 22033 protein-coding transcripts and 2457lncRNAs randomly selected from GENCODE These twodatasets were also used to draw receiver operation charac-teristic (ROC) curves in the next section (Figure 2) Userscan train LncRNA-ID with their own dataset and apply it tovarious species

28 LncRScan-SVM in Detail LncRScan-SVM [33] classifiesthe sequencesmainly by evaluating the qualities of nucleotidesequences codon sequence and transcripts structure Thecounts and average length of exon in one sequence arecalculated The protein-coding transcripts are disposed ofto include more exons thus having a longer exon lengththan lncRNA Another feature is the score of txCdsPredictThis third-part program from UCSC genome browser [68]can determine if a transcript is protein-coding Conservationscore is obtained by calculating the average of PhastConsscores [45] from Phast (httpcompgencshleduphast)Transcript length and standard deviation of stop codoncounts between three ORFs are the last two features

The reliable datasets are constructed from GENCODE[54] composed of 81814 protein-coding transcripts and23898 long noncoding transcripts of human And formouse47394 protein-coding transcripts and 6053 long noncodingtranscripts from GENCODE [52] are also contained withinthe dataset After being trained on human and mouse

8 BioMed Research International

datasets lncRScan-SVM obtains a good performance onlncRNA prediction

29 LncRNApred in Detail Before constructing the classi-fier self-organizing feature map (SOM) clustering [69] isemployed to select representative samples as the trainingdataset which enhanced the performance of LncRNApredAs to the features the length and coverage of the longest ORFone of the classical and typical features are selected as thecriteria In addition G + C content 119896-mer (119896 is from 1 to 3just like lncRNA-MFDL) and length of the sequence are alsothe features of LncRNApred The novel idea of LncRNApredis SNR which transforms one sequence into four binarynumeric sequences

119906119887 = 1 119878 [119899] = 1198870 119878 [119899] = 119887

119899 = 0 1 2 119873 minus 1 119887 isin 119860 119879 119862 119866 (13)

where 119887 means four kinds of bases 119873 is the length of onesequence and 119878[119899] denotes a sequence of length 119873 Thusthere will be four binary sequences 119906119887 | 119887 isin (119860 119879 119862 119866)Then applying Discrete Fourier Transform (DFT) to thesefour binary numeric sequences the power spectrum 119875[119896]can be obtained

119880119887 [119896] =119873minus1sum119899=0

119906119887 [119899] 119890minus119894(2120587119899119896119873) 119896 = 0 1 119873 minus 1119875 [119896] = 1003816100381610038161003816119880119860 [119896]10038161003816100381610038162 + 1003816100381610038161003816119880119879 [119896]10038161003816100381610038162 + 1003816100381610038161003816119880119866 [119896]10038161003816100381610038162 + 1003816100381610038161003816119880119862 [119896]10038161003816100381610038162

(14)

The studies of Fickett [50 70] have presented that positionsand compositions of four bases are different in lncRNAsand protein-coding RNA and because of this the powerspectrum of one protein-coding transcript will have a peakat1198733 position Hence the SNR is defined as follows

119864 = sum119873minus1119896=0 119875 [119896]119873 SNR = 119875 [1198733]

119864 (15)

Now there are 89 features the length and coverage ofthe longest ORF the length of the sequence SNR G + Ccontent and 4 + 16 + 64 features of 119896-mer Noticing that notall the features have high discriminative power the featureselection ismade and 25 high-quality features are determinedfrom the original 84 features of k-mer Finally 30 features areselected to build a random forest model The performance ofrandom forest is largely determined by training setThereforethe clustering method is used to find out the most adequatesequences to form a high standard training setThe clusteringmethod SOM [69] achieved the best result and was chosen toselect characteristic sequences

An overall procedure of these eight tools is displayed inFigure 1

3 Performance of These Methods

To quantify the classification performance under one unifiedstandard we first characterise lncRNAs as the positive classand protein-coding transcripts as the negative class then theperformance of these tools can be evaluated with severalstandard criteria defined as follows

Sensitivity = TPTP + FN

Specificity = TN

TN + FP

Accuracy = TP + TNTP + FP + TN + FN

False Positive Rate = FP

FP + TN

(16)

As one of the most popular methods CPC is especiallydesigned for assessing protein-coding potential and per-formed fairly well for discriminating protein-coding tran-script It enjoys the best results when screening the codingtranscripts For 10000 protein-coding genes and 10000 lncR-NAs selected from UCSC genome browser (GRCh37hg19)CPC picked up about 9762 coding transcripts while CPATdistinguished 8528 of them CPAT also outperforms CPCwith 8994 accuracy [33] Table 3 shows the performance ofthese tools on the same testing dataset CPC picks up 9997of human protein-coding genes collected from GENCODEin comparison with the latest program LncRNA-ID whoseperformance is 9528 However the performance of CPCappears to somewhat decline when focusing on the capabilityof discriminating noncoding transcripts especially long non-coding transcripts CPC only picked up 6648 of humanrsquoslong noncoding transcripts while the results of CPAT PLEKand LncRNA-ID are 8695 9952 and 9628

CPC and CPAT are the programs to assess the codingpotential but CNCI is especially used to classify protein-coding and long noncoding transcripts With the sequencesbecoming longer and longer CNCI was more superior toCPC According to Sun et al [30] when the length oftranscript is longer than 2000 nt the accuracy of CPC isonly around 04 while the CNCI still has an outstandingperformance The training dataset of CNCI is human butthis method still achieved more than 90 accuracy in othervertebrates apart from the birds [30] PLEK is tested ontwo datasets sequenced by PacBio and 454 platforms (referto Table 3) Among the tools being compared CPC stillpicked up about 9990 coding genes though this figureis not that useful because it can only distinguish 1900and 4720 lncRNAs CNCI displayed better performanceon both datasets but PLEK even achieved a more satisfyingresult

LncRNA-ID is another method to identify the longnoncoding transcripts Compared with other programsLncRNA-ID strikes a good balance between sensitivity andfalse positive rate According to Table 3 it is noticeable thatlncRNA is better than PLEK but slightly inferior to CPCand CPAT on the testing data of coding genes and the

BioMed Research International 9

(i) ORF(ii) BLASTX

CPC

(i) k-mer

PLEK

(i) ANT matrix(ii) Codon-bias

(iii) MLCDS

CNCI

(i) Exon (ii) Stop codon

(iii) txCdsPredict(iv) PhastCons

LncRScan-SVM

Support vector machine

Featureextraction

Modelselection

Modelvalidation

Input data

10-fold cross-validationArea under the curve (AUC) of receiver operation characteristic (ROC)

LncRNA-ID(i) ORF

(ii) Ribosome interaction

(iii) HMMER

(i) ORF(ii) k-mer

(iii) Signal to noise ratio

LncRNApred

Random forests

CPAT

(i) ORF(ii) Fickett score

(iii) Hexamer

Logistic regression

LncRNA-MFDL(i) ORF

(ii) k-mer(iii) MLCDS(iv) Secondary

structure

Deeplearning

Figure 1 An overall procedure of eight tools The features of each tool are sorted into several groups and only the categories of the featuresare listed in the figure

02 04 06 08 10001 minus specificity

00

02

04

06

08

10

Sens

itivi

ty

00

02

04

06

08

10

Sens

itivi

ty

02 04 06 08 10001 minus specificity

Mouse (from GENCODE)Human (from GENCODE)

PLEK (AUC = 0991 optimal = 00210973)CNCI (AUC = 0937 optimal = 0084 0970)CPAT (AUC = 0990 optimal = 0037 0961)CPC (AUC = 0977 optimal = 0038 0889) CPC (AUC = 0950 optimal = 0068 0839)

CPAT (AUC = 0966 optimal = 0073 0926)CNCI (AUC = 0903 optimal = 0137 0959)PLEK (AUC = 0870 optimal = 0291 0885)

Figure 2 The ROC curves of CPC CPAT CNCI and PLEK We assessed the models using the same datasets as LncRNA-ID (selected fromGENCODE) used Both CPC and CPAT were evaluated with the latest versions

10 BioMed Research International

Table 3 Overview of each toolrsquos performance on different testing datasets

Testing dataset CPC CPAT CNCI PLEK LncRNA-ID lncRScan-SVMHuman MCF-7 (PacBio)1

Specificity 9990 9180 9470Sensitivity 1900 7870 9580Accuracy 9700 9130 9470Human HelaS3 (454)2

Specificity 9990 9390 9550Sensitivity 4720 8110 9250Accuracy 9900 9370 9540Human (from GENCODE)3

Specificity 9997 9955 8918 9528Sensitivity 6648 8695 9952 9628Accuracy 8322 9325 9432 9578Mouse (from GENCODE)4

Specificity 9875 9895 7094 9210Sensitivity 7655 3880 8811 9445Accuracy 8765 6888 7949 9328Human (from GRCh37hg19)5

Specificity 9762 8528 8920Sensitivity 6723 9460 9388Accuracy 8243 8994 9194Mouse (from GRCm38mm10)5

Specificity 9837 8817 8914Sensitivity 7546 9534 9529Accuracy 8691 9176 9221The results of the tools being tested on the same datasets are listed above Bold numbers denote the highest value of the metrics1MCF-7 is available at httpwwwpacbcomblogdata-release-human-mcf-7-transcriptome 2dataset of HelaS3 is available at httpswwwncbinlmnihgovsraSRX214365 34datasets are available at httpswwwdropboxcomsh7yvmqknartttm6kAAAQHvLZPjgjf4dtmHM7GNCqaH1 gencodedl=0 andhttpswwwdropboxcomsh7yvmqknartttm6kAACzaG-QJggvbXW6LA32oo7baM1 gencodedl=0 5dataset of human and mouse is available athttpjournalsplosorgplosonearticleid=101371journalpone0139654

performance on lncRNAs is just the opposite LncRNA-IDcan be trained with usersrsquo own data it can obtain a satisfyingresult even when the data is unbalancedWith the proportionof lncRNA decreasing CPAT shows a sharp reduction from7951 to 5446on the capability of lncRNAdiscriminationLncRNA-ID by contrast fell less than 1 [32]

ROC curves of CPC CPAT PLEK and LncRNA-IDtested on human and mouse datasets were provided in [32]Since CPC and CPAT are updated as the accumulation ofgene database it is useful to assess their performance withlatest version and take CNCI into account Here we utilisethe test set of LncRNA-ID [32] (both the datasets of humanandmouse are selected fromGENCODE) to reevaluate CPCCPAT CNCI and PLEK (Figure 2) According to [32] thearea under curve (AUC) of LncRNA-ID on human datasetis 09829 (optimal = 00545 09720) while on mouse it is09505 (optimal = 00800 09445) [32] In our assessmentthe performance of PLEK is identical with [32] while theperformance of CPC and CPAT as we anticipated displayedsome differences The ROC curves were drawn in R with thepackage of pROC [71]

LncRScan-SVM is compared with CPC and CPAT onhuman and mouse datasets from UCSC (version hg19 of

human and mm10 of mouse) CPC as an excellent codingpotential assessment tool still achieves 9837 of specificityonmouse testing dataset CPAT on the contrary achieved thehighest values of sensitivity both on the datasets of humanand mouse LncRScan-SVM surpasses CPAT with 8920and 8914 of specificity on human and mouse datasets Forsensitivity lncRScan-SVM obtained 9388 and 9529 onthe same testing datasets which are only 072 and around01 lower than CPATrsquos results respectively but much higherthan CPCrsquos 6723 and 7546 In addition lncRScan-SVMalso has the best results of accuracy and AUC [33] on thesedatasets

For the same testing datasets the running time of CPATis the shortest and CPC shows the longest time to finishthe process because of its alignment process When beingtested on a dataset containing 4000 protein-coding and 4000long noncoding transcripts CPAT takes 3536 s and LncRNA-ID takes 6535 s to accomplish the discrimination whilePLEK and CPC need 2147m and 8651 h respectively [32]PLEK is 8 times and 244 times faster than CNCI and CPCrespectively on the same testing data [31] and lncRScan-SVM also needs about 10 times as much as CPAT to finishcomputation [33]

BioMed Research International 11

Table 4 Priority of employing different methods on different situations

CPC CPAT CNCI PLEK LncRNA-ID lncRScan-SVMCoding potential assessment Human lncRNAs Mouse lncRNAs Other Species1 Testing data with sequencing errors2 Lack of annotation Massive-scale data3 Trained by users4 Web interface This table only presents the preferences under different situations which means a method with a tick can achieve a better performance under a certaincircumstance1Only CPAT LncRNA-ID and lncRScan-SVM provide the model for mouse When analysing other species CPAT has the model for fly and zebrafish CNCIand PLEK can predict the sequences of vertebrata and plant CPAT PLEK and LncRNA-ID can build a newmodel based on usersrsquo datasets 2Users can chooseCNCI for incomplete sequences and CPC or PLEK for the transcripts with indel errors 3CPAT is the most efficient method Though lncRScan-SVM needsmore time than CPAT and LncRNA-ID it is also acceptable 4LncRNA-ID can handle the imbalanced training data Training PLEK with usersrsquo own datasetsmay be a time-consuming task

4 Application Scopes of the Methods

All these methods have own particular scopes to exert theirtalents which means an appropriate program can help usobtain a satisfying result The priority of utilising thesetools under some particular circumstances is summarised inTable 4

CPC is based on sequence alignment which facilitatesprotein-coding transcripts selection but impairs the perfor-mance of noncoding transcripts in that long noncoding tran-scripts share more similarities with coding transcripts suchas putative ORF which could mislead CPC Also becauseof alignment process utilising CPC to analyse massive-scaledata is a time-consuming process

CPAT is also used to evaluate the coding potential thoughthe performance on long noncoding transcripts is acceptableCPAT has a compromise between coding and noncodingtranscripts that is not bad Since the model of CPAT islogistic regression and the input file is FASTA format CPATis markedly superior in computational time which meansCPAT is more suitable for being applied to data on a largescale Furthermore linguistic features make CPAT be able toanalyse the sequences without annotation and allowing usersto train the model with their own dataset extends CPATrsquosscope of application Users can apply CPAT to other speciesinstead of being confined to human or mouse only

CNCI is designed to distinguish between coding and longnoncoding transcripts without the annotations of sequencesBecause lots of lncRNAs are poorly annotated this qualityprovided amore accurate discrimination for these sequencesCNCI is trained on human dataset but can also be applied toothermammals such asmouse and orangutan CNCI displaysacceptable results on vertebrates (except fish) but for plantsand invertebrates the result is not very satisfying CNCI isvaluable when the sequences lack annotations or users do nothave training set of other species CNCI also shows a goodperformance when the transcripts are incomplete

PLEK employs a higher fault tolerance algorithm andperforms better when the sequences have indel errors It is a

proper tool to analyse the de novo assembled transcriptomedatasets such as the sequences obtained from Roche (454)and Pacific Biosciences (PacBio) sequencing platforms Inaddition to human and mouse PLEK can also be used toother vertebrates and displays comparable results with theones of CNCI PLEKrsquos model can be trained by users but ittakes a long time to be completed

LncRNA-ID has many merits and delivers better all-round performance on human andmouse datasets Althoughthe time LncRNA-ID spent on classifying is nearly twice ofCPAT LncRNA-ID is still more efficient than other methodswhich makes it a reasonable choice when data are on amassive-scale The model of LncRNA-ID can be trained byusers but the most excellent attribute is the competence ofhandling the unbalanced training data For studying thosenot well-explored species LncRNA-ID takes priority whenusers have training datasets

LncRScan-SVM achieves a good trade-off betweenthe discrimination of coding and long noncoding genesLncRScan-SVM is slower than CPAT and LncRNA-ID but itis still acceptable For analysing human and mouse datasetslncRScan-SVM can be considered as a proper approach

5 Discussion

According to the features selected by each tool it is apparentthat different tools have their own advantages and disadvan-tages CPC is developed to assess coding potential of thetranscripts moreover CPC is trained on datasets of protein-coding and noncoding RNA which means it achieves excel-lent performance when analysing ncRNAs CPC providesa stand-alone version and a web server but both of thetwo programs need vast amounts of time to process thesequences As alignment-based tools the performance ofCPC varied when using different protein reference databaseCPAT can present satisfying results efficiently partly becauseCPAT builds the logistic model which is faster than SVMThe web server of CPAT can display the result in an instant

12 BioMed Research International

which facilitates small scale prediction tasks A minor dis-advantage of CPAT is that the cutoff of CPAT varies fromspecies to species and users have to determine the optimumcutoff value when they are training a new model CNCI isdesigned to predict the transcripts assembled from whole-transcriptome sequencing data Thus CNCI offers a highaccuracy on incomplete transcripts CNCI did not provideresult of elaborate comparison between CNCI and CPATbut CPAT has no regard for the problem of incompletetranscripts Meanwhile UTRs of the transcripts may alsointerfere with the performance of CPATThe features of ANTof CNCI closely resemble the hexamer of CPAT but thedistinguishing process of CNCI is more complicated andaccurate than CPAT However the sliding window of CNCIslides 3 nt in each step and consequently some deletionor frameshift errors may lead to a false shift and presentusers with a disappointing performance In such cases PLEKhas made a considerable improvement and exhibits moreflexibility when handling the indel sequencing errors Indelerrors are very common in the sequences obtained by todayrsquossequencing platform which means PLEK performs well forde novo assembled transcriptomes With the indel error rateincreasing the accuracy of CNCI is decreasing while PLEKhas no distinct fluctuation Nonetheless since the nucleotidescompositions differ slightly among different species theperformance of PLEK on multiple species is not better thanor approximately equivalent to CNCI whose performanceis more stable on different species Both LncRNA-ID andlncRScan-SVM achieve a balance between protein-codingand lncRNAs But the capacity of lncRScan-SVMwill be lim-ited when analysing the sequences with a lack of annotationAnother point that needs to be brought up is that lncRScan-SVM and CNCI support lowast GTF as input file format

It is apparent that nucleotides composition (such as 119896-merand G + C content) and ORF are two classic and widely usedfeature groups These features have strong discriminativepower because protein-coding genes will finally be tran-scribed and translated to produce a specific amino acid chainwhich requires some specified nucleotides composition andhigh-quality ORFs As to the models of these tools SVM(CPC CNCI and PLEK) logistic regression (CPAT) andrandom forest (LncRNA-ID) are more practical for lncRNAidentification though ANN or deep learning is a more popu-lar machine learning algorithm now Along with the protein-coding genes prediction the annotations of lncRNA genehave been performed as well A new tool named AnnoLnc(2015 available at httpannolnccbipkueducnindexjsp)has just been developed to annotate new discovered lncRNAsbut related article has not yet been officially published Userscan access its web server for more information

LncRNAs are receiving increasing attention and lncRNAidentification has always been a challenge for researchesof life science For so many different types of sequencesvarious excellent tools should be developed to tackle differentproblems under various circumstances in the future In thisreview we summarised several tools for lncRNAs identifica-tion and concluded respective scopes Due to their differentscopes of application using a method apposite to particularsituation will be of essence to achieve convincing results

We hope this review can help researchers employ a moreappropriate method in certain situations

Additional Points

Key Points (i) Different tools have different scopes Usersshould select a proper tool according to the type of sequences(ii) From the perspective of sequence types CPC andCPAT are mainly used to assess coding potential CNCIand PLEK can be applied to the sequences obtained fromhigh-throughput sequencing platforms or the poorly anno-tated LncRNA-ID and lncRScan-SVM are more accurate onhuman and mouse datasets (iii) From the perspective ofother functions CPC and CPAT have web interfaces Theclassification models of CPAT LncRNA-ID and PLEK canbe trained on usersrsquo own datasets CPAT LncRNA-ID andlncRScan-SVM can be utilised when the data to be analysedare on a massive-scale

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This research was supported by the National Natural ScienceFoundation of China (61272207 61472158 and 61402194)and the Science-Technology Development Project from JilinProvince (20130522118JH 20130101070JC) and China Post-doctoral Science Foundation (2014T70291)

References

[1] A F Palazzo and E S Lee ldquoNon-coding RNA what isfunctional andwhat is junkrdquo Frontiers in Genetics vol 5 article2 pp 1ndash11 2015

[2] Y Okazaki M Furuno T Kasukawa et al ldquoAnalysis of themouse transcriptome based on functional annotation of 60770full-length cDNAsrdquo Nature vol 420 pp 563ndash573 2002

[3] S Djebali C A Davis AMerkel et al ldquoLandscape of transcrip-tion in human cellsrdquo Nature vol 489 pp 101ndash108 2012

[4] I Dunham A Kundaje S F Aldred et al ldquoAn integratedencyclopedia of DNA elements in the human genomerdquo Naturevol 489 pp 57ndash74 2012

[5] E Pennisi ldquoENCODE project writes eulogy for junk DNArdquoScience vol 337 no 6099 pp 1159ndash1161 2012

[6] S U Schmitz P Grote and B G Herrmann ldquoMechanisms oflong noncoding RNA function in development and diseaserdquoCellular and Molecular Life Sciences vol 73 no 13 pp 2491ndash2509 2016

[7] K Plath J Fang S K Mlynarczyk-Evans et al ldquoRole of histoneH3 lysine 27 methylation in X inactivationrdquo Science vol 300no 5616 pp 131ndash135 2003

[8] S T da Rocha V Boeva M Escamilla-Del-Arenal et al ldquoJarid2is implicated in the initial xist-induced targeting of PRC2 to theinactive X chromosomerdquoMolecular Cell vol 53 no 2 pp 301ndash316 2014

[9] V OrsquoLeary S V Ovsepian L G Carrascosa et al ldquoPARTICLEa triplex-forming long ncRNA regulates locus-specific methy-lation in response to low-dose irradiationrdquo Cell Reports vol 11no 3 pp 474ndash485 2015

BioMed Research International 13

[10] A C Marques and C P Ponting ldquoCatalogues of mammalianlong noncoding RNAs modest conservation and incomplete-nessrdquo Genome Biology vol 10 no 11 article R124 2009

[11] J R Prensner M K Iyer A Sahu et al ldquoThe long noncodingRNA SChLAP1 promotes aggressive prostate cancer and antag-onizes the SWISNF complexrdquo Nature Genetics vol 45 no 11pp 1392ndash1398 2013

[12] G Chen Z Wang D Wang et al ldquoLncRNADisease a databasefor long-non-coding RNA-associated diseasesrdquo Nucleic AcidsResearch vol 41 no 1 pp D983ndashD986 2013

[13] X Chen and G-Y Yan ldquoNovel human lncRNA-disease associ-ation inference based on lncRNA expression profilesrdquo Bioinfor-matics vol 29 no 20 pp 2617ndash2624 2013

[14] X Chen C Clarence Yan C Luo W Ji Y Zhang and Q DaildquoConstructing lncRNA functional similarity network based onlncRNA-disease associations and disease semantic similarityrdquoScientific Reports vol 5 Article ID 11338 2015

[15] X Chen Y-A Huang X-S Wang Z You and K C ChanldquoFMLNCSIM fuzzy measure-based lncRNA functional simi-larity calculation modelrdquo Oncotarget vol 7 no 29 pp 45948ndash45958 2016

[16] X Chen ldquoPredicting lncRNA-disease associations and con-structing lncRNA functional similarity network based on theinformation of miRNArdquo Scientific Reports vol 5 Article ID13186 2015

[17] J Sun H Shi Z Wang et al ldquoInferring novel lncRNA-diseaseassociations based on a random walk model of a lncRNAfunctional similarity networkrdquo Molecular BioSystems vol 10no 8 pp 2074ndash2081 2014

[18] M Zhou X Wang J Li et al ldquoPrioritizing candidate disease-related long non-codingRNAs bywalking on the heterogeneouslncRNA and disease networkrdquoMolecular BioSystems vol 11 no3 pp 760ndash769 2015

[19] X Chen Z You G Yan and D Gong ldquoIRWRLDA improvedrandom walk with restart for lncRNA-disease association pre-dictionrdquo Oncotarget vol 7 no 36 pp 57919ndash57931 2016

[20] X Chen C C Yan X Zhang and Z You ldquoLong non-codingRNAs and complex diseases from experimental results tocomputational modelsrdquo Briefings in Bioinformatics 2016

[21] T M Lowe and S R Eddy ldquotRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequencerdquo Nucleic Acids Research vol 25 no 5 pp 955ndash9641997

[22] Q Zou J Guo Y Ju M Wu X Zeng and Z Hong ldquoImprov-ing tRNAscan-SE annotation results via ensemble classifiersrdquoMolecular Informatics vol 34 no 11-12 pp 761ndash770 2015

[23] L Wei M Liao Y Gao R Ji Z He and Q Zou ldquoImprovedand promising identificationof human microRNAs by incorpo-ratinga high-quality negative setrdquo IEEEACM Transactions onComputational Biology and Bioinformatics vol 11 no 1 pp 192ndash201 2014

[24] C Y Wang L L Hu M Z Guo X Y Liu and Q Zou ldquoimDCan ensemble learningmethod for imbalanced classificationwithmiRNAdatardquoGenetics andMolecular Research vol 14 no 1 pp123ndash133 2015

[25] K Lagesen P Hallin E A Roslashdland H-H Staeligrfeldt T Rognesand DW Ussery ldquoRNAmmer consistent and rapid annotationof ribosomal RNA genesrdquo Nucleic Acids Research vol 35 no 9pp 3100ndash3108 2007

[26] C Wang L Wei M Guo and Q Zou ldquoComputationalapproaches in detecting non- coding RNArdquo Current Genomicsvol 14 no 6 pp 371ndash377 2013

[27] D Veneziano G Nigita and A Ferro ldquoComputationalapproaches for the analysis of ncRNA through deep sequencingtechniquesrdquo Frontiers in Bioengineering and Biotechnology vol3 article 77 2015

[28] L Kong Y Zhang Z-Q Ye et al ldquoCPC assess the protein-coding potential of transcripts using sequence features andsupport vector machinerdquo Nucleic Acids Research vol 35 no 2pp W345ndashW349 2007

[29] L Wang H J Park S Dasari et al ldquoCoding-potential assess-ment tool using an alignment-free logistic regression modelrdquoNucleic Acids Research vol 41 no 6 article e74 2013

[30] L Sun H Luo D Bu et al ldquoUtilizing sequence intrinsiccomposition to classify protein-coding and long non-codingtranscriptsrdquo Nucleic Acids Research vol 41 no 17 article e1662013

[31] A Li J Zhang and Z Zhou ldquoPLEK a tool for predicting longnon-coding RNAs and messenger RNAs based on an improvedk-mer schemerdquo BMC Bioinformatics vol 15 no 1 article 3112014

[32] R Achawanantakun J Chen Y Sun and Y Zhang ldquoLncRNA-ID long non-coding RNA IDentification using balanced ran-dom forestsrdquo Bioinformatics vol 31 no 24 pp 3897ndash3905 2015

[33] L Sun H Liu L Zhang and J Meng ldquolncRScan-SVM a toolfor predicting long non-coding RNAs using support vectormachinerdquo PLoS ONE vol 10 no 10 Article ID e0139654 2015

[34] X-N Fan and S-W Zhang ldquoLncRNA-MFDL identification ofhuman long non-coding RNAs by fusing multiple features andusing deep learningrdquo Molecular BioSystems vol 11 no 3 pp892ndash897 2015

[35] C Pian G Zhang Z Chen et al ldquoLncRNApred classificationof long non-coding RNAs and protein-coding transcripts by theensemble algorithm with a new hybrid featurerdquo PLoS ONE vol11 no 5 Article ID e0154567 2016

[36] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[37] M C Frith A R Forrest E Nourbakhsh et al ldquoThe abundanceof short proteins in the mammalian proteomerdquo PLoS Geneticsvol 2 no 4 pp 515ndash528 2006

[38] D M Anderson K M Anderson C-L Chang et al ldquoAmicropeptide encoded by a putative long noncoding RNAregulatesmuscle performancerdquoCell vol 160 no 4 pp 595ndash6062015

[39] M Guttman P Russell N T Ingolia J S Weissman and ES Lander ldquoRibosome profiling provides evidence that largenoncoding RNAs do not encode proteinsrdquo Cell vol 154 no 1pp 240ndash251 2013

[40] M F Lin I Jungreis and M Kellis ldquoPhyloCSF a comparativegenomics method to distinguish protein coding and non-coding regionsrdquoBioinformatics vol 27 no 13 Article ID btr209pp i275ndashi282 2011

[41] B Panwar A Arora and G P S Raghava ldquoPrediction andclassification of ncRNAs using structural informationrdquo BMCGenomics vol 15 no 1 article 127 2014

[42] L Childs Z Nikoloski P May and D Walther ldquoIdentificationand classification of ncRNAmolecules using graph propertiesrdquoNucleic Acids Research vol 37 no 9 article e66 2009

[43] K Sun X Chen P Jiang X Song H Wang and HSun ldquoiSeeRNA identification of long intergenic non-codingRNA transcripts from transcriptome sequencing datardquo BMCGenomics vol 14 article S7 2013

14 BioMed Research International

[44] Y Wang Y Li Q Wang et al ldquoComputational identificationof human long intergenic non-coding RNAs using a GA-SVMalgorithmrdquo Gene vol 533 no 1 pp 94ndash99 2014

[45] A Siepel G Bejerano J S Pedersen et al ldquoEvolutionarilyconserved elements in vertebrate insect worm and yeastgenomesrdquo Genome Research vol 15 no 8 pp 1034ndash1050 2005

[46] J Liu J Gough and B Rost ldquoDistinguishing protein-codingfrom non-coding RNAs through support vector machinesrdquoPLoS Genetics vol 2 no 4 pp 529ndash536 2006

[47] K C Pang S Stephen P G Engstrom et al ldquoRNAdb-acomprehensive mammalian noncoding RNA databaserdquoNucleicAcids Research vol 33 pp D125ndashD130 2005

[48] C Liu B Bai G Skogerboslash et al ldquoNONCODE an inte-grated knowledge database of non-coding RNAsrdquoNucleic AcidsResearch vol 33 pp D112ndashD115 2005

[49] Y Zhao H Li S Fang et al ldquoNONCODE 2016 an informativeand valuable data source of long non-coding RNAsrdquo NucleicAcids Research vol 44 no 1 pp D203ndashD208 2016

[50] J W Fickett ldquoRecognition of protein coding regions in DNAsequencesrdquoNucleic Acids Research vol 10 no 17 pp 5303ndash53181982

[51] S Mukherjee and Y Zhang ldquoMM-align a quick algorithmfor aligning multiple-chain protein complex structures usingiterative dynamic programmingrdquoNucleic Acids Research vol 37no 11 2009

[52] T Derrien R Johnson G Bussotti et al ldquoThe GENCODE v7catalog of human long noncoding RNAs analysis of their genestructure evolution and expressionrdquoGenome Research vol 22no 9 pp 1775ndash1789 2012

[53] J Harrow F Denoeud A Frankish et al ldquoGENCODE produc-ing a reference annotation for ENCODErdquo Genome Biology vol7 supplement 1 pp S41ndashS49 2006

[54] J Harrow A Frankish J M Gonzalez et al ldquoGENCODE thereference human genome annotation for the ENCODE projectrdquoGenome Research vol 22 no 9 pp 1760ndash1774 2012

[55] K D Pruitt T Tatusova G R Brown andD RMaglott ldquoNCBIReference Sequences (RefSeq) current status new features andgenome annotation policyrdquo Nucleic Acids Research vol 40 no1 pp D130ndashD135 2012

[56] K D Pruitt T Tatusova and D R Maglott ldquoNCBI refer-ence sequences (RefSeq) a curated non-redundant sequencedatabase of genomes transcripts and proteinsrdquo Nucleic AcidsResearch vol 35 no 1 pp D61ndashD65 2007

[57] L Deng D Yu and J Platt ldquoScalable stacking and learningfor building deep architecturesrdquo in Proceedings of the 2012IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo12) pp 2133ndash2136 Kyoto Japan March2012

[58] B Hutchinson L Deng and D Yu ldquoTensor deep stackingnetworksrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 35 no 8 pp 1944ndash1957 2013

[59] W J Kent C W Sugnet T S Furey et al ldquoThe human genomebrowser at UCSCrdquo Genome Research vol 12 no 6 pp 996ndash1006 2002

[60] R Lorenz S H Bernhart C Honer zu Siederdissen et alldquoViennaRNA Package 20rdquo Algorithms for Molecular Biologyvol 6 no 1 article 26 2011

[61] M Kozak ldquoRecognition of AUG and alternative initiatorcodons is augmented by G in position +4 but is not generallyaffected by the nucleotides in positions +5 and +6rdquoThe EMBOJournal vol 16 no 9 pp 2482ndash2492 1997

[62] M Kozak ldquoInitiation of translation in prokaryotes and eukary-otesrdquo Gene vol 234 no 2 pp 187ndash208 1999

[63] H Xu P Wang Y Fu et al ldquoLength of the ORF position of thefirstAUGand theKozakmotif are important factors in potentialdual-coding transcriptsrdquo Cell Research vol 20 no 4 pp 445ndash457 2010

[64] R D Finn J Clements and S R Eddy ldquoHMMER webserver interactive sequence similarity searchingrdquoNucleic AcidsResearch vol 39 no 2 pp W29ndashW37 2011

[65] P Kapranov J Cheng S Dike et al ldquoRNA maps reveal newRNA classes and a possible function for pervasive transcrip-tionrdquo Science vol 316 no 5830 pp 1484ndash1488 2007

[66] C Chen A Liaw and L BreimanUsing RandomForest to LearnImbalancedData University of California Berkeley Calif USA2004

[67] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[68] J A Sogn ldquoStructure of the peptide antibiotic polypeptinrdquoJournal of Medicinal Chemistry vol 19 no 10 pp 1228ndash12311976

[69] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[70] J W Fickett and C-S Tung ldquoAssessment of protein codingmeasuresrdquo Nucleic Acids Research vol 20 no 24 pp 6441ndash6450 1992

[71] X Robin N Turck A Hainard et al ldquopROC an open-sourcepackage for R and S+ to analyze and compare ROC curvesrdquoBMC Bioinformatics vol 12 article 77 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 4: Review Article Long Noncoding RNA Identification ...

4 BioMed Research International

Table 2 Summary of the features of each method selected

ORF Codon Sequence structure Ribosomeinteraction Alignment Protein

conservation

CPC Quality coverageintegrity No No No BLASTX

Number and119864-value of hitsDistribution of

hits

CPAT Lengthcoverage

HexamerFrequency

Content of thebases

Position of thebases

No No No

CNCI No ANT matrixCodon-bias MLCDS No No No

PLEK No No Improved k-merscheme No No No

lncRNA-MFDL

Lengthcoverage No

k-mer schemeSecondarystructureMLCDS

No No No

LncRNA-ID Lengthcoverage No Kozak motif

Ribosome releasesignal

Changes of bindingenergy

Profile HMMbased alignment

Score ofHMMER

Length of theprofile

Length of alignedregion

lncRScan-SVM No Distribution of

stop codon

Score oftxCdsPredictlength oftranscripts

length and count ofexon

No Phylo-HMMbased alignment

AveragePhastCons scores

LncRNApred Lengthcoverage No

Length of thesequence

signal to noiseratio

k-mer schemeG + C content

No No No

All features are categorized into six groups according to the similarity or basic principles Thus some items in the table might not be exactly in one-to-onecorrespondence with the feature names given in the corresponding published references

CPAT is also available both for download and as a web-server Users can obtain the latest resource code from httpssourceforgenetprojectsrna-cpatfilessource=navbar Pre-releases tutorial files and examples are also supplied onthe pages CPAT requires Python 27x numpy cython andR when running offline The web server is available athttplilabresearchbcmeducpatindexphp

CNCI can be downloaded at httpsgithubcomwww-bioinfo-orgCNCI Version 2 is updated on Feb 28 2014Setup and running steps are attached on the websitesLibsvm-30 has been enclosed in the package Other addi-tional files can be downloaded at httpwwwbioinfoorgnp

PLEK was implemented by C and Python The sourcecode can be freely downloaded from httpssourceforgenetprojectsplekfiles Several videos to assist user in utilisingPLEK correctly are also provided Python 27x is required

Scripts of LncRNA-ID can be obtained at httpsgithubcomzhangy72LncRNA-ID

LncRScan-SVM provided scripts gene annotation filesanddatasetsThe scripts can be downloaded at httpssource-forgenetprojectslncrscansvmsource20=20directoryA Readme file is also attached on this site

All the stand-alone versions of these tools requireLinuxUNIX operating system

The link of lncRNA-MFDL provided is httpscompge-nomicsutsaedulncRNA MDFL LncRNApred only has theweb interface and is available at httpmm20132014wicpnet57203LncRNApredhomejsp However the link of lncRNA-MFDL expired when we did this research And LncRNApredonly provides a web server which cannot handle too manysequences at one time

22 CPC inDetail CPC [28] extracted six features to evaluatethe coding potential of transcripts Log-odds score coverageand integrity of ORF are used to assess the ORFs of onesequence ORFs are predicted by framefinder A high-quality

BioMed Research International 5

ORF tends to have a high log-odds score and a larger ORFcoverage The integrity of ORF means ORFs in protein-coding transcripts are disposed of to begin with a startcodon and end with a stop codon The other three featuresare number of hits hit score and frame score which arederived from the output of BLASTX search A protein-codingtranscript prefersmore hits in alignment with lower119864-valuesThen the hit score is defined as follows [28]

119878119894 = mean119895

minuslog10 119864119894119895 119894 isin 0 1 2

Hit Score = mean119894isin012

119878119894 = sum2119894=0 1198781198943 (1)

where 119864119894119895 is the 119864-Value of the 119895th hits in the 119894th ORF Anoncoding transcript may also contain some hits but thesehits are inclined to scatter in three frames rather than belocated in one The frame score to calculate the distributionof hits among three ORFs is defined in the following

Frame Score = variance119894isin012

119878119894 = sum2119894=0 (119878119894 minus 119878)22 (2)

Thus a protein-coding transcript will achieve a higher hitscore and frame score because of the lower119864-value and biaseddistribution of the hits

The training data of CPC [46] are eukaryotic ncRNAsfrom the RNAdb [47] and NONCODE [48 49] databasesCPC is designed to assess transcriptsrsquo protein-coding poten-tial whichmeans it will have high accuracy of discriminatingprotein-coding transcripts Moreover CPC also has the errortolerance capacity which owesmuch to framefinderrsquos accurateprediction Framefinder performed well even though inputtranscripts may have some point mutations indel errorsand truncations CPC is slightly inferior in distinguishingnoncoding transcripts in respect of the fact that lncRNAsmaycontain putative ORFs and transcript length is also familiarto protein-coding transcripts The slow speed is anotherimperfection of CPC

23 CPAT in Detail CPAT [29] is an alignment-free pro-gram CPAT uses a logistic regression model and can betrained on own data of users Apart from the features ofmaximum length and coverage of ORF akin to CPC FickettScore is another criterion Fickett Score can be regarded asa dependent classifier it is mainly based on calculating theposition of each base favoured and the content of each basein the sequence [50] The basersquos position parameter of CPATis defined as follows

1198601 = Number of As in positions 0 3 6 1198602 = Number of As in positions 1 4 7 1198603 = Number of As in positions 2 5 8

119860-position = max (1198601 1198602 1198603)min (1198601 1198602 1198603) + 1

119860-content = Occurence Number of 119860Total Number of all bases

(3)

where 119860 in the formula means the base 119860 and the otherthree bases are measured in a similar way The parameterof position calculates each basersquos favoured position andthe parameter of content is the percentage of each basein the sequence Then according to distributions of eightparametersrsquo values [50] it is easy to obtain the probability thatthe sequence will be a protein-coding transcript Next eachprobability is multiplied by a weight to make a more accurateresult The weight is the percentage of the times that theestimate of each parameter alone is correct Finally accordingto the above descriptions Fickett Score can be determined asfollows

Fickett Score = 8sum119894=1

119901119894119908119894 (4)

According to Fickett [50] Fickett Score alone can correctlydiscriminate about 94 of the coding segments and 97 ofthe noncoding segments with 18 of ldquoNo Opinionrdquo

The last feature of CPAT is hexamer score which is themost discriminating feature Hexamer means the adjacentamino acids in proteinsThe features of the in-frame hexamerfrequency of coding and noncoding transcripts are calculatedand hexamer score is defined in the following

Hexamer Score = 1119898119898sum119894=1

log( 119865 (119867119894)1198651015840 (119867119894)) (5)

There are 64 lowast 64 kinds of hexamers and 119894 denotes each hex-amer 119865(119867119894) (119894 = 0 1 2 4095) means the in-frame hex-amer frequency of protein-coding transcripts while 1198651015840(119867119894)means noncoding transcripts For a transcript containing119898 hexamers a positive hexamer score indicates a protein-coding transcript

A high-quality training dataset is constructed contain-ing 10000 protein-coding transcripts selected from RefSeqdatabase with the annotations of the Consensus CodingSequence project and 10000 noncoding transcripts randomlycollected from GENCODE database CPAT is prebuilt hex-amer tables and logit models for human mouse fly andzebrafish Meanwhile CPAT uses pure linguistic features tofacilitate discrimination of the poorly annotated transcriptsCPAT has an efficient offline program and also provides auser-friendly web interface

24 CNCI in Detail CNCI [30] is mainly based on sequenceintrinsic composition it evaluates the transcripts by calcu-lating the usage frequency of adjoining nucleotide triplets(ANT) Firstly two ANT matrices are constructed based onthe usage frequency of ANT in noncoding sequences andcoding region of the sequences (CDS) For 4096 ANT the

6 BioMed Research International

formulas to calculate each ANT usage frequency are definedas follows

119883119894119873 = 119899sum119895=1

119878119895 (119883119894)

119879 = 119898sum119894=1

119883119894119873 = 119898sum119894=1

119899sum119895=1

119878119895 (119883119894) 119898 = 64 times 64 119899 = 1 119873

119883119894119865 = 119883119894119873119879

(6)

where 119883 means one kind of ANT 119878119895(119883119894) is the occurrencenumber of119883119894 in one sequence 119878119895Thus119883119894119873 denotes the totaloccurrence number of one kind of ANT in the dataset while119879indicates the total occurrence number of all kinds of ANT inthe dataset Accordingly 119883119894119865 is the usage frequency of ANTThen the ANT Score Matrix is utilised which is the log2-ratio of the two above-mentioned ANT matrices to score asequence and make a discrimination

ANT Score Matrix = log2CDS Matrix

Non-coding Matrix (7)

The distinguishing results of ANT Score Matrix are fairlywell but the matrix is constructed by computing the ANTusage frequency of coding region and noncoding regionconsequently the untranslated region (UTR) of the entiresequence will interfere with the performance of discrimina-tionThe sliding window is employed with one ANT (3 nt) ineach scan step to identify the CDS of a sequence by scanningsix reading frames of each sequence The different sizes(30 60 90 300 nt) of the sliding windows are examinedand the size of 150 nt for this classification model is found toobtain the most robust result For a sequence consisting of119896 ANT there will be 119896 minus 1 segments in this sequence Basedon the ANT Score Matrix each segment will get an 119878-Scoreand each reading frame can obtain an array comprised of the119878-Scores The formula of 119878-Score is defined as follows

119878-Score = 119899sum119894=1

119867119901 (119883119894) (8)

where 119883 means ANT 119867119901 is the ANT Score Matrix and119899 is the total number of the ANT in one segment or thewhole sequence Hence a correct reading frame of codingtranscript tends to have a higher whole sequence 119878-Scoreand in this array of reading frame the region composed ofconsecutive high 119878-Scores is the CDS For long noncodingtranscripts the Maximum Interval Sum [51] program is usedto identify the most-like CDS (MLCDS) which is the regionthat gained the largest sum of consecutive 119878-Scores in eachreading frame Among those six MLCDS the length and119878-Score of the MLCDS with the highest value are selectedas the features of CNCI Furthermore the features of the

LENGTH-Percentage SCORE-Distance and codon-bias arealso selected to improve accuracy

LENGTH-Percentage = 1198721sum119899119894=0 (119884119894)

SCORE-Distance = sum119899119895=0 (119878 minus 119864119895)5

(9)

where 1198721 is the length of the MLCDS with the highest 119878-Score 119884119894 is the length of each MLCDS 119878 is the highest 119878-Score among six MLCDS and 119864119895 is the 119878-Score of other fiveMLCDS Codon-bias (3-mer frequencies) is a parameter toevaluate the usage bias of different codons in protein-codingor long noncoding transcripts The log2-ratio of occurrencefrequency of each codon (stop codons are excluded) inprotein-coding genes and lncRNAs is calculated and mostcodons have distinct usage bias in two kinds of sequences

The training datasets of CNCI contain protein-codingtranscripts selected from RefSeq database and long noncod-ing transcripts selected from GENCODE [52] The CNCIis applied to other species with the aim of examining thescope of applicationThe results of vertebrates (except birds)especially mammals can be accepted since the programwas trained on human gene set CNCI can be used todiscriminate incomplete transcripts especially those high-throughput sequencing data of poorly explored species

25 PLEK in Detail PLEK [31] is an alignment-free toolbased on 119896-mer frequencies of the sequences For a givensequence the sliding windows with size of 119896 scan 1 nt asa step forward 119896 ranges from 1 to 5 which is a trade-off between accuracy and computational time Thus for asequence consisting of 119860 119862 119866 and 119879 the 41 + 42 + 43 +44 + 45 = 1364 patterns can be obtained Then the followingformulas can be used

119891119894 = 119888119894119904119896119908119896 119896 = 1 2 3 4 5 119894 = 1 2 1364119904119896 = 119897 minus 119896 + 1119908119896 = 1

45minus119896 119896 = 1 2 3 4 5

(10)

where 119894 is the number of the patterns 119888119894 denotes the numberof the segments in slidingwindowsmatchingwith patterns 119904119896denotes the total of the segments when sliding window slidesalong the sequencewith the size of 119896Therefore119891119894 is the usagefrequency multiplied by a factor119908119896 which is used to facilitatethe discrimination

A balanced training dataset is conducted with all 22389long noncoding transcripts collected from the GENCODEdataset [52ndash54] and 22389 protein-coding transcripts ran-domly selected from the human RefSeq dataset [55 56]Though the training model of PLEK is human PLEK canstill be applied to other vertebrates PLEK is particularlydesigned for the transcripts acquired from current sequenc-ing platforms which consist of some indel errors commonly

BioMed Research International 7

For these transcripts the performance of PLEK is betterthan CPC and CNCI PLEK can be trained with usersrsquo owndatasets but it may take a long time to be accomplished

26 LncRNA-MFDL in Detail LncRNA-MFDL [34] is basedon feature fusion and deep learning algorithm LncRNA-MFDL has four kinds of features which are integrated tobuild a classification model based on deep stacking networks(DSNs one kind of deep learning algorithm) [57 58] Fourfeature groups of lncRNA-MFDL include 119896-mer secondarystructure ORF obtained by utilising txCdsPredict program(httpgenomeucscedu) [59] and MLCDS features whichare inspired by CNCI [30]

The 119896-mer scheme employed in lncRNA-MFDL is unlikethe one in PLEK Here the 119896 only ranges from 1 to 3but the frequencies are calculated on the regions of thewhole sequence and ORF at the same time Considering thatthe secondary structure is more conserved and stable thanprimary structure a representative criterion the minimumfree energy (MFE) is used to assess the secondary structureof the transcripts Utilising RNAfold program of ViennaRNAPackage [60] the MFE the ratio of MFE to sequence lengthand the number of paired bases and unpaired bases can beeasily obtained

27 LncRNA-ID in Detail LncRNA-ID [32] has three cate-gories of features as mentioned earlier Except for the lengthand coverage of ORF the features based on translationmechanism and protein conservation are extracted

Many studies [61ndash63] have demonstrated that severalnucleotide sites in Kozak motif play a prominent roleduring the initiation of protein translation An efficienttranslation indicates that the highly conserved nucleotidesappear at the positions minus3 +4 and minus2 minus1 of Kozak motifGCCRCCAUGG (R represents purine and the position of Ain start codon AUG is +1) Thus these conserved sites aremore likely to exist in protein-coding transcripts Moreoverwhen the translation starts the binding energy will changealong with the interaction between the 31015840 end of rRNAs andmRNA transcripts The Ribosome Coverage to calculate thechanges of the binding energy is defined as follows

Ribosome Coverage = 119871sum119894=1

119873119894 | 120575119894 lt 0 (11)

where 120575119894 is the free energy at position 119894 and 119873119894 is thenumber of base pairs starting at position 119894 in a sequencewith the length of 119871 Next the three levels of ribosome occu-pancy by computing Ribosome Coverage on three regionsrespectively are obtained the whole transcript ORF and31015840UTR Accordingly a true protein-coding transcript tendsto attain higher Ribosome Coverage on the whole transcriptand the ORF region When the translation terminates theribosomes will be released from protein-coding transcriptsTherefore it is likely to capture a considerable drop ofribosome occupancy when ribosomes reach stop codonsThe

Ribosome Release Score to capture this change of ribosomeoccupancy is defined

Ribosome Release Score

= Ribosome coverage of ORFlength (ORF)Ribosome coverage of 31015840UTRlength (31015840UTR)

(12)

and a protein-coding transcript inclines to exhibit a higherRibosome Release Score For protein translation categorythe selected features including nucleotides at two positionsof Kozak motif Ribosome Coverage on three regions andRibosome Release Score are selected

The protein conservation of the sequences is evaluatedaccording to profile hidden Markov model-based align-ment scores HMMER [64] is a software suite for sequencehomology detection using probabilistic methods LncRNA-ID employed HMMER with the 119864-value cutoff of 01 toalign the transcripts against all available protein families Aprotein-coding transcript is expected to get a higher scorelonger aligned region and a reasonable length of the profilein the alignment

In human genome although the amount of lncRNA isat least four times more than protein-coding genes [65]the majority class in training data is protein-coding tran-script on account of poorly annotated lncRNA Hence theclassification model of this method is balanced randomforest [66 67] which is derived from random forest butcould utilise the sufficient protein-coding data and avoidinaccurate results caused by the imbalanced training data atthe same time The human prebuilt model of LncRNA-IDcontains 15308 protein-coding transcripts and 4586 lncRNAsfrom GENCODE [52] For mouse the training datasets arecomprised of 22033 protein-coding transcripts and 2457lncRNAs randomly selected from GENCODE These twodatasets were also used to draw receiver operation charac-teristic (ROC) curves in the next section (Figure 2) Userscan train LncRNA-ID with their own dataset and apply it tovarious species

28 LncRScan-SVM in Detail LncRScan-SVM [33] classifiesthe sequencesmainly by evaluating the qualities of nucleotidesequences codon sequence and transcripts structure Thecounts and average length of exon in one sequence arecalculated The protein-coding transcripts are disposed ofto include more exons thus having a longer exon lengththan lncRNA Another feature is the score of txCdsPredictThis third-part program from UCSC genome browser [68]can determine if a transcript is protein-coding Conservationscore is obtained by calculating the average of PhastConsscores [45] from Phast (httpcompgencshleduphast)Transcript length and standard deviation of stop codoncounts between three ORFs are the last two features

The reliable datasets are constructed from GENCODE[54] composed of 81814 protein-coding transcripts and23898 long noncoding transcripts of human And formouse47394 protein-coding transcripts and 6053 long noncodingtranscripts from GENCODE [52] are also contained withinthe dataset After being trained on human and mouse

8 BioMed Research International

datasets lncRScan-SVM obtains a good performance onlncRNA prediction

29 LncRNApred in Detail Before constructing the classi-fier self-organizing feature map (SOM) clustering [69] isemployed to select representative samples as the trainingdataset which enhanced the performance of LncRNApredAs to the features the length and coverage of the longest ORFone of the classical and typical features are selected as thecriteria In addition G + C content 119896-mer (119896 is from 1 to 3just like lncRNA-MFDL) and length of the sequence are alsothe features of LncRNApred The novel idea of LncRNApredis SNR which transforms one sequence into four binarynumeric sequences

119906119887 = 1 119878 [119899] = 1198870 119878 [119899] = 119887

119899 = 0 1 2 119873 minus 1 119887 isin 119860 119879 119862 119866 (13)

where 119887 means four kinds of bases 119873 is the length of onesequence and 119878[119899] denotes a sequence of length 119873 Thusthere will be four binary sequences 119906119887 | 119887 isin (119860 119879 119862 119866)Then applying Discrete Fourier Transform (DFT) to thesefour binary numeric sequences the power spectrum 119875[119896]can be obtained

119880119887 [119896] =119873minus1sum119899=0

119906119887 [119899] 119890minus119894(2120587119899119896119873) 119896 = 0 1 119873 minus 1119875 [119896] = 1003816100381610038161003816119880119860 [119896]10038161003816100381610038162 + 1003816100381610038161003816119880119879 [119896]10038161003816100381610038162 + 1003816100381610038161003816119880119866 [119896]10038161003816100381610038162 + 1003816100381610038161003816119880119862 [119896]10038161003816100381610038162

(14)

The studies of Fickett [50 70] have presented that positionsand compositions of four bases are different in lncRNAsand protein-coding RNA and because of this the powerspectrum of one protein-coding transcript will have a peakat1198733 position Hence the SNR is defined as follows

119864 = sum119873minus1119896=0 119875 [119896]119873 SNR = 119875 [1198733]

119864 (15)

Now there are 89 features the length and coverage ofthe longest ORF the length of the sequence SNR G + Ccontent and 4 + 16 + 64 features of 119896-mer Noticing that notall the features have high discriminative power the featureselection ismade and 25 high-quality features are determinedfrom the original 84 features of k-mer Finally 30 features areselected to build a random forest model The performance ofrandom forest is largely determined by training setThereforethe clustering method is used to find out the most adequatesequences to form a high standard training setThe clusteringmethod SOM [69] achieved the best result and was chosen toselect characteristic sequences

An overall procedure of these eight tools is displayed inFigure 1

3 Performance of These Methods

To quantify the classification performance under one unifiedstandard we first characterise lncRNAs as the positive classand protein-coding transcripts as the negative class then theperformance of these tools can be evaluated with severalstandard criteria defined as follows

Sensitivity = TPTP + FN

Specificity = TN

TN + FP

Accuracy = TP + TNTP + FP + TN + FN

False Positive Rate = FP

FP + TN

(16)

As one of the most popular methods CPC is especiallydesigned for assessing protein-coding potential and per-formed fairly well for discriminating protein-coding tran-script It enjoys the best results when screening the codingtranscripts For 10000 protein-coding genes and 10000 lncR-NAs selected from UCSC genome browser (GRCh37hg19)CPC picked up about 9762 coding transcripts while CPATdistinguished 8528 of them CPAT also outperforms CPCwith 8994 accuracy [33] Table 3 shows the performance ofthese tools on the same testing dataset CPC picks up 9997of human protein-coding genes collected from GENCODEin comparison with the latest program LncRNA-ID whoseperformance is 9528 However the performance of CPCappears to somewhat decline when focusing on the capabilityof discriminating noncoding transcripts especially long non-coding transcripts CPC only picked up 6648 of humanrsquoslong noncoding transcripts while the results of CPAT PLEKand LncRNA-ID are 8695 9952 and 9628

CPC and CPAT are the programs to assess the codingpotential but CNCI is especially used to classify protein-coding and long noncoding transcripts With the sequencesbecoming longer and longer CNCI was more superior toCPC According to Sun et al [30] when the length oftranscript is longer than 2000 nt the accuracy of CPC isonly around 04 while the CNCI still has an outstandingperformance The training dataset of CNCI is human butthis method still achieved more than 90 accuracy in othervertebrates apart from the birds [30] PLEK is tested ontwo datasets sequenced by PacBio and 454 platforms (referto Table 3) Among the tools being compared CPC stillpicked up about 9990 coding genes though this figureis not that useful because it can only distinguish 1900and 4720 lncRNAs CNCI displayed better performanceon both datasets but PLEK even achieved a more satisfyingresult

LncRNA-ID is another method to identify the longnoncoding transcripts Compared with other programsLncRNA-ID strikes a good balance between sensitivity andfalse positive rate According to Table 3 it is noticeable thatlncRNA is better than PLEK but slightly inferior to CPCand CPAT on the testing data of coding genes and the

BioMed Research International 9

(i) ORF(ii) BLASTX

CPC

(i) k-mer

PLEK

(i) ANT matrix(ii) Codon-bias

(iii) MLCDS

CNCI

(i) Exon (ii) Stop codon

(iii) txCdsPredict(iv) PhastCons

LncRScan-SVM

Support vector machine

Featureextraction

Modelselection

Modelvalidation

Input data

10-fold cross-validationArea under the curve (AUC) of receiver operation characteristic (ROC)

LncRNA-ID(i) ORF

(ii) Ribosome interaction

(iii) HMMER

(i) ORF(ii) k-mer

(iii) Signal to noise ratio

LncRNApred

Random forests

CPAT

(i) ORF(ii) Fickett score

(iii) Hexamer

Logistic regression

LncRNA-MFDL(i) ORF

(ii) k-mer(iii) MLCDS(iv) Secondary

structure

Deeplearning

Figure 1 An overall procedure of eight tools The features of each tool are sorted into several groups and only the categories of the featuresare listed in the figure

02 04 06 08 10001 minus specificity

00

02

04

06

08

10

Sens

itivi

ty

00

02

04

06

08

10

Sens

itivi

ty

02 04 06 08 10001 minus specificity

Mouse (from GENCODE)Human (from GENCODE)

PLEK (AUC = 0991 optimal = 00210973)CNCI (AUC = 0937 optimal = 0084 0970)CPAT (AUC = 0990 optimal = 0037 0961)CPC (AUC = 0977 optimal = 0038 0889) CPC (AUC = 0950 optimal = 0068 0839)

CPAT (AUC = 0966 optimal = 0073 0926)CNCI (AUC = 0903 optimal = 0137 0959)PLEK (AUC = 0870 optimal = 0291 0885)

Figure 2 The ROC curves of CPC CPAT CNCI and PLEK We assessed the models using the same datasets as LncRNA-ID (selected fromGENCODE) used Both CPC and CPAT were evaluated with the latest versions

10 BioMed Research International

Table 3 Overview of each toolrsquos performance on different testing datasets

Testing dataset CPC CPAT CNCI PLEK LncRNA-ID lncRScan-SVMHuman MCF-7 (PacBio)1

Specificity 9990 9180 9470Sensitivity 1900 7870 9580Accuracy 9700 9130 9470Human HelaS3 (454)2

Specificity 9990 9390 9550Sensitivity 4720 8110 9250Accuracy 9900 9370 9540Human (from GENCODE)3

Specificity 9997 9955 8918 9528Sensitivity 6648 8695 9952 9628Accuracy 8322 9325 9432 9578Mouse (from GENCODE)4

Specificity 9875 9895 7094 9210Sensitivity 7655 3880 8811 9445Accuracy 8765 6888 7949 9328Human (from GRCh37hg19)5

Specificity 9762 8528 8920Sensitivity 6723 9460 9388Accuracy 8243 8994 9194Mouse (from GRCm38mm10)5

Specificity 9837 8817 8914Sensitivity 7546 9534 9529Accuracy 8691 9176 9221The results of the tools being tested on the same datasets are listed above Bold numbers denote the highest value of the metrics1MCF-7 is available at httpwwwpacbcomblogdata-release-human-mcf-7-transcriptome 2dataset of HelaS3 is available at httpswwwncbinlmnihgovsraSRX214365 34datasets are available at httpswwwdropboxcomsh7yvmqknartttm6kAAAQHvLZPjgjf4dtmHM7GNCqaH1 gencodedl=0 andhttpswwwdropboxcomsh7yvmqknartttm6kAACzaG-QJggvbXW6LA32oo7baM1 gencodedl=0 5dataset of human and mouse is available athttpjournalsplosorgplosonearticleid=101371journalpone0139654

performance on lncRNAs is just the opposite LncRNA-IDcan be trained with usersrsquo own data it can obtain a satisfyingresult even when the data is unbalancedWith the proportionof lncRNA decreasing CPAT shows a sharp reduction from7951 to 5446on the capability of lncRNAdiscriminationLncRNA-ID by contrast fell less than 1 [32]

ROC curves of CPC CPAT PLEK and LncRNA-IDtested on human and mouse datasets were provided in [32]Since CPC and CPAT are updated as the accumulation ofgene database it is useful to assess their performance withlatest version and take CNCI into account Here we utilisethe test set of LncRNA-ID [32] (both the datasets of humanandmouse are selected fromGENCODE) to reevaluate CPCCPAT CNCI and PLEK (Figure 2) According to [32] thearea under curve (AUC) of LncRNA-ID on human datasetis 09829 (optimal = 00545 09720) while on mouse it is09505 (optimal = 00800 09445) [32] In our assessmentthe performance of PLEK is identical with [32] while theperformance of CPC and CPAT as we anticipated displayedsome differences The ROC curves were drawn in R with thepackage of pROC [71]

LncRScan-SVM is compared with CPC and CPAT onhuman and mouse datasets from UCSC (version hg19 of

human and mm10 of mouse) CPC as an excellent codingpotential assessment tool still achieves 9837 of specificityonmouse testing dataset CPAT on the contrary achieved thehighest values of sensitivity both on the datasets of humanand mouse LncRScan-SVM surpasses CPAT with 8920and 8914 of specificity on human and mouse datasets Forsensitivity lncRScan-SVM obtained 9388 and 9529 onthe same testing datasets which are only 072 and around01 lower than CPATrsquos results respectively but much higherthan CPCrsquos 6723 and 7546 In addition lncRScan-SVMalso has the best results of accuracy and AUC [33] on thesedatasets

For the same testing datasets the running time of CPATis the shortest and CPC shows the longest time to finishthe process because of its alignment process When beingtested on a dataset containing 4000 protein-coding and 4000long noncoding transcripts CPAT takes 3536 s and LncRNA-ID takes 6535 s to accomplish the discrimination whilePLEK and CPC need 2147m and 8651 h respectively [32]PLEK is 8 times and 244 times faster than CNCI and CPCrespectively on the same testing data [31] and lncRScan-SVM also needs about 10 times as much as CPAT to finishcomputation [33]

BioMed Research International 11

Table 4 Priority of employing different methods on different situations

CPC CPAT CNCI PLEK LncRNA-ID lncRScan-SVMCoding potential assessment Human lncRNAs Mouse lncRNAs Other Species1 Testing data with sequencing errors2 Lack of annotation Massive-scale data3 Trained by users4 Web interface This table only presents the preferences under different situations which means a method with a tick can achieve a better performance under a certaincircumstance1Only CPAT LncRNA-ID and lncRScan-SVM provide the model for mouse When analysing other species CPAT has the model for fly and zebrafish CNCIand PLEK can predict the sequences of vertebrata and plant CPAT PLEK and LncRNA-ID can build a newmodel based on usersrsquo datasets 2Users can chooseCNCI for incomplete sequences and CPC or PLEK for the transcripts with indel errors 3CPAT is the most efficient method Though lncRScan-SVM needsmore time than CPAT and LncRNA-ID it is also acceptable 4LncRNA-ID can handle the imbalanced training data Training PLEK with usersrsquo own datasetsmay be a time-consuming task

4 Application Scopes of the Methods

All these methods have own particular scopes to exert theirtalents which means an appropriate program can help usobtain a satisfying result The priority of utilising thesetools under some particular circumstances is summarised inTable 4

CPC is based on sequence alignment which facilitatesprotein-coding transcripts selection but impairs the perfor-mance of noncoding transcripts in that long noncoding tran-scripts share more similarities with coding transcripts suchas putative ORF which could mislead CPC Also becauseof alignment process utilising CPC to analyse massive-scaledata is a time-consuming process

CPAT is also used to evaluate the coding potential thoughthe performance on long noncoding transcripts is acceptableCPAT has a compromise between coding and noncodingtranscripts that is not bad Since the model of CPAT islogistic regression and the input file is FASTA format CPATis markedly superior in computational time which meansCPAT is more suitable for being applied to data on a largescale Furthermore linguistic features make CPAT be able toanalyse the sequences without annotation and allowing usersto train the model with their own dataset extends CPATrsquosscope of application Users can apply CPAT to other speciesinstead of being confined to human or mouse only

CNCI is designed to distinguish between coding and longnoncoding transcripts without the annotations of sequencesBecause lots of lncRNAs are poorly annotated this qualityprovided amore accurate discrimination for these sequencesCNCI is trained on human dataset but can also be applied toothermammals such asmouse and orangutan CNCI displaysacceptable results on vertebrates (except fish) but for plantsand invertebrates the result is not very satisfying CNCI isvaluable when the sequences lack annotations or users do nothave training set of other species CNCI also shows a goodperformance when the transcripts are incomplete

PLEK employs a higher fault tolerance algorithm andperforms better when the sequences have indel errors It is a

proper tool to analyse the de novo assembled transcriptomedatasets such as the sequences obtained from Roche (454)and Pacific Biosciences (PacBio) sequencing platforms Inaddition to human and mouse PLEK can also be used toother vertebrates and displays comparable results with theones of CNCI PLEKrsquos model can be trained by users but ittakes a long time to be completed

LncRNA-ID has many merits and delivers better all-round performance on human andmouse datasets Althoughthe time LncRNA-ID spent on classifying is nearly twice ofCPAT LncRNA-ID is still more efficient than other methodswhich makes it a reasonable choice when data are on amassive-scale The model of LncRNA-ID can be trained byusers but the most excellent attribute is the competence ofhandling the unbalanced training data For studying thosenot well-explored species LncRNA-ID takes priority whenusers have training datasets

LncRScan-SVM achieves a good trade-off betweenthe discrimination of coding and long noncoding genesLncRScan-SVM is slower than CPAT and LncRNA-ID but itis still acceptable For analysing human and mouse datasetslncRScan-SVM can be considered as a proper approach

5 Discussion

According to the features selected by each tool it is apparentthat different tools have their own advantages and disadvan-tages CPC is developed to assess coding potential of thetranscripts moreover CPC is trained on datasets of protein-coding and noncoding RNA which means it achieves excel-lent performance when analysing ncRNAs CPC providesa stand-alone version and a web server but both of thetwo programs need vast amounts of time to process thesequences As alignment-based tools the performance ofCPC varied when using different protein reference databaseCPAT can present satisfying results efficiently partly becauseCPAT builds the logistic model which is faster than SVMThe web server of CPAT can display the result in an instant

12 BioMed Research International

which facilitates small scale prediction tasks A minor dis-advantage of CPAT is that the cutoff of CPAT varies fromspecies to species and users have to determine the optimumcutoff value when they are training a new model CNCI isdesigned to predict the transcripts assembled from whole-transcriptome sequencing data Thus CNCI offers a highaccuracy on incomplete transcripts CNCI did not provideresult of elaborate comparison between CNCI and CPATbut CPAT has no regard for the problem of incompletetranscripts Meanwhile UTRs of the transcripts may alsointerfere with the performance of CPATThe features of ANTof CNCI closely resemble the hexamer of CPAT but thedistinguishing process of CNCI is more complicated andaccurate than CPAT However the sliding window of CNCIslides 3 nt in each step and consequently some deletionor frameshift errors may lead to a false shift and presentusers with a disappointing performance In such cases PLEKhas made a considerable improvement and exhibits moreflexibility when handling the indel sequencing errors Indelerrors are very common in the sequences obtained by todayrsquossequencing platform which means PLEK performs well forde novo assembled transcriptomes With the indel error rateincreasing the accuracy of CNCI is decreasing while PLEKhas no distinct fluctuation Nonetheless since the nucleotidescompositions differ slightly among different species theperformance of PLEK on multiple species is not better thanor approximately equivalent to CNCI whose performanceis more stable on different species Both LncRNA-ID andlncRScan-SVM achieve a balance between protein-codingand lncRNAs But the capacity of lncRScan-SVMwill be lim-ited when analysing the sequences with a lack of annotationAnother point that needs to be brought up is that lncRScan-SVM and CNCI support lowast GTF as input file format

It is apparent that nucleotides composition (such as 119896-merand G + C content) and ORF are two classic and widely usedfeature groups These features have strong discriminativepower because protein-coding genes will finally be tran-scribed and translated to produce a specific amino acid chainwhich requires some specified nucleotides composition andhigh-quality ORFs As to the models of these tools SVM(CPC CNCI and PLEK) logistic regression (CPAT) andrandom forest (LncRNA-ID) are more practical for lncRNAidentification though ANN or deep learning is a more popu-lar machine learning algorithm now Along with the protein-coding genes prediction the annotations of lncRNA genehave been performed as well A new tool named AnnoLnc(2015 available at httpannolnccbipkueducnindexjsp)has just been developed to annotate new discovered lncRNAsbut related article has not yet been officially published Userscan access its web server for more information

LncRNAs are receiving increasing attention and lncRNAidentification has always been a challenge for researchesof life science For so many different types of sequencesvarious excellent tools should be developed to tackle differentproblems under various circumstances in the future In thisreview we summarised several tools for lncRNAs identifica-tion and concluded respective scopes Due to their differentscopes of application using a method apposite to particularsituation will be of essence to achieve convincing results

We hope this review can help researchers employ a moreappropriate method in certain situations

Additional Points

Key Points (i) Different tools have different scopes Usersshould select a proper tool according to the type of sequences(ii) From the perspective of sequence types CPC andCPAT are mainly used to assess coding potential CNCIand PLEK can be applied to the sequences obtained fromhigh-throughput sequencing platforms or the poorly anno-tated LncRNA-ID and lncRScan-SVM are more accurate onhuman and mouse datasets (iii) From the perspective ofother functions CPC and CPAT have web interfaces Theclassification models of CPAT LncRNA-ID and PLEK canbe trained on usersrsquo own datasets CPAT LncRNA-ID andlncRScan-SVM can be utilised when the data to be analysedare on a massive-scale

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This research was supported by the National Natural ScienceFoundation of China (61272207 61472158 and 61402194)and the Science-Technology Development Project from JilinProvince (20130522118JH 20130101070JC) and China Post-doctoral Science Foundation (2014T70291)

References

[1] A F Palazzo and E S Lee ldquoNon-coding RNA what isfunctional andwhat is junkrdquo Frontiers in Genetics vol 5 article2 pp 1ndash11 2015

[2] Y Okazaki M Furuno T Kasukawa et al ldquoAnalysis of themouse transcriptome based on functional annotation of 60770full-length cDNAsrdquo Nature vol 420 pp 563ndash573 2002

[3] S Djebali C A Davis AMerkel et al ldquoLandscape of transcrip-tion in human cellsrdquo Nature vol 489 pp 101ndash108 2012

[4] I Dunham A Kundaje S F Aldred et al ldquoAn integratedencyclopedia of DNA elements in the human genomerdquo Naturevol 489 pp 57ndash74 2012

[5] E Pennisi ldquoENCODE project writes eulogy for junk DNArdquoScience vol 337 no 6099 pp 1159ndash1161 2012

[6] S U Schmitz P Grote and B G Herrmann ldquoMechanisms oflong noncoding RNA function in development and diseaserdquoCellular and Molecular Life Sciences vol 73 no 13 pp 2491ndash2509 2016

[7] K Plath J Fang S K Mlynarczyk-Evans et al ldquoRole of histoneH3 lysine 27 methylation in X inactivationrdquo Science vol 300no 5616 pp 131ndash135 2003

[8] S T da Rocha V Boeva M Escamilla-Del-Arenal et al ldquoJarid2is implicated in the initial xist-induced targeting of PRC2 to theinactive X chromosomerdquoMolecular Cell vol 53 no 2 pp 301ndash316 2014

[9] V OrsquoLeary S V Ovsepian L G Carrascosa et al ldquoPARTICLEa triplex-forming long ncRNA regulates locus-specific methy-lation in response to low-dose irradiationrdquo Cell Reports vol 11no 3 pp 474ndash485 2015

BioMed Research International 13

[10] A C Marques and C P Ponting ldquoCatalogues of mammalianlong noncoding RNAs modest conservation and incomplete-nessrdquo Genome Biology vol 10 no 11 article R124 2009

[11] J R Prensner M K Iyer A Sahu et al ldquoThe long noncodingRNA SChLAP1 promotes aggressive prostate cancer and antag-onizes the SWISNF complexrdquo Nature Genetics vol 45 no 11pp 1392ndash1398 2013

[12] G Chen Z Wang D Wang et al ldquoLncRNADisease a databasefor long-non-coding RNA-associated diseasesrdquo Nucleic AcidsResearch vol 41 no 1 pp D983ndashD986 2013

[13] X Chen and G-Y Yan ldquoNovel human lncRNA-disease associ-ation inference based on lncRNA expression profilesrdquo Bioinfor-matics vol 29 no 20 pp 2617ndash2624 2013

[14] X Chen C Clarence Yan C Luo W Ji Y Zhang and Q DaildquoConstructing lncRNA functional similarity network based onlncRNA-disease associations and disease semantic similarityrdquoScientific Reports vol 5 Article ID 11338 2015

[15] X Chen Y-A Huang X-S Wang Z You and K C ChanldquoFMLNCSIM fuzzy measure-based lncRNA functional simi-larity calculation modelrdquo Oncotarget vol 7 no 29 pp 45948ndash45958 2016

[16] X Chen ldquoPredicting lncRNA-disease associations and con-structing lncRNA functional similarity network based on theinformation of miRNArdquo Scientific Reports vol 5 Article ID13186 2015

[17] J Sun H Shi Z Wang et al ldquoInferring novel lncRNA-diseaseassociations based on a random walk model of a lncRNAfunctional similarity networkrdquo Molecular BioSystems vol 10no 8 pp 2074ndash2081 2014

[18] M Zhou X Wang J Li et al ldquoPrioritizing candidate disease-related long non-codingRNAs bywalking on the heterogeneouslncRNA and disease networkrdquoMolecular BioSystems vol 11 no3 pp 760ndash769 2015

[19] X Chen Z You G Yan and D Gong ldquoIRWRLDA improvedrandom walk with restart for lncRNA-disease association pre-dictionrdquo Oncotarget vol 7 no 36 pp 57919ndash57931 2016

[20] X Chen C C Yan X Zhang and Z You ldquoLong non-codingRNAs and complex diseases from experimental results tocomputational modelsrdquo Briefings in Bioinformatics 2016

[21] T M Lowe and S R Eddy ldquotRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequencerdquo Nucleic Acids Research vol 25 no 5 pp 955ndash9641997

[22] Q Zou J Guo Y Ju M Wu X Zeng and Z Hong ldquoImprov-ing tRNAscan-SE annotation results via ensemble classifiersrdquoMolecular Informatics vol 34 no 11-12 pp 761ndash770 2015

[23] L Wei M Liao Y Gao R Ji Z He and Q Zou ldquoImprovedand promising identificationof human microRNAs by incorpo-ratinga high-quality negative setrdquo IEEEACM Transactions onComputational Biology and Bioinformatics vol 11 no 1 pp 192ndash201 2014

[24] C Y Wang L L Hu M Z Guo X Y Liu and Q Zou ldquoimDCan ensemble learningmethod for imbalanced classificationwithmiRNAdatardquoGenetics andMolecular Research vol 14 no 1 pp123ndash133 2015

[25] K Lagesen P Hallin E A Roslashdland H-H Staeligrfeldt T Rognesand DW Ussery ldquoRNAmmer consistent and rapid annotationof ribosomal RNA genesrdquo Nucleic Acids Research vol 35 no 9pp 3100ndash3108 2007

[26] C Wang L Wei M Guo and Q Zou ldquoComputationalapproaches in detecting non- coding RNArdquo Current Genomicsvol 14 no 6 pp 371ndash377 2013

[27] D Veneziano G Nigita and A Ferro ldquoComputationalapproaches for the analysis of ncRNA through deep sequencingtechniquesrdquo Frontiers in Bioengineering and Biotechnology vol3 article 77 2015

[28] L Kong Y Zhang Z-Q Ye et al ldquoCPC assess the protein-coding potential of transcripts using sequence features andsupport vector machinerdquo Nucleic Acids Research vol 35 no 2pp W345ndashW349 2007

[29] L Wang H J Park S Dasari et al ldquoCoding-potential assess-ment tool using an alignment-free logistic regression modelrdquoNucleic Acids Research vol 41 no 6 article e74 2013

[30] L Sun H Luo D Bu et al ldquoUtilizing sequence intrinsiccomposition to classify protein-coding and long non-codingtranscriptsrdquo Nucleic Acids Research vol 41 no 17 article e1662013

[31] A Li J Zhang and Z Zhou ldquoPLEK a tool for predicting longnon-coding RNAs and messenger RNAs based on an improvedk-mer schemerdquo BMC Bioinformatics vol 15 no 1 article 3112014

[32] R Achawanantakun J Chen Y Sun and Y Zhang ldquoLncRNA-ID long non-coding RNA IDentification using balanced ran-dom forestsrdquo Bioinformatics vol 31 no 24 pp 3897ndash3905 2015

[33] L Sun H Liu L Zhang and J Meng ldquolncRScan-SVM a toolfor predicting long non-coding RNAs using support vectormachinerdquo PLoS ONE vol 10 no 10 Article ID e0139654 2015

[34] X-N Fan and S-W Zhang ldquoLncRNA-MFDL identification ofhuman long non-coding RNAs by fusing multiple features andusing deep learningrdquo Molecular BioSystems vol 11 no 3 pp892ndash897 2015

[35] C Pian G Zhang Z Chen et al ldquoLncRNApred classificationof long non-coding RNAs and protein-coding transcripts by theensemble algorithm with a new hybrid featurerdquo PLoS ONE vol11 no 5 Article ID e0154567 2016

[36] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[37] M C Frith A R Forrest E Nourbakhsh et al ldquoThe abundanceof short proteins in the mammalian proteomerdquo PLoS Geneticsvol 2 no 4 pp 515ndash528 2006

[38] D M Anderson K M Anderson C-L Chang et al ldquoAmicropeptide encoded by a putative long noncoding RNAregulatesmuscle performancerdquoCell vol 160 no 4 pp 595ndash6062015

[39] M Guttman P Russell N T Ingolia J S Weissman and ES Lander ldquoRibosome profiling provides evidence that largenoncoding RNAs do not encode proteinsrdquo Cell vol 154 no 1pp 240ndash251 2013

[40] M F Lin I Jungreis and M Kellis ldquoPhyloCSF a comparativegenomics method to distinguish protein coding and non-coding regionsrdquoBioinformatics vol 27 no 13 Article ID btr209pp i275ndashi282 2011

[41] B Panwar A Arora and G P S Raghava ldquoPrediction andclassification of ncRNAs using structural informationrdquo BMCGenomics vol 15 no 1 article 127 2014

[42] L Childs Z Nikoloski P May and D Walther ldquoIdentificationand classification of ncRNAmolecules using graph propertiesrdquoNucleic Acids Research vol 37 no 9 article e66 2009

[43] K Sun X Chen P Jiang X Song H Wang and HSun ldquoiSeeRNA identification of long intergenic non-codingRNA transcripts from transcriptome sequencing datardquo BMCGenomics vol 14 article S7 2013

14 BioMed Research International

[44] Y Wang Y Li Q Wang et al ldquoComputational identificationof human long intergenic non-coding RNAs using a GA-SVMalgorithmrdquo Gene vol 533 no 1 pp 94ndash99 2014

[45] A Siepel G Bejerano J S Pedersen et al ldquoEvolutionarilyconserved elements in vertebrate insect worm and yeastgenomesrdquo Genome Research vol 15 no 8 pp 1034ndash1050 2005

[46] J Liu J Gough and B Rost ldquoDistinguishing protein-codingfrom non-coding RNAs through support vector machinesrdquoPLoS Genetics vol 2 no 4 pp 529ndash536 2006

[47] K C Pang S Stephen P G Engstrom et al ldquoRNAdb-acomprehensive mammalian noncoding RNA databaserdquoNucleicAcids Research vol 33 pp D125ndashD130 2005

[48] C Liu B Bai G Skogerboslash et al ldquoNONCODE an inte-grated knowledge database of non-coding RNAsrdquoNucleic AcidsResearch vol 33 pp D112ndashD115 2005

[49] Y Zhao H Li S Fang et al ldquoNONCODE 2016 an informativeand valuable data source of long non-coding RNAsrdquo NucleicAcids Research vol 44 no 1 pp D203ndashD208 2016

[50] J W Fickett ldquoRecognition of protein coding regions in DNAsequencesrdquoNucleic Acids Research vol 10 no 17 pp 5303ndash53181982

[51] S Mukherjee and Y Zhang ldquoMM-align a quick algorithmfor aligning multiple-chain protein complex structures usingiterative dynamic programmingrdquoNucleic Acids Research vol 37no 11 2009

[52] T Derrien R Johnson G Bussotti et al ldquoThe GENCODE v7catalog of human long noncoding RNAs analysis of their genestructure evolution and expressionrdquoGenome Research vol 22no 9 pp 1775ndash1789 2012

[53] J Harrow F Denoeud A Frankish et al ldquoGENCODE produc-ing a reference annotation for ENCODErdquo Genome Biology vol7 supplement 1 pp S41ndashS49 2006

[54] J Harrow A Frankish J M Gonzalez et al ldquoGENCODE thereference human genome annotation for the ENCODE projectrdquoGenome Research vol 22 no 9 pp 1760ndash1774 2012

[55] K D Pruitt T Tatusova G R Brown andD RMaglott ldquoNCBIReference Sequences (RefSeq) current status new features andgenome annotation policyrdquo Nucleic Acids Research vol 40 no1 pp D130ndashD135 2012

[56] K D Pruitt T Tatusova and D R Maglott ldquoNCBI refer-ence sequences (RefSeq) a curated non-redundant sequencedatabase of genomes transcripts and proteinsrdquo Nucleic AcidsResearch vol 35 no 1 pp D61ndashD65 2007

[57] L Deng D Yu and J Platt ldquoScalable stacking and learningfor building deep architecturesrdquo in Proceedings of the 2012IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo12) pp 2133ndash2136 Kyoto Japan March2012

[58] B Hutchinson L Deng and D Yu ldquoTensor deep stackingnetworksrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 35 no 8 pp 1944ndash1957 2013

[59] W J Kent C W Sugnet T S Furey et al ldquoThe human genomebrowser at UCSCrdquo Genome Research vol 12 no 6 pp 996ndash1006 2002

[60] R Lorenz S H Bernhart C Honer zu Siederdissen et alldquoViennaRNA Package 20rdquo Algorithms for Molecular Biologyvol 6 no 1 article 26 2011

[61] M Kozak ldquoRecognition of AUG and alternative initiatorcodons is augmented by G in position +4 but is not generallyaffected by the nucleotides in positions +5 and +6rdquoThe EMBOJournal vol 16 no 9 pp 2482ndash2492 1997

[62] M Kozak ldquoInitiation of translation in prokaryotes and eukary-otesrdquo Gene vol 234 no 2 pp 187ndash208 1999

[63] H Xu P Wang Y Fu et al ldquoLength of the ORF position of thefirstAUGand theKozakmotif are important factors in potentialdual-coding transcriptsrdquo Cell Research vol 20 no 4 pp 445ndash457 2010

[64] R D Finn J Clements and S R Eddy ldquoHMMER webserver interactive sequence similarity searchingrdquoNucleic AcidsResearch vol 39 no 2 pp W29ndashW37 2011

[65] P Kapranov J Cheng S Dike et al ldquoRNA maps reveal newRNA classes and a possible function for pervasive transcrip-tionrdquo Science vol 316 no 5830 pp 1484ndash1488 2007

[66] C Chen A Liaw and L BreimanUsing RandomForest to LearnImbalancedData University of California Berkeley Calif USA2004

[67] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[68] J A Sogn ldquoStructure of the peptide antibiotic polypeptinrdquoJournal of Medicinal Chemistry vol 19 no 10 pp 1228ndash12311976

[69] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[70] J W Fickett and C-S Tung ldquoAssessment of protein codingmeasuresrdquo Nucleic Acids Research vol 20 no 24 pp 6441ndash6450 1992

[71] X Robin N Turck A Hainard et al ldquopROC an open-sourcepackage for R and S+ to analyze and compare ROC curvesrdquoBMC Bioinformatics vol 12 article 77 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 5: Review Article Long Noncoding RNA Identification ...

BioMed Research International 5

ORF tends to have a high log-odds score and a larger ORFcoverage The integrity of ORF means ORFs in protein-coding transcripts are disposed of to begin with a startcodon and end with a stop codon The other three featuresare number of hits hit score and frame score which arederived from the output of BLASTX search A protein-codingtranscript prefersmore hits in alignment with lower119864-valuesThen the hit score is defined as follows [28]

119878119894 = mean119895

minuslog10 119864119894119895 119894 isin 0 1 2

Hit Score = mean119894isin012

119878119894 = sum2119894=0 1198781198943 (1)

where 119864119894119895 is the 119864-Value of the 119895th hits in the 119894th ORF Anoncoding transcript may also contain some hits but thesehits are inclined to scatter in three frames rather than belocated in one The frame score to calculate the distributionof hits among three ORFs is defined in the following

Frame Score = variance119894isin012

119878119894 = sum2119894=0 (119878119894 minus 119878)22 (2)

Thus a protein-coding transcript will achieve a higher hitscore and frame score because of the lower119864-value and biaseddistribution of the hits

The training data of CPC [46] are eukaryotic ncRNAsfrom the RNAdb [47] and NONCODE [48 49] databasesCPC is designed to assess transcriptsrsquo protein-coding poten-tial whichmeans it will have high accuracy of discriminatingprotein-coding transcripts Moreover CPC also has the errortolerance capacity which owesmuch to framefinderrsquos accurateprediction Framefinder performed well even though inputtranscripts may have some point mutations indel errorsand truncations CPC is slightly inferior in distinguishingnoncoding transcripts in respect of the fact that lncRNAsmaycontain putative ORFs and transcript length is also familiarto protein-coding transcripts The slow speed is anotherimperfection of CPC

23 CPAT in Detail CPAT [29] is an alignment-free pro-gram CPAT uses a logistic regression model and can betrained on own data of users Apart from the features ofmaximum length and coverage of ORF akin to CPC FickettScore is another criterion Fickett Score can be regarded asa dependent classifier it is mainly based on calculating theposition of each base favoured and the content of each basein the sequence [50] The basersquos position parameter of CPATis defined as follows

1198601 = Number of As in positions 0 3 6 1198602 = Number of As in positions 1 4 7 1198603 = Number of As in positions 2 5 8

119860-position = max (1198601 1198602 1198603)min (1198601 1198602 1198603) + 1

119860-content = Occurence Number of 119860Total Number of all bases

(3)

where 119860 in the formula means the base 119860 and the otherthree bases are measured in a similar way The parameterof position calculates each basersquos favoured position andthe parameter of content is the percentage of each basein the sequence Then according to distributions of eightparametersrsquo values [50] it is easy to obtain the probability thatthe sequence will be a protein-coding transcript Next eachprobability is multiplied by a weight to make a more accurateresult The weight is the percentage of the times that theestimate of each parameter alone is correct Finally accordingto the above descriptions Fickett Score can be determined asfollows

Fickett Score = 8sum119894=1

119901119894119908119894 (4)

According to Fickett [50] Fickett Score alone can correctlydiscriminate about 94 of the coding segments and 97 ofthe noncoding segments with 18 of ldquoNo Opinionrdquo

The last feature of CPAT is hexamer score which is themost discriminating feature Hexamer means the adjacentamino acids in proteinsThe features of the in-frame hexamerfrequency of coding and noncoding transcripts are calculatedand hexamer score is defined in the following

Hexamer Score = 1119898119898sum119894=1

log( 119865 (119867119894)1198651015840 (119867119894)) (5)

There are 64 lowast 64 kinds of hexamers and 119894 denotes each hex-amer 119865(119867119894) (119894 = 0 1 2 4095) means the in-frame hex-amer frequency of protein-coding transcripts while 1198651015840(119867119894)means noncoding transcripts For a transcript containing119898 hexamers a positive hexamer score indicates a protein-coding transcript

A high-quality training dataset is constructed contain-ing 10000 protein-coding transcripts selected from RefSeqdatabase with the annotations of the Consensus CodingSequence project and 10000 noncoding transcripts randomlycollected from GENCODE database CPAT is prebuilt hex-amer tables and logit models for human mouse fly andzebrafish Meanwhile CPAT uses pure linguistic features tofacilitate discrimination of the poorly annotated transcriptsCPAT has an efficient offline program and also provides auser-friendly web interface

24 CNCI in Detail CNCI [30] is mainly based on sequenceintrinsic composition it evaluates the transcripts by calcu-lating the usage frequency of adjoining nucleotide triplets(ANT) Firstly two ANT matrices are constructed based onthe usage frequency of ANT in noncoding sequences andcoding region of the sequences (CDS) For 4096 ANT the

6 BioMed Research International

formulas to calculate each ANT usage frequency are definedas follows

119883119894119873 = 119899sum119895=1

119878119895 (119883119894)

119879 = 119898sum119894=1

119883119894119873 = 119898sum119894=1

119899sum119895=1

119878119895 (119883119894) 119898 = 64 times 64 119899 = 1 119873

119883119894119865 = 119883119894119873119879

(6)

where 119883 means one kind of ANT 119878119895(119883119894) is the occurrencenumber of119883119894 in one sequence 119878119895Thus119883119894119873 denotes the totaloccurrence number of one kind of ANT in the dataset while119879indicates the total occurrence number of all kinds of ANT inthe dataset Accordingly 119883119894119865 is the usage frequency of ANTThen the ANT Score Matrix is utilised which is the log2-ratio of the two above-mentioned ANT matrices to score asequence and make a discrimination

ANT Score Matrix = log2CDS Matrix

Non-coding Matrix (7)

The distinguishing results of ANT Score Matrix are fairlywell but the matrix is constructed by computing the ANTusage frequency of coding region and noncoding regionconsequently the untranslated region (UTR) of the entiresequence will interfere with the performance of discrimina-tionThe sliding window is employed with one ANT (3 nt) ineach scan step to identify the CDS of a sequence by scanningsix reading frames of each sequence The different sizes(30 60 90 300 nt) of the sliding windows are examinedand the size of 150 nt for this classification model is found toobtain the most robust result For a sequence consisting of119896 ANT there will be 119896 minus 1 segments in this sequence Basedon the ANT Score Matrix each segment will get an 119878-Scoreand each reading frame can obtain an array comprised of the119878-Scores The formula of 119878-Score is defined as follows

119878-Score = 119899sum119894=1

119867119901 (119883119894) (8)

where 119883 means ANT 119867119901 is the ANT Score Matrix and119899 is the total number of the ANT in one segment or thewhole sequence Hence a correct reading frame of codingtranscript tends to have a higher whole sequence 119878-Scoreand in this array of reading frame the region composed ofconsecutive high 119878-Scores is the CDS For long noncodingtranscripts the Maximum Interval Sum [51] program is usedto identify the most-like CDS (MLCDS) which is the regionthat gained the largest sum of consecutive 119878-Scores in eachreading frame Among those six MLCDS the length and119878-Score of the MLCDS with the highest value are selectedas the features of CNCI Furthermore the features of the

LENGTH-Percentage SCORE-Distance and codon-bias arealso selected to improve accuracy

LENGTH-Percentage = 1198721sum119899119894=0 (119884119894)

SCORE-Distance = sum119899119895=0 (119878 minus 119864119895)5

(9)

where 1198721 is the length of the MLCDS with the highest 119878-Score 119884119894 is the length of each MLCDS 119878 is the highest 119878-Score among six MLCDS and 119864119895 is the 119878-Score of other fiveMLCDS Codon-bias (3-mer frequencies) is a parameter toevaluate the usage bias of different codons in protein-codingor long noncoding transcripts The log2-ratio of occurrencefrequency of each codon (stop codons are excluded) inprotein-coding genes and lncRNAs is calculated and mostcodons have distinct usage bias in two kinds of sequences

The training datasets of CNCI contain protein-codingtranscripts selected from RefSeq database and long noncod-ing transcripts selected from GENCODE [52] The CNCIis applied to other species with the aim of examining thescope of applicationThe results of vertebrates (except birds)especially mammals can be accepted since the programwas trained on human gene set CNCI can be used todiscriminate incomplete transcripts especially those high-throughput sequencing data of poorly explored species

25 PLEK in Detail PLEK [31] is an alignment-free toolbased on 119896-mer frequencies of the sequences For a givensequence the sliding windows with size of 119896 scan 1 nt asa step forward 119896 ranges from 1 to 5 which is a trade-off between accuracy and computational time Thus for asequence consisting of 119860 119862 119866 and 119879 the 41 + 42 + 43 +44 + 45 = 1364 patterns can be obtained Then the followingformulas can be used

119891119894 = 119888119894119904119896119908119896 119896 = 1 2 3 4 5 119894 = 1 2 1364119904119896 = 119897 minus 119896 + 1119908119896 = 1

45minus119896 119896 = 1 2 3 4 5

(10)

where 119894 is the number of the patterns 119888119894 denotes the numberof the segments in slidingwindowsmatchingwith patterns 119904119896denotes the total of the segments when sliding window slidesalong the sequencewith the size of 119896Therefore119891119894 is the usagefrequency multiplied by a factor119908119896 which is used to facilitatethe discrimination

A balanced training dataset is conducted with all 22389long noncoding transcripts collected from the GENCODEdataset [52ndash54] and 22389 protein-coding transcripts ran-domly selected from the human RefSeq dataset [55 56]Though the training model of PLEK is human PLEK canstill be applied to other vertebrates PLEK is particularlydesigned for the transcripts acquired from current sequenc-ing platforms which consist of some indel errors commonly

BioMed Research International 7

For these transcripts the performance of PLEK is betterthan CPC and CNCI PLEK can be trained with usersrsquo owndatasets but it may take a long time to be accomplished

26 LncRNA-MFDL in Detail LncRNA-MFDL [34] is basedon feature fusion and deep learning algorithm LncRNA-MFDL has four kinds of features which are integrated tobuild a classification model based on deep stacking networks(DSNs one kind of deep learning algorithm) [57 58] Fourfeature groups of lncRNA-MFDL include 119896-mer secondarystructure ORF obtained by utilising txCdsPredict program(httpgenomeucscedu) [59] and MLCDS features whichare inspired by CNCI [30]

The 119896-mer scheme employed in lncRNA-MFDL is unlikethe one in PLEK Here the 119896 only ranges from 1 to 3but the frequencies are calculated on the regions of thewhole sequence and ORF at the same time Considering thatthe secondary structure is more conserved and stable thanprimary structure a representative criterion the minimumfree energy (MFE) is used to assess the secondary structureof the transcripts Utilising RNAfold program of ViennaRNAPackage [60] the MFE the ratio of MFE to sequence lengthand the number of paired bases and unpaired bases can beeasily obtained

27 LncRNA-ID in Detail LncRNA-ID [32] has three cate-gories of features as mentioned earlier Except for the lengthand coverage of ORF the features based on translationmechanism and protein conservation are extracted

Many studies [61ndash63] have demonstrated that severalnucleotide sites in Kozak motif play a prominent roleduring the initiation of protein translation An efficienttranslation indicates that the highly conserved nucleotidesappear at the positions minus3 +4 and minus2 minus1 of Kozak motifGCCRCCAUGG (R represents purine and the position of Ain start codon AUG is +1) Thus these conserved sites aremore likely to exist in protein-coding transcripts Moreoverwhen the translation starts the binding energy will changealong with the interaction between the 31015840 end of rRNAs andmRNA transcripts The Ribosome Coverage to calculate thechanges of the binding energy is defined as follows

Ribosome Coverage = 119871sum119894=1

119873119894 | 120575119894 lt 0 (11)

where 120575119894 is the free energy at position 119894 and 119873119894 is thenumber of base pairs starting at position 119894 in a sequencewith the length of 119871 Next the three levels of ribosome occu-pancy by computing Ribosome Coverage on three regionsrespectively are obtained the whole transcript ORF and31015840UTR Accordingly a true protein-coding transcript tendsto attain higher Ribosome Coverage on the whole transcriptand the ORF region When the translation terminates theribosomes will be released from protein-coding transcriptsTherefore it is likely to capture a considerable drop ofribosome occupancy when ribosomes reach stop codonsThe

Ribosome Release Score to capture this change of ribosomeoccupancy is defined

Ribosome Release Score

= Ribosome coverage of ORFlength (ORF)Ribosome coverage of 31015840UTRlength (31015840UTR)

(12)

and a protein-coding transcript inclines to exhibit a higherRibosome Release Score For protein translation categorythe selected features including nucleotides at two positionsof Kozak motif Ribosome Coverage on three regions andRibosome Release Score are selected

The protein conservation of the sequences is evaluatedaccording to profile hidden Markov model-based align-ment scores HMMER [64] is a software suite for sequencehomology detection using probabilistic methods LncRNA-ID employed HMMER with the 119864-value cutoff of 01 toalign the transcripts against all available protein families Aprotein-coding transcript is expected to get a higher scorelonger aligned region and a reasonable length of the profilein the alignment

In human genome although the amount of lncRNA isat least four times more than protein-coding genes [65]the majority class in training data is protein-coding tran-script on account of poorly annotated lncRNA Hence theclassification model of this method is balanced randomforest [66 67] which is derived from random forest butcould utilise the sufficient protein-coding data and avoidinaccurate results caused by the imbalanced training data atthe same time The human prebuilt model of LncRNA-IDcontains 15308 protein-coding transcripts and 4586 lncRNAsfrom GENCODE [52] For mouse the training datasets arecomprised of 22033 protein-coding transcripts and 2457lncRNAs randomly selected from GENCODE These twodatasets were also used to draw receiver operation charac-teristic (ROC) curves in the next section (Figure 2) Userscan train LncRNA-ID with their own dataset and apply it tovarious species

28 LncRScan-SVM in Detail LncRScan-SVM [33] classifiesthe sequencesmainly by evaluating the qualities of nucleotidesequences codon sequence and transcripts structure Thecounts and average length of exon in one sequence arecalculated The protein-coding transcripts are disposed ofto include more exons thus having a longer exon lengththan lncRNA Another feature is the score of txCdsPredictThis third-part program from UCSC genome browser [68]can determine if a transcript is protein-coding Conservationscore is obtained by calculating the average of PhastConsscores [45] from Phast (httpcompgencshleduphast)Transcript length and standard deviation of stop codoncounts between three ORFs are the last two features

The reliable datasets are constructed from GENCODE[54] composed of 81814 protein-coding transcripts and23898 long noncoding transcripts of human And formouse47394 protein-coding transcripts and 6053 long noncodingtranscripts from GENCODE [52] are also contained withinthe dataset After being trained on human and mouse

8 BioMed Research International

datasets lncRScan-SVM obtains a good performance onlncRNA prediction

29 LncRNApred in Detail Before constructing the classi-fier self-organizing feature map (SOM) clustering [69] isemployed to select representative samples as the trainingdataset which enhanced the performance of LncRNApredAs to the features the length and coverage of the longest ORFone of the classical and typical features are selected as thecriteria In addition G + C content 119896-mer (119896 is from 1 to 3just like lncRNA-MFDL) and length of the sequence are alsothe features of LncRNApred The novel idea of LncRNApredis SNR which transforms one sequence into four binarynumeric sequences

119906119887 = 1 119878 [119899] = 1198870 119878 [119899] = 119887

119899 = 0 1 2 119873 minus 1 119887 isin 119860 119879 119862 119866 (13)

where 119887 means four kinds of bases 119873 is the length of onesequence and 119878[119899] denotes a sequence of length 119873 Thusthere will be four binary sequences 119906119887 | 119887 isin (119860 119879 119862 119866)Then applying Discrete Fourier Transform (DFT) to thesefour binary numeric sequences the power spectrum 119875[119896]can be obtained

119880119887 [119896] =119873minus1sum119899=0

119906119887 [119899] 119890minus119894(2120587119899119896119873) 119896 = 0 1 119873 minus 1119875 [119896] = 1003816100381610038161003816119880119860 [119896]10038161003816100381610038162 + 1003816100381610038161003816119880119879 [119896]10038161003816100381610038162 + 1003816100381610038161003816119880119866 [119896]10038161003816100381610038162 + 1003816100381610038161003816119880119862 [119896]10038161003816100381610038162

(14)

The studies of Fickett [50 70] have presented that positionsand compositions of four bases are different in lncRNAsand protein-coding RNA and because of this the powerspectrum of one protein-coding transcript will have a peakat1198733 position Hence the SNR is defined as follows

119864 = sum119873minus1119896=0 119875 [119896]119873 SNR = 119875 [1198733]

119864 (15)

Now there are 89 features the length and coverage ofthe longest ORF the length of the sequence SNR G + Ccontent and 4 + 16 + 64 features of 119896-mer Noticing that notall the features have high discriminative power the featureselection ismade and 25 high-quality features are determinedfrom the original 84 features of k-mer Finally 30 features areselected to build a random forest model The performance ofrandom forest is largely determined by training setThereforethe clustering method is used to find out the most adequatesequences to form a high standard training setThe clusteringmethod SOM [69] achieved the best result and was chosen toselect characteristic sequences

An overall procedure of these eight tools is displayed inFigure 1

3 Performance of These Methods

To quantify the classification performance under one unifiedstandard we first characterise lncRNAs as the positive classand protein-coding transcripts as the negative class then theperformance of these tools can be evaluated with severalstandard criteria defined as follows

Sensitivity = TPTP + FN

Specificity = TN

TN + FP

Accuracy = TP + TNTP + FP + TN + FN

False Positive Rate = FP

FP + TN

(16)

As one of the most popular methods CPC is especiallydesigned for assessing protein-coding potential and per-formed fairly well for discriminating protein-coding tran-script It enjoys the best results when screening the codingtranscripts For 10000 protein-coding genes and 10000 lncR-NAs selected from UCSC genome browser (GRCh37hg19)CPC picked up about 9762 coding transcripts while CPATdistinguished 8528 of them CPAT also outperforms CPCwith 8994 accuracy [33] Table 3 shows the performance ofthese tools on the same testing dataset CPC picks up 9997of human protein-coding genes collected from GENCODEin comparison with the latest program LncRNA-ID whoseperformance is 9528 However the performance of CPCappears to somewhat decline when focusing on the capabilityof discriminating noncoding transcripts especially long non-coding transcripts CPC only picked up 6648 of humanrsquoslong noncoding transcripts while the results of CPAT PLEKand LncRNA-ID are 8695 9952 and 9628

CPC and CPAT are the programs to assess the codingpotential but CNCI is especially used to classify protein-coding and long noncoding transcripts With the sequencesbecoming longer and longer CNCI was more superior toCPC According to Sun et al [30] when the length oftranscript is longer than 2000 nt the accuracy of CPC isonly around 04 while the CNCI still has an outstandingperformance The training dataset of CNCI is human butthis method still achieved more than 90 accuracy in othervertebrates apart from the birds [30] PLEK is tested ontwo datasets sequenced by PacBio and 454 platforms (referto Table 3) Among the tools being compared CPC stillpicked up about 9990 coding genes though this figureis not that useful because it can only distinguish 1900and 4720 lncRNAs CNCI displayed better performanceon both datasets but PLEK even achieved a more satisfyingresult

LncRNA-ID is another method to identify the longnoncoding transcripts Compared with other programsLncRNA-ID strikes a good balance between sensitivity andfalse positive rate According to Table 3 it is noticeable thatlncRNA is better than PLEK but slightly inferior to CPCand CPAT on the testing data of coding genes and the

BioMed Research International 9

(i) ORF(ii) BLASTX

CPC

(i) k-mer

PLEK

(i) ANT matrix(ii) Codon-bias

(iii) MLCDS

CNCI

(i) Exon (ii) Stop codon

(iii) txCdsPredict(iv) PhastCons

LncRScan-SVM

Support vector machine

Featureextraction

Modelselection

Modelvalidation

Input data

10-fold cross-validationArea under the curve (AUC) of receiver operation characteristic (ROC)

LncRNA-ID(i) ORF

(ii) Ribosome interaction

(iii) HMMER

(i) ORF(ii) k-mer

(iii) Signal to noise ratio

LncRNApred

Random forests

CPAT

(i) ORF(ii) Fickett score

(iii) Hexamer

Logistic regression

LncRNA-MFDL(i) ORF

(ii) k-mer(iii) MLCDS(iv) Secondary

structure

Deeplearning

Figure 1 An overall procedure of eight tools The features of each tool are sorted into several groups and only the categories of the featuresare listed in the figure

02 04 06 08 10001 minus specificity

00

02

04

06

08

10

Sens

itivi

ty

00

02

04

06

08

10

Sens

itivi

ty

02 04 06 08 10001 minus specificity

Mouse (from GENCODE)Human (from GENCODE)

PLEK (AUC = 0991 optimal = 00210973)CNCI (AUC = 0937 optimal = 0084 0970)CPAT (AUC = 0990 optimal = 0037 0961)CPC (AUC = 0977 optimal = 0038 0889) CPC (AUC = 0950 optimal = 0068 0839)

CPAT (AUC = 0966 optimal = 0073 0926)CNCI (AUC = 0903 optimal = 0137 0959)PLEK (AUC = 0870 optimal = 0291 0885)

Figure 2 The ROC curves of CPC CPAT CNCI and PLEK We assessed the models using the same datasets as LncRNA-ID (selected fromGENCODE) used Both CPC and CPAT were evaluated with the latest versions

10 BioMed Research International

Table 3 Overview of each toolrsquos performance on different testing datasets

Testing dataset CPC CPAT CNCI PLEK LncRNA-ID lncRScan-SVMHuman MCF-7 (PacBio)1

Specificity 9990 9180 9470Sensitivity 1900 7870 9580Accuracy 9700 9130 9470Human HelaS3 (454)2

Specificity 9990 9390 9550Sensitivity 4720 8110 9250Accuracy 9900 9370 9540Human (from GENCODE)3

Specificity 9997 9955 8918 9528Sensitivity 6648 8695 9952 9628Accuracy 8322 9325 9432 9578Mouse (from GENCODE)4

Specificity 9875 9895 7094 9210Sensitivity 7655 3880 8811 9445Accuracy 8765 6888 7949 9328Human (from GRCh37hg19)5

Specificity 9762 8528 8920Sensitivity 6723 9460 9388Accuracy 8243 8994 9194Mouse (from GRCm38mm10)5

Specificity 9837 8817 8914Sensitivity 7546 9534 9529Accuracy 8691 9176 9221The results of the tools being tested on the same datasets are listed above Bold numbers denote the highest value of the metrics1MCF-7 is available at httpwwwpacbcomblogdata-release-human-mcf-7-transcriptome 2dataset of HelaS3 is available at httpswwwncbinlmnihgovsraSRX214365 34datasets are available at httpswwwdropboxcomsh7yvmqknartttm6kAAAQHvLZPjgjf4dtmHM7GNCqaH1 gencodedl=0 andhttpswwwdropboxcomsh7yvmqknartttm6kAACzaG-QJggvbXW6LA32oo7baM1 gencodedl=0 5dataset of human and mouse is available athttpjournalsplosorgplosonearticleid=101371journalpone0139654

performance on lncRNAs is just the opposite LncRNA-IDcan be trained with usersrsquo own data it can obtain a satisfyingresult even when the data is unbalancedWith the proportionof lncRNA decreasing CPAT shows a sharp reduction from7951 to 5446on the capability of lncRNAdiscriminationLncRNA-ID by contrast fell less than 1 [32]

ROC curves of CPC CPAT PLEK and LncRNA-IDtested on human and mouse datasets were provided in [32]Since CPC and CPAT are updated as the accumulation ofgene database it is useful to assess their performance withlatest version and take CNCI into account Here we utilisethe test set of LncRNA-ID [32] (both the datasets of humanandmouse are selected fromGENCODE) to reevaluate CPCCPAT CNCI and PLEK (Figure 2) According to [32] thearea under curve (AUC) of LncRNA-ID on human datasetis 09829 (optimal = 00545 09720) while on mouse it is09505 (optimal = 00800 09445) [32] In our assessmentthe performance of PLEK is identical with [32] while theperformance of CPC and CPAT as we anticipated displayedsome differences The ROC curves were drawn in R with thepackage of pROC [71]

LncRScan-SVM is compared with CPC and CPAT onhuman and mouse datasets from UCSC (version hg19 of

human and mm10 of mouse) CPC as an excellent codingpotential assessment tool still achieves 9837 of specificityonmouse testing dataset CPAT on the contrary achieved thehighest values of sensitivity both on the datasets of humanand mouse LncRScan-SVM surpasses CPAT with 8920and 8914 of specificity on human and mouse datasets Forsensitivity lncRScan-SVM obtained 9388 and 9529 onthe same testing datasets which are only 072 and around01 lower than CPATrsquos results respectively but much higherthan CPCrsquos 6723 and 7546 In addition lncRScan-SVMalso has the best results of accuracy and AUC [33] on thesedatasets

For the same testing datasets the running time of CPATis the shortest and CPC shows the longest time to finishthe process because of its alignment process When beingtested on a dataset containing 4000 protein-coding and 4000long noncoding transcripts CPAT takes 3536 s and LncRNA-ID takes 6535 s to accomplish the discrimination whilePLEK and CPC need 2147m and 8651 h respectively [32]PLEK is 8 times and 244 times faster than CNCI and CPCrespectively on the same testing data [31] and lncRScan-SVM also needs about 10 times as much as CPAT to finishcomputation [33]

BioMed Research International 11

Table 4 Priority of employing different methods on different situations

CPC CPAT CNCI PLEK LncRNA-ID lncRScan-SVMCoding potential assessment Human lncRNAs Mouse lncRNAs Other Species1 Testing data with sequencing errors2 Lack of annotation Massive-scale data3 Trained by users4 Web interface This table only presents the preferences under different situations which means a method with a tick can achieve a better performance under a certaincircumstance1Only CPAT LncRNA-ID and lncRScan-SVM provide the model for mouse When analysing other species CPAT has the model for fly and zebrafish CNCIand PLEK can predict the sequences of vertebrata and plant CPAT PLEK and LncRNA-ID can build a newmodel based on usersrsquo datasets 2Users can chooseCNCI for incomplete sequences and CPC or PLEK for the transcripts with indel errors 3CPAT is the most efficient method Though lncRScan-SVM needsmore time than CPAT and LncRNA-ID it is also acceptable 4LncRNA-ID can handle the imbalanced training data Training PLEK with usersrsquo own datasetsmay be a time-consuming task

4 Application Scopes of the Methods

All these methods have own particular scopes to exert theirtalents which means an appropriate program can help usobtain a satisfying result The priority of utilising thesetools under some particular circumstances is summarised inTable 4

CPC is based on sequence alignment which facilitatesprotein-coding transcripts selection but impairs the perfor-mance of noncoding transcripts in that long noncoding tran-scripts share more similarities with coding transcripts suchas putative ORF which could mislead CPC Also becauseof alignment process utilising CPC to analyse massive-scaledata is a time-consuming process

CPAT is also used to evaluate the coding potential thoughthe performance on long noncoding transcripts is acceptableCPAT has a compromise between coding and noncodingtranscripts that is not bad Since the model of CPAT islogistic regression and the input file is FASTA format CPATis markedly superior in computational time which meansCPAT is more suitable for being applied to data on a largescale Furthermore linguistic features make CPAT be able toanalyse the sequences without annotation and allowing usersto train the model with their own dataset extends CPATrsquosscope of application Users can apply CPAT to other speciesinstead of being confined to human or mouse only

CNCI is designed to distinguish between coding and longnoncoding transcripts without the annotations of sequencesBecause lots of lncRNAs are poorly annotated this qualityprovided amore accurate discrimination for these sequencesCNCI is trained on human dataset but can also be applied toothermammals such asmouse and orangutan CNCI displaysacceptable results on vertebrates (except fish) but for plantsand invertebrates the result is not very satisfying CNCI isvaluable when the sequences lack annotations or users do nothave training set of other species CNCI also shows a goodperformance when the transcripts are incomplete

PLEK employs a higher fault tolerance algorithm andperforms better when the sequences have indel errors It is a

proper tool to analyse the de novo assembled transcriptomedatasets such as the sequences obtained from Roche (454)and Pacific Biosciences (PacBio) sequencing platforms Inaddition to human and mouse PLEK can also be used toother vertebrates and displays comparable results with theones of CNCI PLEKrsquos model can be trained by users but ittakes a long time to be completed

LncRNA-ID has many merits and delivers better all-round performance on human andmouse datasets Althoughthe time LncRNA-ID spent on classifying is nearly twice ofCPAT LncRNA-ID is still more efficient than other methodswhich makes it a reasonable choice when data are on amassive-scale The model of LncRNA-ID can be trained byusers but the most excellent attribute is the competence ofhandling the unbalanced training data For studying thosenot well-explored species LncRNA-ID takes priority whenusers have training datasets

LncRScan-SVM achieves a good trade-off betweenthe discrimination of coding and long noncoding genesLncRScan-SVM is slower than CPAT and LncRNA-ID but itis still acceptable For analysing human and mouse datasetslncRScan-SVM can be considered as a proper approach

5 Discussion

According to the features selected by each tool it is apparentthat different tools have their own advantages and disadvan-tages CPC is developed to assess coding potential of thetranscripts moreover CPC is trained on datasets of protein-coding and noncoding RNA which means it achieves excel-lent performance when analysing ncRNAs CPC providesa stand-alone version and a web server but both of thetwo programs need vast amounts of time to process thesequences As alignment-based tools the performance ofCPC varied when using different protein reference databaseCPAT can present satisfying results efficiently partly becauseCPAT builds the logistic model which is faster than SVMThe web server of CPAT can display the result in an instant

12 BioMed Research International

which facilitates small scale prediction tasks A minor dis-advantage of CPAT is that the cutoff of CPAT varies fromspecies to species and users have to determine the optimumcutoff value when they are training a new model CNCI isdesigned to predict the transcripts assembled from whole-transcriptome sequencing data Thus CNCI offers a highaccuracy on incomplete transcripts CNCI did not provideresult of elaborate comparison between CNCI and CPATbut CPAT has no regard for the problem of incompletetranscripts Meanwhile UTRs of the transcripts may alsointerfere with the performance of CPATThe features of ANTof CNCI closely resemble the hexamer of CPAT but thedistinguishing process of CNCI is more complicated andaccurate than CPAT However the sliding window of CNCIslides 3 nt in each step and consequently some deletionor frameshift errors may lead to a false shift and presentusers with a disappointing performance In such cases PLEKhas made a considerable improvement and exhibits moreflexibility when handling the indel sequencing errors Indelerrors are very common in the sequences obtained by todayrsquossequencing platform which means PLEK performs well forde novo assembled transcriptomes With the indel error rateincreasing the accuracy of CNCI is decreasing while PLEKhas no distinct fluctuation Nonetheless since the nucleotidescompositions differ slightly among different species theperformance of PLEK on multiple species is not better thanor approximately equivalent to CNCI whose performanceis more stable on different species Both LncRNA-ID andlncRScan-SVM achieve a balance between protein-codingand lncRNAs But the capacity of lncRScan-SVMwill be lim-ited when analysing the sequences with a lack of annotationAnother point that needs to be brought up is that lncRScan-SVM and CNCI support lowast GTF as input file format

It is apparent that nucleotides composition (such as 119896-merand G + C content) and ORF are two classic and widely usedfeature groups These features have strong discriminativepower because protein-coding genes will finally be tran-scribed and translated to produce a specific amino acid chainwhich requires some specified nucleotides composition andhigh-quality ORFs As to the models of these tools SVM(CPC CNCI and PLEK) logistic regression (CPAT) andrandom forest (LncRNA-ID) are more practical for lncRNAidentification though ANN or deep learning is a more popu-lar machine learning algorithm now Along with the protein-coding genes prediction the annotations of lncRNA genehave been performed as well A new tool named AnnoLnc(2015 available at httpannolnccbipkueducnindexjsp)has just been developed to annotate new discovered lncRNAsbut related article has not yet been officially published Userscan access its web server for more information

LncRNAs are receiving increasing attention and lncRNAidentification has always been a challenge for researchesof life science For so many different types of sequencesvarious excellent tools should be developed to tackle differentproblems under various circumstances in the future In thisreview we summarised several tools for lncRNAs identifica-tion and concluded respective scopes Due to their differentscopes of application using a method apposite to particularsituation will be of essence to achieve convincing results

We hope this review can help researchers employ a moreappropriate method in certain situations

Additional Points

Key Points (i) Different tools have different scopes Usersshould select a proper tool according to the type of sequences(ii) From the perspective of sequence types CPC andCPAT are mainly used to assess coding potential CNCIand PLEK can be applied to the sequences obtained fromhigh-throughput sequencing platforms or the poorly anno-tated LncRNA-ID and lncRScan-SVM are more accurate onhuman and mouse datasets (iii) From the perspective ofother functions CPC and CPAT have web interfaces Theclassification models of CPAT LncRNA-ID and PLEK canbe trained on usersrsquo own datasets CPAT LncRNA-ID andlncRScan-SVM can be utilised when the data to be analysedare on a massive-scale

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This research was supported by the National Natural ScienceFoundation of China (61272207 61472158 and 61402194)and the Science-Technology Development Project from JilinProvince (20130522118JH 20130101070JC) and China Post-doctoral Science Foundation (2014T70291)

References

[1] A F Palazzo and E S Lee ldquoNon-coding RNA what isfunctional andwhat is junkrdquo Frontiers in Genetics vol 5 article2 pp 1ndash11 2015

[2] Y Okazaki M Furuno T Kasukawa et al ldquoAnalysis of themouse transcriptome based on functional annotation of 60770full-length cDNAsrdquo Nature vol 420 pp 563ndash573 2002

[3] S Djebali C A Davis AMerkel et al ldquoLandscape of transcrip-tion in human cellsrdquo Nature vol 489 pp 101ndash108 2012

[4] I Dunham A Kundaje S F Aldred et al ldquoAn integratedencyclopedia of DNA elements in the human genomerdquo Naturevol 489 pp 57ndash74 2012

[5] E Pennisi ldquoENCODE project writes eulogy for junk DNArdquoScience vol 337 no 6099 pp 1159ndash1161 2012

[6] S U Schmitz P Grote and B G Herrmann ldquoMechanisms oflong noncoding RNA function in development and diseaserdquoCellular and Molecular Life Sciences vol 73 no 13 pp 2491ndash2509 2016

[7] K Plath J Fang S K Mlynarczyk-Evans et al ldquoRole of histoneH3 lysine 27 methylation in X inactivationrdquo Science vol 300no 5616 pp 131ndash135 2003

[8] S T da Rocha V Boeva M Escamilla-Del-Arenal et al ldquoJarid2is implicated in the initial xist-induced targeting of PRC2 to theinactive X chromosomerdquoMolecular Cell vol 53 no 2 pp 301ndash316 2014

[9] V OrsquoLeary S V Ovsepian L G Carrascosa et al ldquoPARTICLEa triplex-forming long ncRNA regulates locus-specific methy-lation in response to low-dose irradiationrdquo Cell Reports vol 11no 3 pp 474ndash485 2015

BioMed Research International 13

[10] A C Marques and C P Ponting ldquoCatalogues of mammalianlong noncoding RNAs modest conservation and incomplete-nessrdquo Genome Biology vol 10 no 11 article R124 2009

[11] J R Prensner M K Iyer A Sahu et al ldquoThe long noncodingRNA SChLAP1 promotes aggressive prostate cancer and antag-onizes the SWISNF complexrdquo Nature Genetics vol 45 no 11pp 1392ndash1398 2013

[12] G Chen Z Wang D Wang et al ldquoLncRNADisease a databasefor long-non-coding RNA-associated diseasesrdquo Nucleic AcidsResearch vol 41 no 1 pp D983ndashD986 2013

[13] X Chen and G-Y Yan ldquoNovel human lncRNA-disease associ-ation inference based on lncRNA expression profilesrdquo Bioinfor-matics vol 29 no 20 pp 2617ndash2624 2013

[14] X Chen C Clarence Yan C Luo W Ji Y Zhang and Q DaildquoConstructing lncRNA functional similarity network based onlncRNA-disease associations and disease semantic similarityrdquoScientific Reports vol 5 Article ID 11338 2015

[15] X Chen Y-A Huang X-S Wang Z You and K C ChanldquoFMLNCSIM fuzzy measure-based lncRNA functional simi-larity calculation modelrdquo Oncotarget vol 7 no 29 pp 45948ndash45958 2016

[16] X Chen ldquoPredicting lncRNA-disease associations and con-structing lncRNA functional similarity network based on theinformation of miRNArdquo Scientific Reports vol 5 Article ID13186 2015

[17] J Sun H Shi Z Wang et al ldquoInferring novel lncRNA-diseaseassociations based on a random walk model of a lncRNAfunctional similarity networkrdquo Molecular BioSystems vol 10no 8 pp 2074ndash2081 2014

[18] M Zhou X Wang J Li et al ldquoPrioritizing candidate disease-related long non-codingRNAs bywalking on the heterogeneouslncRNA and disease networkrdquoMolecular BioSystems vol 11 no3 pp 760ndash769 2015

[19] X Chen Z You G Yan and D Gong ldquoIRWRLDA improvedrandom walk with restart for lncRNA-disease association pre-dictionrdquo Oncotarget vol 7 no 36 pp 57919ndash57931 2016

[20] X Chen C C Yan X Zhang and Z You ldquoLong non-codingRNAs and complex diseases from experimental results tocomputational modelsrdquo Briefings in Bioinformatics 2016

[21] T M Lowe and S R Eddy ldquotRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequencerdquo Nucleic Acids Research vol 25 no 5 pp 955ndash9641997

[22] Q Zou J Guo Y Ju M Wu X Zeng and Z Hong ldquoImprov-ing tRNAscan-SE annotation results via ensemble classifiersrdquoMolecular Informatics vol 34 no 11-12 pp 761ndash770 2015

[23] L Wei M Liao Y Gao R Ji Z He and Q Zou ldquoImprovedand promising identificationof human microRNAs by incorpo-ratinga high-quality negative setrdquo IEEEACM Transactions onComputational Biology and Bioinformatics vol 11 no 1 pp 192ndash201 2014

[24] C Y Wang L L Hu M Z Guo X Y Liu and Q Zou ldquoimDCan ensemble learningmethod for imbalanced classificationwithmiRNAdatardquoGenetics andMolecular Research vol 14 no 1 pp123ndash133 2015

[25] K Lagesen P Hallin E A Roslashdland H-H Staeligrfeldt T Rognesand DW Ussery ldquoRNAmmer consistent and rapid annotationof ribosomal RNA genesrdquo Nucleic Acids Research vol 35 no 9pp 3100ndash3108 2007

[26] C Wang L Wei M Guo and Q Zou ldquoComputationalapproaches in detecting non- coding RNArdquo Current Genomicsvol 14 no 6 pp 371ndash377 2013

[27] D Veneziano G Nigita and A Ferro ldquoComputationalapproaches for the analysis of ncRNA through deep sequencingtechniquesrdquo Frontiers in Bioengineering and Biotechnology vol3 article 77 2015

[28] L Kong Y Zhang Z-Q Ye et al ldquoCPC assess the protein-coding potential of transcripts using sequence features andsupport vector machinerdquo Nucleic Acids Research vol 35 no 2pp W345ndashW349 2007

[29] L Wang H J Park S Dasari et al ldquoCoding-potential assess-ment tool using an alignment-free logistic regression modelrdquoNucleic Acids Research vol 41 no 6 article e74 2013

[30] L Sun H Luo D Bu et al ldquoUtilizing sequence intrinsiccomposition to classify protein-coding and long non-codingtranscriptsrdquo Nucleic Acids Research vol 41 no 17 article e1662013

[31] A Li J Zhang and Z Zhou ldquoPLEK a tool for predicting longnon-coding RNAs and messenger RNAs based on an improvedk-mer schemerdquo BMC Bioinformatics vol 15 no 1 article 3112014

[32] R Achawanantakun J Chen Y Sun and Y Zhang ldquoLncRNA-ID long non-coding RNA IDentification using balanced ran-dom forestsrdquo Bioinformatics vol 31 no 24 pp 3897ndash3905 2015

[33] L Sun H Liu L Zhang and J Meng ldquolncRScan-SVM a toolfor predicting long non-coding RNAs using support vectormachinerdquo PLoS ONE vol 10 no 10 Article ID e0139654 2015

[34] X-N Fan and S-W Zhang ldquoLncRNA-MFDL identification ofhuman long non-coding RNAs by fusing multiple features andusing deep learningrdquo Molecular BioSystems vol 11 no 3 pp892ndash897 2015

[35] C Pian G Zhang Z Chen et al ldquoLncRNApred classificationof long non-coding RNAs and protein-coding transcripts by theensemble algorithm with a new hybrid featurerdquo PLoS ONE vol11 no 5 Article ID e0154567 2016

[36] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[37] M C Frith A R Forrest E Nourbakhsh et al ldquoThe abundanceof short proteins in the mammalian proteomerdquo PLoS Geneticsvol 2 no 4 pp 515ndash528 2006

[38] D M Anderson K M Anderson C-L Chang et al ldquoAmicropeptide encoded by a putative long noncoding RNAregulatesmuscle performancerdquoCell vol 160 no 4 pp 595ndash6062015

[39] M Guttman P Russell N T Ingolia J S Weissman and ES Lander ldquoRibosome profiling provides evidence that largenoncoding RNAs do not encode proteinsrdquo Cell vol 154 no 1pp 240ndash251 2013

[40] M F Lin I Jungreis and M Kellis ldquoPhyloCSF a comparativegenomics method to distinguish protein coding and non-coding regionsrdquoBioinformatics vol 27 no 13 Article ID btr209pp i275ndashi282 2011

[41] B Panwar A Arora and G P S Raghava ldquoPrediction andclassification of ncRNAs using structural informationrdquo BMCGenomics vol 15 no 1 article 127 2014

[42] L Childs Z Nikoloski P May and D Walther ldquoIdentificationand classification of ncRNAmolecules using graph propertiesrdquoNucleic Acids Research vol 37 no 9 article e66 2009

[43] K Sun X Chen P Jiang X Song H Wang and HSun ldquoiSeeRNA identification of long intergenic non-codingRNA transcripts from transcriptome sequencing datardquo BMCGenomics vol 14 article S7 2013

14 BioMed Research International

[44] Y Wang Y Li Q Wang et al ldquoComputational identificationof human long intergenic non-coding RNAs using a GA-SVMalgorithmrdquo Gene vol 533 no 1 pp 94ndash99 2014

[45] A Siepel G Bejerano J S Pedersen et al ldquoEvolutionarilyconserved elements in vertebrate insect worm and yeastgenomesrdquo Genome Research vol 15 no 8 pp 1034ndash1050 2005

[46] J Liu J Gough and B Rost ldquoDistinguishing protein-codingfrom non-coding RNAs through support vector machinesrdquoPLoS Genetics vol 2 no 4 pp 529ndash536 2006

[47] K C Pang S Stephen P G Engstrom et al ldquoRNAdb-acomprehensive mammalian noncoding RNA databaserdquoNucleicAcids Research vol 33 pp D125ndashD130 2005

[48] C Liu B Bai G Skogerboslash et al ldquoNONCODE an inte-grated knowledge database of non-coding RNAsrdquoNucleic AcidsResearch vol 33 pp D112ndashD115 2005

[49] Y Zhao H Li S Fang et al ldquoNONCODE 2016 an informativeand valuable data source of long non-coding RNAsrdquo NucleicAcids Research vol 44 no 1 pp D203ndashD208 2016

[50] J W Fickett ldquoRecognition of protein coding regions in DNAsequencesrdquoNucleic Acids Research vol 10 no 17 pp 5303ndash53181982

[51] S Mukherjee and Y Zhang ldquoMM-align a quick algorithmfor aligning multiple-chain protein complex structures usingiterative dynamic programmingrdquoNucleic Acids Research vol 37no 11 2009

[52] T Derrien R Johnson G Bussotti et al ldquoThe GENCODE v7catalog of human long noncoding RNAs analysis of their genestructure evolution and expressionrdquoGenome Research vol 22no 9 pp 1775ndash1789 2012

[53] J Harrow F Denoeud A Frankish et al ldquoGENCODE produc-ing a reference annotation for ENCODErdquo Genome Biology vol7 supplement 1 pp S41ndashS49 2006

[54] J Harrow A Frankish J M Gonzalez et al ldquoGENCODE thereference human genome annotation for the ENCODE projectrdquoGenome Research vol 22 no 9 pp 1760ndash1774 2012

[55] K D Pruitt T Tatusova G R Brown andD RMaglott ldquoNCBIReference Sequences (RefSeq) current status new features andgenome annotation policyrdquo Nucleic Acids Research vol 40 no1 pp D130ndashD135 2012

[56] K D Pruitt T Tatusova and D R Maglott ldquoNCBI refer-ence sequences (RefSeq) a curated non-redundant sequencedatabase of genomes transcripts and proteinsrdquo Nucleic AcidsResearch vol 35 no 1 pp D61ndashD65 2007

[57] L Deng D Yu and J Platt ldquoScalable stacking and learningfor building deep architecturesrdquo in Proceedings of the 2012IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo12) pp 2133ndash2136 Kyoto Japan March2012

[58] B Hutchinson L Deng and D Yu ldquoTensor deep stackingnetworksrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 35 no 8 pp 1944ndash1957 2013

[59] W J Kent C W Sugnet T S Furey et al ldquoThe human genomebrowser at UCSCrdquo Genome Research vol 12 no 6 pp 996ndash1006 2002

[60] R Lorenz S H Bernhart C Honer zu Siederdissen et alldquoViennaRNA Package 20rdquo Algorithms for Molecular Biologyvol 6 no 1 article 26 2011

[61] M Kozak ldquoRecognition of AUG and alternative initiatorcodons is augmented by G in position +4 but is not generallyaffected by the nucleotides in positions +5 and +6rdquoThe EMBOJournal vol 16 no 9 pp 2482ndash2492 1997

[62] M Kozak ldquoInitiation of translation in prokaryotes and eukary-otesrdquo Gene vol 234 no 2 pp 187ndash208 1999

[63] H Xu P Wang Y Fu et al ldquoLength of the ORF position of thefirstAUGand theKozakmotif are important factors in potentialdual-coding transcriptsrdquo Cell Research vol 20 no 4 pp 445ndash457 2010

[64] R D Finn J Clements and S R Eddy ldquoHMMER webserver interactive sequence similarity searchingrdquoNucleic AcidsResearch vol 39 no 2 pp W29ndashW37 2011

[65] P Kapranov J Cheng S Dike et al ldquoRNA maps reveal newRNA classes and a possible function for pervasive transcrip-tionrdquo Science vol 316 no 5830 pp 1484ndash1488 2007

[66] C Chen A Liaw and L BreimanUsing RandomForest to LearnImbalancedData University of California Berkeley Calif USA2004

[67] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[68] J A Sogn ldquoStructure of the peptide antibiotic polypeptinrdquoJournal of Medicinal Chemistry vol 19 no 10 pp 1228ndash12311976

[69] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[70] J W Fickett and C-S Tung ldquoAssessment of protein codingmeasuresrdquo Nucleic Acids Research vol 20 no 24 pp 6441ndash6450 1992

[71] X Robin N Turck A Hainard et al ldquopROC an open-sourcepackage for R and S+ to analyze and compare ROC curvesrdquoBMC Bioinformatics vol 12 article 77 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 6: Review Article Long Noncoding RNA Identification ...

6 BioMed Research International

formulas to calculate each ANT usage frequency are definedas follows

119883119894119873 = 119899sum119895=1

119878119895 (119883119894)

119879 = 119898sum119894=1

119883119894119873 = 119898sum119894=1

119899sum119895=1

119878119895 (119883119894) 119898 = 64 times 64 119899 = 1 119873

119883119894119865 = 119883119894119873119879

(6)

where 119883 means one kind of ANT 119878119895(119883119894) is the occurrencenumber of119883119894 in one sequence 119878119895Thus119883119894119873 denotes the totaloccurrence number of one kind of ANT in the dataset while119879indicates the total occurrence number of all kinds of ANT inthe dataset Accordingly 119883119894119865 is the usage frequency of ANTThen the ANT Score Matrix is utilised which is the log2-ratio of the two above-mentioned ANT matrices to score asequence and make a discrimination

ANT Score Matrix = log2CDS Matrix

Non-coding Matrix (7)

The distinguishing results of ANT Score Matrix are fairlywell but the matrix is constructed by computing the ANTusage frequency of coding region and noncoding regionconsequently the untranslated region (UTR) of the entiresequence will interfere with the performance of discrimina-tionThe sliding window is employed with one ANT (3 nt) ineach scan step to identify the CDS of a sequence by scanningsix reading frames of each sequence The different sizes(30 60 90 300 nt) of the sliding windows are examinedand the size of 150 nt for this classification model is found toobtain the most robust result For a sequence consisting of119896 ANT there will be 119896 minus 1 segments in this sequence Basedon the ANT Score Matrix each segment will get an 119878-Scoreand each reading frame can obtain an array comprised of the119878-Scores The formula of 119878-Score is defined as follows

119878-Score = 119899sum119894=1

119867119901 (119883119894) (8)

where 119883 means ANT 119867119901 is the ANT Score Matrix and119899 is the total number of the ANT in one segment or thewhole sequence Hence a correct reading frame of codingtranscript tends to have a higher whole sequence 119878-Scoreand in this array of reading frame the region composed ofconsecutive high 119878-Scores is the CDS For long noncodingtranscripts the Maximum Interval Sum [51] program is usedto identify the most-like CDS (MLCDS) which is the regionthat gained the largest sum of consecutive 119878-Scores in eachreading frame Among those six MLCDS the length and119878-Score of the MLCDS with the highest value are selectedas the features of CNCI Furthermore the features of the

LENGTH-Percentage SCORE-Distance and codon-bias arealso selected to improve accuracy

LENGTH-Percentage = 1198721sum119899119894=0 (119884119894)

SCORE-Distance = sum119899119895=0 (119878 minus 119864119895)5

(9)

where 1198721 is the length of the MLCDS with the highest 119878-Score 119884119894 is the length of each MLCDS 119878 is the highest 119878-Score among six MLCDS and 119864119895 is the 119878-Score of other fiveMLCDS Codon-bias (3-mer frequencies) is a parameter toevaluate the usage bias of different codons in protein-codingor long noncoding transcripts The log2-ratio of occurrencefrequency of each codon (stop codons are excluded) inprotein-coding genes and lncRNAs is calculated and mostcodons have distinct usage bias in two kinds of sequences

The training datasets of CNCI contain protein-codingtranscripts selected from RefSeq database and long noncod-ing transcripts selected from GENCODE [52] The CNCIis applied to other species with the aim of examining thescope of applicationThe results of vertebrates (except birds)especially mammals can be accepted since the programwas trained on human gene set CNCI can be used todiscriminate incomplete transcripts especially those high-throughput sequencing data of poorly explored species

25 PLEK in Detail PLEK [31] is an alignment-free toolbased on 119896-mer frequencies of the sequences For a givensequence the sliding windows with size of 119896 scan 1 nt asa step forward 119896 ranges from 1 to 5 which is a trade-off between accuracy and computational time Thus for asequence consisting of 119860 119862 119866 and 119879 the 41 + 42 + 43 +44 + 45 = 1364 patterns can be obtained Then the followingformulas can be used

119891119894 = 119888119894119904119896119908119896 119896 = 1 2 3 4 5 119894 = 1 2 1364119904119896 = 119897 minus 119896 + 1119908119896 = 1

45minus119896 119896 = 1 2 3 4 5

(10)

where 119894 is the number of the patterns 119888119894 denotes the numberof the segments in slidingwindowsmatchingwith patterns 119904119896denotes the total of the segments when sliding window slidesalong the sequencewith the size of 119896Therefore119891119894 is the usagefrequency multiplied by a factor119908119896 which is used to facilitatethe discrimination

A balanced training dataset is conducted with all 22389long noncoding transcripts collected from the GENCODEdataset [52ndash54] and 22389 protein-coding transcripts ran-domly selected from the human RefSeq dataset [55 56]Though the training model of PLEK is human PLEK canstill be applied to other vertebrates PLEK is particularlydesigned for the transcripts acquired from current sequenc-ing platforms which consist of some indel errors commonly

BioMed Research International 7

For these transcripts the performance of PLEK is betterthan CPC and CNCI PLEK can be trained with usersrsquo owndatasets but it may take a long time to be accomplished

26 LncRNA-MFDL in Detail LncRNA-MFDL [34] is basedon feature fusion and deep learning algorithm LncRNA-MFDL has four kinds of features which are integrated tobuild a classification model based on deep stacking networks(DSNs one kind of deep learning algorithm) [57 58] Fourfeature groups of lncRNA-MFDL include 119896-mer secondarystructure ORF obtained by utilising txCdsPredict program(httpgenomeucscedu) [59] and MLCDS features whichare inspired by CNCI [30]

The 119896-mer scheme employed in lncRNA-MFDL is unlikethe one in PLEK Here the 119896 only ranges from 1 to 3but the frequencies are calculated on the regions of thewhole sequence and ORF at the same time Considering thatthe secondary structure is more conserved and stable thanprimary structure a representative criterion the minimumfree energy (MFE) is used to assess the secondary structureof the transcripts Utilising RNAfold program of ViennaRNAPackage [60] the MFE the ratio of MFE to sequence lengthand the number of paired bases and unpaired bases can beeasily obtained

27 LncRNA-ID in Detail LncRNA-ID [32] has three cate-gories of features as mentioned earlier Except for the lengthand coverage of ORF the features based on translationmechanism and protein conservation are extracted

Many studies [61ndash63] have demonstrated that severalnucleotide sites in Kozak motif play a prominent roleduring the initiation of protein translation An efficienttranslation indicates that the highly conserved nucleotidesappear at the positions minus3 +4 and minus2 minus1 of Kozak motifGCCRCCAUGG (R represents purine and the position of Ain start codon AUG is +1) Thus these conserved sites aremore likely to exist in protein-coding transcripts Moreoverwhen the translation starts the binding energy will changealong with the interaction between the 31015840 end of rRNAs andmRNA transcripts The Ribosome Coverage to calculate thechanges of the binding energy is defined as follows

Ribosome Coverage = 119871sum119894=1

119873119894 | 120575119894 lt 0 (11)

where 120575119894 is the free energy at position 119894 and 119873119894 is thenumber of base pairs starting at position 119894 in a sequencewith the length of 119871 Next the three levels of ribosome occu-pancy by computing Ribosome Coverage on three regionsrespectively are obtained the whole transcript ORF and31015840UTR Accordingly a true protein-coding transcript tendsto attain higher Ribosome Coverage on the whole transcriptand the ORF region When the translation terminates theribosomes will be released from protein-coding transcriptsTherefore it is likely to capture a considerable drop ofribosome occupancy when ribosomes reach stop codonsThe

Ribosome Release Score to capture this change of ribosomeoccupancy is defined

Ribosome Release Score

= Ribosome coverage of ORFlength (ORF)Ribosome coverage of 31015840UTRlength (31015840UTR)

(12)

and a protein-coding transcript inclines to exhibit a higherRibosome Release Score For protein translation categorythe selected features including nucleotides at two positionsof Kozak motif Ribosome Coverage on three regions andRibosome Release Score are selected

The protein conservation of the sequences is evaluatedaccording to profile hidden Markov model-based align-ment scores HMMER [64] is a software suite for sequencehomology detection using probabilistic methods LncRNA-ID employed HMMER with the 119864-value cutoff of 01 toalign the transcripts against all available protein families Aprotein-coding transcript is expected to get a higher scorelonger aligned region and a reasonable length of the profilein the alignment

In human genome although the amount of lncRNA isat least four times more than protein-coding genes [65]the majority class in training data is protein-coding tran-script on account of poorly annotated lncRNA Hence theclassification model of this method is balanced randomforest [66 67] which is derived from random forest butcould utilise the sufficient protein-coding data and avoidinaccurate results caused by the imbalanced training data atthe same time The human prebuilt model of LncRNA-IDcontains 15308 protein-coding transcripts and 4586 lncRNAsfrom GENCODE [52] For mouse the training datasets arecomprised of 22033 protein-coding transcripts and 2457lncRNAs randomly selected from GENCODE These twodatasets were also used to draw receiver operation charac-teristic (ROC) curves in the next section (Figure 2) Userscan train LncRNA-ID with their own dataset and apply it tovarious species

28 LncRScan-SVM in Detail LncRScan-SVM [33] classifiesthe sequencesmainly by evaluating the qualities of nucleotidesequences codon sequence and transcripts structure Thecounts and average length of exon in one sequence arecalculated The protein-coding transcripts are disposed ofto include more exons thus having a longer exon lengththan lncRNA Another feature is the score of txCdsPredictThis third-part program from UCSC genome browser [68]can determine if a transcript is protein-coding Conservationscore is obtained by calculating the average of PhastConsscores [45] from Phast (httpcompgencshleduphast)Transcript length and standard deviation of stop codoncounts between three ORFs are the last two features

The reliable datasets are constructed from GENCODE[54] composed of 81814 protein-coding transcripts and23898 long noncoding transcripts of human And formouse47394 protein-coding transcripts and 6053 long noncodingtranscripts from GENCODE [52] are also contained withinthe dataset After being trained on human and mouse

8 BioMed Research International

datasets lncRScan-SVM obtains a good performance onlncRNA prediction

29 LncRNApred in Detail Before constructing the classi-fier self-organizing feature map (SOM) clustering [69] isemployed to select representative samples as the trainingdataset which enhanced the performance of LncRNApredAs to the features the length and coverage of the longest ORFone of the classical and typical features are selected as thecriteria In addition G + C content 119896-mer (119896 is from 1 to 3just like lncRNA-MFDL) and length of the sequence are alsothe features of LncRNApred The novel idea of LncRNApredis SNR which transforms one sequence into four binarynumeric sequences

119906119887 = 1 119878 [119899] = 1198870 119878 [119899] = 119887

119899 = 0 1 2 119873 minus 1 119887 isin 119860 119879 119862 119866 (13)

where 119887 means four kinds of bases 119873 is the length of onesequence and 119878[119899] denotes a sequence of length 119873 Thusthere will be four binary sequences 119906119887 | 119887 isin (119860 119879 119862 119866)Then applying Discrete Fourier Transform (DFT) to thesefour binary numeric sequences the power spectrum 119875[119896]can be obtained

119880119887 [119896] =119873minus1sum119899=0

119906119887 [119899] 119890minus119894(2120587119899119896119873) 119896 = 0 1 119873 minus 1119875 [119896] = 1003816100381610038161003816119880119860 [119896]10038161003816100381610038162 + 1003816100381610038161003816119880119879 [119896]10038161003816100381610038162 + 1003816100381610038161003816119880119866 [119896]10038161003816100381610038162 + 1003816100381610038161003816119880119862 [119896]10038161003816100381610038162

(14)

The studies of Fickett [50 70] have presented that positionsand compositions of four bases are different in lncRNAsand protein-coding RNA and because of this the powerspectrum of one protein-coding transcript will have a peakat1198733 position Hence the SNR is defined as follows

119864 = sum119873minus1119896=0 119875 [119896]119873 SNR = 119875 [1198733]

119864 (15)

Now there are 89 features the length and coverage ofthe longest ORF the length of the sequence SNR G + Ccontent and 4 + 16 + 64 features of 119896-mer Noticing that notall the features have high discriminative power the featureselection ismade and 25 high-quality features are determinedfrom the original 84 features of k-mer Finally 30 features areselected to build a random forest model The performance ofrandom forest is largely determined by training setThereforethe clustering method is used to find out the most adequatesequences to form a high standard training setThe clusteringmethod SOM [69] achieved the best result and was chosen toselect characteristic sequences

An overall procedure of these eight tools is displayed inFigure 1

3 Performance of These Methods

To quantify the classification performance under one unifiedstandard we first characterise lncRNAs as the positive classand protein-coding transcripts as the negative class then theperformance of these tools can be evaluated with severalstandard criteria defined as follows

Sensitivity = TPTP + FN

Specificity = TN

TN + FP

Accuracy = TP + TNTP + FP + TN + FN

False Positive Rate = FP

FP + TN

(16)

As one of the most popular methods CPC is especiallydesigned for assessing protein-coding potential and per-formed fairly well for discriminating protein-coding tran-script It enjoys the best results when screening the codingtranscripts For 10000 protein-coding genes and 10000 lncR-NAs selected from UCSC genome browser (GRCh37hg19)CPC picked up about 9762 coding transcripts while CPATdistinguished 8528 of them CPAT also outperforms CPCwith 8994 accuracy [33] Table 3 shows the performance ofthese tools on the same testing dataset CPC picks up 9997of human protein-coding genes collected from GENCODEin comparison with the latest program LncRNA-ID whoseperformance is 9528 However the performance of CPCappears to somewhat decline when focusing on the capabilityof discriminating noncoding transcripts especially long non-coding transcripts CPC only picked up 6648 of humanrsquoslong noncoding transcripts while the results of CPAT PLEKand LncRNA-ID are 8695 9952 and 9628

CPC and CPAT are the programs to assess the codingpotential but CNCI is especially used to classify protein-coding and long noncoding transcripts With the sequencesbecoming longer and longer CNCI was more superior toCPC According to Sun et al [30] when the length oftranscript is longer than 2000 nt the accuracy of CPC isonly around 04 while the CNCI still has an outstandingperformance The training dataset of CNCI is human butthis method still achieved more than 90 accuracy in othervertebrates apart from the birds [30] PLEK is tested ontwo datasets sequenced by PacBio and 454 platforms (referto Table 3) Among the tools being compared CPC stillpicked up about 9990 coding genes though this figureis not that useful because it can only distinguish 1900and 4720 lncRNAs CNCI displayed better performanceon both datasets but PLEK even achieved a more satisfyingresult

LncRNA-ID is another method to identify the longnoncoding transcripts Compared with other programsLncRNA-ID strikes a good balance between sensitivity andfalse positive rate According to Table 3 it is noticeable thatlncRNA is better than PLEK but slightly inferior to CPCand CPAT on the testing data of coding genes and the

BioMed Research International 9

(i) ORF(ii) BLASTX

CPC

(i) k-mer

PLEK

(i) ANT matrix(ii) Codon-bias

(iii) MLCDS

CNCI

(i) Exon (ii) Stop codon

(iii) txCdsPredict(iv) PhastCons

LncRScan-SVM

Support vector machine

Featureextraction

Modelselection

Modelvalidation

Input data

10-fold cross-validationArea under the curve (AUC) of receiver operation characteristic (ROC)

LncRNA-ID(i) ORF

(ii) Ribosome interaction

(iii) HMMER

(i) ORF(ii) k-mer

(iii) Signal to noise ratio

LncRNApred

Random forests

CPAT

(i) ORF(ii) Fickett score

(iii) Hexamer

Logistic regression

LncRNA-MFDL(i) ORF

(ii) k-mer(iii) MLCDS(iv) Secondary

structure

Deeplearning

Figure 1 An overall procedure of eight tools The features of each tool are sorted into several groups and only the categories of the featuresare listed in the figure

02 04 06 08 10001 minus specificity

00

02

04

06

08

10

Sens

itivi

ty

00

02

04

06

08

10

Sens

itivi

ty

02 04 06 08 10001 minus specificity

Mouse (from GENCODE)Human (from GENCODE)

PLEK (AUC = 0991 optimal = 00210973)CNCI (AUC = 0937 optimal = 0084 0970)CPAT (AUC = 0990 optimal = 0037 0961)CPC (AUC = 0977 optimal = 0038 0889) CPC (AUC = 0950 optimal = 0068 0839)

CPAT (AUC = 0966 optimal = 0073 0926)CNCI (AUC = 0903 optimal = 0137 0959)PLEK (AUC = 0870 optimal = 0291 0885)

Figure 2 The ROC curves of CPC CPAT CNCI and PLEK We assessed the models using the same datasets as LncRNA-ID (selected fromGENCODE) used Both CPC and CPAT were evaluated with the latest versions

10 BioMed Research International

Table 3 Overview of each toolrsquos performance on different testing datasets

Testing dataset CPC CPAT CNCI PLEK LncRNA-ID lncRScan-SVMHuman MCF-7 (PacBio)1

Specificity 9990 9180 9470Sensitivity 1900 7870 9580Accuracy 9700 9130 9470Human HelaS3 (454)2

Specificity 9990 9390 9550Sensitivity 4720 8110 9250Accuracy 9900 9370 9540Human (from GENCODE)3

Specificity 9997 9955 8918 9528Sensitivity 6648 8695 9952 9628Accuracy 8322 9325 9432 9578Mouse (from GENCODE)4

Specificity 9875 9895 7094 9210Sensitivity 7655 3880 8811 9445Accuracy 8765 6888 7949 9328Human (from GRCh37hg19)5

Specificity 9762 8528 8920Sensitivity 6723 9460 9388Accuracy 8243 8994 9194Mouse (from GRCm38mm10)5

Specificity 9837 8817 8914Sensitivity 7546 9534 9529Accuracy 8691 9176 9221The results of the tools being tested on the same datasets are listed above Bold numbers denote the highest value of the metrics1MCF-7 is available at httpwwwpacbcomblogdata-release-human-mcf-7-transcriptome 2dataset of HelaS3 is available at httpswwwncbinlmnihgovsraSRX214365 34datasets are available at httpswwwdropboxcomsh7yvmqknartttm6kAAAQHvLZPjgjf4dtmHM7GNCqaH1 gencodedl=0 andhttpswwwdropboxcomsh7yvmqknartttm6kAACzaG-QJggvbXW6LA32oo7baM1 gencodedl=0 5dataset of human and mouse is available athttpjournalsplosorgplosonearticleid=101371journalpone0139654

performance on lncRNAs is just the opposite LncRNA-IDcan be trained with usersrsquo own data it can obtain a satisfyingresult even when the data is unbalancedWith the proportionof lncRNA decreasing CPAT shows a sharp reduction from7951 to 5446on the capability of lncRNAdiscriminationLncRNA-ID by contrast fell less than 1 [32]

ROC curves of CPC CPAT PLEK and LncRNA-IDtested on human and mouse datasets were provided in [32]Since CPC and CPAT are updated as the accumulation ofgene database it is useful to assess their performance withlatest version and take CNCI into account Here we utilisethe test set of LncRNA-ID [32] (both the datasets of humanandmouse are selected fromGENCODE) to reevaluate CPCCPAT CNCI and PLEK (Figure 2) According to [32] thearea under curve (AUC) of LncRNA-ID on human datasetis 09829 (optimal = 00545 09720) while on mouse it is09505 (optimal = 00800 09445) [32] In our assessmentthe performance of PLEK is identical with [32] while theperformance of CPC and CPAT as we anticipated displayedsome differences The ROC curves were drawn in R with thepackage of pROC [71]

LncRScan-SVM is compared with CPC and CPAT onhuman and mouse datasets from UCSC (version hg19 of

human and mm10 of mouse) CPC as an excellent codingpotential assessment tool still achieves 9837 of specificityonmouse testing dataset CPAT on the contrary achieved thehighest values of sensitivity both on the datasets of humanand mouse LncRScan-SVM surpasses CPAT with 8920and 8914 of specificity on human and mouse datasets Forsensitivity lncRScan-SVM obtained 9388 and 9529 onthe same testing datasets which are only 072 and around01 lower than CPATrsquos results respectively but much higherthan CPCrsquos 6723 and 7546 In addition lncRScan-SVMalso has the best results of accuracy and AUC [33] on thesedatasets

For the same testing datasets the running time of CPATis the shortest and CPC shows the longest time to finishthe process because of its alignment process When beingtested on a dataset containing 4000 protein-coding and 4000long noncoding transcripts CPAT takes 3536 s and LncRNA-ID takes 6535 s to accomplish the discrimination whilePLEK and CPC need 2147m and 8651 h respectively [32]PLEK is 8 times and 244 times faster than CNCI and CPCrespectively on the same testing data [31] and lncRScan-SVM also needs about 10 times as much as CPAT to finishcomputation [33]

BioMed Research International 11

Table 4 Priority of employing different methods on different situations

CPC CPAT CNCI PLEK LncRNA-ID lncRScan-SVMCoding potential assessment Human lncRNAs Mouse lncRNAs Other Species1 Testing data with sequencing errors2 Lack of annotation Massive-scale data3 Trained by users4 Web interface This table only presents the preferences under different situations which means a method with a tick can achieve a better performance under a certaincircumstance1Only CPAT LncRNA-ID and lncRScan-SVM provide the model for mouse When analysing other species CPAT has the model for fly and zebrafish CNCIand PLEK can predict the sequences of vertebrata and plant CPAT PLEK and LncRNA-ID can build a newmodel based on usersrsquo datasets 2Users can chooseCNCI for incomplete sequences and CPC or PLEK for the transcripts with indel errors 3CPAT is the most efficient method Though lncRScan-SVM needsmore time than CPAT and LncRNA-ID it is also acceptable 4LncRNA-ID can handle the imbalanced training data Training PLEK with usersrsquo own datasetsmay be a time-consuming task

4 Application Scopes of the Methods

All these methods have own particular scopes to exert theirtalents which means an appropriate program can help usobtain a satisfying result The priority of utilising thesetools under some particular circumstances is summarised inTable 4

CPC is based on sequence alignment which facilitatesprotein-coding transcripts selection but impairs the perfor-mance of noncoding transcripts in that long noncoding tran-scripts share more similarities with coding transcripts suchas putative ORF which could mislead CPC Also becauseof alignment process utilising CPC to analyse massive-scaledata is a time-consuming process

CPAT is also used to evaluate the coding potential thoughthe performance on long noncoding transcripts is acceptableCPAT has a compromise between coding and noncodingtranscripts that is not bad Since the model of CPAT islogistic regression and the input file is FASTA format CPATis markedly superior in computational time which meansCPAT is more suitable for being applied to data on a largescale Furthermore linguistic features make CPAT be able toanalyse the sequences without annotation and allowing usersto train the model with their own dataset extends CPATrsquosscope of application Users can apply CPAT to other speciesinstead of being confined to human or mouse only

CNCI is designed to distinguish between coding and longnoncoding transcripts without the annotations of sequencesBecause lots of lncRNAs are poorly annotated this qualityprovided amore accurate discrimination for these sequencesCNCI is trained on human dataset but can also be applied toothermammals such asmouse and orangutan CNCI displaysacceptable results on vertebrates (except fish) but for plantsand invertebrates the result is not very satisfying CNCI isvaluable when the sequences lack annotations or users do nothave training set of other species CNCI also shows a goodperformance when the transcripts are incomplete

PLEK employs a higher fault tolerance algorithm andperforms better when the sequences have indel errors It is a

proper tool to analyse the de novo assembled transcriptomedatasets such as the sequences obtained from Roche (454)and Pacific Biosciences (PacBio) sequencing platforms Inaddition to human and mouse PLEK can also be used toother vertebrates and displays comparable results with theones of CNCI PLEKrsquos model can be trained by users but ittakes a long time to be completed

LncRNA-ID has many merits and delivers better all-round performance on human andmouse datasets Althoughthe time LncRNA-ID spent on classifying is nearly twice ofCPAT LncRNA-ID is still more efficient than other methodswhich makes it a reasonable choice when data are on amassive-scale The model of LncRNA-ID can be trained byusers but the most excellent attribute is the competence ofhandling the unbalanced training data For studying thosenot well-explored species LncRNA-ID takes priority whenusers have training datasets

LncRScan-SVM achieves a good trade-off betweenthe discrimination of coding and long noncoding genesLncRScan-SVM is slower than CPAT and LncRNA-ID but itis still acceptable For analysing human and mouse datasetslncRScan-SVM can be considered as a proper approach

5 Discussion

According to the features selected by each tool it is apparentthat different tools have their own advantages and disadvan-tages CPC is developed to assess coding potential of thetranscripts moreover CPC is trained on datasets of protein-coding and noncoding RNA which means it achieves excel-lent performance when analysing ncRNAs CPC providesa stand-alone version and a web server but both of thetwo programs need vast amounts of time to process thesequences As alignment-based tools the performance ofCPC varied when using different protein reference databaseCPAT can present satisfying results efficiently partly becauseCPAT builds the logistic model which is faster than SVMThe web server of CPAT can display the result in an instant

12 BioMed Research International

which facilitates small scale prediction tasks A minor dis-advantage of CPAT is that the cutoff of CPAT varies fromspecies to species and users have to determine the optimumcutoff value when they are training a new model CNCI isdesigned to predict the transcripts assembled from whole-transcriptome sequencing data Thus CNCI offers a highaccuracy on incomplete transcripts CNCI did not provideresult of elaborate comparison between CNCI and CPATbut CPAT has no regard for the problem of incompletetranscripts Meanwhile UTRs of the transcripts may alsointerfere with the performance of CPATThe features of ANTof CNCI closely resemble the hexamer of CPAT but thedistinguishing process of CNCI is more complicated andaccurate than CPAT However the sliding window of CNCIslides 3 nt in each step and consequently some deletionor frameshift errors may lead to a false shift and presentusers with a disappointing performance In such cases PLEKhas made a considerable improvement and exhibits moreflexibility when handling the indel sequencing errors Indelerrors are very common in the sequences obtained by todayrsquossequencing platform which means PLEK performs well forde novo assembled transcriptomes With the indel error rateincreasing the accuracy of CNCI is decreasing while PLEKhas no distinct fluctuation Nonetheless since the nucleotidescompositions differ slightly among different species theperformance of PLEK on multiple species is not better thanor approximately equivalent to CNCI whose performanceis more stable on different species Both LncRNA-ID andlncRScan-SVM achieve a balance between protein-codingand lncRNAs But the capacity of lncRScan-SVMwill be lim-ited when analysing the sequences with a lack of annotationAnother point that needs to be brought up is that lncRScan-SVM and CNCI support lowast GTF as input file format

It is apparent that nucleotides composition (such as 119896-merand G + C content) and ORF are two classic and widely usedfeature groups These features have strong discriminativepower because protein-coding genes will finally be tran-scribed and translated to produce a specific amino acid chainwhich requires some specified nucleotides composition andhigh-quality ORFs As to the models of these tools SVM(CPC CNCI and PLEK) logistic regression (CPAT) andrandom forest (LncRNA-ID) are more practical for lncRNAidentification though ANN or deep learning is a more popu-lar machine learning algorithm now Along with the protein-coding genes prediction the annotations of lncRNA genehave been performed as well A new tool named AnnoLnc(2015 available at httpannolnccbipkueducnindexjsp)has just been developed to annotate new discovered lncRNAsbut related article has not yet been officially published Userscan access its web server for more information

LncRNAs are receiving increasing attention and lncRNAidentification has always been a challenge for researchesof life science For so many different types of sequencesvarious excellent tools should be developed to tackle differentproblems under various circumstances in the future In thisreview we summarised several tools for lncRNAs identifica-tion and concluded respective scopes Due to their differentscopes of application using a method apposite to particularsituation will be of essence to achieve convincing results

We hope this review can help researchers employ a moreappropriate method in certain situations

Additional Points

Key Points (i) Different tools have different scopes Usersshould select a proper tool according to the type of sequences(ii) From the perspective of sequence types CPC andCPAT are mainly used to assess coding potential CNCIand PLEK can be applied to the sequences obtained fromhigh-throughput sequencing platforms or the poorly anno-tated LncRNA-ID and lncRScan-SVM are more accurate onhuman and mouse datasets (iii) From the perspective ofother functions CPC and CPAT have web interfaces Theclassification models of CPAT LncRNA-ID and PLEK canbe trained on usersrsquo own datasets CPAT LncRNA-ID andlncRScan-SVM can be utilised when the data to be analysedare on a massive-scale

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This research was supported by the National Natural ScienceFoundation of China (61272207 61472158 and 61402194)and the Science-Technology Development Project from JilinProvince (20130522118JH 20130101070JC) and China Post-doctoral Science Foundation (2014T70291)

References

[1] A F Palazzo and E S Lee ldquoNon-coding RNA what isfunctional andwhat is junkrdquo Frontiers in Genetics vol 5 article2 pp 1ndash11 2015

[2] Y Okazaki M Furuno T Kasukawa et al ldquoAnalysis of themouse transcriptome based on functional annotation of 60770full-length cDNAsrdquo Nature vol 420 pp 563ndash573 2002

[3] S Djebali C A Davis AMerkel et al ldquoLandscape of transcrip-tion in human cellsrdquo Nature vol 489 pp 101ndash108 2012

[4] I Dunham A Kundaje S F Aldred et al ldquoAn integratedencyclopedia of DNA elements in the human genomerdquo Naturevol 489 pp 57ndash74 2012

[5] E Pennisi ldquoENCODE project writes eulogy for junk DNArdquoScience vol 337 no 6099 pp 1159ndash1161 2012

[6] S U Schmitz P Grote and B G Herrmann ldquoMechanisms oflong noncoding RNA function in development and diseaserdquoCellular and Molecular Life Sciences vol 73 no 13 pp 2491ndash2509 2016

[7] K Plath J Fang S K Mlynarczyk-Evans et al ldquoRole of histoneH3 lysine 27 methylation in X inactivationrdquo Science vol 300no 5616 pp 131ndash135 2003

[8] S T da Rocha V Boeva M Escamilla-Del-Arenal et al ldquoJarid2is implicated in the initial xist-induced targeting of PRC2 to theinactive X chromosomerdquoMolecular Cell vol 53 no 2 pp 301ndash316 2014

[9] V OrsquoLeary S V Ovsepian L G Carrascosa et al ldquoPARTICLEa triplex-forming long ncRNA regulates locus-specific methy-lation in response to low-dose irradiationrdquo Cell Reports vol 11no 3 pp 474ndash485 2015

BioMed Research International 13

[10] A C Marques and C P Ponting ldquoCatalogues of mammalianlong noncoding RNAs modest conservation and incomplete-nessrdquo Genome Biology vol 10 no 11 article R124 2009

[11] J R Prensner M K Iyer A Sahu et al ldquoThe long noncodingRNA SChLAP1 promotes aggressive prostate cancer and antag-onizes the SWISNF complexrdquo Nature Genetics vol 45 no 11pp 1392ndash1398 2013

[12] G Chen Z Wang D Wang et al ldquoLncRNADisease a databasefor long-non-coding RNA-associated diseasesrdquo Nucleic AcidsResearch vol 41 no 1 pp D983ndashD986 2013

[13] X Chen and G-Y Yan ldquoNovel human lncRNA-disease associ-ation inference based on lncRNA expression profilesrdquo Bioinfor-matics vol 29 no 20 pp 2617ndash2624 2013

[14] X Chen C Clarence Yan C Luo W Ji Y Zhang and Q DaildquoConstructing lncRNA functional similarity network based onlncRNA-disease associations and disease semantic similarityrdquoScientific Reports vol 5 Article ID 11338 2015

[15] X Chen Y-A Huang X-S Wang Z You and K C ChanldquoFMLNCSIM fuzzy measure-based lncRNA functional simi-larity calculation modelrdquo Oncotarget vol 7 no 29 pp 45948ndash45958 2016

[16] X Chen ldquoPredicting lncRNA-disease associations and con-structing lncRNA functional similarity network based on theinformation of miRNArdquo Scientific Reports vol 5 Article ID13186 2015

[17] J Sun H Shi Z Wang et al ldquoInferring novel lncRNA-diseaseassociations based on a random walk model of a lncRNAfunctional similarity networkrdquo Molecular BioSystems vol 10no 8 pp 2074ndash2081 2014

[18] M Zhou X Wang J Li et al ldquoPrioritizing candidate disease-related long non-codingRNAs bywalking on the heterogeneouslncRNA and disease networkrdquoMolecular BioSystems vol 11 no3 pp 760ndash769 2015

[19] X Chen Z You G Yan and D Gong ldquoIRWRLDA improvedrandom walk with restart for lncRNA-disease association pre-dictionrdquo Oncotarget vol 7 no 36 pp 57919ndash57931 2016

[20] X Chen C C Yan X Zhang and Z You ldquoLong non-codingRNAs and complex diseases from experimental results tocomputational modelsrdquo Briefings in Bioinformatics 2016

[21] T M Lowe and S R Eddy ldquotRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequencerdquo Nucleic Acids Research vol 25 no 5 pp 955ndash9641997

[22] Q Zou J Guo Y Ju M Wu X Zeng and Z Hong ldquoImprov-ing tRNAscan-SE annotation results via ensemble classifiersrdquoMolecular Informatics vol 34 no 11-12 pp 761ndash770 2015

[23] L Wei M Liao Y Gao R Ji Z He and Q Zou ldquoImprovedand promising identificationof human microRNAs by incorpo-ratinga high-quality negative setrdquo IEEEACM Transactions onComputational Biology and Bioinformatics vol 11 no 1 pp 192ndash201 2014

[24] C Y Wang L L Hu M Z Guo X Y Liu and Q Zou ldquoimDCan ensemble learningmethod for imbalanced classificationwithmiRNAdatardquoGenetics andMolecular Research vol 14 no 1 pp123ndash133 2015

[25] K Lagesen P Hallin E A Roslashdland H-H Staeligrfeldt T Rognesand DW Ussery ldquoRNAmmer consistent and rapid annotationof ribosomal RNA genesrdquo Nucleic Acids Research vol 35 no 9pp 3100ndash3108 2007

[26] C Wang L Wei M Guo and Q Zou ldquoComputationalapproaches in detecting non- coding RNArdquo Current Genomicsvol 14 no 6 pp 371ndash377 2013

[27] D Veneziano G Nigita and A Ferro ldquoComputationalapproaches for the analysis of ncRNA through deep sequencingtechniquesrdquo Frontiers in Bioengineering and Biotechnology vol3 article 77 2015

[28] L Kong Y Zhang Z-Q Ye et al ldquoCPC assess the protein-coding potential of transcripts using sequence features andsupport vector machinerdquo Nucleic Acids Research vol 35 no 2pp W345ndashW349 2007

[29] L Wang H J Park S Dasari et al ldquoCoding-potential assess-ment tool using an alignment-free logistic regression modelrdquoNucleic Acids Research vol 41 no 6 article e74 2013

[30] L Sun H Luo D Bu et al ldquoUtilizing sequence intrinsiccomposition to classify protein-coding and long non-codingtranscriptsrdquo Nucleic Acids Research vol 41 no 17 article e1662013

[31] A Li J Zhang and Z Zhou ldquoPLEK a tool for predicting longnon-coding RNAs and messenger RNAs based on an improvedk-mer schemerdquo BMC Bioinformatics vol 15 no 1 article 3112014

[32] R Achawanantakun J Chen Y Sun and Y Zhang ldquoLncRNA-ID long non-coding RNA IDentification using balanced ran-dom forestsrdquo Bioinformatics vol 31 no 24 pp 3897ndash3905 2015

[33] L Sun H Liu L Zhang and J Meng ldquolncRScan-SVM a toolfor predicting long non-coding RNAs using support vectormachinerdquo PLoS ONE vol 10 no 10 Article ID e0139654 2015

[34] X-N Fan and S-W Zhang ldquoLncRNA-MFDL identification ofhuman long non-coding RNAs by fusing multiple features andusing deep learningrdquo Molecular BioSystems vol 11 no 3 pp892ndash897 2015

[35] C Pian G Zhang Z Chen et al ldquoLncRNApred classificationof long non-coding RNAs and protein-coding transcripts by theensemble algorithm with a new hybrid featurerdquo PLoS ONE vol11 no 5 Article ID e0154567 2016

[36] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[37] M C Frith A R Forrest E Nourbakhsh et al ldquoThe abundanceof short proteins in the mammalian proteomerdquo PLoS Geneticsvol 2 no 4 pp 515ndash528 2006

[38] D M Anderson K M Anderson C-L Chang et al ldquoAmicropeptide encoded by a putative long noncoding RNAregulatesmuscle performancerdquoCell vol 160 no 4 pp 595ndash6062015

[39] M Guttman P Russell N T Ingolia J S Weissman and ES Lander ldquoRibosome profiling provides evidence that largenoncoding RNAs do not encode proteinsrdquo Cell vol 154 no 1pp 240ndash251 2013

[40] M F Lin I Jungreis and M Kellis ldquoPhyloCSF a comparativegenomics method to distinguish protein coding and non-coding regionsrdquoBioinformatics vol 27 no 13 Article ID btr209pp i275ndashi282 2011

[41] B Panwar A Arora and G P S Raghava ldquoPrediction andclassification of ncRNAs using structural informationrdquo BMCGenomics vol 15 no 1 article 127 2014

[42] L Childs Z Nikoloski P May and D Walther ldquoIdentificationand classification of ncRNAmolecules using graph propertiesrdquoNucleic Acids Research vol 37 no 9 article e66 2009

[43] K Sun X Chen P Jiang X Song H Wang and HSun ldquoiSeeRNA identification of long intergenic non-codingRNA transcripts from transcriptome sequencing datardquo BMCGenomics vol 14 article S7 2013

14 BioMed Research International

[44] Y Wang Y Li Q Wang et al ldquoComputational identificationof human long intergenic non-coding RNAs using a GA-SVMalgorithmrdquo Gene vol 533 no 1 pp 94ndash99 2014

[45] A Siepel G Bejerano J S Pedersen et al ldquoEvolutionarilyconserved elements in vertebrate insect worm and yeastgenomesrdquo Genome Research vol 15 no 8 pp 1034ndash1050 2005

[46] J Liu J Gough and B Rost ldquoDistinguishing protein-codingfrom non-coding RNAs through support vector machinesrdquoPLoS Genetics vol 2 no 4 pp 529ndash536 2006

[47] K C Pang S Stephen P G Engstrom et al ldquoRNAdb-acomprehensive mammalian noncoding RNA databaserdquoNucleicAcids Research vol 33 pp D125ndashD130 2005

[48] C Liu B Bai G Skogerboslash et al ldquoNONCODE an inte-grated knowledge database of non-coding RNAsrdquoNucleic AcidsResearch vol 33 pp D112ndashD115 2005

[49] Y Zhao H Li S Fang et al ldquoNONCODE 2016 an informativeand valuable data source of long non-coding RNAsrdquo NucleicAcids Research vol 44 no 1 pp D203ndashD208 2016

[50] J W Fickett ldquoRecognition of protein coding regions in DNAsequencesrdquoNucleic Acids Research vol 10 no 17 pp 5303ndash53181982

[51] S Mukherjee and Y Zhang ldquoMM-align a quick algorithmfor aligning multiple-chain protein complex structures usingiterative dynamic programmingrdquoNucleic Acids Research vol 37no 11 2009

[52] T Derrien R Johnson G Bussotti et al ldquoThe GENCODE v7catalog of human long noncoding RNAs analysis of their genestructure evolution and expressionrdquoGenome Research vol 22no 9 pp 1775ndash1789 2012

[53] J Harrow F Denoeud A Frankish et al ldquoGENCODE produc-ing a reference annotation for ENCODErdquo Genome Biology vol7 supplement 1 pp S41ndashS49 2006

[54] J Harrow A Frankish J M Gonzalez et al ldquoGENCODE thereference human genome annotation for the ENCODE projectrdquoGenome Research vol 22 no 9 pp 1760ndash1774 2012

[55] K D Pruitt T Tatusova G R Brown andD RMaglott ldquoNCBIReference Sequences (RefSeq) current status new features andgenome annotation policyrdquo Nucleic Acids Research vol 40 no1 pp D130ndashD135 2012

[56] K D Pruitt T Tatusova and D R Maglott ldquoNCBI refer-ence sequences (RefSeq) a curated non-redundant sequencedatabase of genomes transcripts and proteinsrdquo Nucleic AcidsResearch vol 35 no 1 pp D61ndashD65 2007

[57] L Deng D Yu and J Platt ldquoScalable stacking and learningfor building deep architecturesrdquo in Proceedings of the 2012IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo12) pp 2133ndash2136 Kyoto Japan March2012

[58] B Hutchinson L Deng and D Yu ldquoTensor deep stackingnetworksrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 35 no 8 pp 1944ndash1957 2013

[59] W J Kent C W Sugnet T S Furey et al ldquoThe human genomebrowser at UCSCrdquo Genome Research vol 12 no 6 pp 996ndash1006 2002

[60] R Lorenz S H Bernhart C Honer zu Siederdissen et alldquoViennaRNA Package 20rdquo Algorithms for Molecular Biologyvol 6 no 1 article 26 2011

[61] M Kozak ldquoRecognition of AUG and alternative initiatorcodons is augmented by G in position +4 but is not generallyaffected by the nucleotides in positions +5 and +6rdquoThe EMBOJournal vol 16 no 9 pp 2482ndash2492 1997

[62] M Kozak ldquoInitiation of translation in prokaryotes and eukary-otesrdquo Gene vol 234 no 2 pp 187ndash208 1999

[63] H Xu P Wang Y Fu et al ldquoLength of the ORF position of thefirstAUGand theKozakmotif are important factors in potentialdual-coding transcriptsrdquo Cell Research vol 20 no 4 pp 445ndash457 2010

[64] R D Finn J Clements and S R Eddy ldquoHMMER webserver interactive sequence similarity searchingrdquoNucleic AcidsResearch vol 39 no 2 pp W29ndashW37 2011

[65] P Kapranov J Cheng S Dike et al ldquoRNA maps reveal newRNA classes and a possible function for pervasive transcrip-tionrdquo Science vol 316 no 5830 pp 1484ndash1488 2007

[66] C Chen A Liaw and L BreimanUsing RandomForest to LearnImbalancedData University of California Berkeley Calif USA2004

[67] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[68] J A Sogn ldquoStructure of the peptide antibiotic polypeptinrdquoJournal of Medicinal Chemistry vol 19 no 10 pp 1228ndash12311976

[69] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[70] J W Fickett and C-S Tung ldquoAssessment of protein codingmeasuresrdquo Nucleic Acids Research vol 20 no 24 pp 6441ndash6450 1992

[71] X Robin N Turck A Hainard et al ldquopROC an open-sourcepackage for R and S+ to analyze and compare ROC curvesrdquoBMC Bioinformatics vol 12 article 77 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 7: Review Article Long Noncoding RNA Identification ...

BioMed Research International 7

For these transcripts the performance of PLEK is betterthan CPC and CNCI PLEK can be trained with usersrsquo owndatasets but it may take a long time to be accomplished

26 LncRNA-MFDL in Detail LncRNA-MFDL [34] is basedon feature fusion and deep learning algorithm LncRNA-MFDL has four kinds of features which are integrated tobuild a classification model based on deep stacking networks(DSNs one kind of deep learning algorithm) [57 58] Fourfeature groups of lncRNA-MFDL include 119896-mer secondarystructure ORF obtained by utilising txCdsPredict program(httpgenomeucscedu) [59] and MLCDS features whichare inspired by CNCI [30]

The 119896-mer scheme employed in lncRNA-MFDL is unlikethe one in PLEK Here the 119896 only ranges from 1 to 3but the frequencies are calculated on the regions of thewhole sequence and ORF at the same time Considering thatthe secondary structure is more conserved and stable thanprimary structure a representative criterion the minimumfree energy (MFE) is used to assess the secondary structureof the transcripts Utilising RNAfold program of ViennaRNAPackage [60] the MFE the ratio of MFE to sequence lengthand the number of paired bases and unpaired bases can beeasily obtained

27 LncRNA-ID in Detail LncRNA-ID [32] has three cate-gories of features as mentioned earlier Except for the lengthand coverage of ORF the features based on translationmechanism and protein conservation are extracted

Many studies [61ndash63] have demonstrated that severalnucleotide sites in Kozak motif play a prominent roleduring the initiation of protein translation An efficienttranslation indicates that the highly conserved nucleotidesappear at the positions minus3 +4 and minus2 minus1 of Kozak motifGCCRCCAUGG (R represents purine and the position of Ain start codon AUG is +1) Thus these conserved sites aremore likely to exist in protein-coding transcripts Moreoverwhen the translation starts the binding energy will changealong with the interaction between the 31015840 end of rRNAs andmRNA transcripts The Ribosome Coverage to calculate thechanges of the binding energy is defined as follows

Ribosome Coverage = 119871sum119894=1

119873119894 | 120575119894 lt 0 (11)

where 120575119894 is the free energy at position 119894 and 119873119894 is thenumber of base pairs starting at position 119894 in a sequencewith the length of 119871 Next the three levels of ribosome occu-pancy by computing Ribosome Coverage on three regionsrespectively are obtained the whole transcript ORF and31015840UTR Accordingly a true protein-coding transcript tendsto attain higher Ribosome Coverage on the whole transcriptand the ORF region When the translation terminates theribosomes will be released from protein-coding transcriptsTherefore it is likely to capture a considerable drop ofribosome occupancy when ribosomes reach stop codonsThe

Ribosome Release Score to capture this change of ribosomeoccupancy is defined

Ribosome Release Score

= Ribosome coverage of ORFlength (ORF)Ribosome coverage of 31015840UTRlength (31015840UTR)

(12)

and a protein-coding transcript inclines to exhibit a higherRibosome Release Score For protein translation categorythe selected features including nucleotides at two positionsof Kozak motif Ribosome Coverage on three regions andRibosome Release Score are selected

The protein conservation of the sequences is evaluatedaccording to profile hidden Markov model-based align-ment scores HMMER [64] is a software suite for sequencehomology detection using probabilistic methods LncRNA-ID employed HMMER with the 119864-value cutoff of 01 toalign the transcripts against all available protein families Aprotein-coding transcript is expected to get a higher scorelonger aligned region and a reasonable length of the profilein the alignment

In human genome although the amount of lncRNA isat least four times more than protein-coding genes [65]the majority class in training data is protein-coding tran-script on account of poorly annotated lncRNA Hence theclassification model of this method is balanced randomforest [66 67] which is derived from random forest butcould utilise the sufficient protein-coding data and avoidinaccurate results caused by the imbalanced training data atthe same time The human prebuilt model of LncRNA-IDcontains 15308 protein-coding transcripts and 4586 lncRNAsfrom GENCODE [52] For mouse the training datasets arecomprised of 22033 protein-coding transcripts and 2457lncRNAs randomly selected from GENCODE These twodatasets were also used to draw receiver operation charac-teristic (ROC) curves in the next section (Figure 2) Userscan train LncRNA-ID with their own dataset and apply it tovarious species

28 LncRScan-SVM in Detail LncRScan-SVM [33] classifiesthe sequencesmainly by evaluating the qualities of nucleotidesequences codon sequence and transcripts structure Thecounts and average length of exon in one sequence arecalculated The protein-coding transcripts are disposed ofto include more exons thus having a longer exon lengththan lncRNA Another feature is the score of txCdsPredictThis third-part program from UCSC genome browser [68]can determine if a transcript is protein-coding Conservationscore is obtained by calculating the average of PhastConsscores [45] from Phast (httpcompgencshleduphast)Transcript length and standard deviation of stop codoncounts between three ORFs are the last two features

The reliable datasets are constructed from GENCODE[54] composed of 81814 protein-coding transcripts and23898 long noncoding transcripts of human And formouse47394 protein-coding transcripts and 6053 long noncodingtranscripts from GENCODE [52] are also contained withinthe dataset After being trained on human and mouse

8 BioMed Research International

datasets lncRScan-SVM obtains a good performance onlncRNA prediction

29 LncRNApred in Detail Before constructing the classi-fier self-organizing feature map (SOM) clustering [69] isemployed to select representative samples as the trainingdataset which enhanced the performance of LncRNApredAs to the features the length and coverage of the longest ORFone of the classical and typical features are selected as thecriteria In addition G + C content 119896-mer (119896 is from 1 to 3just like lncRNA-MFDL) and length of the sequence are alsothe features of LncRNApred The novel idea of LncRNApredis SNR which transforms one sequence into four binarynumeric sequences

119906119887 = 1 119878 [119899] = 1198870 119878 [119899] = 119887

119899 = 0 1 2 119873 minus 1 119887 isin 119860 119879 119862 119866 (13)

where 119887 means four kinds of bases 119873 is the length of onesequence and 119878[119899] denotes a sequence of length 119873 Thusthere will be four binary sequences 119906119887 | 119887 isin (119860 119879 119862 119866)Then applying Discrete Fourier Transform (DFT) to thesefour binary numeric sequences the power spectrum 119875[119896]can be obtained

119880119887 [119896] =119873minus1sum119899=0

119906119887 [119899] 119890minus119894(2120587119899119896119873) 119896 = 0 1 119873 minus 1119875 [119896] = 1003816100381610038161003816119880119860 [119896]10038161003816100381610038162 + 1003816100381610038161003816119880119879 [119896]10038161003816100381610038162 + 1003816100381610038161003816119880119866 [119896]10038161003816100381610038162 + 1003816100381610038161003816119880119862 [119896]10038161003816100381610038162

(14)

The studies of Fickett [50 70] have presented that positionsand compositions of four bases are different in lncRNAsand protein-coding RNA and because of this the powerspectrum of one protein-coding transcript will have a peakat1198733 position Hence the SNR is defined as follows

119864 = sum119873minus1119896=0 119875 [119896]119873 SNR = 119875 [1198733]

119864 (15)

Now there are 89 features the length and coverage ofthe longest ORF the length of the sequence SNR G + Ccontent and 4 + 16 + 64 features of 119896-mer Noticing that notall the features have high discriminative power the featureselection ismade and 25 high-quality features are determinedfrom the original 84 features of k-mer Finally 30 features areselected to build a random forest model The performance ofrandom forest is largely determined by training setThereforethe clustering method is used to find out the most adequatesequences to form a high standard training setThe clusteringmethod SOM [69] achieved the best result and was chosen toselect characteristic sequences

An overall procedure of these eight tools is displayed inFigure 1

3 Performance of These Methods

To quantify the classification performance under one unifiedstandard we first characterise lncRNAs as the positive classand protein-coding transcripts as the negative class then theperformance of these tools can be evaluated with severalstandard criteria defined as follows

Sensitivity = TPTP + FN

Specificity = TN

TN + FP

Accuracy = TP + TNTP + FP + TN + FN

False Positive Rate = FP

FP + TN

(16)

As one of the most popular methods CPC is especiallydesigned for assessing protein-coding potential and per-formed fairly well for discriminating protein-coding tran-script It enjoys the best results when screening the codingtranscripts For 10000 protein-coding genes and 10000 lncR-NAs selected from UCSC genome browser (GRCh37hg19)CPC picked up about 9762 coding transcripts while CPATdistinguished 8528 of them CPAT also outperforms CPCwith 8994 accuracy [33] Table 3 shows the performance ofthese tools on the same testing dataset CPC picks up 9997of human protein-coding genes collected from GENCODEin comparison with the latest program LncRNA-ID whoseperformance is 9528 However the performance of CPCappears to somewhat decline when focusing on the capabilityof discriminating noncoding transcripts especially long non-coding transcripts CPC only picked up 6648 of humanrsquoslong noncoding transcripts while the results of CPAT PLEKand LncRNA-ID are 8695 9952 and 9628

CPC and CPAT are the programs to assess the codingpotential but CNCI is especially used to classify protein-coding and long noncoding transcripts With the sequencesbecoming longer and longer CNCI was more superior toCPC According to Sun et al [30] when the length oftranscript is longer than 2000 nt the accuracy of CPC isonly around 04 while the CNCI still has an outstandingperformance The training dataset of CNCI is human butthis method still achieved more than 90 accuracy in othervertebrates apart from the birds [30] PLEK is tested ontwo datasets sequenced by PacBio and 454 platforms (referto Table 3) Among the tools being compared CPC stillpicked up about 9990 coding genes though this figureis not that useful because it can only distinguish 1900and 4720 lncRNAs CNCI displayed better performanceon both datasets but PLEK even achieved a more satisfyingresult

LncRNA-ID is another method to identify the longnoncoding transcripts Compared with other programsLncRNA-ID strikes a good balance between sensitivity andfalse positive rate According to Table 3 it is noticeable thatlncRNA is better than PLEK but slightly inferior to CPCand CPAT on the testing data of coding genes and the

BioMed Research International 9

(i) ORF(ii) BLASTX

CPC

(i) k-mer

PLEK

(i) ANT matrix(ii) Codon-bias

(iii) MLCDS

CNCI

(i) Exon (ii) Stop codon

(iii) txCdsPredict(iv) PhastCons

LncRScan-SVM

Support vector machine

Featureextraction

Modelselection

Modelvalidation

Input data

10-fold cross-validationArea under the curve (AUC) of receiver operation characteristic (ROC)

LncRNA-ID(i) ORF

(ii) Ribosome interaction

(iii) HMMER

(i) ORF(ii) k-mer

(iii) Signal to noise ratio

LncRNApred

Random forests

CPAT

(i) ORF(ii) Fickett score

(iii) Hexamer

Logistic regression

LncRNA-MFDL(i) ORF

(ii) k-mer(iii) MLCDS(iv) Secondary

structure

Deeplearning

Figure 1 An overall procedure of eight tools The features of each tool are sorted into several groups and only the categories of the featuresare listed in the figure

02 04 06 08 10001 minus specificity

00

02

04

06

08

10

Sens

itivi

ty

00

02

04

06

08

10

Sens

itivi

ty

02 04 06 08 10001 minus specificity

Mouse (from GENCODE)Human (from GENCODE)

PLEK (AUC = 0991 optimal = 00210973)CNCI (AUC = 0937 optimal = 0084 0970)CPAT (AUC = 0990 optimal = 0037 0961)CPC (AUC = 0977 optimal = 0038 0889) CPC (AUC = 0950 optimal = 0068 0839)

CPAT (AUC = 0966 optimal = 0073 0926)CNCI (AUC = 0903 optimal = 0137 0959)PLEK (AUC = 0870 optimal = 0291 0885)

Figure 2 The ROC curves of CPC CPAT CNCI and PLEK We assessed the models using the same datasets as LncRNA-ID (selected fromGENCODE) used Both CPC and CPAT were evaluated with the latest versions

10 BioMed Research International

Table 3 Overview of each toolrsquos performance on different testing datasets

Testing dataset CPC CPAT CNCI PLEK LncRNA-ID lncRScan-SVMHuman MCF-7 (PacBio)1

Specificity 9990 9180 9470Sensitivity 1900 7870 9580Accuracy 9700 9130 9470Human HelaS3 (454)2

Specificity 9990 9390 9550Sensitivity 4720 8110 9250Accuracy 9900 9370 9540Human (from GENCODE)3

Specificity 9997 9955 8918 9528Sensitivity 6648 8695 9952 9628Accuracy 8322 9325 9432 9578Mouse (from GENCODE)4

Specificity 9875 9895 7094 9210Sensitivity 7655 3880 8811 9445Accuracy 8765 6888 7949 9328Human (from GRCh37hg19)5

Specificity 9762 8528 8920Sensitivity 6723 9460 9388Accuracy 8243 8994 9194Mouse (from GRCm38mm10)5

Specificity 9837 8817 8914Sensitivity 7546 9534 9529Accuracy 8691 9176 9221The results of the tools being tested on the same datasets are listed above Bold numbers denote the highest value of the metrics1MCF-7 is available at httpwwwpacbcomblogdata-release-human-mcf-7-transcriptome 2dataset of HelaS3 is available at httpswwwncbinlmnihgovsraSRX214365 34datasets are available at httpswwwdropboxcomsh7yvmqknartttm6kAAAQHvLZPjgjf4dtmHM7GNCqaH1 gencodedl=0 andhttpswwwdropboxcomsh7yvmqknartttm6kAACzaG-QJggvbXW6LA32oo7baM1 gencodedl=0 5dataset of human and mouse is available athttpjournalsplosorgplosonearticleid=101371journalpone0139654

performance on lncRNAs is just the opposite LncRNA-IDcan be trained with usersrsquo own data it can obtain a satisfyingresult even when the data is unbalancedWith the proportionof lncRNA decreasing CPAT shows a sharp reduction from7951 to 5446on the capability of lncRNAdiscriminationLncRNA-ID by contrast fell less than 1 [32]

ROC curves of CPC CPAT PLEK and LncRNA-IDtested on human and mouse datasets were provided in [32]Since CPC and CPAT are updated as the accumulation ofgene database it is useful to assess their performance withlatest version and take CNCI into account Here we utilisethe test set of LncRNA-ID [32] (both the datasets of humanandmouse are selected fromGENCODE) to reevaluate CPCCPAT CNCI and PLEK (Figure 2) According to [32] thearea under curve (AUC) of LncRNA-ID on human datasetis 09829 (optimal = 00545 09720) while on mouse it is09505 (optimal = 00800 09445) [32] In our assessmentthe performance of PLEK is identical with [32] while theperformance of CPC and CPAT as we anticipated displayedsome differences The ROC curves were drawn in R with thepackage of pROC [71]

LncRScan-SVM is compared with CPC and CPAT onhuman and mouse datasets from UCSC (version hg19 of

human and mm10 of mouse) CPC as an excellent codingpotential assessment tool still achieves 9837 of specificityonmouse testing dataset CPAT on the contrary achieved thehighest values of sensitivity both on the datasets of humanand mouse LncRScan-SVM surpasses CPAT with 8920and 8914 of specificity on human and mouse datasets Forsensitivity lncRScan-SVM obtained 9388 and 9529 onthe same testing datasets which are only 072 and around01 lower than CPATrsquos results respectively but much higherthan CPCrsquos 6723 and 7546 In addition lncRScan-SVMalso has the best results of accuracy and AUC [33] on thesedatasets

For the same testing datasets the running time of CPATis the shortest and CPC shows the longest time to finishthe process because of its alignment process When beingtested on a dataset containing 4000 protein-coding and 4000long noncoding transcripts CPAT takes 3536 s and LncRNA-ID takes 6535 s to accomplish the discrimination whilePLEK and CPC need 2147m and 8651 h respectively [32]PLEK is 8 times and 244 times faster than CNCI and CPCrespectively on the same testing data [31] and lncRScan-SVM also needs about 10 times as much as CPAT to finishcomputation [33]

BioMed Research International 11

Table 4 Priority of employing different methods on different situations

CPC CPAT CNCI PLEK LncRNA-ID lncRScan-SVMCoding potential assessment Human lncRNAs Mouse lncRNAs Other Species1 Testing data with sequencing errors2 Lack of annotation Massive-scale data3 Trained by users4 Web interface This table only presents the preferences under different situations which means a method with a tick can achieve a better performance under a certaincircumstance1Only CPAT LncRNA-ID and lncRScan-SVM provide the model for mouse When analysing other species CPAT has the model for fly and zebrafish CNCIand PLEK can predict the sequences of vertebrata and plant CPAT PLEK and LncRNA-ID can build a newmodel based on usersrsquo datasets 2Users can chooseCNCI for incomplete sequences and CPC or PLEK for the transcripts with indel errors 3CPAT is the most efficient method Though lncRScan-SVM needsmore time than CPAT and LncRNA-ID it is also acceptable 4LncRNA-ID can handle the imbalanced training data Training PLEK with usersrsquo own datasetsmay be a time-consuming task

4 Application Scopes of the Methods

All these methods have own particular scopes to exert theirtalents which means an appropriate program can help usobtain a satisfying result The priority of utilising thesetools under some particular circumstances is summarised inTable 4

CPC is based on sequence alignment which facilitatesprotein-coding transcripts selection but impairs the perfor-mance of noncoding transcripts in that long noncoding tran-scripts share more similarities with coding transcripts suchas putative ORF which could mislead CPC Also becauseof alignment process utilising CPC to analyse massive-scaledata is a time-consuming process

CPAT is also used to evaluate the coding potential thoughthe performance on long noncoding transcripts is acceptableCPAT has a compromise between coding and noncodingtranscripts that is not bad Since the model of CPAT islogistic regression and the input file is FASTA format CPATis markedly superior in computational time which meansCPAT is more suitable for being applied to data on a largescale Furthermore linguistic features make CPAT be able toanalyse the sequences without annotation and allowing usersto train the model with their own dataset extends CPATrsquosscope of application Users can apply CPAT to other speciesinstead of being confined to human or mouse only

CNCI is designed to distinguish between coding and longnoncoding transcripts without the annotations of sequencesBecause lots of lncRNAs are poorly annotated this qualityprovided amore accurate discrimination for these sequencesCNCI is trained on human dataset but can also be applied toothermammals such asmouse and orangutan CNCI displaysacceptable results on vertebrates (except fish) but for plantsand invertebrates the result is not very satisfying CNCI isvaluable when the sequences lack annotations or users do nothave training set of other species CNCI also shows a goodperformance when the transcripts are incomplete

PLEK employs a higher fault tolerance algorithm andperforms better when the sequences have indel errors It is a

proper tool to analyse the de novo assembled transcriptomedatasets such as the sequences obtained from Roche (454)and Pacific Biosciences (PacBio) sequencing platforms Inaddition to human and mouse PLEK can also be used toother vertebrates and displays comparable results with theones of CNCI PLEKrsquos model can be trained by users but ittakes a long time to be completed

LncRNA-ID has many merits and delivers better all-round performance on human andmouse datasets Althoughthe time LncRNA-ID spent on classifying is nearly twice ofCPAT LncRNA-ID is still more efficient than other methodswhich makes it a reasonable choice when data are on amassive-scale The model of LncRNA-ID can be trained byusers but the most excellent attribute is the competence ofhandling the unbalanced training data For studying thosenot well-explored species LncRNA-ID takes priority whenusers have training datasets

LncRScan-SVM achieves a good trade-off betweenthe discrimination of coding and long noncoding genesLncRScan-SVM is slower than CPAT and LncRNA-ID but itis still acceptable For analysing human and mouse datasetslncRScan-SVM can be considered as a proper approach

5 Discussion

According to the features selected by each tool it is apparentthat different tools have their own advantages and disadvan-tages CPC is developed to assess coding potential of thetranscripts moreover CPC is trained on datasets of protein-coding and noncoding RNA which means it achieves excel-lent performance when analysing ncRNAs CPC providesa stand-alone version and a web server but both of thetwo programs need vast amounts of time to process thesequences As alignment-based tools the performance ofCPC varied when using different protein reference databaseCPAT can present satisfying results efficiently partly becauseCPAT builds the logistic model which is faster than SVMThe web server of CPAT can display the result in an instant

12 BioMed Research International

which facilitates small scale prediction tasks A minor dis-advantage of CPAT is that the cutoff of CPAT varies fromspecies to species and users have to determine the optimumcutoff value when they are training a new model CNCI isdesigned to predict the transcripts assembled from whole-transcriptome sequencing data Thus CNCI offers a highaccuracy on incomplete transcripts CNCI did not provideresult of elaborate comparison between CNCI and CPATbut CPAT has no regard for the problem of incompletetranscripts Meanwhile UTRs of the transcripts may alsointerfere with the performance of CPATThe features of ANTof CNCI closely resemble the hexamer of CPAT but thedistinguishing process of CNCI is more complicated andaccurate than CPAT However the sliding window of CNCIslides 3 nt in each step and consequently some deletionor frameshift errors may lead to a false shift and presentusers with a disappointing performance In such cases PLEKhas made a considerable improvement and exhibits moreflexibility when handling the indel sequencing errors Indelerrors are very common in the sequences obtained by todayrsquossequencing platform which means PLEK performs well forde novo assembled transcriptomes With the indel error rateincreasing the accuracy of CNCI is decreasing while PLEKhas no distinct fluctuation Nonetheless since the nucleotidescompositions differ slightly among different species theperformance of PLEK on multiple species is not better thanor approximately equivalent to CNCI whose performanceis more stable on different species Both LncRNA-ID andlncRScan-SVM achieve a balance between protein-codingand lncRNAs But the capacity of lncRScan-SVMwill be lim-ited when analysing the sequences with a lack of annotationAnother point that needs to be brought up is that lncRScan-SVM and CNCI support lowast GTF as input file format

It is apparent that nucleotides composition (such as 119896-merand G + C content) and ORF are two classic and widely usedfeature groups These features have strong discriminativepower because protein-coding genes will finally be tran-scribed and translated to produce a specific amino acid chainwhich requires some specified nucleotides composition andhigh-quality ORFs As to the models of these tools SVM(CPC CNCI and PLEK) logistic regression (CPAT) andrandom forest (LncRNA-ID) are more practical for lncRNAidentification though ANN or deep learning is a more popu-lar machine learning algorithm now Along with the protein-coding genes prediction the annotations of lncRNA genehave been performed as well A new tool named AnnoLnc(2015 available at httpannolnccbipkueducnindexjsp)has just been developed to annotate new discovered lncRNAsbut related article has not yet been officially published Userscan access its web server for more information

LncRNAs are receiving increasing attention and lncRNAidentification has always been a challenge for researchesof life science For so many different types of sequencesvarious excellent tools should be developed to tackle differentproblems under various circumstances in the future In thisreview we summarised several tools for lncRNAs identifica-tion and concluded respective scopes Due to their differentscopes of application using a method apposite to particularsituation will be of essence to achieve convincing results

We hope this review can help researchers employ a moreappropriate method in certain situations

Additional Points

Key Points (i) Different tools have different scopes Usersshould select a proper tool according to the type of sequences(ii) From the perspective of sequence types CPC andCPAT are mainly used to assess coding potential CNCIand PLEK can be applied to the sequences obtained fromhigh-throughput sequencing platforms or the poorly anno-tated LncRNA-ID and lncRScan-SVM are more accurate onhuman and mouse datasets (iii) From the perspective ofother functions CPC and CPAT have web interfaces Theclassification models of CPAT LncRNA-ID and PLEK canbe trained on usersrsquo own datasets CPAT LncRNA-ID andlncRScan-SVM can be utilised when the data to be analysedare on a massive-scale

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This research was supported by the National Natural ScienceFoundation of China (61272207 61472158 and 61402194)and the Science-Technology Development Project from JilinProvince (20130522118JH 20130101070JC) and China Post-doctoral Science Foundation (2014T70291)

References

[1] A F Palazzo and E S Lee ldquoNon-coding RNA what isfunctional andwhat is junkrdquo Frontiers in Genetics vol 5 article2 pp 1ndash11 2015

[2] Y Okazaki M Furuno T Kasukawa et al ldquoAnalysis of themouse transcriptome based on functional annotation of 60770full-length cDNAsrdquo Nature vol 420 pp 563ndash573 2002

[3] S Djebali C A Davis AMerkel et al ldquoLandscape of transcrip-tion in human cellsrdquo Nature vol 489 pp 101ndash108 2012

[4] I Dunham A Kundaje S F Aldred et al ldquoAn integratedencyclopedia of DNA elements in the human genomerdquo Naturevol 489 pp 57ndash74 2012

[5] E Pennisi ldquoENCODE project writes eulogy for junk DNArdquoScience vol 337 no 6099 pp 1159ndash1161 2012

[6] S U Schmitz P Grote and B G Herrmann ldquoMechanisms oflong noncoding RNA function in development and diseaserdquoCellular and Molecular Life Sciences vol 73 no 13 pp 2491ndash2509 2016

[7] K Plath J Fang S K Mlynarczyk-Evans et al ldquoRole of histoneH3 lysine 27 methylation in X inactivationrdquo Science vol 300no 5616 pp 131ndash135 2003

[8] S T da Rocha V Boeva M Escamilla-Del-Arenal et al ldquoJarid2is implicated in the initial xist-induced targeting of PRC2 to theinactive X chromosomerdquoMolecular Cell vol 53 no 2 pp 301ndash316 2014

[9] V OrsquoLeary S V Ovsepian L G Carrascosa et al ldquoPARTICLEa triplex-forming long ncRNA regulates locus-specific methy-lation in response to low-dose irradiationrdquo Cell Reports vol 11no 3 pp 474ndash485 2015

BioMed Research International 13

[10] A C Marques and C P Ponting ldquoCatalogues of mammalianlong noncoding RNAs modest conservation and incomplete-nessrdquo Genome Biology vol 10 no 11 article R124 2009

[11] J R Prensner M K Iyer A Sahu et al ldquoThe long noncodingRNA SChLAP1 promotes aggressive prostate cancer and antag-onizes the SWISNF complexrdquo Nature Genetics vol 45 no 11pp 1392ndash1398 2013

[12] G Chen Z Wang D Wang et al ldquoLncRNADisease a databasefor long-non-coding RNA-associated diseasesrdquo Nucleic AcidsResearch vol 41 no 1 pp D983ndashD986 2013

[13] X Chen and G-Y Yan ldquoNovel human lncRNA-disease associ-ation inference based on lncRNA expression profilesrdquo Bioinfor-matics vol 29 no 20 pp 2617ndash2624 2013

[14] X Chen C Clarence Yan C Luo W Ji Y Zhang and Q DaildquoConstructing lncRNA functional similarity network based onlncRNA-disease associations and disease semantic similarityrdquoScientific Reports vol 5 Article ID 11338 2015

[15] X Chen Y-A Huang X-S Wang Z You and K C ChanldquoFMLNCSIM fuzzy measure-based lncRNA functional simi-larity calculation modelrdquo Oncotarget vol 7 no 29 pp 45948ndash45958 2016

[16] X Chen ldquoPredicting lncRNA-disease associations and con-structing lncRNA functional similarity network based on theinformation of miRNArdquo Scientific Reports vol 5 Article ID13186 2015

[17] J Sun H Shi Z Wang et al ldquoInferring novel lncRNA-diseaseassociations based on a random walk model of a lncRNAfunctional similarity networkrdquo Molecular BioSystems vol 10no 8 pp 2074ndash2081 2014

[18] M Zhou X Wang J Li et al ldquoPrioritizing candidate disease-related long non-codingRNAs bywalking on the heterogeneouslncRNA and disease networkrdquoMolecular BioSystems vol 11 no3 pp 760ndash769 2015

[19] X Chen Z You G Yan and D Gong ldquoIRWRLDA improvedrandom walk with restart for lncRNA-disease association pre-dictionrdquo Oncotarget vol 7 no 36 pp 57919ndash57931 2016

[20] X Chen C C Yan X Zhang and Z You ldquoLong non-codingRNAs and complex diseases from experimental results tocomputational modelsrdquo Briefings in Bioinformatics 2016

[21] T M Lowe and S R Eddy ldquotRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequencerdquo Nucleic Acids Research vol 25 no 5 pp 955ndash9641997

[22] Q Zou J Guo Y Ju M Wu X Zeng and Z Hong ldquoImprov-ing tRNAscan-SE annotation results via ensemble classifiersrdquoMolecular Informatics vol 34 no 11-12 pp 761ndash770 2015

[23] L Wei M Liao Y Gao R Ji Z He and Q Zou ldquoImprovedand promising identificationof human microRNAs by incorpo-ratinga high-quality negative setrdquo IEEEACM Transactions onComputational Biology and Bioinformatics vol 11 no 1 pp 192ndash201 2014

[24] C Y Wang L L Hu M Z Guo X Y Liu and Q Zou ldquoimDCan ensemble learningmethod for imbalanced classificationwithmiRNAdatardquoGenetics andMolecular Research vol 14 no 1 pp123ndash133 2015

[25] K Lagesen P Hallin E A Roslashdland H-H Staeligrfeldt T Rognesand DW Ussery ldquoRNAmmer consistent and rapid annotationof ribosomal RNA genesrdquo Nucleic Acids Research vol 35 no 9pp 3100ndash3108 2007

[26] C Wang L Wei M Guo and Q Zou ldquoComputationalapproaches in detecting non- coding RNArdquo Current Genomicsvol 14 no 6 pp 371ndash377 2013

[27] D Veneziano G Nigita and A Ferro ldquoComputationalapproaches for the analysis of ncRNA through deep sequencingtechniquesrdquo Frontiers in Bioengineering and Biotechnology vol3 article 77 2015

[28] L Kong Y Zhang Z-Q Ye et al ldquoCPC assess the protein-coding potential of transcripts using sequence features andsupport vector machinerdquo Nucleic Acids Research vol 35 no 2pp W345ndashW349 2007

[29] L Wang H J Park S Dasari et al ldquoCoding-potential assess-ment tool using an alignment-free logistic regression modelrdquoNucleic Acids Research vol 41 no 6 article e74 2013

[30] L Sun H Luo D Bu et al ldquoUtilizing sequence intrinsiccomposition to classify protein-coding and long non-codingtranscriptsrdquo Nucleic Acids Research vol 41 no 17 article e1662013

[31] A Li J Zhang and Z Zhou ldquoPLEK a tool for predicting longnon-coding RNAs and messenger RNAs based on an improvedk-mer schemerdquo BMC Bioinformatics vol 15 no 1 article 3112014

[32] R Achawanantakun J Chen Y Sun and Y Zhang ldquoLncRNA-ID long non-coding RNA IDentification using balanced ran-dom forestsrdquo Bioinformatics vol 31 no 24 pp 3897ndash3905 2015

[33] L Sun H Liu L Zhang and J Meng ldquolncRScan-SVM a toolfor predicting long non-coding RNAs using support vectormachinerdquo PLoS ONE vol 10 no 10 Article ID e0139654 2015

[34] X-N Fan and S-W Zhang ldquoLncRNA-MFDL identification ofhuman long non-coding RNAs by fusing multiple features andusing deep learningrdquo Molecular BioSystems vol 11 no 3 pp892ndash897 2015

[35] C Pian G Zhang Z Chen et al ldquoLncRNApred classificationof long non-coding RNAs and protein-coding transcripts by theensemble algorithm with a new hybrid featurerdquo PLoS ONE vol11 no 5 Article ID e0154567 2016

[36] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[37] M C Frith A R Forrest E Nourbakhsh et al ldquoThe abundanceof short proteins in the mammalian proteomerdquo PLoS Geneticsvol 2 no 4 pp 515ndash528 2006

[38] D M Anderson K M Anderson C-L Chang et al ldquoAmicropeptide encoded by a putative long noncoding RNAregulatesmuscle performancerdquoCell vol 160 no 4 pp 595ndash6062015

[39] M Guttman P Russell N T Ingolia J S Weissman and ES Lander ldquoRibosome profiling provides evidence that largenoncoding RNAs do not encode proteinsrdquo Cell vol 154 no 1pp 240ndash251 2013

[40] M F Lin I Jungreis and M Kellis ldquoPhyloCSF a comparativegenomics method to distinguish protein coding and non-coding regionsrdquoBioinformatics vol 27 no 13 Article ID btr209pp i275ndashi282 2011

[41] B Panwar A Arora and G P S Raghava ldquoPrediction andclassification of ncRNAs using structural informationrdquo BMCGenomics vol 15 no 1 article 127 2014

[42] L Childs Z Nikoloski P May and D Walther ldquoIdentificationand classification of ncRNAmolecules using graph propertiesrdquoNucleic Acids Research vol 37 no 9 article e66 2009

[43] K Sun X Chen P Jiang X Song H Wang and HSun ldquoiSeeRNA identification of long intergenic non-codingRNA transcripts from transcriptome sequencing datardquo BMCGenomics vol 14 article S7 2013

14 BioMed Research International

[44] Y Wang Y Li Q Wang et al ldquoComputational identificationof human long intergenic non-coding RNAs using a GA-SVMalgorithmrdquo Gene vol 533 no 1 pp 94ndash99 2014

[45] A Siepel G Bejerano J S Pedersen et al ldquoEvolutionarilyconserved elements in vertebrate insect worm and yeastgenomesrdquo Genome Research vol 15 no 8 pp 1034ndash1050 2005

[46] J Liu J Gough and B Rost ldquoDistinguishing protein-codingfrom non-coding RNAs through support vector machinesrdquoPLoS Genetics vol 2 no 4 pp 529ndash536 2006

[47] K C Pang S Stephen P G Engstrom et al ldquoRNAdb-acomprehensive mammalian noncoding RNA databaserdquoNucleicAcids Research vol 33 pp D125ndashD130 2005

[48] C Liu B Bai G Skogerboslash et al ldquoNONCODE an inte-grated knowledge database of non-coding RNAsrdquoNucleic AcidsResearch vol 33 pp D112ndashD115 2005

[49] Y Zhao H Li S Fang et al ldquoNONCODE 2016 an informativeand valuable data source of long non-coding RNAsrdquo NucleicAcids Research vol 44 no 1 pp D203ndashD208 2016

[50] J W Fickett ldquoRecognition of protein coding regions in DNAsequencesrdquoNucleic Acids Research vol 10 no 17 pp 5303ndash53181982

[51] S Mukherjee and Y Zhang ldquoMM-align a quick algorithmfor aligning multiple-chain protein complex structures usingiterative dynamic programmingrdquoNucleic Acids Research vol 37no 11 2009

[52] T Derrien R Johnson G Bussotti et al ldquoThe GENCODE v7catalog of human long noncoding RNAs analysis of their genestructure evolution and expressionrdquoGenome Research vol 22no 9 pp 1775ndash1789 2012

[53] J Harrow F Denoeud A Frankish et al ldquoGENCODE produc-ing a reference annotation for ENCODErdquo Genome Biology vol7 supplement 1 pp S41ndashS49 2006

[54] J Harrow A Frankish J M Gonzalez et al ldquoGENCODE thereference human genome annotation for the ENCODE projectrdquoGenome Research vol 22 no 9 pp 1760ndash1774 2012

[55] K D Pruitt T Tatusova G R Brown andD RMaglott ldquoNCBIReference Sequences (RefSeq) current status new features andgenome annotation policyrdquo Nucleic Acids Research vol 40 no1 pp D130ndashD135 2012

[56] K D Pruitt T Tatusova and D R Maglott ldquoNCBI refer-ence sequences (RefSeq) a curated non-redundant sequencedatabase of genomes transcripts and proteinsrdquo Nucleic AcidsResearch vol 35 no 1 pp D61ndashD65 2007

[57] L Deng D Yu and J Platt ldquoScalable stacking and learningfor building deep architecturesrdquo in Proceedings of the 2012IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo12) pp 2133ndash2136 Kyoto Japan March2012

[58] B Hutchinson L Deng and D Yu ldquoTensor deep stackingnetworksrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 35 no 8 pp 1944ndash1957 2013

[59] W J Kent C W Sugnet T S Furey et al ldquoThe human genomebrowser at UCSCrdquo Genome Research vol 12 no 6 pp 996ndash1006 2002

[60] R Lorenz S H Bernhart C Honer zu Siederdissen et alldquoViennaRNA Package 20rdquo Algorithms for Molecular Biologyvol 6 no 1 article 26 2011

[61] M Kozak ldquoRecognition of AUG and alternative initiatorcodons is augmented by G in position +4 but is not generallyaffected by the nucleotides in positions +5 and +6rdquoThe EMBOJournal vol 16 no 9 pp 2482ndash2492 1997

[62] M Kozak ldquoInitiation of translation in prokaryotes and eukary-otesrdquo Gene vol 234 no 2 pp 187ndash208 1999

[63] H Xu P Wang Y Fu et al ldquoLength of the ORF position of thefirstAUGand theKozakmotif are important factors in potentialdual-coding transcriptsrdquo Cell Research vol 20 no 4 pp 445ndash457 2010

[64] R D Finn J Clements and S R Eddy ldquoHMMER webserver interactive sequence similarity searchingrdquoNucleic AcidsResearch vol 39 no 2 pp W29ndashW37 2011

[65] P Kapranov J Cheng S Dike et al ldquoRNA maps reveal newRNA classes and a possible function for pervasive transcrip-tionrdquo Science vol 316 no 5830 pp 1484ndash1488 2007

[66] C Chen A Liaw and L BreimanUsing RandomForest to LearnImbalancedData University of California Berkeley Calif USA2004

[67] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[68] J A Sogn ldquoStructure of the peptide antibiotic polypeptinrdquoJournal of Medicinal Chemistry vol 19 no 10 pp 1228ndash12311976

[69] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[70] J W Fickett and C-S Tung ldquoAssessment of protein codingmeasuresrdquo Nucleic Acids Research vol 20 no 24 pp 6441ndash6450 1992

[71] X Robin N Turck A Hainard et al ldquopROC an open-sourcepackage for R and S+ to analyze and compare ROC curvesrdquoBMC Bioinformatics vol 12 article 77 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 8: Review Article Long Noncoding RNA Identification ...

8 BioMed Research International

datasets lncRScan-SVM obtains a good performance onlncRNA prediction

29 LncRNApred in Detail Before constructing the classi-fier self-organizing feature map (SOM) clustering [69] isemployed to select representative samples as the trainingdataset which enhanced the performance of LncRNApredAs to the features the length and coverage of the longest ORFone of the classical and typical features are selected as thecriteria In addition G + C content 119896-mer (119896 is from 1 to 3just like lncRNA-MFDL) and length of the sequence are alsothe features of LncRNApred The novel idea of LncRNApredis SNR which transforms one sequence into four binarynumeric sequences

119906119887 = 1 119878 [119899] = 1198870 119878 [119899] = 119887

119899 = 0 1 2 119873 minus 1 119887 isin 119860 119879 119862 119866 (13)

where 119887 means four kinds of bases 119873 is the length of onesequence and 119878[119899] denotes a sequence of length 119873 Thusthere will be four binary sequences 119906119887 | 119887 isin (119860 119879 119862 119866)Then applying Discrete Fourier Transform (DFT) to thesefour binary numeric sequences the power spectrum 119875[119896]can be obtained

119880119887 [119896] =119873minus1sum119899=0

119906119887 [119899] 119890minus119894(2120587119899119896119873) 119896 = 0 1 119873 minus 1119875 [119896] = 1003816100381610038161003816119880119860 [119896]10038161003816100381610038162 + 1003816100381610038161003816119880119879 [119896]10038161003816100381610038162 + 1003816100381610038161003816119880119866 [119896]10038161003816100381610038162 + 1003816100381610038161003816119880119862 [119896]10038161003816100381610038162

(14)

The studies of Fickett [50 70] have presented that positionsand compositions of four bases are different in lncRNAsand protein-coding RNA and because of this the powerspectrum of one protein-coding transcript will have a peakat1198733 position Hence the SNR is defined as follows

119864 = sum119873minus1119896=0 119875 [119896]119873 SNR = 119875 [1198733]

119864 (15)

Now there are 89 features the length and coverage ofthe longest ORF the length of the sequence SNR G + Ccontent and 4 + 16 + 64 features of 119896-mer Noticing that notall the features have high discriminative power the featureselection ismade and 25 high-quality features are determinedfrom the original 84 features of k-mer Finally 30 features areselected to build a random forest model The performance ofrandom forest is largely determined by training setThereforethe clustering method is used to find out the most adequatesequences to form a high standard training setThe clusteringmethod SOM [69] achieved the best result and was chosen toselect characteristic sequences

An overall procedure of these eight tools is displayed inFigure 1

3 Performance of These Methods

To quantify the classification performance under one unifiedstandard we first characterise lncRNAs as the positive classand protein-coding transcripts as the negative class then theperformance of these tools can be evaluated with severalstandard criteria defined as follows

Sensitivity = TPTP + FN

Specificity = TN

TN + FP

Accuracy = TP + TNTP + FP + TN + FN

False Positive Rate = FP

FP + TN

(16)

As one of the most popular methods CPC is especiallydesigned for assessing protein-coding potential and per-formed fairly well for discriminating protein-coding tran-script It enjoys the best results when screening the codingtranscripts For 10000 protein-coding genes and 10000 lncR-NAs selected from UCSC genome browser (GRCh37hg19)CPC picked up about 9762 coding transcripts while CPATdistinguished 8528 of them CPAT also outperforms CPCwith 8994 accuracy [33] Table 3 shows the performance ofthese tools on the same testing dataset CPC picks up 9997of human protein-coding genes collected from GENCODEin comparison with the latest program LncRNA-ID whoseperformance is 9528 However the performance of CPCappears to somewhat decline when focusing on the capabilityof discriminating noncoding transcripts especially long non-coding transcripts CPC only picked up 6648 of humanrsquoslong noncoding transcripts while the results of CPAT PLEKand LncRNA-ID are 8695 9952 and 9628

CPC and CPAT are the programs to assess the codingpotential but CNCI is especially used to classify protein-coding and long noncoding transcripts With the sequencesbecoming longer and longer CNCI was more superior toCPC According to Sun et al [30] when the length oftranscript is longer than 2000 nt the accuracy of CPC isonly around 04 while the CNCI still has an outstandingperformance The training dataset of CNCI is human butthis method still achieved more than 90 accuracy in othervertebrates apart from the birds [30] PLEK is tested ontwo datasets sequenced by PacBio and 454 platforms (referto Table 3) Among the tools being compared CPC stillpicked up about 9990 coding genes though this figureis not that useful because it can only distinguish 1900and 4720 lncRNAs CNCI displayed better performanceon both datasets but PLEK even achieved a more satisfyingresult

LncRNA-ID is another method to identify the longnoncoding transcripts Compared with other programsLncRNA-ID strikes a good balance between sensitivity andfalse positive rate According to Table 3 it is noticeable thatlncRNA is better than PLEK but slightly inferior to CPCand CPAT on the testing data of coding genes and the

BioMed Research International 9

(i) ORF(ii) BLASTX

CPC

(i) k-mer

PLEK

(i) ANT matrix(ii) Codon-bias

(iii) MLCDS

CNCI

(i) Exon (ii) Stop codon

(iii) txCdsPredict(iv) PhastCons

LncRScan-SVM

Support vector machine

Featureextraction

Modelselection

Modelvalidation

Input data

10-fold cross-validationArea under the curve (AUC) of receiver operation characteristic (ROC)

LncRNA-ID(i) ORF

(ii) Ribosome interaction

(iii) HMMER

(i) ORF(ii) k-mer

(iii) Signal to noise ratio

LncRNApred

Random forests

CPAT

(i) ORF(ii) Fickett score

(iii) Hexamer

Logistic regression

LncRNA-MFDL(i) ORF

(ii) k-mer(iii) MLCDS(iv) Secondary

structure

Deeplearning

Figure 1 An overall procedure of eight tools The features of each tool are sorted into several groups and only the categories of the featuresare listed in the figure

02 04 06 08 10001 minus specificity

00

02

04

06

08

10

Sens

itivi

ty

00

02

04

06

08

10

Sens

itivi

ty

02 04 06 08 10001 minus specificity

Mouse (from GENCODE)Human (from GENCODE)

PLEK (AUC = 0991 optimal = 00210973)CNCI (AUC = 0937 optimal = 0084 0970)CPAT (AUC = 0990 optimal = 0037 0961)CPC (AUC = 0977 optimal = 0038 0889) CPC (AUC = 0950 optimal = 0068 0839)

CPAT (AUC = 0966 optimal = 0073 0926)CNCI (AUC = 0903 optimal = 0137 0959)PLEK (AUC = 0870 optimal = 0291 0885)

Figure 2 The ROC curves of CPC CPAT CNCI and PLEK We assessed the models using the same datasets as LncRNA-ID (selected fromGENCODE) used Both CPC and CPAT were evaluated with the latest versions

10 BioMed Research International

Table 3 Overview of each toolrsquos performance on different testing datasets

Testing dataset CPC CPAT CNCI PLEK LncRNA-ID lncRScan-SVMHuman MCF-7 (PacBio)1

Specificity 9990 9180 9470Sensitivity 1900 7870 9580Accuracy 9700 9130 9470Human HelaS3 (454)2

Specificity 9990 9390 9550Sensitivity 4720 8110 9250Accuracy 9900 9370 9540Human (from GENCODE)3

Specificity 9997 9955 8918 9528Sensitivity 6648 8695 9952 9628Accuracy 8322 9325 9432 9578Mouse (from GENCODE)4

Specificity 9875 9895 7094 9210Sensitivity 7655 3880 8811 9445Accuracy 8765 6888 7949 9328Human (from GRCh37hg19)5

Specificity 9762 8528 8920Sensitivity 6723 9460 9388Accuracy 8243 8994 9194Mouse (from GRCm38mm10)5

Specificity 9837 8817 8914Sensitivity 7546 9534 9529Accuracy 8691 9176 9221The results of the tools being tested on the same datasets are listed above Bold numbers denote the highest value of the metrics1MCF-7 is available at httpwwwpacbcomblogdata-release-human-mcf-7-transcriptome 2dataset of HelaS3 is available at httpswwwncbinlmnihgovsraSRX214365 34datasets are available at httpswwwdropboxcomsh7yvmqknartttm6kAAAQHvLZPjgjf4dtmHM7GNCqaH1 gencodedl=0 andhttpswwwdropboxcomsh7yvmqknartttm6kAACzaG-QJggvbXW6LA32oo7baM1 gencodedl=0 5dataset of human and mouse is available athttpjournalsplosorgplosonearticleid=101371journalpone0139654

performance on lncRNAs is just the opposite LncRNA-IDcan be trained with usersrsquo own data it can obtain a satisfyingresult even when the data is unbalancedWith the proportionof lncRNA decreasing CPAT shows a sharp reduction from7951 to 5446on the capability of lncRNAdiscriminationLncRNA-ID by contrast fell less than 1 [32]

ROC curves of CPC CPAT PLEK and LncRNA-IDtested on human and mouse datasets were provided in [32]Since CPC and CPAT are updated as the accumulation ofgene database it is useful to assess their performance withlatest version and take CNCI into account Here we utilisethe test set of LncRNA-ID [32] (both the datasets of humanandmouse are selected fromGENCODE) to reevaluate CPCCPAT CNCI and PLEK (Figure 2) According to [32] thearea under curve (AUC) of LncRNA-ID on human datasetis 09829 (optimal = 00545 09720) while on mouse it is09505 (optimal = 00800 09445) [32] In our assessmentthe performance of PLEK is identical with [32] while theperformance of CPC and CPAT as we anticipated displayedsome differences The ROC curves were drawn in R with thepackage of pROC [71]

LncRScan-SVM is compared with CPC and CPAT onhuman and mouse datasets from UCSC (version hg19 of

human and mm10 of mouse) CPC as an excellent codingpotential assessment tool still achieves 9837 of specificityonmouse testing dataset CPAT on the contrary achieved thehighest values of sensitivity both on the datasets of humanand mouse LncRScan-SVM surpasses CPAT with 8920and 8914 of specificity on human and mouse datasets Forsensitivity lncRScan-SVM obtained 9388 and 9529 onthe same testing datasets which are only 072 and around01 lower than CPATrsquos results respectively but much higherthan CPCrsquos 6723 and 7546 In addition lncRScan-SVMalso has the best results of accuracy and AUC [33] on thesedatasets

For the same testing datasets the running time of CPATis the shortest and CPC shows the longest time to finishthe process because of its alignment process When beingtested on a dataset containing 4000 protein-coding and 4000long noncoding transcripts CPAT takes 3536 s and LncRNA-ID takes 6535 s to accomplish the discrimination whilePLEK and CPC need 2147m and 8651 h respectively [32]PLEK is 8 times and 244 times faster than CNCI and CPCrespectively on the same testing data [31] and lncRScan-SVM also needs about 10 times as much as CPAT to finishcomputation [33]

BioMed Research International 11

Table 4 Priority of employing different methods on different situations

CPC CPAT CNCI PLEK LncRNA-ID lncRScan-SVMCoding potential assessment Human lncRNAs Mouse lncRNAs Other Species1 Testing data with sequencing errors2 Lack of annotation Massive-scale data3 Trained by users4 Web interface This table only presents the preferences under different situations which means a method with a tick can achieve a better performance under a certaincircumstance1Only CPAT LncRNA-ID and lncRScan-SVM provide the model for mouse When analysing other species CPAT has the model for fly and zebrafish CNCIand PLEK can predict the sequences of vertebrata and plant CPAT PLEK and LncRNA-ID can build a newmodel based on usersrsquo datasets 2Users can chooseCNCI for incomplete sequences and CPC or PLEK for the transcripts with indel errors 3CPAT is the most efficient method Though lncRScan-SVM needsmore time than CPAT and LncRNA-ID it is also acceptable 4LncRNA-ID can handle the imbalanced training data Training PLEK with usersrsquo own datasetsmay be a time-consuming task

4 Application Scopes of the Methods

All these methods have own particular scopes to exert theirtalents which means an appropriate program can help usobtain a satisfying result The priority of utilising thesetools under some particular circumstances is summarised inTable 4

CPC is based on sequence alignment which facilitatesprotein-coding transcripts selection but impairs the perfor-mance of noncoding transcripts in that long noncoding tran-scripts share more similarities with coding transcripts suchas putative ORF which could mislead CPC Also becauseof alignment process utilising CPC to analyse massive-scaledata is a time-consuming process

CPAT is also used to evaluate the coding potential thoughthe performance on long noncoding transcripts is acceptableCPAT has a compromise between coding and noncodingtranscripts that is not bad Since the model of CPAT islogistic regression and the input file is FASTA format CPATis markedly superior in computational time which meansCPAT is more suitable for being applied to data on a largescale Furthermore linguistic features make CPAT be able toanalyse the sequences without annotation and allowing usersto train the model with their own dataset extends CPATrsquosscope of application Users can apply CPAT to other speciesinstead of being confined to human or mouse only

CNCI is designed to distinguish between coding and longnoncoding transcripts without the annotations of sequencesBecause lots of lncRNAs are poorly annotated this qualityprovided amore accurate discrimination for these sequencesCNCI is trained on human dataset but can also be applied toothermammals such asmouse and orangutan CNCI displaysacceptable results on vertebrates (except fish) but for plantsand invertebrates the result is not very satisfying CNCI isvaluable when the sequences lack annotations or users do nothave training set of other species CNCI also shows a goodperformance when the transcripts are incomplete

PLEK employs a higher fault tolerance algorithm andperforms better when the sequences have indel errors It is a

proper tool to analyse the de novo assembled transcriptomedatasets such as the sequences obtained from Roche (454)and Pacific Biosciences (PacBio) sequencing platforms Inaddition to human and mouse PLEK can also be used toother vertebrates and displays comparable results with theones of CNCI PLEKrsquos model can be trained by users but ittakes a long time to be completed

LncRNA-ID has many merits and delivers better all-round performance on human andmouse datasets Althoughthe time LncRNA-ID spent on classifying is nearly twice ofCPAT LncRNA-ID is still more efficient than other methodswhich makes it a reasonable choice when data are on amassive-scale The model of LncRNA-ID can be trained byusers but the most excellent attribute is the competence ofhandling the unbalanced training data For studying thosenot well-explored species LncRNA-ID takes priority whenusers have training datasets

LncRScan-SVM achieves a good trade-off betweenthe discrimination of coding and long noncoding genesLncRScan-SVM is slower than CPAT and LncRNA-ID but itis still acceptable For analysing human and mouse datasetslncRScan-SVM can be considered as a proper approach

5 Discussion

According to the features selected by each tool it is apparentthat different tools have their own advantages and disadvan-tages CPC is developed to assess coding potential of thetranscripts moreover CPC is trained on datasets of protein-coding and noncoding RNA which means it achieves excel-lent performance when analysing ncRNAs CPC providesa stand-alone version and a web server but both of thetwo programs need vast amounts of time to process thesequences As alignment-based tools the performance ofCPC varied when using different protein reference databaseCPAT can present satisfying results efficiently partly becauseCPAT builds the logistic model which is faster than SVMThe web server of CPAT can display the result in an instant

12 BioMed Research International

which facilitates small scale prediction tasks A minor dis-advantage of CPAT is that the cutoff of CPAT varies fromspecies to species and users have to determine the optimumcutoff value when they are training a new model CNCI isdesigned to predict the transcripts assembled from whole-transcriptome sequencing data Thus CNCI offers a highaccuracy on incomplete transcripts CNCI did not provideresult of elaborate comparison between CNCI and CPATbut CPAT has no regard for the problem of incompletetranscripts Meanwhile UTRs of the transcripts may alsointerfere with the performance of CPATThe features of ANTof CNCI closely resemble the hexamer of CPAT but thedistinguishing process of CNCI is more complicated andaccurate than CPAT However the sliding window of CNCIslides 3 nt in each step and consequently some deletionor frameshift errors may lead to a false shift and presentusers with a disappointing performance In such cases PLEKhas made a considerable improvement and exhibits moreflexibility when handling the indel sequencing errors Indelerrors are very common in the sequences obtained by todayrsquossequencing platform which means PLEK performs well forde novo assembled transcriptomes With the indel error rateincreasing the accuracy of CNCI is decreasing while PLEKhas no distinct fluctuation Nonetheless since the nucleotidescompositions differ slightly among different species theperformance of PLEK on multiple species is not better thanor approximately equivalent to CNCI whose performanceis more stable on different species Both LncRNA-ID andlncRScan-SVM achieve a balance between protein-codingand lncRNAs But the capacity of lncRScan-SVMwill be lim-ited when analysing the sequences with a lack of annotationAnother point that needs to be brought up is that lncRScan-SVM and CNCI support lowast GTF as input file format

It is apparent that nucleotides composition (such as 119896-merand G + C content) and ORF are two classic and widely usedfeature groups These features have strong discriminativepower because protein-coding genes will finally be tran-scribed and translated to produce a specific amino acid chainwhich requires some specified nucleotides composition andhigh-quality ORFs As to the models of these tools SVM(CPC CNCI and PLEK) logistic regression (CPAT) andrandom forest (LncRNA-ID) are more practical for lncRNAidentification though ANN or deep learning is a more popu-lar machine learning algorithm now Along with the protein-coding genes prediction the annotations of lncRNA genehave been performed as well A new tool named AnnoLnc(2015 available at httpannolnccbipkueducnindexjsp)has just been developed to annotate new discovered lncRNAsbut related article has not yet been officially published Userscan access its web server for more information

LncRNAs are receiving increasing attention and lncRNAidentification has always been a challenge for researchesof life science For so many different types of sequencesvarious excellent tools should be developed to tackle differentproblems under various circumstances in the future In thisreview we summarised several tools for lncRNAs identifica-tion and concluded respective scopes Due to their differentscopes of application using a method apposite to particularsituation will be of essence to achieve convincing results

We hope this review can help researchers employ a moreappropriate method in certain situations

Additional Points

Key Points (i) Different tools have different scopes Usersshould select a proper tool according to the type of sequences(ii) From the perspective of sequence types CPC andCPAT are mainly used to assess coding potential CNCIand PLEK can be applied to the sequences obtained fromhigh-throughput sequencing platforms or the poorly anno-tated LncRNA-ID and lncRScan-SVM are more accurate onhuman and mouse datasets (iii) From the perspective ofother functions CPC and CPAT have web interfaces Theclassification models of CPAT LncRNA-ID and PLEK canbe trained on usersrsquo own datasets CPAT LncRNA-ID andlncRScan-SVM can be utilised when the data to be analysedare on a massive-scale

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This research was supported by the National Natural ScienceFoundation of China (61272207 61472158 and 61402194)and the Science-Technology Development Project from JilinProvince (20130522118JH 20130101070JC) and China Post-doctoral Science Foundation (2014T70291)

References

[1] A F Palazzo and E S Lee ldquoNon-coding RNA what isfunctional andwhat is junkrdquo Frontiers in Genetics vol 5 article2 pp 1ndash11 2015

[2] Y Okazaki M Furuno T Kasukawa et al ldquoAnalysis of themouse transcriptome based on functional annotation of 60770full-length cDNAsrdquo Nature vol 420 pp 563ndash573 2002

[3] S Djebali C A Davis AMerkel et al ldquoLandscape of transcrip-tion in human cellsrdquo Nature vol 489 pp 101ndash108 2012

[4] I Dunham A Kundaje S F Aldred et al ldquoAn integratedencyclopedia of DNA elements in the human genomerdquo Naturevol 489 pp 57ndash74 2012

[5] E Pennisi ldquoENCODE project writes eulogy for junk DNArdquoScience vol 337 no 6099 pp 1159ndash1161 2012

[6] S U Schmitz P Grote and B G Herrmann ldquoMechanisms oflong noncoding RNA function in development and diseaserdquoCellular and Molecular Life Sciences vol 73 no 13 pp 2491ndash2509 2016

[7] K Plath J Fang S K Mlynarczyk-Evans et al ldquoRole of histoneH3 lysine 27 methylation in X inactivationrdquo Science vol 300no 5616 pp 131ndash135 2003

[8] S T da Rocha V Boeva M Escamilla-Del-Arenal et al ldquoJarid2is implicated in the initial xist-induced targeting of PRC2 to theinactive X chromosomerdquoMolecular Cell vol 53 no 2 pp 301ndash316 2014

[9] V OrsquoLeary S V Ovsepian L G Carrascosa et al ldquoPARTICLEa triplex-forming long ncRNA regulates locus-specific methy-lation in response to low-dose irradiationrdquo Cell Reports vol 11no 3 pp 474ndash485 2015

BioMed Research International 13

[10] A C Marques and C P Ponting ldquoCatalogues of mammalianlong noncoding RNAs modest conservation and incomplete-nessrdquo Genome Biology vol 10 no 11 article R124 2009

[11] J R Prensner M K Iyer A Sahu et al ldquoThe long noncodingRNA SChLAP1 promotes aggressive prostate cancer and antag-onizes the SWISNF complexrdquo Nature Genetics vol 45 no 11pp 1392ndash1398 2013

[12] G Chen Z Wang D Wang et al ldquoLncRNADisease a databasefor long-non-coding RNA-associated diseasesrdquo Nucleic AcidsResearch vol 41 no 1 pp D983ndashD986 2013

[13] X Chen and G-Y Yan ldquoNovel human lncRNA-disease associ-ation inference based on lncRNA expression profilesrdquo Bioinfor-matics vol 29 no 20 pp 2617ndash2624 2013

[14] X Chen C Clarence Yan C Luo W Ji Y Zhang and Q DaildquoConstructing lncRNA functional similarity network based onlncRNA-disease associations and disease semantic similarityrdquoScientific Reports vol 5 Article ID 11338 2015

[15] X Chen Y-A Huang X-S Wang Z You and K C ChanldquoFMLNCSIM fuzzy measure-based lncRNA functional simi-larity calculation modelrdquo Oncotarget vol 7 no 29 pp 45948ndash45958 2016

[16] X Chen ldquoPredicting lncRNA-disease associations and con-structing lncRNA functional similarity network based on theinformation of miRNArdquo Scientific Reports vol 5 Article ID13186 2015

[17] J Sun H Shi Z Wang et al ldquoInferring novel lncRNA-diseaseassociations based on a random walk model of a lncRNAfunctional similarity networkrdquo Molecular BioSystems vol 10no 8 pp 2074ndash2081 2014

[18] M Zhou X Wang J Li et al ldquoPrioritizing candidate disease-related long non-codingRNAs bywalking on the heterogeneouslncRNA and disease networkrdquoMolecular BioSystems vol 11 no3 pp 760ndash769 2015

[19] X Chen Z You G Yan and D Gong ldquoIRWRLDA improvedrandom walk with restart for lncRNA-disease association pre-dictionrdquo Oncotarget vol 7 no 36 pp 57919ndash57931 2016

[20] X Chen C C Yan X Zhang and Z You ldquoLong non-codingRNAs and complex diseases from experimental results tocomputational modelsrdquo Briefings in Bioinformatics 2016

[21] T M Lowe and S R Eddy ldquotRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequencerdquo Nucleic Acids Research vol 25 no 5 pp 955ndash9641997

[22] Q Zou J Guo Y Ju M Wu X Zeng and Z Hong ldquoImprov-ing tRNAscan-SE annotation results via ensemble classifiersrdquoMolecular Informatics vol 34 no 11-12 pp 761ndash770 2015

[23] L Wei M Liao Y Gao R Ji Z He and Q Zou ldquoImprovedand promising identificationof human microRNAs by incorpo-ratinga high-quality negative setrdquo IEEEACM Transactions onComputational Biology and Bioinformatics vol 11 no 1 pp 192ndash201 2014

[24] C Y Wang L L Hu M Z Guo X Y Liu and Q Zou ldquoimDCan ensemble learningmethod for imbalanced classificationwithmiRNAdatardquoGenetics andMolecular Research vol 14 no 1 pp123ndash133 2015

[25] K Lagesen P Hallin E A Roslashdland H-H Staeligrfeldt T Rognesand DW Ussery ldquoRNAmmer consistent and rapid annotationof ribosomal RNA genesrdquo Nucleic Acids Research vol 35 no 9pp 3100ndash3108 2007

[26] C Wang L Wei M Guo and Q Zou ldquoComputationalapproaches in detecting non- coding RNArdquo Current Genomicsvol 14 no 6 pp 371ndash377 2013

[27] D Veneziano G Nigita and A Ferro ldquoComputationalapproaches for the analysis of ncRNA through deep sequencingtechniquesrdquo Frontiers in Bioengineering and Biotechnology vol3 article 77 2015

[28] L Kong Y Zhang Z-Q Ye et al ldquoCPC assess the protein-coding potential of transcripts using sequence features andsupport vector machinerdquo Nucleic Acids Research vol 35 no 2pp W345ndashW349 2007

[29] L Wang H J Park S Dasari et al ldquoCoding-potential assess-ment tool using an alignment-free logistic regression modelrdquoNucleic Acids Research vol 41 no 6 article e74 2013

[30] L Sun H Luo D Bu et al ldquoUtilizing sequence intrinsiccomposition to classify protein-coding and long non-codingtranscriptsrdquo Nucleic Acids Research vol 41 no 17 article e1662013

[31] A Li J Zhang and Z Zhou ldquoPLEK a tool for predicting longnon-coding RNAs and messenger RNAs based on an improvedk-mer schemerdquo BMC Bioinformatics vol 15 no 1 article 3112014

[32] R Achawanantakun J Chen Y Sun and Y Zhang ldquoLncRNA-ID long non-coding RNA IDentification using balanced ran-dom forestsrdquo Bioinformatics vol 31 no 24 pp 3897ndash3905 2015

[33] L Sun H Liu L Zhang and J Meng ldquolncRScan-SVM a toolfor predicting long non-coding RNAs using support vectormachinerdquo PLoS ONE vol 10 no 10 Article ID e0139654 2015

[34] X-N Fan and S-W Zhang ldquoLncRNA-MFDL identification ofhuman long non-coding RNAs by fusing multiple features andusing deep learningrdquo Molecular BioSystems vol 11 no 3 pp892ndash897 2015

[35] C Pian G Zhang Z Chen et al ldquoLncRNApred classificationof long non-coding RNAs and protein-coding transcripts by theensemble algorithm with a new hybrid featurerdquo PLoS ONE vol11 no 5 Article ID e0154567 2016

[36] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[37] M C Frith A R Forrest E Nourbakhsh et al ldquoThe abundanceof short proteins in the mammalian proteomerdquo PLoS Geneticsvol 2 no 4 pp 515ndash528 2006

[38] D M Anderson K M Anderson C-L Chang et al ldquoAmicropeptide encoded by a putative long noncoding RNAregulatesmuscle performancerdquoCell vol 160 no 4 pp 595ndash6062015

[39] M Guttman P Russell N T Ingolia J S Weissman and ES Lander ldquoRibosome profiling provides evidence that largenoncoding RNAs do not encode proteinsrdquo Cell vol 154 no 1pp 240ndash251 2013

[40] M F Lin I Jungreis and M Kellis ldquoPhyloCSF a comparativegenomics method to distinguish protein coding and non-coding regionsrdquoBioinformatics vol 27 no 13 Article ID btr209pp i275ndashi282 2011

[41] B Panwar A Arora and G P S Raghava ldquoPrediction andclassification of ncRNAs using structural informationrdquo BMCGenomics vol 15 no 1 article 127 2014

[42] L Childs Z Nikoloski P May and D Walther ldquoIdentificationand classification of ncRNAmolecules using graph propertiesrdquoNucleic Acids Research vol 37 no 9 article e66 2009

[43] K Sun X Chen P Jiang X Song H Wang and HSun ldquoiSeeRNA identification of long intergenic non-codingRNA transcripts from transcriptome sequencing datardquo BMCGenomics vol 14 article S7 2013

14 BioMed Research International

[44] Y Wang Y Li Q Wang et al ldquoComputational identificationof human long intergenic non-coding RNAs using a GA-SVMalgorithmrdquo Gene vol 533 no 1 pp 94ndash99 2014

[45] A Siepel G Bejerano J S Pedersen et al ldquoEvolutionarilyconserved elements in vertebrate insect worm and yeastgenomesrdquo Genome Research vol 15 no 8 pp 1034ndash1050 2005

[46] J Liu J Gough and B Rost ldquoDistinguishing protein-codingfrom non-coding RNAs through support vector machinesrdquoPLoS Genetics vol 2 no 4 pp 529ndash536 2006

[47] K C Pang S Stephen P G Engstrom et al ldquoRNAdb-acomprehensive mammalian noncoding RNA databaserdquoNucleicAcids Research vol 33 pp D125ndashD130 2005

[48] C Liu B Bai G Skogerboslash et al ldquoNONCODE an inte-grated knowledge database of non-coding RNAsrdquoNucleic AcidsResearch vol 33 pp D112ndashD115 2005

[49] Y Zhao H Li S Fang et al ldquoNONCODE 2016 an informativeand valuable data source of long non-coding RNAsrdquo NucleicAcids Research vol 44 no 1 pp D203ndashD208 2016

[50] J W Fickett ldquoRecognition of protein coding regions in DNAsequencesrdquoNucleic Acids Research vol 10 no 17 pp 5303ndash53181982

[51] S Mukherjee and Y Zhang ldquoMM-align a quick algorithmfor aligning multiple-chain protein complex structures usingiterative dynamic programmingrdquoNucleic Acids Research vol 37no 11 2009

[52] T Derrien R Johnson G Bussotti et al ldquoThe GENCODE v7catalog of human long noncoding RNAs analysis of their genestructure evolution and expressionrdquoGenome Research vol 22no 9 pp 1775ndash1789 2012

[53] J Harrow F Denoeud A Frankish et al ldquoGENCODE produc-ing a reference annotation for ENCODErdquo Genome Biology vol7 supplement 1 pp S41ndashS49 2006

[54] J Harrow A Frankish J M Gonzalez et al ldquoGENCODE thereference human genome annotation for the ENCODE projectrdquoGenome Research vol 22 no 9 pp 1760ndash1774 2012

[55] K D Pruitt T Tatusova G R Brown andD RMaglott ldquoNCBIReference Sequences (RefSeq) current status new features andgenome annotation policyrdquo Nucleic Acids Research vol 40 no1 pp D130ndashD135 2012

[56] K D Pruitt T Tatusova and D R Maglott ldquoNCBI refer-ence sequences (RefSeq) a curated non-redundant sequencedatabase of genomes transcripts and proteinsrdquo Nucleic AcidsResearch vol 35 no 1 pp D61ndashD65 2007

[57] L Deng D Yu and J Platt ldquoScalable stacking and learningfor building deep architecturesrdquo in Proceedings of the 2012IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo12) pp 2133ndash2136 Kyoto Japan March2012

[58] B Hutchinson L Deng and D Yu ldquoTensor deep stackingnetworksrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 35 no 8 pp 1944ndash1957 2013

[59] W J Kent C W Sugnet T S Furey et al ldquoThe human genomebrowser at UCSCrdquo Genome Research vol 12 no 6 pp 996ndash1006 2002

[60] R Lorenz S H Bernhart C Honer zu Siederdissen et alldquoViennaRNA Package 20rdquo Algorithms for Molecular Biologyvol 6 no 1 article 26 2011

[61] M Kozak ldquoRecognition of AUG and alternative initiatorcodons is augmented by G in position +4 but is not generallyaffected by the nucleotides in positions +5 and +6rdquoThe EMBOJournal vol 16 no 9 pp 2482ndash2492 1997

[62] M Kozak ldquoInitiation of translation in prokaryotes and eukary-otesrdquo Gene vol 234 no 2 pp 187ndash208 1999

[63] H Xu P Wang Y Fu et al ldquoLength of the ORF position of thefirstAUGand theKozakmotif are important factors in potentialdual-coding transcriptsrdquo Cell Research vol 20 no 4 pp 445ndash457 2010

[64] R D Finn J Clements and S R Eddy ldquoHMMER webserver interactive sequence similarity searchingrdquoNucleic AcidsResearch vol 39 no 2 pp W29ndashW37 2011

[65] P Kapranov J Cheng S Dike et al ldquoRNA maps reveal newRNA classes and a possible function for pervasive transcrip-tionrdquo Science vol 316 no 5830 pp 1484ndash1488 2007

[66] C Chen A Liaw and L BreimanUsing RandomForest to LearnImbalancedData University of California Berkeley Calif USA2004

[67] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[68] J A Sogn ldquoStructure of the peptide antibiotic polypeptinrdquoJournal of Medicinal Chemistry vol 19 no 10 pp 1228ndash12311976

[69] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[70] J W Fickett and C-S Tung ldquoAssessment of protein codingmeasuresrdquo Nucleic Acids Research vol 20 no 24 pp 6441ndash6450 1992

[71] X Robin N Turck A Hainard et al ldquopROC an open-sourcepackage for R and S+ to analyze and compare ROC curvesrdquoBMC Bioinformatics vol 12 article 77 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 9: Review Article Long Noncoding RNA Identification ...

BioMed Research International 9

(i) ORF(ii) BLASTX

CPC

(i) k-mer

PLEK

(i) ANT matrix(ii) Codon-bias

(iii) MLCDS

CNCI

(i) Exon (ii) Stop codon

(iii) txCdsPredict(iv) PhastCons

LncRScan-SVM

Support vector machine

Featureextraction

Modelselection

Modelvalidation

Input data

10-fold cross-validationArea under the curve (AUC) of receiver operation characteristic (ROC)

LncRNA-ID(i) ORF

(ii) Ribosome interaction

(iii) HMMER

(i) ORF(ii) k-mer

(iii) Signal to noise ratio

LncRNApred

Random forests

CPAT

(i) ORF(ii) Fickett score

(iii) Hexamer

Logistic regression

LncRNA-MFDL(i) ORF

(ii) k-mer(iii) MLCDS(iv) Secondary

structure

Deeplearning

Figure 1 An overall procedure of eight tools The features of each tool are sorted into several groups and only the categories of the featuresare listed in the figure

02 04 06 08 10001 minus specificity

00

02

04

06

08

10

Sens

itivi

ty

00

02

04

06

08

10

Sens

itivi

ty

02 04 06 08 10001 minus specificity

Mouse (from GENCODE)Human (from GENCODE)

PLEK (AUC = 0991 optimal = 00210973)CNCI (AUC = 0937 optimal = 0084 0970)CPAT (AUC = 0990 optimal = 0037 0961)CPC (AUC = 0977 optimal = 0038 0889) CPC (AUC = 0950 optimal = 0068 0839)

CPAT (AUC = 0966 optimal = 0073 0926)CNCI (AUC = 0903 optimal = 0137 0959)PLEK (AUC = 0870 optimal = 0291 0885)

Figure 2 The ROC curves of CPC CPAT CNCI and PLEK We assessed the models using the same datasets as LncRNA-ID (selected fromGENCODE) used Both CPC and CPAT were evaluated with the latest versions

10 BioMed Research International

Table 3 Overview of each toolrsquos performance on different testing datasets

Testing dataset CPC CPAT CNCI PLEK LncRNA-ID lncRScan-SVMHuman MCF-7 (PacBio)1

Specificity 9990 9180 9470Sensitivity 1900 7870 9580Accuracy 9700 9130 9470Human HelaS3 (454)2

Specificity 9990 9390 9550Sensitivity 4720 8110 9250Accuracy 9900 9370 9540Human (from GENCODE)3

Specificity 9997 9955 8918 9528Sensitivity 6648 8695 9952 9628Accuracy 8322 9325 9432 9578Mouse (from GENCODE)4

Specificity 9875 9895 7094 9210Sensitivity 7655 3880 8811 9445Accuracy 8765 6888 7949 9328Human (from GRCh37hg19)5

Specificity 9762 8528 8920Sensitivity 6723 9460 9388Accuracy 8243 8994 9194Mouse (from GRCm38mm10)5

Specificity 9837 8817 8914Sensitivity 7546 9534 9529Accuracy 8691 9176 9221The results of the tools being tested on the same datasets are listed above Bold numbers denote the highest value of the metrics1MCF-7 is available at httpwwwpacbcomblogdata-release-human-mcf-7-transcriptome 2dataset of HelaS3 is available at httpswwwncbinlmnihgovsraSRX214365 34datasets are available at httpswwwdropboxcomsh7yvmqknartttm6kAAAQHvLZPjgjf4dtmHM7GNCqaH1 gencodedl=0 andhttpswwwdropboxcomsh7yvmqknartttm6kAACzaG-QJggvbXW6LA32oo7baM1 gencodedl=0 5dataset of human and mouse is available athttpjournalsplosorgplosonearticleid=101371journalpone0139654

performance on lncRNAs is just the opposite LncRNA-IDcan be trained with usersrsquo own data it can obtain a satisfyingresult even when the data is unbalancedWith the proportionof lncRNA decreasing CPAT shows a sharp reduction from7951 to 5446on the capability of lncRNAdiscriminationLncRNA-ID by contrast fell less than 1 [32]

ROC curves of CPC CPAT PLEK and LncRNA-IDtested on human and mouse datasets were provided in [32]Since CPC and CPAT are updated as the accumulation ofgene database it is useful to assess their performance withlatest version and take CNCI into account Here we utilisethe test set of LncRNA-ID [32] (both the datasets of humanandmouse are selected fromGENCODE) to reevaluate CPCCPAT CNCI and PLEK (Figure 2) According to [32] thearea under curve (AUC) of LncRNA-ID on human datasetis 09829 (optimal = 00545 09720) while on mouse it is09505 (optimal = 00800 09445) [32] In our assessmentthe performance of PLEK is identical with [32] while theperformance of CPC and CPAT as we anticipated displayedsome differences The ROC curves were drawn in R with thepackage of pROC [71]

LncRScan-SVM is compared with CPC and CPAT onhuman and mouse datasets from UCSC (version hg19 of

human and mm10 of mouse) CPC as an excellent codingpotential assessment tool still achieves 9837 of specificityonmouse testing dataset CPAT on the contrary achieved thehighest values of sensitivity both on the datasets of humanand mouse LncRScan-SVM surpasses CPAT with 8920and 8914 of specificity on human and mouse datasets Forsensitivity lncRScan-SVM obtained 9388 and 9529 onthe same testing datasets which are only 072 and around01 lower than CPATrsquos results respectively but much higherthan CPCrsquos 6723 and 7546 In addition lncRScan-SVMalso has the best results of accuracy and AUC [33] on thesedatasets

For the same testing datasets the running time of CPATis the shortest and CPC shows the longest time to finishthe process because of its alignment process When beingtested on a dataset containing 4000 protein-coding and 4000long noncoding transcripts CPAT takes 3536 s and LncRNA-ID takes 6535 s to accomplish the discrimination whilePLEK and CPC need 2147m and 8651 h respectively [32]PLEK is 8 times and 244 times faster than CNCI and CPCrespectively on the same testing data [31] and lncRScan-SVM also needs about 10 times as much as CPAT to finishcomputation [33]

BioMed Research International 11

Table 4 Priority of employing different methods on different situations

CPC CPAT CNCI PLEK LncRNA-ID lncRScan-SVMCoding potential assessment Human lncRNAs Mouse lncRNAs Other Species1 Testing data with sequencing errors2 Lack of annotation Massive-scale data3 Trained by users4 Web interface This table only presents the preferences under different situations which means a method with a tick can achieve a better performance under a certaincircumstance1Only CPAT LncRNA-ID and lncRScan-SVM provide the model for mouse When analysing other species CPAT has the model for fly and zebrafish CNCIand PLEK can predict the sequences of vertebrata and plant CPAT PLEK and LncRNA-ID can build a newmodel based on usersrsquo datasets 2Users can chooseCNCI for incomplete sequences and CPC or PLEK for the transcripts with indel errors 3CPAT is the most efficient method Though lncRScan-SVM needsmore time than CPAT and LncRNA-ID it is also acceptable 4LncRNA-ID can handle the imbalanced training data Training PLEK with usersrsquo own datasetsmay be a time-consuming task

4 Application Scopes of the Methods

All these methods have own particular scopes to exert theirtalents which means an appropriate program can help usobtain a satisfying result The priority of utilising thesetools under some particular circumstances is summarised inTable 4

CPC is based on sequence alignment which facilitatesprotein-coding transcripts selection but impairs the perfor-mance of noncoding transcripts in that long noncoding tran-scripts share more similarities with coding transcripts suchas putative ORF which could mislead CPC Also becauseof alignment process utilising CPC to analyse massive-scaledata is a time-consuming process

CPAT is also used to evaluate the coding potential thoughthe performance on long noncoding transcripts is acceptableCPAT has a compromise between coding and noncodingtranscripts that is not bad Since the model of CPAT islogistic regression and the input file is FASTA format CPATis markedly superior in computational time which meansCPAT is more suitable for being applied to data on a largescale Furthermore linguistic features make CPAT be able toanalyse the sequences without annotation and allowing usersto train the model with their own dataset extends CPATrsquosscope of application Users can apply CPAT to other speciesinstead of being confined to human or mouse only

CNCI is designed to distinguish between coding and longnoncoding transcripts without the annotations of sequencesBecause lots of lncRNAs are poorly annotated this qualityprovided amore accurate discrimination for these sequencesCNCI is trained on human dataset but can also be applied toothermammals such asmouse and orangutan CNCI displaysacceptable results on vertebrates (except fish) but for plantsand invertebrates the result is not very satisfying CNCI isvaluable when the sequences lack annotations or users do nothave training set of other species CNCI also shows a goodperformance when the transcripts are incomplete

PLEK employs a higher fault tolerance algorithm andperforms better when the sequences have indel errors It is a

proper tool to analyse the de novo assembled transcriptomedatasets such as the sequences obtained from Roche (454)and Pacific Biosciences (PacBio) sequencing platforms Inaddition to human and mouse PLEK can also be used toother vertebrates and displays comparable results with theones of CNCI PLEKrsquos model can be trained by users but ittakes a long time to be completed

LncRNA-ID has many merits and delivers better all-round performance on human andmouse datasets Althoughthe time LncRNA-ID spent on classifying is nearly twice ofCPAT LncRNA-ID is still more efficient than other methodswhich makes it a reasonable choice when data are on amassive-scale The model of LncRNA-ID can be trained byusers but the most excellent attribute is the competence ofhandling the unbalanced training data For studying thosenot well-explored species LncRNA-ID takes priority whenusers have training datasets

LncRScan-SVM achieves a good trade-off betweenthe discrimination of coding and long noncoding genesLncRScan-SVM is slower than CPAT and LncRNA-ID but itis still acceptable For analysing human and mouse datasetslncRScan-SVM can be considered as a proper approach

5 Discussion

According to the features selected by each tool it is apparentthat different tools have their own advantages and disadvan-tages CPC is developed to assess coding potential of thetranscripts moreover CPC is trained on datasets of protein-coding and noncoding RNA which means it achieves excel-lent performance when analysing ncRNAs CPC providesa stand-alone version and a web server but both of thetwo programs need vast amounts of time to process thesequences As alignment-based tools the performance ofCPC varied when using different protein reference databaseCPAT can present satisfying results efficiently partly becauseCPAT builds the logistic model which is faster than SVMThe web server of CPAT can display the result in an instant

12 BioMed Research International

which facilitates small scale prediction tasks A minor dis-advantage of CPAT is that the cutoff of CPAT varies fromspecies to species and users have to determine the optimumcutoff value when they are training a new model CNCI isdesigned to predict the transcripts assembled from whole-transcriptome sequencing data Thus CNCI offers a highaccuracy on incomplete transcripts CNCI did not provideresult of elaborate comparison between CNCI and CPATbut CPAT has no regard for the problem of incompletetranscripts Meanwhile UTRs of the transcripts may alsointerfere with the performance of CPATThe features of ANTof CNCI closely resemble the hexamer of CPAT but thedistinguishing process of CNCI is more complicated andaccurate than CPAT However the sliding window of CNCIslides 3 nt in each step and consequently some deletionor frameshift errors may lead to a false shift and presentusers with a disappointing performance In such cases PLEKhas made a considerable improvement and exhibits moreflexibility when handling the indel sequencing errors Indelerrors are very common in the sequences obtained by todayrsquossequencing platform which means PLEK performs well forde novo assembled transcriptomes With the indel error rateincreasing the accuracy of CNCI is decreasing while PLEKhas no distinct fluctuation Nonetheless since the nucleotidescompositions differ slightly among different species theperformance of PLEK on multiple species is not better thanor approximately equivalent to CNCI whose performanceis more stable on different species Both LncRNA-ID andlncRScan-SVM achieve a balance between protein-codingand lncRNAs But the capacity of lncRScan-SVMwill be lim-ited when analysing the sequences with a lack of annotationAnother point that needs to be brought up is that lncRScan-SVM and CNCI support lowast GTF as input file format

It is apparent that nucleotides composition (such as 119896-merand G + C content) and ORF are two classic and widely usedfeature groups These features have strong discriminativepower because protein-coding genes will finally be tran-scribed and translated to produce a specific amino acid chainwhich requires some specified nucleotides composition andhigh-quality ORFs As to the models of these tools SVM(CPC CNCI and PLEK) logistic regression (CPAT) andrandom forest (LncRNA-ID) are more practical for lncRNAidentification though ANN or deep learning is a more popu-lar machine learning algorithm now Along with the protein-coding genes prediction the annotations of lncRNA genehave been performed as well A new tool named AnnoLnc(2015 available at httpannolnccbipkueducnindexjsp)has just been developed to annotate new discovered lncRNAsbut related article has not yet been officially published Userscan access its web server for more information

LncRNAs are receiving increasing attention and lncRNAidentification has always been a challenge for researchesof life science For so many different types of sequencesvarious excellent tools should be developed to tackle differentproblems under various circumstances in the future In thisreview we summarised several tools for lncRNAs identifica-tion and concluded respective scopes Due to their differentscopes of application using a method apposite to particularsituation will be of essence to achieve convincing results

We hope this review can help researchers employ a moreappropriate method in certain situations

Additional Points

Key Points (i) Different tools have different scopes Usersshould select a proper tool according to the type of sequences(ii) From the perspective of sequence types CPC andCPAT are mainly used to assess coding potential CNCIand PLEK can be applied to the sequences obtained fromhigh-throughput sequencing platforms or the poorly anno-tated LncRNA-ID and lncRScan-SVM are more accurate onhuman and mouse datasets (iii) From the perspective ofother functions CPC and CPAT have web interfaces Theclassification models of CPAT LncRNA-ID and PLEK canbe trained on usersrsquo own datasets CPAT LncRNA-ID andlncRScan-SVM can be utilised when the data to be analysedare on a massive-scale

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This research was supported by the National Natural ScienceFoundation of China (61272207 61472158 and 61402194)and the Science-Technology Development Project from JilinProvince (20130522118JH 20130101070JC) and China Post-doctoral Science Foundation (2014T70291)

References

[1] A F Palazzo and E S Lee ldquoNon-coding RNA what isfunctional andwhat is junkrdquo Frontiers in Genetics vol 5 article2 pp 1ndash11 2015

[2] Y Okazaki M Furuno T Kasukawa et al ldquoAnalysis of themouse transcriptome based on functional annotation of 60770full-length cDNAsrdquo Nature vol 420 pp 563ndash573 2002

[3] S Djebali C A Davis AMerkel et al ldquoLandscape of transcrip-tion in human cellsrdquo Nature vol 489 pp 101ndash108 2012

[4] I Dunham A Kundaje S F Aldred et al ldquoAn integratedencyclopedia of DNA elements in the human genomerdquo Naturevol 489 pp 57ndash74 2012

[5] E Pennisi ldquoENCODE project writes eulogy for junk DNArdquoScience vol 337 no 6099 pp 1159ndash1161 2012

[6] S U Schmitz P Grote and B G Herrmann ldquoMechanisms oflong noncoding RNA function in development and diseaserdquoCellular and Molecular Life Sciences vol 73 no 13 pp 2491ndash2509 2016

[7] K Plath J Fang S K Mlynarczyk-Evans et al ldquoRole of histoneH3 lysine 27 methylation in X inactivationrdquo Science vol 300no 5616 pp 131ndash135 2003

[8] S T da Rocha V Boeva M Escamilla-Del-Arenal et al ldquoJarid2is implicated in the initial xist-induced targeting of PRC2 to theinactive X chromosomerdquoMolecular Cell vol 53 no 2 pp 301ndash316 2014

[9] V OrsquoLeary S V Ovsepian L G Carrascosa et al ldquoPARTICLEa triplex-forming long ncRNA regulates locus-specific methy-lation in response to low-dose irradiationrdquo Cell Reports vol 11no 3 pp 474ndash485 2015

BioMed Research International 13

[10] A C Marques and C P Ponting ldquoCatalogues of mammalianlong noncoding RNAs modest conservation and incomplete-nessrdquo Genome Biology vol 10 no 11 article R124 2009

[11] J R Prensner M K Iyer A Sahu et al ldquoThe long noncodingRNA SChLAP1 promotes aggressive prostate cancer and antag-onizes the SWISNF complexrdquo Nature Genetics vol 45 no 11pp 1392ndash1398 2013

[12] G Chen Z Wang D Wang et al ldquoLncRNADisease a databasefor long-non-coding RNA-associated diseasesrdquo Nucleic AcidsResearch vol 41 no 1 pp D983ndashD986 2013

[13] X Chen and G-Y Yan ldquoNovel human lncRNA-disease associ-ation inference based on lncRNA expression profilesrdquo Bioinfor-matics vol 29 no 20 pp 2617ndash2624 2013

[14] X Chen C Clarence Yan C Luo W Ji Y Zhang and Q DaildquoConstructing lncRNA functional similarity network based onlncRNA-disease associations and disease semantic similarityrdquoScientific Reports vol 5 Article ID 11338 2015

[15] X Chen Y-A Huang X-S Wang Z You and K C ChanldquoFMLNCSIM fuzzy measure-based lncRNA functional simi-larity calculation modelrdquo Oncotarget vol 7 no 29 pp 45948ndash45958 2016

[16] X Chen ldquoPredicting lncRNA-disease associations and con-structing lncRNA functional similarity network based on theinformation of miRNArdquo Scientific Reports vol 5 Article ID13186 2015

[17] J Sun H Shi Z Wang et al ldquoInferring novel lncRNA-diseaseassociations based on a random walk model of a lncRNAfunctional similarity networkrdquo Molecular BioSystems vol 10no 8 pp 2074ndash2081 2014

[18] M Zhou X Wang J Li et al ldquoPrioritizing candidate disease-related long non-codingRNAs bywalking on the heterogeneouslncRNA and disease networkrdquoMolecular BioSystems vol 11 no3 pp 760ndash769 2015

[19] X Chen Z You G Yan and D Gong ldquoIRWRLDA improvedrandom walk with restart for lncRNA-disease association pre-dictionrdquo Oncotarget vol 7 no 36 pp 57919ndash57931 2016

[20] X Chen C C Yan X Zhang and Z You ldquoLong non-codingRNAs and complex diseases from experimental results tocomputational modelsrdquo Briefings in Bioinformatics 2016

[21] T M Lowe and S R Eddy ldquotRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequencerdquo Nucleic Acids Research vol 25 no 5 pp 955ndash9641997

[22] Q Zou J Guo Y Ju M Wu X Zeng and Z Hong ldquoImprov-ing tRNAscan-SE annotation results via ensemble classifiersrdquoMolecular Informatics vol 34 no 11-12 pp 761ndash770 2015

[23] L Wei M Liao Y Gao R Ji Z He and Q Zou ldquoImprovedand promising identificationof human microRNAs by incorpo-ratinga high-quality negative setrdquo IEEEACM Transactions onComputational Biology and Bioinformatics vol 11 no 1 pp 192ndash201 2014

[24] C Y Wang L L Hu M Z Guo X Y Liu and Q Zou ldquoimDCan ensemble learningmethod for imbalanced classificationwithmiRNAdatardquoGenetics andMolecular Research vol 14 no 1 pp123ndash133 2015

[25] K Lagesen P Hallin E A Roslashdland H-H Staeligrfeldt T Rognesand DW Ussery ldquoRNAmmer consistent and rapid annotationof ribosomal RNA genesrdquo Nucleic Acids Research vol 35 no 9pp 3100ndash3108 2007

[26] C Wang L Wei M Guo and Q Zou ldquoComputationalapproaches in detecting non- coding RNArdquo Current Genomicsvol 14 no 6 pp 371ndash377 2013

[27] D Veneziano G Nigita and A Ferro ldquoComputationalapproaches for the analysis of ncRNA through deep sequencingtechniquesrdquo Frontiers in Bioengineering and Biotechnology vol3 article 77 2015

[28] L Kong Y Zhang Z-Q Ye et al ldquoCPC assess the protein-coding potential of transcripts using sequence features andsupport vector machinerdquo Nucleic Acids Research vol 35 no 2pp W345ndashW349 2007

[29] L Wang H J Park S Dasari et al ldquoCoding-potential assess-ment tool using an alignment-free logistic regression modelrdquoNucleic Acids Research vol 41 no 6 article e74 2013

[30] L Sun H Luo D Bu et al ldquoUtilizing sequence intrinsiccomposition to classify protein-coding and long non-codingtranscriptsrdquo Nucleic Acids Research vol 41 no 17 article e1662013

[31] A Li J Zhang and Z Zhou ldquoPLEK a tool for predicting longnon-coding RNAs and messenger RNAs based on an improvedk-mer schemerdquo BMC Bioinformatics vol 15 no 1 article 3112014

[32] R Achawanantakun J Chen Y Sun and Y Zhang ldquoLncRNA-ID long non-coding RNA IDentification using balanced ran-dom forestsrdquo Bioinformatics vol 31 no 24 pp 3897ndash3905 2015

[33] L Sun H Liu L Zhang and J Meng ldquolncRScan-SVM a toolfor predicting long non-coding RNAs using support vectormachinerdquo PLoS ONE vol 10 no 10 Article ID e0139654 2015

[34] X-N Fan and S-W Zhang ldquoLncRNA-MFDL identification ofhuman long non-coding RNAs by fusing multiple features andusing deep learningrdquo Molecular BioSystems vol 11 no 3 pp892ndash897 2015

[35] C Pian G Zhang Z Chen et al ldquoLncRNApred classificationof long non-coding RNAs and protein-coding transcripts by theensemble algorithm with a new hybrid featurerdquo PLoS ONE vol11 no 5 Article ID e0154567 2016

[36] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[37] M C Frith A R Forrest E Nourbakhsh et al ldquoThe abundanceof short proteins in the mammalian proteomerdquo PLoS Geneticsvol 2 no 4 pp 515ndash528 2006

[38] D M Anderson K M Anderson C-L Chang et al ldquoAmicropeptide encoded by a putative long noncoding RNAregulatesmuscle performancerdquoCell vol 160 no 4 pp 595ndash6062015

[39] M Guttman P Russell N T Ingolia J S Weissman and ES Lander ldquoRibosome profiling provides evidence that largenoncoding RNAs do not encode proteinsrdquo Cell vol 154 no 1pp 240ndash251 2013

[40] M F Lin I Jungreis and M Kellis ldquoPhyloCSF a comparativegenomics method to distinguish protein coding and non-coding regionsrdquoBioinformatics vol 27 no 13 Article ID btr209pp i275ndashi282 2011

[41] B Panwar A Arora and G P S Raghava ldquoPrediction andclassification of ncRNAs using structural informationrdquo BMCGenomics vol 15 no 1 article 127 2014

[42] L Childs Z Nikoloski P May and D Walther ldquoIdentificationand classification of ncRNAmolecules using graph propertiesrdquoNucleic Acids Research vol 37 no 9 article e66 2009

[43] K Sun X Chen P Jiang X Song H Wang and HSun ldquoiSeeRNA identification of long intergenic non-codingRNA transcripts from transcriptome sequencing datardquo BMCGenomics vol 14 article S7 2013

14 BioMed Research International

[44] Y Wang Y Li Q Wang et al ldquoComputational identificationof human long intergenic non-coding RNAs using a GA-SVMalgorithmrdquo Gene vol 533 no 1 pp 94ndash99 2014

[45] A Siepel G Bejerano J S Pedersen et al ldquoEvolutionarilyconserved elements in vertebrate insect worm and yeastgenomesrdquo Genome Research vol 15 no 8 pp 1034ndash1050 2005

[46] J Liu J Gough and B Rost ldquoDistinguishing protein-codingfrom non-coding RNAs through support vector machinesrdquoPLoS Genetics vol 2 no 4 pp 529ndash536 2006

[47] K C Pang S Stephen P G Engstrom et al ldquoRNAdb-acomprehensive mammalian noncoding RNA databaserdquoNucleicAcids Research vol 33 pp D125ndashD130 2005

[48] C Liu B Bai G Skogerboslash et al ldquoNONCODE an inte-grated knowledge database of non-coding RNAsrdquoNucleic AcidsResearch vol 33 pp D112ndashD115 2005

[49] Y Zhao H Li S Fang et al ldquoNONCODE 2016 an informativeand valuable data source of long non-coding RNAsrdquo NucleicAcids Research vol 44 no 1 pp D203ndashD208 2016

[50] J W Fickett ldquoRecognition of protein coding regions in DNAsequencesrdquoNucleic Acids Research vol 10 no 17 pp 5303ndash53181982

[51] S Mukherjee and Y Zhang ldquoMM-align a quick algorithmfor aligning multiple-chain protein complex structures usingiterative dynamic programmingrdquoNucleic Acids Research vol 37no 11 2009

[52] T Derrien R Johnson G Bussotti et al ldquoThe GENCODE v7catalog of human long noncoding RNAs analysis of their genestructure evolution and expressionrdquoGenome Research vol 22no 9 pp 1775ndash1789 2012

[53] J Harrow F Denoeud A Frankish et al ldquoGENCODE produc-ing a reference annotation for ENCODErdquo Genome Biology vol7 supplement 1 pp S41ndashS49 2006

[54] J Harrow A Frankish J M Gonzalez et al ldquoGENCODE thereference human genome annotation for the ENCODE projectrdquoGenome Research vol 22 no 9 pp 1760ndash1774 2012

[55] K D Pruitt T Tatusova G R Brown andD RMaglott ldquoNCBIReference Sequences (RefSeq) current status new features andgenome annotation policyrdquo Nucleic Acids Research vol 40 no1 pp D130ndashD135 2012

[56] K D Pruitt T Tatusova and D R Maglott ldquoNCBI refer-ence sequences (RefSeq) a curated non-redundant sequencedatabase of genomes transcripts and proteinsrdquo Nucleic AcidsResearch vol 35 no 1 pp D61ndashD65 2007

[57] L Deng D Yu and J Platt ldquoScalable stacking and learningfor building deep architecturesrdquo in Proceedings of the 2012IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo12) pp 2133ndash2136 Kyoto Japan March2012

[58] B Hutchinson L Deng and D Yu ldquoTensor deep stackingnetworksrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 35 no 8 pp 1944ndash1957 2013

[59] W J Kent C W Sugnet T S Furey et al ldquoThe human genomebrowser at UCSCrdquo Genome Research vol 12 no 6 pp 996ndash1006 2002

[60] R Lorenz S H Bernhart C Honer zu Siederdissen et alldquoViennaRNA Package 20rdquo Algorithms for Molecular Biologyvol 6 no 1 article 26 2011

[61] M Kozak ldquoRecognition of AUG and alternative initiatorcodons is augmented by G in position +4 but is not generallyaffected by the nucleotides in positions +5 and +6rdquoThe EMBOJournal vol 16 no 9 pp 2482ndash2492 1997

[62] M Kozak ldquoInitiation of translation in prokaryotes and eukary-otesrdquo Gene vol 234 no 2 pp 187ndash208 1999

[63] H Xu P Wang Y Fu et al ldquoLength of the ORF position of thefirstAUGand theKozakmotif are important factors in potentialdual-coding transcriptsrdquo Cell Research vol 20 no 4 pp 445ndash457 2010

[64] R D Finn J Clements and S R Eddy ldquoHMMER webserver interactive sequence similarity searchingrdquoNucleic AcidsResearch vol 39 no 2 pp W29ndashW37 2011

[65] P Kapranov J Cheng S Dike et al ldquoRNA maps reveal newRNA classes and a possible function for pervasive transcrip-tionrdquo Science vol 316 no 5830 pp 1484ndash1488 2007

[66] C Chen A Liaw and L BreimanUsing RandomForest to LearnImbalancedData University of California Berkeley Calif USA2004

[67] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[68] J A Sogn ldquoStructure of the peptide antibiotic polypeptinrdquoJournal of Medicinal Chemistry vol 19 no 10 pp 1228ndash12311976

[69] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[70] J W Fickett and C-S Tung ldquoAssessment of protein codingmeasuresrdquo Nucleic Acids Research vol 20 no 24 pp 6441ndash6450 1992

[71] X Robin N Turck A Hainard et al ldquopROC an open-sourcepackage for R and S+ to analyze and compare ROC curvesrdquoBMC Bioinformatics vol 12 article 77 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 10: Review Article Long Noncoding RNA Identification ...

10 BioMed Research International

Table 3 Overview of each toolrsquos performance on different testing datasets

Testing dataset CPC CPAT CNCI PLEK LncRNA-ID lncRScan-SVMHuman MCF-7 (PacBio)1

Specificity 9990 9180 9470Sensitivity 1900 7870 9580Accuracy 9700 9130 9470Human HelaS3 (454)2

Specificity 9990 9390 9550Sensitivity 4720 8110 9250Accuracy 9900 9370 9540Human (from GENCODE)3

Specificity 9997 9955 8918 9528Sensitivity 6648 8695 9952 9628Accuracy 8322 9325 9432 9578Mouse (from GENCODE)4

Specificity 9875 9895 7094 9210Sensitivity 7655 3880 8811 9445Accuracy 8765 6888 7949 9328Human (from GRCh37hg19)5

Specificity 9762 8528 8920Sensitivity 6723 9460 9388Accuracy 8243 8994 9194Mouse (from GRCm38mm10)5

Specificity 9837 8817 8914Sensitivity 7546 9534 9529Accuracy 8691 9176 9221The results of the tools being tested on the same datasets are listed above Bold numbers denote the highest value of the metrics1MCF-7 is available at httpwwwpacbcomblogdata-release-human-mcf-7-transcriptome 2dataset of HelaS3 is available at httpswwwncbinlmnihgovsraSRX214365 34datasets are available at httpswwwdropboxcomsh7yvmqknartttm6kAAAQHvLZPjgjf4dtmHM7GNCqaH1 gencodedl=0 andhttpswwwdropboxcomsh7yvmqknartttm6kAACzaG-QJggvbXW6LA32oo7baM1 gencodedl=0 5dataset of human and mouse is available athttpjournalsplosorgplosonearticleid=101371journalpone0139654

performance on lncRNAs is just the opposite LncRNA-IDcan be trained with usersrsquo own data it can obtain a satisfyingresult even when the data is unbalancedWith the proportionof lncRNA decreasing CPAT shows a sharp reduction from7951 to 5446on the capability of lncRNAdiscriminationLncRNA-ID by contrast fell less than 1 [32]

ROC curves of CPC CPAT PLEK and LncRNA-IDtested on human and mouse datasets were provided in [32]Since CPC and CPAT are updated as the accumulation ofgene database it is useful to assess their performance withlatest version and take CNCI into account Here we utilisethe test set of LncRNA-ID [32] (both the datasets of humanandmouse are selected fromGENCODE) to reevaluate CPCCPAT CNCI and PLEK (Figure 2) According to [32] thearea under curve (AUC) of LncRNA-ID on human datasetis 09829 (optimal = 00545 09720) while on mouse it is09505 (optimal = 00800 09445) [32] In our assessmentthe performance of PLEK is identical with [32] while theperformance of CPC and CPAT as we anticipated displayedsome differences The ROC curves were drawn in R with thepackage of pROC [71]

LncRScan-SVM is compared with CPC and CPAT onhuman and mouse datasets from UCSC (version hg19 of

human and mm10 of mouse) CPC as an excellent codingpotential assessment tool still achieves 9837 of specificityonmouse testing dataset CPAT on the contrary achieved thehighest values of sensitivity both on the datasets of humanand mouse LncRScan-SVM surpasses CPAT with 8920and 8914 of specificity on human and mouse datasets Forsensitivity lncRScan-SVM obtained 9388 and 9529 onthe same testing datasets which are only 072 and around01 lower than CPATrsquos results respectively but much higherthan CPCrsquos 6723 and 7546 In addition lncRScan-SVMalso has the best results of accuracy and AUC [33] on thesedatasets

For the same testing datasets the running time of CPATis the shortest and CPC shows the longest time to finishthe process because of its alignment process When beingtested on a dataset containing 4000 protein-coding and 4000long noncoding transcripts CPAT takes 3536 s and LncRNA-ID takes 6535 s to accomplish the discrimination whilePLEK and CPC need 2147m and 8651 h respectively [32]PLEK is 8 times and 244 times faster than CNCI and CPCrespectively on the same testing data [31] and lncRScan-SVM also needs about 10 times as much as CPAT to finishcomputation [33]

BioMed Research International 11

Table 4 Priority of employing different methods on different situations

CPC CPAT CNCI PLEK LncRNA-ID lncRScan-SVMCoding potential assessment Human lncRNAs Mouse lncRNAs Other Species1 Testing data with sequencing errors2 Lack of annotation Massive-scale data3 Trained by users4 Web interface This table only presents the preferences under different situations which means a method with a tick can achieve a better performance under a certaincircumstance1Only CPAT LncRNA-ID and lncRScan-SVM provide the model for mouse When analysing other species CPAT has the model for fly and zebrafish CNCIand PLEK can predict the sequences of vertebrata and plant CPAT PLEK and LncRNA-ID can build a newmodel based on usersrsquo datasets 2Users can chooseCNCI for incomplete sequences and CPC or PLEK for the transcripts with indel errors 3CPAT is the most efficient method Though lncRScan-SVM needsmore time than CPAT and LncRNA-ID it is also acceptable 4LncRNA-ID can handle the imbalanced training data Training PLEK with usersrsquo own datasetsmay be a time-consuming task

4 Application Scopes of the Methods

All these methods have own particular scopes to exert theirtalents which means an appropriate program can help usobtain a satisfying result The priority of utilising thesetools under some particular circumstances is summarised inTable 4

CPC is based on sequence alignment which facilitatesprotein-coding transcripts selection but impairs the perfor-mance of noncoding transcripts in that long noncoding tran-scripts share more similarities with coding transcripts suchas putative ORF which could mislead CPC Also becauseof alignment process utilising CPC to analyse massive-scaledata is a time-consuming process

CPAT is also used to evaluate the coding potential thoughthe performance on long noncoding transcripts is acceptableCPAT has a compromise between coding and noncodingtranscripts that is not bad Since the model of CPAT islogistic regression and the input file is FASTA format CPATis markedly superior in computational time which meansCPAT is more suitable for being applied to data on a largescale Furthermore linguistic features make CPAT be able toanalyse the sequences without annotation and allowing usersto train the model with their own dataset extends CPATrsquosscope of application Users can apply CPAT to other speciesinstead of being confined to human or mouse only

CNCI is designed to distinguish between coding and longnoncoding transcripts without the annotations of sequencesBecause lots of lncRNAs are poorly annotated this qualityprovided amore accurate discrimination for these sequencesCNCI is trained on human dataset but can also be applied toothermammals such asmouse and orangutan CNCI displaysacceptable results on vertebrates (except fish) but for plantsand invertebrates the result is not very satisfying CNCI isvaluable when the sequences lack annotations or users do nothave training set of other species CNCI also shows a goodperformance when the transcripts are incomplete

PLEK employs a higher fault tolerance algorithm andperforms better when the sequences have indel errors It is a

proper tool to analyse the de novo assembled transcriptomedatasets such as the sequences obtained from Roche (454)and Pacific Biosciences (PacBio) sequencing platforms Inaddition to human and mouse PLEK can also be used toother vertebrates and displays comparable results with theones of CNCI PLEKrsquos model can be trained by users but ittakes a long time to be completed

LncRNA-ID has many merits and delivers better all-round performance on human andmouse datasets Althoughthe time LncRNA-ID spent on classifying is nearly twice ofCPAT LncRNA-ID is still more efficient than other methodswhich makes it a reasonable choice when data are on amassive-scale The model of LncRNA-ID can be trained byusers but the most excellent attribute is the competence ofhandling the unbalanced training data For studying thosenot well-explored species LncRNA-ID takes priority whenusers have training datasets

LncRScan-SVM achieves a good trade-off betweenthe discrimination of coding and long noncoding genesLncRScan-SVM is slower than CPAT and LncRNA-ID but itis still acceptable For analysing human and mouse datasetslncRScan-SVM can be considered as a proper approach

5 Discussion

According to the features selected by each tool it is apparentthat different tools have their own advantages and disadvan-tages CPC is developed to assess coding potential of thetranscripts moreover CPC is trained on datasets of protein-coding and noncoding RNA which means it achieves excel-lent performance when analysing ncRNAs CPC providesa stand-alone version and a web server but both of thetwo programs need vast amounts of time to process thesequences As alignment-based tools the performance ofCPC varied when using different protein reference databaseCPAT can present satisfying results efficiently partly becauseCPAT builds the logistic model which is faster than SVMThe web server of CPAT can display the result in an instant

12 BioMed Research International

which facilitates small scale prediction tasks A minor dis-advantage of CPAT is that the cutoff of CPAT varies fromspecies to species and users have to determine the optimumcutoff value when they are training a new model CNCI isdesigned to predict the transcripts assembled from whole-transcriptome sequencing data Thus CNCI offers a highaccuracy on incomplete transcripts CNCI did not provideresult of elaborate comparison between CNCI and CPATbut CPAT has no regard for the problem of incompletetranscripts Meanwhile UTRs of the transcripts may alsointerfere with the performance of CPATThe features of ANTof CNCI closely resemble the hexamer of CPAT but thedistinguishing process of CNCI is more complicated andaccurate than CPAT However the sliding window of CNCIslides 3 nt in each step and consequently some deletionor frameshift errors may lead to a false shift and presentusers with a disappointing performance In such cases PLEKhas made a considerable improvement and exhibits moreflexibility when handling the indel sequencing errors Indelerrors are very common in the sequences obtained by todayrsquossequencing platform which means PLEK performs well forde novo assembled transcriptomes With the indel error rateincreasing the accuracy of CNCI is decreasing while PLEKhas no distinct fluctuation Nonetheless since the nucleotidescompositions differ slightly among different species theperformance of PLEK on multiple species is not better thanor approximately equivalent to CNCI whose performanceis more stable on different species Both LncRNA-ID andlncRScan-SVM achieve a balance between protein-codingand lncRNAs But the capacity of lncRScan-SVMwill be lim-ited when analysing the sequences with a lack of annotationAnother point that needs to be brought up is that lncRScan-SVM and CNCI support lowast GTF as input file format

It is apparent that nucleotides composition (such as 119896-merand G + C content) and ORF are two classic and widely usedfeature groups These features have strong discriminativepower because protein-coding genes will finally be tran-scribed and translated to produce a specific amino acid chainwhich requires some specified nucleotides composition andhigh-quality ORFs As to the models of these tools SVM(CPC CNCI and PLEK) logistic regression (CPAT) andrandom forest (LncRNA-ID) are more practical for lncRNAidentification though ANN or deep learning is a more popu-lar machine learning algorithm now Along with the protein-coding genes prediction the annotations of lncRNA genehave been performed as well A new tool named AnnoLnc(2015 available at httpannolnccbipkueducnindexjsp)has just been developed to annotate new discovered lncRNAsbut related article has not yet been officially published Userscan access its web server for more information

LncRNAs are receiving increasing attention and lncRNAidentification has always been a challenge for researchesof life science For so many different types of sequencesvarious excellent tools should be developed to tackle differentproblems under various circumstances in the future In thisreview we summarised several tools for lncRNAs identifica-tion and concluded respective scopes Due to their differentscopes of application using a method apposite to particularsituation will be of essence to achieve convincing results

We hope this review can help researchers employ a moreappropriate method in certain situations

Additional Points

Key Points (i) Different tools have different scopes Usersshould select a proper tool according to the type of sequences(ii) From the perspective of sequence types CPC andCPAT are mainly used to assess coding potential CNCIand PLEK can be applied to the sequences obtained fromhigh-throughput sequencing platforms or the poorly anno-tated LncRNA-ID and lncRScan-SVM are more accurate onhuman and mouse datasets (iii) From the perspective ofother functions CPC and CPAT have web interfaces Theclassification models of CPAT LncRNA-ID and PLEK canbe trained on usersrsquo own datasets CPAT LncRNA-ID andlncRScan-SVM can be utilised when the data to be analysedare on a massive-scale

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This research was supported by the National Natural ScienceFoundation of China (61272207 61472158 and 61402194)and the Science-Technology Development Project from JilinProvince (20130522118JH 20130101070JC) and China Post-doctoral Science Foundation (2014T70291)

References

[1] A F Palazzo and E S Lee ldquoNon-coding RNA what isfunctional andwhat is junkrdquo Frontiers in Genetics vol 5 article2 pp 1ndash11 2015

[2] Y Okazaki M Furuno T Kasukawa et al ldquoAnalysis of themouse transcriptome based on functional annotation of 60770full-length cDNAsrdquo Nature vol 420 pp 563ndash573 2002

[3] S Djebali C A Davis AMerkel et al ldquoLandscape of transcrip-tion in human cellsrdquo Nature vol 489 pp 101ndash108 2012

[4] I Dunham A Kundaje S F Aldred et al ldquoAn integratedencyclopedia of DNA elements in the human genomerdquo Naturevol 489 pp 57ndash74 2012

[5] E Pennisi ldquoENCODE project writes eulogy for junk DNArdquoScience vol 337 no 6099 pp 1159ndash1161 2012

[6] S U Schmitz P Grote and B G Herrmann ldquoMechanisms oflong noncoding RNA function in development and diseaserdquoCellular and Molecular Life Sciences vol 73 no 13 pp 2491ndash2509 2016

[7] K Plath J Fang S K Mlynarczyk-Evans et al ldquoRole of histoneH3 lysine 27 methylation in X inactivationrdquo Science vol 300no 5616 pp 131ndash135 2003

[8] S T da Rocha V Boeva M Escamilla-Del-Arenal et al ldquoJarid2is implicated in the initial xist-induced targeting of PRC2 to theinactive X chromosomerdquoMolecular Cell vol 53 no 2 pp 301ndash316 2014

[9] V OrsquoLeary S V Ovsepian L G Carrascosa et al ldquoPARTICLEa triplex-forming long ncRNA regulates locus-specific methy-lation in response to low-dose irradiationrdquo Cell Reports vol 11no 3 pp 474ndash485 2015

BioMed Research International 13

[10] A C Marques and C P Ponting ldquoCatalogues of mammalianlong noncoding RNAs modest conservation and incomplete-nessrdquo Genome Biology vol 10 no 11 article R124 2009

[11] J R Prensner M K Iyer A Sahu et al ldquoThe long noncodingRNA SChLAP1 promotes aggressive prostate cancer and antag-onizes the SWISNF complexrdquo Nature Genetics vol 45 no 11pp 1392ndash1398 2013

[12] G Chen Z Wang D Wang et al ldquoLncRNADisease a databasefor long-non-coding RNA-associated diseasesrdquo Nucleic AcidsResearch vol 41 no 1 pp D983ndashD986 2013

[13] X Chen and G-Y Yan ldquoNovel human lncRNA-disease associ-ation inference based on lncRNA expression profilesrdquo Bioinfor-matics vol 29 no 20 pp 2617ndash2624 2013

[14] X Chen C Clarence Yan C Luo W Ji Y Zhang and Q DaildquoConstructing lncRNA functional similarity network based onlncRNA-disease associations and disease semantic similarityrdquoScientific Reports vol 5 Article ID 11338 2015

[15] X Chen Y-A Huang X-S Wang Z You and K C ChanldquoFMLNCSIM fuzzy measure-based lncRNA functional simi-larity calculation modelrdquo Oncotarget vol 7 no 29 pp 45948ndash45958 2016

[16] X Chen ldquoPredicting lncRNA-disease associations and con-structing lncRNA functional similarity network based on theinformation of miRNArdquo Scientific Reports vol 5 Article ID13186 2015

[17] J Sun H Shi Z Wang et al ldquoInferring novel lncRNA-diseaseassociations based on a random walk model of a lncRNAfunctional similarity networkrdquo Molecular BioSystems vol 10no 8 pp 2074ndash2081 2014

[18] M Zhou X Wang J Li et al ldquoPrioritizing candidate disease-related long non-codingRNAs bywalking on the heterogeneouslncRNA and disease networkrdquoMolecular BioSystems vol 11 no3 pp 760ndash769 2015

[19] X Chen Z You G Yan and D Gong ldquoIRWRLDA improvedrandom walk with restart for lncRNA-disease association pre-dictionrdquo Oncotarget vol 7 no 36 pp 57919ndash57931 2016

[20] X Chen C C Yan X Zhang and Z You ldquoLong non-codingRNAs and complex diseases from experimental results tocomputational modelsrdquo Briefings in Bioinformatics 2016

[21] T M Lowe and S R Eddy ldquotRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequencerdquo Nucleic Acids Research vol 25 no 5 pp 955ndash9641997

[22] Q Zou J Guo Y Ju M Wu X Zeng and Z Hong ldquoImprov-ing tRNAscan-SE annotation results via ensemble classifiersrdquoMolecular Informatics vol 34 no 11-12 pp 761ndash770 2015

[23] L Wei M Liao Y Gao R Ji Z He and Q Zou ldquoImprovedand promising identificationof human microRNAs by incorpo-ratinga high-quality negative setrdquo IEEEACM Transactions onComputational Biology and Bioinformatics vol 11 no 1 pp 192ndash201 2014

[24] C Y Wang L L Hu M Z Guo X Y Liu and Q Zou ldquoimDCan ensemble learningmethod for imbalanced classificationwithmiRNAdatardquoGenetics andMolecular Research vol 14 no 1 pp123ndash133 2015

[25] K Lagesen P Hallin E A Roslashdland H-H Staeligrfeldt T Rognesand DW Ussery ldquoRNAmmer consistent and rapid annotationof ribosomal RNA genesrdquo Nucleic Acids Research vol 35 no 9pp 3100ndash3108 2007

[26] C Wang L Wei M Guo and Q Zou ldquoComputationalapproaches in detecting non- coding RNArdquo Current Genomicsvol 14 no 6 pp 371ndash377 2013

[27] D Veneziano G Nigita and A Ferro ldquoComputationalapproaches for the analysis of ncRNA through deep sequencingtechniquesrdquo Frontiers in Bioengineering and Biotechnology vol3 article 77 2015

[28] L Kong Y Zhang Z-Q Ye et al ldquoCPC assess the protein-coding potential of transcripts using sequence features andsupport vector machinerdquo Nucleic Acids Research vol 35 no 2pp W345ndashW349 2007

[29] L Wang H J Park S Dasari et al ldquoCoding-potential assess-ment tool using an alignment-free logistic regression modelrdquoNucleic Acids Research vol 41 no 6 article e74 2013

[30] L Sun H Luo D Bu et al ldquoUtilizing sequence intrinsiccomposition to classify protein-coding and long non-codingtranscriptsrdquo Nucleic Acids Research vol 41 no 17 article e1662013

[31] A Li J Zhang and Z Zhou ldquoPLEK a tool for predicting longnon-coding RNAs and messenger RNAs based on an improvedk-mer schemerdquo BMC Bioinformatics vol 15 no 1 article 3112014

[32] R Achawanantakun J Chen Y Sun and Y Zhang ldquoLncRNA-ID long non-coding RNA IDentification using balanced ran-dom forestsrdquo Bioinformatics vol 31 no 24 pp 3897ndash3905 2015

[33] L Sun H Liu L Zhang and J Meng ldquolncRScan-SVM a toolfor predicting long non-coding RNAs using support vectormachinerdquo PLoS ONE vol 10 no 10 Article ID e0139654 2015

[34] X-N Fan and S-W Zhang ldquoLncRNA-MFDL identification ofhuman long non-coding RNAs by fusing multiple features andusing deep learningrdquo Molecular BioSystems vol 11 no 3 pp892ndash897 2015

[35] C Pian G Zhang Z Chen et al ldquoLncRNApred classificationof long non-coding RNAs and protein-coding transcripts by theensemble algorithm with a new hybrid featurerdquo PLoS ONE vol11 no 5 Article ID e0154567 2016

[36] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[37] M C Frith A R Forrest E Nourbakhsh et al ldquoThe abundanceof short proteins in the mammalian proteomerdquo PLoS Geneticsvol 2 no 4 pp 515ndash528 2006

[38] D M Anderson K M Anderson C-L Chang et al ldquoAmicropeptide encoded by a putative long noncoding RNAregulatesmuscle performancerdquoCell vol 160 no 4 pp 595ndash6062015

[39] M Guttman P Russell N T Ingolia J S Weissman and ES Lander ldquoRibosome profiling provides evidence that largenoncoding RNAs do not encode proteinsrdquo Cell vol 154 no 1pp 240ndash251 2013

[40] M F Lin I Jungreis and M Kellis ldquoPhyloCSF a comparativegenomics method to distinguish protein coding and non-coding regionsrdquoBioinformatics vol 27 no 13 Article ID btr209pp i275ndashi282 2011

[41] B Panwar A Arora and G P S Raghava ldquoPrediction andclassification of ncRNAs using structural informationrdquo BMCGenomics vol 15 no 1 article 127 2014

[42] L Childs Z Nikoloski P May and D Walther ldquoIdentificationand classification of ncRNAmolecules using graph propertiesrdquoNucleic Acids Research vol 37 no 9 article e66 2009

[43] K Sun X Chen P Jiang X Song H Wang and HSun ldquoiSeeRNA identification of long intergenic non-codingRNA transcripts from transcriptome sequencing datardquo BMCGenomics vol 14 article S7 2013

14 BioMed Research International

[44] Y Wang Y Li Q Wang et al ldquoComputational identificationof human long intergenic non-coding RNAs using a GA-SVMalgorithmrdquo Gene vol 533 no 1 pp 94ndash99 2014

[45] A Siepel G Bejerano J S Pedersen et al ldquoEvolutionarilyconserved elements in vertebrate insect worm and yeastgenomesrdquo Genome Research vol 15 no 8 pp 1034ndash1050 2005

[46] J Liu J Gough and B Rost ldquoDistinguishing protein-codingfrom non-coding RNAs through support vector machinesrdquoPLoS Genetics vol 2 no 4 pp 529ndash536 2006

[47] K C Pang S Stephen P G Engstrom et al ldquoRNAdb-acomprehensive mammalian noncoding RNA databaserdquoNucleicAcids Research vol 33 pp D125ndashD130 2005

[48] C Liu B Bai G Skogerboslash et al ldquoNONCODE an inte-grated knowledge database of non-coding RNAsrdquoNucleic AcidsResearch vol 33 pp D112ndashD115 2005

[49] Y Zhao H Li S Fang et al ldquoNONCODE 2016 an informativeand valuable data source of long non-coding RNAsrdquo NucleicAcids Research vol 44 no 1 pp D203ndashD208 2016

[50] J W Fickett ldquoRecognition of protein coding regions in DNAsequencesrdquoNucleic Acids Research vol 10 no 17 pp 5303ndash53181982

[51] S Mukherjee and Y Zhang ldquoMM-align a quick algorithmfor aligning multiple-chain protein complex structures usingiterative dynamic programmingrdquoNucleic Acids Research vol 37no 11 2009

[52] T Derrien R Johnson G Bussotti et al ldquoThe GENCODE v7catalog of human long noncoding RNAs analysis of their genestructure evolution and expressionrdquoGenome Research vol 22no 9 pp 1775ndash1789 2012

[53] J Harrow F Denoeud A Frankish et al ldquoGENCODE produc-ing a reference annotation for ENCODErdquo Genome Biology vol7 supplement 1 pp S41ndashS49 2006

[54] J Harrow A Frankish J M Gonzalez et al ldquoGENCODE thereference human genome annotation for the ENCODE projectrdquoGenome Research vol 22 no 9 pp 1760ndash1774 2012

[55] K D Pruitt T Tatusova G R Brown andD RMaglott ldquoNCBIReference Sequences (RefSeq) current status new features andgenome annotation policyrdquo Nucleic Acids Research vol 40 no1 pp D130ndashD135 2012

[56] K D Pruitt T Tatusova and D R Maglott ldquoNCBI refer-ence sequences (RefSeq) a curated non-redundant sequencedatabase of genomes transcripts and proteinsrdquo Nucleic AcidsResearch vol 35 no 1 pp D61ndashD65 2007

[57] L Deng D Yu and J Platt ldquoScalable stacking and learningfor building deep architecturesrdquo in Proceedings of the 2012IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo12) pp 2133ndash2136 Kyoto Japan March2012

[58] B Hutchinson L Deng and D Yu ldquoTensor deep stackingnetworksrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 35 no 8 pp 1944ndash1957 2013

[59] W J Kent C W Sugnet T S Furey et al ldquoThe human genomebrowser at UCSCrdquo Genome Research vol 12 no 6 pp 996ndash1006 2002

[60] R Lorenz S H Bernhart C Honer zu Siederdissen et alldquoViennaRNA Package 20rdquo Algorithms for Molecular Biologyvol 6 no 1 article 26 2011

[61] M Kozak ldquoRecognition of AUG and alternative initiatorcodons is augmented by G in position +4 but is not generallyaffected by the nucleotides in positions +5 and +6rdquoThe EMBOJournal vol 16 no 9 pp 2482ndash2492 1997

[62] M Kozak ldquoInitiation of translation in prokaryotes and eukary-otesrdquo Gene vol 234 no 2 pp 187ndash208 1999

[63] H Xu P Wang Y Fu et al ldquoLength of the ORF position of thefirstAUGand theKozakmotif are important factors in potentialdual-coding transcriptsrdquo Cell Research vol 20 no 4 pp 445ndash457 2010

[64] R D Finn J Clements and S R Eddy ldquoHMMER webserver interactive sequence similarity searchingrdquoNucleic AcidsResearch vol 39 no 2 pp W29ndashW37 2011

[65] P Kapranov J Cheng S Dike et al ldquoRNA maps reveal newRNA classes and a possible function for pervasive transcrip-tionrdquo Science vol 316 no 5830 pp 1484ndash1488 2007

[66] C Chen A Liaw and L BreimanUsing RandomForest to LearnImbalancedData University of California Berkeley Calif USA2004

[67] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[68] J A Sogn ldquoStructure of the peptide antibiotic polypeptinrdquoJournal of Medicinal Chemistry vol 19 no 10 pp 1228ndash12311976

[69] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[70] J W Fickett and C-S Tung ldquoAssessment of protein codingmeasuresrdquo Nucleic Acids Research vol 20 no 24 pp 6441ndash6450 1992

[71] X Robin N Turck A Hainard et al ldquopROC an open-sourcepackage for R and S+ to analyze and compare ROC curvesrdquoBMC Bioinformatics vol 12 article 77 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 11: Review Article Long Noncoding RNA Identification ...

BioMed Research International 11

Table 4 Priority of employing different methods on different situations

CPC CPAT CNCI PLEK LncRNA-ID lncRScan-SVMCoding potential assessment Human lncRNAs Mouse lncRNAs Other Species1 Testing data with sequencing errors2 Lack of annotation Massive-scale data3 Trained by users4 Web interface This table only presents the preferences under different situations which means a method with a tick can achieve a better performance under a certaincircumstance1Only CPAT LncRNA-ID and lncRScan-SVM provide the model for mouse When analysing other species CPAT has the model for fly and zebrafish CNCIand PLEK can predict the sequences of vertebrata and plant CPAT PLEK and LncRNA-ID can build a newmodel based on usersrsquo datasets 2Users can chooseCNCI for incomplete sequences and CPC or PLEK for the transcripts with indel errors 3CPAT is the most efficient method Though lncRScan-SVM needsmore time than CPAT and LncRNA-ID it is also acceptable 4LncRNA-ID can handle the imbalanced training data Training PLEK with usersrsquo own datasetsmay be a time-consuming task

4 Application Scopes of the Methods

All these methods have own particular scopes to exert theirtalents which means an appropriate program can help usobtain a satisfying result The priority of utilising thesetools under some particular circumstances is summarised inTable 4

CPC is based on sequence alignment which facilitatesprotein-coding transcripts selection but impairs the perfor-mance of noncoding transcripts in that long noncoding tran-scripts share more similarities with coding transcripts suchas putative ORF which could mislead CPC Also becauseof alignment process utilising CPC to analyse massive-scaledata is a time-consuming process

CPAT is also used to evaluate the coding potential thoughthe performance on long noncoding transcripts is acceptableCPAT has a compromise between coding and noncodingtranscripts that is not bad Since the model of CPAT islogistic regression and the input file is FASTA format CPATis markedly superior in computational time which meansCPAT is more suitable for being applied to data on a largescale Furthermore linguistic features make CPAT be able toanalyse the sequences without annotation and allowing usersto train the model with their own dataset extends CPATrsquosscope of application Users can apply CPAT to other speciesinstead of being confined to human or mouse only

CNCI is designed to distinguish between coding and longnoncoding transcripts without the annotations of sequencesBecause lots of lncRNAs are poorly annotated this qualityprovided amore accurate discrimination for these sequencesCNCI is trained on human dataset but can also be applied toothermammals such asmouse and orangutan CNCI displaysacceptable results on vertebrates (except fish) but for plantsand invertebrates the result is not very satisfying CNCI isvaluable when the sequences lack annotations or users do nothave training set of other species CNCI also shows a goodperformance when the transcripts are incomplete

PLEK employs a higher fault tolerance algorithm andperforms better when the sequences have indel errors It is a

proper tool to analyse the de novo assembled transcriptomedatasets such as the sequences obtained from Roche (454)and Pacific Biosciences (PacBio) sequencing platforms Inaddition to human and mouse PLEK can also be used toother vertebrates and displays comparable results with theones of CNCI PLEKrsquos model can be trained by users but ittakes a long time to be completed

LncRNA-ID has many merits and delivers better all-round performance on human andmouse datasets Althoughthe time LncRNA-ID spent on classifying is nearly twice ofCPAT LncRNA-ID is still more efficient than other methodswhich makes it a reasonable choice when data are on amassive-scale The model of LncRNA-ID can be trained byusers but the most excellent attribute is the competence ofhandling the unbalanced training data For studying thosenot well-explored species LncRNA-ID takes priority whenusers have training datasets

LncRScan-SVM achieves a good trade-off betweenthe discrimination of coding and long noncoding genesLncRScan-SVM is slower than CPAT and LncRNA-ID but itis still acceptable For analysing human and mouse datasetslncRScan-SVM can be considered as a proper approach

5 Discussion

According to the features selected by each tool it is apparentthat different tools have their own advantages and disadvan-tages CPC is developed to assess coding potential of thetranscripts moreover CPC is trained on datasets of protein-coding and noncoding RNA which means it achieves excel-lent performance when analysing ncRNAs CPC providesa stand-alone version and a web server but both of thetwo programs need vast amounts of time to process thesequences As alignment-based tools the performance ofCPC varied when using different protein reference databaseCPAT can present satisfying results efficiently partly becauseCPAT builds the logistic model which is faster than SVMThe web server of CPAT can display the result in an instant

12 BioMed Research International

which facilitates small scale prediction tasks A minor dis-advantage of CPAT is that the cutoff of CPAT varies fromspecies to species and users have to determine the optimumcutoff value when they are training a new model CNCI isdesigned to predict the transcripts assembled from whole-transcriptome sequencing data Thus CNCI offers a highaccuracy on incomplete transcripts CNCI did not provideresult of elaborate comparison between CNCI and CPATbut CPAT has no regard for the problem of incompletetranscripts Meanwhile UTRs of the transcripts may alsointerfere with the performance of CPATThe features of ANTof CNCI closely resemble the hexamer of CPAT but thedistinguishing process of CNCI is more complicated andaccurate than CPAT However the sliding window of CNCIslides 3 nt in each step and consequently some deletionor frameshift errors may lead to a false shift and presentusers with a disappointing performance In such cases PLEKhas made a considerable improvement and exhibits moreflexibility when handling the indel sequencing errors Indelerrors are very common in the sequences obtained by todayrsquossequencing platform which means PLEK performs well forde novo assembled transcriptomes With the indel error rateincreasing the accuracy of CNCI is decreasing while PLEKhas no distinct fluctuation Nonetheless since the nucleotidescompositions differ slightly among different species theperformance of PLEK on multiple species is not better thanor approximately equivalent to CNCI whose performanceis more stable on different species Both LncRNA-ID andlncRScan-SVM achieve a balance between protein-codingand lncRNAs But the capacity of lncRScan-SVMwill be lim-ited when analysing the sequences with a lack of annotationAnother point that needs to be brought up is that lncRScan-SVM and CNCI support lowast GTF as input file format

It is apparent that nucleotides composition (such as 119896-merand G + C content) and ORF are two classic and widely usedfeature groups These features have strong discriminativepower because protein-coding genes will finally be tran-scribed and translated to produce a specific amino acid chainwhich requires some specified nucleotides composition andhigh-quality ORFs As to the models of these tools SVM(CPC CNCI and PLEK) logistic regression (CPAT) andrandom forest (LncRNA-ID) are more practical for lncRNAidentification though ANN or deep learning is a more popu-lar machine learning algorithm now Along with the protein-coding genes prediction the annotations of lncRNA genehave been performed as well A new tool named AnnoLnc(2015 available at httpannolnccbipkueducnindexjsp)has just been developed to annotate new discovered lncRNAsbut related article has not yet been officially published Userscan access its web server for more information

LncRNAs are receiving increasing attention and lncRNAidentification has always been a challenge for researchesof life science For so many different types of sequencesvarious excellent tools should be developed to tackle differentproblems under various circumstances in the future In thisreview we summarised several tools for lncRNAs identifica-tion and concluded respective scopes Due to their differentscopes of application using a method apposite to particularsituation will be of essence to achieve convincing results

We hope this review can help researchers employ a moreappropriate method in certain situations

Additional Points

Key Points (i) Different tools have different scopes Usersshould select a proper tool according to the type of sequences(ii) From the perspective of sequence types CPC andCPAT are mainly used to assess coding potential CNCIand PLEK can be applied to the sequences obtained fromhigh-throughput sequencing platforms or the poorly anno-tated LncRNA-ID and lncRScan-SVM are more accurate onhuman and mouse datasets (iii) From the perspective ofother functions CPC and CPAT have web interfaces Theclassification models of CPAT LncRNA-ID and PLEK canbe trained on usersrsquo own datasets CPAT LncRNA-ID andlncRScan-SVM can be utilised when the data to be analysedare on a massive-scale

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This research was supported by the National Natural ScienceFoundation of China (61272207 61472158 and 61402194)and the Science-Technology Development Project from JilinProvince (20130522118JH 20130101070JC) and China Post-doctoral Science Foundation (2014T70291)

References

[1] A F Palazzo and E S Lee ldquoNon-coding RNA what isfunctional andwhat is junkrdquo Frontiers in Genetics vol 5 article2 pp 1ndash11 2015

[2] Y Okazaki M Furuno T Kasukawa et al ldquoAnalysis of themouse transcriptome based on functional annotation of 60770full-length cDNAsrdquo Nature vol 420 pp 563ndash573 2002

[3] S Djebali C A Davis AMerkel et al ldquoLandscape of transcrip-tion in human cellsrdquo Nature vol 489 pp 101ndash108 2012

[4] I Dunham A Kundaje S F Aldred et al ldquoAn integratedencyclopedia of DNA elements in the human genomerdquo Naturevol 489 pp 57ndash74 2012

[5] E Pennisi ldquoENCODE project writes eulogy for junk DNArdquoScience vol 337 no 6099 pp 1159ndash1161 2012

[6] S U Schmitz P Grote and B G Herrmann ldquoMechanisms oflong noncoding RNA function in development and diseaserdquoCellular and Molecular Life Sciences vol 73 no 13 pp 2491ndash2509 2016

[7] K Plath J Fang S K Mlynarczyk-Evans et al ldquoRole of histoneH3 lysine 27 methylation in X inactivationrdquo Science vol 300no 5616 pp 131ndash135 2003

[8] S T da Rocha V Boeva M Escamilla-Del-Arenal et al ldquoJarid2is implicated in the initial xist-induced targeting of PRC2 to theinactive X chromosomerdquoMolecular Cell vol 53 no 2 pp 301ndash316 2014

[9] V OrsquoLeary S V Ovsepian L G Carrascosa et al ldquoPARTICLEa triplex-forming long ncRNA regulates locus-specific methy-lation in response to low-dose irradiationrdquo Cell Reports vol 11no 3 pp 474ndash485 2015

BioMed Research International 13

[10] A C Marques and C P Ponting ldquoCatalogues of mammalianlong noncoding RNAs modest conservation and incomplete-nessrdquo Genome Biology vol 10 no 11 article R124 2009

[11] J R Prensner M K Iyer A Sahu et al ldquoThe long noncodingRNA SChLAP1 promotes aggressive prostate cancer and antag-onizes the SWISNF complexrdquo Nature Genetics vol 45 no 11pp 1392ndash1398 2013

[12] G Chen Z Wang D Wang et al ldquoLncRNADisease a databasefor long-non-coding RNA-associated diseasesrdquo Nucleic AcidsResearch vol 41 no 1 pp D983ndashD986 2013

[13] X Chen and G-Y Yan ldquoNovel human lncRNA-disease associ-ation inference based on lncRNA expression profilesrdquo Bioinfor-matics vol 29 no 20 pp 2617ndash2624 2013

[14] X Chen C Clarence Yan C Luo W Ji Y Zhang and Q DaildquoConstructing lncRNA functional similarity network based onlncRNA-disease associations and disease semantic similarityrdquoScientific Reports vol 5 Article ID 11338 2015

[15] X Chen Y-A Huang X-S Wang Z You and K C ChanldquoFMLNCSIM fuzzy measure-based lncRNA functional simi-larity calculation modelrdquo Oncotarget vol 7 no 29 pp 45948ndash45958 2016

[16] X Chen ldquoPredicting lncRNA-disease associations and con-structing lncRNA functional similarity network based on theinformation of miRNArdquo Scientific Reports vol 5 Article ID13186 2015

[17] J Sun H Shi Z Wang et al ldquoInferring novel lncRNA-diseaseassociations based on a random walk model of a lncRNAfunctional similarity networkrdquo Molecular BioSystems vol 10no 8 pp 2074ndash2081 2014

[18] M Zhou X Wang J Li et al ldquoPrioritizing candidate disease-related long non-codingRNAs bywalking on the heterogeneouslncRNA and disease networkrdquoMolecular BioSystems vol 11 no3 pp 760ndash769 2015

[19] X Chen Z You G Yan and D Gong ldquoIRWRLDA improvedrandom walk with restart for lncRNA-disease association pre-dictionrdquo Oncotarget vol 7 no 36 pp 57919ndash57931 2016

[20] X Chen C C Yan X Zhang and Z You ldquoLong non-codingRNAs and complex diseases from experimental results tocomputational modelsrdquo Briefings in Bioinformatics 2016

[21] T M Lowe and S R Eddy ldquotRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequencerdquo Nucleic Acids Research vol 25 no 5 pp 955ndash9641997

[22] Q Zou J Guo Y Ju M Wu X Zeng and Z Hong ldquoImprov-ing tRNAscan-SE annotation results via ensemble classifiersrdquoMolecular Informatics vol 34 no 11-12 pp 761ndash770 2015

[23] L Wei M Liao Y Gao R Ji Z He and Q Zou ldquoImprovedand promising identificationof human microRNAs by incorpo-ratinga high-quality negative setrdquo IEEEACM Transactions onComputational Biology and Bioinformatics vol 11 no 1 pp 192ndash201 2014

[24] C Y Wang L L Hu M Z Guo X Y Liu and Q Zou ldquoimDCan ensemble learningmethod for imbalanced classificationwithmiRNAdatardquoGenetics andMolecular Research vol 14 no 1 pp123ndash133 2015

[25] K Lagesen P Hallin E A Roslashdland H-H Staeligrfeldt T Rognesand DW Ussery ldquoRNAmmer consistent and rapid annotationof ribosomal RNA genesrdquo Nucleic Acids Research vol 35 no 9pp 3100ndash3108 2007

[26] C Wang L Wei M Guo and Q Zou ldquoComputationalapproaches in detecting non- coding RNArdquo Current Genomicsvol 14 no 6 pp 371ndash377 2013

[27] D Veneziano G Nigita and A Ferro ldquoComputationalapproaches for the analysis of ncRNA through deep sequencingtechniquesrdquo Frontiers in Bioengineering and Biotechnology vol3 article 77 2015

[28] L Kong Y Zhang Z-Q Ye et al ldquoCPC assess the protein-coding potential of transcripts using sequence features andsupport vector machinerdquo Nucleic Acids Research vol 35 no 2pp W345ndashW349 2007

[29] L Wang H J Park S Dasari et al ldquoCoding-potential assess-ment tool using an alignment-free logistic regression modelrdquoNucleic Acids Research vol 41 no 6 article e74 2013

[30] L Sun H Luo D Bu et al ldquoUtilizing sequence intrinsiccomposition to classify protein-coding and long non-codingtranscriptsrdquo Nucleic Acids Research vol 41 no 17 article e1662013

[31] A Li J Zhang and Z Zhou ldquoPLEK a tool for predicting longnon-coding RNAs and messenger RNAs based on an improvedk-mer schemerdquo BMC Bioinformatics vol 15 no 1 article 3112014

[32] R Achawanantakun J Chen Y Sun and Y Zhang ldquoLncRNA-ID long non-coding RNA IDentification using balanced ran-dom forestsrdquo Bioinformatics vol 31 no 24 pp 3897ndash3905 2015

[33] L Sun H Liu L Zhang and J Meng ldquolncRScan-SVM a toolfor predicting long non-coding RNAs using support vectormachinerdquo PLoS ONE vol 10 no 10 Article ID e0139654 2015

[34] X-N Fan and S-W Zhang ldquoLncRNA-MFDL identification ofhuman long non-coding RNAs by fusing multiple features andusing deep learningrdquo Molecular BioSystems vol 11 no 3 pp892ndash897 2015

[35] C Pian G Zhang Z Chen et al ldquoLncRNApred classificationof long non-coding RNAs and protein-coding transcripts by theensemble algorithm with a new hybrid featurerdquo PLoS ONE vol11 no 5 Article ID e0154567 2016

[36] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[37] M C Frith A R Forrest E Nourbakhsh et al ldquoThe abundanceof short proteins in the mammalian proteomerdquo PLoS Geneticsvol 2 no 4 pp 515ndash528 2006

[38] D M Anderson K M Anderson C-L Chang et al ldquoAmicropeptide encoded by a putative long noncoding RNAregulatesmuscle performancerdquoCell vol 160 no 4 pp 595ndash6062015

[39] M Guttman P Russell N T Ingolia J S Weissman and ES Lander ldquoRibosome profiling provides evidence that largenoncoding RNAs do not encode proteinsrdquo Cell vol 154 no 1pp 240ndash251 2013

[40] M F Lin I Jungreis and M Kellis ldquoPhyloCSF a comparativegenomics method to distinguish protein coding and non-coding regionsrdquoBioinformatics vol 27 no 13 Article ID btr209pp i275ndashi282 2011

[41] B Panwar A Arora and G P S Raghava ldquoPrediction andclassification of ncRNAs using structural informationrdquo BMCGenomics vol 15 no 1 article 127 2014

[42] L Childs Z Nikoloski P May and D Walther ldquoIdentificationand classification of ncRNAmolecules using graph propertiesrdquoNucleic Acids Research vol 37 no 9 article e66 2009

[43] K Sun X Chen P Jiang X Song H Wang and HSun ldquoiSeeRNA identification of long intergenic non-codingRNA transcripts from transcriptome sequencing datardquo BMCGenomics vol 14 article S7 2013

14 BioMed Research International

[44] Y Wang Y Li Q Wang et al ldquoComputational identificationof human long intergenic non-coding RNAs using a GA-SVMalgorithmrdquo Gene vol 533 no 1 pp 94ndash99 2014

[45] A Siepel G Bejerano J S Pedersen et al ldquoEvolutionarilyconserved elements in vertebrate insect worm and yeastgenomesrdquo Genome Research vol 15 no 8 pp 1034ndash1050 2005

[46] J Liu J Gough and B Rost ldquoDistinguishing protein-codingfrom non-coding RNAs through support vector machinesrdquoPLoS Genetics vol 2 no 4 pp 529ndash536 2006

[47] K C Pang S Stephen P G Engstrom et al ldquoRNAdb-acomprehensive mammalian noncoding RNA databaserdquoNucleicAcids Research vol 33 pp D125ndashD130 2005

[48] C Liu B Bai G Skogerboslash et al ldquoNONCODE an inte-grated knowledge database of non-coding RNAsrdquoNucleic AcidsResearch vol 33 pp D112ndashD115 2005

[49] Y Zhao H Li S Fang et al ldquoNONCODE 2016 an informativeand valuable data source of long non-coding RNAsrdquo NucleicAcids Research vol 44 no 1 pp D203ndashD208 2016

[50] J W Fickett ldquoRecognition of protein coding regions in DNAsequencesrdquoNucleic Acids Research vol 10 no 17 pp 5303ndash53181982

[51] S Mukherjee and Y Zhang ldquoMM-align a quick algorithmfor aligning multiple-chain protein complex structures usingiterative dynamic programmingrdquoNucleic Acids Research vol 37no 11 2009

[52] T Derrien R Johnson G Bussotti et al ldquoThe GENCODE v7catalog of human long noncoding RNAs analysis of their genestructure evolution and expressionrdquoGenome Research vol 22no 9 pp 1775ndash1789 2012

[53] J Harrow F Denoeud A Frankish et al ldquoGENCODE produc-ing a reference annotation for ENCODErdquo Genome Biology vol7 supplement 1 pp S41ndashS49 2006

[54] J Harrow A Frankish J M Gonzalez et al ldquoGENCODE thereference human genome annotation for the ENCODE projectrdquoGenome Research vol 22 no 9 pp 1760ndash1774 2012

[55] K D Pruitt T Tatusova G R Brown andD RMaglott ldquoNCBIReference Sequences (RefSeq) current status new features andgenome annotation policyrdquo Nucleic Acids Research vol 40 no1 pp D130ndashD135 2012

[56] K D Pruitt T Tatusova and D R Maglott ldquoNCBI refer-ence sequences (RefSeq) a curated non-redundant sequencedatabase of genomes transcripts and proteinsrdquo Nucleic AcidsResearch vol 35 no 1 pp D61ndashD65 2007

[57] L Deng D Yu and J Platt ldquoScalable stacking and learningfor building deep architecturesrdquo in Proceedings of the 2012IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo12) pp 2133ndash2136 Kyoto Japan March2012

[58] B Hutchinson L Deng and D Yu ldquoTensor deep stackingnetworksrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 35 no 8 pp 1944ndash1957 2013

[59] W J Kent C W Sugnet T S Furey et al ldquoThe human genomebrowser at UCSCrdquo Genome Research vol 12 no 6 pp 996ndash1006 2002

[60] R Lorenz S H Bernhart C Honer zu Siederdissen et alldquoViennaRNA Package 20rdquo Algorithms for Molecular Biologyvol 6 no 1 article 26 2011

[61] M Kozak ldquoRecognition of AUG and alternative initiatorcodons is augmented by G in position +4 but is not generallyaffected by the nucleotides in positions +5 and +6rdquoThe EMBOJournal vol 16 no 9 pp 2482ndash2492 1997

[62] M Kozak ldquoInitiation of translation in prokaryotes and eukary-otesrdquo Gene vol 234 no 2 pp 187ndash208 1999

[63] H Xu P Wang Y Fu et al ldquoLength of the ORF position of thefirstAUGand theKozakmotif are important factors in potentialdual-coding transcriptsrdquo Cell Research vol 20 no 4 pp 445ndash457 2010

[64] R D Finn J Clements and S R Eddy ldquoHMMER webserver interactive sequence similarity searchingrdquoNucleic AcidsResearch vol 39 no 2 pp W29ndashW37 2011

[65] P Kapranov J Cheng S Dike et al ldquoRNA maps reveal newRNA classes and a possible function for pervasive transcrip-tionrdquo Science vol 316 no 5830 pp 1484ndash1488 2007

[66] C Chen A Liaw and L BreimanUsing RandomForest to LearnImbalancedData University of California Berkeley Calif USA2004

[67] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[68] J A Sogn ldquoStructure of the peptide antibiotic polypeptinrdquoJournal of Medicinal Chemistry vol 19 no 10 pp 1228ndash12311976

[69] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[70] J W Fickett and C-S Tung ldquoAssessment of protein codingmeasuresrdquo Nucleic Acids Research vol 20 no 24 pp 6441ndash6450 1992

[71] X Robin N Turck A Hainard et al ldquopROC an open-sourcepackage for R and S+ to analyze and compare ROC curvesrdquoBMC Bioinformatics vol 12 article 77 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 12: Review Article Long Noncoding RNA Identification ...

12 BioMed Research International

which facilitates small scale prediction tasks A minor dis-advantage of CPAT is that the cutoff of CPAT varies fromspecies to species and users have to determine the optimumcutoff value when they are training a new model CNCI isdesigned to predict the transcripts assembled from whole-transcriptome sequencing data Thus CNCI offers a highaccuracy on incomplete transcripts CNCI did not provideresult of elaborate comparison between CNCI and CPATbut CPAT has no regard for the problem of incompletetranscripts Meanwhile UTRs of the transcripts may alsointerfere with the performance of CPATThe features of ANTof CNCI closely resemble the hexamer of CPAT but thedistinguishing process of CNCI is more complicated andaccurate than CPAT However the sliding window of CNCIslides 3 nt in each step and consequently some deletionor frameshift errors may lead to a false shift and presentusers with a disappointing performance In such cases PLEKhas made a considerable improvement and exhibits moreflexibility when handling the indel sequencing errors Indelerrors are very common in the sequences obtained by todayrsquossequencing platform which means PLEK performs well forde novo assembled transcriptomes With the indel error rateincreasing the accuracy of CNCI is decreasing while PLEKhas no distinct fluctuation Nonetheless since the nucleotidescompositions differ slightly among different species theperformance of PLEK on multiple species is not better thanor approximately equivalent to CNCI whose performanceis more stable on different species Both LncRNA-ID andlncRScan-SVM achieve a balance between protein-codingand lncRNAs But the capacity of lncRScan-SVMwill be lim-ited when analysing the sequences with a lack of annotationAnother point that needs to be brought up is that lncRScan-SVM and CNCI support lowast GTF as input file format

It is apparent that nucleotides composition (such as 119896-merand G + C content) and ORF are two classic and widely usedfeature groups These features have strong discriminativepower because protein-coding genes will finally be tran-scribed and translated to produce a specific amino acid chainwhich requires some specified nucleotides composition andhigh-quality ORFs As to the models of these tools SVM(CPC CNCI and PLEK) logistic regression (CPAT) andrandom forest (LncRNA-ID) are more practical for lncRNAidentification though ANN or deep learning is a more popu-lar machine learning algorithm now Along with the protein-coding genes prediction the annotations of lncRNA genehave been performed as well A new tool named AnnoLnc(2015 available at httpannolnccbipkueducnindexjsp)has just been developed to annotate new discovered lncRNAsbut related article has not yet been officially published Userscan access its web server for more information

LncRNAs are receiving increasing attention and lncRNAidentification has always been a challenge for researchesof life science For so many different types of sequencesvarious excellent tools should be developed to tackle differentproblems under various circumstances in the future In thisreview we summarised several tools for lncRNAs identifica-tion and concluded respective scopes Due to their differentscopes of application using a method apposite to particularsituation will be of essence to achieve convincing results

We hope this review can help researchers employ a moreappropriate method in certain situations

Additional Points

Key Points (i) Different tools have different scopes Usersshould select a proper tool according to the type of sequences(ii) From the perspective of sequence types CPC andCPAT are mainly used to assess coding potential CNCIand PLEK can be applied to the sequences obtained fromhigh-throughput sequencing platforms or the poorly anno-tated LncRNA-ID and lncRScan-SVM are more accurate onhuman and mouse datasets (iii) From the perspective ofother functions CPC and CPAT have web interfaces Theclassification models of CPAT LncRNA-ID and PLEK canbe trained on usersrsquo own datasets CPAT LncRNA-ID andlncRScan-SVM can be utilised when the data to be analysedare on a massive-scale

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This research was supported by the National Natural ScienceFoundation of China (61272207 61472158 and 61402194)and the Science-Technology Development Project from JilinProvince (20130522118JH 20130101070JC) and China Post-doctoral Science Foundation (2014T70291)

References

[1] A F Palazzo and E S Lee ldquoNon-coding RNA what isfunctional andwhat is junkrdquo Frontiers in Genetics vol 5 article2 pp 1ndash11 2015

[2] Y Okazaki M Furuno T Kasukawa et al ldquoAnalysis of themouse transcriptome based on functional annotation of 60770full-length cDNAsrdquo Nature vol 420 pp 563ndash573 2002

[3] S Djebali C A Davis AMerkel et al ldquoLandscape of transcrip-tion in human cellsrdquo Nature vol 489 pp 101ndash108 2012

[4] I Dunham A Kundaje S F Aldred et al ldquoAn integratedencyclopedia of DNA elements in the human genomerdquo Naturevol 489 pp 57ndash74 2012

[5] E Pennisi ldquoENCODE project writes eulogy for junk DNArdquoScience vol 337 no 6099 pp 1159ndash1161 2012

[6] S U Schmitz P Grote and B G Herrmann ldquoMechanisms oflong noncoding RNA function in development and diseaserdquoCellular and Molecular Life Sciences vol 73 no 13 pp 2491ndash2509 2016

[7] K Plath J Fang S K Mlynarczyk-Evans et al ldquoRole of histoneH3 lysine 27 methylation in X inactivationrdquo Science vol 300no 5616 pp 131ndash135 2003

[8] S T da Rocha V Boeva M Escamilla-Del-Arenal et al ldquoJarid2is implicated in the initial xist-induced targeting of PRC2 to theinactive X chromosomerdquoMolecular Cell vol 53 no 2 pp 301ndash316 2014

[9] V OrsquoLeary S V Ovsepian L G Carrascosa et al ldquoPARTICLEa triplex-forming long ncRNA regulates locus-specific methy-lation in response to low-dose irradiationrdquo Cell Reports vol 11no 3 pp 474ndash485 2015

BioMed Research International 13

[10] A C Marques and C P Ponting ldquoCatalogues of mammalianlong noncoding RNAs modest conservation and incomplete-nessrdquo Genome Biology vol 10 no 11 article R124 2009

[11] J R Prensner M K Iyer A Sahu et al ldquoThe long noncodingRNA SChLAP1 promotes aggressive prostate cancer and antag-onizes the SWISNF complexrdquo Nature Genetics vol 45 no 11pp 1392ndash1398 2013

[12] G Chen Z Wang D Wang et al ldquoLncRNADisease a databasefor long-non-coding RNA-associated diseasesrdquo Nucleic AcidsResearch vol 41 no 1 pp D983ndashD986 2013

[13] X Chen and G-Y Yan ldquoNovel human lncRNA-disease associ-ation inference based on lncRNA expression profilesrdquo Bioinfor-matics vol 29 no 20 pp 2617ndash2624 2013

[14] X Chen C Clarence Yan C Luo W Ji Y Zhang and Q DaildquoConstructing lncRNA functional similarity network based onlncRNA-disease associations and disease semantic similarityrdquoScientific Reports vol 5 Article ID 11338 2015

[15] X Chen Y-A Huang X-S Wang Z You and K C ChanldquoFMLNCSIM fuzzy measure-based lncRNA functional simi-larity calculation modelrdquo Oncotarget vol 7 no 29 pp 45948ndash45958 2016

[16] X Chen ldquoPredicting lncRNA-disease associations and con-structing lncRNA functional similarity network based on theinformation of miRNArdquo Scientific Reports vol 5 Article ID13186 2015

[17] J Sun H Shi Z Wang et al ldquoInferring novel lncRNA-diseaseassociations based on a random walk model of a lncRNAfunctional similarity networkrdquo Molecular BioSystems vol 10no 8 pp 2074ndash2081 2014

[18] M Zhou X Wang J Li et al ldquoPrioritizing candidate disease-related long non-codingRNAs bywalking on the heterogeneouslncRNA and disease networkrdquoMolecular BioSystems vol 11 no3 pp 760ndash769 2015

[19] X Chen Z You G Yan and D Gong ldquoIRWRLDA improvedrandom walk with restart for lncRNA-disease association pre-dictionrdquo Oncotarget vol 7 no 36 pp 57919ndash57931 2016

[20] X Chen C C Yan X Zhang and Z You ldquoLong non-codingRNAs and complex diseases from experimental results tocomputational modelsrdquo Briefings in Bioinformatics 2016

[21] T M Lowe and S R Eddy ldquotRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequencerdquo Nucleic Acids Research vol 25 no 5 pp 955ndash9641997

[22] Q Zou J Guo Y Ju M Wu X Zeng and Z Hong ldquoImprov-ing tRNAscan-SE annotation results via ensemble classifiersrdquoMolecular Informatics vol 34 no 11-12 pp 761ndash770 2015

[23] L Wei M Liao Y Gao R Ji Z He and Q Zou ldquoImprovedand promising identificationof human microRNAs by incorpo-ratinga high-quality negative setrdquo IEEEACM Transactions onComputational Biology and Bioinformatics vol 11 no 1 pp 192ndash201 2014

[24] C Y Wang L L Hu M Z Guo X Y Liu and Q Zou ldquoimDCan ensemble learningmethod for imbalanced classificationwithmiRNAdatardquoGenetics andMolecular Research vol 14 no 1 pp123ndash133 2015

[25] K Lagesen P Hallin E A Roslashdland H-H Staeligrfeldt T Rognesand DW Ussery ldquoRNAmmer consistent and rapid annotationof ribosomal RNA genesrdquo Nucleic Acids Research vol 35 no 9pp 3100ndash3108 2007

[26] C Wang L Wei M Guo and Q Zou ldquoComputationalapproaches in detecting non- coding RNArdquo Current Genomicsvol 14 no 6 pp 371ndash377 2013

[27] D Veneziano G Nigita and A Ferro ldquoComputationalapproaches for the analysis of ncRNA through deep sequencingtechniquesrdquo Frontiers in Bioengineering and Biotechnology vol3 article 77 2015

[28] L Kong Y Zhang Z-Q Ye et al ldquoCPC assess the protein-coding potential of transcripts using sequence features andsupport vector machinerdquo Nucleic Acids Research vol 35 no 2pp W345ndashW349 2007

[29] L Wang H J Park S Dasari et al ldquoCoding-potential assess-ment tool using an alignment-free logistic regression modelrdquoNucleic Acids Research vol 41 no 6 article e74 2013

[30] L Sun H Luo D Bu et al ldquoUtilizing sequence intrinsiccomposition to classify protein-coding and long non-codingtranscriptsrdquo Nucleic Acids Research vol 41 no 17 article e1662013

[31] A Li J Zhang and Z Zhou ldquoPLEK a tool for predicting longnon-coding RNAs and messenger RNAs based on an improvedk-mer schemerdquo BMC Bioinformatics vol 15 no 1 article 3112014

[32] R Achawanantakun J Chen Y Sun and Y Zhang ldquoLncRNA-ID long non-coding RNA IDentification using balanced ran-dom forestsrdquo Bioinformatics vol 31 no 24 pp 3897ndash3905 2015

[33] L Sun H Liu L Zhang and J Meng ldquolncRScan-SVM a toolfor predicting long non-coding RNAs using support vectormachinerdquo PLoS ONE vol 10 no 10 Article ID e0139654 2015

[34] X-N Fan and S-W Zhang ldquoLncRNA-MFDL identification ofhuman long non-coding RNAs by fusing multiple features andusing deep learningrdquo Molecular BioSystems vol 11 no 3 pp892ndash897 2015

[35] C Pian G Zhang Z Chen et al ldquoLncRNApred classificationof long non-coding RNAs and protein-coding transcripts by theensemble algorithm with a new hybrid featurerdquo PLoS ONE vol11 no 5 Article ID e0154567 2016

[36] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[37] M C Frith A R Forrest E Nourbakhsh et al ldquoThe abundanceof short proteins in the mammalian proteomerdquo PLoS Geneticsvol 2 no 4 pp 515ndash528 2006

[38] D M Anderson K M Anderson C-L Chang et al ldquoAmicropeptide encoded by a putative long noncoding RNAregulatesmuscle performancerdquoCell vol 160 no 4 pp 595ndash6062015

[39] M Guttman P Russell N T Ingolia J S Weissman and ES Lander ldquoRibosome profiling provides evidence that largenoncoding RNAs do not encode proteinsrdquo Cell vol 154 no 1pp 240ndash251 2013

[40] M F Lin I Jungreis and M Kellis ldquoPhyloCSF a comparativegenomics method to distinguish protein coding and non-coding regionsrdquoBioinformatics vol 27 no 13 Article ID btr209pp i275ndashi282 2011

[41] B Panwar A Arora and G P S Raghava ldquoPrediction andclassification of ncRNAs using structural informationrdquo BMCGenomics vol 15 no 1 article 127 2014

[42] L Childs Z Nikoloski P May and D Walther ldquoIdentificationand classification of ncRNAmolecules using graph propertiesrdquoNucleic Acids Research vol 37 no 9 article e66 2009

[43] K Sun X Chen P Jiang X Song H Wang and HSun ldquoiSeeRNA identification of long intergenic non-codingRNA transcripts from transcriptome sequencing datardquo BMCGenomics vol 14 article S7 2013

14 BioMed Research International

[44] Y Wang Y Li Q Wang et al ldquoComputational identificationof human long intergenic non-coding RNAs using a GA-SVMalgorithmrdquo Gene vol 533 no 1 pp 94ndash99 2014

[45] A Siepel G Bejerano J S Pedersen et al ldquoEvolutionarilyconserved elements in vertebrate insect worm and yeastgenomesrdquo Genome Research vol 15 no 8 pp 1034ndash1050 2005

[46] J Liu J Gough and B Rost ldquoDistinguishing protein-codingfrom non-coding RNAs through support vector machinesrdquoPLoS Genetics vol 2 no 4 pp 529ndash536 2006

[47] K C Pang S Stephen P G Engstrom et al ldquoRNAdb-acomprehensive mammalian noncoding RNA databaserdquoNucleicAcids Research vol 33 pp D125ndashD130 2005

[48] C Liu B Bai G Skogerboslash et al ldquoNONCODE an inte-grated knowledge database of non-coding RNAsrdquoNucleic AcidsResearch vol 33 pp D112ndashD115 2005

[49] Y Zhao H Li S Fang et al ldquoNONCODE 2016 an informativeand valuable data source of long non-coding RNAsrdquo NucleicAcids Research vol 44 no 1 pp D203ndashD208 2016

[50] J W Fickett ldquoRecognition of protein coding regions in DNAsequencesrdquoNucleic Acids Research vol 10 no 17 pp 5303ndash53181982

[51] S Mukherjee and Y Zhang ldquoMM-align a quick algorithmfor aligning multiple-chain protein complex structures usingiterative dynamic programmingrdquoNucleic Acids Research vol 37no 11 2009

[52] T Derrien R Johnson G Bussotti et al ldquoThe GENCODE v7catalog of human long noncoding RNAs analysis of their genestructure evolution and expressionrdquoGenome Research vol 22no 9 pp 1775ndash1789 2012

[53] J Harrow F Denoeud A Frankish et al ldquoGENCODE produc-ing a reference annotation for ENCODErdquo Genome Biology vol7 supplement 1 pp S41ndashS49 2006

[54] J Harrow A Frankish J M Gonzalez et al ldquoGENCODE thereference human genome annotation for the ENCODE projectrdquoGenome Research vol 22 no 9 pp 1760ndash1774 2012

[55] K D Pruitt T Tatusova G R Brown andD RMaglott ldquoNCBIReference Sequences (RefSeq) current status new features andgenome annotation policyrdquo Nucleic Acids Research vol 40 no1 pp D130ndashD135 2012

[56] K D Pruitt T Tatusova and D R Maglott ldquoNCBI refer-ence sequences (RefSeq) a curated non-redundant sequencedatabase of genomes transcripts and proteinsrdquo Nucleic AcidsResearch vol 35 no 1 pp D61ndashD65 2007

[57] L Deng D Yu and J Platt ldquoScalable stacking and learningfor building deep architecturesrdquo in Proceedings of the 2012IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo12) pp 2133ndash2136 Kyoto Japan March2012

[58] B Hutchinson L Deng and D Yu ldquoTensor deep stackingnetworksrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 35 no 8 pp 1944ndash1957 2013

[59] W J Kent C W Sugnet T S Furey et al ldquoThe human genomebrowser at UCSCrdquo Genome Research vol 12 no 6 pp 996ndash1006 2002

[60] R Lorenz S H Bernhart C Honer zu Siederdissen et alldquoViennaRNA Package 20rdquo Algorithms for Molecular Biologyvol 6 no 1 article 26 2011

[61] M Kozak ldquoRecognition of AUG and alternative initiatorcodons is augmented by G in position +4 but is not generallyaffected by the nucleotides in positions +5 and +6rdquoThe EMBOJournal vol 16 no 9 pp 2482ndash2492 1997

[62] M Kozak ldquoInitiation of translation in prokaryotes and eukary-otesrdquo Gene vol 234 no 2 pp 187ndash208 1999

[63] H Xu P Wang Y Fu et al ldquoLength of the ORF position of thefirstAUGand theKozakmotif are important factors in potentialdual-coding transcriptsrdquo Cell Research vol 20 no 4 pp 445ndash457 2010

[64] R D Finn J Clements and S R Eddy ldquoHMMER webserver interactive sequence similarity searchingrdquoNucleic AcidsResearch vol 39 no 2 pp W29ndashW37 2011

[65] P Kapranov J Cheng S Dike et al ldquoRNA maps reveal newRNA classes and a possible function for pervasive transcrip-tionrdquo Science vol 316 no 5830 pp 1484ndash1488 2007

[66] C Chen A Liaw and L BreimanUsing RandomForest to LearnImbalancedData University of California Berkeley Calif USA2004

[67] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[68] J A Sogn ldquoStructure of the peptide antibiotic polypeptinrdquoJournal of Medicinal Chemistry vol 19 no 10 pp 1228ndash12311976

[69] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[70] J W Fickett and C-S Tung ldquoAssessment of protein codingmeasuresrdquo Nucleic Acids Research vol 20 no 24 pp 6441ndash6450 1992

[71] X Robin N Turck A Hainard et al ldquopROC an open-sourcepackage for R and S+ to analyze and compare ROC curvesrdquoBMC Bioinformatics vol 12 article 77 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 13: Review Article Long Noncoding RNA Identification ...

BioMed Research International 13

[10] A C Marques and C P Ponting ldquoCatalogues of mammalianlong noncoding RNAs modest conservation and incomplete-nessrdquo Genome Biology vol 10 no 11 article R124 2009

[11] J R Prensner M K Iyer A Sahu et al ldquoThe long noncodingRNA SChLAP1 promotes aggressive prostate cancer and antag-onizes the SWISNF complexrdquo Nature Genetics vol 45 no 11pp 1392ndash1398 2013

[12] G Chen Z Wang D Wang et al ldquoLncRNADisease a databasefor long-non-coding RNA-associated diseasesrdquo Nucleic AcidsResearch vol 41 no 1 pp D983ndashD986 2013

[13] X Chen and G-Y Yan ldquoNovel human lncRNA-disease associ-ation inference based on lncRNA expression profilesrdquo Bioinfor-matics vol 29 no 20 pp 2617ndash2624 2013

[14] X Chen C Clarence Yan C Luo W Ji Y Zhang and Q DaildquoConstructing lncRNA functional similarity network based onlncRNA-disease associations and disease semantic similarityrdquoScientific Reports vol 5 Article ID 11338 2015

[15] X Chen Y-A Huang X-S Wang Z You and K C ChanldquoFMLNCSIM fuzzy measure-based lncRNA functional simi-larity calculation modelrdquo Oncotarget vol 7 no 29 pp 45948ndash45958 2016

[16] X Chen ldquoPredicting lncRNA-disease associations and con-structing lncRNA functional similarity network based on theinformation of miRNArdquo Scientific Reports vol 5 Article ID13186 2015

[17] J Sun H Shi Z Wang et al ldquoInferring novel lncRNA-diseaseassociations based on a random walk model of a lncRNAfunctional similarity networkrdquo Molecular BioSystems vol 10no 8 pp 2074ndash2081 2014

[18] M Zhou X Wang J Li et al ldquoPrioritizing candidate disease-related long non-codingRNAs bywalking on the heterogeneouslncRNA and disease networkrdquoMolecular BioSystems vol 11 no3 pp 760ndash769 2015

[19] X Chen Z You G Yan and D Gong ldquoIRWRLDA improvedrandom walk with restart for lncRNA-disease association pre-dictionrdquo Oncotarget vol 7 no 36 pp 57919ndash57931 2016

[20] X Chen C C Yan X Zhang and Z You ldquoLong non-codingRNAs and complex diseases from experimental results tocomputational modelsrdquo Briefings in Bioinformatics 2016

[21] T M Lowe and S R Eddy ldquotRNAscan-SE a programfor improved detection of transfer RNA genes in genomicsequencerdquo Nucleic Acids Research vol 25 no 5 pp 955ndash9641997

[22] Q Zou J Guo Y Ju M Wu X Zeng and Z Hong ldquoImprov-ing tRNAscan-SE annotation results via ensemble classifiersrdquoMolecular Informatics vol 34 no 11-12 pp 761ndash770 2015

[23] L Wei M Liao Y Gao R Ji Z He and Q Zou ldquoImprovedand promising identificationof human microRNAs by incorpo-ratinga high-quality negative setrdquo IEEEACM Transactions onComputational Biology and Bioinformatics vol 11 no 1 pp 192ndash201 2014

[24] C Y Wang L L Hu M Z Guo X Y Liu and Q Zou ldquoimDCan ensemble learningmethod for imbalanced classificationwithmiRNAdatardquoGenetics andMolecular Research vol 14 no 1 pp123ndash133 2015

[25] K Lagesen P Hallin E A Roslashdland H-H Staeligrfeldt T Rognesand DW Ussery ldquoRNAmmer consistent and rapid annotationof ribosomal RNA genesrdquo Nucleic Acids Research vol 35 no 9pp 3100ndash3108 2007

[26] C Wang L Wei M Guo and Q Zou ldquoComputationalapproaches in detecting non- coding RNArdquo Current Genomicsvol 14 no 6 pp 371ndash377 2013

[27] D Veneziano G Nigita and A Ferro ldquoComputationalapproaches for the analysis of ncRNA through deep sequencingtechniquesrdquo Frontiers in Bioengineering and Biotechnology vol3 article 77 2015

[28] L Kong Y Zhang Z-Q Ye et al ldquoCPC assess the protein-coding potential of transcripts using sequence features andsupport vector machinerdquo Nucleic Acids Research vol 35 no 2pp W345ndashW349 2007

[29] L Wang H J Park S Dasari et al ldquoCoding-potential assess-ment tool using an alignment-free logistic regression modelrdquoNucleic Acids Research vol 41 no 6 article e74 2013

[30] L Sun H Luo D Bu et al ldquoUtilizing sequence intrinsiccomposition to classify protein-coding and long non-codingtranscriptsrdquo Nucleic Acids Research vol 41 no 17 article e1662013

[31] A Li J Zhang and Z Zhou ldquoPLEK a tool for predicting longnon-coding RNAs and messenger RNAs based on an improvedk-mer schemerdquo BMC Bioinformatics vol 15 no 1 article 3112014

[32] R Achawanantakun J Chen Y Sun and Y Zhang ldquoLncRNA-ID long non-coding RNA IDentification using balanced ran-dom forestsrdquo Bioinformatics vol 31 no 24 pp 3897ndash3905 2015

[33] L Sun H Liu L Zhang and J Meng ldquolncRScan-SVM a toolfor predicting long non-coding RNAs using support vectormachinerdquo PLoS ONE vol 10 no 10 Article ID e0139654 2015

[34] X-N Fan and S-W Zhang ldquoLncRNA-MFDL identification ofhuman long non-coding RNAs by fusing multiple features andusing deep learningrdquo Molecular BioSystems vol 11 no 3 pp892ndash897 2015

[35] C Pian G Zhang Z Chen et al ldquoLncRNApred classificationof long non-coding RNAs and protein-coding transcripts by theensemble algorithm with a new hybrid featurerdquo PLoS ONE vol11 no 5 Article ID e0154567 2016

[36] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[37] M C Frith A R Forrest E Nourbakhsh et al ldquoThe abundanceof short proteins in the mammalian proteomerdquo PLoS Geneticsvol 2 no 4 pp 515ndash528 2006

[38] D M Anderson K M Anderson C-L Chang et al ldquoAmicropeptide encoded by a putative long noncoding RNAregulatesmuscle performancerdquoCell vol 160 no 4 pp 595ndash6062015

[39] M Guttman P Russell N T Ingolia J S Weissman and ES Lander ldquoRibosome profiling provides evidence that largenoncoding RNAs do not encode proteinsrdquo Cell vol 154 no 1pp 240ndash251 2013

[40] M F Lin I Jungreis and M Kellis ldquoPhyloCSF a comparativegenomics method to distinguish protein coding and non-coding regionsrdquoBioinformatics vol 27 no 13 Article ID btr209pp i275ndashi282 2011

[41] B Panwar A Arora and G P S Raghava ldquoPrediction andclassification of ncRNAs using structural informationrdquo BMCGenomics vol 15 no 1 article 127 2014

[42] L Childs Z Nikoloski P May and D Walther ldquoIdentificationand classification of ncRNAmolecules using graph propertiesrdquoNucleic Acids Research vol 37 no 9 article e66 2009

[43] K Sun X Chen P Jiang X Song H Wang and HSun ldquoiSeeRNA identification of long intergenic non-codingRNA transcripts from transcriptome sequencing datardquo BMCGenomics vol 14 article S7 2013

14 BioMed Research International

[44] Y Wang Y Li Q Wang et al ldquoComputational identificationof human long intergenic non-coding RNAs using a GA-SVMalgorithmrdquo Gene vol 533 no 1 pp 94ndash99 2014

[45] A Siepel G Bejerano J S Pedersen et al ldquoEvolutionarilyconserved elements in vertebrate insect worm and yeastgenomesrdquo Genome Research vol 15 no 8 pp 1034ndash1050 2005

[46] J Liu J Gough and B Rost ldquoDistinguishing protein-codingfrom non-coding RNAs through support vector machinesrdquoPLoS Genetics vol 2 no 4 pp 529ndash536 2006

[47] K C Pang S Stephen P G Engstrom et al ldquoRNAdb-acomprehensive mammalian noncoding RNA databaserdquoNucleicAcids Research vol 33 pp D125ndashD130 2005

[48] C Liu B Bai G Skogerboslash et al ldquoNONCODE an inte-grated knowledge database of non-coding RNAsrdquoNucleic AcidsResearch vol 33 pp D112ndashD115 2005

[49] Y Zhao H Li S Fang et al ldquoNONCODE 2016 an informativeand valuable data source of long non-coding RNAsrdquo NucleicAcids Research vol 44 no 1 pp D203ndashD208 2016

[50] J W Fickett ldquoRecognition of protein coding regions in DNAsequencesrdquoNucleic Acids Research vol 10 no 17 pp 5303ndash53181982

[51] S Mukherjee and Y Zhang ldquoMM-align a quick algorithmfor aligning multiple-chain protein complex structures usingiterative dynamic programmingrdquoNucleic Acids Research vol 37no 11 2009

[52] T Derrien R Johnson G Bussotti et al ldquoThe GENCODE v7catalog of human long noncoding RNAs analysis of their genestructure evolution and expressionrdquoGenome Research vol 22no 9 pp 1775ndash1789 2012

[53] J Harrow F Denoeud A Frankish et al ldquoGENCODE produc-ing a reference annotation for ENCODErdquo Genome Biology vol7 supplement 1 pp S41ndashS49 2006

[54] J Harrow A Frankish J M Gonzalez et al ldquoGENCODE thereference human genome annotation for the ENCODE projectrdquoGenome Research vol 22 no 9 pp 1760ndash1774 2012

[55] K D Pruitt T Tatusova G R Brown andD RMaglott ldquoNCBIReference Sequences (RefSeq) current status new features andgenome annotation policyrdquo Nucleic Acids Research vol 40 no1 pp D130ndashD135 2012

[56] K D Pruitt T Tatusova and D R Maglott ldquoNCBI refer-ence sequences (RefSeq) a curated non-redundant sequencedatabase of genomes transcripts and proteinsrdquo Nucleic AcidsResearch vol 35 no 1 pp D61ndashD65 2007

[57] L Deng D Yu and J Platt ldquoScalable stacking and learningfor building deep architecturesrdquo in Proceedings of the 2012IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo12) pp 2133ndash2136 Kyoto Japan March2012

[58] B Hutchinson L Deng and D Yu ldquoTensor deep stackingnetworksrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 35 no 8 pp 1944ndash1957 2013

[59] W J Kent C W Sugnet T S Furey et al ldquoThe human genomebrowser at UCSCrdquo Genome Research vol 12 no 6 pp 996ndash1006 2002

[60] R Lorenz S H Bernhart C Honer zu Siederdissen et alldquoViennaRNA Package 20rdquo Algorithms for Molecular Biologyvol 6 no 1 article 26 2011

[61] M Kozak ldquoRecognition of AUG and alternative initiatorcodons is augmented by G in position +4 but is not generallyaffected by the nucleotides in positions +5 and +6rdquoThe EMBOJournal vol 16 no 9 pp 2482ndash2492 1997

[62] M Kozak ldquoInitiation of translation in prokaryotes and eukary-otesrdquo Gene vol 234 no 2 pp 187ndash208 1999

[63] H Xu P Wang Y Fu et al ldquoLength of the ORF position of thefirstAUGand theKozakmotif are important factors in potentialdual-coding transcriptsrdquo Cell Research vol 20 no 4 pp 445ndash457 2010

[64] R D Finn J Clements and S R Eddy ldquoHMMER webserver interactive sequence similarity searchingrdquoNucleic AcidsResearch vol 39 no 2 pp W29ndashW37 2011

[65] P Kapranov J Cheng S Dike et al ldquoRNA maps reveal newRNA classes and a possible function for pervasive transcrip-tionrdquo Science vol 316 no 5830 pp 1484ndash1488 2007

[66] C Chen A Liaw and L BreimanUsing RandomForest to LearnImbalancedData University of California Berkeley Calif USA2004

[67] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[68] J A Sogn ldquoStructure of the peptide antibiotic polypeptinrdquoJournal of Medicinal Chemistry vol 19 no 10 pp 1228ndash12311976

[69] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[70] J W Fickett and C-S Tung ldquoAssessment of protein codingmeasuresrdquo Nucleic Acids Research vol 20 no 24 pp 6441ndash6450 1992

[71] X Robin N Turck A Hainard et al ldquopROC an open-sourcepackage for R and S+ to analyze and compare ROC curvesrdquoBMC Bioinformatics vol 12 article 77 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 14: Review Article Long Noncoding RNA Identification ...

14 BioMed Research International

[44] Y Wang Y Li Q Wang et al ldquoComputational identificationof human long intergenic non-coding RNAs using a GA-SVMalgorithmrdquo Gene vol 533 no 1 pp 94ndash99 2014

[45] A Siepel G Bejerano J S Pedersen et al ldquoEvolutionarilyconserved elements in vertebrate insect worm and yeastgenomesrdquo Genome Research vol 15 no 8 pp 1034ndash1050 2005

[46] J Liu J Gough and B Rost ldquoDistinguishing protein-codingfrom non-coding RNAs through support vector machinesrdquoPLoS Genetics vol 2 no 4 pp 529ndash536 2006

[47] K C Pang S Stephen P G Engstrom et al ldquoRNAdb-acomprehensive mammalian noncoding RNA databaserdquoNucleicAcids Research vol 33 pp D125ndashD130 2005

[48] C Liu B Bai G Skogerboslash et al ldquoNONCODE an inte-grated knowledge database of non-coding RNAsrdquoNucleic AcidsResearch vol 33 pp D112ndashD115 2005

[49] Y Zhao H Li S Fang et al ldquoNONCODE 2016 an informativeand valuable data source of long non-coding RNAsrdquo NucleicAcids Research vol 44 no 1 pp D203ndashD208 2016

[50] J W Fickett ldquoRecognition of protein coding regions in DNAsequencesrdquoNucleic Acids Research vol 10 no 17 pp 5303ndash53181982

[51] S Mukherjee and Y Zhang ldquoMM-align a quick algorithmfor aligning multiple-chain protein complex structures usingiterative dynamic programmingrdquoNucleic Acids Research vol 37no 11 2009

[52] T Derrien R Johnson G Bussotti et al ldquoThe GENCODE v7catalog of human long noncoding RNAs analysis of their genestructure evolution and expressionrdquoGenome Research vol 22no 9 pp 1775ndash1789 2012

[53] J Harrow F Denoeud A Frankish et al ldquoGENCODE produc-ing a reference annotation for ENCODErdquo Genome Biology vol7 supplement 1 pp S41ndashS49 2006

[54] J Harrow A Frankish J M Gonzalez et al ldquoGENCODE thereference human genome annotation for the ENCODE projectrdquoGenome Research vol 22 no 9 pp 1760ndash1774 2012

[55] K D Pruitt T Tatusova G R Brown andD RMaglott ldquoNCBIReference Sequences (RefSeq) current status new features andgenome annotation policyrdquo Nucleic Acids Research vol 40 no1 pp D130ndashD135 2012

[56] K D Pruitt T Tatusova and D R Maglott ldquoNCBI refer-ence sequences (RefSeq) a curated non-redundant sequencedatabase of genomes transcripts and proteinsrdquo Nucleic AcidsResearch vol 35 no 1 pp D61ndashD65 2007

[57] L Deng D Yu and J Platt ldquoScalable stacking and learningfor building deep architecturesrdquo in Proceedings of the 2012IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP rsquo12) pp 2133ndash2136 Kyoto Japan March2012

[58] B Hutchinson L Deng and D Yu ldquoTensor deep stackingnetworksrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 35 no 8 pp 1944ndash1957 2013

[59] W J Kent C W Sugnet T S Furey et al ldquoThe human genomebrowser at UCSCrdquo Genome Research vol 12 no 6 pp 996ndash1006 2002

[60] R Lorenz S H Bernhart C Honer zu Siederdissen et alldquoViennaRNA Package 20rdquo Algorithms for Molecular Biologyvol 6 no 1 article 26 2011

[61] M Kozak ldquoRecognition of AUG and alternative initiatorcodons is augmented by G in position +4 but is not generallyaffected by the nucleotides in positions +5 and +6rdquoThe EMBOJournal vol 16 no 9 pp 2482ndash2492 1997

[62] M Kozak ldquoInitiation of translation in prokaryotes and eukary-otesrdquo Gene vol 234 no 2 pp 187ndash208 1999

[63] H Xu P Wang Y Fu et al ldquoLength of the ORF position of thefirstAUGand theKozakmotif are important factors in potentialdual-coding transcriptsrdquo Cell Research vol 20 no 4 pp 445ndash457 2010

[64] R D Finn J Clements and S R Eddy ldquoHMMER webserver interactive sequence similarity searchingrdquoNucleic AcidsResearch vol 39 no 2 pp W29ndashW37 2011

[65] P Kapranov J Cheng S Dike et al ldquoRNA maps reveal newRNA classes and a possible function for pervasive transcrip-tionrdquo Science vol 316 no 5830 pp 1484ndash1488 2007

[66] C Chen A Liaw and L BreimanUsing RandomForest to LearnImbalancedData University of California Berkeley Calif USA2004

[67] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[68] J A Sogn ldquoStructure of the peptide antibiotic polypeptinrdquoJournal of Medicinal Chemistry vol 19 no 10 pp 1228ndash12311976

[69] T Kohonen ldquoThe self-organizingmaprdquo Proceedings of the IEEEvol 78 no 9 pp 1464ndash1480 1990

[70] J W Fickett and C-S Tung ldquoAssessment of protein codingmeasuresrdquo Nucleic Acids Research vol 20 no 24 pp 6441ndash6450 1992

[71] X Robin N Turck A Hainard et al ldquopROC an open-sourcepackage for R and S+ to analyze and compare ROC curvesrdquoBMC Bioinformatics vol 12 article 77 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 15: Review Article Long Noncoding RNA Identification ...

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology