Reconstruction of full-length circular RNAs enables ... · METHOD Open Access Reconstruction of...

20
METHOD Open Access Reconstruction of full-length circular RNAs enables isoform-level quantification Yi Zheng 1,2, Peifeng Ji 1, Shuai Chen 1,2 , Lingling Hou 1 and Fangqing Zhao 1,2,3* Abstract Currently, circRNA studies are shifting from the identification of circular transcripts to understanding their biological functions. However, such endeavors have been limited by large-scale determination of their full-length sequences and also by the inability of accurate quantification at the isoform level. Here, we propose a new feature, reverse overlap (RO), for circRNA detection, which outperforms back-splice junction (BSJ)-based methods in identifying low-abundance circRNAs. By combining RO and BSJ features, we present a novel approach for effective reconstruction of full-length circRNAs and isoform-level quantification from the transcriptome. We systematically compared the difference between the BSJ-level and isoform-level differential expression analyses using human liver tumor and normal tissues and highlight the necessity of deepening circRNA studies to the isoform-level resolution. The CIRI-full software can be accessed at https://sourceforge.net/projects/ciri. Keywords: Alternative splicing, Circular RNA (circRNA), Transcript reconstruction, Isoform quantification Background Circular RNA (circRNA) is a type of RNA molecules in which both ends are covalently linked. Advances in deep sequencing and identification algorithms have resulted in a huge number of circRNAs from fly to human [16]. Most recently, new subclasses of circRNAs, including non-exonic circRNAs [7, 8] and exon-intron circRNAs [9], have been explored. Subsequent studies unveiled the ubiquity of alternative splicing (AS) events within cir- cRNAs and revealed a profound difference in the expres- sion of circRNAs and mRNAs [10]. Profiting from these identified circRNAs, most recent studies have shifted to the efforts of revealing the biological functions of cir- cRNAs. As a heterogeneous class, circRNAs may partici- pate in various aspects of biological processes. In addition to the well-studied function of microRNA sponges [3, 4], studies have illustrated that these circular transcripts may be involved in gene regulation [11], development [12], in- nate immune response [13], and diseases [1421]. Recent efforts have shown that N 6 -methyladenosine (m 6 A) pro- motes the efficient initiation of protein translation from circRNAs [22]. Subsequently, Zhou et al. demonstrated the prevalence of m 6 A in circRNAs [23]. Collectively, these intriguing findings illustrate the complexity of cir- cRNA functions and show that our understanding of how circRNAs participate in biological processes is still rudimentary. Evolutionary analyses of gene sequences and expres- sion patterns have provided essential insights into gene functional study. Increasing attention has also been paid to evolutionary analysis of circRNAs among different species. Rybak-Wolf et al. systematically compiled a catalog of neuronal human and mouse circRNAs and found that these circRNAs were preferentially enriched in mammalian brain and that the same circRNAs were often expressed in both species and were well conserved in sequence [24]. As the first systematic analysis of cir- cRNAs in mammalian brains, this work represents an important step toward further elaboration of circRNA functions. However, owing to the lack of full-length cir- cRNAs, sequence conservation comparison is restricted to flanking introns and coding DNA sequences (CDSs). Considering the prevalence of circRNA isoforms gener- ated by combinations of internal components within back-splice junctions (BSJs), current cross-species con- servation analyses based on partial sequences may lead to biased estimation of circRNA conservation with * Correspondence: [email protected] Yi Zheng and Peifeng Ji contributed equally to this work. 1 Computational Genomics Lab, Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing 100101, China 2 University of Chinese Academy of Sciences, Beijing 100049, China Full list of author information is available at the end of the article © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Zheng et al. Genome Medicine (2019) 11:2 https://doi.org/10.1186/s13073-019-0614-1

Transcript of Reconstruction of full-length circular RNAs enables ... · METHOD Open Access Reconstruction of...

Page 1: Reconstruction of full-length circular RNAs enables ... · METHOD Open Access Reconstruction of full-length circular RNAs enables isoform-level quantification Yi Zheng1,2†, Peifeng

METHOD Open Access

Reconstruction of full-length circular RNAsenables isoform-level quantificationYi Zheng1,2†, Peifeng Ji1†, Shuai Chen1,2, Lingling Hou1 and Fangqing Zhao1,2,3*

Abstract

Currently, circRNA studies are shifting from the identification of circular transcripts to understanding their biologicalfunctions. However, such endeavors have been limited by large-scale determination of their full-length sequencesand also by the inability of accurate quantification at the isoform level. Here, we propose a new feature, reverseoverlap (RO), for circRNA detection, which outperforms back-splice junction (BSJ)-based methods in identifyinglow-abundance circRNAs. By combining RO and BSJ features, we present a novel approach for effective reconstructionof full-length circRNAs and isoform-level quantification from the transcriptome. We systematically compared thedifference between the BSJ-level and isoform-level differential expression analyses using human liver tumorand normal tissues and highlight the necessity of deepening circRNA studies to the isoform-level resolution.The CIRI-full software can be accessed at https://sourceforge.net/projects/ciri.

Keywords: Alternative splicing, Circular RNA (circRNA), Transcript reconstruction, Isoform quantification

BackgroundCircular RNA (circRNA) is a type of RNA molecules inwhich both ends are covalently linked. Advances in deepsequencing and identification algorithms have resulted ina huge number of circRNAs from fly to human [1–6].Most recently, new subclasses of circRNAs, includingnon-exonic circRNAs [7, 8] and exon-intron circRNAs[9], have been explored. Subsequent studies unveiled theubiquity of alternative splicing (AS) events within cir-cRNAs and revealed a profound difference in the expres-sion of circRNAs and mRNAs [10]. Profiting from theseidentified circRNAs, most recent studies have shifted tothe efforts of revealing the biological functions of cir-cRNAs. As a heterogeneous class, circRNAs may partici-pate in various aspects of biological processes. In additionto the well-studied function of microRNA sponges [3, 4],studies have illustrated that these circular transcripts maybe involved in gene regulation [11], development [12], in-nate immune response [13], and diseases [14–21]. Recentefforts have shown that N6-methyladenosine (m6A) pro-motes the efficient initiation of protein translation from

circRNAs [22]. Subsequently, Zhou et al. demonstratedthe prevalence of m6A in circRNAs [23]. Collectively,these intriguing findings illustrate the complexity of cir-cRNA functions and show that our understanding of howcircRNAs participate in biological processes is stillrudimentary.Evolutionary analyses of gene sequences and expres-

sion patterns have provided essential insights into genefunctional study. Increasing attention has also been paidto evolutionary analysis of circRNAs among differentspecies. Rybak-Wolf et al. systematically compiled acatalog of neuronal human and mouse circRNAs andfound that these circRNAs were preferentially enrichedin mammalian brain and that the same circRNAs wereoften expressed in both species and were well conservedin sequence [24]. As the first systematic analysis of cir-cRNAs in mammalian brains, this work represents animportant step toward further elaboration of circRNAfunctions. However, owing to the lack of full-length cir-cRNAs, sequence conservation comparison is restrictedto flanking introns and coding DNA sequences (CDSs).Considering the prevalence of circRNA isoforms gener-ated by combinations of internal components withinback-splice junctions (BSJs), current cross-species con-servation analyses based on partial sequences may leadto biased estimation of circRNA conservation with

* Correspondence: [email protected]†Yi Zheng and Peifeng Ji contributed equally to this work.1Computational Genomics Lab, Beijing Institutes of Life Science, ChineseAcademy of Sciences, Beijing 100101, China2University of Chinese Academy of Sciences, Beijing 100049, ChinaFull list of author information is available at the end of the article

© The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link tothe Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Zheng et al. Genome Medicine (2019) 11:2 https://doi.org/10.1186/s13073-019-0614-1

Page 2: Reconstruction of full-length circular RNAs enables ... · METHOD Open Access Reconstruction of full-length circular RNAs enables isoform-level quantification Yi Zheng1,2†, Peifeng

respect to expression pattern and sequence composition.Moreover, using BSJs to represent a collection of differ-ent circRNA isoforms hampers our understanding ofspecific circRNA functions and makes it difficult toachieve further evolutionary insights into circRNAsamong species. A fully automated method that can iden-tify large-scale full-length circRNAs from RNA-seq datahas yet to be developed.A number of efforts have recently been made to ex-

plore the internal landscape of circRNAs. Some studiesutilized a straightforward strategy that simply combinedall known mRNA exons in a sequential order as putativefull-length circRNA [2, 25]. This method, however, relieson an unsupported assumption that circular and lineartranscripts share the same composition and thus maylead to misunderstanding in downstream analyses. Othermethods, such as CIRI-AS [10], CIRCexplorer2 [26], andFUCHS [27], involve identifying the internal compo-nents of the BSJ. CIRI-AS employs a spliced junctionsignature-based algorithm and enables, for the first time,high-throughput detection of internal components ofcircRNAs based on short-read sequencing. Similar toCIRI-AS, FUCHS predicts circRNA internal sequencesby extracting the mapping results of BSJ reads. In an al-ternative manner, CIRCexplorer2 detects alternative spli-cing events through comparison analysis betweenpoly(A)+ and poly(A)− RNA-seq data sets. However,without considering combinations of these components,whole-sequence prediction of circular isoforms withcomplicated AS events is still beyond the reach of thesemethods. Most recently, Ye et al. employed a strategysimilar to CIRI-AS to assemble full-length sequences ofcircRNAs using BSJ read pairs, but this approach stillfaces an inherent challenge in that only a small fractionof circRNAs can be identified by assembling BSJ reads[28]. As a result, without whole-sequence of circRNAs,the reconstruction of circular isoforms within a certainBSJ extends far beyond the scope of current analysis andaccurate quantification of circRNAs at the isoform levelremains an insurmountable obstacle. Therefore, the in-ability of reconstructing full-length circRNAs and quan-tifying circular isoforms places limitations on thediscovery of previous unknown biological phenomena,which may restrict our ability to understand the diversityand expression patterns of circRNA isoforms.To address this challenge, we present a new feature,

reverse overlap (RO), for full-length circRNA recon-struction and isoform-level quantification. RO is espe-cially suitable for identifying low-abundance circRNAsthat are difficult to identify using the BSJ feature. Con-sidering that a vast majority of circRNAs in various tran-scriptomes are in low expression levels, the detection ofsuch low-abundance transcripts is extremely importantin circRNA studies. Moreover, we develop an accurate,

high-throughput approach (CIRI-full) that uses both BSJand RO features to reconstruct full-length circRNAs andcircular isoforms within them from RNA-seq data sets.Several recent independent studies demonstrated thatCIRI2 exhibited remarkably balanced sensitivity, reliabil-ity, running time, and RAM usage on circRNA detection[29–32]. Most recently, Thomas B. Hansen further sys-tematically compared 11 circRNA detection algorithmsand found that CIRI2 was one of the best algorithms forcircRNA identification and performed comparably toannotation-based algorithms [33]. In CIRI-full, CIRI2 isemployed to detect cirexons (circRNA’s exon) and to de-termine the boundaries of circRNAs. The RO feature,which is deduced from reversely overlapped paired-endreads, is used to explore the detailed cirexon landscapewithin boundary sites and to assemble into full-lengthsequence. Based on the assembled full-length circRNAs,a forward splice graph (FSG)-based algorithm isemployed to reconstruct all full-length isoforms withinthem and to determine their abundances. Comparedwith previous methods, CIRI-full is not only efficient atdetermining complete sequences of circRNAs, but, moreimportantly, enables the analysis of circRNAs at the iso-form level. We applied CIRI-full to survey circRNA ex-pression patterns in samples from the brains of sixvertebrates and also explored circRNA expression diver-gence between tumor and normal tissues at both BSJ-and isoform-level resolution, which uncovered distinctexpression patterns between circRNAs and their iso-forms. This study presents an important approach to as-semble and quantify circRNAs and will greatly improveour understanding of their biogenesis and functions.

MethodsOverview of CIRI-fullThe CIRI-full algorithm is a four-step process that in-cludes RO read detection and verification, BSJ and cir-exon detection, combined assembly of both RO and BSJreads, and isoform reconstruction and quantification.RO read detection and verification is designed to detectRO reads from paired-end reads based on their 5′ re-verse overlaps and to rule out linear transcripts and falsepositives resulting from lariat structures using sophisti-cated post-alignment filters. An RO-merged read is iden-tified as full-length circRNA if the genomic alignmentsof both its ends either have overlaps or are located inthe same cirexon. The BSJ and cirexon detection stepwas developed to detect BSJs and cirexons and to iden-tify single-splice events. If the BSJ read pairs are exclu-sively located on all the cirexons within the BSJ, thecomplete sequence of this circRNA can be reconstructedusing these cirexons. RNA-seq reads are processed sep-arately in the first two steps, resulting in the identifica-tion of a number of full-length circRNAs in addition to

Zheng et al. Genome Medicine (2019) 11:2 Page 2 of 20

Page 3: Reconstruction of full-length circular RNAs enables ... · METHOD Open Access Reconstruction of full-length circular RNAs enables isoform-level quantification Yi Zheng1,2†, Peifeng

BSJs and RO-merged reads that are not sufficient forreconstructing full-length circRNA independently.Therefore, these unused but informative RO-mergedreads and BSJ reads are integrated in make a combinedassembly. Finally, for a given BSJ, all forward splice junc-tions are recognized and quantified, and then an adaptedFSG [34–36] is built to estimate the isoform expressionabundance.

RO read detection and verificationDuring the reverse transcription step of library prepar-ation, transcription begins at the primers and walksalong the RNA template. Owing to the unique structureof circRNAs, these circular transcripts will be repeatedlyreverse-transcribed. When sequencing these reverse-transcribed cDNAs, peculiar reverse overlap features onthe read pair will be observed on the pair-end reads ifthe library insert length is greater than the length of thecircRNA. Specifically, the 5′- or 3′-ends on both pairedreads are reversely overlapped with each other, whichcan be used as an indicator of circRNA. Moreover, itshould be noted that the presence of 3′-RO on bothreads indicates that the complete sequence of the cir-cRNA has been read, and thus, its whole sequence canbe reconstructed.The 5′ RO reads are identified based on the following

strategy. For each read pair, the first 10 bp at the 5′-endof one read is divided into three subsequences with awindow size of 8 bp and a step size of 1 bp. These subse-quences are then used as seeds to search for matches atthe 5′-end of the other read. Once all these seeds havematches (match base pair ≥ 7 bp) on the other read, thesequence with location ranging from the 5′ terminus tothe matches on this read is extracted and aligned to thecounterpart read. Both members of this pair of reads aretaken as candidate 5′ RO reads if the alignment satisfiestwo criteria: (i) aligned length ≥ 13 bp and (ii) nucleotideacid identity ≥ 95%. Subsequently, these two reads aremerged into a long read based on the alignment, and thelong read is treated as a candidate RO-merged read forfurther validation.During the library preparation step, the size of frag-

ments is not strictly the same as the library insert lengthbut varies around the insert length. Moreover, if thefragment size is shorter than the sequenced read length,this will yield partial adapter sequences attached to the3′-ends of paired-end reads. The presence of this type ofread, which indeed originates from linear transcripts,can lead to false-positive 5′ RO features when perform-ing candidate RO read detection. To rule out suchfalse-positive RO reads, a mapping-based filtration strat-egy is employed. First, the precise location of the candi-date RO-merged read is determined. Specifically, thecandidate RO-merged read is aligned to the reference

genome using BWA-MEM (-T 19, minimum score tooutput), which outputs split alignments of the read onthe genome. The alignments are then collected andsorted according to their alignment lengths. The longestalignment with mapping quality greater than 15 is usedas an anchor alignment, and the accumulated mappinglength within a 100-Kbp interval on both sides of thisanchor alignment is calculated. If the summed lengthexceeds half of the read length, this candidateRO-merged read is reserved for further analysis.Otherwise, it is discarded. Finally, the reserved candi-date RO-merged reads are remapped using a local re-alignment strategy to accurately determine theirlocations. Briefly, highly reliable mapping fragmentsare determined using the BSJ position or, when noBSJ position is available, the anchor position. Then,the precise locations of the unmapped or abnormalmapped fragments are obtained using dynamic pro-gramming. After determining the locations of the can-didate RO-merged reads, linear transcript-derivedreads are identified and ruled out if they satisfy twocriteria [1]: the reads contain no BSJ and [2] the sub-sequences with the same length on both ends of theread have no hit around the anchor alignment.Because BWA-MEM was originally designed for map-

ping DNA sequencing reads, this tool does not considerGT/AG splicing signals and fails to obtain the accurateboundaries of split alignments in which the mappingposition may deviate by a few base pairs from the trueboundary position. Moreover, use of BWA-MEM mayalso lead to the inclusion of lariat structures in the can-didate RO reads. To justify the boundaries and removethe lariat structures, the alignments for each candidateRO-merged read are revisited by checking for the pres-ence of a GT/AG splicing signal. For each read, the GT/AG splicing site of the aligned fragments is first checked,and the read is filtered if any aligned fragment does notcontain a GT/AG splicing site. Then, the GT/AG spli-cing sites on the remaining candidate RO reads are justi-fied. Next, the alignments of two 5-bp subsequencesfrom both sides of the junction site on each read areextracted, and whether these alignments have map-ping gaps or mismatches is determined. If there is nogap or mismatch in these alignments, this read istaken as a highly reliable RO-merged read. Otherwise,it is discarded.

BSJ and cirexon detectionBSJs in the RNA-seq reads are detected using CIRI2[30], and single-splice events within these BSJs are in-ferred from CIRI-AS (parameter -d yes) [10]. Withineach BSJ, all cirexons inferred from the single-spliceevents are collected, sorted, and recorded.

Zheng et al. Genome Medicine (2019) 11:2 Page 3 of 20

Page 4: Reconstruction of full-length circular RNAs enables ... · METHOD Open Access Reconstruction of full-length circular RNAs enables isoform-level quantification Yi Zheng1,2†, Peifeng

Combined assembly of RO and BSJ readsAfter identifying RO-merged reads and BSJ boundaries,full-length circRNAs are identified separately using thesetwo types of information. Specifically, an RO-mergedread is identified as a full-length circRNA if the genomicalignments of its both ends either have overlaps (3′ RO)or are located on the same cirexon. Otherwise, theRO-merged reads are reserved for further combined as-sembly. For each BSJ, the alignments of all thepaired-end BSJ reads are collected; if the BSJ reads areall exclusively located on cirexons, the complete se-quence of this circRNA is reconstructed by linearly con-necting the cirexons. Otherwise, the BSJ and thecirexons within them are recorded for further combinedassembly.The BSJ feature provides an efficient method of detect-

ing the existence of circRNAs but cannot be used toidentify the internal composition of the circRNA. TheRO feature greatly facilitates determination of the in-ternal compositions of circRNAs but sometimes may failto cover the BSJ site owing to the limitation of readlength. Hence, CIRI-full employs a combined strategythat uses the advantages of both BSJ and RO features toreconstruct a more comprehensive full-length circRNArepository. By utilizing the unused RO-merged readsand BSJ reads, full-length circRNAs are reconstructedusing the following process. The RO-merged reads andBSJ reads are sorted and clustered according to the BSJ.If both types of read are observed within a BSJ, the readsare used to reconstruct full-length circRNAs. For eachcircRNA, the RO-merged reads are used to determineadditional cirexons that were not identified by the BSJreads, and the alignments of all the BSJ read pairs arere-checked. If these reads are all exclusively located oncirexons, the complete sequence is reconstructed bylinearly connecting the cirexons. Otherwise, the identi-fied cirexons within the BSJ are outputted and markedas a partially reconstructed circRNA.

Quantifying the expression of circRNAs at the isoformlevelAfter obtaining the reconstructed circRNAs, CIRI-fullbuilds a forward splice graph (FSG) using all BSJ readsand RO-merged reads within each BSJ. The nodes in theFSG represent cirexons, and the edges represent theirconnections (i.e., forward splice junctions between cirex-ons). Theoretically, the FSG covers all possible circularisoforms that are consistent to the mapped reads withinthe BSJ, and the traversing path from the start cirexonto the end cirexon represents a candidate splicing iso-form. It should be noted that the FSG is a closed circuitowing to the nature of circRNAs. Then, an adapteddepth-first search (DFS) algorithm is performed toexhaustively decompose the FSG graph into paths.

Specifically, the DFS algorithm starts iteratively at eachnode and stops at the breakpoints (without continuoussplicing events) or the start cirexon. After obtainingthese paths, short paths are merged into longer pathsand redundant paths are then filtered. To avoid a largenumber of false-positive paths which will significantlyaffect the efficiency of later iteration steps, CIRI-fullscreens out a certain number (by default, set to 10) ofpaths. In detail, the edges on the FSG graph are catego-rized into four types: (i) BSJ, (ii) phasing FSJ, where thesplicing event is exclusively occupied by only one circu-lar isoform, (iii) co-occurred FSJs, where the number ofsplicing events is supported by the same RO read, and(iv) the remaining FSJs. Paths containing phasing FSJand co-occurred FSJs will give the top priority forscreening, and the corresponding paths are referred toas phased isoforms, because these circular isoforms un-doubtedly exist. Regarding the paths that contain theremaining FSJs, they are sorted using the node sequen-cing depth and the paths with high-sequencing depthare retained to fill up the threshold (by default, set to10). The resulting outputted paths are referred to as can-didate isoforms.Next, the following steps are to determine the relative

abundance for each path. In detail, a Monte Carlo simu-lation method is used to simulate the BSJ-reads distribu-tion on each path based on the insert length distributionof the RNA-seq library, which is inferred from mappingdistance of paired-end reads. According to the distribu-tion of simulated reads, the abundance of nodes andedges (splicing events) on each path can be calculated inthe latter steps. To quantify the relative abundance ofeach path, an approximate exhaustive search algorithmis proposed. Specifically, this approach starts by assign-ing a random putative abundance (positive integralvalue) to each path, where the summed abundance forall paths should be equal to the total number of BSJreads. Based on the assigned putative abundance and thedistribution of simulated BSJ reads, the putative abun-dances of nodes and edges on each path are computed.Based on the resulting abundance of nodes and edges,accumulated putative abundance of nodes and edges arecalculated. Then, the distance between putative and realabundance (inferred from mapped BSJ reads) of nodesand edges is calculated and recorded. The putative abun-dances of paths are adapted to real abundance of nodesand edges. Next, it iterates through the following steps.The putative abundances of nodes and edges on eachpath are re-calculated. Accumulated putative abundanceof nodes and edges are computed. The distance betweenputative and real abundance (inferred from mapped BSJreads) of nodes and edges is calculated and comparedwith pre-recorded distance. If this distance is larger thanthe pre-recorded distance, it means that the path

Zheng et al. Genome Medicine (2019) 11:2 Page 4 of 20

Page 5: Reconstruction of full-length circular RNAs enables ... · METHOD Open Access Reconstruction of full-length circular RNAs enables isoform-level quantification Yi Zheng1,2†, Peifeng

abundance adaption process goes wrong and new pathabundance adaption process is performed. Otherwise,the distance is recorded, new iteration starts. This iter-ation process stops when the distance converges. By thismethod, we can obtain the relative abundance for circu-lar isoforms of a certain circRNA.

Simulated data sets for circRNA identification andreconstructionSimulated data sets were generated by a CIRI simulator[8]. This tool can simulate RNA-seq data with given ref-erences and annotations. Parameters such as read length,coverage, sequencing error rate, and insert size can becustomized. Insert length distribution L follows a normaldistribution (μ, σ2). For each pair of reads, the insertlength size is generated using two independent uni-formly distributed random numbers x1, x2 (0 < x1, x2 <1), as follows:

L ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

2� ln x1ð Þp

� cos 2πx2ð Þ � σ þ μ

To make the simulated circRNA more closely resem-ble real data, the length distribution and expression dis-tribution of the circRNAs were also considered. We firstapplied CIRI-AS to RNA-seq data from the HeLa cellline (SRA accession number: SRR3476956) to predictthe size of identified circRNAs by summing all cirexonlengths with supporting BSJ reads > = 5.We also adjusted the sequencing depth of the cir-

cRNAs in the simulated data according to the real data.For inputting parameter sequencing depth D, the cover-age of a particular circular transcript was generatedusing a uniformly distributed random number x (0 < x <1) as follows:

Coverage ¼ D−0:5ð Þ � 3þ 1ð Þ= 1−ffiffiffi

xp� �

To validate the robustness of the RO feature, twogroups of simulated paired-end transcriptomic data setswere used to test the performance of the RO detectionmethod. The first group was designed to test the perform-ance of the RO feature under different sequencing depths.This group consisted of four circular transcript sequen-cing datasets with different average depths (2×, 5×, 10×,and 15×). The read length was set to 200 bp, and the in-sert length distribution was set to μ = 350 bp, σ = 200 bp.The second group was constructed to test the perform-ance of the RO feature under different sequencing lengths.This group contained four sets of circular transcript se-quencing data with different read lengths (75 bp, 100 bp,150 bp, and 200 bp). The average depth of these four data-sets was 10×, and the insert length followed the same dis-tribution as that of the former group. Furthermore,BSJ-based circRNA detection tools, including CIRI2 [30],

find_circ [4], CIRCexplorer2 [25], and KINFE [37], wereused for performance comparison.

Simulated data sets for circRNA isoform quantificationTo compare the performance of circRNA isoform quan-tification between CIRI-full and CIRI-AS, we simulatedcircRNA-containing RNA-seq data sets, where two dif-ferent isoforms were added for each circRNA by simu-lating additional exon skipping event. Note that thenumber of isoforms for each circRNA was set to two,because only in this situation, CIRI-AS can estimate therelative abundance of the two isoforms within a certaincircRNA using the PSI values. Consequently, two datasets were simulated. The first one simulated transcriptswith different sequencing depth (25×, 50×, 100×, and150×, respectively) and uniform read length of 150 bpwith insert length of 350 ± 200 bp. The second one simu-lated transcripts with different read length (100, 150,200, 250, and 300 bp, respectively), insert length of 350± 200 bp and sequencing depth of 75×.To further evaluate the sensitivity and accuracy of

CIRI-full on circRNA isoform detection and quantifica-tion, we also simulated circRNA-containing RNA-seqdata sets with three isoforms for each circRNA. Thetranscript sequencing depth was set to 50× with sequen-cing length of 150 bp and insert length of 350 ± 200 bp.

Generation of HeLa cell RNA-seq dataTotal RNA was isolated using TRIZOL (Invitrogen) fromHeLa cells grown in standard medium under standardconditions. The RNA was divided into three samplescontaining equal amounts of RNA. The quality of thesesamples was manually controlled to produce differentRIN values. Specifically, the RIN value of the low-qualityRNA sample was 5 and that of the two high-quality sam-ples was 10. Next, a RiboMinus kit (Invitrogen, Carlsbad,CA, USA) was utilized to deplete ribosomal RNA inthese samples. The resulting RNA was incubated at 37 °C and treated with 10 U μg− 1 RNase R (Epicenter, Madi-son, WI, USA). One of the high-RIN samples and thelow-RIN sample were used separately as templates forcDNA libraries following the TruSeq protocol (Illumina,San Diego, CA, USA); the other high-RIN sample wasused to construct a sequencing library using the sameprotocol but without the fragmentation step. Fragmentswith a broad range of fragment size (300–800 bp) wereselected for library construction. The three libraries(low RIN/fragmented, high RIN/fragmented, highRIN/unfragmented) were sequenced on the IlluminaHiSeq 2500 platform of the Research Facility Centerat the Beijing Institutes of Life Science, CAS, with aread length of 250 bp. The sequencing data sets havebeen deposited to SRA with the following project ID(PRJNA475651) [38].

Zheng et al. Genome Medicine (2019) 11:2 Page 5 of 20

Page 6: Reconstruction of full-length circular RNAs enables ... · METHOD Open Access Reconstruction of full-length circular RNAs enables isoform-level quantification Yi Zheng1,2†, Peifeng

Generation of whole brain RNA-seq dataHuman, macaque, and rabbit whole brain RNA sampleswere purchased from Zyagen (San Diego, CA, USA).Mouse, rat, and chicken whole brain tissues were ob-tained from the Research Facility Center at the BeijingInstitutes of Life Science, CAS. RNA samples were iso-lated using TRIZOL (Invitrogen). For each species, threetypes of cDNA library were prepared. Specifically, aRibo-/RNase R library was constructed using RNA sam-ples that had been treated with the RiboMinus kit (Invi-trogen) and then incubated at 37 °C with 10 U μg−1

RNase R, a Ribo-/cDNA library was constructed usingRNA samples that were only treated with the RiboMinuskit, and a poly-A library was prepared according to theTruSeq v2 guide. Poly-A and Ribo-RNA samples wereused as templates for cDNA libraries according to the Tru-Seq protocol (Illumina); Ribo-/RNase R-treated samplesused the same protocol but without fragmentation. Theselibraries were sequenced on the Illumina HiSeq 2500 plat-form of the Research Facility Center at Beijing Institutes ofLife Science, CAS. PolyA+ and Ribo-/RNase R librarieswere sequenced with paired-end 250-bp reads, andRibo-libraries were sequenced with paired-end 150-bpreads. The sequencing data sets have been deposited toSRA with the following project ID (PRJNA475651) [38].

HeLa cell and brain RNA-seq data processingCircRNAs from three HeLa cell RNA-seq data sets wereidentified using CIRI-full with default parameters. TheRNA-seq data sets of brain samples that had undergoneRiboMinus/RNase R treatment from six species wereprocessed using CIRI-full. To normalize the number ofresulting circRNAs, circRNAs with no RO read supportand those with BSJ read support ≤ 5 were filtered in hu-man, mouse, rat, and rabbit according to the total BSJnumber. Brain data sets from RiboMinus libraries wereprocessed by CIRI2 with default parameters, and expres-sion level was normalized to data set size. Poly-A-selectedRNA-seq data sets of brain samples of the six vertebrateswere analyzed using Hisat2 [39] and StringTie [40, 41]with default parameters.

Experimental validationTo validate the predictions made by the RO method,outward-facing primer sets were designed to amplify 21circRNAs (15 for BSJ validation and 6 for full-length cir-cRNA validation). PCRs were performed using 35 cycles,and the sequences of the PCR products were determinedvia Sanger sequencing. Among 21 validated circRNAs,all BSJs and FSJs of six circRNAs were supported bySanger sequencing, indicating the accuracy of the recon-structed full-length sequences of these circRNAs. Theremaining 15 circRNAs were also validated by the con-firmed BSJ sites using Sanger sequencing.

To verify predicted circRNA isoforms and their rela-tive abundances, outward-facing primers were designedto quantify the expression of 17 isoforms within 8 cir-cRNAs. Specifically, HeLa cells were grown in standardmedia and conditions. Total RNA was isolated usingTRIZOL and converted to cDNA using random hexame-ters primers within the FastKing RT Kit (TIANGEN).Resulting cDNA was used as templates and real-timeqPCR was performed using primer pairs specific for cir-cRNA isoforms and two negative controls (GAPDH andb-actin) with SYBR FAST qPCR Kits (Kapa Biosystems).The reaction volume was 20 μl, which contained 1 μl ofserial diluted cDNA, 10 μl of qPCR SYBR Green MasterMix, 0.5 μl each of forward and reverse primers, and 8 μlof water. Thermal cycling was carried out on StepOne-Plus (Applied Biosystems) using the following condi-tions: 95 °C for 5 min and followed by 40 cycles of 95 °Cfor 10 s and 60 °C for 30 s. Fluorescent signals were de-tected at the step of annealing/extension (60 °C).

Differential expression analysis of circRNAs and theirisoforms in HCC patientsRiboMinus treated RNA-seq data sets of tumor(SRX1558046-SRX1558064) and normal liver samples(SRX1558026-SRX1558045) from 20 HCC patients gen-erated in a previous study [42] were downloaded fromthe NCBI SRA database. Full-length circRNAs and theirisoforms were obtained by running CIRI-full on thesedata sets using default parameters. The statistical signifi-cances of differentially expressed circRNAs between nor-mal and tumor samples were calculated using Mann–Whitney U test.

ResultsCurrently available approaches on circRNA identificationare exclusively based on the detection of back-splice junc-tions (BSJs). In this study, we propose a new feature, re-verse overlap (RO), for full-length circRNA detection,which outperforms previous BSJ-based methods in detect-ing circRNAs, even for highly degraded RNA samples. Wefurther develop an accurate and high-throughput ap-proach, CIRI-full, that uses both BSJ and RO features toreconstruct full-length circRNAs and their isoforms fromRNA-seq data sets.

The CIRI-full approachThe RO identification algorithm (Additional file 1: Fig-ure S1) is performed based on region in amplified circu-lar transcripts in which the 5′- and 3′- ends of pairedreads are reversely overlapped with each other (Fig. 1a,b). It should be noted that 3′- end RO (3′ RO) willoccur if the circRNA is completely covered bypaired-end reads; in this case, the entire sequence of thecircRNA can be reconstructed (Fig. 1a). However, the

Zheng et al. Genome Medicine (2019) 11:2 Page 6 of 20

Page 7: Reconstruction of full-length circular RNAs enables ... · METHOD Open Access Reconstruction of full-length circular RNAs enables isoform-level quantification Yi Zheng1,2†, Peifeng

presence of RO features is dependent on library frag-ment size, cDNA amplification, and read length (Add-itional file 1: Figure S2). This method involves two steps,RO detection and RO verification. The former is de-signed to detect candidate RO reads. Specifically, foreach read pair, the 5′- end subsequences of both readsare extracted and aligned using a seed-matching

strategy. The read pairs for which 5′- end subsequencealignment passes the length and identity thresholds aremerged into a long sequence according to the alignmentand taken as candidate RO-merged reads. Owing to con-tamination with lariat structures and linear transcriptreads that have partial sequences attached to both ends,a significant number of candidate RO-merged reads are

read1 r

path3

RNA-seqdata

circRNA

cDNA

read2

RO RO

read

1

read2

ag gt ag gtgt ag

RO RO

RO read mapping & filtering

partial: lack of ROfull length

RO

CA B D

E

reference

mergedReadRO

Anchoredalignment

false alignment

RO

full length partial: lack of FSJs

BSJ reads

RO reads

Combined assembly

Assembled full length

circRNA

CircRNA reconstruction

path4

path1

path5

path2

12

median coverage of

BSJ reads on cirexon

FSJ support

phasing FSJ

co-occurred FSJs

16

BSJ

unphasing FSJ

Simple isoform

RO-supported

isoform

Candidate isoforms

* *

*

Cov(3, 1) Cov(3, 3) Cov(3, 4) Cov(3, 5)

Sp(3, 1~3)

Sp(3, 3~4)

Sp(3, 4~5)

path3

Simulated

reads

BS

Jre

ads

RO

read

s

669 bp 580 bp381 bp = 16 : 42 : 39 : 36

1133

13 12

29

17

15

13567

41 16

108

30

Forward splice graph Isoform traversing

Estimate sequencing depthfor each path

DFS

path search

Monte-Carlo method Approximate

exhaustive

search

Output optimal

solution

F BSJ & RO reads alignment

Isoform-level quantification

Phased isoforms

780 bp

Fig. 1 Workflow of reverse overlap detection and full-length circular RNA reconstruction. RO, reverse overlap; BSJ, back-spliced junction; FSJ,forward-spliced junction. a RO is an overlapped region in amplified circular transcripts in which the 5′- or 3′- ends of paired reads are reverselyoverlapped with each other. The presence of a 5’ RO indicates that the paired reads are derived from a circular transcript. The presence of both 5′and 3′ ROs indicates that a full-length circular transcript can be generated by merging the 5′ and 3′ overlapped sequences of the read. b Alignment ofa read pair with 5′ RO and/or 3′ RO. c, d Candidate RO-merged reads are mapped to the reference genome to accurately determine the locations ofthe reads and to rule out contamination. The longest alignment is chosen as an anchor for determining the location of the reads (c). Unmapped andabnormally mapped fragments in the candidate RO-merged reads are realigned to the reference genome based on the location of the anchoredalignment; the alignment boundaries are then adjusted based on the GT/AG splicing signal (d). e Workflow of full-length circRNA reconstruction. ROs,BSJs and cirexons are first detected from RNA-seq data. Full-length circRNAs can be reconstructed when both 5′ and 3′ RO are present or when thecircRNAs are completely covered by BSJ reads. For circRNAs lacking 3′ RO or FSJs, a combined assembly is performed to integrate the 5′ RO reads andthe BSJ reads. f Isoform-level quantification of circRNAs. The BSJ and RO-merged reads are aligned to the reference genome. A forward splice graph(FSG) that records the splicing and coverage information is built based on the alignments. Next, the resulting FSC is dissected into paths that representputative circular isoforms of the circRNA (right panel). Paths that contain phasing FSJ, where the splicing event is exclusively occupied by only onecircular isoform, or co-occurred FSJs, where the number of splicing events is supported by the same RO read, are classified as phased isoforms. Theread coverage profile of each path is modeled by a Monte Carlo simulation (right middle panel). Expressed circular isoforms are dissectedand quantified by employing an approximate exhaustive search algorithm (bottom)

Zheng et al. Genome Medicine (2019) 11:2 Page 7 of 20

Page 8: Reconstruction of full-length circular RNAs enables ... · METHOD Open Access Reconstruction of full-length circular RNAs enables isoform-level quantification Yi Zheng1,2†, Peifeng

false positives. Next, a mapping-strategy-based RO veri-fication step is utilized to screen out authenticRO-merged reads. In this step, candidate RO-mergedreads are mapped to the reference genome, resulting inseveral split alignments. To accurately determine the lo-cation of candidate RO-merged reads, the longest align-ment for each read is employed as an anchor alignment,and the accumulated mapping length within a giveninterval on both sides of the anchor is calculated. If thesummed length exceeds half of the read length, the readis reserved and remapped using a local realignmentstrategy. Abnormally mapped and unmapped regions foreach read are remapped using local realignment; then,the boundaries of the mapping junctions are adjusted ac-cording to the GT/AG splicing signal. Moreover, candi-date RO-merged reads derived from linear transcriptsand lariat structures are also ruled out in this process.Finally, the locations of both ends of the remainingRO-merged reads are checked and recorded for furtheranalyses (Fig. 1c, d and Additional file 1: Figure S3).The CIRI-full pipeline involves four different steps,

RO detection and reconstruction, BSJ and cirexon detec-tion, combined assembly of RO and BSJ reads (Fig. 1e)and circular isoform detection and quantification(Fig. 1f ). Several recent studies demonstrated that CIRI2exhibited remarkably balanced sensitivity, reliability,running time, and RAM usage on circRNA detection[29–32] and performed comparably to annotation-basedalgorithms [33], and thus, CIRI2 was employed to detectBSJ reads in this pipeline. RNA-seq reads are processedseparately in the first two steps, thus yielding a numberof full-length circRNAs, as well as BSJ and RO-mergedreads, which are not sufficient for reconstructingfull-length circRNA independently. Therefore, these un-used but informative reads are integrated in the nextstep to generate a combined assembly. Based on theidentified full-length circRNAs, circular isoforms withinthe BSJs of a circRNA are then detected and quantifiedby employing statistic-based models. In the first step, anRO-merged read is identified as full-length circRNA ifthe genome alignments of both of its ends satisfy one oftwo criteria [1]: they have overlap on the genome (Add-itional file 1: Figure S4A) or [2] they do not have overlapbut locate on the same cirexon (Additional file 1: FigureS4B). In the second step, CIRI-AS is employed to detectBSJ and cirexons of circRNAs. Then, for each circRNA,the locations of BSJ read pairs within the BSJ arechecked; if all the reads are exclusively located on thecirexons, the complete sequence of this circRNA is as-sembled by linearly connecting the cirexons (Add-itional file 1: Figure S4C). In the third step, incompleteinformation from the first two steps is clustered accord-ing to the BSJ, and the RO reads and cirexons for eachBSJ are combined to complement each other to

reconstruct full-length circRNAs (Additional file 1: Fig-ure S4D). However, under certain conditions, full-lengthcircRNAs cannot be reconstructed; these may includelong circRNAs whose length is twofold greater than thelibrary size (Additional file 1: Figure S5D) and circRNAswith incomplete cirexons due to low expression levels(Additional file 1: Figure S5A–C). Finally, a forwardsplice graph (FSG) is constructed using the BSJ andRO-merged reads alignments within the BSJs for eachassembled circRNA (Fig. 1f ). The resulting FSG is dis-sected into paths by using an adapted deep-first searchmethod, which iteratively traverses from different sourcenode to find all non-redundant paths (Additional file 1:Figure S6). Next, the following steps are to estimate theabundance of each circular isoform (Additional file 1:Figure S7). First, a Monte Carlo method is employed tosimulate the distribution of BSJ reads on each path ac-cording to the insert length of RNA-seq library, whichwill be used to estimate the coverage of each node andedge of this path. Then, an approximate exhaustivesearch method is employed to find the optimum solu-tion of the abundance of each path. Specifically,CIRI-full initially assigns a random value to each pathand then calculates the abundance of every node and thesplicing events of each edge on the path based on thesimulated BSJ-reads distribution. Consequently, CIRI-fullcalculates the distance between the accumulated putativeabundance of each node and splicing events of eachedge. This distance score represents the discrepancy be-tween the putative and real abundance of each path. Toobtain the smallest distance, CIRI-full corrects the puta-tive abundance of each path iteratively according tothe FSG until the distance scores get converged. Afteriterative computation, the optimum solution will beoutput, and thus, the abundance of each path is de-termined (Fig. 1f ).

RO feature facilitates identification of low-abundancecircRNAsTo explore the advantages of the RO feature, we exten-sively compared the performance of the RO-basedmethod with that of BSJ-based methods by simulatingcircRNA-containing transcriptomic datasets. These data-sets contained RNA-seq paired-end reads with an aver-age library size of 350 bp and an average circRNA lengthof 300 ± 150 bp, where the length distribution was in-ferred from a HeLa circRNA dataset (Additional file 1:Figure S8). To measure the effect of read length on theresults obtained using these two strategies, we simulateddatasets with an average circRNA abundance of 10X andread lengths of 75, 100, 150, and 200 bp (Additional file 1:Figure S9). As expected, the sensitivity of the RO-basedmethod increased significantly, considering that 75% ofthe simulated circular transcripts were shorter than 480

Zheng et al. Genome Medicine (2019) 11:2 Page 8 of 20

Page 9: Reconstruction of full-length circular RNAs enables ... · METHOD Open Access Reconstruction of full-length circular RNAs enables isoform-level quantification Yi Zheng1,2†, Peifeng

bp (Additional file 1: Figure S8A). Moreover, the numberof circRNAs that were exclusively detected by the ROmethod also increased with increased read length. Sur-prisingly, this contrasted sharply with the performanceof the BSJ-based tools except for CIRI2, for which thesensitivity dropped rapidly, especially with read lengthsof 200 bp. The primary reason for this decrease is the in-creased number of junction sites produced by longreads. Most current BSJ-based approaches, except forCIRI2, are specifically designed for short reads, and theyignore the split alignment of long reads with more thanthree splitting sites. Next, we compared the performanceof the RO- and BSJ-based methods in detectinglow-abundance circRNAs. The simulator was adapted togenerate paired-end reads with lengths of 200 bp andcircRNAs with a gamma distribution of expressionlevels, in which most circRNAs exhibited low abundance(Additional file 1: Figure S9). As shown in Fig. 2a, b andAdditional file 1: Figure S10, in all of the comparedcases, the RO-based method achieved a sensitivity com-parable to that of CIRI2 and a specificity similar to thatof the BSJ-based methods. Notably, circRNAs that wereonly detected by the RO-based method were of lowabundance, suggesting the potential application of themethod for identifying low-abundance circRNAs.To obtain a more comprehensive understanding of the

RO feature, we further compared the relationship betweenRO reads and BSJ reads used for circRNA identification. Asshown in Fig. 2c, 69% (30% + 39%) of the RO-merged readswere derived from 78% of the circRNAs (27% + 51%) havingBSJ read support ≥ 2, and more than half of theseRO-merged reads had 3′ RO also, thus indicating that theycould directly generate full-length circRNA transcripts. Theremaining RO-merged reads comprised the 22% of the cir-cRNAs that were exclusively identified by the RO-basedmethod, thus suggesting that the RO-based method offers adistinct advantage compared with previous BSJ-basedmethods, in which low-abundance circRNAs with limitedBSJ read support are usually discarded by setting an arbi-trary threshold (e.g., #BSJ reads > 2 or more). We furthersurveyed the base depth distribution of circRNAs andfound that the RO reads produced a more uniform depthdistribution along the normalized circRNA transcript thanthe BSJ reads (Fig. 2d). Based on the foregoing results, weconclude that the RO feature performs well in identifyinglow-abundance circRNAs and generating more uniformread distributions along circular transcripts. This improve-ment will greatly facilitate downstream reconstruction offull-length circRNAs.

The FSG-based algorithm accurately identifies andquantifies circular isoformsTo validate the quantification accuracy of the FSG-based al-gorithm, we simulated circRNA-containing transcriptomic

datasets with read length varying from 100 to 300 bp andsequencing depth at 25–150 fold (Fig. 2e, f). We sought toevaluate the performance of this approach by comparingwith CIRI-AS. Therefore, all the circRNAs in these datasets were designed to possess two isoforms, where CIRI-AScan estimate the relative abundance of the two isoformswithin a certain circRNA using the PSI values. We thencompared the accuracy of these two tools by calculating thediscrepancy between predicted and real abundance of eachisoform. As shown in Fig. 2e, both of these two approachesachieved a high level of accuracy, especially for those iso-forms with an abundance over 50 fold. Moreover, theFSG-based quantification approach exhibited increasedlevels of accuracy with increasing read length, relative tothe splicing events-based method (Fig. 2f). For more com-plicated splicing pattern, we further designed a simulateddataset containing 994 circRNAs with three isoforms,which were referred to as major, medium, and minor iso-form according to their abundance, respectively. We per-formed CIRI-full on this dataset, detected circular isoformswithin each circRNA, and determined their abundances.Among these simulated circRNAs, 73% (726/994) of themcould be precisely recognized for all three isoforms. For iso-form quantification, an average of 79% of these isoformscan be correctly determined (Fig. 2g–i). We further usedfour circRNA data sets with biological replicates and twosimulated datasets to evaluate the reliability of isoformquantification by CIRI-full. As shown in Additional file 1:Figure S11, 76.2 ~ 85.6% of moderately or highly expressedisoforms (#BSJ reads > = 30) in the real datasets, includinghuman brain tissue, human liver tissue, and Hs68 cell line,could be accurately quantified using CIRI-full. Similar find-ings were also found in the simulated datasets.To experimentally evaluate our approach on quantify-

ing circRNA isoforms, we performed real-time RT-PCRto validate eight randomly selected circRNAs with twoor three isoforms from a transcriptomic data set of HeLacells (Additional file 1: Figure S12–14). Each of these cir-cRNAs was predicted to contain at least two isoforms.We designed 17 pairs of primers to amplify fragmentscontaining both BSJ and alternatively spliced cirexonsand quantified their abundance using real-time RT-PCR.As shown in Fig. 2j and Additional file 1: Table S1,the abundances of circRNAs determined by CIRI-fulland qRT-PCR show a high level of consistency, dem-onstrating the reliability of the FSG-based method forisoform-level circRNA quantification.

Reconstruction of full-length circRNAs based on CIRI-fullTo further investigate the utility of using both RO andBSJ features in circRNA reconstruction, we generated9.2 Gb of sequencing data with RiboMinus + RNase Rtreatment from HeLa cells. CIRI-full was then employedto identify circRNAs and to reconstruct their full-length

Zheng et al. Genome Medicine (2019) 11:2 Page 9 of 20

Page 10: Reconstruction of full-length circular RNAs enables ... · METHOD Open Access Reconstruction of full-length circular RNAs enables isoform-level quantification Yi Zheng1,2†, Peifeng

transcripts. As shown in Fig. 3a, 77.6% of the circRNAswere identified as full- or nearly full-length circular tran-scripts, indicating a high efficiency of CIRI-full in recon-structing circRNAs by combining RO and BSJ features.We further explored the length distribution of these

reconstructed circRNAs and found that the majoritywere between 150 and 500 bp in length (Fig. 3b). At thislength interval, circRNAs can be well covered by longpaired-end reads (e.g., PE250 or PE300), but above thislength, it is difficult to recover their complete sequences.

0.75

detection

22%

27% 51%

1 2 3

circRNA with BSJ = 1circRNA with BSJ = 2circRNA with BSJ > 2

0%

20%

40%

60%

80%

100% 2X 5X 10X 15X

39%

30% 8%

3% 20% 1 2 3 4 5

RO (BSJ 2)RO (BSJ 2)

RO (BSJ = 1) RO (BSJ = 1) RO (No BSJ)

BSJ-based

Normalized length of circRNAs

RO-based

# R

eads

(*10

00)

Det

ectio

nra

te

Det

ectio

nra

te

A B

C D

Den

sity

Real relative abundance ofthe isoform

Major

MinorMedium

Pred

icte

dre

lativ

eab

unda

nce

25x50x75x100x150x

25x50x75x100x150x

4

0 0 0 2 0 00 0 2 0 0

Den

sity

994 circRNAswith 3 isoforms

CIRI-AS

CIRI-full

E G

H J

Relative abundance by qPCR

Rel

ativ

esab

unda

nce

byC

IRI-

fullI

4

3

2

1

0

# R

eads

(*10

00) 1.00

0.75

0.50

0.25

0

6

4

2

00 0.25 0.50-0.25-0.50 0.50 0.75 1.000.250

0

0.25

0.50

1.00

0.50 0.75 1.000.2500

0.25

0.50

0.75

1.00

0 0.25 0.50-0.25-0.50

6

4

2

0

Difference between real andpredicted relative abundance

Difference between real andpredicted relative abundance

Difference between real andpredicted relative abundance

100bp150bp200bp250bp300bp

100bp150bp200bp250bp300bp

F

0 0.25 0.50-0.25-0.50

6

4

2

0

8 CIRI-AS

CIRI-full

1.00

00.250.500.75

800

400

0

600

200

3 2 1

#cir

cRN

As

Isoforms

quantification

Acc

urac

yra

teMajor

MinorMedium

0%

20%

40%

60%

80%

100% PE75 PE100 PE150 PE200

Fig. 2 Performance evaluation of the RO approach to circRNA identification. a, b Performance comparison between the RO approach and theBSJ-based tools. “RO only” represents the circRNAs that are only identified by the RO approach. a CircRNA detection rate on the four data setswith different circRNA depth. b CircRNA detection rate on the four data sets with different read length. c Component of circular RNA and circularRNA reads detect by RO in simulated data (5X, paired-end 200 bp). d Base depth distribution of BSJ reads (pink) and RO reads (green) onnormalized circular RNAs. e, f Accuracy evaluation of the AS events-based and the FSG-based quantification algorithms (CIRI-AS vs. CIRI-full) usingsimulated circRNA-containing transcriptomic data sets, including different sequencing depth (e) and different read length (f). g Sensitivityevaluation of the FSG-based quantification algorithm on simulated circRNA-containing transcriptomic data sets, where each circRNA containsthree isoforms with different abundance. The bar plot on the right top displays the number of isoforms detected in 994 circRNAs; the bar plot onthe right bottom shows the accuracy of FSG quantification in three types of isoforms. Accuracy rate is defined as the percentage of isoforms thatare fully reconstructed and of which the predicted relative abundance matches the ground truth (difference between them is smaller than 20%).h, i The accuracy distribution of FSG method on the three types of reconstructed isoforms. j Experimental validation of the FSG-based isoformquantification algorithm. X- and y-axis represent the relative abundance of circRNA isoforms determined by qPCR and the FSG-based algorithm,respectively. Each dot represents a circRNA isoform, and dots in the same color represents that they come from the same circRNA

Zheng et al. Genome Medicine (2019) 11:2 Page 10 of 20

Page 11: Reconstruction of full-length circular RNAs enables ... · METHOD Open Access Reconstruction of full-length circular RNAs enables isoform-level quantification Yi Zheng1,2†, Peifeng

The lengths of the unconstructed circRNAs were esti-mated by summing their potential cirexons; the resultsindicated that their lengths ranged from 750 to 1250 bp(Additional file 1: Figure S14A). These circRNAs repre-sented only one quarter of the total number of identifiedcircRNAs. By examining the length distribution and the

expression levels of full-length circRNAs in more detail,we found that the circular transcripts reconstructed bythe RO and BSJ features exhibited distinct patterns. Spe-cifically, the BSJ feature focused on long and highly tran-scribed circRNAs, whereas the RO feature preferentiallyidentified short and low-abundance circRNAs, especially

Mis

sing

ICFs

21%

7695311221

497

869

BSJRO

Complete

Nearly complete (>=80%)

Partial (<80% )

0

50

100

150

200

250

0 500 1000 1500 2000

R

R

HeLa cell line (9.2Gb)

circRNA length (bp)

Cou

nt

Human brain (16.2 Gb)

0

5

10

15

25

20

Exp

ress

ion

leve

l

175731638140

2923

6631

BSJRO

0

250

500

750

1000

0 500 1000 1500 2000

Cou

nt

circRNA length (bp)

BSJ & RO

CompleteNearly

completePartialBSJ RO

0

20

40

60

Exp

ress

ion

leve

l

CompleteNearly

completePartialBSJ & RO

BSJ RO

0

0.25

0.50

#RO

read

s/#(

RO

+BSJ

)rea

ds

0.75

1

0

0.25

0.50

0.75

1

UnfragFrag

UnfragFrag

RIN=10RIN=5

RIN=10RIN=5

** **

0

0.25

0.50

0.75

1

0

0.25

0.50

0.75

1

A B C

DE F

G H

I J

K** *

< 250250-350

350-450> 450

< 250250-350

350-450> 450

< 250250-350

350-450> 450

< 250250-350

350-450> 450

circRNA length (bp) circRNA length (bp)#RO

circ

RN

As/

#(R

O+B

SJ)c

ircR

NA

s

#RO

read

s/#(

RO

+BSJ

)rea

ds

Perf

ectm

atch

34%

Fals

eex

ons

23%

#RO

circ

RN

As/

#(R

O+B

SJ)c

ircR

NA

s

Mix

ure

22%

Correct exonsICFs/incorrect boundariesFalsely included exons

Normalized length of circRNAs

Fig. 3 Full-length circRNA reconstruction of HeLa cell line (a–c) and human brain (d–f) transcriptomes with RNase R + RiboMinus treatment. a, dCircRNAs reconstructed using both the RO and BSJ features. Completely reconstructed circRNAs are shown in blue-lined ovals. Nearly completeand partial circRNAs are shown in orange and gray, respectively. b, e Length distribution of reconstructed circRNAs in the HeLa cell line and inhuman whole brain tissue. Complete, nearly complete and partial circRNAs are shown in blue, orange, and gray, respectively. The length ofpartially reconstructed circRNAs was estimated based on supported BSJ/RO reads and sequencing depth in the RNase R-treated sample. c, fExpression levels of different categories of circRNA. g–j Performance of the RO feature in circRNA identification when applied to fragmented orlow-quality RNA samples. The green bar indicates the RNA data set derived from high-quality RNA (RIN = 10) without manual fragmentation. Theyellow bar indicates the RNA data set derived from high-quality RNA (RIN = 10) with manual fragmentation. The red bar indicates the RNA dataset derived from low-quality RNA (RIN = 5) with manual fragmentation. g, h Comparison of the ratio of RO reads to total circRNA reads forcircRNAs of different lengths. ‘**’ and ‘*’ represent P < 0.01 and P < 0.05, respectively (Mann–Whitney U test). i, j Ratio of the number of circularRNAs with RO reads to total circular RNAs of a given length. k Comparison of full-length circRNA structure and corresponding annotated exonregions in human brain tissue. The 150 most highly expressed circRNAs that were completely reconstructed are shown. Each line represents acircRNA with normalized length

Zheng et al. Genome Medicine (2019) 11:2 Page 11 of 20

Page 12: Reconstruction of full-length circular RNAs enables ... · METHOD Open Access Reconstruction of full-length circular RNAs enables isoform-level quantification Yi Zheng1,2†, Peifeng

those with only one RO read support (Fig. 3c andAdditional file 1: Figure S14A). This finding highlightsthe general principle that depending on the RO or BSJfeature, a single feature alone may not be sufficient torecover all full-length circRNAs.For further validation, we generated 16.2 Gb of se-

quencing data from human brain samples based onRiboMinus RNA sequencing with RNase R treatmentand performed CIRI-full on this dataset. Most of the cir-cRNAs identified in this dataset were full- or nearlyfull-length (Fig. 3d). The number of circRNAs in thebrain dataset was greater than that in HeLa cells whenthe sizes of the data sets were normalized. The lengthdistribution of full-length circRNAs and the recognitionpatterns of the RO and BSJ features were similar tothose identified in HeLa cells (Fig. 3e, f and Add-itional file 1: Figure S14B). Notably, a large majority ofcirexons that were specifically identified by RO readswere enriched in both RiboMinus-treated and RiboMi-nus/RNase R-treated samples as compared to the poly(A) enrichment sample (Additional file 1: Figure S15).To verify the reconstructed circRNAs based on RO

features, both computational and experimental ap-proaches were employed. Firstly, over 80% of circRNAsdetected by the RO-based method could be supportedby at least one BSJ-based method (Additional file 1: Fig-ure S16A). The remaining circRNAs solely detected bythe RO-based method also exhibited a typical reverselymapping signature when aligned them to the referencegenome (Additional file 1: Figure S16B). Secondly, weperformed experimental validation by randomly select-ing 15 circRNA loci in the HeLa sample; 3, 6, and 6 ofthese loci were highly, moderately, and weakly tran-scribed, respectively. Outward-facing primers were de-signed to amplify fragments containing BSJs, and thesequences of the PCR products were determined viaSanger sequencing. As shown in Additional file 1: FigureS17–19, all 15 of the loci were successfully validated. Foradditional validation, six predicted full-length circRNAswere randomly selected and validated using the same ap-proach. As shown in Additional file 1: Figure S20, all ofthese predicted full-length circRNAs were successfullyverified using the experimental method. This solid evi-dence demonstrates the excellent reliability of CIRI-fullin reconstructing full-length circRNAs.To measure the robustness of using RO and BSJ fea-

tures for circRNA reconstruction, RNA-seq librarieswere constructed using both high-quality RNA (RNA In-tegrity Number, RIN = 10) and degraded RNA (RIN = 5)of HeLa cells treated by RNase R and RiboMinus. Forthe high-quality RNA sample, two different methods(with or without fragmentation) were used to constructRNA-seq libraries. For these three libraries, 9.2, 5.3, and11.6 Gb of sequencing data were generated, respectively.

We employed CIRI-full to detect RO and BSJ reads fromthese data; the ratio of RO reads to the sum of RO andBSJ reads was then calculated. As expected, comparedwith the unfragmented library, the RO reads ratio de-creased considerably in the fragmented and low-qualitylibraries across all the compared distribution levels(Fig. 3g, h). In particular, the low-quality library, inwhich the circRNAs suffered from degradation, exhib-ited the lowest RO reads ratio. We next investigatedwhether the number of RO-identified circRNAs was alsoaffected (Additional file 1: Figure S21). Therefore, weperformed CIRI-full on each dataset and calculated theratio of RO-identified circRNAs to the sum of RO-and BSJ-identified circRNAs. We found that althoughthe ratio of RO reads decreased, the number ofRO-identified circRNAs was unaffected by RNA frag-mentation (Fig. 3i) and only weakly influenced byRNA degradation (Fig. 3j). These findings furtherconfirm that the RO feature can be used to efficientlydetect low-abundance circRNAs even if most of theRO reads are degraded or fragmented.In addition to its high sensitivity and robustness in cir-

cRNA identification, CIRI-full also offers high accuracyfor exploring detailed internal components within cir-cRNAs. In contrast, previous studies simply combined allknown or aligned mRNA exons in a sequential order asputative full-length circRNAs (hereafter referred to as thereference-based method). We measured the accuracy offull-length circRNAs predicted using the reference-basedmethod by aligning them with the circular transcripts re-constructed using CIRI-full. As shown in Fig. 3k, only 34%of these predicted full-length circRNAs in human brainpredicted by the reference-based method perfectlymatched the circular transcripts reconstructed usingCIRI-full, thus suggesting the former method’s low level ofaccuracy. The errors can be classified into three categor-ies: false cirexons, missing ICFs, and a mixture of theformer two errors (Additional file 1: Figure S22). False cir-exons, representing the insertion of false additional exons,accounted for 23% of the predicted circRNAs, and theaverage number of false exons per circRNA was 2.4.Moreover, 21% of the errors were identified as missingICFs, referring to missing intronic/intergenic circular frag-ments; these included an average of 25.5% of thefull-length circular transcripts. A mixture of both false cir-exons and missing ICFs accounted for up to 22% of the er-rors. A similar error rate was also consistently observed inHeLa cells and mouse brain samples (Additional file 1:Figure S23), strongly indicating that the reference-basedmethod is error-prone and not reliable for resolving theinternal structure of circRNAs. Compared with previousapproaches, which focused on the determination of in-ternal sequence or alternative splicing events, CIRI-fullexhibited a high efficiency on reconstructing circular

Zheng et al. Genome Medicine (2019) 11:2 Page 12 of 20

Page 13: Reconstruction of full-length circular RNAs enables ... · METHOD Open Access Reconstruction of full-length circular RNAs enables isoform-level quantification Yi Zheng1,2†, Peifeng

transcripts using the combination of RO and BSJ features(Additional file 1: Figure S24).

Profiling full-length circular RNAs in vertebrate brainsEvolutionary analysis is essential for insights into thegenetic basis of phenotypes and into functional screen-ing. For circRNAs, such analyses remain scarce despitegrowing attention to these circular transcripts. With thisgoal in mind, we determined circRNA repertoires inbrain samples of six vertebrate species, including human,macaque, mouse, rat, rabbit, and chicken. For each spe-cies, a whole-brain sample was sequenced usingRNA-seq with RiboMinus/RNase R treatment, Poly Aenrichment and RiboMinus treatment (Fig. 4a). Then,we applied CIRI-full to the RiboMinus/RNase R-treatedtranscriptomic data to comprehensively explore the cir-cular transcripts for each species, including the numbersof circRNAs, cirexons, BSJ reads, ICFs, full-length cir-cRNAs, and RO reads. HISAT and StringTie [41] wereperformed on the Poly (A) enrichment data to obtainlinear transcript abundance. We identified approximately3500 to 11,500 full-length circRNAs in these organisms.Analysis of the lengths of these full-length circular tran-scripts revealed that circRNA length was highly con-served among these species, with the majority ofcircRNAs ranging from 250 to 500 bp in length (Fig. 4a).Next, we measured the exon boundary conservation oforthologous circular and linear transcripts between pairsof closely related species, including human and macaqueand mouse and rat. Considering that most of the cirex-ons in circRNAs are identical to those in linear mRNAs,only the ICFs that are exclusively present in circRNAswere used for boundary conservation analysis. As shownin Additional file 1: Figure S25, lncRNA exon boundariesexhibited larger and more frequent changes across mam-mals than did protein-coding exons. Interestingly, theboundary conservation level of ICFs was similar to thatof protein-coding exons. Consequently, compared withlncRNAs, circRNAs exhibit more constraint with respectto maintaining an exact position of splicing eventsamong orthologous pairs. We further counted the num-ber of shared circRNAs and mRNAs in these two pairsof species and found that circRNAs exhibited signifi-cantly decreased conservation compared with protein-coding genes (Fig. 4b); only a small subset of ortholo-gous circRNAs were conserved between closely relatedspecies. For example, approximately 23.5% of human cir-cRNAs were also expressed in macaque and 21.4% ofmouse circRNAs were also expressed in rat, whereasmore than 76% of protein-coding genes were expressedin both members of these two species pairs.We further examined the orthologous circRNAs

present in these six vertebrates and calculated the num-ber of orthologous circRNAs that possessed the same

sequences and BSJs and the number of shared genes ontheir ancestral nodes (denoted by “a”, “b”, “c”, and “d”).As shown in Fig. 4a, the number of shared full-lengthorthologous circRNAs and BSJs decreased rapidly withincreased genetic distance. In contrast, the number ofshared orthologous circRNAs decreased much moreslowly, thus indicating that although derived from thesame genes, the circRNAs of different species divergedrapidly in terms of sequences and BSJs. Next, the expres-sion levels of orthologous circRNAs and genes on eachnode were estimated. As shown in Fig. 4c, the sharedorthologous circRNAs exhibited increased levels of ex-pression compared with lineage-specific circRNAs,whereas this scenario was not observed in the sharedorthologous genes from which the circRNAs were de-rived. These findings suggest that there is a distinct evo-lutionary conservation pattern of orthologous circRNAsand that these shared circRNAs provide valuable targetsfor further functional screening. We next surveyed cir-cRNA AS events in the six species. All four types of ASevents could be detected within circRNAs in all thesespecies (Fig. 4d). Exon skipping (ES) was the most preva-lent AS type in circRNAs. Alternative 3′-splicing site(A3SS) and alternative 5′-splicing site (A5SS) were alsomajor circular AS types, in agreement with our previousstudy showing that AS events not only occur in mRNAsbut are also prevalent in circRNAs [10]. To this end, weinvestigated the expression level of conserved circRNAisoforms in these species. As shown in Fig. 4e, conservedcircRNA isoforms exhibited similar splicing patterns inclosely related species. For example, the expression andsplicing patterns of these circRNAs were more similarbetween human and macaque than between human andother species.

Read length is a key determinant of circRNA identificationbut not quantificationConsidering that most of publicly available RNA-seqdata sets are generated for linear transcripts, which aretypically in short sequencing length and contain a verylimited fraction of RO reads, one may question whetherthis FSG quantification method works on short sequen-cing reads. To test this possibility, we truncated the250-bp paired-end reads from human brain RNA-seqdata set (RNase R + RiboMinus treatment) to 100-bppaired-end reads, and compared the performance ofCIRI-full on both data sets with long (PE250) and short(PE100) sequencing reads. As shown in Fig. 5a, althoughmost of the highly expressed circRNAs (BSJ reads > =20) could be successfully identified in both data sets, thenumber of identified circRNAs transcribed at low levelsdecreased rapidly after truncating into short reads. Thisscenario was also observed for the assembled circularisoforms in terms of number and length (Fig. 5b). In

Zheng et al. Genome Medicine (2019) 11:2 Page 13 of 20

Page 14: Reconstruction of full-length circular RNAs enables ... · METHOD Open Access Reconstruction of full-length circular RNAs enables isoform-level quantification Yi Zheng1,2†, Peifeng

particular, only less than a half of full-length circRNAisoforms with length > 300 bp were reconstructed by theFSG-based method. It has been demonstrated that thenumber of assembled full-length linear transcripts canbenefit from increasing sequencing depth [43]. Wetherefore investigated whether this holds true for cir-cRNA isoforms. We classified the identified circRNAsinto five categories according to their expression leveland subsequently calculated the number of full-lengthcircRNA isoforms in each category. Strikingly, the recov-ery of full-length circRNA isoforms did not considerablyimproved with increasing sequencing depth (Fig. 5c), in-dicating that the read length rather than sequencingdepth is the key determinant of circRNA isoform detec-tion. This finding further raised the question as towhether the read length affects the accuracy of circRNA

isoform quantification. We checked the relative abun-dance difference of the shared circRNA isoforms inthese two data sets (PE250 and PE100) and found a highlevel of concordance, especially for highly expressed cir-cRNAs (Fig. 5d). This observation demonstrates that theFSG-based method is efficient in quantifying circRNAisoforms even with short read length, but researcherscan recognize more circRNA isoforms by increasing thesequencing read length.

Isoform-level quantification helps filter false positives indifferential circRNA expression analysisUnlike mRNA transcripts, current differential expressionanalysis on circRNAs is limited to the BSJ level due to theinability of detecting isoforms within a certain BSJ (Fig. 6a,top). Using our FSG-based isoform quantification algorithm,

Exon Skipping

Alterative Splice Site

Alterative Splice Site

RNase R (Gb)

Poly A (Gb)

#circRNA #Cirexon #ICF #Full length circRNAs

Length distribution

16.2 6.3 18,397 33,353 8,018 11,497

18.1 3.7 10,545 16,268 5,580 6,315

14.8 5.3 15,013 31,226 6,666 9,847

15.9 4.6 10,395 18,144 5,151 6,171

18.7 3.7 13,675 25,397 7,355 9,160

12.6 6.0 6,536 11,643 3,741 3,881

0

50

100

0

50

0

50

100

150

0

50

100

0

50

0

50

0

50

0

50

a d human

circRNA

mRNA

Mouse

circRNA mRNA

7,748 2,471 470

10,373 918 599

4,340 14,051 6,205

3,210 11,803 7,185

Human Macaque Human Macaque

Rat Mouse Rat

a d macaque a b c mouse a b c rat

a d human a d macaque a b c mouse a b c rat

0

16

8

0

12

6

0

16

8

0

16

8

0

100

50

0

50

25

0

100

50

150

0

100

50

7/14/593

Human

Macaque

Mouse

Rat

Rabbit

Chicken

1129/2729/798

1678/4117/1534

96/258/572

26/209/462

# circRNAs shared both BSJ and internal sequence / # circRNAs shared the same BSJ / # orthologous genes that can express circRNAs

a

b c

d

A

B C

BSJ

cou

nt p

er

ten

mill

ion

read

s T

PM

D Relative abundance

100%

0%

Human

Macaque

Mouse

Rat

Chicken

Rabbit

E

Human Macaque

Mouse Rat

Rabbit

Chicken

20%

20%

0%

10%

0%

10%

Intron Retention

Fig. 4 CircRNAs expression profiles in vertebrate brain tissues. a CircRNAs identified in six vertebrate brain tissues (RNase R + Ribomiuns treatment) byCIRI-full. The number of shared circRNAs is shown on the phylogenetic tree. The table on the right shows the RNA-seq data set size and the numbersof identified circRNAs, cirexons, intronic/intergenic circRNA fragments (ICFs), and full-length circRNAs. The histogram on the right shows the lengthdistribution of reconstructed circRNAs; blue, orange, and gray represent complete, nearly complete and partial circRNAs, respectively. b Overlap ofhighly expressed mRNAs and circRNAs in closely related species. Obviously, mRNA expression is more conserved than circRNA expression in closelyrelated species (human vs. macaque, mouse vs. rat). c Expression levels of circRNAs and their corresponding mRNA genes. a, b, c, and d represent thefour ancestral nodes, as shown in panel a. Species-specific circRNAs in four species (shown in blue) have much lower expression levels than the sharedcircRNAs present in ancestral nodes. d Percentage of circRNAs (BSJ ≥ 10 reads) containing four types of alternative splicing events. eExpression profiles of circRNA isoforms in the six species. The relative abundance of circRNA isoforms were normalized between 0 and 1

Zheng et al. Genome Medicine (2019) 11:2 Page 14 of 20

Page 15: Reconstruction of full-length circular RNAs enables ... · METHOD Open Access Reconstruction of full-length circular RNAs enables isoform-level quantification Yi Zheng1,2†, Peifeng

however, it is possible to distinguish differentially expressedisoform from circRNAs with the same BSJ (Fig. 6a, bottom).To explore the difference of expression patterns betweenBSJ and isoform levels, we applied CIRI-full to 40 RNA-seqdata sets of HCC tumor tissues and their adjacent normaltissues [42]. The sequencing data size mapped to the refer-ence genome varied from 10 to 27 Gb and the number ofidentified circRNAs from these data sets ranged from ap-proximate 4000 to 14,000 (Fig. 6b). Notably, a small fractionof circRNAs in these 40 samples contained at least two iso-forms. Moreover, the number of isoforms positively corre-lated with sequencing data size, suggesting that increasingsequencing depth and read length should facilitate the de-tection of isoforms within circRNAs (Fig. 6c).To better understand the discrepancy between

BSJ-level and isoform-level differential expression ana-lysis, we extracted top 1000 most abundant circRNAsthat expressed in at least 80% of the 40 samples andfound that 778 of them could be fully reconstructed.Then, we investigated their expression changes betweennormal and tumor tissues at both BSJ and isoform levels.Specifically, Mann–Whitney U test was employed to cal-culate the significance of expression alternation andsigns of P value were used as proxies for directions of

changes in expression (Fig. 6d). Among these 778 cir-cRNAs, 587 of them showed the same significancevalues between BSJ-level and isoform-level differentialexpression analysis, because each of these circRNAs onlyexpressed a single isoform, with 66 and 230 significantlyup- and downregulated in tumor tissues, respectively(Fig. 6e). Regarding the remaining 191 circRNAs thatexpressed multiple isoforms, the significance level oftheir differential expression was generally overestimatedbecause different isoforms of circRNAs with a certainBSJ cannot be distinguished from each other if they arequantified at the BSJ level. After corrected forisoform-level quantification, only 32% of them were stillsignificantly upregulated in tumor samples. The samephenomenon was also observed in downregulated cir-cRNAs, where only 35% of them were kept after correc-tion. Notably, a small number of circRNAs wererecognized as differentially expressed isoforms only bythe isoform-level quantification, indicating that otherisoforms within the same BSJ may interfere with the per-formance of differential expression analysis solely basedon the BSJ-level quantification.To further investigate whether there were alternative

splicing isoform switches present between normal and

#ci

rcR

NA

s(*

1000

)

circRNA expression level

BSJ>30

# BSJ reads

#Is

ofor

ms

(*10

00)

A B

C

(0,20)[20,30)[30-50)[50-75)[100, )

3

2

1

0

0 10 20 30

5

4

3

2

1

0

57.3%59.0%63.1%67.8%69.9%

Relative abundance difference

Den

sity

# BSJ readsD

(0,20)

[20,30)

[30-50)[50-75)[100, )

3

2

1

0

-1.0 0-0.5 0.5 1.0

#ci

rcR

NA

isof

orm

s(*

1000

)

circRNA length (bp)

1.5

1.0

0.5

00 500 1000 1500

250bp only100bp & 250bp

250bp only100bp & 250bp

24.0% 26.9%28.7% 30.4% 31.6%

250bp only100bp & 250bp

Fig. 5 Sequencing length affects circRNA identification but not quantification. Gray bars represent the circRNAs that could be identified fromboth datasets and red bars represent the circRNAs exclusively detected in the PE250 dataset. a The number of circRNAs detected by CIRI_fullfrom the PE250 and PE100 datasets. b The length distribution of reconstructed circRNA isoforms. c The number of reconstructed isoforms withdifferent expression levels. d The difference of relative expression levels of circRNA isoforms estimated from PE250 and PE100 datasets. DifferentcircRNA expression levels are shown in different colors. Two vertical dashed lines represent the threshold of relative abundance differencebetween PE100 and PE250 (± 0.2). The ratios in the panel represent the percentage of accurately quantified circRNA isoforms

Zheng et al. Genome Medicine (2019) 11:2 Page 15 of 20

Page 16: Reconstruction of full-length circular RNAs enables ... · METHOD Open Access Reconstruction of full-length circular RNAs enables isoform-level quantification Yi Zheng1,2†, Peifeng

tumor samples, we extracted the top 50 most highlyexpressed circRNAs with multiple isoforms and calculatedthe expression fold change for each isoform betweentumor and normal samples. We found that 10 out of 50circRNAs underwent isoform switches, where the strikingexpression changes occur in the most abundant isoform(green and red circles, Fig. 6f). In contrast, the BSJ-leveldifferential expression analysis cannot distinguish such

scenario. For example, circRNA chr2:207144264|207162097 locating on the ZDBF2 gene could express fourcircular isoforms (Fig. 6g). The alignments of BSJ reads onthe second and fourth cirexons, as well as the splicingevents between the fourth and sixth exons, exhibited dis-tinct read supports between tumor and normal samples,indicating the existence of alternative splicing isoformswitches. Indeed, the circular isoform with sequence

Normal

Tumor

15

10

5

0

10

5

15

BSJ

leve

l

** Normal Tumor

expr

essi

onex

pres

sion

Isof

orm

sle

vel

Normal Tumor

4

2

0

-2

-4

− 1.0Isoformrelativeabundance 0

0.5

587

191

BSJ

num

ber(

*100

0)

400

200

200

400

0

Normal

Tumor

22-43-8

0

10

20

10

20

A B C

D E

F G-log10(P) before correction

-log

10(P

)af

ter

corr

ectio

nFo

ldch

ange

(log

2)

Mapped

data size(G

b)

Cou

nt

isoformnumber

circRNAs with onlyone isoform

down-regulated

Correction

up-regulated

291

230

66

86159 11

3676 6Before

correctionAfter

correction

Aftercorrection

Beforecorrection

4

0

-4

-8

circRNAs withmultiple isoforms

-8 -4 0 4 8

down- regulated

732 isoforms

up-regulated

n.s. n.s. n.s. n.s. ** ** ****

n.s. n.s. * * ****** n.s.

*** n.s. ** n.s. n.s. n.s. n.s. *

BSJ-based major isoform in bothmajor isoform in normal

major isoform in tumor minor isoform

H

BSJ

-lev

elIs

ofor

m-l

evel

Normal

Tumor

1005000

50

100

477 nt

470 nt

334 nt

367 nt

60.6%

19.6% 10.8%

63.4%

17.0% 28.6%

0% 0%

3.2 5.54.1

6.1

3.9 6.8

2.2 11.03.8

20.4

5.6 8.5

VAPB PLOD2

FAM120A TRUB1

PGD4 METTL3

CHD9 EHBP1

Nor

mal

ized

dept

h

Fig. 6 Differential circRNA isoform expression between normal and tumor liver tissues of 20 HCC patients. a A schematic comparison betweenBSJ-level and isoform-level differential expression analysis. b The number of circRNAs detected from normal and tumor liver tissues of 20 HCCpatients. Bars in light color represent the numbers of circRNAs in a certain sample, and bars in dark color represent the numbers of highlyexpressed circRNAs (> 1 BSJ read per 10 million reads). c The number of circRNAs containing AS events that are detected by CIRI-AS. Light barsrepresent the circRNAs with one AS event and dark bars correspond to circRNAs containing more than one AS events. Black curved linesrepresent the total mapped data size for each sample. d Comparison of differential expression analysis between BSJ level (x-axis) and isoformlevel (y-axis). Each dot denotes a circRNA isoform, with its size representing the expression level and its color representing its relative abundancein the parental circRNA. e circRNAs in panel d can be classified into circRNAs with only one isoform and circRNAs with multiple isoforms. ForcircRNAs with multiple isoforms, Venn diagrams show the discrepancies of significantly up- or downregulated isoforms between BSJ level andisoform level differential expression analyses. f The average isoform expression fold change between normal and tumor tissues of the top 50most highly expressed circRNAs that have multiple isoforms. Black dot represents the average fold change of circRNAs at the BSJ-levelquantification. The dashed box highlights an example shown in panel g. g An example of alternative splicing switch between normal and tumorsamples. CircRNA (chr2:207144264|207162097) locating on the ZDBF2 gene can express four isoforms. Rectangles with different colors representthe cirexons within this circRNA. The green and red histograms on each cirexon represent the normalized sequencing depth in normal andtumor samples, respectively. The curve connected cirexons represents the forward splice junction (FSJ) within this circRNA, and its width isproportional to the read support. The red curve represents the FSJ of the dominant circular isoform. The relative abundance of four circularisoforms is shown in bar plot (bottom). h Expression profiles of eight circRNAs quantified at BSJ and isoform level between normal (left) andtumor (right) tissues across 20 HCC patients. Red and cyan lines represent the expression profile of the major and minor circRNA isoform,respectively. All statistic significances are calculated by Mann–Whitney U test. “**” and “*” represent P < 0.01 and P < 0.05. “n.s.” indicates“not significant”

Zheng et al. Genome Medicine (2019) 11:2 Page 16 of 20

Page 17: Reconstruction of full-length circular RNAs enables ... · METHOD Open Access Reconstruction of full-length circular RNAs enables isoform-level quantification Yi Zheng1,2†, Peifeng

length of 477 nt was the dominant isoform in normal sam-ples, whereas its expression dropped rapidly in tumorsamples and the expression of this circRNA was domi-nated by the circular isoform with sequence length of 334nt. Moreover, eight circRNAs were illustrated to serve asexamples to detail the expression changes at both the BSJand isoform levels between normal and tumor tissues of20 patients (Fig. 6h). Although four of these circRNAswere found to be differentially expressed by the traditionalBSJ-level quantification, it could not distinguish the realdifferentially expressed isoforms. In addition, there werealso a few cases that certain circular isoforms exhibited in-creased expression levels across tumor samples, whichwere missed by the BSJ-level differential expression ana-lysis. Collectively, these results highlight the limitation ofBSJ-based quantification and the necessity of extendingdifferential circRNA expression to the isoform level.

DiscussionThis study presents a novel experimental and bioinfor-matic framework, CIRI-full, that can be used to effi-ciently reconstruct full-length circRNAs and quantifytheir expression at the isoform level from transcriptomicdata. The main advantage of CIRI-full is that it utilizesan unfragmented library preparation approach to gener-ate RO reads for circRNAs. To fully utilize these ROreads, we developed a new computational algorithm withthe aim of reconstructing the complete sequences of cir-cRNAs and circular isoforms within them and furtherproposed a new Forward Splice Graph (FSG)-based algo-rithm for isoform-level circRNA quantification. Throughextensive evaluations of both simulated and real datasets from HeLa cells and from the human brain, wedemonstrated that CIRI-full offers excellent performancein full-length circRNA reconstruction and isoform-levelquantification. By applying this tool to brain samples ofsix vertebrate species, we demonstrated the reliability ofCIRI-full on exploring comprehensive circRNA reper-toires across multiple species and unveiling their evolu-tionary conservation and divergence. Further applicationof CIRI-full on human normal and tumor tissues, wesystematically compared the difference between theBSJ-level and isoform-level differential expression ana-lyses. We have packed CIRI-full with our previous tools(CIRI2 and CIRI-AS), and users can identify circRNAs,detect alternative splicing events, and reconstruct circu-lar isoforms from transcriptomic data using a singlecommand. The running time and peak memory usage ofCIRI-full under different conditions are shown in Fig. 7.This tool will greatly accelerate our understanding thediversity and function of circRNAs, which will undoubt-edly contribute to the field of circRNA studies.Most approaches to circRNA identification rely exclu-

sively on recognizing BSJs, and these methods are of

limited application in determining the internal structureof circRNAs. In this study, for the first time, we proposean RO feature-based method for circRNA detection.This new approach is an important addition to the BSJfeature-based method, with each approach having itsown advantages and limitations. Compared with the BSJfeature-based method, the RO feature-based method hasthe following distinct advantages. First, the RO featureprovides more solid evidence for identifying full-lengthcircRNAs. This greatly facilitates genome-wide full-length circRNA identification and thereby offers anindispensable advantage for downstream analyses, in-cluding functional and evolutionary analyses. Second,the RO feature facilitates the detection of weakly tran-scribed circRNAs, and even extremely low-abundancecircRNAs could be efficiently and accurately identified.For instance, in the HeLa cell dataset, 3887 circRNAswere identified; 14.4% of these were supported by onlyone read, and 92% were successfully identified byCIRI-full. Considering that a vast majority of circRNAsin various transcriptomes are in low expression levels,the detection of such low-abundance transcripts is ex-tremely important in exploring circRNA profiles. For ex-ample, when detecting circRNAs from RNA-seq datasets without RNase R treatment, most of the circRNAsare in low abundance compared with their linear coun-terparts. Therefore, the ability of identifying and recon-structing low-abundance circRNAs is not trivial incircRNA studies. Third, compared with BSJ reads, ROreads produce a more uniform depth distribution alongthe normalized circRNA transcript, which greatly facili-tates quantification of circRNAs and determination oftheir internal structures. Compared with the BSJ feature,RO-based circRNA identification also has limitations.First, the RO-based approach requires longer reads toobtain an entire circRNA sequence. Considering thatmost circRNAs are between 200 and 800 bp in length, itshould be possible to easily obtain RO reads as sequen-cing technology continues to advance. Second, a large li-brary size is required to produce high-quality RO readsduring library preparation.Considering the potential significance of the biogenesis

and functions of full-length circRNAs, several computa-tional algorithms have been developed to determine theinternal components of these circular transcripts. Thereference-based prediction method has been revealed tobe error-prone; up to 66% of the predicted full-lengthcircRNAs in our study were demonstrated to contain er-rors, including false cirexons and missing ICFs. Despitethe fact that spliced junction signature-based methodssuch as CIRI-AS represent an important step forward byfacilitating accurate cirexon identification, full-lengthprediction of circular isoforms with complicated ASevents is still not feasible. An alternative approach

Zheng et al. Genome Medicine (2019) 11:2 Page 17 of 20

Page 18: Reconstruction of full-length circular RNAs enables ... · METHOD Open Access Reconstruction of full-length circular RNAs enables isoform-level quantification Yi Zheng1,2†, Peifeng

involves utilization of new sequencing technologies suchas PacBio long read sequencing, which promises increasesin read length of several orders of magnitude, therebymaking full-length circRNA identification considerablyeasier. However, this is achieved at the expense of highercost per base and lower throughput. Furthermore, it is notcost-effective to use PacBio long reads to obtain thecomplete sequences of circRNAs, especially consideringthat their lengths range from 250 to 800 bp. In contrast,CIRI-full generates high-throughput “long” reads that aresufficient for determining the complete sequences of mostfull-length circRNAs in an economical and practical man-ner. Without the need for additional library preparationand with bypassing of the fragmentation step, RO readsare easily obtained using general paired-end sequencing,greatly expanding the applicability of this method.Besides circular transcript reconstruction, CIRI-full is

also the first tool to provide isoform-level quantificationfor circRNAs. Previous studies revealed the prevalenceof AS events within circRNAs and the importance of ac-curate quantification of circRNA expression to confirmtheir crucial functions in many biological processes.However, current state-of-the-art transcript quantifica-tion approaches largely focus on linear RNAs, which es-timate expression abundance at both gene and transcriptlevels. Till now, there is no available method forisoform-level circRNA quantification. In this study,CIRI-full can successfully determine the abundance ofcircRNAs at both BSJ and isoform levels by employingthe FSG-based algorithm, which reconstructs all

full-length isoforms within each assembled full-lengthcircRNA. By applying this approach to 40 RNA-seq datasets of human HCC tumor tissues and their adjacentnormal tissues, we identified thousands of circRNAs ofwhich most were assembled into full-length, and we sys-tematically explored the discrepancy between BSJ-leveland isoform-level differential expression analysis. Wefound that the significance level for differential expres-sion of most circRNAs that expressed multiple isoformswas generally overestimated at the BSJ level. A large ma-jority of differentially expressed circRNAs measured atthe BSJ level tended to be false positives, as BSJ-levelquantification cannot make a distinction among differentisoforms within a certain circRNA. Consequently, forcircRNAs with multiple isoforms, BSJ-level quantifica-tion does not necessarily reflect the real expressionchanges of certain circular isoforms between normal andtumor samples. These findings not only provide moreaccurate candidates for functional screening, but alsounveil the complexity of circRNA isoform expression.

ConclusionThis study, for the first time, presents a high-throughputapproach, CIRI-full, that employs a new feature forfull-length circRNA reconstruction and isoform-levelquantification. Extensive evaluations demonstrate thatCIRI-full exhibits excellent performance in circRNAidentification and whole-sequence assembly, as well asisoform reconstruction and quantification. We appliedCIRI-full to investigate the evolutionary conservation of

Fig. 7 Time and memory usage of CIRI-full on human brain RNA-seq data sets. The CIRI-full pipeline consists two components, one is BSJ detectionusing CIRI2/CIRI-AS, the other is RO detection, and both are executed simultaneously. Height of boxes represents the running time of each module inthis pipeline. Options “-t 5” was used for the last four datasets to activate the multithreading function of CIRI2 and BWA

Zheng et al. Genome Medicine (2019) 11:2 Page 18 of 20

Page 19: Reconstruction of full-length circular RNAs enables ... · METHOD Open Access Reconstruction of full-length circular RNAs enables isoform-level quantification Yi Zheng1,2†, Peifeng

circRNAs in brain transcriptomes across six vertebrates.In addition, we systematically compared the differencebetween the BSJ-level and isoform-level differential ex-pression analyses using human liver tumor and normaltissues and found that a large majority of differentiallyexpressed circRNAs measured at the BSJ level tended tobe false positives. For circRNAs with multiple isoforms,isoform-level quantification instead of BSJ-level quantifi-cation can reflect the real expression changes of certaincircular isoforms between normal and tumor samples.This study provides an indispensable approach for cir-cRNA transcript reconstruction and quantification andhighlights the necessity of deepening circRNA studies tothe isoform-level resolution.

Availability and requirementsThe availability and requirements are listed as follows:Project name: CIRI-full.Project home page: https://sourceforge.net/projects/ciri.Operating system(s): Linux, Mac.Programming language: Java, Perl.

Additional file

Additional file 1: Figure S1. Workflows of the RO detection method.Figure S2. All possible scenarios on the presence of RO and BSJ in RNA-seq reads. Figure S3. 5′ RO candidate read mapping and filtering steps.Figure S4. Full-length circRNA reconstruction. Figure S5. CircRNAs thatcannot be reconstructed into full length. Figure S6. Workflow of theadapted DFS method in the FSG algorithm. Figure S7. Workflow of approxi-mate exhaustive search in the FSG algorithm. Figure S8. Characteristics ofsimulated data sets. Figure S9. Abundance distributions of circRNAs insimulated data sets. Figure S10. False discovery rate of different circRNA de-tection tools. Figure S11. CircRNA quantification in four real RNA-seq datasets with biological replicates and two simulated datasets. Figure S12. Veri-fied circRNA structure and their predicted relative abundance in total RNA-seq data set (PE100). Figure S13. Verified circRNA structure and their pre-dicted relative abundance in total RNA-seq data set (PE250). Figure S14.Length of circRNAs in human HeLa cell line. Figure S15. Cirexon enrich-ment rate. Figure S16. RO feature is reliable in detecting lowly expressedcircRNAs. Figure S17. Experimental validations of the RO method on de-tecting highly expressed circRNAs (# BSJ reads > = 30) in HeLa cell line. Fig-ure S18. Experimental validations on moderately- expressed circRNAs (# BSJreads > = 10 & < 30). Figure S19. Experimental validations on weakly-expressed circRNAs (# BSJ reads < 10). Figure S20. Experimental validationson reconstructing full-length circRNAs in HeLa cell line. Figure S21. RNAdegradation and fragmentation can reduce the abundance of RO reads.Figure S22. Four examples of circRNAs in Fig. 3k. Figure S23. Comparisonof full-length circRNA structure and corresponding annotated exon regionsin human HeLa cell line (A) and mouse brain tissue (B). Figure S24. Per-formance comparison between CIRI-Full and CIRCexplorer2. FigureS25. Boundary conservation of orthologous exons in mRNA, circRNAand lincRNA. Table S1. RT-PCR and CIRI-full quantification results.(PDF 4930 kb)

AcknowledgementsWe are grateful to National Science Foundation of China, Chinese Academyof Sciences for the support. We also thank the Sequencing and ComputingFacilities at Beijing Institute of Life Sciences, Chinese Academy of Sciencesthat supported this work.

FundingThis work was supported by NSFC grants (31722031, 91640117, 91531306and 31701148), National Key R&D Program (2018YFC0910400), Beijing NaturalScience Foundation (JQ18020) and the Strategic Priority Research Program ofthe Chinese Academy of Sciences [XDB13000000].

Availability of data and materialsThe sequencing data generated in this study was deposited to SRA with thefollowing project ID (PRJNA475651) [38].

Authors’ contributionsFZ conceived the study and proposed the method. YZ implemented thealgorithm and designed the software. YZ and PJ analyzed the data. SC andLH performed sequencing and experimental validations. PJ and YZ draftedthe manuscript, and FZ revised the manuscript. All authors read andapproved the final manuscript.

Ethics approval and consent to participateNot applicable.

Consent for publicationNot applicable.

Competing interestsThe authors declare that they have no competing interests.

Publisher’s NoteSpringer Nature remains neutral with regard to jurisdictional claims in publishedmaps and institutional affiliations.

Author details1Computational Genomics Lab, Beijing Institutes of Life Science, ChineseAcademy of Sciences, Beijing 100101, China. 2University of Chinese Academyof Sciences, Beijing 100049, China. 3Center for Excellence in Animal Evolutionand Genetics, Chinese Academy of Sciences, Kunming 650223, China.

Received: 3 July 2018 Accepted: 10 January 2019

References1. Ashwal-Fluss R, Meyer M, Pamudurti NR, Ivanov A, Bartok O, Hanan M,

Evantal N, Memczak S, Rajewsky N, Kadener S. circRNA biogenesis competeswith pre-mRNA splicing. Mol Cell. 2014;56:55–66.

2. Guo JU, Agarwal V, Guo H, Bartel DP. Expanded identification andcharacterization of mammalian circular RNAs. Genome Biol. 2014;15:409.

3. Hansen TB, Jensen TI, Clausen BH, Bramsen JB, Finsen B, Damgaard CK,Kjems J. Natural RNA circles function as efficient microRNA sponges. Nature.2013;495:384–8.

4. Memczak S, Jens M, Elefsinioti A, Torti F, Krueger J, Rybak A, Maier L,Mackowiak SD, Gregersen LH, Munschauer M, et al. Circular RNAs are a largeclass of animal RNAs with regulatory potency. Nature. 2013;495:333–8.

5. Jeck WR, Sorrentino JA, Wang K, Slevin MK, Burd CE, Liu J, Marzluff WF,Sharpless NE. Circular RNAs are abundant, conserved, and associated withALU repeats. RNA. 2013;19:141–57.

6. Salzman J, Gawad C, Wang PL, Lacayo N, Brown PO. Circular RNAs are thepredominant transcript isoform from hundreds of human genes in diversecell types. PLoS One. 2012;7:e30733.

7. Salzman J, Chen RE, Olsen MN, Wang PL, Brown PO. Cell-type specificfeatures of circular RNA expression. PLoS Genet. 2013;9:e1003777.

8. Gao Y, Wang J, Zhao F. CIRI: an efficient and unbiased algorithm for denovo circular RNA identification. Genome Biol. 2015;16:4.

9. Li Z, Huang C, Bao C, Chen L, Lin M, Wang X, Zhong G, Yu B, Hu W, Dai L,et al. Exon-intron circular RNAs regulate transcription in the nucleus. NatStruct Mol Biol. 2015;22:256–64.

10. Gao Y, Wang J, Zheng Y, Zhang J, Chen S, Zhao F. Comprehensiveidentification of internal structure and alternative splicing events in circularRNAs. Nat Commun. 2016;7:12060.

11. Chen YG, Satpathy AT, Chang HY. Gene regulation in the immune systemby long noncoding RNAs. Nat Immunol. 2017;18:962–72.

12. Kristensen LS, Hansen TB, Veno MT, Kjems J. Circular RNAs in cancer:opportunities and challenges in the field. Oncogene. 2017;37:555–65.

Zheng et al. Genome Medicine (2019) 11:2 Page 19 of 20

Page 20: Reconstruction of full-length circular RNAs enables ... · METHOD Open Access Reconstruction of full-length circular RNAs enables isoform-level quantification Yi Zheng1,2†, Peifeng

13. Li X, Liu CX, Xue W, Zhang Y, Jiang S, Yin QF, Wei J, Yao RW, Yang L, ChenLL. Coordinated circRNA biogenesis and function with NF90/NF110 in viralinfection. Mol Cell. 2017;67:214–227 e217.

14. Salta E, De Strooper B. Noncoding RNAs in neurodegeneration. Nat RevNeurosci. 2017;18:627–40.

15. Fei T, Chen Y, Xiao T, Li W, Cato L, Zhang P, Cotter MB, Bowden M, Lis RT,Zhao SG, et al. Genome-wide CRISPR screen identifies HNRNPL as a prostatecancer dependency regulating RNA splicing. Proc Natl Acad Sci U S A. 2017;114:E5207–15.

16. Hirsch S, Blatte TJ, Grasedieck S, Cocciardi S, Rouhi A, Jongen-Lavrencic M,Paschka P, Kronke J, Gaidzik VI, Dohner H, et al. Circular RNAs of thenucleophosmin (NPM1) gene in acute myeloid leukemia. Haematologica.2017;102:2039–47.

17. Legnini I, Di Timoteo G, Rossi F, Morlando M, Briganti F, Sthandier O, Fatica A,Santini T, Andronache A, Wade M, et al. Circ-ZNF609 is a circular RNA that canbe translated and functions in myogenesis. Mol Cell. 2017;66:22–37 e29.

18. Piwecka M, Glazar P, Hernandez-Miranda LR, Memczak S, Wolf SA, Rybak-Wolf A, Filipchyk A, Klironomos F, Cerda Jara CA, Fenske P, et al. Loss of amammalian circular RNA locus causes miRNA deregulation and affects brainfunction. Science. 2017;357.

19. Yang Y, Gao X, Zhang M, Yan S, Sun C, Xiao F, Huang N, Yang X, Zhao K,Zhou H, et al. Novel role of FBXW7 circular RNA in repressing gliomatumorigenesis. J Natl Cancer Inst. 2018;110.

20. Yu CY, Li TC, Wu YY, Yeh CH, Chiang W, Chuang CY, Kuo HC. The circularRNA circBIRC6 participates in the molecular circuitry controlling humanpluripotency. Nat Commun. 2017;8:1149.

21. Zheng Q, Bao C, Guo W, Li S, Chen J, Chen B, Luo Y, Lyu D, Li Y, Shi G, et al.Circular RNA profiling reveals an abundant circHIPK3 that regulates cellgrowth by sponging multiple miRNAs. Nat Commun. 2016;7:11215.

22. Yang Y, Fan X, Mao M, Song X, Wu P, Zhang Y, Jin Y, Yang Y, Chen LL,Wang Y, et al. Extensive translation of circular RNAs driven by N(6)-methyladenosine. Cell Res. 2017;27:626–41.

23. Zhou C, Molinie B, Daneshvar K, Pondick JV, Wang J, Van Wittenberghe N,Xing Y, Giallourakis CC, Mullen AC. Genome-wide maps of m6A circRNAsidentify widespread and cell-type-specific methylation patterns that aredistinct from mRNAs. Cell Rep. 2017;20:2262–76.

24. Rybak-Wolf A, Stottmeister C, Glazar P, Jens M, Pino N, Giusti S, Hanan M,Behm M, Bartok O, Ashwal-Fluss R, et al. Circular RNAs in the mammalianbrain are highly abundant, conserved, and dynamically expressed. Mol Cell.2015;58:870–85.

25. Zhang XO, Wang HB, Zhang Y, Lu X, Chen LL, Yang L. Complementarysequence-mediated exon circularization. Cell. 2014;159:134–47.

26. Zhang XO, Dong R, Zhang Y, Zhang JL, Luo Z, Zhang J, Chen LL, Yang L.Diverse alternative back-splicing and alternative splicing landscape ofcircular RNAs. Genome Res. 2016;26:1277–87.

27. Metge F, Czaja-Hasse LF, Reinhardt R, Dieterich C. FUCHS-towards fullcircular RNA characterization using RNAseq. PeerJ. 2017;5:e2934.

28. Ye CY, Zhang X, Chu Q, Liu C, Yu Y, Jiang W, Zhu QH, Fan L, Guo L. Full-length sequence assembly reveals circular RNAs with diverse non-GT/AGsplicing signals in rice. RNA Biol. 2017;14:1055–63.

29. Gao Y, Zhao F. Computational strategies for exploring circular RNAs. TrendsGenet. 2018;34:389–400.

30. Gao Y, Zhang J, Zhao F. Circular RNA identification based on multiple seedmatching. Brief Bioinform. 2018;19:803–10.

31. Zeng X, Lin W, Guo M, Zou Q. A comprehensive overview and evaluation ofcircular RNA detection tools. PLoS Comput Biol. 2017;13:e1005420.

32. Wang J, Liu K, Liu Y, Lv Q, Zhang F, Wang H. Evaluating the bias of circRNApredictions from total RNA-Seq data. Oncotarget. 2017;8:110914–21.

33. Hansen TB. Improved circRNA identification by combining predictionalgorithms. Front Cell Dev Biol. 2018;6:20.

34. Heber S, Alekseyev M, Sze SH, Tang H, Pevzner PA. Splicing graphs and ESTassembly problem. Bioinformatics. 2002;18(Suppl 1):S181–8.

35. Feng J, Li W, Jiang T. Inference of isoforms from short sequence reads. JComput Biol. 2011;18:305–21.

36. Li JJ, Jiang CR, Brown JB, Huang H, Bickel PJ. Sparse linear modeling ofnext-generation mRNA sequencing (RNA-Seq) data for isoform discoveryand abundance estimation. Proc Natl Acad Sci U S A. 2011;108:19867–72.

37. Szabo L, Morey R, Palpant NJ, Wang PL, Afari N, Jiang C, Parast MM, MurryCE, Laurent LC, Salzman J. Statistically based splicing detection revealsneural enrichment and tissue-specific induction of circular RNA duringhuman fetal development. Genome Biol. 2015;16:126.

38. Zheng, Y., Ji, P., Chen, S., Hou, L. and Zhao, F. (2018) RNA-seq data sets fromhuman brain and Hela cell line. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA475651/

39. Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with lowmemory requirements. Nat Methods. 2015;12:357–60.

40. Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT, Salzberg SL.StringTie enables improved reconstruction of a transcriptome from RNA-seqreads. Nat Biotechnol. 2015;33:290–5.

41. Pertea M, Kim D, Pertea GM, Leek JT, Salzberg SL. Transcript-level expressionanalysis of RNA-seq experiments with HISAT, StringTie and Ballgown. NatProtoc. 2016;11:1650–67.

42. Yang Y, Chen L, Gu J, Zhang H, Yuan J, Lian Q, Lv G, Wang S, Wu Y, Yang Y-CT, et al. Recurrently deregulated lncRNAs in hepatocellular carcinoma. NatCommun. 2017;8:14421.

43. Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J,Couger MB, Eccles D, Li B, Lieber M, et al. De novo transcript sequencereconstruction from RNA-seq using the Trinity platform for referencegeneration and analysis. Nat Protoc. 2013;8:1494.

Zheng et al. Genome Medicine (2019) 11:2 Page 20 of 20