Download - Transcriptome reconstruction from RNA-seq Transcriptome ...

Transcript
Page 1: Transcriptome reconstruction from RNA-seq Transcriptome ...

5/2/16

1

Transcriptome reconstruction from RNA-seq

Transcriptome assembly

Strategies:1.  De-novo assembly: no reference genome available

(Trinity, Bridger, Abyss, IDBA-trans)2.  Reference-based assembly: a reference genome is

available (Cufflinks, StringTie, iReckon)3.  Annotation-guided assembly: use known annotations

to improve the annotation process (Cufflinks, StringTie)

Transcriptome assembly

Goal: from read alignments to transcripts/splice graphs

Spliced alignments Ungapped alignments

Challenges

Short Reads

Page 2: Transcriptome reconstruction from RNA-seq Transcriptome ...

5/2/16

2

Challenges

Short ReadsAmbiguous Origins

Challenges

Short ReadsAmbiguous Origins

Variable Coverage

Challenges

Short ReadsAmbiguous Origins

Variable Coverage

Challenges

Short ReadsAmbiguous Origins

Variable Coverage

Page 3: Transcriptome reconstruction from RNA-seq Transcriptome ...

5/2/16

3

Challenges

Short ReadsAmbiguous Origins

Variable Coverage

Localized Evidence

0CVWTG�4GXKGYU�^�)GPGVKEU

F��#UUGODNGF�KUQHQTOU

C��5RNKEG�CNKIP�TGCFU�VQ�VJG�IGPQOG

D��$WKNF�C�ITCRJ�TGRTGUGPVKPI�CNVGTPCVKXG�URNKEKPI�GXGPVU

E��6TCXGTUG�VJG�ITCRJ�VQ�CUUGODNG�XCTKCPVU

��������MD ��������MD ��������MD ��������MD ��������MD

Figure 2 | Overview of the reference-based transcriptome assembly strategy. The steps of the reference-based transcriptome strategy are shown using an example of a maize gene (GRMZM2G060216). a | Reads (grey) are first splice-aligned to a reference genome. b | A connectivity or splice graph is then constructed to represent all possible isoforms at a locus. c,d | Finally, alternative paths through the graph (blue, red, yellow and green) are followed to join compatible reads together into isoforms.

Applications. Reference-based transcriptome assembly is easier to perform for the simple transcriptomes of bac-terial, archaeal and lower eukaryotic organisms, as these organisms have few introns and little alternative splicing. Transcription boundaries can be inferred from regions of contiguous read coverage in the genome even with-out graph construction and traversal37,51,52. Alternative transcription start and stop sites can also be inferred based on the 5′ cap or poly(A) signals (if cap- or end-specific experimental protocols are used)51,53. However, complications arise owing to the gene-dense nature of

these genomes. Many genes overlap, resulting in adja-cent genes being assembled into one transcript, even though they are not from a polycistronic RNA. Strand-specific RNA-seq has successfully been used to separate adjacent overlapping genes from opposite strands in the genome51,52. Overlapping genes that are transcribed from the same strand and that also have comparable expres-sion levels cannot easily be separated without using cap- or end-specific RNA-seq.

Plant and mammalian transcriptomes have complex alternative splicing patterns and are difficult to assemble

REVIEWS

NATURE REVIEWS | GENETICS VOLUME 12 | OCTOBER 2011 | 675

© 2011 Macmillan Publishers Limited. All rights reserved

ReferenceBasedAssembly

Figure from: Next generation transcriptome assembly. Nature Reviews Genetics, 2011

0CVWTG�4GXKGYU�^�)GPGVKEU

C��)GPGTCVG�CNN�UWDUVTKPIU�QH�NGPIVJ�M�HTQO�VJG�TGCFU

D��)GPGTCVG�VJG�&G�$TWKLP�ITCRJ�5GSWGPEKPI�GTTQT�QT�502�

&GNGVKQP�QT�KPVTQP�

M�OGTU�M���

4GCFU�

E��%QNNCRUG�VJG�&G�$TWKLP�ITCRJ�

G��#UUGODNGF�KUQHQTOU�

F��6TCXGTUG�VJG�ITCRJ�

ACCGC CCGCC CGCCC

GCCCA CCCAC

CTTCC TTCCT

TCCTG CCTGC CTGCT TGCTG CTGGT TGGTC GGTCT GTCTC TCTCT

TCCTC CCTCT

GCCCACAGC

GCCCTCAGCCAGCGCTTCCT

TCCTGCTGGTCTCTCTCTTGTTGGTCGTAG

ACCGCCCACAGCGCTTCCTGCTGGTCTCTTGTTG

ACCGCCC

ACCGC

CCGCC

CGCCC

CCCAC

CCACA

CACAG

GCCCA

ACAGC

CAGCG

AGCGC

GCGCT

CGCTT

GCTTC

CTTCC

TTCCT

TCCTG

CCTGC

CTGCT

TGCTG

GCTGG

CTGGT

TGGTC

GGTCT

GTCTC

TCTCT

CTCTT

TCTTG

CTTGT

CGCCCTCAGCGCTTCCTCTTGTTGGTCGTAG

TCCTCT

TTGTT

TGTTG

CGCCC

GCCCT

CCCTC

CCTCA

CTCAG

TCAGC

CAGCG

AGCGC

GCGCT

CGCTT

GCTTC

CTTCC

TTCCT

TCCTC

CCTCT

CTCTT

TCTTG

CTTGT

TTGTT

TGTTG

GTTGG

TTGGT

TGGTC

GGTCG

GTCGT

TCGTA

CGTAG

CCACA CACAG ACAGC

GCTGG

GCCCT CCCTC CCTCA CTCAG TCAGC

CAGCG AGCGC GCGCT CGCTT GCTTC

CTCTT TCTTG CTTGT TTGTT TGTTG GTTGG TTGGT TGGTC GGTCG GTCGT TCGTA CGTAG

GCCCACAGC

GCCCTCAGC

CAGCGCTTCCT

TCCTGCTGGTCTCT

CTCTTGTTGGTCGTAGACCGCCC

TCCTCT

ACCGCCCACAGCGCTTCCT--------CTTGTTGGTCGTAGACCGCCCACAGCGCTTCCTGCTGGTCTCTTGTTGGTCGTAG

ACCGCCCTCAGCGCTTCCTGCTGGTCTCTTGTTGGTCGTAGACCGCCCTCAGCGCTTCCT--------CTTGTTGGTCGTAG

Figure 3 | Overview of the de novo transcriptome assembly strategy. a | All substrings of length k (k-mers) are generated from each read. b | Each unique k-mer is used to represent a node (or vertex) in the De Bruijn graph, and pairs of nodes are connected if shifting a k-mer by one character creates an exact k–1 overlap between the two k-mers. Note that for non-strand-specific RNA sequencing data sets, the reverse complement of each k-mer will also be represented in the graph. Here, a simple example using 5-mers is shown. The example illustrates a SNP or

sequencing error (for example, A/T) and an example of an intron or a deletion. Single-nucleotide differences cause ‘bubbles’ of length k in the De Brujin graph, whereas introns or deletions introduce a shorter path in the graph. c,d | Chains of adjacent nodes in the graph are collapsed into a single node when the first node has an out degree of one and the second node has an in degree of one. Last, as in the reference-based approach, four alternative paths (blue, red, yellow and green) through the graph are chosen. e | The isoforms are then assembled.

REVIEWS

676 | OCTOBER 2011 | VOLUME 12 www.nature.com/reviews/genetics

© 2011 Macmillan Publishers Limited. All rights reserved

De-novoAssembly

Transcript assembly using Cufflinks

512 VOLUME 28 NUMBER 5 MAY 2010 NATURE BIOTECHNOLOGY

L E T T E R S

junction (Supplementary Table 1). Of the splice junctions spanned by fragment alignments, 70% were present in transcripts annotated by the UCSC, Ensembl or VEGA groups (known genes).

To recover the minimal set of transcripts supported by our frag-ment alignments, we designed a comparative transcriptome assem-bly algorithm. Expressed sequence tag (EST) assemblers such as PASA introduced the idea of collapsing alignments to transcripts on the basis of splicing compatibility17, and Dilworth’s theorem18 has been used to assemble a parsimonious set of haplotypes from virus population sequencing reads19. Cufflinks extends these ideas, reducing the transcript assembly problem to finding a maximum matching in a weighted4 bipartite graph that represents com-patibilities17 among fragments (Fig. 1a–c and Supplementary Methods, section 4). Noncoding RNAs20 and microRNAs21 have been reported to regulate cell differentiation and development, and coding genes are known to produce noncoding isoforms as a means of regulating protein levels through nonsense-mediated decay22. For these biologically motivated reasons, the assembler does not require that assembled transcripts contain an open reading frame (ORF). As Cufflinks does not make use of existing gene annotations

during assembly, we validated the transcripts by first comparing individual time point assemblies to existing annotations.

We recovered a total of 13,692 known isoforms and 12,712 new iso-forms of known genes. We estimate that 77% of the reads originated from previously known transcripts (Supplementary Table 2). Of the new isoforms, 7,395 (58%) contain novel splice junctions, with the remainder being novel combinations of known splicing outcomes; 11,712 (92%) have an ORF, 8,752 of which end at an annotated stop codon. Although we sequenced deeply by current standards, 73% of the moderately abundant transcripts (15–30 expected fragments per kilobase of transcript per million fragments mapped, abbreviated FPKM; see below for further explanation) detected at the 60-h time point with three lanes of GAII transcriptome sequencing were fully recovered with just a single lane. Because distinguishing a full-length transcript from a partially assembled fragment is difficult, we con-servatively excluded from further analyses the novel isoforms that were unique to a single time point. Out of the new isoforms, 3,724 were present in multiple time points, and 581 were present at all time points; 6,518 (51%) of the new isoforms and 2,316 (62%) of the multiple time point novel isoforms were tiled by high-identity

a

c

db

e

Map paired cDNAfragment sequences

to genomeTopHat

Cufflinks

Spliced fragmentalignments

Abundance estimationAssemblyMutually

incompatiblefragments

Transcript coverageand compatibility

Fragmentlength

distribution

Overlap graph

Maximum likelihoodabundances

Log-likelihood

Minimum path cover

Transcripts

Transcriptsand their

abundances

3

3

1

1

2

2

Figure 1 Overview of Cufflinks. (a) The algorithm takes as input cDNA fragment sequences that have been aligned to the genome by software capable of producing spliced alignments, such as TopHat. (b–e) With paired-end RNA-Seq, Cufflinks treats each pair of fragment reads as a single alignment. The algorithm assembles overlapping ‘bundles’ of fragment alignments (b,c) separately, which reduces running time and memory use, because each bundle typically contains the fragments from no more than a few genes. Cufflinks then estimates the abundances of the assembled transcripts (d,e). The first step in fragment assembly is to identify pairs of ‘incompatible’ fragments that must have originated from distinct spliced mRNA isoforms (b). Fragments are connected in an ‘overlap graph’ when they are compatible and their alignments overlap in the genome. Each fragment has one node in the graph, and an edge, directed from left to right along the genome, is placed between each pair of compatible fragments. In this example, the yellow, blue and red fragments must have originated from separate isoforms, but any other fragment could have come from the same transcript as one of these three. Isoforms are then assembled from the overlap graph (c). Paths through the graph correspond to sets of mutually compatible fragments that could be merged into complete isoforms. The overlap graph here can be minimally ‘covered’ by three paths (shaded in yellow, blue and red), each representing a different isoform. Dilworth’s Theorem states that the number of mutually incompatible reads is the same as the minimum number of transcripts needed to ‘explain’ all the fragments. Cufflinks implements a proof of Dilworth’s Theorem that produces a minimal set of paths that cover all the fragments in the overlap graph by finding the largest set of reads with the property that no two could have originated from the same isoform. Next, transcript abundance is estimated (d). Fragments are matched (denoted here using color) to the transcripts from which they could have originated. The violet fragment could have originated from the blue or red isoform. Gray fragments could have come from any of the three shown. Cufflinks estimates transcript abundances using a statistical model in which the probability of observing each fragment is a linear function of the abundances of the transcripts from which it could have originated. Because only the ends of each fragment are sequenced, the length of each may be unknown. Assigning a fragment to different isoforms often implies a different length for it. Cufflinks incorporates the distribution of fragment lengths to help assign fragments to isoforms. For example, the violet fragment would be much longer, and very improbable according to the Cufflinks model, if it were to come from the red isoform instead of the blue isoform. Last, the program numerically maximizes a function that assigns a likelihood to all possible sets of relative abundances of the yellow, red and blue isoforms ( 1, 2, 3) (e), producing the abundances that best explain the observed fragments, shown as a pie chart.

Page 4: Transcriptome reconstruction from RNA-seq Transcriptome ...

5/2/16

4

Cufflinks overview

512 VOLUME 28 NUMBER 5 MAY 2010 NATURE BIOTECHNOLOGY

L E T T E R S

junction (Supplementary Table 1). Of the splice junctions spanned by fragment alignments, 70% were present in transcripts annotated by the UCSC, Ensembl or VEGA groups (known genes).

To recover the minimal set of transcripts supported by our frag-ment alignments, we designed a comparative transcriptome assem-bly algorithm. Expressed sequence tag (EST) assemblers such as PASA introduced the idea of collapsing alignments to transcripts on the basis of splicing compatibility17, and Dilworth’s theorem18 has been used to assemble a parsimonious set of haplotypes from virus population sequencing reads19. Cufflinks extends these ideas, reducing the transcript assembly problem to finding a maximum matching in a weighted4 bipartite graph that represents com-patibilities17 among fragments (Fig. 1a–c and Supplementary Methods, section 4). Noncoding RNAs20 and microRNAs21 have been reported to regulate cell differentiation and development, and coding genes are known to produce noncoding isoforms as a means of regulating protein levels through nonsense-mediated decay22. For these biologically motivated reasons, the assembler does not require that assembled transcripts contain an open reading frame (ORF). As Cufflinks does not make use of existing gene annotations

during assembly, we validated the transcripts by first comparing individual time point assemblies to existing annotations.

We recovered a total of 13,692 known isoforms and 12,712 new iso-forms of known genes. We estimate that 77% of the reads originated from previously known transcripts (Supplementary Table 2). Of the new isoforms, 7,395 (58%) contain novel splice junctions, with the remainder being novel combinations of known splicing outcomes; 11,712 (92%) have an ORF, 8,752 of which end at an annotated stop codon. Although we sequenced deeply by current standards, 73% of the moderately abundant transcripts (15–30 expected fragments per kilobase of transcript per million fragments mapped, abbreviated FPKM; see below for further explanation) detected at the 60-h time point with three lanes of GAII transcriptome sequencing were fully recovered with just a single lane. Because distinguishing a full-length transcript from a partially assembled fragment is difficult, we con-servatively excluded from further analyses the novel isoforms that were unique to a single time point. Out of the new isoforms, 3,724 were present in multiple time points, and 581 were present at all time points; 6,518 (51%) of the new isoforms and 2,316 (62%) of the multiple time point novel isoforms were tiled by high-identity

a

c

db

e

Map paired cDNAfragment sequences

to genomeTopHat

Cufflinks

Spliced fragmentalignments

Abundance estimationAssemblyMutually

incompatiblefragments

Transcript coverageand compatibility

Fragmentlength

distribution

Overlap graph

Maximum likelihoodabundances

Log-likelihood

Minimum path cover

Transcripts

Transcriptsand their

abundances

3

3

1

1

2

2

Figure 1 Overview of Cufflinks. (a) The algorithm takes as input cDNA fragment sequences that have been aligned to the genome by software capable of producing spliced alignments, such as TopHat. (b–e) With paired-end RNA-Seq, Cufflinks treats each pair of fragment reads as a single alignment. The algorithm assembles overlapping ‘bundles’ of fragment alignments (b,c) separately, which reduces running time and memory use, because each bundle typically contains the fragments from no more than a few genes. Cufflinks then estimates the abundances of the assembled transcripts (d,e). The first step in fragment assembly is to identify pairs of ‘incompatible’ fragments that must have originated from distinct spliced mRNA isoforms (b). Fragments are connected in an ‘overlap graph’ when they are compatible and their alignments overlap in the genome. Each fragment has one node in the graph, and an edge, directed from left to right along the genome, is placed between each pair of compatible fragments. In this example, the yellow, blue and red fragments must have originated from separate isoforms, but any other fragment could have come from the same transcript as one of these three. Isoforms are then assembled from the overlap graph (c). Paths through the graph correspond to sets of mutually compatible fragments that could be merged into complete isoforms. The overlap graph here can be minimally ‘covered’ by three paths (shaded in yellow, blue and red), each representing a different isoform. Dilworth’s Theorem states that the number of mutually incompatible reads is the same as the minimum number of transcripts needed to ‘explain’ all the fragments. Cufflinks implements a proof of Dilworth’s Theorem that produces a minimal set of paths that cover all the fragments in the overlap graph by finding the largest set of reads with the property that no two could have originated from the same isoform. Next, transcript abundance is estimated (d). Fragments are matched (denoted here using color) to the transcripts from which they could have originated. The violet fragment could have originated from the blue or red isoform. Gray fragments could have come from any of the three shown. Cufflinks estimates transcript abundances using a statistical model in which the probability of observing each fragment is a linear function of the abundances of the transcripts from which it could have originated. Because only the ends of each fragment are sequenced, the length of each may be unknown. Assigning a fragment to different isoforms often implies a different length for it. Cufflinks incorporates the distribution of fragment lengths to help assign fragments to isoforms. For example, the violet fragment would be much longer, and very improbable according to the Cufflinks model, if it were to come from the red isoform instead of the blue isoform. Last, the program numerically maximizes a function that assigns a likelihood to all possible sets of relative abundances of the yellow, red and blue isoforms ( 1, 2, 3) (e), producing the abundances that best explain the observed fragments, shown as a pie chart.

Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.�Nature Biotechnology 2010.

Cufflinks overview²  Isoforms correspond to paths

through the overlap graph.²  Every read is consistent with

at least one assembled transcript.

²  Every transcript is tiled by reads.

²  Look for the smallest number of transcripts that explain the data.

512 VOLUME 28 NUMBER 5 MAY 2010 NATURE BIOTECHNOLOGY

L E T T E R S

junction (Supplementary Table 1). Of the splice junctions spanned by fragment alignments, 70% were present in transcripts annotated by the UCSC, Ensembl or VEGA groups (known genes).

To recover the minimal set of transcripts supported by our frag-ment alignments, we designed a comparative transcriptome assem-bly algorithm. Expressed sequence tag (EST) assemblers such as PASA introduced the idea of collapsing alignments to transcripts on the basis of splicing compatibility17, and Dilworth’s theorem18 has been used to assemble a parsimonious set of haplotypes from virus population sequencing reads19. Cufflinks extends these ideas, reducing the transcript assembly problem to finding a maximum matching in a weighted4 bipartite graph that represents com-patibilities17 among fragments (Fig. 1a–c and Supplementary Methods, section 4). Noncoding RNAs20 and microRNAs21 have been reported to regulate cell differentiation and development, and coding genes are known to produce noncoding isoforms as a means of regulating protein levels through nonsense-mediated decay22. For these biologically motivated reasons, the assembler does not require that assembled transcripts contain an open reading frame (ORF). As Cufflinks does not make use of existing gene annotations

during assembly, we validated the transcripts by first comparing individual time point assemblies to existing annotations.

We recovered a total of 13,692 known isoforms and 12,712 new iso-forms of known genes. We estimate that 77% of the reads originated from previously known transcripts (Supplementary Table 2). Of the new isoforms, 7,395 (58%) contain novel splice junctions, with the remainder being novel combinations of known splicing outcomes; 11,712 (92%) have an ORF, 8,752 of which end at an annotated stop codon. Although we sequenced deeply by current standards, 73% of the moderately abundant transcripts (15–30 expected fragments per kilobase of transcript per million fragments mapped, abbreviated FPKM; see below for further explanation) detected at the 60-h time point with three lanes of GAII transcriptome sequencing were fully recovered with just a single lane. Because distinguishing a full-length transcript from a partially assembled fragment is difficult, we con-servatively excluded from further analyses the novel isoforms that were unique to a single time point. Out of the new isoforms, 3,724 were present in multiple time points, and 581 were present at all time points; 6,518 (51%) of the new isoforms and 2,316 (62%) of the multiple time point novel isoforms were tiled by high-identity

a

c

db

e

Map paired cDNAfragment sequences

to genomeTopHat

Cufflinks

Spliced fragmentalignments

Abundance estimationAssemblyMutually

incompatiblefragments

Transcript coverageand compatibility

Fragmentlength

distribution

Overlap graph

Maximum likelihoodabundances

Log-likelihood

Minimum path cover

Transcripts

Transcriptsand their

abundances

3

3

1

1

2

2

Figure 1 Overview of Cufflinks. (a) The algorithm takes as input cDNA fragment sequences that have been aligned to the genome by software capable of producing spliced alignments, such as TopHat. (b–e) With paired-end RNA-Seq, Cufflinks treats each pair of fragment reads as a single alignment. The algorithm assembles overlapping ‘bundles’ of fragment alignments (b,c) separately, which reduces running time and memory use, because each bundle typically contains the fragments from no more than a few genes. Cufflinks then estimates the abundances of the assembled transcripts (d,e). The first step in fragment assembly is to identify pairs of ‘incompatible’ fragments that must have originated from distinct spliced mRNA isoforms (b). Fragments are connected in an ‘overlap graph’ when they are compatible and their alignments overlap in the genome. Each fragment has one node in the graph, and an edge, directed from left to right along the genome, is placed between each pair of compatible fragments. In this example, the yellow, blue and red fragments must have originated from separate isoforms, but any other fragment could have come from the same transcript as one of these three. Isoforms are then assembled from the overlap graph (c). Paths through the graph correspond to sets of mutually compatible fragments that could be merged into complete isoforms. The overlap graph here can be minimally ‘covered’ by three paths (shaded in yellow, blue and red), each representing a different isoform. Dilworth’s Theorem states that the number of mutually incompatible reads is the same as the minimum number of transcripts needed to ‘explain’ all the fragments. Cufflinks implements a proof of Dilworth’s Theorem that produces a minimal set of paths that cover all the fragments in the overlap graph by finding the largest set of reads with the property that no two could have originated from the same isoform. Next, transcript abundance is estimated (d). Fragments are matched (denoted here using color) to the transcripts from which they could have originated. The violet fragment could have originated from the blue or red isoform. Gray fragments could have come from any of the three shown. Cufflinks estimates transcript abundances using a statistical model in which the probability of observing each fragment is a linear function of the abundances of the transcripts from which it could have originated. Because only the ends of each fragment are sequenced, the length of each may be unknown. Assigning a fragment to different isoforms often implies a different length for it. Cufflinks incorporates the distribution of fragment lengths to help assign fragments to isoforms. For example, the violet fragment would be much longer, and very improbable according to the Cufflinks model, if it were to come from the red isoform instead of the blue isoform. Last, the program numerically maximizes a function that assigns a likelihood to all possible sets of relative abundances of the yellow, red and blue isoforms ( 1, 2, 3) (e), producing the abundances that best explain the observed fragments, shown as a pie chart.

Cufflinks overview²  Look for the smallest number

of isoforms that explain the data.

²  The number of mutually incompatible reads is the same as the minimum number of transcripts needed to explain all the alignments.

512 VOLUME 28 NUMBER 5 MAY 2010 NATURE BIOTECHNOLOGY

L E T T E R S

junction (Supplementary Table 1). Of the splice junctions spanned by fragment alignments, 70% were present in transcripts annotated by the UCSC, Ensembl or VEGA groups (known genes).

To recover the minimal set of transcripts supported by our frag-ment alignments, we designed a comparative transcriptome assem-bly algorithm. Expressed sequence tag (EST) assemblers such as PASA introduced the idea of collapsing alignments to transcripts on the basis of splicing compatibility17, and Dilworth’s theorem18 has been used to assemble a parsimonious set of haplotypes from virus population sequencing reads19. Cufflinks extends these ideas, reducing the transcript assembly problem to finding a maximum matching in a weighted4 bipartite graph that represents com-patibilities17 among fragments (Fig. 1a–c and Supplementary Methods, section 4). Noncoding RNAs20 and microRNAs21 have been reported to regulate cell differentiation and development, and coding genes are known to produce noncoding isoforms as a means of regulating protein levels through nonsense-mediated decay22. For these biologically motivated reasons, the assembler does not require that assembled transcripts contain an open reading frame (ORF). As Cufflinks does not make use of existing gene annotations

during assembly, we validated the transcripts by first comparing individual time point assemblies to existing annotations.

We recovered a total of 13,692 known isoforms and 12,712 new iso-forms of known genes. We estimate that 77% of the reads originated from previously known transcripts (Supplementary Table 2). Of the new isoforms, 7,395 (58%) contain novel splice junctions, with the remainder being novel combinations of known splicing outcomes; 11,712 (92%) have an ORF, 8,752 of which end at an annotated stop codon. Although we sequenced deeply by current standards, 73% of the moderately abundant transcripts (15–30 expected fragments per kilobase of transcript per million fragments mapped, abbreviated FPKM; see below for further explanation) detected at the 60-h time point with three lanes of GAII transcriptome sequencing were fully recovered with just a single lane. Because distinguishing a full-length transcript from a partially assembled fragment is difficult, we con-servatively excluded from further analyses the novel isoforms that were unique to a single time point. Out of the new isoforms, 3,724 were present in multiple time points, and 581 were present at all time points; 6,518 (51%) of the new isoforms and 2,316 (62%) of the multiple time point novel isoforms were tiled by high-identity

a

c

db

e

Map paired cDNAfragment sequences

to genomeTopHat

Cufflinks

Spliced fragmentalignments

Abundance estimationAssemblyMutually

incompatiblefragments

Transcript coverageand compatibility

Fragmentlength

distribution

Overlap graph

Maximum likelihoodabundances

Log-likelihood

Minimum path cover

Transcripts

Transcriptsand their

abundances

3

3

1

1

2

2

Figure 1 Overview of Cufflinks. (a) The algorithm takes as input cDNA fragment sequences that have been aligned to the genome by software capable of producing spliced alignments, such as TopHat. (b–e) With paired-end RNA-Seq, Cufflinks treats each pair of fragment reads as a single alignment. The algorithm assembles overlapping ‘bundles’ of fragment alignments (b,c) separately, which reduces running time and memory use, because each bundle typically contains the fragments from no more than a few genes. Cufflinks then estimates the abundances of the assembled transcripts (d,e). The first step in fragment assembly is to identify pairs of ‘incompatible’ fragments that must have originated from distinct spliced mRNA isoforms (b). Fragments are connected in an ‘overlap graph’ when they are compatible and their alignments overlap in the genome. Each fragment has one node in the graph, and an edge, directed from left to right along the genome, is placed between each pair of compatible fragments. In this example, the yellow, blue and red fragments must have originated from separate isoforms, but any other fragment could have come from the same transcript as one of these three. Isoforms are then assembled from the overlap graph (c). Paths through the graph correspond to sets of mutually compatible fragments that could be merged into complete isoforms. The overlap graph here can be minimally ‘covered’ by three paths (shaded in yellow, blue and red), each representing a different isoform. Dilworth’s Theorem states that the number of mutually incompatible reads is the same as the minimum number of transcripts needed to ‘explain’ all the fragments. Cufflinks implements a proof of Dilworth’s Theorem that produces a minimal set of paths that cover all the fragments in the overlap graph by finding the largest set of reads with the property that no two could have originated from the same isoform. Next, transcript abundance is estimated (d). Fragments are matched (denoted here using color) to the transcripts from which they could have originated. The violet fragment could have originated from the blue or red isoform. Gray fragments could have come from any of the three shown. Cufflinks estimates transcript abundances using a statistical model in which the probability of observing each fragment is a linear function of the abundances of the transcripts from which it could have originated. Because only the ends of each fragment are sequenced, the length of each may be unknown. Assigning a fragment to different isoforms often implies a different length for it. Cufflinks incorporates the distribution of fragment lengths to help assign fragments to isoforms. For example, the violet fragment would be much longer, and very improbable according to the Cufflinks model, if it were to come from the red isoform instead of the blue isoform. Last, the program numerically maximizes a function that assigns a likelihood to all possible sets of relative abundances of the yellow, red and blue isoforms ( 1, 2, 3) (e), producing the abundances that best explain the observed fragments, shown as a pie chart.

When the genome is already annotated

Reference-based assembly in Cufflinks: create faux-reads out of the gene models

Roberts et al

Reference annotation

Mappedsequenced

reads

Fauxreads

Assembly

Genome

Reference annotationbased transcript assembly

Merge andfilter

Fig. 2. An overview of our RABT assembly method. First paired-end reads (mates shown connected by solid lines) are mapped to the genome using a splicedread mapper that can map reads across junctions (shown in dotted lines). The reference annotation (blue) is used to generate faux-read alignments that tile thetranscripts (green). The faux-read alignments are used together with the spliced read alignments to generate a reference genome based assembly (black). Thisassembly is merged with the reference annotation, and “noisy” read mappings are filtered resulting in all reference annotation transcripts in the output (blue)as well as novel transcripts (light blue).

is to incorporate RNA-Seq data in a gene finding system. Suchapproaches have been successful, but they rely on models ofgenes structures that may limit predictions based on prior biases.Other approaches, such as Scripture (Guttman et al., 2010) maybe able to close gaps in coverage, but fail to incorporate existingannotation information during prediction. In the case of modelorganisms where gene annotations are based on years of efforts,in some cases by large consortia that have combined state of theart computational predictions (Guigo et al., 2006) with carefulhuman curated annotations (Ashurst et al., 2005), this results in theomission of a large amount of very useful information. We haveorganized existing assembly strategies into three categories, and asfar as we are aware, none of the programs in any of these categoriescan explicitly use existing annotations during assembly:

• De novo transcript assembly: the direct assembly of sequencedreads into transcripts without mapping to a reference genome.Examples include (Simpson et al., 2009).

• Genome reference based transcript assembly: assembly oftranscripts by first mapping to a reference genome. Thesemethods build on previous work using ESTs (Heber et al.,2002; Haas et al., 2003) or mapped pyrosequencing reads(Eriksson et al., 2008). Examples include (Guttman et al.,2010; Trapnell et al., 2010).

• RNA-Seq assisted protein coding gene annotation: theincorporation of read alignments as supporting evidence for ab

initio gene finding algorithms. Examples include (Schweikert

et al., 2009; Stanke and Waack, 2003; Allen and Salzberg,2005).

We address the need for a reference annotation based assemblerby developing a novel approach through modification of an existingassembler that we term reference annotation based transcript

assembly (RABT assembly). We adopt the approach of (Trapnellet al., 2010) which is to identify transcripts only based on readalignments (i.e., without regard to prior information about proteincoding gene structure), and we employ the parsimony approach tofind the fewest number of transcripts explaining the data (in this casethe aligned sequenced reads together with the reference annotation).A key feature of our approach is that it rapidly identifies noveltranscripts (with respect to the reference annotation).

2 METHODSOur RABT assembly method builds upon the Cufflinks assembler(Trapnell et al., 2010) that determines the minimum number oftranscripts needed to explain sets of reads aligned to a genome. Thealgorithm is based on finding a minimum path decomposition ofa directed acyclic overlap graph constructed from the reads, and isefficient thanks to a reduction of the computational problem to graphmatching. For details see the supplementary methods in (Trapnellet al., 2010).

We used the default parameters on the Cufflinks assembler,which include the removal of likely intronic reads (due to intronretention) as well as assembled transfrags with very low estimated

2

by guest on Decem

ber 1, 2013http://bioinform

atics.oxfordjournals.org/D

ownloaded from

Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics, 2011.

Page 5: Transcriptome reconstruction from RNA-seq Transcriptome ...

5/2/16

5

Comparison of the two versions

Reference Annotation

Cufflinks Assembly

RABT Assembly

NM_014774.2

NM_014774.1

CUFF.1545.1CUFF.1540

NM_014774

CUFF.1546

CUFF.1545.2

Fig. 3. Comparison of assembler output for an example gene. Lack of sequencing coverage in the UTR and across one splice junction caused the Cufflinksassembler (teal) to output three transfrags that match the reference (blue) and a fourth that contains a novel splice junction. The RABT assembler output(red) includes both the reference transcript (NM 014774.1) and a novel isoform (NM 014774.2) that is assembled from a combination of sequencing reads,which reveal the novel junction, and faux-reads, which connect the three sections to form a single transcript. Note that even with the addition of the referencetranscript, the total number of transfrags output by the assembler has been reduced for this locus, and the transfrag lengths have increased.

D. melanogaster Output Set # of Genes # of Transfrags Avg Transfrag Length Isoforms Per GeneReference Annotation 13,302 20,715 1,629 1.56Cufflinks Assembly 7,167 8,701 2,334 1.21Cufflinks Assembly (Novel Only) 350 3,205 2,741 -RABT Assembly 13,634 23,913 1,815 1.75RABT Assembly (Novel Only) 332 3,018 2,719 -

Table 2. Results for two different versions of assembly on the first Drosophila melanogaster embryo time-point from (Graveley et al., 2010). The categoriescan be interpreted in the same manner as Table 1. These results show that the method also produces improved assemblies in fly.

data. Bioinformatics.Denoued, F. et al. (2008). Annotating genomes with massive-scale

RNA-sequencing. Genome Biology, 9, R175.Eriksson, N., Pachter, L., Mitsuya, Y., Rhee, S.-Y., Wang, C.,

Gharizadeh, B., Ronaghi, M., Shafer, R. W., and Beerenwinkel,N. (2008). Viral population estimation using pyrosequencing.PLoS Computational Biology, 4(4), e1000074.

Graveley, B., Brooks, A., Carlson, J., Landolin, J., Yang, L.,Artieri, C., van Baren, M., Boley, N., Booth, B., Brown, J.,Cherbas, L., Davis, C., Dobin, A., Li, R., Lin, W., Malone,J., Mattiuzzo, N., Miller, D., Sturgill, D., Tuch, B., Zaleski,C., Zhang, D., Blanchette, M., Dudoit, S., Eads, B., Green,R., Hammonds, A., Jiang, L., Kapranov, P., Langton, L.,et al. (2010). The developmental transcriptome of Drosophilamelanogaster. Nature.

Guigo, R. et al. (2006). EGASP: the human ENCODE GenomeAnnotation Assessment Project. Genome Biology, 7, S2.

Guttman, M., Garber, M., Levin, J., Donaghey, J., Robinson,J., Adiconis, X., Fan, L., Koziol, M., Gnirke, A., Nusbaum,C., Rinn, J., Lander, E., and Regev, A. (2010). Ab initioreconstruction of cell type-specific transcriptomes in mousereveals the conserved multi-exonic structure of lincrnas. Nature

Biotechnology, 28, 503–510.Haas, B. et al. (2003). Improving the Arabidopsis genome

annotation using maximal transcript alignment assemblies.Nucleic Acids Research, 31, 5654–5666.

Heber, S., Alekseyev, M., Sze, S.-H., Tang, H., and Pevzner,P. (2002). Splicing graphs and EST assembly problem.Bioinformatics, 18, S181–S188.

Jiang, H. and Wong, W. (2009). Statistical inferences for isoformexpression in RNA-Seq. Bioinformatics, 25, 1026–1032.

Mortazavi, A., Williams, B., McCue, K., and Schaeffer, L. (2008).Mapping and quantifying mammalian transcriptomes by rna-seq.Nature Methods, 5.

Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D.,Gerstein, M., and Snyder, M. (2008). The transcriptionallandscape of the yeast genome defined by RNA sequencing.Science, 320, 1344–1349.

Schweikert, G. et al. (2009). mGene: Accurate SVM-based genefinding with an application to nematode genomes. Genome

Research, 19, 2133–2143.Simpson, J., Wong, K., Jackman, S., and Schein, J. (2009). Abyss:

A parallel assembler for short read sequence data. Genome

Research, 19.Stanke, M. and Waack, S. (2003). Gene prediction with a hidden

Markov model and a new intron submodel. Bioinformatics, 19.Trapnell, C., Pachter, L., and Salzberg, S. (2009). Tophat:

discovering splice junctions with rna-seq. Bioinformatics, 25,1105–11.

Trapnell, C., Williams, B., Pertea, G., Mortazavi, A., GK,van Baren, M., Salzberg, S., Wold, B., and Pachter, L.(2010). Transcript assembly and quantification by rna-seqreveals unannotated transcripts and isoform switching during celldifferentiation. Nature Biotechnology, 28, 511–515.

Venter, J., Adams, M., Myers, E., Li, P., Mural, R., Sutton, G.,Smith, H., Yandell, M., Evans, C., Holt, R., et al. (2001). Thesequence of the human genome. Science, 291, 1304–1351.

5

by guest on Decem

ber 1, 2013http://bioinform

atics.oxfordjournals.org/D

ownloaded from

Reference Annotation

Cufflinks Assembly

RABT Assembly

NM_014774.2

NM_014774.1

CUFF.1545.1CUFF.1540

NM_014774

CUFF.1546

CUFF.1545.2

Fig. 3. Comparison of assembler output for an example gene. Lack of sequencing coverage in the UTR and across one splice junction caused the Cufflinksassembler (teal) to output three transfrags that match the reference (blue) and a fourth that contains a novel splice junction. The RABT assembler output(red) includes both the reference transcript (NM 014774.1) and a novel isoform (NM 014774.2) that is assembled from a combination of sequencing reads,which reveal the novel junction, and faux-reads, which connect the three sections to form a single transcript. Note that even with the addition of the referencetranscript, the total number of transfrags output by the assembler has been reduced for this locus, and the transfrag lengths have increased.

D. melanogaster Output Set # of Genes # of Transfrags Avg Transfrag Length Isoforms Per GeneReference Annotation 13,302 20,715 1,629 1.56Cufflinks Assembly 7,167 8,701 2,334 1.21Cufflinks Assembly (Novel Only) 350 3,205 2,741 -RABT Assembly 13,634 23,913 1,815 1.75RABT Assembly (Novel Only) 332 3,018 2,719 -

Table 2. Results for two different versions of assembly on the first Drosophila melanogaster embryo time-point from (Graveley et al., 2010). The categoriescan be interpreted in the same manner as Table 1. These results show that the method also produces improved assemblies in fly.

data. Bioinformatics.Denoued, F. et al. (2008). Annotating genomes with massive-scale

RNA-sequencing. Genome Biology, 9, R175.Eriksson, N., Pachter, L., Mitsuya, Y., Rhee, S.-Y., Wang, C.,

Gharizadeh, B., Ronaghi, M., Shafer, R. W., and Beerenwinkel,N. (2008). Viral population estimation using pyrosequencing.PLoS Computational Biology, 4(4), e1000074.

Graveley, B., Brooks, A., Carlson, J., Landolin, J., Yang, L.,Artieri, C., van Baren, M., Boley, N., Booth, B., Brown, J.,Cherbas, L., Davis, C., Dobin, A., Li, R., Lin, W., Malone,J., Mattiuzzo, N., Miller, D., Sturgill, D., Tuch, B., Zaleski,C., Zhang, D., Blanchette, M., Dudoit, S., Eads, B., Green,R., Hammonds, A., Jiang, L., Kapranov, P., Langton, L.,et al. (2010). The developmental transcriptome of Drosophilamelanogaster. Nature.

Guigo, R. et al. (2006). EGASP: the human ENCODE GenomeAnnotation Assessment Project. Genome Biology, 7, S2.

Guttman, M., Garber, M., Levin, J., Donaghey, J., Robinson,J., Adiconis, X., Fan, L., Koziol, M., Gnirke, A., Nusbaum,C., Rinn, J., Lander, E., and Regev, A. (2010). Ab initioreconstruction of cell type-specific transcriptomes in mousereveals the conserved multi-exonic structure of lincrnas. Nature

Biotechnology, 28, 503–510.Haas, B. et al. (2003). Improving the Arabidopsis genome

annotation using maximal transcript alignment assemblies.Nucleic Acids Research, 31, 5654–5666.

Heber, S., Alekseyev, M., Sze, S.-H., Tang, H., and Pevzner,P. (2002). Splicing graphs and EST assembly problem.Bioinformatics, 18, S181–S188.

Jiang, H. and Wong, W. (2009). Statistical inferences for isoformexpression in RNA-Seq. Bioinformatics, 25, 1026–1032.

Mortazavi, A., Williams, B., McCue, K., and Schaeffer, L. (2008).Mapping and quantifying mammalian transcriptomes by rna-seq.Nature Methods, 5.

Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D.,Gerstein, M., and Snyder, M. (2008). The transcriptionallandscape of the yeast genome defined by RNA sequencing.Science, 320, 1344–1349.

Schweikert, G. et al. (2009). mGene: Accurate SVM-based genefinding with an application to nematode genomes. Genome

Research, 19, 2133–2143.Simpson, J., Wong, K., Jackman, S., and Schein, J. (2009). Abyss:

A parallel assembler for short read sequence data. Genome

Research, 19.Stanke, M. and Waack, S. (2003). Gene prediction with a hidden

Markov model and a new intron submodel. Bioinformatics, 19.Trapnell, C., Pachter, L., and Salzberg, S. (2009). Tophat:

discovering splice junctions with rna-seq. Bioinformatics, 25,1105–11.

Trapnell, C., Williams, B., Pertea, G., Mortazavi, A., GK,van Baren, M., Salzberg, S., Wold, B., and Pachter, L.(2010). Transcript assembly and quantification by rna-seqreveals unannotated transcripts and isoform switching during celldifferentiation. Nature Biotechnology, 28, 511–515.

Venter, J., Adams, M., Myers, E., Li, P., Mural, R., Sutton, G.,Smith, H., Yandell, M., Evans, C., Holt, R., et al. (2001). Thesequence of the human genome. Science, 291, 1304–1351.

5

by guest on Decem

ber 1, 2013http://bioinform

atics.oxfordjournals.org/D

ownloaded from

SpliceGrapher: From Splice Graphs to Transcripts using RNA-Seq

Sofware: splicegrapher.sourceforge.net

Splice Graphs vs. TranscriptsTranscripts

Splice Graphs vs. TranscriptsTranscripts

Splice Graph

Page 6: Transcriptome reconstruction from RNA-seq Transcriptome ...

5/2/16

6

Data●  Long introns●  Dominant form of alternative splicing: exon skipping

●  Best annotated plant genome●  Short introns●  Dominant form of alternative splicing: intron retention

Human

Arabidopsis

Data●  28M 75bp read pairs R. Xiao, et al. Nuclear matrix factor hnRNP U/SAF-A exerts a global control of alternative splicing by regulating U2 snRNP maturation. Molecular Cell, 45(5) 656–668, March 2012.

●  116M 76bp read pairs across five samples Y. Marquez, J. W. Brown, C. Simpson, A. Barta, and M. Kalyna. Transcriptome survey reveals increased complexity of the alternative splicing landscape in Arabidopsis. Genome Research, 22(6):1184–1195, 2012.

Human

Arabidopsis

Simulated data: 100bp reads generated using FluxSimulator

Protocol (real data)

SpliceGrapherXT

Cufflinks IsoLasso

Protocol (simulated data)

SpliceGrapherXT

Cufflinks IsoLasso

Page 7: Transcriptome reconstruction from RNA-seq Transcriptome ...

5/2/16

7

Exon level results: simulated data

Human A.thaliana

Human A.thaliana

Precision

Recall

Exon level results: real data

Human A.thaliana

Human A.thaliana

Recall

Precision

Example De-novo transcriptome reconstruction methods

Figure from: Bridger: a new framework for de novo transcriptome assembly using RNA-seq data

Page 8: Transcriptome reconstruction from RNA-seq Transcriptome ...

5/2/16

8

De-novo transcriptome reconstruction methods

Assessment of transcript reconstruction methods

exon-level performance for the methods evaluated

Assessment of transcript reconstruction methods for RNA-seqTamara Steijger et al. Nature Methods 10, 1177–1184 (2013).

Assessment of transcript reconstruction methods

transcript-level performance for the methods evaluated

Assessment of transcript reconstruction methods for RNA-seqTamara Steijger et al. Nature Methods 10, 1177–1184 (2013).

Assessment of transcript reconstruction methods

transcript-level performance for the methods evaluated

Assessment of transcript reconstruction methods for RNA-seqTamara Steijger et al. Nature Methods 10, 1177–1184 (2013).

Page 9: Transcriptome reconstruction from RNA-seq Transcriptome ...

5/2/16

9

Transcript reconstruction: read length matters

Pacific BioSciences ISO-seq sequencing

http://genome.duke.edu/sites/default/files/pacbio_isoseq_1.png

TAPIS: a transcriptome assembly and analysis pipeline for Pacific

Biosciences isoform sequencing reads

Page 10: Transcriptome reconstruction from RNA-seq Transcriptome ...

5/2/16

10

Read length0 5000 10000 15000 20000 25000 30000 350000

50000

100000

150000

200000

250000

300000

350000

400000Combined

0 2000 4000 6000 8000 10000 12000 14000 160000

20000

40000

60000

80000

100000

120000

140000

160000

180000Full length

0 5000 10000 15000 20000 25000 30000 350000

20000

40000

60000

80000

100000

120000

140000

160000

180000Non-full length

0 2000 4000 6000 8000 100000

50000

100000

150000

200000

250000

300000

350000

400000

0 500 1000 1500 2000 2500 3000 3500 40000

20000

40000

60000

80000

100000

120000

140000

160000

180000

0 2000 4000 6000 8000 100000

20000

40000

60000

80000

100000

120000

140000

160000

180000

Reads of insert

Non-full length readsFull length reads

Error correction using genome

Align to genome using GMAP

Filter alignments with splice junction classifier

Final set of alignments

De-novo transcript discovery

Polyadenylation site analysis

Alternative splicing detection

Differential expression detection

The TAPIS pipeline

Data summaryReads Total Method Aligned After Filtering

non-full length 161,913

proovread 98,454 (60.1%) 52,582 (32.4%)

genome 82,730 (51.1%) 48,352 (29.9%)

hybrid 130,702 (80.7%) 61,549 (38.0%)

full length 134,984

proovread 129,434 (95.9%) 115,967 (85.9%)

genome 129,117 (95.7%) 118,679 (87.9%)

hybrid 133,489 (98.9%) 123,836 (91.7%)

Alternative splicing and transcript predictions

Source IR Alt5 Alt3 ES Total

Gene models v1

577 389 627 184 1777

Gene models v2

1671 1681 2869 1490 7711

PacBio

3897 796 1167 832 6692

Close to 8000 predicted transcripts

Page 11: Transcriptome reconstruction from RNA-seq Transcriptome ...

5/2/16

11

Alternative polyadenylation

6498

2856

6498

2886

6498

2917

6498

2978

6498

3034

6498

3065

6498

3131

Coordinate

0

5

10

15

20

25

Poly

-(A)s

ites

Orientation =)