Transcriptome reconstruction from RNA-seq Transcriptome ...

of 11 /11
5/2/16 1 Transcriptome reconstruction from RNA-seq Transcriptome assembly Strategies: 1. De-novo assembly: no reference genome available (Trinity, Bridger, Abyss, IDBA-trans) 2. Reference-based assembly: a reference genome is available (Cufflinks, StringTie, iReckon) 3. Annotation-guided assembly: use known annotations to improve the annotation process (Cufflinks, StringTie) Transcriptome assembly Goal: from read alignments to transcripts/splice graphs Spliced alignments Ungapped alignments Challenges Short Reads

Embed Size (px)

Transcript of Transcriptome reconstruction from RNA-seq Transcriptome ...

  • 5/2/16

    1

    Transcriptome reconstruction from RNA-seq

    Transcriptome assembly

    Strategies:

    1. De-novo assembly: no reference genome available

    (Trinity, Bridger, Abyss, IDBA-trans)

    2. Reference-based assembly: a reference genome is

    available (Cufflinks, StringTie, iReckon)

    3. Annotation-guided assembly: use known annotations

    to improve the annotation process (Cufflinks, StringTie)

    Transcriptome assembly

    Goal: from read alignments to transcripts/splice graphs

    Spliced alignments Ungapped alignments

    Challenges

    Short Reads

  • 5/2/16

    2

    Challenges

    Short Reads

    Ambiguous Origins

    Challenges

    Short Reads

    Ambiguous Origins

    Variable Coverage

    Challenges

    Short Reads

    Ambiguous Origins

    Variable Coverage

    Challenges

    Short Reads

    Ambiguous Origins

    Variable Coverage

  • 5/2/16

    3

    Challenges

    Short Reads

    Ambiguous Origins

    Variable Coverage

    Localized Evidence

    Figure 2 | Overview of the reference-based transcriptome assembly strategy. The steps of the reference-based transcriptome strategy are shown using an example of a maize gene (GRMZM2G060216). a | Reads (grey) are first splice-aligned to a reference genome. b | A connectivity or splice graph is then constructed to represent all possible isoforms at a locus. c,d | Finally, alternative paths through the graph (blue, red, yellow and green) are followed to join compatible reads together into isoforms.

    Applications. Reference-based transcriptome assembly is easier to perform for the simple transcriptomes of bac-terial, archaeal and lower eukaryotic organisms, as these organisms have few introns and little alternative splicing. Transcription boundaries can be inferred from regions of contiguous read coverage in the genome even with-out graph construction and traversal37,51,52. Alternative transcription start and stop sites can also be inferred based on the 5 cap or poly(A) signals (if cap- or end-specific experimental protocols are used)51,53. However, complications arise owing to the gene-dense nature of

    these genomes. Many genes overlap, resulting in adja-cent genes being assembled into one transcript, even though they are not from a polycistronic RNA. Strand-specific RNA-seq has successfully been used to separate adjacent overlapping genes from opposite strands in the genome51,52. Overlapping genes that are transcribed from the same strand and that also have comparable expres-sion levels cannot easily be separated without using cap- or end-specificRNA-seq.

    Plant and mammalian transcriptomes have complex alternative splicing patterns and are difficult to assemble

    REVIEWS

    NATURE REVIEWS | GENETICS VOLUME 12 | OCTOBER 2011 | 675

    2011 Macmillan Publishers Limited. All rights reserved

    Reference

    Based

    Assembly

    Figure from: Next generation transcriptome assembly. Nature Reviews Genetics, 2011

    ACCGC CCGCC CGCCC

    GCCCA CCCAC

    CTTCC TTCCT

    TCCTG CCTGC CTGCT TGCTG CTGGT TGGTC GGTCT GTCTC TCTCT

    TCCTC CCTCT

    GCCCACAGC

    GCCCTCAGCCAGCGCTTCCT

    TCCTGCTGGTCTCTCTCTTGTTGGTCGTAG

    ACCGCCCACAGCGCTTCCTGCTGGTCTCTTGTTG

    ACCGCCC

    ACCGC

    CCGCC

    CGCCC

    CCCAC

    CCACA

    CACAG

    GCCCA

    ACAGC

    CAGCG

    AGCGC

    GCGCT

    CGCTT

    GCTTC

    CTTCC

    TTCCT

    TCCTG

    CCTGC

    CTGCT

    TGCTG

    GCTGG

    CTGGT

    TGGTC

    GGTCT

    GTCTC

    TCTCT

    CTCTT

    TCTTG

    CTTGT

    CGCCCTCAGCGCTTCCTCTTGTTGGTCGTAG

    TCCTCT

    TTGTT

    TGTTG

    CGCCC

    GCCCT

    CCCTC

    CCTCA

    CTCAG

    TCAGC

    CAGCG

    AGCGC

    GCGCT

    CGCTT

    GCTTC

    CTTCC

    TTCCT

    TCCTC

    CCTCT

    CTCTT

    TCTTG

    CTTGT

    TTGTT

    TGTTG

    GTTGG

    TTGGT

    TGGTC

    GGTCG

    GTCGT

    TCGTA

    CGTAG

    CCACA CACAG ACAGC

    GCTGG

    GCCCT CCCTC CCTCA CTCAG TCAGC

    CAGCG AGCGC GCGCT CGCTT GCTTC

    CTCTT TCTTG CTTGT TTGTT TGTTG GTTGG TTGGT TGGTC GGTCG GTCGT TCGTA CGTAG

    GCCCACAGC

    GCCCTCAGC

    CAGCGCTTCCT

    TCCTGCTGGTCTCT

    CTCTTGTTGGTCGTAGACCGCCC

    TCCTCT

    ACCGCCCACAGCGCTTCCT--------CTTGTTGGTCGTAGACCGCCCACAGCGCTTCCTGCTGGTCTCTTGTTGGTCGTAG

    ACCGCCCTCAGCGCTTCCTGCTGGTCTCTTGTTGGTCGTAGACCGCCCTCAGCGCTTCCT--------CTTGTTGGTCGTAG

    Figure 3 | Overview of the denovo transcriptome assembly strategy. a | All substrings of length k (k-mers) are generated from each read. b | Each unique k-mer is used to represent a node (or vertex) in the De Bruijn graph, and pairs of nodes are connected if shifting a k-mer by one character creates an exact k1 overlap between the two k-mers. Note that for non-strand-specific RNA sequencing data sets, the reverse complement of each k-mer will also be represented in the graph. Here, a simple example using 5-mers is shown. The example illustrates a SNP or

    sequencing error (for example, A/T) and an example of an intron or a deletion. Single-nucleotide differences cause bubbles of length k in the De Brujin graph, whereas introns or deletions introduce a shorter path in the graph. c,d | Chains of adjacent nodes in the graph are collapsed into a single node when the first node has an out degree of one and the second node has an in degree of one. Last, as in the reference-based approach, four alternative paths (blue, red, yellow and green) through the graph are chosen. e | The isoforms are then assembled.

    REVIEWS

    676 | OCTOBER 2011 | VOLUME 12 www.nature.com/reviews/genetics

    2011 Macmillan Publishers Limited. All rights reserved

    De-novo

    Assembly

    Transcript assembly using Cufflinks

    512 VOLUME 28 NUMBER 5 MAY 2010 NATURE BIOTECHNOLOGY

    L E T T E R S

    junction (Supplementary Table 1). Of the splice junctions spanned by fragment alignments, 70% were present in transcripts annotated by the UCSC, Ensembl or VEGA groups (known genes).

    To recover the minimal set of transcripts supported by our frag-ment alignments, we designed a comparative transcriptome assem-bly algorithm. Expressed sequence tag (EST) assemblers such as PASA introduced the idea of collapsing alignments to transcripts on the basis of splicing compatibility17, and Dilworths theorem18 has been used to assemble a parsimonious set of haplotypes from virus population sequencing reads19. Cufflinks extends these ideas, reducing the transcript assembly problem to finding a maximum matching in a weighted4 bipartite graph that represents com-patibilities17 among fragments (Fig. 1ac and Supplementary Methods, section 4). Noncoding RNAs20 and microRNAs21 have been reported to regulate cell differentiation and development, and coding genes are known to produce noncoding isoforms as a means of regulating protein levels through nonsense-mediated decay22. For these biologically motivated reasons, the assembler does not require that assembled transcripts contain an open reading frame (ORF). As Cufflinks does not make use of existing gene annotations

    during assembly, we validated the transcripts by first comparing individual time point assemblies to existing annotations.

    We recovered a total of 13,692 known isoforms and 12,712 new iso-forms of known genes. We estimate that 77% of the reads originated from previously known transcripts (Supplementary Table 2). Of the new isoforms, 7,395 (58%) contain novel splice junctions, with the remainder being novel combinations of known splicing outcomes; 11,712 (92%) have an ORF, 8,752 of which end at an annotated stop codon. Although we sequenced deeply by current standards, 73% of the moderately abundant transcripts (1530 expected fragments per kilobase of transcript per million fragments mapped, abbreviated FPKM; see below for further explanation) detected at the 60-h time point with three lanes of GAII transcriptome sequencing were fully recovered with just a single lane. Because distinguishing a full-length transcript from a partially assembled fragment is difficult, we con-servatively excluded from further analyses the novel isoforms that were unique to a single time point. Out of the new isoforms, 3,724 were present in multiple time points, and 581 were present at all time points; 6,518 (51%) of the new isoforms and 2,316 (62%) of the multiple time point novel isoforms were tiled by high-identity

    a

    c

    db

    e

    Map paired cDNAfragment sequences

    to genomeTopHat

    Cufflinks

    Spliced fragmentalignments

    Abundance estimationAssemblyMutually

    incompatiblefragments

    Transcript coverageand compatibility

    Fragmentlength

    distribution

    Overlap graph

    Maximum likelihoodabundances

    Log-likelihood

    Minimum path cover

    Transcripts

    Transcriptsand their

    abundances

    3

    3

    1

    1

    2

    2

    Figure 1 Overview of Cufflinks. (a) The algorithm takes as input cDNA fragment sequences that have been aligned to the genome by software capable of producing spliced alignments, such as TopHat. (be) With paired-end RNA-Seq, Cufflinks treats each pair of fragment reads as a single alignment. The algorithm assembles overlapping bundles of fragment alignments (b,c) separately, which reduces running time and memory use, because each bundle typically contains the fragments from no more than a few genes. Cufflinks then estimates the abundances of the assembled transcripts (d,e). The first step in fragment assembly is to identify pairs of incompatible fragments that must have originated from distinct spliced mRNA isoforms (b). Fragments are connected in an overlap graph when they are compatible and their alignments overlap in the genome. Each fragment has one node in the graph, and an edge, directed from left to right along the genome, is placed between each pair of compatible fragments. In this example, the yellow, blue and red fragments must have originated from separate isoforms, but any other fragment could have come from the same transcript as one of these three. Isoforms are then assembled from the overlap graph (c). Paths through the graph correspond to sets of mutually compatible fragments that could be merged into complete isoforms. The overlap graph here can be minimally covered by three paths (shaded in yellow, blue and red), each representing a different isoform. Dilworths Theorem states that the number of mutually incompatible reads is the same as the minimum number of transcripts needed to explain all the fragments. Cufflinks implements a proof of Dilworths Theorem that produces a minimal set of paths that cover all the fragments in the overlap graph by finding the largest set of reads with the property that no two could have originated from the same isoform. Next, transcript abundance is estimated (d). Fragments are matched (denoted here using color) to the transcripts from which they could have originated. The violet fragment could have originated from the blue or red isoform. Gray fragments could have come from any of the three shown. Cufflinks estimates transcript abundances using a statistical model in which the probability of observing each fragment is a linear function of the abundances of the transcripts from which it could have originated. Because only the ends of each fragment are sequenced, the length of each may be unknown. Assigning a fragment to different isoforms often implies a different length for it. Cufflinks incorporates the distribution of fragment lengths to help assign fragments to isoforms. For example, the violet fragment would be much longer, and very improbable according to the Cufflinks model, if it were to come from the red isoform instead of the blue isoform. Last, the program numerically maximizes a function that assigns a likelihood to all possible sets of relative abundances of the yellow, red and blue isoforms ( 1, 2, 3) (e), producing the abundances that best explain the observed fragments, shown as a pie chart.

  • 5/2/16

    4

    Cufflinks overview

    512 VOLUME 28 NUMBER 5 MAY 2010 NATURE BIOTECHNOLOGY

    L E T T E R S

    junction (Supplementary Table 1). Of the splice junctions spanned by fragment alignments, 70% were present in transcripts annotated by the UCSC, Ensembl or VEGA groups (known genes).

    To recover the minimal set of transcripts supported by our frag-ment alignments, we designed a comparative transcriptome assem-bly algorithm. Expressed sequence tag (EST) assemblers such as PASA introduced the idea of collapsing alignments to transcripts on the basis of splicing compatibility17, and Dilworths theorem18 has been used to assemble a parsimonious set of haplotypes from virus population sequencing reads19. Cufflinks extends these ideas, reducing the transcript assembly problem to finding a maximum matching in a weighted4 bipartite graph that represents com-patibilities17 among fragments (Fig. 1ac and Supplementary Methods, section 4). Noncoding RNAs20 and microRNAs21 have been reported to regulate cell differentiation and development, and coding genes are known to produce noncoding isoforms as a means of regulating protein levels through nonsense-mediated decay22. For these biologically motivated reasons, the assembler does not require that assembled transcripts contain an open reading frame (ORF). As Cufflinks does not make use of existing gene annotations

    during assembly, we validated the transcripts by first comparing individual time point assemblies to existing annotations.

    We recovered a total of 13,692 known isoforms and 12,712 new iso-forms of known genes. We estimate that 77% of the reads originated from previously known transcripts (Supplementary Table 2). Of the new isoforms, 7,395 (58%) contain novel splice junctions, with the remainder being novel combinations of known splicing outcomes; 11,712 (92%) have an ORF, 8,752 of which end at an annotated stop codon. Although we sequenced deeply by current standards, 73% of the moderately abundant transcripts (1530 expected fragments per kilobase of transcript per million fragments mapped, abbreviated FPKM; see below for further explanation) detected at the 60-h time point with three lanes of GAII transcriptome sequencing were fully recovered with just a single lane. Because distinguishing a full-length transcript from a partially assembled fragment is difficult, we con-servatively excluded from further analyses the novel isoforms that were unique to a single time point. Out of the new isoforms, 3,724 were present in multiple time points, and 581 were present at all time points; 6,518 (51%) of the new isoforms and 2,316 (62%) of the multiple time point novel isoforms were tiled by high-identity

    a

    c

    db

    e

    Map paired cDNAfragment sequences

    to genomeTopHat

    Cufflinks

    Spliced fragmentalignments

    Abundance estimationAssemblyMutually

    incompatiblefragments

    Transcript coverageand compatibility

    Fragmentlength

    distribution

    Overlap graph

    Maximum likelihoodabundances

    Log-likelihood

    Minimum path cover

    Transcripts

    Transcriptsand their

    abundances

    3

    3

    1

    1

    2

    2

    Figure 1 Overview of Cufflinks. (a) The algorithm takes as input cDNA fragment sequences that have been aligned to the genome by software capable of producing spliced alignments, such as TopHat. (be) With paired-end RNA-Seq, Cufflinks treats each pair of fragment reads as a single alignment. The algorithm assembles overlapping bundles of fragment alignments (b,c) separately, which reduces running time and memory use, because each bundle typically contains the fragments from no more than a few genes. Cufflinks then estimates the abundances of the assembled transcripts (d,e). The first step in fragment assembly is to identify pairs of incompatible fragments that must have originated from distinct spliced mRNA isoforms (b). Fragments are connected in an overlap graph when they are compatible and their alignments overlap in the genome. Each fragment has one node in the graph, and an edge, directed from left to right along the genome, is placed between each pair of compatible fragments. In this example, the yellow, blue and red fragments must have originated from separate isoforms, but any other fragment could have come from the same transcript as one of these three. Isoforms are then assembled from the overlap graph (c). Paths through the graph correspond to sets of mutually compatible fragments that could be merged into complete isoforms. The overlap graph here can be minimally covered by three paths (shaded in yellow, blue and red), each representing a different isoform. Dilworths Theorem states that the number of mutually incompatible reads is the same as the minimum number of transcripts needed to explain all the fragments. Cufflinks implements a proof of Dilworths Theorem that produces a minimal set of paths that cover all the fragments in the overlap graph by finding the largest set of reads with the property that no two could have originated from the same isoform. Next, transcript abundance is estimated (d). Fragments are matched (denoted here using color) to the transcripts from which they could have originated. The violet fragment could have originated from the blue or red isoform. Gray fragments could have come from any of the three shown. Cufflinks estimates transcript abundances using a statistical model in which the probability of observing each fragment is a linear function of the abundances of the transcripts from which it could have originated. Because only the ends of each fragment are sequenced, the length of each may be unknown. Assigning a fragment to different isoforms often implies a different length for it. Cufflinks incorporates the distribution of fragment lengths to help assign fragments to isoforms. For example, the violet fragment would be much longer, and very improbable according to the Cufflinks model, if it were to come from the red isoform instead of the blue isoform. Last, the program numerically maximizes a function that assigns a likelihood to all possible sets of relative abundances of the yellow, red and blue isoforms ( 1, 2, 3) (e), producing the abundances that best explain the observed fragments, shown as a pie chart.

    Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL,

    Wold B, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated

    transcripts and isoform switching during cell differentiation.Nature Biotechnology 2010.

    Cufflinks overview

    Isoforms correspond to paths

    through the overlap graph.

    Every read is consistent with

    at least one assembled transcript.

    Every transcript is tiled by reads.

    Look for the smallest number of transcripts that explain the data.

    512 VOLUME 28 NUMBER 5 MAY 2010 NATURE BIOTECHNOLOGY

    L E T T E R S

    junction (Supplementary Table 1). Of the splice junctions spanned by fragment alignments, 70% were present in transcripts annotated by the UCSC, Ensembl or VEGA groups (known genes).

    To recover the minimal set of transcripts supported by our frag-ment alignments, we designed a comparative transcriptome assem-bly algorithm. Expressed sequence tag (EST) assemblers such as PASA introduced the idea of collapsing alignments to transcripts on the basis of splicing compatibility17, and Dilworths theorem18 has been used to assemble a parsimonious set of haplotypes from virus population sequencing reads19. Cufflinks extends these ideas, reducing the transcript assembly problem to finding a maximum matching in a weighted4 bipartite graph that represents com-patibilities17 among fragments (Fig. 1ac and Supplementary Methods, section 4). Noncoding RNAs20 and microRNAs21 have been reported to regulate cell differentiation and development, and coding genes are known to produce noncoding isoforms as a means of regulating protein levels through nonsense-mediated decay22. For these biologically motivated reasons, the assembler does not require that assembled transcripts contain an open reading frame (ORF). As Cufflinks does not make use of existing gene annotations

    during assembly, we validated the transcripts by first comparing individual time point assemblies to existing annotations.

    We recovered a total of 13,692 known isoforms and 12,712 new iso-forms of known genes. We estimate that 77% of the reads originated from previously known transcripts (Supplementary Table 2). Of the new isoforms, 7,395 (58%) contain novel splice junctions, with the remainder being novel combinations of known splicing outcomes; 11,712 (92%) have an ORF, 8,752 of which end at an annotated stop codon. Although we sequenced deeply by current standards, 73% of the moderately abundant transcripts (1530 expected fragments per kilobase of transcript per million fragments mapped, abbreviated FPKM; see below for further explanation) detected at the 60-h time point with three lanes of GAII transcriptome sequencing were fully recovered with just a single lane. Because distinguishing a full-length transcript from a partially assembled fragment is difficult, we con-servatively excluded from further analyses the novel isoforms that were unique to a single time point. Out of the new isoforms, 3,724 were present in multiple time points, and 581 were present at all time points; 6,518 (51%) of the new isoforms and 2,316 (62%) of the multiple time point novel isoforms were tiled by high-identity

    a

    c

    db

    e

    Map paired cDNAfragment sequences

    to genomeTopHat

    Cufflinks

    Spliced fragmentalignments

    Abundance estimationAssemblyMutually

    incompatiblefragments

    Transcript coverageand compatibility

    Fragmentlength

    distribution

    Overlap graph

    Maximum likelihoodabundances

    Log-likelihood

    Minimum path cover

    Transcripts

    Transcriptsand their

    abundances

    3

    3

    1

    1

    2

    2

    Figure 1 Overview of Cufflinks. (a) The algorithm takes as input cDNA fragment sequences that have been aligned to the genome by software capable of producing spliced alignments, such as TopHat. (be) With paired-end RNA-Seq, Cufflinks treats each pair of fragment reads as a single alignment. The algorithm assembles overlapping bundles of fragment alignments (b,c) separately, which reduces running time and memory use, because each bundle typically contains the fragments from no more than a few genes. Cufflinks then estimates the abundances of the assembled transcripts (d,e). The first step in fragment assembly is to identify pairs of incompatible fragments that must have originated from distinct spliced mRNA isoforms (b). Fragments are connected in an overlap graph when they are compatible and their alignments overlap in the genome. Each fragment has one node in the graph, and an edge, directed from left to right along the genome, is placed between each pair of compatible fragments. In this example, the yellow, blue and red fragments must have originated from separate isoforms, but any other fragment could have come from the same transcript as one of these three. Isoforms are then assembled from the overlap graph (c). Paths through the graph correspond to sets of mutually compatible fragments that could be merged into complete isoforms. The overlap graph here can be minimally covered by three paths (shaded in yellow, blue and red), each representing a different isoform. Dilworths Theorem states that the number of mutually incompatible reads is the same as the minimum number of transcripts needed to explain all the fragments. Cufflinks implements a proof of Dilworths Theorem that produces a minimal set of paths that cover all the fragments in the overlap graph by finding the largest set of reads with the property that no two could have originated from the same isoform. Next, transcript abundance is estimated (d). Fragments are matched (denoted here using color) to the transcripts from which they could have originated. The violet fragment could have originated from the blue or red isoform. Gray fragments could have come from any of the three shown. Cufflinks estimates transcript abundances using a statistical model in which the probability of observing each fragment is a linear function of the abundances of the transcripts from which it could have originated. Because only the ends of each fragment are sequenced, the length of each may be unknown. Assigning a fragment to different isoforms often implies a different length for it. Cufflinks incorporates the distribution of fragment lengths to help assign fragments to isoforms. For example, the violet fragment would be much longer, and very improbable according to the Cufflinks model, if it were to come from the red isoform instead of the blue isoform. Last, the program numerically maximizes a function that assigns a likelihood to all possible sets of relative abundances of the yellow, red and blue isoforms ( 1, 2, 3) (e), producing the abundances that best explain the observed fragments, shown as a pie chart.

    Cufflinks overview

    Look for the smallest number

    of isoforms that explain the data.

    The number of mutually incompatible reads is the same as the minimum number of transcripts needed to explain all the alignments.

    512 VOLUME 28 NUMBER 5 MAY 2010 NATURE BIOTECHNOLOGY

    L E T T E R S

    junction (Supplementary Table 1). Of the splice junctions spanned by fragment alignments, 70% were present in transcripts annotated by the UCSC, Ensembl or VEGA groups (known genes).

    To recover the minimal set of transcripts supported by our frag-ment alignments, we designed a comparative transcriptome assem-bly algorithm. Expressed sequence tag (EST) assemblers such as PASA introduced the idea of collapsing alignments to transcripts on the basis of splicing compatibility17, and Dilworths theorem18 has been used to assemble a parsimonious set of haplotypes from virus population sequencing reads19. Cufflinks extends these ideas, reducing the transcript assembly problem to finding a maximum matching in a weighted4 bipartite graph that represents com-patibilities17 among fragments (Fig. 1ac and Supplementary Methods, section 4). Noncoding RNAs20 and microRNAs21 have been reported to regulate cell differentiation and development, and coding genes are known to produce noncoding isoforms as a means of regulating protein levels through nonsense-mediated decay22. For these biologically motivated reasons, the assembler does not require that assembled transcripts contain an open reading frame (ORF). As Cufflinks does not make use of existing gene annotations

    during assembly, we validated the transcripts by first comparing individual time point assemblies to existing annotations.

    We recovered a total of 13,692 known isoforms and 12,712 new iso-forms of known genes. We estimate that 77% of the reads originated from previously known transcripts (Supplementary Table 2). Of the new isoforms, 7,395 (58%) contain novel splice junctions, with the remainder being novel combinations of known splicing outcomes; 11,712 (92%) have an ORF, 8,752 of which end at an annotated stop codon. Although we sequenced deeply by current standards, 73% of the moderately abundant transcripts (1530 expected fragments per kilobase of transcript per million fragments mapped, abbreviated FPKM; see below for further explanation) detected at the 60-h time point with three lanes of GAII transcriptome sequencing were fully recovered with just a single lane. Because distinguishing a full-length transcript from a partially assembled fragment is difficult, we con-servatively excluded from further analyses the novel isoforms that were unique to a single time point. Out of the new isoforms, 3,724 were present in multiple time points, and 581 were present at all time points; 6,518 (51%) of the new isoforms and 2,316 (62%) of the multiple time point novel isoforms were tiled by high-identity

    a

    c

    db

    e

    Map paired cDNAfragment sequences

    to genomeTopHat

    Cufflinks

    Spliced fragmentalignments

    Abundance estimationAssemblyMutually

    incompatiblefragments

    Transcript coverageand compatibility

    Fragmentlength

    distribution

    Overlap graph

    Maximum likelihoodabundances

    Log-likelihood

    Minimum path cover

    Transcripts

    Transcriptsand their

    abundances

    3

    3

    1

    1

    2

    2

    Figure 1 Overview of Cufflinks. (a) The algorithm takes as input cDNA fragment sequences that have been aligned to the genome by software capable of producing spliced alignments, such as TopHat. (be) With paired-end RNA-Seq, Cufflinks treats each pair of fragment reads as a single alignment. The algorithm assembles overlapping bundles of fragment alignments (b,c) separately, which reduces running time and memory use, because each bundle typically contains the fragments from no more than a few genes. Cufflinks then estimates the abundances of the assembled transcripts (d,e). The first step in fragment assembly is to identify pairs of incompatible fragments that must have originated from distinct spliced mRNA isoforms (b). Fragments are connected in an overlap graph when they are compatible and their alignments overlap in the genome. Each fragment has one node in the graph, and an edge, directed from left to right along the genome, is placed between each pair of compatible fragments. In this example, the yellow, blue and red fragments must have originated from separate isoforms, but any other fragment could have come from the same transcript as one of these three. Isoforms are then assembled from the overlap graph (c). Paths through the graph correspond to sets of mutually compatible fragments that could be merged into complete isoforms. The overlap graph here can be minimally covered by three paths (shaded in yellow, blue and red), each representing a different isoform. Dilworths Theorem states that the number of mutually incompatible reads is the same as the minimum number of transcripts needed to explain all the fragments. Cufflinks implements a proof of Dilworths Theorem that produces a minimal set of paths that cover all the fragments in the overlap graph by finding the largest set of reads with the property that no two could have originated from the same isoform. Next, transcript abundance is estimated (d). Fragments are matched (denoted here using color) to the transcripts from which they could have originated. The violet fragment could have originated from the blue or red isoform. Gray fragments could have come from any of the three shown. Cufflinks estimates transcript abundances using a statistical model in which the probability of observing each fragment is a linear function of the abundances of the transcripts from which it could have originated. Because only the ends of each fragment are sequenced, the length of each may be unknown. Assigning a fragment to different isoforms often implies a different length for it. Cufflinks incorporates the distribution of fragment lengths to help assign fragments to isoforms. For example, the violet fragment would be much longer, and very improbable according to the Cufflinks model, if it were to come from the red isoform instead of the blue isoform. Last, the program numerically maximizes a function that assigns a likelihood to all possible sets of relative abundances of the yellow, red and blue isoforms ( 1, 2, 3) (e), producing the abundances that best explain the observed fragments, shown as a pie chart.

    When the genome is already annotated

    Reference-based assembly in Cufflinks: create faux-reads out of the gene models

    Roberts et al

    Reference annotation

    Mappedsequenced

    reads

    Fauxreads

    Assembly

    Genome

    Reference annotationbased transcript assembly

    Merge andfilter

    Fig. 2. An overview of our RABT assembly method. First paired-end reads (mates shown connected by solid lines) are mapped to the genome using a splicedread mapper that can map reads across junctions (shown in dotted lines). The reference annotation (blue) is used to generate faux-read alignments that tile thetranscripts (green). The faux-read alignments are used together with the spliced read alignments to generate a reference genome based assembly (black). Thisassembly is merged with the reference annotation, and noisy read mappings are filtered resulting in all reference annotation transcripts in the output (blue)as well as novel transcripts (light blue).

    is to incorporate RNA-Seq data in a gene finding system. Suchapproaches have been successful, but they rely on models ofgenes structures that may limit predictions based on prior biases.Other approaches, such as Scripture (Guttman et al., 2010) maybe able to close gaps in coverage, but fail to incorporate existingannotation information during prediction. In the case of modelorganisms where gene annotations are based on years of efforts,in some cases by large consortia that have combined state of theart computational predictions (Guigo et al., 2006) with carefulhuman curated annotations (Ashurst et al., 2005), this results in theomission of a large amount of very useful information. We haveorganized existing assembly strategies into three categories, and asfar as we are aware, none of the programs in any of these categoriescan explicitly use existing annotations during assembly:

    De novo transcript assembly: the direct assembly of sequencedreads into transcripts without mapping to a reference genome.Examples include (Simpson et al., 2009).

    Genome reference based transcript assembly: assembly oftranscripts by first mapping to a reference genome. Thesemethods build on previous work using ESTs (Heber et al.,2002; Haas et al., 2003) or mapped pyrosequencing reads(Eriksson et al., 2008). Examples include (Guttman et al.,2010; Trapnell et al., 2010).

    RNA-Seq assisted protein coding gene annotation: theincorporation of read alignments as supporting evidence for abinitio gene finding algorithms. Examples include (Schweikert

    et al., 2009; Stanke and Waack, 2003; Allen and Salzberg,2005).

    We address the need for a reference annotation based assemblerby developing a novel approach through modification of an existingassembler that we term reference annotation based transcriptassembly (RABT assembly). We adopt the approach of (Trapnellet al., 2010) which is to identify transcripts only based on readalignments (i.e., without regard to prior information about proteincoding gene structure), and we employ the parsimony approach tofind the fewest number of transcripts explaining the data (in this casethe aligned sequenced reads together with the reference annotation).A key feature of our approach is that it rapidly identifies noveltranscripts (with respect to the reference annotation).

    2 METHODSOur RABT assembly method builds upon the Cufflinks assembler(Trapnell et al., 2010) that determines the minimum number oftranscripts needed to explain sets of reads aligned to a genome. Thealgorithm is based on finding a minimum path decomposition ofa directed acyclic overlap graph constructed from the reads, and isefficient thanks to a reduction of the computational problem to graphmatching. For details see the supplementary methods in (Trapnellet al., 2010).

    We used the default parameters on the Cufflinks assembler,which include the removal of likely intronic reads (due to intronretention) as well as assembled transfrags with very low estimated

    2

    by guest on Decem

    ber 1, 2013http://bioinform

    atics.oxfordjournals.org/D

    ownloaded from

    Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using

    RNA-Seq. Bioinformatics, 2011.

  • 5/2/16

    5

    Comparison of the two versions

    Reference Annotation

    Cufflinks Assembly

    RABT Assembly

    NM_014774.2

    NM_014774.1

    CUFF.1545.1CUFF.1540

    NM_014774

    CUFF.1546

    CUFF.1545.2

    Fig. 3. Comparison of assembler output for an example gene. Lack of sequencing coverage in the UTR and across one splice junction caused the Cufflinksassembler (teal) to output three transfrags that match the reference (blue) and a fourth that contains a novel splice junction. The RABT assembler output(red) includes both the reference transcript (NM 014774.1) and a novel isoform (NM 014774.2) that is assembled from a combination of sequencing reads,which reveal the novel junction, and faux-reads, which connect the three sections to form a single transcript. Note that even with the addition of the referencetranscript, the total number of transfrags output by the assembler has been reduced for this locus, and the transfrag lengths have increased.

    D. melanogaster Output Set # of Genes # of Transfrags Avg Transfrag Length Isoforms Per GeneReference Annotation 13,302 20,715 1,629 1.56Cufflinks Assembly 7,167 8,701 2,334 1.21Cufflinks Assembly (Novel Only) 350 3,205 2,741 -RABT Assembly 13,634 23,913 1,815 1.75RABT Assembly (Novel Only) 332 3,018 2,719 -

    Table 2. Results for two different versions of assembly on the first Drosophila melanogaster embryo time-point from (Graveley et al., 2010). The categoriescan be interpreted in the same manner as Table 1. These results show that the method also produces improved assemblies in fly.

    data. Bioinformatics.Denoued, F. et al. (2008). Annotating genomes with massive-scale

    RNA-sequencing. Genome Biology, 9, R175.Eriksson, N., Pachter, L., Mitsuya, Y., Rhee, S.-Y., Wang, C.,

    Gharizadeh, B., Ronaghi, M., Shafer, R. W., and Beerenwinkel,N. (2008). Viral population estimation using pyrosequencing.PLoS Computational Biology, 4(4), e1000074.

    Graveley, B., Brooks, A., Carlson, J., Landolin, J., Yang, L.,Artieri, C., van Baren, M., Boley, N., Booth, B., Brown, J.,Cherbas, L., Davis, C., Dobin, A., Li, R., Lin, W., Malone,J., Mattiuzzo, N., Miller, D., Sturgill, D., Tuch, B., Zaleski,C., Zhang, D., Blanchette, M., Dudoit, S., Eads, B., Green,R., Hammonds, A., Jiang, L., Kapranov, P., Langton, L.,et al. (2010). The developmental transcriptome of Drosophilamelanogaster. Nature.

    Guigo, R. et al. (2006). EGASP: the human ENCODE GenomeAnnotation Assessment Project. Genome Biology, 7, S2.

    Guttman, M., Garber, M., Levin, J., Donaghey, J., Robinson,J., Adiconis, X., Fan, L., Koziol, M., Gnirke, A., Nusbaum,C., Rinn, J., Lander, E., and Regev, A. (2010). Ab initioreconstruction of cell type-specific transcriptomes in mousereveals the conserved multi-exonic structure of lincrnas. NatureBiotechnology, 28, 503510.

    Haas, B. et al. (2003). Improving the Arabidopsis genomeannotation using maximal transcript alignment assemblies.Nucleic Acids Research, 31, 56545666.

    Heber, S., Alekseyev, M., Sze, S.-H., Tang, H., and Pevzner,P. (2002). Splicing graphs and EST assembly problem.Bioinformatics, 18, S181S188.

    Jiang, H. and Wong, W. (2009). Statistical inferences for isoformexpression in RNA-Seq. Bioinformatics, 25, 10261032.

    Mortazavi, A., Williams, B., McCue, K., and Schaeffer, L. (2008).Mapping and quantifying mammalian transcriptomes by rna-seq.Nature Methods, 5.

    Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D.,Gerstein, M., and Snyder, M. (2008). The transcriptionallandscape of the yeast genome defined by RNA sequencing.Science, 320, 13441349.

    Schweikert, G. et al. (2009). mGene: Accurate SVM-based genefinding with an application to nematode genomes. GenomeResearch, 19, 21332143.

    Simpson, J., Wong, K., Jackman, S., and Schein, J. (2009). Abyss:A parallel assembler for short read sequence data. GenomeResearch, 19.

    Stanke, M. and Waack, S. (2003). Gene prediction with a hiddenMarkov model and a new intron submodel. Bioinformatics, 19.

    Trapnell, C., Pachter, L., and Salzberg, S. (2009). Tophat:discovering splice junctions with rna-seq. Bioinformatics, 25,110511.

    Trapnell, C., Williams, B., Pertea, G., Mortazavi, A., GK,van Baren, M., Salzberg, S., Wold, B., and Pachter, L.(2010). Transcript assembly and quantification by rna-seqreveals unannotated transcripts and isoform switching during celldifferentiation. Nature Biotechnology, 28, 511515.

    Venter, J., Adams, M., Myers, E., Li, P., Mural, R., Sutton, G.,Smith, H., Yandell, M., Evans, C., Holt, R., et al. (2001). Thesequence of the human genome. Science, 291, 13041351.

    5

    by guest on Decem

    ber 1, 2013http://bioinform

    atics.oxfordjournals.org/D

    ownloaded from

    Reference Annotation

    Cufflinks Assembly

    RABT Assembly

    NM_014774.2

    NM_014774.1

    CUFF.1545.1CUFF.1540

    NM_014774

    CUFF.1546

    CUFF.1545.2

    Fig. 3. Comparison of assembler output for an example gene. Lack of sequencing coverage in the UTR and across one splice junction caused the Cufflinksassembler (teal) to output three transfrags that match the reference (blue) and a fourth that contains a novel splice junction. The RABT assembler output(red) includes both the reference transcript (NM 014774.1) and a novel isoform (NM 014774.2) that is assembled from a combination of sequencing reads,which reveal the novel junction, and faux-reads, which connect the three sections to form a single transcript. Note that even with the addition of the referencetranscript, the total number of transfrags output by the assembler has been reduced for this locus, and the transfrag lengths have increased.

    D. melanogaster Output Set # of Genes # of Transfrags Avg Transfrag Length Isoforms Per GeneReference Annotation 13,302 20,715 1,629 1.56Cufflinks Assembly 7,167 8,701 2,334 1.21Cufflinks Assembly (Novel Only) 350 3,205 2,741 -RABT Assembly 13,634 23,913 1,815 1.75RABT Assembly (Novel Only) 332 3,018 2,719 -

    Table 2. Results for two different versions of assembly on the first Drosophila melanogaster embryo time-point from (Graveley et al., 2010). The categoriescan be interpreted in the same manner as Table 1. These results show that the method also produces improved assemblies in fly.

    data. Bioinformatics.Denoued, F. et al. (2008). Annotating genomes with massive-scale

    RNA-sequencing. Genome Biology, 9, R175.Eriksson, N., Pachter, L., Mitsuya, Y., Rhee, S.-Y., Wang, C.,

    Gharizadeh, B., Ronaghi, M., Shafer, R. W., and Beerenwinkel,N. (2008). Viral population estimation using pyrosequencing.PLoS Computational Biology, 4(4), e1000074.

    Graveley, B., Brooks, A., Carlson, J., Landolin, J., Yang, L.,Artieri, C., van Baren, M., Boley, N., Booth, B., Brown, J.,Cherbas, L., Davis, C., Dobin, A., Li, R., Lin, W., Malone,J., Mattiuzzo, N., Miller, D., Sturgill, D., Tuch, B., Zaleski,C., Zhang, D., Blanchette, M., Dudoit, S., Eads, B., Green,R., Hammonds, A., Jiang, L., Kapranov, P., Langton, L.,et al. (2010). The developmental transcriptome of Drosophilamelanogaster. Nature.

    Guigo, R. et al. (2006). EGASP: the human ENCODE GenomeAnnotation Assessment Project. Genome Biology, 7, S2.

    Guttman, M., Garber, M., Levin, J., Donaghey, J., Robinson,J., Adiconis, X., Fan, L., Koziol, M., Gnirke, A., Nusbaum,C., Rinn, J., Lander, E., and Regev, A. (2010). Ab initioreconstruction of cell type-specific transcriptomes in mousereveals the conserved multi-exonic structure of lincrnas. NatureBiotechnology, 28, 503510.

    Haas, B. et al. (2003). Improving the Arabidopsis genomeannotation using maximal transcript alignment assemblies.Nucleic Acids Research, 31, 56545666.

    Heber, S., Alekseyev, M., Sze, S.-H., Tang, H., and Pevzner,P. (2002). Splicing graphs and EST assembly problem.Bioinformatics, 18, S181S188.

    Jiang, H. and Wong, W. (2009). Statistical inferences for isoformexpression in RNA-Seq. Bioinformatics, 25, 10261032.

    Mortazavi, A., Williams, B., McCue, K., and Schaeffer, L. (2008).Mapping and quantifying mammalian transcriptomes by rna-seq.Nature Methods, 5.

    Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D.,Gerstein, M., and Snyder, M. (2008). The transcriptionallandscape of the yeast genome defined by RNA sequencing.Science, 320, 13441349.

    Schweikert, G. et al. (2009). mGene: Accurate SVM-based genefinding with an application to nematode genomes. GenomeResearch, 19, 21332143.

    Simpson, J., Wong, K., Jackman, S., and Schein, J. (2009). Abyss:A parallel assembler for short read sequence data. GenomeResearch, 19.

    Stanke, M. and Waack, S. (2003). Gene prediction with a hiddenMarkov model and a new intron submodel. Bioinformatics, 19.

    Trapnell, C., Pachter, L., and Salzberg, S. (2009). Tophat:discovering splice junctions with rna-seq. Bioinformatics, 25,110511.

    Trapnell, C., Williams, B., Pertea, G., Mortazavi, A., GK,van Baren, M., Salzberg, S., Wold, B., and Pachter, L.(2010). Transcript assembly and quantification by rna-seqreveals unannotated transcripts and isoform switching during celldifferentiation. Nature Biotechnology, 28, 511515.

    Venter, J., Adams, M., Myers, E., Li, P., Mural, R., Sutton, G.,Smith, H., Yandell, M., Evans, C., Holt, R., et al. (2001). Thesequence of the human genome. Science, 291, 13041351.

    5

    by guest on Decem

    ber 1, 2013http://bioinform

    atics.oxfordjournals.org/D

    ownloaded from

    SpliceGrapher: From Splice Graphs to Transcripts using RNA-Seq

    Sofware: splicegrapher.sourceforge.net

    Splice Graphs vs. Transcripts

    Transcripts

    Splice Graphs vs. Transcripts

    Transcripts

    Splice Graph

  • 5/2/16

    6

    Data

    Long introns

    Dominant form of alternative splicing: exon skipping

    Best annotated plant genome

    Short introns

    Dominant form of alternative splicing: intron retention

    Human

    Arabidopsis

    Data

    28M 75bp read pairs

    R. Xiao, et al. Nuclear matrix factor hnRNP U/SAF-A exerts a global

    control of alternative splicing by regulating U2 snRNP maturation.

    Molecular Cell, 45(5) 656668, March 2012.

    116M 76bp read pairs across five samples

    Y. Marquez, J. W. Brown, C. Simpson, A. Barta, and M. Kalyna.

    Transcriptome survey reveals increased complexity of the alternative

    splicing landscape in Arabidopsis. Genome Research, 22(6):11841195,

    2012.

    Human

    Arabidopsis

    Simulated data: 100bp reads generated using FluxSimulator

    Protocol (real data)

    SpliceGrapherXT

    Cufflinks IsoLasso

    Protocol (simulated data)

    SpliceGrapherXT

    Cufflinks IsoLasso

  • 5/2/16

    7

    Exon level results: simulated data

    Human A.thaliana

    Human A.thaliana

    Precision

    Recall

    Exon level results: real data

    Human A.thaliana

    Human A.thaliana

    Recall

    Precision

    Example De-novo transcriptome reconstruction methods

    Figure from:

    Bridger: a new framework for de novo transcriptome assembly using RNA-seq data

  • 5/2/16

    8

    De-novo transcriptome reconstruction methods

    Assessment of transcript reconstruction methods

    exon-level performance for the methods evaluated

    Assessment of transcript reconstruction methods for RNA-seq

    Tamara Steijger et al. Nature Methods 10, 11771184 (2013).

    Assessment of transcript reconstruction methods

    transcript-level performance for the methods evaluated

    Assessment of transcript reconstruction methods for RNA-seq

    Tamara Steijger et al. Nature Methods 10, 11771184 (2013).

    Assessment of transcript reconstruction methods

    transcript-level performance for the methods evaluated

    Assessment of transcript reconstruction methods for RNA-seq

    Tamara Steijger et al. Nature Methods 10, 11771184 (2013).

  • 5/2/16

    9

    Transcript reconstruction: read length matters

    Pacific BioSciences ISO-seq sequencing

    http://genome.duke.edu/sites/default/files/pacbio_isoseq_1.png

    TAPIS: a transcriptome assembly and analysis pipeline for Pacific

    Biosciences isoform sequencing reads

  • 5/2/16

    10

    Read length0 5000 10000 15000 20000 25000 30000 35000050000

    100000

    150000

    200000

    250000

    300000

    350000

    400000Combined

    0 2000 4000 6000 8000 10000 12000 14000 160000

    20000

    40000

    60000

    80000

    100000

    120000

    140000

    160000

    180000Full length

    0 5000 10000 15000 20000 25000 30000 350000

    20000

    40000

    60000

    80000

    100000

    120000

    140000

    160000

    180000Non-full length

    0 2000 4000 6000 8000 100000

    50000

    100000

    150000

    200000

    250000

    300000

    350000

    400000

    0 500 1000 1500 2000 2500 3000 3500 40000

    20000

    40000

    60000

    80000

    100000

    120000

    140000

    160000

    180000

    0 2000 4000 6000 8000 100000

    20000

    40000

    60000

    80000

    100000

    120000

    140000

    160000

    180000

    Reads of insert

    Non-full length readsFull length reads

    Error correction using genome

    Align to genome using GMAP

    Filter alignments with splice junction classifier

    Final set of alignments

    De-novo transcript discovery

    Polyadenylation site analysis

    Alternative splicing detection

    Differential expression detection

    The TAPIS pipeline

    Data summary

    Reads Total Method Aligned After Filtering

    non-full length 161,913

    proovread 98,454 (60.1%) 52,582 (32.4%)

    genome 82,730 (51.1%) 48,352 (29.9%)

    hybrid 130,702 (80.7%) 61,549 (38.0%)

    full length 134,984

    proovread 129,434 (95.9%) 115,967 (85.9%)

    genome 129,117 (95.7%) 118,679 (87.9%)

    hybrid 133,489 (98.9%) 123,836 (91.7%)

    Alternative splicing and transcript predictions

    Source IR Alt5 Alt3 ES Total

    Gene models v1

    577 389 627 184 1777

    Gene models v2

    1671 1681 2869 1490 7711

    PacBio

    3897 796 1167 832 6692

    Close to 8000 predicted transcripts

  • 5/2/16

    11

    Alternative polyadenylation

    6498

    2856

    6498

    2886

    6498

    2917

    6498

    2978

    6498

    3034

    6498

    3065

    6498

    3131

    Coordinate

    0

    5

    10

    15

    20

    25

    Poly

    -(A)s

    ites

    Orientation =)