Combining Transcriptome Assemblies - Kenlee Nakasugi

1
Summary: 1. 35 query RNA silencing gene sequences were detected close to 100% assembled in the combined assemblies. 2. Using multiple assemblers, varying kmer sizes, and varying input read depth yielded more full-length transcripts in general. 3. The Dcl1 transcript example illustrates the effect of input read depth and the assembly of a complete transcript. 4. The EvidentialGene tr2aacds pipeline produced the best quality assemblies overall based on several metrics including Feature Response Curves and bit- scores despite not producing the longest transcripts based on traditional length based metrics. In addition, the number of contigs/transcripts was substantially less than the TGI clustering tool. The caveat is that the Tgi tool can correctly generate additional full-length transcripts post-assembly, while the Evi pipeline will only be as good as the input transcripts. 5. In terms of full length RNA silencing gene transcript assembly in N. benthamiana, assembler performance was: Oa > Ta >= Tr >> So Nicotiana benthamiana is an allopolyploid (hybridisation of 2 genomes followed by whole genome duplication) plant, which can be problematic for transcriptome assemblies due to duplicated gene copies yielding novel but highly similar transcripts with markedly differing expression levels. This can lead to unassembled, partial or chimerically assembled transcripts. In addition, no one assembler with any one given parameter space can assemble all possible transcripts. In an effort to produce an assembly that contain (but not limited to) full length RNA silencing gene transcripts to facilitate more informative first pass searches, and to increase the chances of finding paralogous transcripts while limiting redundancy, we have combined the sequences from multiple assemblies generated by four popular de novo transcriptome assemblers: Trans-Abyss, Trinity, Soap-denovo-trans and Oases. We varied kmer sizes and to a lesser extent input read depth to try and assemble as many variants as possible. We then applied two pipelines to reduce redundancy, TGI clustering tool and EvidentialGene tr2aacds pipeline, and compared the quality of assemblies pre and post processing with these tools. Assembly overview: 1. Assemble kmers individually 2. Merge kmer assemblies - TaM, SoM, OaM ('M' denotes merged assembly). Tr is single kmer. 3. Combine assemblies: - combine merged assemblies - combine individual kmer assemblies without any merging 4. Apply Tgi and Evi tools to individually merged and combined assemblies 5. Assess via multiple metrics Paired end reads Single end reads Dataset 1 189,333,894 48,827,381 Dataset 2 228,279,832 50,249,303 A B Dataset 1 (ds1) Dataset 2 (ds2) k48-86, step size 2 k20-44, step size 4 1 - k48-86, step size 2 2 k25 k25 3 k31 k21-81, step size 10 4 - k25-75, step size 10 5 Combined assemblies Contains: A1, A3, A4, B1+B2, B2, B3, B4, B5 A1, A3, A4, B1, B2, B3, B4, B4,B5 Transabyss (Ta) Transabyss (Ta) Trinity (Tr) Soap de novo trans (So) Oases (Oa) SasmM (sum of merged assemblies)* SasmK (sum of all kmer assemblies) * k-mer assemblies merged by: Ta – Ta merge utility; So – TGI clustering software; Oa – Oa merge utility Processed input RNAseq reads Assembly: Ta, Tr, So, Oa TGI clustering tool Cluster and create assemblies (contigs) from a set of DNA sequences. 1. Group similar sequences into clusters 2. Assemble clusters using CAP3 EvidentialGene tr2aacds Principle is that over-assembling transcript reads with many assembly options produces a subset of accurate assemblies in a superset of crappy ones. biologically useful "best" set of mRNA which are classified into primary and alternate transcripts. 1. Remove identical/highly similar transcripts 2. Find longest AA and remove highly similar AA 3. Self-blast to identify highly similar transcripts 4. Filter and classify 'main' and 'alternate' transcripts tr2aacds selects the most Assess - Std metrics (contig length/numbers) - Avg. length of top 1000 longest prots - CEGMA analysis - Blast against reference databases - Read mapping statistics - Feature Response Curves - Query alignment coverage - Impact of sequencing depth 1 Details of assemblies: 750713 178700 128938 284583 247174 44726 1248430 243483 95212 797243 393516 72637 3562114 1066952 216177 9891564 735375 234526 N50 906 1463 1231 1853 1809 1951 468 883 686 2024 2226 2131 1400 2079 1806 971 2162 2208 706 925 933 1017 966 1323 433 642 618 1407 1605 1671 910 1327 1283 681 1340 1674 14633 19508 14453 16069 16069 16069 15107 19151 15107 22494 22626 16312 22494 22494 16312 18177 20330 16038 Average length of top1000 longest proteins 1778 1433 1405 1678 1552 1325 1239 1119 932 2409 2033 1660 2811 2521 1840 3261 2456 2137 % of complete CEGMA proteins 97.18 99.6 93.55 97.58 97.58 90.32 49.19 87.1 47.18 99.19 99.19 90.73 99.19 99.6 97.18 99.6 99.6 98.79 99.6 100 96.37 99.19 99.19 92.74 83.06 98.79 80.65 100 100 92.34 100 100 97.98 100 100 99.6 290.13 294.64 303.54 346.19 329.90 386.05 145.81 199.58 202.17 380.15 358.97 437.37 320.95 362.51 368.25 263.83 374.40 493.05 319.99 344.99 333.77 420.45 403.25 434.89 163.87 240.74 224.79 442.55 423.07 503.58 365.16 421.91 418.70 298.63 449.56 554.61 Mapping % of input reads(BWA) 99.53 99.55 88.54 99.21 99.19 84.16 99.62 99.64 81.89 99.63 99.64 85.88 99.77 99.77 93.57 99.78 99.77 93.87 0.081 0.265 0.237 0.480 0.536 0.381 0.062 0.570 0.344 0.015 0.033 0.450 0.002 0.013 0.124 0.002 0.016 0.179 Mapping % of input reads (Bowtie2) 97.54 97.17 79.93 91.53 91.40 74.01 96.80 96.38 71.18 94.73 94.55 72.40 98.90 98.83 88.82 98.79 98.81 88.86 TaMraw TaMtgi TaMevi Trraw Trtgi Trevi Soraw SoMtgi SoMevi OaMraw OaMtgi OaMevi SasmMraw SasmMtgi SasmMevi SasmKraw SasmKtgi SasmKevi Number of contigs Mean contig length Longest contig % of partial CEGMA protiens Average bitscore to top 1000 longest uniprot tomato proteins Average bitscore to top 1000 longest solgenomics Nben proteins Mapping % of chimeric reads mapQ>30 (BWA) Statistics of assemblies generated from dataset 2 reads. Metrics are of merged assemblies of individual assemblers as well as combined assemblies. Each 'raw' assembly was processed by the TGI clustering tool (Tgi) or EvidentialGene (Evi) tr2aacds pipeline. While the Tgi assemblies generated longer transcripts and proteins, average bitscores and mean contig length is higher and contig number lower in most Evi assemblies. 2 Feature response curves of assemblies, using the "High_spanning_PE" feature, which measures the number of PE reads where the pairs are mapped onto different contigs. The feature threshold is used to filter out contigs that contain the number of features that fall above the threshold. That is, only contigs that contain less than the threshold number of features are used to calculate the coverage acheived. Except for the TransAbyss-Merged assembly, the EvidentialGene (Evi) tr2aacds pipeline appear to perform best or at least on par on all assemblies (raw vs tgi vs evi) as higher coverage is achieved at a lower feature threshold. 3 Blast results summarized by Circos showing the proportion of transcripts in each assembly that matched the Solgenomics N. benthamiana predicted protein database (green track) and the percentage of these hits that were also found by other assemblies (orange tracks). Common db hits between Trraw and, in clockwise direction: TaMraw , TaMtgi , TaMevi , Trraw , Trtgi , Trevi , SoMraw , SoMtgi , SoMevi , OaMraw , OaMtgi , OaMevi , SasmMraw , SasmMtgi , SasmMevi , SasmKraw , SasmKtgi , SasmKevi (self comparison will always be 100%) TaMraw TaMtgi TaMevi Trraw Trtgi Trevi SoMraw SoMtgi SoMevi OaMraw OaMtgi OaMevi SasmMraw SasmMtgi SasmMevi SasmKraw SasmKtgi SasmKevi 100% 100% Trraw % of Trraw assembly that matched against ref db. Blue bars: Blastx against all proteins in ref db Greyscale bars: Blastx against top 1000 longest proteins in ref db Total db db % actually expressed (RSEM) Total hits against db Total hits against db (>80% db coverage) Links showing the proportion of the Trraw sequences that are found in the combined assembly SasmM before and after EvidentialGene pipeline 4 Alignment coverage of 35 RNA silencing gene transcripts. The consensus CDS (identified previously) of these genes were screened with blastn against each assembly. Genes such as Dcl2 and Dcl3 were always found to be assembled to 100% in all assemblies, whereas Dcl1 could only be assembled to 100% by Oases. As expected, the combined assemblies always contained the query sequences that were close to 100% assembled. 5 Increased read counts prevent assembly of full length Dcl1 transcript in Trinity A single complete transcript of Dcl1 can be generated by assembly of dataset 1 reads Only two partial transcripts can be generated from assembly of dataset 2 reads There is a noticaeble increase in read depth between position 2200 and 2950, and towards the end of the Dcl1 transcript between dataset 2 and dataset 1, implying that increased reads from dcl1 variants could be preventing the assembly of the full-lengthed Dcl1 transcript. Inspection of the sequences of the transcripts at the read depth 'cliffs' (Box B) show the presence of Dcl1 variants including transcripts with intron sequence (Box A), and which are supported by reads spanning potential intron/exon junctions (BoxC). The full-length Dcl1 transcript could not be assembled with dataset 2 reads most likely due to increased read 'noise' from other variants. Varying the read depth presented to the assembler (either by raw data, or by coverage threshold options) is important in generating full length transcripts. This also depends on the type of assembler used, as Trans-abyss was not able to generate the full-length Dcl1 sequence with either dataset 1 or 2. 1 2500 5000 scaffold13731: ds1_comp79561_c0_seq14: scaffold165219: ds2_comp85436_c0_seq7: read 1: read 2: read 3: read 4: scaffold13731: ds1_comp79561_c0_seq14: Intron sequence B A C 6 References: * EvidentialGene pipeline (by Don Gilbert @ Indiana University): http://arthropods.eugenes.org/EvidentialGene/about/EvigeneRNA2013poster.pdf http://arthropods.eugenes.org/EvidentialGene * TGI Clustering tool: http://compbio.dfci.harvard.edu/tgi/software/ * Visualization tools: - Circos plot: Krzywinski, M. et al. Circos: an Information Aesthetic for Comparative Genomics. Genome Res (2009) 19:1639-1645 (http://circos.ca/) - IGV: http://www.broadinstitute.org/igv/ - Geneious: http://www.geneious.com/ * Feature Response Curve: Vezzi F et al.(2012) Reevaluating Assembly Evaluations with Feature Response Curves: GAGE and Assemblathons. PLoS ONE 7(12): e52210 * Assemblers: TransAbyss, Trinity, Soap denovo Trans, Velvet/Oases * Other software: R (v3.0.1) with ggplots2 package, BWA (v0.7.5a), Bowtie2 (v2.2.1), NCBI blast (v2.2.26+), CEGMA (v2.4) * HPC resources: Intersect Australia, 'Orange' server; UQ Research Computing Centre, 'Barrine' cluster Combining transcriptome assemblies from multiple de novo assemblers to generate full length RNA silencing gene transcripts in Nicotiana benthamiana Kenlee Nakasugi and Peter Waterhouse School of Molecular Bioscience, University of Sydney

Transcript of Combining Transcriptome Assemblies - Kenlee Nakasugi

Page 1: Combining Transcriptome Assemblies - Kenlee Nakasugi

Summary:1. 35 query RNA silencing gene sequences were detected close to 100% assembled in the combined assemblies. 2. Using multiple assemblers, varying kmer sizes, and varying input read depth yielded more full-length transcripts in general.3. The Dcl1 transcript example illustrates the effect of input read depth and the assembly of a complete transcript.4. The EvidentialGene tr2aacds pipeline produced the best quality assemblies overall based on several metrics including Feature Response Curves and bit-scores despite not producing the longest transcripts based on traditional length based metrics. In addition, the number of contigs/transcripts was substantially less than the TGI clustering tool. The caveat is that the Tgi tool can correctly generate additional full-length transcripts post-assembly, while the Evi pipeline will only be as good as the input transcripts. 5. In terms of full length RNA silencing gene transcript assembly in N. benthamiana, assembler performance was: Oa > Ta >= Tr >> So

Nicotiana benthamiana is an allopolyploid (hybridisation of 2 genomes followed by whole genome duplication) plant, which can be problematic for transcriptome assemblies due to duplicated gene copies yielding novel but highly similar transcripts with markedly differing expression levels. This can lead to unassembled, partial or chimerically assembled transcripts. In addition, no one assembler with any one given parameter space can assemble all possible transcripts.

In an effort to produce an assembly that contain (but not limited to) full length RNA silencing gene transcripts to facilitate more informative first pass searches, and to increase the chances of finding paralogous transcripts while limiting redundancy, we have combined the sequences from multiple assemblies generated by four popular de novo transcriptome assemblers: Trans-Abyss, Trinity, Soap-denovo-trans and Oases. We varied kmer sizes and to a lesser extent input read depth to try and assemble as many variants as possible. We then applied two pipelines to reduce redundancy, TGI clustering tool and EvidentialGene tr2aacds pipeline, and compared the quality of assemblies pre and post processing with these tools.

Assembly overview:1. Assemble kmers individually2. Merge kmer assemblies - TaM, SoM, OaM ('M' denotes merged assembly). Tr is single kmer.3. Combine assemblies: - combine merged assemblies - combine individual kmer assemblies without any merging4. Apply Tgi and Evi tools to individually merged and combined assemblies5. Assess via multiple metrics

Paired end reads Single end readsDataset 1 189,333,894 48,827,381Dataset 2 228,279,832 50,249,303

A BDataset 1 (ds1) Dataset 2 (ds2)

k48-86, step size 2 k20-44, step size 4 1- k48-86, step size 2 2

k25 k25 3k31 k21-81, step size 10 4

- k25-75, step size 10 5

Combined assemblies Contains:A1, A3, A4, B1+B2, B2, B3, B4, B5A1, A3, A4, B1, B2, B3, B4, B4,B5

Transabyss (Ta)Transabyss (Ta)

Trinity (Tr)Soap de novo trans (So)

Oases (Oa)

SasmM (sum of merged assemblies)*SasmK (sum of all kmer assemblies)* k-mer assemblies merged by: Ta – Ta merge utility; So – TGI clustering software; Oa – Oa merge utility

Processed inputRNAseq reads

Assembly:Ta, Tr, So, Oa

TGI clustering toolCluster and create assemblies (contigs)from a set of DNA sequences.1. Group similar sequences into clusters2. Assemble clusters using CAP3

EvidentialGene tr2aacds“Principle is that over-assemblingtranscript reads with many assemblyoptions produces a subset of accurateassemblies in a superset of crappy ones.”

biologically useful "best" set of mRNAwhich are classified into primary andalternate transcripts.1. Remove identical/highly similartranscripts2. Find longest AA and remove highlysimilar AA3. Self-blast to identify highly similartranscripts4. Filter and classify 'main' and 'alternate'transcripts

tr2aacds selects the most

Assess- Std metrics (contig length/numbers)- Avg. length of top 1000 longest prots- CEGMA analysis- Blast against reference databases- Read mapping statistics- Feature Response Curves- Query alignment coverage- Impact of sequencing depth

1

Details of assemblies:

750713 178700 128938 284583 247174 44726 1248430 243483 95212 797243 393516 72637 3562114 1066952 216177 9891564 735375 234526N50 906 1463 1231 1853 1809 1951 468 883 686 2024 2226 2131 1400 2079 1806 971 2162 2208

706 925 933 1017 966 1323 433 642 618 1407 1605 1671 910 1327 1283 681 1340 167414633 19508 14453 16069 16069 16069 15107 19151 15107 22494 22626 16312 22494 22494 16312 18177 20330 16038

Average length of top1000 longest proteins 1778 1433 1405 1678 1552 1325 1239 1119 932 2409 2033 1660 2811 2521 1840 3261 2456 2137% of complete CEGMA proteins 97.18 99.6 93.55 97.58 97.58 90.32 49.19 87.1 47.18 99.19 99.19 90.73 99.19 99.6 97.18 99.6 99.6 98.79

99.6 100 96.37 99.19 99.19 92.74 83.06 98.79 80.65 100 100 92.34 100 100 97.98 100 100 99.6

290.13 294.64 303.54 346.19 329.90 386.05 145.81 199.58 202.17 380.15 358.97 437.37 320.95 362.51 368.25 263.83 374.40 493.05

319.99 344.99 333.77 420.45 403.25 434.89 163.87 240.74 224.79 442.55 423.07 503.58 365.16 421.91 418.70 298.63 449.56 554.61Mapping % of input reads(BWA) 99.53 99.55 88.54 99.21 99.19 84.16 99.62 99.64 81.89 99.63 99.64 85.88 99.77 99.77 93.57 99.78 99.77 93.87

0.081 0.265 0.237 0.480 0.536 0.381 0.062 0.570 0.344 0.015 0.033 0.450 0.002 0.013 0.124 0.002 0.016 0.179Mapping % of input reads (Bowtie2) 97.54 97.17 79.93 91.53 91.40 74.01 96.80 96.38 71.18 94.73 94.55 72.40 98.90 98.83 88.82 98.79 98.81 88.86

TaMraw TaMtgi TaMevi Trraw Trtgi Trevi Soraw SoMtgi SoMevi OaMraw OaMtgi OaMevi SasmMraw SasmMtgi SasmMevi SasmKraw SasmKtgi SasmKeviNumber of contigs

Mean contig lengthLongest contig

% of partial CEGMA protiensAverage bitscore to top 1000 longest uniprot

tomato proteinsAverage bitscore to top 1000 longest

solgenomics Nben proteins

Mapping % of chimeric reads mapQ>30 (BWA)

Statistics of assemblies generated from dataset 2 reads. Metrics are of merged assemblies of individual assemblers as well as combined assemblies. Each 'raw' assembly was processed by the TGI clustering tool (Tgi) or EvidentialGene (Evi) tr2aacds pipeline. While the Tgi assemblies generated longer transcripts and proteins, average bitscores and mean contig length is higher and contig number lower in most Evi assemblies.

2

Feature response curves of assemblies, using the "High_spanning_PE" feature, which measures the number of PE reads where the pairs are mapped onto different contigs. The feature threshold is used to filter out contigs that

contain the number of features that fall above the threshold. That is, only contigs that contain less than the threshold number of features are used to calculate the

coverage acheived. Except for the TransAbyss-Merged assembly, the EvidentialGene (Evi) tr2aacds pipeline appear to perform best or at least on

par on all assemblies (raw vs tgi vs evi) as higher coverage is achieved at a lower feature threshold.

3

Blast results summarized by Circos showing the proportion of transcripts in each assembly that matched the Solgenomics N. benthamiana predicted protein database (green track) and the percentage of these hits that were also found by other assemblies (orange tracks).

Common db hits between Trraw and, in clockwisedirection: TaMraw, TaMtgi, TaMevi, Trraw, Trtgi, Trevi, SoMraw, SoMtgi, SoMevi, OaMraw, OaMtgi, OaMevi, SasmMraw, SasmMtgi, SasmMevi, SasmKraw, SasmKtgi, SasmKevi(self comparison will always be 100%)

TaMraw

TaMtgi

TaMevi

TrrawTrtg

i

Trev

i

SoM

raw

SoMtgi

SoMeviOaMraw

OaMtgi

OaM

evi

SasmM

raw

Sasm

Mtg

i

Sas

mM

evi

Sasm

Kraw

SasmKtgi

SasmKevi

100%

100%

Trraw

% of Trraw assembly that matched against ref db. Blue bars: Blastx against all proteins in ref dbGreyscale bars: Blastx against top 1000 longest proteins in ref db

Total dbdb % actually expressed (RSEM)

Total hits against dbTotal hits against db (>80% db coverage)

Links showing the proportion of the Trraw sequences that are found in the combined assembly SasmM before and after EvidentialGene pipeline

4

Alignment coverage of 35 RNA silencing gene transcripts. The consensus CDS (identified previously) of these genes were screened with blastn againsteach assembly. Genes such as Dcl2 and Dcl3 were always found to be

assembled to 100% in all assemblies, whereas Dcl1 could only be assembledto 100% by Oases. As expected, the combined assemblies always contained

the query sequences that were close to 100% assembled.

5

Increased read counts prevent assembly of full length Dcl1 transcript in Trinity

A single complete transcript of Dcl1 can be generated by assembly of dataset 1 reads

Only two partial transcripts can be generated from assembly of dataset 2 reads

There is a noticaeble increase in read depth betweenposition 2200 and 2950, and towards the end of the Dcl1 transcript between dataset 2 and dataset 1, implying that increased reads from dcl1 variants could be preventing the assembly of the full-lengthed Dcl1 transcript.

Inspection of the sequences of the transcripts at the read depth 'cliffs' (Box B) showthe presence of Dcl1 variants including transcripts with intron sequence (Box A), andwhich are supported by reads spanning potential intron/exon junctions (BoxC).

The full-length Dcl1 transcript could not be assembled with dataset 2 reads most likely due to increased read 'noise' from other variants. Varying the read depth presented to the assembler (either by raw data, or by coverage threshold options) is important in generating full length transcripts. This also depends on the type of assembler used, as Trans-abyss was not able to generate the full-length Dcl1 sequence with either dataset 1 or 2.

1 2500 5000

scaffold13731:

ds1_comp79561_c0_seq14:

scaffold165219:

ds2_comp85436_c0_seq7:

read 1:

read 2:

read 3:

read 4:

scaffold13731:

ds1_comp79561_c0_seq14:

Intron sequence

B

A

C

6

References:* EvidentialGene pipeline (by Don Gilbert @ Indiana University): http://arthropods.eugenes.org/EvidentialGene/about/EvigeneRNA2013poster.pdf http://arthropods.eugenes.org/EvidentialGene* TGI Clustering tool: http://compbio.dfci.harvard.edu/tgi/software/* Visualization tools:- Circos plot: Krzywinski, M. et al. Circos: an Information Aesthetic for Comparative Genomics. Genome Res (2009) 19:1639-1645 (http://circos.ca/)- IGV: http://www.broadinstitute.org/igv/- Geneious: http://www.geneious.com/* Feature Response Curve:Vezzi F et al.(2012) Reevaluating Assembly Evaluations with Feature Response Curves: GAGE and Assemblathons. PLoS ONE 7(12): e52210* Assemblers:TransAbyss, Trinity, Soap denovo Trans, Velvet/Oases* Other software: R (v3.0.1) with ggplots2 package, BWA (v0.7.5a), Bowtie2 (v2.2.1),NCBI blast (v2.2.26+), CEGMA (v2.4)* HPC resources: Intersect Australia, 'Orange' server; UQ Research Computing Centre, 'Barrine' cluster

Combining transcriptome assemblies from multiple de novo assemblers to generate full length RNA silencing gene transcripts in Nicotiana benthamiana

Kenlee Nakasugi and Peter WaterhouseSchool of Molecular Bioscience, University of Sydney