Combined final report: genome and transcriptome assemblies

16
Combined final report: genome and transcriptome assemblies Nadia Fernandez- Trinity assembly, RSEM, Tophat and Cufflinks/Cuffmerge/Cuffdiff pipeline, and MAKER annotation Stephanie Gutierrez Avril Harder – some QC steps for reads/transcripts, SOAPdenovo2, HTSeq, DESeq2/edgeR for reference- and denovo-assembled transcriptomes Samarth Mathur- Generated genome assemblies (Using AbySS and SOAPdenovo) using trimmed cleaned reads and merged overlapped (using FLASH) reads, merged genome assemblies using GAM-NGS, Quast Stats for generated assmeblies, QC/contamination cleaning for transcripts, analysed Cuffdiff RNA-Seq differential expression data using cummeRbund package, Expression analysis for 3 chosen DEGs and hypothesis formation. Alex Martinez After DGE analyses were completed, I was responsible for examining our list of differentially expressed genes along with two other group members and choosing 3 genes of interest pertaining to mating type. Once our 3 genes were chosen, I was responsible for researching biological pathways and M. roreri functions in which our genes may be involved. Finally, I was responsible for developing a ‘story’ detailing potential roles our genes might serve in regard to mating type and reproduction in . M. roreri Genome Assembly Adapters and low quality sequences were removed from raw mate pair reads using Trimmomatic, and contaminant sequences ( mitochondrial M. roreri genome and PhiX sequences) were removed using Bowtie 2. Read quality was checked before and after running Trimmomatic ( , Fig. 1). Proportions of e.g. reads surviving each quality control step are outlined in Table 1. Tadpole was used to error-correct paired-end reads shared by another group, with full corrections applied to 23,898,406 reads and partial corrections applied to 862,524 reads. Table 1. Number of mate pair reads at each quality control step. MP_1 paired MP_1 unpaired MP_2 paired MP_2 unpaired Initial # read with Nextera adapters (% initial) 54753460 (22.85%) -- 58573898 (24.44%) -- Initial # reads 239670785 -- 239670785 -- Remaining # reads following Bowtie2 contaminant (PhiX and mito. genome) removal (% initial) 151450854 (63.19%) 49518286 (20.66%) 151450854 (63.19%) 34426062 (14.36%) Remaining # reads following Trimmomatic cleaning (% initial) 152005076 (63.42%) 49674223 (20.73%) 152005076 (63.42%) 34537324 (14.41%) Remaining # reads with Nextera adapters following Trimmomatic cleaning 0 0 0 0 (a) (b) Figure 1. FastQC plots for MP_1 reads (a) before and (b) after removal of adapter and low quality sequences. Merging paired-end reads using FLASH Paired end and Mate pair reads were merged using FLASH (Fast Length Adjustment of SHort reads) to get extended fragments (Hereafter flash reads). Parameters Used:

Transcript of Combined final report: genome and transcriptome assemblies

Page 1: Combined final report: genome and transcriptome assemblies

Combined final report: genome and transcriptome assembliesNadia Fernandez- Trinity assembly, RSEM, Tophat and Cufflinks/Cuffmerge/Cuffdiff pipeline, and MAKER annotation

Stephanie Gutierrez

Avril Harder – some QC steps for reads/transcripts, SOAPdenovo2, HTSeq, DESeq2/edgeR for reference- and denovo-assembled transcriptomes

Samarth Mathur- Generated genome assemblies (Using AbySS and SOAPdenovo) using trimmed cleaned reads and merged overlapped (using FLASH) reads, merged genome assemblies using GAM-NGS, Quast Stats for generated assmeblies, QC/contamination cleaning for transcripts, analysed Cuffdiff

 RNA-Seq differential expression data using cummeRbund package, Expression analysis for 3 chosen DEGs and hypothesis formation. 

Alex Martinez

After DGE analyses were completed, I was responsible for examining our list of differentially expressed genes along with two other group members and choosing 3 genes of interest pertaining to mating type. Once our 3 genes were chosen, I was responsible for researching biological pathways and M. rorerifunctions in which our genes may be involved. Finally, I was responsible for developing a ‘story’ detailing potential roles our genes might serve in regard to mating type and reproduction in . M. roreri

Genome Assembly

Adapters and low quality sequences were removed from raw mate pair reads using Trimmomatic, and contaminant sequences ( mitochondrial M. rorerigenome and PhiX sequences) were removed using Bowtie 2. Read quality was checked before and after running Trimmomatic ( , Fig. 1). Proportions of e.g.reads surviving each quality control step are outlined in Table 1. Tadpole was used to error-correct paired-end reads shared by another group, with full corrections applied to 23,898,406 reads and partial corrections applied to 862,524 reads.

Table 1. Number of mate pair reads at each quality control step.

  MP_1 paired MP_1 unpaired MP_2 paired MP_2 unpaired

Initial # read with Nextera adapters (% initial) 54753460 (22.85%)

-- 58573898 (24.44%)

--

Initial # reads 239670785 -- 239670785 --

Remaining # reads following Bowtie2 contaminant (PhiX and mito. genome) removal (% initial)

151450854 (63.19%)

49518286 (20.66%)

151450854 (63.19%)

34426062 (14.36%)

Remaining # reads following Trimmomatic cleaning (% initial) 152005076 (63.42%)

49674223 (20.73%)

152005076 (63.42%)

34537324 (14.41%)

Remaining # reads with Nextera adapters following Trimmomatic cleaning 0 0 0 0

(a)                                                                                        (b)

 

 

 

Figure 1. FastQC plots for MP_1 reads (a) before and (b) after removal of adapter and low quality sequences.

Merging paired-end reads using FLASH

Paired end and Mate pair reads were merged using FLASH (Fast Length Adjustment of SHort reads) to get extended fragments (Hereafter flash reads).

Parameters Used:

Page 2: Combined final report: genome and transcriptome assemblies

 

Min overlap: 10Max overlap: 65Max mismatch density: 0.250000Allow "outie" pairs: falseCap mismatch quals: falseCombiner threads: 10Input format: FASTQ, phred_offset=33Output format: FASTQ, phred_offset=33

Read combination statistics:

Reads Total pairs Combined pairs Uncombined pairs Percent combined

Paired End Reads 25298314 3468074 21830240 13.71%

Mate Pair Reads 157729360 519913 157209447 0.33%

The final output consists of merged reads as extended fragments (Single end reads) and not combined reads (R1 and R2)

K-mer size estimation

 

The optimal kmer size to use for genome assembly was identified using kmergenie. 

Final kmergenie predictions are :

ONLY PAIRED READS Predicted best k Predicted assembly size

Raw Reads 88 59,588,865 bp

Flash Reads 90 59,651,847 bp

For genome assembly using ABySS, the kmer size of 88 was used for raw reads and 90 for flash reads.For genome assembly using SOAPdenovo, the kmer size of 88 was used for raw reads and 89 for flash reads 

Genome assembly using ABySS

Trimmed and cleaned Paired end and mate paired reads (Raw reads) were assembled using ABySS with kmer size of 88

abyss-pe name=raw_kmer88 k=88 lib='pe1' mp='mp1' \pe1='./PE/phix.mito.unmap.1.fastq ./PE/phix.mito.unmap.2.fastq' \mp1='./MP/cleaned_mate-pair_reads.1.fastq ./MP/cleaned_mate-pair_reads.2.fastq' \

Merged overlapped reads as single end reads (FLASH extended reads) and not combined reads (as paired end reads) were assembled using ABySS with kmer size of 90.

abyss-pe name=flash_kmer90 k=90 lib='pe1' mp='mp1' \pe1='./flash/PE/PE.out.notCombined_1.fastq ./flash/PE/PE.out.notCombined_2.fastq' \mp1='./flash/MP/MP.out.notCombined_1.fastq ./flash/MP/MP.out.notCombined_2.fastq' \se='./flash/PE/PE.out.extendedFrags.fastq ./flash/MP/MP.out.extendedFrags.fastq'

SOAPdenovo2

Cleaned and corrected paired-end reads and cleaned mate pair reads were used to construct a  genome assembly with SOAPdenovo2 and an de novoestimated genome size of 50 Mb. The .config file used to run SOAPdenovo2 is available as an attachment to this page ( ).m_roreri_soapdenovo2.config.txt

QUAST was run with the --scaffolds option to assess the quality of the SOAPdenovo2 assembly. With this option, QUAST produces two sets of summary statistics: (1) for the provided file of scaffolds and (2) for scaffolds resulting from QUAST breaking provided scaffolds after 10 consecutive Ns. When QUAST broke provided scaffolds according to this rule, the number of Ns per 100 Kb in the assembly decreased from 8956.16 to 0.23. The total length of the broken assembly was 50.24 Mb, with 10457 contigs, an N50 of 14,883, and with 139 Kb in the largest contig. Prior to breaking scaffolds, the total assembly length was 55.7 Mb, with an N50 of 107,380, and with 1.78 Mb in the largest contig.

REAPR was also used to check assembly accuracy ( ). Only 2365 of 63,281,957 bases were found to be error-free using the perfectmap approach. reapr.shThe FCD rate plot (Fig. 2a) and read coverage plot (Fig. 2b) are below.

Page 3: Combined final report: genome and transcriptome assemblies

1. 2. 3.

(a)                                                                           (b)

              

Figure 2. (a) FCD rate and (b) read coverage plots provided by REAPR analysis of the SOAPdenovo2 assembly.

Merging assemblies using GAM-NGS

GAM-NGS (Genomic Assemblies Merger for Next Generation Sequencing) is used to improve de novo assemblies by merges two assemblies (assembly reconciliation) .  in order to enhance contiguity and possibly correctnessThe two assemblies being merged are put in an hierarchical order, electing one of the sequences as master, the other is the slave. In situations where weights/features do not allow us to take a position (e.g. similar weights), we decided to be as conservative as possible, trusting only contigs belonging to the master assembly.GAM-NGS is a multistep process which involves the following steps: (The entire script file can be found here:  )gamngs_raw.txt

GAM-NGS needs in input, for each assembly and for each read library, a file that lists BAM files of aligned libraries.  Next step is to create a block (Block construction) with minimum reads specify the reads required to build a block.Merging the master and slave assemblies with associated blocks constructed in the previous step.

For our analysis, we created merged assmebly with ABySS assembly as master assembly and SOAPdenovo assembly as the slave assmebly. 

QUAST results for merged GAM-NGS assembly (from broken scaffolds):Total length 57.01 Mb, with 4787 contigs, an N50 of 36,036, and with 290 Kb in the largest contig. Prior to breaking scaffolds, the total assembly length was 57.9 Mb, with an N50 of 102,258, and with 926 kb in the largest contig.

Transcriptome Assembly

Data Quality Control 

Trimmomatic

Ran Trimmomatic with the following parameters for each individual for adapter removal: paired-end; phred33;

ILLUMINACLIP:/group/bioinfo/apps/apps/trimmomatic-0.32/adapters/TruSeq3-PE-2.fa:2:20:9; LEADING:7 TRAILING:7; SLIDINGWINDOW:4:13; MINLEN:30.

 

Individual Input Read Pairs Both Surviving Forward Only Surviving Reverse Only Surviving Dropped

JD-6 35,221,409 34523800 (98.02%) 621607 (1.76%) 60985 (0.17%) 15017 (0.04%)

JD-8 33,453,731 32836908 (98.16%) 547875 (1.64%) 56992 (0.17%) 11956 (0.04%)

JD-5 37,640,600 36944155 (98.15%) 616398 (1.64%) 66411 (0.18%) 13636 (0.04%)

MCA-2504 35,143,640 34486224 (98.13%) 581893 (1.66%) 60894 (0.17%) 14629 (0.04%)

MCA-2952 31,156,221 30457235 (97.76%) 635039 (2.04%) 48996 (0.16%) 14951 (0.05%)

MCA-2974 38,396,705 37623337 (97.99%) 692368 (1.80%) 64963 (0.17%) 16037 (0.04%)

Table 3. Number of reads surviving after adapter removal via Trimmomatic, and their percentages. 

 

Page 4: Combined final report: genome and transcriptome assemblies

Bowtie

For contaminant removal, the mitogenome and phiX fasta files were downloaded from the wiki page. These two fasta files were merged together and indexed via bowtie-build. Trimmed reads were mapped against the merged contaminant fasta file, and reads that didn’t align to the contaminant fasta file were treated as “clean” reads and pushed into a new fastq file. Reads that mapped to contaminants were formatted into a sam file.

 

Example:

bowtie -t -S --un JD-8_trimmomatic_forward_filtered.fastq \

merged_contaminants.fasta \

JD-8_trimmomatic_forward_paired.fastq \

JD-8_forward_contaminant_alignments.sam

 

bowtie -t -S --un JD-8_trimmomatic_reverse_filtered.fastq \

merged_contaminants.fasta \

JD-8_trimmomatic_reverse_paired.fastq \

JD-8_reverse_contaminant_alignments.sam

 

FastQC

After adapter and contaminant removal, FastQC was used to quantify the quality of the reads.

(a)                                                                                        (b)

 

Page 5: Combined final report: genome and transcriptome assemblies

Figure 3. FastQC plots for JD-5 forward reads (a) before and (b) after removal of adapter and low quality sequences.

De novo Transcriptome Assembly

Assembly the transcriptome with Trinity (v2.2.0) with newly cleaned reads. All individuals and their forward/reverse files were input into the trinity run.

Simplified script: 

Trinity --seqType fq --max_memory 96G --CPU 20 --verbose --left JD-8_trimmomatic_forward_filtered.fastq,\JD-6_trimmomatic_forward_filtered.fastq,\JD-5_trimmomatic_forward_filtered.fastq,\JD-8_trimmomatic_forward_filtered.fastq,\MCA-2504_trimmomatic_forward_filtered.fastq,\MCA-2952_trimmomatic_forward_filtered.fastq,\MCA-2974_trimmomatic_forward_filtered.fastq, \--right JD-8_trimmomatic_reverse_filtered.fastq,\JD-6_trimmomatic_reverse_filtered.fastq,\JD-5_trimmomatic_reverse_filtered.fastq,\JD-8_trimmomatic_reverse_filtered.fastq,\MCA-2504_trimmomatic_reverse_filtered.fastq,\MCA-2952_trimmomatic_reverse_filtered.fastq,\MCA-2974_trimmomatic_reverse_filtered.fastq \&> trinity_log.txt

 

Trinity stats on Trinity.fasta

Page 6: Combined final report: genome and transcriptome assemblies

Counts of transcripts, etc.Total trinity 'genes': 21603Total trinity transcripts: 58001Percent GC: 48.64

Stats based on ALL transcript contigs:

Contig N10: 11366Contig N20: 8868Contig N30: 7097Contig N40: 5735Contig N50: 4654

Median contig length: 1774Average contig: 2688.99Total assembled bases: 155964366

Stats based on ONLY LONGEST ISOFORM per 'GENE':

Contig N10: 9477Contig N20: 7060Contig N30: 5559Contig N40: 4449Contig N50: 3604

Median contig length: 885Average contig: 1763.01Total assembled bases: 38086305

RSEM

Prepared reference

rsem-prepare-reference \--num-threads 20 \--transcript-to-gene-map gene.map \--bowtie2 \trinity_out_dir/Trinity.fasta Trinity_ref

Calculated expression (for each individual) example:

rsem-calculate-expression -p 20 --bowtie2 --paired-end \cleaned_reads/bowtie/repaired_reads/JD-6_forward_filtered_fixed.fastq \cleaned_reads/bowtie/repaired_reads/JD-6_reverse_filtered_fixed.fastq \Trinity_ref JD-6.rsem

 

Reference Genome Analysis

Reference genome: GCF_000488995.1_M_roreri_MCA_2997_v1_genomic.fna

Reference annotation: GCF_000488995.1_M_roreri_MCA_2997_v1_genomic.gff

Tophat

Used bowtie to build a index of reference genome. Tophat was used to generate BAM files for each individual. An error was generated in the Tophat run due to pair alignments therefore, we ran another script to help correct for mismatches or missing reads with BBMap's "repair.sh". This can happen when there is an unequal number of reads and/or when a read-trimming tools throws away one read in a pair but not the other. 

 

Strain Input total Aligned pairs Overall mapping rate

JD5 73,862,560 31,381,013 89.80%

JD6 69,023,838 28,939,213 88.50%

JD8 65,649,570 27,724,054 89.30%

MCA2504 68,945,066 25,383,466 79.20%

MCA2952 60,898,998 14,487,145 51.20%

MCA2974 75,224,452 23,234,076 66.30%

Page 7: Combined final report: genome and transcriptome assemblies

Cufflinks

 

cufflinks -p 20 --multi-read-correct --compatible-hits-norm \-o cufflinks_out/JD-6 \-G GCF_000488995.1_M_roreri_MCA_2997_v1_genomic.gff \JD-6_repaired/accepted_hits.bam

HTSeq

htseq-count --quiet \--format=bam \--stranded=no \JD5_accepted_hits.bam \JD5_transcripts.gtf \>JD5.count

Cuffmerge

 

Assemblies_file.txt contained pathways to each transcript.gtf file produced for each individual run. 

cuffmerge -p 20 \-o cuffmerge_out \-g GCF_000488995.1_M_roreri_MCA_2997_v1_genomic.gff \-s GCF_000488995.1_M_roreri_MCA_2997_v1_genomic.fna \assemblies_file.txt

Cuffdiff

 

cuffdiff -o cuffdiff_out -b GCF_000488995.1_M_roreri_MCA_2997_v1_genomic.fna -p 20 -L JD-6,JD-8,JD-5,MCA-2504,MCA-2952,MCA-2972 -u cuffmerge_out/merged.gtf \JD-6/accepted_hits.bam \JD-8/accepted_hits.bam \JD-5/accepted_hits.bam \MCA-2504/accepted_hits.bam \MCA-2952/accepted_hits.bam \MCA-2974/accepted_hits.bam

CummeRbund

cummeRbund is a visualization package for Cufflinks high-throughput sequencing data. It is designed to help navigate through the Cuffdiff RNA-Seq differential expression analysis data.

All the following commands are executed in R 3.3.0 (http://www.R-project.org)

 

> setwd("cuffdiff_out") > library(cummeRbund)> cuff<-readCufflinks()> cuff

CuffSet instance with:6 samples17830 genes17987 isoforms17910 TSS17910 CDS267450 promoters268650 splicing267450 relCDS

> disp<-dispersionPlot(genes(cuff))> disp

Page 8: Combined final report: genome and transcriptome assemblies

Figure 4. Dispersion plots to estimate overdispersion for each sample as a quality control measure

> genes.scv<-fpkmSCVPlot(genes(cuff))> isoforms.scv<-fpkmSCVPlot(isoforms(cuff)) 

                               (a)                                                                 (b)Figure 5. Estimating squared coefficient of variation (CV ) across all (a) genes and (b) isoforms2

> dens<-csDensity(genes(cuff))> dens

Page 9: Combined final report: genome and transcriptome assemblies

Figure 6. Density distributions of FPKM scores across samples

 

> b<-csBoxplot(genes(cuff))> b

 

Page 10: Combined final report: genome and transcriptome assemblies

Figure 7. Boxplots showing log(fpkm) values for each sample

> dend<-csDendro(genes(cuff))> dend

Figure 8. Dendrogram' with 2 branches and 6 members total, at height 0.1441899

Page 11: Combined final report: genome and transcriptome assemblies

 

Differential Gene Expression Analyses (R)

For  transcriptome analysis, transcript counts from RSEM (RSEM.counts.matrix) were imported into DESeq2. For reference-based transcriptome de novoanalysis, *.count files produced by HTSeq were imported into DESeq2. DESeq2 was run, and "JD" and "MCA" were set as conditions in order to compare gene expression between the two mating types ( ,  ).rsem_to_deseq2.R htseq_to_deseq2.R

For both the  and the reference-based transcriptome, samples within mating types clustered more closely together than samples between mating de novotypes (Fig. 4). Samples within mating types were also more closely correlated with one another than samples between mating types (Fig. 5), with decreased distances between samples within mating types (Fig. 6).

(a)

(b)

Figure 9. For the (a)  assembly, mating type accounted for 86% of the variance between samples, and for the (b) reference-based assembly, de novomating type accounted for 85% of the variance between samples.

(a)

Page 12: Combined final report: genome and transcriptome assemblies

(b)

Figure 10. Pairwise correlations between samples for the (a)   and (b) reference-based transcriptome assemblies.de novo

(a)

(b)

Figure 11. Distances between samples for the (a)   and (b) reference-based transcriptome assemblies.de novo

Volcano plots were constructed to illustrate differentially expressed genes for both assemblies (Fig. 12).

Page 13: Combined final report: genome and transcriptome assemblies

(a)

(b)

Figure 12. Volcano plots illustrating DEGs for the (a)   and (b) reference-based transcriptome assemblies. Genes labeled in red have an adjusted p-de novovalue < 1e-10. Genes labeled in green have an adjusted p-value < 1e-10 and exhibited a log2 fold-change greater than 4.

For the  assembly,  204 DEGs were identified. For the reference-based assembly, 155 DEGs were identified.de novo 

For the  assembly, DEGs were plotted as a heatmap, demonstrating differences in up- and down-regulation between samples (Fig. 13). DEGs are de novo also listed in a FASTA file:  .denovo_degs.fasta

Page 14: Combined final report: genome and transcriptome assemblies

Figure 13. Heatmap of DEGs identified in  assembly analysis.de novo

DEGs of Interest

DEG#1

Identity

Ref Gene ID Moror_3144

Cufflinks ID XLOC_009906

Assoc. GO Terms GO:0009277:C:fungal-type cell wall

GO:0005199:F:structural constituent of cell wall

Description Hydrophobin 2

Page 15: Combined final report: genome and transcriptome assemblies

Figure 14. Expression levels (FPKM) of hydrophobin-producing gene Moror_3144 between mating types.M. roreri 

Description

            Moror_3144 is a gene responsible for producing hydrophobin proteins in . Hydrophobins are a large family of cysteine-rich proteins that M. roreriserve as a main component in fungal cell walls. Specifically, hydrophobins help form a hydrophobic sheath on the exterior of fungal spore and hyphae cell walls. Hydrophobins have high surfactant activity, which results from their self-assembly at hydrophilic–hydrophobic interfaces to form an amphipathic monolayer. As a critical component of fungal cell walls, hydrophobins play a key role in fungal interactions with both the external environment and other fungi. Specifically, expression of SC3 hydrophobins is responsible for the production of aerial hyphae and attachment of hyphae to hydrophobic surfaces in basidiomycete fungi. There are multiple hydrophobin genes in the genome of individual fungi, due to possibly different functional roles or differential expression, or to different environmental conditions or developmental stages.

Hypothesis #1

            Our data demonstrated increased expression of Moror_3144 in the MCA mating type. Differential expression of Moror_3144 and other hydrophobin genes could produce structural differences in fungal cell wall composition, rendering mating types incompatible upon initial contact.

Future Experiment #1

            Knockout Moror_3144 and other genes responsible for the production of hydrophobin proteins in closely related basidiomycete fungi (e.g. ) to see if mating and production of fruiting bodies are altered or inhibited. Schizophyllum commune

Hypothesis #2

            SC15 is a secreted protein of 191 a.a. with a hydrophilic  N –terminal half and a highly hydrophobic  C- terminal half. SC15 is responsible for formation of aerial hyphae and attachment in the absence of the SC3 hydrophobin. Mating types with lower expression of SC3 genes should see an increase in expression of SC15 protein producing genes. As a result, gene expression levels of SC15 should increase when hydrophobins are knocked down. 

Future Experiment #2

            Silence expression of known SC3 genes and analyze expression levels of SC15 genes in both mating types. In our dataset, we looked at M. roreri expression levels of known SC3 and SC15 genes in M. roreri. 

Hydrophobin SC15

Moror_3144 Moror_16098

Moror_3864 Moror_9579

Moror_3142 Moror_2440

Moror_3141  

Page 16: Combined final report: genome and transcriptome assemblies

Figure 15. Expression levels (FPKM) of hydrophobin and SC15 genes between  mating types.M. roreri