IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three...

22
iPlant Collaborative Discovery Environment RNA-seq Basic Analysis http://preview.iplantcollaborative.org/de / Log in with your iPlant ID; three orange icons on left side: Data, Apps, Analysis. 1. Select Apps (Applications) 2. When Apps window opens, search for application of interest, in this case: FastQC since the first task is to determine if the fastq data files are high quality. 3. Click on “I” icon for more information.

Transcript of IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three...

Page 1: IPlant Collaborative Discovery Environment RNA-seq Basic Analysis  Log in with your iPlant ID; three orange icons.

iPlant Collaborative Discovery Environment RNA-seq Basic Analysis

http://preview.iplantcollaborative.org/de/

Log in with your iPlant ID; three orange icons on left side: Data, Apps, Analysis.

1. Select Apps (Applications) 2. When Apps window

opens, search for application of interest, in this case: FastQC since the first task is to determine if the fastq data files are high quality.

3. Click on “I” icon for more information.

Page 2: IPlant Collaborative Discovery Environment RNA-seq Basic Analysis  Log in with your iPlant ID; three orange icons.

Can read full name: FastQC 0.10.1 (multi-file)

Open User Manual Link

User Manual open; Quick Start and Test Data may be useful.

Page 3: IPlant Collaborative Discovery Environment RNA-seq Basic Analysis  Log in with your iPlant ID; three orange icons.

Click on App name, App window opens. Modify Analysis Name so it is easy to recall what was done.Add Comments, this will greatly help as you try to figure out what was done at a later date.

Scroll down using bar (red arrow) to Select output folder: keep as default:/iplant/home/youraccountname/analyses.

Files can be moved around later.

Click on Select input data.

Page 4: IPlant Collaborative Discovery Environment RNA-seq Basic Analysis  Log in with your iPlant ID; three orange icons.

Click on + Add.

List of folders open; click on folder with fastq files, check those that need QC, then OK.

Check to make sure all files are present then Launch Analysis.

Page 5: IPlant Collaborative Discovery Environment RNA-seq Basic Analysis  Log in with your iPlant ID; three orange icons.

A notification will briefly appear that lets you know App was successfully launched. Click on orange Analyses icon to open Analyses window to track progress. You can also look at notifications on top right side. Be sure to refresh Analyses window when you check the status.

Wait, this will take hours. Status will change to Running and then Completed. You may get Idle (no problem) or Failed (did not work).

Page 6: IPlant Collaborative Discovery Environment RNA-seq Basic Analysis  Log in with your iPlant ID; three orange icons.

An email will arrive to let you when run is Completed or Failed.

Go to Analyses folder for Output; Status will read Completed (!), Click on name to Go to the output folder.

The Data window now opens (one of the three orange icons on the DE desktop). Under analyses, the folder for this particular analysis will be selected and the subfolders can be seen to the right. Click on a folder that has one of the names of your fastq input files. You can share this output easily using the share icon and selecting a collaborator’s name.

Share icon

Page 7: IPlant Collaborative Discovery Environment RNA-seq Basic Analysis  Log in with your iPlant ID; three orange icons.

Once in selected fastqc folder, click on images.

Slide name bar to the right so that entire name can be read.Select per_base_quality.png

Page 8: IPlant Collaborative Discovery Environment RNA-seq Basic Analysis  Log in with your iPlant ID; three orange icons.

Y-axis = Phred score Sequence Quality

40 = 1/10,000 chance of error.30 = 1/1000 chance of error20, 1/100 chance of error

28 and above are coded green; high quality.

The central red line is the median value.

The yellow box represents the inter-quartile range (25-75%).

The upper and lower whiskers represent the 10% and 90% points.

The blue line represents the mean quality.

The mean quality is quite high for this fastq file. Check all files. If quality goes way down at the ends, trimming may be needed.

Page 9: IPlant Collaborative Discovery Environment RNA-seq Basic Analysis  Log in with your iPlant ID; three orange icons.

This per_base_quality graph shows a dip at 27-28 nucleotides, however the mean quality score (blue line) stays above 28. This fastq file is of acceptable quality.

Page 10: IPlant Collaborative Discovery Environment RNA-seq Basic Analysis  Log in with your iPlant ID; three orange icons.

This per_base_quality graph shows a steady decline in quality. Must filter for higher quality reads.

Page 11: IPlant Collaborative Discovery Environment RNA-seq Basic Analysis  Log in with your iPlant ID; three orange icons.

If you are concerned about any of your fastq files, screen using a quality filter.Search in Apps, and select appropriate App.

Add information to output name and be sure to use Comments!

Under Select input data, browse for appropriate fastq file.Under Options, the default is a score of 20 for minimum quality for 50% of bases; increase to 75%. The correct type of Illumina encoding can be seen at the top of the per_base_quality graph.

Launch Analysis. Do this for all fastq files that have dips in quality.

Page 12: IPlant Collaborative Discovery Environment RNA-seq Basic Analysis  Log in with your iPlant ID; three orange icons.

Let them run! When complete, go to Analyses folder.Output is fastq_quality_filter_out.fastq.Need to change name to reflect sample sequenced, in this case 29d-3. You can see why it is good to put sample name on run name!

Page 13: IPlant Collaborative Discovery Environment RNA-seq Basic Analysis  Log in with your iPlant ID; three orange icons.

Check box next to file name. Under edit, click on Rename…

Change file name, but retain .fastq at end.

Create New Folder for quality filtered fastq files.Under Edit, Click on New FolderChoose Location and give appropriate Folder Name.

Rename all quality filtered fastq files, move into New Folder.

Page 14: IPlant Collaborative Discovery Environment RNA-seq Basic Analysis  Log in with your iPlant ID; three orange icons.

Remaining FASTX_quality_filtered folders in Analyses Folder only contain logs and can be deleted.

Open Analyses FolderCheck files to be deleted.Under Files, click on Move to Trash

Now all quality filtered fastq files are in a folder, and ready for FASTQC to see if data are improved.

Use App to redo fastqc analysis as before.

Page 15: IPlant Collaborative Discovery Environment RNA-seq Basic Analysis  Log in with your iPlant ID; three orange icons.

After quality filter, Phred above 20 for 75% of sequenceBefore quality filter

Median value (blue line) is somewhat higher after quality filter. Demonstrates that dip in middle is not major concern.

Page 16: IPlant Collaborative Discovery Environment RNA-seq Basic Analysis  Log in with your iPlant ID; three orange icons.

The next step is to align the quality filtered sequence reads to the genome. These reads are from spliced RNA so the alignment programs must take the presence of introns into account. Tophat is commonly used for alignment with eukaryotes.

Open Apps, and search for Tophat. These datasets are single-end (SE) reads, so choose Tophat2-SE. PE is for paired-end reads.

Add Input fastq files, using quality filtered files.

Select Reference Genome from list. If your genome is not on list, obtain FASTA file from appropriate genome project.

Page 17: IPlant Collaborative Discovery Environment RNA-seq Basic Analysis  Log in with your iPlant ID; three orange icons.

Under Analysis Options

Use default Anchor length, but you may want to change minimum and maximum intron length depending on your organism. Arabidopsis has smaller introns. After these values are set, Launch Analysis.

Refresh Analysis window and check that your Tophat job is running.

Page 18: IPlant Collaborative Discovery Environment RNA-seq Basic Analysis  Log in with your iPlant ID; three orange icons.

When Tophat2_SE is Completed (you will receive an email), click on analysis name and you will be taken to the Output Folder.

Open bam folder.This folder can be shared with collaborators.

Widen name column to read names of bam and bai (bam index) files. Note that bam files are much larger than bai files (GB vs. KB). The bam file is a text file with sequence alignment data. A bam and a bai file will be generated for each input fastq file. These are the input for the next step in the analysis.

Page 19: IPlant Collaborative Discovery Environment RNA-seq Basic Analysis  Log in with your iPlant ID; three orange icons.

The next step is to count and compare reads at each gene locus and determine if read count values are significantly different between samples.Search Apps using cuffdiff and select to run Cuffdiff2.Under Input data, give clear name to each sample.Click Add and select bam files from Tophat2 output. Multiple samples can be added at one time.Do this for all samples.

Add Reference Annotation and Reference GenomeGTF file, Gene Transfer Format, gene list by chromosome location. Limited to known genes.

Page 20: IPlant Collaborative Discovery Environment RNA-seq Basic Analysis  Log in with your iPlant ID; three orange icons.

Analysis OptionsIf samples are a time series, checking “treat sample files as a time series” will only compare adjacent time points.The False Discovery Rate (FDR) can be reduced to 0.01 if more stringency is needed.

Launch Analysis!

Wait until Completed…

Page 21: IPlant Collaborative Discovery Environment RNA-seq Basic Analysis  Log in with your iPlant ID; three orange icons.

When complete, Click on Analysis name to get to folder within analyses window. Open the cuffdiff_out folder.A number of files appear. Gene_exp.diff will be the most important. It should be large (MB). This file can be shared with colleagues. Click on gene_exp.diff to view data.

Page 22: IPlant Collaborative Discovery Environment RNA-seq Basic Analysis  Log in with your iPlant ID; three orange icons.

The columns of the gene_exp.diff file include gene name, the comparisons being made (important if you have more than two samples), the median RPKM values for the three sample replicates and the log2(fold change) which will be positive if sample 2 is greater than sample 1. p-value and q-value are shown, and if the q-value is ≤ 0.05, then a yes is present in the significant column.

Downloading instructions for gene_exp.diff files and subsequent Gene Ontology analyses can be found on the Life After Cuffdiff PowerPoint.