Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring...

49
Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory

Transcript of Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring...

Page 1: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Introduction to RNA-SeqTranscriptome profiling with iPlant

Jason Williams iPlant / Cold Spring Harbor Laboratory

Page 2: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

This training module is designed to demonstrate a workflow in the iPlant Discovery Environment using RNA-Seq for transcriptome profiling.

Question: How can we compare gene expression levels using RNA-Seq data in Arabidopsis WT and hy5 genetic backgrounds?

RNA-Seq in the Discovery Environment

Page 3: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Scientific Objective

LONG HYPOCOTYL 5 (HY5) is a basic leucine zipper transcription factor (TF).

Mutations cause aberrant phenotypes in Arabidopsis morphology, pigmentation and hormonal response.

We will use RNA-Seq to compare WT and hy5 to identify HY5-regulated genes.

Source: http://www.gla.ac.uk/media/media_73736_en.jpg

Page 4: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Sample Dataset

• Experimental data downloaded from the NCBI Short Read Archive (GEO:GSM613465 and GEO:GSM613466)

• Two replicates each of RNA-Seq runs for Wild-type and hy5 mutant seedlings.

Page 5: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

RNA-Seq Conceptual Overview

Image source: http://www.bgisequence.com

Page 6: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

RNA-seq Sample Read Statistics

• Genome alignments from TopHat were saved as BAM files, the binary version of SAM (samtools.sourceforge.net/).

• Reads retained by TopHat are shown below

Sequence run WT-1 WT-2 hy5-1 hy5-2

Reads 10,866,702 10,276,268 13,410,011 12,471,462

Seq. (Mbase) 445.5 421.3 549.8 511.3

Page 7: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

RNA-Seq [email protected] HWUSI-EAS455:3:1:1:1096 length=41CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC+BA?39AAA933BA05>A@A=?4,9#################@SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT+@=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48##@SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA+A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/[email protected] HWUSI-EAS455:3:1:4:1075 length=41CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC+BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:[email protected] HWUSI-EAS455:3:1:5:238 length=41AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA+BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@@SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG+BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB>@SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC+?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>:

…Now What?

Page 8: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

RNA-Seq [email protected] HWUSI-EAS455:3:1:1:1096 length=41CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC+BA?39AAA933BA05>A@A=?4,9#################@SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT+@=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48##@SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA+A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/[email protected] HWUSI-EAS455:3:1:4:1075 length=41CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC+BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:[email protected] HWUSI-EAS455:3:1:5:238 length=41AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA+BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@@SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG+BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB>@SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC+?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>:

…Now What?

Page 9: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

RNA-Seq Data - FastQ

Page 10: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

@SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC+BA?39AAA933BA05>A@A=?4,9#################@SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT+@=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48##@SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA+A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/[email protected] HWUSI-EAS455:3:1:4:1075 length=41CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC+BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:[email protected] HWUSI-EAS455:3:1:5:238 length=41AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA+BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@@SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG+BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB>@SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC+?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>:

Bioinformatician

0100110

10 1

Page 11: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

*Graphics taken from these publications

Page 12: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

The Tuxedo Protocol

*TopHat and Cufflinks require a sequenced genome

Page 13: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

The Tuxedo Protocol

Page 14: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Most of RNA-Seq happens before analysisENCODE Project RNA-Seq Standards

Page 15: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

$ tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq$ tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq$ tophat -p 8 -G genes.gtf -o C1_R3_thout genome C1_R3_1.fq C1_R3_2.fq$ tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C1_R1_2.fq$ tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C1_R2_2.fq$ tophat -p 8 -G genes.gtf -o C2_R3_thout genome C2_R3_1.fq C1_R3_2.fq

$ cufflinks -p 8 -o C1_R1_clout C1_R1_thout/accepted_hits.bam$ cufflinks -p 8 -o C1_R2_clout C1_R2_thout/accepted_hits.bam$ cufflinks -p 8 -o C1_R3_clout C1_R3_thout/accepted_hits.bam$ cufflinks -p 8 -o C2_R1_clout C2_R1_thout/accepted_hits.bam$ cufflinks -p 8 -o C2_R2_clout C2_R2_thout/accepted_hits.bam$ cufflinks -p 8 -o C2_R3_clout C2_R3_thout/accepted_hits.bam

$ cuffmerge -g genes.gtf -s genome.fa -p 8 assemblies.txt

$ cuffdiff -o diff_out -b genome.fa -p 8 –L C1,C2 -u merged_asm/merged.gtf \./C1_R1_thout/accepted_hits.bam,./C1_R2_thout/accepted_hits.bam,\./C1_R3_thout/accepted_hits.bam \./C2_R1_thout/accepted_hits.bam,\./C2_R3_thout/accepted_hits.bam,./C2_R2_thout/accepted_hits.bam

Your RNA-Seq Data

Your transformed RNA-Seq Data

Page 16: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

RNA-Seq Analysis Workflow

Tophat (bowtie)

Cufflinks

Cuffmerge

Cuffdiff

CummeRbund

Your Data

iPlant Data Store

FASTQ

Disco

very E

nviro

nm

en

t Atm

osphe

re

Page 17: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Moving your data in

www.iplantc.org/ds2

Page 18: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Moving your data in

iDrop Desktop – Java Program

Page 19: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Moving your data in

iCommands

Page 20: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

The iPlant Discovery Environment

Page 21: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Data preparation

Decompress your data?

Page 22: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Data preparation

Decompress your data?

Page 23: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Data preparation

Pre-process sequences if needed (e.g., Sabre for de-multiplexing reads, and Scythe for removing primer/adapter sequences)

Image from: http://www.westburg.eu/lp/rna-seq-library-preparation

Page 24: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Data preparation

FASTQC – Quality Control

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Page 25: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Data preparation

Per Base Sequence Quality

BAD GOOD

• The central red line is the median value• The yellow box represents the inter-quartile range (25-75%)• The upper and lower whiskers represent the 10% and 90% points• The blue line represents the mean quality

Page 26: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Data preparation

BAD GOOD

Fail: most frequently observed mean quality is below 20 (1% error rate)

Per Sequence Quality Scores

Page 27: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Data preparation

BAD GOOD

Per Base N Content

Fail: any position shows an N content of >20%.

Page 28: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

The iPlant Discovery Environment

GOOD

Sequence Length Distribution

Fail: error if any of the sequences have zero length.

Page 29: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

The iPlant Discovery Environment

BAD

Overrepresented Sequences

Fail: module will issue an error if any sequence is found to representmore than 1% of the total

Page 30: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Tophat

Page 31: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

TopHat

• TopHat is one of many applications for aligning short sequence reads to a reference genome.

• It uses the BOWTIE aligner internally.

• Other alternatives are BWA, MAQ, OLego, Stampy, Novoalign, etc.

Page 32: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

TopHat

TopHat has a number of parameters and options, and their default values are tuned for processing mammalian RNA-Seq reads.

If you would like to use TopHat for another class of organism, we recommend setting some of the parameters with more strict, conservative values than their defaults.

Usually, setting the maximum intron size to 4 or 5 Kb is sufficient to discover most junctions while keeping the number of false positives low.

- TopHat User Manual

Page 33: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

TopHat outputs in IGV

Page 34: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Assembling the Transcripts

Page 35: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Assembling the Transcripts

Page 36: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Assembling the Transcripts

Provide a mask file (gtf/gff)

• Tells Cufflinks to ignore all reads that could have come from transcripts in this GTF file.

• Annotated rRNA, mitochondrial transcripts other abundant transcripts you wish to ignore.

- Cufflinks User Manual

Page 37: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Assembling the Transcripts

1) transcripts.gtfThis GTF file contains Cufflinks' assembled isoforms. The first 7 columns are standard GTF, and the last column contains attributes, some of which are also standardized ("gene_id", and "transcript_id"). There one GTF record per row, and each record represents either a transcript or an exon within a transcript.

2) isoforms.fpkm_trackingThis file contains the estimated isoform-level expression values (FPKM).

3) genes.fpkm_trackingThis file contains the estimated gene-level expression values (FPKM).

- Cufflinks User Manual

Page 38: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Assembling the Transcripts

- Cufflinks User Manual

Page 39: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Merging the Transcriptomes

Cuffmerge is a meta-assembler; Assembly of Cufflinks transcripts / Reference based assembly

Page 40: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Comparing wild-type to hy5 transcriptomes

Page 41: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

• Cuffdiff evaluates variation in read counts for each gene across the replicates this estimate is used to calculate significance of expression changes

• Cuffdiff can identify genes that are differentially spliced or differentially regulated via promoter switching. Isoforms of a gene that have the same TSS are grouped

• Detection rate of differentially expressed genes/transcripts is strongly dependent on sequencing depth

Comparing wild-type to hy5 transcriptomes

Page 42: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Comparing wild-type to hy5 transcriptomes

Changes in fragment counts ≠ changes in expression

True expression is estimated by the sum of the length-normalized isoform read counts so the entire transcript must be taken into account.

Page 43: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Cuffdiff Results1. FPKM tracking filesCuffdiff calculates the FPKM of each transcript, primary transcript, and gene in each sample. Primary transcript and gene FPKMs are computed by summing the FPKMs of transcripts in each primary transcript group or gene group.

(tss_groups.fpkm_tracking tracks summed FPKM of transcripts sharing tss_ids)

2) Count tracking filesEstimate of the number of fragments that originated from each transcript, primary transcript, and gene in each sample.

3) Read group tracking filesExpression and fragment count for each transcript, primary transcript, and gene in each replicate.

4) Differential expression testsTab delimited file lists the results of differential expression testing between samples for spliced transcripts, primary transcripts, genes, and coding sequences.

Plus several other outputs (diff splicing, CDS, promoter, etc.)

Page 44: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Differentially expressed genes

Example filtered Cuffdiff results generated in the Discovery Environment.

Page 45: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Differentially expressed transcripts

Example filtered Cuffdiff results generated in the Discovery Environment.

Page 46: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Density Plot

Page 47: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Scatter Plot

Page 48: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Volcano Plot

Page 49: Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory.

Expression Plots