Practical RNA-seq Data Analysis -...

26
Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 Bioinformatics and Research Computing Whitehead Institute http://barc.wi.mit.edu/hot_topics/

Transcript of Practical RNA-seq Data Analysis -...

Page 1: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

Practical RNA-seq Data Analysis

BaRC Hot Topics – March 31, 2016

Bioinformatics and Research Computing

Whitehead Institute

http://barc.wi.mit.edu/hot_topics/

Page 2: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

2

Replication (especially biological) Multiplexing Read length Read depth Paired or unpaired Stranded or unstranded Batch effect

Experimental design

Things to consider:

Page 3: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

Data analysis pipeline

Quality Control

Mapping

Quantifying

Differential expression

Presenting the results

3

Page 4: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

Quality control

• Check quality of file of raw reads using FastQC

Command: • Resources: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ https://sequencing.qcfail.com/ http://barcwiki.wi.mit.edu/wiki/SOPs/qc_shortReads

http://jura.wi.mit.edu/bio/education/hot_topics/

4

fastqc <reads.fastq>

Page 5: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

Responding to quality issues

• Method 1: – Keep all reads as is – Map as many as possible

• Method 2:

– Drop all poor-quality reads – Trim poor-quality bases

– Trim adapter/ vector – Map only good-quality bases

• Which makes more sense for your experiment? • Consider do both and compare outcome.

5

Page 6: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

Mapping reads: various mappers

Various mappers for spliced alignment:

TopHat (recommended), STAR, GSNAP, MapSplice, PALMapper,ReadsMap, GEM, PASS, GSTRUCT, BAGET…

6

Pär G Engström et al. Nature Methods | VOL.10 NO.12 | DECEMBER 2013

Page 7: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

Mapping reads with tophat2

7

quality score encoding Find out from fastQC result

--segment-length Shortest length of a spliced read that can map to one side of the junction. default: 25

--no-novel-juncs Only look at reads across junctions in the supplied GTF file

-G <GTF file> Map reads to virtual transcriptome (from gtf file) first.

-N max. number of mismatches in a read, default is 2

-o/--output-dir default = tophat_out

--library-type (fr-unstranded, fr-firststrand, fr-secondstrand)

-I/--max-intron-length default: 500000

http://ccb.jhu.edu/software/tophat/manual.shtml

Command:

tophat [options] <genome_index_base> <reads.fastq> tophat [options] <genome_index_base> <reads.fastq>

Commonly considered parameters:

/nfs/genomes/

Page 8: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

Mapping reads: quality score encoding

• Fastq format:

8

@ILLUMINA-F6C19_0048_FC:5:1:12440:1460#0/1

GTAGAACTGGTACGGACAAGGGGAATCTGACTGTAG

+ILLUMINA-F6C19_0048_FC:5:1:12440:1460#0/1

hhhhhhhhhhhghhhhhhhehhhedhhhhfhhhhhh

@seq identifier seq

+any description seq quality values

/1 or /2 paired-end

Input qualities Illumina versions

--solexa-quals <= 1.2

--phred64 1.3-1.7

--phred33 >= 1.8

http://en.wikipedia.org/wiki/FASTQ_format

Page 9: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

Mapping reads: Optimize mapping across introns

• Tophat default parameters are designed for mammalian RNA-seq data.

-l: default is 500,000

• Reduce “maximum intron length” for non-mammalian organisms

9

Species Max_intron_length yeast 2,484 arabidopsis 11,603 C. elegans 100,913 fly 141,628

Page 10: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

Mapping reads: inspect result

• Inspect mapping result in a genome browser

• Remap if necessary

10

Page 11: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

Quantifying Expression

• Different ways of quantifying expression level for different purposes

11

Gene 1

Gene 2

Sample 1

RPKM FPKM

Sample 2

Normalized count

Reads (Fragments) Per Kilobase per Million mapped reads

Page 12: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

Counting methods

• htseq-count (recommended) http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

– Output is raw counts at gene level

• Cufflinks http://cole-trapnell-lab.github.io/cufflinks/

Output is FPKM and related statistics

• Bedtools (intersectBed; coverageBed) http://bedtools.readthedocs.org/en/latest/

Output is raw counts (but may need post-processing)

• featureCounts http://bioinf.wehi.edu.au/featureCounts/

Output is raw counts at feature level and gene level

12

Page 13: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

Using htseq-count

13

htseq-count "modes"

htseq-count [options] <alignment_file> <gff_file>

Frequently considered options default

-r <order> name

-s <yes/no/reverse> yes

-t <feature type> exon

-m <mode> union

-f, -i, -o, -q, -h … …

Keep in mind:

Only unique alignments are counted!!

Command:

Page 14: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

Differential Expression Analysis

• Normalization

• Estimating variability (dispersion)

14

Page 15: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

Differential Expression Analysis

• Normalization

• Variation (dispersion of data)

15

Total reads 200 400

- problem with simply normalizing to total reads:

Page 16: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

> cds = estimateSizeFactors(cds) > sizeFactors(cds) Sample 1 Sample 2 Sample 3 Sample 4 4.7772242 1.0490870 0.3697529 0.5590669

Preferred normalization method - geometric (implemented in DESeq, cuffdiff)

1. Construct a "reference sample" by taking, for each gene, the geometric mean of the counts in all samples.

2. Calculate for each gene the quotient of the counts in your sample divided by the counts of the reference sample.

3. Take the median of all the quotients to get the relative depth of the library and use it as the scaling factor.

𝑥𝑖

𝑛

𝑖=1

1𝑛

𝑥𝑖

Sample 1 Sample 2 Sample 3 Sample 4

Gene 1 2135 615 128 161

Gene 2 600 58 103 189

Gene 3 3150 1346 68 88

Gene 4 378 187 11 22

(447) (586) (346) (288)

Page 17: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

Differential Expression Analysis

Dispersion affects ability to differentiate

17

x x1 x2

x x1 x2

Page 18: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

Methods for estimating dispersion implemented by DEseq

18

pooled Each replicated condition is used to build a model, then these models are averaged to provide a single global model for all conditions in the experiment. (Default)

per-condition Each replicated condition receives its own model. Only available when all conditions have replicates.

blind All samples are treated as replicates of a single global "condition" and used to build one model.

pooled-CR Fit models according to modelFormula and estimate the dispersion by maximizing a Cox-Reid adjusted profile likelihood (CR-APL)

Page 19: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

19

Differential expression analysis with DESeq or EdgeR

• Do it yourself (require some knowledge of R) - Visit the webpages of DESeq or EdgeR http://bioconductor.org/packages/release/bioc/html/DESeq.html https://bioconductor.org/packages/release/bioc/html/edgeR.html - Read documentations - Follow Instruction and examples

• Using R scripts developed by BaRC

- Get scripts from /nfs/BaRC_Public/BaRC_Code/R/ ./RunDESeq.R <input_file> <output_file> <Groups> <Method>

./get_DE_genes_with_edgeR.R <input_file> <Groups> <output_file> <fdr> <foldChange>

Page 20: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

Interpreting DESeq output

20

Gene ID (from GTF file)

Mean norm

counts

Mean normalized

counts for group

Fold change

Log2 (fold

change)

Raw p-value

FDR p-value Raw counts

Normalized counts = raw / (size factor)

id baseMean

baseMean.CEU

baseMean.YRI

YRI/CEU log2(YRI/CEU)

pval padj CEU_NA07357

CEU_NA11881

YRI_NA18502

YRI_NA19200

CEU_NA07357_norm

CEU_NA11881_norm

YRI_NA18502_norm

YRI_NA19200_norm

ENSG00000213442 74.71 149.06 0.36 0.00 -8.69 8.13E-

18 1.55E-

13 169 145 1 0 148.28 149.85 0.72 0

ENSG00000138061 83.15 159.78 6.52 0.04 -4.62 5.61E-

12 5.36E-

08 184 153 10 4 161.44 158.12 7.2 5.83

ENSG00000198618 11.71 23.42 0.00 0.00 #NAME

? 1.51E-

06 0.00 31 19 0 0 27.2 19.64 0 0

ENSG00000134184 8.17 0.00 16.34 Inf Inf 0.00021

1 0.11 0 0 15 15 0 0 10.8 21.88

division by 0 -Inf

sizeFactors (from DESeq):

CEU_NA07357 CEU_NA11881 YRI_NA18502 YRI_NA19200

1.1397535 0.9676201 1.3891488 0.6856959

Page 21: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

Presenting results

• What do you want to show? • All-gene scatterplots can be helpful to

– See level and fold-change ranges – Identify sensible thresholds – Hint at data or analysis problems

• Heatmaps are useful if many conditions are being compared but only for gene subsets

• Output normalized read counts with same method used for DE

statistics

• Whenever one gene is especially important, look at the mapped reads in a genome browser

21

Page 22: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

Scatterplots

22

Standard scatterplot MA (ratio-intensity) plot

Page 23: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

Clustering and heatmap

23

Software: Cluster 3.0: •Log-transform •Mean center •Cluster Java TreeView: •Visualize •Export

Illustrator •Assemble

Liang et al. BMC Genomics 2010, 11:173

Page 24: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

Further mining of the data

Find enrichment in descriptive terms, regulatory pathways and networks: Commercial Ingenuity Pathway analysis Pathway studio GeneGo … Public domain DAVID GSEA …

24

Page 25: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

Summary

25

• Experimental design • Quality control (fastqc) • Mapping spliced reads (tophat) • Counting gene levels (htseq-count) • Identifying "differentially expressed" genes (DESeq R

package) • Exploring your results

Slides and exercise materials are available at http://jura.wi.mit.edu/bio/education/hot_topics/ More practical commands and procedures available at http://barcwiki.wi.mit.edu/wiki/SOPs

Page 26: Practical RNA-seq Data Analysis - MITbarc.wi.mit.edu/education/hot_topics/RNAseq_Mar2016/RNAseq_Mar… · Practical RNA-seq Data Analysis BaRC Hot Topics – March 31, 2016 ... mammalian

26

Hands on Practice overview

Fastq file

Bam file

Gene read counts

Normalized read counts and differential expression

fastqc

Tophat

htseq-count

DESeq