Practical RNA-seq Data Analysis -...

Post on 19-Aug-2020

3 views 0 download

Transcript of Practical RNA-seq Data Analysis -...

Practical RNA-seq Data Analysis

BaRC Hot Topics – March 31, 2016

Bioinformatics and Research Computing

Whitehead Institute

http://barc.wi.mit.edu/hot_topics/

2

Replication (especially biological) Multiplexing Read length Read depth Paired or unpaired Stranded or unstranded Batch effect

Experimental design

Things to consider:

Data analysis pipeline

Quality Control

Mapping

Quantifying

Differential expression

Presenting the results

3

Quality control

• Check quality of file of raw reads using FastQC

Command: • Resources: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ https://sequencing.qcfail.com/ http://barcwiki.wi.mit.edu/wiki/SOPs/qc_shortReads

http://jura.wi.mit.edu/bio/education/hot_topics/

4

fastqc <reads.fastq>

Responding to quality issues

• Method 1: – Keep all reads as is – Map as many as possible

• Method 2:

– Drop all poor-quality reads – Trim poor-quality bases

– Trim adapter/ vector – Map only good-quality bases

• Which makes more sense for your experiment? • Consider do both and compare outcome.

5

Mapping reads: various mappers

Various mappers for spliced alignment:

TopHat (recommended), STAR, GSNAP, MapSplice, PALMapper,ReadsMap, GEM, PASS, GSTRUCT, BAGET…

6

Pär G Engström et al. Nature Methods | VOL.10 NO.12 | DECEMBER 2013

Mapping reads with tophat2

7

quality score encoding Find out from fastQC result

--segment-length Shortest length of a spliced read that can map to one side of the junction. default: 25

--no-novel-juncs Only look at reads across junctions in the supplied GTF file

-G <GTF file> Map reads to virtual transcriptome (from gtf file) first.

-N max. number of mismatches in a read, default is 2

-o/--output-dir default = tophat_out

--library-type (fr-unstranded, fr-firststrand, fr-secondstrand)

-I/--max-intron-length default: 500000

http://ccb.jhu.edu/software/tophat/manual.shtml

Command:

tophat [options] <genome_index_base> <reads.fastq> tophat [options] <genome_index_base> <reads.fastq>

Commonly considered parameters:

/nfs/genomes/

Mapping reads: quality score encoding

• Fastq format:

8

@ILLUMINA-F6C19_0048_FC:5:1:12440:1460#0/1

GTAGAACTGGTACGGACAAGGGGAATCTGACTGTAG

+ILLUMINA-F6C19_0048_FC:5:1:12440:1460#0/1

hhhhhhhhhhhghhhhhhhehhhedhhhhfhhhhhh

@seq identifier seq

+any description seq quality values

/1 or /2 paired-end

Input qualities Illumina versions

--solexa-quals <= 1.2

--phred64 1.3-1.7

--phred33 >= 1.8

http://en.wikipedia.org/wiki/FASTQ_format

Mapping reads: Optimize mapping across introns

• Tophat default parameters are designed for mammalian RNA-seq data.

-l: default is 500,000

• Reduce “maximum intron length” for non-mammalian organisms

9

Species Max_intron_length yeast 2,484 arabidopsis 11,603 C. elegans 100,913 fly 141,628

Mapping reads: inspect result

• Inspect mapping result in a genome browser

• Remap if necessary

10

Quantifying Expression

• Different ways of quantifying expression level for different purposes

11

Gene 1

Gene 2

Sample 1

RPKM FPKM

Sample 2

Normalized count

Reads (Fragments) Per Kilobase per Million mapped reads

Counting methods

• htseq-count (recommended) http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

– Output is raw counts at gene level

• Cufflinks http://cole-trapnell-lab.github.io/cufflinks/

Output is FPKM and related statistics

• Bedtools (intersectBed; coverageBed) http://bedtools.readthedocs.org/en/latest/

Output is raw counts (but may need post-processing)

• featureCounts http://bioinf.wehi.edu.au/featureCounts/

Output is raw counts at feature level and gene level

12

Using htseq-count

13

htseq-count "modes"

htseq-count [options] <alignment_file> <gff_file>

Frequently considered options default

-r <order> name

-s <yes/no/reverse> yes

-t <feature type> exon

-m <mode> union

-f, -i, -o, -q, -h … …

Keep in mind:

Only unique alignments are counted!!

Command:

Differential Expression Analysis

• Normalization

• Estimating variability (dispersion)

14

Differential Expression Analysis

• Normalization

• Variation (dispersion of data)

15

Total reads 200 400

- problem with simply normalizing to total reads:

> cds = estimateSizeFactors(cds) > sizeFactors(cds) Sample 1 Sample 2 Sample 3 Sample 4 4.7772242 1.0490870 0.3697529 0.5590669

Preferred normalization method - geometric (implemented in DESeq, cuffdiff)

1. Construct a "reference sample" by taking, for each gene, the geometric mean of the counts in all samples.

2. Calculate for each gene the quotient of the counts in your sample divided by the counts of the reference sample.

3. Take the median of all the quotients to get the relative depth of the library and use it as the scaling factor.

𝑥𝑖

𝑛

𝑖=1

1𝑛

𝑥𝑖

Sample 1 Sample 2 Sample 3 Sample 4

Gene 1 2135 615 128 161

Gene 2 600 58 103 189

Gene 3 3150 1346 68 88

Gene 4 378 187 11 22

(447) (586) (346) (288)

Differential Expression Analysis

Dispersion affects ability to differentiate

17

x x1 x2

x x1 x2

Methods for estimating dispersion implemented by DEseq

18

pooled Each replicated condition is used to build a model, then these models are averaged to provide a single global model for all conditions in the experiment. (Default)

per-condition Each replicated condition receives its own model. Only available when all conditions have replicates.

blind All samples are treated as replicates of a single global "condition" and used to build one model.

pooled-CR Fit models according to modelFormula and estimate the dispersion by maximizing a Cox-Reid adjusted profile likelihood (CR-APL)

19

Differential expression analysis with DESeq or EdgeR

• Do it yourself (require some knowledge of R) - Visit the webpages of DESeq or EdgeR http://bioconductor.org/packages/release/bioc/html/DESeq.html https://bioconductor.org/packages/release/bioc/html/edgeR.html - Read documentations - Follow Instruction and examples

• Using R scripts developed by BaRC

- Get scripts from /nfs/BaRC_Public/BaRC_Code/R/ ./RunDESeq.R <input_file> <output_file> <Groups> <Method>

./get_DE_genes_with_edgeR.R <input_file> <Groups> <output_file> <fdr> <foldChange>

Interpreting DESeq output

20

Gene ID (from GTF file)

Mean norm

counts

Mean normalized

counts for group

Fold change

Log2 (fold

change)

Raw p-value

FDR p-value Raw counts

Normalized counts = raw / (size factor)

id baseMean

baseMean.CEU

baseMean.YRI

YRI/CEU log2(YRI/CEU)

pval padj CEU_NA07357

CEU_NA11881

YRI_NA18502

YRI_NA19200

CEU_NA07357_norm

CEU_NA11881_norm

YRI_NA18502_norm

YRI_NA19200_norm

ENSG00000213442 74.71 149.06 0.36 0.00 -8.69 8.13E-

18 1.55E-

13 169 145 1 0 148.28 149.85 0.72 0

ENSG00000138061 83.15 159.78 6.52 0.04 -4.62 5.61E-

12 5.36E-

08 184 153 10 4 161.44 158.12 7.2 5.83

ENSG00000198618 11.71 23.42 0.00 0.00 #NAME

? 1.51E-

06 0.00 31 19 0 0 27.2 19.64 0 0

ENSG00000134184 8.17 0.00 16.34 Inf Inf 0.00021

1 0.11 0 0 15 15 0 0 10.8 21.88

division by 0 -Inf

sizeFactors (from DESeq):

CEU_NA07357 CEU_NA11881 YRI_NA18502 YRI_NA19200

1.1397535 0.9676201 1.3891488 0.6856959

Presenting results

• What do you want to show? • All-gene scatterplots can be helpful to

– See level and fold-change ranges – Identify sensible thresholds – Hint at data or analysis problems

• Heatmaps are useful if many conditions are being compared but only for gene subsets

• Output normalized read counts with same method used for DE

statistics

• Whenever one gene is especially important, look at the mapped reads in a genome browser

21

Scatterplots

22

Standard scatterplot MA (ratio-intensity) plot

Clustering and heatmap

23

Software: Cluster 3.0: •Log-transform •Mean center •Cluster Java TreeView: •Visualize •Export

Illustrator •Assemble

Liang et al. BMC Genomics 2010, 11:173

Further mining of the data

Find enrichment in descriptive terms, regulatory pathways and networks: Commercial Ingenuity Pathway analysis Pathway studio GeneGo … Public domain DAVID GSEA …

24

Summary

25

• Experimental design • Quality control (fastqc) • Mapping spliced reads (tophat) • Counting gene levels (htseq-count) • Identifying "differentially expressed" genes (DESeq R

package) • Exploring your results

Slides and exercise materials are available at http://jura.wi.mit.edu/bio/education/hot_topics/ More practical commands and procedures available at http://barcwiki.wi.mit.edu/wiki/SOPs

26

Hands on Practice overview

Fastq file

Bam file

Gene read counts

Normalized read counts and differential expression

fastqc

Tophat

htseq-count

DESeq