Practical RNA-seq Data Analysis

BaRC Hot Topics – March 31, 2016

Bioinformatics and Research Computing

Whitehead Institute

http://barc.wi.mit.edu/hot_topics/

Replication (especially biological) Multiplexing Read length Read depth Paired or unpaired Stranded or unstranded Batch effect

Experimental design

Things to consider:

Data analysis pipeline

Quality Control

Mapping

Quantifying

Differential expression

Presenting the results

Quality control

• Check quality of file of raw reads using FastQC

Command: • Resources: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ https://sequencing.qcfail.com/ http://barcwiki.wi.mit.edu/wiki/SOPs/qc_shortReads

http://jura.wi.mit.edu/bio/education/hot_topics/

fastqc <reads.fastq>

Responding to quality issues

• Method 1: – Keep all reads as is – Map as many as possible

• Method 2:

– Drop all poor-quality reads – Trim poor-quality bases

– Trim adapter/ vector – Map only good-quality bases

• Which makes more sense for your experiment? • Consider do both and compare outcome.

Mapping reads: various mappers

Various mappers for spliced alignment:

TopHat (recommended), STAR, GSNAP, MapSplice, PALMapper,ReadsMap, GEM, PASS, GSTRUCT, BAGET…

Pär G Engström et al. Nature Methods | VOL.10 NO.12 | DECEMBER 2013

Mapping reads with tophat2

quality score encoding Find out from fastQC result

--segment-length Shortest length of a spliced read that can map to one side of the junction. default: 25

--no-novel-juncs Only look at reads across junctions in the supplied GTF file

-G <GTF file> Map reads to virtual transcriptome (from gtf file) first.

-N max. number of mismatches in a read, default is 2

-o/--output-dir default = tophat_out

--library-type (fr-unstranded, fr-firststrand, fr-secondstrand)

-I/--max-intron-length default: 500000

http://ccb.jhu.edu/software/tophat/manual.shtml

Command:

tophat [options] <genome_index_base> <reads.fastq> tophat [options] <genome_index_base> <reads.fastq>

Commonly considered parameters:

/nfs/genomes/

Mapping reads: quality score encoding

• Fastq format:

@ILLUMINA-F6C19_0048_FC:5:1:12440:1460#0/1

GTAGAACTGGTACGGACAAGGGGAATCTGACTGTAG

+ILLUMINA-F6C19_0048_FC:5:1:12440:1460#0/1

hhhhhhhhhhhghhhhhhhehhhedhhhhfhhhhhh

@seq identifier seq

+any description seq quality values

/1 or /2 paired-end

Input qualities Illumina versions

--solexa-quals <= 1.2

--phred64 1.3-1.7

--phred33 >= 1.8

http://en.wikipedia.org/wiki/FASTQ_format

Mapping reads: Optimize mapping across introns

• Tophat default parameters are designed for mammalian RNA-seq data.

-l: default is 500,000

• Reduce “maximum intron length” for non-mammalian organisms

Species Max_intron_length yeast 2,484 arabidopsis 11,603 C. elegans 100,913 fly 141,628

Mapping reads: inspect result

• Inspect mapping result in a genome browser

• Remap if necessary

Quantifying Expression

• Different ways of quantifying expression level for different purposes

Gene 1

Gene 2

Sample 1

RPKM FPKM

Sample 2

Normalized count

Reads (Fragments) Per Kilobase per Million mapped reads

Counting methods

• htseq-count (recommended) http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

– Output is raw counts at gene level

• Cufflinks http://cole-trapnell-lab.github.io/cufflinks/

Output is FPKM and related statistics

• Bedtools (intersectBed; coverageBed) http://bedtools.readthedocs.org/en/latest/

Output is raw counts (but may need post-processing)

• featureCounts http://bioinf.wehi.edu.au/featureCounts/

Output is raw counts at feature level and gene level

Using htseq-count

htseq-count "modes"

htseq-count [options] <alignment_file> <gff_file>

Frequently considered options default

-r <order> name

-s <yes/no/reverse> yes

-t <feature type> exon

-m <mode> union

-f, -i, -o, -q, -h … …

Keep in mind:

Only unique alignments are counted!!

Command:

Differential Expression Analysis

• Normalization

• Estimating variability (dispersion)

• Normalization

• Variation (dispersion of data)

Total reads 200 400

- problem with simply normalizing to total reads:

> cds = estimateSizeFactors(cds) > sizeFactors(cds) Sample 1 Sample 2 Sample 3 Sample 4 4.7772242 1.0490870 0.3697529 0.5590669

Preferred normalization method - geometric (implemented in DESeq, cuffdiff)

1. Construct a "reference sample" by taking, for each gene, the geometric mean of the counts in all samples.

2. Calculate for each gene the quotient of the counts in your sample divided by the counts of the reference sample.

3. Take the median of all the quotients to get the relative depth of the library and use it as the scaling factor.

𝑥𝑖

𝑖=1

𝑥𝑖

Sample 1 Sample 2 Sample 3 Sample 4

Gene 1 2135 615 128 161

Gene 2 600 58 103 189

Gene 3 3150 1346 68 88

Gene 4 378 187 11 22

(447) (586) (346) (288)

Dispersion affects ability to differentiate

x x1 x2

Methods for estimating dispersion implemented by DEseq

pooled Each replicated condition is used to build a model, then these models are averaged to provide a single global model for all conditions in the experiment. (Default)

per-condition Each replicated condition receives its own model. Only available when all conditions have replicates.

blind All samples are treated as replicates of a single global "condition" and used to build one model.

pooled-CR Fit models according to modelFormula and estimate the dispersion by maximizing a Cox-Reid adjusted profile likelihood (CR-APL)

Differential expression analysis with DESeq or EdgeR

• Do it yourself (require some knowledge of R) - Visit the webpages of DESeq or EdgeR http://bioconductor.org/packages/release/bioc/html/DESeq.html https://bioconductor.org/packages/release/bioc/html/edgeR.html - Read documentations - Follow Instruction and examples

• Using R scripts developed by BaRC

- Get scripts from /nfs/BaRC_Public/BaRC_Code/R/ ./RunDESeq.R <input_file> <output_file> <Groups> <Method>

./get_DE_genes_with_edgeR.R <input_file> <Groups> <output_file> <fdr> <foldChange>

Interpreting DESeq output

Gene ID (from GTF file)

Mean norm

counts

Mean normalized

counts for group

Fold change

Log2 (fold

change)

Raw p-value

FDR p-value Raw counts

Normalized counts = raw / (size factor)

id baseMean

baseMean.CEU

baseMean.YRI

YRI/CEU log2(YRI/CEU)

pval padj CEU_NA07357

CEU_NA11881

YRI_NA18502

YRI_NA19200

CEU_NA07357_norm

CEU_NA11881_norm

YRI_NA18502_norm

YRI_NA19200_norm

ENSG00000213442 74.71 149.06 0.36 0.00 -8.69 8.13E-

18 1.55E-

13 169 145 1 0 148.28 149.85 0.72 0

ENSG00000138061 83.15 159.78 6.52 0.04 -4.62 5.61E-

12 5.36E-

08 184 153 10 4 161.44 158.12 7.2 5.83

ENSG00000198618 11.71 23.42 0.00 0.00 #NAME

? 1.51E-

06 0.00 31 19 0 0 27.2 19.64 0 0

ENSG00000134184 8.17 0.00 16.34 Inf Inf 0.00021

1 0.11 0 0 15 15 0 0 10.8 21.88

division by 0 -Inf

sizeFactors (from DESeq):

CEU_NA07357 CEU_NA11881 YRI_NA18502 YRI_NA19200

1.1397535 0.9676201 1.3891488 0.6856959

Presenting results

• What do you want to show? • All-gene scatterplots can be helpful to

– See level and fold-change ranges – Identify sensible thresholds – Hint at data or analysis problems

• Heatmaps are useful if many conditions are being compared but only for gene subsets

• Output normalized read counts with same method used for DE

statistics

• Whenever one gene is especially important, look at the mapped reads in a genome browser

Scatterplots

Standard scatterplot MA (ratio-intensity) plot

Clustering and heatmap

Software: Cluster 3.0: •Log-transform •Mean center •Cluster Java TreeView: •Visualize •Export

Illustrator •Assemble

Liang et al. BMC Genomics 2010, 11:173

Further mining of the data

Find enrichment in descriptive terms, regulatory pathways and networks: Commercial Ingenuity Pathway analysis Pathway studio GeneGo … Public domain DAVID GSEA …

Summary

• Experimental design • Quality control (fastqc) • Mapping spliced reads (tophat) • Counting gene levels (htseq-count) • Identifying "differentially expressed" genes (DESeq R

package) • Exploring your results

Slides and exercise materials are available at http://jura.wi.mit.edu/bio/education/hot_topics/ More practical commands and procedures available at http://barcwiki.wi.mit.edu/wiki/SOPs

Hands on Practice overview

Fastq file

Bam file

Gene read counts

Normalized read counts and differential expression

fastqc

Tophat

htseq-count

Practical RNA-seq Data Analysis -...

Transcript of Practical RNA-seq Data Analysis -...

Practical RNA-seq Data Analysis -...

Documents

Transcript of Practical RNA-seq Data Analysis -...

MaRS - Matrix of RNA-Seq · 2016-11-30 · RNA-Seq collection • RNA-Seq libraries are collected from public databases (NCBI, EBI) • We selected libraries: -with comparable data

Practical RNA-seq analysisbarc.wi.mit.edu/education/hot_topics/RNAseq_Feb2020/RNA...2020/02/13 · Statistical design and analysis of RNA sequencing data Genetics (2010) 8 QC Before

RNA-seq differential expression analysis

RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Tutorial - QIAGEN Bioinformatics€¦ · Four workflows: 1.RNA-Seq and IPA analysis workflow 2.RNA-Seq and IPA advanced analysis workflow 3.RNA-Seq analysis workflow 4.RNA-Seq analysis

Analysing RNA-Seq data produced by Mars-Seq protocoldors.weizmann.ac.il/course/course2018/AnalysingRNA-Seq...Analysing RNA-Seq data produced by Mars-Seq protocol Dena Leshkowitz, Introduction

Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows ...

Rna seq pipeline

RNA-Seq Analysis Overview

RNA-seq - Read mapping and · PDF fileRNA-seq (1) Peter N. Robinson Microarrays RNA-seq Alternative splicing mapping cu inks Bipartite RNA-seq Read mapping and Quanti cation Peter

RNA-seq - Quantification and Differential · PDF fileRNA-seq (2) Peter N. Robinson RNA-seq RPKM Fisher’s exact test Poisson LRT Negative Binomial RNA-seq Quanti cation and Di erential

Biases in RNA- Seq data October 30, 2013 NBIC Advanced RNA- Seq course

Introduction to RNA-Seq - University of California, Davis...Introduction to RNA-Seq Monica Britton, Ph.D. Bioinformatics Analyst December 2014 Workshop Overview of RNA-Seq Activities

RNA-seq data analysis - CSC · Analysis tool overview 250 NGS tools for • RNA-seq • single cell RNA-seq • miRNA-seq • exome/genome-seq • ChIP-seq • FAIRE/DNase-seq •

ChIP-seq Analysis - Massachusetts Institute of Technologybarc.wi.mit.edu/education/hot_topics/ChIPseq_2017/AnalysisofChIP... · Outline • ChIP-seq overview • Experimental design

RNA- Seq Lab

Gene Mapping via Bulked Segregant RNA-Seq (BSR-Seq)

Analysis of RNA-seq Data - University of Hong Kongcgs.hku.hk/portal/files/GRC/Events/Seminars/2017/20170208/rna-seq.pdf · Outline • What is RNA-seq? • What can RNA-seq do? •

Poster rna seq-molecular_medtriconf_2011_a_vladimirova

RNA-seq experiments for bioinformaticians