Post on 19-Aug-2020
Practical RNA-seq Data Analysis
BaRC Hot Topics – March 31, 2016
Bioinformatics and Research Computing
Whitehead Institute
http://barc.wi.mit.edu/hot_topics/
2
Replication (especially biological) Multiplexing Read length Read depth Paired or unpaired Stranded or unstranded Batch effect
Experimental design
Things to consider:
Data analysis pipeline
Quality Control
Mapping
Quantifying
Differential expression
Presenting the results
3
Quality control
• Check quality of file of raw reads using FastQC
Command: • Resources: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ https://sequencing.qcfail.com/ http://barcwiki.wi.mit.edu/wiki/SOPs/qc_shortReads
http://jura.wi.mit.edu/bio/education/hot_topics/
4
fastqc <reads.fastq>
Responding to quality issues
• Method 1: – Keep all reads as is – Map as many as possible
• Method 2:
– Drop all poor-quality reads – Trim poor-quality bases
– Trim adapter/ vector – Map only good-quality bases
• Which makes more sense for your experiment? • Consider do both and compare outcome.
5
Mapping reads: various mappers
Various mappers for spliced alignment:
TopHat (recommended), STAR, GSNAP, MapSplice, PALMapper,ReadsMap, GEM, PASS, GSTRUCT, BAGET…
6
Pär G Engström et al. Nature Methods | VOL.10 NO.12 | DECEMBER 2013
Mapping reads with tophat2
7
quality score encoding Find out from fastQC result
--segment-length Shortest length of a spliced read that can map to one side of the junction. default: 25
--no-novel-juncs Only look at reads across junctions in the supplied GTF file
-G <GTF file> Map reads to virtual transcriptome (from gtf file) first.
-N max. number of mismatches in a read, default is 2
-o/--output-dir default = tophat_out
--library-type (fr-unstranded, fr-firststrand, fr-secondstrand)
-I/--max-intron-length default: 500000
http://ccb.jhu.edu/software/tophat/manual.shtml
Command:
tophat [options] <genome_index_base> <reads.fastq> tophat [options] <genome_index_base> <reads.fastq>
Commonly considered parameters:
/nfs/genomes/
Mapping reads: quality score encoding
• Fastq format:
8
@ILLUMINA-F6C19_0048_FC:5:1:12440:1460#0/1
GTAGAACTGGTACGGACAAGGGGAATCTGACTGTAG
+ILLUMINA-F6C19_0048_FC:5:1:12440:1460#0/1
hhhhhhhhhhhghhhhhhhehhhedhhhhfhhhhhh
@seq identifier seq
+any description seq quality values
/1 or /2 paired-end
Input qualities Illumina versions
--solexa-quals <= 1.2
--phred64 1.3-1.7
--phred33 >= 1.8
http://en.wikipedia.org/wiki/FASTQ_format
Mapping reads: Optimize mapping across introns
• Tophat default parameters are designed for mammalian RNA-seq data.
-l: default is 500,000
• Reduce “maximum intron length” for non-mammalian organisms
9
Species Max_intron_length yeast 2,484 arabidopsis 11,603 C. elegans 100,913 fly 141,628
Mapping reads: inspect result
• Inspect mapping result in a genome browser
• Remap if necessary
10
Quantifying Expression
• Different ways of quantifying expression level for different purposes
11
Gene 1
Gene 2
Sample 1
RPKM FPKM
Sample 2
Normalized count
Reads (Fragments) Per Kilobase per Million mapped reads
Counting methods
• htseq-count (recommended) http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
– Output is raw counts at gene level
• Cufflinks http://cole-trapnell-lab.github.io/cufflinks/
Output is FPKM and related statistics
• Bedtools (intersectBed; coverageBed) http://bedtools.readthedocs.org/en/latest/
Output is raw counts (but may need post-processing)
• featureCounts http://bioinf.wehi.edu.au/featureCounts/
Output is raw counts at feature level and gene level
12
Using htseq-count
13
htseq-count "modes"
htseq-count [options] <alignment_file> <gff_file>
Frequently considered options default
-r <order> name
-s <yes/no/reverse> yes
-t <feature type> exon
-m <mode> union
-f, -i, -o, -q, -h … …
Keep in mind:
Only unique alignments are counted!!
Command:
Differential Expression Analysis
• Normalization
• Estimating variability (dispersion)
14
Differential Expression Analysis
• Normalization
• Variation (dispersion of data)
15
Total reads 200 400
- problem with simply normalizing to total reads:
> cds = estimateSizeFactors(cds) > sizeFactors(cds) Sample 1 Sample 2 Sample 3 Sample 4 4.7772242 1.0490870 0.3697529 0.5590669
Preferred normalization method - geometric (implemented in DESeq, cuffdiff)
1. Construct a "reference sample" by taking, for each gene, the geometric mean of the counts in all samples.
2. Calculate for each gene the quotient of the counts in your sample divided by the counts of the reference sample.
3. Take the median of all the quotients to get the relative depth of the library and use it as the scaling factor.
𝑥𝑖
𝑛
𝑖=1
1𝑛
𝑥𝑖
Sample 1 Sample 2 Sample 3 Sample 4
Gene 1 2135 615 128 161
Gene 2 600 58 103 189
Gene 3 3150 1346 68 88
Gene 4 378 187 11 22
(447) (586) (346) (288)
Differential Expression Analysis
Dispersion affects ability to differentiate
17
x x1 x2
x x1 x2
Methods for estimating dispersion implemented by DEseq
18
pooled Each replicated condition is used to build a model, then these models are averaged to provide a single global model for all conditions in the experiment. (Default)
per-condition Each replicated condition receives its own model. Only available when all conditions have replicates.
blind All samples are treated as replicates of a single global "condition" and used to build one model.
pooled-CR Fit models according to modelFormula and estimate the dispersion by maximizing a Cox-Reid adjusted profile likelihood (CR-APL)
19
Differential expression analysis with DESeq or EdgeR
• Do it yourself (require some knowledge of R) - Visit the webpages of DESeq or EdgeR http://bioconductor.org/packages/release/bioc/html/DESeq.html https://bioconductor.org/packages/release/bioc/html/edgeR.html - Read documentations - Follow Instruction and examples
• Using R scripts developed by BaRC
- Get scripts from /nfs/BaRC_Public/BaRC_Code/R/ ./RunDESeq.R <input_file> <output_file> <Groups> <Method>
./get_DE_genes_with_edgeR.R <input_file> <Groups> <output_file> <fdr> <foldChange>
Interpreting DESeq output
20
Gene ID (from GTF file)
Mean norm
counts
Mean normalized
counts for group
Fold change
Log2 (fold
change)
Raw p-value
FDR p-value Raw counts
Normalized counts = raw / (size factor)
id baseMean
baseMean.CEU
baseMean.YRI
YRI/CEU log2(YRI/CEU)
pval padj CEU_NA07357
CEU_NA11881
YRI_NA18502
YRI_NA19200
CEU_NA07357_norm
CEU_NA11881_norm
YRI_NA18502_norm
YRI_NA19200_norm
ENSG00000213442 74.71 149.06 0.36 0.00 -8.69 8.13E-
18 1.55E-
13 169 145 1 0 148.28 149.85 0.72 0
ENSG00000138061 83.15 159.78 6.52 0.04 -4.62 5.61E-
12 5.36E-
08 184 153 10 4 161.44 158.12 7.2 5.83
ENSG00000198618 11.71 23.42 0.00 0.00 #NAME
? 1.51E-
06 0.00 31 19 0 0 27.2 19.64 0 0
ENSG00000134184 8.17 0.00 16.34 Inf Inf 0.00021
1 0.11 0 0 15 15 0 0 10.8 21.88
division by 0 -Inf
sizeFactors (from DESeq):
CEU_NA07357 CEU_NA11881 YRI_NA18502 YRI_NA19200
1.1397535 0.9676201 1.3891488 0.6856959
Presenting results
• What do you want to show? • All-gene scatterplots can be helpful to
– See level and fold-change ranges – Identify sensible thresholds – Hint at data or analysis problems
• Heatmaps are useful if many conditions are being compared but only for gene subsets
• Output normalized read counts with same method used for DE
statistics
• Whenever one gene is especially important, look at the mapped reads in a genome browser
21
Scatterplots
22
Standard scatterplot MA (ratio-intensity) plot
Clustering and heatmap
23
Software: Cluster 3.0: •Log-transform •Mean center •Cluster Java TreeView: •Visualize •Export
Illustrator •Assemble
Liang et al. BMC Genomics 2010, 11:173
Further mining of the data
Find enrichment in descriptive terms, regulatory pathways and networks: Commercial Ingenuity Pathway analysis Pathway studio GeneGo … Public domain DAVID GSEA …
24
Summary
25
• Experimental design • Quality control (fastqc) • Mapping spliced reads (tophat) • Counting gene levels (htseq-count) • Identifying "differentially expressed" genes (DESeq R
package) • Exploring your results
Slides and exercise materials are available at http://jura.wi.mit.edu/bio/education/hot_topics/ More practical commands and procedures available at http://barcwiki.wi.mit.edu/wiki/SOPs
26
Hands on Practice overview
Fastq file
Bam file
Gene read counts
Normalized read counts and differential expression
fastqc
Tophat
htseq-count
DESeq