RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1....

31
1 RNA-seq data analysis RNA-seq data analysis 1. Introduction to RNA-seq 2. Quality control, preprocessing 3. Alignment to reference 4. Quantitation 5. Differential expression analysis Normalization Dispersion estimation Statistical testing

Transcript of RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1....

Page 1: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

1

RNA-seq data analysis

RNA-seq data analysis1. Introduction to RNA-seq2. Quality control, preprocessing3. Alignment to reference4. Quantitation5. Differential expression analysis

• Normalization• Dispersion estimation• Statistical testing

Page 2: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

2

Introduction to RNA-seq

RNA-seq for transcriptome analysis

What genes are expressed / differentially expressed?Advantages if compared to microarrays• Able to detect novel transcripts and isoforms• Increased dynamic range• High reproducibility, low background• Detection of gene fusions and SNPs

But• Based on sampling transcript’s expression signal depends on

expression level of other transcripts• Microarrays are based on hybridization probe intensities are independent of

each other

Page 3: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

3

Typical steps in RNA-seq

http://cmb.molgen.mpg.de/2ndGenerationSequencing/Solas/RNA-seq.html

Things to take into account

Unequal coverage along transcript• depends on library construction procedure

Sequencing bias• GC-rich and GC-poor regions can be under-sampled• Not so random priming

Longer transcripts give more countsRNA composition effect due to sampling:

Page 4: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

4

Quality control, preprocessing(FastQC, PRINSEQ,)

Align reads to reference(TopHat, Star)

RNA-seq data analysis workflow

De novo assembly(Trinity, Velvet+Oases)

Align reads to transcripts

Reference based assembly to detect new transcripts and isoforms(Cufflinks)

Quantitation(HTSeq, Cuffdiff, etc)

reference

reads

Annotation(Blast2GO)

Differential expression analysis(edgeR, DESeq, Cuffdiff)

Quality control, preprocessing(FastQC, PRINSEQ,)

Align reads to reference(TopHat, Star)

RNA-seq data analysis today

De novo assembly(Trinity, Velvet+Oases)

Align reads to transcripts

Reference based assembly to detect new transcripts and isoforms(Cufflinks)

Quantitation(HTSeq, Cuffdiff, etc)

reference

reads

Annotation(Blast2GO)

Differential expression analysis(edgeR, DESeq, Cuffdiff)

Page 5: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

5

Quality control

What and why?

Potential problems

• low-quality sequences

• sequencing artifacts

• sequence contamination

• Examples: low confidence bases, Ns, duplicate reads,

sequence bias, adapters, another organism…

Knowing about potential problems in your data allows you to

correct for them before you spend a lot of time on analysis

take them into account when interpreting results

Page 6: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

6

Software packages for quality control

FastQC

FastX

PRINSEQ

TagCleaner

....

Things to check

Quality • Per base quality• Per read quality

Composition• Per base composition• GC content and profile

Contaminant identification• Overrepresented sequences and k-mers• Duplicate levels

Page 7: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

7

Base qualities

Phred quality score Q = -10 * log10 (probability that the base is wrong)

Encoded as ASCII characters so that 33 is added to the Phred score• This ”Sanger” encoding is used by Illumina 1.8+, 454 and SOLiD • Note: older Illumina encoding added 64 to Phred

Base qualities

Page 8: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

8

Per base sequence quality plot (FastQC)

good

so and so

bad

Quality drops suddenly

Problem in the flow cell -> trim the sequences

Page 9: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

9

Per base sequence content (FastQC)

Base composition bias

Correct sequence but biased location, typical for Illumina RNA-seq data

Page 10: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

10

Base composition bias II

Library has a restriction site at the frontA single sequence makes up of 20% of the library

Human data for 2 cell lines (h1-hESC and GM12878) from the ENCODE project• 76 b single-end reads, no replicatesDrosophila data from experiment studying the role of the pasilla splicing factor using RNAi• 4 untreated samples (2 sequenced single end and 2 paired end)• 3 treated samples (1 sequenced single end and 2 paired end)

Data used in exercises

Page 11: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

11

Exercise 1: Quality controlImport file h1-hESC_RNAseq.fastq. View it as text. Run the tool Quality control / Read quality with FastQC. • how many reads are there and how long are they? • what quality encoding is used? • is the base quality good all along the reads?

Preprocessing: Filtering and trimming low quality reads

Page 12: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

12

Filter and trim low quality reads: FastX

Filter sequences based on quality• Decide what is the minimum quality value allowed (e.g. 20)• Decide what percentage of bases in a read should have at least

this quality

Trim a certain number of bases from all reads• Decide how many bases should be trimmed and from which end

Filter reads for adapters, ambiguous nucleotides (N) and length• Give the adapter sequence and decide what is the minimum

allowed sequence length after clipping

Filter low quality reads: PRINSEQFilter sequences based on quality scores• Min/ max quality score per base, mean of quality scores

Filter for low complexity• DUST (score 1-100, > 7 means low complexity)• Entropy (score 1-100, < 70 means low complexity)

Filter for Ns• Maximum count/ percentage of Ns that a read is allowed to have

Filter for length • Min/ max length of a read

Filter for duplicates• Exact, reverse complement, or 5’/3’ duplicates

Filter for several criteria• All above, and possibility to get filtered pairs for paired end data

Page 13: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

13

Trim low quality reads: PRINSEQ

Trim based on quality scores• Minimum or mean quality score• In a sliding window• From 3’ or 5’ end

Trim polyA/T tails• Minimum number of A/Ts• From left or right

Trim based on several criteria• All above• Trim x bases from left/ right• Trim to length x

Exercise 2: Trim and filter readsTrim reads from 3’ end based on quality• Select the file h1-hESC_RNAseq.fastq and the tool Utilities /

Trim reads for several criteria. Set parameters so that 3’ ends are trimmed when the mean quality in a 3 base window is less than 20.

• Inspect the log file and figure out how many reads were discarded.

Select the file trimmed.fastq and run the tool Quality control / Read quality with PRINSEQ. • Select the result file reads-stats.html and visualization method

Open in external web browser. Does the quality look better? How is the read length distribution now?

Filter out reads that are shorter than 50 bases.• Select trimmed.fastq and the tool Filtering / Filter reads for

length. How many reads get discarded?

Page 14: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

14

Aligning (=mapping) reads to reference

Splice-aware aligners

TopHat (uses Bowtie)GSNAPStarRUMMapSplice

Page 15: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

15

Bowtie2 conceptsEnd-to-end alignment mode• Read is aligned over its entire length• Maximum alignment score = 0, deduct penalty for each mismatch

(less for low quality base), N, gap opening and gap extension

Local alignment mode• Read ends don’t need to align, if this maximizes the alignment score• Add bonus to alignment score for each match

Differences to Bowtie1: Can handle indels, Ns and pairs where one mate doesn’t align

TopH

at2

wor

kflo

w

Page 16: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

16

Mapping qualityConfidence in read’s point of originDepends on • length of alignment• number of mismatches and gaps• uniqueness of the aligned region in the genome

Expressed in Phred scores, like base qualities• Q = -10 * log10 (probability that read was mapped to a wrong

location)• Marked in numbers

Exercise 3: Align reads to reference genomeSelect accepted.fastq and run the tool Alignment / Bowtie2 for single end reads. • how many alignments are there and how many of them have

mapping quality higher than 1 (use the tool Utilities / Count alignments in BAM)

Visualize alignment in genome browser• Select the BAM file and visualization method Genome browser• Select the human genome and set the location to gene CNN2

and click Go.• Untick the box “reads” and set the coverage scale to 1000.• Is the coverage even along the transcript? Can you find SNPs?

Page 17: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

17

Quantitation

Software for counting aligned reads per genomic features (genes/exons/transcripts)

HTSeqCuffdiffBEDTools

Page 18: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

18

HTSeq countGiven a BAM file and a list of genomic features, counts how many reads map to each feature.• For RNA-seq the features are typically genes, where each

gene is considered as the union of all its exons. • Also exons can be considered as features, e.g., in order to

check for alternative splicing.

Features need to be supplied in GTF file• Note that the chromosome naming has to be the same as in

BAM

3 modes to handle reads which overlap several genes• Union• Intersection-strict• Intersection-nonempty

HTSeq count modes

Page 19: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

19

Exercise 4: Count reads per genes

Select the BAM file and run RNA-seq / Map aligned reads to genes with HTSeq so that the minimum alignment quality is 1.

Select htseq-counts.tsv and run Filtering / Filter table by column value.• how many genes have more than 10 counts?

Differential expression analysis

Page 20: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

20

Things to take into accountNormalization is required in order to compare expression

between samples

• Different library sizes

• RNA composition bias caused by sampling approach

Model has to account for overdispersion in biological

replicates negative binomial distribution

Raw counts are needed to assess measurement precision

Multiple testing problem

Software packages for DE analysis

edgeR

DESeq

DEXSeq

Cuffdiff

BaySeq

SAMseq

NOIseq

Limma + voom, limma + vst

...

Page 21: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

21

Comments from comparisons”Methods based on negative binomial modeling have

improved specificity and sensitivities as well as good

control of false positive errors”

”Cuffdiff performance has reduced sensitivity and

specificity. We postulate that the source of this is related to

the normalization procedure that attempts to account for

both alternative isoform expression and length of

transcripts”

Page 22: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

22

Differential expression analysis:Normalization

NormalizationFor comparing gene expression within sample, normalize for

• Gene length

• Gene GC content

For comparing gene expression between samples, normalize for

• Library size (number of reads obtained)

• RNA composition effect

Page 23: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

23

“FPKM and TC are ineffective and should be definitely abandoned in the context of differential analysis”

“In the presence of high count genes, only DESeq and TMM (edgeR) are able to maintain a reasonable false positive rate without any loss of power”

RPKM and FPKM Reads/fragments per kilobase per million mapped reads

• Example: 2000 bp transcript has 600 counts, library size is

30 million reads RPKM = (600/2) / 30 = 10

Normalizes for gene length and library size

Can be used only for reporting expression values, not for

testing differential expression

• Raw counts are needed to assess the measurement

precision correctly

Page 24: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

24

Normalization by edgeR and DESeqAim to make normalized counts for non-differentially

expressed genes similar between samples

• Do not aim to adjust count distributions between samples

Assume that

• Most genes are not differentially expressed

• Differentially expressed genes are divided equally between

up- and down-regulation

Do not transform data, but use normalization factors within

statistical testing

Normalization by edgeR and DESeq – how?DESeq• Take geometric mean of gene’s counts across all samples• Divide gene’s counts in a sample by the geometric mean• Take median of these ratios sample’s normalization factor

(applied to read counts)

edgeR• Select as reference the sample whose upper quartile is closest to

the mean upper quartile• Log ratio of gene’s counts in sample vs reference M value• Take weighted trimmed mean of M-values (TMM) normalization

factor (applied to library sizes)• Trim: Exclude genes with high counts or large differences in expression• Weights are from the delta method on binomial data

Page 25: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

25

Filtering

FilteringFilter out genes which have little chance of showing significant evidence for differential expression• E.g. genes which are not expressed in either experimental

group

Reduces the severity of multiple testing adjustment

Should be independent of experimental setup• do not use information on what group the sample belongs to

Page 26: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

26

Differential expression analysis:Dispersion estimation

DispersionDispersion = (BCV)2

• BCV = gene’s biological coefficient of variation• E.g. if gene’s expression typically differs from replicate to

replicate by 20%, this gene’s dispersion is 0.22 = 0.04

Note that the variance seen in counts is a sum of 2 things:• Sample-to-sample variation (dispersion)• Uncertainty in measuring expression by counting reads

Page 27: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

27

Dispersion estimation by edgeR and DESeqDESeq• Models the observed mean-variance relationship for the genes

using either parametric or local regression• User can choose to use the fitted values always, or only when

they are higher than the genewise value

edgeR• Estimates common dispersion for all genes using a conditional

maximum likelyhood approach • Trended dispersion: takes binned common dispersion and

abundance, and fits a curve though these binned values• Tagwise dispersion: uses empirical Bayes strategy to shrink

gene-wise dispersions towards the common one using a weighted likelyhood approach genes that are consistent between replicates are ranked more highly

Page 28: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

28

Data exploration using MDS plotedgeR outputs multidimensional scaling (MDS) plot which shows the relative similarities between samples

Allows you to see if replicates are consistent and if you can expect to find differentially expressed genes

Distances correspond to the biological coefficient of variation between each pair of samples• Calculated using 500 most heterogenous genes (that have

largest tagwise dispersion treating all libraries as one group)

Page 29: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

29

MDS plot by edgeR

Differential expression analysis:Statistical testing

Page 30: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

30

Statistical testing by DESeq and edgeRTwo group comparisons• Exact test for negative binomial distribution

Multifactor experiments• Generalized linear model (GLM) likelyhood ratio test

• GLM = extension of linear models to nonnormally distributed response data

Exercise 5: Analyze differential expression with edgeR classic

Select File / Open session and the file pasilla.zip. Inspect the phenodata. • This is a two-group comparison with 7 samples. Some samples

were sequenced single end, some paired end.Select the file counts.tsv and run the tool RNA-seq / Differential expression using edgeR.• do the samples separate according to group or something else

in the MDS plot? • how many differentially expressed genes do you get? • is common dispersion a good approximation of genewise

dispersions, judged by the dispersion plot? • how big fold change is required for DE genes (see MA plot)? • how does the p-value distribution look like in the p-value plot?

Page 31: RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

31

Exercise 6: Include filtering in edgeR classic

Repeat the run so that you analyze only genes which are expressed in at least 3 samples.• how does the number of DE genes change and why? • did the p-value distribution change? • did the MDS plot change?

Exercise 7: Include read type info in edgeR

Select the file counts.tsv and run the tool RNA-seq / Differential expression using edgeR for multivariate experiments so that you• analyze only genes which are expressed in at least 3 samples and• incorporate the readtype info to analysis: Add readtype column as

main effect 2.

Filter the file edger-glm.tsv using Filtering / Filter table by column value (column = PValue.group, cutoff = 0.05)• how many genes were analyzed and how many are DE?