RNA-Seq Analysis Simon Andrews [email protected] @simon_andrews v2.3.

39
RNA-Seq Analysis Simon Andrews [email protected] @simon_andrews v2.3

Transcript of RNA-Seq Analysis Simon Andrews [email protected] @simon_andrews v2.3.

Page 1: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

RNA-Seq Analysis

Simon [email protected]

@simon_andrewsv2.3

Page 2: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

RNA-Seq Libraries

rRNA depleted mRNA

Fragment

Random prime + RT

2nd strand synthesis (+ U)

A-tailing

Adapter Ligation

(U strand degradation)

Sequencing

NNNN

u u u u

u u u u AA

u u u u AAT

T

A T

Page 3: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

RNA-Seq Analysis

QC (Trimming) Mapping

Mapped QCQuantitationStatistical Analysis

Page 4: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

QC: Raw Data• Sequence call quality

Page 5: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

QC: Raw Data• Sequence bias

Page 6: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

QC: Raw Data• Duplication level

Page 7: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Mapping

Exon 1 Exon 2 Exon 3 Genome

Simple mapping within exons

Mapping between exons

Spliced mapping

Can simplify by aligning first to a transcriptome and then translate back to genomic coordinates. Can map unmatched reads to the whole genome.

Page 8: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Spliced Mapping Software

• Tophat (http://tophat.cbcb.umd.edu/)

• Star (http://code.google.com/p/rna-star/)

Page 9: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

TopHat pipelineReference FastQ files Indexed Genome

Reference GTF Models Indexed Transcriptome

Reads Maps to transcriptome? Translate coords and report

Maps to genome?

Split map to genome Build consensus and report

Report

Yes

Yes

Yes

Discard

No

No

No

Page 10: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Post Mapping QC

• Mapping statistics• Proportion of reads which are in transcripts• Proportion of reads in transcripts in exons• Strand specificity• Consistency of coverage

SeqMonk (RNA-Seq QC plot)RNASeqQC (easiest through GenePattern)

Page 11: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

SeqMonk Mapping QC (good)

Page 12: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

SeqMonk Mapping QC (bad)

Page 13: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Quantitation

Exon 1 Exon 2 Exon 3

Exon 1 Exon 3

Splice form 1

Splice form 2

Definitely splice form 1

Definitely splice form 2

Ambiguous

Page 14: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Options for handling splice variants

• Ignore them – combine exons and analyse at gene level– Simple, powerful, inaccurate in some cases– DE-Seq, SeqMonk

• Assign ambiguous reads based on unique ones – quantitate transcripts and optionally merge to gene level– Potentially cleaner more powerful signal– High degree of uncertainty– Cufflinks, bitSeq, RSEM

Page 15: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Read counting

• Simple (exon or transcript)– HTSeq (htseq-count)– BEDTools (multicov)– featureCounts– SeqMonk (graphical)

• Complex (re-assignment)– Cufflinks, bitSeq, RSEM

Page 16: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Count Corrections

• Size of library

• Length of transcript

• Other factors

Page 17: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

RPKM / FPKM / TPM• RPKM (Reads per kilobase of transcript per million reads of library)

– Corrects for total library coverage– Corrects for gene length– Comparable between different genes within the same dataset

• FPKM (Fragments per kilobase of transcript per million reads of library)

– Only relevant for paired end libraries– Pairs are not independent observations– RPKM/2

• TPM (transcripts per million)

– Normalises to transcript copies instead of reads– Corrects for cases where the average transcript length differs between samples

Page 18: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Normalisation

Page 19: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Normalisation

Page 20: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Filtering Genes

• Remove things which are uninteresting or shouldn’t be measured

• Reduces noise – easier to achieve significance– Non-coding (miRNA, snoRNA etc) in RNA-Seq– Known mis-spliced forms (exon skipping etc)– Mitochonidrial genes– X/Y chr genes in mixed sex populations– Unknown/Unannotated genes

Page 21: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Filtering Mouse mRNAs

Non coding

ESTs

Predicted genes

Good transcripts

Page 22: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Visualising Expression

• Comparing the same gene in different samples– Normalised log2 RPM values

• Comparing different genes in the same sample– Normalised log2 RPKM values

Page 23: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Linear Log2

Eef1a1

Actb

Lars2

Eef2

CD74

Page 24: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Differential Expression

• Microarrays traditionally used continuous statistical tests (t-test ANOVA etc)

• RNA-Seq differs in that it is count based data, so continuous tests fail at low counts

• Most differential tests use count based distribution tests, usually based on a negative binomial distribution

Page 25: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Negative binomial tests (DE-Seq)

• Are the counts we see for gene X in condition 1 consistent with those for gene X in condition 2?

• Initially modelled using simple Poisson distribution using mean expression as the only parameter

• Doesn’t model real data very well

Page 26: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Poisson vs Negative binomial

PoissonDESeq

Page 27: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Parameters• Size factors– Estimator of library sampling depth– More stable measure than total coverage– Based on median ratio between conditions

• Variance – required for NB distribution– Custom variance distribution fitted to real data– Insufficient observation to allow direct measure– Smooth distribution assumed to allow fitting

Page 28: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Dispersion shrinkage• Plot observed per gene dispersion

• Calculate average dispersion for genes with similar observation

• Individual dispersions regressed towards the mean. Weighted by– Distance from mean– Number of observations

• Points more than 2SD above the mean are not regressed

Page 29: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Other filters• Cook’s Distance – Identifies high variance– Effect on mean of removal of one replicate– For n<3 test not performed– For n = 3-6 failures are removed– For n>6 outliers removed to make trimmed mean– Disable with cooksCutoff=FALSE

• Hit count optimisation– Low intensity reads are removed– Limits multiple-testing to give max significant hits– Disable with independentFiltering=FALSE

Page 30: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Replicates

• Compared to arrays, RNA-Seq is a very clean technical measure of expression– Generally don’t run technical replicates

• Some statistics can be run on single replicates, but they can only tell you about technical noise (how likely is it that this change is due to a technical issue)

• Assessing biological variation requires biological replicates

Page 31: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Replicates• Traditional statistics require min 3x3

• DESeq can operate at 2x2, but this is minimum, not recommended

• True number of replicates required will depend on your biology and requirements

• 4x4 design is fairly common

• Always expect at least one sample to fail

• Randomise samples during sample prep

Page 32: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

The problem of power…

• In a library Gene B is much better observed for the same copy number

• Power to detect DE is proportional to length

Gene A (1kb)

Gene B (8kb)

Page 33: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

5x5 Replicates

5,000 out of 22,000 genes (23%) identified as DE using DESeq (p<0.05)

Page 34: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Intensity difference test

• Different approach to differential expression• Doesn’t aim to find every differentially

expressed gene• Conservative test• Guaranteed to never return large numbers of

hits

Page 35: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Assumptions

• Noise is related to observation level– Similar to DESeq

• Differences between conditions are either– A direct response to stimulus– Noise, either technical or biological

• Find points whose differences aren’t explained by general disruption

Page 36: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Method

Page 37: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Results

Page 38: RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Exercises

• Look at raw QC• Mapping with tophat– Small test data

• Quantitation and visualisation with SeqMonk– Larger replicated data

• Differential expression with DESeq• Review in SeqMonk