Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.

19
Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen

Transcript of Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.

Page 1: Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.

Introduction to DESeq and edgeR packages

Peter A.C. ’t Hoen

Page 2: Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.

Poisson distribution

• discrete probability distribution that expresses the probability

of a number of events occurring in a fixed period of time if

these events occur with a known average rate and

independently of the time since the last event

= expected k = number of occurrences

Page 3: Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.

Count process

• Poisson distribution

Yt ~ Poisson(λt) with λt = pnt

t: tag

λ: true expression

Y: observed expression

p: probability

n: total number of RNA molecules

• Truncated Poisson distribution: zero can mean not expressed or not counted

• Count variance ~ λt

• Murray F Freeman and John W Tukey. Ann Math Statist, 21:607-611, (1950)

Page 4: Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.

Negative binomial distribution

• discrete probability distribution of the number of successes in

a sequence of Bernoulli trials before a specified (non-random)

number r of failures occurs

• also arises as a continuous mixture of Poisson distributions

where the mixing distribution of the Poisson rate is a gamma

distribution. That is, we can view the negative binomial as a

Poisson(λ) distribution, where λ is itself a random variable,

distributed according to Gamma(r, p/(1 − p)).

Page 5: Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.

edgeR (1)

• Robinson, Smyth (Biostatistics, 2008; Bioinformatics 2007)

• Package available from Bioconductor with very informative

vignette

Yij ~ NB (ij , )

Var (Yij) = ij ( 1 + ij x )

• Negative binomial (gamma Poisson) with average mu

• Phi is overdispersion parameter (biological variation)

• = 0 gives Poisson distribution

Page 6: Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.

Overdispersion in our data

Page 7: Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.

edgeR (2)

• Test per gene

Ygij ~ NB (gij , g ) where gij = Mj x pgj

Var (Ygij) = gij ( 1 + ij x g)

pgi is proportion of tags for tag g in sample i

Mj is library size for sample i and library j

g is dispersion parameter for tag g

Page 8: Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.

edgeR (3)

• Estimation of common dispersion parameter by conditioning

g on the sum of counts and maximizing the common

likelihood

lC() = lg (g)

• Common dispersion parameter OR weighted linear

combination of common and individual likelihoods

WL (g) = lg(g) + lC(g)

Page 9: Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.

edgeR (4)

• Exact test replacing hypergeometric probabilities with NB-

derived probabilities (qCML) for single factor experiment

• Generalized linear models and Cox-Reid profile-adjusted

likelihood (CR) method for multifactorial experiments

Page 10: Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.

edgeR: what is new?

• Exact Test not able to work with confounders

replaced by generalized linear model with log likelihood

ratio test

• Abundance trending in dispersion estimates

Page 11: Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.

Dispersion trend

dispersion

abundance

Page 12: Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.

Dispersion trending (after filtering for low ab)

dispersion

abundance

Page 13: Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.

DESeq (1)

• Anders and Huber: Genome Biology (2010) 11:R106

• Roughly same principles as edgeR

• No multifactorial analysis implemented yet

Page 14: Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.

DESeq (2)

(1) Yij ~ NB (ij , σ2ij )

(2) ij = sj qi,ρ(j) sj scaling factor for sample j

qi,ρ(j) proportional concentration

of tag i in condition ρ

(3) σ2ij = ij + s2

j νi,ρ(j) νi,ρ(j) is a smooth function

depending on qi,ρ(j) (concentration)

Count noise Extra variance

Page 15: Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.

DESeq (3): variance trend with expression

Purple: PoissonDashed orange: edgeR (before trending)Orange: DESeq

You can derive:Squared CV is 1/μ + φ

Page 16: Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.

DESeq (3)

• Differences with edgeR:

• Complete shrinkage to trended dispersion; limited tagwise

dispersion estimates

• Different variance estimates for different sample groups allowed

• Deals better with samples with large differences in read depth?

Page 17: Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.

DESeq (4): statistical testing

• In analogy to initial edgeR implementation exact test on the

NB probabilities in the two conditions

Page 18: Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.

Conclusions

• edgeR and DESeq are comparable implementation of

statistical tests using NB distribution

• edgeR and DESeq produce largely similar results

• Implementation of generalized linear models in edgeR allows

for testing with confounders

• Results comparable to limma for medium – high expressed

genes: modeling of stochastic effects is particularly important

for low expressed genes

Page 19: Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.

Comparison to limma (on sqrt scaled data)