Identifying differentially expressed genes from RNA- seq data

16
1 tifying differentially expressed genes from RNA-seq cent algorithms for calling differentially expresse Empirical analysis of digital gene expression data in R tp://www.bioconductor.org/packages/2.10/bioc/html/edgeR.html Differential gene expression analysis based on the gative binomial distribution tp://www-huber.embl.de/users/anders/DESeq/ Empirical Bayesian analysis of patterns of differential pression in count data tp://www.bioconductor.org/packages/2.10/bioc/html/baySeq.html

description

Identifying differentially expressed genes from RNA- seq data. Many recent algorithms for calling differentially expressed genes: edgeR : Empirical analysis of digital gene expression data in R http:// www.bioconductor.org /packages/2.10/ bioc /html/ edgeR.html - PowerPoint PPT Presentation

Transcript of Identifying differentially expressed genes from RNA- seq data

Page 1: Identifying differentially expressed genes from RNA- seq  data

1

Identifying differentially expressed genes from RNA-seq data

Many recent algorithms for calling differentially expressed genes:

edgeR: Empirical analysis of digital gene expression data in Rhttp://www.bioconductor.org/packages/2.10/bioc/html/edgeR.html

DEseq: Differential gene expression analysis based on the negative binomial distributionhttp://www-huber.embl.de/users/anders/DESeq/

baySeq: Empirical Bayesian analysis of patterns of differential expression in count datahttp://www.bioconductor.org/packages/2.10/bioc/html/baySeq.html

Page 2: Identifying differentially expressed genes from RNA- seq  data

2

Identifying differentially expressed genes from RNA-seq data

Why can’t we use the same software for microarray data analysis?

1. Microarray data are continuous values. Sequence data are discrete values ofread counts.

e.g. 5.342 signal intensity versus 5 counts

2. Reproducibility of RNA-seq measurements is different for low-abundance versushigh-abundance transcripts – this is called over-dispersion

Page 3: Identifying differentially expressed genes from RNA- seq  data

3

Overdispersion: variance of replicates is higher for high-abundance reads

Page 4: Identifying differentially expressed genes from RNA- seq  data

4

baySeq Empirical Bayesian analysis of patterns of

differential expression in count datahttp://www.bioconductor.org/packages/2.10/bioc/html/baySeq.html

Identifies differentially expressed genes between 2 or more samples using replicated RNA-seq data

Page 5: Identifying differentially expressed genes from RNA- seq  data

5

Bayes’ Theorem

P (M | D) = P (D | M) * P (M) P (D)

Read in English: The Probability that your Model is correct Given ( | ) the Data is equal to the Probability of your Data Given the Model times the Probability of your Model divided by the Probability of the Data

What the hell does this mean?

Page 6: Identifying differentially expressed genes from RNA- seq  data

6

Bayes’ Theorem: wikipedia’s ridiculous example

Your friend had a conversation with someone who happened to have long hair.What is the probability that that person was a woman, given that you know~50% of people are women and ~75% of women have long hair.

P (W) = Probability that this person was a woman = 0.5P (L | W) = Probability that the person had long hair IF the person is a woman = 0.75P (L | M) = Probability that the person had long hair IF that person is a man = 0.3P (L) = Probability that any random person has long hair =

P (L | W) * P (W) = 75% of 50% of the population = 0.75*0.5 = 0.375 + P (L | M) * P (M) = 30% of 50% of the population = 0.3 * 0.5 = 0.15

So:P (W|L) = P(L|W)*P(W) = P(L|W)*P(W) = 0.714

P(L) P (L|W)*P(W) + P (L|M)*P(M)

Page 7: Identifying differentially expressed genes from RNA- seq  data

7

Bayes’ Theorem

P (M | D) = P (D | M) * P (M) P (D)

Read in English: The Probability that your Model is correct Given ( | ) the Data is equal to the Probability of your Data Given the Model times the Probability of your Model divided by the Probability of the Data

P is called the ‘Posterior Probability’ that this Model is the right one to describe your data.

Each gene will have a PP for each model, where S PP = 1.

The Bigger the PP, the more likely this is the right model.

Posterior Probability is NOT a p-value!

Page 8: Identifying differentially expressed genes from RNA- seq  data

8

Bayes’ Theorem

We don’t know some of these factors ( e.g. P(D|M) ), but we can describe the data with some parameter set called q

(which includes mean and stdev of count data across replicates of each gene)

Imagine you have three RNA-seq replicates of two samples (WT vs mutant).There are two models for each gene

M0 = the model that your gene is NOT differentially expressed across the 2 samples

MDE = the model that a given gene IS differentially expressed

P (MDE | DgeneX) = P (DgeneX | MDE) * P (MDE) P (DgeneX)

P (M0 | DgeneX) = P (DgeneX | M0) * P (M0) P (DgeneX)

Page 9: Identifying differentially expressed genes from RNA- seq  data

9

wikepediaE.g. Infecting a cell culture with viral particles. l = Multiplicity of Infection (MOI) = # of viral particles/cell in your culture

How to model the mean and standard deviation of your replicates:Poisson Distribution: Accounts for random fluctuations

Page 10: Identifying differentially expressed genes from RNA- seq  data

10

Overdispersion: variance of replicates is higher for high-abundance reads

Page 11: Identifying differentially expressed genes from RNA- seq  data

11

How to model the mean and standard deviation of your replicates:Negative Binomial Distribution: Mean and variance are different

Notice the wider spread on the right side of each distribution, for higher numbers

Page 12: Identifying differentially expressed genes from RNA- seq  data

12

baySeq Empirical Bayesian analysis of patterns of

differential expression in count datahttp://www.bioconductor.org/packages/2.10/bioc/html/baySeq.html

Imagine you have triplicate measurements of sample A and sample B. We want toidentify genes differentially expressed (DE) in A versus B.

Arep1 Arep2 Brep1 Brep2

We define TWO possible models:NO DE (NDE): Arep1 = Arep2 = Brep1 = Brep2

we say that the data for all samples was drawn from the same distribution(* i.e. same mean and standard deviation, if normally distributed)

DE: Arep1 = Arep2 NOT EQUAL Brep1 = Brep2

A and B replicates are drawn from two different distributions

tuple: simply S counts per transcription unit

Page 13: Identifying differentially expressed genes from RNA- seq  data

13

baySeq Empirical Bayesian analysis of patterns of

differential expression in count datahttp://www.bioconductor.org/packages/2.10/bioc/html/baySeq.html

P (MDE | DgeneX) = P(DgeneX|MDE) * P(MDE)

P(DgeneX)

Next, we use Bayes’ Rule to try to estimate the probability that the DE modelis true for geneX

baySeq tries to use data sharing across the dataset to estimate the componentson the right side of the equation

P (DgeneX | MDE ) = Int[ P(DgeneX| K, MDE) * P(K | MDE) ] dK

where K is the parameter set of q (mean, dispersion) for each gene in each replicate** q for each gene is estimated by looking at all data in the replicates

prior probability

Page 14: Identifying differentially expressed genes from RNA- seq  data

14

baySeq Empirical Bayesian analysis of patterns of

differential expression in count datahttp://www.bioconductor.org/packages/2.10/bioc/html/baySeq.html

P (MDE | DgeneX) = P(DgeneX|MDE) * P(MDE)

P(DgeneX)

prior probability

The prior probability is the probability of M before any data are considered.

Here: 1. Guess at a starting prior probability for MDE

2. Iteratively test different alternatives3. Repeat until ‘convergence’ (P(MDE) does not change with more iterations)

* For data with strong DE signal, the posterior probability is not very dependenton the starting prior probability P(MDE)

Page 15: Identifying differentially expressed genes from RNA- seq  data

15

Page 16: Identifying differentially expressed genes from RNA- seq  data

16

Volcano Plot