scRNA-seq - Differential expression analyses

57
scRNA-seq Differential expression analyses Olga Dethlefsen [email protected] NBIS, National Bioinformatics Infrastructure Sweden February 2019 Olga (NBIS) scRNA-seq DE February 2019 1 / 46

Transcript of scRNA-seq - Differential expression analyses

Page 1: scRNA-seq - Differential expression analyses

scRNA-seqDifferential expression analyses

Olga Dethlefsenolgadethlefsennbisse

NBIS National Bioinformatics Infrastructure Sweden

February 2019

Olga (NBIS) scRNA-seq DE February 2019 1 46

Outline

Outline

Introduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Outline

OutlineIntroduction what is so special about scRNA-seq DE

Common methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Outline

OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out there

Performance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Outline

OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is best

Practicalities what to do in real lifeSummary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Outline

OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real life

Summary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Outline

OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Introduction

Introduction

Olga (NBIS) scRNA-seq DE February 2019 3 46

Introduction

What does differential expression mean to you

httpswwwmenticom

Olga (NBIS) scRNA-seq DE February 2019 4 46

Introduction

Figure Simplified scRNA-seq workflow [adapted from Wikipedia]

Olga (NBIS) scRNA-seq DE February 2019 5 46

Introduction

Figure Simplified scRNA-seq workflow [adapted from Wikipedia]

Olga (NBIS) scRNA-seq DE February 2019 6 46

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1

Mybpc1 Nars Ndufa3 Nono Pgam2

Crispld2 Fbxw13 Hbxip Katna1 Lcorl

1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1

00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100

00

01

02

03

0

1

2

3

00

01

02

03

00

01

02

03

00

01

02

03

000

005

010

015

020

025

00

02

04

06

00

01

02

03

04

05

00

01

02

03

00

01

02

03

00

02

04

00

05

10

15

00

01

02

03

00

05

10

15

20

00

02

04

000

005

010

015

0

1

2

3

4

00

05

10

15

00

01

02

03

04

05

00

01

02

03

04

value

dens

ity

Based on tutorial data

Olga (NBIS) scRNA-seq DE February 2019 9 46

Common methods

Common methods

Olga (NBIS) scRNA-seq DE February 2019 10 46

Common methods

Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio

Olga (NBIS) scRNA-seq DE February 2019 11 46

Common methods

Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test

Olga (NBIS) scRNA-seq DE February 2019 12 46

Common methods

developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR

Olga (NBIS) scRNA-seq DE February 2019 13 46

Common methods

developed for scRNA-seq

eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach

Olga (NBIS) scRNA-seq DE February 2019 14 46

Common methods

Miao and Zhang 2016

Olga (NBIS) scRNA-seq DE February 2019 15 46

Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 16 46

Common methods More detailed examples

More detailed examples

Olga (NBIS) scRNA-seq DE February 2019 17 46

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 2: scRNA-seq - Differential expression analyses

Outline

Outline

Introduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Outline

OutlineIntroduction what is so special about scRNA-seq DE

Common methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Outline

OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out there

Performance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Outline

OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is best

Practicalities what to do in real lifeSummary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Outline

OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real life

Summary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Outline

OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Introduction

Introduction

Olga (NBIS) scRNA-seq DE February 2019 3 46

Introduction

What does differential expression mean to you

httpswwwmenticom

Olga (NBIS) scRNA-seq DE February 2019 4 46

Introduction

Figure Simplified scRNA-seq workflow [adapted from Wikipedia]

Olga (NBIS) scRNA-seq DE February 2019 5 46

Introduction

Figure Simplified scRNA-seq workflow [adapted from Wikipedia]

Olga (NBIS) scRNA-seq DE February 2019 6 46

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1

Mybpc1 Nars Ndufa3 Nono Pgam2

Crispld2 Fbxw13 Hbxip Katna1 Lcorl

1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1

00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100

00

01

02

03

0

1

2

3

00

01

02

03

00

01

02

03

00

01

02

03

000

005

010

015

020

025

00

02

04

06

00

01

02

03

04

05

00

01

02

03

00

01

02

03

00

02

04

00

05

10

15

00

01

02

03

00

05

10

15

20

00

02

04

000

005

010

015

0

1

2

3

4

00

05

10

15

00

01

02

03

04

05

00

01

02

03

04

value

dens

ity

Based on tutorial data

Olga (NBIS) scRNA-seq DE February 2019 9 46

Common methods

Common methods

Olga (NBIS) scRNA-seq DE February 2019 10 46

Common methods

Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio

Olga (NBIS) scRNA-seq DE February 2019 11 46

Common methods

Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test

Olga (NBIS) scRNA-seq DE February 2019 12 46

Common methods

developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR

Olga (NBIS) scRNA-seq DE February 2019 13 46

Common methods

developed for scRNA-seq

eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach

Olga (NBIS) scRNA-seq DE February 2019 14 46

Common methods

Miao and Zhang 2016

Olga (NBIS) scRNA-seq DE February 2019 15 46

Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 16 46

Common methods More detailed examples

More detailed examples

Olga (NBIS) scRNA-seq DE February 2019 17 46

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 3: scRNA-seq - Differential expression analyses

Outline

OutlineIntroduction what is so special about scRNA-seq DE

Common methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Outline

OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out there

Performance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Outline

OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is best

Practicalities what to do in real lifeSummary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Outline

OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real life

Summary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Outline

OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Introduction

Introduction

Olga (NBIS) scRNA-seq DE February 2019 3 46

Introduction

What does differential expression mean to you

httpswwwmenticom

Olga (NBIS) scRNA-seq DE February 2019 4 46

Introduction

Figure Simplified scRNA-seq workflow [adapted from Wikipedia]

Olga (NBIS) scRNA-seq DE February 2019 5 46

Introduction

Figure Simplified scRNA-seq workflow [adapted from Wikipedia]

Olga (NBIS) scRNA-seq DE February 2019 6 46

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1

Mybpc1 Nars Ndufa3 Nono Pgam2

Crispld2 Fbxw13 Hbxip Katna1 Lcorl

1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1

00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100

00

01

02

03

0

1

2

3

00

01

02

03

00

01

02

03

00

01

02

03

000

005

010

015

020

025

00

02

04

06

00

01

02

03

04

05

00

01

02

03

00

01

02

03

00

02

04

00

05

10

15

00

01

02

03

00

05

10

15

20

00

02

04

000

005

010

015

0

1

2

3

4

00

05

10

15

00

01

02

03

04

05

00

01

02

03

04

value

dens

ity

Based on tutorial data

Olga (NBIS) scRNA-seq DE February 2019 9 46

Common methods

Common methods

Olga (NBIS) scRNA-seq DE February 2019 10 46

Common methods

Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio

Olga (NBIS) scRNA-seq DE February 2019 11 46

Common methods

Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test

Olga (NBIS) scRNA-seq DE February 2019 12 46

Common methods

developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR

Olga (NBIS) scRNA-seq DE February 2019 13 46

Common methods

developed for scRNA-seq

eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach

Olga (NBIS) scRNA-seq DE February 2019 14 46

Common methods

Miao and Zhang 2016

Olga (NBIS) scRNA-seq DE February 2019 15 46

Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 16 46

Common methods More detailed examples

More detailed examples

Olga (NBIS) scRNA-seq DE February 2019 17 46

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 4: scRNA-seq - Differential expression analyses

Outline

OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out there

Performance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Outline

OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is best

Practicalities what to do in real lifeSummary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Outline

OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real life

Summary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Outline

OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Introduction

Introduction

Olga (NBIS) scRNA-seq DE February 2019 3 46

Introduction

What does differential expression mean to you

httpswwwmenticom

Olga (NBIS) scRNA-seq DE February 2019 4 46

Introduction

Figure Simplified scRNA-seq workflow [adapted from Wikipedia]

Olga (NBIS) scRNA-seq DE February 2019 5 46

Introduction

Figure Simplified scRNA-seq workflow [adapted from Wikipedia]

Olga (NBIS) scRNA-seq DE February 2019 6 46

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1

Mybpc1 Nars Ndufa3 Nono Pgam2

Crispld2 Fbxw13 Hbxip Katna1 Lcorl

1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1

00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100

00

01

02

03

0

1

2

3

00

01

02

03

00

01

02

03

00

01

02

03

000

005

010

015

020

025

00

02

04

06

00

01

02

03

04

05

00

01

02

03

00

01

02

03

00

02

04

00

05

10

15

00

01

02

03

00

05

10

15

20

00

02

04

000

005

010

015

0

1

2

3

4

00

05

10

15

00

01

02

03

04

05

00

01

02

03

04

value

dens

ity

Based on tutorial data

Olga (NBIS) scRNA-seq DE February 2019 9 46

Common methods

Common methods

Olga (NBIS) scRNA-seq DE February 2019 10 46

Common methods

Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio

Olga (NBIS) scRNA-seq DE February 2019 11 46

Common methods

Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test

Olga (NBIS) scRNA-seq DE February 2019 12 46

Common methods

developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR

Olga (NBIS) scRNA-seq DE February 2019 13 46

Common methods

developed for scRNA-seq

eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach

Olga (NBIS) scRNA-seq DE February 2019 14 46

Common methods

Miao and Zhang 2016

Olga (NBIS) scRNA-seq DE February 2019 15 46

Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 16 46

Common methods More detailed examples

More detailed examples

Olga (NBIS) scRNA-seq DE February 2019 17 46

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 5: scRNA-seq - Differential expression analyses

Outline

OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is best

Practicalities what to do in real lifeSummary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Outline

OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real life

Summary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Outline

OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Introduction

Introduction

Olga (NBIS) scRNA-seq DE February 2019 3 46

Introduction

What does differential expression mean to you

httpswwwmenticom

Olga (NBIS) scRNA-seq DE February 2019 4 46

Introduction

Figure Simplified scRNA-seq workflow [adapted from Wikipedia]

Olga (NBIS) scRNA-seq DE February 2019 5 46

Introduction

Figure Simplified scRNA-seq workflow [adapted from Wikipedia]

Olga (NBIS) scRNA-seq DE February 2019 6 46

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1

Mybpc1 Nars Ndufa3 Nono Pgam2

Crispld2 Fbxw13 Hbxip Katna1 Lcorl

1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1

00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100

00

01

02

03

0

1

2

3

00

01

02

03

00

01

02

03

00

01

02

03

000

005

010

015

020

025

00

02

04

06

00

01

02

03

04

05

00

01

02

03

00

01

02

03

00

02

04

00

05

10

15

00

01

02

03

00

05

10

15

20

00

02

04

000

005

010

015

0

1

2

3

4

00

05

10

15

00

01

02

03

04

05

00

01

02

03

04

value

dens

ity

Based on tutorial data

Olga (NBIS) scRNA-seq DE February 2019 9 46

Common methods

Common methods

Olga (NBIS) scRNA-seq DE February 2019 10 46

Common methods

Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio

Olga (NBIS) scRNA-seq DE February 2019 11 46

Common methods

Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test

Olga (NBIS) scRNA-seq DE February 2019 12 46

Common methods

developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR

Olga (NBIS) scRNA-seq DE February 2019 13 46

Common methods

developed for scRNA-seq

eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach

Olga (NBIS) scRNA-seq DE February 2019 14 46

Common methods

Miao and Zhang 2016

Olga (NBIS) scRNA-seq DE February 2019 15 46

Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 16 46

Common methods More detailed examples

More detailed examples

Olga (NBIS) scRNA-seq DE February 2019 17 46

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 6: scRNA-seq - Differential expression analyses

Outline

OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real life

Summary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Outline

OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Introduction

Introduction

Olga (NBIS) scRNA-seq DE February 2019 3 46

Introduction

What does differential expression mean to you

httpswwwmenticom

Olga (NBIS) scRNA-seq DE February 2019 4 46

Introduction

Figure Simplified scRNA-seq workflow [adapted from Wikipedia]

Olga (NBIS) scRNA-seq DE February 2019 5 46

Introduction

Figure Simplified scRNA-seq workflow [adapted from Wikipedia]

Olga (NBIS) scRNA-seq DE February 2019 6 46

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1

Mybpc1 Nars Ndufa3 Nono Pgam2

Crispld2 Fbxw13 Hbxip Katna1 Lcorl

1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1

00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100

00

01

02

03

0

1

2

3

00

01

02

03

00

01

02

03

00

01

02

03

000

005

010

015

020

025

00

02

04

06

00

01

02

03

04

05

00

01

02

03

00

01

02

03

00

02

04

00

05

10

15

00

01

02

03

00

05

10

15

20

00

02

04

000

005

010

015

0

1

2

3

4

00

05

10

15

00

01

02

03

04

05

00

01

02

03

04

value

dens

ity

Based on tutorial data

Olga (NBIS) scRNA-seq DE February 2019 9 46

Common methods

Common methods

Olga (NBIS) scRNA-seq DE February 2019 10 46

Common methods

Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio

Olga (NBIS) scRNA-seq DE February 2019 11 46

Common methods

Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test

Olga (NBIS) scRNA-seq DE February 2019 12 46

Common methods

developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR

Olga (NBIS) scRNA-seq DE February 2019 13 46

Common methods

developed for scRNA-seq

eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach

Olga (NBIS) scRNA-seq DE February 2019 14 46

Common methods

Miao and Zhang 2016

Olga (NBIS) scRNA-seq DE February 2019 15 46

Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 16 46

Common methods More detailed examples

More detailed examples

Olga (NBIS) scRNA-seq DE February 2019 17 46

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 7: scRNA-seq - Differential expression analyses

Outline

OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour

Olga (NBIS) scRNA-seq DE February 2019 2 46

Introduction

Introduction

Olga (NBIS) scRNA-seq DE February 2019 3 46

Introduction

What does differential expression mean to you

httpswwwmenticom

Olga (NBIS) scRNA-seq DE February 2019 4 46

Introduction

Figure Simplified scRNA-seq workflow [adapted from Wikipedia]

Olga (NBIS) scRNA-seq DE February 2019 5 46

Introduction

Figure Simplified scRNA-seq workflow [adapted from Wikipedia]

Olga (NBIS) scRNA-seq DE February 2019 6 46

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1

Mybpc1 Nars Ndufa3 Nono Pgam2

Crispld2 Fbxw13 Hbxip Katna1 Lcorl

1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1

00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100

00

01

02

03

0

1

2

3

00

01

02

03

00

01

02

03

00

01

02

03

000

005

010

015

020

025

00

02

04

06

00

01

02

03

04

05

00

01

02

03

00

01

02

03

00

02

04

00

05

10

15

00

01

02

03

00

05

10

15

20

00

02

04

000

005

010

015

0

1

2

3

4

00

05

10

15

00

01

02

03

04

05

00

01

02

03

04

value

dens

ity

Based on tutorial data

Olga (NBIS) scRNA-seq DE February 2019 9 46

Common methods

Common methods

Olga (NBIS) scRNA-seq DE February 2019 10 46

Common methods

Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio

Olga (NBIS) scRNA-seq DE February 2019 11 46

Common methods

Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test

Olga (NBIS) scRNA-seq DE February 2019 12 46

Common methods

developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR

Olga (NBIS) scRNA-seq DE February 2019 13 46

Common methods

developed for scRNA-seq

eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach

Olga (NBIS) scRNA-seq DE February 2019 14 46

Common methods

Miao and Zhang 2016

Olga (NBIS) scRNA-seq DE February 2019 15 46

Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 16 46

Common methods More detailed examples

More detailed examples

Olga (NBIS) scRNA-seq DE February 2019 17 46

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 8: scRNA-seq - Differential expression analyses

Introduction

Introduction

Olga (NBIS) scRNA-seq DE February 2019 3 46

Introduction

What does differential expression mean to you

httpswwwmenticom

Olga (NBIS) scRNA-seq DE February 2019 4 46

Introduction

Figure Simplified scRNA-seq workflow [adapted from Wikipedia]

Olga (NBIS) scRNA-seq DE February 2019 5 46

Introduction

Figure Simplified scRNA-seq workflow [adapted from Wikipedia]

Olga (NBIS) scRNA-seq DE February 2019 6 46

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1

Mybpc1 Nars Ndufa3 Nono Pgam2

Crispld2 Fbxw13 Hbxip Katna1 Lcorl

1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1

00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100

00

01

02

03

0

1

2

3

00

01

02

03

00

01

02

03

00

01

02

03

000

005

010

015

020

025

00

02

04

06

00

01

02

03

04

05

00

01

02

03

00

01

02

03

00

02

04

00

05

10

15

00

01

02

03

00

05

10

15

20

00

02

04

000

005

010

015

0

1

2

3

4

00

05

10

15

00

01

02

03

04

05

00

01

02

03

04

value

dens

ity

Based on tutorial data

Olga (NBIS) scRNA-seq DE February 2019 9 46

Common methods

Common methods

Olga (NBIS) scRNA-seq DE February 2019 10 46

Common methods

Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio

Olga (NBIS) scRNA-seq DE February 2019 11 46

Common methods

Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test

Olga (NBIS) scRNA-seq DE February 2019 12 46

Common methods

developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR

Olga (NBIS) scRNA-seq DE February 2019 13 46

Common methods

developed for scRNA-seq

eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach

Olga (NBIS) scRNA-seq DE February 2019 14 46

Common methods

Miao and Zhang 2016

Olga (NBIS) scRNA-seq DE February 2019 15 46

Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 16 46

Common methods More detailed examples

More detailed examples

Olga (NBIS) scRNA-seq DE February 2019 17 46

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 9: scRNA-seq - Differential expression analyses

Introduction

What does differential expression mean to you

httpswwwmenticom

Olga (NBIS) scRNA-seq DE February 2019 4 46

Introduction

Figure Simplified scRNA-seq workflow [adapted from Wikipedia]

Olga (NBIS) scRNA-seq DE February 2019 5 46

Introduction

Figure Simplified scRNA-seq workflow [adapted from Wikipedia]

Olga (NBIS) scRNA-seq DE February 2019 6 46

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1

Mybpc1 Nars Ndufa3 Nono Pgam2

Crispld2 Fbxw13 Hbxip Katna1 Lcorl

1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1

00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100

00

01

02

03

0

1

2

3

00

01

02

03

00

01

02

03

00

01

02

03

000

005

010

015

020

025

00

02

04

06

00

01

02

03

04

05

00

01

02

03

00

01

02

03

00

02

04

00

05

10

15

00

01

02

03

00

05

10

15

20

00

02

04

000

005

010

015

0

1

2

3

4

00

05

10

15

00

01

02

03

04

05

00

01

02

03

04

value

dens

ity

Based on tutorial data

Olga (NBIS) scRNA-seq DE February 2019 9 46

Common methods

Common methods

Olga (NBIS) scRNA-seq DE February 2019 10 46

Common methods

Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio

Olga (NBIS) scRNA-seq DE February 2019 11 46

Common methods

Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test

Olga (NBIS) scRNA-seq DE February 2019 12 46

Common methods

developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR

Olga (NBIS) scRNA-seq DE February 2019 13 46

Common methods

developed for scRNA-seq

eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach

Olga (NBIS) scRNA-seq DE February 2019 14 46

Common methods

Miao and Zhang 2016

Olga (NBIS) scRNA-seq DE February 2019 15 46

Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 16 46

Common methods More detailed examples

More detailed examples

Olga (NBIS) scRNA-seq DE February 2019 17 46

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 10: scRNA-seq - Differential expression analyses

Introduction

Figure Simplified scRNA-seq workflow [adapted from Wikipedia]

Olga (NBIS) scRNA-seq DE February 2019 5 46

Introduction

Figure Simplified scRNA-seq workflow [adapted from Wikipedia]

Olga (NBIS) scRNA-seq DE February 2019 6 46

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1

Mybpc1 Nars Ndufa3 Nono Pgam2

Crispld2 Fbxw13 Hbxip Katna1 Lcorl

1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1

00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100

00

01

02

03

0

1

2

3

00

01

02

03

00

01

02

03

00

01

02

03

000

005

010

015

020

025

00

02

04

06

00

01

02

03

04

05

00

01

02

03

00

01

02

03

00

02

04

00

05

10

15

00

01

02

03

00

05

10

15

20

00

02

04

000

005

010

015

0

1

2

3

4

00

05

10

15

00

01

02

03

04

05

00

01

02

03

04

value

dens

ity

Based on tutorial data

Olga (NBIS) scRNA-seq DE February 2019 9 46

Common methods

Common methods

Olga (NBIS) scRNA-seq DE February 2019 10 46

Common methods

Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio

Olga (NBIS) scRNA-seq DE February 2019 11 46

Common methods

Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test

Olga (NBIS) scRNA-seq DE February 2019 12 46

Common methods

developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR

Olga (NBIS) scRNA-seq DE February 2019 13 46

Common methods

developed for scRNA-seq

eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach

Olga (NBIS) scRNA-seq DE February 2019 14 46

Common methods

Miao and Zhang 2016

Olga (NBIS) scRNA-seq DE February 2019 15 46

Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 16 46

Common methods More detailed examples

More detailed examples

Olga (NBIS) scRNA-seq DE February 2019 17 46

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 11: scRNA-seq - Differential expression analyses

Introduction

Figure Simplified scRNA-seq workflow [adapted from Wikipedia]

Olga (NBIS) scRNA-seq DE February 2019 6 46

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1

Mybpc1 Nars Ndufa3 Nono Pgam2

Crispld2 Fbxw13 Hbxip Katna1 Lcorl

1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1

00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100

00

01

02

03

0

1

2

3

00

01

02

03

00

01

02

03

00

01

02

03

000

005

010

015

020

025

00

02

04

06

00

01

02

03

04

05

00

01

02

03

00

01

02

03

00

02

04

00

05

10

15

00

01

02

03

00

05

10

15

20

00

02

04

000

005

010

015

0

1

2

3

4

00

05

10

15

00

01

02

03

04

05

00

01

02

03

04

value

dens

ity

Based on tutorial data

Olga (NBIS) scRNA-seq DE February 2019 9 46

Common methods

Common methods

Olga (NBIS) scRNA-seq DE February 2019 10 46

Common methods

Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio

Olga (NBIS) scRNA-seq DE February 2019 11 46

Common methods

Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test

Olga (NBIS) scRNA-seq DE February 2019 12 46

Common methods

developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR

Olga (NBIS) scRNA-seq DE February 2019 13 46

Common methods

developed for scRNA-seq

eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach

Olga (NBIS) scRNA-seq DE February 2019 14 46

Common methods

Miao and Zhang 2016

Olga (NBIS) scRNA-seq DE February 2019 15 46

Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 16 46

Common methods More detailed examples

More detailed examples

Olga (NBIS) scRNA-seq DE February 2019 17 46

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 12: scRNA-seq - Differential expression analyses

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1

Mybpc1 Nars Ndufa3 Nono Pgam2

Crispld2 Fbxw13 Hbxip Katna1 Lcorl

1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1

00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100

00

01

02

03

0

1

2

3

00

01

02

03

00

01

02

03

00

01

02

03

000

005

010

015

020

025

00

02

04

06

00

01

02

03

04

05

00

01

02

03

00

01

02

03

00

02

04

00

05

10

15

00

01

02

03

00

05

10

15

20

00

02

04

000

005

010

015

0

1

2

3

4

00

05

10

15

00

01

02

03

04

05

00

01

02

03

04

value

dens

ity

Based on tutorial data

Olga (NBIS) scRNA-seq DE February 2019 9 46

Common methods

Common methods

Olga (NBIS) scRNA-seq DE February 2019 10 46

Common methods

Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio

Olga (NBIS) scRNA-seq DE February 2019 11 46

Common methods

Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test

Olga (NBIS) scRNA-seq DE February 2019 12 46

Common methods

developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR

Olga (NBIS) scRNA-seq DE February 2019 13 46

Common methods

developed for scRNA-seq

eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach

Olga (NBIS) scRNA-seq DE February 2019 14 46

Common methods

Miao and Zhang 2016

Olga (NBIS) scRNA-seq DE February 2019 15 46

Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 16 46

Common methods More detailed examples

More detailed examples

Olga (NBIS) scRNA-seq DE February 2019 17 46

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 13: scRNA-seq - Differential expression analyses

Introduction

adapted from Wu et al 2017

Differential expression meanstaking read count data amp

performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups

ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)

Differential expression is an old problemknown from bulk RNA-seq and microarray studies

in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences

Olga (NBIS) scRNA-seq DE February 2019 7 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1

Mybpc1 Nars Ndufa3 Nono Pgam2

Crispld2 Fbxw13 Hbxip Katna1 Lcorl

1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1

00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100

00

01

02

03

0

1

2

3

00

01

02

03

00

01

02

03

00

01

02

03

000

005

010

015

020

025

00

02

04

06

00

01

02

03

04

05

00

01

02

03

00

01

02

03

00

02

04

00

05

10

15

00

01

02

03

00

05

10

15

20

00

02

04

000

005

010

015

0

1

2

3

4

00

05

10

15

00

01

02

03

04

05

00

01

02

03

04

value

dens

ity

Based on tutorial data

Olga (NBIS) scRNA-seq DE February 2019 9 46

Common methods

Common methods

Olga (NBIS) scRNA-seq DE February 2019 10 46

Common methods

Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio

Olga (NBIS) scRNA-seq DE February 2019 11 46

Common methods

Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test

Olga (NBIS) scRNA-seq DE February 2019 12 46

Common methods

developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR

Olga (NBIS) scRNA-seq DE February 2019 13 46

Common methods

developed for scRNA-seq

eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach

Olga (NBIS) scRNA-seq DE February 2019 14 46

Common methods

Miao and Zhang 2016

Olga (NBIS) scRNA-seq DE February 2019 15 46

Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 16 46

Common methods More detailed examples

More detailed examples

Olga (NBIS) scRNA-seq DE February 2019 17 46

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 14: scRNA-seq - Differential expression analyses

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1

Mybpc1 Nars Ndufa3 Nono Pgam2

Crispld2 Fbxw13 Hbxip Katna1 Lcorl

1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1

00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100

00

01

02

03

0

1

2

3

00

01

02

03

00

01

02

03

00

01

02

03

000

005

010

015

020

025

00

02

04

06

00

01

02

03

04

05

00

01

02

03

00

01

02

03

00

02

04

00

05

10

15

00

01

02

03

00

05

10

15

20

00

02

04

000

005

010

015

0

1

2

3

4

00

05

10

15

00

01

02

03

04

05

00

01

02

03

04

value

dens

ity

Based on tutorial data

Olga (NBIS) scRNA-seq DE February 2019 9 46

Common methods

Common methods

Olga (NBIS) scRNA-seq DE February 2019 10 46

Common methods

Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio

Olga (NBIS) scRNA-seq DE February 2019 11 46

Common methods

Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test

Olga (NBIS) scRNA-seq DE February 2019 12 46

Common methods

developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR

Olga (NBIS) scRNA-seq DE February 2019 13 46

Common methods

developed for scRNA-seq

eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach

Olga (NBIS) scRNA-seq DE February 2019 14 46

Common methods

Miao and Zhang 2016

Olga (NBIS) scRNA-seq DE February 2019 15 46

Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 16 46

Common methods More detailed examples

More detailed examples

Olga (NBIS) scRNA-seq DE February 2019 17 46

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 15: scRNA-seq - Differential expression analyses

Introduction

Differential expression is an old problem

So what is all the commotion about

httpswwwmenticom amp 25 06 78

scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)

Olga (NBIS) scRNA-seq DE February 2019 8 46

Introduction

Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1

Mybpc1 Nars Ndufa3 Nono Pgam2

Crispld2 Fbxw13 Hbxip Katna1 Lcorl

1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1

00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100

00

01

02

03

0

1

2

3

00

01

02

03

00

01

02

03

00

01

02

03

000

005

010

015

020

025

00

02

04

06

00

01

02

03

04

05

00

01

02

03

00

01

02

03

00

02

04

00

05

10

15

00

01

02

03

00

05

10

15

20

00

02

04

000

005

010

015

0

1

2

3

4

00

05

10

15

00

01

02

03

04

05

00

01

02

03

04

value

dens

ity

Based on tutorial data

Olga (NBIS) scRNA-seq DE February 2019 9 46

Common methods

Common methods

Olga (NBIS) scRNA-seq DE February 2019 10 46

Common methods

Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio

Olga (NBIS) scRNA-seq DE February 2019 11 46

Common methods

Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test

Olga (NBIS) scRNA-seq DE February 2019 12 46

Common methods

developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR

Olga (NBIS) scRNA-seq DE February 2019 13 46

Common methods

developed for scRNA-seq

eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach

Olga (NBIS) scRNA-seq DE February 2019 14 46

Common methods

Miao and Zhang 2016

Olga (NBIS) scRNA-seq DE February 2019 15 46

Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 16 46

Common methods More detailed examples

More detailed examples

Olga (NBIS) scRNA-seq DE February 2019 17 46

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 16: scRNA-seq - Differential expression analyses

Introduction

Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1

Mybpc1 Nars Ndufa3 Nono Pgam2

Crispld2 Fbxw13 Hbxip Katna1 Lcorl

1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1

00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100

00

01

02

03

0

1

2

3

00

01

02

03

00

01

02

03

00

01

02

03

000

005

010

015

020

025

00

02

04

06

00

01

02

03

04

05

00

01

02

03

00

01

02

03

00

02

04

00

05

10

15

00

01

02

03

00

05

10

15

20

00

02

04

000

005

010

015

0

1

2

3

4

00

05

10

15

00

01

02

03

04

05

00

01

02

03

04

value

dens

ity

Based on tutorial data

Olga (NBIS) scRNA-seq DE February 2019 9 46

Common methods

Common methods

Olga (NBIS) scRNA-seq DE February 2019 10 46

Common methods

Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio

Olga (NBIS) scRNA-seq DE February 2019 11 46

Common methods

Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test

Olga (NBIS) scRNA-seq DE February 2019 12 46

Common methods

developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR

Olga (NBIS) scRNA-seq DE February 2019 13 46

Common methods

developed for scRNA-seq

eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach

Olga (NBIS) scRNA-seq DE February 2019 14 46

Common methods

Miao and Zhang 2016

Olga (NBIS) scRNA-seq DE February 2019 15 46

Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 16 46

Common methods More detailed examples

More detailed examples

Olga (NBIS) scRNA-seq DE February 2019 17 46

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 17: scRNA-seq - Differential expression analyses

Common methods

Common methods

Olga (NBIS) scRNA-seq DE February 2019 10 46

Common methods

Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio

Olga (NBIS) scRNA-seq DE February 2019 11 46

Common methods

Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test

Olga (NBIS) scRNA-seq DE February 2019 12 46

Common methods

developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR

Olga (NBIS) scRNA-seq DE February 2019 13 46

Common methods

developed for scRNA-seq

eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach

Olga (NBIS) scRNA-seq DE February 2019 14 46

Common methods

Miao and Zhang 2016

Olga (NBIS) scRNA-seq DE February 2019 15 46

Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 16 46

Common methods More detailed examples

More detailed examples

Olga (NBIS) scRNA-seq DE February 2019 17 46

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 18: scRNA-seq - Differential expression analyses

Common methods

Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio

Olga (NBIS) scRNA-seq DE February 2019 11 46

Common methods

Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test

Olga (NBIS) scRNA-seq DE February 2019 12 46

Common methods

developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR

Olga (NBIS) scRNA-seq DE February 2019 13 46

Common methods

developed for scRNA-seq

eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach

Olga (NBIS) scRNA-seq DE February 2019 14 46

Common methods

Miao and Zhang 2016

Olga (NBIS) scRNA-seq DE February 2019 15 46

Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 16 46

Common methods More detailed examples

More detailed examples

Olga (NBIS) scRNA-seq DE February 2019 17 46

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 19: scRNA-seq - Differential expression analyses

Common methods

Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test

Olga (NBIS) scRNA-seq DE February 2019 12 46

Common methods

developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR

Olga (NBIS) scRNA-seq DE February 2019 13 46

Common methods

developed for scRNA-seq

eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach

Olga (NBIS) scRNA-seq DE February 2019 14 46

Common methods

Miao and Zhang 2016

Olga (NBIS) scRNA-seq DE February 2019 15 46

Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 16 46

Common methods More detailed examples

More detailed examples

Olga (NBIS) scRNA-seq DE February 2019 17 46

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 20: scRNA-seq - Differential expression analyses

Common methods

developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR

Olga (NBIS) scRNA-seq DE February 2019 13 46

Common methods

developed for scRNA-seq

eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach

Olga (NBIS) scRNA-seq DE February 2019 14 46

Common methods

Miao and Zhang 2016

Olga (NBIS) scRNA-seq DE February 2019 15 46

Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 16 46

Common methods More detailed examples

More detailed examples

Olga (NBIS) scRNA-seq DE February 2019 17 46

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 21: scRNA-seq - Differential expression analyses

Common methods

developed for scRNA-seq

eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach

Olga (NBIS) scRNA-seq DE February 2019 14 46

Common methods

Miao and Zhang 2016

Olga (NBIS) scRNA-seq DE February 2019 15 46

Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 16 46

Common methods More detailed examples

More detailed examples

Olga (NBIS) scRNA-seq DE February 2019 17 46

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 22: scRNA-seq - Differential expression analyses

Common methods

Miao and Zhang 2016

Olga (NBIS) scRNA-seq DE February 2019 15 46

Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 16 46

Common methods More detailed examples

More detailed examples

Olga (NBIS) scRNA-seq DE February 2019 17 46

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 23: scRNA-seq - Differential expression analyses

Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 16 46

Common methods More detailed examples

More detailed examples

Olga (NBIS) scRNA-seq DE February 2019 17 46

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 24: scRNA-seq - Differential expression analyses

Common methods More detailed examples

More detailed examples

Olga (NBIS) scRNA-seq DE February 2019 17 46

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 25: scRNA-seq - Differential expression analyses

Common methods More detailed examples

MAST

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 18 46

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 26: scRNA-seq - Differential expression analyses

Common methods More detailed examples

Letrsquos stop for a minute

Olga (NBIS) scRNA-seq DE February 2019 19 46

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 27: scRNA-seq - Differential expression analyses

Common methods More detailed examples

The key

Outcomei = (Modeli) + errori

we collect data on a sample from a much larger population

statistics lets us to make inferences about the population from which sample wasderived

we try to predict the outcome given a model fitted to the data

Olga (NBIS) scRNA-seq DE February 2019 20 46

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 28: scRNA-seq - Differential expression analyses

Common methods More detailed examples

The key

t = x1minusx2

sp

radic1

n1+ 1

n2

height [cm]

Fre

quen

cy

165 170 175 180

010

3050

Olga (NBIS) scRNA-seq DE February 2019 21 46

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 29: scRNA-seq - Differential expression analyses

Common methods More detailed examples

Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference

Olga (NBIS) scRNA-seq DE February 2019 22 46

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 30: scRNA-seq - Differential expression analyses

Common methods More detailed examples

Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference

Important implicationthe better model fits to the data the better statistics

Olga (NBIS) scRNA-seq DE February 2019 23 46

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 31: scRNA-seq - Differential expression analyses

Common methods More detailed examples

Common distributions

Negative Binomial

Read Counts

Fre

quen

cy

0 5 10 15 20

010

020

030

040

050

0

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10

050

100

150

Negative Binomial

Read Counts

Fre

quen

cy

0 2 4 6 8 10 12

050

100

150

200

micro = mu

δ2 = mu + mu2

size

mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)

Olga (NBIS) scRNA-seq DE February 2019 24 46

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 32: scRNA-seq - Differential expression analyses

Common methods More detailed examples

Common distributions

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 5 10 15 20

020

040

0

Zerominusinflated NB

Read Counts

Fre

quen

cy

0 10 20 30 40

010

020

0micro = mu lowast (1 minus d)

δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)

d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE

Olga (NBIS) scRNA-seq DE February 2019 25 46

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 33: scRNA-seq - Differential expression analyses

Common methods More detailed examples

Common distributions

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100

020

040

0

PoissonminusBeta

Read Counts

Fre

quen

cy

0 20 60 100 140

050

150

micro = g lowast a(a + b)

δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)

a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm

Olga (NBIS) scRNA-seq DE February 2019 26 46

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 34: scRNA-seq - Differential expression analyses

Common methods More detailed examples

MAST (revisited)

uses generalized linear hurdle model

designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable

The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)

A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)

logit(Pr (Zig = 1)) = XiβDg

Pr (Yig = Y |Zig = 1) = N(XiβCg σ

2g) where Xi is a design matrix

Model parameters are fitted using an empirical Bayesian framework

Allows for a joint estimate of nuisance and treatment effects

DE is determined using the likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 27 46

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 35: scRNA-seq - Differential expression analyses

Common methods More detailed examples

SCDE

models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution

NB distribution models the transcripts that are amplified and detected

Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)

subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels

For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach

Olga (NBIS) scRNA-seq DE February 2019 28 46

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 36: scRNA-seq - Differential expression analyses

Common methods More detailed examples

Monocole

Originally designed for ordering cells by progress through differentiation stages(pseudo-time)

The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as

g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions

(eg cubic splines)

The observable expression level Y is then modeled using GAM

E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero

The DE test is performed using an approx χ2 likelihood ratio test

Olga (NBIS) scRNA-seq DE February 2019 29 46

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 37: scRNA-seq - Differential expression analyses

Performance

Performance

Olga (NBIS) scRNA-seq DE February 2019 30 46

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 38: scRNA-seq - Differential expression analyses

Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle

Short name Method Software version InputAvailablefrom

Reference

BPSC BPSC BPSC 09901 CPM GitHub [11]

D3E D3E D3E 10 raw counts GitHub [12]

DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]

DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]

DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]

DESeq2nofiltDESeq2 without the built-in in-dependent filtering

DESeq2 1141 raw counts Bioconductor [13]

DEsingle DEsingle DEsingle 010 raw counts GitHub [14]

edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]

edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]

edgeRLRTdeconvedgeRLRT with deconvolutionnormalization

edgeR 3191scran 120

raw counts Bioconductor [15 17 18]

edgeRLRTrobustedgeRLRT with robust disper-sion estimation

edgeR 3191 raw counts Bioconductor [15ndash17 19]

edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]

edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate

edgeR 3191 raw counts Bioconductor [15 16 20]

limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]

MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]

MASTcpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(CPM+1) Bioconductor [23]

MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]

MASTtpmDetRateMAST with cellular detectionrate as covariate

MAST 105 log2(TPM+1) Bioconductor [23]

metagenomeSeq metagenomeSeqmetagenomeSeq1160

raw counts Bioconductor [24]

monocle monocle (tobit) monocle 220 TPM Bioconductor [25]

monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]

monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]

NODES NODESNODES0009010

raw countsAuthor-providedlink

[27]

ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]

ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]

ROTSvoom ROTS ROTS 120voom-transformedraw counts

Bioconductor [28 29]

SAMseq SAMseq samr 20 raw counts CRAN [30]

scDD scDD scDD 100 raw counts Bioconductor [31]

SCDE SCDE scde 220 raw counts Bioconductor [32]

SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]

SeuratBimodnofiltSeurat (bimod test) without theinternal filtering

Seurat 1407 raw counts GitHub [33 34]

SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2

Seurat 1407 raw counts GitHub [33 34]

SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]

ttest t-test stats (R v 33)TMM-normalizedTPM

CRAN [16 35]

voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]

Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM

CRAN [16 36]

3

Nature Methods doi101038nmeth4612

Performance

Olga (NBIS) scRNA-seq DE February 2019 31 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 39: scRNA-seq - Differential expression analyses

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 40: scRNA-seq - Differential expression analyses

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 41: scRNA-seq - Differential expression analyses

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 42: scRNA-seq - Differential expression analyses

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 43: scRNA-seq - Differential expression analyses

Performance

No ground truth ie no independently validated truth is available fortesting

Known data

using data we know something aboutto get positive controls

Simulated data

null-data sets by re-samplingmodeling data sets based on variousdistributions

Comparing between methods andscenarios

Comparing numbers of DEs incl as afunction of group size

Investigating results

How does the expression anddistributions of detected DEs look like

Olga (NBIS) scRNA-seq DE February 2019 32 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 44: scRNA-seq - Differential expression analyses

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

adapted from Wikipedia

Olga (NBIS) scRNA-seq DE February 2019 33 46

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 45: scRNA-seq - Differential expression analyses

Performance

False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall

Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)

Olga (NBIS) scRNA-seq DE February 2019 34 46

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 46: scRNA-seq - Differential expression analyses

Performance

Consistency

Miao et al 2017

Olga (NBIS) scRNA-seq DE February 2019 35 46

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 47: scRNA-seq - Differential expression analyses

Performance

And so much more

Soneson and Robinson 2018

Bias robustness and scalability in single-celldifferential expression analysis

36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells

based on 9 data sets with 11 - 21separate instances (sample size effect)

extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc

conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets

Olga (NBIS) scRNA-seq DE February 2019 36 46

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 48: scRNA-seq - Differential expression analyses

Practicalities

Practicalities

Olga (NBIS) scRNA-seq DE February 2019 37 46

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 49: scRNA-seq - Differential expression analyses

Practicalities

Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all

Read Counts

Fre

quen

cy

0 500 1000 1500

050

0015

000

0 counts

Fre

quen

cy

0 20 40 60 800

2000

4000

6000

Olga (NBIS) scRNA-seq DE February 2019 38 46

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 50: scRNA-seq - Differential expression analyses

Practicalities

Choosing DE methods

Soneson and Robinson 2018

Olga (NBIS) scRNA-seq DE February 2019 39 46

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 51: scRNA-seq - Differential expression analyses

Practicalities

Rembering the bigger picture

Stegle Teichmann and Marioni 2015

QC filtering

Cell-cycle phase

Normalization of cell-specific biases

Confounding factors incl batcheffects

Detection rate ie the fraction ofdetected genes per cell

Imputations strategies for dropoutvalues

What is pragmatic programminglanguage platform speedcollaborative workflows etc

Olga (NBIS) scRNA-seq DE February 2019 40 46

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 52: scRNA-seq - Differential expression analyses

Practicalities

Staying critical

Olga (NBIS) scRNA-seq DE February 2019 41 46

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 53: scRNA-seq - Differential expression analyses

Summary

What to remember from this hour

httpswwwmenticom amp 25 06 78

Olga (NBIS) scRNA-seq DE February 2019 42 46

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 54: scRNA-seq - Differential expression analyses

Summary

Growing field

Angerer et al 2017

Olga (NBIS) scRNA-seq DE February 2019 43 46

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 55: scRNA-seq - Differential expression analyses

Summary

Growing field

httpswwwscrna-toolsorgtools

Zappia Phipson and Oshlack 2018

Olga (NBIS) scRNA-seq DE February 2019 44 46

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 56: scRNA-seq - Differential expression analyses

Summary

SummaryscRNA-seq is a rapidly growing field

DE is a common task so many newer and better methods will be developed

understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set

staying critical staying updated staying connected

Olga (NBIS) scRNA-seq DE February 2019 45 46

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography
Page 57: scRNA-seq - Differential expression analyses

Bibliography

Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329

Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7

Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612

Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062

Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549

Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145

Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004

Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573

Olga (NBIS) scRNA-seq DE February 2019 46 46

  • Outline
  • Introduction
  • Common methods
    • More detailed examples
      • Performance
      • Practicalities
      • Summary
      • Bibliography