Exploiting heterogeneity in single-cell transcriptomic analyses ...Exploiting heterogeneity in...
Transcript of Exploiting heterogeneity in single-cell transcriptomic analyses ...Exploiting heterogeneity in...
Exploiting heterogeneity in single-cell transcriptomic analyses: how to move beyond comparisons of averages
Keegan Korthauer1,2*, Li-Fang Chu4, Michael A. Newton3, Yuan Li3, James Thomson4, Ron M. Stewart4, and Christina Kendziorski3 1Harvard T.H. Chan School of Public Health, 2Dana-Farber Cancer Institute, 3University of Wisconsin-Madison, 4Morgridge Institute for Research
Keegan Korthauer Harvard T.H. Chan School of Public Health Dana-Farber Cancer Institute [email protected] @keegankorthauer
Contact [1] Korthauer, K. D., Chu, L. F., Newton, M. A., Li, Y., Thomson, J., Stewart, R., & Kendziorski, C. (2016). A statistical approach for identifying
differential distributions in single-cell RNA-seq experiments. Genome Biology, 17(1), 222. [2] Korthauer, K.D. (2017) scDD: Mixture modeling of single-cell RNA-seq data to identify genes with differential distributions. Bioconductor R
Package version 1.0.0 (BioC Release 3.5), https://bioconductor.org/Packages/scDD. [3] Lahav, G., et al. (2004). Dynamics of the p53-Mdm2 feedback loop in individual cells. Nature genetics, 36(2), 147-150. [4] Dobrzyński, M., et al. Nonlinear signalling networks and cell-to-cell variability transform external signals into broadly distributed or bimodal
responses. Journal of The Royal Society Interface 11.98 (2014): 20140383. [5] Jubelin, G., et al. FliZ is a global regulatory protein affecting the expression of flagellar and virulence genes in individual Xenorhabdus
nematophila bacterial cells. PloS Genet 9.10 (2013): e1003915. [6] Kharchenko, P. V., et al. Bayesian approach to single-cell differential expression analysis. Nature methods 11.7 (2014): 740-742. [7] Finak, Greg, et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell
RNA sequencing data. Genome biology 16.1 (2015): 278.
References
The ability to quantify cellular heterogeneity is a major advantage of single-cell technologies. It is now possible to elucidate gene expression dynamics that were invisible using bulk RNA-seq, such as the presence of distinct expression states. However, statistical methods often treat cellular heterogeneity as a nuisance. We have developed a novel method to characterize differences in expression in the presence of distinct expression states within and among biological conditions. This framework can detect differential expression patterns under a wide range of settings. Compared to alternative approaches, this method has higher power to detect subtle differences in gene expression distributions that are more complex than a mean shift, and can characterize those differences. The R package scDD implements the approach, and is available on Bioconductor [2].
Abstract
The scDD [1] algorithm (summarized above) tests whether the distribution (possibly multi-modal) of expression is different between biological conditions and classifies genes into categories that summarize the salient characteristics of the differences.
Differential Expression Analysis in Bulk RNA-seq is blind to cellular heterogeneity
Biological mechanisms such as stochastic burst-like fluctuations [3], unsynchronized oscillations [4], and bistable feedback loops [5] (illustrated above) can give rise to a mixed population of cells at multiple different expression states, which manifests as multi-modal distributions. This multimodality complicates DE analysis methods for single-cell, since most assume a parametric distribution with one mode representing the expressed cells (such as SCDE [6] and MAST [7]).
Biological mechanisms leading to multi-modality
In an analysis of human embryonic stem cell (hESC) types (detailed below), we evaluated pairwise comparisons of four cell lines. Identifying which genes are expressed differently between these conditions can give insight into the differentiation process. scDD generally detects more differential genes than other methods, but the additional are enriched for complex patterns. As expected, cell cycle and pluripotency genes are among those detected only by scDD.
scDD detects and classifies complex patterns
scDD is a novel statistical framework and R package that detects gene expression differences in scRNA-seq experiments while explicitly accounting for potential multimodality among expressed cells. It has comparable performance to alternative methods at detecting mean shifts, but is able to detect and characterize more complex differences that are masked under unimodal assumptions.
Summary
In contrast to single-cell RNA-seq, which allows us to get a measurement for each cell, differential expression (DE) analysis in traditional (or bulk) RNA-seq is blind to any cellular heterogeneity. The illustration above shows an example where the bulk RNA-seq experiment would not detect differential expression, but there is clearly a different pattern of expression between the two populations. This type of pattern may be of great biological significance, so it is important that DE methods for scRNA-seq account for it. However, doing so is complicated by the fact that these types of patterns result in multi-modal expression distributions, which are generally not accommodated for in existing approaches.
scDD Algorithm
Website:
Korthauer et al. Page 27 of 31
Table 2 Power to detect DD genes in simulated data
True Gene Category
Sample Size Method DE DP DM DB Overall (FDR)
scDD 0.893 0.418 0.898 0.572 0.695 (0.029)
50 SCDE 0.872 0.026 0.817 0.260 0.494 (0.004)
MAST 0.908 0.400 0.871 0.019 0.550 (0.026)
scDD 0.951 0.590 0.960 0.668 0.792 (0.031)
75 SCDE 0.948 0.070 0.903 0.387 0.577 (0.003)
MAST 0.956 0.633 0.943 0.036 0.642 (0.022)
scDD 0.972 0.717 0.982 0.727 0.850 (0.033)
100 SCDE 0.975 0.125 0.946 0.478 0.631 (0.003)
MAST 0.977 0.752 0.970 0.045 0.686 (0.022)
scDD 1.000 0.983 1.000 0.905 0.972 (0.035)
500 SCDE 1.000 0.855 0.998 0.787 0.910 (0.004)
MAST 1.000 0.993 1.000 0.170 0.791 (0.022)Average power to detect simulated DD genes by true category. Averages are calculated over 20
replications. Standard errors were < 0.025 (not shown).
Table 3 Correct Classification Rate in simulated data
Gene Category
Sample Size DE DP DM DB
50 0.719 0.801 0.557 0.665
75 0.760 0.732 0.576 0.698
100 0.782 0.678 0.599 0.706
500 0.816 0.550 0.583 0.646Average Correct Classification Rate for detected DD genes. Averages are calculated over 20
replications. Standard errors were < 0.025 (not shown).
Table 4 Average correct classification rates by component mean distance
Sample Gene component mean distance �µ
Size Category 2 3 4 5 6
DP 0.02 0.20 0.78 0.94 0.98
50 DM 0.10 0.23 0.59 0.81 0.89
DB 0.08 0.22 0.59 0.80 0.80
DP 0.02 0.18 0.77 0.94 0.97
75 DM 0.08 0.27 0.69 0.86 0.90
DB 0.09 0.29 0.71 0.83 0.84
DP 0.03 0.16 0.74 0.93 0.95
100 DM 0.10 0.32 0.76 0.87 0.91
DB 0.08 0.32 0.80 0.85 0.84
DP 0.01 0.15 0.72 0.91 0.93
500 DM 0.12 0.33 0.72 0.85 0.89
DB 0.03 0.43 0.85 0.85 0.85Average Correct Classification Rates stratified by �µ. Averages are calculated over 20 replications.
Standard errors were < 0.025 (not shown).
When simulating gene expression from mixtures of negative binomial distributions that represent the patterns depicted above, scDD is comparable or slightly better at detecting the DD genes that have an overall mean shift (as shown in the table to the right). As expected, however, it is superior at detecting the DB category, which has no overall mean shift.
Fig 2, Lahav et al. 2004, Nature Genetics [3]
Stochastic burst fluctuations Bistable Feedback loops
Fig 3, Jubelin et al. 2013, PLOS Genetics [5]
Fig 2, Dobrzynski et al. 2012, CSMB [4]
Unsynchronized Oscillations
0.00
0.25
0.50
0.75
1.00
1 2 3+Number of Modes
Prop
ortio
n of
gen
es (o
r tra
nscr
ipts
)
DatasetGE.50GE.75GE.100LC.77H1.78DEC.64NPC.86H9.87
Modality of Bulk (Reds) vs Single−cell (Blues) RNA−seq datasets
Fig 2, Korthauer et al. 2016, Genome Biology [1]
Modality in scRNA-seq
Con
ditio
n 1
Sample 1 Sample 2
… Sample N2
Con
ditio
n 2
Measurement 1 Measurement 2 Measurement N2
Gene X Sample 1 Sample 2
… Sample N1
Measurement 1 Measurement 2 Measurement N1 Do not observe individual cell states in bulk
Gene X is not DE in bulk
Snapshot of Population of Single Cells
Histogram of Observed Expression Level of Gene X
Number of Cells
(A)
(B) (C)
Expression States of Gene X for Individual Cells Over Time
Low Expression State: µ1 High Expression State: µ2
µ1 µ2
Time
Cell 1
Cell 2
Cell 3 !"! !
Cell J
!"! !
in Condition 2
Snapshot of Population of Single Cells
Histogram of Observed Expression Level of Gene X
Number of Cells
(A)
(B) (C)
Expression States of Gene X for Individual Cells Over Time
Low Expression State: µ1 High Expression State: µ2
µ1 µ2
Time
Cell 1
Cell 2
Cell 3 !"! !
Cell J
!"! !
Preprocessing
1. Obtain log Expected Counts normalized for library size 2. Filter genes that are detected in fewer than 25% of cells
Detection
1. Model expressed cells for each gene: DPM of Normals 2. Quantify evidence of Differential Distributions (DD):
- BF with permutation for expressed component - GLM LRT for dropout component
Classification
Classify significant DD genes into patterns DE, DP, DM, DB, DZ
DE: Traditional Differential Expression
µ1 µ2
DP: Differential Proportion
µ1 µ2
DM: Differential Modality
µ1 µ2
DB: Both DM and DE
µ1 µ3 µ2
DZ: Differential proportion of Zeroes
0 µ1
DE: Traditional Differential Expression
DP: Differential Proportion DM: Differential Modality DB: Both DM and Differential Component means
DZ: Differential Proportion of Zeroes
Evaluate evidence of DD of expressed cells using an approximate Bayes Factor score comparing: • Global model for all cells • Independent models for each
biological condition Assess significance via permutation test or (for large datasets) the Kolmogorov-Smirnov test. If expressed component does not display significant DD, assess evidence for differential proportion of zeroes
Undifferentiated
Differentiated
H1
NPC DEC
H9
hESC types
0
1
2
3
4
DEC H1
DZ: SLAMF7
0
2
4
6
8
DEC H1
DP: FASTKD3
0
2
4
6
DEC H1
DM: KCNE3
0
2
4
6
DEC H1
DB: NCOA3
0
2
4
6
DEC H1
CHEK2
0
2
4
6
DEC H1
CDK7
0
2
4
6
DEC H1
FOXP1
2
4
6
8
DEC H1
PSMD12
log(
EC+1
)
(A) scDD−exclusive Genes
log(
EC+1
)lo
g(EC
+1)
(B)
(C)
Cell Cycle Genes
Pluripotency Genes
0
1
2
3
4
DEC H1
DZ: SLAMF7
0
2
4
6
8
DEC H1
DP: FASTKD3
0
2
4
6
DEC H1
DM: KCNE3
0
2
4
6
DEC H1
DB: NCOA3
0
2
4
6
DEC H1
CHEK2
0
2
4
6
DEC H1
CDK7
0
2
4
6
DEC H1
FOXP1
2
4
6
8
DEC H1
PSMD12
log(
EC+1
)
(A) scDD−exclusive Genes
log(
EC+1
)lo
g(EC
+1)
(B)
(C)
Cell Cycle Genes
Pluripotency Genes
471 DD genes not detected by SCDE or MAST are enriched for complex patterns (1 gene categorized as DE)
Korthauer et al. Page 28 of 31
Snapshot of Population of Single Cells
Histogram of Observed Expression Level of Gene X
Number of Cells
(A)
(B) (C)
Expression States of Gene X for Individual Cells Over Time
Low Expression State: µ1 High Expression State: µ2
µ1 µ2
Time
Cell 1
Cell 2
Cell 3 !"! !
Cell J
!"! !
Figure 1 Schematic of the presence of two cell states within a cell population which can lead to
bimodal expression distributions. (A) Time series of the underlying expression state of gene X in a
population of unsynchronized single cells, which switches back and forth between a low and high
state with mean µ1 and µ2, respectively. The color of cells at each time point corresponds to the
underlying expression state. (B) Population of individual cells shaded by expression state of gene
X at a snapshot in time. (C) Histogram of the observed expression level of gene X for the cell
population in (B).
Table 5 Number of DD genes identified in the hESC case study data for scDD, SCDE, and MAST.
Note that the Total for scDD includes genes detected as DD but not categorized.
scDD
Comparison DE DP DM DB DZ Total SCDE MAST
H1 vs NPC 1686 270 902 440 1603 5555 2921 5887
H1 vs DEC 913 254 890 516 911 5295 1616 3724
NPC vs DEC 1242 327 910 389 2021 5982 2147 5624
H1 vs H9 260 55 85 37 145 739 111 1119
Table 6 Number of DD genes identified in the myoblast and mESC case studies for scDD and
MAST. Note that the Total for scDD includes genes detected as DD but not categorized.
scDD
Comparison DE DP DM DB DZ Total MAST
Myoblast: T0 vs T72 312 44 200 36 1311 2134 2904
mESC: Serum vs 2i 5233 76 1259 1128 670 9130 9706
Differentially expressed genes detected by each method
H1 vs DEC
Cyclin genes expressed constitutively in hESCs, oscillatory in differentiated cell types
PSMD12 encodes a subunit of the proteasome complex vital to maintenance of pluripotency and has shown decreased expression in differentiating hESCs