RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 /...
Transcript of RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 /...
RNA-seq co-expression analysisusing mixture models
A. Rau, C. Maugis-Rabusseau, M.-L. Martin-Magniette, G. Celeux
September 29, 2015
Netbio @ Paris
[email protected] RNA-seq co-expression analysis 1 / 25
Introduction Biological context
Gene (co-)expression
Transcriptome data: main source of ’omic information available forliving organisms
Microarrays (∼1995 - )High-throughput sequencing (HTS): RNA-seq (∼2008 - )
Comparison of two conditions (hypothesis tests) → Differentialexpression analysis
Co-expression (clustering) analysis
Study gene expression behavior across several conditions
Co-expressed genes may be involved in similar biological process(es)⇒ study genes without known or predicted function (orphan genes)
[email protected] RNA-seq co-expression analysis 2 / 25
Introduction Co-expression analysis with RNA-seq data
High-throughput transcriptome sequencing data (RNA-seq)
Reads aligned or directly mapped to the genome to get counts pergenomic feature (discrete data) ⇒ digital measures of gene expression
[email protected] RNA-seq co-expression analysis 3 / 25
Introduction Co-expression analysis with RNA-seq data
RNA-seq data, continued
Some statistical challenges of RNA-seq data analysis
Discrete, non-negative, and skewed data with very large dynamicrange (up to 5+ orders of magnitude)
Sequencing depth (= “library size”) varies among experiments, andother technical biases...
Counts correlated with gene length
Gene 1 Gene 2
Gene 1 Gene 2
Sample 1
Sample 2
To date, most methodological developments are for experimental design,normalization, and differential analysis...
[email protected] RNA-seq co-expression analysis 4 / 25
Introduction Co-expression analysis with RNA-seq data
Some notation
Notation
Let Yij` be the count (expression measure) for gene i in replicate ` ofcondition j , with corresponding observed value yij`.
Let sj` be the library size in replicate ` of condition j
Let y = (yij`) be the n ×∑
j Lj matrix of counts for all genes andvariables and yi the ith row of the matrix
[email protected] RNA-seq co-expression analysis 5 / 25
Introduction Finite mixture models
Finite mixture models
Model-based clustering
Rigourous framework for parameter estimation and model selection
Output: each gene assigned a probability of cluster membership
Assume data y come from K distinct subpopulations, each modeledseparately:
f (y|K ,ΨK ) =n∏
i=1
K∑k=1
πk fk(yi |θk)
ΨK = (π1, . . . , πK−1,θ′)′
π = (π1, . . . , πK )′ are the mixing proportions, where∑K
k=1 πk = 1
[email protected] RNA-seq co-expression analysis 6 / 25
Poisson mixture model
Finite mixture models for RNA-seq data
f (y|K ,ΨK ) =n∏
i=1
K∑k=1
πk fk(yi |θk)
For microarray data, we often assume yi |k ∼ MVN(µk ,Σk)...
For RNA-seq data, we need to choose the family andparameterization of fk(·). One possibility:
yi |k ∼J∏
j=1
Lj∏`=1
P(yij`|µij`k)
Question: How to parameterize the mean µij`k to obtain meaningfulclusters of co-expressed genes?
[email protected] RNA-seq co-expression analysis 7 / 25
Poisson mixture model
Finite mixture models for RNA-seq data
f (y|K ,ΨK ) =n∏
i=1
K∑k=1
πk fk(yi |θk)
For microarray data, we often assume yi |k ∼ MVN(µk ,Σk)...
For RNA-seq data, we need to choose the family andparameterization of fk(·). One possibility:
yi |k ∼J∏
j=1
Lj∏`=1
P(yij`|µij`k)
Question: How to parameterize the mean µij`k to obtain meaningfulclusters of co-expressed genes?
[email protected] RNA-seq co-expression analysis 7 / 25
Poisson mixture model
Finite mixture models for RNA-seq data
f (y|K ,ΨK ) =n∏
i=1
K∑k=1
πk fk(yi |θk)
For microarray data, we often assume yi |k ∼ MVN(µk ,Σk)...
For RNA-seq data, we need to choose the family andparameterization of fk(·). One possibility:
yi |k ∼J∏
j=1
Lj∏`=1
P(yij`|µij`k)
Question: How to parameterize the mean µij`k to obtain meaningfulclusters of co-expressed genes?
[email protected] RNA-seq co-expression analysis 7 / 25
Poisson mixture model Parameterization
Which genes should be clustered?
[email protected] RNA-seq co-expression analysis 8 / 25
Poisson mixture model Parameterization
Which genes should be clustered?
[email protected] RNA-seq co-expression analysis 8 / 25
Poisson mixture model Parameterization
Which genes should be clustered?
[email protected] RNA-seq co-expression analysis 8 / 25
Poisson mixture model Parameterization
Poisson mixture model for RNA-seq data
Consider yij`|k ∼ Poisson(yij`|µij`k), where
µij`k = wi sj`λjk
wi : overall expression level of gene i (= yi ··)
sj` : normalized library sizea
λk = (λjk) : parameters that define profiles of genes in each clusterb
aEstimated from data using standard techniques and considered to be fixedbFor identifiability of model, we assume
∑j,` λjksj` = 1 for all k
Genes assigned to the same cluster if they share the same profile ofvariation around their mean count across all conditions
[email protected] RNA-seq co-expression analysis 9 / 25
Poisson mixture model Parameterization
Poisson mixture model for RNA-seq data
Consider yij`|k ∼ Poisson(yij`|µij`k), where
µij`k = wi sj`λjk
wi : overall expression level of gene i (= yi ··)
sj` : normalized library sizea
λk = (λjk) : parameters that define profiles of genes in each clusterb
aEstimated from data using standard techniques and considered to be fixedbFor identifiability of model, we assume
∑j,` λjksj` = 1 for all k
Genes assigned to the same cluster if they share the same profile ofvariation around their mean count across all conditions
[email protected] RNA-seq co-expression analysis 9 / 25
Poisson mixture model Implementation
Parameter estimation
The log likelihood is
L(ΨK |y,K ) = log
[n∏
i=1
f (yi |K ,ΨK )
]=
n∑i=1
log
[K∑
k=1
πk f (yi |θk)
],
where θk = (wi , λ1k , . . . , λdK )′
Estimation approach (EM): mixture parameters are estimated for agiven model K by computing the maximum likelihood estimate(Dempster et al. 1977) Details...
Note: the EM algorithm is sensitive to initialization, so we make useof a splitting small-EM initialization Details...
[email protected] RNA-seq co-expression analysis 10 / 25
Poisson mixture model Implementation
Parameter estimation
The log likelihood is
L(ΨK |y,K ) = log
[n∏
i=1
f (yi |K ,ΨK )
]=
n∑i=1
log
[K∑
k=1
πk f (yi |θk)
],
where θk = (wi , λ1k , . . . , λdK )′
Estimation approach (EM): mixture parameters are estimated for agiven model K by computing the maximum likelihood estimate(Dempster et al. 1977) Details...
Note: the EM algorithm is sensitive to initialization, so we make useof a splitting small-EM initialization Details...
[email protected] RNA-seq co-expression analysis 10 / 25
Poisson mixture model Implementation
Classification by the MAP rule
“Maximum a posteriori” (MAP) rule:
Each individual is attributed to the cluster for which it has the largestconditional probability of membership given the estimated parameters:
τik(θ) =πk fk(yi |θk)∑K`=1 π`f`(yi |θ`)
MAP rule with θK :
zik =
{1 if τik
(θK
)> τi`
(θK
)∀` 6= k
0 otherwise
[email protected] RNA-seq co-expression analysis 11 / 25
Poisson mixture model Model selection
Model selection
1 Collection of models (SK )K∈K indexed by number of clusters K
2 In each model SK , parameter estimation via MLE: ΨK
3 Selection of the “best” model K using a penalized criterion:
K = arg minK∈K
{−1
n
n∑i=1
log f (yi |K , ΨK ) + penalty(K )
}
⇒ Asymptotic penalized criteria include Bayesian InformationCriterion (BIC) and Integrated Completed Likelihood (ICL) Details...
[email protected] RNA-seq co-expression analysis 12 / 25
Poisson mixture model Model selection
Slope heuristics for model selection (Birge and Massart, 2006)
Non-asymptotic framework: construct a penalized criterion1 such thatthe selected model has a risk close to the oracle model
Optimal penalty for model of dimension D:
penaltyopt ≈ 2κD
n
In large dimensions:
Linear behavior of loglikelihood with respect to model dimension D
⇒ Estimation of slope to calibrate κ in a data-driven manner(Data-Driven Slope Estimation = DDSE), capushe R package
1Theoretically validated in Gaussian framework, but encouraging applications in othercontexts (Baudry et al., 2012)
[email protected] RNA-seq co-expression analysis 13 / 25
Poisson mixture model Model selection
Slope heuristics for model selection (Birge and Massart, 2006)
Non-asymptotic framework: construct a penalized criterion1 such thatthe selected model has a risk close to the oracle model
Optimal penalty for model of dimension D:
penaltyopt ≈ 2κD
n
In large dimensions:
Linear behavior of loglikelihood with respect to model dimension D
⇒ Estimation of slope to calibrate κ in a data-driven manner(Data-Driven Slope Estimation = DDSE), capushe R package
1Theoretically validated in Gaussian framework, but encouraging applications in othercontexts (Baudry et al., 2012)
[email protected] RNA-seq co-expression analysis 13 / 25
Poisson mixture model Model selection
Slope heuristics in practice for RNA-seq
●
●
●
●
●
●
●●●●●
●●●●
●●●●
●●●●●●●
●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●● ● ● ● ● ● ● ● ●
−2.
0e+
08−
1.5e
+08
−1.
0e+
08−
5.0e
+07
Contrast representation
penshape(m) (labels : Model names)
−γ n
(sm)
* * *
*The regression line is computed with 17 pointsValidation points
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 65 70 75 80 85 90 95 100
110
120
130
[email protected] RNA-seq co-expression analysis 14 / 25
Poisson mixture model HTSCluster
HTSCluster R package
> PMM <- PoisMixClusWrapper(y=data, gmin=1, gmax=35,
conds=conds, split.init=TRUE, norm="TMM")
>
> summary(PMM)
*************************************************
Selected number of clusters via ICL = 10
Selected number of clusters via BIC = 30
Selected number of clusters via Djump = 15
Selected number of clusters via DDSE = 14
*************************************************
>
> summary(PMM$DDSE.results)
*************************************************
Number of clusters = 14
Model selection via DDSE
*************************************************
Cluster sizes:
Cluster 1 Cluster 2 Cluster 3 ...
540 192 235 ...
[email protected] RNA-seq co-expression analysis 15 / 25
RNA-seq application Embryonic fly data
Real data analysis: Embryonic fly development
modENCODE project to provide functional annotation of Drosophila(Graveley et al., 2011)
Expression dynamics over 27 distinct stages of development duringlife cycle studied with RNA-seq
12 embryonic samples (collected at 2-hr intervals over 24 hrs) for13,164 genes downloaded from ReCount database (Frazee et al.,2011)
3 independent runs, used HTSCluster to fit Poisson mixture modelsfor K ∈ {1, . . . , 60, 65, . . . , 100, 110, . . . , 130}Using slope heuristics, selected model is K = 48
[email protected] RNA-seq co-expression analysis 16 / 25
RNA-seq application Embryonic fly data
Real data analysis: Embryonic fly development
modENCODE project to provide functional annotation of Drosophila(Graveley et al., 2011)
Expression dynamics over 27 distinct stages of development duringlife cycle studied with RNA-seq
12 embryonic samples (collected at 2-hr intervals over 24 hrs) for13,164 genes downloaded from ReCount database (Frazee et al.,2011)
3 independent runs, used HTSCluster to fit Poisson mixture modelsfor K ∈ {1, . . . , 60, 65, . . . , 100, 110, . . . , 130}Using slope heuristics, selected model is K = 48
[email protected] RNA-seq co-expression analysis 16 / 25
RNA-seq application Embryonic fly data
HTSCluster model diagnostics
Maximum conditional probability
Fre
quen
cy
0.0 0.2 0.4 0.6 0.8 1.0
020
0040
0060
0080
0010
000
[email protected] RNA-seq co-expression analysis 17 / 25
RNA-seq application Embryonic fly data
HTSCluster model diagnostics
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●●●●●●●●
●
●
●
●
●●●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●●●●●●●●
●
●●
●
●
●
●
●●●
●●
●
● ●●
●
●
●
●●
●●
●●●●●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●●●●●●●●
●
●
●
● ●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●●●●●
●
●●●●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●●
●●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●●●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●●●●●●●
●
●
●
●
●
●●
●●
●
●
●●
●
●●●
●
●
●●●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●●●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●●●●
●
●
●
●
●
●●
●
●●●
●
●●●●●●
●●
●
●●●●●●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●●●
●●
●●●●●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●● ●●●
●●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●●●
●
●●●●●●
●
●
●●
●●
●●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●●●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●●
●
●●●
●●
●
●
●
●
●●
●
●
●●●
●
●●
●
●
●
●
●
●●●●●●
●●●●
●●●●
●
●●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●●●●●●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●●●●●●
●
●
●
●
●
●
●
●
●
●● ●●●
●
●
●
●●●
●
●●●●●●●
●
●●
●
●
●
●
●
●●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●●●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●●●●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●●
●
●●
●
●●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
1 3 5 7 9 12 15 18 21 24 27 30 33 36 39 42 45 48
0.2
0.4
0.6
0.8
1.0
Cluster
Max
imum
con
ditio
nal p
roba
bilit
y
[email protected] RNA-seq co-expression analysis 18 / 25
RNA-seq application Embryonic fly data
HTSCluster model diagnostics
18 19 30 44 20 48 46 10 35 16 25 22 14 3 9 6
τ ≤ 0.8τ > 0.8
Cluster
Fre
quen
cy
010
020
030
040
050
060
0
[email protected] RNA-seq co-expression analysis 19 / 25
RNA-seq application Embryonic fly data
HTSCluster: Visualization of results
2 4 6 8 10 12
0e+
002e
+05
4e+
05
Cluster 1
2 4 6 8 10 120e
+00
4e+
058e
+05
Cluster 2
2 4 6 8 10 12
010
0000
2500
00
Cluster 3
2 4 6 8 10 12
010
0000
2500
00
Cluster 4
2 4 6 8 10 12
0e+
002e
+05
4e+
05
Cluster 5
2 4 6 8 10 12
010
0000
2000
00
Cluster 6
[email protected] RNA-seq co-expression analysis 20 / 25
RNA-seq application Embryonic fly data
HTSCluster: Visualization of results
2 4 6 8 10 12
02
46
810
Cluster 1
2 4 6 8 10 120
24
68
12
Cluster 2
2 4 6 8 10 12
02
46
810
Cluster 3
2 4 6 8 10 12
02
46
810
Cluster 4
2 4 6 8 10 12
02
46
810
Cluster 5
2 4 6 8 10 12
02
46
810
Cluster 6
[email protected] RNA-seq co-expression analysis 20 / 25
RNA-seq application Embryonic fly data
HTSCluster: Visualization of results
2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
Cluster 1
2 4 6 8 10 120.
00.
20.
40.
6
Cluster 2
2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
Cluster 3
2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
Cluster 4
2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
Cluster 5
2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
Cluster 6
[email protected] RNA-seq co-expression analysis 20 / 25
RNA-seq application Embryonic fly data
HTSCluster: Visualization of results
2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
Cluster 1
2 4 6 8 10 12
0.0
0.2
0.4
0.6
Cluster 2
2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
Cluster 3
2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
Cluster 4
2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
Cluster 5
2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
Cluster 6
121110987654321
0.0
0.2
0.4
0.6
0.8
1.0
λ jks
j.
Cluster
1 2 3 4 5 6
[email protected] RNA-seq co-expression analysis 20 / 25
RNA-seq application Embryonic fly data
HTSCluster: Visualization of results
Embryos 22−24hrEmbryos 20−22hrEmbryos 18−20hrEmbryos 16−18hrEmbryos 14−16hrEmbryos 12−14hrEmbryos 10−12hrEmbryos 8−10hrEmbryos 6−8hrEmbryos 4−6hrEmbryos 2−4hrEmbryos 0−2hr
0.0
0.2
0.4
0.6
0.8
1.0
λ jks
j.
Cluster
1 2 3 4 5 6 7 8 9 10 11 121314 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
Functional enrichment analysis: 33 of 48 clusters associated with atleast one Gene Ontology Biological Process term (e.g., cluster 6associated with muscle attachment)
[email protected] RNA-seq co-expression analysis 21 / 25
Discussion
HTSCluster for clustering count-based RNA-seq profiles
Interpretable parameterization for RNA-seq co-expression analyses,straightforward parameter estimation, and a sound mechanism formodel selection
Performs well on real and simulated data compared to otherapproaches especially when the number of clusters is unknown
Details...
HTSCluster (v2.0.4): R package on CRAN
[email protected] RNA-seq co-expression analysis 22 / 25
Discussion Future work
Some limits (opportunities!) for HTSCluster
Computational time can be a drawback: a full collection of models isestimated to allow for selection of a single “best” model, splittingsmall-EM initialization prevents parallelization...
Samples are currently assumed to be conditionally independent giventhe cluster
Conditions are currently assumed to be a single multi-level factor:how to correctly account for more complex experimental designs?(e.g. factorial, time series)
Is a Poisson mixture model the most appropriate choice for RNA-seqdata in practice? ...
[email protected] RNA-seq co-expression analysis 23 / 25
Discussion Future work
Future work: Model comparisons for co-expression
Is it better to model the raw counts yij using a Poisson distribution orappropriately transformed counts t(yij) using a Gaussian distribution?2
f (yi |K , θK ) =K∑
k=1
πk
J∏j=1
P(yij |θk)
- vs -
g(t(yi )|K , ηK ) =K∑
k=1
πkΦ(t(yi )|µk ,Σk)
For example,
t(yij) = log
(yij/y·j + 1
mi + 1
)
2Ph.D. work of Melina [email protected] RNA-seq co-expression analysis 24 / 25
Discussion Future work
Future work: Model comparisons for co-expression
Is it better to model the raw counts yij using a Poisson distribution orappropriately transformed counts t(yij) using a Gaussian distribution?2
f (yi |K , θK ) =K∑
k=1
πk
J∏j=1
P(yij |θk)
- vs -
g(t(yi )|K , ηK ) =K∑
k=1
πkΦ(t(yi )|µk ,Σk)
For example,
t(yij) = log
(yij/y·j + 1
mi + 1
)2Ph.D. work of Melina Gallopin
[email protected] RNA-seq co-expression analysis 24 / 25
Discussion Future work
Future work: Model comparisons for RNA-seqco-expression
BIC model selection criterion enables an objective comparison:
BICf (K ; y) =∑n
i=1 log f (yi ;K , θK )− υf2 log n
BICg (K ; y) =∑n
i=1 log g(t(yi );K , ηK ) +∑n
i=1 log t ′(yi )− υg2 log n
Left: Sultan et al. (2008). Right: Mach et al. (2014)
Further comparisons of transformations / models in progress...
[email protected] RNA-seq co-expression analysis 25 / 25
Discussion Future work
Future work: Model comparisons for RNA-seqco-expression
BIC model selection criterion enables an objective comparison:
BICf (K ; y) =∑n
i=1 log f (yi ;K , θK )− υf2 log n
BICg (K ; y) =∑n
i=1 log g(t(yi );K , ηK ) +∑n
i=1 log t ′(yi )− υg2 log n
Left: Sultan et al. (2008). Right: Mach et al. (2014)
Further comparisons of transformations / models in [email protected] RNA-seq co-expression analysis 25 / 25
Thank you!
In collaboration with...
Gilles Celeux (Inria Saclay - Ile-de-France)
Cathy Maugis-Rabusseau (INSA / IMT Toulouse)
Marie-Laure Martin-Magniette (AgroParisTech / INRA URGV)
Panos Papastamoulis (University of Manchester)
Melina Gallopin (current Ph.D. student)
Estimation of finite mixture models
A finite mixture model may be seen as an incomplete datastructure model
The complete data are
x = (y, z) = (x1, . . . , xn) = ((y1, . . . , yn), (z1, . . . , zn))
where the missing data are z = (z1, . . . , zn) = (zik)zi component of i , where zik = 1 if i arises from group k and 0otherwisez defines a partition P = (P1, . . . ,PK ) of the observed data y withPk = {i |zik = 1}
Expected completed likelihood:
L(ΨK ; y, z) =n∑
i=1
K∑k=1
zik {log πk + log fk(y;θ)}+ λπ
(K∑
k=1
πk − 1
)
where λπ is the Lagrange multiplier for the constraint on π
Estimation: EM algorithm (Dempster et al., 1977)
E-step Compute the conditional probabilities:
τik
(θ
(b)k
)=
π(b)k f (yi |θ
(b)k )∑K
m=1 π(b)m f (yi |θ
(b)m )
M-step Update Ψk to maximize the expected value of the completedlikelihood by weighting observation i for cluster k with
τik
(θ
(b)k
):
π(b+1)k =
1
n
n∑i=1
τik
(θ
(b)k
),
wi = yi ··
λ(b+1)jk =
∑ni=1 τik
(θ
(b)k
)yij ·
sj ·∑n
i=1 τik
(θ
(b)k
)yi ··
Back...
Splitting initialization (Papastamoulis et al., 2014)
for K ← 2 to gmax do– Calculate per-class entropy ek = −
∑i∈k log tK−1
ik for model with (K − 1)clusters– Select cluster k? = arg maxk ek to be splitfor i ← 1 to init.runs do
– Randomly split the observations in cluster k? into two clusters– Calculate corresponding λ(0,i),K and π(0,i),K
– Update values of λ(0,i),K and π(0,i),K via EM algorithm withinit.iter iterations– Calculate the log-likelihood L(i),K = L(λ(0,i),K , π(0,i),K )
end
Let i? = arg maxi L(i),K . Fix new initial values λ(0),K = λ(0,i?),K and
π(0),K = π(0,i?),K .end
Back...
Penalized criteria for model selection: BIC (Schwarz, 1978)
Maximization of integrated likelihood:
m = arg minm∈M
−f (y|m)
where
f (y|m) =
∫Θm
f (y|θ,m)Π(θ|m)dθ
Asymptotic approximation (where Dm = (m − 1) + m × J is thedimension of Sm):
−ln(f (y|m)) ≈ −L(y|θm) +Dm
2ln(n)
= nγn(sm) +Dm
2ln(n)
Bayesian information criterion (BIC):
m = arg minm∈M
{γn(sm) + penBIC(m)} with penBIC(m) =Dm
2nln(n)
Back...
Penalized criteria for model selection: ICL (Biernacki et al., 2000)
Alternative based on maximization of integrated completed likelihood:
m = arg minm∈M
−f (y, z|m)
where
f (y, z|m) =
∫Θm
f (y, z|θ,m)Π(θ|m)dθ
BIC-like asymptotic approximation for Integrated CompletedLikelihood (ICL):
m = arg minm∈M
{−1
nL(θm|y, z) +
Dm
2nln(n)}
= arg minm∈M
{γn(sm) + penBIC(m) + Ent(m)}
where Ent(m) = − 1n
∑i
∑k zik ln τik(θm)
Back...
Behavior of BIC and ICL in practice for RNA-seq data
0 20 40 60
−25
0000
−20
0000
−15
0000
−10
0000
Number of clusters
logL
ike
0 20 40 60
−25
0000
−20
0000
−15
0000
−10
0000
Number of clustersB
IC
X
0 20 40 60
−25
0000
−20
0000
−15
0000
−10
0000
Number of clusters
ICL
X
0 50 100 150 200
−2.
0e+
08−
1.0e
+08
Number of clusters
logL
ike
0 50 100 150 200
−2.
0e+
08−
1.0e
+08
Number of clusters
BIC
X
0 50 100 150 200
−2.
0e+
08−
1.0e
+08
Number of clusters
ICL
X
Back...
Behavior of BIC and ICL in practice for RNA-seq data
0 20 40 60
−25
0000
−20
0000
−15
0000
−10
0000
Number of clusters
logL
ike
0 20 40 60
−25
0000
−20
0000
−15
0000
−10
0000
Number of clustersB
IC
X
0 20 40 60
−25
0000
−20
0000
−15
0000
−10
0000
Number of clusters
ICL
X
0 50 100 150 200
−2.
0e+
08−
1.0e
+08
Number of clusters
logL
ike
0 50 100 150 200
−2.
0e+
08−
1.0e
+08
Number of clusters
BIC
X
0 50 100 150 200
−2.
0e+
08−
1.0e
+08
Number of clusters
ICL
X
Back...
Description of competing models
1 PoisL (Cai et al., 2004): K-means type algorithm using Poissonloglinear model
Equivalent to HTSCluster when equal library sizes, unreplicated data,equiprobable Poisson mixtures, and parameter estimation via theClassification EM (CEM) algorithm
2 Witten (2011): hierarchical clustering of dissimilarity measure basedon a Poisosn loglinear model
Originally intended to cluster samples
3 Si et al. (2014): model-based hierarchical algorithm using Poissonand negative binomial models
4 Classic K-means algorithm on expression profiles (yij`/yi ··)
Model selection not addressed by any of the above ⇒ Calinski andHarabasz index (1974) used for comparison
Simulation procedure (based on fly and human data)
For each setting (fly and human), 50 individual datasets:
K fixed to 15, true experimental design used
All parameters (λjk), (sj`), and (wi ) fixed to estimated values fromreal data analysis
n = 3000 genes randomly sampled from fly or human data, weightedby their maximum conditional probability
For each selected gene, we sample from the appropriate Poissondistribution:
Yijl ∼ P(µijlk)
where µijlk = wi sjlλjk if zik = 1.
Simulation results
All models fit for K ∈ 1, . . . , 40
Model selection via the slope heuristics (HTSCluster, PoisL) orCH-index (Si-Pois, Si-NB, Witten)
Models compared using the adjusted Rand Index (ARI, Hubert &Arabie 1985)
For comparison, also consider the oracle ARI (based on assignment ofobservations to clusters using the true parameter values)
Simulation results
Table: Mean (sd) ARI for simulations with parameters based on the fly and human liver.
Method Model selection Fly Human
HTSClustercapushe 0.93 (0.05) 0.61 (0.02)True K 0.84 (0.09) 0.60 (0.02)
PoisLcapushe 0.79 (0.15) 0.53 (0.05)True K 0.82 (0.05) 0.53 (0.04)
WittenCH index 0.15 (0.07) 0.11 (0.03)True K 0.67 (0.09) 0.39 (0.04)
Si-PoisCH index 0.26 (0.17) 0.48 (0.04)True K 0.95 (0.02) 0.61 (0.02)
Si-NBCH index 0.23 (0.16) 0.47 (0.04)True K 0.94 (0.02) 0.60 (0.02)
K-means True K 0.79 (0.08) 0.42 (0.02)
Oracle True K 0.95 (0.01) 0.63 (0.01)
Back...