Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

34
Mixture models for analysing transcriptome and ChIP-chip data Marie-Laure Martin-Magniette French National Institute for agricultural research (INRA) Unit of Applied Mathematics and Informatics at AgroParisTech, Paris Unit of Plant Genomics Research (URGV), Evry M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 1 / 30

description

Mixture models are useful for identifying underlying structures. In such models, the density of the observations is modelled by a weighted sum of parametric density (e.g. each component is a Gaussian distribution) and each one represents a subpopulation composed of observations sharing common characteristics. The first part of my talk will be dedicated to a presentation of the mixture models. I will explain the concept and the outputs of an analysis based on a mixture through easy examples. In the second part of my talk, I will show how mixture models can be applied to analyze transcriptomic (co‐expression analysis of Arabidopsis thaliana genes) and chIP‐chip data (detection of enriched regions and of differentially methylated regions). First presented at the 2014 Winter School in Mathematical and Computational Biology http://bioinformatics.org.au/ws14/program/

Transcript of Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Page 1: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Mixture modelsfor analysing

transcriptome and ChIP-chip data

Marie-Laure Martin-Magniette

French National Institute for agricultural research (INRA)

Unit of Applied Mathematics and Informatics at AgroParisTech, Paris

Unit of Plant Genomics Research (URGV), Evry

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 1 / 30

Page 2: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Presentation outline

1 Introduction

2 Mixture model definition

3 Genomic examples

4 Conclusions

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 2 / 30

Page 3: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Introduction

Observations described by 2 variables

Observation distribution seems easy to model with one Gaussian

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 3 / 30

Page 4: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Introduction

Observations described by 2 variables

Data are scattered and subpopulations are observedAccording to the experimental design, there exists no externalinformation about them

This is an underlying structure observed through the data

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 3 / 30

Page 5: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Introduction

Definition of a mixture modelIt is a probabilistic model for representing the presence of subpopula-tions within an overall population.

Introduction of a latent variable Z indicating the subpopulationwhere each observation comes from

what we observe the model the expected results

Z = ? Z : 1 = •,2 = •,3 = •

→ It is an unsupervised classification methodM.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 4 / 30

Page 6: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Functional annotation is the new challenge

It is now relatively easy to sequence an organism and to localizeits genesBut between 20% and 40% of the genes have an unknownfunctionFor Arabidopsis thaliana, 16% of the genes are orphean genesi.e. without any information on their function

→ with the high-throughput technologies, it is now possible to improvethe functional annotation

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 5 / 30

Page 7: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

First genomic example: co-expression analysis

Co-expressed genes are good candidates to be involved in asame biological process (Eisen et al, 1998)Pearson correlation values are often used to measure theco-expression, but it is a local point of viewCo-expression analysis can be recast as a research of anunderlying structure in a whole dataset

Table : Examples of co-expression clusters of genes observed on 45independent transcriptome experiments. Clusters are identified with amixture.

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 6 / 30

Page 8: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Second example: ChIP-chip analysis

These experiments aim atidentifying interactions between aprotein and DNA

Most methods look for peaks oflog(IP/Input) along the genome

There exists an underlying structurebetween the two samples

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 7 / 30

Page 9: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Presentation outline

1 Introduction

2 Mixture model definition

3 Genomic examples

4 Conclusions

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 8 / 30

Page 10: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Key ingredients of a mixture model

what we observe the model the expected results

Z = ? Z : 1 = •, 2 = •, 3 = •

Let y = (y1, . . . ,yn) denote n observations with yi ∈ RQ and letZ = (Z1, . . . ,Zn) be the latent vector.

1) Distribution of Z: {Zi} are assumed to be independent and

P(Zi = k) = πk withK∑

k=1

πk = 1 → Z ∼M(n;π1, . . . , πK )

and where K is the number of components of the mixture

2) Distribution of (yi |Zi = k): a parametric distribution f (•;γk )

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 9 / 30

Page 11: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Some properties:{Zi} are independent{Yi} are independent conditionally to {Zi}Couples {(Yi ,Zi)} are i.i.d.The model is invariant for any permutation of the labels {1, . . . ,K}⇒ the mixture model has K ! equivalent definitions.

Distribution of Y:

P(Y|K ,θ) =n∏

i=1

K∑k=1

P(Yi ,Zi = k) =n∏

i=1

K∑k=1

P(Zi = k)P(Yi |Zi = k)

=n∏

i=1

K∑k=1

πk f (Yi ;γk )

→ It is a weighted sum of parametric distributions known up to theparameter vector θ = (π1, . . . , πK−1,γ1, . . . ,γK )

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 10 / 30

Page 12: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Statistical inference of incomplete data models

Maximum likelihood estimate:

θ̂ = arg maxθ

log P(Y|K ,θ) = arg maxθ

n∑i=1

log

[K∑

k=1

πk f (Yi ;γk )

]

→ It is not always possible since this sum involves K n terms....

Expectation-Maximization algorithm: iterative algorithm based on theexpectation of the completed data conditionally to θ(l)

θ(l+1) = arg maxθ

E{

log P(Y,Z|K ,θ)|Y,θ(l)}

→ According to the theory, it implies that log P(Y|K ,θ) tends toward alocal maximum.

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 11 / 30

Page 13: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

EM algorithm details

Initialisation of θ(0)

While the convergence criterion is not reached, iterateE-step Calculation of the conditional probabilities

τ(l)ik = P(Zi = k |yi ,θ

(l)) =π(l)k f (yi ;γ

(l)k )∑K

k ′=1 π(l)k ′ f (yi ;γ

(l)k ′ )

M-step Calculation of θ̂ by maximising the complete likehoodwhere Z is replaced with the conditional probabilities

θ̂ = arg maxθ

n∑i=1

K∑k=1

τ(l)ik [logπk + log f (yi ;γk )]

→ weighted version of the usual maximum likelihoodestimates (MLE).

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 12 / 30

Page 14: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

EM algorithm properties

Convergence is always reached but not always toward a globalmaximum

EM algorithm is sensitive to the initialisation step

EM algorithm exists in all good statistical sotfwares

In R software, it is available in MCLUST and RMIXMOD packages.

RMIXMOD proposes the best strategy of initialisation

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 13 / 30

Page 15: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Outputs of the model

Distribution: Conditional probabilities:

g(yi ) = π1f (yi ;γ1) + π2f (yi ;γ2) + π3f (yi ;γ3) τik = P(Zi = k |yi) =πk f (yi ;γk )

g(yi)

τik (%) i = 1 i = 2 i = 3k = 1 65.8 0.7 0.0k = 2 34.2 47.8 0.0k = 3 0.0 51.5 1.0

→ These probabilities enables the classification of the observationsinto the subpopulations

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 14 / 30

Page 16: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Outputs of the model

Distribution: Conditional probabilities:

g(yi ) = π1f (yi ;γ1) + π2f (yi ;γ2) + π3f (yi ;γ3) τik = P(Zi = k |yi) =πk f (yi ;γk )

g(yi)

τik (%) i = 1 i = 2 i = 3k = 1 65.8 0.7 0.0k = 2 34.2 47.8 0.0k = 3 0.0 51.5 1.0

Maximum A Posteriori rule: Classification in the component for whichthe conditional probability is the highest.

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 14 / 30

Page 17: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Model selection

The number of components of the mixture is often unknownA collection of models where K varies between 2 and Kmax

The best model is the one maximising a criterion

Bayesian Information Criterion (BIC)

proxy of the integrated likelihood P(Y|K ) =∫

P(Y|K ,θ)π(θ|K )dθaims at finding a good number of components for a global fit of thedata distribution

BIC(K ) = log P(Y|K , θ̂)− νK

2log(n)

whereνK is the number of free parameters of the modelP(Y|K , θ̂) is the maximum likelihood under this model.

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 15 / 30

Page 18: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Model selection

The number of components of the mixture is often unknownA collection of models where K varies between 2 and Kmax

The best model is the one maximising a criterion

Integrated Information Criterion (ICL)

proxy of the integrated complete likelihood P(Y,Z|m)

dedicated to classification since it strongly penalizes models forwhich the classification is uncertain

ICL(K ) = BIC(K )+n∑

i=1

K∑k=1

τik log τik ,

whereνK is the number of free parametersP(Y|K , θ̂) is the maximum likelihood under this model.

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 15 / 30

Page 19: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Conclusions on the model selection

BIC aims at finding a good number of components for a global fitof the data distribution. It tends to overestimate the number ofcomponentsICL is dedicated to a classification purpose. It strongly penalizesmodels for which the classification is uncertain.Whatever the criterion, it must be a convex function of the numberof components

Bad behavior Correct behavior

→ a non-convex function may indicate an issue of modelingM.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 16 / 30

Page 20: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Presentation outline

1 Introduction

2 Mixture model definition

3 Genomic examplesMixtures for co-expression analysisMixtures for analysing chIP-chip data

4 Conclusions

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 17 / 30

Page 21: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

GEM2Net: From gene expression modeling to-omics network

Goal: Explore the orphean gene space to identifynew genes involved in defense andadaptation process

Method: Predict co-expression networks using mixturemodels

Data: An original resource generated by thetranscriptomic platform of URGV

Homogeneous data generated with theCATMA microarray5,095 genes not present in Affymetrix chipHigh diversity of biological samples relativeto stress conditions

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 18 / 30

Page 22: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Workflow overview

- Extraction of CATdb of 387 stress comparaisons

- 17,264 genes are differentially expressed in at least one of thesecomparisons (FWER controlled at 5% on overall the tests)

- Analyses performed with Gaussian Mixture Models

- According to BIC curve, the naive clustering on the whole dataset is notrelevant

- Gene co-expression depends on the stress categories→ The functional modules vary with the environment

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 19 / 30

Page 23: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Results of the co-expression analysis

- 18 categories (9 biotic and 9 abiotic), identification of 681 clusters

- Large overlap between biotic and abiotic clusters

- 98% of clusters have a functional bias in a term of gene ontology

- 80% are associated to a stress term

- 39% have a preferential sub-cellular localization in plastid

- 18% are enriched in transcription factors and for stifenia, no cluster is enriched in TF

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 20 / 30

Page 24: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Focus on nematode stress

7467 genes described by 10 expressiondifferences29 clusters of co-expression identified1519 genes with a conditional proba.close to 1

Example of Cluster 14

49 genes repressed from 14 days afterinfection13 genes known to be involved in stressresponse10 orphean genesEndoplasmic reticulum bias

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 21 / 30

Page 25: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

GEM2Net databasehttp://urgv.evry.inra.fr/GEM2NET

Integration of various resources: gene ontology, genes involved instress responses, gene families (transcription factors andhormones) and protein-protein interactions (experimental andpredicted).

Original representation and interactive visualization, using piecharts to summarize the functional biases at first glance

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 22 / 30

Page 26: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

ChIP-chip experiments

The log-ratio is not tractable while the couple (IP, Input) isDevelopment of mixture of 2 linear regressions

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 23 / 30

Page 27: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

MultiChIPmix: Mixture of two linear regressions

Let Zi the status of the probe i : P(Zi = 1) = π

The linear relation between IP and Input depends on the probestatus

IPir =

a0r + b0rInputir + Eir if Zi = 0 (normal)

a1r + b1rInputir + Eir if Zi = 1 (enriched)V (IPir) = σ2

r

Martin-Magniette et al. (2008), BioinformaticsM.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 24 / 30

Page 28: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

MultiChIPmix: Mixture of two linear regressions

Let Zi the status of the probe i : P(Zi = 1) = π

The linear relation between IP and Input depends on the probestatus

IPir =

a0r + b0rInputir + Eir if Zi = 0 (normal)

a1r + b1rInputir + Eir if Zi = 1 (enriched)V (IPir) = σ2

r

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 24 / 30

Page 29: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Use tocreate the first epigenomic map of Arabidopsis thaliana: Roudier etal. (2011), EMBO Journalstudy the additive inherance of histone modifications in Arabidopsisthaliana intra-specific hybrids: Moghaddam et al. (2011), Plant Journal

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 25 / 30

Page 30: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

MultiChIPmixHMM for taking the spatialinformation into account

When probes are (almost)equally spaced along thegenome, hybridisation signalstend to be clusteredAssuming that the probestatus are(Markov-)dependent enablesthis information in the model:{Zi} ∼ MC(π, ν)

πk` = Pr{Zi = k |Zi−1 = `}

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 26 / 30

Page 31: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Table : Example of one known H3K27me3 target gene identified only withMultiChIPmixHMM.

MultiChIPmix and MultiChIPmixHMM are alternative methods topeak detections

Analysis of several replicates simultaneously + modelling thespatial dependency = more accurate conditional probabilities

MultiChIPmixHMM is available as an R package: Bérard et al.(2013), BMC Bioinformatics

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 27 / 30

Page 32: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Presentation outline

1 Introduction

2 Mixture model definition

3 Genomic examples

4 Conclusions

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 28 / 30

Page 33: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Conclusions

Mixtures reveal underlying structuresKey ingredients are P(Z) and P(Y|Z)For genomic data, component distribution modeling is sometimestricky, especially for RNA-Seq dataApplications on genomic data sometimes raise newmethodological questions about the parameter inference andclassification rulesExamples of R packages using mixtures: Mclust, Rmixmod,MultiChIPmixHMM, HTSDiff, HTSCluster,poisson.glm.mix

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 29 / 30

Page 34: Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

Acknowledgements

Statistics Bioinformatics Biology

S. Robin V. Brunaud J-P. RenouT. Mary-Huard J-P Tamby E. DelannoyC. Bérard R. Zaag S. BalzergueG. Celeux Z. TariqC. Maugis-Rabusseau V. ColotG. Rigaill F. RoudierA. Rau

P. PapastamoulisM. Seifert

Thank you for your attention !

M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 30 / 30