Weighted Lasso for Network inference

Inferring Gene Regulatory Networks fromTime-Course Gene Expression Data

Camille Charbonnier and Julien Chiquet and Christophe Ambroise

Laboratoire Statistique et Genome,La genopole - Universite d’Evry

JOBIM – 7-9 septembre 2010

Camille Charbonnier and Julien Chiquet and Christophe Ambroise 1

Problem

t0t1

tn

≈ 10s microarrays over time

≈ 1000s probes (“genes”)

Inference

Which interactions?

The main statistical issue is the high dimensional setting.


Handling the scarcity of the dataBy introducing some prior

Priors should be biologically grounded

1. few genes effectively interact (sparsity),2. networks are organized (latent clustering),

G1

G2

G3

G4 G5

G6

G7

G8

G9

G10

G11

G12

G13


Handling the scarcity of the dataBy introducing some prior

Priors should be biologically grounded

1. few genes effectively interact (sparsity),2. networks are organized (latent clustering),

A1

A2

A3

A4 A

B1

B2

B3

B4

B5

B

C1

C


Outline

Statistical modelsGaussian Graphical Model for Time-course dataStructured Regularization

Algorithms and methodsOverall viewModel selection

Numerical experimentsInference methodsPerformance on simulated dataE. coli S.O.S DNA repair network


Gaussian Graphical Model for Time-course data

Collecting gene expression

1. Follow-up of one single experiment/individual;2. Close enough time-points to ensure

I dependency between consecutive measurements;I homogeneity of the Markov process.

X1

X2X3

X4

X5

stands for

X1t

X2t X2

t+1

X3t+1

X4t+1G

X5t+1



Collecting gene expression

1. Follow-up of one single experiment/individual;2. Close enough time-points to ensure

I dependency between consecutive measurements;I homogeneity of the Markov process.

X1t X1

2. . . X1

n

X21 X2

2. . . X2

n

X31 X3

2. . . X3

n

X41 X4

2. . . X4

n

G G GX5

1 X52

. . . X5n


Gaussian Graphical Model for Time-Course data

AssumptionA microarray can be represented as a multivariate Gaussianvector X = (X(1), . . . , X(p)) ∈ Rp, following a first order vectorautoregressive process V AR(1):

Xt = ΘXt−1 + b + εt, t ∈ [1, n]

where we are looking for Θ = (θij)i,j∈P .

Graphical interpretationconditional dependency between Xt−1(j) and Xt(i)

lnon null partial correlation between Xt−1(j) and Xt(i)

lθij 6= 0

if and only ifi

j

k


Gaussian Graphical Model for Time-Course data

AssumptionA microarray can be represented as a multivariate Gaussianvector X = (X(1), . . . , X(p)) ∈ Rp, following a first order vectorautoregressive process V AR(1):

Xt = ΘXt−1 + b + εt, t ∈ [1, n]

where we are looking for Θ = (θij)i,j∈P .

Graphical interpretationconditional dependency between Xt−1(j) and Xt(i)

lnon null partial correlation between Xt−1(j) and Xt(i)

lθij 6= 0

if and only ifi

j

?

k



LetI X be the n× p matrix whose kth row is Xk,I S = n−1Xᵀ\nX\n be the within time covariance matrix,

I V = n−1Xᵀ\nX\0 be the across time covariance matrix.

The log-likelihood

Ltime(Θ;S,V) = n Trace (VΘ)− n

2Trace (ΘᵀSΘ) + c.

Maximum Likelihood Estimator ΘMLE = S−1V

I not defined for n < p;I even if n > p, requires multiple testing.


Structured regularization`1 penalized log-likelihood

Charbonnier, Chiquet, Ambroise, SAGMB 2010

Θλ = arg maxΘ

Ltime(Θ;S,V)− λ ·∑i,j∈P

PZij |Θij |

where λ is an overall tuning parameter and PZ is a(non-symmetric) matrix of weights depending on the underlyingclustering Z.

It performs1. regularization (needed when n� p),2. selection (specificity of the `1-norm),3. cluster-driven inference (penalty adapted to Z).


Structured regularization“Bayesian” interpretation of `1 regularization

Laplacian prior on Θ depends on the clustering Z

P(Θ|Z) ∝∏i,j

exp{−λ ·PZ

ij · |Θij |}.

PZ summarizes prior information on the position of edges

-1.0 -0.5 0.0 0.5 1.0

0.2

0.4

0.6

0.8

1.0

Θij

pri

ord

istr

ibu

tion

λPZij = 1

λPZij = 2


How to come up with a latent clustering?

Biological expertise

I Build Z from prior biological informationI transcription factors vs. regulatees,I number of potential binding sites,I KEGG pathways, . . .

I Build the weight matrix from Z.

Inference: Erdos-Renyi Mixture for Networks(Daudin et al., 2008)

I Spread the nodes into Q classes;I Connexion probabilities depends upon node classes:

P(i→ j|i ∈ class q, j ∈ class `) = πq`.

I Build PZ ∝ 1− πq`.


Algorithm

SIMoNESuppose you want to recover a clustered network:

Target Adjacency Matrix

Graph

Target Network


Algorithm

SIMoNEStart with microarray data

Data


Algorithm

SIMoNE

DataAdjacency Matrix

corresponding to G?

Inference without prior


Algorithm

SIMoNE


corresponding to G?


πZ

Connectivity matrix

Mixer

Penalty matrix PZ

Decreasing transformation


Algorithm

SIMoNE


corresponding to G?


πZ

Connectivity matrix

Mixer

Penalty matrix PZ

Decreasing transformation

Adjacency Matrixcorresponding to G?Z

+

Inference with clustered prior


Tuning the penalty parameterBIC / AIC

Degrees of freedom of the Lasso (Zou et al. 2008)

df(βλ) =∑k

1(βλk 6= 0)

Straightforward extensions to the graphical framework

BIC(λ) = L(Θλ;X)− df(Θλ)log n

2

AIC(λ) = L(Θλ;X)− df(Θλ)

I Rely on asymptotic approximations,but still relevant on simulated small samples.


Inference methods

1. Lasso (Tibshirani)

2. Adaptive Lasso (Zou et al.)Weights inversely proportional to an initial Lasso estimate.

3. KnwClWeights structured according to true clustering.

4. InfClWeights structured according to inferred clustering.

5. Renet-VAR (Shimamura et al.)Edge estimation based on a recursive elastic net.

6. G1DBN (Lebre et al.)Edge estimation based on dynamic Bayesian networks followed by statistical

testing of edges.

7. Shrinkage (Opgen-Rhein et al.)Edge estimation based on shrinkage followed by multiple testing local false

discovery rate correction.


Simulations: time-course data with star-pattern

Simulation settings

I 2 classes, hubs and leaves, with proportions α = (0.1, 0.9),I K = 2p edges, among which:

I 85% from hubs to leaves,I 15% between hubs.

p genes n arrays samples20 40 50020 20 50020 10 500

100 100 200100 50 200800 20 100


Simulations: time-course data with star-pattern

0.00.51.01.5

a) p=20, n=40

0.00.51.01.5

b) p=20, n=20

0.00.51.01.5

c) p=20, n=10

0.00.51.01.5

d) p=100, n=100

0.00.51.01.5

e) p=100, n=50

LASSO−A

IC

LASSO−B

IC

Adapt

ive−A

IC

Adapt

ive−B

IC

KnwCl−A

IC

KnwCl−B

IC

InfC

l−AIC

InfC

l−BIC

Renet

−VAR

G1DBDN

Shrink

age

0.00.51.01.5

f) p=800, n=20

PrecisionRecall


Reasonnable computing time

2.5 3.0 3.5 4.0 4.5 5.0

02

46

810

CPU time (log scale)

log of p/n

log

of C

PU ti

me

(sec

)

1 se

c.10

sec

.1

min

.10

min

.1

hr.

4 hr

.

G1DBNRenet−VARInfCl

Figure: Computing times on the log-log scale for Renet-VAR, G1DBNand InfCl (including inference of classes). Intel Dual Core 3.40 GHzprocessor.


E. coli S.O.S DNA repair network

E.coli S.O.S data

SOS.pdf

I 8 major genes involved inthe S.O.S. response

I 50 time points

figures/fig_sos_pathway.pdf


E. coli S.O.S DNA repair networkPrecision and Recall rates

0.0

0.5

1.0

1.5Exp. 1

0.0

0.5

1.0

1.5Exp. 2

0.0

0.5

1.0

1.5Exp. 3

Lass

o

Adapt

ive

G1DBN

Renet

KnownC

l

Infer

Cl0.0

0.5

1.0

1.5Exp. 4

PrecisionRecall


E. coli S.O.S DNA repair networkInferred networks

Lasso Adaptive-Lasso

uvrD

lexA

umuDC

recA

uvrA

uvrY

ruvA

polB

uvrDlexA

umuDC

recA

uvrA

uvrY

ruvA

polB

Known Classification Inferred Classification

uvrD

lexA

umuDC

recA

uvrAuvrY

ruvA

polB

uvrDlexA

umuDC

recA

uvrA

uvrY

ruvA

polB


Conclusions

To sum-up

I cluster-driven inference of gene regulatory networks fromtime-course data,

I expert-based or inferred latent structure,I embedded in the SIMoNe R package along with similar

algorithms dealing with steady-state or multitask data.

Perspectives

I inference of truely dynamic networks,I use of additive biological information to refine the

inference,I comparison of inferred networks.


Publications

Ambroise, Chiquet, Matias, 2009.Inferring sparse Gaussian graphical models with latent structureElectronic Journal of Statistics, 3, 205-238.

Chiquet, Smith, Grasseau, Matias, Ambroise, 2009.SIMoNe: Statistical Inference for MOdular NEtworks Bioinformatics,25(3), 417-418.

Charbonnier, Chiquet, Ambroise, 2010.Weighted-Lasso for Structured Network Inference from TimeCourse Data., SAGMB, 9.

Chiquet, Grandvalet, Ambroise, 2010. Statistics and Computing.Inferring multiple Gaussian graphical models.


Publications

Ambroise, Chiquet, Matias, 2009.Inferring sparse Gaussian graphical models with latent structureElectronic Journal of Statistics, 3, 205-238.

Chiquet, Smith, Grasseau, Matias, Ambroise, 2009.SIMoNe: Statistical Inference for MOdular NEtworks Bioinformatics,25(3), 417-418.

Charbonnier, Chiquet, Ambroise, 2010.Weighted-Lasso for Structured Network Inference from TimeCourse Data., SAGMB, 9.

Chiquet, Grandvalet, Ambroise, 2010. Statistics and Computing.Inferring multiple Gaussian graphical models.

Working paper: Chiquet, Charbonnier, Ambroise, Grasseau.SIMoNe: An R package for inferring Gausssian networks withlatent structure, Journal of Statistical Softwares.

Working paper: Chiquet, Grandvalet, Ambroise, Jeanmougin.Biological analysis of breast cancer by multitasks learning.


Weighted Lasso for Network inference

Documents

Transcript of Weighted Lasso for Network inference