Biological Network Inference via Gaussian Graphical Models

An introduction to Biological network inference viaGaussian Graphical Models

Christophe Ambroise, Julien Chiquet

Statistique et Genome, CNRS & Universite d’Evry Val d’Essonne

Sao Paulo – School on Advance Science – Octobre 2012

http://stat.genopole.cnrs.fr/~cambroise

Network inference 1

http://stat.genopole.cnrs.fr/~cambroise

OutlineIntroduction

MotivationsBackground on omicsModeling issue

Modeling toolsStatistical dependenceGraphical modelsCovariance selection and Gaussian vector

Gaussian Graphical Models for genomic dataSteady-state dataTime-course data

Statistical inferencePenalized likelihood approachInducing sparsity and regularizationThe Lasso

Application in Post-genomicsModeling time-course dataIllustrationsMultitask learning

Network inference 2

OutlineIntroduction






Network inference 3

OutlineIntroduction






Network inference 4

Real networks

IMany scientific fields :

I World Wide Web

I Biology, sociology, physics

INature of data under study:

I Interactions between N

objects

I O(N

2) possible interactions

INetwork topology :

I Describes the way nodesinteract, structure/functionrelationship

Sample of 250 blogs (nodes) with their links

(edges) of the French political Blogosphere.

Network inference 5

What the reconstructed networks are expected to be 1 (1)Regulatory networks

E. coli regulatory network

I relationships betweengene and their products

I inhibition/activation

I impossible to recover atlarge scale

I always incomplete

Figure 1: Regulatory network of E. Coli displayed with colors for each MixNet class (5 groups).

2

1

1and are presumably wrongly assumed to be

Network inference 6

What the reconstructed networks are expected to be (2)Regulatory networks

Figure: Regulatory network identified in mammalian cells: highly structured

Network inference 7

What the reconstructed networks are expected to be (3)Protein-Protein interaction networks

Figure: Yeast PPI network : do not be mislead by the representation, trust stat !

Network inference 8


Figure: Yeast PPI network : do not be mislead by the representation, trust stat !Network inference 8


Figure: Yeast PPI network : do not be mislead by the representation, trust stat !

Network inference 8

OutlineIntroduction






Network inference 9

What are we looking at?

Central dogma of molecular biology

DNA mRNA Proteinstranscription translation

replication

Proteins

I are building blocks of any cellular functionality,

I are encoded by the genes,

Ido interact (at the protein and gene level – regulations).

Network inference 10

What questions in functional genomics? (1)

Various levels/scales of study

I genome: sequence analysis,

I transcriptome: gene expression levels,

I proteome: protein functions and interactions.

Questions

1. Biological understandingI Mechanisms of diseases,I gene/protein functions and interactions.

2. Medical/clinical careI Diagnostic (type of disease),I prognostic (survival analysis),I treatment (prediction of response).


What questions in functional genomics? (2)

Central dogma of molecular biology

DNA mRNA Proteinstranscription translation

replication

Basic biostatistical issues

Selecting some genes of interest (biomarkers),

Looking for interactions between them (pathway analysis).


How is this measured? (1)Microarray technology: parallel measurement of many biological features

signal processing

X =

0

B@x

11 x

21 x

31 . . . x

p1

...x

1n x

2n x

21 . . . x

pn

1

CA

Matrix of features n ⌧ p

Expression levels of p

probes are simultaneously

monitored for n individuals

pretreatment


How is this measured? (2)Next Generation Sequencing: parallel measurement of even many more biological features

assembling

X =

0

B@k

11 k

21 k

31 . . . k

p1

...k

1n k

2n k

21 . . . k

pn

1

CA

Matrix of features nn p

Expression counts are extracted

from small repeated sequences

and monitored for n individuals

pretreatment


What questions are we dealing with? (1)Supervised canonical example at the gene level: di↵erential analysis

Leukemia (Golub data, thanks to P. Neuvial)

I AML – Acute Myeloblastic Leukemia, n1 = 11,

I ALL – Acute Lymphoblastic Leukemia n2 = 27,

a n1 + n2 vector of outcome with each patient’s tumor type.

Supervised classificationFind genes with significantdi↵erent expression levelsbetween groups – biomarkers

prediction purpose


What questions are we dealing with? (2)Unsupervised canonical example at the gene level: hierarchical clustering

Same kind of data, no outcome is considered

(Unsupervised) clustering

Find groups of gene which showstatisticaldependencies/commonalities –hoping for biological interactions

exploratory purpose

functional understanding

Can we do better than that ? And how do genes interact anyway?


OutlineIntroduction







The problem at hand

Inference

⇡ 10s/100s microarray/sequencing experiments

⇡ 1000s probes (“genes”)

Inference

Modeling questions prior to inference

1. What do the nodes represent? (the easiest one)

2. What is/should be the meaning of an edge? (the toughest one)I Biologically?I Statistically?


More questions/issues

Modelling

I Is the network dynamic of static?

I How has the data been generated? (time-course/steady state)

I Are the edges oriented or not? (causality)

I What do the edges represent for my particular problem?

Statistical challenges

I (Ultra) high dimensionality,

I Noisy data, lack of reproducibility,

I Heterogeneity of the data (many techniques, various signals).


OutlineIntroduction







Canonical model settingsBiological microarrays in comparable conditions

Notations

1. a set P = {1, . . . , p} of p variables:these are typically the genes (could be proteins);

2. a sample N = {1, . . . ,n} of individuals associated to the variables:these are typically the microarray (could be sequence counts).

Basic statistical modelThis can be view as

I a random vector X in Rp , whose j th entry is the j th variable,I a n-size sample (X

1, . . . ,X n), such as X i is the ith microarrays,

I could be independent identically distributed copies (steady-state)I could be dependent in a certain way (time-course data)

I assume a parametric probability distribution for X (Gaussian).


Canonical model settingsBiological microarrays in comparable conditions

Notations

1. a set P = {1, . . . , p} of p variables:these are typically the genes (could be proteins);

2. a sample N = {1, . . . ,n} of individuals associated to the variables:these are typically the microarray (could be sequence counts).

Basic statistical modelThis can be view as

I a random vector X in Rp , whose j th entry is the j th variable,I a n-size sample (X

1, . . . ,X n), such as X i is the ith microarrays,

I could be independent identically distributed copies (steady-state)I could be dependent in a certain way (time-course data)

I assume a parametric probability distribution for X (Gaussian).

The data

Stacking (X

1, . . . ,X n), we met the usual individual/variable table X

Inference X =

0

B@x

11 x

21 x

31 . . . x

p1

...x

1n x

2n x

21 . . . x

pn

1

CA


OutlineIntroduction







Modeling relationship between variables (1)Independence

Definition (Independence of events)

Two events A and B are independent if and only if

P(A,B) = P(A)P(B),

which is usually denoted by A ?? B . Equivalently,

IA ?? B , P(A|B) = P(A),

IA ?? B , P(A|B) = P(A|Bc

)

Example (class vs party)

partyclass Labour Tory

working 0.42 0.28bourgeoisie 0.06 0.24

partyclass Labour Tory

working 0.60 0.40bourgeoisie 0.20 0.80

Table: Joint probability (left) vs. conditional probability (right)Network inference 23

Modeling relationships between variables (2)Conditional independence

Generalizing to more than two events requires strong assumptions(mutual independence). Better handle with

Definition (Conditional independence of events)


P(A,B |C ) = P(A|C )P(B |C ),

which is usually denoted by A ?? B |CExample (Does QI depends on weight?)

Consider the events A = ”having low QI”, B = ”having low weight”.






P(A,B |C ) = P(A|C )P(B |C ),


Consider the events A = ”having low QI”, B = ”having low weight”.Estimating2 P(A,B), P(A) and P(B) in a sample would lead to

P(A,B) 6= P(A)P(B)

2stupidly






P(A,B |C ) = P(A|C )P(B |C ),


Consider the events A = ”having low QI”, B = ”having low weight”.But in fact, introducing C = ”having a given age”,

P(A,B |C ) = P(A|C )P(B |C )


Independence of random vectors (1)Independence and Conditional independence: natural generalization

DefinitionConsider 3 random vector X ,Y ,Z with distribution fX , fY , fZ , jointlyfXY , fXYZ . Then,

IX and Y are independent iif fXY (x , y) = fX (x )fY (y);

IX and Y are conditionally independent on Z , z : fZ (z ) > 0 iiffXY |Z (x , y ; z ) = fX |Z (x ; z )fY |Z (y ; z ).

Proposition (Factorization criterion)

X and Y are independent (resp. conditionally independent on Z ) iif

there exists functions g and h such as, for all x and y

1. fXY (x , y) = g(x )h(y),

2. fXYZ (x , y , z ) = g(x , z )h(y , z ), for all z fZ (z ) > 0.


Independence of random vectors (2)Independence vs Conditional independence

f ;X ?? Z |Y

f ;X ?? Y |Z

f ;Y ?? Z |X

f ; fXYZ

f ; fX fY fZ

Figure: Mutual independence, Conditional dependence, full dependence.Network inference 26

OutlineIntroduction







Definition

DefinitionA graphical model gives a graphical (intuitive) representation of thedependence structure of a probability distribution.

Graphical structure $ Random variables/Random vector

It links

1. a random vector (or a set of random variables.) X = {X1, . . . ,Xp}with distribution P,

2. a graph G = (P, E) whereI P = {1, . . . , p} is the set of nodes associated to each variable,I E is a set of edges describing the dependence relationship of X ⇠ P.


Conditional Independence GraphsDefinition

DefinitionThe conditional independence graph of a random vector X is theundirected graph G = {P, E} with the set of node P = {1, . . . , p} andwhere

(i , j ) /2 E , Xi ?? Xj |P\{i , j}.

PropertyIt owns the Markov property: any two subsets of variables separated by athird is independent conditionally on variables in the third set.


Conditional Independence GraphsAn example

Let X1,X2,X3,X4 be four random variables with joint probability densityfunction fX (x ) = exp(u + x1 + x1x2 + x2x3x4) with u a given constant.

Apply the factorization property

fX (x ) = exp(u + x1 + x1x2 + x2x3x4)

= exp(u) · exp(x1 + x1x2) · exp(x2x3x4)

Graphical representation

G = (P, E) such as P = {1, 2, 3, 4}and

E =

1 2

3

4





fX (x ) = exp(u + x1 + x1x2 + x2x3x4)



G = (P, E) such as P = {1, 2, 3, 4}and

E = {?}

1 2

3

4





fX (x ) = exp(u + x1 + x1x2 + x2x3x4)



G = (P, E) such as P = {1, 2, 3, 4}and

E = {(1, 2)}

1 2

3

4





fX (x ) = exp(u + x1 + x1x2 + x2x3x4)



G = (P, E) such as P = {1, 2, 3, 4}and

E = {(2, 3), (3, 4), (2, 4)}

1 2

3

4


Directed Acyclic conditional independence Graph (DAG)Motivation

Limitation of undirected graphsSometimes an ordering on the variables is known, which allows to breakthe symmetry in the graphical representation to introduce, in some sense,“causality” in the modeling.

Consequences

I Each element of E has to be directed.

I There are no directed cycle in the graph.

We thus deal with a directed acyclic graph (or DAG).


Directed Acyclic conditional independence Graph (DAG)Definition

Definition (Ordering)

An ordering between variables {1, . . . , p} is a relation � such that: i) forall couple (i , j ), either i � j or j � i , ii) � is transitive iii) � is notreflexive.

I A natural ordering is obtained when variables are observed acrosstime,

I A natural conditioning set for a pair of variables (i , j ) is the past,denoted P(j ) = 1, . . . , j for j .

Definition (DAG)

The directed conditional dependence graph of X is the directed graphG = (P, E�

) where

(i , j ) such as i � j /2 E , Xj ?? Xi |P(j )\{i , j}.


Directed Acyclic conditional independence Graph (DAG)Factorization and Markov property

Another view is a parent/descendant relationships to deal with theordering of the nodes:

The factorization property

fX (x ) =

pY

k=1

fXk |pak (xk |pak ),

where pak are the parents of node k .


Directed Acyclic conditional independence Graph (DAG)An example

x1

x2 x3

x4 x5

x6 x7

fX (x ) = ?fX1fX2fX3fX4|X1,X2,X3fX5|X1,X3

fX6|X4fX7|X4,X5

.



x1

x2 x3

x4 x5

x6 x7

fX (x ) = ?fX1 · · · fX2fX3fX4|X1,X2,X3fX5|X1,X3

fX6|X4fX7|X4,X5

.



x1

x2 x3

x4 x5

x6 x7

fX (x ) = ?fX1fX2 · · · fX3fX4|X1,X2,X3fX5|X1,X3

fX6|X4fX7|X4,X5

.



x1

x2 x3

x4 x5

x6 x7

fX (x ) = ?fX1fX2fX3 · · · fX4|X1,X2,X3fX5|X1,X3

fX6|X4fX7|X4,X5

.



x1

x2 x3

x4 x5

x6 x7

fX (x ) = ?fX1fX2fX3fX4|X1,X2,X3· · · fX5|X1,X3

fX6|X4fX7|X4,X5

.



x1

x2 x3

x4 x5

x6 x7


· · · fX6|X4fX7|X4,X5

.



x1

x2 x3

x4 x5

x6 x7


fX6|X4· · · fX7|X4,X5

.



x1

x2 x3

x4 x5

x6 x7


fX6|X4fX7|X4,X5

.


Directed Acyclic conditional independence Graph (DAG)Markov property

Local Markov propertyFor any Y 2 dek where dek are the descendants of k , then

Xk ?? Y | pak ,

that is, Xk is conditionally independent on its non-descendants given itsparents.


Local Markov property: example

x1

x2 x3

x4 x5

Check that x4 ?? x5 | {x2, x3}, byusing the factorization property.

P(x4|x5, x2, x3) =

P(x2, x3, x4, x5)

P(x2, x3, x5)

=

P(x2)P(x3)P(x4|x2, x3)P(x5|x3)P(x2)P(x3)P(x5|x3)

= P(x4|x2, x3).


OutlineIntroduction







Modeling the genomic dataGaussian assumption

The data

Inference X =

0

B@x

11 x

21 x

31 . . . x

p1

...x

1n x

2n x

21 . . . x

pn

1

CA

Assuming fX (X) multivariate Gaussian

Greatly simplifies the inference:

naturally links independence and conditional independence to thecovariance and partial covariance,

gives a straightforward interpretation to the graphical modelingpreviously considered.


Start gently with the univariate Gaussian distribution

The Gaussian distribution is thenatural model for the level ofexpression of gene (noisy data).

We note X ⇠ N (µ, �2), so as EX = µ, VarX = �2 and

fX (x ) =

1

�p

2⇡exp

⇢� 1

2�2(x � µ)

2

�,

and

log fX (x ) = � log �p

2⇡ � 1

2�2(x � µ)

2.

Useless for modeling the distribution of expression level for a wholebunch of genes.


One step forward: bivariate Gaussian distributionNeed concepts of covariance and correlation

Let X ,Y be two real random variables.

Definitions

cov(X ,Y ) = Eh�X � E(X )

��Y � E(Y )

�i= E(XY ) � E(X )E(Y ).

⇢XY = cor(X ,Y ) =

cov(X ,Y )pVar(X ) · Var(Y )

.

Proposition

Icov(X ,X ) = Var(X ) = E[(X � EX )(Y � EY )],

Icov(X + Y ,Z ) = cov(X ,Z ) + cov(X ,Z ),

IVar(X + Y ) = Var(X ) + Var(Y ) + 2cov(X ,Y ).

IX ?? Y ) cov(X ,Y ) = 0.

IX ?? Y , cov(X ,Y ) = 0 when X ,Y are Gaussian.


The bivariate Gaussian distribution

fXY (x , y) =

1p2⇡ det ⌃

exp{1

2

�x � µ1 y � µ2

�⌃

�1

✓x � µ1

y � µ2

◆}

where ⌃ is the variance/covariance matrix which is symmetric andpositive definite.

⌃ =

✓Var(X ) cov(Y ,X )

cov(Y ,X ) Var(Y )

◆.

and

fX ,Y (x , y) =

1p2⇡(1 � ⇢2XY )

exp � 1

2(1 � ⇢2XY )

(x

2+ y

2+ 2⇢XY xy),

where ⇢XY is the correlation between X ,Y and describe the interactionbetween them.



fXY (x , y) =

1p2⇡ det ⌃

exp{1

2

�x � µ1 y � µ2

�⌃

�1

✓x � µ1

y � µ2

◆}

where ⌃ is the variance/covariance matrix which is symmetric andpositive definite. If standardized,

⌃ =

✓1 ⇢XY

⇢XY 1

◆.

and

fX ,Y (x , y) =

1p2⇡(1 � ⇢2XY )

exp � 1

2(1 � ⇢2XY )

(x

2+ y

2+ 2⇢XY xy),

where ⇢XY is the correlation between X ,Y and describe the interactionbetween them.



The Covariance MatrixLet

X ⇠ N (0,⌃),

with unit variance and⇢XY = 0

⌃ =

✓1 0

0 1

◆.

The shape of the 2-Ddistribution evolvesaccordingly.



The Covariance MatrixLet

X ⇠ N (0,⌃),

with unit variance and⇢XY = 0.9

⌃ =

✓1 0.9

0.9 1

◆.

The shape of the 2-Ddistribution evolvesaccordingly.


Full generalization: multivariate Gaussian vectorNow need partial covariance and partial correlation

Let X ,Y ,Z be real random variables.

Definitions

cov(X ,Y |Z ) = cov(X ,Y ) � cov(X ,Z )cov(Y ,Z )/Var(Z ).

⇢XY |Z =

⇢XY � ⇢XZ⇢YZq1 � ⇢2XZ

q1 � ⇢2YZ

.

Give the interaction between X and Y once removed the e↵ect of Z .

PropositionWhen X ,Y ,Z are jointly Gaussian, then

cov(X ,Y |Z ) = 0 , cor(X ,Y |Z ) = 0 , X ?? Y |Z .


The multivariate Gaussian distribution

Allow to give a modeling for the expression level of a whole set of genesP:

Gaussian vectorLet X ⇠ N (µ,⌃), and assume any block decomposition with {a, b} apartition of P

⌃ =

✓⌃ab ⌃ba

⌃ab ⌃bb

◆.

Then

1. Xa is Gaussian with distribution N (µa ,⌃aa)

2. Xa |Xb = x is Gaussian with distribution N (µa|b ,⌃a|b) known.


OutlineIntroduction







Steady-state data: scheme

⇡ 10s microarrays over time


Inference

Which interactions?


Modeling the underlying distribution (1)

Model for data generation

I A microarray can be represented as a multivariate vectorX = (X1, . . . ,Xp) 2 Rp ,

I Consider n biological replicate in the same condition, which forms ausual n-size sample (X1, . . . ,Xn).

Consequence: a Gaussian Graphical Model

IX ⇠ N (µ,⌃) with X1, . . . ,Xn i.i.d. copies of X ,

I⇥ = (✓ij )i ,j2P , ⌃

�1 is called the concentration matrix.


Modeling the underlying distribution (2)Interpretation as a GGM

Multivariate Gaussian vector and covariance selection

� ✓ijp✓ii✓jj

= cor

�Xi ,Xj |XP\i ,j

�= ⇢ij |P\{i ,j},

Graphical Interpretation The matrix ⇥ = (✓ij )i ,j2P encodes the network G we are looking for.

conditional dependency between Xj and Xi

ornon-null partial correlation between Xj and Xi

m✓ij 6= 0

if and only ifi

j

?


OutlineIntroduction







Time-course data: scheme

t0t1

tn



Inference

Which interactions?


Modeling time-course data with DAG

Collecting gene expression

1. Follow-up of one single experiment/individual;

2. Close enough time-points to ensureI dependency between consecutive measurements;I homogeneity of the Markov process.

X1

X2X3

X4

X5

stands for

X

t1 X

t+11

X

t2 X

t+12

X

t+13

X

t+14G

X

t+15


Modeling time-course data with DAG


1. Follow-up of one single experiment/individual;

2. Close enough time-points to ensureI dependency between consecutive measurements;I homogeneity of the Markov process.

X

t1 X

21

. . .X

n1

X

12 X

22

. . .X

n2

X

13 X

23

. . .X

n3

X

14 X

24

. . .X

n4

G G GX

15 X

25

. . .X

n5


DAG: remark

X

1

X

2X

3

X

4

X

5

Argh, there is a cycle :’(

versus

X

1t X

1t+1

X

2t X

2t+1

X

3t+1

X

4t+1G

X

5t+1

is indeed a DAG

Overcomes the rather restrictive acyclic requirement



Model for data generationA microarray can be represented as a multivariate vectorX = (X1, . . . ,Xp) 2 Rp , generated through a first order vectorautoregressive process VAR(1):

X

t= ⇥X

t�1+ b + "t , t 2 [1,n]

where "t is a white noise to ensure the Markov property andX

0 ⇠ N (0,⌃0).

Consequence: a Gaussian Graphical Model

I Each X

t |X t�1 ⇠ N (✓X t�1,⌃),

I or, equivalently, X tj |X t�1 ⇠ N (⇥jX

t�1,⌃)

where ⌃ is known and ⇥j is the j th row of ⇥.



I

2

66666666664

X

1t

.

.X

it

.

.

.X

pt

3

77777777775

=

2

6666666666664

✓11 . . . ✓1j . . ✓1p. . . . . . . .. . . . . . . .. . . . . . . .

✓i1 . . . ✓ij . . ✓ip. . . . . . . .. . . . . . . .. . . . . . . .

✓p1 . . . ✓pj . . ✓pp

3

7777777777775

2

66666666664

X

1t�1

.

.

.

X

jt�1..

X

pt�1

3

77777777775

+

2

66666666664

b1

.

.bi

.

.

.bp

3

77777777775

+

2

66666666664

"1t.."it...

"pt

3

77777777775

I Example:

⇥ =

0

@✓11 ✓12 0

✓21 0 0

0 ✓32 0

1

A


Modeling the underlying distribution (3)Interpretation as a GGM

The VAR(1) as a covariance selection model

✓ij =

cov

⇣X

ti ,X t�1

j |X t�1P\j

⌘

var

⇣X

t�1j |X t�1

P\j

⌘ ,

Graphical Interpretation The matrix ⇥ = (✓ij )i ,j2P encodes the network G we are looking for.

conditional dependency between X t�1j and X t

ior

non-null partial correlation between X t�1j and X t

im

✓ij 6= 0

if and only ifi

j

?


OutlineIntroduction







The graphical models: reminderaa for goldfish-like memories

AssumptionA microarray can be represented as a multivariate Gaussian vector X .


1. Steady-state data leads to an i.i.d. sample.

2. Time-course data gives a time series.

Graphical interpretationconditional dependency between X (i) and X (j )

ornon null partial correlation between X (i) and X (j )

if and only ifj

i

Encoded in an unknown matrix of parameters ⇥.







Graphical interpretationconditional dependency between X (i) and X (j )

ornon null partial correlation between X (i) and X (j )

if and only ifj

i?








Graphical interpretationconditional dependency between Xt (i) and Xt�1(j )

ornon null partial correlation between Xt (i) and Xt�1(j )

if and only ifj

i?



The Maximum likelihood estimatorThe natural approach for parametric statistics

Let X be a random vector with distribution defined by fX (x ;⇥), where⇥ are the model parameters.

Maximum likelihood estimator

ˆ

⇥ = arg max

⇥L(⇥;X)

where L is the log likelihood, a function of the parameters:

L(⇥;X) = log

nY

k=1

fX (x

k;⇥),

where x

k is the k row of X.

Remarks

I This a convex optimization problem,

I We just need to detect non zero coe�cients in ⇥


The penalized likelihood approach

Let ⇥ be the parameters to infer (the edges).

A penalized likelihood approach

ˆ

⇥� = arg max

⇥L(⇥;X) � � pen`1(⇥),

I L is the model log-likelihood,

Ipen`1 is a penalty function tuned by � > 0.

It performs1. regularization (needed when n ⌧ p),2. selection (sparsity induced by the `1-norm),


OutlineIntroduction







A Geometric View of SparsityConstrained Optimization

f(�1,�

2;X

)

�2 �1

We basically want to solve a problem ofthe form

maximize

�1,�2

f (�1, �2;X)

where f is typically a concave likelihoodfunction.This is strictly equivalent to solve

minimize

�1,�2

g(�1, �2;X)

where g = �f is convex ! For instancethe square lost in the OLS.



f(�1,�

2;X

)

�2 �1

(maximize

�1,�2

f (�1, �2;X)

s.t. ⌦(�1, �2) c

,

where ⌦ defines a domain thatconstrains �.



�2

�1

(maximize

�1,�2

f (�1, �2;X)

s.t. ⌦(�1, �2) c

,


mmaximize

�1,�2

f (�1, �2;X) � �⌦(�1, �2)



�2

�1

(maximize

�1,�2

f (�1, �2;X)

s.t. ⌦(�1, �2) c

,


mmaximize

�1,�2

f (�1, �2;X) � �⌦(�1, �2)

How shall we define ⌦ to inducesparsity?


A Geometric View of SparsitySupporting Hyperplane

An hyperplane supports a set i↵

I the set is contained in one half-space

I the set has at least one point on the hyperplane

�2

�1






�2

�1

There are Supporting Hyperplane at all points of convex sets:Generalize tangents






�2

�1

�2

�1


A Geometric View of SparsityDual Cone

Generalizes normals

�2

�1

�2

�1

�2

�1


A Geometric View of SparsityDual Cone

Generalizes normals

�2

�1

�2

�1

�2

�1

Shape of dual cones ) sparsity pattern


OutlineIntroduction







The LASSO

R. Tibshirani, 1996.The Lasso: Least Absolute Shrinkage and Selection Operator

S. Chen , D. Donoho , M. Saunders, 1995.Basis Pursuit.

Weisberg, 1980.Forward Stagewise regression.

3.2. Régularisations `p 23

�ls

�`1�1

�2

�ls

�`2

�1

�2

Fig. 3.2 – Comparaisons des solutions de problèmes régularisés par une norme `1 et `2.

À gauche de la figure 3.2, �`1 est l’estimateur du problème (3.2) régularisépar une norme `1. La deuxième composante de �`1 est annulée, car l’ellipseatteint la région admissible sur l’angle situé sur l’axe �2 = 0. À droite de lafigure 3.2, �`2 est l’estimateur du problème (3.2) régularisé par une norme`2. La forme circulaire de la région admissible n’incite pas les coefficientsà atteindre des valeurs nulles.

Afin de poursuivre cette discussion avec des arguments à la fois simpleset formels, on peut donner l’expression d’un coefficient des estimateurs�`1 et �`2 , lorsque la matrice X est orthogonale (ce qui correspond à descontours circulaires pour la fonction de perte quadratique). Pour �`2, nousavons

�`2m =

11 + �

�lsm .

Les coefficients subissent un rétrécissement2 proportionnel par le biais dufacteur 1 / (1 + �). En particulier, �`2

m ne peut être nul que si le coefficient�lsm est lui même exactement nul. Pour �`1, nous avons

�`1m = sign

⇣�lsm

⌘ ⇣|�ls

m| � �⌘

+

,

où [u]

+

= max(0, u). On obtient ainsi un seuillage « doux » : les compo-santes des coefficients des moindres carrés sont rétrécies d’une constante �lorsque |�ls

m| > � , et sont annulés sinon.

Stabilité

Définition 3.2 Stabilité—Selon Breiman [1996], un problème est instable si pour des ensemblesd’apprentissage similaires mais pas identiques (petites perturbations), on obtientdes prédictions ou des estimateurs très différents (grande perturbation).

Remarque 3.5 — Bousquet et Elisseeff [2002] ont défini de façon formelledifférentes notions de stabilité, basées sur le comportement des estima-teurs quand l’échantillon d’apprentissage est perturbé par le retrait ou leremplacement d’un exemple. �

2Shrinkage, en anglais.

(minimize

�2R2ky � X�k22,

s.t. k�k1 = |�1| + |�2| c.

mminimize

�2Rky � X�k22 + �k�k1.


Orthogonal case and link to the OLS

OLS shrinkageThe Lasso has no analytical solution but in the orthogonal case: whenX

|X = I (never for real data),

ˆ�lassoj = sign(

ˆ�olsj ) max(0, |ˆ�ols

j | � �).

�ols

Lasso

OLS

�4

�4

�2

�2

00

2

2

4

4


LARs: Least angle regression

B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, 2004.Least Angle Regression.

E�cient algorithm to compute the Lasso solutionsThe LARS solution consists of a curve denoting the solution � for eachvalue of �.

I construct a piecewise linear path of solution starting from the nullvector towards the OLS estimate,

I (Almost) the same cost as OLS,

I well adapted to cross validation (help us to choose �).


Example: prostate cancer I

Lasso solution path with Lars

> library(lars)

> load("prostate.rda")

> x <- as.matrix(x)

> x <- scale(as.matrix(x))

> out <- lars(x,y)

> plot(out)


Example: prostate cancer II

*

**

* * * * **

0.0 0.2 0.4 0.6 0.8 1.0

02

46

|beta|/max|beta|

Stan

dard

ized

Coe

ffici

ents

* * *

* * * * * *

* * * * * **

*

*

* * * * **

* **

* **

* ** * *

*

* * * * * * * *

*

* * * * * * * * ** * * * *

* * **

LASSO

37

82

1

0 1 3 5 6 7 8


Choice of the tuning parameter � I

Model selection criteria

BIC(�) = ky � X

ˆ��k22 � df(

ˆ��)log n

2

AIC(�) = ky � X

ˆ��k22 � df(

ˆ��)

where df(

ˆ��) is the number of nonzero entries in ��.

Cross-validation

1. split the data into K folds,

2. use successively each K fold as the testing set,

3. compute the test error on this K folds,

4. average to obtain the CV estimation of the test error.

� is chosen to minimize the CV test error.


Choice of the tuning parameter � II

CV choice for �

> cv.lars(x,y, K=10)


Choice of the tuning parameter � III

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0.0 0.2 0.4 0.6 0.8 1.0

0.6

0.8

1.0

1.2

1.4

1.6

Fraction of final L1 norm

Cro

ss−V

alid

ated

MSE


Many variations

Group-LassoActivate the variables by group (given by the user).

Adaptive/Weighted-Lasso

Adjust the penalty level to each variables, according to prior knowledge orwith data driven weights.

BoLassoBootstrapped version that removes false positives/stabilizes the estimate.

etc.

+ many theoretical results.


OutlineIntroduction







Problem

t0t1

tn



Inference

Which interactions?

The main statistical issue is the high dimensional setting.


Handling the scarcity of the dataBy introducing some prior

Priors should be biologically grounded

1. few genes e↵ectively interact (sparsity),

2. networks are organized (latent clustering),

G1

G2

G3

G4 G5

G6

G7

G8

G9

G10

G11

G12

G13


Handling the scarcity of the dataBy introducing some prior

Priors should be biologically grounded

1. few genes e↵ectively interact (sparsity),

2. networks are organized (latent clustering),

A1

A2

A3

A4 A

B1

B2

B3

B4

B5

B

C1

C


Penalized log-likelihood

Banerjee et al., JMLR 2008

ˆ

⇥� = arg max

⇥Liid(⇥;S) � �k⇥k`1 ,

e�ciently solved by the graphical Lasso of Friedman et al, 2008.

Ambroise, Chiquet, Matias, EJS 2009Use adaptive penalty parameters for di↵erent coe�cients

Liid(⇥;S) � �kPZ ? ⇥k`1 ,

where PZ is a matrix of weights depending on the underlying clusteringZ.

Works with the pseudo log-likelihood (computationally e�cient).


Penalized log-likelihood

Banerjee et al., JMLR 2008

ˆ

⇥� = arg max

⇥Liid(⇥;S) � �k⇥k`1 ,

e�ciently solved by the graphical Lasso of Friedman et al, 2008.

Ambroise, Chiquet, Matias, EJS 2009Use adaptive penalty parameters for di↵erent coe�cients

˜Liid(⇥;S) � �kPZ ? ⇥k`1 ,

where PZ is a matrix of weights depending on the underlying clusteringZ.

Works with the pseudo log-likelihood (computationally e�cient).


Neighborhood selection (1)

Let

IXi be the ith column of X,

IX\i be X deprived of Xi .

Xi = X\i� + ", where �j = �✓ij✓ii

.

Meinshausen and Bulhman, 2006Since sign(corij |P\{i ,j}) = sign(�j ), select the neighbors of i with

arg min

�

1

n

��Xi � X\i�

��22+ � k�k`1 .

The sign pattern of ⇥� is inferred after a symmetrization step.


Neighborhood selection (2)

The pseudo log-likelihood of the i.i.d Gaussian sample is

˜Liid(⇥;S) =

pX

i=1

nX

k=1

logP(Xk (i)|Xk (P\i);⇥i)

!,

=

n

2

log det(D) � n

2

Trace

⇣D

�1/2⇥S⇥D

�1/2⌘

� n

2

log(2⇡),

where D = diag(⇥).

Proposition

ˆ

⇥

pseudo

� = arg max

⇥:✓ij 6=✓ii

˜Liid

(⇥;S) � � k⇥k`1has the same null entries as inferred by neighborhood selection.


Structured regularizationIntroduce prior knowledge

Building the weights

1. Build w from prior biological informationI transcription factors vs. regulatees,I number of potential binding sites,I KEGG pathways, Gene Ontology . . .

2. Build the weights matrix from clustering algorithmI Infer the network G0 with w = 1 for each node,I Apply a clustering algorithm on G0,I Re-Infer G with w built according to the clustering Z.


Structured regularization“Bayesian” interpretation of `1 regularization

Laplacian prior on ⇥ depends on the clustering Z

P(⇥|Z) /Y

j

exp {�� · wkj (Z)|✓kj |} .

wk(Z) summarizes prior information on the position of edges fornode k

-1.0 -0.5 0.0 0.5 1.0

0.2

0.4

0.6

0.8

1.0

akj

prior

distribution

�wkj(Z) = 0.5�wkj(Z) = 1


OutlineIntroduction







Multiple SclerosisStudying treatment e↵ect on 31 selected genes

Collaboration started with J.-C. Corvol, La pitie Salpetriere (and C.Charbonnier, C. Ambroise).

The data set

I 31 genes identified as implied in the MS physiopatholgy,

I Each patient received either a placebo or the treatment at t = 0,

I Gene expression are recorded through microarray along time for eachpatient (time-course data),

I full data set: of 100 microarrays, 25 patients (⇡ 75%/25%

treatment/placebo), 4 time points.

Comparing networks inferred under treatment or placebo.


Multiple sclerosisDetecting changes in the regulatory networks

Active

IL2RA

ITGAM

JAK1

STAT5A

STAT3

ITGA6

IL2RB

CD58

ITGB2

IL2RA.1ITGAL

STAT3.1

VCAM1.1

IL2

VCAM1.2

TOB1

JAK3

IL7.1

STAT3.2

ITGA6.1

CD58.1

IL2RGIL21

STAT5B

ICAM1

ITGB2.1

ITGA4

Placebo

IL2RA

ITGAM

JAK1

STAT5A

STAT3

ITGA6

IL2RB

CD58

ITGB2

IL2RA.1ITGAL

STAT3.1

VCAM1.1

IL2

VCAM1.2

TOB1

JAK3

IL7.1

STAT3.2

ITGA6.1

CD58.1

IL2RGIL21

STAT5B

ICAM1

ITGB2.1

ITGA4

Intersection

IL2RA

ITGAM

JAK1

STAT5A

STAT3

ITGA6

IL2RB

CD58

ITGB2

IL2RA.1ITGAL

STAT3.1

VCAM1.1

IL2

VCAM1.2

TOB1

JAK3

IL7.1

STAT3.2

ITGA6.1

CD58.1

IL2RGIL21

STAT5B

ICAM1

ITGB2.1

ITGA4

Difference

IL2RA

ITGAM

JAK1

STAT5A

STAT3

ITGA6

IL2RB

CD58

ITGB2

IL2RA.1ITGAL

STAT3.1

VCAM1.1

IL2

VCAM1.2

TOB1

JAK3

IL7.1

STAT3.2

ITGA6.1

CD58.1

IL2RGIL21

STAT5B

ICAM1

ITGB2.1

ITGA4

Active

IL2RA

ITGAM

JAK1

STAT5A

STAT3

ITGA6

IL2RB

CD58

ITGB2

IL2RA.1ITGAL

STAT3.1

VCAM1.1

IL2

VCAM1.2

TOB1

JAK3

IL7.1

STAT3.2

ITGA6.1

CD58.1

IL2RGIL21

STAT5B

ICAM1

ITGB2.1

ITGA4

Placebo

IL2RA

ITGAM

JAK1

STAT5A

STAT3

ITGA6

IL2RB

CD58

ITGB2

IL2RA.1ITGAL

STAT3.1

VCAM1.1

IL2

VCAM1.2

TOB1

JAK3

IL7.1

STAT3.2

ITGA6.1

CD58.1

IL2RGIL21

STAT5B

ICAM1

ITGB2.1

ITGA4

Intersection

IL2RA

ITGAM

JAK1

STAT5A

STAT3

ITGA6

IL2RB

CD58

ITGB2

IL2RA.1ITGAL

STAT3.1

VCAM1.1

IL2

VCAM1.2

TOB1

JAK3

IL7.1

STAT3.2

ITGA6.1

CD58.1

IL2RGIL21

STAT5B

ICAM1

ITGB2.1

ITGA4

Difference

IL2RA

ITGAM

JAK1

STAT5A

STAT3

ITGA6

IL2RB

CD58

ITGB2

IL2RA.1ITGAL

STAT3.1

VCAM1.1

IL2

VCAM1.2

TOB1

JAK3

IL7.1

STAT3.2

ITGA6.1

CD58.1

IL2RGIL21

STAT5B

ICAM1

ITGB2.1

ITGA4

I Better insight that di↵erential analysis,

I networks try to tell us how genes vary together,

I hope that sometimes an edge describes regulations.


E. coli S.O.S DNA repair network

E.coli S.O.S data

Assigning numbers to the arrows: Parameterizing agene regulation network by using accurateexpression kineticsMichal Ronen†, Revital Rosenberg†, Boris I. Shraiman‡, and Uri Alon†§¶

Departments of †Molecular Cell Biology and §Physics of Complex Systems, Weizmann Institute of Science, Rehovot 76100, Israel; and ‡Bell Laboratories,Lucent Technologies, Murray Hill, NJ 07974

Edited by David Botstein, Stanford University School of Medicine, Stanford, CA, and approved June 5, 2002 (received for review January 28, 2002)

A basic challenge in systems biology is to understand the dynam-ical behavior of gene regulation networks. Current approaches aimat determining the network structure based on genomic-scale data.However, the network connectivity alone is not sufficient to defineits dynamics; one needs to also specify the kinetic parameters forthe regulation reactions. Here, we ask whether effective kineticparameters can be assigned to a transcriptional network based onexpression data. We present a combined experimental and theo-retical approach based on accurate high temporal-resolution mea-surement of promoter activities from living cells by using greenfluorescent protein (GFP) reporter plasmids. We present algorithmsthat use these data to assign effective kinetic parameters within amathematical model of the network. To demonstrate this, weemploy a well defined network, the SOS DNA repair system ofEscherichia coli. We find a strikingly detailed temporal program ofexpression that correlates with the functional role of the SOSgenes and is driven by a hierarchy of effective kinetic parameterstrengths for the various promoters. The calculated parameters canbe used to determine the kinetics of all SOS genes given theexpression profile of just one representative, allowing a significantreduction in complexity. The concentration profile of the masterSOS transcriptional repressor can be calculated, demonstratingthat relative protein levels may be determined from purely tran-scriptional data. This finding opens the possibility of assigningkinetic parameters to transcriptional networks on a genomic scale.

There is much interest in understanding the design principlesunderlying the structure and dynamics of gene regulation

networks (1–10, 36). Determining the dynamic behavior of thesesystems requires specifying not only the network connectivity,but also the kinetic parameters for the various regulationreactions. Standard biochemical methods of measuring thesekinetic parameters are usually done outside of the cellularcontext and cannot be easily scaled up to a genomic level. Itwould therefore be valuable to develop methods to assigneffective kinetic parameters to transcriptional networks based onin vivo measurements. Here we present an approach for deter-mining the effective kinetic parameters of a transcriptionalnetwork based on accurate promoter activity measurements andanalysis algorithms (Fig. 1).

We developed a system for real-time monitoring of thetranscriptional activity of operons by means of low-copy reporterplasmids (10) in which a promoter controls green fluorescentprotein (GFP) (11). In each plasmid a different promotercontrols the transcription rate of the same reporter gene, gfp, andthus rate of transcript production from the promoter is propor-tional to the rate of GFP accumulation. By continuous mea-surements from living cells grown in a multiwell plate fluorim-eter, high-resolution time courses of the promoter strength andcell density are obtained. With this method, temporal resolutionof minutes can be achieved. This process complements, at higheraccuracy, the genomic-scale perspective given by DNA microar-rays (12). In a previous study, we demonstrated that this ap-proach can be used to determine the order of genes in an

assembly pathway (10). Here, we extend it by presenting analysisalgorithms that use accurate expression data to assign kineticparameters that can be incorporated into a mathematical modelof the dynamics.

We apply this method to a well characterized transcriptionalnetwork, the SOS DNA repair system in Escherichia coli. TheSOS system includes about 30 operons regulated at the tran-scriptional level (12–16). A master repressor (LexA) binds sitesin the promoter regions of these operons (16, 17). One of theSOS proteins, RecA, acts as a sensor of DNA damage: by bindingto single-stranded DNA it becomes activated and mediates LexAautocleavage. The drop in LexA levels causes the de-repressionof the SOS genes (Fig. 2). Once damage has been repaired orbypassed, the level of activated RecA drops, LexA accumulatesand represses the SOS operons, and the cells return to theiroriginal state.

We demonstrate that effective kinetic parameters can be usedto detect SOS genes with additional regulation, capture thetemporal transcriptional program, and calculate the concentra-tion profile of the regulatory protein.

MethodsPlasmids and Strains. Promoter regions were amplified fromMG1655 genomic DNA by using PCR and the following startand end coordinates for the primers taken from the sequencedE. coli genome (18): uvrA (4271368–4271753), uvrD (3995429–3995664), lexA (4254491–4254751), recA (2821707–2821893),ruvA (1943919–1944201), polB (65704–65932), umuD (1229552–1230069), uvrY (1993282–1993900), and lacZ (365438–365669).Each amplified region includes the entire region between ORFswith an additional 50–150 bp into each of the flanking ORFs.The promoter regions were cloned by using XhoI and BamHIsites upstream of a promoterless GFPmut3 gene in a low copypSC101 origin plasmid as described (10). The plasmids weretransformed into E. coli strain AB1157 [argE3, his4, leuB6,proA2, thr1, ara14, galK2, lacY1, mtl1, xyl5, thi1, tsx33, rpsL31, andsupE44] (24).

Culture and Measurements. Cultures of strain AB1157 (1 ml)inoculated from glycerol frozen stocks were grown for 16 h in LBmedium with kanamycin (25 !g!ml) at 37°C with shaking at 250rpm. The cultures were diluted 1:100 into defined medium (24)[M9 supplemented with thiamine (10 !g!ml), glucose (2 mg!ml), MgSO4 (1 mM), MgCl2 (0.1 mM), thymine (20 !g!ml), eachof the 20 aa except tryptophan (50 !g!ml) ! 25 !g!ml kana-mycin], at a final volume of 100 !l per well in a flat-bottom96-well plate (Sarstedt). The cultures were covered with anadhesive pad to prevent evaporation and grown in a WallacVictor2 multiwell f luorimeter at 37°C, set with an automatically

This paper was submitted directly (Track II) to the PNAS office.

Abbreviation: GFP, green fluorescent protein.¶To whom reprint requests should be addressed. E-mail: [email protected].

www.pnas.org!cgi!doi!10.1073!pnas.152046799 PNAS " August 6, 2002 " vol. 99 " no. 16 " 10555–10560

GEN

ETIC

S

I 8 major genes involved in theS.O.S. response

I 50 time points

uvrD

lexA

umuDC

recA

uvrA

uvrY

ruvA

polB


E. coli S.O.S DNA repair networkPrecision and Recall rates

0.0

0.5

1.0

1.5Exp. 1

0.0

0.5

1.0

1.5Exp. 2

0.0

0.5

1.0

1.5Exp. 3

Lasso

Adapti

ve

G1DBN

Renet

KnownC

l

InferC

l0.0

0.5

1.0

1.5Exp. 4

PrecisionRecall


E. coli S.O.S DNA repair networkInferred networks

Lasso

uvrD

lexA

umuDC

recA

uvrA

uvrY

ruvA

polB

Known Classification Inferred Classification

uvrD

lexA

umuDC

recA

uvrAuvrY

ruvA

polB

uvrDlexA

umuDC

recA

uvrA

uvrY

ruvA

polB


Reasonnable computing timeComparing computation time with the multi-tests approach

2.5 3.0 3.5 4.0 4.5 5.0

02

46

810

CPU time (log scale)

log of p/n

log

of C

PU ti

me

(sec

)

1 se

c.10

sec

.1

min

.10

min

.1

hr.

4 hr

.

G1DBNRenet−VARInfCl

Figure: Computing times on the log-log scale for Renet-VAR, G1DBN and InfCl(including inference of classes). Intel Dual Core 3.40 GHz processor.


OutlineIntroduction







Handling the scarcity of data (1)By reducing the number of parameters

AssumptionConnections will only appear between informative genes

di↵erential analysis

select p key genes P

p “reasonable” compared to n

typically, n 2 [p/5; 5p]

the learning dataset

n size–p vectors of expression

(X1, . . . ,Xn ) with Xi 2 Rp

inference


Handling the scarcity of data (2)By collecting as many observations as possible

Multitask learning Go to learning

How should we merge the data?organism

drug 1drug 2

drug 3




by inferring each network independentlyorganism

drug 1drug 2

drug 3

(X(1)1 , . . . ,X

(1)n1

),X(1)i 2 Rp1 (X

(2)1 , . . . ,X

(2)n2

),X(2)i 2 Rp2 (X

(3)1 , . . . ,X

(3)n3

),X(3)i 2 Rp3

inference inference inference




by pooling all the available dataorganism

drug 1drug 2

drug 3

(X1, . . . ,Xn ),Xi 2 Rp , with n = n1 + n2 + n3.

inference




by breaking the separabilityorganism

drug 1drug 2

drug 3

(X(1)1 , . . . ,X

(1)n1

),X(1)i 2 Rp1 (X

(2)1 , . . . ,X

(2)n2

),X(2)i 2 Rp2 (X

(3)1 , . . . ,X

(3)n3

),X(3)i 2 Rp3

inference


Coupling related problems

Consider

IT samples concerning the expressions of the same p genes,

IX

(t)1 , . . . ,X (t)

nt is the t

th sample drawn from N (0p ,⌃(t)

), withcovariance matrix S

(t).

Multiple samples setup Go to scheme

Ignoring the relationships between the tasks leads to

arg max

⇥(t),t=1...,T

TX

t=1

L(⇥

(t);S

(t)) � � pen`1(⇥

(t),Z).

Breaking the separability

I Either by modifying the objective function

I or the constraints.


Coupling problems by grouping variables through penalty

Potential groups Group(s) induced by edges (1, 2)

Group

-Lasso

X1 X2

X3 X4

X1 X2

X3 X4

X1 X2

X3 X4

X1 X2

X3 X4

X1 X2

X3 X4

X1 X2

X3 X4

X1 X2

X3 X4

X1 X2

X3 X4

Coo

perative-Lasso

X1 X2

X3 X4

X1 X2

X3 X4

X1 X2

X3 X4

X1 X2

X3 X4

X1 X2

X3 X4

X1 X2

X3 X4

X1 X2

X3 X4

X1 X2

X3 X4

X1 X2

X3 X4

X1 X2

X3 X4

X1 X2

X3 X4

X1 X2

X3 X4


Conclusion

Versatile Statistical Tools

I Gaussian Graphical Model

I Sparse Regression

I Model Based Graph Clustering

Available Package on CRAN

ISimone to infer graphs from various kind of data

Iquadrupen to solve e�ciently various penalized regressions problems

Iosbm and mixer to search for structure in graphs


References

Christopher Bishop, 2002-2006.Pattern Recognition and Machine Learning

Joe Whittacker, 1996.Graphical Model

Trevor Hastie, Robert Tibshirani, Jerome Friedman, 2009.The Elements of Statistical Learning (2nd Edition).

Ste↵en Lauritzen, 1996.Graphical Models.

+ papers of our lab (see webpage) ;-)


Biological Network Inference via Gaussian Graphical Models

Technology

Transcript of Biological Network Inference via Gaussian Graphical Models