Areejit Samal Randomizing metabolic networks

Areejit Samal

Laboratoire de Physique Théorique et Modèles Statistiques (LPTMS), CNRS and UnivParis-Sud, Orsay, France

andMax Planck Institute for Mathematics in the Sciences, Leipzig, Germany

Areejit Samal

Large-scale structure of metabolic networks: Role of Biochemical and Functional

constraints

Areejit Samal

Outline

Architectural features of metabolic networks

Issues with existing null models for metabolic networks

Our method to generate realistic randomized metabolic networks using Markov Chain Monte Carlo (MCMC) sampling and Flux Balance Analysis (FBA)

Biological function is a main driver of large-scale structure in metabolic networks

Areejit Samal

Architectural features of real-world networks

Small Average Path Length & High Local Clustering

Watts and Strogatz, Nature (1998)

Small-world

Power law degree distribution

Barabasi and Albert, Science (1999)

Scale-free

Scale-free graphs can be generated using a simple model incorporating: Growth & Preferential attachment scheme

Areejit Samal

Large-scale structure of metabolic networks

Ma and Zeng, Bioinformatics (2003)

Wagner & Fell (2001) Jeong et al (2000) Ravasz et al (2002)

Small-world, Scale-free and Hierarchical Organization

Bow-tie architecture

Bow-tie architecture of the metabolism is similar to that found for the WWW by Broder et al.

Small Average Path Length, High Local Clustering, Power-law degree distribution

Directed graph with a giant component of strongly connected nodes along with associated IN and OUT component

Areejit Samal

Shen-Orr et al Nature Genetics (2002); Milo et al Science (2002,2004)

Network motifs and Evolutionary design

Sub-graphs that are over-represented in real networks compared to randomized networks with same degree sequence.

Certain network motifs were shown to perform important information processing tasks.

For example, the coherent Feed Forward Loop (FFL) can be used to detect persistent signals.

Areejit Samal

Chosen Property

0.60 0.62 0.64 0.66 0.68 0.70

Freq

uenc

y

0

50

100

150

200

250

300

“Null hypothesis” based approach for identifying design principles in real-world networks

Statistically significant network properties are deduced as follows:

1) Measure the ‘chosen property’ (E.g., motif significance profile, average clustering coefficient, etc.) in the investigated real network.

2) Generate randomized networks with structure similar to the investigated real network using an appropriate null model.

3) Use the distribution of the ‘chosen property’ for the randomized networks to estimate a p-value.

Investigated network

p-value

Areejit Samal

Edge-exchange algorithm: commonly-used null model

Randomized networks preserve the degree at

each node and the degree sequence.

Investigated Network

After 1 exchange

After 2 exchanges

Reference: Shen-Orr et al Nature Genetics (2002); Milo et al Science (2002,2004); Maslov & Sneppen Science (2002)

Areejit Samal

Artzy-Randrup et al Science (2004)

Carefully posed null models are required for deducing design principles with confidence

C. elegans Neuronal Network: Close by neurons have a greater chance of forming a connection than distant neurons

A null model incorporating this spatial constraint gives different results compared to generic edge-exchange algorithm.

Areejit Samal

Edge-exchange algorithm: case of bipartite metabolic networks

ASPT CITL

fum

asp-L cit

ac oaanh4

ASPT: asp-L → fum + nh4CITL: cit → oaa + ac

Areejit Samal


ASPT CITL

fum

asp-L cit

ac oaanh4

ASPT* CITL*

fum

asp-L cit

ac oaanh4


ASPT*: asp-L → ac + nh4 CITL*: cit → oaa + fum

Preserves degree of each node in the networkBut generates fictitious reactions that violatemass, charge and atomic balance satisfied by real chemical reactions!!

Note that fum has 4 carbon atoms while ac has 2 carbon atoms in the example shown.

Null-model used in many studies including: Guimera & Amaral Nature (2005); Samal et al BMC Bioinformatics (2006)

Areejit Samal


ASPT CITL

fum

asp-L cit

ac oaanh4

ASPT* CITL*

fum

asp-L cit

ac oaanh4


ASPT*: asp-L → ac + nh4 CITL*: cit → oaa + fum

Preserves degree of each node in the networkBut generates fictitious reactions that violatemass, charge and atomic balance satisfied by real chemical reactions!!

Note that fum has 4 carbon atoms while ac has 2 carbon atoms in the example shown.

Null-model used in many studies including: Guimera & Amaral Nature (2005); Samal et al BMC Bioinformatics (2006)

A common generic null model (edge exchange algorithm) is unsuitable for different types of real-world networks.

Areejit Samal

Null models for metabolic networks: Blind-watchmaker network

Areejit Samal

Randomizing metabolic networks

We propose a new method using Markov Chain Monte Carlo (MCMC) sampling and Flux Balance Analysis (FBA) to generate meaningful randomized ensembles for metabolic networks by successively imposing biochemical and functional constraints.

Ensemble RGiven no. of valid

biochemical reactions

Ensemble RmFix the no. of metabolites

Ensemble uRmExclude Blocked Reactions

Ensemble uRm-V1Functional constraint of

growth on specified environments

Increasing level of constraints

Constraint I

Constraint II

Constraint III

Constraint IV

+

+

+

Areejit Samal

Comparison with E. coli: Large-scale structural properties of interest

• Metabolite Degree distribution

• Clustering Coefficient

• Average Path Length

• Pc: Probability that a path exists between two nodes in the directed graph

• Largest strongly component (LSC) and the union of LSC, IN and OUT components

E. coli metabolic network has m (~750) metabolites and r (~900) reactions.

In this work, we will generate randomized metabolic networks using only reactions in KEGG and compare the structural properties of randomized networks with those of E. coli.

We study the following large-scale structural properties:

Bow-tie architecture of the Internet and metabolism

Ref: Broder et al; Ma and Zeng, Bioinformatics (2003)

Scale Free and Small World Ref: Jeong et al (2000) Wagner & Fell (2001)

HEX1

PGI

atp

adph

PFK

g6p

glc-D

f6p

fdp

glc-D

g6p

f6p

fdp

HEX1: atp + glc-D → adp + g6p + hPGI: g6p ↔ f6pPFK: atp + f6p → adp + fdp + h

CurrencyMetabolite

OtherMetabolite

ReversibleReaction

IrreversibleReaction

(a) Bipartite Representation (b) Unipartite Representation

Diff

eren

t Gra

ph-th

eore

tic R

epre

sent

atio

ns o

f the

M

etab

olic

Net

wor

k

HEX1

PGI

atp

adph

PFK

g6p

glc-D

f6p

fdp

glc-D

g6p

f6p

fdp


CurrencyMetabolite

OtherMetabolite

ReversibleReaction



Degree Distribution

Clustering coefficientAverage Path Length

Largest Strong Component

Diff

eren

t Gra

ph-th

eore

tic R

epre

sent

atio

ns o

f the

M

etab

olic

Net

wor

k

HEX1

PGI

atp

adph

PFK

g6p

glc-D

f6p

fdp

glc-D

g6p

f6p

fdp


CurrencyMetabolite

OtherMetabolite

ReversibleReaction



Degree Distribution

Clustering coefficientAverage Path Length


Diff

eren

t Gra

ph-th

eore

tic R

epre

sent

atio

ns o

f the

M

etab

olic

Net

wor

k

The values obtained for different network measures depend on the list of currency metabolites used to generate the unipartite graph of metabolites starting from the bipartite graph. In this work, we have used a well defined biochemically sound list of currency metabolites to construct unipartite graph corresponding to each metabolic network.

Areejit Samal

Constraint I: Given number of valid biochemical reactions

Instead of generating fictitious reactions using edge exchange mechanism, we can restrict to >6000 valid reactions contained within KEGG database.

We use the database of 5870 mass balanced reactions derived from KEGG by Rodrigues and Wagner. For simplicity, I will refer to this database in the talk as ‘KEGG’

Areejit Samal

Global Reaction Set

+

Biomass composition

The E. coli biomass composition of iJR904 was used to determine viability of genotypes.

Nutrient metabolites

The set of external metabolites in iJR904were allowed for uptake/excretion in the global reaction set.

KEGG Database + E. coli iJR904

We can implement Flux Balance Analysis (FBA)

within this database.

Areejit Samal

Genome-scale metabolic models: Flux Balance Analysis (FBA)

List of metabolic reactions with stoichiometric coefficients

Biomass composition

Set of nutrients in theenvironment

Flux Balance Analysis (FBA)

Growth rate for the given medium‘Biomass Yield’

Fluxes of all reactions

AdvantagesFBA does not require enzyme kinetic information which is not known for most reactions.

DisadvantagesFBA cannot predict internal metabolite concentrations and is restricted to steady states.Basic models do not account for metabolic regulation.

Reference: Varma and Palsson, Biotechnology (1994); Price et al Nature Reviews Microbiology (2004)

Areejit Samal

Bit string representation of metabolic networks within the global reaction set

6000900C

KEGG Database E. coli

R1: 3pg + nad → 3php + h + nadhR2: 6pgc + nadp → co2 + nadph + ru5p-DR3: 6pgc → 2ddg6p + h2o...RN: ---------------------------------

101....

The E. coli metabolic network or any random network of reactions within KEGG can be represented as a bit string of length N with exactly r entries equal to 1. r is the number of reactions in the E. coli metabolic network.

Present - 1Absent - 0

N = 5870 reactions within KEGG (or global reaction set)

Contains r reactions (1’s in the bit string)

Areejit Samal

Ensemble R: Random networks within KEGG with same number of reactions as E. coli

100....

To compare with E. coli, we generate an ensemble R of random networks as follows:

Pick random subsets with exactly r (~900) reactions within KEGG database of N(~6000) reactions. r is the no. of reactions in E. coli.

Huge no. of networks possible within KEGG with exactly r reactions ~

Fixing the no. of reactions is equivalent to fixing genome size.

6000900C

KEGG with N=5870

reactions

Random subsets or networks with r reactions

001....

111....

101....

Each random network or bit string contains exactly r reactions (1’s in the string) like in E. coli.

Areejit Samal

Metabolite degree distribution

k

1 10 100 1000

P(k)

10-6

10-5

10-4

10-3

10-2

10-1

100

E. coli

E. coli: = 2.17; Most organisms: is 2.1-2.3

Reference: Jeong et al (2000);

Wagner and Fell (2001)

A degree of a metabolite is the number of reactions in which the metabolite participates in the network.

Areejit Samal


k

1 10 100 1000

P(k)

10-6

10-5

10-4

10-3

10-2

10-1

100

E. coli

k

1 10 100 1000

P(k)

10-6

10-5

10-4

10-3

10-2

10-1

100

E. coliRandom




Random networks in ensemble R: = 2.51


Areejit Samal


k

1 10 100 1000

P(k)

10-6

10-5

10-4

10-3

10-2

10-1

100

E. coli

Complete KEGG: = 2.31

k

1 10 100 1000

P(k)

10-6

10-5

10-4

10-3

10-2

10-1

100

E. coliRandom




Random networks in ensemble R: = 2.51

Any (large enough) random subset of reactions in KEGG has Power Law degree distribution !!

k

1 10 100 1000

P(k)

10-8

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

E. coliRandomKEGG


Areejit Samal

Reaction Degree Distribution

Number of substrates in a reaction

0 2 4 6 8 10 12

Freq

uenc

y

0

500

1000

1500

2000

2500

30000 Currency1 Currency2 Currency3 Currency4 Currency5 Currency6 Currency7 Currency

In each reaction, we have counted the number of currency and other metabolites which is shown in the figure. Most reactions involve 4 metabolites of which 2 are currency metabolites and 2 are other metabolites.

The degree of a reaction is the number of metabolites that participate in it.

The reaction degree distribution is bell shaped with typical reaction in the network involving 4 metabolites. The nature of reaction degree distribution is very different from the metabolite degree distribution which follows a power law.

R

atp

A

adp

B

Areejit Samal

Scale-free versus Scale-rich network: Origin of Power laws in metabolic networks

Tanaka and Doyle have suggested a classification of metabolites into three categories based on their biochemical roles (a) ‘Carriers’ (very high degree)(b) ‘Precursors’ (intermediate degree)(c) ‘Others’ (low degree)

It has been argued that gene duplication and divergence at the level of protein interaction networks can lead to observed power laws in metabolic networks. However, the presence of ubiquitous (high degree) ‘currency’ metabolites along with low degree ‘other’ metabolites in each reaction can explain the observed fat tail in the metabolite distribution.

Reference: Tanaka (2005); Tanaka, Csete, and Doyle (2008)

R

atp

A

adp

B

Areejit Samal

Network Properties: Given number of valid biochemical reactions

R RM uRM uRM-V1 uRM-V5 uRM-V10 E. coli

P c

0.0

0.2

0.4

0.6

0.8


Frac

tion

of n

odes

0.0

0.2

0.4

0.6

0.8

1.0LSCLSC + IN + OUT


< C

>

0.00

0.02

0.04

0.06

0.08

0.10


< L

>

0

2

4

6

8

10

Clustering Coefficient Path Length

Probability that a path exists between two nodes


Areejit Samal

Constraint II: Fix the number of metabolites

Direct sampling of randomized networks with same no. of reactions and metabolites as in E. coli is not possible.

Accept/Reject Criterion of reaction swap:Accept if the no. of metabolites in the new network is less than or equal to E. coli.

KEGG Database


10101..

N = 5870 reactions within KEGG reaction universe

11001..

10110..

11100..

00101..

E. coli

Accept

Reject

Step 1

Swap Swap

Reject

106 Swaps

Step 2

103 Swaps

store storeReaction swap keeps the number of reactions in network constant

Erase memory of

initial network

Saving Frequency

Ensemble RM

Markov Chain Monte Carlo (MCMC) sampling of randomized networks

Areejit Samal

Network Properties: After fixing number of metabolites


P c

0.0

0.2

0.4

0.6

0.8


Frac

tion

of n

odes

0.0

0.2

0.4

0.6

0.8



< C

>

0.00

0.02

0.04

0.06

0.08

0.10


< L

>

0

2

4

6

8

10




Areejit Samal

Constraint III: Exclude Blocked Reactions

KEGG

Unblocked KEGG

Large-scale metabolic networks typically have certain dead-end or blocked reactions which cannot contribute to growth of the organism.

Blocked reactions have zero flux for every investigated chemical environment under any steady-state condition and do not contribute to biomass growth flux.

We next exclude blocked reactions from our biochemical universe used for sampling randomized networks.

For Blocked reactions, see Burgard et al Genome Research (2004)

Areejit Samal

MCMC Sampling: Exclusion of blocked reactions

Accept/Reject Criterion of reaction swap:

Restrict the reaction swaps to the set of unblocked reactions within KEGG.

Accept if the no. of metabolites in the new network is less than or equal to E. coli.

KEGG Database


10101..


11001..

10110..

11100..

00101..

E. coli

Accept

Reject

Step 1

Swap Swap

Reject

106 Swaps

Step 2

103 Swaps

store store

Erase memory of

initial network

Saving Frequency

Ensemble uRM

Areejit Samal

Network Properties: After excluding blocked reactions


P c

0.0

0.2

0.4

0.6

0.8


Frac

tion

of n

odes

0.0

0.2

0.4

0.6

0.8



< C

>

0.00

0.02

0.04

0.06

0.08

0.10


< L

>

0

2

4

6

8

10




Areejit Samal

Constraint IV: Growth Phenotype

011....

001....

GA

GB

xDeletedreaction

FluxBalanceAnalysis

(FBA)

Viable genotype

Unviable genotype

Bit string and equivalent network representation of genotypes

Genotypes Ability to synthesize Viability biomass components

Growth

No Growth

Definition of Phenotype

For FBA, see Price et al (2004) for a review

Areejit Samal

MCMC Sampling: Growth phenotype in one environment

Accept/Reject Criterion of reaction swap:

Restrict the reaction swaps to the set of unblocked reactions within KEGG.

Accept if (a) No. of metabolites in the new network is less than or equal to E. coli(b) New network is able to grow on the specified environment

KEGG Database


10101..


11001..

10110..

11100..

00101..

E. coli

Accept

Reject

Step 1

Swap Swap

Reject

106 Swaps

Step 2

103 Swaps

store store

Erase memory of

initial network

Saving Frequency

Ensemble uRM-V1

Areejit Samal

Network Properties: Growth phenotype in one environment


P c

0.0

0.2

0.4

0.6

0.8


Frac

tion

of n

odes

0.0

0.2

0.4

0.6

0.8



< C

>

0.00

0.02

0.04

0.06

0.08

0.10


< L

>

0

2

4

6

8

10




Areejit Samal

Network Properties: Growth phenotype in multiple environments


P c

0.0

0.2

0.4

0.6

0.8


Frac

tion

of n

odes

0.0

0.2

0.4

0.6

0.8



< C

>

0.00

0.02

0.04

0.06

0.08

0.10


< L

>

0

2

4

6

8

10




Areejit Samal

Global structural properties of real metabolic networks: Consequence of simple biochemical and functional constraints

Incr

easi

ng le

vel o

f con

stra

ints

Samal and Martin, PLoS ONE (2011) 6(7): e22295

Conjectured by: A. Wagner (2007) and

B. Papp et al. (2008)

Areejit Samal

High level of genetic diversity in our randomized ensembles

Any two random networks in our most constrained ensemble uRM-v10 differ in ~ 60% of their reactions.

101....

100....

E. coli Random network in ensemble uRM-V10

versus

Hamming distance between the two networks is ~ 60% of the maximum possible between two bit strings.

There is a set of ~ 100 reactions which are present in all of our random networks.

101....

100....

Random network in ensemble uRM-V10

versus

Random network in ensemble uRM-V10

Areejit Samal

Necessity of MCMC sampling

Reference: Samal et al, BMC Systems Biology (2010)

Ensemble R

Ensemble Rm

Ensemble uRm

Number of reactions in genotype2000 2200 2400 2600 2800

Frac

tion

of g

enot

ypes

that

are

via

ble 100

10-5

10-10

10-15

10-20

GlucoseAcetateSuccinateAnalytic prefactor

Ensemble uRm-V1Ensemble uRm-V5

Ensemble uRm-V10

Embedded sets like Russian dolls

Including the viability constraint on the first chemical environment leads to a reduction by at least a factor 1022

Areejit Samal

Summary

We have proposed a new method based on MCMC sampling and FBA to generate randomized metabolic networks by successively imposing few macroscopic biochemical and functional constraints into our sampling criteria.

By imposing a few macroscopic constraints into our sampling criteria, we were able to approach the properties of real metabolic networks.

We conclude that biological function is a main driver of structure in metabolic networks.

Areejit Samal

Limitations and Future Outlook

We can extend the study using the SEED database to many organisms.

The list of reactions in KEGG is incomplete and its use may reflect evolutionary constraints associated with natural organisms.

From a synthetic biology context, there could be many more reactions that can occur which are not in KEGG.

Alternate approach proposed by Basler et al, Bioinformatics, 2011 which generates new mass-balanced reactions but does not impose functionality constraint.

Automated reconstruction of metabolic models for >140 organisms on which FBA can be performed.

Generate new mass-balanced reactions but no constraint of biomass production.

Areejit Samal

Genotype networks: A many-to-one genotype to phenotype map

Genotype

Phenotype

RNA Sequence

Secondary structure of pairings

folding

Reference: Fontana & Schuster (1998) Lipman & Wilbur (1991) Ciliberti, Martin & Wagner (2007)

Genotype network refers to the set of (many) genotypes that have a given phenotype, and their “organization” in genotype space. In the context of RNA, genotype networks are commonly referred to as Neutral networks.

Protein Sequence

Three dimensional structure

Gene network

Gene expression levels

Areejit Samal

Properties of the genotype-to-phenotype map

Are there many genotypes that have a given phenotype?

Do the genotypes with a given phenotype form a significant fraction of all possible genotypes?

Are all the genotypes with a given phenotype rather similar?

Are biological genotypes atypically robust compared to random genotypes with similar phenotype?

Are biological genotypes atypically innovative?

Can genotypes change gradually in response to selective pressures?

A. Wagner, Nat. Rev. Genet. (2008)

Areejit Samal

Properties of the genotype-to-phenotype map

Are there many genotypes that have a given phenotype?

Do the genotypes with a given phenotype form a significant fraction of all possible genotypes?

Are all the genotypes with a given phenotype rather similar?

Are biological genotypes atypically robust compared to random genotypes with similar phenotype?

Are biological genotypes atypically innovative?

Can genotypes change gradually in response to selective pressures?

We have considered these questions in the context of metabolic networks.

Genotype = lists of metabolic enzymes or reactionsPhenotype = growth or not in a given environment

A. Wagner, Nat. Rev. Genet. (2008)

Areejit Samal

Computational tool for Evolutionary Systems Biology

Areejit Samal

Acknowledgement

CNRS & Max Planck Society Joint Program in Systems Biology (CNRS MPG GDRE 513) for a Postdoctoral Fellowship.

Reference

Other Work

Genotype networks in metabolic reaction spacesAreejit Samal, João F Matias Rodrigues, Jürgen Jost, Olivier C Martin* and Andreas Wagner*BMC Systems Biology 2010, 4:30

Environmental versatility promotes modularity in genome-scale metabolic networksAreejit Samal, Andreas Wagner* andOlivier C Martin*BMC Systems Biology 2011, 5:135

Funding

Andreas Wagner João Rodrigues University of Zurich

Jürgen JostMax Planck Institute for Mathematics in the Sciences

Olivier C. MartinUniversité Paris-Sud

Areejit Samal Randomizing metabolic networks

Education

Transcript of Areejit Samal Randomizing metabolic networks