Download - NEMo: An Evolutionary Model With Modularity for PPI Networks · 2017-08-03 · IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 16, NO. 2, MARCH 2017 131 NEMo: An Evolutionary Model With

IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 16, NO. 2, MARCH 2017 131

NEMo: An Evolutionary Model WithModularity for PPI Networks

Min Ye, Xiuwei Zhang, Gabriela C. Racz, Qijia Jiang, and Bernard M. E. Moret∗

Abstract— Modeling the evolution of biological networks1

is a major challenge. Biological networks are usually rep-2

resented as graphs; evolutionary events not only include3

addition and removal of vertices and edges but also dupli-4

cation of vertices and their associated edges. Since dupli-5

cation is viewed as a primary driver of genomic evolution,6

recent work has focused on duplication-based models.7

Missing from these models is any embodiment of modu-8

larity, a widely accepted attribute of biological networks.9

Some models spontaneously generate modular structures,10

but none is known to maintain and evolve them. We describe11

network evolution with modularity (NEMo), a new model12

that embodies modularity. NEMo allows modules to appear13

and disappear and to fission and to merge, all driven by14

the underlying edge-level events using a duplication-based15

process. We also introduce measures to compare biological16

networks in terms of their modular structure; we present17

comparisons betweenNEMo and existingduplication-based18

models and run our measuring tools on both generated and19

published networks.20

Index Terms— Evolutionary event, evolutionary21

model, generative model, modularity, network topology,22

protein–protein (PPI) network.23

I. INTRODUCTION24

THE RAPID growth of experimentally measured data25

in biology requires effective computational models to26

uncover biological mechanisms in the data. Networks are com-27

monly used to represent key processes in biology; examples28

include transcriptional regulatory networks, protein–protein29

interaction (PPI) networks, and metabolic networks. The30

model is typically a graph, directed or undirected, where31

edges or arcs represent interactions and vertices represent32

actors (genes, proteins, etc.). Establishing experimentally the33

existence of a particular interaction is expensive and time34

consuming, while high-throughput experimental methods such35

Manuscript received August 6, 2016; revised January 9, 2017;accepted January 18, 2017. Date of publication January 19, 2017; date ofcurrent version April 20, 2017. Asterisk indicates corresponding author.

M. Ye is with the School of Computer and Communication Sciences,EPFL, 1015 Lausanne, Switzerland (e-mail: [email protected]).

X. Zhang is with the Simons Institute, University of California Berkeley,Berkeley, CA 94720 USA (e-mail: [email protected]).

G. C. Racz is with the University of British Columbia, Vancouver,BC V6T 1Z4, Canada.

Q. Jiang is with Stanford University, Stanford, CA 94305 USA.∗B. M. E. Moret is with the School of Computer and CommunicationSciences, EPFL, 1015 Lausanne, Switzerland (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TNB.2017.2656058

as affinity purification mass spectrometry [1] produce large 36

numbers of false positives, so most published networks have 37

been inferred through computational methods ranging from 38

datamining the literature (see [2]–[4]) to inferring the evolu- 39

tionary history of the networks from present observations [5]– 40

[8]. (Makino and McLysaght [9] present a thorough discussion 41

of evolutionary approaches to PPI networks.) Often these 42

networks are built through a process of accretion, by adding 43

new actors and new interactions as they are observed, pub- 44

lished, or inferred, with the result that errors in many cur- 45

rent biological networks tend to be false positives (errors of 46

commission) rather than false negatives (errors of omission). 47

A variety of databases store inferred networks and range from 48

large graphs, such as the human PPI network in the STRING 49

database (circa 4 250 000 interactions) [10], down to quite 50

small ones, such as the manually curated Human Protein 51

Reference Database (ca. 40 000 interactions) [11]. Even a 52

cursory reading of the literature shows that agreement among 53

findings is rather limited, not only in part due to the variety 54

of samples used and the dynamic nature of the networks, but 55

also in part because of the difficulty of inference. 56

This intrinsic difficulty has led some research groups to go 57

beyond the inference of a single network from data about one 58

organism and to use comparative methods. However, pairwise 59

comparative methods, while more powerful, offer only limited 60

protection against noise and high variability. This weakness in 61

turn has led to the use of evolutionary methods that use several 62

different organisms and carry out simultaneous inference on 63

all of them [5], [7], [9]—a type of inference that falls within 64

the category of transfer learning [12]. A unique feature in these 65

approaches is their use of evolutionary models—not something 66

normally associated with transfer learning. These approaches 67

posit a model of evolution for the networks, typically based on 68

inserting and deleting edges and duplicating or losing vertices, 69

and then seek to infer present-day networks as well as ancestral 70

networks that, under the chosen evolutionary model, would 71

best explain the data collected. The evolutionary model is thus 72

the crucial component of the inference procedure. 73

An early finding about biological networks such as regu- 74

latory networks and PPI networks was the clear presence of 75

modularity [13]: these networks are not homogeneous, with 76

comparable connectivity patterns at every vertex, but instead 77

present a higher-level structure consisting of well-connected 78

subgraphs with less substantial connectivity to other such 79

subgraphs. Modularity is now widely viewed as one of the 80

main characteristics of living systems [14]. While some of 81

1536-1241 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

132 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 16, NO. 2, MARCH 2017

the models devised for networks lead automatically to the82

emergence of modules within the network [15], these models83

are purely generative—increasing the size of the network84

at each step; moreover, the types of modular structure they85

create have not been compared to those found in biological86

networks. There is thus a need for an evolutionary model for87

PPI networks that, while still based on the gain and loss of88

vertices and edges, takes into account modularity.89

In this paper, we introduce network evolution with modu-90

larity (NEMo), a network evolutionary model with modularity91

for PPI networks that includes both growth and reduction92

operators, and that explicitly models the influence of modu-93

larity on network evolution. While network evolution remains94

the product of purely local events (at the level of single95

vertices or edges), these events are subject to slightly different96

parameters according to whether they occur within or between97

modules. Our model allows modules to emerge, disappear,98

merge, and split—all as consequences of lower level evolu-99

tionary events.100

We want to test whether the introduction of modularity into101

the evolutionary model makes a difference in the properties102

of the resulting networks compared with biological networks.103

To this end, we present the results of simulations and compare104

the networks thus produced with the consensus networks105

currently stored in a variety of databases for model organisms.106

Our comparisons are based on both network alignment ideas107

and new measures aimed at quantifying modularity, so we108

also discuss the usefulness of these measures and evaluate109

published PPI networks with respect to these measures. Our110

measures of modularity can be used to analyze the general111

characteristics of PPI networks and clearly distinguish the112

various model organisms. Our findings support the accepted113

bias of published networks toward false positives and the often114

reported distribution of modules into a few large subgraphs115

and a collection of much smaller subgraphs; NEMo produces116

networks with the latter characteristic and maintains it even117

when it has reached a target range of sizes.118

II. CURRENT GENERATIVE MODELS119

FOR PPI NETWORKS120

All evolutionary models to date are based on the addition121

or removal of the basic constituent elements of the network:122

vertices (proteins) and edges (pairwise interactions). In terms123

of complexity and verisimilitude, however, models proposed124

to date vary widely. Most of the recent models are based on125

duplication followed by divergence, denoted D&D [16], [17],126

in which a vertex is duplicated (think of a gene duplication)127

and inherits some randomly chosen subset of the connections128

of the original vertex (the copy of the gene initially produces129

much the same protein as the original and so enters into many130

of the same interactions). Most evolutionary biologists view131

gene duplication (single gene, a segment of genes, or even the132

entire genome) as the most important source of diversification133

in genomic evolution [18], [19], so models based on D&D134

have become widely used for PPI networks.135

In the full D&D model, a duplication event duplicates all136

interactions of the original copy, but some interactions for137

both the original and the duplicated copies are immediately138

lost with some probability. A recent variation on the D&D 139

model is the duplication–mutation–complementarity (DMC) 140

model [20]–[22], in which the same interaction cannot be lost 141

simultaneously in the original and in the copy and in which 142

the duplicated gene itself may gain a direct interaction with 143

the original gene. The DMR (random mutation) model [23] is 144

another variation, in which new interactions (not among those 145

involving the original vertex) can be introduced between the 146

duplicate vertex and some random vertices in the network. 147

III. NEMO 148

While, as noted earlier, the D&D model (and, by extension, 149

its various derivatives) will automatically give rise to modular 150

structures, it does so in scenarios of unrestricted growth: no 151

edge deletions are allowed other than those that occur as part 152

of a node duplication and a node gets deleted only indirectly, 153

if and when its degree is reduced to zero. In that sense, 154

the D&D, while a generative model, is not an evolutionary 155

model: it can only grow networks, not evolve them while 156

keeping their size within some fixed range. The same is true 157

of its several variants. 158

Our aim is to produce a generative model that is also 159

an evolutionary model, a model that we can later use for 160

reconstructing the evolutionary history of PPI networks. Under 161

such a model, a network may grow, shrink, or, most commonly, 162

vary in size within some bounded range. Since the dominant 163

growth operator is duplication and since this operator typically 164

adds multiple edges to the network, random (i.e., unrelated 165

to other events) deletion of edges must be fairly common. 166

We conjectured that under such a model, modularity would 167

not necessarily be preserved—simply because, under such a 168

model, the selection of interactions to lose is independent of 169

the modular structure. Since modules appear both necessary to 170

life and quite robust against mutations, a model of evolution 171

of PPI networks that is biased (as nature appears to be) 172

in favor of the survival of modules would need to “know” 173

about the module structure. (From an evolutionary standpoint, 174

mutations that remove interactions within modules would be 175

under negative selection.) 176

We therefore designed a two-level model, NEMo. In NEMo, 177

a PPI network is represented as a graph, with nodes repre- 178

senting proteins and undirected edges representing undirected 179

interactions between pairs of proteins. 180

Events in NEMo occur at the lower level and are based 181

on the D&D model, suitably augmented. The main event in 182

a D&D model is node duplication. Node duplication copies 183

an existing node and all of its connections, thereby creating 184

a new node and a collection of new edges; in addition, some 185

of the edges copied as well as some of the new edges created 186

are probabilistically lost as part of the same event. We retain 187

this event in NEMo but allow the newly created node to be 188

connected to an additional node, randomly chosen within the 189

graph. (The loss of edges in the D&D model corresponds to 190

the common evolutionary adaptation that reduces the level of 191

conservation in genes that exist in multiple copies; most of the 192

time the resulting divergence in the gene sequence will lead 193

to a loss of interaction, but it is also possible that it will lead 194

YE et al.: NEMo: AN EVOLUTIONARY MODEL WITH MODULARITY FOR PPI NETWORKS 133

to a gain.) We also add an independent gain or loss event for195

each node: with low probability, a node can establish a new196

connection to a previously unconnected node. The higher level197

is “module-aware” so that evolutionary events can be classified198

as within a module or between modules. Such a model requires199

the identification of modules within a network and the extrac-200

tion and quantification of some high-level attributes that can201

be used to measure similarity. Methodologies used in much of202

the work on the identification of functional modules [24]–[26]203

are not applicable here, as we deal with an anonymous graph,204

not with annotated proteins, so we use clustering to identify205

modular structures, with a clustering algorithm that supports206

node overlap between clusters. (Many proteins have multiple207

domains and thus naturally interact with very different proteins208

and even a single-domain protein can be part of several209

pathways or modules: hence, we need a similar flexibility in210

the definition of modules in our model.)211

More precisely, events affecting nodes and edges can be212

classified into four categories: node gain, node loss, edge213

gain, and edge loss. Node gain occurs exclusively through214

duplication of an existing node, a duplication that typically215

also results in both edge gains and edge losses. Node loss216

removes a randomly chosen node, reflecting such biological217

events as mutation in transcription factors or pseudogene218

formation. (As in the D&D models, it is also possible to lose219

a node through progressive loss of edges until the node has220

degree zero.) Edge loss (other than edges losses associated221

with a node duplication) removes a randomly chosen edge222

and reflects such biological events as domain mutations,223

structural mutations, subfunctionalization, and the like. Edge224

gain (other than edge gains associated with a node duplication)225

connects a previously unconnected pair of nodes and thus226

reflects many of the same events that can also cause edge227

loss, such as domain or structural mutations, or progressive228

neofunctionalization.229

The higher level of the model reflects the modular structure230

and influences the event chain as follows. First, we allow up to231

one event to occur in each module within the same step. That232

is, whereas existing models treat the network as one unit and233

allow a single event at a time, our model treats the network234

as a collection of subgraphs (modules) and allows up to one235

event in each subgraph. Multiple events within the same step236

can more closely model interconnected events—events in two237

different modules, for instance, can affect the same shared238

node. Second, we distinguish intramodular events (all four239

events can be intramodular) from intermodular events (only240

edge gains and losses can be intermodular), allowing us to use241

different parameters for the two types. We use this flexibility242

to introduce a slight bias in favor of intramodular edges over243

intermodular edges. Finally and crucially, while we automat-244

ically place a duplicate node within the same module as the245

original node, we also periodically recompute the subgraph246

decomposition, thereby “discovering” changes in the module247

structure and recording evolutionary events at the module248

level as module emergence, module disappearance, fusion of249

modules, and fission of modules. (These module-level events250

are thus not independently generated, but come into being as a251

consequence of node- and edge-level events.) Recomputing the252

modular structure can be done at fixed intervals (in the results 253

presented in the following, the recomputation takes places after 254

one third, two thirds, and all of the steps, for instance) or once 255

the current modular structure has diverged sufficiently from the 256

last recorded one. 257

In an evolutionary simulation using NEMo, at each step, 258

each module may record no event or one lower-level event; 259

in the latter case, that event may be an intramodular 260

event (node duplication, node loss, edge loss, or intramodular 261

edge gain) or an intermodular event (intermodular edge loss 262

or gain). The parameter controlling the “no event” outcome 263

at each step can be used to allow the simulation of distinct 264

evolutionary rates in different modules while the parameter 265

controlling intramodular versus intermodular events can be 266

used to introduce a bias in favor of module conservation. 267

(Note that when a node loss occurs, the node is removed 268

from its module, but not from any overlapping module: it is 269

removed entirely from the network only when it is the target 270

of node loss and appears in one module only.) Very small 271

modules can easily disappear as a consequence of just a few 272

node and/or edge losses and are thus somewhat unstable when 273

all modules are assigned the same loss and gain parameter 274

values. 275

IV. ASSESSING MODULARITY 276

In order to evaluate the output of NEMo, we must first 277

quantify significant attributes of PPI networks. The resulting 278

features can then be used to measure the similarity of our 279

generated networks to real networks, as well as the differences 280

between networks generated by our model and networks gen- 281

erated under existing models. While it is sometimes possible 282

to compare two biological networks using a rich set of annota- 283

tions to produce a network alignment, simulated networks lack 284

such an annotation and thus cannot be meaningfully aligned. 285

Thus, similarity here refers to structural and topological fea- 286

tures such as modularity and connectivity: we need to compare 287

networks very different in size and composition and so cannot 288

use tools such as network alignment methods. We thus propose 289

a set of features applicable to all networks; these features 290

are chosen to measure global properties of networks and to 291

quantify aspects of modularity. 292

Most of the features proposed here are commonly used 293

in the analysis of networks [26], [31]; several are modified 294

so as to provide a level of independence from size—while 295

simulations can be run at all sizes, bacterial PPI networks 296

are necessarily smaller than mammalian PPI networks—and 297

some are measures of importance in other fields that we have 298

repurposed for networks, such as the Gini coefficient. For each 299

network, we compute the number of nodes, the number of 300

edges, and the degree distribution; we also run the clustering 301

algorithms and store the number of clusters as well as the size 302

and composition of each cluster. We then compute these six 303

global measures. 304

1) Cluster Coefficient (CC): The CC is based on triplets of 305

vertices. A triplet is open if connected with two edges 306

and closed if connected with all three edges. The CC is 307

just the ratio of the number of closed triplets divided by 308

the total number of (open or closed) triplets [32]. 309


2) Graph Density (GD): The density of a graph is the ratio310

of the actual number of edges to the number of possible311

edges.312

3) Fraction of Edges Inside (FEI): FEI is the fraction313

of edges contained within modules. We expect it to314

be high since PPI networks contain highly connected315

substructures (modules) that have only few connections316

to vertices outside the substructure [31], [33], [34].317

4) Diameter (�): The diameter of a graph is the length of318

the longest simple path in the graph.319

5) Shortest Path Mean (SPM): SPM is the mean of all320

pairwise shortest paths in the graph.321

6) Gini Coefficient (Gini): The Gini coefficient, proposed322

by the Italian statistician C. Gini in 1912, measures dis-323

tributional inequality, such as the inequality of incomes324

within a population. If household i has a yearly income325

of xi , then the Gini coefficient of the population is326

given by327

G =∑n

i=1∑n

j=1 |xi − x j |2n

∑ni=1 xi

.328

We repurposed that definition for our use in studying329

modularity by defining the “income” of a node as the330

degree of the node plus the sum of the degrees of its331

immediate neighbors.332

We considered a number of other measures, such as modu-333

larity, as defined by Newman and Girvan [35] to measure the334

community structure, and betweenness centrality, a measure335

that originated in sociology [36], as well as several others,336

such as conformance to the characteristics of small-world337

networks or scale-free networks, but present only these six338

here, as others either are closely correlated with them or appear339

unrelated to modularity, as further discussed in the following.340

V. RESULTS ON NATURAL PPI NETWORKS341

For the data, we chose to work with model organisms,342

as they have large numbers of documented high-confidence343

interactions. We picked the three species with the largest344

number of such interactions, Escherichia coli, Saccharomyces345

cerevisiae, and Homo sapiens. Different sources were consid-346

ered to investigate the discrepancies among the networks in347

current databases.348

One source was the STRING database [10], which aims to349

provide a global perspective for as many organisms as feasible,350

tolerating lower quality data and computational predictions.351

Due to this bias, STRING includes a large number of indirect352

interactions, which we treat as false positives, since our aim353

is to evolve a network of direct interactions. Fortunately,354

STRING stores an evidence score for each interaction to allow355

elimination of false positive entries by the user. We thus356

used both the complete data set and a subset filtered using357

a high threshold of 900 on the evidence scores.) For other358

sources, we consulted the manually curated H. sapiens PPI359

network database HPRD [37] and the experimental setup of360

the MAGNA++ algorithm [38], which aims at maximizing361

accuracy in global network alignment: an H. sapiens PPI362

network of 9141 proteins and 41 456 interactions [39], an E.363

TABLE IGENERAL CHARACTERISTICS OF THE THREE PPI

NETWORKS IN VARIOUS DATABASES

TABLE IIVALUES OF OUR MEASURES FOR THE REFERENCE PPI

NETWORKS IN VARIOUS DATABASES

coli PPI network [40] of high confidence of 1941 proteins 364

with 3989 interactions, and a yeast S. cerevisiae PPI network 365

with 2390 proteins and 161 277 PPIs [41]. We also use the 366

database HitPredict [42], [43], which stores experimentally 367

determined protein-protein interactions with reliability scores; 368

for this database, we also included the network of C. elegans 369

as an additional reference. For clustering (that is, to identify 370

putative modules), we used both ClusterOne and MCL. 371

Table I provides a brief description of these PPI networks in 372

various databases and versions, while Table II shows the values 373

of our various measures on these networks. In Tables I and II, 374

S stands for STRING’s complete data set, S900 stands the 375

filtered data set of confidence score >900, H for HPRD, 376

M for MAGNA++, and P for HitPredict. The different 377

clustering results yield different values for the FEIs. 378

The very large differences in size among the databases for 379

the same network are striking: the STRING database has well 380

over four million edges for the human PPI network, whereas 381

the HPRD database has fewer than 40 000 or less than 1% of 382

the number in STRING. This large discrepancy underlines the 383

difference in philosophy between various PPI databases and 384


Fig. 1. Degree distribution for the E. coli network in STRING, a completedata set.

illustrates why testing models or inferences against databases385

must be done with great care. For instance, simply clustering386

the graph has pitfalls, as shown in the number of clusters found387

by the same algorithm for E. coli on the various databases,388

going from 16 clusters among 4145 nodes in STRING to389

1151 clusters among 3351 nodes in HitPredict—values that390

again differ by around two orders of magnitude. The graphs391

themselves are all sparse (GD is low, even for the relatively392

denser STRING networks), but some structural differences393

are clear, although the reason for any such difference is not394

always clear: differences between the numbers of proteins395

and interactions stored in the databases, differences between396

the complexity of the networks, or differences between the397

organisms’ metabolic needs and lifestyles. The Gini coefficient398

points to significant inequality of distribution in the degree399

of one-level neighborhoods—Gini coefficients above 0.6 for400

income per capita are very rare in today’s world, but the values401

are quite variable across the databases. The FEI modules402

displays one of the more striking differences, being very high403

for networks in STRING, HPRD, and MAGNA++, but much404

lower in networks in HitPredict, presumably because HitPre-405

dict is good at excluding indirect interactions that simply406

shortcut paths through transitive closure.407

We also tested these networks for one of the characteristic408

attributes of social networks, small-world networks, and scale-409

free networks, namely, a degree distribution that follows a410

power law. The conclusion is very clear for the STRING411

networks: they do not follow a power law, as the plot in Fig. 1412

clearly shows—a power law would result in an oblique line,413

not in the complex curve shown in the figure. It is less clear for414

the other three databases; in fact, for E. coli, the plot appears415

to support a hypothesis of an underlying power law, at least416

in HitPredict, as shown in Fig. 2.417

VI. RESULTS ON SIMULATIONS418

A. Simulation Goals and Setup419

The goal of our simulations is to verify the ability of NEMo420

to produce networks with characteristics similar to those of the421

Fig. 2. Degree distribution for the E. coli network in HitPredict,a complete data set.

TABLE IIIVALUES OF OUR FEATURES FOR THE GENERATED NETWORKS AND

THE REFERENCE PPI NETWORKS IN VARIOUS DATABASES

natural PPI networks and also to compare the networks it pro- 422

duces with those produced without the module-aware level and 423

with those produced by D&D models. In particular, we want 424

to test the ability of NEMo to sustain modules in networks not 425

undergoing growth, but subject only to evolutionary changes, 426

where gain of proteins and interactions is balanced by loss 427

of same. Therefore, we run two distinct series of simulations, 428

one for generation and one for evolution. 429

The first series uses both the DMC model [22], perhaps 430

the most commonly used model in the D&D family today, 431

and NEMo to grow networks to fixed sizes. We then compute 432

our features on these networks and compare both types of 433

generated networks with the PPI networks of the model 434

organisms. Since DMC is not module aware, but claimed to 435

generate modular networks [15], whereas NEMo is explicitly 436

module aware, we want to see how well the characteristics 437

of each type of generated network compare with those of the 438

PPI networks of the model organisms. 439

In the second series of simulations, we use NEMo in steady- 440

state mode (balanced gains and losses) over many steps to 441


Fig. 3. Evolution of network characteristics under the NEMo model over 600 steps, with reclustering into modules at 200 and 400 steps. Top lineshows the total number of edges, second line the number of vertices, third line the number of modules, fourth line the size of the largest module,and bottom line the number of singleton modules. (a) Evolution from a DMC-generated network. (b) Evolution from a NEMo-generated network.

evolve networks produced during the first simulation series.442

Our main intent here is to observe the evolution (mostly in443

terms of size, edge density, and modules) of the networks.444

We use parameters for NEMo that give it a slight bias toward445

growth, mostly to prevent the natural variance of the process446

from “starving” too many of the networks.447

B. Results for Network Generation448

We set parameters of our model for simulating growth of449

the network and compare the resulting networks with those450

built with the standard DMC model for similar sizes, as well451

as with the PPI networks from the three model organisms.452

(In generative mode, NEMo is not just module aware, but453

also reclusters the network regularly.)454

We compute our network features for each of these net-455

works, but report mean values over the set of simulations.456

Table III shows these means, preceded for convenience by the457

same features shown for PPI networks (from Table II).458

DMC and NEMo both generate networks with features459

comparable to those observed in the PPI networks collected460

from HPRD, MAGNA, and HitPredict, although the signif- 461

icantly lower clustering coefficient of the DMC-generated 462

network (0.05 compared with 0.14 for the NEMo-generated 463

network) indicates a less resolved modular structure. 464

(All PPI networks from databases have larger clustering coef- 465

ficients than the generated networks, but the size of networks 466

matters in this respect, as does the number of additional 467

indirect interaction edges.) The Gini coefficients of the gen- 468

erated networks are comparable and are considerably smaller 469

than those of the networks from the databases, which is to 470

be expected from a model used in generative mode—the 471

generation gives little time for module-level events such as 472

merging and splitting that contribute to the unequal distribution 473

of neighborhood degrees. 474

C. Results for Network Evolution 475

In the second step of our experiments, we test the ability 476

of NEMo to simulate the evolution of a PPI network (with 477

roughly balanced node gain and loss rates), while preserving 478

modularity and also test how NEMo’s behavior is affected by 479


its initial condition using both DMC- and NEMo-generated480

networks at time zero. We want to observe the evolution of481

the network after a larger number of events, so we (arbitrarily)482

choose 600 steps—recall that NEMo allows up to one event483

per module at each step, so that the 600 steps can yield484

a much larger number of events. Fig. 3 shows the changes485

in network size (numbers of edges and vertices) and struc-486

ture (numbers of modules) as an initial network is evolved487

through 600 steps, with reclustering into modules taking place488

after 200 and 400 steps.489

The main observation here is that NEMo, when started with490

a DMC-generated network [Fig. 3(a)], begins by reconfiguring491

the network, reducing its number of vertices by about one-492

third over the first hundred steps and replacing edges. It then493

moves into much the same mode, as depicted in Fig. 3(b),494

which shows a steady evolutionary behavior mixed with495

a small bias toward growth. The implication is that while the496

DMC-generated network may have a modular structure, that497

structure is not really compatible with the type of structure498

our two-level model embodies: the module structure built by499

DMC is somehow “wrong” and needs to be heavily modified500

before the model can enter a stable phase. In particular,501

observe that the GD of the DMC-generated network is low502

and gets swiftly increased by NEMo, while the initial number503

of modules is high and gets swiftly decreased by NEMo504

as a consequence of the removal of many nodes. After the505

first 200 steps and the first reclustering of modules, the evo-506

lution follows the same path as that followed immediately507

when working from a NEMo-generated initial graph, as seen508

in Fig. 3(b). Fig. 3(b) shows variance in the rate of increase509

in the number of edges, partly a consequence of the node510

duplication process—duplicating a few high-degree nodes in511

rapid succession quickly increases the overall degree of the512

network, while also increasing the number of high-degree513

nodes. Most NEMo simulations show a mixed growth rate514

within the 600 simulation steps, indicating that NEMo is515

flexible and allows a reshaping and restructuring of a network516

while keeping the network size pretty stable. The node–edge517

ratio for biological PPI networks (see Table I) shows that518

the number of edges is some multiple (larger than 1) of the519

number of nodes, but that this multiple is quite variable.520

Thus, the flexibility and dynamics that NEMo enables are521

important.522

The mild generative bias we deliberately introduced into523

the evolutionary simulations can be harmlessly removed for524

evolving NEMo-generated networks and, through larger num-525

bers of steps, evolving a modular structure closer to that of526

the PPI networks from the databases.527

The module-aware level of NEMo derives its power from its528

ability to distinguish intermodular from intramodular events.529

However, NEMo uses this power in a minimal way, by assign-530

ing slightly different probabilities to the two classes of531

events—in evolutionary terms, it simulates a slightly stronger532

negative selection for intramodular interactions than for inter-533

modular interactions. The distinction between the two classes534

of events could be used to a much larger extent, but our535

results show that even this minimal intervention, consistent536

with a selective pressure to preserve modularity while allowing 537

modules themselves to adapt, suffices to create a significant 538

difference in the types of networks produced. 539

VII. DISCUSSION AND CONCLUSION 540

We presented NEMo, a module-aware evolutionary model 541

for PPI networks. The emphasis of NEMo, compared with 542

existing models for PPI networks, is on evolution rather than 543

generation: whereas existing models (and the first layer of 544

NEMo, which is a variant of existing models) are known to 545

generate a modular structure when growing networks, we were 546

interested in a model that would evolve existing networks, 547

using the same basic set of evolutionary events. 548

The salient feature of NEMo is a module-aware layer that 549

sits above the event layer and distinguishes between inter- 550

modular and intramodular events. The awareness is achieved 551

through periodic recomputation (triggered by sampling and 552

analysis for drift) of the modular structure. The uses to which 553

this awareness are put are minimal: NEMo simply gives 554

a slightly higher probability to intramodular events than to 555

intermodular events, thereby slightly favoring conservation of 556

modules. The details of the model are broadly adjustable: the 557

algorithm used to detect modules, the number and nature of 558

parameters used to control intramodular versus intermodular 559

events, the features chosen to characterize the network, and 560

the distance measure used to measure drift in order to decide 561

when to reevaluate the composition of modules are all flexible. 562

Our simulation results show that its second layer enables 563

NEMo to run through large numbers (as compared to the 564

size of the network) of evolutionary events, balanced so as to 565

not affect the expected size of the network, while preserving 566

the characteristics of its original (growth-derived) modular 567

structure. To the best of our knowledge, this is the first such 568

result and it paves the way for phylogenetic analyses as well 569

as population studies of PPI networks. 570

As discussed by Makino and McLysaght [9], however, 571

the number of factors that could affect the evolution of 572

PPI networks is very large. NEMo captures only a small 573

subset of these factors, since it works just on the graph 574

structure and, at the level of individual events, makes the same 575

independence assumptions as current models. Interdependent 576

events or hidden underlying events present serious challenges. 577

Incorporating externally supplied data (in addition to the 578

network itself) makes sense in a data-rich era, but will require, 579

for each type of data, further development of the model. 580

REFERENCES 581

[1] J. Morris et al., “Affinity purification—Mass spectrometry and network 582

analysis to understand protein-protein interactions,” Nature Protocols, 583

vol. 9, no. 11, pp. 2539–2554, 2014. 584

[2] E. M. Marcotte, I. Xenarios, and D. Eisenberg, “Mining litera- 585

ture for protein–protein interactions,” Bioinformatics, vol. 17, no. 4, 586

pp. 359–363, 2001. 587

[3] Y. Hao, X. Zhu, M. Huang, and M. Li, “Discovering patterns to extract 588

protein–protein interactions from the literature: Part II,” Bioinformatics, 589

vol. 21, no. 15, pp. 3294–3300, 2005. 590

[4] A. Abi-Haidar et al., “Uncovering protein interaction in abstracts and 591

text using a novel linear model and word proximity networks,” Genome 592

Biol., vol. 9, p. S11, Sep. 2008. 593


[5] J. Dutkowski and J. Tiuryn, “Phylogeny-guided interaction mapping in594

seven eukaryotes,” BMC Bioinf., vol. 10, no. 1, p. 393, 2009.595

[6] X. Zhang and B. M. E. Moret, “Refining transcriptional regulatory596

networks using network evolutionary models and gene histories,”597

Algorithms Mol. Biol., vol. 5, p. 1, Jan. 2010.598

[7] X. Zhang and B. M. E. Moret, “Refining regulatory networks through599

phylogenetic transfer of information,” IEEE/ACM Trans. Comput. Biol.600

Bioinf., vol. 9, no. 4, pp. 1032–1045, Jul. 2012.601

[8] S. M. E. Sahraeian and B.-J. Yoon, “A network synthesis model for602

generating protein interaction network families,” PLoS ONE, vol. 7,603

no. 8, e41474, 2012.604

[9] T. Makino and A. McLysaght, “Evolutionary analyses of protein interac-605

tion networks,” in Biological Data Mining Protein Interaction Networks,606

X.-L. Li and S.-K. Ng, Eds. Hershey, PA, USA: IGI Global, 2009,607

pp. 169–181.608

[10] D. Szklarczyk et al., “STRING v10: Protein–protein interaction net-609

works, integrated over the tree of life,” Nucl. Acids Res., vol. 43, no. D1,610

pp. D447–D452, 2015.611

[11] T. S. K. Prasad et al., “Human protein reference database—2009 update,”612

Nucl. Acids Res., vol. 37, pp. D767–D772, Jan. 2009.613

[12] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans.614

Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.615

[13] L. H. Hartwell, J. J. Hopfield, S. Leibler, and A. W. Murray, “From616

molecular to modular cell biology,” Nature, vol. 402, no. 6761,617

pp. C47–C52, 1999.618

[14] G. Schlosser and G. Wagner, Modularity in Development and Evolution.619

Chicago, IL, USA: Univ. Chicago Press, 2004.620

[15] R. Solé and S. Valverde, “Spontaneous emergence of modularity in621

cellular networks,” J. Roy. Soc. Interface, vol. 5, no. 18, pp. 129–133,622

2008.623

[16] J. Qian, N. M. Luscombe, and M. Gerstein, “Protein family and fold624

occurrence in genomes: Power-law behaviour and evolutionary model,”625

J. Mol. Biol., vol. 313, no. 4, pp. 673–689, Nov. 2001.626

[17] A. Bhan, D. Galas, and T. Dewey, “A duplication growth model of gene627

expression networks,” Bioinformatics, vol. 18, no. 11, pp. 1486–1493,628

2002.629

[18] S. Ohno, Evolution by Gene Duplication. Berlin, Germany:630

Springer-Verlag, 1970.631

[19] M. Lynch and J. S. Conery, “The evolutionary fate and consequences632

of duplicate genes,” Science, vol. 290, no. 5494, pp. 1151–1155, 2000.633

[20] M. Middendorf, E. Ziv, and C. H. Wiggins, “Inferring network mecha-634

nisms: The Drosophila melanogaster protein interaction network,” Proc.635

Nat. Acad. Sci. USA, vol. 102, no. 9, pp. 3192–3197, 2005.636

[21] S. Navlakha and C. Kingsford, “Network archaeology: Uncovering637

ancient networks from present-day interactions,” PLoS Comput. Biol.,638

vol. 7, no. 4, e1001119, 2011.639

[22] A. Vazquez, A. Flammini, A. Maritan, and A. Vespignani, “Global640

protein function prediction from protein-protein interaction networks,”641

Nature Biotechnol., vol. 21, no. 6, pp. 697–700, 2003.642

[23] R. Solé, R. Pastor-Satorras, E. Smith, and T. Kepler, “A model of large-643

scale proteome evolution,” Adv. Complex Syst., vol. 5, no. 1, pp. 43–54,644

2002.645

[24] J. Dutkowski and J. Tiuryn, “Identification of functional modules646

from conserved ancestral protein–protein interactions,” Bioinformatics,647

vol. 23, no. 13, pp. i149–i158, 2007.648

[25] M. T. Dittrich, G. W. Klau, A. Rosenwald, T. Dandekar, and T. Müller,649

“Identifying functional modules in protein–protein interaction net-650

works: An integrated exact approach,” Bioinformatics, vol. 24, no. 13,651

pp. i223–i231, 2008.652

[26] T. Aittokallio, “Module finding approaches for protein interaction653

networks,” in Biological Data Mining Protein Interaction Networks,654

X.-L. Li and S.-K. Ng, Eds. Hershey, PA, USA: IGI Global, 2009,655

pp. 335–353.656

[27] T. Nepusz, H. Yu, and A. Paccanaro, “Detecting overlapping protein657

complexes in protein-protein interaction networks,” Nature Methods,658

vol. 9, no. 5, pp. 471–472, 2012.659

[28] S. V. Dongen, “Graph clustering by flow simulation,” Ph.D. dissertation,660

Dept. Math. Comput. Sci., Utrecht Univ., Utrecht, The Netherlands,661

2000.662

[29] A. Enright, S. Van Dongen, and C. Ouzounis, “An efficient algorithm663

for large-scale detection of protein families,” Nucl. Acids Res., vol. 30,664

no. 7, pp. 1575–1584, 2002.665

[30] S. van Dongen and C. Abreu-Goodger, “Using MCL to extract clusters666

from networks,” in Bacterial Molecular Networks (Methods in Molecu-667

lar Biology), vol. 804, J. van Helden, A. Toussaint, and D. Thieffry, Eds.668

Berlin, Germany: Springer-Verlag, 2012, pp. 281–295.669

[31] A.-L. Barabási and Z. Oltvai, “Network biology: Understanding 670

the cell’s functional organization,” Nature Rev. Genet., vol. 5, 671

pp. 101–113, Feb. 2004. 672

[32] S. Wasserman and K. Faust, Social Network Analysis: Methods and 673

Applications. Cambridge, U.K.: Cambridge Univ. Press, 1994. 674

[33] A. Wagner, “The yeast protein interaction network evolves rapidly and 675

contains few redundant duplicate genes,” Mol. Biol. Evol., vol. 18, no. 7, 676

pp. 1283–1292, 2001. 677

[34] Y. Jin, D. Turaev, T. Weinmaier, T. Rattei, and H. A. Makse, “The 678

evolutionary dynamics of protein-protein interaction networks inferred 679

from the reconstruction of ancient networks,” PLoS ONE, vol. 8, no. 3, 680

p. e58134, 2013. 681

[35] M. E. J. Newman and M. Girvan, “Finding and evaluating community 682

structure in networks,” Phys. Rev. E, Stat. Phys. Plasmas Fluids Relat. 683

Interdiscip. Top., vol. 69, no. 2, pp. 26113–26127, 2004. 684

[36] L. C. Freeman, “A set of measures of centrality based on betweenness,” 685

Sociometry, vol. 40, no. 1, pp. 35–41, Mar. 1977. 686

[37] T. S. K. Prasad et al., “Human protein reference database—2009 update,” 687

Nucl. Acids Res., vol. 37, no. 1, pp. D767–D772, Jan. 2009. 688

[38] V. Saraph and T. Milenkovic, “MAGNA: Maximizing accuracy in global 689

network alignment,” Bioinformatics, vol. 30, no. 20, pp. 2931–2940, 690

2014. [Online]. Available: http://bioinformatics.oxfordjournals.org/ 691

content/30/20/2931.abstract 692

[39] P. Radivojac et al., “An integrated approach to inferring gene–disease 693

associations in humans,” Proteins, vol. 72, no. 3, pp. 1030–1037, 2008. 694

[40] J. M. Peregrín-Alvarez, X. Xiong, C. Su, and J. Parkinson, “The modular 695

organization of protein interactions in Escherichia coli,” PLoS Comput. 696

Biol., vol. 5, no. 10, p. e1000523, 2009. 697

[41] S. Collins et al., “Toward a comprehensive atlas of the physical 698

interactome of Saccharomyces cerevisiae,” Mol. Cellular Proteomics, 699

vol. 6, no. 3, pp. 439–450, 2007. 700

[42] Y. López, K. Nakai, and A. Patil, “HitPredict version 4: Comprehensive 701

reliability scoring of physical protein–protein interactions from more 702

than 100 species,” Database, J. Biol. Databases Curation, 2015, doi: 703

10.1093/database/bav117. 704

[43] A. Patil, K. Nakai, and H. Nakamura, “HitPredict: A database of quality 705

assessed protein–protein interactions in nine species,” Nucl. Acids Res., 706

vol. 39, no. 1, pp. D744–D749, 2015. 707

Min Ye received the B.S. and M.S. degrees 708

in computational biology and bioinformatics 709

from Saarland University, Germany. She is 710

currently working toward the Ph.D. degree in 711

the Laboratory for Computational Biology and 712

Bioinformatics, École Polytechnique Fédérale 713

de Lausanne, Lausanne, Switzerland. 714

Her research focuses on model 715

and algorithm design for biological 716

networks and their phylogenic 717

reconstruction. 718

Xiuwei Zhang received the Ph.D. degree in 719

computer science, under the supervision of Prof. 720

B. Moret, from the Laboratory for Computational 721

Biology and Bioinformatics, École Polytechnique 722

Fédérale de Lausanne, Switzerland. 723

She was a Postdoctoral Researcher with 724

the Teichmann Group, European Bioinformatics 725

Institute, and a Simons fellow with the Simons 726

Institute, University of California Berkeley, CA, 727

USA. She is currently a Postdoctoral Researcher 728

with the Yosef Laboratory, UC Berkeley. Her cur- 729

rent research interests include the analysis and application of evolution 730

of biological data, such as biological networks and protein structures and 731

method development and data analysis for RNA-Seq data and single-cell 732

RNA-Seq data. 733

Gabriela Clara Racz received the M.S. degree 734

in applied mathematics from the University of 735

Zagreb, Croatia. She is currently working toward 736

the Ph.D. degree with the Earth, Ocean and 737

Atmosphere Science Department, University of 738

British Columbia, Canada. 739

In 2015, she was a Research Intern with the 740

Laboratory for Computational Biology and Bioin- 741

formatics, École Polytechnique Fédérale de Lau- 742

sanne, Switzerland. 743


Qijia Jiang received the B.S. degree in electrical744

engineering and the B.A. degree in statistics from745

Rice University, Houston, TX, USA, in 2015. She746

is currently working toward the Ph.D. degree with747

the Electrical Engineering Department, Stanford748

University, Stanford, CA, USA.749

Bernard M. E. Moret received the Ph.D. degree 750

from the University of Tennessee, Knoxville, TN, 751

USA, in 1980. 752

He was on the faculty of the Department of 753

Computer Science, University of New Mexico, 754

until 2006, serving as the Chairman from 1991 755

to 1993. He is currently a Professor of Computer 756

Science and holding the Chair of Bioinformat- 757

ics, École Polytechnique Fédérale de Lausanne, 758

Lausanne, Switzerland. His research interests 759

are in the area of algorithms and applications, 760

particularly in computational molecular biology. He founded the ACM 761

Journal of Experimental Algorithmics in 1995 and served as its Editor- 762

in-Chief for seven years. Since 2000, he has focused on the development 763

of models and algorithms for evolutionary genomics, publishing over 100 764

peer-reviewed articles in the area and founding, in 2001, the annual 765

Workshop on Algorithms in bioinformatics. 766