IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 16, NO. 2, MARCH 2017 131
NEMo: An Evolutionary Model WithModularity for PPI Networks
Min Ye, Xiuwei Zhang, Gabriela C. Racz, Qijia Jiang, and Bernard M. E. Moret∗
Abstract— Modeling the evolution of biological networks1
is a major challenge. Biological networks are usually rep-2
resented as graphs; evolutionary events not only include3
addition and removal of vertices and edges but also dupli-4
cation of vertices and their associated edges. Since dupli-5
cation is viewed as a primary driver of genomic evolution,6
recent work has focused on duplication-based models.7
Missing from these models is any embodiment of modu-8
larity, a widely accepted attribute of biological networks.9
Some models spontaneously generate modular structures,10
but none is known to maintain and evolve them. We describe11
network evolution with modularity (NEMo), a new model12
that embodies modularity. NEMo allows modules to appear13
and disappear and to fission and to merge, all driven by14
the underlying edge-level events using a duplication-based15
process. We also introduce measures to compare biological16
networks in terms of their modular structure; we present17
comparisons betweenNEMo and existingduplication-based18
models and run our measuring tools on both generated and19
published networks.20
Index Terms— Evolutionary event, evolutionary21
model, generative model, modularity, network topology,22
protein–protein (PPI) network.23
I. INTRODUCTION24
THE RAPID growth of experimentally measured data25
in biology requires effective computational models to26
uncover biological mechanisms in the data. Networks are com-27
monly used to represent key processes in biology; examples28
include transcriptional regulatory networks, protein–protein29
interaction (PPI) networks, and metabolic networks. The30
model is typically a graph, directed or undirected, where31
edges or arcs represent interactions and vertices represent32
actors (genes, proteins, etc.). Establishing experimentally the33
existence of a particular interaction is expensive and time34
consuming, while high-throughput experimental methods such35
Manuscript received August 6, 2016; revised January 9, 2017;accepted January 18, 2017. Date of publication January 19, 2017; date ofcurrent version April 20, 2017. Asterisk indicates corresponding author.
M. Ye is with the School of Computer and Communication Sciences,EPFL, 1015 Lausanne, Switzerland (e-mail: [email protected]).
X. Zhang is with the Simons Institute, University of California Berkeley,Berkeley, CA 94720 USA (e-mail: [email protected]).
G. C. Racz is with the University of British Columbia, Vancouver,BC V6T 1Z4, Canada.
Q. Jiang is with Stanford University, Stanford, CA 94305 USA.∗B. M. E. Moret is with the School of Computer and CommunicationSciences, EPFL, 1015 Lausanne, Switzerland (e-mail: [email protected]).
Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TNB.2017.2656058
as affinity purification mass spectrometry [1] produce large 36
numbers of false positives, so most published networks have 37
been inferred through computational methods ranging from 38
datamining the literature (see [2]–[4]) to inferring the evolu- 39
tionary history of the networks from present observations [5]– 40
[8]. (Makino and McLysaght [9] present a thorough discussion 41
of evolutionary approaches to PPI networks.) Often these 42
networks are built through a process of accretion, by adding 43
new actors and new interactions as they are observed, pub- 44
lished, or inferred, with the result that errors in many cur- 45
rent biological networks tend to be false positives (errors of 46
commission) rather than false negatives (errors of omission). 47
A variety of databases store inferred networks and range from 48
large graphs, such as the human PPI network in the STRING 49
database (circa 4 250 000 interactions) [10], down to quite 50
small ones, such as the manually curated Human Protein 51
Reference Database (ca. 40 000 interactions) [11]. Even a 52
cursory reading of the literature shows that agreement among 53
findings is rather limited, not only in part due to the variety 54
of samples used and the dynamic nature of the networks, but 55
also in part because of the difficulty of inference. 56
This intrinsic difficulty has led some research groups to go 57
beyond the inference of a single network from data about one 58
organism and to use comparative methods. However, pairwise 59
comparative methods, while more powerful, offer only limited 60
protection against noise and high variability. This weakness in 61
turn has led to the use of evolutionary methods that use several 62
different organisms and carry out simultaneous inference on 63
all of them [5], [7], [9]—a type of inference that falls within 64
the category of transfer learning [12]. A unique feature in these 65
approaches is their use of evolutionary models—not something 66
normally associated with transfer learning. These approaches 67
posit a model of evolution for the networks, typically based on 68
inserting and deleting edges and duplicating or losing vertices, 69
and then seek to infer present-day networks as well as ancestral 70
networks that, under the chosen evolutionary model, would 71
best explain the data collected. The evolutionary model is thus 72
the crucial component of the inference procedure. 73
An early finding about biological networks such as regu- 74
latory networks and PPI networks was the clear presence of 75
modularity [13]: these networks are not homogeneous, with 76
comparable connectivity patterns at every vertex, but instead 77
present a higher-level structure consisting of well-connected 78
subgraphs with less substantial connectivity to other such 79
subgraphs. Modularity is now widely viewed as one of the 80
main characteristics of living systems [14]. While some of 81
1536-1241 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
132 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 16, NO. 2, MARCH 2017
the models devised for networks lead automatically to the82
emergence of modules within the network [15], these models83
are purely generative—increasing the size of the network84
at each step; moreover, the types of modular structure they85
create have not been compared to those found in biological86
networks. There is thus a need for an evolutionary model for87
PPI networks that, while still based on the gain and loss of88
vertices and edges, takes into account modularity.89
In this paper, we introduce network evolution with modu-90
larity (NEMo), a network evolutionary model with modularity91
for PPI networks that includes both growth and reduction92
operators, and that explicitly models the influence of modu-93
larity on network evolution. While network evolution remains94
the product of purely local events (at the level of single95
vertices or edges), these events are subject to slightly different96
parameters according to whether they occur within or between97
modules. Our model allows modules to emerge, disappear,98
merge, and split—all as consequences of lower level evolu-99
tionary events.100
We want to test whether the introduction of modularity into101
the evolutionary model makes a difference in the properties102
of the resulting networks compared with biological networks.103
To this end, we present the results of simulations and compare104
the networks thus produced with the consensus networks105
currently stored in a variety of databases for model organisms.106
Our comparisons are based on both network alignment ideas107
and new measures aimed at quantifying modularity, so we108
also discuss the usefulness of these measures and evaluate109
published PPI networks with respect to these measures. Our110
measures of modularity can be used to analyze the general111
characteristics of PPI networks and clearly distinguish the112
various model organisms. Our findings support the accepted113
bias of published networks toward false positives and the often114
reported distribution of modules into a few large subgraphs115
and a collection of much smaller subgraphs; NEMo produces116
networks with the latter characteristic and maintains it even117
when it has reached a target range of sizes.118
II. CURRENT GENERATIVE MODELS119
FOR PPI NETWORKS120
All evolutionary models to date are based on the addition121
or removal of the basic constituent elements of the network:122
vertices (proteins) and edges (pairwise interactions). In terms123
of complexity and verisimilitude, however, models proposed124
to date vary widely. Most of the recent models are based on125
duplication followed by divergence, denoted D&D [16], [17],126
in which a vertex is duplicated (think of a gene duplication)127
and inherits some randomly chosen subset of the connections128
of the original vertex (the copy of the gene initially produces129
much the same protein as the original and so enters into many130
of the same interactions). Most evolutionary biologists view131
gene duplication (single gene, a segment of genes, or even the132
entire genome) as the most important source of diversification133
in genomic evolution [18], [19], so models based on D&D134
have become widely used for PPI networks.135
In the full D&D model, a duplication event duplicates all136
interactions of the original copy, but some interactions for137
both the original and the duplicated copies are immediately138
lost with some probability. A recent variation on the D&D 139
model is the duplication–mutation–complementarity (DMC) 140
model [20]–[22], in which the same interaction cannot be lost 141
simultaneously in the original and in the copy and in which 142
the duplicated gene itself may gain a direct interaction with 143
the original gene. The DMR (random mutation) model [23] is 144
another variation, in which new interactions (not among those 145
involving the original vertex) can be introduced between the 146
duplicate vertex and some random vertices in the network. 147
III. NEMO 148
While, as noted earlier, the D&D model (and, by extension, 149
its various derivatives) will automatically give rise to modular 150
structures, it does so in scenarios of unrestricted growth: no 151
edge deletions are allowed other than those that occur as part 152
of a node duplication and a node gets deleted only indirectly, 153
if and when its degree is reduced to zero. In that sense, 154
the D&D, while a generative model, is not an evolutionary 155
model: it can only grow networks, not evolve them while 156
keeping their size within some fixed range. The same is true 157
of its several variants. 158
Our aim is to produce a generative model that is also 159
an evolutionary model, a model that we can later use for 160
reconstructing the evolutionary history of PPI networks. Under 161
such a model, a network may grow, shrink, or, most commonly, 162
vary in size within some bounded range. Since the dominant 163
growth operator is duplication and since this operator typically 164
adds multiple edges to the network, random (i.e., unrelated 165
to other events) deletion of edges must be fairly common. 166
We conjectured that under such a model, modularity would 167
not necessarily be preserved—simply because, under such a 168
model, the selection of interactions to lose is independent of 169
the modular structure. Since modules appear both necessary to 170
life and quite robust against mutations, a model of evolution 171
of PPI networks that is biased (as nature appears to be) 172
in favor of the survival of modules would need to “know” 173
about the module structure. (From an evolutionary standpoint, 174
mutations that remove interactions within modules would be 175
under negative selection.) 176
We therefore designed a two-level model, NEMo. In NEMo, 177
a PPI network is represented as a graph, with nodes repre- 178
senting proteins and undirected edges representing undirected 179
interactions between pairs of proteins. 180
Events in NEMo occur at the lower level and are based 181
on the D&D model, suitably augmented. The main event in 182
a D&D model is node duplication. Node duplication copies 183
an existing node and all of its connections, thereby creating 184
a new node and a collection of new edges; in addition, some 185
of the edges copied as well as some of the new edges created 186
are probabilistically lost as part of the same event. We retain 187
this event in NEMo but allow the newly created node to be 188
connected to an additional node, randomly chosen within the 189
graph. (The loss of edges in the D&D model corresponds to 190
the common evolutionary adaptation that reduces the level of 191
conservation in genes that exist in multiple copies; most of the 192
time the resulting divergence in the gene sequence will lead 193
to a loss of interaction, but it is also possible that it will lead 194
YE et al.: NEMo: AN EVOLUTIONARY MODEL WITH MODULARITY FOR PPI NETWORKS 133
to a gain.) We also add an independent gain or loss event for195
each node: with low probability, a node can establish a new196
connection to a previously unconnected node. The higher level197
is “module-aware” so that evolutionary events can be classified198
as within a module or between modules. Such a model requires199
the identification of modules within a network and the extrac-200
tion and quantification of some high-level attributes that can201
be used to measure similarity. Methodologies used in much of202
the work on the identification of functional modules [24]–[26]203
are not applicable here, as we deal with an anonymous graph,204
not with annotated proteins, so we use clustering to identify205
modular structures, with a clustering algorithm that supports206
node overlap between clusters. (Many proteins have multiple207
domains and thus naturally interact with very different proteins208
and even a single-domain protein can be part of several209
pathways or modules: hence, we need a similar flexibility in210
the definition of modules in our model.)211
More precisely, events affecting nodes and edges can be212
classified into four categories: node gain, node loss, edge213
gain, and edge loss. Node gain occurs exclusively through214
duplication of an existing node, a duplication that typically215
also results in both edge gains and edge losses. Node loss216
removes a randomly chosen node, reflecting such biological217
events as mutation in transcription factors or pseudogene218
formation. (As in the D&D models, it is also possible to lose219
a node through progressive loss of edges until the node has220
degree zero.) Edge loss (other than edges losses associated221
with a node duplication) removes a randomly chosen edge222
and reflects such biological events as domain mutations,223
structural mutations, subfunctionalization, and the like. Edge224
gain (other than edge gains associated with a node duplication)225
connects a previously unconnected pair of nodes and thus226
reflects many of the same events that can also cause edge227
loss, such as domain or structural mutations, or progressive228
neofunctionalization.229
The higher level of the model reflects the modular structure230
and influences the event chain as follows. First, we allow up to231
one event to occur in each module within the same step. That232
is, whereas existing models treat the network as one unit and233
allow a single event at a time, our model treats the network234
as a collection of subgraphs (modules) and allows up to one235
event in each subgraph. Multiple events within the same step236
can more closely model interconnected events—events in two237
different modules, for instance, can affect the same shared238
node. Second, we distinguish intramodular events (all four239
events can be intramodular) from intermodular events (only240
edge gains and losses can be intermodular), allowing us to use241
different parameters for the two types. We use this flexibility242
to introduce a slight bias in favor of intramodular edges over243
intermodular edges. Finally and crucially, while we automat-244
ically place a duplicate node within the same module as the245
original node, we also periodically recompute the subgraph246
decomposition, thereby “discovering” changes in the module247
structure and recording evolutionary events at the module248
level as module emergence, module disappearance, fusion of249
modules, and fission of modules. (These module-level events250
are thus not independently generated, but come into being as a251
consequence of node- and edge-level events.) Recomputing the252
modular structure can be done at fixed intervals (in the results 253
presented in the following, the recomputation takes places after 254
one third, two thirds, and all of the steps, for instance) or once 255
the current modular structure has diverged sufficiently from the 256
last recorded one. 257
In an evolutionary simulation using NEMo, at each step, 258
each module may record no event or one lower-level event; 259
in the latter case, that event may be an intramodular 260
event (node duplication, node loss, edge loss, or intramodular 261
edge gain) or an intermodular event (intermodular edge loss 262
or gain). The parameter controlling the “no event” outcome 263
at each step can be used to allow the simulation of distinct 264
evolutionary rates in different modules while the parameter 265
controlling intramodular versus intermodular events can be 266
used to introduce a bias in favor of module conservation. 267
(Note that when a node loss occurs, the node is removed 268
from its module, but not from any overlapping module: it is 269
removed entirely from the network only when it is the target 270
of node loss and appears in one module only.) Very small 271
modules can easily disappear as a consequence of just a few 272
node and/or edge losses and are thus somewhat unstable when 273
all modules are assigned the same loss and gain parameter 274
values. 275
IV. ASSESSING MODULARITY 276
In order to evaluate the output of NEMo, we must first 277
quantify significant attributes of PPI networks. The resulting 278
features can then be used to measure the similarity of our 279
generated networks to real networks, as well as the differences 280
between networks generated by our model and networks gen- 281
erated under existing models. While it is sometimes possible 282
to compare two biological networks using a rich set of annota- 283
tions to produce a network alignment, simulated networks lack 284
such an annotation and thus cannot be meaningfully aligned. 285
Thus, similarity here refers to structural and topological fea- 286
tures such as modularity and connectivity: we need to compare 287
networks very different in size and composition and so cannot 288
use tools such as network alignment methods. We thus propose 289
a set of features applicable to all networks; these features 290
are chosen to measure global properties of networks and to 291
quantify aspects of modularity. 292
Most of the features proposed here are commonly used 293
in the analysis of networks [26], [31]; several are modified 294
so as to provide a level of independence from size—while 295
simulations can be run at all sizes, bacterial PPI networks 296
are necessarily smaller than mammalian PPI networks—and 297
some are measures of importance in other fields that we have 298
repurposed for networks, such as the Gini coefficient. For each 299
network, we compute the number of nodes, the number of 300
edges, and the degree distribution; we also run the clustering 301
algorithms and store the number of clusters as well as the size 302
and composition of each cluster. We then compute these six 303
global measures. 304
1) Cluster Coefficient (CC): The CC is based on triplets of 305
vertices. A triplet is open if connected with two edges 306
and closed if connected with all three edges. The CC is 307
just the ratio of the number of closed triplets divided by 308
the total number of (open or closed) triplets [32]. 309
134 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 16, NO. 2, MARCH 2017
2) Graph Density (GD): The density of a graph is the ratio310
of the actual number of edges to the number of possible311
edges.312
3) Fraction of Edges Inside (FEI): FEI is the fraction313
of edges contained within modules. We expect it to314
be high since PPI networks contain highly connected315
substructures (modules) that have only few connections316
to vertices outside the substructure [31], [33], [34].317
4) Diameter (�): The diameter of a graph is the length of318
the longest simple path in the graph.319
5) Shortest Path Mean (SPM): SPM is the mean of all320
pairwise shortest paths in the graph.321
6) Gini Coefficient (Gini): The Gini coefficient, proposed322
by the Italian statistician C. Gini in 1912, measures dis-323
tributional inequality, such as the inequality of incomes324
within a population. If household i has a yearly income325
of xi , then the Gini coefficient of the population is326
given by327
G =∑n
i=1∑n
j=1 |xi − x j |2n
∑ni=1 xi
.328
We repurposed that definition for our use in studying329
modularity by defining the “income” of a node as the330
degree of the node plus the sum of the degrees of its331
immediate neighbors.332
We considered a number of other measures, such as modu-333
larity, as defined by Newman and Girvan [35] to measure the334
community structure, and betweenness centrality, a measure335
that originated in sociology [36], as well as several others,336
such as conformance to the characteristics of small-world337
networks or scale-free networks, but present only these six338
here, as others either are closely correlated with them or appear339
unrelated to modularity, as further discussed in the following.340
V. RESULTS ON NATURAL PPI NETWORKS341
For the data, we chose to work with model organisms,342
as they have large numbers of documented high-confidence343
interactions. We picked the three species with the largest344
number of such interactions, Escherichia coli, Saccharomyces345
cerevisiae, and Homo sapiens. Different sources were consid-346
ered to investigate the discrepancies among the networks in347
current databases.348
One source was the STRING database [10], which aims to349
provide a global perspective for as many organisms as feasible,350
tolerating lower quality data and computational predictions.351
Due to this bias, STRING includes a large number of indirect352
interactions, which we treat as false positives, since our aim353
is to evolve a network of direct interactions. Fortunately,354
STRING stores an evidence score for each interaction to allow355
elimination of false positive entries by the user. We thus356
used both the complete data set and a subset filtered using357
a high threshold of 900 on the evidence scores.) For other358
sources, we consulted the manually curated H. sapiens PPI359
network database HPRD [37] and the experimental setup of360
the MAGNA++ algorithm [38], which aims at maximizing361
accuracy in global network alignment: an H. sapiens PPI362
network of 9141 proteins and 41 456 interactions [39], an E.363
TABLE IGENERAL CHARACTERISTICS OF THE THREE PPI
NETWORKS IN VARIOUS DATABASES
TABLE IIVALUES OF OUR MEASURES FOR THE REFERENCE PPI
NETWORKS IN VARIOUS DATABASES
coli PPI network [40] of high confidence of 1941 proteins 364
with 3989 interactions, and a yeast S. cerevisiae PPI network 365
with 2390 proteins and 161 277 PPIs [41]. We also use the 366
database HitPredict [42], [43], which stores experimentally 367
determined protein-protein interactions with reliability scores; 368
for this database, we also included the network of C. elegans 369
as an additional reference. For clustering (that is, to identify 370
putative modules), we used both ClusterOne and MCL. 371
Table I provides a brief description of these PPI networks in 372
various databases and versions, while Table II shows the values 373
of our various measures on these networks. In Tables I and II, 374
S stands for STRING’s complete data set, S900 stands the 375
filtered data set of confidence score >900, H for HPRD, 376
M for MAGNA++, and P for HitPredict. The different 377
clustering results yield different values for the FEIs. 378
The very large differences in size among the databases for 379
the same network are striking: the STRING database has well 380
over four million edges for the human PPI network, whereas 381
the HPRD database has fewer than 40 000 or less than 1% of 382
the number in STRING. This large discrepancy underlines the 383
difference in philosophy between various PPI databases and 384
YE et al.: NEMo: AN EVOLUTIONARY MODEL WITH MODULARITY FOR PPI NETWORKS 135
Fig. 1. Degree distribution for the E. coli network in STRING, a completedata set.
illustrates why testing models or inferences against databases385
must be done with great care. For instance, simply clustering386
the graph has pitfalls, as shown in the number of clusters found387
by the same algorithm for E. coli on the various databases,388
going from 16 clusters among 4145 nodes in STRING to389
1151 clusters among 3351 nodes in HitPredict—values that390
again differ by around two orders of magnitude. The graphs391
themselves are all sparse (GD is low, even for the relatively392
denser STRING networks), but some structural differences393
are clear, although the reason for any such difference is not394
always clear: differences between the numbers of proteins395
and interactions stored in the databases, differences between396
the complexity of the networks, or differences between the397
organisms’ metabolic needs and lifestyles. The Gini coefficient398
points to significant inequality of distribution in the degree399
of one-level neighborhoods—Gini coefficients above 0.6 for400
income per capita are very rare in today’s world, but the values401
are quite variable across the databases. The FEI modules402
displays one of the more striking differences, being very high403
for networks in STRING, HPRD, and MAGNA++, but much404
lower in networks in HitPredict, presumably because HitPre-405
dict is good at excluding indirect interactions that simply406
shortcut paths through transitive closure.407
We also tested these networks for one of the characteristic408
attributes of social networks, small-world networks, and scale-409
free networks, namely, a degree distribution that follows a410
power law. The conclusion is very clear for the STRING411
networks: they do not follow a power law, as the plot in Fig. 1412
clearly shows—a power law would result in an oblique line,413
not in the complex curve shown in the figure. It is less clear for414
the other three databases; in fact, for E. coli, the plot appears415
to support a hypothesis of an underlying power law, at least416
in HitPredict, as shown in Fig. 2.417
VI. RESULTS ON SIMULATIONS418
A. Simulation Goals and Setup419
The goal of our simulations is to verify the ability of NEMo420
to produce networks with characteristics similar to those of the421
Fig. 2. Degree distribution for the E. coli network in HitPredict,a complete data set.
TABLE IIIVALUES OF OUR FEATURES FOR THE GENERATED NETWORKS AND
THE REFERENCE PPI NETWORKS IN VARIOUS DATABASES
natural PPI networks and also to compare the networks it pro- 422
duces with those produced without the module-aware level and 423
with those produced by D&D models. In particular, we want 424
to test the ability of NEMo to sustain modules in networks not 425
undergoing growth, but subject only to evolutionary changes, 426
where gain of proteins and interactions is balanced by loss 427
of same. Therefore, we run two distinct series of simulations, 428
one for generation and one for evolution. 429
The first series uses both the DMC model [22], perhaps 430
the most commonly used model in the D&D family today, 431
and NEMo to grow networks to fixed sizes. We then compute 432
our features on these networks and compare both types of 433
generated networks with the PPI networks of the model 434
organisms. Since DMC is not module aware, but claimed to 435
generate modular networks [15], whereas NEMo is explicitly 436
module aware, we want to see how well the characteristics 437
of each type of generated network compare with those of the 438
PPI networks of the model organisms. 439
In the second series of simulations, we use NEMo in steady- 440
state mode (balanced gains and losses) over many steps to 441
136 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 16, NO. 2, MARCH 2017
Fig. 3. Evolution of network characteristics under the NEMo model over 600 steps, with reclustering into modules at 200 and 400 steps. Top lineshows the total number of edges, second line the number of vertices, third line the number of modules, fourth line the size of the largest module,and bottom line the number of singleton modules. (a) Evolution from a DMC-generated network. (b) Evolution from a NEMo-generated network.
evolve networks produced during the first simulation series.442
Our main intent here is to observe the evolution (mostly in443
terms of size, edge density, and modules) of the networks.444
We use parameters for NEMo that give it a slight bias toward445
growth, mostly to prevent the natural variance of the process446
from “starving” too many of the networks.447
B. Results for Network Generation448
We set parameters of our model for simulating growth of449
the network and compare the resulting networks with those450
built with the standard DMC model for similar sizes, as well451
as with the PPI networks from the three model organisms.452
(In generative mode, NEMo is not just module aware, but453
also reclusters the network regularly.)454
We compute our network features for each of these net-455
works, but report mean values over the set of simulations.456
Table III shows these means, preceded for convenience by the457
same features shown for PPI networks (from Table II).458
DMC and NEMo both generate networks with features459
comparable to those observed in the PPI networks collected460
from HPRD, MAGNA, and HitPredict, although the signif- 461
icantly lower clustering coefficient of the DMC-generated 462
network (0.05 compared with 0.14 for the NEMo-generated 463
network) indicates a less resolved modular structure. 464
(All PPI networks from databases have larger clustering coef- 465
ficients than the generated networks, but the size of networks 466
matters in this respect, as does the number of additional 467
indirect interaction edges.) The Gini coefficients of the gen- 468
erated networks are comparable and are considerably smaller 469
than those of the networks from the databases, which is to 470
be expected from a model used in generative mode—the 471
generation gives little time for module-level events such as 472
merging and splitting that contribute to the unequal distribution 473
of neighborhood degrees. 474
C. Results for Network Evolution 475
In the second step of our experiments, we test the ability 476
of NEMo to simulate the evolution of a PPI network (with 477
roughly balanced node gain and loss rates), while preserving 478
modularity and also test how NEMo’s behavior is affected by 479
YE et al.: NEMo: AN EVOLUTIONARY MODEL WITH MODULARITY FOR PPI NETWORKS 137
its initial condition using both DMC- and NEMo-generated480
networks at time zero. We want to observe the evolution of481
the network after a larger number of events, so we (arbitrarily)482
choose 600 steps—recall that NEMo allows up to one event483
per module at each step, so that the 600 steps can yield484
a much larger number of events. Fig. 3 shows the changes485
in network size (numbers of edges and vertices) and struc-486
ture (numbers of modules) as an initial network is evolved487
through 600 steps, with reclustering into modules taking place488
after 200 and 400 steps.489
The main observation here is that NEMo, when started with490
a DMC-generated network [Fig. 3(a)], begins by reconfiguring491
the network, reducing its number of vertices by about one-492
third over the first hundred steps and replacing edges. It then493
moves into much the same mode, as depicted in Fig. 3(b),494
which shows a steady evolutionary behavior mixed with495
a small bias toward growth. The implication is that while the496
DMC-generated network may have a modular structure, that497
structure is not really compatible with the type of structure498
our two-level model embodies: the module structure built by499
DMC is somehow “wrong” and needs to be heavily modified500
before the model can enter a stable phase. In particular,501
observe that the GD of the DMC-generated network is low502
and gets swiftly increased by NEMo, while the initial number503
of modules is high and gets swiftly decreased by NEMo504
as a consequence of the removal of many nodes. After the505
first 200 steps and the first reclustering of modules, the evo-506
lution follows the same path as that followed immediately507
when working from a NEMo-generated initial graph, as seen508
in Fig. 3(b). Fig. 3(b) shows variance in the rate of increase509
in the number of edges, partly a consequence of the node510
duplication process—duplicating a few high-degree nodes in511
rapid succession quickly increases the overall degree of the512
network, while also increasing the number of high-degree513
nodes. Most NEMo simulations show a mixed growth rate514
within the 600 simulation steps, indicating that NEMo is515
flexible and allows a reshaping and restructuring of a network516
while keeping the network size pretty stable. The node–edge517
ratio for biological PPI networks (see Table I) shows that518
the number of edges is some multiple (larger than 1) of the519
number of nodes, but that this multiple is quite variable.520
Thus, the flexibility and dynamics that NEMo enables are521
important.522
The mild generative bias we deliberately introduced into523
the evolutionary simulations can be harmlessly removed for524
evolving NEMo-generated networks and, through larger num-525
bers of steps, evolving a modular structure closer to that of526
the PPI networks from the databases.527
The module-aware level of NEMo derives its power from its528
ability to distinguish intermodular from intramodular events.529
However, NEMo uses this power in a minimal way, by assign-530
ing slightly different probabilities to the two classes of531
events—in evolutionary terms, it simulates a slightly stronger532
negative selection for intramodular interactions than for inter-533
modular interactions. The distinction between the two classes534
of events could be used to a much larger extent, but our535
results show that even this minimal intervention, consistent536
with a selective pressure to preserve modularity while allowing 537
modules themselves to adapt, suffices to create a significant 538
difference in the types of networks produced. 539
VII. DISCUSSION AND CONCLUSION 540
We presented NEMo, a module-aware evolutionary model 541
for PPI networks. The emphasis of NEMo, compared with 542
existing models for PPI networks, is on evolution rather than 543
generation: whereas existing models (and the first layer of 544
NEMo, which is a variant of existing models) are known to 545
generate a modular structure when growing networks, we were 546
interested in a model that would evolve existing networks, 547
using the same basic set of evolutionary events. 548
The salient feature of NEMo is a module-aware layer that 549
sits above the event layer and distinguishes between inter- 550
modular and intramodular events. The awareness is achieved 551
through periodic recomputation (triggered by sampling and 552
analysis for drift) of the modular structure. The uses to which 553
this awareness are put are minimal: NEMo simply gives 554
a slightly higher probability to intramodular events than to 555
intermodular events, thereby slightly favoring conservation of 556
modules. The details of the model are broadly adjustable: the 557
algorithm used to detect modules, the number and nature of 558
parameters used to control intramodular versus intermodular 559
events, the features chosen to characterize the network, and 560
the distance measure used to measure drift in order to decide 561
when to reevaluate the composition of modules are all flexible. 562
Our simulation results show that its second layer enables 563
NEMo to run through large numbers (as compared to the 564
size of the network) of evolutionary events, balanced so as to 565
not affect the expected size of the network, while preserving 566
the characteristics of its original (growth-derived) modular 567
structure. To the best of our knowledge, this is the first such 568
result and it paves the way for phylogenetic analyses as well 569
as population studies of PPI networks. 570
As discussed by Makino and McLysaght [9], however, 571
the number of factors that could affect the evolution of 572
PPI networks is very large. NEMo captures only a small 573
subset of these factors, since it works just on the graph 574
structure and, at the level of individual events, makes the same 575
independence assumptions as current models. Interdependent 576
events or hidden underlying events present serious challenges. 577
Incorporating externally supplied data (in addition to the 578
network itself) makes sense in a data-rich era, but will require, 579
for each type of data, further development of the model. 580
REFERENCES 581
[1] J. Morris et al., “Affinity purification—Mass spectrometry and network 582
analysis to understand protein-protein interactions,” Nature Protocols, 583
vol. 9, no. 11, pp. 2539–2554, 2014. 584
[2] E. M. Marcotte, I. Xenarios, and D. Eisenberg, “Mining litera- 585
ture for protein–protein interactions,” Bioinformatics, vol. 17, no. 4, 586
pp. 359–363, 2001. 587
[3] Y. Hao, X. Zhu, M. Huang, and M. Li, “Discovering patterns to extract 588
protein–protein interactions from the literature: Part II,” Bioinformatics, 589
vol. 21, no. 15, pp. 3294–3300, 2005. 590
[4] A. Abi-Haidar et al., “Uncovering protein interaction in abstracts and 591
text using a novel linear model and word proximity networks,” Genome 592
Biol., vol. 9, p. S11, Sep. 2008. 593
138 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 16, NO. 2, MARCH 2017
[5] J. Dutkowski and J. Tiuryn, “Phylogeny-guided interaction mapping in594
seven eukaryotes,” BMC Bioinf., vol. 10, no. 1, p. 393, 2009.595
[6] X. Zhang and B. M. E. Moret, “Refining transcriptional regulatory596
networks using network evolutionary models and gene histories,”597
Algorithms Mol. Biol., vol. 5, p. 1, Jan. 2010.598
[7] X. Zhang and B. M. E. Moret, “Refining regulatory networks through599
phylogenetic transfer of information,” IEEE/ACM Trans. Comput. Biol.600
Bioinf., vol. 9, no. 4, pp. 1032–1045, Jul. 2012.601
[8] S. M. E. Sahraeian and B.-J. Yoon, “A network synthesis model for602
generating protein interaction network families,” PLoS ONE, vol. 7,603
no. 8, e41474, 2012.604
[9] T. Makino and A. McLysaght, “Evolutionary analyses of protein interac-605
tion networks,” in Biological Data Mining Protein Interaction Networks,606
X.-L. Li and S.-K. Ng, Eds. Hershey, PA, USA: IGI Global, 2009,607
pp. 169–181.608
[10] D. Szklarczyk et al., “STRING v10: Protein–protein interaction net-609
works, integrated over the tree of life,” Nucl. Acids Res., vol. 43, no. D1,610
pp. D447–D452, 2015.611
[11] T. S. K. Prasad et al., “Human protein reference database—2009 update,”612
Nucl. Acids Res., vol. 37, pp. D767–D772, Jan. 2009.613
[12] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans.614
Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.615
[13] L. H. Hartwell, J. J. Hopfield, S. Leibler, and A. W. Murray, “From616
molecular to modular cell biology,” Nature, vol. 402, no. 6761,617
pp. C47–C52, 1999.618
[14] G. Schlosser and G. Wagner, Modularity in Development and Evolution.619
Chicago, IL, USA: Univ. Chicago Press, 2004.620
[15] R. Solé and S. Valverde, “Spontaneous emergence of modularity in621
cellular networks,” J. Roy. Soc. Interface, vol. 5, no. 18, pp. 129–133,622
2008.623
[16] J. Qian, N. M. Luscombe, and M. Gerstein, “Protein family and fold624
occurrence in genomes: Power-law behaviour and evolutionary model,”625
J. Mol. Biol., vol. 313, no. 4, pp. 673–689, Nov. 2001.626
[17] A. Bhan, D. Galas, and T. Dewey, “A duplication growth model of gene627
expression networks,” Bioinformatics, vol. 18, no. 11, pp. 1486–1493,628
2002.629
[18] S. Ohno, Evolution by Gene Duplication. Berlin, Germany:630
Springer-Verlag, 1970.631
[19] M. Lynch and J. S. Conery, “The evolutionary fate and consequences632
of duplicate genes,” Science, vol. 290, no. 5494, pp. 1151–1155, 2000.633
[20] M. Middendorf, E. Ziv, and C. H. Wiggins, “Inferring network mecha-634
nisms: The Drosophila melanogaster protein interaction network,” Proc.635
Nat. Acad. Sci. USA, vol. 102, no. 9, pp. 3192–3197, 2005.636
[21] S. Navlakha and C. Kingsford, “Network archaeology: Uncovering637
ancient networks from present-day interactions,” PLoS Comput. Biol.,638
vol. 7, no. 4, e1001119, 2011.639
[22] A. Vazquez, A. Flammini, A. Maritan, and A. Vespignani, “Global640
protein function prediction from protein-protein interaction networks,”641
Nature Biotechnol., vol. 21, no. 6, pp. 697–700, 2003.642
[23] R. Solé, R. Pastor-Satorras, E. Smith, and T. Kepler, “A model of large-643
scale proteome evolution,” Adv. Complex Syst., vol. 5, no. 1, pp. 43–54,644
2002.645
[24] J. Dutkowski and J. Tiuryn, “Identification of functional modules646
from conserved ancestral protein–protein interactions,” Bioinformatics,647
vol. 23, no. 13, pp. i149–i158, 2007.648
[25] M. T. Dittrich, G. W. Klau, A. Rosenwald, T. Dandekar, and T. Müller,649
“Identifying functional modules in protein–protein interaction net-650
works: An integrated exact approach,” Bioinformatics, vol. 24, no. 13,651
pp. i223–i231, 2008.652
[26] T. Aittokallio, “Module finding approaches for protein interaction653
networks,” in Biological Data Mining Protein Interaction Networks,654
X.-L. Li and S.-K. Ng, Eds. Hershey, PA, USA: IGI Global, 2009,655
pp. 335–353.656
[27] T. Nepusz, H. Yu, and A. Paccanaro, “Detecting overlapping protein657
complexes in protein-protein interaction networks,” Nature Methods,658
vol. 9, no. 5, pp. 471–472, 2012.659
[28] S. V. Dongen, “Graph clustering by flow simulation,” Ph.D. dissertation,660
Dept. Math. Comput. Sci., Utrecht Univ., Utrecht, The Netherlands,661
2000.662
[29] A. Enright, S. Van Dongen, and C. Ouzounis, “An efficient algorithm663
for large-scale detection of protein families,” Nucl. Acids Res., vol. 30,664
no. 7, pp. 1575–1584, 2002.665
[30] S. van Dongen and C. Abreu-Goodger, “Using MCL to extract clusters666
from networks,” in Bacterial Molecular Networks (Methods in Molecu-667
lar Biology), vol. 804, J. van Helden, A. Toussaint, and D. Thieffry, Eds.668
Berlin, Germany: Springer-Verlag, 2012, pp. 281–295.669
[31] A.-L. Barabási and Z. Oltvai, “Network biology: Understanding 670
the cell’s functional organization,” Nature Rev. Genet., vol. 5, 671
pp. 101–113, Feb. 2004. 672
[32] S. Wasserman and K. Faust, Social Network Analysis: Methods and 673
Applications. Cambridge, U.K.: Cambridge Univ. Press, 1994. 674
[33] A. Wagner, “The yeast protein interaction network evolves rapidly and 675
contains few redundant duplicate genes,” Mol. Biol. Evol., vol. 18, no. 7, 676
pp. 1283–1292, 2001. 677
[34] Y. Jin, D. Turaev, T. Weinmaier, T. Rattei, and H. A. Makse, “The 678
evolutionary dynamics of protein-protein interaction networks inferred 679
from the reconstruction of ancient networks,” PLoS ONE, vol. 8, no. 3, 680
p. e58134, 2013. 681
[35] M. E. J. Newman and M. Girvan, “Finding and evaluating community 682
structure in networks,” Phys. Rev. E, Stat. Phys. Plasmas Fluids Relat. 683
Interdiscip. Top., vol. 69, no. 2, pp. 26113–26127, 2004. 684
[36] L. C. Freeman, “A set of measures of centrality based on betweenness,” 685
Sociometry, vol. 40, no. 1, pp. 35–41, Mar. 1977. 686
[37] T. S. K. Prasad et al., “Human protein reference database—2009 update,” 687
Nucl. Acids Res., vol. 37, no. 1, pp. D767–D772, Jan. 2009. 688
[38] V. Saraph and T. Milenkovic, “MAGNA: Maximizing accuracy in global 689
network alignment,” Bioinformatics, vol. 30, no. 20, pp. 2931–2940, 690
2014. [Online]. Available: http://bioinformatics.oxfordjournals.org/ 691
content/30/20/2931.abstract 692
[39] P. Radivojac et al., “An integrated approach to inferring gene–disease 693
associations in humans,” Proteins, vol. 72, no. 3, pp. 1030–1037, 2008. 694
[40] J. M. Peregrín-Alvarez, X. Xiong, C. Su, and J. Parkinson, “The modular 695
organization of protein interactions in Escherichia coli,” PLoS Comput. 696
Biol., vol. 5, no. 10, p. e1000523, 2009. 697
[41] S. Collins et al., “Toward a comprehensive atlas of the physical 698
interactome of Saccharomyces cerevisiae,” Mol. Cellular Proteomics, 699
vol. 6, no. 3, pp. 439–450, 2007. 700
[42] Y. López, K. Nakai, and A. Patil, “HitPredict version 4: Comprehensive 701
reliability scoring of physical protein–protein interactions from more 702
than 100 species,” Database, J. Biol. Databases Curation, 2015, doi: 703
10.1093/database/bav117. 704
[43] A. Patil, K. Nakai, and H. Nakamura, “HitPredict: A database of quality 705
assessed protein–protein interactions in nine species,” Nucl. Acids Res., 706
vol. 39, no. 1, pp. D744–D749, 2015. 707
Min Ye received the B.S. and M.S. degrees 708
in computational biology and bioinformatics 709
from Saarland University, Germany. She is 710
currently working toward the Ph.D. degree in 711
the Laboratory for Computational Biology and 712
Bioinformatics, École Polytechnique Fédérale 713
de Lausanne, Lausanne, Switzerland. 714
Her research focuses on model 715
and algorithm design for biological 716
networks and their phylogenic 717
reconstruction. 718
Xiuwei Zhang received the Ph.D. degree in 719
computer science, under the supervision of Prof. 720
B. Moret, from the Laboratory for Computational 721
Biology and Bioinformatics, École Polytechnique 722
Fédérale de Lausanne, Switzerland. 723
She was a Postdoctoral Researcher with 724
the Teichmann Group, European Bioinformatics 725
Institute, and a Simons fellow with the Simons 726
Institute, University of California Berkeley, CA, 727
USA. She is currently a Postdoctoral Researcher 728
with the Yosef Laboratory, UC Berkeley. Her cur- 729
rent research interests include the analysis and application of evolution 730
of biological data, such as biological networks and protein structures and 731
method development and data analysis for RNA-Seq data and single-cell 732
RNA-Seq data. 733
Gabriela Clara Racz received the M.S. degree 734
in applied mathematics from the University of 735
Zagreb, Croatia. She is currently working toward 736
the Ph.D. degree with the Earth, Ocean and 737
Atmosphere Science Department, University of 738
British Columbia, Canada. 739
In 2015, she was a Research Intern with the 740
Laboratory for Computational Biology and Bioin- 741
formatics, École Polytechnique Fédérale de Lau- 742
sanne, Switzerland. 743
YE et al.: NEMo: AN EVOLUTIONARY MODEL WITH MODULARITY FOR PPI NETWORKS 139
Qijia Jiang received the B.S. degree in electrical744
engineering and the B.A. degree in statistics from745
Rice University, Houston, TX, USA, in 2015. She746
is currently working toward the Ph.D. degree with747
the Electrical Engineering Department, Stanford748
University, Stanford, CA, USA.749
Bernard M. E. Moret received the Ph.D. degree 750
from the University of Tennessee, Knoxville, TN, 751
USA, in 1980. 752
He was on the faculty of the Department of 753
Computer Science, University of New Mexico, 754
until 2006, serving as the Chairman from 1991 755
to 1993. He is currently a Professor of Computer 756
Science and holding the Chair of Bioinformat- 757
ics, École Polytechnique Fédérale de Lausanne, 758
Lausanne, Switzerland. His research interests 759
are in the area of algorithms and applications, 760
particularly in computational molecular biology. He founded the ACM 761
Journal of Experimental Algorithmics in 1995 and served as its Editor- 762
in-Chief for seven years. Since 2000, he has focused on the development 763
of models and algorithms for evolutionary genomics, publishing over 100 764
peer-reviewed articles in the area and founding, in 2001, the annual 765
Workshop on Algorithms in bioinformatics. 766
Top Related