RESEARCH ARTICLE Clustering and overlapping modules detection … · 2013-05-02 · 278 DOI...

278 Proteomics 2013, 13, 278–290DOI 10.1002/pmic.201200309

RESEARCH ARTICLE

Clustering and overlapping modules detection in PPI

network based on IBFO

Xiujuan Lei1, Shuang Wu1, Liang Ge2 and Aidong Zhang2

1 College of Computer Science, Shaanxi Normal University, Xi’an, P. R. China2 Department of Computer Science and Engineering, State University of New York at Buffalo, NY, USA

As is known to all, traditional clustering algorithms do not work well due to the topologicalfeatures of protein–protein interaction networks. An improved clustering method based onbacteria foraging optimization (BFO) mechanism and intuitionistic fuzzy set, short for im-proved BFO, is proposed in this paper, in which the trigonometric function is used to definethe membership degrees and the indeterminacy degree is introduced to detect the overlappingmodules. In chemotactic operation of BFO, the algorithm initializes a cluster center accordingto comprehensive network feature value of node and eliminates the isolated point in accordancewith edge-clustering coefficient. In the reproduction operation of BFO, the nodes possessinghigh membership degrees are merged into the cluster that the cluster center belongs to andlabeled as visited nodes. Meanwhile, the nodes that also have high indeterminacy degrees arevisited again when generating another cluster. The procedure of elimination–dispersal opera-tion is equivalent to the selection of the next cluster center. Finally, the algorithm merges theclusters having high similarity. The results show that the algorithm not only determines thecluster number automatically, improves the f-measure value of cluster results, but also identifythe overlaps in protein–protein interaction network successfully.

Keywords:

Bacteria foraging optimization / Bioinformatics / Indeterminacy degree / Overlap /Protein–protein interaction networks

Received: July 25, 2012Revised: September 19, 2012

Accepted: October 11, 2012

1 Introduction

In the postgenomic era, the research focus of biological sci-ence has gradually transferred from genomics to proteomics.Recently, the rapid development of proteomics and the explo-sion of protein–protein interaction (PPI) dataset have drawnmore and more researchers to investigate PPI networks inorder to predict the function of unknown proteins. The re-searches toward PPI networks contribute to predicting thefunctions of unknown proteins from the aspect of molecularlevel, further uncovering regularities of cellular activities such

Correspondence: Dr. Xiujuan Lei, College of Computer Science,Shaanxi Normal University, Xi’an, Shaanxi Province, 710062,P. R. ChinaE-mail: [email protected]: +86 29 85310161

Abbreviations: BFO, bacteria foraging optimization; CNFV,comprehensive network feature value; PPI, protein–proteininteraction

as growth and development and metabolism. In addition, itis extremely helpful in the diagnosis of major diseases andintensive study of therapy, meantime stimulates the develop-ment of biology, medicine and bioinformatics, and so on.

PPI networks share the feature of small world [1] that ischaracterized by high clustering coefficients. In addition, thescale-free [2] is also fit in PPI networks that suggested animportant topological feature of PPI networks, that is, themodularity [3]. So, it is natural to use clustering methods topredict the functional modules. However, the traditional clus-tering methods such as hierarchical clustering, density-basedmethod, and fuzzy clustering algorithm [4,5] have difficultiesin either requiring the prior knowledge of cluster numberor being sensitive to noisy data. Then, Nabieva et al. [6] firstput forward functional-flow model to explore the underlyingstructure of PPI networks. The experimental results showedthat the method performed well. However, the running time

Colour Online: See the article online to view Figs. 3, 4 and 7–10 incolour.

C© 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com

Proteomics 2013, 13, 278–290 279

was relatively high. Cho et al. [7] proposed another flow-based modularization algorithm to predict the overlappingfunctional modules in a weighted graph; the f-measure [8]value of clustering result was relatively low. Kenley et al. [9]proposed a novel information-theoretic definition, graph en-tropy as a measure of the structural complexity of a graph.The results showed that the approach had higher accuracyin predicting protein complexes. Recently, more and moreintelligent optimization algorithms have found broad appli-cations in many research fields. Goel et al. [10] presentedsoftware for the generation and analysis of dynamic, four-dimensional PPI networks. Our research [10] had success-fully applied the quantum-behaved particle swarm optimiza-tion algorithm and artificial bee colony algorithm to optimizefunctional-flow model and also utilized joint strength basedant colony optimization to cluster the functional modules,the results show that these methods performed well in iden-tifying the functional modules.

Following the rapid development of intelligence algo-rithms, Passino [14] presented bacteria foraging optimization(BFO) algorithm inspired by the social foraging behavior ofEscherichia coli. BFO algorithm had drawn attentions of re-searchers from various fields such as harmonic estimation,transmission loss reduction, and machine learning. In or-der to explore the searching ability of BFO algorithm, severalresearchers integrated the algorithm with other intelligentmethods [15] such as genetic algorithm and particle swarmoptimization algorithm etc. Then, we try to design a newmodel taking advantage of BFO algorithm to cluster PPI net-works [16]. The initial positions where the bacterium locatedare treated as cluster centers, the cluster modules are gen-erated during the reproduction operation and elimination–dispersal operation. The experimental results showed thatthe method could effectively improve the accuracy of clusterresults. However, the recall value is low and the algorithm ig-nored a fact that a protein may belong to two or more clusters.In 1965, the fuzzy methods [17] have been adopted in clusteranalysis. Atanassov [18] extended the theory of fuzzy set andproposed the intuitionistic fuzzy set that includes the con-cepts of membership degree, nonmembership degree, andindeterminacy degree. In this paper, we adopt the indetermi-nacy degree of protein nodes to take the overlapping func-tional modules of PPI networks into consideration in orderto improve the recall value of cluster results.

In this paper, we propose a novel model and algorithm thatuses BFO mechanism and intuitionistic fuzzy set to tackle theoverlapping functional modules of PPI network and automat-ically determine cluster number. In Section 2, the principleof BFO algorithm, the definitions of intuitionistic fuzzy set,several concepts of graph, and the evaluation criteria of PPInetwork clustering are briefly presented. Section 3 describesthe model design and implementation steps of improved al-gorithm that integrates intuitionistic fuzzy set into the mech-anism of BFO algorithm. We execute the algorithm and makecomparisons with the BFO algorithm referred in ref. [16] andfunctional-flow algorithm [7] in Section 4. The experimental

results show that the algorithm is superior to BFO algorithmin f-measure value.

2 Materials and methods

2.1 Basic principles and concepts

2.1.1 Principle of BFO algorithm

BFO algorithm [19] is an intelligent algorithm that has beenwidely accepted as a global optimization searching method.It is composed of chemotactic operation, reproduction oper-ation, and elimination–dispersal operation.

2.1.1.1 Chemotactic operation

Figure 1(A) shows that the bacterium moves in two differentways including swim and tumble by means of a set of tensileflagella. When all the flagella rotate clockwise, each flagellumoperates relatively independent of the others. When all theflagella rotate counterclockwise, they push the bacterium sothat it moves in one direction at a very fast rate. Figure 1(B)shows the swimming and tumbling behavior of the bacteriumin a neural medium. There exists a nutrient concentrationgradient in the Fig. 1(C). The darker the shade is, the higherthe concentration of the nutrient is. The bacterium alternatelyswims and tumbles for the purpose of moving toward nutri-ent gradient and avoiding noxious environment. The BFOalgorithm regards this phenomenon as the chemotactic be-havior that is able to largely broaden the local exploring abilityof BFO algorithm.

2.1.1.2 Reproduction operation

In general, several bacteria that are becoming incapable ofsearching food during executing the chemotactic behavior areeliminated. In order to maintain the scale of population, theremained bacteria will reduplicate themselves and generatenew individuals. To improve the global convergent speed andefficiency, the BFO algorithm generally selects the bacteriathat rank in the former half positions to reproduce themselvesand generate new individuals that are completely identical tothe original bacteria.

2.1.1.3 Elimination–dispersal operation

Owing to sudden changes of the local environment, the bac-teria population may be gradually inadaptable to the environ-ment that a group of bacteria is either killed or dispersedinto a new location. This phenomenon is simulated as theelimination–dispersal operation that is normally executedwith some certain probability. If some bacterium satisfiesthe probability of elimination–dispersal operation, this bac-terium will die and the algorithm generates another new in-dividual in a random position of the feasible solution space.The elimination–dispersal operation enhances the randomly


280 X. Lei et al. Proteomics 2013, 13, 278–290

Figure 1. The foraging behaviorof bacteria [20].

searching ability of BFO algorithm, maintains the varieties ofpopulation and avoids the premature convergence.

Define a chemotactic step to be a tumble followed by atumble, or a tumble followed by a run. Let j be the index forthe chemotactic step. k be the index for the reproduction stepand l be the index of the elimination–dispersal event. Theposition of each member in the population of the S bacteriaat the jth chemotactic step, kth reproduction step, and lthelimination–dispersal event is represented as follows:

P( j, k, l ) = {xi ( j, k, l )|i = 1, 2, · · · , S}.

2.1.2 Fuzzy set

In the typical setting, the clusters are nonoverlapping. How-ever, PPI networks, contain many overlapping modules.Therefore, we adopt the fuzzy concept that each node belongsto certain clusters with a probability between 0 and 1.

Definition 1. The fuzzy set A in the domain X is defined asfollows [17]:

A = {(x, A(x))|x ∈ X},where A(x) is the membership function. The fuzzy set A sat-isfies a requirement: A: X→M, the symbol M represents themembership space. The membership function A(x) denotesthe membership degree or the probability that the elementx belongs to fuzzy set A. Therefore, each element (x, A(x))in the fuzzy set A expresses the membership degree of theelement x.

Definition 2. Let a set X be fixed. An intuitionistic fuzzyset [21] is an object having the form

B = {< x, �B(x), �B(x) > |x ∈ E },where the functions �B(x) and �B(x) define the membershipdegree and nonmembership degree that the element x be-longs to set B, respectively. In addition, for each object x in

the set B, the summation of �B(x) and �B(x) varies between 0and 1. The indeterminacy degree of the element x is equal to1 − �B(x) − �B(x).

2.1.3 Relevant concepts of graph

In an undirected graph, the degree of one node representsthe number of its direct neighboring nodes. For weightedgraphs, the weighted degree of a node [8] is the sum of theweight value of edges among nodes i and its neighbors,

w(i ) =∑

j∈N(i )

w(i, j ), (1)

Assume a node i, ki represents the node degree, and ni

refers to the number of edges linking all the neighbor nodesof node i with each other. The node clustering coefficient iscalculated as follows [8]:

Ci = 2ni/ki (ki − 1), (2)

With respect to Ci, the edge clustering coefficient WEu,v isdefined as the ratio of the number of triangles containing theedge to the number of all the possible triangles including thisedge. In a weighted graph, the edge clustering coefficient [22]is calculated as follows:

WEi, j =

∑k∈Ii, j

w(i, k) ·∑k∈Ii, j

w( j, k)

∑s∈Ni

w(i, s ) ·∑t∈Nj

w( j, t). (3)

In Eq. (3), the sets Ni and Nj represent the sets of adjacentnodes of node i and node j, respectively, w(i, s) stands for theweight value of edge-linking node i with node s. The set Ii, j

refers to the set of common nodes between the adjacent nodesof nodes i and j. Noted that Ii, j = Ni ∩ Nj . The weighted ag-gregation coefficient of edge is illustrated by the ratio betweenthe product of summation of weight values of edges, respec-tively, connecting these two nodes (i, j) with their common


Proteomics 2013, 13, 278–290 281

neighbors (k) and the product of summation of weight valuesof edges linking these two nodes (i, j) with their correspond-ing neighbors (s, t). The edge clustering coefficient is notsensitive to the influence of the false-positive data. Therefore,it is more preferable to the large-scale PPI data containingmany false-positive data.

2.1.4 Object function

The density of a subgraph s is defined by the followingequation:

D(s ) = 2e

n(n − 1), (4)

where the parameter n represents the number of nodes ande is the number of edges connecting protein nodes with eachother in the subgraph s. To the weighted PPI networks, eachcluster can be assessed according to Eq. (5):

WD(s ) =2

n−1∑i=1

n∑j=2

WEi, j

n(n − 1), (5)

where the symbols i and j stand for the ith protein node andthe jth protein node in the subgraph s, respectively. WEi,j isthe edge clustering coefficient connecting the ith node withthe jth node. It is apparent that the value WD(s) illustrates theaverage edge clustering coefficient linking with all the proteinnodes in cluster s. Suppose that the obtained cluster numberis indicated as numclu, a set of clusters C can be calculated asfollows:

f un(C) = 1

numclu

numclu∑k=1

WD(s )k. (6)

2.1.5 Evaluation criteria of cluster results

Precision, recall, and p-value are usually adopted to evaluateclustering results [8]. Suppose that X represents one clusterin the cluster results, Fi stands for the matched cluster in thestandard PPI dataset,

precision(X, Fi ) = |X ∩ Fi ||X | , (7)

recall(X, Fi ) = |X ∩ Fi ||Fi | , (8)

where the expression |X∩Fi| stands for the number of com-mon proteins between clusters X and Fi. However, both thesetwo evaluating criteria have bias for different sized clusters.

Therefore, in order to balance the precision and recall values,we can define the f-measure value as follows:

f -measure = 21

precision+ 1

recall

. (9)

In PPI network, protein modules can be statistically eval-uated using p-value from the hypergeometric distribution,which is defined as

P = 1 −k−1∑i=0

( |X |i

) ( |V | − |X |n − i

)( |V |

n

) . (10)

where |V | is the total number of proteins, |X | is the number ofproteins in a reference function, n is the number of proteinsin an identified module, and k is the number of proteins incommon between the function and the module.

2.2 Methods

2.2.1 Data preprocessing

2.2.1.1 Calculation of distance among protein nodes

In PPI networks, protein name can be changed into the posi-tive integer and the data is converted into an adjacent matrixP. Assume that the number of protein nodes is n and Xi rep-resents the ith protein that is denoted as Xi = (Pi1, Pi2, . . . ,Pin). The inner product of two protein node is calculated bythe equation Xij = (Pi1, Pi2, . . . , Pin) • (Pj1, Pj2, . . . , Pjn) =Pi1 × Pj1 + Pi2 × Pj2+ . . . + Pin × Pjn. The similarity betweennodes i and j is defined as follows [23]:

Si j =

n∑k=1

min(Xik, X j k)

n∑k=1

max(Xik, X j k)

. (11)

The protein node not only interacts with adjacent proteinnodes, but also has interactions with other protein nodes viasome protein or some several protein nodes.

As Fig. 2 shows that the protein node a can be denoted asXa = (paa, pab, pac, pad, pae, paf) = (0, 1, 0, 1, 0, 0), similarly, theprotein node Xb = (1, 0, 0, 1, 0, 0), Xc = (0, 0, 0, 1, 0, 0), Xd =(1, 1, 1, 0, 1, 1), Xe = (0, 0, 0, 1, 0, 0), Xf = (0, 0, 0, 1, 0, 0). Theassociated matrix is obtained as follows:

X =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

0 1 0 1 0 0

1 0 0 1 0 0

0 0 0 1 0 0

1 1 1 0 1 1

0 0 0 1 0 0

0 0 0 1 0 0

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦



Figure 2. A sketch subgraph of PPI network.

Then, value Xaa = Xa • (Xa)T = (0, 1, 0, 1, 0, 0) • (0, 1, 0, 1, 0,0)T = 2, Xab = Xa • Xb = (0, 1, 0, 1, 0, 0) • (1, 0, 0, 1, 0, 0) = 1,Xbb = Xb • (Xb)T = (1, 0, 0, 1, 0, 0) • (1, 0, 0, 1, 0, 0)T = 2. In thesimilar way, the value Xac = 1, Xbc = 1, Xad = 1, Xbd = 1, Xae =1, Xbe = 1, Xaf = 1, Xbf = 1. Therefore, the similarity betweennodes a and b is that Sab = (1 + 1 + 1 + 1 + 1 + 1)/(2 + 2 +1 + 1 + 1 + 1) = 6/8 = 0.75. It is clear to see that the higherthe similarity between two protein nodes is, the shorter thespace distance is. So, the distance between two protein nodesa and b is denoted as dab = 1 – Sab = 0.25.

The similarity between two protein nodes can be calculatedaccording to Eq. (11), while the similarity between moduleMi and another different module Mj is measured [7] by theEq. (12):

S(Mi , Mj ) =

∑x∈Mi ,y∈Mj

c(x, y)

min(|Mi |, |Mj |) (12)

where

c(x, y) =

⎧⎪⎪⎨⎪⎪⎩

1 if x = y

w(x, y) if x �= y and 〈x, y〉 ∈ E

0 otherwise

(13)

2.2.1.2 Initialization of cluster center

The node clustering coefficient barely measures the joint den-sity and strength among all the nodes in the local proximityof this node. Meanwhile, the comprehensive network featurevalue (CNFV) of node [8] reveals the joint strength betweenthis node and other nodes aside from the above-mentionedfeature. The CNFV of node i is defined as follows:

CNFVi = b × Ci + (1 − b) × w(i )/n. (14)

The parameter � is a random number within 0 and 1, and nstands for the number of protein nodes in PPI network.

2.2.2 Determination of membership function and

nonmembership function

The concepts of membership degree � and nonmembershipdegree � play an important role in the clustering procedureand it is essential to construct appropriate functions to cal-culate the membership degree and nonmembership degreeamong protein nodes. If two protein nodes are close, the pos-sibility that these two nodes can be grouped into one clusteris high, and the membership degree between two nodes isclose to 1. As the distance increases, the membership de-gree gradually descends. The relationship between distancesamong protein nodes and membership degree can be roughlydescribed as Fig. 3 (A).

Figure 3. The relationships among membership degree, non-membership degree, and distance of protein nodes.


Proteomics 2013, 13, 278–290 283

The membership function is obtained according to corre-sponding relationship between membership degree and thedistance of two protein nodes:

mi j =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

1 0 ≤ di j < 0.1

cos(

di j · p

2

)0.1 ≤ di j < 0.9

0 0.9 ≤ di j ≤ 1

(15)

On the contrary, when two protein nodes are relativelyclose, two protein nodes can be merged into one cluster; sothe nonmembership degree between two nodes is consideredto be close to 0. As the distance constantly increases between0 and 1, there is less possibility that these two nodes belongto identical cluster and the nonmembership degree keepsup the upward momentum, which is the relation betweennonmembership degree and distance of two protein nodes isillustrated as the Fig. 3 (B). Similarly, the calculation equa-tion of nonmembership degree can also be determined asfollows:

�i j =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

0 0 ≤ di j < 0.1

sin(

di j · �

2

)0.1 ≤ di j < 0.9

1 0.9 ≤ di j ≤ 1

(16)

2.2.3 Model design of improved algorithm

The BFO algorithm referred in ref. [16] mainly containedthree behaviors of bacteria into the problem of cluster-ing PPI networks, which is chemotactic, reproduction, andelimination–dispersal operation. However, the recall valueof cluster results is relatively low, which is due to the factthat PPI network is distinct from other complicated net-works and has the small-world and scale-free characters,there are a large number of proteins that have fewer interac-tions with other proteins and are abandoned in the clusteringprocedure.

In fact, a protein (black node in Fig. 4) in real PPI net-work can be included in several different protein complexesto perform different functions, i.e., a protein functional mod-ules overlap with each other as Fig. 4 shows. Naturally, theconcepts of membership degree and indeterminacy degreein the intuitionistic fuzzy set can be introduced to detect theoverlapping modules. This paper proposes an improved BFOclustering algorithm based on intuitionistic fuzzy set.

We adopt the principle of improved bacteria foraging op-timization (IBFO) to cluster PPI network. During the proce-dure of clustering PPI network, one bacterium is regardedas one protein node, the corresponding relationship of theIBFO mechanism and the PPI networks clustering is listedin Table 1.

The clustering model based on IBFO is shown in Fig. 5.Figure 6 is the flow chart of clustering method based on

IBFO.

Figure 4. The overlap of protein functional modules.

Table 1. Corresponding relationship of the IBFO mechanism andthe PPI networks clustering

BFO algorithm The problem of clustering PPI networks

Bacterium Protein nodeChemotactic

operationEliminate isolated node according to

edge-clustering coefficientReproduction

operationMerge nodes into the cluster that

cluster center belongs toMembership

degreeNodes that possess high membership

degree are grouped into the clusterIndeterminacy

degreeNode that has the lower indeterminacy

degree than the given threshold ismerged into this cluster. Otherwisemerge node into cluster and markthis node as unvisited node

Elimination–dispersaloperation

Randomly choose one node as the newcluster center

Fooddistributionfunction

Average value of the edge-clusteringcoefficient in each cluster

Where flag represents the parameter of controlling thechemotactic operation, Nre and Ned stand for the parameters ofreproduction operation and elimination–dispersal operation,respectively. The count parameters of reproduction operationand elimination–dispersal operation are denoted as k and l.In the preliminary stage, k and l are all set as 0.

First, in the chemotactic operation, the algorithm takesadvantage of the edge clustering coefficient to eliminate theisolated protein nodes. With regard to each protein node iin PPI network, the algorithm calculates the summation ofedge clustering coefficient connecting the node i with otherprotein nodes in PPI network. If the summation is zero, thenode i will be regarded as the isolated node and abandoned.Then, the algorithm chooses one protein node that has highCNFV value as the initial cluster center.



Figure 5. The clustering model design based on IBFO.

In the reproduction period, the algorithm searches the pro-tein nodes that have higher membership degrees with clustercenter and merges these protein nodes into the cluster thatthe cluster center belongs to. Then calculates the membershipdegree, nonmembership degree and indeterminacy degreeamong cluster center j and one node i in the other unvisitednodes of PPI network based on Eqs. (15) and (16). Assumethat the membership degree �ij of node i is higher than thethreshold of membership degree (T1), and the indeterminacydegree �ij is lower than the threshold of indeterminacy de-gree (T3), then the protein node i is classified into the clusterthat the cluster center j belongs to and marked as the visitedprotein node (the threshold of nonmembership degree (T2)that has no effect on cluster effect will not be considered).Oppositely, if the indeterminacy degree �ij is higher thanthe threshold of indeterminacy degree, the node i is not onlygrouped into the cluster that the cluster center j exists in, butalso is labeled as fuzzy node that will be visited next time andhas the potential to be merged into other clusters. The proce-dure continues until all nodes in PPI network are evaluated.Then one cluster is obtained.

In the elimination–dispersal phase, several bacteria will dieand the population will produce other new individuals. Thisoperation is corresponding to selecting a new protein node asthe next cluster center according to CNFV of the nodes. Thenthe algorithm starts to generate the next cluster according tothe former reproduction operation. If the cluster number islarger than 3, then calculates the similarities among any twocluster modules according to Eqs. (12) and (13). Afterwards

merges clusters that the similarity is higher than the giventhreshold.

2.2.4 Implementation steps of IBFO algorithm

The specific implementation steps are as follows:

Procedure Initialization

Assign values to several parameters: set the index ofexternal loop iter = 1, the maximal iterations of externalloop maxiter = 100. Initialize the optimal fitness valuegfval and the global optimal cluster result gcluster.

Calculate CNFV of all the protein nodes and the distances dbetween any two protein nodes. Calculate themembership degree, nonmembership degree, andindeterminacy degree among all the protein nodesaccording to the appropriate membership functionEq. (15) and nonmembership function Eq. (16). Determinethe threshold T1 of membership degree, the threshold T3of indeterminacy degree.

Step 1: During the procedure of chemotactic operation, foreach node i, the algorithm respectively calculates thesummation of edge clustering coefficient connecting thenode i with other protein nodes in PPI network. If thesummation is zero, the node i will be regarded as theisolated node and eliminated.

Step 2: Set the index of internal loop count = 1, the maximaliterations of internal loop maxcount = 100.

Step 3: Randomly select one protein node that has highnetwork comprehensive feature value as cluster center.

Step 4: Corresponding to the reproduction operation of BFOalgorithm, the algorithm clusters PPI networks inaccordance with the membership degree,nonmembership degree, and indeterminacy degreeamong cluster centerand other protein nodes, severalprotein nodes are grouped into the cluster that clustercenter belongs to and marked as the visited nodes.However, a part of nodes are classified into the clusterand labeled as unvisited nodes that may also participatein other clusters.

Step 5: Set count = count + 1, meantime take advantage ofthe elimination–dispersal operation to randomly selectthe new cluster center according to the reproductionoperation of BFO algorithm and go back to Step 4.

Step 6: If the cluster number is larger than 3, then calculatethe similarities between any two clusters. Afterwards,merge the clusters that the similarity is higher than thegiven threshold.

Step 7: Until all the protein nodes are visited or the index ofexternal loop count arrives at the maximal iterations ofexternal loop maxcount, a set of clusters of PPI network isobtained.

Step 8: Calculate the fitness value of the obtained clusterresults and compare with the optimal fitness value gfval,and then update the value gfval and the global optimalcluster result gcluster. Meantime, set iter = iter + 1.

Step 9: The algorithm terminates until the value iter reachesto the maximal iterations of external loop maxiter, else goback to Step 2.

Output of the ultimate clustering result.


Proteomics 2013, 13, 278–290 285

Figure 6. The flow chart of clus-tering method based on IBFO.

2.2.5 Time complexity of algorithm

In this algorithm, suppose that the number of protein nodesin PPI dataset is n, the cluster number of obtained clustersis numclu and the number of protein nodes in the cluster isnum, the maximal iterations of external loop is maxiter, themaximal iterations of internal loop is maxcount, and the timecomplexity of algorithm is as follows:

(i) In the phase of data preprocessing, the time complexityof calculating the membership degree and indetermi-nacy degree among any two protein nodes is O(n2).

(ii) With regard to each cluster center, the time complexityof obtaining one cluster via chemotactic operation andreproduction operation is O(n).

(iii) The time complexity of calculating the similarity andmerging the clusters that have high similarity isO(numclu × num2 + numclu × num).

(iv) The time complexity of obtaining a set of clusters is O(maxcount × (numclu × num2)).

(v) The global optimal cluster results are obtained until ex-ecuting the algorithm for maxiter times, the time com-plexity is O(maxiter × maxcount × (numclu × num2)).

3 Results

3.1 Parameter analysis

To assess the performance of algorithm, the experiments arecarried out on the Windows XP system on an Intel Core 2Duo, running at 2.93 GHz Processor with 2 GB of memory.We use the Munich Information Center for Protein Sequence(MIPS) PPI datasets as our data source and MIPS complexdatabase as ground truth to evaluate the protein complexespredicted by our method [24]. There are several relevant pa-rameters that will influence the cluster results such as theparameter of BFO, the threshold of membership degree, thethreshold of indeterminacy degree, and the maximal itera-tions of IBFO algorithm etc.



Figure 7. The consequence of two initializing mechanisms onclustering results.

We execute the algorithms for 20 times. Figure 7 illustratesthe influence of f-measure value by two different initializationmechanisms, respectively, that are the cluster center is ini-tialized by bacterial reproduction operation and initializedrandomly. It shows that f-measure values obtained from theformer are higher than the latter. The former scope is between0.78 and 0.82 that is relatively stable.

Figure 8(A) illustrates the influence of threshold T1 ofmembership degree on cluster center in terms of precision,recall, and f-measure values. When threshold T1 of member-ship degree is less than 0.5, the chart shows an uptrend ofthe three values and the values decrease as the threshold T1arrives at the value 0.8. Figure 8(B) describes the influenceof threshold T3 of indeterminacy degree. When threshold T3varies from 0 to 0.15, the precision, recall, and f-measure val-ues gradually increase to the optimal values. So the thresholdof membership degree is set as 0.52 and the threshold ofindeterminacy degree is assigned to 0.15 in the followingexperiments.

Figure 9 shows the effect of maximal iteration on the clus-tering results. The algorithm performs best in precision, recall,and f-measure values when the maximal iteration reaches 100.

3.2 Performance of IBFO algorithm

This paper integrates the concepts of membership degree andindeterminacy degree in the intuitionistic fuzzy set into theprinciple of BFO algorithm, so the algorithm may generateseveral overlapping functional modules in the final clusterresults.

During the procedure of reproduction operation in themodel design, the indeterminacy degree is used to determinewhether the protein node can be regarded as fuzzy node andvisited for many times when clustering PPI network, whichis intended to tackle the problem that one protein node may

Figure 8. The effect of the threshold of membership degree andindeterminacy degree on clustering results.

Figure 9. The influence of maximum iterations on clustering re-sults.


Proteomics 2013, 13, 278–290 287

Figure 10. Comparisons of clustering results with and withoutthe indeterminacy degree.

possess one or more functional module. It is essential to showwhether the improvement relevant to intuitionistic fuzzy setis reasonable and effective. In Fig. 10, the dotted line standsfor the cluster results obtained by the algorithm that ignoresthe indeterminacy degree of protein node, while the real linerepresents the cluster results of improved algorithm proposedin this paper. Figure 10(A) shows the results of these twoalgorithms in terms of recall value. The results show that theIBFO algorithm performs better in improving recall value ofcluster results. This is because that each protein node hasthe possibility to be merged into one or more clusters, sothe algorithm can find the clusters as completely as possible.Figure 10(B) evaluates these two algorithms in terms of f-measure values. It shows that the f-measure value of IBFOalgorithm is superior to the algorithm that takes no accountof the overlapping functional modules in PPI network.

The functional-flow algorithm [7] is a relatively effectivemethod in solving the problem of clustering PPI network,which is based on the principle that the functional informa-tion of a protein flows through every possible path and thus wecan quantify how much a protein can functionally influenceother adjacent proteins. The algorithm considers the genera-tion of overlapping functional modules in the model designand merging the modules that have high similarity in thepostprocessing stage. Moreover, the experiments show thatthe algorithm is comparatively highly efficient. However, thealgorithm has to predefine cluster number. In addition, theprecision and recall values are relatively low. Consequently,the method presented in ref. [16] takes the mechanism ofBFO algorithm into consideration to optimize the procedureof clustering PPI network. In the model design, the clustersare created one by one that can overcome the drawback ofpredefining cluster number. Although each protein node canbe exclusively grouped into one cluster, which goes againstthe topological character of PPI network that one proteinnode can be grouped into two or more clusters. Therefore,the number of protein nodes in the obtained clusters is rela-tively fewer compared to the matched module in the standarddataset, which results in that the BFO algorithm do not workwell in clustering PPI network from the perspective of recallvalue of cluster results. With regard to the shortcomings ofpredefining cluster number and lower recall value, the IBFOalgorithm proposed in this paper introduces the concepts ofindeterminacy degree on the basis of BFO algorithm. Werespectively execute the three algorithms for 20 times, theprecision, recall, f-measure values of cluster results are shownin Table 2.

As Tables 2 and 3 show, the IBFO algorithm performsbetter in terms of precision, recall, and f-measure values com-pared to other algorithms. The ultimate goal of algorithm isto predicting the clusters as accurately as possible. The top 20clusters obtained by IBFO algorithm are listed in Table 4.

As Table 4 shows that the top 20 clusters obtained by IBFOalgorithm include the proteins classified rightly and otherproteins that should be grouped into the different clustersfrom the corresponding modules. There are relatively moreprotein nodes classified rightly existing in modules 1, 6, and19, so the recall value of cluster results get largely improved. Alow value of p indicates that the module closely correspondsto the function, because it is less probable that the networkwill produce the module by chance. The cluster results inTable 3 can effectively identify a set of unknown proteins thathave the same function and protein complex to predict thefunction of unknown proteins.

We can see from Table 5 that the three overlapping proteinsare detected when the indeterminacy degree are set as 0.05,0.2, and 0.25, respectively, and when indeterminacy degreeis 0.1 and 0.15, 5, and 7 overlapping proteins are identifiedseparately. This illustrate that the more overlapping proteinsare obtained when the indeterminacy degree is set as 0.15that can also see from the Fig. 8(B). The result is just fromthe statistics and simulation point of view. In fact how to set



Table 2. Comparisons among the three algorithms

Running times Flow BFO IBFO

Precision Recall f-measure Precision Recall f-measure Precision Recall f-measure

1 0.3159 0.6635 0.4280 0.7436 0.4447 0.5566 0.8628 0.7626 0.80962 0.2727 0.7316 0.3973 0.7446 0.7669 0.5740 0.9053 0.7170 0.80023 0.2554 0.7454 0.3804 0.6776 0.4413 0.5345 0.8491 0.7586 0.80134 0.2341 0.5225 0.3233 0.7102 0.7391 0.5427 0.8519 0.7343 0.78875 0.2445 0.5364 0.3359 0.6651 0.4460 0.5340 0.8492 0.7728 0.80926 0.2365 0.5265 0.3264 0.6364 0.4480 0.5259 0.8564 0.7435 0.79597 0.2192 0.5367 0.3117 0.7117 0.4355 0.5404 0.8877 0.7217 0.79618 0.2292 0.5212 0.3123 0.6715 0.7179 0.5153 0.8736 0.7494 0.80679 0.2192 0.5253 0.3093 0.7590 0.4222 0.5426 0.8839 0.7164 0.791310 0.2083 0.5155 0.2967 0.7294 0.4673 0.5697 0.8907 0.6824 0.772711 0.2211 0.5276 0.3116 0.6510 0.4435 0.5277 0.8661 0.7278 0.790912 0.2044 0.5395 0.2965 0.6679 0.4677 0.5502 0.8504 0.7690 0.807613 0.2215 0.5352 0.3133 0.6868 0.4429 0.5386 0.8575 0.7081 0.775614 0.2264 0.5194 0.3153 0.6943 0.4316 0.5324 0.8472 0.7937 0.819515 0.2176 0.5423 0.3106 0.7222 0.4294 0.5386 0.8750 0.7042 0.780316 0.2212 0.5337 0.3128 0.6480 0.4221 0.5113 0.8491 0.7577 0.800817 0.2373 0.5221 0.3263 0.7165 0.7148 0.5255 0.8644 0.7702 0.814918 0.2222 0.5334 0.3137 0.7050 0.4273 0.5322 0.8709 0.6942 0.772519 0.2302 0.5126 0.3177 0.7179 0.4366 0.5431 0.8726 0.6777 0.762920 0.2427 0.5243 0.3318 0.7185 0.4327 0.5402 0.8539 0.7674 0.8083

Table 3. Comparisons of flow and our algorithms on averagevalue

Algorithms Average value

Precision Recall f-measure

Flow [7] 0.23 0.56 0.31IQ-Flow [11] 0.67 __ __IQ-Flow fast [11] 0.72 __ __ABC-Flow [12] 0.70 0.84 0.76JSACO [13] 0.87 0.26 0.55BFO [16] 0.70 0.50 0.54IBFO 0.87 0.74 0.79

the appropriate value of indeterminacy degree and which pro-teins will overlap, these should depend on the experiment andanalysis from biologists. But it at least provides a reference tothem to a certain extent.

4 Discussion

BFO algorithm in ref. [16] has the low recall value in cluster-ing PPI network, in this paper we proposed a novel methodusing BFO mechanism based on intuitionistic fuzzy. The al-gorithm initially eliminates the isolated points based on the

Table 4. The proteins and p-value of the top 20 clusters

Cluster The proteins classified rightly The proteins classified wrongly ID of the protein p-valueordinal function modules

1 YBR120c, YOR334w, YIR021w, YDR194c,YMR023c, YGR222w

YKR052c, YLR382c,YDL044c, YHR005c-a,YJL133w, YPR134w

500.50 0.1644

2 YPR025c, YDL108w, YLR005w, YPR056w,YPL122c, YER171w

YIL143c,YDR311w 510.100 0.0771

3 YDR176w, YDR145w, YGL112c, YPL254w,YCL010c, YOL148c, YLR055c, YGL066w,YDR392w, YBR081c, YBR192c,YMR236w

YDR167w 230.20.10 0.1044

4 YHR069c, YDR280w, YOL021c, YGR195W,YDL111c, YGR095c, YCR035c

YOL077w-a, YKL190w,YKL058w, YJL074c 440.12.10 0.0985

5 YLR381w, YJR135c YPR046w 270.20.20 0.23216 YLR115w, YLR277c, YAL043c YDR301w 440.10.20 0.20217 YPL010w, YNL287w, YDL145c,YFR051c,

YGL137wYLR093c,YDR238c 260.30.10 0.1864

8 YDR028c, YER133w, YOR178c, YKL193c,YER054c, YMR311c

YNL126w 450 0.0936


Proteomics 2013, 13, 278–290 289

Table 4. Continued

Cluster The proteins classified rightly The proteins classified wrongly ID of the protein p-valueordinal function modules

9 YPL129w, YPL016w, YJL176c, YMR033w,YBR289w, YOR290c, YDR073w,YPR034w

YNR023w,YHL025w 510.190.50 0.1563

10 YML099c, YMR043w, YMR042w, YDR137c YGL154c, YGR244c,YGR113w 510.190.120 0.217911 YRB123c, YAL001c, YGR047c, YOR110w,

YDR362c___ 510.150 0.0827

12 YMR033w, YOR213c, YCR052w,YKR008,YFR037c, YLR321c, YIL126w, YLR357w,YPR034w, YGR056w

YLR060w, YLP045w, YLR148w 400 0.1329

13 YOL123w, YGL044c, YDR228c,YMR061w,YOR250c

YHR012w 440.10.10 0.1195

14 YEL020w-a, YHR005c-a YCL009c 440.40 0.301915 YJL154c, YJL053w, YHR012w ___ 260.30.30.10 0.109216 YNL062c, YNL244c, YOR361c, YDR429c,

YMR146c, YBR079cYGL130w, YHR164c,YER176w 500.10.40 0.3919

17 YNL199c, YPL075w YPL001w 510.190.90 0.210918 YDL069c, YHL038c,YJL209w, YDR197w YPL075w 440.20 0.156019 YNL290w, YOR217w, YJR068w, YOL094c,

YBR087w___ 410.40.30 0.1210

20 YGR072w, YMR080c, YHR077c YJR052w, YER090w 300 0.2057

Table 5. The overlapping proteins under different indeterminacydegree

Indeterminacy The overlapping proteinsdegree

0.05 YDR276c, YDR488c, YPL017c0.10 YNL088w, YDR488c, YER007w, YPL155c,

YMR138w0.15 YNL088w, YDR276c, YEL020w-a, YFL018c,

YLR055c, YDR149c, YOR266w0.20 YLR055c, YBR107c, YPR070w0.25 YIL095w, YPR023c, YLR212c

edge-clustering coefficient in the chemotactic operation. Cor-responding to the reproduction operation, the nodes that havehigh membership degree are merged into the cluster that thecluster center belongs to. Meantime, the nodes that also havehigh indeterminacy degree are labeled as unvisited proteinnodes and may be grouped into two or more clusters. Theprocedure of elimination–dispersal operation is equivalent tothe selection of the next cluster center and generating anothercluster. In the end, the algorithm merges the clusters havinghigh similarity and terminates until arriving at the maximaliterations. The simulation result on PPI dataset showed thatthe algorithm could not only effectively improve the accuracyof cluster result, automatically determine the cluster num-ber, but also identify the overlapping modules successfully.However, some parameters of the algorithms will influencethe cluster result, which should be discussed further. Andalso, how to construct the dynamic model and to design thecorresponding algorithms of the PPI network is the futureresearch direction.

This work was supported by the National Natural ScienceFoundation of China under Grant No. 61100164 and 61173190,the Natural Science Foundation of Shaanxi Province of Chinain 2010 under Grant No. 2010JQ8034, and the FundamentalResearch Funds for the Central Universities under Grant No.GK200902016.

The authors have declared no conflict of interest.

5 References

[1] Watts, D. J., Strogatz, S. H., Collective dynamics of ‘small-world’ networks. Nature 1998, 393, 440–442.

[2] Barabasi, A. L., Oltvai, Z. N., Network biology: understandingthe cell’s functional organization. Nature Rev. Genet. 2004,5, 101–113.

[3] Soon-Hyung, Y., Oltvai, Z. N., Barabasi, A. L., Functional andtopological characterization of protein interaction networks.Proteomics 2004, 4, 928–942.

[4] Penggang, S., Lin, G., Identification of overlapping and non-overlapping community structure by fuzzy clustering in com-plex networks. Inform. Sci. 2011, 181, 1060–1071.

[5] Berggard, T., Linse, S., James, P., Methods for the detec-tion and analysis of protein-protein interactions. Proteomics2007, 7, 2833–2842.

[6] Nabieva, E., Jim, K., Agarwal, A., Chazelle, B., Singh, M.,Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 2005,21, i302–i310.

[7] Cho, Y. R., Hwang, W., Ramanathan, M., Aidong, Z., Seman-tic integration to identify overlapping functional modules



in protein interaction networks. BMC Bioinformatics 2007,doi:10.1186/1471-2105-8-265.

[8] Aidong, Z., Protein Interaction Networks, Cambridge Univer-sity Press, New York 2009.

[9] Kenley, E. C., Cho, Y. R., Detecting protein complexesand functional modules from protein interaction net-works: a graph entropy approach. Proteomics 2011, 11,3825–3844.

[10] Goel, A., Simone, S. Li, Marc, R. W., Four-dimensional visual-isation and analysis of protein–protein interaction networks.Proteomics 2011, 11, 2672–2682.

[11] Xiujuan, L., Xu, H., Lei, S., Aidong, Z., Clustering PPIdata based on improved functional-flow model throughquantum-behaved PSO. Int. J. Data Min. Bioinform. 2012,6, 42–60.

[12] Xiujuan, L., Jianfang, T., The information flow clusteringmodel and algorithm based on the artificial bee colonymechanism of PPI network. Chinese J. Comput. 2012, 35,134–145.

[13] Xiujuan, L., Xu, H., Shuang, W., Ling, G., Joint strengthbased ant colony optimization clustering algorithm for PPInetworks. Acta Electron. Sin. 2012, 40, 695–702.

[14] Passino, K. M., Biomimicry of bacterial foraging for dis-tributed optimization and control. IEEE Contr. Syst. Mag. NY 2002, 22, 52–67.

[15] Kim, D.H., Abraham, A., Cho, J. H., A hybrid ge-netic algorithm and bacterial foraging approach

for global optimization. Inform. Sci. 2007, 177,3918–3937.

[16] Xiujuan, L., Shuang, W., Liang, G., Aidong, Z., ClusteringPPI data based on bacteria foraging optimization algorithm.2011 IEEE International Conference on Bioinformatics andBiomedicine (BIBM11), Atlanta, Georgia, 2011, 96–99.

[17] Zadeh, L.A., Fuzzy sets. Inform. Cont 1965, 8, 338–353.

[18] Atanassov, K., Intuitionistic fuzzy sets. Fuzzy Sets Syst. 1986,20, 87–96.

[19] Dasgupta, S., Biswas, A., Abraham, A., Das, S., Adaptivecomputational chemotaxis in bacterial foraging algorithm,2008 International Conference on Complex, Intelligent andSoftware Intensive Systems 2008, 13, 64–71.

[20] Veysel, G., Kevin, M. P., Swarm Stability and Optimization.Springer Verlag, Berlin Heidelberg 2011.

[21] De, S. K., Biswas, R., Roy, A. R., Some operations on intu-itionistic fuzzy sets. Fuzzy Sets Syst., Arti. Intell. 2003, 2715,285–292.

[22] Huan, W., Min, L., Jianxin, W., Yi, P., A new method for identi-fying essential proteins based on edge clustering coefficient.Lecture Notes in Computer Science 2011, 6674, 87–98.

[23] Letovsky, S., Kasif, S., Predicting protein function fromprotein-protein interaction data: a probabilistic approach.BMC Bioinformatics 2003, 19, 197–204.

[24] Guldener, U., Munsterkotter, M., Kastenmuller, G., Strack,N. et al., CYGD: the comprehensive yeast genome database.Nucl. Acids Res. 2005, 33, D364–D368.


RESEARCH ARTICLE Clustering and overlapping modules detection … · 2013-05-02 · 278 DOI...

Documents

Transcript of RESEARCH ARTICLE Clustering and overlapping modules detection … · 2013-05-02 · 278 DOI...