A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena,...

25
Social Network Analysis and Mining manuscript No. (will be inserted by the editor) A Method to Detect Communities with Stability in Social Networks Nam P. Nguyen · Md Abdul Alim · Thang N. Dinh · My T. Thai Received: date / Accepted: date Abstract We propose SCD (Stable Community Detection), a framework to effectively identify stable communities in Online Social Networks (OSNs). Our framework works by first enriching the input network with the mutual rela- tionship estimation of all links, and then discovering stable communities using a lumped Markov chain model. SCD has the advantage of handling the real model of OSNs with weighted reciprocity relationships. This approach is also supported by a key connection between the persistence probability of a com- munity and its local topology. To certify the efficiency and stability of the dis- covered communities, we test SCD on both synthesized datasets and real-world social traces, including the NetHEPT collaboration, Foursquare, Twitter and Facebook social networks, in reference to the consensus of other state-of-the- art detection methods. Competitive experimental results confirm the quality and efficacy of the proposed framework on identifying stable communities in OSNs. Nam P. Nguyen Department of Computer & Information Sciences, Towson University 7800 York Road, Suite 406, Room 445 - Towson, MD 21252 E-mail: [email protected] Md Abdul Alim Department of Computer & Information Science & Engineering, University of Florida CSE Building, Room E555, P.O. Box 116120, Gainesville, Florida 32611-6120 E-mail: [email protected]fl.edu Thang N. Dinh Department of Computer Science, Virginia Commonwealth University East Hall, Room E4244, P.O. Box 843019 Richmond, Virginia 23284-3019 E-mail: [email protected] My T. Thai Department of Computer & Information Science & Engineering, University of Florida CSE Building, Room E550, P.O. Box 116120, Gainesville, Florida 32611-6120 Phone: 352-450-1837, E-mail: [email protected]fl.edu

Transcript of A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena,...

Page 1: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

Social Network Analysis and Mining manuscript No.(will be inserted by the editor)

A Method to Detect Communities with Stability in

Social Networks

Nam P. Nguyen · Md Abdul Alim ·

Thang N. Dinh · My T. Thai

Received: date / Accepted: date

Abstract We propose SCD (Stable Community Detection), a framework toeffectively identify stable communities in Online Social Networks (OSNs). Ourframework works by first enriching the input network with the mutual rela-tionship estimation of all links, and then discovering stable communities usinga lumped Markov chain model. SCD has the advantage of handling the realmodel of OSNs with weighted reciprocity relationships. This approach is alsosupported by a key connection between the persistence probability of a com-munity and its local topology. To certify the efficiency and stability of the dis-covered communities, we test SCD on both synthesized datasets and real-worldsocial traces, including the NetHEPT collaboration, Foursquare, Twitter andFacebook social networks, in reference to the consensus of other state-of-the-art detection methods. Competitive experimental results confirm the qualityand efficacy of the proposed framework on identifying stable communities inOSNs.

Nam P. NguyenDepartment of Computer & Information Sciences, Towson University7800 York Road, Suite 406, Room 445 - Towson, MD 21252E-mail: [email protected]

Md Abdul AlimDepartment of Computer & Information Science & Engineering, University of FloridaCSE Building, Room E555, P.O. Box 116120, Gainesville, Florida 32611-6120E-mail: [email protected]

Thang N. DinhDepartment of Computer Science, Virginia Commonwealth UniversityEast Hall, Room E4244, P.O. Box 843019 Richmond, Virginia 23284-3019E-mail: [email protected]

My T. ThaiDepartment of Computer & Information Science & Engineering, University of FloridaCSE Building, Room E550, P.O. Box 116120, Gainesville, Florida 32611-6120Phone: 352-450-1837, E-mail: [email protected]

Page 2: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

Keywords Community detection, Online social network, Stable communitydetection, Lumped Markov chain

1 Introduction

Popular social network sites, such as Facebook, Twitter, and recently Google+,have brought a great deal of attention to OSNs nowadays. Alongside theirfruitful expansions in sizes and spaces, the need to understand their underlyingstructures has become more and more pressing. Most social networks in realityexhibit some common topological features, such as the small-world, scale-freephenomena, the power-law degree distribution, and especially, the property ofcontaining community structure, i.e. they contain groups of users with moresocial interactions within a group than among groups [26]. For instance, acommunity on Facebook could be a group of users sharing a same interest, suchas music, movies or photography. Similarly, a community (termed a “circle”) onthe newly introduced Google+ network usually consists of friends, colleaguesor college students who mostly interact with each other. Community detectionin OSNs, as a result, is the classification of users into communities so that thenetwork’s natural structure is properly displayed.

The detection of network communities has been a very active researchtopic in the field, especially in the age of OSNs. With recent technologies suchas Facebook and Twitter, more and more emphasis is placed on communitystructure. One example of such uses in information retrieval would be usingcommunity structure to analyze spending patterns within a community, andutilizing this data to generate community specific advertisements or purchaserecommendations. In addition, community detection also finds itself extremelyhelpful in different aspects, including sensor reprogramming in WSNs [31],forwarding and routing strategies in mobile networks [29], and Sybil attackin OSNs [38]. These applications generalize the applicability of communitystructure in not only social network analysis but also in data mining andinformation management.

OSNs in reality are highly dynamic as social interactions on them tendto occur frequently. As a result, their communities are also highly dynamicaland evolve heavily as the network evolves over time. However, Palla et al. [30]observe in their seminal work that some communities in social networks aretightly connected and remain wealthy even over a long period of time. Theauthors also point out that large-size communities with a high internal den-sities and less external distractions tend to remain stable during the networkevolution, which intuitively agrees with the findings reported in [17]. These ob-servations reassemble the concept of stable communities in OSNs. For example,stable communities on Facebook can be visualized as groups of users who de-voted themselves to a particular interest such as movie, music or photography.Likewise, a stable community in Twitter can be illustrated via a group of userswho may follow many but are only loyal to a specific celebrity. Similarly, sta-ble communities in a citation network may refer to well-established research

Page 3: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

topics whereas unstable communities may indicate topical or arising researchdirections in the field.

The discovery of these stable communities, as a consequence, will provide usvaluable insights into the core characteristics of not only each community butalso of the network as a whole. This knowledge can further benefit social datamining and information retrieval in OSNs as input queries can be redirected tostable communities sharing the most similar characteristics to the queries formore meaningful answers. For instance, the search for well-established researchtopics in a citation network can be mined more effectively when one looks atits stable rather than unstable communities, as discussed above. However, thelarge-scale and nonreciprocal topologies of OSNs in reality make the detectionof stable community structure an extremely challenging yet topical problem.

A large body of work has been devoted to find general communities (i.e.without the concept of stability) on both directed and undirected networks inthe literature [11]. On the contrary, not too many approaches are suggested toidentify stable communities [20][40], especially on directed and weighted net-works. The main source of difficulty is due to the inconsistency of communitymembers in a general structure: while they might appear to be in a communityat one time, they may not commit to that particular community in a long run.One possible approach, therefore, is to find a consensus of a specific algorithmafter multiple executions and use those cores as stable communities [20][5][36].We argue that while this approach is feasible for small graphs, it will incurexpensive computational cost, is time consuming, and is lack of a convergenceguarantee when applied to large-scale social networks. In [40], the authors es-timate the mutual links between pairs of users and suggest a detection methodthat optimizes the total mutual connection on the whole network. While theidea of mutual connection is sound, we find that it might not be sufficientbecause some estimated mutual links are of low magnitudes, and thus, maynot reflect the correct concept of stability at the community level.

Intuitively, a stable community is often realized either by its internallytight and strong mutual relationships among the users, or by its internal linkswhich display a high tendency to remain within the community over time [8].In other words, stable communities in the network are commonly characterizedby stable connections among their members. Motivated by these observations,we suggest SCD (short for Stable Community Detection), a framework toeffectively identify stable communities in directed and weighted OSNs thatfacilitates both of the above intuitions. SCD works by first enriching the inputnetwork with the stability estimation of all links in the network, and then dis-covering communities via stable connections using the lumped Markov chainmodel. Our approach is mathematically developed based on a key theory re-garding the persistence probability of a community at the stationary distribu-tion and its local topology. The notable features of SCD are: (1) it accounts fora very general model of OSNs with reciprocity and weighted social connections,(2) it is parameter-free, requires only a single execution (which significantlyreduces the running time) and is of high consensus, (3) it is not affected bythe order of network vertices and does not suffer from the resolution limit.

Page 4: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

Furthermore, since our method intrinsically accounts for stability, the discov-ered communities should be stable as opposed to doing a statistical analysis.In this work, we focus on the technical development of the framework and itsanalysis.

In summary, our contributions in this work are

– We suggest an estimation which provides helpful insights into the stabilityof links in the input network. Based on that, we propose SCD - a frame-work to identify community structure in real OSNs with the advantage ofcommunity stability.

– We explore an vital connection between the persistence probability of acommunity at the stationary distribution and its local topology, which isthe fundamental mathematical theory to develop the SCD framework.

– To certify the efficiency of our approach, we extensively test SCD onboth synthesized data with embedded communities, and real-world socialtraces including NetHEPT, NetHEPT WC collaboration networks as wellas Foursquare, Tiwtter and Facebook social networks. The experimentsis conducted in reference to the consensus of other state-of-the-art detec-tion methods. Highly competitive empirical results confirm the quality andefficiency of SCD on identifying stable communities in OSNs.

The organization of this work is as follows: Section 2 reviews the relatedwork on stable community detection. Section 3 introduces the graph modeland notations that will be used throughout the paper. Section 4 describes linkstability estimation process, the first phase of SCD framework. In section 5,we discuss in detail the motivations and detection mechanism of our approach.Section 6 reports our empirical results on synthesized and real social networksin comparison with other methods. Finally, section 7 concludes the paper.

2 Related Work

Community detection in complex networks is a well established field and atremendous number of identification methods has been proposed in the lit-erature. Some notable approach directions include classical graph clusteringalgorithms [2][24], dynamic approaches [33], modularity optimization methods[25][27], statistical inference [28] or random walk for community detection [9](see [11] and references therein for an excellent survey).

The discovery of stable communities, on the contrary, is still an untroddenarea with only a few attempts has been suggested [8][20][40]. This specialproperty of network communities was perhaps first observed by Palla et al. inhis seminal work [30], where they point out that tight-knit communities withhigh internal densities and less external distractions tend to remain strong overtime, thereby reassembles the concept of community stability. Delvenne et al.[8] extend this general concept to proposed an measure, called “stability ofthe clustering r(t,H)”, to quantify how stable a given cluster (or communitystructure) H is at a specific time step t based on the Markov Autocovariance

Page 5: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

model. Under this notation, a cluster H is stable at time t if a high value ofr(t,H) is observed. This quantity, instead, is more appropriate for verificationrather than identification of stable network communities since it requires thespecification of time step t a prior.

In a different approach, Lancichinetti et al. [20] investigate on the consensusof community detection methods. The authors report that, given a particularalgorithm A, the consensus on communities found by A after multiple runsdramatically improve the quality of the detection, henceforth suggest thatthose communities are candidates for stable structures. Other studies followthe same line of approach [5][36]. This is a very interesting approach, however,might encounter some disadvantages of (1) the expensive computational costand time consuming, and (2) the convergence of the whole iterative processis not guaranteed. In a recent attempt, Yanhua et al. [40] utilize the conceptof mutual links and suggest an spectral-clustering-based identification methodthat tries to maximize the total mutual connections in order to find stablecommunities. However, there are possibilities that some mutual links are of lowmagnitudes, and thus, do not significantly contribute to the overall stabilityat the community level.

3 Basic Notations

We introduce the basic notations representing the underlying social networkthat we will use throughout this paper.

(Graph notation) Let G = (V,E,w) be a directed and weighted graph rep-resenting a social network with V is the set of n network users (or nodes), Eis the set of m directed relationships (or edges), and w (or precisely wuv) isthe weight function on each edge (u, v) ∈ E representing the communicationfrequency between user u and v in the social network. Without loss of gener-ality, we assume that all edge weights are normalized, i.e.

(u,v)∈E wuv = 1

and wuv ≥ 0. For each edge (u, v) ∈ E which (v, u) /∈ E, we say that thebackwards edge (v, u) is missing, we will use the notation (v;u) and wv;u todenote the mutual link of edge (u, v) if (v, u) should indeed exist in E and itsweight, respectively. Furthermore, we will use the notation st(u, v, t) to denotethe stability estimate of edge (u, v) at time step (or hop) t. These notationswill be described in detail in next section.

(Community notation) Denote by C = {C1, C2, ..., Cq} the network com-munity structure, i.e. a collection of q subsets of V satisfying ∪q

i=1Ci = Vand Ci ∩ Cj = ∅ ∀i, j. We say that each Ci ∈ C and its induced subgraphform a community of G. For a node u ∈ V , let N+

u , N−u and Nu denote the

set of outgoing, the set of incoming, and the set of all neighbor nodes adja-cent to u, respectively. Furthermore, let k+u (or w

+u ), k

−u (or w

−u ) and ku(or wu)

be the corresponding cardinalities (or total weights) of these sets. For anyC ⊆ V , let Cin and Cout denote the set of links having both endpointsin C and the set of links heading out from C, respectively. In addition, letmC = |Cin| (rsp. wC = w(Cin)) and k+C =

u∈C k+u (rsp. w+C =

u∈C w+u ).

Page 6: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

Finally, the terms node-vertex as well as edge-link -connection are used inter-changeably.

4 Link Stability Estimation

We describe our first step towards the identification of stable communities inthe network: the link stability estimation process. Our estimation procedureconsists of two stages: In the early stage, the reciprocity of each link in thenetwork is first estimated, and based on the estimation, its stability is conse-quently evaluated in the later stage.

4.1 Link reciprocity prediction

When dealing with large scale OSNs, it is possible that some backwards edgesbetween individuals are missing. This lack of information may due to the im-perfect data collection process, or because these backwards edges are not yetreflected in the underlying network but should be due to the strong relation-ships between local network users. For instance, Leskovec et al. [22] observethat friends of friends in social networks tend to be friend of each other in thenear future, i.e. there should be dual connections between friends of friendswith high chance even if they are not yet friends of each other. Therefore,predicting the existence of these backwards edges will allow a more completeand comprehensive detection of stable communities by increasing the internaldensity of strongly connected components, which are potential candidates fornetwork communities.

Link reciprocity prediction problem is a well-studied field and many meth-ods are proposed in the literature [23][1][10]. In this paper, we utilize a methodcalled “friends-measure” suggested in [10]. The intuition behind this measureis that when looking at two users in the social network, one can assume thatthe more connections their neighbors have with each other, the higher thechance the two users are actually connected. Originally, this friends-measurebetween two users u and v is formulated as:

friends-measure(u, v) =∑

x∈Nu

y∈Nv

δ(x, y)

where δ(x, y) = 1 if either x = y or (x, y) ∈ E or (y, x) ∈ E. This measure hasbeen extensively verified among other topological features and has been shownto be a promising one in comparison with other metrics [10]. However, in thecase of directed networks, there are possibilities that different link topologiescan share a common friends-measure value. Therefore, we need to modify theabove formula so that it reflects the true relationship between the networkusers, and furthermore copes with edge weights in the network.

In order to better handle directed and weighted graphs, we will attempt topredict the existence of backwards edges of unidirectional links. For example,

Page 7: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

if (u, v) ∈ E and (v, u) /∈ E , we will try to find the possibility whether weshould enrich the network by inserting (v, u) into E. To this end, we first relaxthe direction of the edge between u and v, and next compute the likelihoodthat a backwards edge should exist between u and v by using the modifiedformula

∆(u, v) =

x∈Nv

y∈Nuτ(x, y)

WvWu

(1)

Where τ(x, y) = wvxwxywyu is the total possibility of the backwards pathstarting from v, passing through neighbor nodes x and y, and ending at u.When the network is unweighted Wu = du,Wv = dv and thus, ∆(u, v) countsthe (normalized) number of paths of lengths two and three joining two users uand v, which intuitively agrees with the aforementioned friends-measure for-mula. By Proposition 1, we show that ∆(u, v) is indeed the generalization ofweighted friend-measure(u, v) and depends only on the nodes’ topology. Hence,∆(u, v) can be regarded as the estimated probability that the backwards con-nection < v, u > indeed exists, i.e. we set w<vu> = ∆(u, v).

Proposition 1 For any (u, v) ∈ E which (v, u) /∈ E, 0 ≤ ∆(u, v) ≤ 1.

Proof We first prove this for unweighted graphs. The proof for weighted graphscan be extended straightforwardly. It is obvious that 0 ≤ ∆(u, v). Now we show∆(u, v) ≤ 1. For any x, y such that δ(x, y) = 1, if x = y, they can make just oneconnection counted towards the summation. Otherwise, they can make at mostdu (or dv) dual-connections at each vertex. Taking these facts into account, wehave ∆(u, v) =

x∈Nv

y∈Nuδ(x, y) ≤ dvdu. Thus, the inequalities follows.

The left equality holds when there are no connections from u to v and viceversa. The right equality holds when every path of length 2 from u to v (orfrom v to u) are contained in the corresponding path of length 3.

Proposition 2 Let n0 be the number of unidirectional links in the input net-work. The time complexity for estimating the mutual connections for theselinks is O(n0M).

Proof The total time required for estimating the possibility for a backwardconnection at an edge (u, v) is du + dv +min

x∈N+v

y∈N−

y{d+x , d−y }. Thus,

for all n0 links, the total time complexity is upper bounded by n0(2M)+n0M =O(n0M).

4.2 Link stability estimation

After the reciprocity of each link in the network has been estimated, the inputnetwork is now enriched with more information of the backwards edges. Whilethe presence of these dual edges is helpful in characterizing the mutual rela-tionships between pairs of network users, it might not be sufficient to evaluatethe stability of all network connections as some of the backwards edges maybe of low magnitudes, and thus, may not be able to hint the stability of the

Page 8: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

u v

0.45

u v

x

0.5 0.2

u v

x 0.5 0.2

0.1

y

a) b) c)

Fig. 1 Illustrations of stability function. a) st(u, v, 0) = 0.45 = w<vu>, b) st(u, v, 1) =0.5× 0.2 = 0.1, c) st(u, v, 2) = 0.5× 0.1× 0.2 = 0.05.

connection. Therefore, we need to further estimate the stability of a networklink given its predicted reciprocity. In order to do so, we define the stabilityof an edge (u, v) ∈ E at t time steps (or t hops) as follow

st(u, v, t) =∑

|P |=t

w(P )

where P is a path going from v to u (v and u are excluded) of length |P | = t,and w(P ) =

(a,b)∈P wab is the total weight of path P . Finally, we define the

stability st(u, v) of a link (u, v) ∈ E as the total stability of up to T0 timesteps, where T0 is a predefined parameter (or the upper bound on the numberof hops)

st(u, v) =

T0∑

t=1

st(u, v, t). (2)

The intuition behind our stability function st(u, v) is as follows: since sta-ble communities are commonly recognized by a high density of stable edges, itis reasonable to expect that such edges form a cycles. In the senses of directedand weighted networks, the stronger the strength of cycles an edge (u, v) is on,the more stable it is believed to be. On the contrary, edges that connecting orjoining between communities shall hardly be part of many cycles, and eventu-ally result in low stability. Figure 1 illustrates the stability estimates for link(u, v) at 0, 1 and 2 hops.

As a local measure, our suggested stability function has the following ad-vantages (1) it puts more focus on the existence of the mutual link of anylink (u, v) by reserving the original strength of the backwards edge < v, u >.This intuitively agrees with the findings that stable clusters are usually made ofbidirectional links in [40]. Moreover, our formula further takes into account thestrength of cycles containing the current link; (2) the more time (or, numberof hops) we allow, the more stability a link would be. Nevertheless, links thatreally belong to a stable community are more likely to have strong stabilitywhereas those connecting communities are of low stability. These advantagessupport the intuitions of stable communities that we discussed above. Theevaluation of our stability estimation is evaluated in section 6.

In summary, our link stability estimation first predicts the potential of thedual link of any link (u, v) ∈ E such that (v, u) /∈ E by using the modifiedmeasure in equation (1). Next, it evaluates the stability of the every link in the

Page 9: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

given network enriched from the first stage by using equation (2), and utilizesthese stability values as new weights for links in the network. This resultingnetwork will be consequently passed as the input network to our main process:the identification of stable communities.

5 Stable community detection

In this section, we present our main contribution: the stable community identi-fication process. Given the input network enriched with link stability informa-tion, we discover the stable communities by exploring an important connectionbetween the persistence probability of each community and its local networktopology. In the following paragraphs, we first review the concept of LumpedMarkov chain [18][32], and then establish our key connection between thisMarkov chain and the local network topology. Finally, we describe in detailour last but most important process - stable community detection.

5.1 Lumped Markov chain

A Markov chain [35] is a mathematical system representing transitions fromone system’s state to another, between a finite number of predefined states.In terms of social networks, a state can be either a user (a node in thegraph) or a group of tightly connected users (a community) in the networks,whereas transitions can be regarded as the user-to-user or group-to-groupcommunication tendencies. An n-state Markov chain corresponding to an n-node network is commonly represented by the transition πt+1 = πtP , whereπt = (π1,t, π2,t, ..., πn,t) with πu,t is the probability of being at node u at timet, and P = (puv) is the transition matrix. In particular, this n-state Markovchain can be associated to input network by letting the probability of transit-ing from a node u to a neighbor node v as

puv =wuv

j wuj

=wuv

w+u

.

Basically, puv is the probability of a random walker jumps from node u to nodev given the network topology. A Markov chain is said to be at its stationarystate distribution π if π satisfies the equation π = πP . As shown in [16], whenthe network is originally connected P would be irreducible, and thus, the equa-tion π = πP has a unique solution which is strictly positive (πu > 0 ∀u ∈ V )which corresponds to the stationary Markov chain state distribution. When thenetwork is undirected, π can be exactly computed as π = 1

2W0(w1, w2, ..., wn)

with W0 is the total edge weights. However, we do not have an exact form forthe stationary distribution π in general for directed network, and thus, π hasto be computed numerically.

As our ultimate goal is to detect the stable network community structure,we sought to find a good partitioning of V where each partition will remain

Page 10: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

wealthy over time. In the light of Markovian chain method, this correspondsto finding a collection of communities C = {C1, C2, ..., Cq} where a randomwalker would spend most of the time walking inside a community and lesstime wandering among communities. By defining this partition C of q commu-nities, we introduce a so called q-state meta-network where each communityin the network becomes a meta-state. However, at this aggregate level, a ingeneral dynamics Markovian description of a random walker walking amongcommunities is not possible because the Markovian property may not be well-preserved [18]. Nevertheless, this q-state community-to-community transitioncan still be defined using the lumped Markov chain, which correctly describesthe random walker at this scale given the stochastic process is started at thestationary distribution π [16]. This lumped Markov chain is defined via theq × q matrix as U in [32]

U = [diag(πH)]−1HTdiag(π)PH

where H is a n× q binary matrix representing the partitioning C.One of the notable advantages of the lumped Markov chain Πt+1 = ΠtU

defined on U is that it shares the same stationary distribution with the originalMarkov chain, i.e. the new stationary distribution defined by Π = πH satis-fies the equation Π = ΠU . Moreover, the difference between Πt+1 = ΠtU ,starting at Π0 = πU , and the original πtH tends exponentially to zero if thetwo chains are regular. These advantages make the community-based lumpedMarkov chain defined byΠtU a very good approximation of the original n-nodenetwork. We stress that the ability of the lumped Markov chain to describethe random walk dynamics only at stationary is not a limitation for the detec-tion of stable communities. Indeed, this stationary requirement evaluates therandom walk dynamics of all nodes at their stable states, and hence perfectlysupports the concept of stable communities.

In terms of interpretation, each entry ucd of U denotes the chance that arandom walker, at time t, wanders from community c to another community din time t+1. As a result, the diagonal elements uCC ’s (or uC ’s in short) of Uindicate the persistence probabilities that a random walker just walking withina particular community C. Of course, large values of uC ’s are expected formeaningful communities. It is also shown in [32] that in directed and weightedgraphs, uC can be computed as

uC =

i,j∈C πipij∑

i∈C πi

(3)

Note that∑

i,j∈C πipij is the fraction of time a random walker spends on thelinks inside a community C. Hence, uC is indeed the ratio between the amountof time a random walker spends on links and that it spends on nodes in C. Inundirected networks, one can verify that

uC =

i,j∈C πiwij∑

i∈C wi

=2wC

2wC + w(Cout).

Page 11: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

5.2 Connection to the network topology

At this stage, one might try to optimize uC for all communities C ∈ C in orderto maximize the persistence probabilities. However, doing in this way requiressolving for the stationary distribution πi’s in equation (3), which is costlyin large-scale directed networks. Hence, a natural question is “How can weeffectively optimize the persistence probability uC for each community withoutsolving for that costly exact stationary distribution?” As an answer for thischallenging question, we present in Proposition 3 a connection regarding thepersistence probability of a community C and its local topology. In particular,we show that the minimum value of uC can be represented under quantitiesthat only involve C’s local topology. Therefore, optimizing uC can be shiftedas the optimization of these local components, which are inexpensive and easyto compute.

Proposition 3 For any community C ∈ C, at the stationary distribution π,we have the following inequality

uC =

i,j∈C πipij∑

i∈C πi

≥ wC

w+C

.

Proof It is easy to see that

uC =

i,j∈C πipij∑

i∈C πi

=

i∈C πiwi,C

w+

i∑

i∈C πi

.

where wi,C =∑

j∈C wij . Next, we rewrite∑

i∈C πi in the form∑

i∈C πi =

πTeC where eC = (ei)N×1 and ei = 1 if i ∈ C and 0 otherwise. Since π is thestationary distribution of the Markov chain, we have π = πP . Thus

πTeC = πTPeC =∑

i∈C

πi

(

j:(i,j)∈E

1

w+i

)

Now we have,

i∈C

πi × wC =∑

i∈C

πi

(

j:(i,j)∈E

1

w+i

)

wC

≤∑

i∈C

πi

wi,C

w+i

(

t∈C

w+t

)

=∑

i,j∈C

πi

wi,C

w+i

× w+C

Hence, the conclusion follows. The quality holds when all πi equals to eachother and wC = w+

C . This happens when C is a full dually connected cliqueand is disconnected from the rest of the network. ⊓⊔

Proposition 3 is the key theory under our development since its establishesthe connection between the persistence probability of a random walker stayingwithin a community C and the local network topology. As a result, if wecan maximize the later quantity, we can provide insurance to the originaloptimization problem with high confidence.

Page 12: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

5.3 Detecting stable communities

5.3.1 Formulation

Taking into account this intuition, we propose Stable Community Detection(SCD) as an optimization problem defined as follow:

Given a directed, weighted network G = (V,E,w), find a community struc-ture C = {C1, C2, ..., Cq} such that the overall total persistence probability ismaximized:

maxR =∑

C∈C

wC

w+C

subject toCi ∩ Cj = ∅ ∀i, j ∈ {1, 2, ..., q}

q⋃

i=1

Ci = V

Note that in the above formulation, the number of communities q willbe determined by optimizing the objective function R and is not an inputparameter. Indeed, optimizing R provides us q as a very good estimate for theactual number of communities, as we will show in section 6.

5.3.2 Resolution limit analysis

Perhaps one of the most important properties a metric suggested for iden-tifying community structure should satisfy is the ability of overcoming theresolution limit [13], i.e. the metric should be able to detect network com-munities even at different scaling levels. In this subsection, we analyze theresistance to resolution limit of our proposed function R by looking particu-larly at the condition in which two communities should be merged together. Inwhat following, we simplify the situation by considering undirected networks.

Let us consider two communities C1 and C2. Let m12 be the number ofedges connecting C1 and C2. In order to merge C1 and C2 into a bigger com-munity, m12 should satisfy:

mC1

d+C1

+mC2

d+C2

≤ mC1+mC2

+m12

d+C1+ d+C2

The above condition is equivalent to:

mC1

d+C2

d+C1+

mC2

d+C1

d+C2≤ m12

which in turn implies 2√mC1

mc2 ≤ m12. Without loss of generality, we canassume that mC1

≤ mC2, thus 2mC1

≤ m12. This violates the condition ofeven a weak community. Moreover, this inequality implies the sufficient con-dition to merge two adjacent communities depends on the local structure oftwo communities only, regardless of the rest of the network. This observationindicates that our proposed metric R is strongly against the resolution limit.

Page 13: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

5.3.3 Connection to stability estimation

We next verify the following properties of network communities identified byoptimizing our suggested metric R: (1) links within a communities are of highstability and (2) links connecting communities are of low stability values. Thesetwo observations are shown in Propositon 4.

Proposition 4 Let C = {C1, C2, ..., Ck} be a community structure detected byoptimizing R, links within each Ci are of strong stability and those connectingcommunities are of weak stability values.

Proof For any node p ∈ V and subset A ⊆ V , let wp,A be the total weightof all links that p has towards A and vice versa. By this definition, we obtainwp = wp,A + wp,V \A. For any community C ∈ C, s ∈ C and p /∈ C, since p isnot a member of C, we have

wC

w+C

>wC + wp,C

w+C + wp

=wC + wp,C

w+C + wp,C + wp,V \C

,

because otherwise joining p to C will give a better value of R. This equalityequals

wp,C

wp

<wC

w+C

,

which in turn implies that the stability contribution of links joining p to C areinsignificant in comparison to C as a whole.

Similarly, for any node s ∈ C, we have

wC

w+C

>wC − wp,C

w+C − wp

=wC − wp,C

w+C − wp,C − wp,V \C

,

because otherwise excluding s from C will give a better R. This inequalityequals to

ws,C

ws

>wC

w+C

,

which in turn implies that the stability contribution of internal links of C aresignificant in comparison to C as a whole.

5.4 A greedy algorithm for SCD problem

Analyzing the theoretical hardness of the SCD problem is an aspect beyondthe scope of this paper. The NP-hardness of the SCD problem can be shownby a similar reduction to MODULARITY as in [4] (see also [39] and [15] fora comprehensive survey on similar graph clustering problems). Given its NP-hardness, a heuristic approach that can provide a good solution in a timelymanner is therefore more desirable. In this section, we describe a greedy al-gorithm for the SCD problem consisting of three phases: community growing,strengthening and refinement phases described below.

Page 14: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

Algorithm 1 SCD AlgorithmInput: A directed weighted graph G = (V,E,w)Output: Community structure C

Growing Phase:

C ← ∅;A← V ;while (∃ unassigned node u ∈ A) do

C ← {u};A← A \ {u};while (∃ v ∈ A such that uC∪{v} > uC) do

v ← argmaxv∈A{uC∪{v}};C ← C ∪ {v};A← A \ {v};

end while

C ← C ∪ {C};end while

Strengthening Phase:

for C ∈ C do

while (∃u ∈ C such that uC < uC\{u}) do

C ← C \ {u};C ← C ∪ {u};

end while

end for

Refining Phase:

while (∃C1, C2 such that uC1∪C2> uC1

+ uC2) do

(C1, C2)← argmaxC1,C2∈C{uC1∪C2− uC1

− uC2};

C ← (C \ {C1, C2}) ∪ {C1 ∪ C2};end while

Return C

Growing phase. This phase is responsible for discovering raw communitiesin the input network. Initially, all nodes are unassigned and do not belongto any community. Next, a random node is selected as the first member (orthe seed) of a new community C, and consequently, new members who helpto maximize C’s persistence probability are gradually admitted into C. Whenthere is no more node that can improve this objective of the current commu-nity, another new community is formed and the whole process is then cycledin the very same manner on this newly formed community.

Strengthening phase. We further rearrange nodes into more appropriatecommunities. Since new members are admitted into a community C in a ran-dom order, C’s objective value could be further improve with the absence ofsome of it members as they can be obstacles for the total stability. This re-quires the reevaluation of all C’s members as a result. Therefore, in this phase,we exclude any node which reduces the persistence probability of a commu-nity and let them be singleton communities. The removal of such nodes createsmore cohesive communities, i.e. communities with higher internal stability.

Refining phase. In the last phase, the global stability of the whole networkis reevaluated. In particular, this last refinement phase looks at the merging of

Page 15: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

two adjacent communities in order to improve the overall objective function.If two communities have a great number of mutual connections between them,it is thus more stable to merge them into one community. The final algorithm,which we call SCD algorithm, is presented in Alg. 1.

6 Experimental results

In this section, we present our results on the discovery of network communi-ties on both synthesized networks with known groundtruths and real-world so-cial traces including NetHEPT, NetHEPT WC collaboration and Foursquare,Twitter and Facebook social networks. We evaluate the following aspects ofour proposed SCD framework (1) the effectiveness of our link stability esti-mation process, (2) the ability of identifying the general network communitystructure without the concept of community stability, i.e. how similar ourdetected communities are in comparison with the groundtruths, and (3) theability of identifying stable communities in reference to the consensus of otherstate-of-the-art methods, including Blondel’s [3], Infomap [34] and OSLOM[21] methods, after their multiple executions as suggested in [20][5][36].

6.1 Setup

6.1.1 Datasets

Synthesized networks. Of course, the best way to evaluate our approaches isto validate them on real-world networks with known community structures.Unfortunately, we often do not know that structures beforehand, or such struc-tures cannot be easily mined from the network topologies. Although synthe-sized networks might not reflect all the statistical properties of real ones, theycan provide us the known groundtruths via planted communities and the abil-ity to vary other network parameters such as sizes, densities and overlappinglevels, etc. Testing community detection methods on generated data has alsobecomes a usual practice that is widely accepted in the field [14].

We use the well-known LFR benchmark [14] to generate 190 weighted anddirected testbeds. Generated data follow power-law degree distribution andcontain embedded communities of varying sizes that capture characteristicsof real-world networks. Parameters are: the number of nodes N = 1000 and5000, the mixing parameter µ = [0.1...1] controlling the overall sharpness ofthe community structure, the minimum (minC) and maximum (maxC) ofcommunity sizes are set to (25, 50) for small-size and (50, 100) for big-sizecommunities as in the standard settings. Each test is averaged over 100 runsfor consistency.

NetHEPT and NetHEPT WC. These traces are widely-used datasets fortesting social-aware detection methods [6][7]. These traces contain informa-tion, mostly the academic collaboration from arXiv’s “High Energy Physics -

Page 16: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

Theory” section where nodes stand for authors and links represent coauthor-ships. In their deliverable, the NetHEPT networks contain 15233 nodes and31398 links, and weights on edges are assigned by either uniformly at random(for NetHEPT data) or by weighted cascade (for NetHEPT WC data) wherewuv = 1/din(v) with din(v) is the indegree of a node v.

Facebook. This dataset contains friendship information among New Orleansregional network on Facebook, spanning from September 2006 to January2009 [37]. The data contains more than 63K nodes (users) connected by morethan 1.5 million friendship links with an average node degree of 23.5. In ourexperiments, the weight for each link between users u and v is proportional tothe communication frequency between them, normalized on the whole network.

Foursquare. is a location-based social networking website for mobile devicessuch as smart phones, tablet and PDAs. In general, users check-in at venues us-ing an application in their supported devices by selecting from a list of venuesthe application locates nearby. Each check-in awards the user points and some-times badges. The user who checks in most often to a place becomes the“mayor” and, and users regularly vie for “majorship” (www.foursquare.com).To collect the data, we created the Foursquare API and then crawl the datastarting from some of our accounts. This dataset contain roughly 1.6M con-nections among 44,896 users.

Twitter. To obtain this dataset, we apply the unbiased sampling approachto sample a portion of the network which is provided by Kwak et al. [19]. Thisdataset contains 48,092 users with 16M “followship” connections.

6.1.2 Metric

To measure the quality of the detected communities in comparison with theembedded groundtruths, we evaluate Generalized Normalized Mutual Infor-mation (NMI) [14]. Basically, the NMI(U,V) value of two structures U and Vis 1 if U and V are identical and is 0 if they are totally separated. This is themost important metric for a community detection algorithm because it indi-cates how good the algorithm is in comparison with the planned communities.Higher NMI values are expected for a better community detection algorithm.

6.2 Effect of link stability estimation

We first evaluate the effect of our link stability estimation on the detection ofnetwork communities by comparing NMI values of SCD and its version with NoLink stability Prediction (SCD-NLP). Due to space limit, results of SCD andSCD-NLP are also reported in Figure 2, where those on general communitystructure detection are also presented.

In general, SCD-NLP performs very competitively even without being pre-processed: on synthesized networks with no community size constraint (Figure2(a)), its discovered communities are almost of perfect similarity to the em-bedded ones (NMI values approximately 1) on µ = [0...0.5] whereas the quality

Page 17: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

NM

I (N

= 1

000)

µ

SCDSCD-NLP

BlondelOslom

Infomap 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

NM

I (N

= 5

000)

µ

SCDSCD-NLP

BlondelOslom

Infomap

(a) Networks with minC,maxC unconstrained.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

NM

I (N

= 1

000)

µ

SCDSCD-NLP

BlondelOslom

Infomap 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

NM

I (N

= 5

000)

µ

SCDSCD-NLP

BlondelOslom

Infomap

(b) Networks with minC = 25,maxC = 50 (small-size).

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

NM

I (N

= 1

000)

µ

SCDSCD-NLP

BlondelOslom

Infomap 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

NM

I (N

= 5

000)

µ

SCDSCD-NLP

BlondelOslom

Infomap

(c) Networks with minC = 50,maxC = 100 (big-size).

Fig. 2 Results on synthesized networks with different community criteria.

drops down quickly when µ is above 0.5. We note that this drop of detectionquality is controversial and does not necessary imply a bad performance sincenetworks with µ > 0.5 is consider very stochastic, and thus, may not con-tain a clear community structure. Nevertheless, with the help of the stabilityestimation, the performance is now boosted up significantly on SCD as thedetection qualities are very high even for µ > 0.65 (N = 1000) and µ > 0.75(N = 5000), and only drop down when the networks are extremely stochastic(µ > 0.8).

We next take a look at the cases where networks are constrained withsmall-sized (Figure 2(b)) and big-sized communities (Figure 2(c)). We observethat, when community sizes are constrained, SCD-NLP performs much betterthan before and even overcome its prior limit µ = 0.5. In particular, the

Page 18: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

0

0.2

0.4

0.6

0.8

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

NM

I (N

=1

0K

, U

W)

µ (a)

OslomBlondelInfomap

SCD

0

0.2

0.4

0.6

0.8

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

NM

I (N

=1

0K

, U

W)

µ (b)

OslomBlondelInfomap

SCD

0

0.2

0.4

0.6

0.8

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

NM

I (N

=1

0K

, D

W)

µ (c)

OslomBlondelInfomap

SCD

0

0.2

0.4

0.6

0.8

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

NM

I (N

=1

0K

, D

W)

µ (d)

OslomBlondelInfomap

SCD

Fig. 3 Results on large synthesized directed weighted (DW) and undirected weighted (UW)networks with 10000 nodes. Left figures: minC = 20,maxC = 2000, Right figures: minC =40,maxC = 4000.

performance of SCD-NLP closely approaches that of SCD, especially in largenetworks (N = 5000). However, SCD-NLP appears to be sensitive to big-sizecommunities in small networks as its quality drops down quickly in Figure 2(c)(left), and seems to favor small-size communities as its plots tend to tanglewith those of SCD (Figure 2(b)). SCD detection quality, thanks to the stabilityestimation, stays wealthy in all test cases.

In summary, these results indicate that (1) without the stability estima-tion process, our suggested metric R appears to be a very good one to detectcommunity structure in general directed and weighted networks, and (2) whenthe community size is constrained, link stability estimation has a little effecton the community detection quality. However, in real-world social networksettings where community sizes are typically unknown, and therefore uncon-strained, the stability estimation has a significant effect on the detection ofnetwork communities. These experiments also confirm the efficacy of our pro-posed stability estimation procedure.

6.3 General community structure detection

We next investigate on the ability to identify general network community struc-ture, i.e. without community stability, in comparison with the aforementionedstate-of-the-art detection methods. Results are also reported in Figure 2.

Page 19: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

0

0.2

0.4

0.6

0.8

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

NM

I (N

=2

0K

, U

W)

µ (a)

OslomBlondelInfomap

SCD

0

0.2

0.4

0.6

0.8

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

NM

I (N

=2

0K

, U

W)

µ (b)

OslomBlondelInfomap

SCD

0

0.2

0.4

0.6

0.8

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

NM

I (N

=2

0K

, D

W)

µ (c)

OslomBlondelInfomap

SCD

0

0.2

0.4

0.6

0.8

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

NM

I (N

=2

0K

, D

W)

µ (d)

OslomBlondelInfomap

SCD

Fig. 4 Results on large synthesized directed weighted (DW) and undirected weighted (UW)networks with 20000 nodes. Left figures: minC = 20,maxC = 2000, Right figures: minC =40,maxC = 4000.

In general, the performance of our SCD frameworks on synthesized net-works appears to be better than those of Blondel and Infomap methods, andonly lags behind Oslom’s when the networks are heavily stochastic. When thecommunity size is unconstrained, the detection quality of SCD and other meth-ods, except for Blondel’s, retain at nearly perfect on µ = [0...0.65] (N = 1000)and µ = [0...0.8] (N = 5000) and then all degrade quickly. Among the threemethods, Infomap’s performance appears to be sensitive to some certain mix-ing threshold µ as it NMI values tend to drop directly to 0, whereas Oslomand ours tend to drop down slower. On average, the NMI values of SCD areabout 8% and 3% better than those of Blondel and Informap methods, and areabout 2% lag behind those of Oslom method. Blondel’s method, on the otherhand, does not attain a good performance through due to low NMI values evenat a low range of mixing value µ. An possible explanation for this behavior ofBlondel’s method is due to the effect of resolution limit, as we shall discussbelow.

When the embedded communities are constrained with small and largecommunity sizes, we observe the nearly same behavior of SCD, Oslom andInfomap methods as depicted in Figures 2(b) and 2(c). Blondel’s method gets asignificant improvement in these cases where its performance is closely relatedto the others. As we discussed above, one possible reason for the bad behaviorof Blondel’s method is due to the resolution limit of modularity objectivefunction [12]. As the community size is unconstrained, this resolution limit canmislead Blondel method to merge some communities whose are of small sizes in

Page 20: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

Table 1 Running time

Time(h) 1U2 1U4 1D2 1D4 2U2 2U4 2D2 2D4Blondel 0.06 0.07 0.06 0.07 0.07 0.08 0.09 0.09Oslom 2.76 2.78 2.81 2.80 5.95 5.96 6.31 6.42Infomap 2.88 2.85 2.97 3.03 5.98 6.10 6.51 6.81SCD 1.40 1.42 1.56 1.61 2.74 2.8 3.11 3.25

comparison to the rest of the network, thus results in the low NMI values. Onthe other hand, this resolution limit does not take effect when size constraintsare imposed and thus the significant improvement. Our SCD framework, asshown in section 5.3.2, can withstand this scaling limit as its obtains highlycompetitively results. Moreover, the difference between our SCD and othermethods are insignificant on average which indicates that all methods are ableto detect network communities with high quality. This is not a surprising resultsince Blondel, Oslom and Informap are currently state-of-the-art methods buta great motivation for our SCD framework.

To further certify the performance of SCD in data with known groundtruth, we generated 3200 directed and undirected weighted networks withN = 10000 and N = 20000 vertices. The notable advantage of those medium-sized synthesized datasets is, again, the ability to judge community detectionalgorithms based on the embedded communities. Due to our ultimate goal oftesting the network’s stable structures, we set a wider range of communitiessizes than before: minc = 20,maxc = 2000 and minc = 40,maxc = 4000. Agood detection algorithm should be able to reveal the core stable structuresof those communities.

The performance of SCD and other approaches are reported in Figures3 and 4. On these medium-sized data, the results depicted in these Figuresindicate that SCD framework was able to detect a high quality communitystructure via high NMI scores in comparison with other detection algorithms.In particular, SCD was among the top 2 approaches (with Oslom) that ob-tained highest NMI scores even on networks with fuzzy community structures(µ > 0.5). We observed that Blondel method performed rather poor on thesedatasets.

The required execution time is also reported in Table 1 (Note: Short nota-tions “1” stands for the number of nodes, “U” and “D” stand for Undirectedand Directed network, and last “2” and “4” are for the minimum communitysize). Due to it community-compression nature, Blondel method unsurpris-ingly required the least time among the four approaches but in turn, wastraded off by its performance. In general, SCD framework required less ex-ecution time than Oslom and Infomap while still obtained good communitystructures through its high similarity to the embedded groundtruth. This ex-periment on synthesized data provides us with confidence when applying SCDon real-world data.

Page 21: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Blondel Infomap Oslom

NetHEPT

NMIJaccard

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

(a) NetHEPT

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Blondel Infomap Oslom

NetHEPT-WC

NMIJaccard

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

(b) NetHEPT WC

0

0.2

0.4

0.6

0.8

1

Blondel Infomap Oslom

Foursquare

NMIJaccard

(c) Foursquare

0

0.2

0.4

0.6

0.8

1

Blondel Infomap Oslom

Twitter

NMIJaccard

(d) Twitter

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Blondel Infomap Oslom

Facebook

NMIJaccard

(e) Facebook

Fig. 5 Results of stable community detection on real social traces.

6.4 Results on stable community detection

In order to compare our results to the consensus of other detection methods,we will adopt a strategy recently proposed in [20]. In particular, given a specificcommunity detection method A, its consensus (or stable) communities can bedetermined by:

– Execute A on G np times to have np partitions.– Find the matrix D = (Dij) where Dij is the probability which vertices i

and j of G are assigned to the same cluster among np partitions.

Page 22: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

Table 2 Running time

Time Twitter Foursquare TwitterBlondel 5.8min 3.4min 6.2minOslom 3.4h 2.9h 4.2hInfomap 3.5h 3.1h 4.6hSCD 25.2min 17.9min 36.2min

– All Dij ’s that are below a threshold τ will be disregarded (iv) Apply A onD np times, so to create np partitions and (v) if all partitions are equal,stop (the result matrix would be block diagonal). Otherwise go to step (ii).

As suggested in [20], the resulted communities are ideal candidates for sta-ble structures as members commit to their communities. We also compute the

Jaccard index J(U, V ) = |A∩B||A∪B| to better evaluate the quality of the detected

stable communities. Results are represented in Figure 5.

As illustrated by the subfigures, even a single run of SCD is able to obtainvery high NMI scores and Jaccard indicies in comparison with the consen-sus of other methods after multiple runs. In particular, community structuresdiscovered by SCD on NetHEPT, NetHEPT WC, Foursquare and Twitter ob-tain nearly 70% similarity in comparison with Blondel, Infomap and Oslommethods, meanwhile the Jaccard indicies indicate that, in average, almost 66%number of nodes are found in common between SCD and the core structure ofother competitors. This show that communities discovered by SCD are indeedhighly overlap with core community structures identified by other detectionmethods, which in turns implies that those clusters found by SCD are stablewith high confidence. Surprising, in all real-world networks except Facebook,we observe the high similarity among the consensus of Blondel, Infomap andOslom methods even with difference in edge weight distribution. This obser-vation indicate those identified communities by SCD are, in fact, stable inthese networks. In Facebook, a large network with real social interactions, thesimilarity between consensus communities discovered by other methods andby SCD are still of high similarity with nearly 60%, 50% similarity to thosefound by Blondel and Infomap methods with over 50% overlap in the stablepartitions as indicated by the Jaccard indices. The achieved NMI values incomparison with Oslom method are relatively low as their core communitiesdo not appear to highly overlap (Jaccard index of only 35%). We note thatthis low similarity does not indicate the unstability community structure ofour SCD framework since communities detected by Oslom can be overlappedwith each other, while SCD works towards disjoint community structure. Nev-ertheless, as just a single run, the above competitively results in reference toother state-of-the-art methods confirm the efficay and quality of our methodin detecting stable network communities in OSNs.

Page 23: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

7 Conclusion

In this work, we investigate community structures in directed OSNs with morefocus on community stability. As an effort towards the understanding of stablecommunities, we suggest an estimation procedure which provides helpful in-sights into the stability of links in the input network. Based on that, we proposeSCD, a framework to identify community structure in directed OSNs with theadvantage of community stability. We explore an essential connection betweenthe persistence probability of a community at the stationary distribution andits local topology, which is the fundamental point to back our SCD framework.Finally, we certify the efficiency of our approach on both synthesized datasetswith embedded communities and real-world social traces in reference to theconsensus of other state-of-the-art detection methods. Highly competitive em-pirical results confirm the quality and efficiency of SCD on identifying stablecommunities in OSNs.

References

1. Backstrom, L., Leskovec, J.: Supervised random walks: predicting and recommendinglinks in social networks. In: Proceedings of the fourth ACM international conferenceon Web search and data mining, WSDM ’11, pp. 635–644. ACM, New York, NY, USA(2011). DOI 10.1145/1935826.1935914

2. Barnes, E.R.: An algorithm for partitioning the nodes of a graph. SIAM Journal onAlgebraic and Discrete Methods 3(4), 541–550 (1982). DOI 10.1137/0603056

3. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of com-munities in large networks. Journal of Statistical Mechanics: Theory and Experiment2008(10), P10,008 (2008)

4. Brandes, U., Delling, D., Gaertler, M., Gorke, R., Hoefer, M., Nikoloski, Z., Wagner,D.: On modularity clustering. IEEE Trans. on Knowl. and Data Eng. 20(2), 172–188(2008). DOI 10.1109/TKDE.2007.190689

5. Chakraborty, T., Srinivasan, S., Ganguly, N., Bhowmick, S., Mukherjee, A.: ConstantCommunities in Complex Networks. Scientific Reports 3 (2013)

6. Chen, W., Wang, C., Wang, Y.: Scalable influence maximization for prevalent viralmarketing in large-scale social networks. In: Proceedings of the 16th ACM SIGKDDinternational conference on Knowledge discovery and data mining, KDD (2010)

7. Chen, W., Yuan, Y., Zhang, L.: Scalable influence maximization in social networksunder the linear threshold model. In: The 10th IEEE Int. Conference on Data Mining,ICDM (2010)

8. Delvenne, J.C.C., Yaliraki, S.N., Barahona, M.: Stability of graph communities acrosstime scales. Proceedings of the National Academy of Sciences of the United States ofAmerica 107(29), 12,755–12,760 (2010). DOI 10.1073/pnas.0903215107

9. E, W., Li, T., Vanden-Eijnden, E.: Optimal partition and effective dynamics of complexnetworks. Proceedings of the National Academy of Sciences of the United States ofAmerica 105(23), 7907–7912 (2008)

10. Fire, M., Tenenboim, L., Lesser, O., Puzis, R., Rokach, L., Elovici, Y.: Link predic-tion in social networks using computationally efficient topological features. In: Social-Com/PASSAT, pp. 73–80. IEEE (2011)

11. Fortunato, S.: Community detection in graphs. Physics Reports 486(3-5), 75 – 174(2010)

12. Fortunato, S., Barthelemy, M.: Resolution limit in community detection. physics0607100104(1), 8 (2006)

13. Fortunato, S., Barthelemy, M.: Resolution limit in community detection. Proceedingsof the National Academy of Sciences 104(1), 36 (2007)

Page 24: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

14. Fortunato, S., Lancichinetti, A.: Community detection algorithms: a comparative analy-sis: invited presentation, extended abstract. In: Proceedings of the Fourth InternationalICST Conference on Performance Evaluation Methodologies and Tools, VALUETOOLS’09, pp. 27:1–27:2. ICST (2009)

15. Garey, M.R., Johnson, D.S.: Computers and Intractability; A Guide to the Theory ofNP-Completeness. W. H. Freeman & Co., New York, NY, USA (1990)

16. Hoffmann, K.H., Salamon, P.: Bounding the lumping error in markov chain dynamics.Applied Mathematics Letters 22(9), 1471–1475 (2009)

17. Jamali, M., Haffari, G., Ester, M.: Modeling the temporal dynamics of social rat-ing networks using bidirectional effects of social relations and rating patterns. In:Proceedings of the 2010 IEEE International Conference on Data Mining Workshops,ICDMW ’10, pp. 344–351. IEEE Computer Society, Washington, DC, USA (2010).DOI 10.1109/ICDMW.2010.103

18. Kemeny, J.G., Hazleton, M., J. Laurie, S., Gerald L., T.: Finite mathematical structures.1st edition. Englewood Cliffs, N.J.: Prentice-Hall, Inc (1959)

19. Kwak, H., Lee, C., Park, H., Moon, S.: What is twitter, a social network or a news media?In: Proceedings of the 19th International Conference on World Wide Web, WWW ’10,pp. 591–600. ACM, New York, NY, USA (2010). DOI 10.1145/1772690.1772751. URLhttp://doi.acm.org/10.1145/1772690.1772751

20. Lancichinetti, A., Fortunato, S.: Consensus clustering in complex networks. ScientificReports 2 (2012). DOI 10.1038/srep00336

21. Lancichinetti, A., Radicchi, F., Ramasco, J.J., Fortunato, S.: Finding statisticallysignificant communities in networks. PLoS ONE 6(4), e18,961 (2011). DOI10.1371/journal.pone.0018961

22. Leskovec, J., Huttenlocher, D., Kleinberg, J.: Predicting positive and negative linksin online social networks. In: Proceedings of the 19th international conference onWorld wide web, WWW ’10, pp. 641–650. ACM, New York, NY, USA (2010). DOIhttp://doi.acm.org/10.1145/1772690.1772756

23. Liben-Nowell, D., Kleinberg, J.: The link prediction problem for social networks. In:Proceedings of the twelfth international conference on Information and knowledgemanagement, CIKM ’03, pp. 556–559. ACM, New York, NY, USA (2003). DOI10.1145/956863.956972

24. Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing 17(4), 395–416(2007). DOI 10.1007/s11222-007-9033-z

25. Newman, M.: Fast algorithm for detecting community structure in networks. PhysicalReview E 69 (2003)

26. Newman, M.E.J.: The structure and function of complex networks. SIAM Review 45(2),167–256 (2003)

27. Newman, M.E.J.: Modularity and community structure in networks. Proceed-ings of the National Academy of Sciences 103(23), 8577–8582 (2006). DOI10.1073/pnas.0601602103

28. Newman, M.E.J., Leicht, E.A.: Mixture models and exploratory analysis in networks.Proceedings of the National Academy of Sciences of the United States of America104(23), 9564–9569 (2007)

29. Nguyen, N.P., Dinh, T.N., Tokala, S., Thai, M.T.: Overlapping communities in dynamicnetworks: their detection and mobile applications. In: Proceedings of the 17th annualinternational conference on Mobile computing and networking, MobiCom ’11, pp. 85–96.ACM, New York, NY, USA (2011). DOI 10.1145/2030613.2030624

30. Palla, G., Barabasi, A.L., Vicsek, T.: Quantifying social group evolution. Nature446(7136), 664–667 (2007). DOI 10.1038/nature05670

31. Pasztor, B., Mottola, L., Mascolo, C., Picco, G.P., Ellwood, S., Macdonald, D.: Selectivereprogramming of mobile sensor networks through social community detection. In:Proceedings of the 7th European conference on Wireless Sensor Networks, EWSN’10,pp. 178–193. Springer-Verlag, Berlin, Heidelberg (2010). DOI 10.1007/978-3-642-11917-0 12

32. Piccardi, C.: Finding and testing network communities by lumped markov chains. PLoSONE 6(11), e27,028 (2011). DOI 10.1371/journal.pone.0027028

33. Reichardt, J., Bornholdt, S.: Detecting fuzzy community structures in complex networkswith a potts model. Physical Review Letters 93(21), 218,701 (2004)

Page 25: A Method to Detect Communities with Stability in Social ...mythai/files/stability.pdf · phenomena, the power-law degree distribution, and especially, the property of containing community

34. Rosvall, M., Bergstrom, C.T.: Mapping change in large networks. PLoS ONE 5(1),e8694 (2010). DOI 10.1371/journal.pone.0008694

35. Sean P., M., Richard L., T.: Markov chains and stochastic stability. 2nd edition. Cam-bridge University Press (2009)

36. Seifi, M., Junier, I., Rouquier, J.B., Iskrov, S., Guillaume, J.L.: Stable community coresin complex networks. In: R. Menezes, A. Evsukoff, M.C. Gonzlez (eds.) Complex Net-works, Studies in Computational Intelligence, vol. 424, pp. 87–98. Springer Berlin Hei-delberg (2013). DOI 10.1007/978-3-642-30287-9 10

37. Viswanath, B., Mislove, A., Cha, M., Gummadi, K.P.: On the evolution of user inter-action in facebook. In: 2nd ACM SIGCOMM Workshop on Social Networks (2009)

38. Viswanath, B., Post, A., Gummadi, K.P., Mislove, A.: An analysis of social network-based sybil defenses. In: Proceedings of the ACM SIGCOMM 2010 confer-ence, SIGCOMM ’10, pp. 363–374. ACM, New York, NY, USA (2010). DOI10.1145/1851182.1851226

39. Sıma, J., Schaeffer, S.E.: On the np-completeness of some graph cluster measures. In:Proceedings of the 32nd conference on Current Trends in Theory and Practice of Com-puter Science, SOFSEM’06, pp. 530–537. Springer-Verlag, Berlin, Heidelberg (2006).DOI 10.1007/11611257 51

40. Yanhua Li Zhi-Li Zhang, J.B.: Mutual or unrequited love: Identifying stable clusters insocial networks with uni- and bi-directional links. In: LNCS WAW 2012 (to appear)(2012)