Social Butterfly: Social Caches for Distributed Social Networksiftode/socialbutterfly.pdf ·...

6
Social Butterfly: Social Caches for Distributed Social Networks Lu Han, Badri Nath, Liviu Iftode, S. Muthukrishnan Department of Computer Science Rutgers, The State University of New Jersey {luhan, badri, iftode, muthu}@cs.rutgers.edu Abstract—A distributed architecture for implementing online social networks (OSNs) can overcome several disadvantages of the now popular centralized online social networks such as Facebook or Twitter. Owners of centralized OSNs control all of the individuals’ data and associated policies on dissemination of data. Hence, individuals can hardly exert control of their personal data, resulting in serious potential privacy violations. Distributed OSNs, where each individual hosts their personal data, provide greater control of ownership of personal data. However, for distributed OSNs to be scalable, methods that optimize peer- to-peer data dissemination need to be developed. In this paper, we introduce Social Caches as a way to alleviate the peer-to-peer traffic that would ensue in a distributed OSNs. Social caches act as local bridges among friends for efficient information delivery. We provide a novel social communication model, Social Butterfly, for distributed OSNs that utilize social caches. We further formulate social cache selection as the Neighbor-Dominating Set Problem, and propose two algorithms to solve it: 1) Approximate NDS, and 2) Social Score algorithm. Performance evaluation based on real social traffic data shows that both algorithms reduce peer-to-peer social traffic by an order of magnitude. I. I NTRODUCTION Various distributed social networks (DSNs) have been pro- posed and developed recently such as Diaspora 1 , DSNP 2 , Safebook [1], and AppleSeed 3 . A more detailed list of dis- tributed social networks can be found in Wiki. 4 The pro- posed DSNs are expected to overcome some of the prob- lems associated with the centralized social networks such as Facebook, Twitter, and LinkedIn. The main drawback with these centralized social networks is the lack of privacy and control over personal content stored in servers. Distributed social networks, on the other hand, allow users to store and organize personal contents at the place of their choosing such as cloud storage, enterprise servers or personal devices. With such flexible distributed storage, to a large extent, users have control over their personal data and can share this data with social contacts on their terms. As distributed social networks do not have a central server, reads and writes of social update will have to have seperate peer-to-peer connections one for each social contact. Hence, we propose a form of distributed caching scheme that in- telligently places data in the network so as to optimize the message traffic that originates from members of the OSNs. Certain friends “cache” the updates for other friends, and 1 http://joindiaspora.com/ 2 http://www.complang.org/dsnp/ 3 http://opensource.appleseedproject.org/ 4 http://www.en.wikipedia.org/wiki/Distributed social network these selected friends act as “Social Caches.” By utilizing social caches, we can optimize social message traffic, reduce the transmission latency, and reduce bandwidth usage in the network. In this paper, we propose a novel distributed social network data dissemination model, Social Butterfly, which utilizes social caches to cache social updates before the updates are requested by friends. In order to optimize the social traffic, we formulate the social cache selection as the Neighbor- Dominating Set problem and propose two social cache selec- tion algorithms: (i). Approximate NDS algorithm; and (ii). So- cial Score algorithm. The approximate NDS algorithm selects social caches by approximating the problem to the Set Cover problem. The Social Score algorithm selects social caches by selecting the ones with the higher social scores, where a social score is the representation of a vertex’s social properties. Real social network data is used for comparing the two algorithms, and the evaluation shows that both algorithms can efficiently reduce the network traffic with the dataset that we considered. II. RELATED WORK Recently, Distributed Social Networks (DSNs) such as Diaspora and Safebook [1] have been proposed. Diaspora allows users to host their data on their personal machines, and communicate with friends in a P2P manner. Cutillo, Et al. proposed Safebook [1], a distributed social network that leverages trust relationships to cope with the lack of trust and cooperation in P2P systems. The system utilizes “Matryoshkas”, concentric rings of nodes built around each member, to provide data storage to replicate friends’ profiles. There are some other DSNs that are being developed such as PeerSoN [2] and PrPI [3]. They have to yet to exploit social update patterns and do not provide efficient information delivery mechanism for social updates. III. SOCIAL BUTTERFLY In this section, we describe the design of Social Butterfly, which selects certain nodes as Social Caches for bridging social update delivery between producers and consumers and thereby reducing the cost of operating a distributed OSN. Social cache nodes selected act as “local servers”, while the remaining nodes are assigned as members of one or more social caches to form Social Clusters. A member node only contacts the social caches it is associated with, while a social cache can be a member of multiple social clusters. Every node in the social network can be both a social consumer

Transcript of Social Butterfly: Social Caches for Distributed Social Networksiftode/socialbutterfly.pdf ·...

Page 1: Social Butterfly: Social Caches for Distributed Social Networksiftode/socialbutterfly.pdf · Social Butterfly: Social Caches for Distributed Social Networks Lu Han, Badri Nath,

Social Butterfly: Social Caches for Distributed Social Networks

Lu Han, Badri Nath, Liviu Iftode, S. Muthukrishnan

Department of Computer ScienceRutgers, The State University of New Jersey{luhan, badri, iftode, muthu}@cs.rutgers.edu

Abstract—A distributed architecture for implementing onlinesocial networks (OSNs) can overcome several disadvantages ofthe now popular centralized online social networks such asFacebook or Twitter. Owners of centralized OSNs control all ofthe individuals’ data and associated policies on dissemination ofdata. Hence, individuals can hardly exert control of their personaldata, resulting in serious potential privacy violations. DistributedOSNs, where each individual hosts their personal data, providegreater control of ownership of personal data. However, fordistributed OSNs to be scalable, methods that optimize peer-to-peer data dissemination need to be developed. In this paper,we introduce Social Caches as a way to alleviate the peer-to-peertraffic that would ensue in a distributed OSNs. Social caches actas local bridges among friends for efficient information delivery.We provide a novel social communication model, Social Butterfly,for distributed OSNs that utilize social caches. We furtherformulate social cache selection as the Neighbor-Dominating SetProblem, and propose two algorithms to solve it: 1) ApproximateNDS, and 2) Social Score algorithm. Performance evaluationbased on real social traffic data shows that both algorithmsreduce peer-to-peer social traffic by an order of magnitude.

I. INTRODUCTION

Various distributed social networks (DSNs) have been pro-posed and developed recently such as Diaspora1, DSNP2,Safebook [1], and AppleSeed3. A more detailed list of dis-tributed social networks can be found in Wiki.4 The pro-posed DSNs are expected to overcome some of the prob-lems associated with the centralized social networks such asFacebook, Twitter, and LinkedIn. The main drawback withthese centralized social networks is the lack of privacy andcontrol over personal content stored in servers. Distributedsocial networks, on the other hand, allow users to store andorganize personal contents at the place of their choosing suchas cloud storage, enterprise servers or personal devices. Withsuch flexible distributed storage, to a large extent, users havecontrol over their personal data and can share this data withsocial contacts on their terms.

As distributed social networks do not have a central server,reads and writes of social update will have to have seperatepeer-to-peer connections one for each social contact. Hence,we propose a form of distributed caching scheme that in-telligently places data in the network so as to optimize themessage traffic that originates from members of the OSNs.Certain friends “cache” the updates for other friends, and

1http://joindiaspora.com/2http://www.complang.org/dsnp/3http://opensource.appleseedproject.org/4http://www.en.wikipedia.org/wiki/Distributed social network

these selected friends act as “Social Caches.” By utilizingsocial caches, we can optimize social message traffic, reducethe transmission latency, and reduce bandwidth usage in thenetwork.

In this paper, we propose a novel distributed social networkdata dissemination model, Social Butterfly, which utilizessocial caches to cache social updates before the updates arerequested by friends. In order to optimize the social traffic,we formulate the social cache selection as the Neighbor-Dominating Set problem and propose two social cache selec-tion algorithms: (i). Approximate NDS algorithm; and (ii). So-cial Score algorithm. The approximate NDS algorithm selectssocial caches by approximating the problem to the Set Coverproblem. The Social Score algorithm selects social caches byselecting the ones with the higher social scores, where a socialscore is the representation of a vertex’s social properties. Realsocial network data is used for comparing the two algorithms,and the evaluation shows that both algorithms can efficientlyreduce the network traffic with the dataset that we considered.

II. RELATED WORK

Recently, Distributed Social Networks (DSNs) such asDiaspora and Safebook [1] have been proposed. Diasporaallows users to host their data on their personal machines,and communicate with friends in a P2P manner. Cutillo,Et al. proposed Safebook [1], a distributed social networkthat leverages trust relationships to cope with the lack oftrust and cooperation in P2P systems. The system utilizes“Matryoshkas”, concentric rings of nodes built around eachmember, to provide data storage to replicate friends’ profiles.There are some other DSNs that are being developed suchas PeerSoN [2] and PrPI [3]. They have to yet to exploitsocial update patterns and do not provide efficient informationdelivery mechanism for social updates.

III. SOCIAL BUTTERFLY

In this section, we describe the design of Social Butterfly,which selects certain nodes as Social Caches for bridgingsocial update delivery between producers and consumers andthereby reducing the cost of operating a distributed OSN.Social cache nodes selected act as “local servers”, while theremaining nodes are assigned as members of one or moresocial caches to form Social Clusters. A member node onlycontacts the social caches it is associated with, while a socialcache can be a member of multiple social clusters. Everynode in the social network can be both a social consumer

Page 2: Social Butterfly: Social Caches for Distributed Social Networksiftode/socialbutterfly.pdf · Social Butterfly: Social Caches for Distributed Social Networks Lu Han, Badri Nath,

Fig. 1. Example comparing update costs of DSN without and with socialcaches. The edges in Fig-a show the friends relationship. Push schema incursan update cost of

∑6i=1 degree(Ni) = 14. The edges in Fig-b show the

update relationship with social caches C1 and C2. Updates are from N1, N2,N3 and C2 to C1, and N4 to C5, which incurs a total cost of 5.

Fig. 2. Given the same graph, the top two graphs show that social cacheand distributed cache selection yield different caches. The bottom two graphsshow that social cache and distributed cache selection yield same caches. Theshaded vertices are the selected caches.

and producer. Producers push social updates to the socialcaches they are associated with and consumers fetch data fromthe corresponding social caches when they want to learn theupdates from their friends.

Typically, consumers and producers in social networks arefriends and are at one-hop distance. This property requiressocial caches that bridge consumers and producers be friendswith both nodes if none of them is a social cache. Thus, socialcaches have information that only belongs to friends.

Fig. 1 gives an example of the distributed social networkarchitecture with six nodes. The edges in Fig. 1-a show thefriends relationship and the edges in Fig. 1-b show the updaterelationship, where C1 and C2 act as social caches. In Fig. 1-a,the push scheme will incur an update cost equal to the sum ofthe out degree of all the nodes, which is significantly higherthan the update cost incurred in the network with social caches(5 � 14) as shown in Fig. 1-b.

IV. SOCIAL CACHING VS DISTRIBUTED CACHING

Distributed caching [4], [5], and Dynamic Data Replication[6] are well studied topics. However, they differ fundamentallyfrom social caching in several aspects. First, social networktopology differs from distributed networks by showing non-trivial clustering and positive degree correlations [7]. Second,selecting social caches is social-relationship-driven; informa-tion providers proactively send updates to the selected socialcaches. Third, social caches are selected to cache updates onlyfor friends to ensure security requirements and only allow one-hop communication (a.k.a. communication between friends).Therefore, neither distributed cache selection algorithms, nordynamic data replication placement methods can be used forselecting social caches.

Fig. 2 illustrates that given the same network topologies,social cache selection and distributed cache selection meth-ods can yield both different and identical caches. Fig. 2-ashows cache selection scenarios in distributed social networks,where vertices represent social nodes, and edges representfriendship relations; Fig. 2-b shows cache selection scenariosin distributed networks, where edges represent connectiv-ity/distance. The shaded vertices are the selected caches. Thetop two graphs show that the two cache selection methodsselect different caches. In the social cache scenario, node 3 isthe social cache for node 1, 2 and 4 to send social updates andrequest friend’s updates. Node 5 is the social cache for node4 and 6. In the distributed cache scenario, 3, 4, and 5 couldbe selected as caches depending on where the providers andconsumers are relative to each other, as there is no one-hopconstraint. The bottom two graphs show that the two methodsselect identical caches. In social cache scenario, node 3 is thecache for node 1,2, and 4, while node 4 is the cache for node3, 5 and 6. The distributed cache selection schema producesthe same caches, in which node 3 caches contents for node 1and 2, node 4 caches content for node 5 and 6.

V. SOCIAL CACHE SELECTION ALGORITHMS

Social caches selected should satisfy the following require-ments. Given an undirected graph G = (V,E), where V isthe set of vertices and E is the set of edges, the social cachesform a Neighbor-Dominating Set (NDS) such that:• Every vertex should either be a social cache or connect

to at least one social cache;• A pair of friends should be connected by at least one

social cache if none of them is a social cache.In this section, we define the NDS problem formally,

and discuss two social cache selection algorithms: the Ap-proximate NDS algorithm, which approximates the set coverproblem, and the Social Score algorithm, which takes intoaccount the nodes connectivity in a social graph.

A. Problem Definition

Define N(u) ⊆ V to be the set of neighbors of node u, thatis, v ∈ N(u) iff (u, v) ∈ E.

Definition 5.1: The Neighbor-Dominating Set of graph G =(V,E) is the set S ⊆ V of vertices such that for each edge(u, v) ∈ E, there exists a w ∈ S satisfying w ∈ (N(u) ∩N(v)) ∪ {u, v}.Intuitively, for each edge (u, v), either one of its endpointsshould be in the neighbor-dominating set, or one of thecommon neighbors of u and v.

We should distinguish this from a number of related, well-known concepts. For example, a vertex cover is a set S ⊆ Vsuch that for each edge (u, v) ∈ E, either u ∈ S or v ∈ S.Also, a dominating set is a set S ⊆ V of vertices such that forall vertices u ∈ V −S, there exists an edge (u,w) ∈ E, wherew ∈ S. See Fig. 3 to see examples where these concepts differfrom one another.

Definition 5.2: (NDS problem) Given undirected graph G =(V,E), find a neighbor-dominating set of the smallest size.

Page 3: Social Butterfly: Social Caches for Distributed Social Networksiftode/socialbutterfly.pdf · Social Butterfly: Social Caches for Distributed Social Networks Lu Han, Badri Nath,

Fig. 3. The top two graphs show that the minimum vertex cover (on theleft) is not the same as the minimum neighbor-dominating set (on the right).The bottom two graphs show that the minimum dominating set (on the left)is not the same as the minimum neighbor-dominating set (on the right).

B. Approximate NDS Algorithm

Our algorithm is as follows. From an instance of the NDSproblem, we will generate an instance of the set cover problem.Given a collection S of sets over universe U , a set cover is acollection of sets from S whose union is U . [8]

The universe is the set E of edges. For each vertex u ∈ V ,generate a set Su such that e = (α, β) ∈ E belongs to Su iffu ∈ (N(α) ∩ N(β)) ∪ {α, β}. Let S be the collection of allsuch Su’s.

Lemma 5.3: Each set cover T is a neighbor-dominating setof size |T |, and each neighbor-dominating set S gives a setcover of size |S|.Proof. Consider the set cover T of S. Each set Su in Tcorresponds to some vertex u, and consider the set S of allsuch u′s. Since T is a set cover, for each edge e = (α, β),there is some set Su ∈ T that contains it and in turn, u ∈ S.Thus, S is a neighbor-dominating set. The proof that edgeneighbor-dominating set S gives a set cover of size is |S| ina similar way. �

Theorem 5.4: There is an O(logm) approximation algo-rithm for the neighbor-dominating set problem that takestime O(md), where m is the number of edges and d is themaximum degree of any node in G.Proof. This follows from the best known approximation forthe set cover problem. [8]. �

We implement a greedy algorithm for the set cover problem.For each vertex vi ∈ V , generate a set Si that contains a set ofedges e = (α, β), where either vi ∈ {α, β}, or vi is a commonneighbor of (α, β). We have a finite set of edges E and afamily S of subset of E, such that every element of E belongsto at least one subset of S. At each stage, the algorithm picksthe set Si ⊆ S that covers the greatest numbers of elements notyet covered. The vi selected together with Si is the set of socialcaches. The vertices in each set Si form the correspondingsocial cluster. Algorithm 1 describes the approximate NDSalgorithm.

Let’s look at an example graph shown in Fig. 1-a. For eachvertex vi ∈ V , we generate a set of edges in vi’s egocentricnetwork, which is shown in Table I. According to Algorithm1, vertex 2 will be selected first as it covers 6 edges out of7. Vertex 4 (or 6) will be selected which covers edge (4, 6).Therefore, {2, 4} are the selected social caches.

The approximate NDS algorithm works well on socialgraphs, and is valid for any graph; however, the algorithmdoes not exploit any properties exhibited by the social graph.

Algorithm 1 Approximate NDS AlgorithmInput: undirected graph G = (V, E)Outputs: Social Caches C and Social Clusters SCInitialization:Universal Set = ECovered Set = φfor each vertex vi ∈ V do

generate Si contains edges e = (α, β), where vi ∈(N(α) ∩N(β)) ∪ {α, β}.

end forMain algorithm:while Universal Set != φ do

select Si that maximizes |Universal Set ∩ Si|Universal Set = Universal Set - Si

Covered Set = Covered Set ∪ Si

end while

TABLE ISUBSET OF EDGES FOR EACH VERTEX IN FIG. 1-A

Vertex Edges in subset1 {(1,2), (1,3), (2,3)}2 {(1,2), (2,3), (1,3), (2,4), (2,5), (4,5)}3 {(1,3), (2,3), (1,2)}4 {(2,4), (4,5), (2,5), (4,6)}5 {(2,5),(4,5),(2,4))}6 {(4,6)}

Since we are designing a novel selection scheme in the contextof social networks, we can also integrate social propertiesinto the cache selection algorithm. We will now discuss analgorithm that exploits social properties called the Social ScoreAlgorithm.

C. Social Score Algorithm

Before describing the algorithm, we will discuss two im-portant social network properties that are used in the thisalgorithm.

1) Clustering Coefficient: Clustering coefficient is a mea-sure of degree to which vertices in a graph tend to clustertogether [9]. Social networks show a property of communitystructure that groups of nodes have a high density of linkswithin them, with a lower density of links between groups. Avertex with a lower clustering coefficient indicates that it mayconnect to many unconnected nodes which belong to differentcommunities.

2) Egocentric Betweenness Centrality: An Egocentric net-work [10] is a “local” network for a vertex which contains onlythe vertex and its one-hop neighbors. Betweenness centralitymeasures to what extent a vertex can facilitate communicationwith others in the network. Vertices that occur on manyshortest paths between other vertices have higher betweennesscentrality. Egocentric Betweenness Centrality [10] is the cor-responding betweenness for an egocentric network. It not onlyshows how many times a vertex is on the shortest path of allits neighbors in the egocentric network, but also indicates how

Page 4: Social Butterfly: Social Caches for Distributed Social Networksiftode/socialbutterfly.pdf · Social Butterfly: Social Caches for Distributed Social Networks Lu Han, Badri Nath,

TABLE IITABLE FOR SOCIAL SCORE ALGORITHM.

Node Degree CC EBC SS IsCache Cache list1 2 1.0 0.0 0.0 N {2}2 4 0.333 0.667 16/3 Y {}3 2 1.0 0.0 0.0 N {2}4 3 0.333 0.667 12/3 Y {2}5 2 1.0 0.0 0.0 N {2}6 1 0.0 0.0 1.0 N {4}

many additional pairs of nodes among the neighbors could beconnected if a vertex is selected as a social cache.

3) Social Score Algorithm: In addition to clustering coef-ficient and egocentric betweenness centrality, vertex degree isalso an important factor. Higher degree vertices with moreneighbors should have a higher possibility to be selectedas a social cache. Therefore, vertices with lower clusteringcoefficient (CC), higher egocentric betweenness centrality(EBC), and higher degree connect to a larger number of non-interconnected nodes, and should be the potential candidatesto be social caches.

The above discussion suggests that a social score can bedefined as a combination of cluster coefficient with egocentricbetweenness centrality scaling by vertex degree in the formof:SS = ((1− CC) + EBC) ∗DA similar metric has been used in Ad-hoc networks to select

the backbone nodes for energy-efficient forwarding [11].The social score algorithm works as follows. For each vertex

vi in the graph, we maintain a table with the following sixfields: clustering coefficient, egocentric betweenness centrality,social score, IsCache, Cache List, Member list. The IsCachefield is a boolean variable indicates if the vertex is a socialcache or not, and Cache List field stores the social cachesvertex vi affiliate to. Member List holds all the memberswithin a social cluster if the vertex is a cache. Algorithm 2presents the Social Score algorithm. During each stage, wepick the vertex with the highest score as a social cache, andtreat its neighbors who have not yet been covered by othersocial clusters as its social cluster members.

We use Fig. 1-a as an example to illustrate the algorithm.The clustering coefficient, egocentric betweenness centralityand social score are calculated in Table II. Node 2 and node4 are the selected social caches with set {1,3,4,5} and {6} associal cluster members.

VI. EVALUATION

We first compare the performance of the two cachingalgorithms against various metrics such as number of cachesselected and cluster size by using data available from Facebook[12]. We then evaluate how social traffic could be reduced byusing caches when compared with other data disseminationmethods.

A. Dataset

The Facebook dataset we study was crawled from FacebookNew Orleans Network by [12]. The dataset contains public

Algorithm 2 Social Score AlgorithmInput: undirected graph G = (V, E)Outputs: Social Caches C and Social Clusters SCInitialization:for each vertex vi ∈ V do

calculate CCi, EBCi, and SSi;set IsCachei, Cache Listi and Member Listi to φ;

end forMain algorithm:for each vertex in V do

pick vertex vj with highest social scoreif Cache Listj == φ thenIsCachej = TrueMember Listj = vj’s friend Fj

elseget Cache Listj and friends of vj : Fj

for each cache Ck ∈ Cache Listj doget friends of Ck: Fk

Member Listj = Fj − Fk

end forif Member Listj != φ thenIsCachej = Trueupdate Member Listj

end ifend if

end for

profiles of 60,290 users who are connected by 1,545,686links with a average node degree of 25.6418. The datasetalso crawled the wall posts data spans from September 26th,2006 until January 22nd, 2009. Specifically, we use posts fromJanuary 1st, 2007 to December 31st, 2008, which includes838,092 wall posts.

B. Social Score Algorithm Properties

Before comparing the two algorithms, we first present someresults of social score algorithm. As discussed in SectionV-C, social score is calculated based on clustering coefficient,egocentric betweenness centrality, and degree. The correlationsbetween the three factors for the Facebook dataset has beenplotted in Fig. 4. The figure on the left shows that as vertexdegree increases, the range of clustering coefficient shrinksfrom [0,1] exponentially towards 0. The figure in the middleshows that as vertex degree increases, the range of egocentricbetweenness centrality shrinks from [0,1] towards 1. Thismeans that a vertex with a higher degree has a lower valueof clustering coefficient and a higher egocentric betweennesscentrality. The figure on the right shows that most vertices areclustered around the line of EBC = 1− CC.

Fig. 5 shows social scores for the Facebook dataset inthe space composed of degree, clustering coefficient, andegocentric betweenness centrality as x, y, z axes. The blue dotsare the social scores in this coordinate system. Red dots arethe trajectory on the x-y surface, yellow dots are the trajectoryon the x-z surface, and the green dots are the trajectory on the

Page 5: Social Butterfly: Social Caches for Distributed Social Networksiftode/socialbutterfly.pdf · Social Butterfly: Social Caches for Distributed Social Networks Lu Han, Badri Nath,

Fig. 4. Correlations between vertex degree, clustering coefficient andegocentric betweenness centrality for Facebook data. Range of clusteringcoefficient shrinks exponentially towards 0 as degree increases (on the left).Egocentric betweenness centrality shrinks towards 1 as degree increase (inthe middle). Most vertices are clustered around the line of EBC = 1−CC(on the right).

Fig. 5. Social score distribution in the coordinate systems of vertexdegree, clustering coefficient, egocentric betweenness centrality as x, y, z axesrespectively. The blue dots are the social scores, red dots are the trajectoryon the x-y surface, yellow dots are the trajectory on the x-z surface, and thegreen dots are the trajectory on the y-z surface.

y-z surface.Nodes with higher social scores have higher egocentric

betweenness centrality, and smaller clustering coefficient butwith larger degree. They are connected to large number of non-interconnected vertices, and are candidates for social caches.

C. Comparison of the Two Algorithms

We evaluated both algorithms by using the dataset obtainedfrom Facebook [12], and compared the selection of socialcaches, as well as the corresponding social clusters.

The statistics about total number of social caches selectedand number of social caches each vertex connects by applyingthe two algorithms is listed in Table III. The social scorealgorithm selects about 1.5 times of total number of cachesthat the approximate NDS algorithm does. Number of cachesa vertex connects determines the cost of sending or requestingsocial updates. Both algorithms yield relatively equal averagenumber of social caches, which means they should generaterelatively equal social traffic. The cumulative distribution ofnumber of social caches a vertex connected is shown in Fig. 6.The CDF for both indicate that they converge fast, with eachvertex connecting to less than 15 social caches, the cachesselected by the algorithms can cover 90% of the network, andwith each vertex connecting to less than 25 social caches,99% of entire network can be covered. The approximate NDSalgorithm converges faster with the given dataset.

TABLE IIISTATISTICS OF SOCIAL CACHES SELECTED FOR THE TWO ALGORITHMS.

AlgorithmNo. of social caches Social Score Approximate NDSTotal no. 40782 26288

Avg no. a vertex connects 7.058(5.991) 5.867(5.434)

Max no. a vertex connects 61 48

Min no. a vertex connects 0 1

-5 0 5 10 15 20 25 30 35 40 45 50 55 60 65-0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

CD

F

Number of social caches connected by vertex

Social score algorithm Approximate NDS algorithm

Fig. 6. CDF for number of social caches that each vertex connects.

TABLE IVSTATISTICS OF SOCIAL CLUSTER SIZE FOR THE TWO ALGORITHMS.

AlgorithmSocial cluster size Social Score Approximate NDSAverage size 11.424 (24.491) 13.224 (34.418)

Median size 4 4

Maximum size 1098 1098

Minimum size 1 1

Table IV presents the statistics about social cluster size forthe algorithms. Cluster size measures the number of nodesneeded to contact the same social cache for social updates, andtherefore the social traffic a cache needs to handle. The averagesocial cluster size are similar for the two algorithms. Fig. 7shows the CDF of social cluster size for both algorithms, bothhave a few large social clusters (> 1000 members) exhibitinga long-tail distribution. Social caches associated with largesocial clusters are required to handle significant social trafficfrom members, which should be avoided.

D. Social Traffic Comparison

We evaluate the effectiveness of the two proposed algo-rithms by determining the social traffic generated in compar-ison to other social network data dissemination methods suchas “push-to-all-friends” in a distributed social networks. Inparticular, we compare the following six methods:

a. Push social updates immediately to all friends.b. Push periodically (every m minutes) to all friends. In

this scenario, the amount of social traffic does not depend onhow much social updates generated by vertices, but dependson how many friends each vertex has and the time interval

Page 6: Social Butterfly: Social Caches for Distributed Social Networksiftode/socialbutterfly.pdf · Social Butterfly: Social Caches for Distributed Social Networks Lu Han, Badri Nath,

0 100 200 300 400 500 600 700 800 900 1000 11000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

Social score algorithm Approximate NDS algorithm

CD

F of

soc

ial c

lust

er s

ize

Social cluster size

Fig. 7. CDF of social cluster size for the two algorithms.

Fig. 8. The overall social traffic generated daily by utilizing data dissem-ination method a, method b and c with m = 1440, method d with k = 10,method d with k = 30, method d with k = 60, method e and f.

m. In another word, it depends on how many edges are therein the graph. Therefore, social traffic is calculated as ST =24 ∗ 60/m ∗ 2 ∗ |E|

c. Pull periodically from friends. Same as above.d. Push periodically (every k minutes) to all friends if the

vertex has updates. If a node updates two or more times duringthe k minutes, only one connection needs to set up. This willreduce the total network traffic by the way of local caching.

e. Social Score algorithm.f. Approximate NDS algorithm.We evaluate the above six algorithms by utilizing the Face-

book dataset. For each wall posting, we simulate generatingsocial traffic by pushing or pulling defined by the methods. Thetotal social traffic daily is plotted in Fig. 8. The first straightline is the plot for method b and c. We choose m = 1440minutes, which is the total minutes in a day. The result showsthat even with push or pull once per day between each pair offriends, the total network traffic is still extremely high whencompare with other methods. The next set of lines in the graphare for method a, as well as method d with k = 10, 30, and 60minutes. Local caching mechanism used in method d reducesthe social traffic by 4.68%, 8.79%, and 11.92%. The bottom setof lines show the results by utilizing social caching methods,e and f, which can effectively reduce the overall social traffic

by 83.90% and 84.92% respectively compared with methoda. The performance of the two algorithms is similar giventhe current Facebook dataset. We anticipate that with differentsocial graphs and social traffic pattern, the performance willbe different.

VII. CONCLUSION AND FUTURE WORK

In this paper, we proposed Social Butterfly, a novel peer-to-peer information dissemination model for distributed socialnetworks by utilizing social caches. We propose the Neighbor-Dominating Set problem as a formal definition of socialcache selection problem and present two algorithms for thisproblem. Evaluation with social traffic data made availablefrom Facebook usage shows that use of social caches reducesthe total social traffic by 80%.

The social cache selection algorithms we discussed in thispaper are offline algorithms based on the underlying topologyof the social graph, and assume the selected social caches areavailable all the time. Issues such as social traffic patterns,social tie-strength [13], and availability models need to beconsidered. We plan to develop online algorithms, and willextend our algorithms to include social traffic and availabilitymodels.

REFERENCES

[1] L. A. Cutillo, R. Molva, and T. Strufe, “Privacy preserving socialnetworking through decentralization,” in Proceedings of the Sixth in-ternational conference on Wireless On-Demand Network Systems andServices, 2009.

[2] S. Buchegger, D. Schioberg, L. H. Vu, and A. Datta, “Peerson: P2psocial networking - early experiences and insights,” in Proceedings ofthe Second ACM Workshop on Social Network Systems Social NetworkSystems 2009, co-located with Eurosys 2009, Nurnberg, Germany, March31, 2009.

[3] S.-W. Seong, J. Seo, M. Nasielski, D. Sengupta, S. Hangal, S. K.Teh, R. Chu, B. Dodson, and M. S. Lam, “Prpl: a decentralized socialnetworking infrastructure,” in Proceedings of the 1st ACM Workshop onMobile Cloud Computing Services: Social Networks and Beyond, ser.MCS ’10, 2010.

[4] L. Yin and G. Cao, “Supporting cooperative caching in ad hoc networks,”in IEEE INFOCOM, 2004.

[5] D. P. John and J. Harrison, “A distributed internet cache,” in Proceedingsof the 20th Australian Computer Science Conference, 1997, pp. 5–7.

[6] S. Acharya and S. B. Zdonik, “An efficient scheme for dynamic datareplication,” Providence, RI, USA, Tech. Rep., 1993.

[7] M. E. J. Newman and J. Park, “Why social networks are different fromother types of networks,” Phys. Rev. E, vol. 68, no. 3, Sep 2003.

[8] D. P. Williamson and D. B. Shmoys, The Design of ApproximationAlgorithms. Cambridge Univ Press, 2011.

[9] P. W. Holland and S. Leinhardt, “Transitivity in structural models ofsmall groups,” Small Group Research, vol. 2, pp. 107–124, 1971.

[10] P. Marsden, “Egocentric and sociocentric measures of network central-ity,” Social Networks, vol. 24, no. 4, pp. 407–422, 2002.

[11] B. Chen, K. Jamieson, H. Balakrishnan, and R. Morris, “Span: AnEnergy-Efficient Coordination Algorithm for Topology Maintenance inAd Hoc Wireless Networks,” in 7th ACM MOBICOM, Rome, Italy, July2001.

[12] B. Viswanath, A. Mislove, M. Cha, and K. P. Gummadi, “On theevolution of user interaction in facebook,” in Proceedings of the 2ndACM workshop on Online social networks, ser. WOSN ’09. New York,NY, USA: ACM, 2009, pp. 37–42.

[13] E. Gilbert and K. Karahalios, “Predicting tie strength with social media,”in Proceedings of the Conferece on Human Factors in ComputingSystems (CHI ’09), 2009, p. 220.