Analysis of hybrid P2P overlay network topology

11
Analysis of hybrid P2P overlay network topology q,qq Chao Xie a, * , Guihai Chen c , Art Vandenberg d , Yi Pan b, * a Department of Computer Science, University of Wisconsin-Madison, Madison, WI, 53706-1685, USA b Department of Computer Science, Georgia State University, Atlanta, GA 30302-3994, USA c State Key Laboratory of Novel Software, Nanjing University, Nanjing 210093, China d Department of Information Systems and Technology, Georgia State University, Atlanta, GA 30302-3968, USA Available online 19 August 2007 Abstract Modeling peer-to-peer (P2P) networks is a challenge for P2P researchers. In this paper, we provide a detailed analysis of large-scale hybrid P2P overlay network topology, using Gnutella as a case study. First, we re-examine the power-law distributions of the Gnutella network discovered by previous researchers. Our results show that the current Gnutella network deviates from the earlier power-laws, suggesting that the Gnutella network topology may have evolved a lot over time. Second, we identify important trends with regard to the evolution of the Gnutella network between September 2005 and February 2006. Upon analyzing the limitations of the power-laws, we provide a novel two-layered approach to study the topology of the Gnutella network. We divide the Gnutella network into two layers, namely the mesh and the forest, to model the hybrid and highly dynamic architecture of the current Gnutella network. We give a detailed analysis of the two-layered overlay and present six power-laws and one empirical law to characterize the topology. Using the two-layered approach and laws proposed, realistic topologies can be generated and the realism of artificial topologies can be validated. Ó 2007 Elsevier B.V. All rights reserved. Keywords: Peer-to-peer; Overlay network; Network topology; Power-law 1. Introduction Modeling the topologies of peer-to-peer (P2P) networks is an important open problem. An accurate topological model can have significant influence on P2P research. First, we can gain detailed insight into the nature of the underly- ing system. Second, the model can enable detailed analysis of algorithms and facilitate design of more efficient proto- cols that take advantage of topology properties. Third, we can generate more accurate artificial topologies for simula- tion purposes. Furthermore, we can predict future trends and thereby address potential problems in advance. Previous researchers [2] and [7] tended to use power-laws to characterize the topology of P2P networks. Recent advances in P2P networks have resulted in hybrid architec- tures, represented by the success of Gnutella protocol 0.6 [3] and Kazaa [4]. In this paper, we provide a detailed anal- ysis of large-scale hybrid P2P network topology, giving results concerning major topology properties and main dis- tributions. In our study, we choose Gnutella as a case study, as it has a large user community and open architec- ture. Our work can be summarized by the following points. First, we re-examine the power-law distributions of the Gnutella network discovered by previous researchers. Our results show that the current Gnutella network devi- ates from the earlier power-laws. This observation suggests that the Gnutella network topology may have evolved a lot over time. 0140-3664/$ - see front matter Ó 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.comcom.2007.08.014 q This paper extends and supplants the earlier version of this paper presented at IEEE GLOBECOM’06 [1]. qq Guihai Chen’s work is supported by China NSF under Grant 60573131, China Jiangsu Provincial NSF under Grant BK2005208, China 973 projects under Grants 2006CB303000 and 2002CB312002, and Nokia Bridging the World Program. Yi Pan’s work is supported in part by the National Science Foundation (NSF) under Grants ECS-0196569, ECS- 0334813, and CCF-0514750. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the NSF, China NSF or Nokia. * Corresponding authors. Tel.: +1 404 651 0649; fax: +1 404 463 9912. E-mail addresses: [email protected] (C. Xie), [email protected] (G. Chen), [email protected] (A. Vandenberg), [email protected] (Y. Pan). URLs: http://www.cs.wisc.edu/~cxie (C. Xie), http://www.cs.gsu.edu/ pan (Y. Pan). www.elsevier.com/locate/comcom Available online at www.sciencedirect.com Computer Communications 31 (2008) 190–200

description

 

Transcript of Analysis of hybrid P2P overlay network topology

Page 1: Analysis of hybrid P2P overlay network topology

Available online at www.sciencedirect.com

www.elsevier.com/locate/comcom

Computer Communications 31 (2008) 190–200

Analysis of hybrid P2P overlay network topology q,qq

Chao Xie a,*, Guihai Chen c, Art Vandenberg d, Yi Pan b,*

a Department of Computer Science, University of Wisconsin-Madison, Madison, WI, 53706-1685, USAb Department of Computer Science, Georgia State University, Atlanta, GA 30302-3994, USA

c State Key Laboratory of Novel Software, Nanjing University, Nanjing 210093, Chinad Department of Information Systems and Technology, Georgia State University, Atlanta, GA 30302-3968, USA

Available online 19 August 2007

Abstract

Modeling peer-to-peer (P2P) networks is a challenge for P2P researchers. In this paper, we provide a detailed analysis of large-scalehybrid P2P overlay network topology, using Gnutella as a case study. First, we re-examine the power-law distributions of the Gnutellanetwork discovered by previous researchers. Our results show that the current Gnutella network deviates from the earlier power-laws,suggesting that the Gnutella network topology may have evolved a lot over time. Second, we identify important trends with regard to theevolution of the Gnutella network between September 2005 and February 2006. Upon analyzing the limitations of the power-laws, weprovide a novel two-layered approach to study the topology of the Gnutella network. We divide the Gnutella network into two layers,namely the mesh and the forest, to model the hybrid and highly dynamic architecture of the current Gnutella network. We give a detailedanalysis of the two-layered overlay and present six power-laws and one empirical law to characterize the topology. Using the two-layeredapproach and laws proposed, realistic topologies can be generated and the realism of artificial topologies can be validated.� 2007 Elsevier B.V. All rights reserved.

Keywords: Peer-to-peer; Overlay network; Network topology; Power-law

1. Introduction

Modeling the topologies of peer-to-peer (P2P) networksis an important open problem. An accurate topologicalmodel can have significant influence on P2P research. First,we can gain detailed insight into the nature of the underly-ing system. Second, the model can enable detailed analysis

0140-3664/$ - see front matter � 2007 Elsevier B.V. All rights reserved.

doi:10.1016/j.comcom.2007.08.014

q This paper extends and supplants the earlier version of this paperpresented at IEEE GLOBECOM’06 [1].qq Guihai Chen’s work is supported by China NSF under Grant60573131, China Jiangsu Provincial NSF under Grant BK2005208, China973 projects under Grants 2006CB303000 and 2002CB312002, and NokiaBridging the World Program. Yi Pan’s work is supported in part by theNational Science Foundation (NSF) under Grants ECS-0196569, ECS-0334813, and CCF-0514750. Any opinions, findings, and conclusions orrecommendations expressed in this paper are those of the authors and donot necessarily reflect the views of the NSF, China NSF or Nokia.

* Corresponding authors. Tel.: +1 404 651 0649; fax: +1 404 463 9912.E-mail addresses: [email protected] (C. Xie), [email protected] (G.

Chen), [email protected] (A. Vandenberg), [email protected] (Y. Pan).URLs: http://www.cs.wisc.edu/~cxie (C. Xie), http://www.cs.gsu.edu/

pan (Y. Pan).

of algorithms and facilitate design of more efficient proto-cols that take advantage of topology properties. Third, wecan generate more accurate artificial topologies for simula-tion purposes. Furthermore, we can predict future trendsand thereby address potential problems in advance.

Previous researchers [2] and [7] tended to use power-lawsto characterize the topology of P2P networks. Recentadvances in P2P networks have resulted in hybrid architec-tures, represented by the success of Gnutella protocol 0.6[3] and Kazaa [4]. In this paper, we provide a detailed anal-ysis of large-scale hybrid P2P network topology, givingresults concerning major topology properties and main dis-tributions. In our study, we choose Gnutella as a casestudy, as it has a large user community and open architec-ture. Our work can be summarized by the following points.

First, we re-examine the power-law distributions of theGnutella network discovered by previous researchers.Our results show that the current Gnutella network devi-ates from the earlier power-laws. This observation suggeststhat the Gnutella network topology may have evolved a lotover time.

Page 2: Analysis of hybrid P2P overlay network topology

C. Xie et al. / Computer Communications 31 (2008) 190–200 191

Second, we identify important trends with regard to theevolution of the Gnutella network between September 2005and February 2006.

As our primary contribution, we provide a novel two-layered approach to study the topology of the Gnutellanetwork. Due to the limitations of the power-laws, wedivide the Gnutella network into two layers, namely themesh and the forest, to model the hybrid and highlydynamic architecture of the current Gnutella network.We give a detailed analysis of the two-layered overlayand present six power-laws and one empirical law to char-acterize the topology.

Finally, we focus on the generation of realistic topolo-gies and the validation of artificial topologies using ourapproach and laws proposed.

The rest of this paper is organized as follows. Section 2presents background and previous work. In Section 3, wepresent our traces of the Gnutella network. In Section 4,we re-examine the power-law distributions discovered byprevious researchers and identify the trends concerningthe evolution of Gnutella network. In Section 5, we analyzethe limitations of the power-laws and introduce our newtwo-layered approach to study the topology of Gnutellanetwork. In Section 6, we analyze the topological proper-ties of the mesh and present two power-laws concerningthe mesh topology. In Section 7, we examine the topologyproperties of the forest and provide one empirical law con-cerning the tree size. In Section 8, we present to two twopower-laws concerning the overlay network as a wholeand discuss the practical uses of our approach and laws.Finally, Section 9 concludes our work.

2. Background and previous work

2.1. Gnutella Protocol and the crawler

Gnutella protocol 0.4 [5] employs a pure decentralizedmodel. In this model, individual nodes, also called serventsare equal in terms of functionality. They not only performserver-side roles such as matching incoming queries againsttheir local resources and respond with applicable results,but also offer client-side functions such as issuing queriesand collecting search results. All servents are connectedto each other randomly. Fig. 1 illustrates the topology ofthe Gnutella 0.4 network.

Fig. 1. Topology of the Gnutella 0.4 Network.

Gnutella protocol 0.6 [3] employs a hybrid architecturecombining centralized and decentralized model. Serventsare categorized into leaf and ultrapeer. A leaf keeps onlya small number of connections to ultrapeers. An ultrapeermaintains connections with other ultrapeers and acts as aproxy to the Gnutella network for the leaves connectedto it. An ultrapeer only forwards a query to a leaf if itbelieves the leaf can answer it, and leaves never relay que-ries between ultrapeers. Fig. 2 illustrates the topology ofthe Gnutella 0.6 network. Protocol 0.6 is compatible withprotocol 0.4, which implies that the current Gnutella net-work can contain some fraction of nodes of former proto-col specification 0.4.

2.2. Power-law

Power-laws have been found in numerous diverse fieldsspanning sociological, geological, natural and biologicalsystems. Power-laws of the form y � xa enables a compactcharacterization of topologies through their exponents.Faloutsos et al. [8] discovered four power-laws characteriz-ing the topology of the Internet, while Magoni et al. [9]found another four power-laws of the Internet.

In [2,7,11], several power-laws were found with regardto the topology of the Gnutella network. In 2002, Ripeanuet al. [10] argued that the connection distribution of themore recent Gnutella network may follow a two-tierpower-law distribution. P2P studies usually assume thatthese power-laws characterize the topology of P2P net-works and use synthetically generated topologies followingthese power-laws [12–17].

3. Our Gnutella Network Traces

We developed a crawler to collect topology informationof the Gnutella network, taking advantage of messagecommunication mechanism of both protocol 0.4 and pro-tocol 0.6. The crawler is based on the Limewire [6] opensource client and performs a breadth first searching onthe network in parallel. It can discover more than100,000 nodes in minutes.

We can build the graph of nodes by analyzing the col-lected data on the Gnutella network. We model two adja-cent nodes that have at least one connection between

Fig. 2. Topology of the Gnutella 0.6 Network.

Page 3: Analysis of hybrid P2P overlay network topology

Table 1Basic Statistics of the Gnutella Network

Stat. Data Ours [11] [2]

091505 021106 V34206 V57926

Time 09–2005 02–2006 09–2003 10–2003 11–2000 12–2000Nodes 107,205 118,925 34,206 57,926 992 1,125Edges 118,187 130,612 43,958 80,276 2465 4080l 6.4 7.9 5.4 5.8 3.7 3.3Diam. 22 24 16 15 9 8k 2.20 2.20 2.57 2.72 4.97 7.25

Fig. 3. Log–log plot of the degree dv versus the rank rv in the sequence ofdecreasing degree.

192 C. Xie et al. / Computer Communications 31 (2008) 190–200

each other by an edge. We treat the Gnutella network as aundirected graph.

In this paper, we provide two traces of the Gnutella net-work, namely the 091505 trace and the 021106 trace. Notethat we have studied the topology of the Gnutella networkfrom September 2005 until February 2006 and all the traceswe have gotten accord with the results given in this paper.In Table 1, we present some basic statistics about our tracesand previous work [2,11]. In Table 1, l represents the aver-age shortest distance and k represents the average degree.

4. Current Gnutella network topology

In this section, we examine the power-laws of the Gnu-tella network described in previous literatures against ourtwo traces. The goal of our work is to find out whetherthe topology of the current Gnutella network accords withthe early power-laws.

We use linear regression to fit a line in a set of two-dimensional points using the least-square errors method.The validity of the approximation is quantified by the cor-

relation coefficient ranging from �1.0 and 1.0. The absolutevalue of the correlation coefficient is ACC. An ACC valueof 1.0 indicates perfect linear correlation. In general, theACC level should be greater than 0.90 to validate linearcorrelation.

4.1. Rank distribution

In this section, we study the degrees of the nodes in theGnutella network.

Power-law of rank exponent R: The degree dv of a node v

is proportional to the rank of the node rv to the power of aconstant R : dv / rRv . The rank rv of a node v is defined asits index in the order of decreasing degree.

Jovanovic [2] found that the early Gnutella network fol-lowed the above power-law with rank exponent of �0.98and ACC of 0.94. For our two traces, the rank exponentis �0.64268 and �0.60681 and ACC is 0.92178 and0.88120 in chronological order as we see in Fig. 3. Thelow ACC values imply that this power-law is relativelyweak in the 091505 graph and even invalid for the021106 graph.

Compared with a pure power-law distribution, the twographs deviate from the linear regression with similar

patterns. On the one hand, the nodes with high rank areof too small degree. This is because the Gnutella protocol0.6 imposes a limit on maximal connections of an ultra-peer. On the other hand, there are too many nodes withdegree around 30, with the result that the curve breakoutsfrom the linear regression. This pattern suggests thatultrapeers in the Gnutella 0.6 network tend to have theconnection limit around 30.

Moreover, the 021106 graph is somewhat different fromthe 091505 graph. First, the nodes with high rank in theformer graph are of smaller degree compared with thecounterparts in the latter, implying that protocol 0.6 iseffectively replacing protocol 0.4. Secondly, the curve aftera degree of approximately 30 drops much more suddenly inthe former graph than in the latter, which suggests thatultrapeers tend to employ as many connections as they can.

4.2. Degree Distribution

In this section, we study the distribution of the degreesof the nodes. Note that the degree power law we presentin the current work is different from the one in earlier work[2]. However, they both refer to the same distribution. The

Page 4: Analysis of hybrid P2P overlay network topology

C. Xie et al. / Computer Communications 31 (2008) 190–200 193

difference is that the current work uses the cumulativeprobability distribution function, while the earlier workuses the probability distribution function. As a result, theexponents of the two power-laws differ approximately byone. The cumulative distribution is preferable because itcan be estimated in a statistically robust way.

Power-law of degree exponent D: The complementarycumulative distribution function (CCDF) Dd of a degreed, is proportional to the degree to the power of a constantD : Dd / dD. The CCDF of a degree d is the percentage ofnodes that have degree greater than the degree d.

Jovanovic [2] showed degree exponent of �1.4 and ACCof 0.96 for the early Gnutella network by probability distri-bution. For our two traces, the degree exponent is�2.25926 and �2.31074 and ACC is 0.91744 and 0.87718in chronological order as we see in Fig. 4. Again, the lowACC values imply that this power-law is relatively weakin the 091505 graph and even invalid for the 021106 graph.

Compared with a pure power-law distribution, thegraphs share some common patterns. There are too manynodes with degree around 30, and the resulting curves devi-ate from the linear regression. This is coincident with whatwe found in rank distribution.

Fig. 4. Log–log plot of Dd versus the degree d.

Furthermore, in the 021106 graph, degrees in interval 5–20 follow an almost constant distribution, which meansthere are too few ultrapeers with a degree in this interval.This confirms our previous conclusion that ultrapeers tryto hold more connections up to the limit. The curve ofhigher degree in the 021106 graph drops much more shar-ply, which agrees with our previous comment that the Gnu-tella protocol 0.6 prevents ultrapeers from employing alarge number of connections.

5. The two-layered approach

In this section, we first discuss the limitations of thepower-laws and then present a new approach to studythe topology of the Gnutella network.

5.1. Limitations of the power-laws

Previous researches [18] and [19] suggest two key causesfor power-law distributions in network topologies: incre-mental growth and preferential connectivity. Incremental

growth refers to open networks that form by the continualaddition of new nodes, and thus the gradual increase in thesize of the network. Preferential connectivity refers to thetendency of a new node to connect to existing nodes thatare highly connected or popular.

The topology of the Gnutella network is highlydynamic, since a node can join or leave the Gnutella net-work at any time. More specifically, most leaves tend todisconnect from the Gnutella network in several minutesafter they connect to the network. The transient life-timeof the leaves works against incremental growth. Moreover,due to the hybrid architecture of Gnutella protocol 0.6 [3],a leaf keeps only a small number of connections to ultrap-eers and cannot connect to other leaves. This limitation onleaves also works against preferential connectivity, becauseleaves can never become highly connected. Combining theabove factors, we can explain why the current Gnutella net-work does not follow the early power-law distributions. Itis the limitations of the power-laws that make them inap-propriate for modeling hybrid and highly dynamictopologies.

As we mentioned earlier, P2P studies usually use syn-thetically generated topologies characterized by the earlypower-laws. These topologies may not reflect propertiesof current P2P networks. So there should be a newapproach to model current P2P networks.

5.2. Our approach

In our study, we propose a new two-layered approach tomodel the topology of the current Gnutella network. Wesplit the Gnutella network into two layers, namely themesh and the forest.

Before we present the analysis of our approach, we pro-vide below a few definitions. Note that Magoni et al. [9]proposed some definitions to describe the AS network.

Page 5: Analysis of hybrid P2P overlay network topology

194 C. Xie et al. / Computer Communications 31 (2008) 190–200

We keep these definitions and modify them into the follow-ing ones. Fig. 5 shows different kinds of nodes in a samplegraph.

• Cycle node: a node that belongs to a cycle (i.e. it is on aclosed path of disjoint nodes; in Fig. 5, there are elevencycle nodes).

• Bridge node: a node which is not a cycle node and is ona path connecting 2 cycle nodes (in Fig. 5, there is onebridge node).

• In-mesh node: a node which is a cycle node or a bridgenode (in Fig. 5, the mesh has twelve in-mesh nodes).

• In-tree node: a node which is not an in-mesh node (i.e. itbelongs to a tree; in Fig. 5, each tree has four in-treenodes).

Mesh is the set of in-mesh nodes and forest is the set ofin-tree nodes.

• Branch node: an in-tree node of degree at least 2.• Leaf node: an in-tree AS of degree 1.• Root node: an in-mesh node which is the root of a tree.• Relay node: a node having exactly 2 connections.• Border node: a node located on the diameter of the

network.

If we split the Gnutella network into the mesh and theforest, we can analyze the topological properties of themesh and the forest, respectively.

After careful comparison between Figs. 2 and 5, we canfind that the mesh in Fig. 5 is composed merely of ultrap-eers and acts as the backbone of the Gnutella network.Since ultrapeers are relatively stable and tend to stay inthe Gnutella network for a longer time, it can meet therequirement of incremental growth. Further more, sinceultrapeers can connect to other ultrapeers, it can meet therequirement of preferential connectivity. Hence, the topol-ogy of the mesh theoretically should comply with power-laws (see Section 6 for detailed validation). On the otherhand, we can also obtain major topology properties anddistributions of the forest (see Section 7). Note that it isnot necessary to have all ultrapeers in the mesh.

Fig. 5. Different kinds of nodes.

With the knowledge of both the topology of the meshand the topology of the forest, we can model the topologyof the Gnutella network easily by merging these two layers.

6. Mesh topology analysis

In this section, we study the topology properties con-cerning the mesh in the Gnutella network. In Table 2, wepresent some basic statistics about the mesh in our traces.In Table 2, p(m) represents the percentage of nodes in themesh, l represents average shortest distance, and k repre-sents average degree.

6.1. Mesh node rank exponent Rm

In this section, we study the degrees of the nodes in themesh. We sort the nodes in the mesh in decreasing order ofdegree dvm and define the mesh node rank rvm as the index ofthe node in the sequence. We plot the ðdvm ; rvmÞ pairs in log-log scale. The plots are shown in Fig. 6. The data values arerepresented by points, while the solid lines represent theleast-squares approximation.

The points of Fig. 6 are well approximated by the linearregression. The ACC is 0.96425 for the 091505 trace and0.96580 for the 021106 trace. This leads us to the followingpower law and definition.

Power-law 1 (Mesh node rank exponent): The degree dvm

of a mesh node vm is proportional to the rank of the meshnode rvm to the power of a constant Rm:

dvm / rRmvm:

Definition 1. Let us sort the mesh nodes of a graph indecreasing order of degree. We define the mesh rankexponent Rm to be the slope of the plot of the degrees ofthe mesh nodes versus the rank of the nodes in log–log scale.

6.2. Mesh node degree exponent Om

In this section, we study the distribution of the degreesof the nodes in the mesh. We define the frequency fdm ofa mesh node degree dm as the number of nodes in the meshwith degree dm. We plot the (fdm ; dm) pairs in log-log scalein Fig. 7. In these plots, we exclude a small percentage ofnodes of higher degree that have frequency of one, but stillplot 99.9% of the total number of nodes. As we saw earlier,

Table 2Basic Statistics of the Mesh

Stat. Data 091505 021106

Nb of Nodes 16,487 11,852p(m) 15.4% 10.0%Nb of Edges 27,467 23,539l 5.2 6.5Diameter 14 17k 3.33 3.97

Page 6: Analysis of hybrid P2P overlay network topology

Fig. 6. Log–log plot of the mesh node degree dmv versus the rank rmv in thesequence of decreasing degree.

Fig. 7. Log–log plot of frequency fdm versus the mesh node degree dm.

C. Xie et al. / Computer Communications 31 (2008) 190–200 195

the higher degrees are described and captured by the meshrank exponent.

The major observation of Fig. 7 is that the plots areapproximately linear with ACC of 0.97171 for the 091505trace and 0.96016 for the 021106 trace. We infer the follow-ing power-law and definition.

Power-law 2 (Mesh node degree exponent): The fre-quency fdm of a mesh node degree dm, is proportional tothe degree to the power of a constant Om:

fdm / dOmm :

Definition 2. We define the mesh node degree exponent Om

to be the slope of the plot of the frequency of the meshnode degrees versus the degrees in log–log scale.

6.3. Mesh pair rank exponent Pm

In this section, we study the Number of distinct Shortest

Paths (NSP) of each pair of vertices in the mesh. The num-ber of distinct shortest paths between two vertices is thenumber of shortest paths such that any of these paths have

at least one vertex not in common [9]. The distribution ofNSP is useful for evaluating the amount of redundantedges involved in shortest path. Higher NSP values meanthat if one edge of a shortest path between a pair of nodesis removed, there is still a probability for another shortestpath of the same length to exist for this pair. We sort thepairs of in-mesh nodes in decreasing NSP npm

and definethe pair rank rpm

as the index of the pair in the sequence.We plot the ðnpm

; rpmÞ pairs in log-log scale. The plots are

shown in Fig. 8. Due to the enormous amount of nodepairs, we plot the first 106 pairs only.

The points of Fig. 8 are well approximated by the linearregression with ACC of 0.99157 for the 091505 trace and0.99632 for the 021106 trace. Note that it seems that inFig. 8(a) a significant portion of the upper left part of thecurve goes off the straight line. However, this is a visualillusion. The dots in the lower right part of the curve aremuch more denser than the dots in the upper left part,resulting in a high ACC value all the same. This leads usto the following power law and definition.

Power-law 3 (Mesh pair rank exponent). The NSP npm

between a pair of mesh nodes pm, is proportional to therank of the pair rpm

to the power of a constant Pm:

npm/ rPm

pm:

Page 7: Analysis of hybrid P2P overlay network topology

Fig. 8. Log–log plot of the mesh NSP npmversus the rank rpm

in thesequence of decreasing degree.

Fig. 9. Log–log plot of frequency fnm versus the mesh NSP nm.

196 C. Xie et al. / Computer Communications 31 (2008) 190–200

Definition 3. Let us sort the pairs of nodes in the mesh of agraph in decreasing order of NSP. We define the mesh pairrank exponent Pm to be the slope of the plot of the NSPversus the rank of the mesh node pairs in log-log scale.

Table 3Basic Statistics of the Forest

Stat. Data 091505 021106

Nb of Nodes 90,718 107,073p(t) 84.6% 90.0%Nb of trees 9886 6830Mean tree size 10.18 16.68Max tree size 4,824 231Mean tree depth 1.52 1.30Max tree depth 8 10

6.4. Mesh NSP exponent N m

In this section, we study the distribution of NSP of in-mesh nodes. We define the frequency fnm of a NSP nm asthe number of pairs with NSP of nm in the mesh. We plotthe (fnm ; nm) pairs in log-log scale in Fig. 9. In these plots,we exclude a small percentage of pairs of higher NSP thathave lowest frequency, but still plot more than 99.9% of thetotal number of pairs. The solid lines are the result of thelinear regression.

The major observation of Fig. 9 is that the plots areapproximately linear with ACC of 0.94301 for the 091505trace and 0.99840 for the 021106 trace. We infer the follow-ing power-law and definition.

Power-law 4 (Mesh NSP Exponent). The frequency fnm

of a NSP between a pair of nodes in the mesh, nm, is pro-portional to the NSP to the power of a constant N m:

fnm / nN mm :

Definition 4. We define the Mesh NSP exponent N m to bethe slope of the plot of the frequency of the mesh NSPversus the mesh NSP in log-log scale.

7. Forest topology analysis

In this section, we study the topology properties concern-ing the forest in the Gnutella network. In Table 3, we presentsome basic statistics about the forest in our traces. In Table 3,p(t) represents the percentage of nodes in the forest.

7.1. Tree depth distribution

We define the probability p(td) of a tree depth td as thepercentage of trees in the forest with depth td. Fig. 10describes the tree depth distribution.

Page 8: Analysis of hybrid P2P overlay network topology

Fig. 11. Plot of the tree size st(log-scale) versus the rank rt in the sequenceof decreasing size.

Fig. 10. Tree depth distribution.

C. Xie et al. / Computer Communications 31 (2008) 190–200 197

In Fig. 10, we notice that more than 56% of trees aresimply composed of leaves that is directly connected totheir corresponding root. We can also observe that morethan 27% of trees have depth 2 and less than 4% of treeshave depth larger than 3.

7.2. Tree rank distribution

In this section, we study the size of each tree, whichis defined as the sum of the vertices composing the treeplus the root. We sort the trees in decreasing tree size st

and define tree rank rt as the index of the tree in thesequence. We plot the (st,rt) pairs in Fig. 11, applyinglog-scale only on the y-axis. The solid lines are given by lin-ear regression.

The plots of Fig. 11 match the linear regression line. TheACC is 0.95621 for the 091505 trace and 0.95465 for the021106 trace. Consequently, we infer the following empiri-cal law and definition.

Empirical law 1: The size st of a tree t, is proportional toan exponential function with exponent being the product ofthe rank of the tree rt and a constant T :

st / expðT rtÞ:

Definition 5. Let us sort the trees of a graph in decreasingorder of size. We define T to be the slope of the plot of thesizes of trees versus the rank of the trees with log-scaleapplied on the sizes of trees.

This empirical law provides the formula on the sizes oftrees in a sequence of trees.

8. Discussion

In this section, we first present two more power-lawsconcerning all the nodes (including both in-mesh nodesand in-tree nodes) in the Gnutella network. Then wefocus on the generation of synthetic topologies of P2Pnetworks.

8.1. Additional power-laws

In our study, we find that the NSP rank distribution andNSP distribution of all the nodes in the Gnutella networkfollow power-laws as well. This can be explained easily.Because the mesh is the core part of the network, shortestpaths is mainly constituted by nodes in the mesh, whilenodes in the forest barely contribute to shortest paths.However, the two power-laws presented below could beused as minor metrics to distinguish P2P topologies.

8.1.1. Pair rank exponent PHere we study the NSP of all the nodes (including both

in-mesh nodes and in-tree nodes). We sort the pairs of thenodes in decreasing NSP np and plot the (np, rp) pairs inlog–log scale in Fig. 12. Due to the enormous amount ofnode pairs, we plot the first 106 pairs only. The data valuesare represented by points, while the solid lines represent theleast-squares approximation.

The points of Fig. 12 are well approximated by the lin-ear regression with ACC of 0.98184 for the 091505 traceand 0.99259 for the 021106 trace. Note that it seems thatin both Fig. 12(a) and (b), a significant portion of the upper

Page 9: Analysis of hybrid P2P overlay network topology

Fig. 12. Log–log plot of the NSP np versus the rank of the pairs rp in thesequence of decreasing NSP.

Fig. 13. Log–log plot of frequency fn versus the NSP n.

198 C. Xie et al. / Computer Communications 31 (2008) 190–200

left part of the curves goes off the straight line. However,this is also resulted from visual illusion. The dots in thenether right part of the curve is much more dense thanthe dots in the upper left part, resulting in that the ACCvalue is high all the same. This leads us to the followingpower law and definition.

Power-law 5 (Pair Rank Exponent): The NSP np

between a pair of nodes p, is proportional to the rank ofthe pair rp to the power of a constant P:

np / rPp :

Definition 6. Let us sort the pairs of nodes of a graph indecreasing order of NSP. We define the pair rank exponentP to be the slope of the plot of the NSP versus the rank ofthe pairs in log–log scale.

8.1.2. NSP Exponent NHere we study the distribution of NSP of all the nodes

(including both in-mesh nodes and in-tree nodes). Wedefine the frequency fn of a NSP n as the number of pairswith NSP of n. We plot the (fn, n) pairs in log–log scalein Fig. 13. In these plots, we exclude a small percentage

of pairs of higher NSP that have lowest frequency. Inany case, we plot more than 99.9% of the total numberof pairs. The solid lines are the result of the linearregression.

The major observation is that the plots are approxi-mately linear with ACC of 0.93510 for the 091505 traceand 0.98810 for the 021106 trace. We infer the followingpower-law and definition.

Power-law 6 (NSP Exponent): The frequency fn of a NSPbetween a pair of nodes n, is proportional to the NSP to thepower of a constant N :

fn / nN :

Definition 7. We define the NSP exponent N to be theslope of the plot of the frequency of the NSP versus theNSP in log-log scale.

8.2. Topology generation

The regularity observed in our traces of the Gnutellanetwork between September 2005 and February 2006(including but not restricted to the two traces specificallydiscussed in this paper) is unlikely to be a coincidence.

Page 10: Analysis of hybrid P2P overlay network topology

C. Xie et al. / Computer Communications 31 (2008) 190–200 199

We could reasonably conjecture that our laws might con-tinue to hold, at least for the near future.

Our work can facilitate the generation of realistic topol-ogies of P2P networks, specially those which employ ahybrid and highly dynamic architecture like the Gnutellanetwork. As an overview, we list the following guidelinesfor creating P2P network topologies. First, a small percent-age of the nodes (15.4% or 10.0%) belong to the mesh and alarge percentage of the nodes (84.6% or 90.0%) belong tothe forest. Second, the degree distribution of the mesh isskewed following our power-law 1 and 2. Third, more than56% of the trees have depth one, less than 4% of the treeshave depth larger than 3, and the maximum depth is 7 or10. Fourth, the size distribution of the trees is skewed fol-lowing our empirical law 1. As a final step, we merge thegenerated mesh and the generated forest together to getthe P2P network topology. We can further use our law 3,law 4, law 5, and law 6 to examine the quality of the gen-erated topologies. If we finetune the parameters, we can getspecific topologies that meet our needs.

9. Conclusion and future work

In this paper, we study the hybrid P2P network topologythrough the mesh perspective and the forest perspectiverespectively. Using the two-layered approach and laws pro-posed, realistic topologies can be generated.

References

[1] C. Xie, Y. Pan, Analysis of large-scale hybrid peer-to-peer networktopology, in: Proc. IEEE GLOBECOM’06, San Francisco, USA,2006.

[2] M.A. Jovanovic, Modelling large-scale peer-to-peer networks and acase study of gnutella, Master’s thesis, University of Cincinnati,Cambridge , June 2000.

[3] Gnutella, The gnutella protocol v0.6, 2002.[4] The KaZaA website, 2006.[5] Clip2, The Gnutella protocol specification v0.4, 2001.[6] The Limewire website, 2006.[7] L.A. Adamic, R.M. Lukose, A.R. Puniyani, B.A. Huberman,

Search in power-law networks, Physical Review E 64 (2001)46135–46143.

[8] M. Faloutsos, P. Faloutsos, C. Faloutsos, On power-law relationshipsof the internet topology, in: Proc. ACM SIGCOMM’99, New York,NY, 1999, pp. 251–262.

[9] D. Magoni, J.-J. Pansiot, Analysis of the autonomous system networktopology, ACM SIGCOMM Computer Communication Review 31(3) (2001) 26–37.

[10] M. Ripeanu, I. Foster, A. Iamnitchi, Mapping the Gnutella network:properties of large-scale peer-to-peer systems and implications forsystem design, IEEE Internet Computing Journal 6 (1) (2002) 50–57.

[11] H. Chen, H. Jin, J. Sun, D. Deng, X. Liao, Analysis of large-scaletopological properties for peer-to-peer networks, in: Proc. IEEECCGrid’04, 2004, pp. 27–34.

[12] Q. He, M. Ammar, G. Riley, H. Raj, R. Fujimoto, Mapping peerbehavior to packet-level details: a framework for packet-levelsimulation of peer-to-peer systems, in: Proc. IEEE/ACM MAS-COTS’03, Orlando, FL, October 2003.

[13] S. Merugu, S. Srinivasan, E. Zegura, P-sim, A simulator for peer-to-peer networks, in: Proc. IEEE/ACM MASCOTS’03, Orlando, FL,Oct. 2003.

[14] N.S. Ting, R. Deters, 3LS – A peer-to-peer network simulator, in:Proc. IEEE P2P’03, Sweden, 2003.

[15] N. Kotilainen, M. Vapa, T. Keltanen, A. Auvinen, J. Vuori,P2PRealm – Peer-to-Peer Network Simulator, in: Proc. 11th Inter-national Workshop on Computer-Aided Modeling, Analysis andDesign of Communication Links and Networks, 2006, pp. 93–99.

[16] M. Jelasity, A. Montresor, G.P. Jesi, Peersim peer-to- peer simulator,2004, Avaliable from: <http://peersim.sourceforge.net/>.

[17] W. Yang, N. Abu-Ghazaleh, GPS: a general peer-to-peer simulatorand its use for modeling BitTorrent, in: Proc. IEEE/ACM MAS-COTS’05, Atlanta, GA, 2005.

[18] A.L. Barabasi, R. Albert, Emergence of scaling in random networks,Science 286 (1999) 509.

[19] A. Medina, I. Matta, J. Byers, On the origin of power laws in internettopologies, ACM SIGCOMM Computer Communication Review 30(2) (2000) 18–28.

Chao Xie currently is a Ph.D. student in theDepartment of Computer Science at Universityof Wisconsin-Madison. He obtained his M.S.degree in Computer Science from Georgia StateUniversity, USA, in 2007, obtained his M.Eng.degree in Computer Science from HuazhongUniversity of Science and Technology, China, in2005, and obtained his B.S. degree in MechanicalEngineering from Huazhong University of Sci-ence and Technology, China, in 2001.

His main research interests include computernetworks, distributed systems, parallel computing and data mining.

Chao Xie is a member of the Association of Computing Machinery and

the IEEE Computer Society.

Guihai Chen obtained his B.S. degree fromNanjing University, M.Eng. from SoutheastUniversity, and Ph.D from University of HongKong. He visited Kyushu Institute of Technol-ogy, Japan in 1998 as a research fellow, andUniversity of Queensland, Australia in 2000 as avisiting professor. During September 2001 toAugust 2003, he was a visiting professor inWayne State University. He is now a full pro-fessor and deputy chair of Department of Com-puter Science, Nanjing University. Prof. Chen

has published more than 100 papers in peer-reviewed journals and refereedconference proceedings in the areas of wireless sensor networks, high-

performance computer architecture, peer-to-peer computing and perfor-mance evaluation. He has also served on technical program committees ofnumerous international conferences. He is a member of the IEEE Com-puter Society.

Art Vandenberg was born in Grasonville, Mary-land, 1950. Education includes B.A. EnglishLiterature, Swarthmore College, Swarthmore,PA, 1972; M.V.A Painting and Drawing, GeorgiaState University, Atlanta, GA 1979; and M.S.Information and Computer Systems, GeorgiaInstitute of Technology, Atlanta, GA 1985.He has worked in library systems, research andadministrative computing since 1976, including 15years in information technology positions atGeorgia Institute of Technology. Since 1997 he has

been with Information Systems & Technology at Georgia State University,as Director of Advanced Campus Services charged with deploying middle-

ware and research computing infrastructure. His current activities includedeploying grid computing solutions and establishing high-performance
Page 11: Analysis of hybrid P2P overlay network topology

200 C. Xie et al. / Computer Communications 31 (2008) 190–200

computing cyberinfrastructure. Recent research grants include a NSF ITRAward 0312636 as Co-PI investigating a unique approach to resolvingmetadata heterogeneity for information integration by combining moni-toring, clustering and visualization to discover patterns or trends. He is amember of Georgia State’s IT Risk Management Research Group, theGeorgia State Information Integration Lab, and serves as Chair ofSURAgrid, a regional grid initiative of the Southeastern UniversitiesResearch Association.

Mr. Vandenberg is a member of the Association of ComputingMachinery and the IEEE Computer Society.

Yi Pan is the chair and a professor in theDepartment of Computer Science and a profes-sor in the Department of Computer InformationSystems at Georgia State University. Dr. Panreceived his B.Eng. and M.Eng. degrees incomputer engineering from Tsinghua University,China, in 1982 and 1984, respectively, and hisPh.D. degree in computer science from theUniversity of Pittsburgh, USA, in 1991. Dr.Pan’s research interests include parallel anddistributed computing, optical networks, wire-

less networks, and bioinformatics. Dr. Pan has published more than 100journal papers with 30 papers published in various IEEE journals. In

addition, he has published over 100 papers in refereed conferences(including IPDPS, ICPP, ICDCS, INFOCOM, and GLOBECOM). He

has also co-authored/co-edited 30 books (including proceedings) andcontributed several book chapters. His pioneer work on computing usingreconfigurable optical buses has inspired extensive subsequent work bymany researchers, and his research results have been cited by more than100 researchers worldwide in books, theses, journal and conferencepapers. He is a co-inventor of three U.S. patents (pending) and 5 pro-visional patents, and has received many awards from agencies such asNSF, AFOSR, JSPS, IISF and Mellon Foundation. His recent researchhas been supported by NSF, NIH, NSFC, AFOSR, AFRL, JSPS, IISFand the states of Georgia and Ohio. He has served as a reviewer/panelistfor many research foundations/agencies such as the U.S. National Sci-ence Foundation, the Natural Sciences and Engineering ResearchCouncil of Canada, the Australian Research Council, and the HongKong Research Grants Council. Dr. Pan has served as an editor-in-chiefor editorial board member for 15 journals including 5 IEEE Transac-tions and a guest editor for 10 special issues for 9 journals including 2IEEE Transactions. He has organized several international conferencesand workshops and has also served as a program committee member forseveral major international conferences such as INFOCOM, GLOBE-COM, ICC, IPDPS, and ICPP. Dr. Pan has delivered over 10 keynotespeeches at many international conferences. Dr. Pan is an IEEE Dis-tinguished Speaker (2000-2002), a Yamacraw Distinguished Speaker(2002), a Shell Oil Colloquium Speaker (2002), and a senior member ofIEEE. He is listed in Men of Achievement, Who’sWho in Midwest,Who’sWho in America, Who’sWho in American Education, Who’s Whoin Computational Science and Engineering, and Who’s Who of AsianAmericans.