Download - Gnutella Thesis

UNIVERSITY OF CINCINNATI

_____________ , 20 _____

I,______________________________________________,hereby submit this as part of the requirements for thedegree of:

________________________________________________

in:

________________________________________________

It is entitled:

________________________________________________

________________________________________________

________________________________________________

________________________________________________

Approved by:________________________________________________________________________________________________________________________

Modeling Large-scale Peer-to-Peer Networks and a CaseStudy of Gnutella

A thesis submitted to the

Division of Graduate Studies and Research of

the University of Cincinnati

in partial fulfillment of the

requirements for the degree of

MASTER OF SCIENCE

in the Department of

Electrical and Computer Engineering and Computer Scienceof the College of Engineering

June, 2000

by

Mihajlo A. Jovanovic B.S., Department of Mathematics andComputer Science, Otterbein College, Westerville, Ohio, 1997.

Thesis Advisor and Committee Chair: Dr. Fred S. Annexstein andDr. Kenneth A. Berman

Abstract

The ongoing digital revolution has brought on the emergence of novel network ap-plications such as Gnutella, Freenet, and Napster, intended to facilitate worldwidesharing of information. These applications have embraced the familiar peer-to-peer(P2P) architecture model of the original Internet in new and innovative ways, foreverchanging the world of personal computing. However if P2P is to truly replace thewell-established client-server model as the computing paradigm of the future, moreefficient decentralized algorithms must first be designed. This requires better under-standing of the P2P network model on which those algorithms would be operating.Such model includes both network topology and traffic.

In this thesis, we study both of these factors using as our case study Gnutella -a fully-decentralized file sharing network application. In order to study the Gnutellanetwork topology, we have developed a network crawler that allows topology dis-covery to be performed in parallel. Upon analyzing the obtained topology data, wediscovered it exhibits strong ”small-world” properties. More specifically, we observedthe properties of small diameter and clustering in the Gnutella network topology. Inaddition, we report evidence of four different power laws previously observed in othertechnological networks, such as the Internet and the WWW.

In the second part of our thesis, we utilize our topology model in order to studynetwork traffic. Specifically, we show that heterogeneous latencies present in manylarge-scale P2P network applications, when combined with the standard protocolmechanisms of time-to-live (TTL) and unique message identification (UID) used togovern flooding message transmissions, can potentially have a devastating effect onthe reachability of message broadcast. We call this combined effect ”short-circuiting,”and we investigate consequences of this phenomenon. We show through experimenta-tion that, in the worst case, short-circuiting can near-completely eliminate the reachof broadcast messages. We report measurements obtained through both network sim-ulation studies and experimental studies performed on Gnutella. Our results indicatethat, on average, the real effects of short-circuiting are significant, but not devastatingto the performance of an overall large-scale system.

We believe our discoveries of both network topology properties and short-circuitingare an important step toward a uniform model of P2P network applications, and couldserve as a valuable tool in analyzing the performance of existing algorithms, as wellas designing new, more scalable solutions.

Acknowledgments

First, I would like to thank my advisers, Dr. Fred Annexstein and Dr. Kenneth

Berman, for hours of intellectually stimulating discussions, suggestions and ideas.

For the duration of this thesis, they have been not just my advisers but also my

mentors, providing constant encouragement as well as financial support in the form

of a Research Assistantship.

I would also like to thank Dr. Yizong Cheng for taking the time out of his busy

schedule to be on my thesis committee, and Dr. John Schlipf for attending my

thesis defense. Special thanks goes to Dr. John Franco for providing motivation and

guidance, particularly during my first year at UC, and also Linda Gruber for her

always kind and helpful attitude.

I extend my sincere gratitude to the Department of Electrical and Computer En-

gineering and Computer Science for its generous support without which this work

would not be possible. The department has provided me with a Graduate Assis-

tantship during my first year and a University Graduate Scholarship for three full

academic years.

Finally, I dedicate this work to my parents, Aleksandar and Mirjana, whose love

and support, even from half a world away, I could not have done it without.

Table of Contents

Page

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Peer-to-Peer Computing . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 Example Applications . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Modeling P2P Applications . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.1 Benefits to Modeling . . . . . . . . . . . . . . . . . . . . . . . 7

2 Modeling Topology of Large P2P Networks . . . . . . . . . . . . . . . . . . 9

2.1 Small-World Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 Modeling Small-World Networks . . . . . . . . . . . . . . . . . 13

2.1.2 Gnutella as a Small-World . . . . . . . . . . . . . . . . . . . . 14

2.2 Power-Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.1 Power-Law Models . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.2 Power-Laws in Gnutella . . . . . . . . . . . . . . . . . . . . . 21

3 Modeling Network Latencies . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1 Latency Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Modeling the Short-Circuiting Effect . . . . . . . . . . . . . . . . . . 30

3.3 Empirical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3.1 Gnutella Studies . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3.2 Network Simulation Studies . . . . . . . . . . . . . . . . . . . 37

i

4 Gnutella Crawler Implementation . . . . . . . . . . . . . . . . . . . . . . . 41

4.1 Introduction to Gnutella . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1.1 Gnutella Protocol . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1.2 Discovering Gnutella Network Topology . . . . . . . . . . . . 44

4.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2.2 Initial Implementation . . . . . . . . . . . . . . . . . . . . . . 47

4.2.3 Parallel Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3 Distributed Computing Solution Using Java RMI . . . . . . . . . . . 50

5 Conclusions and future research . . . . . . . . . . . . . . . . . . . . . . . . 52

5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.2.1 Network Topology Modeling . . . . . . . . . . . . . . . . . . . 53

5.2.2 Network Visualization . . . . . . . . . . . . . . . . . . . . . . 53

5.2.3 Server Placement . . . . . . . . . . . . . . . . . . . . . . . . . 53

Appendix

A Visualizations of the Gnutella Network Topology . . . . . . . . . . . . . . 59

B Java source code for gnutsim . . . . . . . . . . . . . . . . . . . . . . . . . . 65

C Network Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . 77

ii

List of Figures

2.1 Values for the clustering coefficient as defined in definition 3 for the

Gnutella, Barabasi-Albert, Watts-Strogatz, random graph, and the 2D

torus topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Log-log plots of degree versus rank (power-law 1) . . . . . . . . . . . 22

2.3 Log-log plot of frequency versus degree (power-law 2) . . . . . . . . . 23

2.4 Log-log plot of the number of pairs of nodes versus the number of hops

(power-law 3) for four snapshots of the Gnutella topology . . . . . . . 24

2.5 Log-log plot of eigenvalues versus rank (power-law 4) for four snapshots

of the Gnutella topology . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.1 The results of level-1 short-circuiting effects on the broadcast hori-

zon on the Gnutella network, October 2000. The y-axis represents the

broadcast horizon size, and the x-axis labels each of 68 broadcast trials.

The top line is the resulting horizon from multiple distinct broadcasts

from the same source, and the lower line is the resulting horizon from

a single broadcast message from a single source. The discrepancy rep-

resents “level-1 short-circuiting” effects. . . . . . . . . . . . . . . . . . 33

3.2 Horizon-size versus t . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

iii

3.3 Horizon-size variation over time with broadcasting client using mul-

tiple connections on the Gnutella network, March 2001. The y-axis

represents the horizon size, and the x-axis labels each of 180 broadcast

trials, performed consecutively in six minute intervals. . . . . . . . . . 35

3.4 Difficulty in conducting experiments on today’s Gnutella network . . 36

3.5 Short-circuiting effects for the Watts-Strogatz topology (nodes = 10000, k =

3, p = 0.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

A.1 Gnutella network topology using Caida’s Otter . . . . . . . . . . . . . 60

A.2 Gnutella network topology using LEDA’s 2D spring layout . . . . . . 60

A.3 Gnutella network topology using experimental layout . . . . . . . . . 61

A.4 Gnutella network backbone (dominating set using greedy algorithm)

using LEDA’s 3D spring layout . . . . . . . . . . . . . . . . . . . . . 62

A.5 Gnutella network backbone (nodes with degree > 10) using LEDA’s

3D spring layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

A.6 Gnutella network backbone (nodes with degree > 20) using LEDA’s

3D spring layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

iv

Chapter 1

Introduction

The new wave of innovative network applications such as Gnutella, Freenet, Jabber,

Popular Power, SETI@Home, Publius, Free Haven, Groove, and others, has brought

on a revolution in personal computing threatening the long-established client-server

architecture of the Internet. For lack of a better term, this revolution has been la-

beled peer-to-peer (P2P), or simply peer computing. The success of this revolution

will depend on the ability of modern P2P network application to provide efficient

communication between increasingly large number of autonomous hosts dispersed all

over the Internet. To cope with this problem some P2P applications, like instant mes-

saging and Napster rely on a centralized server. Other applications, such as Gnutella

and Freenet, adopt fully decentralized design approach and require scalable algorith-

mic solutions for functions such as routing and searching. Gnutella, for example,

utilizes a flooding mechanism for transmitting messages through the network. These

algorithms are typically built-in the application in the form of an application-level

protocol. The inadequacy of the existing protocols became painfully clear to Gnutella

developers during the summer of 2000, when the size of the user community rapidly

increased. The problem is that the original protocols were designed without any

1

knowledge about the nature of the network on which they would be operating. In P2P

applications such as Gnutella and Freenet, much like in social networks, this nature

is determined by collective phenomena, as users connect to each other in a seemingly

random manner. Under these circumstances and given the highly-dynamic nature of

these networks, even relatively simple protocols result in complex interactions that

are difficult to predict. To provide better understanding of such interactions, in this

thesis we study the nature of P2P networks using Gnutella as our case study. In

particular, we study two fundamental components of a network, namely the topology

and the traffic.

In the first part of this thesis (chapter 2), we focus on the network topology model.

In order to study the Gnutella network topology, we have designed and implemented a

distributed network crawler that allows topology discovery to be performed in parallel

- an important feature considering highly dynamic nature of Gnutella. The analysis

of the obtained topology data reveals several important structural characteristics of

P2P networks:

1. We report that the Gnutella network is a small-world topology, exhibiting both

small diameter and clustering typical of many social networks.

2. We present evidence of four different power laws also found in other technolog-

ical networks, such as the Internet and the WWW.

As a result, we conclude that many P2P networks, such as Gnutella, posses charac-

teristics of both technological and social networks. It is our thesis that these char-

acteristics can be utilized for designing more efficient algorithms operating on such

networks.

In the second part of this thesis (chapter 3), we turn our focus to network traffic.

More specifically, we study the effects of heterogeneous latencies on reachability in

2

P2P networks operating under flooding protocols. We show that heterogeneous la-

tencies present in many large-scale P2P network applications, when combined with

the standard protocol mechanisms of time-to-live (TTL) and unique message identi-

fication (UID) used to govern flooding message transmissions, can potentially have

a devastating effect on the reachability of message broadcast. We call this com-

bined effect ”short-circuiting,” and we investigate consequences of this phenomenon.

We show through experimentation that, in the worst case, short-circuiting can near-

completely eliminate the reach of broadcast messages. We report measurements ob-

tained through both network simulation studies and experimental studies performed

on Gnutella. Our results indicate that, on average, the real effects of short-circuiting

are significant, but not devastating to the performance of an overall large-scale sys-

tem. In chapter 4, we describe the design and implementation of our parallel network

crawler. Finally, chapter 5 concludes this thesis with the description of future work.

For the remainder of this chapter, we first present a brief overview of the P2P

computing paradigm. Then, we summarize the main reasons for network modeling

and present our formal model.

1.1 Peer-to-Peer Computing

As with many new technologies, there is no single universally accepted definition for

P2P. The recently formed Peer-to-Peer Working Group, a consortium lead by the

industry giants such as Hewlett-Packard, Intel and IBM, defines peer computing as

”sharing of computer resources by direct exchange.” Indeed it is this notion of direct

access to resources, instead of through a centralized server as with the traditional

client-server model, that characterizes P2P. However, this definition may be too gen-

eral as it would seam to include applications typically considered client-server, such

as FTP and TELNET. According to [25], the two fundamental criteria that each

3

P2P application must satisfy are (1) treating variable connectivity and temporary

network addresses as the norm and (2) giving nodes at the edges of the network sig-

nificant autonomy. Using this definition, applications such as email are not P2P since

addresses are not machine independent, while instant messaging applications such

as ICQ and Jabber are P2P, because ”they devolve connection management to the

individual nodes” and dynamically map users to their IP addresses. However the fun-

damental idea of having computers act as peers is hardly new - some may even argue

it has its root in the original design of the Internet, as part of the early ARPANET

architecture. In fact, early network applications such as USENET and DNS were

based on a peer-to-peer communication model and can be considered predecessors to

modern P2P technologies. The true innovation of these technologies therefore lies not

in their architecture design, but rather in their implementation and scale. In order

for these applications to extend the scope of P2P computing beyond a single LAN,

they needed to overcome serious technical challenges posed by technologies such as

firewalls, dynamic IP, and NAT, designed to obstruct open communications between

computers for reasons of security. They did so by mitigating application complex-

ity to the edges of the network, thereby creating a much more significant role for the

Internet-connected PCs than previously offered by the traditional client-server model.

This idea of transferring the complexity to the edges can be best explained in com-

parison with a telephone network. At first glance a telephone network may seam P2P,

since communication occurs directly between two points in the network. However the

crucial difference between a telephone network and P2P is that the former relies on

an intelligent network for functions such as routing, and relatively ”dumb” devices

in the form of telephone sets. In contrast, P2P application like Gnutella relies on an

existing, ”dumb” network (the Internet) and incorporates all the application logic at

endpoints. The main advantage to such design from a perspective of a researcher is

that it enables rapid development and deployment of innovative technologies, which

4

can perhaps serve as an explanation for such a large number of P2P applications we

are seeing today.

1.1.1 Example Applications

Current network applications have embraced three forms of peer computing: shar-

ing of information, sharing of computing power, and communication. This does not

mean P2P computing model is limited to these resources, but simply that a P2P

application for sharing other types of resources has not yet been designed. Table 1.1

shows the list of the most popular P2P applications in each category. Applications

such as SETI@Home outline clear relationship between P2P and another computing

paradigm commonly referred to as distributed computing. These applications allow

the computing power of thousands of Internet-connected PCs to be harnessed and

used for performing computationally intensive tasks that would otherwise require the

use of a supercomputer. Examples include processing radio signals from outer space

in search for extraterrestrial intelligence [4] and simulating protein folding [2]. Per-

haps the most popular form of peer computing on the Internet is instant messaging.

Unlike email, where messages travel through centralized mail servers, instant messag-

ing allows individuals to directly communicate with each other. To route messages

between users across the entire Internet, applications such as AIM, ICQ, MSN, and

Jabber rely on a centralized back-end server to dynamically map users to their IP

addresses and buffer messages in case the user is offline.

Ongoing work toward development of a generalized platform for building P2P

applications [16] can be perhaps taken as an indication that the P2P model is here

to stay. The main goal of Groove developers is abstracting away many common

challenges to building P2P network application, such as providing open PC-to-PC

communication. The main obstacles are arising from the fact that the Internet archi-

5

Sharing of Information Sharing of Computing Power Communication

Gnutella SETI@Home AIM

Freenet Folding@Home ICQ

Napster FightAIDS@Home MSN

Publius PopularPower Jabber

Free Haven Intel’s NetBatch

Table 1.1: List of most popular P2P applications

tecture has been built for years around the prevalent client-server model. As a result,

numerous technologies such as firewalls, dynamic IP, NAT, and asymmetric band-

width connections have been deployed on the Internet, driven by the fundamental

assumption that most Internet-connected PCs will only serve as clients. This under-

lying assumption is being strongly challenged by P2P applications such as Gnutella,

Napster, and Freenet, which strive to provide a fully distributed worldwide informa-

tion sharing system. These applications require their users to serve both as consumers

and producers of information in a large distributed information storage system. The

idea behind peer-to-peer information sharing is that much of the desired content is

stored on individual workstations and not behind some centralized server. Applica-

tions like Gnutella allow users to directly connect to each other for the purpose of

exchanging information.

From the perspective of this thesis, a common thread that ties all of these appli-

cations is that they all form highly dynamic networks of peers with complex topology.

Understanding the nature of these networks, particularly with regards to their topo-

logical structure, is the main topic of chapter 2. In addition, applications such as

Gnutella and Free Haven [14], which rely on a broadcast search mechanism typically

6

implemented through flooding, are susceptible to a potential negative effect of hetero-

geneous latencies on message reachability - a phenomenon we call “short-circuiting.”

We examine this phenomenon in detail in chapter 3.

1.2 Modeling P2P Applications

In this section we present our formal model for representing network topology. We

model topology of P2P networks with an undirected graph G whose nodes represent

hosts and edges represent Internet connections between those hosts. For the remain-

der of this thesis, we will refer to network graphs as graphs representing topological

structure of a network. In order to study the effects of latencies on broadcast flood-

ing operations in chapter 3, we will further refine our model to include edge weights

denoting network latencies along communication links.

1.2.1 Benefits to Modeling

There are many reasons for obtaining an accurate network model. The main ones

can be summarized as follows:

Provides insight into the nature of the underlying system: Even if it was pos-

sible to catalog all the vertices and edges of a graph, such information does not

explain the evolutionary process of the corresponding network, nor does it pro-

vide a deeper understanding of its nature.

Enables analytical analysis of algorithms: Performance of graph algorithms is

closely related to the structural properties of the underlying graph [28]. A well-

formulated graph model can aid in analytical analysis of algorithms performing

on such topologies.

7

Allows generation of realistic topologies for simulation purposes: Besides an-

alytical analysis, simulations are a widely used method of assessing the perfor-

mance of algorithms. However successful simulations require realistic topologies

that accurately capture important structural characteristics present in the orig-

inal networks.

Facilitates design of new scalable algorithms: If the nature of a particular topol-

ogy is well understood, algorithms can be design to take advantage of particular

structural properties.

Helps in understanding of related network structures: A good understanding

of the nature of a particular system could lead to better understanding of other

dynamic, decentralized network structures for which complete topological data

may not be available.

Allows prediction of future trends: A good network model can be used to sim-

ulate future growth, thereby allowing developers to address potential problems

in advance.

As we have mentioned earlier, the topology of many P2P networks such as Gnutella

is completely defined by usage patterns, or collective phenomena. In this sense, there

is a clear relationship between P2P and social networks. Over the recent years, a lot of

research has been done on social network models. In the following chapter we present

some of the most notable network models and discuss how they can be adopted for

P2P networks. We support are claims with results obtained on the Gnutella network

topology.

8

Chapter 2

Modeling Topology of Large P2P

Networks

In this chapter we focus on one major aspect of the overall network model, namely

the topology. We analyzed the Gnutella network topology instances obtain by our

network crawler between the months of May and December of 2000. In our analysis,

we discovered some important structural properties of the topology graph, such as the

small-world properties and several power-law distributions of certain graph metrics.

It is our thesis that these properties can be used to test the “representativeness”

of synthetically generated topologies used to model P2P networks such as Gnutella.

Conversely, we believe these properties are an essential ingredient of an accurate P2P

network topology model.

Here we present our results in the context of other related research. Be begin

with a brief introduction of small-world networks and their characteristics. We then

present our discoveries on Gnutella, showing that the Gnutella network topology

exhibits strong small-world properties. Next, we describe several power-laws recently

observed in various network structures arising in technology. Finally, we report four

9

of these power-laws characterizing topology of the Gnutella network. It is our thesis

that these power-laws are a fundamental property of many large-scale P2P networks,

and therefore must be dealt with in their corresponding models.

2.1 Small-World Networks

The small-world phenomenon in the context of a worldwide social network refers to a

widely accepted belief that we are all connected by a short chain of intermediate ac-

quaintances. One of the first experimental studies of this phenomenon was conducted

by Stanley Milgram in the late 1960s. Milgram’s famous experiment consisted of

taking a number of letters addressed to a person in the Boston area, and distributing

them to a randomly selected group of people in Nebraska. Each person who received

a letter was asked to pass it to someone they knew on a first-name basis in an effort

to get it closer to its destination. As many of the letters eventually reached their des-

tination, Milgram observed that the average number of steps for a letter to get from

Nebraska to Boston was between five and six. The results of Milgram’s experiment

were the first to quantify the phenomenon, giving birth to a popular expression ”six

degrees of separation.”

One way to model the small-world phenomenon is by a graph whose vertices are

people and edges exist between two people who know each other. Such graph is often

referred to as the human acquaintanceship graph. As suggested by the phenomenon,

the acquaintanceship graph is characterized by small diameter. Stated more precisely,

its diameter seams to be of the order of log n, where n is the size of the graph. Fur-

thermore, the acquaintanceship graph also shows tendency to be clustered. Clustering

can be thought of as a measure of how well connected each node’s neighborhood is.

For the human acquaintanceship graph this property seams intuitive, as two people

with a common friend are with high probability themselves friends. It is these two

10

properties of clustering and small diameter that define a class of graphs Watts and

Strogatz call the small-worlds graphs. The two in [27] argue that the structure of

many biological, technological, and social networks exhibits small-world behavior. As

examples of such networks, they studied the only completely mapped neural network

of the nematode worm Caenorhabditis elegans, the electric power grid of the western

US, and the Hollywood graph. The collaboration graph of film actors, appropriately

termed the Hollywood graph, contains 225, 000 vertices representing actors and an

edge for any two actors who have appeared in a feature film together. Similar collabo-

ration graphs exist for active scientists [17] and even baseball players [24]. Since each

of these social networks is a subgraph of the acquaintanceship graph, it is not surpris-

ing they also show properties of clustering and small diameter. Without providing

a strict mathematical definition, Watts and Strogatz define small-world behavior in

terms of two properties, mainly the characteristic path length and clustering. In order

to quantify these properties for various networks, the two defined characteristic path

length L and clustering coefficient C as the following:

Definition 1 Characteristic path length L, a global property, is defined as the

number of edges in the shortest path between two vertices, averaged over all pairs of

vertices.

Definition 2 Clustering Coefficient Cv, a local (node) property measuring ”cliquish-

ness” of vertex v, is calculated by taking all the neighbors of v, counting the edges

between them, and then dividing by the maximum number of edges that could possibly

be drawn between those neighbors. Clustering coefficient C of a graph is defined as

the average of Cv over all vertices v.

Table 2.1 shows the L and C values for three real networks mentioned above,

benchmarked against a random graph of the same size. The results clearly demon-

strate the small-world phenomenon for these networks: L � Lrandom but C � Crandom.

11

n Lactual Lrandom Cactual Crandom

Film actors 225,226 3.65 2.99 0.79 0.00027

Power grid 4,941 18.7 12.4 0.080 0.005

C. elegans 282 2.65 2.25 0.28 0.05

Table 2.1: Small-world behavior of three real networks

Recently Leda Adamic in [5] showed that the web hyperlink graph, in which nodes

are static home pages and edges are hyperlinks between those pages, is also a small-

world. In addition, the author demonstrated how this fact could be used to improve

performance of web search engines.

Besides small diameter and clustering, many small-world networks share other

important properties:

They tend to be sparse: These graphs all have relatively few edges, considering

their vast number of vertices. Stated more precisely, in small-world graphs the

number of edges is typically closer to the number of vertices n than to the

maximum possible number of edges(

n2

). The Hollywood graph, for example,

has 225, 000 vertices connected by 13 billion edges, far short of 25 billion in a

clique. The largest studied sample of the WWW graph contains 1.5 billion links

connecting 200 million pages. This means that only about 7% of all possible

edges exist in the WWW graph.

They are self-organizing: Most of these small-world networks are not deliberate

constructions. Instead, they can be viewed as naturally occurring artifacts that

have developed through some evolutionary process. A good theoretical model

for generating realistic small-world topologies must inevitably provide deeper

insight into the nature of such process.

12

2.1.1 Modeling Small-World Networks

The simplest way to model the small-world phenomenon is by means of a uniform

random graph. Graphs of this type were thoroughly studied by Erdos and Renyi in

the 1960s. While these graphs exhibit small diameter, their major limitation as a

model of the small-world is that they show no tendency to form clusters. To address

this problem, Watts and Strogatz proposed a model based on interpolating between

a completely regular and completely random topology [27]. The authors start by

taking a highly regular ring lattice topology, created by arranging n vertices in a

circle and joining each vertex to its k nearest neighbors for some small constant k.

Each edge in the original lattice is then examined and redirected to another randomly

chosen destination with probability p. This method allowed the authors to “tune”

the graph between regularity (p = 0) and disorder (p = 1), and thereby to probe

the intermediate region 0 < p < 1, about which little is known. Because of the

potential rewiring of edges, Watts and Strogatz refer to their model as the rewired

ring lattice. Another way to look at this construction process is to observe that all

the edges in the original lattice are local contacts. The rewiring process can then

simply be viewed as adding a number of long-range contacts. Watts and Strogatz

observed that adding only a few such edges results in a dramatic decrease in diameter

size while still preserving the clustering property of the original lattice. While the

Watts-Strogatz model remains one of the most popular models of the small-world,

most of the recent research utilizes a variation of the model proposed by Newman

and Watts. In this version, instead of rewiring the existing links, new shortcut links

are added. This greatly simplifies the analysis by eliminating the possibility present

in the original model for a portion of the graph to become disconnected from the rest.

The model was latter generalized by Kleinberg in [19], who introduced an additional

parameter consequently defining an entire family of random networks. Kleinberg

13

showed that the performance of decentralized algorithm varies within this family of

network models, proving the existence of a unique model within the family for which

decentralized algorithms are effective. The idea most relevant to our thesis is that the

small-world property of a network topology can significantly impact the performance

of algorithms such as those for routing operating on such topology.

2.1.2 Gnutella as a Small-World

Upon analyzing the Gnutella network topology data obtained by our crawler, we

discovered both the small diameter and the clustering properties characteristic of

small-world networks. To show this, we calculated the clustering coefficient and the

characteristic path length as defined by Watts and Strogatz for five different snapshots

of the Gnutella topology obtained during the months of November and December of

2000. Since the results presented in this chapter are based on these particular datasets,

we present some basic statistics for them in table 2.2.

Snapshot date Nodes Edges Diameter

11/13/2000 992 2465 9

11/16/2000 1008 1782 12

12/20/2000 1077 4094 10

12/27/2000 1026 3752 8

12/28/2000 1125 4080 8

Table 2.2: Statistics for five snapshots of the Gnutella network topology

We present the statistics for the clustering coefficient C and the characteristic

path length L in tables 2.3 and 2.4. The values for each one are benchmarked against

the random graph G(n, p) and the 2-D mesh of the same size (in terms of the number

14

of nodes) as the original Gnutella topology graph. For random graphs, average values

out of 100 trials are shown.

Count source vertex Do not count source vertex

Gnutella G(n,p) 2D mesh Gnutella G(n,p) 2D mesh

11/13/2000 0.643587 0.389914 0.413181 0.035122 0.007789 0

11/16/2000 0.701287 0.492788 0.41276 0.010896 0.005636 0

12/20/2000 0.539189 0.268877 0.412366 0.065172 0.009371 0

12/27/2000 0.514996 0.278801 0.41276 0.063023 0.010213 0

12/28/2000 0.521659 0.27966 0.411995 0.054443 0.009013 0

Table 2.3: Values for the clustering coefficient C as defined by Watts and Strogatz in

definition 2

Because it is not clear from their definition whether Watts and Strogatz consider

each vertex to be a neighbor of itself, we have calculated the results using both

methods. Based on the results in 2.3, we believe the two were not counting the

source vertex. However the results obtained on a 2D mesh, typically regarded as a

highly clustered topology, highlight a potential inconsistency with this definition. For

this reason we propose a more consistent definition for the clustering coefficient of a

graph:

Definition 3 Characteristic coefficient C(l)v of vertex v is calculated by dividing

the number of cross edges in a BFS-tree of depth l and rooted at v, by the maximum

possible number of cross edges given by(

k2

)−(k−1), where k is the number of vertices

in the BFS-tree. Clustering coefficient C(l) of a graph is defined as the average of

C(l)v over all vertices v.

15

Gnutella BA WS G(n,p) 2D torus

11/13/2000 0.0223545 0.0149507 0.0372667 0.00403533 0.0606061

11/16/2000 0.0088999 0.0095887 0.0372356 0.00249125 0.0606061

12/20/2000 0.0300611 0.0178844 0.0537228 0.00618598 0.0606061

12/27/2000 0.0205752 0.0184729 0.0539221 0.00620002 0.0606061

12/28/2000 0.0206982 0.0173541 0.0535703 0.00561928 0.0606061

(a) l = 2

Gnutella BA WS G(n,p) 2D torus

11/13/2000 0.0141344 0.00693268 0.0110796 0.00391614 0.0434783

11/16/2000 0.0100001 0.00524975 0.0110373 0.00243858 0.0434783

12/20/2000 0.0136551 0.00743268 0.0143759 0.00601365 0.0434783

12/27/2000 0.0125729 0.00773103 0.014582 0.00602383 0.0434783

12/28/2000 0.0122163 0.00718639 0.0142141 0.00545913 0.0434783

(b) l = 3

Figure 2.1: Values for the clustering coefficient as defined in definition 3 for the

Gnutella, Barabasi-Albert, Watts-Strogatz, random graph, and the 2D torus topolo-

gies

We believe our definition to be in better agreement with our intuitive under-

standing of clustering. Furthermore, such definition allows us to identify the aspect

of clustering in various topologies that contributes to the “short-circuiting” effect

we study in chapter 3. The results for the new clustering coefficient are presented

in figure 2.1. Besides the values for the Gnutella, the random graph and the 2D

torus, each table also contains results for the Barabasi-Albert (discussed in the sub-

16

sequent section) and the Watts-Strogatz models. The parameters for these models

were chosen in a way so that the number of nodes and average degree of the resulting

graph is approximately equal to that of the original Gnutella topology. For example,

the Gnutella topology snapshot from 12/20/2000 is compared to the Watts-Strogatz

topology generated according to the following parameters: n = 1125, k = 3, and

p = 1 (every node gets a random edge - the Newman-Watts version of the model is

used).

Gnutella BA WS G(n,p) 2D mesh

11/13/2000 3.72299 3.47491 4.59706 4.48727 20.6667

11/16/2000 4.42593 4.07535 4.61155 5.5372 21.3333

12/20/2000 3.3065 3.19022 4.22492 3.6649 22

12/27/2000 3.30361 3.18046 4.19174 3.70995 21.3333

12/28/2000 3.32817 3.20749 4.25202 3.7688 22.6667

Table 2.4: Values for the characteristic path length L for the Gnutella, Barabasi-

Albert, Watts-Strogatz, random graph, and the 2D mesh topologies

As you can see, all of the Gnutella topology instances show the small-world phe-

nomenon: characteristic path length is comparable to that of a random graph (table

2.4), while the clustering coefficient is considerably higher. These results clearly indi-

cate strong small-world properties of the Gnutella network topology. It is our thesis

that this is an important issue to consider when modeling P2P networks such as

Gnutella. More specifically, an accurate P2P model must inevitably generate topolo-

gies exhibiting the described small-world properties. Furthermore, our discovery can

aid in designing and predicting performance of distributed algorithms, such as those

for routing and searching. For example, Gnutella’s current broadcast routing strategy

17

is clearly not likely to work well on a clustered topology of a small-world network, as

it would generate large amounts of duplicate messages. This would result in poor uti-

lization of network bandwidth and hinder scaling - a phenomenon recently observed

in practice [13].

2.2 Power-Laws

The major limitation of the described small-world models is due to increasing evidence

of various power-laws of the form y = xa, governing distribution of various graph

metrics for many large, self-organizing networks [15, 10, 11, 20]. Faloutsos et al [15]

discovered four of these power-laws characterizing topology of the Internet at both

inter-domain and router level. These power-laws are defined as follows:

Power-Law 1 (rank exponent R): The outdegree, dv, of a node v, is proportional

to the rank of the node, rv, to the power of a constant, R: dv ∝ rRv . The rank

rv of a node, v, is defined as its index in the order of decreasing outdegree.

Power-Law 2 (out-degree exponent O): The frequency, fd, of an out-degree, d,

is proportional to the out-degree to the power of a constant, O: fd ∝ dO.

Power-Law 3 (hop-plot exponent H): The total number of pairs of nodes, P (h),

within h hops, is proportional to the number of hops to the power of a constant,

H: P (h) ∝ hH,h � δ, the diameter. The number of pairs P (h) is the total

number of pairs of nodes within less or equal to h hops, including self-pairs, and

counting all other pairs twice.

Power-Law 4 (eigen exponent E): The eigenvalues, λi, of a graph are propor-

tional to the order, i, to the power of a constant, E : λi ∝ iE .

18

Several research groups have also independently discovered evidence of the same

power-laws describing structural properties of the web graph [10, 11, 20]. Since these

discoveries occurred on various scales and levels of granularity, they could be taken as

indications of possible self-similar or fractal nature of the web. Of particular interest

is the fact that all of these groups reported practically identical values for the power-

law 2 exponent, ranging between 2.1 and 2.2. This observation led the authors in

[15] to suggest the use of power-law exponents as a way of characterizing different

families of graphs. In addition, they demonstrated how these exponents could be used

to approximate important graph metrics, such as the number of nodes, the number

of edges, the average neighborhood size, and the effective diameter. Albert, Jeong,

and Barabasi went even further to argue the scale-invariant nature of the power-law

distributions, suggesting that ”large networks self-organize into a scale-free state, a

feature unpredicted by all existing random graph models” [10].

The significance of these power-laws is that they clearly outline the inadequacy

of the described small-world models to accurately capture the true nature of many

large networks. The problem is that these models do not explain the existence of

highly connected nodes, a simple consequence of the power-law 2. The described

power-law observations have therefore opened up a search for alternative techniques

for generating realistic network topologies that exhibit such power-law phenomena.

2.2.1 Power-Law Models

Based on the discoveries described above, a number of alternative models have been

proposed that produce graphs exhibiting the observed power-law properties. While

some set out to synthetically reproduce various power-law distributions accepting

them as empirical facts, others attempt to provide an explanation as to the origin of

such phenomena. An example of the later is a model proposed by Barabasi and Albert

19

[10]. The two argue that the existence of power-laws in many real networks is caused

by two key features: growth and preferential attachment. Growth feature describes

the dynamic nature of many real networks, in which new vertices are continuously

added. Preferential attachment is used to model the fact that in real networks, new

vertices are more likely to link to existing vertices of high degree, resulting in so-called

”rich-get-richer” phenomenon. In the case of the web graph, these two features are

evident as new pages are created daily, typically containing hyperlinks to already

highly connected and therefore highly visible pages. Barabasi and Albert build their

model by starting with a small number of vertices and no edges. Then, a new vertex is

added at each time step by linking it to m other vertices already present in the system.

The existing vertices are chosen with probability that is proportional to their degree.

This process produces a random graph that reaches a steady state characterized

by the same power-law distribution observed in many real networks. Notice that,

without continuous addition of new vertices, this model would eventually produce a

clique, as all the vertices would ultimately be connected. In fact the authors proved

that both growth and preferential attachment are necessary to correctly model the

behavior of real networks: growth factor ensures stationary power-law distribution,

and preferential attachment is responsible for its scale-free nature. The Barabasi-

Albert model possesses certain intuitive appeal, particularly when used to model the

topology of many P2P networks such as Gnutella. Recently, a topology generator

called BRITE was proposed for produces graphs exhibiting all four of the discussed

power-laws based on factors such as growth and preferential attachment studied by

Barabasi and Albert [21]. We are currently experimenting with adopting this model

for P2P networks such as Gnutella.

If the goal is to simply generate graphs that match exactly the power-law prop-

erties observed empirically, then the α − β graph model proposed by Aiello, Chung,

and Lu could be used [7]. This model involves two parameters, α and β, represent-

20

ing the intercept and the slope of the plot of degree distribution on a log-log scale.

Since any fixed pair of values for α and β defines a finite set of graphs, the authors

propose simply selecting a graph from this set at random. More recently, Internet

topology generators have been proposed that subscribe to the same philosophy of

using power-laws to guide graph construction [23].

2.2.2 Power-Laws in Gnutella

Upon analyzing the Gnutella topology data obtained using our network crawler, we

discovered it obeys all four of the power-laws described in the previous section. The

results for power-laws 1 through 4 are presented in figures 2.2, 2.3, 2.4, and 2.5,

respectively. Power-laws relationships between variables are typically plotted on a

logarithmic scale, since their plot should, by definition, appear linear. Power-law

exponents can then be defined as the slope of this linear plot. We used linear regression

to fit a line in a set of two-dimensional points using the least-square errors method. To

quantify the validity of the approximation, with each figure we included the absolute

value of the correlation coefficient r ranging between −1 and 1. A |r| value of 1

indicated perfect linear correlation.

As mentioned earlier, power-law 1 is evaluated by sorting all nodes in descending

order according to their degree, and plotting degree versus rank of a node in this

sequence on a log-log scale. For comparison, we present plots for both the snapshots

of the Gnutella network topology and a simple connected random graph of the same

size. Figure 2.2 shows this power-law holds for the Gnutella topology instance with

rank exponent R =−0.98 and the correlation coefficient of 0.94, which cannot be said

for the random topology.

Power-law 2 is of particular importance, because it is the one that is most fre-

quently cited in the recent studies of large network topologies. Figure 2.3 shows

21

100

101

102

103

10−1

100

101

102

103

Gnutella 12/28/2000 exp(6.04022)*x**(−1.42696)

100

101

102

100

101

102

103

Random graph

(a) Gnutella 12/28/00(|r| = 0.94) (b) Random Graph

Figure 2.2: Log-log plots of degree versus rank (power-law 1)

node degree power-law exponent of −1.4 for the Gnutella topology. We must remark

that a group called Clip2 independently discovered this particular power-law for the

Gnutella network topology [13]. However they reported the power-law exponent of

−2.3, in disagreement with our result. We believe the reason for this discrepancy is

due to the fact that our results are based on the network crawls performed during

December of 2000, while the other result dates back to the summer of the same year.

Since that time, the Gnutella network has undergone significant changes in terms

of its structure and size, as described in [13]. While the values of the node degree

exponent O for all of the Gnutella topology instances obtained during the month of

December are consistently around −1.4, we have observed O values of −1.6 for the

data obtained in November. This may be taken as indication of a highly-dynamic,

evolving state of the Gnutella network. We are nevertheless currently attempting to

establish contact with people from Clip2 in order to further examine reasons for this

discrepancy. Interestingly, power-law degree distributions have recently been reported

for another file-sharing P2P applications, Freenet [22].

22

100

101

102

103

104

100

101

102

103

104

Gnutella 12/28/00 exp(7.27358)*x**(−0.98116)

100

101

102

103

104

100

101

102

Random graph

(a) Gnutella 12/28/00(|r| = 0.96) (a) Random Graph

Figure 2.3: Log-log plot of frequency versus degree (power-law 2)

It has been shown that power-laws 3 and 4 hold for almost all types of topologies,

including random, regular, and hierarchical [21]. Power-law three by definition holds

for regular topologies such as a ring topology and a 2-D mesh, with hop-plot exponents

of 1 and 2, respectively, for h � δ. It is therefore not surprising that we have also

observed these power-laws in the Gnutella network topology. However a case has been

made that, while the mere presence of these two power-laws is not a distinguishing

property of a graph, the values of their exponents can be. For this reason, instead

of plotting power-laws 3 and 4 for a single instance of the Gnutella topology and a

random graph of the same size, we compare results for several different snapshots

of the Gnutella topology. Figure 2.4 shows the hop-plots for four of these Gnutella

topology snapshots described previously. For each one, we approximated only the

first four hops. Clearly, power-law 3 holds for all four snapshots with very high

correlation coefficients of 0.99. More importantly, the hop-plot exponents seam to be

clustered tightly around the value of 3.5. Notice that this value lies right between the

exponent values reported for the inter-domain and router level topology instances of

23

100

101

100

102

104

106

108

1010

Gnutella snapshot 11/16/2000exp(8.36937)*x**(3.48228) maximum number of pairs

100

101

100

102

104

106

108

1010


(a) Gnutella 11/16/00(|r| = 0.99) (b) Gnutella 12/20/00(|r| = 0.99)

100

101

100

102

104

106

108

1010


100

101

100

102

104

106

108

1010


(c) Gnutella 12/27/00(|r| = 0.99) (d) Gnutella 12/28/00(|r| = 0.99)

Figure 2.4: Log-log plot of the number of pairs of nodes versus the number of hops(power-law 3) for four snapshots of the Gnutella topology

the Internet [15]. Like the authors in [15, 21], we must concede that the results for

this particular power-law may be misleading given such small number of data points.

This limitation is imposed by the fact that these graphs have a small diameter.

An application of power-law 3 that seams particularly applicable to Gnutella was

suggested by the authors in [15]. They introduced a concept of the effective diameter

24

δef , which is essentially the number of hops required to reach a “sufficiently large”

portion of a network. In other words, any two nodes are within δef hops of each other

with high probability. We present the definition below for convenience.

Definition 4 (effective diameter) Given a graph with N nodes, E edges, and Hhop-plot exponent, the effective diameter, δef , is defined as:

δef =

(N2

N + 2E

)1/H

Substituting the values for the Gnutella topology snapshot from December 28th,

2000, we get that, during that time, a better value for the maximum TTL would have

been 4 (instead of 7, which is the default specified by the Gnutella protocol).

Similar trends to the ones reported for the hop-plots appear in the eigenvalue

plots. Figure 2.5 shows the first 20 eigenvalues versus their order on a log-log scale

for the Gnutella topology snapshots. Once again, we see the consistency of power-law

exponents across different snapshots. Interestingly the exponents for the snapshots

obtained during the month of December are practically equal, while the exponent

for the snapshot from November is slightly smaller. Again, this fact may be taken

as an indication that the Gnutella network was going through an evolutionary state,

captured by these power-law exponents. There is a rich literature proving that eigen-

values of a graph are closely related to its topological properties. In the future, we

plan to further analyze the eigenvalues of P2P network topologies and their practical

implications.

Our empirical results clearly outline strong power-law properties on the Gnutella

network topology. It is our thesis that these properties can be utilized to improve

performance of algorithms such as those used for searching [6]. In addition, we believe

that an accurate model of the network topology of P2P network applications such as

Gnutella must inevitable exhibit presence of power-laws 1 and 2, as well as produce

all four power-law exponents in close agreement with the ones observed empirically.

25

100

101

100

101

102

Gnutella 11/16/2000 exp(2.27850)*x**(−0.22301)

100

101

100

101

102

Gnutella 12/20/2000 exp(2.83511)*x**(−0.30114)

(a) Gnutella 11/16/00(|r| = 0.97) (b) Gnutella 12/20/00(|r| = 0.89)

100

101

100

101

102

Gnutella 12/27/2000 exp(2.82127)*x**(−0.29278)

100

101

100

101

102

Gnutella 12/28/2000 exp(2.81997)*x**(−0.29412)

(c) Gnutella 12/27/00(|r| = 0.94) (d) Gnutella 12/28/00(|r| = 0.94)

Figure 2.5: Log-log plot of eigenvalues versus rank (power-law 4) for four snapshotsof the Gnutella topology

26

Chapter 3

Modeling Network Latencies

In this chapter we further refine our model of P2P networks to include traffic. In par-

ticular, we study the effects of heterogeneous latencies on reachability in P2P network

applications operating under flooding protocols. We call this potentially devastating

effect “short-circuiting.” Traditionally, latency has been studied to model network

performance as it relates to throughput. Network reachability has traditionally been

studied through the analysis of distance in graphs. In this work, we point towards

a novel fact that heterogeneous latencies can significantly impact reachability, inde-

pendent of distance.

We begin with a brief introduction of short-circuiting. We then present our formal

model for studying the effects of short-circuiting. Finally, we report our results from

both network simulation studies and empirical tests performed on Gnutella. We

conclude based on these results that, on average, the real effects of short-circuiting

are significant, but not devastating to the performance of an overall system.

27

3.1 Latency Effects

We have seen in chapter 1 that P2P applications are inherently decentralized, there-

fore relying on efficient decentralized algorithms for communication between hosts.

As a result, many of these applications, including Gnutella, have adopted a flood-

ing mechanism to forward messages in an effort to maximize reachability. Notice

that reachability, or the number of hosts receiving a particular message, is an im-

portant performance metric for many P2P applications, particularly those used for

file-sharing.

Flooding dictates that each host is to simply forward each received message to

all of its neighbors, except the one from which the message was received. As such,

flooding provides a simple and effective way of broadcasting messages in a dynam-

ically changing network without requiring the use of routing tables or knowledge

of the global network topology. However it clearly does not scale for Internet-wide

applications, as it generates a large number of redundant messages and uses all avail-

able paths across the network. For this reason, in practice, flooding is typically

implemented in combination with one or more of the following standard governing

mechanisms designed to restrict its scope and limit redundant messages:

Mechanism 1. Time-to-Live Bounds Time-to-Live (TTL) is a governing mech-

anism that prevent messages from traveling farther than a specified number

of hops, defined by an initial TTL value. TTL bounds are implemented by

including in each message header a TTL value field. When a node receives a

message it first checks to see if its TTL value is greater than zero. If not, the

node continues the flood with a decremented TTL. Otherwise the message is

dropped.

Mechanism 2. Unique Message Identification Unique Message Identification is

28

a mechanism that prevents unique messages from being transmitted more than

once from each node. This mechanism is implemented by including in each

message header a UID (a unique ID label, or unique sequence number). When

a node receives a message it checks to see if it has previously seen that message.

If it has , the message is dropped and not forwarded. Otherwise, the node stores

the new UID in a local table, and then continues the flood.

Mechanism 3. Path Identification Path Identification is a mechanism that pre-

vents message paths from looping. This mechanism is implemented by including

in each message a header that records which nodes of the network have already

encountered the message. Before forwarding messages, each node simply checks

the header to verify whether or not it has previously seen the message. If so,

the message is dropped and not forwarded. If not, the node adds its name to

the header, and then continues the flood.

Ordinarily, a broadcast operation functioning under these mechanisms should

reach all nodes within the TTL bound of the broadcast source. However we have

discovered that network latencies can negatively impact reachability of broadcast op-

erations. We define latency as the time it takes a message to traverse a link in the

network. We will show that, when Mechanisms 1 and 2 are implemented together,

heterogeneous network latencies can potentially have a devastating effect on reach-

ability. We call this phenomenon the ”short-circuiting effect,” and describe it as

follows:

Short-circuiting Effect. Consider a message broadcast from a source node a, and

consider a path P = {u1, u2, . . . , up}, joining nodes a = u1 and b = up. It is

possible that there may be no throughput of the broadcast messages from a to b

along P , even if the hop-length p of the path P is less than or equal to the TTL

value t. This can result from heterogeneous latencies, as the following scenario

29

shows. Suppose there exists a message path Q from a to some intermediate

node x = ui of P , having a strictly smaller latency (but, with possibly a greater

hop number). Then a broadcast message originating from a, and following path

P will be killed (by Mechanism 2) when it reaches x, since it is the duplicate

of an earlier arriving message originating from a, but following path Q. Notice

that there may also be no throughput along path R consisting of the path Q

together with the subpath of P from x to b. This effect results from the fact

that R may possibly have a hop-length strictly greater than t, and hence, by

Mechanism 1 there is no throughput of the broadcast message originating at

a along path R. And, indeed, there may be no throughput of the broadcast

message along any path from a to b; it is this latency effect on reachability

which we call short-circuiting.

For the remainder of this chapter, we will consider broadcasts as operating under

the combination of Mechanisms 1 and 2. Note that short-circuiting like effects can

not be caused by the combination of Mechanisms 1 and 3, since, in that case, all

loop-free paths within the TTL bound are valid message paths.

3.2 Modeling the Short-Circuiting Effect

In order to analyze the problem of SC, we refine our network model from chapter

1 to include edge weights representing latency values on communication links. We

consider the latency of a message path to be the sum of the latencies of its edges.

The flooding operation governed by mechanisms 1 and 2 in a network G is defined

by the following protocol regimen. Packets in the network we will denote p(u, t, h),

with unique message identifier UID = u, initial TTL value TTL = t, and current

hop-value HOP = h. The hop-value denotes the number of hops from the packet’s

source node. We will denote a packet (ready for broadcast) originating at node s,

30

with initial TTL = t, by p(us, t, 0). The broadcast regimen operates as follows, and

defines the valid message paths associated with the transmission of the broadcast

packet.

1. Source s sends p(us, t, 0) to all the neighbors of s, injecting the packet on all

links connected to s at the same time.

2. Nodes process packets on first-come-first-served basis as follows: when a node v

receives packet p(us, t, h) it checks whether the UID us has been seen previously.

If it has, then the packet is dropped with no further processing.

3. If not, then v records us in its local table, and check whether t = h. If t > h,

then v replicates and forwards the message p(us, t, h+1) (with incremented hop

count) to all neighbors except u, the node from which it received the packet. If

t = h then the packet is dropped and not forwarded.

When latencies are introduced into this model of a flooding broadcast, complica-

tions arise as to the reachability of nodes. To determine reachability it is not sufficient

to consider only minimum-cost paths from s to v.

In order to quantify reachability, we introduce the notion of a horizon, defined as

following:

Definition 5 The t-horizon R(s, t) from a source node s, is the set of all nodes v

which receive a packet ps(u, t,−) broadcast from s with TTL = t. The t-neighborhood

N(s, t) from a source node s, is the is the set of all nodes within a hop-distance of

t from s. Likewise, for a set of source nodes S, we denote by R(S, t) and N(S, t)

are the t-horizon, and t-neighborhood, respectively, from S, where we assume that the

broadcast is initiated by each s ∈ S simultaneously.

In the subsequent sections, we present our experimental results on the size of

t-horizon as a function of latencies under the described broadcast model.

31

3.3 Empirical experiments

We have conducted a series of experiments to empirically test the effects of short-

circuiting. These experiments are divided into two categories: simulations performed

on various static network topologies and empirical tests performed on a real P2P

network application. For the later, we use Gnutella as our case study.

3.3.1 Gnutella Studies

We have already mentioned Gnutella as a rapidly evolving technology based on the

peer-to-peer network model. In this section we continue our case study of Gnutella

with the analysis of short-circuiting effects on reachability. In order to see why

Gnutella presents a meaningful testbed for studying the problem of short-circuiting,

let us briefly describe its design. Gnutella’s application-level protocol supports two

basic types of broadcast requests: ping, which is essentially a request for a host to

announce itself, and a query. These messages are propagated through the network by

means of a flooding broadcast. The response messages are then routed back along the

same path that the original request arrived by means of dynamically updated routing

tables maintained by each host. The flooding in Gnutella is implemented using mech-

anisms 1 and 2 described in previous sections, with the Gnutella software generally

limiting TTL values to at most 7. Its routing protocol, together with heterogeneous

latencies, make Gnutella potentially vulnerable to the short-circuiting effects we have

described.

Our original interest in the effects of short-circuiting arose from an experiment

that involved crawling and mapping the entire Gnutella network. In particular, we

noted that the number of reachable hosts reported by a client was substantially less

than on off-line analysis of the generated topology map. This analysis consisted of

calculating the number of elements in the BFS tree rooted at a node representing that

32

particular client. We consistently noted discrepancies of this nature of approximately

one half. After conjecturing that short-circuiting may play a substantial role is such

discrepancies, we attempted to try to prove this empirically.

Figure 3.1: The results of level-1 short-circuiting effects on the broadcast horizon onthe Gnutella network, October 2000. The y-axis represents the broadcast horizonsize, and the x-axis labels each of 68 broadcast trials. The top line is the resultinghorizon from multiple distinct broadcasts from the same source, and the lower lineis the resulting horizon from a single broadcast message from a single source. Thediscrepancy represents “level-1 short-circuiting” effects.

To test our hypothesis, we have devised an experimental method of discovering

what we call the “level-1 short-circuiting” effect. These are the effects of short-

circuiting caused by the paths interfering at the first level, that is, in our experiments

we compare the 7-horizon of a message broadcast from v with the 6-horizon of distinct

message broadcasts from the neighbors of v. The idea is that sending messages with

distinct ID labels will prevent them from interfering with each other, and thereby

allows us to measure a subset of the total short-circuiting effect. The actual number

of hosts reached by the broadcast of the shared message is compared to a union of

host sets reached by the set of distinct broadcast messages. More refined estimates

of short-circuiting effects can be obtained by comparing the hop counts of messages

33

responding to a shared broadcast to the hop counts of messages responding to distinct

broadcasts: if the former is larger than the minimum of the later, than we posit that

short-circuiting has occurred. Figure 3.1 shows the results of a particular experiment

of this nature conducted in October of 2000 . We note that the observed reductions

average 55%.

2 3 4 5 6 70

50

100

150

200

250

300

350

400

4502 servers3 servers

Figure 3.2: Horizon-size versus t

In another set of experiments we focused on the t-horizon as a function of the TTL

value. We performed the experiment by connecting to a set of servers and sending

successive ping messages with increasing TTL. Figure 3.2 shows the results of one such

experiment using two and three broadcast servers. As predicted by short-circuiting,

we observed a decrease in t-horizon after TTL has exceeded certain threshold, typ-

ically around 5. We have been able to explain this phenomenon analytically in [9].

This particular experiment required connections to selected servers to persist over a

longer period of time, so that a number of test trials could be performed.

Difficulties in conducting experiments on Gnutella. Overall, we have found it

quite challenging to isolate the effects of short-circuiting, as well as other phenomena,

34

on the Gnutella application. The challenge has been mainly due the system instability,

both in terms of topology and latencies. One of our preliminary experiments focused

on measuring variance in the size of the broadcast horizon over time. We have found

that several identical tests of horizon size, which were performed consecutively, can

differ drastically in their results. Figure 3.3 shows the size of the broadcast horizon

over time using four broadcast servers. Each data point represents the horizon size

for a particular broadcast trial, with trials performed consecutively in six minute

intervals.

0 20 40 60 80 100 120 140 160 180 2000

500

1000

1500

2000

2500

3000

3500

4000

Figure 3.3: Horizon-size variation over time with broadcasting client using multipleconnections on the Gnutella network, March 2001. The y-axis represents the horizonsize, and the x-axis labels each of 180 broadcast trials, performed consecutively in sixminute intervals.

We attribute this phenomenon to the highly dynamic nature of the network and

constantly changing network conditions and topology. (We remark that in our net-

work simulations, we have also observed that slight changes in latency distribution

can result in dramatic changes in the size of the t-horizon.) Such high variance, as

well as the existence of a number of factors influencing the actual number of hosts

35

reached, makes it challenging to obtain meaningful results.

By far the biggest challenge to isolating the effects of short-circuiting on Gnutella

is due to emergence of a new generation of “intelligent” Gnutella clients. These

clients contain built-in application logic designed to promote overall network health

by conserving bandwidth. While such clients have succeeded in allowing the Gnutella

network to scale-up to about five times the original size, they have also created a

serious obstacle to conducting sophisticated experimental studies on the network.

In order to see this, consider a simple procedure for calculating the size of the t-

horizon in Gnutella, performed by sending a ping message and counting the number

of responses. Figure 3.4 shows the results of an experiment in which eight of these

procedures were performed simultaneously.

1 2 3 4 5 6 7 8 9 10 11 12 13 140

500

1000

1500

2000

2500

3000

3500ping1ping2ping3ping4ping5ping6ping7ping8

Figure 3.4: Difficulty in conducting experiments on today’s Gnutella network

As you can see, typically only one of these procedures will result in a considerable

number of responses. The reason for this is that Gnutella clients are now ”intelligent”

enough to realize when messages are the same, and will only forward one of them.

In addition, many clients will now cache the responses to ping and query messages

36

for a certain amount of time. While such design decisions are understandable from

the performance standpoint, they also effectively take away the ability to accurately

determine the exact size of the broadcast horizon in Gnutella at any given time. As

a result we have found it extremely difficult to repeat experiments such as those

reported in figures 3.1 and 3.2 on the current system. Because of the difficulties with

measuring short-circuiting effects directly on the application, we turned our attention

to a series of network simulation studies in which we were able to precisely isolate the

effects of short-circuiting on theoretical network topologies.

3.3.2 Network Simulation Studies

In order to study the practical impact of short-circuited t-horizon reductions, we

needed to carefully consider both the topology of the network and the assignment of

latencies. Simulated studies allowed us to isolate the effects of short-circuiting on fixed

topologies. We conducted the simulations using our network simulator gnutsim, based

on a modified version of Dijkstra’s shortest path algorithm. The Java source code for

gnutsim is given in appendix B. To carry out these simulations, we needed to choose

the network topological model, as well as the network latency model. We report in

this chapter on a number of well-known regular topologies, such as the mesh and the

hypercube, as well as the Watts-Strogatz “small world” topology and snapshots of the

Gnutella topology obtained through crawling. To model network latencies we used

several classes of weights representing various commonly used Internet connection

bandwidths. We conducted our experiments by using random distributions of these

weights.

We present the statistics of our simulation studies as tables, which report the

reduction ratios in reachability caused by short-circuiting, given by randomly chosen

latencies on a fixed topology. Each table is associated with a fixed topology. Each

37

TTL Worst Avg Best Nbhd WRR MRR1 8 8 8 8 100% 100%2 18 21 24 25 72% 84%3 24 47 66 69 35% 68%4 43 84 124 138 31% 61%5 67 150 238 310 22% 48%6 121 274 424 678 18% 40%7 278 498 723 1399 20% 36%8 434 819 1364 2771 16% 30%9 765 1388 2307 5018 15% 28%

10 977 2148 3420 7729 13% 28%11 2030 3153 4549 9449 21% 33%12 2252 4290 5812 9928 23% 43%13 3692 5519 6599 9994 37% 55%14 4995 6392 7563 10000 50% 64%

(a) Reduction rations for the Watts-Strogatz topology

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

80

90

100

(b) Histogram of 1000 trials with random distribution of latencies (t = 10)

Figure 3.5: Short-circuiting effects for the Watts-Strogatz topology (nodes =10000, k = 3, p = 0.2)

38

row of the table represents results from 100 trials using random latencies. In each

row we report for a fixed t, the worst, average, and best observed t-horizon, and

t-neighborhood (which is equal to t-horizon when using uniform latencies). We then

give the reduction ratios by dividing the worst over t-neighborhood, and the average

over t-neighborhood.

Figure 3.5 represents the results for the Watts-Strogatz small-world topology. The

histogram on the right represents distribution of t-horizon values over 100 trials using

random latencies for t = 10, which is the value of t for which the reduction ratios are

the most severe. The results for other topologies are presented in appendix C.

Observations and Conclusions. Our empirical results indicate that, in practice,

the effects of short-circuiting are not as devastating as suggested by the theoretical re-

sults in [9]. We have observed the most significant impact on “small-world” topologies

such as our Gnutella snapshots and Watts-Strogatz network models. Fr these graphs,

we have observed reduction ratios in t-horizon size of over 90% in the worst case,

for certain values of t. In other words, we have observed that with random latencies

one can expect instances where the ratio of sizes of the t-neighborhood divided by

t-horizon is greater than 10 to 1, as shown in figure ??. Furthermore, the histogram

in the same figure shows that the reduction in reachability caused by short-circuiting

was always greater than 50% using random latencies.

In our experimental studies we have also observed that both random graphs and

highly structured graphs such as the mesh and hypercube tend to have, on aver-

age, less pronounced short-circuiting effects, as compared with “small-world” graphs.

Intuitively, this can be best understood if one considers the potentially stimulating

effect of the clustering property as defined in chapter 2 on short-circuiting.

In general, for a fixed TTL = t, the distribution of t-horizon sizes tends to be

normally distributed with small variance, independent of network topology. We have

39

also observed that, independent of topology, mean reduction ratios are dependent on

the TTL= t. Our results suggest that the reduction ratio increases as t increases,

until certain thresholds are reached, usually at about the point t is equal to half the

network radius or diameter, after which the reduction decreases.

40

Chapter 4

Gnutella Crawler Implementation

In this chapter we discuss issues related to design and implementation of our Gnutella

network crawler. We begin by providing a brief introduction to Gnutella and its

protocol, necessary for understanding the remainder of this chapter. We then present

both the sequential and parallel algorithms for discovering topology of the Gnutella

network, followed by the discussion of our distributed implementation using Java

RMI.

4.1 Introduction to Gnutella

Gnutella can be best explained as a fully distributed, information sharing technology.

It originated as a project at Nullsoft, a subsidiary of America Online, but was aban-

doned out of fear of its potential use for copyright infringement. After being quickly

reverse-engineered by several programmers and open-source enthusiasts, Gnutella’s

popularity really took off. Gnutella allows distributed file sharing by allowing each

user to specify directories on their local machine they want to share. In this sense,

Gnutella can be viewed as a distributed file storage system with search capabili-

ties. Unlike its predecessor Napster, which relies on a centralized search database,

41

Gnutella promotes decentralization of all network functions. As we have already seen,

Gnutella is based on a peer-to-peer model. This means that users connect to each

other directly through a piece of client-server software, forming a high-level network.

Throughout this thesis, we have and will continue to refer to this high-level network

as the Gnutella network, or GnutellaNet. Because Gnutella software functions as

both a server and a client, it is sometimes referred to as a ”servant.” In this thesis

we may use the terms client, servant, and host interchangeably to refer to Gnutella

software running on a particular machine.

4.1.1 Gnutella Protocol

Each Gnutella client implements the application level Gnutella protocol, which spec-

ifies how messages are routed between GnutellaNet hosts. We have already described

Gnutella’s protocol design at a high-level in chapter 3. We will now complete our

description with a few implementation details.

Gnutella protocol support four basic types of messages summarized in table 4.1.

The routing technique employed by the Gnutella protocol is a form of controlled

flooding, where messages are passed recursively between hosts. Flooding operates

by each Gnutella host forwarding the received ping and search messages to all of its

neighbors, except to the one that sent the message. To limit exponential spread of

messages through the network, each message header contains a time-to-live (TTL)

field. TTL is used in the same fashion as in the IP protocol: at each hop its value

is decremented until it reaches zero, at which point the message is dropped. This

is equivalent to mechanism 1 described in chapter 3. The maximum TTL value

specified by the Gnutella protocol is seven. Recall that this restriction effectively

segments the Gnutella network into subnets, imposing on each user a virtual ”horizon”

beyond which their messages cannot reach. In practice, this situation is acceptable

42

Type Description Contains

Ping Request for a host to an-

nounce itself

No body

Pong Reply to Ping message IP and port of responding host, num-

ber and size of files shared

Query Search request Minimum speed requirement for re-

sponding host, search string

Query Hits Reply to Query message IP and port speed of responding host,

number of matching files and their in-

dexed result set

Table 4.1: Gnutella protocol message description

as information may still get around. Each Gnutella message is also flagged with a

unique ID. Message ID is used by peers to detect and subsequently drop duplicate

messages, indicating a loop in GnutellaNet topology (mechanism 2). In addition, it is

also used to route the response messages along the same path that the original request

arrived. This is implemented by each host maintaining a dynamic routing table of

message IDs and connection labels indicating a particular connection along which

that specific message arrived. When a response message arrives at a host, it should

contain the same message ID as the original request. The host then checks its routing

table to determine along which link the response message should be forwarded. This

technique greatly improves efficiency while also preserving network bandwidth.

43

4.1.2 Discovering Gnutella Network Topology

Topology discovery in IP networks is a well-studied area of research [26]. Generally

the approach is based on some protocol-specific feature, as in the case of traceroute.

Although Gnutella protocol is much simpler than IP and provides no feedback regard-

ing message delivery, it nevertheless provides the necessary functionality for mapping

GnutellaNet topology. Notice that, according to the Gnutella protocol, it is possible

to discover neighbors of a particular host by connecting to that host and sending a

ping message with TTL = 2. As a result, pong messages would be sent back from

the connected host and all of its immediate neighbors. A complete network topology

could therefore be discovered by connecting to all the hosts, discovering their neigh-

bors, and combining the information into a single graph. We refer to this process

as crawling. Notice that, by following the described procedure, each edge would be

discovered twice thus introducing a level of redundancy. However it is still necessary

to connect to all the hosts in order to guarantee that the obtained topology map is

complete.

Compared with IP networks, GnutellaNet is highly dynamic. This means that its

topology is constantly changing - nodes and edges are added and removed as hosts

join and leave the network, establish new connections, and close the existing ones.

Therefore any topology discovery algorithm operating on the Gnutella network is

really capturing an instance, or a snapshot of the topology at a specific point in time.

Clearly, this posses an additional requirement for any topology discovery algorithm

to be efficient, since the accuracy of the topology map is inversely proportional to

the actual running time of an algorithm that was used to obtain it. In designing our

crawler, we have paid close attention to this requirement.

44

4.2 Design

In this section we discuss some issues related to design of our Gnutella network

crawler. We present informal performance analysis for both our sequential and parallel

algorithms for discovering Gnutella network topology.

4.2.1 Algorithm

Based on the procedure described in the previous section for discovering GnutellaNet

topology, an intuitive design solution might be to use the BFS to crawl the network,

applying the algorithm for discovering direct neighbors to each encountered host.

However, there are some practical issues that make this approach inefficient. In order

to see this, let us first examine the basic operation of discovering neighbors of a single

Gnutella host. This operation requires establishing a connection, sending a ping

message, and waiting for all pong messages to be received - overall a time-consuming

process with running time in the order of several minutes. However it is clear that such

operation represents a lower bound for any topology discovery algorithm operating

on Gnutella and based on the procedure described in the previous section. We will

therefore use this basic operation as a unit in our performance analysis of algorithms

for discovering GnutellaNet topology.

The complexity of the BFS algorithm for discovering topology of the Gnutella

network with N hosts is clearly O(log N). Also, for the moment, let us assume that

our crawling workstation is capable of maintaining up to b simultaneous network

connections. Then if b ≥ N and we had a list of addresses for all the Gnutella hosts,

we could simply connect to all of them simultaneously and obtain the entire network

topology map in constant time. Fortunately such list is available, as every Gnutella

client maintains a dynamically updated list of live hosts. Using this list as input, we

can now formulate our new algorithm for discovering GnutellaNet topology as follows:

45

Procedure buildTopoMap (G, l)

Input: An empty graph G, and a complete host list l

Output: A graph G representing the Gnutella network topology

for each element h of lconnect to hif (connection is successful)

send ping message with TTL = 2for each response message m from host h2

if (h2! = h)add edge h − h2 to Gif (h2 is not in l)

add h2 to the end of l

Due to highly dynamic nature of the network, the input list of hosts is not guar-

anteed to be neither complete nor perfectly accurate. This means that new hosts

not contained in the list could have just joined the network and, furthermore, hosts

contained in the list may no longer be active. Nevertheless our algorithm will still

work, as new hosts will be discovered at run-time and added to the end of the list.

Similarly, hosts that are no longer active will simply be ignored. The ability of our

algorithm to work with incomplete input data is particularly important considering

highly dynamic nature of the Gnutella network. However the more complete the list

is, the closer the performance of our algorithm will be to optimal.

Notice that our algorithm in effect partitions the problem of discovering Gnutella

network topology into two steps, or phases: discovering nodes (host list) and discov-

ering edges (connections). Since the functionality for solving the first phase is already

provided through the existing Gnutella client software, our algorithm’s focus is on the

second phase of the problem.

46

4.2.2 Initial Implementation

We have implemented the algorithm presented in the previous section as a Java

application. We chose Java as our development platform primarily for its support

for networking and threads. Platform-independence was also an important benefit,

particularly for our distributed implementation described is the subsequent sections.

The main problem with our initial implementation is due to our original assump-

tion that the number of connections that could be maintained simultaneously is

greater than the total number of Gnutella hosts. In practice, this assumption doesn’t

hold as the number of live Gnutella hosts at any given time is typically in the order

of thousands. To cope with this situation we were forced to organize threads into

groups of b, where b is the maximum number of simultaneous connections that our

system could handle. This strategy introduces additional complexity and, as already

discussed, sacrifices the integrity of a time-critical task such as topology discovery in

a highly dynamic network. However since connections to different Gnutella hosts can

be done asynchronously, a natural solution would be to run the crawler in parallel.

The following section describes issues involved in discovering GnutellaNet topology

in parallel, as well as our implementation using Java RMI.

4.2.3 Parallel Algorithm

The simplest and perhaps the most natural way to make our topology discovery algo-

rithm run in parallel would be to partition the initial list of Gnutella host addresses.

Each processor would then be responsible for discovering neighbors of only a subset of

hosts. In addition, each processors would need to have some way of knowing whether

a newly discovered host address has already been “crawled” by another processor.

One way this could be done is by hashing the host address string and checking the

result (modulo the number of processors participating in the crawl) against the pro-

47

cessor’s index. If there is a match, the processor would know that it should go ahead

and crawl the host. If not, it would then need to pass the information to the appro-

priate processor. In fact, this technique is commonly used for indexing the WWW

by many search engines, including Google, primarily because it results in good load

balancing. However it also requires additional inter-processor communication in or-

der to pass the Gnutella host addresses discovered at run-time to the appropriate

processors. Instead, we have opted for perhaps less elegant but more robust solution.

Our algorithm provides each processor with a complete input list of active hosts.

Each processor then executes an algorithm for calculating the subset for which it is

responsible, based on its unique processor number and the total number of processors

involved in the computation. For example, processor 0 of 10 would only attempt to

discover neighbors of the first 10% of hosts from the input list. The parallel version of

the topology discovery algorithm presented in the previous section is formulated bel-

low. For clarity, we are assuming that the size of the initial list of hosts is a multiple

of the number of processors.

Procedure parallelBuildTopoMap (G, l)

Input: An empty graph G, and a complete host list l

Output: A graph G representing the Gnutella network topology

startIndex = (sizeofhosts/numberofprocs) ∗ procIDendIndex = startIndex + (sizeofhosts/numberofprocs) − 1l2 = hosts[startIndex..endIndex]for each element h of l2

connect to hif (connection is successful)

send ping message with TTL = 2for each response message m from host h2

if (h2! = h)add edge h − h2 to Gif (h2 is not in l)

add h2 to the end of l2

48

Despite its apparent simplicity, due to highly asynchronous nature of the task, our

parallel algorithm in the best cast achieves optimal speed-up. In addition, as long as

total number of Gnutella hosts N ≤ pb, where p is the number of processors and b

is the maximum number of connections each processor can maintain simultaneously,

our algorithm will run in constant time. In practice, we were typically able to satisfy

this requirement with only a few processors, as the size of the largest connected public

segment of the Gnutella network at the time rarely exceeded two thousand users.

One potential problem with our algorithm is that its performance is dependent

on the “completeness” of the input list of host addresses. Recall from our previous

discussion that the input list is not guaranteed to be complete, as new hosts could

have joined the network. Because our algorithm only partitions the initial set of

hosts, each processor would discover new hosts independently. This would result in

redundant work being performed by all the processors. Notice that this would not

be a problem had be used the hashing solution mentioned above. However it is easy

to show that, as long as the number of hosts discovered at run-time is within b,

performance of our algorithm will be within a factor of two of optimal. This is true

because only a single additional step will be required by each processor.

Typically an important issue in designing parallel algorithms is load balancing. In

our case, this refers to the actual number of connections each processor is required to

make. Recall that the input list of potential hosts may also contain some hosts that

have recently left the network. Therefore even though each processor will receive an

equal number of potential hosts to connect to, the number of actual live hosts in a

list is likely to be smaller and will vary between processors. However our experiments

indicate this is not a significant problem. In order to see this recall that, even though

the actual number of connections made by each processor could vary, they are still

handled simultaneously by each processor in a single logical step.

49

4.2.4 Limitations

The main limitation of our crawler is related to the notion of private networks. Since

a significant portion of Gnutella users reside behind a firewall that prevents anyone

on the outside from establishing direct connection to them, our crawler will not be

able to accurately discover topology between such hosts. Notice that these hosts may

still appear in the final topology graph, due to their connections with hosts outside

the firewall. In this sense, the topology obtained by our crawler can be viewed as a

subgraph of the actual Gnutella network topology.

In addition, even though running time of our algorithm is optimal for any topology

discovery algorithm based on the Gnutella protocol, the actual execution time is still

bounded by the RTT time of messages in the Gnutella network and can take up

to several minutes. One could therefore argue the integrity of our topology data,

based on the fact that the network structure may have significantly changed over

the course of several minutes. Despite these limitations we believe our crawler is a

valuable tool, able to accurately capture important structural properties of the actual

Gnutella network topology.

4.3 Distributed Computing Solution Using Java

RMI

We have implemented our parallel algorithm for GnutellaNet topology discovery for a

network of workstations (NOW), primarily because we felt it would give the greatest

amount of flexibility and portability to our code. In addition, we felt that the task at

hand would be perfectly suited for a distributed computing model, since it requires

very little inter-processor communication. In fact, in our design, communication only

occurs at the beginning of the process, to distribute input, and at the end, to gather

50

the output at a central location. The mechanism for this communication is provided

by Java RMI. Remote method invocation (RMI) is JavaSoft’s implementation of

remote procedure calls (RPC). It is distributed as a standard Java library, providing

necessary functionality for distributed object communication. In our implementation,

crawling a subset of the Gnutella network is provided as a service residing on various

remote locations throughout our network. In other words, our parallel algorithm

described in the previous section is implemented as a distributed object residing on

remote machines.

Our distributed computing system includes an object serving as the ”brain” of

the entire computation. This central object is responsible for “bootstrapping” the

entire topology discovery process by distributing the initial list of Gnutella hosts

to other remote objects. Upon receiving the input, each remote object performs

topology discovery of its portion of the network, and subsequently returns a graph

object representing network topology to the central object. The central object is then

responsible for merging all the output graphs into a single one representing topology

of the entire Gnutella network. We should mention that our crawler utilized some

Java classes providing functionality related to Gnutella protocol compliance from furi

- a full-fledged open-source Gnutella client developed by William Wong [3].

The main feature of our distributed implementation is that is allows a heteroge-

neous network of workstations to participate in discovery of the Gnutella network

topology. As explained, this topology discovery can be executed in constant time

using only a few processors. In addition, the output graph representing Gnutella

network topology is provided in GML format [18], which is a fast growing standard

for representing graph data structures, and can immediately be viewed using visu-

alization tools such as LEDA’s graphwin [8]. Several visualizations of the Gnutella

network topology data obtained using our crawler are presented in appendix A.

51

Chapter 5

Conclusions and future research

5.1 Conclusions

Modeling complex network structures produces by modern P2P network applications

is a difficult task. The main contribution of this thesis to the task at hand is two-fold.

First, we made several important discoveries regarding the structure of the underlying

network topology of a P2P network application known as Gnutella. Specifically we

discovered it exhibits “small-world” properties of clustering and small diameter. In

addition, we observed four different power law relationships of various graph metrics.

It is our thesis that these empirical observations must be accounted for by any accu-

rate graph-based model of P2P network topology. Second, we pointed out potential

devastating effects of heterogeneous latencies on reachability of message broadcast in

P2P network applications operating under flooding protocols. Even though our em-

pirical results indicate that this problem we call “short-circuiting” is on average not

devastating to the overall system performance, we believe it should be taken seriously

by protocol designers. It is our hope that our results can be used in designing the

new generation of application-level protocols for P2P network applications.

52

5.2 Future Directions

Future research directions can be divided into three categories: those dealing with

network topology, visualization, and server placement. In the following sections, we

briefly discuss each one.

5.2.1 Network Topology Modeling

In this thesis we have reported discoveries of some structural properties of P2P net-

work topologies. However the search continues toward a uniform model of P2P net-

work topology, encompassing all of those structural properties observed in real net-

work applications. We speculate that for many P2P network applications, including

Gnutella, such model will be a modification of the discussed Barabasi-Albert model,

perhaps accounting for hosts leaving the network and dynamically-changing connec-

tions. In addition, more research needs to be done on spectral analysis of the topology

graph’s eigenvalues and their relationship with the structural properties.

5.2.2 Network Visualization

Better graph drawing algorithms need to be designed for visualizing the topology

of large-scale P2P networks. Such algorithms should be able to present topological

structure of a network in a way so that meaningful conclusions can be drawn. Network

visualizations can then be used by engineers to identify network-related problems.

5.2.3 Server Placement

The problem of finding an optimal placement of servers has received a lot of attention

in the Internet community. Many P2P file-sharing applications such as Gnutella

present another attractive practical application of this problem. For example, each

53

time a Gnutella user connects to the network can be modeled as a graph augmentation

problem. This problem can be formulated as adding a single vertex and t edges to

a graph G so that the size of t-horizon would be optimized. In the future, we plan

to examine some theoretical issues behind this problem using the knowledge we’ve

obtained on the Gnutella topology model.

54

Bibliography

[1] Cooperative Association for Internet Data Analysis (CAIDA).

http://www.caida.org.

[2] Folding@home. http://www.stanford.edu/group/pandegroup/Cosm.

[3] The Furi Homepage. http://www.jps.net/williamw/furi/.

[4] SETI@home. http://setiathome.ssl.berkeley.edu.

[5] Lada Adamic. The small world web. In ECDL’99, pages 443–452, Springer,

1999. Lecture Notes in Computer Science 1696.

[6] Lada A. Adamic, Rajan M. Lukose, Amit R. Puniyani, and Bernardo A. Hu-

berman. Search in power-law networks.

http://www.parc.xerox.com/istl/groups/iea/papers/plsearch/, March 20, 2001.

[7] William Aiello, Fan R. K. Chung, and Linyuan Lu. A random graph model for

massive graphs. In ACM Symposium on Theory of Computing, pages 171–180,

Portland, Oregon, 2000.

[8] Algorithmic Solutions Software GmbH. The LEDA Homepage.

http://www.algorithmic-solutions.com/as html/products/products.html.

55

[9] Fred S. Annexstein, Kenneth A. Berman, and Mihajlo A. Jovanovic. Latency

effects on reachability in large-scale peer-to-peer networks. In ACM Symposium

on Parallel Algorithms and Architectures, July 2001.

[10] Albert-Laszlo Barabasi and Reka Albert. Emergence of scaling in random net-

works. Science, 286:509–512, October 15, 1999.

[11] Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Ra-

jagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. Graph structures

in the web. Computer Networks, 33(1-6):309–20, June 2000.

[12] Brown University. The Java Data Structures Library (JDSL).

http://www.cs.brown.edu/cgc/jdsl/.

[13] Gnutella: To the bandwidth barrier and beyond. Clip2.com, November 6, 2000.

http://dss.clip2.com/gnutella.html.

[14] Roger Dingledine, Michael J. Freedman, and David Molnar. The free haven

project: Distributed anonymous storage service. In Workshop on Design Issues

in Anonymity and Unobservability, July 2000.

[15] Michalis Faloutsos, Petros Faloutsos, and Christos Faloutsos. On power-law

relationships of the internet topology. In SIGCOMM, pages 251–262, 1999.

[16] Groove Networks, Inc. Introducing Groove. http://www.groove.net/products/.

[17] Jerrold W. Grossman and Patrick D. F. Ion. The Erdos Number Project.

http://www.oakland.edu/ grossman/erdoshp.html.

[18] Michael Himsolt. Gml: A portable graph file format. Technical Report 94030,

University of Passau, 1997.

56

[19] Jon Kleinberg. The small-world phenomenon: An algorithmic perspective. Tech-

nical Report 99-1776, Cornell University Department of Computer Science, Oc-

tober 1999.

[20] Jon M. Kleinberg, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and

Andrew Tomkins. The web a a graph: measurements, models, and methods. In

5th Annual International Conference on Computing and Combinatorics, volume

1627, pages 1–7, 1999. Lecture Notes in Computer Science.

[21] Albert Medina, Ibrahim Matta, and John Byers. On the origin of power laws in

internet topologies. ACM Computer Communications Review, 30(2), April 2000.

[22] Andrew Oram, editor. Harnessing the Power of Disruptive Technologies. O’Reilly

& Associates, 1 edition, March 2001.

[23] Christopher R. Palmer and J. Gregory Steffan. Generating network topolo-

gies that obey power laws. http://citeseer.nj.nec.com/palmer00generating.html,

2000.

[24] T. Remes. Six degrees of Rogers Hornsby. New York Times, August 17, 1997.

[25] Clay Shirky. What is p2p... and what isn’t? The O’Reilly Network,

November 24, 2000. http://www.openp2p.com/pub/a/p2p/2000/11/24/shirky1-

whatisp2p.html.

[26] R. Siamwalla, R. Sharma, and S. Keshav. Discovering internet topology.

http://www.cs.cornell.edu/skeshav/papers.html, 1998.

[27] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of small-world

networks. Nature, 393:440–442, June 1998.

57

[28] Ellen W. Zegura, Kenneth L. Calvert, and Michael J. Donahoo. A quantitative

comparison of graph-based models for Internet topology. IEEE/ACM Transac-

tions on Networking, 5(6):770–783, December 1997.

58

Appendix A

Visualizations of the Gnutella

Network Topology

In this appendix we present vizualizations of the Gnutella network topology data

obtained using out crawler between November 13 and December 28 of 2000. The

visualizations were done using Otter - a network visualization tool developed by

Caida [1], and LEDA’s graph drawing software [8].

59

Figure A.1: Gnutella network topology using Caida’s Otter

Figure A.2: Gnutella network topology using LEDA’s 2D spring layout

60

Figure A.3: Gnutella network topology using experimental layout

61

Figure A.4: Gnutella network backbone (dominating set using greedy algorithm)

using LEDA’s 3D spring layout

62

Figure A.5: Gnutella network backbone (nodes with degree > 10) using LEDA’s 3D

spring layout

63

Figure A.6: Gnutella network backbone (nodes with degree > 20) using LEDA’s 3D

spring layout

64

Appendix B

Java source code for gnutsim

The following the is the Java source code for our Gnutella network simulator gnutsum,

which we used to study the problem of short-circuiting. Our code makes use of some

classes from the JDSL package developed at Brown University [12].

/*

* gnutsim - Gnutella message transmission simulator

* Copyright (C) November 2000 Mihajlo A. Jovanovic

* [email protected]

*

*/

import jdsl.core.api.*;

import jdsl.core.ref.ArrayHeap;

import java.io.BufferedReader;

import java.io.InputStreamReader;

import java.io.FileInputStream;

import java.io.PrintWriter;

65

import java.io.FileWriter;

import java.io.File;

import java.util.Vector;

import java.util.Hashtable;

import java.util.Enumeration;

import java.util.StringTokenizer;

import java.util.Random;

import java.util.Date;

class MsgComparator implements Comparator

{

public int compare(Object a, Object b) { return ((Msg)a).compareTo((Msg)b); }

public boolean isLessThan(Object a, Object b) { return true; }

public boolean isGreaterThan(Object a, Object b) { return true; }

public boolean isEqualTo(Object a, Object b) { return true; }

public boolean isLessThanOrEqualTo(Object a, Object b) { return true; }

public boolean isGreaterThanOrEqualTo(Object a, Object b) { return true; }

public boolean isComparable(Object b) { return true; }

}

class HostComparator implements Comparator

{

public int compare(Object a, Object b) { return ((Host)a).compareTo((Host)b); }

public boolean isLessThan(Object a, Object b) { return true; }

public boolean isGreaterThan(Object a, Object b) { return true; }

public boolean isEqualTo(Object a, Object b) { return true; }

66

public boolean isLessThanOrEqualTo(Object a, Object b) { return true; }

public boolean isGreaterThanOrEqualTo(Object a, Object b) { return true; }

public boolean isComparable(Object b) { return true; }

}

class Msg

{

private int guid;

private int ttl = 7;

private int cost = 0;

Msg(int id) { guid = id; }

Msg(Msg m)

{

//COPY CONSTRUCTOR

guid = m.getGuid();

ttl = m.getTtl();

cost = m.getCost();

}

public void setTtl(int newTTL) { ttl = newTTL; }

public int getGuid() { return guid; }

public int getTtl() { return ttl; }

public int getCost() { return cost; }

public boolean decTTL()

{

ttl--;

67

if (ttl == 0)

return false;

else

return true;

}

public void incrCost(int w) { cost += w; }

public int compareTo(Msg m) { return (new Integer(cost)).compareTo(new Integer(m.get

public boolean equals(Object msg)

{

return (guid == ((Msg)msg).getGuid());

}

public String toString() { return "GUID: " + guid + " TTL: " + ttl + " Cost:

}

class Host

{

Vector msgHistory = new Vector(10, 10);

Hashtable neighbors = null; //keys: neighbors (Host) Values: link weights (Integ

ArrayHeap sendQueue = new ArrayHeap(new MsgComparator());

String id;

Host(String address) { id = address; }

public String getID() { return id; }

public void clearAndReset(Random r, Hashtable map)

{

68

msgHistory.clear();

//Recalculate link weights

for (Enumeration e = neighbors.keys() ; e.hasMoreElements() ;)

{

int w = r.nextInt(gnutsim.MAX_WEIGHT);

neighbors.put(e.nextElement(), (Integer)map.get(new Integer(w)));

}

}

public void setBroadcastMsg(Msg newMsg)

{

msgHistory.add(newMsg);


{

Host h = (Host)e.nextElement();

Msg outMsg = new Msg(newMsg);

outMsg.incrCost(((Integer)neighbors.get(h)).intValue());

sendQueue.insert(outMsg, h);

}

}

public boolean wasMsgSeen(Msg msg)

{

return msgHistory.contains(msg);

}

public void setNeighbors(Hashtable h) { neighbors = h; }

69

public void addNeighbor(Host h, int w)

{

if (neighbors == null)

neighbors = new Hashtable();

neighbors.put(h, new Integer(w));

}

public void receiveMsg(Host sender, Msg inMsg)

{

if (msgHistory.contains(inMsg))

{

return;

}

else

{

msgHistory.add(inMsg);

}

if (inMsg.decTTL())

{

/*for all neighbors except sender

1. create a new Msg object(m), incr cost

2. add to the send queue(msg, neighbor)*/


{

Host h = (Host)e.nextElement();

if (h.equals(sender))

continue;

70

Msg outMsg = new Msg(inMsg);

outMsg.incrCost(((Integer)neighbors.get(h)).intValue());

sendQueue.insert(outMsg, h);

}

}

}

public Host sendNextMsg()

{

Msg outMsg = (Msg)sendQueue.min().key();

Host receiver = (Host) sendQueue.removeMin();

receiver.receiveMsg(this, outMsg);

return receiver;

}

public int getNextMsgCost()

{

if (sendQueue.isEmpty())

return -1;

else

return ((Msg)sendQueue.min().key()).getCost();

}

public boolean equals(Object host)

{

if (id.equals(((Host)host).getID()))

return true;

71

return false;

}

public int compareTo(Host m) { return (new Integer(getNextMsgCost())).compareTo(new

public String toString() { return id; }

}

public class gnutsim

{

static final int NUM_OF_TRIALS = 100;

static final int MAX_WEIGHT = 9;

static boolean isArrayHeapElement(ArrayHeap a, Object el)

{

for (ObjectIterator i = a.keys(); i.hasNext() ;)

{

Object o = i.nextObject();

if (el.equals(o))

return true;

}

return false;

}

public static void main(String args[])

{

ArrayHeap pq = new ArrayHeap(new HostComparator());

//CREATE WEIGHTED TOPOLOGY

String line = "";

StringTokenizer t;

String token = null;

72

Hashtable nodes = null; //keys: node ID (Integer) values: hosts (Host)

Random r = new Random((new Date()).getTime());

Hashtable map = new Hashtable();

map.put(new Integer(0), new Integer(1));









int min = -1, max = -1, accum = 0, ttl = -1;

try

{

for (int trial = 0; trial < NUM_OF_TRIALS; trial++)

{

if (trial == 0)

{

ttl = Integer.parseInt(args[1]);

File f = new File(args[0]);

if (!f.exists() || !f.canRead())

throw new Exception("Cannot read file " + f);

BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(f)));

while ((line = in.readLine()) != null)

{

73

t = new StringTokenizer(line, " ");

token = t.nextToken();

if (token.equals(new String("t")))

nodes = new Hashtable(2*Integer.parseInt(t.nextToken()));

else if (token.equalsIgnoreCase(new String("?")))

{

int i = Integer.parseInt(t.nextToken());

Host h = new Host(t.nextToken());

nodes.put(new Integer(i), h);

}

else if (token.equalsIgnoreCase(new String("L")))

{

t.nextToken();

int nodeID = Integer.parseInt(t.nextToken());

Host h1 = (Host)nodes.get(new Integer(nodeID));

nodeID = Integer.parseInt(t.nextToken());

Host h2 = (Host)nodes.get(new Integer(nodeID));

if (h1 == null || h2 == null)

throw new Exception("Invalid .odf file firmat!");

/*UNIFORM WEIGHTS

h1.addNeighbor(h2, 1);

h2.addNeighbor(h1, 1);

*/

int w = r.nextInt(MAX_WEIGHT);

h1.addNeighbor(h2, ((Integer)map.get(new Integer(w))).intValue());

h2.addNeighbor(h1, ((Integer)map.get(new Integer(w))).intValue());

}

}

74

}

else

{

//clear all host objects

for (Enumeration e = nodes.elements() ; e.hasMoreElements() ;)

((Host)e.nextElement()).clearAndReset(r, map);

}

//ADD BROADCAST SERVER ONTO PQ

Msg m = new Msg(1);

m.setTtl(ttl);

Host h = (Host)nodes.get(new Integer(0));

h.setBroadcastMsg(m);

pq.insert(h, new Boolean(true));

while(!pq.isEmpty())

{

Locator l = pq.min();

Host nextHost = (Host)l.key();

Host newHost = nextHost.sendNextMsg();

pq.remove(l);

if (nextHost.getNextMsgCost() != -1)

pq.insert(nextHost, new Boolean(true));

//if new host is not already in the pq and its cost is not -1 - add to pq

if (!isArrayHeapElement(pq, newHost) && newHost.getNextMsgCost() != -1)

pq.insert(newHost, new Boolean(true));

}

int horSize = 0;

for (Enumeration e = nodes.elements() ; e.hasMoreElements() ;)

75

if (((Host)e.nextElement()).wasMsgSeen(m))

horSize++;

System.out.println("Total horizon size: " + horSize);

if (min == -1 || horSize < min)

min = horSize;

if (max == -1 || horSize > max)

max = horSize;

accum+=horSize;

}

System.out.println("Average horizon size: " + accum*1.0/NUM_OF_TRIALS);

System.out.println("Min horizon size: " + min);

System.out.println("Max horizon size: " + max);

}

catch (ArrayIndexOutOfBoundsException e)

{

System.out.println("Usage: java gnutsim [graph_file.odf] [TTL]");

}

catch (Exception e)

{

System.out.println(e);

}

}

}

76

Appendix C

Network Simulation Results

In this appendix we present the statistics obtained from our network simulation stud-

ies. The tables report reduction ratios in reachability, caused by short-circuiting and

given by randomly chosen latencies on a fixed topology. Each table is associated with

a fixed topology. Each row of the table represents results from 100 trials using random

latencies. In each row we report for a fixed t, the worst, average, and best observed

t-horizon, and t-neighborhood (which is equal to t-horizon when using uniform laten-

cies). We then give the reduction ratios by dividing the worst over t-neighborhood,

and the average over t-neighborhood.

77

TTL Worst Avg Best Nbhd WRR MRR

1 7 7 7 7 100% 100%

2 9 14 16 16 56% 88%

3 12 28 41 42 29% 67%

4 15 52 83 96 16% 54%

5 28 105 188 252 11% 42%

6 55 181 337 494 11% 37%

7 105 333 525 830 13% 40%

8 185 496 719 1055 18% 47%

9 371 659 877 1121 33% 59%

10 468 804 983 1129 41% 71%

Table C.1: Short-circuiting effects on the Watts-Strogatz topology (nodes = 1129, k

= 3, p = 0.2)

78


1 2 2 2 2 100% 100%

2 4 4 4 4 100% 100%

3 10 10 10 10 100% 100%

4 65 92 113 113 58% 81%

5 214 492 689 844 25% 58%

6 246 589 843 1107 22% 53%

7 419 806 1040 1124 37% 72%

8 566 915 1071 1125 50% 81%

Table C.2: Short-circuiting effects on the Gnutella topology (nodes = 1125, edges =

4080)


1 6 6 6 6 100% 100%

2 54 54 54 54 100% 100%

3 405 410 419 419 97% 98%

4 1473 2216 2606 2851 52% 78%

5 4686 5986 6875 9021 52% 66%

6 6557 8143 8809 9998 66% 81%

7 8113 9060 9443 10000 81% 91%

Table C.3: Short-circuiting effects on a random topology (nodes = 10000, edges =

40000)

79


1 11 11 11 11 100% 100%

2 56 56 56 56 100% 100%

3 92 150 176 176 52% 85%

4 263 319 372 386 68% 83%

5 307 523 606 638 48% 82%

6 478 720 821 848 56% 85%

7 533 852 933 968 55% 88%

8 699 948 1002 1013 69% 94%

9 883 991 1020 1023 86% 97%

10 916 1011 1024 1024 89% 99%

Table C.4: Short-circuiting effects on a hypercube topology (N = 210)

80


1 14 14 14 14 100% 100%

2 92 92 92 92 100% 100%

3 258 315 368 378 68% 83%

4 685 858 1008 1093 63% 78%

5 1120 1750 2139 2380 47% 74%

6 2243 3079 3544 4096 55% 75%

7 2796 4422 5298 5812 48% 76%

8 3970 5813 6644 7099 56% 82%

9 6023 6844 7424 7814 77% 88%

10 6259 7558 7950 8100 77% 93%

11 6930 7907 8147 8178 85% 97%

12 7877 8108 8187 8191 96% 99%

13 8050 8174 8192 8192 98% 100%

Table C.5: Short-circuiting effects on a hypercube topology (N = 213)

81