UNIVERSITY OF CINCINNATI
_____________ , 20 _____
I,______________________________________________,hereby submit this as part of the requirements for thedegree of:
________________________________________________
in:
________________________________________________
It is entitled:
________________________________________________
________________________________________________
________________________________________________
________________________________________________
Approved by:________________________________________________________________________________________________________________________
Modeling Large-scale Peer-to-Peer Networks and a CaseStudy of Gnutella
A thesis submitted to the
Division of Graduate Studies and Research of
the University of Cincinnati
in partial fulfillment of the
requirements for the degree of
MASTER OF SCIENCE
in the Department of
Electrical and Computer Engineering and Computer Scienceof the College of Engineering
June, 2000
by
Mihajlo A. Jovanovic B.S., Department of Mathematics andComputer Science, Otterbein College, Westerville, Ohio, 1997.
Thesis Advisor and Committee Chair: Dr. Fred S. Annexstein andDr. Kenneth A. Berman
Abstract
The ongoing digital revolution has brought on the emergence of novel network ap-plications such as Gnutella, Freenet, and Napster, intended to facilitate worldwidesharing of information. These applications have embraced the familiar peer-to-peer(P2P) architecture model of the original Internet in new and innovative ways, foreverchanging the world of personal computing. However if P2P is to truly replace thewell-established client-server model as the computing paradigm of the future, moreefficient decentralized algorithms must first be designed. This requires better under-standing of the P2P network model on which those algorithms would be operating.Such model includes both network topology and traffic.
In this thesis, we study both of these factors using as our case study Gnutella -a fully-decentralized file sharing network application. In order to study the Gnutellanetwork topology, we have developed a network crawler that allows topology dis-covery to be performed in parallel. Upon analyzing the obtained topology data, wediscovered it exhibits strong ”small-world” properties. More specifically, we observedthe properties of small diameter and clustering in the Gnutella network topology. Inaddition, we report evidence of four different power laws previously observed in othertechnological networks, such as the Internet and the WWW.
In the second part of our thesis, we utilize our topology model in order to studynetwork traffic. Specifically, we show that heterogeneous latencies present in manylarge-scale P2P network applications, when combined with the standard protocolmechanisms of time-to-live (TTL) and unique message identification (UID) used togovern flooding message transmissions, can potentially have a devastating effect onthe reachability of message broadcast. We call this combined effect ”short-circuiting,”and we investigate consequences of this phenomenon. We show through experimenta-tion that, in the worst case, short-circuiting can near-completely eliminate the reachof broadcast messages. We report measurements obtained through both network sim-ulation studies and experimental studies performed on Gnutella. Our results indicatethat, on average, the real effects of short-circuiting are significant, but not devastatingto the performance of an overall large-scale system.
We believe our discoveries of both network topology properties and short-circuitingare an important step toward a uniform model of P2P network applications, and couldserve as a valuable tool in analyzing the performance of existing algorithms, as wellas designing new, more scalable solutions.
Acknowledgments
First, I would like to thank my advisers, Dr. Fred Annexstein and Dr. Kenneth
Berman, for hours of intellectually stimulating discussions, suggestions and ideas.
For the duration of this thesis, they have been not just my advisers but also my
mentors, providing constant encouragement as well as financial support in the form
of a Research Assistantship.
I would also like to thank Dr. Yizong Cheng for taking the time out of his busy
schedule to be on my thesis committee, and Dr. John Schlipf for attending my
thesis defense. Special thanks goes to Dr. John Franco for providing motivation and
guidance, particularly during my first year at UC, and also Linda Gruber for her
always kind and helpful attitude.
I extend my sincere gratitude to the Department of Electrical and Computer En-
gineering and Computer Science for its generous support without which this work
would not be possible. The department has provided me with a Graduate Assis-
tantship during my first year and a University Graduate Scholarship for three full
academic years.
Finally, I dedicate this work to my parents, Aleksandar and Mirjana, whose love
and support, even from half a world away, I could not have done it without.
Table of Contents
Page
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Peer-to-Peer Computing . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Example Applications . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Modeling P2P Applications . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Benefits to Modeling . . . . . . . . . . . . . . . . . . . . . . . 7
2 Modeling Topology of Large P2P Networks . . . . . . . . . . . . . . . . . . 9
2.1 Small-World Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Modeling Small-World Networks . . . . . . . . . . . . . . . . . 13
2.1.2 Gnutella as a Small-World . . . . . . . . . . . . . . . . . . . . 14
2.2 Power-Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Power-Law Models . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Power-Laws in Gnutella . . . . . . . . . . . . . . . . . . . . . 21
3 Modeling Network Latencies . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Latency Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Modeling the Short-Circuiting Effect . . . . . . . . . . . . . . . . . . 30
3.3 Empirical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.1 Gnutella Studies . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.2 Network Simulation Studies . . . . . . . . . . . . . . . . . . . 37
i
4 Gnutella Crawler Implementation . . . . . . . . . . . . . . . . . . . . . . . 41
4.1 Introduction to Gnutella . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.1 Gnutella Protocol . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1.2 Discovering Gnutella Network Topology . . . . . . . . . . . . 44
4.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.2 Initial Implementation . . . . . . . . . . . . . . . . . . . . . . 47
4.2.3 Parallel Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Distributed Computing Solution Using Java RMI . . . . . . . . . . . 50
5 Conclusions and future research . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.1 Network Topology Modeling . . . . . . . . . . . . . . . . . . . 53
5.2.2 Network Visualization . . . . . . . . . . . . . . . . . . . . . . 53
5.2.3 Server Placement . . . . . . . . . . . . . . . . . . . . . . . . . 53
Appendix
A Visualizations of the Gnutella Network Topology . . . . . . . . . . . . . . 59
B Java source code for gnutsim . . . . . . . . . . . . . . . . . . . . . . . . . . 65
C Network Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . 77
ii
List of Figures
2.1 Values for the clustering coefficient as defined in definition 3 for the
Gnutella, Barabasi-Albert, Watts-Strogatz, random graph, and the 2D
torus topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Log-log plots of degree versus rank (power-law 1) . . . . . . . . . . . 22
2.3 Log-log plot of frequency versus degree (power-law 2) . . . . . . . . . 23
2.4 Log-log plot of the number of pairs of nodes versus the number of hops
(power-law 3) for four snapshots of the Gnutella topology . . . . . . . 24
2.5 Log-log plot of eigenvalues versus rank (power-law 4) for four snapshots
of the Gnutella topology . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1 The results of level-1 short-circuiting effects on the broadcast hori-
zon on the Gnutella network, October 2000. The y-axis represents the
broadcast horizon size, and the x-axis labels each of 68 broadcast trials.
The top line is the resulting horizon from multiple distinct broadcasts
from the same source, and the lower line is the resulting horizon from
a single broadcast message from a single source. The discrepancy rep-
resents “level-1 short-circuiting” effects. . . . . . . . . . . . . . . . . . 33
3.2 Horizon-size versus t . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
iii
3.3 Horizon-size variation over time with broadcasting client using mul-
tiple connections on the Gnutella network, March 2001. The y-axis
represents the horizon size, and the x-axis labels each of 180 broadcast
trials, performed consecutively in six minute intervals. . . . . . . . . . 35
3.4 Difficulty in conducting experiments on today’s Gnutella network . . 36
3.5 Short-circuiting effects for the Watts-Strogatz topology (nodes = 10000, k =
3, p = 0.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
A.1 Gnutella network topology using Caida’s Otter . . . . . . . . . . . . . 60
A.2 Gnutella network topology using LEDA’s 2D spring layout . . . . . . 60
A.3 Gnutella network topology using experimental layout . . . . . . . . . 61
A.4 Gnutella network backbone (dominating set using greedy algorithm)
using LEDA’s 3D spring layout . . . . . . . . . . . . . . . . . . . . . 62
A.5 Gnutella network backbone (nodes with degree > 10) using LEDA’s
3D spring layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
A.6 Gnutella network backbone (nodes with degree > 20) using LEDA’s
3D spring layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
iv
Chapter 1
Introduction
The new wave of innovative network applications such as Gnutella, Freenet, Jabber,
Popular Power, SETI@Home, Publius, Free Haven, Groove, and others, has brought
on a revolution in personal computing threatening the long-established client-server
architecture of the Internet. For lack of a better term, this revolution has been la-
beled peer-to-peer (P2P), or simply peer computing. The success of this revolution
will depend on the ability of modern P2P network application to provide efficient
communication between increasingly large number of autonomous hosts dispersed all
over the Internet. To cope with this problem some P2P applications, like instant mes-
saging and Napster rely on a centralized server. Other applications, such as Gnutella
and Freenet, adopt fully decentralized design approach and require scalable algorith-
mic solutions for functions such as routing and searching. Gnutella, for example,
utilizes a flooding mechanism for transmitting messages through the network. These
algorithms are typically built-in the application in the form of an application-level
protocol. The inadequacy of the existing protocols became painfully clear to Gnutella
developers during the summer of 2000, when the size of the user community rapidly
increased. The problem is that the original protocols were designed without any
1
knowledge about the nature of the network on which they would be operating. In P2P
applications such as Gnutella and Freenet, much like in social networks, this nature
is determined by collective phenomena, as users connect to each other in a seemingly
random manner. Under these circumstances and given the highly-dynamic nature of
these networks, even relatively simple protocols result in complex interactions that
are difficult to predict. To provide better understanding of such interactions, in this
thesis we study the nature of P2P networks using Gnutella as our case study. In
particular, we study two fundamental components of a network, namely the topology
and the traffic.
In the first part of this thesis (chapter 2), we focus on the network topology model.
In order to study the Gnutella network topology, we have designed and implemented a
distributed network crawler that allows topology discovery to be performed in parallel
- an important feature considering highly dynamic nature of Gnutella. The analysis
of the obtained topology data reveals several important structural characteristics of
P2P networks:
1. We report that the Gnutella network is a small-world topology, exhibiting both
small diameter and clustering typical of many social networks.
2. We present evidence of four different power laws also found in other technolog-
ical networks, such as the Internet and the WWW.
As a result, we conclude that many P2P networks, such as Gnutella, posses charac-
teristics of both technological and social networks. It is our thesis that these char-
acteristics can be utilized for designing more efficient algorithms operating on such
networks.
In the second part of this thesis (chapter 3), we turn our focus to network traffic.
More specifically, we study the effects of heterogeneous latencies on reachability in
2
P2P networks operating under flooding protocols. We show that heterogeneous la-
tencies present in many large-scale P2P network applications, when combined with
the standard protocol mechanisms of time-to-live (TTL) and unique message identi-
fication (UID) used to govern flooding message transmissions, can potentially have
a devastating effect on the reachability of message broadcast. We call this com-
bined effect ”short-circuiting,” and we investigate consequences of this phenomenon.
We show through experimentation that, in the worst case, short-circuiting can near-
completely eliminate the reach of broadcast messages. We report measurements ob-
tained through both network simulation studies and experimental studies performed
on Gnutella. Our results indicate that, on average, the real effects of short-circuiting
are significant, but not devastating to the performance of an overall large-scale sys-
tem. In chapter 4, we describe the design and implementation of our parallel network
crawler. Finally, chapter 5 concludes this thesis with the description of future work.
For the remainder of this chapter, we first present a brief overview of the P2P
computing paradigm. Then, we summarize the main reasons for network modeling
and present our formal model.
1.1 Peer-to-Peer Computing
As with many new technologies, there is no single universally accepted definition for
P2P. The recently formed Peer-to-Peer Working Group, a consortium lead by the
industry giants such as Hewlett-Packard, Intel and IBM, defines peer computing as
”sharing of computer resources by direct exchange.” Indeed it is this notion of direct
access to resources, instead of through a centralized server as with the traditional
client-server model, that characterizes P2P. However, this definition may be too gen-
eral as it would seam to include applications typically considered client-server, such
as FTP and TELNET. According to [25], the two fundamental criteria that each
3
P2P application must satisfy are (1) treating variable connectivity and temporary
network addresses as the norm and (2) giving nodes at the edges of the network sig-
nificant autonomy. Using this definition, applications such as email are not P2P since
addresses are not machine independent, while instant messaging applications such
as ICQ and Jabber are P2P, because ”they devolve connection management to the
individual nodes” and dynamically map users to their IP addresses. However the fun-
damental idea of having computers act as peers is hardly new - some may even argue
it has its root in the original design of the Internet, as part of the early ARPANET
architecture. In fact, early network applications such as USENET and DNS were
based on a peer-to-peer communication model and can be considered predecessors to
modern P2P technologies. The true innovation of these technologies therefore lies not
in their architecture design, but rather in their implementation and scale. In order
for these applications to extend the scope of P2P computing beyond a single LAN,
they needed to overcome serious technical challenges posed by technologies such as
firewalls, dynamic IP, and NAT, designed to obstruct open communications between
computers for reasons of security. They did so by mitigating application complex-
ity to the edges of the network, thereby creating a much more significant role for the
Internet-connected PCs than previously offered by the traditional client-server model.
This idea of transferring the complexity to the edges can be best explained in com-
parison with a telephone network. At first glance a telephone network may seam P2P,
since communication occurs directly between two points in the network. However the
crucial difference between a telephone network and P2P is that the former relies on
an intelligent network for functions such as routing, and relatively ”dumb” devices
in the form of telephone sets. In contrast, P2P application like Gnutella relies on an
existing, ”dumb” network (the Internet) and incorporates all the application logic at
endpoints. The main advantage to such design from a perspective of a researcher is
that it enables rapid development and deployment of innovative technologies, which
4
can perhaps serve as an explanation for such a large number of P2P applications we
are seeing today.
1.1.1 Example Applications
Current network applications have embraced three forms of peer computing: shar-
ing of information, sharing of computing power, and communication. This does not
mean P2P computing model is limited to these resources, but simply that a P2P
application for sharing other types of resources has not yet been designed. Table 1.1
shows the list of the most popular P2P applications in each category. Applications
such as SETI@Home outline clear relationship between P2P and another computing
paradigm commonly referred to as distributed computing. These applications allow
the computing power of thousands of Internet-connected PCs to be harnessed and
used for performing computationally intensive tasks that would otherwise require the
use of a supercomputer. Examples include processing radio signals from outer space
in search for extraterrestrial intelligence [4] and simulating protein folding [2]. Per-
haps the most popular form of peer computing on the Internet is instant messaging.
Unlike email, where messages travel through centralized mail servers, instant messag-
ing allows individuals to directly communicate with each other. To route messages
between users across the entire Internet, applications such as AIM, ICQ, MSN, and
Jabber rely on a centralized back-end server to dynamically map users to their IP
addresses and buffer messages in case the user is offline.
Ongoing work toward development of a generalized platform for building P2P
applications [16] can be perhaps taken as an indication that the P2P model is here
to stay. The main goal of Groove developers is abstracting away many common
challenges to building P2P network application, such as providing open PC-to-PC
communication. The main obstacles are arising from the fact that the Internet archi-
5
Sharing of Information Sharing of Computing Power Communication
Gnutella SETI@Home AIM
Freenet Folding@Home ICQ
Napster FightAIDS@Home MSN
Publius PopularPower Jabber
Free Haven Intel’s NetBatch
Table 1.1: List of most popular P2P applications
tecture has been built for years around the prevalent client-server model. As a result,
numerous technologies such as firewalls, dynamic IP, NAT, and asymmetric band-
width connections have been deployed on the Internet, driven by the fundamental
assumption that most Internet-connected PCs will only serve as clients. This under-
lying assumption is being strongly challenged by P2P applications such as Gnutella,
Napster, and Freenet, which strive to provide a fully distributed worldwide informa-
tion sharing system. These applications require their users to serve both as consumers
and producers of information in a large distributed information storage system. The
idea behind peer-to-peer information sharing is that much of the desired content is
stored on individual workstations and not behind some centralized server. Applica-
tions like Gnutella allow users to directly connect to each other for the purpose of
exchanging information.
From the perspective of this thesis, a common thread that ties all of these appli-
cations is that they all form highly dynamic networks of peers with complex topology.
Understanding the nature of these networks, particularly with regards to their topo-
logical structure, is the main topic of chapter 2. In addition, applications such as
Gnutella and Free Haven [14], which rely on a broadcast search mechanism typically
6
implemented through flooding, are susceptible to a potential negative effect of hetero-
geneous latencies on message reachability - a phenomenon we call “short-circuiting.”
We examine this phenomenon in detail in chapter 3.
1.2 Modeling P2P Applications
In this section we present our formal model for representing network topology. We
model topology of P2P networks with an undirected graph G whose nodes represent
hosts and edges represent Internet connections between those hosts. For the remain-
der of this thesis, we will refer to network graphs as graphs representing topological
structure of a network. In order to study the effects of latencies on broadcast flood-
ing operations in chapter 3, we will further refine our model to include edge weights
denoting network latencies along communication links.
1.2.1 Benefits to Modeling
There are many reasons for obtaining an accurate network model. The main ones
can be summarized as follows:
Provides insight into the nature of the underlying system: Even if it was pos-
sible to catalog all the vertices and edges of a graph, such information does not
explain the evolutionary process of the corresponding network, nor does it pro-
vide a deeper understanding of its nature.
Enables analytical analysis of algorithms: Performance of graph algorithms is
closely related to the structural properties of the underlying graph [28]. A well-
formulated graph model can aid in analytical analysis of algorithms performing
on such topologies.
7
Allows generation of realistic topologies for simulation purposes: Besides an-
alytical analysis, simulations are a widely used method of assessing the perfor-
mance of algorithms. However successful simulations require realistic topologies
that accurately capture important structural characteristics present in the orig-
inal networks.
Facilitates design of new scalable algorithms: If the nature of a particular topol-
ogy is well understood, algorithms can be design to take advantage of particular
structural properties.
Helps in understanding of related network structures: A good understanding
of the nature of a particular system could lead to better understanding of other
dynamic, decentralized network structures for which complete topological data
may not be available.
Allows prediction of future trends: A good network model can be used to sim-
ulate future growth, thereby allowing developers to address potential problems
in advance.
As we have mentioned earlier, the topology of many P2P networks such as Gnutella
is completely defined by usage patterns, or collective phenomena. In this sense, there
is a clear relationship between P2P and social networks. Over the recent years, a lot of
research has been done on social network models. In the following chapter we present
some of the most notable network models and discuss how they can be adopted for
P2P networks. We support are claims with results obtained on the Gnutella network
topology.
8
Chapter 2
Modeling Topology of Large P2P
Networks
In this chapter we focus on one major aspect of the overall network model, namely
the topology. We analyzed the Gnutella network topology instances obtain by our
network crawler between the months of May and December of 2000. In our analysis,
we discovered some important structural properties of the topology graph, such as the
small-world properties and several power-law distributions of certain graph metrics.
It is our thesis that these properties can be used to test the “representativeness”
of synthetically generated topologies used to model P2P networks such as Gnutella.
Conversely, we believe these properties are an essential ingredient of an accurate P2P
network topology model.
Here we present our results in the context of other related research. Be begin
with a brief introduction of small-world networks and their characteristics. We then
present our discoveries on Gnutella, showing that the Gnutella network topology
exhibits strong small-world properties. Next, we describe several power-laws recently
observed in various network structures arising in technology. Finally, we report four
9
of these power-laws characterizing topology of the Gnutella network. It is our thesis
that these power-laws are a fundamental property of many large-scale P2P networks,
and therefore must be dealt with in their corresponding models.
2.1 Small-World Networks
The small-world phenomenon in the context of a worldwide social network refers to a
widely accepted belief that we are all connected by a short chain of intermediate ac-
quaintances. One of the first experimental studies of this phenomenon was conducted
by Stanley Milgram in the late 1960s. Milgram’s famous experiment consisted of
taking a number of letters addressed to a person in the Boston area, and distributing
them to a randomly selected group of people in Nebraska. Each person who received
a letter was asked to pass it to someone they knew on a first-name basis in an effort
to get it closer to its destination. As many of the letters eventually reached their des-
tination, Milgram observed that the average number of steps for a letter to get from
Nebraska to Boston was between five and six. The results of Milgram’s experiment
were the first to quantify the phenomenon, giving birth to a popular expression ”six
degrees of separation.”
One way to model the small-world phenomenon is by a graph whose vertices are
people and edges exist between two people who know each other. Such graph is often
referred to as the human acquaintanceship graph. As suggested by the phenomenon,
the acquaintanceship graph is characterized by small diameter. Stated more precisely,
its diameter seams to be of the order of log n, where n is the size of the graph. Fur-
thermore, the acquaintanceship graph also shows tendency to be clustered. Clustering
can be thought of as a measure of how well connected each node’s neighborhood is.
For the human acquaintanceship graph this property seams intuitive, as two people
with a common friend are with high probability themselves friends. It is these two
10
properties of clustering and small diameter that define a class of graphs Watts and
Strogatz call the small-worlds graphs. The two in [27] argue that the structure of
many biological, technological, and social networks exhibits small-world behavior. As
examples of such networks, they studied the only completely mapped neural network
of the nematode worm Caenorhabditis elegans, the electric power grid of the western
US, and the Hollywood graph. The collaboration graph of film actors, appropriately
termed the Hollywood graph, contains 225, 000 vertices representing actors and an
edge for any two actors who have appeared in a feature film together. Similar collabo-
ration graphs exist for active scientists [17] and even baseball players [24]. Since each
of these social networks is a subgraph of the acquaintanceship graph, it is not surpris-
ing they also show properties of clustering and small diameter. Without providing
a strict mathematical definition, Watts and Strogatz define small-world behavior in
terms of two properties, mainly the characteristic path length and clustering. In order
to quantify these properties for various networks, the two defined characteristic path
length L and clustering coefficient C as the following:
Definition 1 Characteristic path length L, a global property, is defined as the
number of edges in the shortest path between two vertices, averaged over all pairs of
vertices.
Definition 2 Clustering Coefficient Cv, a local (node) property measuring ”cliquish-
ness” of vertex v, is calculated by taking all the neighbors of v, counting the edges
between them, and then dividing by the maximum number of edges that could possibly
be drawn between those neighbors. Clustering coefficient C of a graph is defined as
the average of Cv over all vertices v.
Table 2.1 shows the L and C values for three real networks mentioned above,
benchmarked against a random graph of the same size. The results clearly demon-
strate the small-world phenomenon for these networks: L � Lrandom but C � Crandom.
11
n Lactual Lrandom Cactual Crandom
Film actors 225,226 3.65 2.99 0.79 0.00027
Power grid 4,941 18.7 12.4 0.080 0.005
C. elegans 282 2.65 2.25 0.28 0.05
Table 2.1: Small-world behavior of three real networks
Recently Leda Adamic in [5] showed that the web hyperlink graph, in which nodes
are static home pages and edges are hyperlinks between those pages, is also a small-
world. In addition, the author demonstrated how this fact could be used to improve
performance of web search engines.
Besides small diameter and clustering, many small-world networks share other
important properties:
They tend to be sparse: These graphs all have relatively few edges, considering
their vast number of vertices. Stated more precisely, in small-world graphs the
number of edges is typically closer to the number of vertices n than to the
maximum possible number of edges(
n2
). The Hollywood graph, for example,
has 225, 000 vertices connected by 13 billion edges, far short of 25 billion in a
clique. The largest studied sample of the WWW graph contains 1.5 billion links
connecting 200 million pages. This means that only about 7% of all possible
edges exist in the WWW graph.
They are self-organizing: Most of these small-world networks are not deliberate
constructions. Instead, they can be viewed as naturally occurring artifacts that
have developed through some evolutionary process. A good theoretical model
for generating realistic small-world topologies must inevitably provide deeper
insight into the nature of such process.
12
2.1.1 Modeling Small-World Networks
The simplest way to model the small-world phenomenon is by means of a uniform
random graph. Graphs of this type were thoroughly studied by Erdos and Renyi in
the 1960s. While these graphs exhibit small diameter, their major limitation as a
model of the small-world is that they show no tendency to form clusters. To address
this problem, Watts and Strogatz proposed a model based on interpolating between
a completely regular and completely random topology [27]. The authors start by
taking a highly regular ring lattice topology, created by arranging n vertices in a
circle and joining each vertex to its k nearest neighbors for some small constant k.
Each edge in the original lattice is then examined and redirected to another randomly
chosen destination with probability p. This method allowed the authors to “tune”
the graph between regularity (p = 0) and disorder (p = 1), and thereby to probe
the intermediate region 0 < p < 1, about which little is known. Because of the
potential rewiring of edges, Watts and Strogatz refer to their model as the rewired
ring lattice. Another way to look at this construction process is to observe that all
the edges in the original lattice are local contacts. The rewiring process can then
simply be viewed as adding a number of long-range contacts. Watts and Strogatz
observed that adding only a few such edges results in a dramatic decrease in diameter
size while still preserving the clustering property of the original lattice. While the
Watts-Strogatz model remains one of the most popular models of the small-world,
most of the recent research utilizes a variation of the model proposed by Newman
and Watts. In this version, instead of rewiring the existing links, new shortcut links
are added. This greatly simplifies the analysis by eliminating the possibility present
in the original model for a portion of the graph to become disconnected from the rest.
The model was latter generalized by Kleinberg in [19], who introduced an additional
parameter consequently defining an entire family of random networks. Kleinberg
13
showed that the performance of decentralized algorithm varies within this family of
network models, proving the existence of a unique model within the family for which
decentralized algorithms are effective. The idea most relevant to our thesis is that the
small-world property of a network topology can significantly impact the performance
of algorithms such as those for routing operating on such topology.
2.1.2 Gnutella as a Small-World
Upon analyzing the Gnutella network topology data obtained by our crawler, we
discovered both the small diameter and the clustering properties characteristic of
small-world networks. To show this, we calculated the clustering coefficient and the
characteristic path length as defined by Watts and Strogatz for five different snapshots
of the Gnutella topology obtained during the months of November and December of
2000. Since the results presented in this chapter are based on these particular datasets,
we present some basic statistics for them in table 2.2.
Snapshot date Nodes Edges Diameter
11/13/2000 992 2465 9
11/16/2000 1008 1782 12
12/20/2000 1077 4094 10
12/27/2000 1026 3752 8
12/28/2000 1125 4080 8
Table 2.2: Statistics for five snapshots of the Gnutella network topology
We present the statistics for the clustering coefficient C and the characteristic
path length L in tables 2.3 and 2.4. The values for each one are benchmarked against
the random graph G(n, p) and the 2-D mesh of the same size (in terms of the number
14
of nodes) as the original Gnutella topology graph. For random graphs, average values
out of 100 trials are shown.
Count source vertex Do not count source vertex
Gnutella G(n,p) 2D mesh Gnutella G(n,p) 2D mesh
11/13/2000 0.643587 0.389914 0.413181 0.035122 0.007789 0
11/16/2000 0.701287 0.492788 0.41276 0.010896 0.005636 0
12/20/2000 0.539189 0.268877 0.412366 0.065172 0.009371 0
12/27/2000 0.514996 0.278801 0.41276 0.063023 0.010213 0
12/28/2000 0.521659 0.27966 0.411995 0.054443 0.009013 0
Table 2.3: Values for the clustering coefficient C as defined by Watts and Strogatz in
definition 2
Because it is not clear from their definition whether Watts and Strogatz consider
each vertex to be a neighbor of itself, we have calculated the results using both
methods. Based on the results in 2.3, we believe the two were not counting the
source vertex. However the results obtained on a 2D mesh, typically regarded as a
highly clustered topology, highlight a potential inconsistency with this definition. For
this reason we propose a more consistent definition for the clustering coefficient of a
graph:
Definition 3 Characteristic coefficient C(l)v of vertex v is calculated by dividing
the number of cross edges in a BFS-tree of depth l and rooted at v, by the maximum
possible number of cross edges given by(
k2
)−(k−1), where k is the number of vertices
in the BFS-tree. Clustering coefficient C(l) of a graph is defined as the average of
C(l)v over all vertices v.
15
Gnutella BA WS G(n,p) 2D torus
11/13/2000 0.0223545 0.0149507 0.0372667 0.00403533 0.0606061
11/16/2000 0.0088999 0.0095887 0.0372356 0.00249125 0.0606061
12/20/2000 0.0300611 0.0178844 0.0537228 0.00618598 0.0606061
12/27/2000 0.0205752 0.0184729 0.0539221 0.00620002 0.0606061
12/28/2000 0.0206982 0.0173541 0.0535703 0.00561928 0.0606061
(a) l = 2
Gnutella BA WS G(n,p) 2D torus
11/13/2000 0.0141344 0.00693268 0.0110796 0.00391614 0.0434783
11/16/2000 0.0100001 0.00524975 0.0110373 0.00243858 0.0434783
12/20/2000 0.0136551 0.00743268 0.0143759 0.00601365 0.0434783
12/27/2000 0.0125729 0.00773103 0.014582 0.00602383 0.0434783
12/28/2000 0.0122163 0.00718639 0.0142141 0.00545913 0.0434783
(b) l = 3
Figure 2.1: Values for the clustering coefficient as defined in definition 3 for the
Gnutella, Barabasi-Albert, Watts-Strogatz, random graph, and the 2D torus topolo-
gies
We believe our definition to be in better agreement with our intuitive under-
standing of clustering. Furthermore, such definition allows us to identify the aspect
of clustering in various topologies that contributes to the “short-circuiting” effect
we study in chapter 3. The results for the new clustering coefficient are presented
in figure 2.1. Besides the values for the Gnutella, the random graph and the 2D
torus, each table also contains results for the Barabasi-Albert (discussed in the sub-
16
sequent section) and the Watts-Strogatz models. The parameters for these models
were chosen in a way so that the number of nodes and average degree of the resulting
graph is approximately equal to that of the original Gnutella topology. For example,
the Gnutella topology snapshot from 12/20/2000 is compared to the Watts-Strogatz
topology generated according to the following parameters: n = 1125, k = 3, and
p = 1 (every node gets a random edge - the Newman-Watts version of the model is
used).
Gnutella BA WS G(n,p) 2D mesh
11/13/2000 3.72299 3.47491 4.59706 4.48727 20.6667
11/16/2000 4.42593 4.07535 4.61155 5.5372 21.3333
12/20/2000 3.3065 3.19022 4.22492 3.6649 22
12/27/2000 3.30361 3.18046 4.19174 3.70995 21.3333
12/28/2000 3.32817 3.20749 4.25202 3.7688 22.6667
Table 2.4: Values for the characteristic path length L for the Gnutella, Barabasi-
Albert, Watts-Strogatz, random graph, and the 2D mesh topologies
As you can see, all of the Gnutella topology instances show the small-world phe-
nomenon: characteristic path length is comparable to that of a random graph (table
2.4), while the clustering coefficient is considerably higher. These results clearly indi-
cate strong small-world properties of the Gnutella network topology. It is our thesis
that this is an important issue to consider when modeling P2P networks such as
Gnutella. More specifically, an accurate P2P model must inevitably generate topolo-
gies exhibiting the described small-world properties. Furthermore, our discovery can
aid in designing and predicting performance of distributed algorithms, such as those
for routing and searching. For example, Gnutella’s current broadcast routing strategy
17
is clearly not likely to work well on a clustered topology of a small-world network, as
it would generate large amounts of duplicate messages. This would result in poor uti-
lization of network bandwidth and hinder scaling - a phenomenon recently observed
in practice [13].
2.2 Power-Laws
The major limitation of the described small-world models is due to increasing evidence
of various power-laws of the form y = xa, governing distribution of various graph
metrics for many large, self-organizing networks [15, 10, 11, 20]. Faloutsos et al [15]
discovered four of these power-laws characterizing topology of the Internet at both
inter-domain and router level. These power-laws are defined as follows:
Power-Law 1 (rank exponent R): The outdegree, dv, of a node v, is proportional
to the rank of the node, rv, to the power of a constant, R: dv ∝ rRv . The rank
rv of a node, v, is defined as its index in the order of decreasing outdegree.
Power-Law 2 (out-degree exponent O): The frequency, fd, of an out-degree, d,
is proportional to the out-degree to the power of a constant, O: fd ∝ dO.
Power-Law 3 (hop-plot exponent H): The total number of pairs of nodes, P (h),
within h hops, is proportional to the number of hops to the power of a constant,
H: P (h) ∝ hH,h � δ, the diameter. The number of pairs P (h) is the total
number of pairs of nodes within less or equal to h hops, including self-pairs, and
counting all other pairs twice.
Power-Law 4 (eigen exponent E): The eigenvalues, λi, of a graph are propor-
tional to the order, i, to the power of a constant, E : λi ∝ iE .
18
Several research groups have also independently discovered evidence of the same
power-laws describing structural properties of the web graph [10, 11, 20]. Since these
discoveries occurred on various scales and levels of granularity, they could be taken as
indications of possible self-similar or fractal nature of the web. Of particular interest
is the fact that all of these groups reported practically identical values for the power-
law 2 exponent, ranging between 2.1 and 2.2. This observation led the authors in
[15] to suggest the use of power-law exponents as a way of characterizing different
families of graphs. In addition, they demonstrated how these exponents could be used
to approximate important graph metrics, such as the number of nodes, the number
of edges, the average neighborhood size, and the effective diameter. Albert, Jeong,
and Barabasi went even further to argue the scale-invariant nature of the power-law
distributions, suggesting that ”large networks self-organize into a scale-free state, a
feature unpredicted by all existing random graph models” [10].
The significance of these power-laws is that they clearly outline the inadequacy
of the described small-world models to accurately capture the true nature of many
large networks. The problem is that these models do not explain the existence of
highly connected nodes, a simple consequence of the power-law 2. The described
power-law observations have therefore opened up a search for alternative techniques
for generating realistic network topologies that exhibit such power-law phenomena.
2.2.1 Power-Law Models
Based on the discoveries described above, a number of alternative models have been
proposed that produce graphs exhibiting the observed power-law properties. While
some set out to synthetically reproduce various power-law distributions accepting
them as empirical facts, others attempt to provide an explanation as to the origin of
such phenomena. An example of the later is a model proposed by Barabasi and Albert
19
[10]. The two argue that the existence of power-laws in many real networks is caused
by two key features: growth and preferential attachment. Growth feature describes
the dynamic nature of many real networks, in which new vertices are continuously
added. Preferential attachment is used to model the fact that in real networks, new
vertices are more likely to link to existing vertices of high degree, resulting in so-called
”rich-get-richer” phenomenon. In the case of the web graph, these two features are
evident as new pages are created daily, typically containing hyperlinks to already
highly connected and therefore highly visible pages. Barabasi and Albert build their
model by starting with a small number of vertices and no edges. Then, a new vertex is
added at each time step by linking it to m other vertices already present in the system.
The existing vertices are chosen with probability that is proportional to their degree.
This process produces a random graph that reaches a steady state characterized
by the same power-law distribution observed in many real networks. Notice that,
without continuous addition of new vertices, this model would eventually produce a
clique, as all the vertices would ultimately be connected. In fact the authors proved
that both growth and preferential attachment are necessary to correctly model the
behavior of real networks: growth factor ensures stationary power-law distribution,
and preferential attachment is responsible for its scale-free nature. The Barabasi-
Albert model possesses certain intuitive appeal, particularly when used to model the
topology of many P2P networks such as Gnutella. Recently, a topology generator
called BRITE was proposed for produces graphs exhibiting all four of the discussed
power-laws based on factors such as growth and preferential attachment studied by
Barabasi and Albert [21]. We are currently experimenting with adopting this model
for P2P networks such as Gnutella.
If the goal is to simply generate graphs that match exactly the power-law prop-
erties observed empirically, then the α − β graph model proposed by Aiello, Chung,
and Lu could be used [7]. This model involves two parameters, α and β, represent-
20
ing the intercept and the slope of the plot of degree distribution on a log-log scale.
Since any fixed pair of values for α and β defines a finite set of graphs, the authors
propose simply selecting a graph from this set at random. More recently, Internet
topology generators have been proposed that subscribe to the same philosophy of
using power-laws to guide graph construction [23].
2.2.2 Power-Laws in Gnutella
Upon analyzing the Gnutella topology data obtained using our network crawler, we
discovered it obeys all four of the power-laws described in the previous section. The
results for power-laws 1 through 4 are presented in figures 2.2, 2.3, 2.4, and 2.5,
respectively. Power-laws relationships between variables are typically plotted on a
logarithmic scale, since their plot should, by definition, appear linear. Power-law
exponents can then be defined as the slope of this linear plot. We used linear regression
to fit a line in a set of two-dimensional points using the least-square errors method. To
quantify the validity of the approximation, with each figure we included the absolute
value of the correlation coefficient r ranging between −1 and 1. A |r| value of 1
indicated perfect linear correlation.
As mentioned earlier, power-law 1 is evaluated by sorting all nodes in descending
order according to their degree, and plotting degree versus rank of a node in this
sequence on a log-log scale. For comparison, we present plots for both the snapshots
of the Gnutella network topology and a simple connected random graph of the same
size. Figure 2.2 shows this power-law holds for the Gnutella topology instance with
rank exponent R =−0.98 and the correlation coefficient of 0.94, which cannot be said
for the random topology.
Power-law 2 is of particular importance, because it is the one that is most fre-
quently cited in the recent studies of large network topologies. Figure 2.3 shows
21
100
101
102
103
10−1
100
101
102
103
Gnutella 12/28/2000 exp(6.04022)*x**(−1.42696)
100
101
102
100
101
102
103
Random graph
(a) Gnutella 12/28/00(|r| = 0.94) (b) Random Graph
Figure 2.2: Log-log plots of degree versus rank (power-law 1)
node degree power-law exponent of −1.4 for the Gnutella topology. We must remark
that a group called Clip2 independently discovered this particular power-law for the
Gnutella network topology [13]. However they reported the power-law exponent of
−2.3, in disagreement with our result. We believe the reason for this discrepancy is
due to the fact that our results are based on the network crawls performed during
December of 2000, while the other result dates back to the summer of the same year.
Since that time, the Gnutella network has undergone significant changes in terms
of its structure and size, as described in [13]. While the values of the node degree
exponent O for all of the Gnutella topology instances obtained during the month of
December are consistently around −1.4, we have observed O values of −1.6 for the
data obtained in November. This may be taken as indication of a highly-dynamic,
evolving state of the Gnutella network. We are nevertheless currently attempting to
establish contact with people from Clip2 in order to further examine reasons for this
discrepancy. Interestingly, power-law degree distributions have recently been reported
for another file-sharing P2P applications, Freenet [22].
22
100
101
102
103
104
100
101
102
103
104
Gnutella 12/28/00 exp(7.27358)*x**(−0.98116)
100
101
102
103
104
100
101
102
Random graph
(a) Gnutella 12/28/00(|r| = 0.96) (a) Random Graph
Figure 2.3: Log-log plot of frequency versus degree (power-law 2)
It has been shown that power-laws 3 and 4 hold for almost all types of topologies,
including random, regular, and hierarchical [21]. Power-law three by definition holds
for regular topologies such as a ring topology and a 2-D mesh, with hop-plot exponents
of 1 and 2, respectively, for h � δ. It is therefore not surprising that we have also
observed these power-laws in the Gnutella network topology. However a case has been
made that, while the mere presence of these two power-laws is not a distinguishing
property of a graph, the values of their exponents can be. For this reason, instead
of plotting power-laws 3 and 4 for a single instance of the Gnutella topology and a
random graph of the same size, we compare results for several different snapshots
of the Gnutella topology. Figure 2.4 shows the hop-plots for four of these Gnutella
topology snapshots described previously. For each one, we approximated only the
first four hops. Clearly, power-law 3 holds for all four snapshots with very high
correlation coefficients of 0.99. More importantly, the hop-plot exponents seam to be
clustered tightly around the value of 3.5. Notice that this value lies right between the
exponent values reported for the inter-domain and router level topology instances of
23
100
101
100
102
104
106
108
1010
Gnutella snapshot 11/16/2000exp(8.36937)*x**(3.48228) maximum number of pairs
100
101
100
102
104
106
108
1010
Gnutella snapshot 12/20/2000exp(9.32629)*x**(3.54494) maximum number of pairs
(a) Gnutella 11/16/00(|r| = 0.99) (b) Gnutella 12/20/00(|r| = 0.99)
100
101
100
102
104
106
108
1010
Gnutella snapshot 12/27/2000exp(9.26415)*x**(3.52262) maximum number of pairs
100
101
100
102
104
106
108
1010
Gnutella snapshot 12/28/2000exp(9.31438)*x**(3.60599) maximum number of pairs
(c) Gnutella 12/27/00(|r| = 0.99) (d) Gnutella 12/28/00(|r| = 0.99)
Figure 2.4: Log-log plot of the number of pairs of nodes versus the number of hops(power-law 3) for four snapshots of the Gnutella topology
the Internet [15]. Like the authors in [15, 21], we must concede that the results for
this particular power-law may be misleading given such small number of data points.
This limitation is imposed by the fact that these graphs have a small diameter.
An application of power-law 3 that seams particularly applicable to Gnutella was
suggested by the authors in [15]. They introduced a concept of the effective diameter
24
δef , which is essentially the number of hops required to reach a “sufficiently large”
portion of a network. In other words, any two nodes are within δef hops of each other
with high probability. We present the definition below for convenience.
Definition 4 (effective diameter) Given a graph with N nodes, E edges, and Hhop-plot exponent, the effective diameter, δef , is defined as:
δef =
(N2
N + 2E
)1/H
Substituting the values for the Gnutella topology snapshot from December 28th,
2000, we get that, during that time, a better value for the maximum TTL would have
been 4 (instead of 7, which is the default specified by the Gnutella protocol).
Similar trends to the ones reported for the hop-plots appear in the eigenvalue
plots. Figure 2.5 shows the first 20 eigenvalues versus their order on a log-log scale
for the Gnutella topology snapshots. Once again, we see the consistency of power-law
exponents across different snapshots. Interestingly the exponents for the snapshots
obtained during the month of December are practically equal, while the exponent
for the snapshot from November is slightly smaller. Again, this fact may be taken
as an indication that the Gnutella network was going through an evolutionary state,
captured by these power-law exponents. There is a rich literature proving that eigen-
values of a graph are closely related to its topological properties. In the future, we
plan to further analyze the eigenvalues of P2P network topologies and their practical
implications.
Our empirical results clearly outline strong power-law properties on the Gnutella
network topology. It is our thesis that these properties can be utilized to improve
performance of algorithms such as those used for searching [6]. In addition, we believe
that an accurate model of the network topology of P2P network applications such as
Gnutella must inevitable exhibit presence of power-laws 1 and 2, as well as produce
all four power-law exponents in close agreement with the ones observed empirically.
25
100
101
100
101
102
Gnutella 11/16/2000 exp(2.27850)*x**(−0.22301)
100
101
100
101
102
Gnutella 12/20/2000 exp(2.83511)*x**(−0.30114)
(a) Gnutella 11/16/00(|r| = 0.97) (b) Gnutella 12/20/00(|r| = 0.89)
100
101
100
101
102
Gnutella 12/27/2000 exp(2.82127)*x**(−0.29278)
100
101
100
101
102
Gnutella 12/28/2000 exp(2.81997)*x**(−0.29412)
(c) Gnutella 12/27/00(|r| = 0.94) (d) Gnutella 12/28/00(|r| = 0.94)
Figure 2.5: Log-log plot of eigenvalues versus rank (power-law 4) for four snapshotsof the Gnutella topology
26
Chapter 3
Modeling Network Latencies
In this chapter we further refine our model of P2P networks to include traffic. In par-
ticular, we study the effects of heterogeneous latencies on reachability in P2P network
applications operating under flooding protocols. We call this potentially devastating
effect “short-circuiting.” Traditionally, latency has been studied to model network
performance as it relates to throughput. Network reachability has traditionally been
studied through the analysis of distance in graphs. In this work, we point towards
a novel fact that heterogeneous latencies can significantly impact reachability, inde-
pendent of distance.
We begin with a brief introduction of short-circuiting. We then present our formal
model for studying the effects of short-circuiting. Finally, we report our results from
both network simulation studies and empirical tests performed on Gnutella. We
conclude based on these results that, on average, the real effects of short-circuiting
are significant, but not devastating to the performance of an overall system.
27
3.1 Latency Effects
We have seen in chapter 1 that P2P applications are inherently decentralized, there-
fore relying on efficient decentralized algorithms for communication between hosts.
As a result, many of these applications, including Gnutella, have adopted a flood-
ing mechanism to forward messages in an effort to maximize reachability. Notice
that reachability, or the number of hosts receiving a particular message, is an im-
portant performance metric for many P2P applications, particularly those used for
file-sharing.
Flooding dictates that each host is to simply forward each received message to
all of its neighbors, except the one from which the message was received. As such,
flooding provides a simple and effective way of broadcasting messages in a dynam-
ically changing network without requiring the use of routing tables or knowledge
of the global network topology. However it clearly does not scale for Internet-wide
applications, as it generates a large number of redundant messages and uses all avail-
able paths across the network. For this reason, in practice, flooding is typically
implemented in combination with one or more of the following standard governing
mechanisms designed to restrict its scope and limit redundant messages:
Mechanism 1. Time-to-Live Bounds Time-to-Live (TTL) is a governing mech-
anism that prevent messages from traveling farther than a specified number
of hops, defined by an initial TTL value. TTL bounds are implemented by
including in each message header a TTL value field. When a node receives a
message it first checks to see if its TTL value is greater than zero. If not, the
node continues the flood with a decremented TTL. Otherwise the message is
dropped.
Mechanism 2. Unique Message Identification Unique Message Identification is
28
a mechanism that prevents unique messages from being transmitted more than
once from each node. This mechanism is implemented by including in each
message header a UID (a unique ID label, or unique sequence number). When
a node receives a message it checks to see if it has previously seen that message.
If it has , the message is dropped and not forwarded. Otherwise, the node stores
the new UID in a local table, and then continues the flood.
Mechanism 3. Path Identification Path Identification is a mechanism that pre-
vents message paths from looping. This mechanism is implemented by including
in each message a header that records which nodes of the network have already
encountered the message. Before forwarding messages, each node simply checks
the header to verify whether or not it has previously seen the message. If so,
the message is dropped and not forwarded. If not, the node adds its name to
the header, and then continues the flood.
Ordinarily, a broadcast operation functioning under these mechanisms should
reach all nodes within the TTL bound of the broadcast source. However we have
discovered that network latencies can negatively impact reachability of broadcast op-
erations. We define latency as the time it takes a message to traverse a link in the
network. We will show that, when Mechanisms 1 and 2 are implemented together,
heterogeneous network latencies can potentially have a devastating effect on reach-
ability. We call this phenomenon the ”short-circuiting effect,” and describe it as
follows:
Short-circuiting Effect. Consider a message broadcast from a source node a, and
consider a path P = {u1, u2, . . . , up}, joining nodes a = u1 and b = up. It is
possible that there may be no throughput of the broadcast messages from a to b
along P , even if the hop-length p of the path P is less than or equal to the TTL
value t. This can result from heterogeneous latencies, as the following scenario
29
shows. Suppose there exists a message path Q from a to some intermediate
node x = ui of P , having a strictly smaller latency (but, with possibly a greater
hop number). Then a broadcast message originating from a, and following path
P will be killed (by Mechanism 2) when it reaches x, since it is the duplicate
of an earlier arriving message originating from a, but following path Q. Notice
that there may also be no throughput along path R consisting of the path Q
together with the subpath of P from x to b. This effect results from the fact
that R may possibly have a hop-length strictly greater than t, and hence, by
Mechanism 1 there is no throughput of the broadcast message originating at
a along path R. And, indeed, there may be no throughput of the broadcast
message along any path from a to b; it is this latency effect on reachability
which we call short-circuiting.
For the remainder of this chapter, we will consider broadcasts as operating under
the combination of Mechanisms 1 and 2. Note that short-circuiting like effects can
not be caused by the combination of Mechanisms 1 and 3, since, in that case, all
loop-free paths within the TTL bound are valid message paths.
3.2 Modeling the Short-Circuiting Effect
In order to analyze the problem of SC, we refine our network model from chapter
1 to include edge weights representing latency values on communication links. We
consider the latency of a message path to be the sum of the latencies of its edges.
The flooding operation governed by mechanisms 1 and 2 in a network G is defined
by the following protocol regimen. Packets in the network we will denote p(u, t, h),
with unique message identifier UID = u, initial TTL value TTL = t, and current
hop-value HOP = h. The hop-value denotes the number of hops from the packet’s
source node. We will denote a packet (ready for broadcast) originating at node s,
30
with initial TTL = t, by p(us, t, 0). The broadcast regimen operates as follows, and
defines the valid message paths associated with the transmission of the broadcast
packet.
1. Source s sends p(us, t, 0) to all the neighbors of s, injecting the packet on all
links connected to s at the same time.
2. Nodes process packets on first-come-first-served basis as follows: when a node v
receives packet p(us, t, h) it checks whether the UID us has been seen previously.
If it has, then the packet is dropped with no further processing.
3. If not, then v records us in its local table, and check whether t = h. If t > h,
then v replicates and forwards the message p(us, t, h+1) (with incremented hop
count) to all neighbors except u, the node from which it received the packet. If
t = h then the packet is dropped and not forwarded.
When latencies are introduced into this model of a flooding broadcast, complica-
tions arise as to the reachability of nodes. To determine reachability it is not sufficient
to consider only minimum-cost paths from s to v.
In order to quantify reachability, we introduce the notion of a horizon, defined as
following:
Definition 5 The t-horizon R(s, t) from a source node s, is the set of all nodes v
which receive a packet ps(u, t,−) broadcast from s with TTL = t. The t-neighborhood
N(s, t) from a source node s, is the is the set of all nodes within a hop-distance of
t from s. Likewise, for a set of source nodes S, we denote by R(S, t) and N(S, t)
are the t-horizon, and t-neighborhood, respectively, from S, where we assume that the
broadcast is initiated by each s ∈ S simultaneously.
In the subsequent sections, we present our experimental results on the size of
t-horizon as a function of latencies under the described broadcast model.
31
3.3 Empirical experiments
We have conducted a series of experiments to empirically test the effects of short-
circuiting. These experiments are divided into two categories: simulations performed
on various static network topologies and empirical tests performed on a real P2P
network application. For the later, we use Gnutella as our case study.
3.3.1 Gnutella Studies
We have already mentioned Gnutella as a rapidly evolving technology based on the
peer-to-peer network model. In this section we continue our case study of Gnutella
with the analysis of short-circuiting effects on reachability. In order to see why
Gnutella presents a meaningful testbed for studying the problem of short-circuiting,
let us briefly describe its design. Gnutella’s application-level protocol supports two
basic types of broadcast requests: ping, which is essentially a request for a host to
announce itself, and a query. These messages are propagated through the network by
means of a flooding broadcast. The response messages are then routed back along the
same path that the original request arrived by means of dynamically updated routing
tables maintained by each host. The flooding in Gnutella is implemented using mech-
anisms 1 and 2 described in previous sections, with the Gnutella software generally
limiting TTL values to at most 7. Its routing protocol, together with heterogeneous
latencies, make Gnutella potentially vulnerable to the short-circuiting effects we have
described.
Our original interest in the effects of short-circuiting arose from an experiment
that involved crawling and mapping the entire Gnutella network. In particular, we
noted that the number of reachable hosts reported by a client was substantially less
than on off-line analysis of the generated topology map. This analysis consisted of
calculating the number of elements in the BFS tree rooted at a node representing that
32
particular client. We consistently noted discrepancies of this nature of approximately
one half. After conjecturing that short-circuiting may play a substantial role is such
discrepancies, we attempted to try to prove this empirically.
Figure 3.1: The results of level-1 short-circuiting effects on the broadcast horizon onthe Gnutella network, October 2000. The y-axis represents the broadcast horizonsize, and the x-axis labels each of 68 broadcast trials. The top line is the resultinghorizon from multiple distinct broadcasts from the same source, and the lower lineis the resulting horizon from a single broadcast message from a single source. Thediscrepancy represents “level-1 short-circuiting” effects.
To test our hypothesis, we have devised an experimental method of discovering
what we call the “level-1 short-circuiting” effect. These are the effects of short-
circuiting caused by the paths interfering at the first level, that is, in our experiments
we compare the 7-horizon of a message broadcast from v with the 6-horizon of distinct
message broadcasts from the neighbors of v. The idea is that sending messages with
distinct ID labels will prevent them from interfering with each other, and thereby
allows us to measure a subset of the total short-circuiting effect. The actual number
of hosts reached by the broadcast of the shared message is compared to a union of
host sets reached by the set of distinct broadcast messages. More refined estimates
of short-circuiting effects can be obtained by comparing the hop counts of messages
33
responding to a shared broadcast to the hop counts of messages responding to distinct
broadcasts: if the former is larger than the minimum of the later, than we posit that
short-circuiting has occurred. Figure 3.1 shows the results of a particular experiment
of this nature conducted in October of 2000 . We note that the observed reductions
average 55%.
2 3 4 5 6 70
50
100
150
200
250
300
350
400
4502 servers3 servers
Figure 3.2: Horizon-size versus t
In another set of experiments we focused on the t-horizon as a function of the TTL
value. We performed the experiment by connecting to a set of servers and sending
successive ping messages with increasing TTL. Figure 3.2 shows the results of one such
experiment using two and three broadcast servers. As predicted by short-circuiting,
we observed a decrease in t-horizon after TTL has exceeded certain threshold, typ-
ically around 5. We have been able to explain this phenomenon analytically in [9].
This particular experiment required connections to selected servers to persist over a
longer period of time, so that a number of test trials could be performed.
Difficulties in conducting experiments on Gnutella. Overall, we have found it
quite challenging to isolate the effects of short-circuiting, as well as other phenomena,
34
on the Gnutella application. The challenge has been mainly due the system instability,
both in terms of topology and latencies. One of our preliminary experiments focused
on measuring variance in the size of the broadcast horizon over time. We have found
that several identical tests of horizon size, which were performed consecutively, can
differ drastically in their results. Figure 3.3 shows the size of the broadcast horizon
over time using four broadcast servers. Each data point represents the horizon size
for a particular broadcast trial, with trials performed consecutively in six minute
intervals.
0 20 40 60 80 100 120 140 160 180 2000
500
1000
1500
2000
2500
3000
3500
4000
Figure 3.3: Horizon-size variation over time with broadcasting client using multipleconnections on the Gnutella network, March 2001. The y-axis represents the horizonsize, and the x-axis labels each of 180 broadcast trials, performed consecutively in sixminute intervals.
We attribute this phenomenon to the highly dynamic nature of the network and
constantly changing network conditions and topology. (We remark that in our net-
work simulations, we have also observed that slight changes in latency distribution
can result in dramatic changes in the size of the t-horizon.) Such high variance, as
well as the existence of a number of factors influencing the actual number of hosts
35
reached, makes it challenging to obtain meaningful results.
By far the biggest challenge to isolating the effects of short-circuiting on Gnutella
is due to emergence of a new generation of “intelligent” Gnutella clients. These
clients contain built-in application logic designed to promote overall network health
by conserving bandwidth. While such clients have succeeded in allowing the Gnutella
network to scale-up to about five times the original size, they have also created a
serious obstacle to conducting sophisticated experimental studies on the network.
In order to see this, consider a simple procedure for calculating the size of the t-
horizon in Gnutella, performed by sending a ping message and counting the number
of responses. Figure 3.4 shows the results of an experiment in which eight of these
procedures were performed simultaneously.
1 2 3 4 5 6 7 8 9 10 11 12 13 140
500
1000
1500
2000
2500
3000
3500ping1ping2ping3ping4ping5ping6ping7ping8
Figure 3.4: Difficulty in conducting experiments on today’s Gnutella network
As you can see, typically only one of these procedures will result in a considerable
number of responses. The reason for this is that Gnutella clients are now ”intelligent”
enough to realize when messages are the same, and will only forward one of them.
In addition, many clients will now cache the responses to ping and query messages
36
for a certain amount of time. While such design decisions are understandable from
the performance standpoint, they also effectively take away the ability to accurately
determine the exact size of the broadcast horizon in Gnutella at any given time. As
a result we have found it extremely difficult to repeat experiments such as those
reported in figures 3.1 and 3.2 on the current system. Because of the difficulties with
measuring short-circuiting effects directly on the application, we turned our attention
to a series of network simulation studies in which we were able to precisely isolate the
effects of short-circuiting on theoretical network topologies.
3.3.2 Network Simulation Studies
In order to study the practical impact of short-circuited t-horizon reductions, we
needed to carefully consider both the topology of the network and the assignment of
latencies. Simulated studies allowed us to isolate the effects of short-circuiting on fixed
topologies. We conducted the simulations using our network simulator gnutsim, based
on a modified version of Dijkstra’s shortest path algorithm. The Java source code for
gnutsim is given in appendix B. To carry out these simulations, we needed to choose
the network topological model, as well as the network latency model. We report in
this chapter on a number of well-known regular topologies, such as the mesh and the
hypercube, as well as the Watts-Strogatz “small world” topology and snapshots of the
Gnutella topology obtained through crawling. To model network latencies we used
several classes of weights representing various commonly used Internet connection
bandwidths. We conducted our experiments by using random distributions of these
weights.
We present the statistics of our simulation studies as tables, which report the
reduction ratios in reachability caused by short-circuiting, given by randomly chosen
latencies on a fixed topology. Each table is associated with a fixed topology. Each
37
TTL Worst Avg Best Nbhd WRR MRR1 8 8 8 8 100% 100%2 18 21 24 25 72% 84%3 24 47 66 69 35% 68%4 43 84 124 138 31% 61%5 67 150 238 310 22% 48%6 121 274 424 678 18% 40%7 278 498 723 1399 20% 36%8 434 819 1364 2771 16% 30%9 765 1388 2307 5018 15% 28%
10 977 2148 3420 7729 13% 28%11 2030 3153 4549 9449 21% 33%12 2252 4290 5812 9928 23% 43%13 3692 5519 6599 9994 37% 55%14 4995 6392 7563 10000 50% 64%
(a) Reduction rations for the Watts-Strogatz topology
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
100
(b) Histogram of 1000 trials with random distribution of latencies (t = 10)
Figure 3.5: Short-circuiting effects for the Watts-Strogatz topology (nodes =10000, k = 3, p = 0.2)
38
row of the table represents results from 100 trials using random latencies. In each
row we report for a fixed t, the worst, average, and best observed t-horizon, and
t-neighborhood (which is equal to t-horizon when using uniform latencies). We then
give the reduction ratios by dividing the worst over t-neighborhood, and the average
over t-neighborhood.
Figure 3.5 represents the results for the Watts-Strogatz small-world topology. The
histogram on the right represents distribution of t-horizon values over 100 trials using
random latencies for t = 10, which is the value of t for which the reduction ratios are
the most severe. The results for other topologies are presented in appendix C.
Observations and Conclusions. Our empirical results indicate that, in practice,
the effects of short-circuiting are not as devastating as suggested by the theoretical re-
sults in [9]. We have observed the most significant impact on “small-world” topologies
such as our Gnutella snapshots and Watts-Strogatz network models. Fr these graphs,
we have observed reduction ratios in t-horizon size of over 90% in the worst case,
for certain values of t. In other words, we have observed that with random latencies
one can expect instances where the ratio of sizes of the t-neighborhood divided by
t-horizon is greater than 10 to 1, as shown in figure ??. Furthermore, the histogram
in the same figure shows that the reduction in reachability caused by short-circuiting
was always greater than 50% using random latencies.
In our experimental studies we have also observed that both random graphs and
highly structured graphs such as the mesh and hypercube tend to have, on aver-
age, less pronounced short-circuiting effects, as compared with “small-world” graphs.
Intuitively, this can be best understood if one considers the potentially stimulating
effect of the clustering property as defined in chapter 2 on short-circuiting.
In general, for a fixed TTL = t, the distribution of t-horizon sizes tends to be
normally distributed with small variance, independent of network topology. We have
39
also observed that, independent of topology, mean reduction ratios are dependent on
the TTL= t. Our results suggest that the reduction ratio increases as t increases,
until certain thresholds are reached, usually at about the point t is equal to half the
network radius or diameter, after which the reduction decreases.
40
Chapter 4
Gnutella Crawler Implementation
In this chapter we discuss issues related to design and implementation of our Gnutella
network crawler. We begin by providing a brief introduction to Gnutella and its
protocol, necessary for understanding the remainder of this chapter. We then present
both the sequential and parallel algorithms for discovering topology of the Gnutella
network, followed by the discussion of our distributed implementation using Java
RMI.
4.1 Introduction to Gnutella
Gnutella can be best explained as a fully distributed, information sharing technology.
It originated as a project at Nullsoft, a subsidiary of America Online, but was aban-
doned out of fear of its potential use for copyright infringement. After being quickly
reverse-engineered by several programmers and open-source enthusiasts, Gnutella’s
popularity really took off. Gnutella allows distributed file sharing by allowing each
user to specify directories on their local machine they want to share. In this sense,
Gnutella can be viewed as a distributed file storage system with search capabili-
ties. Unlike its predecessor Napster, which relies on a centralized search database,
41
Gnutella promotes decentralization of all network functions. As we have already seen,
Gnutella is based on a peer-to-peer model. This means that users connect to each
other directly through a piece of client-server software, forming a high-level network.
Throughout this thesis, we have and will continue to refer to this high-level network
as the Gnutella network, or GnutellaNet. Because Gnutella software functions as
both a server and a client, it is sometimes referred to as a ”servant.” In this thesis
we may use the terms client, servant, and host interchangeably to refer to Gnutella
software running on a particular machine.
4.1.1 Gnutella Protocol
Each Gnutella client implements the application level Gnutella protocol, which spec-
ifies how messages are routed between GnutellaNet hosts. We have already described
Gnutella’s protocol design at a high-level in chapter 3. We will now complete our
description with a few implementation details.
Gnutella protocol support four basic types of messages summarized in table 4.1.
The routing technique employed by the Gnutella protocol is a form of controlled
flooding, where messages are passed recursively between hosts. Flooding operates
by each Gnutella host forwarding the received ping and search messages to all of its
neighbors, except to the one that sent the message. To limit exponential spread of
messages through the network, each message header contains a time-to-live (TTL)
field. TTL is used in the same fashion as in the IP protocol: at each hop its value
is decremented until it reaches zero, at which point the message is dropped. This
is equivalent to mechanism 1 described in chapter 3. The maximum TTL value
specified by the Gnutella protocol is seven. Recall that this restriction effectively
segments the Gnutella network into subnets, imposing on each user a virtual ”horizon”
beyond which their messages cannot reach. In practice, this situation is acceptable
42
Type Description Contains
Ping Request for a host to an-
nounce itself
No body
Pong Reply to Ping message IP and port of responding host, num-
ber and size of files shared
Query Search request Minimum speed requirement for re-
sponding host, search string
Query Hits Reply to Query message IP and port speed of responding host,
number of matching files and their in-
dexed result set
Table 4.1: Gnutella protocol message description
as information may still get around. Each Gnutella message is also flagged with a
unique ID. Message ID is used by peers to detect and subsequently drop duplicate
messages, indicating a loop in GnutellaNet topology (mechanism 2). In addition, it is
also used to route the response messages along the same path that the original request
arrived. This is implemented by each host maintaining a dynamic routing table of
message IDs and connection labels indicating a particular connection along which
that specific message arrived. When a response message arrives at a host, it should
contain the same message ID as the original request. The host then checks its routing
table to determine along which link the response message should be forwarded. This
technique greatly improves efficiency while also preserving network bandwidth.
43
4.1.2 Discovering Gnutella Network Topology
Topology discovery in IP networks is a well-studied area of research [26]. Generally
the approach is based on some protocol-specific feature, as in the case of traceroute.
Although Gnutella protocol is much simpler than IP and provides no feedback regard-
ing message delivery, it nevertheless provides the necessary functionality for mapping
GnutellaNet topology. Notice that, according to the Gnutella protocol, it is possible
to discover neighbors of a particular host by connecting to that host and sending a
ping message with TTL = 2. As a result, pong messages would be sent back from
the connected host and all of its immediate neighbors. A complete network topology
could therefore be discovered by connecting to all the hosts, discovering their neigh-
bors, and combining the information into a single graph. We refer to this process
as crawling. Notice that, by following the described procedure, each edge would be
discovered twice thus introducing a level of redundancy. However it is still necessary
to connect to all the hosts in order to guarantee that the obtained topology map is
complete.
Compared with IP networks, GnutellaNet is highly dynamic. This means that its
topology is constantly changing - nodes and edges are added and removed as hosts
join and leave the network, establish new connections, and close the existing ones.
Therefore any topology discovery algorithm operating on the Gnutella network is
really capturing an instance, or a snapshot of the topology at a specific point in time.
Clearly, this posses an additional requirement for any topology discovery algorithm
to be efficient, since the accuracy of the topology map is inversely proportional to
the actual running time of an algorithm that was used to obtain it. In designing our
crawler, we have paid close attention to this requirement.
44
4.2 Design
In this section we discuss some issues related to design of our Gnutella network
crawler. We present informal performance analysis for both our sequential and parallel
algorithms for discovering Gnutella network topology.
4.2.1 Algorithm
Based on the procedure described in the previous section for discovering GnutellaNet
topology, an intuitive design solution might be to use the BFS to crawl the network,
applying the algorithm for discovering direct neighbors to each encountered host.
However, there are some practical issues that make this approach inefficient. In order
to see this, let us first examine the basic operation of discovering neighbors of a single
Gnutella host. This operation requires establishing a connection, sending a ping
message, and waiting for all pong messages to be received - overall a time-consuming
process with running time in the order of several minutes. However it is clear that such
operation represents a lower bound for any topology discovery algorithm operating
on Gnutella and based on the procedure described in the previous section. We will
therefore use this basic operation as a unit in our performance analysis of algorithms
for discovering GnutellaNet topology.
The complexity of the BFS algorithm for discovering topology of the Gnutella
network with N hosts is clearly O(log N). Also, for the moment, let us assume that
our crawling workstation is capable of maintaining up to b simultaneous network
connections. Then if b ≥ N and we had a list of addresses for all the Gnutella hosts,
we could simply connect to all of them simultaneously and obtain the entire network
topology map in constant time. Fortunately such list is available, as every Gnutella
client maintains a dynamically updated list of live hosts. Using this list as input, we
can now formulate our new algorithm for discovering GnutellaNet topology as follows:
45
Procedure buildTopoMap (G, l)
Input: An empty graph G, and a complete host list l
Output: A graph G representing the Gnutella network topology
for each element h of lconnect to hif (connection is successful)
send ping message with TTL = 2for each response message m from host h2
if (h2! = h)add edge h − h2 to Gif (h2 is not in l)
add h2 to the end of l
Due to highly dynamic nature of the network, the input list of hosts is not guar-
anteed to be neither complete nor perfectly accurate. This means that new hosts
not contained in the list could have just joined the network and, furthermore, hosts
contained in the list may no longer be active. Nevertheless our algorithm will still
work, as new hosts will be discovered at run-time and added to the end of the list.
Similarly, hosts that are no longer active will simply be ignored. The ability of our
algorithm to work with incomplete input data is particularly important considering
highly dynamic nature of the Gnutella network. However the more complete the list
is, the closer the performance of our algorithm will be to optimal.
Notice that our algorithm in effect partitions the problem of discovering Gnutella
network topology into two steps, or phases: discovering nodes (host list) and discov-
ering edges (connections). Since the functionality for solving the first phase is already
provided through the existing Gnutella client software, our algorithm’s focus is on the
second phase of the problem.
46
4.2.2 Initial Implementation
We have implemented the algorithm presented in the previous section as a Java
application. We chose Java as our development platform primarily for its support
for networking and threads. Platform-independence was also an important benefit,
particularly for our distributed implementation described is the subsequent sections.
The main problem with our initial implementation is due to our original assump-
tion that the number of connections that could be maintained simultaneously is
greater than the total number of Gnutella hosts. In practice, this assumption doesn’t
hold as the number of live Gnutella hosts at any given time is typically in the order
of thousands. To cope with this situation we were forced to organize threads into
groups of b, where b is the maximum number of simultaneous connections that our
system could handle. This strategy introduces additional complexity and, as already
discussed, sacrifices the integrity of a time-critical task such as topology discovery in
a highly dynamic network. However since connections to different Gnutella hosts can
be done asynchronously, a natural solution would be to run the crawler in parallel.
The following section describes issues involved in discovering GnutellaNet topology
in parallel, as well as our implementation using Java RMI.
4.2.3 Parallel Algorithm
The simplest and perhaps the most natural way to make our topology discovery algo-
rithm run in parallel would be to partition the initial list of Gnutella host addresses.
Each processor would then be responsible for discovering neighbors of only a subset of
hosts. In addition, each processors would need to have some way of knowing whether
a newly discovered host address has already been “crawled” by another processor.
One way this could be done is by hashing the host address string and checking the
result (modulo the number of processors participating in the crawl) against the pro-
47
cessor’s index. If there is a match, the processor would know that it should go ahead
and crawl the host. If not, it would then need to pass the information to the appro-
priate processor. In fact, this technique is commonly used for indexing the WWW
by many search engines, including Google, primarily because it results in good load
balancing. However it also requires additional inter-processor communication in or-
der to pass the Gnutella host addresses discovered at run-time to the appropriate
processors. Instead, we have opted for perhaps less elegant but more robust solution.
Our algorithm provides each processor with a complete input list of active hosts.
Each processor then executes an algorithm for calculating the subset for which it is
responsible, based on its unique processor number and the total number of processors
involved in the computation. For example, processor 0 of 10 would only attempt to
discover neighbors of the first 10% of hosts from the input list. The parallel version of
the topology discovery algorithm presented in the previous section is formulated bel-
low. For clarity, we are assuming that the size of the initial list of hosts is a multiple
of the number of processors.
Procedure parallelBuildTopoMap (G, l)
Input: An empty graph G, and a complete host list l
Output: A graph G representing the Gnutella network topology
startIndex = (sizeofhosts/numberofprocs) ∗ procIDendIndex = startIndex + (sizeofhosts/numberofprocs) − 1l2 = hosts[startIndex..endIndex]for each element h of l2
connect to hif (connection is successful)
send ping message with TTL = 2for each response message m from host h2
if (h2! = h)add edge h − h2 to Gif (h2 is not in l)
add h2 to the end of l2
48
Despite its apparent simplicity, due to highly asynchronous nature of the task, our
parallel algorithm in the best cast achieves optimal speed-up. In addition, as long as
total number of Gnutella hosts N ≤ pb, where p is the number of processors and b
is the maximum number of connections each processor can maintain simultaneously,
our algorithm will run in constant time. In practice, we were typically able to satisfy
this requirement with only a few processors, as the size of the largest connected public
segment of the Gnutella network at the time rarely exceeded two thousand users.
One potential problem with our algorithm is that its performance is dependent
on the “completeness” of the input list of host addresses. Recall from our previous
discussion that the input list is not guaranteed to be complete, as new hosts could
have joined the network. Because our algorithm only partitions the initial set of
hosts, each processor would discover new hosts independently. This would result in
redundant work being performed by all the processors. Notice that this would not
be a problem had be used the hashing solution mentioned above. However it is easy
to show that, as long as the number of hosts discovered at run-time is within b,
performance of our algorithm will be within a factor of two of optimal. This is true
because only a single additional step will be required by each processor.
Typically an important issue in designing parallel algorithms is load balancing. In
our case, this refers to the actual number of connections each processor is required to
make. Recall that the input list of potential hosts may also contain some hosts that
have recently left the network. Therefore even though each processor will receive an
equal number of potential hosts to connect to, the number of actual live hosts in a
list is likely to be smaller and will vary between processors. However our experiments
indicate this is not a significant problem. In order to see this recall that, even though
the actual number of connections made by each processor could vary, they are still
handled simultaneously by each processor in a single logical step.
49
4.2.4 Limitations
The main limitation of our crawler is related to the notion of private networks. Since
a significant portion of Gnutella users reside behind a firewall that prevents anyone
on the outside from establishing direct connection to them, our crawler will not be
able to accurately discover topology between such hosts. Notice that these hosts may
still appear in the final topology graph, due to their connections with hosts outside
the firewall. In this sense, the topology obtained by our crawler can be viewed as a
subgraph of the actual Gnutella network topology.
In addition, even though running time of our algorithm is optimal for any topology
discovery algorithm based on the Gnutella protocol, the actual execution time is still
bounded by the RTT time of messages in the Gnutella network and can take up
to several minutes. One could therefore argue the integrity of our topology data,
based on the fact that the network structure may have significantly changed over
the course of several minutes. Despite these limitations we believe our crawler is a
valuable tool, able to accurately capture important structural properties of the actual
Gnutella network topology.
4.3 Distributed Computing Solution Using Java
RMI
We have implemented our parallel algorithm for GnutellaNet topology discovery for a
network of workstations (NOW), primarily because we felt it would give the greatest
amount of flexibility and portability to our code. In addition, we felt that the task at
hand would be perfectly suited for a distributed computing model, since it requires
very little inter-processor communication. In fact, in our design, communication only
occurs at the beginning of the process, to distribute input, and at the end, to gather
50
the output at a central location. The mechanism for this communication is provided
by Java RMI. Remote method invocation (RMI) is JavaSoft’s implementation of
remote procedure calls (RPC). It is distributed as a standard Java library, providing
necessary functionality for distributed object communication. In our implementation,
crawling a subset of the Gnutella network is provided as a service residing on various
remote locations throughout our network. In other words, our parallel algorithm
described in the previous section is implemented as a distributed object residing on
remote machines.
Our distributed computing system includes an object serving as the ”brain” of
the entire computation. This central object is responsible for “bootstrapping” the
entire topology discovery process by distributing the initial list of Gnutella hosts
to other remote objects. Upon receiving the input, each remote object performs
topology discovery of its portion of the network, and subsequently returns a graph
object representing network topology to the central object. The central object is then
responsible for merging all the output graphs into a single one representing topology
of the entire Gnutella network. We should mention that our crawler utilized some
Java classes providing functionality related to Gnutella protocol compliance from furi
- a full-fledged open-source Gnutella client developed by William Wong [3].
The main feature of our distributed implementation is that is allows a heteroge-
neous network of workstations to participate in discovery of the Gnutella network
topology. As explained, this topology discovery can be executed in constant time
using only a few processors. In addition, the output graph representing Gnutella
network topology is provided in GML format [18], which is a fast growing standard
for representing graph data structures, and can immediately be viewed using visu-
alization tools such as LEDA’s graphwin [8]. Several visualizations of the Gnutella
network topology data obtained using our crawler are presented in appendix A.
51
Chapter 5
Conclusions and future research
5.1 Conclusions
Modeling complex network structures produces by modern P2P network applications
is a difficult task. The main contribution of this thesis to the task at hand is two-fold.
First, we made several important discoveries regarding the structure of the underlying
network topology of a P2P network application known as Gnutella. Specifically we
discovered it exhibits “small-world” properties of clustering and small diameter. In
addition, we observed four different power law relationships of various graph metrics.
It is our thesis that these empirical observations must be accounted for by any accu-
rate graph-based model of P2P network topology. Second, we pointed out potential
devastating effects of heterogeneous latencies on reachability of message broadcast in
P2P network applications operating under flooding protocols. Even though our em-
pirical results indicate that this problem we call “short-circuiting” is on average not
devastating to the overall system performance, we believe it should be taken seriously
by protocol designers. It is our hope that our results can be used in designing the
new generation of application-level protocols for P2P network applications.
52
5.2 Future Directions
Future research directions can be divided into three categories: those dealing with
network topology, visualization, and server placement. In the following sections, we
briefly discuss each one.
5.2.1 Network Topology Modeling
In this thesis we have reported discoveries of some structural properties of P2P net-
work topologies. However the search continues toward a uniform model of P2P net-
work topology, encompassing all of those structural properties observed in real net-
work applications. We speculate that for many P2P network applications, including
Gnutella, such model will be a modification of the discussed Barabasi-Albert model,
perhaps accounting for hosts leaving the network and dynamically-changing connec-
tions. In addition, more research needs to be done on spectral analysis of the topology
graph’s eigenvalues and their relationship with the structural properties.
5.2.2 Network Visualization
Better graph drawing algorithms need to be designed for visualizing the topology
of large-scale P2P networks. Such algorithms should be able to present topological
structure of a network in a way so that meaningful conclusions can be drawn. Network
visualizations can then be used by engineers to identify network-related problems.
5.2.3 Server Placement
The problem of finding an optimal placement of servers has received a lot of attention
in the Internet community. Many P2P file-sharing applications such as Gnutella
present another attractive practical application of this problem. For example, each
53
time a Gnutella user connects to the network can be modeled as a graph augmentation
problem. This problem can be formulated as adding a single vertex and t edges to
a graph G so that the size of t-horizon would be optimized. In the future, we plan
to examine some theoretical issues behind this problem using the knowledge we’ve
obtained on the Gnutella topology model.
54
Bibliography
[1] Cooperative Association for Internet Data Analysis (CAIDA).
http://www.caida.org.
[2] Folding@home. http://www.stanford.edu/group/pandegroup/Cosm.
[3] The Furi Homepage. http://www.jps.net/williamw/furi/.
[4] SETI@home. http://setiathome.ssl.berkeley.edu.
[5] Lada Adamic. The small world web. In ECDL’99, pages 443–452, Springer,
1999. Lecture Notes in Computer Science 1696.
[6] Lada A. Adamic, Rajan M. Lukose, Amit R. Puniyani, and Bernardo A. Hu-
berman. Search in power-law networks.
http://www.parc.xerox.com/istl/groups/iea/papers/plsearch/, March 20, 2001.
[7] William Aiello, Fan R. K. Chung, and Linyuan Lu. A random graph model for
massive graphs. In ACM Symposium on Theory of Computing, pages 171–180,
Portland, Oregon, 2000.
[8] Algorithmic Solutions Software GmbH. The LEDA Homepage.
http://www.algorithmic-solutions.com/as html/products/products.html.
55
[9] Fred S. Annexstein, Kenneth A. Berman, and Mihajlo A. Jovanovic. Latency
effects on reachability in large-scale peer-to-peer networks. In ACM Symposium
on Parallel Algorithms and Architectures, July 2001.
[10] Albert-Laszlo Barabasi and Reka Albert. Emergence of scaling in random net-
works. Science, 286:509–512, October 15, 1999.
[11] Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Ra-
jagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. Graph structures
in the web. Computer Networks, 33(1-6):309–20, June 2000.
[12] Brown University. The Java Data Structures Library (JDSL).
http://www.cs.brown.edu/cgc/jdsl/.
[13] Gnutella: To the bandwidth barrier and beyond. Clip2.com, November 6, 2000.
http://dss.clip2.com/gnutella.html.
[14] Roger Dingledine, Michael J. Freedman, and David Molnar. The free haven
project: Distributed anonymous storage service. In Workshop on Design Issues
in Anonymity and Unobservability, July 2000.
[15] Michalis Faloutsos, Petros Faloutsos, and Christos Faloutsos. On power-law
relationships of the internet topology. In SIGCOMM, pages 251–262, 1999.
[16] Groove Networks, Inc. Introducing Groove. http://www.groove.net/products/.
[17] Jerrold W. Grossman and Patrick D. F. Ion. The Erdos Number Project.
http://www.oakland.edu/ grossman/erdoshp.html.
[18] Michael Himsolt. Gml: A portable graph file format. Technical Report 94030,
University of Passau, 1997.
56
[19] Jon Kleinberg. The small-world phenomenon: An algorithmic perspective. Tech-
nical Report 99-1776, Cornell University Department of Computer Science, Oc-
tober 1999.
[20] Jon M. Kleinberg, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and
Andrew Tomkins. The web a a graph: measurements, models, and methods. In
5th Annual International Conference on Computing and Combinatorics, volume
1627, pages 1–7, 1999. Lecture Notes in Computer Science.
[21] Albert Medina, Ibrahim Matta, and John Byers. On the origin of power laws in
internet topologies. ACM Computer Communications Review, 30(2), April 2000.
[22] Andrew Oram, editor. Harnessing the Power of Disruptive Technologies. O’Reilly
& Associates, 1 edition, March 2001.
[23] Christopher R. Palmer and J. Gregory Steffan. Generating network topolo-
gies that obey power laws. http://citeseer.nj.nec.com/palmer00generating.html,
2000.
[24] T. Remes. Six degrees of Rogers Hornsby. New York Times, August 17, 1997.
[25] Clay Shirky. What is p2p... and what isn’t? The O’Reilly Network,
November 24, 2000. http://www.openp2p.com/pub/a/p2p/2000/11/24/shirky1-
whatisp2p.html.
[26] R. Siamwalla, R. Sharma, and S. Keshav. Discovering internet topology.
http://www.cs.cornell.edu/skeshav/papers.html, 1998.
[27] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of small-world
networks. Nature, 393:440–442, June 1998.
57
[28] Ellen W. Zegura, Kenneth L. Calvert, and Michael J. Donahoo. A quantitative
comparison of graph-based models for Internet topology. IEEE/ACM Transac-
tions on Networking, 5(6):770–783, December 1997.
58
Appendix A
Visualizations of the Gnutella
Network Topology
In this appendix we present vizualizations of the Gnutella network topology data
obtained using out crawler between November 13 and December 28 of 2000. The
visualizations were done using Otter - a network visualization tool developed by
Caida [1], and LEDA’s graph drawing software [8].
59
Figure A.1: Gnutella network topology using Caida’s Otter
Figure A.2: Gnutella network topology using LEDA’s 2D spring layout
60
Figure A.3: Gnutella network topology using experimental layout
61
Figure A.4: Gnutella network backbone (dominating set using greedy algorithm)
using LEDA’s 3D spring layout
62
Figure A.5: Gnutella network backbone (nodes with degree > 10) using LEDA’s 3D
spring layout
63
Figure A.6: Gnutella network backbone (nodes with degree > 20) using LEDA’s 3D
spring layout
64
Appendix B
Java source code for gnutsim
The following the is the Java source code for our Gnutella network simulator gnutsum,
which we used to study the problem of short-circuiting. Our code makes use of some
classes from the JDSL package developed at Brown University [12].
/*
* gnutsim - Gnutella message transmission simulator
* Copyright (C) November 2000 Mihajlo A. Jovanovic
*
*/
import jdsl.core.api.*;
import jdsl.core.ref.ArrayHeap;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.FileInputStream;
import java.io.PrintWriter;
65
import java.io.FileWriter;
import java.io.File;
import java.util.Vector;
import java.util.Hashtable;
import java.util.Enumeration;
import java.util.StringTokenizer;
import java.util.Random;
import java.util.Date;
class MsgComparator implements Comparator
{
public int compare(Object a, Object b) { return ((Msg)a).compareTo((Msg)b); }
public boolean isLessThan(Object a, Object b) { return true; }
public boolean isGreaterThan(Object a, Object b) { return true; }
public boolean isEqualTo(Object a, Object b) { return true; }
public boolean isLessThanOrEqualTo(Object a, Object b) { return true; }
public boolean isGreaterThanOrEqualTo(Object a, Object b) { return true; }
public boolean isComparable(Object b) { return true; }
}
class HostComparator implements Comparator
{
public int compare(Object a, Object b) { return ((Host)a).compareTo((Host)b); }
public boolean isLessThan(Object a, Object b) { return true; }
public boolean isGreaterThan(Object a, Object b) { return true; }
public boolean isEqualTo(Object a, Object b) { return true; }
66
public boolean isLessThanOrEqualTo(Object a, Object b) { return true; }
public boolean isGreaterThanOrEqualTo(Object a, Object b) { return true; }
public boolean isComparable(Object b) { return true; }
}
class Msg
{
private int guid;
private int ttl = 7;
private int cost = 0;
Msg(int id) { guid = id; }
Msg(Msg m)
{
//COPY CONSTRUCTOR
guid = m.getGuid();
ttl = m.getTtl();
cost = m.getCost();
}
public void setTtl(int newTTL) { ttl = newTTL; }
public int getGuid() { return guid; }
public int getTtl() { return ttl; }
public int getCost() { return cost; }
public boolean decTTL()
{
ttl--;
67
if (ttl == 0)
return false;
else
return true;
}
public void incrCost(int w) { cost += w; }
public int compareTo(Msg m) { return (new Integer(cost)).compareTo(new Integer(m.get
public boolean equals(Object msg)
{
return (guid == ((Msg)msg).getGuid());
}
public String toString() { return "GUID: " + guid + " TTL: " + ttl + " Cost:
}
class Host
{
Vector msgHistory = new Vector(10, 10);
Hashtable neighbors = null; //keys: neighbors (Host) Values: link weights (Integ
ArrayHeap sendQueue = new ArrayHeap(new MsgComparator());
String id;
Host(String address) { id = address; }
public String getID() { return id; }
public void clearAndReset(Random r, Hashtable map)
{
68
msgHistory.clear();
//Recalculate link weights
for (Enumeration e = neighbors.keys() ; e.hasMoreElements() ;)
{
int w = r.nextInt(gnutsim.MAX_WEIGHT);
neighbors.put(e.nextElement(), (Integer)map.get(new Integer(w)));
}
}
public void setBroadcastMsg(Msg newMsg)
{
msgHistory.add(newMsg);
for (Enumeration e = neighbors.keys() ; e.hasMoreElements() ;)
{
Host h = (Host)e.nextElement();
Msg outMsg = new Msg(newMsg);
outMsg.incrCost(((Integer)neighbors.get(h)).intValue());
sendQueue.insert(outMsg, h);
}
}
public boolean wasMsgSeen(Msg msg)
{
return msgHistory.contains(msg);
}
public void setNeighbors(Hashtable h) { neighbors = h; }
69
public void addNeighbor(Host h, int w)
{
if (neighbors == null)
neighbors = new Hashtable();
neighbors.put(h, new Integer(w));
}
public void receiveMsg(Host sender, Msg inMsg)
{
if (msgHistory.contains(inMsg))
{
return;
}
else
{
msgHistory.add(inMsg);
}
if (inMsg.decTTL())
{
/*for all neighbors except sender
1. create a new Msg object(m), incr cost
2. add to the send queue(msg, neighbor)*/
for (Enumeration e = neighbors.keys() ; e.hasMoreElements() ;)
{
Host h = (Host)e.nextElement();
if (h.equals(sender))
continue;
70
Msg outMsg = new Msg(inMsg);
outMsg.incrCost(((Integer)neighbors.get(h)).intValue());
sendQueue.insert(outMsg, h);
}
}
}
public Host sendNextMsg()
{
Msg outMsg = (Msg)sendQueue.min().key();
Host receiver = (Host) sendQueue.removeMin();
receiver.receiveMsg(this, outMsg);
return receiver;
}
public int getNextMsgCost()
{
if (sendQueue.isEmpty())
return -1;
else
return ((Msg)sendQueue.min().key()).getCost();
}
public boolean equals(Object host)
{
if (id.equals(((Host)host).getID()))
return true;
71
return false;
}
public int compareTo(Host m) { return (new Integer(getNextMsgCost())).compareTo(new
public String toString() { return id; }
}
public class gnutsim
{
static final int NUM_OF_TRIALS = 100;
static final int MAX_WEIGHT = 9;
static boolean isArrayHeapElement(ArrayHeap a, Object el)
{
for (ObjectIterator i = a.keys(); i.hasNext() ;)
{
Object o = i.nextObject();
if (el.equals(o))
return true;
}
return false;
}
public static void main(String args[])
{
ArrayHeap pq = new ArrayHeap(new HostComparator());
//CREATE WEIGHTED TOPOLOGY
String line = "";
StringTokenizer t;
String token = null;
72
Hashtable nodes = null; //keys: node ID (Integer) values: hosts (Host)
Random r = new Random((new Date()).getTime());
Hashtable map = new Hashtable();
map.put(new Integer(0), new Integer(1));
map.put(new Integer(1), new Integer(6));
map.put(new Integer(2), new Integer(31));
map.put(new Integer(3), new Integer(127));
map.put(new Integer(4), new Integer(500));
map.put(new Integer(5), new Integer(2001));
map.put(new Integer(6), new Integer(8005));
map.put(new Integer(7), new Integer(16400));
map.put(new Integer(8), new Integer(33000));
int min = -1, max = -1, accum = 0, ttl = -1;
try
{
for (int trial = 0; trial < NUM_OF_TRIALS; trial++)
{
if (trial == 0)
{
ttl = Integer.parseInt(args[1]);
File f = new File(args[0]);
if (!f.exists() || !f.canRead())
throw new Exception("Cannot read file " + f);
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(f)));
while ((line = in.readLine()) != null)
{
73
t = new StringTokenizer(line, " ");
token = t.nextToken();
if (token.equals(new String("t")))
nodes = new Hashtable(2*Integer.parseInt(t.nextToken()));
else if (token.equalsIgnoreCase(new String("?")))
{
int i = Integer.parseInt(t.nextToken());
Host h = new Host(t.nextToken());
nodes.put(new Integer(i), h);
}
else if (token.equalsIgnoreCase(new String("L")))
{
t.nextToken();
int nodeID = Integer.parseInt(t.nextToken());
Host h1 = (Host)nodes.get(new Integer(nodeID));
nodeID = Integer.parseInt(t.nextToken());
Host h2 = (Host)nodes.get(new Integer(nodeID));
if (h1 == null || h2 == null)
throw new Exception("Invalid .odf file firmat!");
/*UNIFORM WEIGHTS
h1.addNeighbor(h2, 1);
h2.addNeighbor(h1, 1);
*/
int w = r.nextInt(MAX_WEIGHT);
h1.addNeighbor(h2, ((Integer)map.get(new Integer(w))).intValue());
h2.addNeighbor(h1, ((Integer)map.get(new Integer(w))).intValue());
}
}
74
}
else
{
//clear all host objects
for (Enumeration e = nodes.elements() ; e.hasMoreElements() ;)
((Host)e.nextElement()).clearAndReset(r, map);
}
//ADD BROADCAST SERVER ONTO PQ
Msg m = new Msg(1);
m.setTtl(ttl);
Host h = (Host)nodes.get(new Integer(0));
h.setBroadcastMsg(m);
pq.insert(h, new Boolean(true));
while(!pq.isEmpty())
{
Locator l = pq.min();
Host nextHost = (Host)l.key();
Host newHost = nextHost.sendNextMsg();
pq.remove(l);
if (nextHost.getNextMsgCost() != -1)
pq.insert(nextHost, new Boolean(true));
//if new host is not already in the pq and its cost is not -1 - add to pq
if (!isArrayHeapElement(pq, newHost) && newHost.getNextMsgCost() != -1)
pq.insert(newHost, new Boolean(true));
}
int horSize = 0;
for (Enumeration e = nodes.elements() ; e.hasMoreElements() ;)
75
if (((Host)e.nextElement()).wasMsgSeen(m))
horSize++;
System.out.println("Total horizon size: " + horSize);
if (min == -1 || horSize < min)
min = horSize;
if (max == -1 || horSize > max)
max = horSize;
accum+=horSize;
}
System.out.println("Average horizon size: " + accum*1.0/NUM_OF_TRIALS);
System.out.println("Min horizon size: " + min);
System.out.println("Max horizon size: " + max);
}
catch (ArrayIndexOutOfBoundsException e)
{
System.out.println("Usage: java gnutsim [graph_file.odf] [TTL]");
}
catch (Exception e)
{
System.out.println(e);
}
}
}
76
Appendix C
Network Simulation Results
In this appendix we present the statistics obtained from our network simulation stud-
ies. The tables report reduction ratios in reachability, caused by short-circuiting and
given by randomly chosen latencies on a fixed topology. Each table is associated with
a fixed topology. Each row of the table represents results from 100 trials using random
latencies. In each row we report for a fixed t, the worst, average, and best observed
t-horizon, and t-neighborhood (which is equal to t-horizon when using uniform laten-
cies). We then give the reduction ratios by dividing the worst over t-neighborhood,
and the average over t-neighborhood.
77
TTL Worst Avg Best Nbhd WRR MRR
1 7 7 7 7 100% 100%
2 9 14 16 16 56% 88%
3 12 28 41 42 29% 67%
4 15 52 83 96 16% 54%
5 28 105 188 252 11% 42%
6 55 181 337 494 11% 37%
7 105 333 525 830 13% 40%
8 185 496 719 1055 18% 47%
9 371 659 877 1121 33% 59%
10 468 804 983 1129 41% 71%
Table C.1: Short-circuiting effects on the Watts-Strogatz topology (nodes = 1129, k
= 3, p = 0.2)
78
TTL Worst Avg Best Nbhd WRR MRR
1 2 2 2 2 100% 100%
2 4 4 4 4 100% 100%
3 10 10 10 10 100% 100%
4 65 92 113 113 58% 81%
5 214 492 689 844 25% 58%
6 246 589 843 1107 22% 53%
7 419 806 1040 1124 37% 72%
8 566 915 1071 1125 50% 81%
Table C.2: Short-circuiting effects on the Gnutella topology (nodes = 1125, edges =
4080)
TTL Worst Avg Best Nbhd WRR MRR
1 6 6 6 6 100% 100%
2 54 54 54 54 100% 100%
3 405 410 419 419 97% 98%
4 1473 2216 2606 2851 52% 78%
5 4686 5986 6875 9021 52% 66%
6 6557 8143 8809 9998 66% 81%
7 8113 9060 9443 10000 81% 91%
Table C.3: Short-circuiting effects on a random topology (nodes = 10000, edges =
40000)
79
TTL Worst Avg Best Nbhd WRR MRR
1 11 11 11 11 100% 100%
2 56 56 56 56 100% 100%
3 92 150 176 176 52% 85%
4 263 319 372 386 68% 83%
5 307 523 606 638 48% 82%
6 478 720 821 848 56% 85%
7 533 852 933 968 55% 88%
8 699 948 1002 1013 69% 94%
9 883 991 1020 1023 86% 97%
10 916 1011 1024 1024 89% 99%
Table C.4: Short-circuiting effects on a hypercube topology (N = 210)
80
TTL Worst Avg Best Nbhd WRR MRR
1 14 14 14 14 100% 100%
2 92 92 92 92 100% 100%
3 258 315 368 378 68% 83%
4 685 858 1008 1093 63% 78%
5 1120 1750 2139 2380 47% 74%
6 2243 3079 3544 4096 55% 75%
7 2796 4422 5298 5812 48% 76%
8 3970 5813 6644 7099 56% 82%
9 6023 6844 7424 7814 77% 88%
10 6259 7558 7950 8100 77% 93%
11 6930 7907 8147 8178 85% 97%
12 7877 8108 8187 8191 96% 99%
13 8050 8174 8192 8192 98% 100%
Table C.5: Short-circuiting effects on a hypercube topology (N = 213)
81
Top Related