A Measurement Study of Peer-to-Peer File Sharing Systems
-
Upload
samantha-koch -
Category
Documents
-
view
38 -
download
0
description
Transcript of A Measurement Study of Peer-to-Peer File Sharing Systems
A Measurement Study of Peer-to-Peer File Sharing Systems
Presented by
Cristina Abad
Motivation
In a P2P file sharing system, peers are usually in the “edge” of the network
Does this affect/limit the quality of the infrastructure?
What are the characteristics of hosts that choose to participate?
Solution: Measure Gnutella and Napster traffic to help understand these issues
Napster
Gnutella
Methodology Crawler periodically takes “snapshot” of
Napster/Gnutella– capture basic info (peers, files shared, …)
For peers discovered– measure bottleneck bandwidth– measure latency– track content and degree of sharing
Measure lifetime– track availability of peers (at P2P and IP level)
Crawling Napster Peers can only be discovered by querying
index Crawler issues queries with names of
popular song artists Query responses contain
– IP, reported bandwidth, files shared (number, names and sizes)
Results:– Captured 40-60% of Napster hosts
(contributing to 80-95% of total files)– Could not capture peers that do not share files
Crawling Gnutella Crawler uses ping/pong to discover peers Each crawl captured aprox. 10000 peers
Measuring bandwidth Reported bandwidth may not be accurate
(ignorance or lies) Use bottleneck bandwidth as approximation
to available bandwidth– capacity of slowest host along path between
two hosts
Used SProbe to actively measure both upstream and downstream bottleneck bandwidth– Similar to “packet pair” technique
Packet Pair Technique Two packets queued next to each other at
bottleneck link exit the link t seconds apart:
Then,
Kevin Lai and Mary Baker. “Measuring bandwidth”. In Proceedings of IEEE INFOCOM '99. 1999.
bnlb
st 2 s2: size of second packet
bbnl: bottleneck bandwidth
t
sbbnl
2
How many peers are server-like? High-bandwidth, low latency, high
availability 8% have upstreambb 10Mbps
Availability – Host uptimes
Availability – Session duration
Free-riders
Is Gnutella robust?
Measurement, Modeling, and Analysis of a Peer-to-Peer File-Sharing Workload
Presented by
Cristina Abad
Three-tiered approach
1. Analyze 200-day trace of Kazaa traffic Considered only traffic going from U.
Washington to the outside
2. Develop a model of multimedia workloads
Analyze and confirm hypothesis
3. Explore potential impact of locality -awareness in Kazaa
Contributions Obtained some useful characterizations of
Kazaa’s traffic Showed that Kazaa’s workload is not Zipf
– Showed that other workloads (multimedia) may not be Zipf either
Presented a model of P2P file-sharing workloads based on their trace results– Validated the model through simulations that
yielded results very similar to those from traces Proved the usefulness of exploiting locality-
aware request routing
Measurement results
Users are patient Users slow down as they age Kazaa is not one workload Kazaa clients fetch objects at-most-once Popularity of objects is often short-lived Kazaa is not Zipf
User characteristics (1)
Users are patient
User characteristics (2) Users slow down as they age
– clients “die”– older clients ask for less each time they use
system
User characteristics (3) Client activity
– Tracing used could only detect users when their clients transfer data
– Thus, they only report statistics on client activity, which is a lower bound on availability
– Avg session lengths are typically small (median: 2.4 mins)
• Many transactions fail
• Periods of inactivity may occur during a request if client cannot find an available server with the object
Object characteristics (1) Kazaa is not
one workload
Object characteristics (2)
Kazaa object dynamics– Kazaa clients fetch objects at most once– Popularity of objects is often short-lived– Most popular objects tend to be recently born
objects– Most requests are for old objects
Object characteristics (3) Kazaa is not Zipf Web access patterns are Zipf: small number of
objects are extremely popular, but there is a long tail of unpopular requests.
Zipf’s law: popularity of ith-most popular object is proportional to i-α, (α: Zipf coefficient)
(Zipf) looks linear on log-log
scale
Model of P2P file-sharing workloads On average, a client requests 2 objects/day P(x): probability that a user requests an
object of popularity rank x Zipf(1)– Adjusted so that objects are requested at most
once A(x): probability that a newly arrived object
is inserted at popularity rank x Zipf(1) All objects are assumed to have same size Use caching to observe performance
changes (effectiveness hit rate)
Model – Simulation results File-sharing effectiveness diminishes with
client age– System evolves towards one with no locality
and objects chosen at random from large space New object arrivals improve performance
– Arrivals replenish supply of popular objects New clients cannot stabilize performance
– Can’t compensate for increasing number of old clients
– Overall bandwidth increases in proportion to population size
Model validation By tweaking the arrival rate of of new
objects, were able to match trace results (with 5475 new arrivals per year)
Exploring locality-awareness
Currently organizations shape or filter P2P traffic
Alternative strategy: exploit locality in file-sharing workload– Caching; or,– Use content available within organization to
substantially decrease external bandwidth usage– Result: 86% of externally downloaded bytes
could be avoided by using an organizational proxy
Questions?
Analysis
How can results obtained be used when evaluating P2P schemes?
Are any of the measurements obtained biased?
Peers are heterogeneous– Incentives– Enforcement (e.g. super-peers in Kazaa)
SProbe
Works in uncooperative environments Works on asymmetric network paths Exploit properties of TCP protocol
– Send SYN packet with large payload; then, measure time dispersion of received RST packet
Zipf Linguist George Kingsley Zipf observed that for
many frequency distributions, the n-th largest frequency is proportional to a negative power of the rank order n
"Zipf's law" is also sometimes used to refer to the corresponding probability distribution
Is an instance of a power law Zipf's law is often demonstrated by plotting the
data, with the axes being log(rank order) and log(frequency). If the points are close to a single straight line, the distribution follows Zipf's law.