P2P = “Structured Overlay Networks for Peer-to-Peer systems”
Adaptive Content Management in Structured P2P Communities
description
Transcript of Adaptive Content Management in Structured P2P Communities
Adaptive Content Management in Structured P2P Communities
Jussi Kangasharju
Keith W. Ross
David A. Turner
Content
Introduction Related Works Adaptive Algorithms Experimental Results Optimization Theory Conclusion
Introduction (1)
P2P file sharing is the dominant traffic type in the Internet
Two types of P2P system– Unstructured, e.g. KaZaA and Gnutella
Nodes are not organized into highly-structured overlays Content is randomly assigned to nodes
– Structured, e.g. CAN, Chord Distributed hash table (DHT) substrates are used Nodes are organized into highly-structured overlays Keys are deterministically assigned to nodes
Introduction (2)
Assume the system is DHT-based P2P file-sharing communities
– P2P community: a collection of intermittently-connected nodes
– Nodes: contribute storage, content and bandwidth to the rest of the community
– A node in the community wants a file Retrieve the file from the other nodes in the community If the file is not found, the community retrieves the file from outside The file will be cached and a copy will be forwarded to the requesting
node
Introduction (3)
Address the problem of content management in P2P file sharing communities
Propose algorithms to adaptively manage content– Minimize the average delay: the time from when a node
makes a query for a file until the node receives the file in its entirety.
– File transfer delay >> lookup delays – Intra-community file transfers occur at relatively fast rates
as compared with file transfers into the community
Introduction (4)
PROBLEM is equivalent to “adaptively managing content to maximize intra-community hit rates”
– Replication: how should content be replicated to provide satisfactory hit rates
– Replacement: how does a node determine to keep/evict the files Contributions
– Algorithms for dynamically replicating and replacing files in a P2P community
No a priori assumptions about file request rate or nodal up probabilities Simple, adaptive and fully distributed
– Analytical optimization theory to benchmark the adaptive replication algorithms
For complete-file replication For the case when files are segmented and erasure codes are used
Related Works
Squirrel [8]– Distributed, server-less, P2P web caching system– Built on top of the Pastry DHT substrate– Focus on the protocol design and implementation– Not address the issues of replication and file replacement
In [13] and [14], it studied optimal replication in an unstructured peer-to-peer network
– Reduce random search times
DHT Substrate
Node has access to the API of a DHT substrate
The substrate takes a file j as input and determines an ordered list of the up nodes
For a given value of K, (i1, i2,…,iK)
i1 is the first-place winner for file j
LRU Algorithms (1)
Fundamental Problem:– “How can we adaptively add and remove replicas, in a distributed
manner and as a function of evolving demand, to maximize the hit probability?”
Suppose X is a node that wants file j Basic LRU Algorithm
– X uses the substrate to determine i1, the first place winner for j– If i1 doesn’t have j, i1 retrieves j from outside the community and
copies the file in storage– If i1 needs to make room for j, LRU replacement policy is used– i1 sends j to X– X does not put j in its storage
LRU Algorithms (2)
Basic LRU Algorithm– A request can be a “miss” even when the file is cached in some up
node within the community Top-K LRU Algorithm
– When i1 doesn’t have j, i1 determines i2,…,iK and pings each of these K-1 nodes to see if any of them have j
– If so, i1 retrieves j from one of the nodes and puts a copy in its storage
– Otherwise, i1 retrieves j from outside the community The algorithm replicates content
– Without any a priori knowledge of request patterns or nodal up probabilities
– Fully distributed
Observations
Top-K LRU algorithm is simple but its performance is significantly below the theoretical optimal
Observed that– LRU let unpopular file linger in nodes. Intuitively, if we do
not store the less popular files, the popular files will have more replicas
– Searching more than one node is needed to find files under the file-sharing system
MFR Algorithm (1)
Most Frequently Requested (MFR) Algorithm– Has near optimal performance
Each node i maintains an estimate of aj(i), the local request rate for file j
aj(i) is the number of requests that node i has seen for file j divided by the amount of time node i has been up
Each node i stores the files with the highest aj(i) values, packing in as many files as possible
MFR Algorithm (2)
MFR retrieval and replacement policy– Node i receives a request for file j, it updates aj(i)
– If i doesn’t have j and MFR say it should, i retrieves j from the outside and puts j in its storage
– If i needs to make room for j, MFR replacement policy is used
Searching more than one node is needed– “Ping” dynamics to influence aj(i) so that the number of repli
cas across all nodes become nearly optimal
MFR Algorithm (3)
“Ping” the top-K winners in parallel– Retrieve the file from any node that has the file– Each “Ping” could be considered a request– Nodes update their request rate and manage their storage
with MFR– However, this approach doesn’t give better performance
Sequentially request j from the top-K winners– Stop the sequential requests once j is found
Experiment Results (1)
Run simulation experiments– 100 nodes and 10000 files– Request probabilities follow a Zipf distribution with paramet
ers 0.8 and 1.2– All file sizes are the same– Each node contributes the same amount of storage
Measure the hit performance of the algorithm
Experiment Results (2)
LRU performs better than non-cooperative algorithm but significantly worse than the theoretical optimal
Experiment Results (3)
Experiment Results (4)
Using a K greater than 1 improves the hit probability
K beyond 5 gives insignificant improvement
Experiment Results (5)
The number of replicas is changing over time, the graphs report the average values The optimal scheme replicates the more popular files much more aggressively The optimal scheme does not store the less popular files
Experiment Results (6)
Experiment Results (7)
The MFR algorithm is very close to optimal Thus, the hit rates also are very close to optimal
Analysis of MFR (1)
Analytical procedure for calculating the steady-state replica profile and hit probability for Top-K MFR for the case K=I
The results still serve as excellent approximations for when K is small
Assume– I is the number of nodes– J is the number of distinct files– pi is “up” probability of node i– Si is the amount of shared storage (in bytes) in node i– bj is the size (in bytes) of file j– qj is the request probability for file j
The request probability for the J files are known
Analysis of MFR (2)
The procedure sequentially places copies of files into the nodes– Ti is the remaining unallocated storage– xij is equal to 1 if a copy of file j has been placed in node i
Initializes Ti=Si, xij=0 and vj=qj/bj
– Find file j that has the largest value of vj
– Sequentially examine the winning nodes for j until a node is found such that Ti>=bj and xij=0
Set xij=1; Set vj=vj(1-pi); Set Ti=Ti-bj
– If there is no node such that Ti>=bj and xij=0, remove file j from further consideration
– Return to Step 1 if all files have not been removed from consideration
Optimization Theory (1)
Analytical theory for optimal replication in P2P communities
– Complete-File Replication (No Fragmentation)– File are segmented and erasure coded
No Fragmentation
J
j
I
i
xijhit
ijpqP1 1
)1(1
J
jiiji Sxb
1
Ii ,,2,1 Subject to
Optimization Theory (2)
The problem is NP-complete Consider a special case
– pi=p
– nj=number of replicas for file j
J
j
njhit
jpqP1
)1(1
J
jji Snb
1ISSSS 21
The problem can be efficiently solved by dynamic programming
Subject to
Optimization Theory (3)
Upper bound on the performance of adaptive management algorithms for the case of erasures
– File j is made up of Rj erasures
– Any Mj of the Rj erasures are needed to reconstruct the file
– Size of each erasure is bj/Mj
– Assume homogenous “up” probabilities, pi=p
– rth erasure of file j as erasure jr, r=1,…,Rj
– njr is the number of erasures jr stored in the community of nodes
Optimization Theory (4)
jrn
jr ppP jr )1(1)1(
j
j
R
Mm rjrj mPhitP )()(
)(
))1()(()(mRA Ar Ar
jrjrr
jr
jC
ppmP
mAandRAAmR jj ,,1:)(
0-1 random variable which is 1 if any
of the njr erasures jr is in some up node
•A hit for a request for file j if any Mj of the Rj erasures for file j are available
Optimization Theory (5)
Theorem 2.2 of Boland et al [24], the function is Shur concave
),,,(),,()( 2121 jjRjjj xxxhnnnhhitPj
j
j
jjj
R
Mm
mRxmxjj pp
m
RhitP ])1[(])1(1[)(
jR
rjr
jj
n
Rx
1
1
J
jjj hitPqhitP
1
)()(
J
j
R
rjr
j
j SnM
bj
1 1
Subject to
j
C
jrjr
TA Ar Ar
nnj pphitP ])1[(])1(1[)(
jjj MAandRAAmT ,,1:)(
Optimization Theory (6)
Special case: No erasures– Rj=Mj=1
p
b
q
pB
b
qb
B
Sn
j
j
L
L
l l
ll
Lj
1
1ln
ln
)1ln(
ln1*
L
jjL bB
1
where
qj/bj plays a key role in influencing the number of replicas
It is upper bound on the true optimal because it is the optimal over continuous variables rather than integer variables
Conclusion
Claim that structured/DHT-designs will potentially improve search and download performance
Proposed the Top-K MFR algorithm, which is simple, fully distributed, adaptive, near-optimal
Introduce an optimization methodology for bench-marking the performance of adaptive algorithms
The methodology can also be applied to designs that use erasures