1 Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility...
-
date post
21-Dec-2015 -
Category
Documents
-
view
223 -
download
1
Transcript of 1 Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility...
1
Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility
Gabi Kliot, Computer Science Department, Technion
Topics in Reliable Distributed Computing21/11/2004
Partially borrowed from Peter Druschel’s presentation
2
Outline
Introduction Pastry overview PAST Overview Storage Management Caching Experimental Results Conclusion
3
“Storage management and caching in PAST, a large-scale persistent peer-to-peer storage utility” Antony Rowstron (Microsoft Research) Peter Druschel (Rice University)
“Pastry: scalable, decentralized object location and routing for large-scale peer-to-peer systems” Antony Rowstron (Microsoft Research) Peter Druschel (Rice University)
Sources
4
PASTRYPASTRY
5
PastryGeneric p2p location and routing substrate
(DHT)
Self-organizing overlay network (join, departures, locality repair)
Consistent hashing Lookup/insert object in < log2
b N routing steps
(expected) O(log N) per-node state Network locality heuristics
Scalable, fault resilient, self-organizing, locality aware, secure
6
Pastry: API
nodeId=pastryInit(Credentials, Applicaton): join local node to Pastry network
route(M, X): route message M to node with nodeId numerically closest to X
Application callbacks: deliver(M): deliver message M to application forwarding(M, X): message M is being
forwarded towards key X newLeaf(L): report change in leaf set L to
application
7
Pastry: Object distribution
objId/key
Consistent hashing
128 bit circular id space
nodeIds (uniform random)
objIds/keys (uniform random)
Invariant: node with numerically closest nodeId maintains object
nodeIds
O2128 - 1
8
Pastry: Object insertion/lookup
X
Route(X)
Msg with key X is routed to live node with nodeId closest to X
Problem:
complete routing table not feasible
O2128 - 1
9
Pastry: Routing
Tradeoff
O(log N) routing table size 2b * log2
bN + 2l
O(log N) message forwarding steps
10
Pastry: Routing table (# 10233102)
L nodes in leaf set
log2b N Rows
(actuallylog2b 2128=
128/b)
2b columns
L neighbors
Pastry: Leaf sets
Each node maintains IP addresses of the nodes with the L numerically closest larger and smaller nodeIds, respectively. routing efficiency/robustness fault detection (keep-alive) application-specific local coordination
12
Pastry: Routing procedure
If (destination is within range of our leaf set) forward to numerically closest member
elselet l = length of shared prefix let d = value of l-th digit in D’s addressif (Rl
d exists) forward to Rl
d
else forward to a known node (from ) that (a) shares at least as long a prefix(b) is numerically closer than this node
MRL
13
Pastry: Routing
Properties• log2
b N steps • O(log N) state
d46a1c
Route(d46a1c)
d462ba
d4213f
d13da3
65a1fc
d467c4d471f1
14
Pastry: RoutingIntegrity of overlay: guaranteed unless L/2 simultaneous
failures of nodes with adjacent nodeIds
Number of routing hops: No failures: < log2
b N expected, 128/b
+ 1 max During failure recovery:
O(N) worst case, average case much better
15
Pastry: Locality propertiesAssumption: scalar proximity metric e.g. ping/RTT delay, # IP hops traceroute, subnet masks a node can probe distance to any other node
Proximity invariant: Each routing table entry refers to a node closeto the local node (in the proximity space), amongall nodes with the appropriate nodeId prefix.
16
Pastry: Geometric Routing in proximity space
d46a1c
Route(d46a1c)
d462ba
d4213f
d13da3
65a1fc
d467c4d471f1 d467c4
65a1fcd13da3
d4213f
d462ba
Proximity space
The proximity distance traveled by message in each routing step is exponentially increasing (entry in row l is chosen from a set of nodes of size N/2bl)The distance traveled by message from its source increases monotonically at each step (message takes larger and larger strides)
NodeId space
17
Pastry: Locality properties
Each routing step is local, but there is no guarantee of globally shortest path
Nevertheless, simulations show: Expected distance traveled by a message
in the proximity space is within a small constant of the minimum
Among k nodes with nodeIds closest to the key, message likely to reach the node closest to the source node first
18
Pastry: Self-organization
Initializing and maintaining routing tables and leaf sets
Node addition Node departure (failure)
The goal is to maintain all routing table entries
to refer to a near node, among all live nodes with appropriate prefix
19
New node X contacts nearby node A A routes “join” message to X, which arrives
to Z, closest to X X obtains leaf set from Z, i’th row for
routing table from i’th node from A to Z X informs any nodes that need to be aware
of its arrival X also improves its table locality by requesting
neighborhood sets from all nodes X knows In practice: optimistic approach
Pastry: Node addition
20
Pastry: Node addition
X=d46a1c
Route(d46a1c)
d462ba
d4213f
d13da3
A = 65a1fc
Z=d467c4d471f1
New node: X=d46a1c
21
d467c4
65a1fcd13da3
d4213f
d462ba
Proximity space
Pastry: Node addition
New node: d46a1c
d46a1c
Route(d46a1c)
d462bad4213f
d13da3
65a1fc
d467c4d471f1
NodeId spaceX is close to A, B is close to B1. Why X is close to B1?The expected distance from B to its row one entries (B1) is much largerthan the expected distance from A to B (chosen from exponentially decreasing set size)
22
Node departure (failure)
Leaf set repair (eager – all the time): Leaf set members exchange keep-alive
messages request set from furthest live node in set
Routing table repair (lazy – upon failure): get table from peers in the same row, if not
found – from higher rows Neighborhood set repair (eager)
23
Pastry: Security
Secure nodeId assignment Randomized routing – pick random
node among all potential Byzantine fault-tolerant leaf set
membership protocol
24
Pastry: Distance traveled
|L|=16, 100k random queriesProximity in emulated network. Nodes paced randomly
0.8
0.9
1
1.1
1.2
1.3
1.4
1000 10000 100000Number of nodes
Rel
ativ
e D
ista
nce
Pastry
Complete routing table
25
Pastry: Summary
Generic p2p overlay network Scalable, fault resilient, self-
organizing, secure O(log N) routing steps (expected) O(log N) routing table size Network locality properties
26
PASTPAST
27
INTRODUCTION PAST system
Internet-based, peer-to-peer global storage utility Characteristics:
strong persistence, high availability (by using k replicas) scalability (due to efficient Pastry routing) short insert and query paths query load balancing and latency reduction (due to wide
dispersion, Pastry locality and caching) security
Composed of nodes connected to internet, each node has 128-bit nodeId
Use Pastry for efficient routing scheme No support for mutable files, searching, directory
lookup
28
INTRODUCTION Function of nodes :
store replicas of files initiate and route client requests to insert or
retrieve files in PAST File-related property :
Inserted files have quasi-unique fileId, File is replicated across multiple nodes To retrieve file, client must know fileId and
decryption key (if necessary) fileId : 160-bit computed as SHA-1 of file
name, owner’s public key, random salt number
29
PAST Operation Insert: fileId = Insert(name, owner-
credentials, k, file)1. fileId computed (hash code of file name,
public key, etc.)2. Request Message reaches one of k nodes
closest to fileId3. Node accepts a replica of the file, forwards
message to k-1 nodes existing in leaf set 4. Once k nodes accept, ‘ack’ message with
store receipt is passed to client Lookup: file = Lookup(fileId) Reclaim: Reclaim(fileId, owner-credentials)
30
STORAGE MANAGEMENTwhy? Responsibility
Replicas of files be maintained by k nodes with nodeId closest to fileId
Balance free storage space among nodes in PAST
Conflict : K nodes having insufficient storage vs. neighbor nodes having sufficient storage
Cause of load imbalance : 3 differences Number of files assigned to each node Size of each inserted file Storage capacity of each node
Resolution : Replica diversion, File diversion
31
STORAGE MANAGEMENTReplica Diversion
GOAL : balance the remaining free storage space among nodes in leaf set
Diversion steps of node A (that received insertion request but has insufficient space)
1. choose node B among nodes in leaf set except k closest, s.t. B does not already holds diverted replica
2. ask B to store a copy3. enter an file entry in table with pointer to B 4. send store receipt as usual
32
STORAGE MANAGEMENTReplica Diversion
Policy for accepting a replica by node Node rejects file if
file_size/remaining_storage > t Threshold t -> tpri (in primary replica),
tdiv (in diverted replica) Avoids unnecessary diversion when node still
has space Prefer diverting large files – minimize number
of diversions Prefer accepting primary replicas than
diverted replicas
33
STORAGE MANAGEMENTFile Diversion
GOAL : balancing the remaining free storage space among nodes in PAST network
When all k nodes and their leaf sets have insufficient space
Client node generate new fileId using different salt value
Repeats limit : 3 times Fourth fail -> make smaller file size by
fragmenting
34
STORAGE MANAGEMENTnode strategy to maintain k replicas
In Pastry, neighboring nodes exchange keep-alive message
If period T is passed, leaf nodes removes the failed node from leaf set includes a live node with next closest noidId
File strategy for node joining and dropping in leaf sets if failed node is one of k nodes for certain files (primary or
diverted replica holder), re-creating replicas held by failed node
To cope with diverter failure – replicate diversion pointers Optimization – joining node may, instead of requesting all
its replicas, install a pointer to the previous replica holder in file table (like replica diversion). Than gradual migration
35
STORAGE MANAGEMENTFragmenting and File encoding
In Reed-Solomon encoding, to increase high availability
Fragmentation: improves equal disk utilization improves bandwidth – parallel
download Higher latency to contact several
nodes for retreaval
36
CACHING GOAL : minimizing client access latency,
maximizing query throughput, balancing query load
Create and maintain additional copies of highly popular file in “unused” disk space of nodes
During successful insertion and lookup, on all routed nodes
GreedyDual-Size (GD-S) policy for replacement Applying Hf(=cost(f)/size(f)) value to each cached
file File with lowest Hf is replaced
37
Security in PAST Smartcard – private/public key scheme
ensure nodeId / fileId assignment integrity Against a malicious node
Getting store receipt – prevent fewer than k replicas File certificate – verify the authenticity of file
content File privacy by clients encryption Signing routing tables entries Randomizing the routing scheme, to prevent DOS
Can not completely prevent malicious node to suppress valid entries
38
EXPERIMENTAL RESULTS Effects of Storage Management
No diversion (tpri = 1, tdiv = 0):
max utilization 60.8% 51.1% inserts failed
- leaf set size : effect of local load balancing
Replica/file diversion (tpri = 0.1, tdiv = .05):
max utilization > 98%< 1% inserts failed
-Policy-
Accept a file if file_size / free_space < t
39
EXPERIMENTAL RESULTS Determine Threshold Values
Insertion Statistics and Utilization as tpri varied, tdiv = 0.05
Insertion Statistics and Utilization as tdiv varied, tpri = 0.1
-Policy-
Accept a file if file_size / free_space < t
As tpri increases, fewer files are successfully inserted, but higher storage utilization is achieved
The lower tpri, the less likely that large file can be stored, therefore many small files can be stored instead. Util drops, cause large files are rejected at low utilization levels
As tdiv increases, storage utilization improves, but fewer files are successfully inserted,
40
EXPERIMENTAL RESULTS Impact of file and replica diversion
File diversions are negligible for storage utilization below 83%
Number of replica diversions is small even at high utilization: at 80% utilization less than 10% replicas are
diverted
=> The overhead imposed by replica and file diversions is small as long as utilization is below 95%
41
EXPERIMENTAL RESULTSFile Insertion Failure
File insertion failures vs. storage utilization
Utilization vs. Smaller files’ failure
Failure ratio increases from 90% Utilization
Failed insertions are heavily biased towards large files
42
EXPERIMENTAL RESULTSCaching
Global cache hit ratio and average number of message hops
Dropping hit ratio : Storage Util. and file number increases,
replace files in caches
hit ratio ↓ -> routing hops ↑
log 16 2250 = 3
43
CONCLUSION
Design and evaluation of PAST Storage Management, Caching Nodes and files are assigned uniformly distributed
ID Replicas of file stored at k nodes closest to fildId
Experimental results Achieve storage utilization of 98% Low file insertion failure ratio at high storage
utilization Effective caching achieves load balancing
44
Weakness Does not support mutable files –read only No searching, directory lookup Local fault in segment of network may cause
functioning node not to be able to contact outside world, since its routing table is mainly local
No direct support for anonymity or confidentiality
Breaking large node apart – is it good or bad? Simulation is too sterile No experimental comparison of PAST to other
systems
45
Comparison Comparison to other to other systemssystems
46
Comparison
PASTRY compared to Freenet and Gnutella: Guaranteed answer in bounded number of steps, while retaining
scalabilty of Freenet and self-organization of Freenet and Gnutella PASTRY Compared to Chord
Chord makes no explicit effort to achieve good network locality PAST compared to OceanStore
PAST has no support for mutable files, searching, directory lookup more sophisticated storage semantics could be build on top of PAST
Pastry (and Tapestry) are similar to Plaxton: routing based on prefixes, generalization of hypercube routing Plaxton is not self organizing; one node associated per file, thus
single point of failure
47
Comparison PAST compared to FarSite
FarSite has traditional file system semantics, distributed directory service to locate content.
Every node maintains partial list of live nodes, from which it chooses nodes to store replicas
LAN assumptions of FarSite may not hold in a wide-area environment
PAST compared to CFS CFS built on top of Chord File sharing medium, block oriented, read only Each block is stored on multiple nodes with adjacent Chord
nodeIds, caching of popular blocks Increased file retrieval overhead Parallel block retrieval good for large files
CFS assumes abundance of free disk space Relies on hosting multiple logical nodes in one physical Chord
node, with separate ids, in order to accommodate nodes with big storage capacity => increasing query overhead
48
Comparison PAST compared to LAND
Expected constant number of outgoing links in each node
Constant number of pointers to each object Constant bound on distortion (stretch):
accumulative route cost divided by distance cost Links choice enforces distance upper bound on each
stage of the route LAND uses two tier architecture: super-nodes
49
The The ENDEND