1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent...
-
Upload
solomon-dickerson -
Category
Documents
-
view
213 -
download
0
Transcript of 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent...
1
Phenix Workshopon
Global Computing Systems
Pastis, a peer-to-peer file systemfor persistent large-scale storage
byFabio Picconi
advisorPierre Sens
Dec. 7, 2006 Laboratoire d’Informatique de Paris 6 (LIP6 Computer Science Lab.)
2
1. Introduction
2. Peer-to-peer file systems DHTs previous designs
3. Pastis file system data structures, updates, consistency
4. Experimental evaluation Pastis prototype performance
5. Predicting durability Markov chain model parameter estimation problem
6. Conclusions
Outline
3
P2P storage system
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
Goalslarge aggregate capacity
low cost (shared ownership)
high durability (persistence)
high availability
self-organization
4
P2P storage system (cont.)
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
Challengeslarge-scale routing
high latency, low bandwidth
node disconnections/failures
consistency
security
5
Distributed file systems
Client-server P2P
LAN
(100)
NFS -
Organization(10.000)
AFS FARSITE
Pangaea
Internet(1.000.000)
- Ivy *Oceanstore *
scalability(number of nodes)
architecture
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
* uses a Distributed Hash Table (DHT) to store data
6
Distributed Hash Tables
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
Persistent storage servicesimple API
put(key, object)
object = get(key)
decentralization
scalability (e.g., O( log N ))
fault-tolerance (replication)
self-organization
04F2
3A79
5230
8909
8954
8957
AC78
C52A
E25A
put(8955,obj)
overlay address space[0000,FFFF]
620F
85E0
obj
obj
obj
replicas
7
DHT-based file systems
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
Ivy [OSDI’02]log-based, one log per user
fast writes, slow reads
limited to small number of users
Oceanstore [FAST’03]updates serialized by primary replicas
partial centralization
BFT agreement requires well-connected primary replicas
primary replicas
secondary replicas
User A’s log
User B’s log
User C’s log
DHTblock
DHTblock
DHTblock
8
P2P file systems
decentralization Internet scale large numberof users
Ivy X X
Oceanstore X X
FARSITE X
Pangaea X X
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
none of the systems presents all three characteristics
9
PASTIS
FILE SYSTEM
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
10
Pastis file system
Characteristicsdecentralization
scalable•number of nodes•number of users
high durability
performance
security
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
the Pastis layer abstracts from most problems due to distribution and large-scale
open, read,
write, lseek,
etc.
application
Pastis
PAST DHT
put, getpersistencefault-toleranceself-organization
directoriesPOSIX APIuser groups,
ACLs
11
Pastis data structures
Data structures similar to the Unix file system inodes are stored in modifiable DHT blocks (UCBs) file contents are stored in immutable DHT blocks (CHBs)
metadata
block addresses
UCB
file inode
CHB1
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
CHB2
file contents
file contentsUCB
CHB1
CHB2replica sets
DHT address space
12
Pastis data structures (cont.)
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
directories contain <file name, inode key> entries use indirect blocks for large files
metadata
block addresses
UCB
directory inode
CHB
file1, key1file2, key2
…metadata
block addresses
UCB
file1 inode
CHB
oldcontents
CHB
indirectblock
CHB
filecontents
CHB
oldcontents
CHB
filecontents
13
DHT blocks used by Pastis
Content Hash Block key = Hash(contents) self-verifying the block is immutable
contents
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
User Certificate Block contents signed by writer certificate proves write access the block is modifiable
contents
version numbersign(Kwriter_private)
Kwriter_public
Kblock_public
expiration date
sign(Kblock_private)
certificate
UCB key:H(Kblock_public)
CHB key:
H(contents)
used for consistency
14
Pastis updates
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
User modifies the file contents DHT access retrieve CHB storing old file contents → CHB get a new CHB is created and stored in the DHT → CHB put the pointer in the file inode is updated → UCB put
metadata
version 1
block addresses
UCB
file inode
CHB1
oldcontents
CHB2
newcontents
* old CHB is kept in the DHT to allow relaxed consistency models
*
metadata
version 2
block addresses
version is increased
block storing modified offset
15
Pastis consistency models
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
Close-to-open [AFS] defer update propagation until file is closed file open must return the most recently closed version in the system ignore updates from other clients while file is open
Implementation• file open: fetch all inode replicas from DHT and keep most recent version• file close: update inode replicas in the DHT
Read-your-writes [Bayou] file open must return a version consistent with previous local writes
Implementation:• file open: fetch one inode replica not older than those accessed previously
read-your-writes useful when files are seldom shared
16
PASTIS
EVALUATION
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
17
Pastis architecture
Prototype uses FreePastry (PAST open-source impl.) two interfaces for applications
• NFS loop-back server• Java Class
~ 10.000 lines of Java
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
prototype architecture
NFSloop-back
client
PAST
Pastry
LS3 simulator TCP/UDP
Javaapplication
FreePastry
Pastis
18
Pastis evaluation
Emulation prototype runs on a local cluster router adds constant latency between nodes controlled environment
LS3 Simulator [Busca] intercept calls to transport layer, adds latency (sphere topology) executes the same upper layer code up to 80.000 DHT nodes with 2 GB RAM
Andrew BenchmarkPhase 1: create subdirectoriesPhase 2: write to filesPhase 3: read file attributesPhase 4: read file contentsPhase 5: execute “make” command
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
router N1 N2 N3 N4
19
Pastis performance with concurrent clients
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
Configuration
16 DHT nodes
100 ms constant inter-node latency (Dummynet)
4 replicas per object
close-to-openconsistency
Ivy’s read overhead increases rapidly with the number of users
(the client must retrieve the records of more logs)
0.98
1
1.02
1.04
1.06
1.08
1.1
1.12
1.14
0 2 4 6 8 10 12 14 16
Number of active users
Pastis Ivynormalizedexecution
time[sec.]
(each running an independent benchmark)every userwriting to FS
20
Pastis consistency models
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
Configuration
16 DHT nodes
100 ms constant inter-node latency (Dummynet)
4 replicas per object
Pastis 1.9
Ivy [OSDI’02] 2.0 – 3.0
Oceanstore [FAST’03] 2.55
0
50
100
150
200
250
300
Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Total
Andrew Benchmark phase
execution time[sec.]
performance penalty compared to NFS(close-to-open)
Pastis (close-to-open) Pastis (read-your-writes) NFSv3
(dirs) (write) (attr.) (read) (make)
21
0
250
500
750
1000
4 5 6 7 8 9 10 11 12
Replication factor
read phase
write phase
total (5 phases)
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
Configuration
100 DHT nodes
transit-stub topology (Modelnet)200 AS, 50 stubs
node bandwidth:10 MBit/s
close-to-openconsistency
Impact of replication factor
execution time[sec.]
higher replication factor → more replicas transmitted when inserting
a CHB/UCB or when retrieving an UCB
22
0
50
100
150
200
250
300
350
400
16 64 256 1024 4096 16384
Number of DHT nodes
0
1
2
3
4
5
execution time
hop count
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
Configuration
sphere topology(LS3 simulator)300 ms. max latency
16 replicas perobject
close-to-openconsistency
Impact of network size
execution time[sec.]
reason: in the PAST DHT, the lookup latencyis mainly due to the last hop
23
Pastis evaluation summary
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
graceful degradation with number of concurrent users (cf. Ivy) slightly faster than Ivy and Oceanstore with one active Andrew Bench. read-your-writes model 30% faster than close-to-open latency increases with replication factor latency remains constant with DHT size
24
PREDICTING
DURABILITY
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
25
Durability vs. availability
Data availability metric: percentage of time that data is accessible (e.g, 99.9%) depends on the rate of replica temporary disconnections
Data durability metric: probability of data loss (e.g., 0.1% after one year from insertion) depends on the rate of disk failures data can be stored durably despite being temporarily unavailable
(i.e., all replicas off-line)
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
both problems studied in thesis, will only discuss durability in talk
26
Durability in P2P systems
Problem: restoring a crashed hard disk may take a long time
Example: 300 GB per node, 1 Mbit/s connection
Time needed to restore the hard disk contents: 1 month
tr1tc1 tc2
Replica 1
Replica 2
irrecoverable data loss
Classical solution replicate objects on different nodes (avoid correlated failures) restore replicas after a disk crash
1 month
t
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
27
Analytical model for durability
Goals create an analytical model to predict the probability of data loss
(instead of using simulations)
Benefits model can be used to determine the minimum number of replicas
to achieve a given level of durability model avoids new simulations each time system parameters change
(repair bandwidth, node capacity, etc.)
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
28
Using Markov chains to predict durability
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
Model the number of replicas present on the network
the chain models one data object (replicated k times) k+1 states, 0 to k state transitions:
• replica failure → from state i to i-1 with rate λi
• replica repair → from state i to i+1 with rate µi
state probabilities for t = 0: Pk(0) = 1, Pi(0) = 0 for i < k
PLOSS(t) = P0(t) (prob. of reaching the absorbing state 0 at time t)
2 1 0
λ2 λ1
μ1
. . .k
λk
μk-1
k-1 k-2
μk-2
λk-1
29Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
Estimating the model parameters
Express transition rates using coefficients αi and βi
Failure rates
state 1 (one live replica)
λ = 1 / MTBF
state i (i live replicas)
βi = i
Repair rates
state i = k-1 (one missing replica)µ = 1 / Trepair Trepair = ?
state i = k-m (m missing replicas)
αm = ?
What are the values of Trepair and αm?
2 1 0
β2 λ λ
αk-1 μ
. . .k
βk λ
μ
k-1 k-2
α2 μ
βk-1 λ
30
0
1
2
3
4
5
0 1 2 3 4 5
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
Ramadhadran et al. [INFOCOM’06] αm = m (m = number of missing replicas)
no expression for Trepair
(assumed to be configurable)
Chun et al. [NSDI’06] αm = 1
b node capacity [bytes]
bw node bandwidth [bytes/sec]
Trepair = b / bw
m
αm
Parameters proposed by other authors
Ramadhadran et al.
Chun et al.
Are these expressions for Trepair and αm accurate?
too simplisticnumber of missing replicas
31Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
Our parameter estimation
Analytical estimation of Trepair
multiple crashes• a node may crash again before it finishes restoring its disk
bandwidth sharing• the uploader’s bandwidth bw may be shared among several downloaders• this effect increases with a smaller MTBF
(crashes are more frequent → more nodes are in the restoring state)
Our value of Trepair depends on three parameters: b, bw, MTBF
Estimation for αm
the coefficients αm are very hard to obtain analytically
measure αm with a simulator
32Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
Simulator
DHT protocol simplified version of the PAST DHT object replicas stored on ring neighbors crashed node restores objects sequentially object is downloaded from random source
Assumptions symmetric node bandwidth node inter-failure times drawn from exponential distribution
with mean MTBF high node availability (temporary disconnections ignored)
conservativechoice
33Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
bw = 1.5 Mbit/s MTBF = 2 months k = {3,7} b = variable
Validation of our expression for Trepair
Trepair [days] estimation error
0%
25%
50%
75%
100%
0 100 200 300 400 500
b [GB per node]
our estimation (k=3)our estimation (k=7)Chun et al. (k=3)Chun et al. (k=7)
0
6
12
18
24
30
36
42
48
0 100 200 300 400 500
b [GB per node]
simul (k=3)simul (k=7)our estimationChun et al.
34
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6 7 8
k = 5
k = 7
k = 9
Ramadhadran et al.
Chun et al.
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
αm
b = 250 GB per node bw = 1.5 Mbit/s MTBF = 2 months k = {5,7,9}
m2
)1(
:solvingby obtained constant
11)(1
kkf
c
ecmf c
m
m
Measured values of αm
f(m)
Empirical function f(m)
f(m) can be used to obtain the model parameters fromthe system characteristics b, bw, k, and MTBF
plateau due to fewer remaining replicasto download the object from
35
0
0.025
0.05
0.075
0 1 2 3 4
simul
Ramadhadran et al.
Chun et al.f(i)
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
Predicted loss probability
b = 250 GB per node bw = 1.5 Mbit/s MTBF = 2 months k = 5
PLOSS(t)
time after insertion [years]
t
36
Conclusion
Pastis: new P2P file system completely decentralized scalable number of nodes and users
relaxed consistency models close-to-open, read-your-write
performance equal or better than Ivy and Oceanstore
durability prediction use Markov chains to estimate probability of object loss previous model parameters are inaccurate
more accurate parameter set based on new analytical expression empirical data
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
37
Future work
Pastis explore new consistency models flexible replica location evaluation in a wide-area testbed (Planetlab)
Durability refine model to consider
other DHT protocols impact of temporary node disconnections
Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions