Pastis: a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca
description
Transcript of Pastis: a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca
1JTE HPC/FS
Pastis: a peer-to-peer file system for persistant large-scale storage
Jean-Michel BuscaFabio Picconi
Pierre Sens
LIP6, Université Paris 6 – CNRS, Paris, FranceINRIA, Rocquencourt, France
2JTE HPC/FS
1. DHT-based File Systems
2. Pastis
3. Performance evaluation
Outline
3JTE HPC/FS
Distributed file systems
Client-server P2P
LAN
(100)
NFS -
Organization(10.000)
AFS FARSITE
Pangaea
Internet(1.000.000)
- Ivy *Oceanstore *
Pastis *
scalability(number of nodes)
architecture
* uses a Distributed Hash Table (DHT) to store data
4JTE HPC/FS
Distributed Hash Tables
52
24
7540
18
91
32
66
83
5JTE HPC/FS
DHTs
logical address space
52
24
7540
18
91
32
66
83
South America
North America
North America
Australia
Asia
Asia
Europe
Europe
Asia
high latency,low bandwidthbetween logical neighbors
Overlay network
6JTE HPC/FS
Insertion of blocks in DHT
04F2
5230
834BC52A
8909
8BB2
3A79
8954
8957AC78
895D
E25A
04F2
3A79
5230
834B8909
8954
8957
8BB2
AC78
C52A
E25A
k = 8958
k = 8959
put(8959,block)
root ofkey 8959 block
Address space
replica
895D
7JTE HPC/FS
PAST: Storage System
PAST: Cooperative, archival file storage and distribution Layered on top of Pastry
Goals: Strong persistence of the data High availability Scalability of the System Reduced cost (no backup) Efficient use of pooled resources
8JTE HPC/FS
Insertion of blocks in DHT
04F2
5230
834BC52A
8909
8BB2
3A79
8954
8957AC78
895D
E25A
04F2
3A79
5230
834B8909
8954
8957
8BB2
AC78
C52A
E25A
k = 8958
k = 8959
put(8959,block)
root ofkey 8959 block
Address space
replica
895D
replica
9JTE HPC/FS
Insertion of blocks in DHT
04F2
5230
834BC52A
8909
8BB2
3A79
8954
8957AC78
895D
E25A
04F2
3A79
5230
834B8909
8954
8957
8BB2
AC78
C52A
E25A
block
Address space
replica
895D
replica
k = 8958
k = 8959
get(8959,block)
10JTE HPC/FS
P2P File systems architecture
put(key, block)
block = get(key)
files and directoriesread-write access
semanticssecurity and access control
DHash / Past
Ivy / Pastis
DHT
FS
- scalability
- fault-tolerance
- self-organization
block store (DHT)
message routing
open(), read(), write(), close(), etc.
11JTE HPC/FS
DHT-based file systems
Ivy [OSDI’02]log-based, one log per user
fast writes, slow reads
limited to small number of users
Oceanstore [FAST’03]updates serialized by primary replicas
partially centralized system
BFT agreement protocol requires well-connected primary replicas
primary replicas
secondary replicas
User A’s log
User B’s log
User C’s log
DHTobject
DHTobject
DHTobject
12JTE HPC/FS
Pastis
13JTE HPC/FS
Pastis design
Design goals simple completely decentralized scalable (network size and number of users)
put(key, block) block = get(key)
Pastry
Past
Pastis
DHT
FS
storage
routing
14JTE HPC/FS
Pastis data structures
Data structures similar to the Unix file system inodes are stored in modifiable DHT blocks (UCBs) file contents are stored in immutable DHT blocks (CHBs)
metadata
block addresses
UCB
file inode
CHB1
CHB2
file contents
file contentsUCB
CHB1
CHB2replica sets
DHT address space
Inode key
15JTE HPC/FS
Pastis data structures (cont.) directories contain <file name, inode key> entries use indirect blocks for large files
metadata
block addresses
UCB
directory inode
CHB
file1, key1file2, key2
…metadata
block addresses
UCB
file1 inode
CHB
oldcontents
CHB
indirectblock
CHB
filecontents
CHB
oldcontents
CHB
filecontents
16JTE HPC/FS
Content Hash Block (CHB)
Content Hash Block block has to be immutable
Solution to check and prevent modification
block contents determine block key can detect if block is modified
data block
block key = Hash( block contents )
block contents
17JTE HPC/FS
User Certificate Blocks (UCBs)
UCBs are modifiable by the block owner.
Question: How to check that the file is modified only by the owner?
Protocol
(KBpub
, KBpriv
) associated to each block
The owner builds a signature of the block using KB
priv.
Authentication Verify signature of UCB
using the KBpub
sign(KBpriv)
timestamp
UCB
block key = Hash( KBpub
)
inode contents
18JTE HPC/FS
UCBs: Multiple User Edits
We want that multiple users can edit BUT we do not want to share the private Key.
(KBpub
, KBpriv
) associated to each block
(KUpub
, KUpriv
) associated to each user
Certificate grants write access to a given user
(identified by KUpub) issued by the file owner expiration date allows access
revocation
Authentication Verify signature of certificate using
the storage key (KBpub) Verify signature of UCB using the
KUpub
sign(KUpriv)
KUpub
expiration date
sign(KBpriv)
timestamp
UCB
certificate
block key = Hash( KBpub
)
inode contents
KBpub
19JTE HPC/FS
Pastis – Update handling
File update insert the new file contents (CHBs) reinsert the file inode (UCB) replace data blocks (CHBs)
metadata@CHB1@CHB2…
@CHBi@CHBii@CHBiii
UCB1
directory inode
file1 @UCB2file2 @UCB3file3 ……
CHB1
directory contentsmetadata@CHB3……
………
UCB2
file inode
foo
CHB3
file contents
20JTE HPC/FS
Pastis – Update handling
File update insert the new file contents (CHBs) reinsert the file inode (UCB) replace data blocks (CHBs)
metadata@CHB1@CHB2…
@CHBi@CHBii@CHBiii
directory inode
CHB1
directory contentsmetadata@CHB3……
………
file inode
foo
CHB3
file contents
foo bar
CHB4
new file contents
Insert new CHB into the DHT
UCB1 UCB2
file1 @UCB2file2 @UCB3file3 ……
21JTE HPC/FS
Pastis – Update handling
File update insert the new file contents (CHBs) reinsert the file inode (UCB) replace data blocks (CHBs)
metadata@CHB1@CHB2…
@CHBi@CHBii@CHBiii
directory inode
CHB1
directory contentsmetadata@CHB4……
………
file inode
foo
CHB3
file contents
foo bar
CHB4
new file contents
Update file inode to point to new CHB
file1 @UCB2file2 @UCB3file3 ……
UCB1 UCB2
22JTE HPC/FS
Pastis – Update handling
File update insert the new file contents (CHBs) reinsert the file inode (UCB) replace data blocks (CHBs)
metadata@CHB1@CHB2…
@CHBi@CHBii@CHBiii
directory inode
CHB1
directory contentsmetadata@CHB4……
………
file inode
foo
CHB3
file contents
foo bar
CHB4
new file contents
Reinsert inode UCB into the DHT
file1 @UCB2file2 @UCB3file3 ……
UCB1 UCB2
23JTE HPC/FS
Pastis – Consistency
Strict consistency → too expensive, requires too many network accesses
Close-to-open consistency
open(): returns the latest version of the file commited by close()
between open() and close(): user only sees his own updates defer writes until file is closed
Client Aopen read ‘1’
open
write ‘2’
read ‘1’ open read ‘2’
close
closewrite is cached until close
(CHBs and inode UCB are stored in a local buffer)
Client B
write ‘2’ is sent to the network
(CHBs and UCB and inserted into the DHT)
a “close-to-open” path makes updates visible
B retrieves inode from the DHTStill quite expensive: an open requires retrieving the mostup-to-date inode replica
24JTE HPC/FS
Pastis – Consistency
Read-your-writes consistency relaxation of the close-to-open model read() must reflect previous local writes only writes from other clients may or may not be visible
Client Aopen read ‘1’
open write ‘2’ open read ‘2’
close
close
Client B
read must reflect local previous writes
open read ‘1’
A’s read may not reflect B’s writes
An open does not require retrieving the most up-to-date inode replica, just fetch one inode replica not older than those accessed previously
25JTE HPC/FS
Class materials and Bibliography
Everything at http://www-sop.inria.fr/members/Frederic.Giroire/enseignement/p2p/
Slides Paper and technical report:
Pastis: a Highly-Scalable Multi-User Peer-to-Peer File System, Busca, Picconi, and Sens, EuroPar 2005.
26JTE HPC/FS
Evaluation
27JTE HPC/FS
Evaluation
Prototype programmed in Java Client interface : NFS, Fuse Test program: Andrew Benchmark
Phase 1: create subdirectories Phase 2: copy files Phase 3: read file attributes Phase 4: read file contents Phase 5: make command
Emulation LAN with one DHT node per machine DummyNet router emulates WAN latencies
Simulation discrete event simulator - LS3
simulates overlay network latency
28JTE HPC/FS
0.98
1
1.02
1.04
1.06
1.08
1.1
1.12
1.14
0 2 4 6 8 10 12 14 16
Number of active users
Pastis Ivy
Pastis performance with concurrent clients
Configuration
16 DHT nodes
100 ms constant inter-node latency (Dummynet)
4 replicas per object
close-to-openconsistency
Ivy’s read overhead increases rapidly with the number of users
(the client must retrieve the records of more logs)
normalizedexecution
time[sec.]
(each running an independent benchmark)every userreading and writing to FS
29JTE HPC/FS
0
50
100
150
200
250
300
Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Total
Andrew Benchmark phase
Pastis consistency models
Configuration
16 DHT nodes
100 ms constant inter-node latency (Dummynet)
4 replicas per object
Pastis 1.9
Ivy [OSDI’02] 2.0 – 3.0
Oceanstore [FAST’03] 2.55
execution time[sec.]
performance penalty compared to NFS(close-to-open)
Pastis (close-to-open) Pastis (read-your-writes) NFSv3
(dirs) (write) (attr.) (read) (make)
30JTE HPC/FS
Evaluation: consistency models
N = 32768, sphere topology, max. latency: 300 ms, k = 16
CTO RYW
RYW with 10%stale UCB replicas
31JTE HPC/FS
Conclusion
Pastis simple completely decentralized (cf. Oceanstore) scalable number of users (cf. Ivy) good performance thanks to:
PAST-Pastry’s locality properties relaxed consistency models (close-to-open, read-your-
writes) Future work
explore new consistency models flexible replica location evaluation in a wide-area testbed (Planetlab)
32JTE HPC/FS
Links
Pastis : http://regal.lip6.fr/projects/pastis
Pastry, Past : http://freepastry.rice.edu
LS3 : http://regal.lip6.fr/projects/pastis/ls3
33JTE HPC/FS
Questions?
34JTE HPC/FS
Internet
Blocks distribution
root
replication
Past / Pastry overlay
Pastis FS
Pastis design