Pastis: a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

34
1 JTE HPC/FS Pastis: a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca Fabio Picconi Pierre Sens LIP6, Université Paris 6 – CNRS, Paris, France INRIA, Rocquencourt, France

description

Pastis: a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca Fabio Picconi Pierre Sens LIP6, Université Paris 6 – CNRS, Paris, France INRIA, Rocquencourt, France. Outline. DHT-based File Systems Pastis Performance evaluation. Distributed file systems. - PowerPoint PPT Presentation

Transcript of Pastis: a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

Page 1: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

1JTE HPC/FS

Pastis: a peer-to-peer file system for persistant large-scale storage

Jean-Michel BuscaFabio Picconi

Pierre Sens

LIP6, Université Paris 6 – CNRS, Paris, FranceINRIA, Rocquencourt, France

Page 2: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

2JTE HPC/FS

1. DHT-based File Systems

2. Pastis

3. Performance evaluation

Outline

Page 3: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

3JTE HPC/FS

Distributed file systems

Client-server P2P

LAN

(100)

NFS -

Organization(10.000)

AFS FARSITE

Pangaea

Internet(1.000.000)

- Ivy *Oceanstore *

Pastis *

scalability(number of nodes)

architecture

* uses a Distributed Hash Table (DHT) to store data

Page 4: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

4JTE HPC/FS

Distributed Hash Tables

52

24

7540

18

91

32

66

83

Page 5: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

5JTE HPC/FS

DHTs

logical address space

52

24

7540

18

91

32

66

83

South America

North America

North America

Australia

Asia

Asia

Europe

Europe

Asia

high latency,low bandwidthbetween logical neighbors

Overlay network

Page 6: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

6JTE HPC/FS

Insertion of blocks in DHT

04F2

5230

834BC52A

8909

8BB2

3A79

8954

8957AC78

895D

E25A

04F2

3A79

5230

834B8909

8954

8957

8BB2

AC78

C52A

E25A

k = 8958

k = 8959

put(8959,block)

root ofkey 8959 block

Address space

replica

895D

Page 7: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

7JTE HPC/FS

PAST: Storage System

PAST: Cooperative, archival file storage and distribution Layered on top of Pastry

Goals: Strong persistence of the data High availability Scalability of the System Reduced cost (no backup) Efficient use of pooled resources

Page 8: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

8JTE HPC/FS

Insertion of blocks in DHT

04F2

5230

834BC52A

8909

8BB2

3A79

8954

8957AC78

895D

E25A

04F2

3A79

5230

834B8909

8954

8957

8BB2

AC78

C52A

E25A

k = 8958

k = 8959

put(8959,block)

root ofkey 8959 block

Address space

replica

895D

replica

Page 9: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

9JTE HPC/FS

Insertion of blocks in DHT

04F2

5230

834BC52A

8909

8BB2

3A79

8954

8957AC78

895D

E25A

04F2

3A79

5230

834B8909

8954

8957

8BB2

AC78

C52A

E25A

block

Address space

replica

895D

replica

k = 8958

k = 8959

get(8959,block)

Page 10: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

10JTE HPC/FS

P2P File systems architecture

put(key, block)

block = get(key)

files and directoriesread-write access

semanticssecurity and access control

DHash / Past

Ivy / Pastis

DHT

FS

- scalability

- fault-tolerance

- self-organization

block store (DHT)

message routing

open(), read(), write(), close(), etc.

Page 11: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

11JTE HPC/FS

DHT-based file systems

Ivy [OSDI’02]log-based, one log per user

fast writes, slow reads

limited to small number of users

Oceanstore [FAST’03]updates serialized by primary replicas

partially centralized system

BFT agreement protocol requires well-connected primary replicas

primary replicas

secondary replicas

User A’s log

User B’s log

User C’s log

DHTobject

DHTobject

DHTobject

Page 12: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

12JTE HPC/FS

Pastis

Page 13: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

13JTE HPC/FS

Pastis design

Design goals simple completely decentralized scalable (network size and number of users)

put(key, block) block = get(key)

Pastry

Past

Pastis

DHT

FS

storage

routing

Page 14: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

14JTE HPC/FS

Pastis data structures

Data structures similar to the Unix file system inodes are stored in modifiable DHT blocks (UCBs) file contents are stored in immutable DHT blocks (CHBs)

metadata

block addresses

UCB

file inode

CHB1

CHB2

file contents

file contentsUCB

CHB1

CHB2replica sets

DHT address space

Inode key

Page 15: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

15JTE HPC/FS

Pastis data structures (cont.) directories contain <file name, inode key> entries use indirect blocks for large files

metadata

block addresses

UCB

directory inode

CHB

file1, key1file2, key2

…metadata

block addresses

UCB

file1 inode

CHB

oldcontents

CHB

indirectblock

CHB

filecontents

CHB

oldcontents

CHB

filecontents

Page 16: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

16JTE HPC/FS

Content Hash Block (CHB)

Content Hash Block block has to be immutable

Solution to check and prevent modification

block contents determine block key can detect if block is modified

data block

block key = Hash( block contents )

block contents

Page 17: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

17JTE HPC/FS

User Certificate Blocks (UCBs)

UCBs are modifiable by the block owner.

Question: How to check that the file is modified only by the owner?

Protocol

(KBpub

, KBpriv

) associated to each block

The owner builds a signature of the block using KB

priv.

Authentication Verify signature of UCB

using the KBpub

sign(KBpriv)

timestamp

UCB

block key = Hash( KBpub

)

inode contents

Page 18: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

18JTE HPC/FS

UCBs: Multiple User Edits

We want that multiple users can edit BUT we do not want to share the private Key.

(KBpub

, KBpriv

) associated to each block

(KUpub

, KUpriv

) associated to each user

Certificate grants write access to a given user

(identified by KUpub) issued by the file owner expiration date allows access

revocation

Authentication Verify signature of certificate using

the storage key (KBpub) Verify signature of UCB using the

KUpub

sign(KUpriv)

KUpub

expiration date

sign(KBpriv)

timestamp

UCB

certificate

block key = Hash( KBpub

)

inode contents

KBpub

Page 19: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

19JTE HPC/FS

Pastis – Update handling

File update insert the new file contents (CHBs) reinsert the file inode (UCB) replace data blocks (CHBs)

metadata@CHB1@CHB2…

@CHBi@CHBii@CHBiii

UCB1

directory inode

file1 @UCB2file2 @UCB3file3 ……

CHB1

directory contentsmetadata@CHB3……

………

UCB2

file inode

foo

CHB3

file contents

Page 20: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

20JTE HPC/FS

Pastis – Update handling

File update insert the new file contents (CHBs) reinsert the file inode (UCB) replace data blocks (CHBs)

metadata@CHB1@CHB2…

@CHBi@CHBii@CHBiii

directory inode

CHB1

directory contentsmetadata@CHB3……

………

file inode

foo

CHB3

file contents

foo bar

CHB4

new file contents

Insert new CHB into the DHT

UCB1 UCB2

file1 @UCB2file2 @UCB3file3 ……

Page 21: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

21JTE HPC/FS

Pastis – Update handling

File update insert the new file contents (CHBs) reinsert the file inode (UCB) replace data blocks (CHBs)

metadata@CHB1@CHB2…

@CHBi@CHBii@CHBiii

directory inode

CHB1

directory contentsmetadata@CHB4……

………

file inode

foo

CHB3

file contents

foo bar

CHB4

new file contents

Update file inode to point to new CHB

file1 @UCB2file2 @UCB3file3 ……

UCB1 UCB2

Page 22: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

22JTE HPC/FS

Pastis – Update handling

File update insert the new file contents (CHBs) reinsert the file inode (UCB) replace data blocks (CHBs)

metadata@CHB1@CHB2…

@CHBi@CHBii@CHBiii

directory inode

CHB1

directory contentsmetadata@CHB4……

………

file inode

foo

CHB3

file contents

foo bar

CHB4

new file contents

Reinsert inode UCB into the DHT

file1 @UCB2file2 @UCB3file3 ……

UCB1 UCB2

Page 23: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

23JTE HPC/FS

Pastis – Consistency

Strict consistency → too expensive, requires too many network accesses

Close-to-open consistency

open(): returns the latest version of the file commited by close()

between open() and close(): user only sees his own updates defer writes until file is closed

Client Aopen read ‘1’

open

write ‘2’

read ‘1’ open read ‘2’

close

closewrite is cached until close

(CHBs and inode UCB are stored in a local buffer)

Client B

write ‘2’ is sent to the network

(CHBs and UCB and inserted into the DHT)

a “close-to-open” path makes updates visible

B retrieves inode from the DHTStill quite expensive: an open requires retrieving the mostup-to-date inode replica

Page 24: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

24JTE HPC/FS

Pastis – Consistency

Read-your-writes consistency relaxation of the close-to-open model read() must reflect previous local writes only writes from other clients may or may not be visible

Client Aopen read ‘1’

open write ‘2’ open read ‘2’

close

close

Client B

read must reflect local previous writes

open read ‘1’

A’s read may not reflect B’s writes

An open does not require retrieving the most up-to-date inode replica, just fetch one inode replica not older than those accessed previously

Page 25: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

25JTE HPC/FS

Class materials and Bibliography

Everything at http://www-sop.inria.fr/members/Frederic.Giroire/enseignement/p2p/

Slides Paper and technical report:

Pastis: a Highly-Scalable Multi-User Peer-to-Peer File System, Busca, Picconi, and Sens, EuroPar 2005.

Page 26: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

26JTE HPC/FS

Evaluation

Page 27: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

27JTE HPC/FS

Evaluation

Prototype programmed in Java Client interface : NFS, Fuse Test program: Andrew Benchmark

Phase 1: create subdirectories Phase 2: copy files Phase 3: read file attributes Phase 4: read file contents Phase 5: make command

Emulation LAN with one DHT node per machine DummyNet router emulates WAN latencies

Simulation discrete event simulator - LS3

simulates overlay network latency

Page 28: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

28JTE HPC/FS

0.98

1

1.02

1.04

1.06

1.08

1.1

1.12

1.14

0 2 4 6 8 10 12 14 16

Number of active users

Pastis Ivy

Pastis performance with concurrent clients

Configuration

16 DHT nodes

100 ms constant inter-node latency (Dummynet)

4 replicas per object

close-to-openconsistency

Ivy’s read overhead increases rapidly with the number of users

(the client must retrieve the records of more logs)

normalizedexecution

time[sec.]

(each running an independent benchmark)every userreading and writing to FS

Page 29: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

29JTE HPC/FS

0

50

100

150

200

250

300

Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Total

Andrew Benchmark phase

Pastis consistency models

Configuration

16 DHT nodes

100 ms constant inter-node latency (Dummynet)

4 replicas per object

Pastis 1.9

Ivy [OSDI’02] 2.0 – 3.0

Oceanstore [FAST’03] 2.55

execution time[sec.]

performance penalty compared to NFS(close-to-open)

Pastis (close-to-open) Pastis (read-your-writes) NFSv3

(dirs) (write) (attr.) (read) (make)

Page 30: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

30JTE HPC/FS

Evaluation: consistency models

N = 32768, sphere topology, max. latency: 300 ms, k = 16

CTO RYW

RYW with 10%stale UCB replicas

Page 31: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

31JTE HPC/FS

Conclusion

Pastis simple completely decentralized (cf. Oceanstore) scalable number of users (cf. Ivy) good performance thanks to:

PAST-Pastry’s locality properties relaxed consistency models (close-to-open, read-your-

writes) Future work

explore new consistency models flexible replica location evaluation in a wide-area testbed (Planetlab)

Page 32: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

32JTE HPC/FS

Links

Pastis : http://regal.lip6.fr/projects/pastis

Pastry, Past : http://freepastry.rice.edu

LS3 : http://regal.lip6.fr/projects/pastis/ls3

Page 33: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

33JTE HPC/FS

Questions?

Page 34: Pastis:  a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca

34JTE HPC/FS

Internet

Blocks distribution

root

replication

Past / Pastry overlay

Pastis FS

Pastis design