1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent...

37
1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens Dec. 7, 2006 Laboratoire d’Informatique de Paris 6 (LIP6 Computer Science Lab.)

Transcript of 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent...

Page 1: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

1

Phenix Workshopon

Global Computing Systems

Pastis, a peer-to-peer file systemfor persistent large-scale storage

byFabio Picconi

advisorPierre Sens

Dec. 7, 2006 Laboratoire d’Informatique de Paris 6 (LIP6 Computer Science Lab.)

Page 2: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

2

1. Introduction

2. Peer-to-peer file systems DHTs previous designs

3. Pastis file system data structures, updates, consistency

4. Experimental evaluation Pastis prototype performance

5. Predicting durability Markov chain model parameter estimation problem

6. Conclusions

Outline

Page 3: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

3

P2P storage system

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

Goalslarge aggregate capacity

low cost (shared ownership)

high durability (persistence)

high availability

self-organization

Page 4: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

4

P2P storage system (cont.)

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

Challengeslarge-scale routing

high latency, low bandwidth

node disconnections/failures

consistency

security

Page 5: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

5

Distributed file systems

Client-server P2P

LAN

(100)

NFS -

Organization(10.000)

AFS FARSITE

Pangaea

Internet(1.000.000)

- Ivy *Oceanstore *

scalability(number of nodes)

architecture

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

* uses a Distributed Hash Table (DHT) to store data

Page 6: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

6

Distributed Hash Tables

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

Persistent storage servicesimple API

put(key, object)

object = get(key)

decentralization

scalability (e.g., O( log N ))

fault-tolerance (replication)

self-organization

04F2

3A79

5230

8909

8954

8957

AC78

C52A

E25A

put(8955,obj)

overlay address space[0000,FFFF]

620F

85E0

obj

obj

obj

replicas

Page 7: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

7

DHT-based file systems

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

Ivy [OSDI’02]log-based, one log per user

fast writes, slow reads

limited to small number of users

Oceanstore [FAST’03]updates serialized by primary replicas

partial centralization

BFT agreement requires well-connected primary replicas

primary replicas

secondary replicas

User A’s log

User B’s log

User C’s log

DHTblock

DHTblock

DHTblock

Page 8: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

8

P2P file systems

decentralization Internet scale large numberof users

Ivy X X

Oceanstore X X

FARSITE X

Pangaea X X

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

none of the systems presents all three characteristics

Page 9: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

9

PASTIS

FILE SYSTEM

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

Page 10: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

10

Pastis file system

Characteristicsdecentralization

scalable•number of nodes•number of users

high durability

performance

security

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

the Pastis layer abstracts from most problems due to distribution and large-scale

open, read,

write, lseek,

etc.

application

Pastis

PAST DHT

put, getpersistencefault-toleranceself-organization

directoriesPOSIX APIuser groups,

ACLs

Page 11: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

11

Pastis data structures

Data structures similar to the Unix file system inodes are stored in modifiable DHT blocks (UCBs) file contents are stored in immutable DHT blocks (CHBs)

metadata

block addresses

UCB

file inode

CHB1

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

CHB2

file contents

file contentsUCB

CHB1

CHB2replica sets

DHT address space

Page 12: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

12

Pastis data structures (cont.)

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

directories contain <file name, inode key> entries use indirect blocks for large files

metadata

block addresses

UCB

directory inode

CHB

file1, key1file2, key2

…metadata

block addresses

UCB

file1 inode

CHB

oldcontents

CHB

indirectblock

CHB

filecontents

CHB

oldcontents

CHB

filecontents

Page 13: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

13

DHT blocks used by Pastis

Content Hash Block key = Hash(contents) self-verifying the block is immutable

contents

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

User Certificate Block contents signed by writer certificate proves write access the block is modifiable

contents

version numbersign(Kwriter_private)

Kwriter_public

Kblock_public

expiration date

sign(Kblock_private)

certificate

UCB key:H(Kblock_public)

CHB key:

H(contents)

used for consistency

Page 14: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

14

Pastis updates

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

User modifies the file contents DHT access retrieve CHB storing old file contents → CHB get a new CHB is created and stored in the DHT → CHB put the pointer in the file inode is updated → UCB put

metadata

version 1

block addresses

UCB

file inode

CHB1

oldcontents

CHB2

newcontents

* old CHB is kept in the DHT to allow relaxed consistency models

*

metadata

version 2

block addresses

version is increased

block storing modified offset

Page 15: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

15

Pastis consistency models

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

Close-to-open [AFS] defer update propagation until file is closed file open must return the most recently closed version in the system ignore updates from other clients while file is open

Implementation• file open: fetch all inode replicas from DHT and keep most recent version• file close: update inode replicas in the DHT

Read-your-writes [Bayou] file open must return a version consistent with previous local writes

Implementation:• file open: fetch one inode replica not older than those accessed previously

read-your-writes useful when files are seldom shared

Page 16: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

16

PASTIS

EVALUATION

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

Page 17: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

17

Pastis architecture

Prototype uses FreePastry (PAST open-source impl.) two interfaces for applications

• NFS loop-back server• Java Class

~ 10.000 lines of Java

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

prototype architecture

NFSloop-back

client

PAST

Pastry

LS3 simulator TCP/UDP

Javaapplication

FreePastry

Pastis

Page 18: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

18

Pastis evaluation

Emulation prototype runs on a local cluster router adds constant latency between nodes controlled environment

LS3 Simulator [Busca] intercept calls to transport layer, adds latency (sphere topology) executes the same upper layer code up to 80.000 DHT nodes with 2 GB RAM

Andrew BenchmarkPhase 1: create subdirectoriesPhase 2: write to filesPhase 3: read file attributesPhase 4: read file contentsPhase 5: execute “make” command

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

router N1 N2 N3 N4

Page 19: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

19

Pastis performance with concurrent clients

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

Configuration

16 DHT nodes

100 ms constant inter-node latency (Dummynet)

4 replicas per object

close-to-openconsistency

Ivy’s read overhead increases rapidly with the number of users

(the client must retrieve the records of more logs)

0.98

1

1.02

1.04

1.06

1.08

1.1

1.12

1.14

0 2 4 6 8 10 12 14 16

Number of active users

Pastis Ivynormalizedexecution

time[sec.]

(each running an independent benchmark)every userwriting to FS

Page 20: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

20

Pastis consistency models

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

Configuration

16 DHT nodes

100 ms constant inter-node latency (Dummynet)

4 replicas per object

Pastis 1.9

Ivy [OSDI’02] 2.0 – 3.0

Oceanstore [FAST’03] 2.55

0

50

100

150

200

250

300

Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Total

Andrew Benchmark phase

execution time[sec.]

performance penalty compared to NFS(close-to-open)

Pastis (close-to-open) Pastis (read-your-writes) NFSv3

(dirs) (write) (attr.) (read) (make)

Page 21: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

21

0

250

500

750

1000

4 5 6 7 8 9 10 11 12

Replication factor

read phase

write phase

total (5 phases)

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

Configuration

100 DHT nodes

transit-stub topology (Modelnet)200 AS, 50 stubs

node bandwidth:10 MBit/s

close-to-openconsistency

Impact of replication factor

execution time[sec.]

higher replication factor → more replicas transmitted when inserting

a CHB/UCB or when retrieving an UCB

Page 22: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

22

0

50

100

150

200

250

300

350

400

16 64 256 1024 4096 16384

Number of DHT nodes

0

1

2

3

4

5

execution time

hop count

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

Configuration

sphere topology(LS3 simulator)300 ms. max latency

16 replicas perobject

close-to-openconsistency

Impact of network size

execution time[sec.]

reason: in the PAST DHT, the lookup latencyis mainly due to the last hop

Page 23: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

23

Pastis evaluation summary

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

graceful degradation with number of concurrent users (cf. Ivy) slightly faster than Ivy and Oceanstore with one active Andrew Bench. read-your-writes model 30% faster than close-to-open latency increases with replication factor latency remains constant with DHT size

Page 24: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

24

PREDICTING

DURABILITY

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

Page 25: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

25

Durability vs. availability

Data availability metric: percentage of time that data is accessible (e.g, 99.9%) depends on the rate of replica temporary disconnections

Data durability metric: probability of data loss (e.g., 0.1% after one year from insertion) depends on the rate of disk failures data can be stored durably despite being temporarily unavailable

(i.e., all replicas off-line)

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

both problems studied in thesis, will only discuss durability in talk

Page 26: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

26

Durability in P2P systems

Problem: restoring a crashed hard disk may take a long time

Example: 300 GB per node, 1 Mbit/s connection

Time needed to restore the hard disk contents: 1 month

tr1tc1 tc2

Replica 1

Replica 2

irrecoverable data loss

Classical solution replicate objects on different nodes (avoid correlated failures) restore replicas after a disk crash

1 month

t

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

Page 27: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

27

Analytical model for durability

Goals create an analytical model to predict the probability of data loss

(instead of using simulations)

Benefits model can be used to determine the minimum number of replicas

to achieve a given level of durability model avoids new simulations each time system parameters change

(repair bandwidth, node capacity, etc.)

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

Page 28: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

28

Using Markov chains to predict durability

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

Model the number of replicas present on the network

the chain models one data object (replicated k times) k+1 states, 0 to k state transitions:

• replica failure → from state i to i-1 with rate λi

• replica repair → from state i to i+1 with rate µi

state probabilities for t = 0: Pk(0) = 1, Pi(0) = 0 for i < k

PLOSS(t) = P0(t) (prob. of reaching the absorbing state 0 at time t)

2 1 0

λ2 λ1

μ1

. . .k

λk

μk-1

k-1 k-2

μk-2

λk-1

Page 29: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

29Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

Estimating the model parameters

Express transition rates using coefficients αi and βi

Failure rates

state 1 (one live replica)

λ = 1 / MTBF

state i (i live replicas)

βi = i

Repair rates

state i = k-1 (one missing replica)µ = 1 / Trepair Trepair = ?

state i = k-m (m missing replicas)

αm = ?

What are the values of Trepair and αm?

2 1 0

β2 λ λ

αk-1 μ

. . .k

βk λ

μ

k-1 k-2

α2 μ

βk-1 λ

Page 30: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

30

0

1

2

3

4

5

0 1 2 3 4 5

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

Ramadhadran et al. [INFOCOM’06] αm = m (m = number of missing replicas)

no expression for Trepair

(assumed to be configurable)

Chun et al. [NSDI’06] αm = 1

b node capacity [bytes]

bw node bandwidth [bytes/sec]

Trepair = b / bw

m

αm

Parameters proposed by other authors

Ramadhadran et al.

Chun et al.

Are these expressions for Trepair and αm accurate?

too simplisticnumber of missing replicas

Page 31: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

31Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

Our parameter estimation

Analytical estimation of Trepair

multiple crashes• a node may crash again before it finishes restoring its disk

bandwidth sharing• the uploader’s bandwidth bw may be shared among several downloaders• this effect increases with a smaller MTBF

(crashes are more frequent → more nodes are in the restoring state)

Our value of Trepair depends on three parameters: b, bw, MTBF

Estimation for αm

the coefficients αm are very hard to obtain analytically

measure αm with a simulator

Page 32: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

32Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

Simulator

DHT protocol simplified version of the PAST DHT object replicas stored on ring neighbors crashed node restores objects sequentially object is downloaded from random source

Assumptions symmetric node bandwidth node inter-failure times drawn from exponential distribution

with mean MTBF high node availability (temporary disconnections ignored)

conservativechoice

Page 33: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

33Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

bw = 1.5 Mbit/s MTBF = 2 months k = {3,7} b = variable

Validation of our expression for Trepair

Trepair [days] estimation error

0%

25%

50%

75%

100%

0 100 200 300 400 500

b [GB per node]

our estimation (k=3)our estimation (k=7)Chun et al. (k=3)Chun et al. (k=7)

0

6

12

18

24

30

36

42

48

0 100 200 300 400 500

b [GB per node]

simul (k=3)simul (k=7)our estimationChun et al.

Page 34: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

34

0

1

2

3

4

5

6

7

8

9

0 1 2 3 4 5 6 7 8

k = 5

k = 7

k = 9

Ramadhadran et al.

Chun et al.

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

αm

b = 250 GB per node bw = 1.5 Mbit/s MTBF = 2 months k = {5,7,9}

m2

)1(

:solvingby obtained constant

11)(1

kkf

c

ecmf c

m

m

Measured values of αm

f(m)

Empirical function f(m)

f(m) can be used to obtain the model parameters fromthe system characteristics b, bw, k, and MTBF

plateau due to fewer remaining replicasto download the object from

Page 35: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

35

0

0.025

0.05

0.075

0 1 2 3 4

simul

Ramadhadran et al.

Chun et al.f(i)

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

Predicted loss probability

b = 250 GB per node bw = 1.5 Mbit/s MTBF = 2 months k = 5

PLOSS(t)

time after insertion [years]

t

Page 36: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

36

Conclusion

Pastis: new P2P file system completely decentralized scalable number of nodes and users

relaxed consistency models close-to-open, read-your-write

performance equal or better than Ivy and Oceanstore

durability prediction use Markov chains to estimate probability of object loss previous model parameters are inaccurate

more accurate parameter set based on new analytical expression empirical data

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

Page 37: 1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

37

Future work

Pastis explore new consistency models flexible replica location evaluation in a wide-area testbed (Planetlab)

Durability refine model to consider

other DHT protocols impact of temporary node disconnections

Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions