Changing Requirements for Distributed File Systems in Cloud … · 2020. 6. 18. · Client A adds...

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Changing Requirements for Distributed File Systems in Cloud Storage

Wesley Leggette Cleversafe


Presentation Agenda

r  About Cleversafe r  Scalability, our core driver r  Object storage as basis for filesystem technology

r Namespace-based routing r Distributed transactions r Optimistic concurrency

r  Designing an ultra-scalable filesystem r Filesystem operations on object layer

r  Conclusions

2


About Cleversafe

r  We offer scalable storage solutions r Target market is massive storage (>10 PiB) r Information Dispersal Algorithms (Erasure Codes)

r Reduce cost by avoiding replication overhead r Maximize reliability by tolerating many failures

r  Object storage core product offering r  How to translate this technology to filesystem space

r Evolution from object storage concepts r Also influenced by distributed databases and P2P r Techniques we investigate not unique to IDA


How Dispersed Storage Works

Digital Content

Site 1

Site 2

Site 3

Site 4

8h$1 vD@-‐ fMq& Z4$’ >hip )aj% l[au T0kQ %~fa Uh(k My)v 9hU6 >kiR &i@n pYvQ 4Wco

1. Digital Assets divided into slices using Information Dispersal Algorithms

8h$1 vD@-‐ >hip )aj% l[au %~fa 9hU6 >kiR pYvQ 4Wco

2. Slices distributed to separate disks, storage nodes and geographic locaVons

Total Slices = ‘width’ = N IDA

IDA 3. A threshold number of slices are retrieved and used to regenerate the original content


Access Methods Simple Object HTTP

HTTP Accesser

Accesser Exposes HTTP REST API

Simple Object Client Library Accesser Function Embedded into the Client

•  Multiple Accessers can be load balanced for increased throughput and availability.

•  The Accesser returns a unique 36

Character Object ID

Accesser functionality including slicing and dispersal is contained within the client library

Object ID

dsNet Protocol

OBJECT VAULT

dsNet Protocol

OBJECT VAULT

Application Server

Database

Stores metadata

Application Server

Object ID Database

Stores metadata

Java Client

Library

These are “clients” in context of this presentation

We sell two deployment models…


Scalability – A Primary Requirement

r  Big Data customers are petabyte to exabyte scale r  Scale out architecture

r Add storage capacity with commodity machines r Reduce costs: commodity hard drives

r  Invariants r Reliability – keep data even as cheap disks fail r Availability – access data during node failures r Performance – linear performance growth

6


Scale Example

r  Shutterfly r 10 PB Cleversafe dsNet storage system r All commodity hard drives r Single storage container for all photos r 10’s thousands of large photos stored per minute

r Max capacity many times this level

r 14 access nodes for load balanced read/write r No single point of failure r Linear performance growth with each new node

r  This uses object storage product


Investigating Filesystem Space

r  We have scalable object storage r Limitless capacity and performance growth r Fully concurrent read/write

r  Some customers want the same with a filesystem

r  Is this technically possible? r  What tradeoffs would have to be made?


Scale comes from homogeneity

r  To scale out, we need to do so at each layer r Eliminate central chokepoint for data operations

r Central point of failure, central point of…

r We accomplish this today with object storage r Consider same concept in a filesystem

Client Storage

Client Storage

Client Storage

Client Storage

Client Storage

Client Storage

Client Storage

Client Storage

Metadata

Scalable

NotScalable


What approach can we take?

r  Start with scalable transactional object storage r  Add filesystem implementation on top

r  Object Layer r Check-and-write transactions

r  Reliability Layer r  Ensures committed objects

reliable and consistent

r  Namespace Layer r Routes actual data storage r No central I/O manager Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice ServerSlice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server

Filesystem

Transactional Object Storage

IDA + DistributedTransaction Client

Namespace-basedStorage Routing

Session Management(Multi-path)


Namespace Layer

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server


Traditional Centralized Routing

r  Central controller directs traffic r Easier to implement, allows simple search r Detect conflicts, control locking

r  Does not scale-out with rest of architecture r Today, 10PB system needs 90 45-disk nodes* r These nodes can service 57,600 2MB req/s**

r  Central point of failure = less availability

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server

* 3TB drives, some IDA overhead **10Gbps NIC, nodes saturate wire speed

Client

Storage

Storage

Storage

Storage

Routing Master

10,000 req/s

640 req/s

MAX 15 Servers!

1

2


Namespace-based Routing

r  Namespace concept from P2P systems r Chord, CAN, Kademila r MongoDB, CouchDB production examples

r  Physical mapping determined by storage map r Small data (<10KiB) loaded at start-up

r P2P systems use dynamic overlay protocol r We’ll have 10’s thousands of nodes, not millions

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server

Slicestor H

Slicestor A

SlicestorB

Slicestor C

Slicestor D

Slicestor F

Slicestor E

Slicestor G

Slicestor H

Slicestor A

Slicestor B

Slicestor C

Slicestor D

Slicestor F

Slicestor E

Slicestor G

Index 0

12

3

4-wide VaultIndex 0

8-wide Vault

1

2

34

5

6

7


Storing Data in a Namespace

r  No central lookup for data I/O 1.  Generate “object id” 2.  Map to storage

r  With object storage, object id à database r  How do we map file name to object id?

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server

Object ID Source Name

Slice Name Slicestor

Slicestor

Slicestor

Storage Map

Slice Name

Slice Name


Reliability Layer

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server


Replication and Eventual Consistency

r  Eventual consistency often used with replication r Client writes new versions to available nodes r Versions sync to other replicas lazily

r  Application responsible for consistency r Already true in

filesystems r  Allows partition

tolerant systems

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server Client A

V2

V2

V1

1COPY

2COPY

3REPAIR

V1

V2

V2

V2

V2

4REPAIR

Client B

Now Later

Read sees old version


Dispersal Requires Consistency

r  Dispersal doesn’t store replicas r Threshold of slices

required to recover data r Crash during “unsafe”

periods can cause loss r  Methods to prevent loss

r Three-phase distributed transaction r Commit: All revisions visible during unsafe period r Finalize: Cleanup when new version commit safe

r Quorum-based voting r Writes fail if <T successful

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server

V1 V1 V1 V1

Tim

e V2 V1 V1 V1

V2 V2 V1 V1

V2 V2 V2 V1

Safe

Safe

UNSAFE

Safe

Width: 4Threshold: 3


Three-Phase Commit Protocol

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server

V1 V1 V1

V2 V2 V2

Width: 4Threshold: 3

V1

V2

V1

V2 V2 V2

V1

V2

1 WRITE 2 COMMIT

Commit Failure Causes Loss!

X X

V1 V1 V1

V2 V2 V2

V1

V2

V1

V2 V2 V2

V1

V2

1 WRITE 2 COMMIT

X X

V1 V1 V1

V2 V2 V2

V1

V2

3 FINALIZE/UNDO

V1 V1

2-Phase Commit Protocol

3-Phase Commit Protocol


Consistent Transactional Interface

r  Distributed transaction makes dispersal safe r All happens in client, no server coordination

r  Write consistency r Side-effect of distributed transactions r Writes either succeed or fail “atomically”

r  Limitation: Consistency = less partition tolerance r CAP Theorem (we also choose availability) r Either read or write fails during partition

r Still “shardable”: affects availability, not scalability

r  Is consistency useful for filesystem directories?

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server


Object Layer

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server


Write-if-absent for WORM

r  Object storage is WORM r Enforced by underlying storage r Write-if-absent model built on transactions

r Distributed transactions emulate atomicity r “Checked write” fails if previous revision exists

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server

V1 WRITE IF PREVIOUS = ∅ SuccessV1

V1' V1WRITE IF PREVIOUS = ∅ FailureV1

Client A

Client B


Optimistic Concurrency Control

r  Easy to extend this model to multiple revisions r Write succeeds IFF last revision matches given r Basis for “optimistic concurrency”

r  How do concurrent writers update a directory?

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server

V1Client A WRITE IF PREVIOUS = ∅

V2 WRITE IF PREVIOUS = 1

SuccessV1

V1 V2 SuccessClient A

V2 WRITE IF PREVIOUS = 1 V1 V2 SuccessClient A

V2' WRITE IF PREVIOUS = 1 V2 V2 FailureClient B

V3 WRITE IF PREVIOUS = 2 V2 V3 Success

1

2

3

READ, REDO ACTION


Filesystem Layer

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server

Filesystem


Ultra-Scalable Filesystem Technology

r  Filesystem layer on top of object storage r Scalable no-master storage r Inherits reliability, security, and performance

r  How do we map file name to object id? r  Is consistency useful for filesystem directories? r  How do concurrent writers update a directory?

Object Layer

Filesystem


Object-based directory tree

r  How do we map file name to object id? r  Directories stored as objects

r Filesystem structure as reliable as data r  Directory content data is map of file name to object id

r Object id points to another object on system r Id for content data r Id for metadata (xattr, etc) r Data objects WORM r Zero-copy snapshot support

r Reference counting r  Well known object id for “root”

r mapsContains asdf


Directory Internal Consistency

r  Is consistency useful for filesystem directories? r  Object layer allows “atomic” directory updates

r This mimics model used by traditional filesystems r  Content data stored in separate “immutable” storage

r Safe snapshot support r  Eventual consistency

r Temporary effects r Writes: Orphaned data r Deletes: Read error

r  Absolute requirement? No.


Concurrency Requires Serialization

r  How do concurrent writers update a directory? r  Updates to directory entries are atomic (definition)

r More precisely, filesystem operations are serialized r Client A adds file, Client B adds file, Client C deletes file r First to call wins, application must have sane order

r  Kernels use mutexes (locks) for serialization r Master controller (pNFS, GoogleFS) does this r We want to use “multiple/no master” model

r  Distributed locking protocols exist (e.g., PAXOS) r It’s hard: Protocols complex and have drawbacks r It’s slow: Overhead for every operation


Optimistic Concurrency

r  We want to serialize without locking r  Observation: File writes have two steps

r Write the data (long, no contention)* r Modify the directory (short, serialized)**

r  Use checked writes for directory r Always read directory before writes r Write new revision “if-not-modified-since” r On “write conflict”, re-read, replay, repeat

* Consider workload where files > 1 MiB, we write content data in WORM storage ** Because directories stored as as objects themselves, modifying directory is re-writing object


Lockless Directory Update

r  Optimistic concurrency guarantees serialization r Operation is simple (“add file”), so replay trivial r On conflict, operation replay semantics are clear

r Content data (large) is not rewritten on conflict r Highly parallelizable

r  Potentially unbounded contention latency r Back-off protocol can help r Not good for high directory contention use cases


Conclusions

r  Advantages r  Limitations r  Final Thoughts


Advantages

r  Scalability and Performance r Content data I/O quick and contention free r No-master concurrent read and write r Linearly scalable performance

r  Availability r Load balancing without complicated HA setups

r  Reliability r Information dispersal r Both data and metadata have same reliability r No separate “backup” required for index server


Limitations

r  Optimistic concurrency sensitive to high contention r  Cache requirements limit directory size

r No intrinsic limit, but a 100MiB directory object? r  No central master makes explicit file locking hard

r SMB, NFS protocols support these r  Not suitable for random-write workloads r  Not suitable for majority small file workloads

r Directory write times eclipse file write times r  Requires separate index service for search


Final Thoughts

r  Significant advances come from P2P and NOSQL space r Three key techniques allow for ultra-scalable FS

r Namespace-based routing r Distributed transactions using quorum/3-phase commit r Optimistic concurrency using checked write

r Techniques useable with IDA or replicated systems r  Filesystem would not be general purpose

r Techniques have some trade-offs r Excellent for specific big data use cases


Questions?

Changing Requirements for Distributed File Systems in Cloud … · 2020. 6. 18. · Client A adds...

Documents

Transcript of Changing Requirements for Distributed File Systems in Cloud … · 2020. 6. 18. · Client A adds...