DOATS Interface Design Description (IDD). Client Configuration File.
Changing Requirements for Distributed File Systems in Cloud … · 2020. 6. 18. · Client A adds...
Transcript of Changing Requirements for Distributed File Systems in Cloud … · 2020. 6. 18. · Client A adds...
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Changing Requirements for Distributed File Systems in Cloud Storage
Wesley Leggette Cleversafe
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Presentation Agenda
r About Cleversafe r Scalability, our core driver r Object storage as basis for filesystem technology
r Namespace-based routing r Distributed transactions r Optimistic concurrency
r Designing an ultra-scalable filesystem r Filesystem operations on object layer
r Conclusions
2
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
About Cleversafe
r We offer scalable storage solutions r Target market is massive storage (>10 PiB) r Information Dispersal Algorithms (Erasure Codes)
r Reduce cost by avoiding replication overhead r Maximize reliability by tolerating many failures
r Object storage core product offering r How to translate this technology to filesystem space
r Evolution from object storage concepts r Also influenced by distributed databases and P2P r Techniques we investigate not unique to IDA
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
How Dispersed Storage Works
Digital Content
Site 1
Site 2
Site 3
Site 4
8h$1 vD@-‐ fMq& Z4$’ >hip )aj% l[au T0kQ %~fa Uh(k My)v 9hU6 >kiR &i@n pYvQ 4Wco
1. Digital Assets divided into slices using Information Dispersal Algorithms
8h$1 vD@-‐ >hip )aj% l[au %~fa 9hU6 >kiR pYvQ 4Wco
2. Slices distributed to separate disks, storage nodes and geographic locaVons
Total Slices = ‘width’ = N IDA
IDA 3. A threshold number of slices are retrieved and used to regenerate the original content
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Access Methods Simple Object HTTP
HTTP Accesser
Accesser Exposes HTTP REST API
Simple Object Client Library Accesser Function Embedded into the Client
• Multiple Accessers can be load balanced for increased throughput and availability.
• The Accesser returns a unique 36
Character Object ID
Accesser functionality including slicing and dispersal is contained within the client library
Object ID
dsNet Protocol
OBJECT VAULT
dsNet Protocol
OBJECT VAULT
Application Server
Database
Stores metadata
Application Server
Object ID Database
Stores metadata
Java Client
Library
These are “clients” in context of this presentation
We sell two deployment models…
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Scalability – A Primary Requirement
r Big Data customers are petabyte to exabyte scale r Scale out architecture
r Add storage capacity with commodity machines r Reduce costs: commodity hard drives
r Invariants r Reliability – keep data even as cheap disks fail r Availability – access data during node failures r Performance – linear performance growth
6
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Scale Example
r Shutterfly r 10 PB Cleversafe dsNet storage system r All commodity hard drives r Single storage container for all photos r 10’s thousands of large photos stored per minute
r Max capacity many times this level
r 14 access nodes for load balanced read/write r No single point of failure r Linear performance growth with each new node
r This uses object storage product
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Investigating Filesystem Space
r We have scalable object storage r Limitless capacity and performance growth r Fully concurrent read/write
r Some customers want the same with a filesystem
r Is this technically possible? r What tradeoffs would have to be made?
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Scale comes from homogeneity
r To scale out, we need to do so at each layer r Eliminate central chokepoint for data operations
r Central point of failure, central point of…
r We accomplish this today with object storage r Consider same concept in a filesystem
Client Storage
Client Storage
Client Storage
Client Storage
Client Storage
Client Storage
Client Storage
Client Storage
Metadata
Scalable
NotScalable
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
What approach can we take?
r Start with scalable transactional object storage r Add filesystem implementation on top
r Object Layer r Check-and-write transactions
r Reliability Layer r Ensures committed objects
reliable and consistent
r Namespace Layer r Routes actual data storage r No central I/O manager Slice Server
Object Layer
Reliability Layer
Remote Session
Namespace Layer
Slice ServerSlice Server
Object Layer
Reliability Layer
Remote Session
Namespace Layer
Slice Server
Filesystem
Transactional Object Storage
IDA + DistributedTransaction Client
Namespace-basedStorage Routing
Session Management(Multi-path)
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Namespace Layer
Slice Server
Object Layer
Reliability Layer
Remote Session
Namespace Layer
Slice Server
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Traditional Centralized Routing
r Central controller directs traffic r Easier to implement, allows simple search r Detect conflicts, control locking
r Does not scale-out with rest of architecture r Today, 10PB system needs 90 45-disk nodes* r These nodes can service 57,600 2MB req/s**
r Central point of failure = less availability
Slice Server
Object Layer
Reliability Layer
Remote Session
Namespace Layer
Slice Server
* 3TB drives, some IDA overhead **10Gbps NIC, nodes saturate wire speed
Client
Storage
Storage
Storage
Storage
Routing Master
10,000 req/s
640 req/s
MAX 15 Servers!
1
2
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Namespace-based Routing
r Namespace concept from P2P systems r Chord, CAN, Kademila r MongoDB, CouchDB production examples
r Physical mapping determined by storage map r Small data (<10KiB) loaded at start-up
r P2P systems use dynamic overlay protocol r We’ll have 10’s thousands of nodes, not millions
Slice Server
Object Layer
Reliability Layer
Remote Session
Namespace Layer
Slice Server
Slicestor H
Slicestor A
SlicestorB
Slicestor C
Slicestor D
Slicestor F
Slicestor E
Slicestor G
Slicestor H
Slicestor A
Slicestor B
Slicestor C
Slicestor D
Slicestor F
Slicestor E
Slicestor G
Index 0
12
3
4-wide VaultIndex 0
8-wide Vault
1
2
34
5
6
7
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Storing Data in a Namespace
r No central lookup for data I/O 1. Generate “object id” 2. Map to storage
r With object storage, object id à database r How do we map file name to object id?
Slice Server
Object Layer
Reliability Layer
Remote Session
Namespace Layer
Slice Server
Object ID Source Name
Slice Name Slicestor
Slicestor
Slicestor
Storage Map
Slice Name
Slice Name
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Reliability Layer
Slice Server
Object Layer
Reliability Layer
Remote Session
Namespace Layer
Slice Server
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Replication and Eventual Consistency
r Eventual consistency often used with replication r Client writes new versions to available nodes r Versions sync to other replicas lazily
r Application responsible for consistency r Already true in
filesystems r Allows partition
tolerant systems
Slice Server
Object Layer
Reliability Layer
Remote Session
Namespace Layer
Slice Server Client A
V2
V2
V1
1COPY
2COPY
3REPAIR
V1
V2
V2
V2
V2
4REPAIR
Client B
Now Later
Read sees old version
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Dispersal Requires Consistency
r Dispersal doesn’t store replicas r Threshold of slices
required to recover data r Crash during “unsafe”
periods can cause loss r Methods to prevent loss
r Three-phase distributed transaction r Commit: All revisions visible during unsafe period r Finalize: Cleanup when new version commit safe
r Quorum-based voting r Writes fail if <T successful
Slice Server
Object Layer
Reliability Layer
Remote Session
Namespace Layer
Slice Server
V1 V1 V1 V1
Tim
e V2 V1 V1 V1
V2 V2 V1 V1
V2 V2 V2 V1
Safe
Safe
UNSAFE
Safe
Width: 4Threshold: 3
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Three-Phase Commit Protocol
Slice Server
Object Layer
Reliability Layer
Remote Session
Namespace Layer
Slice Server
V1 V1 V1
V2 V2 V2
Width: 4Threshold: 3
V1
V2
V1
V2 V2 V2
V1
V2
1 WRITE 2 COMMIT
Commit Failure Causes Loss!
X X
V1 V1 V1
V2 V2 V2
V1
V2
V1
V2 V2 V2
V1
V2
1 WRITE 2 COMMIT
X X
V1 V1 V1
V2 V2 V2
V1
V2
3 FINALIZE/UNDO
V1 V1
2-Phase Commit Protocol
3-Phase Commit Protocol
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Consistent Transactional Interface
r Distributed transaction makes dispersal safe r All happens in client, no server coordination
r Write consistency r Side-effect of distributed transactions r Writes either succeed or fail “atomically”
r Limitation: Consistency = less partition tolerance r CAP Theorem (we also choose availability) r Either read or write fails during partition
r Still “shardable”: affects availability, not scalability
r Is consistency useful for filesystem directories?
Slice Server
Object Layer
Reliability Layer
Remote Session
Namespace Layer
Slice Server
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Object Layer
Slice Server
Object Layer
Reliability Layer
Remote Session
Namespace Layer
Slice Server
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Write-if-absent for WORM
r Object storage is WORM r Enforced by underlying storage r Write-if-absent model built on transactions
r Distributed transactions emulate atomicity r “Checked write” fails if previous revision exists
Slice Server
Object Layer
Reliability Layer
Remote Session
Namespace Layer
Slice Server
V1 WRITE IF PREVIOUS = ∅ SuccessV1
V1' V1WRITE IF PREVIOUS = ∅ FailureV1
Client A
Client B
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Optimistic Concurrency Control
r Easy to extend this model to multiple revisions r Write succeeds IFF last revision matches given r Basis for “optimistic concurrency”
r How do concurrent writers update a directory?
Slice Server
Object Layer
Reliability Layer
Remote Session
Namespace Layer
Slice Server
V1Client A WRITE IF PREVIOUS = ∅
V2 WRITE IF PREVIOUS = 1
SuccessV1
V1 V2 SuccessClient A
V2 WRITE IF PREVIOUS = 1 V1 V2 SuccessClient A
V2' WRITE IF PREVIOUS = 1 V2 V2 FailureClient B
V3 WRITE IF PREVIOUS = 2 V2 V3 Success
1
2
3
READ, REDO ACTION
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Filesystem Layer
Slice Server
Object Layer
Reliability Layer
Remote Session
Namespace Layer
Slice Server
Filesystem
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Ultra-Scalable Filesystem Technology
r Filesystem layer on top of object storage r Scalable no-master storage r Inherits reliability, security, and performance
r How do we map file name to object id? r Is consistency useful for filesystem directories? r How do concurrent writers update a directory?
Object Layer
Filesystem
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Object-based directory tree
r How do we map file name to object id? r Directories stored as objects
r Filesystem structure as reliable as data r Directory content data is map of file name to object id
r Object id points to another object on system r Id for content data r Id for metadata (xattr, etc) r Data objects WORM r Zero-copy snapshot support
r Reference counting r Well known object id for “root”
r mapsContains asdf
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Directory Internal Consistency
r Is consistency useful for filesystem directories? r Object layer allows “atomic” directory updates
r This mimics model used by traditional filesystems r Content data stored in separate “immutable” storage
r Safe snapshot support r Eventual consistency
r Temporary effects r Writes: Orphaned data r Deletes: Read error
r Absolute requirement? No.
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Concurrency Requires Serialization
r How do concurrent writers update a directory? r Updates to directory entries are atomic (definition)
r More precisely, filesystem operations are serialized r Client A adds file, Client B adds file, Client C deletes file r First to call wins, application must have sane order
r Kernels use mutexes (locks) for serialization r Master controller (pNFS, GoogleFS) does this r We want to use “multiple/no master” model
r Distributed locking protocols exist (e.g., PAXOS) r It’s hard: Protocols complex and have drawbacks r It’s slow: Overhead for every operation
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Optimistic Concurrency
r We want to serialize without locking r Observation: File writes have two steps
r Write the data (long, no contention)* r Modify the directory (short, serialized)**
r Use checked writes for directory r Always read directory before writes r Write new revision “if-not-modified-since” r On “write conflict”, re-read, replay, repeat
* Consider workload where files > 1 MiB, we write content data in WORM storage ** Because directories stored as as objects themselves, modifying directory is re-writing object
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Lockless Directory Update
r Optimistic concurrency guarantees serialization r Operation is simple (“add file”), so replay trivial r On conflict, operation replay semantics are clear
r Content data (large) is not rewritten on conflict r Highly parallelizable
r Potentially unbounded contention latency r Back-off protocol can help r Not good for high directory contention use cases
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Conclusions
r Advantages r Limitations r Final Thoughts
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Advantages
r Scalability and Performance r Content data I/O quick and contention free r No-master concurrent read and write r Linearly scalable performance
r Availability r Load balancing without complicated HA setups
r Reliability r Information dispersal r Both data and metadata have same reliability r No separate “backup” required for index server
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Limitations
r Optimistic concurrency sensitive to high contention r Cache requirements limit directory size
r No intrinsic limit, but a 100MiB directory object? r No central master makes explicit file locking hard
r SMB, NFS protocols support these r Not suitable for random-write workloads r Not suitable for majority small file workloads
r Directory write times eclipse file write times r Requires separate index service for search
2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.
Final Thoughts
r Significant advances come from P2P and NOSQL space r Three key techniques allow for ultra-scalable FS
r Namespace-based routing r Distributed transactions using quorum/3-phase commit r Optimistic concurrency using checked write
r Techniques useable with IDA or replicated systems r Filesystem would not be general purpose
r Techniques have some trade-offs r Excellent for specific big data use cases